**Subset Basis Approximation of Kernel Principal Component Analysis**

Yoshikazu Washizawa *The University of Electro-Communications Japan*

#### **1. Introduction**

66 Principal Component Analysis

Alonso-Salces, R.M. Héberger, K. Holland M.V., Moreno-Rojas J.M., C. Mariani, G. Bellan, F.

Brescia, M.A. Monfreda, M. Buccolieri, A. & Carrino, C. (2005). Characterisation of the

Gonzalvez, A. Armenta, S. De la Guardia, M. (2009). Trace-element composition and stable -

Keto, R.O. & Wineman, PL. (1991). Detection of petroleum-based accelerants in fire debris

Keto, R.O. & Wineman, PL. (1994). Target-compound method for the analysis of accelerant

Lennard, C.J. Tristan Rochaix, V. Margot, P. & Huber, K. (1995). A GC–MS Database of

Marini, F. Magrì, A.L. Bucci, R. Balestrieri, F. & Marini, D. (2006). Class –modeling

Monfreda, M. & Gregori, A. (2011). Differentiation of Unevaporated Gasoline Samples

Wold, S. & Sjostrom, M. (1977). SIMCA: A Method for Analyzing Chemical Data in Terms

*Journal of Forensic Sciences*, Vol. 56 (No. 2), pp. 372-380, March 2011.

residues in fire debris. *Analytica Chimica Acta,* Vol.288 pp.97-110.

and spectroscopic determinations. *Food Chemistry*, Vol. 89, pp. 139-147. Diaz, T.G. Merás, I.D. Casas, J.S. & Franco, M.F.A. (2005). Characterization of virgin olive

Reniero, C. Guillou. (2010). Multivariate analysis of NMR fingerprint of the unsaponifiable fraction of virgin olive oils for authentication purposes. *Food* 

geographical origin of buffalo milk and mozzarella cheese by means of analytical

oils according to its triglycerides and sterols composition by chemometric methods.

isotope ratio for discrimination of foods with Protected Designation of Origin.

by target compound gas chromatography/mass spectrometry. *Analitycal Chemistry*,

target compound chromatograms for the identification of arson accelerants. *Science* 

techniques in the authentication of Italian oils from Sicily with a Protected Denomination of Origin (PDO). *Chemometrics and Intelligent Laboratory Systems*, Vol.

According to Their Brands, by SPME-GC-MS and Multivariate Statistical Analysis.

of Similarity and Analogy In: *Chemometrics, Theory and Application*, Kowalsky B.R. pp. 243-282, American Chemical Society Symposium Series No. 52 Washington.

**5. References** 

*Chemistry*, Vol. 118 pp. 956-965.

*Food Control,* Vol. 16 pp. 339-347.

*& Justice*; Vol. 35 No.1 pp.19–30.

Vol. 63 pp. 1964-71.

80, pp. 140-149.

*Trends in Analytical Chemistry*, Vol. 28 No.11, 2009.

Principal component analysis (PCA) has been extended to various ways because of its simple definition. Especially, non-linear generalizations of PCA have been proposed and used in various areas. Non-linear generalizations of PCA, such as principal curves (Hastie & Stuetzle, 1989) and manifolds (Gorban et al., 2008), have intuitive explanations and formulations comparing to the other non-linear dimensional techniques such as ISOMAP (Tenenbaum et al., 2000) and Locally-linear embedding (LLE) (Roweis & Saul, 2000).

Kernel PCA (KPCA) is one of the non-linear generalizations of PCA by using the kernel trick (Schölkopf et al., 1998). The kernel trick nonlinearly maps input samples to higher dimensional space so-called the feature space F. The mapping is denoted by Φ, and let x be a *d*-dimensional input vector,

$$
\Phi: \mathbb{R}^d \to \mathcal{F}, \ x \mapsto \Phi(x). \tag{1}
$$

Then a linear operation in the feature space is a non-linear operation in the input space. The dimension of the feature space F is usually much larger than the input dimension *d*, or could be infinite. The positive definite kernel function *k*(·, ·) that satisfies following equation is used to avoid calculation in the feature space,

$$k(x\_1, x\_2) = \langle \Phi(x\_1), \Phi(x\_2) \rangle \,\,\forall x\_1, x\_2 \in \mathbb{R}^d\,. \tag{2}$$

where �·, ·� denotes the inner product.

By using the kernel function, inner products in F are replaced by the kernel function *k* : **<sup>R</sup>***<sup>d</sup>* <sup>×</sup> **<sup>R</sup>***<sup>d</sup>* <sup>→</sup> **<sup>R</sup>**. According to this replacement, the problem in <sup>F</sup> is reduced to the problem in **R***n*, where *n* is the number of samples since the space spanned by mapped samples is at most *n*-dimensional subsapce. For example, the primal problem of Support vector machines (SVMs) in <sup>F</sup> is reduced to the Wolf dual problem in **<sup>R</sup>***<sup>n</sup>* (Vapnik, 1998).

In real problems, the number of *n* is sometimes too large to solve the problem in **R***n*. In the case of SVMs, the optimization problem is reduced to the convex quadratic programming whose size is *n*. Even if *n* is too large, SVMs have efficient computational techniques such as chunking or the sequential minimal optimization (SMO) (Platt, 1999), since SVMs have sparse solutions for the Wolf dual problem. After the optimal solution is obtained, we only have to store limited number of learning samples so-called support vectors to evaluate input vectors.

í10

í10

**2. Kernel PCA**

are UPCAU�

**2.1 Brief review of KPCA**

variance-covariance matrix **Σ**,

PCAx and U�

í5

0

5

10

15

20

í10 í8 í6 í4 í2 0 2 4 (a) PCA

í10 í8 í6 í4 í2 0 2 4 (c) KPCA (N=50)

Fig. 1. Illustrative example of SubKPCA

í10

í10

Let <sup>x</sup>1,...,x*<sup>n</sup>* be *<sup>d</sup>*-dimensional learning samples, and <sup>X</sup> = [x1<sup>|</sup> ... <sup>|</sup>x*n*] <sup>∈</sup> **<sup>R</sup>***d*×*n*. Suppose that their mean is zero or subtracted. Standard PCA obtains eigenvectors of the

> x*i*x� *<sup>i</sup>* <sup>=</sup> <sup>1</sup> *n*

Then the *i*th largest eigenvector corresponds to the *i*th principal component. Suppose UPCA = [u1| ... |u*r*]. The projection and the transform of x onto *r*-dimensional eigenspace

This section briefly reviews KPCA, and shows some characterizations of KPCA.

*n* ∑ *i*=1

**<sup>Σ</sup>** <sup>=</sup> <sup>1</sup> *n*

PCAx respectively.

í5

0

5

10

15

20

í10 í8 í6 í4 í2 0 2 4 (b) KPCA (n=1000)

í10 í8 í6 í4 í2 0 2 4 (d) SubKPCA

XX�. (3)

í5

0

5

10

15

20

Subset Basis Approximation of Kernel Principal Component Analysis 69

í5

0

5

10

15

20

In the case of KPCA, the optimization problem is reduced to an eigenvalue problem whose size is *n*. There are some efficient techniques for eigenvalue problems, such as the divide-and-conquer eigenvalue algorithm (Demmel, 1997) or the implicitly restarted Arnoldi method (IRAM) (Lehoucq et al., 1998) 1. However, their computational complexity is still too large to solve when *n* is large, because KPCA does not have sparse solution. These algorithms require *O*(*n*2) working memory space and *O*(*rn*2) computational complexity, where *r* is the number of principal components. Moreover, we have to store all *n* learning samples to evaluate input vectors.

Subset KPCA (SubKPCA) approximates KPCA using the subset of samples for its basis, and all learning samples for the criterion of the cost function (Washizawa, 2009). Then the optimization problem for SubKPCA is reduced to the generalized eigenvalue problem whose size is the size of the subset, *m*. The size of the subset *m* defines the trade-off between the approximation accuracy and the computational complexity. Since all learning samples are utilized for its criterion, even if *m* is much smaller than *n*, the approximation error is small. The approximation error due to this subset approximation is discussed in this chapter. Moreover, after the construction, we only have to store the subset to evaluate input vectors.

An illustrative example is shown in Figure 1. Figure 1 (a) shows artificial 1000 2-dimensional samples, and contour lines of norms of transformed vectors onto one-dimensional subspace by PCA. Figure 1 (b) shows contour curves by KPCA (transformed to five-dimensional subspace in F). This is non-linear analysis, however, it requires to solve an eigenvalue problem whose size is 1000. For an input vector, calculations of kernel function with all 1000 samples are required. Figure 1 (c) randomly selects 50 samples, and obtains KPCA. In this case, the size of the eigenvalue problem is only 50, and calculations of kernel function with only 50 samples are required to obtain the transform. However, the contour curves are rather different from (b). Figure 1 (d) shows contour curves of SubKPCA by using the 50 samples for its basis, and all 1000 samples for evaluation. The contour corves are almost that same with (b). In this case, the size of the eigenvalue problem is also only 50, and the number of calculations of kernel function is also 50.

There are some conventional approaches to reduce the computational complexity of KPCA. improved KPCA (IKPCA) (Xu et al., 2007) is similar approach to SubKPCA, however, the approximation error is much higher than SubKPCA. Experimental and theoretical difference are shown in this chapter. Comparisons with Sparse KPCAs (Smola et al., 1999; Tipping, 2001), Nyström method (Williams & Seeger, 2001), incomplete Cholesky decomposition (ICD) (Bach & Jordan, 2002) and adaptive approaches (Ding et al., 2010; Günter et al., 2007; Kim et al., 2005) are also diecussed.

In this chapter, we denote vectors by bold-italic lower symbols x, y, and matrices by bold-italic capital symbols A, B. In kernel methods, F could be infinite-dimensional space up to the selection of the kernel function. If vectors could be infinite (functions), we denote them by italic lower symbols *f* , *g*. If either domain or range of linear transforms could be infinite-dimensional space, we denote the transforms by italic capital symbols *X*,*Y*. This is summarized as follows; (i) bold symbols, x, A, are always finite. (ii) non-bold symbols, *f* , *X*, could be infinite.

<sup>1</sup> IRAM is implemented as "eigs" in MATLAB

Fig. 1. Illustrative example of SubKPCA

#### **2. Kernel PCA**

2 Principal Component Analysis / Book 1

In the case of KPCA, the optimization problem is reduced to an eigenvalue problem whose size is *n*. There are some efficient techniques for eigenvalue problems, such as the divide-and-conquer eigenvalue algorithm (Demmel, 1997) or the implicitly restarted Arnoldi method (IRAM) (Lehoucq et al., 1998) 1. However, their computational complexity is still too large to solve when *n* is large, because KPCA does not have sparse solution. These algorithms require *O*(*n*2) working memory space and *O*(*rn*2) computational complexity, where *r* is the number of principal components. Moreover, we have to store all *n* learning samples to

Subset KPCA (SubKPCA) approximates KPCA using the subset of samples for its basis, and all learning samples for the criterion of the cost function (Washizawa, 2009). Then the optimization problem for SubKPCA is reduced to the generalized eigenvalue problem whose size is the size of the subset, *m*. The size of the subset *m* defines the trade-off between the approximation accuracy and the computational complexity. Since all learning samples are utilized for its criterion, even if *m* is much smaller than *n*, the approximation error is small. The approximation error due to this subset approximation is discussed in this chapter. Moreover,

An illustrative example is shown in Figure 1. Figure 1 (a) shows artificial 1000 2-dimensional samples, and contour lines of norms of transformed vectors onto one-dimensional subspace by PCA. Figure 1 (b) shows contour curves by KPCA (transformed to five-dimensional subspace in F). This is non-linear analysis, however, it requires to solve an eigenvalue problem whose size is 1000. For an input vector, calculations of kernel function with all 1000 samples are required. Figure 1 (c) randomly selects 50 samples, and obtains KPCA. In this case, the size of the eigenvalue problem is only 50, and calculations of kernel function with only 50 samples are required to obtain the transform. However, the contour curves are rather different from (b). Figure 1 (d) shows contour curves of SubKPCA by using the 50 samples for its basis, and all 1000 samples for evaluation. The contour corves are almost that same with (b). In this case, the size of the eigenvalue problem is also only 50, and the number of

There are some conventional approaches to reduce the computational complexity of KPCA. improved KPCA (IKPCA) (Xu et al., 2007) is similar approach to SubKPCA, however, the approximation error is much higher than SubKPCA. Experimental and theoretical difference are shown in this chapter. Comparisons with Sparse KPCAs (Smola et al., 1999; Tipping, 2001), Nyström method (Williams & Seeger, 2001), incomplete Cholesky decomposition (ICD) (Bach & Jordan, 2002) and adaptive approaches (Ding et al., 2010; Günter et al., 2007; Kim et al.,

In this chapter, we denote vectors by bold-italic lower symbols x, y, and matrices by bold-italic capital symbols A, B. In kernel methods, F could be infinite-dimensional space up to the selection of the kernel function. If vectors could be infinite (functions), we denote them by italic lower symbols *f* , *g*. If either domain or range of linear transforms could be infinite-dimensional space, we denote the transforms by italic capital symbols *X*,*Y*. This is summarized as follows; (i) bold symbols, x, A, are always finite. (ii) non-bold symbols, *f* , *X*,

after the construction, we only have to store the subset to evaluate input vectors.

evaluate input vectors.

calculations of kernel function is also 50.

<sup>1</sup> IRAM is implemented as "eigs" in MATLAB

2005) are also diecussed.

could be infinite.

This section briefly reviews KPCA, and shows some characterizations of KPCA.

#### **2.1 Brief review of KPCA**

Let <sup>x</sup>1,...,x*<sup>n</sup>* be *<sup>d</sup>*-dimensional learning samples, and <sup>X</sup> = [x1<sup>|</sup> ... <sup>|</sup>x*n*] <sup>∈</sup> **<sup>R</sup>***d*×*n*. Suppose that their mean is zero or subtracted. Standard PCA obtains eigenvectors of the variance-covariance matrix **Σ**,

$$\boldsymbol{\Sigma} = \frac{1}{n} \sum\_{i=1}^{n} \boldsymbol{x}\_{i} \boldsymbol{x}\_{i}^{\top} = \frac{1}{n} \boldsymbol{\mathsf{X}} \boldsymbol{\mathsf{X}}^{\top}. \tag{3}$$

Then the *i*th largest eigenvector corresponds to the *i*th principal component. Suppose UPCA = [u1| ... |u*r*]. The projection and the transform of x onto *r*-dimensional eigenspace are UPCAU� PCAx and U� PCAx respectively.

From this definition, X that minimizes the averaged distance between x*<sup>i</sup>* and Xx*<sup>i</sup>* over *i* is obtained under the rank constraint. Note that from this criterion, each principal component is

Subset Basis Approximation of Kernel Principal Component Analysis 71

*n*

Subject to rank(*X*) ≤ *r*, N (*X*) ⊃ R(*S*)⊥,

where R(*A*) denotes the range or the image of the matrix or the operator *A*, and N (*A*) denotes the null space or the kernel of the matrix or the operator *A*. In linear case, we can assume that the number of samples *n* is sufficiently larger than *r* and *d*, and the second constraint N (*X*) ⊃ R(*S*)<sup>⊥</sup> is often ignored. However, since the dimension of the feature space is large, *r* could be larger than the dimension of the space spanned by mapped samples

Here, brief derivation of the solution to the problem (16) is shown. Since the problem is in

1/2 denotes the square root matrix, and �·�*<sup>F</sup>* denotes the Frobenius norm. The

*<sup>i</sup>*=<sup>1</sup> *λi*v*i*v�

*<sup>i</sup>* <sup>=</sup> <sup>V</sup>KPCA**Λ**−1<sup>V</sup> �

theorem (also called Eckart-Young theorem) (Israel & Greville, 1973), *J*<sup>1</sup> is minimized when

*<sup>λ</sup>i*v*i*v�

2. Perform EVD for K, and obtain the *r* largest eigenvalues and eigenvectors, *λ*1,..., *λr*,

*n* ∑ *i*=1

�Φ(x*i*) <sup>−</sup> *<sup>X</sup>*Φ(x*i*)�<sup>2</sup>

Trace[K − KAK − KA�K + A�KAK]

*<sup>F</sup>* (17)

*<sup>i</sup>* . From the Schmidt approximation

KPCA (19)

*<sup>i</sup>* (18)

PCA, and the transform UPCA is

(16)

not characterized, i.e., the minimum solution is X = UPCAU�

min

*<sup>n</sup>* �*<sup>S</sup>* <sup>−</sup> *SAS*∗*S*�<sup>2</sup>

eigenvalue decomposition of K is K = ∑*<sup>n</sup>*

**2.3 Computational complexity of KPCA** The procedure of KPCA is as follows; 1. Calculate K from samples. [*O*(*n*2)]

5. Obtain transformed vector Eq. (14). [*O*(*rn*)]

v1,... v*r*. [*O*(*rn*2)] 3. Obtain **Λ**−1/2V �

*<sup>n</sup>* �KAK1/2 <sup>−</sup> <sup>K</sup>1/2�<sup>2</sup>

KAK1/2 =

A =

*<sup>X</sup> <sup>J</sup>*1(*X*) = <sup>1</sup>

Φ(x1),..., Φ(x*n*). For such cases, the second constraint is introduced.

<sup>R</sup>(*S*), *<sup>X</sup>* can be parameterized by *<sup>X</sup>* <sup>=</sup> *<sup>S</sup>*A*S*∗, <sup>A</sup> <sup>∈</sup> **<sup>R</sup>***n*×*n*. Accordingly, *<sup>J</sup>*<sup>1</sup> yields

*<sup>F</sup>* <sup>=</sup> <sup>1</sup> *n*

> *r* ∑ *i*=1

> *r* ∑ *i*=1

KPCA, and store all training samples.

4. For an input vector x, calculate the empirical kernel map k<sup>x</sup> from Eq. (9). [*O*(*n*)]

1 *λi* v*i*v�

not determined.

In the case of KPCA, the criterion is

**2.2.1 Solution to the problem (16)**

*<sup>J</sup>*1(*A*) = <sup>1</sup>

where ·

<sup>=</sup> <sup>1</sup>

In the case of KPCA, input vectors are mapped to feature space before the operation. Let

$$S = \left[\Phi(x\_1)|\dots|\Phi(x\_n)\right] \tag{4}$$

$$
\Sigma\_{\mathcal{F}} = \mathcal{S} \mathcal{S}^\* \tag{5}
$$

$$\mathbf{K} = \mathbf{S}^\* \mathbf{S} \in \mathbb{R}^{n \times n},\tag{6}$$

where · <sup>∗</sup> denotes the adjoint operator 2, and K is called the kernel Gram matrix (Schölkopf et al., 1999), and *i*,*j*-component of K is *k*(x*i*,x*j*). Then the *i*th largest eigenvector corresponds to the *i*th principal component. If the dimension of F is large, eigenvalue decomposition (EVD) cannot be performed. Let {*λi*, *ui*} be the *i*th eigenvalue and corresponding eigenvector of <sup>Σ</sup><sup>F</sup> respectively, and {*λi*, <sup>v</sup>*i*} be the *<sup>i</sup>*th eigenvalue and eigenvector of <sup>K</sup>. Note that <sup>K</sup> and <sup>Σ</sup><sup>F</sup> have the same eigenvalues. Then the *<sup>i</sup>*th principal component can be obtained from the *i*th eigenvalue and eigenvector of K,

$$
\mu\_{\dot{i}} = \frac{1}{\sqrt{\lambda\_{\dot{i}}}} S \upsilon\_{\dot{i}}.\tag{7}
$$

Note that it is difficult to obtain *ui* explicitly on a computer because the dimension of F is large. However, the inner product of a mapped input vector Φ(x) and the *i*th principal component is easily obtained from,

$$
\langle \langle u\_{\dot{i}}, \Phi(x) \rangle \rangle = \frac{1}{\sqrt{\lambda\_i}} \langle v\_{\dot{i}}, k\_x \rangle\_{\prime} \tag{8}
$$

$$k\_x = \left[ k(x, x\_1), \dots, k(x, x\_{\text{\tiny{x}}}) \right]^\mid \tag{9}$$

k<sup>x</sup> is an *n*-dimensional vector called the empirical kernel map.

Let us summarize using matrix notations. Let

$$\mathbf{A\_{KPCA}} = \text{diag}([\lambda\_1, \dots, \lambda\_r])\tag{10}$$

$$\mathcal{U}\_{\text{KPCA}} = [\boldsymbol{\mu}\_1 | \dots | \boldsymbol{\mu}\_r] \tag{11}$$

$$\mathbf{V\_{KPCA}} = [\boldsymbol{v}\_1 | \dots | \boldsymbol{v}\_l]. \tag{12}$$

Then the projection and the transform of x onto the *r*-dimensional eigenspace are

$$\mathbf{U\_{KPCA}} \mathbf{U\_{KPCA}^\*} \Phi(x) = \mathbf{S} \mathbf{V\_{KPCA}} \mathbf{A}^{-1} \mathbf{V\_{KPCA}^\top} \mathbf{k\_{2\nu}} \tag{13}$$

$$
\mathcal{U}\_{\rm KPCA}^\* \Phi(x) = \Lambda^{-1/2} \mathcal{V}\_{\rm KPCA}^\top k\_x. \tag{14}
$$

#### **2.2 Characterization of KPCA**

There are some characterizations or definitions for PCA (Oja, 1983). SubKPCA is extended from the least mean square (LMS) error criterion 3.

$$\begin{array}{ll}\underset{\mathbf{X}}{\text{min}} & J\_0(\mathbf{X}) = \frac{1}{n} \sum\_{i=1}^n ||x\_i - \mathbf{X}x\_i||^2\\ \text{Subject to } & \text{rank}(\mathbf{X}) \le r. \end{array} \tag{15}$$

<sup>2</sup> In real finite dimensional space, the adjoint and the transpose · � are equivalent. However, in infinite dimensional space, the transpose is not defined

<sup>3</sup> Since all definitions of PCA lead to the equivalent solution, SubKPCA is also defined by the other definitions. However, in this chapter, only LMS criteria is shown.

From this definition, X that minimizes the averaged distance between x*<sup>i</sup>* and Xx*<sup>i</sup>* over *i* is obtained under the rank constraint. Note that from this criterion, each principal component is not characterized, i.e., the minimum solution is X = UPCAU� PCA, and the transform UPCA is not determined.

In the case of KPCA, the criterion is

4 Principal Component Analysis / Book 1

<sup>∗</sup> denotes the adjoint operator 2, and K is called the kernel Gram matrix (Schölkopf

et al., 1999), and *i*,*j*-component of K is *k*(x*i*,x*j*). Then the *i*th largest eigenvector corresponds to the *i*th principal component. If the dimension of F is large, eigenvalue decomposition (EVD) cannot be performed. Let {*λi*, *ui*} be the *i*th eigenvalue and corresponding eigenvector of <sup>Σ</sup><sup>F</sup> respectively, and {*λi*, <sup>v</sup>*i*} be the *<sup>i</sup>*th eigenvalue and eigenvector of <sup>K</sup>. Note that <sup>K</sup> and <sup>Σ</sup><sup>F</sup> have the same eigenvalues. Then the *<sup>i</sup>*th principal component can be obtained from the

> *ui* <sup>=</sup> <sup>1</sup> <sup>√</sup>*λ<sup>i</sup>*

> > <sup>√</sup>*λ<sup>i</sup>*

Then the projection and the transform of x onto the *r*-dimensional eigenspace are

<sup>X</sup> *<sup>J</sup>*0(X) = <sup>1</sup>

Subject to rank(X) ≤ *r*.

KPCAΦ(x) = *<sup>S</sup>*VKPCA**Λ**−1<sup>V</sup> �

There are some characterizations or definitions for PCA (Oja, 1983). SubKPCA is extended

*n*

<sup>3</sup> Since all definitions of PCA lead to the equivalent solution, SubKPCA is also defined by the other

*n* ∑ *i*=1

�x*<sup>i</sup>* <sup>−</sup> Xx*i*�<sup>2</sup>

KPCAΦ(x) = **<sup>Λ</sup>**−1/2<sup>V</sup> �

�*ui*, <sup>Φ</sup>(x)� <sup>=</sup> <sup>1</sup>

k<sup>x</sup> is an *n*-dimensional vector called the empirical kernel map.

*U*KPCA*U*<sup>∗</sup>

from the least mean square (LMS) error criterion 3.

dimensional space, the transpose is not defined

*U*∗

min

<sup>2</sup> In real finite dimensional space, the adjoint and the transpose ·

definitions. However, in this chapter, only LMS criteria is shown.

Note that it is difficult to obtain *ui* explicitly on a computer because the dimension of F is large. However, the inner product of a mapped input vector Φ(x) and the *i*th principal

*S* =[Φ(x1)| ... |Φ(x*n*)] (4) <sup>Σ</sup><sup>F</sup> =*SS*<sup>∗</sup> (5) <sup>K</sup> <sup>=</sup>*S*∗*<sup>S</sup>* <sup>∈</sup> **<sup>R</sup>***n*×*n*, (6)

*S*v*i*. (7)

�v*i*, kx�, (8)

KPCAkx, (13)

� are equivalent. However, in infinite

(15)

KPCAkx. (14)

k<sup>x</sup> = [*k*(x,x1),..., *k*(x,x*n*)]� (9)

**Λ**KPCA = diag([*λ*1,..., *λr*]) (10) *U*KPCA = [*u*1| ... |*ur*] (11) VKPCA = [v1| ... |v*r*]. (12)

In the case of KPCA, input vectors are mapped to feature space before the operation. Let

where ·

*i*th eigenvalue and eigenvector of K,

component is easily obtained from,

**2.2 Characterization of KPCA**

Let us summarize using matrix notations. Let

$$\begin{array}{ll}\min\_{\boldsymbol{X}} & J\_{1}(\boldsymbol{X}) = \frac{1}{n} \sum\_{i=1}^{n} \left\| \Phi(\boldsymbol{x}\_{i}) - \boldsymbol{X} \Phi(\boldsymbol{x}\_{i}) \right\|^{2} \\ \text{Subject to } & \text{rank}(\boldsymbol{X}) \le r, \ \mathcal{N}(\boldsymbol{X}) \supset \mathcal{R}(\boldsymbol{S})^{\perp}, \end{array} \tag{16}$$

where R(*A*) denotes the range or the image of the matrix or the operator *A*, and N (*A*) denotes the null space or the kernel of the matrix or the operator *A*. In linear case, we can assume that the number of samples *n* is sufficiently larger than *r* and *d*, and the second constraint N (*X*) ⊃ R(*S*)<sup>⊥</sup> is often ignored. However, since the dimension of the feature space is large, *r* could be larger than the dimension of the space spanned by mapped samples Φ(x1),..., Φ(x*n*). For such cases, the second constraint is introduced.

#### **2.2.1 Solution to the problem (16)**

Here, brief derivation of the solution to the problem (16) is shown. Since the problem is in <sup>R</sup>(*S*), *<sup>X</sup>* can be parameterized by *<sup>X</sup>* <sup>=</sup> *<sup>S</sup>*A*S*∗, <sup>A</sup> <sup>∈</sup> **<sup>R</sup>***n*×*n*. Accordingly, *<sup>J</sup>*<sup>1</sup> yields

$$J\_1(A) = \frac{1}{n} \|\mathbf{S} - \mathbf{S} \mathbf{A} \mathbf{S}^\* \mathbf{S}\|\_F^2 = \frac{1}{n} \text{Trace}[\mathbf{K} - \mathbf{K} \mathbf{A} \mathbf{K} - \mathbf{K} \mathbf{A}^\top \mathbf{K} + \mathbf{A}^\top \mathbf{K} \mathbf{A} \mathbf{K}]$$

$$= \frac{1}{n} \|\mathbf{K} \mathbf{A} \mathbf{K}^{1/2} - \mathbf{K}^{1/2}\|\_F^2 \tag{17}$$

where · 1/2 denotes the square root matrix, and �·�*<sup>F</sup>* denotes the Frobenius norm. The eigenvalue decomposition of K is K = ∑*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *λi*v*i*v� *<sup>i</sup>* . From the Schmidt approximation theorem (also called Eckart-Young theorem) (Israel & Greville, 1973), *J*<sup>1</sup> is minimized when

$$\mathbf{K}\mathbf{A}\mathbf{K}^{1/2} = \sum\_{i=1}^{r} \sqrt{\lambda\_i} \mathbf{v}\_i \mathbf{v}\_i^\top \tag{18}$$

$$\mathbf{A} = \sum\_{i=1}^{r} \frac{1}{\lambda\_i} \mathbf{v}\_i \mathbf{v}\_i^\top = \mathbf{V}\_{\text{KPCA}} \mathbf{A}^{-1} \mathbf{V}\_{\text{KPCA}}^\top \tag{19}$$

#### **2.3 Computational complexity of KPCA**

The procedure of KPCA is as follows;


The projection and the transform of SubKPCA for an input vector x are

(Harville, 1997). If R(*T*) ⊂ R(*S*), *P*SubKPCA is a projector since *P*<sup>∗</sup>

The Moore-Penrose pseudo inverse is denoted by ·

(Ky)†K�

and let W = [w1,..., w*r*]. Then the problem (20) is minimized by

*P*SubKPCA = *T*(K1/2

*<sup>n</sup>* �*<sup>S</sup>* <sup>−</sup> *<sup>T</sup>*B*T*∗*S*�<sup>2</sup>

Trace[BK�

<sup>y</sup> BK�

xy = K1/2

**3.2.2 All cases**

xyKxy(Ky)† is

**3.2.3 Derivation of the solutions**

*<sup>J</sup>*1(B) = <sup>1</sup>

where the relations K�

**R***m*×*m*. Then the objective function is

= 1 *n*

= 1 *<sup>n</sup>* �K1/2

Then the minimum solution is given by

(Ky)†K�

where <sup>h</sup><sup>x</sup> = [*k*(x, <sup>y</sup>1),..., *<sup>k</sup>*(x, <sup>y</sup>*m*)] <sup>∈</sup> **<sup>R</sup>***<sup>m</sup>* is the empirical kernel map of <sup>x</sup> for the subset. A matrix or an operator *A* that satisfies *AA* = *A* and *A*� = *A* (*A*∗ = *A*), is called a projector

Subset Basis Approximation of Kernel Principal Component Analysis 73

xyKxy(Ky)† =

Since the solution is rather complex, and we don't find any advantages to use the basis set

Since the problem (20) is in R(*T*), the solution can be parameterized as *X* = *T*B*T*∗, B ∈

xyKxyB�K<sup>y</sup> − B�K�

<sup>y</sup> )†K�

<sup>y</sup> )†K�

the second term is a constant for B, from the Schmidt approximation theorem, The minimum

xy =

xy =

xy�<sup>2</sup> *<sup>F</sup>* + 1 *n*

*m* ∑ *i*=1

*r* ∑ *i*=1

xy <sup>−</sup> (K1/2

<sup>y</sup> (K1/2

solution is given by the singular value decomposition (SVD) of (K1/2

(K1/2 <sup>y</sup> )†K�

K1/2 <sup>y</sup> BK�

<sup>y</sup> )†WW�(K1/2

{y1,..., y*m*} such that R(*T*) �⊂ R(*S*), we henceforth assume that R(*T*) ⊂ R(*S*).

*P*SubKPCA*P*SubKPCA =*T*ZZ�KyZZ�*T*<sup>∗</sup> = *T*ZZ�*T*<sup>∗</sup> = *P*SubKPCA. (25)

*m* ∑ *i*=1

<sup>y</sup> )†(K�

*ξi*w*i*w�

xyKxy)(K�

*<sup>F</sup>* (28)

Trace[<sup>K</sup> <sup>−</sup> <sup>K</sup>xyK†

xyKxy − BK�

xy and Kxy = Kxy(K1/2

*<sup>ξ</sup>i*w*i*ν�

*<sup>ξ</sup>i*w*i*ν�

*P*SubKPCAΦ(x) = *T*ZZ�h<sup>x</sup> (23) *U*SubKPCAΦ(x) = Z�hx, (24)

SubKPCA = *P*SubKPCA, and

†. Suppose that EVD of

*<sup>i</sup>* , (26)

xyKxy)†*T*∗. (27)

xyKxy + K]

*<sup>i</sup>* . (30)

*<sup>i</sup>* . (31)

<sup>y</sup> )†K1/2

<sup>y</sup> )†K� xy,

yK�

xy, ] (29)

<sup>y</sup> are used. Since

The procedures 1, 2, and 3 are called the learning (training) stage, and the procedures 4 and 5 are called the evaluation stage.

The dominant computation for the learning stage is EVD. In realistic situation, *n* should be less than several tens of thousands. For example, if *n* = 100, 000, 20Gbyte RAM is required to store K on four byte floating point system. This computational complexity is sometimes too heavy to use for real large-scale problems. Moreover, in the evaluation stage, response time of the system depends on the number of *n*.

#### **3. Subset KPCA**

#### **3.1 Definition**

Since the problem of KPCA in the feature space F is in the subspace spanned by the mapped samples, <sup>Φ</sup>(x1),..., <sup>Φ</sup>(x*n*), i.e., <sup>R</sup>(*S*), the problem in <sup>F</sup> is transformed to the problem in **<sup>R</sup>***n*. SubKPCA seeks the optimal solution in the space spanned by smaller number of samples, Φ(y1),..., Φ(y*m*), *m* ≤ *n* that is called a basis set. Let *T* = [Φ(y1),..., Φ(y*m*)], then the optimization problem of SubKPCA is defined as

$$\begin{array}{ll}\min\limits\_{\boldsymbol{X}} & f\_{1}(\boldsymbol{X})\\ \text{Subject to } & \text{rank}(\boldsymbol{X}) \le r, \ \mathcal{N}(\boldsymbol{X}) \supset \mathcal{R}(\boldsymbol{T})^{\perp}, \ \mathcal{R}(\boldsymbol{X}) \subset \mathcal{R}(\boldsymbol{T}).\end{array} \tag{20}$$

The third and the fourth constraints indicate that the solution is in R(*T*). It is worth noting that SubKPCA seeks the solution in the limited space, however, the objective function is the same as that of KPCA, i.e., all training samples are used for the criterion. We call the set of all training samples the criterion set. The selection of the basis set {y1,..., y*m*} is also important problem, however, here we assume that it is given, and the selection is discussed in the next section.

#### **3.2 Solution of SubKPCA**

At first, the minimal solutions to the problem (20) are shown, then their derivations are shown. If R(*T*) ⊂ R(*S*), its solution is simplified. Note that if the set {y1,..., y*m*} the subset of {x1,...,x*n*}, R(*T*) ⊂ R(*S*) is satisfied. Therefore, solutions for two cases are shown, (R(*T*) ⊂ R(*S*) and all cases)
