**3.2.1 The case** R(*T*) ⊂ R(*S*)

Let <sup>K</sup><sup>y</sup> <sup>=</sup> *<sup>T</sup>*∗*<sup>T</sup>* <sup>∈</sup> **<sup>R</sup>***m*×*m*, (Ky)*i*,*<sup>j</sup>* <sup>=</sup> *<sup>k</sup>*(y*i*, <sup>y</sup>*j*), <sup>K</sup>xy <sup>=</sup> *<sup>X</sup>*∗*<sup>T</sup>* <sup>∈</sup> **<sup>R</sup>***n*×*m*, (Kxy)*i*,*<sup>j</sup>* <sup>=</sup> *<sup>k</sup>*(x*i*, <sup>y</sup>*j*). Let *κ*1,..., *κ<sup>r</sup>* and z1,..., z*<sup>r</sup>* be sorted eigenvalues and corresponding eigenvectors of the generalized eigenvalue problem,

$$K\_{xy}^{\parallel} K\_{xy} z = \kappa K\_y z \tag{21}$$

respectively, where each eigenvector z*<sup>i</sup>* is normalized by z*<sup>i</sup>* ← z*i*/ �z*i*, <sup>K</sup>yz*i*�, that is �z*i*, Kyz*j*� = *δij* (Kronecker delta). Let Z = [z1| ... |z*r*], then the problem (20) is minimized by

$$P\_{\text{SubKPCA}} = \mathbf{T} \mathbf{Z} \mathbf{Z}^{\top} \mathbf{T}^\*. \tag{22}$$

The projection and the transform of SubKPCA for an input vector x are

$$P\_{\text{SubKPCA}}\Phi(x) = TZZ^\top h\_x \tag{23}$$

$$\mathbf{U} L\_{\text{SubKPCA}} \Phi(x) = \mathbf{Z}^{\top} \mathbf{h}\_{x\prime} \tag{24}$$

where <sup>h</sup><sup>x</sup> = [*k*(x, <sup>y</sup>1),..., *<sup>k</sup>*(x, <sup>y</sup>*m*)] <sup>∈</sup> **<sup>R</sup>***<sup>m</sup>* is the empirical kernel map of <sup>x</sup> for the subset.

A matrix or an operator *A* that satisfies *AA* = *A* and *A*� = *A* (*A*∗ = *A*), is called a projector (Harville, 1997). If R(*T*) ⊂ R(*S*), *P*SubKPCA is a projector since *P*<sup>∗</sup> SubKPCA = *P*SubKPCA, and

$$P\_{\text{SubKPCA}}P\_{\text{SubKPCA}} = \mathbf{T}\mathbf{Z}\mathbf{Z}^{\top}\mathbf{K}\_{\mathbf{y}}\mathbf{Z}\mathbf{Z}^{\top}\mathbf{T}^{\*} = \mathbf{T}\mathbf{Z}\mathbf{Z}^{\top}\mathbf{T}^{\*} = P\_{\text{SubKPCA}}.\tag{25}$$

#### **3.2.2 All cases**

6 Principal Component Analysis / Book 1

The procedures 1, 2, and 3 are called the learning (training) stage, and the procedures 4 and 5

The dominant computation for the learning stage is EVD. In realistic situation, *n* should be less than several tens of thousands. For example, if *n* = 100, 000, 20Gbyte RAM is required to store K on four byte floating point system. This computational complexity is sometimes too heavy to use for real large-scale problems. Moreover, in the evaluation stage, response time of

Since the problem of KPCA in the feature space F is in the subspace spanned by the mapped samples, <sup>Φ</sup>(x1),..., <sup>Φ</sup>(x*n*), i.e., <sup>R</sup>(*S*), the problem in <sup>F</sup> is transformed to the problem in **<sup>R</sup>***n*. SubKPCA seeks the optimal solution in the space spanned by smaller number of samples, Φ(y1),..., Φ(y*m*), *m* ≤ *n* that is called a basis set. Let *T* = [Φ(y1),..., Φ(y*m*)], then the

The third and the fourth constraints indicate that the solution is in R(*T*). It is worth noting that SubKPCA seeks the solution in the limited space, however, the objective function is the same as that of KPCA, i.e., all training samples are used for the criterion. We call the set of all training samples the criterion set. The selection of the basis set {y1,..., y*m*} is also important problem, however, here we assume that it is given, and the selection is discussed in the next

At first, the minimal solutions to the problem (20) are shown, then their derivations are shown. If R(*T*) ⊂ R(*S*), its solution is simplified. Note that if the set {y1,..., y*m*} the subset of {x1,...,x*n*}, R(*T*) ⊂ R(*S*) is satisfied. Therefore, solutions for two cases are shown, (R(*T*) ⊂

Let <sup>K</sup><sup>y</sup> <sup>=</sup> *<sup>T</sup>*∗*<sup>T</sup>* <sup>∈</sup> **<sup>R</sup>***m*×*m*, (Ky)*i*,*<sup>j</sup>* <sup>=</sup> *<sup>k</sup>*(y*i*, <sup>y</sup>*j*), <sup>K</sup>xy <sup>=</sup> *<sup>X</sup>*∗*<sup>T</sup>* <sup>∈</sup> **<sup>R</sup>***n*×*m*, (Kxy)*i*,*<sup>j</sup>* <sup>=</sup> *<sup>k</sup>*(x*i*, <sup>y</sup>*j*). Let *κ*1,..., *κ<sup>r</sup>* and z1,..., z*<sup>r</sup>* be sorted eigenvalues and corresponding eigenvectors of the

�z*i*, Kyz*j*� = *δij* (Kronecker delta). Let Z = [z1| ... |z*r*], then the problem (20) is minimized

xyKxyz = *κ*Kyz (21)

*P*SubKPCA = *T*ZZ�*T*∗. (22)

�z*i*, <sup>K</sup>yz*i*�, that is

K�

respectively, where each eigenvector z*<sup>i</sup>* is normalized by z*<sup>i</sup>* ← z*i*/

Subject to rank(*X*) <sup>≤</sup> *<sup>r</sup>*, <sup>N</sup> (*X*) ⊃ R(*T*)⊥, <sup>R</sup>(*X*) ⊂ R(*T*). (20)

are called the evaluation stage.

**3. Subset KPCA**

**3.1 Definition**

section.

by

**3.2 Solution of SubKPCA**

**3.2.1 The case** R(*T*) ⊂ R(*S*)

generalized eigenvalue problem,

R(*S*) and all cases)

the system depends on the number of *n*.

optimization problem of SubKPCA is defined as

*<sup>X</sup> <sup>J</sup>*1(*X*)

min

The Moore-Penrose pseudo inverse is denoted by · †. Suppose that EVD of (Ky)†K� xyKxy(Ky)† is

$$\left(\left(\boldsymbol{K}\_{y}\right)^{\dagger}\boldsymbol{K}\_{xy}^{\top}\boldsymbol{K}\_{xy}\left(\boldsymbol{K}\_{y}\right)^{\dagger}=\sum\_{i=1}^{m}\tilde{\zeta}\_{i}\boldsymbol{w}\_{i}\boldsymbol{w}\_{i}^{\top}\right.\tag{26}$$

and let W = [w1,..., w*r*]. Then the problem (20) is minimized by

$$P\_{\rm SubKPCA} = T(K\_y^{1/2})^\dagger W W^\top (K\_y^{1/2})^\dagger (K\_{xy}^\top K\_{xy}) (K\_{xy}^\top K\_{xy})^\dagger T^\*. \tag{27}$$

Since the solution is rather complex, and we don't find any advantages to use the basis set {y1,..., y*m*} such that R(*T*) �⊂ R(*S*), we henceforth assume that R(*T*) ⊂ R(*S*).

#### **3.2.3 Derivation of the solutions**

Since the problem (20) is in R(*T*), the solution can be parameterized as *X* = *T*B*T*∗, B ∈ **R***m*×*m*. Then the objective function is

$$\begin{split} \|I\_{1}(\mathbf{B}) = \frac{1}{n} \|\mathbf{S} - \mathbf{T}\mathbf{B}\mathbf{T}^{\*}\mathbf{S}\|\_{\mathrm{F}}^{2} \\ = \frac{1}{n} \text{Trace}[\mathbf{B}\mathbf{K}\_{xy}^{\top}\mathbf{K}\_{xy}\mathbf{B}^{\top}\mathbf{K}\_{y} - \mathbf{B}^{\top}\mathbf{K}\_{xy}^{\top}\mathbf{K}\_{xy} - \mathbf{B}\mathbf{K}\_{xy}^{\top}\mathbf{K}\_{xy} + \mathbf{K}] \\ = \frac{1}{n} \|\mathbf{K}\_{y}^{1/2}\mathbf{B}\mathbf{K}\_{xy}^{\top} - (\mathbf{K}\_{y}^{1/2})^{\dagger}\mathbf{K}\_{xy}^{\top}\|\_{\mathrm{F}}^{2} + \frac{1}{n} \text{Trace}[\mathbf{K} - \mathbf{K}\_{xy}\mathbf{K}\_{y}^{\dagger}\mathbf{K}\_{xy}^{\top}] \end{split} \tag{29}$$

where the relations K� xy = K1/2 <sup>y</sup> (K1/2 <sup>y</sup> )†K� xy and Kxy = Kxy(K1/2 <sup>y</sup> )†K1/2 <sup>y</sup> are used. Since the second term is a constant for B, from the Schmidt approximation theorem, The minimum solution is given by the singular value decomposition (SVD) of (K1/2 <sup>y</sup> )†K� xy,

$$(\mathbf{K}\_y^{1/2})^\dagger \mathbf{K}\_{xy}^\top = \sum\_{i=1}^m \sqrt{\xi\_i} w\_i \nu\_i^\top. \tag{30}$$

Then the minimum solution is given by

$$\mathbf{K}\_y^{1/2} \mathbf{B} \mathbf{K}\_{xy}^\top = \sum\_{i=1}^r \sqrt{\xi\_i} w\_i \nu\_i^\top \,. \tag{31}$$

In the case of SubKPCA, the approximation error is

*<sup>J</sup>*<sup>1</sup> <sup>=</sup> <sup>1</sup> *n*

and R(*T*) respectively. The second term yields that

training set, the second term is large.

**3.5 Pre-centering**

1. <sup>Φ</sup>¯ <sup>1</sup> <sup>=</sup> <sup>1</sup>

2. <sup>Φ</sup>¯ <sup>2</sup> <sup>=</sup> <sup>1</sup>

equivalent.

*n*

*m*

3. Φ¯ <sup>3</sup> = argmin

*n* ∑ *i*=1

> *m* ∑ *i*=1

Ψ∈R(*T*)

Φ(x*i*)

Φ(y*i*)

*n* ∑ *i*=*r*+1

Trace[<sup>K</sup> <sup>−</sup> <sup>K</sup>xy(Ky)†K�

that is to say, *S* and *K* in Eq. (17) are respectively replaced by

matrix whose elements are all one, respectively.

�<sup>Ψ</sup> <sup>−</sup> <sup>Φ</sup>¯ <sup>1</sup>� <sup>=</sup> <sup>1</sup>

*n T*K† yK� xy1*n*.

*<sup>S</sup>*¯ <sup>=</sup>*<sup>S</sup>* <sup>−</sup> <sup>Φ</sup>¯ <sup>1</sup>�

*<sup>K</sup>*¯ <sup>=</sup>*S*¯∗*S*¯ = (<sup>I</sup> <sup>−</sup> <sup>1</sup>

*ξ<sup>i</sup>* + 1 *n*

Trace[<sup>K</sup> <sup>−</sup> <sup>K</sup>xy(Ky)†K�

The first term is due to the approximation error for the rank reduction and the second term is due to the subset approximation. Let *<sup>P</sup>*R(*S*) and *<sup>P</sup>*R(*T*) be orthogonal projectors onto <sup>R</sup>(*S*)

Subset Basis Approximation of Kernel Principal Component Analysis 75

since <sup>K</sup> <sup>=</sup> *<sup>S</sup>*∗*P*R(*S*)*S*. Therefore, if <sup>R</sup>(*S*) = <sup>R</sup>(*T*) (for example, the subset contains all training samples), the second term is zero. If the range of the subset is far from the range of the all

Although we have assumed that the mean of training vector in the feature space is zero so far, it is not always true in real problems. In the case of PCA, we subtract the mean vector from all training samples when we obtain the variance-covariance matrix **Σ**. On the other hand, in

explicitly, the pre-centering can be set in the algorithm of KPCA. The pre-centering can be

*<sup>n</sup>* <sup>=</sup> *<sup>S</sup>*(<sup>I</sup> <sup>−</sup> <sup>1</sup>

*n*

For SubKPCA, following three methods to estimate the centroid can be considered,

where I denotes the identify matrix, and 1*<sup>n</sup>* and 1*n*,*<sup>n</sup>* are an *n*-dimensional vector and an *n* × *n*

The first one is the same as that of KPCA. The second one is the mean of the basis set. If the basis set is the subset of the criterion set, the estimation accuracy is not as good as Φ¯ 1. The third one is the best approximation of <sup>Φ</sup>¯ <sup>1</sup> in <sup>R</sup>(*T*). Since SubKPCA is discussed in <sup>R</sup>(*T*), Φ¯ <sup>1</sup> and Φ¯ <sup>3</sup> are equivalent. However, for the post-processing such as pre-image, they are not

*n*

<sup>1</sup>*n*,*n*)*K*(<sup>I</sup> <sup>−</sup> <sup>1</sup>

*n*

KPCA, although we cannot obtain the mean vector in the feature space, Φ¯ = <sup>1</sup>

achieved by using subtracted vector Φ¯ (x*i*), instead of a mapped vector Φ(x*i*),

xy]. (33)

*<sup>n</sup>* <sup>∑</sup>*<sup>n</sup>*

*<sup>i</sup>*=<sup>1</sup> Φ(x*i*),

xy] = Trace[*S*∗(*P*R(*S*) <sup>−</sup> *<sup>P</sup>*R(*T*))*S*], (34)

<sup>Φ</sup>¯ (x*i*) =Φ(x*i*) <sup>−</sup> <sup>Φ</sup>¯ , (35)

1*n*,*n*) (36)

1*n*,*n*) (37)

From the matrix equation theorem (Israel & Greville, 1973), the minimum solution is given by Eq. (27).

Let us consider the case that R(*T*) ⊂ R(*S*).

**Lemma 1** (Harville (1997))**.** *Let* A *and* B *be non-negative definite matrices that satisfy* R(A) ⊂ R(B)*. Consider an EVD and a generalized EVD,*

$$\begin{aligned} (\mathbf{B}^{1/2})^\dagger A (\mathbf{B}^{1/2})^\dagger v &= \lambda v \\ \mathbf{A}u &= \sigma \mathbf{B}u\_\sigma \end{aligned}$$

*and suppose that* {(*λi*, v*i*)} *and* {(*σi*, u*i*)}*, i* = 1, 2, . . . *are sorted pairs of the eigenvalues and the eigenvectors respectively. Then*

$$\begin{aligned} \lambda\_i &= \sigma\_i \\ u\_i &= \alpha (\mathbf{B}^{1/2})^\dagger v\_{i\prime} \quad \forall \alpha \in \mathbb{R} \\ v\_i &= \beta \mathbf{B}^{1/2} u\_{i\prime} \quad \forall \beta \in \mathbb{R} \end{aligned}$$

*are satisfied.*

If R(*T*) ⊂ R(*S*), R(K� xy) = R(Ky). Since (K� xyKxy)(K� xyKxy)† is a projector onto <sup>R</sup>(Ky), (K1/2 <sup>y</sup> )†(K� xyKxy)(K� xyKxy)† = (K1/2 <sup>y</sup> )† in Eq. (27). From Lemma 1, the solution Eq. (22) is derived.

#### **3.3 Computational complexity of SubKPCA**

The procedures and computational complexities of SubKPCA are as follows,


The procedures 1, 2 and 3 are the construction, and 4 and 5 are the evaluation. The dominant calculation in the construction stage is the generalized EVD. In the case of standard KPCA, the size of EVD is *n*, whereas for SubKPCA, the size of generalized EVD is *m*. Moreover, for evaluation stage, the computational complexity depends on the size of the subset, *m*, and required memory to store Z and the subset is also reduced. It means the response time of the system using SubKPCA for an input vector x is faster than standard KPCA.

#### **3.4 Approximation error**

It should be shown the approximation error due to the subset approximation. In the case of KPCA, the approximation error, that is the value of the objective function of the problem (16). From Eqs. (17) and (19), The value of *J*<sup>1</sup> at the minimum solution is

$$J\_1 = \frac{1}{n} \sum\_{i=r+1}^n \lambda\_i. \tag{32}$$

In the case of SubKPCA, the approximation error is

$$J\_1 = \frac{1}{n} \sum\_{i=r+1}^n \xi\_i + \frac{1}{n} \text{Trace}[\mathbf{K} - \mathbf{K}\_{xy}(\mathbf{K}\_y)^\dagger \mathbf{K}\_{xy}^\top]. \tag{33}$$

The first term is due to the approximation error for the rank reduction and the second term is due to the subset approximation. Let *<sup>P</sup>*R(*S*) and *<sup>P</sup>*R(*T*) be orthogonal projectors onto <sup>R</sup>(*S*) and R(*T*) respectively. The second term yields that

$$\text{Trace}[\mathbf{K} - \mathbf{K}\_{xy}(\mathbf{K}\_y)^\dagger \mathbf{K}\_{xy}^\top] = \text{Trace}[\mathbf{S}^\*(P\_{\mathcal{R}(S)} - P\_{\mathcal{R}(T)})\mathbf{S}]\_\prime \tag{34}$$

since <sup>K</sup> <sup>=</sup> *<sup>S</sup>*∗*P*R(*S*)*S*. Therefore, if <sup>R</sup>(*S*) = <sup>R</sup>(*T*) (for example, the subset contains all training samples), the second term is zero. If the range of the subset is far from the range of the all training set, the second term is large.

#### **3.5 Pre-centering**

8 Principal Component Analysis / Book 1

From the matrix equation theorem (Israel & Greville, 1973), the minimum solution is given by

**Lemma 1** (Harville (1997))**.** *Let* A *and* B *be non-negative definite matrices that satisfy* R(A) ⊂

*and suppose that* {(*λi*, v*i*)} *and* {(*σi*, u*i*)}*, i* = 1, 2, . . . *are sorted pairs of the eigenvalues and the*

<sup>u</sup>*<sup>i</sup>* <sup>=</sup>*α*(B1/2)†v*i*, <sup>∀</sup>*<sup>α</sup>* <sup>∈</sup> **<sup>R</sup>** <sup>v</sup>*<sup>i</sup>* <sup>=</sup>*β*B1/2u*i*, <sup>∀</sup>*<sup>β</sup>* <sup>∈</sup> **<sup>R</sup>**

The procedures 1, 2 and 3 are the construction, and 4 and 5 are the evaluation. The dominant calculation in the construction stage is the generalized EVD. In the case of standard KPCA, the size of EVD is *n*, whereas for SubKPCA, the size of generalized EVD is *m*. Moreover, for evaluation stage, the computational complexity depends on the size of the subset, *m*, and required memory to store Z and the subset is also reduced. It means the response time of the

It should be shown the approximation error due to the subset approximation. In the case of KPCA, the approximation error, that is the value of the objective function of the problem (16).

> *n* ∑ *i*=*r*+1

*<sup>J</sup>*<sup>1</sup> <sup>=</sup> <sup>1</sup> *n*

*λ<sup>i</sup>* =*σ<sup>i</sup>*

xy) = R(Ky). Since (K�

The procedures and computational complexities of SubKPCA are as follows, 1. Select the subset from training samples (discussed in the next Section)

xyKxy [ *O*(*m*2) + *O*(*nm*2)]

5. For an input vector x, calculate the empirical kernel map hx. [*O*(*m*)]

system using SubKPCA for an input vector x is faster than standard KPCA.

From Eqs. (17) and (19), The value of *J*<sup>1</sup> at the minimum solution is

xyKxy)† = (K1/2

Au = *σ*Bu,

xyKxy)(K�

xyKxy)† is a projector onto

<sup>y</sup> )† in Eq. (27). From Lemma 1, the solution

*λi*. (32)

(B1/2)†A(B1/2)†v = *λ*v

Eq. (27).

*are satisfied.*

<sup>R</sup>(Ky), (K1/2

Eq. (22) is derived.

Let us consider the case that R(*T*) ⊂ R(*S*).

*eigenvectors respectively. Then*

If R(*T*) ⊂ R(*S*), R(K�

2. Calculate K<sup>y</sup> and K�

**3.4 Approximation error**

<sup>y</sup> )†(K�

**3.3 Computational complexity of SubKPCA**

3. Perform generalized EVD, Eq. (21). [*O*(*rm*2)]

4. Store Z and the samples in the subset.

6. Obtain transformed vector Eq. (24).

xyKxy)(K�

R(B)*. Consider an EVD and a generalized EVD,*

Although we have assumed that the mean of training vector in the feature space is zero so far, it is not always true in real problems. In the case of PCA, we subtract the mean vector from all training samples when we obtain the variance-covariance matrix **Σ**. On the other hand, in KPCA, although we cannot obtain the mean vector in the feature space, Φ¯ = <sup>1</sup> *<sup>n</sup>* <sup>∑</sup>*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> Φ(x*i*), explicitly, the pre-centering can be set in the algorithm of KPCA. The pre-centering can be achieved by using subtracted vector Φ¯ (x*i*), instead of a mapped vector Φ(x*i*),

$$
\Phi(x\_i) = \Phi(x\_i) - \Phi\_\prime \tag{35}
$$

that is to say, *S* and *K* in Eq. (17) are respectively replaced by

$$\bar{\mathbf{S}} = \mathbf{S} - \Phi \mathbf{1}\_n^\top = \mathbf{S}(\mathbf{I} - \frac{1}{n} \mathbf{1}\_{n,n}) \tag{36}$$

$$\bar{\mathbf{K}} = \bar{\mathbf{S}}^{\*} \bar{\mathbf{S}} = (\mathbf{I} - \frac{1}{n} \mathbf{1}\_{n,n}) \mathbf{K} (\mathbf{I} - \frac{1}{n} \mathbf{1}\_{n,n}) \tag{37}$$

where I denotes the identify matrix, and 1*<sup>n</sup>* and 1*n*,*<sup>n</sup>* are an *n*-dimensional vector and an *n* × *n* matrix whose elements are all one, respectively.

For SubKPCA, following three methods to estimate the centroid can be considered,

$$\begin{aligned} \text{1. } \bar{\Phi}\_1 &= \frac{1}{n} \sum\_{i=1}^n \Phi(x\_i) \\ \text{2. } \bar{\Phi}\_2 &= \frac{1}{m} \sum\_{i=1}^m \Phi(y\_i) \\ \text{3. } \bar{\Phi}\_1 &= \text{armin} \|\Psi - \bar{\Phi}\_1\| - \frac{1}{n} \end{aligned}$$

$$\text{3. } \bar{\Phi}\_3 = \underset{\Psi \in \mathcal{R}(T)}{\text{argmin}} ||\Psi - \bar{\Phi}\_1|| = \frac{1}{n} T K\_y^\dagger K\_{xy}^\top \mathbf{1}\_{\mathfrak{n}}.$$

The first one is the same as that of KPCA. The second one is the mean of the basis set. If the basis set is the subset of the criterion set, the estimation accuracy is not as good as Φ¯ 1. The third one is the best approximation of <sup>Φ</sup>¯ <sup>1</sup> in <sup>R</sup>(*T*). Since SubKPCA is discussed in <sup>R</sup>(*T*), Φ¯ <sup>1</sup> and Φ¯ <sup>3</sup> are equivalent. However, for the post-processing such as pre-image, they are not equivalent.

**Algorithm 1** Greedy forward search (one-by-one version)

4: Let temporal basis set be <sup>T</sup>˜ <sup>=</sup> T ∪{x*i*} 5: Obtain SubKPCA using the temporal basis set

6: Store the empirical error *Ei* = *J*1(*X*).

**5. Comparison with conventional methods**

*SS*∗u*<sup>i</sup>* = *λiui*, the approximated eigenvalue problem is

*U*∗

where *κ<sup>i</sup>* is the *i*th largest eigenvalue of (21).

IKPCAΦ(x) =

extraction, each dimension of the feature vector is multiplied by √

8: Obtain the smallest *Ei*, *k* = argmin*<sup>i</sup>*

**4.1.3 Random sampling consensus**

size of the residual set *n*˜ = *n*.

2: **while** *m*˜ < *m*, **do** 3: **for** *i* = 1, . . . , *n*˜ **do**

7: **end for**

11: **end while**

**4.1.4 Clustering**

**5.1 Improved KPCA**

of an input vector x is

2002).

1: Set initial basis set T = *φ*, size of current basis set *m*˜ = 0, residual set S = {x1,...,x*n*},

Subset Basis Approximation of Kernel Principal Component Analysis 77

*Ei*.

RANSAC is a simple sample (or parameter) selection technique. The best basis set is chosen

Clustering techniques also can be used for sample selection. When the subset is used for the basis set, i) a sample that is the closest to each centroid should be used, or ii) centroids should be included to the criterion set. Clustering in the feature space F is also proposed (Girolami,

Improved KPCA (IKPCA) (Xu et al., 2007) directly approximates *ui* � *T*v˜*<sup>i</sup>* in Eq. (7). From

*λi*Kyv˜*i*. The parameter vector v*<sup>i</sup>* is substituted to the relation *u*˜*<sup>i</sup>* = *T*v˜*i*, hence, the transform

This approximation has no guarantee to be good approximation of *ui*. In our experiments in the next section, IKPCA showed worse performance than SubKPCA. In so far as feature

If the classifier accepts such linear transforms, the classification accuracy of feature vectors

,..., <sup>1</sup> <sup>√</sup>*κ<sup>r</sup>* ])

diag([ <sup>1</sup> <sup>√</sup>*κ*<sup>1</sup>

By multiplying *T*∗ from left side, one gets the approximated generalized EVD, K�

*SS*∗*T*v˜ = *λiT*v˜*i*. (41)

1

xyKxyv˜ =

Z�hx, (42)

*<sup>κ</sup><sup>i</sup>* comparing to SubKPCA.

9: Add x*<sup>k</sup>* to the current basis set, T ←T ∪{x*k*}, *m*˜ ← *m*˜ + 1 10: Remove x*<sup>k</sup>* from the residual set, S ← S\{x*k*}, *n*˜ ← *n*˜ − 1.

from many random sampling trials. The algorithm is simple to code.

This section compares SubKPCA with related conventional methods.

For SubKPCA, only *S* in Eq. (28) has to be modified for per-centering 4. If Φ¯ <sup>3</sup> is used, *S* and Kxy are replaced by

$$\bar{S} = \mathbf{S} - \bar{\Phi}\mathbf{1}\_n^\top \tag{38}$$

$$
\bar{K}\_{xy} = \bar{S}^\* T = (I - \frac{1}{n} \mathbf{1}\_{n,n}) K\_{xy}. \tag{39}
$$

#### **4. Selection of samples**

Selection of samples for the basis set is an important problem in SubKPCA. Ideal criterion for the selection depends on applications such as classification accuracy or PSNR for denoising. We, here, show a simple criterion using empirical error,

$$\begin{array}{lcl}\min\_{\mathbf{y}\_{1},\ldots,\mathbf{y}\_{m}}\min\_{\mathbf{X}}\ l\_{1}(\mathbf{X})\\\text{Subject to}\quad\text{rank}(\mathbf{X})\leq r,\ \mathcal{N}(\mathbf{X})\supset\mathcal{R}(T)^{\perp},\ \mathcal{R}(\mathbf{X})\subset\mathcal{R}(T),\\\quad\{\mathbf{y}\_{1},\ldots,\mathbf{y}\_{m}\}\subset\{\mathbf{x}\_{1},\ldots,\mathbf{x}\_{n}\},\ T=[\Phi(\mathbf{y}\_{1})]\dots\bot[\Phi(\mathbf{y}\_{m})].\end{array} \tag{40}$$

This criterion is a combinatorial optimization problem for the samples, and it is hard to obtain to global solution if *n* and *m* are large. Instead of solving directly, following techniques can be introduced,


and their combinations.

#### **4.1 Sample selection methods**

#### **4.1.1 Greedy forward search**

The greedy forward search adds a sample to the basis set one by one or bit by bit. The algorithm is as follows, If several samples are added at 9 and 10, the algorithm is faster, but the cost function may be larger.

#### **4.1.2 Backward search**

On the other hand, a backward search removes samples that have the least effect on the cost function. In this case, the standard KPCA using the all samples has to be constructed at the beginning, and this may have very high computational complexity. However, the backward search may be useful in combination with the greedy forward search. In this case, the size of the temporal basis set does not become large, and the value of the cost function is monotonically decreasing.

Sparse KPCA (Tipping, 2001) is a kind of backward procedures. Therefore, the kernel Gram matrix K using all training samples and its inverse have to be calculated in the beginning.

<sup>4</sup> Of course, for KPCA, we can also consider the criterion set and the basis set, and perform pre-centering only for the criterion set. It produces the equivalent result.

#### **Algorithm 1** Greedy forward search (one-by-one version)


10 Principal Component Analysis / Book 1

For SubKPCA, only *S* in Eq. (28) has to be modified for per-centering 4. If Φ¯ <sup>3</sup> is used, *S* and

Selection of samples for the basis set is an important problem in SubKPCA. Ideal criterion for the selection depends on applications such as classification accuracy or PSNR for denoising.

This criterion is a combinatorial optimization problem for the samples, and it is hard to obtain to global solution if *n* and *m* are large. Instead of solving directly, following techniques can be

The greedy forward search adds a sample to the basis set one by one or bit by bit. The algorithm is as follows, If several samples are added at 9 and 10, the algorithm is faster, but

On the other hand, a backward search removes samples that have the least effect on the cost function. In this case, the standard KPCA using the all samples has to be constructed at the beginning, and this may have very high computational complexity. However, the backward search may be useful in combination with the greedy forward search. In this case, the size of the temporal basis set does not become large, and the value of the cost function is

Sparse KPCA (Tipping, 2001) is a kind of backward procedures. Therefore, the kernel Gram matrix K using all training samples and its inverse have to be calculated in the beginning.

<sup>4</sup> Of course, for KPCA, we can also consider the criterion set and the basis set, and perform pre-centering

Subject to rank(*X*) ≤ *r*, N (*X*) ⊃ R(*T*)⊥, R(*X*) ⊂ R(*T*),

*n*

{y1,..., y*m*}⊂{x1,...,x*n*}, *T* = [Φ(y1)| ... |Φ(y*m*)].

*<sup>n</sup>* (38)

1*n*,*n*)Kxy. (39)

(40)

*<sup>S</sup>*¯ <sup>=</sup>*<sup>S</sup>* <sup>−</sup> <sup>Φ</sup>¯ <sup>3</sup>1�

<sup>K</sup>¯ xy <sup>=</sup>*S*¯∗*<sup>T</sup>* = (<sup>I</sup> <sup>−</sup> <sup>1</sup>

Kxy are replaced by

**4. Selection of samples**

1. Greedy forward search

2. Backward search

and their combinations.

**4.1 Sample selection methods 4.1.1 Greedy forward search**

the cost function may be larger.

**4.1.2 Backward search**

monotonically decreasing.

only for the criterion set. It produces the equivalent result.

4. Clustering,

introduced,

min y1,...,y*<sup>m</sup>* min *<sup>X</sup> <sup>J</sup>*1(*X*)

3. Random sampling consensus (RANSAC)

We, here, show a simple criterion using empirical error,


11: **end while**

#### **4.1.3 Random sampling consensus**

RANSAC is a simple sample (or parameter) selection technique. The best basis set is chosen from many random sampling trials. The algorithm is simple to code.

#### **4.1.4 Clustering**

Clustering techniques also can be used for sample selection. When the subset is used for the basis set, i) a sample that is the closest to each centroid should be used, or ii) centroids should be included to the criterion set. Clustering in the feature space F is also proposed (Girolami, 2002).
