**6. Numerical examples**

12 Principal Component Analysis / Book 1

may be the same with SubKPCA. Indeed, (Xu et al., 2007) uses IKPCA only for feature

Two methods to obtain a sparse solution to KPCA are proposed (Smola et al., 1999; Tipping, 2001). Both approaches focus on reducing the computational complexity in the evaluation stage, and do not consider that in the construction stage. In addition, the degree of sparsity cannot be tuned directly for these sparse KPCAs, where as the number of the subset *m* can be

As mentioned in Section 4.1.2, (Tipping, 2001) is based on a backward search, therefore, it requires to calculate the kernel Gram matrix using all training samples, and its inverse. These

(Smola et al., 1999) utilizes *l*<sup>1</sup> norm regularization to make the solution sparse. The principal

two indeces, even if each principal component *ui* is represented by a few samples, it may not

Nyström approximation is a method to approximate EVD, and it is applied to KPCA (Williams & Seeger, 2001). Let ˜u*<sup>i</sup>* and u*<sup>i</sup>* be the *i*th eigenvectors of K<sup>y</sup> and K respectively. Nyström

where *λ<sup>i</sup>* is the *i*th eigenvalue of Ky. Since the eigenvector of K<sup>x</sup> is approximated by the eigenvector of Ky, the computational complexity in the construction stage is reduced, but that in the evaluation stage is not reduced. In our experiments, SubKPCA shows better

There are some iterative approaches for KPCA (Ding et al., 2010; Günter et al., 2007; Kim et al.,

Iterative approaches are sometimes used for reduction of computational complexities. Even if optimization step does not converge to the optimal point, early stopping point may be a good approximation of the optimal solution. However, Kim et al. (2005) and Günter et al. (2007) do not compare their computational complexity with standard KPCA. In the next section, comparisons of run-times show that iterative KPCAs are not faster than batch approaches.

ICD can also be used for reduction of computational complexity of KPCA. ICD approximates

*m n* 1 *λi*

*<sup>i</sup>* have many zero entry due to *<sup>l</sup>*<sup>1</sup> norm regularization. However, since *<sup>α</sup><sup>j</sup>*

*<sup>j</sup>*=<sup>1</sup> *<sup>α</sup><sup>j</sup> i* Φ(x*j*).

Kxyv*i*, (43)

KPCA in Eq. (14) for incoming samples.

K �GG�, (44)

*<sup>i</sup>* has

extraction of a classification problem, and IKPCA shows good performance.

procedures have high computational complexity, especially, when *n* is large.

components are represented by linear combinations of mapped samples, *ui* = ∑*<sup>n</sup>*

v˜*<sup>i</sup>* =

**5.2 Sparse KPCA**

tuned for SubKPCA.

The coefficients *<sup>α</sup><sup>j</sup>*

be sparse for many *i*.

**5.4 Iterative KPCA**

**5.3 Nyström approximation**

approximation approximates

performance than Nyström approximation.

**5.5 Incomplete Cholesky decomposition**

the kernel Gram matrix K by

2005). They update the transform matrix **Λ**−1/2V �

This section presents numerical examples and numerical comparisons with the other methods.

#### **6.1 Methods and evaluation criteria**

At first, methods to be compared and evaluation criteria are described. Following methods are compared,


Abbreviations in [] are used in Figures and Tables.

For evaluation criteria, the empirical error that is *J*1, is used.

$$E\_{\rm emp}(X) = I\_1(X) = \frac{1}{n} \sum\_{i=1}^{n} \left\| \Phi(x\_i) - X\Phi(x\_i) \right\|^2 \tag{46}$$

where *X* is replaced by each operator. Note that full KPCA gives the minimum values for *E*emp(*X*) under the rank constraint. Since *E*emp(*X*) depends on the problem, normalized by that of full KPCA is also used, *E*emp(*X*)/*E*emp(*P*Fkp), where *P*FKp is a projector of full KPCA. Validation error *E*val that uses validation samples instead of training samples in the empirical error is also used.

Fig. 2. Contour curves of projection norms

í10 í8 í6 í4 í2 0 2 4

í10 í8 í6 í4 í2 0 2 4

í10 í8 í6 í4 í2 0 2 4

í10 í8 í6 í4 í2 0 2 4

*E*emp = 0.357, *D* = 0.843

Subset Basis Approximation of Kernel Principal Component Analysis 81

(a) Distribution (b) FKp (*n* = 1000),

Gaussian kernel *<sup>k</sup>*(x1.x2) = exp(−�x<sup>1</sup> <sup>−</sup> <sup>x</sup>2�2/(2*σ*2)) whose *<sup>σ</sup>*<sup>2</sup> is set to be the variance of for all elements of each dataset is used for the kernel function. The number of principal

(g) SpKp (*σ* = 0.3, *n* = 69.) (h) *r* = 1

(e) IKp (*m* = 50), (f) Nys *E*emp = 0.990, *D* = 0.990 *E*emp = 0.383, *D* = 0.192

(c) SubKp (*m* = 50), (d) RKp (*n* = 50), *E*emp = 0.207, *D* = 0.008 *E*emp = 0.232, *D* = 0.157

í10 í8 í6 í4 í2 0 2 4

*E*emp = 0.206, *D* = 0

í10 í8 í6 í4 í2 0 2 4

í10 í8 í6 í4 í2 0 2 4

í10 í8 í6 í4 í2 0 2 4


Table 1. Mean values and standard deviations over 10 trials of *E*emp and *D* in Experiment 1: Upper rows are *E*emp(*X*)/*E*emp(*P*FKp); lower rows are *D*; Sparse KPCA does not require sample selection.

The alternative criterion is operator distance from full KPCA. Since these methods are approximation of full KPCA, an operator that is closer to that of full KPCA is the better one. In the feature space, the distance between projectors is measured by the Frobenius distance,

$$D(X\_\prime P\_{\rm FKp}) = \|X - P\_{\rm FKp}\|\_F. \tag{47}$$

For example, if *X* = *P*SubKPCA = *T*ZZ�*T*<sup>∗</sup> (Eq. (27)),

$$\begin{split} D^2(P\_{\text{SubKPCA}}, P\_{\text{FKp}}) &= \|T\mathbf{Z}\mathbf{Z}^\top T^\* - S\mathbf{V}\_{\text{KPCA}}\mathbf{A}^{-1}\mathbf{V}\_{\text{KPCA}}^\top\|\_F^2 \\ &= \text{Trace}[\mathbf{Z}^\top \mathbf{K}\_\mathcal{Y} \mathbf{Z} + \mathbf{V}\_{\text{KPCA}}^\top \mathbf{K} \mathbf{V}\_{\text{KPCA}} \mathbf{A}^{-1}] \\ &- 2\mathbf{Z}^\top \mathbf{K}\_{xy}^\top \mathbf{V}\_{\text{KPCA}} \mathbf{A}^{-1} \mathbf{V}\_{\text{KPCA}}^\top \mathbf{K}\_{xy} \mathbf{Z}]. \end{split}$$

#### **6.2 Artificial data**

Two-dimensional artificial data described in Introduction is used again with more comparisons and quantitative evaluation. Gaussian kernel function *k*(x1,x2) = exp(−0.1�x<sup>1</sup> <sup>−</sup> <sup>x</sup>2�2) and the number of principal components, *<sup>r</sup>* <sup>=</sup> 5 are chosen. Training samples of Reduced KPCA and the basis set of SubKPCA, Nyström approximation, and IKPCA are identical, and chosen randomly. For Sparse KPCA (SpKp), a parameter *σ* is chosen to have the same sparsity level with SubKPCA. Figure 2 shows contour curves and values of evaluation criteria. From evaluation criteria *E*emp and *D*, SubKp shows the best approximation accuracy among these methods.

Table 1 compares sample selection methods. The values in the table are the mean values and standard deviations over 10 trials using different random seeds or initial point. SubKPCA performed better than the other methods. Regarding sample selection, K-means and forward search give almost the same results for SubKPCA.

#### **6.3 Open dataset**

Three open benchmark datasets, "concrete," "housing," and "tic" from UCI (University of California Irvine) machine learning repository are used <sup>5</sup> (Asuncion & Newman, 2007). Table 2 shows properties of the datasets.

<sup>5</sup> As of Oct. 2011, the datasets are available from http://archive.ics.uci.edu/ml/index.html

14 Principal Component Analysis / Book 1

Sample selection SubKPCA Reduced KPCA Improved KPCA Nyström method Random 1.0025±0.0019 1.1420±0.0771 4.7998±0.0000 2.3693±0.6826 K-means 1.0001±0.0000 1.0282±0.0114 4.7998±0.0000 1.7520±0.2535 Forward 1.0002±0.0001 1.3786±0.0719 4.7998±0.0000 14.3043±9.0850 Random 0.0045±0.0035 0.2279±0.1419 0.9900±0.0000 0.3583±0.1670 K-means 0.0002±0.0001 0.0517±0.0318 0.9900±0.0000 0.1520±0.0481 Forward 0.0002±0.0001 0.5773±0.0806 0.9900±0.0000 1.7232±0.9016

Table 1. Mean values and standard deviations over 10 trials of *E*emp and *D* in Experiment 1: Upper rows are *E*emp(*X*)/*E*emp(*P*FKp); lower rows are *D*; Sparse KPCA does not require

The alternative criterion is operator distance from full KPCA. Since these methods are approximation of full KPCA, an operator that is closer to that of full KPCA is the better one. In the feature space, the distance between projectors is measured by the Frobenius distance,

=Trace[Z�KyZ + V �

xyVKPCA**Λ**−**1**<sup>V</sup> �

*<sup>D</sup>*2(*P*SubKPCA, *<sup>P</sup>*FKp) =�*T*ZZ�*T*<sup>∗</sup> <sup>−</sup> *<sup>S</sup>*VKPCA**Λ**−1<sup>V</sup> �

− 2Z�K�

Two-dimensional artificial data described in Introduction is used again with more comparisons and quantitative evaluation. Gaussian kernel function *k*(x1,x2) = exp(−0.1�x<sup>1</sup> <sup>−</sup> <sup>x</sup>2�2) and the number of principal components, *<sup>r</sup>* <sup>=</sup> 5 are chosen. Training samples of Reduced KPCA and the basis set of SubKPCA, Nyström approximation, and IKPCA are identical, and chosen randomly. For Sparse KPCA (SpKp), a parameter *σ* is chosen to have the same sparsity level with SubKPCA. Figure 2 shows contour curves and values of evaluation criteria. From evaluation criteria *E*emp and *D*, SubKp shows the best

Table 1 compares sample selection methods. The values in the table are the mean values and standard deviations over 10 trials using different random seeds or initial point. SubKPCA performed better than the other methods. Regarding sample selection, K-means and forward

Three open benchmark datasets, "concrete," "housing," and "tic" from UCI (University of California Irvine) machine learning repository are used <sup>5</sup> (Asuncion & Newman, 2007). Table

<sup>5</sup> As of Oct. 2011, the datasets are available from http://archive.ics.uci.edu/ml/index.html

*D*(*X*, *P*FKp) =�*X* − *P*FKp�*F*. (47)

KPCA�<sup>2</sup> *F*

KPCAKxyZ].

KPCAKVKPCA**Λ**−**<sup>1</sup>**

sample selection.

**6.2 Artificial data**

**6.3 Open dataset**

2 shows properties of the datasets.

For example, if *X* = *P*SubKPCA = *T*ZZ�*T*<sup>∗</sup> (Eq. (27)),

approximation accuracy among these methods.

search give almost the same results for SubKPCA.

Fig. 2. Contour curves of projection norms

Gaussian kernel *<sup>k</sup>*(x1.x2) = exp(−�x<sup>1</sup> <sup>−</sup> <sup>x</sup>2�2/(2*σ*2)) whose *<sup>σ</sup>*<sup>2</sup> is set to be the variance of for all elements of each dataset is used for the kernel function. The number of principal

10<sup>í</sup><sup>8</sup>

1

1

(c)

(e)

1999; Tsuda, 1999).

1.02

1.04

Normalized validation error

search

1.06

1.08

(a)

(d)

(f)

1.1

(a)

(c)

(e)

1.02

1.04

Normalized empirical error

1.06

1.08

1.1

(d)

(f)

0 20 40 60 80 100

(b) (b) RkpíRand

0 20 40 60 80 100

(b) RKpíRand (b)

No. of samples [%]

0 20 40 60 80 100

No. of samples [%]

SubKpíRand

SubKpíClus

SubKpíForw

(a) SubKpíRand

(c) SubKpíClus

(e) SubKpíForw

(a) SubKpíRand

(c) SubKpíClus

(e) SubKpíForw

(d) RKpíClus

(f) RKpíForw

(d) RKpíClus

(f) RKpíForw

10<sup>í</sup><sup>6</sup>

1

1

(e) concrete, validation error (f) housing, validation error Fig. 3. Results for open datasets. Rand: random, Clus: Clustering (K-means), Forw: Forward

easily. Furthermore, subspace methods are easily to be applied to multi-label problems or class-addition/reduction problems. CLAFIC is easily extended to KPCA (Maeda & Murase,

1.1

1.2

(c)

(e)

Normalized validation error

1.3

1.4

1.5

1.1

(c)

(e)

(f)

1.2

Normalized empirical error

(c) concrete, empirical error (d) housing, empirical error

1.3

1.4

1.5

SubKpíRand

SubKpíClus

SubKpíForw

(b)

0 20 40 60 80 100

0 20 40 60 80 100

No. of samples [%]

(a) (a) SubKpíRand (b) RKpíRand (b)

(d) (e) SubKpíForw

0 20 40 60 80 100

No. of samples [%]

RKpíRand

RKpíClus

RKpíForw

(b) RKpíRand

(e) RKpíForw

(c) SubKpíClus

(d) RKpíClus

(f) RKpíForw

(f)

(c) SubKpíClus

NysíRand

NysíClus

NysíForw

Size of basis [%]

(a) (a) SubKpíRand

(d) RKpíClus (d) (e) SubKpíForw

10<sup>í</sup><sup>4</sup>

10<sup>í</sup><sup>2</sup>

Normalized Distance*D*

Subset Basis Approximation of Kernel Principal Component Analysis 83

(a) concrete: normalized squared distance (b) housing, normalized squared distance

100

102

NysíRand

NysíClus

NysíForw

Size of basis [%]

RKpíRand

RKpíClus

RKpíForw

10<sup>í</sup><sup>6</sup>

10<sup>í</sup><sup>4</sup>

Normalized Distance*D*

10<sup>í</sup><sup>2</sup>

100


Table 2. Open dataset

components, *r*, is set to be the input dimension of each dataset. 90% of samples are used for training, and the remaining 10% of samples are used for validation. The division of the training and the validation sets is repeated 50 times randomly.

Figures 3-(a) and (b) show the averaged squared distance from KPCA using all samples. SubKPCA shows better performance than Reduced KPCA and the Nyström method, especially SubKPCA with a forward search performed the best of all. In both datasets, even if the number of basis is one of tenth that of all samples, the distance error of SubKPCA is less than 1%.

Figures 3-(c) and (d) show the average normalized empirical error, and Figures (e) and (f) show the averaged validation error. SubKPCA with K-means or forward search performed the best, and its performance did not change much with 20% more basis. The results for the Nyström method are outside of the range illustrated in the figures.

Figures 4-(a) and (b) show the calculation times for construction. The simulation was done on the system that has an Intel Core 2 Quad CPU 2.83GHz and an 8Gbyte RAM. The routines dsygvx and dsyevx in the Intel math kernel library (MKL) were respectively used for the generalized eigenvalue decomposition of SubKPCA and the eigenvalue decomposition of KPCA. The figures indicate that SubKPCA is faster than Full KPCA if the number of basis is less than 80%.

Figure 5 shows the relation between runtime [s] and squared distance from Full KPCA. In this figure, "kmeans" includes runtime for K-means clustering. The vertical dotted line stands for run-time of full KPCA. For (a) concrete and (b) housing, incomplete Cholesky decomposition is faster than our method. However, for a larger dataset, (c) tic, incomplete Cholesky decomposition is slower than our method. KHA-SMD Günter et al. (2007) is slower than full KPCA in these three methods.

#### **6.4 Classification**

PCA and KPCA are also used for classifier as subspace methods (Maeda & Murase, 1999; Oja, 1983; Tsuda, 1999). Subspace methods obtain projectors onto subspaces that correspond with classes. Let P*<sup>i</sup>* be a projector onto the subspace of the class *i*. In the class feature information compression (CLAFIC) that is one of the subspace methods, P*<sup>i</sup>* is a projector of PCA for each class. Then an input sample x is classified to a class *k* whose squared distance is the largest, that is,

$$k = \underset{i=1,\ldots,c}{\text{argmax}} \|x - P\_i x\|^2,\tag{48}$$

where *c* is the number of classes. Binary classifiers such as SVM cannot be applied to multi-class problems directly, therefore, some extentions such as one-against-all strategy have to be used. However, subspace methods can be applied to many-class problems 16 Principal Component Analysis / Book 1

dataset no. of dim. no. of samples concrete 9 1030 housing 14 506 tic 85 9822

components, *r*, is set to be the input dimension of each dataset. 90% of samples are used for training, and the remaining 10% of samples are used for validation. The division of the

Figures 3-(a) and (b) show the averaged squared distance from KPCA using all samples. SubKPCA shows better performance than Reduced KPCA and the Nyström method, especially SubKPCA with a forward search performed the best of all. In both datasets, even if the number of basis is one of tenth that of all samples, the distance error of SubKPCA is less

Figures 3-(c) and (d) show the average normalized empirical error, and Figures (e) and (f) show the averaged validation error. SubKPCA with K-means or forward search performed the best, and its performance did not change much with 20% more basis. The results for the

Figures 4-(a) and (b) show the calculation times for construction. The simulation was done on the system that has an Intel Core 2 Quad CPU 2.83GHz and an 8Gbyte RAM. The routines dsygvx and dsyevx in the Intel math kernel library (MKL) were respectively used for the generalized eigenvalue decomposition of SubKPCA and the eigenvalue decomposition of KPCA. The figures indicate that SubKPCA is faster than Full KPCA if the number of basis

Figure 5 shows the relation between runtime [s] and squared distance from Full KPCA. In this figure, "kmeans" includes runtime for K-means clustering. The vertical dotted line stands for run-time of full KPCA. For (a) concrete and (b) housing, incomplete Cholesky decomposition is faster than our method. However, for a larger dataset, (c) tic, incomplete Cholesky decomposition is slower than our method. KHA-SMD Günter et al. (2007) is slower

PCA and KPCA are also used for classifier as subspace methods (Maeda & Murase, 1999; Oja, 1983; Tsuda, 1999). Subspace methods obtain projectors onto subspaces that correspond with classes. Let P*<sup>i</sup>* be a projector onto the subspace of the class *i*. In the class feature information compression (CLAFIC) that is one of the subspace methods, P*<sup>i</sup>* is a projector of PCA for each class. Then an input sample x is classified to a class *k* whose squared distance is the largest,

where *c* is the number of classes. Binary classifiers such as SVM cannot be applied to multi-class problems directly, therefore, some extentions such as one-against-all strategy have to be used. However, subspace methods can be applied to many-class problems

�<sup>x</sup> <sup>−</sup> <sup>P</sup>*i*x�2, (48)

*k* =argmax *i*=1,...,*c*

training and the validation sets is repeated 50 times randomly.

Nyström method are outside of the range illustrated in the figures.

Table 2. Open dataset

than 1%.

is less than 80%.

**6.4 Classification**

that is,

than full KPCA in these three methods.

Fig. 3. Results for open datasets. Rand: random, Clus: Clustering (K-means), Forw: Forward search

easily. Furthermore, subspace methods are easily to be applied to multi-label problems or class-addition/reduction problems. CLAFIC is easily extended to KPCA (Maeda & Murase, 1999; Tsuda, 1999).

10<sup>í</sup><sup>10</sup>

10<sup>í</sup><sup>3</sup> 10<sup>í</sup><sup>2</sup> 10<sup>í</sup><sup>1</sup> 10<sup>0</sup>

(a) Concrete

KHAíSMD ICD Kmeans SubKp Kmeans RKp Rand SubKp Rand RKp Rand SubKp+ICD FKp

Elapsed time [s]

10<sup>í</sup><sup>4</sup>

non-centered, c: centered

10<sup>í</sup><sup>3</sup>

10<sup>í</sup><sup>2</sup>

Distance from KPCA

10<sup>í</sup><sup>1</sup>

10<sup>0</sup>

10<sup>í</sup><sup>10</sup>

10<sup>í</sup><sup>1</sup> 10<sup>0</sup> 10<sup>1</sup> 10<sup>2</sup> 10<sup>3</sup>

(c) Tic

Method 10% 30% 50% 100% SubKp (nc) 2.69±0.58 2.30±0.61 2.18±0.58 2.03±0.56 SubKp (c) 2.68±0.60 2.29±0.61 2.15±0.55 2.02±0.56 RKp (nc) 2.75±0.61 2.35±0.60 2.22±0.57 2.03±0.56 RKp (c) 2.91±0.63 2.40±0.59 2.24±0.58 2.02±0.56 PCA (nc) 3.60±0.66 3.38±0.60 3.21±0.56 3.03±0.60

Before the demonstration of image denoising, comparisons of computational complexities are presented since the database has rather large data. The Gaussian kernel function *k*(x1,x2) = exp(−10−5.1�x<sup>1</sup> <sup>−</sup> <sup>x</sup>2�2) and the number of principal components *<sup>r</sup>* <sup>=</sup> 145 are used because these parameters show the best result in latter denoising experiment. The random selection

KHAíSMD ICD Kmeans SubKp Kmeans RKp Rand SubKp Rand RKp Rand SubKp+ICD

FKp

Fig. 5. Relation between runtime [s] and squared distance from Full KPCA

Table 4. Minimum validation errors [%] and standard deviations II; K-means, nc:

Elapsed time [s]

No. of basis

10<sup>í</sup><sup>3</sup> 10<sup>í</sup><sup>2</sup> 10<sup>í</sup><sup>1</sup>

Elapsed time [s]

(b) Housing

KHAíSMD ICD Kmeans SubKp Kmeans RKp Rand SubKp Rand RKp Rand SubKp+ICD FKp

10<sup>í</sup><sup>8</sup>

10<sup>í</sup><sup>6</sup>

Distance from KPCA

10<sup>í</sup><sup>4</sup>

10<sup>í</sup><sup>2</sup>

10<sup>0</sup>

Subset Basis Approximation of Kernel Principal Component Analysis 85

10<sup>í</sup><sup>8</sup>

10<sup>í</sup><sup>6</sup>

Distance from KPCA

10<sup>í</sup><sup>4</sup>

10<sup>í</sup><sup>2</sup>

10<sup>0</sup>

Fig. 4. Calculation time


Table 3. Minimum validation errors [%] and standard deviations I; random selection, nc: non-centered, c: centered

A handwritten digits database, USPS (U.S. postal service database), is used for the demonstration. The database has 7291 images for training, and 2001 images for testing. Each image is 16x16 pixel gray-scale, and has a label (0, . . . , 9).

10% of samples (729 samples) from training set are extracted for validation, and rest 90% (6562 samples) are used for training. This division is repeated 100 times, and obtained the optimal parameters from several picks, width of Gaussian kernel *<sup>c</sup>* ∈ {10−4.0, 10−3.8, . . . , 100.0}, the number of principal components *r* ∈ {10, 20, . . . , 200}.

Tables 3 and 4 respectively show the validation errors [%] and standard deviations over 100 validations when the samples of the basis are selected randomly and by k-means respectively. SubKPCA has lower error rate than reduced KPCA when the number of basis is small. Tables 5 and 6 show the test errors when the optimum parameters are given by the validation.

#### **6.5 Denoising using a huge dataset**

KPCA is also used for image denoising (Kim et al., 2005; Mika et al., 1999). This subsection demonstrate image denoising by KPCA using MNIST database. The database has 60000 images for training, and 10000 samples for testing. Each image is a 28x28 pixel gray-scale image of a handwritten digit. Each pixel value of the original image is scaled from 0 to 255.

18 Principal Component Analysis / Book 1

Calculation time [s]

(a) Concrete (b) Housing

Method 10% 30% 50% 100% SubKp (nc) 3.89±0.82 2.87±0.71 2.55±0.64 2.03±0.56 SubKp (c) 3.93±0.82 2.83±0.70 2.55±0.64 2.02±0.56 RKp (nc) 4.92±0.95 3.17±0.71 2.65±0.60 2.03±0.56 RKp (c) 4.91±0.92 3.16±0.70 2.65±0.62 2.02±0.56 CLAFIC (nc) 5.24±0.92 3.64±0.79 3.25±0.70 2.95±0.73 CLAFIC (c) 5.24±0.89 3.71±0.76 3.38±0.68 3.06±0.74

Table 3. Minimum validation errors [%] and standard deviations I; random selection, nc:

A handwritten digits database, USPS (U.S. postal service database), is used for the demonstration. The database has 7291 images for training, and 2001 images for testing. Each

10% of samples (729 samples) from training set are extracted for validation, and rest 90% (6562 samples) are used for training. This division is repeated 100 times, and obtained the optimal parameters from several picks, width of Gaussian kernel *<sup>c</sup>* ∈ {10−4.0, 10−3.8, . . . , 100.0}, the

Tables 3 and 4 respectively show the validation errors [%] and standard deviations over 100 validations when the samples of the basis are selected randomly and by k-means respectively. SubKPCA has lower error rate than reduced KPCA when the number of basis is small. Tables 5

KPCA is also used for image denoising (Kim et al., 2005; Mika et al., 1999). This subsection demonstrate image denoising by KPCA using MNIST database. The database has 60000 images for training, and 10000 samples for testing. Each image is a 28x28 pixel gray-scale image of a handwritten digit. Each pixel value of the original image is scaled from 0 to 255.

and 6 show the test errors when the optimum parameters are given by the validation.

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16

No. of basis

FKpSubKp RKp Nys

0 20 40 60 80 100

Size of basis [%]

0

Fig. 4. Calculation time

non-centered, c: centered

0 20 40 60 80 100

image is 16x16 pixel gray-scale, and has a label (0, . . . , 9).

number of principal components *r* ∈ {10, 20, . . . , 200}.

**6.5 Denoising using a huge dataset**

Size of basis [%]

0.1

0.2

0.3

Calculation time [s]

0.4

0.5

FKp SubKp RKp Nys

Fig. 5. Relation between runtime [s] and squared distance from Full KPCA


Table 4. Minimum validation errors [%] and standard deviations II; K-means, nc: non-centered, c: centered

Before the demonstration of image denoising, comparisons of computational complexities are presented since the database has rather large data. The Gaussian kernel function *k*(x1,x2) = exp(−10−5.1�x<sup>1</sup> <sup>−</sup> <sup>x</sup>2�2) and the number of principal components *<sup>r</sup>* <sup>=</sup> 145 are used because these parameters show the best result in latter denoising experiment. The random selection

Denoising is done by following procedures,

(b) Obtain centered KPCA using the subset. 4. Prepare noisy images using 10000 test samples; (a) Add Gaussian noise whose variance is *σ*2.

The evaluation criterion is the mean squared error

2. Obtain the subset using K-means clustering from 60000 training samples.

probability *p*/2, and flips black (0) with probability *p*/2).

*<sup>E</sup>*MSE <sup>=</sup> <sup>1</sup>

10000

(a) Obtain centered SubKPCA using 60000 training samples and the subset.

(b) Add salt-and-pepper noise with a probability of *p* (a pixel flips white (1) with

Subset Basis Approximation of Kernel Principal Component Analysis 87

10000 ∑ *i*=1

where f*<sup>i</sup>* is the *i*th original test image, and fˆ is its denoising image. The optimal parameters, *r*: the number of principal components, and *c*: parameter of the Gaussian kernel, are chosen to show the best performance in several picks *r* ∈ {5, 10, 15, . . . , *m*} and *c* ∈

Tables 7 and 8 are denoising results. SubKPCA shows always lower errors than errors of Reduced KPCA. Figures 7 show the original images, noisy images, and de-noised images. Fields of experts (FoE) Roth & Black (2009) and block-matching and 3D filtering (BM3D) Dabov et al. (2007) are state-of-the-art denoising methods for natural images 6. FoE and BM3D *σ* 20 50 80 100 SubKp (100) 3.38±1.37 4.64±1.49 6.73±1.70 8.33±2.17 RKp (100) 3.48±1.42 4.71±1.51 6.80±1.74 8.55±2.11 SubKp (500) 0.99±0.24 3.64±0.82 6.22±1.43 7.95±1.91 RKp (500) 1.01±0.27 3.73±0.81 6.39±1.51 8.14±2.01 SubKp (1000) 0.93±0.22 3.20±0.83 5.11±1.67 6.18±2.02 RKp (1000) 0.94±0.20 3.27±0.87 5.49±1.60 7.18±1.88 WF 0.88±0.24 3.14±0.81 5.49±1.43 7.01±1.84 FoE 1.15±2.08 8.48±0.78 23.29±1.90 36.53±2.81 BM3D 1.07±1.80 7.17±1.09 17.39±2.96 25.49±4.17

Table 7. Denoising results for Gaussian noise , mean and SD of squared errors, values are

tu-darmstadt.de/˜sroth/research/foe/downloads.html. and

divided by 105; the numbers in brackets denote the numbers of basis

<sup>6</sup> MATLAB codes were downloaded from http://www.gris.

http://www.cs.tut.fi/˜foi/GCF-BM3D/index.html

�f*<sup>i</sup>* <sup>−</sup> <sup>f</sup><sup>ˆ</sup>

*<sup>i</sup>*�2, (49)

5. Obtain each transformed vector and pre-image using the method in (Mika et al., 1999). 6. Rescale each pixel value from 0 to 255, and truncate values if the values less than 0 or grater

1. Rescale each pixel value from 0 to 1.

3. Obtain operators,

than 255.

{10−6.0, 10−5.9, . . . , 10−2.0}.


Table 5. Test errors [%] and standard deviations; random selection, nc: non-centered, c: centered


Table 6. Test errors [%] and standard deviations; K-means, nc: non-centered, c: centered

Fig. 6. Relation between training error and elapsed time in MNIST dataset

is used for basis of the SubKPCA. Figure 6 shows relation between run-time and training error. SubKPCA achieves lower training error *<sup>E</sup>*emp <sup>=</sup> 4.57 <sup>×</sup> <sup>10</sup>−<sup>5</sup> in 28 seconds, whereas ICD achieves *<sup>E</sup>*emp <sup>=</sup> 4.59 <sup>×</sup> <sup>10</sup>−<sup>5</sup> in 156 seconds.

Denoising is done by following procedures,


20 Principal Component Analysis / Book 1

Method 10% 30% 50% 100% SubKp (nc) 6.50±0.36 5.69±0.15 5.20±0.14 4.78±0.00 SubKp (c) 6.54±0.36 5.52±0.16 5.20±0.14 4.83±0.00 RKp (nc) 7.48±0.44 5.71±0.31 5.26±0.20 4.78±0.00 RKp (c) 7.50±0.43 5.76±0.31 5.28±0.20 4.83±0.00

Table 5. Test errors [%] and standard deviations; random selection, nc: non-centered, c:

Table 6. Test errors [%] and standard deviations; K-means, nc: non-centered, c: centered

Method 10% 30% 50% 100% SubKp (nc) 5.14±0.17 4.99±0.14 4.97±0.13 4.78±0.00 SubKp (c) 5.14±0.18 4.99±0.14 4.87±0.14 4.83±0.00 RKp (nc) 5.18±0.21 5.01±0.16 4.89±0.15 4.78±0.00 RKp (c) 5.36±0.24 5.07±0.17 4.93±0.15 4.83±0.00

centered

4eí5

ICD achieves *<sup>E</sup>*emp <sup>=</sup> 4.59 <sup>×</sup> <sup>10</sup>−<sup>5</sup> in 156 seconds.

5eí5

6eí5

7eí5

Training error

8eí5

9eí5

1eí4

No. of basis

No. of basis

ICD RKPCA SubKPCA

10<sup>í</sup><sup>2</sup> 10<sup>í</sup><sup>1</sup> 10<sup>0</sup> 10<sup>1</sup> 10<sup>2</sup> 10<sup>3</sup>

Elapsed time [s]

is used for basis of the SubKPCA. Figure 6 shows relation between run-time and training error. SubKPCA achieves lower training error *<sup>E</sup>*emp <sup>=</sup> 4.57 <sup>×</sup> <sup>10</sup>−<sup>5</sup> in 28 seconds, whereas

Fig. 6. Relation between training error and elapsed time in MNIST dataset

	- (a) Add Gaussian noise whose variance is *σ*2.
	- (b) Add salt-and-pepper noise with a probability of *p* (a pixel flips white (1) with probability *p*/2, and flips black (0) with probability *p*/2).

The evaluation criterion is the mean squared error

$$E\_{\rm MSE} = \frac{1}{10000} \sum\_{i=1}^{10000} \left\| f\_i - \hat{f}\_i \right\|^2. \tag{49}$$

where f*<sup>i</sup>* is the *i*th original test image, and fˆ is its denoising image. The optimal parameters, *r*: the number of principal components, and *c*: parameter of the Gaussian kernel, are chosen to show the best performance in several picks *r* ∈ {5, 10, 15, . . . , *m*} and *c* ∈ {10−6.0, 10−5.9, . . . , 10−2.0}.

Tables 7 and 8 are denoising results. SubKPCA shows always lower errors than errors of Reduced KPCA. Figures 7 show the original images, noisy images, and de-noised images. Fields of experts (FoE) Roth & Black (2009) and block-matching and 3D filtering (BM3D) Dabov et al. (2007) are state-of-the-art denoising methods for natural images 6. FoE and BM3D


Table 7. Denoising results for Gaussian noise , mean and SD of squared errors, values are divided by 105; the numbers in brackets denote the numbers of basis

<sup>6</sup> MATLAB codes were downloaded from http://www.gris. tu-darmstadt.de/˜sroth/research/foe/downloads.html. and

http://www.cs.tut.fi/˜foi/GCF-BM3D/index.html

**7. Conclusion**

**8. References**

the empirical errors if the subset is the same.

*Learning Research* 3: 1–48.

*Processing* 90(5): 1542–1553.

*Networks* 13(3): 780–784.

Vol. 84(No. 406): 502–516.

27(9): 1351–1366.

and Tools 6, SIAM.

82(2): 205–229.

16(8): 2080–2095.

Asuncion, A. & Newman, D. (2007). UCI machine learning repository. URL: *http://www.ics.uci.edu/*∼*mlearn/MLRepository.html*

*and Dimension Reduction*, LNCSE 58, Springer.

Theories, properties, and numerical examples of SubKPCA have been presented in this chapter. SubKPCA has a simple solution form Eq. (22) and no constraint for its kernel functions. Therefore, SubKPCA can be applied to any applications of KPCA. Furthermore, it should be emphasized that SubKPCA is always better than reduced KPCA in the sense of

Subset Basis Approximation of Kernel Principal Component Analysis 89

Bach, F. R. & Jordan, M. I. (2002). Kernel independent component analysis, *Journal of Machine*

Dabov, K., Foi, A., Katkovnik, V. & Egiazarian, K. (2007). Image denoising by

Girolami, M. (2002). Mercer kernel-based clustering in feature space, *IEEE Trans. on Neural*

Gorban, A., Kégl, B., Wunsch, D. & (Eds.), A. Z. (2008). *Principal Manifolds for Data Visualisation*

Günter, S., Schraudolph, N. N. & Vishwanathan, S. V. N. (2007). Fast iterative kernel principal component analysis, *Journal of Machine Learning Research* 8: 1893–1918. Harville, D. A. (1997). *Matrix Algebra From a Statistician's Perspective*, Springer-Verlag.

Hastie, T. & Stuetzle, W. (1989). Principal curves, *Journal of the American Statistical Association*

Israel, A. B. & Greville, T. N. E. (1973). *Generalized inverses, Theorey and applications*, Springer. Kim, K., Franz, M. O. & Schölkopf, B. (2005). Iterative kernel principal component

Lehoucq, R. B., Sorensen, D. C. & Yang, C. (1998). *ARPACK Users' Guide: Solution of Large-Scale*

Maeda, E. & Murase, H. (1999). Multi-category classification by kernel based nonlinear

Mika, S., Schölkopf, B. & Smola, A. (1999). Kernel PCA and de-noising in feature space, *Advances in Neural Information Processing Systems (NIPS)* 11: 536–542.

Platt, J. C. (1999). Fast training of support vector machines using sequential minimal

Roth, S. & Black, M. J. (2009). Fields of experts, *International Journal of Computer Vision*

*processing (ICASSP)*, Vol. 2, IEEE press., pp. 1025–1028.

Oja, E. (1983). *Subspace Methods of Pattern Recognition*, Wiley, New-York.

*Methods - Support Vector Learning*, MIT press, pp. 185–208.

analysis for image modeling, *IEEE Trans. Pattern Analysis and Machine Intelligence*

*Eigenvalue Problems with Implicitly Restarted Arnoldi Methods*, Software, Environments,

subspace method, *IEEE International Conference On Acoustics, speech, and signal*

optimization, *in* B. Scholkopf, C. Burges & A. J. Smola (eds), *Advances in Kernel*

Demmel, J. (1997). *Applied Numerical Linear Algebra*, Society for Industrial Mathematics. Ding, M., Tian, Z. & Xu, H. (2010). Adaptive kernel principal component analysis, *Signal*

sparse 3D transform-domain collaborative filtering, *IEEE Trans. on Image Processing*


Table 8. Denoising results for salt-and-pepper noise , mean and SD of squared errors, values are divided by 105; the numbers in brackets denote the numbers of basis

are assumed that the noise is Gaussian whose mean is zero and variance is known. Thus these two methods are compared only in Gaussian noise case. Since the datasets is not natural images, these methods are not better than SubKPCA. "WF" and "Median" denote Wiener filter and median filter respectively. When noise is relatively small, (*σ* = 20 ∼ 50 in Gaussian or *p* = 0.05 ∼ 0.10), these classical methods show better performance. On the other hand, when noise is large, our method shows better performance. Note that Wiener filter is known to be the optimal filter in terms of the mean squared error among linear operators. From different point of view, Wiener filter is optimal among all linear and non-linear operators if both signal and noise are Gaussian. However, KPCA is non-linear because of non-linear mapping Φ, and pixel values of images and salt-and-pepper noise are not Gaussian in this case.

Fig. 7. Results of denoising (first 100 samples), top-left: original image, top-right: noisy image (Gaussian, *σ* = 50), bottom-left: image de-noised by SubKPCA, bottom-right: image de-noised by KPCA.

### **7. Conclusion**

22 Principal Component Analysis / Book 1

Table 8. Denoising results for salt-and-pepper noise , mean and SD of squared errors, values

are assumed that the noise is Gaussian whose mean is zero and variance is known. Thus these two methods are compared only in Gaussian noise case. Since the datasets is not natural images, these methods are not better than SubKPCA. "WF" and "Median" denote Wiener filter and median filter respectively. When noise is relatively small, (*σ* = 20 ∼ 50 in Gaussian or *p* = 0.05 ∼ 0.10), these classical methods show better performance. On the other hand, when noise is large, our method shows better performance. Note that Wiener filter is known to be the optimal filter in terms of the mean squared error among linear operators. From different point of view, Wiener filter is optimal among all linear and non-linear operators if both signal and noise are Gaussian. However, KPCA is non-linear because of non-linear mapping Φ, and

(a) Gaussian noise (b) Salt-and-pepper-noise

Fig. 7. Results of denoising (first 100 samples), top-left: original image, top-right: noisy image (Gaussian, *σ* = 50), bottom-left: image de-noised by SubKPCA, bottom-right: image

de-noised by KPCA.

are divided by 105; the numbers in brackets denote the numbers of basis

pixel values of images and salt-and-pepper noise are not Gaussian in this case.

*p* 0.05 0.10 0.20 0.40 SubKp (100) 4.73±1.51 6.45±1.73 10.07±2.16 18.11±3.06 RKp (100) 4.79±1.54 6.52±1.72 10.51±2.30 18.80±3.30 SubKp (500) 3.61±1.00 5.73±1.34 9.66±2.02 17.87±2.95 RKp (500) 3.72±1.00 6.00±1.39 10.04±2.13 18.31±3.05 SubKp (1000) 3.22±0.98 4.99±1.46 7.93±2.22 13.58±4.72 RKp (1000) 3.25±1.00 5.15±1.53 8.78±2.22 18.21±2.96 WF 3.15±0.87 5.15±1.17 8.93±1.72 17.07±2.61 Median 3.26±1.44 4.34±1.70 7.36±2.45 21.78±5.62

Theories, properties, and numerical examples of SubKPCA have been presented in this chapter. SubKPCA has a simple solution form Eq. (22) and no constraint for its kernel functions. Therefore, SubKPCA can be applied to any applications of KPCA. Furthermore, it should be emphasized that SubKPCA is always better than reduced KPCA in the sense of the empirical errors if the subset is the same.

#### **8. References**


**0**

**5**

*Japan*

*Ritsumeikan University*

Xian-Hua Han and Yen-Wei Chen

**Multilinear Supervised Neighborhood Preserving**

Subspace learning based pattern recognition methods have attracted considerable interests in recent years, including Principal Component Analysis (PCA), Independent Component Analysis (ICA), Linear Discriminant Analysis (LDA), and some extensions for 2D analysis. However, a disadvantage of all these approaches is that they perform subspace analysis directly on the reshaped vector or matrix of pixel-level intensity, which is usually unstable under appearance variance. In this chapter, we propose to represent an image as a local descriptor tensor, which is a combination of the descriptor of local regions (K\*K-pixel patch) in the image, and is more efficient than the popular Bag-Of-Feature (BOF) model for local descriptor combination. As we know that the idea of BOF is to quantize local invariant descriptors, e.g., obtained using some interest-point detector techniques by Harris & Stephens (1998), and a description with SIFT by Lowe (2004) into a set of visual words by Lazebnik et al. (2006). The frequency vector of the visual words then represents the image, and an inverted file system is used for efficient comparison of such BOFs. However. the BOF model approximately represents each local descriptor feature as a predefined visual word, and vectorizes the local descriptors of an image into a orderless histogram, which may lose some important (discriminant) information of local features and spatial information hold in the local regions of the image. Therefore, this paper proposes to combine the local features of an image as a descriptor tensor. Because the local descriptor tensor retains all information of local features, it will be more efficient for image representation than the BOF model and then can use a moderate amount of local regions to extract the descriptor for image representation, which will be more effective in computational time than the BOF model. For feature representation of image regions, SIFT proposed by Lowe (2004) is improved to be a powerful local descriptor by Lazebnik et al. (2006) for object or scene recognition, which is somewhat invariant to small illumination change. However, in some benchmark database such as YALE and PIE face data sets by Belhumeur et al. (1997), the illumination variance is very large. Then, in order to extract robust features invariant to large illumination, we explore an improved gradient (intensity-normalized gradient) of the image and use histogram

of orientation weighed with the improved gradient for local region representation.

With the local descriptor tensor of image representation, we propose to use a tensor subspace analysis algorithm, which is called as multilinear Supervised Neighborhood Preserving Embedding (MSNPE), for discriminant feature extraction, and then use it for object or scene recognition. As we know, subspace learning approaches, such as PCA and LDA by Belhumeur et al. (1997), have widely used in computer vision research filed for feature extraction or selection and have been proven to be efficient for modeling or classification.

**1. Introduction**

**Embedding Analysis of Local Descriptor Tensor**

