**Multilinear Supervised Neighborhood Preserving Embedding Analysis of Local Descriptor Tensor**

*Ritsumeikan University Japan* Xian-Hua Han and Yen-Wei Chen

#### **1. Introduction**

24 Principal Component Analysis / Book 1

90 Principal Component Analysis

Roweis, S. T. & Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear

Schölkopf, B., Mika, S., Burges, C., Knirsch, P., Müller, K.-R., Rätsch, G. & Smola, A. (1999).

Schölkopf, B., Smola, A. & Müller, K.-R. (1998). Nonlinear component analysis as a kernel

Smola, A. J., Mgngasarian, O. L. & Schölkopf, B. (1999). Sparse kernel feature analysis,

Tenenbaum, J. B., de Silva, V. & Langford, J. C. (2000). A global geometric framework for

Tipping, M. E. (2001). Sparse kernel principal component analysis, *Advances in Neural*

Tsuda, K. (1999). Subspace classifier in the Hilbert space, *Pattern Recognition Letters*

Washizawa, Y. (2009). Subset kernel principal component analysis, *Proceedings of 2009 IEEE International Workshop on Machine Learning for Signal Processing*, IEEE, pp. 1–6. Williams, C. K. I. & Seeger, M. (2001). Using the Nyström method to speed up kernel machines, *Advances in Neural Information Processing Systems (NIPS)* 13: 682–688. Xu, Y., Zhang, D., Song, F., Yang, J., Jing, Z. & Li, M. (2007). A method for speeding up feature

eigenvalue problem, *Neural Computation* 10(5): 1299–1319.

nonlinear dimensionality reduction, *Science* 290: 2319–2323.

extraction based on KPCA, *Neurocomputing* pp. 1056–1061.

*Technical report 99-04, University of Wisconsin* .

*Information Processing Systems (NIPS)* 13: 633–639.

Vapnik, V. (1998). *Statistical Learning Theory*, Wiley, New-York.

Input space vs. feature space in kernel-based methods, *IEEE Trans. on Neural Networks*

embedding, *Science* 290: 2323–2326.

10(5): 1000–1017.

20: 513–519.

Subspace learning based pattern recognition methods have attracted considerable interests in recent years, including Principal Component Analysis (PCA), Independent Component Analysis (ICA), Linear Discriminant Analysis (LDA), and some extensions for 2D analysis. However, a disadvantage of all these approaches is that they perform subspace analysis directly on the reshaped vector or matrix of pixel-level intensity, which is usually unstable under appearance variance. In this chapter, we propose to represent an image as a local descriptor tensor, which is a combination of the descriptor of local regions (K\*K-pixel patch) in the image, and is more efficient than the popular Bag-Of-Feature (BOF) model for local descriptor combination. As we know that the idea of BOF is to quantize local invariant descriptors, e.g., obtained using some interest-point detector techniques by Harris & Stephens (1998), and a description with SIFT by Lowe (2004) into a set of visual words by Lazebnik et al. (2006). The frequency vector of the visual words then represents the image, and an inverted file system is used for efficient comparison of such BOFs. However. the BOF model approximately represents each local descriptor feature as a predefined visual word, and vectorizes the local descriptors of an image into a orderless histogram, which may lose some important (discriminant) information of local features and spatial information hold in the local regions of the image. Therefore, this paper proposes to combine the local features of an image as a descriptor tensor. Because the local descriptor tensor retains all information of local features, it will be more efficient for image representation than the BOF model and then can use a moderate amount of local regions to extract the descriptor for image representation, which will be more effective in computational time than the BOF model. For feature representation of image regions, SIFT proposed by Lowe (2004) is improved to be a powerful local descriptor by Lazebnik et al. (2006) for object or scene recognition, which is somewhat invariant to small illumination change. However, in some benchmark database such as YALE and PIE face data sets by Belhumeur et al. (1997), the illumination variance is very large. Then, in order to extract robust features invariant to large illumination, we explore an improved gradient (intensity-normalized gradient) of the image and use histogram of orientation weighed with the improved gradient for local region representation.

With the local descriptor tensor of image representation, we propose to use a tensor subspace analysis algorithm, which is called as multilinear Supervised Neighborhood Preserving Embedding (MSNPE), for discriminant feature extraction, and then use it for object or scene recognition. As we know, subspace learning approaches, such as PCA and LDA by Belhumeur et al. (1997), have widely used in computer vision research filed for feature extraction or selection and have been proven to be efficient for modeling or classification.

In tensor analysis, Principal Component Analysis (PCA) is used to extract the basis for each mode. The proposed MSNPE approach is based on the basis idea of Locality Preserving Projection (LPP). Therefore, we simply introduce PCA, LPP and a 2D extension of LPP as

Multilinear Supervised Neighborhood Preserving Embedding Analysis of Local Descriptor Tensor 93

(1) Principal component analysis extracts the principal eigen-space associated with a set

*n*: dimension of the samples). Let **m** be the mean of the *N* training samples, and **C** =

then it is projected to eigenspace **<sup>U</sup>**: **<sup>y</sup>***<sup>t</sup>* <sup>=</sup> **<sup>U</sup>***T*(**x***<sup>t</sup>* <sup>−</sup> **<sup>m</sup>**). The vector **<sup>y</sup>***<sup>t</sup>* is used in place of **<sup>x</sup>***<sup>t</sup>* for

(2)Locality Preserving Projection: LPP seeks a linear transformation **P** to project high-dimensional data into a low-dimensional sub-manifold that preserves the local Structure of the data. Let **X** = [**x**1, **x**2, ··· , **x***N*] denotes the set representing features of N training image samples, and **<sup>Y</sup>** = [**y**1, **<sup>y</sup>**2, ··· , **<sup>y</sup>***N*]=[**P***T***x**1, **<sup>P</sup>***T***x**2, ··· , **<sup>P</sup>***T***x***N*] denotes the samples feature in transformed subspace. Then, the linear transformation **P** can be obtained by solving the

> **<sup>P</sup>** ∑ *ij*

1 if **x***<sup>i</sup>* is among the *k* nearest neighbors of **x***<sup>j</sup>*

**P***T***x***iDii***x***<sup>T</sup>*

where *Wij* evaluate the local structure of the image space. It can be simply defined as follows:

where each column **P***<sup>i</sup>* of the LPP linear transformation matrix **P** can not be zero vector, and a

where **I** in constraint term **P***T***XDX***T***P** = **I** or **Y***T***DY** = **I** is an identity matrix. **D** is a diagonal matrix; its entries are column (or row, since **W** is symmetric) sums of **W**, *Dii* = ∑*<sup>j</sup> Wij*; **L** = **D** − **W** is the Laplacian matrix [5]. Matrix D provides a natural measure on the data samples. The bigger the value *Dii* (corresponding to **y***i*) is, the more importance is **y***i*. The constraint

will try to make the important point (has density distribution around the important point) near the origin of the projected subspace. Then, the density region near the origin of the

following minimization problem with some constraints, which will be given later:


By simple algebra formulation, the objective function can be reduced to:

<sup>2</sup>*Wij* <sup>=</sup> ∑ *i*

(**P***T***x***<sup>i</sup>* <sup>−</sup> **<sup>P</sup>***T***x***j*)

*<sup>i</sup>*=1(**x***<sup>i</sup>* <sup>−</sup> **<sup>m</sup>**)(**x***<sup>i</sup>* <sup>−</sup> **<sup>m</sup>**)*<sup>T</sup>* be the covariance matrix of the **<sup>x</sup>***i*. One solves the eigenvalue equation *λ***u***<sup>i</sup>* = **Cu***<sup>i</sup>* for eigenvalues *λ<sup>i</sup>* ≥ 0. The principal eigenspace **U** is spanned by the

*<sup>i</sup>*=1] of training samples (**x***<sup>i</sup>* <sup>∈</sup> *<sup>R</sup><sup>n</sup>* with 1 <sup>≤</sup> *<sup>i</sup>* <sup>≤</sup> *<sup>N</sup>*; *<sup>N</sup>*: sample number;

*K*

0 otherwise (3)

*<sup>i</sup>* **<sup>P</sup>** − ∑ *ij*

**<sup>Y</sup>***T***DY** <sup>=</sup> **<sup>I</sup>** <sup>⇒</sup> **<sup>P</sup>***T***XDX***T***<sup>P</sup>** <sup>=</sup> **<sup>I</sup>** (5)

*<sup>i</sup>* **y***<sup>i</sup>* = 1, which means that the more importance (*Dii*

*<sup>i</sup>* **<sup>y</sup>***<sup>i</sup>* is. Therefor, the constraint **<sup>Y</sup>***T***DY** <sup>=</sup> **<sup>I</sup>**

<sup>=</sup> **<sup>P</sup>***T***X**(**<sup>D</sup>** <sup>−</sup> **<sup>W</sup>**)**X***T***<sup>P</sup>** <sup>=</sup> **<sup>P</sup>***T***XLX***T***<sup>P</sup>**

**P***T***x***iWij***x***<sup>T</sup>*

*i* **P**

(4)

*<sup>i</sup>*=1]. If **x***<sup>t</sup>* is a new feature vector,


the following.

1 *<sup>N</sup>* <sup>∑</sup>*<sup>N</sup>*

(matrix) **X** = [**x***i*|

*N*

representation and classification.

min **<sup>P</sup>** ∑ *ij*

*Wij* =

1 2∑ *ij*

constraint is imposed as follows:

for the sample **<sup>y</sup>***<sup>i</sup>* in **<sup>Y</sup>***T***DY** <sup>=</sup> **<sup>I</sup>** is *Dii* <sup>∗</sup> **<sup>y</sup>***<sup>T</sup>*

is larger) the sample **y***<sup>i</sup>* is, the smaller the value of **y***<sup>T</sup>*

first *K* eigenvectors with the largest eigenvalues, **U** = [**u***i*|

Recently there are considerable interests in geometrically motivated approaches to visual analysis. Therein, the most popular ones include locality preserving projection by He et al. (2005), neighborhood preserving embedding, and so on, which cannot only preserve the local structure between samples but also obtain acceptable recognition rates for face recognition. In real applications, all these subspace learning methods need to firstly reshape the multilinear data into a 1D vector for analysis, which usually suffers an overfitting problem. Therefore, some researchers proposed to solve the curse-of-dimension problem with 2D subspace learning such as 2-D PCA and 2-D LDA by ming Wang et al. (2009) for analyzing directly on a 2D image matrix, which was proven to be suitable in some extend. However, all of the conventional methods usually perform subspace analysis directly on the reshaped vector or matrix of pixel-level intensity, which would be unstable under illumination and background variance. In this paper, we propose MSNPE for discriminant feature extraction on the local descriptor tensor. Unlike tensor discriminant analysis by Wang (2006), which equally deals with the samples in the same category, the proposed MSNPE uses neighbor similarity in the same category as a weight of minimizing the cost function for *Nth* order tensor analysis, which is able to estimate geometrical and topological properties of the sub-manifold tensor from random points ("scattered data") lying on this unknown sub-manifold. In addition, compared with TensorFaces by Casilescu & D.Terzopoulos (2002) method, which also directly analyzes multi-dimensional data, the proposed multilinear supervised neighborhood preserving embedding uses supervised strategy and thus can extract more discriminant features for distinguishing different objects and, at the same time, can preserve samples' relationship of inner object instead of only dimension reduction in TensorFaces. We validate our proposed algorithm on different benchmark databases such as view-based object data sets (Coil-100 and Eth-70) and Facial image data sets (YALE and CMU PIE) by Belhumeur et al. (1997) and Sim et al. (2001).

#### **2. Related work**

In this section, we firstly briefly introduce the tensor algebra and then review subspace-based feature extraction approaches such as PCA, LPP.

Tensors are arrays of numbers which transform in certain ways under coordinate transformations. The order of a tensor X ∈ *<sup>R</sup>N*1×*N*2×···×*NM* , represented by a multi-dimensional array of real numbers, is *<sup>M</sup>*. An element of X is denoted as X*i*1,*i*2,···,*iM* , where 1 ≤ *ij* ≤ *Nj* and 1 ≤ *j* ≤ *M*. In the tensor terminology, the mode-*j* vectors of the *<sup>n</sup>*th-order tensor <sup>X</sup> are the vectors in *<sup>R</sup>Nj* obtained from <sup>X</sup> by varying the index *ij* while keeping the other indices fixed. For example, the column vectors in a matrix are the mode-1 vectors and the row vectors in a matrix are the mode-2 vectors.

**Definition**. (**Modeproduct**). The tensor product X×*d***<sup>U</sup>** of tensor X ∈ *<sup>R</sup>N*1×*N*2×···×*NM* and a matrix **<sup>U</sup>** <sup>∈</sup> *<sup>R</sup>Nd*×*N*� is the *<sup>N</sup>*<sup>1</sup> × *<sup>N</sup>*<sup>2</sup> ×···× *Nd*−<sup>1</sup> × *<sup>N</sup>*� × *Nd*<sup>+</sup><sup>1</sup> ×···× *NM* tensor:

$$(\mathcal{X}\_{\times d} \mathbf{U})\_{\mathbf{i}\_1 \dot{\mathbf{i}}\_2 \dots \dot{\mathbf{i}}\_{d-1} \dot{\mathbf{j}}\_l \dot{\mathbf{i}}\_{d+1} \dots \dot{\mathbf{j}}\_M} = \sum\_{\dot{\mathbf{i}}\_d} (\mathcal{X}\_{\dot{\mathbf{i}}\_1 \dot{\mathbf{i}}\_2 \dots \dot{\mathbf{i}}\_{d-1} \dot{\mathbf{i}}\_d \dot{\mathbf{i}}\_{d+1} \dots \dot{\mathbf{i}}\_M} \mathbf{U}\_{\dot{\mathbf{i}}\_d \dot{\mathbf{j}}}) \tag{1}$$

for all index values. X×*d***<sup>U</sup>** means the mode d's product of the tensor X with the matrix **<sup>U</sup>**. The mode product is a special case of a contraction, which is defined for any two tensors not just for a tensor and a matrix. In this paper, we follow the definitions in Lathauwer (1997) and avoid the use of the term "contraction".

2 Will-be-set-by-IN-TECH

Recently there are considerable interests in geometrically motivated approaches to visual analysis. Therein, the most popular ones include locality preserving projection by He et al. (2005), neighborhood preserving embedding, and so on, which cannot only preserve the local structure between samples but also obtain acceptable recognition rates for face recognition. In real applications, all these subspace learning methods need to firstly reshape the multilinear data into a 1D vector for analysis, which usually suffers an overfitting problem. Therefore, some researchers proposed to solve the curse-of-dimension problem with 2D subspace learning such as 2-D PCA and 2-D LDA by ming Wang et al. (2009) for analyzing directly on a 2D image matrix, which was proven to be suitable in some extend. However, all of the conventional methods usually perform subspace analysis directly on the reshaped vector or matrix of pixel-level intensity, which would be unstable under illumination and background variance. In this paper, we propose MSNPE for discriminant feature extraction on the local descriptor tensor. Unlike tensor discriminant analysis by Wang (2006), which equally deals with the samples in the same category, the proposed MSNPE uses neighbor similarity in the same category as a weight of minimizing the cost function for *Nth* order tensor analysis, which is able to estimate geometrical and topological properties of the sub-manifold tensor from random points ("scattered data") lying on this unknown sub-manifold. In addition, compared with TensorFaces by Casilescu & D.Terzopoulos (2002) method, which also directly analyzes multi-dimensional data, the proposed multilinear supervised neighborhood preserving embedding uses supervised strategy and thus can extract more discriminant features for distinguishing different objects and, at the same time, can preserve samples' relationship of inner object instead of only dimension reduction in TensorFaces. We validate our proposed algorithm on different benchmark databases such as view-based object data sets (Coil-100 and Eth-70) and Facial image data sets (YALE and CMU PIE) by Belhumeur et al.

In this section, we firstly briefly introduce the tensor algebra and then review subspace-based

Tensors are arrays of numbers which transform in certain ways under coordinate transformations. The order of a tensor X ∈ *<sup>R</sup>N*1×*N*2×···×*NM* , represented by a multi-dimensional array of real numbers, is *<sup>M</sup>*. An element of X is denoted as X*i*1,*i*2,···,*iM* , where 1 ≤ *ij* ≤ *Nj* and 1 ≤ *j* ≤ *M*. In the tensor terminology, the mode-*j* vectors of the *<sup>n</sup>*th-order tensor <sup>X</sup> are the vectors in *<sup>R</sup>Nj* obtained from <sup>X</sup> by varying the index *ij* while keeping the other indices fixed. For example, the column vectors in a matrix are the mode-1

**Definition**. (**Modeproduct**). The tensor product X×*d***<sup>U</sup>** of tensor X ∈ *<sup>R</sup>N*1×*N*2×···×*NM* and a

*id*

for all index values. X×*d***<sup>U</sup>** means the mode d's product of the tensor X with the matrix **<sup>U</sup>**. The mode product is a special case of a contraction, which is defined for any two tensors not just for a tensor and a matrix. In this paper, we follow the definitions in Lathauwer (1997)

is the *<sup>N</sup>*<sup>1</sup> × *<sup>N</sup>*<sup>2</sup> ×···× *Nd*−<sup>1</sup> × *<sup>N</sup>*� × *Nd*<sup>+</sup><sup>1</sup> ×···× *NM* tensor:

(X*i*1,*i*2,···,*id*−1,*id*,*id*+1,···,*iM* **<sup>U</sup>***id*,*j*) (1)

(1997) and Sim et al. (2001).

feature extraction approaches such as PCA, LPP.

and avoid the use of the term "contraction".

vectors and the row vectors in a matrix are the mode-2 vectors.

(X×*d***U**)*i*1,*i*2,···,*id*−1,*j*,*id*+1,···,*iM* <sup>=</sup> ∑

**2. Related work**

matrix **<sup>U</sup>** <sup>∈</sup> *<sup>R</sup>Nd*×*N*�

In tensor analysis, Principal Component Analysis (PCA) is used to extract the basis for each mode. The proposed MSNPE approach is based on the basis idea of Locality Preserving Projection (LPP). Therefore, we simply introduce PCA, LPP and a 2D extension of LPP as the following.

(1) Principal component analysis extracts the principal eigen-space associated with a set (matrix) **X** = [**x***i*| *N <sup>i</sup>*=1] of training samples (**x***<sup>i</sup>* <sup>∈</sup> *<sup>R</sup><sup>n</sup>* with 1 <sup>≤</sup> *<sup>i</sup>* <sup>≤</sup> *<sup>N</sup>*; *<sup>N</sup>*: sample number; *n*: dimension of the samples). Let **m** be the mean of the *N* training samples, and **C** = 1 *<sup>N</sup>* <sup>∑</sup>*<sup>N</sup> <sup>i</sup>*=1(**x***<sup>i</sup>* <sup>−</sup> **<sup>m</sup>**)(**x***<sup>i</sup>* <sup>−</sup> **<sup>m</sup>**)*<sup>T</sup>* be the covariance matrix of the **<sup>x</sup>***i*. One solves the eigenvalue equation *λ***u***<sup>i</sup>* = **Cu***<sup>i</sup>* for eigenvalues *λ<sup>i</sup>* ≥ 0. The principal eigenspace **U** is spanned by the first *K* eigenvectors with the largest eigenvalues, **U** = [**u***i*| *K <sup>i</sup>*=1]. If **x***<sup>t</sup>* is a new feature vector, then it is projected to eigenspace **<sup>U</sup>**: **<sup>y</sup>***<sup>t</sup>* <sup>=</sup> **<sup>U</sup>***T*(**x***<sup>t</sup>* <sup>−</sup> **<sup>m</sup>**). The vector **<sup>y</sup>***<sup>t</sup>* is used in place of **<sup>x</sup>***<sup>t</sup>* for representation and classification.

(2)Locality Preserving Projection: LPP seeks a linear transformation **P** to project high-dimensional data into a low-dimensional sub-manifold that preserves the local Structure of the data. Let **X** = [**x**1, **x**2, ··· , **x***N*] denotes the set representing features of N training image samples, and **<sup>Y</sup>** = [**y**1, **<sup>y</sup>**2, ··· , **<sup>y</sup>***N*]=[**P***T***x**1, **<sup>P</sup>***T***x**2, ··· , **<sup>P</sup>***T***x***N*] denotes the samples feature in transformed subspace. Then, the linear transformation **P** can be obtained by solving the following minimization problem with some constraints, which will be given later:

$$\min\_{\mathbf{P}} \sum\_{\text{ij}} ||\mathbf{y}\_i - \mathbf{y}\_j||^2 \mathcal{W}\_{\text{ij}} = \min\_{\mathbf{P}} \sum\_{\text{ij}} ||\mathbf{P}^T \mathbf{x}\_i - \mathbf{P}^T \mathbf{x}\_j||^2 \mathcal{W}\_{\text{ij}} \tag{2}$$

where *Wij* evaluate the local structure of the image space. It can be simply defined as follows:

$$\mathcal{W}\_{\text{ij}} = \begin{cases} 1 \text{ if } \mathbf{x}\_{\text{j}} \text{ is among the } k \text{ nearest neighbors of } \mathbf{x}\_{\text{j}}\\ 0 \text{ otherwise} \end{cases} \tag{3}$$

By simple algebra formulation, the objective function can be reduced to:

$$\begin{aligned} \frac{1}{2} \sum\_{ij} (\mathbf{P}^T \mathbf{x}\_i - \mathbf{P}^T \mathbf{x}\_j)^2 \mathcal{W}\_{ij} &= \sum\_i \mathbf{P}^T \mathbf{x}\_i D\_{ii} \mathbf{x}\_i^T \mathbf{P} - \sum\_{ij} \mathbf{P}^T \mathbf{x}\_i \mathcal{W}\_{ij} \mathbf{x}\_i^T \mathbf{P} \\ &= \mathbf{P}^T \mathbf{X} (\mathbf{D} - \mathbf{W}) \mathbf{X}^T \mathbf{P} = \mathbf{P}^T \mathbf{X} \mathbf{L} \mathbf{X}^T \mathbf{P} \end{aligned} \tag{4}$$

where each column **P***<sup>i</sup>* of the LPP linear transformation matrix **P** can not be zero vector, and a constraint is imposed as follows:

$$\mathbf{Y}^T \mathbf{D} \mathbf{Y} = \mathbf{I} \Rightarrow \mathbf{P}^T \mathbf{X} \mathbf{D} \mathbf{X}^T \mathbf{P} = \mathbf{I} \tag{5}$$

where **I** in constraint term **P***T***XDX***T***P** = **I** or **Y***T***DY** = **I** is an identity matrix. **D** is a diagonal matrix; its entries are column (or row, since **W** is symmetric) sums of **W**, *Dii* = ∑*<sup>j</sup> Wij*; **L** = **D** − **W** is the Laplacian matrix [5]. Matrix D provides a natural measure on the data samples. The bigger the value *Dii* (corresponding to **y***i*) is, the more importance is **y***i*. The constraint for the sample **<sup>y</sup>***<sup>i</sup>* in **<sup>Y</sup>***T***DY** <sup>=</sup> **<sup>I</sup>** is *Dii* <sup>∗</sup> **<sup>y</sup>***<sup>T</sup> <sup>i</sup>* **y***<sup>i</sup>* = 1, which means that the more importance (*Dii* is larger) the sample **y***<sup>i</sup>* is, the smaller the value of **y***<sup>T</sup> <sup>i</sup>* **<sup>y</sup>***<sup>i</sup>* is. Therefor, the constraint **<sup>Y</sup>***T***DY** <sup>=</sup> **<sup>I</sup>** will try to make the important point (has density distribution around the important point) near the origin of the projected subspace. Then, the density region near the origin of the

the local gray region. For a color image, a *M*-dimensional feature vector can be extracted from each color channel such as R, G and B color channels. With the feature vectors of the three color channels, a combined 2D *M* × 3 tensor can represent the local color region. Furthermore we combine the *K* 1D or 2D local tensor (*M*-dimensional vector or *M* × 3 2D tensor ) into a 2D or 3D tensor with of size *M* × *K* × *L* (*L*: 1 or 3). The tensor feature extraction procedure of a color image is shown in Fig. 1(a). For feature representation of the local regions such as the red, orange and green rectangles in Fig. 1 (a), the popular SIFT proposed by Lowe (2004) is proved to be a powerful one for object recognition, which is somewhat invariant to small illumination change. However, in some benchmark database such as YALE and CMU PIE face datasets, the illumination variance is very large. Then, in order to extract robust feature invariant to large illumination, we explore an normalized gradient (intensity-normalized gradient) of the image, and use Histogram of Orientation weighed with Normalized Gradient (NHOG) for local region representation. Therefore, for the benchmark databases without large illumination variance such as COIL-100 dataset or where the illumination information is also useful for recognition such as scene dataset, we use the popular SIFT for local region representation. However, for the benchmark database with large illumination variation, which will be harmful for subject recognition such as YALE and CMU PIE facial datasets, we use Histogram of Orientation weighed with Normalized

Multilinear Supervised Neighborhood Preserving Embedding Analysis of Local Descriptor Tensor 95

(1) SIFT: The SIFT descriptor computes a gradient orientation histogram within the support region. For each of 8 orientation planes, the gradient image is sampled over a 4 by 4 grid of locations, thus resulting in a 128-dimensional feature vector for each region. A Gaussian window function is used to assign a weight to the magnitude of each sample point. This makes the descriptor less sensitive to small changes in the position of the support region and puts more emphasis on the gradients that are near the center of the region. To obtain robustness to illumination changes, the descriptors are made invariant to illumination transformations of the form *aI*(*x*) + *b* by scaling the norm of each descriptor to unity [8]. For representing the local region of a color image, we extract SIFT feature in each color component (R, G and B

(2) Histogram of Orientation weighed with the Normalized Gradient (NHOG): Given an image **I**, we calculate the improved gradient (Intensity-normalized gradient) using the

**<sup>I</sup>***x*(*i*, *<sup>j</sup>*) = **<sup>I</sup>**(*<sup>i</sup>* <sup>+</sup> 1, *<sup>j</sup>*) <sup>−</sup> **<sup>I</sup>**(*<sup>i</sup>* <sup>−</sup> 1, *<sup>j</sup>*)

**<sup>I</sup>***y*(*i*, *<sup>j</sup>*) = **<sup>I</sup>**(*i*, *<sup>j</sup>* <sup>+</sup> <sup>1</sup>) <sup>−</sup> **<sup>I</sup>**(*i*, *<sup>j</sup>* <sup>−</sup> <sup>1</sup>)

where **I***x*(*i*, *j*) and **I***y*(*i*, *j*) mean the horizontal and vertical gradient in pixel position *i*, *j*, respectively,**I***xy*(*i*, *j*) means the global gradient in pixel position *i*, *j*. The idea of the normalized gradient is from *χ*<sup>2</sup> distance: a normalized Euclidean distance. For x-direction, the gradient is normalized by summation of the upper one and the bottom one pixel centered by the focused pixel; for y-direction, the gradient is normalized by that of the right and left one. With the intensity-normalized gradient, we can extract robust and invariant features to illumination changing in a local region of an image. Some examples with the intensity-normalized and

**I**(*i* + 1, *j*) + **I**(*i* − 1, *j*)

**I**(*i*, *j* + 1) + **I**(*i*, *j* − 1)

**I***x*(*i*, *j*)<sup>2</sup> + **I***y*(*i*, *j*)<sup>2</sup>

(8)

color components), and then can achieve a 128 ∗ 3 2D tensor for each local region.

**<sup>I</sup>***xy*(*i*, *<sup>j</sup>*) =

Gradient (NHOG) for local region representation.

conventional gradients are shown in Fig. 2

following Eq.:

projected subspace includes most of the samples, which can make the objecrive function in Eq. (2) as small as possible, and at same time, can avoid the trivial solution ||**P***i*||<sup>2</sup> <sup>=</sup> 0 for the transformation matrix **P**.

Then, The linear transformation **P** can be obtained by minimizing the objective function under constraint **P***T***XDX***T***P** = **I**:

$$\begin{array}{ll}\textbf{argmin} & \mathbf{P}^{T}\mathbf{X}(\mathbf{D}-\mathbf{W})\mathbf{X}^{T}\mathbf{P} \\ \mathbf{P}^{T}\mathbf{X}\mathbf{D}\mathbf{X}^{T}\mathbf{P}=\mathbf{I} \end{array} \tag{6}$$

Finally, the minimization problem can be converted to solve a generalized eigenvalue problem as follows:

$$\mathbf{X} \mathbf{L} \mathbf{X}^T \mathbf{P} = \lambda \mathbf{X} \mathbf{D} \mathbf{X}^T \mathbf{P} \tag{7}$$

In Face recognition application, He et al [8] extended LPP method into 2D dimension analysis, named as Tensor Subspace Analysis (TSA). TSA can directly deal with 2D gray images, and achieved better recognition results than the conventional 1D subspace learning methods such as PCA, LDA and LPP. However, for object recognition, color information also plays an important role for distinguishing different objects. Then, in this paper, we extend LPP to ND tensor analysis, which can directly deal with not only 3D Data but also ND data structure. At the same time, in order to obtain stable transformation tensor basis, we regularize a term in the proposed MSNPE objective function for abject recognition, which is introduced in Sec. 3 in detail.

#### **3. Local descriptor tensor for image representation**

In computer vision, local descriptors (i.e., features computed over limited spatial support) have been proven to be well-adapted for matching and recognition tasks as they are robust to partial visibility and clutter. The current popular one for a local descriptor is the SIFT feature, which is proposed by Lowe (2004). With the local SIFT descriptor, usually there are two types of algorithms for object recognition. One is to match the local points with SIFT features in two images, and the other one is to use the popular BOF model, which forms a frequency histogram of a predefined visual-words for all sampled region features by Belhumeur et al. (1997). For a matching algorithm, it is usually not enough to recognize the unknown image even if there are several points that are well matched. The popular BOF model usually can achieve good recognition performance in most applications such as scene and object recognition. However, in BOF model, in order to achieve an acceptable recognition rate, it is necessary to sample a lot of points for extracting SIFT features (usually more than 1000 in an image) and to compare the extracted local SIFT feature with the predefined visual words (usually more than 1000) to obtain the visual-word occurrence histogram. Therefore, BOF model needs a lot of computing time to extract visual-words occurrence histogram. In addition, BOF model just approximately represents each local region feature as a predefined visual-word; then, it may lose a lot of information and will be not efficient for image representation. Therefore, in this paper, we propose to represent a color or gray image as a combined local descriptor tensor, which can use different features (such as SIFT or other descriptors) for local region representation.

In order to extract the local descriptor tensor for image representation, we firstly grid-segment an image into *K* regions with some overlapping, and in each region, we extract some descriptors (can be consider tensor) for local region representation. For a gray image, a *M*-dimensional feature vector, which can be considered as a 1D tensor, is extracted from 4 Will-be-set-by-IN-TECH

projected subspace includes most of the samples, which can make the objecrive function in Eq. (2) as small as possible, and at same time, can avoid the trivial solution ||**P***i*||<sup>2</sup> <sup>=</sup> 0 for the

Then, The linear transformation **P** can be obtained by minimizing the objective function under

Finally, the minimization problem can be converted to solve a generalized eigenvalue problem

In Face recognition application, He et al [8] extended LPP method into 2D dimension analysis, named as Tensor Subspace Analysis (TSA). TSA can directly deal with 2D gray images, and achieved better recognition results than the conventional 1D subspace learning methods such as PCA, LDA and LPP. However, for object recognition, color information also plays an important role for distinguishing different objects. Then, in this paper, we extend LPP to ND tensor analysis, which can directly deal with not only 3D Data but also ND data structure. At the same time, in order to obtain stable transformation tensor basis, we regularize a term in the proposed MSNPE objective function for abject recognition, which is introduced in Sec.

In computer vision, local descriptors (i.e., features computed over limited spatial support) have been proven to be well-adapted for matching and recognition tasks as they are robust to partial visibility and clutter. The current popular one for a local descriptor is the SIFT feature, which is proposed by Lowe (2004). With the local SIFT descriptor, usually there are two types of algorithms for object recognition. One is to match the local points with SIFT features in two images, and the other one is to use the popular BOF model, which forms a frequency histogram of a predefined visual-words for all sampled region features by Belhumeur et al. (1997). For a matching algorithm, it is usually not enough to recognize the unknown image even if there are several points that are well matched. The popular BOF model usually can achieve good recognition performance in most applications such as scene and object recognition. However, in BOF model, in order to achieve an acceptable recognition rate, it is necessary to sample a lot of points for extracting SIFT features (usually more than 1000 in an image) and to compare the extracted local SIFT feature with the predefined visual words (usually more than 1000) to obtain the visual-word occurrence histogram. Therefore, BOF model needs a lot of computing time to extract visual-words occurrence histogram. In addition, BOF model just approximately represents each local region feature as a predefined visual-word; then, it may lose a lot of information and will be not efficient for image representation. Therefore, in this paper, we propose to represent a color or gray image as a combined local descriptor tensor, which can use different features (such as SIFT or other

In order to extract the local descriptor tensor for image representation, we firstly grid-segment an image into *K* regions with some overlapping, and in each region, we extract some descriptors (can be consider tensor) for local region representation. For a gray image, a *M*-dimensional feature vector, which can be considered as a 1D tensor, is extracted from

**<sup>P</sup>***T***X**(**<sup>D</sup>** <sup>−</sup> **<sup>W</sup>**)**X***T***<sup>P</sup>** (6)

**XLX***T***P** = *λ***XDX***T***P** (7)

**argmin P***T***XDX***T***P**=**I**

**3. Local descriptor tensor for image representation**

descriptors) for local region representation.

transformation matrix **P**.

constraint **P***T***XDX***T***P** = **I**:

as follows:

3 in detail.

the local gray region. For a color image, a *M*-dimensional feature vector can be extracted from each color channel such as R, G and B color channels. With the feature vectors of the three color channels, a combined 2D *M* × 3 tensor can represent the local color region. Furthermore we combine the *K* 1D or 2D local tensor (*M*-dimensional vector or *M* × 3 2D tensor ) into a 2D or 3D tensor with of size *M* × *K* × *L* (*L*: 1 or 3). The tensor feature extraction procedure of a color image is shown in Fig. 1(a). For feature representation of the local regions such as the red, orange and green rectangles in Fig. 1 (a), the popular SIFT proposed by Lowe (2004) is proved to be a powerful one for object recognition, which is somewhat invariant to small illumination change. However, in some benchmark database such as YALE and CMU PIE face datasets, the illumination variance is very large. Then, in order to extract robust feature invariant to large illumination, we explore an normalized gradient (intensity-normalized gradient) of the image, and use Histogram of Orientation weighed with Normalized Gradient (NHOG) for local region representation. Therefore, for the benchmark databases without large illumination variance such as COIL-100 dataset or where the illumination information is also useful for recognition such as scene dataset, we use the popular SIFT for local region representation. However, for the benchmark database with large illumination variation, which will be harmful for subject recognition such as YALE and CMU PIE facial datasets, we use Histogram of Orientation weighed with Normalized Gradient (NHOG) for local region representation.

(1) SIFT: The SIFT descriptor computes a gradient orientation histogram within the support region. For each of 8 orientation planes, the gradient image is sampled over a 4 by 4 grid of locations, thus resulting in a 128-dimensional feature vector for each region. A Gaussian window function is used to assign a weight to the magnitude of each sample point. This makes the descriptor less sensitive to small changes in the position of the support region and puts more emphasis on the gradients that are near the center of the region. To obtain robustness to illumination changes, the descriptors are made invariant to illumination transformations of the form *aI*(*x*) + *b* by scaling the norm of each descriptor to unity [8]. For representing the local region of a color image, we extract SIFT feature in each color component (R, G and B color components), and then can achieve a 128 ∗ 3 2D tensor for each local region.

(2) Histogram of Orientation weighed with the Normalized Gradient (NHOG): Given an image **I**, we calculate the improved gradient (Intensity-normalized gradient) using the following Eq.:

$$\begin{aligned} \mathbf{I}\_x(i,j) &= \frac{\mathbf{I}(i+1,j) - \mathbf{I}(i-1,j)}{\mathbf{I}(i+1,j) + \mathbf{I}(i-1,j)} \\ \mathbf{I}\_y(i,j) &= \frac{\mathbf{I}(i,j+1) - \mathbf{I}(i,j-1)}{\mathbf{I}(i,j+1) + \mathbf{I}(i,j-1)} \\ \mathbf{I}\_{xy}(i,j) &= \sqrt{\mathbf{I}\_x(i,j)^2 + \mathbf{I}\_y(i,j)^2} \end{aligned} \tag{8}$$

where **I***x*(*i*, *j*) and **I***y*(*i*, *j*) mean the horizontal and vertical gradient in pixel position *i*, *j*, respectively,**I***xy*(*i*, *j*) means the global gradient in pixel position *i*, *j*. The idea of the normalized gradient is from *χ*<sup>2</sup> distance: a normalized Euclidean distance. For x-direction, the gradient is normalized by summation of the upper one and the bottom one pixel centered by the focused pixel; for y-direction, the gradient is normalized by that of the right and left one. With the intensity-normalized gradient, we can extract robust and invariant features to illumination changing in a local region of an image. Some examples with the intensity-normalized and conventional gradients are shown in Fig. 2

(a)Samples of YALE facial database

Multilinear Supervised Neighborhood Preserving Embedding Analysis of Local Descriptor Tensor 97

(b)Samples of PIE facial database

For feature extraction of a local region *I<sup>R</sup>* in the normalized gradient image shown in Fig. 1(b), we firstly segment the region into 4 (2×2) patches,and then in each patch

In order to model *N*-Dimensional data without rasterization, tensor representation is proposed and analyzed for feature extraction or modeling. In this section, we propose a multilinear supervised neighborhood preserving embedding by Han et al. (2011) Han et al.

*xy* calculated using

*<sup>y</sup>* . Therefore, each region in a gray image can be

Fig. 2. Gradient image samples. Top row: Original face images; Middle row: the intensity-normalized gradient images; Bottom row: the conventional gradient images.

extract a 20-bin histogram of orientation weighted by global gradient **I***<sup>R</sup>*

**4. Multilinear supervised neighborhood preserving embedding**

represented by 80-bin (20×4) histogram as shown in Fig. 1(b).

*<sup>x</sup>* , **I***<sup>R</sup>*

the intensity-normalized gradients **I***<sup>R</sup>*

Fig. 1. (a) Extraction of local descriptor tensor for color image representation; (b)NHOG feature extraction from a gray region.

(b)

6 Will-be-set-by-IN-TECH

(b)

Fig. 1. (a) Extraction of local descriptor tensor for color image representation; (b)NHOG

(a)

feature extraction from a gray region.

(b)Samples of PIE facial database

Fig. 2. Gradient image samples. Top row: Original face images; Middle row: the intensity-normalized gradient images; Bottom row: the conventional gradient images.

For feature extraction of a local region *I<sup>R</sup>* in the normalized gradient image shown in Fig. 1(b), we firstly segment the region into 4 (2×2) patches,and then in each patch extract a 20-bin histogram of orientation weighted by global gradient **I***<sup>R</sup> xy* calculated using the intensity-normalized gradients **I***<sup>R</sup> <sup>x</sup>* , **I***<sup>R</sup> <sup>y</sup>* . Therefore, each region in a gray image can be represented by 80-bin (20×4) histogram as shown in Fig. 1(b).
