**4. Multilinear supervised neighborhood preserving embedding**

In order to model *N*-Dimensional data without rasterization, tensor representation is proposed and analyzed for feature extraction or modeling. In this section, we propose a multilinear supervised neighborhood preserving embedding by Han et al. (2011) Han et al.

where X*<sup>i</sup>* is the tensor representation of the *i*

1 <sup>2</sup> ∑ *ij*

= 1 <sup>2</sup> ∑ *ij*

=*tr*(∑ *i*

=*tr*(∑ *i*

> − ∑ *ij*

=*tr*(**U***<sup>T</sup> <sup>d</sup>* (∑ *i*

=*tr*(**U***<sup>T</sup>*

− ∑ *ij*

*Dii*(**U***<sup>T</sup>*

*Wij*(**U***<sup>T</sup>*

*<sup>d</sup>* (**D***<sup>d</sup>* − **S***d*)**U***d*)

**U***<sup>T</sup>*

matrix **U***<sup>d</sup>* can be achieved by minimizing the following cost function:

**argmin U***<sup>T</sup> <sup>d</sup>* **D***d***U***d*=**I**

see that

following:

*th* sample; <sup>X</sup>*i*×1**U**<sup>1</sup> means the mode 1's product of

*Wij*

*<sup>d</sup>* **<sup>Y</sup>***d***D**(**Y***d*)*T***U***<sup>d</sup>* <sup>=</sup> **<sup>I</sup>** <sup>⇒</sup> **<sup>U</sup>***T***D***d***<sup>U</sup>** <sup>=</sup> **<sup>I</sup>** (12)

*<sup>d</sup>* (**D***<sup>d</sup>* − **S***d*)**U***<sup>d</sup>* (13)

(11)

the tensor X*<sup>i</sup>* with the matrix **<sup>U</sup>**1, and X*i*×1**U**1×2**U**<sup>2</sup> means the mode 2's product of the tensor X*i*×1**U**<sup>1</sup> with the matrix **<sup>U</sup>**2, and so on. The above objective function incurs a heavy penalty if neighboring points of same class X*<sup>i</sup>* and X*<sup>j</sup>* are mapped far apart. Therefore, minimizing it is an attempt to ensure that if X*<sup>i</sup>* and X*<sup>j</sup>* are ��close��, then X*i*×1**U**1×2**U**<sup>2</sup> ···×*<sup>L</sup>* **<sup>U</sup>***<sup>L</sup>* and X*j*×1**U**1×2**U**<sup>2</sup> ···×*<sup>L</sup>* **<sup>U</sup>***<sup>L</sup>* are ��close�� as well. Let Y*<sup>i</sup>* = X*i*×1**U**1×2**U**<sup>2</sup> ···×*<sup>L</sup>* **<sup>U</sup>***<sup>L</sup>* with dimension *<sup>N</sup>*<sup>1</sup> <sup>×</sup> *<sup>N</sup>*<sup>2</sup> ×···× *NL* , and (**Y***i*)*<sup>d</sup>* = (X*i*×1**U**1×2**U**<sup>2</sup> ···×*<sup>d</sup>*−<sup>1</sup> **<sup>U</sup>***d*−1×*d*+1**U***d*+<sup>1</sup> ···×*<sup>L</sup>* **<sup>U</sup>***L*)*<sup>d</sup>* with dimension:*Nd* × (*N*<sup>1</sup> × *<sup>N</sup>*<sup>2</sup> ×···× *Nd*−<sup>1</sup> × *Nd*<sup>+</sup><sup>1</sup> ×···× *NL*) is the d-mode extension of tensor <sup>Y</sup>*i*, which is a 2D matrix. Let D be a diagonal matrix,*Dii* <sup>=</sup> <sup>∑</sup>*<sup>j</sup> Wij*. Since �**A**�<sup>2</sup> <sup>=</sup> *tr*(**AA***T*), we

Multilinear Supervised Neighborhood Preserving Embedding Analysis of Local Descriptor Tensor 99

�X*i*×1**U**<sup>1</sup> ···×*<sup>L</sup>* **<sup>U</sup>***<sup>L</sup>* − X*j*×1**U**<sup>1</sup> ···×*<sup>L</sup>* **<sup>U</sup>***L*�<sup>2</sup>

*tr*(((**Y***i*)*<sup>d</sup>* <sup>−</sup> (**Y***j*)*d*)((**Y***i*)*<sup>d</sup>* <sup>−</sup> (**Y***j*)*d*)*T*)*Wij*

*ij*

*Wij*(**Y***i*)*d*((**Y***j*)*d*)*T*)

*<sup>d</sup>* (X*i*×1**U**<sup>1</sup> ···×*<sup>d</sup>*−<sup>1</sup> **<sup>U</sup>***d*−1×*d*+1**U***d*+<sup>1</sup> ···×*<sup>L</sup>* **<sup>U</sup>***L*)*<sup>d</sup>*

((X*i*×1**U**<sup>1</sup> ···×*<sup>d</sup>*−<sup>1</sup> **<sup>U</sup>***d*−1×*d*+1**U***d*+<sup>1</sup> ···×*<sup>L</sup>* **<sup>U</sup>***L*)*d*)*T***U***<sup>d</sup>*

((X*j*×1**U**<sup>1</sup> ···×*<sup>d</sup>*−<sup>1</sup> **<sup>U</sup>***d*−1×*d*+1**U***d*+<sup>1</sup> ···×*<sup>L</sup>* **<sup>U</sup>***L*)*d*)*T***U***d*)

*Dii*((X*i*×1**U**<sup>1</sup> ···×*<sup>d</sup>*−<sup>1</sup> **<sup>U</sup>***d*−1×*d*+1**U***d*+<sup>1</sup> ···×*<sup>L</sup>* **<sup>U</sup>***L*)*<sup>d</sup>*

((X*i*×1**U**<sup>1</sup> ···×*<sup>d</sup>*−<sup>1</sup> **<sup>U</sup>***d*−1×*d*+1**U***d*+<sup>1</sup> ···×*<sup>L</sup>* **<sup>U</sup>***L*)*d*)*<sup>T</sup>*

*Wij*((X*i*×1**U**<sup>1</sup> ···×*<sup>d</sup>*−<sup>1</sup> **<sup>U</sup>***d*−1×*d*+1**U***d*+<sup>1</sup> ···×*<sup>L</sup>* **<sup>U</sup>***L*)*<sup>d</sup>*

where **<sup>D</sup>***<sup>d</sup>* <sup>=</sup> <sup>∑</sup>*<sup>i</sup> Dii*(X*i*×1**U**<sup>1</sup> ···×*<sup>d</sup>*−<sup>1</sup> **<sup>U</sup>***d*−1×*d*+<sup>1</sup> **<sup>U</sup>***d*+<sup>1</sup> ···×*<sup>L</sup>* **<sup>U</sup>***L*)*<sup>d</sup>* ((X*i*×1**U**<sup>1</sup> ···×*<sup>d</sup>*−<sup>1</sup> **<sup>U</sup>***d*−1×*d*+<sup>1</sup> **<sup>U</sup>***d*+<sup>1</sup> ···×*<sup>L</sup>* **<sup>U</sup>***L*)*d*)*<sup>T</sup>* and **<sup>S</sup>***<sup>d</sup>* <sup>=</sup> <sup>∑</sup>*ij Wij*(X*i*×1**U**<sup>1</sup> ···×*<sup>d</sup>*−<sup>1</sup> **<sup>U</sup>***d*−1×*d*+1**U***d*+<sup>1</sup> ···×*<sup>L</sup>* **<sup>U</sup>***L*)*<sup>d</sup>* ((X*j*×1**U**<sup>1</sup> ···×*<sup>d</sup>*−<sup>1</sup> **<sup>U</sup>***d*−1×*d*+1**U***d*+<sup>1</sup> ···×*<sup>L</sup>* **<sup>U</sup>***L*)*d*)*T*. In optimization procedure of each mode, we also impose a constraint to achieve the transformation matrix (such as **U***<sup>d</sup>* in mode d) as the

For the optimization problem of all modes, we adopt an alternative least square (ALS) approach. In ALS, we can obtain the optimal base vectors on one mode by fixing the base vectors on the other modes and cycle for the remaining variables. The d-mode transformation

**U***<sup>T</sup>*

((X*j*×1**U**<sup>1</sup> ···×*<sup>d</sup>*−<sup>1</sup> **<sup>U</sup>***d*−1×*d*+1**U***d*+<sup>1</sup> ···×*<sup>L</sup>* **<sup>U</sup>***L*)*d*)*T*)**U***d*)

*<sup>d</sup>* (X*i*×1**U**<sup>1</sup> ···×*<sup>d</sup>*−<sup>1</sup> **<sup>U</sup>***d*−1×*d*+1**U***d*+<sup>1</sup> ···×*<sup>L</sup>* **<sup>U</sup>***L*)*<sup>d</sup>*

*Dii*(**Y***i*)*d*((**Y***i*)*d*)*<sup>T</sup>* <sup>−</sup> ∑

(2011) to not only extract discriminant feature but also preserve the local geometrical and topological properties in same category for recognition. The proposed approach decompose each mode of tensor with objective function, which consider neighborhood relation and class label of training samples.

Suppose we have ND tensor objects <sup>X</sup> from *<sup>C</sup>* classes. The *<sup>c</sup>th* class has *<sup>n</sup><sup>c</sup>* tensor objects and the total number of tensor objects is n. Let <sup>X</sup>*ic* <sup>∈</sup> *<sup>R</sup>N*1×*N*2×···×*NL* (*ic* <sup>=</sup> 1, 2, ··· , *<sup>n</sup>c*) be the *i th* object in the *cth* class. For color object image tensor, L is 3, *N*<sup>1</sup> is the row number, *N*<sup>2</sup> is the column number, and *N*<sup>3</sup> is the color space components (*N*3=3). We can build a nearest neighbor graph G to model the local geometrical structure and label information of X . Let **W** be the weight matrix of G. A possible definition of **W** is as follows:

$$\mathcal{W}\_{ij} = \begin{cases} \exp^{-\frac{\|X\_i - X\_j\|^2}{t}} & \text{if sample } i \text{ and } j \text{ is in same class} \\ 0 & \text{otherwise} \end{cases} \tag{9}$$

where �X*<sup>i</sup>* − X*j*�<sup>2</sup> means Euclidean distance of two tensor, which is the summation square root of all corresponding elements between X*<sup>i</sup>* and X*j*, and �•� means *l*<sup>2</sup> norm in our paper.

Let **U***<sup>d</sup>* be the *d*-mode transformation matrices (Dimension: *Nd* × *N*� *<sup>d</sup>*). A reasonable transformation respecting the graph structure can be obtained by solving the following objective functions:

$$\min\_{\mathbf{U}\_{i},\mathbf{U}\_{i-1},\mathbf{U}\_{i}} \frac{1}{\sum\_{ij} \|\mathbf{X}\_{i\times1}\mathbf{U}\_{1\times2}\mathbf{U}\_{2}\cdots\mathbf{U}\_{i\times1}\mathbf{U}\_{i-1}\mathbf{U}\_{1\times2}\mathbf{U}\_{2}\cdots\mathbf{U}\_{i\times1}\mathbf{U}\_{i\times2}\mathbf{U}\_{2}\cdots\mathbf{U}\_{i\times1}\mathbf{U}\_{i\times2}\mathbf{U}\_{2}\cdots\mathbf{U}\_{i\times1}\mathbf{U}\_{i\times2}\mathbf{U}\_{i\times1}\|\mathbf{U}\_{i\times2}\mathbf{U}\_{i\times1}\mathbf{U}\_{i\times2}\mathbf{U}\_{i\times1}\mathbf{U}\_{i\times2}\mathbf{U}\_{i\times1}\mathbf{U}\_{i\times2}\mathbf{U}\_{i\times1}\mathbf{U}\_{i\times2}\mathbf{U}\_{i\times1}\mathbf{U}\_{i\times2}\mathbf{U}\_{i\times2}\mathbf{U}\_{i\times2}\mathbf{U}\_{i\times2}\mathbf{U}\_{i\times2}$$

Table 1. The flowchart of multilinear supervised neighborhood preserving embedding (MSNPE).

8 Will-be-set-by-IN-TECH

(2011) to not only extract discriminant feature but also preserve the local geometrical and topological properties in same category for recognition. The proposed approach decompose each mode of tensor with objective function, which consider neighborhood relation and class

Suppose we have ND tensor objects <sup>X</sup> from *<sup>C</sup>* classes. The *<sup>c</sup>th* class has *<sup>n</sup><sup>c</sup>* tensor objects and the total number of tensor objects is n. Let <sup>X</sup>*ic* <sup>∈</sup> *<sup>R</sup>N*1×*N*2×···×*NL* (*ic* <sup>=</sup> 1, 2, ··· , *<sup>n</sup>c*) be the

*th* object in the *cth* class. For color object image tensor, L is 3, *N*<sup>1</sup> is the row number, *N*<sup>2</sup> is the column number, and *N*<sup>3</sup> is the color space components (*N*3=3). We can build a nearest neighbor graph G to model the local geometrical structure and label information of X . Let **W**

where �X*<sup>i</sup>* − X*j*�<sup>2</sup> means Euclidean distance of two tensor, which is the summation square root of all corresponding elements between X*<sup>i</sup>* and X*j*, and �•� means *l*<sup>2</sup> norm in our paper.

transformation respecting the graph structure can be obtained by solving the following

**Algorithm 1: ND tensor supervised neighborhood**

**Graph-based weights:** Building nearest neighbor graph in same class and calculate the graph weight

**for** t=1:*T* (Iteration steps) or until converge **do**

··· , *d* − 1, *d* + 1, ··· , *L*) fixed. • Solve the minimizing problem:

**output:** the MSNPE tensor T*<sup>j</sup>* = **U**<sup>1</sup> × **U**2× ···× **U***L*, *j* = 1, 2, ··· ,(*N*�

Table 1. The flowchart of multilinear supervised neighborhood preserving embedding

• Calculate **D***<sup>d</sup>* and **S***<sup>d</sup>* assuming **U***i*(*i* = 1, 2,

*th* tensor object in the *cth* class

**W** according to Eq. 9 and **D** from **W Initialize:** Randomly initialize **U***<sup>d</sup>*

**for** d=1:*L* (Iteration steps) **do**

*tr*(**U***<sup>T</sup>*

*<sup>t</sup>* if sample *i* and *j* is in same class

*<sup>i</sup>* from C classes, <sup>X</sup> *<sup>c</sup>*

�X*i*×1**U**1×2**U**<sup>2</sup> ···×*<sup>L</sup>* **<sup>U</sup>***<sup>L</sup>* − X*j*×1**U**1×2**U**<sup>2</sup> ···×*<sup>L</sup>* **<sup>U</sup>***L*�2*Wij* (10)

*<sup>r</sup>* <sup>∈</sup> *<sup>R</sup>Nd* for d

*<sup>d</sup>* (**D***<sup>d</sup>* − **S***d*)*Ud*) with eigenspace analysis

<sup>1</sup> × *N*�

<sup>2</sup> ×···× *N*�

*L*).

*<sup>i</sup>* denots

(9)

*<sup>d</sup>*). A reasonable

be the weight matrix of G. A possible definition of **W** is as follows:

*exp*<sup>−</sup> �X*i*−X*<sup>j</sup>*

�2

0 otherwise

Let **U***<sup>d</sup>* be the *d*-mode transformation matrices (Dimension: *Nd* × *N*�

*Wij* = 

label of training samples.

objective functions:

(MSNPE).

min **U**1,**U**2,···,**U***<sup>L</sup>* 1 <sup>2</sup> ∑ *ij*

**embedding**

the *i*

=1,2,··· , *L*

min **U***<sup>d</sup>*

**end for end for**

**Input:** Tensor objects <sup>X</sup> *<sup>c</sup>*

*i*

where X*<sup>i</sup>* is the tensor representation of the *i th* sample; <sup>X</sup>*i*×1**U**<sup>1</sup> means the mode 1's product of the tensor X*<sup>i</sup>* with the matrix **<sup>U</sup>**1, and X*i*×1**U**1×2**U**<sup>2</sup> means the mode 2's product of the tensor X*i*×1**U**<sup>1</sup> with the matrix **<sup>U</sup>**2, and so on. The above objective function incurs a heavy penalty if neighboring points of same class X*<sup>i</sup>* and X*<sup>j</sup>* are mapped far apart. Therefore, minimizing it is an attempt to ensure that if X*<sup>i</sup>* and X*<sup>j</sup>* are ��close��, then X*i*×1**U**1×2**U**<sup>2</sup> ···×*<sup>L</sup>* **<sup>U</sup>***<sup>L</sup>* and X*j*×1**U**1×2**U**<sup>2</sup> ···×*<sup>L</sup>* **<sup>U</sup>***<sup>L</sup>* are ��close�� as well. Let Y*<sup>i</sup>* = X*i*×1**U**1×2**U**<sup>2</sup> ···×*<sup>L</sup>* **<sup>U</sup>***<sup>L</sup>* with dimension *<sup>N</sup>*<sup>1</sup> <sup>×</sup> *<sup>N</sup>*<sup>2</sup> ×···× *NL* , and (**Y***i*)*<sup>d</sup>* = (X*i*×1**U**1×2**U**<sup>2</sup> ···×*<sup>d</sup>*−<sup>1</sup> **<sup>U</sup>***d*−1×*d*+1**U***d*+<sup>1</sup> ···×*<sup>L</sup>* **<sup>U</sup>***L*)*<sup>d</sup>* with dimension:*Nd* × (*N*<sup>1</sup> × *<sup>N</sup>*<sup>2</sup> ×···× *Nd*−<sup>1</sup> × *Nd*<sup>+</sup><sup>1</sup> ×···× *NL*) is the d-mode extension of tensor <sup>Y</sup>*i*, which is a 2D matrix. Let D be a diagonal matrix,*Dii* <sup>=</sup> <sup>∑</sup>*<sup>j</sup> Wij*. Since �**A**�<sup>2</sup> <sup>=</sup> *tr*(**AA***T*), we see that

1 <sup>2</sup> ∑ *ij* �X*i*×1**U**<sup>1</sup> ···×*<sup>L</sup>* **<sup>U</sup>***<sup>L</sup>* − X*j*×1**U**<sup>1</sup> ···×*<sup>L</sup>* **<sup>U</sup>***L*�<sup>2</sup> *Wij* = 1 <sup>2</sup> ∑ *ij tr*(((**Y***i*)*<sup>d</sup>* <sup>−</sup> (**Y***j*)*d*)((**Y***i*)*<sup>d</sup>* <sup>−</sup> (**Y***j*)*d*)*T*)*Wij* =*tr*(∑ *i Dii*(**Y***i*)*d*((**Y***i*)*d*)*<sup>T</sup>* <sup>−</sup> ∑ *ij Wij*(**Y***i*)*d*((**Y***j*)*d*)*T*) =*tr*(∑ *i Dii*(**U***<sup>T</sup> <sup>d</sup>* (X*i*×1**U**<sup>1</sup> ···×*<sup>d</sup>*−<sup>1</sup> **<sup>U</sup>***d*−1×*d*+1**U***d*+<sup>1</sup> ···×*<sup>L</sup>* **<sup>U</sup>***L*)*<sup>d</sup>* ((X*i*×1**U**<sup>1</sup> ···×*<sup>d</sup>*−<sup>1</sup> **<sup>U</sup>***d*−1×*d*+1**U***d*+<sup>1</sup> ···×*<sup>L</sup>* **<sup>U</sup>***L*)*d*)*T***U***<sup>d</sup>* − ∑ *ij Wij*(**U***<sup>T</sup> <sup>d</sup>* (X*i*×1**U**<sup>1</sup> ···×*<sup>d</sup>*−<sup>1</sup> **<sup>U</sup>***d*−1×*d*+1**U***d*+<sup>1</sup> ···×*<sup>L</sup>* **<sup>U</sup>***L*)*<sup>d</sup>* ((X*j*×1**U**<sup>1</sup> ···×*<sup>d</sup>*−<sup>1</sup> **<sup>U</sup>***d*−1×*d*+1**U***d*+<sup>1</sup> ···×*<sup>L</sup>* **<sup>U</sup>***L*)*d*)*T***U***d*) =*tr*(**U***<sup>T</sup> <sup>d</sup>* (∑ *i Dii*((X*i*×1**U**<sup>1</sup> ···×*<sup>d</sup>*−<sup>1</sup> **<sup>U</sup>***d*−1×*d*+1**U***d*+<sup>1</sup> ···×*<sup>L</sup>* **<sup>U</sup>***L*)*<sup>d</sup>* ((X*i*×1**U**<sup>1</sup> ···×*<sup>d</sup>*−<sup>1</sup> **<sup>U</sup>***d*−1×*d*+1**U***d*+<sup>1</sup> ···×*<sup>L</sup>* **<sup>U</sup>***L*)*d*)*<sup>T</sup>* − ∑ *ij Wij*((X*i*×1**U**<sup>1</sup> ···×*<sup>d</sup>*−<sup>1</sup> **<sup>U</sup>***d*−1×*d*+1**U***d*+<sup>1</sup> ···×*<sup>L</sup>* **<sup>U</sup>***L*)*<sup>d</sup>* ((X*j*×1**U**<sup>1</sup> ···×*<sup>d</sup>*−<sup>1</sup> **<sup>U</sup>***d*−1×*d*+1**U***d*+<sup>1</sup> ···×*<sup>L</sup>* **<sup>U</sup>***L*)*d*)*T*)**U***d*) =*tr*(**U***<sup>T</sup> <sup>d</sup>* (**D***<sup>d</sup>* − **S***d*)**U***d*) (11)

where **<sup>D</sup>***<sup>d</sup>* <sup>=</sup> <sup>∑</sup>*<sup>i</sup> Dii*(X*i*×1**U**<sup>1</sup> ···×*<sup>d</sup>*−<sup>1</sup> **<sup>U</sup>***d*−1×*d*+<sup>1</sup> **<sup>U</sup>***d*+<sup>1</sup> ···×*<sup>L</sup>* **<sup>U</sup>***L*)*<sup>d</sup>* ((X*i*×1**U**<sup>1</sup> ···×*<sup>d</sup>*−<sup>1</sup> **<sup>U</sup>***d*−1×*d*+<sup>1</sup> **<sup>U</sup>***d*+<sup>1</sup> ···×*<sup>L</sup>* **<sup>U</sup>***L*)*d*)*<sup>T</sup>* and **<sup>S</sup>***<sup>d</sup>* <sup>=</sup> <sup>∑</sup>*ij Wij*(X*i*×1**U**<sup>1</sup> ···×*<sup>d</sup>*−<sup>1</sup> **<sup>U</sup>***d*−1×*d*+1**U***d*+<sup>1</sup> ···×*<sup>L</sup>* **<sup>U</sup>***L*)*<sup>d</sup>* ((X*j*×1**U**<sup>1</sup> ···×*<sup>d</sup>*−<sup>1</sup> **<sup>U</sup>***d*−1×*d*+1**U***d*+<sup>1</sup> ···×*<sup>L</sup>* **<sup>U</sup>***L*)*d*)*T*. In optimization procedure of each mode, we also impose a constraint to achieve the transformation matrix (such as **U***<sup>d</sup>* in mode d) as the following:

$$\mathbf{U}\_d^T \mathbf{Y}^d \mathbf{D} (\mathbf{Y}^d)^T \mathbf{U}\_d = \mathbf{I} \Rightarrow \mathbf{U}^T \mathbf{D}\_d \mathbf{U} = \mathbf{I} \tag{12}$$

For the optimization problem of all modes, we adopt an alternative least square (ALS) approach. In ALS, we can obtain the optimal base vectors on one mode by fixing the base vectors on the other modes and cycle for the remaining variables. The d-mode transformation matrix **U***<sup>d</sup>* can be achieved by minimizing the following cost function:

$$\begin{array}{c} \mathbf{argmin} \ \mathbf{U}\_d^T(\mathbf{D}\_d - \mathbf{S}\_d)\mathbf{U}\_d\\ \mathbf{U}\_d^T \mathbf{D}\_d \mathbf{U}\_d = \mathbf{I} \end{array} \tag{13}$$

(a)COIL-100 dataset;

Multilinear Supervised Neighborhood Preserving Embedding Analysis of Local Descriptor Tensor 101

(b) ETH80 dataset;

(ii) Facial dataset: We use two facial datasets for evaluating the tensor representation with the proposed NHOG for image representation. One is Yale databae which includes 15 people and 11 facial images of each individual with different illuminations and expressions. Some sample facial images are shown in the top row of Fig. 2(a). The other one is CMU PIE, which includes 68 people and about 170 facial images for each individual with 13 different poses, 43 different illumination conditions, and with 4 different expressions. Some sample facial images

The recognition task is to assign each test image to one of a number of categories or objects.

For view-based object databases, we take different experimental setup in COIL-100 and ETH80 datasets. For COIL-100, the objective is to discriminate between the 100 individual

Fig. 3. Sample images from view-based object data sets.

The performance is measured using recognition rates.

are shown in the top row of Fig. 2(b).

**5.2 Methodology**

In order to achieve the stable solution, we firstly regularize the symmetric matrix **D***<sup>d</sup>* as **D***<sup>d</sup>* = **D***<sup>d</sup>* + *α***I** (*α* is a small value, **I** is an identity matrix of same size with the matrix **D***d*). Then, the minimization problem for obtaining d-mode matrix can be converted to solve a generalized eigenvalue problem as follows:

$$(\mathbf{D}\_d - \mathbf{S}\_d)\mathbf{U}\_d = \lambda \mathbf{D}\_d \mathbf{U}\_d \tag{14}$$

We can select the corresponding generalized eigenvectors with the first *N*� *<sup>d</sup>* smaller eigenvalues in Eq.(14), which can minimize the objective function in Eq.(13). However, the eigenvectors with the smallest eigenvalues are usually unstable. Therefore, we convert Eq. (14) into:

$$\mathbf{S}\_d \mathbf{U}\_d = (1 - \lambda) \mathbf{D}\_d \mathbf{U}\_d \Rightarrow \mathbf{S}\_d \mathbf{U}\_d = \beta \mathbf{D}\_d \mathbf{U}\_d \tag{15}$$

The corresponding generalized eigenvectors with the first *N*� *<sup>d</sup>* smaller eigenvalues *λ* in Eq. (14) means those with the first *N*� *<sup>d</sup>* larger eigenvalues *β*(1 − *λ*) in Eq. (15). Therefore, the corresponding generalized eigenvectors with the first *N*� *<sup>d</sup>* larger eigenvalues can be selected for minimizing the objective function in Eq.(13). The details algorithm of MSNPE are listed in Algorithm 1. In MSNPE algorithm, we need to decide the retained number of the generalized eigenvectors (mode dimension) for each mode. Usually, the dimension numbers in most discriminant tensor analysis methods are decided empirically or according to applications. In our experiments, we retain different dimension numbers for different modes, and do recognition for objects or scene categories. The recognition accuracy with varied dimensions in different modes are also given in the experiment part. The dimension numbers is decided empirically in the compared results with the state-of-art algorithms.

After obtaining the MSNPE basis of each mode, we can project each tensor object into these MSNPE tensors. For classification, the projection coefficients can represent the extracted feature vectors and can be inputted into any other classification algorithm. In our work, beside Euclidean distance as KNN (*k*=1) classifier, we also use Random Forest (RF) for recognition.

#### **5. Experiments**

#### **5.1 Database**

We evaluated our proposed framework on two different types of datasets.

(i) View-based object datasets, which includes two datasets: The first one is the Columbia COIL-100 image library by Nene et al. (1996). It consists of color images of 72 different views of 100 objects. The images were obtained by placing the objects on a turntable and taking a view every 5◦. The objects have a wide variety of complex geometric and reflectance characteristics. Fig. 3(a) shows some sample images from COIL-100. The second one is the ETH Zurich CogVis ETH-80 dataset by Leibe & Schiele (2003a). This dataset was setup by Leibe and Schiele to explore the capabilities of different features for object class recognition. In this dataset, eight object categories including apple, pear, tomato, cow, dog, horse, cup and car have been collected. There are 10 different objects spanned large intra-class variance in each category. Each object has 41 images from viewpoints spaced equally over the upper viewing hemisphere. On the whole we have 3280 images, 41 images for each object and 10 object for each category. Fig.3(b) shows some sample images from ETH-80.

100 Principal Component Analysis Multilinear Supervised Neighborhood Preserving Embedding Analysis of Local Descriptor Tensor <sup>11</sup> Multilinear Supervised Neighborhood Preserving Embedding Analysis of Local Descriptor Tensor 101

(b) ETH80 dataset;

Fig. 3. Sample images from view-based object data sets.

(ii) Facial dataset: We use two facial datasets for evaluating the tensor representation with the proposed NHOG for image representation. One is Yale databae which includes 15 people and 11 facial images of each individual with different illuminations and expressions. Some sample facial images are shown in the top row of Fig. 2(a). The other one is CMU PIE, which includes 68 people and about 170 facial images for each individual with 13 different poses, 43 different illumination conditions, and with 4 different expressions. Some sample facial images are shown in the top row of Fig. 2(b).

#### **5.2 Methodology**

10 Will-be-set-by-IN-TECH

In order to achieve the stable solution, we firstly regularize the symmetric matrix **D***<sup>d</sup>* as **D***<sup>d</sup>* = **D***<sup>d</sup>* + *α***I** (*α* is a small value, **I** is an identity matrix of same size with the matrix **D***d*). Then, the minimization problem for obtaining d-mode matrix can be converted to solve a generalized

eigenvalues in Eq.(14), which can minimize the objective function in Eq.(13). However, the eigenvectors with the smallest eigenvalues are usually unstable. Therefore, we convert Eq.

for minimizing the objective function in Eq.(13). The details algorithm of MSNPE are listed in Algorithm 1. In MSNPE algorithm, we need to decide the retained number of the generalized eigenvectors (mode dimension) for each mode. Usually, the dimension numbers in most discriminant tensor analysis methods are decided empirically or according to applications. In our experiments, we retain different dimension numbers for different modes, and do recognition for objects or scene categories. The recognition accuracy with varied dimensions in different modes are also given in the experiment part. The dimension numbers is decided

After obtaining the MSNPE basis of each mode, we can project each tensor object into these MSNPE tensors. For classification, the projection coefficients can represent the extracted feature vectors and can be inputted into any other classification algorithm. In our work, beside Euclidean distance as KNN (*k*=1) classifier, we also use Random Forest (RF) for recognition.

(i) View-based object datasets, which includes two datasets: The first one is the Columbia COIL-100 image library by Nene et al. (1996). It consists of color images of 72 different views of 100 objects. The images were obtained by placing the objects on a turntable and taking a view every 5◦. The objects have a wide variety of complex geometric and reflectance characteristics. Fig. 3(a) shows some sample images from COIL-100. The second one is the ETH Zurich CogVis ETH-80 dataset by Leibe & Schiele (2003a). This dataset was setup by Leibe and Schiele to explore the capabilities of different features for object class recognition. In this dataset, eight object categories including apple, pear, tomato, cow, dog, horse, cup and car have been collected. There are 10 different objects spanned large intra-class variance in each category. Each object has 41 images from viewpoints spaced equally over the upper viewing hemisphere. On the whole we have 3280 images, 41 images for each object and 10

We can select the corresponding generalized eigenvectors with the first *N*�

The corresponding generalized eigenvectors with the first *N*�

empirically in the compared results with the state-of-art algorithms.

We evaluated our proposed framework on two different types of datasets.

object for each category. Fig.3(b) shows some sample images from ETH-80.

corresponding generalized eigenvectors with the first *N*�

(**D***<sup>d</sup>* − **S***d*)**U***<sup>d</sup>* = *λ***D***d***U***<sup>d</sup>* (14)

*<sup>d</sup>* larger eigenvalues *β*(1 − *λ*) in Eq. (15). Therefore, the

**S***d***U***<sup>d</sup>* = (1 − *λ*)**D***d***U***<sup>d</sup>* ⇒ **S***d***U***<sup>d</sup>* = *β***D***d***U***<sup>d</sup>* (15)

*<sup>d</sup>* smaller

*<sup>d</sup>* smaller eigenvalues *λ* in Eq.

*<sup>d</sup>* larger eigenvalues can be selected

eigenvalue problem as follows:

(14) means those with the first *N*�

(14) into:

**5. Experiments**

**5.1 Database**

The recognition task is to assign each test image to one of a number of categories or objects. The performance is measured using recognition rates.

For view-based object databases, we take different experimental setup in COIL-100 and ETH80 datasets. For COIL-100, the objective is to discriminate between the 100 individual

Methods DTROD DTROD+AdaB RSW LS Rate(%) 70.0 76.0 75.0 65.0 Methods MSNPE-PL RF-PL MSNPE RF Rate(%) 76.83 77.74 83.54 85.98

Multilinear Supervised Neighborhood Preserving Embedding Analysis of Local Descriptor Tensor 103

Table 2. The compared recognition rates on ETH-80. RSW denotes random subwindow method Marrr et al. (2005) and LS denotes the results from Leibe and Schiele Leibe & Schiele (2003b) with 2925 samples for training and 328 for testing. The others are with 360 samples for training. MSNPE-PL and RF-PL mean MSNPE analysis on pixel-level intensity tensor using simple Euclidean distance and random forest classifier, respectively; MSNPE and RF mean the proposed MSNPE analysis on local SIFT tensor using simple Euclidean distance

dataset . The best result with the same experiment setup (400 training samples and 6800 test samples) on COIL-100 is reported by Wang (2006), in which the average recognition rate using tensor LDA and AdBoost classifier (DTROD+AdBoost) is 84.5%, and the recognition rate of the tensor LDA and simple Euclidean distance(DTROD) by Wang (2006) (same as KNN method with *k*=1) is 79.7. However, The MSNPE approach with pixel-intensity tensor can achieve about 85.28% with same classifier (KNN), and 90% average recognition rate with random forest classifier. Furthermore, the MSNPE approach with local SIFT tensor achieved 93.68% average recognition rate. The compared recognition rate results with the state-of-art approaches are shown in Fig. 4 (a). Figure 4(b) shows the compared recognition rates of one run on different mode dimension of MSNPE between using pixel-level intensity and local SIFT tensor with random forest classifier. It is obvious that the recognition rates by using pixel-level tensor have very large variance with differen mode dimension changing. Therefore, we must select a optimized row and column mode dimension to achieve better recognition rate. However, it is usually difficult to decide the optimized dimension number of different modes automatically. If we just shift the mode dimension number a little from the optimized mode dimension, the recognition rate can be decreased significantly shown in Fig. 4(b) when using pixel-level tensor. For the local SIFT tensor representing an object image, the average recognition rates in lager mode dimension changing (Both row and column mode

dimension numbers are from 3 to 8; color mode dimension is 3) are very stable.

For ETH-80 dataset, we also do similar experiments to COIL-100 using the proposed MSNPE analysis with pixel-level and local SIFT tensor, respectively. The compared results with the state of the art approach are shown in Table 2. From Table 2, it can be seen that our proposed approach can greatly improve the overall recognition rate compared with the state of the art

(2) Facial Datasets: With the two used facial datasets, we investigate the efficiency of the proposed local NHOG feature on large illumination variance dataset compared with local SIFT descriptor. We do 20 runs for different training number and average recognition rate. For comparison, we also do experiments using the proposed MSNPE analysis directly on the gray face image (pixel-level intensity, denoted MSNPE-PL), local feature tensor with SIFT descriptor (denoted MSNPE-SIFT) and our proposed intensity-normalized histogram of orientation (denoted MSNPE-NHOG). Table 3 gives the compared results using MSNPE analysis with different tensors using KNN classifier (*k*=1) and other subspace learning methods by Cai et al. (2007a), Cai et al. (2007b), Cai (2009) and Cai (n.d.) on YALE dataset, and the compared results on CMU PIE dataset are shown in Table 4 with our proposed framework and the conventional ones by Cai et al. (2007a) Cai et al. (2007b) Cai (2009) Cai

random forest classifier, respectively.

method (from 60-80% to about 86%).

objects. In most previous experiments on object recognition using COIL-100, the number of views used as training set for each object varied from 36 to 4. When 36 views are used for training, the recognition rate using SVM was reported approaching 100% by Pontil & Verri (1998). In practice, however, only very few views of an object are available. In our experiment, in order to compare experimental results with those by Wang (2006), we follows the experiment setup, which used only 4 views of each object for training and the rest 68 views for testing. In total it is equivalent to 400 images for training and 6800 images for testing. The error rate is the overall error rate over 100 objects. The 4 training viewpoints are sampled evenly from the 72 viewpoints, which can capture enough variance on the change of viewpoints for tensor learning. For ETH-80, it aims to discriminate between the 8 object categories. Most previous experiments using ETH-80 dataset all adopted leave-one-object-out cross-validation. The training set consists of all views from 9 objects from each category. The testing set consists of all views from the remaining object from each category. In this setting, objects in the testing set have not appeared in the training set, but those belonging to the same category have. Classification of a test image is a process of labeling the image by one of the categories. Reported results are based on average error rate over all 80 possible test objects by Leibe & Schiele (2003b). Similar to the above, instead of taking all possible views of each object in the training set, we take only 5 views of each object as training data. By doing so we have decreased the number of the training data to 1/8 of that used by Leibe & Schiele (2003b), Marrr et al. (2005). The testing set consists of all the views of an object. The recognition rate with the proposed scheme is compared to those of different conventional approaches by Wang (2006) and those with MSNPE analysis directly on pixel-level intensity tensor.

For facial dataset, which has large illumination variance in images, we validate that the tensor representation with the proposed NHOG for image representation will be much more efficient for face recognition than that with the popular SIFT descriptor, which only is somewhat robust to small illumination variance. In experiments Yale dataset, we randomly select 2, 3, 4 and 5 facial images from each individual for training, and the remainders for test. For CMU PIE dataset, we randomly select 5 and 10 facial images from each individual for training, and the remainder for test. We do 20 runs for different training number and average recognition rate in all experiments. The recognitions with our proposed approach are compared to those by the state-of-art algorithm by Cai et al. (2007a), Cai et al. (2007b).

#### **6. Experimental results**

#### (1) View-based object data sets

We investigate the performance of the proposed MSNPE tensor learning compared with conventional tensor analysis such as tensor LDA by Wang (2006), which is also used in view-base object recognition, and the efficiency of the proposed tensor representation compared to the pixel-level intensity tensor, which directly consider a whole image as a tensor, on COIL-100 and ETH80 datasets. In these experiments, all samples are also color images, and SIFT descriptor for local region representation is used. Therefore, the pixel-level intensity tensor is 3rd tensor with dimension *R*1 × *C*1 × 3, where *R*1 and *C*1 is row and column number of the image, and the local descriptor tensor is with 128 × *K* × 3, where *K* is the segmented region number of an image (here *K*=128). In order to compare with the state-of-art works by Wang (2006), simple KNN method (*k*=1 in our experiments) is also used for recognition. Experimental setup was given in Sec. 5, and we did 18 runs so that all samples can be as test. Figure 6(a) shows the compared results of MSNPE using pixel-level tensor and local descriptor tensor (denoted MSNPE-PL and MSNPE with KNN classifier, respectively, MSNPE-RF-PL and MSNPE-RF with random forest) and traditional methods by Wang (2006) on COIL-100 12 Will-be-set-by-IN-TECH

objects. In most previous experiments on object recognition using COIL-100, the number of views used as training set for each object varied from 36 to 4. When 36 views are used for training, the recognition rate using SVM was reported approaching 100% by Pontil & Verri (1998). In practice, however, only very few views of an object are available. In our experiment, in order to compare experimental results with those by Wang (2006), we follows the experiment setup, which used only 4 views of each object for training and the rest 68 views for testing. In total it is equivalent to 400 images for training and 6800 images for testing. The error rate is the overall error rate over 100 objects. The 4 training viewpoints are sampled evenly from the 72 viewpoints, which can capture enough variance on the change of viewpoints for tensor learning. For ETH-80, it aims to discriminate between the 8 object categories. Most previous experiments using ETH-80 dataset all adopted leave-one-object-out cross-validation. The training set consists of all views from 9 objects from each category. The testing set consists of all views from the remaining object from each category. In this setting, objects in the testing set have not appeared in the training set, but those belonging to the same category have. Classification of a test image is a process of labeling the image by one of the categories. Reported results are based on average error rate over all 80 possible test objects by Leibe & Schiele (2003b). Similar to the above, instead of taking all possible views of each object in the training set, we take only 5 views of each object as training data. By doing so we have decreased the number of the training data to 1/8 of that used by Leibe & Schiele (2003b), Marrr et al. (2005). The testing set consists of all the views of an object. The recognition rate with the proposed scheme is compared to those of different conventional approaches by Wang

(2006) and those with MSNPE analysis directly on pixel-level intensity tensor.

the state-of-art algorithm by Cai et al. (2007a), Cai et al. (2007b).

**6. Experimental results**

(1) View-based object data sets

For facial dataset, which has large illumination variance in images, we validate that the tensor representation with the proposed NHOG for image representation will be much more efficient for face recognition than that with the popular SIFT descriptor, which only is somewhat robust to small illumination variance. In experiments Yale dataset, we randomly select 2, 3, 4 and 5 facial images from each individual for training, and the remainders for test. For CMU PIE dataset, we randomly select 5 and 10 facial images from each individual for training, and the remainder for test. We do 20 runs for different training number and average recognition rate in all experiments. The recognitions with our proposed approach are compared to those by

We investigate the performance of the proposed MSNPE tensor learning compared with conventional tensor analysis such as tensor LDA by Wang (2006), which is also used in view-base object recognition, and the efficiency of the proposed tensor representation compared to the pixel-level intensity tensor, which directly consider a whole image as a tensor, on COIL-100 and ETH80 datasets. In these experiments, all samples are also color images, and SIFT descriptor for local region representation is used. Therefore, the pixel-level intensity tensor is 3rd tensor with dimension *R*1 × *C*1 × 3, where *R*1 and *C*1 is row and column number of the image, and the local descriptor tensor is with 128 × *K* × 3, where *K* is the segmented region number of an image (here *K*=128). In order to compare with the state-of-art works by Wang (2006), simple KNN method (*k*=1 in our experiments) is also used for recognition. Experimental setup was given in Sec. 5, and we did 18 runs so that all samples can be as test. Figure 6(a) shows the compared results of MSNPE using pixel-level tensor and local descriptor tensor (denoted MSNPE-PL and MSNPE with KNN classifier, respectively, MSNPE-RF-PL and MSNPE-RF with random forest) and traditional methods by Wang (2006) on COIL-100


Table 2. The compared recognition rates on ETH-80. RSW denotes random subwindow method Marrr et al. (2005) and LS denotes the results from Leibe and Schiele Leibe & Schiele (2003b) with 2925 samples for training and 328 for testing. The others are with 360 samples for training. MSNPE-PL and RF-PL mean MSNPE analysis on pixel-level intensity tensor using simple Euclidean distance and random forest classifier, respectively; MSNPE and RF mean the proposed MSNPE analysis on local SIFT tensor using simple Euclidean distance random forest classifier, respectively.

dataset . The best result with the same experiment setup (400 training samples and 6800 test samples) on COIL-100 is reported by Wang (2006), in which the average recognition rate using tensor LDA and AdBoost classifier (DTROD+AdBoost) is 84.5%, and the recognition rate of the tensor LDA and simple Euclidean distance(DTROD) by Wang (2006) (same as KNN method with *k*=1) is 79.7. However, The MSNPE approach with pixel-intensity tensor can achieve about 85.28% with same classifier (KNN), and 90% average recognition rate with random forest classifier. Furthermore, the MSNPE approach with local SIFT tensor achieved 93.68% average recognition rate. The compared recognition rate results with the state-of-art approaches are shown in Fig. 4 (a). Figure 4(b) shows the compared recognition rates of one run on different mode dimension of MSNPE between using pixel-level intensity and local SIFT tensor with random forest classifier. It is obvious that the recognition rates by using pixel-level tensor have very large variance with differen mode dimension changing. Therefore, we must select a optimized row and column mode dimension to achieve better recognition rate. However, it is usually difficult to decide the optimized dimension number of different modes automatically. If we just shift the mode dimension number a little from the optimized mode dimension, the recognition rate can be decreased significantly shown in Fig. 4(b) when using pixel-level tensor. For the local SIFT tensor representing an object image, the average recognition rates in lager mode dimension changing (Both row and column mode dimension numbers are from 3 to 8; color mode dimension is 3) are very stable.

For ETH-80 dataset, we also do similar experiments to COIL-100 using the proposed MSNPE analysis with pixel-level and local SIFT tensor, respectively. The compared results with the state of the art approach are shown in Table 2. From Table 2, it can be seen that our proposed approach can greatly improve the overall recognition rate compared with the state of the art method (from 60-80% to about 86%).

(2) Facial Datasets: With the two used facial datasets, we investigate the efficiency of the proposed local NHOG feature on large illumination variance dataset compared with local SIFT descriptor. We do 20 runs for different training number and average recognition rate. For comparison, we also do experiments using the proposed MSNPE analysis directly on the gray face image (pixel-level intensity, denoted MSNPE-PL), local feature tensor with SIFT descriptor (denoted MSNPE-SIFT) and our proposed intensity-normalized histogram of orientation (denoted MSNPE-NHOG). Table 3 gives the compared results using MSNPE analysis with different tensors using KNN classifier (*k*=1) and other subspace learning methods by Cai et al. (2007a), Cai et al. (2007b), Cai (2009) and Cai (n.d.) on YALE dataset, and the compared results on CMU PIE dataset are shown in Table 4 with our proposed framework and the conventional ones by Cai et al. (2007a) Cai et al. (2007b) Cai (2009) Cai

Method 2 Train 3 Train 4 Train 5 Train PCA 56.5 51.1 57.8 45.6 LDA 54.3 35.5 27.3 22.5 Laplacianface 43.5 31.5 25.4 21.7 O-Laplacianface 44.3 29.9 22.7 17.9 TensorLPP 54.5 42.8 37 32.7 R-LDA 42.1 28.6 21.6 17.4 S-LDA 37.5 25.6 19.7 14.9 MSNPE 41.89 31.67 24.86 23.06 MSNPE-SIFT 35.22 26.33 22.19 20.83 MSNPE-NHOG **29.74 22.87 18.52 17.44**

Multilinear Supervised Neighborhood Preserving Embedding Analysis of Local Descriptor Tensor 105

Table 3. Average recognition error rates (%) on YALE dataset with different training number. Method 5 Train 10 Train PCA 75.33 65.5 LDA 42.8 29.7 LPP 38 29.6 MSNPE 37.66 23.57 MSNPE-NHOG **33.85 22.06**

Table 4. Average recognition error rates (%) on PIE dataset with different training number.

using pixel-level tensor.

**7. Conclusion**

(n.d.). From Table 3 and 4, it is obvious that our proposed algorithm can achieve the best recognition performances for all most cases, and the recognition rate improvements become greater when the training sample number is small compared to those by the conventional subspace learning methods by Cai et al. (2007a), Cai et al. (2007b), Cai (2009) and Cai (n.d.). In addition, as we have shown in the previous section, our proposed strategy can be applied not only for recognition of face with small variance (such as mainly frontal face database), but also for recognition of generic object with large variance. With generic object dataset with large variance, the recognition rates are also improved greatly compared with

In this paper, we proposed to represent an image as a local descriptor tensor, which is a combination of the descriptor of local regions (*K* ∗ *K*-pixel patch) in the image, and more efficient than the popular Bag-Of-Feature (BOF) model for local descriptor combination, and at the same time, we explored a local descriptor for region representation for databases with large illumination variance, Which is improved to be more efficient than the popular SIFT descriptor. Furthermore, we proposed to use Multilinear Supervised Neighborhood Preserving Embedding (MSNPE) for discriminant feature extraction from the local descriptor tensor of different images, which can preserve local sample structure in feature space. We validate our proposed algorithm on different Benchmark databases such as view-based and facial datasets, and experimental results show recognition rate with our method can be greatly

improved compared conventional subspace analysis methods.

Fig. 4. (a) The compared recognition rates on COIL-100 between the proposed framework and the state-of-art approaches Wang (2006). (b) Average recognition rate with different mode dimension using random forest classifier.


Table 3. Average recognition error rates (%) on YALE dataset with different training number.


Table 4. Average recognition error rates (%) on PIE dataset with different training number.

(n.d.). From Table 3 and 4, it is obvious that our proposed algorithm can achieve the best recognition performances for all most cases, and the recognition rate improvements become greater when the training sample number is small compared to those by the conventional subspace learning methods by Cai et al. (2007a), Cai et al. (2007b), Cai (2009) and Cai (n.d.). In addition, as we have shown in the previous section, our proposed strategy can be applied not only for recognition of face with small variance (such as mainly frontal face database), but also for recognition of generic object with large variance. With generic object dataset with large variance, the recognition rates are also improved greatly compared with using pixel-level tensor.
