4.5. Head pose estimation by supervised manifold learning

4.3. The biased manifold embedding (BME)

126 Manifolds - Current Research Areas

with almost all the classical algorithms [23].

zero-distance.

rate on the data set of FacePix [25].

4.4. Head pose estimation as frontal view search

original images. Then, the whole image is divided into M

d̃ ij ¼

The head pose estimation is subjected to the identity variation. The ideal case is to eliminate such negative effects, which means the face images with close pose angles should maintain nearer and the ones with quite different poses should stay farther in the low-dimensional manifold, even the poses are from the same identity. Based on this statement, the BME is proposed to modify the distance matrix according to the pose angles, which can be extended

β � pði, jÞ

0, pði, jÞ ¼ 0

where p(i,j) = |pi - pj| is simply defined as the absolute difference of the angles of two poses. From the modified distance matrix, one can find that the distance between images with close poses is biased to be proportionally small. The images with the same poses are defined to be

In fact, the BME can be seen as a naÏve version of the supervised manifold learning. The head pose information is used as the supervision to enhance the construction of the graph. For the head pose estimation stage, the generalized regression neural network (GRNN) [24] is applied to learn the nonlinear mapping for the unseen data points, and linear multivariate regression is applied to estimate the head pose angle. This idea can be easily extended to the classical algorithms, e.g., Isomap, LLE, and LE, among which the biased LE achieves the lowest error

The two remarkable head poses, i.e., yaw and pitch, cause the problem self-occlusion. Compared with pitch, the yaw makes the problem more serious. An extended manifold learning (EML) method is proposed to specify the head pose estimation only considering the variation in the yaw [15]. This work resorts to the frontal view search instead of directly estimating the head pose, which is more efficient and robust. The idea is based on the observation that the frontal face locates nearly at the vertex in the symmetrical shape of the embedding. However, if the pose distribution of the data is asymmetric, the location of the frontal face in the manifold will shift from the vertex. Therefore, the first trial of the EML method is data enhancement. All the images are horizontally flipped and both the original and flipped images are used for manifold learning. In order to make the method more robust to variations in environment, for example, illumination, the localized edge orientation histogram (LEOH) is presented to represent the original color mappings as more representative features. The idea is inspired by the classical HoG feature [14]. The first step of LEOH is to apply a Canny edge detector on the

+

N cells. The gradient orientation

� dij, pði, jÞ≠0

(16)

The modified distance between a pair of data points x<sup>i</sup> and x<sup>j</sup> is given by:

maxm,<sup>n</sup> � pðm, nÞ � −pði, jÞ

8 ><

>:

A taxonomy of methods, which structures the general framework of manifold learning into several stages, is proposed to incorporate the head pose angles in one or some of the stages to enable the supervised manifold learning [26]. A straightforward solution could be the adaption of the distance and weight matrix according to the angle difference between pairwise face images. The head pose estimation problem is then interrupted as a regression problem, which was usually solved as a classification problem. As a result, continuous head poses can be generalized by the model.

The general framework of manifold learning can be represented as follows: Stage 1, neighborhood searching; Stage 2, graph weighting; Stage 3, low-dimensional manifold computation; and Stage 4, projection from unseen data to the manifold and pose estimation.

In Stage 1, the distance matrix of D = {dij} can be adapted as follows:

$$\dot{d}\_{i\dot{\jmath}} = f(|\Theta\_i - \Theta\_{\dot{\jmath}}|) \cdot d\_{i\dot{\jmath}} \tag{17}$$

where θ<sup>i</sup> and θ<sup>j</sup> are the angles of two poses, which keep the same denotation as previous sections. The f is some reciprocal increasing positive function, for example, fðuÞ ¼ α � u=ðβ−uÞ. The introduction of f encourages the distance decreasing of the nearer poses and increasing of farther poses. The farther the poses are, the more penalties the distance will gain.

In Stage 2, the weight matrix of W = {wij} can be adapted by similar idea of supervision information incorporation.

$$
\bar{w}\_{\dot{\imath}\dot{\jmath}} = w\_{\dot{\imath}\dot{\jmath}} \cdot g(|\Theta\_{\dot{\imath}} - \Theta\_{\dot{\jmath}}|) \tag{18}
$$

where g is defined as some positive decreasing function, which is similar to the f applied in Stage 1.

In Stage 3, let us take the LLE for an instance. The original objective function of LLE shown in Eq. (6) can be adapted as follows:

$$\min\_{\mathbf{Y}} \sum\_{i=1}^{M} \left\| \mathbf{y}\_i - \sum\_{j \in \mathcal{N}(i)} w\_{ij} \mathbf{y}\_j \right\|^2 + \lambda \frac{1}{2} \sum\_{i,j} (\mathbf{y}\_i - \mathbf{y}\_j)^2 \Lambda\_{ij} \tag{19}$$

where the Λ = {Λi,j} measures the similarity between the angles of pairwise poses. A possible form of Λ is the heat kernel.

$$\Lambda\_{ij} = \begin{cases} \left. \frac{\left\| \left| \boldsymbol{\varrho}\_i \cdot \boldsymbol{\theta}\_j \right| \right\|}{2\boldsymbol{\sigma}^2} \right|, & \text{if the } i^{\text{th}} \text{ and } j^{\text{th}} \text{data points are neighbors} \\\ 0, & \text{otherwise} \end{cases} \tag{20}$$

The adaption of the objective function can preserve the local linearity of the original data and enhance the similarity for neighborhoods, which are facilitated with similar poses. This is implemented by the second term of Eq. (19) that introduces the supervision information. Following the derivation from Eq. (6) to Eq. (7), Eq. (19) can be simplified as:

$$\min\_{\mathbf{Y}} \quad \mathbf{Y}\mathbf{M}\mathbf{Y}^T + \lambda\mathbf{Y}\mathbf{\tilde{L}}\mathbf{Y}^T = \min\_{\mathbf{Y}} \mathbf{Y}(\mathbf{M} + \lambda\mathbf{\tilde{L}})\mathbf{Y}^T \tag{21}$$

where L~ is the Laplacian matrix of Λ. For the low-dimensional embedding, eigenvectors decomposition of M + λL can be performed. By the supervision information incorporation, the method is much capable of imposing discriminative projection to the learned embedding.

In Stage 4, the GRNN algorithm is applied to produce the mapping from unseen data to the low-dimensional embedding. During testing time, the support vector regression (SVR) with RBF kernel and smoothing cubic splines are taken.

A novel method of supervised manifold learning for head pose estimation [27, 28] is proposed based on the framework from the former method. Similarly, angles of poses are incorporated in all three stages of the general manifold learning structure.

In Stage 1, an improved version of f is proposed as:

$$\dot{d}\_{ij} = f(|\Theta\_i - \Theta\_j|)^p \cdot d\_{ij}(p>0) \tag{22}$$

where <sup>f</sup> is defined as a rectified reciprocal form <sup>f</sup>ðjθi−θjjÞ ¼ <sup>α</sup> <sup>j</sup>θi−θj<sup>j</sup> maxm; <sup>n</sup>fjθm−θnjg−jθi−θjjþε . α is a positive constant and ε is an arbitrary small positive constant that avoids the denominator of f being zero. This adaption for the distance matrix further enhances the effects of the supervision information during the procedure of neighbors search.

In Stage 2, taking LLE (NPE [29]), for an example, the local distance matrix shown in Eq. (4) is modified as

$$
\dot{\mathcal{L}}\_{mn} = \mathcal{g}\_{mn} \cdot \mathcal{c}\_{mn} \tag{23}
$$

where gmn <sup>¼</sup> <sup>j</sup>θi−θmjjθi−θn<sup>j</sup> ðmaxm,nfjθm−θnjg−jθm−θnjþεÞ 2. θ<sup>i</sup> is the angle of the reference face image xi.This operation enhances the supervision during the computation of local correlated matrix.

In Stage 3, a supervised neighborhood-based fisher discriminant analysis (SNFDA) is proposed. The basic idea is to make the neighboring data points as close as possible and the nonneighboring data points as far as possible in the low-dimensional embedding. The SNFDA can be seen as a postprocessing procedure in this stage. Based on the low-dimensional represented data Y obtained from the original LLE or the modified LLE in Stages 1 and 2, the within- and between-neighborhood scatter matrices are defined as:

#### Head Pose Estimation via Manifold Learning http://dx.doi.org/10.5772/65903 129

$$\mathbf{S}\_{w} = \frac{K}{2} \sum\_{i,j=1}^{M} A\_{ij}^{w} (\mathbf{y}\_{i} - \mathbf{y}\_{j})(\mathbf{y}\_{i} - \mathbf{y}\_{j})^{T} \tag{24}$$

$$\mathbf{S}\_{B} = \frac{K}{2} \sum\_{i,j=1}^{M} A\_{ij}^{B} (\mathbf{y}\_{i} - \mathbf{y}\_{j})(\mathbf{y}\_{i} - \mathbf{y}\_{j})^{T} \tag{25}$$

where

Λij ¼ e

128 Manifolds - Current Research Areas

min Y

RBF kernel and smoothing cubic splines are taken.

In Stage 1, an improved version of f is proposed as:

in all three stages of the general manifold learning structure.

sion information during the procedure of neighbors search.

ðmaxm,nfjθm−θnjg−jθm−θnjþεÞ

modified as

where gmn <sup>¼</sup> <sup>j</sup>θi−θmjjθi−θn<sup>j</sup>

d̃

where <sup>f</sup> is defined as a rectified reciprocal form <sup>f</sup>ðjθi−θjjÞ ¼ <sup>α</sup> <sup>j</sup>θi−θj<sup>j</sup>

c̃

tion enhances the supervision during the computation of local correlated matrix.

within- and between-neighborhood scatter matrices are defined as:

8 < :

− � � � �<sup>θ</sup>i−θ<sup>j</sup> � � � � 2 <sup>2</sup>σ<sup>2</sup> , if the <sup>i</sup>

th and j

The adaption of the objective function can preserve the local linearity of the original data and enhance the similarity for neighborhoods, which are facilitated with similar poses. This is implemented by the second term of Eq. (19) that introduces the supervision information.

where L~ is the Laplacian matrix of Λ. For the low-dimensional embedding, eigenvectors decomposition of M + λL can be performed. By the supervision information incorporation, the method is much capable of imposing discriminative projection to the learned embedding. In Stage 4, the GRNN algorithm is applied to produce the mapping from unseen data to the low-dimensional embedding. During testing time, the support vector regression (SVR) with

A novel method of supervised manifold learning for head pose estimation [27, 28] is proposed based on the framework from the former method. Similarly, angles of poses are incorporated

positive constant and ε is an arbitrary small positive constant that avoids the denominator of f being zero. This adaption for the distance matrix further enhances the effects of the supervi-

In Stage 2, taking LLE (NPE [29]), for an example, the local distance matrix shown in Eq. (4) is

In Stage 3, a supervised neighborhood-based fisher discriminant analysis (SNFDA) is proposed. The basic idea is to make the neighboring data points as close as possible and the nonneighboring data points as far as possible in the low-dimensional embedding. The SNFDA can be seen as a postprocessing procedure in this stage. Based on the low-dimensional represented data Y obtained from the original LLE or the modified LLE in Stages 1 and 2, the

0, otherwise

YMY<sup>T</sup> <sup>þ</sup> <sup>λ</sup>YL̃Y<sup>T</sup> <sup>¼</sup> min

Following the derivation from Eq. (6) to Eq. (7), Eq. (19) can be simplified as:

thdata points are neighbours

ij <sup>¼</sup> <sup>f</sup>ðjθi−θjjÞ<sup>p</sup> � dijð<sup>p</sup> <sup>&</sup>gt; <sup>0</sup><sup>Þ</sup> (22)

mn ¼ gmn � cmn (23)

2. θ<sup>i</sup> is the angle of the reference face image xi.This opera-

<sup>Y</sup>ð<sup>M</sup> <sup>þ</sup> <sup>λ</sup>L̃ÞY<sup>T</sup> (21)

maxm; <sup>n</sup>fjθm−θnjg−jθi−θjjþε

. α is a

Y

(20)

$$A\_{i\circ}^w = \begin{cases} A\_{i\circ}, & \mathcal{Y}\_{\circ} \in \mathcal{N}(i) \\ 0, & \text{otherwise} \end{cases} \tag{26}$$

$$A\_{i\bar{j}}^w = \begin{cases} A\_{i\bar{j}} \left( \frac{1}{M} - \frac{1}{K} \right), & y\_j \in N(i) \\\frac{A\_{i\bar{j}}}{n}, & \text{otherwise} \end{cases} \tag{27}$$

Aij is the affinity between y<sup>i</sup> and yj, which is defined as the form of heat function:

$$A\_{i\dagger} = e^{-\frac{\|\mathbf{y}\_i - \mathbf{y}\_j\|^2}{2r^2}}\tag{28}$$

Details about the inference of the scatter matrices can be found in the original reference. The transformed matrix TSNFDA of SNFDA is computed from the generalized eigenvector decomposition problem

$$\mathbf{S}\_{\mathcal{B}}e = \lambda \mathbf{S}\_{w}e\tag{29}$$

The top d eigenvectors span to the TSNFDA and the transformed feature is obtained by z = TSNFDAy. This supervised learning manner successfully introduces the supervision information in a framework to provide a "good" projection from the original data to the lowdimensional. Due to the supervised learning, when the projection is applied on original data, more discriminative features can be obtained for head pose estimation.

In Stage 4, during testing time, the GRNN is applied to map the unseen data point to the lowdimensional embedding and the relevance vector machine (RVM) [30] is adopted to accomplish the pose estimation. Experimental results obtained by the proposed method performing on the database of FacePix [25] and MIT-CBCL [31] show big improvements compared with other state-of-the-art algorithms [23, 26] in Stage 3 and Stages 1 + 2 + 3. This means that this method is more robust for identity and illumination variations.
