3. Characteristics of manifold learning algorithms

Given a set of data points, for example, face images, it is difficult to directly estimate or extract the most significant modes from such high-dimensional representation of the data. If the distribution of data in the original feature space can be linearly structured, the classical principal component analysis (PCA) will be able to estimate the discriminative modes and then reduce the feature dimensions. An example of such a type of data is shown in Figure 3. However, if the data distribution of the original data is nonlinear, for example, the famous "swiss roll" shown in Figure 4(a), which is a smooth, continuous but nonlinear surface embedded in the 3D space, the structure interpreted as Euclidean distance is less preferable to represent the distribution of the data. Taking the two circled points sampled from the manifold shown in Figure 4(b), for instance, their Euclidean distance is close, while this is not guaranteed if the 3D structure is considered. The embedded structure can be explored with the help of nonlinear dimensionality reduction, such as manifold learning algorithms. The learned low-dimensional representation can approximately model the real distance of the sampled data points as shown in Figure 4(c).

Figure 3. A data set sampled from a multivariate Gaussian distribution. The most significant modes indicated by the red orthogonal axis can be learned by PCA, which preserve the largest variations in the original data.

Figure 4. An example of the data set including a potential "swiss roll" structure. The figures are produced based on the code from [16]. (a) The original 3D surface. (b) The data points sampled from (a). The Euclidean distance indicated by dash line between the circled points cannot represent the distance lying on the potential structure. (c) The distance measured in the learned low-dimensional can more accurately model the data.

In this section, in order to reveal the essence of manifold learning, the PCA is initially detailed. Other classical manifold learning algorithms will be elaborated in the following.

## 3.1. Principal component analysis (PCA)

PCA is one of the most popular unsupervised linear dimensionality reduction algorithms. The intrinsic feature of PCA is to estimate a linear space whose basis will be able to preserve the maximum variations in the original data. Mathematically, the low-dimensional data can be obtained by a linear transformation from the original data as denoted in Eq. (1).

$$\mathbf{y} = \mathbf{V}^T \mathbf{x} \tag{1}$$

where x is the centered data point. The entry of the projection matrix V is the column vector that represents the principal components in the projection space. Let us take one of the principal components for instance. The objective is to preserve the maximum variations in the transformed data.

$$\max\_{\|\boldsymbol{\nu}\|=1} \operatorname{Var}(\boldsymbol{\nu}^T \mathbf{X}) = \max\_{\boldsymbol{\nu}} \frac{1}{M-1} \|\boldsymbol{\nu}^T \mathbf{X}\|^2 = \max\_{\boldsymbol{\nu}} \frac{1}{M-1} (\boldsymbol{\nu}^T \mathbf{X})(\mathbf{X}^T \boldsymbol{\nu}) = \max\_{\boldsymbol{\nu}} (\boldsymbol{\nu}^T \mathbf{C} \boldsymbol{\nu}) \tag{2}$$

1 <sup>M</sup>−<sup>1</sup> makes Eq. (2) an unbiased estimation, which can be replaced by M if it is sufficiently large. Eq. (2) is a form of Rayleigh's quotient, which can be maximized by eigenvector decomposition of covariance matrix C. The d eigenvectors corresponding to the top d positive eigenvalues are taken to construct the low-dimensional space V<sup>D</sup> +<sup>d</sup> to which the original data are projected. Actually, Eq. (2) can be converted to

$$\mathbf{x} = \mathbf{V}\mathbf{y} \tag{3}$$

which means that the original data can be linearly represented as a combination of the principal components.

Taking the head pose images of one identity shown in Figure 2, for example, the PCA is applied on the vectorized images. Figure 5 visualizes the low-dimensional representation of the face images in the first 3D dimensions. One can find obvious transitions for pitch and yaw along a 3D shape of valley. The three principal components are visualized in Figure 6, from which one face image is decomposed into a weighted accumulation of variations in the mean face. The first and third eigenfaces (principal component) clearly show the variation in yaw. Therefore, PCA can model the head poses as some of the discriminative principal components.

Figure 5. Visualization of the low-dimensional features obtained by PCA. A surface of valley can be found. Blue dots show the face images sampled from the surface. Some face images are selected and shown. (a) The variation in pitch is shown with the yaw of −90° in a specific view of the surface. (b) The variation in yaw is shown with the pitch of 0° in another view of the surface.

Figure 6. Representation of one face image by the mean face and first three eigenfaces obtained from PCA.

## 3.2. Locally linear embedding (LLE)

In this section, in order to reveal the essence of manifold learning, the PCA is initially detailed.

Figure 4. An example of the data set including a potential "swiss roll" structure. The figures are produced based on the code from [16]. (a) The original 3D surface. (b) The data points sampled from (a). The Euclidean distance indicated by dash line between the circled points cannot represent the distance lying on the potential structure. (c) The distance

PCA is one of the most popular unsupervised linear dimensionality reduction algorithms. The intrinsic feature of PCA is to estimate a linear space whose basis will be able to preserve the maximum variations in the original data. Mathematically, the low-dimensional data can be

where x is the centered data point. The entry of the projection matrix V is the column vector that represents the principal components in the projection space. Let us take one of the principal components for instance. The objective is to preserve the maximum variations in the

‖v<sup>T</sup>X‖<sup>2</sup> <sup>¼</sup> max <sup>v</sup>

+

<sup>M</sup>−<sup>1</sup> makes Eq. (2) an unbiased estimation, which can be replaced by M if it is sufficiently large. Eq. (2) is a form of Rayleigh's quotient, which can be maximized by eigenvector decomposition of covariance matrix C. The d eigenvectors corresponding to the top d positive eigenvalues are

which means that the original data can be linearly represented as a combination of the

1 M−1

<sup>y</sup> <sup>¼</sup> <sup>V</sup><sup>T</sup><sup>x</sup> (1)

<sup>ð</sup>v<sup>T</sup>XÞðX<sup>T</sup>vÞ ¼ max <sup>v</sup> <sup>ð</sup>v<sup>T</sup>Cv<sup>Þ</sup> (2)

<sup>d</sup> to which the original data are projected.

x ¼ Vy (3)

Other classical manifold learning algorithms will be elaborated in the following.

measured in the learned low-dimensional can more accurately model the data.

obtained by a linear transformation from the original data as denoted in Eq. (1).

1 M−1

3.1. Principal component analysis (PCA)

transformed data.

118 Manifolds - Current Research Areas

1

max ‖v‖¼1

principal components.

Varðv<sup>T</sup>XÞ ¼ max <sup>v</sup>

taken to construct the low-dimensional space V<sup>D</sup>

Actually, Eq. (2) can be converted to

From the observation of the data shown in Figure 4, the smooth manifold is globally nonlinear but can be seen as linear from a local neighborhood. On the basis of this observation, the LLE attempts to represent each of the data by a weighted linear combination of a number of neighbors [11]. The weight matrix W can be obtained by the following objective function.

$$\min\_{\|\mathbf{w}\_i\|=1} \sum\_{i=1}^{M} \left\|\mathbf{x}\_i - \sum\_{j \in \mathcal{N}(i)} \mathbf{z} \mathbf{w}\_{ij} \mathbf{x}\_j \right\|^2 \tag{4}$$

where wi denotes the ith row vector of matrix W. Eq. (4) shows that the LLE aims to minimize the total reconstruction error for the data from the corresponding nearest neighbors. Specifically, W is a sparse matrix which assigns optimal weights for neighboring data points and zeros for nonneighboring data points. As a result, both the global nonlinearity and local linearity are included in one identical form. A close form of the weights matrix W ¼ fwijg can be efficiently computed.

$$w\_{i\nright} = \frac{\sum\_{n=1}^{K} c\_{jn}^{-1}}{\sum\_{m=1}^{K} \sum\_{n=1}^{K} c\_{mn}^{-1}} \tag{5}$$

where K is the number of nearest neighbors tuned for specific problem; C ¼ fcmng is the neighborhood correlation matrix that is specified for each data: cmn <sup>¼</sup> <sup>ξ</sup><sup>T</sup> <sup>m</sup>ξ<sup>n</sup> where ξ<sup>m</sup> and ξ<sup>n</sup> are the mth and nth neighbors of the ith data point.

Moreover, the weight matrix W is locally invariant to linear transformation, i.e., translation, rotation, and scaling. Therefore, it is reasonable to propose that low-dimensional representation of the data can also preserve the local geometry as featured in the original space with the same weight matrix W. The next step is then to estimate such low-dimensional (d-dimension) representation Y by the following equation.

$$\min\_{\mathbf{Y}} \sum\_{i=1}^{M} \left\| \mathbf{y}\_i - \sum\_{j \in \mathcal{N}(i)} w\_{ij} \mathbf{y}\_j \right\|^2 \tag{6}$$

Eq. (6) can be rewritten as

$$\min\_{\mathbf{Y}} \mathbf{Y} \mathbf{M} \mathbf{Y}^T \tag{7}$$

where M ¼ ðI−WÞðI−WÞ <sup>T</sup>. Two constraints are made to center the data and avoid degenerate solutions: ∑<sup>M</sup> <sup>i</sup>¼<sup>1</sup>y<sup>i</sup> <sup>¼</sup> <sup>0</sup> and <sup>1</sup> <sup>M</sup> ∑<sup>M</sup> <sup>i</sup>¼<sup>1</sup>y<sup>i</sup> yi <sup>T</sup> <sup>¼</sup> <sup>I</sup><sup>d</sup> · <sup>d</sup>. Then <sup>Y</sup> can be obtained, of which the columns correspond to the bottom d eigenvectors of M.

The same data set used in the last experiment is processed with LLE and shown with the first three dimensions in Figure 7. The variation of the head pose in yaw with different pitch is obviously shown. The transition from a pose to another pose tends to be continuous and easy to locate. The learned manifold is smoother and more discriminative than PCA.

### 3.3. Isomap

Isomap [10] is an abbreviation of isometric feature mapping [17], which is an extension of the classical algorithm of multidimensional scaling (MDS) [18]. From the previous section, one can learn that the LLE represents the nonlinearity of the original data by preserving the local geometrical linearity. In contrast, the algorithm of Isomap proposed a global solution by constructing a graph for all pairwise data. This idea ensures the global optimum.

Specifically, Isomap firstly constructs a graph that can be represented as G =(V,E) with V as the vertices (the data points) and E as the edges (the adjacent connections). The adjacent vertices will be connected with edges according to particular metrics. Taking the ith and jth data points for instance, an edge will be added between them if ‖xi−xj‖ ≤ ϵ (ϵ-neighbor) or x<sup>i</sup> is the Knearest neighbor of x<sup>j</sup> and vice versa (K-nearest neighbor). The graph is then assigned with distances among all pairwise vertices according to the edges. The distance between the vertices associated with edges will be dij ¼ ‖xi−xj‖ (Euclidean distance). For the other pairwise vertices, the geodesic distances are considered, which can be simply computed as the shortest path (Floyd's algorithm). This weighted graph is capable of modeling the isometric distances, which can preserve the global geometry in the learned manifold.

zeros for nonneighboring data points. As a result, both the global nonlinearity and local linearity are included in one identical form. A close form of the weights matrix W ¼ fwijg can

> <sup>n</sup>¼<sup>1</sup>c<sup>−</sup><sup>1</sup> jn

> > <sup>n</sup>¼<sup>1</sup>c<sup>−</sup><sup>1</sup> mn

> > > wijy<sup>j</sup> 2

<sup>T</sup>. Two constraints are made to center the data and avoid degenerate

<sup>T</sup> <sup>¼</sup> <sup>I</sup><sup>d</sup> · <sup>d</sup>. Then <sup>Y</sup> can be obtained, of which the columns

YMY<sup>T</sup> (7)

(5)

(6)

<sup>m</sup>ξ<sup>n</sup> where ξ<sup>m</sup> and ξ<sup>n</sup>

wij <sup>¼</sup> <sup>∑</sup><sup>K</sup>

neighborhood correlation matrix that is specified for each data: cmn <sup>¼</sup> <sup>ξ</sup><sup>T</sup>

min Y ∑ M

i¼1 yi − ∑ j∈NðiÞ

min Y

The same data set used in the last experiment is processed with LLE and shown with the first three dimensions in Figure 7. The variation of the head pose in yaw with different pitch is obviously shown. The transition from a pose to another pose tends to be continuous and easy

Isomap [10] is an abbreviation of isometric feature mapping [17], which is an extension of the classical algorithm of multidimensional scaling (MDS) [18]. From the previous section, one can learn that the LLE represents the nonlinearity of the original data by preserving the local geometrical linearity. In contrast, the algorithm of Isomap proposed a global solution by

Specifically, Isomap firstly constructs a graph that can be represented as G =(V,E) with V as the vertices (the data points) and E as the edges (the adjacent connections). The adjacent vertices will be connected with edges according to particular metrics. Taking the ith and jth data points

to locate. The learned manifold is smoother and more discriminative than PCA.

constructing a graph for all pairwise data. This idea ensures the global optimum.

are the mth and nth neighbors of the ith data point.

representation Y by the following equation.

<sup>i</sup>¼<sup>1</sup>y<sup>i</sup> <sup>¼</sup> <sup>0</sup> and <sup>1</sup>

correspond to the bottom d eigenvectors of M.

<sup>M</sup> ∑<sup>M</sup> <sup>i</sup>¼<sup>1</sup>y<sup>i</sup> yi

Eq. (6) can be rewritten as

where M ¼ ðI−WÞðI−WÞ

solutions: ∑<sup>M</sup>

3.3. Isomap

∑K <sup>m</sup>¼<sup>1</sup>∑<sup>K</sup>

where K is the number of nearest neighbors tuned for specific problem; C ¼ fcmng is the

Moreover, the weight matrix W is locally invariant to linear transformation, i.e., translation, rotation, and scaling. Therefore, it is reasonable to propose that low-dimensional representation of the data can also preserve the local geometry as featured in the original space with the same weight matrix W. The next step is then to estimate such low-dimensional (d-dimension)

be efficiently computed.

120 Manifolds - Current Research Areas

Figure 7. Visualization of the low-dimensional features obtained by LLE. A surface of "wings" is observed. (a) The variation in yaw with the pitch of -30° is found along the edge of one "wing" of the 3D surface. (b) Another variation in yaw with pitch of 0° is found along the ridge of the 3D surface.

From the distance matrix D ¼ fdijg, an objective function is defined as

$$\min\_{\mathbf{Y}} \left\| \tau(\mathbf{D}\_{\mathbf{X}}) - \tau(\mathbf{D}\_{\mathbf{Y}}) \right\|^{2} \tag{8}$$

where τðDXÞ and τðDYÞ denote the distances conversion to inner products for the original data and the low-dimensional data, respectively. The operator τ is defined as τðDÞ ¼ −HSH=2; <sup>S</sup> <sup>¼</sup> <sup>D</sup><sup>2</sup> is the squared distance matrix; <sup>H</sup> <sup>¼</sup> <sup>I</sup><sup>−</sup> <sup>1</sup> <sup>M</sup> <sup>11</sup><sup>T</sup> is the centering matrix. This design of the objective function aims to preserve the global structure represented as a graph associated with geodesic distance. The low-dimensional embedding should be featured with similar global geometry with the original data. The optimization of the objective function can be solved by MDS. The d-dimensional representation Y is obtained by decomposing the matrix of τðDXÞ and preserving the top d eigenvectors.

Figure 8 shows the results of the low-dimensional head poses obtained from Isomap. An interesting shape of "bowl" of the embedding surface obtained for the head pose images.

Figure 8. Visualization of the low-dimensional features obtained by Isomap. (a) The variation in yaw with the pitch of - 30° is found along the edge of the shape. (b) Another variation in yaw with pitch of 0° is found along the geodesic path in the middle of the shape. The interesting thing is the frontal face locates approximated at the center.
