3. Multi-view clustering methods

Due to the widespread use of multi-view datasets in practice, many realistic applications are accomplished by multi-view learning methods, such as community detection in social networks, image annotation in computer vision, and cross-domain user modeling in recommendation systems [6]. Meanwhile, based on the seminal work of Bickel and Scheffer [1], plenty of multiview clustering methods have been proposed [2, 3, 5]. As explained in Section 1, this chapter seeks to review five kinds of typical clustering methods and their multi-view versions, which include k-means, spectral clustering, matrix factorization, tensor decomposition, and deep learning. All of these five methods are popular methods for single-view clustering. Although there are some other multi-view clustering methods not contained in this chapter, such as the canonical correlation analysis (CCA)-based multi-view clustering methods [7], the DBSCAN-based multiview clustering methods [8], and the lower dimensional subspace-based multi-view clustering methods [9], most of them can be unified into the frameworks of these five involved methods. For instance, a pair-wise sparse subspace representation model for multi-view clustering proposed in [10] can be unified into the framework of matrix factorization.

#### 3.1. Multi-view clustering via k-means

k-means is one of the most popular clustering algorithms with a history of more than 50 years [11]. Except for its simplicity, k-means has a good potential to deal with large-scale datasets. Owing to these properties, k-means has been successfully used in various topics, including computer vision, social network analysis, and market segmentation, to name but a few. Although it has been studied deeply over the past few decades, many variants of k-means are put forward continuously [12–15].

#### 3.1.1. Preliminaries of k-means

As a classic clustering algorithm, k-means employs K prototype vectors (i.e., centers or centroids of the K clusters) to characterize each data sample and minimizes a sum of squared loss function to find these prototypes. Consider a dataset denoted by <sup>X</sup> <sup>¼</sup> ½ � <sup>x</sup>1; <sup>x</sup>2; …; <sup>x</sup><sup>N</sup> <sup>∈</sup>IR<sup>M</sup>�<sup>N</sup>, where x<sup>i</sup> ∈ IR<sup>M</sup> represents the attribute (feature) vector of the i-th data sample xi. In order to partition the dataset X into K disjoint clusters, denoted by C ¼ f g C1; C2;…; CK , k-means tries to optimize the following objective function:

$$\boldsymbol{\varepsilon} = \sum\_{i=1}^{N} \sum\_{k=1}^{K} \delta\_{ik} \|\mathbf{x}\_i - \mathbf{v}\_k\|\_{2'}^2 \quad \mathbf{v}\_k = \frac{\sum\_{i=1}^{N} \delta\_{ik} \mathbf{x}\_i}{\sum\_{i=1}^{N} \delta\_{ik}} = \frac{1}{|\mathbf{C}\_k|} \sum\_{\mathbf{x} \in \mathbf{C}\_k} \mathbf{x}, \tag{1}$$

where δik is an indicator variable with δik ¼ 1 if x<sup>i</sup> ∈ Ck and 0 otherwise and v<sup>k</sup> is the k-th prototype vector, i.e., the k-th cluster center.

As can be seen, Eq. (1) adopts the Euclidean distance to measure the similarities between data samples. However, there are many data structures or data distributions in real world. Thus, it is not always suitable to apply this basic form of k-means to accurately identify the hidden patterns of datasets. What is more, some datasets may be not separable in the low-dimensional space. Recently, kernel method has been of wide concern in the field of machine learning. By introducing a kernel function, the original nonlinear datasets are mapped to a higher dimensional reproducing kernel Hilbert space. In the new space, the datasets become linearly separable. For this reason, the kernel k-means algorithm [16, 17] has been proposed. It is just a generalization of the standard k-means algorithm and has the following objective function:

$$\boldsymbol{\varepsilon} = \sum\_{i=1}^{N} \sum\_{k=1}^{K} \delta\_{\bar{k}} \left\| \phi(\mathbf{x}\_{i}) - \mathbf{v}\_{k}^{\prime} \right\|\_{2^{\prime}}^{2} \quad \mathbf{v}\_{k}^{\prime} = \frac{\sum\_{i=1}^{N} \delta\_{\bar{k}} \phi(\mathbf{x}\_{i})}{\sum\_{i=1}^{N} \delta\_{\bar{k}}} \,\tag{2}$$

where ϕ : X ! H is a nonlinear transformation function. Define a kernel function K : X � X ! IR with K xi; x<sup>j</sup> � � <sup>¼</sup> <sup>ϕ</sup>ð Þ <sup>x</sup><sup>i</sup> <sup>T</sup>ϕ x<sup>j</sup> � �. Then, Eq. (2) can be rewritten into the kernel form as below:

$$\varepsilon = \sum\_{i=1}^{N} \sum\_{k=1}^{K} \delta\_{ik} \left( \mathcal{K}(\mathbf{x}\_{i}, \mathbf{x}\_{i}) - 2 \frac{\sum\_{j=1}^{N} \delta\_{jk} \mathcal{K}(\mathbf{x}\_{i}, \mathbf{x}\_{j})}{\sum\_{j=1}^{N} \delta\_{jk}} + \frac{\sum\_{j=1}^{N} \sum\_{l=1}^{N} \delta\_{jk} \delta\_{lk} \mathcal{K}(\mathbf{x}\_{l}, \mathbf{x}\_{l})}{\sum\_{j=1}^{N} \sum\_{l=1}^{N} \delta\_{jk} \delta\_{lk}} \right). \tag{3}$$

With the aid of the kernel function, there is no need to explicitly provide the transformation function ϕ. This is because, for certain kernel function, the corresponding transformation function is intractable. However, the inner products in the kernel space can be easily obtained according to the kernel function.

#### 3.1.2. Basic form of multi-view k-means

These types of data naturally fit into multi-view learning, while cannot be settled by singleview learning methods appropriately. In all, the complementary property among multi-view data can overcome the limitations of single-view data and expand their application areas.

Due to the widespread use of multi-view datasets in practice, many realistic applications are accomplished by multi-view learning methods, such as community detection in social networks, image annotation in computer vision, and cross-domain user modeling in recommendation systems [6]. Meanwhile, based on the seminal work of Bickel and Scheffer [1], plenty of multiview clustering methods have been proposed [2, 3, 5]. As explained in Section 1, this chapter seeks to review five kinds of typical clustering methods and their multi-view versions, which include k-means, spectral clustering, matrix factorization, tensor decomposition, and deep learning. All of these five methods are popular methods for single-view clustering. Although there are some other multi-view clustering methods not contained in this chapter, such as the canonical correlation analysis (CCA)-based multi-view clustering methods [7], the DBSCAN-based multiview clustering methods [8], and the lower dimensional subspace-based multi-view clustering methods [9], most of them can be unified into the frameworks of these five involved methods. For instance, a pair-wise sparse subspace representation model for multi-view clustering pro-

k-means is one of the most popular clustering algorithms with a history of more than 50 years [11]. Except for its simplicity, k-means has a good potential to deal with large-scale datasets. Owing to these properties, k-means has been successfully used in various topics, including computer vision, social network analysis, and market segmentation, to name but a few. Although it has been studied deeply over the past few decades, many variants of k-means are

As a classic clustering algorithm, k-means employs K prototype vectors (i.e., centers or centroids of the K clusters) to characterize each data sample and minimizes a sum of squared loss function to find these prototypes. Consider a dataset denoted by <sup>X</sup> <sup>¼</sup> ½ � <sup>x</sup>1; <sup>x</sup>2; …; <sup>x</sup><sup>N</sup> <sup>∈</sup>IR<sup>M</sup>�<sup>N</sup>, where x<sup>i</sup> ∈ IR<sup>M</sup> represents the attribute (feature) vector of the i-th data sample xi. In order to partition the dataset X into K disjoint clusters, denoted by C ¼ f g C1; C2;…; CK , k-means tries to

<sup>2</sup>, v<sup>k</sup> ¼

where δik is an indicator variable with δik ¼ 1 if x<sup>i</sup> ∈ Ck and 0 otherwise and v<sup>k</sup> is the k-th

P<sup>N</sup> <sup>i</sup>¼<sup>1</sup> <sup>δ</sup>ikx<sup>i</sup> P<sup>N</sup> <sup>i</sup>¼<sup>1</sup> <sup>δ</sup>ik

<sup>¼</sup> <sup>1</sup> ∣Ck∣

X x∈Ck

x, (1)

posed in [10] can be unified into the framework of matrix factorization.

3. Multi-view clustering methods

198 Recent Applications in Data Clustering

3.1. Multi-view clustering via k-means

put forward continuously [12–15].

optimize the following objective function:

<sup>ε</sup> <sup>¼</sup> <sup>X</sup> N

prototype vector, i.e., the k-th cluster center.

i¼1

X K

<sup>δ</sup>ik∥x<sup>i</sup> � <sup>v</sup>k∥<sup>2</sup>

k¼1

3.1.1. Preliminaries of k-means

Both the k-means and the kernel k-means described above are designed for single-view data. To solve the multi-view clustering problem, some new objective functions should be developed. Assume that there are <sup>V</sup> views in total. Let <sup>X</sup> <sup>¼</sup> <sup>X</sup>ð Þ<sup>1</sup> ;Xð Þ<sup>2</sup> ; …;Xð Þ <sup>V</sup> n o denote the data of all the views. It is obvious that different views should have different contributions according to their conveyed information. To achieve this goal, it is straightforward to modify the standard k-means to make it applicable in the multi-view environment with a new objective function as follows:

$$\mathfrak{e} = \sum\_{v=1}^{V} \mu\_v^\gamma \mathfrak{e}\_{v\prime} \text{s.t.}\\\mu\_v \ge 0, \sum\_{v=1}^{V} \mu\_v = 1, \gamma > 1,\tag{4}$$

where μ<sup>v</sup> is the weight factor for the v-th view, γ is a parameter used to control the weight distribution, and ε<sup>v</sup> corresponds to the objective function (i.e., loss function) of the v-th view:

$$\boldsymbol{\varepsilon}\_{v} = \sum\_{i=1}^{N} \sum\_{k=1}^{K} \delta\_{ik} \|\mathbf{x}\_{i}^{(v)} - \mathbf{v}\_{k}^{(v)}\|\_{2^{\*}}^{2} \mathbf{v}\_{k}^{(v)} = \frac{\sum\_{i=1}^{N} \delta\_{ik} \mathbf{x}\_{i}^{(v)}}{\sum\_{i=1}^{N} \delta\_{ik}}.\tag{5}$$

Similarly, the objective function of the multi-view kernel k-means can be obtained, which is omitted here. Note that finding the optimal solution of Eq. (4) is an NP-hard problem; thus, some iterative algorithms are developed according to the greedy strategy. One basic iterative algorithm works in a two-stage manner: (1) updating the clustering for given weights and (2) updating the weights for given clusters; see [18] for details.

present a new multi-view self-paced learning (MSPL) algorithm for clustering based on multiview k-means. MSPL learns the multi-view model by not only progressing from "easy" to "complex" data samples but also from "easy" to "complex" views. The objective function of

> ffiffiffiffiffiffiffiffi μð Þ<sup>v</sup> � � q

Spectral clustering is built upon the spectral graph theory. In recent years, spectral clustering has become one of the most popular clustering algorithms and shown its effectiveness in various real-world applications ranging from statistics, computer sciences to bioinformatics. Due to its adaptation in data distribution, spectral clustering often outperforms traditional clustering algorithms such as k-means. In addition, spectral clustering is simple to implement

Spectral clustering is closely related to the minimum cut problem of graphs. It first performs dimensionality reduction on the original data space by leveraging the spectrum of the similarity matrix of data samples and then performs k-means on the low-dimensional space to partition data into different clusters. Therefore, for a set of data samples, a similarity matrix should be constructed at first. Typically, each data sample is treated as a node of a graph and each relationship between data samples is regarded as an edge in the graph. Besides, each edge is associated with a weight. It is obvious that the value of the edge weight between two faraway data samples should be low and the value between two close data samples should be high. For a given dataset <sup>X</sup> <sup>¼</sup> ½ � <sup>x</sup>1; <sup>x</sup>2;…; <sup>x</sup><sup>N</sup> <sup>∈</sup> IR<sup>M</sup>�<sup>N</sup>, let <sup>G</sup> <sup>¼</sup> ð Þ <sup>V</sup>; <sup>ℰ</sup>; <sup>S</sup> be the generated undirected weighted graph, where V denotes the set of nodes representing the data samples and ℰ denotes the set of edges representing the relationships between data samples. The similarity matrix S is a symmetric matrix with each element sij representing the similarity between x<sup>i</sup> and xj. There are three popular approaches to construct graph G, that is, the ε-neighborhood graph, the k-nearest neighbor graph, and the fully connected graph (see details in [25]). To partition G into disjoint subgraphs (clusters), the minimum cut problem requires that the edge weights across different clusters are as small as possible, while the total edge weights within each

According to the above graph cut theory, two popular versions of spectral clustering are developed, i.e., the ratio cut (RatioCut) and the normalized cut (Ncut). The classical relaxed

<sup>U</sup> <sup>¼</sup> <sup>μ</sup>ð Þ<sup>1</sup> ; <sup>μ</sup>ð Þ<sup>2</sup> ;…; <sup>μ</sup>ð Þ <sup>V</sup> � �, and <sup>f</sup>ð Þ <sup>U</sup> denotes the regularization term on demand.

∥2

h i <sup>∈</sup>½ � <sup>0</sup>; <sup>1</sup> <sup>N</sup> denotes the weights of data samples in the <sup>v</sup>-th view,

<sup>F</sup> þ fð Þ U , s:t:uik ∈ f g 0; 1 ,

X K

New Approaches in Multi-View Clustering http://dx.doi.org/10.5772/intechopen.75598

uik ¼ 1, (7)

201

k¼1

MSPL is quite succinct, which is shown in Eq. (7).

<sup>2</sup> ;…; <sup>μ</sup>ð Þ<sup>v</sup> N

3.2. Multi-view clustering via spectral clustering

and can be solved efficiently by standard linear algebra.

<sup>∥</sup> <sup>X</sup>ð Þ<sup>v</sup> � <sup>V</sup>ð Þ<sup>v</sup> <sup>U</sup><sup>T</sup> � �diag

min <sup>V</sup>ð Þ<sup>v</sup> , <sup>U</sup>, <sup>U</sup>

where <sup>μ</sup>ð Þ<sup>v</sup> <sup>¼</sup> <sup>μ</sup>ð Þ<sup>v</sup>

X V

v¼1

<sup>1</sup> ; <sup>μ</sup>ð Þ<sup>v</sup>

3.2.1. Preliminaries of spectral clustering

cluster are as high as possible.

form of the RatioCut [26] is shown as below:

Denote ∥X∥<sup>F</sup> as the Frobenius norm of a given matrix X, i.e., ∥X∥<sup>F</sup> ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P<sup>N</sup> i,j¼<sup>1</sup> <sup>x</sup><sup>2</sup> ij <sup>q</sup> . Then, Eq. (4) can be easily transformed into a matrix form as shown in the following:

$$\min\_{\mathbf{V}^{(v)}, \mathbf{U}, \boldsymbol{\mu}\_v} \sum\_{v=1}^{V} \mu\_v^{\boldsymbol{\gamma}} \|\mathbf{X}^{(v)} - \mathbf{V}^{(v)} \mathbf{U}^{T}\|\_{F}^{2} \\ \text{s.t.} \\ \boldsymbol{\mu}\_{\hat{\mathbf{x}}} \in \{0, 1\}, \sum\_{k=1}^{K} \boldsymbol{\mu}\_{\hat{\mathbf{x}}} = 1, \boldsymbol{\mu}\_v \ge 0, \sum\_{v=1}^{V} \boldsymbol{\mu}\_v = 1, \boldsymbol{\gamma} > 1,\tag{6}$$

where Vð Þ<sup>v</sup> ∈ IRMv�<sup>K</sup> denotes the centroid matrix for the v-th view and U ∈ IR<sup>N</sup>�<sup>K</sup> denotes the clustering indicator matrix with the ð Þ i; k element being δik. Note that all the views share a common clustering indicator matrix U.

#### 3.1.3. Variants of multi-view k-means

The basic formulations of multi-view k-means shown in Eqs. (4) and (6) do have some drawbacks. For example, it assumes that all the views are sharing a common clustering indicator matrix U. However, the structure information contained may be very limited or even lost in some views. In such case, the performance will be severely affected if all the views share a common clustering indicator matrix. To tackle the issues, many variants of multi-view k-means clustering algorithms have been proposed in recent years. Instead of the ℓ2-norm, the structured sparsity-inducing norm, i.e., the ℓ2, 1-norm, is adopted to strengthen the basic multi-view k-means, in the hope that the effect of outlier data samples will be reduced [19]. In [20], a k-means-based dual-regularized multi-view outlier detection method (DMOD) is proposed to identify the cluster outliers and the attribute outliers simultaneously, which is based on a novel cross-view outlier measurement criterion. Moreover, in the DMOD model, each view is associated with a particular clustering indicator matrix, and another alignment matrix is introduced to enforce the consistency between different views. An automated two-level variable weighting clustering algorithm, called TW-k-means, is developed in [21]. TW-k-means is able to compute weights for each view and each individual attribute simultaneously. More specifically, in this algorithm, to identify the compactness of the view, a view weight is assigned to each view, and an attribute weight is assigned to each attribute in the view to identify the importance of the attribute. Both view weights and attribute weights are employed in the distance function to determine the cluster structures of data samples. Similar strategies have also been taken in [22, 23] to learn more robust multi-view k-means models.

As aforementioned, it is NP-hard to find the optimal solution of the multi-view k-means clustering problem. The greedy iterative algorithm has a high risk of getting stuck in local optima during the optimization. Recently, the self-paced learning has been used to alleviate this problem. The general self-paced learning model consists of a weighted loss function on all data samples and a regularizer term imposed on the weights of data samples. By gradually increasing the penalty on the regularizer, more data samples are automatically added into consideration from "easy" to "complex" via a pure self-paced approach. In this, Xu et al. [24] present a new multi-view self-paced learning (MSPL) algorithm for clustering based on multiview k-means. MSPL learns the multi-view model by not only progressing from "easy" to "complex" data samples but also from "easy" to "complex" views. The objective function of MSPL is quite succinct, which is shown in Eq. (7).

$$\min\_{\mathbf{V}^{(v)}, \mathbf{U}, \mathcal{U}} \sum\_{v=1}^{V} \| \left( \mathbf{X}^{(v)} - \mathbf{V}^{(v)} \mathbf{U}^{T} \right) \text{diag} \left( \sqrt{\mu^{(v)}} \right) \|\_{F}^{2} + f(\mathcal{U}),\\\text{s.t.}\\\boldsymbol{u}\_{\vec{n}} \in \{0, 1\}, \sum\_{k=1}^{K} \boldsymbol{u}\_{\vec{n}} = 1,\tag{7}$$

where <sup>μ</sup>ð Þ<sup>v</sup> <sup>¼</sup> <sup>μ</sup>ð Þ<sup>v</sup> <sup>1</sup> ; <sup>μ</sup>ð Þ<sup>v</sup> <sup>2</sup> ;…; <sup>μ</sup>ð Þ<sup>v</sup> N h i <sup>∈</sup>½ � <sup>0</sup>; <sup>1</sup> <sup>N</sup> denotes the weights of data samples in the <sup>v</sup>-th view, <sup>U</sup> <sup>¼</sup> <sup>μ</sup>ð Þ<sup>1</sup> ; <sup>μ</sup>ð Þ<sup>2</sup> ;…; <sup>μ</sup>ð Þ <sup>V</sup> � �, and <sup>f</sup>ð Þ <sup>U</sup> denotes the regularization term on demand.

#### 3.2. Multi-view clustering via spectral clustering

Spectral clustering is built upon the spectral graph theory. In recent years, spectral clustering has become one of the most popular clustering algorithms and shown its effectiveness in various real-world applications ranging from statistics, computer sciences to bioinformatics. Due to its adaptation in data distribution, spectral clustering often outperforms traditional clustering algorithms such as k-means. In addition, spectral clustering is simple to implement and can be solved efficiently by standard linear algebra.

#### 3.2.1. Preliminaries of spectral clustering

some iterative algorithms are developed according to the greedy strategy. One basic iterative algorithm works in a two-stage manner: (1) updating the clustering for given weights and (2)

> X K

uik ¼ 1, μ<sup>v</sup> ≥ 0,

k¼1

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P<sup>N</sup> i,j¼<sup>1</sup> <sup>x</sup><sup>2</sup> ij

. Then, Eq. (4)

μ<sup>v</sup> ¼ 1, γ > 1, (6)

q

X V

v¼1

updating the weights for given clusters; see [18] for details.

<sup>v</sup>∥Xð Þ<sup>v</sup> � <sup>V</sup>ð Þ<sup>v</sup> <sup>U</sup><sup>T</sup>∥<sup>2</sup>

min <sup>V</sup>ð Þ<sup>v</sup> , <sup>U</sup>, <sup>μ</sup><sup>v</sup> X V

200 Recent Applications in Data Clustering

v¼1 μγ

common clustering indicator matrix U.

3.1.3. Variants of multi-view k-means

Denote ∥X∥<sup>F</sup> as the Frobenius norm of a given matrix X, i.e., ∥X∥<sup>F</sup> ¼

can be easily transformed into a matrix form as shown in the following:

also been taken in [22, 23] to learn more robust multi-view k-means models.

As aforementioned, it is NP-hard to find the optimal solution of the multi-view k-means clustering problem. The greedy iterative algorithm has a high risk of getting stuck in local optima during the optimization. Recently, the self-paced learning has been used to alleviate this problem. The general self-paced learning model consists of a weighted loss function on all data samples and a regularizer term imposed on the weights of data samples. By gradually increasing the penalty on the regularizer, more data samples are automatically added into consideration from "easy" to "complex" via a pure self-paced approach. In this, Xu et al. [24]

<sup>F</sup>, s:t:uik ∈f g 0; 1 ,

where Vð Þ<sup>v</sup> ∈ IRMv�<sup>K</sup> denotes the centroid matrix for the v-th view and U ∈ IR<sup>N</sup>�<sup>K</sup> denotes the clustering indicator matrix with the ð Þ i; k element being δik. Note that all the views share a

The basic formulations of multi-view k-means shown in Eqs. (4) and (6) do have some drawbacks. For example, it assumes that all the views are sharing a common clustering indicator matrix U. However, the structure information contained may be very limited or even lost in some views. In such case, the performance will be severely affected if all the views share a common clustering indicator matrix. To tackle the issues, many variants of multi-view k-means clustering algorithms have been proposed in recent years. Instead of the ℓ2-norm, the structured sparsity-inducing norm, i.e., the ℓ2, 1-norm, is adopted to strengthen the basic multi-view k-means, in the hope that the effect of outlier data samples will be reduced [19]. In [20], a k-means-based dual-regularized multi-view outlier detection method (DMOD) is proposed to identify the cluster outliers and the attribute outliers simultaneously, which is based on a novel cross-view outlier measurement criterion. Moreover, in the DMOD model, each view is associated with a particular clustering indicator matrix, and another alignment matrix is introduced to enforce the consistency between different views. An automated two-level variable weighting clustering algorithm, called TW-k-means, is developed in [21]. TW-k-means is able to compute weights for each view and each individual attribute simultaneously. More specifically, in this algorithm, to identify the compactness of the view, a view weight is assigned to each view, and an attribute weight is assigned to each attribute in the view to identify the importance of the attribute. Both view weights and attribute weights are employed in the distance function to determine the cluster structures of data samples. Similar strategies have

Spectral clustering is closely related to the minimum cut problem of graphs. It first performs dimensionality reduction on the original data space by leveraging the spectrum of the similarity matrix of data samples and then performs k-means on the low-dimensional space to partition data into different clusters. Therefore, for a set of data samples, a similarity matrix should be constructed at first. Typically, each data sample is treated as a node of a graph and each relationship between data samples is regarded as an edge in the graph. Besides, each edge is associated with a weight. It is obvious that the value of the edge weight between two faraway data samples should be low and the value between two close data samples should be high. For a given dataset <sup>X</sup> <sup>¼</sup> ½ � <sup>x</sup>1; <sup>x</sup>2;…; <sup>x</sup><sup>N</sup> <sup>∈</sup> IR<sup>M</sup>�<sup>N</sup>, let <sup>G</sup> <sup>¼</sup> ð Þ <sup>V</sup>; <sup>ℰ</sup>; <sup>S</sup> be the generated undirected weighted graph, where V denotes the set of nodes representing the data samples and ℰ denotes the set of edges representing the relationships between data samples. The similarity matrix S is a symmetric matrix with each element sij representing the similarity between x<sup>i</sup> and xj. There are three popular approaches to construct graph G, that is, the ε-neighborhood graph, the k-nearest neighbor graph, and the fully connected graph (see details in [25]). To partition G into disjoint subgraphs (clusters), the minimum cut problem requires that the edge weights across different clusters are as small as possible, while the total edge weights within each cluster are as high as possible.

According to the above graph cut theory, two popular versions of spectral clustering are developed, i.e., the ratio cut (RatioCut) and the normalized cut (Ncut). The classical relaxed form of the RatioCut [26] is shown as below:

$$\min \quad tr(\mathbf{U}^T \mathbf{L} \mathbf{U}), \\ \text{s.t.} \mathbf{U}^T \mathbf{U} = \mathbf{I}, \tag{8}$$

3.2.3. Variants of multi-view spectral clustering

min X V

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi tr U<sup>T</sup>Lð Þ<sup>v</sup> U

Lð Þ<sup>1</sup> 0 ⋯ 0 0 Lð Þ<sup>2</sup> ⋯ 0 ⋮ ⋮ ⋱⋮ 0 0 ⋯ Lð Þ <sup>V</sup>

replaced by <sup>α</sup><sup>v</sup> <sup>¼</sup> <sup>1</sup>

2

probability matrix <sup>P</sup>ð Þ<sup>v</sup> <sup>¼</sup> <sup>D</sup>ð Þ<sup>v</sup> � ��<sup>1</sup>

multiple Laplacian matrices.

0

BBB@

B ¼ B<sup>w</sup> þ βB<sup>a</sup> ¼

then defined as <sup>B</sup><sup>b</sup> <sup>¼</sup> <sup>D</sup>�1=<sup>2</sup>

v¼1 μγ

<sup>v</sup> tr <sup>U</sup><sup>T</sup>Lð Þ<sup>v</sup> <sup>U</sup> � �, <sup>s</sup>:t:

The basic form of multi-view spectral clustering achieves the basic goals of multi-view learning. However, some issues have not yet been considered. For instance, the weight parameter λ in Eq. (9) needs to be set manually. To make up this issue, it is necessary to adaptively compute the weight of each view. Xia et al. [30] assume that each view has a weight μ<sup>v</sup> representing its importance and the weight distribution should be sufficiently smooth. They further consider a unified indicator matrix U across all views, which can be fulfilled via exploring the complementary property of different views. To this end, they develop a novel model as follows:

The model above needs a manually specified parameter γ to adjust the weights of different views, which is sometimes intractable. Thus, Nie et al. [31] propose a parameter-free autoweighted multiple graph learning method (AMGL), wherein the weight parameter μ<sup>v</sup> is

α<sup>v</sup> can be self-updated. To avoid the considerable noise in each view which often degrades the performance severely, Xia et al. [32] propose a robust multi-view spectral clustering (RMSC) method via low-rank and sparse decomposition. In RMSC, a novel Markov chain is designed for dealing with the noise. First, the similarity matrix Sð Þ<sup>v</sup> and the corresponding transition

probability matrix <sup>P</sup><sup>b</sup> and the deviation error matrix <sup>E</sup>ð Þ<sup>v</sup> are constructed via low-rank and sparse decomposition. Finally, based on the transition probability matrix Pb, the standard Markov chain method is applied for partitioning data into K clusters. Note that the methods above have a high cost in optimization computation. There are numerous variables that need to be updated and the derivation process is also extremely complex during the optimization. To overcome this limitation, Chen et al. [33] present a novel variant of the Laplacian matrix named block intra-normalized Laplacian defined as follows, without the linear combination of

1

0

BBB@

where B<sup>w</sup> denotes the within Laplacian matrix of V views and B<sup>a</sup> denotes the across Laplacian matrix between different views. Based on B, the block intra-normalized Laplacian matrix is

block being Dð Þ<sup>v</sup> . By proving that the multiplicity of the zero eigenvalue of the constructed block Laplacian matrix is equal to the number of clusters K, the eigenvectors of the block

CCCA þ β X V

μ<sup>v</sup> ¼ 1, μ<sup>v</sup> > 0: (11)

New Approaches in Multi-View Clustering http://dx.doi.org/10.5772/intechopen.75598 203

v¼1

<sup>r</sup> � �. Thus, AMGL does not require additional parameters, and

Sð Þ<sup>v</sup> are computed. Then, the row-rank latent transition

ð Þ V � 1 I �I ⋯ �I �I ð Þ V � 1 I ⋯ �I ⋮ ⋮ ⋱⋮

<sup>B</sup>wD�1=<sup>2</sup> <sup>þ</sup> <sup>β</sup>Ba, where <sup>D</sup> is a block diagonal matrix with the <sup>v</sup>-th

�I �I ⋯ ð Þ V � 1 I

1

CCCA, (12)

where tr computes the trace of a matrix, U ∈ IR<sup>N</sup>�<sup>K</sup> is the clustering indicator matrix, I is an identity matrix, and L is the graph Laplacian matrix defined as L ¼ D � S. Here, D is a diagonal matrix with dii <sup>¼</sup> <sup>P</sup><sup>N</sup> <sup>j</sup>¼<sup>1</sup> sij. The objective function of Ncut [27] is similar to Eq. (8) by replacing <sup>L</sup> by the normalized Laplacian matrix <sup>L</sup><sup>~</sup> <sup>¼</sup> <sup>I</sup> � <sup>D</sup>�1=<sup>2</sup> SD�1=<sup>2</sup> . Both RatioCut and NCut can be solved efficiently by the eigenvalue decomposition (EVD) of L or L~.

#### 3.2.2. Basic form of multi-view spectral clustering

Multi-view spectral clustering is able to learn the latent cluster structures by fusing the information contained in multiple graphs. Similar to multi-view k-means, it is not hard to extend the basic spectral clustering to the multi-view environment. Given a dataset X ¼ <sup>X</sup>ð Þ<sup>1</sup> ;Xð Þ<sup>2</sup> ; …;Xð Þ <sup>V</sup> n o with <sup>V</sup> views, <sup>V</sup> graphs <sup>G</sup>ð Þ<sup>1</sup> ;Gð Þ<sup>2</sup> ;…; <sup>G</sup>ð Þ <sup>V</sup> n o and the corresponding Laplacian matrices <sup>L</sup>ð Þ<sup>1</sup> ; <sup>L</sup>ð Þ<sup>2</sup> ;…; <sup>L</sup>ð Þ <sup>V</sup> n o can be constructed.

Kumar et al. [28] firstly present a multi-view spectral clustering approach, which has a flavor of co-training idea widely used in semi-supervised learning. It follows the consistency of multi-view learning that each view gives the same labels for all data samples. So it can use the eigenvector of one view to "label" another view and vice versa. For example, via computing two views' eigenvectors, say Uð Þ<sup>1</sup> and Uð Þ<sup>2</sup> , the clustering result of Uð Þ<sup>1</sup> can be used to modify the graph similarity matrix Sð Þ<sup>2</sup> , and then the clustering result of Uð Þ<sup>2</sup> can be used to modify the graph similarity matrix Sð Þ<sup>1</sup> . For more than two views, the same strategy can be applied. Kumar et al. [29] further propose a multi-view spectral clustering approach using coregularization idea that makes the clustering results of different views agree with each other. The co-regularization form is stated as the disagreement between clustering results of two views: <sup>Φ</sup> <sup>U</sup>ð Þ<sup>p</sup> ; <sup>U</sup>ð Þ<sup>q</sup> � � ¼ �tr <sup>U</sup>ð Þ<sup>p</sup> <sup>U</sup>ð Þ<sup>p</sup> <sup>T</sup>Uð Þ<sup>q</sup> <sup>U</sup>ð Þ<sup>q</sup> <sup>T</sup> � �. Then the goal is to minimize the disagreement to achieve the consistency between views with the following objective function:

$$\min \sum\_{v=1}^{V} tr\left(\mathbf{U}^{(v)T}\mathbf{L}^{(v)}\mathbf{U}^{(v)}\right) - \sum\_{p\_r,q=1}^{V} \lambda\_{pq} tr\left(\mathbf{U}^{(p)}\mathbf{U}^{(p)T}\mathbf{U}^{(q)}\mathbf{U}^{(q)T}\right), \\ \text{s.t.} \mathbf{U}^{(v)T}\mathbf{U}^{(v)} = \mathbf{I}, \tag{9}$$

where λpq represents the degree of disagreement between the p-th view and the q-th view. From another perspective, all the views sharing a common indicator matrix U<sup>∗</sup> is also rational according to the consistency requirement. So the model in Eq. (9) can be rewritten as

$$\min \sum\_{v=1}^{V} \text{tr}\left(\mathbf{U}^{(v)T}\mathbf{L}^{(v)}\mathbf{U}^{(v)}\right) - \sum\_{v=1}^{V} \lambda\_v \text{tr}\left(\mathbf{U}^{(v)}\mathbf{U}^{(v)T}\mathbf{U}^{\*}\mathbf{U}^{\*T}\right), \\ \text{s.t.} \mathbf{U}^{(v)T}\mathbf{U}^{(v)} = \mathbf{I}, \tag{10}$$

where λ<sup>v</sup> controls the degree of disagreement between Uð Þ<sup>v</sup> and U<sup>∗</sup> .

#### 3.2.3. Variants of multi-view spectral clustering

min tr <sup>U</sup><sup>T</sup>LU � �, <sup>s</sup>:t:U<sup>T</sup><sup>U</sup> <sup>¼</sup> <sup>I</sup>, (8)

<sup>j</sup>¼<sup>1</sup> sij. The objective function of Ncut [27] is similar to Eq. (8) by

SD�1=<sup>2</sup>

. Then the goal is to minimize the disagreement

.

, <sup>s</sup>:t:Uð Þ<sup>v</sup> <sup>T</sup>Uð Þ<sup>v</sup> <sup>¼</sup> <sup>I</sup>, (9)

, <sup>s</sup>:t:Uð Þ<sup>v</sup> <sup>T</sup>Uð Þ<sup>v</sup> <sup>¼</sup> <sup>I</sup>, (10)

. Both RatioCut and

and the corresponding

where tr computes the trace of a matrix, U ∈ IR<sup>N</sup>�<sup>K</sup> is the clustering indicator matrix, I is an identity matrix, and L is the graph Laplacian matrix defined as L ¼ D � S. Here, D is a

Multi-view spectral clustering is able to learn the latent cluster structures by fusing the information contained in multiple graphs. Similar to multi-view k-means, it is not hard to extend the basic spectral clustering to the multi-view environment. Given a dataset X ¼

can be constructed.

<sup>λ</sup>pqtr <sup>U</sup>ð Þ<sup>p</sup> <sup>U</sup>ð Þ<sup>p</sup> <sup>T</sup>Uð Þ<sup>q</sup> <sup>U</sup>ð Þ<sup>q</sup> <sup>T</sup> � �

where λpq represents the degree of disagreement between the p-th view and the q-th view. From another perspective, all the views sharing a common indicator matrix U<sup>∗</sup> is also rational

> λvtr Uð Þ<sup>v</sup> Uð Þ<sup>v</sup> <sup>T</sup>U<sup>∗</sup> <sup>U</sup><sup>∗</sup><sup>T</sup> � �

Kumar et al. [28] firstly present a multi-view spectral clustering approach, which has a flavor of co-training idea widely used in semi-supervised learning. It follows the consistency of multi-view learning that each view gives the same labels for all data samples. So it can use the eigenvector of one view to "label" another view and vice versa. For example, via computing two views' eigenvectors, say Uð Þ<sup>1</sup> and Uð Þ<sup>2</sup> , the clustering result of Uð Þ<sup>1</sup> can be used to modify the graph similarity matrix Sð Þ<sup>2</sup> , and then the clustering result of Uð Þ<sup>2</sup> can be used to modify the graph similarity matrix Sð Þ<sup>1</sup> . For more than two views, the same strategy can be applied. Kumar et al. [29] further propose a multi-view spectral clustering approach using coregularization idea that makes the clustering results of different views agree with each other. The co-regularization form is stated as the disagreement between clustering results of two

with <sup>V</sup> views, <sup>V</sup> graphs <sup>G</sup>ð Þ<sup>1</sup> ;Gð Þ<sup>2</sup> ;…; <sup>G</sup>ð Þ <sup>V</sup> n o

replacing <sup>L</sup> by the normalized Laplacian matrix <sup>L</sup><sup>~</sup> <sup>¼</sup> <sup>I</sup> � <sup>D</sup>�1=<sup>2</sup>

¼ �tr <sup>U</sup>ð Þ<sup>p</sup> <sup>U</sup>ð Þ<sup>p</sup> <sup>T</sup>Uð Þ<sup>q</sup> <sup>U</sup>ð Þ<sup>q</sup> <sup>T</sup> � �

tr <sup>U</sup>ð Þ<sup>v</sup> <sup>T</sup>Lð Þ<sup>v</sup> <sup>U</sup>ð Þ<sup>v</sup> � �

tr <sup>U</sup>ð Þ<sup>v</sup> <sup>T</sup>Lð Þ<sup>v</sup> <sup>U</sup>ð Þ<sup>v</sup> � �

where λ<sup>v</sup> controls the degree of disagreement between Uð Þ<sup>v</sup> and U<sup>∗</sup>

to achieve the consistency between views with the following objective function:

� <sup>X</sup> V

p, <sup>q</sup>¼<sup>1</sup>

according to the consistency requirement. So the model in Eq. (9) can be rewritten as

�<sup>X</sup> V

v¼1

NCut can be solved efficiently by the eigenvalue decomposition (EVD) of L or L~.

diagonal matrix with dii <sup>¼</sup> <sup>P</sup><sup>N</sup>

202 Recent Applications in Data Clustering

<sup>X</sup>ð Þ<sup>1</sup> ;Xð Þ<sup>2</sup> ; …;Xð Þ <sup>V</sup> n o

views: <sup>Φ</sup> <sup>U</sup>ð Þ<sup>p</sup> ; <sup>U</sup>ð Þ<sup>q</sup> � �

minX V

v¼1

minX V

v¼1

3.2.2. Basic form of multi-view spectral clustering

Laplacian matrices <sup>L</sup>ð Þ<sup>1</sup> ; <sup>L</sup>ð Þ<sup>2</sup> ;…; <sup>L</sup>ð Þ <sup>V</sup> n o

The basic form of multi-view spectral clustering achieves the basic goals of multi-view learning. However, some issues have not yet been considered. For instance, the weight parameter λ in Eq. (9) needs to be set manually. To make up this issue, it is necessary to adaptively compute the weight of each view. Xia et al. [30] assume that each view has a weight μ<sup>v</sup> representing its importance and the weight distribution should be sufficiently smooth. They further consider a unified indicator matrix U across all views, which can be fulfilled via exploring the complementary property of different views. To this end, they develop a novel model as follows:

$$\min \sum\_{v=1}^{V} \mu\_v^\circ tr \Big(\mathbf{U}^T \mathbf{L}^{(v)} \mathbf{U}\Big),\\ \text{s.t.}\\ \sum\_{v=1}^{V} \mu\_v = 1, \mu\_v > 0. \tag{11}$$

The model above needs a manually specified parameter γ to adjust the weights of different views, which is sometimes intractable. Thus, Nie et al. [31] propose a parameter-free autoweighted multiple graph learning method (AMGL), wherein the weight parameter μ<sup>v</sup> is replaced by <sup>α</sup><sup>v</sup> <sup>¼</sup> <sup>1</sup> 2 ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi tr U<sup>T</sup>Lð Þ<sup>v</sup> U <sup>r</sup> � �. Thus, AMGL does not require additional parameters, and α<sup>v</sup> can be self-updated. To avoid the considerable noise in each view which often degrades the performance severely, Xia et al. [32] propose a robust multi-view spectral clustering (RMSC) method via low-rank and sparse decomposition. In RMSC, a novel Markov chain is designed for dealing with the noise. First, the similarity matrix Sð Þ<sup>v</sup> and the corresponding transition probability matrix <sup>P</sup>ð Þ<sup>v</sup> <sup>¼</sup> <sup>D</sup>ð Þ<sup>v</sup> � ��<sup>1</sup> Sð Þ<sup>v</sup> are computed. Then, the row-rank latent transition probability matrix <sup>P</sup><sup>b</sup> and the deviation error matrix <sup>E</sup>ð Þ<sup>v</sup> are constructed via low-rank and sparse decomposition. Finally, based on the transition probability matrix Pb, the standard Markov chain method is applied for partitioning data into K clusters. Note that the methods above have a high cost in optimization computation. There are numerous variables that need to be updated and the derivation process is also extremely complex during the optimization. To overcome this limitation, Chen et al. [33] present a novel variant of the Laplacian matrix named block intra-normalized Laplacian defined as follows, without the linear combination of multiple Laplacian matrices.

$$\mathbf{B} = \mathbf{B}\_w + \beta \mathbf{B}\_d = \begin{pmatrix} \mathbf{L}^{(1)} & 0 & \cdots & 0 \\ 0 & \mathbf{L}^{(2)} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \mathbf{L}^{(V)} \end{pmatrix} + \beta \begin{pmatrix} (V-1)\mathbf{I} & -\mathbf{I} & \cdots & -\mathbf{I} \\ -\mathbf{I} & (V-1)\mathbf{I} & \cdots & -\mathbf{I} \\ \vdots & \vdots & \ddots & \vdots \\ -\mathbf{I} & -\mathbf{I} & \cdots & (V-1)\mathbf{I} \end{pmatrix},\tag{12}$$

where B<sup>w</sup> denotes the within Laplacian matrix of V views and B<sup>a</sup> denotes the across Laplacian matrix between different views. Based on B, the block intra-normalized Laplacian matrix is then defined as <sup>B</sup><sup>b</sup> <sup>¼</sup> <sup>D</sup>�1=<sup>2</sup> <sup>B</sup>wD�1=<sup>2</sup> <sup>þ</sup> <sup>β</sup>Ba, where <sup>D</sup> is a block diagonal matrix with the <sup>v</sup>-th block being Dð Þ<sup>v</sup> . By proving that the multiplicity of the zero eigenvalue of the constructed block Laplacian matrix is equal to the number of clusters K, the eigenvectors of the block Laplacian matrix can be used for clustering via the classical form of spectral clustering. At the end, the lower and upper bounds of the optimal solution are also established. See [33] for more details.

3.3.2. Basic form of multi-view matrix factorization

tion can be derived to partition X into K clusters:

<sup>∥</sup>Xð Þ<sup>v</sup> � <sup>V</sup>ð Þ<sup>v</sup> <sup>U</sup>ð Þ<sup>v</sup> <sup>T</sup>∥<sup>2</sup>

reduced by setting the corresponding λ<sup>v</sup> to be small enough.

<sup>λ</sup>v∥Xð Þ<sup>v</sup> � <sup>V</sup>ð Þ<sup>v</sup> <sup>U</sup>ð Þ<sup>v</sup> <sup>T</sup>∥<sup>2</sup>

X V

p, <sup>q</sup>¼<sup>1</sup>

min <sup>V</sup>ð Þ<sup>v</sup> , <sup>U</sup>ð Þ<sup>v</sup> X V

v¼1

is defined intuitively as follows:

X V

v¼1

min <sup>V</sup>ð Þ<sup>v</sup> , <sup>U</sup>ð Þ<sup>v</sup>

The hypothesis behind multi-view clustering is that different views should admit the same underlying clustering structures of the datasets. That is, the coefficient matrices learned from different views should be as consistent as possible. To this end, a soft regularization term is introduced to enforce the coefficient matrices of different views toward a common consensus [37]. For a given dataset <sup>X</sup> <sup>¼</sup> <sup>X</sup>ð Þ<sup>1</sup> ;Xð Þ<sup>2</sup> ; …;Xð Þ <sup>V</sup> n o with <sup>V</sup> views, the following objective func-

<sup>λ</sup>v∥Uð Þ<sup>v</sup> � <sup>U</sup><sup>∗</sup>

where U<sup>∗</sup> is a consensus matrix that characterizes the intrinsic clustering structures of datasets among all views and λ<sup>v</sup> is the parameter used to tune both the relative importance of different views and the contribution between the first reconstruction error term and the second disagreement term. Note that Eq. (15) does not require that all the views share a common U<sup>∗</sup>

thus, this model is more robust to low-quality views, i.e., the effect of low-quality views is

Instead of enforcing a rigid common consensus constraint on all the views as in Eq. (15), another form of basic multi-view NMF for clustering is the pair-wise CoNMF model [38], which imposes similarity constraints on each pair of views. Through the pair-wise coregularization, it is expected that the coefficient matrices learned from two views can complement with each other during the factorization process. And therefore, high-quality clustering results can be yielded. The co-regularization objective function of the pair-wise CoNMF model

> <sup>F</sup> <sup>þ</sup> <sup>X</sup> V

p, <sup>q</sup>¼<sup>1</sup>

where λ<sup>v</sup> is the parameter employed to combine the factorization of different views and λpq is the parameter used to denote the weight of similarity constraint on Uð Þ<sup>p</sup> and Uð Þ<sup>q</sup> . As the column vector of the coefficient matrix U represents a cluster, when adopting the vector-based ℓ2-norm, each element of U<sup>T</sup>U gives the cosine similarity between two clusters. Obviously, in the multi-view environment, the cluster similarity between different views should also be consistent, which results in the cluster-wise CoNMF model. Cluster-wise CoNMF replaces the pair-wise regularization term in Eq. (16) by the following cluster-wise regularization term:

<sup>λ</sup>pq∥Uð Þ<sup>p</sup> <sup>T</sup>Uð Þ<sup>p</sup> � <sup>U</sup>ð Þ<sup>q</sup> <sup>T</sup>Uð Þ<sup>q</sup> <sup>∥</sup><sup>2</sup>

Similar to the optimization of the standard single-view NMF model, all the three basic multi-

view NMF clustering models can be optimized via the multiplicative updating rules.

<sup>λ</sup>pq∥Uð Þ<sup>p</sup> � <sup>U</sup>ð Þ<sup>q</sup> <sup>∥</sup><sup>2</sup>

∥2

<sup>F</sup>, <sup>s</sup>:t:Vð Þ<sup>v</sup> <sup>≥</sup> <sup>0</sup>, <sup>U</sup>ð Þ<sup>v</sup> <sup>≥</sup> <sup>0</sup>, <sup>U</sup><sup>∗</sup> <sup>≥</sup> <sup>0</sup>, (15)

New Approaches in Multi-View Clustering http://dx.doi.org/10.5772/intechopen.75598

<sup>F</sup>, <sup>s</sup>:t:Vð Þ<sup>v</sup> <sup>≥</sup> <sup>0</sup>, <sup>U</sup>ð Þ<sup>v</sup> <sup>≥</sup> <sup>0</sup>, (16)

<sup>F</sup>: (17)

;

205

<sup>F</sup> <sup>þ</sup><sup>X</sup> V

v¼1

#### 3.3. Multi-view clustering via matrix factorization

In the fields of data mining and machine learning, matrix factorization (MF) is an effective latent factor learning model. Given a data matrix X∈IR<sup>M</sup>�<sup>N</sup>, MF tries to find two low-rank factor matrices V ∈IR<sup>M</sup>�<sup>K</sup> and U ∈ IR<sup>N</sup>�<sup>K</sup> whose multiplication can well approximate it, i.e., X ≈ VU<sup>T</sup>. MF has shown many promising applications in real world, such as information retrieval, recommendation system, signal processing, document analysis, and so on. Usually, the nonnegativity constraints are enforced to the factor matrices to promote the interpretability of the MF models. Therefore, in this part, we focus on the introduction of the nonnegative MF (NMF)-related clustering models. For a comprehensive review of NMF-based models and applications, please refer to [34].

#### 3.3.1. Preliminaries of matrix factorization

As is well known, there are many matrix factorization models, including the singular value decomposition, Cholesky decomposition, LU decomposition, QR decomposition, and Schur decomposition. These factorization models either have too strict restrictions on the factor matrices or lack the ability to be applied to data analysis. Due to the wide applications of NMF in recommending systems, NMF has drawn much attention in both academia and industry. In fact, NMF can be regarded as an extension of the standard k-means algorithm by relaxing the constraints imposed on the clustering indicator matrix. For a given dataset <sup>X</sup> <sup>¼</sup> ½ � <sup>x</sup>1; <sup>x</sup>2;…; <sup>x</sup><sup>N</sup> <sup>∈</sup>IR<sup>M</sup>�<sup>N</sup> <sup>þ</sup> , NMF seeks to learn a basis matrix <sup>V</sup> and a coefficient matrix <sup>U</sup> via optimizing the following objective function:

$$\min\_{\mathbf{V},\mathbf{U}} \|\mathbf{X} - \mathbf{V}\mathbf{U}^{T}\|\_{F^{\prime}}^{2} \text{ s.t.} \mathbf{V} \ge 0, \mathbf{U} \ge 0,\tag{13}$$

where V ∈IR<sup>M</sup>�<sup>K</sup> <sup>þ</sup> can be considered as the cluster centroid matrix and <sup>U</sup> <sup>∈</sup>IR<sup>N</sup>�<sup>K</sup> <sup>þ</sup> can be treated as a "soft" clustering indicator matrix. The objective function above is not convex in U and V; therefore, it is impractical to find the global optima. Typically, there are two methods to solve Eq. (13). The first one is the gradient descent method [35]. The other one is the multiplicative method [36] where the iterative updating rules are as follows:

$$\mathbf{V} \leftarrow \mathbf{V} \odot \frac{\mathbf{X}\mathbf{U}}{\mathbf{V}\mathbf{U}^{\mathrm{T}}\mathbf{U}}, \; \mathbf{U} \leftarrow \mathbf{U} \odot \frac{\mathbf{X}^{\mathrm{T}}\mathbf{V}}{\mathbf{U}\mathbf{V}^{\mathrm{T}}\mathbf{V}} \tag{14}$$

where ⊙ and ½ �� ½ �� denote the element-wise multiplication and division, respectively. It is noteworthy that there are many other criterions to measure the difference between X and VU<sup>T</sup>, such as the ℓ1-norm, the ℓ2, 1-norm, and the Kullback-Leibler divergence (a.k.a. relative entropy). For these criterions, the updating rules can be derived similarly.

#### 3.3.2. Basic form of multi-view matrix factorization

Laplacian matrix can be used for clustering via the classical form of spectral clustering. At the end, the lower and upper bounds of the optimal solution are also established. See [33] for more

In the fields of data mining and machine learning, matrix factorization (MF) is an effective latent factor learning model. Given a data matrix X∈IR<sup>M</sup>�<sup>N</sup>, MF tries to find two low-rank factor matrices V ∈IR<sup>M</sup>�<sup>K</sup> and U ∈ IR<sup>N</sup>�<sup>K</sup> whose multiplication can well approximate it, i.e., X ≈ VU<sup>T</sup>. MF has shown many promising applications in real world, such as information retrieval, recommendation system, signal processing, document analysis, and so on. Usually, the nonnegativity constraints are enforced to the factor matrices to promote the interpretability of the MF models. Therefore, in this part, we focus on the introduction of the nonnegative MF (NMF)-related clustering models. For a comprehensive review of NMF-based models and

As is well known, there are many matrix factorization models, including the singular value decomposition, Cholesky decomposition, LU decomposition, QR decomposition, and Schur decomposition. These factorization models either have too strict restrictions on the factor matrices or lack the ability to be applied to data analysis. Due to the wide applications of NMF in recommending systems, NMF has drawn much attention in both academia and industry. In fact, NMF can be regarded as an extension of the standard k-means algorithm by relaxing the constraints imposed on the clustering indicator matrix. For a given dataset

<sup>þ</sup> , NMF seeks to learn a basis matrix <sup>V</sup> and a coefficient matrix <sup>U</sup> via

X<sup>T</sup>V

½ �� denote the element-wise multiplication and division, respectively. It is note-

<sup>F</sup>, s:t:V ≥ 0, U ≥ 0, (13)

UV<sup>T</sup><sup>V</sup> , (14)

<sup>þ</sup> can be treated

details.

204 Recent Applications in Data Clustering

3.3. Multi-view clustering via matrix factorization

applications, please refer to [34].

<sup>X</sup> <sup>¼</sup> ½ � <sup>x</sup>1; <sup>x</sup>2;…; <sup>x</sup><sup>N</sup> <sup>∈</sup>IR<sup>M</sup>�<sup>N</sup>

where V ∈IR<sup>M</sup>�<sup>K</sup>

where ⊙ and ½ ��

3.3.1. Preliminaries of matrix factorization

optimizing the following objective function:

min <sup>V</sup>, <sup>U</sup>

method [36] where the iterative updating rules are as follows:

V V⊙

entropy). For these criterions, the updating rules can be derived similarly.

<sup>∥</sup><sup>X</sup> � VU<sup>T</sup>∥<sup>2</sup>

XU

<sup>þ</sup> can be considered as the cluster centroid matrix and <sup>U</sup> <sup>∈</sup>IR<sup>N</sup>�<sup>K</sup>

as a "soft" clustering indicator matrix. The objective function above is not convex in U and V; therefore, it is impractical to find the global optima. Typically, there are two methods to solve Eq. (13). The first one is the gradient descent method [35]. The other one is the multiplicative

VU<sup>T</sup><sup>U</sup> , <sup>U</sup> <sup>U</sup><sup>⊙</sup>

worthy that there are many other criterions to measure the difference between X and VU<sup>T</sup>, such as the ℓ1-norm, the ℓ2, 1-norm, and the Kullback-Leibler divergence (a.k.a. relative The hypothesis behind multi-view clustering is that different views should admit the same underlying clustering structures of the datasets. That is, the coefficient matrices learned from different views should be as consistent as possible. To this end, a soft regularization term is introduced to enforce the coefficient matrices of different views toward a common consensus [37]. For a given dataset <sup>X</sup> <sup>¼</sup> <sup>X</sup>ð Þ<sup>1</sup> ;Xð Þ<sup>2</sup> ; …;Xð Þ <sup>V</sup> n o with <sup>V</sup> views, the following objective function can be derived to partition X into K clusters:

$$\min\_{\mathbf{V}^{(v)}, \mathbf{U}^{(v)}} \sum\_{v=1}^{V} \|\mathbf{X}^{(v)} - \mathbf{V}^{(v)}\mathbf{U}^{(v)T}\|\_{F}^{2} + \sum\_{v=1}^{V} \lambda\_{v} \|\mathbf{U}^{(v)} - \mathbf{U}^{\*}\|\_{F}^{2} \\ \text{s.t.} \mathbf{V}^{(v)} \succeq \mathbf{0}, \mathbf{U}^{(v)} \succeq \mathbf{0}, \mathbf{U}^{\*} \succeq \mathbf{0}, \tag{15}$$

where U<sup>∗</sup> is a consensus matrix that characterizes the intrinsic clustering structures of datasets among all views and λ<sup>v</sup> is the parameter used to tune both the relative importance of different views and the contribution between the first reconstruction error term and the second disagreement term. Note that Eq. (15) does not require that all the views share a common U<sup>∗</sup> ; thus, this model is more robust to low-quality views, i.e., the effect of low-quality views is reduced by setting the corresponding λ<sup>v</sup> to be small enough.

Instead of enforcing a rigid common consensus constraint on all the views as in Eq. (15), another form of basic multi-view NMF for clustering is the pair-wise CoNMF model [38], which imposes similarity constraints on each pair of views. Through the pair-wise coregularization, it is expected that the coefficient matrices learned from two views can complement with each other during the factorization process. And therefore, high-quality clustering results can be yielded. The co-regularization objective function of the pair-wise CoNMF model is defined intuitively as follows:

$$\min\_{\mathbf{V}^{(v)}, \mathbf{U}^{(v)}} \sum\_{v=1}^{V} \lambda\_v \|\mathbf{X}^{(v)} - \mathbf{V}^{(v)}\mathbf{U}^{(v)T}\|\_F^2 + \sum\_{p,q=1}^{V} \lambda\_{pq} \|\mathbf{U}^{(p)} - \mathbf{U}^{(q)}\|\_{F^\*}^2 \\ \text{s.t.} \mathbf{V}^{(v)} \succeq \mathbf{0}, \mathbf{U}^{(v)} \succeq \mathbf{0}, \tag{16}$$

where λ<sup>v</sup> is the parameter employed to combine the factorization of different views and λpq is the parameter used to denote the weight of similarity constraint on Uð Þ<sup>p</sup> and Uð Þ<sup>q</sup> . As the column vector of the coefficient matrix U represents a cluster, when adopting the vector-based ℓ2-norm, each element of U<sup>T</sup>U gives the cosine similarity between two clusters. Obviously, in the multi-view environment, the cluster similarity between different views should also be consistent, which results in the cluster-wise CoNMF model. Cluster-wise CoNMF replaces the pair-wise regularization term in Eq. (16) by the following cluster-wise regularization term:

$$\sum\_{p,q=1}^{V} \lambda\_{pq} \|\mathbf{U}^{(p)T}\mathbf{U}^{(p)} - \mathbf{U}^{(q)T}\mathbf{U}^{(q)}\|\_{F}^{2}.\tag{17}$$

Similar to the optimization of the standard single-view NMF model, all the three basic multiview NMF clustering models can be optimized via the multiplicative updating rules.

#### 3.3.3. Variants of multi-view matrix factorization

As the locality preserving learning and the manifold learning have been shown very important to promote the performance of clustering algorithms, Cai et al. [39] propose a graph (or manifold) regularized NMF model GNMF for single-view clustering with satisfying performance. Note that the aforementioned multi-view NMF models cannot preserve the local geometrical structures of the samples. To overcome this limitation, a multi-manifold regularized NMF model (MMNMF) is proposed in [40]. MMNMF incorporates consensus manifold and consensus coefficient matrix with multi-manifold regularization to preserve the local geometrical structures of the multi-view data space. The multi-manifold regularization has also been considered in [41]. Moreover, the correntropy-induced metric (CIM) is adapted to measure the reconstruction error, since CIM has achieved excellent performance in many applications. CIM is also insensitive to large errors that are mainly introduced from heavy noises. A much simpler formulation of the manifold regularized multi-view NMF model is developed in [42]. Without the explicit constraint that enforces a rigid common manifold consensus, an auxiliary matrix is involved to add constraints on the column sums of the basis matrix Vð Þ<sup>v</sup> such that the coefficient matrix Uð Þ<sup>v</sup> is comparable. A weighted extension of multiview NMF is presented in [43] to address the image annotation problem. In this model, two weight matrices are introduced. One weight matrix is used to bias the factorization toward improved reconstruction for rare tags. The other weight matrix gives more weight to images containing rare tags and is applied to all views. A weighted extension of the pair-wise CoNMF model has also been developed in [44] to handle those attributes that are unobserved in each data sample so as to resolve the sparseness problem in all views' matrices. For the realistic cases that many views suffer from missing of some data samples resulting in many partial examples, Li et al. [45] firstly devise a partial multi-view clustering method to handle this problem. A multi-incomplete-view clustering method MIC [46] is also designed to deal with the incompleteness of the views. MIC is built upon the weighted NMF model with a ℓ2,1-norm regularization. Zhang et al. [47] further propose a constrained multi-view clustering algorithm for unmapped data in the framework of NMF. The proposed algorithm uses inter-view constraints to establish the connections between different views.

is constructed via modeling the similarity between different domains. Then, the main graph is utilized to regularize the clustering structures in different domain-specific graphs. In the NonClus model, multiple underlying clustering structures can co-exist among domain-specific graphs, while for similar domains, the corresponding clustering structures should be as con-

New Approaches in Multi-View Clustering http://dx.doi.org/10.5772/intechopen.75598 207

In this part, we analyze multi-view clustering from a multilinear algebra perspective and present several novel multi-view clustering algorithms (note that the notations used in this part are self-contained). Tensor is known as a multidimensional matrix or multiway array [51]. In multi-view research field, data can be naturally modeled as a third-order tensor with objects, features, and view dimensions. An intuitive way is to compact different views along the view dimension of the tensor (see Figure 1). Another widely adopted way is to transform

In the field of data mining and machine learning, tensor decomposition is an emerging and effective tool for processing multi-view data. In this section, some basic knowledge on tensors and tensor decomposition methods is provided. We refer the readers to [51, 52] for a compre-

Let X be an m-order tensor of size I<sup>1</sup> � I<sup>2</sup> � ⋯ � Im. The mode-p matricization of X is denoted

be the columns of the matrix Xð Þ<sup>p</sup> . The p-mode multiplication Y ¼ X�pU can be manipulated as matrix multiplication <sup>Y</sup>ð Þ<sup>p</sup> <sup>¼</sup> UXð Þ<sup>p</sup> , where <sup>U</sup> <sup>∈</sup> IRJp�Ip and <sup>Y</sup> <sup>∈</sup>IR<sup>I</sup>1⋯Ip�<sup>1</sup>JpIpþ1⋯Im . The Frobenius norm of a tensor X is the sum of the squares of all its elements xi1i2…im . The tensor X is a rankone tensor if it can be written as the outer product of <sup>m</sup> vectors, i.e., <sup>X</sup> <sup>¼</sup> <sup>x</sup>ð Þ<sup>1</sup> <sup>∘</sup>xð Þ<sup>2</sup> <sup>∘</sup>…∘xð Þ <sup>m</sup> ,

Figure 1. Visualization of the process of transforming the feature matrices to a third-order tensor.

matrix <sup>X</sup>ð Þ<sup>p</sup> , which is obtained by arranging the mode-<sup>p</sup> fibers to

sistent as possible.

3.4. Multi-view clustering via tensor decomposition

3.4.1. Preliminaries of tensor decomposition

hensive understanding of these topics.

where ∘ represents the vector outer product.

3.4.1.1. Notations

as an Ip � I1⋯Ip�<sup>1</sup>Ipþ<sup>1</sup>⋯Im

each feature matrix to a similarity matrix before compacting them.

Due to its great interpretability and high efficacy, NMF has been widely employed for graph clustering [48]. In such setting, the data matrix X is replaced by the adjacency matrix A. In many applications, graph data may be collected from heterogeneous domains or sources. Integrating multiple graphs has been shown to be a promising approach to improve the graph clustering accuracy. Clearly, multi-view NMF is suitable for multiple graph processing. In [49], a flexible and robust NMF-based framework, named co-regularized graph clustering (CGC), is developed to address the multi-domain graph clustering problem. CGC supports many-tomany cross-domain node relationships, and it also incorporates weights on cross-domain relationships. Besides, CGC allows partial cross-domain mapping so that graphs in different domains may have different sizes. Considering the fact that in many real-world applications, different graphs have different node distributions, the assumption that the multiple graphs share a common clustering structure does not hold. Given this, Ni et al. [50] develop a novel two-phase clustering method NoNClus, based on the NMF framework. At first, a main graph is constructed via modeling the similarity between different domains. Then, the main graph is utilized to regularize the clustering structures in different domain-specific graphs. In the NonClus model, multiple underlying clustering structures can co-exist among domain-specific graphs, while for similar domains, the corresponding clustering structures should be as consistent as possible.

#### 3.4. Multi-view clustering via tensor decomposition

In this part, we analyze multi-view clustering from a multilinear algebra perspective and present several novel multi-view clustering algorithms (note that the notations used in this part are self-contained). Tensor is known as a multidimensional matrix or multiway array [51]. In multi-view research field, data can be naturally modeled as a third-order tensor with objects, features, and view dimensions. An intuitive way is to compact different views along the view dimension of the tensor (see Figure 1). Another widely adopted way is to transform each feature matrix to a similarity matrix before compacting them.

#### 3.4.1. Preliminaries of tensor decomposition

In the field of data mining and machine learning, tensor decomposition is an emerging and effective tool for processing multi-view data. In this section, some basic knowledge on tensors and tensor decomposition methods is provided. We refer the readers to [51, 52] for a comprehensive understanding of these topics.

#### 3.4.1.1. Notations

3.3.3. Variants of multi-view matrix factorization

206 Recent Applications in Data Clustering

straints to establish the connections between different views.

Due to its great interpretability and high efficacy, NMF has been widely employed for graph clustering [48]. In such setting, the data matrix X is replaced by the adjacency matrix A. In many applications, graph data may be collected from heterogeneous domains or sources. Integrating multiple graphs has been shown to be a promising approach to improve the graph clustering accuracy. Clearly, multi-view NMF is suitable for multiple graph processing. In [49], a flexible and robust NMF-based framework, named co-regularized graph clustering (CGC), is developed to address the multi-domain graph clustering problem. CGC supports many-tomany cross-domain node relationships, and it also incorporates weights on cross-domain relationships. Besides, CGC allows partial cross-domain mapping so that graphs in different domains may have different sizes. Considering the fact that in many real-world applications, different graphs have different node distributions, the assumption that the multiple graphs share a common clustering structure does not hold. Given this, Ni et al. [50] develop a novel two-phase clustering method NoNClus, based on the NMF framework. At first, a main graph

As the locality preserving learning and the manifold learning have been shown very important to promote the performance of clustering algorithms, Cai et al. [39] propose a graph (or manifold) regularized NMF model GNMF for single-view clustering with satisfying performance. Note that the aforementioned multi-view NMF models cannot preserve the local geometrical structures of the samples. To overcome this limitation, a multi-manifold regularized NMF model (MMNMF) is proposed in [40]. MMNMF incorporates consensus manifold and consensus coefficient matrix with multi-manifold regularization to preserve the local geometrical structures of the multi-view data space. The multi-manifold regularization has also been considered in [41]. Moreover, the correntropy-induced metric (CIM) is adapted to measure the reconstruction error, since CIM has achieved excellent performance in many applications. CIM is also insensitive to large errors that are mainly introduced from heavy noises. A much simpler formulation of the manifold regularized multi-view NMF model is developed in [42]. Without the explicit constraint that enforces a rigid common manifold consensus, an auxiliary matrix is involved to add constraints on the column sums of the basis matrix Vð Þ<sup>v</sup> such that the coefficient matrix Uð Þ<sup>v</sup> is comparable. A weighted extension of multiview NMF is presented in [43] to address the image annotation problem. In this model, two weight matrices are introduced. One weight matrix is used to bias the factorization toward improved reconstruction for rare tags. The other weight matrix gives more weight to images containing rare tags and is applied to all views. A weighted extension of the pair-wise CoNMF model has also been developed in [44] to handle those attributes that are unobserved in each data sample so as to resolve the sparseness problem in all views' matrices. For the realistic cases that many views suffer from missing of some data samples resulting in many partial examples, Li et al. [45] firstly devise a partial multi-view clustering method to handle this problem. A multi-incomplete-view clustering method MIC [46] is also designed to deal with the incompleteness of the views. MIC is built upon the weighted NMF model with a ℓ2,1-norm regularization. Zhang et al. [47] further propose a constrained multi-view clustering algorithm for unmapped data in the framework of NMF. The proposed algorithm uses inter-view con-

Let X be an m-order tensor of size I<sup>1</sup> � I<sup>2</sup> � ⋯ � Im. The mode-p matricization of X is denoted as an Ip � I1⋯Ip�<sup>1</sup>Ipþ<sup>1</sup>⋯Im matrix <sup>X</sup>ð Þ<sup>p</sup> , which is obtained by arranging the mode-<sup>p</sup> fibers to be the columns of the matrix Xð Þ<sup>p</sup> . The p-mode multiplication Y ¼ X�pU can be manipulated as matrix multiplication <sup>Y</sup>ð Þ<sup>p</sup> <sup>¼</sup> UXð Þ<sup>p</sup> , where <sup>U</sup> <sup>∈</sup> IRJp�Ip and <sup>Y</sup> <sup>∈</sup>IR<sup>I</sup>1⋯Ip�<sup>1</sup>JpIpþ1⋯Im . The Frobenius norm of a tensor X is the sum of the squares of all its elements xi1i2…im . The tensor X is a rankone tensor if it can be written as the outer product of <sup>m</sup> vectors, i.e., <sup>X</sup> <sup>¼</sup> <sup>x</sup>ð Þ<sup>1</sup> <sup>∘</sup>xð Þ<sup>2</sup> <sup>∘</sup>…∘xð Þ <sup>m</sup> , where ∘ represents the vector outer product.

Figure 1. Visualization of the process of transforming the feature matrices to a third-order tensor.

#### 3.4.1.2. CP decomposition

The idea of expressing tensor as the sum of a number of rank-one tensors comes from the study of Hitchcock [53]. Then, Cattell [54] proposed the idea of parallel proportional analysis. The popular CP decomposition comes from the ideas of Carroll and Chang [55] (canonical decomposition) and Harshman [56] (parallel factors). Taking a third-order tensor X ∈IR<sup>I</sup>�J�<sup>K</sup> as an example, the CP decomposition tries to approximate tensor X with R components of rank-one tensor, i.e.,

$$\mathfrak{X} \approx \sum\_{r=1}^{R} \mathbf{u}\_r \circ \mathbf{v}\_r \circ \mathbf{w}\_{r\prime} \tag{18}$$

constraint on a period of consecutive time points. The total variation regularizes the time factor to obtain a piece-wise constant function w.r.t. time points. Owing to the piece-wise constant function, the decomposition can be relatively consistent in a cluster and separated between

ur∘vr∘wr∥<sup>2</sup>

where <sup>F</sup> is the first-order difference ð Þ� <sup>V</sup> � <sup>1</sup> <sup>V</sup> matrix such that <sup>f</sup> ii <sup>¼</sup> 1 and <sup>f</sup> i ið Þ <sup>þ</sup><sup>1</sup> ¼ �1 for i ¼ 1, 2, ⋯, V � 1, and the other elements are zeros, τ is a positive regularization parameter, and <sup>∥</sup> � <sup>∥</sup><sup>1</sup> denotes the <sup>ℓ</sup>1-norm. The first term corresponds to the CP decomposition of <sup>X</sup>, and

Liu et al. [61] propose a framework of multi-view clustering via tensor decomposition, mainly the Tucker decomposition. According to the framework, the common type of multi-view

<sup>F</sup> þ τ X R

r¼1

U

<sup>s</sup>:t:U<sup>T</sup><sup>U</sup> <sup>¼</sup> <sup>I</sup>, <sup>μ</sup><sup>v</sup> <sup>≥</sup> <sup>0</sup>,

<sup>∥</sup>X�1U<sup>T</sup>�2U<sup>T</sup>�3<sup>I</sup>

X V

v¼1

<sup>∥</sup>X�1U<sup>T</sup>�2U<sup>T</sup>�3μ<sup>T</sup>∥<sup>2</sup>

<sup>T</sup>∥<sup>2</sup>

F,

μ<sup>v</sup> ¼ 1:

<sup>F</sup>: (21)

(22)

∥Fwr∥1, (20)

New Approaches in Multi-View Clustering http://dx.doi.org/10.5772/intechopen.75598 209

clusters. The TVCP model is formulated as follows:

min U

> min <sup>U</sup>, <sup>μ</sup>

capability of tensor methodology.

X V

tr U<sup>T</sup>Lð Þ<sup>v</sup> U

v¼1

<sup>μ</sup>vLð Þ<sup>v</sup> !

> X V

v¼1

position for multiple coupled tensors sharing common latent factors.

!

tr U<sup>T</sup> X V

<sup>s</sup>:t:U<sup>T</sup><sup>U</sup> <sup>¼</sup> <sup>I</sup>, <sup>μ</sup><sup>v</sup> <sup>≥</sup> <sup>0</sup>,

3.5. Multi-view clustering via deep learning

v¼1

min ½½U;V;W�� 1 2

3.4.2.2. Relations between Tucker decomposition and spectral clustering

<sup>∥</sup><sup>X</sup> �<sup>X</sup> R

r¼1

the second term constrains the time mode (w) to be a piece-wise constant function.

spectral clustering is equivalent to a Tucker decomposition problem as follows:

U

μ<sup>v</sup> ¼ 1,

,

� �, <sup>s</sup>:t:U<sup>T</sup><sup>U</sup> <sup>¼</sup> <sup>I</sup>, <sup>⇔</sup> max

Another form of multi-view spectral clustering can also be written as a Tucker problem:

⇔

With this framework, variety of spectral clustering problems can be solved by a tensor decomposition algorithm. We can see the strong connection between them as well as the strong

Canonical correlation analysis is designed to inspect the linear relationship between two sets of variables [62]. In multi-view learning, a typical approach is to maximize the sum of pair-wise correlations between different views [63]. Without loss of high-order correlations, Luo et al. [64] propose a tensor canonical correlation analysis (TCCA), which is equivalent to CP decomposition of the correlation tensor. Khan et al. [65] propose a Bayesian extension of CP decom-

With the third wave of artificial intelligence, deep learning is gaining increasing popularity in recent years. Deep learning has demonstrated excellent performance in many real-world

max <sup>U</sup>, <sup>μ</sup>

where u<sup>r</sup> ∈IR<sup>I</sup> , v<sup>r</sup> ∈ IR<sup>J</sup> , and <sup>w</sup><sup>r</sup> <sup>∈</sup> IR<sup>K</sup>. For simplicity, we denote <sup>U</sup> <sup>¼</sup> ½ � <sup>u</sup>1; <sup>u</sup>2;…; <sup>u</sup><sup>R</sup> , V ¼ ½ � v1; v2;…; v<sup>R</sup> , W ¼ ½ � w1; w2;…; w<sup>R</sup> , and ½½U; V;W�� as the CP decomposition of X.

#### 3.4.1.3. Tucker decomposition

The idea of Tucker decomposition is introduced by Tucker [57]. The Tucker decomposition is a form of higher-order singular value decomposition (HOSVD) [58]. It decomposes a tensor X ∈IR<sup>I</sup>�J�<sup>K</sup> into a core tensor G∈ IR<sup>P</sup>�Q�<sup>R</sup> multiplied by several orthogonal matrices along each mode, i.e.,

$$\mathfrak{X} \approx \mathfrak{G} \times\_1 \mathbf{U} \times\_2 \mathbf{V} \times\_3 \mathbf{W} = \sum\_{p=1}^{P} \sum\_{q=1}^{Q} \sum\_{r=1}^{R} g\_{pqr} \mathbf{u}\_p \circ \mathbf{v}\_q \circ \mathbf{w}\_r. \tag{19}$$

The cutting-edge technique for calculating the factor matrices is proposed in [59].

#### 3.4.2. Tensor decomposition-based multi-view clustering

In multi-view clustering, the goal is to find out some meaningful group of objects from the data. The above CP decomposition naturally divides the multi-view data into several components, which can be seen as the clusters. Thus, it can be directly applied to solve multi-view clustering problems. For a given dataset <sup>X</sup> <sup>¼</sup> <sup>X</sup>ð Þ<sup>1</sup> ; <sup>X</sup>ð Þ<sup>2</sup> ;…; <sup>X</sup>ð Þ <sup>V</sup> n o with <sup>V</sup> views, where <sup>X</sup>ð Þ<sup>v</sup> of each view takes value from IR<sup>N</sup>�<sup>M</sup>, X can be formulated as a third-order tensor X ∈ IR<sup>N</sup>�M�<sup>V</sup>. In this part, a variant of CP decomposition is introduced first, which is quite straightforward. Then we shed light on the relations between several classic multi-view spectral clustering methods and the Tucker decomposition.

#### 3.4.2.1. Total variation based CP (TVCP)

In some clustering problems, a consecutive range of time points is non-negligible. For example, in the dataset with authors, publications, and a sequence of time points, we are interested in figuring out which group of authors work in the same topics during a period of time. Chen et al. [60] propose a total variation based tensor decomposition method (TVCP) for the constraint on a period of consecutive time points. The total variation regularizes the time factor to obtain a piece-wise constant function w.r.t. time points. Owing to the piece-wise constant function, the decomposition can be relatively consistent in a cluster and separated between clusters. The TVCP model is formulated as follows:

$$\min\_{\{\mathbf{U},\mathbf{V},\mathbf{W}\}} \frac{1}{2} \|\mathbf{X} - \sum\_{r=1}^{R} \mathbf{u}\_r \circ \mathbf{v}\_r \circ \mathbf{w}\_r\|\_F^2 + \tau \sum\_{r=1}^{R} \|\mathbf{F} \mathbf{w}\_r\|\_{1^\prime} \tag{20}$$

where <sup>F</sup> is the first-order difference ð Þ� <sup>V</sup> � <sup>1</sup> <sup>V</sup> matrix such that <sup>f</sup> ii <sup>¼</sup> 1 and <sup>f</sup> i ið Þ <sup>þ</sup><sup>1</sup> ¼ �1 for i ¼ 1, 2, ⋯, V � 1, and the other elements are zeros, τ is a positive regularization parameter, and <sup>∥</sup> � <sup>∥</sup><sup>1</sup> denotes the <sup>ℓ</sup>1-norm. The first term corresponds to the CP decomposition of <sup>X</sup>, and the second term constrains the time mode (w) to be a piece-wise constant function.

#### 3.4.2.2. Relations between Tucker decomposition and spectral clustering

3.4.1.2. CP decomposition

208 Recent Applications in Data Clustering

tensor, i.e.,

where u<sup>r</sup> ∈IR<sup>I</sup>

mode, i.e.,

3.4.1.3. Tucker decomposition

, v<sup>r</sup> ∈ IR<sup>J</sup>

The idea of expressing tensor as the sum of a number of rank-one tensors comes from the study of Hitchcock [53]. Then, Cattell [54] proposed the idea of parallel proportional analysis. The popular CP decomposition comes from the ideas of Carroll and Chang [55] (canonical decomposition) and Harshman [56] (parallel factors). Taking a third-order tensor X ∈IR<sup>I</sup>�J�<sup>K</sup> as an example, the CP decomposition tries to approximate tensor X with R components of rank-one

> X ≈ X R

<sup>X</sup> <sup>≈</sup>G�1U�2V�3<sup>W</sup> <sup>¼</sup> <sup>X</sup>

clustering problems. For a given dataset <sup>X</sup> <sup>¼</sup> <sup>X</sup>ð Þ<sup>1</sup> ; <sup>X</sup>ð Þ<sup>2</sup> ;…; <sup>X</sup>ð Þ <sup>V</sup> n o

3.4.2. Tensor decomposition-based multi-view clustering

methods and the Tucker decomposition.

3.4.2.1. Total variation based CP (TVCP)

The cutting-edge technique for calculating the factor matrices is proposed in [59].

r¼1

V ¼ ½ � v1; v2;…; v<sup>R</sup> , W ¼ ½ � w1; w2;…; w<sup>R</sup> , and ½½U; V;W�� as the CP decomposition of X.

The idea of Tucker decomposition is introduced by Tucker [57]. The Tucker decomposition is a form of higher-order singular value decomposition (HOSVD) [58]. It decomposes a tensor X ∈IR<sup>I</sup>�J�<sup>K</sup> into a core tensor G∈ IR<sup>P</sup>�Q�<sup>R</sup> multiplied by several orthogonal matrices along each

P

X Q

X R

r¼1

q¼1

p¼1

In multi-view clustering, the goal is to find out some meaningful group of objects from the data. The above CP decomposition naturally divides the multi-view data into several components, which can be seen as the clusters. Thus, it can be directly applied to solve multi-view

each view takes value from IR<sup>N</sup>�<sup>M</sup>, X can be formulated as a third-order tensor X ∈ IR<sup>N</sup>�M�<sup>V</sup>. In this part, a variant of CP decomposition is introduced first, which is quite straightforward. Then we shed light on the relations between several classic multi-view spectral clustering

In some clustering problems, a consecutive range of time points is non-negligible. For example, in the dataset with authors, publications, and a sequence of time points, we are interested in figuring out which group of authors work in the same topics during a period of time. Chen et al. [60] propose a total variation based tensor decomposition method (TVCP) for the

ur∘vr∘wr, (18)

gpqrup∘vq∘wr: (19)

with V views, where Xð Þ<sup>v</sup> of

, and <sup>w</sup><sup>r</sup> <sup>∈</sup> IR<sup>K</sup>. For simplicity, we denote <sup>U</sup> <sup>¼</sup> ½ � <sup>u</sup>1; <sup>u</sup>2;…; <sup>u</sup><sup>R</sup> ,

Liu et al. [61] propose a framework of multi-view clustering via tensor decomposition, mainly the Tucker decomposition. According to the framework, the common type of multi-view spectral clustering is equivalent to a Tucker decomposition problem as follows:

$$\min\_{\mathbf{U}} \quad \sum\_{v=1}^{V} tr\Big(\mathbf{U}^{T}\mathbf{L}^{(v)}\mathbf{U}\Big),\\ \text{s.t.}\\ \mathbf{U}^{T}\mathbf{U} = \mathbf{I}, \quad \Leftrightarrow \quad \max\_{\mathbf{U}} \|\mathbf{X} \times\_{1} \mathbf{U}^{T} \times\_{2} \mathbf{U}^{T} \times\_{3} \mathbf{I}^{T}\|\_{F}^{2}. \tag{21}$$

Another form of multi-view spectral clustering can also be written as a Tucker problem:

$$\begin{aligned} \min\_{\mathbf{U}, \boldsymbol{\mu}} \text{tr} \left( \mathbf{U}^T \left( \sum\_{v=1}^V \mu\_v \mathbf{L}^{(v)} \right) \mathbf{U} \right), \quad \underset{\mathbf{U}, \boldsymbol{\mu}}{\text{max}} \text{ \\|\mathbf{X} \times\_1 \mathbf{U}^T \times\_2 \mathbf{U}^T \times\_3 \boldsymbol{\mu}^T \|\_{P}^2}{\text{s.t.}}\\ \text{s.t.} \mathbf{U}^T \mathbf{U} = \mathbf{I}, \boldsymbol{\mu}\_v \ge 0, \sum\_{v=1}^V \boldsymbol{\mu}\_v = 1, \quad \text{s.t.} \mathbf{U}^T \mathbf{U} = \mathbf{I}, \boldsymbol{\mu}\_v \ge 0, \sum\_{v=1}^V \boldsymbol{\mu}\_v = 1. \end{aligned} \tag{22}$$

With this framework, variety of spectral clustering problems can be solved by a tensor decomposition algorithm. We can see the strong connection between them as well as the strong capability of tensor methodology.

Canonical correlation analysis is designed to inspect the linear relationship between two sets of variables [62]. In multi-view learning, a typical approach is to maximize the sum of pair-wise correlations between different views [63]. Without loss of high-order correlations, Luo et al. [64] propose a tensor canonical correlation analysis (TCCA), which is equivalent to CP decomposition of the correlation tensor. Khan et al. [65] propose a Bayesian extension of CP decomposition for multiple coupled tensors sharing common latent factors.

#### 3.5. Multi-view clustering via deep learning

With the third wave of artificial intelligence, deep learning is gaining increasing popularity in recent years. Deep learning has demonstrated excellent performance in many real-world applications, such as face recognition, image annotation, natural language processing, object detection, customer relationship management, and mobile advertising. Typically, deep learning models are composed of multiple nonlinear transformations and thus can learn a better feature representation than traditional shallow models [66]. However, deep learning requires labeled training data to learn the models, which limits its application in data clustering for the reason that training data with cluster labels are not available in many cases. Despite the hardness, there are some works devoted to adjusting shallow clustering models for deep learning. Here, we introduce two popular deep clustering models and their extensions to the multi-view environment.

and different decoders are used to reconstruct view-specific input data samples [70]. In [71], an extension of CCA based on deep neural networks is proposed to learn a shared representation of two views. In fact, the feature representations of the two views are not exactly the same, but their correlations are maximized. Following this line, the deep canonically correlated auto-encoder (DCCAE) is developed in [72]. DCCAE simultaneously optimizes the canonical correlation between the learned feature representations and the reconstruction errors of the auto-encoders. Benton et al. [73] further extend the deep CCA model for multiple views.

New Approaches in Multi-View Clustering http://dx.doi.org/10.5772/intechopen.75598 211

Another line of developing deep clustering models is deepening the MF models. As shown earlier, MF, especially NMF, has demonstrated outstanding performance in many applications. Thus, it is worth building a deep structure for MF in the hope that better feature representations can be obtained to facilitate clustering. Figure 2(b) illustrates an example of the framework of the deep MF models. Compared to the deep auto-encoders, both deep MF and deep auto-encoders are trying to minimize the reconstruction errors. However, unlike deep auto-

The first nonnegative deep network based on NMF is proposed in [74] for speech separation. This architecture can be discriminatively trained for optimal separation performance. Then Li et al. [75] propose a novel weakly supervised deep MF model to uncover the latent image representations and tag representations embedded in the latent subspace by collaboratively exploring the weakly supervised tagging information, the visual structure, and the semantic structure. In [76], a deep semi-NMF model is further developed for learning latent attribute representations. Semi-NMF is a popular variant of NMF by relaxing the factorized basis matrix to be real-valued. This practice makes semi-NMF have much wider applications than NMF since the datasets in real world may contain complex information, for instance, the attributes may be mix-signed. Considering the fact that these deep MF models are trying to factorize the basis matrix hierarchically alone, Qiu et al. [77] further propose a deep orthogonal NMF model which can decompose the coefficient matrix hierarchically. This model is able to learn higherlevel representations for clusters. These deep MF models have achieved great success in data clustering for single-view data. However, they are seldom utilized for multi-view clustering. A recent work [78] attempts to extend the deep semi-NMF model for multi-view clustering, which can dissemble unimportant factors layer by layer and generate an effective consensus representation in the last layer. Another work [79] proposes to address the incomplete multiview clustering problem via deep semantic mapping. The proposed model first projects all incomplete multi-view data to a unified representation in a common subspace, which is

No one can make bricks without straw. In this section we will first list two kinds of open datasets that can be used in multi-view clustering, i.e., feature-based and graph-based datasets. Then we will discuss the performance of multi-view clustering on them briefly.

3.5.2. Deep matrix factorization

encoders, the mapping function of deep MF is linear.

further executed by standard shallow NMF for clustering.

4. Open datasets

#### 3.5.1. Deep auto-encoder

An auto-encoder [67] is an artificial neural network adopted for unsupervised learning, the goal of which is to learn a representation for each data sample. An auto-encoder always consists of two parts: the encoder and the decoder. The encoder plays the role of a nonlinear mapping function that can map each data sample to a representation space. The decoder demands accurate data reconstruction from the representation generated by the encoder. Auto-encoder has been shown to be similar to spectral clustering in theory; however, it is more efficient and flexible in practice. The auto-encoder can be easily deepened via adding more encoder layers and corresponding decoder layers. Figure 2 (a) gives an example of the framework of the deep auto-encoder.

Although auto-encoder can learn a compact representation for each data sample, it contributes little to clustering since it does not require that the representation vectors of similar data samples should also be similar. To make the learned feature representation better capture the cluster structures, many variants of deep auto-encoder models have been proposed. In [68], a novel regularization term that is similar to the objective function of k-means is introduced to guide the learning of the mapping function. In this way, the learned feature representation is more stable and suitable for clustering. In [69], a deep embedded clustering method is proposed to simultaneously learn feature representations and cluster assignments using deep auto-encoders. These deep clustering models are designed for single-view data. For deep multi-view clustering, the learned feature representations should not only capture the cluster structure of each single view but also implement a consensus between different views. To this end, a common encoder is utilized to extract the shared feature representation for all views,

Figure 2. Frameworks of deep auto-encoder and deep matrix factorization (depth is 2).

and different decoders are used to reconstruct view-specific input data samples [70]. In [71], an extension of CCA based on deep neural networks is proposed to learn a shared representation of two views. In fact, the feature representations of the two views are not exactly the same, but their correlations are maximized. Following this line, the deep canonically correlated auto-encoder (DCCAE) is developed in [72]. DCCAE simultaneously optimizes the canonical correlation between the learned feature representations and the reconstruction errors of the auto-encoders. Benton et al. [73] further extend the deep CCA model for multiple views.
