**2.1 Problem formulation**

We assume having a set of data observations, each represented as a *m* dimensional vector *xk*ð Þ*<sup>t</sup>* <sup>∈</sup> *<sup>m</sup>*, where *<sup>k</sup>* <sup>¼</sup> **<sup>1</sup>**, **<sup>2</sup>**, … , *<sup>n</sup>*. Moreover, we consider having *<sup>T</sup>* data modalities, indexed by *t* ¼ **1**, **2**, **3**, … , *T*. Each data modality can then be described as *xk*ð Þ*<sup>t</sup>* <sup>∈</sup> *<sup>m</sup>,* where *X t*ð Þ¼ ½ � *<sup>x</sup>***1**ð Þ*<sup>t</sup> <sup>x</sup>***2**ð Þ*<sup>t</sup>* … *xn*ð Þ*<sup>t</sup> :* Our objective is to assign each set of data observations into clusters that can be efficiently represented by a lowdimensional subspace. This is equivalent to finding a partitioning *X***1** ð Þ*<sup>t</sup>* , *<sup>X</sup>***<sup>2</sup>**ð Þ*<sup>t</sup>* , … , *<sup>X</sup><sup>P</sup>*ð Þ*<sup>t</sup>* of ½ � *<sup>n</sup>* observations, where *<sup>P</sup>* is the total number of clusters

underlying each data modality indexed by *p*. Furthermore, each linear subspace can be described as *Sp*ð Þ*<sup>t</sup>* <sup>⊂</sup> *<sup>m</sup>* with dim *<sup>S</sup><sup>p</sup>*ð Þ*<sup>t</sup>* <sup>≪</sup> *<sup>m</sup>*.

We will exploit the self-expressive property presented in [1, 26], which entails that each data observation *xi*ð Þ*t* can be represented as a linear combination of all features from the same subspace *S x*ð Þ *<sup>i</sup>*ð Þ*t* as follows,

$$\mathfrak{x}\_{i}(t) = \sum\_{i \neq j, \mathfrak{x}\_{j}(t) \in \mathcal{S}(\mathfrak{x}\_{i}(t))} \mathfrak{w}\_{i\mathfrak{j}}(t)\mathfrak{x}\_{\mathfrak{j}}(t) . \tag{1}$$

If we stack all the data points *xi*ð Þ*t* into columns of the data matrix *X t*ð Þ*:* The selfexpressive property can be written in a matrix form as follows,

$$\mathbf{X}(\mathbf{t}) = \mathbf{X}(\mathbf{t})\mathbf{W}(\mathbf{t}) \text{ s.t.}\\\mathbf{W}\_{\text{if}} = \mathbf{0}. \tag{2}$$

The important information about the relations among data samples is then recorded in the self-representation coefficient matrix *W t*ð Þ. Under a suitable arrangement/permutation of the data realizations, the sparse coefficient matrix *W t*ð Þ is an *n* � *n* block-diagonal matrix with zero diagonals provided that each sample is represented by other samples only from the same subspace. More precisely, *Wij*ð Þ¼ *t* **0** whenever the indexes *i*, *j* correspond to samples from different subspaces. As a result, the majority of the elements in *W* are equal to zero. A diagram showing our algorithm is depicted in **Figure 1**.

Our algorithm consists of three main stages; the first stage is the encoder which encodes the input modalities into a latent space. The encoder consists of *T* parallel CNNs, where *T* is the number of data modalities. Each modality data is fed into one network, and the output of each network represents the modality data projection into its corresponding hidden/latent space. The second component of the auto encoder is *T* self-expressive layers, the goal of which is to enforce the self-expressive property among the data observations of each data modality. Each self-expressive layer is a fully connected layer which independently operates on the output of each encoder. The last stage is the decoder which reconstructs input data from the self-expressive layers' output. The objective function sought through this approximation network is reflected in Eq. (5). The group sparsity introduced in [23] requires the minimization of the group norm of matrices *W t*ð Þ, which in turn, entails a smaller angle between the different spaces across all modalities, thus promoting the goal of obtaining a common latent space. Note that minimizing group norm provides a group sparse

**Figure 1.** *Deep robust group subspace clustering diagram.*

solution along data modalities. If we in addition, constrain the coefficient matrices corresponding to each data modality to commute, therefore, we ensure their sharing the same eigen vectors. The idea of commutation has been used in [27–29]. We define *<sup>Ω</sup>* <sup>=</sup> f g *W t*ð Þ *<sup>T</sup> <sup>t</sup>*¼**<sup>1</sup>**, where *W t*ð Þ <sup>=</sup> *wkj*ð Þ*<sup>t</sup>* � � *<sup>k</sup>*,*<sup>j</sup>* and the group l-norm k k **Ω <sup>1</sup>**,**<sup>2</sup>** as:

$$||\mathcal{Q}||\_{1,2} = \sum\_{k,j} \sqrt{\sum\_{t=1}^{T} w\_{k,j}^2(t)}. \tag{3}$$

We also define [*W t*ð Þ**<sup>1</sup>** ,*W t*ð Þ� **<sup>2</sup>** as,

$$[\mathbf{W}(t\_1), \mathbf{W}(t\_2)] = \mathbf{W}(t\_1)\mathbf{W}(t\_2) - \mathbf{W}(t\_2), \mathbf{W}(t\_1) = \mathbf{0} \tag{4}$$

The loss function is then rewritten as,

$$\begin{aligned} \min\_{\mathbf{W}(t)/w\_{\mathbf{k}}(t)} & \sum\_{t\_1, t\_2}^T \left\| \left[ \mathbf{W}(t\_1)\mathbf{W}(t\_2) \right] \right\|^2 + \left\| \mathbf{A} \right\|\_{1,2} \\ & + \frac{\gamma}{2} \sum\_{t=1}^T \left\| \mathbf{X}(t) - \mathbf{X}\_r(t) \right\|\_F^2 \\ & + \rho \sum\_{t=1}^T \left\| \mathbf{W}(t) \right\|\_1 + \frac{\mu}{2} \sum\_{t=1}^T \left\| \mathbf{L}(t) - \mathbf{L}(t)\mathbf{W}(t) \right\|\_F^2 \end{aligned} \tag{5}$$

where *Xr*ð Þ*t* represent the reconstructed data corresponding to modality *t*, and *L t*ð Þ is the output of the *<sup>t</sup>th* encoder with input *X t*ð Þ. *W t*ð Þ is the sparse weight function that ties the data observation for modality *t*. Solving DRoGSuRe in Tensorflow and using the adaptive momentum based gradient descent method (ADAM) [30] results in minimizing the loss function. For each data modality, the weights of the encoder, the self-expressive layer and the decoder are individually calculated, however, fine-tuning the weights is based on the loss function, which is a function of the group norm and the pairwise product difference between sparse coefficient matrices. kk**<sup>1</sup>** denotes the *l***<sup>1</sup>** norm, i.e., the sum of absolute values of the argument. The Lagrangian objective functional may be rewritten as,

$$\begin{aligned} L(\mathbf{W}(t)) &= \sum\_{t\_1, t\_2}^T \left\| \left[ \mathbf{W}(t\_1) \mathbf{W}(t\_2) \right] \right\|^2 + \left\| \mathbf{\mathcal{Q}} \right\|\_{1,2} \\ &+ \rho \sum\_{t=1}^T \left\| \left[ \mathbf{W}(t) \right] \right\|\_1 + \frac{\mathcal{Y}}{2} \sum\_{t=1}^T \left\| \left[ \mathbf{X}(t) - \mathbf{X}\_r(t) \right] \right\|\_F^2 \\ &+ \sum\_{t=1}^T \frac{\mu}{2} \left\| \mathbf{L}(t) \mathbf{W}(t) - \mathbf{L}(t) \right\|\_F^2 \\ &+ \sum\_{t=1}^T < \mathbf{L}(t) \mathbf{W}(t) - \mathbf{L}(t), \mathbf{Y}(t) > \end{aligned} \tag{6}$$

Assume *W t* ^ ð Þ¼ *<sup>I</sup>* � *W t*ð Þ, we update *W t*ð Þ as follows,

$$\begin{split} \mathbf{W}\_{k+1}(t) &= \arg\min\_{\mathbf{W}(t)} \sum\_{t\_1, t\_2}^T \left\| \left[ \mathbf{W}(t\_1) \mathbf{W}(t\_2) \right] \right\|^2 + \left\| \mathbf{L} \right\|\_{1,2} + \rho \left\| \mathbf{W}(t) \right\|\_1 + \\ &< \mathbf{L}\_{k+1}(t) \mathbf{W}(t) - \mathbf{L}\_{k+1}(t), \mathbf{Y}\_k(t) \\ &> + \frac{\mu\_k}{2} \left\| \mathbf{L}\_{k+1}(t) \mathbf{W}(t) - \mathbf{L}\_{k+1}(t) \right\|\_F^2 \end{split} \tag{7}$$

Similar to [4], we utilize linearized ADMM [31] to approximate the minimum of Eq. (7) since the algorithmic solution is complicated and yields a non-convex optimization functional. It has been shown that linearized ADMM is very effective for *l***<sup>1</sup>** minimization problems and the augmented Lagrange multiplier (ALM) method can take care of the non-convexity of the problem [32, 33]. Therefore, utilizing an appropriate augmented Lagrange multiplier *μk*, we can compute the global optimizer by solving the dual problem. The solution to Eq. (7) can be approximated, using linearized soft thresholding, as follows,

$$\begin{aligned} \boldsymbol{W}\_{k}^{+}(t) &= \operatorname{prox}\_{\frac{\boldsymbol{\rho}}{\boldsymbol{\rho}\_{1}}} (\boldsymbol{W}\_{k}(t) + \frac{\mathbf{L}\_{k+1}^{T} \Big( \boldsymbol{L}\_{k+1} \dot{\boldsymbol{W}}\_{k}(t) - \frac{\mathbf{Y}\_{k}(t)}{\boldsymbol{\mu}\_{k}} \Big)}{\eta\_{1}} \\ &+ \sum\_{t\_{1}, t\_{2} = \mathbf{1}, t\_{1} \neq t\_{2}}^{T} \Big\{ (\boldsymbol{W}\_{k}(t\_{1}) \boldsymbol{W}\_{k}(t\_{2})} \\ &- \mathbf{W}\_{k}(t\_{2}) \boldsymbol{W}\_{k}(t\_{1}) \big) \boldsymbol{W}\_{k}^{T}(t\_{2}) + \boldsymbol{W}\_{k}(m) (\boldsymbol{W}\_{k}(t\_{1}) \boldsymbol{W}\_{k}(t\_{2}) \\ &- \mathbf{W}\_{k}(t\_{2}) \boldsymbol{W}\_{k}(t\_{1})) \big{ \\ &\mathbf{W}\_{k+1}(t) = \boldsymbol{\gamma}\_{\boldsymbol{\rho}\_{1}} \Big{ } (\boldsymbol{W}\_{k}^{+}(t)) \big{ } \end{aligned} \tag{8}$$

where *<sup>η</sup>***<sup>1</sup>** <sup>≥</sup>k k*<sup>L</sup>* **<sup>2</sup> <sup>2</sup>**. We alternatively update *L t*ð Þ as,

$$\mathbf{L}\_{k+1}(t) = \mathbf{L}\_k(t) + \mu\_k \left( \mathbf{L}\_k(t) \hat{\mathbf{W}}\_{k+1}(t) - \frac{\mathbf{Y}\_k(t)}{\mu\_k} \right) \hat{\mathbf{W}}\_{k+1}^T(t). \tag{10}$$
 
$$\text{where } \operatorname{prox}\_{\boldsymbol{\theta}}(\mathbf{A}\_{ij}(t)) = \mathbf{A}\_{ij}(t) \* \frac{\max\left\{ \left( \sqrt{\sum\_{i=1}^T \mathbf{A}\_{ij}(t)^2} - \boldsymbol{\theta} \right), \mathbf{0} \right\}}{\sqrt{\sum\_{i=1}^T \mathbf{A}\_{ij}(t)^2}} \text{ and }$$

*μη*2

*γτ Bi*,*<sup>j</sup>* � � <sup>¼</sup> *sign Bi*,*<sup>j</sup>* � � ∗ max *Bi*,*<sup>j</sup>* � � � � � *<sup>τ</sup>* � �, **<sup>0</sup>** � �. The Lagrange multipliers are updated as follows,

$$\mathbf{Y}\_{k+1}(t) = \mathbf{Y}\_k(t) + \mu\_k(\mathbf{L}\_{k+1}(t)\mathbf{W}\_{k+1}(t) - \mathbf{L}\_{k+1}(t))\tag{11}$$

$$
\mu\_{k+1} = \epsilon \mu\_k \tag{12}
$$

After computing the gradient of the loss function, the weights of each multi-layer network, that corresponds to one modality, are updated while other modalities' networks are fixed. In other words, after constructing the data during the forward pass, the loss function determines the updates that back-propagates through each layer. The encoder of the first modality is updated, afterwards, the selfexpressive layer of that modality gets updated and finally the decoder. Since the

*Scaling Subspace-Driven Approaches Using Information Fusion DOI: http://dx.doi.org/10.5772/intechopen.109946*

weights corresponding to each modality are dependent on other modalities, we update each part of the network corresponding to each modality with the assumption that all other networks' components corresponding to other modalities are fixed. The resulting sparse coefficient matrices *W t*ð Þ's, for *t* ¼ **1**, **2**, … , *T* are then integrated as follows,

$$\mathbf{W}\_{Total} = \sum\_{t=1}^{T} \mathbf{W}(t) \tag{13}$$

Integrating the sparse coefficient matrices helps reinforcing the relation between data points that exist in all data modalities, thus establishing a cross-sensor consistency. Furthermore, adding the sparse coefficient matrices reduces the noise variance introduced by the outliers. A similar approach was introduced in [34] for Social Networks community detection, where an aggregation of multi-layer adjacency matrices was proved to provide a better Signal to Noise ratio, and ultimately better performance. To proceed with distinguishing the various classes in unsupervised manner, we construct the affinity matrix as follows,

$$\mathbf{A} = \mathbf{W}\_{Total} + \mathbf{W}\_{Total}^T \tag{14}$$

where *A* ∈ *n*�*<sup>n</sup>*. We subsequently use the spectral clustering method [35] to retrieve the clusters in the data using the above affinity matrix as input.

#### **2.2 Theoretical discussion**

In order to justify the multiple banks of self-expressive layers, we assume that each modality *X t*ð Þ may be expressed as a private information contribution *Xp*ð Þ*t* and a shared information *Xs*ð Þ*t* such that,

$$\mathbf{X}(t) = \mathbf{X}\_s(t) + \mathbf{X}\_p(t) \tag{15}$$

The shared information can be represented as follows,

$$\mathbf{X}\_t(t) = \sum\_{t=1}^T F(\mathbf{W}(t)(\boldsymbol{\varPi}\_t \mathbf{X}(t))) \tag{16}$$

where *<sup>Π</sup><sup>s</sup>* <sup>¼</sup> <sup>∩</sup>*<sup>t</sup>*¼**1**, … ,*<sup>T</sup> <sup>Π</sup><sup>t</sup> <sup>s</sup>*. *Xs*ð Þ*t* and *Xp*ð Þ*t* are distinct and will hence lie in different subspaces, which will hence be mapped to different components in *W t*ð Þ. Similarly for the subspaces spanned by *Xp*ð Þ *ti* and *Xp tj* � �, *<sup>i</sup>* 6¼ *<sup>j</sup>*, the corresponding components of *W t*ð Þ*<sup>i</sup>* and *W tj* � � will almost surely not coincide. On the other hand, the components of *W t*ð Þ*<sup>i</sup>* and *W tj* � � corresponding to *Xs*ð Þ *ti* and *Xs tj* � � will almost surely coincide, thus justifying the construction of a layered *WTotal*, and thereby improving the SNR. In addition, the decoder will help protect and maintain the private information corresponding to each modality *Xp*ð Þ*t* by ensuring that data can be reconstructed again from the latent space with minimal loss. In the following, we will elaborate more on how aggregating affinity matrices should impact the overall clustering performance. The idea of aggregating affinity matrices is not new, in fact, it has been used extensively in clustering and community detection field.

For example, in [36], the authors proposed a method that combines the self-similarity matrices of the eigenvectors after applying a Singular Value Decomposition on clusters. In [37], they proposed merging the information provided by the multiple modalities by combining the characteristics of individual graph layers using tools from subspace analysis on a Grassmann manifold. In [38], they propose a multilayer spectral graph clustering (SGC) framework that performs convex layer aggregation.

**Proposition:** The persistent differential scaling of *m*-modal Group Robust Subspace Clustering Fusion yields an order *m*-improvement resilience over the singly differential scaling fusion.

The proof of the proposition can be found in Appendix A. We basically show that by perturbing one or more data modalities, our proposed approach introduces less error to the overall affinity matrix as compared to DMSC. Hence, preserving the performance and yielding a graceful degradation of the clustering accuracy as an increasing number of modalities get corrupted by noise.
