**1.1 Related work**

Subspace clustering has been introduced as an efficient way for unfolding union of low-dimensional subspaces underlying high dimensional data. Subspace clustering has been extensively studied in computer vision due to the vast availability of visual data as in [1–4]. This paradigm has broadly been adopted in many applications such as image segmentation [5], image compression [6], and object clustering [7].

Uncovering the principles and laying out the fundamentals for multi-modal data has become an important topic in research in light of many applications in diverse fields including image fusion [8], target recognition [9–12], speaker recognition [13], and handwriting analysis [14]. Convolutional neural networks have been widely used on multi-modal data as in [15–19]. A multi-modal subspace clustering-inspired approach was also proposed in [20]. The emphasis of our formulation results in a different optimization problem, as the multi-modal sensing seeks to not only account for the private information which provides the complementarity of the sensors, but also the common and hidden information. This yields, as an end result, a different network structure than that of [20] with a different application space inspiration. In addition, the robustness of fusing multi-modal sensor data each with its distinct intrinsic structure, is addressed along with a potential scaling for viability. A thorough comparison of our results to the multimodal fusion network in [21] is carried out, with a demonstration of resilient fusion under a variety of limiting scenarios including limited sensing modalities (sensor failures). In [22], the authors proposed a deep multi-view subspace clustering approach that combined global and local structures to help achieve a small distance between samples of the same cluster and make samples in different clusters of different views farther. To that end, they used a discriminative constraint between different views. The discriminative constraint is based on the Hadamard product between the features extracted by the convolutional auto-encoder for the different views. In contrast, our approach is based on the minimizing the group norm, which we proved with a derivation in earlier work [23] and entails a smaller angle between the different subspaces across all modalities, thus promoting the goal of obtaining a common latent space. Moreover, minimizing the group norm also provides as well as group sparse solution along data modalities. Sun et al. [24] proposed a deep trainable multi-view subspace clustering method, named self-supervised deep multi-view subspace clustering (S2DMVSC) that learns the common latent subspace using two losses: spectral clustering loss and classification loss in order to denoise the imperfect correlations among data points.

In this paper, we prove that our formulation, which is based on the group norm of the self-representation matrices and the commutation loss between them, provides a natural way to fuse multi-modal data by employing the self-representation matrix as an embedding for each data modality, making our approach robust under different types of potential limitations. It is good to note that our proposed approach secures the individual sensor data-points relations resulting in more flexibility for each data modality.
