**1. Introduction**

Unsupervised learning is a very challenging topic in Machine Learning (ML) and involves the discovery of hidden patterns in data for inference with no prior given labels. Reliable clustering techniques will save time and effort required for classifying/ labeling large datasets that might have thousands of observations. Multi-modal data, increasingly in need for complex application problems, have become more accessible with recent advances in sensor technology, and of pervasive use in practice. The plurality of sensing modalities in our applications of interest, provides diverse and complementary information, necessary to capture the salient characteristics of data and secure their unique signature. A principled combination of the information contained in the different sensors and at different scales is henceforth pursued to enhance understanding of the distinct structure of the various classes of data. The objective of this work is to develop a principled multi-modal framework for object clustering in an unsupervised learning scenario. We extract key class-distinct features-signatures from each data modality using a CNNs encoder, and we subsequently non-linearly combine those features to generate a discriminative

characteristic feature. In so doing, we work on the hypothesis that each data modality is approximated by a Union of low dimensional Subspaces which highlights underlying hidden features. The UoS structure is unveiled by pursuing sparse selfrepresentation of the given data modality. The subsequent aggregation of the multimodal subspace structures yields a jointly unified characteristic subspace for each class.
