**4. Experimental results**

#### **4.1 Dataset description**

We will evaluate our approach on two different datasets. The first dataset we will use is the Extended Yale-B dataset [39]. The same dataset has been used extensively in subspace clustering as in [1, 40]. The dataset is composed of 64 frontal images of 38 individuals under different illumination conditions. In this work, we will use the augmented data used in [4], where facial components such as left eye, right eye, nose and mouth have been cropped to represent four additional modalities. Images corresponding to each modality have been cropped to a size of 32�32. A sample image for each modality is shown in **Figure 3**. The second validation dataset we use is the ARL polarimetric face dataset [41]. This consists of facial images for 60 individuals in the visible domain and in four different polarimetric states. All the images are spatially aligned for each subject. We have also resized the images to 32�32 pixels. Sample images from this dataset are shown in **Figure 4**.

**Figure 3.** *Sample images from the augmented extended Yale-B Dataset. (a) Face. (b) Left eye. (c) Right eye. (d) Mouth. (e) Nose.*

#### *Neuromorphic Computing*

**Figure 4.** *Sample images from the ARL polarimetric dataset. (a) Visible. (b) DoLP. (c) S0. (d) S1. (e) S2.*

### **4.2 Network structure**

In the following, we will elaborate on how we construct the neural network for each dataset. Similar to [4], we implemented DRoGSuRe with Tensorflow and used the adaptive momentum based gradient descent method (ADAM) [30] to minimize the loss function in Eq. (5) with a learning rate of **10<sup>3</sup>** .

In case of ARL dataset, we have five data modalities and will therefore have 5 different encoders, self-expressive layers and decoders. Each encoder is composed of three neural layers. The first layer consists of 5 convolutional filters of kernel size 3. The second layer has 7 filters of kernel size 1. The last layer has 15 filters with kernel size equals 1.

For EYB dataset, we also have five data modalities, therefore, we have 5 different encoders, self-expressive layers and decoders. Each encoder is composed of three neural layers. The first layer consists of 10 convolutional filters of kernel size 5. The second layer has 20 filters of kernel size 3. The last layer has 30 filters of kernel size 3.

#### **4.3 Noiseless results**

In the following, we compare the performance of our approach versus the DMSC approach when learning the union of subspaces structure of noise-free data. First, we divide each dataset into training and validation sets to be able to classify a newly observed dataset, using the structure learned through the current unlabeled data. The ARL expression dataset used for training consists of 2160 images per modality. The validation baseline images include 720 images total per modality. For the EYB, we


#### **Table 1.**

*Performance comparison for ARL dataset.*


#### **Table 2.**

*Performance comparison for EYB dataset.*

randomly selected 1520 images per modality for training and 904 images for validation. The sparse solution *W t*ð Þ corresponding to each data modality, provides important information about the relations among data points, which may be used to split data into individual clusters residing in a common subspace. Observations from each object can be seen as data points spanning one subspace. Interpreting the subspacebased affinities based on *W t*ð Þ as a layered set of networks, we proceed to carry out what amounts to modality fusion. The *T* sparse matrices are added to produce one sparse matrix for both modalities, *WTotal*, thereby improving performance. Observations associated with one object/individual are clustered as one subspace where the contribution of each sensor is embedded in the entries of the *WTotal* matrix. For clustering by *WTotal*, we apply spectral clustering.

After learning the structure of the data clusters, we validate our results on the validation set. We extract the principal components (eigen vectors of the covariance matrix) of each cluster in the original (training) dataset, to act as a representative subspace of its corresponding class. We subsequently project each new test point onto the subspace corresponding to each cluster, spanned by its principal components. The *l***<sup>2</sup>** norm of the projection is then computed, and the class with the largest norm is selected to be the class of this test point. For DRoGSuRe, we use the coefficient matrix *WTotal* in Eq. (13) to cluster the test data points coming from all data modalities. We compare the clustering output labels with the ground truth for each dataset. The results for ARL and EYB datasets are depicted in **Tables 1** and **2** respectively. From the results, it is clear that DRoGSuRE technique for the fused data remarkably outperforms DMSC in case of ARL dataset. The reason behind the significant improvement is the layered structure of our proposed approach that constructs the latent space in a way that safeguards the individual sensor private information which hence dedicates more degrees of freedom to each of the sensors. In addition, the ARL dataset structure offers modalities that are different in nature and individually provides new information in contrast to the EYB dataset. However, in case of EYB dataset and in the noiseless case, DMSC performed better than DRoGSuRe.

#### **4.4 Noise training with single and multiple modalities**

In the following, we test the robustness of our approach in the case of noisy learning. We distort one modality at a time by shuffling the pixels of all images in that


#### **Table 3.**

*ARL dataset: Distorting one modality.*


#### **Table 4.**

*ARL dataset: Distorting two modalities.*


#### **Table 5.**

*EYB dataset: Distorting one modality.*

modality during the training phase. By doing so, we are perturbing the structure of the sparse coefficient matrix associated with that modality, thus impacting the overall W matrix for both DRoGSuRe and DMSC. Testing with clean data, i.e., no distortion, demonstrates the impact of perturbing the training and hence performing an

