**5. Feature concatenation**

Here we propose a rationale along with an alternative solution for enhancing the performance for EYB multi-modal data. Due to the specific structure of the EYB multi-modal data, the concatenation of the features corresponding to each modality is a reasonable alternative. By doing so, we are adjoining together the features representing each part of the face. Since the four modalities correspond to nonoverlapping partitions of the face, the feature set corresponding to each partition will

#### **Figure 9.** *CNN Concatenation Network.*

solely provide complementing information. A similar idea is proposed in [4] and is referred to as Late concatenation, where the multi-modal data is integrated in the last stage of the encoder. Their resulting decoder structure remains the same for either affinity fusion or late concatenation. This entails de-concatenating the multi-modal data prior to decoding it. Our proposed approach on the other hand, results in a selfexpressive layer being driven by the concatenated features from the *M* encoder branches. Afterwards, we feed the self-expressive layer output to each branch of the decoder. The concatenated information results in a more efficient code for the data, thereby resulting in an overall parsimonious with a sparse structure of the decoder, results in a decoder composed of three neural layers. The first layer consists of 150 filters of kernel size 3. The second layer consists of 20 layers of kernel size 3. The third layer consists of 10 layers of kernel size 5. Our approach is illustrated in **Figure 9**. We optimize the weights of the auto-encoder as follows,

$$\min\_{W:\|\boldsymbol{w}\_{kk} = 0\|} \rho \|\mathbf{W}\|\|\_{1} + \frac{\gamma}{2} \sum\_{t=1}^{T} \left\|\mathbf{X}(t) - \mathbf{X}\_{r}(t)\right\|\_{F}^{2} + \frac{\mu}{2} \sum\_{t=1}^{T} \left\|\mathbf{N} - \mathbf{N}\mathbf{W}\right\|\_{F}^{2},\tag{18}$$

where *N* ¼ ½ � *L*ð Þ**1** k k *L*ð Þ **2** *L*ð Þ**3** k k *L*ð Þ **4** *L*ð Þ**5** .

We compared the performance of our proposed approach against the late concatenation approach in [4] and the results are depicted in **Table 7** for the EYB dataset.

From the previous table, we can conclude that concatenating the features from the encoder and feeding the concatenated information to each decoder branch achieves a better performance for this type of multi-modal data structure. The reason behind this enhancement is the combination of efficient extraction of the basic features from the whole face and finer features from each part of the face. Promoting more efficiency as noted, this concatenation may also be intuitively viewed as adequate mosaicking, in which different patterns complement each other. In the following, we will show how


**Table 7.**

*Concatenation performance for EYB dataset.*

**Figure 10.** *Missing modalities during testing for EYB dataset.*

**Figure 11.** *EYB noiseless training and validating on limited noisy data.*


#### **Table 8.**

*Concatenation performance for ARL dataset.*

our proposed approach performs in two cases: missing and noisy test data. The results of the new proposed approach, which we refer to as CNNs concatenation network, is compared to the state-of-the-art DMSC network [4]. We start by training the autoencoder network using 75% of the data and then we test on the rest of the data. In **Figure 10**, we show how the performance degrades by decreasing the number of available modalities at testing from five to one. From the results, it is clear how the CNNs concatenation network outperforms the DMSC network. Additionally, we repeated the same experiment we performed in subsection 4.5. We train the network with noiseless data and then add Gaussian noise to one data modality at the testing. Additionally, we vary the number of available modalities at testing from one to four. The results are depicted in **Figure 11**. From the results, it is clear how the concatenated CNNs is more robust to noise than DMSC.

In addition, we have utilized the Concatenation network to perform object clustering on the ARL data. We compare the clustering performance of the concatenation network with both DMSC and DRoGSuRe. The results are depicted in **Table 8**. From the results, we conclude that DRoGSuRe still outperforms the other approaches for the ARL dataset. Although the number of parameters involved in training the DRoGSuRe network is higher than other approaches, since there are multiple selfexpressive layers, however, DRoGSuRe is more robust to noise and limited data availability during testing.
