Deep-Learning Enabling High-Resolution Hyperspectral Imaging

#### **Chapter 3**

## Unsupervised Deep Hyperspectral Image Super-Resolution

*Zhe Liu and Xian-Hua Han*

#### **Abstract**

This chapter presents the recent advanced deep unsupervised hyperspectral (HS) image super-resolution framework for automatically generating a high-resolution (HR) HS image from its low-resolution (LR) HS and high-resolution RGB observations without any external sample. We incorporate the deep learned priors of the underlying structure in the latent HR-HS image with the mathematical model for formulating the degradation procedures of the observed LR-HS and HR-RGB observations and introduce an unsupervised end-to-end deep prior learning network for robust HR-HS image recovery. Experiments on two benchmark datasets validated that the proposed method manifest very impressive performance, and is even better than most state-of-the-art supervised learning approaches.

**Keywords:** deep learning, unsupervised learning, hyperspectral image, super-resolution, generative network

#### **1. Introduction**

Hyperspectral images (HSI) feature hundreds of bands with extensive spectral qualities that are helpful for a range of visual tasks, such as computer vision [1], mineral exploration [2], medical diagnosis [3], remote sensing [4], and so on. Due to technology restrictions, it is harder to capture high-quality HSI, and the acquired HSI has substantially lower resolution. As a result, super-resolution (SR) has been applied to obtain a HR-HSI, but it is a challenge because of texture blurring and spectral distortion problems at high magnifications. Thus, researchers frequently combine high-resolution PAN and low-resolution HSI [5] to achieve SR tasks. In recent years, it is a trend to fuse a high-resolution multispectral/RGB (HR-MS/RGB) image and a low-resolution hyperspectral (LR-HS) image for generating a high-resolution hyperspectral (HR-HS) image, which is called hyperspectral image super-resolution (HSI-SR). The HSI-SR methods are classified into two primary categories based on reconstruction principles: conventional mathematical modelbased methods and deep learning-based approaches in a supervised/unsupervised manner. The following sections go into further information about each of these categories.

#### **1.1 Mathematical model-based methods**

Since HSI-SR is typically an inverse problem, a mathematical model-based approach yields a solution space that is far bigger than the actual result needed. In order to tackle this issue, mathematical model-based HSI-SR constrains the solution space using hand-crafted prior knowledge, regularizes the mathematical model, and then optimizes the model by minimizing the reconstruction errors. This method aims at establishing a mathematical formulation that simulates the transformation of HR-HS images into LR-HS and HR-RGB images. This process is extremely difficult, and direct optimization of the formed mathematical model might result in very unreliable solutions, as the known variables in the LR-HS/ HR-RGB images under consideration are significantly less than the unknown variables to be estimated in the latent HR-HS images. In order to narrow the set of possible solutions, existing approaches often utilize a variety of priors to modify the mathematical model.

Based on prior knowledge of various structures, three categories of mathematical model-based HSI-SR methods can currently be distinguished: spectral unmixingbased methods [6], sparse representation-based methods [7], and tensor factorization-based methods [8]. For spectrum unmixing-based methods, Yokoya et al. [9] proposed a coupled non-negative matrix decomposition approach (CNMF), which alternatively unmixes LR-HS images and HR-RGB images to estimate HR-HS images. Later, Lanaras et al. [6] proposed a similar framework to jointly unmix the two input images by decoupling the initial optimization problem into two constrained least square problems. Dong et al. [7] incorporated alternating multiplication method (ADMM) techniques for solving the spectra unmixing model. Additionally, the sparse representation is frequently used as an alternative mathematical model for HSI-SR. In this model, the underlying HR-HS image is recovered by first learning the spectral dictionary from the LR-HS image under consideration, and then calculating the sparse coefficient of the HR-RGB image. Inspired by the existed spectral similarity of the neighboring pixels in the latent HS image, Akhtar et al. [10] proposed to perform group sparse and non-negativity representation within a small patch, while Kawakami et al. [11] applied a sparse regularizer for the decomposition of spectral dictionaries. Moreover, the tensor factorization-based method demonstrated that it could be used to resolve the HSI-SR problem. He et al. [8] factorized the HR-HS image into two lowrankness constraint matrices and achieved great super-resolution performances, which were motivated by the intrinsic low dimensionality of the spectrum space and the three-dimensional structure of the HR-HS image.

Despite some advancements in handcrafted prior, HSI-SR performance tends to be inconsistent and can cause severe spectral distortion due to the under-representation of handcrafted prior, depending on the content of the image under investigation.

#### **1.2 Deep learning-based methods**

Hyperspectral super-resolution is a hot field of research in hyperspectral imaging, as it can improve low-resolution images in both the spatial and spectral domains, turning them into high-resolution hyperspectral images. HSI-SR is a classic inverse problem, and deep learning has a lot of promise for resolving it. Depending on whether a training dataset is provided, supervised and unsupervised learning are the two approaches used in deep learning-based HSI-SR. A labeled training dataset is necessary for supervised learning in order to create a function or model from which

subsequent data is fed in order to generate accurate predictions. But a labeled training dataset is not necessary for unsupervised learning.

#### *1.2.1 Deep supervised learning-based methods*

Different vision tasks have been successfully resolved by DCNNs. As a result, DCNN-based methods have been suggested for HSI-SR tasks, which eliminate the requirement to investigate various manually handcrafted priors. With the LR-HS observation only, Li et al. [12] presented an HSI-SR model by combining a spatial constraint (SCT) strategy with a deep spectral difference convolutional neural network (SDCNN). Han et al. [13] utilized three straightforward convolutional layers in the groundbreaking HS/RGB fusion work, whereas later work utilized more advanced CNN architectures, such as ResNet [14] and DenseNet [15], in an effort to attain more robust learning capabilities. By resolving the Sylvester equation using a fusion framework, Dian et al. [16] first provided an optimization technique, and then they investigated a DCNN-based strategy to enhance the initialization results. Further, Han et al. [17] proposed a multi-layer, multi-level spatial, and spectral fusion network that successfully fused existing LR-HS and HR-RGB images. In order to investigate an MS/ HS fusion network and optimize the suggested MS/HS fusion system, Xie et al. [18] employed a low-resolution imaging model and spectral low-level knowledge of HR-HS images. In order to solve HS image reconstruction difficulties effectively and accurately, Zhu et al. [19] researched the progressive zero-centric residual network (PZRes-Net), a lightweight deep neural network-based system. All the DCNN-based methods mentioned above take training with a large number of pre-prepared training instances that contain not only LR-HS and HR-RGB images but also the corresponding HR-HS images as labels, that is, the set of training triples, despite the fact that the reconstruction performance was significantly improved.

#### *1.2.2 Deep unsupervised learning-based methods*

Although HS images are difficult to obtain in the real world, deep learning networks for HSI-SR require a lot of hyperspectral images as training data. It is rather challenging to collect good quality HSIs due to hardware restrictions, and the resolution of the acquired HSIs is relatively low. For supervised learning, which needs big training datasets to succeed, this is an unsolvable problem. As a result, unsupervised learning is one of the key research areas. Unlike supervised learning, unsupervised learning does not require any HR-HS image as a ground-truth image and uses only easily accessible HR-MS/RGB images and LR-HS images to generate HR-HS images.

It is well known that the corresponding training triplets, especially the HR-HS images, are extremely hard to be collected in real applications. Thus, the quality and amount of the collected training triplets generally become the bottleneck of the DCNN-based methods. Most recently, Qu et al. [20] attempted to solve the HSI superresolution problem in an unsupervised way and designed an encoder-decoder architecture for exploiting the approximate low-rank prior structure of the spectral model in the latent HR-HS image. This unsupervised framework did not require any training samples in an HSI dataset and could restore the HR-HS image using a CNN-based endto-end network. However, this method needed to be carefully optimized step-by-step in an alternating way, and the HS image recovery performance was still not enough. Liu et al. [21] proposed an unsupervised multispectral and hyperspectral image fusion (UnMHF) network using the observations of the under-studying scene only, which

estimates the latent HR-HS image with the learned encoder-decoder-based generative network from a noise input and can only be adopted to the observed LR-HS and HR-RGB image with the known spatial downsampling operation and camera spectral function (CSF). Later, Uezato et al. [22] exploited a similar method for unsupervised image pair fusion, dubbed a guided deep decoder (GDD) network for the known spatial and spectral degradation operation only. Thus, the UnMHF [21] and GDD [22] can be categorized into the non-blind paradigm, and lack of generalization in a real scenario. Zhang et al. [23] proposed two steps of learning methods via modeling the common priors of the HR-HS image in a supervised way and then adapting to the under-studying scene for modeling it's specific prior in an unsupervised manner. In addition, the unsupervised adaptation is capable of learning the spatial degradation operation of the observed LR-HS image but can only deal with the observed HR-HS image with known CSF, and thus it would be categorized as a semi-blind paradigm for possibly learning the spatial degradation operations only in the observed LR-HS image. Moreover, Fu et al. [24] exploited an unsupervised hyperspectral image superresolution method using the designed loss function formulated by the observed LR-HS and HR-RGB images only and integrated a CSR optimization layer after the HSI superresolution network to automatically select or learn the optimal CSR for adapting to the target RGB image possibly captured by various color cameras, which is also divided into the semi-blind paradigm for possibly learning the spectral degradation operation: CSF only. Further, the unsupervised adaptation subnet in ref. [23] and the method [24] utilize the under-studying observed images only instead of the requirement of additional training samples for guiding the network training, which achieved impressive performance as an unsupervised learning strategy. However, these learning methods based on the under-studying observed images only are easy to drop into a local solution, and the final prediction heavily depends on the initial input of the network. Our method is also formulated in this unsupervised learning paradigm, and we are going to clarify the distinctiveness of our method in the next sub-section.

#### **2. The proposed unsupervised learning-based methods**

In this section, we first describe the problem formulation in the HSI-SR task and then present the proposed deep unsupervised learning-based method.

#### **2.1 Problem formulation**

Let us consider image pairs: a LR-HS image *X* ∈*R<sup>w</sup>*�*h*�*<sup>L</sup>*, where w and h are the width and height, and a HR-RGB image *Y* ∈*R<sup>W</sup>*�*H*�<sup>3</sup> , where w and h are the width and height of *Y* and *Z*, respectively. A HR-HS image *Z* ∈ *R<sup>W</sup>*�*H*�*<sup>L</sup>*, where *L* is the number of spectral channels in the HR-HS image, is what we are trying to reconstruct for HSI-SR. The following formula can be used to represent the degradation between the HR-HS target image and the observed images: *X* and *Y*.

$$X = \mathbf{k}^{(spa)} \otimes Z^{(spa)} \downarrow + \mathfrak{n}\_{\mathbf{x}}, Y = Z \ast C^{(\text{Spec})} + \mathfrak{n}\_{\mathfrak{I}},\tag{1}$$

where ⊗ stands for the convolution operator, (Spa)↓ for the spatial domain downsampling operator, and k(*Spa*) for the two-dimensional blur kernel in the spatial domain. Three one-dimensional spectral filters C(*Spec*) constitute the spectral

sensitivity function of RGB cameras, which translates *L* spectral bands to RGB bands. The additive white Gaussian noise (AWGN) with noise level σ is represented by n*<sup>x</sup>* and ny. We rephrase the degenerate model as a matrix formulation to quantify the problem, that is,

$$X = \text{DBZ} + \mathfrak{n}\_{\text{x}}, Y = \text{ZC} + \mathfrak{n}\_{\text{y}}, \tag{2}$$

where *B* is the spatial blur matrix, *D* is the downsampling matrix, and *C* is the transformation matrix representing the spectral sensitivity function (CSF). According to Eqs. (1) and (2), a general HSI-SR task should evaluate k(*Spa*) (or *B*), (Spa)↓ (or *D*), and C(*Spec*) (or *C*) from observed image pairs *X* and *Y*, which makes it very complicated to obtain the latent *Z*. It is a challenging problem that has rarely been studied in the HSI-SR task. Therefore, the general solution is to assume that the blur kernel type and spectral sensitivity function (CSF) of the RGB camera are known and to approximate them by some mathematical operations in the application. The current study followed to the previous setting in principle, but we also investigated whether it was possible to reconstruct HR-HS images without knowing the kind of CSF or the blur kernel beforehand as a generic solution for a specific scenario.

Let us begin by defining the generic formula of the HSI-SR task generally. The maximum a posterior (MAP) framework is the foundation formula of the majority of classical approaches.

$$Z^\* = \underset{Z}{\text{argmax}} \text{Pr}(Z|X, Y, B, C) = \underset{Z}{\text{argmax}} \text{Pr}\_{(B, C)}(X, Y|Z) \text{Pr}(Z), \tag{3}$$

where *Pr*(*Z*) performs prior modeling of latent HR-HS images and Pr(*<sup>B</sup>*,*<sup>C</sup>*)(*X*,*Y*|*Z*) is the likelihood of the fidelity term corresponding to the known kernel type and CSF matrix. With regard to the latent HR-HS image *Z*, which we define as � log Prð Þ *<sup>B</sup>*, *<sup>C</sup>* ð Þ *<sup>X</sup>*, *<sup>Y</sup>*j*<sup>Z</sup>* , it is specifically assumed that the reconstruction errors of the fidelity terms *X* and *Y* are independent Gaussian distribution in general. The prior modeling of HR-HS images is subjected to the regularization requirement � log Pr ð Þ¼ ð Þ *Z ϕ*ð Þ *Z* . The reconstruction model of the MAP-based HSI SR in Eq. (3) can be redefined using the following formula.

$$\|Z^\*-\mathbf{a}\mathbf{g}\mathbf{m}\|\_{\mathcal{Q}}\|X-DBZ\|\_F^2 + (\mathbf{1}-a)\beta\_2\|Y-ZC\|\_F^2 + \lambda\phi(Z),\tag{4}$$

where k k∙ *<sup>F</sup>* represents the Frobenius norm. It is generally necessary to introduce normalization weights, such as *β*<sup>1</sup> ¼ 1*=N*<sup>1</sup> and *β*<sup>2</sup> ¼ 1*=N*2, where *N*<sup>1</sup> and *N*<sup>2</sup> are multiples of the number of pixels and spectral bands in LR-HS and HR-HS images, respectively. This is because HR-RGB and LR-HS images have different numbers of elements. In addition, we further modify the contribution of these two reconstruction errors using the hyperparameter *α*ð Þ 0≤*α* ≤1 . On the other hand, the trade-off adjustment parameter is *λ*. We have experimentally developed appropriate prior parameters as regularization term *ϕ*(*Z*) in order to obtain a robust solution. Numerous prior restrictions have been present. The employed priors, however, are often manually determined and fall short of adequately describing the intricate structure of HR-HS images. Furthermore, the established priors should vary depending on the details of the situation being studied, and choosing the suitable priors for a specific scenario is still an art.

The DCNN method is one of the most recent deep learning-based HSI-SR techniques. It effectively captures prospective HS image features (common prior) in a fully supervised learning manner utilizing previously trained training samples (external datasets). Particularly supervised deep learning methods seek to learn joint CNN models by minimizing such loss functions given \$N\$ trainable triples.

$$(\mathbf{X}\_i, \mathbf{Y}\_i, \mathbf{Z}\_i)(i = \mathbf{1}, \mathbf{2}, \cdots, N).$$

$$\boldsymbol{\theta}^\* = \operatorname\*{argmin}\_{\boldsymbol{\theta}} \sum\_{i}^{N} \left\| \mathbf{Z}\_i - F\_{\text{CNN}}(\mathbf{X}\_i, \mathbf{Y}\_i) \right\|\_F^2,\tag{5}$$

where *FCNN* stands for a DCNN network transform with *θ* learning parameters. In contrast to directly searching in the ground-truth image space *Z*, these approaches are trained to extract the optimal parameters *θ\** of the network, and they are able to identify common prior variables concealed in the training samples utilizing the powerful and effective DCNN modeling capabilities. The underlying HR-HS images for each observation (*Xt*,*Yt*) in the research can be simply rebuilt as: *<sup>Z</sup>*^*<sup>t</sup>* <sup>¼</sup> *<sup>F</sup><sup>θ</sup>* <sup>∗</sup> *CNN Xt* ð Þ , *Yt* after learning the network parameters *θ*\*. Although these supervised deep learning methods have shown encouraging results, it is necessary to provide a substantial training dataset that includes LR-HS, HR-RGB, and HR-HS images—all of which are particularly challenging to gather in HSI-SR tasks—in order to learn a good model.

#### **2.2 The overview motivation**

Recent deep learning-based HSI-SR techniques have demonstrated that DCNNs perform well and are capable of accurately capturing the underlying spatial and spectral structure (joint prior information) of potential HS images. The training labels (HR-HS images) for these algorithms, which are typically performed in a fully supervised way and need large-scale training datasets containing LR-HS, HR-RGB, and HR-HS images, are challenging to gather. Numerous studies on natural image generation (DCGAN [25]) and its variations have demonstrated that high-resolution, highquality images with specific features and attributes can be produced from noisy random input data without the supervision of high-quality ground-truth data. This indicates that originating from a random initial image and scanning the parameter space of a neural network can capture the inherent structure (a prior) of possible images with certain features. DIPs [26] have also been utilized to properly perform a number of natural image restoration tasks, including image separation, blurring, and super-resolution extraction, using just the degraded version of a scene to guide them. This unsupervised paradigm is used in the current study, which tries to learn the precise spatial and spectral structure (a prior) of HR-HS latent images from degraded data (LR-HS and HR-RGB images).

The spatial and spectral structure of the underlying HR-HS image *Z* was specifically modeled using the generative neural network *G<sup>θ</sup>* (*θ* is a network parameter that must be learnt). The fusion-based HSI-SR model can be rebuilt as follows by substituting *Z* with *G<sup>θ</sup>* in Eq. (4) and deleting the regularization term *ϕ*(*Z*) connected with the prior acquired automatically by the generative network.

$$\theta^\* = \underset{\theta}{\text{argmin}} \rho \beta\_1 \| X - DBG\_{\theta}(Z\_{in}) \|\_F^2 + (\mathfrak{1} - a)\beta\_2 \| Y - G\_{\theta}(Z\_{in})C \|\_F^2,\tag{6}$$

where *Gθ*(*Z*in)*<sup>i</sup>* is the *i*-*th* component of the HR-HS estimation and *Z*in is the input to the generative neural network. Eq. (6) tries to explore the parameter space of the generative neural network *G<sup>θ</sup>* by leveraging the powerful modeling capability to generate a more reliable HR-HS image, instead of directly searching the exceedingly vast, non-uniquely determined raw HR-HS space.

To solve the above unsupervised HSI-SR task, there are still several issues to be needed to elaborately address: (i) How to design the generative network's architecture so that both spectral correlations and low-level spatial statistics can be effectively modeled during training. (ii) What kind of input to the generative network should be employed so that the local minimization point can be avoided. (iii) How to implement an end-to-end learning framework for incorporating different degradation operations (blurring, downsampling, and spectral modification) following the generative network. In the next sections, we embody the solutions to the aforementioned issues.

#### **2.3 Architecture of the generative neural network**

Generative neural networks *Gθ* can be implemented using arbitrary DCNN architectures. A generative neural network *G<sup>θ</sup>* is required to offer acceptable modeling skills due to the diversity of information, including potentially significant structures, rich textures, and complicated spectra in HR-HS images. It has been demonstrated that various generative neural networks have a great deal of promise for producing high-quality natural images [Pix2pix and others], for example, in adversarial learning settings [27]. In this study, a multi-level feature learning architecture is employed, along with simplified encoder-decoder features and an encoder-decoder architecture that allows for feature reuse via skip connections between the encoder and the decoder. **Figure 1** shows a thorough representation of a generative neural network.

Five blocks compensate the encoder and decoder, and they both learn representative features at various scales. To reuse the extracted detailed features, the output of each of the 5 encoder-side blocks is straight-through forwarded to the corresponding decoder. A maximum clustering layer with a 2 2 kernel is used to reduce the size of the feature map between encoder blocks, and an upconversion layer is used to double

**Figure 1.** *Conceptual diagram of the proposed unsupervised deep HSI-SR.*

the size of the feature map between decoder blocks for recovery. Each block is comprised of three convolutional layers that each follow the RELU activation function. Finally, the HR-HS images are estimated using the convolutional output layer. The training state of the generative neural network cannot be estimated or guided in an unsupervised learning environment as there is no ground-truth HR-HS image. The assessment criteria listed in Eq. (6) are then generated using the observed HR-RGB and LR-HS images.

#### **2.4 Input data to the generative neural network**

We classify the input data into two types. The first is a noisy input with a random perturbation added to check the robustness, corresponding to the deep unsupervised fusion learning (DUFL) model; in particular, to contrast with the addition of random perturbation, we also perform experiments without random perturbation, that is, the DUFL+ model. The second input data is the fusion context of fused observations HR-RGB and LR-HS, which corresponds to the deep self-supervised HS image reconstruction (DSSH) framework.

#### *2.4.1 The noise input*

The deep image prior network (DIP) [26] was developed to get low spatial statistics using inputs of uniformly distributed noise vectors generated at random. Nevertheless, because the noise vectors are chosen at random, DIP has a limited ability to discover spectral and spatial correlations and is more challenging to tune. Motivated by the DIP, we proposed a deep unsupervised fusion learning (DUFL) model, in which a common generative neural network is trained to generate target images with predetermined features; typically, a randomly selected noise vector based on a distribution function (for example, Gaussian or uniform distribution) is used as input to ensure that the generated images have enough diversity and variability. The observed degradation (LR-HS and HR-RGB images) of the corresponding HR-HS images is required for our HSI-SR task. Therefore, it makes sense to determine the best network parameter space for searching a given HR-HS image as the previously sampled noise vector *Z*<sup>0</sup> in. However, a constant noise input could lead to a local minimum in the generative neural network. As a result, the HR-HS image's estimate is inaccurate. Therefore, it is suggested to disturb the fixed initial input with a small randomly generated noise vector in each training step to avoid the local minimum condition. For a training step, the input vector *i*-*th* can be represented as follows:

$$Z\_{\rm in}^{i} + Z\_{\rm in}^{0} + \Delta n\_{i},\tag{7}$$

where Δ stands for the interference level (small scale value) and *ni* is the noise vector randomly sampled in the ith training. The final estimated HR-HS image utilized for prediction is the fixed noise vector *<sup>Z</sup>* <sup>∗</sup> <sup>¼</sup> *<sup>G</sup><sup>θ</sup> <sup>Z</sup>*<sup>0</sup> in , which is created by feeding perturbed inputs into a neural network with coefficient *Gθ*.

This deep unsupervised fusion learning model employs noise vectors produced at random and sampled from a uniform distribution as input to provide low-level spatial statistics. But this research is less effective at identifying spectral and spatial correlations and is more challenging to optimize due to random noise vectors. We propose a solution to this issue in the next section. In the next part, we substitute observed

LR-HS and HR-RGB images for entirely artificial noise. Additionally, we approximate the degradation operation using two distinctive convolutional layers that can be applied as learning or fixed degradation models for a variety of real-world scenarios.

#### *2.4.2 The fusion context*

To deal with the mentioned problems, we improved the DUFL model above. The underlying prior structure of HR-HS images is reflected by an internally designed network structure in the deep self-supervised HS image reconstruction (DSSH) framework, which also learns the network parameters exclusively using observed LR-HS and HR-RGB images. In the proposed DSSH framework, we use the observed fusion context in network learning to gain insight into specific spatial and spectral priorities given the observed images: *X* reflecting hyperspectral properties of the underlying HR-HS image although with lower spatial-resolution, and *Y* showing the high-resolution spatial structure although with fewer spectral channels. To be more specific, we utilize an upconversion layer to first transform the LR-HS image to the same spatial dimension as the HR-RGB image before merging them, as seen below.

$$Z\_{\rm in}^0 = \text{Stack}(UP(X), Y). \tag{8}$$

A simple fused context can be used as input, but this generally results in local minimum convergence. To train a more reliable model in this section that takes into account specific spatial and spectral priors, we add additional perturbations. The model is then represented as follows:

$$\mathbf{Z}\_{\rm in}^{\rm i} = \mathbf{Z}\_{\rm in}^{0} + \lambda \mu,\tag{9}$$

where *λ* is a small number indicating the intensity of the perturbation and *μ* is a sample of a 3D tensor generated at random from a uniform distribution equal to the connection context *Z*<sup>0</sup> in. In this section, *λ* is set at 0.01 and reduced by half every 1000 steps throughout the training phase. The perturbation is applied to the generative network *Gθ* during each training phase.

Our suggested approach is capable of using any DCNN architecture for the *Gθ* generative network construction. Potential HR-HS images frequently have complicated spectra, expressive patterns, and rich textures, all of which demand the full modeling power of the generative network *Gθ*. Significant advancements have been achieved in generating higher natural images [28], and several generative architectures have been presented, for instance in adversarial learning situations [29].

#### **2.5 Degradation modules**

#### *2.5.1 Non-blind degradation module*

We apply degradation operations to get approximations of the LR-HS and HR-RGB images from the HR-HS images predicted by the generative network in order to provide evaluation criteria for training the network. However, this part of the network is removed and cannot be included in an integrated training system if only mathematical operations are utilized to approximate the degraded model. In this work, after constructing the backbone, we approximate the degradation model as a conventional learning system utilizing two parallel blocks. To specifically accept

blurred and downsampling transformations, we modified the conventional deep convolutional layers. We apply the same kernel to various spectral bands in the depth-wise convolution layer and set the step space expansion coefficients and bias terms to "false" since the identical blurring and downsampling operations are applied to each spectral band in a real scene. The blurring and downsampling transformations' equations are written as follows:

$$
\hat{X} = f\_{\text{SDW}}(G\_{\theta}(Z\_{in})) = k\_{\text{SDW}} \otimes G\_{\theta}(Z\_{in})^{(\text{Spa})} \downarrow,\tag{10}
$$

where the convolution layer's specific depth performs the role of *f SDW*ð Þ∙ . To be more precise, we refer to the same kernel that was used in the depth-wise convolution layer to convolve *Gθ*ð Þ *Zin* in the HR-HS images generated with each channel independently as *kSDW* ∈ *R*<sup>1</sup>�1�*s*�*<sup>s</sup>* . False bias is accomplished by using conventional two-dimensional mathematical convolution and nearest downsampling operations to transform the spatial expansion factor of *f SDW*ð Þ *Gθ*ð Þ *Zin* . If the spatially degraded blur kernel is known, we simply set the values to be trained as false values and initialize the weights of each layer based on the known kernel. Similar to this, we simply automatically learned kernel weights of 1\*1 during the network training phase or assigned kernel weights of *f Spe*ð Þ∙ based on the known RGB camera CSF. Additionally, we employ a conventional convolution kernel with output channels of \$3\$ and a kernel size of 1\*1to implement the spectral transform. We similarly set the stride to 1 and the bias term to false, as shown in the following expression.

$$
\hat{Y} = f\_{\text{Spe}}(\mathbf{G}\_{\theta}(Z\_{in})) = k\_{\text{Spe}} \otimes \mathbf{G}\_{\theta}(Z\_{in})^{(\text{Sp}\epsilon)} \downarrow,\tag{11}
$$

where the activity of the spectral convolution layer is indicated by *f Spe*ð Þ∙ . The detailed spectra of the obtained HR-HS images are transformed into degenerate RGB images using the convolution kernel k*Spe* ∈ *R<sup>L</sup>*�3�1�<sup>1</sup> . Additionally, the kernel of k*<sup>S</sup>*pe minimization that needs to be trained has the same dimension as the *C*(Spec) that represents the spectrum sensitivity function of an RGB sensor, allowing us to approximate it in our joint network. These two modules can be used concurrently in our integrated learning model by employing the mentioned framework.

#### *2.5.2 (semi-) blind degradation module*

This section focuses on automatically learning the transform parameters of the convolutional blocks embedded in the unknown decomposition. For spatially semi-blind, the weight parameter of k*<sup>S</sup>*DW in Eq. (10) can either be automatically learned when the blur process is unknown while the weight parameter of k*<sup>S</sup>*pe can be predetermined by changing to the parameter of a known CSF kernel. Thus, we can easily extract the approximation LR-HS image from the generated HR-HS image *G<sup>θ</sup>* using a specified deep convolutional layer fSDW with a fixed k*<sup>S</sup>*pe convolutional kernel. Similarly, it is adaptable to implement the opposite operation to achieve a spectrally semi-blind process. Hence, these two modules can be learned concurrently in our integrated learning framework as a blind degradation module. As a result, the investigated learning model is extremely adaptable and simple to fit into many real-world scenarios. The loss function that was used to train our deep self-supervised network can be rebuilt as follows by substituting the decomposition operation with an improved convolutional block.

*Unsupervised Deep Hyperspectral Image Super-Resolution DOI: http://dx.doi.org/10.5772/intechopen.106908*

$$\begin{aligned} \left(\theta^\*, \theta\_{\text{SDW}}^\*, \theta\_{\text{Spe}}^\*\right) &= \underset{\theta}{\text{argmin}} \rho \beta\_1 \|\mathbf{X} - f\_{\text{SDW}}(\mathbf{G}\_{\theta})(\mathbf{Z}\_{\text{in}})\|\_{F}^2 \\ &+ (\mathbf{1} - a)\beta\_2 \|\mathbf{Y} - f\_{\text{Spe}}(\mathbf{G}\_{\theta})(\mathbf{Z}\_{\text{in}})\|\_{F}^2 \quad \text{s.t.} \mathbf{0} \le \mathbf{G}\_{\theta}(\mathbf{Z}\_{\text{in}} \le \mathbf{1} \forall i. \end{aligned} \tag{12}$$

As can be observed from Eq. (12), in order to rebuild the target well, we learn the generative network parameters rather than directly optimizing the underlying HR-HS image. In our network optimization procedure, the generative network *G<sup>θ</sup>* is trained using only test image pairs (i.e., observed LR-HS and HR-RGB images), and no HR-HS images are provided. This can be seen as a "zero-shot" self-supervised learning method [30]. As a result, we refer to our model as a self-supervised learning model for HSI-SR.

#### **3. Experiment results**

#### **3.1 Experimental settings**

#### *3.1.1 Datasets*

The efficiency of the suggested method was evaluated using two benchmark HSI datasets, namely, CAVE [31] and Harvard [32]. 32 HS images with a spatial resolution of 512 � 512 are included in the CAVE dataset, which includes various realworld materials. The Harvard dataset includes 50 images of various natural settings, each with a resolution of 1392/1040 pixels and 31 bands of spectral-resolution between 420 and 720 nm. In the experiments, a part of the 1024 � 1024 sub-image in the top left corner of the Harvard dataset's original HS image was cropped, resulting in a 512 � 512 -pixel image that served as the HS image's main basis. Using different spatial extraction factors (8 and 16) for the bicubic degradation, the observed LR-HS images were generated from the actual HS images of the two datasets, yielding sizes of 64 � 64 � 31 and 32 � 32 � 31. The observed HR-RGB images were also generated by multiplying the HR-HS image by the spectral Nikon D700 camera response function [9].

#### *3.1.2 Evaluation metrics*

The proposed method is evaluated against various state-of-the-art methods using five widely used metrics, including root-mean-square error (RMSE), signal-to-noise ratio (PSNR), structural similarity index (SSIM), spectral angle mapping (SAM), and relative dimensional global error (ERGAS). The generated HR-HS image and the ground-truth image were both acquired from the same spatial position. The recovered HR is measured by RMSE, PSNR, and ERGAS which are quantitatively distinct from the reference image to assess the spatial accuracy. Then, SAM offers the average spectral angle of the two spectral vectors to show the spectral accuracy. Additionally, SSIM was employed to evaluate how much the spatial organization of the two images resembled one another. A greater PSNR or SSIM and a lower RMSE, ERGAR, or SAM often indicate superior performance. Bold values mean promising results.

#### *3.1.3 Details of the network implementation*

Pytorch has adopted the suggested approach. The input noise was first set to the same size as the HR-HS image that would be generated. Utilizing the Adams optimizer and a loss function based on the *L*<sup>2</sup> criteria, the generated network was trained. Initial settings for the learning rate included 1e-3 with a decrease of 0.7 per 1000 steps. Additionally, the perturbation was reduced by 0.5 every 1000 steps after being initially set at 0.05. After 12,000 iterations, the optimization process was terminated, and all ground-truth HR-HS images from various datasets with various upscale factors were used. Using a Tesla K80 GPU in a training environment, all experiments were carried out. According to our experiments, it takes around 20 minutes to learn an image with a 512 512 size. Across all experiments, we first adjusted the hyperparameter *α* in the loss function of Eq. (12) to 0.5.

#### **3.2 Performance evaluation**

In the study of HS image super-resolution, there are three main paradigms: 1) traditional optimization methods that form image priors based on practical knowledge or physical properties, 2) fully supervised deep learning methods that learn external image priors (training algorithms), and 3) unsupervised methods that learn image priors automatically.

#### *3.2.1 Comparison with traditional non-blind optimization-based methods*

The generalization of simultaneous orthogonal matching pursuit (G-SOMP+) method [33], sparse non-negative matrix factorization (SNNMF) method [34], couple spectral unmixing (CSU) method [9], non-negative structured sparse representation (NSSR) method [7], Bayesian sparse representation (BSR) method [35], and other optimization-based HSI-SR methods have all recently been presented. To rebuild stable HS images, conventional optimization-based approaches often employ a variety of hand-crafted priors. The degradation processes (spatial blurring/downsampling and spectral transformations) are a requirement for all approaches. To automatically learn specific priors for latent HR-HS images, we propose a deep unsupervised learning network. In cases when the degradation pattern is unknown, this can yield results for reconstruction. First, we approximated the bicubic decomposition using the Lanczos kernel to initialize the weights of the spatial decomposition blocks, and then we initialized the spectral transform blocks using the CSF of the Nikon D700 camera without learning these blocks in order to make a fair comparison. We evaluate the efficacy of 8 and 16 spatial expansion factors, and compared results on the CAVE and Harvard datasets are shown in **Table 1**. And the visualization results are shown in **Figure 2**.

#### *3.2.2 Comparison with deep non-blind learning-based methods*

Deep learning-based methods have recently been thoroughly investigated in the HSI-SR tasks, the majority of them in both fully supervised and unsupervised ways. The unsupervised sparse Dirichlet-net (uSDN) [20], deep hyperspectral image prior (DHP) [36], and GDD method [22] are just a few examples of works that have attempted to use unsupervised strategies in HSI-SR tasks. Our approach comes within the unsupervised branch of HSI-SR methods. In this part, we compare supervised and unsupervised deep learning algorithms, such as SSF-Net [33], ResNet [14], DHSIS [16], uSDN [20], and DHP [36]. Only 12 test images from the CAVE dataset and


#### *Unsupervised Deep Hyperspectral Image Super-Resolution DOI: http://dx.doi.org/10.5772/intechopen.106908*

*Compared*

 *results of the* 

*conventional*

 *non-blind* 

*optimization*

 *methods with DUFL and DSSH methods in the CAVE and Harvard datasets for up-scale factors: 8 and 16.*

#### **Figure 2.**

*Visualization of the DHP [36], uSDN [20], SNNMF [37], and difference images between the proposed DUFL+ method and the ground-truth/reconstructed images in CAVE and Harvard datasets for an up-scale factor 16.*

10 test images from the Harvard dataset were compared because supervised deep learning methods need training examples to learn the model. The results of the comparison between the CAVE and Harvard datasets are shown in **Table 2**, with two spatial expansion factors: 8 and 16. It is clear from **Table 2** that our proposed method can perform noticeably better than unsupervised methods based on deep learning, as well as better than supervised methods. And the visualization results are shown in **Figure 3**.

#### *3.2.3 Comparison with (semi-)blind methods*

Our proposed method is exploited in a unified framework, which is capable of reconstructing the HR-HS image from the observations not only with the known spatial and spectral degradation operations but also with the unknown spatial or spectral degradation operations or both unknown. Thus, our proposed method can be implemented in a semi-blind setting (the unknown spatial downsampling kernel for


### *Unsupervised Deep Hyperspectral Image Super-Resolution DOI: http://dx.doi.org/10.5772/intechopen.106908*

 *methods with DUFL and DSSH methods in the CAVE and Harvard datasets for up-scale factors: 8 and 16.*

#### *Hyperspectral Imaging - A Perspective on Recent Advances and Applications*

#### **Figure 3.**

*Visualization of the traditional optimization-based method: CSU [9] and NSSR [7], the supervised deep learningbased methods: DHSIS [16], and the unsupervised deep learning-based methods: uSDN [20], DHP [36], and the proposed DSSH method in the CAVE and Harvard datasets for an up-scale factor 16.*

LR-HS image or the unknown CSF for HR-RGB image). Consequently, our suggested solution can also be used in total blind mode (unknown spatial degradation operations for LR-HS images and unknown CSF for HR-RGB images). The compared results using our proposed method with semi-blind and complete-blind settings, the state-of-the-art unsupervised semi-blind methods: UAL method [23] for spatial blind only, and the spatial blind implementation of NSSR [7] via setting the incorrect spatial kernel, have been given in **Table 3**.

#### *3.2.4 Ablation study*

We adjusted the hyperparameters *α* to 0.3, 0.5, and 0.7 in order to assess the impact of various data circumstances on the loss function of the DUFL method. The comparative results are shown in **Table 4**. The quantitative measurements of our DUFL+ method, PSNR, SAM, and ERGAS, are also shown in **Table 4**, and they demonstrate that the performance of overfitting is not significantly affected by the specific assignment of the hyperparameter *α*. Similarly, the performance of the DSSH reconstruction method in the ablation study was then evaluated by adjusting *α* between 0 and 1.0 with an interval of 0.2, and the compared results are shown in **Table 5**.


#### **3.** *Compared results of the (semi-)blind methods with DUFL and DSSH methods in the CAVE and Harvard datasets for an up-scale factor*

 *8.*

#### *Unsupervised Deep Hyperspectral Image Super-Resolution DOI: http://dx.doi.org/10.5772/intechopen.106908*


#### **Table 4.**

*Ablation results of the DUFL+ method with different weights α values of 0.3, 0.5, and 0.7 in the CAVE and Harvard datasets for up-scale factors: 8 and 16.*


#### **Table 5.**

*Ablation results of the DSSH method with different weights α values of 0.0 to 1.0 in the CAVE and Harvard datasets for an up-scale factor 8.*

#### **4. Conclusions**

In order to address the super-resolution issue for hyperspectral images, we provide an unsupervised deep hyperspectral image super-resolution framework. A deep convolutional neural network is used to automatically learn the spatial and spectral features of latent HR-HS images from perturbed noisy input data and the fusion context that naturally collects a significant quantity of low-level image statistics. A special depth-wise convolution layer is designed to achieve degenerate transformations between observations and desired targets, and this generates a universally learnable module that only uses low-quality observations. Without requiring training samples, the proposed unsupervised deep learning framework can efficiently take advantage of the HR spatial structure of HR-RGB images and the detailed spectral characteristics of LR-HS images to deliver more accurate HS image reconstruction. We simply train the network parameters using the observed LR-HS and HR-RGB images and a generative network structure to reconstruct the underlying HR-HS images. Extensive research using the CAVE and Harvard datasets demonstrate promising results in the quantitative evaluation.

*Unsupervised Deep Hyperspectral Image Super-Resolution DOI: http://dx.doi.org/10.5772/intechopen.106908*

### **Author details**

Zhe Liu and Xian-Hua Han\* Graduate School of Sciences and Technology for Innovation, Yamaguchi University, Yamaguchi, Japan

\*Address all correspondence to: hanxhua@yamaguchi-u.ac.jp

© 2022 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

### **References**

[1] Xu JL, Riccioli C, Sun DW. Comparison of hyperspectral imaging and computer vision for automatic differentiation of organically and conventionally farmed salmon. Journal of Food Engineering. 2017;**196**: 170-182

[2] Bishop CA, Liu JG, Mason PJ. Hyperspectral remote sensing for mineral exploration in Pulang, Yunnan Province, China. International Journal of Remote Sensing. 2011;**32**(9): 2409-2426

[3] Barnes M, Pan Z, Zhang S. Systems and methods for hyperspectral medical imaging using real-time projection of spectral information. Google Patents; 2018. US Patent 9,883,833

[4] Bioucas-Dias JM, Plaza A, Camps-Valls G, Scheunders P, Nasrabadi N, Chanussot J. Hyperspectral remote sensing data analysis and future challenges. IEEE Geoscience and Remote Sensing Magazine. 2013;**1**(2):6-36

[5] Laben CA, Brower BV. Process for enhancing the spatial resolution of multispectral imagery using pansharpening. Google Patents; 2000. US Patent 6,011,875.

[6] Lanaras C, Baltsavias E, Schindler K. Hyperspectral super-resolution by coupled spectral unmixing. In: Proceedings of the IEEE International Conference on Computer Vision. Santiago, Chile: ICCV; 2015. pp. 3586-3594

[7] Dong W, Fu F, Shi G, Cao X, Wu J, Li G, et al. Hyperspectral image superresolution via non-negative structured sparse representation. IEEE Transactions on Image Processing. 2016;**25**(5): 2337-2352

[8] He W, Zhang H, Zhang L, Shen H. Total-variation-regularized low-rank matrix factorization for hyperspectral image restoration. IEEE Transactions on Geoscience and Remote Sensing. 2015; **54**(1):178-188

[9] Yokoya N, Zhu XX, Plaza A. Multisensor coupled spectral unmixing for time-series analysis. IEEE Transactions on Geoscience and Remote Sensing. 2017;**55**(5):2842-2857

[10] Akhtar N, Shafait F, Mian A. Sparse spatio-spectral representation for hyperspectral image super-resolution. In: European Conference on Computer Vision. Zurich, Switzerland: Springer; 2014. pp. 63-78

[11] Kawakami R, Matsushita Y, Wright J, Ben-Ezra M, Tai YW, Ikeuchi K. High-resolution hyperspectral imaging via matrix factorization. In: CVPR 2011. Colorado Springs, CO, USA: IEEE; 2011. pp. 2329-2336

[12] Li Y, Hu J, Zhao X, Xie W, Li J. Hyperspectral image super-resolution using deep convolutional neural network. Neurocomputing. 2017;**266**: 29-41

[13] Han XH, Shi B, Zheng Y. Ssf-cnn: Spatial and spectral fusion with cnn for hyperspectral image superresolution. In: 2018 25th IEEE International Conference on Image Processing (ICIP). Athens, Greece: IEEE; 2018. pp. 2506-2510

[14] Han XH, Sun Y, Chen YW. Residual component estimating CNN for image super-resolution. In: 2019 IEEE Fifth International Conference on Multimedia Big Data (BigMM). Singapore: IEEE; 2019. pp. 443-447

*Unsupervised Deep Hyperspectral Image Super-Resolution DOI: http://dx.doi.org/10.5772/intechopen.106908*

[15] Han XH, Chen YW. Deep residual network of spectral and spatial fusion for hyperspectral image super-resolution. In: 2019 IEEE Fifth International Conference on Multimedia Big Data (BigMM). Singapore: IEEE; 2019. pp. 266-270

[16] Dian R, Li S, Guo A, Fang L. Deep hyperspectral image sharpening. IEEE Transactions on Neural Networks and Learning Systems. 2018;**29**(11): 5345-5355

[17] Han XH, Zheng Y, Chen YW. Multi-level and multi-scale spatial and spectral fusion CNN for hyperspectral image super-resolution. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshop. Seoul, Korea: ICCVW; 2019

[18] Xie Q, Zhou M, Zhao Q, Meng D, Zuo W, Xu Z. Multispectral and hyperspectral image fusion by MS/HS fusion net. In: Proceedings of the IEEE/ CVF Conference on Computer Vision and Pattern Recognition. Long Beach, California, USA: CVPR; 2019. pp. 1585-1594

[19] Zhu Z, Hou J, Chen J, Zeng H, Zhou J. Residual component estimating CNN for image super-resolution. Hyperspectral Image Super-resolution via Deep Progressive Zero-centric Residual Learning. 2020;**30**:1423-1428

[20] Qu Y, Qi H, Kwan C. Unsupervised sparse dirichlet-net for hyperspectral image super-resolution. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, Utah, USA: CVPR; 2018. pp. 2511-2520

[21] Liu Z, Zheng Y, Han XH. Unsupervised multispectral and hyperspectral image fusion with deep spatial and spectral priors. In: Proceedings of the Asian Conference on Computer Vision Workshops. Kyoto, Japan: ACCV: 2020

[22] Uezato T, Hong D, Yokoya N, He W. Guided deep decoder: Unsupervised image pair fusion. In: European Conference on Computer Vision. Glasgow, United Kingdom: Springer; 2020. p. 87-102

[23] Zhang L, Nie J, Wei W, Zhang Y, Liao S, Shao L. Unsupervised adaptation learning for hyperspectral imagery super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: CVPR; 2020. pp. 3073-3082

[24] Fu Y, Zhang T, Zheng Y, Zhang D, Huang H. Hyperspectral image superresolution with optimized rgb guidance. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, California, USA: CVPR; 2019. pp. 11661-11670

[25] Radford A, Metz L, Chintala S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:151106434. 2015

[26] Ulyanov D, Vedaldi A, Lempitsky V. Deep image prior. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, Utah, USA: CVPR; 2018. pp. 9446-9454

[27] Seeliger K et al. Generative adversarial networks for reconstructing natural images from brain activity. NeuroImage. 2018;**181**:775-785

[28] Zou C, Huang X. Hyperspectral image super-resolution combining with deep learning and spectral unmixing. Signal Processing: Image Communication. 2020;**2020**:115833

[29] He Z, Liu H, Wang Y, Hu J. Generative adversarial networks-based semi-supervised learning for hyperspectral image classification. Remote Sensing. 2017;**9**(10):1042

[30] Imamura R, Itasaka T, Okuda M. Zero-shot hyperspectral image denoising with separable image prior. In: Proceedings of the IEEE International Conference on Computer Vision Workshops. Seoul, Korea: ICCV; 2019

[31] Yasuma F, Mitsunaga T, Iso D, Nayar SK. Generalized assorted pixel camera: Postcapture control of resolution, dynamic range, and spectrum. IEEE Transactions on Image Processing. 2010;**19**(9):2241-2253

[32] Chakrabarti A, Zickler T. Statistics of real-world hyperspectral images. In: CVPR 2011. Colorado Springs, CO, USA: IEEE; 2011. pp. 193-200

[33] Sims K et al. The effect of dictionary learning algorithms on super-resolution hyperspectral reconstruction. In: 2015 XXV International Conference on Information, Communication and Automation Technologies (ICAT). Kyoto, Japan: IEEE; 2015. pp. 1-5

[34] Kim H, Park H. Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis. Bioinformatics. 2007;**23**(12):1495-1502

[35] Akhtar N, Shafait F, Mian A. Bayesian sparse representation for hyperspectral image super resolution. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston, Massachusetts, USA: CVPR; 2015. pp. 3631-3640

[36] Sidorov O, Yngve HJ. Deep hyperspectral prior: Single-image denoising, inpainting, super-resolution. In: Proceedings of the IEEE International Conference on Computer Vision Workshops. Seoul, Korea: ICCVW; 2019

[37] Wycoff E, Chan TH, Jia K, Ma WK, Ma Y. A non-negative sparse promoting algorithm for high resolution hyperspectral imaging. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE; 2013. pp. 1409-1413

#### **Chapter 4**

## Hyperspectral and Multispectral Image Fusion Using Deep Convolutional Neural Network - ResNet Fusion

*K. Priya and K.K. Rajkumar*

#### **Abstract**

In recent years, deep learning HS-MS fusion has become a very active research tool for the super resolution of hyperspectral image. The deep conventional neural networks (CNN) help to extract more detailed spectral and spatial features from the hyperspectral image. In CNN, each convolution layer takes the input from the previous layer which may cause the problems of information loss as the depth of the network increases. This loss of information causes vanishing gradient problems, particularly in the case of very high-resolution images. To overcome this problem in this work we propose a novel HS–MS ResNet fusion architecture with help of skip connection. The ResNet fusion architecture contains residual block with different stacked convolution layer, in this work we tested the residual block with two-, three-, and four- stacked convolution layers. To strengthens the gradients and for decreases negative effects from gradient vanishing, we implemented ResNet fusion architecture with different skip connections like short, long, and dense skip connection. We measure the strength and superiority of our ResNet fusion method against traditional methods by using four public datasets using standard quality measures and found that our method shows outstanding performance than all other compared methods.

**Keywords:** convolution neural network, residual network, ResNet fusion, stacked layer, dense skip connection

#### **1. Introduction**

Spectral imaging technology captures contiguous spectrum for each image pixel over a selected range of wavelength bands in the spectrum. Thus, spectral images accommodate more information than conventional monochromatic or RGB images. The wide range of spectral information available in hyperspectral image brings the spectral imaging technology into a new horizon of research for analyzing the pixel content at macroscopic level. This tremendous change in image processing research area makes revolutionary developments in every walks of human life in coming future. In general, spectral images are divided into either Multispectral (<20 numbers of wavelength bands sampled) or Hyperspectral (>20 numbers of wavelength bands). Multispectral image (MSI) captures a maximum of 20 spectral bands whereas Hyperspectral image (HSI) captures hundreds of contiguous spectral bands at a time. Due to this exciting prominence, HSI is now becoming an emerging area and at the same time faces a lot of challenges to analyze the minute details of the pixel content in image processing and computer vision areas [1].

Hyperspectral images (HSIs) are rich in spectral information that highly strengthens their information storing ability. This property of HSI is enable rapid growth in the development in many areas such as remote sensing, medical science, food industry, and various computer vision tasks. However, hyperspectral images capture all these bands in a narrow wavelength range, and hence it limits the amount of energy received by each band. Therefore, the HSI information can be easily influenced by many kinds of noises, and it leads to lower the spatial resolution of HSI [2].

Many studies have been introduced in literature so far to control the tradeoff between the spatial and spectral resolution in the hyperspectral images. As a result of this, many HS–MS fusion methods are evolved in the past decades to address it. The straightforward approach of the HS**–**MS fusion method has become the most popular and trending research area of image processing and computer vision. The early approach is pansharpening-based image fusion that fuses spectral and spatial information from low resolution multispectral (LR–MS) images with high resolution (HR) panchromatic (PAN) images to enhance the spatial and spectral resolution of the fused image. Subsequently, pansharpening image fusion algorithms have been gradually extended to HS-MS image fusion [3].

In HS–MS fusion, a high spatial and spectral hyperspectral image is estimated by fusing LR–HS image with HR–MS image of the same scene. However, the estimated spatial and spectral data quality is highly influenced by the constraints used in the fusion process. Recently, neural network-based methods have been widely used in many areas to improve the HS**–**MS fusion quality in both spatial and spectral domains. One such network named as convolution neural network (CNN) in deep learning (DL) performs much better in image reconstruction, super-resolution, object detection, etc. [4].

In CNN, each layer takes the output from the previous layer, which tends to lose information as the network goes into deeper architecture. In this work, we use ResNet-based HS**–**MS fusion by adding the skip connection between the convolution layers. This skip connection helps to map the identity of information throughout the deep convolution network [5].

The following sections of this paper are arranged as Section 2 includes various literature reviews of HS**–**MS fusion methods in both traditional and newly introduced deep learning methods. Section 3 includes the materials and methods used in this work. Sections 4 and 5 includes the detailed representation of problem formulation and implementation of our work. The results and discussion of our proposed method are discussed in Section 6, and finally, Section 7 concludes the proposed work with future scope.

#### **2. Review of literature**

#### **2.1 Traditional methods**

Many algorithms have been proposed to enhance the spatial quality of HS images in past decades. One such popular and attractive method is HS-MS image fusion,

#### *Hyperspectral and Multispectral Image Fusion Using Deep Convolutional Neural Network… DOI: http://dx.doi.org/10.5772/intechopen.105455*

which is mainly divided into four groups: component substitution (CS), multiresolution analysis (MRA), Bayesian approach, and spectral unmixing (SU) [6]. The CS and MRA methods are described under the concept of an injection framework. In this framework, the high-quality information from one image is injected into another [7]. Apart from these, Bayesian-based methods use probability or posterior distribution of prior information about the target image. The posterior distribution of the target image is considered based on the given HS and MS images [8]. Later, spectral unmixing-based HS**–**MS image fusion was introduced and is one of the promising and widely used methods for enhancing the quality of HS image.

In SU method, the quality of the abundance estimation highly depends on the accuracy of the endmembers. Therefore, any obstruction that occurs during the end member extraction process leads to inconsistency in the abundance estimation. To overcome this limitation, Paatero and Tapper in 1994 [9] introduced nonnegative matrix factorization (NMF) method and it was popularized in article by Lee and Seung in 1999 [10]. It has become an emerging tool for processing high-dimensional data due to the automatic feature extraction capability. The main advantage of this NMF method is that it shows a unique solution to the problem compared to other unmixing techniques [11]. In general, NMF based on the spectral unmixing jointly estimates both endmember and corresponding fractional abundance in a single step are mathematically represented as follows,

$$\mathbf{Y} = \mathbf{E}\mathbf{A} \tag{1}$$

Where the output matrix Y is simultaneously factorized into two nonnegative matrix E (endmember) and A (abundance) without any prior knowledge and hence NMF comes under an unsupervised framework [12]. Later NMF is one of the trending methods for blind source spectral unmixing problems. NMF factorizes the input matrix into a product of two nonnegative matrices (endmember matrix, E and abundance matrix, A) by enforcing nonnegativity. So NMF method has high relevance in SU to enhance the quality of the image by adding these constraints. Finally, SU-based fusion is accomplished by using coupled NMF (CNMF) method to obtain enhanced hyperspectral image with high spatial and spectral quality. The CNMF fusion algorithm gives high-fidelity reconstructed image compared to other existing fusion methods [13].

Yokoya *et al.* in 2012 [14] introduced a coupled non-negative matrix factorization (CNMF) method, which is an unsupervised unmixing-based HS-MS image fusion. CNMF uses a straightforward approach to unmixing and fusion processes, so its mathematical formulation and implementation are not as complex as other existing fusion methods. Finally, this method optimizes the solution with minimum residual errors and reconstructs the high-fidelity hyperspectral image.

Simoes *et al.* in 2015 [15] introduced a super-resolution method for hyperspectral image termed as HySure. This method formulated a new model to preserve the edges between the objects during the unmixing-based data fusion. This method uses an edge-preserving constraint called vector total variation (VTV) regularizer that preserves the edges and promotes piecewise smoothness to the spatial quality of the image.

Lin *et al.* in 2018 [16] introduced a convex optimization-based CNMF (CO-CNMF) method. This method is proposed by incorporating sparsity and sum-of-squareddistances (SSD) regularizer. To extract high-quality data from the images, this method uses an SSD regularizer and provides sparsity by using *ℓ*1 -norm regularization.

By adding these two regularization terms with two convex subproblems helps to upgrade the performance of the existing CNMF method. However, sometimes performance degradation may occur in the CO-CNMF algorithm as the noise level increases. Therefore, it is necessary to add image denoising and spatial smoothing constraints with this fusion method.

Yang *et al.* in 2019 [17] introduced a total variation and signature-based (TVSR) regularizations CNMF method named as TVSR-CNMF. The TV regularizer is added to the abundance matrix to ensure the images spatial smoothness. Similarly, a signaturebased regularizer (SR) is added with the endmember matrix for extracting highquality spectral data. So, this method helps to reconstruct a hyperspectral image with good quality in spatial and spectral data.

Yang *et al.* in 2019 [18] introduced a sparsity and proximal minimum-volume regularized CNMF method named as SPR-CNMF. The minimum-volume regularizer controls and minimizes the distance between selected endmembers and the center of mass of the selected region in the image to reduce the computational complexity. It redefines the fusion method at each iteration until reaches the simplex with minimum volume. This method improves the fusion performance by controlling the loss of cubic structural information.

After being influenced by this work, we implemented an unmixing-based fusion algorithm named fully constrained CNMF (FC-CNMF). This method is a modified version of CNMF by including all spatial and spectral constraints available in the literature. In our method, a simplex with minimum volume constraint is imposed with the endmember matrix to exploit the spectral information fully. Similarly, sparsity and total variation constraints are incorporated with the abundance matrix to provide dimensionality reduction and spatial smoothness to the image. Finally, we evaluated the quality of the fused image obtained by FC-CNMF against the methods discussed in the literature using some standard quality measures. From these evaluations, we understood that our method shows better performance by yielding high-fidelity in the reconstructed images.

These traditional approaches reconstruct the high-resolution hyperspectral image by fusing the high-quality data from hyperspectral and multispectral images. However, to improve the quality of the reconstructed images, these approaches use different constraints such as sparsity, minimum volume simplex, and total variance regularization, etc. The performance and quality of the reconstructed HS image are highly influenced by these constraints and therefore our existing methods still have an ample space to enhance the quality of HSI.

#### **2.2 Deep learning methods**

Deep learning (DL) is a subbranch in machine learning (ML) and has shown remarkable performances in the research field, especially in the area of image processing and computer vision recently. DL is based on an artificial neural network that has been widely used in different areas such as super-resolution, classification, image fusion, object detection, etc. DL-based image fusion methods have the ability to extract deep features automatically from the image. Therefore, DL-based methods overcome the difficulties that are faced during the conventional image fusions methods and make the whole fusion process as easier and simple.

A deep learning-based HS-MS image fusion concept was first introduced by Palsson *et. al* in 2017 [19]. In this method, they used a 3-D convolutional neural network (3D-CNN) to fuse LR–HS and HR–MS image to construct HR-HS image. *Hyperspectral and Multispectral Image Fusion Using Deep Convolutional Neural Network… DOI: http://dx.doi.org/10.5772/intechopen.105455*

This method improves the quality of hyperspectral image by reducing noise and the computational cost. In this paper, they focused on enhancing the spatial data of LR– HS image without any changes in the spectral information and it caused the degradation of spectral data [19].

Later, Masi *et al*. in 2017 [20] proposed a CNN-architecture for image superresolution, which uses deep CNN for extracting both spatial and spectral features. Deep CNN is used to acquire features from HSI with a very complex spatial-spectral structure. But in this paper, authors used deep CNN with single branch CNN architecture which is difficult to extract the discriminating features from the image.

To overcome this drawback, Shao and Cai in 2018 [21] designed a fusion method by extending CNN with depth of 3D-CNN for obtaining better performance while fusion. For implementing this, they used a remote sensing image fusion neural network (RSIFNN) that uses two CNN branches separately. One branch extract the spectral and the other extract the spatial data from the image. In this way, this method helps to exploit the spectral as well as spatial information from the input images to reconstruct high spectral and spatial resolution hyperspectral image.

Yang *et.al* in 2019 [22] introduced a deep two-branch CNN for HS–MS fusion. This method uses a two-branch CNN architecture for extracting spectral and spatial features from LR–HSI and HR–MSI. These extracted features from two branches of CNN are concatenated and then passed to the fully connected convolution layer to obtain HR–HSI. In all the conventional fusion methods, HR–HSI is reconstructed in a bandby-band fashion whereas in CNN concepts all bands are reconstructed jointly. Therefore, it helps to reduce the spectral distortion that occurs in the fused image. But this method uses fully connected layer for image reconstruction that is heavily weighted layer and it increases the network parameters.

Chen *et al* in 2020 [23], introduced a spectral–spatial features extraction fusion-CNN (S2FEF- CNN) which extracts joint spectral and spatial features by using three - S2FEF blocks. The S2FEF method use 1D and 2D convolution network to extract spectral and spatial features and fuse these spectral and spatial features. This method uses fully connected network layer for dimensionality reduction, and it further reduces the network parameters during the fusion. This method shows good results with less computational complexity compared to all other deep learning-based fusion method.

Although the deep learning-based fusion methods achieved tremendous improvement in their implementation, however, all these methods still possess many drawbacks [24]. As the network goes deeper, its performance gets saturated and then rapidly degrades. This is because, in DL method, each convolution layer takes inputs from the output of the previous layers, so when it reaches the last layer, a lot of meaningful information obtained from the initial layers will be lost. The information loss tends to get worse when the network is going deeper in architecture. This will bring some negative effects such as overfitting of data and this effect is called vanishing gradient problem [25].

Due to the vanishing gradient problem, the existing deep learning-based fusion could not be able to extract the detailed features from high dimensional images. He *et al* in ref., [26], introduced a deep network with residual learning to address the vanishing gradient problem. In this framework, a residual block is added between the layers to diminish the performance degradation. The networks with these concepts are called residual networks or ResNets. Therefore, in this work, our aim is to invoke this ResNet architecture into the standard CNN to exploit more detailed features from both spatial and spectral data of HSI.

#### **3. Materials and methods**

#### **3.1 Dataset**

The four real datasets such as Washington DC mall, Botswana, Pavia University, and Indian Pines are used in this work. The Washington DC Mall dataset is a wellknown dataset captured by HYDICE sensor, which acquired a spectral range from 400 to 2500 nm have 1278307 pixel size and 191 bands. The Botswana dataset which is captured by Hyperion sensor acquired over the Okavango delta in Botswana, which acquired a spectral range from 400 to 2500 nm with 1476 256 pixel size and 145 bands. The Pavia University dataset was captured by the reflective optics spectrographic imaging system (ROSIS-3) at the University of Pavia, northern Italy, in 2003. It has a spectral range from 430 to 838 nm and has a 610 340 pixel size and 103 bands. Finally, the dataset AVIRIS Indian Pines was captured by AVIRIS sensor over the Indian Pines test site in northwestern Indiana, USA, in 1992. It acquired a spectral range from 4 to 2500 μm having 512 614 pixel size and 192 bands [26]. All these datasets have been widely used in earlier spectral unmixing-based fusion research.

#### **3.2 Convolution neural networks**

Convolutional neural networks (CNN) have an important role in deep learning models. CNN specially built an algorithm that is designed to work with images to extract deep features from the image through convolution. The convolution is a process that applies a kernel filter across every element of an image to understand and react to each element within the images. This concept of convolution is more helpful during the extraction of specific features from high dimensional images. A convolutional network architecture is composed of an input layer, an output layer, and one or more hidden layers. The hidden layers are combination of convolution layers, pooling layers, activation layers, and normalization layers. These layers automatically detect essential features without any human supervision. So it is considered as a powerful tool for image processing [27].

#### A.Convolution layer

The convolution layer is used to extract various features from the input image with the help of filters. In convolution layer, mathematical operation is performed between the input image and the filter with m m kernel size. This filter is sliding across the input image to calculate the dot product of the filter and part of the image. This process is repeated for convolving the kernel to all over the image and the output of the convolution operation is called a feature map. This feature map includes all essential information about the image such as the boundary and edges of objects etc. [28].

#### B. Pooling layer

The convolution layer is followed by a pooling layer, which reduces the size of the feature map by maintaining all the essential features. There are two types of pooling layers such as max pooling and average pooling. In Max pooling, the largest element is taken from the feature map whereas in the average pooling calculates the average of the element in the feature map [28].

*Hyperspectral and Multispectral Image Fusion Using Deep Convolutional Neural Network… DOI: http://dx.doi.org/10.5772/intechopen.105455*

#### C. Activation function

One of the most important characteristics of any CNN is its activation function. There are several activation functions such as sigmoid, tanH, softmax, and ReLU, and all these functions have their own importance. The ReLU is the most commonly used activation function in DL that accounts for the nonlinear nature of the input data [28].

#### **3.3 Residual network (ResNet)**

A residual network is formed by stacking several residual blocks together. Each residual block consists of convolution layers, batch normalization, and activation layers. The batch normalization process the data and brings numerical stability by using some scaling techniques without distorting the structure of the data. The activation layer is added into the residual network to help the neural network to learn more complex data. The CNN or deep learning method uses ReLU (rectified linear unit) function in the activation layer to accommodate the nonlinearity nature of the image data while providing the output. The residual blocks allow to flow information from the first layer to the last layers of the network by adding residual or skip connection strategy. Therefore, ResNet can effectively utilize features of the input data to the output of the network and thus alleviate gradient vanishing problems.

Let x be the input to the residual block, after processing the information x with two-stacked convolution layers of a residual unit, obtains F(W1x), where W is the weight given to the convolution layer. In ResNet, before giving an output of one convolution layer F(W1x) as input of the next layer by adding the x term, which is the input parameters of previous residual block, to provide an additional identity mapping information called as skip connection. Therefore the general formulation of a residual block can be represented as follows:

$$\mathbf{y} = F(\mathcal{W}\_i \mathbf{x}) + \mathbf{x} \tag{2}$$

Here x is an input and y is the output of the residual unit. Then y is a guaranteed input to the next residual block. The function *F(Wi x)* represents the output of each convolution layer, and *Wi* is the weight associated with *i th* residual blocks. **Figure 1** uses two convolution layers for the residual unit, so the output from this residual layer can be written as:

$$F(\mathcal{X}W) = W\_2 \text{ReLU}(W\_1 \mathfrak{x}) \tag{3}$$

Where *ReLU* represents the nonlinear activation function rectified linear unit (ReLU), *W*<sup>1</sup> and *W*<sup>2</sup> are the weight associated with convolution layers 1 and 2 of the residual block A. Deep residual networks consist of many stacked residual blocks and each block can be formulated in general as follows:

$$\mathbf{x}\_{i+1} = F(\mathbf{x}\_l \mathbf{W}\_l) + \mathbf{x}\_i \tag{4}$$

Where *F* is the output from residual block with *l* stacked convolution layer and *xi* is the residual connection to the *i th* residual block, then *xi*þ<sup>1</sup> become the output of the *<sup>i</sup> th* residual block, which is calculated by a skip connection and element-wise

**Figure 1.** *HS–MS fusion using CNN.*

multiplication. After passing through the ReLU activation layer, the output residual network can be represented as:

$$\mathbf{y} = \text{ReLU}(\mathbf{x}\_{i+1}) \tag{5}$$

#### **4. Problem formulation**

A high-resolution hyperspectral image Z∈*ℝ*<sup>L</sup>�<sup>N</sup> with L spectral band and N pixels. The observed LR–HSI is obtained by downsampling the spatial quality of Z with a gaussian blur factor d is represented as Yh ∈*ℝ*<sup>L</sup>�N*=*<sup>d</sup> with L bands and N*<sup>=</sup>*<sup>d</sup> pixels. Similarly, the observed HR–MSI is obtained by downsampling the spectral quality of Z and it is represented as Ym ∈*ℝ*Lm�<sup>N</sup> with Lm bands and N pixels, where Lm < L [27]. Therefore, the hyperspectral image can be mathematically modeled as:

$$\mathbf{Z} = \mathbf{E}\mathbf{A} + \mathbf{R} \tag{6}$$

Where, Z is the original referenced images, E and A are the endmember, abundance matrices, and R is the residual matrix respectively.

The observed Yh and Ym are spectrally and spatially degraded versions of image Z is further mathematically represented by:

$$\mathbf{Y\_m} \approx \mathbf{S} \mathbf{Z} + \mathbf{R\_m} \tag{7}$$

$$\mathbf{Y\_h} \approx \mathbf{ZB} + \mathbf{R\_h} \tag{8}$$

Where B∈*ℝ*<sup>N</sup>�N*=<sup>d</sup>* is a Gaussian blur filter with blurring factor d used to blur the spatial quality of the referenced hyperspectral image Z to obtain LR–HSI, Yh. The spectral response function, S∈*ℝ*Lm�<sup>L</sup> is used to downsampling the spectral quality of the referenced hyperspectral image Z to obtain HR–MSI, Ym. The term Lmmeans the number of spectral bands used in the multispectral image after downsampling. In this work, referenced image Z is downsampled by its spectral values using standard L and sat 7 multispectral image that contains a high-quality visual image of Earth's surface as HR–MSI with Lm ¼ 7 [28]. Both B and S are spared matrices containing zeros and ones. In general, the residual matrix Rm and Rh are assumed as zero-mean Gaussian noises in the literature, Therefore, the original CNMF method is shown as:

$$\left\| \text{CNMF} \left( \mathbf{E}, \mathbf{A} \right) = \left\| \mathbf{Y}\_{\text{h}} - \left( \mathbf{E} \mathbf{A}\_{\text{h}} \right) \right\|\_{\text{F}}^{2} + \left\| \mathbf{Y}\_{\text{m}} - \left( \mathbf{E}\_{\text{m}} \mathbf{A} \right) \right\|\_{\text{F}}^{2} \tag{9}$$

However, in this work, we make use of the residual term Rm and Rh as a nonnegative residual matrix to account for the nonlinearity effects in the image fusion [29].

*Hyperspectral and Multispectral Image Fusion Using Deep Convolutional Neural Network… DOI: http://dx.doi.org/10.5772/intechopen.105455*

Since the objective function for the original CNMF method expressed in the Eq. (9) can be re-written as:

$$\text{CNNF}\left(\mathbf{E}, \mathbf{A}, \mathbf{R}\right) = \left\|\mathbf{Y}\_{\text{h}} - \left(\mathbf{E}\mathbf{A}\_{\text{h}} + \mathbf{R}\_{\text{h}}\right)\right\|\_{\text{F}}^2 + \left\|\mathbf{Y}\_{\text{m}} - \left(\mathbf{E}\_{\text{m}}\mathbf{A} + \mathbf{R}\_{\text{m}}\right)\right\|\_{\text{F}}^2\tag{10}$$

Therefore the Eq. (10) represents the proposed model of the HS–MS fusion by including the nonlinearity nature of the image. To implement this model, we use standard deep neural network architecture CNN and ResNet. For further enhancement of the proposed method, we implemented modified architecture of ResNet with different stacked layers and multiple skip connections.

#### **5. Problem implementation**

#### **5.1 CNN fusion architecture**

In CNN architecture, 1D CNN convolution operation is performed over the observed HS image Yh of dimension LhxNh with Lh spectral band and Nh number of pixels in the image with the help of filter to obtain the spectral data. In the same way, 2D CNN convolution operation is performed over the observed MS image is denoted by Ym of dimension Lm x Nm, with Lm spectral bands and Nm number of pixels in the image to obtain the spatial data. Finally, the high spectral component obtained from Yh and high spatial component obtained from Ym are fused together to reconstruct a high HR-HSI. The entire deep neural network-based HS–MS fusion is shown in **Figure 1**.

In CNN architecture, the *Conv1D*() convolution filter with kernel size r having weight v are used for extracting spectral data from LR–HSI, Yh are represented as follows:

$$f\_{spec} = \text{Conv1D}(\text{ReLU}(F(v\_i \; Y\_h)))\tag{11}$$

Similarly, the *Conv2D()* convolution filter with kernel size r � r having weight w are used for extracting spatial data from HR–MSI, Ym image are represented as:

$$f\_{\text{spot}} = \text{Conv2D}(\text{ReLU}(\left(F(w\_{\vec{\eta}} \, Y\_m)\right))\tag{12}$$

The two convolutional layers use ReLU (rectified linear unit) activation functions, i.e., *ReLU* (x) = max(x, 0), to provide nonlinear mapping of data. Finally, fuse the extracted spatial and spectral features to get high-quality reconstructed image as shown in Eq. (4).

$$\mathbf{F} = \text{ReLU}\left(\mathbf{f}\_{\text{spec}} \times \mathbf{f}\_{\text{spot}}\right) \tag{13}$$

To implement this CNN fusion architecture, we use two convolution networks such as 1D and 2D convolution. Both 1D and 2D convolution uses the same number of convolution layers and kernel size. Each network uses four convolution layers with 32, 64, 128, and 256 filters. The kernel size of 3 � 3 and 1 � 3 are used for 2D CNN and 1D CNN for extracting spatial and spectral information about the image. Therefore, the architecture and parameters of CNN HS-MS fusion are shown in **Table 1**.


#### **Table 1.**

*The Simple CNN Fusion Architecture.*

In CNN, each layer takes its input as the output from the previous layer and it introduces lose information as the network architecture goes in deeper. This problem in deep neural network leads to overfitting of data, and it is known as vanishing gradient problem [24]. To overcome this, we implemented HS-MS fusion using an alternative ResNet-based network architecture. In ResNet, we introduced the skip connection between two convolution layers. This skip connection helps to map the identity of information throughout the deep convolution network.

#### **5.2 Resnet fusion architecture**

The ResNet fusion architecture for HS–MS fusion uses residual or skip connection which helps to improve the feature extraction capability from the images. For implementation, we use 1D ResNet to extract the spectral features from the LR–HSI and 2D ResNet for extracting spatial features from HR–MSI. Both 1D and 2D ResNet architecture consists of three residual blocks each having two convolutional layers and 64 filters as shown in **Figure 2**. A3 3 kernel size for 2D Resnet and 1 3 kernel size for 1D Resnet are used for extracting the spatial and spectral data from MSI and HSI. Each residual block has ReLU activation layer to accommodate the nonlinearity constraints included in the proposed hyperspectral image fusion model as explained in Eq. (10). Finally, the feature embedding and image reconstruction process are performed using another 2D CNN.

**Figure 2.** *Residual block with two stacked layer.*

*Hyperspectral and Multispectral Image Fusion Using Deep Convolutional Neural Network… DOI: http://dx.doi.org/10.5772/intechopen.105455*

#### A. Spectral generative network

The spectral data from hyperspectral image Yh is extracted using 1D ResNet connection. Initially, spectral data are extracted from LR–HSI using 1D CNN and then mapping the residual connection r(Yh) with the stacked convolution layers. Finally, the output from ID CNN and r(Yh) are given to the input of the next residual block and this process is repeated for an entire residual block in the ResNet. The entire process in 1D ResNet is shown mathematically as:

$$f\left(Y\_{h\iota}\right) = ReLU\left(W\_l Y\_{h\iota}\right) \tag{14}$$

$$f\_{\text{spec}}\left(\mathbf{Y}\_{h\_l}\right) = f\left(\mathbf{Y}\_{h\_l}\right) + r\left(\mathbf{Y}\_{h\_l}\right) \tag{15}$$

Therefore, output of ith residual block is represented as:

$$f\_{\text{spec}}^i = f\_{\text{spec}}^{i-1}(Y\_{h\_l}) + r^{i-1}(Y\_{h\_l}) \tag{16}$$

Where, *Yh* denotes the input LR- HSI data, i is the number of residual units i = 1,2,3 … ..I and *l* are the number of convolution layer *l* = 1,2,3… ..*l*. The weight of convolution kernel is represented as W. Finally, *ReLU* an activation functions are exploited to introduce nonlinearities in the output of deep network as follows:

$$F\_{spec} = ReLU\left(f\_{spec}\right) \tag{17}$$

#### B. Spatial generative network

The spatial data from HR–MSI, Ym is extracted using 2D ResNet. Initially, spatial data are extracted from HR–MSI using 2D CNN and then mapping the residual connection r(Ym) with the stacked convolution layers. Finally, the output from 2D CNN to r(Ym) is given to the input of the next residual block and this process is repeated for an entire residual block in the ResNet. The entire process in 2D ResNet is shown mathematically as:

$$f\left(Y\_{m\_l}\right) = \text{ReLU}\left(W\_l Y\_{m\_l}\right) \tag{18}$$

$$f\_{\text{spat}}\left(Y\_{m\_l}\right) = f\left(Y\_{m\_l}\right) + r\left(Y\_{m\_l}\right) \tag{19}$$

Therefore, output of the ith residual block is represented as:

$$f\_{\text{spat}}^{i} = f\_{\text{spat}}^{i-1}(Y\_{m\_l}) + r^{i-1}(Y\_{m\_l}) \tag{20}$$

Where, *Ym* denotes the input HR- MSI data, i is the number of residual blocks i = 1,2,3 … ..I and *l* are the number of convolution layer *l* = 1,2,3 … ..*L*. The weight of the convolution kernel is represented as W. Finally, similar to spectral extraction *ReLU* is exploited to introduce nonlinearities in the spatial output of a deep network as follows:

$$\mathbf{F}\_{spat} = \text{ReLU}\left(\boldsymbol{f}\_{spat}\right) \tag{21}$$

#### C. Fusion of spectral-spatial data

The spectral data from LR–HSI and spatial data from HR–MSI are extracted using ResNet with size as (1x1x Spec) and (Spat x Spat x 1). After obtaining the spatial and spectral features, next step is to fuse this information by elementwise multiplication.

$$\mathbf{F}\_{\mathbf{Z}} = \mathbf{F}\_{\text{spec}} \propto \mathbf{F}\_{\text{spot}} \tag{22}$$

Then, the feature embedding and image reconstruction are performed by using ReLU activation layer. The proposed ResNet Fusion framework is shown in **Figure 3**. Therefore, the final generated HR-HSI, Z can be written as:

$$Z = \text{ReLU}(\mathbf{F}\_Z) \tag{23}$$

#### D.Different stacked layers and skip connection

We also proposed an extension to the ResNet fusion architecture by varying the number of stacked convolution layers (2 to 4) in the residual block to increase the performance of the fusion using deep network. The 2-layer residual block contains two stacked convolution l ayer followed by ReLU activation layer. Similarly, in three-layer and four-layer residual blocks contain three and fourstacked convolution layers followed by ReLU activation layer. In addition to this, we utilize the ResNet fusion architecture by including different skip connections. The skip connection helps us to regulate the flow of information to a deeper network more effectively. For this, we use long skip and dense skip connections as shown in **Figure 4**. The long skip connections are designed by creating a connection between alternate residual layer ith and (i + 2)th along with a short skip connection between every layer in the ResNet. In dense skip connection, each layer i obtain an additional input from all the preceding layers. Then, the layer i pass its own feature maps to all the subsequent layers. Using the

**Figure 3.** *The framework of the proposed ResNet Fusion architecture.*

*Hyperspectral and Multispectral Image Fusion Using Deep Convolutional Neural Network… DOI: http://dx.doi.org/10.5772/intechopen.105455*

#### **Figure 4.**

*Representation of short, long, and dense skip connection on ResNet.*

dense skip connection, each layer in the ResNet receives feature maps from all the preceding layers and that limits the number of filters and network parameters for extracting deep features. In order to obtain high fidelity reconstructed image, we proposed a modified version of ResNet with long and dense skip connections shown in **Figure 4**.

In the **Figure 4** show three Resnet architecture, having three- residual blocks (Res Block), with three different types of skip connections. Algorithm 1 summarizes the procedures of our proposed ResNet fusion method.


#### **6. Results and discussion**

In this paper, intially we implemented CNN-based fusion by extracting the spectral data from LR–HSI using 1D convolution network and spectral data from HR–MSI using 2D convolution network. These extracted spatial and spectral features are then fused together to obtain HR–HSI. To extract more detailed features from HS and MS, it requires deep CNN architecture. As CNN architecture become deeper, it introduced vanishing gradient problem. To overcome this, we implemented an unsupervised ResNet Fusion network by using skip connections. The proposed ResNet fusion inherits all the advantages of standard CNN. In addition to this, ResNet allows the designing of a deeper network without any performance degradation during the feature extraction process. Therefore, the proposed ResNet Fusion architecture extracts more discriminative features from both HSI and MSI and finally reconstruct a high-resolution HSI by fusing these extracted high-quality features from the ResNet.

The performance of CNN and ResNet fusion method is evaluated on four benchmark data sets using standard quality measures namely SAM, ERGAS, PSNR, and UIQI [30]. Further, we also compared the performance of CNN and ResNet fusion against the baseline fusion methods namely, CNMF [9], FCN-CNMF, and S2FEF-CNN [22]. Out of these, CNN shows better performance compared to CNMF and FCN-CNMF. The ResNet-based fusion shows outstanding performance compared to all other methods including CNN. The results obtained by CNN and ResNet fusion method against the baseline methods on four benchmark datasets are shown in **Table 2**. The low SAM indicates the good spectral data in the fused image and low ERGAS shows the statistical quality of the reconstructed image. The high PSNR and


**Table 2.**

*The performance evaluation of different fused algorithms on four hyperspectral datasets.*

#### *Hyperspectral and Multispectral Image Fusion Using Deep Convolutional Neural Network… DOI: http://dx.doi.org/10.5772/intechopen.105455*

UIQI show good spatial quality and high fidelity reconstructed image with less spectral distortion. From **Table 2**, it is further clear that good spectral preservation is obtained in Botswana dataset on analyzing the SAM value, which is reduced by more than 0.02 dB. Simultaneously, significant spatial preservation is achieved in the Indian Pine database revealed by the PSNR value increased by 1.5 dB.

The above work is extended by introducing different stacked convolution layers in the residual block of the ResNet. The experimental results obtained after stacked convolution layer in the ResNet are shown in **Table 3**. From the SAM value in **Table 3**, it is clear that the spectral information of the image is reducing as and when the number of stacked layers in the residual block increases. The UIQI value from the **Table 3** also reveals that quality of the reconstructed image is also diminishing as the number stacked layer increases in the ResNet. The PSNR and EARGS show a stable performance, which ensure the spatial consistency of our proposed method. So, we concluded that ResNet Fusion network with two-stacked convolution layer acquires more discriminative features from the source images and guarantee the quality of the reconstructed image on analyzing the results obtained in **Table 3**.

**Figure 1** shown below is the visual representation of the output provided by our proposed ResNet fusion method on four benchmark datasets against all other baseline methods. From the figure, it is evident that ResNet Fusion with two-stacked convolution layers produces better performance in most of the areas in the image (highlighted) of the four datasets (**Figure 5**).

We further extend the Resnet fusion architecture to reduce the number of parameters to make our proposed method more efficient and effective to handle high


#### **Table 3.**

*The performance of ResNet fusion by varying the stacked layers.*

#### *Hyperspectral Imaging - A Perspective on Recent Advances and Applications*


#### **Figure 5.**

*The ground truth and fused image of different methods using four benchmark datasets.*


#### **Table 4.**

*The performance of different skip connection.*

dimensional data. For that, we used short skip, long skip, and dense skip connection to the ResNet architecture with two-stacked convolution layers. **Table 4** gives the total number of network parameters required for this ResNet architecture in each skip connection. From **Table 4**, it is clear that ResNet architecture with dense skip connection provides very less network parameters compared to ResNet with short and long skip connections.

#### A. Time complexity

Comparing the performance and running time of all the proposed algorithms on four benchmark datasets are shown in **Figure 6**. From this figure, it is evident that ResNet fusion with dense skip connection took very less running time and showed good performance in reconstructing high-fidelity hyperspectral image.

#### *Hyperspectral and Multispectral Image Fusion Using Deep Convolutional Neural Network… DOI: http://dx.doi.org/10.5772/intechopen.105455*

#### **Figure 6.**

0

*The running time of traditional and deep learning HS-MS image fusion.*

On comparing the performance and running time of ResNet with long skip and short skip connection, long skip connection ResNet fusion architecture shows good performance and running time than short skip connection. On evaluating the performance and running time of all ResNet fusion architectures, ResNet with dense skip connection outperformed compared to the other two ResNet fusion architectures. While comparing the performance and running time, the FCN-CNMF method showed better performance and time than CNN-based fusion. Finally, we concluded that, ResNet with dense skip connection with less network parameter shown highlighting performance for reconstructing good spatial and spectral quality HR-HSI compared to all other proposed methods. However, all our proposed methods show good in performance but the cost incurred in terms of time is high.

Washinton DC Mall Pavia University Indian Pine Botswana

Datasets

B. Resnet HS-MS fusion model

The experimental analysis of our ResNet fusion architecture with various parameters is done to build a general model for our proposed HS-MS ResNet fusion algorithm. For this purpose, we trained the network by using cropped HSI and MSI image pairs from each dataset. That means each dataset is cropped into several patches and then divided into training and testing data. In the case of Indian Pine dataset with size 610 340 103 are cropped into several patches of size M N L. The patch size was M N L = 15 15 103 for Indian Pine dataset showing high performance to our network model. Similarly, we create training and testing samples for all three datasets. The patch size for Washington Dc Mall dataset was M N L = 19 19 191, for Botswana dataset, were M N L = 17 17 145 and for Pavia University dataset were M N L = 19 19 192 gives a network model with good running time and network parameters.

We measure the quality matrix value of our ResNet fusion by varying the number of stacked layers and found that residual blocks, each having twostacked convolution layers is performing better than the others. The most significant part of ResNet is skipped connection, which helps for the flow of information through the network more efficiently and effectively. So, we also


#### *Hyperspectral Imaging - A Perspective on Recent Advances and Applications*

*Hyperspectral and Multispectral Image Fusion Using Deep Convolutional Neural Network… DOI: http://dx.doi.org/10.5772/intechopen.105455*


**Table 5.** *ResNet-dense skip Architecture of HS-MS image fusion.* experimented with three skip connections: short skip, long skip, and dense skip connection. From this experiment, we found that ResNet with a dense skip connection reduces the number of network parameters to a large extent.

Finally, we built a generative ResNet model for the fusion of HS–MS image as shown in **Table 5**. The ResNet fusion model uses ID and 2D convolution networks. These two convolution networks consist of three residual blocks, each residual block contains two convolution layers with 64 filters, 3x3 kernel size, stride = 1, max-pooling, and padding = same. To make the information flow accurately throughout the network, we use dense skip connection. At last, it uses a 2D convolution to decode the reconstructed image into the original format.

#### **7. Conclusion**

In this work, we implemented HS–MS fusion on deep learning method because of its strong ability to extract features from the image. At first, we implemented the HS– MS fusion process in conventional CNN method. But in CNN, each layer takes the output from the previous layer, which tends to lose information as the network goes into deeper architecture. So we further implemented the fusion process in ResNet by adding the skip connection between the convolution layers. This skip connection helps to extract more detailed features from the images without any degradation problems. Our constructed ResNet fusion architecture includes three-residual blocks, and each block is a combination of stacked convolution layer and skip connections. Moreover, we modify the ResNet fusion architecture with different stacked layers and found that ResNet with two-stacked layer gives more accurate results. Finally, we extend ResNet architecture to reduce the number of parameters by using different skip connections like short ship, long skip, and dense skip connections. From the experimental analysis, it is found that the ResNet- dense skip improve the performance in image reconstruction with very less network parameters and running time compared to other fusion methods. This deep residual network helps to extract nonlinearity features with the help of the ReLU activation layer. The experiment and performance analysis of our algorithm is done effectively and quantitatively on four benchmark datasets. The fusion results indicate that ResNet with dense skip fusion method shows outstanding performance over traditional and DL methods by keeping the spatial and spectral data to a large extent in the reconstructed image.

*Hyperspectral and Multispectral Image Fusion Using Deep Convolutional Neural Network… DOI: http://dx.doi.org/10.5772/intechopen.105455*

### **Author details**

K. Priya\* and K.K. Rajkumar Department of Information Technology, Kannur University, Kerala, India

\*Address all correspondence to: kodothpriya@gmail.com

© 2022 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

### **References**

[1] Michael NH, Kudenov W. Review of snapshot spectral imaging technologies. Optical Engineering. 2013;**52**(10): 090901

[2] Feng F, Zhao B, Tang L, Wang W, Jia S. Robust low-rank abundance matrix estimation for hyperspectral unmixing. IET International Radar Conference (IRC 2018). 2019;**2019**(21): 6406-6409

[3] Dhore AD, Veena CS. Evaluation of various pansharpening methods using image quality metrics. 2nd International Conference on Electronics and Communication Systems (ICECS). IEEE. 18 June 2015:2015. DOI: 10.1109/ ecs.2015.7125039

[4] Wang Z, Chen B, Ruiying L, Zhang H, Liu H, Varshney PK. FusionNet: "An unsupervised convolutional variational network for hyperspectral and multispectral image fusion". IEEE Transactions on Image Processing. 2020;**29**:7565-7577

[5] He K, Zhang X, Ren H, Sun J. Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern Recognition. IEEE. 12 December 2016: 770-778

[6] Loncan L, de Almeida LB, Bioucas-Dias JM, Briottet X, et al. Hyperspectral pansharpening: A review. In: IEEE Geoscience and Remote Sensing Magazine. IEEE; September 2015;**3**(3): 27-46

[7] Vivone G et al. A critical comparison among pansharpening algorithms. IEEE Transactions on Geoscience and Remote Sensing. 2015; **53**(5):2565-2586

[8] Wei Q, Bioucas-Dias J, Dobigeon N, Tourneret JY. Hyperspectral and multispectral image fusion based on a sparse representation. IEEE Transactions on Geoscience and Remote Sensing. 2015;**53**:3658-3668

[9] Patero and U. Tapper. Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values. Environmetrics. 1994;**5**:111-126

[10] Lee DD, Seung HS. Algorithms for non-negative matrix factorization. In: Advances in Neural Information Processing Systems. Denver. Cambridge, MA, United States: MIT press; 2001. pp. 556-562

[11] Tong L, Zhou J, Qian B, Yu J, Xiao C. Adaptive graph regularized multilayer nonnegative matrix factorization for hyper-spectral unmixing. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing. 2020;**13**:434-447

[12] Cao J et al. An endmember initialization scheme for nonnegative matrix factorization and its application in hyper-spectral unmixing. ISPRS International Journal of Geo-Information. 2018;**7**:195. DOI: 10.3390/ijgi7050195

[13] José M, Nascimento P, Bioucas Dias JM. Vertex component analysis: A fast algorithm to unmix hyperspectral data. IEEE Transactions on Geoscience and Remote Sensing. 2005;**43**(4)

[14] Yokoya N, Yairi T. Iwasaki, "A Coupled nonnegative matrix factorization unmixing for hyperspectral and multispectral data fusion". IEEE Transactions on Geoscience and Remote Sensing. 2012;**50**:528-537

*Hyperspectral and Multispectral Image Fusion Using Deep Convolutional Neural Network… DOI: http://dx.doi.org/10.5772/intechopen.105455*

[15] Simoes M, Bioucas-Dias J, Almeida L, Chanussot J. A convex formulation for hyperspectral image super resolution via subspace-based regularization. IEEE Transactions on Geoscience and Remote Sensing. 2015; **53**:3373-3388

[16] Lin C-H, Ma F, Chi C-Y, Hsieh C-H. A convex optimization-based coupled nonnegative matrix factorization algorithm for hyperspectral and multispectral data fusion. IEEE Transactions on Geoscience and Remote Sensing. 2018;**56**(3):1652-1667. DOI: 10.1109/tgrs.2017.2746078

[17] Yang F, Ma F, Ping Z, Guixian X. Total variation and signature-based regularizations on coupled nonnegative matrix factorization for data fusion. Digital Object Identifier. 2019;**7**: 2695-2706. DOI: 10.1109/ACCESS.2018. 2857943. IEEE Access

[18] Yang F, Ping Z, Ma F, Wang Y. Fusion of hyperspectral and multispectral images with sparse and proximal regularization. IEEE Access Digital Object Identifier. 2019;**2019**: 2961240. DOI: 10.1109/ACCESS

[19] Palsson F, Sveinsson JR, Ulfarsson MO. Multispectral and hyperspectral image fusion using a 3-D convolutional neural network. IEEE Geoscience and Remote Sensing Letters. 2017;**14**:639-643

[20] Masi G, Cozzolino D, Verdoliva L, Scarpa G. Pansharpening by convolutional neural networks. Remote Sensing. 2017;**8**(7):594

[21] Shao Z, Cai J. Remote sensing image fusion with deep convolutional neural network. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing. May 2018;**11**(5): 1656-1669

[22] Yang J, Zhao Y-Q, Chan J. Hyperspectral and multispectral image fusion via deep two-branches convolutional neural network. Remote Sensing. 2019;**10**(5):800

[23] Chen L, Wei Z, Xu Y. A lightweight spectral–spatial feature extraction and fusion network for hyperspectral image classification. Remote Sensing. 2020;**12**: 1395. DOI: 10.3390/rs12091395. 28 April

[24] Song W, Li S, Fang L, Lu T. Hyperspectral image classification with deep feature fusion network. IEEE Transactions on Geoscience and Remote Sensing. 2018;**56**(7):3173-3184

[25] Available from: http://lesun.weebly. com/hyperspectral-data-set.html

[26] Ian Goodfellow, Yoshua Bengio, Aaron Courville. Deep Learning. Available from: https://www.deeplea rningbook.org/

[27] Ma F, Yang F, Ping Z, Wang W. Joint spatial-spectral smoothing in a minimum-volume simplex for hyperspectral image super-resolution. Applied Sciences. 2019;**10**(1)

[28] Available from: https://www.usgs. gov/landsat-missions/landsat-7

[29] Hong D, Yokoya N, Chanussot J, Zhu X. An augmented linear mixing model to address spectral varialbilty for hyperspectral unmixing, geography, computer science. In: IEEE Transactions on Image Processing. 2018

[30] Wang ACBZ. A universal image quality index. IEEE Signal Proessing Letters. 2002;**9**:81-84

### Section 4
