Theory and Algorithms of Deep Learning and Reinforcement Learning

#### **Chapter 1**

## Utilized System Model Using Channel State Information Network with Gated Recurrent Units (CsiNet-GRUs)

*Hany Helmy, Sherif El Diasty and Hazem Shatila*

#### **Abstract**

MIMO: multiple-input multiple-output technology uses multiple antennas to use reflected signals to provide channel robustness and throughput gains. It is advantageous in several applications like cellular systems, and users are distributed over a wide coverage area in various applications such as mobile systems, improving channel state information (CSI) processing efficiency in massive MIMO systems. This chapter proposes two channel-based deep learning methods to enhance the performance in a massive MIMO system and compares our proposed technique to the previous methods. The proposed technique is based on the channel state information network combined with the gated recurrent unit's technique CsiNet-GRUs, which increases recovery efficiency. Besides, a fair balance between compression ratio (CR) and complexity is given using correlation time in training samples. The simulation results show that the proposed CsiNet-GRUs technique fulfills performance improvement compared with the existing literature techniques, namely CS-based methods Conv-LSTM CsiNet, LASSO, Tval3, and CsiNet.

**Keywords:** massive MIMO, FDD, compressed sensing, deep learning, conventional neural network

#### **1. Introduction**

For fifth-generation wireless communication systems, the massive multiple-input multiple-output (MIMO) system is recognized as a powerful technology.

Such a system can significantly reduce multi-user interference and offer a multifold boost in cell throughput by outfitting a base station (BS) with hundreds or even thousands of antennas in a dispersed or centralized way. Utilizing channel state information (CSI) at base stations is the primary method for obtaining this potential benefit (BSs). The downlink channel state information (CSI) in modern frequency division duplex (FDD) MIMO systems (such as long-term evolution Release-8) is collected at the user equipment throughout the training phase and transmitted back to the BS via feedback links.

To minimize feedback overhead, vector quantization or codeword-based techniques are frequently used. The feedback quantities generated from these methods are not permitted in a massive MIMO system since they must be scaled linearly with the number of transmit antennas. As shown in [1], the difficulty of CSI feedback in massive MIMO systems has inspired several studies. By using the spatial and temporal correlation of channel state information (CSI), which describes how a signal travels from the transmitter to the receiver and represents the combined effect of, for example, scattering, fading, and power decay with distance, these works have primarily concentrated on reducing feedback overhead. To minimize feedback overhead, vector quantization or code-word-based techniques are frequently used. To minimize feedback overhead, vector quantization or codeword-based techniques are frequently used. The feedback quantities generated from these methods are not permitted in a massive MIMO system since they must be scaled linearly with the number of transmit antennas. As shown in [1], the difficulty of CSI feedback in massive MIMO systems has inspired several studies. By using the spatial and temporal correlation of channel state information (CSI), which describes how a signal travels from the transmitter to the receiver and represents the combined effect of, for example, scattering, fading, and power decay with distance, these works have primarily concentrated on reducing feedback overhead. To minimize feedback overhead, vector quantization or codeword-based techniques are frequently used.

A difficult issue in wireless communications systems is channel estimate during auto-encoding. Most of the time sent signals are reflected and scattered as they reach the receiver. The channel moves over time as a result of the mobility of the transmitter, receiver, or scattering objects. Deep learning (DL) trains massive, multilayered neural networks using lots of training data to approximate how the human brain does a particular activity. Channel State Information Networks (CsiNet), which we created as CSI sensing (or encoder) and recovery (or decoder) networks, include the features listed below in the auto-encoder system (**Figure 1**).

**Figure 1.** *Enhanced multiple-access for mmWave massive MIMO [2].*

*Utilized System Model Using Channel State Information Network with Gated Recurrent Units… DOI: http://dx.doi.org/10.5772/intechopen.111650*


User equipment encodes channel matrices into codewords using the encoder; after the codewords are returned to the BS, it uses the decoder to reconstruct the original channel matrices. The technique can be applied as a feedback protocol in FDD MIMO systems. The autoencoder [3] in deep learning, which is used to learn an encoding for a set of data types for dimensionality reduction, and CsiNet are closely related. To recreate accurate models from CS data, several deep learning (DL) architectures have recently been designed and introduced in [4–6].

DL shows state-of-the-art performance in natural-image reconstruction, but because wireless channel reconstruction is more difficult than image reconstruction, it can also demonstrate that this capability is unclear. The DL-based CSI reduction and recovery strategy is introduced in the current work. The most significant research appears to be [7], in which a closed-loop MIMO system implements DL-based CSI encoding. It differs from previous research that did not consider CSI recovery by demonstrating that, as compared to current CSbased methods, CSI can be recovered with a significantly increased reconstruction quality by DL.

#### **2. The structure of channel state information network (CsiNet)**

The structure of CsiNet [8] according to Depth wise Separable Convolution in feature recovery reconstruction illustrated in detail, CsiNet remarkably outperforms the CS-based methods. Introducing the CSI network feedback process, which considers a single-cell FDD massive MIMO-OFDM framework, where there is **Nt** (≫1) transmit antennas at the BS and a single receiver antenna at the UE, OFDM is with **Nc** subcarriers the received signal at the **nth** subcarrier can be communicated as:

$$\mathbf{y}\_n = \tilde{h}\_n^H \mathbf{v}\_n \mathbf{x}\_n + \mathbf{z}\_n \tag{1}$$

where ~ *h H <sup>n</sup>* and *yn***ϵ Nt x 1** is the channel frequency response vector and the precoding vector at the *nth* subcarrier, separately, *xn* represents the transmitted information image, *zn* is the additive noise or obstruction and ð Þ� **<sup>H</sup>** is a conjugate transpose. In the FDD system, improving feedback links through UE and BS, focus on the feedback scheme which allows autoencoder processing, assume:

**Ĥ = h**~**<sup>1</sup>** … **h**̃ *Nc* h i*<sup>H</sup>* **ϵ** *<sup>N</sup>*̃*c x Nt* in CSI stacked in the spatial frequency domain, which means the UE should return **Ĥ** to the BS through feedback links, and in the feedback system, the total number parameter is **NtNc**̃, using a 2D (DFT) discrete Fourier transform, which introducing **H**~ can be improved in the angular-delay domain to reduce feedback overhead:

$$\mathbf{H} = \mathbf{F}\_{\mathbf{d}} \cdot \tilde{H} \mathbf{F}\_{\mathbf{d}}^{H} \tag{2}$$

where **Fd and** *Fa* **are Nc**̃ **X Nc**̃ **and NtX Nt** DFT matrices, respectively. So, considering the COST 2100 as was illustrated in [9] channel model as shown in **Figure 2**. depending on a uniform linear array (ULA), **H** has a small fraction of significant components. According to the delay domain, the first **Nc** rows of **H** contain values, retain the first **Nc** Rows of **H** and remove remaining rows. In a massive MIMO system, the total number of feedback parameters can be reduced to **N =** *N***<sup>c</sup> Nt.** So, we design the encoder *S*,

$$\mathcal{S} = f\_{\text{en}}\ (H) \tag{3}$$

We can convert **H** into a codeword M vector, where **M < N,** and design the decoder inverse transformation from the codeword to H original channel.

$$H = f\_{\text{de}} \text{ (S)}\tag{4}$$

The European Cooperation in Science and Technology COST 2100 channel model is a GSCM that can reproduce the stochastic properties of massive MIMO channels

**Figure 2.** *A plot of the strength of H* ϵ *ℂ*<sup>32</sup>*x*<sup>32</sup> *[8].*

*Utilized System Model Using Channel State Information Network with Gated Recurrent Units… DOI: http://dx.doi.org/10.5772/intechopen.111650*

over time, frequency, and space. A multi-path component MPC is characterized in delay and angular domains by its delay, angle of departure (Azimuth of Departure (AoD), Elevation of Departure (EoD), and angle of arrival (Azimuth of Arrival (AoA), Elevation of Arrival (EoA)). The MPCs with similar delays and angles are grouped into multi-path clusters. The MATLAB implementation of C2CM supports both single-link and multiple-link MIMO channel access indoor (285 MHz) and semiurban (5.3 GHz) channel scenarios. An overview of the C2CM is presented in a detailed description of the channel model. The parameterization of the C2CM in indoor scenarios is detailed while discussing semi-urban scenarios. On the other hand, it gives the massive multiple-input multiple-output MIMO extensions of the C2CM; The C2CM is implemented in MATLAB, while the semi-urban channel scenario is implemented in [9]. Furthermore, the MATLAB implementation of C2CM with massive MIMO extensions. However, the data generated in these MATLAB implementations are not presented as potential datasets to validate multi-path clustering methods and even in the well-known clustering approaches.

#### **3. Recurrent unit system model**

#### **3.1 The structure**

It can adaptively capture capable of adaptively capturing dependencies from lengthy data sequences without removing data from previous stages of the sequence due to the gated recurrent unit structure [10]. This is accomplished by its gating units, which are related to those in long short-term memory LSTMs, and which resolve the vanishing/exploding gradient problem of conventional RNNs. These gates control the information that should be retained or discarded at each time step. The GRU operates like an RNN, except for its internal gating mechanisms, where sequential input data is absorbed by the GRU cell at each time step along with memory, also known as the hidden state [11]. The RNN cell and the following input data in the sequence are then fed with the hidden state once more (**Figure 3**).

Fully gated unit

Initially, for *t* = 0, the output vector is *h*<sup>0</sup> = 0.

$$\mathbf{z}\_t = \rho\_\mathbf{g} \left( \mathbf{W}\_x \mathbf{x}\_t + \mathbf{U}\_Z \mathbf{h}\_{t-1} + \mathbf{b}\_x \right) \tag{5}$$

$$r\_t = \rho\_g \left( W\_r \mathbf{x}\_t + U\_r h\_{t-1} + b\_r \right) \tag{6}$$

$$h\_t = \mathbf{z}\_t \odot h\_{t-1} + (\mathbf{1} - \mathbf{z}\_t) \odot \phi\_h \left(\mathcal{W}\_h \mathbf{x}\_t + \mathcal{U}\_h \left(\mathbf{r}\_t \odot h\_{t-1}\right) + b\_n\right) \tag{7}$$

Were, *xt* input vector, *ht* output vector, *zt* update gate vector, *rt* reset gate vector and *W*, *U*, and *b* denote matrices and vectors, respectively.

**Activation functions:** �**ρ***<sup>g</sup>* Original sigmoid activation, *ϕ<sup>h</sup>* For the initial hyperbolic tangent, Alternative activation functions can be used, provided the *ρg*(*x*) € [0,1].

It is possible to construct alternative forms by modifying *zt* and *rt*.

GRU's ability to hold on to long-term dependencies or memory stems from the gated recurrent unit cell's computations to produce the hidden state. At the same time, LSTMs have two different states passed between the cell state and hidden state, which carry the long and short-term memory, respectively GRUs only have one hidden state transferred between time steps. This hidden state can hold both long-term and short-

#### **Figure 3.**

*Gated recurrent unit, fully gated version [12].*

term dependencies at the same time due to the gating mechanisms and computations that the hidden state and input data go through.

The GRU cell contains only two gates: The Update gate and the Reset gate; like the gates in LSTMs, the GRU gates are trained to selectively filter out any irrelevant information while keeping what's useful. These gates are essentially vectors containing values between 0 and 1, multiplying with the input data or hidden state.

A zero (0) value in the gate vectors indicates that the input or hidden state's corresponding data is unimportant and will, therefore, return as a zero.

On the other hand, a one (1) value in the gate vector means that the corresponding data is essential and will be used. Reset gate: In the first step, we'll create the Reset gate; this gate is derived and calculated using both the hidden state from the previous time step and the input data at the current time step (**Figure 4**).

**Figure 4.** *Reset gate flow [13].*

*Utilized System Model Using Channel State Information Network with Gated Recurrent Units… DOI: http://dx.doi.org/10.5772/intechopen.111650*

Mathematically, this is achieved by multiplying the previous hidden state and current input with their respective weights, summing before passing the sum through the sigmoid function.

The sigmoid function will transform the values to fall between 0 and 1, allowing the gate to filter between the less-important and more-important information in the subsequent steps.

$$\text{Gate}\_{\text{reset}} = \sigma \left( \mathcal{W}\_{\text{inputreset}} \cdot \boldsymbol{\varkappa}\_t + \mathcal{W}\_{\text{hiddenreset}} \cdot \boldsymbol{h}\_{t-1} \right) \tag{8}$$

When the entire network is trained through back-propagation, the weights in the equation will be updated such that the vector will learn to retain only the valuable features. The previous hidden state will first be multiplied by a trainable weight and will then undergo an element-wise multiplication Hadamard product with the reset vector. This operation will decide which information will be kept from the previous time steps and the new inputs.

Simultaneously, the current input will also be multiplied by a trainable weight before being summed with the reset vector's product and the previous hidden state above. Finally, a non-linear activation tanh function will be applied to the result to obtain r in the equation below.

$$\mathbf{r} = \tanh\left(\mathbf{gate}\_{\text{reset}} \odot (\mathcal{W}\_{h1} \cdot h\_{t-1}) + \mathcal{W}\_{x1} \cdot \mathbf{x}\_t\right) \tag{9}$$

Update gate: next, we'll create the Update gate, like the Reset gate; the gate is computed using the previous hidden state and current input data (**Figure 5**). Both the Update and Reset gate vectors are created using the same formula, but the weights multiplied with the input and hidden state are unique to each gate, which means that each gate's final vectors are different; This allows the gates to serve their specific purposes.

**Figure 5.** *Update gate flow [13].*

$$\text{gate}\_{update} = \sigma \left( \mathcal{W}\_{inputupdate} \cdot \boldsymbol{\varkappa}\_t + \mathcal{W}\_{hiddenupdate} \cdot \boldsymbol{h}\_{t-1} \right) \tag{10}$$

The Update vector will undergo element-wise multiplication with the previous hidden state to obtain u in our equation below, which will be used to compute our final output later.

$$\mathbf{u} = \mathbf{g} \mathbf{a} \mathbf{t}\_{update} \odot \mathbf{h}\_{t-1} \tag{11}$$

The Update vector will also be used in another operation later when obtaining our final output.

The purpose of the Update gate here is to help the model determine how much of the past information stored in the previous hidden state needs to be retained for the future. Combining the outputs: In the last step, we will be reusing the Update gate and obtaining the updated hidden state (**Figure 6**).

This time, we will be taking the element-wise inverse version of the same Update vector (1—Update gate) and doing an element-wise multiplication with our output from the Reset gate, r. This operation's purpose is for the Update gate to determine what portion of the new information should be stored in the hidden state. Lastly, the result of the above operations will be summed with our output from the Update gate in the previous step, u.

This will give us our new and updated hidden state; We can use this new hidden state as our output for that time step by passing it through a linear activation layer.

$$h\_t = \mathbf{r} \odot \left(\mathbf{1} - \text{gate}\_{update}\right) + \mathbf{u} \tag{12}$$

The Reset gate determines which parts of the previous hidden state are to be combined with the current input to propose a new hidden state, and the Update gate determines how much of the previous hidden state is to be retained and what part of the new proposed hidden state derived from the Reset gate is to be added to the final

**Figure 6.** *Final output computations [13].*

#### *Utilized System Model Using Channel State Information Network with Gated Recurrent Units… DOI: http://dx.doi.org/10.5772/intechopen.111650*

hidden state. This solves the Vanishing/Exploding Gradient Problem. The network chooses which components of the previous hidden state to keep in memory while discarding the rest when the Update gate is first multiplied with it. When it uses the Reset gate's inverse gate to filter the proposed new hidden state from the Update gate, it then fills in the gaps in the information that were previously missing. The network can maintain long-term dependencies as a result. If the Update vector values are close to 1, the Update gate may decide to keep most of the previous memories in the hidden state rather than recalculating or altering the hidden state entirely.

When training a recurrent neural network RNN, the vanishing or exploding gradient problem can happen, especially if the RNN is processing lengthy sequences or has multiple layers. The network's weight is updated in the right direction and by the right amount using the error gradient that was calculated during training. However, this gradient is determined using the chain rule, beginning at the end of the network. As a result, for long sequences, the gradients will undergo continuous matrix multiplications and either disappear (vanish) or explode (explode) exponentially.

A gradient that is too small will prevent the model from effectively updating its weights, whereas a gradient that is too large will make the model unstable.

Due to the addictive nature of the Update gates, the long short-term memory (LSTM) and gated recurrent units (GRUs) can keep most of the existing hidden state while adding new content on top of it, unlike traditional RNNs that always replace the entire hidden state content at each time step.

This prevents the additional operations from causing the error gradients to vanish or explode too quickly during backpropagation. Utilizing alternative activation functions, like ReLU, which does not result in a small derivative, is the simplest solution. Another option is residual networks, which offer residual connections directly to earlier layers. In a feedforward network (FFN), the backpropagated error signal typically decreases (or increases) exponentially as a function of the distance from the final layer. This technique is referred to as the vanishing gradient.

#### **4. Design of the CsiNet-GRUs system model**

We enhance the architecture in this chapter by considering time correlation. The recurrent convolutional architecture that has been used to represent and reconstruct images successfully provides references for our work. The basic idea is to extract spatial features and inter-frame correlation using convolutional neural networks (CNN) and recurrent neural networks (RNN), respectively. The following is a summary of our contribution to this chapter. CsiNet is extended with a gated recurrent unit network, a common type of recurrent neural network, and a DL-based CSI feedback protocol for FDD MIMO systems is proposed (RNN). CsiNet-GRUs is a proposed network that modifies the CNN-based convolutional neural network CsiNet for channel state information compression and initial recovery and uses a gated recurrent unit technique to extract time correlation for further resolution improvement.

CsiNet-GRUs exhibit remarkable robustness to compression ratio (CR) reduction and enable real-time and extensible channel state information (CSI) feedback applications without considerably increasing overhead compared with CsiNet, to reduce feedback overhead, we can exploit the following observations: **Observation 1 (angular-delay domain sparsity):** *Ht* can be transformed into an approximately

sparsified matrix *H*<sup>0</sup> *<sup>t</sup>* in the angular-delay domain via 2D discrete Fourier transform (2D-DFT) by *H*<sup>0</sup> *<sup>t</sup>* **=** *FdHt Fa***,** where *Fd* **∈***Nc*�*Nc* and *Fa***∈***Nt*�*Nt* are two DFT matrices. First, due to limited multipath time delay, performing DFT on frequency domain channel vectors (i.e., column vectors of *Ht*) can transform *Ht* into a sparse matrix in the delay domain, with only the first *N*<sup>0</sup> *<sup>c</sup>* **(<***Nc***)** rows have distinct non-zero values. Secondly, the channel matrix is sparse in a defined angle domain by performing DFT on spatial domain channel vectors (i.e., row vectors of *Ht*) If the number of transmit antennas, *Nt* ! **+∞**, is very large. Usually, *H*<sup>0</sup> *<sup>t</sup>* is only approximately sparse for finite *Nt* which challenges conventional compressed sensing methods. Therefore, we will propose a DL-based feedback architecture without sparsity prior constraint. We perform sparsity transformation to decrease parameter overhead and training complexity.

We retain the first *N*<sup>0</sup> *<sup>c</sup>* non-zero rows and truncate *H*<sup>0</sup> *<sup>t</sup>* to a *N*<sup>0</sup> *<sup>c</sup>* � *Nt* matrix, *H*<sup>00</sup> *t* , which reduces the total number of parameters for feedback to *N* **=** *N*<sup>0</sup> *cNt.*

**Observation 2 (correlation within coherence time):** The user equipment motion during communication results in a Doppler spread, time-varying characteristics of wireless channels with the maximum movement speed denoted as **v**, coherence time can be calculated as:

$$
\Delta t = \frac{C}{2 \, v \, f \sigma} \tag{13}
$$

where *f* **<sup>0</sup>** is the carrier frequency, and **c** is the speed of light. The CSI within **Δt** is considered correlated with one other. Therefore, instead of independently recovering CSI, the BS can combine the feedback and previous channel information to enhance the subsequent reconstruction.

We set the feedback time interval as **δt** and place **T** adjacent instantaneous angular-delay domain channel matrices into a channel group, i.e.,

$$\left\{H\_t''\right\}\_{t=1}^T = \left\{H\_1'', \dots H\_t'', \dots, H\_T''\right\} \tag{14}$$

The group exhibits correlation property if **T** satisfies **0** ≤ **δt** � **T** ≤ **Δt**.

In this chapter, we design an encoder, *St* **=** *f en* **(***H*<sup>00</sup> *<sup>t</sup>* **)**, at the UE to compress each complex-valued *H*<sup>00</sup> *<sup>t</sup>* of *H*<sup>00</sup> *t <sup>T</sup> <sup>t</sup>*¼**<sup>1</sup>** into an M-dimensional real-valued codeword vector *St* **(M < N)**. If two real number matrices are used to represent the real and imaginary parts of *H*<sup>00</sup> *<sup>t</sup>* , then CR will be *<sup>M</sup>=***<sup>2</sup>***N*.

We also design a decoder with a memory that can extract time correlation from the previously recovered channel matrices, *H*^ <sup>00</sup> **<sup>1</sup> , … ,***H*^ <sup>00</sup> *<sup>t</sup>*�**<sup>1</sup>** and combine them with the received *St* for the current reconstruction channel, *H*^ <sup>00</sup> *<sup>t</sup>* **<sup>=</sup>** *<sup>f</sup> de* **(***St***;** *<sup>H</sup>*^ <sup>00</sup> **<sup>1</sup> , … ,** *<sup>H</sup>*^ <sup>00</sup> *<sup>t</sup>*�**1)**, where **1** ≤ **t** ≤ **T**. Then, inverse 2D-DFT is performed to obtain the original spatial frequency channel matrix; the Channel state information network demonstrates remarkable performance in CSI sensing and reconstruction. However, the resolution degrades at low CR because it only focuses on angular-delay domain sparsity (Observation 1) and ignores the time correlation (Observation 2) of time-varying massive MIMO channels, the two observations are like the spatial and inter-frame correlations of videos.

Motivated by RCNN, which excels in extracting spatial-temporal features for video representation, we will extend CsiNet with GRU to improve CR and recovery quality

*Utilized System Model Using Channel State Information Network with Gated Recurrent Units… DOI: http://dx.doi.org/10.5772/intechopen.111650*

**Figure 7.** *The structure of the proposed CsiNet-GRUs using dropout technique [14].*

trade-off. We will also introduce the multi-CR strategy to implement variable CRs on different channel matrices; The proposed CsiNet-GRU is illustrated in **Figure 7**. with CsiNet. Our model includes the following two steps: angular-delay domain feature extraction, correlation representation, and final reconstruction. Each GRU has an inherent memory unit that, for future prediction, can hold previously extracted information for a long time. A 3 � 3 convolutional layers and an M-unit dense layer for sensing, and a dense layer with **2***N*<sup>0</sup> *cNt* units should be considered to facilitate comparison with the results of the CsiNet structure given in [8] and two decoders from RefineNet for reconstruction as shown in **Figure 7**, each RefineNet comprises channel into four 3 � 3 convolutional layers with different channel sizes.

The CsiNet decoder's output generates a sequence, and the length of every sequence is **T**, which is then fed into a three-layer GRU. All low-CR CsiNet's shown in **Figure 7**. share the same network parameters, i.e., weights and bias, because they perform the same work, which dramatically reduces parameter overhead. Furthermore, the architecture can be easily rescaled to perform on channel groups with different **T** if the value of **T** changes to adapt to the channel-changing speed and feedback frequency; A low-CR CsiNet will be reused (T � 1) time instead of making (T � 1) copies in practice. The gray blocks in **Figure 7** load parameters from the original CsiNet's as pre-training before end-to-end training with the entire architecture. This method can alleviate vanishing gradient problems due to long paths from CsiNet's to GRUs. We use GRUs to extend the CsiNet decoders for time correlation extraction and final reconstruction. Gated recurrent units have inherent memory cells and can keep the previously extracted information for a long period for later prediction. In particular, the CsiNet decoders' outputs form a sequence of length **T** before

being fed into three-layer GRUs. Each GRU has a **2***N*<sup>0</sup> *cNt*; The hidden unit is the same as the size of the output. Then the final output is reshaped into two matrices as the

final recovered *H*^ <sup>00</sup> *<sup>t</sup>* ; This allows the CR-CsiNet encoder to send to the rest **T** � **1**. Because less information is required due to channel correlation, the channel matrix performs operations, *M***<sup>2</sup>** � **1** codewords **(***M***<sup>1</sup> >** *M***2)**, are generated. The **(***T* � **1)** codewords are all concatenated with the first codeword *M***<sup>1</sup>** � **1** before being fed into the low-CR CsiNet decoder to utilize feedback information fully. Each CsiNet outputs two matrices with size (*N*<sup>0</sup> *<sup>c</sup>* � *Nt*Þ as extracted features from the angular delay domain as the final recovered **H**^ <sup>00</sup> **<sup>t</sup>** . The spatial frequency domain CSI can then be obtained via inverse 2D-DFT. At each time step, the GRUs implicitly learns time correlation from the previous inputs and then merge them with the current inputs to increase low CR recovery quality.

#### **4.1 The dropout technique**

Dropout: During training, randomly selected neurons are ignored and "dropped out." This means that their contribution to downstream neuron activation is removed temporally on the forward pass, and no weight updates are applied to the neuron on the backward pass. Dropout can be implemented on any hidden layer in the network; the visible or input layer, as well as the term "dropout," refers to dropping out units (hidden and visible) in a neural network. Dropout is a regularization method used when training the network, as illustrated in **Figure 8**. It is possible that the input and loop connections to the gated recurrent unit (GRU) in **Figure 7** are not excluded from activation and weight updates. Depending on the framework, the dropout regularization Training Phase: Ignore (zero out) a random fraction, p, of nodes for each hidden layer, training sample, and iteration (and corresponding activations). A phase of testing: Use all activations but reduce them by a factor of p. (to account for the missing activations during training). Dropout is a regularization method used when training the network, as shown in **Figure 8**. However, it does not always exclude the input and loop connections to the gated recurrent unit (GRU) from activation and weight updates, as shown in **Figure 7**. To reduce overfitting and improve the efficiency of the CsiNet-GRU structure, a neural network approach is used. We stated

**Figure 8.** *Neural network with dropout architecture [15].*

#### *Utilized System Model Using Channel State Information Network with Gated Recurrent Units… DOI: http://dx.doi.org/10.5772/intechopen.111650*

that the effect on "downstream neurons" activation during the forward process would be temporarily removed and that no weight update for "backward propagation to neurons" would be applied [15].

Training Phase: Ignore (zero out) a random fraction, p, of nodes for each hidden layer, training sample, and iteration (and corresponding activations). A phase of testing: Use all activations but reduce them by a factor of p. (to account for the missing activations during training). Dropout is a regularization method used when training the network, as shown in **Figure 8**. However, it does not always exclude the input and loop connections to the gated recurrent unit (GRU) from activation and weight updates, as shown in **Figure 7**. Depending on the framework, the dropout regularization approach used in neural networks is used to reduce overfitting and improve the efficiency of the CsiNet-GRU structure. We stated that the effect on "downstream neurons" activation during the forward process would be temporarily removed and that no "backward propagation to neurons" weight update would be applied [15]. Training Phase: Ignore (zero out) a random fraction, p, of nodes for each hidden layer, training sample, and iteration (and corresponding activations). A phase of testing: Use all activations but reduce them by a factor of p. (to account for the missing activations during training).

Some observations: Dropout forces a neural network to learn more robust features that can be used in conjunction with the random subsets of many other neurons. Dropout roughly doubles the number of iterations needed to converge; however, each epoch's training time is less, and during the testing phase, the entire network is considered, and each activation is reduced by a factor of p. When training the network in the proposed structure, the input and recurrent connections to the GRU unit may not be excluded from activation and weight updates.

There are two dropout parameters in RNN layers: *dropout*, applied to the first operation on the inputs, and *recurrent dropout* applied to the other operation on the recurrent inputs. It is worth mentioning that interested in designing the encoder which can transform the channel matrix into an M-dimensional vector (codeword), where **M < N**. Thus, define the data compression ratio **γ** as ð Þ *γ* ¼ *M=***2***NtNc* .

The encoder first extracts CSI features via a convolutional layer with two 3 � 3 filters, followed by an activation layer. A fully connected (FC) layer with M neurons is then used to compress the CSI features to lower dimensions. The compression ratio (CR) of this encoder can be expressed as *CR* ¼ **1***=***γ**. The final reconstruction of the CSI is performed by three 2*N*<sup>0</sup> *cNt* unit GRUs with dropout techniques.

Moreover, adopting depth-wise separable convolutions in feature recovery reduces the model's size and interacts with information between channels and introducing the delay *θ* as a parameter used in the encoder and decoder, i.e., *θ* ¼ f g *θen*, *θde :*It is worth mentioning that *H*<sup>00</sup> *<sup>t</sup>* are standardized with all components scaled into the [0; 1], and this standardization is required for CsiNet.

#### **5. COST 2100 data sets and parameters**

The COST 2100 channel model is a geometry-based stochastic channel model (GSCM) capable of reproducing the stochastic properties of multi-link multiple-input multiple-output channels across time, frequency, and space. As a result, there is no doubt that more advanced channel estimation methods and good measurement campaigns for parameterization and validation are required for the successful development and long-term use of the COST 2100 channel model. Multiple-input

multiple-output (MIMO) is a technology that enables faster and more reliable transmissions over wireless channels.

The COST 2100 model simulates MIMO channels and generates training samples; we set the MIMO-OFDM system to work on a 20 MHz bandwidth using a uniform linear array (ULA). The parameters utilized in indoor and outdoor channel scenarios are given in **Table 1**; Data sets are generated by randomly setting different start places for indoor and outdoor scenarios and performing the simulations at CR values with the first channel *H*<sup>00</sup> **<sup>1</sup>** they were compressed under 1/4. **Table 1** shows the training, validation, and testing sets; some parameters are preloaded from the CsiNet for initialization (epochs from 500 to 1000, learning rate of 0.001, and batch size of 200), as shown in **Table 1**.

We compare the proposed architecture's performance with previous similar modeling approaches of channel state information (CSI) with different deep learning approaches, namely Conv-LSTM CsiNet, LASSO, TVAL3 [16], and CsiNet, utilizing the default setup in the open-source codes of the previously mentioned techniques for reproduction.

TVAL3 uses a minimum total variation method that provides remarkable recovery quality and high computing efficiency, while LASSO uses simple sparse priors to achieve good performance. In the feature extraction and recovery modules of Convolutional-LSTM CsiNet, RNN, and depth-wise separable convolution were used.

The term "training" refers to the process of determining which parameters to use in a given dataset. We run the modeling CsiNet-GRUs on Collaboratory (python) according to zero configuration required, free access to GPUs, and easy sharing training and testing of the CsiNet, Conv-LSTM CsiNet, and CsiNet-GRUs on python colab editor.


**Table 1.**

*COST 2100 model DATA-SETS and system, parameters*.

*Utilized System Model Using Channel State Information Network with Gated Recurrent Units… DOI: http://dx.doi.org/10.5772/intechopen.111650*

Comparisons are made using the normalized mean square error, cosine similarity, accuracy, and run-time in the indoor and outdoor channels, as well as the complexity factored in. The Normalized Mean Square Error measures and reflects the mean relative scatters.

The normalization of the MSE assures that the metric will not be biased when the model overestimates or underestimates the predictions. So, the normalized mean square error (NMSE) utilized for comparisons quantifies the difference between the input f g *Ht <sup>T</sup> <sup>t</sup>*¼**<sup>1</sup>** and the output *<sup>H</sup>*^ *<sup>t</sup>* � �*<sup>T</sup> <sup>t</sup>*¼**<sup>1</sup>** in both proposed techniques CsiNet-GRUs are given by:

$$\text{NMSE} = \mathbb{E}\left\{\frac{1}{T}\sum\_{t=1}^{T} \left\|{H\_t'' - \hat{H}\_t''}\right\|\_2^2 / \left\|{H\_t''}\right\|\_2^2\right\} \tag{15}$$

The correlation coefficient is a statistical measure of the strength of the relationship between the relative movements of two variables. The values range between �1.0 and 1.0. A correlation of �1.0 shows a perfect negative correlation, while a correlation of 1.0 shows a perfect positive correlation.

To measure the degree of similarity between the actual channel *hn*, *<sup>t</sup>* and the estimated channel value b*hn*, *<sup>t</sup>* of the *nth* subcarrier at time *t*, using the cosine similarity coefficient *ρ*, in CsiNet-GRUs which is given as:

$$\rho = \mathbb{E}\left\{ \frac{1}{T} \frac{1}{N\_c} \sum\_{t=1}^{T} \sum\_{n=1}^{N\_c} \frac{\left| \hat{h}\_{n,t}^H h\_{n,t} \right|}{\left\| \hat{h}\_{n,t} \right\|\_2 \left\| h\_{n,t} \right\|\_2} \right\} \tag{16}$$

Where b*hn*, *<sup>t</sup>* denotes the reconstructed channel vector of the nth subcarrier at time **t**. **ρ** can measure the quality of the beamforming vector when the vector is set as *vn*,*<sup>t</sup>* **=** b*hn*, *<sup>t</sup>***/** b*hn*, *<sup>t</sup>* � � � � � � **2** since the UE will achieve the equivalent channel b*h H <sup>n</sup>*, *thn*, *<sup>t</sup>***/** b*hn*, *<sup>t</sup>* � � � � � � **2** .

Introducing a new parameter for comparison, which calling accuracy defining it as the ratio of the number of correct predictions to the total number of input samples, that means accuracy is the ratio of the recovered channel vector to the original channel vector *H*<sup>00</sup> *t* � �*<sup>T</sup> t*¼**1** *=H*<sup>00</sup> **<sup>1</sup>** so the accuracy in CsiNet-GRUs is defined as:

$$Accuracy = \mathbb{E}\left\{ \frac{1}{T} \frac{1}{N\_c} \sum\_{t=1}^{T} \sum\_{n=1}^{N\_c} \frac{\left| \hat{h}\_{n,t}^H \right|}{\left\| h\_{n,t} \right\|\_2} \right\} \tag{17}$$

#### **6. Comparison of results with different techniques**

**Figures 9** and **10** show the relationship between CR and NMSE for all structures in indoor and outdoor scenarios. **Figure 9** shows that the proposed CsiNet-GRUs have the lowest NMSE, whereas **Figure 10** shows that it has the lowest NMSE among others except for Conv-LSTM CsiNet at CR > 20. **Figures 11** and **12** show the relationship between the CR and accuracy for all structures in indoor and outdoor scenarios.

**Figure 9.** *NMSE (dB) performance comparison between CS methods INDOOR scenario.*

**Figure 10.** *NMSE (dB) performance comparison between CS methods OUTDOOR scenario.*

*Utilized System Model Using Channel State Information Network with Gated Recurrent Units… DOI: http://dx.doi.org/10.5772/intechopen.111650*

**Figure 11.** *Accuracy performance comparison between CS methods INDOOR scenario.*

**Figure 12.** *Accuracy performance comparison between CS methods OUTDOOR scenario.*

The CsiNet-GRUs outperform the other structures, with higher accuracy observed at lower CR values. **Figures 13** and **14** illustrate the relation between the cosine similarity (ρ) and CR in indoor and outdoor scenarios for all structures. Again, the proposed CsiNet-GRUs outperform the other structures, and besides, it exhibits a near-linear performance with the lowest slope.

**Figure 13.** ρ *Performance comparison between CS methods INDOOR scenario.*

*Utilized System Model Using Channel State Information Network with Gated Recurrent Units… DOI: http://dx.doi.org/10.5772/intechopen.111650*

The performance comparison between the proposed CsiNet-GRUs to other available techniques. Where corresponding values of normalized mean square error (NMSE), ρ, accuracy, and run-time are calculated for different values of γ for indoor and outdoor scenarios, all techniques have better performances in the indoor scenario than the outdoor one.

It is worth noting that channel state information network (CsiNet) techniques significantly outperform the other CS-based methods. LASSO, TVAL3, CsiNet, and CsiNet-GRUs continue to provide the highest cosine similarity values at low CRs, where other CS-based methods fail. However, the proposed CsiNet-GRUs outperform the channel state information network (CsiNet) in terms of correlation and accuracy, as shown in **Table 2**. The same comparison is simulated again in terms of



#### **Table 2.**

*Comparison of results between the proposed framework and other available Ones (Epoch = 1000 iterations in the proposed techniques and others previous techniques).*

epoch = 1000 (1000 iterations) in terms of correlation and accuracy in the proposed technique CsiNet-GRUs. In terms of the NMSE, the CsiNet-GRUs achieve the lowest values of all compressed ratios (CRs), particularly when CR is low.

CsiNet-GRUs have very short run periods when compared to LASSO and TVAL3 techniques. However, when compared to the other CsiNet technique and the proposed modeling technique, CsiNet-GRUs lose time efficiency slightly. It is worth noting that, despite the addition of significant complexity as a result of the GRU layers, the run time is still comparable to that of the CsiNet.

**Figure 15** depicts in comparison to the other modeling techniques, the reconstruction results of the proposed technique, namely LASSO, TVAL3, CsiNet, and Conv-LSTM CsiNet in an indoor Picocellular scenario, the figure represents the average performance at different CRs, reflecting on the reconstructed images to use the other techniques.

**Figure 15.** *Reconstruction images for CR in CS algorithms in an indoor scenario.* *Utilized System Model Using Channel State Information Network with Gated Recurrent Units… DOI: http://dx.doi.org/10.5772/intechopen.111650*

Conv-LSTM, CsiNet, and CsiNet-GRUs continue to provide acceptable correlation coefficients (*ρ*) at low CRs, where compressed sensing-based methods fail; it is noteworthy that the proposed CsiNet-GRUs technique outperforms the other methods in an indoor scenario. CsiNet-GRUs achieve the best performance among CR with indicators parameters to improve accuracy, decrease NMSE, and increase correlation (*ρ*) with dropout to reduce modeling system overfitting in massive multiple-input multiple-output channels. As a result, CsiNet-GRUs outperform both CsiNet and CS-based methods. With advanced deep learning technology, this chapter has the potential to deploy real MIMO systems.

#### **7. Conclusion**

We developed and tested a prediction model to evaluate a real-time and end-toend channel state information (CSI) feedback framework in this chapter by extending the DL-based CsiNet with GRU. By utilizing the time correlation and structure properties of time-varying massive MIMO channels, CsiNet-GRUs achieve an acceptable trade between compressed ratio, recovery quality, accuracy, and complexity. This chapter proposed a channel state information (CSI) feedback network by extending the deep learning DL-based channel state information network (CsiNet) technique to incorporate gated recurrent units (GRUs) and use the dropout method to incorporate gated recurrent units (GRUs) and use the dropout method. The gated recurrent unit (GRU) layers were used to extend the channel state information network CsiNet decoders for time correlation extraction and the final reconstruction of channel state information, whereas the dropout method was used to reduce overfitting in channel modeling. In terms of complexity, the experiment results show that CsiNet-GRUs achieve the best recovery quality and outperform state-of-the-art CS methods.

#### **Appendices and nomenclature**


#### *Deep Learning and Reinforcement Learning*


*Utilized System Model Using Channel State Information Network with Gated Recurrent Units… DOI: http://dx.doi.org/10.5772/intechopen.111650*

### **Author details**

Hany Helmy1 \*, Sherif El Diasty<sup>2</sup> and Hazem Shatila<sup>3</sup>

1 Cairo Airport Company (CAC), Cairo, Egypt

2 Department of Electronics, Arab Academy for Science, Technology and Maritime Transport (AASTMT), Cairo, Egypt

3 Virginia Tech, Artificial Intelligence and Markovdata, Cairo, Egypt

\*Address all correspondence to: hany.nabil@cairo-airport.com; hnabil110@gmail.com

© 2023 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

### **References**

[1] Zhang T, Ge A, Beaulieu NC, Hu Z, Loo J. A limited feedback scheme for massive MIMO systems based on principal component analysis. EURASIP Journal on Advances in Signal Processing. 2016;**2016**. DOI: 10.1186/ s13634-016-0364-9

[2] Busari A, Huq KMS, Mumtaz S, Dai L, Rodriguez J. Millimeter-wave massive MIMO communication for future wireless systems: A survey. IEEE Communications Surveys & Tutorials. 2018;**20**(2):836-869

[3] Tao J, Chen J, Xing J, Fu S, Xie J. Autoencoder neural network based intelligent hybrid beamforming design for mmWave massive MIMO systems. IEEE Transactions on Cognitive Communications and Networking. 2020. DOI: 10.1109/TCCN.2020.2991878

[4] Zhai J, Zhang S, Chen J, He Q. Autoencoder and Its Various Variants. In: 2018 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Miyazaki, Japan. 2018. pp. 415-419. DOI: 10.1109/SMC.2018.00080

[5] Karanov B, Lavery D, Bayvel P, Schmalen L. End-to-end optimized transmission over dispersive intensitymodulated channels using bidirectional recurrent neural networks. Optics Express. 2019;**27**:19650-19663

[6] Sohrabi F, Cheng HV, Yu W. Robust Symbol-Level Precoding Via Autoencoder-Based Deep Learning. 2020. pp. 8951-8955. DOI: 10.1109/ ICASSP40776.2020.9054488

[7] Liu Z, del Rosario M, Liang X, Zhang L, Ding Z. Spherical Normalization for Learned Compressive Feedback in Massive MIMO CSI Acquisition. 2020. pp. 1-6. DOI: 10.1109/ICCWorkshops 49005.2020.9145171

[8] Wen C, Shih W, Jin S. Deep learning for massive MIMO CSI feedback. IEEE Wireless Communications Letters. 2018; **7**(5):748-751

[9] Liu L, Oestges C, Poutanen J, Haneda K. The COST 2100 MIMO channel model. IEEE Wireless Communications. 2012;**19**(6):92-99

[10] Hochreiter S. The vanishing gradient problem during learning recurrent neural nets and problem solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems. 1998;**6**(2):107-116

[11] Aleem S, Huda N, Amin R, Khalid S, Alshamrani SS, Alshehri A. Machine Learning Algorithms for Depression: Diagnosis, Insights, and Research Directions. Electronics. 2022;**11**(7):1111. DOI: 10.3390/electronics11071111

[12] Cho K, van Merrienboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, et al. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. 2014. DOI: 10.3115/v1/D14-1179

[13] Dey R, Salem FM. Gate-variants of Gated Recurrent Unit (GRU) neural networks. 2017. pp. 1597-1600. DOI: 10.1109/MWSCAS.2017.8053243

[14] Helmy HMN, Daysti SE, Shatila H, Aboul-Dahab M. Performance enhancement of massive MIMO using deep learning-based channel estimation. IOP Conference Series: Materials Science and Engineering. 2021;**1051**(1):012029

[15] Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: A simple way to prevent neural networks from overfitting. Journal of

*Utilized System Model Using Channel State Information Network with Gated Recurrent Units… DOI: http://dx.doi.org/10.5772/intechopen.111650*

Machine Learning Research. 2014;**15**(1): 1929-1958

[16] Li C, Yin W, Zhang Y. User's guide for TVAL3: TV minimization by augmented Lagrangian and alternating direction algorithms. CAAM Report. 2009;**20**(4):46-47

[17] Daubechies I, Defrise M, Mol CD. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics. 2004;**75**:1412- 1457

[18] Li X, Huaming W. Spatio-temporal representation with a deep neural recurrent network in MIMO CSI feedback. IEEE Wireless Communications Letters. 2020; **9**(5):653-657

#### **Chapter 2**

## Graph Neural Networks and Reinforcement Learning: A Survey

*Fatemeh Fathinezhad, Peyman Adibi, Bijan Shoushtarian and Jocelyn Chanussot*

#### **Abstract**

Graph neural network (GNN) is an emerging field of research that tries to generalize deep learning architectures to work with non-Euclidean data. Nowadays, combining deep reinforcement learning (DRL) with GNN for graph-structured problems, especially in multi-agent environments, is a powerful technique in modern deep learning. From the computational point of view, multi-agent environments are inherently complex, because future rewards depend on the joint actions of multiple agents. This chapter tries to examine different types of applying GNN and DRL techniques in the most common representations of multi-agent problems and their challenges. In general, the fusion of GNN and DRL can be addressed from two different points of view. First, GNN is used to influence the DRL performance and improve its formulation. Here, GNN is applied in relational DRL structures such as multi-agent and multitask DRL. Second, DRL is used to improve the application of GNN. From this viewpoint, DRL can be used for a variety of purposes including neural architecture search and improving the explanatory power of GNN predictions.

**Keywords:** graph neural network, deep reinforcement learning, multi-agent, multi-task, neural architecture search

#### **1. Introduction**

Building an intelligent system that can extract high-level representation from data is necessary for many issues related to artificial intelligence. Theoretical and biological arguments show that to build such systems, deep architecture models are needed that include many layers of non-linear processing units. Before the emergence of deep learning [1], traditional machine learning approaches depended on the representations given by feature selection or extraction that get from the data.

These methods required an expert in the domain of the subject to extract the features manually. However, this hand-crafted feature extraction is a time-consuming and sub-optimal process. The emergence of deep learning could quickly replace these traditional methods because it could automatically extract the features according to each problem. In recent years, deep learning has become the main motivation for innovative solutions to artificial intelligence problems. This issue has been made

possible by increasing the amount of available data, increasing computing resources, and improving techniques in training deep networks.

Deep neural networks (DNNs) have reached remarkable achievements in the last decade [2]. However, basic types of neural networks can only be implemented using regular or Euclidean data. Whereas, many data in the real world have a graph structure that is a non-Euclidean data structure. This irregularity of the data structure has led to recent advances in graph neural networks (GNNs).

GNNs [3, 4] allow the creation of a machine learning model which is taught simultaneously to learn the representation of data with a graph structure. GNNs are undoubtedly the most interesting issue in graph-based deep learning. GNNs can be applied to graph-structured data for various tasks, from clustering to classification or regression. They can also learn representation at the level of nodes, edges, and graphs. Deep learning with graph data structure is recognized as graph representation learning, geometric deep learning, or graph embedding which seeks to learn the representation of structured information about graphs.

The purpose of graph representation is to build sets of features that show the structure of the graph and the data in it. The main key of this method is to learn a mapping that embeds nodes or graphs as points in a low-dimensional vector space so that this mapping is optimized and the geometric relations in this learned space reflect the structure of the original graph. After optimizing the built-in space, this learned embedding space can be used as input features of the graph.

Once the size of the dataset in the input network is different from the training history GNNs play a highly efficient role in knowledge transfer between data-oriented structures. GNNs are inherently designed to generalize over graphs of different structures and sizes. This ability allows the GNN-based DRL agent to learn and generalize over arbitrary network of environment topologies. Many DRL methods apply standard neural networks such as recurrent neural networks (RNNs) [5], and other neural network structures. This issue causes poor generalization and prevents the deployment of DRL in networks, because it is hard to adjust to the dynamic changes of network topology. In recent years, the integration of GNNs and DRL specially in multi-agent systems has attracted much attention in graph-structured environments.

Nowadays, many systems can be viewed as multi-agent systems from a new perspective. The cooperation of a group of agents (teamwork) in the frame of a graph is one of the most important issues that is always raised in multi-agent systems [6], due to increasing the ability to reach the final goal of the system and improving the overall strategy.

This issue becomes more important when the environment is complex and dynamic. In such an environment, a purposeful agent is affected by the actions of other agents in addition to the changes in the environment. Therefore, the environment has more dynamic states than before, and the agent must have the ability to model the action process and the power to learn and interact with other agents. Using classical methods in describing agents and establishing communication between them in a multi-agent environment, due to the use of many equations, weakens the power of expanding the network to large systems. By defining an intelligent system and using smart modern methods in solving such problems, methods such as deep reinforcement learning algorithms have been proven to be useful in such environments.

Automated control problems and development in decision-making are the results of recent advances in DRL [7]. However, existing DRL-based solutions still fail to generalize when applied to network-related scenarios. So, when faced with a network state that is not seen during training, the ability of the DRL agent is impaired.

Recently, GNNs have been offered to model and operate on graphs to reach combinatorial generalization and relational reasoning. Indeed, GNNs simplify the learning of relations between entities in a graph and the rules for composing them. A combination of DRL and GNN can work and optimize problems while generalizing to unseen topologies. Specifically, the GNN used by the DRL agent is inspired by message-passing neural networks [8].

Robotics, pattern recognition, recommendation systems, and games are some of the subjects in which DRL has presented acceptable performance. On the other hand, GNNs exhibit excellent efficiency in supervised learning for graph-structured data [9]. DRLs utilize the ability of DNNs to solve sequential decision problems with RL, and on the other hand, GNNs are new architectures that are suitable for organizing graph-structured data in this field.

In this survey, an overview of the concepts of GNNs is prepared, and then their relationship with reinforcement learning (RL) is explained. The rest of this chapter is structured as follows. A short review of graph neural networks is given in Section 2. The technical backgrounds of deep reinforcement learning concepts and multi-agent reinforcement learning are presented in Section 3. The relation between RL and GNN is presented in Section 4. Finally, the conclusion is provided in the last section.

#### **2. Graph neural networks**

Nowadays, many learning problems need to use graph representation to present the complex relationship between data [10, 11]. Recently, more attention over studies on graph models has been received due to the great expressive power in social science (social networks) [12–14] and biology science (predicting protein interface and bioinformatics analysis, knowledge graphs, modeling physics systems, and classifying diseases) [15–17].

Pairwise message passing is one of the main elements in the structure of GNNs, such that each node in the graph frequently updates its representations by replacing information with its neighbors until a stable balance is attended. The graph neural network usually contains two parts: the message passing part for extraction of local infrastructure features used around the nodes and the readout phase which is an aggregation part to summarize the particular features of the node in a vector of features of the graph surface.

Representing data as a graph has several advantages, such as a simplified representation of complex problems, systematic modeling of relationships, etc. On the other hand, working with data with a graph structure using common DNN-based methods has its own challenges. The variable size of the unordered nodes, the uneven structure of the graph, and the dynamic neighborhood composition make it difficult to implement basic mathematical methods such as convolution on the graph. Graph neural networks (GNN) as its general structure is shown in **Figure 1**, overcome this defect with the help of new DNN methods in the graph structures of datasets. GNN architectures can model structural information and node features. In the following several well-known models of graph neural networks are introduced.

#### **2.1 Graph convolutional network**

For the first time in [18], spectral networks and local deep networks were connected on a graph convolutional network (GCN), as a method for semi-supervised

**Figure 1.** *Graph neural networks (GNN) framework.*

learning on graph-structured data. The definition of these networks is based on the notion of convolutional neural networks, which are applied to the graphs. GCNs [19] are learned according to the structure of the features of the neighboring nodes. In general, the main difference between CNN and GCN is related to their data structure. CNNs are defined in Euclidean space while GCNs work on graph structure (non-Euclidean structure data) where the number of node connections is different and also the nodes are unordered.

Spatial graph convolutional networks and spectral graph convolutional networks are the two main branches of GCNs. The key idea in spectral GCN was defined by signal/ wave propagation. In spectral GCN, information is propagated along the nodes as signal propagation. Eigen-decomposition of graph Laplacian matrix in spectral GCNs is used for information propagation and also is used for node classification by understanding the graph structure. Non-generalization and inefficiency of computations in spectral graph convolutional networks are two main challenges in spectral graph convolutional networks. GCN overcomes these problems by Chebyshev polynomials to approximate the spectral convolution and the ChebNet network is defined [20].

#### **2.2 Graph attention network**

Graph attention network (GAT) architecture [21] is a type of GCN architecture in which the aggregation process learns the weights between the neighboring nodes of each node with the help of the attention mechanism. In these networks, greater weights are applied to more important nodes and it stores the weight of the nodes. The advantage of graph attention networks is that these networks can adaptively learn the importance of each neighbor. However, since the attention weights between each pair of neighbors must be calculated, the calculation cost as well as the amount of memory occupied increases rapidly.

#### **2.3 GraphSAGE**

In graph theory, there is a concept called node embedding, which means mapping nodes to an embedded space with dimensions less than the actual dimension of the data defined on the nodes of the graph, in which similar nodes are embedded close to each other, in the resulting latent space.

GraphSAGE [22] is a deductive learning technique that exploits node features to learn an embedding function for dynamic graphs. This inductive learning approach is scalable across graphs of different sizes as well as subgraphs within a given graph. A new node can be embedded without retraining by the GraphSAGE approach. It uses aggregator functions to induce new node embeddings based on node features and neighborhoods.

In [23] a method for data-driven neighborhood subsampling is defined by a nonlinear regressor based on the real-valued importance of each node and its neighborhood. This subsampling helps to embed nodes in the graph using a small set of neighboring nodes with high importance. The regressor is learned using value-based reinforcement learning. Here, the negative classification loss output of GraphSAGE is used to extract this importance.

GraphSAGE-D3QN [24] presents a graph DRL method for emergency control of undervoltage load shedding model. Feature extraction of states in this model is designed by GraphSAGE-based method with topology variation in the training step and then online emergency control is achieved.

#### **2.4 Applications of GNNs**

Link prediction [25], node classification [26], clustering [27] and, etc., are considered as graph analysis objectives. In the following, several common GNNs goals are described:

**Node classification**: training models to classify nodes by determining the label of samples that are shown as nodes. Usually, these problems are used in a semisupervised way, with only a part of the graph being labeled.

**Graph classification**: Graph classification is a task with real applications in social network analysis, categorizing documents in natural language processing, and classifying proteins in bioinformatics fields. Graph classification obtains a graph feature that aids discriminate between graphs of different classes.

**Graph Visualization:** Visual representation of data structures and anomalies with the help of geometric graph theory and information visualization that helps the user understand graphs.

**Link prediction:** Predicting the relationship between two nodes and considering that nodes in a network are likely to have links. An application of this approach is to detect social interactions or suggest potential friends to users on social networks. It has also been used in predicting criminal associations, and in recommender system problems.

**Graph clustering:** clustering on graphs is performed in two ways. Either clustering is based on nodes that should be converted into different and connected groups based on the edge distances and their weights or considering the graph as objects that should be clustered, and clusters these objects based on similarity.

#### **3. Deep reinforcement learning**

Using DNNs to solve sequential decision issues in the framework of RL led to the emergence of deep reinforcement learning (DRL) in high-dimensional problems (see **Figure 2**). Nowadays, different applications of artificial intelligence have been enhanced with the help of DRL which includes natural language processing [28], transportation [29], finance [30], healthcare [31], robotics [32], recommendation systems [33], and gaming [34]. DRL can be defined as a system that maximizes the long-term reward in a reinforcement learning problem using representations that are themselves learned by the deep network. The outstanding success of DRL can be considered due to the ability of this method to deal with complex problems and provide efficient, scalable, and flexible computational methods. Also, DRLs have a high ability to understand the dynamics of the environment and produce optimal actions according to their interactions with the environment. When dealing with various high-dimensional problems or continuous states, reinforcement learning suffers from the problem of inefficient feature representation. Therefore, learning time is slow and techniques should be designed to speed up the learning process. The most significant feature of deep learning is that DNNs can discover compact representations of high-dimensional data automatically.

Combining DNNs with RL has become more attractive in recent years and it has gently shifted the focus from single-agent environments to multi-agent ones. Working

**Figure 2.**

*Total structure of the combination of GNNs and DRL.*

with multiple agents is inherently more complex because future rewards depend on the joint actions of several agents and the computational complexity of the function increases. Single-agent environments such as Atari [6], and navigation robots [35], and multi-agent settings such as traffic light control [36], financial market trading [37], and strategy games such as Go, StarCraft, and Dota are some examples that are developed by DRL.

In DRL, unstructured input data from the state space are given to the network. This input such as pixels rendered on the screen in a video game or images from a camera or the raw sensor stream from a robot can be very large and high-dimensional. In the output, the value of an action is determined for the agent to decide what actions must be performed in the environment to maximize the expected rewards. Since the RL methods are suffered from the curse of dimensions problem. DNNs can find lowdimensional representations (and features) of high-dimensional data automatically. In the following, the subject of DRL for the special scope of multi-agent reinforcement learning will be expressed widely.

#### **3.1 Multi-agent reinforcement learning**

In multi-agent reinforcement learning (MARL) sets of independent agents interact with each other to learn how to reach their goals. Large and random state spaces are a common problem in MARL systems with dynamic environments. These challenges include inefficient cooperation between agents, unsuitable coordination between agent decisions, and the effect of state space size on the learning time. In recent years, MARL applications have been used in autonomous driving, traffic light control, and network packet delivery. Communication between agents gathers information about the environment and the status of other agents.

Markov decision process (MDP) is a useful approach for modeling optimal decision-making in stochastic environments such as multi-agent environments but with different representations. The dynamics of the state and the expected rewards change for agents according to the common action and violate the stationarity assumption of MDP. MDP can be completely or partially observable in a multi-agent environment. It also depends on the type of interaction, which can be competitive, collaborative, or mixed. They perform actions sequentially or simultaneously.

Markov game [38], represent a theoretical framework for the study of agents with multiple interactions in a fully observable environment and can be used in competitive, cooperative, and hybrid environments. A Markov game is a set of regular games (matrix games) that agents perform repeatedly in it. Each state of the game can be represented as a matrix representation with the payoff of each joint action. If the agents cooperate with each other; but the actions have decentralized execution, it is shown by a decentralized MDP.

A partially observable Markov game is a multi-agent Markov decision process in which every agent has an individual partial observation of the environment and takes an individual action to receive their own reward. If the agents cooperate to optimize a single reward function according to the joint state and action, then the problem can be modeled as a decentralized partially observable Markov decision process (Dec-POMDP). RL in a multi-agent space is associated with several problems. Partial observability, non-stationarity, computational complexity, and credit allocation are among these problems. In the following, each of these aspects will be discussed:

#### *3.1.1 Partially observable*

Based on local observations in a partially observable environment, each agent makes decisions. So, it leads to asymmetric and incomplete information among agents, which makes the learning process difficult. Partially observable working has been studied mainly in situations where a group of agents maximizes team rewards through a common policy. For example, in Dec-POMDP settings, the two main approaches are (1) centralized learning and decentralized execution paradigm, and (2) using communication to exchange information about the environment.

Since in Dec-POMDP the agents partially observe the state and try to maximize the rewards in each step for all agents, the optimal solution for a Dec-POMDP model is considered a challenge. The lack of access to the real state information in the Dec-POMDP leads to the use of the history of observations and actions, which is computationally expensive for solving the Dec-POMDP model. Policy tree by pruning suboptimal trees [39, 40], and a feature-based heuristic search value iteration [41] techniques are used to solve this challenge in Dec-POMDP model.

Also, deep multi-agent reinforcement learning algorithms for Dec-POMDP models are considered an approximate policy solution technique. Different MARL algorithms have been represented to produce decent policies on many challenging dec-POMDP problems [42, 43]. An independent learning approach is used in [43] where a policy solution for each agent is updated solely based on their individual experiences.

#### *3.1.2 Non-stationary*

In a multi-agent environment, all agents simultaneously learn, interact and change the environment. As a result, state transitions and rewards are no longer fixed, and agents continue to adapt to the changing policies of other agents. This violates the Markov assumption, which is problematic because most RL algorithms assume a fixed environment to guarantee convergence. One way to deal with non-stationarity is to learn as much as possible about the environment, e.g., through adversary modeling and information exchange between agents [44].

To solve the non-stationarity problem the centralized critic architecture is used. Actor-critic algorithm for this architecture includes two components. The critics' training is centralized and has access to the observations and actions of all agents, while the actors' training is decentralized. Since the actor computes the policy, the critic component can be removed during testing, and therefore the approach has fully decentralized execution. If the actions and observations of the opponent during the training are available, the agents do not experience unexpected changes in the dynamics of the environment and it will lead to the stabilization of the process.

The actor-critic algorithm in [42] is used with stochastic policies to evaluate and train agents in the StarCraft game. A single centralized critic is applied for all the agents and a different actor for each agent is used.

Generally, considering non-stationarity in multi-agent systems does not need centralized training approaches. Self-play is another decentralized method that has been explored to manage non-stationarity in MARL problems. This approach trains a neural network, using each agents' observation as input, by playing it against its current or previous versions to learn policies that can generalize to any opponent.

#### *3.1.3 Computational complexity*

As each agent is added, state-action space grows exponentially, which leads to an increase in the time complexity of algorithms in multi-agent environments. Training a DRL model for a single-agent needs significant sources and gets worse for several interaction agents which leads to slow learning.

Reducing the learning complexity for goal-directed problems can be achieved by initializing the Q-values with a good approximative function. In multi-agent problems, good approximations for a big class of problems, namely for goal-directed stochastic games, exists [45]. These games can reflect coordination cooperative robotics.

#### *3.1.4 Assignment of credit*

Allocation of credit to agents due to the simultaneous performance of several agents in the environment leads to the difficulty of learning an optimal policy in the environment. The individual contribution of an agent cannot be determined in the joint reward signal [46]. The agent is also able to distinguish whether changes in global reward are due to its actions or those of other agents. One way to solve this problem is to let agents learn based on a local reward. But the agent may easily increase his local reward, which encourages selfish behavior that may reduce overall group performance. Several approaches are discussed which were created to deal with these challenges.

#### **3.2 Interaction methods between multi-agents with GNN architecture**

In the most recent research, many MARL methods use GNNs to provide information interactions between agents to complete collaborative tasks and coordinate actions. In general, not extracting enough useful information from neighboring agents is one of the problems of simply aggregation in GNN, which is due to ignoring the topological relationships in the graph.

To solve this problem, Ding et al. [47] presented a method to extract useful information from neighboring agents as much as possible in the graph structure, which has the ability to provide feature representation to complete the cooperation task. For this purpose, mutual information (MI) is applied for measuring the agent topological relationships and the agent features information to maximize the correlation between input feature information of neighbor agents and output high-level hidden feature representations.

A GNN architecture for training decentralized agent policies on the perimeter of a unit circle has been proposed in continuous action spaces [48]. In this approach, multi-agent perimeter defense problems are solved by learning decentralized strategies with GNNs. Local perceptions of the defenders are considered as inputs in the learning framework and finally, the model is trained by an expert policy based on the maximum matching algorithm and returns actions to maximize the number of captures for the defender team.

The proposed framework [49] used GNNs for value function factorization in multi-agent deep reinforcement learning. A complete directed graph is designed by the team of agents as the nodes of the graph, and edge weights are

determined by an attention mechanism. The introduced mixing GNN (GraphMIX) module in this paper is responsible for factorizing the team value function into individual per-agent observation-action value functions, and explicit credit assignment to each agent. The centralized-training-decentralized-execution in GraphMIX give the ability to the agents to make their decisions independently once training is completed.

An attention mechanism in [50] is defined to adjust the weights of the edges during an episode based on the agents' observation-action histories. To create the factorized state-action value function's implicit assignment of global reward, additional per-agent loss terms are taken from the output node embeddings of the GNN, that divide the global reward to individual agents explicitly. Neural attention modules have been used in the graph structures [50, 51], for applying attention mechanisms to compute graph edge weights. These techniques are used in sentence translation works for managing associations between structured data [52], and they are generally used in RL [53].

Non-stationery and coordination problems can be solved naturally by centralized learning of joint actions but it is difficult to scale, because of the exponentially grows of the joint action space by the number of agents. To solve this challenge, conditional independencies between agents are exploited by decomposing a global reward function into a sum of agent-local terms. Sparse cooperative Q-learning [54] is a tabular Q-learning algorithm that learns to coordinate the actions of a group of cooperative agents only in the states in which such coordination is necessary, encoding those dependencies in a coordination graph. The use of these methods requires the prior provision of dependencies between agents. To solve this problem, it is assumed that each agent always contributes to the global reward and learns the amount of its contribution in each state.

Coordinating graph formulation is one of the methods for determining the joint action between agents based on the structure of interactions. In [55], Deep Implicit Coordination Graph (DICG) architecture is introduced, which includes two modules, one for obtaining the dynamic coordination graph structure and the other for learning the implicit reasoning about common actions or values. DICG uses the actor-critic structure to improve coordination for multi-agent situations. DICG is assumed that agents can pass messages that encode their observations. The agents use GCN to pass these messages between one another, where the adjacency matrix for the network is learned with self-attention. Here, a new state categorization method has been presented for centralized-training-decentralized-execution. In this method, which is implemented in the StarCraft game, each game agent separates information and observations of itself and its competitors and then leverages GAT to learn the correlation and relationship among the agents.

#### **3.3 Different methods for computing value function in MARL**

This section describes different methods for calculating the Q-value function for multi-agent environments. In MARL problems, each agent has a local and private observation of its surrounding space that it wants to take action based on that information. A problem that the agent may face with it is the locality of observation and not having complete information about the environment. Another problem is the nonstationarity of the environment because all agents in the environment are learning and show different behaviors during training.

To solve these problems, the simplest method is to use single-agent RL algorithms for each agent and consider other agents as part of the environment. However, the exponential growth of this joint action space becomes difficult with the number of agents. The Independent Q-Learning (IQL) [56] method is based on this logic and has a good efficiency in some multi-agent RL problems, but there is no guarantee of their convergence. In IQL, each agent has a separate action value function based on which it receives the local observation of the agent and then chooses its action based on it. In such environments, RNN can also be used for the history of observation-action.

In another approach, the agents perform learning in a centralized manner and the choice of action is also centralized. This approach is suitable for problems (such as traffic management or traffic light management) that do not require decentralized execution.

The third approach includes centralized training and decentralized execution. In this approach, the agents have access to the state and complete information during the training step, but in some environments, the learned policy must be applied in a decentralized manner, and the agents cannot access the full state in the execution phase. In this method, the purpose of each agent is to perform actions that maximize their utility function (joint value function), but such decentralization can result in sub-optimal actions [55].

Value-based methods like Value Decomposition Networks (VDN) [57], QMIX [58] and actor-critic methods like Multi-Agent Deep Deterministic Policy Gradient (MADDPG) [59] and Counterfactual Multi-Agent (COMA) [60] are some approaches presented to solve these problems with training in a centralized manner and execution in a decentralized way.

In VDN, a linear summation of all action-value functions of all agents is used to determine separate action-value functions for each agent and learn using only a common reward signal. Using a common reward signal, it tries to learn the decomposed value functions for each agent and use it for decentralized execution. QMIX generalized the VDN method and combines the Q-value of different agents in a non-linear way. They use the global state as input to hyper networks to generate weights and biases of the mixing network. The actor-critic architecture is the basis of centralized training and decentralized execution. In this method, they use the full state and additional information available in the training phase of the critical network to generate a richer signal for the actor.

One of the disadvantages of the aforementioned above algorithms is that it does not clearly obtain the underlying structure of cooperation between agents with a graph topology. Some papers try to join MARL with graph learning. For example, a multi-agent deep reinforcement learning based on GCN structure has been presented [61]. Here, the decentralized decision-making is not considered by the agents and only centralized training and centralized execution are investigated for communicating agents with each other during the inference phase several times.

Multi-agent DDPG (MADDPG), generalizes the actor-critic algorithm into a multiagent policy gradient algorithm where decentralized agents learn a centralized critic based on the observations and actions of all agents. It leads to learned policies that only use local information and observations at execution time. This method does not assume a differentiable model of the environment dynamics or any particular structure of the communication method between agents. It applies not only to cooperative interaction but also to competitive or mixed interaction involving both physical and communicative behavior. The critic is strengthened by additional information about other agents' policies, while local information is provided just for the actor. After

training completion, only the local actors are used in the execution phase, acting in a decentralized manner.

COMA is a multi-agent policy gradient-based method for cooperative multi-agent systems that uses a centralized critic to estimate Q performance and decentralized actors to optimize agent policies. Also, this method solves the problem of credit assignment using a count. Unlike COMA, which uses a centralized critic for all agents, MADDPG has a concentrated critic for each agent to have different reward functions in competitive environments.

Recent works have been conducted based on MADDPG, R-MADDPG [62] develops the MADDPG algorithm to the semi-observable environment by preserving the history of previous observations in the critic module and by having an iterative actor. M3DDPG [63] includes minimax optimization for powerful policy learning against agents with changing strategies. Actor-Critic with mean field [64] factorizes the Q-value function only by using interaction with neighboring agents based on mean field theory, and the idea of dropping out can be expanded to MADDPG for managing large input space [65].

#### **4. Combination of graph neural networks and reinforcement learning**

Recently, combining GNNs with reinforcement learning for graph-structured problems is a powerful tool in modern deep learning [66]. Combinatorial optimization [67], transportation problems [68], and manufacturing and control [69] are interesting applications in these fields.

**Figure 3** shows the total structure of the combination of GNNs and DRL. The local observation of agents is encoded by MLP for low-dimensional input or CNN for visual input into the feature vector which is shown in the embedded layer. The attention

**Figure 3.** *Schematic structure of deep reinforcement learning agent.*

*Graph Neural Networks and Reinforcement Learning: A Survey DOI: http://dx.doi.org/10.5772/intechopen.111651*

network usually represents to define the edge weights as the strength of the connection in the coordination graph between each agent and its neighbors. In the next step, the graph convolution layer is applied to perform message passing and information integration across all agents. Finally, the deep Q-network is used to approximate the Q-value function. By considering the maximum output of the Q-network the next action for the agents is determined.

The embedding layer contains an encoder for *n* observations f g *o*1, *o*2, … , *on* of *n* agents. The outputs of the encoder include embedding vectors *Ei* for *i* ¼ 1*::n* as follows:

$$E\_i = \text{Encoder}(o\_i, \theta\_E) \tag{1}$$

In the local attention Layer, the attention weights for two agents *i* and *j* in the graph are calculated using embedding vectors as:

$$At\_{\vec{\eta}} = \frac{\exp\left(\text{Attention}\left(\mathbf{E}\_i, \mathbf{E}\_j, \mathbf{W}\_a\right)\right)}{\sum\_{k=1}^n \exp(\text{Attention}\left(\mathbf{E}\_i, \mathbf{E}\_k, \mathbf{W}\_a\right)}}\tag{2}$$

where the attention network is parametrized by the weight matrix **W***a*.

Message passing and information integration across all agents are expressed in a graph convolution layer as follows:

$$\mathbf{H}^{(l+1)} = \sigma \left( \tilde{\mathbf{D}}^{-\frac{1}{2}} \tilde{\mathbf{A}} \tilde{\mathbf{D}}^{-\frac{1}{2}} \mathbf{H}^{(l)} \mathbf{W}\_c^{(l)} \right) \tag{3}$$

where Hð Þ*<sup>l</sup>* is the feature matrix of convolution layer *l,* **<sup>A</sup>**<sup>~</sup> <sup>¼</sup> **<sup>A</sup>** <sup>þ</sup> **<sup>I</sup>***N*, and *<sup>D</sup>*<sup>~</sup> *ii* <sup>¼</sup> <sup>P</sup> *j A*~*ij*.

The predicted *Q*^ in Q-network is verified by *θ* parameter. The general objective for each minibatch in the training step is to minimize the loss function as:

$$L\_{\theta} = \frac{1}{b} \sum\_{t} \mathbf{y}\_{t} - \hat{\mathbf{Q}} \left( \mathbf{s}\_{t}, \mathbf{a}\_{t}, \theta\_{predict} \right) \tag{4}$$

where *b* is the batch size, and *yt* ¼ *rt*þ<sup>1</sup> þ *γ* max *at*þ<sup>1</sup> *Q st*þ1, *at*þ1, *<sup>θ</sup>target* � � in time step *<sup>t</sup>* is the target of *Q* value function for state *s* and action *a* with reward *r*.

In general, the combination of GNN and DRL can be addressed from two different points of view. From one perspective, GNN is used to advance the formulation and performance of DRL and specifically, when GNN has been used for relational DRL problems. The successful modeling for this relationship can be defined among (1) different agents in a multi-agent deep reinforcement learning (MADRL) framework, and (2) different tasks in multi-task deep reinforcement learning (MTDRL) framework [70].

From another perspective, DRL can be used to progress the performance of GNN. DRL is used to improve the explanatory power of GNN predictions, Neural Architecture Search (NAS) [71], and design adversarial examples for GNN. NAS is the process of automatically searching for the optimal architecture of a particular neural network to solve a problem, which includes finding the number of layers, the number of nodes in the layer, etc. In GraphNAS [72], the RL algorithm helps to search in the graph neural architectures. GraphNAS represents a search space for covering sampling

functions, aggregation functions, and gated functions. To define the architecture of a graph neural network a recurrent network is used to create variable-length strings. Auto-GNN [73] is defined in the predefined search space by RL-based controllers. This architecture is applied in the hidden dimension, attention head, attention function, activation function, and aggregate function.

Identifying the subgraph that can have the most impact on the prediction process in GNN is one of the problems in generating explanations for GNN predictions, and in [74], DRLs are used for this improvement. Here, a DRL-based iterative graph generator is used the most important node for a prediction as a seed node is selected and then adds edges to generate the explanatory sub-graphs.

Learning a sub-graph generation policy with a policy gradient is done by mutual information of predictions and the distribution of predictions according to the explanatory sub-graph. This method achieves better performances from the point of view of the qualitative and quantitative similarity between the generated sub-graphs and the ground truth explanations.

Another application of DRL is to add or remove existing edges during adversarial attacks on GNNs [75, 76]. RLS2V [77] is a framework that uses DRL to learn structural changes in graphs, which is used to develop strategies for adversarial attacks on GNNs. Since GNNs are vulnerable to adversarial attacks that corrupt or poison the data used to train them. Q-learning and structure-to-vector-based attack methodology are learned to modify the graph structure. The purpose of DRL is to perform an attack aimed at evading detection during classification.

#### **4.1 Multi-agent deep reinforcement learning**

Multi-agent deep reinforcement learning needs coordination to efficiently solve certain works. Due to the size of joint action spaces, fully centralized control is often infeasible in these problems. The coordination graph-based method allows reasoning about the joint action based on the structure of interactions.

The coordination graph (CG) is introduced by Guestrin et al. [78], where a method for joint value estimation is presented that allows explicit modeling of the locality of interactions and formal reasoning about given joint actions. CG is a way to factorize a complex multi-agent Q-function. Rather than having a single joint Q function which would depend on the joint action of all agents, one could use a hypergraph to decompose this Q-function into a sum of Q functions across the edges, where each edge denotes a much lower dimensional Q function. Then finding the minimizing joint action can be done by passing messages along the edges of the coordination in a hypergraph.

MAGNet [79] represents policies for multi-agent environments based on relevant graphs and message-passing mechanisms. Here, the graphs are static and constructed based on heuristic rules. Multiple agents in the DGN model [80] are shown as nodes of a graph and relationships between them are learned as the observation encoder module in the environment. In the next step, by a convolutional kernel module, a multihead point generation attention is defined to extract relational features between each agent and its neighbors in the local region. Q network module receives the extracted features of the former step to use them for determining the strategy which ultimately leads to cooperation between agents. In order to create an effective strategy in cooperation between agents, joint training between the encoder and Q network is done sequentially. This paper proves that GCN increases strongly cooperation among agents. This model is investigated in a grid-world platform MAgent.

#### *Graph Neural Networks and Reinforcement Learning: A Survey DOI: http://dx.doi.org/10.5772/intechopen.111651*

Inspired by this idea [80], a model is presented in [81] that controls the connected autonomous vehicles (CAVs) as multi-agents by GNN and RL for cooperation between them. Information transfer for connected autonomous vehicles attains through the onboard sensors of nearby human-driven vehicles (HDVs) as local information and also from other connected autonomous vehicles the global information is obtained via connectivity channels. This information helps to define the graph structure. Within the local network, information passes from HDVs to CAVs. From the global network, all the CAVs can share knowledge including locally sensed information and their own information. Here, the environment contains a variable number of agents and makes a dynamic length output that matches with CAVs driving operations. Due to the variable number of agents, it is difficult to use joint training for each agent with its distinct Q network. Also, joint training is not scalable because by increasing the number of agents, the number of parameters for distinct Q networks will grow exponentially. One efficient method for solving these challenges is to apply a shared centralized Q network for all agents to determine their actions. Using the combination of GCN and deep Q network can have collaborative and safe controlling for lane-changing decisions in different traffic.

#### **4.2 Multi-task deep reinforcement learning**

MTDRL prepares a learning framework for coordinating and exploiting commonalities between multiple tasks in order to learn data efficiency, and robustness policies with improved efficiency, and generalization. Compatible state-action spaces are the main assumption in a MTDRL process such as the same dimensions of states and actions across multiple tasks. This issue is supported by GNNs due to capable of processing graphs with arbitrary sizes.

One of the applications of GNN in a MTDRL is in continuous control environments that use the features of each element of the MuJoCo agent to construct input graphs [82]. Each actuator has obtained the information from local sensors. A shared modular policy is defined as a global policy for each agent's actuators. Each limb of the MuJoCo agent is considered as a state with features containing positions, rotation, velocity, etc. that implements its independent policy to optimize joint reward function.

A framework in [83] is proposed to learn a job-shop scheduling problem (JSSP) by GNN and RL. The GNN section contains the creation of a graph from spatial features of the element of the job-shop problem and the RL section considers it as sequential decision-making by proximal policy optimization method (PPO) as a scheduling process.

#### **5. Conclusion**

In this survey, we tried to summarize about GNNs and RL and their relations. We had an overview of the challenges inherent in graph neural networks and multi-agent environments. Since, learning in collaborative multi-agent environments with dynamic, non-deterministic, and large state space has become a very important challenge in applications. Among these challenges, we can mention the effect of the size of the state space on the duration of learning, as well as the inefficient cooperation and the lack of proper coordination in decision-making between the agents. Also, when using reinforcement learning algorithms with the graph structure, the models will face challenges such as the difficulty of determining the appropriate learning goal and the long convergence time caused by trial and error-based learning. So, the integration of these methods leads to more realistic scenarios and more effective solutions to realworld problems. Researchers in this field have a significant impact on the progress of the combination of GNNs and DRL by providing newer models and architectures.

### **Author details**

Fatemeh Fathinezhad<sup>1</sup> \*, Peyman Adibi<sup>1</sup> , Bijan Shoushtarian<sup>1</sup> and Jocelyn Chanussot<sup>2</sup>

1 Faculty of Computer Engineering, Artificial Intelligence Department, University of Isfahan, Isfahan, Iran

2 GIPSA-Lab, CNRS, Grenoble INP, University of Grenoble Alpes, Grenoble, France

\*Address all correspondence to: fateme.fathinezhad@eng.ui.ac.ir

© 2023 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

*Graph Neural Networks and Reinforcement Learning: A Survey DOI: http://dx.doi.org/10.5772/intechopen.111651*

#### **References**

[1] LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;**521**(7553): 436-444

[2] Montavon G, Samek W, Müller KR. Methods for interpreting and understanding deep neural networks. Digital Signal Processing. 2018;**73**:1-15

[3] Zhou J, Cui G, Hu S, Zhang Z, Yang C, Liu Z, et al. Graph neural networks: A review of methods and applications. AI Open. 2020;**1**:57-81

[4] Wu Z, Pan S, Chen F, Long G, Zhang C, Philip SY. A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems. 2020;**32**(1):4-24

[5] Eck D, Schmidhuber J. A first look at music composition using LSTM recurrent neural networks. Istituto Dalle Molle Di Studi Sull Intelligenza Artificiale. 2002;**103**:48

[6] Buşoniu L, Babuška R, De Schutter B. Multi-agent reinforcement learning: An overview. In: Innovations in Multi-Agent Systems and Applications-1, 2010th edition. Germany, Springer Berlin: Springer Verlag; August 11, 2010. pp. 183-221

[7] Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, et al. Human-level control through deep reinforcement learning. Nature. 2015; **518**(7540):529-533

[8] Gilmer J, Schoenholz SS, Riley PF, Vinyals O, Dahl GE. Message passing neural networks. In: Machine Learning Meets Quantum Physics. Switzerland: Springer; 2020. pp. 199-214

[9] Hwang D, Yang S, Kwon Y, Lee KH, Lee G, Jo H, et al. Comprehensive study on molecular supervised learning with graph neural networks. Journal of Chemical Information and Modeling. 2020;**60**(12):5936-5945

[10] Bronstein MM, Bruna J, LeCun Y, Szlam A, Vandergheynst P. Geometric deep learning: Going beyond euclidean data. IEEE Signal Processing Magazine. 2017;**34**(4):18-42

[11] Hamilton WL, Ying R, Leskovec J. Representation learning on graphs: Methods and applications. IEEE Data Engineering Bulletin. arXiv: 1709.05584. 2017;**40**(3):52-74

[12] Ying R, He R, Chen K, Eksombatchai P, Hamilton WL, Leskovec J. Graph convolutional neural networks for web-scale recommender systems. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery and data mining, Washington DC. 2018

[13] Monti F, Frasca F, Eynard D, Mannion D, Bronstein MM. Fake news detection on social media using geometric deep learning. arXiv: 1902.06673. 2019

[14] Rossi E, Monti F, Bronstein MM, Liò P. NCRNA classification with graph convolutional networks. In: KDD Workshop on Deep Learning on Graphs. Anchorage, Alaska, USA: Association for Computing Machinery; 2019

[15] Zitnik M, Agrawal M, Leskovec J. Modeling polypharmacy side effects with graph convolutional networks. Bioinformatics. 2018;**34**(13): 457-466

[16] Veselkov K et al. Hyperfoods: Machine intelligent mapping of cancerbeating molecules in foods. Scientific Reports. 2019;**9**(1):1-12

[17] Gainza P et al. Deciphering interaction fingerprints from protein molecular surfaces using geometric deep learning. Nature Methods. 2019;**17**: 184-192

[18] Estrach JB, Zaremba W, Szlam A, LeCun Y. Spectral networks and deep locally connected networks on graphs. In: 2nd International Conference on Learning Representations (ICLR), Banff, AB, Canada. Vol. 2014; 2014

[19] Kipf TN, Welling M. Semisupervised classification with graph convolutional networks. In: International Conference on Learning Representations (ICLR), 5th ICLR (Poster), Toulon, France. 2017. Available from: OpenReview.net

[20] Tang S, Li B, Yu H. ChebNet: Efficient and stable constructions of deep neural networks with rectified power units using chebyshev approximations. arXiv preprint arXiv: 1911.05467. 2019

[21] Veličković P, Cucurull G, Casanova A, Romero A, Lio P, Bengio Y. Graph attention networks. arXiv preprint arXiv:1710.10903. 2017

[22] Hamilton W, Ying Z, Leskovec J. Inductive representation learning on large graphs. Advances in Neural Information Processing Systems. 2017; **30**:1025-1035

[23] Oh J, Cho K, Bruna J. Advancing graphsage with a data-driven node sampling. arXiv preprint arXiv: 1904.12935. 2019

[24] Pei Y, Yang J, Wang J, Xu P, Zhou T, Wu F. An emergency control strategy for undervoltage load shedding of power system: A graph deep reinforcement learning method. IET Generation, Transmission & Distribution. 2023;**17**: 2130-2141

[25] Zhang M, Chen Y. Link prediction based on graph neural networks. Advances in Neural Information Processing Systems. 2018; **31**:5171-5181

[26] Wu J, He J, Xu J. Net: Degreespecific graph neural networks for node and graph classification. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. Association for Computing Machinery, New York NY, United States. 2019. pp. 406-415

[27] Tsitsulin A, Palowitch J, Perozzi B, Müller E. Graph clustering with graph neural networks. arXiv preprint arXiv: 2006.16904. 2020

[28] Wang WY, Li J, He X. Deep reinforcement learning for NLP. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts. Association for Computational Linguistics, Melbourne Convention and Exhibition Centre. 2018. pp. 19-21

[29] Haydari A, Yılmaz Y. Deep reinforcement learning for intelligent transportation systems: A survey. IEEE Transactions on Intelligent Transportation Systems. 2020;**23**(1): 11-32

[30] Hu YJ, Lin SJ. February. Deep reinforcement learning for optimizing finance portfolio management. In: 2019 Amity International Conference on Artificial Intelligence (AICAI), Amity University Dubai Campus Dubai International Academic City. Dubai: IEEE; 2019. pp. 14-20

*Graph Neural Networks and Reinforcement Learning: A Survey DOI: http://dx.doi.org/10.5772/intechopen.111651*

[31] Coronato A, Naeem M, De Pietro G, Paragliola G. Reinforcement learning for intelligent healthcare applications: A survey. Artificial Intelligence in Medicine. 2020;**109**:101964

[32] Gu S, Holly E, Lillicrap T, Levine S. Deep reinforcement learning for robotic manipulation with asynchronous offpolicy updates. In: 2017 IEEE International Conference on Robotics and Automation (ICRA). Marina Bay Sands, Singapore: IEEE; 2017. pp. 3389-3396

[33] Zhao X, Zhang L, Ding Z, Xia L, Tang J, Yin D. Recommendations with negative feedback via pairwise deep reinforcement learning. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Association for Computing Machinery, New York NY, United States. 2018. pp. 1040-1048

[34] Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, et al. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. 2013

[35] Zhu Y, Mottaghi R, Kolve E, Lim JJ, Gupta A, Fei-Fei L, et al. Target-driven visual navigation in indoor scenes using deep reinforcement learning. In: 2017 IEEE International Conference on Robotics and Automation (ICRA). Marina Bay Sands, Singapore: IEEE; 2017. pp. 3357-3364

[36] Chu T, Wang J, Codecà L, Li Z. Multi-agent deep reinforcement learning for large-scale traffic signal control. IEEE Transactions on Intelligent Transportation Systems. 2019;**21**(3): 1086-1095

[37] Soleymani F, Paquet E. Deep graph convolutional reinforcement learning for financial portfolio management– deeppocket. Expert Systems with Applications. 2021;**182**:115127

[38] Littman ML. Markov games as a framework for multi-agent reinforcement learning. In: Machine Learning Proceedings 1994. New Brunswick, New Jersey: Morgan Kaufmann; 1994. pp. 157-163

[39] Amato C, Dibangoye JS, Zilberstein S. Incremental policy generation for finite-horizon Dec-POMDPs. In: Proceedings of the Thirty-Second International Conference on Automated Planning and Scheduling. Palo Alto, California USA: AAAI Press; 2009

[40] Oliehoek FA, Spaan MT, Vlassis N. Optimal and approximate q-value functions for decentralized POMDPs. Journal of Artificial Intelligence Research. 2008;**32**:289-353

[41] Dibangoye JS, Amato C, Buffet O, Charpillet F. Optimally solving Dec-POMDPs as continuous-state MDPs. Journal of Artificial Intelligence Research. 2016;**55**:443-497

[42] Foerster JN, Farquhar G, Afouras T, Nardelli N, Whiteson S. Counterfactual multi-agent policy gradients. In: AAAI conference on artificial intelligence. New Orleans, Louisiana, USA: AAAI Press; 2018. pp. 2974-2982

[43] Omidshafiei S, Pazis J, Amato C, How JP, Vian J. Deep decentralized multi-task multi-agent reinforcement learning under partial observability. In: Proceedings of the 34th International Conference on Machine Learning: 70. Sydney, Australia: PMLR; 2017. pp. 2681-2690

[44] Papoudakis G, Christianos F, Rahman A, Albrecht SV. Dealing with non-stationarity in multi-agent deep reinforcement learning. arXiv preprint arXiv:1906.04737. 2019

[45] Burkov A, Chaib-Draa B. Reducing the complexity of multiagent reinforcement learning. In: Proceedings of the 6th International Joint Conference on Autonomous Agents and Multiagent Systems. 2007. pp. 1-3

[46] Minsky M. Steps toward artificial intelligence. Proceedings of the IRE. IEEE. 1961;**49**(1):8-30

[47] Ding S, Du W, Ding L, Zhang J, Guo L, An B. Multiagent reinforcement learning with graphical mutual information maximization. In: IEEE Transactions on Neural Networks and Learning Systems. 2023

[48] Lee ES, Zhou L, Ribeiro A, Kumar V. Graph neural networks for decentralized multi-agent perimeter defense. Switzerland, Spain and china: Frontiers in Control Engineering. 2023;**4**:1

[49] Naderializadeh N, Hung FH, Soleyman S, Khosla D. Graph convolutional value decomposition in multi-agent reinforcement learning. arXiv preprint arXiv: 2010.04740. 2020

[50] Thekumparampil KK, Wang C, Oh S, Li L-J. Attention-based graph neural network for semi-supervised learning. arXiv preprint arXiv: 1803.03735. 2018

[51] Veličković P, Cucurull G, Casanova A, Romero A, Liò P, Bengio Y. Graph attention networks. In: International Conference on Learning Representations. 2018;**1710.10903**

[52] Devlin J, Chang M-W, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the

2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Vol. 1, (Long and Short Papers). Hyatt Regency in Minneapolis: Association for Computational Linguistics; 2019. pp. 4171-4186

[53] Iqbal S, Sha F. Actor-attention-critic for multi-agent reinforcement learning. In: International Conference on Machine Learning. Long Beach, California, USA: PMLR; 2019. pp. 2961-2970

[54] Kok JR, Vlassis N. Sparse cooperative Q-learning. In: Proceedings of the Twenty-First International Conference on Machine Learning, Banff Alberta, Canada. 2004. p. 61

[55] Li S, Gupta JK, Morales P, Allen R, Kochenderfer MJ. Deep implicit coordination graphs for multi-agent reinforcement learning. In: Proceedings of the 20th International Conference on Autonomous Agents and Multiagent Systems (AAMAS-2021), International Foundation for Autonomous Agents and Multiagent Systems (IFAAMAS), London. arXiv preprint arXiv: 2006.11438. 2020

[56] Zhou M, Chen Y, Wen Y, Yang Y, Su Y, Zhang W, et al. Factorized qlearning for large-scale multi-agent systems. In: Proceedings of the First International Conference on Distributed Artificial Intelligence (DAI'19). October 2019. pp. 1-7. [Article ID: 7]

[57] Sunehag P, Lever G, Gruslys A, Czarnecki WM, Zambaldi V, Jaderberg M, et al., 2017. Value-decomposition networks for cooperative multi-agent learning. In: AAMAS '18: Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, Systems, Richland, SC,

*Graph Neural Networks and Reinforcement Learning: A Survey DOI: http://dx.doi.org/10.5772/intechopen.111651*

Stockholm Sweden. arXiv preprint arXiv:1706.05296.

[58] Rashid T, Samvelyan M, Schroeder C, Farquhar G, Foerster J, Whiteson S. Qmix: Monotonic value function factorisation for deep multiagent reinforcement learning. In: International Conference on Machine Learning. PMLR; 2018;**21**:7234-7284

[59] Lowe R, Wu YI, Tamar A, Harb J, Pieter Abbeel O, Mordatch I. Multi-agent actor-critic for mixed cooperativecompetitive environments. Advances in Neural Information Processing Systems. Long Beach, CA, USA. 2017;**30**

[60] Foerster J, Farquhar G, Afouras T, Nardelli N, Whiteson S. Counterfactual multi-agent policy gradients. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2018;**32**(1).

[61] Jiang J, Dun C, Huang T, Zongqing L. Graph convolutional reinforcement learning. In: International Conference on Learning Representations, Addis Ababa, Ethiopia. 2020. Available from: OpenReview.net

[62] Wang RE, Everett M, How JP. R-MADDPG for partially observable environments and limited communication. arXiv preprint arXiv: 2002.06684. 2020

[63] Li S, Wu Y, Cui X, Dong H, Fang F, Russell, S. Robust multi-agent reinforcement learning via minimax deep deterministic policy gradient. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2019;**33**:4213-4220

[64] Yang Y, Luo R, Li M, Zhou M, Zhang W, Wang J. Mean field multi-agent reinforcement learning. In: International Conference on Machine Learning. PMLR; 2018. pp. 5571-5580

[65] Kim W, Cho M, Sung Y. Messagedropout: An efficient training method for multi-agent deep reinforcement learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, Hilton Hawaiian Village, Honolulu, Hawaii, USA. 2019;**33**(1):6079-6086

[66] Almasan P, Suárez-Varela J, Rusek K, Barlet-Ros P, Cabellos-Aparicio A. Deep reinforcement learning meets graph neural networks: Exploring a routing optimization use case. Computer Communications. Netherlands: Elsevier; 2022;**196**:184-194

[67] Ma Q, Ge S, He D, Thaker D, Drori I. Combinatorial optimization by graph pointer networks and hierarchical reinforcement learning. arXiv preprint arXiv:1911.04936. 2019

[68] Wang Q, Tang C. Deep reinforcement learning for transportation network combinatorial optimization: A survey. Knowledge-Based Systems. 2021;**233**:107526

[69] Zheng P, Xia L, Li C, Li X, Liu B. Towards self-X cognitive manufacturing network: An industrial knowledge graph-based multi-agent reinforcement learning approach. Journal of Manufacturing Systems. 2021;**61**:16-26

[70] Vithayathil Varghese N, Mahmoud QH. A survey of multi-task deep reinforcement learning. Electronics. 2020;**9**(9):1363

[71] Elsken T, Metzen JH, Hutter F. Neural architecture search: A survey. The Journal of Machine Learning Research. 2019;**20**(1):1997-2017

[72] Gao Y, Yang H, Zhang P, Zhou C, Hu Y. Graphnas: Graph neural architecture search with reinforcement learning. arXiv preprint arXiv: 1904.09981. 2019

[73] Zhou K, Song Q, Huang X, Hu X. Auto-gnn: Neural architecture search of graph neural networks. Machine Learning and Artificial Intelligence, a section of the Journal Frontiers in Big Data. 2022;**5**. arXiv preprint arXiv: 1909.03184. 2019

[74] Shan C, Shen Y, Zhang Y, Li X, Li D. Reinforcement learning enhanced explainer for graph neural networks. Advances in Neural Information Processing Systems. 2021;**34**:22523-22533

[75] Tang X, Li Y, Sun Y, Yao H, Mitra P, Wang S. Transferring robustness for graph neural network against poisoning attacks. In: Proceedings of the 13th International Conference on Web Search and Data Mining. 2020. pp. 600-608

[76] Zügner D, Akbarnejad A, Günnemann S. Adversarial attacks on neural networks for graph data. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. Association for Computing Machinery, New York NY, United States. London, United Kingdom. 2018. pp. 2847-2856

[77] Dai H, Li H, Tian T, Huang X, Wang L, Zhu J, et al. Adversarial attack on graph structured data. In: International Conference on Machine Learning. PMLR, Stockholmsmässan, Stockholm Sweden; 2018. pp. 1115-1124

[78] Guestrin C, Koller D, Parr R. Multiagent planning with factored MDPs. Advances in neural information processing systems. Vancouver, British Columbia, Canada. 2001:14

[79] Malysheva A, Sung TT, Sohn CB, Kudenko D, Shpilman A. Deep multiagent reinforcement learning with relevance graphs. arXiv preprint arXiv: 1811.12557. 2018

[80] Jiang J, Dun C, Huang T, Lu Z. Graph convolutional reinforcement learning. arXiv preprint arXiv: 1810.09202. 2018

[81] Chen S, Dong J, Ha P, Li Y, Labi S. Graph neural network and reinforcement learning for multi-agent cooperative control of connected autonomous vehicles. Computer-Aided Civil and Infrastructure Engineering. 2021;**36**(7):838-857

[82] Huang W, Mordatch I, Pathak D. One policy to control them all: Shared modular policies for agent-agnostic control. In: International Conference on Machine Learning. PMLR; 2020. pp. 4455-4464

[83] Park J, Chun J, Kim SH, Kim Y, Park J. Learning to schedule job-shop problems: Representation and policy learning using graph neural network and reinforcement learning. International Journal of Production Research. United Kingdom. 2021;**59**(11):3360-3377

Section 2
