**2. Background**

The existing literature focuses on the deployment and movement of UAV relays for numerous applications. In [15], the authors estimated the optimal UAV relay position in a multi-rate communication system using theoretical and simulated analysis. The work in [16] investigated the mission planning of UAV relays to improve the connectivity of ground users. The authors of [17, 18] maximized the lower bound of the uplink transmission rate over the link between UAV relay and ground devices using dynamic heading adjusting approaches. For throughput maximization of the mobile relaying system, an iterative algorithm was developed [19, 20], which jointly optimized the relays' trajectory and transmitting power of the sources and UAVs by satisfying the practical constraints. In [21], the authors maximized the UAV relay network's throughput by optimizing transmit power, bandwidth, transmission rate, and relay deployment. However, in these works, a model-based centralized approach

*Optimal Unmanned Aerial Vehicle Control and Designs for Load Balancing in Intelligent… DOI: http://dx.doi.org/10.5772/intechopen.110312*

is used where all necessary system parameters are required. Additionally, the research gap still exists on enhancing network performance for source-destination device pair communication. To overcome these shortcomings, Indu et al. [22] minimized the energy consumption of UAV during its trajectory using genetic algorithm (GA). The authors in [6] proposed two meta-heuristic algorithms, such as GA and particle swarm optimization (PSO), to find the optimal UAV trajectory for satisfying users' minimum data rate requirements. They showed that PSO significantly improves the UAV's wireless coverage compared to GA. Although the meta-heuristic algorithms can deal with the complexity of UAV path planning, there are still some challenges in exchanging information between UAV and core network due to either unavailable constraints or obtaining their gradient analytically.

Another line of research studied the mobility management of UAVs for resource allocation and coverage optimization using RL techniques to deal with convergence issues. Kawamoto et al. [23] have presented a resource allocation algorithm of UAV using Q-learning techniques for allocating time slots and modulation schemes. The work in [24] presented a framework for the optimal UAV trajectory under a given data rate constraint, which relies on a state-action-reward-state-action (SARSA) algorithm. Hu et al. [25] proposed a real-time sensing and transmission protocol in UAVaided cellular networks and designed optimal UAVs' trajectories under limited spectrum resources using RL based on a Q-learning algorithm. Furthermore, the authors of [26] transformed UAV trajectory optimization problem for maximizing cumulative collected sensors' data into a Markov decision process (MDP) and proposed two stochastic modeling RL algorithms, namely Q-learning and SARSA, to learn UAV's policy. They proved that SARSA outperforms Q-learning due to the adaptive system's state update rule. From the state-of-the-art, the coupled relationship among UAV trajectory, device association, and transmit power allocation of IoT devices for the enhancement of network lifetime has not been investigated during the data collection process of UAV-assisted IoT networks.

#### **3. Channel characterization of UAV-operated communication system**

This section proposes a multi-hop radio frequency and free space optical (RF-FSO) communication framework that analytically optimizes the UAV's altitude for performance enhancement of a relaying system. Here, we minimize the outage probability and symbol error rate based on independent and identically distributed statistical parameters i.e., pointing errors, atmospheric turbulence, and scintillation.

#### **3.1 Channel model**

Consider a multi-hop hybrid RF–FSO system as shown in **Figure 1**, where single antenna-equipped ground base stations realize periodic data exchange. Since there are significant obstacles in the LoS path, direct link cannot be established between them. Therefore, two UAVs are deployed at a certain altitude which are employed as relays between the source and destination. These UAVs operate as RF and optical link transceiver modules with single-directional apertures. Depending on various environmental conditions, three different channels categorize the source-to-destination link, i.e., Ground to UAV (G2U), UAV to UAV (U2U), and UAV to Ground (U2G) channels.

**Figure 1.** *UAV-assisted multihop hybrid RF–FSO system.*

#### *3.1.1 G2U channel model*

As ground to UAV channel consists of RF signals, experiencing small-scale fading and large-scale path loss, the received symbol at UAV *U*<sup>1</sup> can be estimated as [27],

$$Y\_{U\_1} = \sqrt{P\_{\mathbb{S}, U\_1}} \sqrt{a\_{\mathbb{S}, U\_1}} h\_{\mathbb{S}, U\_1} \mathbf{x}\_{\mathbb{S}} + n\_{U\_1} \tag{1}$$

where, *xS* is the transmitted symbol of power *PS*,*U*<sup>1</sup> , *nU*<sup>1</sup> represents the additive white Gaussian noise (AWGN) power of zero mean and variance *N*<sup>0</sup> at *U*1, *hS*,*U*<sup>1</sup> defines the channel gain of *S*-*U*<sup>1</sup> link and *aS*,*U*<sup>1</sup> ¼ *κ<sup>S</sup>*,*U*1*L* �*ϵS*,*U*<sup>1</sup> *<sup>S</sup>*,*U*<sup>1</sup> is path loss corresponding to link distance *LS*,*U*<sup>1</sup> , *ϵ<sup>S</sup>*,*U*<sup>1</sup> denotes the path loss exponent and *κ<sup>S</sup>*,*U*<sup>1</sup> is the environment-dependent constant. As multipath components govern the *S*-*U*<sup>1</sup> link, therefore *hS*,*U*<sup>1</sup> j j<sup>2</sup> <sup>¼</sup> *<sup>χ</sup>* follows a non-central chi-square distribution, and its probability density function (PDF) is given by [28],

$$f\_{\chi}(t) = \frac{(K\_{\mathcal{S},U\_1} + \mathbf{1})e^{-K\_{\mathcal{S},U\_1}}}{\overline{A\_{\mathcal{S},U\_1}}} \exp\left\{\frac{-(K\_{\mathcal{S},U\_1} + \mathbf{1})t}{\overline{A\_{\mathcal{S},U\_1}}}\right\} \times I\_0\left(2\sqrt{\frac{(K\_{\mathcal{S},U\_1} + \mathbf{1})K\_{\mathcal{S},U\_1}}{\overline{A\_{\mathcal{S},U\_1}}}}t\right) \tag{2}$$

where *AS*,*U*<sup>1</sup> <sup>¼</sup> *E hS*,*U*<sup>1</sup> j j<sup>2</sup> n o <sup>¼</sup> 1, is average fading power, *<sup>E</sup>*f g*:* denotes expectation operator, *<sup>I</sup>*0ð Þ*:* defines zero order modified Bessel function, *KS*,*U*<sup>1</sup> <sup>¼</sup> *mS*,*U*<sup>1</sup> j j<sup>2</sup> *=*2*σ*<sup>2</sup> is Rician factor, *mS*,*U*<sup>1</sup> is the amplitude of LoS component and *σ*<sup>2</sup> is average power of multipath components. The instantaneous signal-to-noise ratio (SNR) received at UAV *U*<sup>1</sup> is expressed as [29],

$$\Upsilon\_{S,U\_1} = \frac{P\_{S,U\_1} \mathfrak{a}\_{S,U\_1}}{N\_0} X = \overline{Y}\_{S,U\_1} X \tag{3}$$

where, the average SNR is given as, <sup>Υ</sup>*<sup>S</sup>*,*U*<sup>1</sup> <sup>¼</sup> *PS*,*U*<sup>1</sup> *aS*,*U*<sup>1</sup> *N*<sup>0</sup>

#### *3.1.2 U2U channel model*

UAV *U*<sup>1</sup> first receive the RF signal *YU*<sup>1</sup> , then convert and encode it into the optical signal and then forward it to UAV *U*<sup>2</sup> over FSO link. The received signal at UAV *U*<sup>2</sup> can be obtained as [27]

*Optimal Unmanned Aerial Vehicle Control and Designs for Load Balancing in Intelligent… DOI: http://dx.doi.org/10.5772/intechopen.110312*

$$Y\_{U\_2} = \eta\_{U\_1} \sqrt{P\_{U\_1, U\_2}} h\_{U\_1, U\_2} \mathbf{x}\_{U\_1} + \mathbf{n}\_{U\_2} \tag{4}$$

where *ηU*<sup>1</sup> is electrical to optical conversion coefficient of UAV *U*1, *xU*<sup>1</sup> indicates the converted and encoded optical symbol of power *PU*1,*U*<sup>2</sup> , *nU*<sup>2</sup> denotes AWGN with zero mean and variance *N*<sup>0</sup> at UAV *U*2, and *hU*1,*U*<sup>2</sup> ¼ *hahp* is optical channel coefficient depending on atmospheric turbulence-induced fading ð Þ *ha* and pointing errors *hp* � �. The instantaneous SNR received at UAV *U*2, can be expressed as [27]

$$\Upsilon\_{U\_1, U\_2} = \frac{\eta\_{U\_1}^2 P\_{U\_1, U\_2} h\_{U\_1, U\_2}^2}{N\_0} \tag{5}$$

Since the optical link between UAV *U*<sup>1</sup> and *U*<sup>2</sup> experience several atmospheric turbulence and corresponding optical axis misalignment, the PDF of its instantaneous SNR follows the variation of atmospheric turbulence and pointing errors, which can be expressed as [30]

$$f\_{\Upsilon\_{U\_1, U\_2}}(\mathbf{Y}) = \frac{\xi^2}{2\Upsilon\Gamma(a)\Gamma(\beta)} G\_{1,3}^{3,0}\left(a\beta \sqrt{\frac{\mathbf{Y}}{\overline{\Upsilon}\_{U\_1, U\_2}}} \Big|\_{\xi^2, a, \emptyset}^{\xi^2+1}\right) \tag{6}$$

where Γ (.) is the Gamma function, *α* and *β* are scintillation parameters, *ξ* is the ratio between the equivalent beam radius and the misalignment displacement standard deviation at *<sup>U</sup>*2, *<sup>G</sup><sup>m</sup>*,*<sup>n</sup> <sup>p</sup>*,*<sup>q</sup> <sup>x</sup>*<sup>j</sup> *a*1,*a*2, … ,*an*, … ,*ap b*1,*b*2, … ,*bm*, … ,*bq* � � is Meijer's G function and <sup>Υ</sup>*<sup>U</sup>*1,*U*<sup>2</sup> <sup>¼</sup> *PU*1,*U*<sup>2</sup> *η*<sup>2</sup> *<sup>U</sup>*<sup>1</sup> *E hU*1,*U*<sup>2</sup> f g<sup>2</sup> *=N*<sup>0</sup> is average electrical SNR.

#### *3.1.3 U2G channel model*

After receiving the optical signal *YU*<sup>2</sup> , UAV *U*<sup>2</sup> first decodes and converts it to RF signal and then forwards to the destination. Hence, the channel characterization is similar as the G2U channel model, and the received signal at the destination can be expressed as [27]

$$Y\_D = \eta\_{U\_2} \sqrt{P\_{U\_2,D}} \sqrt{\overline{a\_{U\_2,D}}} h\_{U\_2,D} \mathbf{x}\_{U\_2} + \eta\_D \tag{7}$$

where *η<sup>U</sup>*<sup>2</sup> is optical to electrical conversion coefficient of UAV *U*2, *xU*<sup>2</sup> denotes the transmitted symbol of power *PU*2,*<sup>D</sup>*, *nD* defines AWGN of zero mean and variance *N*0, *hU*2,*<sup>D</sup>* is channel coefficient and *aU*2,*<sup>D</sup>* is path loss attenuation factor. Instantaneous SNR received at the destination is expressed as,

$$\Upsilon\_{U\_2,D} = \frac{\eta\_{U\_2}^2 P\_{U\_2,D} \sigma\_{U\_2,D} |h\_{U\_2,D}|^2}{N\_0} \tag{8}$$

where <sup>Υ</sup>*<sup>U</sup>*2,*<sup>D</sup>* <sup>¼</sup> *<sup>η</sup>*<sup>2</sup> *U*<sup>2</sup> *PU*2,*DaU*2,*<sup>D</sup>=N*<sup>0</sup> is average SNR

#### **3.2 Performance metrics of multihop RF: FSO system**

#### *3.2.1 Outage probability*

It is defined as the probability that instantaneous SNR is less than the minimum required threshold level, Υ*th*. For decode and forward relaying mode, the equivalent SNR at destination can be expressed as [27]

$$\Upsilon\_{\mathbb{S},D} = \min\left(\Upsilon\_{\mathbb{S},U\_1}, \Upsilon\_{U\_1,U\_2}, \Upsilon\_{U\_2,D}\right) \tag{9}$$

Cumulative distribution function (CDF) of equivalent SNR is expressed by,

$$\begin{split} F\_{\mathbf{Y}\_{\mathcal{S},D}}(\mathbf{Y}) &= \Pr(\mathbf{Y}\_{\mathcal{S},D} \le \mathbf{Y}) = \Pr(\min\left(\mathbf{Y}\_{\mathcal{S},U\_1}, \mathbf{Y}\_{U\_1,U\_2}, \mathbf{Y}\_{U\_2,D}\right) \le \mathbf{Y}) \\ &= \mathbf{1} - \left\{\mathbf{1} - F\_{\mathbf{Y}\_{\mathcal{S},U\_1}}(\mathbf{Y})\right\} \left\{\mathbf{1} - F\_{\mathbf{Y}\_{U\_1,U\_2}}(\mathbf{Y})\right\} \left\{\mathbf{1} - F\_{\mathbf{Y}\_{U\_2,D}}(\mathbf{Y})\right\} \end{split} \tag{10}$$

where *F*Υ*S*,*U*<sup>1</sup> ð Þ Υ , *F*Υ*U*1,*U*<sup>2</sup> ð Þ Υ and *F*Υ*U*2,*<sup>D</sup>* ð Þ Υ are the CDF of Υ*<sup>S</sup>*,*U*<sup>1</sup> , Υ*<sup>U</sup>*1,*U*<sup>2</sup> and Υ*<sup>U</sup>*2,*<sup>D</sup>* respectively. The outage probability of the overall system is obtained in terms of Q1 (., .) i.e., the first order Marcum Q function as [31]

$$\begin{split}P\_{\text{out}} &= F\_{\text{Y},\text{D}}(\mathbf{Y}\_{th}) = \Pr(\mathbf{Y}\_{\text{S},\text{D}} \le \mathbf{Y}\_{th}) \\ &= \mathbf{1} - Q\_{\text{1}}\left(\sqrt{2\mathbf{K}\_{\text{S},\text{U}\_{1}}}, \sqrt{2\mathbf{Y}\_{th}L\_{\text{S},\text{U}\_{1}}^{\text{cS},\text{U}\_{1}}(\mathbf{1} + \mathbf{K}\_{\text{S},\text{U}\_{1}})/\tilde{\mathbf{Y}}\_{\text{S},\text{U}\_{1}}}\right) \\ &\times Q\_{\text{2}}\left(\sqrt{2\mathbf{K}\_{\text{U}\_{1},\text{D}}}, \sqrt{2\mathbf{Y}\_{th}L\_{\text{U}\_{1},\text{D}}^{\text{cS},\text{D}}(\mathbf{1} + \mathbf{K}\_{\text{U}\_{1},\text{D}})/\tilde{\mathbf{Y}}\_{\text{U}\_{2},\text{D}}}\right) \\ &\times \left[\mathbf{1} - \frac{\xi^{2}}{\Gamma(a)\Gamma(\beta)}\mathbf{G}\_{2,4}^{3,1}\left(a\beta\sqrt{\frac{\mathbf{Y}\_{th}}{\mathbf{\tilde{Y}}\_{\text{U}\_{1},\text{U}\_{2}}}}|\mathbf{1}\_{\xi^{3},a,\emptyset,0}\right)\right] \end{split} \tag{11}$$

#### *3.2.2 Symbol error rate*

It is defined as the probability of false estimation of the received symbol, which can be expressed as [32]

$$P\_{M,PSK}(\boldsymbol{\epsilon}) = \mathbf{1} - \sum\_{k=1}^{M} P\_k(\mathbf{Y}\_{S,U\_l}) P\_k(\mathbf{Y}\_{U\_l,U\_k}) P\_k(\mathbf{Y}\_{U\_l,D}) \tag{12}$$

$$P\_k(\mathbf{Y}\_{J,d}) = \begin{cases} 1 - \frac{1}{\pi} \int\_0^{\frac{(M-1)\pi}{M}} \mathcal{M}\_{\mathbf{Y}\_{J,d}} \left( -\frac{\sin^2\left(\frac{\pi}{M}\right)}{\sin^2(\phi)} \right) d\phi, \text{for } k = 1 \\\\ \frac{1}{\pi} \int\_0^{\frac{(M-1)\pi}{M}} \mathcal{M}\_{\mathbf{Y}\_{J,d}} \left( -\frac{\sin^2\left(\frac{\pi}{M}\right)}{\sin^2(\phi)} \right) d\phi, \text{for } k = \frac{M}{2} + 1 \\\\ \left[ \frac{1}{2\pi} \int\_0^{\frac{\pi}{M} - \Phi\_k} \mathcal{M}\_{\mathbf{Y}\_{J,d}} \left( -\frac{\sin^2(\phi\_k - 1)}{\sin^2(\phi)} \right) d\phi - \right], \text{otherwise} \end{cases} \tag{13}$$

*Optimal Unmanned Aerial Vehicle Control and Designs for Load Balancing in Intelligent… DOI: http://dx.doi.org/10.5772/intechopen.110312*

where, *ak* <sup>¼</sup> ð Þ <sup>2</sup>*<sup>k</sup>* � <sup>1</sup> *<sup>π</sup> <sup>M</sup>*. After substituting Eq. (6) in Eq. (13) and using [29], we can obtain the moment-generating function of instantaneous SNR corresponding FSO link as

$$\mathcal{M}\_{\Upsilon\_{U\_1U\_2}}(\varsigma) = \frac{\xi^2 \mathfrak{Z}^{a+\beta-1}}{4\pi \Gamma(a)\Gamma(\beta)} \times G\_{3,6}^{6,1} \left( \frac{(a\beta)^2}{16\overline{\Upsilon}\_{U\_1U\_2} \mathfrak{s}} \bigg|\_{\frac{\xi^2}{2}, \frac{\xi^2+1}{2}, \frac{a+1}{2}, \frac{\beta}{2}, \frac{\beta+1}{2}}} \right) \tag{14}$$

#### **3.3 UAVs' optimal altitude**

According to Eq. (11), outage probability is a function of UAV's altitude, distance from source to destination, and distance between the projection points of UAVs on the ground and end users. For these given parameters values, the optimal altitude is obtained as

$$
\tilde{h} = l\_\flat \tan(\tilde{\phi}\_2) \tag{15}
$$

where the optimal altitude must satisfy the following condition [33]

$$\tilde{h} = \arg\min\_{h \in [0, \infty]} P\_{out}(h, l\_{\ge}, l\_{\ge}, L\_{S, D}) \tag{16}$$

Finally, the optimal elevation angle at the receiver side *ϕ*~<sup>2</sup> is obtained by solving the equation,

$$[P\_1.Q\_1(v\_2, w\_2) + P\_2.Q\_1(v\_1, w\_1)].P\_3 = 0\tag{17}$$

where

$$P\_1 = v\_1 e^{\frac{r\_1^2 + w\_1^2}{2}} \left[ I\_1(v\_1, w\_1) \frac{K\_{S, U\_1}^{'}(\phi\_1)}{v\_1} - I\_0(v\_1, w\_1) \frac{w\_1}{2} \left\{ \frac{K\_{S, U\_1}^{'}(\phi\_1)}{1 + K\_{S, U\_1}(\phi\_1)} \right\} \right] \tag{18}$$

$$+ c\_{S, U\_1}^{'}(\phi\_1) \ln \left( \frac{l\_x}{\cos \phi\_1} \right) + \epsilon\_{S, U\_1}(\phi\_1) \tan \phi\_1 \text{I} \right] \Bigg| \times \frac{l\_x l\_y}{l\_x^2 \cos^2 \phi\_2 + l\_y^2 \sin^2 \phi\_2}$$

$$P\_2 = v\_2 e^{-\frac{r\_1^2 + w\_1^2}{2}} \left[ I\_1(v\_2, w\_2) \frac{K\_{U\_2, D}'(\phi\_2)}{v\_2} - I\_0(v\_2, w\_2) \frac{w\_2}{2} \left\{ \frac{K\_{U\_2, D}'(\phi\_2)}{1 + K\_{U\_2, D}(\phi\_2)} \right. \tag{19} \right]$$

$$+ c\_{U\_2, D}'(\phi\_2) \ln \left( \frac{l\_y}{\cos \phi\_2} \right) + c\_{U\_2, D}(\phi\_2) \tan \phi\_2 \text{
}^4 \right] \Bigg]$$

$$P\_3 = 1 - \frac{\frac{z^2}{\xi}}{\Gamma(a) \Gamma(\theta)} G\_{2, 4}^{3, 1} \left( a\theta \sqrt{\frac{\Upsilon\_{th}}{\overline{\Upsilon}\_{U\_1, U\_2}}} \Big|\_{\mathcal{E}\_2^{\rm id}, \phi\_2}^{1, z^2 + 1} \right) \tag{20}$$

#### **3.4 Numerical results**

In this section, we provide numerical insights of optimal UAVs' altitude and corresponding performance analysis and then cross-validate the proposed methodology using Monte-Carlo simulation. We assume that the system is operated under moderate and strong atmospheric turbulence conditions with a maximum free space optical distance 7 km, where the average SNR is set as Υ*S*,*U*<sup>1</sup> ¼ Υ*U*1,*U*<sup>2</sup> ¼ Υ*U*2,*<sup>D</sup>* ¼ 75 dB.

The variations of elevation angle corresponding to the optimal UAVs' altitude for the given distance between the projection points of UAVs on the ground and end users under moderate atmospheric turbulence conditions are depicted in **Figure 2**. According to this figure, the optimal elevation angles decrease with the increase in distance from the end-user location to the projection point of the UAVs on the ground because the variation of optimal elevation angle follows Eq. (15).

The variation of outage probability with respect to UAVs' altitude under moderate atmospheric turbulence conditions is statistically visualized in **Figure 3** when the SNR threshold is assumed as Υ*th* ¼ 0*:*4. Since small-scale fading and signal path loss less

**Figure 2.**

*Variation of optimal elevation angle while considering* Υth ¼ 0*:*1*:*

**Figure 3.** *Outage probability variation for different UAVs' altitude.*

*Optimal Unmanned Aerial Vehicle Control and Designs for Load Balancing in Intelligent… DOI: http://dx.doi.org/10.5772/intechopen.110312*

**Figure 4.** *Variation of symbol error rate for different modulation schemes.*

affect the received SNR at the optimal altitude, minimum outage probability can be achieved at that altitude. On the other hand, outage probability increases if UAVs' altitude deviates from the optimal value.

**Figure 4** shows the impact of various modulation schemes on symbol error rate when the distance between projection points of UAVs on the ground and end users is 2000 m under different atmospheric turbulence conditions. According to the result, it is observed that symbol error rate decreases with the average SNR value. Furtherore, binary phase shift keying (BPSK) outperforms the modulation scheme of quadrature phase shift keying (QPSK). Although higher modulation techniques offer more data rates and bandwidth efficiency, they are more complicated to implement, require a more stringent RF amplifier, and are less resilient to error. Therefore, BPSK offers more secure and errorless transmission than other modulation techniques.

## **4. Throughput maximization in UAVs-supported D2D network**

This section proposes a UAVs-supported self-organized device-to-device (USSD2D) network containing multiple source-destination device pairs and multiple UAVs, where the objective is to find the optimal deployed location of UAVs to support reliable data transmission between source and destination device pairs. Here, we consider SNR-constrained maximization of the total instantaneous transmission rate of the USSD2D network by jointly optimizing device association, UAV's channel selection, and UAVs' deployed location at every time slot.

#### **4.1 System model**

**Figure 5** depicts the UAVs-supported self-organized device-to-device (USSD2D) network where the stationary source and destination devices pairs are randomly deployed on the ground within the target area. The direct D2D pairs can establish LoS links due to good channel conditions and the short distance between them. On the other hand, UAV-assisted D2D pairs cannot establish direct links due to the presence

**Figure 5.** *UAVs-supported self-organized device-to-device network.*

of significant obstacles in the signal propagation path and thereby utilize the deployed UAVs as relays.

#### *4.1.1 Channel model*

Consider *M* number of UAVs represented by M ¼ f g 1, 2, … , *M* at a fixed altitude of *Hu* acting as relays for *K* number of direct D2D pairs and *K*~ number of UAVassisted D2D pairs. There are total *J* number of orthogonal channels represented by J ¼ f g 1, 2, … , *J* in the USSD2D network, and each UAV selects a single orthogonal channel at a time. The set of source and destination devices of the direct D2D and UAV-assisted D2D pairs are represented as <sup>K</sup>*<sup>S</sup>* <sup>¼</sup> 1, 2, … , *<sup>K</sup>* � �, <sup>K</sup>*<sup>D</sup>* <sup>¼</sup> *<sup>K</sup>* <sup>þ</sup> 1,*<sup>K</sup>* <sup>þ</sup> 2, … , 2*<sup>K</sup>* � �, <sup>K</sup><sup>~</sup> *<sup>S</sup>* <sup>¼</sup> 1, 2, … , *<sup>K</sup>*<sup>~</sup> � � and <sup>K</sup><sup>~</sup> *<sup>D</sup>* <sup>¼</sup> *<sup>K</sup>*<sup>~</sup> <sup>þ</sup> 1,*K*<sup>~</sup> <sup>þ</sup> 2, … , 2*K*<sup>~</sup> � � respectively where *k*th device's location is *xk*, *yk* � �, ∀*k*∈ K*S*∪K*D*∪K~ *<sup>S</sup>*∪K~ *<sup>D</sup>* � �. UAVs' flight period is discretized into *T* equally spaced time slots of duration *δ* each and *m*th UAV's location *Um*ðÞ¼ *t xm*ð Þ*t* , *ym*ð Þ*t* , *Hu* � �, <sup>∀</sup>*<sup>m</sup>* <sup>∈</sup>M, *<sup>t</sup>*<sup>∈</sup> <sup>T</sup> <sup>¼</sup> f g 1, 2, … , *<sup>T</sup>* is almost unchanged within each slot. Here, we assume that one source device can only associate with a single UAV at a time slot, but multiple devices can access a single UAV simultaneously. To avoid mutual interference from nearby devices, UAVs select the orthogonal channel, and data transmission follows amplify and forward relaying (AF) protocol [34]. The association indicator of the ~ *k*∈ K~ *<sup>S</sup>*∪K~ *<sup>D</sup>* � � device with UAV *m* at time slot *t* is defined as

$$\tilde{I}\_{\tilde{k},m}(t) = \begin{cases} 1, \text{if } \text{device}\tilde{k}\text{associates with UAV } m \\ & \text{0, Otherwise} \end{cases} \tag{21}$$

Similarly, when UAV *m* selects an orthogonal channel *j* at *t*th time slot, the corresponding channel selection indicator is defined as

$$\tilde{I}\_{m,j}(t) = \begin{cases} \mathbf{1}, \text{if UAV } m \text{ selects channel } j \\ & \mathbf{0}, \text{Otherwise} \end{cases} \tag{22}$$

The path loss between the device ~ *k* and UAV *m* can be expressed as [35]

$$L\_{\bar{k},m}(t) = \frac{\mu\_{\text{LoS}} - \mu\_{\text{NLoS}}}{1 + b\_1 \exp\left[ -b\_2 \left( \frac{180}{\pi} \phi\_{\bar{k},m}(t) - b\_1 \right) \right]} + 20 \log \left( \frac{4 \pi f\_c D\_{\bar{k},m}(t)}{c} \right) + \mu\_{\text{NLoS}} \tag{23}$$

*Optimal Unmanned Aerial Vehicle Control and Designs for Load Balancing in Intelligent… DOI: http://dx.doi.org/10.5772/intechopen.110312*

where *c* is the speed of light, *f <sup>c</sup>* is the carrier frequency, *μLoS* and *μNLoS* are attenuation factors corresponding to the LoS and NLoS path, respectively, *b*<sup>1</sup> and *b*<sup>2</sup> are the constant. *ϕ*<sup>~</sup> *<sup>k</sup>*,*m*ðÞ¼ *<sup>t</sup>* sin �<sup>1</sup> *Hu=D*<sup>~</sup> *<sup>k</sup>*,*m*ð Þ*t* � � is the elevation angle between the device <sup>~</sup> *k* and UAV *m*, where the instantaneous distance between them is calculated as

*D*<sup>~</sup> *<sup>k</sup>*,*m*ðÞ¼ *t* ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi *xm*ðÞ�*t x*<sup>~</sup> *k* � �<sup>2</sup> <sup>þ</sup> *ym*ðÞ�*<sup>t</sup> <sup>y</sup>*<sup>~</sup> *k* � �<sup>2</sup> <sup>þ</sup> *<sup>H</sup>*<sup>2</sup> *u* q *:* The instantaneous channel gain between ~ *k*th device and relay UAV *m* can be expressed as

$$G\_{\vec{k},m}(t) = \mathbf{10}^{-L\_{\vec{k},m}(t)/10} \tag{24}$$

#### *4.1.2 Transmission model*

The received SNR at UAV *m* from the source device ~ *k* over channel *j* can be expressed as [34]

$$
\Gamma^{\dot{}}\_{\tilde{k},m}(t) = \frac{P\_{\tilde{k}}^{T\mathbf{x}} G\_{\tilde{k},m}(t) \tilde{I}\_{\tilde{k},m}(t) \tilde{I}\_{m\dot{j}}(t)}{N\_0} \tag{25}
$$

where *PTx* ~ *<sup>k</sup>* is transmit power of <sup>~</sup> *k* device and *N*<sup>0</sup> is noise power. The expected SNR received by the destination device ~ *<sup>k</sup>* <sup>þ</sup> *<sup>K</sup>*<sup>~</sup> <sup>∈</sup> <sup>K</sup><sup>~</sup> *<sup>D</sup>* from UAV *<sup>m</sup>* over channel *<sup>j</sup>* can be expressed as

$$
\hat{\Gamma}^{j}\_{m,\tilde{k}+\tilde{K}}(t) = \frac{P\_{m}^{\text{Tx}} G\_{m,\tilde{k}+\tilde{K}}(t) \tilde{I}\_{\tilde{k}+\tilde{K},m}(t) \tilde{I}\_{m,j}(t)}{N\_0} \tag{26}
$$

where *PTx <sup>m</sup>* is transmit power of UAV *m*. The overall SNR at the destination device of the UAV-assisted D2D pair following AF relaying protocol can be expressed as [36]

$$\hat{\Gamma}\_{\vec{k},\vec{k}+\vec{K}}^{\vec{j}}(t) = \left[ \prod\_{i=1}^{N} \left( \mathbf{1} + \frac{\mathbf{1}}{\Gamma\_i^{\vec{j}}(t)} \right) - \mathbf{1} \right]^{-1} \tag{27}$$

where Γ*<sup>j</sup> i* ð Þ*t* is the instantaneous SNR of the *i*th hop over *j*th channel, and *N* is the total number of hops in the link. For direct D2D pair, we consider a conventional channel model where the instantaneous channel gain between the source device *k* and destination device *k* þ *K* can be expressed as

$$G\_{\overline{k}, \overline{k} + \overline{K}}(t) = \beta\_0 D\_{\overline{k}, \overline{k} + \overline{K}}^{-q}(t) \tag{28}$$

where *<sup>β</sup>*<sup>0</sup> <sup>¼</sup> <sup>4</sup>*π<sup>f</sup> <sup>c</sup>=<sup>c</sup>* � �<sup>2</sup> is free space path loss at a distance of 1 m, and *<sup>ϱ</sup>* is the path loss exponent. The expected instantaneous SNR received by the destination device *k* þ *K* from the source device *k* over channel *j* can be expressed as

$$
\hat{\Gamma}^{\hat{j}}\_{\overline{k},\overline{k}+\overline{K}}(t) = \frac{P\_{\overline{k}}^{\text{Tx}} G\_{\overline{k},\overline{k}+\overline{K}}(t)}{N\_0} \tag{29}
$$

The instantaneous transmission rate achieved by the destination device *k* þ *K* can be expressed as

$$\overline{R}\_{\overline{k},\overline{k}+\overline{K}}^{j}(t) = B \log\_2 \left[ \mathbf{1} + \hat{\Gamma}\_{\overline{k},\overline{k}+\overline{K}}^{j}(t) \right] \tag{30}$$

The total instantaneous transmission rate achieved by all direct D2D pairs can be calculated as

$$\overline{R}\_{\text{Sum}}(t) = \sum\_{j=1}^{J} \sum\_{\overline{k}=1}^{\overline{K}} \overline{R}\_{\overline{k}, \overline{k} + \overline{K}}^{j}(t) \tag{31}$$

Similarly, ~ *<sup>k</sup>* <sup>þ</sup> *<sup>K</sup>*~th device obtains the instantaneous transmission rate over channel *j* as

$$\tilde{\boldsymbol{R}}\_{\bar{k},\bar{k}+\bar{K}}^{j}(t) = \boldsymbol{B}\log\_{2}\left[\mathbf{1} + \hat{\boldsymbol{\Gamma}}\_{\bar{k},\bar{k}+\bar{K}}^{j}(t)\right] \tag{32}$$

The total instantaneous transmission rate of all UAV-assisted D2D pairs can be expressed as

$$\tilde{R}\_{\text{Sum}}(t) = \sum\_{j=1}^{J} \sum\_{m=1}^{M} \sum\_{\tilde{k}=1}^{\tilde{K}} \tilde{R}\_{\tilde{k}, \tilde{k} + \tilde{K}}^{\dagger}(t) \tag{33}$$

The overall instantaneous transmission rate of the USSD2D network is formulated as

$$R\_{Sum}(t) = \overline{R}\_{Sum}(t) + \tilde{R}\_{Sum}(t) \tag{34}$$

#### *4.1.3 Problem formulation*

From the practical scenario, it is observed that when UAVs fly toward a group of devices to obtain better channel conditions, the remaining devices of the network cannot receive adequate services from the UAV, and consequently, UAVs cannot allocate network resources fairly. Hence, we jointly optimize UAVs' location, device association, and channel selection indicators at every time slot to maximize the total instantaneous transmission rate of the USSD2D network while assuring that each device should achieve a minimum SNR of *ς* to maintain the required QoS. The corresponding optimization problem is formulated as

$$\begin{aligned} \text{Maximize} \\ \mathbf{P1}: \left\{ \begin{array}{l} \left( \boldsymbol{\pi}\_{m}(t), \boldsymbol{y}\_{m}(t) \right), \overline{I}\_{k,m}(t), \boldsymbol{\tilde{I}}\_{m,j}(t) \\ \forall k \in \left\{ \overline{\mathcal{K}}\_{\mathcal{S}} \cup \overline{\mathcal{K}}\_{\mathcal{D}} \cup \boldsymbol{\tilde{\mathcal{K}}}\_{\mathcal{S}} \cup \boldsymbol{\tilde{\mathcal{K}}}\_{\mathcal{D}} \right\}, m \in \mathcal{M}, j \in \mathcal{J} \end{aligned} \right\} \end{aligned} \tag{35}$$

Subject to the constraints

*Optimal Unmanned Aerial Vehicle Control and Designs for Load Balancing in Intelligent… DOI: http://dx.doi.org/10.5772/intechopen.110312*

$$\mathbf{C1}: \Gamma\_{k,k+K}^{j}(t) > \emptyset, \forall k \in \{\overline{\mathcal{K}}\_{\mathcal{S}} \cup \overline{\mathcal{K}}\_{D} \cup \tilde{\mathcal{K}}\_{\mathcal{S}} \cup \tilde{\mathcal{K}}\_{D}\} \tag{36}$$

$$\mathbf{C2}: \overline{I}\_{k,m}(t), \overline{I}\_{k+K,m}(t) = \{0, 1\}, \overline{I}\_{m,j}(t) = \{0, 1\}, \forall k \in \{\overline{\mathcal{K}}\_{\mathcal{S}} \cup \overline{\mathcal{K}}\_{\mathcal{D}} \cup \tilde{\mathcal{K}}\_{\mathcal{S}} \cup \tilde{\mathcal{K}}\_{\mathcal{D}}\}, m \in \mathcal{M}, j \in \mathcal{J} \tag{37}$$

$$\mathbf{C3}: \sum\_{m \in \mathcal{M}} \overline{I}\_{k,m}(t) \le \mathbf{1}, \sum\_{m \in \mathcal{M}} \overline{I}\_{k+K,m}(t) \le \mathbf{1}, \forall k \in \{\overline{K}\_S \cup \overline{K}\_D \cup \tilde{K}\_S \cup \tilde{K}\_D\}, \tag{38}$$

$$\mathsf{C4}: \sum\_{j \in \mathcal{J}} \mathsf{I}\_{m\_j}(t) \le \mathsf{1}, \forall m \in \mathcal{M} \tag{39}$$

C1 indicates that a device should achieve a minimum SNR threshold to maintain the required QoS. C2 defines the instantaneous device association indicator and UAVs' channel selection indicator. C3 assures that each device can be associated with a single UAV at a time slot, and C4 implies UAVs' channel selection conditions at each time slot. The optimization variables *xm*ð Þ*<sup>t</sup>* , *ym*ð Þ*<sup>t</sup>* � �, *Ik*,*<sup>m</sup>*ð Þ*<sup>t</sup>* and <sup>~</sup>*Im*,*<sup>j</sup>*ð Þ*<sup>t</sup>* are coupled and interactable, where the deflection of one variable impacts the optimization of other variables and the objective value. Hence, this optimization problem becomes complicated using standard optimization tools. In order to tackle this situation, we adopt an RL-based UAV deployment strategy to find their optimal position by estimating the required system parameters using real-time measurements and statistics of collected information.

#### **4.2 RL-based solution methodology**

UAVs acting as RL agents select the action depending on their current positions, which are only related to their previous states. Hence, the proposed framework follows Markovian properties composed of state, action, reward, state transition probability, and the flying time periods. In the next sub-section, we explain each of those elements elaborately.

#### *4.2.1 State space*

The state of the *m*th UAV at *t*-th time slot is the vector of two elements which represent its current position as *<sup>s</sup>m*ðÞ¼ *<sup>t</sup> xm*ð Þ*<sup>t</sup>* , *ym*ð Þ*<sup>t</sup>* � �, <sup>∀</sup>*sm*ð Þ*<sup>t</sup>* <sup>∈</sup>S. Here, <sup>S</sup> is the state space, whose elements are independent and identically distributed random variables arranged by combining all possible values across the time horizon.

#### *4.2.2 Action space*

UAV's action *am*ð Þ*t* ∈ A in the current state is the change of its position, which is measured with respect to its immediate X and Y coordinates. Here, we consider a benchmark RL gridworld environment where UAVs have maximum of eight possible moving directions at each state, i.e., NORTH, NORTH-WEST, WEST, SOUTH-WEST, SOUTH, SOUTH-EAST, EAST, and NORTH-EAST. After selecting an action, the X and Y coordinate changes of UAV *m* at *t*-th time slot are represented as *δm <sup>x</sup>* ð Þ*<sup>t</sup>* <sup>∈</sup>f g �*ϑ*ð Þ*<sup>t</sup> <sup>δ</sup>*, 0, *<sup>ϑ</sup>*ð Þ*<sup>t</sup> <sup>δ</sup>* and *<sup>δ</sup><sup>m</sup> <sup>y</sup>* ð Þ*t* ∈ f g �*ϑ*ð Þ*t δ*, 0, *ϑ*ð Þ*t δ* respectively, <sup>∀</sup>*am*ðÞ¼ *<sup>t</sup> <sup>δ</sup><sup>m</sup> <sup>x</sup>* ð Þ*<sup>t</sup>* , *<sup>δ</sup><sup>m</sup> <sup>y</sup>* ð Þ*t* n o<sup>∈</sup> <sup>A</sup>, *<sup>t</sup>*<sup>∈</sup> <sup>T</sup> , where *<sup>ϑ</sup>*ð Þ*<sup>t</sup>* is the velocity of UAVs at time slot *<sup>t</sup>* and A is the action set containing all possible actions. The obtained X and Y coordinate of UAV *m* for next time slot is measured as

$$\boldsymbol{\kappa}\_{m}(t+\mathbf{1}) = \boldsymbol{\kappa}\_{m}(t) + \boldsymbol{\delta}\_{\mathbf{x}}^{m}(t) \tag{40}$$

$$\mathcal{Y}\_m(t+\mathbf{1}) = \mathcal{Y}\_m(t) + \delta\_\mathbf{y}^m(t) \tag{41}$$

#### *4.2.3 Reward formulation*

RL agents choose their actions in such a manner that maximizes long-term cumulative reward. Since our objective is to maximize the total instantaneous transmission rate of the USSD2D network, we need to find such locations of UAVs that impacts immediate objective value. Hence, we model the instantaneous reward function contributed by UAV *m* as

$$\mathcal{R}(\mathfrak{e}\_{m}(t), \mathfrak{e}\_{m}(t)) = \sum\_{j=1}^{J} \sum\_{\tilde{k}=1}^{\tilde{K}} \tilde{R}\_{\tilde{k}, m, \tilde{k} + \tilde{\mathcal{K}}}^{j}(t) + \sum\_{j=1}^{J} \sum\_{\tilde{k}=1}^{\overline{K}} \overline{R}\_{\tilde{k}, \tilde{k} + \overline{\mathcal{K}}}^{j}(t), \forall m \in \mathcal{M} \tag{42}$$

#### *4.2.4 State transition probability*

It is the probability that UAV *m* changes its state from *sm*ð Þ*t* to *sm*ð Þ *t* þ 1 after selecting an action *am*ð Þ*t* , denoted as *Ptr*f g *sm*ð Þ *t* þ 1 ∈Sj*sm*ð Þ*t* ,*am*ð Þ*t* . Let us consider the probability vectors of device association and UAVs' channel selection at time slot *t* as *PDA* ~ *<sup>k</sup>* ðÞ¼ *<sup>t</sup> <sup>P</sup>*~~ *<sup>k</sup>*,1ð Þ*<sup>t</sup>* , *<sup>P</sup>*~~ *<sup>k</sup>*,2ð Þ*<sup>t</sup>* , … , *<sup>P</sup>*~~ *<sup>k</sup>*,*<sup>M</sup>*ð Þ*t* h i, <sup>∀</sup><sup>~</sup> *k*∈ K~ *<sup>S</sup>*∪K~ *<sup>D</sup>* � � and *PCS <sup>m</sup>* ðÞ¼ *t Pm*,1ð Þ*<sup>t</sup>* , *Pm*,2ð Þ*<sup>t</sup>* , … , *Pm*,*<sup>J</sup>*ð Þ*<sup>t</sup>* � �, <sup>∀</sup>*<sup>m</sup>* <sup>∈</sup><sup>M</sup> respectively where *<sup>P</sup>*~~ *<sup>k</sup>*,*<sup>m</sup>*ð Þ*t* indicates the association probability of device ~ *k* with UAV *m* at time slot *t* and *Pm*,*<sup>j</sup>*ð Þ*t* is the probability that UAV *m* selects channel *j* at time slot *t*. In each time slot, source and destination devices associated with a single UAV according to probability vectors *PUA* ~ *<sup>k</sup>* ð Þ*t* and UAV selects a single orthogonal channel with a probability vector of *PCS <sup>m</sup>* ð Þ*t* . The probabilities of device association and UAV's channel selections are updated for the next time slot as follows:

$$
\tilde{P}\_{\ddot{k},m}(t+1) = \begin{cases}
\ddot{P}\_{\ddot{k},m}(t) + \boldsymbol{\nu}\_{1}\ddot{r}\_{\ddot{k},m}(t) \left(\mathbb{1} - \ddot{P}\_{\ddot{k},m}(t)\right), m = \boldsymbol{U}\_{\ddot{k}}^{\text{Max}}(t) \\
\ddot{P}\_{\ddot{k},m}(t) - \boldsymbol{\nu}\_{1}\ddot{r}\_{\ddot{k},m}(t)\ddot{P}\_{\ddot{k},m}(t), m \neq \boldsymbol{U}\_{\ddot{k}}^{\text{Max}}(t) \\
\boldsymbol{\nu} = \ddot{\boldsymbol{\nu}}\_{1} - \ddot{\boldsymbol{\nu}}\_{2} - \ddot{\boldsymbol{\nu}}\_{3}\ddot{\boldsymbol{\nu}}\_{3}
\end{cases}
\tag{43}
$$

$$\overline{P}\_{m,j}(t+1) = \begin{cases} \overline{P}\_{m,j}(t) + w\_2 \overline{r}\_{m,j}(t) \left(1 - \overline{P}\_{m,j}(t)\right), j = \mathbf{C}\_m^{\text{Max}}(t) \\\ \overline{P}\_{m,j}(t) - w\_2 \overline{r}\_{m,j}(t) \overline{P}\_{m,j}(t), j \neq \mathbf{C}\_m^{\text{Max}}(t) \end{cases} \tag{44}$$

where *w*<sup>1</sup> and *w*<sup>2</sup> are the learning step sizes. *UMax* ~ *<sup>k</sup>* ð Þ*t* is the current best UAV for device ~ *k* for a fixed selected channel and *CMax <sup>m</sup>* ð Þ*t* is the current best channel of UAV *m* for associated devices at that time slot respectively, which can be expressed as

$$U^{\text{Max}}\_{\tilde{k}}(t) = \arg\max\_{m \in \mathcal{M}} \tilde{R}\_{\tilde{k}, m, \tilde{k} + \tilde{K}}(t), \forall \tilde{k} \in \{\tilde{\mathcal{K}}\_S \cup \tilde{\mathcal{K}}\_D\} \tag{45}$$

$$\mathbf{C}\_{m}^{\text{Max}}(t) = \arg\max\_{j \in \mathcal{J}} \tilde{R}\_{m,j}(t), \forall m \in \mathcal{M} \tag{46}$$

*Optimal Unmanned Aerial Vehicle Control and Designs for Load Balancing in Intelligent… DOI: http://dx.doi.org/10.5772/intechopen.110312*

where ~*r*<sup>~</sup> *<sup>k</sup>*,*m*ð Þ*<sup>t</sup>* and *rm*,*j*ð Þ*<sup>t</sup>* are the normalized reward achieved by the source device <sup>~</sup> *k* and UAV *m* at time slot *t* respectively, which are defined as

$$\tilde{r}\_{\tilde{k},m}(t) = \frac{\tilde{R}\_{\tilde{k},m,\tilde{k}+\tilde{K}}(t)}{\max\_{m \in \mathcal{M}} \tilde{R}\_{\tilde{k},m,\tilde{k}+\tilde{K}}(t)} \tag{47}$$

$$\overline{r}\_{m,j}(t) = \frac{\tilde{R}\_{m,j}(t)}{\max\_{j \in \mathcal{J}} \tilde{R}\_{m,j}(t)} \tag{48}$$

From (43) and (44), it is observed that the update of selection probability vectors depends on the instantaneous transmission rate, which does not need any prior information. Thus, device association and UAVs' channel selection at each time slot is entirely model-free.

#### *4.2.5 Updating the action value function*

During the operation period, each UAV acts as an RL agent where UAV *m* takes an action *am*ð Þ*t* at current state *sm*ð Þ*t* . Then it generates an immediate reward Rð Þ *sm*ð Þ*t* ,*am*ð Þ*t* , and computes corresponding *Q*ð Þ *sm*ð Þ*t* ,*am*ð Þ*t* value. Finally, the current state *sm*ð Þ*t* is updated to the next state *sm*ð Þ *t* þ 1 and UAV *m* selects the next action *am*ð Þ *t* þ 1 using the same policy where the action-value function is updated as [37]

$$Q(\mathfrak{s}\_{\mathfrak{m}}(t), \mathfrak{a}\_{\mathfrak{m}}(t)) \leftarrow (\mathbbm{1} - a)Q(\mathfrak{s}\_{\mathfrak{m}}(t), \mathfrak{a}\_{\mathfrak{m}}(t)) + a[\mathscr{R}(\mathfrak{s}\_{\mathfrak{m}}(t), \mathfrak{a}\_{\mathfrak{m}}(t)) + \chi Q(\mathfrak{s}\_{\mathfrak{m}}(t+1), \mathfrak{a}\_{\mathfrak{m}}(t+1))] \tag{49}$$

UAVs consider all the possible actions from the action space and select an action with a certain probability that provides maximum long-term reward. *ϵ*-greedy action selection policy is adopted under which the probability that UAV *m* takes action *am*ð Þ*t* ∈ A corresponding to a state *sm*ð Þ*t* ∈S at time slot *t* can be expressed as [37]

$$\pi\_m^c = \begin{cases} \arg\max\_{\mathfrak{a}\_m(t) = \left\{\delta\_x^m(t), \delta\_\gamma^m(t)\right\}} Q(\mathfrak{a}\_m(t), \mathfrak{a}\_m(t)), \text{with probability } 1 - \varepsilon\\ \qquad \text{Random selection, with probability } \varepsilon \end{cases} \tag{50}$$

UAVs execute state-action pairs repeatedly to gain experience of interacting with the environment. These interaction results are recorded in *Q*-table and updated the learning policy in each episode until convergence. Algorithm 1 summarizes the optimal deployment strategy using the adaptive State-Action-Reward-State-Action (SARSA) technique.

#### **4.3 Simulation results**

In this sub-section, we validate the proposed analysis and provide various numerical insights on key system parameters to improve the system's performance. Later, we compare the obtained results corresponding to the proposed SARSA algorithm with the existing works [34], such as random selection with fixed optimal relay deployment (RS-FORD), an exhaustive search for relay assignment and channel allocation with fixed initial relay deployment (ES-FIRD), and alternative optimization for the

#### **Figure 6.**

*The variation of the total transmission rate of the USSD2D network corresponding to each episode.*

individual variable (AOIV). Here, we consider that direct D2D pair and UAV-assisted D2D pair devices are uniformly distributed in a 4 km�4 km square area where the primary simulation parameters are adopted from [38].

The iterative evolutions of the proposed and benchmark schemes are depicted in **Figure 6**, where the number of UAVs, UAV-assisted D2D pairs, direct D2D pairs, orthogonal channels, and transmit power are set as 5, 10, 2, 7, and 10 mW respectively. From this figure, it is clear that the proposed algorithm outperforms the benchmark scheme with respect to the converged value because it utilizes *ϵ*-greedy action policy to obtain the large search space by exploring the target region more efficiently. Furthermore, UAV acting as an RL agent learns to improve the cumulative reward, i.e., the total instantaneous transmission rate, from its past learning experiences. Hence, according to this figure, the SARSA algorithm enhances the overall transmission rate by 75.37%, 49.74%, and 11.01%, compared with RS-FORD, ES-FIRD, and AOIV schemes, respectively.


*Optimal Unmanned Aerial Vehicle Control and Designs for Load Balancing in Intelligent… DOI: http://dx.doi.org/10.5772/intechopen.110312*

11: Calculate *R*~*<sup>j</sup>* ~ *k*,~ *<sup>k</sup>*þ*K*<sup>~</sup> ð Þ*<sup>t</sup>* using (32) for a fixed assigned channel 12: **else** 13: *R*~*<sup>j</sup>* ~ *k*,~ *<sup>k</sup>*þ*K*<sup>~</sup> ðÞ¼ *<sup>t</sup>* <sup>0</sup> 14: **for**~ *<sup>k</sup>* <sup>¼</sup> 1, 2, … ,*K*<sup>~</sup> **do** 15: **for***m* ¼ 1, 2, … , *M* **do** 16: Set *I*<sup>~</sup> *<sup>k</sup>*,*<sup>m</sup>*ðÞ¼ *<sup>t</sup>* 1 when *<sup>m</sup>* <sup>¼</sup> arg max *<sup>m</sup>* <sup>∈</sup><sup>M</sup>*P*~~ *<sup>k</sup>*,*<sup>m</sup>*ð Þ*t* , otherwise *I*<sup>~</sup> *<sup>k</sup>*,*<sup>m</sup>*ðÞ¼ *t* 0 17: According to (43), update the association probability as *P*~~ *<sup>k</sup>*,*<sup>m</sup>*ðÞ *<sup>t</sup> <sup>P</sup>*~~ *<sup>k</sup>*,*<sup>m</sup>*ð Þ *t* þ 1 18: **for***m* ¼ 1, 2, … , *M* **do** 19: **for***j* ¼ 1, 2, … , *J* **do** 20: UAV *m* obtains the *j*th channel selection probability as *Pm*,*<sup>j</sup>*ð Þ*t* 21: Calculate <sup>Γ</sup>^*<sup>j</sup> <sup>m</sup>*ð Þ*t* according to (25) for the fixed associated devices 22: **if**Γ^*<sup>j</sup> <sup>m</sup>*ð Þ*t* ≥*ς* **then** 23: *R*~*<sup>j</sup> <sup>m</sup>*ðÞ¼ *<sup>t</sup>* <sup>P</sup>*<sup>K</sup>*<sup>~</sup> ~ *<sup>k</sup>*¼<sup>1</sup>*<sup>B</sup>* log <sup>2</sup> <sup>1</sup> <sup>þ</sup> <sup>Γ</sup>^*<sup>j</sup>* ~ *<sup>k</sup>*,*<sup>m</sup>*ð Þ*t* h i 24: **else** 25: *R*~*<sup>j</sup> <sup>m</sup>*ðÞ¼ *t* 0 26: **for***m* ¼ 1, 2, … , *M* **do** 27: **for***j* ¼ 1, 2, … , *J* **do** 28: Set <sup>~</sup>*Im*,*<sup>j</sup>*ðÞ¼ *<sup>t</sup>* 1 when *<sup>j</sup>* <sup>¼</sup> arg max *<sup>j</sup>*∈<sup>J</sup> *Pm*,*<sup>j</sup>*ð Þ*<sup>t</sup>* , otherwise <sup>~</sup>*Im*,*<sup>j</sup>*ðÞ¼ *<sup>t</sup>* <sup>0</sup> 29: According to (44), update channel selection probability as *Pm*,*<sup>j</sup>*ðÞ *t Pm*,*<sup>j</sup>*ð Þ *t* þ 1 30: **for***m* ¼ 1, 2, … , *M* **do** 31: Choose the action values *<sup>a</sup>m*ðÞ¼ *<sup>t</sup> <sup>δ</sup><sup>m</sup> <sup>x</sup>* ð Þ*<sup>t</sup>* , *<sup>δ</sup><sup>m</sup> <sup>y</sup>* ð Þ*t* n o by (50) 32: Find next state as *sm*ð Þ¼ *t* þ 1 *xm*ð Þ *t* þ 1 , *ym*ð Þ *t* þ 1 , *Hu* � � by (40) and (41) 33: Calculate the immediate reward Rð Þ *sm*ð Þ*t* ,*am*ð Þ*t* of UAV *m* by (42) 34: Choose the action *<sup>a</sup>m*ð Þ¼ *<sup>t</sup>* <sup>þ</sup> <sup>1</sup> *<sup>δ</sup><sup>m</sup> <sup>x</sup>* ð Þ *<sup>t</sup>* <sup>þ</sup> <sup>1</sup> , *<sup>δ</sup><sup>m</sup> <sup>y</sup>* ð Þ *t* þ 1 n o by (50) and obtain *<sup>Q</sup>*ð Þ *<sup>s</sup>m*ð Þ *<sup>t</sup>* <sup>þ</sup> <sup>1</sup> ,*am*ð Þ *<sup>t</sup>* <sup>þ</sup> <sup>1</sup> value 35: Update *Q*ð Þ *sm*ð Þ*t* ,*am*ð Þ*t* value according to (49) and store it in *Q*-table 36: Update the state and action for the next time slot as *sm*ðÞ *t sm*ð Þ *t* þ 1 and *am*ðÞ *t am*ð Þ *t* þ 1 respectively 37: Calculate the instantaneous reward generated by all UAVs as <sup>R</sup>ðÞ¼ *<sup>t</sup>* <sup>P</sup>*<sup>M</sup> <sup>m</sup>*¼<sup>1</sup>Rð Þ *<sup>s</sup>m*ð Þ*<sup>t</sup>* ,*am*ð Þ*<sup>t</sup>*

**Figure 7a** shows the variation of instantaneous transmission rate for different number of UAVs while the other3 network parameters are the same, as mentioned in **Figure 6**. It can be observed in this figure that the performance metric value increases with the number of UAVs because all UAVs utilize the available channels efficiently at their deployed location. But when the number of UAVs exceeds 7, the total instantaneous transmission rate does not increase significantly because all UAVs reuse the limited spectrum, which increases mutual interferences among UAVs and source-destination device pairs.

**Figure 7b** plots the objective value corresponding to the different number of available orthogonal channels. From this figure, we can say that the instantaneous transmission rate increases with the number of channels because all the communication nodes select individual channels according to the channel selection probability vectors. But when the number of channels exceeds 7, no such variation in objective value is found because this is a sufficient resource to avoid mutual interferences completely.

**Figure 7c** represents the network throughput variation for different UAV-assisted D2D pairs when their transmitting power is 10 mW. Since all the devices and UAVs share the fixed amount of orthogonal channels, the network's performance is

#### **Figure 7.**

*Total overall network performance of the USSD2D network for different network parameters value. (a) Network throughput for different number of UAVs. (b) Network throughput for different number of channels. (c) Network throughput corresponding to the different number of UAV-assisted D2D pairs. (d) Network throughput corresponding to the different number of direct D2D pairs.*

independent with respect to the number of UAV-assisted D2D pairs, and the performance metric value is almost constant for variation of the key system parameters.

The performance metric variations for different number of direct D2D pairs are illustrated in **Figure 7d** when their transmitting power is set as 10 mW. It is observed that the instantaneous transmission rate decreases with the number of direct D2D pairs because they utilize more orthogonal channels. As a result, mutual interference among UAV-assisted D2D pairs increases since they share limited network resources. Furthermore, our proposed scheme has the capabilities for adaptive action selection, which significantly outperforms the benchmark techniques. From **Figure 7**, we can say that the overall network throughput can be improved by 77.58%, 52.51%, and 12.14% compared to the RS-FORD, ES-FIRD, and AOIV schemes, respectively.
