 Review
 Open Access
 Published:
Trajectory optimization for UAVassisted relay over 5G networks based on reinforcement learning framework
EURASIP Journal on Wireless Communications and Networking volumeÂ 2023, ArticleÂ number:Â 55 (2023)
Abstract
With the integration of unmanned aerial vehicles (UAVs) into fifth generation (5G) networks, UAVs are used in many applications since they enhance coverage and capacity. To increase wireless communication resources, it is crucial to study the trajectory of UAVassisted relay. In this paper, an energyefficient UAV trajectory for uplink communication is studied, where a UAV serves as a mobile relay to maintain the communication between ground user equipment (UE) and a macro base station. This paper proposes a UAV Trajectory Optimization (UAVTO) scheme for load balancing based on Reinforcement Learning (RL). The proposed scheme utilizes load balancing to maximize energy efficiency for multiple UEs in order to increase network resource utilization. To deal with nonconvex optimization, the RL framework is used to optimize the trajectory UAV. Both modelbased and modelfree approaches of RL are utilized to solve the optimization problem, considering line of sight and nonline of sight channel models. In addition, the network load distribution is calculated. The simulation results demonstrate the effectiveness of the proposed scheme under different path losses and different flight durations. The results show a significant improvement in performance compared to the existing methods.
1 Introduction
Unmanned aerial vehicles (UAVs) are currently one of the most significant technological developments. Many realworld applications for UAV include entertainment, telecommunications, agriculture, transportation, and infrastructure. The utilization of UAVs necessitates the planning of appropriate vehicle trajectories. UAVs then can overcome the accessibility, speed, and dependability of the terrestrial system [1]. In particular, UAVs are utilized to enhance the communication performance of 5G networks that rely on UAVassisted communication. UAVs can operate autonomously or through remote pilot control without needing a pilot onboard [2]. As such, UAVs are utilized in a number of inventive ways to achieve the sustainable development goals (SDGs), including sustainable cities and communities, climate action, industry, innovation and infrastructure, and power saving and clean energy. Climate change has led to increase in the severity of wildfires as well as prolonged heat waves during the drought. UAVs are particularly useful for forest fire prevention. Using Victoria fire frequency statistics over a tenyear period and particle swarm optimization [3], the optimal position of UAV is determined for various fires. UAVaided communication consists of two types of channels, the UAVground channel [4,5,6,7,8,9,10] and the UAVUAV channel [11, 12]. UAVassisted wireless communication operates in three modes: UAVaided ubiquitous coverage, UAVaided relaying and UAVaided information dissemination and data collection. For instance, in [5, 7, 8, 11], the UAVaided ubiquitous coverage deploys UAV to provide wireless coverage within an area. Two scenarios in which this can be examined are rapid service recovery and base station (BS) offloading in extremely crowded areas. UAVaided relaying has been considered in [4, 9] to maximize reliability for user equipment (UE) without direct communication. These efforts aim to enhance the power profile and coverage range of UAVs, thereby reducing energy consumption for the network. Furthermore, UAVassisted data collection provides an efficient way to collect data from network nodes [8, 10, 13]. Due to their ability to fly, UAVaided relaying can provide more wireless communication resources, leading to improved coverage.
Recently, there have been several studies on UAVassisted relay on 5G networks [14,15,16,17,18]. These studies have examined various aspects of UAVassisted relay, including resource allocation, trajectory optimization, transmit power, load balancing (LB), and UAV channel modeling. Some studies have optimized UAV deployment and trajectory to improve communication performance [19,20,21,22,23,24,25,26,27]. However, most of these studies have focused on static UAVs, which are not suitable for providing reliable relaying communication. Mobile UAVs can be used as relays to maintain the communication between UE and destination. In addition, previous studies have either considered UAV deployment or traffic load balancing for UAVs, but not both. The LB of UAV wireless communication is a topic of interest in existing literature, with two types of load to consider: the amount of resources connected with each UAV and the amount of resources associated with each UE.
For the UAVassisted relay network, the existing studies evaluated the LB, which utilizes UAV to improve spectrum allocation [28], data rate [29], wireless latency of users [30], and QoS requirements [31]. Most of the above related works studied one aspect of LB, either allocating resources by UAV or by UE, but ignored the design of an optimal trajectory for the UAV. The load imbalance cannot guarantee efficient distribution in incoming network traffic. Thus, some problems need to be addressed to make an efficient communication environment. The line of sight (LOS) has made a promising solution for energy efficient trajectory. In addition, UAVs offer a better LOS communication between UAV and UE to reduce transmission energy consumption. However, previous works have not considered the design of trajectory and traffic characteristics, especially related to energy consumption.
This paper focuses on an energyefficient UAV trajectory for uplink communication, where a UAV serves as a mobile relay to balance the load in 5G networks. In particular, the main contributions of this paper are summarized as follows:

1.
Our optimization problem proposes a UAV Trajectory Optimization (UAVTO) scheme for load balancing based on Reinforcement Learning (RL) that maximizes energy efficiency (EE) for uplink communication. The problem considers the ThreeDimensional (3D) flight trajectory of UAV to visualize the aircraft performance and verify the safety and adaptability of the algorithm. The channel model of LOS and nonline of sight (NLOS) are formulated to solve optimization problem while multiple UEs are present on the ground.

2.
The proposed scheme utilizes LB to maximize EE for multiple UEs in order to increase network resource utilization. It is based on the amount of resources that the UAV can carry as its load. Specifically, the highest data rate of UE is transmitted first to get the minimum resources. The range of the load takes values between 0 and 1. Therefore, if the load is smaller than 1, the UAV can receive more resources from the UE. Otherwise, the network cannot allocate sufficient resources to UE.

3.
The optimization problem is nonconvex, so the RL framework is utilized to build UAV trajectory planning in order to overcome this problem. Two categories of RL, Monte Carlo (MC) and Dynamic Programming (DP) are used to improve resource utilization. The main objective of RL is to provide the optimal solution, which is represented through the interaction between the UAV and its environment. In addition, the network load distribution is calculated.

4.
The simulation results demonstrate the performance of the proposed scheme under different path losses and different flight durations. The results show that the proposed scheme outperforms the existing methods under various parameter configurations.
The rest of the paper is organized as follows: Sect.Â 2 introduces the related work. SectionÂ 3 presents the RL. SectionÂ 4 presents the system model and channel model. SectionÂ 5 describes the problem formulation. SectionÂ 6 presents the proposed UAVTO scheme. The simulation results are provided in Sect.Â 7. Finally, Sect.Â 8 concludes the paper.
2 Related work
There have been significant works focused on UAVaided relay communication for the UAVground channel [4,5,6,7,8,9,10] and the UAVUAV channel [11, 12]. The authors highlighted the characteristics of mmWave propagation for 5G in [4]. The Friis Transmission Equation is used to determine UE received power for the relay path in order to investigate the various mmWave propagation characteristics. Furthermore, this study includes the propagation of mmWave channels in a RayTracing simulator, which assesses the relative effectiveness of diffracted, reflected, and scattered paths compared to direct paths. When the height of the UAV increases, the power received by the UEs decreases. At a height of 30Â m, the UAV provides sufficient coverage. At the same time, a throughput maximization is achieved in [5], where multiUAV is presented through UAV trajectory delay and packet loss enhancement. A graph neural network (GNN) is used to determine the natural order of nodes by graph for transmission and movement properties. In addition, a dynamically reconfigurable topology is described where the information state of aerial nodes is generated and the node parameters are modified. The location of the UAV is modified to reduce packet loss and delay. The energyefficient for UAV is investigated in [6] based on the LOS and NLOS of communication links in order to optimize the UAVâ€™s trajectory path, transmit power, and speed. It is assumed that the UAVâ€™s flying speed is fixed. A binary decision variable is assigned to plan the UAVtoUE connectivity.
By using the estimation of throughput for UAV, the optimal UAV position is proposed in [7], for a multirate communication system between two ground nodes. Three modulations on multirate in IEEE 802.11b are considered. The estimation of UAV throughput depends on the UAV location and link rates. When the relay is closer to one of the ground nodes than the center position, the maximum throughput is investigated. The uplink channel model is presented in [8], which considers the impact of 3D distance and multiUAV reflection. Maximizing the EE of the UAVBS uplink is examined by modifying the UAV's uplink transmit power at different 3D distances.
In [9], an iterative approach is studied to optimize the UAV trajectory and edge user scheduling in a hybrid cellular network. The objective is to maximize the sum rate of edge users while considering the rate requirements of all UEs. The problem is formulated as a mixedinteger nonconvex optimization.
In general, coverage is the most important aspect of UAVassisted wireless communication. Specifically, in [10], considered the coverage and data rate constraints to identify the minimum number of UAVs and their optimal positions. The mathematical model determines the optimal position and height of UAVs in 3D space. Both decodeandforward (DF) and amplifyandforward (AF) are presented in [11] for a cooperative communication system with a single source and receiver. They optimize the UAVâ€™s power profile, powersplitting ratio profile, and trajectory to maximize the throughput.
For UAVassisted relay networks, related works have studied the deployment path for UAV [12,13,14,15,16,17,18]. The work of [12] examined the deployment path of UAV and resource allocation in order to ensure user fairness. The aim was to maximize the minimum throughput for all the UEs while taking into account various constraints such as backhaul bandwidth, backhaul information causality, UAVBS mobility, total bandwidth, and maximum transmit power. As data collection continues to grow exponentially [13], explored energyefficient data gathering for UAVassisted WSNs. The proposed method aims to minimize throughput by optimizing UAV deployment and sensor node transmission (SN). The SNs are allowed to choose among three transmission modes, which include waiting, standard sink node transmission, and UAV uploading. In addition, the optimal 3D positioning of the UAVs and the power allocation are proposed in [14] to improve performance by maintaining secrecy in the presence of eavesdroppers. In [15], the UAVâ€™s transmit power is proposed which utilizes the uplink channel model from UAVs to ground BSs. A resources optimization problem in [16] is investigated for UAVassisted 5G networks. Resource allocation and control of the planning path play an important role in dynamic UAVassisted 5G networks. Specifically, in [17], multiple UAVBSs communication is proposed, which utilizes UAV as flying BSs to provide coverage enhancement and QoS requirements in 5G wireless networks. In [18], Raytracing simulations are also proposed to enhance the coverage of a UAV operating as aerial BS.
Generally, the trajectory of UAVs plays an active role in improving the performance of UAVassisted wireless communication. It is worth noting that the related works about UAVaided relay systems mainly focus on trajectory optimization [19,20,21,22,23,24,25,26]. A survey of UAV characteristics and limitations in the integration of 5G communications into UAVassisted networks is presented in [19]. An energyefficient trajectory optimization of UAVs is examined in [20], where the trajectory strategy of the UAV is designed in conjunction with prohibitive power depletion and QoS requirements to optimize data transmission, energy consumption, and coverage fairness. For a multihop UAV relaying system, the UAV trajectory and transmit power optimization are utilized in [21] to maximize the endtoend throughput. The trajectory and transmit power of UAVBS are optimized to recognize UAVtoground (U2G) and groundtoUAV (G2U) communications [22]. The optimal trajectory of the UAV is presented in [23] to maximize EE for UAV communication. Moreover, when considering the circular trajectory of the UAV, rate maximization and energy minimization are investigated in situations where the UAV trajectory was unconstrained.
Power allocation and trajectory optimization are studied in [24,25,26] for UAVassisted relay systems. In particular, [24] investigated the optimal trajectory of UAV to improve the throughput of the communication link between two UEs. UAV is considered as a mobile relay to maximize the minimum transmission rate between transmitter and receiver. To address a nonconvex problem, it converted the main problem to some subproblems, which jointly optimizes the power allocations and UAV trajectory alternately. It is worth noting that [25] also investigated an outage probability minimization problem by a longterm proactive optimization algorithm. In addition, the closedform of outage probability is formulated by optimizing UAVâ€™s 3D trajectory. Two hop mobile relay of UAV is used in [26] to serve two UEs on the ground. UAVenabled AF relay network is designed to achieve the maximum end to end throughput. Several works [28,29,30,31] have discussed the optimal position of UAVs based on LB. The optimal user association and spectrum allocation schemes are investigated in [28] based on the branchandbound method to maximize the sum rate in two adjacent cells. In [29], the LB of UAV and fairness of UEs are investigated which optimizes UAV deployment over multiUAV networks by diffusion UAV deployment. The LB is utilized to distribute the load balancing across neighbor UAVs. In [30], the deployment of UAVs as flying BS is analyzed to determine suitable locations to serve more UEs and improve the wireless latency. The goal is to achieve a traffic loading balance that improves the channel quality of UEs in a UAVassisted fog network. In multiUAVaided mobile edge [31], investigated LB of UAV to improve coverage and QoS requirements.
The RL algorithm has been used in many research studies to aid the navigation through unknown environments. The main objective of RL is to provide the optimal solution, which is represented through interaction between the UAV and its environment. Trajectory planning, navigation, and control of UAVs are discussed in [32] using RL. Meanwhile [33], developed trajectory planning of UAV in an uncertain environment using RL.
In recent years, many new approaches for autonomous navigation and path planning of UAVs have emerged [34,35,36]. Specifically [34], used generative adversarial networks and window functions to develop the spatial resolution of satellite images. A high quality image of UAV is designed in [35] using superresolution techniques based on Convolutional Neural Network (CNN), while [36] proposed an energyefficient optimal path of UAV with hybrid ant colony optimization and a variant of A*.
3 Reinforcement learning framework
RL involves an agent interacting with its environment in a cycle, aiming to learn rewards and optimize a policy [37]. The fundamental elements of RL include action, reward, value, policy, transition, and the environment model. Actions describe how agents move from one position to another, while rewards represent the numerical values of the immediate environment state. The policy defines the actions that an agent takes in any given state, and transitions define the probability distribution from the current state to the next. Value represents the expected value of a state, calculated by cumulative discounted rewards. The main objective of RL is to determine an optimal policy to maximize/minimize a certain objective function. It can be defined by a tuple \(\left( {{\text{S}},{\text{A}},{\text{P}},{\text{R}},{\text{T}}} \right)\) where \(S\) is a finite set of states and \(A\) denotes a finite set of actions and the UAV takes an action \(a\epsilon A\) at state \(s\epsilon S\). The probability transition function denotes as \(P = {\text{Pr}}\left( {s_{t + 1} = \left. {s^{\prime}} \right s_{t} = s,a_{t} = a} \right)\) from state \({ }s\) at time \(t\) to state \(s^{\prime}\) at time \(t + 1\) after executing action a. \(R\) is a reward function and \(T\) is the set of decision epochs which can be finite or infinite. The policy function represents as \(\pi\), which is a mapping from a state \({\text{s}}\) to an action \({\text{a}}\). The RL describes how agents learn the optimal policy \(\left( \pi \right)\), where the highest reward value is achieved [33]. RL algorithms are classified into two categories: modelbased and modelfree as described in Fig.Â 1.
The modelbased RL method involves the agent using the transition probability from the model to determine the next reward and action. This method requires an explicit environment and agent model. DP is an example of a modelbased method that requires full observable environmental knowledge. DP is used to identify the optimal solution through value or policy iteration. The agents in modelfree RL do not store any information about the environment. Instead, they update their knowledge to determine the quality of a proposed action. The agent's objective is to choose the optimal action, estimating action values based on experience rather than through exploitation. MC and Temporal Difference (TD) are examples of modelfree RL algorithms [38]. The MC is a modelfree technique that directly learns from episodes of experience [33]. In each episode, the agents move from its current state to its terminal state.
It may be used with sample models without bootstrapping, whereas the TD approach learns from the current value function estimation using bootstrapping. Generally, TD is utilized to predict a quantity that depends on the signalâ€™s future values. Qlearning and SARSA are the two major TDbased algorithms [32].
4 System model and channel model
4.1 System model
The following considers uplink communication in a geographical area. The network has one MBS, which is located at the center of the area as depicted in Fig.Â 2. A UAV flies at a fixed altitude H to serve a group of ground UEs, which serves as a mobile relay. In addition, UAVs can use millimeter waves (mmWave) to provide the backhaul link between UAV and MBS. Total duration to complete a relay communication is \({\text{T}}\). At each \({\text{t}}\), \({ }0 \le {\text{t}} \le {\text{T}}\), the UAV is deployed as an aerial relay to assist communication between UE and MBS for AF communication. In this paper, the UAV, UE, and MBS are equipped with one antenna. Thus, there is no interference between UEUAV and UAVMBS links. UEs are set as \(i = \left( {1,2, \ldots .I} \right)\) and the location of the static UE is assumed as \([W_{i} ,0]^{T}\), where \(W_{i} = [x_{i} ,{ }y_{i} ,0]\) denotes the horizontal coordinate. UEs are distributed randomly in the network. The UAVs are set as \(j = \left( {1,2, \ldots .J} \right)\) and the coordinates of each UAV are assumed as \([q_{j} \left( {\text{t}} \right),H]^{T}\) at time \({\text{t}}\), where the horizontal coordinate of the UAV can be written as \(q_{j} \left( t \right) = \left( t \right), y_{j} \left( t \right)]^{T}\) at time \(t\).
This paper proposes UAVTO flights at fixed altitude \(H\) above the ground for a duration of T. Therefore, the location of \(q_{j}\) UAV is considered unchanged within each time slot. After a certain amount of time, the UAVs' positions vary constantly, while the locations of the UEs are assumed to be fixed during each cycle of UAV positioning. The paper studies the channel model based on both LOS and NLOS scenarios.
4.2 Channel model
4.2.1 The average path loss between UE and UAV
Mobility and the LOS channel are important aspects of 5G networks [19]. This paper aims to improve LOS communication between UAVs and UEs to reduce transmission energy consumption. Hence, the probability of the LOS channel between a UEs and UAV at time \(t\) represents as [5]:
where \({\upalpha }\) and \({\upbeta }\) are the constant values depending on the environment such as Urban and Suburban. Meanwhile, \({\Phi }_{ij}\) is the elevation angle (in degrees) between UAV \(j\) and UE \(i\), which can be expressed as \({\Phi }_{ij} = tan^{  1} \left( {\frac{H}{{d_{ij} \left( t \right)}}} \right)\), where H is the UAVâ€™s altitude and \(d_{ij} \left( t \right) = \sqrt {q_{j} \left( {\text{t}} \right)  W_{i}^{2} + H^{2} }\) is the distance between \(j\) UAV and \(i\) UE at time \(t\).
So, the probability of NLOS between UAV and UE can be expressed as:
\(\yen_{ij}^{los} \left( t \right)\) and \(\yen_{ij}^{Nlos} \left( t \right)\) represent the path loss between UEs at location \(i\) and UAV \(j\) with the LOS and NLOS channels, respectively [39].
where \(\zeta^{los}\) and \(\zeta^{Nlos}\) represent the excessive path loss coefficient for LOS and NLOS channels depending on the urban area. Moreover \(f_{c}\) is carrier frequency,\(c\) is the speed of light, and \(\tau\) is the path loss exponents. The average path loss model for LOS and NLOS is calculated by using Eqs.Â (4) and (5). Therefore, the average path loss between UE and the UAV can be written as:
4.2.2 The average path loss between UAV and MBS
Assuming the altitude of the UAV is high and there is no obstacle between UAV and MBS, the backhaul link is assumed to be a LOS link. Therefore, the channel characteristics are unique due to the strong LOS connections. To maximize the EE of UAVassisted relay in a 5G network, an optimization problem is formulated to jointly determine the optimal trajectory of the UAV. Similarly, the path loss between the UAV and the MBS can be denoted as:
where \(\sqrt {\left( {\left. {{\text{q}}\left( {\text{t}} \right)^{2} + H^{2} } \right)} \right.}\) represents the distance between UAV and MBS.
4.3 Data rate
4.3.1 Transmission from UE and UAV
The average of the channel gain from \(i\) UE to UAV can be expressed as follows \(g_{ij} \left( {\text{t}} \right) = \frac{1}{{\yen_{ij} \left( t \right)}}\), where \(g_{ij} \left( {\text{t}} \right)\) represents the channel gain based on LOS and NLOS communication links. According to [12], the signal to noise ratio (SNR) of the access link can be calculated by:
where \(P_{i}\) represents the transmission power of \(i\) UE and \(\sigma^{2}\) defines the noise power. Therefore, the data rate of the access link of UE \(i\) can be modeled as [38]:
where \({\text{B}}\) is the bandwidth exclusively used by UAV.
4.3.2 Transmission from UAV and MBS
Here, \(g_{jd} \left( {\text{t}} \right)\) represents the channel gain from UAV to MBS and \(P_{j}\) as the transmission power of UAV. Thus, the SNR of the backhaul link from UAV to MBS can be calculated as [5]:
In AF mode, the time domain for a UE is divided into two slots. In the first half of the time slot, the UE broadcasts its data to both the UAV and MBS. Then, in the second half, the UAV amplifies the received data and forwards it to the MBS. As a result, the achievable data rate \(rate\left( {\text{t}} \right)\) of \(i\) UE towards MBS via UAV based on AF is given as [12]:
5 Problem formula
This paper proposes a UAVTO scheme for load balancing based on RL in UAVassisted relay. The proposed scheme formulates LB to maximize EE for multiple UEs to increase network resource utilization. The binary variable \(x_{ij}\) is used to indicate whether \(i\) UE is assigned to \(j\) UAV or not. \(y_{j} = \mathop \sum \limits_{i \in I} x_{ij}\) represents the number of UEs associated to \(j\) UAV.
5.1 Load definition
There are two types of load, including the amount of resources connected with each UAV and the number of resources associated with each UE [40]. The proposed scheme is based on the amount of resources of each UE as load. Therefore, the total available resources of UE to UAV can be represented as \(N_{i}\). When \(i\) UE communicates with \(j\) UAV, the load caused by transmitting the data of \(i\) UE to \(j\) UAV. It can be noted that, the load of UAV can be defined as the ratio of amount of resources allocated of UAV from \(i\) UE to the total available resources of UEs [41].
where \(\rho_{i} \left( t \right)\) is the number of required resources of UAV from \(i\) UE. It can be denoted as follows:
where the data rate required and the bandwidth of the resource block are denoted as \(R_{i}\) and \(BW\), respectively. The bandwidth of resource block is 180Â kHz [42]. Specifically, the highest data rate of UE is transmitted first to get the minimum resources. It is important to determine the amount of resources allocated for UAV according to the LB. Note that the range of load takes [0, 1]. Therefore, if the load is smaller than 1, the UAV can receive more than the resources of UE. Otherwise, the network is overloaded because the load of UAV exceeds 1, hence it cannot be sufficient to allocate resources to UE. Consequently, it can be recognized that the load of UAV cannot exceed 1 for all UEs in the network.
5.2 Energy efficiency
There are many factors that are involved in the energy consumption of a UAV such as communication energy for data transmission, flying energy to keep UAV mobile, and energy due to vertical climb [23]. Flying energy relates to speed and acceleration of the UAV, while the communication energy of the UAVs depends on the data transmission/reception. To avoid complexity, the UAV's takeoff and landing are not considered, hence the energy required for vertical climb is ignored. The energy required for data transmission is given as:
where \(l_{i} \left( t \right)E_{i} \left( t \right)\) is the energy consumed by \(i\) UE which transmitted data to MBS through UAV, \(l_{d} \left( t \right)E_{i} \left( t \right)\) is the energy consumed by UE which transmitted data to MBS as direct mode, and \(E_{j} \left( t \right)\) is the energy consumed by \(i\) UAV \(j\) to MBS.
Based on the approach in [23], the energy required for flying is expressed as:
where \({ }c_{1}\), \({ }c_{2}\) are fixed parameters related to the aircraftâ€™s weight, wing area, and air density. Furthermore, \(v\left( t \right)\) is a velocity of UAV and \(a\left( t \right)\) is acceleration of UAV. \(g\) is the gravitational acceleration. The total energy of UAV can be denoted as \(E_{T}\) which is the combination of communication energy \(E_{c}\) and flying energy \(E_{f}\). It can be written as:
Thus, EE of the UAV is defined as the ratio between the data rate and the energy consumption of the UAV. Therefore, the EE can be denoted as
5.3 Optimization objective
Our objective is to balance the load of the UAV by optimizing its trajectory, utilizing it as a flying relay. Since the UE with the highest data rate is transmitted first, the load of UAV is determined by the number of UEs associated with UAV. The UAV has a time duration \(T\), which can be divided into \(M\) time slots with length \({\raise0.7ex\hbox{$T$} \!\mathord{\left/ {\vphantom {T M}}\right.\kern0pt} \!\lower0.7ex\hbox{$M$}}\). These time slots \(M\) are used for designing the UAV's trajectory. Therefore, the optimization formula can be written as:
which is subject to:
The constraints in the optimization problem are classified into 4 types: user QoS constraints, UAV mechanical constraints, constraints of the load traffic, and UAV trajectory constraints. Constraint (20a) indicates QoS constraints, where \(R_{min}\) and \(R_{i}\) denote the minimum data rate required for \(i\) UE and the overall data rate required for all \(i\) UEs, respectively. Equation (20b) represents a constraint of the load traffic. Constraints (20c) and (20d) provide the total number of UEs that are communicated by the UAV. Equations (20e) and (20f) are the UAVâ€™s velocity and acceleration constraints, where \(v_{max}\) and \(a_{max}\) denote the maximum velocity and maximum acceleration, respectively. Equation (20g) satisfies the constraint of UAV trajectory. Constraint (20h) defines the initial location of UAV \(q_{j} \left[ 0 \right]\) and the final location \(q_{j} \left[ T \right]\) at period \(T\).
It is noted that the optimization problem P1 is a mixed integer nonconvex. Furthermore, the various flight constraints and the high dynamic topology of the network increase the complexity of solving the problem. Meanwhile, the Markov decision process (MDP) is a mathematical framework used for describing the environment in RL problems [43]. Two categories of RL, namely MC and DP, are utilized in this paper in order to improve resource utilization.
QLearning is a modelfree approach in which the optimal policy is learned by using off policy. Qlearning describes an agent that learns the optimal action from an unknown environment. The next action is selected based on the maximum Qvalue of the next state, which is a greedy policy. At each time slot \(m\), given state \(s\left( m \right)\), the agent chooses action \(a\left( m \right)\) with respect to its policy \(\pi\). After the action is performed, the agent receives an immediate reward r and transmits to a new state \(s\left( {m + 1} \right)\). The cumulative reward from the current state up to the terminal state at time \(M\) can be calculated by [7]:
where \(\vartheta \epsilon \left[ {0,1} \right]\) is a discount factor balancing between immediate and future rewards. The value of state \(s\) under policy \(\pi\) is given by [23]:
Qfunction (stateactionvalue function) is the expected return when performing action \(a\) in state \(s\), which is denoted as [38]:
According to the Bellman equation in [23], the value function can be written as the following
The expected value of state \(s\) is defined as the current rewards and values of the next states. The optimal Qfunction \(Q^{{\pi^{*} }} \left( {s,a} \right)\) for each state and action gives the highest expected return that can be obtained from the state when an action is taken. As a result, it can be given as:
The optimal value function \(V^{{\pi^{*} }} \left( {s^{\prime}} \right)\) for each state gives the highest expected return that can be obtained from the state. It can be given as:
The Q value can be updated by taking into account the action that has the maximum Q value of the next time slot:
where \(Q\left( {s^{\prime},a^{\prime}} \right)\) is a new Q value of the next state, \(Q\left( {s,a} \right)\) is the Q value of the previous state, and \(\Omega\) is a learning rate or step size. Since exploration policy of the agent, \(\in\)greedy policy can be used to define the optimal policy \(\pi\) for more information on the Qfunction. It is described as:
It can observed that the optimal policy \(\pi^{*} = argmax_{a} \left( {Q^{*} \left( {s,a} \right)} \right)\) is a unique value of the solution. Action \(a\left( m \right)\) is selected randomly from the action space \(A\) and otherwise the action \(a\left( m \right)\) that maximizes the Qvalue is selected with probability \(\in\). In contrast, the agents in the modelbased learn a representation of the transition function \(P\) and reward function \(r\). The optimal Qfunction \(Q^{{\pi^{*} }} \left( {s,a} \right)\) can be written as [38]:
The expected value \(V^{\pi } \left( s \right)\) of state \(s\) is described as the current rewards and values of the next states weighted by their transition probabilities.
In the proposed method, both modelfree and modelbased RL algorithms are used to plan the UAV trajectory. Modelfree algorithms, such as MC, learn the optimal policy by estimating action values from the agent's interactions with the environment, without requiring a model of the environment. In contrast, modelbased algorithms, such as DP, require a model of the environment and use the transition probability to estimate the next reward and next action.
6 Proposed UAVTO scheme
In this section, the RL algorithm is utilized to plan UAV trajectory planning to increase network resource utilization. The system's state depends on the current location of the UAV and the average of the UAV. The state of the space is denoted as \({ }s\left( m \right) = \left( {s_{1} \left( m \right),{ } \ldots .s_{j} \left( m \right)} \right)\), where \(s_{j} \left( m \right)\) represents the state of \(j\) UAV at time slot \(m\), which includes the current location \(q\left( m \right)\) and average LB of UAV \(\overline{l}\left( m \right)\). Therefore, the state element represents as \(s_{j} \left( m \right){ } = \left\{ {q_{j} \left( m \right), \overline{{l_{j} }} \left( m \right)} \right\}\). The action of the space is defined as \(a\left( m \right) = \left( {a_{1} \left( m \right), \ldots ,a_{j} \left( m \right)} \right)\), where \(a_{j} \left( m \right)\) represents the action of \(j\) UAV at time slot \(m\), which provides the movement direction of UAV. Thus, based on the current state (e.g. location and balancing load), agent makes decision and chooses action according to its policy \(\pi\), which is depicted in Fig.Â 3.
The main objective of Qfunction is to determine the reward function based on the current location \(q\left( m \right)\) and average LB of UAV \(\overline{l}\left( m \right)\). A reward function of ending up in state \({\text{ s}}\left( {{\text{m}} + 1} \right)\) after executing action \(a\left( m \right)\) in state \({\text{s}}\left( {\text{m}} \right)\) is denoted \(r\left( {s\left( m \right),a\left( m \right),s\left( {m + 1} \right)} \right)\). The range of loads takes [0, 1]. Assume that \(\overline{l}\left( 0 \right) = 0\). The average load of UAV at time \(m\) can be written as:
If \(l_{j} \left( m \right) < 0\), it indicates an allowable load of UE and satisfies the LB. Otherwise, the resources of UE could not satisfy the balance of load. The number of resources of UAV \(\rho_{i} \left( m \right)\) that are required from \(i\) UE is given as follows:
The resources occupied by \({\text{i}}\) UE at time slot \(\left( {{\text{m}} + 1} \right)\) may not exceed \(\left( {{\text{N}}_{{\text{i}}}  \sum\nolimits_{{\text{i}}} {\rho_{{\text{i}}} } \left( {\text{m}} \right)} \right)\), otherwise \(\rho_{i} \left( m \right) = 0\).
Specifically, the highest data rate of UE is transmitted first to get the minimum resources. The action \(a_{j} \left( m \right){ }\) determines the next location of UAV at the next slot \(m + 1\). Therefore, the next location can be calculated as:
At each decision point, there are two possible actions for the UAV based on factor \({\upbeta }\), denoted as 0 and 1. \({\upbeta } = 0\) represents the UAV to continue waiting for more UE, while \({\upbeta } = 1\) represents the UAV stopping and transferring to another state that provides the best reward.
The reward function P2 can be formula as follows:
Both modelbased and modelfree RL approaches are applied to solve the optimization problem. The modelfree MC approach is used to design the optimal trajectory of UAV. According to Algorithm 1, the first step is to initialize the Qvalue and states. The algorithm starts by resetting time slot \(m\) to zero. The optimal Q function determines the reward function (P2) in Eq.Â (36) and aims to balance the load of the UAV based on the computed value of \(\overline{l}\left( m \right)\). The objective is to choose the optimal action from an unknown environment. The state space consists of two components: the current location \(q\left( m \right)\) and average LB of UAV \(\overline{l}\left( m \right)\). Consequently, the solution of P2 can be obtained by solving Eqs.Â (32) and (34). The UE with the highest data rate is transmitted first to get the minimum resources. If the load balance is satisfied, i.e., \(l_{j} \left( m \right) < 0\), then the resources allocated to the UE are acceptable. However, if \(l_{j} \left( m \right) > 0\), then the allocated resources may not be sufficient to satisfy the LB. At each time \(m\), the resources occupied by UE are calculated by using Eq.Â (33) and should not exceed.the available resources \(\left( {N_{i}  \sum\nolimits_{i} {\rho_{i} } \left( m \right)} \right)\).
The agent selects the action that maximizes the Qvalue with probability Îµ according to its policy \(\pi\), and the chosen action determines the next location of the UAV at the next time slot according to Eq.Â (34). The factor \({\upbeta }\) in Eq.Â (35) ensures that UAV should remain within the area for a certain duration \({\text{T}}\). The Qvalue is then updated according to the selected action in order to maximize the Qvalue of the next time slot. By repeating these steps, the Algorithm 1 can find the optimal trajectory of the UAV that is suitable for the network environment. On the other hand, the agents in the modelbased learn a representation of the transition function \({\text{P}}\) and the reward function \({\text{r}}\). In the policy iteration method, the optimal solution is based on statevalue function in Eq.Â (31). During the execution, the proposed algorithm constructs greedy action \(\pi^{\prime}\) that selects actions better than the original policy \({\uppi }\). Algorithm 2 is used to find the optimal trajectory of the UAV until a new policy is found that does not improve upon the old policy.
7 Simulation results
The simulation results evaluate the effectiveness of the proposed scheme UAVTO, which is optimized to maximize EE under the constraints of LB and the number of resources. Matlab is used to generate the simulation results. It assumes a cell size of (500Â m, 500Â m) with one MBS located at the center of the cell. The MBS serves UEs located within a distance 500Â m and at coordinates (250Â m, 250Â m). The UAV serves 50 UEs that are distributed randomly between (0Â m, 500Â m). The UAV starts at position (0, 0, h) and ends at position (500, 500, h). The paper proposes a 3D view of UAV trajectories to visualize the aircraft performance and verify the safety and adaptability of the algorithm. The power consumption for flying and communication is set to \(p_{f} = 400W\) and \(p_{c} = 40W\), respectively. The maximum speed of the UAV is set as \(50{\text{ m}}/{\text{s}}\). The flight duration of UAV is Tâ€‰=â€‰120Â s.
The simulation parameters are listed in Table 1. The proposed scheme is compared with deployment UAV as Circular scheme [44], Trajectory UAV scheme [21], and Linear scheme [45]. The Circular scheme [44] is defined as the optimal fixed radius being achieved, while the linear scheme is defined as the UAV moving along the linear path. The UAV moved along a linear path from position (0, 0, 100) to position (500, 500, 100) at a constant height of 100Â m and a constant velocity of 30Â m/s. FigureÂ 4 shows the model of the horizontal path of the UAV. The UAV moved on a horizontal circular path [44] as shown in Fig.Â 5 with radius 250Â m and center at (250, 250, 100) at a constant height of 100Â m and a constant velocity of 30Â m/s. The UAV completed a single full round starting from position (500, 250, 100).
The efficiency of the UAVTO scheme is tested using parameters such as EE, LB, flight duration, and number of UEs. The simulation results are divided into three sections; Sect.Â 1 presents the total EE for various UEs, while Sect.Â 2 describes the total EE for different heights of UAV. SectionÂ 3 includes the LB verse the number of UEs.
7.1 The total EE various different numbers of UE
In this section, two scenarios (MD, DP) of trajectory UAV are proposed for different path loss coefficients such as Suburban, Urban, Dense Urban, and Highrise Urban area. Our goal is to verify the UAVTO scheme within the cell area to increase network resource utilization based on the RL algorithm. Each UE chooses UAV to balance the load distribution of the cell. In the simulation environment, different path loss coefficients (\(\zeta^{los}\), \(\zeta^{Nlos}\)) pairs (0.1, 21), (1.0,20), (1.6, 23), (2.3, 34) corresponding to Suburban, Urban, Dense Urban, and Highrise Urban respectively [4] (measured in dB). For this purpose, two scenarios of trajectory UAV are tested with five possible states of UAV, as shown in Table 2 and Fig.Â 6. In this Table, the states and actions of UAV are assumed to cover every point in the area. The columns of the Table represent the UAV state, which includes the current location, while the rows represent the possible actions that lead to the next state.
FigureÂ 7 shows the two different scenarios DP and MC of trajectory UAV under various environments when Tâ€‰=â€‰120Â s. The initial position of UAV is at (0, 0), while the final position is at (500, 500). Table 3 shows the result of the proposed DP approach for different environments. It indicates a new value function for each state and action. The overall design goal of the reward function is to jointly optimize EE by finding the optimal policy \(\pi^{*} \left( {\left. {\text{a}} \right{\text{s}}} \right)\). From the initial state, the UAV can go to state 1 by finding the value function of five possible actions. As shown in Table 3, the possible actions of state 1 are 2.4851e+07, 2.4905e+07, 2.4772e+07, 2.4848e+07, and 2.4848e+07 for the Suburban environment. The optimal policy \(\pi^{*} \left( {\left. {\text{a}} \right{\text{s}}} \right) = argmax_{a} \left( {Q^{*} \left( {s,a} \right)} \right)\) can be obtained by finding action which will lead to the maximum value function. Therefore, the optimal action is action 2 with xâ€‰=â€‰142.3880 and yâ€‰=â€‰88.2683 (from Table 2), which is highlighted in green. For example, the optimal action of state 2 is action 1 with xâ€‰=â€‰250 and yâ€‰=â€‰50 (from Table 2), which is also highlighted in green. By repeating these steps, the optimal sequence of UAV actions is found to be action 2, action 1, action 3, and action 3 for the Suburban environment. Tables 4 and 5 show the optimal trajectory path of the UAV for DP and MC, respectively, which corresponds to Fig.Â 7. Accordingly, the rows represent the actions that have maximum Q value under various area coefficients. From Table 5, it can be seen that MC must wait until the end of episode to receive the reward.
The proposed scheme is compared with deployment UAV as Circular scheme [44], Trajectory UAV scheme [21], and Linear scheme [45] for balancing the load of trajectory UAV. FigureÂ 8 represents the total EE verse number of UEs for four different environments. As shown in Fig.Â 8, the EE of the UAV is the largest for the Suburban environment. While EE of the UAV is the smallest for other environments. It can be observed that the proposed UAVTO for MC (UAVTOMC) provides higher EE than the proposed UAVTO for DP (UAVTODP). FiguresÂ 9 and 10 are bar graphs show the load performance when UEs are distributed in area for four different environments (Suburban, Urban, Dense Urban, Highrise). It can be observed that the proposed UAVTO scheme serves more UEs compared to other schemes due to its Circular deployment scheme [44], which has limited coverage area. Additionally, the proposed UAVTOMC achieves a better loading balance by serving more UEs through UAV's trajectory. Therefore, the Suburban area is chosen for the remaining results as it has a high EE for all schemes. FiguresÂ 11 and 12 show the EE versus flight duration T for different numbers of UE, with UAV having different trajectory designs over different flight durations. The EE of UAVTOMC improves as the duration increases since more UEs are allocated to the UAV. In contrast, UAVTODP achieves low EE and fails to utilize the available network resources due to its explicit model of the environment. MC is more efficient in terms of experience, while DP is less efficient in terms of exploitation. As is expected, the proposed scheme UAVTOMC achieves the best performance as it optimizes the UAV trajectory for load balancing.
7.2 The total EE for different heights of UAV
In this section, the effect of UAV trajectory design on total EE for different UAV heights is studied. To verify the performance of the proposed UAVTO, it is compared with other schemes, including Circular scheme [44], Trajectory UAV scheme [21], and Linear scheme [45], to maximize total EE while considering load balancing. FigureÂ 13 shows the total EE as a function of the height of UAV, and it can be seen that MC achieves the maximum of EE. DP requires perfect knowledge of the environment and a large amount of memory to store the problem, while MC does not require any prior knowledge and modeling assumptions.
7.3 The LB corresponding to the number of UEs
The LB corresponding to the number of UEs is presented in Fig.Â 14. It can be observed that both curves increase monotonously with the increase in the number of UEs. It is expected that with increasing UEs, the network will become overloaded to satisfy the requirements for both DP and MC. Obviously, the LB of DP is the biggest, and the LB of MC is the smallest.
8 Conclusion
This paper proposes a UAVTO scheme for load balancing based on RL. The proposed scheme utilized LB to maximize EE for multiple UEs and improve network resource utilization. It is considered a 3D flight trajectory of UAV to visualize the aircraft performance and verify the safety and adaptability of the algorithm. Since the problem is modeled as nonconvex optimization, RL is utilized for UAV trajectory planning. The proposed scheme was applied for both MC and DP to solve the optimization problem under the LOS and NLOS channel models. Additionally, the network load distribution is calculated. The simulation results demonstrate the performance of the proposed scheme under different path losses and different flight durations. The results show that the proposed scheme outperforms the existing methods under various parameter configurations.
Availability of data and materials
Not applicable.
Abbreviations
 AF:

Amplifyandforward
 BS:

Base station
 CNN:

Convolutional neural network
 DF:

Decodeandforward
 DP:

Dynamic programming
 EE:

Energy efficiency
 GT:

Ground terminals
 GNN:

Graph neural network
 G2U:

GroundtoUAV
 LB:

Load balance
 LOS:

Line of sight
 MBS:

Macro base station
 MC:

Monte Carlo
 MDP:

Markov decision process
 NLOS:

Nonline of sight
 SDG:

Sustainable development goal
 SN:

Sensor node
 SNR:

Signal to noise ratio
 TD:

Temporal difference
 UAV:

Unmanned aerial vehicle
 UE:

User equipment
 U2G:

UAVtoground
References
S.G. Gupta, M.M. Ghonge, P. Jawandhiya, Review of unmanned aircraft system (UAS). Int. J. Adv. Res. Comput. Eng. Technol. (IJARCET) 2, 1646â€“1658 (2013)
A.A. Laghari, A.K. Jumani, R.A. Laghari, H. Nawaz, Unmanned aerial vehicles: a review. Cogn. Robot. 3, 8â€“22 (2023)
J. Yang, J. Qian, H. Gao, Forest wildfire monitoring and communication UAV system based on particle swarm optimization. Journal of Physics: Conference Series, 2021 2nd International Conference on Artificial Intelligence and Information Systems (ICAIIS 2021), vol. 1982 (2021)
S.K. Khan, M. Farasat, U. Naseem, F. Ali, Performance evaluation of nextgeneration wireless (5G) UAV relay. Wirel. Pers. Commun. 113, 945â€“960 (2020)
S. Ahmed, M.Z. Chowdhury, Y.M. Jang, Energyefficient UAVtouser scheduling to maximize throughput in wireless networks. Inst. Electr. Electron. Eng. (IEEE) 8, 21215â€“21225 (2020)
E. Larsen, L. Landmark, O. Kure, Optimal UAV relay positions in multirate networks. Wireless Days Conference, pp. 8â€“14 (2017)
F. Cheng, S. Zhang, Z. Li, Y. Chen, N. Zhao, F. Richard Yu, V.C.M. Leung, UAV trajectory optimization for data offloading at the edge of multiple cells. IEEE Transactions on Vehicular Technology, vol. 67, pp. 6732â€“6736 (2018)
Z. Rahimi, M.J. Sobouti, R. Ghanbari, S.A.H. Seno, A.H. Mohajerzadeh, H. Ahmadi, H. Yanikomeroglu, An efficient 3D positioning approach to minimize required UAVs for IoT network coverage. IEEE Internet Things J. 558â€“571 (2021)
S. Yin, J. Tan, L. Li, UAVassisted cooperative communications with wireless information and power transfer. Netw. Internet Archit. 1â€“31 (2017)
D. Huang, M. Cui, G. Zhang, X. Chu, F. Lin, Trajectory optimization and resource allocation for UAV base stations under inband backhaul constraint. EURASIP J. Wirel. Commun. Netw. 2020, 1â€“17 (2020)
M.A. Sayeed, R. Kumar, V. Sharma, M.A. Sayeed, Efficient deployment with throughput maximization for UAVs communication networks. Sensors 20, 1â€“27 (2020)
X. Fu, T. Ding, R. Peng, C. Liu, M. Cheriet, Joint UAV channel modeling and power control for 5G IoT networks. EURASIP J. Wirel. Commun. Netw. 2021, 1â€“15 (2021)
B. Liu, H. Zhu, Energyeffective data gathering for UAVaided wireless sensor networks. Sensors. 1â€“12 (2019)
X.A.F. Cabezas, D.P.M. Osorio, M. Latvaaho, Positioning and power optimization for UAVassisted networks in the presence of eavesdroppers: a multiarmed bandit approach. EURASIP J. Wirel. Commun. Netw. 2022, 1â€“24 (2022)
X. Fu, T. Ding, R. Peng, C. Liu, M. Cheriet, Joint UAV channel modeling and power control for 5G IoT networks. EURASIP J. Wirel. Commun. Netw. 2021, 1â€“15 (2021)
S.R. Pandey, K. Kim, M. Alsenwi, Y.K. Tun, Z. Han, C.S. Hong, Latencysensitive service delivery with UAVassisted 5G networks. IEEE Wirel. Commun. Lett. 10, 1518â€“1522 (2021)
H. Yang, J. Zhao, J. Nie, N. Kumar, K.Y. Lam, Z. Xiong, UAVassisted 5G/6G networks: joint scheduling and resource allocation based on asynchronous reinforcement learning, in IEEE INFOCOM 2021IEEE Conference on Computer Communications Workshops (2021)
I. Ahmad, J. Kaur, H.T. Abbas, Q.H. Abbasi, A. Zoha, M.A. Imran, S. Hussain, UAVassisted 5G networks for optimised coverage under dynamic traffic load. in 2022 IEEE International Symposium on Antennas and Propagation and USNCURSI Radio Science Meeting (APS/URSI) (2022)
R. Shahzadi, M. Ali, H.Z. Khan, M. Naeem, UAV assisted 5G and beyond wireless networks: a survey. J. Netw. Comput. Appl. 189, 1â€“20 (2021)
L. Zhang, A. Celik, S. Dang, B. Shihada, Energyefficient trajectory optimization for UAVassisted IoT networks. IEEE Trans. Mobile Comput. 21, 4323â€“4337 (2021)
G. Zhang, H. Yan, Y. Zeng, M. Cui, Y. Liu, Trajectory optimization and power allocation for multihop UAV relaying communications. IEEE Access. 6, 48566â€“48576 (2018)
G. Zhang, Q. Wu, M. Cui, R. Zhang, Securing UAV communications via joint trajectory and power control. IEEE Trans. Wirel. Commun. 18, 1376â€“1389 (2019)
Y. Zeng, R. Zhang, Energyefficient UAV communication with trajectory optimization. IEEE Trans. Wirel. Commun. 16, 3747â€“3760 (2017)
S. Nasrollahi, S.M. Mirrezaei, Toward UAVbased communication: improving throughput by optimum trajectory and power allocation. EURASIP J. Wirel. Commun. Netw. 2022 (2022)
J. Gu, G. Ding, Y. Xu, H. Wang, Q. Wu, Proactive optimization of transmission power and 3D trajectory in UAVassisted relay systems with mobile ground users. Chin. J. Aeronaut. 34(3), 129â€“144 (2021)
X. Jiang, Z. Wu, Z. Yin, Z. Yang, Power and trajectory optimization for UAVenabled amplifyandforward relay networks. IEEE Access 4, 1â€“9 (2016)
A. Salah, H. Abd Elatty, R.Y. Rizk, Joint channel assignment and power allocation based on maximum concurrent multicommodity flow in cognitive radio networks. Wirel. Commun. Mobile Comput. 2018, 114 (2018)
D. Zhai, H. Li, X. Tang, R. Zhang, H. Cao, Joint position optimization, user association, and resource allocation for load balancing in UAVassisted wireless networks. Digit. Commun. Netw. 1â€“13 (2022)
Z. Luan, H. Jia, P. Wang, R. Jia, B. Chen, Joint UAVsâ€™ load balancing and UEsâ€™ data rate fairness optimization by diffusion UAV deployment algorithm in multiUAV networks. Entropy 23, 1470â€“1489 (2021)
Q. Fan, N. Ansari, Towards traffic load balancing in droneassisted communications for IoT. IEEE Internet Things J. 6, 3633â€“3640 (2019)
L. Yang, H. Yao, J. Wang, C. Jiang, A. Benslimane, Y. Liu, MultiUAV enabled loadbalance mobile edge computing for IoT networks. IEEE Internet Things J. 7, 1â€“12 (2020)
J.H. Cui, R.X. Wei, Z.C. Liu, K. Zhou, UAV motion strategies in uncertain dynamic environments: a path planning method based on Qlearning strategy. Appl. Sci. 8, 1â€“16 (2018)
A.T. Azar, A. Koubaa, N.A. Mohamed, H.A. Ibrahim, Z.F. Ibrahim, M. Kazim, A. Ammar, B. Benjdira, A.M. Khamis, I.A. Hameed, G. Casalino, Drone deep reinforcement learning: a review. Electronics 10, 1â€“30 (2021)
K. Karwowska, D. Wierzbicki, Improving spatial resolution of satellite imagery using generative adversarial networks and window functions. Remote Sens. 14, 1â€“22 (2022)
S.M. Mousavi, Improving quality of images in UAVs navigation using superresolution techniques based on convolutional neural network with multilayer mapping. Marine Technol. 4, 1â€“11 (2017)
E. Balasubramanian, E. Elangovan, P. Tamilarasan, G.R. Kanagachidambaresan, D. Chutia, Optimal energy efficient path planning of UAV using hybrid MACOMEA* algorithm: theoretical and experimental approach. J. Ambient Intell. Humaniz. Comput. 1â€“22 (2022)
G. Kalnoor, G. Subrahmanyam, A review on applications of Markov decision process model and energy efficiency in wireless sensor networks. Proc. Comput. Sci. 167, 2308â€“2317 (2020)
M. Abualsheikh, D.T. Hoang, D. Niyato, H.P. Tan, S. Lin, Markov decision processes with applications in wireless sensor networks: a survey. IEEE Commun. Surv. Tutorials 17, 1239â€“1267 (2015)
N. Safwat, I.M. Hafez, F. Newagy, UGPL: a MATLAB application for UAVtoground path loss calculations. Softw. Impacts. 12 (2022)
S.M.M. AboHashish, R.Y. Rizk, F.W. Zaki, Energy efficiency optimization for relay deployment in multiuser LTEadvanced networks. Wirel. Pers. Commun. 108, 297â€“323 (2019)
B.S. Roh, M.H. Han, J.H. Ham, K.I. Kim, QLBR: Qlearning based load balancing routing for UAVassisted VANET. Sensors 20, 1â€“18 (2020)
S.M.M. AboHashish, R.Y. Rizk, F.W. Zaki, Towards energy efficient relay deployment in multiuser LTEA networks. IET Commun. 13, 2688â€“2696 (2019)
P.D. Thanh, T.H. Giang, T.N.K. Hoan, I. Koo, Cacheenabled rata rate maximization for solarpowered UAV communication systems. Electronics 9, 1â€“28 (2020)
O.M. Bushnaq, M.A. Kishk, A. Ã‡elik, M.S. Alouini, T.Y. AlNaffor, Optimal deployment of tethered drones for maximum cellular coverage in user clusters. IEEE Trans. Wirel. Commun. 20, 2092â€“2108 (2021)
Ch. Zhan, Y. Zeng, R. Zhang, Energyefficient data collection in UAV enabled wireless sensor network. IEEE Wirel. Commun. Lett. 7, 328â€“331 (2018)
Funding
Open access funding provided by The Science, Technology & Innovation Funding Authority (STDF) in cooperation with The Egyptian Knowledge Bank (EKB). The research received no external funding.
Author information
Authors and Affiliations
Contributions
All Authors contributed equally to this work and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Abohashish, S.M.M., Rizk, R.Y. & Elsedimy, E.I. Trajectory optimization for UAVassisted relay over 5G networks based on reinforcement learning framework. J Wireless Com Network 2023, 55 (2023). https://doi.org/10.1186/s1363802302268x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1363802302268x
Keywords
 Reinforcement learning
 Sustainable development goals
 Trajectory optimization UAVs