Actor-critic learning-based energy optimization for UAV access and backhaul networks

In unmanned aerial vehicle (UAV)-assisted networks, UAV acts as an aerial base station which acquires the requested data via backhaul link and then serves ground users (GUs) through an access network. In this paper, we investigate an energy minimization problem with a limited power supply for both backhaul and access links. The difficulties for solving such a non-convex and combinatorial problem lie at the high computational complexity/time. In solution development, we consider the approaches from both actor-critic deep reinforcement learning (AC-DRL) and optimization perspectives. First, two offline non-learning algorithms, i.e., an optimal and a heuristic algorithms, based on piecewise linear approximation and relaxation are developed as benchmarks. Second, toward real-time decision-making, we improve the conventional AC-DRL and propose two learning schemes: AC-based user group scheduling and backhaul power allocation (ACGP), and joint AC-based user group scheduling and optimization-based backhaul power allocation (ACGOP). Numerical results show that the computation time of both ACGP and ACGOP is reduced tenfold to hundredfold compared to the offline approaches, and ACGOP is better than ACGP in energy savings. The results also verify the superiority of proposed learning solutions in terms of guaranteeing the feasibility and minimizing the system energy compared to the conventional AC-DRL.

However, one of the most critical issues of UAV-assisted networks is the limited on-board energy, which may shorten the UAVs' endurance and lead to service failure. Therefore, minimizing the UAV's energy consumption is of great importance. In [4], the authors proposed a joint power allocation and trajectory design algorithm to maximize UAV's propulsion energy efficiency. With the consideration of both communication energy and propulsion energy of UAV, the authors in [5] and [6] proposed energy-efficient communication schemes via user scheduling and sub-channel allocation, respectively. We note that the works in [4][5][6] focused on the access link in UAV-assisted networks, where the UAV serves as an aerial base station (BS) that carries all the ground users' (GUs') requested data. In practice, due to limited storage capacity, the GU's requested data may be not available in the UAV's cache. When the BS in the GU's service area is overloaded or damaged, the UAV serves as an intermediate node to acquire requested data from a remote auxiliary base station (ABS) through a backhaul link and deliver data to the GUs via access links [7]. Compared to the direct terrestrial communication between the GU and the ABS, UAV undergoes better channel conditions but with limited energy supply. Thus, it is necessary to consider energy-saving problems for backhaul-access UAV networks. In [8], an energy efficiency maximization problem was investigated via power allocation and trajectory design, where the UAV performs as a relay between ABS and GUs. The authors in [9] proposed a joint trajectory design and spectrum allocation algorithm to minimize UAV's propulsion energy while satisfying the backhaul constraint, meaning that the transmitted data of the access link must be less than that of the backhaul link.
The user scheduling schemes in [8,9] are based on time division multiple access (TDMA) or frequency division multiple access (FDMA) with a single-antenna UAV. However, spatial division multiple access (SDMA) mode with multiple-antenna techniques and precoding design is able to improve network capacity, thereby reducing the tasks' completion time and total energy consumption. In [10], a non-orthogonal multiple access-based user scheduling and power allocation algorithm was proposed to minimize UAV's transmission energy with the backhaul constraint. In [11], the authors designed a game theory-based precoding scheme for multi-antenna UAV-assisted cluster networks. To maximize the UAV's propulsion energy efficiency, the authors in [12] proposed a power allocation scheme for multi-antenna UAV-enabled relay systems. However, the energy consumption of the backhaul link is studied to a limited extent in the above works [10][11][12], which is a large proportion of the total energy consumption and could be optimized by backhaul power control [13]. This motivates us to investigate an energy minimization problem, including both backhaul and access energy, in multiple-antenna UAV-assisted networks.
Optimization-based solutions, e.g., successive convex approximation [5] or Lagrangian dual method [6], might not be able to make time-efficient decisions. First, the SDMAbased transmission mode enables the UAV to serve more than one GU simultaneously, resulting in exponential growth of decision variables as well as the complexity [1]. Moreover, diversified energy models in UAV systems may lead to non-convexity in problem formulation, which makes the problem difficult to be solved optimally.
Deep reinforcement learning (DRL) learns the optimal policy from the interaction between environment and actions, instead of directly solving the optimization problem. DRL combines artificial neural networks with a reinforcement learning architecture to improve learning efficiency and solution quality. Different from deep neural networks (DNNs), DRL is not necessary to prepare a large amount of data in advance for offline training. To maximize the energy efficiency, the authors in [14] and [15] applied deep Q network (DQN) to make decisions for resource block allocation and flight path planning, respectively. DQN needs to establish a Q-table containing all the possible actions before executing the algorithm so that it is usually for the decision tasks with discrete action space and a small number of decision variables [16].
Actor-critic-based DRL (AC-DRL) can tackle both discrete and continuous action space. For the problem with continuous variables, e.g., power control, AC-DRL adopts a stochastic policy to select an action by probability. In [17], an energy-efficient UAV's direction control policy was proposed based on AC-DRL. To minimize UAV's energy consumption, in [18], the authors applied an AC-based deep deterministic policy gradient algorithm for UAV's velocity and direction control. In [17,18], multiple decision variables in the problem modelings may lead to huge action space and slow convergence (more than 1000 learning episodes). It is noted that the solution proposed in [17,18] can be applied to only unconstrained problems. However, for general UAVassisted networks, the optimization problems have constraints [4][5][6][7][8][9][11][12][13]. Therefore, directly applying AC-DRL may not lead to a high-quality and feasible solution.
In this paper, we propose two tailored AC-DRL-based schemes: AC-based user group scheduling and backhaul power allocation (ACGP), and joint AC-based user group scheduling and optimization-based backhaul power allocation (ACGOP). The main contributions are summarized as follows: • We formulate a non-convex mixed-integer programming (NCMIP) problem to minimize both backhaul energy and access energy in UAV-assisted networks. • To approach the optimum, we first transform the non-linear terms to linear by piecewise linear approximation and McCormic envelopes, leading to a mixedinteger linear programming (MILP) problem, which can be solved optimally by branch and bound (B&B). • We provide a near-optimal algorithm with lower computation time than the optimal method. First, the original NCMIP problem is relaxed to a continuous optimization problem. Second, the relaxed problem is converted to a linear programming (LP) problem by piecewise linear approximation. Then, the heuristic solutions can be obtained after taking a rounding-up operation. • Being aware of the high-complexity optimization methods, we propose ACGP and ACGOP learning schemes. To enable the learning algorithms to adapt to the considered NCMIP, in ACGP and ACGOP, we improve the conventional AC-DRL by a set of approaches, i.e., action filtering and reward re-design, to improve learning performance and avoid infeasible solutions. • From the numerical results, we conclude that, compared with non-learning algorithms, ACGP and ACGOP have superiority in computational time efficiency, while compared with conventional AC-DRL, ACGP and ACGOP achieve better performance in delivering feasible solutions. Experiments also show that the com-bined learning-optimization scheme, i.e., ACGOP, achieves better energy-saving performance than ACGP.
The rest of the paper is organized as follows. Section 2 provides the system model. In Sect. 3, we formulate the considered optimization problem and solve it by proposing an optimal algorithm and a heuristic algorithm. In Sect. 4, we resolve the problem by DRL and develop an AC-DRL-based algorithm. Numerical results are presented and analyzed in Sect. 5. Finally, we draw the conclusions in Sect. 6. Notations: Some mathematical operators are defined as follows. For a vector a , a and a H represent its Euclidean norm and conjugate transpose, respectively. For a matrix A , A H refers to its conjugate transpose, and A † denotes its generalized inverse matrix. For scalars x and y, ⌈x⌉ and ⌊x⌋ means rounding-up and rounding-down operations, respectively. [x] + is equivalent to max{0, x} . N (x, y) means a Gaussian distribution with a mean x and a variance y. For a random variable X, E[X] is the statistical expectation of X.

System model
We consider a UAV-assisted communication system including both backhaul and access links, as shown in Fig. 1. In the backhaul part, a multi-antenna UAV requests data from a multi-antenna ABS which is connected to the core network. In the access network, the UAV acts as an aerial BS to serve single-antenna GUs in remote areas when the terrestrial BS in the current service area is not available, e.g., destroyed in a disaster. As the UAV operates at high altitudes, it can overcome the influence of obstacles on the ground, e.g., buildings or mountains, and has more probability to experience LoS transmission. The difference between the backhaul and access networks in channel modeling is that the former forms a MIMO system while the latter is modeled as a multi-user MISO system. When the UAV receives GUs' data requests, it first downloads these data from a remote ABS through a backhaul link and then distributes data to GUs through access links. The GUs in the service area are divided into several clusters due to the limited communication coverage of the UAV. As an input to the UAV optimization problem, GUs clusters can be determined by two methods. One is by clustering algorithms, e.g., K-means, based on the similarity of the GUs' distances or channel conditions. The second is simply based on the GUs association and coverage area of the damaged base stations. In this paper, the latter method is adopted. In a cluster, there exist K single-antenna GUs and each has q k (bits) demands. The user set is denoted as K = {1, ..., k, ..., K } . The total demands is denoted by D = K k=1 q k . In each transmission task, all the GUs' demands need to be served within the time limitation T max (seconds), including the time used for acquiring data from ABS and delivering data to GUs 1 . As shown in Fig. 2, the system spectrum is reused in a TDMA fashion so that the time domain of a transmission task is divided into a sequence of timeslots I = {1, ..., i, ..., I} , where I is the maximum number of timeslots, given by ⌊ T max ⌋ , and (seconds) refers to the duration of each timeslot. In the access network, a timeslot accommodates multiple GUs with the SDMA transmission mode to further improve network capacity.

Backhaul transmission
The ABS and UAV are equipped with L t and L r antennas, respectively, so that the backhaul link can be modeled as a MIMO channel. We assume that signals propagate through LoS transmission from ABS to UAV. Let G ∈ C L t ×L r be the channel matrix of the wireless backhaul link, which is determined by the spherical wave model [19] which is given by: where o l t ,l r corresponds to the path length between the l t -th transmitting antenna and the l r -th receiving antenna, f c refers to the carrier frequency, and β is the path loss exponent. The received signal at the UAV from the ABS can be described by:  where x and n denote the transmitted signal and white Gaussian noise of the UAV, respectively. In order to maximize the backhaul capacity, we employ the water-fillingbased power allocation [20]. The matrix G has a singular value decomposition (SVD): where U ∈ C L t ×L t and V ∈ C L r ×L r are unitary matrices, and ∈ C L t ×L r is a diagonal matrix whose elements are non-negative real numbers. The diagonal elements 1 , ..., L in are the ordered singular values (from large to small) for G . Under the assumption that G is a full-rank matrix, let L = min{L t , L r } . We process the UAV's received signal by: referring to a diagonal matrix, and p l means the power allocation among the antennas. Thus, the capacity of the MIMO channel can be calculated by: where B bh is the bandwidth of the backhaul link, and σ 2 is the receiver noise power of the UAV. Based on the water-filling power allocation, where µ is the water-filling level [20]. Thus, the total transmit power on the backhaul is: The achievable rate of the backhaul can be rewritten as: At a timeslot, the backhaul transmission energy and the achievable transmitted data volume are:

Access transmission
From Fig. 2, in the access transmission, the shaded block indicates that the user is scheduled. We define the scheduled users as a user group. Therefore, the maximum number of candidate groups can be calculated by G = L r l=1 K ! l!(K −l)! , which increases exponentially with K. The group combination is G = {1, ..., g, ..., G} . Toward eliminating multi-user (2) y = Gx + n, interference within a group, minimum mean square error (MMSE) precoding is applied [21]. The signal is propagated between the UAV and GUs via a LoS channel. We denote K g and K g as the number and set of users in group g, and h k,g ∈ C L r ×1 as the channel vector for user k ∈ K g , which is expressed as: where ι k,g,l r means the distance between the UAV's l r -th antenna and the k-th GU of the g-th group. We form H g = h 1,g , ..., h K g ,g as the channel matrix of group g. Based on the MMSE, the precoding vector w k,g ∈ C L r ×1 can be calculated by: k,g is the noise power for user k ∈ K g and I is an identity matrix. Since the UAV's transmit power is a constant selected from 0.1 W to 10 W in practical UAV application [22], we assume the transmit power for user k in group g is fixed, denoted as p k,g . The received signal at GU k ∈ K g is given by: where x k,g and n k,g denote the transmitted signal and white Gaussian noise of GU k ∈ K g . According to (12), we obtain the SINR of GUs k ∈ K g as: Thus, the transmitted data volume for GU k ∈ K g and the transmission energy for group g can be expressed as: where B ac is the bandwidth of the access link.

UAV energy model
The propulsion power can be modeled as a function with regards to the flying velocity U [23], which is given by: where P 0 and P 1 are the blade profile power and induced power in hovering status, respectively. U tip and U ind refer to the tip speed of the rotor blade and mean rotor induced velocity, respectively. ̺ 1 is the parameter related to the fuselage drag ratio, rotor solidity, and the rotor disc area. ̺ 2 is denoted as the air density.
In the hovering phase, the UAV flies circularly around a hovering point with a small radius. To minimize the hovering power, the hovering velocity is given by: Therefore, the hovering energy is only related to the hovering time. In the flying phase, the energy consumption with flying distance S is expressed as SP(U ) U . When the flying path is predetermined, S is a constant parameter such that the flying velocity that minimizes the flying energy is: Both U hov and U fly can be obtained by graph-based numerical methods [24]. Therefore, the hovering power p hov and flying power p fly are P(U hov ) and P(U fly ) . Because the UAV suspends data transmission when flying between the clusters in the fly-hover-communicate protocol [5], the minimum flying energy is SP(U fly ) U fly .

UAV flying path selection and fly-hover-communicate protocol
In the considered scenario, the UAV visits and serves each cluster's data requests in a sequential manner according to the predetermined trajectory and visiting orders. Before taking off, the UAV pre-optimizes the trajectory according to different requirements at the dock station. We keep the trajectory design flexible. For example, if the UAV task is time-critical, the flying path can be determined by the clusters' priorities, e.g., the higher-priority cluster is served first. If the task is energy-critical, we apply Dijkstra's algorithm to obtain the shortest or minimal cost path which is mainly adopted in this paper [25]. The timeline of UAV actions is depicted in Fig. 2. According to the fly-hover-communicate protocol, the UAV stops transmitting data when flying [5]. The UAV first experiences the flying phase before arriving at the hovering center of the target cluster. Then, the UAV hovers at the cluster and delivers data to the GUs, which enables equivalent hovering time and communication time. When the transmission task in the current cluster is completed, the UAV flies to the next cluster.
The main notations are summarized in Table 1.

Problem formulation
Our goal is to minimize the total system energy consumption via a joint design for usertimeslot scheduling and backhaul power allocation subject to the users' quality of service requirements. The total energy consumption consists of four parts: (1) the flying energy, (2) the hovering energy, (3) the backhaul transmission energy, and (4) the access transmission energy. As analyzed in the previous section, the flying energy is independent from the scheduling and power transmission decisions and hence can be skipped in the joint design.
On the other hand, the hovering energy is determined by the transmission time and hence needs to be optimized. We denote a set of binary variables indicating timeslot allocation as follows: Then joint design of timeslot allocation (via α ac g,i , α bh i ) and backhaul power optimization (via µ ) for energy minimization can be formulated as follows: α ac g,i = 1, group g ∈ Gis scheduled at timesloti, 0, otherwise.
(19a)  where e bh (µ) and e g are given in (8) and (15), respectively, and e hov = · p hov is the hovering energy at each timeslot. In (19a), the first summation represents the transmission and hovering energy spent on the backhaul, and the second summation is the energy consumed on the access links. Note that we optimize water-filling level µ instead of directly optimizing backhaul power p bh since p bh depends on µ based on Eq. (6). Constraints (19b) guarantee that each GU's request is satisfied in the access network. Constraint (19c) states that contents delivered through the backhaul should accommodate the total demands from the GUs. Constraint (19d) is to avoid concurrent transmission of the backhaul and access links. Constraint (19e) upper bounds the water-filling level to u max , which is the maximal water-filling level under the backhaul's limited transmit power. Constraints (19f ) and (19g) confine variables α ac g,i and α bh i to binary. Due to the non-convex items e bh (µ)α bh i and d bh (µ)α bh i , P 1 is a NCMIP problem which is difficult to obtain the optimal solution. One method to solve this problem is to apply a piecewise linear approximation to linearize non-linear functions, i.e., e bh (µ) and d bh (µ) [28]. Thus, the approximations of e bh (µ)α bh i and d bh (µ)α bh i have a form of bilinear function, which can be transformed to linear problems by using the McCormick envelopes [26]. The resulting problem is an integer linear programming (ILP) problem, which can be solved optimally by the B&B method [27]. When the number of linear pieces is sufficient in fitted functions and the bounds of the McCormick envelopes are sufficiently tight, the solutions can approach the global optimum. However, the operations of relaxation and approximation bring about high computation time (minutes level) which is unaffordable in practice.

Heuristic approach
To reduce the computation time of the problem P 1 , we propose a heuristic algorithm. First, we consider an extreme condition → 0 , such that P 1 can be relaxed to a continuous optimization problem P 2 in (22a). After relaxation, the allocated time for group g and the backhaul link are continuous values: , 1}, ∀i ∈ I, P 2 can be formulated as follows: where By fitting F(µ) and τ bh (µ) with piecewise linear approximations, P 2 can be approximated as a linear programming (LP) problem, which can be solved by classical algorithms such as simplex method [28]. In practice, when � > 0 , P 2 provides a lower bound of P 1 and variables τ 1 , ..., τ g are integer multiples of . Thus, we take a rounding-up operation for post-processing, which introduces errors but makes the solutions of P 2 feasible. We summarize the proposed heuristic algorithm in Alg. 1. When is sufficiently small, i.e., the solution of P 2 approaches the optimal solution of P 1 and the proposed heuristic method provides near-optimal solutions. The heuristic algorithm is more efficient than the optimal algorithm as solving the relaxed continuous problem is easier than solving its original integer programming problem. However, P 2 is still suffered from high computation complexity as the number of variables is G + 1 , which exponentially increases with the number of GUs. This limits its application practice when the number of users is large or the latency requirement is stringent.

AC overview and the proposed solutions
Being aware of the high computation complexity of the iterative optimal and suboptimal algorithms, We develop ACGP and ACGOP toward real-time applications.

AC-DRL framework
To make the paper self-contained, we provide a brief overview of the adopted AC-DRL framework first. Basic RL is modeled as a Markov decision process (MDP) with three elements: state, action and reward. At each time step t, the current environment is represented as a state s t . The agent takes an action a t based on s t and a policy. Then a reward r t is received by the agent and the next state s t+1 can be observed. By collecting the tuple {s t , a t , r t , s t+1 } , the agent updates the policy iteratively with value-based or policybased methods. The goal of an RL agent is to learn a policy that maximizes the expected cumulative reward. In DRL, the policy or other learned functions are approximated as a neural network to deal with the high-dimensional state space and improve the learning efficiency. AC is one of the DRL frameworks, which integrates the strengths of both value-based and policy-based methods [16]. AC-DRL split the learning agent into two components, where the actor is responsible for updating policies and making decisions while the critic is used for evaluating the decisions by value functions.
For the actor, the stochastic policy is applied, which is denoted as π(a|s t ) representing the probability of taking action a under state s t . Usually, we model π(a|s t ) as Gaussian distribution with a mean ψ(s t ) and a variance χ(s t ) [29]. At each learning step t, an action a t is taken by following the policy π(a|s t ) . After that, the agent receives a reward r t as the feedback. The objective of AC-DRL is to maximize the cumulative reward so that the loss function of the actor can be defined as: , representing a Q-value function with a discount factor γ . The critic is to evaluate the quality of the action by estimating the current Q-value. Temporal difference (TD) learning can be applied for Q-value estimation with high learning efficiency [16]. In TD learning, the TD error is the difference between the TD target r t + Q π (s t+1 , a t+1 ) and the estimated Q-value Q π (s t , a t ) . The loss function of the critic is the square of TD error: To update the policy and Q-value, we use parameterized functions, i.e., ψ θ t (s t ) , χ θ t (s t ) and Q ω t (s t , a t ) , to approximate π(a|s t ) and Q π (s t , a t ): where θ t and ω t are the parameters of the approximators. Based on the fundamental results of the policy gradient theorem [16], the gradient of J (θ t ) and L(ω t ) are given by: The update rules for θ t and ω t can be derived based on gradient descend: where ρ refers to the learning rate.
However, approximating Q π (s t , a t ) directly brings about a large variance on gradient ∇ θ J (θ t ) , resulting in poor convergence [30]. To reduce the variance, we estimate a t |s t instead of Q-value. Based on TD learning and parameterized V-value V ω t (s t ) , the loss function of the critic can be expressed as: In addition, the TD error δ V (ω t ) provides an unbiased estimation of Q-value [30]. Thus, we can rewrite Eq. (28) and Eq. (29) by: (33) ∇ θ J (θ t ) =E[∇ θ log(π(a t |s t ; θ t ))Q π (s t , a t )] =E ∇ θ log(π(a t |s t ; θ t ))δ V (ω t ) , In this paper, we apply DNNs as the approximators. The tuple {s t , s t+1 , r t , δ V (ω t )} is stored in a repository over the learning process. At each learning step, a batch of tuples will be extracted as the training data for parameter updating.

The proposed ACGP and ACGOP
We first reformulate P1 by defining states, actions, and rewards, such that an RL framework can apply. Next, we propose two AC-based solutions with highlighting the differences from conventional AC-DRL and tailored design for solving P 1 . In a learning episode, we denote the learning steps range from t = 1 to t = t e , where t e represents the last step when any termination condition reaches. We set the termination conditions by: • The GUs' requests have been completed.
• The service runs out of time.
Based on the AC-DRL framework, we consider two schemes: (1) A straightforward learning approach ACGP, i.e., the agent makes decisions for all the variables. (2) A combined AC learning and simple optimization approach, i.e., ACGOP.
For ACGP, the system states s t are jointly determined by the undelivered demands b k,t and the remaining timeslots η t : The undelivered demands b k,t is the residual data to be transmitted to GU k at timeslots t. The actions a t in ACGP are corresponding to the decision variables in P 1 . When t = 1 , the agent predicts the water-filling level, i.e., a t = µ . The backhaul power p bh (a t ) and backhaul transmission rate r bh (a t ) can be calculated by Eq.(6), Eq.(7). Then, the backhaul energy is expressed as: where τ bh (a t ) = ⌈D/r bh (a t )⌉ . When t = 2, ..., t e , the agent makes the decisions for user scheduling in the access network. The action a t = g , representing the index of the selected user group. The expressions of the state transition are given by: The reward function r t is commonly related to the objective of the original problem. For example, r t = −e t is widely adopted for min-energy problems [31], where e t is the energy consumed at step t, given by: (36) e bh (a t ) =τ bh (a t ) p bh (a t ) + p hov , Note that, In the simulation −e t will be treated as a benchmark. A tailored reward function for ACGP and ACGOP can be found in (46). In ACGOP, we observe that when user scheduling is fixed, the remaining backhaul power allocation becomes a single-variable optimization problem that is computationally light. Thus, the agent in ACGOP only takes actions for user scheduling while the backhaul power is determined by an efficient golden-section search approach. Specifically, the state s t keeps the same as in ACGP. When t = 1, ..., t e , the learning agent makes decision for user scheduling, i.e., a t = g . The expressions of state transition can be rewritten as: When a termination condition is reached, i.e., t = t e , if η t ≤ 0 , then the solutions are not feasible, otherwise η t can be regarded as the available number of timeslots for backhaul transmission. Since the user scheduling is obtained by the learning agent, the original problem can be reduced to a single-variable power control problem P 3 : (39) e t = e bh (a t ), t = 1, e a t + e hov , t = 2, ..., t e .
(41) 1 Proof See appendix 7.1. Figure 3 illustrates the function graph of F(µ) . Based on Lemma 1, the optimal value µ * can be quickly found by golden section search [32]. After that, the backhaul energy e bh (µ * ) can be calculated by Eq. (36). The energy consumption at each time step is rewritten as: We observe that conventional AC-DRL may have limitations on dealing with P 1 . First, the decision variables in P 1 are both continuous and discrete. Thus, we need to map the stochastic policy in AC-DRL to the corresponding action space. Second, the action spaces is huge due to the combinatorial nature of P 1 . Searching in such a huge space may reduce learning efficiency and solution quality. Third, conventional AC-DRL may converge to an infeasible solution without tailored reward design. In this paper, we propose a set of approaches to address the above issues.

Action mapping
Denote â t as the original action selected by the stochastic policy π(a|s t ) . Since π(a|s t ) follows Gaussian distribution, â t is a continuous value on [−∞, +∞] . We introduce two mapping functions: , where κ is a positive parameter, while M 2 (x) maps x to a discrete space G = {1, 2, ..., G} . In order to map â t to the corresponding action space, we define the after-mapped action a t as:

Action filtering
The size of discrete space G increases exponentially with the number of users. To confine the action space, we eliminate a considerable number of redundant actions which (43) e t = e a t + e hov , t = 1, ..., t e − 1, e a t + e hov + e bh (µ * ), t = t e , (47) • ACGOP : a t = M 2 (â t ), t = 1, ..., t e .
bring no benefit to rewards. Specifically, the redundant actions refer to scheduling the groups that contain demand-satisfied GUs. Therefore, at the beginning of each step, we take an action filtering operation to find which GUs' demands have been satisfied and remove the corresponding groups. As a result, the action space decreases gradually over the learning steps, thereby improving the search efficiency and the solution quality.

Reward design
All the constraints in P 1 except (19b) can be met by properly defining actions and states. The constraints (19b) cannot be guaranteed as the commonly used reward function, i.e., r t = −e t , purely minimizes energy, and the GU's demand is not taken into account. We re-design a tailored reward function. First, if the after-learned policy is infeasible at the end of each episode, the agent will get a penalty −ζ which is negative [33]. Second, an extra reward ǫ K k=1 d k,a t will be added to r t . That is, the reward enforces the actor to deliver more data to meet GUs' demands. However, transmitting more data results in more energy consumption. In this case, we can decrease the weight factor ǫ to control energy growth. The re-designed reward is expressed as: In Alg. 2, we summarize the pseudo-code of ACGOP. Analogous to ACGOP, Alg. The significance of the proposed ACGOP and ACGP lies at the practical applying. The optimization tasks in a UAV-aided communication system are typically with realistic constraints and strict computational delay requirements. Compared to offline optimization approaches, ACGOP and ACGP provide online learning and timely energy-saving solutions, and achieves a good trade-off between solution quality and computational time. In addition, unlike conventional DRL methods, ACGOP combines AC learning and optimization to improve the solution quality.

Numerical results
In this section, we evaluate the performance of the proposed solutions and other three non-learning benchmarks: • Optimal approach (OPT): McCormick envelopes + B&B (refer to Section 3).
• Semi-orthogonal user scheduling-based heuristic algorithm (SUS-HEU) [34]: Applying SUS for user scheduling and solving P 3 backhaul for power allocation.
In addition, we simulate two conventional AC-DRL schemes based on [31] for performance comparison.

Parameter settings
The parameter setting is similar to that in [12]. We consider both the ABS and UAV are equipped with L t = L r = 3 antennas. The backhaul channel matrix G and the access channel vector h k,g are obtained by Eq. (1) and Eq. (10), respectively, with the carrier frequency f c = 2.4 (GHz) and the path loss exponent β = 2.6 . In the access link, the GUs are randomly scattered and separated into N = 3 clusters. In each cluster, the number of GUs is up to K = 10 . The GUs' demands are randomly selected from the set {3, 3.5, 4, 4.5, 5} (Gbits). We assume the bandwidth for the ABS and UAV are B bh = 1 (GHz) and B ac = 0.05 (GHz) [35]. The maximum water-filling level µ max is set to 10 units. The UAV's hovering power p hov and GUs' transmit power p k,g is 5 (Watt) and 2 (Watt), respectively. The noise power in UAV σ 2 and GUs σ 2 k,g are -87.49 (dB) and -116.98 (dB). The duration of timeslot is set as 0.1 (s).  Two fully connected DNNs are employed as the actor and the critic. The adopted parameters in ACGOP and ACGP are summarized in Table 2.

Results and analysis
We compare the performance of the algorithms in terms of energy minimization and computation time. Figure 4 shows the objective energy with the number of users K. We can observe that ACGOP has 3.97% gap to the optimum, while for ACGP, the gap increases to 10.27%. Prop-HEU obtains a near-optimal solution with 1.61% average gap but requires much more computation time, e.g., see Fig. 6. SUS-HEU results in the highest energy consumption among all the schemes due to its inappropriate grouping strategy in energy savings. In addition, by averaging the results from the OPT algorithm, the sub-figure in Fig. 4 illustrates the proportion of the communication and hovering energy, and the percentage of the access and backhaul energy. The majority energy consumption is from serving access links, while the backhaul energy takes up around 25% which is a non-negligible part. The communication energy consumed on backhaul and access links accounts for 31% of the total energy while the proportion of hovering energy is 69%. Figure 5 demonstrates the total energy consumption with respect to T max . When T max increases from 14 (s) to 17 (s), the energy consumption reduces by 10.43%, 12.34%, and 15.31% for Prop-HEU, ACGP, and ACGOP, respectively. This is because, in the access network, a small T max may enforce more GUs to share the same timeslot, which increases inter-user interference as well as the precoding energy. On the other hand, in the backhaul network, the system needs to allocate more backhaul power to satisfy the backhaul constraint within a very limited time. When the transmission time is sufficient, e.g., T max > 17 (s), the min-energy points in all the schemes are achieved. Figure 6 compares the computation time with respect to K. The computation time refers to the time from giving inputs to algorithms until receiving the results. From Fig. 6, the computation time of OPT and Prop-HEU grows exponentially with K. When K=10, the computation time reaches 11 (s) and 90 (s), respectively. ACGOP, ACGP, and SUS-HEU can provide online solutions by applying the after-learned DRL policy or low-complexity SUS strategy to avoid directly solving complex optimization problems, thereby saving tenfold to hundredfold computation time compared with OPT and Prop-HEU. The average computation time of the three algorithms is relatively close. However, by recalling the energy-saving performance, ACGOP saves 8.21% and 15.28% energy compared to ACGP and SUS-HEU, respectively. Figures 7 and 8 illustrate the impacts of different learning rates ρ for ACGOP on the performance of convergence and feasibility. From Fig. 7, we can obverse that the objective energy converges over the learning episodes. The convergence speed in the case of ρ = 10 −3 is faster than that of ρ = 10 −4 , whereas, when ρ increases to 10 −2 , the curve has large fluctuations and the energy at the convergence is higher than that of ρ = 10 −3 and ρ = 10 −4 . Figure 8 depicts the total transmitted data over learning episodes. When ρ = 10 −3 and ρ = 10 −4 , the two curves are overlapped and the after-converged  solutions for both are feasible, i.e., the transmitted data are equal to the demands. But for ρ = 10 −2 , the feasibility cannot be guaranteed. Therefore, to achieve a fast learning speed while ensuring the feasibility of the solutions, the learning rates need to be appropriately selected. Taking ACGOP as an example, ACGP has the same tendency. Figures 9 and 10 compare the proposed solutions with conventional AC-DRL. From Fig. 9, ACGOP, ACGP and conventional AC-DRL with the reward in [31] and action filtering have similar performance in energy minimization. Conventional AC-DRL with the reward in Eq. (48) and without action filtering performs badly, which has slow convergence speed and high after-converged energy. Moreover, Fig. 10 demonstrates that neither the Conventional AC-DRL schemes can guarantee feasibility. The reason is that the reward in breakpoints between adjacent intervals. We define φ l (µ) = µ(log 2 (µ) + b l ) and ϕ l (µ) = µ − a l + c l . The first derivative and second derivative of f l (µ) are given by: Based on Eq. (50), we can derive: where µ * l is the point that satisfies ln 2 · φ l (µ * l ) = ϕ l (µ * l ) . Since l > l+1 , we can derive that a l < a l+1 , b l < b l+1 , c l > c l+1 and µ * l > µ * l+1 by graphical method, as shown in Fig. 12.