Positioning and power optimisation for UAV-assisted networks in the presence of eavesdroppers: a multi-armed bandit approach

Unmanned aerial vehicles (UAVs) are becoming increasingly attractive for the ambitious expectations for 5G and beyond networks due to their several benefits. Indeed, UAV-assisted communications introduce a new range of challenges and opportunities regarding the security of these networks. Thus, in this paper we explore the opportunities that UAVs can provide for physical layer security solutions. Particularly, we analyse the secrecy performance of a ground wireless communication network assisted by N friendly UAV jammers in the presence of an eavesdropper. To tackle the secrecy performance of this system, we introduce a new area-based metric, the weighted secrecy coverage (WSC), that measures the improvement on the secrecy performance of a system over a certain physical area given by the introduction of friendly jamming. Herein, the optimal 3D positioning of the UAVs and the power allocation is addressed in order to maximise the WSC. For that purpose, we provide a reinforcement learning-based solution by modelling the positioning problem as a multi-armed bandit problem over three positioning variables for the UAVs: angle, height and orbit radius. Our results show that the proposed algorithm improves the secrecy of the system over time in terms of the WSC, and it converges into a stable state close to the exhaustive search solution for discretised actions, where there is a trade-off between expediency of the positioning of the UAVs to positions of better secrecy outcome and energy consumption.

assist terrestrial communications within the so-called UAV-assisted communications. In this way, the advantageous characteristics of UAV-assisted communications, such as ondemand deployment, low-cost, flexibility in network reconfiguration, and high chance of line-of-sight (LoS) links, promote the emerging of a number of novel use cases and applications in different contexts such as disaster areas, smart agriculture, traffic control, search and rescue, package delivery, among others [3,4].
Nonetheless, with all the expected technological and architectural progress for 6G, especially with the integration of artificial intelligence (AI) into the network operation and management and the advancements of quantum computing with its potential to break pre-quantum cryptographic methods, security becomes a highly critical aspect in order to guarantee the levels of resilience and reliability planned for 6G [1]. Physical layer security (PLS) has attracted increased attention as a mechanism to provide more robust and quantum-resistant protection to wireless networks by relying on the unique physical properties of the random and noisy wireless channels to enhance confidentiality in a flexible and adaptive manner. Thus, PLS can find a new horizon in the 6G era, especially for the constrained scenarios of Internet of things (IoT) applications [5,6].
Under these circumstances, UAVs can also be exploited for the design of secure solutions in UAV-assisted communications via PLS; thus, the challenges and opportunities for preventing passive and active attacks in wireless networks have been recently discussed in [7]. On the one hand, UAV-assisted communications are more vulnerable to eavesdropping and jamming attacks due to their strong LoS links compared to communication between ground nodes; on the other hand, UAVs can also be used to launch more effective attacks [7]. Therefore, there is a vast research area to be exploited for providing secure wireless communications in the UAV era, and some have been already reported in the literature [8][9][10][11][12][13][14][15][16].
Particularly, the introduction of UAV nodes acting as friendly jammers in order to improve the secrecy performance of wireless networks has recently risen special attention. For instance, in [8], an optimal three-dimensional (3D) deployment and jamming power of UAV-based jammer was proposed to improve the secrecy performance of a legitimate transmission between a pair of nodes for unknown eavesdropper location. In [9], the secrecy outage probability (SOP) of a UAV-based mmWave relay network in the presence of multiple eavesdroppers is investigated. Two scenarios are considered, with and without cooperative jamming, which is introduced via the destination and an external UAV. In [10], the authors studied the secrecy performance of a non-orthogonal multiple access (NOMA)-based scheme in a mmWave UAV-assisted wireless network by considering a protected-zone approach. In [11], the existence of an optimal UAV jammer location on a network with multiple eavesdroppers was proved, and the impact of the density of eavesdroppers, the transmission power of the UAV jammer and the density of UAV jammers on the optimal location was investigated.
Moreover, the joint optimisation of the transmit power and the trajectory of a UAVbased friendly jammer in a three-dimensional space was investigated in [12]. Therein, the problem of average achievable secrecy rate maximisation of the secondary system was investigated for a cognitive relay network by considering the imperfect location information of ground nodes, that is the eavesdropper, secondary receiver and primary receiver. Also in [13], the secrecy rate maximisation problem of a mobile user over all time slots is studied by considering a dual UAV-enabled secure communication system, where one UAV sends confidential information, while the other serves as friendly jammer. Both UAVs are considered to be energy-constrained devices, and the location information of eavesdroppers is imperfect. Therein, the optimisation problem is solved by jointly designing the 3D trajectory of UAVs and the time allocated for recharging and jamming sending under constraints such as the maximum UAV speed, UAV collision avoidance, UAV positioning error and UAV energy harvesting.
More recently, machine learning (ML) approaches have been considered in order to tackle the intricacy of the optimisation problems related to UAV-assisted scenarios, where there are a number of coupled variables and the complexity on the characteristics of the problems would lead to exhaustive searches or complex operations. Particularly, in [14], a deep reinforcement learning (RL) algorithm is proposed to jointly optimise the active beamforming of the UAV, the coefficients of a reflective intelligent surface (RIS) elements, and the UAV trajectory to maximise the sum secrecy rate of the legitimate users in the presence of multiple eavesdroppers of a mmWave UAV communication assisted by a RIS under imperfect channel state information (CSI). Besides, in [15], a deep learning method is employed to optimise a 3D beamformer for the transmission of confidential signal and friendly jamming in order to maximise the average secrecy rate by considering partial CSI of the legitimate UAV and eavesdropping UAV. Also, the authors in [16] considered UAV jammers assisting a legitimate transmission between a UAV and ground nodes in the presence of ground eavesdroppers. Therein, a multi-agent deep RL approach was used to maximise the secure capacity by jointly optimising the trajectory of UAVs, the transmit power of the UAV transmitter and the jamming power of the UAV jammers.
All in all, the employment of friendly jamming has been widely accepted as an effective manner to enable confidential transmissions in wireless networks. However, the effectiveness of friendly jamming schemes is in most cases harnessed to the perfect or partial knowledge of the CSI of the legitimate and eavesdropping links, which is hard to obtain in practice. To dive into the characterisation of the effectiveness of friendly jamming in wireless networks, the authors in [17] proposed two novel area-based metrics, the jamming coverage and the jamming efficiency, in order to provide insights into the design of optimal jamming configurations by considering different levels of CSI knowledge. Later, in [18], we considered a UAV-assisted friendly jamming scheme in a wireless network in the presence of eavesdroppers. Based on the area-based metrics in [17], a novel metric, the weighted secrecy coverage (WSC), was proposed to give a better insight into the impact of friendly jamming. Thus, the optimal positioning of two UAV jammers was tackled in order to maximise the WSC. Further in [19], we proposed a zero-forcing precoding scheme for the two friendly UAV jammers in order to enhance the efficiency of the friendly jamming, thus enhancing the WSC.
Inspired by [17] and based on [18], we will advance on the state-of-the-art by studying a UAV-assisted wireless network, where a number of UAVs assist a legitimate ground communication between a pair of ground nodes in a confined region on a fading environment, and the 3D positioning of the UAVs is optimised in order to maximise the WSC metric. For that purpose, we model our optimisation problem as a multi-armed bandit (MAB) problem and provide a RL-based solution. 1 Thus, the contributions of the paper are the following: • We derive an expression for the SOP of the proposed system, with Rayleigh fading ground channels and air-to-ground (A2G) channels with a Rician fading LoS component and a Rayleigh fading NLoS component, for a wireless wiretap channel with N friendly UAV jammers. • We propose a time frame-based algorithm to optimise the WSC of the system as three independent multi-armed bandit problems, one for each positioning variable of the UAVs. • Monte Carlo simulations are performed to validate our theoretical expressions and to evaluate the performance of the algorithm in terms of the WSC and the energy consumption of the system, which show a steady convergence to an optimal result.
The remainder of the paper is organised as follows. In Sect. 2.1, the investigated system model is presented. In Sect. 2.2, the considered secrecy metrics are introduced, namely the secrecy capacity, SOP, secrecy impact of jamming metric, and the WSC metric. In Sect. 2.3, the formulation of the WSC maximisation problem is shown. In Sect. 3, the numerical results are presented. Finally, in Sect. 4, the conclusions of this work are presented. Notation Throughout this paper, and unless stated otherwise, | · | denotes the absolute value, E x [·] denotes the expectation over random variable x, if x is missing the argument is considered the random variable, N (x, σ 2 ) denotes a normal distribution of mean x and variance σ 2 , Pr [·] denotes probability, F x (·) is the cumulative density function (CDF) of random variable x, while f x (·) is its probability density function (PDF) and S is a double integral over surface S.

System model
Consider the system illustrated in Fig. 1, comprised of a legitimate pair of ground nodes, the transmitter Alice (A) and the receiver Bob (B), who establish an open wireless link to send private information from A to B. They are confined on a circular area S of radius R A around A. Within S, the presence of an illegitimate node Eve (E) is established, trying to leak the information from the legitimate transmission shared through the wireless medium. It is assumed that E is a passive eavesdropper located within the region S, but its exact position and available resources are unknown. A is located at the origin of coordinates (0, 0, 0) and B is located along the x-axis at (d AB , 0, 0) , without losing generality.
To improve the secrecy performance of the system, N UAVs, {J i } i∈{1,...,N } are deployed to act as friendly jammers by emitting pseudorandom noise isotropically in order to prevent E from leaking information. The jammers are positioned at a common height z J and within a circular orbit of radius R J around A, at angular positions θ J i with i ∈ {1, . . . , N } . We assume that the estimate of the radial position of B with respect to A is unreliable; thus, we model the distance between A and B as a random Gaussian variable with the actual distance d AB being the mean of the estimate (unbiased), and a given uncertainty where d AB is the estimate of the distance between A and B.

Ground channels
There are two ground channels to consider between ground nodes, one between A and B and the other between A and E. Both channels are considered to undergo Rayleigh fading and are subject to additive white Gaussian noise (AWGN) with mean power N 0 . Then, the corresponding channel coefficients are h AB and h AE , and the respective channel gains are |h AB | 2 and |h AE | 2 . For a node U ∈ {B, E} , the channel coefficient h AU is an independent complex circularly symmetric Gaussian random variable with a channel gain of g AU = |h AU | 2 with a scale parameter of where d AU is the distance between A and node U, α G is the path loss exponent for the ground links and γ A is the transmit SNR of A given by γ A = P A /N 0 with P A as the transmit power of A.

Air-to-ground channels
There are two air-to-ground channels for each UAV jammer, one between the UAV and B and the other between the UAV and E. The channel coefficients for those links are given by The propagation path loss for the A2G channels presents a contribution from a LoS component and a non-LoS (NLoS) component, where the contribution of each component to the overall path loss is determined by the probabilities P LoS and P NLoS , respectively [20]. These probabilities are functions of the UAV position with respect to the ground node of interest U and are given by [20] (1) P LoS = 1

Signal analysis
For the communication process, A sends a symbol x with mean power E |x| 2 = 1, while the UAVs send pseudorandom symbols s i with mean power E |s i | 2 = 1 , with i ∈ {1, . . . , N } . We consider a common noise level with power E |w| 2 = N 0 at every node in the system. Thus, the received signal at both B and E is, respectively, given by with U ∈ {B, E} . Then, the instantaneous received signal-to-interference-plus-noise ratio (SINR) at node U can be expressed as For the particular case with no UAV jammers, the SINR values at B and E are, respectively, given by γ B = γ A g AB and γ E = γ A g AE .

Performance analysis
As previously mentioned, E is located within a circular area S around A, but no further knowledge on the exact position of E is assumed, i.e. E can be in whichever point inside S. Therefore, to evaluate the secrecy performance of the proposed system, we consider the area-based secrecy metrics proposed in [17], namely jamming coverage (JC) and jamming efficiency (JE), and a new hybrid metric, the WSC, introduced in [18]. These metrics' definition is based on the SOP, which is derived for the proposed system as described below.

Secrecy outage probability
For the definition of the area-based secrecy metrics, we consider first the SOP [6] defined as where R S is the chosen rate for a secrecy code and C S is the secrecy capacity, which for our system is given by where C B and C E are the capacities of the channels between A and B and between A and E, respectively, with [X] + = max[X, 0] , which tells us that if the capacity of the illegitimate channel is greater than the capacity of the legitimate channel, no secrecy can be achieved.

Secrecy improvement metric
This metric measures the improvement on the secrecy performance of the proposed system, which is measured by the SOP, attained by the introduction of the friendly jamming sent by the UAV jammers. Thus, this metric is given by [17] where the SOP subscript identifies if the SOP is computed with (J) or without (NJ) the presence of friendly jamming. Then, � > 1 values imply a reduction on the SOP by the presence of the UAV jammers, while � < 1 is the opposite. For mathematical tractability purposes, in [18] we proposed an analogous secrecy improvement metric that provides the same general idea with the criteria of secrecy achievement ( 1 − SOP ) instead of SOP, thus given by The SOP without jamming term, SOP NJ , is obtained in closed form in [18] as while, the SOP including jamming, SOP J is obtained as in Proposition 1.

Proposition 1
The SOP in the presence of N UAV jammers SOP J for the proposed system is given by The SOP in (12) can be extended for the channel in (5), with both LoS and NLoS components, by considering , which implies doubling the amount of terms in the sums and products in (13) and (14). The Rayleigh NLoS parameters are adapted from Rician channels by setting the shape parameters to zero, K NLOS Proof Let us consider first the case with 2 UAVs and LoS connection between the UAVs and the ground nodes. Under these conditions, g J i U = g LoS J i U and J i U = LoS J i U . For that case, the PDF and CDF of the effective A2G channel gains g J i U are given by [23] where I 0 (·) is the zero-order modified Bessel function of first kind and Q 1 [·] is the Marcum-Q function of order 1. Additionally, the PDF and CDF of the ground channels g AU are given by Therefore, the CDF of γ U is obtained as while the PDF is derived from the CDF as To simplify the notation, in the following steps g AU is used for g A , g J i U for g i , K J i U for K i and J i U for i . Thus, by considering [24, 8.445], the term I 0 (·) in (16) can be rewritten as its series representation as then, by defining η i (16) can be rewritten as Then, by replacing (24) and (23) into (20) leads to Then, by plugging (24) and (23) into (20) (28) in (26) and (29) in (27), I 1 and I 2 can be, respectively, expressed as Finally, (25) can be expressed as To compute the PDF in (16), it is followed a similar process for the CDF calculation, by considering that (30) Thus, the PDF can be obtained as It is worthwhile to note that the integrals in (26) and (27) can be separated into independent terms for each UAV. Therefore, the CDF and PDF for the general case of N UAVs can be obtained as in (13) and (14), respectively.
Then, the SOP is calculated as

Weighted secrecy coverage
As mentioned before, we assume no knowledge on the position of E, other than it is located inside the circular region S within a radius R A from A, so we analyse the secrecy performance of the proposed system in terms of the area-based metrics in [17], the jamming coverage (JC) and the jamming efficiency (JE). Both of these metrics give us the notion on the effect over the secrecy performance inside S by the presence of the UAV jammers.
For the JC, consider that E is located at a single point within the area S, where a certain value can be calculated, and we are interested in such points that lead into a � > 1 value. Then, the jamming secrecy coverage is the integral over the area where � > 1 , expressed as where the dS E term indicates an integral over the positions of E over the whole area S. To illustrate this concept, Fig. 2 shows a simplified overview of the system as a heatmap of over the whole area S. The JC would be the total area where � > 1 , which is enclosed by the yellow line surrounding the UAVs and A.
On the other hand, JE measures the average improvement in the secrecy over the whole area S: (35) where |S| is the area of the region S. Note that JC gives a measure of the area within S where an improvement on the secrecy performance of the system is obtained due to the UAV jammers, while JE gives a measure of the average improvement in the secrecy performance over the area S, if E were located at a random point.
To get further insights on the jamming effective coverage, in [18] we proposed a hybrid metric, the WSC, to account for both, the area over which secrecy is improved and the average secrecy improvement over the whole area S. The WSC is given by

Positioning optimisation
In this section, we consider joint optimisation of the 3D positioning of the UAVs (common height, common orbit radius and angles around A) and the power allocation between the UAVs in order to maximise the WSC, given a relative position of B with respect to A, which is characterised by d AB . Thus, the optimisation problem is formulated as where z MIN is the minimum flying height, z MAX is the maximum allowed flying height for the UAVs, R MAX is the limit of the orbit radius around A, which is the radius of S, and γ T is the maximum jamming transmit SNR from all UAVs.
To simplify the optimisation problem in (40), some trends are considered regarding the angular positioning and the allocated jamming power for the case of two UAVs provided as observed in [18]. In that work, it was found locating both UAVs symmetrically behind the line between A and B leads to the optimal performance; thus, this trend is generalised to the N UAVs case by considering a single opening angle θ J between any pair of adjacent UAVs symmetrically located, as shown in Fig. 1. Then, it was proved that the WSC is maximised by having an equal power allocation for the friendly jammers, which is also generalised to the N UAV case.
Under these observations, the optimisation problem in (40) can be reformulated as where only three optimisation variables are considered, namely the opening angle θ J , the UAV common height z J and the UAV surveillance orbit radius R J .

Reinforcement learning-based positioning
Given that the estimate of the distance from A to B is unreliable, the optimisation problem in (41) cannot be reliably solved. To account for the stochastic nature of the estimate of the distance to B, d AB , we consider a coordinate-descent-based [25] iterative scheme to reliably solve the optimisation problem in (41) by employing an RL approach to ascertain the optimum positioning for the UAVs around A. Particularly, we model this problem as a multi-armed bandit (MAB) problem, by considering the discrete positioning variables values as the arms or actions, and the WSC reading obtained at each step as the values or rewards. In the following, we briefly introduce the basis of the MAB problem and some relevant RL concepts to help us explain our approach. 2 Multi-Armed Bandit Problem [26] An MAB problem consists of an agent (bandit) which has to choose at each time step among a set of actions (arms) to obtain rewards. At each step, each chosen action provides a reward, which is a random variable with a given distribution per action. The goal of the agent is to maximise the reward obtained over the time, which could be understood as choosing the optimum action, which is the action with the highest expected reward, so-called exploitation. This is done by keeping estimates of each of the actions' expected rewards. Therefore, it is also of interest to keep learning more about other actions to refine the estimates for each of them, which is called exploration. The action chosen at each step is determined by a policy, which in part sets the exploration/exploitation balance to be taken. An illustrative example of this learning process is shown in Fig. 3.
Considering the optimisation problem in (41), we have three positioning variables, the opening angle of adjacent UAVs behind A ( θ J ), the common height of the UAVs ( z J ) and the orbit radius of the UAVs around A ( R J ). Each variable is separated into its own RL process, independent of the other two. For each positioning variable, we define its possible actions as a range of values the variable can take, which are given by the constraints in (41), and a discretised number of actions per variable ( N θ , N z , N R ). Each action of a variable has a reward distribution, which corresponds to the distribution of WSC values obtained by performing that action. The goal is to be able to estimate with high accuracy which of the actions has the greatest expected reward. At each step, one of the actions is chosen following a policy and the received WSC reward is processed to contribute for the estimation of the expected reward (WSC) for said action.
To simplify the computations, we perform three separated RL processes, one for each positional variable with its own action range discretisation. The RL loops for each of the variables are to be repeated back to back, alternating between the variables.
Considering that for each RL step of a given positioning variable, an assumption needs to be made regarding the other two positioning variables. The natural way of choosing which value should be considered for the other two positioning variables is to choose them in a greedy fashion, i.e. choose the values for the other two positioning variables that are estimated thus far to be the ones that lead to the highest reward. This implies that for any of the positioning variables, the RL process being carried out is non-stationary since the values for the other positioning variables, which are considered as part of the environment, change during the process, thus changing the environment. To account for the non-stationarity of the RL processes, consider the following generic estimate update rule [26]: where Q n is a generic action reward estimate at time n, R n is the observed reward at time n and α n is the so-called step size at time n, which controls the contribution of the observed data to the estimate at time n. As we consider that all observed rewards will contribute evenly to the estimate, we set α n = 1/n . However, in a non-stationary environment, we may want to give a higher weight to the new observations over the past observations, so that the RL process would be more sensitive to the environmental changes. To accomplish this, we set α n = α for all n values to be a constant, such that 0 < α < 1 [26].
Regarding the policy to be used, we consider the upper confidence bound (UCB) policy [26] that is described next: Upper Confidence Bound The action chosen at each step is determined by both the estimated value of the action thus far (greedy) and by the frequency of chosen that action in the past. This rule is determined by [26] where N t (a) is the number of times the action a has been chosen up to time t and c is a constant parameter that controls the degree of exploration. Then, with this policy, a continuous exploration is performed as time goes on in favour of less chosen actions over time that is controlled by the c constant, which has to be set depending on the desired degree of exploration, and the expected reward values.

Positioning learning block
RL loops will be employed over the positional variables of the UAVs in order to iteratively reach the optimum values in a coordinate descent fashion [25]. This processing is performed at A that has a global understanding of the system, and it transmits the positional information to the UAVs for physical adjustment. However, the transmission frequency of positional information to the UAVs is a concern, since every time this information is received, the UAVs are compelled to adjust their position, thus entailing energy consumption. If this occurs after each RL step of each variable, the movement of the UAVs may be unnecessarily erratic (given the randomness of the estimate and the discretisation level of the variable domains), consume a high amount of energy from the UAVs over time and introduce a substantial amount of delay, given that A needs to receive an acknowledgement (ACK) from the UAVs alerting that the required new position has been assumed before starting another RL step. Thus, we propose a time frame-based scheme that splits a given time range, which we name a positioning learning block (PLB), into individual slots, namely RL slots (RLSs) and positioning slots, as shown in Fig. 4. A PLB comprises nRLS consecutive RLSs and a single positioning slot at the end of it. At the beginning of an RLS, a d AB estimate is obtained and used in the rest of the slot, where a single RL step is performed for each of the positioning variables ( θ J , z J , R J ), one after another. Each RL step assumes a greedy positioning from the other variables.
For the duration of the RLSs, A performs internal processing of the RL steps, and at the positioning slot, A chooses the greedy actions from the three positioning variables and transmits this information to the UAVs. Then, the UAVs assume their new positions based on this information and send an ACK signal to A, which, upon reception, starts another PLB as shown in Fig. 4. Therefore, we define an off-policy scheme, where we employ a greedy policy at the positioning slots, and a UCB policy at the RLSs.
Given this approach, each UAV incurs in energy consumption at each positioning slot that is simply given by: the energy needed to receive the positioning instructions from A ( E RX ), the energy needed to manoeuvre to its new position ( E Mov ) and the energy needed to send an ACK back to A ( E ACK ). This energy term is given by where P Mov is the power needed by the UAV to manoeuvre and t v is the time it takes the UAV to perform this change in position. Assuming that the UAV changes its position by assuming its new angle, height and radius in that order, t v is given by where v J is the manoeuvring speed of the UAV (assumed constant throughout the flight), �θ J , z J and R J are the angle, height and radius variations, and R J 0 is the initial UAV radius value.

MAB-based WSC improvement UAV positioning algorithm
The concepts defined so far have the main goal of establishing the optimal position for the N UAV jammers in order to maximise the WSC, while A sends out information to B over the wireless medium. In Algorithm 1, we present the process followed by the proposed algorithm, where the variables in brackets ( represent the action values estimates array for each of the variables. Algorithm 1 provides a description of the processes depicted in Fig. 4 over time. In this algorithm, MAB processes are carried out, once for each RLS, for every positioning variable sequentially with the UCB action-choosing policy, over a number of PLBs. This algorithm refines its action estimates for each of the positioning variables over time in each RLS, adapting to the changes in the other positioning variables and allowing the UAVs to take positions that increase their WSC at the end of each PLB. Thus, the WSC of the system increases closer to the optimum at every PLB.

Results and discussion
In this section, we evaluate the secrecy performance of the proposed system, in terms of the WSC, and the proposed RL algorithm for certain illustrative cases. The parameters used for the evaluation, unless stated otherwise, are shown in Table 1, channel-specific parameters chosen for the urban environment taken from [21,22]. For UAV-specific parameters, such as the energy for receiving a data frame E RX , for sending an ACK E ACK , and the power spent on manoeuvring from one point to another P Mov , we refer to values based on common transceiver energy consumption values [27] and manoeuvring power values [28]. Also, we consider a UAV movement speed of v J and a processing time for an RLS of t RL . The actual practical values of these parameters depend on the specific UAVs used, so the values considered here are simplified for comparison purposes.
To validate the expression for the SOP in (12), Fig. 5 shows a comparison of theoretical results and results obtained from Monte Carlo simulations for different configurations of parameters.
Note that the simulation results perfectly match with the analytical results, thus validating our expressions. As it is expected, the SOP increases as R S increases, but it converges more rapidly for larger numbers of UAVs. A better performance in terms of SOP is obtained with higher transmit SNR values, tending to a floor in the performance. However, as the height of the UAV increases, this floor of the SOP decreases and it is reached more slowly.
To evaluate the performance of our algorithm, in the following figures, results from Monte Carlo simulations are presented by considering two performance metrics, the energy spent over time and the secrecy performance in terms of the WSC obtained over time. The energy spent is presented as a cumulative metric over a given time step (PLB). The secrecy performance is illustrated as the WSC obtained at each time step and normalised by the secrecy area. These results are also compared to the WSC resulting from an exhaustive search over discretised positioning values. Figure 6 presents the normalised WSC and the cumulative energy consumed obtained over time (PLBs) by the proposed algorithm for different values of σ AB . Note that the WSC increases until it reaches a convergence level, which is higher as σ AB decreases, obtaining a better secrecy performance. This behaviour occurs because, as σ AB decreases, the variance of the estimates of the action rewards also decreases; thus, more reliable action reward estimates are obtained, and it is more likely to choose the optimal actions from the discretised sets.
The energy consumption of the UAVs remains the same over the first time steps, but increases more rapidly for lower values of σ AB . This is expected as at lower σ AB values, the estimates of the action rewards are more reliably found earlier, and any new sample taken to adjust the estimates will not cause a big deviation from its current value (low variance). As the same actions are more reliable chosen, UAVs move less between PLBs, thus consuming less energy. In general, a smaller uncertainty of the distance between A and B will achieve greater secrecy performance and, at the same time, reduce the power consumption of the UAVs. Figure 7 shows the impact of the shape parameter K of the A2G channels on the normalised WSC and the cumulative energy consumed obtained over time (PLBs) for different values of K. Note that a strong LoS component, higher K, leads to a significant loss on the WSC. However, the convergence for lower K values is slower, thus involving more movement between actions that may be further apart, which increases the energy consumption.  Figure 8 shows the impact of the number of UAVs in the system over the normalised WSC and the cumulative energy consumed obtained over time (PLBs) for different numbers of nUAV. The transmit SNR at each UAV is considered as γ J = γ T /nUAV . The results obtained by exhaustive search are also illustrated.
Note that as more UAVs are introduced in the system, while maintaining the total jamming power constant, the secrecy performance decreases. It can mean that having more UAVs affect more the legitimate node B than the illegitimate node E, as it is considered that E can be anywhere in the region S. It is also observed that a good level of convergence is reached up to 3 UAVs within 10 PLBs; thus, the energy consumption over time is maintained low. However, the energy consumption increases drastically for four UAVs. This can be explained due to the late convergence of the case with four UAVs, suggesting a more erratic, less stable movement as the number of UAVs increases. This result suggests that the inclusion of two UAVs may be enough and efficient to provide secret transmissions to a single legitimate pair. Finally, to analyse the convergence of the algorithm, Fig. 9 shows the normalised minimum squared error (MSE) of the WSC, which is obtained by comparing to the exhaustive search results over time and then normalised to the exhaustive search value. The results are shown for different values of σ AB . Note that the algorithm quickly converges to low values of MSE as it reaches a steady low level within 10 PLBs. As σ AB increases, the MSE converges to a higher level, which occurs because a higher uncertainty introduces a larger variance in the action estimates, allowing for the optimal actions to be chosen less reliably, thus increasing the MSE.
It is also worth noting that simulations with nRLS = 10 within 10 PLBs (100 RLSs in total), where the UCB algorithm has been applied to the three positioning variables 100 times each, proved to be enough to reach a good level of convergence with very low MSE. This convergence speed is possible because the number of actions for each of the positioning variables is kept relatively low. Importantly, the treatment of the three MAB processes as independent favours the convergence speed, compared to a joint action space or state-action pairs that would greatly increase the amount of actions or stateaction values to be considered.

Conclusions
This paper investigated the secrecy performance of a legitimate transmission between a pair of ground nodes aided by N friendly UAV-based jammers , in terms of the secrecy metric WSC, that measures the efficiency of friendly jamming over an area and is obtained from the SOP; thus, the exact position of the eavesdropper is not assumed. For that purpose, we first derived an integral-form expression for the SOP of the proposed system, which was validated via Monte Carlo simulations. Additionally, we proposed an RL-based algorithm to optimise the 3D positioning of the UAVs in order to maximise the WSC. The time frame-based algorithm periodically updates the positioning information of the UAVs and allows a control of energy consumption for UAV positioning. Extensive simulations showed that the proposed algorithm improved the secrecy of the system over time and converged to the exhaustive search upper bound, as the uncertainty of the position of B decreases. The proposed time frame structure of the algorithm proved to be efficient to lead to optimal values of WSC while being flexible with the trade-off between secrecy and energy consumption. Furthermore, the algorithm can be explored for solving different problems in novel wireless communications networks that require periodic parameter updates to be learnt over time in a non-stationary environment.

AWGN
Additive white Gaussian noise ACK Acknowledgement CDF Cumulative distribution function CSI Channel state information

Author Contributions
ML-a contributed heavily to the conception of the study. DPMO contributed to the conception of the study, revised and verified the methods, results and the entire article, and wrote the Introduction section. XAFC wrote the entire article except for the Introduction section, developed the proposed algorithm, carried out the simulations and prepared the graphs. All authors read and approved the final manuscript.

Funding
This work was partially supported by the Academy of Finland, 6Genesis Flagship, under Grant 318927 and FAITH project under Grant 334280.

Data Availability
The scripts used for data gathering during the current study are available from the following repository: https:// github. com/ xflor escSt aff/ MAB-based-UAV-posit ioning. git.