Distributed algorithm under cooperative or competitive priority users in cognitive networks

Opportunistic spectrum access (OSA) problem in cognitive radio (CR) networks allows a secondary (unlicensed) user (SU) to access a vacant channel allocated to a primary (licensed) user (PU). By finding the availability of the best channel, i.e., the channel that has the highest availability probability, a SU can increase its transmission time and rate. To maximize the transmission opportunities of a SU, various learning algorithms are suggested: Thompson sampling (TS), upper confidence bound (UCB), ε-greedy, etc. In our study, we propose a modified UCB version called AUCB (Arctan-UCB) that can achieve a logarithmic regret similar to TS or UCB while further reducing the total regret, defined as the reward loss resulting from the selection of non-optimal channels. To evaluate AUCB’s performance for the multi-user case, we propose a novel uncooperative policy for a priority access where the kth user should access the kth best channel. This manuscript theoretically establishes the upper bound on the sum regret of AUCB under the single or multi-user cases. The users thus may, after finite time slots, converge to their dedicated channels. It also focuses on the Quality of Service AUCB (QoS-AUCB) using the proposed policy for the priority access. Our simulations corroborate AUCB’s performance compared to TS or UCB.


Cognitive radio
The static spectrum allocation has nowadays become a major problem in wireless networks as it results in an inefficient use of the spectrum and can generate holes or white spaces therein. The opportunistic spectrum access (OSA) concept aims at reducing the inefficient use of the spectrum by sharing available spectrum of primary users (PUs), i.e., licensed users who have full access to a frequency band, with opportunistic users called secondary users (SUs). According to OSA, a SU may at any time access an unoccupied frequency band, but it must abandon the targeted channel whenever a PU restarts its

Related work
The past decade has witnessed an explosive demand of wireless spectrum that led to the major stress and the scarcity in the frequency bands. Moreover, the radio landscape has become progressively heterogeneous and very complex (e.g., several radio standards, diversity of services offered). Nowadays, the rise of new applications and technologies encourages wireless transmission and accelerates the spectrum scarcity problem. The coming wireless technologies (e.g., 5G) will support high-speed data transfer rates including voice, video, and multimedia.
In many countries, the priority bands for 5G include incumbent users, and it is essential that regulators make high effort to evacuate these frequencies for 5G use-especially in the 3.5 GHz range (3.3-3.8 GHz) [14]. These efforts may consist of (1) putting in place incentives to migrate licensees upstream of frequency allocation, (2) moving licensees to other bands or to a single portion of the frequency range, and (3) allowing licensees to exchange their licenses with mobile operators. When it is not possible to free up a band, the reserving frequencies for 5G bands (i.e., 3.5/26/28 GHz) may lead to the success of 5G services while wasting frequencies. Indeed, according to several recent studies, the frequency sharing approaches represent an efficient solution that can be used to support both potential 5G users and the incumbent users. For instance, the Finnish regulator has chosen to adopt this approach instead of reserving frequencies for the 5G users [14]. Sharing approach will contribute to access new frequencies for 5G in areas where they are needed but underutilized by incumbent users. In this work, we are interested in the opportunistic spectrum access (OSA) that represents a sharing approach in which the SUs can access the frequency bands in opportunistic manner without any cooperation with the PUs. Before making any decision, a SU should make spectrum sensing process in order to reduce the interference with the primary users. In [15], the authors focus on different spectrum sensing techniques and their efficiency trying to obtain accurate information about the status of the selected channel by a SU at a given time. Moreover, the proposed techniques are analytically evaluated under Gaussian and Rayleigh fading channels. In this work, we focus on the decision making process to help the SU reach the best channel with the highest availability probability. This channel, on the one hand, mitigates any harmful interference with the PU as a result that this channel not often used by this latter. On the other hand, accessing the best channel in the long term can increase the SU's transmission time and throughput capacity.
Many recent works, in the CR, have attempted to maximize the transmission rate of the secondary user (SU) without generating any harmful interference to the primary user (PU) [16,17]. To reach this goal, the latter works investigate the effects of using different types of modulation such as OFDM (orthogonal frequency-division multiple access) and SC-FDMA (single-carrier frequency-division multiple access). The main drawback of using OFDM modulation is related to its large peak-to-average power ratio (PAPR) that may increase the interference with the PU. While SC-FDMA has seen as a favorable modulation to maximize the SU's transmission due to its lower PAPR as well its complexity [18]. Moreover, SC-FDMA is used in several mobile generation such as the third-generation partnership project long-term evolution (3GPP-LTE) and the fourth generation (4G). It is also considered as a promising radio access technology and having an optimal energy-efficient power allocation framework for future generation of wireless networks [19,20].
In this work, we choose to focus on the multi-armed bandit (MAB) approach in order to help a SU make a good decision, reduce the interference among PU and SU, and maximize the opportunities of this latter. In MAB, the agent may play an arm at each time slot and collect a reward. The main goal of the agent is to maximize its long-term reward or to minimize its total regret, defined as the reward loss resulting from the selection of bad arms. In [21][22][23][24], the authors considered the MAB approach in an OSA to improve the spectrum learning 1 .
In MAB, the arm reward can be modeled with different models, such as the independent identically distributed (i.i.d.) or Markovian models. In this paper, we focus on the i.i.d. that represents the widely used model for a single user [24,25] or multi-user case [23,26,27].
Based on the MAB problem introduced by Lai and Robbins in [10], the authors of [28] proposed several versions of UCB: UCB1, UCB2, and UCB-normal. All these versions achieve a logarithmic regret with respect to the number of played slots in the 1 A SU in OSA is equivalent to a MAB agent trying to access a channel at each time slot in order to increase its gain. single-user case. For multiple users, we proposed respectively in [13] and [29] cooperative and competitive policies to collectively learn the vacancy probabilities of channels and decrease the number of collisions among users. The latter policies are simulated under TS, UCB, and -greedy algorithms. The previous simulations were conducted without any proof about the analytical convergences of these algorithms or the number of collisions among SUs. In this work, we show that the same policies achieve a better performance with AUCB compared to several existing algorithms. We also investigate the analytical convergence of these two policies under AUCB, and we show that the number of collisions in the competitive access has a logarithmic behavior with respect to time. Therefore, after a finite number of collisions the users converge to their dedicated channels. The authors of [30] proposed a distributed learning for multiple SUs called timedivision fair share (TDFS) and proved that the proposed method achieves a logarithmic regret with respect to the number of slots. Moreover, TDFS considers that the users can access the channels with different offsets in their time-sharing schedule and each of them achieves almost the same throughput. The work of [31] proposed a musical chair that represents a random access policy to manage the secondary network where the users achieve a different throughput. According to [31], each user selects a random channel up to time T 0 in order to estimate the vacancy probabilities of channels and the number of users U in the network. After T 0 , each user randomly selects one of the U best channels. Nevertheless, the musical chair suffers several limitations as follows: 1. The user should have a prior knowledge about the number of channels in order to estimate the number of users in the network. 2. It cannot be used under the dynamic availability probability since the exploration and exploitation phases are independent. 3. It does not take the priority access into account.
To find the U best channels, the authors of [32] proposed a multi-user -greedy collision avoiding (MEGA) algorithm based on the -greedy previously proposed in [28]. However, the MEGA has the same drawbacks of the musical chair. In the literature, various learning algorithms have been proposed to take into account the priority access, such as selective learning of the kth largest expected rewards (SLK) [33] and kth MAB [34]. SLK is based on the UCB algorithm, while the kth MAB is based on both UCB and -greedy.

Contributions and paper organization
The main contributions of this manuscript are as follows: • An improved version of UCB algorithm called AUCB: In the literature, several UCB versions have been proposed to achieve a better performance compared to the classical one [28,[35][36][37]. However, we show that AUCB achieves a better performance compared to previous versions of UCB. By considering the widely used i.i.d. model, the regret for a single or multiple SUs can achieve a logarithmic asymptotic behavior with respect to the number of slots, so that the user may quickly find and access the best channel in order to maximize its transmission time.
• Competitive policy for the priority learning access (PLA): To manage a decentralized secondary network, we propose a learning policy, called PLA, that takes the priority access into account. To the best of our knowledge, PLA represents the first (2020) 2020:145 Page 5 of 31 competitive learning policy that successfully handles the priority dynamic access where the number of SUs changes over time [38], while only the priority access or the dynamic access are considered in several learning policies, such as musical chair and dynamic musical chair [31], MEGA [32], SLK [33], and kth MAB [34]. In [38], PLA shows its superiority under UCB and TS compared to SLK, MEGA, musical chair, and dynamic musical chair. In this work, we evaluate the performance of AUCB in the multi-user case based on PLA. • The upper bound of regret: We analytically prove the asymptotical convergence of AUCB for single or multiple SUs based on our PLA and side channel policies.
• Investigation AUCB's performance of TS is known to exceed the state of the art in MAB algorithms [35,39,40]. Several studies found a concrete bound for its optimal regret [41][42][43]. Based on these facts, we adopt TS as a reference to evaluate AUCB's performance.
• We also investigate the QoS of AUCB algorithm under our PLA policy.
Concerning this manuscript's organization, Section 2 introduces the system model for single and multi-user cases. Section 3 presents the AUCB approach for a single user as well as a novel learning policy to manage a secondary network. AUCB's performance for both single and multi-user cases are investigated in Section 4. This section also compares the performance of the PLA policy for the multi-user case to recent works. Section 5 concludes the paper.

Problem formulation
In this section, we investigate the MAB problem for both single and multi-users cases. We also define the regret that can be used to evaluate a given policy's performance (Table 1). All parameters used in this section can be found in Table 1.

Single-user case
Let C be the number of i.i.d. channels where each channel must be in one of two binary states S: S equals 0 if the channel is occupied, and 1 otherwise. For each time slot t, SU should sense a channel in order to see whether it is occupied or vacant and receives a reward r i (t) from the ith channel. Without any loss of generality, we will then assume that a good decision's reward, e.g., the channel is vacant, equals to its binary state, i.e., r i (t) = S i (t). SU can transmit its data on a vacant channel; otherwise, it must wait for the next slot to sense and use another channel. We suppose that all channels are ordered by their mean availability probabilities, i.e., μ C ≤ μ C−1 ≤ · · · ≤ μ 1 . The availability vector = (μ i ) is initially unknown to the secondary user, but our goal is to estimate it over many sensing slots. If a SU has a perfect knowledge about the channels and their μ i , then it can select the best available channel, i.e., the first one, to increase its transmission rate. As μ i is unknown for that user, we will define the regret as the sum of the reward loss due to the selection of a sub-optimal channel at each slot. The regret minimization determines the efficiency of the selected strategy to find the best channel. In a single user case, the regret R(n, β) up to the total number of slots n under a policy β can be defined as follows: Reward obtained from the ith channel at slot t μ 1 and μ i Availability of the best and ith channels respectively Difference between the best and worst channels β(t) Channel selected at slot t using a policy β for single or multiple-users cases Number of times the ith channel was sensed up to slot t Exploitation contribution of the ith channel that depends on T i (t) Exploration contribution of the ith channel that depends on t and T i (t) Index assigned of the ith channel that takes into consideration the availability α Exploration-exploitation factor Global reward obtained by all users at slot t from the selected channels β(t) Non-collision in the ith channel under the jth user at slot t P i,j (n) Total number of non-collision in the ith channel under the jth user up to n Quality collected from the ith channel up to slot t G max (t) Maximum expected quality over channels up to slot t Quality factor that depends on t and T i (t) Index assigned of the ith channel that takes into consideration both availability and quality γ Weight of the quality factor μ Q i Global mean reward of the ith channel that takes into consideration both availability and quality Total number of collisions in U-best channels up to n p Probability of non-collision in best channels where n is the total number of slots; nμ 1 is the selected channel in an ideal scenario, i.e., when the SU has prior knowledge and always selects the best channel; β(t) denotes the channel selected under the policy β at time t; and μ β(t) i is the mean reward obtained for the i th channel selected at the time slot t and β(t) = i. The main target of a SU is to estimate the channels availability as soon as possible to attain the highest available one. To reach this goal, UCB was firstly proposed in [10] and applied in [25] to optimize the access over channels and identify the best one with the highest availability probability. UCB contains two dimensions: exploitation and exploration. These latter are respectively represented by X i (T i (t)) and A i (t, T i (t)).
The index assigned to the ith channel can be defined as follows: where T i (t) is the number of times the channel i is sensed by a SU up to the time slot t. The user selects the channel β(t) at slot t that maximizes its index in the previous slot, After a sufficient time, the user establishes a good estimation of the availability probabilities and thus can converge towards the optimal channel.

Multi-user case
Let us consider U SUs trying to maximize their network's global reward. At every time slot t, each user can access a channel when available and transmits its own data. However, multiple SUs can work in cooperative or uncooperative modes. In the cooperative one, the users should coordinate their decisions to minimize the global regret of the network. On the other hand, in a non-cooperative mode, each user independently makes its own optimal decision to maximize its local reward. The regret for the multi-user case, under cooperative or competitive modes, can be written as follows: where μ k is the mean availability of the kth best channel; S β(t) (t) is defined by the global reward obtained by all users at the time slot t; E(.) represents the mathematical expectation, and β(t) represents all the selected channels 2 by users at t. We can define S β(t) (t) by: where the state variable 3 S i (t) = 0 indicates that the channel i is occupied by the PU at slot t; otherwise, S i (t) = 1; I i,j (t) = 1 if the jth user is the sole occupant in channel i at the slot t and 0 otherwise. In the multi-user case, the regret can be affected by the collision among SUs and the channel occupancy which allows us to define the regret for U SUs as shown in the following equation: where P i,j (n) = n t=1 E I i,j (t) stands for the expectation of times when the user j is the only occupant of the channel i up to n, and the mean of reward can be given by:

Methods
In this section, we present a new approach inspired from the classical UCB in a singleuser case, and later on, we generalize our study to consider the case of multi-user. The new approach can find the optimal channel faster than the classical UCB while achieving a lower regret. The classical UCB contains the exploration-exploitation trade-off to find a good estimate of the channels status and converges to the best one (see Eq. (2)). In UCB, a non-linear function for the exploration factor, A i (t, T i (t)), is used to ensure the convergence: where α is the exploration-exploitation factor. The effect of α on the classical UCB is well studied in the literature [22,44,45]. According to [28,44,46,47], the best value of α should be in the range of [1,2] in order to make a balance between explorationexploitation epochs. However, if α decreases, the exploration factor of UCB decreases and the exploitation factor dominates, then the algorithm converges quickly to the channel with the highest empirical reward. All previous works study the effect of A i (t, T i (t)) on the UCB with different values of α. In this study, we focus on another form of the exploration factor A i (t, T i (t)) based on another non-linear function in order to enhance the convergence to the best channel of the classical UCB. Different non-linear functions of Eq. (6) with similar characteristics can be investigated. We should mention that this function was chosen because it has two main properties: • A positive function with respect to time t.
• An increasing non-linear function to limit the effect of the exploration.
Therefore, the square-root function introduced in Eq. (6) is widely accepted [24,28,46,47] in order to restrict the exploration factor after the learning phase. Classical UCB ensures the balance between the exploration-exploitation phases at each time slot up to n, using two factors, A i (t, T i (t)) and X i (T i (t)). Indeed, A i (t, T i (t)) is used to explore channels' availability in order to access the best one with the highest expected availability probability X i (T i (t)). The classical UCB gives the same impact of the exploration factor A i (t, T i (t)) at each time slot up to n. However, our proposal is based on the idea that the exploration factor A i (t, T i (t)) should have an important role during the learning phase while it becomes less important after this period. Indeed, after the learning phase, the user will have a good estimation of channels' availability, then it can regularly access the best channel. Subsequently, the big challenge is to restrict A i (t, T i (t)) after the learning phase by using another non-linear function with the following features: • It should be an increasing function with a high derivative with respect to time at the beginning to boost the exploration factor during the learning phase in order to accelerate the estimation of channels' availability.
• It should have a strong asymptotic behavior in order to restrict the exploration factor A i (t, T i (t)) under a certain limit, when the user collects some information about channels' availability.
Subsequently, our study finds that the exploration factor can be adjusted by using the arctan function which has the above features; this proposed UCB version is called AUCB. Indeed, the arctan enhances the convergence speed to the best channel compared to the one obtained with the square-root, and the effect of the exploration factor A i (t, T i (t)) can be reduced after the learning phase. The algorithm then gives an additional weight to the exploitation factor X i (T i (t)) in the maximization of the index B i (t, T i (t)) (see Eq. (2)). In the next section, we will prove that AUCB's regret has a logarithmic asymptotic behavior.

AUCB for a single user
This section focuses on the AUCB's regret convergence for a single user. For the sake of simplicity with regard to the mathematical developments, the regret of Eq. (1) can be written as: where T i (n) represents the number of time slots that the channel i was sensed by the SU up to the total number of slots n. According to Eq. (7), the regret depends on the channels' occupancy probability (for stationary channels, the availability probabilities are considered as constant) and the expectation of T i (n) which is a stationary random variable process. Then, the upper bound of E [T i (n)] indirectly implies the regret's upper bound. Subsequently, the regret of our AUCB approach under the singleuser case has a logarithmic asymptotic behavior as shown in the following equation (see Appendix A): where (1,i) = μ 1 − μ i represents the difference between the best and worst channels.

Multi-user case under uncooperative or cooperative access
To evaluate the performance of our proposed algorithm in the multi-user case, we will propose an uncooperative policy for the priority learning access (PLA) to manage a secondary network. We will also prove the PLA's convergence, as well as the side channel policies with AUCB.

Uncooperative learning policy for the priority access
We investigate the case where the SUs should take decisions according to their priority ranks. In this section, we propose a competitive learning policy that can share the available spectrum among SUs. In addition, we prove the theoretical convergence of the PLA policy with our AUCB approach. In the multi-user case, the big challenge becomes how to collectively learn the channels' availability for each SU; at the same time, the number of collisions should be set below a certain threshold. Our goal is to ensure that the U users are spread separately to the U best channels. In the classical priority access method, the first priority user SU 1 should sense and access the best channel, μ 1 , at each time slot, while the target of the second priority user SU 2 is to access the second best channel. To reach this goal, SU 2 should sense to find the two best channels at the same time, i.e., μ 1 and μ 2 , in order to compute their availabilities and thus access the second best channel if available. Similarly, the Uth user should estimate the availability of all U first best channels at each time slot to access the Uth best one. However, it is a costly and impractical method to settle down each user to its dedicated channel. For this reason, we propose PLA where, at each time slot, SU can sense one channel in order to find its dedicated one. In our policy, indicates a presence of collision under the k th user at the instant t, 5 r i (t): indicates the state of the i th channel at instant t, 6 % r i (t) = 1 if the channel is free and 0 otherwise, : represents the index of the i th channel for the k th user, 9 for t = 1 to C do 10 SU k senses each channel once, 12 SU k generates a random rank from the set {1, ..., k}, 13 for t = C to n do 14 SU k senses a channel in its index B i,k (t, T i (t)) according to its rank, 15 if r i (t)=1 then 16 SU k transmits its data,  23 SU k updates its index B i,k (t, T i (t)) according to eq (2). each user has a dedicated rank, k ∈ {1, ..., U}, and its target remains the access of the kth best channel. In PLA, each user generates a rank around its prior one to have information about the channels availability, (see Algorithm 1). In this case, the kth user can scan the k best channels and its target is the kth best one. However, if the generated rank of the kth user is different than k, then it accesses a channel that has a vacancy probability in the set {μ 1 , μ 2 , ..., μ k−1 } and may collide with top priority users {SU 1 , SU 2 , ..., SU k−1 }. Moreover, after each collision, SU k should regenerate its rank in the set {1, ..., k}. Thus, after a finite number of iterations, each user settles down to its dedicated channel. Equation (9) shows that the expectation of collisions in the U best channels E[ O U (n)] for PLA based on our AUCB approach has a logarithmic asymptotic behavior. Therefore, after a finite number of collisions each user may converge to its dedicated channel (see Appendix B): where p indicates the probability of non-collision and (a,b) = μ a − μ b . We have also proven that the total regret of our policy PLA has a logarithmic asymptotic behavior. It is also worth mentioning that the upper regret bound not only depends on the collisions among users but also on the selection of the worst channels (see Appendix C): The first term of the above equation reflects the selection of the worst channels while the second one represents the reward loss due to collisions among users in the best channels. The upper bound of the regret presented in Eq. (10) can be affected by three parameters: • The number of users, U, represented through the first summation, where k denotes the kth best channels for the kth SU.
• The number of channels, C, in the second summation of the regret.
• The total number of time slots, n.

Cooperative learning with side channel policy
The coordination among SUs can enhance the efficiency of their network, instead of dealing with their partial information about the environment. To manage a cooperative network, we propose a policy based on the use of a side channel in order to exchange simple information among SUs with a very low information rate. The side channels are widely used in wireless telecommunication networks to share data among the base-stations [48], and specifically in the context of cognitive network. However, in [49] and [50], the authors considered the cooperative spectrum sharing among PUs and SUs to enhance the transmission rate of the PUs using a side channel.
The signaling channel in our policy is not wide enough to allow high-data rate transmission unlike that used in [49] and [50] which should have a high rate to ensure the data transmission among PUs and SUs. In our policy, the transmission is done over periods. During the first period, i.e., Sub-Slot1, SU 1 (the highest priority user) searches for the best channel by maximizing its index according to Eq. (2). At the same time, and via the secure channel, SU 1 must inform the other users to evacuate its selected channel in order to avoid any collision with them. While avoiding the first selected channel, the second user SU 2 should repeat the same process and so on. If SU 2 does not receive the choice of SU 1 in the first Sub-Slot1 (suppose that SU 1 does not need to transmit during this Sub-Slot), it can directly choose the first suggested channel by maximizing its index B i,2 (t, T i,2 (t)).
To the best of our knowledge, all proposed policies, such as SLK, kth MAB consider a fixed priority, i.e., the kth best channel is reserved for the kth user all the time. Later on, if SU 1 does not transmit for a certain time, then other users cannot select better channels. Subsequently, the main advantages of the cooperation in this policy are as follows: • An efficient use of the spectrum where best channels are constantly accessed by users.
• An increase in the transmission time of users by avoiding the collision among them.
• Reaching a lower regret compared to several existing policies. Hence, AUCB's regret under the side channel policy can achieve a logarithmic efficiency according to the following equation (see Appendix D): where (k,i) = μ k − μ i . k and i, respectively, represent the best and worst channels. The upper bound of this regret contains two terms: • Term 1 achieves a logarithmic behavior over time.
• Term 2 depends on the vacant channels.

Quality of service of AUCB
In [12], we study UCB's quality of service (QoS) for the restless model. The QoS has been studied for both single and multi-users cases using the random rank policy proposed in [23] to manage a secondary network. Based on the QoS-UCB, the user is able to learn channels' availability and quality.
In this work, we also study the QoS of AUCB using our proposed PLA policy for the priority access of the i.i.d. channels. Supposing that each channel has a binary quality represented by q i (t) at the slot t: q i (t) = 1 if the channel has a good quality and 0 otherwise. Then, the expected quality collected from the channel i up to n is given as follows: The global mean reward, that takes into account all channels' quality and availability, can be defined as follows [12]: The index assigned to the ith channel that takes into account the availability and quality of the ith channel can be defined by: According to [12], the term Q i (t, T i (t)) that represents the quality factor is given by the following equation: where the parameter γ stands for the weight of the quality factor; M i (t, T i (t)) = G max (t) − G i (T i (t)) being the difference between the maximum expected quality over channels at time t, i.e., G max (t), and the one collected from channel i up to time slot t, i.e., G i (T i (t)). However, when the ith channel has a good quality G i (T i (t)) as well as a good availability X i (T i (t)) at time t. Then, the quality factor Q i (t, T i (t)) decreases while X i (T i (t)) increases. Subsequently, by selecting the maximum of its index B Q i (t, T i (t)), the user has a large choice to access the ith channel with a high quality and availability.
To conclude this part, a comparative study in terms of the complexity and convergence speed to the optimal channel has been presented in Table 2 for UCB, AUCB, QoS-UCB, and QoS-AUCB. The latter algorithms behave in O(nC) for large n and C that represent the number of time slots and channels, respectively. Despite the low complexity of UCB compared to AUCB, this latter can quickly converge to the optimal channel with the highest availability probability.

AUCB's performance
In our simulations, we will consider that the SU can access a single-available channel at each time slot to transmit its data. In this section, we investigate AUCB's performance for both single and multi-users cases. Many simulations have been conducted using Monte Carlo methods. P best , that the SU selects the optimal channel is given by:

Single user tests
In Fig. 1, P best shows three parts: • The first part from 1 to C represents the initialization where the SU plays each channel once to obtain a prior knowledge about the availability of each channel. • The second part from C +1 to 2000 slots represents the adaptation phase.
• In the last part, the user asymptotically converges towards the optimal channel μ 1 .
After the initialization part, the two curves evolve in a similar way. After hundreds of slots, the proposed AUCB outperforms the classical UCB. AUCB achieved 65% of the best channel in about 1000 slots, while classical UCB achieved only 45%. Figure 2 shows AUCB and UCB's regret factor, evaluated according to Eq. (1) for a single user. As shown in this figure, the regret has a logarithmic asymptotic behavior for the two approaches over time. This result can confirm the theoretical upper bound of regret calculated in Eq. (8) and also presented in Fig. 2, where the upper bound of regret is logarithmic. The same figure shows that AUCB produces a lower regret compared to the classical UCB. This means that our algorithm can rapidly recognize the best channel while the classical UCB required more time to find it. Figure 3 shows the number of times that the two algorithms sense the sub-optimal channels up to time n. For worst channels, our approach and classical UCB have approximately the same behavior. On the other hand, for almost optimal channels (in our example, channels 2 and 3 which respectively have the availability probabilities of 0.8 and 0.7), the UCB could not clearly switch to the optimal channel and spends a lot of time exploring the almost optimal ones. Figure 4 evaluates AUCB and UCB's performance with respect to various values of the exploration-exploitation factor α in the interval ]1, 2]. This figure shows that our approach outperforms the classical UCB and achieves a lower regret. Moreover, by increasing α, the classical UCB spends more time to explore the channels in order to find the best channel while our approach can reach the best one with a lower number of slots. The latter result, increases the transmission opportunities for the SU, subsequently decreasing the total regret. In the following sections, we will consider α = 1.5.

Multiple SUs tests
In this section, we consider U = 3 with C = 10 channels and their availabilities are given by:  Figure 5 shows the comparison between the regret under the multi-user case defined in Eq. (5) for the two approaches (i.e., UCB and AUCB) under the random rank policy [23]. The latter was used under the UCB; however, it is easy to implement this policy under AUCB to study both algorithms' performance in the multi-user case.
In the random rank policy, when a collision occurs among the users, each of them should generate a random rank in {1, ..., U}. Although, both approaches' regret achieves a logarithmic asymptotic behavior, our algorithm achieves a lower asymptotical regret and converges faster than the classical UCB.
Under the random rank policy, Fig. 6 shows the number of collisions in the U best channels (1, 2, and 3 having availability probabilities of 0.9, 0.8, and 0.7, respectively) for AUCB and classical UCB. Let us remind that, when a collision occurs among users, no-transmission can be achieved and each of them should generate a random rank ∈ {1, ..., U}. The same figure shows that the number of collisions under a random rank policy with AUCB or classical UCB is quite similar. This can be justified based on a nice random rank policy property; indeed, this policy does not favor any user over another. Therefore, each user has an equal chance of settling down in any of U-best channels. In other word, the random rank policy can naturally achieve a probabilistic fairness access  among users. Moreover, in the case of AUCB, the user switches to the optimal channel faster than in the classical one, as shown in Fig. 3 for the single-user case. This fact decreases the number of collisions among users. Figure 7 depicts the regret of AUCB and classical UCB under the side channel policy. As expected, both approaches' regret increase rapidly at the beginning. At a later time, the increase is slower for the AUCB compared to the classical one. We thus notice that our algorithm presents the smaller regret.

The performance of the PLA policy
This section investigates the performance of the PLA policy under AUCB and UCB compared to the musical chair [31] and SLK [33], and 4 priority users are considered to access the channels based on their prior rank. We then compare UCB and AUCB's QoS based on the PLA policy. Figure 8 compares the regret of PLA to SLK and the musical chair policies on a set of 9 channels where PLA achieves the lower regret under AUCB. It is worth mentioning that our policy and SLK take into account the priority access while in the case of the musical chair, users access the channels randomly. As shown in Fig. 8, the musical chair produces a constant regret after a finite number of slots while other methods' regret is logarithmic. However, during the learning time T 0 in the case of the musical chair, the  users randomly access the channels to estimate their availability probabilities as well as the number of users, after that the users just access the U best channels in the long run. Consequently, the musical chair does not follow the dynamism of channels (e.g., assuming that the vacancy probabilities can change with time). The same figure shows that SLK achieves the worst regret.
In Table 3, we compare the regret of the four methods with a fixed number of SUs (U = 4) and different number of channels (C = 5, 9, 13, 17, and 21). As the users spend more time to learn the availability of channels, the regret may increase significantly. This result is due to the access to worst channels and to the collision produced among users. As shown in Table 3, the regret increases with the number of channels, while our PLA policy under AUCB achieves the lowest regret for different considered scenarios (i.e., C = 5, 9, 13, 17, and 21). Thanks to the fact that, under our policy, the SUs quickly learn channels' vacancy probabilities compared to the others.
To study AUCB's QoS, let us define the empirical mean of the quality collected from channels as follows: G =[ 0.75 0.99 0.2 0.8 0.9 0.7 0.75 0.85 0.8], then the global mean reward that takes into account the quality as well as the availability Q can be given by: Q =[ 0.67 0.79 0.14 0.48 0.37 0.28 0.22 0.17 0.08]. After estimating channels' availability and quality (i.e., Q ) and based on the PLA policy, the first priority user SU 1 should converge to the channel that has the highest global mean, i.e., channel 2, while the target  of SU 2 , SU 3 , and SU 4 should respectively be channels 1, 4, and 5. This result can be confirmed in Fig. 9, where the priority users access their dedicated channels in the case of QoS-UCB or QoS-AUCB. Moreover, QoS-AUCB significantly grants users access of their dedicated channels more often than in QoS-UCB. Figure 10 diplays the achievable regret of QoS-AUCB and QoS-UCB in the multi-user case. In [12], the performance of QoS-UCB in the restless MAB problem is compared to several learning algorithms, such as the regenerative cycle algorithm (RCA) [51], the restless UCB (RUCB) [52], and Q-learning [53] where Qos-UCB achieved the lowest regret. From Fig. 10, one can notice that the QoS-AUCB policy achieves better performance compared to QoS-UCB.

AUCB compared to Thompson sampling
Thompson sampling has shown its superiority to a variety of versions of UCB and other bandit algorithms [35]. Instead of comparing different versions of UCB to AUCB, in this section, we will study TS and AUCB' performance in the multi-user case based on the PLA policy for the priority access. We will thus use two factors to make this comparison: access the best channels by each user and the regret that depends not only on the selection of worst-channels but also on the number of collisions among users.
In Fig. 11, we display P best (the percentage of times where the priority users access successfully their dedicated channels) and the cumulative regret using the PLA policy for 4 SUs based on AUCB, UCB, and TS. In Figs. 11a, b, the first priority user SU 1 converges to its channel, i.e., the best one, followed by SU 2 , SU 3 , and SU 4 , respectively. Figure 11c compares P best of the first priority user under AUCB and TS. According to this figure, the first priority user can quickly converge to its dedicated channel using the AUCB algorithm while in the case of TS, the user needs more time to find the best channel. Figure 11d compares the regret of AUCB, UCB, and TS in the multi-user case using the PLA policy for the priority access. However, in TS algorithm, users have to spend more time exploring the C − U worst channels; while in AUCB's case, the users reach quickly their desired channels. However, a lower regret can increase the successful opportunities of transmission for users. Moreover, selecting dedicated channels in a short period becomes a significant event in a dynamic environment.

Conclusion
In this paper, we investigated the problem of opportunistic spectrum access (OSA) in cognitive radio networks, where a SU tries to access PUs' channels and find the best available channel as fast as possible. We also proposed a new AUCB algorithm to achieve a logarithmic regret with a single user. On the other hand, to evaluate AUCB's performance in the multi-user case, we proposed a learning policy called PLA for secondary networks that takes into account the priority access. We have also investigated PLA's performance compared to recent works, such as SLK and the musical chair. We have theoretically found the upper bounds for AUCB's total regret for a single user as well as for the multi-user case under the proposed policy. Our simulation results show logarithmic regret under AUCB and corroborate AUCB's performance compared to UCB or TS. It has also been shown that AUCB rapidly converges to the best channel while achieving a lower regret, improving the transmission time and rate of SUs. Moreover, PLA under AUCB can decrease the number of collisions among users under the competitive scenario, thanks to a faster estimation of the channels' vacancy probability. Like most important works in OSA, this work focused on the independent identical distributed (IID) model in which the state of each channel is supposed to be drawn from an IID process. In future work, we are planning to consider the Markov process that may represent a dynamic memory model to describe the state of available channels; however, it is a more complex process compared to IID. Moreover, our actual model ignores dynamic traffics at the secondary nodes; therefore, the extension of our algorithm to include a queueing-theoretic formulation is desirable.
For a more realistic model, the future work will also investigate the effects of using the state of the art of spectrum sensing techniques to detect the activity of the primary users on the performance of the learning and decision-making. Moreover, considering the imperfect sensing, i.e., the probability of false alarm and miss detection, represents a new challenge to developing a more realistic network.

Appendix A: Convergence proof of AUCB
In this Appendix, we show that the upper bound of the regret of AUCB is logarithmic with respect to time that means that after a finite time, the user will luckily find and access the best channel with the availability μ 1 . The regret for a single user up to the total number of slots n under a policy β can be expressed as follows: where C represents the number of channels; μ 1 and μ i being the availability probabilities of the best channel and ith worst one respectively; E(.) represents the mathematical expectation; T i (n) is the number of times that the ith channel has been sensed by the user up to n. According to Eq. (15) and with constant availabilities of channels, the upper bound of T i (n) can contribute to find an upper bound of R(n, β). Normally, the user senses the ith channel during the initialization stage and every time β(t) = i, and where β(t) represents the selected channel at the instant t under the policy β; then, T i (n) can be expressed as follows: where the logic operator 1 {β(t)=i} equals one if β(t) = i and zero otherwise. Let us consider that a SU senses at least l times each channel up to n, then according to Eq. (16), T i (n) should be bounded as follows: As AUCB selects at each time slot the channel with the highest index obtained in the previous slot, the user may access, at the slot t, a non-optimal channel if the index of this channel at (t − 1), B i (t − 1, T i (t − 1)), is higher than the index of the best channel B i (t − 1, T * (t − 1)). In this case, we can develop further Eq. (17) as follows: According to Eq. (2), the index of channels is based on: • The exploitation factor X i (T i (t)) representing the expected availability probability.
• The exploration factor A i (t, T i (t)) that forces the algorithm to explore different channels. This factor under AUCB is defined as follows: Using Eq. (18), we can prove that: The summation argument in the above equation follows Bernoulli's distribution (i.e., E{X} = P{X = 1}). In this case, the expectation of T i (n) should satisfy the following constraint: The probability in Eq. (20) becomes: After the learning period where T i (t − 1) ≥ l, the user will have a good estimation of channels availability and thus may access regularly the best channel. Therefore, T * (t − 1); and arctan α ln(t) T i (t−1) ≥ arctan α ln(t) T * (t−1) . Using the asymptotic behaviors of the non-linear functions sqrt and arctan, the probability in Eq. (21) becomes bounded by: By taking the minimum value of X i (T * (t − 1)) + α ln(t) T * (t−1) and the maximum value of X i (T i (t − 1)) + α ln(t) T i (t−1) at each time slot, we can upper bound Eq. (20) by the following equation: where S i ≥ l to fulfill the condition T i (t − 1) ≥ l. Then, we obtain: The inequality is satisfied when at least one inequality among the three following ones does: In fact, if all three inequalities are wrong, then we should have: which gives a contradiction with the inequality (24). Using the ceiling operator , let l = 4α ln(n) 2 (1,i) , where (1,i) = μ 1 − μ i and S i ≥ l, then Eq. (25c) becomes false, in fact: Based on Eqs. (24), (25a), and (25b), we obtain: Using Chernoff-Hoeffding bound 4 [54], we can prove that: The two equations above and Eq. (26) lead us to: According to Cauchy series [55], the parameter α should be higher than 3 2 in order to find an upper bound of the second term in the above equation. Let α = 2, to resolve n t=1 t −2 we consider the Taylor's series expansion of sin(t): As sin(t) = 0 when t = ±kπ, then we obtain: where q k is a general coefficient. By comparing the above equation with Eq. (30), we obtain n i=1 1 i 2 = π 2 3! . Finally, we obtain the upper bound of E[ T i (n)] as follows:

Appendix B: Upper bound the collision number under PLA
Here, we show that the total number of collisions occurs among secondary users in the In this case, the total number of collisions for the two users is D(n) = , while the number of collisions in the best channels, i.e., C1 and C2, is We assume that when users have a good estimation of channel availabilities and each one accesses its dedicated channel, then non-collision state can be achieved. On the other hand, the kth user may collide with other users in two cases: • If it does not clearly identify its dedicated channel.
• If it does not respect its prior rank 5 .
Let T k (n) and S s be respectively the total number of times, where the kth user badly identifies its dedicated channel and the time needed to return to its prior rank. After each bad estimation, the user will change its dedicated channel. In this case, it may collide with other users till the convergence to its prior rank. Subsequently, for all values of n, the total number of collisions for the kth user D k (n) can be upper bounded by: As T k (n) and S s are independent, we have: Let us find an upper bound of E[ T k (n)], and let A k (t) be the event that the kth user identifies its dedicated channel, the kth best one, at the instant t. Then, ∀ k + 1 ≤ i ≤ C and 1 ≤ m ≤ k − 1, the event A k (t) takes place when the following condition is satisfied: For a bad estimation event at instant t, ∃ i ∈ {k + 1, ..., C} and ∃ m ∈ {1, ..., k − 1},Ā k (t) is true when we have the following condition: 5 After each collision and according to our policy PLA, the user should regenerate a rank. Then, the total number of times of a bad estimation where the kth priority user does not access its channel up to n, E T k (n) , can be upper bounded as follows: where T B i >B k (n) represents the number of times in which the index of the ith channel exceeds that of the kth best one for all i ∈ {k + 1, ..., C} up to n; and T B m <B k (n) represents the number of times in which the index of the kth best channel exceeds the mth one for all m ∈ {1, ..., k − 1}. It is worth mentioning that, for the first priority user, E T B m <B k (n) should equal 0, since its dedicated channel has the highest availability probability. T B i >B k (n) for the kth user has the similar definition as in Eq. (31) for a single user, then this term, for the k th user, can be upper bound as in Appendix A by: Similarly, the second term E[ T B m <B k (n)] in Eq. (35) should satisfy: where (m,k) ≥ (k−1,k) for all m ∈ {1, ..., k − 1}. Then, we obtain, Based on Eq. (35), (37), and (39), E T (n) can be expressed as follows: Let us estimate the time S s and let us consider U users with different priority levels based on our policy PLA. At a certain moment, supposing that each user has a random rank, then at least two of them may have the same rank, and a collision may occur. In this case, each user with a collision case should regenerate a random rank around its prior rank 6 . After a finite number of collisions, the system will converge to the steady state where each user has a unique rank, i.e., its prior rank. Let S s be a random variable with a countable set of finite outcomes 1,2,... occurring with the probability p 1 , p 2 ... respectively, where p t represents a non-collision at instant t. The expectation of S s can be expressed as follows: where the random variable S s follows the probability p[ S s = t]: The global regret under the multi-user case according to Eq. (5) can be defined as follows: where μ k is the availability probability of the kth best channel and P i,j (n) represents the total number of non-collision up to n in the channel i for user j. Let T i,j (n) be the total number of times where the jth user senses the ith channel up to n.
and P i (n) = U j=1 P i,j (n) represent, respectively, the total number of times, where the ith channel is sensed by all users, and the total number of times, where the users access the ith channel without making any collision with each other up to n. Let O k (n) be the number of collisions in the kth best channel (as defined at the beginning of the Appendix B). Based on T k (n) and P k (n) for the kth best channel (T k (n) and P k (n) represent, respectively, the total number of times where the kth best channel is sensed by all users and the total number of times where the users access the kth best channel without making any collision with each other up to n), O k (n) can be expressed as follows: It is worth mentioning that the number of channels C should be higher than the number of users U, otherwise: • Using a learning algorithm to find the best channels does not make any sense, since all channels need to be accessed. • Considering that the user should sense one channel at each time slot, at least one collision may occur among users, then users cannot converge to free-collision state under any learning policy.
Subsequently, by supposing that C ≥ U and μ 1 ≥ μ i , ∀ i, we can upper bound the regret in Eq. (47) of our policy PLA under our algorithm AUCB by the following equation: At each time slot, the user can sense one channel, then we can consider that: Based on the above expression, the regret can be expressed as follows: We can break C i=1 E [T i (n)] into two terms: Based on Eq. (50) and (51), we obtain the following equation: It is worth mentioning that the global regret in the multi-user case depends on the selection of worst channels as well as the number of collisions among users. However, Eq. (52) The above regret contains three components: The first one is due to the loss of reward when selecting worst channels by all users. The second and third components represent the loss of reward due to collisions among users in the U best channel. In fact, the regret of PLA is worse than the regret under the side channel policy that will be introduced in In this section, we prove that the upper bound of regret of our algorithm AUCB for the multi-user case under the side channel policy has a logarithmic asymptotic behavior. In this policy, we supposed that no-collision occurs among users when the priority user broadcast the choice of its channel, without considering that the broadcast packet of the priority user may loss. However, considering the latter scenario may add some constant values to the regret as a result of the collisions among users so that the regret under the cooperative access can be defined as below: According to Eq. (49), we obtain: Based on Eqs. (55) and (56), the regret can be expressed as follows: where (k,i) = μ k − μ i , k and i represent, respectively, the kth best channel and the ith channel. To simplify the above equation, we consider the summation over worst and best channels as follows: The first term of the regret in Eq. (58) equals 0, then we obtain: In this Appendix, we proved that the global regret of AUCB under the side channel policy has a logarithmic asymptotic behavior with repspect to n, which means that after a