Sensing time and power allocation for cognitive radios using distributed Q-learning

In cognitive radios systems, the sparse assigned frequency bands are opened to secondary users, provided that the aggregated interferences induced by the secondary transmitters on the primary receivers are negligible. Cognitive radios are established in two steps: the radios firstly sense the available frequency bands and secondly communicate using these bands. In this article, we propose two decentralized resource allocation Q-learning algorithms: the first one is used to share the sensing time among the cognitive radios in a way that maximize the throughputs of the radios. The second one is used to allocate the cognitive radio powers in a way that maximizes the signal on interference-plus-noise ratio (SINR) at the secondary receivers while meeting the primary protection constraint. Numerical results show the convergence of the proposed algorithms and allow the discussion of the exploration strategy, the choice of the cost function and the frequency of execution of each algorithm.


Introduction
The scarcity of available radio spectrum frequencies, densely allocated by the regulators, represents a major bottleneck in the deployment of new wireless services. Cognitive radios have been proposed as a new technology to overcome this issue [1]. For cognitive radio use, the assigned frequency bands are opened to secondary users, provided that interference induced on the primary licensees is negligible. Cognitive radios are established in two steps: the radios firstly sense the available frequency bands and secondly communicate using these bands.
To tackle the fading phenomenon-an attenuation of the received power due to destructive interferences between the multiple interactions of the emitted wave with the environment-when sensing the frequency spectrum, cooperative spectrum sensing has been proposed to take advantage of the spatial diversity in wireless channels [2,3]. In cooperative spectrum sensing, the secondary cognitive nodes send the results of their individual observations of the primary signal to a base station through specific control channels. The base station then combines the received information in order to make a decision about the primary network presence. Each cognitive node observes the primary signal during a certain sensing time, which should be chosen high enough to ensure the correct detection of the primary emitter but low enough so that the node has still enough time to communicate. In literature [4,5], the sensing times used by the cognitive nodes are generally assumed to be identical and allocated by a central authority. In [6], the sensing performance of a network of independent cognitive nodes that individually select their sensing times is analyzed using evolutionary game theory.
It is generally considered in literature that the secondary users can only transmit if the primary network is inactive or if the secondary users are located outside a keep-out region surrounding the primary transmitter, or equivalently, if the secondary users generate an interference inferior to a given threshold on a so called protection contour surrounding the primary transmitter [7,8]. However, multiple simultaneously transmitting secondary users may individually meet the protection contour constraint while collectively generating an aggregated interference that exceeds the acceptable threshold. In [7], the effect of aggregated interference caused by IEEE 802.22 secondary users on primary DTV receivers is analyzed. In [9], the aggregated interference generated by a large-scale secondary network is modeled and the impact of the secondary network density on the sensing requirements is investigated. In [10], a decentralized power allocation Q-learning algorithm is proposed to protect the primary network from harmful aggregated interference. The proposed algorithm removes the need for a central authority to allocate the powers in the secondary network and therefore minimizes the communication overhead. The cost functions used by the algorithm are chosen so that the aggregated interference constraint is exactly met on the protection contour. Unfortunately, the cost functions do not take into account the preferences of the secondary network.
This article aims to illustrate the potential of Q-learning for cognitive radio systems. For this purpose two decentralized Q-learning algorithm are presented to solve the allocation problems that appear during the sensing phase on the one hand and during the communication phase on the other hand. The first algorithm allows to share the sensing times among the cognitive radios in a way that maximize the throughputs of the radios. The second algorithm allows to allocate the secondary user powers in a way that maximize the signal on interference-plus-noise ratio (SINR) at the secondary receivers while meeting the primary protection constraint. The agents self-adapt by directly interacting with the environment in real time and by properly utilizing their past experience. They aim to distributively learn an optimal strategy to maximize their throughputs or their SINRs.
Reinforcement learning algorithms such as Q-learning are particularly efficient in applications where reinforcement information (i.e., cost or reward) is provided after an action is performed in the environment [11]. The sensing time and power allocation problems both allow for the easy definition of such information. In this article, we make the assumption that no information is exchanged between the agents for each of the two problems. As a result, many traditional multi-agent reinforcement learning algorithms like fictitious play and Nash-Q learning cannot be used [12], which justifies the use of multi-agent Q-learning in this article to solve the sensing time and power allocation problems.
This distributed allocation of the sensing times and the node powers presents several advantages compared to a centralized allocation [10]: (1) robustness of the system towards a variation of parameters (such as the gains of the sensing channels), (2) maintainability of the system thanks to the modularity of the multiple agents and (3) scalability of the system as the need for control communication is minimized: on the one hand there is no need for a central authority to send the result of a centralized allocation to the multiple nodes and on the other hand these nodes do not have to send their specific parameters (sensing SNRs and data rates for the sensing time allocation, space coordinates for the power allocation problem). In addition, a centralized allocation is not a trivial operation as the sensing time and the power allocation problems are both essentially multi-criteria problems where multiple objective function to maximize can be defined (e.g., the sum of the individual rewards to aim for a global optimum or the minimum individual reward to guarantee more fairness).
The rest of this article is organized as follows: in Section 2, we formulate the problems of sensing time allocation in the secondary network. In Section 3, we formulate the problem of power allocation in the secondary network. In Section 4, we present the decentralized Q-learning algorithms used to solve the sensing time allocation problem and the power allocation problem. In Section 5, we present numerical results allowing the discussion of the performance of the Q-learning algorithms for different exploration strategies, cost functions and execution frequencies.

Cooperative spectrum sensing
The licensed band is assumed to be divided into N sub-bands, and each secondary user is assumed to communicate in one of the N sub-bands when the primary user is absent. When it is present, the primary network is assumed to use all N sub-bands for its communications. Therefore, the secondary user can jointly sense the primary network presence on these sub-bands and report their observations via a narrowband control channel.
We consider a cognitive radio cell made of N + 1 nodes including a central base station. Each node j performs an energy detection of the received signal using M j samples [13,14]. The observed energy value at the j th node is given by the random variable: where s ji and n ji denote the received primary signal and additive white noise at the ith sample of the jth cognitive radio, respectively, (1 ≤ j ≤ N, 1 ≤ i ≤ M j ). These samples are assumed to be real without loss of generality. H 0 and H 1 represent the hypotheses associated to primary signal absence and presence, respectively. In the distributed detection problem, the coordinator node receives information from each of the N nodes (e.g., the communicated Y j ) and must decide between the two hypotheses.
We assume that the instantaneous noise at each node n ji can be modeled as a zero-mean Gaussian random variable with unit variance n ji ∼ N (0, 1) . Let g j be the signal-to-noise ratio (SNR) computed at the jth node, M j denotes a central chi-squared distribution with M j degrees of freedom and λ j = M j g j is the noncentrality parameter. Furthermore, if M j is large, the Central Limit theorem gives [15]: From (1), it can be shown that the false alarm probability P F j = Pr{Y j > λ|H 0 } is given by: and the detection probability P D j = Pr{Y j > λ|H 1 } is given by: where By combining Equations (2) and (3), the false alarm probability can be expressed with respect to the detection probability: where Q -1 (x) is the inverse function of Q(x). As illustrated on Figure 1, we consider that every T H seconds, each node sends a one bit value representing the local hard decision about the primary network presence to the base station. The base station combines the received bits in order to make a global decision for the nodes. The base station decision is sent back to the node as a one bit value. The duration of the communication with the base station is assumed to be negligible compared to the duration T H of a time slot. In this article, we focus on the logical-OR fusion rule at the base station but the other fusion rules could be similarly analyzed. Under the logical-OR fusion rule, the global detection probability P D (defined as the probability that the coordinator node identifies a time slot as busy when the primary network is present during this time slot) and the global false alarm probability P F (defined as the probability that the coordinator node identifies a time slot as busy when the primary network is absent during this time slot) depend, respectively, on the local detection probabilities P D j and false alarm probabilities P F j [16]: and Given a target global detection probabilityP D , we thus have: and Equation (4) can be rewritten as:

Throughput of a secondary user
The random variable representing the presence of the primary network in each time slot n is denoted H (n) (H (n) {H 0 , H 1 }) and is assumed to be a Markov Chain characterized by a transition matrix [p uv ]. It is assumed that the probability p 01 of the primary network apparition is small compared to the probability p 00 . As a result, the secondary users can decide to communicate or not during a time slot based on the result of their sensing in the previous time slot while limiting the probability of interference with the primary network. A secondary user performs data transmission during the time slots that have been identified as free by the base station. In each of these time slots, M j T S seconds are used by the secondary user to sense the spectrum, where T S denotes the sampling period. The remaining T H -M j T S seconds are used for data transmission. The secondary user average throughput R j is given by the sum of the throughput obtained when the primary network is absent and no false alarm has been generated by the base station plus the throughput obtained when the primary network is present but has not been detected by the base station [6]: where π H 0 = lim n→∞ n ν=1 1 H(v)=H 0 n denote the stationary probability of the primary network absence, C H 0 ,j represents the data rate of the secondary user under H 0 and C H 1 ,j represents the data rate of the secondary user under H 1 . The target detection probabilityP D is required to be close to 1 since the cognitive radios should not interfere with the primary network; moreover, π H 0 is usually close to 1, C H 1 ,j C H 0 ,j due to the interference from the primary network [6] and it is assumed that p 00 ≥ p 10 . Therefore, (9) can be approximated by:

Sensing time allocation problem
Equations (6), (8), and (10) show that there is a tradeoff for the choice of the sensing window length M j : on the one hand, if M j is high then the user j will not have enough time to perform his data transmission and R j will be low. On the other hand, if all the users use low M j values, then the global false alarm probability in (10) will be high and all the average throughputs will be low. The sensing time allocation problem consists in finding the optimal sensing window length {M 1 , . . ., M N } that minimize a cost function f (R 1 , . . ., R N ) depending on the secondary throughputs.
In this article, the following cost function is considered: whereR j denotes the throughput required by node j. It is observed that the cost decreases with respect to R j until R j reaches the threshold valueR j , then the cost increases with respect to R j . This should prevent secondary users from selfishly transmitting with a throughput higher than required, which would reduce the achievable throughputs for the other secondary users.
Although a base station could determine the sensing window lengths that minimize function (11) and send these optimal values to each secondary user, in this article we rely on the secondary users themselves to determine their individual best sensing window length. This decentralized allocation avoids the introduction of signaling overhead in the system.

Power allocation problem formulation
We consider a large circular primary cell made up of one central primary emitter and several primary receivers whose positions are unknown. The primary emitter could be a DTV broadcasting station that communicates with multiple passive receivers.
The secondary network uses the same frequency band as the primary network and consists in L adjacent secondary cells. Each secondary cell is made up of one central secondary base station and multiple secondary users. For the sake of simplicity, all the secondary users (SU) are assumed to be located on the line that joins the L base stations (BS) as illustrated on is referred to [10] for more realistic assumptions regarding the geometry of the power allocation problem.
In order to protect the primary receivers from receiving harmful interference from the secondary users, a protection contour is defined around the primary emitter as a circle on which the received primary SINR must be superior to a given threshold SINR p Th . The secondary cells are located around the protection contour. As the primary cell ray is assumed to be much larger than the secondary cells ray, the protection contour can be approximated by a line parallel to the secondary base stations line.
The secondary network is assumed to follow a Time Division Multiple Access (TDMA) scheme, so that at each time only one secondary user SU l communicates with its base station BS l in cell l (l {1, . . ., L}). The difference between SU l and BS l abscissa is denoted x l . The point on the protection contour whose distance with SU l is minimal is denoted I l . We assume that each cell l deploys sensors on the protection contour so that it is able to measure the primary network SINR at the point I l , denoted SINR p l . In this article, the analysis is focused on the interference generated by the upstream transmissions of the secondary users. It is assumed that the secondary SINR at each base station l, denoted SINR s l , needs to be superior to a given threshold SINR s Th for the secondary communication to be reliable.
The power allocation problem consists in finding the optimal secondary users transmission powers {P 1 , . . ., P L } that minimize a cost function f SINR s 1 , . . . , SINR s L depending on the secondary SINRs, under the constraints that In this article, the following cost function is considered: It is observed that the cost decreases with respect to SINR s l until SINR s l reaches the threshold value SINR s Th , then the cost increases with respect to SINR s l . This should prevent secondary users from selfishly transmitting with a power higher than required, which would remove transmission opportunities for other secondary users.
The primary SINRs in Equation (12) are given by: where P p is the power that is received on the protection contour from the primary transmitter, s 2 is the noise power and h SU k I l is the link gain between SU k and the point I l on the protection contour.
The secondary SINRs in Equation (13) are given by: where h SU l BS l is the link gain between SU l and BS l . In this article, we consider free space path loss. Therefore, the link gains are computed as follows: where r s is the ray of the secondary cells, f c is the transmission frequency and c is the speed of light in vacuum.

Q-learning algorithm
In this article, we use two multi-agent Q-learning algorithms. The first one is used to allocate the secondary user sensing times and the second one is used to allocate the secondary user transmission powers. In the sensing time allocation algorithm, each secondary user is an agent that aims to learn an optimal sensing time allocation policy for itself. In the power allocation algorithm, each secondary base station is an agent that aims to learn an optimal power allocation policy for its cell.
Q-learning implementation requires the environment to be modeled as a finite-state discrete-time stochastic system. The set of all possible states of the environment is denoted S . At each learning iteration, the agent that executes the learning algorithm performs an action chosen from the finite set A of all possible actions. Each learning iteration consists in the following sequence: 1) The agent senses the state s ∈ S of the environment 2) Based on s and its accumulated knowledge, the agent chooses and performs an action a ∈ A .
3) Because of the performed action, the state of the environment is modified. The new state is denoted s' The transition from s to s' generates a cost c ℝ for the agent. 4) The agent uses c and s' to update the accumulated knowledge that made him choose the action a when the environment was in state s.
The Q-learning algorithm keeps a quality information (the Q-value) for every state-action couple (s, a) it has tried. The Q-value Q i (s, a) represents how high the expected quality of an action a is when the environment is in state s [17]. The following policy is used for the selection of the action a by the agent when the environment is in state s: where is the randomness for exploration of the learning algorithm.
The cost c and the new state s' generated by the choice of action a in state s are used to update the Qvalue Q(s, a) based on how good the action a was and how good the new optimal action will be in state s'. The update is handled by the following rule: where a is the learning rate and g is the discount rate of the algorithm.
The learning rate a [0, 1] is used to control the linear blend between the previously accumulated knowledge about the (s, a) couple, Q(s, a), and the newly received quality information −c + γ max a ∈A Q(s , a ) . A high value of a gives little importance to previous experience, while a low value of a gives an algorithm that learns slowly as the stored Q-values are easily altered by new information.
The discount rate g [0, 1] is used to control how much the success of a later action a' should be brought back to the earlier action a that led to the choice of a'. A high value of g gives a low importance to the cost of the current action compared to the Q-value of the new state this actions leads to, while a low value of g would rate the current action almost only based on the immediate reward it provides.
The randomness for exploration [0, 1] is used to control how often the algorithm should take a random action instead of the best action it knows. A high value of favors exploration of new good actions over exploitation of existing knowledge, while a low value of reinforces what the algorithm already knows instead of trying to find new better actions. The explorationexploitation trade-off is typical of learning algorithms. In this article, we consider online learning (i.e., at every time step the agents should display intelligent behaviors) which requires a low value.

Q-Learning implementation for sensing time allocation
Each secondary user is an agent in charge of sensing the environment state, selecting an action according to policy (16), performing this action, sensing the resulting new environment state, computing the induced cost and updating the state-action Q-value according to rule (17). In this section, we specify the states, actions and cost function used to solve the sensing time allocation problem.
At each iteration t {1, . . ., K} of the learning algorithm, a secondary user j {1, . . ., N} represents the local state s j, t of the environment as follows: where n H 0 ,t−1 denotes the number of time slots that have been identified as free by the base station during the (t -1)th learning period.
The number of free time slots takes one out of r values: At each iteration t, the action selected by the secondary user j is the duration M j, t of the sensing window to be used during the T L seconds of the learning iteration t. It is assumed that one learning iteration spans several time slots: The optimal value of r will be determined in Section 5. Let s denotes the ratio between the duration of a time slot and the sampling period: In this article, we compare the performances of the sensing time allocation system for two different cost functions c j, t . We firstly define a competitive cost function in which the cost decreases if the average throughput realized by node j increases: whereR j,t denotes the average throughputR j,t realized by node j during the learning period t: With this cost function, every node tries to achieve the maximumR j,t with no consideration for the other nodes in the secondary network. We secondly define a cooperative cost function in which the cost decreases if the difference between the realized average throughput and the required average throughput decreases: This last cost function penalizes the actions that lead to a realized average throughput that is higher than required, which should help the disadvantaged nodes (i. e., the nodes that have a low data rate C H 0 ,j ) to achieve the required average throughput.

Q-Learning implementation for distributed power allocation
Each secondary BS is an agent in charge of sensing the environment state, selecting an action according to policy (16), performing this action, sensing the resulting new environment state, computing the induced cost and updating the state-action Q-value according to rule (17). In this section, we specify the states, actions, and cost function used to solve the power allocation problem.
At each iteration t {1, . . ., K} of the learning algorithm, a base station l {1, . . ., L} represents the local state s l, t of the environment as the following triplet: where x l, t is the local coordinate of the currently transmitting secondary user SU l , P l, t is the power currently allocated to this user and I l, t {0, 1} is a binary indicator that specifies whether the measured aggregated interference at the sensor I l on the protection contour is above or below the acceptable threshold. It is defined as: For Q-learning implementation the states have to be quantized. Therefore it is assumed that x l, t takes one out of the following ξ values: Similarly, P l, t takes one out of the following j values: where P min and P max are the minimum and maximum effective radiated powers (ERP) in dBm.
At each iteration t, the action selected by the base station BS l is the power to allocate to the currently transmitting secondary user SU l . The set of all possible actions is therefore given by Equation (27).
In this article, we compare the performances of the power allocation system for two different cost functions c l, t . We first define a competitive cost function in which the cost decreases if the secondary SINR at the base station increases, provided that the aggregated interference generated on the primary protection contour does not exceeds the acceptable level: where +∞ represents a positive constant that is chosen large enough compared to SINR s l,t . With this cost function, every agent tries to achieve the maximum SINR s l with no consideration for the other secondary cells in the network. Second, we define a cooperative cost function in which the cost decreases if the difference between the secondary SINR at the base station and the required secondary SINR threshold decreases, provided that the aggregated interference on the protection contour is acceptable: where +∞ represents a positive constant that is chosen large enough compared to (SINR s l,t − SINR s Th ) 2 . This last cost function penalizes the actions that lead to a secondary SINR that is higher than required, which should help the disadvantaged secondary cells (i.e., the cells in which the transmission distance |x l, t | and/or BS l is high) to achieve the required secondary SINR threshold.
In this article, the impact of the frequency of the learning algorithm is also analyzed. If T TDMA denotes the length of a TDMA time slot and T L denotes the length of a learning iteration, then indicates how many times a learning loop is executed during one TDMA time slot (i.e., for a fixed secondary transmitter SU l in cell l). It is assumed that every secondary cell uses the same TDMA time slot length T TDMA as well as the same learning iteration length T L . However, the secondary transmissions as well as the learning iterations are assumed asynchronous, as illustrated on Figure 3.
Finally, three exploration strategies are compared in this article. These three exploration strategies are characterized by the same average randomness for explora-tion¯ .
The first exploration strategy consists in using a constant parameter during the K learning iterations: In the second exploration strategy, decreases linearly between the values t=1 = 2¯ and t = K = 0: In the third exploration strategy, the algorithm does pure exploration during the¯ f first learning iterations of each TDMA time slot, then pure exploitation during the remaining (1 −¯ ) f last learning iterations of the time slot (see, Figure 4): Note that for both the sensing time and power allocation problems, the agents have an imperfect knowledge of the state of the environment. The state represented by an agent at each iteration of the Q-learning algorithm is actually an imperfect estimation of the environment state. In this case, the convergence demonstration of single agent Q-learning [18] does not hold. However, multi-agent Q-learning algorithms have been successfully applied in multiple scenarios [11] and in particular to cognitive radios [10,12,19]. Numerical results will show that both Q-learning algorithms presented in this article converge as well.

Sensing time allocation algorithm
Unless otherwise specified, the following simulation parameters are used: we consider N = 2 nodes able to transmit at a maximum data rate C H 0 ,1 = C H 0 ,2 = 0.6. They each require a data rateR 1 =R 2 = 0.1 . One node has a sensing channel characterized by g 1 = 0 dB and the second one has a poorer sensing channel characterized by g 2 = -10 dB.
We consider s = 10 samples per time slot and r = 100 time slots per learning periods. The Q-learning algorithm is implemented with a learning rate a = 0.5 and a discount rate g = 0.7. The chosen exploration strategy consists in using = 0.1 during the first K/2 iterations and then = 0 during the remaining K/2 iterations. Figure 5 gives the result of the Q-learning algorithms when no exploration strategy is used ( = 0). It is observed that after 430 iterations, the algorithm converges to M 1 = M 2 = 4 which is a sub-optimal solution. The optimal solution obtained by minimizing Equation (11) is M 1 = 4, M 2 = 1 (as the second node has a low sensing SNR, the first node has to contribute more to the sensing of the primary signal). After convergence, the normalized average throughputs arê R 2,opt /C H 0 ,2 = 0.144 whereas the optimal normalized   (11). Figure 6 gives the result of the Q-learning algorithms when the exploration strategy described at the beginning of this Section is used. It is observed that the algorithm converges to the optimal solution defined in the previous paragraph. Table 1 compares the performance of the sensing time allocation algorithm implementation based on the cooperative cost function defined by Equation (23) with the one based on the competitive cost function defined by Equation (20). The cooperative cost function penalizes the actions that lead to a higher than required throughput and as a result performs better (i.e., gives higher realized average throughputsR j ) than the competitive cost function, in different scenarios. In particular, it helps achieve fairness among the nodes when one of the nodes has a lower sensing SNR (in which case the other nodes tend to contribute more to the sensing) or when one of the nodes has an inferior channel capacity (in which case this node tends to contribute less to the sensing). The data in Table 1 are the averages of the sensing window lengths and realized throughputs obtained in each scenario. Figure 7 shows the average normalized throughput that is obtained with the algorithm with respect to parameter r = T L /T H when the total duration of execution of the algorithm, equal to rK T H , is kept constant. When r decreases, then the learning algorithm is executed more often but the ratio n H 0 ,t r becomes a less accurate approximation of π H 0 (1 − P F )p 00 and as a result, the agent becomes less aware of its impact on the false alarm probability. Therefore, there is a tradeoff value for r around r ≈ 10 as illustrated on Figure 7. After convergence of the algorithm, if the value of the local SNR g 1 decreases from 0 dB to -10 dB, the algorithm requires an average of 1200 iterations before converging to the new optimal solution M 1 = M 2 = 1. According to Equation (17), each Q-learning iteration requires four additions and five multiplications per node. This result can be compared with the complexity of the centralized allocation algorithm which must be solved numerically. By using a constant step gradient descent optimization algorithm to solve the centralized allocation problem, it was measured that the convergence occurred after an average of four iterations. At each iteration of the algorithm, the partial derivatives of the cost function with respect to the sensing times are evaluated. It can be shown that 18N -1 multiplications and 8N -1 additions are needed for this evaluation. As a result, the centralized allocation algorithm will have a lower computational complexity per node than the Qlearning algorithm. The main advantage of the Q-    learning algorithm is therefore the minimization of control information sent between the secondary nodes and the coordinator node.

Power allocation algorithm
The performance of the Q-learning algorithm presented in Section 4 is evaluated by comparison with the optimal centralized power allocation scheme in which a base station having a perfect knowledge of the environment chooses the optimal transmission powers each time there is a change in the environment (i.e., whenever a TDMA time slot ends in any of the L cells). The optimal allocated powers are determined by selecting the transmission powers (P 1 , . . ., P L ) Ψ L that maximize Equation (13) under the constraints given in Equation (12). The learning algorithm performance metrics considered here is the distance d t (in dB) between the secondary SINRs obtained with the multi-agent Q-learning algorithm and the secondary SINRs given by the optimal allocation algorithm: where SINR s t,l denotes the secondary SINR measured at iteration t at BS l in the distributed learning scenario and SINR s t,l denotes the secondary SINR measured at iteration t at BS l in the optimal centralized scenario.
The performance is evaluated for L = 2 secondary cells with a ray r s = 15 km. The received power from the primary emitter on the protection contour is P p = 0 dBm. Both the primary and the secondary network use a frequency f c = 2.45 GHz. The minimum acceptable primary SINR on the protection contour is SINR p Th = 20 dB. The desired secondary SINR at the base stations is SINR s Th = 3 dB. The secondary users are allocated powers ranging from P min = 0 dBm to The secondary transmission powers P l, t are quantized on j = 15 levels and the local coordinates x l, t of the secondary users are quantized on ξ = 10 levels. The Qlearning algorithm is implemented with a learning rate a = 0.5, a discount rate g = 0.9 and an average randomness for exploration¯ = 0.1. Figure 8 compares the performance of the power allocation algorithm implementation based on the cooperative cost function defined by Equation (29) with the one based on the competitive cost function defined by Equation (28). The cooperative cost function penalizes the actions that lead to a higher than required secondary SINR and as a result performs better (i.e., gives a lower distance d t to the optimal solution) than the competitive cost function. Figure 9 compares the convergence speed of the Qlearning algorithms when different learning frequencies f are used. The Q-learning algorithm converges faster when f increases but the improvement is negligible when f > 50. After about 20000 TDMA time slots, the performance of the algorithm is constant and does not depend on the learning frequency. Figure 10 compares performance of the Q-learning algorithms when different exploration policies are used. The linearly decreasing strategy defined by Equation (32) converges more slowly than the two other analyzed strategies but leads to better final results. The average d t of this strategy, computed on the last 50,000 time slots, is equal to 17.5 dB. The full exploration/full exploitation alternance strategy defined by Equation (33) is the strategy that gives the best initial performance but leads to final results that are inferior to those obtained with the linearly decreasing . The average d t , computed on the The complexity of the decentralized power allocation Q-learning algorithm can be compared to a reference gradient descent centralized power allocation algorithm, similarly to the analysis performed in Section 1. The conclusion is the same as for the sensing time allocation algorithm: the centralized allocation algorithm has a lower computational complexity than the decentralized Q-learning algorithm whose main advantage is therefore that the base stations do not need to exchange control information.

Conclusion
In this article, we have proposed two decentralized Qlearning algorithms. The first one was used to solve the problem of the allocation of the sensing durations in a cooperative cognitive network in a way that maximize the throughputs of the cognitive radios. The second one was used to solve the problem of power allocation in a secondary network made up of several independent cells, given strict limit for the allowed aggregated interference on the primary network. Compared to a centralized allocation system, a decentralized allocation system is more robust, scalable, maintainable and computationally efficient.
Numerical results have demonstrated the need for an exploration strategy for the convergence of the sensing time allocation algorithm. It has also been observed that the strategy of keeping the exploration parameter constant in the power allocation algorithm is less efficient than using a linearly decreasing parameter or implementing an alternance between full exploration and full exploitation, this latest exploration policy leading to the fastest convergence of the power allocation algorithm.
It has furthermore been shown that the implementation of a cost function that penalizes the actions leading to a higher than required throughput in the sensing time allocation algorithm gives better results than the implementation of a cost function without such penalty. Similarly, the implementation of a cost function that penalizes the actions leading to a higher than required secondary SINR in the power allocation algorithm gives better results than the implementation of a cost function without such penalty.
Finally, it has been shown that there is an optimal tradeoff value for the frequency of execution of the sensing time allocation algorithm. The power allocation algorithm has been shown to converge faster when its frequency of execution increases, until the frequency reaches an upper bound where the increase of the convergence speed gets insignificant. Figure 9 Distance d t between the secondary SINRs generated by the Q-learning algorithm and the optimal secondary SINRs when using different frequencies of learning f in the Qlearning implementation. The randomness of exploration is constant and a cooperative cost function is used. Figure 10 Distance d t between the secondary SINRs generated by the Q-learning algorithm and the optimal secondary SINRs when using different exploration strategies in the Q-learning implementation. The learning frequency is f = 100 and a cooperative cost function is used.