 Research
 Open Access
 Published:
Sensing time and power allocation for cognitive radios using distributed Qlearning
EURASIP Journal on Wireless Communications and Networking volume 2012, Article number: 138 (2012)
Abstract
In cognitive radios systems, the sparse assigned frequency bands are opened to secondary users, provided that the aggregated interferences induced by the secondary transmitters on the primary receivers are negligible. Cognitive radios are established in two steps: the radios firstly sense the available frequency bands and secondly communicate using these bands. In this article, we propose two decentralized resource allocation Qlearning algorithms: the first one is used to share the sensing time among the cognitive radios in a way that maximize the throughputs of the radios. The second one is used to allocate the cognitive radio powers in a way that maximizes the signal on interferenceplusnoise ratio (SINR) at the secondary receivers while meeting the primary protection constraint. Numerical results show the convergence of the proposed algorithms and allow the discussion of the exploration strategy, the choice of the cost function and the frequency of execution of each algorithm.
1. Introduction
The scarcity of available radio spectrum frequencies, densely allocated by the regulators, represents a major bottleneck in the deployment of new wireless services. Cognitive radios have been proposed as a new technology to overcome this issue [1]. For cognitive radio use, the assigned frequency bands are opened to secondary users, provided that interference induced on the primary licensees is negligible. Cognitive radios are established in two steps: the radios firstly sense the available frequency bands and secondly communicate using these bands.
To tackle the fading phenomenonan attenuation of the received power due to destructive interferences between the multiple interactions of the emitted wave with the environmentwhen sensing the frequency spectrum, cooperative spectrum sensing has been proposed to take advantage of the spatial diversity in wireless channels [2, 3]. In cooperative spectrum sensing, the secondary cognitive nodes send the results of their individual observations of the primary signal to a base station through specific control channels. The base station then combines the received information in order to make a decision about the primary network presence. Each cognitive node observes the primary signal during a certain sensing time, which should be chosen high enough to ensure the correct detection of the primary emitter but low enough so that the node has still enough time to communicate. In literature [4, 5], the sensing times used by the cognitive nodes are generally assumed to be identical and allocated by a central authority. In [6], the sensing performance of a network of independent cognitive nodes that individually select their sensing times is analyzed using evolutionary game theory.
It is generally considered in literature that the secondary users can only transmit if the primary network is inactive or if the secondary users are located outside a keepout region surrounding the primary transmitter, or equivalently, if the secondary users generate an interference inferior to a given threshold on a so called protection contour surrounding the primary transmitter [7, 8]. However, multiple simultaneously transmitting secondary users may individually meet the protection contour constraint while collectively generating an aggregated interference that exceeds the acceptable threshold. In [7], the effect of aggregated interference caused by IEEE 802.22 secondary users on primary DTV receivers is analyzed. In [9], the aggregated interference generated by a largescale secondary network is modeled and the impact of the secondary network density on the sensing requirements is investigated. In [10], a decentralized power allocation Qlearning algorithm is proposed to protect the primary network from harmful aggregated interference. The proposed algorithm removes the need for a central authority to allocate the powers in the secondary network and therefore minimizes the communication overhead. The cost functions used by the algorithm are chosen so that the aggregated interference constraint is exactly met on the protection contour. Unfortunately, the cost functions do not take into account the preferences of the secondary network.
This article aims to illustrate the potential of Qlearning for cognitive radio systems. For this purpose two decentralized Qlearning algorithm are presented to solve the allocation problems that appear during the sensing phase on the one hand and during the communication phase on the other hand. The first algorithm allows to share the sensing times among the cognitive radios in a way that maximize the throughputs of the radios. The second algorithm allows to allocate the secondary user powers in a way that maximize the signal on interferenceplusnoise ratio (SINR) at the secondary receivers while meeting the primary protection constraint. The agents selfadapt by directly interacting with the environment in real time and by properly utilizing their past experience. They aim to distributively learn an optimal strategy to maximize their throughputs or their SINRs.
Reinforcement learning algorithms such as Qlearning are particularly efficient in applications where reinforcement information (i.e., cost or reward) is provided after an action is performed in the environment [11]. The sensing time and power allocation problems both allow for the easy definition of such information. In this article, we make the assumption that no information is exchanged between the agents for each of the two problems. As a result, many traditional multiagent reinforcement learning algorithms like fictitious play and NashQ learning cannot be used [12], which justifies the use of multiagent Qlearning in this article to solve the sensing time and power allocation problems.
This distributed allocation of the sensing times and the node powers presents several advantages compared to a centralized allocation [10]: (1) robustness of the system towards a variation of parameters (such as the gains of the sensing channels), (2) maintainability of the system thanks to the modularity of the multiple agents and (3) scalability of the system as the need for control communication is minimized: on the one hand there is no need for a central authority to send the result of a centralized allocation to the multiple nodes and on the other hand these nodes do not have to send their specific parameters (sensing SNRs and data rates for the sensing time allocation, space coordinates for the power allocation problem). In addition, a centralized allocation is not a trivial operation as the sensing time and the power allocation problems are both essentially multicriteria problems where multiple objective function to maximize can be defined (e.g., the sum of the individual rewards to aim for a global optimum or the minimum individual reward to guarantee more fairness).
The rest of this article is organized as follows: in Section 2, we formulate the problems of sensing time allocation in the secondary network. In Section 3, we formulate the problem of power allocation in the secondary network. In Section 4, we present the decentralized Q learning algorithms used to solve the sensing time allocation problem and the power allocation problem. In Section 5, we present numerical results allowing the discussion of the performance of the Qlearning algorithms for different exploration strategies, cost functions and execution frequencies.
2. Sensing time allocation problem formulation
2.1. Cooperative spectrum sensing
The licensed band is assumed to be divided into N subbands, and each secondary user is assumed to communicate in one of the N subbands when the primary user is absent. When it is present, the primary network is assumed to use all N subbands for its communications. Therefore, the secondary user can jointly sense the primary network presence on these subbands and report their observations via a narrowband control channel.
We consider a cognitive radio cell made of N + 1 nodes including a central base station. Each node j performs an energy detection of the received signal using M_{ j } samples [13, 14]. The observed energy value at the j^{th} node is given by the random variable:
where s_{ ji } and n_{ ji } denote the received primary signal and additive white noise at the i th sample of the j th cognitive radio, respectively, (1 ≤ j ≤ N, 1 ≤ i ≤ M_{ j }). These samples are assumed to be real without loss of generality. H_{0} and H_{1} represent the hypotheses associated to primary signal absence and presence, respectively. In the distributed detection problem, the coordinator node receives information from each of the N nodes (e.g., the communicated Y_{ j }) and must decide between the two hypotheses.
We assume that the instantaneous noise at each node n_{ ji } can be modeled as a zeromean Gaussian random variable with unit variance ${n}_{ji}~\mathcal{N}\left(0,1\right)$. Let γ_{ j } be the signaltonoise ratio (SNR) computed at the j th node, defined as ${\gamma}_{j}~\frac{1}{{M}_{j}}{\sum}_{i=1}^{{M}_{j}}{s}_{ji}^{2}$.
Since ${n}_{ji}~\mathcal{N}\left(0,1\right)$, the random variable Y_{ j } can be expressed as:
where ${\chi}_{{M}_{j}}^{2}$ denotes a central chisquared distribution with M_{ j } degrees of freedom and λ_{ j } = M_{ j }γ_{ j } is the noncentrality parameter. Furthermore, if M_{ j } is large, the Central Limit theorem gives [15]:
From (1), it can be shown that the false alarm probability ${P}_{{\mathsf{\text{F}}}_{\mathsf{\text{j}}}}=Pr\left\{{Y}_{j}>\lambda {H}_{0}\right\}$ is given by:
and the detection probability ${P}_{{\mathsf{\text{D}}}_{\mathsf{\text{j}}}}=Pr\left\{{Y}_{j}>\lambda {H}_{1}\right\}$ is given by:
where $Q\left(x\right)={\int}_{x}^{+\infty}\frac{1}{\sqrt{2\pi}}{e}^{\frac{{t}^{2}}{2}}\mathsf{\text{d}}t$.
By combining Equations (2) and (3), the false alarm probability can be expressed with respect to the detection probability:
where Q^{1}(x) is the inverse function of Q(x).
As illustrated on Figure 1, we consider that every T_{ H } seconds, each node sends a one bit value representing the local hard decision about the primary network presence to the base station. The base station combines the received bits in order to make a global decision for the nodes. The base station decision is sent back to the node as a one bit value. The duration of the communication with the base station is assumed to be negligible compared to the duration T_{ H } of a time slot. In this article, we focus on the logicalOR fusion rule at the base station but the other fusion rules could be similarly analyzed. Under the logicalOR fusion rule, the global detection probability P_{ D } (defined as the probability that the coordinator node identifies a time slot as busy when the primary network is present during this time slot) and the global false alarm probability P_{ F } (defined as the probability that the coordinator node identifies a time slot as busy when the primary network is absent during this time slot) depend, respectively, on the local detection probabilities ${P}_{{D}_{j}}$ and false alarm probabilities ${P}_{{F}_{j}}$ [16]:
and
Given a target global detection probability ${\stackrel{\u0304}{P}}_{D}$, we thus have:
and Equation (4) can be rewritten as:
2.2. Throughput of a secondary user
The random variable representing the presence of the primary network in each time slot n is denoted H (n) (H (n) ∈ {H_{0}, H_{1}}) and is assumed to be a Markov Chain characterized by a transition matrix [p_{ uv }]. It is assumed that the probability p_{01} of the primary network apparition is small compared to the probability p_{00}. As a result, the secondary users can decide to communicate or not during a time slot based on the result of their sensing in the previous time slot while limiting the probability of interference with the primary network.
A secondary user performs data transmission during the time slots that have been identified as free by the base station. In each of these time slots, M_{ j } T_{ S } seconds are used by the secondary user to sense the spectrum, where T_{ S } denotes the sampling period. The remaining T_{ H }  M_{ j }T_{ S } seconds are used for data transmission. The secondary user average throughput R_{ j } is given by the sum of the throughput obtained when the primary network is absent and no false alarm has been generated by the base station plus the throughput obtained when the primary network is present but has not been detected by the base station [6]:
where ${\pi}_{{H}_{0}}={lim}_{n\to \infty}\frac{{\sum}_{\nu =1}^{n}{1}_{H\left(v\right)={H}_{0}}}{n}$denote the stationary probability of the primary network absence, ${C}_{{H}_{0},j}$ represents the data rate of the secondary user under H_{0} and ${C}_{{H}_{1},j}$ represents the data rate of the secondary user under H_{1}. The target detection probability ${\stackrel{\u0304}{P}}_{D}$ is required to be close to 1 since the cognitive radios should not interfere with the primary network; moreover, ${\pi}_{{H}_{0}}$ is usually close to 1, ${C}_{{H}_{1},j}\ll {C}_{{H}_{0},j}$ due to the interference from the primary network [6] and it is assumed that p_{00} ≥ p_{10}. Therefore, (9) can be approximated by:
2.3. Sensing time allocation problem
Equations (6), (8), and (10) show that there is a tradeoff for the choice of the sensing window length M_{ j }: on the one hand, if M_{ j } is high then the user j will not have enough time to perform his data transmission and R_{ j } will be low. On the other hand, if all the users use low M_{ j } values, then the global false alarm probability in (10) will be high and all the average throughputs will be low.
The sensing time allocation problem consists in finding the optimal sensing window length {M_{1}, . . ., M_{ N }} that minimize a cost function f (R_{1}, . . ., R_{ N }) depending on the secondary throughputs.
In this article, the following cost function is considered:
where ${\stackrel{\u0304}{R}}_{j}$ denotes the throughput required by node j.
It is observed that the cost decreases with respect to R_{ j } until R_{ j } reaches the threshold value ${\stackrel{\u0304}{R}}_{j}$, then the cost increases with respect to R_{ j }. This should prevent secondary users from selfishly transmitting with a throughput higher than required, which would reduce the achievable throughputs for the other secondary users.
Although a base station could determine the sensing window lengths that minimize function (11) and send these optimal values to each secondary user, in this article we rely on the secondary users themselves to determine their individual best sensing window length. This decentralized allocation avoids the introduction of signaling overhead in the system.
3. Power allocation problem formulation
We consider a large circular primary cell made up of one central primary emitter and several primary receivers whose positions are unknown. The primary emitter could be a DTV broadcasting station that communicates with multiple passive receivers.
The secondary network uses the same frequency band as the primary network and consists in L adjacent secondary cells. Each secondary cell is made up of one central secondary base station and multiple secondary users. For the sake of simplicity, all the secondary users (SU) are assumed to be located on the line that joins the L base stations (BS) as illustrated on Figure 2. The reader is referred to [10] for more realistic assumptions regarding the geometry of the power allocation problem.
In order to protect the primary receivers from receiving harmful interference from the secondary users, a protection contour is defined around the primary emitter as a circle on which the received primary SINR must be superior to a given threshold ${\mathsf{\text{SINR}}}_{\mathsf{\text{Th}}}^{p}$. The secondary cells are located around the protection contour. As the primary cell ray is assumed to be much larger than the secondary cells ray, the protection contour can be approximated by a line parallel to the secondary base stations line.
The secondary network is assumed to follow a Time Division Multiple Access (TDMA) scheme, so that at each time only one secondary user SU_{ l }communicates with its base station BS_{ l }in cell l (l ∈ {1, . . ., L}). The difference between SU_{ l }and BS_{ l }abscissa is denoted x_{ l }. The point on the protection contour whose distance with SU_{ l }is minimal is denoted I_{ l }. We assume that each cell l deploys sensors on the protection contour so that it is able to measure the primary network SINR at the point I_{ l } , denoted ${\mathsf{\text{SINR}}}_{l}^{p}$.
In this article, the analysis is focused on the interference generated by the upstream transmissions of the secondary users. It is assumed that the secondary SINR at each base station l, denoted ${\mathsf{\text{SINR}}}_{l}^{s}$, needs to be superior to a given threshold ${\mathsf{\text{SINR}}}_{\mathsf{\text{Th}}}^{s}$ for the secondary communication to be reliable.
The power allocation problem consists in finding the optimal secondary users transmission powers {P_{1}, . . ., P_{ L }} that minimize a cost function $f\left({\mathsf{\text{SINR}}}_{1}^{s},\dots ,{\mathsf{\text{SINR}}}_{L}^{s}\right)$ depending on the secondary SINRs, under the constraints that
In this article, the following cost function is considered:
It is observed that the cost decreases with respect to ${\mathsf{\text{SINR}}}_{l}^{s}$ until ${\mathsf{\text{SINR}}}_{l}^{s}$ reaches the threshold value ${\mathsf{\text{SINR}}}_{\mathsf{\text{Th}}}^{s}$, then the cost increases with respect to ${\mathsf{\text{SINR}}}_{l}^{s}$. This should prevent secondary users from selfishly transmitting with a power higher than required, which would remove transmission opportunities for other secondary users.
The primary SINRs in Equation (12) are given by:
where P^{p}is the power that is received on the protection contour from the primary transmitter, σ^{2} is the noise power and ${h}_{{\mathsf{\text{I}}}_{l}}^{\mathsf{\text{S}}{\mathsf{\text{U}}}_{k}}$ is the link gain between SU_{ k }and the point I_{ l } on the protection contour.
The secondary SINRs in Equation (13) are given by:
where ${h}_{{\mathsf{\text{BS}}}_{l}}^{{\mathsf{\text{SU}}}_{l}}$is the link gain between SU_{ l }and BS_{ l }.
In this article, we consider free space path loss. Therefore, the link gains are computed as follows:
where r_{ s } is the ray of the secondary cells, f_{ c } is the transmission frequency and c is the speed of light in vacuum.
4. Learning algorithm
4.1. Qlearning algorithm
In this article, we use two multiagent Qlearning algorithms. The first one is used to allocate the secondary user sensing times and the second one is used to allocate the secondary user transmission powers. In the sensing time allocation algorithm, each secondary user is an agent that aims to learn an optimal sensing time allocation policy for itself. In the power allocation algorithm, each secondary base station is an agent that aims to learn an optimal power allocation policy for its cell.
Qlearning implementation requires the environment to be modeled as a finitestate discretetime stochastic system. The set of all possible states of the environment is denoted . At each learning iteration, the agent that executes the learning algorithm performs an action chosen from the finite set of all possible actions. Each learning iteration consists in the following sequence:

1)
The agent senses the state $s\in \mathcal{S}$ of the environment

2)
Based on s and its accumulated knowledge, the agent chooses and performs an action $a\in \mathcal{A}$.

3)
Because of the performed action, the state of the environment is modified. The new state is denoted s' The transition from s to s' generates a cost c ∈ ℝ for the agent.

4)
The agent uses c and s' to update the accumulated knowledge that made him choose the action a when the environment was in state s.
The Qlearning algorithm keeps a quality information (the Qvalue) for every stateaction couple (s, a) it has tried. The Qvalue Q_{ i }(s, a) represents how high the expected quality of an action a is when the environment is in state s [17]. The following policy is used for the selection of the action a by the agent when the environment is in state s:
where ϵ is the randomness for exploration of the learning algorithm.
The cost c and the new state s' generated by the choice of action a in state s are used to update the Qvalue Q(s, a) based on how good the action a was and how good the new optimal action will be in state s'. The update is handled by the following rule:
where α is the learning rate and γ is the discount rate of the algorithm.
The learning rate α ∈ [0, 1] is used to control the linear blend between the previously accumulated knowledge about the (s, a) couple, Q(s, a), and the newly received quality information $\left(c+\gamma {max}_{a\prime \in \mathcal{A}}Q\left(s\prime ,a\prime \right)\right)$. A high value of α gives little importance to previous experience, while a low value of α gives an algorithm that learns slowly as the stored Qvalues are easily altered by new information.
The discount rate γ ∈ [0, 1] is used to control how much the success of a later action a' should be brought back to the earlier action a that led to the choice of a'. A high value of γ gives a low importance to the cost of the current action compared to the Qvalue of the new state this actions leads to, while a low value of γ would rate the current action almost only based on the immediate reward it provides.
The randomness for exploration ϵ ∈ [0, 1] is used to control how often the algorithm should take a random action instead of the best action it knows. A high value of ϵ favors exploration of new good actions over exploitation of existing knowledge, while a low value of ϵ reinforces what the algorithm already knows instead of trying to find new better actions. The explorationexploitation tradeoff is typical of learning algorithms. In this article, we consider online learning (i.e., at every time step the agents should display intelligent behaviors) which requires a low ϵ value.
4.2. QLearning implementation for sensing time allocation
Each secondary user is an agent in charge of sensing the environment state, selecting an action according to policy (16), performing this action, sensing the resulting new environment state, computing the induced cost and updating the stateaction Qvalue according to rule (17). In this section, we specify the states, actions and cost function used to solve the sensing time allocation problem.
At each iteration t ∈ {1, . . ., K} of the learning algorithm, a secondary user j ∈ {1, . . ., N} represents the local state s_{ j, t } of the environment as follows:
where ${n}_{{H}_{0},t1}$ denotes the number of time slots that have been identified as free by the base station during the (t 1)th learning period.
The number of free time slots takes one out of r values:
At each iteration t, the action selected by the secondary user j is the duration M_{ j, t }of the sensing window to be used during the T_{ L } seconds of the learning iteration t. It is assumed that one learning iteration spans several time slots:
The optimal value of r will be determined in Section 5. Let s denotes the ratio between the duration of a time slot and the sampling period:
during each learning period t, the sensing window length takes one out of s + 1 values:
In this article, we compare the performances of the sensing time allocation system for two different cost functions c_{ j, t }. We firstly define a competitive cost function in which the cost decreases if the average throughput realized by node j increases:
where ${\widehat{R}}_{j,t}$ denotes the average throughput ${\widehat{R}}_{j,t}$ realized by node j during the learning period t:
With this cost function, every node tries to achieve the maximum ${\widehat{R}}_{j,t}$ with no consideration for the other nodes in the secondary network. We secondly define a cooperative cost function in which the cost decreases if the difference between the realized average throughput and the required average throughput decreases:
This last cost function penalizes the actions that lead to a realized average throughput that is higher than required, which should help the disadvantaged nodes (i.e., the nodes that have a low data rate ${C}_{{H}_{0},j}$) to achieve the required average throughput.
4.3. QLearning implementation for distributed power allocation
Each secondary BS is an agent in charge of sensing the environment state, selecting an action according to policy (16), performing this action, sensing the resulting new environment state, computing the induced cost and updating the stateaction Qvalue according to rule (17). In this section, we specify the states, actions, and cost function used to solve the power allocation problem.
At each iteration t ∈ {1, . . ., K} of the learning algorithm, a base station l ∈ {1, . . ., L} represents the local state s_{ l, t }of the environment as the following triplet:
where x_{ l, t } is the local coordinate of the currently transmitting secondary user SU_{ l }, P_{ l, t }is the power currently allocated to this user and I_{ l, t }∈ {0, 1} is a binary indicator that specifies whether the measured aggregated interference at the sensor I_{ l } on the protection contour is above or below the acceptable threshold. It is defined as:
For Qlearning implementation the states have to be quantized. Therefore it is assumed that x_{ l, t }takes one out of the following ξ values:
Similarly, P_{ l, t }takes one out of the following ϕ values:
where P_{min} and P_{max} are the minimum and maximum effective radiated powers (ERP) in dBm.
At each iteration t, the action selected by the base station BS_{ l }is the power to allocate to the currently transmitting secondary user SU_{ l }. The set of all possible actions is therefore given by Equation (27).
In this article, we compare the performances of the power allocation system for two different cost functions c_{ l, t }. We first define a competitive cost function in which the cost decreases if the secondary SINR at the base station increases, provided that the aggregated interference generated on the primary protection contour does not exceeds the acceptable level:
where +∞ represents a positive constant that is chosen large enough compared to ${\mathsf{\text{SINR}}}_{l,t}^{s}$. With this cost function, every agent tries to achieve the maximum ${\mathsf{\text{SINR}}}_{l}^{s}$ with no consideration for the other secondary cells in the network. Second, we define a cooperative cost function in which the cost decreases if the difference between the secondary SINR at the base station and the required secondary SINR threshold decreases, provided that the aggregated interference on the protection contour is acceptable:
where +∞ represents a positive constant that is chosen large enough compared to ${({\mathsf{\text{SINR}}}_{l,t}^{s}{\mathsf{\text{SINR}}}_{\mathsf{\text{Th}}}^{s})}^{2}$. This last cost function penalizes the actions that lead to a secondary SINR that is higher than required, which should help the disadvantaged secondary cells (i.e., the cells in which the transmission distance x_{ l, t } and/or the aggregated interference ${\sum}_{k=1,k\ne l}^{N}{P}_{k}{h}_{\mathsf{\text{B}}{\mathsf{\text{S}}}_{l}}^{\mathsf{\text{S}}{\mathsf{\text{U}}}_{k}}$ is high) to achieve the required secondary SINR threshold.
In this article, the impact of the frequency of the learning algorithm is also analyzed. If T_{TDMA} denotes the length of a TDMA time slot and T_{ L } denotes the length of a learning iteration, then
indicates how many times a learning loop is executed during one TDMA time slot (i.e., for a fixed secondary transmitter SU_{ l }in cell l). It is assumed that every secondary cell uses the same TDMA time slot length T_{TDMA} as well as the same learning iteration length T_{ L }. However, the secondary transmissions as well as the learning iterations are assumed asynchronous, as illustrated on Figure 3.
Finally, three exploration strategies are compared in this article. These three exploration strategies are characterized by the same average randomness for exploration $\stackrel{\u0304}{\u03f5}$.
The first exploration strategy consists in using a constant ϵ parameter during the K learning iterations:
In the second exploration strategy, ϵ decreases linearly between the values ${\u03f5}_{t=1}=2\stackrel{\u0304}{\u03f5}$ and ϵ_{t = K}= 0:
In the third exploration strategy, the algorithm does pure exploration during the $\stackrel{\u0304}{\u03f5}f$ first learning iterations of each TDMA time slot, then pure exploitation during the remaining $\left(1\stackrel{\u0304}{\u03f5}\right)f$ last learning iterations of the time slot (see, Figure 4):
Note that for both the sensing time and power allocation problems, the agents have an imperfect knowledge of the state of the environment. The state represented by an agent at each iteration of the Qlearning algorithm is actually an imperfect estimation of the environment state. In this case, the convergence demonstration of single agent Qlearning [18] does not hold. However, multiagent Qlearning algorithms have been successfully applied in multiple scenarios [11] and in particular to cognitive radios [10, 12, 19]. Numerical results will show that both Qlearning algorithms presented in this article converge as well.
5. Numerical results
5.1. Sensing time allocation algorithm
Unless otherwise specified, the following simulation parameters are used: we consider N = 2 nodes able to transmit at a maximum data rate ${C}_{{H}_{0},1}={C}_{{H}_{0},2}=0.6$. They each require a data rate ${\stackrel{\u0304}{R}}_{1}={\stackrel{\u0304}{R}}_{2}=0.1$. One node has a sensing channel characterized by γ_{1} = 0 dB and the second one has a poorer sensing channel characterized by γ_{2} = 10 dB.
It is assumed that the primary network transition probabilities are p_{00} = 0.9, p_{01} = 0.1, p_{10} = 0.2, and p_{11} = 0.8. The target detection probability is ${\stackrel{\u0304}{P}}_{D}=0.95$.
We consider s = 10 samples per time slot and r = 100 time slots per learning periods. The Qlearning algorithm is implemented with a learning rate α = 0.5 and a discount rate γ = 0.7. The chosen exploration strategy consists in using ϵ = 0.1 during the first K/2 iterations and then ϵ = 0 during the remaining K/2 iterations.
Figure 5 gives the result of the Qlearning algorithms when no exploration strategy is used (ϵ = 0). It is observed that after 430 iterations, the algorithm converges to M_{1} = M_{2} = 4 which is a suboptimal solution. The optimal solution obtained by minimizing Equation (11) is M_{1} = 4, M_{2} = 1 (as the second node has a low sensing SNR, the first node has to contribute more to the sensing of the primary signal). After convergence, the normalized average throughputs are ${\widehat{R}}_{1}/{C}_{{H}_{0},1}={\widehat{R}}_{2}/{C}_{{H}_{0},2}=0.108$ whereas the optimal normalized average throughputs are ${\widehat{R}}_{1,\mathsf{\text{opt}}}/{C}_{{H}_{0},1}=0.096$ and ${\widehat{R}}_{2,\mathsf{\text{opt}}}/{C}_{{H}_{0},2}=0.144$ and lead to an inferior global cost in Equation (11).
Figure 6 gives the result of the Qlearning algorithms when the exploration strategy described at the beginning of this Section is used. It is observed that the algorithm converges to the optimal solution defined in the previous paragraph.
Table 1 compares the performance of the sensing time allocation algorithm implementation based on the cooperative cost function defined by Equation (23) with the one based on the competitive cost function defined by Equation (20). The cooperative cost function penalizes the actions that lead to a higher than required throughput and as a result performs better (i.e., gives higher realized average throughputs ${\widehat{R}}_{j}$) than the competitive cost function, in different scenarios. In particular, it helps achieve fairness among the nodes when one of the nodes has a lower sensing SNR (in which case the other nodes tend to contribute more to the sensing) or when one of the nodes has an inferior channel capacity (in which case this node tends to contribute less to the sensing). The data in Table 1 are the averages of the sensing window lengths and realized throughputs obtained in each scenario.
Figure 7 shows the average normalized throughput that is obtained with the algorithm with respect to parameter r = T_{ L }/T_{ H } when the total duration of execution of the algorithm, equal to rK T_{ H } , is kept constant. When r decreases, then the learning algorithm is executed more often but the ratio $\frac{{n}_{{H}_{0},t}}{r}$ becomes a less accurate approximation of ${\pi}_{{H}_{0}}\left(1{P}_{F}\right){p}_{00}$ and as a result, the agent becomes less aware of its impact on the false alarm probability. Therefore, there is a tradeoff value for r around r ≈ 10 as illustrated on Figure 7.
After convergence of the algorithm, if the value of the local SNR γ_{1} decreases from 0 dB to 10 dB, the algorithm requires an average of 1200 iterations before converging to the new optimal solution M_{1} = M_{2} = 1. According to Equation (17), each Qlearning iteration requires four additions and five multiplications per node. This result can be compared with the complexity of the centralized allocation algorithm which must be solved numerically. By using a constant step gradient descent optimization algorithm to solve the centralized allocation problem, it was measured that the convergence occurred after an average of four iterations. At each iteration of the algorithm, the partial derivatives of the cost function with respect to the sensing times are evaluated. It can be shown that 18N  1 multiplications and 8N  1 additions are needed for this evaluation. As a result, the centralized allocation algorithm will have a lower computational complexity per node than the Qlearning algorithm. The main advantage of the Qlearning algorithm is therefore the minimization of control information sent between the secondary nodes and the coordinator node.
5.2. Power allocation algorithm
The performance of the Qlearning algorithm presented in Section 4 is evaluated by comparison with the optimal centralized power allocation scheme in which a base station having a perfect knowledge of the environment chooses the optimal transmission powers each time there is a change in the environment (i.e., whenever a TDMA time slot ends in any of the L cells). The optimal allocated powers are determined by selecting the transmission powers (P_{1}, . . ., P_{ L }) ∈ Ψ ^{L} that maximize Equation (13) under the constraints given in Equation (12).
The learning algorithm performance metrics considered here is the distance d_{ t } (in dB) between the secondary SINRs obtained with the multiagent Qlearning algorithm and the secondary SINRs given by the optimal allocation algorithm:
where ${\mathsf{\text{SINR}}}_{t,l}^{s}$ denotes the secondary SINR measured at iteration t at BS_{ l }in the distributed learning scenario and ${\widehat{\mathsf{\text{SINR}}}}_{t,l}^{s}$ denotes the secondary SINR measured at iteration t at BS_{ l }in the optimal centralized scenario.
The performance is evaluated for L = 2 secondary cells with a ray r_{ s } = 15 km. The received power from the primary emitter on the protection contour is P^{p} = 0 dBm. Both the primary and the secondary network use a frequency f_{ c } = 2.45 GHz. The minimum acceptable primary SINR on the protection contour is ${\mathsf{\text{SINR}}}_{\mathsf{\text{Th}}}^{p}=20\phantom{\rule{2.77695pt}{0ex}}\mathsf{\text{dB}}$. The desired secondary SINR at the base stations is ${\mathsf{\text{SINR}}}_{\mathsf{\text{Th}}}^{s}=3\phantom{\rule{0.3em}{0ex}}\mathsf{\text{dB}}$. The secondary users are allocated powers ranging from P_{min} = 0 dBm to ${P}_{max}=\frac{1}{{h}_{{I}_{l}}^{\mathsf{\text{S}}{\mathsf{\text{U}}}_{l}}}\left({\sigma}^{2}+\frac{{P}_{p}}{{\mathsf{\text{SINR}}}_{\mathsf{\text{Th}}}^{p}}\right)=66.4\phantom{\rule{0.3em}{0ex}}\mathsf{\text{dBm}}$.
The secondary transmission powers P_{ l, t }are quantized on ϕ = 15 levels and the local coordinates x_{ l, t }of the secondary users are quantized on ξ = 10 levels. The Qlearning algorithm is implemented with a learning rate α = 0.5, a discount rate γ = 0.9 and an average randomness for exploration $\stackrel{\u0304}{\u03f5}=0.1$.
Figure 8 compares the performance of the power allocation algorithm implementation based on the cooperative cost function defined by Equation (29) with the one based on the competitive cost function defined by Equation (28). The cooperative cost function penalizes the actions that lead to a higher than required secondary SINR and as a result performs better (i.e., gives a lower distance d_{ t } to the optimal solution) than the competitive cost function.
Figure 9 compares the convergence speed of the Qlearning algorithms when different learning frequencies f are used. The Qlearning algorithm converges faster when f increases but the improvement is negligible when f > 50. After about 20000 TDMA time slots, the performance of the algorithm is constant and does not depend on the learning frequency.
Figure 10 compares performance of the Qlearning algorithms when different exploration policies are used. The linearly decreasing ϵ strategy defined by Equation (32) converges more slowly than the two other analyzed strategies but leads to better final results. The average d_{ t } of this strategy, computed on the last 50,000 time slots, is equal to 17.5 dB. The full exploration/full exploitation alternance strategy defined by Equation (33) is the strategy that gives the best initial performance but leads to final results that are inferior to those obtained with the linearly decreasing ϵ. The average d_{ t }, computed on the last 50,000 time slots, is equal to 18.7 dB. The performance of the constant ϵ strategy defined by Equation (31) is always inferior to the performance of the alternance strategy. The average d_{ t }, computed on the last 50,000 time slots, is equal to 23.0 dB.
The complexity of the decentralized power allocation Qlearning algorithm can be compared to a reference gradient descent centralized power allocation algorithm, similarly to the analysis performed in Section 1. The conclusion is the same as for the sensing time allocation algorithm: the centralized allocation algorithm has a lower computational complexity than the decentralized Qlearning algorithm whose main advantage is therefore that the base stations do not need to exchange control information.
6. Conclusion
In this article, we have proposed two decentralized Qlearning algorithms. The first one was used to solve the problem of the allocation of the sensing durations in a cooperative cognitive network in a way that maximize the throughputs of the cognitive radios. The second one was used to solve the problem of power allocation in a secondary network made up of several independent cells, given strict limit for the allowed aggregated interference on the primary network. Compared to a centralized allocation system, a decentralized allocation system is more robust, scalable, maintainable and computationally efficient.
Numerical results have demonstrated the need for an exploration strategy for the convergence of the sensing time allocation algorithm. It has also been observed that the strategy of keeping the exploration parameter constant in the power allocation algorithm is less efficient than using a linearly decreasing parameter or implementing an alternance between full exploration and full exploitation, this latest exploration policy leading to the fastest convergence of the power allocation algorithm.
It has furthermore been shown that the implementation of a cost function that penalizes the actions leading to a higher than required throughput in the sensing time allocation algorithm gives better results than the implementation of a cost function without such penalty. Similarly, the implementation of a cost function that penalizes the actions leading to a higher than required secondary SINR in the power allocation algorithm gives better results than the implementation of a cost function without such penalty.
Finally, it has been shown that there is an optimal tradeoff value for the frequency of execution of the sensing time allocation algorithm. The power allocation algorithm has been shown to converge faster when its frequency of execution increases, until the frequency reaches an upper bound where the increase of the convergence speed gets insignificant.
References
 1.
Jondral FK, Weiss TA: Spectrum pooling: An innovative strategy for the enhancement of spectrum efficiency. IEEE Radio Commun 2004, 42(3):S8S14.
 2.
Aazhang B, Sendonaris A, Erkip E: User cooperation diversity. Part I: system description IEEE Trans Commun 2003, 51(11):19271938.
 3.
Bazerque GB, Giannakis JA: Distributed spectrum sensing for cognitive radio networks by exploiting sparsity. IEEE Trans Signal Process 2010, 58(3):18471862.
 4.
Peh E, Liang YC, Zeng Y, Hoang AT: Sensingthroughput tradeoff for cognitive radio networks, IEEE Trans. Wirel Commun 2008, 4(7):13261337.
 5.
Stotas S, Nallanathan A: Sensing time and power allocation optimization in wideband cognitive radio networks. In GLOBECOM 2010, 2010 IEEE Global Telecommunications Conference. Miami; 2010:15.
 6.
Beibei W, Liu KJR, Clancy TC: Evolutionary cooperative spectrum sensing game: how to collaborate? IEEE Trans. Commun 2010, 58(3):890900.
 7.
Shankar S, Cordeiro C: Analysis of aggregated interference at DTV receivers in TV bands. In Proceedings of the 3rd International Conference on Cognitive Radio Oriented Wireless Networks and Communications (CrownCom). Singapore; 2008:16.
 8.
Tandra R, Shellhammer SJ, Shankar S, Tomcik J: Performance of power detector sensors of DTV signals in IEEE 802.22 wrans. In Proceedings of First International Workshop on Technology and Policy for Accessing Spectrum. Boston; 2006.
 9.
Dejonghe A, Bahai A, der Perre LV, Timmers M, Pollin S, Catthoor F: Accumulative interference modeling for cognitive radios with distributed channel access. In Proceedings of the 3rd International Conference on Cognitive Radio Oriented Wireless Networks and Communications (CrownCom). Singapore; 2008:17.
 10.
GalindoSerrano A, Giupponi L: Distributed qlearning for aggregated interference control in cognitive radio networks. IEEE Trans Veh Technol 2010, 59: 18231834.
 11.
Liviu P, Sean L: Cooperative multiagent learning: the state of the art, Auton. Agents MultiAgent Syst 2005, 11(3):387434.
 12.
Li H: Multiagent Qlearning for competitive spectrum access in cognitive radio systems. 5th IEEE Workshop on Networking Technologies for Software Defined Radio Networks 2010, 16.
 13.
Urkowitz H: Energy detection of unknown deterministic signals. Proceedings of the IEEE 1967, vol 55: 523531.
 14.
Digham FF, Alouini MS, Simon MK: On the energy detection of unknown signals over fading channels. IEEE Trans Commun 2007, 55(1):2124.
 15.
Ma J, Zhao G, Li Y: Soft combination and detection for cooperative spectrum sensing in cognitive radio networks. IEEE Trans Wirel Commun 2008, 7(11):45024507.
 16.
Liang YC, Zeng Y, Peh ECY, Hoang AT: Sensingthroughput tradeoff for cognitive radio networks. IEEE Trans Wirel Commun 2008, 7(4):13261337.
 17.
Millington I: Artificial Intelligence for Games. Morgan Kaufmann Publishers, San Fransisco, CA; 2006:612628.
 18.
Watkins C, Dayan P: Technical note: Qlearning. Mach Learn 1992, 8: 279292. doi:10.1023/A:1022676722315
 19.
Wu C, Chowdhury K, Di Felice M, Meleis W: Spectrum management of cognitive radio using multiagent reinforcement learning. In Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems: Industry track, AAMAS'10, (International Foundation for Autonomous Agents and Multiagent Systems). Richland, SC; 2010:17051712.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Received
Accepted
Published
DOI
Keywords
 Time Slot
 Cognitive Radio
 Power Allocation
 Secondary User
 Allocation Algorithm