© Husheng Li. 2010
Received: 31 December 2009
Accepted: 18 April 2010
Published: 23 May 2010
An Aloha-like spectrum access scheme without negotiation is considered for multiuser and multichannel cognitive radio systems. To avoid collisions incurred by the lack of coordination, each secondary user learns how to select channels according to its experience. Multiagent reinforcement leaning (MARL) is applied for the secondary users to learn good strategies of channel selection. Specifically, the framework of -learning is extended from single user case to multiagent case by considering other secondary users as a part of the environment. The dynamics of the -learning are illustrated using a Metrick-Polak plot, which shows the traces of -values in the two-user case. For both complete and partial observation cases, rigorous proofs of the convergence of multiagent -learning without communications, under certain conditions, are provided using the Robins-Monro algorithm and contraction mapping, respectively. The learning performance (speed and gain in utility) is evaluated by numerical simulations.
In recent years, cognitive radio has attracted intensive studies in the community of wireless communications. It allows users without license (called secondary users) to access licensed frequency bands when licensed users (called primary users) are not present. Therefore, the cognitive radio technique can substantially alleviate the problem of underutilization of frequency spectrum [1, 2].
The following two problems are key to cognitive radio systems.
Resource mining, that is, how to detect the available resource (the frequency bands that are not being used by primary users); usually it is done by spectrum sensing.
Resource allocation, that is, how to allocate the available resource to different secondary users.
Substantial work has been done for the resource mining problem. Many signal processing techniques have been applied to sense the frequency spectrum , for example, cyclostationary feature , quickest change detection , and collaborative spectrum sensing . Meanwhile, a significant amount of research has been conducted for the resource allocation in cognitive radio systems [7, 8]. Typically, it is assumed that secondary users exchange information about available spectrum resources and then negotiate on the resource allocation according to their own requirements of traffic (since the same resource cannot be shared by different secondary users if orthogonal transmission is assumed). These studies typically apply theories in economics, for example, game theory, bargaining theory, or microeconomics.
However, in many applications of cognitive radio, such a negotiation-based resource allocation may incur significant overhead. In traditional wireless communication systems, the available resource is almost fixed (even if we consider the fluctuation of channel quality incurred by fading, the change of available resource is usually very slow and thus can be considered stationary). Therefore, the negotiation need not be carried out frequently, and the negotiation result can be applied for a long period of data communication, thus incurring only tolerable overhead. However, in many cognitive radio systems, the resource may change very rapidly since the activity of primary users may be highly dynamic. Therefore, the available resource needs to be updated very frequently, and the data communication period between two spectrum sensing periods should be fairly short since minimum violation to primary users should be guaranteed. In such a situation, the negotiation of resource allocation may be highly inefficient since a substantial portion of time needs to be used for the negotiation. To alleviate such an inefficiency, high-speed transceivers need to be used to minimize the time consumed on negotiation. Particularly, the turn-around time that is, the time needed to switch from receiving (transmitting) to transmitting (receiving) should be very small, which is a substantial challenge to hardware design.
To accomplish the learning of Aloha-like spectrum access, multiagent reinforcement learning (MARL)  is applied in this paper. One challenge of MARL in our context is that the secondary users do not know the payoffs (thus do not know the strategies) of other secondary users in each stage; thus the environment of each secondary user, including other secondary users, is nonstationary and may not assure the convergence of learning. Due to the assumption that there is no mutual communication between different secondary users, many traditional MARL techniques like fictitious play [11, 12] and Nash-Q learning  cannot be used since they need information exchange among players (e.g., exchanging their action information). To alleviate the lack of mutual communication, we extend the principle of single-agent -learning, that is, evaluating the values of different state-action pairs in an incremental way, to the multiagent situation without information exchange. By applying the theory of stochastic approximation , which has been used in many studies on wireless networks [15, 16], we will prove the main result of this paper, that is, the learning converges to a stationary point regardless of the initial strategies (Propositions 1 and 2).
Some studies on reinforcement learning in cognitive radio networks have been done [17–19]. In  and , the studies are focused on the resource competition in a spectrum auction system, where the channel allocation is determined by the spectrum regulator, which is different from this paper in which no regulator exists. Reference  discusses correlated equilibrium and achieves it by no-regret learning; that is, minimizing the gap between the current reward and optimal reward. In this approach, mutual communication is needed among the secondary users. However, in our study, no intersecondary-user communication is assumed.
Note that the study in this paper has subtle similarities to the evolutionary game theory , which has been successfully applied in the cooperative spectrum sensing in cognitive radio systems . Both our study and the evolutionary game focus on the dynamics of strategy changes of users. However, there is a key difference between the two studies. The evolutionary game theory assumes pure strategies for the players (e.g., cooperate or free-ride in cooperative spectrum sensing ) and studies the proportions of players using different pure strategies. The key equation in the evolutionary game theory, called replicator equation, describes the dynamics of the corresponding proportions. In contrast to the evolutionary game, the players in our study use mixed strategies and the basic (16) describes the dynamics of the -values for different channels. Although the convergence is proved by studying ordinary different equations in both studies, the proof is significantly different since the equations have totally different expressions.
The remainder of this paper is organized as follows. In Section 2, the system model is introduced. Basic elements of the game and the proposed multiagent -learning for fully observable case (i.e., each secondary user can sense all channels) are introduced in Section 3. The corresponding convergence of -learning is proved in Section 4. The -learning for partially observable case (i.e., each secondary user can sense a subset of the channels) is discussed in Section 5. Numerical results are provided in Section 6, while conclusions are drawn in Section 7.
2. System Model
We consider active secondary users accessing licensed frequency channels. (When there are more than channels, there is less competition; thus making the problem easier. We do not consider the case when the number of channels is less than the number of secondary users since a typical cognitive radio system can provide sufficient channels. Meanwhile, the proposed algorithm can also be applied to all possible cases of ) We index the secondary users, as well as the channels, by integers 1, 2, , . For simplicity, we denote by the set of users (channels) different from user (channel) .
The following assumptions are made throughout this paper.
The secondary users are sufficiently close to each other such that they share the same activity of primary users. There is no communication among these secondary users, thus excluding the possibility of negotiation.
We assume that the activity of primary users over each channel is a Markov chain (A more reasonable model for the activity of primary users is the semi-Markov chain. The corresponding analysis is more tedious but similar to that in this paper. Therefore, for simplicity of analysis, we consider only Markov chain in this paper)with states (busy: the channel is occupied by primary users and cannot be used by secondary users) and (idle: there is no primary user over this channel). We denote by the state of channel in the sensing period of the th spectrum access period. For channel , the transition probability from state to state (resp., from state to state ) is denoted by (resp., ). We assume that the Markov chains for the channels are mutually independent. We also assume perfect spectrum sensing and do not consider possible errors of spectrum sensing.
We assume that the channel state transition probabilities, as well as the channel rewards, are unknown with the secondary users at the beginning. They are fixed throughout the game, unless otherwise noted. Therefore, the secondary users need to learn the channel properties.
The timing structure of spectrum sensing and data transmission is illustrated in Figure 2, where data is transmitted after the spectrum sensing period. We assume that each secondary user is able to sense only one channel during the spectrum sensing period and transmit over only one channel during the data transmission period.
In Sections 3 and 4, we consider the case in which all secondary users have full knowledge of channel states in the previous spectrum access period (complete observation). Note that this does not contradict the assumption that a secondary user can sense only one channel during the spectrum sensing period since the secondary user can continue to sense other channels during the data transmission period (suppose that the signal from primary users can be well distinguished from that from secondary users, e.g., using different cyclostationary features ). If we consider the set of channel states in the previous spectrum access period as the system state, denoted by at spectrum access period , then the previous assumption implies a completely observable system state, which substantially simplifies the analysis. In Section 5, we will also study the case in which secondary users cannot continue to sense during the data transmission period (partial observation); thus each secondary user has only partial observations about the system state.
In this section, we introduce the game associated to the learning procedure and the application of -learning to the Aloha-like spectrum access problem. Note that in this section and Section 3, we assume that each secondary user knows all channel states in the previous time slot, that is, the completely observable case.
3.1. Game of Aloha-Like Spectrum Access
It is easy to verify that there are multiple Nash equilibrium points in the game. Obviously, orthogonal transmission strategies, that is, , , are pure equilibria. The reason is the following. If a secondary user changes its strategy and transmits over other channels with nonzero probability, those transmission will collide with other secondary users (recall that, for the Nash equilibrium, all other secondary users do not change their strategies) and incurs performance degradation. The orthogonal channel assignment can be achieved in the following approach: let all secondary users sense the channel randomly at the very beginning; once a secondary user finds an idle channel, it will access this channel forever; after a random number of rounds, all secondary users will find different channels, thus achieving the orthogonal transmission. We call this scheme the simple orthogonal channel assignment since it is simple and fast. However, in this scheme, the different rewards of different channels are ignored. As will be seen in the numerical simulation results, the proposed learning procedure can significant outperform the simple orthogonal channel assignment.
where is the reward obtained by secondary user , which is dependent on the action, as well as the system state, and the expectation is over the randomness of other users' actions, as well as the primary users' occupancies.
where is called temperature, which controls the randomness of exploration. Obviously, the smaller is (the colder), the more focused the actions are. When , each user chooses only the channel having the largest -value.
since secondary user ( ) chooses channel with probability (collision happens and secondary user receives no reward) and channel is idle with probability ; then the product in (4) is the probability that no other secondary user accesses channel .
Note that, in a typical stochastic game setting and -learning, the updating rule in (5) should consider the reward of the future and add a discounted term of the future reward to the right hand side of (5). However, in this paper, the optimal strategy is myopic since we assume that the system state is known, and thus the secondary users' actions do not affect the system state. For the case of partial observation (i.e., each secondary user knows only the state of a single channel), the action does change each secondary user's state (typically the belief of system state), and the future reward should be included in the right hand side of (5), which will be discussed in Section 5.
3.5. Stationary Point
Note that the stationarity is only in the statistical sense since the -values can fluctuate around the stationary point due to the randomness of exploration. Obviously, as , the stationary point converges to a Nash equilibrium point. However, we are still not sure about the existence of such a stationary point. The following lemma assures the existence of stationary point. The proof is given in Appendix .
In this section, we study the convergence of the proposed -learning algorithm. First, we provide an intuitive explanation for the convergence in case. Then, we apply the tools of stochastic approximation and ordinary differential equation (ODE) to prove the convergence rigorously.
4.1. Intuition on Convergence
As will be shown in Proposition 1, the updating rule of -values in (5) will converge to a stationary equilibrium point close to Nash equilibrium. Before the rigorous proof, we provide an intuitive explanation for the convergence using the geometric argument proposed in .
The intuitive explanation is provided in Figure 4 for the case of (we call it Metrick-Polak plot since it was originally proposed by A. Metrick and B. Polak in ). For simplicity, we ignore the indices of state and assume that both channels are idle. The axes are and , respectively. As labeled in the figure, the plane is divided into four regions separated by two lines and , in which the dynamics of -learning are different. We discuss these four regions separately.
Region I: in this region, ; therefore, secondary user 1 prefers visiting channel 1; meanwhile, secondary user 2 prefers accessing channel 2 since ; then, with large probability, the strategies will converge to a stationary point in which secondary users 1 and 2 access channels 1 and 2, respectively.
Region III: similar to region I.
Region IV: similar to region II.
Then, we observe that the points in Regions II and IV are unstable and will move into Region I or III with large probability. In Regions I and III, the strategy will move close to the stationary point with large probability. Therefore, regardless where the initial point is, the updating rule in (5) will converge to a stationary point with large probability.
4.2. Stochastic Approximation-Based Convergence
In this section, we prove the convergence of the -learning of the proposed Aloha-like spectrum access with Boltzman distributed exploration. First, we find the equivalence between the updating rule (5) and Robbins-Monro iteration  for solving an equation with unknown expression (a brief introduction is provided in Appendix ). Then, we apply a conclusion in stochastic approximation  to relate the dynamics of the updating rule to an ODE and prove the convergence of the ODE.
4.2.1. Robbins-Monro Iteration
4.2.2. ODE and Convergence
The procedure of Robbins-Monro algorithm (i.e., the updating of -value) is the stochastic approximation of the solution of the equation. It is well known that the convergence of such a procedure can be characterized by an ODE. Since the noise in (14) is a Martingale difference, it is easy to verify the conditions in Theorem 1 in Appendix and obtain the following lemma (the proof is given in Appendix ).
What remains to do is to analyze the convergence property of the ODE (16). We obtain the following lemma by applying Lyapunov function. The proof is given in Appendix .
Combining Lemmas 1, 2, and 3, we obtain the main result in this paper.
Note that a sufficiently small guarantees the existence of stationary point and a sufficiently large assures the convergence of the learning procedure. However, they do not conflict since they are not necessary conditions. As we found in our simulations, we can always choose a suitable to guarantee the existence of the stationary point and the convergence.
The system state is partially observable.
The game is imperfectly monitored, that is, each player does not know other players' actions.
The game has incomplete information, that is, each player does not know the strategies of other players, as well as their beliefs on the system state.
Note that the latter two difficulties are common for both the complete and partial observation cases. However, the imperfect monitoring and incomplete information add much more difficulty in the partial observation case. In this section, we formulate the -learning algorithm and then prove the convergence under certain conditions.
5.1. State Definition
where , , means the number of consecutive periods during which channel has not been sensed before period (e.g., if the last time that channel was sensed by secondary user is time slot , .) and is the state of channel in the last time when it is sensed before period .
5.2. Learning in the POMDP Game
where is uniquely determined by and , and is the step factor dependent on the time, channel, user, and belief state. Note that is the system state in the next time slot, which is random. Intuitively, the new -value is updated by combining the old value and the new estimation, which is the sum of the new reward and discounted old -value.
Similarly to the complete information situation, we have the following proposition which states the convergence of the learning procedure with partial information and large . The proof is given in Appendix . Note that numerical simulation shows that small also results in convergence. However, we are still unable to prove it rigorously.
6. Numerical Results
where is the initial learning factor. A similar step factor is used for the partially observable case. In Sections 6.1, 6.2, and 6.3, we consider the fully observable case and, in Section 6.4, we consider the partially observable case. Note that, in all simulations, we initialize the -values by choosing uniformly random variables in the interval .
6.2. CDF of Rewards
6.3. Learning Speed
6.4. Time-Varying Channel Rewards
6.5. Partial Observation Case
We have discussed a learning procedure for Aloha-like spectrum access without negotiation in cognitive radio systems. During the learning, each secondary user considers the channel and other secondary users as its environment, updates its -values, and takes the best action. An intuitive explanation for the convergence of learning is provided using Metrick-Polak plot. By applying the theory of stochastic approximation and ODE, we have shown the convergence of learning under certain conditions. We also extended the case of full observations to the case of partial observations. Numerical results show that secondary users can learn to avoid collision quickly. The performance after the learning is significantly better than that before the learning and that using a simple scheme to achieve a Nash equilibrium. Note that our study is one extreme of the resource allocation problem since no negotiation is considered, while the other extreme is full negotiation to achieve optimal performance. Our future work will be the intermediate case; that is, limited negotiation for resource allocation.
This work was supported by the National Science Foundation under grant CCF-0830451.
- Mitola J III: Cognitive radio for flexible mobile multimedia communications. Mobile Networks and Applications 2001, 6(5):435-441. 10.1023/A:1011426600077View ArticleMATHGoogle Scholar
- Mitola J III: Cognitive Radio, Licentiate Proposal. KTH, Stockholm, Sweden; 1998.Google Scholar
- Zhao Q, Sadler BM: A survey of dynamic spectrum access. IEEE Signal Processing Magazine 2007, 24(3):79-89.View ArticleGoogle Scholar
- Kim K, Akbar IA, Bae KK, Um J-S, Spooner CM, Reed JH: Cyclostationary approaches to signal detection and classification in cognitive radio. Proceedings of the 2nd IEEE International Symposium on New Frontiers in Dynamic Spectrum Access Networks, April 2007 212-215.View ArticleGoogle Scholar
- Li H, Li C, Dai H: Quickest spectrum sensing in cognitive radio. Proceedings of the 42nd Annual Conference on Information Sciences and Systems (CISS '08), March 2008, Princeton, NJ, USA 203-208.Google Scholar
- Ghasemi A, Sousa ES: Collaborative spectrum sensing for opportunistic access in fading environments. Proceedings of the 1st IEEE International Symposium on New Frontiers in Dynamic Spectrum Access Networks (DySPAN '05), November 2005 131-136.Google Scholar
- Kloeck C, Jaekel H, Jondral F: Multi-agent radio resource allocation. Mobile Networks and Applications 2006, 11(6):813-824. 10.1007/s11036-006-0051-4View ArticleGoogle Scholar
- Niyato D, Hossain E, Han Z: Dynamics of multiple-seller and multiple-buyer spectrum trading in cognitive radio networks: a game-theoretic modeling approach. IEEE Transactions on Mobile Computing 2009, 8(8):1009-1022.View ArticleGoogle Scholar
- Kuo FF: The ALOHA system. In Computer Networks. Prentice-Hall, Englewood Cliffs, NJ, USA; 1973:501-518.Google Scholar
- Buşoniu L, Babuška R, De Schutter B: A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems, Man and Cybernetics Part C 2008, 38(2):156-172.View ArticleGoogle Scholar
- Fudenberg D, Levine DK: The Theory of Learning in Games. The MIT Press, Cambridge, Mass, USA; 1998.MATHGoogle Scholar
- Robinson J: An iterative method of solving a game. The Annals of Mathematics 1969, 54(2):296-301.View ArticleMathSciNetMATHGoogle Scholar
- Hu J, Wellman MP: Multiagent reinforcement learning: theoretical framework and an algorithm. Proceedings of the 15th International Conference on Machine Learning (ICML '98), July 1998 242-250.Google Scholar
- Kushner HJ, Yin GG: Stochastic Approximation and Recursive Algorithms and Applications. Springer, New York, NY, USA; 2003.MATHGoogle Scholar
- Liu J, Yi Y, Proutiere A, Chiang M, Poor HV: Towards utility-optimal random access without message passing. Wireless Communications and Mobile Computing 2010, 10(1):115-128. 10.1002/wcm.897View ArticleGoogle Scholar
- Yi Y, de Veciana G, Shakkottai S: MAC scheduling with low overheads by learning neighborhood contention patterns. submitted to IEEE/ACM Transactions on NetworkingGoogle Scholar
- Fu F, van der Schaar M: Learning to compete for resources in wireless stochastic games. IEEE Transactions on Vehicular Technology 2009, 58(4):1904-1919.View ArticleGoogle Scholar
- Han Z, Pandana C, Liu KJK: Distributive opportunistic spectrum access for cognitive radio using correlated equilibrium and no-regret learning. Proceedings of IEEE Wireless Communications and Networking Conference (WCNC '07), March 2007 11-15.Google Scholar
- van der Schaar M, Fu F: Spectrum access games and strategic learning in cognitive radio networks for delay-critical applications. Proceedings of the IEEE 2009, 97(4):720-739.View ArticleGoogle Scholar
- Hofbauer J, Sigmund K: Evolutionary Games and Population Dynamics. Cambridge University Press, Cambridge, UK; 1998.View ArticleMATHGoogle Scholar
- Wang B, Liu KJR, Clancy TC: Evolutionary game framework for behavior dynamics in cooperative spectrum sensing. Proceedings of IEEE Conference on Global Communications (Globecom '08), November-December 2008, New Orleans, La, USA 3123-3127.Google Scholar
- Sutton PD, Nolan KE, Doyle LE: Cyclostationary signatures in practical cognitive radio applications. IEEE Journal on Selected Areas in Communications 2008, 26(1):13-24.View ArticleGoogle Scholar
- Sutton RS, Barto AG: Reinforcement Learning: A Introduction. The MIT Press, Cambridge, Mass, USA; 1998.Google Scholar
- Metrick A, Polak B: Fictitious play in 2 × 2 games: a geometric proof of convergence. Economic Theory 1994, 4(6):923-933. 10.1007/BF01213819MathSciNetView ArticleMATHGoogle Scholar
- Robbins H, Monro S: A stochastic approximation method. The Annals of Mathematical Statistics 1951, 2: 400-407.MathSciNetView ArticleMATHGoogle Scholar
- Tsitsiklis JN: Asynchronous stochastic approximation and Q -learning. Machine Learning 1994, 16(3):185-202.MathSciNetMATHGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.