 Research
 Open Access
Cell range expansion using distributed Qlearning in heterogeneous networks
 Toshihito Kudo^{1}Email author and
 Tomoaki Ohtsuki^{1}
https://doi.org/10.1186/16871499201361
© Kudo and Ohtsuki; licensee Springer. 2013
 Received: 1 August 2012
 Accepted: 4 February 2013
 Published: 4 March 2013
Abstract
Cell range expansion (CRE) is a technique to expand a pico cell range virtually by adding a bias value to the pico received power, instead of increasing transmit power of pico base station (PBS), so that coverage, celledge throughput, and overall network throughput are improved. Many studies have focused on intercell interference coordination (ICIC) in CRE, because macro base station’s (MBS’s) strong transmit power harms the expanded region (ER) user equipments (UEs) that select PBSs by bias value. Optimal bias value that minimizes the number of outage UEs depends on several factors such as the dividing ratio of radio resources between MBSs and PBSs. In addition it varies from UE to another. Thus, most articles use the common bias value among all UEs determined by trialanderror method. In this article, we propose a scheme to determine the bias value of each UE by using Qlearning algorithm where each UE learns its bias value that minimizes the number of outage UEs from its past experience independently. Simulation results show that, compared to the scheme using optimal common bias value, the proposed scheme reduces the number of outage UEs and improves network throughput.
Keywords
 Learning Scheme
 Radio Resource
 Receive Power
 Average Throughput
 Optimal Bias
Introduction
Owing to the increase in demand in wireless bandwidth, serving by only macro base stations (MBSs) has become insufficient to serve the network’s user equipments (UEs). Subsequently, a recent solution, Heterogeneous networks (HetNets) whereby low power base stations (BSs) are deployed within the macro cell, has recently received significant attention in the literature [1]. HetNets are discussed as one of the proposed solutions as part of the long term evolutionAdvanced (LTEAdvanced) by the third generation partnership project (3GPP) [2].
As the low power BSs, some BSs are considered, for instance, pico BS (PBS), femto BS (FBS), relay BS, and so on. Among these low power BSs, PBSs are mostly considered, because they can improve the capacity and they usually have the same backhaul as MBS. In [3], the authors place a PBS near the hot spot where the amount of traffic is high to prevent many UEs from accessing the MBS. PBSs have low transmission power, ranging from 23 to 30 dBm, and serve tens of UEs within a coverage range of up to 300 m [1]. However, in the presence of MBSs, PBSs’ ranges become smaller. MBSs’ transmit power is about 46 dBm, and the difference of them is about 16 dBm [1]. This big difference causes PBSs’ ranges to fall within tens of meters, whereas MBSs’ ranges are hundreds or thousands of meters [1]. This is not the case for uplink (UL), in which the reference signal strengths (RSSs) from a UE at different BSs mostly depend on the UE’s transmission powers [1]. Therefore, in this article, we consider only downlink (DL).
If the range of the hot spot area is the same as that of the pico cell, the PBS can serve UEs within that area and improve coverage area. However, because the hot spot’s location and amount of traffic change dynamically, PBSs cannot always cover the hot spot area and UEs may have to access the MBSs even if the PBS may be closer to them.
In [1], the authors discuss cell range expansion (CRE), which is a technique that adds a bias value to pico received power from PBSs during the handover as if pico cell range is expanded, and many studies focus on this topic [1, 3–9]. CRE can make more UEs to access the PBS even if the macro received power is stronger than the pico received power. However, those UEs that access the PBS whose pico received power is weaker than the macro received power are affected by a large amount of interference from MBS; such UEs are referred to as expanded region (ER) UEs [1]. Therefore, whenever CRE is used, intercell interference coordination (ICIC) may be needed so as to eliminate the interference.
Traditionally, UEs are set to use the same, fixed, bias value [1, 3–8]. One reason is the fact that varying the bias value would require the measurement of the UEs’ distribution, which is hard to get. However, optimal bias values change depending on the location of UEs and BSs which differ from one another [4].
Owing to the difficulty to set the appropriate bias value for each UE, many articles mainly discuss applying ICIC[5–8]. ICIC is realized by dividing the radio resource: between two categories of MBS and PBS, ICIC is usually realized by stopping MBS’s transmission on some radio resources. ICIC is applied by separating frequency band in the frequencydomain approach instead of separating time slot in the timedomain approach. In the timedomain approach, almost blank subframe (ABS) [5] in which MBSs stop sending data and PBSs send to pico UEs (PUEs), particularly ER UEs, is mainly applied. However, even if ABS is used, reference signals are still transmitted by MBS, which causes interference [7]. To eliminate this interference, proposals in the literature include using lightly loaded controlling channel transmission subframe (LLCS) [7] or interference cancelation of common reference signal (CRSIC) [8]. In the frequencydomain approach, furthermore, the restricted transmit power of MBS on the allocated frequency to PBS is also discussed in [9].
Resource blocks (RBs) introduced in 3GPPLTE system [10] as blocks of subcarriers can also realize ICIC by dividing them between MBSs and PBSs [1]. Depending on this ratio of RB, the appropriate bias values also change, and this is also one reason for the difficulty to set optimal bias values. From these aforementioned reasons, optimal bias values are obtained only by using trialanderror methods.
Instead of using trialanderror methods, we propose to use Qlearning [11], a machine learning (ML) technique, to determine the bias values. Using ML in radio communication system is becoming popular [12–17], because situations, where different radio systems are mixed in the same area, are very common, and since conditions change dynamically, adjustment of parameters is more difficult and complicated. Qlearning has been applied to many other areas such as cognitive radio [12] and intercell interference problem of multicell network [13]. It has also been applied to cellular networks, such as: selforganized and distributed interference management for femtocell networks [14], selforganized resource allocation scheme [15], cell selection scheme [16], and selfoptimization of capacity and coverage scheme [17]. However, to the best of our knowledge, no studies apply Qlearning to setting the optimal bias value of CRE.
In this article, each UE learns the bias value that minimizes the number of outage UEs individually by Qlearning and can set the appropriate bias value independently. Simulation results show that, compared to the trialanderror approach to find the optimal common bias value, the proposed scheme reduces the number of outage UEs and improves average throughput in almost all cases.
Heterogeneous network
To solve coverage problems in MBS based homogeneous networks where only one BS serves UE in its coverage area, HetNets have been suggested in [18]. HetNets introduce remote radio head or low power BS such as PBS, FBS, and relay BS in a macro cell [1, 18].
Though HetNets encompass many types of BSs, out of concern for simplicity, this work shall be limited to the case where only two types of BSs, namely MBS and PBS, as this is also the case in the majority of the related studies. PBSs are typically deployed within macro cells for capacity enhancement and coverage extension. Moreover, they usually have the same backhaul and access features as MBSs [1].
PBSs are deployed within macro cell to avoid having the hot spot UE access the MBS. Then, as the radius of a pico cell is limited, CRE [3] is traditionally used as we shall explain in the subsequent paragraph.
Cell range expansion
In this article, referencesignalreceivedpowerbased (RSRP) handover [3], whereby the handover procedure is triggered through the assessment of the strength of the pilot signal (reference signal), shall be considered.
where ${\left({w}_{\mathit{\text{m}}}^{\mathit{\text{pilot}}}\right)}_{\text{dB}}$, ${\left({w}_{\mathit{\text{p}}}^{\mathit{\text{pilot}}}\right)}_{\text{dB}}$, and (Δ bias)_{dB} represent the decibel value of pilot signal power from MBS and PBS, and bias value, respectively, [1].
In this way, the pico cell range can be artificially extended. However, since ER UEs connect to BSs that do not provide the strongest received power, they suffer from interference from MBS [1].
Thus, we need ICIC that can eliminate the interference from MBS to PBS. We apply ICIC by dividing the radio resource between MBSs and PBSs to avoid the interference between them [18]. Although each PBS can interfere with another PBS’s signal, it is not a big problem because they have almost the same transmit powers.
The configuration of optimal bias value
Optimal bias values that minimize the number of outage UEs are changed by the ratio of radio resource among BSs and by the location of UEs and BSs. Since the optimal bias values vary from one UE to another [4], bias values should be defined by each UE. However, because of the difficulty to find the suitable sets of the ratio of radio resource and UEs’ distribution, most articles use the common bias value among all UEs [1, 3]. In this article, each UE learns bias values that minimize the number of outage UEs individually and can decide each bias value independently.
Reinforcement learning
Although supervised learning is effective, it may be hard to get training data on field. Thus, RL represents a suitable alternative as it only uses experiences of agents that learn automatically from the environment. In the RL, instead of the training data, agents get scalar values referred to as costs, and only these costs provide knowledge to agents [11].
 1.
Agents observe the state s _{ t } of environment and make actions a _{ t } based on the current observed s _{ t } at the time t.
 2.
State transits to the next state s _{t+1} due to the execution of the selected action a _{ t }, and agents get costs c _{ t } when executing action a _{ t } in state s _{ t }.
 3.
Time t transits to t + 1, then repeat steps 1 and 2.
Thanks to the algorithm described above, RL is allowed an online learning which is one of the most important characteristic of RL.
Value function and policy
RL has two important components, policy and value function.
Policy defines the action of agents at each step, in other words, policy is the mapping from observed state to an action that should be taken. It is expressed as a simple function, a lookup table, or other cases that need more exploration. Policy itself is enough to decide the action of agents [11]. It is represented as a probability π(s, a) of selecting action a at state s. To calculate the policy means to decide π(s, a) of all available actions at every state. The agent’s goal is to maximize the total amount of reward it receives over the long run.
Almost all reinforcement learning algorithms are based on estimating value functions—functions of states or of stateaction pairs that estimate how good it is for the agent to be in a given state or how good it is to perform a given action in a given state. The expression of “how good” means the expected future rewards. Of course, the rewards that the agent can expect to receive in the future depend on what actions it will take. Accordingly, value functions are defined with respect to particular policies [11].
where E_{ π }{·} denotes the expected value given that the agent follows policy π. Note that if the terminal state exists, its value is always zero. The function V^{ π } is referred to as the statevalue function for policy π [11].
Similarly, the actionvalue function Q(s, a) can be defined, which is explained in the following subsection. In this article, actionvalue function Q(s, a) is used as the value function. This represents the value of selecting action a at state s; this is the Qvalue of Qlearning explained later. The best Q(s, a) denotes the best action a at the state s.
Qlearning
Qlearning is one of the typical methods of RL that is proved to converge in single agent systems [11, 19]. Qlearning uses Qvalue that means actionvalue function. Agents have Qtable where they save the sets of states, actions, and Qvalues that represent the effectiveness of the sets.
where γ, c(s_{ t }, a_{ t }), s_{0}, and a_{0} represent discount factor (0 ≤ γ ≤ 1), the cost of the set of state s_{ t } and action a_{ t }, initial state, and initial action, respectively, [12].
If the terminal state can be defined, costs are calculated up to the final one in Equation (4). However, since it can be rarely defined, the final time becomes infinity and future costs make Qvalues diverse. That’s why a concept that discounting future costs is required. If γ = 0, agents do not care about future cost and consider only immediate costs, and if γ is about 1, agents have comprehensive views and consider the future costs.
It is very difficult to obtain optimal policy from Equation (4), because we cannot have the knowledge of all states. Therefore, instead of solving Equation (4), Qlearning is proposed in [11].
where P_{s→v}(a) is the transition probability from state s to the next state v when action a is executed, and c(s, a) and c(s, a) represent the cost of action a at the state s and mean value of c(s, a), respectively. According to Equation (5), the current state’s Qvalue can be evaluated by the current cost and the next state’s Qvalue.
All Qvalues are stored per each state and action pair in Qtable and updated repetitively. Although because Qlearning has to save all Qvalues, there may be a memory problem, it can converge the actionvalue function Q(s, a) directly. Equation (4) can be approximately executed with using Qtable. It is enough to converge this learning if all Qvalues of the sets of states and actions are continue to be updated. Because this concept is simple, it makes the analysis of algorithm easier.

Step (1) Agents observe their states from the environment and find the sets that have the state in the Qtable. They also get costs from the environment as the evaluation of the selected actions.

Step (2) Using the state and cost that are known at step (2), the Qvalue selected at the previous state and action is updated.

Step (3) Following an action selection policy, for instance egreedy policy mentioned later, an action is selected making use of the Qvalues of observed states at step (1).
Through above steps, Qlearning realizes Equation (4).
Qvalue is updated as follows:$\begin{array}{l}Q({s}_{t},{a}_{t})?(1a)Q({s}_{t},{a}_{t})+a\left[{c}_{t+1}+?\underset{a}{\text{min}}Q({s}_{t+1},a)\right],\end{array}$(6)
where α represents the learning rate (0 < α ≤ 1) that controls the amount of the change of Qvalue and “ ←” means update. This equation comes from Equation (5), and it considers future costs.
The aforementioned Qlearning algorithm has been proved in the system of the single agent [19]. However, our system is the multiagent system that has multiple agents, because all UEs can be the agents in our system. The convergence of Qlearning in a multiagent system has not been proved in general, because of the complex relationship among the different agent. The multiagent system has the proof of the convergence only when the agents do not move and know all the other agents’ strategies [20].
Cell range expansion with Qlearning
Though many articles use common bias value among all BSs and all UEs, UEs can improve coverage area by using their own bias values. Because of the difficulty to find the optimal bias value of each UE, in this article, we propose the scheme that every UE decides bias value independently to minimize the number of outage UEs by using Qlearning. Because all UEs should learn by themselves, in other words, all UEs can be the agents in our system, this system is a multiagent system. Moreover, an online learning which is allowed in the algorithm of RL is also used in our system.
We use RBs as radio resources and they denote blocks of subcarriers in this article. RB is the basic resource allocation unit for scheduling in 3rdgeneration partnership project long term evolution (3GPPLTE) system [10]. Although one or more RBs are considered to be allocated to UEs considered in 3GPPLTE system [10], UEs can be allocated only one RB in this article. To eliminate the interference from MBSs to ER UEs, RBs [1] should be divided into MBSs and PBSs. If UEs use the same RBs simultaneously, there will be interference among the UEs. UEs, that do not get allocated RB by the BS, cannot access radio services.
Definition of state, action, and cost
We show the definition of state, action, and cost in Table 1.

State: The state of time t is defined as:$\begin{array}{l}{s}_{t}=\{{p}_{\mathrm{M}},{p}_{\mathrm{P}}\}\end{array}$(7)
where p_{M} and p_{P} denote the received powers of the pilot signals from MBS and PBS, respectively. Although UEs can hear many signals from various BSs, they use the largest macro and pico ones, in other words, only two parameters are saved as state in Qtable. To make Qtable small, those two powers are quantized.

Action: The action of time t is defined as:$\begin{array}{l}{a}_{t}=b\end{array}$(8)
where b denotes the bias value.

cost: The cost of time t is defined as:$\begin{array}{l}{c}_{t}=n\end{array}$(9)
where n denotes the number of UEs that cannot get the radio service because of no spectrum vacancy or weak received power, referred to as outage UEs. Using the backhaul between BSs, we can calculate this number and broadcast it to UEs.
The definition of state, action and cost
State  p_{M}: Received powers of the pilot signals from MBS. 
p_{P}: Received powers of the pilot signals from PBS.  
UEs use the largest macro and pico ones.  
Action  b: The UE’s bias value 
Cost  n: The number of UEs that cannot get the radio service 
because of no spectrum vacancy or weak received power, referred to as outage UEs.  
Using the backhaul between BSs, we can calculate this number and broadcast it to UEs. 
On this definition, UEs decide bias values that minimize the number of outage UEs depending on the received power from each BS. Furthermore, considering the amount of radio resources, when there are many macro RBs (MRBs), access to the MBS may be better even if the difference is small, and vice versa. Each UE can cope with aforementioned situations and decide optimal bias value by using Qlearning.
Flow of learning

Step (1) Each UE receives pilot signals from each BS, and chooses the strongest macro and pico ones. In other words, each UE observes its state.

Step (2) The received power is quantized to converge faster, and each UE compares these pilot signal powers with Qtable’s states.

Step (3) If there are no equal received powers on each UE’s Qtable, they add new received powers to their own Qtables.

Step (4) Among those sets whose received powers are equal to the pilot signal powers, UEs usually choose one set that has the lowest Qvalue or rarely choose one set randomly to avoid local minima as egreedy policy [11].

Step (5) Each UE uses chosen set’s bias value as an action.

Step (6) Each UE compares “macro received power” with “pico received power” added by bias value, they try to connect to the larger one.

Step (7) BSs allocate each UE to each RB randomly. In this article, each UE can use only one RB. strongly interfered by the MBS’s signals. Therefore, in this article, RBs are split.

Step (8) BSs calculate the number of outage UEs and pass it to UEs as a cost.

Step (9) Each UE reevaluates the chosen set’s Qvalue at Step 4 as update based on Equation (6).
Step (1) to step (6) and step (9) are carried out by each UE, while step (7) and step (8) are done by BS.
Repeating the above steps makes Qvalue of all sets of states and actions converge, and then agents can make right actions.
In our system, when the agents find a new state, if they always add them to the Qtable, the size of Qtable increases, which is not allowed by the memory constraint. Moreover, this makes the learning time longer. To solve this, we use priori data of the common bias values to converge faster. The number of outage UEs of all the common bias values can be checked with trialanderror method before starting to learn and sending data to make the learning time shorter, because the common bias values are easier to know than the optimal bias values of each UE. Although the common bias values among all the UEs are not the best bias value for each UE [4], they are tend to be a close value to the best bias value of each UE. We also quantize received powers used as the state to be even values on step (2) and set upper and lower limits to check and remove outlier values. After outlier checking and quantization, state is added. By introducing these, required memory size becomes smaller and the convergence becomes faster.
UEs keep having the data of Qtable when they move to another PBS coverage area because even if the situation changes and if situations may have some similarities, the data got in one situations helps to learn in another situation [21]. UEs use the data as the initial values of next learning, because we expect that it helps a learning algorithm to converge faster. Even in different situations, UEs learn environment so that the table is updated.
Simulation model and results
Macro cell radius  289 m 
Pico cell radius  40 m 
Carrier Frequency  2.0 GHz 
Bandwidth  10 MHz 
RBs  50 
Thermal noise density  174 dBm/Hz 
Macro BSs  1 
Pico BSs  2 
Hot spots  2 
UEs inside macro cell  50 
UEs inside Hot spot areas  25 
Macro BS transmit power  46 dBm 
Pico BS transmit power  30 dBm 
Macro path loss model  128.1 + 37.6 log10(R)dB (R [km]) 
Pico path loss model  140.1 + 36.7 log10(R)dB (R [km]) 
Velocity of UEs  3 km/h 
Channel  Rayleigh fading 
trials  500000 
Learning rate  0.5 
Discount factor  0.5 
ε  0.1 
From now on, we compare three schemes: the proposed Qlearning scheme, no learning scheme (best bias value), and no learning scheme (fixed bias value). In the no learning schemes, all UEs use a common bias value. Both no learning schemes use trial and error method and search the bias value that minimizes the number of outage UEs. No learning scheme (best bias value) searches the bias value that minimizes the number of outage UEs with trial and error method every time. Although it can get minimum number of outage UEs with using a common bias value, this is not practical because the best bias value can be found after checking the number of outage UEs of all bias values. Since the channel condition changes dynamically, they check these values at every trial, in other words, this approach has the best performance in the case using common bias value. However, since it takes a bit long time to do that, it is not suitable in the real environment. Because of this, no learning scheme (fixed bias value) uses trial and error method only at the first trial as a practical scheme. These compared schemes use 1 dB as the interval of bias value while 2 dB is used in our proposal. Note that the smaller interval results in better performance. In our proposal, to make the size of Qtable small, a bit large interval, 2 dB, is used.
As shown in both Figures 9 and 10, the number of outage UEs and the UE’s average throughput change depending on the ratio of PRBs. This is because bias values that minimizes the number of outage UEs also differ according to the ratio of RBs between MBS and PBS. The number of outage UEs changes depending on the ratio of PRBs. In spite of the rough interval, Qlearning, the red line of Figure 9, has fewer outage UEs than no learning schemes at almost all ratios of RBs. This means if UEs define their own bias values, we can get fewer outage UEs. When the ratio of PRBs is 20%, no learning schemes have fewer outage UEs than Qlearning scheme. Many UEs have a small difference between macro and pico received powers enough for the common bias value to occupy all RBs at this ratio. Of course, our proposal can also occupy all RBs at this ratio, however its εgreedy policy’s occasional random actions make a bit more outage UEs. That is why no learning schemes can keep the number of outage UEs smaller than that of the proposed scheme. In this figure, no learning (best bias value) represents the minimum value of the number of outage UEs among the schemes using common bias value. Since the best bias value changes depending on some factors, no learning (fixed bias value) has more outage UEs than no learning (best bias value).
The same thing can also occur to the average throughput of all UEs in Figure 10. When the ratio of PRBs is 20%, no learning schemes have higher throughput than the proposed Qlearning scheme; except this ratio, Qlearning scheme performs better than no learning schemes.
Conclusions
HetNets that introduce PBSs near hot spots in the macro cells are necessary to improve the coverage area. Since pico cell range may be too small to cover the hot spot area, pico’s CRE is considered. However, to the best of our knowledge, there have been no studies on the optimal bias value that minimizes the number of outage UEs, because this value depends on several factors such as the dividing ratio of radio resource between MBSs and PBSs, and it is determined only by trialanderror method. Thus, in this article, we proposed a scheme using QLearning that UEs learn bias values that minimize the number of outage UEs from past experience.
We got the results of the number of outage UEs and average throughput which show that after thousands of trials, the Qlearning approach can perform better than no learning schemes. We showed that our proposal can decrease the number of outage UEs and improve average throughput at almost all ratios of RBs. Moreover, it can largely enhance the celledge UE throughput compared with the schemes using a common bias value.
In the simulation, UEs keep having the data of Qtable when they move to another PBS coverage area, and we expect that it helps a learning algorithm to converge faster. However, we have not evaluated the effect of UEs’ moving to other PBS coverage area in detail. This evaluation is our future study. The required learning time should also be studied for realizing this system because if it takes too much time to converge, it cannot be used in the real system.
Declarations
Authors’ Affiliations
References
 PérezLópez D, Chu X: Intercell interference coordination for expanded region picocells in heterogeneous networks. In Proceedings of 20th International Conference on IEEE Computer Communications and Networks (ICCCN). (Maui, HI, USA; 2011:16.Google Scholar
 Damnjanovic A, Montojo J, Wei Y, Ji T, Luo T, Vajapeyam M, Yoo T, Song O, Malladi D: A survey on 3GPP heterogeneous networks. IEEE Wirel. Commun. 2011, 18: 1021.View ArticleGoogle Scholar
 Sangiamwong J, Saito Y, Miki N, Abe T, Nagata S, Okumura Y: Investigation on cell selection methods associated with intercell interference coordination in heterogeneous networks for LTEadvanced downlink. In 11th European Wireless Conference 2011—Sustainable Wireless Technologies (European Wireless). (Vienna, Austria; 2011:16.Google Scholar
 Shirakabe M, Morimoto A, Miki N: Performance evaluation of intercell interference coordination and cell range expansion in heterogeneous networks for LTEadvanced downlink. In 8th International Symposium on Wireless Communication Systems (ISWCS). (Aachen, Germany; 2011:844848.Google Scholar
 Güvenç İ, Jeong MR, Demirdoḡen İ, Kecicioḡlu B, Watanabe F: Range expansion and intercell interference coordination (ICIC) for picocell networks. In IEEE Vehicular Technology Conference (VTC Fall). (San Francisco, CA, USA; 2011:16.Google Scholar
 Güvenç İ: Capacity and fairness analysis of heterogeneous networks with range expansion and interference coordination. IEEE Commun. Lett. 2011, 15: 10841087.View ArticleGoogle Scholar
 Okino K, Nakayama T, Yamazaki C, Sato H, Kusano Y: Pico cell range expansion with interference mitigation toward LTEadvanced heterogeneous networks. In IEEE International Conference on Communications Workshops (ICC). (Kyoto, Japan; 2011.Google Scholar
 Vajapeyam M, Damnjanovic A, Montojo J, Ji T, Wei Y, Malladi D: Downlink FTP performance of heterogeneous networks for LTEadvanced. In IEEE International Conference on Communications Workshops (ICC). (Kyoto, Japan; 2011.Google Scholar
 Chiu CS, Huang CC: An interference coordination scheme for picocell range expansion in heterogeneous networks. In IEEE 75th Vehicular Technology Conference (VTC Spring). (Yokohama, Japan; 2012:16.Google Scholar
 Lee M, Oh SK: On resource block sharing in 3GPPLTE system. In 17th AsiaPacific Conference Communications (APCC). (Sabah, Malaysia; 2011:3842.View ArticleGoogle Scholar
 Sutton RS, Barto AG: Reinforcement Learning. Cambridge: MIT Press; 1998.Google Scholar
 GalindoSerrano A, Giupponi L: Distributed Qlearning for aggregated interference control in cognitive radio networks. IEEE Trans. Veh. Technol. 2010, 59: 18231834.View ArticleGoogle Scholar
 Dirani M, Altman Z: A cooperative reinforcement learning approach for intercell interference coordination in OFDMA cellular networks. In Proceedings of the 8th International Symposium on Modeling and Optimization in Mobile, Ad Hoc and Wireless Networks (WiOpt). (Avignon, France; 2010:170176.Google Scholar
 GalindoSerrano A, Giupponi L, Auer G: Distributed learning in multiuser OFDMA femtocell networks. In IEEE 73rd Vehicular Technology Conference (VTC Spring). (Yokohama, Japan; 2011:16.Google Scholar
 Feki A, Capdevielle V, Sorsy E: Selforganized resource allocation for LTE pico cells: a reinforcement learning approach. In IEEE 75th Vehicular Technology Conference (VTC Spring). (Yokohama, Japan; 2012:15.Google Scholar
 Dhahri C, Ohtsuki T: Learningbased cell selection method for femtocell networks. In IEEE 75th Vehicular Technology Conference (VTC Spring). (Yokohama, Japan; 2012:15.Google Scholar
 Razavi R, Klein S, Claussen H: Selfoptimization of capacity and coverage in LTE networks using a fuzzy reinforcement learning approach. In IEEE 21st International Symposium Personal Indoor and Mobile Radio Communications (PIMRC). (Instanbul, Turkey; 2010:18651870.View ArticleGoogle Scholar
 Khandekar A, Bhushan N, Tingfang J, Vanghi V: LTEadvanced: heterogeneous networks. In European Wireless Conference (EW). (Lucca, Italy; 2010:978982.View ArticleGoogle Scholar
 Ribeiro R, Borges AP, Enembreck F: Interaction models for multiagent reinforcement learning. In Computational Intelligence for Modelling, Control and Automation, International Conference. (Vienna, Austria; 2008:464469.View ArticleGoogle Scholar
 Haitao O, Weidong Z, Wenyuan Z, Xiaoming X: A novel multiagent Qlearning algorithm in cooperative multiagent system. In Proceedings of the 3rd World Congress Intelligent Control and Automation. (Hefei, China; 2000:272276.Google Scholar
 GalindoSerrano A, Giupponi L, Blasco P, Dohler M: Learning from experts in cognitive radio networks: the docitive paradigm. In Proceedings of the Fifth International Conference, Cognitive Radio Oriented Wireless Networks & Communications (CROWNCOM). (Cannes, France; 2010:16.Google Scholar
 3GPP TR 36.814 (V9.0.0): Evolved Universal Terrestrial Radio Access (EUTRA); further advancements for EUTRA physical layer aspects. 2010.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.