In the section, we evaluate the performance of our successful transmission-only (Section 6.2) and the joint energy and successful transmission (Section 6.3) applications.
Simulation platform and benchmarks
Simulation platform
Our novel CR network simulator is described in Fig. 5. We develop the framework using NS-2 network simulation tool. Some specific modifications to the physical, link, and network layers is done in the form of stand-alone C++ modules. The Multi-agent System Module provides the architecture of multi-agent system, which is composed of an agent object declaimer, agent cooperation mechanism, and agent communication protocol. The Learning Module describes several typical learning algorithms and some common functions. The PU Activity Block defines the activities of PUs using the on-off model, such as transmission and interference range, rule of spectrum occupancy, and location. The Channel Block Module is a channel table which contains the information about background noise and channel capacity. The Spectrum Sensing Block describes some functionalities about energy-aware spectrum sensing. One important function is to notify the Spectrum Management Block if a PU is detected. The Spectrum Management Block would trigger the sensor to switch to another available channel. The Spectrum Sharing Block coordinates the access of channels and calculates the interference of sensor nodes brought by any ongoing transmission. The Wireless Sensor Network Environment Repository facilitates the information about transmission power levels, spectrum bands, locations of sensors, and different network protocols.
We mainly investigate the topology with 100 sensor nodes placed over a square grid of side 1000 m. There are totally 25 PUs in our CWSN system. For each PU, it is randomly assigned its default channel, and the default channel can be kept with a probability of 0.4. Each PU can also switch, with the decreasing probabilities {0.3,0.2,0.1}, to three other pre-assigned successively placed channels, respectively. In the way, these PUs follow an underlying rule, with which they are active on their own given channels. But the information is unknown to the CR sensors.
There are a total of 100 licensed channels. The transmission in the CWSN occurs via these channels connecting multiple pairs of sensor nodes. We denote such a pair with a data link as {i,j}, which means a directional transmission from the ith sender to the jth receiver. The transmitting spectrum is chosen by the sender node and is notified to the receiver node via a common control channel, also called CCC. The information of possible collisions may also be returned back to the sender sensor using this CCC, which may be experienced by the receiver sensor. Here, all data are transmitted only using the chosen spectrum over the link between the pair of sensor nodes.
The permissible value of transmission power for cognitive sensors are uniformly distributed on the interval of { 0.5 mW, 4 mW}, while the PUs always transmit at the level of 10 mW. We consider the time to be slotted, and the link layer at each sender node attempts to transmit with a probability of 0.2 in every slot. The time scale of the x axis on the following figures is represented by epochs, each of which is composed of 50 time slots, and we show results for over 600 epochs.
Simulation benchmarks
In our experiment evaluation, the proposed reinforcement-learning-based scheme, abbreviated as RL Scheme, is compared with the other three schemes, that is, (i) random assignment, (ii) greedy assignment with 1 memory slot, and (iii) greedy assignment with 20 memory slots.
Random assignment scheme, abbreviated as RA Scheme, uses a random combination of spectrum band and power level in each time slot.
Greedy assignment scheme with 1 memory slot, abbreviated as GD-1 Scheme, only stores the reward received the last time for every state, that is, for every combination of spectrum band and power level. In order to avoid local optimum, the scheme selects with a probability η the combination having the highest previous reward and explores with a probability (1−η) a random-chosen combination.
Greedy assignment scheme with 20 memory slots, abbreviated as GD-20 Scheme, performs a repository of rewards received in the 20 past time slots for every combination of spectrum band and power level and picks up the best one. In a similar way, it selects the best with the probability η and explores a random combination with the probability (1−η).
In RL Scheme, the probability of exploration ε is set to 0.2. The initial learning rate α is 0.8, which decreases gradually using a scaling factor of 0.995 in every time slot. Note that GD-1 Scheme occupies the same amount of memory as RL Scheme, but the GD-20 Scheme uses twenty times more.
Successful transmission-only
We now evaluate these four schemes, i.e., RA Scheme, GD-1 Scheme, GD-20 Scheme, and RL Scheme, in the small topology with 100 nodes and the large one with 500 nodes. We then observe the (i) average percentage of successful transmissions, (ii) average reward obtained of CR sensors, and (iii) average channel switches of CR sensors.
We apply the above schemes to the small network and give the average percentage of successful transmissions over all transmissions in Fig. 6a. The results show that, after 600 epoches, RL Scheme transmits successful packets up to approximately 94.8%, while GD-20 Scheme, GD-1 Scheme, and RA Scheme transmit successful packets with an average percentage of approximately 93.6, 45.7, and 46.1%, respectively. The results indicate that RL Scheme can clearly increase the portion of successful packages over all packages transmitted, and its learning performance is much better than the other schemes, even if GD-20 Scheme spends the order of magnitude of memory.
We also evaluate the average rewards obtained by cognitive sensors with the four schemes. Figure 6b gives the results in the small network. The results show that RL Scheme gets the greatest reward about +3.5, and GD-20 Scheme has its reward of approximately +1.3, whereas GD-1 Scheme and RA Scheme have the negative rewards of approximately −8.9 and −8.8, respectively.
The results indicate that RL Scheme pushes CR sensors to gradually obtain the higher positive rewards and choose more suitable spectrum band and power level for package transmission. The results also indicate that the reward obtained tends to be proportional to the probability of successful transmission.
Figure 6c shows the average occurrences of channel switching by CR users, again for the small topology. We observe that after learning, RL Scheme tends to reduce the number of channel switches to 19.0, wherein GD-20 Scheme keeps the channel switches to approximately 29.0, GD-1 Scheme keeps the channel switches to approximately 67.4, and RA Scheme keeps the channel switches to approximately 66.8. The results indicate that our proposed approach can keep the occurrences of channel switching lower and converge to an optimal solution.
Joint energy and successful transmission
In the application, each sensor has a fixed amount of energy 1500 mW at the start of the network, which gets depleted with time. Unless specified, the parameter τ is assumed as 1. We also demonstrate the effect of varying τ on the lifetime in this section. Owing to space constraints, we show the measurements for the case of small topology of 100 nodes only, and a similar scenario is observed for the case of 500 nodes.
In Fig. 7, we observe that our energy-aware RL approach displays significant improvement over the basic RL scheme, which does not exploit rewards based on the rate of energy consumption. In the RL-aware approach, each node is allowed to consume energy during exploration phase, while it is forced to get more conservative towards exploring channel and power choices towards the end of the network lifetime. As a result, both the RL schemes show the same performance during the initial exploration phase, but the energy-aware scheme is still operational after the competing scheme show a completely dead network.
We investigate further the performance of our energy-aware approach in Fig. 8 by varying τ that decides the rate at which we allow the network to consume energy. For each of these experiments, the effect of the competing schemes remains the same (being independent of energy considerations) and, therefore, not displayed. We observe in Fig. 8, that at epoch 600, lower values of τ are able to sustain the network longer with a greater residual energy. As τ increases, we observe that the difference in residual energy is much greater in the range 1−100, than in the subsequent range 100−200. This is attributed to the exponentially increasing value of the R
lim
function. In Fig. 8, we observe that the CWSN is still partially operational at 60−70% nodes in the range τ∈ [1,10]. For higher τ, while the optimal solution may be reached quicker, the network pays a strong penalty in terms of nodes that are alive towards the end of the simulation. Finally, Fig. 8 reveals that for moderate lengths of experiments, in which extreme lifetime of the sensors is not a factor, the value of τ can be freely chosen in the range [ 1,20]. This also allows the network to converge faster, and the resulting loss of energy does not cripple the network entirely for moderate time scales.