 Research
 Open Access
 Published:
Qlearningbased dynamic joint control of interference and transmission opportunities for cognitive radio
EURASIP Journal on Wireless Communications and Networking volume 2018, Article number: 160 (2018)
Abstract
In cognitive radio (CR) system, secondary user (SU) should use available channels opportunistically when the primary user (PU) does not exist. In CR network, SUs have to detect the PU signal with sufficient sensing time to guarantee the detection probability and minimize the interference to the PU, while the CR system should have enough data transmission time to maximize the transmission opportunity of the SU. Therefore, the sensing time and data transmission time of the SU are generally considered as main optimization parameters to maximize the throughput of the CR system. In this paper, a separate sensing node is designated and the sensing is continuously performed using the interference alignment (IA) technique. In this paper, the designated sensing node estimates the interference ratio and transmission opportunity loss ratio. To satisfy the primary user’s interference requirement and maximize secondary throughput, we proposed dynamic adjustment mechanism for sensing slot time and sensing report interval using reinforcement learning in timevarying communication environment. The experimental results show that the proposed approach can minimize the interference on PU and enhance the transmission opportunity of SUs.
Introduction
As the demand for multimedia services explosively increases, the need for bandwidth to meet the requirements of communication systems is also rapidly increasing. Periodic frequency auctions in each country are a major issue of interest to telecom operators due to astronomical costs, and these costs are included in the CAPEX plans of operators, so they are inevitable to be passed on to the consumers for the purpose of operating profit safeguard guarantee [1]. Therefore, it is necessary to use the frequency more efficiently to drastically reduce the cost of the frequency and reduce the communication cost. Cognitive radio (CR) is a technology that can improve the efficiency of frequency spectrum use and has been in continuous research since it was proposed by Mitola [2]. CR should be able to intelligently monitor and adapt to the surrounding environment to share the frequency band with the licensed primary user (PU) in a frequency band not occupied by the PU [3]. In the last few years, several research works have been done in order to apply CR to spectrum sharing and for secondary users (SUs) to coexist with PUs in relation to wireless standards: the IEEE 802.22 Wireless Regional Access Network in TV white spaces (TVWS), IEEE 802.11af for wireless local area network service in TVWS, IEEE 802.19, IEEE 1900.x, and the European Telecommunications Standards Institute’s Reconfigurable Radio Systems for coexistence of licenseexemption systems. Besides these considerations, CR systems are considered in the complex system problem consisting of heterogeneous systems, like the coexistence problem of a femtocell that can be installed indiscriminately in a macrocell, devicetodevice (D2D) coexistence problem within the licensed band, and the coexistence of WiFi and Long Term Evolution in unlicensed spectrum (LTEU) in the Industiral, Scientific and Medical (ISM) band [4, 5].
CR users are basically required to have spectrum sensing to gain access to the spectrum without interfering with the primary networks. Therefore, various efforts have been made to improve the accuracy of sensing, such as matched filter detection, energy detection, feature detection, and cooperative detection [6]. Since the radio frequency front end cannot distinguish between the PU and SU signals, sensing and data transmission must be separated [7]. Although feature detection can identify the modulation types of PU signal, the processing time is considerably longer, and higher computational complexity is required [8]. There is a tradeoff between the interference with PUs and throughput of the SU in the mechanism that separates the sensing and data transmissions. For interference avoidance, the observation time (i.e., sensing time) should be long enough to ensure the accuracy of PU detection. However, a longer observation time reduces the transmission time of the SU, thereby reducing the data throughput of the CR system. In this regard, Liang et al. constructed an optimization function using the ratio of sensing time and transmission time, and detection probability in the SU’s frame, and showed that the maximum data throughput of the optimization function can be found by concavity [9]. It is also proposed to specify the operation region for the bounded false alarm assuming the characteristic of the PU activity and to find the optimal sensing time and period in the operation region. Lee and Akyildiz estimated the detection probability, false alarm probability, expected interference ratio, and lost spectrum opportunity by assuming the PDF (probability density function) of the busy/idle times of the PU, and calculated the guaranteed operation interval related to the SU’s sensing time and the data frame time through the constraints calculated [10]. In addition to the conventional optimization considering tradeoff between sensing time and period, an effort to further reduce the interference to the PU was proposed. Choi and Yoo defined the PU as an unprotected primary user transmission (UPT) in case the SU cannot detect the PU in the transmission interval of the SU [11, 12]. In addition to the constraints related to the existing timeconstrained detection probability, the UPT constraint is additionally used to optimize the sensing schedule. In the multiinput multioutput (MIMO)based CR system for increasing the channel capacity, spectrum sensing utilizing the characteristics of the MIMO system has also been proposed. In order to solve the tradeoff between sensing and transmission time, Lee and Cho proposed a system in which a MIMObased SU can perform sensing and data transmission simultaneously using zero forcing (ZF) [13]. Moghimi et al. proposed a system that divides the receiving stream of a MIMObased SU and operates the data reception stream as a tradeoff for sensing and receiving, while the rest is dedicated to sensing [14].
In wireless communication networks composed of various communication systems, interference is an unavoidable phenomenon. In order to solve this problem, multiple access methods such as time division multiple access, frequency division multiple access, code division multiple access, and space division multiple access are used to make the signals orthogonal to each other in terms of time, frequency, and spatial domain. These methods can avoid interference by dividing the resources in each area, but cannot use enough of the capacity that the channel can provide. In this context, interference alignment (IA) has recently attracted attention as a technique capable of eliminating interference between multiple links and maximizing transmission capacity. In an existing communication system, since each user pair cannot know information about other users in the network, the optimal strategy is to maximize its own transmission rate. Thus, the sum of data rates in the network increases to the same order of a single communication link. However, using IA, the sum rate increases linearly with the number of users at high SNR (signaltonoise ratio). The IA arranges all interference in a common subspace of a total received signal space in a receiver configured with a multiantenna system. It separates the interference space from a desired signal space so that a plurality of transceivers can operate at the same time and or same frequency. Applying the IA to the CR was recently studied because of the advantages in eliminating interference with the PU and removing mutual interference between the SUs to increase the transmission capacity of the SUs [15].
The IA is combined with the CR system to divide the signal space and the interference space so that the interference can be avoided between PU and SU signals. Amir et al. analyzed the degree of freedom (DoF) available in IAbased PU and SU networks and maximized the transmission rate of the PU through waterfilling [16]. Zhou et al. proposed optimizing the precoding matrix and power allocation to increase the transmission rate of SUs in a network of single PUs and multiple SUs [17]. Men et al. proposed an algorithm that guarantees the transmission performance of the PU using a partial IA algorithm [18]. IAbased CR can be applied to applications considering interference between heterogeneous systems. Chatzinotas and Ottersten applied IA to small cells to mitigate interference in macrocell base stations in small cells [19], and Huang et al. investigated a joint opportunistic interference avoidance scheme using the interweave paradigmbased GaleShapley spectralsharing scheme to mitigate interference between a macrocell network and a femtocell network [20]. Sharma et al. proposed a coordinated and uncoordinated approach in a system consisting of monobeam and multibeam satellites as well as macrocell and smallcell systems [21]. On the other hand, methods of accessing the information of the PU and performing IA were proposed. Chen et al. proposed a system that helps the PU to transfer data while the SU uses a DoF that is not used by the PU [22], and Guler and Yener collected all channel information and proposed an interference technique using successive semidefinite programming (SDP) relaxation [23]. And Perlaza et al. allow the SU to transmit signals without interfering with the PU through the remaining eigenmodes that are not used by the PU [24]. HasaniBaferani et al. enabled the SU to perform IA by providing a femtocell with eigenmode information of the PU in a macrocell and femtocell system [25]. Zhao et al. optimized the sum rate of SUs, limiting it to the transmission rate threshold of the PU so that the sum rate of the entire network is optimized according to PU requirements [26].
Previous researches on the optimization of sensing and data transmission time cannot fundamentally block the interference to the PU because it cannot continuously sense the spectrum. Also, assuming the operating characteristics of the PU cannot be a reasonable assumption because the PU cannot be continuously observed or the signal of SUs cannot be transmitted while they process spectrum sensing even if it can continuously observe the spectrum for a while. Meanwhile, even if sensing is continuously performed through the MIMObased CR system, still secondary systems need the optimization of the sensing and transmission time, and this optimization is very difficult to derive as a closed form specially in dynamic wireless environments in terms of timevarying channel characteristics and primary activities. In a CR system research with IA, if it is not possible to include the PU in IA or perform IA process by providing PU information to the CR, the conventional IAbased CR system cannot achieve the desired interference alignment performance. Therefore, in this paper, we design a designated sensor in an IAbased SU system that uses a conventional IA algorithm to limit transmission signals of SUs to the interference space and to sense the PU through the remaining signal space. Since the sensing and data transmission roles are separated, continuous sensing is possible, and the tradeoff problem of sensing time and transmission time disappears. The problem of not detecting the PU during data transmission also disappears. However, since the sensing role is limited to a specific SU, another problem arises in that the sensing result should be transmitted to other SUs. In this paper, we propose a dynamic adjustable sensing report interval control mechanism using reinforcement learning.
There is a tradeoff regarding the period of sensing interval (or the sensing result transmission). If it is too long, interference time to the PU increases due to the transmission time increases for the SUs; conversely, if it is too short, transmission performance of the SUs decreases. In this regard, this paper proposes an algorithm that dynamically determines the sensing time and reporting interval of sensing result by using Qlearning based on reinforcement learning for interference control in targetlevel and enhancement of SU’s transmission opportunities. Meanwhile, the interference ratio and the transmission opportunity loss ratio are defined as the criteria for selecting the sensing time and the reporting interval according to the performance of the system. From the busy and idle time that are statistical characteristics of the PU, interference ratio and the transmission opportunity loss ratio can be estimated precisely from only the sensing results without any knowledge of operating characteristics about PU. First, in order to minimize the interference to the PU, the sensing is basically set to satisfy the required detection probability preferentially so that the interference to the PU can be ensured in the sensing step. In the reward design of Qlearning, the target interference ratio value is used so that the interference resides in desired range. Also, the target transmission opportunity loss ratio value is used to secure the transmission opportunity of the SU. Based on the designed reward, Qlearning dynamically selects the sensing time and reporting interval time to operate the system in the selected interference ratio and loss ratio range.
Since the Qlearning is a modelfree reinforcement learning technique, Qlearning could be very fascinating method for spectrum sensing in timevarying environment. Liwang et al. used Qlearning to minimize mutual interference between SUs according to the sensing order [27]. Jan et al. has assigned a subband for spectrum sensing order to obtain high throughput while minimizing the number of subbands that the SU should sense [28]. Das et al. has cooperatively calculated the reward for the idle and busy states of the channel and evaluated the priority of the channel [29]. Yang et al. considered hardware reconfiguration energy consumptions and time delays when selecting a subband for wideband sensing [30]. Oliver et al. solved the throughput optimization for the sensing and transmission time with Qlearning [31]. In this paper, we can observe the activity of the PU continuously so that the interference ratio and the transmission loss ratio can be calculated. Therefore, we can use Qlearning to select the appropriate sensing time and reporting interval to meet the primary protection and secondary throughput requirements in a dynamic environment.
In this paper, we describe proposed system model in Section 2 and examine the feasibility of the proposed IAbased sensing system through DoF analysis. Section 3 shows the realtime estimation method for the interference ratio and transmission opportunity loss ratio used as states related with Qlearning. Section 4 compares the proposed system with conventional schemes, and it shows that the proposed method provides stable and desired operation in the dynamic wireless environments. Finally, the conclusion is made in Section 5.
Proposed IAbased sensing structure for continuous spectrum sensing
System model
Typical IA in cognitive radio system is shown in Fig. 1 [18]. There are K s secondary transmitterreceiver pair and one SU with MIMOCR interface and one primary transmitter. All the SUs share the transmission resource with the PU at the same time. Each transmitter and receiver has M and N antennas, respectively. H^{[ij]} ∈ ℂ^{N × M} represents the channel between the jth transmitter and the ith receiver, where i, j ∈ {0, 1, …, K} and 0th user represents the PU. All the elements of H^{[ij]} are independent and identically distributed (i.i.d.) and follow complex Gaussian distribution with zero mean and unit variance \( \mathcal{CN}\left(0,1\right) \). Then, the received signal at the ith receiver is expressed as:
where x^{[i]} expresses the transmitted symbols of user i and z^{[i]} ∈ ℂ^{N × 1} represents the circularly symmetric additive white Gaussian noise vector with \( \mathcal{CN}\left(0,{\sigma}^2{\mathbf{I}}_N\right) \), in which σ^{2} is noise variance and I_{ N } is an identity matrix.
In the IA system consisting of a pair of MIMObased transceivers, the transmitter controls the precoding matrix so that the transmitted signal is limited to the interference space at an undesired receiver, and the receiver controls the decoding matrix to remove undesired received signals and to recover the signal. This IA design conditions can be represented as follows:
where d^{k} is the desired number of streams of user i. V^{[j]} ∈ ℂ^{M × d}and U^{[i]} ∈ ℂ^{N × d} denote precoding matrix of jth user and decoding matrix of ith user, respectively. \( {\mathbf{U}}^{{\left[i\right]}^{\ast }} \) is the conjugate transpose of U^{[i]}.
We can represent the received signal recovered by decoding matrix and adjusted by precoding matrix from the IA design conditions as follows.
where s^{[i]} is the transmission signal of the ith transmitter. From the IA condition, interference signals term of (4) is eliminated.
In this paper, we designate one of the SUs as a sensor node only responsible for spectrum sensing, as shown in Fig. 2, in order to eliminate dependence on the PU information in the CR system for IA.
The designated sensor node should sense the primary signal continuously so that it cannot transmit or receive data during the sensing process. It may also consume more energy than other secondary nodes. Therefore, a new sensor node needs to be selected after certain given time. In wireless ad hoc networks or wireless sensor networks, there have been similar studies on selecting the cluster head (CH) [27, 32,33,34]. In order to select the CH, various parameters can be considered, such as the number of member nodes covered by the CH, the current residual energy level, and the history of CH nodes. In this paper, the sensor is selected by considering the residual energy and energy draining rate of every node as in [34]. In this paper, for the selection of spectrum sensing node, the node with the largest ratio of residual energy to draining rate of energy is selected.
where T_{sel} is a cycle for selecting sensor node. E_{ i } and D_{ i } are the residual energy and the draining rate of energy of node i, respectively.
The use of the designated sensor node can reduce the sensing overhead of other secondary nodes, but it may bring sensing accuracy degradation in some wireless scenarios. When a cooperative sensing method is used in the conventional CR system, by combining each SU’s sensing result, it can increase the sensing accuracy and detect hide primary transmitters. Even though the sensor node in the proposed method is the only node that senses the primary signal, by performing consecutive spectrum sensing as a form of sequential chaining of a fixed sensing time, in which the fixed sensing time is determined to satisfy the required minimum primary detection probability, our proposed method also can combine timedomain multiple sensing results. It can compensate the lack of physical cooperative sensing.
As with the usual IA scheme, the sensor performs the IA process with other SUs to limit signals from other SUs to interference space and remove them through the decoding matrix. The remaining signal space can be used to sense the PU. As the sensing role is dedicated to a particular sensor, other SUs do not need to participate on spectrum sensing and also can transmit data without wasting of time for sensing.
To satisfy the required primary detection probability, a fixed sensing time slot (t_{ s }) is determined as in (27). The sensor node performs spectrum sensing every consecutive sensing time slot. In this paper, to notify the sensing result by the sensor node to all CR SUs, we propose two mechanisms.

i.
Periodic notification (default mode): the sensor node broadcasts sensing report which includes primary detection information at every predetermined sensing reporting interval. When SUs receive the primary detection notification, they should not transmit data until the next report broadcasting time. The sensing reporting interval is represented as t_{ r }.

ii.
Notification using dedicated control channel transceiver: every secondary nodes including sensor node have dual transceivers, in which one is for data transmission (or spectrum sensing) and the other one is control signal exchange. When the sensor node detects the primary signal, it transmits detection notification signal on the dedicated narrow band channel, and other SUs seize their data transmission. If the data channel returns to idle, then the sensor also send channel idle notification and then SUs can again utilize the data channel.
The spectrum sensing and primary detection comparison for the conventional CR system and proposed IAbased spectrum sensing system is represented in Fig. 3. Figure 3a represents the primary system activities as a form of busy with times. Figure 3b shows the conventional CR system which uses a fixed spectrum sensing time and interval. The conventional CR system can only sense the primary signal only during the short sensing time so that if the primary appears during the secondary data transmission time (i.e., between two consecutive sensing times), then the secondary system will give harmful interference to the primary system. As shown in Fig. 3c, the designated sensor node can continuously sense the primary signal. At time (1), the sensor node broadcasts the primary nondetection report to SUs so that SUs can utilize the data channel using interference alignment. At time (2) and time (3), the sensor node detects the primary signal so that it sends the primary detection report at (4). As we can see in Fig. 3c, SUs can seize their transmission until the primary signal is not detected. In conventional CR in Fig. 3b, since SUs are not able to detect the primary signal during the short sensing time, they send data and cause strong interference to the primary user. Figure 3d shows the case that dual transceivers are used. At time (5) when the sensor node detects the primary signal, it can immediately send the detection notification using the dedicated control channel. And when the data channel returns to idle at time (6), the sensor node notifies the nondetection notification to SUs and SUs utilize the data channel again. Therefore, secondary node’s data throughput is enhanced.
The sensing time t_{ s } and sensing reporting interval t_{ r } impact on not only primary protection performance but also secondary system data transmission opportunity. The longer the sensing time, the lower miss detection and false alarm probabilities are obtained. On the other hand, the shorter sensing time results in the higher miss detection and false alarm probabilities especially at low signaltonoise ratio (SNR) of the longer sensing reporting interval makes the more transmission opportunity for secondary users; however, it generates the higher possible interference to primary users. In CR wireless network, the primary activity and wireless channel condition vary dynamically so that it is very difficult to derive the optimal sensing time and reporting interval. Therefore, in this paper, we propose a new dynamic optimal parameter control using reinforcement learning. The multiobjective function of the secondary system is given as in (6), in which the multiobjective function consists of three reward functions: interference ratio reward, transmission opportunity loss ratio, and overhead for sensing. The reward functions will be explained in detail in Section 3.3. Therefore, the proposed method derives the optimal \( \left({t}_s^{\ast },{t}_r^{\ast}\right) \) value that maximizes the multiobjective function with subject to primary protection and secondary throughput requirements.
where f_{intf}(R_{ I }) and f_{loss}(R_{ L }) are the functions of interference and transmission opportunity loss ratio; f_{overhead}(t_{ s }, t_{ r }) is the function of the overhead related to sensing time t_{ s } and reporting interval (integer multiple of t_{ s }); \( {t}_s^{\ast },\kern0.5em {t}_r^{\ast } \) are the optimal spectrum sensing time and reporting interval, respectively; \( {P}_d^{\mathrm{th}} \) is the required primary detection probability P_{ d }; \( {R}_I^{\mathrm{th}} \) and \( {R}_L^{\mathrm{th}} \) are the tolerable interference ratio and secondary transmission opportunity loss ratio, respectively.
The main novel features of the proposed system architecture are as follows:

1.
The dedicated sensor is responsible for the sensing function and can operate spectrum sensing by IA process when SUs transmit the signal so that the operation of the PU can be continuously observed.

2.
We specify the target detection probability to basically satisfy the detection probability and operate in the range that satisfies the interference ratio and the secondary transmission opportunity loss ratio.

3.
We use the Qlearning to determine sensing time and reporting interval dynamically and design the suitable reward function.
Interference alignment and degree of freedom in the proposed system
A minimum DoF must be ensured for each transceiver pair to communicate using IA process. We derive the condition of DoF that the proposed system can obtain. In addition, this section provides a theoretical basis for the sensor node to perform sensing while the SU is transmitting. Suppose there is a MIMOCR interference network with K SUs, one sensor and one PU in Fig. 4. It is assumed that SU’s IA network is consist of symmetric (i.e., same transmission antennas and receive antennas). The 0^{th} SU is a sensor, and each transmitter and receiver of the SU has M and N antennas. Then, received signals at the sensor and the ith SU receiver are as shown in (7) and (8):
where x^{[p]} and x^{[i]} are the transmission symbol of PU and SU i, z^{[0]}, z^{[i]} are circularly symmetric additive white Gaussian noise vectors, with \( \mathcal{CN}\left(0,{\sigma}^2{\mathbf{I}}_N\right) \). H^{[ij]} ∈ ℂ^{N × M} represents the channel between the jth transmitter and the ith receiver, where i, j ∈ {0, 1, …, K}, K represents the number of SUs, and the index p represents the PU. All elements of H^{[ij]} are i.i.d. distributed and follow \( \mathcal{CN}\left(0,1\right) \). We assumed a quasistatic channel, i.e., the channel realization remains fixed throughout the duration of transmission.
In order to eliminate interference in a sensor and each SU, we use a decoding matrix and the received signal with d data streams of the ith user is recovered as follows:
where s^{[p]}, s^{[j]} are transmission signals of the PU and the jth SU. V^{[p]} and V^{[j]} are the precoding matrix of the PU and jth SU, and U^{[0]}, U^{[i]} ∈ ℂ^{N × d} are the decoding matrix of the sensor and the ith user. To completely remove interference from the SUs to the sensor or between the SUs, V^{[j]}, U^{[0]}, and U^{[i]} must satisfy the following conditions:
where d^{[i]} is the desired number of stream of user i.
Equations (11) and (12) show that interference in receiving signal dimension of sensor and SU receivers should be zero. Equations (13) and (14) represent the number of signal stream that each SU transceiver pair and sensor node can acquire. From these constraints, the DoF condition is expressed by (15):
Proof. See the Appendix.
Therefore, in the network with satisfying the DoF condition from (11), the sensor node can remove the interference from the signals of other SU and sense the PU signal.
To fulfill the requirements in (11) and (12), the iterative IA algorithms in [35, 36] can be adopted with some modifications. The sensor should minimize the total leakage interference that remains after canceling the interference by decoding. Other SUs can obtain the precoding and decoding matrix by the maximum SINR algorithm considering the total leakage interference of the sensor. By fixing all V^{[i]}, we can solve U^{[j]} as
where ν_{max}(∙) denotes the dominant eigenvector when the eigenvalues are real.
Reversely, by fixing all U^{[j]}, we can solve V^{[i]} as
where Q_{0} is the interference covariance matrix at the sensor.
The interference covariance matrix at the sensor is
The decoder minimizing the total leakage interference at the sensor is
where ν_{min}(∙) is the least dominant eigenvector.
Summarizing the process, the transmitters choose the initial precoders randomly and receivers choose the decoders maximizing SINR. The sensor node calculates the interference covariance matrix and chooses the decoding matrix. The transmitters choose the precoders by maximizing SINR by considering the total interference leakage at the sensor node. Then, the choices of decoding matrix of the receivers are followed and this sequence of processes continues to convergence.
Energy detection with/without interference alignment
Spectrum sensing in the proposed system stops transmission of SUs when PU is detected and performs general MIMObased spectrum sensing. If the sensor determines that the PU is in an idle state, the SUs send the signal through IA and the sensor performs IAbased spectrum sensing that allows sensing during communication of the SUs. Therefore, in the proposed system, MIMObased or IAbased spectrum sensing is selected according to the detection result of the PU signal. Since the parameters to be used to set thresholds for energy detection depend on the choice of sensing method, this section focuses on MIMObased sensing and IAbased sensing.
If an SU does not transmit because the PU state is determined as busy state, the hypothesis from the received signal y_{ i } of the sensor is expressed in (20):
where H_{0} represents the hypothesis corresponding to “no signal transmitted,” and H_{1} represents “signal transmitted.” s(n) is the signal waveform, and z_{ i }(n) is a zero mean additive white Gaussian noise (AWGN). The PU is assumed to phase shift keying (PSK) modulated signal. The channel coefficient h_{ i } follows \( \mathcal{CN}\left(0,{\sigma}_h^2\right) \), and z_{ i } follows \( \mathcal{CN}\left(0,{\sigma}_n^2\right) \); \( {\sigma}_h^2 \) and \( {\sigma}_n^2 \) are the variance in channel gain and Gaussian noise. N is the number of receiving antennas.
The test statistic for the energy detector is given by
where n_{ s } is the samples of spectrum sensing.
We assumed the test statistic follows a Gaussian distribution under the central limit theorem. Therefore, each pdf of (21) under H_{0} and H_{1} is given by
where \( {\mu}_{0,\mathrm{non}\mathrm{IA}}=N{n}_s{\sigma}_n^2 \), \( {\sigma}_{0,\mathrm{non}\mathrm{IA}}^2=N{n}_s{\sigma}_n^4 \), \( {\mu}_{1,\mathrm{non}\mathrm{IA}}=N{n}_s\left(P{\sigma}_h^2{\lambda}_m+{\sigma}_n^2\right) \), \( {\sigma}_1^2=N{n}_s{\left(P{\sigma}_h^2{\lambda}_m+{\sigma}_n^2\right)}^2 \). λ_{ m } is eigenvalue of the correlation matrix, and P is transmission power of the PU.
False alarm and detection probability for the nonIA case are given by
The hypothesis for the received signal of the spectrum sensor produced from the decoding matrix when a SU transmits because the PU state is determined as idle is expressed with (24):
where s_{ j }(n) is the signal waveform from jth antenna of PU, and \( {\overset{\sim }{z}}_i(n) \) is an AWGN. \( {\overset{\sim }{\mathbf{G}}}^{\left[ ij\right]} \) is the compound channel gain between the PU transmitter and the sensor, and M_{ p } is the number of PU’s transmit antenna. We assumed that the gain does not change for multiple CR frames and can be estimated blindly while the PU is known to be present. Each statistical pdf of (24) is given by
where \( {\mu}_{0,\mathrm{IA}}=N{n}_s{\sigma}_n^2 \), \( {\sigma}_{0,\mathrm{IA}}^2=N{n}_s{\sigma}_n^4 \), \( {\mu}_{1,\mathrm{IA}}=N{n}_s\left(P{\overset{\sim }{g}}_m^2{\sigma}_h^2{\lambda}_m+{\sigma}_n^2\right) \), \( {\sigma}_{1,\mathrm{IA}}^2=N{n}_s{\left(P{\overset{\sim }{g}}_m^2{\sigma}_h^2{\lambda}_m+{\sigma}_n^2\right)}^2 \). \( {\overset{\sim }{g}}_m^2 \) is the sum of \( {\overset{\sim }{\mathbf{G}}}^{\left[ ij\right]} \) on j indexes.
False alarm and detection probability for the IA process case are given by
For a given pair of target probabilities \( \left({P}_d^{\mathrm{th}},{P}_f^{\mathrm{th}}\right) \), the number of required samples can be determined by
where ζ is SNR and \( {\overset{\sim }{g}}_m^2=1 \) when this represents about the nonIA case.
Control of interference and transmission opportunity loss through Qlearning
Probabilistic estimation of interference ratio and transmission opportunity loss ratio
In the proposed system, the sensing time and reporting interval time must be determined to protect the PU and to guarantee the SU transmission rate. To determine these sensing parameters appropriately according to the loss of PU and gain of CR system, we must be able to predict interference with the PU and the transmission opportunity access of the CR user as a result of the selected sensing parameters. Most of conventional sensing time and sensing period optimization algorithms have assumed that the statistics of the PU activation (busy and idle). In the proposed system, we estimate the required probabilistic parameters without any prior knowledge of PU operations by sensor node’s observation using IA process. The performance of interference with the PU can be predicted by the interference ratio, which means the predicted rate at which the PU’s busy state interval is interrupted by transmissions by CR users. The transmission performance of the CR users can be estimated by the transmission opportunity loss ratio, which is the idle state ratio of the PU that is not detected by the SU, compared to the transmissionpossible interval.
Figure 5 shows the interference ratio. Figure 5a indicates the operation of the PU, and Fig. 5c indicates the operation of the SU according to the sensor node of Fig. 5b. As shown in Fig. 5b, the sensor node operates as IAbased sensing until the second sensing period, detects PU in the second sensing period, and switches to nonIAbased sensing. Finally, it confirms that the PU is in the idle state in the fifth sensing period and switches to IAbased sensing. Since the sensor node instructs the SU to stop transmission after the second sensing period by the k out of N rule, the transmission stream of the SU overlapping with the transmission interval of the PU in the second sensing interval interferes with the PU.
The interference ratio is estimated by the following equation:
where t_{ s } denotes the sensing time, and D_{0} indicates that PU is not detected, and D_{1} indicates the PU is detected. H_{0} and H_{1} represent the hypothesis of the PU’s presence.
The interference ratio is the ratio of the transmission interval of the SU overlap with the transmission interval of the PU over the transmission interval of the PU within the measurement window. In (28), the numerator is the interference probability and denominator is the probability of PU existence. In order to estimate the interference probability, we calculate the conditional probabilities and sum the results of that PU is busy from each sensing result (D_{0}, D_{1}) in the interval (SU_{on}) where the SU operates from the decision the PU is in idle state in the previous sensing time as show in (29). In the interval in which the SU operates, the sensor node performs the IAbased sensing. In order to estimate the probability of the PU busy state (H_{1}) in the measurement window, we compute the conditional probabilities and sum the results of that PU is busy for each sensing result (D_{0}, D_{1}) in nonIAbased sensing intervals in which the SU operates (SU_{on}) and the interval in which the SU is idle (SU_{off}). Therefore, the probability of PU existence in measurement window is expressed in the denominator of (28) as the sum of (29) and (30).
Figure 6 shows the transmission opportunity loss ratio. The PU switches to idle in the second sensing interval, but the sensor node instructs the SUs to operate from the third sensing interval by k out of N rule. Therefore, a disabled term in the second sensing period is the loss of transmission opportunity.
The transmission opportunity loss ratio of the SU is expressed in (31):
The transmission opportunity loss ratio is the ratio of the interval that the SU does not transmit and for the interval in which the PU is idle within the measurement interval when the sensor node cannot detect the idle state of PU. In (31), the numerator is the probability where PU is idle when SU do not use the vacant time and the denominator is the probability where PU is idle in the measurement window. First, we calculate the conditional probabilities and sum of them that the PU is idle for each sensing results (D_{0}, D_{1}) in the idle interval of SU (SU_{off}) to estimate the probability of the interval in which the SU does not transmit despite the idle PU as shown in (32). In order to estimate the probability of the interval in which the PU is idle, we sum up the conditional probabilities of the PUs idle (H_{0}) for each of the sensing results (D_{0}, D_{1}) in intervals in which the SU operates and IAbased sensing is operated and in which the SU is idle and performs nonIAbased sensing. The estimate for the probability in the measurement window in which the PU is idle is expressed in the denominator of (31) as the sum of (32) and (33).
Each conditional probability from (28) to (33) can be expressed by Baye’s rule as follows.
where each conditional probability can be the case of nonIA or IAbased sensing. P_{on} and P_{off} represent P(H_{1}) and P(H_{0}). They can be estimated from measured P(D_{0}) and P(D_{1}), as follows:
Through the simultaneous equations, P(H_{0}) and P(H_{1}) are calculated as follows.
As a result, we can obtain P(D_{0}) and P(D_{1}) from the sensing result in the set window and obtain P(H_{0}) and P(H_{1}) from these simultaneous equations and then estimate the each conditional probability. Again, the conditional probability can be used to track the state of the PU and the CR system by interference ratio and transmission opportunity loss ratio for the selected sensing time and reporting interval time. In the next section, we propose a system dynamically operates by Qlearning. This system uses sensing time and reporting interval as an action. The response of an action is state which can be represented as interference ratio and transmission opportunity loss ratio.
Dynamic sensing parameter control using Qlearning
Qlearning is one of the offpolicy techniques of reinforcement learning based on a Markov decision rule. It operates adaptively to the experienced environment and allows the system to operate dynamically to suit the desired purpose [37]. First, the object that recognizes and learns the surrounding environment is called an agent. The agent obtains a response from the environment after determining the action and recognizes where the agent belongs in the defined states. The mechanism in which general Qlearning operates is shown in Fig. 7. s_{ t }, r_{ t }, and \( {a}_t^i \) represent state, reward, and ith action at time t, respectively. The agent recognizes the state s_{ t } and selects the action \( {a}_t^{\ast } \) which gives the maximum reward r_{ t } among the selectable actions. The operation of Qlearning is performed by repeating this series of processes.
As shown in Fig. 8, there is an agent to perform decisionmaking and learning in the environment of the current state. The agent performs an action, evaluates the effect of the environment on the taken action, obtains the reward, and acts as a series of processes in which the state changes. In the proposed method, the agent is the sensor, and the sensor operates by using sensing time and reporting interval as an action. When the sensor performs a specific action in the environment of repeated PU busy/idle state, the sensor can estimate R_{ I } and R_{ L } from the sensing results. After that, the sensor obtains the reward by using the gain or loss function related to R_{ I } and R_{ L }, and the overhead function of the action. The sensor recognizes the state change composed of the observed R_{ I } and R_{ L } and performs another action based on it. When the PU is busy, the SUs do not transmit, so the sensor performs basic MIMObased sensing, and when it is idle, the sensor performs IAbased sensing because SUs transmit using IA.
In the Qlearning, the Qtable is used as a data base in which the agent selects action in a given environment and records information with the reward and state changes obtained from the selected actions. The Qtable records this information for (state, action) pairs. When the system starts to operate in a given environment, there is no information in the Qtable. The Qtable stores information on how to maximize the designed reward in the given environment in the Qtable as a series of process that obtain the reward through the selected action and change the state is repeated. A Qtable consists of rows representing the states and columns representing the actions. An action is a set of products of sensing time and reporting interval, which can be expressed as \( \mathcal{A}=\left\{{a}_1,{a}_2,\dots, {a}_t,\dots, {a}_M\right\} \) where a_{ t } is \( \left\{{a}_t^s,{a}_t^r\right\} \), and t represents the serial number of the action; \( {a}_t^s \) is sensing time, and \( {a}_t^r \) is a multiple of the sensing time to express the reporting interval. The state is the product set of R_{ I } and R_{ L }, which can be expressed as \( \mathcal{S}=\left\{{s}_1,{s}_2,\dots, {s}_t,..,{s}_N\right\} \) where s_{ t } is \( \left\{{s}_t^I,{s}_t^L\right\} \), and t means serial number of the state; \( {s}_j^I \) is interference ratio, \( {s}_j^L \) represents loss ratio. Each action and state is quantized as some steps because Qlearning implementation requires the input and environment to be modeled as a finitestate system.
The Qvalue is updated according to (42):
where α ∈ [0, 1] is the learning rate. If α has high value, the system consider more for present and future experience. If α has low value, it takes longer time to learn the environment as the stored Qvalues have more weight. On the right side of (42), \( {r}_t+\gamma \underset{a_{t+1}}{\max}\mathcal{Q}\left({s}_{t+1},{a}_{t+1}\right) \) is the actual value for the action and future value. Qlearning finds the Qvalue by iteratively approximating the Qfunction using the difference between the predicted value and the actual value as the estimation error [38]. γ ∈ [0, 1] is the discount factor and if γ is high, the system gives a higher weight to the Qvalue of the new state by the action than the reward of the past action. On the other hand, if γ is low, the immediate reward is weighted and is more influenced by the current action.
If an agent chooses an action by only the maximum value of the Qvalue, a local optimization problem occurs. Therefore, we used the action in consideration of the εgreedy policy, as follows:
where ε ∈ [0, 1] is the probability of choosing a random action. When the random number is less than ε, the action is selected randomly, and in the opposite case, the highest Qvalue is chosen. Table 1 shows the sequence of steps for Qlearning in the proposed algorithm.
Reward function design
In this section, we describe the proposed reward function of the Qlearningbased dynamic sensing time and reporting interval selection algorithm for the sensor. A false alarm depends on the sensing time from fixing the required detection probability to protect the PU preferentially. The false alarm is a parameter that seriously affects the capture of the transmission opportunity for the SU. A long reporting interval can increase interference with the PU due to the sudden appearance of the PU, while ensuring the continuity of the transmission of the SU and saving sensor power by not sending the sensing results frequently. Therefore, in Qlearning, action is defined as a combination of sensing time and reporting interval, and state is defined as a combination of interference ratio and loss ratio that change through action. In addition, we designed the reward as expressed in the flow chart of Fig. 9. It should be noted that because the wireless environment conditions can change dynamically and we do not assume any prior environmental statistics, the proposed learningbased mechanism may have a difficulty to meet the required constraints in terms of the interference ratio threshold and secondary transmission opportunity loss ratio at every time instance. Therefore, we have relaxed the constraints in (6) not for every time instance but longterm average. From this point of view, the proposed reward function composed of the multiobjective in (6) dynamically controls the sensing and reporting parameters and satisfies the constraint condition.
Reward is divided into two types. Considering the interference ratio R_{ I } as a priority, (44) is used when R_{ I } is smaller than the threshold of R_{ I } (\( {R}_I<{R}_I^{\mathrm{th}}\Big) \), and (45, 46) is used in the opposite case.
where φ, ρ_{1}, ρ_{2}, and L are the constant value. ρ_{1}, and ρ_{2} should be carefully selected to reduce or increase the reward appropriately if R_{1} and R_{L} exceed the threshold or not; δ_{1} and δ_{2} are the signs for the first and the second term; \( {R}_I^{\mathrm{th}} \) and \( {R}_L^{\mathrm{th}} \) are the threshold of interference ratio and loss ratio; N_{ s } and N_{ r } are the sample numbers of sensing and a multiple number of the sensing time to represent the reporting interval. These are obtained by sampling the sensing and reporting intervals at twice the sensing frequency. The first, second, and third term of (44) is related to f_{intf}(R_{ I }), f_{loss}(R_{ L }), and f_{overhead}(t_{ s }, t_{ r }) of (6), respectively.
The first term gives a positive value (δ_{1} = + 1) for R_{ I } satisfying the condition (\( {R}_I<{R}_I^{\mathrm{th}} \)), and the second term gives a positive (δ_{2} = + 1) or negative (δ_{2} = − 1) value according to whether R_{ L } is satisfied (\( {R}_L<{R}_L^{\mathrm{th}} \)). The value of ρ_{1} is greater than ρ_{2} in order to consider R_{ I } prior to R_{ L }. The exponential function is used for more dramatic Qvalue changes in the reward terms for R_{ I } and R_{ L } when the interference ratio and loss ratio become significantly worse. The third term is designed to indicate that the overhead on the system increases, as the response time (the product of the sensing length and the multiple for the response length) is shorter than the average response time.
If R_{ I } does not satisfy the condition (\( {R}_I<{R}_I^{\mathrm{th}} \)), we do not consider the system overhead in order to focus on the interference ratio satisfaction as follows:
where ϕ, ω_{1}, and ω_{2} are the constant values. \( {R}_I^{\mathrm{new}} \) and \( {R}_L^{\mathrm{new}} \) are the present values of interference ratio and loss ratio. \( {R}_I^{\mathrm{old}} \) and \( {R}_L^{\mathrm{old}} \) are the previous value of interference ratio and loss ratio. Since the interference ratio and loss ratio could not be changed by the action, we try to choose the action which shows a good tendency when the interference ratio does not satisfy the constraint. The first and second term of (44) is related to f_{intf}(R_{ I }) and f_{loss}(R_{ L }), respectively. The constant values should be carefully selected to increase or reduce the reward appropriately if tendency of interference ratio and loss ratio is good or not.
The expression in (45, 46) is again divided based on whether the previous value (\( {R}_I^{\mathrm{old}} \)) of R_{ I } of measurement window is larger than the current value (\( {R}_I^{\mathrm{new}} \)) of the measurement window or not. The first term of (45, 46) takes a positive value (δ_{1} = + 1) when R_{ I } is improved for \( {R}_I^{\mathrm{old}}>{R}_I^{\mathrm{new}} \) and has a negative value (δ_{1} = − 1) when R_{ I } deteriorates for \( {R}_I^{\mathrm{old}}\le {R}_I^{\mathrm{new}} \). Similarly with respect to R_{ L }, ω_{2} has a positive value (δ_{2} = + 1) when R_{ L } is lower than the threshold or is improved (\( {R}_L\ge {R}_L^{\mathrm{th}},\kern0.5em {R}_L^{\mathrm{old}}>{R}_L^{\mathrm{new}} \)), and has a negative value (δ_{2} = − 1) when R_{ L } is exacerbated \( {R}_L\ge {R}_L^{\mathrm{th}},\kern0.5em {R}_L^{\mathrm{old}}\le {R}_L^{\mathrm{new}} \)). Using the algorithm designed in this way, we will see that we select the sensing time and reporting interval dynamically according to the surrounding environment through the results of Section 4.
Simulation results
In this section, we first show an accurate estimation of P_{on}, P_{off}, interference ratio, and transmission opportunity loss ratio. Second, we compare the simulation results of interference ratio and transmission opportunity loss ratio at each SNR about Qlearning and the case in which the sensing and reporting interval are fixed. We also show the simulation results of interference ratio and transmission opportunity loss ratio of Qlearning according to SNR change with time. Finally, we represent the advantage from the continuous sensing.
The simulation was performed using MATLAB. The alternate sequence about busy and idle states of PU follows exponential distribution. More details could be found in [39] and for the MIMObased sensing, antenna correlation is 0.5 which is referred to [40]. The parameters of PU activity and spectrum sensing are shown in Table 2. Table 3 represents the experimental parameter for (44), (45) and (46) in Section 3.3 and the target value for interference ratio and loss ratio. The parameters used in Qlearning are shown in Table 4. The parameters for reward are selected experimentally. And we assume the compound channel gain is 0.9.
As shown in Section 3.1, we can estimate P_{on} and P_{off} without any assumption about the statistics of the PU as [10, 39]. We can successfully estimate R_{ I } and R_{ L } as (27) and (30). Figure 10 shows the results of P_{on} estimation for each measurement time. We set SNR = − 5 dB, sensing bandwidth as 1.5 MHz, the sum of the average busy/idle time of the PU 5 ms, and the ratio of ON time is 0.5. The average estimation error is only 0.0264, 0.0265, and 0.0264 for each of (a), (b), and (c) in Fig. 10 according to measurement time. They are similar to each other. If the measurement interval is short, it is possible to estimate the ON state of the PU that is dynamically fluctuating, and it can be confirmed that the variation of the value decreases as the measurement interval becomes longer.
Figure 11 shows the estimated R_{ I } and R_{ L } using the P_{on} estimation of about 0.5 s of the measurement window. Similar to Fig. 10, we can see that the estimated R_{ I } and R_{ L } are close to the actual values, and the average estimation errors are 0.0395 and 0.0427, respectively. From this result, it is possible to guarantee the reliability of the effect estimation from the selection of each action because the interference ratio and transmission opportunity loss ratio can be estimated approximately which are the response of the actions selected by the Qlearning.
In the simulation, we evaluated the performance of the proposal compared with fixed action (i.e., fixed sensing time and reporting interval). The actions of Qlearning are combinations of sensing time in the form of number of sensing samples (ss = 200, 600, 1000, 1400) and the reporting interval as multiples of sensing time (sr = 3, 6, 9). For the state, we divide R_{ I } into four states (0~ 0.07, 0.07~ 0.1, 0.1~ 0.2, and 0.2~), and we split R_{ L } into three states (0~ 0.1, 0.1~ 0.2, 0.2~). The states are the combinations of R_{ I } and R_{ L }. We set the parameters for Qlearning at α = 0.5, γ = 0.5, and the random action choice parameter is 0.1 ≤ ε ≤ 0.3 (starts from 0.3, and the lower limit is 0.1) using εgreedy exploration. The value of the low limit of ε is necessary to allow for a flexible adaptation of the Qtable when the environment changes. The fixed case is the combination of the sensing sample (ss = 200, 600, 1000, and 1400) and the reporting interval as multiples of sensing time (sr = 3 and 9).
Figures 12, 13, and 14 show boxplots of R_{ I } and R_{ L }according to fixed cases and Qlearning at a SNR of − 6, −9, and −12 dB, respectively. If the response time is short, like ss200sr3 (ss = 200, sr = 3, reply length = 200×3) and ss200sr9 (ss = 200, sr = 9), a high false alarm probability arises because there is not enough sensing time. Although the SU system will abandon the transmission for this reason and has very low interference, the loss of transmission opportunity increases at a low SNR (−9 and −12 dB). For ss600sr3 and ss600sr9, the performance is good in terms of interference ratio and loss ratio up to −9 dB, but transmission loss increases at −12 dB for the same reason. Both ss1000sr9 and ss1400sr9 have high values in interference ratio and loss ratio because the reporting interval itself is long. For ss1400sr3, the interference ratio and loss ratio are stable at all SNRs. However, it has low performance compared to Qlearning on the reward side, since Qlearning that dynamically selects actions in all environments has better performance in terms of system load (transmission power loss, continuous transmission possibility). Qlearning has superior mean and a low variance over the fixed case for all SNRs on the reward side. From these results, it can be seen that the loss of transmission opportunity increases in order to minimize the interference to the PU in almost fixed cases, and both the interference to the PU and the loss of the CR system are unsatisfactory in certain cases. On the other hand, the Qlearning dynamically tracks the busy/idle of the PU which are frequently change when the SNR is fixed, satisfying the interference ratio and the transmission opportunity ratio within the selected range. In the reward aspect, it can be seen that the overhead of sensor is considerably reduced because the reward is higher than the fixed case and low variance.
Figures 15 and 16 show that Qlearning operates adaptively according to SNR changes. In the overall region, we can see that the interference ratio stays around the threshold, and for the loss ratio, we can see the phenomenon of returning to around the threshold although it sometimes has a value of bouncing at −12 dB. Therefore, Qlearning can select the sensing time and the reporting interval dynamically even in the environment where the SNR varies, so that the system can operate within the desired range of interference ratio and transmission opportunity loss ratio.
Figure 17 shows the actual interference ratio according to the average primary ON time E[on] changes, and the proposed method is compared with the general sensing time optimization method for throughput maximization [9] for which the data frame length is 17.5 ms (52,500 samples). For the proposed method, we implemented two primary detection notification models (default periodic reporting and dual transceiver). Since the conventional method cannot detect the PU in the data transmission period, the interference ratio is always larger than that of the proposed method. For the smaller average primary ON time (i.e., the shorter primary system activation time), the more conventional method cannot detect primary signal and gives the more harmful interference because the primary may appear only between consecutive sensing times. In the case of two proposed notification models, the dual transceiver method shows little better performance than that of periodic reporting because it can make immediate stop of secondary data transmission using the dedicate narrow band control channel.
Figure 18 shows the primary appearance detection ratio for different average primary ON time E[on]. The primary appearance detection ratio indicates whenever primary turns on how accurately the secondary system detects the primary appearance. In the conventional method, for shorter E[on] case, the primary activation time can be smaller than the sensing interval so that secondary system cannot sense the primary appearance. The primary only can be detected when primary activation time is overlapped with the secondary sensing time. The proposed methods always show very high (> 0.98) primary appearance detection ratio.
Conclusions
In this paper, we proposed an algorithm to dynamically select the sensing time and reporting interval in order to adapt to the surrounding environment using Qlearning in an IAbased CR network. We change the system to eliminate the dependence on the PU information unlike the conventional IAbased CR system. Therefore, we can continuously monitor the PU by designating a sensor dedicated to sensing. However, the remaining SUs need to periodically receive the sensing results from the sensor. In this mechanism, the optimization issue turns into how often to receive the sensing results. We define this as the sensing time and its multiple for reporting interval, and solve this problem using Qlearning, a typical reinforcement learning algorithm. We assigned the action of Qlearning as the set of products from sensing time and multiple of that. We designate state as the set of products of the interference ratio and transmission opportunity loss ratio. We propose a method to predict the interference ratio and loss ratio without any assumptions about the operation of the PU and confirm that it is close to the actual interference ratio and loss ratio. We designed the reward considering the interference ratio, the loss ratio, and the load on the system and compared it with the fixed case for each SNR through simulation. In addition, as the SNR changes, we can confirm that the system operates dynamically and operates stably. Furthermore, we also assure that benefit of the proposal since this system can sense the channel continuously.
Methods/experimental
The purpose of this study is to minimize the interference to the primary user and to keep the monitoring of primary users by continuously sensing without alternating between sensing and data transmission. For this purpose, the IAbased cognitive radio system performs the spectrum sensing by assigning a secondary user as a sensor. In this paper, we propose a precoding and decoding method for this system, and propose a sensing method for each case when the secondary user transmit the data and does not transmit according to the existence of primary user. Since the role of the spectrum sensing is limited to the sensor node, it is necessary to determine the period for other secondary user to receive the sensing result from the sensor and the sensing time. Those are selected by the Qlearning. Qlearning is a representative learning algorithm that allows the agent to identify the state of the agent and to take appropriate action by identifying the surrounding information. In the proposed system, the state of the agent (CH) is defined as the combination of the interference ratio of the PU and transmission opportunity loss ratio of the SU. In this paper, we propose a method to calculate the interference ratio and the transmission opportunity loss ratio using the statistical characteristics of the PU obtained by continuous sensing. The action of the agent is the combination of a sensing time and period for reporting the sensing result. The timedependent mechanical relationship is stored in the Qtable when the interference ratio and the transmission opportunity loss ratio are determined according to the selected action (sensing time and sensing result). The Qlearning uses the information stored in the Qtable to select the most appropriate action for a given state at each time.
Experimental results in this paper had performed using MATLAB R2015b on Intel® Core i7 3.4 GHz system. The exponential random function to generate the PU over time and Qtable matrix for Qlearning can be made by constructing appropriate MATLAB code.
Abbreviations
 AWGN:

Additive white Gaussian noise
 CR:

Cognitive radio
 D2D:

Devicetodevice
 DoF:

Degree of freedom
 IA:

Interference alignment
 ISM:

Industrial, Scientific and Medical
 LTEU:

Long Term Evolution in unlicensed spectrum
 MIMO:

Multiinput multioutput
 PDF:

Probability density function
 PSK:

Phase shift keying
 PU:

Primary user
 SDP:

Semidefinite programming
 SU:

Secondary user
 TVWS:

TV white spaces
 UPT:

Unprotected primary user transmission
 ZF:

Zero forcing
References
MR Kelley, The spectrum auction: big money and lots of unanswered questions. IEEE Internet Comput. 12(1), 66–70 (2008)
J. Mitola, III, Cognitive radio, licentiate thesis, KTH, Royal Inst. of Technol., Stockholm, Sweden, (1999).
S Haykin, Cognitive radio: Brainempowered wireless communications. IEEE Journal on Selected Areas in Communications 23(2), 201–220 (2005)
YY Liu, SJ Yoo, Dynamic resource allocation using reinforcement learning for LTEU and WiFi in the unlicensed spectrum. IEEE ICUFN, 471–475 (2017)
E Almeida, AM Cavalcante, RCD Paiva, et al., Enabling LTE/WiFi coexistence by LTE blank subframe allocation. IEEE ICC, 5083–5088 (2013)
D Cabric, SM Mishra, RW Brodersen, Implementation issues in spectrum sensing for cognitive radios. IEEE, Pacific Grove, 772–776 (2004)
N Sai Shankar, C Cordeiro, K Challapali, Spectrum agile radios: utilization and sensing architectures. IEEE DySPAN, 160–169 (2005)
Y Hur et al., A cognitive radio (CR) system employing a dualstage Spectrum sensing technique: a multiresolution Spectrum sensing (MRSS) and a temporal signature detection (TSD) technique. IEEE Globecom, 1–5 (2006)
YC Liang, Y Zeng, ECY Peh, AT Hoang, Sensingthroughput tradeoff for cognitive radio networks. IEEE Trans. Commun. 7(4), 1326–1337 (2008)
W y Lee, IF Akyildiz, Optimal spectrum sensing framework for cognitive radio networks. IEEE Trans. on Wireless Commun. 7(10), 3845–3857 (2008)
JK Choi, SJ Yoo, Undetectable primary user transmissions in cognitive radio networks. IEEE Commun. Letters 17(2), 277–280 (2013)
JK Choi, SJ Yoo, Optimal sensing interval considering perprimary transmission protection in cognitive radio networks. Wireless Personal Commun. 78, 1891~1903 (2014)
W Lee, DH Cho, Enhanced spectrum sensing scheme in cognitive radio systems with MIMO antennae. IEEE Trans. on Veh. Tech. 60(3), 1072–1085 (2011)
F Moghimi, RK Mallik, R Schober, Sensing time and power optimization in MIMO cognitive radio networks. IEEE Trans. on Wireless Commun. 11(9), 3398–3408 (2012)
X Li, N Zhao, Y Sun, FR Yu, Interference alignment based on antenna selection with Imperfect channel state information in cognitive radio networks. IEEE Trans. on Veh. Tech. 65(7), 5497–5511 (2016)
M Amir, A ElKeyi, M Nafie, Constrained interference alignment and the spatial degrees of freedom of MIMO cognitive networks. IEEE Trans. on Infor. Theory 57(5), 2994–3004 (2011)
H Zhou, T Ratnarajah, YC Liang, On secondary network interference alignment in cognitive radio. IEEE DySPAN, 637–641 (2011)
H Men, N Zhao, M Jin, JM Kim, Optimal transceiver design for interference alignment based cognitive radio networks. IEEE Commun. Letters 19(8), 1442–1445 (2015)
S Chatzinotas, B Ottersten, Cognitive interference alignment between small cells and a macrocell. IEEE ICT, 1–6 (2012)
L Huang, G Zhu, X Du, Cognitive femtocell networks: an opportunistic spectrum access for future indoor wireless coverage. IEEE Wirel. Commun. 20(2), 44–51 (2013)
SK Sharma, S Chatzinotas, B Ottersten, Interference alignment for spectral coexistence of heterogeneous networks. EURASIP Journal on Wireless Commun. and Networking, 1–14 (2013)
G Chen, Z Xiang, C Xu, M Tao, On degrees of freedom of cognitive networks with user cooperation. IEEE Wireless Commun. Letters 1(6), 617–620 (2012)
B Guler, A Yener, Interference alignment for cooperative MIMO femtocell networks. IEEE, GLOBECOM, 1–5 (2011)
SM Perlaza, N Fawaz, S Lasaulce, M Debbah, From spectrum pooling to space pooling: opportunistic interference alignment in MIMO cognitive networks. IEEE Trans. on Signal Processing 58(7), 3728–3741 (2010)
M HasaniBaferani, J Abouei, Z ZeinalpourYazdi, Interference alignment in overlay cognitive radio femtocell networks. IET Commun. 10(11), 1401–1410 (2016)
X Li, N Zhao, Y Sun, FR Yu, Interference alignment based on antenna selection with imperfect channel state information in cognitive radio networks. IEEE Trans. on Veh. Tech. 65(7), 5497–5511 (2016)
L Li, T Li, J Ge, L Kong, J Liu, Channel sensing order for distributed cognitive networks with multiuser and multichannel. IEEE ICCSN, 44–50 (2017)
J Oksanen, J Lunden, V Koivunen, Reinforcement learning based sensing policy optimization for energy efficient cognitive radio networks. Neurocomputing 80, 102–110 (2012)
A Das, SC Ghosh, N Das, AD Barman, Qlearning based cooperative spectrum mobility in cognitive radio networks. IEEE LCN 2017, 502–505 (2017)
Y Li, SK Jayaweera, M Bkassiny, C Ghosh, Learningaided subband selection algorithms for spectrum sensing in wideband cognitive radios. IEEE Trans. on Wireless Commun. 13(4), 2012–2024 (2014)
O. van den Biggelaar, J.M. Dricot, P.D. Doncker, F. Horlin, Sensing time and power allocation for cognitive radios using distributed qlearning, EURASIP J. Wirel. Commun. Netw., 2012, 1–40, (2012)
SH Kang, T Nguyen, Distance based thresholds for cluster head selection in wireless sensor networks. IEEE Commun. Letters 16(9), 1396–1399 (2012)
D Jia, H Zhu, S Zou, P Hu, Dynamic cluster head selection method for wireless sensor network. IEEE Sensors J. 16(8), 2746–2754 (2016)
B Gangwar, JD Bhosale, N Gangwar, An energy optimized path selection and dynamic cluster head selection for wireless mesh network. ICEI, 272–277 (2017)
O El Ayach, SW Peters, RW Heath, The feasibility of interference alignment over measured MIMOOFDM channels. IEEE Trans. on Veh. Tech. 59(9), 4309–4321 (2010)
SW Peters, RW Heath, Cooperative algorithms for MIMO interference channels. IEEE Trans. on Veh. Tech. 60(1), 206–218 (2011)
RS Sutton, AG Barto, Reinforcement learning: an introduction (Cambridge, (MA, MIT Press, 1998)
Watkins and Dayan, Qlearning, Machine learning, 8(3–4), pp.279–292(1992)
H Kim, KG Shin, Efficient discovery of spectrum opportunities with MAClayer sensing in cognitive radio networks. IEEE Trans. on Mobile Computing 7(5), 533–545 (2008)
S Kim, J Lee, H Wang, D Hong, Sensing performance of energy detector with correlated multiple antennas. IEEE Signal Proc. Letters 16(8), 671–674 (2009)
Dataset of simulations
The simulation was performed using MATLAB in Intel Core i7 (32 bit).
The alternate sequence about busy and idle states of PU follows exponential distribution. More details could be found in [39].
For the MIMObased sensing, antenna correlation is 0.5, which is referred to [40].
The Qtable is made up of tables as defined in the paper and it works according to the Qtable update equation in conjunction with sensing.
Funding
This work was supported by the Inha University research grant.
Author information
Authors and Affiliations
Contributions
Both authors contribute to the concept, the design and developments of the theory analysis and algorithm, and the simulation results in this manuscript. Both authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Authors’ information
 Prof. SangJo Yoo, PhD (corresponding author)
SangJo Yoo received the B.S. degree in electronic communication engineering from Hanyang University, Seoul, South Korea, in 1988, and the M.S. and Ph.D. degrees in electrical engineering from the Korea Advanced Institute of Science and Technology, in 1990 and 2000, respectively. From 1990 to 2001, he was a Member of Technical Staff with the Korea Telecom Research and Development Group, where he was involved in communication protocol conformance testing and network design fields. From 1994 to 1995 and from 2007 to 2008, he was a Guest Researcher with the National Institute Standards and Technology, USA. Since 2001, he has been with Inha University, where he is currently a Professor with the Information and Communication Engineering Department. His current research interests include cognitive radio network protocols, ad hoc wireless network, MAC and routing protocol design, wireless network QoS, and wireless sensor networks.
 Mr. SungJeen Jang
SungJeen Jang received a B.S degree in electrical engineering from Inha University Incheon, Korea, 2007. He received his M.S. degree in Graduate School of Information Technology and Telecommunication, Inha University, Incheon Korea, 2009. Since March 2009, he has been pursuing a Ph.D degree at the Graduate School of Information Technology and Telecommunication, Inha University, Incheon Korea. His current research interests include cognitive radio network protocols and machine learning applied wireless communications.
Competing interests
I confirm that I have read Springer Open’s guidance on competing interests and none of the authors have any competing interests in the manuscript.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
According to Bezout’s theorem, N_{ e } ≤ N_{ v } must be satisfied, where N_{ e } is the total number of equations, and N_{ v } is the total number of variables. Considering the conditions in (11), (12), (13), and (14), N_{ e } and N_{ v } can be obtained as follows:
where the number of desired streams is assumed to be the same for simple representation.
Then, the DoF condition is expressed by (48):
If we generalize this to P sensors, we can obtain (49) for N_{ e } and N_{ v }.
The DoF condition is expressed by (50):
Therefore, in the network environment in which (48) or (50) are guaranteed, the sensor node can remove the interference to the signals of other SU and sense the PU signal.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Jang, SJ., Yoo, SJ. Qlearningbased dynamic joint control of interference and transmission opportunities for cognitive radio. J Wireless Com Network 2018, 160 (2018). https://doi.org/10.1186/s1363801811559
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1363801811559
Keywords
 Cognitive radio
 Interference alignment
 Spectrum sensing
 Qlearning