Joint spectrum sensing and access for stable dynamic spectrum aggregation

Spectrum aggregation is an emerging technology to satisfy the data rate requirement of broadband services for next-generation wireless communication systems. In dynamic spectrum environment, in which the spectrum availability is time-varying, it is quite challenging to maintain the stability of spectrum aggregation. In this paper, we investigate the spectrum sensing and access schemes to minimize the times of channel switching for achieving stable dynamic spectrum aggregation, taking into consideration the hardware limitations of spectrum sensing and aggregation capability. We develop an analytical framework for the joint spectrum sensing and access problem based on partially observable Markov decision process (POMDP). Especially, we derive the reward function by estimation of the stability of different spectrum sensing and access strategies. Based on the POMDP framework, we propose a rollout-based suboptimal spectrum sensing and access scheme which approximates the value function of POMDP, and propose a differential training method to improve its robustness. It is proved that the rollout policy achieves performance improvement over the basis heuristics. The simulation results show that the proposed POMDP-based spectrum sensing and access scheme improves the system stability significantly and achieves near-optimal performance with a much lower complexity.


Introduction
Spectrum aggregation [1,2] enables the utilization of discrete spectrum bands or fragments to support broadband services. By spectrum aggregation, the discrete spectrum bands can provide the same transmission service as continuous spectrum bands. Recently, spectrum aggregation becomes one of the key features during LTE-advanced standardization. The performance on the system efficiency and fairness of spectrum aggregation is investigated in [3] and [4]. The energy efficiency of spectrum aggregation is also considered in [5].
The introduction of cognitive radio (CR) [6,7] increases spectrum efficiency by utilizing the spectrum dynamically, and further facilitates the application of spectrum aggregation. To exploit the instantaneous spectrum opportunities in dynamic spectrum environment, the secondary users (SUs) identify available spectrum resources by spectrum sensing and then access the available channels without interrupting primary users (PUs). Dynamic spectrum aggregation (DSA) provides a feasible way to support the broadband services in dynamic spectrum environment. With DSA, multiple available spectrum bands discovered via CR can be aggregated dynamically to fulfill the service requirement.
There have been a few existing publications on spectrum sensing and access schemes in dynamic spectrum environment. In [8], a decentralized MAC protocol is proposed for the SUs to sense the spectrum opportunities. The optimal sensing and channel selection are investigated to maximize the expected total number of bits delivered over a finite number of slots. In [9] and [10], the authors investigate the impacts of sensing errors on the system performance and try to alleviate their negative effects. In [11], by adopting the fusion strategy of collaborative spectrum sensing, the authors design a multi-channel MAC protocol. However, these existing works have all focused on the cases that each user uses only a single channel without considering the cases with spectrum aggregation. In [12], we propose a Maximum Satisfaction Algorithm (MSA) for admission control and a Least Channel Switch (LCS) strategy for DSA, but the spectrum sensing and access schemes are not considered jointly. In [13], we provides some preliminary results on a general POMDP framework for cognitive radio networks.
In this paper, we investigate the joint spectrum sensing and access for DSA considering several practical limitations: • Spectrum Sensing Limitation: Due to the limitation of spectrum sensing capability, it is not always possible to sense all the spectrum bands for a large-span spectrum. Each SU chooses only a subset of channels (i.e., a part of the spectrum) to sense. As a result, the system is lack of the perfect information on channel availability, which brings new technical challenges for DSA. • Spectrum Aggregation Limitation: Due to the hardware capability, only the spectrum bands within a certain range can be aggregated together for a single user. The spectrum aggregation range leads to an additional constraint when the SUs access the spectrum. • Channel Switch Overhead : When an SU adjusts his access strategy and switches to other channels, it is unavoidable for the channel switch to result in extra system overhead, such as rendezvous, synchronization, etc. When designing the spectrum sensing and access scheme with the overhead consideration, it is necessary to reduce as many times of channel switch as possible.
Taking the above practical issues into consideration, we propose a decision-theoretic approach by casting the design of joint spectrum sensing and access for stable DSA in the partially observable Markov decision process (POMDP) [14] framework. In order to provide the reward function for the POMDP framework, the probability of channel switch is estimated based on Markov chain. Since the optimal solution of POMDP is very intensive computationally due to the curse of dimensionality, i.e., the computational time and storage requirements grow exponentially with the number of channels. We further introduce an approximation technique called rollout [15] to design the suboptimal joint spectrum sensing and access scheme. A heuristics is proposed first as the base policy, which can greedily choose the spectrum sensing and access actions to reduce the channel switch times. By rolling out the base policy, the proposed rollout algorithm can approximately calculate the value function defined in the POMDP framework and reduces the times of channel switch. A theoretical analysis is provided to prove the performance improvement of rollout policy over the heuristics. Furthermore, we propose a differential training method which reduces the sensitivity to approximation errors. The performance of the proposed scheme is evaluated by simulation which demonstrates that the proposed policies in the POMDP framework reduce the times of channel switch significantly, and the rollout-based scheme achieves a near-optimal performance compared to the optimal scheme. The rest of this paper is organized as follows. Section 2 describes the system model and formulates the problem. Section 3 introduces the POMDP framework and the approach to estimate the access and switching probabilities. In Section 4, the rollout-based suboptimal spectrum sensing and access schemes are proposed. Section 5 provides the performance evaluation by simulation. Finally, Section 6 summarizes this paper.

Dynamic spectrum aggregation model
We consider a large-span licensed spectrum consisting of N channels, which have the same bandwidth BW. Time is slotted and the duration of each time slot is T p . The availabilities of channels, which depends on the PU activities, are modeled as the following assumption: Assumption 1 (Channel Availability). The availabilities of N channels compose a system which can be modeled as a discrete-time Markov process with 2 N states, where S n (t) ∈ {0(occupied), 1(idle)} denotes the occupancy state of channel n ∈ {1, . . . , N} at time slot t, which is independent over channels.
Denote p ij as the transition probability from state i to state j, i.e., which can be obtained by multiplying the transition probability of each channel a where i n ∈ {0, 1} and j n ∈ {0, 1} are the nth element of the system states i and j, respectively. For simplicity of expression, we denote P n (t) = Pr{S n (t) = 1}. The SUs sense the presence of PUs and access the spectrum opportunistically in a decentralized manner. Here, we consider the spectrum sensing and access of a single SU b . At the beginning of each time slot t, the SU chooses a set of channels A 1 (t) to sense. Assumption 2 (Spectrum Sensing). Due to the spectrum sensing capability, the SU can only sense at most L channels, which means the size of A 1 (t) is no more than L, i.e., |A 1 (t)| ≤ L. When L < N, the SU only obtains the availability information of a subset of channels.
Note that although L channels can be sensed by the SU, the availability states of these L channels are not always accurate because of the existence of sensing errors. The SU performs a binary hypotheses test: • H 0 : Null hypothesis indicating that the sensed channel is available. • H 1 : Alternative hypothesis indicating that the sensed channel is occupied.
If the SU obtains an incorrect sensing result H 1 when the channel state is H 0 , i.e., false alarm, the SU will refrain from transmitting and a spectrum opportunity is wasted. On the other hand, if the SU obtains an incorrect sensing result H 0 when the channel state is H 1 , i.e., miss detection, the SU will collide with a PU. Let P f and P m denote the probabilities of false alarm and miss detection, respectively.
Based on the spectrum sensing results, the SU aggregates a set of channels A 2 for the data transmission with spectrum aggregation.

Assumption 3 (Spectrum Aggregation).
Due to the spectrum aggregation limitation, the SU can only aggregate the channels within , which means that the channels in A 2 (t) are within the frequency range , i.e., D(i, j) ≤ , ∀i, j ∈ A 2 (t), where D(i, j) indicates the frequency distance between channel i and channel j. The total bandwidth of the available channels in A 2 (t) should satisfy the SU's bandwidth requirement, denoted as ϒ.
We illustrate A 1 (t) and A 2 (t) in a large-span spectrum in Figure 1.

Problem formulation
The SU can detect the return of PUs and utilize the channels unoccupied by PUs to avoid the interference to PUs. To satisfy the bandwidth requirement ϒ in the dynamic spectrum environment, the SU adopts different spectrum sensing and access strategies according to the number of the current available channels R(t) within A 2 (t) for time slot t, which is defined as R(t) = n∈A 2 (t) S n (t).
• If R(t) ≥ ϒ/BW , the SU reselects only A 1 (t). The spectrum aggregation decision does not change, i.e., and A 2 (t) and trigger a channel switch.
With the consideration of reducing the system overhead and maintaining the stability of dynamic spectrum aggregation, our aim is to minimize the expected times of channel switches c by adjusting the spectrum sensing and access strategies, A 1 (t) and A 2 (t). Denote η(t) as the expected times of channel switches from time slot 0 to t. The joint spectrum sensing and access optimization problem for stable DSA can be formulated formally as follows: The first two constraints indicate the spectrum sensing and spectrum aggregation limitations respectively, and the last constraint guarantees the satisfactory of the SU's bandwidth requirement.

A POMDP framework for dynamic spectrum aggregation
In this section, we propose a decision-theoretic framework for DSA based on POMDP [14]. Especially, we convert minimizing the times of channels switches into a new objective, i.e., maximizing the time interval of channel switches, and provide an approach to estimate this interval as the reward of the POMDP model, which is challenging in dynamic spectrum environment. If the SU is able to sense the whole spectrum accurately in the network, all the elements of S(t) can be obtained and the optimization problem is a standard Markov decision process (MDP) since the channel availability states S(t) is a discrete-time Markov process. However, in the practical situation with the limitation of spectrum sensing ability and the existence of sensing errors, the SU can only obtain the imperfect occupancy states of a part of channels, which means L < N and S(t) is partially and inaccurately observable. As a result, we need to cast the optimization problem into the POMDP framework, which is a particular case of MDP in which the state of the system is partially observed by the decision maker.

POMDP framework
Before the discussion of POMDP framework, we first introduce a new concept called control interval, each of which is composed of a number of consecutive time slots and delimited by channel switches. It is obvious that the length of a control interval is uncertain depending on how long time the current aggregated channels keep satisfying the SU's bandwidth requirement. Incorporating the control interval structure, the joint spectrum sensing and access scheme are designed based on the POMDP framework, and the framework in [8] is no longer suitable.
Let T denotes the number of control intervals within the whole time horizon (finite number of time slots), and the index m indicates the mth last control interval. Denote t s (m) as the time slot when the mth channel switch is triggerred. If the current control interval includes κ time slots, the state transition probability at the next control interval is Now, we define the key components of the POMDP framework for DSA. For simplicity of expression, we adopt Action The actions of the SU have two stages: determining A 1 (m) to sense and A 2 (m) to access. Define a(m) as the SU action for the mth last control interval, where C i is the index of the ith sensed channel, ∀i ∈ {1, . . . , L} and C start is the starting index of the accessed channel set A 2 (m). Define A as the set of all possible actions, i.e., a(m) ∈ A.
Observation By sensing the channels in A 1 (m), the SU can obtain their occupancy states inaccurately. Let i,A 1 (m) denotes the sensing results based on the current system state i and the sensed channel set A 1 (m). The observing output in the mth last control interval is expressed as may be different to the actual channel availability state S C j (m) due to the sensing errors, they are correlated and θ C j (m) ∈ {0, 1} provides useful information for estimating S C j (m). Specifically, the conditional probabilities of the channel states can be calculated by the Bayes rule [16]. If Thus, we have Similarly, if θ C j (m) = 1, we can obtain that Belief Vector In the optimization problem (4), the channel availability state S(m) is partially and inaccurately observed, which means the internal states of the system cannot be obtained specifically. Consequently, we introduce an important metric (m) called belief vector to represent the SU's estimation of the system states based on the past decisions and observations as is updated according to the action and the observation in the last control interval: (13) where (·|a(m), (m)) indicates the update operator. The updated belief vector can be calculated by the Bayes rule, which also depends on the length of the control interval under consideration, in which Pr{ i,A 1 (m) = θ} can be obtained through the information provided by Equations (10) and (11). It has been proved in [14] that the belief vector (m) is a sufficient statistics for determining the optimal actions for future control intervals.
Policy Denote the policy vector as π =[ μ 1 , μ 2 , . . . , μ T ], where μ m is a mapping from a belief vector (m) to an action a(m) for each control interval. The set of all possible policies is denoted as , i.e., π ∈ .
A policy is said to be stationary if the mapping μ m only depends on the belief vector (m) and is independent to the number of remaining control intervals m. Denote the set of stationary policies as s , and it is usual to restrict the set of policies to s in POMDP. In our framework, a spectrum sensing and access scheme are essentially a policy of this POMDP.
Reward To quantify the SU's objective, we define the reward of a control interval as the length of this control interval, i.e., the number of time slots it includes, denoted as U(m). We now demonstrate that minimizing the number of channel switches equals to maximizing the average reward. For a given total number of time slots t, we have It then follows that arg min which means that over the finite time horizon, the longer the control intervals are, the less expected total times of channel switches will occur, and our objective can be converted into maximizing the average reward.
For control interval m, a set of accessed channels A 2 (m) is determined according to the belief vector (m). To evaluate the reward of A 2 (m), we first define the access probability and the switching probability as follows.
Definition 1 (Access Probability). The access probability indicates the probability that the bandwidth of the available channels in A 2 (m) is more than the number of required channels ϒ, i.e., Definition 2 (Switching Probability). The switching probability indicates the probability that the bandwidth of the available channels in A 2 (m) at the next time slot is not more than the number of required channels ϒ, i.e., Both the access probability ζ and the switching probability ξ can be calculated based on the sensing and access action a, which will be discussed with details in the next subsection. We omit a in the notations of both probabilities for simplicity of expression.
The reward U(m) is a function of the sensing and access action a when the system state is j in the mth last control interval, which is a Bernoulli random variable with probability density function (p.d.f.) p(κ) (κ ∈ Z + ) derived as follows: Using Equations (14) and (20), we can update the belief vector as In summary, our goal in the POMDP framework is to find the optimal policy π * to maximize the average reward as follows: and the POMDP framework for joint spectrum sensing and access for DSA is illustrated in Figure 2, which also indicates the Markovian dynamics of the system.

Estimation of access probability and switching probability
In order to obtain the reward function of POMDP, the access probability ζ and the switching probability ξ need to be estimated. For the channels in the access set A 2 (m), we have the state vector [ S C start , S C start +1 , . . . , S C start +M−1 ]. All the states in the vector can be taken as independent random variables, with μ i = P i and σ 2 i = P i − P 2 i for i ∈ {C start , C start + 1, . . . , C start + M − 1}. According to the central limit theorem [17], we have where μ = C start +M−1 i=C start P i and σ 2 = Figure 3. Based on the distribution of R(t), we calculate the access probability ζ and the switching probability ξ in the following two propositions.

Proposition 1 (Calculation of Access Probability). If the spectrum sensing obtains the accurate channel availability information of all channels, the access probability ζ is
Otherwise, the access probability ζ is calculated as Proof. In the case that the spectrum sensing obtains the accurate channel availability information of all channels, R is a deterministic variable and Equation (24) can be obtained easily. On the other hand, if the spectrum sensing is incomplete or inaccurate, R is a random variable, whose p.d.f. is The access probability is the probability of R ≥ ϒ BW , which is also shown by the shadow region in Figure 3. Therefore, Equation (25) is obtained and the proposition holds. Now we estimate the switching probability ξ by asymptotic analysis, in which the sensing period T p is equally divided into k slim time spans. The situation within one slim time span T p k is analyzed firstly, and then the period T p is investigated by considering multiple slim time spans.
1) The case with complete and accurate sensing.
There are R available channels and M−R channels occupied by the PUs in set A 2 (m) at the beginning of period T p . The sensing period T p is equally divided into k parts, in which the parameter k is large enough, so that we can assume that only one single channel's state is altered during one slim time span.
During the slim time span T p k , the number of available channels R in set A 2 (m) has three possible situations: increased by one, decreased by one, or unchanged. The probabilities of these three situations are denoted by P up , P down , and P hold , respectively.
Here, we have two assumptions for approximation which are proposed based on the actual facts. First, the number of occupied channels M − R stays unchanged at the beginning of each slim time span T p k , which is the most likely case. Second, we take the geometric average of all the channels in set A 2 to approximately calculate the probability of each channel to keep occupied, since the application of geometric average reflects the influence of small probabilities. The probability that the state of one channel keeps occupied P 00 is calculated as follows: Based on P up and P down , we have P hold = 1−P up −P down . According to the above analysis, we obtain the expression of switching probability ξ in the following proposition.

Proposition 2 (Calculation of Switching Probability).
For a given R, the switching probability ξ(R) is Proof. During the sensing period T p , there are H = k(1 − P hold ) = k(P up + P down ) alterations of channel state in total, in which we assume that there are l times of decrease and H − l times of increase of the number of available channels If a channel switch occurs, which means that the bandwidth requirement of the SU is no longer satisfied, then the number of available channels R is less than ϒ BW . In other words, l − (H − l) > R − ϒ BW , and consequently l > . We can calculate the switching probability as Based on Equation (33), Equation (30) can be obtained.
2) The case with partial or inaccurate sensing. Similar to the case for estimating ζ , R is a random variable instead of a deterministic variable, which has

Joint spectrum sensing and access: Rollout policy
On basis of the proposed POMDP framework in Section 3, we can derive the optimal spectrum sensing and access scheme. For optimality, the value function V m ( ) is computed by averaging over all possible state transitions and observations. Since the number of system states grows exponentially with the number of channels, the realization of the optimal scheme suffers from the curse of dimensionality and is computationally overwhelming. In this section, we exploit the specific structure of the problem and develop a rollout-based suboptimal spectrum sensing and access scheme with a much lower complexity.

Rollout policy
The most essential issue of designing the spectrum sensing and access scheme is the calculation of the value function V m ( ), which is also the most computationally intensive part. To alleviate the complexity, we adopt an approximation technique that can offer an effective and computation-saved solution. Rollout algorithm [15], as an approximate dynamic programming methodology based on policy iteration ideas, has been successfully applied to various domains such as combinatorial optimization [18] and stochastic scheduling [19]. Instead of tracing the accurate value, the rollout algorithm can estimate the value function approximately. By use of Monte Carlo method, the results of a number of randomly generated samples are averaged, and the number of samples is typically smaller than the dimensionality of the total strategy space. When the sample number is large enough, we can obtain a joint spectrum sensing and access scheme with reduced complexity and limited performance loss.
To obtain a suboptimal solution, a problem-dependent heuristics is proposed first as the base policy, and then the reward of the base policy can be used by the rollout algorithm in a one-step lookahead method to approximate the value function. The procedure of the rollout-based scheme is illustrated in Figure 4.
The value function of POMDP can be written as where κ m (a) denotes the amount of time slots included in the m-th last control interval, namely the reward function which depends on the action choice a. Base Policy In the rollout algorithm, a heuristic algorithm is needed to serve as the base policy, which is also designed on the basis of control interval structure.
Here, we propose two different heuristics based on our designing objective, namely Bandwidth-Oriented Heuristics (BOH) and Switch-Oriented Heuristics (SOH).
In BOH, we simply choose the sensing and access sets A 1 and A 2 which can obtain the widest expected available bandwidth currently, where P i = Pr{S i = 1}, which can be updated according to A 1 (m). Intuitively, the wider the available bandwidth is, the better the requirement of SU will be satisfied, and it is less possible to trigger the channel switch in the next time slot. But in this heuristics, the statistics of the PU traffic is not taken into consideration to predict the future dynamic behaviors of the channels. In SOH, we choose the sensing and access actions that can maximize the expected current reward where the calculation of p κ m includes the operation of prediction on the access probability ζ and the switching probability ξ . Making full use of the dynamic statistics of the channels, SOH is more sophisticated and achieves better performance than BOH. Both heuristics are greedy and require low computational complexity. With either of the two heuristics as the base policy, the relevant expected reward from current control interval to termination can be calculated by recursion, with the initial condition V 0 H ( ) = 0. Rollout Policy The rollout policy based on the base policy π H is denoted by and is defined through the following operation.
The rollout policy can approximate the value function by the use of the reward of the base policy, and consequently decide the near-optimal action a RL (m). We prove by theoretical deduction that the rollout policy is guaranteed to substantially improve the performance of the heuristics as the base policy. ). The rollout policy is guaranteed to obtain better aggregated reward than the base policy.

Proposition 3 (Rollout Improving Property
Proof. The proposition is proved by backward mathematical induction. For m = T, according to the essence of the rollout policy (42), we obtain Consequently, we have Hence, the proposition holds for m = T. Assume it holds for m < T, i.e., Using the essence of the rollout policy (42) again, we can obtain that Consequently, we have Therefore, the property holds for m − 1. According to the mathematical induction, the proposition is proved.

Suboptimal spectrum sensing and access
Focusing on the implementation of the proposed rollout policy (42), we define Q-factor as which indicates expected reward that the SU can accrue during the lifetime of the process from current control interval, and then the rollout action can be expressed as a RL (m) = arg max a∈A Q m (a).
However, as the key point of the rollout policy, the Qfactor may not be known in a closed form, which makes the computation of a RL (m) an nontrivial issue [20]. To overcome this difficulty, we adopt a widely used Monte Carlo method [21].
Here, we define the trajectory as a sequence of the form (S(T), a(T), S(T − 1), a(T − 1), · · · , S(1), a(1)) (50) Using the Monte Carlo method, we consider any possible action a ∈ A and generate a large number of trajectories of the system starting from belief vector (m) , using a as the first action and the base policy π H thereafter. Thus, a trajectory has the form as S(m), a, S(m − 1), a H (m − 1), · · · , S(1), a H (1) (51) where the system states S(m), S(m − 1), · · · , S(1) are randomly sampled according to the belief vectors which are updated based on the action and observation history: The rewards corresponding to these trajectories are averaged to compute Q m (a) as an approximation to the Qfactor Q m (a). The approximation value becomes increasingly accurate as the number of trajectories increases. Once the approximate Q-factor Q m (a) corresponding to each action a ∈ A is computed, we can obtain the approximate rollout action a RL (m) by the maximization This rollout-based suboptimal spectrum sensing and access scheme can reduce the computational complexity a lot by estimating the value function approximately rather than tracing the accurate value.

Robustness via differential training
It is obvious that, in a stochastic environment, the Monte Carlo method of computing the rollout policy is particularly sensitive to the approximation error, which is closely related to the number of trajectories. In this subsection, we adopt differential training [22] in the proposed rolloutbased suboptimal scheme to improve the robustness. In the differential training method, we estimate the relative Q-factor difference rather than absolute Q-factor value, which is a suitable improvement of the recursively generating rollout policy in the context of Monto Carlo-based policy iteration methods.
In order to compute the rollout action a RL (m) = arg max a∈A Q m (a), the Q-factor differences Q m (a 1 ) − Q m (a 2 ), ∀a 1 , a 2 ∈ A should be computed accurately. By comparing the Q-factor differences with 0, these possible actions can be accurately compared. Unfortunately, in a stochastic environment, the approximation Q m (a) fluctuated around the accurate Q-factor value, bigger or smaller than Q m (a) randomly, as a result of which, the preceding differences computing operation enlarges the approximation error. For example, in the case that a 1 performs better than a 2 and thus, Q m (a 1 ) is definitely bigger than Q m (a 2 ), which results in Q m (a 1 ) − Q m (a 2 ) > 0. However, when using stochastic Monte Carlo method, the approximate Q m (a 1 ) may be smaller than the accurate value Q m (a 1 ), and meanwhile Q m (a 2 ) may be bigger than the accurate value Q m (a 2 ), which makes it quite possible that Q m (a 1 ) − Q m (a 2 ) < 0, and this computation result will lead to a fatal error when determining which action is chosen for spectrum sensing and access.
To reduce the negative effects of the approximation error discussed above, we adopt the differential training method. Specifically, we take the Q-factor value of the base policy π H as a reference to enhance the robustness, which can be viewed as a variance reduction technique. Instead of approximating the independent Qfactor, the approximate rollout action a RL (m) is obtained by maximizing the approximation of the Q-factor difference Q m (a) − Q m (a H ), The reference Q m (a H ) has the same fluctuation monotonicity as Q m (a), which is caused by the approximation error due to the limited number of trajectories. We take the same example that a 1 actually performs better than a 2 and Q m (a 1 ) > Q m (a 2 ). If the approximate Q m (a 1 ) is smaller than the accurate value, so is Q m1 (a H ). Similarly, if Q m (a 2 ) is larger than the accurate value, so is Q m2 (a H ). Using the differential training operation Q m (a)− Q m (a H ), the effect of approximation error can be eliminated. Thus, it probably holds that Q m (a 1 ) − Q m1 (a H ) > Q m (a 2 ) − Q m2 (a H ), consequently the SU will choose the better action a 1 .
From the above discussion, the approximate Q-factor difference Q m (a) − Q m (a H ) is more robust than the approximate independent Q-factor Q m (a). By the differential training of the rollout policy, the approximation error caused by Monte Carlo method can be reduced a lot and the proposed suboptimal spectrum sensing and access scheme performs more robustly.

Simulation results
In this section, we evaluate the performance of the proposed joint spectrum sensing and access scheme by simulation. We investigate the effect of the number of Monte Carlo random trajectories, the proportion of sensing channels L/N, and the ratio of aggregation range to bandwidth requirement /γ . The PU traffic statistics follows the model of Erlang-distribution [23], and the simulation configuration is listed in Table 1. The average simulation results are obtained by 100 runs with random channel states.
In Figure 5, for different number of Monte Carlo random trajectories, we compute the value of the approximate independent Q-factor Q m (a) and the value of the approximate Q-factor difference Q m (a)− Q m (a H ), respectively. Two pairs of curves represent two different rollout It is indicated that the approximate Q-factor difference is more robust than the approximate independent Q-factor, and the differential training of the rollout policy can reduce the negative effect of the approximation error a lot. In the case that the trajectory number is large enough, both approximate values are nearly accurate compared with the original values. Figure 6 illustrates the effect of the proportion of sensing channels L/N on the performance of both optimal and suboptimal schemes. The random scheme is adopted as a baseline for performance comparison, in which M channels are chosen randomly to access. Besides, we also evaluate the performance of the base policies, the suboptimal rollout schemes based on BOH and SOH, and the POMDP-based optimal scheme. Here, we adopt the performance with 1,500 random trajectories for approximation in rollout policies, which can achieve the performance in convergence.
In Figure 6, as the number of sensing channels L increases, the numbers of channel switches decrease for the BOH, SOH, BOH-and SOH-based rollout, and optimal POMDP-based schemes. When the whole spectrum can be sensed (L/N = 1), these schemes achieve their corresponding best performance because the more channels the SU senses, the more information about the system state can be obtained. The spectrum aggregation action determined on the basis of sensing results has better performance in minimizing the expected times of channel switches. For the random access scheme, which determines the access channels without considering the sensing results, the performance does not change with the increase of L.
When L is small, which means that a small number of channels can be sensed, the performances of all schemes are almost the same because the system performance is limited by L in this case. With the increasing of L, the POMDP-based optimal scheme performs the best, and the rollout-based suboptimal schemes achieve much better performance than the basis heuristics and the random scheme. Especially, the SOH-based rollout scheme achieves a performance gain over the BOH-based rollout scheme, which verifies that the choice of the base policy affects the performance of the corresponding rollout policy. When the heuristic is good, the rollout scheme based on it can achieve relatively better performance. Compared with the optimal POMDP-based scheme, the rolloutbased suboptimal scheme only has a slight performance loss, but makes significant improvement in reducing the computational complexity. Figure 7 illustrates the effect of the ratio of aggregation range to bandwidth requirement /γ on the performance of both optimal and suboptimal schemes. The performance comparison of these schemes is similar to those in Figure 6. The performance gaps of different schemes are large when the aggregation range is small, because sophisticated schemes can select the channels whose availability state is stable. On the contrary, all the schemes can achieve a good performance when the aggregation range is large enough.
In Figure 8, we evaluate the performance of the rolloutbased schemes with different number of random trajectories when L = 3. The performances of the random and optimal schemes stay constant with the increase of the number of random trajectories. The performance of the rollout-based scheme approaches to the optimal scheme until the number of random trajectories reach the converging boundary which is 1,500 in this simulation. The differential training method improves the performance of the rollout-based schemes significantly when the number of random trajectories is small. After convergence, the advantage of differential training is small since the performance is not so sensitive to the approximation error with a large enough number of random trajectories. It is proved again in Figure 8 that the SOH-based rollout outperforms the BOH-based one.

Conclusion
In this paper, we investigate the spectrum sensing and access schemes to minimize the channel switching times for achieving stable DSA, taking into consideration the practical limitations of both spectrum sensing and  aggregation capability. We develop an POMDP framework for joint spectrum sensing and access. Especially, we derive the reward function by estimation of the stability of different spectrum sensing and access strategies. Based on the POMDP framework, we propose a rollout-based suboptimal spectrum sensing and access scheme which approximates the value function of POMDP. It is proved that the rollout policy achieves performance improvement over the basis heuristics. By numerical evaluation, we find that with the increase of number of random trajectories, the performance of the proposed rollout-based scheme gets close to the optimal performance. When the number of random trajectories is large enough, the proposed scheme performs near-optimally with a lower complexity, which also achieves a significant improvement over the base policy. In the rollout-based schemes, the basis heuristics affects the performance of its corresponding rollout policy, and the differential training method improves the robustness to the approximation error.
Endnotes a The transition probabilities can be estimated by the statistics of the channel availabilities of two adjacent slots and is assumed to be known by the SUs [23]. b The schemes proposed in this paper can be easily extended to multiple SU cases by adopting the RTS/CTS scheme [24] for the access coordination between SUs. c Minimizing the expected times of channel switches can also be treated equally as maximizing the throughput with the consideration of the system overhead of channel switches.