Due to limited space, we discuss only soft constraint for finite-state Markov sub-channels. The case of hard constraint can be derived in a similar manner. Recall that the state of sub-channel *n* at time slot *t* is denoted by *S*_{
n
}(*t*). Then, the overall state of the system can be given by \(\mathbf {S}(t)\triangleq \left (S_{1}(t),...,S_{N}(t)\right)\). We assume that, for each sub-channel *n*, the transition probabilities \(q^{BI}_{n}\) and \(q^{IB}_{n}\) are both positive and less than 1. Then, both states of idle and busy are recurrent since they are not affected by the input. Therefore, it is easy to verify that the overall sub-channel is indecomposable and the sub-channel capacity is given by [6]

$$\begin{array}{@{}rcl@{}} C={\lim}_{T\rightarrow\infty}\frac{1}{T}\max_{\mathbf{a}_{1}^{T}}I\left(\mathbf{X}_{1}^{T},\mathbf{Y}_{1}^{T}\right), \end{array} $$

(22)

where (recall that **a**(*t*) is the input policy for time slot *t*)

$$\begin{array}{@{}rcl@{}} \mathbf{a}_{1}^{T}=\left(\mathbf{a}(1),...,\mathbf{a}(T)\right), \end{array} $$

(23)

$$\begin{array}{@{}rcl@{}} \mathbf{X}_{1}^{T}=\left\{X_{n}(t)\right\}_{n=1,...,N;t=1,...,T}, \end{array} $$

(24)

and

$$\begin{array}{@{}rcl@{}} \mathbf{Y}_{1}^{T}=\left\{Y_{n}(t)\right\}_{n=1,...,N;t=1,...,T}. \end{array} $$

(25)

Since the secondary user cannot sense all sub-channels simultaneously, it has only partial information about the overall channel state. Therefore, we can apply the framework of partial observable Markov decision process (POMDP) to study the optimal policy achieving channel capacity. We first define the belief about channel states, converting the partial observable state into a completely observable state. Then, we consider the channel capacity as an average-reward Markov decision problem. The uncountable state space is simplified to a countable one using the special structure of spectrum sensing problem. Finally, the channel capacity is given in stable state probability.

### Belief states

We denote by *π*_{
n
}(*t*) the *a posteriori* probability (in our paper, we call it *belief* about sub-channel *n*) that sub-channel *n* is idle in the *t*-th time slot, conditioned on all previous inputs^{Footnote 1}. It is easy to verify that *π*_{
n
}(*t*) can be computed recursively:

$$\begin{array}{@{}rcl@{}} {}\pi_{n}(t)&=&I(X_{n}(t)=\Psi)q^{BI}_{n}\\ &&+I(X_{n}(t)\in \mathcal{X}^{M})\left(1-q^{IB}_{n}\right)\\ &&+I(X_{n}(t)=\Phi)\pi_{n}(t-1)\left(1-q^{IB}_{n}\right)\\ &&+I(X_{n}(t)=\Phi)\left(1-\pi_{n}(t-1)\right)q^{BI}_{n}, \end{array} $$

(26)

where *I* is the characteristic function. Obviously, the first term is for the case that sub-channel *n* is sensed and found to be busy while sub-channel *n* is sensed but turns out to be idle in the second term. In the last two terms, sub-channel *n* is not sensed at time slot *t* and can only be inferred from the *a posteriori* probability at time slot *t*−1.

Meanwhile, we denote by *μ*_{
n
}(*t*) the *a posteriori* probability that sub-channel *n* is idle in the *t*-th time slot, conditioned on all previous outputs. It is easy to verify that *μ*_{
n
}(*t*) can be computed recursively:

$$\begin{array}{@{}rcl@{}} {}\mu_{n}(t)&=&I\left(Y_{n}(t)\in \mathcal{Y}^{M}\right)\left(1-q^{IB}_{n}\right)\\ &&+I(Y_{n}(t)=\Theta)\mu_{n}(t-1)\left(1-q^{IB}_{n}\right)\\ &&+I(Y_{n}(t)=\Theta)(1-\mu_{n}(t-1))q^{BI}_{n}, \end{array} $$

(27)

where the first term is for the case that the receiver receives explicit symbols over sub-channel *n* while the following two terms mean that the receiver receives nothing from sub-channel *n* (the transmitter may have sensed the sub-channel but found that it is busy, or did not sense sub-channel *n* at all). We assume that the initial probability is given by *π*_{
n
}(0)=*μ*_{
n
}(0).

Using the philosophy in [11], we can consider the beliefs {*π*_{
n
}(*t*),*μ*_{
n
}(*t*)}_{n=1,...,N} as system state at time slot *t* (note that the system state is different from the state of sub-channels). Then, the POMDP problem is converted to a full information MDP problem since all belief states are known to the transmitter.

### Average award

Using the same argument as in [10], we can obtain

$$\begin{array}{@{}rcl@{}} {}C&=&{\lim}_{n\rightarrow\infty}\max_{\mathbf{a}_{1}^{n}}\frac{1}{T}\sum\limits_{t=1}^{T} \sum\limits_{n=1}^{N}\\ &&\left(H(Y_{n}(t)|\mu_{n}(t))-H(Y_{n}(t)|X_{n}(t),\pi_{n}(t))\right). \end{array} $$

(28)

The following lemma simplifies the difference of the two conditional entropies:

###
**Lemma 1**

The following equation holds:

$$\begin{array}{@{}rcl@{}} H(Y_{n}(t)|\mu_{n}(t))&-&H(Y_{n}(t)|X_{n}(t),\pi_{n}(t))\\ &=&P\left(X_{n}(t)\in\mathcal{X}^{M}\right) \sum\limits_{m=1}^{M}I(x_{mn},y_{mn})\\ &&+H(\tilde{Y}_{n}(t)|\mu_{n}(t)), \end{array} $$

(29)

where \(\tilde {Y}_{n}(t)\)is a binary random variable equaling *1* when \(Y_{n}(t)\in \mathcal {Y}^{M}\) and equaling 0 when *Y*_{
n
}(*t*)=*Θ*.

###
**Remark 3**

Similarly to the memoryless case, the optimization of explicit input distribution is independent of that of sensing probability. Again, we assume that the explicit input distribution has been optimized using traditional approaches and denote by *I*_{n,max} the corresponding optimal mutual information over sub-channel *n*. Then, we focus on only the sensing probabilities.

We assume that the input policy is determined by the belief states, i.e., the sensing probability is determined by {*π*_{
n
}(*t*)} and {*μ*_{
n
}(*t*)}. Therefore, the input policy, denoted by **a**(**π****(****t****)****,****μ****(****t****)**), is a vector function, and the *n*-th element, *ρ*_{
n
}(*t*)=(**a**(**π****(****t****)****,****μ****(****t****)**))_{
n
}, is the probability of sensing sub-channel *n*. We assume that the input policy is stationary, i.e., it does not change with time.

Note that the input policy maps from [ 0,1]^{2N} (the belief states) to the simplex \(\sum _{n}^{N}\rho _{n}=N'\) in [ *ε*,1]^{N} (the sensing probabilities), where *ε* is a positive number. The *ε* preventing the sensing probability from being zero is justified by the following lemma (the proof is straightforward by using the fact that the derivative of function log*x* is infinite at *x*=0.)

###
**Lemma 2**

For an optimal input policy, the sensing probabilities should be non-zero.

We define the following reward for time slot *t*:

$$\begin{array}{@{}rcl@{}} &&r(\mathbf{a},S(t))\\ &=&\sum\limits_{n=1}^{N}\left(H(Y_{n}(t)|\mu_{n}(t))-H(Y_{n}(t)|X_{n}(t),\pi_{n}(t))\right), \end{array} $$

(30)

(note that the conditional entropies are completely determined by **a**(*t*) and *π*(*t*)).

The channel capacity under the constraint of stationary input policy^{Footnote 2} can be written as

$$\begin{array}{@{}rcl@{}} \hat{C}={\lim}_{T\rightarrow\infty}\max_{\mathbf{\rho}_{1}^{T}}\frac{1}{T}\sum\limits_{t=1}^{T}r(\mathbf{a},S(t)), \end{array} $$

(31)

which is the average award of a controlled Markov process. This motivates us to apply the theory of controlled Markov process to find the optimal input policy.

### Countable state space

The difficulty for analyzing the optimal input policy for the controlled Markov process in (31) is that the state space {*π*_{
n
}(*t*),*μ*_{
n
}(*t*)} is uncountable and discretization is needed for optimizing the input policy. However, we can show that the uncountable state space is equivalent to a countable space, thus substantially reducing the complexity.

First, we notice that the belief *π*_{
n
}(*t*) at time slot *t* is determined by (suppose that the last time slot (before *t*) in which the transmitter sensed sub-channel *n* is *t*−*τ*)

$$ {}\pi_{n}(t)=\left\{ \begin{aligned} &\left(\mathcal{Q}_{n}^{\tau}\right)_{11},\qquad \text{if }X_{n}(t-\tau)\in\mathcal{X}^{M}\\ &\left(\mathcal{Q}_{n}^{\tau}\right)_{12},\qquad \text{if }X_{t-\tau}=\Psi\\ &\left(\mathcal{Q}_{n}^{t}\right)_{11}\pi_{n}(0)+\left(\mathcal{Q}_{n}^{t}\right)_{12}(1-\pi_{n}(0))\text{, if }\tau\leq 0 \end{aligned} \right., $$

(32)

with the convention that *τ*≤0 means sub-channel *n* has never been sensed (recall that \(\mathcal {Q}_{n}\) is the transition matrix of sub-channel *n* defined in (2)).

Since \(\rho _{n}^{IB}+\rho _{n}^{BI}\neq 1\) (otherwise, it degenerates to the memoryless case), \(\left (\mathcal {Q}_{n}^{t_{1}}\right)_{11}\neq \left (\mathcal {Q}_{n}^{t_{2}}\right)_{12}\), for *t*_{1},*t*_{2}>0 almost surely. Also, \(\left (\mathcal {Q}_{n}^{t}\right)_{11}\pi _{n}(0)+\left (\mathcal {Q}_{n}^{t}\right)_{12}(1-\pi _{n}(0))\) is equal to \(\left (\mathcal {Q}_{n}^{t_{1}}\right)_{11}\) or \(\left (\mathcal {Q}_{n}^{t_{1}}\right)_{12}\) for only countable cases, which is of measure zero. Therefore, we can determine the last time slot in which sub-channel *n* is sensed before time slot *t* from *π*_{
n
}(*t*) almost surely.

Similarly, the belief *μ*_{
n
}(*t*) at time slot *t* is determined by (suppose that the last time slot, denoted by *t*−*δ*, in which sub-channel *n* is sensed and found to be idle (i.e., \(Y_{n}(t)\in \mathcal {Y}^{M}\)))

$$ {}\mu_{n}(t)=\left\{ \begin{array}{ll} \left(\mathcal{Q}_{n}^{\delta}\right)_{11},\qquad \text{if }X_{n}(t-\delta)\in\mathcal{X}^{M}\\ \left(\mathcal{Q}_{n}^{t}\right)_{11}\pi_{n}(0)+\left(\mathcal{Q}_{n}^{t}\right)_{12}(1-\pi_{n}(0))\text{, if }\delta\leq 0 \end{array} \right., $$

(33)

with the convention that *δ*≤0 means the receiver has never received signal over sub-channel *n* before time *t*. Similarly, *μ*_{
n
}(*t*) is equivalent to *δ* almost surely.

When the initial state for sub-channel *n* is *π*_{
n
}(0)=1 and *μ*_{
n
}(0)=1, *π*_{
n
}(*t*) is either \(\left (\mathcal {Q}_{n}^{t_{1}}\right)_{11}\) or \(\left (\mathcal {Q}_{n}^{t_{2}}\right)_{12}\), where *t*_{1} and *t*_{2} are integers, due to (32), and *μ*_{
n
}(*t*) can only be \(\left (\mathcal {Q}_{n}^{t_{3}}\right)_{11}\), where *t*_{3} is an integer, due to (33). This means that the possible values of *π*_{
n
}(*t*) and *μ*_{
n
}(*t*) are countable. Then, each sub-state (*π*_{
n
}(*t*),*μ*_{
n
}(*t*)) is equivalent to a 3-tuple (*S*_{
n
}(*τ*),*τ*,*δ*) where *t*−*τ* is the last time slot in which sub-channel *n* is sensed and *t*−*δ* is the last time slot in which sub-channel *n* is sensed and found to be idle (obviously, *δ*≤*τ*). Therefore, the state space [ 0,1]^{N} degenerates to a discrete state space

$$\begin{array}{@{}rcl@{}} \mathbf{\Xi}=\left\{\left\{B,I\right\}\times \left\{(\tau,\delta)|\tau\in\mathbb{N},\delta\in\mathbb{N},\tau\leq \delta\right\}\right\}^{N}. \end{array} $$

(34)

And we denote by *ξ*(*t*) and *ξ*_{
n
}(*t*), the state and the sub-state for sub-channel *n* at time slot *t*.

However, it loses generality to assume *π*_{
n
}(0)=1 or 0, ∀*n*. Fortunately, we can show that the longer-term average reward is a constant dependent on only the control strategy, regardless the initial state. Toward this, we can apply Theorem 1 in Appendix Appendix E: Markov control uncountable state space [11]. The following lemma verifies the assumptions in Theorem 1, whose proof is given in Appendix Appendix F: Proof of Lemma 3.

###
**Lemma 3**

Assumptions 1 and 2 hold for the controlled Markov process of spectrum sensing.

Applying the conclusion in Theorem 1 and Lemma 3, we obtain the following proposition, which converts the finite state sub-channel into a memoryless one:

###
**Proposition 3**

The sub-channel capacity is independent of the initial state and is given by

$$\begin{array}{@{}rcl@{}} {}C=\max_{\Delta}\sum\limits_{\xi\in \mathbf{\Xi}} \sum\limits_{n=1}^{N}\left(H(Y_{n}|\xi)-H(Y_{n}|X_{n},\xi)\right)\Delta(\xi), \end{array} $$

(35)

where *Δ*is the stable probability of belief state *ξ*.

The stable probability *Δ* is determined by the following equation:

$$\begin{array}{@{}rcl@{}} \Delta(\xi)=\sum\limits_{\xi'\rightarrow\xi}\Delta(\xi')\prod\limits_{n=1}^{N}P(\xi_{n}|\xi'), \end{array} $$

(36)

where *ξ*^{′} and *ξ* are both overall state, *ξ*_{
n
} is the state of sub-channel *n*, *ξ*^{′}→*ξ* means that *ξ*^{′} is a legal state in the previous time slot when the current state is *ξ* and *P*(*ξ*_{
n
}|*ξ*^{′}) is the transition probability, which is given by

$$\begin{array}{@{}rcl@{}} P(\xi_{n}|\xi')=1-\left(\mathbf{a}(\xi')\right)_{n}, \end{array} $$

(37)

if *ξ*_{
n
}=(*x*,*τ*,*δ*) and *ξ**n*′=(*x*,*τ*−1,*δ*−1) (i.e., sub-channel *n* is not sensed), and

$$\begin{array}{@{}rcl@{}} P(\xi_{n}|\xi')=\left(\mathbf{a}(\xi')\right)_{n}, \end{array} $$

(38)

otherwise. Then, the sensing probability can be optimized numerically, which is out of the scope of this paper.

### Myopic strategy

The above approaches based on POMDP can achieve theoretically optimal performance. However, they can hardly be implemented when the number of channels becomes large, even if we keep only finitely many states in (33). For example, if we keep only two states for each channel, there will be 2^{N} overall states. When *N*=20, which is used in the numerical simulation section of this paper, the computational and memory costs will be prohibitive.

Hence, we propose a practical approach based on the myopic strategy, namely, to maximize the expected throughput in the next time slot. We consider the belief *π*_{
n
}(*t*) as the true idle probability of channel *n*. Then, we apply the scheduling strategy in Prop. 1 (for the soft constraint case) or in Prop. 2 (for the hard constraint case).