Structured optimal transmission control in network-coded two-way relay channels

Ding, Ni; Sadeghi, Parastoo; Kennedy, Rodney A.

doi:10.1186/s13638-015-0470-7

Research
Open access
Published: 06 November 2015

Structured optimal transmission control in network-coded two-way relay channels

Ni Ding¹,
Parastoo Sadeghi¹ &
Rodney A. Kennedy¹

EURASIP Journal on Wireless Communications and Networking volume 2015, Article number: 241 (2015) Cite this article

1075 Accesses
Metrics details

Abstract

This paper considers a transmission control problem in network-coded two-way relay channels (NC-TWRC), where the relay buffers randomly arrived packets from two users, and the channels are assumed to be fading. The problem is modeled by a discounted infinite horizon Markov decision process (MDP). The objective is to find an adaptive transmission control policy that minimizes the packet delay, buffer overflow, transmission power consumption and downlink error rate simultaneously and in the long run. By using the concepts of submodularity, multimodularity and L ^♮-convexity, we study the structure of the optimal policy searched by dynamic programming (DP) algorithm. We show that the optimal transmission policy is nondecreasing in queue occupancies and/or channel states under certain conditions such as the chosen values of parameters in the MDP model, channel modeling method, and the preservation of stochastic dominance in the transitions of system states. Based one these results, we propose to use two low-complexity algorithms for searching the optimal monotonic policy: monotonic policy iteration (MPI) and discrete simultaneous perturbation stochastic approximation (DSPSA). We show that MPI reduces the time complexity of DP, and DSPSA is able to adaptively track the optimal policy when the statistics of the packet arrival processes change with time.

1 Introduction

Network coding (NC) was proposed in [1] to maximize the information flow in a wired network. It was introduced in multicast wireless communications to optimize the throughput and has attracted significant interest recently due to the rapid growth in multimedia applications [2]. It was shown in [3] that the power efficiency in wireless transmission systems could be improved by NC. For example, in a 3-node network system, called the network-coded two-way relay channels (NC-TWRC) [4] as shown in Fig. 1, the messages m ₁ and m ₂ are XORed at the relay and broadcast to the end users. This method, compared to the conventional store-and-forward transmission, reduces the total number of transmissions from 4 to 3 so that the transmission power is saved by 25 %. Since then, numerous optimization problems have been studied in NC-TWRC, e.g., the precoding scheme design proposed in [5], the optimal achievable sum-rate problem studied in [6] and the optimal beamforming method proposed in [7].

In [8], Katti et al. pointed out the importance of being opportunistic in practical NC scenarios. It was suggested that the assumptions in the related research work should comply with the practical wireless environments, e.g., decentralized routing and time-varying traffic rate. This suggestion highlighted a problem in the existing literature; the majority of the studies (e.g., [9, 10]) consider static environments (e.g, synchronized traffic) while ignoring the stochastic nature of the packet arrivals in the data link layer. On the other hand, the randomness of traffic in Fig. 1 poses the problem of how to make an optimal decision in a dynamic environment with a power-delay tradeof;: when there are packet inflows in the relay but no coding opportunities or XORing pairs (e.g., one packetarrives from one user, but no packet arrives from the other), waiting for coding opportunities by holding packets saves transmission power but increases packet delay and results in more packets to be transmitted in the future. Since a decision made at any instant affects both the immediate and future costs, the decision-making is a dynamic, instead of a one-time, process, i.e., the objective is to determine a decision rule that is optimal over time. In [11, 12], this problem was studied and solved by a cross-layer design, NC-TWRC with buffering. The optimal policy by Markovian process formulation was shown to minimize the transmission power and packet delay simultaneously and in the long run. In [13], the buffer-assisted NC-TWRC was extended to include the dynamics of wireless channels (Fig. 2). In this system, a transmission policy that solves power-delay tradeoff may not be the best decision rule because it does not consider the possible loss in throughput due to the downlink transmission errors. For this reason, the scheduler is required to make an optimal decision that simultaneously minimizes the transmission power, packet delay, downlink BER in the long run by considering current queue and channel states and their expectations in the future. In [13], this problem was formulated by a discounted infinite horizon Markov decision process (MDP) [14] with channels modeled by finite-state Markov chains (FSMCs) [15]. The optimal transmission policy was shown to be superior to [11, 12] in terms of enhancing the QoS (quality of service, evaluated by packet delay and overflow in the data link layer, and power consumption and error rate in the physical layer) in a practical wireless environment, e.g., Rayleigh fading channels.

The optimal policy of a discounted infinite horizon MDP can be found by dynamic programming (DP) [16], value or policy iterations. However, the DP algorithm is burdened with high complexity. In Fig. 2, the system state is a 4-tuple (two channels and two queues), and the decision/action is a 2-tuple (each associated with the departure control of one queue). In such a high dimensional MDP, the curse of dimensionality ¹ becomes more evident [17]; the computation load grows quickly if the cardinality of any tuple in the state variable is large. To relieve the curse, one solution is to qualitatively understand the model and prove the existence of a monotonic optimal policy [18]. Then, a low complexity algorithm or a model-free learning method can be proposed, e.g., simultaneous perturbation stochastic approximation (SPSA) [19, 20]. But, monotonic optimal policy does not exist in general. Most often, optimal policy exists, but it varies with the state variable irregularly. In order to prove the existence of certain feature in the optimal policy, we need to extensively analyze the MDP model and the recursive functions in DP algorithm. The basic approach in the existing literature is to show by induction that the submodularity is preserved in each iterative optimization process (maximization/minimization) in DP, e.g., [19, 21]. We adopt the same method in this paper but consider a submodularity in high dimensional cases. Moreover, we use L ^♮-convexity and multimodularity, two concepts that were originally defined in discrete convex analysis [22, 23], to describe the joint submodularity and integral convexity in a high dimensional space.

The aim of our work is to prove the existence of a monotonic optimal transmission policy in the NC-TWRC system in Fig. 2. By observing the L ^♮-convexity and submodularity of DP function, we derive the sufficient conditions for the optimal policy to be nondecreasing in queue and/or channel states. These structured results are used to derive two low complexity algorithms: monotonic policy iteration (MPI) and discrete simultaneous perturbation stochastic approximation (DSPSA). We compare the time complexity of MPI to that of DP and show the convergence performance of DSPSA algorithm. The main results in this paper are:

We prove that each tuple in the optimal policy is nondecreasing in the queue state that is controlled by that tuple if the chosen values of unit costs in immediate cost function give rise to an L ^♮-convex or multimodular DP. Moreover, we show that the same results found in [19, 21] can also be explained by L ^♮-convexity or multimodularity by a unimodular coordinate transform.
By thinking of each iteration in DP as a one-stage pure coordination supermodular game, we show that equiprobable traffic rates and certain conditions on unit costs guarantee that each tuple in the optimal policy is monotonic in not only the queue state that is controlled by that tuple but also the queue state that is associated with the information flow of the opposite direction, i.e., the one that is not under the control of that tuple.
By observing the submodularity of DP, we show the sufficient conditions for an optimal policy to be nondecreasing in both queue and channel states in terms of unit costs, channel statistics, and FSMC models.
Based on the submodularity, multimodularity, and L ^♮-convexity of DP, we show that the optimal transmission control problem in Fig. 2 can be solved by two low-complexity algorithms. One is MPI, a modified DP algorithm with the action searching space progressively shrinking with the increasing indices of queue and/or channel states. It is shown that the time complexity of MPI is much less than that of DP when the cardinality of system state is large. The other algorithm is a stochastic optimization method. We formulate the optimal policy searching problem by a minimization problem over a set of queue thresholds and use the DSPSA algorithm to approximate the minimizer. We show that DSPSA is able to adaptively track the optimal values of queue thresholds when the statistics of packet arrival processes change with time. We run simulations in NC-TWRC with Rayleigh fading channels to show that the average cost incurred by the policy approximated by DSPSA is similar to that incurred by the optimal policy searched by DP.

The rest of this paper is organized as follows. In Section 2, we state the optimization problem in NC-TWRC with random packet arrivals and FSMC modeled channels and clarify the assumptions. In Section 3, we describe the MDP formulation, state the objective, and present the DP algorithm. In Section 4, we investigate the structure in the optimal transmission policy found by DP algorithm in queue and channel states. Section 5 presents MPI and DPSA algorithms.

2 System

Consider the NC-TWRC shown in Fig. 2. User 1 and 2 randomly send packets to each other via the relay. The relay is equipped with two finite-length FIFO queues, queue 1 and 2, to buffer the incoming packets from user 1 and 2, respectively. The outflows of queues are controlled by a scheduler. The scheduler keeps making decisions as to whether or not to transmit packets from queues. If the decision results in a pair of packets in opposite directions transmitted at the same time, they will be XORed (coded) and broadcast. Otherwise, the packet will be simply forwarded to the end user. The objective is to minimize packet delay, queue overflow, transmission power (saved by utilizing the coding opportunities), and downlink transmission errors simultaneously and their expectations in the future. Obviously, the optimization concerns are contradictory to each other: (1) If there does not exist a pair of packets for XORing, waiting for coding opportunity by holding packets results in a high packet delay on average, while transmitting a packet without coding results in one more packet to be transmitted in the future, i.e., more transmission power on average; (2) If the SNR of one channel is low, waiting for high SNR transition by holding packets results in higher packet delay but lower transmission error rate. Therefore, the scheduler must seek an optimal decision rule that solves this power-delay-error tradeoff.

It should be pointed out that the problem under consideration is a cross-layer multi-objective optimization one; we want to optimize both the power consumption and transmission error rate in the physical layer and the packet delay in the data link layer. As discussed above, since there are tradeoffs among these optimization metrics, it is not possible to get all of them optimized simultaneously. Therefore, in this paper, we are actually seeking the Pareto optimality of these optimization metrics.²

2.1 Assumptions

We consider a discrete-time decision-making process, where the time is divided into small intervals, called decision epochs and denoted by t∈{0,1,…,T}. Let i∈{1,2} and assume the following:

A1
(i.i.d. incoming traffic) Denote random variable $f_{i}^{(t)}\in {\mathcal {F}_{i}}$ as the number of incoming packets to queue i at decision epoch t. Let the maximum number of packets arrived per decision epoch be no greater than 1, i.e., $\mathcal {F}_{i}=\{0, 1\}$. Assume that $\left \{f_{1}^{(t)}\right \}$ and $\left \{f_{2}^{(t)}\right \}$ are two independent i.i.d. random processes with $Pr\left (f_{i}^{(t)}=1\right)=p_{i}$ and $Pr\left (f_{i}^{(t)}=0\right)=1-p_{i}$ for all t.
A2
(modulation scheme) Packets are of equal length. The packets arrived at the relay are decoded and stored in the queues. The relay transmits packets by BPSK modulation. Denote L _P the packet length in bits. Since the maximum information flow is one packet per decision epoch, each decision epoch lasts for L _P symbol durations. The relay can set a certain field in the header of a packet so as to notify the receivers whether the packet is XORed or not.
A3
(finite-length queues) Queue i can store maximum L _i packets. At each t, the scheduler makes a decision and incurs an immediate cost before the event $\mathbf {f}^{(t)}=\left (f_{1}^{(t)},f_{2}^{(t)}\right)$. Denote $b_{i}^{(t)}\in {\mathcal {B}_{i}}$ as the occupancy of queue i at the beginning of decision epoch t, then $\mathcal {B}_{i}=\left \{0,1,\dotsc,L_{i}+\max {\left \{f_{i}^{(t-1)}\right \}}\right \}=\{0,1,\dotsc,L_{i}+1\}$. If the relay’s decision results in queue occupation L _i+1, the newly arrived packet will be dropped. We call it packet lost due to the queue overflow.
A4
(Markovian channel modeling) Let the full variation range of $\gamma _{i}^{(t)}$, the instantaneous SNR of channel i, be partitioned into K _i non-overlapping regions $\{[\!\Gamma _{1},\Gamma _{2}),[\!\Gamma _{2},\Gamma _{3}),\dotsc,[\!\Gamma _{K_{i}},\infty)\}$, called channel states. Here, the SNR boundaries satisfy $\Gamma _{1}<\Gamma _{2}<\dotsc <\Gamma _{K_{i}}$. Denote $\mathcal {G}_{i}=\{1,2,\dotsc,K_{i}\}$ as the state set of channel i and $g_{i}^{(t)}$ as the state of channel i at decision epoch t. We say that $g_{i}^{(t)}=k_{i}$ if $\gamma _{i}^{(t)}\in {[\!\Gamma _{k_{i}},\Gamma _{k_{i}+1})}$. Each channel is modeled by a finite-state Markov chain (FSMC) [15], where the state evolution of channel i is governed by the transition probability $P_{g_{i}^{(t)}g_{i}^{(t+1)}}=Pr\left (g_{i}^{(t+1)^{\phantom {A}}}|g_{i}^{(t)}\right)$.
A5
(downlink channel state information) Let $\left \{g_{1}^{(t)}\right \}$ and $\left \{g_{2}^{(t)}\right \}$ be two independent and i.i.d. random processes. The relay has the channel state information (the value of channel state and its transition probabilities) of both channels before the decision making at t.

3 Markov decision process formulation

Based on A1, A4, and A5, we know that the statistics of the incoming traffic flow and channel dynamics associated with user 1 or 2 are time-invariant. It follows that the transmission control problem in Fig. 2 can be formulated as a stationary Markov decision process (MDP). In the following context, we drop the decision epoch notation t in A1-A5 and use the notation y and y ^′ for the system variable y at the current and next decision epochs, respectively.

3.1 System state

Denote the system state $\mathbf {x}=(\mathbf {b}, \mathbf {g})\in {\mathcal {X}}$, where $\mathbf {b}=(b_{1},b_{2})\in {\mathcal {B}_{1}\times {\mathcal {B}_{2}}}$ and $\mathbf {g}=(g_{1},g_{2})\in {\mathcal {G}_{1}\times {\mathcal {G}_{2}}}$, i.e., $\mathcal {X}=\mathcal {B}_{1}\times {\mathcal {B}_{2}}\times {\mathcal {G}_{1}}\times {\mathcal {G}_{2}}$. × denotes the Cartesian product. We also use the 4-tuple notation x=(b ₁,b ₂,g ₁,g ₂) in the following context.

3.2 Action

Denote action $\mathbf {a}=(a_{1},a_{2})\in {\mathcal {A}}$, where $a_{i}\in {\mathcal {A}_{i}}=\{0,1\}$ denotes the number of packets departed from queue i and $\mathcal {A}=\mathcal {A}_{1}\times {\mathcal {A}_{2}}=\{0,1\}^{2}$. The terminology of actions are shown in Table 1.

Table 1 Action set

Full size table

3.3 State transition probabilities

The transition probability P x x ^′ a=P r(x ^′|x,a) denotes the probability of being in state x ^′ at next decision epoch if action a is taken in state x at current decision epoch. Due to the assumptions of independent random processes in A1 and A5, the state transition probability is given by

$$ P_{\mathbf{x}\mathbf{x}^{\prime}}^{\mathbf{a}}=P_{\mathbf{b}\mathbf{b}'}^{\mathbf{a}}P_{\mathbf{g}\mathbf{g}'} =\prod_{i=1}^{2}P_{b_{i}b_{i}^{\prime}}^{a_{i}}P_{g_{i}g_{i}^{\prime}}, $$

((1))

where $P_{g_{i}g_{i}^{\prime }}$ is determined by channel statistics and FSMC modeling method in A4 and $P_{b_{i}b_{i}^{\prime }}^{a_{i}}$ is the queue state transition probability. At current decision epoch, the occupancy of queue i after decision a _i is min{[ b _i−a _i]⁺,L _i}, where [ y]⁺= max{y,0}. The occupancy at the beginning of the next decision epoch is given by

$$ b'_{i}=\min\left\{[\!b_{i}-a_{i}]^{+},L_{i}\right\}+\,f_{i}. $$

((2))

Therefore, the state transition probability of queue i is

$$\begin{array}{*{20}l} P_{b_{i}b_{i}^{\prime}}^{a_{i}}&=Pr\left(\,f_{i}=b_{i}^{\prime}-\min\left\{[\!b_{i}-a_{i}]^{+},L_{i}\right\}\right) \\ &=Pr\left(\,f_{i}=b_{i}^{\prime}\,-\,[\!b_{i}-a_{i}]^{+}+\,\mathcal{I}_{\left\{[b_{i}-a_{i}]^{+}>L_{i}\right\}}\right) \\ &=\left\{\begin{array}{ll} Pr\left(\,f_{i}=b_{i}^{\prime}\,-\,[\!b_{i}-a_{i}]^{+}\right) & [\!b_{i}-a_{i}]^{+}\leq{L_{i}}\\ Pr\left(\,f_{i}=b_{i}^{\prime}-L_{i}\right) & [\!b_{i}-a_{i}]^{+}>L_{i} \end{array}\right., \end{array} $$

((3))

where $\mathcal {I}_{\{\cdot \}}$ is the indicator function that returns 1 if the expression in {·} is true and 0 otherwise.

3.4 Immediate cost

$C:\mathcal {X}\times \mathcal {A}\rightarrow {\mathbb {R}_{+}}$ is the cost incurred immediately after action a is taken in state x at current decision epoch. It reflects three optimization concerns: the packet delay and queue overflow, the transmission power, and the downlink transmission error rate.

3.4.1 Holding and overflow cost

We define h _i, the holding and queue overflow cost associated with queue i, as

$$\begin{array}{*{20}l} h_{i}(y_{i})&=\lambda\min\{[y_{i}]^{+},L_{i}\}+\xi_{o}\mathcal{I}_{\{[y_{i}]^{+}=L_{i}+1\}} \\ &=\lambda[y_{i}]^{+}+(\xi_{o}-\lambda)\mathcal{I}_{\{[y_{i}]^{+}=L_{i}+1\}}. \end{array} $$

((4))

λ>0 is the unit holding cost and ξ _o>λ is the unit queue overflow cost, which makes h _i(y _i) a nondecreasing convex function. In the case when y _i=b _i−a _i, min{[y _i]⁺,L _i} and $\mathcal {I}_{\{[y_{i}]^{+}=L_{i}+1\}}$ count the number of packets held in queue i and the number of packets lost due the overflow of queue i, respectively. We say that the term λ min{[y _i]⁺,L _i} accounts for the packet delay because by Little’s Law, the average packet delay is proportional to the average number of packets held in the queue in the long run for a given packet arrival rate [24]. We sum up h _i for i∈{1,2} and obtain the total holding and overflow cost as

$$ C_{h}(\mathbf{b},\mathbf{a})=\sum_{i=1}^{2}h_{i}(b_{i}-a_{i}). $$

((5))

3.4.2 Transmission cost

Since forwarding and broadcasting one packet, either coded or non-coded, consume the same amount of energy, we have the immediate transmission cost as

$$ t_{r}(\mathbf{a})=\tau\mathcal{I}_{\{a_{1}=1~\text{or}~a_{2}=1\}}= \left\{\begin{array}{ll} 0 & \mathbf{a}=(0,0)\\ \tau & \text{otherwise} \end{array}\right., $$

((6))

where τ>λ is the unit transmission cost and $\mathcal {I}_{\left \{a_{1}=1~\text {or}~a_{2}=1\right \}}$ counts the number of transmissions resulting from action a.

Note that (5) and (6) form a power-delay tradeoff. A policy that always transmits whenever there is an incoming packet without considering coding opportunities in the long run is penalized by (6), and a policy that always holds packet to wait for coding opportunities without considering the average packet delay is penalized by (5).

3.4.3 Packet error cost

Since packet errors in downlink transmissions happen only when we decide to transmit, we define the immediate packet error cost due to the action a _i as

$$ \text{err}(g_{-i},a_{i})=\eta a_{i}P_{e}(g_{-i}), $$

((7))

where η is the unit packet error cost and −i∈{1,2}∖{i}, i.e., −i=2 if i=1, and −i=1 if i=2. The reason we have err(g _−i,a _i) is because the packet departing queue i is transmitted through channel −i, e.g., the relay sends one packet in queue 1 through fading channel 2 when a ₁=1. P _e(g _i) is estimation of the average BER when transmitting a packet, either coded or non-coded, through channel i when the state is g _i. Since BPSK modulation is used at the relay, we define P _e as

$$ P_{e}(g_{i})=\frac{1}{2}\text{erfc}(\sqrt{\Gamma_{g_{i}}}). $$

((8))

Here, P _e(g _i)≤0.5 because Γ ₁≥0 in A4.

Note, the aforementioned power-delay tradeoff formed by (5) and (6) just poses the problem of whether or not to transmit if an instantaneous packet inflow is not able to form an XORing pair. However, if the scheduler considers downlink transmission error rate in addition, a policy that always broadcasts XORed packets whenever there is a coding opportunity without considering downlink channel states is penalized by (7). Therefore, (5), (6), and (7) form a power-delay-error tradeoff.

In summary, we define the immediate cost as

$$ C(\mathbf{x},\mathbf{a})=C(\mathbf{b},\mathbf{g},\mathbf{a})=C_{h}(\mathbf{b},\mathbf{a})+C_{t}(\mathbf{g},\mathbf{a}), $$

((9))

where

$$\begin{array}{*{20}l} C_{t}(\mathbf{g},\mathbf{a}) &= \sum_{i=1}^{2}\text{err}(g_{-i},a_{i})+t_{r}(\mathbf{a}). \end{array} $$

((10))

Here, C(x,a) is in fact a linear combination of loss functions (each quantifies an optimization concern). The unit cost λ, ξ _o, τ, and η can be considered as the weight factors that are either given or adjustable depending on the real applications. In Section 4, we will derive the sufficient conditions of the existence of a structured optimal policy mainly in terms of the chosen values of these unit costs.

3.5 Objective and dynamic programming

Let x ^(t) and a ^(t) denote the state and action at decision epoch t, respectively, and consider an infinite-horizon MDP modeling where the discrete decision making process is assumed to be infinitely long. We can describe the long-run objective as

$$\begin{array}{*{20}l} \min\mathbb{E}\left[\sum_{t=0}^{\infty} \beta^{t} C\left(\mathbf{x}^{(t)},\mathbf{a}^{(t)}\right)|\mathbf{x}^{(0)}\right], \forall{\mathbf{x}^{(0)}\in{\mathcal{X}}}, \end{array} $$

((11))

where x ^(t+1)∼P r(·|x ^(t),a ^(t)) and β∈ [0,1) is the discounted factor that ensures the convergence of the series. It is proved in [14] that if the state space $\mathcal {X}$ is countable, the action set $\mathcal {A}$ is finite, and the MDP is stationary, there exists a deterministic stationary policy $\theta ^{*}:\mathcal {X}\rightarrow {\mathcal {A}}$ that optimizes (11), and θ ^∗ can be searched by DP

$$ V^{(n)}(\mathbf{x})=\min_{\mathbf{a}\in{\mathcal{A}}}Q^{(n)}(\mathbf{x},\mathbf{a}), \forall{\mathbf{x}}\in\mathcal{X}, $$

((12))

where

$$ Q^{(n)}(\mathbf{x},\mathbf{a})=C(\mathbf{x},\mathbf{a})+\beta{\sum_{\mathbf{x}'\in{\mathcal{X}}}} P_{\mathbf{x}\mathbf{x}^{\prime}}^{\mathbf{a}} V^{(n-1)}(\mathbf{x}'). $$

((13))

Here, n denotes the iteration index and V ⁽⁰⁾(x)=0 for all x. Usually, a very small convergence threshold ε>0 is applied so that DP terminates when |V ^(N−1)(x)−V ^(N)(x)|≤ε for all x and N<∞.³ The optimal policy is obtained as $\theta ^{*}(\mathbf {x})=\arg \min _{\mathbf {a}\in {\mathcal {A}}}Q^{(N)}(\mathbf {x},\mathbf {a})$.

As discussed in Section 2, the problem under consideration is a cross-layer multi-objective one. When defining the immediate cost function (9), we use scalarization technique, i.e., C(x,a) is a weighted sum of the holding and packet overflow costs incurred in the data link layer and the transmission power consumption and error rate incurred in the physical layer. Therefore, the optimal policy θ ^∗ is in fact a Pareto optimal solution.⁴ It should be clear that a Pareto optimal solution is not optimal if we just consider an individual optimization metric, e.g., θ ^∗ is not the optimal solution if we just want to minimize the power consumption in the physical layer.

4 Structured optimal policies

The time complexity in iteration n in DP is $O(|\mathcal {X}|^{2}|\mathcal {A}|)$. There are $|\mathcal {X}|$ minimization operations, each of which requires $|\mathcal {A}|$ calculations of Q ⁽ⁿ⁾, and each Q ⁽ⁿ⁾ value requires $|\mathcal {X}|$ multiplications over state x ^′. Since $|\mathcal {X}|=|\mathcal {B}_{1}||\mathcal {B}_{2}||\mathcal {G}_{1}||\mathcal {G}_{2}|$, the complexity grows quadratically if the cardinality of any tuple in the state variable increases. If the node-to-node transmission in NC-TWRC is via multiple channels (e.g., single-user MIMO channels), the complexity grows exponentially with the number of user-to-relay channels, which may severely overload the CPU. In this section, we investigate the submodularity, L ^♮-convexity and multimodularity of functions Q ⁽ⁿ⁾(x,a) and V ⁽ⁿ⁾(x) in DP to establish the sufficient conditions for the existence of a monotonic optimal policy. These results serve as the prerequisites for the low complexity algorithms proposed in Section 5. We first clarify some concepts as follows.

Definition 4.1 (Monotonic policy).

Let $\theta \colon \mathbb {Z}^{n}\rightarrow {\mathbb {Z}^{m}}$, θ(x) is monotonic nondecreasing if θ(x ⁺)≽θ(x ⁻), for all $\mathbf {x}^{+},\mathbf {x}^{-}\in {\mathbb {Z}^{n}}$ such that x ⁺≽x ⁻, where ≽ denotes componentwise greater than or equal to.

Definition 4.2 (Submodularity [23, 25]).

Let $\mathbf {e}_{i}\in {\mathbb {Z}^{n}}$ be an n-tuple with all zero entries except the ith entry being one. $f\colon \mathbb {Z}^{n}\rightarrow {\mathbb {R}_{+}}$ is submodular if f(x+e _i)+f(x+e _j)≥f(x)+f(x+e _i+e _j) for all $\mathbf {x}\in {\mathbb {Z}^{n}}$ and 1≤i,j≤n. f is strictly submodular if the inequality is strict.

In DP, a submodular function Q ⁽ⁿ⁾(x,a) has Q ⁽ⁿ⁾(x,a ⁻)−Q ⁽ⁿ⁾(x,a ⁺) nondecreasing in x for all a ⁺≽a ⁻, i.e., the preference of choosing action a ⁺ over a ⁻ is always nondecreasing in x. Therefore, an increase in the state variable x implies an increase in the decision rule θ ⁽ⁿ⁾(x)= mina Q ⁽ⁿ⁾(x,a). This property is summarized in a general form in the following lemma.

Lemma 4.3.

If $g\colon \mathbb {Z}^{n}\rightarrow {\mathbb {R}_{+}}$ is submodular in $(\mathbf {x},\mathbf {y})\in {\mathbb {Z}^{n}}$, then f(x)= miny g(x,y) is submodular in x, and the minimizer y ^∗(x)= arg miny g(x,y) is nondecreasing in x [26].

Definition 4.4 (L ^♮-convexity [23]).

$f\colon \mathbb {Z}^{n}\rightarrow {\mathbb {R}_{+}}$ is L ^♮-convex if ψ(x,ζ)=f(x−ζ 1) is submodular in (x,ζ), where $\mathbf {1}=(1,1,\dotsc,1)\in {\mathbb {Z}^{n}}$ and $\zeta \in {\mathbb {Z}}$.

Definition 4.5 (multimodularity [23]).

$f\colon \mathbb {Z}^{n}\rightarrow {\mathbb {R}_{+}}$ is multimodular if ψ(x,ζ)=f(x ₁−ζ,x ₂−x ₁,…,x _n−x _n−1) is submodular in (x,ζ), where $\zeta \in {\mathbb {Z}}$.

L ^♮-convexity and multimodularity are two concepts defined in discrete convex analysis [27]. L ^♮-convexity implies submodularity while multimodularity implies supermoduarity⁵ [28]. They both contribute to a monotonic structure in the optimal policy.

Lemma 4.6.

If $g\colon \mathbb {Z}^{n}\rightarrow {\mathbb {R}_{+}}$ is L ^♮-convex/multimodular in $(\mathbf {x},\mathbf {y})\in \mathbb {Z}^{n}$, then f(x)= miny g(x,y) is L ^♮-convex/multimodular in x, and the minimizer y ^∗(x)= arg miny g(x,y) is nondecreasing/nonincreasing in x [28, 29].

The unimodular coordinate transform below describes the relationship between L ^♮-convexity and multimodularity.

Lemma 4.7 (unimodular coordinate transform [23, 28]).

Let matrix $M_{n,i}=\left [\begin {array}{ccc} -U_{i} & 0 \\ 0 & L_{n-i} \end {array} \right ]$, where U _i and L _i are the i×i upper and lower triangular matrix with all nonzero entries being one, respectively, then

(a)
a function $f\colon \mathbb {Z}^{n}\rightarrow {\mathbb {R}_{+}}$ is multimodular if and only if it can be represented by f(x)=g(±M _n,i x) for some L ^♮-convex function g.
(b)
a function $g\colon \mathbb {Z}^{n}\rightarrow {\mathbb {R}_{+}}$ is L ^♮-convex if and only if it can be represented by $g(\mathbf {x})=f\left (\pm M_{n,i}^{-1}\mathbf {x}\right)$ for some multimodular function f.

Definition 4.8 (First order stochastic dominance [18]).

Let $\tilde {\rho }(x)$ be a random selection on space $\mathcal {X}$ according to a probability measure μ(x) where x conditions the random selection, then $\tilde {\rho }(x)$ is first order stochastically nondecreasing in x if $\mathbb {E}[u(\tilde {\rho }(x^{+}))] \geq \mathbb {E}[u(\tilde {\rho }(x^{-}))]$ for all nondecreasing functions u and x ⁺≥x ⁻.

4.1 Structured properties of dynamic programming

To propose the prototypical procedure of proving the existence of a monotonic optimal policy, we first define a ${\mathcal {P}^{\star }\!}$ property as follows:

Definition 4.9 ($\mathcal {P}^{\star }\!$ property).

$f\colon \mathbb {Z}^{n}\rightarrow {\mathbb {R}_{+}}$ has $\mathcal {P}^{\star }\!$ property in $(\mathbf {x},\mathbf {y})\in {\mathbb {Z}^{n}}$ if f ^∗(x)= miny f(x,y) has ${\mathcal {P}^{\star }\!}$ property in x and y ^∗(x)= arg minx f(x,y) is monotonic (nondecreasing/nonincreasing) in x.

Theorem 4.10.

Submodularity, L ^♮-convexity and multimodularity have $\mathcal {P}^{\star }\!$ property.

Proof.

It can be directly proved by Lemma 4.3 and Lemma 4.6.

We therefore propose an approach, similar to Proposition 5 in [18], as follows:

Proposition 4.11.

Let DP converge at Nth iteration. The optimal value function V ^∗(x)=V ^(N)(x) has $\mathcal {P}^{\star }\!$ property, and the optimal policy θ ^∗ is monotonic in x, if:

C(x,a) has $\mathcal {P}^{\star }\!$ property,
$Q^{(n)}(\mathbf {x},\mathbf {a})=C(\mathbf {x},\mathbf {a})+ \beta \sum _{\mathbf {x}'\in {\mathcal {X}}}P_{\mathbf {x}\mathbf {x}^{\prime }}^{\mathbf {a}} V^{(n-1)}(\mathbf {x}')$ has $\mathcal {P}^{\star }\!$ property for all ${\mathcal {P}^{\star }\!}$ property functions V ⁽ⁿ⁻¹⁾ and n.

Proof.

Since DP starts from V ⁽⁰⁾(x)=0 for all $\mathbf {x}\in {\mathcal {X}}$, Q ⁽¹⁾=C(x,a) has $\mathcal {P}^{\star }\!$ property. So $V^{(1)}(\mathbf {x})=\min _{\mathbf {a}\in {\mathcal {A}}}Q^{(1)}(\mathbf {x},\mathbf {a})$ has ${\mathcal {P}^{\star }\!}$ property. By induction, assume V ⁽ⁿ⁻¹⁾(x,a) has ${\mathcal {P}^{\star }\!}$ property. Then Q ⁽ⁿ⁾ and $V^{(n)}(\mathbf {x})=\min _{\mathbf {a}\in {\mathcal {A}}}Q^{(n)}(\mathbf {x},\mathbf {a})$ have ${\mathcal {P}^{\star }\!}$ property. Therefore, Q ^(N)(x,a) and V ^∗(x)=V ^(N)(x) must also possess ${\mathcal {P}^{\star }\!}$ property, and $\theta ^{*}(\mathbf {x})=\arg \min _{\mathbf {a}\in {\mathcal {A}}}Q^{(N)} (\mathbf {x},\mathbf {a})$ is monotonic in x.

4.2 Monotonic policies in queues states

4.2.1 Nondecreasing $a_{i}^{*}$ in b _i

Let the optimal action be $\mathbf {a}^{*}=\theta ^{*}(\mathbf {x})=(\theta _{1}^{*}(\mathbf {x}),\theta _{2}^{*}(\mathbf {x}))$. $a_{i}^{*}=\theta _{i}^{*}(\mathbf {x})$ is the optimal action to queue i determined by θ ^∗. The following theorem shows that the optimal action $a_{i}^{*}$ is monotonic in b _i, the state of queue being controlled by a _i if the unit costs satisfy a certain condition.

Theorem 4.12.

If ξ _o≥2λ+η+τ,⁶ then for all i∈{1,2}C(x,a) and Q ⁽ⁿ⁾(x,a) are nondecreasing in b _i and L ^♮-convex in (b _i,a _i), V ^∗(x) is nondecreasing and L ^♮-convex in b _i, and the optimal action $a^{*}_{i}$ is nondecreasing in b _i.

Proof.

We define two functions

$$ \tilde{C}(\mathbf{y},\mathbf{g},\mathbf{a})=\tilde{C}_{h}(\mathbf{y})+C_{t}(\mathbf{g},\mathbf{a}), $$

((14))

where $\tilde {C}_{h}(\mathbf {y})=\sum _{i=1}^{2}h_{i}(y_{i})$ and

$$ \tilde{Q}^{(n)}(\mathbf{y},\mathbf{g},\mathbf{a})=\tilde{C}(\mathbf{y},\mathbf{g},\mathbf{a})+\beta \mathbb{E}_{\mathbf{g}'} \Big[ V_{\mathbf{f}}^{(n-1)}(\mathbf{y},\mathbf{g}') \Big|\mathbf{g} \Big]. $$

((15))

Here,

$$ {\fontsize{7.9pt}{12pt}\selectfont{\begin{aligned} {}V_{\mathbf{f}}^{(n-1)}(\mathbf{y},\mathbf{g}')\,=\,\mathbb{E}_{\mathbf{f}} \Big[\! V^{(n-1)}(\min\{[y_{1}]^{+}\!,L_{1}\}+f_{1}, \min\{[y_{2}]^{+},L_{2}\}+f_{2},\mathbf{g}')\! \Big], \end{aligned}}} $$

((16))

y=(y ₁,y ₂) and f=(f ₁,f ₂). It is easy to see that $C(\mathbf {b},\mathbf {g},\mathbf {a})=\tilde {C}(\mathbf {b}-\mathbf {a},\mathbf {g},\mathbf {a})$ and $Q^{(n)}(\mathbf {b},\mathbf {g},\mathbf {a})=\tilde {Q}^{(n)}(\mathbf {b}-\mathbf {a},\mathbf {g},\mathbf {a})$. Since

$$ \left[ \begin{array}{ccc} b_{i}-a_{i} \\ a_{i} \end{array} \right]=\left[ \begin{array}{ccc} 1 & -1 \\ 0 & 1 \end{array} \right]\left[ \begin{array}{ccc} b_{i} \\ a_{i} \end{array} \right]=-M_{2,2}^{-1}\left[ \begin{array}{ccc} b_{i} \\ a_{i} \end{array} \right], $$

((17))

according to Lemma 4.7(b), it follows that proving the L ^♮-convexity of C(b,g,a) and Q ⁽ⁿ⁾(b,g,a) in (b _i,a _i) is equivalent to showing the multimodularity of $\tilde {C}(\mathbf {y},\mathbf {g},\mathbf {a})$ and $\tilde {Q}^{(n)}(\mathbf {y},\mathbf {g},\mathbf {a})$ in (y _i,a _i). It is also clear that the monotonicity of C(b,g,a) and Q ⁽ⁿ⁾(b,g,a) in b _i is equivalent to the monotonicity of $\tilde {C}(\mathbf {y},\mathbf {g},\mathbf {a})$ and $\tilde {Q}^{(n)}(\mathbf {y},\mathbf {g},\mathbf {a})$ in y _i. See Appendix C for the proof of the monotonicity and multimodularity of $\tilde {C}(\mathbf {y},\mathbf {g},\mathbf {a})$ and $\tilde {Q}^{(n)}(\mathbf {y},\mathbf {g},\mathbf {a})$ in y _i and (y _i,a _i), respectively.

According to Proposition 4.7.3 in [14], V ^∗(x) is nondecreasing in b _i. By Theorem 4.10 and Proposition 4.11, V ^∗(x) is L ^♮-convex in b _i, and $a_{i}^{*}$ is nondecreasing in b _i.

Note, Theorem 4.12 aligns with the existing results in the literature, e.g., the adaptive MIMO transmission control [21] and the Markov game modeled adaptive modulation of cognitive radio [19]. In fact, both of them can be explained by L ^♮-convexity. In [21], the monotonicity of $a_{i}^{*}$ in b _i was shown by the multimodularity in (b _i,−a _i). But,

$$ \left[ \begin{array}{ccc} b_{i} \\ -a_{i} \end{array} \right]=\left[ \begin{array}{ccc} 1 & 0 \\ 0 & -1 \end{array} \right] \left[ \begin{array}{ccc} b_{i} \\ a_{i} \end{array} \right]=-M_{2,1}^{-1}\left[ \begin{array}{ccc} b_{i} \\ a_{i} \end{array} \right] $$

((18))

By Lemma 4.7(b), we know that if the a function is multimodular in (b _i,−a _i), then it must be L ^♮-convex in (b _i,a _i). Consequently, V ⁽ⁿ⁾(x) is integer convex in b _i because L ^♮-convexity in one dimension is exactly integer convexity⁷. In [19], the monotonicity of $a_{i}^{*}$ was shown by the submodularity of Q ⁽ⁿ⁾ in (b _i,a _i). But, Q ⁽ⁿ⁾ is a function of b _i−a _i. According to Definition 4.4, the L ^♮-convexity of g(x ₁,x ₂)=f(x ₁−x ₂) in (x ₁,x ₂) is equivalent to the submodularity of g(x ₁,x ₂) in (x ₁,x ₂). So Q ⁽ⁿ⁾ is also L ^♮-convex in (b _i,a _i).

4.2.2 Nondecreasing $a_{i}^{*}$ in (b ₁,b ₂)

We formulate the optimization problem in the nth iteration of DP by a 2-player 2-strategy game, which is called one-stage game in Fig. 3. Assume that action a ₁ is taken by player 1, and a ₂ is taken by player 2. Obviously, it is a pure coordination game where the utility −Q ⁽ⁿ⁾(x,(a ₁,a ₂)) is the same to player 1 and 2.

We prove, in Appendix D, that Fig. 3 is a supermodular game with utility function −Q ⁽ⁿ⁾(x,(a ₁,a ₂)) strictly supermodular in a=(a ₁,a ₂) for all x and V ⁽ⁿ⁻¹⁾(x ^′) that is L ^♮-convex in b ^′=(b1′,b2′). It is proved in [30] that there exists at least one equilibrium $(a_{1}^{*},a_{2}^{*})$ in the form of pure strategy in a supermodular game. Then, we have the following theorem for the monotonicity of the optimal action $a_{i}^{*}$ in b=(b ₁,b ₂).

Theorem 4.13.

If

(a)
ξ _o≥2λ+η+τ,
(b)
one-stage game (in Fig. 3) has two pure strategy equilibria (0,0) and (1,1) for all x=(b ₁,b ₂,g ₁,g ₂) such that b _i<L _i+1 for all i∈{1,2},

then C(x,a) and Q ⁽ⁿ⁾(x,a) are L ^♮-convex in (b,a)=(b ₁,b ₂,a ₁,a ₂), the optimal value function V ^∗(x) is L ^♮-convex in b=(b ₁,b ₂) and the optimal action $\mathbf {a}^{*}=(a_{1}^{*},a_{2}^{*})$ is nondecreasing in b=(b ₁,b ₂).

Proof.

The proof is in Appendix E.

Here is a corollary of Theorem 4.13.

Corollary 4.14.

If

(a)
ξ _o≥2λ+η+τ,
(b)
p ₁=p ₂=0.5,
(c)
$\beta \leq \frac {2(\tau -\lambda)}{\tau +\eta }$,

then Theorem 4.13 holds.

Proof.

The proof is in Appendix F.

We show examples of Theorems 4.12 and 4.13 in Figs. 4, 5, 6 and 7. The results are collected by value iteration, a DP algorithm, applied on an NC-TWRC system with Bernoulli packet arrivals, 5 queue states, and 8 channel states, i.e., $f_{i}^{(t)}\sim {\text {Bernoulli}(p_{i})}$, L _i=3 and K _i=8 for all t and i∈{1,2}. In Fig. 4, we choose the values of unit costs to make Theorem 4.12 hold. As shown in the figure, the optimal action $a_{1}^{*}$ and $a_{2}^{*}$ are monotonic in b ₁ and b ₂, respectively, i.e., $a_{i}^{*}$ is nondecreasing in the queue state that is being controlled by a _i. In Fig. 5, we change the value of unit cost ξ _o to breach the condition in Theorem 4.12 so that the monotonicity of $a_{i}^{*}$ in b _i is not guaranteed. In this case, $a_{1}^{*}$ that is not monotonic in b ₁.

In Fig. 6, we choose the equiprobable packet arrival rates p ₁=p ₂=0.5 and the unit costs according to Corollary 4.14 to make Theorem 4.13 hold. As shown in the figure, the optimal action $a_{1}^{*}$ and $a_{2}^{*}$ are both nondecreasing in (b ₁,b ₂). As compared to Fig. 4, in this case, $a_{i}^{*}$ is also monotonic in b _−i, the queue state that is affected by the message flow and transmission control in the opposite direction, i.e., the queue state that is not controlled by a _i. In Fig. 7, we switch unit cost η from 1 to 2 so that Theorem 4.13 no longer holds. In this case, neither $a_{1}^{*}$ nor $a_{2}^{*}$ is monotonic in (b ₁,b ₂). But, the condition in Theorem 4.12 is satisfied. Therefore, $a_{1}^{*}$ and $a_{2}^{*}$ are still nondecreasing in b ₁ and b ₂, respectively.

4.3 Monotonic policies in channel states

The related research work in the existing literature considers the structure of the optimal policy in queue state only, e.g., [19, 21, 24]. This section breaks this limitation in that we extend the investigation of the monotonicity to the channel states. The main results are summarized as follows.

Theorem 4.15.

If

ξ _o≥2λ+η+τ,
P _e(g _i)≥P _e(g _i+1),
$P_{g_{i}g_{i}^{\prime }}$ is first order stochastic nondecreasing in g _i,
$\beta \leq \frac {P_{e}(g_{i})-P_{e}(g_{i}+1)}{\sum _{g'_{i}}P_{g_{i}g'_{i}} (P_{e}(g'_{i})-P_{e}(g'_{i}+1)) }$.

then C(x,a) and Q ⁽ⁿ⁾(x,a) is submodular in (b _i,g _−i,a _i), V ^∗(x) is submodular in (b _i,g _−i), and the optimal action $a_{i}^{*}$ is nondecreasing in (b _i,g _−i).

Proof.

The proof is in Appendix G.

In Theorem 4.15, condition (b) is straightforwardly satisfied because of the definition of P _e in (8) and assumption A4. Conditions (c) and (d) depend on the fading statistics and the FSMC modeling method. In fact, condition (c) is not hard to satisfy.

Corollary 4.16.

If the FSMC of channel i adopts equiprobable partitioning (of the full range of SNR), and channel i experiences slow and flat Rayleigh fading, then condition (c) in Theorem 4.15 are satisfied.

Proof.

The proof is in Appendix H.

We show examples of Theorem 4.15 in Figs. 8 and 9. In Fig. 8, we use the same system parameters as in Fig. 4 except that the discount factor β is switched from 0.97 to 0.95 in order to satisfy the inequality in condition (d) of Theorem 4.15. The results are obtained from an NC-TWRC system where the channels experience slow and flat Rayleigh fading with average SNR $\overline {\gamma }_{1}=\overline {\gamma }_{2}=0\text {dB}$. Both FSMCs are 8-state and adopt equiprobable partition method. In this case, all the conditions in Theorem 4.15 are satisfied according to Corollary 4.16. Therefore, $a_{1}^{*}$ is nondecreasing in (b ₁,g ₂), and $a_{2}^{*}$ is nondecreasing in (b ₂,g ₁). In Fig. 9, we switch $\overline {\gamma }_{2}$ from 0dB to 3dB to breach condition (d) in Theorem 4.15. In this case, $a_{1}^{*}$ is not monotonic in g ₂. But, since Theorem 4.12 still holds, $a_{1}^{*}$ and $a_{2}^{*}$ are monotonic in b ₁ and b ₂, respectively.

Note, that the related previous studies usually placed constraints on the environments or the DP functions in order to prove the structure in the optimal policy. For example, in [19] the submodularity of the state transition probability was proved by assuming uniformly distributed traffic rates, and in [31], the strict submodularity of Q ⁽ⁿ⁾ in DP iterations was assumed to be preserved by a weight factor in the immediate cost function (however, the exact value of this factor was not given). In contrast, the basic result in this paper, Theorem 4.12, is essentially given in terms of unit costs and discount factor, the parameters in the MDP model. The practical meaning of Theorem 4.12 can be interpreted in two ways. If the unit costs and discount factor are adjustable, we can tune them to get a structured optimal policy. If they are given, we can check the sufficient conditions for the existence of a monotonic optimal policy after the MDP modeling. In addition, we also derive the results, Theorems 4.13 and 4.15 by considering the uniform traffic rates, stochastic dominance of channel transition probabilities and channel modeling, and modulation scheme in this paper. They are also applicable if the associated conditions are satisfied.

5 Low complexity algorithms

This section considers the question of how to exploit the results in Section 4 to simplify the optimization process of problem (11). For this purpose, we present MPI and DSPSA algorithms for the MDP model in Section 3.

5.1 Monotonic policy iteration

The idea of MPI is to modify (12) as

$$\begin{array}{*{20}l} V^{(n)}(\mathbf{x})=\min_{\mathbf{a}\in{\mathcal{A}(\mathbf{x})}}Q^{(n)}(\mathbf{x},\mathbf{a}), \forall{\mathbf{x}} \end{array} $$

((19))

where $\mathcal {A}(\mathbf {x})\subseteq {\mathcal {A}}$ is a selection of actions in $\mathcal {A}=\{(0,1)\}^{2}=\{(0,0),(0,1),(1,0),(1,1)\}$. Let $\theta ^{(n)}(\mathbf {x})=\arg \min _{\mathbf {a}\in \mathcal {A}(\mathbf {x})}Q^{(n)}(\mathbf {x},\mathbf {a})$. Note, θ ⁽ⁿ⁾(x) can be obtained at the same time when V ⁽ⁿ⁾(x) is calculated. We express θ ⁽ⁿ⁾(x) as

$$\begin{array}{*{20}l} \theta^{(n)}(\mathbf{x}) &= \theta^{(n)}(b_{1},b_{2},g_{1},g_{2}) \\ &=\left(\theta_{1}^{(n)}(b_{1},b_{2},g_{1},g_{2}),\theta_{2}^{(n)}(b_{1},b_{2},g_{1},g_{2})\right). \end{array} $$

((20))

Assume that Theorem 4.13 holds. We can define the action selection set $\mathcal {A}(\mathbf {x})$ as follows. Due to the L ^♮-convexity of Q ⁽ⁿ⁾ in (b _i,a _i), $\theta _{i}^{(n)}$ is always nondecreasing in b _i. Therefore, we can define $\mathcal {A}(\mathbf {x})$ as

$$ \begin{aligned} {}\mathcal{A}(\mathbf{x})&\,=\, \left\{\! a_{1}\!\in\{0,1\} \colon a_{1} \!\geq\! \theta_{1}^{(n)}([b_{1}-1]^{+}\!,[\!b_{2}-1]^{+},g_{1},g_{2})\right\} \\ &\!\!\!\quad\times\! \left\{\! a_{2}\!\in\!\{0,1\} \colon\! a_{2} \!\geq\! \theta_{2}^{(n)}([b_{1}-1]^{+}\!,[b_{2}-1]^{+}\!,g_{1},g_{2})\!\right\} \end{aligned} $$

when b ₁≠0 and b ₂≠0 and $\mathcal {A}(\mathbf {x})=\mathcal {A}$ when b ₁=b ₂=0. For example, consider the case when g ₁=1 and g ₂=1 at some iteration n. We need to determine the value of θ ⁽ⁿ⁾(x) for all x=(b ₁,b ₂,g ₁,g ₂) such that g ₁=1 and g ₂=1. We start with the lowest values of b ₁ and b ₂. For state x=(0,0,1,1), we have $\mathcal {A}(\mathbf {x})=\mathcal {A}=\{0,1\}^{2}$. In this case, the minimization problem $\min _{\mathbf {a}\in {\mathcal {A}(\mathbf {x})}}Q^{(n)}(\mathbf {x},\mathbf {a})$ is equivalent to $\min _{\mathbf {a}\in {\mathcal {A}}}Q^{(n)}(\mathbf {x},\mathbf {a})$, i.e., we need to obtain four values of Q ⁽ⁿ⁾(x,a) at a=(0,0), (0,1), (1,0), and (1,1) to determine the minimum. If we get θ ⁽ⁿ⁾(x)=(0,1) for x=(0,0,1,1), then $\mathcal {A}(\mathbf {x})=\{0,1\}\times \{1\}=\{(0,1),(1,1)\}$ for x=(0,1,1,1), (1,0,1,1) and (1,1,1,1). It means that only two calculations of Q ⁽ⁿ⁾(x,a) are required when we want to determine the value of $\min _{\mathbf {a}\in {\mathcal {A}(\mathbf {x})}}Q^{(n)}(\mathbf {x},\mathbf {a})$ for these three states. In addition, if we find that θ ⁽ⁿ⁾(x)=(1,1) for x=(1,1,1,1), then, for all x such that b ₁>1, b ₂>1, g ₁=1 and g ₂=1, $\mathcal {A}(\mathbf {x})=\{(1,1)\}$ and we can directly assign θ ⁽ⁿ⁾(x)=(1,1) without doing the minimization $\min _{\mathbf {a}\in {\mathcal {A}(\mathbf {x})}}Q^{(n)}(\mathbf {x},\mathbf {a})$. We can find the optimal policy by repeating this process for all values of g ₁ and g ₂ in each iteration. From this example, it can be seen that (19) should be conducted in the increasing order of b ₁ and b ₂ so that the cardinality of set $\mathcal {A}(\mathbf {x})$ is progressively reducing.

5.2 Discrete simultaneous perturbation stochastic approximation

Assume Theorem 4.12 holds.⁸ Due to the monotonicity of the optimal policy in queue states, the optimization problem (11) can be converted to a minimization problem over a set of queue thresholds.

For i∈{1,2}, we define $\phi _{i}(b_{-i},g_{1},g_{2})\in \mathcal {B}_{i}$ as

$$ \phi_{i}(b_{-i},g_{1},g_{2})=\min\{b_{i} \colon \theta_{i}(\mathbf{x})=1\}, \forall{b_{-i},g_{1},g_{2}}. $$

((21))

Here, ϕ _i(b _−i,g ₁,g ₂) is the threshold to queue i when the other user’s queue state is b _−i and channel states are g ₁ and g ₂. Let ϕ _i be constructed by stacking ϕ _i for all $(b_{-i},g_{1},g_{2})\in \mathcal {B}_{-i}\times {\mathcal {G}_{1}}\times {\mathcal {G}_{2}}$. The queue threshold vector is defined as ϕ=(ϕ ₁,ϕ ₂). In Fig. 10, we show the optimal queue threshold vector $\boldsymbol {\phi }^{*}=(\boldsymbol {\phi }_{1}^{*},\boldsymbol {\phi }_{2}^{*})$ where

$$ \phi_{i}^{*}(b_{-i},g_{1},g_{2})=\min\{b_{i} \colon \theta_{i}^{*}(\mathbf{x})=1\}, \forall{b_{-i},g_{1},g_{2}} $$

((22))

and θ ^∗ is the optimal policy obtained in Fig. 4. Each queue threshold vector ϕ=(ϕ ₁,ϕ ₂) determines a deterministic policy $\theta _{\boldsymbol {\phi }}(\mathbf {x})=(\theta _{1\boldsymbol {\phi }_{1}}(\mathbf {x}),\theta _{2\boldsymbol {\phi }_{2}}(\mathbf {x}))\phantom {\dot {i}\!}$ by

$$ \theta_{i\boldsymbol{\phi}_{i}}(\mathbf{x})=\mathcal{I}_{\{b_{i}\geq{\phi_{i}(b_{-i},g_{1},g_{2})\}}}= \left\{\begin{array}{ll} 1 & b_{i}\geq \phi_{i}(b_{-i},g_{1},g_{2})\\ 0 & b_{i}<\phi_{i}(b_{-i},g_{1},g_{2}) \end{array}\right.. $$

((23))

Since $\theta _{i}^{*}$ is nondecreasing in b _i for all i∈{1,2} if Theorem 4.12 holds and θ ^∗ determines ϕ ^∗ via (22), finding the optimal policy θ ^∗ is equivalent to finding the optimal queue threshold vector ϕ ^∗. We can convert problem (11) to

$$ \min_{\boldsymbol{\phi}} J(\boldsymbol{\phi}), $$

((24))

where

$$ J(\boldsymbol{\phi})=\sum_{\mathbf{x}^{(0)}\in\mathcal{X}}\mathbb{E}\left[\sum_{t=0}^{\infty} \beta^{t} C(\mathbf{x}^{(t)},\theta_{\boldsymbol{\phi}}(\mathbf{x}^{(t)}))|\mathbf{x}^{(0)}\right]. $$

((25))

The advantage of formulating problem (24) is that the solutions can be approximated by the DSPSA algorithm [32] presented in Algorithm 1. The parameters/functions in this algorithm are explained as follows

$\hat {J}(\boldsymbol {\phi })$ is an estimation of J at ϕ that is obtained by simulation. The method is to simulate the state sequence {x ^(t)} governed by the transition probability $Pr\left (\mathbf {x}^{(t+1)}|\mathbf {x}^{(t)}\right)=P_{\mathbf {x}^{(t)} \mathbf {x}^{(t+1)}}^{\theta _{\boldsymbol {\phi }}(\mathbf {x}^{(t)})}$ for all $\mathbf {x}^{(0)}\in \mathcal {X}$. $\hat {J}(\boldsymbol {\phi })$ is obtained as
$$\begin{array}{*{20}l} \hat{J}(\boldsymbol{\phi})=\sum_{\mathbf{x}^{(0)}\in\mathcal{X}}\sum_{t=0}^{T}\beta^{t}C\left(\mathbf{x}^{(t)}, \theta_{\boldsymbol{\phi}}(\mathbf{x}^{(t)})\right). \end{array} $$
((26))

Each simulation stops if the increments over several successive decision epochs blow a small threshold (10⁻⁵), i.e., the simulation length is finite.
The step size parameters A, B, and α are crucial for the convergence performance of DSA algorithms. In this paper, we set as A=0.3, B=100, and α=0.602. These values are found by adopting the method suggested in [33] for practical problems where the computation budget N, the total number of iterations, is fixed: B=0.095N, α=0.602 and A is chosen so that $A/(B+1)^{\alpha }\|\mathbf {d} \left (\tilde {\boldsymbol {\phi }}^{(0)}\right)\|=0.1$.

The DSPSA algorithm is a in fact a line search algorithm. It starts with any initial guess ϕ ⁽⁰⁾, say ϕ ⁽⁰⁾=0, and iteratively updates the guess by the estimated descent direction −a ⁽ⁿ⁾ d(ϕ ⁽ⁿ⁾). The gradient d(ϕ ⁽ⁿ⁾) in each iteration is obtained based on two values of $\hat {J}$, $\hat {J} \left (\lfloor \boldsymbol {\phi }^{(n)}\rfloor +\frac {\mathbf {1}+\mathbf {\Delta }}{2} \right)$ and $ \hat {J} \left (\lfloor \boldsymbol {\phi }^{(n)}\rfloor +\frac {\mathbf {1}-\mathbf {\Delta }}{2} \right)$.⁹ According to a study in [34], the estimation sequence {ϕ ⁽ⁿ⁾} slowly converges to the optimal queue threshold vector ϕ ^∗.

5.3 Complexity

MPI is in fact a modified DP algorithm that exploits L ^♮-convexity or submodularity of Q ⁽ⁿ⁾. It converges at the same rate as DP. But, since $\mathcal {A}(\mathbf {x})\subseteq \mathcal {A}$ and $|\mathcal {A}(\mathbf {x})|$, the cardinality of $\mathcal {A}(\mathbf {x})$, is progressively decreasing in b ₁ and b ₂, the complexity in each iteration in MPI is lower than that in DP. Let ρ be the average size of $\mathcal {A}(\mathbf {x})$ over all states x. The complexity in one iteration of MPI is $O(|\mathcal {X}|^{2}\rho)$, where $\rho \leq |\mathcal {A}|$. The exact value of ρ varies with different systems. To show the examples of the actual complexity of MPI, we do the following experiment. We use the same system settings as in Fig. 6 and set the number of channel states of both channels to K, i.e., K ₁=K ₂=K. We vary K from 2 to 10. For each value of K, we run both DP and MPI and obtain the complexity as the number of calculations of Q ⁽ⁿ⁾ averaged over iterations. The results are shown in Fig. 11. It can be seen that the complexity of MPI is always less than that of DP, and MPI alleviates the drastically growing complexity in DP when the size of the channel state space grows large.

Consider the complexity of the DSPSA algorithm. Let ζ be the complexity of obtaining the value of $\hat {J}$ by simulation. Since we only need two values of $\hat {J}$ to calculate the gradient d, the complexity in each iteration of DSPSA is O(ζ). But, the convergence rate depends on the parameters of the DSPSA algorithm [35], e.g., the step size parameters, and may vary with different MDP systems, i.e., DSPSA may converge slower than DP or MPI. However, we have two advantages of implementing DSPSA algorithm over DP or MPI. One is that DSPSA is a simulation-based algorithm, the runs of which do not require the full knowledge of the MDP model. Based on (26), to obtain $\hat {J}$, we only require the knowledge of the state space $\mathcal {X}$ and a simulation model that can generate a state sequence {x ^(t)} based on a given queue threshold vector ϕ and the statistics of packet arrival and channel variation processes. If the packet arrival probabilities and/or channel statistics change suddenly, the optimal policy will change accordingly, and DSPSA algorithm can adapt slowly to the new optimal policy.

The results in Fig. 12 are based on an experiment of DSPSA in an environment where the system parameters change with time. The relay is assumed to serve the first pair of users with packet arrival probabilities being p ₁=0.1 and p ₂=0.2 in the first 500 iterations and serve another pair of users with p ₁=0.8 and p ₂=0.2 in the second 500 iterations. It can be seen that DSPSA is able to adaptively track the optimum and optimizer of problem (24). In contrast, to run DP or MPI, we require the full knowledge of the MDP model. If the statistics of packet arrival and channel variation processes change, we need to determine the new MDP model by calculating all values of the state transition probability P x x ^′ a before running DP or MPI. Alternatively speaking, MPI and DP are model-based algorithms while DSPSA is a model-free algorithm [36].

The other advantage of DSPSA is that it allows the scheduler to learn the optimal policy online. For example, assume that we start with any arbitrary threshold vector ϕ ⁽⁰⁾. We first let the scheduler adopt the policy that is determined by the queue threshold vector $\lfloor \boldsymbol {\phi }^{(0)}\rfloor +\frac {\mathbf {1}+\mathbf {\Delta }}{2}$ (via (21)) for a while and obtain the value of $\hat {J}\left (\lfloor \boldsymbol {\phi }^{(0)}\rfloor +\frac {\mathbf {1}+\mathbf {\Delta }}{2}\right)$ based on the actual immediate costs incurred. Then, we let the scheduler adopt the policy that is determined by $\lfloor \boldsymbol {\phi }^{(0)}\rfloor +\frac {\mathbf {1}-\mathbf {\Delta }}{2}$ for a while and obtain the value of $\hat {J}\left (\lfloor \boldsymbol {\phi }^{(0)}\rfloor +\frac {\mathbf {1}-\mathbf {\Delta }}{2}\right)$. By doing so, the gradient d can be calculated, and we update ϕ ⁽⁰⁾ and get a new queue vector ϕ ⁽¹⁾. By repeating this process, the scheduler can slowly update the estimation ϕ ⁽ⁿ⁾ towards ϕ ^∗ and hence find the the optimal policy θ ^∗.

It should be noted that low complexity algorithms for searching or approximating the optimal policy θ ^∗ are not restricted to MPI and DSPSA. With the results on monotonicity derived in Section 4, one can propose other algorithms, e.g., the random search method [37], the simulated annealing method [38], the complexity of which could be even lower than MPI and DSPSA. For example, the random search method [37] can be applied to find the solution of the multivariate minimization problem (24). In this method, the descent direction is found by random sampling in each iteration. The complexity incurred by random sampling could be lower than that incurred by simulation (as in DSPSA). But, we still need to compare the convergence rates of the random search and DSPSA algorithms. In summary, the MPI and DSPSA are two examples of low complexity algorithms that are based on the monotonicity of the optimal policy. To propose more low complexity algorithms and compare the convergence performance are out of the scope of this paper and could be one of the future directions of research.

5.4 Simulation results

We run simulations in an NC-TWRC with Rayleigh fading channels. Let DP-MDP-QC be the optimal policy searched by DP based on the MDP model in Section 3. We compare the performance of DP-MDP-QC to the following four policies:

DSPSA-MDP-QC: This policy is searched by DSPSA based on the MDP model in Section 3 with the total number of iterations being N=1000. As explained in Section 5.2, the estimation sequence produced by the DSPSA algorithm should be slowly converging to the optimal policy. Therefore, DSPSA-MDP-QC should be close to DP-MDP-QC (in Euclidian distance) and the performance of DSPSA-MDP-QC should be similar to that of DP-MDP-QC.
MYO-QC: This policy is obtained by θ _MYO-QC(x)= arg mina C(x,a), where C(x,a) is the immediate cost function as defined in (9). Recall that policy DP-MDP-QC searched by DP is $\theta ^{*}(\mathbf {x})=\theta ^{N}(\mathbf {x})=\arg \min _{\mathbf {a}} C(\mathbf {x},\mathbf {a})+\beta \sum _{\mathbf {x}'} P_{\mathbf {x}\mathbf {x}^{\prime }}^{\mathbf {a}} V^{(N-1)}(\mathbf {x})$, where N is the iteration index when DP converges. MYO-QC is the policy that neglects the aftermath $\beta \sum _{\mathbf {x}'} P_{\mathbf {x}\mathbf {x}^{\prime }}^{\mathbf {a}} V^{(N-1)}(\mathbf {x})$ that is incurred by the action taken at the current decision epoch. Alternatively speaking, MYO-QC is myopic while DP-MDP-QC is far-sighted.¹⁰ In a stochastic environment, myopic policies usually incur a higher expected long-term cost than far-sighed ones.
AT: This policy is denoted as θ _AT(x)=(θ _AT1(x),θ _AT2(x)) where θ _ATi(x)=1 if b _i≠0, i.e., always transmit whenever queue i is not empty. This policy minimizes the costs incurred by the packet delay and queue overflow. But, the performance of this policy should not be as good as DP-MDP-QC if the purpose is to minimize the long-term cost incurred by not only the packet delay and queue overflow but also the transmission power consumption and downlink transmission error rate.
DP-MDP-Q: This policy is determined by DP based on an MDP model that is the same as the one in Section 3 except that the immediate cost function is defined as C(x,a)=C _h(b,a)+t _r(a). This policy was proposed in [12], where the authors assume that the channels are lossless so that the packet error cost $\sum _{i=1}^{2}\text {err}(g_{-i},a_{i})=0$ always. However, the wireless channels are usually not ideal in practice. If we adopt policy DP-MDP-Q, it should incur a higher downlink transmission error rate than DP-MDP-QC.

We fix p ₂=0.5 and vary p ₁ from 0.2 to 0.6. The other system parameters are the same as in Fig. 4. A simulation lasting for 10⁵ decision epochs is run for each value of p ₁. Each packet contains 100 bits, i.e., the packet length L _P=100. We obtain the number of holding and overflowing packets and the number of transmissions averaged over decision epochs. The former indicates the mean packet delay and queue overflow costs, and the latter indicates the average transmission power consumption. The transmission error rate is calculated as the ratio of the number of erroneous bits received to the total number of bits sent. We also obtain the immediate cost averaged over decision epochs, which indicates the long-term cost (the minimand in (11)). The results are presented in Fig. 13. It can be seen that the average immediate cost of DSPSA-MDP-QC almost overlaps with that of DP-MDP-QC. It means that if we allow the total number of iterations in the DSPSA algorithm large enough, e.g., 1000 iterations, it is able to converge to a policy that is very close to DP-MDP-QC.

For policy MYO-QC, we can see that it always incurs a greater number of transmissions and holding and overflow packets and a higher transmission error rate than DP-MDP-QC. The average immediate cost of this policy is at least 0.23 higher than those of DP-MDP-QC, which is the worst among all policies. Therefore, a far-sighted policy outperforms a myopic one when we want to minimize the long-term cost in a stochastic system.

The number of holding and overflow packets incurred by policy AT is always zero. However, it results in the highest number of transmissions and transmission error rate. The average immediate cost incurred by AT is at least 0.09 higher than DP-MDP-QC, which justifies our expectation; AT minimizes the packet delay and queue overflow costs but incurs higher transmission power consumption and downlink transmission error rate. Therefore, the long-term cost incurred by AT is not as low as that incurred by DP-MDP-QC. For policy DP-MDP-Q, the number of transmissions is almost the same as DP-MDP-QC, and the number of holding and overflow packets is even lower than DP-MDP-QC. However, since this policy assumes that the wireless channels are ideal (but they are in fact not), the transmission error rate is about 1.3 times higher than DP-MDP-QC (almost as high as AT). Therefore, the average immediate cost is still higher than DP-MDP-QC. In summary, in a stochastic environment where the long-term loss can be incurred by multiple causes, the policy that considers all such causes simultaneously outperforms those that only consider some and neglects others.

6 Conclusion

This paper studied an MDP-modeled transmission control problem in NC-TWRC with random traffic and fading channels. The purpose was to prove the existence of a monotonic optimal transmission policy that minimized packet delay, queue overflow, transmission power, and the downlink transmission error rate in the long run. We proved that the optimal policy is nondecreasing in queue and/or channel states by investigating how certain properties (submodularity, L ^♮-convexity and multimodularity) varied with the system parameters. Based on these properties of DP, we presented two low-complexity algorithms, MPI and DSPSA.

As a part of the conclusion, we point out two directions for the research work in the future. The structured results derived in Section 4 can be used to design model-free learning algorithms, e.g., monotonic Q-learning. ¹¹Since queue-assisted transmission control is also used in cross-layer variable-rate adaptive modulation problems, it would be of interest if we can use submodularity, L ^♮-convexity, and multimodularity to establish the sufficient conditions for the existence of monotonic optimal ¹²transmission policies in these systems.

7 Endnotes

¹ The complexity of the algorithm grows drastically with the cardinality of the system variables [16].

² The definition of Pareto optimality is given in Appendix A. In Section 3.5, we will explain the Pareto optimality of the optimal policy of MDP.

³ In this paper, we use ε=10⁻⁵.

⁴ See the definition of Pareto optimality and description of scalarization technique in Appendix A. The Pareto optimality of θ ^∗ has also been discussed in [31].

⁵ $f\colon \mathbb {Z}^{n}\rightarrow {\mathbb {R}_{-}}$ is (strictly) supermodular if −f is (strictly) submodular.

⁶ The interpretation of ξ _o≥2λ+η+τ is that the cost of overflowing a packet is greater than or equal to the sum of the cost of holding two packets, the cost when transmission error rate is increased by η and the cost of missing a coding opportunity.

⁷ In [21], integer convexity was used to denote the one dimensional discrete convexity as explained in Lemma B.1(b).

⁸ According to the conditions in Theorems 4.12, 4.13 and 4.15, Theorem 4.12 is straightforwardly satisfied if either Theorem 4.13 or Theorem 4.15 holds. Therefore, if DSPSA can be applied when Theorem 4.12 holds, it can be also applied when Theorem 4.13 and 4.15 hold.

⁹ The gradient d in (27) is defined based on the discrete mid-point convexity [32].

¹⁰ More comparisons of far-sighted and myopic policies in NC-TWRC are presented in [13].

¹¹ The one dimensional discrete convex function $h:\mathbb {Z}\rightarrow {\mathbb {R}}$ satisfies h(x + 1) + h(x − 1) − 2h(x)≥0 for all $x\in {\mathbb {Z}}$. Moreover, by Definition 4.4 and 4.5, h is both L ^♮-convex and multimodular.

¹² A function $f\colon \mathbb {Z}^{2}\rightarrow {\mathbb {R}_{+}}$ is multimodular if and only if it is (1) supermodualr: Δ _i Δ _j f(x)≥0 and (2) superconvex: Δ _i f(x + e _i)≥Δ _i f(x + e _j) for all i,j∈{1,2}, where Δ _i f(x) = f(x) − f(x − e _i) and $\mathbf {e}_{i}\in {\mathbb {Z}^{2}}$ is a 2-tuple with all zero entries except the ith entry being one.

8 Appendix A

In multi-objective optimization [39], there are N optimization metrics. Each of then can be quantified by a loss function $f_{n}\colon \mathbb {R}^{M}\rightarrow {\mathbb {R}}$. The problem can be expressed as

$$ \min_{\mathbf{x}\in\mathbb{R}^{M}} (\;f_{1}(\mathbf{x}),f_{2}(\mathbf{x}),\dotsc,f_{N}(\mathbf{x})), $$

((27))

where x is the decision vector. We say x Pareto dominates x ^′ if f _n(x)≤f _n(x ^′) for all n∈{1,…,N}. We call x ^∗ a Pareto optimal decision vector if no $\mathbf {x}\in \mathbb {R}^{M}$ Pareto dominates x ^∗. In a multi-objective optimization problem, we always want to seek a Pareto optimal solution. One way to solve this problem is called scalarization technique. The idea is to convert (27) to a single-objective problem

$$ \min_{\mathbf{x}\in\mathbb{R}^{M}}w_{1}\,f_{1}(\mathbf{x})+w_{2}\,f_{2}(\mathbf{x})+\dotsc+w_{N}\,f_{N}(\mathbf{x})), $$

((28))

where w _n>0 is the weight. It is shown that the solution of problem (28) is a Pareto optimal solution of problem (27) in [39]. Note, based on the definition of Pareto optimality, a Pareto optimal solution is not an optimal solution if we purely consider only one optimization metric.

9 Appendix B

Lemma B.1.

submodularity, L ^♮-convexity and multimodularity has the following properties:

(a)
If $f_{i}\colon \mathbb {Z}^{n}\rightarrow {\mathbb {R}_{+}}$ is submodular/ L ^♮-convex/multimodular in $\mathbf {x}\in {\mathbb {Z}^{n}}$ and α _i≥0 for all i, then $\sum _{i=1}^{m} \alpha _{i} f_{i}(\mathbf {x})$ is submodular/ L ^♮-convex/multimodular in x.
(b)
If $h\colon \mathbb {Z}\rightarrow {\mathbb {R}_{+}}$ is convex ¹¹, then f(x)=h(x ₁−x ₂) is L ^♮-convex in x=(x ₁,x ₂) and g(x)=h(x ₁+x ₂) is multimodular in x=(x ₁,x ₂).
(c)
Let d be a random variable. If g(x,d) is L ^♮-convex/multimodular in $\mathbf {x}\in {\mathbb {Z}^{n}}$ for all d, then $\mathbb {E}_{d}[g(\mathbf {x},d)]$ is L ^♮-convex/multimodular in x.
(d)
If $f\colon \mathbb {Z}^{n}\rightarrow \mathbb {R}_{+}$ is L ^♮-convex, then ψ(x,ζ)=f(x−ζ 1) is L ^♮-convex in (x,ζ).

Proof.

The proofs of (a), (b), and (d) can be found in [26, 28, 29]. We show proof of (c). Consider function f first. Since ψ(x,ζ)=f(x−ζ 1)=h(x ₁−x ₂), according to Definition 4.4, it suffices to show the submodularity of h in (x ₁,x ₂). But, because of the convexity of h,

$$ {\fontsize{8.5pt}{12pt}\selectfont{\begin{aligned} {}\!h(x_{1}+1-x_{2})&+h(x_{1}-(x_{2}+1))\!-h(x_{1}-x_{2})\-h(x_{1}\!+1-\!(x_{2}+1)) \\ &=\!h(x_{1}-x_{2}+1)+h(x_{1}-x_{2}-1) \!-2h(x_{1}-x_{2})\geq{0}. \end{aligned}}} $$

((29))

By Definition 4.2, h is submodular in (x ₁,x ₂). Therefore, f(x)=h(x ₁−x ₂) is L ^♮-convex in (x ₁,x ₂). Since, g(x)=f(−M _2,1 x), according to Lemma 4.7(a), g(x) is multimodular in (x ₁,x ₂).

10 Appendix C

$\tilde {C}$ is nondecreasing in y ₁ because h ₁ is nondecreasing in y ₁. By assuming that V ⁽ⁿ⁻¹⁾ is nondecreasing in $b^{\prime }_{1}$, we have $\tilde {Q}^{(n)}$ nondecreasing in y ₁ since min{[y _i]⁺,L _i}+f _i is nondecreasing in y _i. Next, consider the multimodularity by using Proposition 1 ¹² in [40]. The supermodulariry and superconvexity of $\tilde {C}$ in (y ₁,a ₁) can be proved by the convexity of h ₁. So, $\tilde {C}$ is multimodular in (y ₁,a ₁). Assume the monotonicity and L ^♮-convexity of V ⁽ⁿ⁻¹⁾ in $b^{\prime }_{1}$. $\tilde {Q}$ is supermodular and superconvex in (y ₁,a ₁) because

$$ {\fontsize{7.8pt}{12pt}\selectfont{\begin{aligned} {} \tilde{Q}^{(n)}(\mathbf{y},\mathbf{g},\mathbf{a})+\tilde{Q}^{(n)}(\mathbf{y}&+\mathbf{e}_{1},\mathbf{g},\mathbf{a}+\mathbf{e}_{1})\\ {} &-\tilde{Q}^{(n)}(\mathbf{y}+\mathbf{e}_{1},\mathbf{g},\mathbf{a})-\tilde{Q}^{(n)}(\mathbf{y},\mathbf{g},\mathbf{a}+\mathbf{e}_{1})=0, \end{aligned}}} $$

$$ {\fontsize{7.9pt}{12pt}\selectfont{\begin{aligned} {} \tilde{Q}^{(n)}(\mathbf{y}+\mathbf{e}_{1},\mathbf{g},\mathbf{a})-\tilde{Q}^{(n)}(\mathbf{y},\mathbf{g},\mathbf{a}) -\tilde{Q}^{(n)}(\mathbf{y},\mathbf{g},\mathbf{a}+\mathbf{e}_{1})\,+\,\tilde{Q}^{(n)}(\mathbf{y}-\mathbf{e}_{1},\mathbf{g},\mathbf{a}+\mathbf{e}_{1})\\ {}=\left\{\begin{array}{ll} \lambda+\beta \mathbb{E}_{\mathbf{g}'} \left[ V_{\mathbf{f}}^{(n-1)}(\mathbf{y}+\mathbf{e}_{1},\mathbf{g}') -V_{\mathbf{f}}^{(n-1)}(\mathbf{y},\mathbf{g}') \Big|\mathbf{g} \right] \geq 0 & y_{1}=0 \\ \xi_{o}-\lambda+\beta \mathbb{E}_{\mathbf{g}'} \left[-V_{\mathbf{f}}^{(n-1)}(\mathbf{y},\mathbf{g}')+\right.\\ \left. \qquad V_{\mathbf{f}}^{(n-1)}(\mathbf{y}-\mathbf{e}_{1},\mathbf{g}') \Big|\mathbf{g} \right] \geq 0 & y_{1}=L_{1} \\ \beta \mathbb{E}_{\mathbf{g}'} \left[ V_{\mathbf{f}}^{(n-1)}(\mathbf{y}+\mathbf{e}_{1},\mathbf{g}')-\right.\\ \left. \qquad 2V_{\mathbf{f}}^{(n-1)}(\mathbf{y},\mathbf{g}') +V_{\mathbf{f}}^{(n-1)}(\mathbf{y}-\mathbf{e}_{1},\mathbf{g}') \Big|\mathbf{g} \right] \geq 0 & \text{otherwise} \end{array}\right.. \end{aligned}}} $$

((30))

The second inequality in (30) (when y ₁=L ₁) is explained as follows. Recall that we have V ⁽ⁿ⁻¹⁾(x ^′)=Q ⁽ⁿ⁻¹⁾(x ^′,a ^∗(x ^′)), where Q ⁽ⁿ⁻¹⁾ is L ^♮-convex in $(b^{\prime }_{1},a^{\prime }_{1})$ and $\mathbf {a}^{*}(\mathbf {x}^{\prime })=\arg \min _{\mathbf {a}^{\prime }}Q^{(n-1)}(\mathbf {x}^{\prime },\mathbf {a}^{\prime })$ is nondecreasing in $b^{\prime }_{1}$. It can be shown that

$$ \begin{aligned} {}&-V^{(n-1)}(b^{\prime}_{1},b^{\prime}_{2},\mathbf{g}^{\prime})+V^{(n-1)}(b^{\prime}_{1}-1,b^{\prime}_{2},\mathbf{g}^{\prime}) \\ {}&=-Q^{(n-1)}(b^{\prime}_{1},b^{\prime}_{2},\mathbf{g}^{\prime},\mathbf{a}^{*}(b^{\prime}_{1},b^{\prime}_{2},\mathbf{g}^{\prime}))\\ {}&\quad+Q^{(n-1)}(b^{\prime}_{1}-1,b^{\prime}_{2},\mathbf{g}^{\prime},\mathbf{a}^{*}(b^{\prime}_{1}-1,b^{\prime}_{2},\mathbf{g}^{\prime})) \\ {}&\geq-Q^{(n-1)}(b^{\prime}_{1},b^{\prime}_{2},\mathbf{g}^{\prime},(1,1))+Q^{(n-1)}(b^{\prime}_{1}-1,b^{\prime}_{2},\mathbf{g}^{\prime},(0,0)) \\ {}&\geq-C(b^{\prime}_{1},b^{\prime}_{2},\mathbf{g}^{\prime},(1,1))+C(b^{\prime}_{1}-1,b^{\prime}_{2},\mathbf{g}^{\prime},(0,0)) \\ {}&\geq -\lambda-\eta-\tau. \end{aligned} $$

((31))

Since ξ _o≥2λ+η+τ, we have the inequality when y ₁=L ₁ in (30). Therefore, $\tilde {Q}$ is multimodular in (y ₁,a ₁). The monotonicity and multimodularity of $\tilde {C}$ and $\tilde {Q}^{(n)}$ in (y ₂,a ₂) can be proved in the same way.

11 Appendix D

By knowing the L ^♮-convexity of V ⁽ⁿ⁻¹⁾ in b ^′, we have

$$\begin{array}{*{20}l} {} &\!Q^{(n)}(\mathbf{x},\!(1,\!0))+\!Q^{(n)}(\mathbf{x},\!(0,\!1)) -\!Q^{(n)}(\mathbf{x},\!(0,0))-\!Q^{(n)}(\mathbf{x},\!(1,\!1)) \\ {}&=\tau+\beta \mathbb{E}_{\mathbf{g}'} \left[ V_{\mathbf{f}}^{(n-1)}(\mathbf{b}-\mathbf{e}_{1},\mathbf{g}')+V_{\mathbf{f}}^{(n-1)} \left(\mathbf{b}-\mathbf{e}_{2},\mathbf{g}'\right.\right. \\ {}&\left.\qquad-V_{\mathbf{f}}^{(n-1)}(\mathbf{b},\mathbf{g}')-V_{\mathbf{f}}^{(n-1)}(\mathbf{b}-\mathbf{e}_{1}-\mathbf{e}_{2},\mathbf{g}') \Big|\mathbf{g} \right] \\ {}&\geq{\tau}>0, \end{array} $$

((32))

i.e., −Q ⁽ⁿ⁾ is strictly supermodular in a for all x. By definition in [30], the game is supermodular.

12 Appendix E

C _h is L ^♮-convex in (b,a) because of the convexity of h _i, and C _t is L ^♮-convex in (b,a) because of the submodularity of t _r in a. By Lemma B.1(a), C is L ^♮-convex in (b,a). Consider the L ^♮-convexity of Q in (b,a). Let ${BR}_{i}(a_{-i})=\arg \min _{a_{i}}Q^{(n)}(\mathbf {x},(a_{i},a_{-i}))$. Equilibria (0,0)(1,1) implies B R _i(a _−i)=a _−i, i.e., a ₁=a ₂. But, Q ⁽ⁿ⁾(x,(a ₁,a ₁)) is L ^♮-convex in (b,a ₁) since: When b _i−a ₁<L _i+1 for all i∈{1,2}, Q ⁽ⁿ⁾ is L ^♮-convex in (b,a ₁) because of the L ^♮-convexity of V ⁽ⁿ⁻¹⁾ in b ^′ and Lemma B.1(c) and (d); When b _i−a ₁=L _i+1 for either i=1 or i=2, the L ^♮-convexity of Q ⁽ⁿ⁾ can be shown in the same way as in Appendix C under condition ξ _o≥2λ+τ+η. By Theorem 4.10 and Proposition 4.11, V ^∗(x) is L ^♮-convex in b and the optimal action a ^∗ is nondecreasing in b.

13 Appendix F

We just need to show that condition (b) in Theorem 4.13 is satisfied. Let b _i−a ₁<L _i+1 for all i∈{1,2}. It suffices to show B R _i(a _−i)=a _−i for all i∈{1,2} in order to prove equilibria (0,0)(1,1) in Theorem 4.13. Because the game has strictly supermodular utility, B R _i(a _−i+1)>B R _i(a _−i). So B R _i(1)=1, if we can prove B R _i(0)=0. By knowing that p ₁=0.5, we can show that

$$ {\fontsize{8.7pt}{12pt}\selectfont{\begin{aligned} {} &Q^{(n)}(\mathbf{b},\mathbf{g},(1,0))-Q^{(n)}(\mathbf{b},\mathbf{g},(0,0)) \\ {} &=\left\{\begin{array}{ll} -\lambda+\tau+\eta P_{e}(g_{1})+ \\ 0.5\beta \Big(V(b_{1}-1,\hat{b}_{2}),\mathbf{g}^{\prime}) -V(b_{1},\hat{b}_{2}),\mathbf{g}^{\prime})\Big) \geq{0} & 0<b_{1}<L_{1}+1 \\ -\lambda+\tau+\eta P_{e}(g_{1}) \geq{0} & \text{otherwise} \end{array}\right., \end{aligned}}} $$

((33))

where $\hat {b}_{2}=\min \{[\!b_{2}]^{+},L_{2}\}+f_{2}$ and the inequality in the case when 0<b ₁<L ₁+1 is because that, by a similar approach as in (31), V ⁽ⁿ⁻¹⁾(b1′−1,b2′,g ^′)−V ⁽ⁿ⁻¹⁾(b1′+1,b2′,g ^′)≥−τ−η and $\beta \leq \frac {2(\tau -\lambda)}{\tau +\eta }$.

Similarly, we can show Q ⁽ⁿ⁾(b,g,(0,1))−Q ⁽ⁿ⁾(b,g,(0,0))≥0 in the case when p ₂=0.5. So, B R _i(a _−i)=a _−i.

14 Appendix G

Let i=2. C(x,a) is submodular in (b ₂,g ₁,a ₂) because of the convexity of h _i and the condition P _e(g ₁)≥P _e(g ₁+1). By knowing the submodularity of V ⁽ⁿ⁻¹⁾ in (b2′,g1′) and the L ^♮-convexity of Q ⁽ⁿ⁾ in (b ₂,a ₂) under condition ξ _o≥2λ+η+τ, we can show the submodularity of Q ⁽ⁿ⁾ in (b ₂,a ₂) and (b ₂,g ₁). Consider the submodularity of Q ⁽ⁿ⁾ in (g ₁,a ₂). We can show that

$$ \begin{aligned} {} Q^{(n)}(\mathbf{b},\mathbf{g},\mathbf{a}+\mathbf{e}_{2}\!)&+Q^{(n)}(\mathbf{b},\mathbf{g}+\mathbf{e}_{1},\mathbf{a}) - Q^{(n)}(\mathbf{b},\mathbf{g},\mathbf{a\!})\\ \qquad &-Q^{(n)}(\mathbf{b},\mathbf{g}+\mathbf{e}_{1},\mathbf{a}+\mathbf{e}_{2}) \\ &\geq \!\eta\! \left(P_{e}(g_{1})-P_{e}(g_{1}+1) + \beta \mathbb{E}_{g^{\prime}_{1}} [\!P_{e}(g^{\prime}_{1}+1)\right.\\ &\left.-P_{e}(g^{\prime}_{1}) | g_{1}] \vphantom{\beta \mathbb{E}_{g^{\prime}_{1}}}\right) \geq{0}. \end{aligned} $$

((34))

The second last inequality in (34) is obtained by using a similar approach as in (31), and the last one is due to the condition $\beta \leq \frac {P_{e}(g_{i})-P_{e}(g_{i}+1)}{\sum _{g'_{i}}P_{g_{i}g'_{i}} (P_{e}(g'_{i})-P_{e}(g'_{i}+1)) }$. Therefore, Q ⁽ⁿ⁾ is submodular in (b ₂,g ₁,a ₂). Similarly, we can show that Theorem 4.15 holds for i=1.

15 Appendix H

In an equiprobable partitioned slow and flat Rayleigh fading channel, the channel transitions can be worked out by level crossing rate (LCR) [15] and only happens between adjacent states, i.e., $\phantom {\dot {i}\!}g^{\prime }_{i}\in \{g_{i}-1, g_{i}, g_{i}+1\}$. Further, $P_{gg'}=P_{g'g}\phantom {\dot {i}\!}$, and $\phantom {\dot {i}\!}P_{gg'}\ll {P_{\textit {gg}}}$ for all g ^′≠g. According to Definition 4.8, for nondecreasing u, $\phantom {\dot {i}\!}P_{g_{i}g_{i}^{\prime }}$ is first order stochastic nondecreasing in g _i because

$$ \begin{aligned} \sum_{(g_{i}+1)^{prime}}P_{(g_{i}+1)(g_{i}+1)^{prime}}u \left((g_{i}+1)^{prime} \right)-\sum_{g^{prime}_{i}}P_{g_{i}g^{prime}_{i}}u(g^{prime}_{i}) \\ \geq (1-2P_{g_{i}g_{i}+1}) \left(u(g_{i}+1)-u(g_{i}) \right) \geq 0, \end{aligned} $$

((35))

where $\phantom {\dot {i}\!}1-2P_{g_{i}g_{i}+1}\geq {0}$ is because $_{gg'}\ll {P_{\textit {gg}}}\phantom {\dot {i}\!}$ and $\phantom {\dot {i}\!}\sum _{g'}P_{gg'}=1$.

Abbreviations

NC-TWRC:: network-coded two-way relay channels
MDP:: Markov decision process
DP:: dynamic programming
MPI:: monotonic policy iteration
DSPSA:: discrete simultaneous perturbation stochastic approximation
NC:: network coding
FSMC:: finite-state Markov chain
QoS:: quality of service
SPSA:: simultaneous perturbation stochastic approximation

References

R Ahlswede, N Cai, S-YR Li, RW Yeung, Network information flow. IEEE Trans. Inf. Theory. 46(4), 1204–1216 (2000). doi:http://dx.doi.org/10.1109/18.850663.
Article MATH MathSciNet Google Scholar
Y Wu, Information exchange in wireless networks with network coding and physical-layer broadcast. Technical Report MSR-TR-2004-78, Microsoft Research, Redmond WA (2004).
S Katti, H Rahul, W Hu, D Katabi, M Médard, J Crowcroft, Xors in the air: practical wireless network coding. SIGCOMM Comput. Commun. Rev.36(4), 243–254 (2006).
Article Google Scholar
C Hausl, J Hagenauer, in Proceedings of IEEE International Conference on Communications: June 2006. Iterative network and channel decoding for the two-way relay channel (IEEEIstanbul, 2006), pp. 1568–1573. doi:http://dx.doi.org/10.1109/ICC.2006.255034.
Chapter Google Scholar
J Li, S Song, Y Guo, M Lee, Joint optimization of source and relay precoding for AF MIMO relay systems. EURASIP J. Wireless Commun. Netw. 2015:, 175 (2015).
Article Google Scholar
M Soussi, A Zaidi, L Vandendorpe, DF-based sum-rate optimization for multicarrier multiple access relay channel. EURASIP J. Wireless Commun. Netw. 2015:, 133 (2015).
Article Google Scholar
J Joung, AH Sayed, Multiuser two-way amplify-and-forward relay processing and power control methods for beamforming systems. IEEE Trans. Signal Process.58(3), 1833–1846 (2010).
Article MathSciNet Google Scholar
S Katti, D Katabi, W Hu, H Rahul, M Medard, in Proceedings of 43rd Annual Allerton Conference on Communications, Control and Computing: September 2005. The importance of being opportunistic: Practical network coding for wireless environments (University of IllinoiMonticello, IL, 2005), pp. 756–765.
Google Scholar
S Peters, A Panah, K Truong, R Heath, Relay architectures for 3GPP LTE-advanced. EURASIP J. Wireless Commun. Netw. 2009:, 618787 (2009).
Article Google Scholar
Q-T Vien, L-N Tran, E-K Hong, Network coding-based retransmission for relay aided multisource multicast networks. EURASIP J. Wireless Commun. Netw. 2011:, 643920 (2011).
Article Google Scholar
W Chen, KB Letaief, Z Cao, in Proceedings of IEEE International Conference on Communications: 24-28 June 2007. Opportunistic network coding for wireless networks (IEEEGlasgow, 2007), pp. 4634–4639. doi:http://dx.doi.org/10.1109/ICC.2007.765.
Chapter Google Scholar
Y-P Hsu, N Abedini, S Ramasamy, N Gautam, A Sprintson, S Shakkottai, in Proceedings IEEE International Symposium on Information Theory: July 31 -August 5 2011. Opportunities for network coding: To wait or not to wait (IEEESt. Petersburg, 2011), pp. 791–795. doi:http://dx.doi.org/10.1109/ISIT.2011.6034243.
Chapter Google Scholar
N Ding, I Nevat, GW Peters, J Yuan, in Proceedings of IEEE International Conference on Communications: 9-13 June 2013. Opportunistic network coding for two-way relay fading channels (IEEEBudapest, 2013), pp. 5980–5985. doi:http://dx.doi.org/10.1109/ICC.2013.6655556.
Chapter Google Scholar
ML Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, 1st edn (Wiley, New York, 1994).
Book MATH Google Scholar
P Sadeghi, RA Kennedy, PB Rapajic, R Shams, Finite-state Markov modeling of fading channels: A survey of principles and applications. IEEE Signal Process. Mag.25(5), 57–80 (2008). doi:http://dx.doi.org/10.1109/MSP.2008.926683.
Article Google Scholar
RS Sutton, AG Barto, Introduction to Reinforcement Learning, 1st edn (MIT Press, Cambridge, MA, 1998).
Google Scholar
WB Powell, Approximate Dynamic Programming: Solving the Curses of Dimensionality (Wiley, New Jersey, 2007).
Book Google Scholar
JE Smith, KF McCardle, Structural properties of stochastic dynamic programs. Oper. Res.50(5), 796–809 (2002).
Article MATH MathSciNet Google Scholar
J Huang, V Krishnamurthy, Transmission control in cognitive radio as a Markovian dynamic game: Structural result on randomized threshold policies. IEEE Trans. Commun.58(1), 301–310 (2010).
Article MathSciNet Google Scholar
MH Ngo, V Krishnamurthy, Monotonicity of constrained optimal transmission policies in correlated fading channels with ARQ. IEEE Trans. Signal Process.58(1), 438–451 (2010). doi:http://dx.doi.org/10.1109/TSP.2009.2027735.
Article MathSciNet Google Scholar
DV Djonin, V Krishnamurthy, MIMO transmission control in fading channels–a constrained Markov decision process formulation with monotone randomized policies. IEEE Trans. Signal Process.55(10), 5069–5083 (2007).
Article MathSciNet Google Scholar
DM Topkis, Supermodularity and Complementarity (Princeton University Press, Princeton, 2001).
Google Scholar
K Murota, Note on multimodularity and L-convexity. Math. Oper. Res.30(3), 658–661 (2005).
Article MATH MathSciNet Google Scholar
L Yang, YE Sagduyu, JH Li, in Proceedings of 13th ACM International Symposium on Mobile Ad Hoc Networking and Computing: 11-14 June 2012. Adaptive network coding for scheduling real-time traffic with hard deadlines (ACM SIGMOBILENew York, 2012), pp. 105–114.
Google Scholar
B Hajek, Extremal splittings of point processes. Math. Oper. Res.10(4), 543–556 (1985).
Article MATH MathSciNet Google Scholar
DM Topkis, Minimizing a submodular function on a lattice. Oper. Res.26(2), 305–321 (1978).
Article MATH MathSciNet Google Scholar
K Murota, Discrete Convex Analysis (SIAM, Philadelphia, 2003).
Book MATH Google Scholar
QLP Yu, Multimodularity and structural properties of stochastic dynamic programs. Working Paper. School of Bus. and Manage., HongKong University of Sci. and Tech (2013).
P Zipkin, On the structure of lost-sales inventory models. Oper. Res.58(4), 937–944 (2008).
Article MathSciNet Google Scholar
P Milgrom, J Roberts, Rationalizability, learning, and equilibrium in games with strategic complementarities. Econometrica J. Econ. Soc.58(6), 1255–1277 (1990).
Article MATH MathSciNet Google Scholar
AT Hoang, M Motani, Cross-layer adaptive transmission: Optimal strategies in fading channels. IEEE Trans. Commun.56(5), 799–807 (2008). doi:http://dx.doi.org/10.1109/TCOMM.2008.060214.
Article Google Scholar
Q Wang, JC Spall, in Proceedings of American Control Conference: June 29-July 1 2011. Discrete simultaneous perturbation stochastic approximation on loss function with noisy measurements (IEEESan Francisco, CA, 2011), pp. 4520–4525.
Chapter Google Scholar
JC Spall, Implementation of the simultaneous perturbation algorithm for stochastic optimization. IEEE Trans. Aerosp. Electron. Syst.34(3), 817–823 (1998). doi:http://dx.doi.org/10.1109/7.705889.
Article Google Scholar
N Ding, P Sadeghi, RA Kennedy, Discrete Convexity and Stochastic Approximation for Cross-layer On-off Transmission Control. Wireless Communications, IEEE Transactions on, 1536–1276 (2015). doi:http://dx.doi.org/10.1109/TWC.2015.2473858.
JC Spall, Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Trans. Autom. Control. 37(3), 332–341 (1992). doi:http://dx.doi.org/10.1109/9.119632.
Article MATH MathSciNet Google Scholar
A Gosavi, Simulation-based Optimization: Parametric Optimization Techniques and Reinforcement Learning, vol. 55 (Springer, New York, 2014).
Google Scholar
L Rastrigin, Convergence of random search method in extremal control of many-parameter system. Automat. Remote Control. 24(11), 1337 (1964).
Google Scholar
S Kirkpatrick, Optimization by simulated annealing: quantitative studies. J. Stat. Phys.34(5-6), 975–986 (1984).
Article MathSciNet Google Scholar
M Caramia, P Dell’Olmo, Multi-objective Management in Freight Logistics: Increasing Capacity, Service Level and Safety with Optimization Algorithms (Springer, Berlin, Germany, 2008).
Google Scholar
W Zhuang, MZF Li, Monotone optimal control for a class of Markov decision processes. European J. Oper. Res.217(2), 342–350 (2012).
Article MATH MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Research School of Engineering, College of Engineering and Computer Science, Australian National University (ANU), Canberra, 2601, Australia
Ni Ding, Parastoo Sadeghi & Rodney A. Kennedy

Authors

Ni Ding
View author publications
You can also search for this author in PubMed Google Scholar
Parastoo Sadeghi
View author publications
You can also search for this author in PubMed Google Scholar
Rodney A. Kennedy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ni Ding.

Additional information

Competing interests

The authors declare that they have no competing interests.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article

Ding, N., Sadeghi, P. & Kennedy, R.A. Structured optimal transmission control in network-coded two-way relay channels. J Wireless Com Network 2015, 241 (2015). https://doi.org/10.1186/s13638-015-0470-7

Download citation

Received: 09 July 2015
Accepted: 20 October 2015
Published: 06 November 2015
DOI: https://doi.org/10.1186/s13638-015-0470-7

Structured optimal transmission control in network-coded two-way relay channels

Abstract

1 Introduction

2 System

2.1 Assumptions

3 Markov decision process formulation

3.1 System state

3.2 Action

3.3 State transition probabilities

3.4 Immediate cost

3.4.1 Holding and overflow cost

3.4.2 Transmission cost

3.4.3 Packet error cost

3.5 Objective and dynamic programming

4 Structured optimal policies

Definition 4.1 (Monotonic policy).

Definition 4.2 (Submodularity [23, 25]).

Lemma 4.3.

Definition 4.4 (L ♮-convexity [23]).

Definition 4.5 (multimodularity [23]).

Lemma 4.6.

Lemma 4.7 (unimodular coordinate transform [23, 28]).

Definition 4.8 (First order stochastic dominance [18]).

4.1 Structured properties of dynamic programming

Definition 4.9 (\(\mathcal {P}^{\star }\!\) property).

Theorem 4.10.

Proof.

Proposition 4.11.

Proof.

4.2 Monotonic policies in queues states

4.2.1 Nondecreasing \(a_{i}^{*}\) in b i

Theorem 4.12.

Proof.

4.2.2 Nondecreasing \(a_{i}^{*}\) in (b 1,b 2)

Theorem 4.13.

Proof.

Corollary 4.14.

Proof.

4.3 Monotonic policies in channel states

Theorem 4.15.

Proof.

Corollary 4.16.

Proof.

5 Low complexity algorithms

5.1 Monotonic policy iteration

5.2 Discrete simultaneous perturbation stochastic approximation

5.3 Complexity

5.4 Simulation results

6 Conclusion

7 Endnotes

8 Appendix A

9 Appendix B

Lemma B.1.

Proof.

10 Appendix C

11 Appendix D

12 Appendix E

13 Appendix F

14 Appendix G

15 Appendix H

Abbreviations

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Definition 4.4 (L ^♮-convexity [23]).

4.2.1 Nondecreasing \(a_{i}^{*}\) in b _i

4.2.2 Nondecreasing \(a_{i}^{*}\) in (b ₁,b ₂)