 Research
 Open Access
 Published:
Structured optimal transmission control in networkcoded twoway relay channels
EURASIP Journal on Wireless Communications and Networking volume 2015, Article number: 241 (2015)
Abstract
This paper considers a transmission control problem in networkcoded twoway relay channels (NCTWRC), where the relay buffers randomly arrived packets from two users, and the channels are assumed to be fading. The problem is modeled by a discounted infinite horizon Markov decision process (MDP). The objective is to find an adaptive transmission control policy that minimizes the packet delay, buffer overflow, transmission power consumption and downlink error rate simultaneously and in the long run. By using the concepts of submodularity, multimodularity and L ^{♮}convexity, we study the structure of the optimal policy searched by dynamic programming (DP) algorithm. We show that the optimal transmission policy is nondecreasing in queue occupancies and/or channel states under certain conditions such as the chosen values of parameters in the MDP model, channel modeling method, and the preservation of stochastic dominance in the transitions of system states. Based one these results, we propose to use two lowcomplexity algorithms for searching the optimal monotonic policy: monotonic policy iteration (MPI) and discrete simultaneous perturbation stochastic approximation (DSPSA). We show that MPI reduces the time complexity of DP, and DSPSA is able to adaptively track the optimal policy when the statistics of the packet arrival processes change with time.
Introduction
Network coding (NC) was proposed in [1] to maximize the information flow in a wired network. It was introduced in multicast wireless communications to optimize the throughput and has attracted significant interest recently due to the rapid growth in multimedia applications [2]. It was shown in [3] that the power efficiency in wireless transmission systems could be improved by NC. For example, in a 3node network system, called the networkcoded twoway relay channels (NCTWRC) [4] as shown in Fig. 1, the messages m _{1} and m _{2} are XORed at the relay and broadcast to the end users. This method, compared to the conventional storeandforward transmission, reduces the total number of transmissions from 4 to 3 so that the transmission power is saved by 25 %. Since then, numerous optimization problems have been studied in NCTWRC, e.g., the precoding scheme design proposed in [5], the optimal achievable sumrate problem studied in [6] and the optimal beamforming method proposed in [7].
In [8], Katti et al. pointed out the importance of being opportunistic in practical NC scenarios. It was suggested that the assumptions in the related research work should comply with the practical wireless environments, e.g., decentralized routing and timevarying traffic rate. This suggestion highlighted a problem in the existing literature; the majority of the studies (e.g., [9, 10]) consider static environments (e.g, synchronized traffic) while ignoring the stochastic nature of the packet arrivals in the data link layer. On the other hand, the randomness of traffic in Fig. 1 poses the problem of how to make an optimal decision in a dynamic environment with a powerdelay tradeof;: when there are packet inflows in the relay but no coding opportunities or XORing pairs (e.g., one packetarrives from one user, but no packet arrives from the other), waiting for coding opportunities by holding packets saves transmission power but increases packet delay and results in more packets to be transmitted in the future. Since a decision made at any instant affects both the immediate and future costs, the decisionmaking is a dynamic, instead of a onetime, process, i.e., the objective is to determine a decision rule that is optimal over time. In [11, 12], this problem was studied and solved by a crosslayer design, NCTWRC with buffering. The optimal policy by Markovian process formulation was shown to minimize the transmission power and packet delay simultaneously and in the long run. In [13], the bufferassisted NCTWRC was extended to include the dynamics of wireless channels (Fig. 2). In this system, a transmission policy that solves powerdelay tradeoff may not be the best decision rule because it does not consider the possible loss in throughput due to the downlink transmission errors. For this reason, the scheduler is required to make an optimal decision that simultaneously minimizes the transmission power, packet delay, downlink BER in the long run by considering current queue and channel states and their expectations in the future. In [13], this problem was formulated by a discounted infinite horizon Markov decision process (MDP) [14] with channels modeled by finitestate Markov chains (FSMCs) [15]. The optimal transmission policy was shown to be superior to [11, 12] in terms of enhancing the QoS (quality of service, evaluated by packet delay and overflow in the data link layer, and power consumption and error rate in the physical layer) in a practical wireless environment, e.g., Rayleigh fading channels.
The optimal policy of a discounted infinite horizon MDP can be found by dynamic programming (DP) [16], value or policy iterations. However, the DP algorithm is burdened with high complexity. In Fig. 2, the system state is a 4tuple (two channels and two queues), and the decision/action is a 2tuple (each associated with the departure control of one queue). In such a high dimensional MDP, the curse of dimensionality ^{1} becomes more evident [17]; the computation load grows quickly if the cardinality of any tuple in the state variable is large. To relieve the curse, one solution is to qualitatively understand the model and prove the existence of a monotonic optimal policy [18]. Then, a low complexity algorithm or a modelfree learning method can be proposed, e.g., simultaneous perturbation stochastic approximation (SPSA) [19, 20]. But, monotonic optimal policy does not exist in general. Most often, optimal policy exists, but it varies with the state variable irregularly. In order to prove the existence of certain feature in the optimal policy, we need to extensively analyze the MDP model and the recursive functions in DP algorithm. The basic approach in the existing literature is to show by induction that the submodularity is preserved in each iterative optimization process (maximization/minimization) in DP, e.g., [19, 21]. We adopt the same method in this paper but consider a submodularity in high dimensional cases. Moreover, we use L ^{♮}convexity and multimodularity, two concepts that were originally defined in discrete convex analysis [22, 23], to describe the joint submodularity and integral convexity in a high dimensional space.
The aim of our work is to prove the existence of a monotonic optimal transmission policy in the NCTWRC system in Fig. 2. By observing the L ^{♮}convexity and submodularity of DP function, we derive the sufficient conditions for the optimal policy to be nondecreasing in queue and/or channel states. These structured results are used to derive two low complexity algorithms: monotonic policy iteration (MPI) and discrete simultaneous perturbation stochastic approximation (DSPSA). We compare the time complexity of MPI to that of DP and show the convergence performance of DSPSA algorithm. The main results in this paper are:

We prove that each tuple in the optimal policy is nondecreasing in the queue state that is controlled by that tuple if the chosen values of unit costs in immediate cost function give rise to an L ^{♮}convex or multimodular DP. Moreover, we show that the same results found in [19, 21] can also be explained by L ^{♮}convexity or multimodularity by a unimodular coordinate transform.

By thinking of each iteration in DP as a onestage pure coordination supermodular game, we show that equiprobable traffic rates and certain conditions on unit costs guarantee that each tuple in the optimal policy is monotonic in not only the queue state that is controlled by that tuple but also the queue state that is associated with the information flow of the opposite direction, i.e., the one that is not under the control of that tuple.

By observing the submodularity of DP, we show the sufficient conditions for an optimal policy to be nondecreasing in both queue and channel states in terms of unit costs, channel statistics, and FSMC models.

Based on the submodularity, multimodularity, and L ^{♮}convexity of DP, we show that the optimal transmission control problem in Fig. 2 can be solved by two lowcomplexity algorithms. One is MPI, a modified DP algorithm with the action searching space progressively shrinking with the increasing indices of queue and/or channel states. It is shown that the time complexity of MPI is much less than that of DP when the cardinality of system state is large. The other algorithm is a stochastic optimization method. We formulate the optimal policy searching problem by a minimization problem over a set of queue thresholds and use the DSPSA algorithm to approximate the minimizer. We show that DSPSA is able to adaptively track the optimal values of queue thresholds when the statistics of packet arrival processes change with time. We run simulations in NCTWRC with Rayleigh fading channels to show that the average cost incurred by the policy approximated by DSPSA is similar to that incurred by the optimal policy searched by DP.
The rest of this paper is organized as follows. In Section 2, we state the optimization problem in NCTWRC with random packet arrivals and FSMC modeled channels and clarify the assumptions. In Section 3, we describe the MDP formulation, state the objective, and present the DP algorithm. In Section 4, we investigate the structure in the optimal transmission policy found by DP algorithm in queue and channel states. Section 5 presents MPI and DPSA algorithms.
System
Consider the NCTWRC shown in Fig. 2. User 1 and 2 randomly send packets to each other via the relay. The relay is equipped with two finitelength FIFO queues, queue 1 and 2, to buffer the incoming packets from user 1 and 2, respectively. The outflows of queues are controlled by a scheduler. The scheduler keeps making decisions as to whether or not to transmit packets from queues. If the decision results in a pair of packets in opposite directions transmitted at the same time, they will be XORed (coded) and broadcast. Otherwise, the packet will be simply forwarded to the end user. The objective is to minimize packet delay, queue overflow, transmission power (saved by utilizing the coding opportunities), and downlink transmission errors simultaneously and their expectations in the future. Obviously, the optimization concerns are contradictory to each other: (1) If there does not exist a pair of packets for XORing, waiting for coding opportunity by holding packets results in a high packet delay on average, while transmitting a packet without coding results in one more packet to be transmitted in the future, i.e., more transmission power on average; (2) If the SNR of one channel is low, waiting for high SNR transition by holding packets results in higher packet delay but lower transmission error rate. Therefore, the scheduler must seek an optimal decision rule that solves this powerdelayerror tradeoff.
It should be pointed out that the problem under consideration is a crosslayer multiobjective optimization one; we want to optimize both the power consumption and transmission error rate in the physical layer and the packet delay in the data link layer. As discussed above, since there are tradeoffs among these optimization metrics, it is not possible to get all of them optimized simultaneously. Therefore, in this paper, we are actually seeking the Pareto optimality of these optimization metrics.^{2}
Assumptions
We consider a discretetime decisionmaking process, where the time is divided into small intervals, called decision epochs and denoted by t∈{0,1,…,T}. Let i∈{1,2} and assume the following:

A1
(i.i.d. incoming traffic) Denote random variable \(f_{i}^{(t)}\in {\mathcal {F}_{i}}\) as the number of incoming packets to queue i at decision epoch t. Let the maximum number of packets arrived per decision epoch be no greater than 1, i.e., \(\mathcal {F}_{i}=\{0, 1\}\). Assume that \(\left \{f_{1}^{(t)}\right \}\) and \(\left \{f_{2}^{(t)}\right \}\) are two independent i.i.d. random processes with \(Pr\left (f_{i}^{(t)}=1\right)=p_{i}\) and \(Pr\left (f_{i}^{(t)}=0\right)=1p_{i}\) for all t.

A2
(modulation scheme) Packets are of equal length. The packets arrived at the relay are decoded and stored in the queues. The relay transmits packets by BPSK modulation. Denote L _{ P } the packet length in bits. Since the maximum information flow is one packet per decision epoch, each decision epoch lasts for L _{ P } symbol durations. The relay can set a certain field in the header of a packet so as to notify the receivers whether the packet is XORed or not.

A3
(finitelength queues) Queue i can store maximum L _{ i } packets. At each t, the scheduler makes a decision and incurs an immediate cost before the event \(\mathbf {f}^{(t)}=\left (f_{1}^{(t)},f_{2}^{(t)}\right)\). Denote \(b_{i}^{(t)}\in {\mathcal {B}_{i}}\) as the occupancy of queue i at the beginning of decision epoch t, then \(\mathcal {B}_{i}=\left \{0,1,\dotsc,L_{i}+\max {\left \{f_{i}^{(t1)}\right \}}\right \}=\{0,1,\dotsc,L_{i}+1\}\). If the relay’s decision results in queue occupation L _{ i }+1, the newly arrived packet will be dropped. We call it packet lost due to the queue overflow.

A4
(Markovian channel modeling) Let the full variation range of \(\gamma _{i}^{(t)}\), the instantaneous SNR of channel i, be partitioned into K _{ i } nonoverlapping regions \(\{[\!\Gamma _{1},\Gamma _{2}),[\!\Gamma _{2},\Gamma _{3}),\dotsc,[\!\Gamma _{K_{i}},\infty)\}\), called channel states. Here, the SNR boundaries satisfy \(\Gamma _{1}<\Gamma _{2}<\dotsc <\Gamma _{K_{i}}\). Denote \(\mathcal {G}_{i}=\{1,2,\dotsc,K_{i}\}\) as the state set of channel i and \(g_{i}^{(t)}\) as the state of channel i at decision epoch t. We say that \(g_{i}^{(t)}=k_{i}\) if \(\gamma _{i}^{(t)}\in {[\!\Gamma _{k_{i}},\Gamma _{k_{i}+1})}\). Each channel is modeled by a finitestate Markov chain (FSMC) [15], where the state evolution of channel i is governed by the transition probability \(P_{g_{i}^{(t)}g_{i}^{(t+1)}}=Pr\left (g_{i}^{(t+1)^{\phantom {A}}}g_{i}^{(t)}\right)\).

A5
(downlink channel state information) Let \(\left \{g_{1}^{(t)}\right \}\) and \(\left \{g_{2}^{(t)}\right \}\) be two independent and i.i.d. random processes. The relay has the channel state information (the value of channel state and its transition probabilities) of both channels before the decision making at t.
Markov decision process formulation
Based on A1, A4, and A5, we know that the statistics of the incoming traffic flow and channel dynamics associated with user 1 or 2 are timeinvariant. It follows that the transmission control problem in Fig. 2 can be formulated as a stationary Markov decision process (MDP). In the following context, we drop the decision epoch notation t in A1A5 and use the notation y and y ^{′} for the system variable y at the current and next decision epochs, respectively.
System state
Denote the system state \(\mathbf {x}=(\mathbf {b}, \mathbf {g})\in {\mathcal {X}}\), where \(\mathbf {b}=(b_{1},b_{2})\in {\mathcal {B}_{1}\times {\mathcal {B}_{2}}}\) and \(\mathbf {g}=(g_{1},g_{2})\in {\mathcal {G}_{1}\times {\mathcal {G}_{2}}}\), i.e., \(\mathcal {X}=\mathcal {B}_{1}\times {\mathcal {B}_{2}}\times {\mathcal {G}_{1}}\times {\mathcal {G}_{2}}\). × denotes the Cartesian product. We also use the 4tuple notation x=(b _{1},b _{2},g _{1},g _{2}) in the following context.
Action
Denote action \(\mathbf {a}=(a_{1},a_{2})\in {\mathcal {A}}\), where \(a_{i}\in {\mathcal {A}_{i}}=\{0,1\}\) denotes the number of packets departed from queue i and \(\mathcal {A}=\mathcal {A}_{1}\times {\mathcal {A}_{2}}=\{0,1\}^{2}\). The terminology of actions are shown in Table 1.
State transition probabilities
The transition probability P x x ^{′} a=P r(x ^{′}x,a) denotes the probability of being in state x ^{′} at next decision epoch if action a is taken in state x at current decision epoch. Due to the assumptions of independent random processes in A1 and A5, the state transition probability is given by
where \(P_{g_{i}g_{i}^{\prime }}\) is determined by channel statistics and FSMC modeling method in A4 and \(P_{b_{i}b_{i}^{\prime }}^{a_{i}}\) is the queue state transition probability. At current decision epoch, the occupancy of queue i after decision a _{ i } is min{[ b _{ i }−a _{ i }]^{+},L _{ i }}, where [ y]^{+}= max{y,0}. The occupancy at the beginning of the next decision epoch is given by
Therefore, the state transition probability of queue i is
where \(\mathcal {I}_{\{\cdot \}}\) is the indicator function that returns 1 if the expression in {·} is true and 0 otherwise.
Immediate cost
\(C:\mathcal {X}\times \mathcal {A}\rightarrow {\mathbb {R}_{+}}\) is the cost incurred immediately after action a is taken in state x at current decision epoch. It reflects three optimization concerns: the packet delay and queue overflow, the transmission power, and the downlink transmission error rate.
Holding and overflow cost
We define h _{ i }, the holding and queue overflow cost associated with queue i, as
λ>0 is the unit holding cost and ξ _{ o }>λ is the unit queue overflow cost, which makes h _{ i }(y _{ i }) a nondecreasing convex function. In the case when y _{ i }=b _{ i }−a _{ i }, min{[y _{ i }]^{+},L _{ i }} and \(\mathcal {I}_{\{[y_{i}]^{+}=L_{i}+1\}}\) count the number of packets held in queue i and the number of packets lost due the overflow of queue i, respectively. We say that the term λ min{[y _{ i }]^{+},L _{ i }} accounts for the packet delay because by Little’s Law, the average packet delay is proportional to the average number of packets held in the queue in the long run for a given packet arrival rate [24]. We sum up h _{ i } for i∈{1,2} and obtain the total holding and overflow cost as
Transmission cost
Since forwarding and broadcasting one packet, either coded or noncoded, consume the same amount of energy, we have the immediate transmission cost as
where τ>λ is the unit transmission cost and \(\mathcal {I}_{\left \{a_{1}=1~\text {or}~a_{2}=1\right \}}\) counts the number of transmissions resulting from action a.
Note that (5) and (6) form a powerdelay tradeoff. A policy that always transmits whenever there is an incoming packet without considering coding opportunities in the long run is penalized by (6), and a policy that always holds packet to wait for coding opportunities without considering the average packet delay is penalized by (5).
Packet error cost
Since packet errors in downlink transmissions happen only when we decide to transmit, we define the immediate packet error cost due to the action a _{ i } as
where η is the unit packet error cost and −i∈{1,2}∖{i}, i.e., −i=2 if i=1, and −i=1 if i=2. The reason we have err(g _{−i },a _{ i }) is because the packet departing queue i is transmitted through channel −i, e.g., the relay sends one packet in queue 1 through fading channel 2 when a _{1}=1. P _{ e }(g _{ i }) is estimation of the average BER when transmitting a packet, either coded or noncoded, through channel i when the state is g _{ i }. Since BPSK modulation is used at the relay, we define P _{ e } as
Here, P _{ e }(g _{ i })≤0.5 because Γ _{1}≥0 in A4.
Note, the aforementioned powerdelay tradeoff formed by (5) and (6) just poses the problem of whether or not to transmit if an instantaneous packet inflow is not able to form an XORing pair. However, if the scheduler considers downlink transmission error rate in addition, a policy that always broadcasts XORed packets whenever there is a coding opportunity without considering downlink channel states is penalized by (7). Therefore, (5), (6), and (7) form a powerdelayerror tradeoff.
In summary, we define the immediate cost as
where
Here, C(x,a) is in fact a linear combination of loss functions (each quantifies an optimization concern). The unit cost λ, ξ _{ o }, τ, and η can be considered as the weight factors that are either given or adjustable depending on the real applications. In Section 4, we will derive the sufficient conditions of the existence of a structured optimal policy mainly in terms of the chosen values of these unit costs.
Objective and dynamic programming
Let x ^{(t)} and a ^{(t)} denote the state and action at decision epoch t, respectively, and consider an infinitehorizon MDP modeling where the discrete decision making process is assumed to be infinitely long. We can describe the longrun objective as
where x ^{(t+1)}∼P r(·x ^{(t)},a ^{(t)}) and β∈ [0,1) is the discounted factor that ensures the convergence of the series. It is proved in [14] that if the state space \(\mathcal {X}\) is countable, the action set \(\mathcal {A}\) is finite, and the MDP is stationary, there exists a deterministic stationary policy \(\theta ^{*}:\mathcal {X}\rightarrow {\mathcal {A}}\) that optimizes (11), and θ ^{∗} can be searched by DP
where
Here, n denotes the iteration index and V ^{(0)}(x)=0 for all x. Usually, a very small convergence threshold ε>0 is applied so that DP terminates when V ^{(N−1)}(x)−V ^{(N)}(x)≤ε for all x and N<∞.^{3} The optimal policy is obtained as \(\theta ^{*}(\mathbf {x})=\arg \min _{\mathbf {a}\in {\mathcal {A}}}Q^{(N)}(\mathbf {x},\mathbf {a})\).
As discussed in Section 2, the problem under consideration is a crosslayer multiobjective one. When defining the immediate cost function (9), we use scalarization technique, i.e., C(x,a) is a weighted sum of the holding and packet overflow costs incurred in the data link layer and the transmission power consumption and error rate incurred in the physical layer. Therefore, the optimal policy θ ^{∗} is in fact a Pareto optimal solution.^{4} It should be clear that a Pareto optimal solution is not optimal if we just consider an individual optimization metric, e.g., θ ^{∗} is not the optimal solution if we just want to minimize the power consumption in the physical layer.
Structured optimal policies
The time complexity in iteration n in DP is \(O(\mathcal {X}^{2}\mathcal {A})\). There are \(\mathcal {X}\) minimization operations, each of which requires \(\mathcal {A}\) calculations of Q ^{(n)}, and each Q ^{(n)} value requires \(\mathcal {X}\) multiplications over state x ^{′}. Since \(\mathcal {X}=\mathcal {B}_{1}\mathcal {B}_{2}\mathcal {G}_{1}\mathcal {G}_{2}\), the complexity grows quadratically if the cardinality of any tuple in the state variable increases. If the nodetonode transmission in NCTWRC is via multiple channels (e.g., singleuser MIMO channels), the complexity grows exponentially with the number of usertorelay channels, which may severely overload the CPU. In this section, we investigate the submodularity, L ^{♮}convexity and multimodularity of functions Q ^{(n)}(x,a) and V ^{(n)}(x) in DP to establish the sufficient conditions for the existence of a monotonic optimal policy. These results serve as the prerequisites for the low complexity algorithms proposed in Section 5. We first clarify some concepts as follows.
Definition 4.1 (Monotonic policy).
Let \(\theta \colon \mathbb {Z}^{n}\rightarrow {\mathbb {Z}^{m}}\), θ(x) is monotonic nondecreasing if θ(x ^{+})≽θ(x ^{−}), for all \(\mathbf {x}^{+},\mathbf {x}^{}\in {\mathbb {Z}^{n}}\) such that x ^{+}≽x ^{−}, where ≽ denotes componentwise greater than or equal to.
Definition 4.2 (Submodularity [23, 25]).
Let \(\mathbf {e}_{i}\in {\mathbb {Z}^{n}}\) be an ntuple with all zero entries except the ith entry being one. \(f\colon \mathbb {Z}^{n}\rightarrow {\mathbb {R}_{+}}\) is submodular if f(x+e _{ i })+f(x+e _{ j })≥f(x)+f(x+e _{ i }+e _{ j }) for all \(\mathbf {x}\in {\mathbb {Z}^{n}}\) and 1≤i,j≤n. f is strictly submodular if the inequality is strict.
In DP, a submodular function Q ^{(n)}(x,a) has Q ^{(n)}(x,a ^{−})−Q ^{(n)}(x,a ^{+}) nondecreasing in x for all a ^{+}≽a ^{−}, i.e., the preference of choosing action a ^{+} over a ^{−} is always nondecreasing in x. Therefore, an increase in the state variable x implies an increase in the decision rule θ ^{(n)}(x)= mina Q ^{(n)}(x,a). This property is summarized in a general form in the following lemma.
Lemma 4.3.
If \(g\colon \mathbb {Z}^{n}\rightarrow {\mathbb {R}_{+}}\) is submodular in \((\mathbf {x},\mathbf {y})\in {\mathbb {Z}^{n}}\), then f(x)= miny g(x,y) is submodular in x, and the minimizer y ^{∗}(x)= arg miny g(x,y) is nondecreasing in x [26].
Definition 4.4 (L ^{♮}convexity [23]).
\(f\colon \mathbb {Z}^{n}\rightarrow {\mathbb {R}_{+}}\) is L ^{♮}convex if ψ(x,ζ)=f(x−ζ 1) is submodular in (x,ζ), where \(\mathbf {1}=(1,1,\dotsc,1)\in {\mathbb {Z}^{n}}\) and \(\zeta \in {\mathbb {Z}}\).
Definition 4.5 (multimodularity [23]).
\(f\colon \mathbb {Z}^{n}\rightarrow {\mathbb {R}_{+}}\) is multimodular if ψ(x,ζ)=f(x _{1}−ζ,x _{2}−x _{1},…,x _{ n }−x _{ n−1}) is submodular in (x,ζ), where \(\zeta \in {\mathbb {Z}}\).
L ^{♮}convexity and multimodularity are two concepts defined in discrete convex analysis [27]. L ^{♮}convexity implies submodularity while multimodularity implies supermoduarity^{5} [28]. They both contribute to a monotonic structure in the optimal policy.
Lemma 4.6.
If \(g\colon \mathbb {Z}^{n}\rightarrow {\mathbb {R}_{+}}\) is L ^{♮}convex/multimodular in \((\mathbf {x},\mathbf {y})\in \mathbb {Z}^{n}\), then f(x)= miny g(x,y) is L ^{♮}convex/multimodular in x, and the minimizer y ^{∗}(x)= arg miny g(x,y) is nondecreasing/nonincreasing in x [28, 29].
The unimodular coordinate transform below describes the relationship between L ^{♮}convexity and multimodularity.
Lemma 4.7 (unimodular coordinate transform [23, 28]).
Let matrix \(M_{n,i}=\left [\begin {array}{ccc} U_{i} & 0 \\ 0 & L_{ni} \end {array} \right ]\), where U _{ i } and L _{ i } are the i×i upper and lower triangular matrix with all nonzero entries being one, respectively, then

(a)
a function \(f\colon \mathbb {Z}^{n}\rightarrow {\mathbb {R}_{+}}\) is multimodular if and only if it can be represented by f(x)=g(±M _{ n,i } x) for some L ^{♮}convex function g.

(b)
a function \(g\colon \mathbb {Z}^{n}\rightarrow {\mathbb {R}_{+}}\) is L ^{♮}convex if and only if it can be represented by \(g(\mathbf {x})=f\left (\pm M_{n,i}^{1}\mathbf {x}\right)\) for some multimodular function f.
Definition 4.8 (First order stochastic dominance [18]).
Let \(\tilde {\rho }(x)\) be a random selection on space \(\mathcal {X}\) according to a probability measure μ(x) where x conditions the random selection, then \(\tilde {\rho }(x)\) is first order stochastically nondecreasing in x if \(\mathbb {E}[u(\tilde {\rho }(x^{+}))] \geq \mathbb {E}[u(\tilde {\rho }(x^{}))]\) for all nondecreasing functions u and x ^{+}≥x ^{−}.
Structured properties of dynamic programming
To propose the prototypical procedure of proving the existence of a monotonic optimal policy, we first define a \({\mathcal {P}^{\star }\!}\) property as follows:
Definition 4.9 (\(\mathcal {P}^{\star }\!\) property).
\(f\colon \mathbb {Z}^{n}\rightarrow {\mathbb {R}_{+}}\) has \(\mathcal {P}^{\star }\!\) property in \((\mathbf {x},\mathbf {y})\in {\mathbb {Z}^{n}}\) if f ^{∗}(x)= miny f(x,y) has \({\mathcal {P}^{\star }\!}\) property in x and y ^{∗}(x)= arg minx f(x,y) is monotonic (nondecreasing/nonincreasing) in x.
Theorem 4.10.
Submodularity, L ^{♮}convexity and multimodularity have \(\mathcal {P}^{\star }\!\) property.
Proof.
It can be directly proved by Lemma 4.3 and Lemma 4.6.
We therefore propose an approach, similar to Proposition 5 in [18], as follows:
Proposition 4.11.
Let DP converge at Nth iteration. The optimal value function V ^{∗}(x)=V ^{(N)}(x) has \(\mathcal {P}^{\star }\!\) property, and the optimal policy θ ^{∗} is monotonic in x, if:

C(x,a) has \(\mathcal {P}^{\star }\!\) property,

\(Q^{(n)}(\mathbf {x},\mathbf {a})=C(\mathbf {x},\mathbf {a})+ \beta \sum _{\mathbf {x}'\in {\mathcal {X}}}P_{\mathbf {x}\mathbf {x}^{\prime }}^{\mathbf {a}} V^{(n1)}(\mathbf {x}')\) has \(\mathcal {P}^{\star }\!\) property for all \({\mathcal {P}^{\star }\!}\) property functions V ^{(n−1)} and n.
Proof.
Since DP starts from V ^{(0)}(x)=0 for all \(\mathbf {x}\in {\mathcal {X}}\), Q ^{(1)}=C(x,a) has \(\mathcal {P}^{\star }\!\) property. So \(V^{(1)}(\mathbf {x})=\min _{\mathbf {a}\in {\mathcal {A}}}Q^{(1)}(\mathbf {x},\mathbf {a})\) has \({\mathcal {P}^{\star }\!}\) property. By induction, assume V ^{(n−1)}(x,a) has \({\mathcal {P}^{\star }\!}\) property. Then Q ^{(n)} and \(V^{(n)}(\mathbf {x})=\min _{\mathbf {a}\in {\mathcal {A}}}Q^{(n)}(\mathbf {x},\mathbf {a})\) have \({\mathcal {P}^{\star }\!}\) property. Therefore, Q ^{(N)}(x,a) and V ^{∗}(x)=V ^{(N)}(x) must also possess \({\mathcal {P}^{\star }\!}\) property, and \(\theta ^{*}(\mathbf {x})=\arg \min _{\mathbf {a}\in {\mathcal {A}}}Q^{(N)} (\mathbf {x},\mathbf {a})\) is monotonic in x.
Monotonic policies in queues states
Nondecreasing \(a_{i}^{*}\) in b _{ i }
Let the optimal action be \(\mathbf {a}^{*}=\theta ^{*}(\mathbf {x})=(\theta _{1}^{*}(\mathbf {x}),\theta _{2}^{*}(\mathbf {x}))\). \(a_{i}^{*}=\theta _{i}^{*}(\mathbf {x})\) is the optimal action to queue i determined by θ ^{∗}. The following theorem shows that the optimal action \(a_{i}^{*}\) is monotonic in b _{ i }, the state of queue being controlled by a _{ i } if the unit costs satisfy a certain condition.
Theorem 4.12.
If ξ _{ o }≥2λ+η+τ,^{6} then for all i∈{1,2}C(x,a) and Q ^{(n)}(x,a) are nondecreasing in b _{ i } and L ^{♮}convex in (b _{ i },a _{ i }), V ^{∗}(x) is nondecreasing and L ^{♮}convex in b _{ i }, and the optimal action \(a^{*}_{i}\) is nondecreasing in b _{ i }.
Proof.
We define two functions
where \(\tilde {C}_{h}(\mathbf {y})=\sum _{i=1}^{2}h_{i}(y_{i})\) and
Here,
y=(y _{1},y _{2}) and f=(f _{1},f _{2}). It is easy to see that \(C(\mathbf {b},\mathbf {g},\mathbf {a})=\tilde {C}(\mathbf {b}\mathbf {a},\mathbf {g},\mathbf {a})\) and \(Q^{(n)}(\mathbf {b},\mathbf {g},\mathbf {a})=\tilde {Q}^{(n)}(\mathbf {b}\mathbf {a},\mathbf {g},\mathbf {a})\). Since
according to Lemma 4.7(b), it follows that proving the L ^{♮}convexity of C(b,g,a) and Q ^{(n)}(b,g,a) in (b _{ i },a _{ i }) is equivalent to showing the multimodularity of \(\tilde {C}(\mathbf {y},\mathbf {g},\mathbf {a})\) and \(\tilde {Q}^{(n)}(\mathbf {y},\mathbf {g},\mathbf {a})\) in (y _{ i },a _{ i }). It is also clear that the monotonicity of C(b,g,a) and Q ^{(n)}(b,g,a) in b _{ i } is equivalent to the monotonicity of \(\tilde {C}(\mathbf {y},\mathbf {g},\mathbf {a})\) and \(\tilde {Q}^{(n)}(\mathbf {y},\mathbf {g},\mathbf {a})\) in y _{ i }. See Appendix C for the proof of the monotonicity and multimodularity of \(\tilde {C}(\mathbf {y},\mathbf {g},\mathbf {a})\) and \(\tilde {Q}^{(n)}(\mathbf {y},\mathbf {g},\mathbf {a})\) in y _{ i } and (y _{ i },a _{ i }), respectively.
According to Proposition 4.7.3 in [14], V ^{∗}(x) is nondecreasing in b _{ i }. By Theorem 4.10 and Proposition 4.11, V ^{∗}(x) is L ^{♮}convex in b _{ i }, and \(a_{i}^{*}\) is nondecreasing in b _{ i }.
Note, Theorem 4.12 aligns with the existing results in the literature, e.g., the adaptive MIMO transmission control [21] and the Markov game modeled adaptive modulation of cognitive radio [19]. In fact, both of them can be explained by L ^{♮}convexity. In [21], the monotonicity of \(a_{i}^{*}\) in b _{ i } was shown by the multimodularity in (b _{ i },−a _{ i }). But,
By Lemma 4.7(b), we know that if the a function is multimodular in (b _{ i },−a _{ i }), then it must be L ^{♮}convex in (b _{ i },a _{ i }). Consequently, V ^{(n)}(x) is integer convex in b _{ i } because L ^{♮}convexity in one dimension is exactly integer convexity^{7}. In [19], the monotonicity of \(a_{i}^{*}\) was shown by the submodularity of Q ^{(n)} in (b _{ i },a _{ i }). But, Q ^{(n)} is a function of b _{ i }−a _{ i }. According to Definition 4.4, the L ^{♮}convexity of g(x _{1},x _{2})=f(x _{1}−x _{2}) in (x _{1},x _{2}) is equivalent to the submodularity of g(x _{1},x _{2}) in (x _{1},x _{2}). So Q ^{(n)} is also L ^{♮}convex in (b _{ i },a _{ i }).
Nondecreasing \(a_{i}^{*}\) in (b _{1},b _{2})
We formulate the optimization problem in the nth iteration of DP by a 2player 2strategy game, which is called onestage game in Fig. 3. Assume that action a _{1} is taken by player 1, and a _{2} is taken by player 2. Obviously, it is a pure coordination game where the utility −Q ^{(n)}(x,(a _{1},a _{2})) is the same to player 1 and 2.
We prove, in Appendix D, that Fig. 3 is a supermodular game with utility function −Q ^{(n)}(x,(a _{1},a _{2})) strictly supermodular in a=(a _{1},a _{2}) for all x and V ^{(n−1)}(x ^{′}) that is L ^{♮}convex in b ^{′}=(b1′,b2′). It is proved in [30] that there exists at least one equilibrium \((a_{1}^{*},a_{2}^{*})\) in the form of pure strategy in a supermodular game. Then, we have the following theorem for the monotonicity of the optimal action \(a_{i}^{*}\) in b=(b _{1},b _{2}).
Theorem 4.13.
If

(a)
ξ _{ o }≥2λ+η+τ,

(b)
onestage game (in Fig. 3) has two pure strategy equilibria (0,0) and (1,1) for all x=(b _{1},b _{2},g _{1},g _{2}) such that b _{ i }<L _{ i }+1 for all i∈{1,2},
then C(x,a) and Q ^{(n)}(x,a) are L ^{♮}convex in (b,a)=(b _{1},b _{2},a _{1},a _{2}), the optimal value function V ^{∗}(x) is L ^{♮}convex in b=(b _{1},b _{2}) and the optimal action \(\mathbf {a}^{*}=(a_{1}^{*},a_{2}^{*})\) is nondecreasing in b=(b _{1},b _{2}).
Proof.
The proof is in Appendix E.
Here is a corollary of Theorem 4.13.
Corollary 4.14.
If

(a)
ξ _{ o }≥2λ+η+τ,

(b)
p _{1}=p _{2}=0.5,

(c)
\(\beta \leq \frac {2(\tau \lambda)}{\tau +\eta }\),
then Theorem 4.13 holds.
Proof.
The proof is in Appendix F.
We show examples of Theorems 4.12 and 4.13 in Figs. 4, 5, 6 and 7. The results are collected by value iteration, a DP algorithm, applied on an NCTWRC system with Bernoulli packet arrivals, 5 queue states, and 8 channel states, i.e., \(f_{i}^{(t)}\sim {\text {Bernoulli}(p_{i})}\), L _{ i }=3 and K _{ i }=8 for all t and i∈{1,2}. In Fig. 4, we choose the values of unit costs to make Theorem 4.12 hold. As shown in the figure, the optimal action \(a_{1}^{*}\) and \(a_{2}^{*}\) are monotonic in b _{1} and b _{2}, respectively, i.e., \(a_{i}^{*}\) is nondecreasing in the queue state that is being controlled by a _{ i }. In Fig. 5, we change the value of unit cost ξ _{ o } to breach the condition in Theorem 4.12 so that the monotonicity of \(a_{i}^{*}\) in b _{ i } is not guaranteed. In this case, \(a_{1}^{*}\) that is not monotonic in b _{1}.
In Fig. 6, we choose the equiprobable packet arrival rates p _{1}=p _{2}=0.5 and the unit costs according to Corollary 4.14 to make Theorem 4.13 hold. As shown in the figure, the optimal action \(a_{1}^{*}\) and \(a_{2}^{*}\) are both nondecreasing in (b _{1},b _{2}). As compared to Fig. 4, in this case, \(a_{i}^{*}\) is also monotonic in b _{−i }, the queue state that is affected by the message flow and transmission control in the opposite direction, i.e., the queue state that is not controlled by a _{ i }. In Fig. 7, we switch unit cost η from 1 to 2 so that Theorem 4.13 no longer holds. In this case, neither \(a_{1}^{*}\) nor \(a_{2}^{*}\) is monotonic in (b _{1},b _{2}). But, the condition in Theorem 4.12 is satisfied. Therefore, \(a_{1}^{*}\) and \(a_{2}^{*}\) are still nondecreasing in b _{1} and b _{2}, respectively.
Monotonic policies in channel states
The related research work in the existing literature considers the structure of the optimal policy in queue state only, e.g., [19, 21, 24]. This section breaks this limitation in that we extend the investigation of the monotonicity to the channel states. The main results are summarized as follows.
Theorem 4.15.
If

ξ _{ o }≥2λ+η+τ,

P _{ e }(g _{ i })≥P _{ e }(g _{ i }+1),

\(P_{g_{i}g_{i}^{\prime }}\) is first order stochastic nondecreasing in g _{ i },

\(\beta \leq \frac {P_{e}(g_{i})P_{e}(g_{i}+1)}{\sum _{g'_{i}}P_{g_{i}g'_{i}} (P_{e}(g'_{i})P_{e}(g'_{i}+1)) }\).
then C(x,a) and Q ^{(n)}(x,a) is submodular in (b _{ i },g _{−i },a _{ i }), V ^{∗}(x) is submodular in (b _{ i },g _{−i }), and the optimal action \(a_{i}^{*}\) is nondecreasing in (b _{ i },g _{−i }).
Proof.
The proof is in Appendix G.
In Theorem 4.15, condition (b) is straightforwardly satisfied because of the definition of P _{ e } in (8) and assumption A4. Conditions (c) and (d) depend on the fading statistics and the FSMC modeling method. In fact, condition (c) is not hard to satisfy.
Corollary 4.16.
If the FSMC of channel i adopts equiprobable partitioning (of the full range of SNR), and channel i experiences slow and flat Rayleigh fading, then condition (c) in Theorem 4.15 are satisfied.
Proof.
The proof is in Appendix H.
We show examples of Theorem 4.15 in Figs. 8 and 9. In Fig. 8, we use the same system parameters as in Fig. 4 except that the discount factor β is switched from 0.97 to 0.95 in order to satisfy the inequality in condition (d) of Theorem 4.15. The results are obtained from an NCTWRC system where the channels experience slow and flat Rayleigh fading with average SNR \(\overline {\gamma }_{1}=\overline {\gamma }_{2}=0\text {dB}\). Both FSMCs are 8state and adopt equiprobable partition method. In this case, all the conditions in Theorem 4.15 are satisfied according to Corollary 4.16. Therefore, \(a_{1}^{*}\) is nondecreasing in (b _{1},g _{2}), and \(a_{2}^{*}\) is nondecreasing in (b _{2},g _{1}). In Fig. 9, we switch \(\overline {\gamma }_{2}\) from 0dB to 3dB to breach condition (d) in Theorem 4.15. In this case, \(a_{1}^{*}\) is not monotonic in g _{2}. But, since Theorem 4.12 still holds, \(a_{1}^{*}\) and \(a_{2}^{*}\) are monotonic in b _{1} and b _{2}, respectively.
Note, that the related previous studies usually placed constraints on the environments or the DP functions in order to prove the structure in the optimal policy. For example, in [19] the submodularity of the state transition probability was proved by assuming uniformly distributed traffic rates, and in [31], the strict submodularity of Q ^{(n)} in DP iterations was assumed to be preserved by a weight factor in the immediate cost function (however, the exact value of this factor was not given). In contrast, the basic result in this paper, Theorem 4.12, is essentially given in terms of unit costs and discount factor, the parameters in the MDP model. The practical meaning of Theorem 4.12 can be interpreted in two ways. If the unit costs and discount factor are adjustable, we can tune them to get a structured optimal policy. If they are given, we can check the sufficient conditions for the existence of a monotonic optimal policy after the MDP modeling. In addition, we also derive the results, Theorems 4.13 and 4.15 by considering the uniform traffic rates, stochastic dominance of channel transition probabilities and channel modeling, and modulation scheme in this paper. They are also applicable if the associated conditions are satisfied.
Low complexity algorithms
This section considers the question of how to exploit the results in Section 4 to simplify the optimization process of problem (11). For this purpose, we present MPI and DSPSA algorithms for the MDP model in Section 3.
Monotonic policy iteration
The idea of MPI is to modify (12) as
where \(\mathcal {A}(\mathbf {x})\subseteq {\mathcal {A}}\) is a selection of actions in \(\mathcal {A}=\{(0,1)\}^{2}=\{(0,0),(0,1),(1,0),(1,1)\}\). Let \(\theta ^{(n)}(\mathbf {x})=\arg \min _{\mathbf {a}\in \mathcal {A}(\mathbf {x})}Q^{(n)}(\mathbf {x},\mathbf {a})\). Note, θ ^{(n)}(x) can be obtained at the same time when V ^{(n)}(x) is calculated. We express θ ^{(n)}(x) as
Assume that Theorem 4.13 holds. We can define the action selection set \(\mathcal {A}(\mathbf {x})\) as follows. Due to the L ^{♮}convexity of Q ^{(n)} in (b _{ i },a _{ i }), \(\theta _{i}^{(n)}\) is always nondecreasing in b _{ i }. Therefore, we can define \(\mathcal {A}(\mathbf {x})\) as
when b _{1}≠0 and b _{2}≠0 and \(\mathcal {A}(\mathbf {x})=\mathcal {A}\) when b _{1}=b _{2}=0. For example, consider the case when g _{1}=1 and g _{2}=1 at some iteration n. We need to determine the value of θ ^{(n)}(x) for all x=(b _{1},b _{2},g _{1},g _{2}) such that g _{1}=1 and g _{2}=1. We start with the lowest values of b _{1} and b _{2}. For state x=(0,0,1,1), we have \(\mathcal {A}(\mathbf {x})=\mathcal {A}=\{0,1\}^{2}\). In this case, the minimization problem \(\min _{\mathbf {a}\in {\mathcal {A}(\mathbf {x})}}Q^{(n)}(\mathbf {x},\mathbf {a})\) is equivalent to \(\min _{\mathbf {a}\in {\mathcal {A}}}Q^{(n)}(\mathbf {x},\mathbf {a})\), i.e., we need to obtain four values of Q ^{(n)}(x,a) at a=(0,0), (0,1), (1,0), and (1,1) to determine the minimum. If we get θ ^{(n)}(x)=(0,1) for x=(0,0,1,1), then \(\mathcal {A}(\mathbf {x})=\{0,1\}\times \{1\}=\{(0,1),(1,1)\}\) for x=(0,1,1,1), (1,0,1,1) and (1,1,1,1). It means that only two calculations of Q ^{(n)}(x,a) are required when we want to determine the value of \(\min _{\mathbf {a}\in {\mathcal {A}(\mathbf {x})}}Q^{(n)}(\mathbf {x},\mathbf {a})\) for these three states. In addition, if we find that θ ^{(n)}(x)=(1,1) for x=(1,1,1,1), then, for all x such that b _{1}>1, b _{2}>1, g _{1}=1 and g _{2}=1, \(\mathcal {A}(\mathbf {x})=\{(1,1)\}\) and we can directly assign θ ^{(n)}(x)=(1,1) without doing the minimization \(\min _{\mathbf {a}\in {\mathcal {A}(\mathbf {x})}}Q^{(n)}(\mathbf {x},\mathbf {a})\). We can find the optimal policy by repeating this process for all values of g _{1} and g _{2} in each iteration. From this example, it can be seen that (19) should be conducted in the increasing order of b _{1} and b _{2} so that the cardinality of set \(\mathcal {A}(\mathbf {x})\) is progressively reducing.
Discrete simultaneous perturbation stochastic approximation
Assume Theorem 4.12 holds.^{8} Due to the monotonicity of the optimal policy in queue states, the optimization problem (11) can be converted to a minimization problem over a set of queue thresholds.
For i∈{1,2}, we define \(\phi _{i}(b_{i},g_{1},g_{2})\in \mathcal {B}_{i}\) as
Here, ϕ _{ i }(b _{−i },g _{1},g _{2}) is the threshold to queue i when the other user’s queue state is b _{−i } and channel states are g _{1} and g _{2}. Let ϕ _{ i } be constructed by stacking ϕ _{ i } for all \((b_{i},g_{1},g_{2})\in \mathcal {B}_{i}\times {\mathcal {G}_{1}}\times {\mathcal {G}_{2}}\). The queue threshold vector is defined as ϕ=(ϕ _{1},ϕ _{2}). In Fig. 10, we show the optimal queue threshold vector \(\boldsymbol {\phi }^{*}=(\boldsymbol {\phi }_{1}^{*},\boldsymbol {\phi }_{2}^{*})\) where
and θ ^{∗} is the optimal policy obtained in Fig. 4. Each queue threshold vector ϕ=(ϕ _{1},ϕ _{2}) determines a deterministic policy \(\theta _{\boldsymbol {\phi }}(\mathbf {x})=(\theta _{1\boldsymbol {\phi }_{1}}(\mathbf {x}),\theta _{2\boldsymbol {\phi }_{2}}(\mathbf {x}))\phantom {\dot {i}\!}\) by
Since \(\theta _{i}^{*}\) is nondecreasing in b _{ i } for all i∈{1,2} if Theorem 4.12 holds and θ ^{∗} determines ϕ ^{∗} via (22), finding the optimal policy θ ^{∗} is equivalent to finding the optimal queue threshold vector ϕ ^{∗}. We can convert problem (11) to
where
The advantage of formulating problem (24) is that the solutions can be approximated by the DSPSA algorithm [32] presented in Algorithm 1. The parameters/functions in this algorithm are explained as follows

\(\hat {J}(\boldsymbol {\phi })\) is an estimation of J at ϕ that is obtained by simulation. The method is to simulate the state sequence {x ^{(t)}} governed by the transition probability \(Pr\left (\mathbf {x}^{(t+1)}\mathbf {x}^{(t)}\right)=P_{\mathbf {x}^{(t)} \mathbf {x}^{(t+1)}}^{\theta _{\boldsymbol {\phi }}(\mathbf {x}^{(t)})}\) for all \(\mathbf {x}^{(0)}\in \mathcal {X}\). \(\hat {J}(\boldsymbol {\phi })\) is obtained as
$$\begin{array}{*{20}l} \hat{J}(\boldsymbol{\phi})=\sum_{\mathbf{x}^{(0)}\in\mathcal{X}}\sum_{t=0}^{T}\beta^{t}C\left(\mathbf{x}^{(t)}, \theta_{\boldsymbol{\phi}}(\mathbf{x}^{(t)})\right). \end{array} $$((26))Each simulation stops if the increments over several successive decision epochs blow a small threshold (10^{−5}), i.e., the simulation length is finite.

The step size parameters A, B, and α are crucial for the convergence performance of DSA algorithms. In this paper, we set as A=0.3, B=100, and α=0.602. These values are found by adopting the method suggested in [33] for practical problems where the computation budget N, the total number of iterations, is fixed: B=0.095N, α=0.602 and A is chosen so that \(A/(B+1)^{\alpha }\\mathbf {d} \left (\tilde {\boldsymbol {\phi }}^{(0)}\right)\=0.1\).
The DSPSA algorithm is a in fact a line search algorithm. It starts with any initial guess ϕ ^{(0)}, say ϕ ^{(0)}=0, and iteratively updates the guess by the estimated descent direction −a ^{(n)} d(ϕ ^{(n)}). The gradient d(ϕ ^{(n)}) in each iteration is obtained based on two values of \(\hat {J}\), \(\hat {J} \left (\lfloor \boldsymbol {\phi }^{(n)}\rfloor +\frac {\mathbf {1}+\mathbf {\Delta }}{2} \right)\) and \( \hat {J} \left (\lfloor \boldsymbol {\phi }^{(n)}\rfloor +\frac {\mathbf {1}\mathbf {\Delta }}{2} \right)\).^{9} According to a study in [34], the estimation sequence {ϕ ^{(n)}} slowly converges to the optimal queue threshold vector ϕ ^{∗}.
Complexity
MPI is in fact a modified DP algorithm that exploits L ^{♮}convexity or submodularity of Q ^{(n)}. It converges at the same rate as DP. But, since \(\mathcal {A}(\mathbf {x})\subseteq \mathcal {A}\) and \(\mathcal {A}(\mathbf {x})\), the cardinality of \(\mathcal {A}(\mathbf {x})\), is progressively decreasing in b _{1} and b _{2}, the complexity in each iteration in MPI is lower than that in DP. Let ρ be the average size of \(\mathcal {A}(\mathbf {x})\) over all states x. The complexity in one iteration of MPI is \(O(\mathcal {X}^{2}\rho)\), where \(\rho \leq \mathcal {A}\). The exact value of ρ varies with different systems. To show the examples of the actual complexity of MPI, we do the following experiment. We use the same system settings as in Fig. 6 and set the number of channel states of both channels to K, i.e., K _{1}=K _{2}=K. We vary K from 2 to 10. For each value of K, we run both DP and MPI and obtain the complexity as the number of calculations of Q ^{(n)} averaged over iterations. The results are shown in Fig. 11. It can be seen that the complexity of MPI is always less than that of DP, and MPI alleviates the drastically growing complexity in DP when the size of the channel state space grows large.
Consider the complexity of the DSPSA algorithm. Let ζ be the complexity of obtaining the value of \(\hat {J}\) by simulation. Since we only need two values of \(\hat {J}\) to calculate the gradient d, the complexity in each iteration of DSPSA is O(ζ). But, the convergence rate depends on the parameters of the DSPSA algorithm [35], e.g., the step size parameters, and may vary with different MDP systems, i.e., DSPSA may converge slower than DP or MPI. However, we have two advantages of implementing DSPSA algorithm over DP or MPI. One is that DSPSA is a simulationbased algorithm, the runs of which do not require the full knowledge of the MDP model. Based on (26), to obtain \(\hat {J}\), we only require the knowledge of the state space \(\mathcal {X}\) and a simulation model that can generate a state sequence {x ^{(t)}} based on a given queue threshold vector ϕ and the statistics of packet arrival and channel variation processes. If the packet arrival probabilities and/or channel statistics change suddenly, the optimal policy will change accordingly, and DSPSA algorithm can adapt slowly to the new optimal policy.
The results in Fig. 12 are based on an experiment of DSPSA in an environment where the system parameters change with time. The relay is assumed to serve the first pair of users with packet arrival probabilities being p _{1}=0.1 and p _{2}=0.2 in the first 500 iterations and serve another pair of users with p _{1}=0.8 and p _{2}=0.2 in the second 500 iterations. It can be seen that DSPSA is able to adaptively track the optimum and optimizer of problem (24). In contrast, to run DP or MPI, we require the full knowledge of the MDP model. If the statistics of packet arrival and channel variation processes change, we need to determine the new MDP model by calculating all values of the state transition probability P x x ^{′} a before running DP or MPI. Alternatively speaking, MPI and DP are modelbased algorithms while DSPSA is a modelfree algorithm [36].
The other advantage of DSPSA is that it allows the scheduler to learn the optimal policy online. For example, assume that we start with any arbitrary threshold vector ϕ ^{(0)}. We first let the scheduler adopt the policy that is determined by the queue threshold vector \(\lfloor \boldsymbol {\phi }^{(0)}\rfloor +\frac {\mathbf {1}+\mathbf {\Delta }}{2}\) (via (21)) for a while and obtain the value of \(\hat {J}\left (\lfloor \boldsymbol {\phi }^{(0)}\rfloor +\frac {\mathbf {1}+\mathbf {\Delta }}{2}\right)\) based on the actual immediate costs incurred. Then, we let the scheduler adopt the policy that is determined by \(\lfloor \boldsymbol {\phi }^{(0)}\rfloor +\frac {\mathbf {1}\mathbf {\Delta }}{2}\) for a while and obtain the value of \(\hat {J}\left (\lfloor \boldsymbol {\phi }^{(0)}\rfloor +\frac {\mathbf {1}\mathbf {\Delta }}{2}\right)\). By doing so, the gradient d can be calculated, and we update ϕ ^{(0)} and get a new queue vector ϕ ^{(1)}. By repeating this process, the scheduler can slowly update the estimation ϕ ^{(n)} towards ϕ ^{∗} and hence find the the optimal policy θ ^{∗}.
It should be noted that low complexity algorithms for searching or approximating the optimal policy θ ^{∗} are not restricted to MPI and DSPSA. With the results on monotonicity derived in Section 4, one can propose other algorithms, e.g., the random search method [37], the simulated annealing method [38], the complexity of which could be even lower than MPI and DSPSA. For example, the random search method [37] can be applied to find the solution of the multivariate minimization problem (24). In this method, the descent direction is found by random sampling in each iteration. The complexity incurred by random sampling could be lower than that incurred by simulation (as in DSPSA). But, we still need to compare the convergence rates of the random search and DSPSA algorithms. In summary, the MPI and DSPSA are two examples of low complexity algorithms that are based on the monotonicity of the optimal policy. To propose more low complexity algorithms and compare the convergence performance are out of the scope of this paper and could be one of the future directions of research.
Simulation results
We run simulations in an NCTWRC with Rayleigh fading channels. Let DPMDPQC be the optimal policy searched by DP based on the MDP model in Section 3. We compare the performance of DPMDPQC to the following four policies:

DSPSAMDPQC: This policy is searched by DSPSA based on the MDP model in Section 3 with the total number of iterations being N=1000. As explained in Section 5.2, the estimation sequence produced by the DSPSA algorithm should be slowly converging to the optimal policy. Therefore, DSPSAMDPQC should be close to DPMDPQC (in Euclidian distance) and the performance of DSPSAMDPQC should be similar to that of DPMDPQC.

MYOQC: This policy is obtained by θ _{MYOQC}(x)= arg mina C(x,a), where C(x,a) is the immediate cost function as defined in (9). Recall that policy DPMDPQC searched by DP is \(\theta ^{*}(\mathbf {x})=\theta ^{N}(\mathbf {x})=\arg \min _{\mathbf {a}} C(\mathbf {x},\mathbf {a})+\beta \sum _{\mathbf {x}'} P_{\mathbf {x}\mathbf {x}^{\prime }}^{\mathbf {a}} V^{(N1)}(\mathbf {x})\), where N is the iteration index when DP converges. MYOQC is the policy that neglects the aftermath \(\beta \sum _{\mathbf {x}'} P_{\mathbf {x}\mathbf {x}^{\prime }}^{\mathbf {a}} V^{(N1)}(\mathbf {x})\) that is incurred by the action taken at the current decision epoch. Alternatively speaking, MYOQC is myopic while DPMDPQC is farsighted.^{10} In a stochastic environment, myopic policies usually incur a higher expected longterm cost than farsighed ones.

AT: This policy is denoted as θ _{AT}(x)=(θ _{AT1}(x),θ _{AT2}(x)) where θ _{ATi }(x)=1 if b _{ i }≠0, i.e., always transmit whenever queue i is not empty. This policy minimizes the costs incurred by the packet delay and queue overflow. But, the performance of this policy should not be as good as DPMDPQC if the purpose is to minimize the longterm cost incurred by not only the packet delay and queue overflow but also the transmission power consumption and downlink transmission error rate.

DPMDPQ: This policy is determined by DP based on an MDP model that is the same as the one in Section 3 except that the immediate cost function is defined as C(x,a)=C _{ h }(b,a)+t _{ r }(a). This policy was proposed in [12], where the authors assume that the channels are lossless so that the packet error cost \(\sum _{i=1}^{2}\text {err}(g_{i},a_{i})=0\) always. However, the wireless channels are usually not ideal in practice. If we adopt policy DPMDPQ, it should incur a higher downlink transmission error rate than DPMDPQC.
We fix p _{2}=0.5 and vary p _{1} from 0.2 to 0.6. The other system parameters are the same as in Fig. 4. A simulation lasting for 10^{5} decision epochs is run for each value of p _{1}. Each packet contains 100 bits, i.e., the packet length L _{ P }=100. We obtain the number of holding and overflowing packets and the number of transmissions averaged over decision epochs. The former indicates the mean packet delay and queue overflow costs, and the latter indicates the average transmission power consumption. The transmission error rate is calculated as the ratio of the number of erroneous bits received to the total number of bits sent. We also obtain the immediate cost averaged over decision epochs, which indicates the longterm cost (the minimand in (11)). The results are presented in Fig. 13. It can be seen that the average immediate cost of DSPSAMDPQC almost overlaps with that of DPMDPQC. It means that if we allow the total number of iterations in the DSPSA algorithm large enough, e.g., 1000 iterations, it is able to converge to a policy that is very close to DPMDPQC.
For policy MYOQC, we can see that it always incurs a greater number of transmissions and holding and overflow packets and a higher transmission error rate than DPMDPQC. The average immediate cost of this policy is at least 0.23 higher than those of DPMDPQC, which is the worst among all policies. Therefore, a farsighted policy outperforms a myopic one when we want to minimize the longterm cost in a stochastic system.
The number of holding and overflow packets incurred by policy AT is always zero. However, it results in the highest number of transmissions and transmission error rate. The average immediate cost incurred by AT is at least 0.09 higher than DPMDPQC, which justifies our expectation; AT minimizes the packet delay and queue overflow costs but incurs higher transmission power consumption and downlink transmission error rate. Therefore, the longterm cost incurred by AT is not as low as that incurred by DPMDPQC. For policy DPMDPQ, the number of transmissions is almost the same as DPMDPQC, and the number of holding and overflow packets is even lower than DPMDPQC. However, since this policy assumes that the wireless channels are ideal (but they are in fact not), the transmission error rate is about 1.3 times higher than DPMDPQC (almost as high as AT). Therefore, the average immediate cost is still higher than DPMDPQC. In summary, in a stochastic environment where the longterm loss can be incurred by multiple causes, the policy that considers all such causes simultaneously outperforms those that only consider some and neglects others.
Conclusion
This paper studied an MDPmodeled transmission control problem in NCTWRC with random traffic and fading channels. The purpose was to prove the existence of a monotonic optimal transmission policy that minimized packet delay, queue overflow, transmission power, and the downlink transmission error rate in the long run. We proved that the optimal policy is nondecreasing in queue and/or channel states by investigating how certain properties (submodularity, L ^{♮}convexity and multimodularity) varied with the system parameters. Based on these properties of DP, we presented two lowcomplexity algorithms, MPI and DSPSA.
As a part of the conclusion, we point out two directions for the research work in the future. The structured results derived in Section 4 can be used to design modelfree learning algorithms, e.g., monotonic Qlearning. ^{11}Since queueassisted transmission control is also used in crosslayer variablerate adaptive modulation problems, it would be of interest if we can use submodularity, L ^{♮}convexity, and multimodularity to establish the sufficient conditions for the existence of monotonic optimal ^{12}transmission policies in these systems.
Endnotes
^{1} The complexity of the algorithm grows drastically with the cardinality of the system variables [16].
^{2} The definition of Pareto optimality is given in Appendix A. In Section 3.5, we will explain the Pareto optimality of the optimal policy of MDP.
^{3} In this paper, we use ε=10^{−5}.
^{4} See the definition of Pareto optimality and description of scalarization technique in Appendix A. The Pareto optimality of θ ^{∗} has also been discussed in [31].
^{5} \(f\colon \mathbb {Z}^{n}\rightarrow {\mathbb {R}_{}}\) is (strictly) supermodular if −f is (strictly) submodular.
^{6} The interpretation of ξ _{ o }≥2λ+η+τ is that the cost of overflowing a packet is greater than or equal to the sum of the cost of holding two packets, the cost when transmission error rate is increased by η and the cost of missing a coding opportunity.
^{7} In [21], integer convexity was used to denote the one dimensional discrete convexity as explained in Lemma B.1(b).
^{8} According to the conditions in Theorems 4.12, 4.13 and 4.15, Theorem 4.12 is straightforwardly satisfied if either Theorem 4.13 or Theorem 4.15 holds. Therefore, if DSPSA can be applied when Theorem 4.12 holds, it can be also applied when Theorem 4.13 and 4.15 hold.
^{9} The gradient d in (27) is defined based on the discrete midpoint convexity [32].
^{10} More comparisons of farsighted and myopic policies in NCTWRC are presented in [13].
^{11} The one dimensional discrete convex function \(h:\mathbb {Z}\rightarrow {\mathbb {R}}\) satisfies h(x + 1) + h(x − 1) − 2h(x)≥0 for all \(x\in {\mathbb {Z}}\). Moreover, by Definition 4.4 and 4.5, h is both L ^{♮}convex and multimodular.
^{12} A function \(f\colon \mathbb {Z}^{2}\rightarrow {\mathbb {R}_{+}}\) is multimodular if and only if it is (1) supermodualr: Δ _{ i } Δ _{ j } f(x)≥0 and (2) superconvex: Δ _{ i } f(x + e _{ i })≥Δ _{ i } f(x + e _{ j }) for all i,j∈{1,2}, where Δ _{ i } f(x) = f(x) − f(x − e _{ i }) and \(\mathbf {e}_{i}\in {\mathbb {Z}^{2}}\) is a 2tuple with all zero entries except the ith entry being one.
Appendix A
In multiobjective optimization [39], there are N optimization metrics. Each of then can be quantified by a loss function \(f_{n}\colon \mathbb {R}^{M}\rightarrow {\mathbb {R}}\). The problem can be expressed as
where x is the decision vector. We say x Pareto dominates x ^{′} if f _{ n }(x)≤f _{ n }(x ^{′}) for all n∈{1,…,N}. We call x ^{∗} a Pareto optimal decision vector if no \(\mathbf {x}\in \mathbb {R}^{M}\) Pareto dominates x ^{∗}. In a multiobjective optimization problem, we always want to seek a Pareto optimal solution. One way to solve this problem is called scalarization technique. The idea is to convert (27) to a singleobjective problem
where w _{ n }>0 is the weight. It is shown that the solution of problem (28) is a Pareto optimal solution of problem (27) in [39]. Note, based on the definition of Pareto optimality, a Pareto optimal solution is not an optimal solution if we purely consider only one optimization metric.
Appendix B
Lemma B.1.
submodularity, L ^{♮}convexity and multimodularity has the following properties:

(a)
If \(f_{i}\colon \mathbb {Z}^{n}\rightarrow {\mathbb {R}_{+}}\) is submodular/ L ^{♮}convex/multimodular in \(\mathbf {x}\in {\mathbb {Z}^{n}}\) and α _{ i }≥0 for all i, then \(\sum _{i=1}^{m} \alpha _{i} f_{i}(\mathbf {x})\) is submodular/ L ^{♮}convex/multimodular in x.

(b)
If \(h\colon \mathbb {Z}\rightarrow {\mathbb {R}_{+}}\) is convex ^{11}, then f(x)=h(x _{1}−x _{2}) is L ^{♮}convex in x=(x _{1},x _{2}) and g(x)=h(x _{1}+x _{2}) is multimodular in x=(x _{1},x _{2}).

(c)
Let d be a random variable. If g(x,d) is L ^{♮}convex/multimodular in \(\mathbf {x}\in {\mathbb {Z}^{n}}\) for all d, then \(\mathbb {E}_{d}[g(\mathbf {x},d)]\) is L ^{♮}convex/multimodular in x.

(d)
If \(f\colon \mathbb {Z}^{n}\rightarrow \mathbb {R}_{+}\) is L ^{♮}convex, then ψ(x,ζ)=f(x−ζ 1) is L ^{♮}convex in (x,ζ).
Proof.
The proofs of (a), (b), and (d) can be found in [26, 28, 29]. We show proof of (c). Consider function f first. Since ψ(x,ζ)=f(x−ζ 1)=h(x _{1}−x _{2}), according to Definition 4.4, it suffices to show the submodularity of h in (x _{1},x _{2}). But, because of the convexity of h,
By Definition 4.2, h is submodular in (x _{1},x _{2}). Therefore, f(x)=h(x _{1}−x _{2}) is L ^{♮}convex in (x _{1},x _{2}). Since, g(x)=f(−M _{2,1} x), according to Lemma 4.7(a), g(x) is multimodular in (x _{1},x _{2}).
Appendix C
\(\tilde {C}\) is nondecreasing in y _{1} because h _{1} is nondecreasing in y _{1}. By assuming that V ^{(n−1)} is nondecreasing in \(b^{\prime }_{1}\), we have \(\tilde {Q}^{(n)}\) nondecreasing in y _{1} since min{[y _{ i }]^{+},L _{ i }}+f _{ i } is nondecreasing in y _{ i }. Next, consider the multimodularity by using Proposition 1 ^{12} in [40]. The supermodulariry and superconvexity of \(\tilde {C}\) in (y _{1},a _{1}) can be proved by the convexity of h _{1}. So, \(\tilde {C}\) is multimodular in (y _{1},a _{1}). Assume the monotonicity and L ^{♮}convexity of V ^{(n−1)} in \(b^{\prime }_{1}\). \(\tilde {Q}\) is supermodular and superconvex in (y _{1},a _{1}) because
The second inequality in (30) (when y _{1}=L _{1}) is explained as follows. Recall that we have V ^{(n−1)}(x ^{′})=Q ^{(n−1)}(x ^{′},a ^{∗}(x ^{′})), where Q ^{(n−1)} is L ^{♮}convex in \((b^{\prime }_{1},a^{\prime }_{1})\) and \(\mathbf {a}^{*}(\mathbf {x}^{\prime })=\arg \min _{\mathbf {a}^{\prime }}Q^{(n1)}(\mathbf {x}^{\prime },\mathbf {a}^{\prime })\) is nondecreasing in \(b^{\prime }_{1}\). It can be shown that
Since ξ _{ o }≥2λ+η+τ, we have the inequality when y _{1}=L _{1} in (30). Therefore, \(\tilde {Q}\) is multimodular in (y _{1},a _{1}). The monotonicity and multimodularity of \(\tilde {C}\) and \(\tilde {Q}^{(n)}\) in (y _{2},a _{2}) can be proved in the same way.
Appendix D
By knowing the L ^{♮}convexity of V ^{(n−1)} in b ^{′}, we have
i.e., −Q ^{(n)} is strictly supermodular in a for all x. By definition in [30], the game is supermodular.
Appendix E
C _{ h } is L ^{♮}convex in (b,a) because of the convexity of h _{ i }, and C _{ t } is L ^{♮}convex in (b,a) because of the submodularity of t _{ r } in a. By Lemma B.1(a), C is L ^{♮}convex in (b,a). Consider the L ^{♮}convexity of Q in (b,a). Let \({BR}_{i}(a_{i})=\arg \min _{a_{i}}Q^{(n)}(\mathbf {x},(a_{i},a_{i}))\). Equilibria (0,0)(1,1) implies B R _{ i }(a _{−i })=a _{−i }, i.e., a _{1}=a _{2}. But, Q ^{(n)}(x,(a _{1},a _{1})) is L ^{♮}convex in (b,a _{1}) since: When b _{ i }−a _{1}<L _{ i }+1 for all i∈{1,2}, Q ^{(n)} is L ^{♮}convex in (b,a _{1}) because of the L ^{♮}convexity of V ^{(n−1)} in b ^{′} and Lemma B.1(c) and (d); When b _{ i }−a _{1}=L _{ i }+1 for either i=1 or i=2, the L ^{♮}convexity of Q ^{(n)} can be shown in the same way as in Appendix C under condition ξ _{ o }≥2λ+τ+η. By Theorem 4.10 and Proposition 4.11, V ^{∗}(x) is L ^{♮}convex in b and the optimal action a ^{∗} is nondecreasing in b.
Appendix F
We just need to show that condition (b) in Theorem 4.13 is satisfied. Let b _{ i }−a _{1}<L _{ i }+1 for all i∈{1,2}. It suffices to show B R _{ i }(a _{−i })=a _{−i } for all i∈{1,2} in order to prove equilibria (0,0)(1,1) in Theorem 4.13. Because the game has strictly supermodular utility, B R _{ i }(a _{−i }+1)>B R _{ i }(a _{−i }). So B R _{ i }(1)=1, if we can prove B R _{ i }(0)=0. By knowing that p _{1}=0.5, we can show that
where \(\hat {b}_{2}=\min \{[\!b_{2}]^{+},L_{2}\}+f_{2}\) and the inequality in the case when 0<b _{1}<L _{1}+1 is because that, by a similar approach as in (31), V ^{(n−1)}(b1′−1,b2′,g ^{′})−V ^{(n−1)}(b1′+1,b2′,g ^{′})≥−τ−η and \(\beta \leq \frac {2(\tau \lambda)}{\tau +\eta }\).
Similarly, we can show Q ^{(n)}(b,g,(0,1))−Q ^{(n)}(b,g,(0,0))≥0 in the case when p _{2}=0.5. So, B R _{ i }(a _{−i })=a _{−i }.
Appendix G
Let i=2. C(x,a) is submodular in (b _{2},g _{1},a _{2}) because of the convexity of h _{ i } and the condition P _{ e }(g _{1})≥P _{ e }(g _{1}+1). By knowing the submodularity of V ^{(n−1)} in (b2′,g1′) and the L ^{♮}convexity of Q ^{(n)} in (b _{2},a _{2}) under condition ξ _{ o }≥2λ+η+τ, we can show the submodularity of Q ^{(n)} in (b _{2},a _{2}) and (b _{2},g _{1}). Consider the submodularity of Q ^{(n)} in (g _{1},a _{2}). We can show that
The second last inequality in (34) is obtained by using a similar approach as in (31), and the last one is due to the condition \(\beta \leq \frac {P_{e}(g_{i})P_{e}(g_{i}+1)}{\sum _{g'_{i}}P_{g_{i}g'_{i}} (P_{e}(g'_{i})P_{e}(g'_{i}+1)) }\). Therefore, Q ^{(n)} is submodular in (b _{2},g _{1},a _{2}). Similarly, we can show that Theorem 4.15 holds for i=1.
Appendix H
In an equiprobable partitioned slow and flat Rayleigh fading channel, the channel transitions can be worked out by level crossing rate (LCR) [15] and only happens between adjacent states, i.e., \(\phantom {\dot {i}\!}g^{\prime }_{i}\in \{g_{i}1, g_{i}, g_{i}+1\}\). Further, \(P_{gg'}=P_{g'g}\phantom {\dot {i}\!}\), and \(\phantom {\dot {i}\!}P_{gg'}\ll {P_{\textit {gg}}}\) for all g ^{′}≠g. According to Definition 4.8, for nondecreasing u, \(\phantom {\dot {i}\!}P_{g_{i}g_{i}^{\prime }}\) is first order stochastic nondecreasing in g _{ i } because
where \(\phantom {\dot {i}\!}12P_{g_{i}g_{i}+1}\geq {0}\) is because \(_{gg'}\ll {P_{\textit {gg}}}\phantom {\dot {i}\!}\) and \(\phantom {\dot {i}\!}\sum _{g'}P_{gg'}=1\).
Abbreviations
 NCTWRC:

networkcoded twoway relay channels
 MDP:

Markov decision process
 DP:

dynamic programming
 MPI:

monotonic policy iteration
 DSPSA:

discrete simultaneous perturbation stochastic approximation
 NC:

network coding
 FSMC:

finitestate Markov chain
 QoS:

quality of service
 SPSA:

simultaneous perturbation stochastic approximation
References
R Ahlswede, N Cai, SYR Li, RW Yeung, Network information flow. IEEE Trans. Inf. Theory. 46(4), 1204–1216 (2000). doi:http://dx.doi.org/10.1109/18.850663.
Y Wu, Information exchange in wireless networks with network coding and physicallayer broadcast. Technical Report MSRTR200478, Microsoft Research, Redmond WA (2004).
S Katti, H Rahul, W Hu, D Katabi, M Médard, J Crowcroft, Xors in the air: practical wireless network coding. SIGCOMM Comput. Commun. Rev.36(4), 243–254 (2006).
C Hausl, J Hagenauer, in Proceedings of IEEE International Conference on Communications: June 2006. Iterative network and channel decoding for the twoway relay channel (IEEEIstanbul, 2006), pp. 1568–1573. doi:http://dx.doi.org/10.1109/ICC.2006.255034.
J Li, S Song, Y Guo, M Lee, Joint optimization of source and relay precoding for AF MIMO relay systems. EURASIP J. Wireless Commun. Netw. 2015:, 175 (2015).
M Soussi, A Zaidi, L Vandendorpe, DFbased sumrate optimization for multicarrier multiple access relay channel. EURASIP J. Wireless Commun. Netw. 2015:, 133 (2015).
J Joung, AH Sayed, Multiuser twoway amplifyandforward relay processing and power control methods for beamforming systems. IEEE Trans. Signal Process.58(3), 1833–1846 (2010).
S Katti, D Katabi, W Hu, H Rahul, M Medard, in Proceedings of 43rd Annual Allerton Conference on Communications, Control and Computing: September 2005. The importance of being opportunistic: Practical network coding for wireless environments (University of IllinoiMonticello, IL, 2005), pp. 756–765.
S Peters, A Panah, K Truong, R Heath, Relay architectures for 3GPP LTEadvanced. EURASIP J. Wireless Commun. Netw. 2009:, 618787 (2009).
QT Vien, LN Tran, EK Hong, Network codingbased retransmission for relay aided multisource multicast networks. EURASIP J. Wireless Commun. Netw. 2011:, 643920 (2011).
W Chen, KB Letaief, Z Cao, in Proceedings of IEEE International Conference on Communications: 2428 June 2007. Opportunistic network coding for wireless networks (IEEEGlasgow, 2007), pp. 4634–4639. doi:http://dx.doi.org/10.1109/ICC.2007.765.
YP Hsu, N Abedini, S Ramasamy, N Gautam, A Sprintson, S Shakkottai, in Proceedings IEEE International Symposium on Information Theory: July 31 August 5 2011. Opportunities for network coding: To wait or not to wait (IEEESt. Petersburg, 2011), pp. 791–795. doi:http://dx.doi.org/10.1109/ISIT.2011.6034243.
N Ding, I Nevat, GW Peters, J Yuan, in Proceedings of IEEE International Conference on Communications: 913 June 2013. Opportunistic network coding for twoway relay fading channels (IEEEBudapest, 2013), pp. 5980–5985. doi:http://dx.doi.org/10.1109/ICC.2013.6655556.
ML Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, 1st edn (Wiley, New York, 1994).
P Sadeghi, RA Kennedy, PB Rapajic, R Shams, Finitestate Markov modeling of fading channels: A survey of principles and applications. IEEE Signal Process. Mag.25(5), 57–80 (2008). doi:http://dx.doi.org/10.1109/MSP.2008.926683.
RS Sutton, AG Barto, Introduction to Reinforcement Learning, 1st edn (MIT Press, Cambridge, MA, 1998).
WB Powell, Approximate Dynamic Programming: Solving the Curses of Dimensionality (Wiley, New Jersey, 2007).
JE Smith, KF McCardle, Structural properties of stochastic dynamic programs. Oper. Res.50(5), 796–809 (2002).
J Huang, V Krishnamurthy, Transmission control in cognitive radio as a Markovian dynamic game: Structural result on randomized threshold policies. IEEE Trans. Commun.58(1), 301–310 (2010).
MH Ngo, V Krishnamurthy, Monotonicity of constrained optimal transmission policies in correlated fading channels with ARQ. IEEE Trans. Signal Process.58(1), 438–451 (2010). doi:http://dx.doi.org/10.1109/TSP.2009.2027735.
DV Djonin, V Krishnamurthy, MIMO transmission control in fading channels–a constrained Markov decision process formulation with monotone randomized policies. IEEE Trans. Signal Process.55(10), 5069–5083 (2007).
DM Topkis, Supermodularity and Complementarity (Princeton University Press, Princeton, 2001).
K Murota, Note on multimodularity and Lconvexity. Math. Oper. Res.30(3), 658–661 (2005).
L Yang, YE Sagduyu, JH Li, in Proceedings of 13th ACM International Symposium on Mobile Ad Hoc Networking and Computing: 1114 June 2012. Adaptive network coding for scheduling realtime traffic with hard deadlines (ACM SIGMOBILENew York, 2012), pp. 105–114.
B Hajek, Extremal splittings of point processes. Math. Oper. Res.10(4), 543–556 (1985).
DM Topkis, Minimizing a submodular function on a lattice. Oper. Res.26(2), 305–321 (1978).
K Murota, Discrete Convex Analysis (SIAM, Philadelphia, 2003).
QLP Yu, Multimodularity and structural properties of stochastic dynamic programs. Working Paper. School of Bus. and Manage., HongKong University of Sci. and Tech (2013).
P Zipkin, On the structure of lostsales inventory models. Oper. Res.58(4), 937–944 (2008).
P Milgrom, J Roberts, Rationalizability, learning, and equilibrium in games with strategic complementarities. Econometrica J. Econ. Soc.58(6), 1255–1277 (1990).
AT Hoang, M Motani, Crosslayer adaptive transmission: Optimal strategies in fading channels. IEEE Trans. Commun.56(5), 799–807 (2008). doi:http://dx.doi.org/10.1109/TCOMM.2008.060214.
Q Wang, JC Spall, in Proceedings of American Control Conference: June 29July 1 2011. Discrete simultaneous perturbation stochastic approximation on loss function with noisy measurements (IEEESan Francisco, CA, 2011), pp. 4520–4525.
JC Spall, Implementation of the simultaneous perturbation algorithm for stochastic optimization. IEEE Trans. Aerosp. Electron. Syst.34(3), 817–823 (1998). doi:http://dx.doi.org/10.1109/7.705889.
N Ding, P Sadeghi, RA Kennedy, Discrete Convexity and Stochastic Approximation for Crosslayer Onoff Transmission Control. Wireless Communications, IEEE Transactions on, 1536–1276 (2015). doi:http://dx.doi.org/10.1109/TWC.2015.2473858.
JC Spall, Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Trans. Autom. Control. 37(3), 332–341 (1992). doi:http://dx.doi.org/10.1109/9.119632.
A Gosavi, Simulationbased Optimization: Parametric Optimization Techniques and Reinforcement Learning, vol. 55 (Springer, New York, 2014).
L Rastrigin, Convergence of random search method in extremal control of manyparameter system. Automat. Remote Control. 24(11), 1337 (1964).
S Kirkpatrick, Optimization by simulated annealing: quantitative studies. J. Stat. Phys.34(56), 975–986 (1984).
M Caramia, P Dell’Olmo, Multiobjective Management in Freight Logistics: Increasing Capacity, Service Level and Safety with Optimization Algorithms (Springer, Berlin, Germany, 2008).
W Zhuang, MZF Li, Monotone optimal control for a class of Markov decision processes. European J. Oper. Res.217(2), 342–350 (2012).
Author information
Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Ding, N., Sadeghi, P. & Kennedy, R.A. Structured optimal transmission control in networkcoded twoway relay channels. J Wireless Com Network 2015, 241 (2015). https://doi.org/10.1186/s1363801504707
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1363801504707
Keywords
 Crosslayer optimization
 Discounted Markov decision process
 Discrete stochastic approximation
 Dynamic programming
 L ^{♮,}convexity
 Multimodularity
 Network coding
 Submodularity