Structured Optimal Transmission Control in Network-coded Two-way Relay Channels

This paper considers a transmission control problem in network-coded two-way relay channels (NC-TWRC), where the relay buffers random symbol arrivals from two users, and the channels are assumed to be fading. The problem is modeled by a discounted infinite horizon Markov decision process (MDP). The objective is to find a transmission control policy that minimizes the symbol delay, buffer overflow and transmission power consumption and error rate simultaneously and in the long run. By using the concepts of submodularity, multimodularity and L-natural convexity, we study the structure of the optimal policy searched by dynamic programming (DP) algorithm. We show that the optimal transmission policy is nondecreasing in queue occupancies or/and channel states under certain conditions such as the chosen values of parameters in the MDP model, channel modeling method, modulation scheme and the preservation of stochastic dominance in the transitions of system states. The results derived in this paper can be used to relieve the high complexity of DP and facilitate real-time control.


I. INTRODUCTION
Network coding (NC) was proposed in [1] to maximize the information flow in a wired network. It was introduced in multicast wireless communications to optimize the throughput, and has gained a lot of interest recently due to the rapid growth in multimedia applications [2]. It was shown in [3] that the power efficiency in wireless transmission systems could be improved by NC. For example, in a 3-node network system, called the network-coded two-way relay channels (NC-TWRC) [4] as shown in Fig. 1, the messages m 1 and m 2 are XORed at the relay and broadcast to the end users. This method, compared to the conventional storeand-forward transmission, reduces the total number of transmissions from 4 to 3 so that the transmission power is saved by 25%. Since then, numerous optimization problems have been studied in NC-TWRC, e.g., training design for optimal channel estimation [5] and the design principle of modulation and NC scheme [6].  [4]. Two users exchange information (m1 and m2) via the center node R (stands for relay).
In [7], Katti et al. pointed out the importance of being opportunistic in practical NC scenarios.
It was suggested that the assumptions in the related research work should comply with the practical wireless environments, e.g., decentralized routing and non-stationary traffic rate. This suggestion highlighted a problem in the existing literature: The majority of the studies consider synchronized traffic while ignoring the stochastic nature of the packet arrivals in the network layer. However, introducing the randomness of traffic in Fig. 1 gives rise to a power-delay tradeoff: When there are symbol inflows in the relay but no coding opportunities or XORing pairs (e.g., one symbol arrives from one user, but no symbol arrives from the other), waiting for coding opportunities by holding symbols saves transmission power but increases symbol delay. This dilemma was studied and solved by NC-TWRC with buffering in [8] and [9]. It was shown that the optimal transmission policy by a Markovian process formulation minimized the transmission power and symbol delay simultaneously and in the long run. In [10], the NC-TWRC in [8], [9] was extended to include the dynamics of wireless channels (Fig. 2). In this system, a transmission policy that solves power-delay tradeoff may not be the best decision rule because it does not consider the possible loss in throughput due to the downlink transmission errors. For this reason, the scheduler is required to make optimal decision that simultaneously minimizes the transmission power, symbol delay, downlink BER in the long run by considering current queue and channel states and their expectations in the future. In [10], this problem was formulated by a discounted infinite horizon Markov decision process (MDP) with channels modeled by finite-state Markov chains (FSMCs) [11]. The optimal transmission policy searched by dynamic programming (DP) was shown to be superior to [8] and [9] in terms of enhancing the QoS (quality of service, evaluated by symbol delay and overflow in the data link layer and power consumption and error rate in the physical layer) in a practical wireless environment, e.g., Rayleigh fading channels. However, the DP algorithm in [10] is burdened with high complexity. The scheduler requires information of both queue states/occupancies and channel states to assist the decision making, i.e., the system state is a 4-tuple (two channels and two queues) and the decision/action is a 2-tuple (each associated with the departure control of one queue). In such a high dimensional MDP, the curse of dimensionality 1 becomes more evident. The problem could be intractable if the cardinality of any tuple in the state variable is large. To relieve the curse, one solution is to qualitatively understand the model and prove the existence of a structured optimal policy [13]. Then a modified DP algorithm, or an alternative algorithm with lower complexity, can be proposed. For example, if the optimal policy is proved to be nondecreasing in the state variable, we can run a monotonic policy iteration [14], which reduces the computation load by shrinking the feasible action space with the increasing state index in each iteration of DP; if the optimal policy is of threshold type, the problem can be solved by a simultaneous perturbation stochastic approximation (SPSA) algorithm [15]. But, structured optimal policy does not exist in general.
Most often, optimal policy exists, but it varies with the state variable irregularly. In order to prove the existence of certain feature (e.g., monotonicity) in the optimal policy, we need to extensively analyze the MDP model and the recursive functions in DP algorithm. The basic approach in the existing literature is to show by induction that the monotonicity/submodularity is preserved in each iterative optimization process (maximization/minimization) in DP, e.g., [14], [15]. We adopt the same method in this paper, but consider these properties in high dimensional cases.
For example, a 3-dimensional submodularity (instead of 2-dimensional as usual) is shown to contribute to a monotonic optimal solution in both queue and channel states. Moreover, we use L ♮ -convexity and multimodularity, two concepts that were originally defined in discrete convex analysis [16], [17] and recently applied in operational research [18]- [20], to describe the joint submodularity and integral convexity in a high dimensional space.
The aim of our work is to establish the sufficient conditions for the existence of a monotonic optimal transmission policy in the NC-TWRC system in Fig. 2. Unlike other related research work where certain features were assumed a priori (e.g., strict submodularity of DP functions [21] or uniformly distributed traffic flows [15]), we prove the properties of DP by observing the variation of parameters in the immediate cost function of MDP (e.g., unit costs associated with symbol delay and transmission power, quantized error rate associated with the channel state, etc.) while having our assumptions consistent with the actual applications (e.g., arbitrary traffic rates, flat and slow Rayleigh fading channels). The main results in this paper are: • We prove that each tuple in the optimal policy is nondecreasing in the queue state that is controlled by that tuple if the chosen values of unit costs give rise to an L ♮ -convex or multimodular DP. Moreover, we show that the same results found in [15] and [14] can also be explained by L ♮ -convexity or multimodularity by a unimodular coordinate transform [22].
• By thinking of each iteration in DP as a one-stage pure coordination supermodular game, we show that equiprobable traffic rates and certain conditions on unit costs guarantee that October 30, 2013 DRAFT each tuple in the optimal policy is monotonic in not only the queue state that is controlled by that tuple but also the queue state that is associated with the information flow of the opposite direction, i.e. the one that is not under the control of that tuple.
• By observing the submodularity of DP, we show the sufficient conditions for an optimal policy to be nondecreasing in both queue and channel states in terms of unit costs, channel statistics, FSMC models and modulation scheme.
The rest of this paper is organized as follows. In Section II, we state the optimization problem in NC-TWRC with random symbol arrivals and FSMC modeled channels and clarify the assumptions of this system. In Section III, we describe the MDP formulation, state the objective and present the DP algorithm. In Section IV, we investigate the structure in the optimal transmission policy found by DP algorithm in queue and channel states, where numerical examples are presented.

A. NC-TWRC with random symbol arrivals and wireless fading channels
Consider the NC-TWRC shown in Fig. 2. User 1 and 2 randomly send symbols to each other via the relay. The relay is equipped with two finite-length FIFO queues, queue 1 and 2, to buffer the incoming symbols from user 1 and 2, respectively. The outflows of queues are controlled by a scheduler. The scheduler keeps making decisions as to whether or not to transmit symbols from queues. If the decision results in a pair of symbols in opposite directions transmitted at the same time, they will be XORed (coded) and broadcast. Otherwise, the symbol will be simply forwarded to the end user. The optimization problem is to find an optimal transmission control policy that minimizes symbol delay, queue overflow, transmission power (saved by utilizing the coding opportunities) and downlink transmission errors simultaneously and their expectations in the future.
Obviously, the optimization concerns are contradictory to each other: (1) Because of the random symbol inflows, there would not always be a pair of symbols for XORing. In this situation, waiting for coding opportunity by holding symbols results in a high symbol delay on average, while transmitting a symbol without coding results in one more symbol to be transmitted in the future, i.e., more transmission power on average; (2) Even if there is a coding opportunity, broadcasting a coded symbol when any of the channels is having low SNR will incur a high symbol error rate. Therefore, the scheduler must seek an optimal decision rule that solves this power-delay-error tradeoff.

B. Assumptions
We consider a discrete-time decision making process, where the time is divided into small intervals, called decision epochs and denoted by t ∈ {0, 1, . . . , T }. Let i ∈ {1, 2} and assume the following: . Each channel is modeled by a finite-state Markov chain (FSMC) [11], where the state evolution of channel i is governed by the transition probability P g (t)     (1, 1) XOR two symbols one in each queue, then broadcast.

III. MARKOV DECISION PROCESS FORMULATION
Based on A1, A3 and A4, we know that the statistics of the incoming traffic flow and channel dynamics associated with user 1 or 2 are time-invariant. It follows that the transmission control problem in Fig. 2 can be formulated as a stationary Markov decision process (MDP). In the following context, we drop the decision epoch notation t in A1-A4 and use the notation y and y ′ for the system variable y at the current and next decision epochs, respectively.

B. Action
Denote action a = (a 1 , a 2 ) ∈ A, where a i ∈ A i = {0, 1} denotes the number of symbols departed from queue i and A = A 1 × A 2 = {0, 1} 2 . The terminology of actions are shown in Table I.

C. State Transition Probabilities
The transition probability P a xx ′ = P r(x|x ′ , a) denotes the probability of being in state x ′ at next decision epoch if action a is taken in state x at current decision epoch. Due to the assumptions of independent random processes in A1 and A4, the state transition probability is given by where P g i g ′ i is determined by channel statistics and FSMC modeling method in A3 and P a i is the queue state transition probability derived as follows.
At current decision epoch, the occupancy of queue i after decision where [y] + = max{y, 0}. Then, the occupancy at the beginning of the next decision epoch is Therefore, the state transition probability of queue i is where I {·} is the indicator function that returns 1 if the expression in {·} is true and 0 otherwise.

D. Immediate Cost
C : X × A → R + is the cost incurred immediately after action a is taken in state x at current decision epoch. It reflects three optimization concerns: the symbol delay and queue overflow, the transmission power and the downlink transmission error rate.
Holding and overflow cost: We define h i , the holding and queue overflow cost associated with queue i, as λ > 0 is the unit holding cost and ξ o > λ is the unit queue overflow cost, which makes h i (y i ) a nondecreasing convex function. In the case when count the number of symbols held in queue i and the number of symbols lost due the overflow of queue i, respectively. We say that the term λ min{[y i ] + , L i } accounts for the symbol delay because by Little's Law the average symbol delay is proportional to the average number of symbols held in the queue in the long run for a given symbol arrival rate [23]. We sum up h i for i ∈ {1, 2} and obtain the total holding and overflow cost as Transmission cost: Since forwarding and broadcasting one symbol, either coded or non-coded, consume the same amount of power, we have the immediate transmission cost as where τ > λ is the unit transmission cost and I {a 1 =1 or a 2 =1} counts the number of transmissions resulting from action a.
Note that (5) and (6) form a power-delay tradeoff. A policy that always transmits whenever there is an incoming symbol without considering coding opportunities in the long run is penalized by (6), and a policy that always holds symbol to wait for coding opportunities without considering the average symbol delay is penalized by (5).
Symbol error cost: Let P e (g i ) denote the symbol error probability when channel i is in state g i .
The form of the function P e is determined by the modulation scheme (e.g., P e (g i ) = 1 2 erfc( Γ g i ) for BPSK modulation). And, P e (g i ) ≤ 0.5 for all g i because Γ 1 ≥ 0 in A3. Since symbol errors happen only when we decide to transmit, we define the immediate symbol error cost due to the action a i as where η is the unit symbol error cost and Note, the reason we have err(g −i , a i ) is because the symbol departing queue i is transmitted through channel −i, e.g., the relay sends one symbol in queue 1 through fading channel 2 when a 1 = 1.
Note, the aforementioned power-delay tradeoff formed by (5) and (6) just poses the problem: whether or not to transmit if an instantaneous symbol inflow is not able to form an XORing pair.
However, if the scheduler considers downlink transmission error rate in addition, a policy that always broadcasts XORed symbols whenever there's a coding opportunity without considering downlink channel states is penalized by (7). Therefore, (5), (6) and (7) form a power-delay-error tradeoff.
In summary, we define the immediate cost as where Here, C(x, a) is in fact a linear combination of loss functions (each quantifies an optimization concern). The unit cost λ, ξ o , τ and η can be considered as the weight factors that are either given or adjustable depending on the real applications. In Section IV, we will derive the sufficient conditions of the existence of a structured optimal policy mainly in terms of the chosen values of these unit costs.

E. Objective and Dynamic Programming
Let x (t) and a (t) denote the state and action at decision epoch t, respectively, and consider an infinite-horizon MDP modeling where the discrete decision making process is assumed to be infinitely long. We can describe the long-run objective as where β ∈ [0, 1) is the discounted factor that ensures the convergence of the series. It is proved in [24], that if the state space X is countable, the action set A is finite and the MDP is stationary, there exists an optimal deterministic stationary policy θ * : X → A, and θ * can be searched by dynamic programming (DP) where n denotes the iteration index, V (0) = 0 for all x and the optimal policy θ * (x) = arg min a∈A Q (N ) (x, a) if DP converges at Nth iteration. Usually a very small convergence threshold is applied, i.e., N < ∞.

IV. STRUCTURED OPTIMAL POLICIES
A conventional transmission control MDP model, say [21] for adaptive modulation purpose, usually has 2-tuple state and 1-tuple action variables. However, the MDP model defined in Section III has a higher dimension: 4-tuple state and 2-tuple action. This could make the DP algorithm cumbersome. The major problem with DP is the curse of dimensionality, the computation load grows exponentially with the number and dimensions of system parameters.
The consequence is that the optimization problem may become intractable. For example, an increment in the cardinality of any tuple in the state variable x = (b 1 , b 2 , g 1 , g 2 ) may severely overload the CPU in DP iterations or, even worse, drive the processer out of memory during MDP modeling. To cope with this problem, researchers are always interested in certain structures, e.g., monotonicity, in the optimal policy because a modified optimization algorithm with lower complexity, e.g., structured policy iteration [14] or simultaneous perturbation stochastic approximation (SPSA) [15], can be proposed if so. In this section, we investigate the submodularity, L ♮ -convexity and multimodularity of functions Q (n) (x, a) and V (n) (x) in DP to establish the sufficient conditions for the existence of a monotonic optimal policy. Before that, we clarify some concepts 2 as follows.
greater than or equal to.
In the case when |X | ≫ |A|, a nondecreasing optimal policy is in the form of θ * is a switching curve or plane that is characterized by optimal threshold vector In this case, we can formulate a multivariate optimization problem in the form of with J being the objective in (10). This problem can be solved by an SPSA algorithm, e.g., [15].

Definition 4.2 (Submodularity):
Let e i ∈ Z n be an n-tuple with all zero entries except the ith for all x ∈ Z n and 1 ≤ i, j ≤ n such that i = j. f is strictly submodular if the inequality is strict. 3 The insight of the submodularity can be explained by the following example. In DP, a sub- i.e., the preference of choosing action a + over a − is always nondecreasing in x. Therefore, an increase in the state variable x implies an increase in the decision rule θ (n) (x) = min a Q (n) (x, a).
It follows that the optimal policy θ * (x) = min a Q (N ) (x, a) must be monotonic in x if we can prove the submodularity of Q (n) for all n. This property is summarized in a general form in the following lemma.  where 1 = (1, 1, . . . , 1) ∈ Z n and ζ ∈ Z. Definition 4.5 (multimodularity [17] L ♮ -convexity and multimodularity are two concepts defined in discrete convex analysis [27]. L ♮ -convexity and multimodularity both imply integral convexity. The difference is that L ♮ -convexity is submodular while multimodularity is supermoduar 5 [22]. The two concepts are related by a unimodular coordinate transform. problems in queueing networks [28] and inventory systems [19]. Due to the implication of the submodularity/supermodularity, they both contribute to a monotonic structure in the optimal policy. In section IV-B, we will use them to show a nondecreasing optimal transmission policy in queue states. As a preliminary step, we clarify some properties of L ♮ -convexity and multimodularity in the following lemma. (a) If g : Z n → R + is L ♮ -convex/multimodular in (x, y) ∈ Z n , then f (x) = min y g(x, y) is L ♮ -convex/multimodular in x, and the minimizer y * (x) = arg min y g(x, y) is nondecreas- [19], [22] (e) If f :  Stochastic dominance, as defined in decision theory, describes the situation that the expected aftermath (quantified by a utility or cost function) of one decision is superior to that of another.
In Section IV-C, we will show that the stochastic dominance of the channel state transition probabilities preserves submodularity across the iterations in DP and contributes to a nondecreasing optimal transmission policy in channel states.

A. Structured Properties of Dynamic Programming
To propose the prototypical procedure of proving the existence of a monotonic optimal policy, we first define a P ⋆ property as follows: It can be seen, by Lemma 4.3(a) and 4.7(a), that submodularity, L ♮ -convexity and multimodularity satisfy the conditions in Definition 4.9, which we summarize as follows. We therefore propose an approach, similar to Proposition 5 in [13], as follows: Proposition 4.11: Let DP converge at Nth iteration. The optimal value function V * (x) = V (N ) (x) has P ⋆ property, and the optimal policy θ * is monotonic in x, if: (a) C(x, a) has P ⋆ property, has P ⋆ property for all P ⋆ property functions V (n−1) and n.
So V (1) (x) = min a∈A Q (1) (x, a) has P ⋆ property. By induction, assume V (n−1) (x, a) has P ⋆ property. Then Q (n) and V (n) (x) = min a∈A Q (n) (x, a) have P ⋆ property. Therefore, Q (N ) (x, a) and V * (x) = V (N ) (x) must also possess P ⋆ property, and θ * (x) = arg min a∈A Q (N ) (x, a) is monotonic in x.

1) Nondecreasing a *
i in b i : Let the optimal action be a * = (a * 1 , a * 2 ) = θ * (x). The following theorem shows that the optimal action a * i is monotonic in b i , the state of queue being controlled by a i if the unit costs satisfy a certain condition. Here, y = (y 1 , y 2 ) and f = (f 1 , f 2 ). It is easy to see that C(b, g, a) =C(b−a, g, a) and Q (n) (b, g, a) = according to Lemma 4.6(b), it follows that proving the L ♮ -convexity of C(b, g, a) and Q (n) (b, g, a) in (b i , a i ) is equivalent to showing the multimodularity ofC(y, g, a) andQ (n) (y, g, a) in (y i , a i ).
It is also clear that the monotonicity of C(b, g, a) and Q (n) (b, g, a) in b i is equivalent to the monotonicity ofC(y, g, a) andQ (n) (y, g, a) in y i . See Appendix B for the proof of the monotonicity and multimodularity ofC(y, g, a) andQ (n) (y, g, a) in y i and (y i , a i ), respectively.
According to Proposition 4.7.3 in [24], V * (x) is nondecreasing in b i . By Theorem 4.10 and Proposition 4.11, V * (x) is L ♮ -convex in b i , and a * i is nondecreasing in b i . Note, Theorem 4.12 aligns with the existing results in the literature, e.g., the adaptive MIMO transmission control [14] and the Markov game modeled adaptive modulation of cognitive radio [15]. In fact, both of them can be explained by L ♮ -convexity. In [14], the monotonicity of a * By Lemma 4.6(b), we know that if the a function is multimodular in one dimension is exactly integer convexity 7 . In [15], the monotonicity of a * i was shown by the submodularity of We formulate the optimization problem in the nth iteration of DP by a 2-player 2-strategy game, which is called one-stage game in Fig. 3. Assume that action a 1 is taken by player 1, and a 2 is taken by player 2. Obviously, it is a pure coordination game where the utility −Q (n) (x, (a 1 , a 2 )) is the same to player 1 and 2.
We prove, in Appendix C, that Fig. 3 is a supermodular game with utility function −Q (n) (x, a) strictly supermodular in a = (a 1 , a 2 ) for all x and V (n−1) ( . It is proved in [29] that there exists at least one equilibrium (a * 1 , a * 2 ) in the form of pure strategy in a supermodular game. Then, we have the following theorem for the monotonicity of the optimal Fig. 3) has two pure strategy equilibria (0, 0) and (1, 1) for all x = (b 1 , b 2 , g 1 , g 2 ) such that b i < L i + 1 for all i ∈ {1, 2}, then C(x, a) and Q (n) (x, a) are L ♮ -convex in (b, a) = (b 1 , b 2 , a 1 , a 2 ), the optimal value function 1 , b 2 ) and the optimal action a * = (a * 1 , a * 2 ) is nondecreasing in b = (b 1 , b 2 ).
The proof is in Appendix D.
Here is a corollary of Theorem 4.13.
The proof is in Appendix E.  Fig. 4, we choose the values of unit costs to make Theorem 4.12 hold. As shown in the figure, the optimal action a * 1 and a * 2 are monotonic in b 1 and b 2 , respectively, i.e., a * i is nondecreasing in the queue state that is being controlled by a i . In Fig. 5, we change the value of unit cost ξ o to breach the condition in Theorem 4.12 so that the monotonicity of a * i in b i is not guaranteed. In this case, a * 1 that is not monotonic in b 1 . In Fig. 6, we choose the equiprobable symbol arrival rates p 1 = p 2 = 0.5 and the unit costs according to Corollary 4.14 to make Theorem 4.13 hold. As shown in the figure, the optimal action a * 1 and a * 2 are both nondecreasing in (b 1 , b 2 ). As compared to Fig. 4, in this case, a * i is also monotonic in b −i , the queue state that is affected by the message flow and transmission control in the opposite direction, i.e., the queue state that is not controlled by a i . In Fig. 7, we switch unit cost η from 1 to 2 so that Theorem 4.13 no longer holds. In this case, neither a * 1 nor a * 2 is monotonic in (b 1 , b 2 ). But, the condition in Theorem 4.12 is satisfied. Therefore, a * 1 and a * 2 are still nondecreasing in b 1 and b 2 , respectively.
The proof is in Appendix G.
Finally, condition (d) depends on the modulation scheme, which should be determined in the real applications.
We show examples of Theorem 4.15 in Figs. 8 and 9. In Fig. 8, we use the same system parameters as in Fig. 4 except that the discount factor β is switched from 0.97 to 0.95 in order to satisfy the inequality in condition (d) of Theorem 4.15 . The results are obtained from Therefore, a * 1 is nondecreasing in (b 1 , g 2 ), and a * 2 is nondecreasing in (b 2 , g 1 ). In Fig. 9, we switch γ 2 from 0dB to 3dB to breach condition (d) in Theorem 4.15. In this case, a * 1 is not monotonic in g 2 . But, since Theorem 4.12 still holds, a * 1 and a * 2 are monotonic in b 1 and b 2 , respectively. In Fig. 10, we show a monotonic optimal threshold policy when both Theorem 4.13 and Theorem 4.15 hold. In this figure, b * i th is the optimal threshold defined by Because of the monotonicity of a * . By stacking the b i th for all (b −i , g 1 , g 2 ), we can form a vector b th with the cardinality being |B −i | × |G 1 | × |G 2 | and convert (10) to where J could be the objective in (10) obtained by a Markov chain 8 simulation and C i (b th ) could be the constraint imposed by the monotonicity of b i th in (b −i , g −i ). The method that solves (22) could be, but not restricted to, a stochastic approximation algorithm, e.g. SPSA [15], that has lower complexity than DP and is suitable for online reinforcement learning. We will not discuss the details since it is not the purpose of our work. But, it should be clear that the results derived in this paper can be used for further study on (22). They are, in fact, the prerequisites for threshold policy optimization problems.
Note, that the related previous studies usually placed constraints on the environments or the DP functions in order to prove the structure in the optimal policy. For example, in [15] the submodularity of the state transition probability was proved by assuming uniformly distributed traffic rates, and in [21] the strict submodularity of Q (n) in DP iterations was assumed to be preserved by a weight factor in the immediate cost function (However, the exact value of this factor was not given). In contrast, the results in this paper, Theorem 4.12, 4.13 and 4.15, are essentially given in terms of unit costs and discount factor, the parameters in the MDP model.
The practical meaning of them can be interpreted in two ways. If the unit costs and discount factor are adjustable, we can tune them to get a structured optimal policy. If they are given, we the nondecreasing optimal policy was proved to be conditioned on the parameters in the MDP model, e.g., unit costs or discount factor, instead of the constraints placed on the environment of the NC-TWRC system. Secondly, we observed the monotonicity in both queue and channel states instead of queue state only. The results in this paper can be utilized for further studies on how to simplify the optimization processes in DP, which could be useful in realtime control scenarios of NC-TWRC.

APPENDIX A
Proof of Lemma 4.3(b): According to Definition 4.2, since

Proof of Lemma 4.7(a):
The case when g is L ♮ -convex can be proved by Lemma 2 and 3 9 in [19] through a sequential minimization, i.e., minimizing over the tuples in y one after another.
The case when g is multimodular can be proved by Theorem 1 10 in [22] in the same way.
Proof of Lemma 4.7(b): Consider the case of L ♮ -convexity first. According to Definition 4.4 and Lemma 4.

APPENDIX B
Leti = 1 and consider the monotonicity first.C(y, g, a) is nondecreasing in y 1 becausẽ C(y + e 1 , g, a) −C(y, g, a) Q n (y, g, a) is nondecreasing in y 1 because by assuming that Then, consider the multimodularity. Since the function is two-dimensional, we use the Proposition 2 in [18] to prove the multimodularity, i.e., A function f : ) and e i ∈ Z 2 is 2-tuple with all zero entries except the ith entry being one.
By assuming the monotonicity and L ♮ -convexity of is supermodular in (y 1 , a 1 ) becausẽ Q n (y + e 1 , g, a) +Q n (y, g, a + e 1 ) −Q n (y, g, a) −Q n (y + e 1 , g, a + e 1 ) = 0 (29) and superconvex in (y 1 , a 1 ) becausẽ Q n (y + e 1 , g, a) −Q n (y, g, a) −Q n (y, g, a + e 1 ) +Q n (y − e 1 , g, a + e 1 ) The inequality in (30) in the case of The inequality in the case of y 1 = 0 is because of the monotonicity of V (n−1) (b ′ , g ′ ) in b ′ 1 . The inequality in the case of y 1 = L 1 is explained as follows.
So, BR i (a −i ) = a −i .
Because of the condition β ≤ Pe(g i )−Pe(g i +1) , we have the inequality in (42). Similarly, we can show that Theorem 4.15 holds for i = 1. Therefore, by Proposition 4.11 and Theorem 4.10, V * (x) is submodular in (b i , g −i ) and the optimal action a * i is nondecreasing in (b i , g −i ).

APPENDIX G
In an equiprobable partition Rayleigh fading FSMC, P e (g i ) is nonincreasing in g i , i.e., P e (g i ) ≥ P e (g i + 1). Because of slow and flat fading assumption, the channel transitions can be worked out by level crossing rate (LCR) [11] and only happens between adjacent states, i.e., g ′ i ∈ {g i −1, g i , g i +1}. Further, P gg ′ = P g ′ g , and P gg ′ ≪ P gg for all g ′ = g. According to Definition 4.8, for nondecreasing u, P g i g ′ i is first order stochastic nondecreasing in g i because where 1 − 2P g i g i +1 ≥ 0 is because P gg ′ ≪ P gg and g ′ P gg ′ = 1.