3.1 Optimal decision problem
Under the prescribed model, the optimal relay decision problem of a source node to transmit energy efficient TCP traffic is straightforward: it periodically updates its best choices of relay nodes for cooperative communication and sets the appropriate control parameters for link layer frame transmission. Equivalently, in each time slot t, S calculates the optimal η(n) for each relay candidate R_{
n
} with relay state x_{
n
}(t) by finding the optimal control parameter setting of L_{
fr
}(t) and N_{
arq
}(t), and then finds the best relay {R}_{n}^{*}\left(t\right) with its control parameters being {L}_{fr}^{*}\left(t\right) and and {N}_{arq}^{*}\left(t\right).
\begin{array}{c}\left\{{R}_{n}^{*}\left(t\right),{L}_{fr}^{*}\left(t\right),{N}_{arq}^{*}\left(t\right)\right\}\hfill \\ =\underset{Rn\left(t\right)}{arg\; max}\left\{\frac{B\left({W}_{\mathsf{\text{m}}},RTT\left[{L}_{fr}\left(t\right),{N}_{arq}\left(t\right)\right],{P}_{\mathsf{\text{seg}}}\left[{x}_{n}\left(t\right),{L}_{fr}\left(t\right),{N}_{arq}\left(t\right)\right]\right){L}_{\mathsf{\text{seg}}}{T}_{fr}\left[{L}_{fr}\left(t\right)\right]}{{E}_{f}\left[{L}_{fr}\left(t\right)\right]}\right\}\hfill \end{array}
(10)
We may transform the original decision problem (10) into an optimal control problem based on the system view: source S decides \left\{{R}_{n}^{*}\left(t\right),{L}_{fr}^{*}\left(t\right),{N}_{arq}^{*}\left(t\right)\right\} as an optimal control signal into system consisting of S, \mathcal{N} and D, so that the control objective, i.e., energy efficiency for TCP relay transmission for S  D pair in the t th time slot, is maximized.
The decision in formula (10) is just made for one time slot. However, when the overall energy efficiency across the time horizon is considered, the greedy solution in which the decision is made just in one time slot scope may be unable to achieve the optimizing goal in a long run. The channel fading states of a relay n, x_{
n
}(t), may have high probability to change to another x_{
n
}(t+1) in the next time slot t+1, which may cause nonideal performance in energy efficiency. By this way, the overall TCP traffic may have poor energy efficiency performance in a long run. Therefore, we prefer to making the decision with a "onestep lookahead" policy [14].
Based on the assumption that the rayleigh fading channels can be modeled using FSMC and the states of channels for a relay is independent distributed, the decision model for optimal control problem with "onestep lookahead" policy match the Restless Bandit system model in control theory [14]. In addition, we also consider the switching cost in relay selection procedure where the reselection and handover process may consume the energy while spending time for TCP data transmission. Therefore, we adopt Restless Bandit system with Switching Cost (RBSC) [14] based approach in this article.
3.2 Approach based on restless bandit system with switching cost
The RBSC approach is used to solve the stochastic control problems with the following characteristics: there are N projects and one is selected and set active at each time slot t = 0;1, . . . , T  1. If the system switches from project i to j, it bring an additional costs denoted by c_{
ij
}. Each project is in a finite state space x_{
n
}(t)∈S_{
n
}, n = 1, . . . , N. We denote the Cartesian product of the state spaces as \mathcal{J}={S}_{1}\times \dots \times {S}_{N}. For project n, a reward {r}_{n}^{1}\left(xn\left(t\right)\right) is earned if it is selected, otherwise the reward is {r}_{n}^{0}\left(xn\left(t\right)\right). Moreover, the reward is timediscounted by factor α∈(0,1). The projects' states change in a Markovian fashion according to the transition probability matrix. Project is selected under a policy u, which is in the set of all available policies \mathcal{U}. The purpose is to find the optimal u that maximize the expected discounted reward.
In following subsections, the relay network with relay selection decision will be formulated as RBSC system.
3.2.1 Relay state description and system reward
A FSMC is used to model the Rayleigh fading channel according to the received SNR [18]. The received SNR at j from i at time t is denoted as γ_{
ij
}(t) and divided into K discrete levels with increasing order. Each level is associated with a state of the Markov chain. Let ϒ = {ϒ_{1}, ϒ_{2}, . . . , ϒ_{
K
}} denotes the finite state space. The K × K transition probability matrix is
\Psi \left(t\right)=\left[{\psi}_{uv}\left(t\right)\right],
(11)
where {\psi}_{uv}\left(t\right)=\mathsf{\text{Pr}}\left(\gamma \left(t+1\right)=v\gamma \left(t\right)=u\right) and v, u ∈ϒ
In each time slot t ∈ {0, 1, . . . T1},the state of the n th relay is characterized by the three Rayleigh fading channels, including the source to the n th relay and the destination, and the n th relay to the destination. We assume the state of each channel is independent distributed.
Therefore, the state of the n th relay is described as the three channels' state at time t:
{x}_{n}\left(t\right)=\left[{\gamma}_{SD}\left(t\right),{\gamma}_{S{R}_{n}}\left(t\right),{\gamma}_{{R}_{n}D}\left(t\right)\right].
(12)
The relay state changes in a Markovian fashion and the finite state space of the n th relay is denoted as S_{
n
}, that x_{
n
}(t)∈S_{
n
}. The transition probability matrix is
{P}_{n}\left(t\right)={\left[{\psi}_{{f}_{n}{j}_{n}}\left(t\right),{\psi}_{{h}_{n}{y}_{n}}\left(t\right),{\psi}_{{u}_{n}{v}_{n}}\left(t\right)\right]}_{{H}_{n}\times {H}_{n}},
(13)
where H_{n} = K × K × K and f_{
n
}, j_{
n
}, h_{
n
}, y_{
n
}, u_{
n
}, v_{
n
} ∈ ϒ. The matrixes can be obtained and updated according to the history observation, and in simulation, they keep constant.
The purpose of the proposed scheme is to maximize the energy efficiency described in equation (9) by selecting the optimal relay and adjusting the lower layer parameters according to the current relay state. The reward of the n th relay in each time slot t is
{r}_{n}^{{a}_{n}\left(t\right)}\left({x}_{n}\left(t\right)\right)={a}_{n}\left(t\right)\eta \left({x}_{n}\left(t\right),{L}_{fr}\left(t\right),{N}_{arq}\left(t\right)\right).
(14)
where x_{
n
}(t) is the n th relay state at time t, {a}_{n}\left(t\right)\in \mathcal{A}=\left\{0,1\right\}, L_{
fr
} (t) is the frame size and N_{
arq
} (t) is the maximum retransmission time.
3.2.2 Solving the RBSC
The RBSC problem is determined by several factors: \mathcal{N} is the set of available projects, S_{
n
} is the state space of the n th project, the Cartesian product of the state spaces is \mathcal{J}={S}_{1}\times \dots \times {S}_{N},{A}_{n} is the action space, P_{
n
} is the transition probability, ν is the initial distribution, c_{
ij
} is denoted as the switching cost from i th relay to j th relay, α∈(0,1) is the discounted factor and r is the reward.
The complete state of the network is (x;s) = (x_{1},. . . , x_{
N
} ;s), where vector × is the current states of N relay nodes, s denotes the current selected relay that s\in \mathcal{N}. To solve RBSC problem with linear programming, we denote the stateaction pairs as \mathcal{K}=\left\{\left(\mathsf{\text{x}};s\right),a:\mathsf{\text{x}}\in \mathcal{J},s\in \mathcal{N},a\in \mathcal{N}\right\}. The occupation measures ρ_{(x;s),a}are the expected discounted time of different stateaction pairs. The distribution of initial system state is described by ν, e.g., ν (x_{1}, . . . ,x_{
N
};s) = ν_{1}(x_{
1
}) . . . ν_{
N
} (x_{
N
}) δ_{1} (s) if the 1 project is selected and set active initially. Let c_{
ij
} denote the switching cost from i to j. α∈(0,1) is the discounter factor. The RBSC problem can be cast into the MDP framework and the linear program can be written:
\begin{array}{c}\mathsf{\text{maximize}}\hfill \\ \sum _{s=1}^{N}\sum _{x\in \mathcal{J}}\sum _{a=1}^{N}\left({r}_{a}^{1}\left({x}_{a}\right)+\sum _{j\ne a}{r}_{j}^{0}\left({x}_{j}\right){c}_{sa}\right){\rho}_{\left({x}_{1},...,{x}_{N};s\right),a}\phantom{\rule{0.3em}{0ex}}\hfill \\ \mathsf{\text{subjectto}}\hfill \\ \sum _{a=1}^{N}{\rho}_{\left(\mathsf{\text{x}};s\right),a}\alpha \sum _{{s}^{\prime}=1}^{N}\sum _{{x}^{\prime}\in \mathcal{J}}{\rho}_{\left(\mathsf{\text{x'}};{s}^{\prime}\right),s}{p}_{{x}_{1}^{\prime}{x}_{1}}^{0}...{p}_{{x}_{s}^{\prime}{x}_{s}}^{1}...{p}_{{x}_{N}^{\prime}{x}_{N}}^{0}=\left(1\alpha \right)v\left(\mathsf{\text{x}};s\right),\phantom{\rule{1em}{0ex}}\forall \left(\mathsf{\text{x}},s\right)\in \mathcal{J}\times \mathcal{N}\hfill \\ {\rho}_{\left(\mathsf{\text{x}};s\right),a}\ge 0,\phantom{\rule{2.77695pt}{0ex}}\forall \left(\left(\mathsf{\text{x}};s\right),a\right)\in \mathcal{J}\times {\mathcal{N}}^{2},\hfill \end{array}
(15)
In the above formula, the decision variables {\rho}_{\left({x}_{1},\dots ,{x}_{N};s\right),a}, which is exponential in size of the input data. Therefore, this problem can not be solved efficiently in the real world. By relaxing the constraint in (15), we get the final LP relaxation for RBSC:
\begin{array}{c}\mathsf{\text{maximize}}\hfill \\ \sum _{s=1}^{N}\sum _{a=1}^{N}\left[\sum _{{x}_{a}\in {S}_{a}}\left({r}_{a}^{1}\left({x}_{a}\right){c}_{sa}\right){\rho}_{\left({x}_{a};s\right)}^{a}+\sum _{j\ne a}\sum _{{x}_{j}\in {S}_{j}}{r}_{j}^{0}\left({x}_{j}\right){\rho}_{\left({x}_{j};s\right),a}^{j}\right]\hfill \\ \mathsf{\text{subjectto}}\hfill \\ {\rho}_{\left({x}_{i};s\right),a}^{i}\in {Q}_{i}^{\alpha}\left({v}_{i}\right),\phantom{\rule{2.77695pt}{0ex}}\forall i\in \mathcal{N}\hfill \\ \sum _{{x}_{i}\in {S}_{i}}{\rho}_{\left({x}_{i};s\right),a}^{i}=\sum _{{x}_{1}\in {S}_{1}}{\rho}_{\left({x}_{1};s\right),a}^{1},\forall i,s,a\in \mathcal{N}.\hfill \end{array}
(16)
where {Q}_{i}^{\alpha}\left({v}_{i}\right) is the polytope of the stateaction pair variable defined as:
\begin{array}{c}{Q}_{i}^{\alpha}\left({v}_{i}\right)=\left\{\begin{array}{c}{\rho}_{\left({x}_{i};s\right),a}^{i}\in {\mathbb{R}}_{+}^{\left{s}_{i}\right\times {N}^{2}}:\hfill \\ \sum _{a=1}^{N}{\rho}_{\left({x}_{i};i\right),a}^{i}\alpha \sum _{{x}_{i}^{\prime}\in {S}_{i}}\sum _{{s}^{\prime}=1}^{N}{\rho}_{\left({x}_{i}^{\prime};s\right),i}^{i}{p}_{{x}_{i}^{\prime}{x}_{i}}^{1}=\left(1\alpha \right){v}_{i}\left({x}_{i}\right){\delta}_{1}\left(i\right)\phantom{\rule{1em}{0ex}}\forall {x}_{i}\in {S}_{i},\hfill \\ \sum _{a=1}^{N}{\rho}_{\left({x}_{i};i\right),a}^{i}\alpha \sum _{{x}_{i}^{\prime}\in {S}_{i}}\sum _{{s}^{\prime}=1}^{N}{\rho}_{\left({x}_{i}^{\prime};s\right),i}^{i}{p}_{{x}_{i}^{\prime}{x}_{i}}^{0}=\left(1\alpha \right){v}_{i}\phantom{\rule{2.77695pt}{0ex}}\left({x}_{i}\right){\delta}_{1}\left(s\right)\phantom{\rule{1em}{0ex}}\forall {x}_{i}\in {S}_{i},s\ne i.\hfill \end{array}\right.\hfill \end{array}
(17)
For N projects, there are \mathcal{O}\left({N}^{3}\times {max}_{i}\left(\left{S}_{i}\right\right)\right) variables and constraints. So the variables in (16) is in polynomial size of input data. In addition, the dual of the above LP is:
\begin{array}{c}\mathsf{\text{minimize}}\hfill \\ \left(1\alpha \right)\sum _{i=1}^{N}\sum _{s=1}^{N}\sum _{{x}_{i}\in {S}_{i}}{v}_{i}\left({x}_{i}\right){\delta}_{1}\left(s\right){\lambda}_{{x}_{i},s}^{i}\hfill \\ \mathsf{\text{subjectto}}\hfill \\ {\lambda}_{{x}_{1},s}^{1}\alpha \sum _{{x}_{1}^{\prime}\in {S}_{1}}{p}_{{x}_{1}{x}_{1}^{\prime}}^{1}{\lambda}_{{x}_{1}^{\prime},1}^{1}+\sum _{i=2}^{N}{\mu}_{s,1}^{i}\ge {r}_{1}^{1}\left({x}_{1}\right){c}_{s1},s\in \mathcal{N},{x}_{1}\in {S}_{1}\hfill \\ {\lambda}_{{x}_{1},s}^{1}\alpha \sum _{{x}_{1}^{\prime}\in {S}_{1}}{p}_{{x}_{1}{x}_{1}^{\prime}}^{0}{\lambda}_{{x}_{1}^{\prime},a}^{1}+\sum _{i=2}^{N}{\mu}_{s,a}^{i}\ge {r}_{1}^{0}\left({x}_{1}\right),s\in \mathcal{N},{x}_{1}\in {S}_{1},a\ne 1\hfill \\ {\lambda}_{{x}_{j},s}^{j}\alpha \sum _{{x}_{j}^{\prime}\in {S}_{j}}{p}_{{x}_{j}{x}_{j}^{\prime}}^{1}{\lambda}_{{x}_{j}^{\prime},j}^{j}{\mu}_{s,j}^{j}\ge {r}_{j}^{1}\left({x}_{j}\right){c}_{sj},j\ge 2,s\in \mathcal{N},{x}_{j}\in {S}_{j}\hfill \\ {\lambda}_{{x}_{j},s}^{j}\alpha \sum _{{x}_{j}^{\prime}\in {S}_{j}}{p}_{{x}_{j}{x}_{j}^{\prime}}^{0}{\lambda}_{{x}_{j}^{\prime},j}^{j}{\mu}_{s,j}^{j}\ge {r}_{j}^{0}\left({x}_{j}\right),j\ge 2,s\in \mathcal{N},{x}_{j}\in {S}_{j},\alpha \ne j\mathsf{\text{.}}\hfill \end{array}
(18)
The author in [14] provided a primaldual index heuristic approach to solve the selection problem. For (16) and (18), the solution is denoted as \left\{{\stackrel{\u0304}{\rho}}_{\left({x}_{i};s\right),a}^{\iota}\right\} and \left\{{\stackrel{\u0304}{\lambda}}_{{x}_{i},s}^{i},{\stackrel{\u0304}{\mu}}_{s,a}^{i}\right\}. The reduced costs \left\{{\stackrel{\u0304}{\xi}}_{\left({x}_{i};s\right),a}^{i}\right\} are the difference between the left and right of the constraints in (18). The reduced costs are nonnegative and {\stackrel{\u0304}{\xi}}_{\left({x}_{i};s\right),a}^{i} equals to 0 if the corresponding {\stackrel{\u0304}{\rho}}_{\left({x}_{i};s\right),a}^{\iota}>0. It is also interpreted as the rate of decrease in the objective value of the primal linear program (16) per unit increase of variable {\rho}_{\left({x}_{i};s\right),a}^{i}. If the current state is (x_{1}, . . . , x_{
N
}; s), the candidate index (CI) of action a is as:
I\left(\left({x}_{1},\phantom{\rule{2.77695pt}{0ex}}\dots ,{x}_{N};s\right),a\right)=\sum _{i=1}^{N}{\stackrel{\u0304}{\xi}}_{\left({x}_{i};s\right),a}^{i}
(19)
which is the sum of all projectsŕeduced cost to action a. Therefore, the optimal action is to select the project with the minimal CI:
\xe2=arg\phantom{\rule{0.3em}{0ex}}\underset{a\in \mathcal{N}}{min}\left\{I\left(\left({x}_{1},\phantom{\rule{2.77695pt}{0ex}}\dots ,{x}_{N};s\right),a\right)\right\}.
(20)
We can store the \mathcal{O}\left({N}^{3}{max}_{i}\left(\left{S}_{i}\right\right)\right) optimal reduced costs and compute the indices described in (19) in time \mathcal{O}\left({N}^{2}\right), which is polynomial in the size of input data.
3.3 Relay selection procedure
Based on the approach depicted above, the relay decision algorithm is summarized as Algorithm 1.
For each node, a CoopTable similar to [17] is maintained. Based on the CoopTable, the source node can decide which relay candidate to cooperate for energyefficient TCP transmission. The revised CoopTable consists of seven fields: the address of destination node (S), the address of potential relay for source node S (R_{
n
}), the CSI (SNR) from S to D (h_{SD}), the CSI (SNR) from S to R_{
n
} \left({{h}_{SR}}_{n}\right), the CSI (SNR) from R_{
n
} to D \left({hRn}_{D}\right), the count of failure about this cooperative relay \left(N{{F}_{R}}_{n}\right), and the update time. The initial distribution of state and the transition probability, which can be obtained from the network × node (long run node or base station) in advance, is also stored in each node.
The channel states information in CoopTable record can be updated through control frames and overheard transmissions: The channel between any two nodes is assumed to be symmetric, because they all use the same frequency band for transmission. When S receives HTS frame sent by R_{
n
}, it gets the estimated {\gamma}_{S{R}_{n}} D receives HTS frame and estimates {\gamma}_{{R}_{n}D} and piggyback it in its CTS frames. When D sends CTS frames, it encapsulates SNR from S to D (γ_{SD}) which is estimated at D when receiving RTS. After receiving CTS, S extracts the channel state information γ_{
SD
} and {\gamma}_{S{R}_{n}}. S and D can also overhear frame from other candidate relays to estimate the channel state information between S and R_{
x
}, R_{
x
}, and D, and D will also piggyback the {\gamma}_{{R}_{x}D} in its CTS frames. By this way, each S can maintain the CSI for all relays based on which the optimized result for relay selection to achieve energy efficiency of TCP traffic in every time slot can be calculated.
Moreover, when S decision produces a new relay R_{
n
} other than currently selected relay R_{
s
}el, it invokes the relay selection procedure expanded based on CoopMAC, as shown in Figure 3. A new message IDLE is introduced to facilitate the reselection. S sends a IDLE frame which inform the currently selected relay R_{s}el to suspend its cooperation and set the state from ACTIVE to NOT ACTIVE after sending back an ACK. IDLE frame also contains NAV to preserve the channel for selection query interactions. S sends RTS to request {R}_{n1}^{*} to provide the cooperation while setting NAV in its frame. The target relay {R}_{n1}^{*} may be available and willing to provide the cooperation so that it sends back the HTS. On the other hand, the target relay may also be reserved by other blind source or unwilling to provide the cooperation. In this case, it keeps silent and replies nothing. If no HTS received at S after SIFS interval, which means the newly selected relay is unavailable at this time, S begins another round of relay decision using Algorithm 1 excluding the unavailable relay. It sends RTS to the newly selected {R}_{nj}^{*} and waits for HTS. The procedure iterates until one HTS is received. Finally, D sends CTS to complete the selection and control interactions. After that, the data transmission begins for t th time slot with the selected relay {R}_{nj}^{*} with the length of link layer frame being {L}_{fr}^{j}\left(t\right) and the maximum retransmission time being {N}_{arq}^{j}\left(t\right). Obviously, if S does not decide to change relay in time slot t, the CoopMAC RTSCTS mode as well as two phases of data relay transmission follows as regular design shown in Figure 2.
Algorithm 1 Select the optimal relay â
Input:

S_{
n
}, n\in \mathcal{N}{State Space}

v(x_{1}, . . ., x_{
N
};s) {Distribution of Initial State}

P_{
n
}(t) {State Transition Probability of Each Relay}

c_{
ij
}{Switching Cost From Relay i to j}
Output:
1: Obtain the current state of the system: (x;s).
2: Get the optimal solution of primal and dual problems:
\left\{{\widehat{\rho}}_{\left({x}_{i};s\right),a}^{i}\right\}\phantom{\rule{2.77695pt}{0ex}}\left\{{\widehat{\lambda}}_{{x}_{i},s}^{i},{\widehat{\mu}}_{s,a}^{i}\right\}.
3: Get the reduced cost:
\left\{{\widehat{\gamma}}_{\left({x}_{i};s\right),a}^{i}\right\}.
4: Calculate the CI of the N relays and select the relay with smallest CI:
\xe2=arg\phantom{\rule{0.3em}{0ex}}\underset{a\in \mathcal{N}}{min}\left\{I\left(\left({x}_{1},\phantom{\rule{2.77695pt}{0ex}}\dots ,{x}_{N};s\right),a\right)\right\}.
In the procedure mentioned above, the switching cost is incurred by the control message interactions for reselection, e.g., IDLE and ACK, RTS and its retransmission. To keep the measurement of switching cost consistent with the energy efficiency in Equation (1), we define the switching cost as the product of the average system reward \stackrel{\u0304}{\eta} and the expected extra retransmission time of control frames, where \stackrel{\u0304}{\eta}=\frac{{\Sigma}_{n=1}^{N}{\Sigma}_{{x}_{n}\in {S}_{n}}\eta \left({x}_{n}\right)}{{\Sigma}_{n=1}^{N}\left{S}_{n}\right}. The probability that the target relay node is idle can be denoted as P_{idl}, and the expected extra retransmission number is {1/P_{idl}  1}. Therefore, the switching cost is \stackrel{\u0304}{\eta}\left(1/{P}_{\mathsf{\text{id}}1}1\right), i.e., if the target relay node is idle with probability 1 there is no switching cost.