In this section, decentralized dynamic computation offloading is proposed to minimize long-term average computation cost of each user. Specifically, by leveraging the DDPG [35] algorithm, dynamic computation offloading policies can be independently learned at each user, where an continuous action, i.e., powers allocated for local execution and computation offloading in a continuous domain, will be selected for each slot based on local observation of the environment. Each user has no prior knowledge of the system environment, which means the number of users *M*, and statistics of task arrivals and wireless channels are all unknown to each user agent and thus the online learning process is totally model-free. In the sequel, we will firstly introduce some basics of DRL technology. Then, the DDPG-based framework for decentralized dynamic computation offloading will be built, where the state space, action space and reward function are defined. Finally, training and testing of decentralized policies in the framework is also presented.

### Preliminaries on DRL

The standard setup for DRL follows from traditional RL, which consists of an agent, an environment *E*, a set of possible state \(\mathcal {S}\), a set of available action \(\mathcal {A}\), and a reward function \(r : \mathcal {S} \times \mathcal {A} \rightarrow \mathcal {R}\), where the agent continually learns and makes decisions from the interaction with the environment in discrete time steps. In each step *t*, the agent observes current state of the environment as \(s_{t} \in \mathcal {S}\), and chooses and executes a action \(a_{t} \in \mathcal {A}\) according to a policy *π*. The policy *π* is generally stochastic, which maps the current state to a probability distribution over the actions, i.e., \(\pi : \mathcal {S} \rightarrow \mathcal {P}(\mathcal {A})\). Then, the agent will receive a scalar reward \(r_{t} = r(s_{t}, a_{t}) \in \mathcal {R} \subseteq \mathbb {R}\) and transition to the next state *s*_{t+1} according to the transition probability of the environment *p*(*s*_{t+1}|*s*_{t},*a*_{t}). Thus, it can be known that state transition of the environment *E* depends on the agent’s action *a*_{t} executed in the current state *s*_{t}. In order to find the optimal policy, we define the return from a state as the sum of discounted reward in the future as \(R_{t} = \sum _{i=t}^{T} \gamma ^{i-t} r(s_{i},a_{i})\), where *T*→*∞* represents the total number of time steps and *γ*∈[0,1] stands for the discounting factor. The goal of RL is to find the policy that maximizes the long-term expected discounted reward from the start distribution, i.e.,

$$\begin{array}{*{20}l} J = \mathbb{E}_{s_{i} \sim E, a_{i} \sim \pi} [R_{1}]. \end{array} $$

(12)

Under a policy *π*, the state-action function \(Q^{\pi }(s_{t},a_{t}):\mathcal {S} \times \mathcal {A} \rightarrow \mathcal {R}\), also known as the critic function that will be introduced later in this paper, is defined as the expected discounted return starting from state *s*_{t} with the selected action *a*_{t},

$$\begin{array}{*{20}l} Q^{\pi}(s_{t},a_{t}) = \mathbb{E}_{s_{i>t} \sim E, a_{i>t} \sim \pi}[R_{t}|s_{t},a_{t})], \end{array} $$

(13)

where expectation is over the transition probability *p*(*s*_{t+1}|*s*_{t},*a*_{t}) and the policy *π*.

RL algorithms attempt to learn the optimal policies from actual interactions with the environment and adapts its behavior upon experiencing the outcome of its actions. This is due to the fact that there may not be an explicit model of the environment *E*’s dynamics. That is, the underlying transition probability *p*(*s*_{t+1}|*s*_{t},*a*_{t}) is unknown and even non-stationary. Fortunately, it has been proved [21] that under the policy *π*^{∗} that maximizes the expected discounted rewards, the state-action function satisfies the following Bellman optimality equation,

$$\begin{array}{*{20}l} Q^{*}(s_{t},a_{t}) = \mathbb{E}_{s_{t+1} \sim E}\left[r(s_{t},a_{t}) + \gamma \max_{a_{t+1}} Q^{*}(s_{t+1},a_{t+1}) \right], \end{array} $$

(14)

from which the optimal policy *π*^{∗} can be derived by choosing action in any state \(s \in \mathcal {S}\) as follows^{Footnote 3},

$$\begin{array}{*{20}l} \pi^{*}(s) = \arg \max_{a \in \mathcal{A}} Q^{*}(s, a). \end{array} $$

(15)

In order to learn the value of state-action function from raw experience, the temporal-difference (TD) method can be leveraged to update the state-action function from an agent’s experience tuple (*s*_{t},*a*_{t},*r*_{t},*s*_{t+1}) at each time step *t*,

$$\begin{array}{*{20}l} Q(s_{t}, a_{t}) \leftarrow Q(s_{t},a_{t}) + \alpha \left[ r(s_{t},a_{t}) + \gamma\max_{a_{t+1}} Q(s_{t+1}, a_{t+1}) - Q(s_{t},a_{t})\right], \end{array} $$

(16)

where *α* is the learning rate and the value \(\delta _{t} := r(s_{t},a_{t}) + \gamma \max _{a_{t+1}} Q(s_{t+1}, a_{t+1}) - Q(s_{t},a_{t})\) is referred to as the TD error. Note that the well-known Q-learning algorithm [36] is based on (16) and thus the state-action function is also known as *Q*-value. Besides, it has been proved that Q-learning algorithm converges with probability one [21], which indicates that an estimated *Q*^{∗}(*s*,*a*) will be obtained eventually.

Thanks to the powerful function approximation properties of DNNs, DRL algorithms are able to learn low-dimensional representations of RL problems efficiently. For instance, the DQN algorithm [22] utilizes a DNN parameterized by *θ*^{Q} to approximate *Q*-value as *Q*(*s*_{t},*a*_{t}|*θ*^{Q}). Moreover, an experience replay buffer \(\mathcal {B}\) is employed to store the agent’s experience tuple *e*_{t}=(*s*_{t},*a*_{t},*r*_{t},*s*_{t+1}) at each time step *t*, which can be then used to resolve the problem of instability of using function approximation in RL. Specifically, for each time slot, a mini-batch of samples \((s,a,r,s^{\prime }) \sim U(\mathcal {B})\) will be drawn uniformly at random from \(\mathcal {B}\) to calculate the following loss function:

$$\begin{array}{*{20}l} L(\theta^{Q}) = \mathbb{E}_{(s,a,r,s^{\prime}) \sim U(\mathcal{B})}\left[\left(r+\gamma \max\limits_{\hat{a}\in\mathcal{A}} Q(s^{\prime},\hat{a} | \theta^{{Q}^{\prime}})-Q(s,a| \theta^{Q})\right)^{2}\right]. \end{array} $$

(17)

Such loss function will be used to update the network parameter by \(\theta ^{Q} \leftarrow \theta ^{Q} - \alpha _{Q}\cdot \nabla _{\theta ^{Q}}L(\theta ^{Q})\) with a learning rate *α*_{Q}. In order to further improve stability, the so-called soft update strategy is adopted by DQN, where a target network \(\theta ^{Q^{\prime }}\) in (17) is used to derive the TD error for the agent. Note that the target network tracks the weights of the learned network by \(\theta ^{Q^{\prime }} \leftarrow \tau \theta ^{Q} + (1-\tau)\theta ^{Q^{\prime }}\) with *τ*≪1. Besides, from (15), we know that the action taken by the agent at each time step *t* is obtained by \(a_{t} = \arg \max \limits _{a\in \mathcal {A}} Q(s_{t},a | \theta ^{Q})\).

### DDPG-based dynamic computation offloading

Although problems in high-dimensional state spaces can be solved by DQN, only discrete and low-dimensional action spaces is supported. To this end, DDPG has been proposed [35] to extend conventional DRL algorithms to continuous action space. As shown in Fig. 2, an actor-critic approach is adopted by using two separate DNNs to approximate the *Q*-value network *Q*(*s*,*a*|*θ*^{Q}), i.e., the critic function, and the policy network *μ*(*s*|*θ*^{μ}), i.e., the actor function, respectively^{Footnote 4}. Specifically, the critic *Q*(*s*,*a*|*θ*^{Q}) is similar to DQN and can be updated according to (17). On the other hand, the actor *μ*(*s*|*θ*^{μ}) deterministically maps state *s* to a specific continuous action. As derived in [37], policy gradient of the actor can be calculated as follows,

$$\begin{array}{*{20}l} \nabla_{\theta^{\mu}} J \approx \mathbb{E}_{(s,a,r,s^{\prime}) \sim U(\mathcal{B})}\left[\nabla_{a} Q(s,a|\theta^{Q}) |_{a=\mu(s|\theta^{\mu})} \cdot \nabla_{\theta^{\mu}} \mu(s|\theta^{\mu})\right], \end{array} $$

(18)

which is actually the averaged gradient of the expected return from the start distribution *J* with respect to the actor parameter *θ*^{μ} over the sampled mini-batch \(U(\mathcal {B})\). Note that the chain rule is applied here since the action *a*_{t}=*μ*(*s*|*θ*^{μ}) is taken as the input of the critic function. Thus, with (17) and (18) in hand, network parameters of the actor and the critic can be updated by \(\theta ^{Q} \leftarrow \theta ^{Q} - \alpha _{Q}\cdot \nabla _{\theta ^{Q}}L(\theta ^{Q})\) and \(\theta ^{\mu } \leftarrow \theta ^{\mu } - \alpha _{\mu }\cdot \nabla _{\theta ^{\mu }}J\), respectively. Here, *α*_{Q} and *α*_{μ} are the learning rates.

It is worth noting that the soft update strategy is also needed for DDPG. Thus, target networks parameterized by \(\theta ^{\mu ^{\prime }}\) and \(\theta ^{Q^{\prime }}\) will be used to calculate the loss function different from (17),

$$\begin{array}{*{20}l} L(\theta^{Q}) = \mathbb{E}_{(s,a,r,s^{\prime}) \sim U(\mathcal{B})}\left[\left(r+ \gamma Q(s^{\prime}, \mu(s|\theta^{\mu^{\prime}}) | \theta^{Q^{\prime}})-Q(s,a| \theta^{Q})\right)^{2}\right], \end{array} $$

(19)

where the target networks track the weights of the learned networks by \(\theta ^{\mu ^{\prime }} \leftarrow \tau \theta ^{\mu } + (1-\tau)\theta ^{\mu ^{\prime }}\) and \(\theta ^{Q^{\prime }} \leftarrow \tau \theta ^{Q} + (1-\tau)\theta ^{Q^{\prime }}\) with *τ*≪1, respectively.

In order to minimize the computation cost at each user, a DDPG-based framework is proposed to train decentralized policies to elaborately controlling power allocation for local execution and computation offloading. More specifically, the learned policy only depends on local observations of the environment and can make power allocation decisions in the continuous domain. In the sequel, definitions of state space, action space, and reward function in the framework will be introduced in detail.

*State space:* In the proposed framework, state space defines the representation of the environment from each user *m*’s perspective, which will then used to make power allocation decisions. For decentralized policies, each user’s state space only depends on local observations as shown in Fig. 3. Specifically, at the start of time slot *t*, the queue length of its data buffer *B*_{m}(*t*) will be updated according to (6). Meanwhile, the user can receive the feedback from BS with its last receiving SINR *γ*_{m}(*t*−1). Moreover, the channel vector *h*_{m}(*t*) for uplink transmission can be estimated by using channel reciprocity.

To define the state space based on the local observations mentioned above, we will firstly leverage *B*_{m}(*t*) to indicate the amount of task data bits waiting to be executed. To reflect the impact of the interference from other mobile users’ uplink signals under ZF detection, we denote the normalized SINR of slot *t* as follows,

$$\begin{array}{*{20}l} \phi_{m}(t) & = \frac{\gamma_{m}(t) \sigma_{R}^{2}}{p_{o,m}(t)\|\boldsymbol{h}_{m}(t)\|^{2}} \end{array} $$

(20)

$$\begin{array}{*{20}l} & = \frac{1}{\|\boldsymbol{h}_{m}(t)\|^{2} \left[\left(\boldsymbol{H}^{H}(t) \boldsymbol{H}(t)\right)^{-1}\right]_{mm}}, \end{array} $$

(21)

which represents the projected power ratio after ZF detection at BS. In fact, in order to decode user *m*’s symbol without inter-stream interference, ZF detection will project the received signal *y*(*t*) to the subspace orthogonal to the one spanned by the other users’ channel vectors [38], after which the interference from other users has already removed for user *m*. Thus, *ϕ*_{m}(*t*) can be interpreted as the normalized SINR of user *m*, since it excludes the influence of its transmission power *p*_{o,m}(*t*) and channel vector *h*_{m}(*t*) and is able to accurately reflect the impact of other users’ interference on unit received power for user *m*. Practically, for slot *t*, the last normalized SINR *ϕ*_{m}(*t*−1) can be estimated from *γ*_{m}(*t*−1) and then included in the state space. Besides, channel quality can be quantified by the channel vector *h*_{m}(*t*). To summarize, from the perspective of each user *m*, the state space of slot *t* can be defined as

$$\begin{array}{*{20}l} s_{m,t} = \left[B_{m}(t), \phi_{m}(t-1), \boldsymbol{h}_{m}(t)\right]. \end{array} $$

(22)

*Action space:* Based on the current state *s*_{m,t} of the system observed by each user agent *m*, an action *a*_{m,t} including the allocated powers for both local execution and computation offloading will be selected for each slot \(t \in \mathcal {T}\) as below:

$$\begin{array}{*{20}l} a_{m,t} = \left[p_{l,m}(t),p_{o,m}(t)\right]. \end{array} $$

(23)

It is worth noting that, by applying the DDPG algorithm, either power allocation can be elaborately optimized in a continuous action space, i.e., *p*_{l,m}(*t*)∈[0,*P*_{l,m}] and *p*_{o,m}(*t*)∈[0,*P*_{o,m}], to minimize the average computation cost, unlike other conventional DRL algorithms to select from several predefined discrete power levels. Consequently, the high dimension of discrete action spaces can be significantly reduced.

*Reward function:* As mentioned in Section 3.1, the behavior of each user agent is reward-driven, and thus, the reward function plays a key role in the performance of DRL algorithms. In order to learn an energy-aware dynamic computation offloading policy which minimizes the long-term computation cost defined in (9), the reward function *r*_{m,t} that each user received after time step *t* can be defined as

$$\begin{array}{*{20}l} r_{m,t} = -w_{m,1} \cdot \left(p_{l,m}(t) + p_{o,m}(t)\right) - w_{m,2} \cdot B_{m}(t), \end{array} $$

(24)

which is the negative weighted sum of the instantaneous total power consumption and the queue length of task buffer.

### Remark 1.

For each user agent, the value function is defined as the expected discounted return starting from a state, which can be maximized by the optimal policy *[*21*]*. Specifically, in the proposed MEC model, the policy learned by the DDPG-based algorithm maximizes the value function of user *m* which starts from the initial state *s*_{m,1} under policy *μ*_{m}, i.e.,

$$\begin{array}{*{20}l} V^{\mu_{m}}(s_{m,1}) = \mathbb{E} \left[\sum_{t=1}^{\infty} \gamma^{t-1} r_{m,t} | s_{m,1} \right], \end{array} $$

(25)

which can be used to approximate the real expected infinite-horizon undiscounted return at each user agent when *γ*→1*[*39*]*. That is, the long-term average computation cost

$$\begin{array}{*{20}l} \bar{C}_{m}(s_{m,t})= \mathbb{E} \left[{\lim}_{T \rightarrow \infty}\frac{1}{T}\sum_{i=t}^{T} w_{m,1} \left(p_{l,m}(i)+p_{o,m}(i)\right) + w_{m,2} B_{m}(i) | s_{m,t} \right], \end{array} $$

(26)

will be minimized by applying the learned computation offloading policy \(\mu ^{*}_{m}\).

### Training and testing

To learn and evaluate the learned policies for decentralized computation offloading, a simulated environment with a group of user agents is constructed to conduct training and testing. Particularly, training and testing data can be generated from the interactions between the user agents and the simulated environment, which accepts the decision of each user agent and returns the local observations. Also notice that the interaction between the user agents and the environment is generally continuing RL tasks [21], which does not break naturally into identifiable episodes^{Footnote 5}. Thus, in order to have better exploration performance, the interaction will be manually started with a random initial state *s*_{m,1} for each user *m* and terminate at a predefined maximum steps *T*_{max} for each episode.

The detailed training stage is illustrated in Algorithm 1. At each time step *t* during an episode, each agent’s experience tuple (*s*_{m,t},*a*_{a,t},*r*_{m,t},*s*_{m,t+1}) will be stored in its own experience buffer \(\mathcal {B}_{m}\). Meanwhile, the use agent’s actor and critic network will be updated accordingly using a mini-batch of experience tuples \(\{(s_{i},a_{i},r_{i},s^{\prime }_{i})\}_{i=1}^{I}\) randomly sampled from the replay buffer \(\mathcal {B}_{m}\). In this way, after the training of *K*_{max} episodes, the dynamic computation offloading policy will be gradually and independently learned at each user agent. In order to improve the model with adequate exploration of the state space, one major challenge is the trade-off between exploration and exploitation, which is even more difficult for learning in continuous action spaces [35]. Since the exploration of DDPG are treated independently from the learning process, the exploration policy *μ*^{′} can be constructed by adding noise *Δ**μ* sampled from a random noise process to the actor, i.e.,

$$\begin{array}{*{20}l} \mu^{\prime}(s) = \mu(s|\theta^{\mu}) + \Delta\mu, \end{array} $$

(27)

where the random noise process needs to be elaborately selected. E.g., exploration noise sampled from a temporally correlated random process can better preserve momentum.

As for the testing stage, each user agent will firstly load its actor network parameters learned in the training stage. Then, the user agent will start with an empty data buffer and interact with a randomly initialized environment, where it selects actions according to the output of the actor network, using its local observation of the environment as current state.