Decentralized Computation Offloading for Multi-User Mobile Edge Computing: A Deep Reinforcement Learning Approach

Mobile edge computing (MEC) emerges recently as a promising solution to relieve resource-limited mobile devices from computation-intensive tasks, which enables devices to offload workloads to nearby MEC servers and improve the quality of computation experience. Nevertheless, by considering a MEC system consisting of multiple mobile users with stochastic task arrivals and wireless channels in this paper, the design of computation offloading policies is challenging to minimize the long-term average computation cost in terms of power consumption and buffering delay. A deep reinforcement learning (DRL) based decentralized dynamic computation offloading strategy is investigated to build a scalable MEC system with limited feedback. Specifically, a continuous action space-based DRL approach named deep deterministic policy gradient (DDPG) is adopted to learn efficient computation offloading policies independently at each mobile user. Thus, powers of both local execution and task offloading can be adaptively allocated by the learned policies from each user's local observation of the MEC system. Numerical results are illustrated to demonstrate that efficient policies can be learned at each user, and performance of the proposed DDPG based decentralized strategy outperforms the conventional deep Q-network (DQN) based discrete power control strategy and some other greedy strategies with reduced computation cost. Besides, the power-delay tradeoff is also analyzed for both the DDPG based and DQN based strategies.


I. INTRODUCTION
As the popularity of smart mobile devices in the coming 5G era, mobile applications, especially for computation-intensive tasks such as online 3D gaming, face recognition and location-based augmented or virtual reality (AR/VR), have been greatly affected by the limited on-device computation capability [1]. Meanwhile, for the large number of low-power and resource-constrained wireless terminals serving in the emerging Internet of Things (IoT) [2] and Intelligent Transport Systems (ITS) [3], a huge amount of sensory data also needs to be pre-processed and analyzed.
As a result, to meet the quality of experience (QoE) of these mobile applications, the technology of mobile edge computing (MEC) [4] has been proposed as a promising solution to bridge the gap between the limited resources on mobile devices and the ever-increasing demand of computation requested by mobile applications.
Instead of the remote public clouds in conventional cloud computing systems such as Amazon Web Services and Microsoft Azure, MEC enhances the radio access networks (RANs), which is in close proximity to mobile users, with computing capability [5]- [7]. It enables mobile devices to offload computation workloads to the MEC server associated with a base station (BS), and thus improves the QoE of mobile applications with considerably reduced latency and power consumption. Many researcher has been attracted from both the industry [8] and academia [9]- [11]. Nevertheless, computation offloading highly depends on the efficiency of wireless data transmission, which requires MEC systems to manage radio resources along with computation resources and complete computation tasks efficiently.
In order to achieve higher energy efficiency or better computation experience, computation offloading strategies for MEC have been widely investigated in the literature recently. For shortterm optimization over quasi-static channels, some algorithms have been studied in [12]- [18]. In [12], optimal joint offloading selection and radio resource allocation for mobile task offloading was studied to minimize the overall execution time. For decentralized algorithms with reduced overhead, a game-theoretic computation offloading scheme was constructed in [13]. Moreover, with the dynamic voltage and frequency (DVFS) techniques, CPU-cycle frequency was flexibly controlled with other features in [14], [15], where the system cost, defined as weighted sum of energy consumption and execution time, has been reduced. Besides, energy-latency tradeoff has been discussed in [16] with jointly optimized communication and computation resource allocation under the limited energy and sensitive latency. Also, it has been shown the performance of MEC can be further improved with adopting some other emerging technologies such as wireless power transfer [17] and non-orthogonal multiple access (NOMA) [18].
To cope with stochastic task arrivals and time-varying wireless channels, strategies for dynamic joint control of radio and computation resources in MEC systems become even challenging [19]- [28]. In [19], a threshold-based dynamic computation offloading policy was proposed to minimize energy consumption under stochastic wireless channels. For low-complexity online algorithms, Lyapunov optimization has been widely adopted. In [20], dynamic policies for offloading decision, clock speed and network interface control were considered to minimize energy consumption with given delay constraints. Joint optimization of multiple-input multiple-output (MIMO) beamforming and computational resource allocation for a multi-cell MEC system is designed in [21].
Additionally, an energy harvesting enabled green MEC system is studied in [22], where the delay cost addressing both the execution delay and task failure is minimized. For multiple user scenarios, power-delay tradeoff [23], network utility maximization balancing throughput and fairness with reduced feedback [24], and stochastic admission control and scheduling for multiuser multi-task computation offloading [25] were discussed, respectively. On the other hand, Markov decision process (MDP) can be also applied to the analysis and design of dynamic control of MEC systems [26], [27]. Furthermore, it was shown in [28] and [29] that an optimal dynamic computation offloading policy can be learned by the emerging reinforcement learning (RL) based algorithm without any prior knowledge of the MEC system.
Conventional RL algorithms cannot scale well as the number of agents increases, since the explosion of state space will make traditional tabular methods infeasible [30]. Nevertheless, by exploiting deep neural networks (DNNs) for function approximation, deep reinforcement learning (DRL) has been demonstrated to efficiently approximate Q-values of RL [31]. There have been some attempts to adopt DRL in the design of online resource allocation and scheduling in wireless networks [32]- [34], especially for some recent works targeting computation offloading in MEC [35]- [38]. Specifically, in [38], system sum cost combining execution delay and energy consumption of a multi-user MEC system is minimized by optimal offloading decision and computational resource allocation. Similarly, the authors in [36] considered an online offloading algorithm to maximize the weighted sum computation rate in a wireless powered MEC system.
In [37], a DRL based computation offloading strategy of an IoT device is learned to choose one MEC server to offload and determine the offloading rate. Besides, double deep Q-network (DQN) based strategic computation offloading algorithm was proposed in [35], where an mobile device learned the optimal task offloading and energy allocation to maximize the long-term utility based on the task queue state, the energy queue state as well as the channel qualities. In the existing works, there have been only strategies focusing on centralized DRL based algorithms for optimal computation offloading in MEC systems, and the design of decentralized DRL based algorithms for dynamic task offloading control of a multi-user MEC system still remains unknown.
In this paper, we consider a general MEC system consisting of one base station (BS) with one attached MEC server and multiple mobile users, where tasks arrive stochastically and channel condition is time-varying for each user. Without any prior knowledge of network statistics of the MEC system, a dynamic computation offloading policy will be learned independently at each mobile user based on local observations of the MEC system. Moreover, different from other DRL based policies in existing works making decisions in discrete action spaces, we adopt a continuous action space based algorithm named deep deterministic policy gradient (DDPG) to derive better power control of local execution and task offloading. Specifically, major contributions of this paper can be summarized as follows: • A multi-user MIMO based MEC system is considered, where each mobile user with stochastic task arrivals and time-varying wireless channels attempts to independently learn dynamic computation offloading policies from scratch to minimize long-term average computation cost in terms of power consumption and task buffering delay.
• By adopting DDPG, a DRL framework for decentralized dynamic computation offloading has been designed, which enables each mobile user to leverage only local observations of the MEC system to gradually learn efficient policies for dynamic power allocation of both local execution and computation offloading in a continuous domain.
• Numerical simulations are performed to illustrate performance of the policy learned from the DDPG based decentralized strategy and analyze the power-delay tradeoff for each user.
Superiority of the continuous power control based DDPG over the discrete control based DQN and some other greedy strategies is also demonstrated.
The rest of this paper is organized as follows. In Section II, some preliminaries on DRL are introduced. Then, system model for the dynamic computation offloading of the MEC system is presented in Section III. Design of the decentralized DRL based dynamic computation offloading algorithm is proposed in Section IV. Numerical results will be illustrated in Section V. Finally, Section VI concludes this paper.

II. PRELIMINARIES ON DEEP REINFORCEMENT LEARNING
In this section, we will firstly give an overview of MDP and RL [30], and then introduce some basics of the emerging DRL technology [31]. Finally, recent extension of DRL on continuous action space based algorithm, i.e., DDPG [39], is presentedX

A. MDP
An MDP consists of an agent and an environment E, a set of possible states S, a set of available actions A, and a reward function r : S × A → R, where the agent continually learns and makes decisions from the interaction with the environment in discrete time steps. In each time step t, the agent observes current state of the environment as s t ∈ S, and chooses and executes a action a t ∈ A according to a policy π. After that, the agent will receive a scalar reward r t = r(s t , a t ) ∈ R ⊆ R from the environment E and find itself in the next state s t+1 ∈ S according to the transition probability of the environment p(s t+1 |s t , a t ). Thus, the dynamics of the environment E is determined by the transition probability as response to the action taken by the agent in the current state, while the goal of the agent is to find the optimal policy that maximizes the long-term expected discounted reward it receives, i.e., where T → ∞ is the total number of time steps taken and γ ∈ [0, 1] is the discounting factor.
It is worth noting that a policy π is generally stochastic, which maps the current state to a probability distribution over the actions, i.e., π : S → P(A). Under the policy π, the expected discounted return starting from state s t is defined as the value function while the state-action function is the expected discounted return after taking an action a t , i.e., A fundamental property which is frequently used in MDP called Bellman equation represents the recursive relationship of the value function and the state-action function, respectively, Q π (s t , a t ) = E s t+1 ∼E r(s t , a t ) + γE a t+1 ∼π [Q π (s t+1 , a t+1 )] .
Moreover, under the optimal policy π * , the Bellman optimality equation for the value function can be written by Based on the assumption of perfect model of the environment in MDP, dynamic programming (DP) algorithms like value iteration can be applied to obtain the optimal value function of any state s ∈ S under the optimal policy π * , i.e., where k is the index for value iteration. Once the optimal value function V * (s) is obtained, the optimal state-value function can be derived by Q * (s, a) = E s ′ ∼E [r(s, a) + γV * (s ′ )]. Then, it can be found that the optimal policy π * chooses the optimal action greedily in state s as follows 1 ,

B. RL
Unlink MDP, RL algorithms attempt to derive optimal policies without an explicit model of the environment's dynamics. In this case, the underlying transmission probability p(s t+1 |s t , a t ) is unknown and even non-stationary. Thus, the RL agent will learn from actual interactions with the environment and adapting its behavior upon experiencing the outcome of its actions, so as to maximize the expected discounted rewards. To this end, as a combination of Monte Carlo methods and DP, temporal-difference (TD) method arises to learn directly from raw experience.
Note that similar to (6), the Bellman optimality equation for the state-action function is from which we can update the state-action function using the agent's experience tuple (s t , a t , r t , s t+1 ) and other learned estimates at each time step t as follows, where α is the learning rate and the value (10) is the well-known Q-learning [40], from which the state-action function is also known as Q-value. It is worth noting that Q-learning is off-policy, since it directly approximates the optimal Q-value and the transitions experienced by the agent are independent of the policy being learned. Besides, it can be proved that Q-learning algorithm converges with probability one [30]. With the estimated optimal state-action function, the optimal policy π * can be easily obtained from (8).

C. DRL
Thanks to the powerful function approximation properties of DNNs, DRL algorithms [31] can learn low-dimensional representations for RL problems, which addresses the curse of dimensionality efficiently. As illustrated in Fig. 1, the recent DQN technology [31] successfully takes advantage of a DNN parameterized by θ to approximate the Q-values Q(s, a). In order to resolve the problem of instability of using function approximation in RL, an experience replay buffer B is employed, which stores the agent's experiences e t = (s t , a t , r t , s t+1 ) at each time step t. Meanwhile, a mini-batch of samples (s, a, r, s ′ ) ∼ U(B) will be drawn uniformly at random Agent Actor Tuple (s t ,a t ,r t ,s t+1 ) Action a t

Experience Buffer
Train Θ Q with a mini-batch of samples Update Θ µ with a mini-batch of policy gradients from B, and the following loss function will be calculated: which can be used to update the network parameter by θ ← θ − α · ∇ θ L(θ) with a learning rate α. Note that in order to further improve the stability of RL, the original DNN in DQN utilizes a target network θ ′ to derive the TD error for the agent, which adopts a so-called soft update strategy that tracks the weights of the learned network by Besides, the action taken by the agent at each time step t is obtained by a t = arg max a∈A Q(s t , a|θ).

D. DDPG
Although problems in high-dimensional state spaces has been successfully solved by DQN, only discrete and low-dimensional action spaces can be handled. To extend DRL algorithms to continuous action space, DDPG has been proposed in [39]. As shown in Fig. 2, an actor-critic approach is adopted by using two separate DNNs to approximate the Q-value network Q(s, a|θ Q ), i.e., the critic function, and the policy network µ(s|θ µ ), i.e., the actor function, respectively. Specifically, the critic Q(s, a|θ Q ) is similar to DQN and can be updated following (11). On the other hand, the actor µ(s|θ µ ) deterministically maps state s to a specific continuous action. As derived in [41], the policy gradient of the actor can be calculated by chain rule, which is the gradient of the expected return from the start distribution J with respect to the actor parameter Θ µ , averaged over the sampled mini-batch U(B). Thus, with (11) and (12) in hand, network parameters of the actor and the critic can be updated by and θ µ ← θ µ − α µ · ∇ θ µ J, respectively. Here, α Q and α µ are the learning rates.
In order to improve the model with adequate exploration of the state space, one major challenge in RL is the tradeoff between exploration and exploitation [42], which is even more difficult for learning in continuous action spaces. As an off-policy algorithm, the exploration of DDPG can be treated independently from the learning process. Thus, the exploration policy µ ′ can be constructed by adding noise ∆µ sampled from a random noise process to the actor, i.e., where the random noise process needs to be elaborately selected. E.g., exploration noise sampled from a temporally correlated random process can better preserve momentum [39].

III. DYNAMIC COMPUTATION OFFLOADING FOR MOBILE EDGE COMPUTING
As shown in Fig. 3, we consider a multi- device, the MEC server is deployed in proximity to the BS by the telecom operator, which can improve user's computation experience by enabling it to offload part of its computation need to the MEC server via the wireless link [6]. A discrete-time model is adopted for the MEC system, where the operating period is slotted with equal length τ 0 and indexed by T = {0, 1, . . .}.
The channel condition and task arrival of each user varies for each slot t ∈ T . Thus, aiming to balance the average energy consumption and task processing delay, each user needs to determine the ratio of local execution and computation offloading at each slot. Moreover, as the number of mobile users increases, decentralized task scheduling at each user is more favorable, which can reduce the system overhead between the users and the MEC server and improve the scalability of the MEC system. In the following parts, we will introduce the modeling of networking and computing in detail.

A. Network Model
In the MEC system, we consider a 5G macro-cell or small-cell BS, which is equipped with N antennas and manages the uplink transmissions of multiple single-antenna mobile users by employing the well-known linear detection algorithm zero-forcing (ZF), which is of low complexity but efficient, especially for multi-user MIMO with large antenna arrays [43]. For each time slot t ∈ T , if the channel vector of each mobile user m ∈ M is represented by , the received signal of the BS can be written as where is the transmission power of user m to offload task data bits with P o,m being the maximum value, s m (t) is the complex data symbol with unit variance, and is a vector of additive white Gaussian noise (AWGN) with variance σ 2 R . Note that I N denotes an N × N identity matrix. In order to characterize the temporal correlation between time slots for each mobile user m ∈ M, the following Gaussian Markov block fading autoregressive model [44] is adopted: where ρ m is the normalized channel correlation coefficient between slots t and t − 1, and the error vector e(t) is complex Gaussian and uncorrelated with h m (t). Note that ρ m = J 0 (2πf d,m τ 0 ) according to Jake's fading spectrum, where f d,m is the Doppler frequency of user m, t 0 is the slot length, and J 0 (·) is the Bessel function of the first kind [45].
Denoting H(t) = [h 1 (t), . . . , h M (t)] as the N ×M channel matrix between the BS and the M users, the linear ZF detector at the BS 3 can be written by the channel matrix's pseudo inverse [43]. Here, δ ij = 1 when i = j and 0 otherwise. Thus, the corresponding signal-to-interference-plus-noise (SINR) can be derived by where [A] mn is the (m, n)-th element of matrix A. From (16), it can be verified that each user's SINR will become worse as the number of users M increases, which requires each user to allocate more power for task offloading. In the sequel, we will show how the user learns to adapt to the environment from the SINR feedbacks.

B. Computation Model
In this part, how each mobile user m ∈ M takes advantage of local execution or computation offloading to satisfy its running applications will be discussed. Without loss of generality, we use a m (t) to quantify the number of task arrivals during slot t ∈ T , which can be processed starting from slot t + 1 and is independent and identically distributed (i.i.d) over different time slots with mean rate λ m = E[a m (t)]. Besides, we assume that the applications are fine-grained [20].
That is, some bits of the computation tasks denoted by d l,m (t) will be processed on the mobile device, and some other bits denoted by d o,m (t) will be offloaded to and executed by the MEC server. Thus, if B m (t) stands for the queue length of user m's task buffer at the beginning of slot t, it will evolve as follows: where B m (0) = 0 and [x] + = max(x, 0).

1) Local computing:
In this part, we will show the amount of data bits being processed locally given the allocated local execution power p l,m (t) ∈ [0, P l,m ]. To start with, we assume that the number of CPU cycles required to process one task bit at user m is denoted by L m , which can be estimated through off-line measurement [46]. By chip voltage adjustment using DVFS techniques [47], the CPU frequency scheduled for slot t can be written by where κ is the effective switched capacitance depending on the chip architecture. Note that 0 ≤ f m (t) ≤ F m with F m = 3 P l,m (t)/κ being the maximum allowable CPU-cycle frequency of user m's device. As a result, the local processed bits at t-th slot can be derived by: 2) Edge computing: To take advantage of edge computing, it is worth noting that the MEC server is usually equipped with sufficient computational resources, e.g., a high-frequency multicore CPU. Thus, it can be assumed that different applications can be handled in parallel with a negligible processing latency, and the feedback delay is ignored due to the small sized computation output. In this way, all the task data bits offloaded to the MEC server via the BS will be processed. Therefore, according to (16) and given the uplink transmission power p o,m (t), the amount of offloaded data bits of user m during slot t can be derived by where W is the system bandwidth and γ m (t) is obtained from (16).

IV. DRL BASED DECENTRALIZED DYNAMIC COMPUTATION OFFLOADING
In this section, we will develop a DRL based approach to minimize the computation cost of each mobile user in terms of energy consumption and buffering delay in the proposed multiuser MEC system. Specifically, by employing the DDPG algorithm, a decentralized dynamic computation offloading policy will be learned independently at each user, which selects an action, i.e., allocated powers for both local execution and computation offloading, upon the observation of the environment from its own perspective. It is worth noting that each user has no prior knowledge of the MEC system, which means the number of users M, and statistics of task arrivals and wireless channels are unknown to each user agent and thus the online learning process is totally model-free. In the following, by adopting DDPG, the DRL framework for decentralized dynamic computation offloading will be introduced, where the state space, action space and reward function will be defined. Then, how to take advantage of the framework to train and test the decentralized policies is also presented.

A. The DRL Framework
State Space: Full observation of the system includes the channel vectors and the queue lengths of the task buffer for all users. However, the system overhead to collect such informations at the BS and then distribute them to each user is very high in practice, which will become even higher as the number of mobile users increases. In order to reduce the overhead and make the MEC system much more scalable, we assume that the state of each user agent is only determined by its local observation of the system, upon which each user will select an action independently from other user agents.
As shown in Fig. 4, at the start of time slot t, the queue length of each user m's data buffer B m (t) will be updated according to (17). Meanwhile, one feedback from the BS conveying the last receiving SINR of user m at the BS, i.e., γ m (t − 1), will be received. At the same time, channel vector h m (t) for the upcoming uplink transmission can be estimated by using channel reciprocity. As a result, from the perspective of each user m, the state can be defined as where we denote the projected power ratio after ZF detection at the BS for slot t as Note that in order to decode user m's symbol without inter-stream interference, ZF detection will project the received signal y(t) to the subspace orthogonal to the one spanned by the other users' channel vectors [48]. In this way, φ m (t) can be interpreted as the ratio of unit received power of user m's uplink signal after projection.
Action Space: Based on the current state s m,t of the system observed by each user agent m, an action a m,t including the allocated powers for both local execution and computation offloading will be selected for each slot t ∈ T as below: It is worth noting that, by applying the DDPG algorithm, either power allocation can be elaborately optimized in a continuous action space, i.e., p l,m (t) ∈ [0, P l,m ] and p o,m (t) ∈ [0, P o,m ], to minimize the average computation cost, unlike other conventional DRL algorithms to select from several predefined discrete power levels. Consequently, the high dimension of discrete action spaces can be significantly reduced.
Reward Function: As mentioned in Section II, the behavior of each user agent is rewarddriven, which indicates that the reward function plays a key role in the performance of DRL algorithms. In order to learn an energy-aware dynamic computation offloading policy for the proposed MEC model, we consider to minimize the energy consumption while completing tasks within a acceptable buffering delay. Thus, the overall computation cost for each user agent will be counted by both the total energy cost and the penalty on task buffering delay. Notice that it can be known that according to the Little's Theorem [49], the average queue length of the task buffer is proportional to the buffering delay. In this way, we define the reward function r m,t that each user agent m receives after slot t as where w m,1 and w m,2 are both nonnegative weighted factors, and the reward r m,t is the negative wighted sum of the instantaneous total power consumption and the queue length of task buffer. By it can be used to approximate of the real expected infinite-horizon undiscounted return [50] at each user agent when γ → 1. That is, the following average computation cost will be minimized by applying the learned computation offloading policy µ * m .

B. Training and Testing
To learn and evaluate the decentralized computation offloading algorithm, there are two stages including training and testing for the DRL framework. Before training, we need to build the DRL framework with a simulated environment and a group of user agents. The simulated environment is used to generate the training and testing data, which mimics the interaction of the user agents with the MEC system needs to be built, which accepts the decision of each user agent and returns feedbacks of CSI and SINR. For each user agent, the TensorFlow library [51] is used to construct and train the DNNs in the DDPG algorithm.
The detailed training stage is illustrated in Algorithm 1. It is worth noting that the interaction between the user agents and the environment is generally continuing RL tasks [30], which does not break naturally into identifiable episodes. 4 Thus, in order to have better exploration performance, the interaction of the user agent in the MEC system will be manually start with a random initial state s m,1 and terminate at a predefined maximum steps T max for each episode.
At each time step t during an episode, each agent's experience tuple (s m,t , a a,t , r m,t , s m,t+1 ) will be stored in its own experience buffer B m . Meanwhile, the use agent's actor and critic network will be updated accordingly using a mini-batch of experience tuples {(s i , a i , r i , s ′ i )} I i=1 randomly sampled from the replay buffer B m . In this way, after the training of K max episodes, the dynamic computation offloading policy will be gradually and independently learned at each user agent.
As for the testing stage, each user agent will firstly load its actor network parameters learned in the training stage. Then, the user agent will start with an empty data buffer and interact with a randomly initialized environment, after which it selects actions according to the output of the actor network, when its local observation of the environment is obtained as the current state.

V. NUMERICAL RESULTS
In this section, numerical simulations will be presented to illustrate the proposed DRL framework for decentralized dynamic computation offloading in the MEC system. System setup for the simulations is firstly introduced. Then, performance of the DRL framework is demonstrated and compared with some other baseline schemes in the scenarios of single user and multiple users, respectively. Randomly initialize the actor network µ(s|θ µ m ) and the critic network Q(s, a|θ Q m );

3:
Initialize the associated target networks with weights θ µ ′ m ← θ µ m and θ Q ′ m ← θ Q m ; 4: Initialize the experience replay buffer B m ; 5: end for 6: for each episode k = 1, 2, . . . , K max do 7: Reset simulation parameters for the multi-user MEC model environment; 8: Randomly generate an initial state s m,1 for each user agent m ∈ M; 9: for each time slot t = 1, 2, . . . , T max do 10: for each user agent m ∈ M do 11: Determine the power for local execution and computation offloading by selecting an action a m,t = µ(s m,t |θ µ m )+∆µ using running the current policy network θ µ m and generating exploration noise ∆µ; 12: Execute action a m,t independently at the user agent, and then receive reward r m,t and observe the next state s m,t+1 from the environment simulator; 13: Collect and save the tuple (s m,t , a m,t , r m,t , s m,t+1 ) into the replay buffer B m ; 14: Randomly sample a mini-batch of I tuples {(s i , a i , r i , s ′ i )} I i=1 from B m ; 15: Update the critic network Q(s, a|θ Q m ) by minimizing the loss L with the samples: Update the actor network µ(s|θ µ m ) by using the sampled policy gradient: end for 20: end for is the distance of user m to the BS in meters. In the following slots, h m (t) will be updated according to (15), where the channel correlation coefficient ρ m = 0.95 and the error vector Additionally, we set the system bandwidth to be 1MHz, the maximum transmission power P o,m = 2W, and the noise power σ 2 R = 10 −9 W. On the other hand, for local execution, we assume that κ = 10 −27 , the required CPU cycles per bit L m = 500 cycles/bit, and the maximum allowable CPU-cycle frequency F m = 1.26GHz, from which we know that the maximum power required for local execution P l,m = 2W.
To implement the DDPG algorithm, for each user agent m, the actor network and critic network is a four-layer fully connected neural network with two hidden layers. The number of neurons in the two hidden layers are 400 and 300, respectively. The neural networks use the Relu, i.e., f (x) = max(0, x), as the activation function for all hidden layers, while the final output layer of the actor uses a sigmoid layer to bound the actions. Note that for the critic, actions are not included until the second hidden layer of the Q-network. Adaptive moment estimation (Adam) method [52] is used for learning the neural network parameters with a learning rate of 0.0001 and 0.001 for the actor and critic respectively. The soft update rate for the target networks is τ = 0.001. To initialize the network layer weights, settings in experiment of [39] is adopted.
Moreover, in order to explore well, the Ornstein-Uhlenbeck process [53] with θ = 0.15 and σ = 0.12 is used to provide temporal correlated noise. The experience replay buffer size is set as |B m | = 2.5 × 10 5 . With slight abuse of notation, we introduce a tradeoff factor w m ∈ [0, 1] for each use agent m ∈ M, which is used to represent the two nonnegative weighted factors by w m,1 = 10w m and w m,2 = 1 − w m . Thus, the reward function r m,t in (24) can be written by from which we can make a tradeoff between energy consumption and buffering delay by simply setting a single factor w m . Moreover, the number of episodes is K max = 2000 in the training stage, and the maximum steps of each episode is T max = 200.
For comparison, the baseline strategies are introduced as follows:

1) Greedy Local Execution First (GD-Local):
For each slot, the user agent firstly attempts to execute task data bits locally as many as possible. Then, the remaining buffered data bits will be offloaded to the MEC.

2) Greedy Computation Offloading First (GD-Offload):
Similar to the GD-local strategy, each user agent firstly makes its best effort to offload data bits to the MEC, and then local execution will be adopted to process the remaining buffered data bits for each time slot.

3) DQN based Dynamic Offloading (DQN):
To evaluate performance of the proposed DDPG base algorithm, the conventional discrete action space based DRL algorithm, i.e., DQN [31], is also implemented for the dynamic computation offloading problem. Specifically, for each user m, the power levels for local execution and computation offloading are defined as P l,m =

B. Single User Scenario
In this part, numerical results of training and testing for the single user scenario are illustrated.
The user is assumed to be randomly located in a distance of d 1 = 100 meters to the BS. Fig. 5(a) and Fig. 5(b), the training process of the single-user dynamic computation offloading is presented by setting w 1 = 0.5 and w 1 = 0.8, respectively.

1) Training: As shown in
Note that the results are averaged from 10 runs of numerical simulations. In each figure, we will compare two different cases, where the task arrival rate is set as λ 1 = 2.0Mbps and λ 1 = 3.0Mbps, respectively. It can be observed that for both policies learned from DDPG and DQN, the average reward of each episode increases as the interaction between the user agent and the MEC system environment continues, which indicates that efficient computation offloading policies can be successfully learned without any prior knowledge. Besides, the performance of each learned policy becomes stable after about 1500 episodes. On the other hand, performance of the policy learned from DDPG is always better than DQN for different scenarios, which demonstrates that for continuous control problems, DDPG based strategies can explore the action space more efficiently than DQN based strategies.
2) Testing: In the training stage, we have obtained dynamic computation offloading policies by applying the DDPG based and DQN based learning algorithms after K max = 2000 episodes, respectively. For different task arrival rates ranging from λ 1 = 1.5 ∼ 4.0Mbps, the actor and critic network will be trained with the same network architecture and hyper-parameters. To compare the performance of different policies, testing results are averaged from 100 runs of numerical simulations, where each run consists of 10000 steps. As shown in Fig. 6 and Fig. 7,   are presented for w 1 = 0.5 and w 1 = 0.8, respectively, each of which includes the performance of average reward, power consumption and buffering delay. It can be observed from Fig. 6 that the average reward will increase as the task arrival rate grows, which indicates that the computation cost is higher for a larger computation demand. Specifically, the increased computation cost results from a higher power consumption and a longer buffering delay. Moreover, although the DDPG based strategy outperforms both greedy strategies with minimum average reward, which, however, slightly compromises the buffering delay to achieve the lowest energy consumption.
It is worth noting that the average reward of the DQN based strategy is higher than the greedy strategies, which is due to the limited number of discrete power levels of DQN based strategy 5 .
In Fig. 7, testing results for a larger tradeoff factor w 1 = 0.8 are also provided. From (29), we know that a larger w 1 will give more penalty on power consumption in the reward function, i.e., the computation cost. In this scenario, the average reward of the DDPG based strategy outperforms all the other strategies, and the gap is much larger than that in the case of w 1 = 0.5.
Specifically, this is achieved by much lower power consumption and increased buffering delay.

3) Power-Delay Tradeoff:
We also investigate testing results for the power-delay tradeoff by setting different values of w 1 in Fig. 8. It can be inferred from the curves that, there is a tradeoff between the average power consumption and the average buffering delay. Specifically, with a larger w 1 , the power consumption will be decreased by sacrificing the delay performance, which indicates that in practice w 1 can be tuned to have a minimum power consumption with a given delay constraint. It is also worth noting noting that for each value of w 1 , the policy learned form DDPG always has better performance in terms of both power consumption and buffering delay, which demonstrates the superiority of the DDPG based strategy for continuous power control.

C. Multi-User Scenario
In this part, numerical results for the multi-user scenario is presented. There are M = 3 mobile users in the MEC system, each of which is randomly located in a distance of d m = 100 meters to the BS, and the task arrival rate is λ m = m × 1.0Mbps, for m ∈ {1, 2, 3}. 1) Training: By setting w m = 0.5 for all the users, the training process has been shown in Fig. 9. It can be observed that for each mobile user, the average reward increases gradually when the mobile user interacts with the MEC system after more episodes. Thus, we know that for both the DDPG based and DQN based strategies, efficient decentralized dynamic computation offloading policies can be learned at each mobile user, especially for heterogeneous users with different computation demands. Moreover, it can be inferred that the higher computation cost needs to be paid by the user with a higher computation demand. Meanwhile, compared with the single user scenario, the average reward obtained in the multi-user scenario is much lower for the same task arrival rate. It is due to the fact that the spectral efficiency of data transmission will be degraded when more mobile users are served by the BS. Hence, more power will be consumed in computation offloading in the multi-user scenario.
2) Testing: By loading the neural network parameters learned by the DDPG based and DQN based algorithms after K max = 2000 episodes, testing results of different dynamic computation offloading policies are compared in Table II and Table III. From Table II, we know that the average rewards of user 2 and user 3 adopting DDPG based strategies are better than all other strategies under the scenario of w m = 0.5. However, as for user 1, the DDPG based strategy is slightly worse than the GD-Local strategy, which indicates that the exploration of DDPG for a small allocated power needs to be further improved. Also, it can be observed that both the DDPG based and DQN based strategies can achieve much lower power consumption with a little compromised buffering delay.
By setting the tradeoff factor w m = 0.8 as shown in Fig. III, the DDPG based strategies obtain the best average reward at each user agent. Moreover, the performance gaps between the DDPG based strategies and other greedy strategies become larger. Besides, with more penalty given to the power consumption, the consumed power of each user is much lower than that of w m = 0.5, which, however, results in a moderately increased buffering delay. Thus, we know that for the multi-user scenario, power consumption can be minimized with a satisfied average buffering delay, by selecting a proper value of w m . Notice that again the DDPG based strategies outperform the DQN based strategies in terms of average reward for all users.

VI. CONCLUSION
In this paper, we considered a multi-user MEC system, where tasks arrive stochastically and wireless channels are time-varying at each user. In order to minimize the long-term average computation cost in terms of power consumption and buffering delay, the design of DRL based decentralized dynamic computation offloading algorithms has been investigated. Specifically, by adopting the continuous action space based DRL approach named DDPG, an efficient computation offloading policy has been successfully learned at each mobile user, which is able to allocate powers of local execution and task offloading adaptively from its local observation of the MEC system. Numerical simulations have been performed to verify the superiority of the proposed DDPG based decentralized strategy over the conventional DQN based discrete power control strategy and some other greedy strategies with reduced computation cost. Besides, the power-delay tradeoff for both the DDPG based and DQN based strategies has been also studied.