Skip to main content

A new task offloading algorithm in edge computing


In the last few years, the Internet of Things (IOT), as a new disruptive technology, has gradually changed the world. With the prosperous development of the mobile Internet and the rapid growth of the Internet of Things, various new applications continue to emerge, such as mobile payment, face recognition, wearable devices, driverless, VR/AR, etc. Although the computing power of mobile terminals is getting higher and the traditional cloud computing model has higher computing power, it is often accompanied by higher latency and cannot meet the needs of users. In order to reduce user delay to improve user experience, and at the same time reduce network load to a certain extent, edge computing, as an application of IOT, came into being. In view of the new architecture after dating edge computing, this paper focuses on the task offloading in edge computing, from task migration in multi-user scenarios and edge server resource management expansion, and proposes a multi-agent load balancing distribution based on deep reinforcement learning DTOMALB, a distributed task allocation algorithm, can perform a reasonable offload method for this scenario to improve user experience and balance resource utilization. Simulations show that the algorithm has a certain adaptability compared to the traditional algorithm in the scenario of multi-user single cell, and reduces the complexity of the algorithm compared to the centralized algorithm, and reduces the average response delay of the overall user. And balance the load of each edge computing server, improve the robustness and scalability of the system.

1 Introdation

Over the past decade, mobile communication technology has evolved from the third/fourth generation to the fifth generation of mobile communication today. The rapid development of communication technology brings re-transmission speed and shorter delay. At the same time, integrated circuits continue to develop to smaller sizes and integration levels, while the computing power and storage capacity of chips are still rising. All kinds of intelligent hardware such as smart phones, tablet PCs, VR devices, wearable smart devices, etc. were born in the moment. And human society is also moving from the era of mobile Internet to the era of Internet of Everything. For the purpose of perceiving and driving the world, the IoT has gradually developed into a multi-disciplinary ecosystem, which is widely used in various scenarios requiring real-time data processing and feedback. Driven by the IoT, various smart edge devices have been rapidly popularized, and various new mobile applications such as navigation, mobile payment, face recognition, VR/AR, etc. have followed one after another [1]. However, it is accompanied by exponential growth of data traffic and the generation of a large number of computationally intensive tasks, which poses huge challenges to network bandwidth and servers. The traditional computing paradigms adopted by traditional cloud computing are mostly centralized computing models. Linearly expanded cloud computing services cannot efficiently handle the massive data and computing tasks generated by exponentially growing edge devices [2]. It faces problems such as real-time, accumulation, and bandwidth occupation. Therefore, in order to meet the needs of real-time operation, low latency requirements and high quality of service (QoS) scenarios, edge computing emerged as an application paradigm of the IoT.

As a new distributed computing paradigm, edge computing enables computing and data to be stored closer to edge devices, thereby changing the response time of computing tasks and greatly reducing the pressure on network bandwidth and cloud centers, reducing the productivity of edge devices and improving service quality for users. Due to its superior performance in delay sensitive applications, edge computing has become a crucial enabling technology in 5G.

Task offloading refers to the user equipment processing some computationally intensive applications and uploading the data processing these applications to the edge server through wireless transmission under the condition of weighing continuous or other indicators. Resource allocation refers to the edge server for these uploads the processing application allocates certain computing resources, in this way to obtain continuous or gradual replacement, providing a better user experience.

Usually, an initial part of computing offloading and resource allocation is also a key part of deciding whether to offload, that is, offloading decision. After determining whether to uninstall, the next question to consider is how much and what should be uninstalled. In general, the possible decisions to calculate the offload may have the following situations, as shown in Fig. 1:

  1. 1.

    Local Execution: That is, the entire calculation process is completed locally. This situation is generally aimed at tasks with low computing power requirements.

  2. 2.

    Full Offloading: The entire calculation is connected to the base station via wireless channels and then offloaded and migrated to the edge server for calculation and processing. This method is also called the complete uninstallation problem and the binary uninstallation problem. This problem assumes that the application of the edge service cannot be split, and can only choose to perform local computing or offload to the edge server to perform the calculation.

  3. 3.

    Partial Offloading: Under the premise that the calculation can be split, part of the calculation is processed locally, and the other part is offloaded to the edge server for processing.

Fig. 1
figure 1

Task offloading method comparison

In this paper, the task offloading mechanism under the multi-user single-cell scenario is mainly studied. In the scenario of a single cell, multiple users are connected to an edge server through a single LTE macro base station, and the edge server can schedule tasks to other edge computing servers connected thereto. Aiming at the competition and selfishness that may occur when multiple users uninstall their computing tasks, a global load balancing penalty factor is introduced to minimize the response time of global user tasks and make the load on each edge server relatively balanced. In addition, in view of the characteristics of dimensional explosion, scalability and poor dynamics faced by the centralized task scheduling as the number of users increases, an algorithm model for centralized training and distributed operation is proposed. By establishing each user as a Markov game model, a distributed task offloading algorithm DTOMALB based on multi-agent and load balancing is proposed. This DRL-based algorithm is able to improve the characteristics of dimensional explosion, scalability and poor dynamics faced by the currently existing task scheduling.

The following is the structure of this paper. The second section is related work. The third section is the system model. This part mainly describes the single-cell multi-user scenario, communication model, and calculation model, and transforms the task offloading problem into a target optimization problem. The fourth section mainly proposes new algorithms. A task offloading algorithm based on multi-agent is proposed to optimize the user's response delay and balance the load, and improve the server's resource utilization. And through the centralized training and distributed operation mode, it solves the various problems that the centralized scheduling will produce and improves the robustness of the system. The fifth section is the experiment and results and discussions, mainly include the experimental method to verify the feasibility and effectiveness of DTOMALB, and the analysis of the simulation results. It also includes some discussions after the analysis of the results, discussing the meaning of the research results in the context of existing research, and emphasizing the limitations of the research. The last section is the conclusion.

2 Related work

In an IoT network based on edge computing, task offloading is a major way to solve the limitation of edge device computing, storage, and power resources in edge computing. The edge device can offload part or all of the computing tasks to the edge computing server, thereby speeding up the processing speed of the tasks, saving the energy of the device, and reducing the response time. The main issues that need attention are whether to uninstall, when to uninstall, how many computing tasks should be uninstalled, and where to uninstall. The offload calculation will bring additional communication overhead, such as transmission delay and energy consumption caused by communication. A lot of research is devoted to the optimal offloading strategy for different scenarios and different optimization goals.

Wang et al. proposed a joint base station cache and D2D offloading algorithm based on Q learning for edge computing architecture. They applied Q learning to distributed cache replacement strategies according to content popularity [3].

Flores et al. designed and implemented a task delegation or code offloading framework for a mobile cloud computing model. When task delegation, resource-intensive mobile tasks are delegated asynchronously by directly invoking services. When the code is uninstalled, the mobile application is partitioned and analyzed, complex computing operations are identified, and they are offloaded to cloud-based computing to improve the overall system performance [4].

You et al. studied multi-user edge computing scenarios based on time division multiple access (TDMA) and orthogonal frequency division multiple access (OFDMA). Under the constraint of computing delay, the optimal resource allocation problem is expressed as a convex optimization problem that minimizes weighting and mobile energy consumption. For a cloud with limited capacity, a suboptimal resource allocation algorithm is proposed to reduce the complexity of the calculation threshold [5].

Bi et al. studied the offloading problem in multi-user scenarios, decoupled the combination of multi-user computing mode selection and the strong coupling of transmission time allocation, and proposed a simple two-segment search algorithm to obtain conditional optimal time allocation And on this basis, a coordinate descent method was designed to optimize the mode selection [6].

Guo et al. proposed an energy-saving dynamic offloading and resource scheduling strategy to reduce energy consumption and shorten the application completion time. Under the premise of meeting the task dependency requirements and completion time limit constraints, the problem is transformed into an energy efficiency cost minimization problem, and the problem is decoupled into three subtasks of calculation offload selection, clock frequency control, and transmission power allocation. [7].

In response to the current MEC offloading problem, Wang et al. proposed a new offloading framework based on deep reinforcement learning, which can automatically infer the optimal offloading strategy in different scenarios according to the characteristics of the offloading task to minimize The overall service delay [8].

Xu et al. incorporated renewable energy into mobile edge computing and proposed an effective resource management algorithm based on reinforcement learning to learn dynamic optimal strategies for dynamic load offloading and edge server configuration to minimize long-term system costs [9].

Dong et al. proposed an intelligent offloading system for vehicle edge computing based on deep reinforcement learning. The offloading system includes a task scheduling module and a resource allocation module. With the goal of maximizing the quality of experience (QoE), a joint optimization problem of two modules was established, and this problem was solved through DQN [10].

The scene studied by Liu et al. is the Vehicle Edge Computing (VEC) network. In this paper, an efficient computational offloading scheme is proposed for the user, which is transformed into an optimization problem to maximize the use of the proposed VEC network, taking into account the randomness of vehicle traffic. In this paper, dynamic communication requirements and time-varying communication conditions are expressed as semi-Markov processes, and an algorithm based on Q-learning is proposed. Later, in order to avoid the problem of dimensionality guarantee, an algorithm based on Deep Reinforcement Learning (DRL) was proposed to obtain the optimal computing offload and resource allocation strategy [11].

3 System model

In a single-cell multi-user scenario, multiple users N = {1,2,3,…,N} perform wireless communication through a single Long Term Evolution (LTE) macro base station. The base station can be divided into multiple sub-channels for different users. At the same time, the base station is connected to multiple edge servers M = {1,2,3,…,M} to provide users with computing resources to help them process computing tasks, as shown in Fig. 2. Each user can be a mobile phone, computer, wearable smart device, etc., all with different computing capabilities. At the same time, the computing power and load of each edge server are also different. In addition, computing tasks generated from multiple user equipments may also be competitive for limited computing resources. Since each user is selfish, he wants to offload his user tasks to the best performing server for calculation. However, when a large number of concurrent user tasks reach a certain high-performance edge server at the same time, the server will not be able to allocate the required computing resources for all user tasks, but it will affect the user experience. Although some servers have low performance, they are in a relatively idle state and can provide relatively superior computing resources, but they are not utilized, resulting in a waste of resources. Therefore, when the task distribution is relatively balanced, not only can the resource utilization rate of each server be improved, but also the overall user service quality can be improved. Under the constraints of limited resources, for heterogeneous user equipment and different user tasks, how to reasonably allocate the computing resources of the edge server to each task, and to ensure that the user experience of the user to balance the load of each edge server is urgent to solve the problem.

Fig. 2
figure 2

Single-cell multi-user scenario in edge computing

We assume that there are n = {1,2,3,…,N} user devices UD (User Device) and an evolved base station (eNodeB, eNB) in a total cell. It is directly connected to an edge computing server ECS through optical fiber, with the number m = 1. At the same time, the edge server is also connected to other edge computing servers of the cell number m = {2,3,4…,M} through optical fibers, which can provide users with computing resources. It is assumed that at each moment, each user equipment will generate a computationally intensive task, and can choose to calculate locally or offload to the edge server directly connected to the base station through the base station or the remaining edge servers connected to the edge server. Define the uninstall decision vector of the nth user as:

$${\mathcal{A}}_{n} = \left[ {a_{n,0} ,a_{n,1} ,a_{n,2} ,...,a_{n,M} } \right]$$

\(a_{n,m} \in \left\{ {0,1} \right\}\)\(a_{n,m} = 0\) means do not offload to the m-th edge computing server, \(a_{n,m} = 1\) means offload to the m-th edge computing server. \(a_{n,1} = 1\) means the task will be offloaded to the edge computing server directly connected to the eNB. Especially, \(a_{n,0} = 1\) indicates that the task is calculated locally. Assuming that the task cannot be split, for the task generated by user n, the offload vector has the following constraints:

$$\mathop \sum \limits_{m = 0}^{M} a_{n,m} \left( t \right) = 1,\quad \forall n \in N$$

The decision space of all users is:

$${\mathcal{A}} = \left[ {{\mathcal{A}}_{1} ,{\mathcal{A}}_{2} ,{\mathcal{A}}_{3} , \ldots ,{\mathcal{A}}_{{\text{N}}} } \right] = \left[ {\begin{array}{*{20}c} {\begin{array}{*{20}c} {a_{0,0} ,a_{0,1} ,a_{0,2} , \ldots ,a_{0,M} } \\ {a_{1,0} ,a_{1,1} ,a_{1,2} , \ldots ,a_{1,M} } \\ \end{array} } \\ {a_{2,0} ,a_{2,1} ,a_{2,2} , \ldots ,a_{2,M} } \\ . \\ . \\ . \\ {a_{N,0} ,a_{N,1} ,a_{N,2} , \ldots ,a_{N,M} } \\ \end{array} } \right]$$

We use the length of the task and the number of calculation cycles required by the task to describe it.

$$S_{n} \triangleq \left( {L_{n} ,C_{n} } \right)$$

\(L_{n}\) is the length of the calculation task, including the calculation task request, data, code, etc. \(C_{n}\) is the number of CPU cycles required to complete the calculation task.

3.1 Communication model

In the current scenario, multiple users share an eNB, so interval interference can be ignored. Assuming that N users choose to offload the calculation tasks generated at the same time, the bandwidth of the wireless channel will be evenly distributed to the UD. Then the upload rate available for the nth UD is:

$$r_{n} = \frac{W}{N}log_{2} \left( {1 + \frac{{p_{n} g_{n} }}{{\frac{W}{N}N_{0} }}} \right)$$

where W is the bandwidth of the entire wireless channel, \(p_{n}\) is the upload power of user equipment n, \(g_{n}\) is the gain of the wireless channel allocated to user equipment n, and \(N_{0}\) is the variance of the complex Gaussian white noise channel.

3.2 Computation model

3.2.1 Local computation model

For user n, when the offloading variable \(a_{n,0} = 1\), local operation will be performed. According to its task tag \(S_{n}\), the total number of cycles \(C_{n}\) required to execute the task can be known. Then the time delay for the user task \(S_{n}\) to execute locally can be expressed as:

$$t_{n}^{local} = \frac{{C_{n} }}{{f_{local} }}$$

3.2.2 Offloading computation model

For a task generated by a user n, when the offloading variable \(a_{n,m} \left( t \right) = 1\), it means that the task offloads the task to the m-th edge server. If m = 1, the m-th base station and the corresponding edge computing server perform calculations, otherwise it will be forwarded again to reach the m-th server. User tasks are uninstalled through the following three steps:

  1. 1.

    User n uploads the task to the edge server ECS1 through the uplink allocated by the base station after the time \(t_{n}^{up}\):

    $$t_{{\text{n}}}^{{{\text{up}}}} = \frac{{L_{n} }}{{r_{n} }}$$

    where \(r_{n}\) is upload rate.

  2. 2.

    Determine whether m is 1. If it is 1, then calculate in ECS1, and return the calculation result after the computation time \(t_{n,1}^{comp}\).

  3. 3.

    If not, through ECS1, the task is forwarded to the edge computing server ECSm through time \(t_{n,m}^{comp}\), and after the calculation time, the result is obtained and returned.

For each edge server, assuming that it uses the first-come-first-served strategy to server the arriving computing tasks, the response delay in the server is the delay in queue plus the calculation delay:

$$t_{n,m}^{resp} = t_{n,m}^{wait} + t_{n,m}^{comp} = \frac{{\mathop \sum \nolimits_{i \in k} C_{i} }}{{f_{m} }} + \frac{{C_{n} }}{{f_{m} }}$$

If m! = 1, it will take \(t_{n,m}^{tran}\) to forward the task:

$$t_{n,m}^{tran} = \frac{{L_{n} }}{\mu }$$

where μ is the forwarding rate between edge servers.

The delay in the entire task offloading process can be expressed as:

$$t_{n}^{off} = \left\{ {\begin{array}{*{20}l} {{\text{t}}_{{\text{n}}}^{{{\text{up}}}} + t_{n,m}^{resp} ,m = 1} \hfill \\ {{\text{t}}_{{\text{n}}}^{{{\text{up}}}} + t_{n,m}^{resp} + t_{n,m}^{tran} ,m! = 1} \hfill \\ \end{array} } \right.$$

The delay that users spend in computing their tasks is:

$$t_{n} = \left\{ {\begin{array}{*{20}l} {\frac{{C_{n} }}{{f_{local} }},m = 0} \hfill \\ {{\text{t}}_{{\text{n}}}^{{{\text{up}}}} + t_{n,m}^{resp} ,m = 1} \hfill \\ {{\text{t}}_{{\text{n}}}^{{{\text{up}}}} + t_{n,m}^{resp} + t_{n,m}^{tran} ,m! = 1} \hfill \\ \end{array} } \right.$$

For the overall offloading process, for each user, it is desirable to minimize the delay of their respective users. For the entire system, it is desirable to allocate user tasks reasonably so that the overall average user delay is minimized. In summary, the optimization problem can be modeled as:

$$\begin{aligned} & P1:min \frac{1}{N}\mathop \sum \limits_{n \in N} t_{n} \\ { } & {\text{s}}.{\text{t}}.{ }\mathop \sum \limits_{m = 0}^{M} a_{n,m} \left( t \right) = 1 \\ & \forall n \in N,\;\forall m \in M \\ \end{aligned}$$

The above problem is an objective optimization problem. In essence, it is a combinatorial optimization problem. It is NP-hard to solve the complexity. Traditional methods are more difficult to solve, mostly brute force search or heuristic algorithms. In order to improve the adaptability of the model, this chapter also uses reinforcement learning to solve this problem.

4 Deep reinforcement learning-based solution

Reinforcement learning is a field in machine learning. The biggest difference from supervised learning is that it does not require artificial labeling. The core lies in the exploration of the unknown environment and the use of known knowledge. Reinforcement learning through continuous trial and error, continuous interaction with the environment, and get rewards or punishment from the environment, and then obtain learning information to update its own model. It learns a behavior strategy. After a certain amount of training, it can make decisions to maximize long-term returns based on environmental conditions. After the introduction of deep learning, reinforcement learning can handle high-dimensional actions and states, learning efficiency is greatly improved, and to a certain extent, the limitations of reinforcement learning are broken. In this paper, in order to make the model adaptively learn to offload decisions, deep reinforcement learning algorithms will also be used to solve the task offloading problem in edge computing.

Deep reinforcement learning In order to apply deep reinforcement learning algorithms, a Markov decision process model needs to be established. The three elements of the model are system state, action and reward function.

4.1 System state

In the current state, a total of \(n \in \left\{ {1,2, \ldots N} \right\}\) users will offload tasks, and the number of available edge computing servers is \(m \in \left\{ {1,2, \ldots M} \right\}\). In each state \(t \in \{ 0,1,2, \ldots T\)}, each user has a computation-intensive task \(u_{n} \left( t \right)\) that needs to be calculated.

The edge server ECSm may provide computing resources for multiple user tasks at the same time, and the user tasks are queued in sequence in the computing queue. Assuming that there are currently k user tasks queued in the task queue of ECSm, the computing power of ECSm is and the computing power of ECSm is \(f_{m}\). Define the current load factor of the edge server ECSm as the ratio of the required computing cycle of all tasks in the current edge server to the computing capacity of the server:

$$LD_{m} \left( t \right) = \frac{{\mathop \sum \nolimits_{i \in k} C_{i} }}{{f_{m} }}$$

Define the current system state as the system load factor of each edge server, that is, how many tasks are queued and calculated in the system:

$$\chi \left( t \right) = \left[ {LD_{1} \left( t \right),LD_{2} \left( t \right),LD_{3} \left( t \right),...,LD_{m} \left( t \right)} \right]$$

Assuming that each edge computing server has k user tasks in the current state, its load factor is:

$$LD_{m} \left( t \right) = \frac{{\mathop \sum \nolimits_{i \in k} C_{i} }}{{f_{m} }}$$

Since each user has m servers that can be uninstalled, the entire state space is \(M^{N}\). It can be seen that as the number of users increases, the state space of the entire system will show an exponential growth.

4.2 System action

Define the current system action as an offload vector for all user tasks:

$${\mathcal{A}}_{n} \left( t \right) = \left[ {a_{n,0} ,a_{n,1} ,a_{n,2} ,...,a_{n,M} } \right]$$

And have the following constraints:

$$\mathop \sum \limits_{m = 0}^{M} a_{n,m} \left( t \right) = 1,\quad \forall n \in N$$

The action space of the entire system is:

$${\mathcal{A}}\left( t \right) = \left[ {{\mathcal{A}}_{1} \left( t \right),{\mathcal{A}}_{2} \left( t \right),{\mathcal{A}}_{3} \left( t \right), \ldots ,{\mathcal{A}}_{{\text{N}}} \left( t \right)} \right] = \left[ {\begin{array}{*{20}c} {\begin{array}{*{20}c} {a_{0,0} ,a_{0,1} ,a_{0,2} , \ldots ,a_{0,M} } \\ {a_{1,0} ,a_{1,1} ,a_{1,2} , \ldots ,a_{1,M} } \\ \end{array} } \\ {a_{2,0} ,a_{2,1} ,a_{2,2} , \ldots ,a_{2,M} } \\ . \\ . \\ . \\ {a_{N,0} ,a_{N,1} ,a_{N,2} , \ldots ,a_{N,M} } \\ \end{array} } \right]$$

As can be seen from the above formula, the action space of the entire system is also quite large, and its dimension is \(M \times N\). The action space is \(M^{N}\). It can be seen that as the number of users increases, the dimension of system actions will continue to increase, and the system action space will increase exponentially.

4.3 System reward function

Since each user is selfish, it is easy for multiple users to select the same edge server for uninstallation at the same time, resulting in a decline in the user experience of each user. In order to improve the sociality of user decision-making and the adaptability of the model, we propose a task offloading algorithm based on load balancing and reinforcement learning (TOLB-RL). That is, when setting the reward function in reinforcement learning, the system load balancing coefficient is introduced, so that tasks can be more evenly distributed, the training speed of the model is accelerated, and the average delay of the overall system user is reduced.

For the administrators of edge computing servers, they hope to improve the resource utilization of the servers and make the load of each server relatively balanced, so as to avoid all tasks being concentrated on individual servers for calculation, resulting in waste of resources. Define the load balancing factor of the system as the variance of the load of each server:

$$var\left( {LD} \right) = \frac{{\mathop \sum \nolimits_{m \in M} \left( {LD_{m} - \left| E \right|} \right)^{2} }}{M}$$

where \(\left| E \right|\) is the average of all edge server load factors.

Whenever the system takes an action in the current state, it will get an instant feedback of the action. In reinforcement learning, it is generally expected to maximize rewards. According to question P1, every user wants to minimize their task offload delay. Therefore, the inverse number of the delay can be used as a reward after the decision. At the same time, considering the load balancing coefficient of the system, when defining the reward function, the load balancing coefficient is used as the penalty term of the reward function. If the action causes the overall load balancing coefficient to increase, the penalty is increased, otherwise the penalty is reduced. We define the instant reward function as:

$$R\left( t \right)_{n} = - \omega_{1} t_{n} - \omega_{2} var\left( {LD} \right)$$

This reward function comprehensively considers user delay and system load balancing coefficient. The value depends on the current system status and the actions taken by the user. Where \(\omega_{1}\) and \(\omega_{2}\) represent the average task response time and the weight of the edge server load balancing, respectively. Here, since the optimization goal is to minimize the time delay and minimize the load factor, reinforcement learning is generally to maximize the reward during optimization. Therefore, the entire reward function is a negative number. When the maximum reward of the entire system is obtained, it is equivalent to a minimum delay and a load balancing factor. Define the system utility as the overall system reward.

4.4 Multi-agent reinforcement learning

The centralized DQN [12] needs to observe the global state, so each user's offload decision requires agent observation, which has poor scalability and causes the problem of dimensional explosion. The speed of training and execution is relatively slow. In addition, for the centralized algorithm, once the central agent node making the decision fails, the entire system will be paralyzed and unable to operate, and its reliability is relatively low. Therefore, we will introduce a multi-agent reinforcement learning algorithm, which treats each user as a separate agent and can make task offload decisions for its own tasks, rather than relying solely on a centralized single agent.

For multi-agent reinforcement learning, first, a Markov Game model needs to be established. Markov game can be described by \((n,S,A_{1} , \ldots ,A_{n} ,R_{1} , \ldots ,R_{n} ,\)):

  • \(n\) is the number of agents, in this section is the number of users N.

  • \(S\) is the system state, generally refers to the joint state of multiple agents, that is, the joint state of each agent. In this section, users share the load status of the current edge computing server, which can be expressed as:

    $$\chi \left( t \right) = \left[ {LD_{1} \left( t \right),LD_{2} \left( t \right),LD_{3} \left( t \right), \ldots ,LD_{m} \left( t \right)} \right]$$
  • \(R_{i}\) is the instant reward function of each agent, that is, the reward obtained in the next system state s’ after the joint action \((A_{1} , \ldots ,A_{n} )\) taken by multiple agents in the current state s.

The reward function completely describes the relationship between multiple agents. It should be noted that the reward function here is the reward function of each agent. When the reward function of each agent is consistent, that is \(R_{1} = R_{2} = \cdots = R_{n}\), it means that there is a complete cooperative relationship between the agents; when there are only two agents, and the reward function is opposite, that is \(R_{1} = - R_{2}\), it means that the agent’s The relationship is completely competitive; when the reward function is between the two, it is a mixed relationship of competition and cooperation.

Figure 3 is a multi-agent reinforcement learning system. Multi-agents act at the same time, and under joint action, the entire system will be transferred, and each agent will be rewarded immediately.

Fig. 3
figure 3

Multi-agent reinforcement learning system

Directly adopting the centralized reinforcement learning algorithm DQN will cause the action dimension to be too large and difficult to converge. In order to solve the above problems, the DTOMALB algorithm is proposed to solve the above problems, which is the DDPG [13] algorithm in the case of multi-agents combined with the load balancing factor.

DDPG is a deterministic strategy gradient algorithm based on the actor-critic model. It also combines the advantages of DQN, using Target network and experience playback.

The reinforcement learning method based on value function introduces the action value function \(\hat{q}\) by approximating the value function. This function is often approximated by a neural network with parameter \(w\), and accepts state s and action a as inputs. After calculation, the approximate action value is obtained, and then the action is selected by the maximum value:

$$\hat{q}\left( {s,a,w} \right) \approx q_{\pi } \left( {s,a} \right){ }$$

In the policy-based reinforcement learning method, a policy function is learned, and the action probability is directly obtained through the input state. At this time, the strategy π can be described as a function containing the parameter θ, namely

$$\pi_{\theta } = P(a|s,\theta ) \approx \pi (a|s)$$

The optimization goal isL

$$J\left( \theta \right) = V_{\pi \theta } \left( {s_{1} } \right) = E_{\pi \theta } \left( {G_{1} } \right) = E_{\pi \theta } \left[ {\left( {r_{1} + \gamma r_{2} + \gamma^{2} r_{3} + \cdots } \right)} \right]$$

Finally, the gradient to derive \(\theta\) can be expressed as:

$$\nabla_{\theta } J\left( \theta \right) = E_{\pi \theta } \left( {\nabla_{\theta } log\pi_{\theta } \left( {s,a} \right)Q_{\pi } \left( {s,a} \right)} \right)$$

When using the Monte Carlo method to update, for each time step in each Monte Carlo sequence, the formula is used to update the parameters of the network:

$$\theta = \theta + \alpha \nabla_{\theta } log\pi_{\theta } \left( {s_{t} ,a_{t} } \right)v_{t}$$

When using the time difference method to learn the approximate function \(Q_{\pi } \left( {s,a} \right)\) of the real action value function, it evolves into the Actor-Critic algorithm. The Actor network is responsible for selecting actions, and the Critic network is responsible for learning the real action value function, so as to guide the Actor to select actions.

The deterministic strategy adopted in the DDPG algorithm is to directly output actions according to the current state:

$$a = \pi (s|\theta^{\mu } )$$

Random strategy, the strategy output is the probability of action. For example, the A2C[14] algorithm uses a normal distribution to sample the action, that is, each action has a probability of being selected. Random strategy integrates exploration and improvement into one strategy, but requires a lot of training data.

In deterministic strategies, the output of the strategy is an action, which requires less data to be sampled and high algorithm efficiency, but it is impossible to explore the environment. Since the deterministic strategy cannot explore the environment, the DDPG algorithm utilizes the off-policy learning method. Off-policy means that the sampling strategy and the improved strategy are not the same strategy. Similar to DQN, samples are generated using a random strategy and stored in the experience playback mechanism. Samples are randomly selected during training. The improvement is the current deterministic strategy. The entire deterministic strategy learning framework adopts the AC method.

In DDPG, a deep neural network with parameters \(\theta^{\mu }\) and \(\theta^{Q}\) is used to represent the deterministic strategy \(a = \pi (s|\theta^{\mu } )\) and the action value function \(Q(s,a|\theta^{Q} )\). Among them, the strategy network is used to update the strategy, corresponding to the actors in the AC framework; the value network is used to approximate the value function of the state action pair, and provides gradient information, corresponding to the critics in the AC framework. The objective function is defined as the total return with discounts:

$$J\left( {\theta^{\mu } } \right) = E_{{\theta^{\mu } }} \left[ {r_{1} + \gamma r_{2} + \gamma^{2} r_{3} + \cdots } \right]$$

The objective function is optimized end-to-end by stochastic gradient method. Silver et al. [15] proved that the gradient of the objective function with respect to \(\theta^{\mu }\) is equivalent to the expected gradient of the Q-value function with respect to \(\theta^{\mu }\):

$$\frac{{\partial J\left( {\theta^{\mu } } \right)}}{{\partial \theta^{\mu } }} = E_{s} \left[ {\frac{{\partial Q(s,a|\theta^{\mu } )}}{{\partial \theta^{\mu } }}} \right]$$

According to the deterministic strategy \(a = \pi (s|\theta^{\mu } )\), the following formula can be obtained:

$$\frac{{\partial J\left( {\theta^{\mu } } \right)}}{{\partial \theta^{\mu } }} = E_{s} \left[ {\frac{{\partial Q(s,a|\theta^{\mu } )}}{{\partial \theta^{\mu } }}\frac{{\partial \pi (s|\theta^{\mu } )}}{{\partial \theta^{\mu } }}} \right]$$

The Actor network will update the parameters of the strategy network in the direction of increasing the Q value.

For the Critic network, it will be updated by updating the value network in DQN. The gradient information is:

$$\frac{{\partial L\left( {\theta^{Q} } \right)}}{{\partial \theta^{Q} }} = E_{{s,a,r,s^{\prime } \sim D}} [(TargetQ - Q(s,a|\theta^{Q} ))\frac{{\partial Q(s,a|\theta^{Q} )}}{{\partial \theta^{Q} }}]$$
$$TargetQ = r + \gamma Q^{\prime } (s^{\prime } ,\pi (s^{\prime } |\theta^{{\mu^{\prime } }} )|\theta^{{Q^{\prime } }} )$$

For the MADDPG [16] algorithm that introduces multi-agents, the system will adopt a centralized training and distributed execution framework. In the training process, learn by using some global information, and in the final test run, you only need to use local information to make decisions when applying. This is a disadvantage of Q-learning. Q-learning must use the same information when learning and applying. This algorithm improves the DDPG algorithm, Critic network adds the strategy information of other agents, and Actors can only access local information. After the training is completed, only Actors are used in the execution phase, and each Actor is executed in a distributed manner.

Let \(\theta = \left\{ {\theta_{1} ,\theta_{2} , \ldots ,\theta_{N} } \right\}\) denote the parameters of n agent strategies, and  \(\pi = \left\{ {\pi_{1} ,\pi_{2} ,...,\pi_{N} } \right\}\) denote the strategy of n agents. For the cumulative expected reward of the i-th agent, for a random strategy, the strategy gradient is:

$$\nabla_{{\theta_{i} }} J\left( {\theta_{i} } \right) = E_{{s \sim p^{\mu } ,a_{i} \sim \pi_{i} }} \left[ {\nabla_{{\theta_{i} }} \log \pi_{i} \left( {a_{i} {|}o_{i} } \right)Q_{i}^{\pi } \left( {x,a_{1} , \ldots ,a_{N} } \right)} \right]$$

Here \(Q_{i}^{\pi } \left( {x,a_{1} , \ldots ,a_{N} } \right)\) is a centralized action value function, which takes action \(a_{1} , \ldots ,a_{N}\) of all agents, plus state information \(x\) as input, and then outputs the Q value of agent i.

The gradient formula obtained according to the deterministic strategy \(\mu_{{\theta_{i} }}\) is as follows:

$$\nabla_{{\theta_{i} }} J\left( {\mu_{i} } \right) = E_{x,a \sim D} [\nabla_{{\theta_{i} }} \mu_{i} (a_{i} |o_{i} )\nabla_{{a_{i} }} Q_{i}^{\pi } \left( {x,a_{1} , \ldots ,a_{N} } \right)]|_{{a_{i} = \mu_{i} \left( {o_{i} } \right)}}$$

The experience playback buffer D contains a tuple \(\left( {{\text{x}},{\text{x}}^{\prime } ,a_{1} , \ldots ,a_{N} ,r_{1} , \ldots ,r_{N} } \right)\), which records the experience of all agents. The centralized action value function \(Q_{i}^{\mu }\) is updated as follows:

$$L\left( {\theta_{i} } \right) = E_{{x,a,r,x^{\prime } }} \left[ {\left( {Q_{i}^{\pi } \left( {x,a_{1} , \ldots ,a_{N} } \right) - y} \right)^{2} } \right],\;y = r_{i} + \gamma Q_{i}^{{\mu^{\prime } }} \left( {x^{\prime } ,a_{1}^{\prime } , \ldots ,a_{N}^{\prime } } \right)|_{{a_{j}^{\prime } = \mu_{j}^{\prime } \left( {o_{j} } \right)}}$$

Because DDPG is generally used to solve the problem of continuous action input, and the action space in us is discrete. Therefore, Gumbel-Softmax [16] network is used to convert discrete actions into continuous action estimates.

Figure 4 is a schematic diagram of the network structure of the DTOMALB algorithm.

Fig. 4
figure 4

Schematic diagram of the network structure of the DTOMALB algorithm

The overall algorithm flow is shown in Table 1.

Table 1 DTOMALB algorithm description

5 Experiments and methods

In this section, the experiments are given to evaluate the algorithm from four aspects. The experiment mainly includes the simulation of the DTOMALB algorithm and the comparison with other algorithms in four index levels to test the availability and superiority of the algorithm proposed in this paper, and the effect of adding some variables on the whole training process of the algorithm.

In this paper, the experimental method is to simulate the task unloading of edge computation in a single-cell multi-user scenario. Firstly, the number of edge servers and users in the scene and other parameters in the simulation process are set. Then, DTOMALB algorithm is used to simulate the task unloading process. Compared with other existing algorithms in the same simulation environment, the performance of the proposed algorithm is evaluated.

The analysis methods adopted here are cross-comparison method and trend analysis method. First of all, under the condition that other simulation parameters are the same, when changing a variable in the training process, observe its influence on the simulation results. Then analyze the variation trend and reason of simulation results under this circumstance. Finally, different algorithms are respectively used to complete the simulation experiment of the same setting, and the performance of different algorithms is compared according to the simulation results, so as to verify the feasibility of the DTOMALB algorithm proposed in this paper.

In this paper, four experiments are designed, and the experimental simulation environment is set as follows.

In a single-cell multi-user scenario, there are 10 users and 5 edge servers on this experiment. More detailed simulation parameters settings are shown in Table 2.

Table 2 Simulation parameter setting

6 Results and discussions

After the algorithm is simulated, the effectiveness of the algorithm is analyzed from four aspects: training convergence, load balancing coefficient, comparison of system utility and number of users, and task load balancing results. The simulation results are analyzed as follows, and the meaning of the research results and the limitations of the research are reflected in the result analysis.

7 Training convergence analysis

In Fig. 5, DQN-TOLB and IDQL-TOLB[17] are the centralized DQN algorithm introducing load factor and the independent DQN algorithm in multi-agent. To facilitate comparison with the centralized DQN-TOLB algorithm, we define the system utility of the two multi-agent algorithms as the average of the cumulative rewards of multiple agents. It can be seen from Fig. 5 that the system utility of the three algorithms changes with the training process. It can be seen that the proposed DTOMALB algorithm has the fastest convergence and the highest utility. Observing the centralized training DQN-TOLB algorithm, it can be found that it can also converge in the end, but it falls into the local optimum, and its convergence speed is slow. This is because its action space is huge, and it requires a lot of exploration to converge and find the optimal value. For the IDQL-TOLB algorithm, the training process is very unstable, and it can be seen that its fluctuation range is the largest. This is because each agent makes a decision independently, which makes the entire system environment dynamically change, so the training process is unstable, and Reward fluctuates at a lower level.

Fig. 5
figure 5

The convergence of the algorithm

8 With or without load balancing coefficient training

Figure 6 is a comparison of the training situation of the DTOMALB algorithm and the DTOMA algorithm without the introduction of load balancing coefficients. It can be seen that when the load balancing factor is introduced, the DTOMALB algorithm can converge faster and can achieve higher system utility. However, the DTOMA algorithm without introducing load balancing coefficients has a long training process, and the system utility is relatively low. This is because after the introduction of the load balancing factor, the selfishness of each user is reduced, and each user will tend to choose an edge server with a lower load, thereby reducing unnecessary competition, thereby accelerating convergence, and improving the effectiveness of the system. Without the introduction of load balancing coefficients, competition will occur more easily, so that training requires more rounds to explore the optimal strategy, and it is easier to fall into the local optimal. Therefore, it is necessary to introduce load balancing coefficients, which can accelerate the training speed and promote the overall average delay reduction of all users.

Fig. 6
figure 6

Comparison of training with and without load balance coefficient

9 Comparison of system utility and number of users

As the total number of user equipment increases, the overall utility of the system decreases, because with the increase of user equipment, more and more tasks need to be uninstalled. On the one hand, as the number of users increases, each user occupies fewer and fewer communication resources, so the offload rate will decrease, and the offload delay will increase, resulting in reduced system utility. On the other hand, as the number of users increases, the computing resources that each user can allocate also decrease, resulting in an increase in latency and therefore a decrease in the overall system utility. It can be seen from Fig. 7 that the overall utility of the resource scheduling scheme based on DTOMALB is higher, reflecting the superiority of the algorithm.

Fig. 7
figure 7

The relationship between system utility and number of users

10 Task load balance result analysis

Figure 8 shows the load balancing results achieved by the DTOMALB algorithm. The computing power of servers 1 to 5 is [12,14,16,18,20]Ghz. It can be seen that the load of each server is more balanced, the server with strong computing power is assigned to relatively more computing tasks, and the server with weak computing power is assigned less computing tasks. The load factor of each server is almost the same, and the load balancing is realized.

Fig. 8
figure 8

Edge server load balancing

From the above experiments and results, it can be seen that the DTOMALB algorithm proposed in this paper can perform a reasonable unloading method under the scenario of single-cell and multi-user edge calculation to improve user experience and balance resource utilization. The limitations of this research mainly focus on model training, which can be improved from the offline training level.

11 Conclusion

This chapter studies task offloading and resource allocation in a multi-user single-cell scenario. In a single cell scenario, multiple users are connected to an edge server through a single LTE macro base station, and the edge server can schedule tasks to other servers connected to it. Aiming at the resource waste caused by the uneven load of multiple users in the process of uninstalling their computing tasks, and considering that the response time of the user tasks should be minimized, the problem of user offloading and resource allocation in this scenario is turned into more Goal optimization problem. At the same time, in view of the characteristics of dimensional explosion, scalability and poor dynamics faced by the centralized task scheduling with the increase of the number of users, an algorithm model for centralized training and task offloading of distributed operations is proposed. By establishing each user as a Markov game model and introducing load balancing coefficients, a DTOMALB algorithm based on multi-agent is proposed. Through simulation experiments, comparing the centralized algorithm and the independent multi-agent algorithm, the DTOMALB algorithm proposed in this chapter can effectively reduce the response delay of all users and make the load of each edge computing server relatively balanced, improving the robustness and scalability of the system.

Availability of data and materials

The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.



Internet of Things


Distributed task offloading algorithm based on multi-agent and load balancing


Vehicle Edge Computing


Deep Reinforcement Learning


Long Term Evolution


  1. K. Ashton, That ‘Internet of things’ thing. RFiD J. 22(7), 97–101 (2009)

    Google Scholar 

  2. R. Buyya, C.S. Yeo, S. Venugopal et al., Cloud computing and emerging IT platforms: vision, hype, and reality for delivering computing as the 5th utility. Future Gener. Comput. Syst. 25(6), 599–616 (2009)

    Article  Google Scholar 

  3. W. Wang, R. Lan, J. Gu et al., Edge caching at base stations with device-to-device offloading. IEEE Access 5(99), 6399–6410 (2017)

    Article  Google Scholar 

  4. Flores H, Srirama S N, Buyya R. Computational offloading or data binding? Bridging the cloud infrastructure to the proximity of the mobile user. In 2014 2nd IEEE International Conference on Mobile Cloud Computing, Services, and Engineering (MobileCloud). IEEE, 2014: 10–18.

  5. C. You, K. Huang, H. Chae et al., Energy-efficient resource allocation for mobile-edge computation offloading. IEEE Trans. Wireless Commun. 16(3), 1397–1411 (2017)

    Article  Google Scholar 

  6. S. Bi, Y.J.A. Zhang, Computation rate maximization for wireless powered mobile-edge computing with binary computation offloading. IEEE Trans. Wirel. Commun. 2018:1–1.

  7. S. Guo, B. Xiao, Y. Yang, et al. Energy-efficient dynamic offloading and resource scheduling in mobile cloud computing. in IEEE INFOCOM 2016 - IEEE Conference on Computer Communications. IEEE, 2016.

  8. J. Wang, J. Hu, G. Min et al., Computation offloading in multi-access edge computing using a deep sequential model based on reinforcement learning. IEEE Commun. Mag. 57(5), 64–69 (2019)

    Article  Google Scholar 

  9. J. Xu, L. Chen, S. Ren, Online learning for offloading and autoscaling in energy harvesting mobile edge computing. IEEE Trans. Cognit. Commun. Netw. 3(3), 361–373 (2017)

    Article  Google Scholar 

  10. P. Dong, X.X. Wang, J.J.P.C. Rodrigues, et al. Deep reinforcement learning for vehicular edge computing: an intelligent offloading system. ACM Trans. on Intell. Syst. Technol. 2019, 10(6).

  11. Y. Liu, H. Yu, S. Xie et al., Deep reinforcement learning for offloading and resource allocation in vehicle edge computing and networks. IEEE Trans. Veh. Technol. 99, 1–1 (2019)

    Article  Google Scholar 

  12. Mnih V, Kavukcuoglu K, Silver D, et al. Human-level control through deep reinforcement learning. nature, 2015, 518(7540): 529–533.

  13. T.P. Lillicrap, J.J. Hunt, A. Pritzel, et al. Continuous control with deep reinforcement learning; 2015. arXiv preprint arXiv:1509.02971.

  14. V. Mnih, A.P. Badia, M. Mirza, et al. Asynchronous methods for deep reinforcement learning. In International conference on machine learning. 2016: 1928–1937.

  15. D. Silver, G. Lever, N. Heess, et al. Deterministic policy gradient algorithms. 2014.

  16. Ryan Lowe. Multi-agent actor-critic for mixed cooperative-competitive environments. 2017.

  17. A. Tampuu, et al. Multiagent cooperation and competition with deep reinforcement learning. PloS one 12.4 (2017): e0172395.

Download references


This work was supported by the National Natural Science Foundation of China (NO. 61772064), the National Key Research and Development Program of China (2018YFC0831900).


This work was supported by the National Natural Science Foundation of China (NO. 61772064), the National Key Research and Development Program of China (2018YFC0831900).

Author information

Authors and Affiliations



SP, ZZ conceived and designed the study. CL and XP performed the simulation experiments. ZZ and CL wrote the paper. ZZ and CL reviewed and edited the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Zhenjiang Zhang.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit

Reprints and Permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, Z., Li, C., Peng, S. et al. A new task offloading algorithm in edge computing. J Wireless Com Network 2021, 17 (2021).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: