A Risk-Sensitive Task Offloading Strategy for Edge Computing in Industrial Internet of Things

Edge computing has become one of the key enablers for ultra-reliable and low-latency communications in the industrial Internet of Things in the fifth generation communication systems, and is also a promising technology in the future sixth generation communication systems. In this work, we consider the application of edge computing to smart factories for mission-critical task offloading through wireless links. In such scenarios, although high end-to-end delays from the generation to completion of tasks happen with low probability, they may incur severe casualties and property loss, and should be seriously treated. Inspired by the risk management theory widely used in finance, we adopt the Conditional Value at Risk to capture the tail of the delay distribution. An upper bound of the Conditional Value at Risk is derived through analysis of the queues both at the devices and the edge computing servers. We aim to find out the optimal offloading policy taking into consideration both the average and the worst case delay performance of the system. Given that the formulated optimization problem is a non-convex mixed integer non-linear programming problem, a decomposition into sub-problems is performed and a two-stage heuristic algorithm is proposed. Simulation results validate our analysis and indicate that the proposed algorithm can reduce the risk in both the queuing and end-to-end delay.


Introduction
Intelligent factory automation is one of the typical applications envisioned in ultrareliable and low-latency communications (URLLC) scenarios in the fifth generation (5G) and the coming sixth generation (6G) communications [1,2]. In future smart factories, machines and sensors are seamlessly connected with each other through wireless links to conduct production tasks corporately. During the manufacturing process, a great number of operations of the machines and robots require complex control algorithms and intense data computation, such as travelling across zones to identify and pick up the objects and controlling the robotic arms to assemble components within a precise position alignment [3]. The limited built-in computing resources are not sufficient for the stringent latency requirements, so the tasks have to be offloaded to servers for processing [4]. Conventionally, the large volume of data generated at the local devices is uploaded to the cloud computing servers [5]. However, since the cloud computing servers are usually deployed remotely, the large roundtrip transmission latency as well as the possible network congestion makes it hard to meet the stringent end-to-end delay requirements of the actuators and arXiv:2101.05946v1 [cs.IT] 15 Jan 2021 control units in the IIoT system. To overcome these difficulties, edge computing has emerged, where the servers are placed at the edge of the network to achieve a much lower transmission and processing latency [6].
There have been literatures focusing on improving the service efficiency of the edge computing systems [7,8,9,10,11]. The authors in [7] adopt the Markov decision process (MDP) to minimize the average delay of a mobile edge computing system by deciding whether to compute locally or offload the tasks to the edge server. In [8], the authors define a delay-based Lyapunov function instead of the queue lengthbased one, and minimize the average delay by optimizing resource scheduling under the Lyapunov optimization framework. Three-tier multi-server mobile computing networks are investigated in [9], where a cooperative task offloading strategy is proposed based on the Alternating Direction Method of Multipliers (ADMM). In [10], an adaptive learning-based task offloading algorithm is proposed to minimize the average delay for vehicle edge computing systems. Most relatively, [11] integrates the fog computing to the cloud-based industrial Internet of Things (IIoT), where the task offloading, transmission and computing resource allocation schemes are jointly optimized to reduce the service response latency in the unreliable communication environment.
The aforementioned works have only focused on the average delay performance, neglecting the worst case performance of the edge computing system. However, in the IIoT systems, the probability of an intense delay jitter usually matters much more than the average delay, since when the delay exceeds a certain threshold, severe accidents may incur such as the deadlock of the manufacture process, the damage to the machines and even casualties. Therefore, in such scenarios, not only the average delay performance, but also the potential hazard, i.e. the risk behind the tail distribution of the delay, should be carefully investigated. There have been some preliminary works to deal with the embedded risks in the edge computing systems [12,13,14]. In [12], the tail distribution of the task queue under a probability constraint imposed on the excess value is characterized by the extreme value theory [15], and an offloading strategy is designed to minimize the energy consumption. The authors of [13] also apply the extreme value theory to the edge computing systems, in order to investigate the extreme event of queue length violation in the computation phase. Besides, in [14], the authors focus on a vehicular edge computing network where vehicles either fetch images from cameras or acquire synthesized images from an edge computing server. A risk-sensitive learning [16] based task fetching and offloading strategy is proposed to minimize the risk behind the end-to-end delay.
Different from the works mentioned above, we introduce the risk management theory [17], widely used in the field of finance, to the edge computing system in consideration of the uncertainty of the wireless channels. Value at risk (VaR) and Conditional Value at Risk (CVaR) are the two widely used tools to characterize risks. While VaR takes the Gaussian distribution as assumption and lacks convexity and sub-additivity, which makes it inapplicable in many cases, CVaR is a coherent risk measure of any type of probability distribution and is much easier to handle in practice. Therefore, CVaR is employed in this work to model the risk of the task completion delay in the considered edge computing-assisted IIoT system. We aim to minimize both the average delay and the CVaR by jointly designing the offloading and computing resource allocation strategy. The main contributions of this work are summarized as follows: • We focus on the hazard incurred by the intense delay jitter in the edge computing-assisted IIoT scenario and introduce the risk management theory to the design of the offloading and computation resource allocation strategy. • A cascade queuing model is constructed to describe the end-to-end delay property of the system. Due to the uncertainty of the wireless channel, the transmission time follows a general distribution, which makes the queuing model hard to analyze. By exploring the queuing theory and the risk management theory, we provide an upper bound for both the average end-to-end delay and the CVaR.
• A low-complexity risk-sensitive task offloading strategy is proposed, where both the average performance and the risk with respect to the end-to-end delay are optimized simultaneously. The computation complexity of each procedure of the proposed algorithm is analyzed in details. Simulations under the practical wireless environment in the automated factory validate the effectiveness of the proposed strategy in controlling the risk behind the intense delay jitter. The remainder of this paper is organized as follows. We introduce the system model and analyze both the average delay and the CVaR in Section 2. In Section 3, we formulate the offloading and computation resource allocation problem and propose a low-complexity heuristic algorithm. In Section 4, numerical results are reported with discussions. Finally, Section 5 concludes the paper.
2 System model 2.1 Edge computing system As shown in Fig. 1, we consider an edge computing-assisted IIoT system that consists of a set of M = {1, 2, · · · , M } IIoT devices and a set of N = {1, 2, · · · , N } edge computing servers (ECS). Each IIoT device i ∈ M randomly generates tasks of identical size of d i bits, and we assume the task arrival process follows the Possion distribution with average arrival rate λ i . We denote by ω i the computation intensity of the task of device i [18], i.e. the number of CPU cycles required to process per bit data. Then, the total CPU cycles needed for a task of device i, denoted by c i , can be calculated as Owing to the insufficiency of computation capability, the IIoT devices offload their tasks to the ECSs through wireless links. Each ECS j ∈ N is equipped with a CPU of N j cores, which can work simultaneously and independently. We assume that the tasks of a device can only be offloaded to one ECS, while each core only processes the tasks from the same device, which means a ECS can receive tasks from multiple devices as long as the number of devices it serves doesn't exceed the number of the CPU cores [12]. Let X M ×N = [x ij ] as the offloading matrix, where x ij , i ∈ M, j ∈ N is defined as follows: x ij =    1 , device i offloads its tasks to ECS j, 0 , otherwise. x ij ≤ N j for ∀i ∈ M, ∀j ∈ N. Due to the massive deployment of devices in the complex industrial environment, it's impractical to obtain the instantaneous channel state information (CSI) accurately and timely. Therefore, in this work, we design a task offloading strategy based on statistics of the wireless links, i.e. the distribution of the channel gain. We assume blocking-fading channels such that the channel gain remains unchanged during the execution of one task and varies independently between two executions following an identical distribution, which is known a priori. Denote by g ij the channel gain from device i to ECS j, the transmission rate can be expressed as follows: where B is the bandwidth, N 0 is the noise power, p i is the transmit power of device i and Φ ij is the path loss from device i to ECS j. Without loss of generality, we assume that the noise power at each ECS is identical, and each IIoT device has an orthogonal channel with the same bandwidth B.

Queuing model
In the considered edge computing-assisted IIoT system, there are two kinds of queues: the queue at each device and the queue at each ECS, as depicted in Fig. 2. Without loss of generality, we assume that device i offloads its tasks to ECS j, and denote by Q D ij the queue formed at the device i. The arrival process of Q D ij follows the Poisson process, and the departure process is dependent on the transmission delay denoted by t D ij , which is given by . ( Therefore, Q D i follows the M/G/1 model. As for an ECS, tasks from each device connected to it form an independent queue. Denote by Q S ij the queue of tasks offloaded from device i to ECS j, then the arrival the queue at device i the queue at ECS j tasks of device i wireless channel task computing Figure 2 Queuing model process of Q S ij is the same as the departure process of Q D ij . We denote by the f ij the computation frequency allocated to device i by ECS j, then the computation delay denoted by t S ij , i.e. the time required to complete a task of device i, is calculated as As a result, the service time of a task follows a deterministic distribution and thus Q S ij follows the G/D/1 model. Based on the queuing analysis, when device i offloads its tasks to ECS j, the total delay denoted by t ij is given by where W D ij and W S ij are the queuing delay at the device i and the ECS j, respectively. We 'll analyse both the average performance and the CVaR of the total delay in the following.

Average delay
According to (4), the average delay can be calculated as follows As for the queuing delay at device i, we denote by µ ij the service rate of Q D ij , which can be calculated as the reciprocal of the average transmission time, i.e.
where the expectation is taken over the probability distribution of the channel gain. According to [19], the average queuing delay at device i can be expressed as follows where To analyze the queuing delay in the G/D/1 queuing model of Q S ij , we first give the following lemma [20].
Lemma 1 In the G/G/1 queuing model, let λ, µ and W be the arrival rate, service rate and queuing delay, respectively, then an upper bound of the average queuing delay is given by where ρ = λ/µ, σ 2 a is the variance of the inter-arrival time, and σ 2 b is the variance of the service time.
According to Lemma 1, we obtain the following theorem characterizing the upper bound of the average queuing delay at a ECS.
Theorem 1 When device i offloads its tasks to ECS j, an upper bound of the average queuing delay at ECS j is given by where λ S ij is the arrival rate of Q S ij , ρ S ij = λ S ij /µ ij is the traffic intensity and σ S ij 2 is the variance of the arrival interval of tasks offloaded from device i to ECS j.
Proof G/D/1 model can be seen as a special case of G/G/1 model with the service time following a deterministic distribution, the variance of which is zero. By sub- (8), we get (9) and Theorem 1 is proved.
Note that, due to the cascaded structure between Q D ij and Q S ij , the arrival rate λ S ij of Q S ij is equal to the departure rate of Q D ij , which can be evaluated from the analysis of the inter-departure time in [21]. Similarly, the variance of the inter-arrival time of Q S ij , i.e. σ S ij 2 , can be derived from the variance of the inter-departure time of Q D ij as in [22]. Finally, combining (5), (6) and (9), the upper bound of the average total delay can be obtained in the following corollary.
Corollary 1 When device i offloads its tasks to ECS j, an upper bound of the average total delay E[t ij ] is given by Since each IIoT device offloads its tasks to only one ECS, for each device i the task completion time denoted by t i can be expressed as t i = N j=1 x ij t ij , and correspondingly an upper bound of the average total delay of device i based on (10) is given by where

Risk metric for delay
Risk in the considered edge computing-assisted system is mainly reflected in the high latency happened with low probability. Specifically, we introduce CVaR as a measure of risk to characterize the tail distribution of the delay. Before we formally define CVaR, we first give the definition of VaR [23].
Definition 1 For a random variable X and a confidence level α ∈ (0, 1), the α-VaR of X is the α-percentile of the distribution of X, which can be expressed mathematically as follows The CVaR measures the expected loss in the right tail given a particular threshold has been crossed, and can also be considered as the average of potential loss that exceed the VaR. The definition of CVaR is given as follows [24].
Definition 2 For a random variable X and a confidence level α ∈ (0, 1), the α-CVaR of X is given by To characterize the CVaR of the total delay, we first analyze the CVaR of each part of the total delay in (4). Recall that the service process of the queue at the device follows general distribution, so it is quite difficult to directly characterize the probability distribution of the waiting time. However, in the considered IIoT scenario, it is reasonable to take the heavy traffic assumption. According to [25], the cumulative distribution function (CDF) of W D ij can be approximated as where V ij is the variance of the service time of Q D ij , i.e. the transmission time of a task. Based on (15), the CVaR of W D ij can be evaluated in the following theorem. Theorem 2 For a confidence level α ∈ (0, 1), the α-CVaR of W D ij can be expressed as Proof According to the definition of VaR in Definition 1, we can obtain that By substituting (17) to (14), the CVaR of W D ij can be calculated as follows.
As for the transmission delay t D ij , we define a auxiliary function where (x) + = max (0, x) and the expectation is taken over the distribution of the channel gain. According to [26], the CVaR of t D ij can finally be calculated as Now we turn to the queue at the ECS. Similar to (15), the CDF of W S ij can be approximated as and thus the CVaR of the queuing delay at the ECS can be evaluated in the following theorem.

Theorem 3
For a confidence level α ∈ (0, 1), the α-CVaR of W S ij can be expressed as The proof of Theorem 3 is similar to Theorem 2 and is omitted here for brevity.
Since we assume constant computing capability at the ECS, the CVaR of the service time of Q S ij is at the same value as itself, i.e.
With the CVaR of each part of the delay involved in the task offloading, we provide an upper bound of the CVaR of the total delay in the following theorem based on the convexity and sub-additivity [27]. Theorem 4 For a confidence level α ∈ (0, 1), an upper bound of the α-CVaR of t i is given by where Proof According to the convexity, the α-CVaR of t i satisfies the following Jensen inequality: Furthermore, based on (4) and the sub-additivity of the CVaR, we have the following inequality: By combining (26) and (27), Theorem 4 is proved.

Problem formulation
In the design of the edge computing-assisted IIoT system, not only the average latency but also risk behind the intense delay jitter should be carefully considered.
Taking into account both the average delay performance and the risk, we set the objective of the task offloading problem as the weighted sum of the average delay and the CVaR, i.e. the mean-risk sum. We have shown that obtaining an explicit expression of both the two terms is often cumbersome, especially for the complex wireless environment in the automated factories. Therefore, the two upper bounds of the average total delay and the corresponding CVaR derived in the previous section are adopted instead. Furthermore, in the considered mission-critical IIoT scenario, the performance of the whole system is usually determined by the device with the worst performance. As a result, we aim to minimize the maximum meanrisk sum among all the devices, which can be described as the following optimization problem: where f = [f ij ], i ∈ M, j ∈ N is the computation frequency allocation matrix, β ∈ (0, 1) is the weight of the CVaR, also called the risk-sensitive parameter, and F j is the overall computation frequency of ECS j. Constraint (28b) is used to guarantee that the tasks generated by the device can only be offloaded to one ECS. Constraint (28c) and (28d) indicate that the number of devices served by a ECS should not exceed the number of its CPU cores, and the sum of the computation frequency allocated to these devices should not exceed its overall computation frequency. Substituting (12), (18), (20), (22) and (23) to (28), we find that the optimization problem is a non-convex mixed integer non-linear problem (MINLP), which is NP-hard [28]. To reduce the computation overhead, we propose a heuristic algorithm, which will be described in details in the following subsection.

Problem solving
Recall that the CVaR of the queuing delay at the device in (20) is in the form of an minimization problem. Since the optimization variable γ in (20) is independent of the optimization variables X and f in (28), we can solve (20) first and substitute its optimal solution to (28) for the subsequent problem solving. To solve (20), we introduce an auxiliary variable z i j = (t D ij − γ) + , and problem (20) can be transformed to the following problem min γ∈R,zij Problem (29) is a stochastic optimization problem with the expectation taken over the channel gain g ij . To approximate the expectation, we sample the probability distribution of g ij [26], and a transformed problem is obtained as follows: where z k ij , k ∈ K = {1, 2, · · · , K} are the samples of z ij . Problem (30) is a linear optimization problem, the optimal solution of which, denoted by U ij , can be obtained from the interior point method (IPM) [29]. There are K + 1 variables and 2K constraints in problem (30), so we can solve it at the time complexity of O(((3K + 1)(K + 1) 2 + (3K + 1) 1.5 (K + 1))δ), where δ is the number of decoded bits [30].
With all the derived average and CVaR terms, problem (28) can be reformulated as the following optimization problem: It is obvious that the objective function of problem (31) contains the term x ij /f ij , and thus problem (31) is still a non-convex MINLP. In order to reduce the computational complexity, we decompose the original problem into two sub-problems. First, we consider the following problem: It is worthwhile to mention that problem (32) is a convex MINLP, which can generally be solved via an outer approximation algorithm or an extended cutting plane algorithm [31]. More specifically, problem (32) is in the form of a bottleneck generalized assignment problem [32], which can be solved through the algorithm proposed in [33] at the time complexity of O(M N log N + θ(N M + N 2 )), where θ is the number of bits required to encode max i,j V ij . After solving the optimal offloading matrix for problem (32) denoted by X * = [x * ij ], i ∈ M, j ∈ N , we substitute X * to (28), and the second sub-problem can be formulated as follows: Problem (34) is a non-convex optimization problem. To transform it to a convex problem, we introduce an auxiliary variable G = [1/f ij ], i ∈ M, j ∈ N , and an optimization problem equivalent to (34) is given by The right side of inequality (35a) is used to maintain the stability of the queue at the ECS. Although problem (35) is a convex optimization problem, the objective function is in the form of the pointwise maximum of M mean-risk sums. To handle this, we optimize the epigraph of problem (35) and obtain the equivalent problem as follows: ij constraints in all. According to [34], we can find an -optimal solution to problem (36) in O(κ √ L ln Lµ0 ) Newton iterations through the logarithmic barrier method [35], where κ is the self-concordance factor, µ 0 is the initial barrier value and is the accuracy parameter. Till now, both the offloading matrix X and the frequency allocation matrix f have been solved.

Results and discussion
In this section, we evaluate the proposed strategy through numerical results. We consider a typical use case in the edge computing-assisted IIoT system, i.e. the video-operated remote control use case, with a typical latency requirement of 10-100 ms and payload size of 15-150 kbytes [36]. In the simulation, we consider 8 IIoT devices offload their tasks to 2 ECSs. Without loss of generality, we set the data size to 0.5 Mbits, i.e. 62.5 kbytes, and the task computation intensity to 15 cycles/bit for each task of each device. Each ECS is equipped with a four-core CPU. The task arrival rate of each device, i.e. the parameter of the Poisson process, is set to be uniformly distributed in (10,30). The bandwidth of each wireless channel is 10 MHz, and the transmission power, noise power at the receiver and the path loss are all set to be identical for each device at 30 dBm, 10 −9 W and 70 dB, respectively. To characterize the fading channel in the practical automated factory, we set the channel distribution as a mixture of Rayleigh and log-normal distribution, which has been confirmed by the measurements in the real industrial environment [37]. The parameter of the Rayleigh distribution is set to be uniformly distributed in (0.5, 1) for each IIoT-ECS pair, and correspondingly, the two parameters of the log-normal distribution are set to be uniformly distributed in (1,2) and (0, 4) respectively. Finally, we set the confidence level to α = 0.99. Our proposed strategy considers both the queuing effect and the risk behind the total delay, and thus is denoted by queuing-based and risk-sensitive (Q-R) strategy in the following simulations. To evaluate the performance of the Q-R, we compare it with the following five strategies: i) the queuing-based and risk-sensitive optimal (Q-R-Opt) strategy, i.e. the globally optimal solution to problem (28); ii) the queuingbased and non-risk-sensitive (Q-NR) strategy, which considers the queuing effect but only optimizes the average delay performance, i.e. the weight of the CVaR is set to β = 0; iii) the queuing-based and non-risk-sensitive optimal (Q-NR-Opt) strategy, i.e. the globally optimal solution that corresponds to the Q-NR case; iv) the non-queuing-based and risk-sensitive (NQ-R) strategy, which takes into account both the average delay and the CVaR, but does not consider the queuing effect; v) the non-queuing-based and non-risk-sensitive (NQ-NR), which considers neither the queuing effect nor the risk. In the following simulations, we set the weight of the CVaR to β = 2 for risk-sensitive strategies.
We first investigate the complementary cumulative distribution function (CCDF) of the total delay under the six offloading strategies, since the CVaR captures the tail information of the delay distribution. As presented in Fig. 3, for the probability of ultra-high delay, the curve of Q-R, Q-R-Opt and NQ-R are all lower than their corresponding non-risk-sensitive strategies. This implies that by adding the CVaR to the optimization objective, the risk of high delay can be greatly reduced. On the other hand, we can see that for any value of the total delay, the CCDF under NQ-R and NQ-NR is greater than that under Q-R, Q-R-Opt, Q-NR and Q-NR-Opt, which means the non-queuing strategies are more likely to arise a higher delay. This is reasonable, since the non-queuing strategies neglect the queuing effect in the strategy design, which leads to a higher queuing delay. Furthermore, the curve of Q-R is close to that of Q-R-Opt and has nearly the same trend, which indicates that the proposed algorithm achieves near-optimal performance with a great reduction in computation complexity. In Fig. 4 and Fig. 5, we compare how the delay performance evolves with the computation frequency of the ECS under the six strategies. Specifically, Fig.4 investigates relationship between the average delay and the computation frequency,  Fig. 5 focus on the 99th percentile of the total delay. It can be seen that with the increasing computation frequency, the average total delay and the 99th percentile decreases for all the six strategies. The reason is that the higher the computation frequency, the more computation resources allocated to the IIoT devices and thus the lower the computation delay. Furthermore, note that the delay doesn't descend much when computation frequency is relatively high. This is because for high computation frequency, both the computation delay and the queuing delay at the ECS is relatively low, and thus the total delay is mainly dependent on the queuing delay at the devices. We can also see that the queuing strategies outperforms the corresponding non-queuing strategies for both the average performance and the 99th percentile, which verifies the significance of the queuing analysis again. More importantly, the two figures verify the near-optimality of the proposed algorithm, and jointly indicate that the risk-sensitive strategies achieve nearly the same average total delay as the non-risk-sensitive ones, but greatly reduce the 99th percentile of the total delay by incorporating the risk to the design of offloading strategy. In other words, the intense delay jitter can be effectively controlled under the risk-sensitive strategies only at the price of very little degradation on the average performance.
Finally, we investigate the effect of the task size on both the average delay and the 99th percentile under the Q-R and Q-NR strategies. As shown in Fig. 6, the 99th percentile is higher than the average total delay for both strategies, since the former characterize the worst case delay. With the increase of the task size, both the average delay and the 99th percentile increase under both strategies. This is due to the fact that a larger task size lead to the higher transmission and computation delay. Furthermore, the 99th percentile of Q-R is always lower than that of Q-NR, while the two curves of the average delay almost coincide with each other. Note that the average delay under the Q-NR strategy is the lower bound of that under the Q-R. This implies that Q-R achieves nearly the same average performance as the Q-NR while simultaneously improving the worst case performance with respect to the total delay.

Conclusions
In this work, we introduce the risk management theory to design of the edge computing-assisted IIoT system. We explore the queuing theory and the properties of the CVaR to capture the tail distribution of the end-to-end delay, and provide two upper bounds of the average total delay and the CVaR. A joint task offloading and computation resource allocation problem is formulated to simultaneously minimize the average total delay and the risk. Since the problem is a non-convex MINLP, we decompose it into two sub-problems and design a two-stage heuristic algorithm. The computation complexity of each procedure of the proposed algorithm has been analyzed. Finally, simulations are performed under the practical channel model in the automated factories, and the results verify that the proposed strategy can effectively control the risk of intense delay jitter while guaranteeing the average delay performance.