Task admission control for application service operators in mobile cloud computing

The resource constraint has become an important factor hindering the further development of mobile devices (MDs). Mobile cloud computing (MCC) is a new approach proposed to extend MDs’ capacity and improve their performance by task offloading. In MCC, MDs send task requests to the application service operator (ASO), which provides application services to MDs and needs to determine whether to accept the task request according to the system condition. This paper studies the task admission control problem for ASOs with the consideration of three features (two-dimensional resources, uncertainty, and incomplete information). A task admission control model, which considers radio resource variations, computing, and radio resources, is established based on the semi-Markov decision process with the goal of maximizing the ASO’s profits while guaranteeing the quality of service (QoS). To develop the admission policy, a reinforcement learning-based policy algorithm, which develops the admission policy through system simulations without knowing the complete system information, is proposed. Experimental results show that the established model adaptively adjusts the admission policy to accept or reject different levels and classes of task requests based on the ASO load, available radio resources, and event type. The proposed policy algorithm outperforms the existing policy algorithms and maximizes the ASO’s profits while guaranteeing the QoS.


Introduction
In recent years, with the rapid development of wireless network and computer technologies, the use of mobile devices (MDs) (e.g., smart phones, wearable devices, and smart vehicles) has become very popular in many industries. Cisco predicted that the number of MDs worldwide will grow from 8.6 billion in 2017 to 12.3 billion in 2022 [1]. At the same time, with the increasing popularity of mobile Internet, a large number of mobile applications providing different types of services are developed. People are spending more and more time on their MDs and want to do everything with the help of mobile applications. For instance, according to Internet Trend Report 2019 [2], technology, the battery technology has not made breakthroughs in the short term, and the annual growth rate of the battery capacity is only 5% [6]. The development of the battery technology lags far behind the semiconductor technology developed by Moore's Law. On the other hand, due to a series of factors such as architecture and heat dissipation, although MD processing capacity has been improved, it is still weak compared with the ordinary computer, making MDs take much time and energy to execute some applications, and even cannot execute heavy applications. As a result, these constraints bring a poor user experience and prevent the further development of MDs.
Task offloading, which offloads computing tasks to the external platform to extend available MD resources, is an effective way to solve the problem of limited MD resources. Cloud computing, as the foundation of the future information industry, is a business computing model, which provides powerful external computing resources to MDs. On the basis of cloud computing, mobile cloud computing (MCC), which offloads application tasks to the cloud via task offloading, is proposed to address the problem that MD resources are limited. MCC provides a rich pool of resources that can be accessed through wireless networks. MCC has attracted wide attention from industry and academia because of its tremendous potential. There have been many mobile cloud applications for mobile healthcare [7], e-commerce [8], and mobile education [9]. According to the assessment of Allied Analytics LLP [10], the mobile cloud market is valued at $12.07 billion in 2016 and is expected to reach $72.55 billion by 2023, with a compound annual growth rate of 30.1% from 2017 to 2023. It is believed that the mobile cloud market will become more prosperous with the development of MCC.
The task offloading architecture, the offloading policy, and the offloading granularity are three main branches of the current research on MCC [11]. How to develop offloading policies has been studied in many previous works [12][13][14][15][16]. The offloading policy aims to improve the MD performance and determines whether a task should be offloaded to the cloud. If the offloading policy indicates that a task should be offloaded to the cloud, a task request is sent to the application service operator (ASO), which provides application services for mobile users. The target of ASOs is to maximize their profits, and ASOs try to accept as many tasks as possible to increase their income. However, if an ASO accepts all incoming task, it leads to resource overloading and then affects the quality of service (QoS). ASOs need task admission control to determine whether to accept a new task according to their current load conditions. The features of task admission control problem in MCC can be summarized as three points: (a) Two-dimensional resources. Mobile users in MCC are connected to the cloud via wireless networks, which have a serious impact on MCC [17]. Therefore, the task admission control in MCC has to consider both computing and radio resources. (b) Uncertainty. Wireless networks are not stable and vary for many reasons, such as the wireless channel fading and channel interference [18]. Wireless network variations lead to uncertainty to the task admission control in MCC. In addition, dynamic resource extension, uncertain task arrivals, and departures also lead to uncertainty. (c) Incomplete information.
In real-life MCC, some information of the task admission control problem is often unclear or hard to obtain.
This paper strives to tackle the task admission control problem in MCC and aims to maximize the ASO's profits while ensuring the QoS. For features (a) and (b), a task admission control model, which considers radio resource variations, computing, and radio resources, is established based on the semi-Markov decision process (SMDP) with the long-term average criterion. SMDP is a powerful tool for solving the sequential decision-making problems and provides a mathematical framework that selects an action according to the state observed at each decision epoch. The SMDP policy, composed by a set of state-action pairs, can be developed offline and applied online. For feature (c), a policy algorithm based on the reinforcement learning (RL) is proposed to develop the admission policy. SMDP problems can be solved by using classical dynamic programming methods. However, these dynamic programming methods require the exact transition probabilities, which are often hard to obtain and need the extra storage to store [19]. At the same time, the complete system information of the task admission control problem is often unclear or hard to obtain in real-life MCC. Therefore, a RL-based policy algorithm is proposed. RL is a machine learning framework for solving sequential decision-making problems and can solve SMDP problems approximately without the complete system information. The main contributions of this paper are summarized as follows: (1) The task admission control problem is formulated as a SMDP, and a SMDP-based model, which aims to maximize the ASO's profits while ensuring the QoS, is established. To describe the task admission control problem in MCC accurately, the established model considers two-dimensional resources (computing and radio resources), system uncertainty (radio resource variations, task uncertainty, and dynamic resource extension), and multi-level and multi-class application services.
(2) A RL-based policy algorithm is proposed to develop the admission policy. The proposed policy algorithm develops the admission policy through system simulations without requiring the complete system information. The policy can be developed offline and applied (2020) 2020:217 Page 3 of 21 online, and the admission control depends only on the current system state. These advantages make the proposed policy algorithm efficient for the task admission control problem in real-life MCC, whose complete information is often unclear or hard to obtain. (3) Extensive simulation experiments are conducted to verify the established system model and proposed policy algorithm. The impact of system parameters on the ASO's profits and QoS is evaluated and analyzed.
To verify the efficiency of the proposed algorithm, it is compared with existing algorithms such as the threshold-based policy algorithm, the greedy policy algorithm, and the random policy algorithm.
The remainder of this paper is organized as follows. Section 2 reviews the related work. In Section 3, we first describe the system model and then illustrate the RL-based policy algorithm. In Section 4, the established model and proposed policy algorithm are evaluated. Section 5 concludes this paper.

Related work
The offloading policy, which is developed by the offloading decision-making algorithm, determines whether a task request is sent to the ASO. We first review the work that focuses on offloading decision-making briefly. The admission control problem is involved in wireless networks. Therefore, we also review the work on admission control in wireless networks after reviewing the work on task admission control in MCC.

Offloading decision-making
As mentioned above, how to make offloading decisions is a main branch of current research on MCC, and this problem is usually described as an application partitioning problem. Many works proposed algorithms to develop offloading policies to improve the MD performance. Zheng et al. investigated the problem of multi-user task offloading for MCC under dynamic environment, wherein mobile users become active or inactive dynamically, and the wireless channels for mobile users to offload task vary randomly [12]. They formulated the mobile users' offloading decision process under dynamic environment as a stochastic game and proved that the formulated stochastic game is equivalent to a weighted potential game which has at least one Nash Equilibrium. Mahmoodi et al. proposed an energy-efficient joint scheduling and task offloading scheme for MDs using applications with arbitrary component dependency graphs [13]. They defined a net utility that trades off the energy saved by the mobile, subject to constraints on the communication delay, overall application execution time, and component precedence ordering. Kumari et al. considered a trade-off between time and cost for offloading in MCC and proposed a two-step algorithm known as the cost and time constraint offloading algorithm, the task scheduling algorithm based on teaching, learning-based optimization, and the energy saving using the dynamic voltage and frequency scaling technique [14]. Hong and Kim proposed optimal transmission scheduling and optimal service class selection of task offloading while capturing the trade-off between energy, latency, and pricing [15]. They formulated the transmission scheduling problem as dynamic programming, and its optimal scheduling and two suboptimal scheduling algorithms have been derived. Hekmati et al. considered the multidecision problem when task execution completion times are subject to hard deadline constraints, and when the wireless channel can be modeled as a Markov process [16]. They proposed an online mobile task offloading algorithm named MultiOpt to develop the offloading policy. In [3], Zhou et al. studied the data offloading problem and formulated it as an optimization problem, which also involves offloading decision-making that the mobile network operator decides whether a WiFi access point is selected to offload traffic. They designed an effective reverse auction-based incentive mechanism to stimulate WiFi access points to participate in the data offloading process.

Task admission control
After offloading decisions are made, requests of offloadable tasks are sent to the ASO. The ASO needs to use admission control to determine whether to accept a task and to allocate resources for it. Several works studied the task admission control from different aspects. Guo et al. established a ASO resource model using queuing theory and optimized admission control for multi-type task requests [20]. They modeled the admission control problem as an NP-hard optimization problem and used the moment-based convex linear matrix inequality relaxation to develop the admission policy. Lyu et al. studied the task admission problem with the aim of minimizing the total energy consumption while guaranteeing the latency requirements of MDs [21]. They transformed the admission control problem to an integer programming problem with the optimal substructure by pre-admitting resource-restrained MDs and proposed a quantized dynamic programming algorithm to develop the admission policy. Liu and Lee studied the admission control and resource allocation problem for partitioned mobile applications in MCC [22]. A discounted SMDPbased model was proposed to solve the admission control and resource allocation problem. They used the policy iteration approach to develop the optimal policy. Liu et al. focused on the resource allocation problem for the cloudlet-based MCC system with resource-intensive and latency-sensitive mobile applications [23]. They proposed a joint multi-resource allocation framework based on the SMDP and used the linear programming to obtain the optimal resource allocation policy among multiple mobile users. Wang et al. studied the admission control problem in the multi-server and multi-user situation with the aim of minimizing the total energy consumption of MDs while guaranteeing their latency requirements [24]. They formulated the admission control problem to a multi-choice integer program and utilized Ben's genetic algorithm to solve it. Chen et al. proposed a comprehensive framework consisting of a resource-efficient computation offloading mechanism for users and a joint communication and computation resource allocation mechanism for network operator [25]. They formulated the admission control problem as a NP-hard optimization problem and designed an approximation algorithm based on the user ranking criteria to develop the admission policy. Qi et al. proposed a multi-level computing architecture coupled with admission control to meet the heterogeneous requirements of vehicular services and modeled the admission control problem as a MDP that optimizes the network throughput [26]. Khojasteh et al. proposed two task admission algorithms to keep the cloud system in the stable operating region by using two controlling parameters, that is, the full rate task acceptance threshold and the filtering coefficient [27]. Their first admission algorithm, which is based on the long-term estimation of the average utilization and offered load, is lightweight and appropriate for the cloud systems with a stable task arrival rate. Their second admission algorithm, which is based on the instantaneous utilization, is computation-intensive and appropriate for the systems with a varying task arrival rate. Lyazidi et al. focused on the admission control and resource allocation problem in the cloud radio access network [28]. They formulated the problem as an optimization problem constrained by mobile users' QoS requirements, the maximum transmission power, and the fronthaul links capacity. They reformulated the original nonlinear optimization problem as a mixed integer linear program and proposed a two-stage algorithm to solve it.

Admission control in wireless networks
Admission control is also studied in the research field of wireless networks. Mirahsan et al. studied the admission control problem for wireless virtual networks in heterogeneous wireless networks with the goal of improving the QoS of network operators and proposed an admission method including the feedback information of virtual network users [29]. They formulated the admission control problem as a convex optimization problem, which allows the general multi-association between users and base stations, and proposed a solution algorithm for heterogeneous traffic distribution networks. Dromard et al.
proposed an admission control model combining the dynamic link scheduling for the bandwidth limitation problem of wireless Mesh networks and transformed it as a 0-1 linear programming problem, aiming to optimize the network bandwidth usage [30]. Zhang et al. studied the admission control problem in sensor nodes and developed stochastic models in wireless sensor networks to explore admission control with the sleep/active scheme [31]. Shang et al. established an admission control model based on the matching game and multi-attribute decision-making according to the network's attribute and system resource allocation [32]. They proposed an algorithm, which balances the interests of the network and user, reflects the superiority of the balanced decision of both parties, and guarantees the common interests of the network and user. This paper focuses on the task admission control problem in MCC and fully considers features (a)-(c) in model establishment and algorithm design, which goes beyond existing works. Existing models on task admission control in MCC ignore the feature (b) uncertainty and assume that the wireless networks are stable. At the same time, existing works on admission control in wireless networks only need to consider the radio resource load conditions and cannot highlight the feature (a) two-dimensional resources. Different from these works, two-dimensional resources (computing and radio resources), system uncertainty (radio resource variations, task uncertainty, and dynamic resource extension), and multi-level and multiclass application services are considered in the established model, making it more accurate in describing the task admission control problem in MCC. Furthermore, existing works do not pay enough attention to the feature (c) incomplete information and use dynamic programming methods, which need the complete system information to develop the admission policy. The proposed RL-based policy algorithm develops the admission policy through system simulations without requiring the complete system information.

Mobile cloud computing system architecture
The MCC system architecture, illustrated in Fig. 1, is divided into 3 parts, namely the mobile users, the ASO, and the cloud operator. The cloud operator manages physical resources and provides the virtual resource renting service. The ASO obtains virtual resources from the resource pool and does not need to have its own hardware equipments, which frees the ASO from the equipment purchase and maintenance and helps the ASO pay more attention on the application service development. The ASO provides application services (e.g., augmented reality (AR), virtual reality (VR), and speech and image recognition) for mobile users. Task requests from mobile users are sent to the ASO. After receiving a task request, the admission controller determines whether to accept the task.

SMDP-based task admission control model
In this section, the SMDP-based task admission control model is illustrated. The application service, the state space, the action space, and the reward function of the SMDP-based model are defined. The task admission control model is abstracted as the model shown in Fig. 2.
The admission controller decides whether to accept the task request according to the system state when a task request arrives. If the task is accepted, the ASO allocates resources and executes the task. Many ASOs offer same services for mobile users in the cloud service market [33]. If a task request is rejected, it leaves the current ASO and switches to another ASO that provides the same application service. The system state is updated after a system event occurs. The admission controller decides whether to accept the task request according to the system state when a task request arrives. If the task is accepted, the ASO allocates resources and executes the task (2020) 2020:217 Page 6 of 21 (1) Application service The ASO provides multi-level and multi-class application services to mobile users. This paper assumes that the ASO provides L levels and M classes application services, and thus, the ASO will receive L × M types of task requests from mobile users. If a task request is accepted, the ASO allocates resources to execute the task according to the level it requires. Let {(c l , b l ) |1 ≤ l ≤ L} denote the resources that the ASO provides. c l = x l c u and b l = y l b u represent the computing and radio resources provided by the ASO for l-level application services, respectively. c u (b u ) represents one computing (radio) resource unit, which is the minimum computing (radio) resources provided to the mobile users. For example, c u = 1 GHz and b u = 100 kbps. In the following description, the unit is omitted. That is to say, c l = x l (b l = y l ) has the same meaning to c l = x l c u (b l = y l b u ).
The task request rejecting probability (referred to "rejecting probability" for simplicity in the following description) is taken as the indicator of QoS. Let P r l denote the rejecting probability that the ASO guarantees for the l-level application services. In general, the higher the application service level, the higher the QoS ASO guarantees. Also, if the mobile user purchases a higher-level application service, the ASO will allocate more resources. Mathematically, if 1 ≤ l 1 < l 2 ≤ L, then c l 1 < c l 2 , b l 1 < b l 2 and P r l 1 > P r l 2 . The notations used in this paper are summarized in Table 1.
(2) State space At a decision epoch, the state is the system descriptor, and the admission controller makes decisions according to the current state. The state space, represented by set S, is composed of all system states. The state s (s ∈ S) is expressed as s = (Z(s), B(s), E(s)).
represents the numbers of tasks that are being executed in the ASO, and its element Z m l (s) represents the number of l-level and m-class tasks that are being executed in the ASO. B(s) represents the available ASO radio resources. As mentioned above, radio resources vary for many reasons, such as the channel fading and channel interference [18]. Moreover, the ASO radio resources rented from the cloud operator may vary within the servicelevel agreement (SLA) between the ASO and the cloud operator. This paper assumes that B(s) varies from B L to B U , in which B L and B U represent the lower and upper radio resource bounds, respectively. advantage of the cloud computing is that it allows dynamic resource extension, and the ASO computing resources can be further extended. Let C denote the base computing resources and C U denote the upper bound of the extendable computing resources. Computing and radio resources occupied by the tasks that are being executed should not exceed the computing and radio resource upper bounds, which are ensured by constraints and ensures that at least one task is being executed when a departure event occurs.

(3) Action space
The action space for state s is a set of all possible actions that can be taken at state s. When an event E(s) occurs, the admission controller selects an action from the action space A(s). The action space A(s) is a subset of A = {a d , a o , a a , a r }, which represents all actions in the system model. If a task arrival event occurs E(s) = A m l , the task request can be accepted or rejected, which are denoted as taking actions a a or a r . However, the task request must be rejected when the required ASO computing or radio resources exceed their upper bounds. Therefore, if E(s) = A m l , action space A(s) is expressed as The reward function represents the profits of the decision-making at current state. The reward function r(k, s, a), expressed as is defined to represent the profits resulted from taking action a at state s with next state k. f r (k, s, a), expressed as represents the income from taking action a at state s. R m l represents the income from accepting and executing a l-level and m-class task. The system cost is caused by two parts, which are the penalty for short of radio resources and the cost of occupying resources to execute tasks, respectively. The second portion of Eq. (5) represents the system cost, in which τ (k, s, a) represents the sojourn time from current state s to next state k after taking action a, o 1 (k, s, a) represents the penalty per unit time for short of radio resources, and o 2 (k, s, a) represents the system cost per unit time of occupying the computing and radio resources rented from the cloud operator to execute tasks. After action a is taken at state s, the number of tasks that are being executed is equal to Z m l (s), and the available radio resources are B(k). o 1 (k, s, a) is calculated by in which F 0 represents the penalty coefficient, and the other portion denotes the radio resource shortage, which may be caused by a radio resource variation event. If there are enough radio resources, o 1 (k, s, a) is equal to zero; otherwise, o 1 (k, s, a) is positive. When the radio resources are short, the ASO cannot allocate sufficient radio resources to the task, and the ASO's punitive cost is paid to mobile users as the compensation for the radio resource shortage. One simple way to solve the problem of radio resource reallocating and compensation dividing among mobile users is reducing the radio resource allocation and dividing in which f c (·) and f b represents the cost coefficient of occupying the computing and radio resources, respectively. Generally, dynamically extended computing resources are more expensive, and is used to calculate the cost coefficient of occupying computing resources, in which f c 0 ≤ f c 1 . f c (x)x represents the cost of occupying the computing resources , in which f c 0 C represents the cost of occupying the base computing resources C, and f c 1 (x−C) represents the cost of occupying the dynamically extended computing resources (x − C). f c 0 , f c 1 and f b are constants. On the basis of the system model, the decision process of the admission controller can be represented as the process shown in Fig. 3 [34]. At time t i , the admission controller observes current state s i and selects action a i from action space A(s i ). This action-taking has two effects on the system: one is to receive the corresponding reward, and the other is that the system state is affected by this action and enters next state s i+1 . At time t i+1 , the admission controller faces the same problem as the previous decision-making time, that is, selecting an action according to the current system state. The decision-making process will go on in this form and generate a policy made up of state-action pairs and a reward sequence. The problem of policy solving is that a policy is required to maximize the value of a function (criterion) of the reward sequence under this policy. A long-term average criterion, expressed as in which π(s) represents the policy, is used in the system model and aims to maximize the profit per unit time.

Task admission control policy algorithm
In this paper, a RL-based task admission control policy algorithm, whose pseudo-code is shown in Fig. 4, is proposed. It can be seen that the RL-based policy algorithm has two loops, whose execution times are N and V. Also, the RL-based policy algorithm needs space to store Q ( s, a), Q + ( s, a), and some intermediate variables. Therefore, the RL-based policy algorithm has time complexity o(NV ) and space complexity o S |A| , in which S represents the aggregated state space size and |A| represents the action space size. The RL model is illustrated in Fig. 5, and the admission controller learns through interacting with the system environment. The RL-based policy algorithm is a simulation-based algorithm, which develops the approximate optimal policy using the observed data from the real-life system or through system simulations without the complete system information. Therefore, it has a wider range of use and can be used in more problem scenarios. The long-term average Q-learning [19], which belongs to the value iteration-based RL methods, is used to develop the policy. By using Q-learning, the admission controller can make decisions after learning an actionvalue function, which gives a value of each state-action pair. For a state, the admission controller selects the action that has the highest value obtained from the action-value function as the optimal action. In the learning process, the QoS constraint should be considered, and thus, the system state is modified to handle the QoS constraint. As the core components of the RL-based policy algorithm, the actionvalue function and the QoS constraint are illustrated in the following description.

Action-value function
Let Q(s, a) represent the action-value function, whose value denotes the average adjusted value of taking action a at state s. According to [19], the Bellman equation for the long-term average reward SMDP-based problem can be expressed as , (11) in which Q * (s, a) represents the average adjusted value by taking actions optimally, p(k|s, a) represents the transition probability that the system state transfers from state s to state k after taking action a, and ρ * represents the optimal average reward. The optimal policy π * (s) = arg max Based on the Robbins-Monro algorithm, a temporal difference method, expressed as is used to update Q(s, a) in the iterations, in which α represents the learning rate, ρ represents the average reward, and k represents the next state. Equation (12) shows that the RL-based policy algorithm does not require transition probabilities and can run with the data from simulations or real-life system. At the initial step, all Q(s, a) are set to a same value (e.g., 0). When visiting a state, RL-based policy algorithm needs to simulate an action-taking. To avoid falling into the local optimal policy and improve the probability to achieve the global optimal policy, the simulated annealing algorithm [35], which simulates the annealing process of heating solids and brings random factors in the selecting process, is used to select an action for the state-action pair whose action-value function is to be updated. The random action that may be worse than the greedy action is selected with a certain probability. When selecting an action for a state-action pair, a random number ϕ ∈[ 0, 1) is generated and compared with p a greedy → a random = e [Q(s,arandom)−Q(s,agreedy)]/T , (13) which represents the probability of selecting action a greedy instead of selecting action a random . a greedy represents the greedy action that results in the highest action-value function, a random represents a random action selected from the action space, and T represents the current temperature. T is calculated by in which T 0 represents the initial temperature, T γ represents the temperature dropping coefficient, and n represents the number of iterations. If ϕ ≤ p a greedy → a random , a random is selected; otherwise, a greedy is selected.
After the simulative action a is taken, the system goes to the next simulative state k. The average reward is updated by in which β represents another learning rate, and r n and τ n represent the accumulated reward and time until the nth iteration, respectively. The two learning rates ( α and β) are calculated by and respectively. After a certain number of iterations, the action-value function is learnt, and the approximate optimal policy is developed by the action-value function.

QoS constraint
The QoS constraint is formulated as with the long-term average criterion. Equation (18) indicates that the long-term time-average rejecting probability of the l-level application services is no more than P r l . In Eq. (18), s i+1 represents the next state of state s i , a i represents the action taken at state s i , and P r l (s i+1 ) represents the rejecting probability at state s i+1 . The Lagrange multiplier framework [36] is used to deal with the QoS constraint. According to Eq. (18), the QoS constraint depends on the rejecting probability and sojourn time. Therefore, the expression of state s is extended as represents the total number of task requests, N r (s) = N r l (s) L l=1 represents the total number of rejected task requests, and τ represents the sojourn time between decision epochs. However, N t l (s) and N r l (s) N r l (s) ≤ N t l (s) can be any nonnegative integers, and τ is a decimal, making the extended state space infinite. To add the QoS constraint into the RL-based policy algorithm, the extended state space must be finite. The quantization method [37] of the rejecting probability and sojourn time is used to aggregate the extended states to make the extended state space finite. The aggregated state is denoted as s = (2020) 2020:217 Page 11 of 21 represents the quantized rejecting probability of l-level task requests, and h 2 ( s) represents the quantized sojourn time. The rejecting probability N r l (s) / N t l (s) is quantized into 100 levels, and τ is quantized into 2 levels. If τ ≤ τ (τ represents the average sojourn time), τ is quantized to level 1; otherwise, τ is quantized to level 2.
After the extended states are aggregated, the actionvalue function with the QoS constraint is denoted as Q ( s, a). In the Lagrange multiplier framework, the reward function, expressed as is adjusted with the Lagrange multiplier ω = (ω l ) L l=1 . In Eq. (19), r k, s, a is equal to the original reward function, that is, r k, s, a = r(k, s, a). q l k, s, a represents the cost function associated with the QoS constraint, and q l k, s, a = f h 1 l k τ k, s, a , in which f h 1 l k denotes the rejecting probability level that h 1 l k represents. To find an optimal ω l , ω l is updated by in which δ l is a updating coefficient, and P r ω l represents the rejecting probability with ω l .

Results and discussion
In this section, extensive simulation experiments are conducted to evaluate the established system model and the proposed policy algorithm. The arrival of the l-level and m-class task request is assumed to follow a Poisson process with mean rate λ m l . If a task request is accepted, the ASO allocates resources and executes the task. The resource occupation time of the task is assumed to follow the exponential distribution. The mean occupation time of the l-level and m-class task is represented by 1 / μ m l , and thus, the mean rate of the task departure is μ m l . The occurrence of the radio resource variation event is assumed to follow a Poisson process with mean rate λ o , and the radio resources vary uniformly between its upper and lower bounds. Based on this experimental settings, the cumulative event rate at state s with action a, denoted by γ (s, a), is the sum of all event rates. γ (s, a) is calculated by to the property of exponential distribution that the minimum of exponential random variables is also exponentially distributed with the cumulative rate parameters [38], sojourn time of the earliest event follows the exponential distribution with rate parameter γ (s, a). The sojourn time τ (k, s, a) is a random variable and is generated randomly according to its distribution, in which k is the next state after the earliest event occurs. R m l represents the income from accepting a l-level and m-class task and is set as R l 1 μ m l , in which R l represents the income per unit time, and η l (η l = R l / R 1 ) represents the ratio of R l and R 1 . Two indicators, the system reward/profits (SR) and rejecting probability (RP), are concerned. The unit of the SR is UM (Unit Money). This section first evaluates the established system model and then compares the performance of the proposed policy algorithm and other algorithms. The default simulation parameters are listed in Table 2. Figure 6 shows the SR and RPs under different task request arrival rates. In this experiment, the arrival rates are set to be equal, and λ sum represents the sum of the arrival rates, that is, λ 1 1 = λ 2 1 = λ 1 2 = λ 2 2 = λ sum / 4. It can be observed that the SR increases with the increasing λ sum , and its increments are 7.31UM, 6.83UM, 6.38UM, 5.72UM, 5.27UM, 4.68UM, 3.95UM, 3.5UM, 2.71UM, 2.38UM, 1.62UM, and 0.59UM, respectively, which shows that the SR increases slowly when λ sum becomes large. The RP (RP(l = 1)) of 1-level task requests increases from 1.77% to 8.07%, and the RP (RP(l = 2)) of 2level task requests increases from 0.33% to 4.77%. When λ sum is 24.75, RP(l = 1) (8.07%) is slightly larger than the maximum allowable rejecting probability P r 1 = 8% with a difference of 0.07%. The RL-based policy algorithm searches for the approximate optimal admission policy while guaranteeing the QoS requirement iteratively. It is considered to meet the requirement within the accuracy range, which is 0-0.1% in this paper. When λ sum is small, the ASO resources are enough, and the QoS requirement is easy to be met. Therefore, more task requests are accepted with the increasing λ sum , and the SR increases  rapidly. When λ sum becomes large, many task requests are rejected because of the heavy ASO load. With the help of the RL-based policy algorithm, the system model adaptively adjusts the admission policy depending on the arrival rates while satisfying the QoS, making the SR increase and the QoS requirement satisfied. The average increment rates of RP(l = 1) and RP(l = 2) are 0.54% and 0.38%, respectively, showing that RP(l = 1) increases faster than RP(l = 2). This is because the income from accepting a 2-level task is larger than the income from accepting a 1-level task, and thus, more 1-level task requests are rejected. Figures 7 and 8 show the SR and RPs under different arrival rates of the 1-level and 2-level task request, respectively. In the experiment of Fig. 7, arrival rates of the 1-level task request are set to be equal, and arrival rates of the 2-level task request are set as the default simulation parameters, that is, λ 1 1 = λ 2 1 = λ l=1 , λ 1 2 = λ 2 2 = 5. From  Fig. 7, it can be observed that the SR first increases and then decreases with the increasing λ l=1 . At the same time, RP(l = 1) first decreases and then increases. When λ l=1 is small, with the increasing λ l=1 , RP(l = 1) decreases, and more 1-level task requests are accepted to increase the SR. At the same time, a small λ l=1 allows fewer 2-level task requests are rejected, making RP(l = 2) increase slowly. When λ l=1 is large, with the increasing λ l=1 , RP(l = 1) increases, and more 1-level task requests are rejected to balance the ASO load. Also, a large λ l=1 leads to the rapid increment of RP(l = 2). In the experiment of Fig. 8, arrival rates of the 2-level task request are set to be equal, and the arrival rates of the 1-level task request are set as the default simulation parameters, that is, λ 1 1 = λ 2 1 = 5, λ 1 2 = λ 2 2 = λ l=2 . From Fig. 8, it can be observed that the SR, RP(l = 1), and RP(l = 2) increase with the increasing λ l=2 . The average increment rates of RP(l = 2) are 0.85% and 0.66% in Figs. 7 and 8, respectively, which means that RP(l = 2) increases faster in Fig. 7. The reason is that the increasing 1-level tasks occupy too much resources, and more 2-level task requests are rejected to balance the ASO load. Figures 9 and 10 show the SR and RPs under different arrival rates of the 1-class and 2-class task request, respectively. In the experiment of Fig. 9, arrival rates of the 1-class task request are set to be equal, and arrival rates of the 2-class task request are set as the default simulation parameters, that is, λ 1 1 = λ 1 2 = λ m=1 , λ 2 1 = λ 2 2 = 5. In the experiment of Fig. 10, arrival rates of the 2-class task request are set to be equal, and the arrival rates of the 1-class task request are set as the default simulation parameters, that is, λ 1 1 = λ 1 2 = 5, λ 2 1 = λ 2 2 = λ m=2 . From Figs. 9 and 10, it can be observed that the SR, RP(l = 1), and RP(l = 2) increase with the increasing λ m=1 and λ m=2 . In Fig. 9, the average increment rates of the SR, RP(l = 1), and RP(l = 2) are 14.53UM, 1.40%, and 0.74%, respectively. In Fig. 10, the average increment rates of the SR, RP(l = 1), and RP(l = 2) are 5.58UM, 0.68%, and 0.53%, respectively. The income from accepting a l-level and m-class task request is R l 1 μ m l , and μ 1 l < μ 2 l , which indicates that accepting a 1-class task request results in more income. As reflected in the SR increment, the average SR increment rate in Fig. 9 is larger than that in Fig. 10, which shows that the SR increases faster with the increasing λ m=1 . The mean occupation time of a l-level and m-class task is 1 / μ m l , and μ 1 l < μ 2 l , which indicates that the 1-class task takes more ASO resources. Therefore, more task requests are rejected with the increasing λ m=1 . As reflected in the increments of RP(l = 1) and RP(l = 2), the average increment rates of RP(l = 1) and RP(l = 2) in Fig. 9 are larger than those in Fig. 10, which shows that RP(l = 1) and RP(l = 2) increase faster with the increasing λ m=1 . Figure 11 shows the impacts of the resources on the SR and RPs. A larger B L and C make resources ample and allow the ASO to provide more resources, which leads to fewer penalties for radio resource shortages, fewer costs for extended computing resources, and more task request acceptances. As reflected in Fig. 11, the SRs (SR(η 2 = 6.5) and SR(η 2 = 4.5)) increase, and the RPs (RP(l = 1, η 2 = 6.5), RP(l = 2, η 2 = 6.5), RP(l = 1, η 2 = 4.5), and RP(l = 2, η 2 = 4.5)) decrease with the increasing B L and C. η 2 is the ratio of R 2 and R 1 , and the larger η 2 means more income from accepting a 2-level task request. SR(η 2 = 6.5) is larger than SR(η 2 = 4.5) for this reason. For the same reason, more 1-level task requests are rejected to accept enough 2-level task requests to optimize the SR, making RP(l = 1, η 2 = 6.5) larger than RP(l = 1, η 2 = 4.5), and RP(l = 2, η 2 = 6.5) smaller than RP(l = 2, η 2 = 4.5). In addition, it can be observed that RP(l = 2, η 2 = 6.5) and RP(l = 2, η 2 = 4.5) first decrease and then keep stable. At the same time, the average decrease rates of RP(l = 1, η 2 = 6.5) are 0.66% and 0.18% when 6 ≤ B L = C ≤ 10 and 10 ≤ B L = C ≤ 14, respectively. The average decrease rates of RP(l = 1, η 2 = 4.5) are 0.99% and 0.38% when 6 ≤ B L = C ≤ 10 and 10 ≤ B L = C ≤ 14, respectively. This shows that the RPs decrease slowly when the resources are ample. The reason is when the resources are ample, the resources will no longer be the main factor limiting the task request acceptances, but the resource occupation cost. Figure 12 shows the SR and RPs under different radio resource variation rates. It can be observed that the SRs (SR(F 0 = 100) and SR(F 0 = 175)) decrease with the increasing λ o . The radio resources become more unstable with the increasing λ o , and more penalties for radio resource shortages are generated. The large λ o results in the large possibility of radio resource variations during the task execution, which leads to more punitive cost. For example, if the radio resources are enough and stable during a period, there will be no penalty during this period.

Impacts of the resources
In the same period, if λ o is large, the radio resources are unstable and easily become less than required, which leads to the penalty. To eliminate the cost caused by the penalties, more task requests are accepted, which is reflected in the decreasing trend of RPs (RP(l = 1, F 0 = 100), RP(l = 2, F 0 = 100), RP(l = 1, F 0 = 175), and RP(l = 2, F 0 = 175)). F 0 is the penalty coefficient, and a larger F 0 leads to more punitive cost. Therefore, SR(F 0 = 100) is larger than SR(F 0 = 175). From Fig. 12, it can be observed that RP(l = 1, F 0 = 175) and RP(l = 2, F 0 = 175) are both larger than RP(l = 1, F 0 = 100) and RP(l = 2, F 0 = 100) when λ o ≥ 4. The reason is that, when F 0 is large, more penalties are generated because of radio resource variations, and more task requests are rejected. From Figs. 11 and 12, it can be concluded that stable and ample resources are crucial to improve the SR and reduce the RPs.  Figure 13 shows the impacts of the income on the SR and RPs, in which η 2 represents the ratio of R 2 and R 1 . It can be observed that the SR increases with the increasing η 2 , and the average increment rates of the SR is 21.93UM. The income from accepting a 2-level and m-class task request is R 2

Impacts of the income and cost
, and a larger η 2 results in more income from accepting a 2-level task request. Therefore, the SR increases rapidly with the increasing η 2 . Correspondingly, it can be observed that RP(l = 1) increases, and RP(l = 2) decreases with the increasing η 2 . With the increasing η 2 , more 1-level task requests are rejected to accept more 2-level task requests, which have more income. Figure 14 shows the SR and RPs under different penalty coefficients. It can be seen that the SR decreases with the increasing F 0 . F 0 represents the penalty coefficient for the radio resource shortage, and the system cost increases when F 0 becomes large. On the other hand, as F 0 increases, the ASO rejects more task requests to eliminate the penalty caused by the radio resource shortage, and thus, the RPs increase with the increasing F 0 . Both of these two factors reduce the system reward. The incre- ment of RP(l = 2) (2.98%) is larger than that of RP(l = 1) (2.8%) when F 0 increases from 50 to 300, which shows that RP(l = 2) increases faster. This is because that the 2-level task occupies more radio resources, and a large F 0 has more impact on it. Figures 15 and 16 show the SR and RPs under different resource occupation cost coefficients. In the experiment of Fig. 15, three occupation cost coefficients, f c 0 , f c 1 , and With the increasing f, the computing and radio resource occupation cost increases. As reflected in Fig. 15, the SRs (SR(η 2 = 6.5) and SR(η 2 = 4.5)) decrease with the increasing f. The RPs (RP(l = 1, η 2 = 6.5), RP(l = 2, η 2 = 6.5), RP(l = 1, η 2 = 4.5) and RP(l = 2, η 2 = 4.5)) first keep relatively stable and then increase with the increasing f. To combat the increasing resource occupation cost, the ASO reduces the acceptance of task requests so that less resources are allocated. It can be observed that RP(l = 2, η 2 = 6.5) increases slightly. This is because η 2 = 6.5 is large so that the occupation cost can be eliminated by the income from accepting 2-level task requests, and fewer 2-level task requests are rejected, which is also confirmed by the obvious increment of RP(l = 2, η 2 = 4.5). It can be seen that when η 2 decreases from 6.5 to 4.5, RP(l = 1, η 2 = 6.5) is larger than RP(l = 1, η 2 = 4.5), and RP(l = 2, η 2 = 6.5) is smaller than RP(l = 2, η 2 = 4.5). When η 2 decreases, the income from accepting a 2-level task requests decreases while the income from accepting a 1-level task request increases relatively. Therefore, fewer 1-level task requests are rejected, and more 2-level task requests are rejected. Figure 16 shows the SR and RPs under different f c 1 , which represents the cost coefficient of occupying the dynamically extended computing resources. From Fig. 16, it can be seen that the SRs (SR(η 2 = 6.5) and SR(η 2 = 4.5)) decrease, and the RPs (RP(l = 1, η 2 = 6.5), RP(l = 2, η 2 = 6.5), RP(l = 1, η 2 = 4.5), and RP(l = 2, η 2 = 4.5)) increase with the increasing f c 1 . Dynamically extended computing resources are more expensive, and the large f c 1 leads to more occupation cost. Therefore, the ASO rejects more task requests to reduce the possibility of extending the computing resources. Similar to Fig. 15, when η 2 = 6.5, fewer 2-level task requests are rejected for the reason that the occupation cost can be eliminated by the income from accepting 2-level task requests. When η 2 decreases from 6.5 to 4.5, the income from accepting 2-level task requests decreases while the income from accepting 1-level task requests increases relatively. Therefore, the ASO reduces the rejection of 1-level task requests and improves the rejection of 2-level task requests, which is reflected in Fig. 16 that RP(l = 1, η 2 = 6.5) is larger than RP(l = 1, η 2 = 4.5), and RP(l = 2, η 2 = 6.5) is smaller than RP(l = 2, η 2 = 4.5).

Performance comparisons of the policy algorithms
In this section, performance of the admission control policy algorithms is compared, and four policy algorithms are evaluated.
(1) RACPA, the random admission control policy algorithm, which accepts and rejects task requests randomly.
(2) TACPA, the threshold admission control policy algorithm, which rejects task requests when the ASO resource occupation ratio exceeds 95%; otherwise, it accepts all task requests. (3) GACPA, the greedy admission control policy algorithm, which takes the action that leads to more reward when receiving a task request. (4) RLACPA, the proposed RL-based admission control policy algorithm.
The performance of the proposed policy algorithms and other commonly used policy algorithms is compared. The scenario parameters are shown in Table 3, and other parameters are the default simulation parameters listed in Table 2. The SRs in different scenarios are shown in Table 4, and their box-plot is shown in Fig. 17, which helps to visualize the data from Table 4. Table 5 shows the RPs in different scenarios.
As shown in Tables 4 and 5, SRs of RACPA are the smallest, and RPs of RACPA are the largest. The reason is that RACPA does not consider the system condition when making admission control decisions and accepts or rejects task requests randomly without optimizing the SR. In this experiment, RACPA generates the random policy evenly, and thus, its RPs are about 50%. TACPA is a commonly used policy algorithm and makes admission control decisions based on the system resource occupation ratio. TACPA does not optimize the SR and RPs from a long-term perspective. Therefore, TACPA cannot obtain the optimal SR while satisfying the QoS requirement. SRs of TACPA are smaller than those of RLACPA. In S2, S3, and S5, RPs of TACPA satisfy the QoS requirement. In S1, S4, S6-S10, although RP(l = 1)s of TACPA satisfy the QoS requirement, RP(l = 2)s exceed 37.80%, 41.67%, 59.60%, 60.0%, 54.0%, 67.20%, and 66.0% of P r 2 s, respectively. GACPA is another commonly used policy   Table 4 and helps to visualize the data from Table 4 algorithm and makes admission control decisions greedily. GACPA selects the optimal action for each step but does not optimize the SR and RPs from a global perspective. Therefore, GACPA cannot obtain the optimal SR while satisfying the QoS requirement either. In S1-S5, RPs of GACPA satisfy the QoS requirement. In S6-S10, although RP(l = 1)s of TACPA satisfy the QoS requirement, RP(l = 2)s exceed 10.20%, 10.20%, 6.60%, 13.60%, and 7.0% of P r 2 s, respectively. The average range exceeding P r 2 of GACPA (9.52%) is noticeably smaller than that of TACPA (55.18%). As shown in Table 4, SRs of RLACPA are larger than those of other algorithms. The average relative difference between RACPA, TACPA, GACPA, and RLACPA are 67.29%, 8.06%, and 9.05%, respectively. As shown in Table 5, in S2, S4-S6, and S8-S10, RPs of RLACPA can meet the QoS requirement. In S1, S3, and S7, RP(l = 1)s of RLACPA are slightly larger than P r 1 s with the difference of 0.05%, 0.06%, and 0.03%, respectively, which meets the accuracy requirement of 0.1%. As explained in Fig. 6, the reason is that RLACPA is iteratively searching for the optimal SR while satisfying the QoS requirement. It is considered to meet the requirement within the accuracy range, which is 0-0.1% in this paper.

Conclusion
In MCC, mobile users send task requests to the ASO according to the offloading policy provided by the offloading decision-making algorithms, and the ASO needs the task admission controller to decide whether to accept the   task request. The features of the task admission control problem in MCC are summarized as three points: (a) twodimensional resources, (b) uncertainty, and (c) incomplete information. Considering these three features, a SMDPbased task admission control model, which considers radio resource variations, computing, and radio resources, is established. Also, a RL-based policy algorithm, which develops the admission policy through system simulations without the complete system information, is proposed to develop the admission policy. The established system model and proposed policy algorithm can be extended to more general admission control problems with one or more of the above features. Experimental results show that the SMDP-based task admission control model adaptively adjust the admission policy to accept or reject different levels and classes of tasks according to the ASO load, available radio resources and event type. The proposed RL-based policy algorithm outperforms the existing policy algorithms. The experimental results also show that stable and ample radio resources improve the ASO performance.
As mentioned above, wireless networks have a serious impact on MCC. In the current version of the problem, we only consider one type of radio resources. The concurrent multipath transfer (CMT) technology can use multiple physical wireless interfaces to transfer data in MCC to combat the challenge that wireless links have limited bandwidth and low robustness. Therefore, in the future, we will study the admission control problem with the consideration of CMT.
Abbreviations MD: Mobile device; MCC: Mobile cloud computing; ASO: Application service operator; QoS: Quality of service; SMDP: Semi-Markov decision process; RL: Reinforcement learning; AR: Augmented reality; VR: Virtual reality; SLA: Service-level agreement; SR: System reward; RP: Rejecting probability; UM: Unit money; RACPA: The random admission control policy algorithm, which accepts and rejects task requests randomly; TACPA: The threshold admission control policy algorithm, which rejects task requests when the ASO resource occupation ratio exceeds 95%; otherwise, it accepts all task requests; GACPA: The greedy admission control policy algorithm, which takes the action that leads to more reward when receiving a task request; RLACPA: The proposed reinforcement learning-based admission control policy algorithm