 Research
 Open Access
 Published:
Resourceaware task scheduling by an adversarial bandit solver method in wireless sensor networks
EURASIP Journal on Wireless Communications and Networking volume 2016, Article number: 10 (2016)
Abstract
A wireless sensor network (WSN) is composed of a large number of tiny sensor nodes. Sensor nodes are very resourceconstrained, since nodes are often batteryoperated and energy is a scarce resource. In this paper, a resourceaware task scheduling (RATS) method is proposed with better performance/resource consumption tradeoff in a WSN. Particularly, RATS exploits an adversarial bandit solver method called exponential weight for exploration and exploitation (Exp3) for target tracking application of WSN. The proposed RATS method is compared and evaluated with the existing scheduling methods exploiting online learning: distributed independent reinforcement learning (DIRL), reinforcement learning (RL), and cooperative reinforcement learning (CRL), in terms of the tracking quality/energy consumption tradeoff in a target tracking application. The communication overhead and computational effort of these methods are also computed. Simulation results show that the proposed RATS outperforms the existing methods DIRL and RL in terms of achieved tracking performance.
Introduction
Wireless sensor networks (WSNs) [1] are an important and attractive platform for various pervasive applications like target tracking, area monitoring, routing, and innetwork data aggregation. Resource awareness is an important issue for WSNs. Basically, battery power, memory, and processing functionality form the resource infrastructure. A WSN has its own resource and design constraints. Resource constraints include a limited energy, low bandwidth, limited processing capability of the central processing unit, limited storing capacity of the storage device, and short communication range. Design constraints are applicationdependent and also depend on the environment being monitored. The environment acts as a major determinant regarding the size of the network, deployment strategy, and network topology. The number of sensor nodes or the size of the network changes based on the monitored environment. For example, in indoor environments, fewer nodes are needed to form a network in a limited space, whereas outdoor environments may require more sensor nodes to cover a huge unattended area. The deployment scheme also depends on the environment. Ad hoc deployment is preferred over a preplanned deployment when the environment is not accessible and the network is composed of a vast number of nodes.
Battery power is the main resource constraint of a WSN. One of the major reasons of energy consumption for the communication in WSN is idle mode consumption. When there is no transmission/reception, sensor nodes consume some energy for listening and waiting for the information from the neighboring nodes. Over hearing is another source of energy consumption. Over hearing means that a node picks up packets that are destined for other nodes. Packet collision is another issue of energy consumption. Collided packets should be retransmitted which require extra effort in energy consumption. Protocol overhead is also a reason of energy consumption.
As a WSN is a resourceconstrained network, there are challenges associated with task scheduling. For performing the application, sensor nodes execute some tasks. Task scheduling methods help to schedule the tasks in a way that the resource is optimized with the goal of maximizing the lifetime of the network. Sensor nodes consume some resources from the resource budget for each executed task. Scheduling can be performed online, offline, or periodically.
Sensor nodes pose strong energy limitations due to fixed battery operation [2–4]. Based on the application demand, sensor nodes need to execute a particular task at each time step. Every task execution consumes some energy from the available energy budget of the sensor node. The main goal is to improve the highest amount of performance while keeping the energy consumption low by exploiting online learning [5] for scheduling.
For example, in object tracking application, sensor nodes need to perform some tasks like sensing, transmitting, receiving, and sleeping over time steps. The performed task at each time step provides an impact on the overall tracking performance by a WSN. There is a tradeoff between tracking performance and energy consumption for the tasks. If we perform tracking all the time, then it is possible to get higher tracking quality, but this is very energy consuming. It is necessary to schedule the tasks in a way that the energy consumption is optimized, and at the same time, a certain amount of tracking quality is maintained.
Since it is not possible to schedule the tasks a priori, online and energyaware task scheduling is required. For determining the next task to execute, sensor nodes need to consider the available energy and the energy required for executing that task. Sensor nodes also need to consider the effect of executing the task on the application’s overall performance.
In this paper, an online learning algorithm is proposed for the task scheduling in order to explore the tradeoff between performance and energy consumption. A bandit solver method Exp3 (Exp3 denotes exponential weight for exploration and exploitation) is used [6]. Exp3 is an adversarial bandit solver method used for online task scheduling. This works by maintaining a list of weights for each task to perform. Using these weights, it decides randomly which task to take next and increases/decreases the relevant weights when a payoff is good or bad. Exp3 has an egalitarianism factor which tunes the desire to pick a task uniformly at random.
The proposed resourceaware task scheduling (RATS) method is based on a simulation study of the performance and energy consumption of a prototypical target tracking application. The balancing factor of the reward function is varied, and the number of nodes in the network and the randomness of moving targets to find out the tracking quality/energy consumption tradeoff are also varied. The average execution time and communication overhead for distributed independent reinforcement learning (DIRL) [7], reinforcement learning (RL) [8], cooperative reinforcement learning (CRL) [9], and Exp3 are also calculated.
The main contribution of this paper is to propose a method for RATS and to perform the evaluation with the existing methods. The proposed RATS approach also considers the cooperation where each node shares local observations of object trajectories with the neighboring nodes. This cooperation helps to improve the tracking performance of our considered tracking application.
The rest of this paper is organized as follows. Section 2 discusses related works, and Section 3 introduces the network model. Section 4 describes the underlying system model for task scheduling based on online learning. In Section 6, we present the proposed method for resourceaware task scheduling. Section 7 presents the experimental setup and discusses the simulation results for a target tracking application. Section 8 concludes this paper with a summary and brief discussion on future work.
Related works
In a resourceconstrained WSN, effective task scheduling is very important for facilitating the effective use of resources [10, 11]. Cooperative behavior among sensor nodes can be very helpful to schedule the tasks in a way that the energy consumption is optimized and also a considerable performance is maintained. Most of the existing methods of task scheduling do not provide online scheduling of tasks. Most of them rather consider static task allocation instead of focusing on distributed task scheduling. The main difference between task allocation and distributed task scheduling is that task allocation deals with the problem of finding a set of task assignments on a sensor network that minimizes an objective function such as total execution time [12, 13]. On the other hand, in a task scheduling problem, the objective is to determine the “best” order of task execution for each sensor node. Each sensor node has to execute a particular task at each time step in order to perform the application, and each node determines the next task to execute based on the observed application behavior and available resources. The following subsections describe some task scheduling methods in WSN.
Selfadaptive task allocation
Guo et al. [14] propose a selfadaptive task allocation strategy in a WSN. They assume that the WSN is composed of a number of sensor nodes and a set of independent tasks which compete for the sensors. They consider neither distributed task scheduling nor the tradeoff among energy consumption and performance.
Xu et al. [15] propose a novel hierarchical data aggregation method using compressive sensing which combines a hierarchical network configuration. Their key idea is to set multiple compression thresholds adaptively based on cluster sizes at different levels of the data aggregation tree to optimize the amount of data transmitted. The advantages of the proposed model in terms of the total amount of data transmitted and data compression ratio are analytically verified. Moreover, they formulate a new energy model by factoring in both processor and radio energy consumption into the cost, especially the computation cost incurred in relatively complex algorithms.
Collaborative resource allocation
Giannecchini et al. [16] propose an online task scheduling mechanism called collaborative resource allocation to allocate the network resources between the tasks of periodic applications in WSNs. This mechanism does also not explicitly consider energy consumption.
Meng et al. [17] argue that by carefully considering spatial reusability of the wireless communication media, they can tremendously improve the endtoend throughput in multihop wireless networks. To support their argument, they propose spatial reusabilityaware singlepath routing (SASR) and anypath routing (SAAR) protocols and compare them with existing singlepath routing and anypath routing protocols, respectively.
Rulebased method
Frank and Römer [10] propose an algorithm for generic task allocation in WSNs. They define some rules for the task execution and propose a rolerule model for sensor networks where “role” is used as a synonym for task. It is a programming abstraction of the rolerule model. This distributed approach provides a specification that defines possible roles and rules for how to assign roles to nodes. This specification is distributed to the whole network via a gateway, or alternatively, it can be preinstalled on the nodes. A role assignment algorithm takes into account the rules and node properties, which may trigger execution and network data aggregation. This generic role assignment approach does consider the energy consumption but not the ordering of tasks to sensor nodes.
Constraint satisfactionbased method
Krishnamachari and Wicker [18] examine the channel utilization as a resource management problem by a distributed constraint satisfaction method. They consider a WSN of n nodes placed randomly in a square area with a uniform, independent distribution. This work tests three selfconfiguration tasks in WSNs: partition into coordinating cliques, formation of Hamiltonian cycles, and conflictfree channel scheduling. They explore the impact of varying the transmission radius on the solvability and complexity of these problems. In the case of partition into cliques and Hamiltonian cycle formation, they observe that the probability that these tasks can be performed undergoes a transition from 0 to 1.
Busch et al. [19] propose a classic optimization problem in network routing which is to minimize C+D, where C is the maximum edge congestion and D is the maximum path length (also known as dilation). The problem of computing the optimal is NPcomplete. They study routing games in general networks where each player i selfishly selects a path that minimizes the sum of congestion and dilation of the player’s path.
These constraint satisfaction approaches neither address mapping of tasks to sensor nodes nor discuss the resource consumption/performance tradeoff.
Utilitybased method
Dhanani et al. [20] compare utilitybased information management policies in sensor networks. Here, the considered resource is information or data, and two models are distinguished: the sensorcentric utilitybased model (SCUB) and the resource manager (RM) model. SCUB follows a distributed approach that instructs individual sensors to make their own decisions about what sensor information should be reported based on a utility model for data. RM is a consolidated approach that takes into account knowledge from all sensors before making decisions. They evaluate these policies through simulation in the context of dynamically deployed sensor networks in military scenarios. Both SCUB and RM can extend the lifetime of a network as compared to a network without maintaining any policy. Peng et al. [21] propose a reliable multicast protocol, called CodePipe, with energy efficiency, high throughput and fairness in lossy wireless networks. Building upon opportunistic routing and random linear network coding, CodePipe can not only eliminate coordination between nodes but also improve the multicast throughput significantly by exploiting both intrabatch and interbatch coding opportunities.
These approaches do not address the task scheduling to improve the resource consumption/performance tradeoff.
Reinforcement learningbased method
Reinforcement learning helps to enable applications with inherent support for efficient resource/task management. It is the process by which an agent improves task scheduling according to previously learned behavior. It does not need a model of its environment and can be used online. It is simple and demands minimal computational resources.
Shah and Kumar [7] consider Q learning as reinforcement learning for the task management. They describe a distributed independent reinforcement learning (DIRL) approach for resource management, which forms an important component of any application including initial sensor selection and task allocation as well as runtime adaptation of allocated resources to tasks. Here, the optimization parameters are energy, bandwidth, network lifetime, etc. DIRL allows each individual sensor node to selfschedule its tasks and allocate its resources by learning their usefulness in any given state while honoring applicationdefined constraints and maximizing the total amount of reward over time.
Khan and Rinner [8] apply reinforcement learning (RL) for online task scheduling. They use cooperative reinforcement learning for task scheduling. They introduce cooperation among neighboring nodes with the local information of each node. This cooperation helps to provide better performance. Cooperative Q learning is a reinforcement learning approach to learn the usefulness of some tasks over time in a particular environment. They consider the WSN as a multiagent system. The nodes correspond to agents in the multiagent reinforcement learning. The world surrounding the sensor nodes forms the environment. Tasks are considered as activities for the sensor nodes at each time step such as transmit, receive, sleep, and sense. States are formed by a set of system variables such as object in the field of view (FOV) of sensor nodes, required energy for a specific action, and data to transmit. A reward value provides some positive or negative feedback for performing a task at each time step. Value functions define what is good for an agent over the long run described by reward function and some parameters.
Khan and Rinner [9] apply cooperative reinforcement learning (CRL) for online task scheduling. They use StateActionRewardStateAction (S A R S A(λ)) [22] learning and introduced cooperation among neighboring sensor nodes to further improve the task scheduling.
The proposed RATS method applies Exp3 which does not need any statistical assumptions like stochastic bandit solvers. Exp3 is an online bandit solver in which an adversary, rather than a wellbehaved stochastic process, has complete control over the payoffs/rewards [6].
DIRL, RL, CRL, and Exp3 are compared for the task scheduling in a target tracking application and analyzed for the performance in terms of tracking quality/energy consumption tradeoff.
Network model
Before describing the problem formally, terms like resource consumption and tracking quality need to be defined. In WSNs, resource consumption happens because of performing the various tasks needed for the application. Each task consumes an amount of energy from the fixed energy budget of the sensor nodes. Typically, tracking quality in a target tracking application of a WSN is defined as the accuracy of target location estimation provided by the network.
In the proposed approach, a WSN is composed of N nodes represented by the set \(\hat {N}=\{n_{1},\ldots,n_{N}\}\). Each node has a known position (u _{ i },v _{ i }) and a given sensing coverage range which is simply modeled by circle with radius r _{ i }. All nodes within the communication range R _{ i } can directly communicate with n _{ i } and are referred to as neighbors. The number of neighbors of n _{ i } is given as ngh(n _{ i }). The available resources of node n _{ i } are modeled by a scalar E _{ i }. The battery power of sensor nodes is considered as resource. A set of tasks is considered to perform over time steps. Each task consumes some battery power from the energy budget of the sensor nodes. A set of static values for the energy consumption of tasks is considered. These values are assigned based on the energy demands of the task. A higher value is set for the tasks which need higher energy consumption.
The WSN application is composed of A tasks (or actions) represented by the set \(\hat {A}=\{a_{1},\ldots,a_{A}\}\). Once a task is started at a specific node, it executes for a specific (short) period of time and terminates afterwards. Each task execution on a specific node n _{ i } requires some resources \(\tilde {E}_{j}\) and contributes to the overall application performance P. Thus, the execution of task a _{ j } on node n _{ i } is only feasible if \(E_{i} \ge \tilde {E}_{j}\). The overall performance P is represented by an applicationspecific metric. On each node, an online task scheduling takes place which selects the next task to execute among the Aindependent tasks. The task execution time is abstracted as a fixed period. Thus, scheduling is required at the end of each period which is represented as time instant t _{ i }. Nonpreemptive scheduling is considered based on our proposed model. Figure 1 shows our considered WSN model components.
Table 1 shows the notations and meanings used to represent the network model. The task scheduling approach is demonstrated using a target tracking application. A sensor network may consist of a variable number of nodes. The sensing region of each node is called the field of view (FOV). Every node aims to detect and track all targets in the FOV. If the sensor nodes would perform tracking all the time, then this would result in the best tracking performance. But executing target tracking all the time is energy demanding. Thus, a task should only be executed when necessary and sufficient for tracking performance. Sensor nodes can cooperate with each other by informing neighboring nodes about “approaching” targets. Neighboring nodes can therefore become aware of approaching targets.
The objective function is defined in a way that it is possible to trade the application performance and the required energy consumption by a balancing factor. The ultimate objective of the problem is to determine the order of tasks on each node such that the overall performance is maximized while the resource consumption is minimized. Naturally, these are conflicting optimization criteria, so there is no single best solution. The set of nondominating solutions for such a multicriteria problem can be typically represented by a Pareto front.
System model
The task scheduler operates in a highly dynamic environment, and the effect of the task ordering on the overall application performance is difficult to model. We consider the set of tasks, set of states, and the reward function as considered in [23]. Figure 2 depicts the scheduling framework where its key components can be described as follows:

Agent: Each sensor node embeds an agent which is responsible for executing the online learning algorithm.

Environment: The WSN application represents the environment in our approach. Interaction between the agent and the environment is achieved by executing actions and receiving a reward function.

Action: An agent’s action is the currently executed application task on the sensor node. At the end of each time period t _{ i }, each node triggers the scheduler to determine the next action to execute.

State: A state describes an internal abstraction of the application which is typically specified by some system parameters. In our target tracking application, the states are represented by the number of currently detected targets in the node’s FOV and expected arrival times of targets detected by neighboring nodes. The state transitions depend on the current state and action.

Policy: An agent’s policy determines what action will be selected in a particular state. In our case, this policy determines which task to execute at the perceived state. The policy can focus more on exploration or exploitation depending on the selected setting of the learning algorithm.

Value function: This function defines what is good for an agent over the long run. It is built upon the reward function values over time, and hence, its quality totally depends on the reward function [7].

Reward function: This function provides a mapping of the agent’s state and the corresponding action to a reward value that contributes to the performance. We apply a weighted reward function which is capable of expressing the tradeoff between energy consumption and tracking performance.

Cooperation: Information exchange is considered among neighboring nodes as cooperation. The received information may influence the application’s state of sensor nodes.
Existing methods for task scheduling
Existing methods DIRL, RL, and CRL are described below. Each method is described briefly with the learning mechanism and considered set of states and tasks.
DIRL
Shah and Kumar [7] use distributed independent Q learning (DIRL) as reinforcement learning. The advantage of using independent learning is that no communication is required for coordination between sensor nodes and each node selfishly tries to maximize its own rewards. In Q learning, every agent needs to maintain a Q matrix for the value functions. Initially, all entries of the Q matrix are 0 and the agent of the nodes may be in any state. Based on the applicationdefined variables, the system goes to a particular state. Then it performs an action which depends on the status of the nodes.
It calculates the Q value for this (state, action) pair as
where Q _{ t+1}(s _{ t },a _{ t }) means the update of the Q value at time t+1 after executing the action a at time step t. r _{ t+1} represents the immediate reward after executing the action a at time t, V _{ t } represents the value function for node at time t, and V _{ t+1} represents the value function at time t+1. \(\max \limits _{a\in A} Q_{t+1}(s_{t},a)\) means the maximum Q value after performing an action from the action set A for the agent i. γ is the discount factor which can be set to a value in [ 0,1]. For higher γ values, the agent relies more on the future than the immediate reward. α is the learning rate parameter which can be set to a value in [ 0,1]. It controls the rate at which an agent tries to learn by giving more or less weight to the previously learned utility value. When α is close to 1, the agent gives more priority to the previously learned utility value.
Algorithm 1 depicts the RL algorithm.
RL
Khan and Rinner [8] propose cooperative Q learning (RL) where every agent needs to maintain a Q matrix for the value functions like independent Q learning. Initially, all entries of the Q matrix are 0 and the nodes or agents may be in any state. Based on the applicationdefined variable or system variables, the system goes to a particular state. Then it performs an action which depends on the status of the nodes (example: For transmit action, a node must have residual energy which is greater than transmission cost). It calculates the Q value for this (state, task) pair with the immediate reward.
where Q _{ t+1}(s _{ t },a _{ t }) means the update of Q value at time t+1, after executing the action a at time step t. r _{ t+1} means the immediate reward after executing the action a at time t. V _{ t } is the value function at time t. V _{ t+1} is the value function at time t+1. \(\max \limits _{a\in A} Q_{t+1}(s_{t},a)\) means the maximum Q value after performing an action from the action set A. γ is the discount factor which can be set to a value in [0,1]. The higher the value, the greater the agent relies on future reward than the immediate reward. α is the learning rate parameter which can be set to a value in [ 0,1]. It controls the rate at which an agent tries to learn by giving more or less weight to the previously learned utility value. When α is set close to 1, the agent gives more priority to the previously learned utility value.
f is the weight factor [24] for the neighbors of agent i and can be defined as follows:
The algorithm can be stated as follows:
CRL
Khan and Rinner [9] apply S A R S A(λ) (CRL), also referred to as StateActionRewardStateAction, which is an iterative algorithm that approximates the optimal solution. S A R S A(λ) [22] is an iterative algorithm that approximates the optimal solution without knowledge of the transition probabilities which is very important for a dynamic system like WSN. At each state s _{ t+1} of iteration t+1, it updates Q _{ t+1}(s,a), which is an estimate of the Q function by computing the estimation error δ _{ t } after receiving the reward in the previous iteration. The S A R S A(λ) algorithm has the following updating rule for the Q values:
for all s, a.
In Eq. 7, α∈ [ 0,1] is the learning rate which decreases with time. δ _{ t } is the temporal difference error which is calculated by following rule
In Eq. 8, γ _{1} is a discount factor which varies from 0 to 1. The higher the value, the more the agent relies on future rewards than on the immediate reward. r _{ t+1} represents the reward received for performing action. f is the weight factor [24] for the neighbors of agent i and can be defined as follows:
An important aspect of an RL framework is the tradeoff between exploration and exploitation [25]. Exploration deals with randomly selecting actions which may not have higher utility in search of better rewarding actions, while exploitation aims at the learned utility to maximize the agent’s reward.
A simple heuristic is used where exploration probability at any point of time is given by
where ε _{max} and ε _{min} denote upper and lower boundaries for the exploration factor, respectively. S _{max} represents maximum number of states which is three in our work, and S represents current number of states already known. At each time step, the system calculates ε and generates a random number in the interval of [ 0,1]. If the selected random number is less than or equal to ε, the system chooses a uniformly random task (exploration); otherwise, it chooses the best task using Q values (exploitation).
S A R S A(λ) improves learning through eligibility traces. e _{ t }(s,a) is the eligibility traces in Eq. 7. Here, λ is another learning parameter similar to α for guaranteed convergence. γ _{2} is the discount factor. In general, eligibility traces give a higher update factor for recently revisited states. This means that the eligibility trace for a stateaction pair (s,a) will be reinforced if s _{ t }∈s and a _{ t }∈a. Otherwise, if the previous action a _{ t } is not greedy, the eligibility trace is cleared.
The algorithm can be stated as follows:
The eligibility trace is updated by the following rule:
The learning rate α is decreased slowly in such a way that it reflects the degree to which a stateaction pair has been chosen in the recent past. It is calculated as
where ζ is a positive constant and visited(s,a) represents the visited stateaction pairs so far [26].
Proposed method for task scheduling
Following set of actions, set of states and reward function are considered for the proposed RATS.
Set of actions
The following actions are considered in our target tracking application:

1.
Detect_Targets: This function scans the field of view (FOV) and returns the number of detected targets in the FOV.

2.
Track_Targets: This function keeps track of the targets inside the FOV and returns the current 2D positions of all targets. Every target within the FOV is assigned with a unique ID number.

3.
Send_Message: This function sends information about the target’s trajectory to neighboring nodes. The trajectory information includes (i) the current position and time of the target and (ii) the estimated speed and direction. This function is executed when the target is about to leave the FOV.

4.
Predict_Trajectory: This function predicts the velocity of the trajectory. A simple approach is to use the two most recent target positions, i.e., (x _{ t },y _{ t }) at time t _{ t } and (x _{ t−1},y _{ t−1}) at t _{ t−1}. Then the constant target’s speed can be estimated as
$$ v=\sqrt{(x_{t}x_{t1})^{2}+(y_{t}y_{t1})^{2}}/(t_{t}t_{t1}) $$((15)) 
5.
Intersect_Trajectory: This function checks whether the trajectory intersects with the FOV and predicts the expected time of the intersection. This function is executed by all nodes which receive the “target trajectory” information from a neighboring node. Trajectory intersection with the FOV of a sensor node is computed by basic algebra. The expected time to intersect the node is estimated by
$$ \tilde{t}_{i}=D_{P_{i}P_{j}}/v $$((16))where \(D_{P_{i}P_{j}}\) is the distance between points P _{ j } and P _{ i }. P _{ j } represents the point where the trajectory is predicted at node j, and P _{ i } corresponds to the trajectory’s intersection points with the FOV of node i (cp. Fig. 3). v is the estimated velocity as calculated by Eq. 15.

6.
Goto_Sleep: This function shuts down the sensor node for a single time period. It consumes the least amount of energy of all available actions.
The advanced trajectory prediction and intersection are considered for these methods. Inputs for this prediction task are the last few tracked positions of the target. Here, the last six tracked positions of the target are considered based on simulation studies. The trajectory is linearized given by the last six tracked positions of the target considering the constant speed and direction. The speed is calculated by Eq. 15.
Suppose (x _{1},y _{1}), (x _{2},y _{2}) … (x _{ n },y _{ n }) are tracked positions of the moving object inside the FOV of the sensor node at time steps t _{1}, t _{2} … t _{ n }.
The trajectory can be predicted by the regression line [27] in Eq. 17:
where b is the slope, a is the intercept, and ε is the residual or error for the calculation.
So residual, ε, can be calculated by following
where i=1,2,3,…,n.
If the squares of the residuals of all the points from the line are summed up, what we get is a measure of the fitness of the line. The aim is to minimize this value.
So, the square of the residual is as follows:
To calculate the sum of square residuals, all the individual square residuals are added together as follows:
where J is the sum of square residuals and n is the number of considered points.
J in Eq. 20 needs to be minimized. The minimum value for J has to occur when the first derivative is 0. The partial derivatives for J are with respect to the two parameters of the regression line b and a. To get the minimum, it needs to assign 0 [28].
Equations 21 and 22 can be shuffled and divided as
Some constants can be pulled out in front of the summations. The \(\sum ^{n}_{i=1}a\) can be written as na in Eq. 23. We take the values of unknown parameters a and b using Eqs. 23 and 24. These give two equations as follows:
Now, from Eqs. 25 and 26, some simple substitutions between the two equations are obtained as follows:
These formulas in Eqs. 27 and 28 do not tell us how precise the estimates are. That is, how much the estimators a and b can deviate from the “true” values of a and b. It can be solved by confidence intervals.
Using Student’s tdistribution with (n−2) degrees of freedom [29], a confidence interval can be constructed for a and b as follows:
where \(\hat a\) and \(\hat b\) are the new estimated values of a and b. \(t^{*}_{n2}\) is the (1−τ/2)th quantile of the t _{ n−2} distribution. For example, if τ=0.05, the confidence level becomes 95 %. s _{ a } and s _{ b } are the standard deviations as follows:
where \(\bar {x}\) is the average of the x values.
In Fig. 4, some tracked positions of the target are observed which is denoted by the “black” dots. At first, the regression line is predicted and the middle line is obtained. Then the confidence band is calculated which gives two other lines.
For the intersection with the circles, the line as follows is considered:
where b is the slope and a is the intercept.
The line given by Eq. 33 intersects a circle (sensing range is considered as a circle) given by Eq. 34:
where (u _{1},v _{1}) is the center and r _{1} is the radius of the circle.
Substituting the value of Eq. 33 in Eq. 34 gives the following:
Simply expanding Eq. 35 by algebraic formula gives a quadratic equation of x and can be solved using the quadratic formula.
After solving the quadratic equation, we get the values of x and y.
if B ^{2}−4A C<0, then the line misses the circle. If B ^{2}−4A C=0, then the line is tangent to the circle. If B ^{2}−4A C>0, then the line meets the circle in two distinct points.
x can be substituted in Eq. 33 from Eq. 36 to get the y values:
Set of states
The application is abstracted by three states at every node.

Idle: This state indicates that there is currently no target detected within the node’s FOV and the local clock is too far from the expected arrival time of any target already detected by some neighbor. If the time gap between the local clock L _{ c } and the expected arrival time N _{ET} is greater than or equal to a threshold Th_{1} (cp. Fig. 5), then the node remains in the idle state. The threshold Th_{1} is set to 5 based on our simulation studies. In this state, the sensor node performs Detect_Targets less frequently to save energy.

Awareness: There is currently also no detected target in the node’s FOV in this state. However, the node has received some relevant trajectory information, and the expected arrival time of at least one target is in less than Th_{1} clock ticks. In this state, the sensor node performs Detect_Targets more frequently, since at least one target is expected to enter the FOV.

Tracking: This state indicates that there is currently at least one detected target within the node’s FOV. Thus, the sensor node performs tracking frequently to achieve high tracking performance.
Obviously, the frequency of executing Detect_Targets and Track_Targets depends on the overall objective, i.e., whether to focus more on tracking performance or energy consumption. The states can be identified by two application variables, i.e., the number of detected targets at the current time N _{ t } and the list of arrival times of targets expected to intersect with node N _{ET}. N _{ t } is determined by the task Detect_Targets which is executed at time t. If the sensor node executes the task Detect_Targets at time t, then N _{ t } returns the number of detected targets in the FOV. Each node maintains a list of appearing targets and the corresponding arrival time. Targets are inserted in this list if the sensor node receives a message and the estimated trajectory intersects with the FOV. Targets are removed if a target is detected by the node or the expected arrival time with an additional threshold Th_{1} has expired.
Initially, each node has no idea about which task to perform at which state. They learn this scheduling online over time. For example, Track_Targets is a necessary task for keep tracking when the target is in FOV. The application learns online about the next task to execute based on our proposed methods. If the sensor node does not perform the Track_Targets task when the target is in FOV, there is a chance to miss the target which implies less tracking quality. But this situation could provide better energy efficiency, since the Track_Targets task consumes the highest amount of energy among all the tasks. So, selection of a particular task at each time step or scheduling of tasks provides an impact on overall tracking quality/energy consumption tradeoff.
Figure 5 depicts the state transition diagram where L _{ c } is the local clock value of the sensor node and Th_{1} represents the time threshold between L _{ c } and N _{ET}.
Reward function
The reward function is a key system component for expressing the effect of the task execution on the system performance and resource consumption. Thus, both aspects should be covered by the reward function. Among the various options, it is simplified by merging energy consumption and system performance using a balancing parameter. In detail, the reward function in our algorithm is defined as
where the parameter β balances the conflicting objectives between E _{ i } and P _{ t }. E _{ i } represents the residual energy of the node. P _{ t } represents the number of tracked positions of the target inside the FOV of the node. E _{max} is the maximum energy level of the sensor node, and P is the number of all possible detected target’s positions in the FOV. These two parameters are used for normalizing the energy and performance parameters. By modifying the balancing parameter β, we can control whether more focus is put on energy efficiency or system performance.
Proposed method
The classical adversarial algorithm Exp3 (exponentialweight algorithm for exploration and exploitation) is used for task scheduling [6].
The algorithm can be stated as follows:
Exp3 has a parameter κ which controls the probability with which arms are explored in each round. At each time step t, Exp3 draws an action a according to the distribution P _{1,t },P _{2,t },…,P _{ A,t }. The distribution can be calculated by the following equation:
where w _{ a,t } is the weight associated with the action a at time t.
This distribution is a mixture of the uniform distribution and a distribution which assigns to each action a probability mass exponential in the estimated reward for that action. Intuitively, mixing in the uniform distribution is done to make sure that the algorithm tries out all actions A and gets good estimates of the rewards for each action.
Weight for each action can be calculated by following equation
where r _{ t+1} is the reward after executing the action a.
Reward can be calculated by following equation
where P _{ a,t } is the calculated probability distribution for the action a by Eq. 39.
Exp3 works by maintaining a list of weights w _{ i } by Eq. 40 for each of the actions, using these weights to decide which action to take next based on a probability distribution P _{ t }, and increasing the relevant weights when the reward is positive. The egalitarianism factor κ∈[ 0,1] tunes the desire to pick an action uniformly at random. If κ=1, the weights have no effect on the choices at any step.
Experimental results and evaluation
Simulation environment
The proposed method is implemented and evaluated with other task scheduling methods using a WSN multitarget tracking scenario implemented in a C# simulation environment.
The simulator consists of two stages: the deployment of the nodes and the execution of the tracking application. In the evaluation scenario, the sensor nodes are uniformly distributed in a 2D rectangular area. A given number of sensor nodes are placed randomly in this area which can result in partially overlapping FOVs of the nodes. However, placement of nodes on the same position is avoided. Before deploying the network, the network parameters should be configured using the configuration sliders.
The following network parameters can be configured by our simulator.

Network size: Network size means the number of nodes in the network. In the current settings of the simulator, the number of sensor nodes can be varied between [3, 40].

Sensor radius: Sensor radius is the sensing range of the sensors in the network. Sensor radius can be varied between [1, 50].

Transmission radius: Transmission radius is the maximum distance within two sensor nodes communicating with each other. If it is set to a high value, nodes on the opposite side of the rectangular area may be able to reach each other. If it is set to a low value, nodes must be very close to communicate with each other. Transmission radius can be varied between [1, 50].
The network is displayed on the simulation environment as a set of red circles surrounded by gray circles. The red circles denote the sensor nodes, and the gray circles denote the sensing range of the nodes. Each node is connected to nearby nodes by black lines which represent the communication links. When a message is being exchanged, it appears as red. The color in the center of the red circle represents the battery status of the node, which gradually shifts from white to black. White color denotes the nodes with full power, and black color denotes the nodes with no power. When a node loses all power, the node becomes completely black. The gray area of the node shrinks and disappears. All of the communication links associated with the node disappear as well.
Targets move around in the area based on a GaussMarkov mobility model [30]. The GaussMarkov mobility model was designed to adapt to different levels of randomness via tuning parameters. Initially, each mobile target is assigned with a current speed and direction. At each time step t, the movement parameters of each target are updated based on the following rule:
where S _{ t } and D _{ t } are the current speed and direction of the target at time t, respectively. S and D are constants representing the mean value of speed and direction, respectively. \(S^{G}_{t1}\) and \(D^{G}_{t1}\) are random variables from a Gaussian distribution. η is a parameter in the range [0,1] and is used to vary the randomness of the motion. Random (Brownian) motion is obtained if η=0, and linear motion is obtained if η=1. At each time t, the target’s position is given by the following equations:
Settings of parameters
In the simulation, we limit the number of concurrently available targets to seven. The total energy budget for each sensor node is considered as 1000 units. Table 2 shows the energy consumption for the execution of each action. Sending messages over two hops consumes energy on both the sender and relay nodes. To simplify the energy consumption at the network level, only the energy consumption to ten units on the sending node only is aggregated. The egalitarianism factor κ=0.5 is set for Exp3. The sensing radius is considered as r _{ i }=5, and the communication radius is set as R _{ i }=8. These fixed values are set for the parameters based on simulation studies. For each simulation run, the achieved tracking quality and energy consumption are aggregated and normalized to [ 0,1].
Performed experiments
For the evaluation, the following four experiments are performed with the following assumptions of parameters.

1.
To find out the tradeoff between tracking quality and energy consumption, the balancing factor β of the reward function is set between [ 0.1,0.9] in 0.1 steps, keeping the randomness of moving target as η=0.5, setting the egalitarianism factor of Exp3 as κ=0.5, and fixing the topology to five nodes.

2.
The network size is varied to check the tradeoff between tracking quality and energy consumption. Three different topologies consisting of 5, 10, and 20 sensor nodes are considered where the coverage ratio is 0.0029, 0.0057, and 0.0113, respectively. The coverage ratio is defined as the ratio of the aggregated FOV of all deployed sensor nodes over the area of the entire surveillance area. The balancing factor is set β=0.5 and the randomness of the mobility model η=0.5 which is a constant for this experiment.

3.
Randomness of moving targets η is set to one of the following values {0.1,0.15,0.2,0.25,0.3,0.4,0.5,0.7,0.9} and setting the balancing factor β=0.5 and fixing the topology to five nodes.

4.
DIRL, RL, CRL, and Exp3 are evaluated in terms of average execution time and average communication effort. These values are measured from 20 iterations and represent the mean execution times and the mean of Send_Message task executions.
Discussion
Figure 6 shows the results of our first experiment. Each data point in these figures represents the average of normalized tracking quality and energy consumption of ten complete simulation runs. The results show the tracking quality/energy consumption tradeoff for DIRL, RL, CRL, and Exp3 by varying the balancing factor β between [0.1, 0.9] in 0.1 steps. It is observed that CRL and Exp3 provide similar results, i.e., the corresponding data points are closely colocated. RL is energyaware but is not able to achieve high tracking quality. DIRL achieves the most energy awareness but provides the least tracking quality.
Figure 7 shows the results of the second experiment. In this experiment, each data point represents the average of normalized tracking quality and energy consumption of ten complete simulation runs by varying the network size to one of the values {5,10,20} for each method. Here, the same trend can be identified, i.e., the CRL and Exp3 achieve almost similar results in terms of tracking quality/energy consumption tradeoff and DIRL shows less tracking performance with the higher energy efficiency.
Figures 8, 9 and 10 show the results of our third experiment. In this experiment, each data point represents the average of normalized tracking quality and the energy consumption of ten complete simulation runs by varying the randomness of moving objects η to one of these values {0.10,0.15,0.20,0.25,0.30,0.40,0.50,0.70,0.90} for each method. From these figures, it can be seen that CRL and Exp3 outperform RL and DIRL in terms of achieved tracking performance. It can be seen that for lower randomness, η=0.5, 0.7, and 0.9, RL and Exp3 show very close results for tracking performance. But for higher randomness, η=0.1, 0.15, and 0.2, DIRL gives poor performance with regard to tracking performance.
Table 3 shows the comparison of DIRL, RL, CRL, and Exp3 in terms of average execution time and average communication effort. These values are derived from 20 iterations and represent the mean execution times and the mean of Send_message task executions. It can be seen that DIRL and RL are resourceaware in terms of execution time and communication effort. Exp3 requires 25 % more and CRL requires 86 % more execution time. The communication overhead is similar for both Exp3 and CRL.
Conclusions
In this paper, an adversarial bandit solver is applied based on online learning algorithm for resourceaware task scheduling in WSN. The performance of our proposed online task scheduling method is evaluated with other existing task scheduling methods based on the three learning algorithms: DIRL, RL, and CRL. Evaluation results show that our proposed RATS method provides better performance in terms of tracking quality. DIRL shows the best energy efficiency but provides poor results in terms of tracking quality. The proposed RATS and CRL show almost similar results in terms of tracking qualityenergy consumption tradeoff. Evaluation results show that these methods provide different properties concerning achieved performance and resource awareness. The selection of a particular algorithm depends on the application requirements and the available resources of sensor nodes.
Future work includes the application of our resourceaware scheduling approach to different WSN applications, the implementation on our visual sensor network platforms [31], and the comparison of our approach with other variants of reinforcement learning methods.
References
 1
L Xiang, J Luo, A Vasilakos, “Compressed data aggregation for energy efficient wireless sensor networks,” in Sensor, Mesh and Ad Hoc Communications and Networks (SECON), 2011 8th Annual IEEE Communications Society Conference on, june 2011, 46–54 (2011).
 2
Y Song, L Liu, H Ma, AV Vasilakos, A biologybased algorithm to minimal exposure problem of wireless sensor networks. IEEE Trans. Netw. Serv. Manag. 11(3), 417–430 (2014).
 3
AV Vasilakos, GI Papadimitriou, A new approach to the design of reinforcement schemes for learning automata: stochastic estimator learning algorithm. Neurocomputing. 7(3), 275–297 (1995).
 4
L Liu, Y Song, H Zhang, H Ma, AV Vasilakos, Physarum optimization: a biologyinspired algorithm for the Steiner tree problem in networks. IEEE Trans. Comput. 64(3), 819–832 (2015).
 5
H Saad, A Mohamed, T ElBatt, Cooperative Qlearning techniques for distributed online power allocation in femtocell networks. Wirel. Commun. Mob. Comput (2014). doi:10.1002/wcm.2470.
 6
P Auer, NC Bianchi, Y Freund, RE Schapire, The nonstochastic multiarmed bandit problem. SIAM J. Comput.32:, 48–77 (2003). doi:10.1137/S0097539701398375.
 7
K Shah, M Kumar, in Proceedings of IEEE Mobile Adhoc and Sensor Systems. Distributed Independent Reinforcement Learning (DIRL) Approach to Resource Management in Wireless Sensor Networks (IEEEPisa, Italy, 2007), pp. 1–9.
 8
MI Khan, B Rinner, in Proceedings of the IEEE International Conference on Pervasive Computing and Communications Workshops. Resource Coordination in Wireless Sensor Networks by Cooperative Reinforcement Learning (IEEELugano, Switzerland, 2012), pp. 895–900.
 9
M Khan, B Rinner, in Proceedings of the IEEE International Conference on Communications Workshops. EnergyAware Task Scheduling in Wireless Sensor Networks Based on Cooperative Reinforcement Learning (IEEESydney, Australia, 2014), pp. 871–877.
 10
C Frank, K Römer, in Proceedings of the ACM Conference on Embedded Networked Sensor Systems. Algorithms for Generic Role Assignments in Wireless Sensor Networks (IEEESan Diego, California, 2005), pp. 230–242.
 11
JHW Ye, D Estrin, in Proceedings of the INFOCOM’02. An EnergyEfficient MAC Protocol for Wireless Sensor Networks (IEEENew York, USA, 2002), pp. 1567–1576.
 12
Y Tian, E Ekici, F Ozguner, in Proceedings of the IEEE International Conference on Mobile Adhoc and Sensor Systems Conference. Energyconstrained Task Mapping and Scheduling in Wireless Sensor Networks (IEEEWashington, DC, 2005), pp. 210–218.
 13
T He, S Krishnamurthy, JA Stankovic, T Abdelzaher, L Luo, R Stoleru, T Yan, L Gu, in In Mobisys. EnergyEfficient Surveillance System Using Wireless Sensor Networks (ACM Press, 2004), pp. 270–283.
 14
W Guo, N Xiong, HC Chao, S Hussain, G Chen, Design and analysis of self adapted task scheduling strategies in WSN. Sensors J. 11:, 6533–6554 (2011). doi:10.3390/s110706533.
 15
X Xu, R Ansari, A Khokhar, AV Vasilakos, Hierarchical data aggregation using compressive sensing (HDACS) in WSNs. ACM Trans. Sens. Netw. 11(3), 1–25 (2015).
 16
S Giannecchini, M Caccamo, CS Shih, in Proceedings of the Euromicro Conference on RealTime Systems. Collaborative Resource Allocation in Wireless Sensor Networks (IEEERennes, France, 2004), pp. 35–44.
 17
T Meng, F Wu, Z Yang, G Chen, AV Vasilakos, Spatial reusabilityaware routing in multihop wireless networks. IEEE Trans. Comput., 1–13 (2015). doi:10.1109/TC.2015.2417543.
 18
B Krishnamachari, S Wicker, R Bejar, C Fernandez, “On the complexity of distributed selfconfiguration in wireless networks”. J. Telecommun. Syst. 22(1–4), 33–59 (2003).
 19
C Busch, R Kannan, AV Vasilakos, Approximating congestion + dilation in networks via “Quality of Routing” games. Int. J. Distrib. Wirel. Sens. Netw. 61(9), 22 (2014).
 20
S Dhanani, J Arseneau, A Weatherton, B Caswell, N Singh, S Patek, in Proceedings of the IEEE Systems and Information Engineering Design Symposium. A Comparison of Utility Based Information Management Policies in Sensor Networks (IEEECharlottesville, Virginia USA, 2006), pp. 84–89.
 21
P Li, S Guo, S Yu, AV Vasilakos, in Proceedings of the IEEE INFOCOM. CodePipe: An opportunistic feeding and routing protocol for reliable multicast with pipelined network coding (Orlando, FL, 2012), pp. 100–108.
 22
RS Sutton, AG Barto, Reinforcement Learning: An Introduction (MIT Press, Cambridge, Massachusetts, United States, 1998).
 23
MI Khan, B Rinner, Performance analysis of resource aware task scheduling methodologies in wireless sensor networks. International Journal of Distributed Sensor Networks, Hindawi, Volume 2014, 11 (2014).
 24
KLA Yau, P Komisarczuk, PD Teal, Reinforcement learning for context awareness and intelligence in wireless networks: review, new features and open issues. J. Netw. Comput. Appl.35:, 253–267 (2012).
 25
J Byers, G Nasser, in Proceedings of the Workshop on Mobile and Ad Hoc Networking and Computing. Utility Based Decision making in Wireless Sensor Networks (IEEEBoston, MA, 2000), pp. 143–144.
 26
RAC Bianchi, CHC Ribeiro, AHR Costa, Advances in Artificial Intelligence (Springer, Berlin, Germany, 2004).
 27
DC Montgomery, EA Peck, GG Vining, Introduction to Linear Regression Analysis (Wiley, Hoboken, New Jersey, United States, 2007).
 28
N Bery, Linear regression. Technical report. DataGenetics (2009).
 29
MR Spiegel, Theory and Problems of Probability and Statistics (McGrawHill, New York City, New York, United States, 1992).
 30
T Abbes, S Mohamed, K Bouabdellah, Impact of model mobility in ad hoc routing protocols. Comput. Netw. Inf. Secur. 10:, 47–54 (2012).
 31
L Esterle, PR Lewis, X Yao, B Rinner, Socioeconomic vision graph generation and handover in distributed smart camera networks. ACM Trans. Sens. Netw. 10(2), 24 (2014).
Author information
Affiliations
Corresponding author
Additional information
Competing interests
The author declares that he has no competing interests.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Khan, M.I. Resourceaware task scheduling by an adversarial bandit solver method in wireless sensor networks. J Wireless Com Network 2016, 10 (2016). https://doi.org/10.1186/s136380150515y
Received:
Accepted:
Published:
Keywords
 Wireless sensor networks
 Task scheduling
 Resourceawareness
 Independent reinforcement learning
 Cooperative reinforcement learning
 Adversarial bandit solvers