 Research
 Open Access
 Published:
Retrieving the relative kernel dataset from big sensory data for continuous queries in IoT systems
EURASIP Journal on Wireless Communications and Networking volume 2019, Article number: 132 (2019)
Abstract
Internet of Things (IoT) is rapidly developed and widely deployed in recent years, which makes the sensory data generated by IoT systems explode. The huge amount of sensory data generated by some IoT systems has already exceeded the storage, transmission, and computation capacities of IoT systems. However, the valuable sensory data which is highly related to a query in an IoT system is relatively small. The sensory data which is highly related to a query Q forms the relative kernel dataset of Q. Therefore, retrieving sensory data in the relative kernel dataset of a query instead of the raw sensory data will reduce the heavy burdens of an IoT system in terms of transmission and computation and then reduce the energy consumption of the IoT system. In this paper, we investigate the problem of retrieving relative kernel dataset from big sensory data for continuous queries in IoT systems. Two algorithms, relative kernel dataset retrieving algorithm and piecewise linear fittingbased relative kernel dataset retrieving algorithm, are proposed to retrieve the relative kernel dataset for continuous queries. Beside, algorithms for estimating the answers of continuous queries based on their relative kernel datasets are also proposed. Extensive simulation results are provided to verify the effectiveness and energy efficiency of the proposed algorithms.
Introduction
The Internet of Things (IoT) system, which refers to the network with a variety of intelligent things or objects around us, provides an efficient way to observe the complicated physical world and brings convenience to our life. With the repaid development of wireless telecommunications, embedded systems, and sensing techniques, the IoT systems are widely developed and generate mass sensory data. Gartner forecasts that there will be 20.8 billion connected things by 2020 and 100 billion by 2030 [1]. Such a large scale of IoT systems will generate mass sensory data. According to [2], the global climate data will reach to 100 petabytes in 2020. Other types of sensory data, such as monitoring data from a largescale system (e.g., Large Hadron Collider) and GPS data, also increase rapidly and have exceeded terabytes or even petabytes currently.
The mass sensory data generated by IoT systems is called as big sensory data, which has four V characters.

Volume. The volume of big sensory data is extremely huge that has already exceeded the storage, transmission and computation capacities of IoT systems.

Variety. An IoT system always contains a wide variety of sensors to satisfy the complex applications of the IoT system.

Velocity. To improve the accuracy of applications in an IoT system, the sampling frequency of each sensor is high. Therefore, the generating velocity of big sensory data is quite fast.

Value. The value of big sensory data always concentrates on a relative small subset of the big sensory data on account of the existence of noises and redundancies in big sensory data.
The Volume, Variety, and Velocity characters of big sensory data make the data processing in IoT systems more challenging. The mass sensory data may exceed the storage, transmission and computation capacities of IoT systems. Besides, the energy consumption for transmitting and computing big sensory data is quite high. Therefore, the existing sensory data acquisition, routing [3–6], data collection [7–9], and data computation [10, 11] techniques are no longer applicable for the big sensory data. The authors in [3] design clusterbased routing protocol for wireless sensor networks with nonuniform distribution of sensors. The works of [4–6] focus on constructing connected dominating sets for wireless sensor networks. However, the transmission of mass sensory data will exceed the communication capacity of wireless sensor networks and affect the effectiveness of these methods accordingly. The authors in [7] study the approximated holistic aggregation algorithms based on uniform sampling. And the authors in [8] propose aggregation scheduling algorithms for energy harvesting sensor networks. However, the mass sensory data will increase the aggregation latency and affect the performances of these algorithms. The work of [9] proposes algorithm to track quantiles and range countings in wireless sensor networks. The authors in [10] propose algorithm to support curve query processing in wireless sensor networks. The work in [11] studies the applicationaware scheduling in wireless networks. When the amount of sensory data is huge, the computation complexity of these algorithms may exceed the computation capability of wireless sensor networks. Therefore, a series of new innetwork processing algorithms with much lighter transmission and computation overloads should be considered.
The Value character of big sensory data reveals that the dataset of sensory data that is highly related to a given query is usually small. Therefore, for a query, only retrieving the dataset of sensory data that is highly related to the query will greatly reduce the transmission and computation overhead of an IoT system and then reduce the energy consumption of the IoT system.
There already exists some data reduction algorithms which reduce the amount of sensory data transmitted and computed in an IoT system and then reduce the energy consumption of the IoT system. Firstly, the simplest data reduction technique is based on sampling [12–16]. However, the sampling technique is only applicable to some simple statistic queries. Some valuable data in other types of queries may be missed by sampling technique on account of the limitation of sampling frequency. Besides, the sampled data may not be necessary for all queries. Secondly, another classical technique for innetwork data reduction is compressed sensing [17–22]. However, compressed sensing only considers the temporal and spatial correlation between sensory data while neglecting the correlation between the queries conducted by users and the sensory data generated by different sensors. As a consequence, the data reduction for a given query is not enough in compressed sensing. Thirdly, the work proposed by Cheng et. al. is the first one to investigate data processing problem in big sensory data [23, 24]. The work can reduce sensory data by drawing dominant dataset from Big Sensory Data. Nevertheless, the dominant dataset defined in [23, 24] is irrelevant to any given query. As with the sampling technique and compressed sensing technique, the sensory data in the dominant dataset defined in [23, 24] may not be necessary for all queries.
All existing algorithms above are not energy efficient enough since they ignore the correlation between a given query and the sensory data. As a consequence, we concentrate on retrieving the relative kernel dataset for continuous queries from big sensory data in IoT systems in this paper. The relative kernel dataset of a query is the sensory data that is highly related to the query. The amount of sensory data in the relative kernel dataset of a query is usually small according to the Value character of big sensory data. Therefore, only transmitting and computing these sensory data will highly reduce the energy consumption of an IoT system. Besides, many applications in the IoT systems provide users to have access to temporal variation information of the monitored environment. That is, users can conduct continuous queries in IoT systems to monitor the physical world in real time, where a continuous query is a query which is conducted by a user continuously and frequently. Since continuous queries are frequent in an IoT system, reducing the energy consumption of continuous queries will highly reduce the energy consumption of an IoT system.
On account of the above reasons, we study the problem of retrieving the relative kernel dataset from big sensory data for continuous queries in IoT systems in this paper. Two algorithms, the relative kernel dataset retrieving (RKDR) algorithm and the piecewise linear fittingbased relative kernel dataset retrieving (PLFRKDR) algorithm, are proposed firstly to retrieve the relative kernel dataset for continuous queries. Besides, two algorithms, the piecewise linear fittingbased answer estimating (PLFAE) algorithm and temporal correlationbased answer estimating (TCAE) algorithm, are proposed to estimate the answers of continuous queries based on their relative kernel datasets. The major contributions of this paper are as follows.

1
The definition of the relative kernel dataset of a continuous query is proposed firstly. Besides, the relative kernel dataset retrieving problem is formally defined in this paper.

2
Two algorithms, the RKDR algorithm and the PLFRKDR algorithm, are proposed to retrieve the relative kernel dataset from big sensory data for continuous queries in IoT systems.

3
Two algorithms, the PLFAE algorithm and the TCAE algorithm, are proposed to estimate the answers of continuous queries in an IoT system based on their relative kernel datasets.

4
Extensive simulations on both simulation dataset and real dataset are carried out to verify the accuracy of the proposed algorithms.
The rest of this paper is organized as follows. Section 2 summarizes the methods of this paper. Section 3 presents the problem definition. Section 4 elaborates the algorithms for the relative kernel dataset retrieving problem. Section 5 provides the algorithms for estimating the answers of continuous queries with their relative kernel datasets. Section 6 analyzes the performances of the proposed algorithms. Section 7 shows the simulation results and discussion. Section 8 discusses the related work. Finally, Section 9 concludes the whole paper.
Methods
In this paper, we study the problem of retrieving the relative kernel dataset from big sensory data for continuous queries in IoT systems. The Pearson correlation coefficient is applied to measure the linear correlation between sensory data and the continuous queries. Besides, the piecewise linear fitting method is applied to approximate the relationships between sensory data and the continuous queries. Two algorithms are proposed to retrieve the relative kernel dataset for continuous queries in IoT systems. Another two algorithms are proposed to estimate the answers of continuous queries with their relative kernel datasets. The absolute error and relative error are defined and evaluated on both simulation dataset and real dataset. The simulation results show that the proposed algorithms are effective and efficient.
Problem definition
The IoT system model
An IoT system is equipped with n categories of sensors, which can generate n types of sensory data from the monitored environment. Each type of sensory data is described as an attribute of the monitored environment. The attribute set of the monitored environment is denoted as \(\mathcal {A}=\{x_{1},x_{2},\cdots,x_{n}\}\). For each continuous query Q from users at time slot t in an IoT system, the IoT system returns an answer to users according to the sensory data collected at time slot t. The answer of query Q is described as the target value of query Q, denoted as y_{q}. Apparently, the target value y_{q} is correlated with partial or all n attributes. Taking the healthcare monitoring IoT system as an example, there are a variety of sensors collecting data of heart sounds, electrical heart signals, blood oxygen saturation, respiratory sounds, blood pressure, body temperature, etc., which are attributes of this IoT system. When the user’s query is the possibility of cardiac arrest, the IoT system returns a probability. In this application, the query Q is the possibility of cardiac arrest and the target value y_{q} is the returned probability.
The correlation model
To explore the correlations between attributes in \(\mathcal {A}\) and query Q, a training set with m training examples is applied. Each training example is denoted as a column vector S_{l}=[x_{1l},⋯,x_{nl},y_{ql}]^{T}, where T denotes the matrix transpose and x_{il} (1≤i≤n) is the value of attribute x_{i} and y_{ql} is the target value of query Q in training example S_{l}, 1≤l≤m. The training set can be denoted as a (n+1)×m matrix \(\mathcal {S}=[S_{1},S_{2},\cdots,S_{m}]\).
In most applications of IoT systems, the sensory data can reflect the statement of the monitored physical world intuitively. That means the simple linear correlation can reveal the relationship between the sensory data and the query of an IoT system in most cases. The Pearson correlation coefficient is a common metric to measure the linear correlation between two variables in correlation analysis. The value of the Pearson correlation coefficient is in [−1,1], where −1 presents the total negative linear correlation, 0 presents no linear correlation, and 1 is the total positive linear correlation. Therefore, we apply the Pearson correlation coefficient to measure the correlation between attributes and the correlation between attributes and continuous queries.
Firstly, the Pearson correlation coefficient of any two attributes x_{i} and x_{j} (1≤i≤n, 1≤j≤n and i≠j) is presented by the following formula,
where x_{il} is the value of attribute x_{i} and x_{jl} is the value of attribute x_{j} in training example S_{l}. Besides, \(\overline {x_{i}}=\frac {1}{m}\sum \nolimits _{l=1}^{m} x_{il}\) is the average value of attribute x_{i} and \(\overline {x_{j}}=\frac {1}{m}\sum \nolimits _{l=1}^{m} x_{jl}\) is the average value of attribute x_{j} in the training set \(\mathcal {S}\).
Secondly, the Pearson correlation coefficient of attribute x_{i} (1≤i≤n) and the target value y_{q} of a continuous query Q is presented by the following formula,
where x_{il} denotes the value of attribute x_{i} and y_{ql} denotes the target value of query Q in training example S_{l}. Besides, \(\overline {y_{q}}=\frac {1}{m}\sum \nolimits _{l=1}^{m} y_{ql}\) is the average value of target values of query Q in the training set \(\mathcal {S}\).
Problem statement
In this paper, we study the problem of retrieving the relative kernel dataset for continuous queries in IoT systems to reduce the transmission and computation overhead and energy consumption of the IoT systems. The formal definition of the relative kernel dataset \(\mathcal {K}^{q}\) for a continuous query Q is described as follows.
Definition 1
(βcompatible) Given a parameter β, where 0<β<1, any two attributes x_{i} and x_{j} are βcompatible if their Pearson correlation coefficient r_{ij} satisfies that r_{ij}≤β, where r_{ij} is calculated by formula (1).
Definition 2
(βcompatible set) Given a parameter β, where 0<β<1, the subset \(\mathcal {A}_{S}\) of attribute set \(\mathcal {A}\) is a βcompatible set if any two attributes in \(\mathcal {A}_{S}\) are βcompatible with each other, i.e., \(\mathcal {A}_{S}\subseteq \mathcal {A}\) and \(\forall x_{i},x_{j}\in \mathcal {A}_{S}\), r_{ij}≤β.
Definition 3
(the weight of a βcompatible set) Given a βcompatible set \(\mathcal {A}_{S}\) and a continuous query Q, the weight of \(\mathcal {A}_{S}\) is denoted as \(w(\mathcal {A}_{S})\). The definition of \(w(\mathcal {A}_{S})\) is \(w(\mathcal {A}_{S})=\sum \nolimits _{x_{i}\in \mathcal {A}_{S}}R^{q}_{i}\), where \(R^{q}_{i}\) is the correlation coefficient between attribute x_{i} and query Q.
Definition 4
((k,β)relative kernel dataset \(\mathcal {K}^{q}\)) Given the attribute set \(\mathcal {A}\), query Q, the compatible parameter β, and the required size k of the relative kernel dataset, a training set \(\mathcal {S}\). The (k,β)relative kernel dataset of query Q is a subset of \(\mathcal {A}\), denoted as \(\mathcal {K}^{q}\), satisfying the following three conditions.
(1) \(\mathcal {K}^{q}\) is a βcompatible subset,
(2) the size of \(\mathcal {K}^{q}\) is at most k, and
(3) \(w(\mathcal {K}^{q})>w(\mathcal {A}_{S})\) for any attribute subset \(\mathcal {A}_{S}\) satisfying conditions (1)(2).
Based on the definition of (k,β)relative kernel dataset \(\mathcal {K}^{q}\), we define the problem of retrieving the relative kernel dataset for continuous queries in IoT systems as follows.
Problem statement The problem of retrieving the relative kernel dataset for continuous queries in IoT systems is denoted as relative kernel dataset retrieving (RKDR) problem. The input and output of the RKDR problem are presented as follows.
Input:

1.
The attribute set of an IoT system, \(\mathcal {A}=\{x_{1},x_{2},\cdots,x_{n}\}\);

2.
The continuous query, Q;

3.
The training set, \(\mathcal {S}=[S_{1},S_{2},\cdots,S_{m}]\);

4.
The required size of the relative kernel dataset, k;

5.
The compatible parameter, β.
Output:
The (k,β)relative kernel dataset \(\mathcal {K}^{q}\) of the continuous query Q;
Theorem 1
The RKDR problem is NPhard.
Proof 1
The Maximum Weighted Budgeted Independent Set (MWBIS) problem is to find a maximum weighted independent set in a weighted graph G(V,E) of cardinality at most k. The MWBIS problem has been proved to be NPhard in [25]. The MWBIS problem can be converted to the RKDR problem in polynomial time. Each vertex i of the weighted graph G(V,E) can be regarded as an attribute x_{i} of an IoT system, where the weight of vertex i can be regarded as the absolute value of the correlation coefficient between attribute x_{i} and query Q. For each edge (i,j)∈E in the weighted graph G(V,E), we regard that attributes x_{i} and x_{j} are not βcompatible. To find a maximum weighted independent set of cardinality at most k is to find the (k,β)relative kernel dataset of query Q. Then the MWBIS problem is converted to the RKDR problem. Therefore, the RKDR problem is NPhard.
Algorithms for the relative kernel dataset retrieving problem
In this section, we propose two algorithms to retrieve the relative kernel dataset for continuous queries in IoT systems, which are the relative kernel dataset retrieving (RKDR) algorithm and the piecewise linear fittingbased relative kernel dataset retrieving (PLFRKDR) algorithm. The detailed description of the RKDR algorithm is presented in Algorithm 1 and the detailed description of the PLFRKDR algorithm is presented in Algorithm 3. The RKDR algorithm is suitable for the IoT systems where simple linear correlation can reveal the relationship between the sensory data and the continuous queries. And the PLFRKDR algorithm is more suitable for the IoT systems where the correlation between an attribute and a continuous query is not merely linear correlation.
Relative kernel dataset retrieving algorithm
The relative kernel dataset retrieving (RKDR) algorithm firstly reduces redundancies in attributes by measuring the linear correlations between different attributes. Then, the Pearson correlation coefficient is applied to measure the correlations between attributes and the given continuous query. The RKDR algorithm contains the following two steps.
Step 1. Calculate the candidate relative kernel dataset through reducing the redundancies in attributes.
The candidate relative kernel dataset is denoted as \(\mathcal {X}\). At first, the candidate relative kernel dataset is initialized as the attribute set, i.e., \(\mathcal {X}=\mathcal {A}=\{x_{1},\cdots,x_{n}\}\). Then, \(\mathcal {X}\) is updated by the following two substeps.
Step 1.1. For each attribute x_{i} (1≤i≤n) in \(\mathcal {X}\), the Pearson correlation coefficient \(R^{q}_{i}\) between x_{i} and the target value y_{q} of query Q is calculated by formula (2).
Step 1.2 For any two attribute x_{i} and x_{j} in \(\mathcal {X}\), the Pearson correlation coefficient r_{ij} of them is calculated by formula (1). If x_{i} and x_{j} are not βcompatible (i.e., r_{ij}>β), one of them is redundant in \(\mathcal {X}\) and should be removed. If x_{i} has stronger linear correlation with query Q than x_{j}, i.e., \(R^{q}_{i}>R^{q}_{j}\), remove x_{j} from \(\mathcal {X}\). Otherwise, remove x_{i} from \(\mathcal {X}\).
After the above two substeps, the candidate relative kernel dataset \(\mathcal {X}\) is calculated through reducing redundancies in attributes.
Step 2. Retrieve the (k,β)relative kernel dataset from the candidate relative kernel dataset.
Any two attributes in the candidate relative kernel dataset \(\mathcal {X}\) are βcompatible according to Step 1.2. The Pearson correlation coefficient of each attribute in \(\mathcal {X}\) and query Q is calculated in Step 1.1. The absolute values of these Pearson correlation coefficients are calculated. Therefore, we select the topk absolute values of these Pearson correlation coefficients and their corresponding k attributes. The topk absolute values of these Pearson correlation coefficients are denoted as \(R^{q}_{a_{1}} \geq R^{q}_{a_{2}} \geq \cdots \geq R^{q}_{a_{k}}\) and the corresponding k attributes are \(x_{a_{1}},\cdots,x_{a_{k}}\). These k attributes are added to the (k,β)relative kernel dataset of query Q, i.e., \(\mathcal {K}^{q}=\{x_{a_{1}},\cdots,x_{a_{k}}\}\).
After the above two steps, the (k,β)relative kernel dataset for query Q is derived. The detailed operations are formally described in Algorithm 1. Algorithm 1 is proposed to retrieve the (k,β)relative kernel dataset for continuous query Q in IoT systems by the linear correlation analysis.
Piecewise linear fittingbased relative kernel dataset retrieving algorithm
In some scenarios, linear correlation is not enough to describe the relationship between an attribute and a continuous query. For example, the correlation between an attribute and a continuous query could be exponential, quadric, or logarithmic. In this situation, the RKDR algorithm proposed in the last subsection may be no longer applicable. As a consequence, the piecewise linear fittingbased relative kernel dataset retrieving (PLFRKDR) algorithm is proposed to improve the RKDR algorithm in this subsection.
Since it is almost impossible to know the actual correlation function between an attribute and a continuous query, it is difficult to measure the correlation coefficient between them. The PLFRKDR algorithm applies the piecewise linear fitting method to approximate the actual correlation function between an attribute and a continuous query. Since each segment of the piecewise linear function is a linear function, the PLFRKDR algorithm combines the Pearson correlation coefficients of all segments in the piecewise linear function to measure the correlation of each attribute and the continuous query. The PLFRKDR algorithm contains the following three steps.
Step 1. Calculate the Pearson correlation coefficient of each attribute \(x_{i}\in \mathcal {A}\) and the continuous query Q by piecewise linear fitting.
We apply the training set \(\mathcal {S}\) to calculate the piecewise linear function between each attribute \(x_{i}\in \mathcal {A}\) and the target value of the continuous query Q.
Firstly, for each training example S_{l} (1≤l≤m), we extract the tuple (x_{il},y_{ql}), where x_{il} is the value of attribute x_{i} and y_{ql} is the target value in S_{l}. Then, we sort the extracted tuples by the nondescending order of the values of attribute x_{i} and denote the ordered tuples as \(\{(x^{1}_{i},y^{1}_{q}),\cdots,(x^{m}_{i},y^{m}_{q})\}\), where \(x^{1}_{i}\leq \cdots \leq x^{m}_{i}\).
Secondly, the optimal segment points of the sorted tuples \(\{(x^{1}_{i},y^{1}_{q}),\cdots,(x^{m}_{i},y^{m}_{q})\}\) are calculated recursively by the method proposed in [26]. For one segment, \(\{(x^{s}_{i},y^{s}_{q}), (x^{s+1}_{i},y^{s+1}_{q}), \cdots,(x^{e}_{i},y^{e}_{q})\}\), the linear fitting function of it is derived by the leastsquares method and is denoted as \(F_{i}^{(s,e)}(x)\). The error sum of squares of linear fitting function \(F_{i}^{(s,e)}(x)\) is E(s,e). The optimal segment point \((x^{j}_{i},y^{j}_{q})\) of \(\{(x^{s}_{i},y^{s}_{q}), (x^{s+1}_{i},y^{s+1}_{q}), \cdots,(x^{e}_{i},y^{e}_{q})\}\) should satisfy that \(j=\mathop {\arg \min }_{s< l< e} (E(s,l)+E(l,e))\). This process is conducted recursively until E(s,e)−(E(s,j)+E(j,e))≤γ. The formal description is presented in Algorithm 2. Algorithm 2 returns the set of segment points \(\mathcal {P}_{i}=\{s_{1},s_{2},\cdots,s_{l_{i}}\}\) for segment \(\{(x^{s}_{i},y^{s}_{q}), (x^{s+1}_{i},y^{s+1}_{q}), \cdots,(x^{e}_{i},y^{e}_{q})\}\).
Finally, the sorted tuples \(\{(x^{1}_{i},y^{1}_{q}),\cdots,(x^{m}_{i},y^{m}_{q})\}\) are divided into l_{i}−1 subsets according to the calculated segment point set \(\mathcal {P}_{i}\). Each subset is denoted as \(D_{j}=\{(x^{s_{j}}_{i},y^{s_{j}}_{q}), \cdots,(x^{s_{j+1}}_{i},y^{s_{j+1}}_{q})\}\), where \(s_{j},s_{j+1}\in \mathcal {P}_{i}\), and 1≤j<l_{i}. The Pearson correlation coefficient \(R^{q}_{ij}\) for subset D_{j} is calculated by the following formula,
where \(\overline {x_{i}}^{j}=\frac {1}{s_{j+1}s_{j}+1}\sum \nolimits _{l=s_{j}}^{s_{j+1}}x_{i}^{l}\) and \(\overline {y_{q}}^{j}=\frac {1}{s_{j+1}s_{j}+1}\sum \nolimits _{l=s_{j}}^{s_{j+1}}y_{q}^{l}\). As a consequence, the correlation coefficient of attribute x_{i} and the continuous query Q is the weighted average of the absolute values of all calculated Pearson correlation coefficients \(\{R^{q}_{ij}1\leq j< l_{i}\}\), which is presented as the following formula.
Step 2. Calculate the candidate relative kernel dataset \(\mathcal {X}\) through reducing the redundancies in attributes.
Initially, all attributes are added to \(\mathcal {X}\), i.e., \(\mathcal {X}=\mathcal {A}=\{x_{1},\cdots,x_{n}\}\). Then, for each two attribute \(x_{i},x_{j}\in \mathcal {X}\), we compute the Pearson correlation coefficient r_{ij} of them by formula (1). When r_{ij}>β, meaning that x_{i} and x_{j} are not βcompatible, x_{i} and x_{j} are redundant in \(\mathcal {X}\) and one of them should be removed. \(R^{q}_{i}\) and \(R^{q}_{j}\) are compared, where \(R^{q}_{i}\) and \(R^{q}_{j}\) are calculated by formula (4). If \(R^{q}_{i}>R^{q}_{j}\), remove x_{j} from \(\mathcal {X}\). Otherwise, x_{i} is removed from \(\mathcal {X}\).
Step 3. Retrieve the relative kernel dataset from the candidate relative kernel dataset for the continuous query Q.
After Step 2, all attributes in candidate relative kernel dataset \(\mathcal {X}\) are βcompatible. Then, we can obtain the topk correlation coefficients of all attributes in \(\mathcal {X}\) calculated in Step 1 and the corresponding k attributes. The topk correlation coefficients are denoted as \(R^{q}_{a_{1}}\geq R^{q}_{a_{2}}\geq \cdots \geq R^{q}_{a_{k}}\), and the corresponding k attributes are denoted as \(x_{a_{1}}, x_{a_{2}}, \cdots, x_{a_{k}}\). Finally, the (k,β)relative kernel dataset of query Q is \(\mathcal {K}^{q}=\{x_{a_{1}}, x_{a_{2}}, \cdots, x_{a_{k}}\}\).
After the above three steps, the (k,β)relative kernel dataset of query Q is calculated. The detailed operations are formally described in Algorithm 3. Algorithm 3 is proposed to retrieve the (k,β)relative kernel dataset of continuous query Q in IoT systems by the piecewise linear fitting method. And Algorithm 2 is called by Algorithm 3 to return the set of segment points for piecewise linear fitting.
Algorithms for estimating the answer of a continuous query with its relative kernel dataset
Once a continuous query Q is issued by users in an IoT system, sensory data of attributes in (k,β)relative kernel dataset of query Q are transmitted to the sink node of the IoT system. After the (k,β)relative kernel dataset of Q is retrieved by the RKDR algorithm (Algorithm 1) or the PLFRKDR algorithm (Algorithm 3), the IoT system should return an answer of Q to the users. In this section, we propose two algorithms to estimate the answers of continuous queries with their relative kernel datasets, which are the piecewise linear fittingbased answer estimating (PLFAE) algorithm and the temporal correlationbased answer estimating (TCAE) algorithm. The detailed descriptions of the PLFAE algorithm and the TCAE algorithm are presented in Algorithm 4 and Algorithm 5, respectively. The PLFAE algorithm is proposed to estimate the answers for continuous queries with their (k,β)relative kernel datasets by piecewise linear fitting method. And the TCAE algorithm is proposed to estimate the answers for continuous queries with less energy consumption by applying the temporal correlations of sensor data.
Piecewise linear fittingbased answer estimating algorithm
The piecewise linear fittingbased answer estimating (PLFAE) algorithm applies the piecewise linear fitting method to estimate the answer of a continuous query with sensory data of attributes in its (k,β)relative kernel dataset. Once a continuous query Q is issued by users in an IoT system at time slot t, only sensory data of attributes in the (k,β)relative kernel dataset \(\mathcal {K}^{q}\) of Q are transmitted to the sink node. For each attribute, the sensory data generated at time slot t is denoted as x_{i}(t). That is, the sink node collects sensory data in \(\{x_{i}(t)x_{i}\in \mathcal {K}^{q}\}\) to estimate the answer of Q. Firstly, the piecewise linear function of each attribute in \(\mathcal {K}^{q}\) and the target value of query Q is calculated by the piecewise linear fitting method. Then, each piecewise linear function is assigned a weight. Finally, the target value of Q at time slot t is estimated by the sensory data of attributes in \(\mathcal {K}^{q}\) according to the weighted piecewise linear functions. The PLFAE algorithm consists of the following three steps.
Step 1. Calculate the piecewise linear functions of query Q and attributes in \(\mathcal {K}^{q}\).
For each attribute \(x_{i}\in \mathcal {K}^{q}\), the piecewise linear function of x_{i} and query Q is denoted as f_{i}. The process of calculating f_{i} consists of the following three substeps.
Step 1.1. Preprocess the training set \(\mathcal {S}=\{S_{1},\cdots,S_{m}\}\).
For each training example S_{l} (1≤l≤m), the value of attribute x_{i} and the target value y_{ql} are extracted and denoted as a tuple (x_{il},y_{ql}). All extracted tuples are sorted by the nondescending order of the values of attribute x_{i}. The sorted tuples are denoted as \(\{(x^{1}_{i},y^{1}_{q}),\cdots,(x^{m}_{i},y^{m}_{q})\}\), where \(x^{1}_{i}\leq \cdots \leq x^{m}_{i}\).
Step 1.2. Calculate the optimal segment points of piecewise linear function f_{i}.
The optimal segment points of f_{i} is calculated by Algorithm 2. The returned segment point set is denoted as \(\mathcal {P}_{i}=\{s_{1},s_{2},\cdots,s_{l_{i}}\}\). As a consequence, the sorted tuples \(\{(x^{1}_{i},y^{1}_{q}),\cdots,(x^{m}_{i},y^{m}_{q})\}\) are divided into l_{i}−1 subsets, where each subset D_{j} (1≤j<l_{i}) is denoted as \(D_{j}=\{(x^{s_{j}}_{i},y^{s_{j}}_{q}), \cdots,(x^{s_{j+1}}_{i},y^{s_{j+1}}_{q})\}\).
Step 1.3. Calculate the linear functions for all subsets.
For each subset \(D_{j}=\{(x^{s_{j}}_{i},y^{s_{j}}_{q}), \cdots,(x^{s_{j+1}}_{i},y^{s_{j+1}}_{q})\}\), the linear function is calculated by the leastsquares method and is denoted as f_{i,j}. The Pearson correlation coefficient \(R^{q}_{ij}\) for subset D_{j} is calculated by formula (3). As a consequence, the piecewise linear function f_{i} of attribute x_{i} and the continuous query Q is as follows.
Step 2. Assign weights for the calculated piecewise linear functions.
For each attribute \(x_{i}\in \mathcal {K}^{q}\), the generated sensory data at time slot t is x_{i}(t). If x_{i}(t) satisfies that xis_{j}<x_{i}(t)≤xis_{j+1} (\(s_{j},s_{j+1}\in \mathcal {P}_{i}\) and 1≤j<l_{i}), the piecewise linear function f_{i} is assigned a weight with \(w_{i}=R^{q}_{ij}\), where \(R^{q}_{ij}\) is the Pearson correlation coefficient calculated in Step 1.
Step 3. Estimate the target value of query Q at time slot t.
Based on the sensory data generated at time slot t, i.e., \(\{x_{i}(t)x_{i}\in \mathcal {K}^{q}\}\), the estimated target value of Q at time slot t is calculated by the following formula.
The detailed operations for the PLFAE algorithm is described in Algorithm 4. Algorithm 4 is proposed to estimate the answers of continuous queries with their (k,β)relative kernel datasets by piecewise linear fitting method.
Temporal correlationbased answer estimating algorithm
It has been revealed by many research works that the monitored physical environment always varies continuously [27, 28]. That means the difference between the sensory data of an attribute in two continuous time slots may be small. Consequently, the temporal correlationbased answer estimating (TCAE) algorithm utilizes the temporal correlation between sensory data of an attribute to estimate the answer of a continuous query with less energy consumption.
In the PLFAE algorithm, each time a continuous query Q is issued by users, k sensory data of attributes in the (k,β)relative kernel dataset of Q are transmitted and computed to estimate the answer of Q. The basic idea of TCAE algorithm is to further reduce the sensory data transmitted and computed in the IoT system based on the temporal correlations of sensory data.
The sink node of an IoT system stores the latest received sensory data of all attributes, denoted as {b_{1},b_{1},⋯,b_{n}}. Besides, for each attribute \(x_{i}\in \mathcal {A}\), the corresponding sensor also stores its latest transmitted sensory data b_{i}. For each attribute \(x_{i}\in \mathcal {A}\), there exists a value range [p_{i},q_{i}] for its sensory data, i.e., ∀t,p_{i}≤x_{i}(t)≤q_{i}. When a continuous query Q is conducted by users, only sensory data in a subset of \(\{x_{i}(t)x_{i}\in \mathcal {K}^{q}\}\) are transmitted to the sink node. For each attribute \(x_{i}\in \mathcal {K}^{q}\), x_{i}(t) is transmitted to the sink node only if x_{i}(t) satisfies that x_{i}(t)−b_{i}≥α×(q_{i}−p_{i}), where 0≤α≤1 is a userdefined adjustable constant. Applying this method, the energy consumption of the IoT system can be further reduced with a little sacrifice of the accuracy for estimating the answer of a continuous query. Through adjusting α, we can balance the tradeoff between the energy consumption and the estimation accuracy.
If the sensory data x_{i}(t) satisfies that x_{i}(t)−b_{i}≥α×(q_{i}−p_{i}), the stored sensory data in the corresponding sensor is updated to b_{i}=x_{i}(t). Besides, after the sink node received x_{i}(t), it also updates its stored sensory data to b_{i}=x_{i}(t). After the sink node has received all sensory data \(\{x_{i}(t)x_{i}\in \mathcal {K}^{q}, x_{i}(t)b_{i}\geq \alpha \times (q_{i}p_{i})\}\), the sink node estimates the answer of Q by the similar method described in the PLFAE algorithm.
The detailed operations for the TCAE algorithm is described in Algorithm 5. Algorithm 5 is proposed to estimate the answers of continuous queries with their (k,β)relative kernel datasets by piecewise linear fitting method. Algorithm 5 can further reduce the sensory data transmitted and computed in an IoT system by applying the temporal correlations of sensory data.
The performance analysis
Computation complexity of the RKDR algorithm
The computation complexity of computing the Pearson correlation coefficients between query Q and n attributes is O(mn). Then, the computation complexity of calculating the Pearson correlation coefficients between any two attributes is O(mn^{2}). Therefore, the computation complexity of Step 1 is O(mn^{2}). The computation complexity of sorting n Pearson correlation coefficients in Step 2 is O(n logn). In conclusion, the computation complexity for the RKDR algorithm is O(mn^{2}).
Computation complexity of the PLFRKDR algorithm
The computation complexity of Step 1 in the PLFRKDR algorithm consists of three parts. Firstly, the computation complexity of preprocessing the training set is O(nm logm). Then, the computation complexity of calculating optimal segment points for n attributes is O(nm^{3}). Finally, the computation complexity of calculating the correlation coefficients between query Q and attributes is O(nm). Therefore, the computation complexity of Step 1 is O(nm^{3}). The computation complexity of calculating candidate relative kernel dataset in Step 2 is O(mn^{2}). In Step 3, the computation complexity of sorting all correlation coefficients is O(n logn). In conclusion, the computation complexity for the PLFRKDR algorithm is O(nm^{3}).
Computation complexity of the PLFAE algorithm and the TCAE algorithm
The computation complexity of the PLFAE algorithm and the computation complexity of the TCAE algorithm are similar. For each attribute \(x_{i}\in \mathcal {K}^{q}\), the following computations are conducted. Firstly, the computation complexity of sorting the extracted tuples \(\{(x^{1}_{i},y^{1}_{q}),\cdots,(x^{m}_{i},y^{m}_{q})\}\) is O(m logm). Then, the computation complexity of calculating optimal segment points is O(m^{3}). Finally, the computation complexity of calculating the Pearson correlation coefficients and piecewise linear functions for all segments is O(m). Therefore, the computation complexity of the algorithm is O(km^{3}).
Simulation results and discussion
In this section, the performances of the RKDR algorithm and the PLFRKDR algorithm are evaluated by extensive simulations on both simulation dataset and real dataset. The RKDR algorithm and the PLFRKDR algorithm retrieve the (k,β)relative kernel datasets for continuous queries. The application of the RKDR algorithm and the PLFRKDR algorithm is to estimate the answers of the continuous queries given by users in IoT systems with the retrieved (k,β)relative kernel datasets. Therefore, to evaluate the performances of the RKDR algorithm and the PLFRKDR algorithm, we evaluate the accuracy of the answers estimated by the retrieved (k,β)relative kernel datasets of continuous queries. The PLFAE algorithm and the TCAE algorithm are proposed in this paper to estimate the answers of continuous queries by the retrieved (k,β)relative kernel datasets.
The simulation dataset consists of two subdatasets, which are about two functions y_{q}=f_{1}(x_{1},⋯,x_{15}) and y_{q}=f_{2}(x_{1},⋯,x_{15}), respectively. There are 15 attributes of the IoT systems, which are x_{1},x_{2},⋯,x_{15}. There are two continuous queries of the IoT systems, which are query Q_{1} and query Q_{2}. Each training example in the simulation dataset includes the values of 15 attributes and the target values of query Q_{1} and query Q_{2} following function f_{1} and f_{2}, respectively. We apply the RKDR algorithm and the PLFRKDR algorithm to retrieve the (k,β)relative kernel dataset for query Q_{1} and query Q_{2}. Then, we apply the PLFAE algorithm and the TCAE algorithm to estimate the target values of query Q_{1} and query Q_{2} with the retrieved (k,β)relative kernel dataset.
The real dataset is collected from a mobile device. The collected data from the mobile device contains 6 attributes, i.e. the X, Y, and Zaxes of a threeaxis accelerometer and the X, Y, and Zaxes of a threeaxis gyroscope of the mobile phone, and a motion type of a person holding the mobile phone. Each training example is generated by a person holding the mobile device and making 7 types of motions, which are numbered as motion 1 to 7. Each training example contains values of 6 attributes and a target value y_{q} indicating which type of motion happens. We apply the RKDR algorithm and the PLFRKDR algorithm to retrieve the (k,β)relative kernel dataset for the motion query, where the (k,β)relative kernel dataset is a subset of 6 attributes of the mobile phone. Then, we apply the PLFAE algorithm and the TCAE algorithm to estimate the motion type of the person holding the mobile phone with the retrieved (k,β)relative kernel dataset.
In the following simulations, two types of error are applied to evaluate the performance of the proposed algorithms. The first one is the absolute error \(er_{1}=\frac {\widehat {y_{q}}(t)y_{q}(t)}{y_{q}(t)}\), where \(\widehat {y_{q}}(t)\) is the target value estimated by sensory data of attributes in the (k,β)relative kernel dataset \(\mathcal {K}^{q}\) retrieved by the proposed algorithms and y_{q}(t) is the true target value of query Q at time slot t. However, the true target value of query Q is always unknown or inaccessible in the monitored physical world. As a consequence, another type of error replaces y_{q}(t) by \(y^{\prime }_{q}(t)\), where \(y^{\prime }_{q}(t)\) is the target value estimated by sensory data of all n attributes. The second type of error is called as the relative error, denoted as \(er=\frac {\widehat {y_{q}}(t)y'_{q}(t)}{y'_{q}(t)}\).
In the following subsections, the comparison experiments are firstly conducted to compare the performances of the RKDR algorithm and the PLFRKDR algorithm. Then, both absolute error and relative error are evaluated on simulation dataset to verify the performance of the proposed algorithms. Finally, experiments on real dataset are carried out to verify the performance of the proposed algorithms.
Comparison experiments of the RKDR algorithm and the PLFRKDR algorithm
We compare the RKDR algorithm and the PLFRKDR algorithm by conducting a group of comparison experiments on simulation dataset. We compare the relative error \(er=\frac {\widehat {y_{q}}(t)y'_{q}(t)}{y'_{q}(t)}\) of the RKDR algorithm and the PLFRKDR algorithm when the size of the (k,β)relative kernel dataset increases. The simulation results are presented in Fig. 1. Each data point presented in Fig. 1 is the average of simulation results on 500 times of query.
In Fig. 1, the relative error of both algorithms decreases when k increases. However, the relative error of the PLFRKDR algorithm is much smaller than that of the RKDR algorithm no matter how much k is. In particular, the relative error of the PLFRKDR algorithm is almost a half of the RKDR algorithm when k is larger than 5. Therefore, the PLFRKDR algorithm has better performance than the RKDR algorithm. Besides, Fig. 1 shows that the relative error of both algorithms on simulation dataset with function f_{1} is much less than that of function f_{2}, which reveals that linear correlation cannot describe the relationship between sensory data and query Q_{2} perfectly.
The performance of proposed algorithms on simulation dataset
The absolute error on simulation dataset
The first group of simulations is conducted on the simulation dataset to evaluate the performance of the PLFRKDR algorithm through the absolute error. Firstly, the PLFRKDR algorithm is applied to retrieve the (k,β)relative kernel datasets of continuous queries. Then, the PLFAE algorithm is applied to estimate answers of continuous queries. Finally, the absolute error of the estimated answers of the continuous queries are evaluated. The simulation results are presented in Figs. 2 and 3.
Firstly, we investigate the impact of k, the size of the (k,β)relative kernel dataset, on the absolute error of the estimated answers of the continuous queries. Figure 2 presents the absolute error under three test sets, whose size are T=1000, T=500, and T=100. The data points presented in Fig. 2 are the average of simulation results on 1000, 500, and 100 times of query, respectively. Figure 2 presents that the absolute error decreases with the increasing of k. It is shown that the absolute error can reach to 0.258 when k is only 5, which is pretty small for users.
Then, we study the impact of m, the size of training set, on the absolute error of estimated answers of the continuous queries. Figure 3 presents the absolute error when the size of the (k,β)relative kernel dataset are k=5, k=8, and k=11. Each data point presented in Fig. 3 is the average of simulation results on 500 times of query. Figure 3 reveals that the absolute error of estimated answers of the continuous queries decreases sharply with the increasing of m.
The relative error on simulation dataset
The second group of simulations is carried out on the simulation dataset to evaluate performance of the PLFRKDR algorithm. Firstly, the PLFRKDR algorithm is applied on the simulation dataset to retrieve the (k,β)relative kernel datasets of query Q_{1} and query Q_{2}. Then, the PLFAE algorithm is applied to estimate answers of query Q_{1} and query Q_{2}. Finally, the relative error of the estimated answers of query Q_{1} and query Q_{2} is evaluated. The simulation results are presented in Figs. 4 and 5. Each data point presented in Figs. 4 and 5 is the average of simulation results on 500 times of query.
Firstly, we investigate the impact of k, the size of the (k,β)relative kernel dataset, on the relative error of the estimated answers of query Q_{1} and query Q_{2}. The size of training set is set as m=3000. When the size of the (k,β)relative kernel dataset is fixed, the relative error of the estimated answers of query Q_{1} and query Q_{2} are different since different queries have different relative kernel datasets. That explains why we need to retrieve the relative kernel dataset for each given query. Figure 4 shows that the relative error of the estimated answers of both queries decreases with the increasing of k. Specially, Fig. 4 presents that the relative error is reduced to 0.05 when k is only a half of n. That means the IoT system can save a half of energy consumption when sacrificing only a few accuracy.
Then, we study the impact of m, the size of training set, on the relative error of the estimated answers of the continuous queries. In the simulations, different scales of training set are evaluated and the size of the (k,β)relative kernel dataset is set as k=8. We evaluate the relative error with the size of training set m varying from 500 to 3000 with increment of 500. Figure 5 presents that the relative error decreases slowly with m increases. Similarly, the relative error of the estimated answers of query Q_{1} is less than that of query Q_{2}. It is worth noting that even m is only 500, the relative error is less than 0.15 when k is no less than a half of n.
The performance of the proposed algorithm on real dataset
In this subsection, simulations are conducted on the real dataset to evaluate the performance of the PLFRKDR algorithm. Each training example of the real dataset consists of the values of 6 attributes and a target value indicating which type of motion happens. For each type of motion i, the target value is set as 1 if motion i happens; otherwise, the target value is set as 0. Firstly, the (k,β)relative kernel datasets for motion 1 to 7 are retrieved by the PLFRKDR algorithm. Then, the PLFAE algorithm is applied to estimate the target value of each motion, respectively. Finally, if the target value of a motion i calculated by the PLFAE algorithm with its (k,β)relative kernel dataset retrieved by the PLFRKDR algorithm is larger than 0.5, it is thought that motion i happens. To evaluate the performance of the PLFRKDR algorithm, we evaluate the precision of the judgement of each motion. The simulation results are presented in Figs. 6 and 7.
The first group of simulations are conducted to study the impact of k on the motion judgement precision. The size of training set is fixed to m=4796. The simulation results are presented in Fig. 6, where each data point is the average of the simulation results on 1511 times of query. Figure 6 presents that the precision for each motion increases with k increases. The reason is similar to the analysis of simulations on the simulation dataset. For each type of motion, more than 80% of the queries are correctly answered by the PLFRKDR algorithm and the PLFAE algorithm.
The second group of simulations are carried out to study the impact of m on the motion judgement precision. The simulation results are presented in Fig. 7, where the size of the (k,β)relative kernel dataset is set as k=4. Each data point presented in Fig. 7 is the average of simulation results on 585 times of query. Figure 7 presents that more than 80% of the motions are correctly judged.
Related works
Reducing the energy consumption of an IoT system through reducing sensory data transmitted and computed in the IoT system has been studied for a long time. The existing data reduction algorithms can be divided into two categories. The first category of algorithms are based on sampling. The second category of algorithms are based on compressed sensing.
The first category of data reduction algorithms is samplingbased data reduction algorithms. The authors in [12] investigate Bernoulli samplingbased aggregation algorithms, which can satisfy arbitrary precision requirement. They firstly adapt the sampling probability to satisfy the arbitrary precision requirement given by users. Then, the aggregation is approximately computed with the sampled sensory data. The work in [13] proposes a samplingbased algorithm to compute approximate quantiles in sensor networks. The authors apply random sampling in the algorithm and provide a guarantee that the computed ϕquantile are within error ε with a constant probability which can arbitrarily close to 1. The algorithms proposed in [14] adapt the sampling frequencies for energyhungry sensors in wireless sensor networks according to the real needs of the monitored physical environment. As a consequence, the proposed algorithms can reduce the sensory data by sampling and then reduce the energy consumption of a wireless sensor network. Another work in [15] adaptively adjusts the sampling frequency to retrieve the critical data points of each sensor, which can significantly reduce the energy consumption. The authors in [16] apply adaptive sampling in snow monitoring applications of wireless sensor networks. They proposed algorithms to dynamically estimate the sampling frequencies of sensors to minimize sensory data sensed and transmitted in sensor networks while maintaining acceptable accuracy of the monitoring application.
The second category of data reduction algorithms is compressed sensingbased data reduction algorithms. The authors in [17] propose an algorithm combining the compressed sensing with the principal component analysis. The algorithm can recover the whole dataset through a small subset of data, which can significantly reduce the sensory data transmitted in the network. Other two works [18, 19] are proposed to compress the sensory data in a wireless sensor network by the principal component analysis technique. The work in [20] applies the compressed sensing in both singlesink and multisink wireless sensor networks to efficiently gather sensory data. The authors in [21] propose a compressed sensingbased approach to monitor with vehicular networks. The tradeoff between the communication cost and the estimation accuracy is studied in this work. The authors in [22] proposed an algorithm to minimize the energy consumption of sensor networks through joint routing and compressed aggregation. Compressed sensingbased data reduction algorithms compress raw sensory data and transmit compressed sensory data in the network.
However, the sampling based data reduction algorithms in the first category are only effective for some simple statistic queries, such as maximum, quantile, and average. These algorithms cannot reduce sensory data for arbitrary queries. Besides, the compressed sensingbased data reduction algorithms in the second category only consider the temporal and spatial correlations between sensory data but ignoring the correlation between the sensory data and the queries. The sensory data can be further reduced if the correlations between sensory data and queries of IoT system are taken into consideration. And the data reduction for a given query is not enough in these algorithms. Another work is proposed in [23, 24] to retrieve the dominant dataset from big sensory data for a wireless sensor network. However, the dominant dataset defined in [23, 24] is a general one for all queries instead of a specific one for a given query. As a consequence, the existing algorithms cannot retrieve the relative kernel dataset for continuous queries in IoT systems.
Conclusion
This paper studies the problem of retrieving the relative kernel dataset from big sensory data for continuous queries in IoT systems. The RKDR algorithm and PLFRKDR algorithm are proposed to retrieve the (k,β)relative kernel dataset for continuous queries. Besides, the PLFAE algorithm and TCAE algorithm are proposed to estimate the answers of continuous queries based on their (k,β)relative kernel datasets. Extensive simulations are carried out to evaluate the performances of the proposed algorithms.
Abbreviations
 IoT:

Internet of Things
 LSPLF:

Leastsquaresbased piecewise linear fitting
 PLFAE:

Piecewise linear fittingbased answer estimating
 PLFRKDR:

Piecewise linear fittingbased relative kernel dataset retrieving
 RKDR:

Relative kernel dataset retrieving
 TCAE:

Temporal correlationbased answer estimating
References
 1
G. G. Says, 6.4 billion connected things will be in use in 2016, up 30 percent from 2015 (2015). http://www.gartner.com/newsroom/id/3165317.
 2
J. T. Overpeck, G. A. Meehl, S. Bony, D. R. Easterling, Climate data challenges in the 21st century. Science. 331(6018), 700–702 (2011). American Association for the Advancement of Science.
 3
J. Yu, Y. Qi, G. Wang, X. Gu, A clusterbased routing protocol for wireless sensor networks with nonuniform node distribution. AEUInt. J. Electr. Commun.66(1), 54–61 (2012). Elsevier.
 4
J. Yu, N. Wang, G. Wang, Constructing minimum extended weaklyconnected dominating sets for clustering in ad hoc networks. J. Parallel Distrib. Comput.72(1), 35–47 (2012). Elsevier.
 5
J. Yu, X. Ning, Y. Sun, S. Wang, Y. Wang, in INFOCOM 2017IEEE Conference on Computer Communications, IEEE. Constructing a selfstabilizing cds with bounded diameter in wireless networks under sinr (IEEE, 2017), pp. 1–9.
 6
T. Shi, S. Cheng, Z. Cai, J. Li, in IEEE INFOCOM 2016The 35th Annual IEEE International Conference on Computer Communications. Adaptive connected dominating set discovering algorithm in energyharvest sensor networks (IEEE, 2016), pp. 1–9.
 7
J. Li, S. Cheng, Z. Cai, J. Yu, C. Wang, Y. Li, Approximate holistic aggregation in wireless sensor networks. ACM Trans. Sens. Netw. (TOSN). 13(2), 11 (2017). ACM.
 8
Q. Chen, H. Gao, Z. Cai, L. Cheng, J. Li, in IEEE INFOCOM 2018IEEE Conference on Computer Communications. Energycollision aware data aggregation scheduling for energy harvesting sensor networks (IEEE, 2018), pp. 117–125.
 9
Z. He, Z. Cai, S. Cheng, X. Wang, Approximate aggregation for tracking quantiles and range countings in wireless sensor networks. Theor. Comput. Sci.607:, 381–390 (2015). Elsevier.
 10
S. Cheng, Z. Cai, J. Li, Curve query processing in wireless sensor networks. IEEE Trans. Veh. Technol.64(11), 5198–5209 (2015).
 11
X. Zheng, Z. Cai, J. Li, H. Gao, A study on applicationaware scheduling in wireless networks. IEEE Trans. Mob. Comput.16(7), 1787–1801 (2017). IEEE.
 12
S. Cheng, J. Li, Q. Ren, L. Yu, in Proceedings of the 29th Conference on Information Communications. Bernoulli sampling based (ε, δ)approximate aggregation in largescale sensor networks (IEEE Press, 2010), pp. 1181–1189.
 13
Z. Huang, L. Wang, K. Yi, Y. Liu, in Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data. Sampling based algorithms for quantile computation in sensor networks (ACM, 2011), pp. 745–756.
 14
C. Alippi, G. Anastasi, M. D. Francesco, M. Roveri, An adaptive sampling algorithm for effective energy management in wireless sensor networks with energyhungry sensors. IEEE Trans. Instrum. Meas.59(2), 335–344 (2010).
 15
T. Zhu, S. Cheng, Z. Cai, J. Li, Critical data points retrieving method for big sensory data in wireless sensor networks. EURASIP J. Wirel. Commun. Netw.2016(1), 18 (2016).
 16
C. Alippi, G. Anastasi, C. Galperti, F. Mancini, M. Roveri, in Mobile Adhoc and Sensor Systems, 2007. MASS 2007. IEEE International Conference On. Adaptive sampling for energy conservation in wireless sensor networks for snow monitoring applications (IEEE, 2007), pp. 1–6.
 17
R. Masiero, G. Quer, D. Munaretto, M. Rossi, J. Widmer, M. Zorzi, in Global Telecommunications Conference, 2009. GLOBECOM 2009. IEEE. Data acquisition through joint compressive sensing and principal component analysis (IEEE, 2009), pp. 1–6.
 18
S. V. Macua, P. Belanovic, S. Zazo, in Signal Processing Advances in Wireless Communications (SPAWC), 2010 IEEE Eleventh International Workshop On. Consensusbased distributed principal component analysis in wireless sensor networks (IEEE, 2010), pp. 1–5.
 19
A. Rooshenas, H. R. Rabiee, A. Movaghar, M. Y. Naderi, in Intelligent Sensors, Sensor Networks and Information Processing (ISSNIP), 2010 Sixth International Conference On. Reducing the data transmission in wireless sensor networks using the principal component analysis (IEEE, 2010), pp. 133–138.
 20
H. Wang, Y. Zhu, Q. Zhang, in INFOCOM, 2013 Proceedings IEEE. Compressive sensing based monitoring with vehicular networks (IEEE, 2013), pp. 2823–2831.
 21
H. Zheng, S. Xiao, X. Wang, X. Tian, M. Guizani, Capacity and delay analysis for data gathering with compressive sensing in wireless sensor networks. IEEE Trans. Wirel. Commun.12(2), 917–927 (2013). IEEE.
 22
L. Xiang, J. Luo, A. Vasilakos, in Sensor, Mesh and Ad Hoc Communications and Networks (SECON), 2011 8th Annual IEEE Communications Society Conference On. Compressed data aggregation for energy efficient wireless sensor networks (IEEE, 2011), pp. 46–54.
 23
S. Cheng, Z. Cai, J. Li, X. Fang, in 2015 IEEE Conference on Computer Communications (INFOCOM). Drawing dominant dataset from big sensory data in wireless sensor networks (IEEE, 2015), pp. 531–539.
 24
S. Cheng, Z. Cai, J. Li, H. Gao, Extracting kernel dataset from big sensory data in wireless sensor networks. IEEE Trans. Knowl. Data Eng.29(4), 813–827 (2017).
 25
T. Kalra, R. Mathew, S. P. Pal, V. Pandey, in Conference on Algorithms and Discrete Applied Mathematics. Maximum weighted independent sets with a budget (Springer, 2017), pp. 254–266.
 26
T. L. L. Zongtian, Leastsquares method piecewise linear fitting. Comput. Sci.39(6A), 482–484 (2012).
 27
S. Cheng, J. Li, Z. Cai, in INFOCOM, 2013 Proceedings IEEE. O (ε)approximation to physical world by sensor networks (IEEE, 2013), pp. 3084–3092.
 28
J. Li, S. Cheng, H. Gao, Z. Cai, Approximate physical world reconstruction algorithms in sensor networks. IEEE Trans. Parallel Distrib. Syst.25(12), 3099–3110 (2014). IEEE.
Acknowledgements
This work is partly supported by the National Natural Science Foundation of China under Grant NO. 61632010, 61502116, U1509216, 61370217, and 61602129 and the National Science Foundation (NSF) under grant NO.1741277.
Funding
This work is partly supported by the National Natural Science Foundation of China under Grant NO. 61632010, 61502116, U1509216, 61370217 and 61602129 and the National Science Foundation (NSF) under grant NO.1741277.
Availability of data and materials
No applicable.
Author information
Affiliations
Contributions
TX designed the idea, performed the experiments, and wrote the manuscript. JB and SY gave valuable suggestions on the design of the study. YS assisted in processing the data and helped to finish the simulation results. JZ supervised the work and helped to check and revise the manuscript. All authors read and approved the final manuscript.
Corresponding author
Correspondence to Jinbao Wang.
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Received
Accepted
Published
DOI
Keywords
 Internet of things
 Big sensory data
 Relative kernel dataset