- Research
- Open Access

# Data quality analysis and cleaning strategy for wireless sensor networks

- Hongju Cheng
^{1, 2}Email authorView ORCID ID profile, - Danyang Feng
^{1}, - Xiaobin Shi
^{1}and - Chongcheng Chen
^{2}

**2018**:61

https://doi.org/10.1186/s13638-018-1069-6

© The Author(s). 2018

**Received:**2 January 2018**Accepted:**28 February 2018**Published:**15 March 2018

## Abstract

The quality of data in wireless sensor networks has a significant impact on decision support, and data cleaning is an effective way to improve data quality. However, if the data cleaning strategies are not correctly designed, it might result in an unsatisfactory cleaning effect with increased system cleaning costs. Initially, data quality evaluation indicators and their measurement methods in wireless sensor networks were introduced. We then explored the impact of relationship between different indicators which are used in the quality assessment. Finally, data cleaning strategy for wireless sensor networks based on the relationship between data quality indicators was proposed by comparing and analyzing data cleaning schemes with different orders. The experimental results showed that the proposed data cleaning strategy can effectively improve data availability and have a better cleaning effect in wireless sensor networks for the same cleaning cost.

## Keywords

- Wireless sensor networks
- Data quality
- Data cleaning

## 1 Introduction

In wireless sensor networks (WSNs), many errors occur among the sensor data due to characteristics, such as low-cost sensors, limited resources, and link variation [1]. These errors appear in different modes, for example, the data loss or anomalies caused by hardware, the data failure due to transmission delays, and the sampling jitter [2] caused by the node task conflicts. The dataset collected by the sink node may simultaneously result in these aforementioned errors.

The data-centric feature is becoming increasingly prominent with wireless sensor networks that are widely deployed in the real world. Data is the bridge between the network and the physical world, and the quality of data has an important impact on the application. However, the dataset is not reliable due to numerous data errors in the network. It is necessary to improve the data quality to support various applications [3].

There are two main aspects of data management in wireless sensor networks, data quality assessment and data cleaning technology. The current mainstream operation is to decompose the data quality into specific data quality indicators [4] such as accuracy, timeliness, completeness, and consistency [5]. There are dozens of metrics currently used to assess the quality of sensory data, but the search for a common and valid data quality assessment framework is still ongoing. Data cleaning aims at how to detect and eliminate data errors originated from the initial data [6]. The current data cleaning strategies generally deal with repeated object detection, outlier value detection, and missing data processing. Duplicate object detection finds whether there is a data duplication or inconsistency, or other issue based on the data volume and consistency indicators. Abnormal data detection aims at identifying and correcting the abnormal data. Elimination of sample jittering is mainly used for the time-related indicators, while missing data processing for data integrity indicators.

There are relations among different quality indicators in data cleaning. Fan et al. [7] shows that data quality indicators are not completely isolated. Although the data cleaning strategy might be designed for a given indicator, it may influence another indicator at the same time. For example, the cleaning of missing data mending may lead to uncertain changes in the accuracy measurement of the data when improving the integrity, due to the fact that the related cleaning technologies cannot guarantee data correctness [8]. For abnormal data correction, the data correctness can be improved without changing the measurement of data integrity indicators. However, the current research works are less concerned with the impact of the relationship between various quality indicators, and systematic studies on the relationship between quality indicators in wireless sensor networks are still an interesting issue.

- (1).
We introduce four indicators for the data quality assessment: amount of data, correctness, completeness, and time correlation index measure. We also provide detailed measurement for the relationship between different indicators.

- (2).
By utilizing the relationship among different indicators, we study the final result of different order of cleaning strategy by theoretical analysis.

- (3).
An efficient data cleaning strategy is proposed to solve the multiple mixed errors in wireless sensor networks, and its effect is verified by experiments.

The paper is organized as follows. In Section 2, we present the related works. Section 3 describes the system model and the problem formulation. In Section 4, we describe the measurement of quality indicators. In Section 5, we introduce the method including the relationship between indicators and the proposed cleaning strategy. Section 6 presents simulation results, and Section 7 is conclusion.

## 2 Related works

There are a large number of researches on data quality or data assessment. Data quality is usually divided into different indicators, i.e., accuracy, completeness, and timeliness [4]. In order to avoid the “dirty” data, Klein et al. [5] propose five measures to evaluate the quality of sensor data flow, namely, accuracy, credibility, integrity, data volume, and timeliness. A flexible model that presents data quality dissemination and processing is used to capture, process, and deliver quality features and provide corresponding business tasks. Li et al. [6] define the metrics and observe real-world data by the use of three commonly used indicators: timeliness, availability, and effectiveness. The definition of these indicators ensures that their parameters are interpretable and are obtained by analyzing historical data.

Currently, there are a lot of available works regarding data cleaning. Ghorbel et al. [9] propose a method of detecting outliers by using Mahalanobis distance based on kernel principal component analysis (KPCA). KPCA calculates the mappings of data points and maps the data to another feature space, thus separates the exception points from the normal data distribution patterns. Experiments show that KPCA performs well in detecting abnormal values and can obtain the abnormal values quickly and effectively. Zhuang et al. [10] propose a method of clearing the network outlier values. It is based on the correction of outlier values of wavelet and distance-based DTW (dynamic time warp) outlier. The cleaning process is completed during the multi-hop data forwarding process and the neighbor relationship in the hop-based routing algorithm. Experiments show that this method can clean the abnormal sensing data.

Hamrani et al. [11] use the radial basis function as the basic interpolation function to carry out the data restoration in WSN. Li et al. [12] propose a *kd*-tree based K-nearest neighbor (KNN) data restoration algorithm that uses weighted variance and weighted Euclidean distance to construct a binary search tree for *k*-dimensional non-missing data. The size of the weight is inversely proportional to the amount of data loss of the indicator and is proportional to the variance of the indicator. For time-dependent sampling jitter, Rahm et al. [13] aim at eliminating the non-uniform sampling time series and propose to eliminate the data error by using linear interpolation. During the execution of the algorithm, the linear function is calculated by intercepting the two previous and subsequent data of the problem data points in the time series, and the target data points are expected to obtain an estimate close to the true value at the correct sampling time. The inaccuracy of data due to node sampling jitter is eliminated with regular sampling of WSN datasets.

Although some researches have studied data management in the area of data assessment and data cleaning [14], the relationship between data quality indicators is still a challenging issue. Fan et al. [7] propose that various indicators of data quality are not isolated from each other, such as completeness and timeliness. Although, the paper does not study the specific relationship between the quality indicators and does not explicitly point out the relevance between quality indicators. Ding et al. [8] studies the relationship between data quality properties that apply to information systems. However, the quality evaluation property of information systems cannot be used in WSNs, and the paper does not analyze the difference of final results of data cleaning strategies in different orders.

## 3 Network model and problem

*S*= {

*s*

_{1},

*s*

_{2}, …,

*s*

_{ n }}. The total time to monitor the area is

*T*. The time synchronized and the sampling interval is Δ

*T*. At a given time, one node can collect

*k*physical quantities, and the collected data of node

*i*at time

*t*can be represented by set

*X*(

*i*,

*t*).

*i*during the monitoring time

*T*is denoted as

*X*

_{ i }:

*i*during the monitoring time

*T*is denoted as

*Xi*:

*S*is received at the sink node during the monitoring time

*T*, which can be represented by a matrix

*D*with size as (

*T/*Δ

*t*) ×

*n*,

By detailed analysis of the different quality indicators shown in [15–18], we adopt the following metric as the data quality evaluation for the WSNs: data volume, completeness, time correlation, and correctness. Let *q*_{
v
}, *q*_{
c
}, *q*_{
t
}, and *q*_{
a
} represent the corresponding quality indicators of dataset *D*.

The quality assessment and data cleaning of dataset *D* are done at the sink node. Data cleaning includes the missed data patching, sampling jitter correction, and outliers and correction.

We assume that the signal of a physical object detected by a sensor node will change in a smooth way. For example, the temperature or humidity in 1 day usually changes continuously and smoothly. In data sampling jitter elimination and the data cleaning process, this constraint is necessary by assuming that the sampling interval is smaller than the change frequency of the physical signal.

Similar to [8], the relationship between different quality indicators is defined as follows. For a given dataset *D*, let *d*_{
i
}, *d*_{
j
}∈{*q*_{
v
}, *q*_{
c
}, *q*_{
t
}, *q*_{
a
}} denote two different quality indicators. The metric of *D* on *d*_{
i
} is denoted as *q*_{
i
}, and the metric on *d*_{
j
} is denoted as *q*_{
j
}. The new dataset after data cleaning for *d*_{
i
} is denoted as *D*_{new.} The new metric *d*_{
i
} on *D*_{new} is denoted as *q*_{
i
}*′* and metric *d*_{
j
} is denoted as *q*_{
j
}*′*. We have Δ*q*_{
i
} *= q*_{
i
}*′ − q*_{
i
},Δ*q*_{
j
} *= q*_{
j
}*′ − q*_{
j
}. Here, we assume that Δ*q*_{
i
} > 0 because the data cleaning is generally used to improve the data metric.

1. If Δ*q*_{
j
} > 0, it means that indicator *d*_{
i
} will lead to increment on the metric of indicator *d*_{
j
}. In this case, *d*_{
i
} is *positively correlated* with *d*_{
j
}, which is denoted as *d*_{
i
} ≺ *d*_{
j
}.

2. If Δ*q*_{
j
} < 0, it means that indicator *d*_{
i
} will lead to reduction on the metric of indicator *d*_{
j
}. In this case, *d*_{
i
} is *negatively correlated* with *d*_{
j
}, which is denoted as *d*_{
i
} ≻ *d*_{
j
}.

3. If Δ*q*_{
j
} = 0, it means that indicator *d*_{
i
} has no impact on the metric of indicator *d*_{
j
}. In this case, *d*_{
i
} and *d*_{
j
} are *irrelevant*, which is denoted as *d*_{
i
} ⊀ *d*_{
j
}.

4. If there is a probability *p* to have Δ*q*_{
j
} > 0, *p* ∈ (0,1), it means that indicator *d*_{
j
} will lead to increment on the metric of *d*_{
j
} with probability of *p*. In this case, *d*_{
i
} and *d*_{
j
} are *not completely related*, which is denoted as \( {d}_i\ \overset{\sim }{\prec }\ {d}_j \).

As mentioned in the introduction, there are different data errors for the collected dataset *D* in the WSNs, such as data missing, data anomaly, sampling jitter, and data invalidation. Applying the cleaning process on the given dataset will lead to interactions between two different indicators, *d*_{
i
} and *d*_{
j
}. The first part of this paper studies the quality indicators and provides the formula description between two indicators. The second part of this paper compares and analyzes the performance of different data cleaning order and discovers the proper data cleaning strategy.

## 4 Data quality indicators and metrics

### 4.1 Data volume indicators

The data volume describes the size of dataset, which can be used to describe the working state for a given sensor node. In the case that the node has less data compared with other nodes, it is considered that data is lost. The data volume describes the availability of dataset and the reliability of related logic results. For example, a mean operation can be done on two datasets with different sizes for a given observation object, and the one with smaller data volume is assumed to be less trustworthy.

**Definition 1**(Data volume indicators) Assuming that the monitoring area has

*n*nodes, the monitoring time duration is

*T*, and all nodes collect data with the same time interval Δ

*t*. The data sequence of the node

*i*in the monitoring duration

*T*is

*X*

_{ i }= [

*X*(

*i*, 1),

*X*(

*i*, 2), …,

*X*(

*i*,

*T*/Δ

*t*)]. The existence of sampling for node

*i*at time

*t*is defined as:

*v*

_{ i }be the number of samplings for node

*i*:

### 4.2 Completeness indicator

Completeness describes the seriousness of data loss problems in the dataset. The completeness indicator is generally measured with the proportion of the raw data volume compared with the required data volume.

**Definition 2**(Completeness indicator) Assuming that the monitoring area has

*n*nodes, the monitoring time duration is

*T*, and all nodes collect data with the same time interval Δ

*t*. The data sequence of the node

*i*in the monitoring duration

*T*is

*X*

_{ i }= [

*X*(

*i*, 1),

*X*(

*i*, 2), …,

*X*(

*i*,

*T*/Δ

*t*)]. The completeness of data record

*X*(

*i*,

*t*) is defined as follows:

*X*(

*i*,

*t*) = {

*x*

_{1},

*x*

_{2}, …,

*x*

_{ k }}.

*D*at time

*t*is denoted as

*cv*

_{ t }, that is:

### 4.3 Time-related indicator

**Definition 3**(Time-dependent indicator) Assuming that the monitoring area has

*n*nodes, the monitoring time duration is

*T*, and collection interval of all the nodes is Δ

*t*. The volatility is defined as the length of time during which the data remains valid:

*k*is a constant which can be chosen for different values in various situations.

*i*in the moment

*t*is defined as currency, that is

*t*

_{ideal}is the ideal sampling time and

*t*

_{real}is the actual sampling time. The system time needed for sink nodes receiving the data recording is

*t*

_{arrive}.

*X*(

*i*,

*t*) is described as follows:

*D*as follows:

### 4.4 Correctness indicator

The correctness indicator describes the closeness of the monitored value to the true value. To the data obtained from one sampling of a specific physical quantity (such as temperature), the data is considered to be correct in the case that the data error between the measured value and the real value of the environment is less than a given threshold.

**Definition 4**(Correctness indicator) Assuming that the monitoring area has

*n*nodes, the monitoring time duration is

*T*, and all nodes collect data with the same time interval Δ

*t*. The data sequence of the node

*i*in the monitoring duration

*T*is

*X*

_{ i }= [

*X*(

*i*, 1),

*X*(

*i*, 2), …,

*X*(

*i*,

*T*/Δ

*t*)]. The observation value can be expressed as val = val

_{real}+ Δ, which is a combination of the real value of the environment val

_{real}and error Δ. The correctness of node

*i*at time

*t*is defined as follows:

*ξ*

_{ c }is the error threshold.

*D*is defined as follows:

### 4.5 Data quality evaluation coefficient

**Definition 5**(Data quality evaluation coefficient) Given the dataset

*D*in the time duration

*T*, the data quality

*Q*is the weighted combination of the data quantity, correctness, completeness, and time-related indicator.

In which *w*_{
i
} is the weight of each indicator.

## 5 Method

Data management requires not only data quality assessment but also high-quality datasets obtained by data cleaning or other technologies. Quality assessment indicators will affect each other in the data cleaning process. This paper aims at finding the relationship between quality indicators as well as a proper data cleaning strategy. It is noted that the relationship between indicators analyzed in the following is considered in the data cleaning process if it is not specialized.

### 5.1 Relationship between data volume indicator and others

**Theorem 1** *The data volume indicator and completeness indicator are not completely correlated*.

*Proof* Given the time duration *T* in the same location, the sampling frequency Δ*t*, data sequences collected by unreliable nodes is *X*_{
i
} = [*X*(*i*, 1), *X*(*i*, 2), …, *X*(*i*, *T*/Δ*t*)], and by the reliable nodes is *X*_{
i
}^{′} = [*X*^{′}(*i*, 1), *X*^{′}(*i*, 2), …, *X*^{′}(*i*, *T*/Δ*t*)]. The data sizes are denoted as *v*_{
i
} and *v*_{
i
}*′*, respectively, in which *v*_{
i
}*′* > *v*_{
i
}.

*X*

_{ i }, the probability is

*p*

_{loss}in case that instance

*X*(

*i*,

*t*) is independently lost. Therefore, the data volume,

*cv*

_{ t, }satisfies the binomial distribution by following the completeness constraint. For data sequence

*X*

_{ i }

*′*, the probability is

*p*

_{loss}and

*cv*

_{ t }

*′*satisfies the binomial distribution too. According to the Formula (6), the variation of the completeness indicator is as follows:

*cv*

_{ t }and

*cv*

_{ t }

*′*satisfy the binomial distribution respectively.

So, we have *p*(Δ*q*_{
c
} ≥ 0) ∈ (0,1). In this way, \( {q}_v\overset{\sim }{\prec }\ {q}_c \) ■.

**Theorem 2** *The data volume indicator and time correlation indicator are not completely correlated*.

*Proof*Given the time duration

*T*, let

*v*

_{ i }be the size of data sequence

*X*

_{ i }collected by unreliable nodes, and

*v*

_{ i }′ be the size of data sequence

*X*

_{ i }

*′*collected by reliable nodes. The probability is

*p*

_{time}in the case that data has independent jitter. The data instance satisfies the normal distribution during the network transmission. According to Formula (10), variation of the time correlation indicator is described as Δ

*q*

_{ t }=

*q*

_{ t }−

*q*

_{ t }′:

*q*

_{ t }and

*q*

_{ t }′ are independent to each other and satisfy the binomial distribution respectively.

So, we have *p*(Δ*q*_{
t
} ≥ 0) ∈ (0,1). In conclusion, there is not a complete correlation between data volume indicator and correctness indicator, which can be described as \( {q}_v\overset{\sim }{\prec }{q}_t \) ■.

**Theorem 3**
*The data volume indicator and the correctness indicator are not completely correlated.*

*Proof* Similar to the proof process of Theorem 1, the probability is *p*_{error} for the situation that data instance is independently wrong. According to Formula (12), correctness indicators *q*_{
a
} and *q*_{
a
}′ are independent of each other and respectively satisfy the binomial distribution. We have Δ*q*_{
a
} = *q*_{
a
}*′* − *q*_{
a
} in which *p*(Δ*q*_{
a
} ≥ 0) ∈ (0,1). So, we have \( {q}_v\ \overset{\sim }{\prec }\ {q}_a \) ■.

### 5.2 Relationship between completeness indicator and others

**Theorem 4** *There is a positive correlation between the completeness indicator and data volume indicator*.

*Proof*In the time duration

*T*, the data sequence of node

*i*is

*X*

_{ i }= [

*X*(

*i*, 1),

*X*(

*i*, 2), …,

*X*(

*i*,

*T*/Δ

*t*)]. The missed data is shown below:

*X*(

*i*,

*t*) = {

*x*

_{1},

*x*

_{2}, …,

*x*

_{ k }}.

*v*

_{ i }

*′*–

*v*

_{ i }≥ 0.

So, we have *q*_{
c
} ≺ *q*_{
v
} ■.

**Theorem 5**
*There is no correlation between the completeness indicator and the time-related indicator after repairing the missing data of the dataset assuming that only the collected data is calculated by the time-related indicator.*

*Proof*In the time duration

*T*, the data sequence of node

*i*is

*X*

_{ i }= [

*X*(

*i*, 1),

*X*(

*i*, 2), …,

*X*(

*i*,

*T*/Δ

*t*)]. When there is a data loss, we have

After data cleaning by fixing these missed data, these values are no longer empty, and thus the data volume will increase from *cv*_{
t
} to *cv*_{
t
}*′*.

However, this increment is independent to the time because it is carried out at the sink node.

According to Formula (10), Δ*q*_{
t
} = 0. So we have *q*_{
c
} ⊀ *q*_{
t
} ■.

**Theorem 6** *There is no complete correlation between the time correlation indicator and completeness indicator*.

*Proof*In the time duration

*T*, the data sequence of node

*i*is

*X*

_{ i }= [

*X*(

*i*, 1),

*X*(

*i*, 2), …,

*X*(

*i*,

*T*/Δ

*t*)]. The data loss is represented as val

_{ t }

*=*null, where val

_{ t }∈

*X*

_{ i }. Completeness cleaning will add the lost data into the sequence, and thus the data volume increases from

*cv*

_{ i }to

*cv*

_{ i }

*′*, and we have:

*p*

_{ r }, that is

*p*(Δ

*q*

_{ t }≥ 0) ∈ (0,1) with Δ

*q*

_{ t }≥ 0:

So, we have \( {q}_c\tilde{\prec}{q}_a \).

### 5.3 Relationship between time correlation indicator and others

**Theorem 7**
*There is no correlation between time correlation indicator and data volume indicator.*

*Proof* The timeliness measurement is calculated by Formula (8), and we can see that the currency of *X*(*i*, *t*) decreases after data cleaning because the jitter is eliminated. At the same time, the cleaning does not increase the sampling records, which means that *X*(*i*, *t*) is not changed. According to Definition 1, we have Δ*q*_{
v
} = 0. So, we have *q*_{
t
} ⊀ *q*_{
v
} ■.

**Theorem 8** *The time correlation indicator and completeness indicator are irrelevant*.

*Proof* According to the definition of timeliness measurement, *currency* decreases because the jitter is eliminated after the data related cleaning process, while ƒ_{
c
}(*X*(*i*, *t*)) remains unchanged for *X*(*i*, *t*). According to Definition 2, the effective data volume *cv*_{
t
} = ∑ƒ_{
c
}(*X*(*i*, *t*)) remains unchanged. According to Formula (14), we have Δ*q*_{
c
} = 0. So, we have *q*_{
t
} ⊀ *q*_{
c
} ■.

**Theorem 9** *In the case that the physical signal changes continuous and smoothly, there is a positive correlation between time-related indicator and correction indicator after eliminating jitter in the collected dataset*.

*Proof* As shown in Fig. 1, the sampling time is *t*_{real} = *t*_{ideal} + Δ*t*, while the observation value is val = val_{real} + Δ, where Δ is the error caused by the jitter Δ*t*.

Considering the general situation, the physical signals observed by the nodes change continuously and smoothly in a long period of time, and the sampling frequency of the nodes is far less than the frequency of signal changes. When the sampling delay Δ*t* decreases, we can assume the error Δ decreases too. According to Definition 4, *f*_{
a
}(val_{
t
}) = 1 when Δ < *ξ*_{
c
}. So, ∑*f*_{
a
}(val_{
t
}) increases for a given data sequence *X*_{
i
}. According to Formula (12), we have Δ*q*_{
a
} > 0. So, we get *q*_{
t
} ≺ *q*_{
a
} ■.

### 5.4 Relationship between correctness indicator and others

**Theorem 10** *There is no correlation between the correctness indicator and the data volume indicator*.

*Proof* The observed value can be described as val *=* val_{real} + Δ, in which Δ is the error. In the case that Δ > ξ, the value is considered as abnormal and the correctness data cleaning will eliminate the data error, and accordingly, we have *f*_{
a
}(val_{
t
}) = 1. At the same time, the completeness metric for dataset *D* at time *t* is not changed according to Definition 2, which means Δ*q*_{
v
} = 0. So, we have *q*_{
a
} ⊀ *q*_{
v
} ■.

**Theorem 11**
*There is no correlation between the correctness indicator and the completeness indicator.*

**Theorem 12** *There is no correlation between the correctness indicator and the time correlation indicator*.

*Proof* The proof process is similar to that in Theorem 10 ■.

### 5.5 Analysis of sequential cleaning strategies

*D*. The existence of these errors leads to lower metrics for these data indicators, i.e.,

*q*

_{ c },

*q*

_{ t }, and

*q*

_{ a }. There are several combinations for the data cleaning strategies in which the cleaning process is carried out with different orders:

- (1)
Completeness, time-related, and correction;

- (2)
Completeness, correction, and time-related;

- (3)
Time-related, completeness, and correction;

- (4)
Time-related, correction, and completeness;

- (5)
Correction, completeness, and time-related;

- (6)
Correction, time-related, and completeness.

According to the relationship analysis in the previous section, completeness cleaning cannot guarantee the data correctness, and thus abnormal data might still exist if it is placed at the end of the cleaning order. It means that (4), (5), and (6) are not suitable for the WSNs. In order (3), the performance of the time-related cleaning algorithm cannot be guaranteed, especially in the case that data loss is serious in the original dataset. In order (2), it is helpful to reduce the abnormal data by eliminating the jitter. However, if there is a peak among two adjacent collections, Theorem 9 does not stand, which means possible poor performance after the cleaning process.

On the other hand, if we adopt the order (1), the completeness data cleaning is firstly carried out, which will repair the lost data and is helpful to guarantee the performance of the secondary time-related data cleaning. The final correctness cleaning will eliminate the abnormal data due to the previous two steps, and the final metrics for these three indicators will increase accordingly. In this way, we can see that order (1) is the best compared with other strategies.

### 5.6 Data cleaning strategy

- Step 1
Calculate the volume indicator of dataset

*D*. - Step 2
If the volume indicator is larger than a given threshold, then

- Step 3
Clean the dataset by completeness indicator;

- Step 4
Clean the dataset by time-related indicator;

- Step 5
Clean the dataset by correctness indicator;

- Step 6
End.

Steps 1 and 2 are used to determine if the cleaning process is necessary or not. The volume indicator describes the size of the collected data. If the size is very small, it might show that the network is not in the proper mode because enough data cannot be gathered by the system. The reliability for these data is very low in this case. Although data cleaning is helpful to repair the lost data, it is considered useless since the reliability is less than the threshold. Steps 3 to 5 will carry out the cleaning process via completeness, time-related, and correctness indicators, as mentioned in the previous section.

## 6 Simulation

The simulation is carried out based on the dataset of inter indoor laboratory project with MATLAB as the simulation tool. The project includes 54 Mica2Dot sensor nodes in Intel Berkeley Research Lab. The temperature, humidity, and light data of the environment are collected every 30 s by the nodes. Data are gathered through the TinyDB intranet query processing system [20]. In this paper, data cleaning is carried out with the abnormal data detection and correction technology based on small waves, the elimination sampling shaking technique based on linear interpolation, and the missing data patch technology based on KNN. We firstly verify Theorem 1 to Theorem 12 by different groups of simulations. Then, we carry out the cleaning strategy with temperature dataset and compare the final result with the practical values. Finally, the performance of the proposed data cleaning strategy is demonstrated.

### 6.1 Correlation simulation

This group of simulations demonstrates the relationship between the volume and other indicators. Data loss, jitter, error, and other mistakes in the dataset are independent and consistent with binomial distribution. In this paper, two data volumes are gathered at time Δ*t* and 2Δ*t*, and the metrics in other indicators of these two datasets are calculated respectively. The results are as follows.

The next group of simulation deals with the relationship between completeness and other indicators. Given one dataset, we carry out the completeness cleaning two times which will increase the completeness indicators. Then, we can observe the difference between the other three indicators.

The following group of simulations deals with the relationship between time-dependent and other indicators. Similar to the above experiment, the sample jitter is eliminated twice on the same dataset in order to guarantee that the time-dependent indicator of the dataset gradually increases. Then, we can observe the difference between the other three indicators.

The following group deals with the relationship between correctness and other indicators. Twice, data cleaning operation for the abnormal data are carried out sequentially, and thus the correctness will increase accordingly. Then, we can observe the difference between the other three indicators.

### 6.2 Data cleaning simulation

In order to verify the performance of the proposed data cleaning strategy, we adopt two different sequential cleaning strategies under the same cleaning cost. The data before cleaning and the cleaned data are respectively compared with the true values of the environment so that the difference between them can be observed intuitively. The cleaning costs of the two cleaning strategies are the same and abnormal data detection and correction, missing data mending, and linear interpolation cleaning operation for eliminating sample jitter are respectively performed. Due to the fact that the practical value of the environment in the experiment is not available, we use the average of 54 nodes as the practical value of the environment.

*D*of node 7. The quality metrics

*Q*is 65.34%. When

*D*is cleaned with the proposed data cleaning strategy, the final dataset

*D′*is more similar to the practical value (the second one in Fig. 2). The new quality metrics

*Q*is 89.43%. We also carry out the data cleaning strategy with order (4) in Section 4.5, and compare the performance with the practical value (the last one in Fig. 8). It can be seen that the proposed data cleaning strategy performs a better cleaning effect on dataset

*D*.

## 7 Conclusions

Reasonable data cleansing strategies which can effectively improve data quality and remove extra cleaning overhead caused by repeated cleansing are very important to data management in wireless sensor networks. In this paper, we introduced four data quality indicators, namely, data volume, completeness, time-dependence, and correctness. Theoretic analysis with respect to their relationships was provided. We analyzed the cleaning effect of different order of cleaning strategy and proposed a data cleaning strategy that is suitable for the wireless sensor networks. Additionally, detailed simulations were carried out to demonstrate the correctness and performance of the suggested data cleaning strategy. The proposed data cleaning strategy has a significant effect on improving data availability.

## Declarations

### Acknowledgements

None.

### Funding

This work is supported by the National Science Foundation of China under Grant No. 61370210, the Program for New Century Excellent Talents in Fujian Province of China under Grant No. SX2015-01, and the Fujian Province Key Laboratory of Network Computing and Intelligent Information Processing Project under Grant No. 2009J1007.

### Availability of data and materials

None.

### Authors’ contributions

HC proposed the framework of the data cleaning strategy. Moreover, he also participated in the writing of this paper. DF carried out the simulation. XS contributed to the relationship analysis of these indicators, and he wrote the initial version of this paper. CC contributed to the performance analysis. All authors read and approved the final manuscript.

### Competing interests

The authors declare that they have no competing interests.

### Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

## Authors’ Affiliations

## References

- C Batini, M Scannapieco,
*Data quality: concepts, methodologies and techniques*(Springer Publishing Company, 2010). https://doi.org/10.1007/3-540-33173-5%2010.1109%2FICCSE.2012.88 - D Ganesan, S Ratnasamy, H Wang, et al., Coping with irregular spatio-temporal sampling in sensor networks. ACM Sigcomm Comput Communication Rev
**34**(1), 125–130 (2004). https://doi.org/10.1145/972374.972396 View ArticleGoogle Scholar - A Karkouch, H Mousannif, HA Moatassime, et al., Data quality in internet of things: a state-of-the-art survey. J Network Comp App
**73**(57–81) (2016). https://doi.org/10.1016/j.jnca.2016.08.002 - S Sicari, C Cappiello, FD Pellegrini, et al., A security-and quality-aware system architecture for Internet of Things. Inf. Syst. Front.
**18**(4), 665–677 (2016). https://doi.org/10.1007/s10796-014-9538-x View ArticleGoogle Scholar - A Klein, W Lehner, Representing data quality in sensor data streaming environments. J Data and Inf Qual
**1**(2), 1–28 (2009). https://doi.org/10.1145/1577840.1577845 View ArticleGoogle Scholar - F Li, S Nastic, S Dustdar,
*Data quality observation in pervasive environments*, IEEE 15th International Conference on Computational Science and Engineering (CSE), 602–609 (2012). https://doi.org/10.1109/ICCSE.2012.88 Google Scholar - W Fan, S Ma, N Tang, et al., Interaction between record matching and data repairing. J Data Inf Qual. 4(4), 16 (2014). DOI: https://doi.org/10.1145/2567657
- XO Ding, HZ Wang, XY Zhang, et al., Research on the relationship of data quality with many kinds of properties. J Software
**27**(7), 1626–1644 (2016). https://doi.org/10.1007/s10115-011-0474-5 MathSciNetGoogle Scholar - O Ghorbel, W Ayedi, H Snoussi, et al., Fast and efficient outlier detection method in wireless sensor networks. IEEE Sensors J.
**15**(6), 3403–3411 (2015). https://doi.org/10.1109/JSEN.2015.2388498 View ArticleGoogle Scholar - Y Zhuang, L Chen, In-network outlier cleaning for data collection in sensor networks. Int'l VLDB workshop on clean databases, Cleandb (2006). Seoul Korea. DBLP (2006). DOI: https://doi.org/10.1007/s10115-011-0474-5
- A Hamrani, I Belaidi, E Monteiro, et al., On the factors affecting the accuracy and robustness of smoothed-radial point interpolation method. Adv. Appl. Math. Mech.
**9**(1), 43–72 (2016). https://doi.org/10.4208/aamm.2015.m1115 MathSciNetView ArticleGoogle Scholar - YY Li, LE Parker, Nearest neighbor imputation using spatial–temporal correlations in wireless sensor networks. Inf Fusion
**15**(1), 64–79 (2014). https://doi.org/10.1016/j.inffus.2012.08.007 View ArticleGoogle Scholar - E Rahm, HD Hong, Data cleaning: problems and current approaches. IEEE Data Eng Bull
**23**(23), 3–13 (2000). https://doi.org/10.1007/978-0-387-39940-9 Google Scholar - S Sathe, TG Papaioannou, H Jeung, et al.,
*A survey of model-based sensor data acquisition and management. Managing and mining sensor data*(Springer, US, 2013), pp. 9–50. https://doi.org/10.1007/978-1-4614-6309-2_2 View ArticleGoogle Scholar - CC Aggarwal,
*Managing and mining sensor data*, Springer US (2013). https://doi.org/10.1007/978-1-4614-6309-2 View ArticleGoogle Scholar - RY Wang, DM Strong, Beyond accuracy: what data quality means to data consumers. J. Manag. Inf. Syst.
**12**(4), 5–33 (1995). https://doi.org/10.1080/07421222.1996.11518099 View ArticleGoogle Scholar - L Jiang, A Borgida, J Mylopoulos,
*Towards a compositional semantic account of data quality attributes. International Conference on Conceptual Modeling ER 2008*(2008), pp. 55–68. https://doi.org/10.1007/978-3-540-87877-3_6 Google Scholar - C Zhang, X Zhou, C Gao, C Wang, On improving the precision of localization with gross error removal. IEEE ICDCS (2008). 144–149. DOI: https://doi.org/10.1109/ICDCSWorkshops.2008.44
- M Suzuki, S Saruwatari, N Kurata, et al., A quantitative error analysis of synchronized sampling on wireless sensor networks for earthquake monitoring. ACM Conference Embedded Network Sens Syst, 417–418 (2008). https://doi.org/10.1145/1460412.1460481
- H Cheng, S Zhihuang, N Xiong, X Yang, Energy-efficient nodes scheduling algorithms for wireless sensor networks using Markov random field model. Inf Sci Volume
**329**(1), 461–477 (February 2016). https://doi.org/10.1016/j.ins.2015.09.039 View ArticleGoogle Scholar