Data management requires not only data quality assessment but also high-quality datasets obtained by data cleaning or other technologies. Quality assessment indicators will affect each other in the data cleaning process. This paper aims at finding the relationship between quality indicators as well as a proper data cleaning strategy. It is noted that the relationship between indicators analyzed in the following is considered in the data cleaning process if it is not specialized.
Relationship between data volume indicator and others
Theorem 1 The data volume indicator and completeness indicator are not completely correlated.
Proof Given the time duration T in the same location, the sampling frequency Δt, data sequences collected by unreliable nodes is X
i
= [X(i, 1), X(i, 2), …, X(i, T/Δt)], and by the reliable nodes is X
i
′ = [X′(i, 1), X′(i, 2), …, X′(i, T/Δt)]. The data sizes are denoted as v
i
and v
i
′, respectively, in which v
i
′ > v
i
.
For sequence X
i
, the probability is ploss in case that instance X(i, t) is independently lost. Therefore, the data volume, cv
t,
satisfies the binomial distribution by following the completeness constraint. For data sequence X
i
′, the probability is ploss and cv
t
′ satisfies the binomial distribution too. According to the Formula (6), the variation of the completeness indicator is as follows:
$$ \Delta {q}_c=\frac{\Delta t\cdotp \left(\sum \limits_{i=1}^{T/\Delta t}{cv_t}^{\hbox{'}}-\sum \limits_{i=1}^{T/\Delta t}{cv}_t\right)}{N\times T}, $$
(14)
in which cv
t
and cv
t
′ satisfy the binomial distribution respectively.
So, we have p(Δq
c
≥ 0) ∈ (0,1). In this way, \( {q}_v\overset{\sim }{\prec }\ {q}_c \) ■.
Theorem 2 The data volume indicator and time correlation indicator are not completely correlated.
Proof Given the time duration T, let v
i
be the size of data sequence X
i
collected by unreliable nodes, and v
i
′ be the size of data sequence X
i
′ collected by reliable nodes. The probability is ptime in the case that data has independent jitter. The data instance satisfies the normal distribution during the network transmission. According to Formula (10), variation of the time correlation indicator is described as Δq
t
= q
t
− q
t
′:
$$ \Delta {q}_t=\frac{\sum \limits_{i=1}^n\left(\sum \limits_{t=1}^{v_i^{\hbox{'}}}{f}_t\left(X\left(i,t\right)\right)-\sum \limits_{t=1}^{v_i}{f}_t\left(X\left(i,t\right)\right)\right)}{N}, $$
(15)
in which q
t
and q
t
′ are independent to each other and satisfy the binomial distribution respectively.
So, we have p(Δq
t
≥ 0) ∈ (0,1). In conclusion, there is not a complete correlation between data volume indicator and correctness indicator, which can be described as \( {q}_v\overset{\sim }{\prec }{q}_t \) ■.
Theorem 3
The data volume indicator and the correctness indicator are not completely correlated.
Proof Similar to the proof process of Theorem 1, the probability is perror for the situation that data instance is independently wrong. According to Formula (12), correctness indicators q
a
and q
a
′ are independent of each other and respectively satisfy the binomial distribution. We have Δq
a
= q
a
′ − q
a
in which p(Δq
a
≥ 0) ∈ (0,1). So, we have \( {q}_v\ \overset{\sim }{\prec }\ {q}_a \) ■.
Relationship between completeness indicator and others
Theorem 4 There is a positive correlation between the completeness indicator and data volume indicator.
Proof In the time duration T, the data sequence of node i is X
i
= [X(i, 1), X(i, 2), …, X(i, T/Δt)]. The missed data is shown below:
$$ X\left(i,t\right)=\mathrm{null}\ \mathrm{or}\ {x}_j=\mathrm{null}, $$
(16)
where X(i, t) = {x1, x2, …, x
k
}.
However, the data is not lost after data repair. According to Definition 1, we have
$$ \Delta {q}_v=\frac{\Delta t\times \left({v}_i^{\hbox{'}}-{v}_i\right)}{N\times T} $$
(17)
in which v
i
′ – v
i
≥ 0.
So, we have q
c
≺ q
v
■.
Theorem 5
There is no correlation between the completeness indicator and the time-related indicator after repairing the missing data of the dataset assuming that only the collected data is calculated by the time-related indicator.
Proof In the time duration T, the data sequence of node i is X
i
= [X(i, 1), X(i, 2), …, X(i, T/Δt)]. When there is a data loss, we have
$$ X\left(i,t\right)=\mathrm{null}\ \mathrm{or}\ {x}_i=\mathrm{null},X\left(i,t\right)=\left\{{x}_1,{x}_2,\dots, {x}_k\right\}. $$
(18)
After data cleaning by fixing these missed data, these values are no longer empty, and thus the data volume will increase from cv
t
to cv
t
′.
However, this increment is independent to the time because it is carried out at the sink node.
According to Formula (10), Δq
t
= 0. So we have q
c
⊀ q
t
■.
Theorem 6 There is no complete correlation between the time correlation indicator and completeness indicator.
Proof In the time duration T, the data sequence of node i is X
i
= [X(i, 1), X(i, 2), …, X(i, T/Δt)]. The data loss is represented as val
t
= null, where val
t
∈ X
i
. Completeness cleaning will add the lost data into the sequence, and thus the data volume increases from cv
i
to cv
i
′, and we have:
$$ \Delta {cv}_i={cv}_i^{\hbox{'}}-{cv}_i. $$
Suppose the correctness of the repaired data is judged as probability p
r
, that is
$$ p\left({f}_a\left({\mathrm{val}}_t\right)=1\right)={p}_r,{\mathrm{val}}_t\widehat{I}\Delta {cv}_t $$
According to Formula (12), the probability is p(Δq
t
≥ 0) ∈ (0,1) with Δq
t
≥ 0:
$$ p\left(\Delta {q}_t\ge 0\right)=\prod \limits_{i=1}^n\prod \limits_{j=1}^{\Delta {cv}_i}{p}_r. $$
(19)
So, we have \( {q}_c\tilde{\prec}{q}_a \).
Relationship between time correlation indicator and others
Theorem 7
There is no correlation between time correlation indicator and data volume indicator.
Proof The timeliness measurement is calculated by Formula (8), and we can see that the currency of X(i, t) decreases after data cleaning because the jitter is eliminated. At the same time, the cleaning does not increase the sampling records, which means that X(i, t) is not changed. According to Definition 1, we have Δq
v
= 0. So, we have q
t
⊀ q
v
■.
Theorem 8 The time correlation indicator and completeness indicator are irrelevant.
Proof According to the definition of timeliness measurement, currency decreases because the jitter is eliminated after the data related cleaning process, while ƒ
c
(X(i, t)) remains unchanged for X(i, t). According to Definition 2, the effective data volume cv
t
= ∑ƒ
c
(X(i, t)) remains unchanged. According to Formula (14), we have Δq
c
= 0. So, we have q
t
⊀ q
c
■.
Theorem 9 In the case that the physical signal changes continuous and smoothly, there is a positive correlation between time-related indicator and correction indicator after eliminating jitter in the collected dataset.
Proof As shown in Fig. 1, the sampling time is treal = tideal + Δt, while the observation value is val = valreal + Δ, where Δ is the error caused by the jitter Δt.
Considering the general situation, the physical signals observed by the nodes change continuously and smoothly in a long period of time, and the sampling frequency of the nodes is far less than the frequency of signal changes. When the sampling delay Δt decreases, we can assume the error Δ decreases too. According to Definition 4, f
a
(val
t
) = 1 when Δ < ξ
c
. So, ∑f
a
(val
t
) increases for a given data sequence X
i
. According to Formula (12), we have Δq
a
> 0. So, we get q
t
≺ q
a
■.
Relationship between correctness indicator and others
Theorem 10 There is no correlation between the correctness indicator and the data volume indicator.
Proof The observed value can be described as val = valreal + Δ, in which Δ is the error. In the case that Δ > ξ, the value is considered as abnormal and the correctness data cleaning will eliminate the data error, and accordingly, we have f
a
(val
t
) = 1. At the same time, the completeness metric for dataset D at time t is not changed according to Definition 2, which means Δq
v
= 0. So, we have q
a
⊀ q
v
■.
Theorem 11
There is no correlation between the correctness indicator and the completeness indicator.
Theorem 12 There is no correlation between the correctness indicator and the time correlation indicator.
Proof The proof process is similar to that in Theorem 10 ■.
Analysis of sequential cleaning strategies
As mentioned in the previous section, there are relationships between different indicators, and a directed graph can be used to describe them. Figures 2 and 3 demonstrate the positive/incomplete correlations between these data quality indicators separately.
Assuming that many data errors, such as jitter, data loss, and data exception, occur in the collected dataset D. The existence of these errors leads to lower metrics for these data indicators, i.e., q
c
, q
t
, and q
a
. There are several combinations for the data cleaning strategies in which the cleaning process is carried out with different orders:
-
(1)
Completeness, time-related, and correction;
-
(2)
Completeness, correction, and time-related;
-
(3)
Time-related, completeness, and correction;
-
(4)
Time-related, correction, and completeness;
-
(5)
Correction, completeness, and time-related;
-
(6)
Correction, time-related, and completeness.
According to the relationship analysis in the previous section, completeness cleaning cannot guarantee the data correctness, and thus abnormal data might still exist if it is placed at the end of the cleaning order. It means that (4), (5), and (6) are not suitable for the WSNs. In order (3), the performance of the time-related cleaning algorithm cannot be guaranteed, especially in the case that data loss is serious in the original dataset. In order (2), it is helpful to reduce the abnormal data by eliminating the jitter. However, if there is a peak among two adjacent collections, Theorem 9 does not stand, which means possible poor performance after the cleaning process.
On the other hand, if we adopt the order (1), the completeness data cleaning is firstly carried out, which will repair the lost data and is helpful to guarantee the performance of the secondary time-related data cleaning. The final correctness cleaning will eliminate the abnormal data due to the previous two steps, and the final metrics for these three indicators will increase accordingly. In this way, we can see that order (1) is the best compared with other strategies.
Data cleaning strategy
According to the analysis of the final cleaning effect of different cleaning sequences in the previous section, it is considered that the data cleaning strategy by order (1) is the best one. Therefore, in this paper, we propose the following data cleaning strategy to avoid redundant cleaning operation and reducing the cleaning expenses as well as ensuring the data cleaning effect.
-
Step 1
Calculate the volume indicator of dataset D.
-
Step 2
If the volume indicator is larger than a given threshold, then
-
Step 3
Clean the dataset by completeness indicator;
-
Step 4
Clean the dataset by time-related indicator;
-
Step 5
Clean the dataset by correctness indicator;
-
Step 6
End.
Steps 1 and 2 are used to determine if the cleaning process is necessary or not. The volume indicator describes the size of the collected data. If the size is very small, it might show that the network is not in the proper mode because enough data cannot be gathered by the system. The reliability for these data is very low in this case. Although data cleaning is helpful to repair the lost data, it is considered useless since the reliability is less than the threshold. Steps 3 to 5 will carry out the cleaning process via completeness, time-related, and correctness indicators, as mentioned in the previous section.