Data management requires not only data quality assessment but also highquality datasets obtained by data cleaning or other technologies. Quality assessment indicators will affect each other in the data cleaning process. This paper aims at finding the relationship between quality indicators as well as a proper data cleaning strategy. It is noted that the relationship between indicators analyzed in the following is considered in the data cleaning process if it is not specialized.
Relationship between data volume indicator and others
Theorem 1 The data volume indicator and completeness indicator are not completely correlated.
Proof Given the time duration T in the same location, the sampling frequency Δt, data sequences collected by unreliable nodes is X_{
i
} = [X(i, 1), X(i, 2), …, X(i, T/Δt)], and by the reliable nodes is X_{
i
}^{′} = [X^{′}(i, 1), X^{′}(i, 2), …, X^{′}(i, T/Δt)]. The data sizes are denoted as v_{
i
} and v_{
i
}′, respectively, in which v_{
i
}′ > v_{
i
}.
For sequence X_{
i
}, the probability is p_{loss} in case that instance X(i, t) is independently lost. Therefore, the data volume, cv_{
t,
} satisfies the binomial distribution by following the completeness constraint. For data sequence X_{
i
}′, the probability is p_{loss} and cv_{
t
}′ satisfies the binomial distribution too. According to the Formula (6), the variation of the completeness indicator is as follows:
$$ \Delta {q}_c=\frac{\Delta t\cdotp \left(\sum \limits_{i=1}^{T/\Delta t}{cv_t}^{\hbox{'}}\sum \limits_{i=1}^{T/\Delta t}{cv}_t\right)}{N\times T}, $$
(14)
in which cv_{
t
} and cv_{
t
}′ satisfy the binomial distribution respectively.
So, we have p(Δq_{
c
} ≥ 0) ∈ (0,1). In this way, \( {q}_v\overset{\sim }{\prec }\ {q}_c \) ■.
Theorem 2 The data volume indicator and time correlation indicator are not completely correlated.
Proof Given the time duration T, let v_{
i
} be the size of data sequence X_{
i
} collected by unreliable nodes, and v_{
i
}′ be the size of data sequence X_{
i
}′ collected by reliable nodes. The probability is p_{time} in the case that data has independent jitter. The data instance satisfies the normal distribution during the network transmission. According to Formula (10), variation of the time correlation indicator is described as Δq_{
t
} = q_{
t
} − q_{
t
}′:
$$ \Delta {q}_t=\frac{\sum \limits_{i=1}^n\left(\sum \limits_{t=1}^{v_i^{\hbox{'}}}{f}_t\left(X\left(i,t\right)\right)\sum \limits_{t=1}^{v_i}{f}_t\left(X\left(i,t\right)\right)\right)}{N}, $$
(15)
in which q_{
t
} and q_{
t
}′ are independent to each other and satisfy the binomial distribution respectively.
So, we have p(Δq_{
t
} ≥ 0) ∈ (0,1). In conclusion, there is not a complete correlation between data volume indicator and correctness indicator, which can be described as \( {q}_v\overset{\sim }{\prec }{q}_t \) ■.
Theorem 3
The data volume indicator and the correctness indicator are not completely correlated.
Proof Similar to the proof process of Theorem 1, the probability is p_{error} for the situation that data instance is independently wrong. According to Formula (12), correctness indicators q_{
a
} and q_{
a
}′ are independent of each other and respectively satisfy the binomial distribution. We have Δq_{
a
} = q_{
a
}′ − q_{
a
} in which p(Δq_{
a
} ≥ 0) ∈ (0,1). So, we have \( {q}_v\ \overset{\sim }{\prec }\ {q}_a \) ■.
Relationship between completeness indicator and others
Theorem 4 There is a positive correlation between the completeness indicator and data volume indicator.
Proof In the time duration T, the data sequence of node i is X_{
i
} = [X(i, 1), X(i, 2), …, X(i, T/Δt)]. The missed data is shown below:
$$ X\left(i,t\right)=\mathrm{null}\ \mathrm{or}\ {x}_j=\mathrm{null}, $$
(16)
where X(i, t) = {x_{1}, x_{2}, …, x_{
k
}}.
However, the data is not lost after data repair. According to Definition 1, we have
$$ \Delta {q}_v=\frac{\Delta t\times \left({v}_i^{\hbox{'}}{v}_i\right)}{N\times T} $$
(17)
in which v_{
i
}′ – v_{
i
} ≥ 0.
So, we have q_{
c
} ≺ q_{
v
} ■.
Theorem 5
There is no correlation between the completeness indicator and the timerelated indicator after repairing the missing data of the dataset assuming that only the collected data is calculated by the timerelated indicator.
Proof In the time duration T, the data sequence of node i is X_{
i
} = [X(i, 1), X(i, 2), …, X(i, T/Δt)]. When there is a data loss, we have
$$ X\left(i,t\right)=\mathrm{null}\ \mathrm{or}\ {x}_i=\mathrm{null},X\left(i,t\right)=\left\{{x}_1,{x}_2,\dots, {x}_k\right\}. $$
(18)
After data cleaning by fixing these missed data, these values are no longer empty, and thus the data volume will increase from cv_{
t
} to cv_{
t
}′.
However, this increment is independent to the time because it is carried out at the sink node.
According to Formula (10), Δq_{
t
} = 0. So we have q_{
c
} ⊀ q_{
t
} ■.
Theorem 6 There is no complete correlation between the time correlation indicator and completeness indicator.
Proof In the time duration T, the data sequence of node i is X_{
i
} = [X(i, 1), X(i, 2), …, X(i, T/Δt)]. The data loss is represented as val_{
t
} = null, where val_{
t
} ∈ X_{
i
}. Completeness cleaning will add the lost data into the sequence, and thus the data volume increases from cv_{
i
} to cv_{
i
}′, and we have:
$$ \Delta {cv}_i={cv}_i^{\hbox{'}}{cv}_i. $$
Suppose the correctness of the repaired data is judged as probability p_{
r
}, that is
$$ p\left({f}_a\left({\mathrm{val}}_t\right)=1\right)={p}_r,{\mathrm{val}}_t\widehat{I}\Delta {cv}_t $$
According to Formula (12), the probability is p(Δq_{
t
} ≥ 0) ∈ (0,1) with Δq_{
t
} ≥ 0:
$$ p\left(\Delta {q}_t\ge 0\right)=\prod \limits_{i=1}^n\prod \limits_{j=1}^{\Delta {cv}_i}{p}_r. $$
(19)
So, we have \( {q}_c\tilde{\prec}{q}_a \).
Relationship between time correlation indicator and others
Theorem 7
There is no correlation between time correlation indicator and data volume indicator.
Proof The timeliness measurement is calculated by Formula (8), and we can see that the currency of X(i, t) decreases after data cleaning because the jitter is eliminated. At the same time, the cleaning does not increase the sampling records, which means that X(i, t) is not changed. According to Definition 1, we have Δq_{
v
} = 0. So, we have q_{
t
} ⊀ q_{
v
} ■.
Theorem 8 The time correlation indicator and completeness indicator are irrelevant.
Proof According to the definition of timeliness measurement, currency decreases because the jitter is eliminated after the data related cleaning process, while ƒ_{
c
}(X(i, t)) remains unchanged for X(i, t). According to Definition 2, the effective data volume cv_{
t
} = ∑ƒ_{
c
}(X(i, t)) remains unchanged. According to Formula (14), we have Δq_{
c
} = 0. So, we have q_{
t
} ⊀ q_{
c
} ■.
Theorem 9 In the case that the physical signal changes continuous and smoothly, there is a positive correlation between timerelated indicator and correction indicator after eliminating jitter in the collected dataset.
Proof As shown in Fig. 1, the sampling time is t_{real} = t_{ideal} + Δt, while the observation value is val = val_{real} + Δ, where Δ is the error caused by the jitter Δt.
Considering the general situation, the physical signals observed by the nodes change continuously and smoothly in a long period of time, and the sampling frequency of the nodes is far less than the frequency of signal changes. When the sampling delay Δt decreases, we can assume the error Δ decreases too. According to Definition 4, f_{
a
}(val_{
t
}) = 1 when Δ < ξ_{
c
}. So, ∑f_{
a
}(val_{
t
}) increases for a given data sequence X_{
i
}. According to Formula (12), we have Δq_{
a
} > 0. So, we get q_{
t
} ≺ q_{
a
} ■.
Relationship between correctness indicator and others
Theorem 10 There is no correlation between the correctness indicator and the data volume indicator.
Proof The observed value can be described as val = val_{real} + Δ, in which Δ is the error. In the case that Δ > ξ, the value is considered as abnormal and the correctness data cleaning will eliminate the data error, and accordingly, we have f_{
a
}(val_{
t
}) = 1. At the same time, the completeness metric for dataset D at time t is not changed according to Definition 2, which means Δq_{
v
} = 0. So, we have q_{
a
} ⊀ q_{
v
} ■.
Theorem 11
There is no correlation between the correctness indicator and the completeness indicator.
Theorem 12 There is no correlation between the correctness indicator and the time correlation indicator.
Proof The proof process is similar to that in Theorem 10 ■.
Analysis of sequential cleaning strategies
As mentioned in the previous section, there are relationships between different indicators, and a directed graph can be used to describe them. Figures 2 and 3 demonstrate the positive/incomplete correlations between these data quality indicators separately.
Assuming that many data errors, such as jitter, data loss, and data exception, occur in the collected dataset D. The existence of these errors leads to lower metrics for these data indicators, i.e., q_{
c
}, q_{
t
}, and q_{
a
}. There are several combinations for the data cleaning strategies in which the cleaning process is carried out with different orders:

(1)
Completeness, timerelated, and correction;

(2)
Completeness, correction, and timerelated;

(3)
Timerelated, completeness, and correction;

(4)
Timerelated, correction, and completeness;

(5)
Correction, completeness, and timerelated;

(6)
Correction, timerelated, and completeness.
According to the relationship analysis in the previous section, completeness cleaning cannot guarantee the data correctness, and thus abnormal data might still exist if it is placed at the end of the cleaning order. It means that (4), (5), and (6) are not suitable for the WSNs. In order (3), the performance of the timerelated cleaning algorithm cannot be guaranteed, especially in the case that data loss is serious in the original dataset. In order (2), it is helpful to reduce the abnormal data by eliminating the jitter. However, if there is a peak among two adjacent collections, Theorem 9 does not stand, which means possible poor performance after the cleaning process.
On the other hand, if we adopt the order (1), the completeness data cleaning is firstly carried out, which will repair the lost data and is helpful to guarantee the performance of the secondary timerelated data cleaning. The final correctness cleaning will eliminate the abnormal data due to the previous two steps, and the final metrics for these three indicators will increase accordingly. In this way, we can see that order (1) is the best compared with other strategies.
Data cleaning strategy
According to the analysis of the final cleaning effect of different cleaning sequences in the previous section, it is considered that the data cleaning strategy by order (1) is the best one. Therefore, in this paper, we propose the following data cleaning strategy to avoid redundant cleaning operation and reducing the cleaning expenses as well as ensuring the data cleaning effect.

Step 1
Calculate the volume indicator of dataset D.

Step 2
If the volume indicator is larger than a given threshold, then

Step 3
Clean the dataset by completeness indicator;

Step 4
Clean the dataset by timerelated indicator;

Step 5
Clean the dataset by correctness indicator;

Step 6
End.
Steps 1 and 2 are used to determine if the cleaning process is necessary or not. The volume indicator describes the size of the collected data. If the size is very small, it might show that the network is not in the proper mode because enough data cannot be gathered by the system. The reliability for these data is very low in this case. Although data cleaning is helpful to repair the lost data, it is considered useless since the reliability is less than the threshold. Steps 3 to 5 will carry out the cleaning process via completeness, timerelated, and correctness indicators, as mentioned in the previous section.