Problem formulation
Given M time instances and N devices in the network, the original data in the network can be represented by a m×n matrix X, where each row represents the readings in the network at a time instant and each column represents the readings of an IoT device at a different time instant. Xij(1≤i≤M,1≤j≤N) denotes the reading of each node j at time instant ti.
Let A denote a linear map from RM×N space to Rp space and vec denote a linear map to transform a matrix to a vector by overlaying one column on another; we have
$$A(\boldsymbol{X})=\boldsymbol{\Phi} \cdot vec(\boldsymbol{X}) $$
where Φ is a p×MN matrix. Let Φ be a random matrix satisfying the RIP condition [21]. Before deployment, each device is equipped with a pseudo-random number generator. Once the device produces a reading at some time instance, the pseudo-random number generator will generate a random vector of length p with the combination of current time instance and the device’s ID as random seed. The elements of this random vector is i.i.d. sampled from a Gaussian distribution of mean 0 and variance 1/p. Note that this pseudo-random number generation at each device could be reproduced by the base station by using the same generator.
p, the dimension of A, means the number of elements (namely the combinations) to recover X. Typically, p should be not less than cr(3m+3n−5r) [13]. Therefore, the problem can be formulated as the following optimization problem:
$$ \underset{\boldsymbol{X} \in R^{M\times N}}{\text{min}}\quad \frac{1}{2} ||A(\boldsymbol{X})-b||^{2}_{F}+ \mu ||\boldsymbol{X}||_{*} \quad \text{s.t.} A{\boldsymbol{X}}=\boldsymbol{b} $$
(3)
where the first part is for noise and the second part is for low rank.
Remark 1
In IoT networks, devices may produce erroneous readings due to noisy environment or error-prone hardware. The erroneous readings usually occur at sporadic time and locations and thus may have few impacts on the data sparsity of the network. Thus, outlier/abnormal reading recovery/detection could still work in compressive sensing-based data gathering. However, device measurements on the same event usually have strong inter-correlations and geographically concentrated in a group of devices in close proximity. Such events may spread in diverse time and space scale and result in dynamic sparsity of the data, which would further violate the assumption of constant sparsity in compressive sensing and thus lead to poor recovery.
Remark 2
Given N M−dim signal vectors generated from N devices within M time instances, a good basis to make these vectors sparse may not be easy to find. Interestingly, [22] has analyzed different sets of data from two independent device network testbeds. The results indicate that the N×Mdim data matrix may be approximately low rank under various scenarios under investigation. Therefore, such N×M temporal-spatial signal gathering problem with diverse scale event data that cannot be well addressed by CS method could be tackled under the low-rank frameworkFootnote 1.
Path along compressive collection
In this paper, we provide a generalization of current data gathering methods on temporal-spatial signals with diverse scale events, during which device readings are compressively collected along the relay paths, e.g., chain-type or mesh topology, to the sink.
At each device sj, given the reading produced from sj at time instance t1, sj generates a random vector Φ1j of length p, with time instance t1 and its ID sj as the seed, and computes the vector X1jΦ1j. At the next time instance t2, sj generates a random vector Φ2j, computes X2jΦ2j, and adds it to the previous vector X1jΦ1j. At time instance tM, sj computes XMjΦMj and would have the summation \(S_{j}=\sum \limits ^{M}_{i=1} \boldsymbol {X}_{ij}\Phi _{ij}\).
In the network, each device sj continuously updates its vector sum Sj till time instance tM. After that, device sj relays the vector Sj to the next device si. Then, si adds Sj with its vector sum Si and forwards Si+Sj to the next device. After the collection along the relay paths, the sink receives \(\sum \limits ^{M}_{i=1} \boldsymbol {X}_{ij}\Phi _{ij}\).
Remark 3
During data gathering, each node sends out only one vector of fixed length along the collection path, regardless of the distance to the sink (The property of the fixed-length vector will be discussed in Section 4).
Considering event data, recall that the row of data matrix X (the signal in the network) represents the data acquired at some time instance from all devices and each column of matrix X represents the data got from one device at different time instances.
Outlier readings could come from the internal errors at error-prone devices, for example, noise, systematic errors, or caused by external events due to environmental changes. Former internal errors are often sparse in spatial domain, while the latter readings are usually low rank in time domain. They both keep sparse at the corresponding domain but together may lead to dynamic changes of data sparsity.
Let matrix X be decomposed into two parts, the normal one and the abnormal one: X=Xn+Xs. We could have:
$$ \begin{aligned} A\boldsymbol{X}&=A\left(\boldsymbol{X}_{n}+\boldsymbol{X}_{s}\right)\\ &=A\cdot[\!I,I]\left(\boldsymbol{X}_{n},{X}_{s}\right)^{T}\\ &=[\!A,A]\left[\boldsymbol{X}_{n},\boldsymbol{X}_{s}\right]^{T} \quad \text{s.t.} A(\boldsymbol{X})=\boldsymbol{b}\\ \end{aligned} $$
(4)
Based on Eq. 1, [A,A] is a new linear map. The formulated problem could be solved in the framework of matrix recovery. That is, given the observation vector y∈Rp, the original data matrix X∗ could be recovered in R2M×N.
A basic design of data recovery
This section provides the generalization of data recovery method from compressive sensing to the realm of matrix recovery. The advantages of such an extension are twofold: (1) it exploits the data correlation in both time and space domains and (2) the diverse scale of event data, which would mute the power of CS method due to sparsity changes, could be tackled with the proposed method.
According to Eqs. 3 and 4, the general form of the problem could be expressed with the following minimization problem:
$$ \underset{\boldsymbol{X} \in R^{m\times n}}{\text{min}}\quad \frac{1}{2} ||A(\boldsymbol{X})-b||^{2}_{F}+ \mu ||\boldsymbol{X}||_{*} \quad \text{s.t.} A(\boldsymbol{X})=\boldsymbol{b} $$
(5)
where A(x)=ΦT(x) given T(·) as the transformation of a matrix to a vector by overlaying one column of x on another. Φ is a p×MN random matrix.
Note that Eq. 3 is the Lasso form of Eq. 2. In relaxed conditions, its solution is the solution of Eq. 2 [23]. Therefore, we consider Eq. 3 (Eqs. 3 and 5 are essentially same) instead of the original problem in Eq. 2.
This problem could be further transformed into the following form:
$$ \underset{\boldsymbol{X} \in R^{m\times n}}{\text{min}}\quad F(x)\triangleq f(x) +P(x) $$
(6)
where \(f(x)=\frac {1}{2} ||A(\boldsymbol {X})-b||^{2}_{F}\) and P(x)=μ||X||∗
Note that both parts are convex, but only the first part is differential while the second part may not. Then, we could have
$$\nabla f(\boldsymbol{X})= A^{*}(A(\boldsymbol{X})-b) $$
where A∗ is the dual operator of A.
Since A∗(X)=ΦTX, we have
$$\nabla f(\boldsymbol{X})= A^{*}(A(\boldsymbol{X})-b)=\Phi^{T}(\Phi^{*}T(X)-b)^{*} $$
Because ∇f is linear, it is Lipschitz continuous. Then, we could have a positive constant Lf to satisfy the following inequation:
$$||\nabla f(\boldsymbol{X})-\nabla f(\boldsymbol{Y})||_{F}\leq L_{f} ||\boldsymbol{X}-\boldsymbol{Y}||_{F}\quad\forall \boldsymbol{X},\boldsymbol{Y}\in R^{M\times N} $$
Lemma 1
A rough estimation of Lf
$$\sqrt{MN\cdot \underset{i}{\text{max}}\: \left\{\left(\Phi^{T}\Phi\right)^{2}_{i})\right\}}, $$
where \(\left (\Phi ^{T}\Phi)^{2}_{i}\right)^{2}\) is the ith column of the matrix ΦTΦ.
Proof
\(||\nabla f(\boldsymbol {X})-\nabla f(\boldsymbol {Y})||_{F}^{2}=||\Phi ^{T}(\Phi ^{*}T(\boldsymbol {X}-\boldsymbol {Y}))||^{2}_{2}\)
Set \(\Phi ^{T}\Phi = \left (\begin {array}{ccc} a_{11} & \cdots & a_{1,MN} \\ \vdots & & \vdots \\ a_{p1} &\cdots &a_{p,MN} \end {array}\right)\),
\(T(X-Y)=\left (\begin {array}{c} x_{11} \\ \vdots \\ x_{MN} \end {array}\right)\),
and \(h= \underset {i}{\text {max}}\; \left \{\left (\Phi ^{T}\Phi \right)^{2}_{i}\right)\),then
$$\begin{aligned} ||\Phi^{T} \Phi T(X-Y) ||^{2}_{2}&=\sum\limits^{p}_{j=1}\left(\sum\limits^{MN}_{i=1} a_{ji}x_{i}\right)\\ &\leq h(x_{1}+\ldots+x_{MN})^{2}\\ &\leq M\cdot N \cdot h(x_{1}+\ldots+x_{MN})^{2}\\ &=MNh||X-Y||^{2}_{F} \end{aligned} $$
Thus, \(L_{f}\leq \sqrt {MNh}\). □
Remark 4
A much smaller Lf could be found in various real scenarios and may help converge quickly. The experimental results of this paper show that the Lf could be much smaller than the rough estimation above, given the matrix sampled from a Gaussian distribution.
Considering the following quadratic approximation of F(·) of Eq. 6 at Y:
$$ \begin{aligned} Q_{\tau}(X,Y)&\triangleq f(Y)+<\nabla f(Y),X-Y>\\ &\quad+\frac{\tau}{2}||X-Y||^{2}_{F} +P(X)\\ &= \frac{\tau}{2}||X-G||^{2}_{F}+P(X)+f(Y)\\ &\quad-\frac{1}{2\tau}||\nabla F(Y)||^{2}_{F}\\ \end{aligned} $$
(7)
where τ>0 is a given parameter, G=Y−τ−1∇f(Y).
Since the above function of X is strong convex, it has a unique global minimizer.
Considering the minimization problem
$$ \underset{X\in R^{M\times N}}{\text{min}}\quad \frac{\tau}{2}||X-G||^{2}_{F}+\mu||X||_{*} $$
(8)
where G∈RM×N. Note that if G=Y−τ−1A∗(A(Y)−b), then the above minimization problem is a special case of Eq. 7 with \(f(X)=\frac {1}{2}||A(X)-b||^{2}_{2}\) and P(X)=μ||X||∗ when we ignore the constant term.
Let Sτ(G) denote the minimizer of (6). According to [24], we further have
$$S_{\tau}(G)=U\cdot diag((\delta-\mu/\tau)_{+})\cdot V^{T} $$
given the SVD decomposition of G=Y−τ−1A∗(A(Y)−b)=U·diag(δ)·VT. Here, for a given vector x∈Rp, we let x+=max{x,0} where the maximum is taken component-wise.
Based on the accelerated proximal gradient(APG) design given [13, 24], we further denote t0=t1=1 and τk=Lf and {Xk},{Yk},{tk} as the sequence generated by APG. For i=1,2,3,⋯, we have
-
Step 1: Set \(Y_{k}=X_{k}+\frac {t^{k-1}-1}{t^{k}}\left (X_{k}-X_{k-1}\right)\)
-
Step 2: Set Gk=Yk−(τk)−1A∗(A(Yk)−b). Compute \(S_{\tau _{k}}(G_{k})\) from the SVD of Gk
-
Step 3: Set \(X^{k+1}=S_{\tau _{k}}(G_{k})\)
-
Step 4: Set \(t_{k+1}=\frac {1+\sqrt {1+4(t_{k})^{2}}}{2}\)
Lemma 2
For any μ>0, the optimal solution X∗ of Eq. 3 is bounded according to [13, 24]. And ||X||F<χ where
$$ \chi = \left\{ \begin{array}{ll} min\left\{||b||^{2}_{2}/(2\mu), ||X_{LS}||_{*}\right\} & \text{if A is surjective}\\ ||b||^{2}_{2}/(2\mu) & \text{Otherwise} \end{array} \right. $$
(9)
with XLS=A∗(AA∗)−1b
Based on this lemma, we could reach a deterministic estimation of the procedure and speed of convergence of data recovery.
Let {Xk},{Yk},{tk} be the sequence generated by APG. Then, for any k≥1, we could have
$$F(X_{k})-F(X^{*})\leq \frac{2L_{f}||X^{*}-X_{0}||^{2}_{F}}{(k+1)^{2}} $$
Thus,
$$F(X_{k})-F(X^{*})\leq \varepsilon \quad \text{if}\quad k\geq \sqrt{\frac{2L_{f}}{\varepsilon}}(||X_{0}||_{F}+\chi)-1. $$
Let δ(x) denote dist(0,∂(f(x))+μ||X||∗), where δ(x) represents the convergence speed of data recovery. It is easy to see that the process naturally stops when δ(x) is small enough.
Since ||X||∗ is not differential, it may not be easy to compute δ(x). However, there is a good upper bound for δ(x) provided by APG designs [24].
Given
$$\begin{aligned} \tau_{k}(G_{k}-X_{k+1})&=\tau_{k}(Y_{k}-X_{k+1})-\nabla f(Y_{k})\\ &=\tau_{k} (Y_{k}-X_{k+1})\\ &\quad-\Phi^{T}(\Phi\cdot vec(Y_{k})-b) \end{aligned} $$
Note that
$$\partial (\mu ||X_{k+1}||_{*}) \geq \tau_{k}(G_{k}-X_{k+1}) $$
let
$$\begin{aligned} S_{k+1}&\triangleq \tau_{k}(Y_{k}-X_{k+1})+\nabla f(X_{k+1})-\nabla f(Y_{k})\\ &= \tau_{k}(Y_{k}-X_{k+1}) +A^{*}(A(X_{k+1})-A(Y_{k}))\\ &=\tau_{k}(Y_{k}-X_{k+1})+\Phi^{T}(\Phi \cdot T(Y_{k}-X_{k+1})) \end{aligned} $$
we could have
$$S_{k+1} \in \partial (f(X_{k+1})+\mu ||X_{k+1}||_{*}) $$
Therefore, we have δ(Xk+1)≤||Sk+1||.
According to the derivation above, the stopping condition could be given as follows,
$$\hspace{45pt} \frac{||S_{k+1}||_{F}}{\tau_{k} \text{max}\{1,||X_{k+1}||_{F}\}}\leq Tol $$
where Tol is a tolerance defined by user, usually moderately small threshold.