Problem formulation
Given M time instances and N devices in the network, the original data in the network can be represented by a m×n matrix X, where each row represents the readings in the network at a time instant and each column represents the readings of an IoT device at a different time instant. X_{ij}(1≤i≤M,1≤j≤N) denotes the reading of each node j at time instant t_{i}.
Let A denote a linear map from R^{M×N} space to R^{p} space and vec denote a linear map to transform a matrix to a vector by overlaying one column on another; we have
$$A(\boldsymbol{X})=\boldsymbol{\Phi} \cdot vec(\boldsymbol{X}) $$
where Φ is a p×MN matrix. Let Φ be a random matrix satisfying the RIP condition [21]. Before deployment, each device is equipped with a pseudorandom number generator. Once the device produces a reading at some time instance, the pseudorandom number generator will generate a random vector of length p with the combination of current time instance and the device’s ID as random seed. The elements of this random vector is i.i.d. sampled from a Gaussian distribution of mean 0 and variance 1/p. Note that this pseudorandom number generation at each device could be reproduced by the base station by using the same generator.
p, the dimension of A, means the number of elements (namely the combinations) to recover X. Typically, p should be not less than cr(3m+3n−5r) [13]. Therefore, the problem can be formulated as the following optimization problem:
$$ \underset{\boldsymbol{X} \in R^{M\times N}}{\text{min}}\quad \frac{1}{2} A(\boldsymbol{X})b^{2}_{F}+ \mu \boldsymbol{X}_{*} \quad \text{s.t.} A{\boldsymbol{X}}=\boldsymbol{b} $$
(3)
where the first part is for noise and the second part is for low rank.
Remark 1
In IoT networks, devices may produce erroneous readings due to noisy environment or errorprone hardware. The erroneous readings usually occur at sporadic time and locations and thus may have few impacts on the data sparsity of the network. Thus, outlier/abnormal reading recovery/detection could still work in compressive sensingbased data gathering. However, device measurements on the same event usually have strong intercorrelations and geographically concentrated in a group of devices in close proximity. Such events may spread in diverse time and space scale and result in dynamic sparsity of the data, which would further violate the assumption of constant sparsity in compressive sensing and thus lead to poor recovery.
Remark 2
Given N M−dim signal vectors generated from N devices within M time instances, a good basis to make these vectors sparse may not be easy to find. Interestingly, [22] has analyzed different sets of data from two independent device network testbeds. The results indicate that the N×Mdim data matrix may be approximately low rank under various scenarios under investigation. Therefore, such N×M temporalspatial signal gathering problem with diverse scale event data that cannot be well addressed by CS method could be tackled under the lowrank framework^{Footnote 1}.
Path along compressive collection
In this paper, we provide a generalization of current data gathering methods on temporalspatial signals with diverse scale events, during which device readings are compressively collected along the relay paths, e.g., chaintype or mesh topology, to the sink.
At each device s_{j}, given the reading produced from s_{j} at time instance t_{1}, s_{j} generates a random vector Φ_{1j} of length p, with time instance t_{1} and its ID s_{j} as the seed, and computes the vector X_{1j}Φ_{1j}. At the next time instance t_{2}, s_{j} generates a random vector Φ_{2j}, computes X_{2j}Φ_{2j}, and adds it to the previous vector X_{1j}Φ_{1j}. At time instance t_{M}, s_{j} computes X_{Mj}Φ_{Mj} and would have the summation \(S_{j}=\sum \limits ^{M}_{i=1} \boldsymbol {X}_{ij}\Phi _{ij}\).
In the network, each device s_{j} continuously updates its vector sum S_{j} till time instance t_{M}. After that, device s_{j} relays the vector S_{j} to the next device s_{i}. Then, s_{i} adds S_{j} with its vector sum S_{i} and forwards S_{i}+S_{j} to the next device. After the collection along the relay paths, the sink receives \(\sum \limits ^{M}_{i=1} \boldsymbol {X}_{ij}\Phi _{ij}\).
Remark 3
During data gathering, each node sends out only one vector of fixed length along the collection path, regardless of the distance to the sink (The property of the fixedlength vector will be discussed in Section 4).
Considering event data, recall that the row of data matrix X (the signal in the network) represents the data acquired at some time instance from all devices and each column of matrix X represents the data got from one device at different time instances.
Outlier readings could come from the internal errors at errorprone devices, for example, noise, systematic errors, or caused by external events due to environmental changes. Former internal errors are often sparse in spatial domain, while the latter readings are usually low rank in time domain. They both keep sparse at the corresponding domain but together may lead to dynamic changes of data sparsity.
Let matrix X be decomposed into two parts, the normal one and the abnormal one: X=X_{n}+X_{s}. We could have:
$$ \begin{aligned} A\boldsymbol{X}&=A\left(\boldsymbol{X}_{n}+\boldsymbol{X}_{s}\right)\\ &=A\cdot[\!I,I]\left(\boldsymbol{X}_{n},{X}_{s}\right)^{T}\\ &=[\!A,A]\left[\boldsymbol{X}_{n},\boldsymbol{X}_{s}\right]^{T} \quad \text{s.t.} A(\boldsymbol{X})=\boldsymbol{b}\\ \end{aligned} $$
(4)
Based on Eq. 1, [A,A] is a new linear map. The formulated problem could be solved in the framework of matrix recovery. That is, given the observation vector y∈R^{p}, the original data matrix X^{∗} could be recovered in R^{2M×N}.
A basic design of data recovery
This section provides the generalization of data recovery method from compressive sensing to the realm of matrix recovery. The advantages of such an extension are twofold: (1) it exploits the data correlation in both time and space domains and (2) the diverse scale of event data, which would mute the power of CS method due to sparsity changes, could be tackled with the proposed method.
According to Eqs. 3 and 4, the general form of the problem could be expressed with the following minimization problem:
$$ \underset{\boldsymbol{X} \in R^{m\times n}}{\text{min}}\quad \frac{1}{2} A(\boldsymbol{X})b^{2}_{F}+ \mu \boldsymbol{X}_{*} \quad \text{s.t.} A(\boldsymbol{X})=\boldsymbol{b} $$
(5)
where A(x)=ΦT(x) given T(·) as the transformation of a matrix to a vector by overlaying one column of x on another. Φ is a p×MN random matrix.
Note that Eq. 3 is the Lasso form of Eq. 2. In relaxed conditions, its solution is the solution of Eq. 2 [23]. Therefore, we consider Eq. 3 (Eqs. 3 and 5 are essentially same) instead of the original problem in Eq. 2.
This problem could be further transformed into the following form:
$$ \underset{\boldsymbol{X} \in R^{m\times n}}{\text{min}}\quad F(x)\triangleq f(x) +P(x) $$
(6)
where \(f(x)=\frac {1}{2} A(\boldsymbol {X})b^{2}_{F}\) and P(x)=μX_{∗}
Note that both parts are convex, but only the first part is differential while the second part may not. Then, we could have
$$\nabla f(\boldsymbol{X})= A^{*}(A(\boldsymbol{X})b) $$
where A^{∗} is the dual operator of A.
Since A^{∗}(X)=Φ^{T}X, we have
$$\nabla f(\boldsymbol{X})= A^{*}(A(\boldsymbol{X})b)=\Phi^{T}(\Phi^{*}T(X)b)^{*} $$
Because ∇f is linear, it is Lipschitz continuous. Then, we could have a positive constant L_{f} to satisfy the following inequation:
$$\nabla f(\boldsymbol{X})\nabla f(\boldsymbol{Y})_{F}\leq L_{f} \boldsymbol{X}\boldsymbol{Y}_{F}\quad\forall \boldsymbol{X},\boldsymbol{Y}\in R^{M\times N} $$
Lemma 1
A rough estimation of L_{f}
$$\sqrt{MN\cdot \underset{i}{\text{max}}\: \left\{\left(\Phi^{T}\Phi\right)^{2}_{i})\right\}}, $$
where \(\left (\Phi ^{T}\Phi)^{2}_{i}\right)^{2}\) is the ith column of the matrix Φ^{T}Φ.
Proof
\(\nabla f(\boldsymbol {X})\nabla f(\boldsymbol {Y})_{F}^{2}=\Phi ^{T}(\Phi ^{*}T(\boldsymbol {X}\boldsymbol {Y}))^{2}_{2}\)
Set \(\Phi ^{T}\Phi = \left (\begin {array}{ccc} a_{11} & \cdots & a_{1,MN} \\ \vdots & & \vdots \\ a_{p1} &\cdots &a_{p,MN} \end {array}\right)\),
\(T(XY)=\left (\begin {array}{c} x_{11} \\ \vdots \\ x_{MN} \end {array}\right)\),
and \(h= \underset {i}{\text {max}}\; \left \{\left (\Phi ^{T}\Phi \right)^{2}_{i}\right)\),then
$$\begin{aligned} \Phi^{T} \Phi T(XY) ^{2}_{2}&=\sum\limits^{p}_{j=1}\left(\sum\limits^{MN}_{i=1} a_{ji}x_{i}\right)\\ &\leq h(x_{1}+\ldots+x_{MN})^{2}\\ &\leq M\cdot N \cdot h(x_{1}+\ldots+x_{MN})^{2}\\ &=MNhXY^{2}_{F} \end{aligned} $$
Thus, \(L_{f}\leq \sqrt {MNh}\). □
Remark 4
A much smaller L_{f} could be found in various real scenarios and may help converge quickly. The experimental results of this paper show that the L_{f} could be much smaller than the rough estimation above, given the matrix sampled from a Gaussian distribution.
Considering the following quadratic approximation of F(·) of Eq. 6 at Y:
$$ \begin{aligned} Q_{\tau}(X,Y)&\triangleq f(Y)+<\nabla f(Y),XY>\\ &\quad+\frac{\tau}{2}XY^{2}_{F} +P(X)\\ &= \frac{\tau}{2}XG^{2}_{F}+P(X)+f(Y)\\ &\quad\frac{1}{2\tau}\nabla F(Y)^{2}_{F}\\ \end{aligned} $$
(7)
where τ>0 is a given parameter, G=Y−τ^{−1}∇f(Y).
Since the above function of X is strong convex, it has a unique global minimizer.
Considering the minimization problem
$$ \underset{X\in R^{M\times N}}{\text{min}}\quad \frac{\tau}{2}XG^{2}_{F}+\muX_{*} $$
(8)
where G∈R^{M×N}. Note that if G=Y−τ^{−1}A^{∗}(A(Y)−b), then the above minimization problem is a special case of Eq. 7 with \(f(X)=\frac {1}{2}A(X)b^{2}_{2}\) and P(X)=μX_{∗} when we ignore the constant term.
Let S_{τ}(G) denote the minimizer of (6). According to [24], we further have
$$S_{\tau}(G)=U\cdot diag((\delta\mu/\tau)_{+})\cdot V^{T} $$
given the SVD decomposition of G=Y−τ^{−1}A^{∗}(A(Y)−b)=U·diag(δ)·V^{T}. Here, for a given vector x∈R^{p}, we let x_{+}=max{x,0} where the maximum is taken componentwise.
Based on the accelerated proximal gradient(APG) design given [13, 24], we further denote t_{0}=t_{1}=1 and τ_{k}=L_{f} and {X_{k}},{Y_{k}},{t_{k}} as the sequence generated by APG. For i=1,2,3,⋯, we have

Step 1: Set \(Y_{k}=X_{k}+\frac {t^{k1}1}{t^{k}}\left (X_{k}X_{k1}\right)\)

Step 2: Set G_{k}=Y_{k}−(τ_{k})^{−1}A^{∗}(A(Y_{k})−b). Compute \(S_{\tau _{k}}(G_{k})\) from the SVD of G_{k}

Step 3: Set \(X^{k+1}=S_{\tau _{k}}(G_{k})\)

Step 4: Set \(t_{k+1}=\frac {1+\sqrt {1+4(t_{k})^{2}}}{2}\)
Lemma 2
For any μ>0, the optimal solution X^{∗} of Eq. 3 is bounded according to [13, 24]. And X_{F}<χ where
$$ \chi = \left\{ \begin{array}{ll} min\left\{b^{2}_{2}/(2\mu), X_{LS}_{*}\right\} & \text{if A is surjective}\\ b^{2}_{2}/(2\mu) & \text{Otherwise} \end{array} \right. $$
(9)
with X_{LS}=A^{∗}(AA^{∗})^{−1}b
Based on this lemma, we could reach a deterministic estimation of the procedure and speed of convergence of data recovery.
Let {X_{k}},{Y_{k}},{t_{k}} be the sequence generated by APG. Then, for any k≥1, we could have
$$F(X_{k})F(X^{*})\leq \frac{2L_{f}X^{*}X_{0}^{2}_{F}}{(k+1)^{2}} $$
Thus,
$$F(X_{k})F(X^{*})\leq \varepsilon \quad \text{if}\quad k\geq \sqrt{\frac{2L_{f}}{\varepsilon}}(X_{0}_{F}+\chi)1. $$
Let δ(x) denote dist(0,∂(f(x))+μX_{∗}), where δ(x) represents the convergence speed of data recovery. It is easy to see that the process naturally stops when δ(x) is small enough.
Since X_{∗} is not differential, it may not be easy to compute δ(x). However, there is a good upper bound for δ(x) provided by APG designs [24].
Given
$$\begin{aligned} \tau_{k}(G_{k}X_{k+1})&=\tau_{k}(Y_{k}X_{k+1})\nabla f(Y_{k})\\ &=\tau_{k} (Y_{k}X_{k+1})\\ &\quad\Phi^{T}(\Phi\cdot vec(Y_{k})b) \end{aligned} $$
Note that
$$\partial (\mu X_{k+1}_{*}) \geq \tau_{k}(G_{k}X_{k+1}) $$
let
$$\begin{aligned} S_{k+1}&\triangleq \tau_{k}(Y_{k}X_{k+1})+\nabla f(X_{k+1})\nabla f(Y_{k})\\ &= \tau_{k}(Y_{k}X_{k+1}) +A^{*}(A(X_{k+1})A(Y_{k}))\\ &=\tau_{k}(Y_{k}X_{k+1})+\Phi^{T}(\Phi \cdot T(Y_{k}X_{k+1})) \end{aligned} $$
we could have
$$S_{k+1} \in \partial (f(X_{k+1})+\mu X_{k+1}_{*}) $$
Therefore, we have δ(X_{k+1})≤S_{k+1}.
According to the derivation above, the stopping condition could be given as follows,
$$\hspace{45pt} \frac{S_{k+1}_{F}}{\tau_{k} \text{max}\{1,X_{k+1}_{F}\}}\leq Tol $$
where Tol is a tolerance defined by user, usually moderately small threshold.