2.1 Classic kernel least squares regression
As introduced in [11], the classic kernel least squares regression method is a well-known approach based on applying the kernel trick to the linear least squares algorithm. Given \(\mathbf {x}_{i}\in \mathbb {R}^{d}\), a set of training inputs, and \(y_{i}\in \mathbb {R}\), the corresponding outputs, with i=1,...,n, the algorithm finds the hyperplane in the non-linearly transformed kernel space, f(x), that better fits a given set of training data, minimizing a least squares criteria. Then, for any new or test input, x
⋆, the method predicts the output as f(x
⋆). In a WSN, the input x
i
could be the position coordinates while the output y
i
could be some environment measure, e.g. the temperature.
In the computation of the prediction, f(x), an optimization problem is solved where a loss function, \(\mathcal {L}\), between the truth y
i
and its prediction f(x
i
)
$$ \mathcal{L}(y_{i},f(\mathbf{x}_{i}))\geq0, $$
((1))
is averaged over the joint probability density function of the input and outputs,
$$ \mathcal{R}(f)=\int_{\mathcal{Y}\times\mathcal{X}}\mathcal{L}(y,f(\mathbf{x}))p(y,\mathbf{x})dyd\mathbf{x} $$
((2))
where \(\mathcal {R}(f)\) is the so-called risk function.
Usually, the risk of making estimations cannot be computed because the joint distribution between the inputs and the outputs is unknown. However, we can compute an approximation averaging the error function of the available data, as it is formulated by the empirical risk minimization (ERM) principle. The formulation in (2) yields
$$ \mathcal{R}_{\text{emp}}(f)=\sum\limits_{i=1}^{n}\mathcal{L}(y_{i},f(\mathbf{x}_{i})). $$
((3))
Using the least squares (LS) loss function, we get:
$$ \mathcal{R}_{\text{emp}}(f)=\sum\limits_{i=1}^{n}(y_{i}-f(\mathbf{x}_{i}))^{2}. $$
((4))
Since there is an infinite set of non-linear prediction functions, f(x), that fit the output data, we need to constrain the solution. This is achieved through Thikonov regularization. We get the classic formulation (5) of the KLS problem,
$$ f_{\lambda}(\cdot)=\arg\min_{f\in\mathcal{H}_{K}}\frac{1}{n}\sum_{i=1}^{n}(f(\mathbf{x}_{i})-y_{i})^{2}+\lambda\Vert f\Vert_{\mathcal{H}_{K}}^{2}. $$
((5))
The optimization variable is f, which is a function constrained to be in \(\mathcal {H}_{K}\), the reproducing kernel Hilbert space induced by the kernel k(·,·), denoting by \(\parallel \cdot \parallel _{\mathcal {H}_{K}}\) its norm. \(\mathcal {H}_{K}\) is a vector space of functions with a certain (and convenient) inner product. Note that in (5), we compute f(·) to minimize the mean square error with the first term, while with the last one we force the solution f(·) to have minimum norm to avoid overfitting. The inner-product structure implies that the solution to (5), denoted by f
λ
(·), satisfies:
$$ f_{\lambda}(\cdot)=\sum\limits_{i=1}^{n}c_{\lambda,i}k(\cdot,\mathbf{x}_{i}) $$
((6))
for some \(\mathbf {c}_{\lambda }\in \mathbb {R}^{r}\). This fact is known as the representer theorem in [12]. In the case of least squares, c
λ
is the solution to a system of n linear equations, satisfying:
$$ \mathbf{c}_{\lambda}=(\mathbf{K}+\lambda\mathbf{I})^{-1}\mathbf{y} $$
((7))
where K is the kernel matrix whose elements are defined by k
ij
=k(x
i
,x
j
), and the kernel is pre-specified.
2.2 Distributed kernel least squares (DKLS)
2.2.1 2.2.1 Distributed definition of KLS
The previous solution is a centralized algorithm and cannot be implemented in a distributed approach as it is. Let us suppose that we have a wireless sensor network of m nodes and we have n≤m measurements from them as training samples. Using the same notation as in Section 2.1, we could think of position as inputs \(\mathbf {x}_{i}\in \mathbb {R}^{3}\), and temperature measures y
i
as outputs. The training samples are ensembles in the set S
n
. Let us suppose that not all nodes have access to all the samples, so the training samples accessible from node j is the subset \({S_{n}^{j}}\). Let us also denote the set of the indices of the training samples in S
n
by \(\overline {S}_{n}\) and the indices of training samples accessible by node j as \(\overline {S}_{n}^{j}\).
The first approximation to a distributed problem in this scenario is to compute m centralized solutions, one for each node of the network, so the classical KLS problem could be written as:
$$\begin{array}{*{20}l} &\underset{f_{j}\in\mathcal{H}_{K}}{min}\sum\limits_{i=1}^{n}(z_{i}-y_{i})^{2}+\sum\limits_{j=1}^{m}\lambda_{j}\Vert f_{j} \Vert_{\mathcal{H}_{K}}^{2} \end{array} $$
((8))
$$\begin{array}{*{20}l} &\!\!\!\!\begin{array}{llll} s.t. &z_{i}=f_{j}(x_{i}), & \forall i\in\overline{S}_{n}, & j=1,...,m. \\ \end{array} \end{array} $$
((9))
In this problem, the optimization variables are \(\mathbf {z} \in \mathbb {R}^{n}\), i.e. \(\{f_{j}\}_{j=1}^{m}\), and rather than finding a function f(·), we are estimating a set of them.
The constraints in (9) require that all nodes agree on the training data. This fact makes it to be equivalent to the classic KLS problem, getting the centralized solution, i.e. f
j
(·)=f
λ
(·) for j=1,...,m (see Lemma 1 in Appendix of [6]). So, we can associate a centralized regression to a global agreement of nodes on the training samples. But we could think of an association of a distributed regression to a local agreement instead. Local agreement would involve that only a limited number of samples are shared between each two nodes. This last problem can be described as follows,
$$\begin{array}{*{20}l} &\underset{f_{j}\in\mathcal{H}_{K}}{min}\sum\limits_{i\in\overline{S}_{n}^{j}}^{}(z_{i}-y_{i})^{2}+\sum\limits_{j=1}^{m}\lambda_{j}\Vert f_{j}\Vert_{\mathcal{H}_{K}}^{2} \end{array} $$
((10))
$$\begin{array}{*{20}l} &\!\!\!\!\begin{array}{llll} s.t. & z_{i}=f_{j}(x_{i}), & \forall i\in\overline{S}_{n}^{j}, & j=1,...,m. \end{array} \end{array} $$
((11))
In this formulation, the solution is feasible if and only if f
j
(x
i
)=z
i
=f
k
(x
i
) for \((x_{i},y_{i})\in {S_{n}^{j}}\cap {S_{n}^{k}}\) and for j,k=1,...,m; that is, if and only if every pair of node decision rules agree on samples they share. We get (z,f
1,...,f
m
) as the minimizer solution of (10), and f
j
is a function of only the training samples in \({S_{n}^{j}}\) as part of the joint minimizer.
2.2.2 2.2.2 Successive orthogonal projections algorithm
A distributed approach of KLS problem has been shown in the previous subsection. Here, we face its solution, for which an alternate projections algorithm is proposed in [6], taking into account the similarities between both problems. In particular, the algorithm uses the non-relaxed successive orthogonal projection (SOP) algorithm, next described.
Let C
1,...,C
m
be closed convex subsets of the Hilbert space \(\mathcal {H}\), whose intersection \(C=\cap _{i=1}^{m}C_{i}\) is non-empty. Let \(P_{C}(\hat {v})\) denote the orthogonal projection of \(\hat {v}\in \mathcal {H}\) onto C:
$$ P_{C}(\hat{v})\triangleq\arg\min_{v\in C}\parallel v-\hat{v}\parallel $$
((12))
And the orthogonal projection of \(\hat {v}\in \mathcal {H}\) onto C
i
:
$$ P_{C_{i}}(\hat{x})\triangleq\arg\min_{v\in C_{i}}\parallel v-\hat{v}\parallel $$
((13))
In [6, 13], it is defined the successive orthogonal projection (SOP) algorithm to compute P
C
(·) using \(\left \{ P_{C_{i}}(\cdot)\right \}_{i=1}^{m}\)as follows:
$$ v_{0}:=\hat{v}\qquad v_{t}:=P_{C_{(t\; \text{mod}\; m)+1}}(v_{t-1}) $$
((14))
In this definition (14), we denote by (t
m
o
d
m) to the remainder of the division t/m. It establishes that the P
C
(·) can be computed projecting sequentially onto all the convex subsets C
i
, using for the \(P_{C_{i+1}}\) the result of the previous projection: first, it projects \(\hat {v}\) onto C
1, the result \(P_{C_{1}}\)is projected onto C
2, and it iterates in this way successively a certain number of times.
As pointed out in [6] (Theorem 2), it is demonstrated in [14] that for every v∈C and every t≥1
$$ \parallel v_{t}-v\parallel\leq\parallel v_{t-1}-v\parallel $$
((15))
and that
$$ {\lim}_{\textit{n}\rightarrow\infty}v_{n}\in(\cap_{i=1}^{m}C_{i}) $$
((16))
$$ {\lim}_{\textit{n}\rightarrow\infty}\parallel v_{t}-P_{C}(\hat{v})\parallel=0 $$
((17))
if C
i
are affine for all i∈{1,...,m}. Hence, the more iterations we perform, the more accurate result we get.
2.2.3 2.2.3 Distributed KLS solution
It is possible to redefine the problem in (10) in terms of the SOP algorithm [6], where the Hilbert space \(\mathcal {H}=\mathbb {R}^{n}\times \mathcal {H}_{K}^{m}\) with norm
$$ \parallel(\mathbf{z},f_{1},...,f_{m})\parallel^{2}=\parallel z{\parallel_{2}^{2}}+ \sum_{i=1}^{m}\lambda_{i}\parallel f_{i}\parallel_{\mathcal{H}_{K}}^{2} $$
((18))
is defined. With it, (10) can be interpreted as the orthogonal projection of the vector \((\mathbf {y},0,...,0)\in \mathcal {H}\) onto the set \(C=\cap _{j=1}^{m}C_{j}\subset \mathcal {H}\), with
$$\begin{array}{*{20}l} &C_{j}=\left\{(\mathbf{z},f_{1},...,f_{m}):f_{j}(x_{i})=z_{i}, \forall i\in\overline{S}_{n}^{j},\right.\\ &\qquad\qquad\qquad\quad\left.\mathbf{z}\in \mathbb{R}^{n},\left\{ f_{j}\right\}_{j=1}^{m} \subset\mathcal{H}_{K}\right\}\subset\mathcal{H} \end{array} $$
((19))
It is important to note that, for any \(v = (\mathbf {z},f_{1},...,f_{m})\in \mathcal {H}\), the computation of
$$ P_{C_{j}}(v)=\arg\min_{v'\in C_{j}}\parallel v-v'\parallel $$
((20))
is restricted to the locally accessible training examples by node j. It means that computing \(P_{C_{j}}(v)\) leaves z
i
unchanged for all \(i\notin \overline {S}_{n}^{j}\) and leaves f
k
unchanged for all k≠j.
The new function associated with the node j can be computed using f
j
, \(\{x_{i}\}_{i\in \overline {S}_{n}^{j}}\) and the message variables \(\{z_{i}\}_{i\in \overline {S}_{n}^{j}}\). This method defines the DKLS algorithm (using the notation of [7]), shown in Algorithm 1.
It is interesting to note that the solution in (10) is an approximation to the centralized KLS. As discussed in [6], the neighbourhood of a mote limits the accuracy of its estimations, so local connectivity influences an estimator’s bias. In [6, 7], there are some studies that show through simulations that the error decays exponentially with the number of neighbours.