A graphical overview of our DVC architecture, which targets the aforementioned application scenarios, is given in Figure 2.

### 4.1. The encoder

Every incoming frame is categorized as a key or a Wyner-Ziv frame, denoted by *K* and *W*, respectively, as to construct groups of pictures (GOP) of the form *KW* ...*W*. The key frames are coded separately using a conventional intra codec, e.g., H.264/AVC intra [1] or Motion JPEG.^{b} The Wyner-Ziv frames on the other hand are encoded in two stages. For every Wyner-Ziv frame, the encoder first generates and codes a hash, which will assist the decoder during the motion estimation process. In the second stage, every Wyner-Ziv frame undergoes a discrete cosine transform (DCT) and is subsequently coded in the transform domain using powerful channel codes, thus generating a Wyner-Ziv bit stream.

#### 4.1.1. Hash formation and coding

Our Wyner-Ziv video encoder creates an efficient hash that consists of a low-quality version of the downsized original Wyner-Ziv frames. In contrast to our previous hash-based DVC architectures [30, 31], where the dimensions of the hash were equal to the dimensions of the original input frames, coding a hash-based on the downsampled Wyner-Ziv frames reduces the computational complexity. In particular, every Wyner-Ziv frame undergoes a down-scaling operation by a factor, *d* ∈ ℤ_{+}. To limit the involved operations, straightforward downsampling is applied. Foregoing a low-pass filter to bandlimit, the signal prior to downsampling runs the risk of introducing undesirable aliasing artefacts. However, experimental experience has shown that the impact on the overall rate-distortion (RD) performance of the entire system does not outweigh the computational complexity incurred by the use of state-of-the-art downsampling filters, e.g., Lanczos filers [35].

After the dimensions of the original Wyner-Ziv frames have been reduced, the result is coded using a conventional intra video codec, exploiting spatial correlation within the hash frame only. The quality at which the hash is coded has experimentally been selected and constitutes a trade-off between (i) obtaining a constant quality of the decoded frames, which is of particular interest in medical applications, (ii) achieving high RD performance for the proposed system and (iii) maintaining a low hash rate overhead. We notice that constraining the hash overhead comes with the additional benefit of minimizing the hash encoding complexity. On the other hand, ensuring sufficient hash quality so that the accuracy of the hash-based motion estimation at the decoder is not compromised or so that even pixels in the hash itself could serve as predictors is important. Afterwards, the resulting hash bit stream is multiplexed with the key frame bit stream and sent to the decoder.

We wish to highlight that, apart from assisting motion estimation at the decoder as in contemporary hash-based systems, the proposed hash code is designed to also act as a candidate predictor for pixels for which the temporal correlation is low. This feature is of particular significance especially when difficult-to-capture endoscopic video content is coded. To this end, the presented hash generation approach was chosen over existing methods in which the hash consists of a number of most significant Wyner-Ziv frame bit-planes [30, 31], of coarsely sub-sampled and quantized versions of blocks [28], or of quantized low frequency DCT bands [29] in the Wyner-Ziv frames.

Furthermore, we note that, in contrast to other hash-based DVC solutions [12, 28], the proposed architecture avoids block-based decisions on the transmission of the hash at the encoder side. Although this can increase the hash rate overhead when easy-to-predict motion content is coded, it comes at the benefit of constraining the encoding complexity, in the sense that the encoder is not burdened by expensive block-based comparisons or memory requirements necessary for such mode decision. An additional key advantage of the presented hash code is that it facilitates accurate side information creation using pixel-based multi-hypothesis compensation at the decoder, as explained in Section 4.2.2. In this way, the presented hash code enhances the RD performance of the proposed system especially for irregular motion content, e.g., endoscopic video material.

#### 4.1.2. Wyner-Ziv encoding

In addition to the coded hash, a Wyner-Ziv layer is created for every Wyner-Ziv frame, providing efficient compression [5] and scalable coding [25]. In line with the DVC architecture introduced in [5], the Wyner-Ziv frames are first transformed with a 4 × 4 integer approximation of the DCT [1] and the obtained coefficients are subsequently assembled in frequency bands. Each DCT band is independently quantized using a collection of predefined quantization matrices (QMs) [26], where the DC and the AC bands are quantized with a uniform and double-deadzone scalar quantizer, respectively. The quantized symbols are translated into binary codewords and passed to a LDPC Accumulate (LDPCA) encoder [36], assuming the role of Slepian-Wolf encoder.

The LDPCA [36] encoder realizes Slepian and Wolf's random binning argument [15] through linear channel code syndrome binning. In detail, let **b** be a binary *M*-tuple containing a bit-plane of a coded DCT band *β* of a Wyner-Ziv frame, where *M* is the number of coefficients in the band. To compress **b**, the encoder employs an (*M, k*) LDPC channel code *C* constructed by the generator matrix {\mathbf{G}}_{k\times M}=\left[{\mathbf{I}}_{k}\phantom{\rule{1em}{0ex}}{\mathbf{P}}_{k\times \left(M-k\right)}\right]^{c}. The corresponding parity check matrix of *C* is {\mathbf{H}}_{\left(M-k\right)\times M}=\left[{\mathbf{P}}_{k\times \left(M-k\right)}^{T}\phantom{\rule{1em}{0ex}}{\mathbf{I}}_{M-k}\right]. Thereafter, the encoder forms the syndrome vector as **s** = **bH**^{T}. In order to achieve various puncturing rates, the LDPC syndrome-based scheme is concatenated with an accumulator [36]. Namely, the derived syndrome bits **s** are in turn mod-2 accumulated, producing the accumulated syndrome tuple **α**. The encoder stores the accumulated syndrome bits in a buffer and transmits them incrementally upon the decoder's request using a feedback channel, as explained in Section 4.2.3. Note that contemporary wireless (implantable) sensors--including capsule endoscopes--support bidirectional communication [33, 37, 38]. That is, a feedback channel from the encoder to the decoder is a viable solution for the pursued applications. The effect of the employed feedback channel on the decoding delay, and in turn on the buffer requirements at the encoder of a wireless capsule endoscope, is studied in Section 5.3.

Note that the focus of this study is to successfully target various lightweight applications by improving the compression efficiency of Wyner-Ziv video coding while maintaining low computational cost at the encoder. Hence, in order to accurately evaluate the impact of the proposed techniques on the RD performance, the proposed system employs LDPCA codes which are also used in the state-of-the-art codecs of [13, 26]. Observe that for distributed compression under a noiseless transmission scenario the syndrome-based Slepian-Wolf scheme [15] is optimal since it can achieve the information theoretical bound with the shortest channel codeword length [23]. Nevertheless, in order to address distributed joint source-channel coding (DJSCC) in a noisy transmission scenario the parity-based [23] Slepian-Wolf scheme needs to be deployed. In the latter, parity-check bits are employed to indicate the Slepian-Wolf bins, thereby achieving equivalent Slepian-Wolf compression performance at the cost of an increased codeword length [23].

It is important to mention that, conversely to other hash-driven Wyner-Ziv schemes operating in the transform domain, e.g., [12, 31], the presented Wyner-Ziv encoder encodes the entire original Wyner-Ziv frame, instead of coding the difference between the original frame and the reconstructed hash. The motivation for this decision is twofold. The first reason stems from the nature of the hash. Namely, coding the difference between the Wyner-Ziv frame and the reconstructed hash would require decoding and interpolating the hash at the encoder, an operation which is computationally demanding and would pose an additional strain on the encoder's memory demands. Second, compressing the entire Wyner-Ziv frame with linear channel codes enables the extension of the scheme to the DJSCC case [23], thereby providing error-resilience for the entire Wyner-Ziv frame if a parity based Slepian-Wolf approach is followed.

### 4.2. The decoder

The main components of the presented DVC architecture's decoding process are treated separately, namely dealing with the hash, side information generation and Wyner-Ziv decoding. The decoder first conventionally intra decodes the key frame bit stream and stores the reconstructed frame in the reference frame buffer. In the following phase, the hash is handled, which is detailed next.

#### 4.2.1. Hash decoding and reconstruction

The hash bit-stream is decoded with the appropriate conventional intra codec. The reconstructed hash is then upscaled to the original Wyner-Ziv frame's resolution. The ideal upscaling process consists of upsampling followed by ideal interpolation filtering. The ideal interpolation filter is a perfect low-pass filter with gain *d* and cut-off frequency *π*/*d* without transition band [39]. However, such a filter corresponds to an infinite length impulse response *h*_{ideal}, to be precise, a sinc function *h*_{ideal}(*n*) = sinc(*n*/*d*) where *n* ∈ ℤ_{+}, which cannot be implemented in practice.

Therefore our system employs a windowing method [39] to create a filter with finite impulse response *h*(*n*), namely

h\left(n\right)={h}_{\mathsf{\text{ideal}}}\left(n\right)\cdot z\left(n\right),\phantom{\rule{0.3em}{0ex}}\left|n\right|<3\cdot d,

(1)

where the window function *z*(*n*) corresponds to samples taken from the central lobe of a sinc function, that is

z\left(n\right)=\mathsf{\text{sinc}}\left(\frac{n}{3\cdot d}\right),\phantom{\rule{0.3em}{0ex}}\left|n\right|<3\cdot d.

(2)

Such interpolation filter is known in the literature as a Lanczos3 filter [35]. Following [40], the resulting filter taps are normalized to obtain unit DC gain while the input samples are preserved by the upscaling process since *h*_{0}(*n*) = 1.

#### 4.2.2. Side information generation

After the hash has been restored to the same frame size as the original Wyner-Ziv frames, it is used to perform decoder-side motion estimation. The quality of the side information is an important factor on the overall compression performance of any Wyner-Ziv codec, since the higher the quality the less channel code rate is required for Wyner-Ziv decoding. The proposed side information generation algorithm performs bidirectional overlapped block motion estimation (OBME) using the available hash information and a past and a future reconstructed Wyner-Ziv and/or key frame as references.

Temporal prediction is carried out using a hierarchical frame organization, similar to the prediction structures used in [5, 12, 26]. It is important to note that conversely to our previous study [30], in which motion estimation was based on bit-planes, this study follows a different approach regarding the nature of the hash as well as the block matching process. Before motion estimation is initiated, the reference frames are preprocessed. Specifically, to improve the consistency of the resulting motion vectors, the reference frames are first subjected to the same downsampling and interpolation operation as the hash.

Figure 3 contains a graphical representation of the motion estimation algorithm. To offer a clear presentation of the proposed algorithm, we introduce the following notation. Let \stackrel{\u0303}{W} be the reconstructed hash of a Wyner-Ziv frame, let *Y* be the side information and let {\stackrel{\u0303}{R}}_{k}, *k* ∈ {0,1} be the preprocessed versions of the reference frames *R*_{
k
}, respectively. Also, denote by *Y*_{
m
}, *R*_{k,m}, {\stackrel{\u0303}{W}}_{\mathbf{m}}, {\stackrel{\u0303}{R}}_{k,\mathbf{m}} the blocks of size *B* × *B* pixels with top-left coordinates **m** = (*m*_{1}, *m*_{2}) in *Y, R*_{
k
}, \stackrel{\u0303}{W} and {\stackrel{\u0303}{R}}_{k}, respectively. Finally, let *Y*_{
m
}(**p**) designate the sample at position **p** = (*p*_{1}, *p*_{2}) in the block *Y*_{
m
}.

At the outset, the available hash frame is divided into overlapping spatial blocks, {\stackrel{\u0303}{W}}_{\mathbf{u}}, with top-left coordinates **u** = (*u*_{1}, *u*_{2}), using an overlapping step size *ε* ∈ ℤ_{+}, 1 ≤ *ε* ≤ *B*. For each overlapping block {\stackrel{\u0303}{W}}_{\mathbf{u}}, the best matching block within a specified search range *ρ*, is found in the reference frames {\stackrel{\u0303}{R}}_{k}. In contrast to our earlier study [30], the proposed algorithm retains the motion vector **v** = (*v*_{1}, *v*_{2}), -*ρ* <*v*_{1}, *v*_{2} ≤ *ρ*, which minimizes the sum of absolute differences (SAD) between {\stackrel{\u0303}{W}}_{\mathbf{u}} and a block {\stackrel{\u0303}{R}}_{k,\mathbf{u}-\mathbf{v}}, **v** = (*v*_{1}, *v*_{2}), in other words

\mathbf{v}=arg\phantom{\rule{0.3em}{0ex}}\underset{\mathbf{v}}{min}\sum _{\mathbf{p}}\left|{\stackrel{\u0303}{W}}_{\mathbf{u}}\left(\mathbf{p}\right)-{\stackrel{\u0303}{R}}_{k,\mathbf{u}-\mathbf{v}}\left(\mathbf{p}\right)\right|,

(3)

where **p** visits all the co-located pixel positions in the blocks {\stackrel{\u0303}{W}}_{\mathbf{u}} and {\stackrel{\u0303}{R}}_{k,\mathbf{u}-\mathbf{v}}, respectively. The motion search is executed at integer-pel accuracy and the obtained motion field is extrapolated to the original reference frames *R*_{
k
}. By construction, every pixel *Y*(**p**), **p** = (*p*_{1}, *p*_{2}) in the side information frame *Y* is located inside a number of overlapping blocks {Y}_{{\mathbf{u}}_{n}} with **u**_{
n
}= (*u*_{n,1}, *u*_{n,2}). After the execution of the OBME, a temporal predictor block {R}_{k,{\mathbf{u}}_{n}} for every block {Y}_{{\mathbf{u}}_{n}} has been identified in one reference frame. As a result, each pixel *Y*(**p**) in the side information frame has a number of associated temporal predictors {r}_{k,{\mathbf{u}}_{n}} in the blocks {R}_{k,{\mathbf{u}}_{n}}.

However, some temporal predictors may stem from rather unreliable motion vectors. Especially when the input sequence was recorded at low frame rates or when the motion content is highly irregular, as might be the case in endoscopic sequences, temporal prediction is not the preferred method for all blocks at all times. Therefore, to avoid quality degradation of the side information due to untrustworthy predictors, all obtained motion vectors are subjected to a reliability screening. Namely, when the SAD, based on which the motion vector associated with temporal predictor {r}_{k,{\mathbf{u}}_{n}} was determined, is not smaller than a certain threshold *T*, the motion vector and associated temporal predictor is labeled as *unreliable*. In this case, a temporal predictor for the side information pixel *Y*(**p**) is replaced by the co-located pixel of *Y*(**p**) in the upsampled hash frame, that is \stackrel{\u0303}{W}\left(\mathbf{p}\right). In other words, when motion estimation is considered not to be trusted, the hash itself is assumed to convey more dependable information. This feature of OBME is referred to as hash-predictor-selection (HPS).

During the motion compensation process, the obtained predictors per pixel, whether being temporal predictors or taken from the upsampled hash, are combined to perform multi-hypothesis pixel-based prediction. Specifically, every side information pixel *Y*(**p**) is calculated as the mean value of the predictor values {g}_{k,{\mathbf{u}}_{n}}:

Y\left(\mathbf{p}\right)=\frac{1}{{N}_{k,{\mathbf{u}}_{n}}}\sum _{{\mathbf{u}}_{n}}{g}_{k,{\mathbf{u}}_{n}},

(4)

where, {N}_{k,{\mathbf{u}}_{c}} denotes the number of predictors for pixel *Y*(**p**) and {g}_{k,{\mathbf{u}}_{n}}={r}_{k,{\mathbf{u}}_{n}} when {r}_{k,{\mathbf{u}}_{n}} is *reliable* or {g}_{k,{\mathbf{u}}_{n}}=\stackrel{\u0303}{W}\left(\mathbf{p}\right) when {r}_{k,{\mathbf{u}}_{n}} is *unreliable*. The derived multi-hypothesis motion field is employed in an analogous manner to estimate the chroma components of the side information frame from the chroma components of the reference frames *R*_{
k
} or the upsampled hash.

#### 4.2.3. Wyner-Ziv decoding

The derived motion-compensated frame is first DCT transformed to serve as side information *Y* for decoding the Wyner-Ziv bit stream in the transform domain. Then, online transform domain correlation channel estimation [7] is carried out to model the correlation channel between the side information *Y* and the original Wyner-Ziv frame samples *W* in the DCT domain. As in [7], the correlation is expressed by an additive input-dependent noise model, *W* = *Y* + *N*, where the correlation noise *N* ~ ℒ(0, *σ*(*y*)) is zero-mean Laplacian with standard-deviation *σ*(*y*), which varies depending on the realization *y* of the input of the channel, i.e., the side information, namely [7],

{f}_{N\left|Y\right.}\left(n\left|y\right.\right)=\frac{1}{\sigma \left(y\right)\sqrt{2}}{e}^{-\frac{\sqrt{2}\left|n\right|}{\sigma \left(y\right)}}

(5)

Thereafter, the estimated correlation channel statistics per coded DCT band bit-plane are interpreted into soft estimates, i.e., log-likelihood ratios (LLRs). These LLRs, which provide *a priori* information about the probability of each bit to be 0 or 1, are passed to the variable nodes of the LDPCA decoder. Then, the message passing algorithm [41] is used for iterative LDPC decoding, in which the received syndrome bits correspond to the check nodes on the bipartite graph.

Notice that the scheme follows a layered Wyner-Ziv coding approach to provide quality scalability without experiencing a performance loss [25]. Namely, in the formulation of the LLRs, infor-mation given by the side information and the already decoded source bit-planes is taken into account. In detail, let *b*_{
l
} denote a bit of the *l* th bit-plane of the source and *b*_{1}, ..., *b*_{l-1}be the already decoded bits in the previous *l* -1 bit-planes. Then the estimated LLR at the corresponding variable node of the LDPCA decoder is given by

\mathsf{\text{LLR}}=log\frac{p\left({}_{b}^{l}=0\left|{}^{y},{}_{b}^{1},\dots ,{}_{b}^{l-1}\right.\right)}{p\left({}_{b}^{l}=1\left|{}^{y},{}_{b}^{1},\dots ,{}_{b}^{l-1}\right.\right)}=log\frac{p\left({}_{b}^{1},\dots ,{}_{b}^{l-1},{}_{b}^{l}=0\left|{}^{y}\right.\right)}{p\left({}_{b}^{1},\dots ,{}_{b}^{l-1},{}_{b}^{l}=1\left|{}^{y}\right.\right)}

(6)

where equality in (6) stems from: *p*(*b*_{
l
}|*y, b*_{1}, ..., *b*_{l-1}) = *p*(*b*_{1}, ..., *b*_{l-1}, b_{
l
}|*y*)/*p*(*b*_{1}, ..., *b*_{l-1}|*y*). Hence, in (6) the nominator and the denominator are calculated by integrating the conditional probability density function of the correlation channel, i.e., *f*_{X|Y}(*x*|*y*), over the quantization bin indexed by *b*_{1}, ..., *b*_{
l
}.

Remark that the LDPCA decoder achieves various rates by altering the decoding graph upon reception of an additional increment of the accumulated syndrome [36]. Initially, the decoder receives a short syndrome based on an aggressive code and the decoder tries to decode [36]. If decoding falls short, the encoder receives a request to augment the previously received syndrome with extra bits. The process loops until the syndrome is sufficient for successful decoding.

Once all the *L* bit-planes of a DCT band of a Wyner-Ziv frame are LDPCA decoded, the obtained *L* binary *M*-tuples **b**_{1}, **b**_{2}, ..., **b**_{
L
}are combined to form the decoded quantization indices of the coefficients of the band. Subsequently, the decoded quantization indices are fed to the reconstruction module which performs inverse quantization using the side information and the correlation channel statistics. Since the mean square error distortion measure is employed, the optimal reconstruction of a Wyner-Ziv coefficient *w* is obtained as the centroid of the random variable *W* given the corresponding side information coefficient *y* and the decoded quantization index *q* [25]. Namely

E\left[w\left|y,q\right.\right]=\frac{{\int}_{{q}_{L}}^{{q}_{H}}w{f}_{W\left|y\right.}\left(w\left|y\right.\right)}{{\int}_{{q}_{L}}^{{q}_{H}}{f}_{W\left|y\right.}\left(w\left|y\right.\right)}

(7)

where, *q*_{L}, *q*_{H} denote the lower and upper bound of the quantization bin *q*. Finally, the inverse DCT transform provides the reconstructed frame *Ŵ* in the spatial domain. The reconstructed frame is now ready for display and is stored in the reference frame buffer, serving as a reference for future temporal prediction.