 Research
 Open Access
 Published:
Beyond 100 Gbit/s Pipeline Decoders for Spatially Coupled LDPC Codes
EURASIP Journal on Wireless Communications and Networking volume 2022, Article number: 90 (2022)
Abstract
Lowdensity paritycheck (LDPC) codes are a wellestablished class of forward error correction codes that provide excellent error correction performance for large code block sizes. However, for throughputs toward 1 Tbit/s, as expected for B5G systems, stateoftheart LDPC soft decoders are restricted to short code block sizes of several hundred to thousands of bits due to routing congestion challenges, limiting the overall communications performance of the transmission system. Spatially coupled LDPC (SCLDPC) codes and respective sliding window decoding methods show the potential to overcome these block size restrictions. However, in contrast to conventional LDPC codes little literature exists on the efficient hardware implementation of respective highthroughput decoders. In this work, we present the first indepth investigation on the implementation of SCLDPC decoders for throughputs beyond 100 Gbit/s. For an \(N=51328\), \(R=0.8\) terminated SCLDPC code with subblock size \(c=640\) and coupling width \(m_s=1\), we explore various design tradeoffs, including row and columnwise decoding, nonoverlapping and overlapping window scheduling, and processor pipelining. To the best of our knowledge, this is the first description of a columnwise SCLDPC decoding architecture in the literature. We complement the algorithmic investigation with the virtual silicon implementation of all presented decoders in a 22nm FDSOI technology.
1 Introduction
Forward error correction (FEC) is a key technology of modern communication systems. The capability of detecting and correcting transmission errors has largely contributed to the progress in data rates, roundtrip latencies, and number of connected devices in cellular mobile communication systems. And the demand does not cease. Beyond5G systems target data rates toward 1 Tbit/s, posing new and fundamental challenges to the design and implementation of FEC systems [1]. Lowdensity paritycheck (LDPC) codes are among the most powerful and widely used FEC schemes. Since their discovery by Gallager in 1962 [2], innovations in both the code and the decoder design have opened this code class for a wide range of practical applications. Today, LDPC codes are part of many modern communication standards, like DVBS2x, WiFi, and 3GPP 5GNR. However, achieving 12 orders of magnitude higher throughputs than today’s fastest standards without sacrificing performance remains a great challenge, mainly due to area, power, and power density restrictions in the decoder implementation.
LDPC decoding is based on an iterative exchange of messages between the variable nodes and the check nodes in the Tanner graph of the code. To achieve good error correction performance, the Tanner graph must be large, of low density, and without short cycles causing the graph to be highly unstructured and without distinct regularity. These properties, however, challenge an efficient and highthroughput decoder implementation that requires locality to minimize the cost of data transfers and regularity to achieve large parallelism [3]. This fact manifests in particular in an increased wiring complexity and routing congestions for large code block sizes resulting in low area utilization, poor timing, and increased power consumption of the respective decoding hardware. As a consequence, stateoftheart highthroughput decoders are limited to small code block sizes of 1000–2000 bits, e.g., [4,5,6], which limits the overall performance of the FEC system.
Spatial coupling of LDPC codes is a promising approach to overcome these block size limitations. Spatially coupled LDPC (SCLDPC) codes, initially introduced as LDPC convolutional codes (LDPCCC) [7], are constructed from a set of small LDPC codes that are coupled together to form a chain of local subcodes. In this way, a code of almost any length can be generated. Similarly, a respective decoder can be constructed by chaining together multiple subdecoders, of which each operates on a much smaller subblock. In this way, SCLDPC codes have the potential to combine good error correction performance with highthroughput decoding. This property is of particular importance in the context of beyond5G/6G highTHz communications systems that aim at data rates of 1 Tbit/s. Respective use cases show significant variations in bit error rate (BER) requirements depending on whether they fall into infrastructure or enduser domains. For instance, IEEE 802.15.3d standard sets very stringent BER requirements of \(10^{12}\) for infrastructuretype use cases such as wireless backhaul/fronthaul and data centers, in contrast to relatively relaxed BER requirement of \(10^{6}\) for closeproximity communications with applications in personalarea networks [8]. SCLDPC codes can cover this broad range of use cases as they exhibit good performance in both the waterfall and the error floor region of the BER curve [9].
In contrast to classical LDPC block codes, not so much research exists on the implementation of efficient, highthroughput SCLDPC decoders. Essentially, we can distinguish three candidate architectures in the literature:

The rowlayered pipeline decoder (RLPD) [7],

The rowcompact pipeline decoder (RCPD) [10] and

The fullparallel window decoder (FPWD) [11].
The RLPD achieves a fast convergence but supposedly suffers from a long initial decoding delay and high storage requirements [10], whereas the RCPD and the FPWD exhibit relatively smaller decoding delay and less storage requirements but also a slower convergence [10, 12]. However, these characterizations provide little information about the efficiency of respective decoding hardware, which is evaluated according to implementation metrics like achievable frequency, latency, area, and power consumption. In particular, for data transfer dominated circuits with complex signal routing like highly parallel LDPC decoders, it is difficult to fully assess the implications of design decisions on the algorithmic level on the implementation efficiency.
In this work, we therefore provide the first comparative investigation of various highthroughput SCLDPC decoding architectures down to the silicon level. Our investigation focuses on the \(N=51328\), \(R=0.8\) terminated EPIC SCLDPC code with subblock size \(c=640\) and coupling width \(m_s=1\) [13]. We explore various design tradeoffs, including row and columnwise decoding, nonoverlapping and overlapping window scheduling, and processor pipelining. To the best of our knowledge, we present the first description of a columnwise SCLDPC decoding architecture in the literature.
2 Preliminaries
2.1 Notation
In the following, we use italic letters, e.g., x, for the representation of scalars, bold lowercase letters, e.g., \({\mathbf {x}}\), for the representation of vectors, and bold uppercase letters, e.g., \({\mathbf {X}}\), for the representation of matrices. \({\mathbb {N}}_0\) denotes the set of positive integers, including the 0element, \({\mathbb {Z}}\) the set of integers, \({\mathbb {R}}\) the set of real numbers, and \({\mathbb {F}}_2\) the Galois field 2. \(\log _2(x)\) denotes the base 2 logarithm of variable x, and \(\lceil x \rceil\) is the least integer greater than or equal to x.
2.2 System model
We define a system for the continuous transmission of data blocks, as depicted in Fig. 1. An information source generates a stream, or sequence, of information blocks \({\mathbf {u}}_{[\infty ,\infty ]} = [{\mathbf {u}}_{\infty }, ..., {\mathbf {u}}_{\infty }]\), with each information block being composed of b bits, i.e., \({\mathbf {u}}_t = [u_t(1), ..., u_t(b)], t \in {\mathbb {Z}}\) and \(u_t{(\cdot )} \in {\mathbb {F}}_2\). An encoder maps this sequence of information blocks on a sequence of code blocks \({\mathbf {v}}_{[\infty ,\infty ]} = [{\mathbf {v}}_{\infty }, ..., {\mathbf {v}}_{\infty }]\) with \({\mathbf {v}}_t = [v_t(1), ..., v_t(c)]\), \(c>b\) and \(v_t{(\cdot )} \in {\mathbb {F}}_2\). After modulation, transmission over a noisy channel, and subsequent bitlevel demodulation, the decoder receives a sequence of blocks \({\mathbf {y}}_{[\infty ,\infty ]}\) each comprising c loglikelihood ratio (LLR) values, i.e., \({\mathbf {y}}_t = [y_t(1), ..., y_t(c)]\) and \(y_t{(\cdot )} \in {\mathbb {R}}\). Finally, a decoder provides an estimation \(\mathbf {{\hat{u}}}_{[\infty ,\infty ]}\) on the initially transmitted information, with \(\mathbf {{\hat{u}}}_t = [{\hat{u}}_t(1), ..., {\hat{u}}_t(b)]\) and \({\hat{u}}_t{(\cdot )} \in {\mathbb {F}}_2\).
2.3 SCLDPC codes
We define an \(R=b/c\) SCLDPC code as the set of code sequences \({\mathbf {v}}_{[\infty ,\infty ]}\) satisfying
where
is also denoted as the paritycheck matrix or syndrome former matrix of the SCLDPC code. The elements \({\mathbf {H}}_{i}(t) \ne {\mathbf {0}}\) are binary, full rank submatrices of dimension \((cb) \times c\). From Eq. 1, it follows that each \({\mathbf {v}}_t\) satisfies
and is thus coupled to \(m_s\) preceding and ascending code blocks. Consequently, \(m_s\) is also referred to as the coupling width or the memory of the code, \((m_s+1)\) as the constraint length.
We define the paritycheck matrix of a terminated SCLDPC code of finite length L as \({\mathbf {H}}_{[1,L]}\). For simplification, we consider in this work only timeinvariant codes, i.e., \({\mathbf {H}}_{i}(t) = {\mathbf {H}}_{i}(t')\) for \(i=0, \ldots , m_\mathrm {s}\) and \(t,t' \in {\mathbb {Z}}\). For a detailed description of SCLDPC codes, we refer to [7, 10] and [9].
2.4 Window decoding of SCLDPC codes
SCLDPC codes are decoded with messagepassing algorithms, i.e., an exchange of probabilistic messages between variable and check nodes in the corresponding Tanner graph of the code. The Tanner graph can be directly derived from \({\mathbf {H}}_{[1,L]},\) and the decoding can be performed similarly to a conventional LDPC block code by repeatedly updating the respective variable nodes and check nodes. This block scheduling, however, does not take advantage of the limited memory of the code and the consequent diagonal band structure of \({\mathbf {H}}_{[1,L]}\). Instead, it is more efficient to use a sliding window scheduling. Here, the decoding is not performed simultaneously on the full graph corresponding to \({\mathbf {H}}_{[1,L]}\), but timedelayed on multiple subgraphs each defined by a decoding window \({\mathcal {W}}(t)\) of length W. The decoding window can be considered as the Tanner graph corresponding to a submatrix of \({\mathbf {H}}_{[1,L]}\) comprising W rows and \(W+m_s\) columns. Throughout the decoding process, the window traverses \({\mathbf {H}}_{[1,L]}\) from the upper left to the lower right, moving by one subblock each time step t. The advantage of a window decoding schedule is a significantly lower structural latency than a block schedule, as the decoder can start the decoding right after receiving the first subblock \({\mathbf {y}}_1\). Furthermore, since the window length is typically much smaller than the block length, i.e., \(W \ll L\), the memory requirements are far less compared to a conventional LDPC block decoder of similar length.
3 Stateoftheart highthroughput decoders
From an implementation perspective, a sliding window decoder is subject to similar constraints as a conventional LDPC block decoder. For a large window size W, which in this analogy corresponds to a large block size N, routing congestions limit the achievable throughput. Reducing W, on the other hand, reduces the performance of the decoder. This performance loss can be partially counterbalanced by increasing the number of iterations on the window, but this again increases the logic critical path.
In [7], the authors propose to split the window into multiple subwindows, each comprising only a single row of \({\mathbf {H}}_{[1,L]}\), in the following denoted as a rowlayer. These subwindows are then processed individually on multiple processors in parallel. Each processor performs only a single iteration involving \((cb)\) check nodes and \((m_s+1)\cdot c\) variable nodes. Moving the window to the next position is similar to shifting the processed data from one processor to the next. In this way, the processors form a pipeline. It is hence commonly referred to as pipeline decoder. The initially proposed architecture requires a spacing between simultaneously processed layers of \(m_s\) to avoid memory hazards. To better differentiate it from other decoding architectures, we will refer to it in the following as RLPD. An applicationspecific integrated circuit (ASIC) implementation of an RLPD was presented in [14]. For a (491,3,6) LDPCCC, the decoder achieves a throughput of 2.37 Gbit/s in 90 nm CMOS technology. Note that the achievable throughput for a LDPCCC is much lower due to code’s serial structure, that limits the achievable hardware parallelism.
The structural latency, which also impacts the memory requirements of the RLPD, is proportional to the number of processors (iterations) I and the constraint length \((m_s+1)\). This can become a significant issue for codes with large \(m_s\) and that require many iterations. In [10], the authors propose a socalled compact pipeline decoder with overlapping rowlayers. The overlapping regions reduce the decoding window’s size and thus the decoder’s latency and memory. In analogy to the RLPD, we refer to this architecture as RCPD. However, a drawback of this architecture is a slower convergence of the decoding. The reason for this is a simultaneous update of the variable node in the overlapping regions, similar to a flooding schedule. With the size of the overlap, latency and convergence of the decoder can be weighed against one another. In [15], the authors implemented an RCPD for a (215,3,6) LDPCCC in a 65 nm technology. The decoder achieves a throughput of 7.72 Gbit/s.
Another approach for low decoding latency in combination with high throughput is the FPWD [11]. Like the RLPD and RCPD, the FPWD comprises multiple processors arranged in a pipeline. However, here the processors operate on small overlapping subwindows of at least \(W=m_s + 1\) using a flooding schedule. In the course of the decoding, the processors exchange extrinsic messages. An implementation of a FPWD was presented in [12]. The decoder achieves a throughput of 336 Gbit/s in a 22 nm technology and is currently the fastest SCLDPC decoder in the literature.
4 Proposed pipeline decoder architectures
The decoders presented in the previous chapter, i.e., the RLPD, the RCPD, and the FPWD, are individual decoding solutions for SCLDPC codes with the advantages and disadvantages discussed. In this chapter, we generalize several of the presented stateoftheart concepts, like the layered/compact window schedules [10] and overlapping decoding windows [12], and combine them with row and columnwise decoding algorithms. This systematic approach leads to a new, more abstract perspective on stateoftheart decoders. For example, the FPWD can then be viewed, as a particular columnwise decoder that uses a compact processing schedule.
Based on our methodology, we propose new pipeline decoder architectures, which are described in the following. The description resembles a bottomup approach starting with the description of the row and column processors that constitute the fundamental building blocks of the respective decoders. We then show how the different window schedules can be implemented by the different interconnection of these elementary components. For simplicity, we assume in the following a fixed coupling width of \(m_s=1\). Furthermore, we focus on MinSum (MS) decoding. However, the presented concepts can be also applied to larger values of \(m_s\) and other decoding algorithms.
4.1 Processors
For the processor design, we utilize the nodesplitting concept [12] that was initially introduced for the check nodes in the FPWD and apply it to the variable nodes of a rowwise decoder and the check nodes of a columnwise decoder. Furthermore, the ondemand variable node activation (OVA) schedule for the rowwise decoding is extended to columnwise decoding.
4.1.1 Node splitting
For \(m_s=1\), a rowlayer of \({\mathbf {H}}_{[1,L]}\) at time instance t is
and the corresponding subgraph is composed of \(N_\text {r} = 2\cdot c\) variable nodes and \(M_\text {r} = cb\) check nodes. By transforming the SCLDPC factor graph [16], the variable nodes corresponding to a rowlayer can be considered the partial variable nodes of a group of coupling nodes. For layer t, the respective coupling nodes connect the variable nodes corresponding to \({\mathbf {H}}_0(t)\) to the variable nodes corresponding to \({\mathbf {H}}_1(t)\) in layer \(t+1\). This principle is illustrated in Fig. 2. Similarly, we can merge the coupling nodes of layer t with the partial nodes of \({\mathbf {H}}_0(t)\) such that the c leftmost nodes of layer t are directly connected via a single edge to the c rightmost nodes of layer \(t+1\). Likewise, the c rightmost nodes of layer t are connected to the c leftmost nodes of layer \(t1\). This concept can be similarly applied to a column of \({\mathbf {H}}_{[1,L]}\)
corresponding to \(N_\text {c} = c\) variable and \(M_\text {c}=2\cdot (cb)\) check nodes. For simplicity, we assume that the SCLDPC is regular with \(d_v\) and \(d_c\) and that \(d_v({\mathbf {H}}_\mathrm {r}) = d_v/2\) and \(d_c({\mathbf {H}}_\mathrm {c}) = d_c/2\), i.e., the nodes are split exactly in half.
4.1.2 Ondemand variable and check node activation
In the standard flooding schedule commonly used in the decoding of LDPC block codes, an iteration starts with the update of the check nodes (CNs) and ends with the update of the variable nodes (VNs). We denote this in the following as CN–VN iteration, see Algorithm 1. It was shown in [10] that when employing the flooding schedule in a rowwise pipeline decoder, the processors do not make use of the most recent information in the processing pipeline. The authors, therefore, propose an OVA schedule that provides faster convergence for rowwise pipeline decoders. This OVA schedule resembles a subiteration in the layered decoding schedule for LDPC block codes [17]. Here, the iteration, in the following denoted as VN–VN iteration, starts with the calculation of new extrinsic messages for the check nodes (line 8, Alg. 1) and ends with the calculation of the a posteriori probability (APP) values (line 7, Alg. 1).
Similarly, we propose an ondemand check node activation (OCA) schedule for columnwise pipeline decoders. The corresponding CN–CN iteration starts at line 4 of Alg. 1 and ends at line 3. The first step in the CN–CN iteration is thus to compute the extrinsic messages for the VNs based on the respective signs and minimum values. Then, the VNs are updated and new signs and minimum values are eventually calculated. Between two decoding iterations, the extrinsic messages are represented only using the first and second absolute minimum \(\text {min}_0\) and \(\text {min}_1\), the edge index of the first minimum \(\text {idx}_0\), and the \(d_c/2\) signs of the output messages.
The node splitting requires an exchange of messages between neighboring windows/processors. Therefore, we extend the VN–VN and CN–CN iterations by additional steps for the message exchange.

Before the iteration on the respective subcode, the local variable, respectively, check nodes are updated with the incoming messages from the neighboring windows/processors. For the rowwise decoder by adding the exchange messages to the respective APP value, for the columnwise decoder by sorting the exchange messages into the list of minima and updating the sign of the respective check node.

After the VN–VN/CN–CN iteration, the outgoing exchange messages for the neighboring windows/processors are generated. For the rowwise decoder by subtracting the incoming exchange message from the updated APP values, and for the columnwise decoder by multiplication of the minimum value with the check node sign.
Algorithm 2 and Algorithm 3 summarize the resulting processing algorithms for a row and a column layer.
4.1.3 Processor architecture
The row and column processors apply full parallelism on the node and edge level. For a description of respective variable node functional units (VFUs) and check node functional units (CFUs), we refer to [18] and [19]. The architecture of a fullparallel row processor that implements Algorithm 2 is depicted in Fig. 3a. The incoming righttoleft (R2L) and lefttoright (L2R) exchange messages are first added to the left and right local APP values, which are then passed to the VN–VN processor. The VN–VN processor comprises three stages, each implementing different parts of Algorithm 1: the VFU out (VFUo) stage line 8, the CFU stage lines 1–5, and the VFU in (VFUi) stage line 7. Eventually, the received exchange messages are subtracted from the left and right APP values. Note that the respective channel values are implicitly contained in the APP values, which is a common practice in rowlayered decoding architectures [19].
The column processor has a similar structure and is depicted in Fig. 3b. The update minima unit (UMU) stage updates the minima for all \(2\cdot (cb)\) check nodes with the exchange messages according to Algorithm 4. Inside the CN–CN processor, the CFU out (CFUo) stage implements line 4, the VFU stage lines 6–9, and the CFU in (CFUi) stage lines 2–3 of Algorithm 1. In the split stage, the new exchange messages are generated by concatenating the sign bit from the check node computation with the new minimum value \(\text {min}_0\) (signmagnitude representation). Note that no logic is required to perform this step.
The first and the last subblock of the code require special processing. If a processed subblock is the first of a block, it must not interact with the previous, if it is last, with the subsequent subblock in the pipeline. This is realized with the respective multiplexers that, depending on the case, pass a neutral element or the termination sequence to the respective VN–VN or CN–CN processor.
The outputs are stored in a register stage. Let \(Q_\text {chv}\), \(Q_\text {ext}\) and \(Q_\text {app}\) denote the bitwidths for the channel values, the extrinsic messages, and the APP values. The memory requirement for a row processor is then
and for the column processor
4.2 Window schedules
With the described processors, different window processing schedules can be implemented by differently interconnecting the processors.
4.2.1 Layered schedule
The layered schedule resembles the layerbylayer schedule for LDPC block codes [17], except that multiple layers are processed in parallel. It was already stated in the introduction of the RLPD that the spacing between active layers must be at least \(m_s\). Respective processing windows for 4 iterations of row and columnlayered decoding are illustrated in Fig. 4a.
Figure 5a shows the corresponding processor arrangement. Note that the processors can be interchangeably row or column processors. The APP and extrinsic messages are passed from one processor to the next via an intermediate pipeline stage. Consequently, the respective rows/columns are updated only every second clock cycle, which creates the alternating layer processing depicted in Fig. 4a. To better understand the exchange message handling, Fig. 5a also shows the respective pipeline diagram. If we consider a subblock t currently processed inside the pipeline (shown at the bottom right of the pipeline diagram), it must receive a R2L message from subblock \(t1\) and a L2R message from subblock \(t+1\). With the subblocks moving through the pipeline, the R2L output of a processor i is connected to its R2L input and the L2R output to the L2R input of processor \(i+1\). In the first stage of the pipeline, L2R messages are not yet available and must be initialized with ’0’.
It should be noted that this processor arrangement results in redundant computations in the row processors as each processor updates its right and left APP values in every clock cycle, although the respective R2L and L2R messages are only updated every second clock cycle. To improve the efficiency of the architecture, we can directly connect the left APP input to the right APP output of the same processor and the right APP input to the left APP output of the previous processor. The R2L and L2R inputs are then set to zero. In this way, the processors share the APP values. This is not possible for the columnwise decoder as there is no distinction between intrinsic and extrinsic messages in the list representation.
4.2.2 Compact schedule
In the compact schedule, the layers are overlapping (see Fig. 4b). The respective processor arrangement is shown in Fig. 5b. Due to the overlaps, the subcodes are updated in every clock cycle and the stored data passed from one processor to the next. This modification of the processing also affects the message exchange, which is again illustrated in the pipeline diagram. Accordingly, the R2L exchange messages must be propagated to the input of the same processor or, in the case of L2R messages, two stages ahead.
4.2.3 Multicompact schedule
The decoding throughput for a given code is mainly determined by the achievable operating frequency and thus by the critical path of the processors. The critical path of both the row and column processors comprises a complete VN–VN or CN–CN iteration. Unrolled block decoders introduce additional pipeline stages in the decoding iterations to achieve higher clock frequencies [5, 18]. For pipelined SCLDPC decoders, this is more challenging due to the data dependencies between neighboring subblocks. With the proposed split node architecture, however, each processor operates on its local data set, which allows us to introduce additional pipeline stages in the processors and delay the message exchange. The corresponding arrangement for one additional pipeline stage in the processors is shown in Fig. 5c. The connections between processors for the L2R and R2L messages are again derived from the pipeline diagram. For each additional pipeline stage, the exchange message must be delayed by one clock cycle. We denote this schedule in the following as multicompact, or compact (i), where i denotes the number of pipeline stages inside the processors.
4.3 Input/output stages
The columnwise decoders require an input stage to generate the initial list for the first processor in the pipeline from the channel values. This input stage is equivalent to a CFUi stage in the CN–CN processor. The rowwise decoders do not require a designated input stage, as the right APP values can be initialized directly with the channel values and the left APP values with zero messages. Instead, the rowwise decoders require an output stage to combine the left and right partial APP values. Therefore, the right APP messages of the last processor are stored in a register stage and then added to the current left APP messages.
4.4 Computational complexity analysis
We estimate the computational complexity of the decoders based on the number of edge messages that are computed and transferred in every clock cycle [20]. Let \(E_\text {row}\) denote the number of edges corresponding to a rowlayer and \(E_\text {col}\) the number of edges corresponding to a column layer of \({\mathbf {H}}_{[1,L]}\). For a (\(d_v\),\(d_c\)) regular code, the number of edges can be expressed as \(E_\text {row} = d_c \cdot (cb)\) and \(E_\text {col} = d_v \cdot c\), which is equivalent to the number of ones in \({\mathbf {H}}_r(t)\) and \({\mathbf {H}}_c(t)\), respectively. Since \({\mathbf {H}}_r(t)\) and \({\mathbf {H}}_c(t)\) of a timeinvariant code are composed of the same submatrices \({\mathbf {H}}_1\) and \({\mathbf {H}}_0\), the number of edges in one rowlayer equals the number of edges in one column layer, i.e., \(E_\text {row} = E_\text {col}\). From this, it follows that the VN–VN processor and the CN–CN processor compute the same number of edge messages in each clock cycle and thus exhibit similar computational complexity. However, the node splitting introduces additional edges for the exchange of messages between processors. The resulting number of edges for the row processor is then \({\hat{E}}_\text {row} = E_\text {row} + 2 \cdot c\) and for the column processor \({\hat{E}}_\text {col} = E_\text {col} + 2 \cdot (cb)\). Taking into account the number of iterations (processors) I, we can define the complexity of the rowwise decoders as
and for the columnwise decoders as
5 Results and discussion
In this section, we present FEC performance and implementation results of the presented pipeline decoders. The performance evaluation and the implementation were performed for the EPIC SCLDPC code with \(L=80\), \(m_s=1\), \(b=512\) and \(c=640\) [1]. The code is regular with \(d_v=4\) and \(d_c=20\). The code rate is \(R=0.798\) and the total block length \(N=51238\) bit. The choice of code is based on the fact that the associated paritycheck matrices and performance data are publicly available [13]. Furthermore, the subblock size \(c=640\) allowed for a fully parallel decoder implementation, other than the larger EPIC codes with \(c=960\) and \(c=1280\). For the latter, not all decoder architectures could successfully pass the placement and routing process in the integrated circuit design due to routing congestions. The smaller subblock size results in a lower performance of the code, for the quantification of which we again refer to [13]. A detailed description of the experimental setup is provided in Sect. 7
5.1 FEC Performance
Figure 6a shows a sidebyside performance comparison of the row and the columnlayered decoder with 4, 8, 16, and 64 iterations (floating point). For better classification of the results, performance curves of the (64800, 51804) DVBS2 code and the (648, 540) WiFi code are also shown, each decoded with 200 iterations Normalized MS (NMS) (\(\gamma =0.75\), flooding schedule). Both codes are comparable to the SCLDPC code in terms of code rate (\(R=0.8\) and \(R=0.83\)). The DVBS2 serves as a reference LDPC code, that is similar in length to a full code block (64800 bit vs. 51238 bit). Conversely, the WiFi code can be considered an uncoupled counterpart to the SCLDPC code with a block length similar to one subblock (648 bit vs. 640 bit). We see that for the same number of iterations, the rowlayered decoder outperforms the columnlayered decoder. This observation is also made for the compact decoders in Fig. 6b and the compact decoders with one additional pipeline stage in the processor in Fig. 6c. We also see that the performance decreases from the layered to the compact and the doublecompact decoders. For better visualization, Fig. 6d shows the signaltonoise ratio (SNR) at BER \(10^{6}\) for all presented decoders over the iterations. This representation helps to illustrate the convergence behavior of the decoders, where a strong curvature toward the origin of the diagram indicates a faster convergence. The fastest convergence is achieved with layered decoding, the slowest with doublecompact decoding. Furthermore, we see that all decoders asymptotically approach an SNR of approximately 3.1 dB. This behavior is expected as the exchanged messages in the compact, particularly in the doublecompact decoder, are less recent, which affects the convergence speed of the decoder.
5.2 Implementation results
To compare the efficiency of different pipeline decoders, we implemented

A 4iteration rowlayered decoder,

A 5iteration columnlayered decoder,

A 6iteration rowcompact decoder,

A 7iteration columncompact decoder,

And two 9iteration compact decoders with one additional pipeline stage in the processor, with row and columnwise processing.
The pipeline stage was inserted in the minimum search tree of the CFU stage for the rowwise processor and after the VFU stage for the columnwise processor. The respective number of iterations was selected such that all decoders achieve the same FEC performance (fixed point) of \(\text {BER}=10^{6}\) at 4.2 dB, which is around 1 dB from the maximum MS performance of the code. All decoders use 4 bits to represent the channel values and the extrinsic messages. The row processors’ APP, R2L, and L2R values are quantized with 6 bits. The performance loss compared to floating point for this configuration is less than 0.2 dB up to a BER of \(10^{6}\).
The layouts of the implemented decoders are depicted in Fig. 7, and Table 1 summarizes the results. When comparing the respective row and columnwise decoders, we see that the columnwise decoders outperform the respective rowwise decoders in terms of throughput, energy, and area efficiency. For the layered architecture, the columnwise decoder achieves \(25\%\) higher clock frequency (throughput), around \(19\%\) better energy efficiency and \(22\%\) better area efficiency. For the compact and doublecompact architectures, the improvements are even more profound with \(50\%\) (freq.), \(59\%\) (energy eff.), and \(135\%\) (area eff.), and with \(38\%\) (freq.), \(24\%\) (energy eff.) and \(38\%\) (area eff.), respectively.
At first glance, these results seem unexpected, especially considering the fact that the columnwise decoders are composed of an equal or even larger number of processors. However, an important fact to consider is the lower number of edge message computations of the columnwise processors (\({\hat{E}}_\text {col}=2816\) vs. \({\hat{E}}_\text {row}=3840\)). Relating the calculated computational complexity values to implementation metrics such as energy and area efficiency, we observe a large discrepancy. We attribute this to the data transfer dominance of the decoder. It was already shown in [21] that for data transfer dominated applications, like LDPC decoding, the number of operations has little impact on area and energy efficiency. Another important aspect to consider for the better efficiency of the columnwise decoders is that the data transfer between the processors is much lower (\(256\cdot Q_\text {ext}=1024\) vs. \(1296 \cdot Q_\text {app}=7776\)) and, consequently, the wiring load for the logic circuit resulting in smaller logic cells. Furthermore, the column processor has \(57\%\) fewer registers (Eq. 5 and Eq. 6), and thus a smaller clock tree.
A similar observation is made when comparing the compact and the doublecompact decoders. Despite a higher computational complexity of the latter, the energy efficiency improves. Here, the introduction of an additional pipeline stage reduces the wiring load and toggling rates. Consequently, the energy efficiency per processor improves by \(28\%\) for the columnwise decoder and \(60\%\) for the rowwise decoder.
It should be noted that there is a large drop in frequency and degradation in energy efficiency from the rowlayered to the rowcompact decoder. The reason for this is that the rowlayered decoder does not process exchange messages (see Sect. 4.2.1). Consequently, the adders in the row processors and the corresponding wiring were not synthesized, which contributes to the better efficiency of the rowlayered decoder.
Finally, Table 2 shows the results for the column doublecompact decoder in comparison to data from the fastest LDPC Block Code (LDPCBC) and LDPCCC decoders in the literature. The column doublecompact decoder was selected for the comparison as it achieves the highest throughput of the presented decoders. The FPWD in [12] is not listed in Table 2, as the results are similar to the columncompact decoder in Table 1.
The LDPCBC decoder in [8] is implemented in the same 22 nm technology and features a similar code rate and (sub)block size and therefore allows for a detailed comparison. The throughput of the SCLDPC decoder is about 20% lower than the throughput of the unrolled block decoder. This results from the fewer pipeline stages per iteration of the SCLDPC decoder (2 vs. 3). As a result, the LDPCBC decoder achieves a higher maximum frequency. The area of both decoders is similar. However, the SCLDPC decoder achieves a gain in error correction performance of 0.7 dB over the LDPCBC decoder. With an \(E_b/N_0\) of 4.2 dB the SCLDPC decoder surpasses the maximum performance of the WiFi code at 200 iterations (see Fig. 6c). This illustrates the great potential of SCLDPC codes for highthroughput decoding beyond 5G.
6 Conclusions
In this paper, we presented the first indepth investigation on the implementation of SCLDPC decoders for throughputs beyond 100 Gbit/s. For a \(N=51328\), \(R=0.8\) terminated SCLDPC code with subblock size \(c=640\) and coupling width \(m_s=1\), we explored various design tradeoffs, including row and columnwise decoding, nonoverlapping and overlapping window scheduling, and processor pipelining. In this context, we provided the first description of a columnwise SCLDPC decoding architecture. We have shown that the columnwise decoding of SCLDPC codes is a promising approach, which, despite poorer convergence behavior, offers advantages in the implementation efficiency.
7 Methods
Performance evaluation and implementation were performed for the EPIC SCLDPC code with \(L=80\), \(m_s=1\), \(b=512\) and \(c=640\) [13].
Performance is evaluated with BER over SNR simulations for the transmission scenario depicted in Fig. 1. Each SNR point was simulated with \(10^6\) blocks (corresponding to \(80 \cdot 10^6\) subblocks) transmitted over an additive white Gaussian noise (AWGN) channel with binary phaseshift keying (BPSK) modulation. All decoders use the NMS algorithm with \(\gamma =0.75\).
Implementation was performed in a 22 nm fully depleted silicononinsulator (FDSOI) technology under worstcase process, voltage, and temperature (PVT) conditions (125\(^{\circ }\), 0.72 V for timing, 0.80 V for power). For synthesis, we used the design compiler, for placement and routing the ICcompiler, both from Synopsys. Power numbers are calculated with backannotated wiring data using Synopsys Primetime and Siemens Modelsim. The stimuli for the power simulations were obtained at a fixed SNR of 4 dB. Measurements started after an initialization phase of 100 clock cycles to ensure that the pipelines are filled.
Availability of data and materials
EPIC SCLDPC codes: https://www.unikl.de/en/channelcodes/channelcodesdatabase/spatiallycoupledldpc/
Abbreviations
 APP:

A posteriori probability
 AWGN:

Additive white Gaussian noise
 ASIC:

Applicationspecific integrated circuit
 BER:

Bit error rate
 BPSK:

Binary phase shift keying
 BSMC:

Binary symmetric memoryless channel
 CCPD:

Columncompact pipeline decoder
 CFU:

Check node functional unit
 CFUi:

CFU in
 CFUo:

CFU out
 CLPD:

Columnlayered pipeline decoder
 FDSOI:

Fully depleted silicon on insulator
 FEC:

Forward error correction
 FER:

Frame error rate
 FPWD:

Fullparallel window decoder
 L2R:

Lefttoright
 LDPC:

Lowdensity paritycheck
 LDPCBC:

LDPC block code
 LDPCCC:

LDPC convolutional codes
 LEX:

Left exchange
 LLR:

Loglikelihood ratio
 MS:

Minsum
 NMS:

Normalized MS
 OCA:

Ondemand check node activation
 OVA:

Ondemand variable node activation
 PVT:

Process, voltage and temperature
 R2L:

Righttoleft
 RCPD:

Rowcompact pipeline decoder
 REX:

Right exchange
 RLPD:

Rowlayered pipeline decoder
 SCLDPC:

Spatially coupled LDPC
 SNR:

Signaltonoise ratio
 SOA:

Stateoftheart
 UMU:

Update minima unit
 VFU:

Variable node functional unit
 VFUi:

VFU in
 VFUo:

VFU out
References
EPIC  Enabling practical wireless Tb/s communications with next generation channel coding  https://epich2020.eu/results
R.G. Gallager, Lowdensity paritycheck codes. IRE Trans. Inf. Theory 8(1), 21–28 (1962)
C. Kestel, M. Herrmann, N. Wehn, When channel coding hits the implementation wall, in 2018 IEEE 10th International Symposium on Turbo Codes Iterative Information Processing (ISTC), pp. 1–6 (2018). https://doi.org/10.1109/ISTC.2018.8625324
M. Li, V. Derudder, K. Bertrand, C. Desset, A. Bourdoux, Highspeed LDPC decoders towards 1 Tb/s. IEEE Transactions on Circuits and Systems I: Regular Papers, pp. 1–10 (2021). https://doi.org/10.1109/TCSI.2021.3060880
R. Ghanaatian, A. BalatsoukasStimming, T.C. Müller, M. Meidlinger, G. Matz, A. Teman, A. Burg, A 588Gb/s LDPC decoder based on finitealphabet message passing. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 26(2), 329–340 (2018). https://doi.org/10.1109/TVLSI.2017.2766925
M. Herrmann, C. Kestel, N. Wehn, Energy efficient FEC decoders. In: 2021 IEEE International Symposium on Topics in Coding (ISTC) (2021)
A. Jimenez Felstrom, K.S. Zigangirov, Timevarying periodic convolutional codes with lowdensity paritycheck matrix. IEEE Trans. Inf. Theory 45(6), 2181–2191 (1999)
N. Wehn, O. Sahin, M. Herrmann, Forwarderrorcorrection for beyond5G ultrahigh throughput communications. 2021 IEEE International Symposium on Topics in Coding (ISTC) (2021)
D.J. Costello, L. Dolecek, T.E. Fuja, J. Kliewer, D.G.M. Mitchell, R. Smarandache, Spatially coupled sparse codes on graphs: theory and practice. IEEE Commun. Mag. 52(7), 168–176 (2014)
A.E. Pusane, A.J. Feltstrom, A. Sridharan, M. Lentmaier, K.S. Zigangirov, D.J. Costello, Implementation aspects of LDPC convolutional codes. IEEE Trans. Commun. 56(7), 1060–1069 (2008)
N.U. Hassan, M. Schluter, G.P. Fettweis, Fully parallel window decoder architecture for spatiallycoupled IDPC codes. 2016 IEEE International Conference on Communications (ICC), pp. 1–6 (2016)
M. Herrmann, N. Wehn, M. Thalmaier, M. Fehrenz, T. LehnigkEmden, M. Alles, A 336 gbit/s fullparallel window decoder for spatially coupled ldpc codes. Joint European Conference on Networks and Communications & 6G Summit (EuCNC/6G Summit) (2021)
M. Helmling, S. Scholl, F. Gensheimer, T. Dietz, K. Kraft, S. Ruzika, N. Wehn, Database of Channel Codes and ML Simulation Results. (2022). https://www.unikl.de/channelcodes
C. Chen, Y. Lin, H. Chang, C. Lee, A 2.37gb/s 284.8 mw ratecompatible (491, 3, 6) ldpccc decoder. IEEE J SolidState Circuits 47(4), 817–831 (2012)
C. Lin, R. Liu, C. Chen, H. Chang, C. Lee, A 7.72 gb/s ldpccc decoder with overlapped architecture for pre5g wireless communications. 2016 IEEE Asian SolidState Circuits Conference (ASSCC), pp. 337–340 (2016)
F.R. Kschischang, B.J. Frey, H.A. Loeliger, Factor graphs and the sumproduct algorithm. IEEE Trans. Inf. Theory 47(2), 498–519 (2001). https://doi.org/10.1109/18.910572
D.E. Hocevar, A reduced complexity decoder architecture via layered decoding of IDPC codes. IEEE Workshop on Signal Processing Systems, 2004. SIPS 2004., pp. 107–112 (2004)
P. Schläfer, N. Wehn, M. Alles, T. LehnigkEmden, A new dimension of parallelism in ultra high throughput LDPC decoding. SiPS 2013 Proceedings, pp. 153–158 (2013). https://doi.org/10.1109/SiPS.2013.6674497
O. Boncalo, A. Amaricai, Ultra high throughput unrolled layered architecture for qcldpc decoders. 2017 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), pp. 225–230 (2017)
M. Baldi, LowDensity ParityCheck Codes (Springer, Cham, 2014), pp.5–21. https://doi.org/10.1007/9783319025568_2
F. Kienle, N. Wehn, H. Meyr, On complexity, energy and implementationefficiency of channel decoders. IEEE Trans. Commun. 59(12), 3301–3310 (2011). https://doi.org/10.1109/TCOMM.2011.092011.100157
C. Chen, Y. Lan, H. Chang, C. Lee, A 3.66gb/s 275mw tbldpccc decoder chip for mimo broadcasting communications. 2013 IEEE Asian SolidState Circuits Conference (ASSCC), pp. 153–156 (2013)
Funding
Open Access funding enabled and organized by Projekt DEAL. This work was financially supported by the EU (projectID: 760150EPIC) and the German ministry of education and research (project: FACTOR).
Author information
Authors and Affiliations
Contributions
MH and NW contributed to the main idea. MH performed the numerical simulations and hardware implementation. MH and NW wrote the paper. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Herrmann, M., Wehn, N. Beyond 100 Gbit/s Pipeline Decoders for Spatially Coupled LDPC Codes. J Wireless Com Network 2022, 90 (2022). https://doi.org/10.1186/s13638022021695
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13638022021695
Keywords
 Spatial coupling
 LDPC codes and VLSI implementation
 Highthroughput decoding