Reduced-complexity decoding implementation of QC-LDPC codes with modified shuffling

Layered decoding (LD) facilitates a partially parallel architecture for performing belief propagation (BP) algorithm for decoding low-density parity-check (LDPC) codes. Such a schedule for LDPC codes has, in general, reduced implementation complexity compared to a fully parallel architecture and higher convergence rate compared to both serial and parallel architectures, regardless of the codeword length or code-rate. In this paper, we introduce a modified shuffling method which shuffles the rows of the parity-check matrix (PCM) of a quasi-cyclic LDPC (QC-LDPC) code, yielding a PCM in which each layer can be produced by the circulation of its above layer one symbol to the right. The proposed shuffling scheme additionally guarantees the columns of a layer of the shuffled PCM to be either zero weight or single weight. This condition has a key role in further decreasing LD complexity. We show that due to these two properties, the number of occupied look-up tables (LUTs) on a field programmable gate array (FPGA) reduces by about 93% and consumed on-chip power by nearly 80%, while the bit error rate (BER) performance is maintained. The only drawback of the shuffling is the degradation of decoding throughput, which is negligible for low values of Eb/N0\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$E_b/N_0$$\end{document} until the BER of 1e−6.

to BP decoding is the schedule of the algorithm, i.e., the order in which the reliability messages are exchanged between the nodes of the Tanner graph. BP schedule is directly associated with the implementation architecture of the decoding method, and it falls into three main categories: 1 Fully parallel architecture which is realized by flood schedule [1] in which all the variable nodes (VNs) and check nodes (CNs) in the Tanner graph pass messages concurrently to their neighbors in every iterations of the algorithm. Although yielding a high throughput, this schedule requires a large silicon area with high interconnect complexity [2]. This architecture is facilitated due to the inherent parallelizable feature of BP algorithm, which is in contrast to turbo codes whose decoding algorithms are inherently serial. However, several works have devised fully parallel architectures for decoding turbo codes [3][4][5][6]. 2 Serial architecture in which a smaller number of functional units are re-used several times to perform each decoder iteration. In this way, decoding complexity is lowered, although at the price of reduced decoding throughput. 3 Partially parallel architecture which is a good trade-off between hardware complexity and decoding throughput and it is best accomplished by layered decoding (LD) schedule .
In LD schedule, the rows of PCM are divided into a number of layers, and each iteration of the BP algorithm is likewise split into the same number of sub-iterations. Each subiteration runs over one layer of the PCM, during which the CNs of that layer exchange reliability messages with their neighbor VNs. At the end of a sub-iteration, updated reliability messages are delivered to the next layer. Accordingly, in each sub-iteration, only a subset of CNs, i.e., as many as the number of rows in each layer, participate in the decoding process, causing a reduced hardware utilization of LD compared to flood schedule. Furthermore, LD schedule achieves a better convergence performance than the flood schedule due to the fact that the latest variable-to-check (VTC) messages are always used to update the check-to-variable (CTV) messages during a sub-iteration. For the sake of LD complexity, it is highly desirable that the number of ones in each column of a layer be either one or zero. Quasi-cyclic LDPC (QC-LDPC) codes have inherently a layered structure with such a property, thus making them an appropriate candidate for LD. QC-LDPC codes are a special type of LDPC codes possessing a cyclic property which simplifies the encoding and decoding process of them, while preserving comparable performance to random (or unstructured) LDPC codes [34,35].

Related work and contributions
Shuffling idea proposed in [15] shuffles the rows of the PCM of a QC-LDPC code prior to decoding, in the sense that the order of the rows of the PCM is totally changed. After shuffling, each layer can be produced by circulating its above layer one symbol to the right, leading to a simplified LD and sped-up convergence rate. In particular, due to the cyclic property, it is enough to realize only the first layer of a PCM in hardware rather than the whole PCM. The downside of this shuffling is that it may spoil the primary property of single weight columns in the PCM.
To workaround this shortcoming, we outlined a modified shuffling idea in our previous work [36] which results in a shuffled PCM that retains the desired property of single weight columns, and possesses the cyclic property too. This was accomplished by introducing a set of offset values prior to performing the shuffling. In this paper, this modified shuffling idea is further investigated. To be specific 1 The logic behind offset values applied for shuffling is clarified, aiming to elaborate how the offset values come into effect. The procedure for determining the offset values is also outlined. 2 Since [36] lacks implementation results to verify the improvements promised by the modified shuffling method, we provide in this work the implementation results for LD of several QC-LDPC codes when shuffled with the proposed technique. Improvements in terms of number of occupied look-up tables (LUTs) on a field programmable gate array (FPGA) and also power consumption are observed when compared with the case of non-shuffled LD. These improvements are achieved without sacrificing bit error rate (BER) performance. Although the decoding throughput deteriorates as E b /N 0 rises, our analysis shows that if BER of 1e−6 is chosen as the target, throughput degradation will be insubstantial.
The organization of the paper is as follows. Section 3 presents necessary fundamentals of QC-LDPC codes and LD. Section 4 is devoted to assessment of the novel shuffling method and its attributes. Implementation and simulation results together with necessary analysis come in Sect. 5. Final conclusions are made in Sect. 6.

QC-LDPC codes
The PCM of a QC-LDPC code is comprised of Circulant Permutation Matrices (CPMs) and zero matrices, wherein a CPM is a shifted identity matrix. Such a PCM could be represented as Codewords of a QC-LDPC code have sectionized cyclic structure, in the sense that with cyclic shifting of the t sections in a codeword, another valid codeword is obtained [37]. A compact way for representing PCM of a QC-LDPC code is known as the base matrix, denoted by W . In W , non-negative integers specify the shifting value with respect to an identity matrix in the corresponding CPM, and other entries, usually chosen to be − 1, represent zero matrices in the PCM. Fig. 1 shows the base matrices for the QC-LDPC codes utilized in IEEE 802.15.3c standard, and Fig. 2 shows the 1/2-rate (2304,1152)-QC-LDPC code used in IEEE 802.16e. In these two figures, empty places are the locations of zero matrices. Tanner graph representation of a PCM is an important means to comprehend BP algorithm. It consists of two sets of nodes, where one set represents CNs, i.e., parity-check sums (equivalent to rows of H ), and the other set represents VNs (equivalent to columns of H ). A CN in a Tanner graph is connected to a VN if and only if the corresponding element of H is one.

LD schedule
One appropriate schedule which facilitates partially parallel architecture for execution of BP algorithm is LD. In this schedule, each iteration is split into several sub-iterations, each corresponding to a layer of the PCM. In each sub-iteration, CNs of the corresponding layer exchange reliability messages with VNs of that layer, and at the end, the updated reliability messages are provided to the next layer. Accordingly, fewer number of processing units for CNs and VNs are realized in hardware, and they are re-utilized in successive sub-iterations for successive layers. Let y = (y 0 , . . . , y n−1 ) be the soft-decision sequence at the output of the channel that is to be decoded. The J rows of the PCM H qc are divided into L layers each containing E consecutive rows. Hence, the i-th layer  (3), min-sum scheme [38] has been used for computing CTV messages, wherein sgn(x) is a sign function which is equal to 1 when x ≥ 0 and -1 otherwise.
At the end of each sub-iteration, some codeword bits are estimated based on the updated APP values. In particular, v l = 1 if Y l > 0 and v l = 0 otherwise. If the check-sum condition H qcv T = 0 is satisfied by the estimated codeword v = (v 0 , . . . ,v n−1 ) , it will be declared as the valid codeword leading to the termination of the algorithm. Otherwise, the algorithm continues starting from the vertical step of the next layer. The maximum number of iterations is however limited by a threshold I max . The algorithm declares a failure if decoding is not converged to a valid codeword within I max iterations.

Modified shuffling
LD schedule is suitably tailored for decoding QC-LDPC codes, since each block-row (1) can be regarded as a layer. Accordingly, the columns in each layer are either single weight or zero weight. This property is useful for simplifying the decoding implementation. First and foremost, the summation in (4) is simplified as it exists only a single CTV message to be added to y l . Moreover, the implementation of both operations in (2) and (3) are also simplified.
Shuffling is the act of swapping the rows of the PCM, in a manner that the complexity of the decoding algorithm reduces, while the error correction performance is preserved. [15] proposes to partition the J = c.b rows of H qc between matrices H sh,(i) qc , i = 1, . . . , b each containing the rows i, i + b, i + 2b, . . . , i + (c − 1)b of H qc . Accordingly, the shuffled PCM, H sh qc , has matrices H sh,(i) qc , i = 1, . . . , b as its row-blocks: An example of such a shuffling is depicted in Fig. 3, which has been performed on a (12, 4)-QC-LDPC code. In H sh qc , each layer H sh,(i) qc is obtained by cyclically shifting its above layer H sh,(i−1) qc one symbol to the right, noting that the circulation must be performed separately on the t individual sections of a layer. Due to this cyclic property provided by shuffling, it is no longer needed to implement the whole PCM in the targeted hardware (for example an FPGA) and define all the "1"-entries as connections in it. Instead, it suffices to implement only H sh, (1) qc , the first layer of H sh qc , and then perform a circulation on the memory blocks containing updated APP values at the end of each sub-iteration. This implementation benefit is further elaborated on in Sect. 4.2. The shortcoming with this shuffling method is that layers in the shuffled matrix may no longer be comprised of single weight columns. In particular, if the base matrix of a QC-LDPC code has repetitive numbers in a column, the corresponding block-column in the shuffled matrix will have columns of weight bigger than one in each layer. For instance, in Figs. 1 and 2, the shaded columns have repetitive numbers. Consequently, after shuffling, they bring about columns of weight 2 or 3 in each layer of their own shuffled matrix.
As a solution to this weakness of shuffling, we propose to employ a set of integer values 0 ≤ o m < b, m = 1, . . . , c serving as an offset in order to modify the order of the rows of H qc which are put in the same layer in H sh qc . Accordingly, the i-th layer of H sh qc is made    This perspective of the modified shuffling gives us the way for finding appropriate offset values for a specific QC-LDPC code. The straight way is to try all the possible values until finding the one that gives a base matrix in which all the integers in any column are distinct. In general, for a QC-LDPC code, there are b c possible offset sets which must be examined one by one until finding the one which results in a base matrix without repetitive integers in a column. Once such an offset set is found, the search is halted. It should be noted that in some cases, there may not exist a possible offset set, like the (672, 588)-QC-LDPC code of Fig. 1-d. This indicates that the modified shuffling is not necessarily applicable to all the QC-LDPC codes. It should also be emphasized that the desired cyclic property of H sh qc does exist in the case of modified shuffling as it does in the basic shuffling method, and hence, each layer of H sh qc can be produced from its previous layer by a circular shifting.

LD implementation
The advantage of shuffling can be highlighted by an investigation of the implementation methodology. Specifically, Fig. 6 illustrates the LD architecture for H sh qc of Fig. 3. In this figure, VN Processing Unit (VNPU) and CN Processing Unit (CNPU) stand for a processing unit responsible for computing VTC and CTV messages in (2)  determined by the 1-entries of the first layer. At the end of each sub-iteration, the computed APP values are cyclically shifted as shown by the figure. This circulation serves as an alternative to redefining the connections between VNPUs and CNPUs, allowing the current connections between VNPUs and CNPUs to remain valid and to start the next sub-iteration immediately. Fig. 7 shows the decoding flowchart for the algorithm outlined in section 3.2. As shown, the last block in the decoding loop is a "circulate" operation performed on the updated APP values in order to prepare them for the next sub-iteration processing. The other blocks in the decoding loop are responsible for performing operations (2)-(4).

Implementation and experimental results
We implemented both LD with shuffled PCM and LD with non-shuffled PCM for the example codes of IEEE 802.16e and IEEE 802.15.3c standards. The utilized hardware was a Xilinx VC709 evaluation board, shown in Fig. 8  The acquired results in terms of utilized LUTs, on-chip power and maximum clock frequency are shown in Table 1. The first two parameters are directly reported by the synthesis tool and the maximum clock frequency is estimated from the parameter of worst negative slack, also reported by the synthesis tool. As deduced from the figures in the table, the design with shuffled PCM is considerably smaller, and the consumed power is notably lower. For instance, the number of occupied LUTs reduces from 262122 to only 15299 in the case of (672,336) code, equivalent to (1 − 15299 262122 ) × 100 ∼ = 94% reduction in the numbers of LUTs on FPGA. Similarly, a (1 − 0.84 6.693 ) × 100 ∼ = 87% reduction in consumed power is also resulted. In summary, the superiority of the shuffling method in terms of hardware area and consumed power is apparent from the implementation results. Note that the design of the non-shuffled IEEE 802.16e is too big to fit in the FPGA, and hence, the results are not available. Figure 9 depicts the performance simulation of the three IEEE 802.15.3c codes, showing that shuffling does not degrade the BER performance. Indeed, LD with (modified) shuffled PCM performs as good as LD with non-shuffled PCM. Furthermore, the average number of iterations needed to achieve a specific BER performance depicted in Fig. 10 is quite the same in two cases, further confirming the similar performance of the two modes. This is due to the fact that shuffling just changes the order of the rows of the PCM, while the overall PCM's characteristics remain intact. In particular, the determining attributes of the PCM such as distance property, cycles' distribution and rows' and columns' weight do not undergo any change.
Comparing the two cases in terms of throughput can also be of interest. The average throughput for different codes is plotted in Fig. 11. Given that f clk is the clock frequency specified in table 1, N ave_ite is the average number of sub-iterations and N clk is the number of clock cycles each sub-iteration needs, the average duration for decoding a sequence will be then N ave_ite .N clk f clk , thus leading to the average throughput of (6) τ = k.f clk N ave_ite .N clk .   In our implementation, N clk = 8 and 7 for the case of shuffled and non-shuffled PCM, respectively, noting that the one extra clock cycle in the former case is needed for cyclic shifting of the computed APP values. The average throughput in the two cases overlap for low values of E b /N 0 , indicating that the additional number of sub-iterations is compensated fully by the higher clock frequency. This is however not true when E b /N 0 grows. If BER = 1e -6 is chosen as the targeted BER performance, the throughput degradation will be 0.1, 0.3, and 1.3 Gbps for the three codes, respectively. The degradation in throughput stems from the fact that layering is different in the two cases. In the case of non-shuffled PCM, the J rows are divided into c layers, each of b rows, while in the case of shuffled PCM, they are divided into b layers, each of c rows. Since c is usually much more smaller than b, in the first case, a bigger number of VNs are processed in a subiteration and hence it needs fewer sub-iterations in total.

Conclusion
The novel shuffling method proposed in this paper is basically a swapping of the rows of the PCM of a QC-LDPC code with two objectives in mind. First, the columns in each layer of the shuffled PCM must remain single weight or zero weight. Second, each layer must be producible from the upper layer by a one-symbol circular shifting. Though simple, this shuffling brings about considerable complexity reduction in the decoding implementation, while preserving the error-correcting capability of the code and its decoding throughput for BER values of up to 1e−6.