 Research
 Open Access
 Published:
Systematic construction, verification and implementation methodology for LDPC codes
EURASIP Journal on Wireless Communications and Networking volume 2012, Article number: 84 (2012)
Abstract
In this article, a novel and systematic Lowdensity paritycheck (LDPC) code construction, verification and implementation methodology is proposed. The methodology is composed by the simulated annealing based LDPC code constructor, the GPU based highspeed code selector, the ant colony optimization based pipeline scheduler and the FPGAbased hardware implementer. Compared to the traditional ways, this methodology enables us to construct both decodingperformanceaware and hardwareefficiencyaware LDPC codes in a short time. Simulation results show that the generated codes have much less cycles (length 6 cycles eliminated) and memory conflicts (75% reduction on idle clocks), while having no BER performance loss compared to WiMAX codes. Additionally, the simulation speeds up by 490 times under float precision against CPU and a net throughput 24.5 Mbps is achieved. Finally, a net throughput 1.2 Gbps (bitthroughput 2.4 Gbps) multimode LDPC decoder is implemented on FPGA, with completely onthefly configurations and less than 0.2 dB BER performance loss.
1. Introduction
Lowdensity paritycheck (LDPC) code is first proposed by Gallager [1] and rediscovered by Mackay and Neal since they introduce Tanner Graph [2] into LDPC code [3]. LDPC code with soft decoding algorithms on Tanner Graph can achieve outstanding capacity and approach Shannon limit over noisy channels at moderate decoding complexity [4]. Most algorithms root from the famous believe propagation (BP) algorithm, such as minsum algorithm (MSA) with simplified calculation, modified MSA (MMSA) [5] with improved BER performance and layered versions [6] with fast decoding convergence.
The existence of "cycle" in Tanner Graph is a critical constraint of the above algorithms, as it breaks the "message independence hypothesis" and degrades the BER performance. As a result, "girth" becomes an importance metric of estimating the performance of the LDPC code. The progressive edgegrowth (PEG) algorithm [7] is a girthaware construction method that tries to make shortest cycle as large as possible. Approximate cycle extrinsic (ACE) message degree constraint is further combined into PEG [8] to lower error floor. However, these performanceaware methods do not take hardware implementation into account, which usually result in low efficiency or high complexity.
As to the decoder implementation, the fullyparallel architecture [9] is first proposed for achieving the highest decoding throughput, but the hardware complexity due to the routing overhead is very high. The semiparallel layered decoder [10] is then proposed to achieve the tradeoff between hardware complexity and decoding throughput. Memory conflict is a critical problem for layered decoder, which is modeled as a singlelayer traveling salesman problem (TSP) in [11]. However, this model ignores "element permutation", i.e., the order assignment of the edges in each layer, and its search does not cover the entire solution space. Further, fullyparallel graphic processing unit (GPU) based implementation is also proposed in [12].
In this article, a novel and systematic LDPC code construction, verification, and implementation methodology is proposed, and a software and hardware platform is implemented, which is composed by four modules as shown in Figure 1. The simulated annealing (SA) based LDPC code constructor continuously constructs good candidate codes. The BER performance of the generated codes, especially the error floor, is then evaluated by the highspeed GPU based simulation platform. Next, the hardware pipeline of the selected codes are optimized by the ant colony optimization (ACO) based scheduling algorithm, which can reduce much of the memory conflicts. Finally, detailed implementation schemes are proposed, i.e., reconfigurable switch network (adopted by [13]), offsetthreshold decoding, splitrow MMSA core, earlystoping scheme and multiblock scheme, and the corresponding multimode highthroughput decoder of the optimized codes is implemented on FPGA. The novelties of the proposed methodology are listed as follows:

Compared to traditional methods (PEG, ACE), the SAbased constructor takes both decoding performance and hardware efficiency into consideration during construction process.

Compared to existed work [11], the ACObased scheduler covers both layer and element permutation, and maps the problem to a doublelayer TSP, which is a complete solution and can provide better pipelining schedule.

Compared to existed works, the GPUbased evaluator first implements the semiparallel layered architecture on GPU. The obtained net throughput is similar to the highest report [12] (about 25 Mbps), while the proposed scheme has higher precision and better BER performance. Further, we put the whole coding and decoding system into GPU rather than a single decoder.

Compared to existed FPGA or ASIC implementations [14–16], the proposed multimode highthroughput decoder not only supports multiple modes with completely onthefly configurations, but also has a performance loss within 0.2 dB against float precision and 20 iterations, and a stable netthroughput 721.58 Mbps under code rate 1/2 and 20 iterations. With earlystopping scheme, a netthroughput 1.2 Gbps is further achieved on Stratix III FPGA.
The remainder of this paper is organized as follows. Section 2 presents the background of our research. Sections 3, 4, and 5 introduces the ACO based pipeline scheduler, the SA based code constructor and the GPU based performance evaluator, respectively, followed by hardware implementation schemes and issues of the multimode highthroughput LDPC decoder discussed in Section 6. Simulation results are provided in Section 7 and hardware implementation results are given in Section 8. Finally, Section 9 concludes this article.
2. Background
2.1. LDPC codes and Tanner graph
An LDPC code is a special linear block code, characterized by a sparse paritycheck matrix H with dimensions M × N; H_{ j,i }= 1 if code bit i is involved in paritycheck equation j, and 0 otherwise. An LDPC code is usually described by its Tanner Graph, a bipartite graph defined on the code bit set ℝ and paritycheck equation set ℂ, whose elements are called a "bit node" and a "check node", respectively. An edge is assigned between bit node BN_{ i } and check node CN_{ j } if H_{ j,i }= 1. A simple 4 × 6 LDPC code and the corresponding Tanner Graph is shown in Figure 2.
Quasicyclic LDPC codes (QCLDPC) is a popular class of structured LDPC codes, which is defined by its base matrix H^{b}, whose elements satisfying $1\le {\mathbf{H}}_{j,i}^{b}<{z}_{f}.{z}_{f}$ is called the expansion factor. Each element in the base matrix should be further expanded to a z_{ f } × z_{ f } matrix to obtain H. The elements ${\mathbf{H}}_{j,i}^{b}=1$ are expanded to zero matrices, while ${\mathbf{H}}_{j,i}^{b}\ge 0$ are expanded to a cyclicshift identity matrices with permutation factors $p={\mathbf{H}}_{j,i}^{b}$. QCLDPC is naturally available for layered algorithms, whose jth row is exactly layer j. We call the "1"s of jth row as the set $\left\{{\mathbf{H}}_{j,i}^{b}\left{\mathbf{H}}_{j,i}^{b}\right.\ge 0\right\}$. See Figure 3 for an example of a 4 × 6 base matrix with z_{ f } = 4.
2.2. The BP algorithm and effect of cycle
The BP algorithm is a general soft decoding scheme for codes described by Tanner Graph. It can be viewed as the process of iterative message exchange between bit nodes and check nodes. For each iteration, each bit node or check node collects the messages passed from its neighborhood, updates its own message and passes the updated message back to its neighborhood. BP algorithm has many modified versions, such as logdomain BP, MSA, and layered BP. All of them originate from the basic logdomain message passing equations, given as follows.
where L(c_{ i } ) is the initial channel message, L(q_{ ij } ) is the message passing from BN_{ i } to CN_{ j } , L(r_{ ji } ) is the message of inverse direction, and L(Q_{ i } ) is the aposteriori of bit node BN_{ i } . ${\mathcal{C}}_{i}$ is the neighbor set of BN_{ i } , ℛ_{ j }is the neighbor set of CN_{ j } . $\Phi \left(x\right)=\text{log}\frac{{e}^{x}+1}{{e}^{x}1}$. These equations can also be applied in layered BP, the difference is that the L(q_{ ij } ) and L(r_{ ji } ) should be updated in each layer of the iteration.
The above equations requires the independence of all the messages L( q_{ i′j } ), i′ ∈ ℛ_{ j }and ${\mathbf{H}}_{j,k}^{b}$. However, the existence of "cycle" in Tanner Graph invalidates this independence assumption, thus degrades the BER performance of BP algorithm. A length 6 cycle is shown with bold lines in Figure 2. In this case, if BP algorithm proceeds for more than three iterations, the receive messages of the involved bit nodes v_{2},v_{4},v_{5} will partly contain its own message sent three iterations before. For this reason, the minimum cycle length in the Tanner Graph, called "girth", has a strong relationship with its BER performance, and is considered as an important metric in LDPC code construction algorithms (PEG, ACE) [7, 8].
2.3. Decoder architecture and memory conflict
The semiparallel structure with layered MMSA core is a popular decoder architecture due to its good tradeoff among low complexity, high BER performance and high throughput. As shown in Figure 4, the main components in the toplevel architecture include an LLRSUM RAM storing L(Q_{ i } ), an LLREX RAM storing L(r_{ ji } ) and a layered MMSA core pipeline. The two RAMs should be readable and writable. Old values of L(Q_{ i } ) and L(r_{ ji } ) are read, and new values are calculated through the pipeline and written back to RAMs. For QCLDPC codes, the values are processed layer by layer, and the "1"s in each layer is processed one by one.
Memory conflict is a critical problem that constrains the throughput of the semiparallel decoder. Essentially, memory conflict occurs when the readafterwrite (RAW) dependency of L(Q_{ i } ) is violated. Note that the new value of L(Q_{ i } ) will not be written back to RAM until the pipelined calculation finishes. If L(Q_{ i } ) is again needed during this calculation period, the old value will be read, while the new one is still under processing, see L(Q_{6}) in Figure 4. This case happens when the layers j and j + l have "1"s in the same position $i\phantom{\rule{2.77695pt}{0ex}}\left({\mathbf{H}}_{j,i}^{b}\ge 0,{\mathbf{H}}_{j+l,i}^{b}\ge 0\right)$. We call it a gapl conflict.
Memory conflict slows the decoding convergence and thus reduces the BER performance. The traditional method of handling memory conflict is to insert idle clocks in the pipeline, with the cost of throughput reduction. It's obvious that the smaller l, the more idle clocks should be inserted, since the pipeline need to wait at least K stages before writing back the new values. Usually, the number of gap1, gap2, gap3 conflicts, denote c_{1}, c_{2} and c_{3}, are considered as the metrics of measuring memory conflict.
3. The ACObased pipelining scheduler
In this section, we propose the ACObased pipeline scheduling algorithm to minimize memory conflict. We first formulate this problem, then map it to the doublelayered TSP and finally use ACO to solve it.
3.1. Problem formulation
Consider a QC LDPC code described by its base matrix H with dimensions M × N. Thus, there are M layers. Denote w_{ m } ,1 ≤ m ≤ M as the number of elements ("1"s) in mth layer. Denote h_{ m }_{,n}, 1 ≤ n ≤ w_{ m } as the column index in H of the nth element, mth layer. Additionally, we assume the core pipeline is K stages.
As discussed above, the decoder processes all the "1"s in H exactly once by processing layerbylayer in each iteration, and elementbyelement in each layer. However, the order can be arbitrary, which enables us to schedule the elements carefully to minimize memory conflict. We have two ways to solve it.

Layer permutation: We can assign which layer to be processed first and which to be next. If two layers i,j have 1s at totally different positions, i.e., such j,l do not exist that h_{ i }_{,k}= h_{ j }_{,l}, they tend to be assigned as the adjacent layers with no conflict.

Element permutation: In a certain layer, we can assign which element to be processed first and which to be next. If two adjacent layers i,j still have conflict, i.e., h_{ i }_{,k}= h_{ j }_{,l}for some k,l, then we can assign element k to be first in layer i, and l to be last in layer j. By this way, we increase the time interval between the conflicting elements k and l.
Therefore, the memory conflict minimization problem is exactly a scheduling problem, in which layer permutation and element permutation should be designed to minimize the number of idle pipeline clock insertions. We denote layer permutation as m → λ_{ m } , 1 ≤ m, λ_{ m } ≤ M, and element permutation of layer m as n → µ_{ m }_{,n}, 1 ≤ n,µ_{ m } ,_{ n }≤ w_{ m }.
Based on the above definitions, a memory conflict occurs between layer i, element k and layer j, element l if the following conditions are satisfied: (1) layers i,j are assigned to be adjacent, i.e., λ_{ j } = λ_{ i } + 1; (2) h_{ i }_{,k}= 1 and h_{ j }_{,l}= 1; (3) the pipeline time interval is less than pipeline stages, i.e., w_{ i }−µ_{ i }_{,k}+µ_{ j }_{,l}≤ K. Further, we define the "conflict set" $\mathcal{C}$ as $\mathcal{C}(i,j)=\{(k,l)\text{elements}\phantom{\rule{0.5em}{0ex}}\text{(}i\text{,}k\text{)}$ and (j, l) cause a memory conflict}, and the "conflict stages", also the minimum number of idle clocks inserted due to this conflict, as
3.2. The doublelayered TSP
This part introduces the mapping from the above memory conflict minimization problem to a doublelayered TSP. TSP is a famous NPhard problem, in which the salesman should find the shortest path to visit all the n cities exactly once and finally return to the starting point. Denote d_{ i }_{,j}as the distance between city i and city j. TSP can be mathematically described as follows: given distance matrix D = [d_{ i }_{,j}]_{n×n}, find the optimal permutation of the city indices x_{1},x_{2}, ...,x_{ n } to minimize the loop distance,
Compared to layer permutation which can contribute most part of the memory conflict reduction, element permutation only deals with minor changes for the optimization when layer permutation is already determined. Therefore, we map the problem to a doublelayered TSP, where layer permutation is mapped to the first layer, and element permutation is mapped to the second layer based on result of the first layer. Details are described as follows:

Layer permutation layer: In this layer we only deal with layer permutation. We define the "distance", also "cost" between layers i and j as the minimum number of idle clocks inserted before the processing of layer j. If more conflict position pairs exist, i.e., $\left\mathcal{C}(i,j)\right>1$, then we should take the maximum one. Thus in this layer, the distance matrix should be defined by
$${d}_{i,j}=\underset{\left(k,l\right)\epsilon \mathcal{C}\left(i,j\right)}{\text{max}}\mathcal{C}\left(i,k;j,l\right)$$(6)and the target function remains the same as (5).

Element permutation layer: In this layer we inherit the layer permutation result, and map element permutation of each layer to an independent TSP. In the TSP for layer i, we fix the schedule of the prior layer p (λ_{ p }= λ_{ i } − 1) and next layer q (λ_{ q }= λ_{ i }+ 1), and only tune the elements of layer i. We define the "distance" d_{ k }_{,l}as the change on the number of idle clocks if element k is assigned to the position l, i.e., µ_{ i }_{,k}= l. Note that element k can conflict with layer p or q, and d_{ k }_{,l}varies by different conflict cases, given by
$${d}_{k,l}=\{\begin{array}{l}0\phantom{\rule{0.5em}{0ex}}\phantom{\rule{0.5em}{0ex}}\phantom{\rule{0.5em}{0ex}}\text{bothconflictorneitherconflict}\\ kl\phantom{\rule{0.5em}{0ex}}k\text{onlyconflictwithlayer}p\\ lk\phantom{\rule{0.5em}{0ex}}k\text{onlyconflictwithlayer}q\end{array}$$(7)Since the largest d_{ k }_{,l}becomes the bottleneck of element permutation, the target function should change to the following max form:
$$\text{min}\phantom{\rule{2.77695pt}{0ex}}\text{max}\left\{{d}_{{x}_{1},{x}_{2}},{d}_{{x}_{2},{x}_{3}},\dots ,{d}_{{{x}_{n}}_{1},{x}_{n}},\phantom{\rule{0.5em}{0ex}}{d}_{{x}_{n},{x}_{1}}\right\}$$(8)
3.3. The ACObased algorithm
This part introduces the ACO based algorithm to solve the doublelayered TSP discussed above. ACO is a heuristic algorithm to solve computational problems which can be reduced to finding good paths through graphs. Its idea originates from mimicking the behavior of ants seeking a path between their colony and a source of food. ACO is especially suitable for solving TSP.
Algorithm 1 [see Additional file 1] gives the ACObased doublelayered memory conflict minimization algorithm. First we try layer permutation LAYER1_MAX times, and for each layer permutation, we try element permutation for LAYER2_MAX times. We record the pipeline schedule with smallest idle clocks as the best solution for this algorithm.
The detailed ACO algorithm for TSP is described in Algorithm 2. We try SOL_MAX solutions, and for each solution, all ants should finish CYCLE_MAX cycles, in which the shortest cycle is recorded as the best solution. One ant cycle is finished in VERTEX_NUM antmove steps, where one step is consist of four substeps: Ant Choose, Ant Move, Local Update and Global Update. Further, the Bonus is rewarded to the shortest cycle. All specific parameters (e.g., p and φ) are referred to the suggestion of [17].
4. The SAbased code constructor
In this section, we propose a joint optimized construction algorithm that takes both performance and efficiency into consideration during construction the H matrix of the LDPC code. We first give the SA based framework and then discuss the details of the algorithm.
4.1. Problem formulation
We now deal with the classic code construction problem. Given the code length N, code rate R, and perhaps other constraints such as QCRA type (e.g., WiMAX, DVBS2), or fixed degree distribution (optimized by density evolution), we should construct a "good" LDPC code described by its H matrix that meets practical need. The word "good" here mainly have the following two metrics.

High performance, which means the code should have high coding gain and good BER/BLER performance, including early waterfall region, low error floor and antifading ability. This is strongly related to large girth, large ACE spectrum, few trapping sets, and etc.

High efficiency, which means the implementation of the encoder and decoder should have moderate complexity, and high throughput. This is strongly related to QCRA type, high degree of parallelism, short decoding pipeline, few memory conflicts, and etc.
Traditional construction methods mainly focus on high performance of the code, such as PEG and ACE, which motivates us to find a joint optimized construction method concerning both performance and efficiency.
4.2. The doublestage SA framework
In this part, we introduce the doublestage SA [18] based framework for the joint optimized construction problem. SA is a generic probabilistic metaheuristic for the global optimization problem which should locate a good approximation to the global optimum of a given function in a large search space. Since our search space is a large 01 matrix space, denoted as {0, 1}^{M×N}, SA is very useful for this problem.
Note that the performance metric is the more important metric for LDPC construction compared with efficiency metric. Therefore, we divide the algorithm into two stages, aiming at performance and efficiency, respectively, and regard performance as the major stage that should be satisfied first. For a specific target measured by "performance energy" e_{1} and "efficiency energy" e_{2}, we set two thresholds: upper bound e_{1h}= e_{1}, and lower bound e_{1l}< e_{1}. The algorithm enters in the second stage when the current performance energy is less than e_{1l}. At the second stage, the algorithm ensures the performance energy to be not larger than e_{ 1h } , and try to reduce the e_{2}. Algorithm 3 shows the details.
4.3. Details of the algorithm
This part discusses the details of the important functions and configurations of Algorithm 3.

sample_temperature is the temperature sampling function, decreasing with k. It can be an exponential form αe^{−βk}.

prob is the accept probability function of the new search point h_new. If h_new is better (E_new < E), it returns 1, otherwise, it decreases with E_new−E, and increases with t. It can be an exponential form αe^{−β}(E_new−E)/t

perf_energy is the performance energy function. It evaluates the performance related factors of the matrix h, and gives a lower energy for better performance. Typically, we can calculate the number of lengthl cycles c_{ l }, then calculate a total cost given by ∑_{ l }w_{ l }c_{ l }, where w_{ l }is the cost weight of a lengthl cycle, decreasing with l.

effi_energy is the efficiency energy function, similar as perf_energy except that it gives a lower energy for higher efficiency. Typically, we can calculate the the number of gapl memory conflicts c_{ l }, then calculate a total cost given by ∑_{ l }w_{ l }c_{ l }, where w_{ l }is the cost weight of a layer gap l conflict, decreasing with l.

perf_neighbor searches for a neighbor of h in the matrix space when aiming at performance, which is based on minor changes of h. For QC LDPC, we can define three atomic operations for the base matrix H^{b}as follows.

Horizontal swap: For chosen row i,j and column k, l, swap values of ${\mathbf{H}}_{i,k}^{b}$ and ${\mathbf{H}}_{i,l}^{b}$, then swap values of ${\mathbf{H}}_{j,k}^{b}$ and ${\mathbf{H}}_{j,l}^{b}$.

Vertical swap: For chosen row i,j and column k,l, swap values of ${\mathbf{H}}_{i,k}^{b}$ and ${\mathbf{H}}_{j,k}^{b}$, then swap values of ${\mathbf{H}}_{i,l}^{b}$ and ${\mathbf{H}}_{j,l}^{b}$.

Permutation change: Change the permutation factor for chosen element ${\mathbf{H}}_{i,k}^{b}$.
For a higher temperature t, we allow the neighbor searching process to search in a wider space. This is done by performing the atomic operations more times.


effi_neighbor searches for a neighbor of h in the matrix space when aiming at efficiency. This is similar as perf_neighbor, however, typically we should remove the permutation change operation, as it does nothing to help reduce conflicts.
5. The GPUbased performance evaluator
In this section, we introduce the implementation of highspeed LDPC verification platform based on compute unified device architecture (CUDA) supported GPUs. We first give the architecture and algorithm on GPU, and then talk about some details.
5.1. Motivation and architecture
Compute unified device architecture is NVIDIA's parallel computing architecture. It enables dramatic increases in computing performance by executing multiple parallel independent and cooperated threads on GPU, thus is particularly suitable for the Monte Carlo model. The BER simulation of LDPC code is Monte Carlo since it collects huge amount of bit error statistics of the same decoding process, especially in the error floor region when the BER is low (10^{−7} to 10^{−10}). This motivates us to implement the verification platform on GPU where many decoders run parallel like hardware such as ASIC/FPGA to provide statistics.
Figure 5 shows our GPU architecture. CPU is used as the controller, which puts the code into GPU constant memory, raises the GPU kernels and gets back the statistics. While in GPU grid, we implement the whole coding system for each GPU block, including source generator, LDPC encoder, AWGN channel, LDPC decoder and statistics. Our decoding algorithm is layered MMSA. In each GPU block, we assign z_{ f } threads to calculate new LLRSUM and LLREX of the z_{ f } rows in each layer, where z_{ f } is the expansion factor of QC LDPC. The z_{ f } threads cooperate to complete the decoding job.
5.2. Algorithm and procedure
This part introduces the procedure that implements the GPU simulation, given by Algorithm 4. P × Q blocks run parallel, each simulating an individual coding system, where P is the number of multiprocessors (MP) on the device and Q is the number of cores per MP. In each system, z_{ f } threads cooperatively do the job of encoding, channel and decoding. When decoding, the threads process data layer after layer, each thread performing LMMSA for one row of this layer. The procedure ends up with the statistics of P × Q LDPC blocks.
5.3. Details and instructions

Ensure "coalesced access" when reading or writing global memory, or the operation will be autoserialized. In our algorithm, the adjacent threads should access adjacent L(Q_{ i }) and L(r_{ ji }).

Shared memory and registers are fast yet limited resources and their use should be carefully planned. In our algorithm, we store L(Q_{ i }) in shared memory and L(r_{ ji }) in registers due to the lack of resources.

Make sure all the P × Q cores are running. This calls for careful assignment of limited resources (i.e., warps, shared memory, registers). In our case, we limit the registers per thread to 16 and threads per block to 128, or some of the Q cores on each MP will "starve" and be disabled.
6. Hardware implementation schemes
6.1. Toplevel hardware architecture
Our goal is to implement a multimode highthroughput QCLDPC decoder, which can support multiple code rates and expansion factors onthefly. The proposed decoder consists of three main parts, namely, the interface part, the execution part and the control part. The top level architecture is shown in Figure 6.
The interface part buffers the input and output data as well as handling the configuration commands. In the execution part, the LLRSUM and LLREX are read out from the RAMs, updated in the Σ parallel LMMSA cores, and written back to the RAMs, thus forming the LLRSUM loop and the LLREX loop, as marked red in Figure 6. The control part generates control signals, including port control, LLRSUM control, LLREX control and iteration control.
Note that the reconfigurable switch network is designed in the LLRSUM loop to support multimode feature. As to achieve highthroughput, we propose the splitrow MMSA core, the earlystopping scheme and the multiblock scheme. The splitrow core has two data inputs and two data outputs, hence it also "splits" the LLRSUM RAM and LLREX RAM into two parts, meanwhile, two identical switch networks are needed to shuffle the data simultaneously. We also propose the offsetthreshold decoding scheme to improve BER/BLER performance. The above five techniques are described in detail as follows.
6.2. The reconfigurable switch network
A switch network is an Sinput, Soutput hardware structure that can put the input signals in the arbitrary order at the output. Formally, given input signals x_{1},x_{2},...,x_{ S }with data width W, the output of switch network has the form ${x}_{{a}_{1}},{x}_{{a}_{2}},\mathrm{...},{x}_{{a}_{S}}$ where a_{ 1 },a_{ 2 },...,a_{ S }is any desired permutation of 1,2,...,S. For the design of reconfigurable LDPC decoders, two special kinds of output order are more important, described as follows.

Full cyclicshift: The output has the cyclicshift form of the total S inputs, i.e., x_{ c }, x_{ c }_{+1},...x_{ S }, x_{1},x_{2}...,x_{ c }_{−1}, where 1 ≤ c ≤ S.

Partial cyclicshift: The output has the cyclicshift form of the first p inputs, while other signal can be in arbitrary order, i.e., i.e., x_{ c }, x_{ c }_{+1},...x_{ p }, x_{1},x_{2}...,x_{ c }_{−1}, x_{ * },...x_{ * }, where 1 ≤ c<p < S, and x_{ * }can be any signal from x_{ p }_{+1}to x_{ S }.
For the implementation of QCLDPC decoder, the switch network is an essential module. Suppose ${H}_{j,i}^{b}\ne {H}_{k,i}^{b}\ge 0,j<k$, and for any $j<l<k,\phantom{\rule{2.77695pt}{0ex}}{H}_{l,i}^{b}=1$, then the same data is involved in the processing of the above two "1"s, i.e., LLRSUM and LLREX of BN_{ i × Zf } to BN_{(i+1) × zf−1}. However, after processing ${H}_{j,i}^{b}$, the above data should be cyclicshifted to ensure correct order for the processing ${H}_{j,k}^{b}$ which corresponds to the full cyclicshift case with
Further, in the case of multiple expansion factors, such as WiMAX [19] (z_{ f } = 24: 4: 96), the partial cyclicshift is required with
The existing schemes to implement switch networks include the MSCS network [20] and Benes network [13, 21, 22]. The former structure can handle the case when S is not a power of 2, while the latter is proved more efficient in area and gate count. In [13], the most efficient onthefly generation method of control signals is proposed. Therefore, we adopt the Benes network proposed in [13] for our decoder. The structure is shown in Figure 7 and the features is given in Table 1.
6.3. The offsetthreshold decoding scheme
In this part, we propose the offsetthreshold decoding method, which is adopted in our decoder architecture. Unlike existed modifications of MSA [5], the proposed scheme uses an offsetthreshold correction to further improve the BER/BLER performance.
The traditional MSA is a simplified version of BP, by replacing the complicated Equation (2) with simple min operation, shown as follows.
In [5], the normalized and offset MMSA schemes (13) (14) are proposed to compensate the loss of the above approximation, described as follows.
In our simulation, for BLER, the offset MMSA performs better than normalized one. However, as to BER, both schemes show error floor at 10^{−6}, as shown in Figure 8. The problem here is that, for most cases, the offset MMSA works well, while in a few cases, the decoding fails with many bit errors in one block. The intuitive explanation of such phenomenon is existence of extremely large likelihoods (L(q_{ ij } ), L(r_{ ji } )). In high SNR region, the L(q_{ ij } ) likelihoods converge fast to a large value, for both correct and wrong bits in some cases. The wrong bits not only remain wrong, but also propagate large L(r_{ ji } ) to other bits, resulting in more wrong bits and finally failure of decoding. For this reason, we need to set threshold upon offset MMSA to limit the likelihoods of becoming extremely large, which leads to the proposed offsetthreshold scheme, done by the following equation.
The difference between traditional MSA, normalized MMSA, offset MMSA and offsetthreshold MMSA is shown in Figure 9. Simulation result (Figure 8) shows that the proposed scheme has lowest error floor (10^{−8}) among the above schemes, while achieving good BLER performance as offset MMSA.
6.4. The splitrow MMSA core
This part presents the splitrow MMSA core. In traditional semiparallel structure with layered MMSA core (see Figure 4), since the "1"s in jth row will be processed one by one to find the minimum and subminimum of all L(q_{ ij } ), the decoding stages K for one iteration is proportional to the number of "1"s in each row of the base matrix H^{b}. The idea is that, if k "1"s can be processed at the same time, the decoding time of one iteration will be shortened by a factor of k, and the throughput will have a gain of k. This is done by splitrow scheme, which vertically splits H^{b}into multiple part. The "1"s in each part are processed simultaneously to find the local minimum, and the results are merged together. In this way, for H^{b}with maximum row weight w, the minimum and subminimum can be obtained in w/k clocks. See Figure 2 as an example, we split the 4 × 6 H^{b}into two parts, each has one or two "1"s in every row. The corresponding architecture with k = 2 is shown in Figure 10. The LLRSUM (L(Q_{ i } )), LLREX (L(r_{ ji } )) and LLR (L(q_{ ij } )) of the left part and right part are stored in two individual RAM/FIFOs, respectively. Two minimum/subminimum finders pass result to the merger for final comparison, thus approximately shorten the process pipeline by half. Note that the split position must exist for the code H^{b}such that each row in each part contains nearly the same number of "1"s. Otherwise, we need RAMs with multiple read ports and write ports, which is not practical for FPGA implementation.
6.5. The earlystopping scheme
This part introduces the earlystoping scheme applied in our decoder. In practical scenario, the decoding process often gets to convergence much earlier than the preset maximum iterations is reached, especially under favorable transmission conditions when SNR is large. Thus, if the decoder can terminate the decoding iterations as soon as it detect the convergence, the power of the circuit can be reduced as well as the decoding delay. Throughout is also increased if the system dynamically adjusts the transmission rate according to statistics of average iteration numbers under current channel state.
Traditional stopping criterions focus on whether the code can be decoded successfully or not, which either cost too much extra resource to store iteration parameters, such as HDA [23] and NSPC [24], or use floatingpoint calculation to evaluate the current iteration situation, such as VNR [25] and CMM [23]. All these methods are not suitable for the hardware implementation.
Here, we propose a simple and effective scheme to detect the convergence of the decoding. The "convergence" means at some time, all of the hard decisions sgn (L(Q_{ i } )) satisfy the check equations, The detection of convergence usually demands parallel calculation of each equation, However, due to the layered structure (QC) and lack of the hardware resources, we can use a semiparallel algorithm to implement iterationstopping module, which evaluates one layer (z_{ f } equations) simultaneously. If the number of the continuously successfulcheck layers reach a threshold ω, the module will trigger a signal meaning the decoding have got to convergence and the iteration can be stopped.
One important issue of Algorithm 5 is the estimation of threshold ω. The BER/BLER performances and average iteration times of different ω are shown in Figure 11, where the stopping criterion of ideal iteration is H c^{T} = 0. We choose ω = 2.5 × M to achieve tradeoff between time and performance. In this case, if the average iteration times is I_{ave} (ideal iteration case), the decoding terminates at approximately I_{ ave } + 2 iterations.
6.6. The multiblock scheme
Suppose the LDPC code with code length N and expansion factor z_{ f } still has serious memory conflicts though being optimized by our SA and ACO algorithms, which is common for large z_{ f } and relatively small N. To address this problem, we propose a hardware method called "multiblock" to further avoid memory conflicts and increase pipeline efficiency. The "multiblock" scheme is explain as follows.
We construct a new matrix H_{ v }by the paritycheck matrix H with M rows:
Here "virtual matrix" H_{ v }is the combination of two codes H without "cross constraint" (edge between nodes from different codes in Tanner Graph) between each other. Suppose v_{1}, v_{2} are any two legal encoded blocks satisfying that
Thus the vector (v_{1} v_{2}) is also one legal block for H_{ v }:
The key observation is that there are no memory conflicts between the two codes H due to the diagonal form of H_{ v }. This enables us to reorder and combine the decoding schedule of the two codes to reduce memory conflict of each code. We rewrite H and H_{ v }as follows:
where ${\mathbf{H}}_{i}^{\left(j\right)}$ denotes the ith row of the jth code. The decoding schedule is given by above equation, i.e., ${\mathbf{H}}_{i}^{\left(1\right)}$ comes first, followed by ${\mathbf{H}}_{i}^{\left(2\right)}$, and then ${\mathbf{H}}_{i+1}^{\left(1\right)}$, and so on so forth. The benefit of this "multiblock" scheme is that the insertion of ${\mathbf{H}}_{i}^{\left(2\right)}$ provides extra stages for the conflicts between ${\mathbf{H}}_{i}^{\left(1\right)}$ and ${\mathbf{H}}_{i+1}^{\left(1\right)}$.
To sum up, the "multiblock" scheme changes any gapl memory conflict to gap(2l − 1), thus can improve the pipeline efficiency significantly. Meanwhile, it demands no extra logic resources (LE) for the design, but may double the memory bits for buffering two encoded blocks. Since the depth of memory is not fully used on our FPGA, the proposed method can make full use of it with no extra resource cost.
7. Numerical simulation
In this section, we show how our platform produces "good" LDPC codes with outstanding decoding performance and hardware efficiency. For comparison, we target on the WiMAX LDPC code (N = 2304, R = 0.5, z_{ f } = 96). We use the same parameters and degree distributions as WiMAX for our SAbased constructor. We set "cycle" as performance metric and memory conflict as efficiency metric. The performance of one of the candidate codes and the WiMAX code are listed in Table 2. The candidate code has much less length6/8 cycles and gap1/2/3 memory conflict. Usually, the candidate codes can eliminate length6 cycles and gap1 conflicts, which ensures a largerthanorequalto 8 girth and no conflict under short pipeline (when K ≤ w_{ m } ).
We simulate the candidate code and WiMAX code through the GPU platform. The BER/BLER performance is shown in Figure 12, while the platform parameters and throughput are listed in Table 3. The waterfall region and the error floor of our candidate code is almost the same as WiMAX code. For speed comparison, we also include the fastest result that ever reported [12]. The "net throughput" is defined by the decoded "message bits" per second, given by:
where t is the consumed time for running through the GPU kernel (for us is Algorithm 4). As shown in Table 3, our GPU platform speeds up 490 times against CPU and achieves a net throughput 24.5 Mbps. Further, our throughput approaches the fastest one, while providing better precision (floatingpoint vs. 8 bit fixedpoint) for the simulation.
Finally, we optimize the pipeline schedule by ACObased scheduler, shown in Table 2. The "pipeline occupancy" is given by running/total clocks required for one iteration. For the candidate code, the number of idle clock insertions after ACO is 5, compared with 12 before ACO, achieving a 58.3% reduction. While for WiMAX code, 20 idle clock insertions remain required after layerpermutationonly (singlelayer) scheme proposed by [11]. In this case, the doublelayered ACO achieves a 75% reduction against the singlelayer scheme (5 vs. 20 idle clocks).
8. The multimode highthroughput decoder
Based on the above techniques, namely, reconfigurable switch network, offsetthreshold decoding, splitrow MMSA core, earlystoping scheme and multiblock scheme, we implement the multimode highthroughput LDPC decoder on Altera Stratix III FPGA. The proposed decoder supports 27 modes, including nine different code lengths and three different code rates, and maximum 31 iterations. The configurations for code length, code rate, and iteration number are completely onthefly. Further, it has a BER gap less than 0.2 dB against floatingpoint LMMSA, while achieving a stable netthroughput 721.58 Mbps under code rate R = 1/ 2 and 20 iterations (corresponding to a bitthroughput 1.44 Gbps). With earlystopping module working, the netthroughput can boost up to 1.2 Gbps (bitthroughput 2.4 Gbps), which is calculated under average 12 iterations. The features are listed in Table 4.
One great advantage of the proposed multimode highthroughput LDPC decoder is that more modes can be supported with only more memory bits consumed and no architecture level change. Since the reconfigurable switch network supports all expansion factors z_{ f } ≤ 256, and the layered MMSA cores supports arbitrary QCLDPC codes, more code lengths and code rates are naturally supported, for example, the WiMAX codes (z_{ f } = 24: 4: 96, R = 1/ 2, 2/ 3, 3/ 4, 5/ 6, 114 modes in total). The only cost is that more memory bits are required to store the new base matrices H^{b}.
9. Conclusion
In this article, a novel LDPC code construction, verification, and implementation methodology is proposed, which can produce LDPC codes with both good decoding performance and high hardware efficiency. Additionally, a GPU verification platform is built that can accelerate 490× speed against CPU and a multimode highthroughput decoder is implemented on FPGA, achieving a netthroughput 1.2 Gbps and performance loss within 0.2 dB.
References
 1.
Gallager R: Lowdensity paritycheck codes. IRE Trans Inf Theory 1962, 8(1):2128. 10.1109/TIT.1962.1057683
 2.
Tanner R: A recursive approach to low complexity codes. IEEE Trans Inf Theory 1981, 27(9):533547.
 3.
MacKay D: Good errorcorrecting codes based on very sparse matrices. IEEE Trans Inf Theory 1999, 45(3):399431.
 4.
Richardson T, Shokrollahi M, Urbanke R: Design of capacity approaching irregular lowdensity paritycheck codes. IEEE Trans Inf Theory 2001, 47(2):619637. 10.1109/18.910578
 5.
Chen J, Tanner RM, Jones C, Yan Li L: Improved minsum decoding algorithms for irregular LDPC codes. In Proc ISIT. Adelaide; 2005:449453.
 6.
Hocevar DE: A reduced complexity decoder architecture via layered decoding of LDPC codes. IEEE workshop on SiPS 2004, 107112.
 7.
Hu Y, Eleftheriou E, Arnold DM: Regular and irregular progressive edge growth Tanner graphs. IEEE Trans Inf Theory 2005, 51(1):386398.
 8.
Vukobratovic D, Senk V: Generalized ACE constrained progressive Eedgegrowth LDPC code design. IEEE Comm Lett 2008, 12(1):3234.
 9.
Blanksby AJ, Howland CJ: A 690mW 1Gb/s 1024b, rate1/2 lowdensity paritycheck code decoder. J Solid State Circ 2002, 37(3):404412. 10.1109/4.987093
 10.
Cui Z, Wang Z, Liu Y: Highthroughput layered LDPC decoding architecture. IEEE Trans VLSI Syst 2009, 17(4):582587.
 11.
Marchand C, Dore J, Canencia L, Boutillon E: Conflict resolution for pipelined layered LDPC decoders. In IEEE workshop on SiPS. Tampere; 2009:220225.
 12.
Falcao G, Silva V, Sousa L: How GPUs can outperform ASICs for fast LDPC decoding. In Proc international conf on Supercomputing. New York; 2009:390399.
 13.
Lin J, Wang Z: Effcient shuffle network architecture and application for WiMAX LDPC decoders. IEEE Trans on Circuits and Systems 2009, 56(3):215219.
 14.
Gunnam KK, Choi GS, Yeary MB, Atiquzzaman M: VLSI architectures for layered decoding for irregular LDPC codes of WiMax. In IEEE International Conference on Communications. Glasgow; 2007:45424547.
 15.
Brack T, Alles M, Kienle F, Wehn N: A synthesizable IP core for WIMAX 802.16E LDPC code decodings. In IEEE Inter Symp on Personal, Indoor and Mobile Radio Comm. Helsinki; 2006:15.
 16.
TzuChieh K, Willson AN: A flexible decoder IC for WiMAX QCLDPC codes. In Custom Integrated Circuits Conference. San Jose; 2008:527530.
 17.
Dorigo M, Gambardella LM: Ant colonies for the travelling salesman problem. Biosystems 1997, 43(2):7381. 10.1016/S03032647(97)017085
 18.
Kirkpatrick S, Gelatt CD, Vecchi MP: Optimization by simulated annealing. Science, New Series 1983, 220(4598):671680.
 19.
IEEE Standard for Local and Metropolitan Area Networks Part 16 IEEE Standard 802.16e 2008.
 20.
Rovini M, Gentile G, Rossi F: Multisize circular shifting networking for decoders of structured LDPC codes. Electron Lett 2007, 43(17):938940. 10.1049/el:20071157
 21.
Tang J, Bhatt T, Sundaramurthy V: Reconfigurable shuffle network design in LDPC decoders. In IEEE Intern Conf ASAP. Steamboat Springs, CO; 2006:8186.
 22.
Oh D, Parhi K: Area efficient controller design of barrel shifters for reconfigurable LDPC decoders. In IEEE Intern Symp on Circuits and Systems. Seattle; 2008:240243.
 23.
Jin L, Xiaohu Y, Jing L: Early stopping for LDPC decoding: convergence of mean magnitude (CMM). IEEE Commun Lett 2006, 10(9):667669. 10.1109/LCOMM.2006.1714539
 24.
Donghyuk S, Kyoungwoo H, Sangbong O, Jeongseok Ha A: A stopping criterion for lowdensity paritycheck codes. In Vehicular Technology Conference. Dublin; 2007:15291533.
 25.
Kienle F, Wehn N: Low complexity stopping criterion for LDPC code decoders. Vehicular Technology Conference 2005, 1: 606609.
Acknowledgements
This paper is partially sponsored by the Shanghai Basic Research Key Project (No. 11DZ1500206) and the National Key Project of China (No. 2011ZX0300100201).
Author information
Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Electronic supplementary material
Algorithm
Additional file 1: . This file contains Algorithm 1, Memory conflict minimization algorithm; Algorithm 2, ACO algorithm for TSP; Algorithm 3, The SA based LDPC construction framework; Algorithm 4, The GPU based LDPC simulation; and Algorithm 5, Semiparallel earlystopping algorithm. (PDF 98 KB)
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Yu, H., Cui, J., Wang, Y. et al. Systematic construction, verification and implementation methodology for LDPC codes. J Wireless Com Network 2012, 84 (2012). https://doi.org/10.1186/16871499201284
Received:
Accepted:
Published:
Keywords
 lowdensity paritycheck codes
 simulated annealing
 ant colony optimization
 graphic processing unit
 decoder architecture