 Research
 Open Access
 Published:
Configurable Mfactor VLSI DVBS2 LDPC decoder architecture with optimized memory tiling design
EURASIP Journal on Wireless Communications and Networking volume 2012, Article number: 98 (2012)
Abstract
Semiparallel architectures for decoding Digital Video BroadcastingSatellite 2 (DVBS2) LowDensity ParityCheck (LDPC) codes have improved Very Large Scale Integration (VLSI) solutions, but their design is challenging from several perspectives. In order to conveniently exploit parallelism for obtaining VLSI LDPC decoders that occupy small circuit areas and demand low power consumption, we propose in this article a novel ASIC reconfigurable approach that exploits efficiently the memory block reshaping required to use a reduced number of processor nodes. We exploit different memory tiling configurations to reduce the memory area about 20%. The proposed architecture was synthesized for a 90 nm process design with a variable number of processor nodes and a competitive circuit area of 6.2 mm^{2} was achieved. The operating frequency simultaneously guarantees throughputs superior to 90 Mbps, as required by DVBS2, and low levels of power consumption.
1 Introduction
Over the last 15 years LowDensity ParityCheck (LDPC) codes have assumed a growing importance in the channel coding arena, namely because they have error correction capability to achieve efficient coding close to the Shannon limit. These codes were invented by Robert Gallager (MIT) in the early sixties [1] and have never been fully exploited due to overwhelming computational requirements by that time. LDPCs are linear block codes (N,K) and can be described by sparse binary paritycheck H matrices with dimensions (N  K) × N. They can also be elegantly represented by a Tanner graph [2] defined by edges connecting two distinct types of nodes usually denoted as Bit Nodes (BN), with a BN for each one of the N variables of H, and Check Nodes (CN), also called restriction or test nodes, with a CN for each one of the (N  K) paritycheck equations given by H. Naturally, the fact that their patent has expired, has shifted the attention of the scientific community and industry away from Turbo codes [3] towards the study of LDPC codes [4], which quickly have shown to be able of guaranteeing similar or even superior coding performance. Mainly for these reasons and also because advances in microelectronics allowed the development of hardware solutions for realtime decoding, LDPC codes have been adopted by modern communication standards [5–7]. Important examples of these standards are: the Digital Video BroadcastingSatellite 2 (DVBS2) for satellite communications [8]; the WiMAX IEEE 802.16e for wireless communication systems in Metropolitan area networks (MAN) [9]; the WiFi 802.11n standard for wired home networking technologies; and the 10 Gb Ethernet IEEE 802.3an. Also, the introduction of LDPC codes in 4G systems has recently been proposed, as opposed to Turbo codes adopted in 3G. Some of these applications impose challenges that typically have to be addressed by using dedicated solutions that require SystemOnChip (SoC) hardware providing at the same time good performance, low power consumption and small die areas. Naturally, special emphasis has been given to solutions addressing the DVBS2 standard for satellite communications [8], which represents the most challenging application that actually incorporates the use of LDPC codes.
In the past we have seen that parallelism can be efficiently exploited to achieve good performances with iterative messagepassing algorithms used in LDPC decoding [10–12]. However, LDPC codes with good coding performances demand very long lengths [4, 13], as it is the case of those adopted in the DVBS2 standard. The development of efficient Very Large Scale Integration (VLSI)hardware able of providing the huge computational power necessary and performing irregular memory accesses in realtime actually represents a real challenge. Focused on the original architecture developed by Kienle et al. [5] that uses M = 360 processor nodes, we investigated and developed efficient hardwarededicated DVBS2 LDPC decoders, namely by using a reduced number submultiple of M of lightweight node processors. Consequently, we have also reduced significantly the routing complexity of the interconnection network between processor nodes and memory blocks, which represents a target that aims to improve the design in terms of cost and complexity. This architecture has been initially proposed in [14], but some important challenges have not been addressed yet. In this article we tackle these challenges, namely by materializing the benefits and quantifying the gains achieved with the VLSI design of this architecture, in particular regarding the complex design of Application Specific Integrated Circuit (ASIC) memory blocks that can benefit from the use of the reduced number of processors. The proposed architecture exploits the modulom properties of DVBS2 codes that allow to take advantage of partial parallelism to increase throughput performance. Because we propose a reduction of the number of processors used, we also address the necessary memory blocks reshape. Although the amount of stored bits remains the same independently of the number of processors, we show that the memory configuration for a solution that uses less processing units is more efficient in terms of circuit area. The architecture has been prototyped using ASIC technology. Synthesis results show that the proposed solution achieves high performance with low power consumption and within a small circuit area context. We also present synthesis results for a memory optimized ASIC architecture that compare well with those reported in stateoftheart solutions of DVBS2 LDPC decoders [5–7, 15, 16]. The main contributions of the article are: (i) efficient, scalable and parallel architectures with any submultiple of M = 360 number of processors for VLSIbased LDPC decoders under the context of DVBS2; (ii) optimized synthesis area results for different sets of functional units and corresponding memory blocks reconfiguration; (iii) architecture with reduced routing complexity, occupying small die areas and consuming low power; and (iv) decoders with high throughput.
This article is organized as follows. In the next section we describe structured DVBS2 LDPC codes. In Section 3 we present an overview of the stateoftheart and also the main challenges regarding the development of efficient VLSI LDPC decoders for DVBS2. Section 4 addresses the Mfactorizable parallel approach proposed, and in Section 5 we describe an optimized RAM memory design procedure for ASIC. Section 6 presents experimental results for the proposed LDPC decoder architecture synthesized for ASIC and compares its performance with other architectures in the literature. Section 7 concludes the article.
2 LDPC codes
Graphical models, and in particular Tanner graphs [13], have often been proposed to perform approximate inference calculations [17, 18]. They are based on iterative intensive messagepassing algorithms [19] also known as Belief Propagation (BP) that perform the computation of joint probabilities on graphs (as depicted in Figure 1) and are commonly used in information theory (e.g., channel coding), artificial intelligence and computer vision (e.g., stereo vision) [17], which, under certain circumstances, can become computationally prohibitive.
2.1 LDPC decoding
In this article we are interested in the BP algorithm applied to LDPC decoding. In particular, we exploit the MinSum Algorithm (MSA), an efficient simplification [13] of the SumProduct Algorithm (SPA), which is very demanding from a computational perspective. The MSA here applied to LDPC decoding operates over Loglikelihood Ratios (LLR) [1], exchanging information and updating messages between neighbor nodes over successive iterations. From a computational perspective, this simplification decreases the complexity of processing because it uses sum operations instead of multiplications, while subtractions replace divisions [13, 20]. Although this represents a lower workload than the one required by the SPA, it is still quite significant. If the number of nodes is largein the order of thousands as in the case of DVBS2the MSA can still demand very intensive processing [13, 19].
But if from a computational perspective it makes sense to use the MSA, in [15, 21] it is shown that by using the conventional MSA algorithm, it is not possible to achieve acceptable Bit Error Rate (BER) performance for the particular case of DVBS2. The utilization of this algorithm puts performance far away from BER specifications defined for the standard [8], as it is clearly shown by figure 7 in [15]. In order to obtain BER performances closer to those defined by the DVBS2 standard, we adopted the improved normalized MSA algorithm [22, 23] with two distinct normalization factors that depend on the processed messages at the output of the CNs. Although in [23] this problem is only analyzed for short length codes, in [22] it is successfully extended to DVBS2 LDPC codes. This fact, associated with the corresponding and relatively simple changes necessarily introduced in the design of the architecture, is the main reason why we adopted the λ  min = 1 solution [24], which computes LLR messages based on two minimums.
The SPA can also be performed in the logarithmic domain as described in [20]. In the next equations, Lp_{ n } designates the a priori LLR of BN_{ n }, derived from the values received from the channel, and Lr_{ mn } is the message that is sent from CN_{ m } to BN_{ n }, computed based on all received messages from BNs of group $\mathcal{N}\left(m\right)\backslash n$, where $\mathcal{N}\left(m\right)\backslash n$ represents BNs connected to CN_{ m }excluding BN_{ n }, according to:
with
and
In the particular case of the MSA f(a, b) = 0. This represents an overestimation of LLR calculated values, which can be compensated by using the improved normalized MSA algorithm [22, 23] with two normalization factors (this was the solution adopted in our architecture). Consequently, (1) becomes:
where α can be obtained by:
Also, Lq_{ nm } is the LLR of BN_{ n }, which is sent to CN_{ m }and calculated based on message Lp_{ n } received from the channel and from CNs of $\mathcal{M}\left(n\right)\backslash m$, where $\mathcal{M}\left(n\right)\backslash m$ represents the set of CNs connected to BN_{ n }excluding CN_{ m }, and is given by:
At the end of each iteration we compute the a posteriori LLRs of BNs,
and perform hard decoding to obtain the decoded output word ĉ[13, 19]
Before the first iteration occurs, for each node pair (BN_{ n }, CN_{ m }), we initialize Lq_{ nm } with the LLR information Lp_{ n } received from the channel, and then we proceed to the iterative body of the algorithm until stop conditions occur: Hĉ^{T}≠ 0 or the maximum number of iterations is reached.
2.2. DVBS2 LDPC codes
Following the recognition of their potential, LDPC codes have been recently adopted by DVBS2 [8] and other new standards for communication and storage applications. They use their powerful coding gains, obtained at the expense of computational power, to achieve good performances under adverse channel conditions. Some of the LDPC codes adopted in those standards have a periodic nature, which allows exploiting suitable representations of data structures for reducing the computational requirements. These modulom properties of DVBS2 LDPC codes are described next, in order to allow understanding how architectures can take advantage of them, namely for exploiting parallelism.
The Forward Errorcorrecting (FEC) system of the recent DVBS2 standard [8] incorporates a special class of LDPC codes based on Irregular Repeat Accumulate (IRA) codes [25]. The paritycheck matrix H is of the form:
where A is sparse and has modulom properties, and B is a staircase lower triangular matrix. The structured constraints put on the pseudorandom generation of matrix A allow a significant reduction on the storage requirements without significant code performance loss. The N bits of a codeword are represented in the systematic form, divided in information bits (IN) and paritycheck bits (PN) given by the number of columns of A and B, respectively. The construction technique used to generate the A matrix consists of splitting the IN nodes into disjoint groups of M = 360 consecutive 1's. All the IN nodes of a group l should have the same weight w_{ l } and it is only necessary to choose the CN nodes that connect to the first IN of the group, in order to specify the CN nodes that connect to each one of the remaining (M  1) IN nodes of that group. The connection choice for the first elements of group l is pseudorandom but it guarantees that, in the resulting LDPC code, all the CN nodes must connect to the same number of IN nodes. Denoting by ${r}_{1},{r}_{2},\dots ,{r}_{{w}_{l}}$ the indices of the CN nodes that connect to the first IN of group l, the indices of the CN nodes that connect to IN_{ i }, with 0 ≤ i ≤ M  1, of group l can be obtained by:
with
The parameter M = 360 is constant for all codes used in the DVBS2 standard. For each code, the first groups of IN that form A have constant weights w_{ b } > 3, and the remaining groups have weights w_{ b } = 3. Matrix B has a lower triangle staircase profile as shown in (9). The LDPC codes adopted by the DVBS2 standard support two different frame lengths, one for short frames (N = 16, 200 bit) and the other for normal frames (N = 64, 800 bit). The short frame mode supports ten distinct code rates as depicted in Table 1, while the latter supports 11 rates as shown in Table 2. The column and row weights are also depicted in Tables 1 and 2 for all rates in the standard. For short frame lengths, only 50% of the codes have constant weight w_{ c } as indicated in Table 1, while for normal frame length codes all CNs have a constant weight w_{ c } , as indicated in Table 2.
Both tables show the number of edges for each code adopted in the DVBS2 standard. In each edge circulate messages (Lr_{ mn } and Lq_{ nm } ) that are used to update the connected nodes. A closer inspection, for example, of code with rate = 3/5 for normal frames, shows that the total number of edges of the Tanner graph is 285120. Given that communications occur in both directions (from CNs to BNs, and then from BNs to CNs), 570240 > 2^{19} messages are exchanged per iteration, which imposes significant computational demands (at several levels) for the development of LDPC decoders.
3 Overview of VLSI LDPC decoding architectures
The two types of nodes (BN and CN) in the Tanner graph require distinct processing. Therefore, processor nodes of LDPC decoding architectures must be capable of performing these two types of calculations in order to compute the improved normalized MSA. Also, it is necessary to incorporate memory to store messages associated with each processor node, and the decoder requires an efficient switching mechanism to allow each processor node in the graph to access different memory banks. Ideally, this should be performed in parallel to increase throughput. The messagepassing scheduling mechanism controls the order in which messages are updated (i.e., the order how memory is addressed by each node processor), defining which nodes communicate with each other and in what order. A control logic unit, dependent on scheduling, controls the interconnection network that performs this task. It connects existing BN (usually a factor of N) with CN (usually a factor of (N  K)) processors to the corresponding memory blocks, in order to allow read/write operations (to be performed in parallel) into different blocks that store subsections of the graph. The interconnection network should be designed to allow conflictfree accesses to memory banks.
3.1 Parallel LDPC decoder architectures
The complex nature of such a VLSI architecture presents several challenges. To perform fast LDPC decoding, it requires a certain level of parallelism, which should have some significance in order to achieve a throughput that can be very demanding (90 Mbps in the DVBS2 case [8]). Above some level of parallelism the requirements for routing, area and power consumption may be difficult to achieve for practical reasons. Presently, there are LDPC decoder architectures ranging from serial [26] to fully parallel [10], targeting different objectives. Serial decoders trade throughput with architectural simplicity, obtaining reduced die areas. Although limitations at the processing level can make them target mainly lowthroughput applications, they can exploit hardware simplifications, such as reducing the storage size of messages and still achieve coding gains without noticeable loss in performance [26]. Parallel approaches offer superior throughput at the expense of architectural complexity and circuit area. On a fully parallel system, the complexity increases significantly as the code length grows. The dimension and complexity of this type of architectures are only acceptable for short to medium length LDPC codes. One of the first LDPC decoder on ASIC [10] is based on a fully parallel architecture for a 1, 024 bit, rate = 1/2 LDPC decoder processing 64 iterations, that obtains a throughput of 1 Gbps. The excellent performance achieved is only possible due to the complex routing established to connect all N BN processors with corresponding (N  K) CN processors. Consequently, the complex wiring of the circuit created the need to carefully manage the floor planning and routing of the design, which alerted for the prohibitive complexity involved in the design of architectures that support block codes with high lengths (N > 1,024 bit).
Naturally, the size of the architecture is influenced by the length of the largest code supported and also by the level of parallelism adopted. This has direct implications in the number of node processors necessary to perform computation and, more importantly, in the complexity of the interconnection network (represented by a barrel shifter in Figure 2). The complexity of this network decreases by reducing the level of parallelism of the architecture. For a fixed code, permutation patterns are known a priori and the indices/addresses that indicate which nodes connect with each other are usually computed offline. Compared with fullparallel decoders, the complexity of an interconnection network in a partialparallel architecture significantly decreases, which eliminates important restrictions from the routing process (in the place & route phase of the design).
Partialparallel architectures for highthroughput LDPC decoding have been proposed by Mansour et al. [11, 12], where the concept of architectureaware LDPC codes was firstly introduced. Architectureaware codes decouple the architectural dependence of the decoder from the LDPC code properties. They achieve a faster convergence rate, which allows obtaining higher throughputs [12]. In this case, the interconnection network can be efficiently implemented by using programmable multistage networks [11]. A comparison between serial, partlyparallel and fullparallel architectures has been performed by Fanucci et al. in [27] for a regular code with N = 2,048 bits, w_{ c } = 6 and w_{ b } = 3. However, these architectures are dedicated to particular cases and other solutions were necessary to support other types of more demanding LDPC codes, namely those that have irregular nature and higher block lengths.
For long length LDPC codes (e.g., DVBS2 codes), typically more than 70% of the circuit's area in VLSI decoders is occupied by memory units [6, 15], which are essential to support the iterative messagepassing mechanism. They are extensively addressed in the context of this article. Memory requirements for parallel architectures capable of supporting long length codes are feasible, but hard to achieve for practical reasons on a VLSI approach. The number of accesses and the amount of memory necessary to store messages exchanged between hundreds or thousands of nodes is extremely high, which imposes restrictions in die area and power consumption. Although in a partialparallel approach the total amount of memory necessary to store messages remains unchanged for any number of processors, by varying the latter, the height and width of memory blocks change, which can be exploited to achieve architectures with more efficient areas.
3.2 Parallel Mkernel LDPC decoder architectures for DVBS2
The development of LDPC decoders for the DVBS2 standard can be considered to be among the most demanding applications for this type of processing, mainly due to the high length of codes adopted [8]. LDPC codes used in this standard are IRA codes, where the sparse paritycheck H matrix has modulom properties that can be exploited to achieve hardware parallelism [25].
The use of scalable parallelism to obtain chips with small areas for DVBS2 has been proposed for technologies ranging from 0.13 µ m down to 65 nm [5–7, 15, 16] based on the processing of subsets of the Tanner graph, which guarantee the minimum necessary throughput of 90 Mbps required by the standard [8]. The design of partialparallel solutions generates systems with lower areas, but also implies lower throughputs, comparing to fullparallel architectures in a SoC. One of the architectures initially proposed by Kienle et al. [5] uses M = 360 processor nodes, also referred in this text as Functional Units (FU). They work in parallel and share control signals, that process both CN nodes (in CN mode) and IN nodes (in BN mode) according to a flooding schedule approach.
Attending to the zigzag connectivity [8] between PNs and CNs defined by B in (9), they are updated jointly in CN mode [5]. An efficient VLSI architecture has been proposed in [28, 29], where the joint processing of CNs and PNs is performed as depicted in Figure 3. For example, when updating PN_{ m }, according to (6) it becomes a passing node because the message it sends to CN_{m+1}is the message received from CNm added to the channel information, and viceversa (see Figure 3). Since each FU processes q consecutive CNs, the update of PNs follows an horizontal schedule approach, with PNs and CNs being processed simultaneously. This scheduling mechanism allows a faster decoding convergence [30]. Therefore, the message that travels through CN_{ m }, PN_{ m }and CN_{m+1}is kept in the FU and only the backward message that is sent from CN_{ m }to PN_{m1}is saved in the external memory. The equations that describe the operation inside the FU in CN mode are [29]:
and
where IN(m) represents the set of INs connected to CN_{ m }, and Mem is the internal memory of the FU.
All 360 FUs process one message per cycle. In IN mode, the 360 messages are read from the same address (sequentially incremented, for this type of processing) in the respective memory blocks. The new messages resulting from this computation are then stored in the same address cyclically shifted right through the interconnection network, implemented by a barrel shifter. The modulom properties of LDPC codes used in DVBS2 allow to replace the complex interconnection network by a common barrel shifter. In CN mode, messages have to be read from specific addresses and stored back in the same addresses cyclically shifted left to conclude the execution of an iteration. Once again, the access is performed in parallel for all 360 messages. The barrel shifter mechanism and the efficient memory mapping scheme constitute the major strengths of the architecture described in [5].
Memory requirements for a partially parallel architecture capable of supporting long length codes used in DVBS2 are demanding. The appropriate Control of ADDRESSES and SHIFTS memory banks indicated in Figure 2 guarantees that every time the address of a memory block changes, it changes accordingly and without conflicts for all M = 360 processors in parallel. The barrel shifter, which has a switching activity controlled by the SHIFTS memory bank, can be properly managed together with the ADDRESSES memory block that defines memory accesses, to simulate the connectivity of the Tanner graph. In [5] it is illustrated the memory mapping mechanisms used in BN processing as a function of q (a parameter defined by the standard), which is presented in (11). In CN processing, a similar scheme applies. For each different code we use different SHIFTS and ADDRESSES values, which can be easily obtained from annexes B and C of the DVBS2 standard [8]. In Figure 4 mainly three distinct types of memory are depicted: (i) Channel memory which initially receives data (IN and PN) from the input of the LDPC decoder; (ii) Message memory, where all messages associated with information bits are stored (the FU supports both types of BN and CN processing, which perform alternately during the same iteration); and (iii) PN Message memory that holds messages associated with parity bits, which are computed in CN mode and have all weight w_{ c } = 2 (zigzag connectivity depicted in Figure 3). In PN Message memory each FU only needs to store the message which is passed during the backward update of CNs [14].
However, the large number of FUs used (360) still implies a wide and complex barrel shifter that requires a significant die area and imposes routing problems in order to accommodate the simultaneous accesses of all 360 FUs to corresponding messages in memory. Since this architecture is able to provide, with the current technology, a throughput far above the mandatory 90 Mbps, we are able to reduce the number of FUs even further. In fact, we herein show that this can be done by any integer factor submultiple of M = 360, which, for VLSI systems, presents a beneficial reduction of the size of the barrel shifter.
4 Mfactorizable VLSI parallel LDPC decoder architecture for DVBS2
Under this context we developed a novel hardware approach, originally proposed in [14, 28], which is based on a partialparallel architecture that simplifies the barrel shifter and reduces memory requirements. We address the generalization of the well known Mkernel parallel hardware structure [5] and propose its partitioning by any integer L submultiple of M (which can be obtained from the decomposition of M = 360 = 2^{3} × 3^{2} × 5), without memory addressing/reconfiguration overheads and keeping unchanged the efficient messagepassing mechanism. The proposed architecture provides an efficient way of reducing the number of FUs and the overall complexity of the decoder. This approach does not only surpass some disadvantages of the architecture described in [5], such as die area occupied or routing congestion, but it also adds flexibility and reconfigurability to the system according to the decoder requirements and device constraints. This type of subsampling approach preserves the key modulom properties of the architecture [5], with only P = M/L processing units addressed in [28, 29] as shown in Figure 2. This strategy allows a linear reduction by L of the hardware resources occupied by the FU blocks, and reduces significantly the complexity (2 × O(P log P)) of the interconnection network (or barrel shifter), which simplifies the routing problem. This strategy does not imply an increase by L in the size of ROM memories (that hold SHIFTS and ADDRESSES values). In fact, by performing a subsampling by L, each group of M = 360 INs or CNs results in L subgroups of P INs or CNs. As the properties of the first subgroups of CNs and BNs to process are known, the modulom periodicity of DVBS2 codes allows to automatically calculate the properties of the remaining subgroups. Based on the SHIFTS shift_{0} to apply to the first subgroup, we can obtain the remaining subgroups γ, with 0 ≤ γ ≤ L  1 by [14]:
The same applies for ADDRESSES, which can be computed as:
For the architecture to support only P FUs (instead of the M = 360 FUs as in [5]), memory blocks have to be reorganized according to (16), (17) and Figures 2 and 5. We can reshape these memory blocks and keep unchanged the size of the system ROM, by computing on the fly the new SHIFTS as a function of those initially stored in the ROM, as indicated in (16). In the configuration shown in Figure 2, each FU_{ i }, with 0 < i < P  1, is now responsible for processing information (IN), parity (PN) and CN according to a proper memory mapping and shuffling mechanism. As we increase L, the smaller become the subsets of the Tanner graph processed in parallel. Figure 5 describes the addressing mechanisms used in the factorizable architecture for L = 2, which uses 180 FUs. The amount of memory is exactly the same as in the architecture with L = 1, but the structure is different. There are less FUs and the corresponding memory word size decreases. As shown in Figure 2, memories have to become higher and thinner in order to hold the same information as before. This new memory configuration will introduce benefits in area occupied by the architecture, as it will be shown later.
5 Optimizing RAM memory design based on macro cells for ASIC
The presented architecture has been described in Register Transfer Level (RTL) for P = {45, 90, 180, 360} FUs. All solutions use 5bit to represent data messages. Figure 2 shows that all four configurations use exactly the same amount of memory, though rearranged with different widths and heights (distinct memory dimensions). The smaller the number of P functional units adopted, the higher and thicker block memories become. The complete type and amount of memories necessary for such design are:

Message memory holds internal messages calculated during each iteration of the decoding process associated with IN; Message memory width (word length) is given by w_{ a } = number of FU × message, which in this case is w_{ a } = M × 5/L bits; the height h_{ mm }can be obtained for worst case scenario (code with rate = 3/5 obtained from Table 2, where the number of edges is maximum). In this case, h_{ mm } = L × q × (w_{ c }  2);

Channel memory stores IN and PN data incoming from the channel. This memory has the same word length as Message memory (w_{ ch }= w_{ a }), and the height is given by h_{ ch }= L × 64800/M, representing the worst case (normal frame);

PN message memory holds messages calculated during each iteration, which are associated only with parity bits. It has the same width as Channel memory and the height h_{ PN }= L × q is obtained for worst case scenario (code with rate = 1/4) from Table 2, which turns q = 135 (only one 5bit message per each parity bit has to be stored);

Hard decoding memory holds IN and PN data obtained in the hard decoding phase of processing (1bit per FU); the width of this memory is given by the number M/L FUs of the architecture, and the height h_{ HD }is obtained from h_{ HD }= L × 64800/M.
Table 3 summarizes the required width and height (mem. pos.) of all memories used in the four synthesized configurations. Unfortunately, sometimes RAM memory libraries of macro cells do not exactly support all the specified heights requested in Table 3, but rather standard dimensions which are usually a power of 2, as shown in Table 4. The area results obtained in the synthesis process and shown in Table 5 allowed other interesting conclusions. Memories occupy nearly 97% of the circuits' total area. The remaining part of the circuit is occupied by the barrel shifter, the functional units and the control logic. The fact that in the four configurations the areas are different was also a surprise. Table 5 shows these differences. As mentioned before and depicted in Figure 2, the total amount of memory is the same for all designs. If we realize that they occupy a significant area of the design, we conclude that their differences should be minimal. To analyze these differences, we first need to understand how memories are generated and the RAM generator limitations. The architecture implemented in RTL uses memories with large width (word length) and height (number of words, or memory positions).
The RAM generator used can create memories that support the requested number of words, but the maximum word width is limited to 64 bits, which is far below the architectural needs. To overcome this problem, B blocks of RAM memory were concatenated until the required width was achieved, as shown in Figure 6. Each RAM memory has its own internal control logic which can address data in the operating clock cycle, with its own antenna diode protection, testing mechanisms, power rings, etc. As represented in Figure 6, more memories are necessary for large words, which replicates control hardware and increases silicon area. This seemed to show that some level of inefficiency should be expected. A practical example with real areas can be addressed for the Message RAMs for 360 and 45 FUs, to give a better perspective of the problem. In the former configuration, the Message RAM width is 1,800 bits (5 bits per message × 360 FUs) with 2^{10} addresses (height). For the second configuration, the RAM width is 225 bits (5 bits per message × 45 FUs) with 2^{13} addresses. Both Message memory blocks have the same capacity (2^{10} × 1, 800 bits), however the area used by the wider 360 FUs Message memory is 6.2 mm^{2}, while the thiner 45 FUs memory occupies only 3.2 mm^{2}. These memories were created by concatenating B banks of 45 bit RAMs, as illustrated in Figure 6. For 360 FUs, 40 instantiations are necessary (40 × 45 bits = 1,800 bits) while for 45 FUs only five instantiations are needed (5 × 45 bits = 225 bits). Whatever the factorization level L adopted, the architecture will still have to support a Very Long Data Word (VLDW) bus. This property is illustrated in detail in Figure 4, where the VLDW bus implies a complex routing of the wiring associated with it. This introduces additional complexity managing the floor planning and routing process (in the place & route phase of the design). Nevertheless, the architecture with 45 FUs minimizes this problem by moving its complexity to a dimension where it can be more easily tractable.
5.1 Minimal RAM memory configuration
From all designs in Table 5 the architecture with 45 FUs occupies the smallest area. Comparing Tables 3 and 4, we notice that due to hardware restrictions, there is a portion of the RAM which is never used. The 45 FUs architecture Message RAM, for instance, needs 225 × 5184 = 1,166,400 bits. But physically, we have 225 × 2^{13} = 1,843,200 bits, resulting in about 37% unused bits. This fact occurs in all memories and for all architectures and we tried to minimize it by using decompositions into powers of 2 and concatenating smaller RAMs, but we realized that instead of decreasing, the area increases nearly 30%, which can be justified by the additional number of internal RAM hardware controllers incorporated into the design. However, joining together several blocks of memory that operate in different time slots into one bigger common block of memory can become a more efficient solution. By analyzing in detail the Message memory block of the architecture with 45 FUs depicted in Figure 4, each 5 bits of the word are directly connected to each FU, and consequently only one main memory block is used, as mentioned in the beginning of Section 5. Therefore, the control is unique and only one (write or read) address is needed to transfer data to/from all FUs. This memory must be a dual port RAM, in order to allow reading and writing during the same clock cycle. However, due to memory generator limitations in the tools used, this type of memory was not possible to achieve. To overcome this, we adopted a solution that consists of creating a memory that operates at twice the circuit's frequency (f_{op}): in the first cycle it performs read operations; and in the second one it writes data to memory. The system's master clock operates at 200 MHz and it is used to feed memory blocks, while f_{op} = 100 MHz is synchronously derived from this clock source.
The LDPC decoder input data comes from an external deinterleaver module that does not work at the same clock frequency as the LDPC decoder. To guarantee that no information is lost, all messages delivered to the decoder must be stored in Channel memory. The PN message memory is equally stored on a dual port RAM which, due to memory generator limitations as before, was converted into a single RAM working at twice of the initial frequency. In the beginning, this memory is loaded with zeros, and then it will hold data associated with parity bits necessary in each iteration of the decoding process. Shift and address data are loaded from an outside source into SHIFTS and ADDRESSES memories. The values stored in these memories depend on the code rate and must be previously calculated.
6 Experimental results
In the next subsections we present the obtained experimental results for the architecture in Figure 2, which was synthesized for ASIC using a 90 nm CMOS process design. Place & route was performed using eight metal layers. Memories were synthesized using a generic RAM generator tool with libraries of macro cells. Estimates for power consumption were obtained assuming a toggling rate of 10%. The architecture was also synthesized for FieldProgrammable Gate Array (FPGA), but only for validation purposes [14].
6.1 Experimental synthesis results for ASIC
Synthesizing in ASIC technology the architecture with P = {45, 90, 180, 360} FUs aims at finding the one which produces better power consumption and area results, while simultaneously supporting the throughput requirements of the standard. The architecture supports short and normal frame lengths for all code rates, even though it is dimensioned for the worst case. All rates in the solutions here proposed support 90 Mbps as required by the DVBS2 standard. Figures 7a) and 7b) report throughput performance achieved for normal and short frame length DVBS2 codes decoding ten iterations. For example, with P = 90 FUs (or equivalently L = 4) in the normal frame mode, for code with rate = 1/4 the throughput surpasses 295 Mbps, while with P = 90 FUs it reaches 1.18 Gbps. Figures 7c) and 7d) report the maximum number of iterations supported by the architecture working at 100 MHz, while producing the DVBS2 target throughput of 90 Mbps. For example, for P = 180 FUs (L = 2) in the normal frame mode, a maximum of 65 iterations is supported when decoding rate = 1/4, while 27 iterations are achieved for code with rate 3/5 (worst case working conditions for DVBS2). Half these values are obtained with P = 90 FUs (L = 4) and a value four times inferior is achieved with P = 45 FUs (L = 8). It can also be seen that a maximum of 55 iterations is supported for the configuration with P = 90 FUs decoding the worst case code. Technical limitations imposed by the available library of memory macro cells, namely the maximum frequency of operation at 200 MHz and the fact that they do not support dualport memories, have put limits on throughput and maximum number of iterations supported. To exemplify this, our architecture implemented with P = 45 FUs supports at maximum seven iterations when decoding at a throughput of 90 Mbps for worst case working conditions, as depicted in Table 6. By using dualport memories, or memories capable of supporting frequencies in the range of 400 MHz (that would allow to have the main circuit operating at 200 MHz), it would be possible to achieve 14 iterations for worst case condition (see Table 6) without additional modifications in the design, at the expense of an increase in power consumption. To validate this assumption, we resynthesized the architecture with the circuit operating at f_{op} = 125 MHz and then at f_{op} = 200 MHz (as indicated in Table 6) and no timing violations occurred.
In (18) and (19) it is described how the throughput and maximum number of iterations were calculated for Figure 7, with the frequency of operation selected at f_{op} = 100 MHz, i.e.:
The variables frame_length, W and w_{ j } depend on code length and rate, and can be found in the standard [8]. The variable W defines the number of elements of A in the compact form (see (9)) as presented in annexes B and C of the standard [8]. It can be computed from Tables 1 and 2 by performing:
where IN_weight_j and IN_weight_3 are the number of IN nodes with weights weight_j and 3, respectively.
Table 5 shows that the total area of the circuit is equal to 21.2 mm^{2} for a 360 FUs solution, while the architecture with 45 FUs can be obtained with only 7.7 mm^{2}. Different parts of the architecture work at two distinct clock frequencies, namely 100 and 200 MHz, and for the solution with 45 FUs we estimate a power consumption of approximately 105 mW. Final circuit validation has been performed by place & route using eight metal layers. No time, physical or routing violations were reported, so at this stage the 45 FUs architecture is physically validated.
At this point and due to the configurable nature of the architecture, we found room to perform some extra optimizations. For practical purposes, and since it presented good results for throughput and area, we adopted the configuration based on 45 FUs as the basis solution to be optimized. The next subsection shows that there is still room to perform optimizations and how to achieve that.
6.2 Experimental synthesis results for ASIC using an optimized 45 FUs architecture
To translate the optimizations mentioned in Section 5.1 into area results, the new circuit has been synthesized for the same 90 nm technology used in the original synthesis. The optimizations reduce the area of the smallest original architecture nearly 20%. Synthesis results for the previously mentioned P = {45, 90, 180, 360} FUs based architectures and also for the optimized 45 FUs version are listed in Table 5. The areas range from 21.2 to 6.2 mm^{2}, with corresponding levels of power consumption of, respectively, 290 mW down to 85 mW. The 45 FUs' architecture optimized with an efficient RAM memory reshape presents a total area of 6.2 mm^{2} and 85 mW of power consumed at a typical corner voltage of 1.1 V. The estimation of power consumption was performed by considering typical corner operating conditions for current, voltage and temperature. This approach was followed because extreme working conditions usually do not occur, or if they do, it is during small periods of time which does not negatively affect the design.
No timing or design rule check (DRC) violations were reported, so at this stage the 45 FUs optimized architecture is physically validated and ready to be implemented in silicon. To accommodate place & route, it should be noted that, based on previous design experience, we estimated an increase in area equivalent to 20% (after final validation, we realized that it could have been approximately 19%), which corresponds to a global area of 7.4 mm^{2} for the optimized 45 FUs architecture.
6.3 Discussion
The assessment of synthesis area results for the proposed architecture presented in Table 5 shows that it compares well with stateoftheart solutions [5–7, 15, 16, 31]. Table 6 compares synthesis results for stateoftheart architectures with the 45 FUsoptimized architecture here proposed. A 90 nm technology is used in [6, 16], producing, respectively, an area of 4.1 and 9.6 mm^{2}. Although it is not possible to fully assess [6] because important characteristics of the circuit such as power consumption are not indicated and only area and throughput are mentioned, it is important to refer that the architecture in [6] presents a smaller area but it only supports the normal frame mode of operation. Results for similar architectures but using different technologies have been presented: for example, for a 0.13μ m technology, an area of 22.7 mm^{2} has been achieved [5], while [7] presents 3.9 mm^{2} for a 65 nm technology, and [15] achieves 6.03 mm^{2} for the same 65 nm technology. Although some of them claim to occupy smaller die areas, our solution supports both frame lengths, while [6, 7] only support the normal frame mode. Also, considering that our design is based on a 90 nm process, it compares favorably in terms of area against [7, 15], which use a 65 nm technology as mentioned before. The new architecture here proposed based on 45 FUs shows an LDPC decoder circuit with smaller area occupied than those reported in [5, 16]. More recently, Kim et al. [31] presented an architecture with an excellent throughput of 520 Mbps but at the expense of a circuit area equal to 12.4 mm^{2} for a 90 nm process. For power consumption purposes, perhaps most important is the fact that our architecture works with an inferior maximum frequency of operation than those just reported in stateoftheart solutions [5, 7, 15, 16]. While [5, 15] propose architectures with operating frequencies of 270 MHz, the one in [7] requires 400 and 320 MHz in the case of [16]. In the new approach herein proposed, 45% of the circuit works at 200 MHz, while the remaining 55% work at 100 MHz. The power consumption required by our architecture is inferior to those mentioned by competitors. Namely, 477 mW are reported in [15] for best case working conditions, while we achieved 105 mW for the 45 FUs architecture and 85 mW for the optimized version of it, as depicted in Tables 5 and 6.
By scaling the proposed architecture to a 65 nm technology using the same parameters as those reported in [7] and correspondingly increasing the frequency of operation to 400 MHz (which is possible as long as proper memory technology is adopted), it is possible to increase the maximum number of iterations supported by this architecture to 28 for worst case working conditions, which for DVBS2 occurs for code with rate = 3/5.
7 Conclusions
This article addresses the generalization of a stateoftheart Mkernel parallel structure for LDPCIRA DVBS2 decoding, for any integer factor of M = 360. The proposed architecture adopts a partitioned processing of subsets of the Tanner graph that keeps unchanged the efficient message memory mapping structure without addressing unnecessary overheads. This architecture proves to be flexible and easily configurable according to the decoder constraints and represents a tradeoff between silicon area and decoder throughput above the required 90 Mbps for all DVBS2 codes. Under this context, five configuration designs with different number of processing units have been synthesized using ASIC technology. They range from 360 to 45 FUs which represents, respectively, an equivalent occupied area of 21.2 and 7.7 mm^{2}. Although the process of generating RAM memories imposes constraints, the investigation carried out under the context of this article allowed several interesting conclusions that were applied into the design of the LDPC decoder circuit. Redimensioning and rearranging the order how memory blocks are grouped together allowed reducing the global area of the circuit to a value as low as 6.2 mm^{2}. Additional improvements in area could still be experienced either by using a different number of metal layers, or by adopting full custom RAM memories in the circuit design. Since in the present case nearly 97% of the circuit's area is occupied by memory blocks and we used a generic RAM generator tool, the area occupied by the circuit can be further reduced if we require to the foundry the use of dedicated RAM memory cells. This shows the competitiveness of the architecture when compared with stateoftheart solutions for the same 90 nm technology. Moreover, the maximum frequency of operation of the design here proposed is smaller than those reported by competitors, which justifies the low levels of power consumption achieved of approximately 85 mW.
References
 1.
Gallager RG: Lowdensity paritycheck codes. IRE Trans Inf Theory 1962, 8: 2128. 10.1109/TIT.1962.1057683
 2.
Tanner R: A recursive approach to low complexity codes. IEEE Trans Inf Theory 1981, 27(5):533547. 10.1109/TIT.1981.1056404
 3.
Berrou C, Glavieux A: Near optimum error correcting coding and decoding: turbocodes. IEEE Trans Commun 1996, 44(10):12611271. 10.1109/26.539767
 4.
Mackay D, Neal R: Near shannon limit performance of low density parity check codes. IEEE Electron Lett 1996, 32(18):16451646. 10.1049/el:19961141
 5.
Kienle F, Brack T, Wehn N: A synthesizable IP Core for DVBS2 LDPC code decoding. Proceedings of Design, Automation and Test in Europe, 2005 (DATE'05), IEEE Electron. Lett 2005, 16.
 6.
Dielissen J, Hekstra A, Berg V: Low cost LDPC decoder for DVBS2. In Proceedings of Design, Automation and Test in Europe, 2006 (DATE'06). IEEE, Munich, Germany; 2006:16.
 7.
Brack T, LehnigkEmden T, Kienle F, Wehn N, L'Insalata NE, Rossi F, Rovini M, Fanucci L: Low complexity LDPC code decoders for next generation standards. In Proceedings of Design, Automation and Test in Europe, 2007 (DATE'07). IEEE, Nice, France; 2007:16.
 8.
European Telecommunications Standards Institute (ETSI): EN 302 307 V1. 1.1 Digital video broadcasting (DVB); second generation framing structure, channel coding and modulation systems for broadcasting, interactive services, news gathering and other broadband satellite applications 2005.
 9.
IEEE: IEEE 802.16e. IEEE P802.16e/D12: 'Draft IEEE Standard for Local and Metropolitan Area Networks. Part 16: Air Interface for Fixed and Mobile Broadband Wireless Access Systems', October 2005 2005.
 10.
Blanksby AJ, Howland CJ: A 690mW 1Gb/s 1024b, Rate1/2 lowdensity paritycheck code decoder. J Solid State Circ 2002, 37(3):404412. 10.1109/4.987093
 11.
Mansour M, Shanbhag N: A 640Mb/s 2048Bit programmable LDPC decoder chip. IEEE J Solid State Circ 2006, 41(3):684698. 10.1109/JSSC.2005.864133
 12.
Mansour M, Shanbhag N: Highthroughput LDPC decoders. IEEE Trans Very Large Scale Integr Syst 2003, 11(6):976996.
 13.
Lin S, Costello DJ: Error Control Coding. Prentice Hall, Englewood Cliffs, NJ; 2004.
 14.
Gomes M, Falcao G, Silva V, Ferreira V, Sengo A, Falcao M: Flexible parallel architecture for DVBS2 LDPC decoders. In Proceedings of the IEEE Global Telecommunications Conf (GLOBECOM'07). IEEE, Washington, D.C., USA; 2007:32653269.
 15.
Müller S, Schreger M, Kabutz M, Alles M, Kienle F, Wehn N: A novel LDPC decoder for DVBS2 IP. In Proceedings of Design, Automation and Test in Europe, 2009 (DATE'09). IEEE, Nice, France; 2009:13081313.
 16.
Zhang B, Liu H, Chen X, Liu D, Yi X: Low complexity DVBS2 LDPC decoder. In Proceedings of the IEEE 69th Vehicular Technology Conference, 2009 (VTC Spring 2009). IEEE, Barcelona, Spain; 2009:15.
 17.
Sun J, Zheng NN, Shum HY: Stereo matching using belief propagation. IEEE Trans Pattern Anal Mach Intell 2003, 25(7):787800. 10.1109/TPAMI.2003.1206509
 18.
Sudderth EB, Freeman WT: Signal and image processing with belief propagation. IEEE Signal Process. Mag 2003, 25: 114141.
 19.
Wicker SB, Kim S: Fundamentals of Codes, Graphs, and Iterative Decoding. Kluwer Academic Publishers, Boston; 2003.
 20.
Hu XY, Eleftheriou E, Arnold DM, Dholakia A: Efficient implementations of the sumproduct algorithm for decoding LDPC codes. In Proceedings of the IEEE Global Telecommunications Conf (GLOBECOM'01). IEEE, San Antonio, Texas, USA; 2001:10361036E.
 21.
Papaharalabos S, Papaleo M, Mathiopoulos P, Neri M, VanelliCoralli A, Corazza G: DVBS2 LDPC decoding using robust check node update approximations. IEEE Trans Broadcast 2008, 54: 120126.
 22.
Tsai CJ, Chen MC: Efficient LDPC decoder implementation for DVBS2 system. In Proceedings of the International Symposium on VLSI Design Automation and Test (VLSIDAT'10). Hsinchu, Taiwan; 2010:3740.
 23.
Oh D, Parhi K: Minsum decoder architectures with reduced word length for LDPC codes. IEEE Trans Circ Syst 2010, 105115.
 24.
Guilloud F, Boutillon E, Danger JL: λmin decoding algorithm of regular and irregular LDPC codes. In Proceedings of the 3rd Int Symp Turbo Codes Relat Topics. Brest, France; 2003:14.
 25.
Eroz M, Sun FW, Lee LN: DVBS2 low density parity check codes with near Shannon limit performance. Int J Satell Commun Netw 2004, 22: 269279. 10.1002/sat.787
 26.
Prabhakar A, Narayanan K: A memory efficient serial LDPC decoder architecture. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2005). IEEE, Philadelphia, PA, USA; 2005:V41V44.
 27.
Fanucci L, Rossi F: A throughput/complexity analysis for the VLSI implementation of LDPC decoder. In Proceedings of the Fourth IEEE International Symposium on Signal Processing and Information Technology. IEEE, Rome, Italy; 2004:409412.
 28.
Falcao G, Gomes M, Goncalves J, Faia P, Silva V: HDL library of processing units for an automatic LDPC decoder design. In Proceedings of the IEEE Ph.D. Research in Microelectronics and Electronics (PRIME'06). IEEE, Otranto (Lecce), Italy; 2006:349352.
 29.
Gomes M, Falcao G, Goncalves J, Silva V, Falcao M, Faia P: HDL library of processing units for generic and DVBS2 LDPC decoding. In Proceedings of the International Conference on Signal Processing and Multimedia Applications (SIGMAP'06). ICETE, Setubal, Portugal; 2006:1724.
 30.
Sharon E, Litsyn S, Goldberger J: Efficient serial messagepassing schedules for LDPC decoding. IEEE Trans Inf Theory 2007, 53(11):40764091.
 31.
Kim SW, Park CS, Hwang SY: Design of a highthroughput LDPC decoder for DVBS2 using local memory banks. IEEE Trans Consum Electron 2009, 55(3):10451050.
Acknowledgements
This study was partially supported by the Portuguese Foundation for Science and Technology (FCT), namely through the PIDDAC program funds and under grants SFRH/BD/37495/2007 and SFRH/BD/38338/2007, and also by FCT project PEstOE/EEI/LA0008/2011.
Author information
Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Falcao, G., Gomes, M., Silva, V. et al. Configurable Mfactor VLSI DVBS2 LDPC decoder architecture with optimized memory tiling design. J Wireless Com Network 2012, 98 (2012). https://doi.org/10.1186/16871499201298
Received:
Accepted:
Published:
Keywords
 LDPC decoding
 DVBS2
 VLSI
 ASIC
 memory tiling
 semiparallel architecture
 Mfactorizable architecture
 Lowpower consumption
 highthroughput