Configurable M-factor VLSI DVB-S2 LDPC decoder architecture with optimized memory tiling design

Semi-parallel architectures for decoding Digital Video Broadcasting-Satellite 2 (DVB-S2) Low-Density Parity-Check (LDPC) codes have improved Very Large Scale Integration (VLSI) solutions, but their design is challenging from several perspectives. In order to conveniently exploit parallelism for obtaining VLSI LDPC decoders that occupy small circuit areas and demand low power consumption, we propose in this article a novel ASIC reconfigurable approach that exploits efficiently the memory block reshaping required to use a reduced number of processor nodes. We exploit different memory tiling configurations to reduce the memory area about 20%. The proposed architecture was synthesized for a 90 nm process design with a variable number of processor nodes and a competitive circuit area of 6.2 mm2 was achieved. The operating frequency simultaneously guarantees throughputs superior to 90 Mbps, as required by DVB-S2, and low levels of power consumption.


Introduction
Over the last 15 years Low-Density Parity-Check (LDPC) codes have assumed a growing importance in the channel coding arena, namely because they have error correction capability to achieve efficient coding close to the Shannon limit. These codes were invented by Robert Gallager (MIT) in the early sixties [1] and have never been fully exploited due to overwhelming computational requirements by that time. LDPCs are linear block codes (N,K) and can be described by sparse binary parity-check H matrices with dimensions (N -K) × N. They can also be elegantly represented by a Tanner graph [2] defined by edges connecting two distinct types of nodes usually denoted as Bit Nodes (BN), with a BN for each one of the N variables of H, and Check Nodes (CN), also called restriction or test nodes, with a CN for each one of the (N -K) parity-check equations given by H. Naturally, the fact that their patent has expired, has shifted the attention of the scientific community and industry away from Turbo codes [3] towards the study of LDPC codes [4], which quickly have shown to be able of guaranteeing similar or even superior coding performance. Mainly for these reasons and also because advances in microelectronics allowed the development of hardware solutions for real-time decoding, LDPC codes have been adopted by modern communication standards [5][6][7]. Important examples of these standards are: the Digital Video Broadcasting-Satellite 2 (DVB-S2) for satellite communications [8]; the WiMAX IEEE 802.16e for wireless communication systems in Metropolitan area networks (MAN) [9]; the WiFi 802.11n standard for wired home networking technologies; and the 10 Gb Ethernet IEEE 802.3an. Also, the introduction of LDPC codes in 4G systems has recently been proposed, as opposed to Turbo codes adopted in 3G. Some of these applications impose challenges that typically have to be addressed by using dedicated solutions that require System-On-Chip (SoC) hardware providing at the same time good performance, low power consumption and small die areas. Naturally, special emphasis has been given to solutions addressing the DVB-S2 standard for satellite communications [8], which represents the most challenging application that actually incorporates the use of LDPC codes.
In the past we have seen that parallelism can be efficiently exploited to achieve good performances with iterative message-passing algorithms used in LDPC decoding [10][11][12]. However, LDPC codes with good coding performances demand very long lengths [4,13], as it is the case of those adopted in the DVB-S2 standard. The development of efficient Very Large Scale Integration (VLSI)-hardware able of providing the huge computational power necessary and performing irregular memory accesses in real-time actually represents a real challenge. Focused on the original architecture developed by Kienle et al. [5] that uses M = 360 processor nodes, we investigated and developed efficient hardware-dedicated DVB-S2 LDPC decoders, namely by using a reduced number submultiple of M of lightweight node processors. Consequently, we have also reduced significantly the routing complexity of the interconnection network between processor nodes and memory blocks, which represents a target that aims to improve the design in terms of cost and complexity. This architecture has been initially proposed in [14], but some important challenges have not been addressed yet. In this article we tackle these challenges, namely by materializing the benefits and quantifying the gains achieved with the VLSI design of this architecture, in particular regarding the complex design of Application Specific Integrated Circuit (ASIC) memory blocks that can benefit from the use of the reduced number of processors. The proposed architecture exploits the modulom properties of DVB-S2 codes that allow to take advantage of partial parallelism to increase throughput performance. Because we propose a reduction of the number of processors used, we also address the necessary memory blocks reshape. Although the amount of stored bits remains the same independently of the number of processors, we show that the memory configuration for a solution that uses less processing units is more efficient in terms of circuit area. The architecture has been prototyped using ASIC technology. Synthesis results show that the proposed solution achieves high performance with low power consumption and within a small circuit area context. We also present synthesis results for a memory optimized ASIC architecture that compare well with those reported in state-of-the-art solutions of DVB-S2 LDPC decoders [5][6][7]15,16]. The main contributions of the article are: (i) efficient, scalable and parallel architectures with any submultiple of M = 360 number of processors for VLSI-based LDPC decoders under the context of DVB-S2; (ii) optimized synthesis area results for different sets of functional units and corresponding memory blocks reconfiguration; (iii) architecture with reduced routing complexity, occupying small die areas and consuming low power; and (iv) decoders with high throughput.
This article is organized as follows. In the next section we describe structured DVB-S2 LDPC codes. In Section 3 we present an overview of the state-of-the-art and also the main challenges regarding the development of efficient VLSI LDPC decoders for DVB-S2. Section 4 addresses the M-factorizable parallel approach proposed, and in Section 5 we describe an optimized RAM memory design procedure for ASIC. Section 6 presents experimental results for the proposed LDPC decoder architecture synthesized for ASIC and compares its performance with other architectures in the literature. Section 7 concludes the article.

LDPC codes
Graphical models, and in particular Tanner graphs [13], have often been proposed to perform approximate inference calculations [17,18]. They are based on iterative intensive message-passing algorithms [19] also known as Belief Propagation (BP) that perform the computation of joint probabilities on graphs (as depicted in Figure 1) and are commonly used in information theory (e.g., channel coding), artificial intelligence and computer vision (e.g., stereo vision) [17], which, under certain circumstances, can become computationally prohibitive.

LDPC decoding
In this article we are interested in the BP algorithm applied to LDPC decoding. In particular, we exploit the Min-Sum Algorithm (MSA), an efficient simplification [13] of the Sum-Product Algorithm (SPA), which is very demanding from a computational perspective. The MSA here applied to LDPC decoding operates over Log-likelihood Ratios (LLR) [1], exchanging information and updating messages between neighbor nodes over successive iterations. From a computational perspective, this simplification decreases the complexity of processing because it uses sum operations instead of multiplications, while subtractions replace divisions [13,20]. Although this represents a lower workload than the one required by the SPA, it is still quite significant. If the number of nodes is large-in the order of thousands as in the case of DVB-S2-the MSA can still demand very intensive processing [13,19].
But if from a computational perspective it makes sense to use the MSA, in [15,21] it is shown that by using the conventional MSA algorithm, it is not possible to achieve acceptable Bit Error Rate (BER) performance for the particular case of DVB-S2. The utilization of this algorithm puts performance far away from BER specifications defined for the standard [8], as it is clearly shown by figure 7 in [15]. In order to obtain BER performances closer to those defined by the DVB-S2 standard, we adopted the improved normalized MSA algorithm [22,23] with two distinct normalization factors that depend on the processed messages at the output of the CNs. Although in [23] this problem is only analyzed for short length codes, in [22] it is successfully extended to DVB-S2 LDPC codes. This fact, associated with the corresponding and relatively simple changes necessarily introduced in the design of the architecture, is the main reason why we adopted the lmin = 1 solution [24], which computes LLR messages based on two minimums.
The SPA can also be performed in the logarithmic domain as described in [20]. In the next equations, Lp n designates the a priori LLR of BN n , derived from the values received from the channel, and Lr mn is the message that is sent from CN m to BN n , computed based on all received messages from BNs of group N (m)\n , where N (m)\n represents BNs connected to CN m excluding BN n , according to: with and In the particular case of the MSA f(a, b) = 0. This represents an overestimation of LLR calculated values, which can be compensated by using the improved normalized MSA algorithm [22,23] with two normalization factors (this was the solution adopted in our architecture). Consequently, (1) becomes: where α can be obtained by:   CN m , and is given by: At the end of each iteration we compute the a posteriori LLRs of BNs, Lr m n (7) and perform hard decoding to obtain the decoded output word ĉ [13,19] Before the first iteration occurs, for each node pair (BN n , CN m ), we initialize Lq nm with the LLR information Lp n received from the channel, and then we proceed to the iterative body of the algorithm until stop conditions occur: Hĉ T ≠ 0 or the maximum number of iterations is reached.

DVB-S2 LDPC codes
Following the recognition of their potential, LDPC codes have been recently adopted by DVB-S2 [8] and other new standards for communication and storage applications. They use their powerful coding gains, obtained at the expense of computational power, to achieve good performances under adverse channel conditions. Some of the LDPC codes adopted in those standards have a periodic nature, which allows exploiting suitable representations of data structures for reducing the computational requirements. These modulo-m properties of DVB-S2 LDPC codes are described next, in order to allow understanding how architectures can take advantage of them, namely for exploiting parallelism.
The Forward Error-correcting (FEC) system of the recent DVB-S2 standard [8] incorporates a special class of LDPC codes based on Irregular Repeat Accumulate (IRA) codes [25]. The parity-check matrix H is of the form: where A is sparse and has modulo-m properties, and B is a staircase lower triangular matrix. The structured constraints put on the pseudo-random generation of matrix A allow a significant reduction on the storage requirements without significant code performance loss. The N bits of a codeword are represented in the systematic form, divided in information bits (IN) and parity-check bits (PN) given by the number of columns of A and B, respectively. The construction technique used to generate the A matrix consists of splitting the IN nodes into disjoint groups of M = 360 consecutive 1's. All the IN nodes of a group l should have the same weight w l and it is only necessary to choose the CN nodes that connect to the first IN of the group, in order to specify the CN nodes that connect to each one of the remaining (M -1) IN nodes of that group. The connection choice for the first elements of group l is pseudorandom but it guarantees that, in the resulting LDPC code, all the CN nodes must connect to the same number of IN nodes. Denoting by r 1 , r 2 , . . . , r w l the indices of the CN nodes that connect to the first IN of group l, the indices of the CN nodes that connect to IN i , with 0 ≤ i ≤ M -1, of group l can be obtained by: The parameter M = 360 is constant for all codes used in the DVB-S2 standard. For each code, the first groups of IN that form A have constant weights w b >3, and the remaining groups have weights w b = 3. Matrix B has a lower triangle staircase profile as shown in (9). The LDPC codes adopted by the DVB-S2 standard support two different frame lengths, one for short frames (N = 16, 200 bit) and the other for normal frames (N = 64, 800 bit). The short frame mode supports ten distinct code rates as depicted in Table 1, while the latter supports 11 rates as shown in Table 2. The column and row weights are also depicted in Tables 1 and 2 for all rates in the standard. For short frame lengths, only 50% of the codes have constant weight w c as indicated in Table 1, while for normal frame length codes all CNs have a constant weight w c , as indicated in Table 2.
Both tables show the number of edges for each code adopted in the DVB-S2 standard. In each edge circulate messages (Lr mn and Lq nm ) that are used to update the connected nodes. A closer inspection, for example, of code with rate = 3/5 for normal frames, shows that the total number of edges of the Tanner graph is 285120. Given that communications occur in both directions (from CNs to BNs, and then from BNs to CNs), 570240 >2 19 messages are exchanged per iteration, which imposes significant computational demands (at several levels) for the development of LDPC decoders.

Overview of VLSI LDPC decoding architectures
The two types of nodes (BN and CN) in the Tanner graph require distinct processing. Therefore, processor nodes of LDPC decoding architectures must be capable of performing these two types of calculations in order to compute the improved normalized MSA. Also, it is necessary to incorporate memory to store messages associated with each processor node, and the decoder requires an efficient switching mechanism to allow each processor node in the graph to access different memory banks. Ideally, this should be performed in parallel to increase throughput. The message-passing scheduling mechanism controls the order in which messages are updated (i.e., the order how memory is addressed by each node processor), defining which nodes communicate with each other and in what order. A control logic unit, dependent on scheduling, controls the interconnection network that performs this task. It connects existing BN (usually a factor of N) with CN (usually a factor of (N -K)) processors to the corresponding memory blocks, in order to allow read/write operations (to be performed in parallel) into different blocks that store subsections of the graph. The interconnection network should be designed to allow conflict-free accesses to memory banks.

Parallel LDPC decoder architectures
The complex nature of such a VLSI architecture presents several challenges. To perform fast LDPC decoding, it requires a certain level of parallelism, which should have some significance in order to achieve a throughput that can be very demanding (90 Mbps in the DVB-S2 case [8]). Above some level of parallelism the requirements for routing, area and power consumption may be difficult to achieve for practical reasons. Presently, there are LDPC decoder architectures ranging from serial [26] to fully parallel [10], targeting different objectives. Serial decoders trade throughput with architectural simplicity, obtaining reduced die areas. Although limitations at the processing level can make Properties of LDPC codes (the five short frame codes marked with the symbol * don't have a constant weight per row w c , since they have been shortened has defined in the standard [8]. Consequently, they have rates which are an approximation to those mentioned in this table, but not exactly the same.) used in the DVB-S2 standard [8] for the short frame length them target mainly low-throughput applications, they can exploit hardware simplifications, such as reducing the storage size of messages and still achieve coding gains without noticeable loss in performance [26]. Parallel approaches offer superior throughput at the expense of architectural complexity and circuit area. On a fully parallel system, the complexity increases significantly as the code length grows. The dimension and complexity of this type of architectures are only acceptable for short to medium length LDPC codes. One of the first LDPC decoder on ASIC [10] is based on a fully parallel architecture for a 1, 024 bit, rate = 1/2 LDPC decoder processing 64 iterations, that obtains a throughput of 1 Gbps. The excellent performance achieved is only possible due to the complex routing established to connect all N BN processors with corresponding (N -K) CN processors. Consequently, the complex wiring of the circuit created the need to carefully manage the floor planning and routing of the design, which alerted for the prohibitive complexity involved in the design of architectures that support block codes with high lengths (N >1,024 bit). Naturally, the size of the architecture is influenced by the length of the largest code supported and also by the level of parallelism adopted. This has direct implications in the number of node processors necessary to perform computation and, more importantly, in the complexity of the interconnection network (represented by a barrel shifter in Figure 2). The complexity of this network decreases by reducing the level of parallelism of the architecture. For a fixed code, permutation patterns are known a priori and the indices/addresses that indicate which nodes connect with each other are usually computed offline. Compared with full-parallel decoders, the complexity of an interconnection network in a partialparallel architecture significantly decreases, which eliminates important restrictions from the routing process (in the place & route phase of the design).
Partial-parallel architectures for high-throughput LDPC decoding have been proposed by Mansour et al. [11,12], where the concept of architecture-aware LDPC codes was firstly introduced. Architecture-aware codes decouple the architectural dependence of the decoder from the LDPC code properties. They achieve a faster convergence rate, which allows obtaining higher throughputs [12]. In this case, the interconnection network can be efficiently implemented by using programmable multi-stage networks [11]. A comparison between serial, partly-parallel and full-parallel architectures has been performed by Fanucci et al. in [27] for a regular code with N = 2,048 bits, w c = 6 and w b = 3. However, these architectures are dedicated to particular cases and other solutions were necessary to support other types of more demanding LDPC codes, namely those that have irregular nature and higher block lengths.
For long length LDPC codes (e.g., DVB-S2 codes), typically more than 70% of the circuit's area in VLSI decoders is occupied by memory units [6,15], which are essential to support the iterative message-passing mechanism. They are extensively addressed in the context of this article. Memory requirements for parallel architectures capable of supporting long length codes are feasible, but hard to achieve for practical reasons on a VLSI approach. The number of accesses and the amount of memory necessary to store messages exchanged between hundreds or thousands of nodes is extremely high, which imposes restrictions in die area and power consumption. Although in a partial-parallel  approach the total amount of memory necessary to store messages remains unchanged for any number of processors, by varying the latter, the height and width of memory blocks change, which can be exploited to achieve architectures with more efficient areas.

Parallel M-kernel LDPC decoder architectures for DVB-S2
The development of LDPC decoders for the DVB-S2 standard can be considered to be among the most demanding applications for this type of processing, mainly due to the high length of codes adopted [8]. LDPC codes used in this standard are IRA codes, where the sparse parity-check H matrix has modulo-m properties that can be exploited to achieve hardware parallelism [25]. The use of scalable parallelism to obtain chips with small areas for DVB-S2 has been proposed for technologies ranging from 0.13 µm down to 65 nm [5][6][7]15,16] based on the processing of sub-sets of the Tanner graph, which guarantee the minimum necessary throughput of 90 Mbps required by the standard [8]. The design of partial-parallel solutions generates systems with lower areas, but also implies lower throughputs, comparing to full-parallel architectures in a SoC. One of the architectures initially proposed by Kienle et al. [5] uses M = 360 processor nodes, also referred in this text as Functional Units (FU). They work in parallel and share control signals, that process both CN nodes (in CN mode) and IN nodes (in BN mode) according to a flooding schedule approach.
Attending to the zigzag connectivity [8] between PNs and CNs defined by B in (9), they are updated jointly in CN mode [5]. An efficient VLSI architecture has been proposed in [28,29], where the joint processing of CNs and PNs is performed as depicted in Figure 3. For example, when updating PN m , according to (6) it becomes a passing node because the message it sends to CN m+1 is the message received from CNm added to the channel information, and vice-versa (see Figure 3). Since each FU processes q consecutive CNs, the update of PNs follows an horizontal schedule approach, with PNs and CNs being processed simultaneously. This scheduling mechanism allows a faster decoding convergence [30]. Therefore, the message that travels through CN m , PN m and CN m+1 is kept in the FU and only the backward message that is sent from CN m to PN m-1 is saved in the external memory. The equations that describe the operation inside the FU in CN mode are [29]:   All 360 FUs process one message per cycle. In IN mode, the 360 messages are read from the same address (sequentially incremented, for this type of processing) in the respective memory blocks. The new messages resulting from this computation are then stored in the same address cyclically shifted right through the interconnection network, implemented by a barrel shifter. The modulo-m properties of LDPC codes used in DVB-S2 allow to replace the complex interconnection network by a common barrel shifter. In CN mode, messages have to be read from specific addresses and stored back in the same addresses cyclically shifted left to conclude the execution of an iteration. Once again, the access is performed in parallel for all 360 messages. The barrel shifter mechanism and the efficient memory mapping scheme constitute the major strengths of the architecture described in [5].
Memory requirements for a partially parallel architecture capable of supporting long length codes used in DVB-S2 are demanding. The appropriate Control of ADDRESSES and SHIFTS memory banks indicated in Figure 2 guarantees that every time the address of a memory block changes, it changes accordingly and without conflicts for all M = 360 processors in parallel. The barrel shifter, which has a switching activity controlled by the SHIFTS memory bank, can be properly managed together with the ADDRESSES memory block that defines memory accesses, to simulate the connectivity of the Tanner graph. In [5] it is illustrated the memory mapping mechanisms used in BN processing as a function of q (a parameter defined by the standard), which is presented in (11). In CN processing, a similar scheme applies. For each different code we use different SHIFTS and ADDRESSES values, which can be easily obtained from annexes B and C of the DVB-S2 standard [8]. In Figure 4 mainly three distinct types of memory are depicted: (i) Channel memory which initially receives data (IN and PN) from the input of the LDPC decoder; (ii) Message memory, where all messages associated with information bits are stored (the FU supports both types of BN and CN processing, which perform alternately during the same iteration); and (iii) PN Message memory that holds messages associated with parity bits, which are computed in CN mode and have all weight w c = 2 (zigzag connectivity depicted in Figure 3). In PN Message memory each FU only needs to store the message which is passed during the backward update of CNs [14].
However, the large number of FUs used (360) still implies a wide and complex barrel shifter that requires a significant die area and imposes routing problems in order to accommodate the simultaneous accesses of all 360 FUs to corresponding messages in memory. Since this architecture is able to provide, with the current technology, a throughput far above the mandatory 90 Mbps, we are able to reduce the number of FUs even further. In fact, we herein show that this can be done by any integer factor submultiple of M = 360, which, for VLSI systems, presents a beneficial reduction of the size of the barrel shifter.

M-factorizable VLSI parallel LDPC decoder architecture for DVB-S2
Under this context we developed a novel hardware approach, originally proposed in [14,28], which is based on a partial-parallel architecture that simplifies the barrel shifter and reduces memory requirements. We address the generalization of the well known M-kernel parallel hardware structure [5] and propose its partitioning by any integer L submultiple of M (which can be obtained from the decomposition of M = 360 = 2 3 × 3 2 × 5), without memory addressing/reconfiguration overheads and keeping unchanged the efficient message-passing mechanism. The proposed architecture provides an efficient way of reducing the number of FUs and the overall complexity of the decoder. This approach does not only surpass some disadvantages of the architecture described in [5], such as die area occupied or routing congestion, but it also adds flexibility and reconfigurability to the system according to the decoder requirements and device constraints. This type of sub-sampling approach preserves the key modulo-m properties of the architecture [5], with only P = M/L processing units addressed in [28,29] as shown in Figure 2. This strategy allows a linear reduction by L of the hardware resources occupied by the FU blocks, and reduces significantly the complexity (2 × O(P log P)) of the interconnection network (or barrel shifter), which simplifies the routing problem. This strategy does not imply an increase by L in the size of ROM memories (that hold SHIFTS and ADDRESSES values). In fact, by performing a sub-sampling by L, each group of M = 360 INs or CNs results in L subgroups of P INs or CNs. As the properties of the first subgroups of CNs and BNs to process are known, the modulo-m periodicity of DVB-S2 codes allows to automatically calculate the properties of the remaining subgroups. Based on the SHIFTS shift 0 to apply to the first subgroup, we can obtain the remaining subgroups g, with 0 ≤ g ≤ L -1 by [14]: The same applies for ADDRESSES, which can be computed as: For the architecture to support only P FUs (instead of the M = 360 FUs as in [5]), memory blocks have to be reorganized according to (16), (17) and Figures 2 and 5. We can reshape these memory blocks and keep unchanged the size of the system ROM, by computing on the fly the new SHIFTS as a function of those initially stored in the ROM, as indicated in (16). In the configuration shown in Figure 2, each FU i , with 0 < i < P -1, is now responsible for processing information (IN), parity (PN) and CN according to a proper memory mapping and shuffling mechanism. As we increase L, the smaller become the sub-sets of the Tanner graph processed in parallel. Figure 5 describes the addressing mechanisms used in the factorizable architecture for L = 2, which uses 180 FUs. The amount of memory is exactly the same as in the architecture with L = 1, but the structure is different. There are less FUs and the corresponding memory word size decreases. As shown in Figure 2, memories have to become higher and thinner in order to hold the same information as before. This new memory configuration will introduce benefits in area occupied by the architecture, as it will be shown later.

Optimizing RAM memory design based on macro cells for ASIC
The presented architecture has been described in Register Transfer Level (RTL) for P = {45, 90, 180, 360} FUs. All solutions use 5-bit to represent data messages. Figure 2 shows that all four configurations use exactly the same amount of memory, though rearranged with different widths and heights (distinct memory dimensions). The smaller the number of P functional units adopted, the higher and thicker block memories become. The complete type and amount of memories necessary for such design are: • Message memory holds internal messages calculated during each iteration of the decoding process associated with IN; Message memory width (word length) is given by w a = number of FU × message, which in this case is w a = M × 5/L bits; the height h mm can be obtained for worst case scenario (code with rate = 3/5 obtained from Table 2, where the number of edges is maximum). In this case, h mm = L × q × (w c -2); • Channel memory stores IN and PN data incoming from the channel. This memory has the same word length as Message memory (w ch = w a ), and the height is given by h ch = L × 64800/M, representing the worst case (normal frame); • PN message memory holds messages calculated during each iteration, which are associated only with parity bits. It has the same width as Channel memory and the height h PN = L × q is obtained for worst case scenario (code with rate = 1/4) from Table 2, which turns q = 135 (only one 5-bit message per each parity bit has to be stored); • Hard decoding memory holds IN and PN data obtained in the hard decoding phase of processing (1-bit per FU); the width of this memory is given by the number M/L FUs of the architecture, and the height h HD is obtained from h HD = L × 64800/M.    Table 3, but rather standard dimensions which are usually a power of 2, as shown in Table 4. The area results obtained in the synthesis process and shown in Table 5 allowed other interesting conclusions. Memories occupy nearly 97% of the circuits' total area. The remaining part of the circuit is occupied by the barrel shifter, the functional units and the control logic. The fact that in the four configurations the areas are different was also a surprise. Table 5 shows these differences. As mentioned before and depicted in Figure 2, the total amount of memory is the same for all designs. If we realize that they occupy a significant area of the design, we conclude that their differences should be minimal. To analyze these differences, we first need to understand how memories are generated and the RAM generator limitations. The architecture implemented in RTL uses memories with large width (word length) and height (number of words, or memory positions).
The RAM generator used can create memories that support the requested number of words, but the maximum word width is limited to 64 bits, which is far below the architectural needs. To overcome this problem, B blocks of RAM memory were concatenated until the required width was achieved, as shown in Figure 6. Each RAM memory has its own internal control logic which can address data in the operating clock cycle, with its own antenna diode protection, testing mechanisms, power rings, etc. As represented in Figure  6, more memories are necessary for large words, which replicates control hardware and increases silicon area. This seemed to show that some level of inefficiency should be expected. A practical example with real areas can be addressed for the Message RAMs for 360 and 45 FUs, to give a better perspective of the problem. In the former configuration, the Message RAM width is 1,800 bits (5 bits per message × 360 FUs) with 2 10 addresses (height). For the second configuration, the RAM width is 225 bits (5 bits per message × 45 FUs) with 2 13 addresses. Both Message memory blocks have the same capacity (2 10 × 1, 800 bits), however the area used by the wider 360 FUs Message memory is 6.2 mm 2 , while the thiner 45 FUs memory occupies only 3.2 mm 2 . These memories were created by concatenating B banks of 45 bit RAMs, as illustrated in Figure 6. For 360 FUs, 40 instantiations are necessary (40 × 45 bits = 1,800 bits) while for 45 FUs only five instantiations are needed (5 × 45 bits = 225 bits). Whatever the factorization level L adopted, the architecture will still have to support a Very Long Data Word (VLDW) bus. This property is illustrated in detail in Figure 4, where the VLDW bus implies a complex routing of the wiring associated with it. This introduces additional complexity managing the floor planning and routing process (in the place & route phase of the design). Nevertheless, the architecture with 45 FUs minimizes this problem by moving its complexity to a dimension where it can be more easily tractable.

Minimal RAM memory configuration
From all designs in Table 5 the architecture with 45 FUs  occupies the smallest area. Comparing Tables 3 and 4, we notice that due to hardware restrictions, there is a   This fact occurs in all memories and for all architectures and we tried to minimize it by using decompositions into powers of 2 and concatenating smaller RAMs, but we realized that instead of decreasing, the area increases nearly 30%, which can be justified by the additional number of internal RAM hardware controllers incorporated into the design. However, joining together several blocks of memory that operate in different time slots into one bigger common block of memory can become a more efficient solution. By analyzing in detail the Message memory block of the architecture with 45 FUs depicted in Figure 4, each 5 bits of the word are directly connected to each FU, and consequently only one main memory block is used, as mentioned in the beginning of Section 5. Therefore, the control is unique and only one (write or read) address is needed to transfer data to/ from all FUs. This memory must be a dual port RAM, in order to allow reading and writing during the same clock cycle. However, due to memory generator limitations in the tools used, this type of memory was not possible to achieve. To overcome this, we adopted a solution that consists of creating a memory that operates at twice the circuit's frequency (f op ): in the first cycle it performs read operations; and in the second one it writes data to memory. The system's master clock operates at 200 MHz and it is used to feed memory blocks, while f op = 100 MHz is synchronously derived from this clock source. The LDPC decoder input data comes from an external de-interleaver module that does not work at the same clock frequency as the LDPC decoder. To guarantee that no information is lost, all messages delivered to the decoder must be stored in Channel memory. The PN message memory is equally stored on a dual port RAM which, due to memory generator limitations as before, was converted into a single RAM working at twice of the initial frequency. In the beginning, this memory is loaded with zeros, and then it will hold data associated with parity bits necessary in each iteration of the decoding process. Shift and address data are loaded from an outside source into SHIFTS and ADDRESSES memories. The values stored in these memories depend on the code rate and must be previously calculated.

Experimental results
In the next subsections we present the obtained experimental results for the architecture in Figure 2, which was synthesized for ASIC using a 90 nm CMOS process design. Place & route was performed using eight metal layers. Memories were synthesized using a generic RAM generator tool with libraries of macro cells. Estimates for power consumption were obtained assuming a toggling rate of 10%. The architecture was also synthesized for Field-Programmable Gate Array (FPGA), but only for validation purposes [14].

Experimental synthesis results for ASIC
Synthesizing in ASIC technology the architecture with P = {45, 90, 180, 360} FUs aims at finding the one which produces better power consumption and area results, while simultaneously supporting the throughput requirements of the standard. The architecture supports short and normal frame lengths for all code rates, even though it is dimensioned for the worst case. All rates in  It can also be seen that a maximum of 55 iterations is supported for the configuration with P = 90 FUs decoding the worst case code. Technical limitations imposed by the available library of memory macro cells, namely the maximum frequency of operation at 200 MHz and the fact that they do not support dual-port memories, have put limits on throughput and maximum number of iterations supported. To exemplify this, our architecture implemented with P = 45 FUs supports at maximum seven iterations when decoding at a throughput of 90 Mbps for worst case working conditions, as depicted in Table 6. By using dual-port memories, or memories capable of supporting frequencies in the range of 400 MHz (that would allow to have the main circuit operating at 200 MHz), it would be possible to achieve 14 iterations for worst case condition (see Table 6) without additional modifications in the design, at the expense of an increase in power consumption. To validate this assumption, we re-synthesized the architecture with the circuit operating at f op = 125 MHz and then at f op = 200 MHz (as indicated in Table 6) and no timing violations occurred.
In (18) and (19) it is described how the throughput and maximum number of iterations were calculated for Figure 7, with the frequency of operation selected at f op = 100 MHz, i.e.: . (18) The variables frame_length, W and w j depend on code length and rate, and can be found in the standard [8].  The variable W defines the number of elements of A in the compact form (see (9)) as presented in annexes B and C of the standard [8]. It can be computed from Tables 1 and 2 by performing: where IN_weight_j and IN_weight_3 are the number of IN nodes with weights weight_j and 3, respectively. At this point and due to the configurable nature of the architecture, we found room to perform some extra optimizations. For practical purposes, and since it presented good results for throughput and area, we adopted the configuration based on 45 FUs as the basis solution to be optimized. The next subsection shows that there is still room to perform optimizations and how to achieve that.

Experimental synthesis results for ASIC using an optimized 45 FUs architecture
To translate the optimizations mentioned in Section 5.1 into area results, the new circuit has been synthesized for the same 90 nm technology used in the original synthesis. The optimizations reduce the area of the smallest original architecture nearly 20%. Synthesis results for the previously mentioned P = {45, 90, 180, 360} FUs based architectures and also for the optimized 45 FUs version are listed in Table 5. The areas range from 21.2 to 6.2 mm 2 , with corresponding levels of power consumption of, respectively, 290 mW down to 85 mW. The 45 FUs' architecture optimized with an efficient RAM memory reshape presents a total area of 6.2 mm 2 and 85 mW of power consumed at a typical corner voltage of 1.1 V. The estimation of power consumption was performed by considering typical corner operating conditions for current, voltage and temperature. This approach was followed because extreme working conditions usually do not occur, or if they do, it is during small periods of time which does not negatively affect the design.
No timing or design rule check (DRC) violations were reported, so at this stage the 45 FUs optimized architecture is physically validated and ready to be implemented in silicon. To accommodate place & route, it should be noted that, based on previous design experience, we estimated an increase in area equivalent to 20% (after final validation, we realized that it could have been approximately 19%), which corresponds to a global area of 7.4 mm 2 for the optimized 45 FUs architecture.

Discussion
The assessment of synthesis area results for the proposed architecture presented in Table 5 shows that it compares well with state-of-the-art solutions [5][6][7]15,16,31]. Table 6 compares synthesis results for state-of-the-art architectures with the 45 FUs-optimized architecture here proposed. A 90 nm technology is used in [6,16], producing, respectively, an area of 4.1 and 9.6 mm 2 . Although it is not possible to fully assess [6] because important characteristics of the circuit such as power consumption are not indicated and only area and throughput are mentioned, it is important to refer that the architecture in [6] presents a smaller area but it only supports the normal frame mode of operation. Results for similar architectures but using different technologies have been presented: for example, for a 0.13μm technology, an area of 22.7 mm 2 has been achieved [5], while [7] presents 3.9 mm 2 for a 65 nm technology, and [15] achieves 6.03 mm 2 for the same 65 nm technology. Although some of them claim to occupy smaller die areas, our solution supports both frame lengths, while [6,7] only support the normal frame mode. Also, considering that our design is based on a 90 nm process, it compares favorably in terms of area against [7,15], which use a 65 nm technology as mentioned before.
The new architecture here proposed based on 45 FUs shows an LDPC decoder circuit with smaller area occupied than those reported in [5,16]. More recently, Kim et al. [31] presented an architecture with an excellent throughput of 520 Mbps but at the expense of a circuit area equal to 12.4 mm 2 for a 90 nm process. For power consumption purposes, perhaps most important is the fact that our architecture works with an inferior maximum frequency of operation than those just reported in state-of-the-art solutions [5,7,15,16]. While [5,15] propose architectures with operating frequencies of 270 MHz, the one in [7] requires 400 and 320 MHz in the case of [16]. In the new approach herein proposed, 45% of the circuit works at 200 MHz, while the remaining 55% work at 100 MHz. The power consumption required by our architecture is inferior to those mentioned by competitors. Namely, 477 mW are reported in [15] for best case working conditions, while we achieved 105 mW for the 45 FUs architecture and 85 mW for the optimized version of it, as depicted in Tables 5 and  6. By scaling the proposed architecture to a 65 nm technology using the same parameters as those reported in [7] and correspondingly increasing the frequency of operation to 400 MHz (which is possible as long as proper memory technology is adopted), it is possible to increase the maximum number of iterations supported by this architecture to 28 for worst case working conditions, which for DVB-S2 occurs for code with rate = 3/ 5.

Conclusions
This article addresses the generalization of a state-ofthe-art M-kernel parallel structure for LDPC-IRA DVB-S2 decoding, for any integer factor of M = 360. The proposed architecture adopts a partitioned processing of subsets of the Tanner graph that keeps unchanged the efficient message memory mapping structure without addressing unnecessary overheads. This architecture proves to be flexible and easily configurable according to the decoder constraints and represents a trade-off between silicon area and decoder throughput above the required 90 Mbps for all DVB-S2 codes. Under this context, five configuration designs with different number of processing units have been synthesized using ASIC technology. They range from 360 to 45 FUs which represents, respectively, an equivalent occupied area of 21.2 and 7.7 mm 2 . Although the process of generating RAM memories imposes constraints, the investigation carried out under the context of this article allowed several interesting conclusions that were applied into the design of the LDPC decoder circuit. Re-dimensioning and rearranging the order how memory blocks are grouped together allowed reducing the global area of the circuit to a value as low as 6.2 mm 2 . Additional improvements in area could still be experienced either by using a different number of metal layers, or by adopting full custom RAM memories in the circuit design. Since in the present case nearly 97% of the circuit's area is occupied by memory blocks and we used a generic RAM generator tool, the area occupied by the circuit can be further reduced if we require to the foundry the use of dedicated RAM memory cells. This shows the competitiveness of the architecture when compared with state-ofthe-art solutions for the same 90 nm technology. Moreover, the maximum frequency of operation of the design here proposed is smaller than those reported by competitors, which justifies the low levels of power consumption achieved of approximately 85 mW.