- Research
- Open Access
- Published:

# Parallel LDPC decoding using CUDA and OpenMP

*EURASIP Journal on Wireless Communications and Networking*
**volume 2011**, Article number: 172 (2011)

## Abstract

Digital mobile communication technologies, such as next generation mobile communication and mobile TV, are rapidly advancing. Hardware designs to provide baseband processing of new protocol standards are being actively attempted, because of concurrently emerging multiple standards and diverse needs on device functions, hardware-only implementation may have reached a limit. To overcome this challenge, digital communication system designs are adopting software solutions that use central processing units or graphics processing units (GPUs) to implement communication protocols. In this article we propose a parallel software implementation of low density parity check decoding algorithms, and we use a multi-core processor and a GPU to achieve both flexibility and high performance. Specifically, we use OpenMP for parallelizing software on a multi-core processor and Compute Unified Device Architecture (CUDA) for parallel software running on a GPU. We process information on H-matrices using OpenMP pragmas on a multi-core processor and execute decoding algorithms in parallel using CUDA on a GPU. We evaluated the performance of the proposed implementation with respect to two different code rates for the China Multimedia Mobile Broadcasting (CMMB) standard, and we verified that the proposed implementation satisfies the CMMB bandwidth requirement.

## 1. Introduction

Today, wireless devices transmit and receive high rate data in real-time. The need to provide high transmission rates with reliability is increasing, in order to offer various multimedia services with 4G mobile communication systems. Typical data transmission requirements of 4G mobile communication systems are for 100 Mbps in mobile circumstances and for 1 Gbps in a stationary state [1]. Therefore, powerful correcting codes are becoming indispensable.

The low density parity check (LDPC) code is one of the strongest error correcting codes; it is a linear block code originally devised by Gallager in 1960s [2]. However, it was impossible to implement the code in hardware that was available at that time. About 30 years later, the LDPC code was reviewed by Mackay and Neal [3, 4]. They rediscovered the excellent properties of the code and suggested its current feasibility, thanks to the development of communication and integrated circuit technologies. Recently Chung and Richardson [5] showed that the LDPC code can approach the Shannon limit to within 0.0045 dB. The LDPC code has a smaller minimum distance than the Turbo code, which was regarded as the best channel coding technique before the LDPC started to draw attention. Hence, with almost no error floor issues, it shows very good bit error rate curves. Furthermore, iterative LDPC decoding schemes based on the sum-product algorithm (SPA) can fully be parallelized, leading to high-speed decoding [5]. For these reasons, LDPC coding is widely regarded as a very attractive coding technique for high-speed 4G wireless communications.

LDPC codes are used in many standards, and they support multiple data rates for each standard. However, it is very challenging to design decoder hardware that supports various standards and multiple data rates. Recently, software defined radio (SDR) [6] baseband processing has emerged as a promising technology that can provide a cost-effective, flexible alternative by implementing a wide variety of wireless protocols in software. The physical layer baseband processing generally requires very high bandwidth and thus high processing power. Thus, multi-core processors are often employed in modern, embedded communication devices [7, 8]. Also, GPU is often adopted to achieve high computational power [9]. Wide deployment of multi-core processors and rapid advances in GPU performance has led to active studies in designing LDPC decoders using GPUs [10–14]. For example, a recent study proposed a technique to utilize GPU to run SPA and described a way to access LLR data [11]. A related study proposed a method for LDPC decoding using Compute Unified Device Architecture (CUDA) [13]. They showed that a GPU could reduce decoding time dramatically.

In this article, we extend this parallelization further in such a way that various standards and code rates can be supported seamlessly. We propose a design which employs both a central processing unit (CPU) and a graphics processing unit (GPU). To support various code rates, the host multi-core CPU reads the H-matrix, and, using OpenMP, it generates address patterns which help the GPU to effectively execute the LDPC decoding in parallel. The LDPC decoding algorithm is written in CUDA [15], which is a parallel computing language developed by NVIDIA, and the CUDA program is executed by the GPU.

The rest of this article is organized as follows. In Section 2, we review LDPC decoding algorithms and parallelization techniques using CUDA. In Section 3, we present the memory structure for CUDA, an address generation method for LDPC decoding, and existing parallelization techniques. In Section 4, the proposed software implementation and performance evaluation results are presented. Section 5 concludes this article.

## 2. Background

### 2.1 Review of LDPC decoding algorithms

The LDPC code is a linear block code with a very sparse parity check matrix called H-matrix. The rows and columns of an H-matrix denote parity check codes and symbols, respectively. LDPC codes can be represented by a Tanner graph which is a bipartite graph in which the sides represent check nodes and bit nodes, respectively. Thus, check nodes correspond to the rows of the H-matrix, and bit nodes correspond to the columns of the H-matrix. For example, when the (*i, j*) element of an H-matrix is '1', the *i* th check node is connected to the *j* th bit node of the equivalent Tanner graph. Figures 1 and 2 illustrate an H-matrix and the equivalent Tanner graph for (8, 4) LDPC codes.

Most practical LDPC decoders use soft-decisions, because soft-decision decoders typically outperform hard-decision ones. A soft-decision decoding scheme is carried out, based on the concept of belief propagation, by passing messages, which contain the amount of belief for a value being between 0 and 1, between adjacent check nodes and bit nodes. Based on the delivered messages, each node attempts to decode its own value. If the decoded value turns out to contain error, the decoding process is repeated for a predefined number of times. Typically, there are two ways to deliver messages in LDPC decoding. One is to use probabilities, and the other is to use log-likelihood ratios (LLRs). In general, using LLRs is favored since that allows us to replace expensive multiplication operations with inexpensive addition operations.

### 2.2 Parallelization of LDPC decoding

As explained in Section 2.1, an LDPC decoding algorithm is capable of correcting errors by repeatedly computing and exchanging messages. The amount of computation depends on the size of the H-matrix. However, recently published standards reveal a growing trend that the length of codewords is getting longer as the amount of data transfer is increasing [1]. By the same token, the size of the H-matrix is increasing. For a recent standard, DVB-T2 [16], the length of the codeword is 64,800 bits or 16,200 bits. For China Multimedia Mobile Broadcasting (CMMB) [17], the length is 9,126 bits. The huge size causes both decoding complexity and decoding time to increase. Therefore, it is crucial to distribute the computation load evenly to multiple cores and to parallelize the computation efficiently.

LDPC decoding consists of four general operations: initialization, check node update, bit node update, and parity check. Examining the algorithm reveals that the check node update can be done in parallel, since the rows are uncorrelated with each other. Also, the bit node update on each column can be processed in parallel, because an LDPC decoding algorithm has independent memory accesses among the four types of operations. For example, in the H-matrix in Figure 3, check node operations can process four rows in parallel, and bit node operations can process eight columns concurrently.

This article has three main technical contributions. First, we propose an efficient and flexible technique which can be applied to various protocols and target multi-core platforms, since we propose a solution which employs both CPUs and GPUs. We also introduce an efficient technique to reduce the memory requirement significantly. Next, we propose parallelization techniques for not only check and bit node update operations, but also parity check operations, which will be described in detail in Section 3.

### 2.3 CUDA programming

CUDA is a GPU software development kit proposed by David Kirk and Mark Harris. One major advantage of CUDA is that it is an extension of the standard C programming language. Hence, those who are familiar with the C/C++ programming language can learn how to program in CUDA relatively easily. Also, CUDA is capable of fully utilizing the fast-improving GPU processing power. Further, NVIDIA hardware engineers actively reflect scientists' opinions as they develop the next generation of CUDA and GPU. For instance, support for double precision computation, error correction capability, and increased shared memory may not be crucial for graphics processing in game applications, but they are important for many scientific and engineering applications. These features have been added in recent versions of CUDA and GPU.

Figure 4 shows the architecture of NVIDIA's 8800GTX. There are 16 multiprocessors, and each multiprocessor has 8 single precision thread processors (SPs). Therefore, the total number of SPs is 128. Each SP can process a block of data with a thread allocation in parallel. However, it is not possible for the CPU and the GPU to share memory space. Thus, the GPU must make a copy of the shared data to its own memory space in advance. If the CPU wants data stored in the memory of the GPU, a similar copy operation must take place. These copy operations incur significant overhead.

Figure 5 shows the relation between a block and a thread in the GPU. A kernel function is executed for one thread at a time. For example, if there were 12 threads in Block (1,1), and there were 6 blocks in a grid, then the kernel function would be executed 72 times. When a function is invoked, the thread and the block index are identified by the thread_idx and block_idx variables, respectively [18, 19].

#### 2.4 OpenMP

OpenMP is a set of application program interfaces (APIs) for parallelization of C/C++ or Fortran programs in a shared memory multiprocessing environment. OpenMP has gained lots of attention lately as multi-core systems are being widely deployed in many embedded platforms. Recently, version 2.5 was released. OpenMP is a parallelization method based on compiler directives, in which a directive will tell the compiler which part of the program should be parallelized, by generating multiple threads. Many commercial and noncommercial compilers support OpenMP directives, and thus, we can utilize OpenMP in many existing platforms [20].

OpenMP's parallelization model is based on a fork-join model. Starting with only the master thread, additional slave threads are forked on demand. All threads except for the master one are terminated when execution for a parallel region ends. In this article, we use OpenMP pragmas to parallelize address generation computations. Since only the new address is transferred to the CUDA memory, the memory copy overhead is minimal.(Figure 6)

### 3. Proposed LDPC decoder

As described above, when the size of H-matrices increases, the amount of computation grows rapidly. This makes it difficult to achieve satisfactory performance in either software- or hardware-only implementations that attempt to support multiple standards and data rates. Therefore, we propose a novel parallel software implementation of LDPC decoding algorithms that are based on OpenMP and CUDA programming. We will show that the proposed design is a cost-effective and flexible LDPC decoder which satisfies the throughput requirement for various H-matrices and multiple code rates. First, we will show the overall software structure, and next we will explain the parallelization techniques that we propose. (Figure 7)

#### 3.1 Architecture of the proposed LDPC decoder

The overall structure of the proposed LDPC decoder is as follows. We assume that the target platform consists of a single host multi-core processor which can run C codes with OpenMP pragmas and a GPU which can run CUDA codes. To support multiple standards and data rates, multiple H-matrices are stored as files. The host CPU reads the H-matrix for a given standard and signal-to-noise ratio (SNR) constraint. The host CPU then generates an address table of data processed in parallel by the GPU. Generation of the address table is parallelized by OpenMP pragmas. Next, generated address information is transferred to the memory in the GPU. This copy operation takes place only if there is a change in standard, SNR constraint, or code rate.

When signals are received, the host CPU delivers them to the GPU. The GPU executes the proposed LDPC decoding software in parallel. Upon completion of the decoding, decoded bits are transferred to the host CPU. A CUDA API called "CUDA Copy" is used to exchange data between the host and the GPU. The copy overhead may be significant, so it is crucial to minimize it. It should be noted that in our implementation this copy operation takes place only for generated address transfers, received signal transfers, decoded bit transfers, and configuration (standard, code rates, etc.) changes. Therefore, the copy overhead is not large in our implementation.

#### 3.2 One-dimensional address generation for parallelization

**Function 1. New address generator**

1: {New address :}

#pragma omp parallel for private(i, j)

shared(v_nodes, c_nodes) schedule(static)

2: for *i* Check Node Num

3: for j weight of check Node

4: for k weight of bit Node

if(v_nodes[c_nodes[i].index[j]].index[k] = = i

{

c_nodes[i].order[j] = k;

break;

}

5: end for

6: end for

7: end for

We will explain how we generate addresses for CUDA parallelization, using the H-matrix in Figure 1. The H-matrix is stored in a file as a two-dimensional array which contains bit node positions that are necessary for the check node update operation. The first table in Figure 8 shows an example. Since the positions of the LLR values of Bit Node 1 for Check Node 0 and Check Node 1 are different, the bit node order is determined by reading the H-matrix information. We minimize the execution time for this by parallelizing the operation using OpenMP. We use an Intel Quad-Core processor as the host CPU, and the following algorithm with four threads may be used.

The position of an LLR value is stored in the form of (*x, y*) where *x* is the position of a bit node and *y* indicates that it is the (*y* + 1)th 1 (0 ≤ *y* ≤ (degree - 1)) of the same bit node. To make it more convenient to parallelize the execution and reduce the memory access time, this (*x, y*) information is rearranged as a one-dimensional array, as shown at the bottom of Figure 8. The position of the LLR value for (*x, y*) in the one-dimensional array which is the address for check node computation is easily computed as follows:

where *Wb* is the degree of bit nodes.

By using this method, when CUDA parallelizes the decoding process, the position of LLR values is obtained by reading memory instead of computing a new address. This improves execution time.

Figure 9 shows the positions of check nodes which are necessary for bit node update operations. Using a similar method to compute *Laddr*, *Zaddr*, which is the position of bit node (*x, y*) in the one-dimensional array, is computed as follows:

where *x* is the position of a check node, and *y* indicates that it is the (*y* + 1)th 1 (0 ≤ *y* ≤ (degree - 1)) of the same check node and *Wc* is the degree of check nodes. Using this one-dimensional address arrangement, the number of memory accesses is minimized in all of the operations of check node updates, bit node updates, initialization, and parity checks.

### 3.3 Parallel LDPC decoding by GPU

LDPC decoding consists of four parts as shown in Figure 10. The first part is that received LLR values are copied into the location of 1's in the H-matrix. Then, check node update operations and bit node update operations are carried out. Lastly, parity check operations are conducted.

**Kernel 1. Initialization kernel**

1: {Initilization :}

2: for *i* weight of bit Node

3: end for

**Kernel 2. Check node update kernel**

1: {Check Node Update :}

2: for *i* weight of Check Node

3: end for

**Kernel 3. Bit node update kernel**

1: {Bit Node Update :}

2: for *i* weight of bit Node

3: end for

4: Decode\left[Index\right]=BitNode\text{\_}Comp

**Kernel 4. Parity check kernel**

1: {Parity Check :}

2: for *i* weight of Check Node

3: end for

The first initialization step is carried out with a pre-generated *Zaddr*. When the number of signals received equals the number of bit nodes, each received value is copied into the position indicated by *Zaddr*. This task for the H-matrix in Figure 1 can be processed in parallel using eight threads, if we use the GPU in Figure 4.

Second, a check node update operation is conducted after generating as many threads as the number of check nodes. Each thread sequentially reads values from the memory as many times as the degree of check nodes from the memory, and it updates the values and stores them back to the same locations from which they were read.

Third, the bit node update is conducted after generating as many threads as the number of bit nodes. The stored data in memory are arranged in such a way that a check node update operation can effectively be carried out. Therefore, for bit node updates, each thread reads as many values as the degree of bit nodes, using *Zaddr*. Using the input values, bit node updates and determination of a decode bit are conducted. Updated values are stored back to the same locations from which they were read.

Last, parity check operations are parallelized for each check node. A parity check operation is intended to check that all the checking results are 0; this is done using the addresses in *Laddr*. When an address from *Laddr* is divided by the degree of the bit node, we obtain the position of the decode bit for parity check operations.

## 4. Performance results

Performance evaluation results of the proposed LDPC decoding implementation are presented in this section. H-matrices for the CMMB standard, which is currently used in Digital Multimedia Broadcasting in China, are used for performance evaluation. The length of the codeword in the CMMB standard is 9,216 bits, and two code rates, 1/2 and 3/4, are supported [17].

The execution platform was composed of Intel i5 750 (a Quad-Core CPU with 2.6 GHz) as the host CPU, with 4 GB of DDR3 RAM, Windows XP ServicePack 3 (32 bit). MS Visual Studio 2005 was used as the C compiler. The GPU consisted of NVIDIA GT 8800, 512 MB of memory, and CUDA v2.3.

To optimize the GPU performance, the best block size must be determined. Table 1 summarizes the performance evaluation results for various block sizes. From this evaluation, we determined that the optimal block size was 64 (threads per block).

To evaluate the performance of the proposed design, we compared the performance of three cases: (1) where no parallelization technique was applied, (2) where only parallelization utilizing OpenMP was applied, and (3) where both OpenMP and CUDA were utilized. Report times are the latencies to take decoding 10,000 frames for each SNR value.

Table 2 shows that when both OpenMP and CUDA were utilized, we achieved a speedup of 22 over the case in which no parallelization was applied. When the iteration count increased with low SNR values, the speedup became greater. This is mainly because of the fact that as the iteration count increases, the amount of check and bit nodes' operation will increase. Thus, more parallelization can be done, and accordingly the speedup will become bigger. The effective SNR must be greater than 1.5 dB for the CMMB. Hence, the iteration count is typically greater than 20. When we iterated 20 times, the decoding performance was 2.046 Mbps, which satisfies the CMMB standard.

Table 3 summarizes the performance for a code rate 3/4 of the CMMB. As the code rate increased, LDPC decoding was successfully finished with at least 2.5 dB. To satisfy the CMMB performance requirement, more than 300 frames per second must be processed. Our results verify that signal reception with at least 3 dB will satisfy this requirement.

To show that our method is satisfactory for multiple standards, we implemented an LDPC decoder for DVB-S2 with a code rate of 2/5; Table 4 summaries the results. DVB-S2 is a standard for European satellite broadcasting. The H-matrix has 16,200 bits of code words and 9,720 bits of an information word. Table 4 shows that for an SNR of 1, we achieved a speedup of 24.5.

Table 5 compares our performance with those of recently reported [11, 13]. The H-matrix in our decoding is about twice the size of those previously reported [11, 13], which suggests that the performance of our proposed decoder is excellent.

## 5. Conclusion

Owing to the multiple standards and diverse device function needs of current digital communications, hardware-only implementations may not be cost-effective. Instead, software implementations of communication protocols using CPUs or GPUs are rapidly being adopted in digital communication system designs. In this article, we have described a software design that implements parallel processing of LDPC decoding algorithms. In our proposal, we use a combination of a multi-core processor and a GPU to achieve both flexibility and high performance. Specifically, we use OpenMP for parallelizing software on a multi-core processor and CUDA for parallel software running on a GPU. Test results show that our parallel software implementation of LDPC algorithms satisfies the CMMB performance and bandwidth requirements.

## References

Standardization Roadmap for IT839 StrategyVer 2007.

Gallager R:

*Low-Density Parity Check Codes*. MIT Press, Cambridge, MA; 1963.MacKay D, Neal R: Near Shannon limit performance of low density parity check codes.

*IEE Electron Lett*1996, 32(18):1645-1646. 10.1049/el:19961141MacKay D: Good error-correcting codes based on very sparse matrices.

*IEEE Trans Inf Theory*1999, 45(2):399-431. 10.1109/18.748992Chung SY, Forney GD Jr, Richardson TJ, Urbanke R: On the design of low-density parity-check codes within 0.0045dB of the Shannon limit.

*IEEE Commun Lett*2001, 5: 58-60. 10.1109/4234.905935SDR Forum[http://www.wirelessinnovation.org/]

Sousa L, Momcilovic S, Silva V, Falcão G: Multi-core platforms for signal processing: source and channel coding.

*IEEE Press Multimedia Expo*2009, 1805-1808.Kollig P, Osborne C, Henriksson T: Heterogeneous multi-core platform for consumer multimedia applications.

*Date Conference*2009, 1254-1259.van Berkel CH(K): Multi-core for mobile phones.

*Date Conference*2009, 1260-1265.Falcão G, Sousa L, Silva V: Massive parallel LDPC decoding on GPU.

*13th ACM SIGPLAN*2008, 83-90.Falcão G, Yamagiwa S, Sousa L, Silva V: Parallel LDPC decoding on GPUs using a stream-based computing approach.

*J Comput Sci Technol*2009, 24(5):913-924. 10.1007/s11390-009-9266-8Falcão G, Sousa L, Silva V: How GPUs can outperform ASICs for fast LDPC decoding.

*Proceedings of the 23rd international conference on Supercomputing*2009, 390-399.Wang S, S C, Wu Q: A parallel decoding algorithm of LDPC codes using CUDA.

*Signals, Systems and Computers, 2008 42nd Asilomar Conference*2008, 171-175.Ji H, Cho J, Sung W: Massively parallel implementation of cyclic LDPC codes on a general purpose graphics processing unit.

*Signal Processing Systems, SiPS 2009*2009, 285-290.Lin S, Costello DJ Jr:

*Error Control Coding*. 2nd edition. Prentice Hall; 2004.ETSI EN 302 307, Digital Video Broadcasting (DVB): Second generation framing structure, channel coding and modulation systems for Broadcasting, Interactive Services, News Gathering and other broadband satellite applications.

GY/T 220.1-2006: Mobile Multimedia Broadcasting Part1.

NVIDIA CUDA Development Tools 2.3, Getting Started 2009.

NVIDIA CUDA C Programming Best Practices Guide, Optimization 2009.

Chapman B, Jost G, van der Pas R:

*Using OpenMP Portable Shared Memory Parallel Programming*. The MIT Press; 2007.

## Acknowledgements

This research was supported by the MKE (The Ministry of Knowledge Economy), Korea, under the ITRC(Information Technology Research Center) support program supervised by the NIPA (National IT Industry Promotion Agency) (NIPA-2011-C1090-1100-0010).

## Author information

### Authors and Affiliations

### Corresponding author

## Additional information

### Competing interests

The authors declare that they have no competing interests.

## Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

## Rights and permissions

**Open Access** This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## About this article

### Cite this article

Park, JY., Chung, KS. Parallel LDPC decoding using CUDA and OpenMP.
*J Wireless Com Network* **2011**, 172 (2011). https://doi.org/10.1186/1687-1499-2011-172

Received:

Accepted:

Published:

DOI: https://doi.org/10.1186/1687-1499-2011-172

### Keywords

- LDPC
- decoder
- parallel processing
- CUDA
- graphic processing unit