### 2.1 Review of LDPC decoding algorithms

The LDPC code is a linear block code with a very sparse parity check matrix called H-matrix. The rows and columns of an H-matrix denote parity check codes and symbols, respectively. LDPC codes can be represented by a Tanner graph which is a bipartite graph in which the sides represent check nodes and bit nodes, respectively. Thus, check nodes correspond to the rows of the H-matrix, and bit nodes correspond to the columns of the H-matrix. For example, when the (*i, j*) element of an H-matrix is '1', the *i* th check node is connected to the *j* th bit node of the equivalent Tanner graph. Figures 1 and 2 illustrate an H-matrix and the equivalent Tanner graph for (8, 4) LDPC codes.

Most practical LDPC decoders use soft-decisions, because soft-decision decoders typically outperform hard-decision ones. A soft-decision decoding scheme is carried out, based on the concept of belief propagation, by passing messages, which contain the amount of belief for a value being between 0 and 1, between adjacent check nodes and bit nodes. Based on the delivered messages, each node attempts to decode its own value. If the decoded value turns out to contain error, the decoding process is repeated for a predefined number of times. Typically, there are two ways to deliver messages in LDPC decoding. One is to use probabilities, and the other is to use log-likelihood ratios (LLRs). In general, using LLRs is favored since that allows us to replace expensive multiplication operations with inexpensive addition operations.

### 2.2 Parallelization of LDPC decoding

As explained in Section 2.1, an LDPC decoding algorithm is capable of correcting errors by repeatedly computing and exchanging messages. The amount of computation depends on the size of the H-matrix. However, recently published standards reveal a growing trend that the length of codewords is getting longer as the amount of data transfer is increasing [1]. By the same token, the size of the H-matrix is increasing. For a recent standard, DVB-T2 [16], the length of the codeword is 64,800 bits or 16,200 bits. For China Multimedia Mobile Broadcasting (CMMB) [17], the length is 9,126 bits. The huge size causes both decoding complexity and decoding time to increase. Therefore, it is crucial to distribute the computation load evenly to multiple cores and to parallelize the computation efficiently.

LDPC decoding consists of four general operations: initialization, check node update, bit node update, and parity check. Examining the algorithm reveals that the check node update can be done in parallel, since the rows are uncorrelated with each other. Also, the bit node update on each column can be processed in parallel, because an LDPC decoding algorithm has independent memory accesses among the four types of operations. For example, in the H-matrix in Figure 3, check node operations can process four rows in parallel, and bit node operations can process eight columns concurrently.

This article has three main technical contributions. First, we propose an efficient and flexible technique which can be applied to various protocols and target multi-core platforms, since we propose a solution which employs both CPUs and GPUs. We also introduce an efficient technique to reduce the memory requirement significantly. Next, we propose parallelization techniques for not only check and bit node update operations, but also parity check operations, which will be described in detail in Section 3.

### 2.3 CUDA programming

CUDA is a GPU software development kit proposed by David Kirk and Mark Harris. One major advantage of CUDA is that it is an extension of the standard C programming language. Hence, those who are familiar with the C/C++ programming language can learn how to program in CUDA relatively easily. Also, CUDA is capable of fully utilizing the fast-improving GPU processing power. Further, NVIDIA hardware engineers actively reflect scientists' opinions as they develop the next generation of CUDA and GPU. For instance, support for double precision computation, error correction capability, and increased shared memory may not be crucial for graphics processing in game applications, but they are important for many scientific and engineering applications. These features have been added in recent versions of CUDA and GPU.

Figure 4 shows the architecture of NVIDIA's 8800GTX. There are 16 multiprocessors, and each multiprocessor has 8 single precision thread processors (SPs). Therefore, the total number of SPs is 128. Each SP can process a block of data with a thread allocation in parallel. However, it is not possible for the CPU and the GPU to share memory space. Thus, the GPU must make a copy of the shared data to its own memory space in advance. If the CPU wants data stored in the memory of the GPU, a similar copy operation must take place. These copy operations incur significant overhead.

Figure 5 shows the relation between a block and a thread in the GPU. A kernel function is executed for one thread at a time. For example, if there were 12 threads in Block (1,1), and there were 6 blocks in a grid, then the kernel function would be executed 72 times. When a function is invoked, the thread and the block index are identified by the thread_idx and block_idx variables, respectively [18, 19].

#### 2.4 OpenMP

OpenMP is a set of application program interfaces (APIs) for parallelization of C/C++ or Fortran programs in a shared memory multiprocessing environment. OpenMP has gained lots of attention lately as multi-core systems are being widely deployed in many embedded platforms. Recently, version 2.5 was released. OpenMP is a parallelization method based on compiler directives, in which a directive will tell the compiler which part of the program should be parallelized, by generating multiple threads. Many commercial and noncommercial compilers support OpenMP directives, and thus, we can utilize OpenMP in many existing platforms [20].

OpenMP's parallelization model is based on a fork-join model. Starting with only the master thread, additional slave threads are forked on demand. All threads except for the master one are terminated when execution for a parallel region ends. In this article, we use OpenMP pragmas to parallelize address generation computations. Since only the new address is transferred to the CUDA memory, the memory copy overhead is minimal.(Figure 6)

### 3. Proposed LDPC decoder

As described above, when the size of H-matrices increases, the amount of computation grows rapidly. This makes it difficult to achieve satisfactory performance in either software- or hardware-only implementations that attempt to support multiple standards and data rates. Therefore, we propose a novel parallel software implementation of LDPC decoding algorithms that are based on OpenMP and CUDA programming. We will show that the proposed design is a cost-effective and flexible LDPC decoder which satisfies the throughput requirement for various H-matrices and multiple code rates. First, we will show the overall software structure, and next we will explain the parallelization techniques that we propose. (Figure 7)

#### 3.1 Architecture of the proposed LDPC decoder

The overall structure of the proposed LDPC decoder is as follows. We assume that the target platform consists of a single host multi-core processor which can run C codes with OpenMP pragmas and a GPU which can run CUDA codes. To support multiple standards and data rates, multiple H-matrices are stored as files. The host CPU reads the H-matrix for a given standard and signal-to-noise ratio (SNR) constraint. The host CPU then generates an address table of data processed in parallel by the GPU. Generation of the address table is parallelized by OpenMP pragmas. Next, generated address information is transferred to the memory in the GPU. This copy operation takes place only if there is a change in standard, SNR constraint, or code rate.

When signals are received, the host CPU delivers them to the GPU. The GPU executes the proposed LDPC decoding software in parallel. Upon completion of the decoding, decoded bits are transferred to the host CPU. A CUDA API called "CUDA Copy" is used to exchange data between the host and the GPU. The copy overhead may be significant, so it is crucial to minimize it. It should be noted that in our implementation this copy operation takes place only for generated address transfers, received signal transfers, decoded bit transfers, and configuration (standard, code rates, etc.) changes. Therefore, the copy overhead is not large in our implementation.

#### 3.2 One-dimensional address generation for parallelization

**Function 1. New address generator**

1: {New address :}

#pragma omp parallel for private(i, j)

shared(v_nodes, c_nodes) schedule(static)

2: for *i* Check Node Num

3: for j weight of check Node

4: for k weight of bit Node

if(v_nodes[c_nodes[i].index[j]].index[k] = = i

{

c_nodes[i].order[j] = k;

break;

}

5: end for

6: end for

7: end for

We will explain how we generate addresses for CUDA parallelization, using the H-matrix in Figure 1. The H-matrix is stored in a file as a two-dimensional array which contains bit node positions that are necessary for the check node update operation. The first table in Figure 8 shows an example. Since the positions of the LLR values of Bit Node 1 for Check Node 0 and Check Node 1 are different, the bit node order is determined by reading the H-matrix information. We minimize the execution time for this by parallelizing the operation using OpenMP. We use an Intel Quad-Core processor as the host CPU, and the following algorithm with four threads may be used.

The position of an LLR value is stored in the form of (*x, y*) where *x* is the position of a bit node and *y* indicates that it is the (*y* + 1)th 1 (0 ≤ *y* ≤ (degree - 1)) of the same bit node. To make it more convenient to parallelize the execution and reduce the memory access time, this (*x, y*) information is rearranged as a one-dimensional array, as shown at the bottom of Figure 8. The position of the LLR value for (*x, y*) in the one-dimensional array which is the address for check node computation is easily computed as follows:

where *Wb* is the degree of bit nodes.

By using this method, when CUDA parallelizes the decoding process, the position of LLR values is obtained by reading memory instead of computing a new address. This improves execution time.

Figure 9 shows the positions of check nodes which are necessary for bit node update operations. Using a similar method to compute *Laddr*, *Zaddr*, which is the position of bit node (*x, y*) in the one-dimensional array, is computed as follows:

where *x* is the position of a check node, and *y* indicates that it is the (*y* + 1)th 1 (0 ≤ *y* ≤ (degree - 1)) of the same check node and *Wc* is the degree of check nodes. Using this one-dimensional address arrangement, the number of memory accesses is minimized in all of the operations of check node updates, bit node updates, initialization, and parity checks.

### 3.3 Parallel LDPC decoding by GPU

LDPC decoding consists of four parts as shown in Figure 10. The first part is that received LLR values are copied into the location of 1's in the H-matrix. Then, check node update operations and bit node update operations are carried out. Lastly, parity check operations are conducted.

**Kernel 1. Initialization kernel**

1: {Initilization :}

\begin{array}{c}xIndex=blockIdx.x\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\times \phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}blocksize\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\times \phantom{\rule{0.3em}{0ex}}Wb\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}threadIdx.x\times Wb\\ Index=blockIdx.x\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\times \phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}blocksize\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}threadIdx.x\end{array}

2: for *i* weight of bit Node

Memory\left[Zaddr\left[xIndex+i\right]\right]=Init\left[Index\right]

3: end for

**Kernel 2. Check node update kernel**

1: {Check Node Update :}

xIndex=blockIdx.x\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\times \phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}blocksize\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\times \phantom{\rule{0.3em}{0ex}}Wc\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}threadIdx.x\times Wc

2: for *i* weight of Check Node

Memory\left[xIndex+i\right]=CheckNode\text{\_}Comp\left[xIndex+i\right]

3: end for

**Kernel 3. Bit node update kernel**

1: {Bit Node Update :}

\begin{array}{c}xIndex=blockIdx.x\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\times \phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}blocksize\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\times \phantom{\rule{0.3em}{0ex}}Wb\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}threadIdx.x\times Wb\\ Index=blockIdx.x\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\times \phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}blocksize\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}threadIdx.x\end{array}

2: for *i* weight of bit Node

Memory\left[Zaddr\left[xIndex+i\right]\right]=BitNode\text{\_}Comp\left[Zaddr\left[xIndex+i\right]\right]

3: end for

4: Decode\left[Index\right]=BitNode\text{\_}Comp

**Kernel 4. Parity check kernel**

1: {Parity Check :}

\begin{array}{c}xIndex=blockIdx.x\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\times \phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}blocksize\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\times \phantom{\rule{0.3em}{0ex}}Wc\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}threadIdx.x\times Wc\\ Index=blockIdx.x\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\times \phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}blocksize\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}threadIdx.x\end{array}

2: for *i* weight of Check Node

check=Decode\left[\mathsf{\text{int}}\phantom{\rule{1em}{0ex}}\left(Laddr\left[xIndex+i\right]\u2215Wb\right)\right]+check

3: end for

check=\mathsf{\text{int}}\phantom{\rule{1em}{0ex}}\left(check\%2\right)

The first initialization step is carried out with a pre-generated *Zaddr*. When the number of signals received equals the number of bit nodes, each received value is copied into the position indicated by *Zaddr*. This task for the H-matrix in Figure 1 can be processed in parallel using eight threads, if we use the GPU in Figure 4.

Second, a check node update operation is conducted after generating as many threads as the number of check nodes. Each thread sequentially reads values from the memory as many times as the degree of check nodes from the memory, and it updates the values and stores them back to the same locations from which they were read.

Third, the bit node update is conducted after generating as many threads as the number of bit nodes. The stored data in memory are arranged in such a way that a check node update operation can effectively be carried out. Therefore, for bit node updates, each thread reads as many values as the degree of bit nodes, using *Zaddr*. Using the input values, bit node updates and determination of a decode bit are conducted. Updated values are stored back to the same locations from which they were read.

Last, parity check operations are parallelized for each check node. A parity check operation is intended to check that all the checking results are 0; this is done using the addresses in *Laddr*. When an address from *Laddr* is divided by the degree of the bit node, we obtain the position of the decode bit for parity check operations.