Low-complexity high-throughput decoding architecture for convolutional codes

Sequential decoding can achieve a very low computational complexity and short decoding delay when the signal-to-noise ratio (SNR) is relatively high. In this article, a low-complexity high-throughput decoding architecture based on a sequential decoding algorithm is proposed for convolutional codes. Parallel Fano decoders are scheduled to the codewords in parallel input buffers according to buffer occupancy, so that the processing capabilities of the Fano decoders can be fully utilized, resulting in high decoding throughput. A discrete time Markov chain (DTMC) model is proposed to analyze the decoding architecture. The relationship between the input data rate, the clock speed of the decoder and the input buffer size can be easily established via the DTMC model. Different scheduling schemes and decoding modes are proposed and compared. The novel high-throughput decoding architecture is shown to incur 3-10% of the computational complexity of Viterbi decoding at a relatively high SNR.


Introduction
The 57-64 GHz unlicensed bandwidth around 60 GHz can accommodate multi-gigabits per second (multi-Gbps) wireless transmission in a short range. There are several standards for 60 GHz systems, such as Wire-lessHD [1] and IEEE 802. 15.3c [2,3]. In both Wire-lessHD and the AV PHY mode in IEEE 802. 15.3c, a concatenated FEC scheme is used with a RS code as the outer code and a convolutional code as the inner code. In order to achieve the target decoding throughput at multi-Gbps, parallel convolutional encoding has been adopted by the transmitter baseband design in both standards. It is straightforward to use parallel Viterbi decoding in the receiver baseband. However, it has been shown in [4,5] that parallel Viterbi decoders in the receiver baseband result in massive hardware complexity and power consumption. The problem will become more severe if a higher decoding throughput is targeted (i.e., 10 Gbps) for a battery powered user terminal in the future [6]. Hence it is desirable to find a low-complexity high-throughput decoding method for convolutional codes in such systems.
The Viterbi algorithm (VA) achieves maximum likelihood decoding for convolutional codes [7]. The VA is a breadth-first, exhaustive search approach based on the trellis diagram. Sequential decoding is another method of convolutional decoding and is a depth-first, nonexhaustive searching approach based on the tree diagram. It only explores partial paths locally in the code tree, so it has sub-optimal decoding performance and its computational complexity varies with SNR. There are two main types of sequential decoding algorithms which are known as the Stack algorithm [8] and the Fano algorithm [9,10]. Because the Fano algorithm has low storage and sorting requirements, it can achieve higher decoding throughput compared to the Stack algorithm. Only the Fano algorithm is considered in this article. Sequential decoding is not widely used in real systems due to the excessive computations and long decoding delay when the SNR is low. However, if a relatively high SNR can be achieved (e.g., for a very short range and/or via beamforming) or required for some applications (e. g., HD video streaming), sequential decoding will on average incur a very low computational complexity and short decoding delay, which results in a high decoding throughput.
In this article, a novel low-complexity high-throughput decoding architecture based on parallel Fano algorithm decoding with scheduling is proposed. Different scheduling schemes and decoding modes are investigated. A discrete time Markov chain (DTMC) is introduced to model the proposed architecture to establish the relationship between input data rate, input buffer size, and clock speed of the decoders. The trade-offs between error rate, computational complexity, scheduling schemes and decoding modes are studied. It will be shown that the high-throughput decoding architecture can achieve a much lower computational complexity compared to the Viterbi decoding with a similar error rate performance. The rest of the article is organized as follows. First, the unidirectional Fano algorithm (UFA) and bidirectional Fano algorithm (BFA) are reviewed in Section 2. The novel parallel Fano decoding with scheduling architecture is proposed in Section 3. Different scheduling schemes and decoding modes are also proposed in this section. The DTMC based modeling is applied to the decoding architecture in Section 4. Simulation results are given in Section 5, and the conclusions are drawn in Section 6.

Unidirectional Fano algorithm and bidirectional Fano algorithm
In the conventional unidirectional Fano algorithm, the decoder starts decoding from the initial state zero (or origin node). During each iteration of the algorithm, the decoder may move forward (increase depth within the tree), move backward (reduce depth), or stay at the current tree depth. The decision is made based on the comparison between the threshold value and the path metric. If a forward movement is made, the threshold value needs to be tightened. If the decoder cannot move forward or backward, the threshold value needs to be loosened. A detailed flowchart of the UFA can be found in [11].
A bidirectional Fano algorithm was proposed in [12]. Both the forward decoder (FD) and backward decoder (BD) start decoding from the known state zero and perform decoding in the forward and backward direction in parallel as shown in Figure 1. The decoding will finish if the FD and the BD merge somewhere in the code tree. Otherwise, if the FD and the BD cannot merge, the decoding will finish when either of them reaches the other end of the code tree. Merging means that the FD and the BD have the same encoder state and the same level within the codeword. A simple merging condition requires the FD and the BD have one merged state as shown in the shaded box on the left. A more rigorous merging condition requires the FD and the BD to have more than one merged state (e.g., 5 merged states) as shown in the shaded box on the right. By increasing the number of merged states (NMS), the probability that the FD and the BD to decode on the same path can be increased, resulting in an improved error rate performance. However, this is at the cost of higher computational effort. This trade-off has been discussed in [13]. In this article, the simple merging condition (NMS = 1) is adopted by the BFA.
The simulated complementary cumulative distributions of computational complexity of the UFA, the BFA and the VA are compared in Figure 2 at different SNR values. The computational complexity is measured by the number of branch metric calculations (BMC). It can be seen that as the SNR increases, the computational complexity and variability of the UFA and the BFA reduce. However, the computational complexity of the VA has a constant value which does not change with the SNR. Additionally, the BFA can achieve a lower computational complexity and variability compared to the UFA, which is more pronounced at a lower SNR.

Architecture
It has been discussed in [14] that increasing the parallelism in a Viterbi decoder can be achieved at the bitlevel, the word-level and the algorithm-level. The bitlevel parallelism can be realized by pipelining, and the word-level parallelism can be achieved by the lookahead (or high-radix) technique [15]. However, the addcompare-select (ACS) unit, which selects the best branches within the Viterbi decoder, is still the bottleneck for achieving high decoding throughput [16]. The fastest Viterbi decoder at the time of writing has the decoding throughput of about 1Gbps for a 64-state convolutional code [17]. In order to achieve a higher decoding throughput at the level of multi-Gbps, using parallel convolutional encoders at the transmitter (Tx) and parallel convolutional decoders at the receiver (Rx) is an effective way. Each convolutional decoder does not need to run at a very high speed but an overall high decoding throughput can still be achieved. This parallel convolutional encoding approach has been adopted by the Wir-elessHD specification [1] and the IEEE 802.15.3c AV PHY mode [2]. Each of the parallel convolutional encoders has the structure as shown in Figure 3. For both standards of interest the convolutional code has the code rate of R = 1/3 and the constraint length is K = 7. For each input bit, there are three coded output bits (X, Y, and Z). The generator polynomials are g 0 = {133} 8 , g 1 = {171} 8 , and g 2 = {165} 8 . This convolutional code is used throughout the article to target the WirelessHD specification and the IEEE 802.15.3c AV PHY mode, though it should be noticed that sequential decoding can also be used to decode very long constraint length convolutional codes which may be infeasible for the Viterbi algorithm to decode.
A reference receiver baseband design a for the Wire-lessHD and IEEE 802.15.3c standards is shown in Figure  4. The building blocks operate in reverse compared to the corresponding building blocks at the Tx. There are eight parallel convolutional decoders, and the VA can be implemented in each of them. However, it is one of the most power and hardware intensive blocks in the Rx baseband. The system operates in indoor and short range environments, so it is possible that there is a lineof-sight (LOS) path between the Tx and the Rx which enables a relatively high SNR at the Rx. Even if the LOS component is not available, the adaptive antenna beamforming technique can still guarantee a relatively high SNR at the Rx. Additionally, the Tx and the Rx are quasi-static, which means the SNR is roughly constant. All these facts make sequential decoding algorithm an attractive approach for high-throughput convolutional decoding.
In Figure 5 there are N parallel Fano decoders each with a finite input buffer accommo-dating up to B codewords. The supported input data rate of each buffer is assumed to be R d information bits per second. The total supported data rate or average decoding throughput will be N · R d . This parallel Fano decoding system can be treated as a parallel queuing system, in which the parallel input buffers are the queues and the parallel Fano decoders are the servers. Due to the variable computational efforts of the Fano decoders, the input buffer occupancies (Q 1 ,..., Q N ) vary from each other as shown in Figure 5. If the Fano decoders can be scheduled to decode the codewords in different input buffers, the utilization of the Fano decoders can be increased, resulting in a higher decoding throughput. For example, if a Fano decoder F m finishes decoding one codeword and its input buffer occupancy is lower than that of another input buffer, i.e., B n , it is possible to schedule the decoder F m to help decoding another codeword in the input buffer B n , thus to reduce Q n to avoid potential buffer overflow or frame erasure. In order to realize this, a scheduler is introduced which can allocate the Fano decoders to the input buffers dynamically as shown in Figure 5. Each Fano decoder also needs to connect to all the input and output buffers. The scheduler is invoked when a decoder finishes decoding one codeword. It then allocates the decoder to an input buffer according to some scheduling scheme. The allocation of the decoders to the input buffers can be achieved by changing the connectivities between the input buffers and the decoders and those between the decoders and the output buffers.
For ease of analysis and modeling, an equivalent architecture is proposed in Figure 6. Each Fano decoder has a buffer which can hold one codeword, and the codeword in this buffer may come from any of the parallel long input buffers whose size is B -1. When a decoder F m finishes decoding the codeword in its buffer, the buffer is cleared and updated with a new codeword from a long input buffer according to some scheduling scheme. For example, as shown in Figure 6, when the decoder F 2 finishes decoding the codeword in its buffer, the scheduler selects the long input buffer B N according to   decoders and erases the codeword of the decoder F m if it has consumed the highest computational effort among all the decoders. After the codeword of the decoder F m is erased, one codeword in the input buffer B n is scheduled to the decoder F m and the occupancy of the input buffer B n is reduced Q n = Q n -L f . The number of decoders M is assumed to be the same as the number of input buffers N in Figures 5 and 6 (i.e., M = N) for ease of illustration. However, it will be shown in Section 5 that a higher number of decoders may be required to achieve a target decoding throughput (i.e., M >N).

Scheduling schemes
When a decoder finishes decoding a codeword, the scheduler needs to decide which input buffer the decoder should serve next. It has been discussed in [18][19][20] that serving the longest queue first (LQF) can help making the parallel queues (or input buffers) the most balanced or stable, thus maximising the input data rate R d . The scheduled decoders serving the longest queue first is considered to be one of the best scheduling schemes in the proposed architecture in terms of achieving a high decoding throughput.
The LQF scheme needs to compare the input buffer occupancy values. Other simpler scheduling schemes  can be employed to reduce the computational and hardware complexity of the scheduler. One possible scheduling scheme is to randomly select the input buffer, which is named the RDM scheme. Another scheduling scheme is to group the parallel input buffers and decoders, such that each decoder can only be scheduled to the input buffers within the same group. The decoders in the same group are scheduled according to the LQF scheme. This is known as the static scheduling scheme or the STC scheme. In this article, each group is assumed to have two input buffers and two UFA decoders. Compared to the LQF scheme, the STC scheme can help reducing the need for multi-port memories and high fan-out multiplexers. It can also simplify the design of the scheduler and the connections between the input buffers and the decoders.

PUFAS mode and PBFAS mode
When a decoder F m finishes decoding a codeword, it can be scheduled to decode a new codeword from one of the input buffers, or it can be scheduled to help another decoder F m which has already been working on a whole codeword. The scheduled decoder F m can decode from the end state zero of this codeword, which makes F m and F m decode the same codeword in the BFA mode. These two modes are known as the parallel unidirectional Fano algorithm decoding with scheduling (PUFAS) mode and the parallel bidirectional Fano algorithm decoding with scheduling (PBFAS) mode, respectively. It has been shown in [12,13] that the decoding throughput of a BFA decoder is at least two times of a UFA decoder (D BFA ≥ 2D UFA ) due to the parallel processing between the FD and the BD and also due to the computational effort reduction achieved by the BFA. As a result, if there are M UFA decoders among which any two can decode in the BFA mode, the decoding throughput can be improved by forming ⌊M/2⌋ BFA decoders. In this case, there will be ⌊M/2⌋ parallel BFA decoders which can be scheduled in the architecture.

DTMC based modeling
In the proposed parallel Fano decoding with scheduling architecture, the total number of codewords can be written: where N decoded is the number of decoded codewords and N erased is the number of erased codewords due to buffer overflow. A metric called blocking probability (P B ) is defined as: where P B is similar to the frame error rate (P F ) caused by undetected errors. In designing the system, the input data rate R d (in bps), the clock speed of each Fano decoder f clk (in Hz) and the input buffer size B (in codewords) need to be chosen properly to ensure that: In this article, P B = 0.01 × P F is adopted as the target blocking probability (P target ). The relationship between R d , f clk and B can be found via simulation. Another way to analyze the architecture is to model it based on queuing theory.

DTMC based modeling on single UFA/BFA
A single UFA/BFA decoder with a finite input buffer can be treated as a D/G/1/B queue [21], in which D means that the input data rate is deterministic, G means that the decoding time is generic, 1 means that there is one decoder and B is the number of codewords the input buffer can hold. The state of the Fano decoder is represented by the input buffer occupancy or queue length when a codeword just finishes decoding, which is measured in terms of branches or information bits stored in the buffer. Q(n) and Q(n + 1) have the following relationship: where Q(n + 1) is the input buffer occupancy when the n th codeword just finishes decoding, T s (n) is the decoding time of the n th codeword by the Fano decoder and L f is the length of a codeword in terms of branches or information bits. [x] denotes the operation to get the nearest integer to x. The speed factor of the Fano decoder is defined as the ratio between f clk and R d : If f clk is normalized to 1, Equation (4) can be changed to: It can be seen from Equation (6) that for a fixed value of μ and L f , the state of the input buffer Q(n +1) is determined uniquely by the state Q(n) and the decoding time T s (n). T s (n) and T s (n+1) are i.i.d. in the AWGN channel or randomly interleaved fading channels. As a result, the state of the input buffer is a DTMC. It is assumed that the Fano decoder can execute one iteration per clock cycle which is feasible according to [22], so T s (n) is measured in clock cycles/codeword. The simulated distribution of T s will be used in the following analysis since its closed form expression is intractable. The difference between Q(n + 1) and Q(n) is defined as: The total number of states of the input buffer with size B is: The state transition probability matrix of the input buffer is: where P ij is the state transition probability from S i to S j which can be calculated as follows: where min = min(T s ) μ − L f and p +w = Pr( = w).
The value of p +w can be estimated from the simulated distribution of T s , which is shown in Figure 7 for the UFA with different speed factors at E b /N 0 = 4dB. It should be noted that a bad codeword may incur unbounded decoding time for a Fano decoder and it is common to erase this codeword. This case corresponds to j = Ω in Equations (9) and (10). The initial state probability (n = 0) of the input buffer is: where π ω (n) is the probability that the input buffer is in state S ω at time n. The steady state probability of the input buffer is then: Hence, the blocking probability of the decoder can be calculated by: where ∏(i) is the steady state probability that the input buffer is in state S i and p + −i = Pr( > − i).

Extension to PUFAS/PBFAS-LQF
When scheduling is involved, it is difficult to apply DTMC based modeling since the parallel queues behave in a very complex way. However, if the LQF scheduling scheme is used, the proposed decoding architecture can be modelled by the DTMC in an approximate way. If there are M Fano decoders working in parallel with each running at f clk and the LQF scheduling scheme is used, the M Fano decoders can be fully utilized to decode the codewords in the N input buffers. Since the M Fano decoders and the N input buffers are identical to each other, the system is totally symmetric and can be treated as a faster Fano decoder with the clock speed of f clk = M · f clk working on each input buffer with the probability of P S = 1/N. As a result, Equation (6) should be changed to: where i {1,..., N}, and Equation (7) should be changed to: The state transition probability matrix P T,i can be calculated based on the distribution of Δ i , and Equations (8)- (13) can still be applied to the PUFAS/PBFAS-LQF. The validation of the proposed DTMC model will be confirmed by the simulation results shown in the following section.

Simulation results
The performance of the proposed parallel Fano decoding with scheduling is examined via simulation in this section. The branch metric calculation is based on 1bit hard-decision with the Fano metric [11]. Using 3bit soft-decision for the branch metric calculation results in about 1.75 to 2dB additional coding gain. However, 1-bit hard-decision is favoured in very high throughput decoder design to achieve a trade-off between the complexity of the decoder and the error rate performance. In this article, 1-bit hard-decision is adopted for the metric calculation for both the Viterbi and the Fano algorithm. The threshold adjustment value in the Fano algorithm is δ = 2. The modulation is BPSK and the channel is assumed to be an AWGN channel. The AWGN channel is similar to the LOS multipath channel for 60 GHz as discussed in [23]. Each frame has L = 200 bits plus K -1 = 6 zeros bits which results in a total frame (or a codeword) length of L f = L + K -1 = 206 bits. The input buffer size is assumed to be B = 10.

Comparison between different scheduling schemes
The performance of different scheduling schemes is compared by simulation in Figure 8. The SNR was set as E b /N 0 = 4 dB which corresponds to the target blocking probability of P target = 10 -3 . In both the PUFAS and the PBFAS, the LQF scheduling scheme has the best performance. In the PUFAS the RDM scheme has a better performance compared to the STC scheme, while in the PBFAS the RDM scheme has the worst performance compared to all the other schemes. This is because when the RDM scheme is employed in the PBFAS, a BFA decoder may become idle if it randomly selects a low occupancy input buffer. But the wrong selection by the RDM scheme in the PUFAS may make only one UFA decoder idle. As a result, the RDM scheme can be used in the PUFAS and the STC scheme can be used in the PBFAS to reduce the complexity of the scheduler. However, since the complexity added by the LQF scheduler to the parallel decoders is minimal, it is favoured in terms of achieving a higher decoding throughput.

Validation of the DTMC model
The semi-analytical results b are compared with the simulation results to validate the DTMC model. It can be seen from Figure 9 that the semi-analytical results are quite close to the simulation results, which indicates the accuracy of the proposed DTMC model. The working speed factor of the parallel unidirectional Fano algorithm decoding without scheduling (PUFA) is about μ = 17 which can be reduced to μ = 7 and μ = 5.6 if the LQF scheduling scheme is performed in the PUFAS and in the PBFAS, respectively. The corresponding decoding throughput improvements are 140% and 200%, respectively.
It has been found that the proposed DTMC based modeling on the PUFAS-LQF and PBFAS-LQF is ideal when the input buffer size B is large enough (i.e., B ≥ 5). The accuracy of the model degrades as B gets smaller. However, a very short input buffer will not be adopted according to the trade-off between area and decoding throughput as discussed in [21]. Additionally, it has also been found that the accuracy of the model does not depend on the relationship between M and N (i.e., M >N, M = N or M <N) as long as the input buffer size is large enough.   If the target decoding throughput is D target = 1 Gbps and the clock speed of the Fano decoder is f clk = 500 MHz, the supported input data rate will be R d = D target / N = 125 Mbps for N = 8 input buffers and the target speed factor will be μ 1 = f clk /R d = 4. It can be seen from Figure 10 that the required number of decoders is M = 14 for the PUFAS-LQF and M = 12 for the PBFAS-LQF at E b /N 0 = 4dB. Two decoders can be saved if the PBFAS-LQF is adopted compared to the PUFAS-LQF for the same decoding throughput.

Number of parallel Fano decoders
It can also be seen from Figure 10 that the decoding throughput can be improved as SNR increases for the same number of decoders. As a result, some of the decoders can be dynamically turned off as SNR increases for the same decoding throughput, though a large number of decoders may be required to support a low SNR. For example, if the target decoding throughput increases to D target = 2 Gbps and the clock speed of the Fano decoder is still f clk = 500 MHz, the target speed factor will be μ 2 = 2. It can be seen from Figure  10 that the required number of decoders is M = 28 for the PUFAS-LQF and M = 26 for the PBFAS-LQF at E b / N 0 = 4dB which can be reduced to only M = 12 if the SNR increases to 5dB. In this case, more than half of the decoders can be turned off to reduce the power consumption of the decoding architecture.

Error rate performance and computational complexity
The proposed parallel Fano decoding with scheduling is compared with the parallel Fano decoding without scheduling and the parallel Viterbi algorithm decoding (PVA) in terms of bit-error-rate (BER) and computational complexity. As discussed in [24][25][26], the state-ofthe-art low power Viterbi decoders based on the T-algorithm [27] can also achieve a reduced computational complexity at a high SNR with a minimal penalty in coding gain, so its performance is also included for comparison. It can be seen in Figure 11 that the PVA has the best BER performance. There is about 0.1dB penalty in coding gain at BER = 10 -4 by using the PUFAS-LQF. The PBFAS-LQF has the worst performance and there is about 0.25 dB coding gain loss compared to the PVA. The T-algorithm has been tuned to achieve similar BER performance by setting the discarding threshold T = 5. The computational complexity measured by the number of branch metric calculations is compared in Figure  12. Each BMC corresponds to one node extension in the code tree or one state update in the trellis diagram. Each state update in the VA involves an ACS operation, which has the similar computational complexity as one node extension in the UFA or BFA. This quantity has been widely used in the literature to compare Viterbi decoding and sequential decoding in terms of computational complexity [28,29]. The computational complexity of the PUFAS-LQF to decode one codeword is: where C UFA is the computational complexity of the UFA decoder and C S is the computational complexity of the LQF scheduler. It is known that C UFA ≥ L f = 206 BMC and C S is only N -1 = 7 times input buffer occupancy values comparisons. As a result, the computational complexity of the PUFAS-LQF to decode one codeword is C PUFAS ≈ C UFA . Similarly, the computational complexity of the PBFAS-LQF to decode one codeword is: where C FD is the number of BMC to decode one codeword in the forward direction and C BD is the number of BMC in the backward direction. The computational complexity of the PVA to decode one codeword has a fixed value: The distributions of C UFA , C BFA , and C VA at different SNR can be found in Figure 2.
It can be seen that the proposed decoding architecture consumes a much lower computational complexity compared to the PVA. For example at E b /N 0 = 4 dB, the computational complexity of the PUFAS-LQF is only 10% of the PVA and it reduces to 3% at 6 dB. Additionally, the computational complexity of the PBFAS-LQF is lower than that of the PUFAS-LQF at a lower SNR, but they become very similar as SNR increases. This is because at a high SNR, the computational complexity reduction achieved by the BFA compared to the UFA becomes minimal. Since there is a very limited improvement on decoding throughput and computational complexity by using the PBFAS-LQF compared to the PUFAS-LQF at a high SNR, the PUFAS-LQF is favored due to its better BER performance. It can also be seen from Figures 11 and 12 that with a similar BER performance as the PUFAS-LQF and the PBFAS-LQF the T-algorithm cannot achieve the same low computational complexity.

Conclusions
This article considered the application of sequential decoding algorithm in high-throughput wireless communication systems. A novel architecture based on parallel Fano algorithm decoding with scheduling was proposed. Due to the scheduling of the Fano decoders according to the input buffer occupancy, a high decoding throughput can be achieved by the proposed architecture. Different scheduling schemes and decoding modes were proposed and compared. It was shown that the PBFAS-LQF scheme could achieve the highest decoding throughput. A DTMC model was proposed for the decoding architecture. The relationship between the input data rate, the clock speed of the decoder and the input buffer size can be easily established via the DTMC model. The model was validated by simulation and utilized to determine the number of decoders required for a target decoding throughput. It was shown that the novel high-throughput decoding architecture requires 3-10% of the computational complexity of the Viterbi decoding with a similar error rate performance. This novel architecture can be employed in high-throughput systems such as 60 GHz systems to achieve energy efficient low-complexity convolutional codes decoding.
Endnotes a The standard does not specify the Rx design. Only the Tx design is given. b Since the distribution of T s is obtained by simulation, the DTMC based results are referred to as semi-analytical.