Low-complexity high-throughput decoding architecture for convolutional codes
© Xu et al; licensee Springer. 2012
Received: 23 August 2011
Accepted: 23 April 2012
Published: 23 April 2012
Sequential decoding can achieve a very low computational complexity and short decoding delay when the signal-to-noise ratio (SNR) is relatively high. In this article, a low-complexity high-throughput decoding architecture based on a sequential decoding algorithm is proposed for convolutional codes. Parallel Fano decoders are scheduled to the codewords in parallel input buffers according to buffer occupancy, so that the processing capabilities of the Fano decoders can be fully utilized, resulting in high decoding throughput. A discrete time Markov chain (DTMC) model is proposed to analyze the decoding architecture. The relationship between the input data rate, the clock speed of the decoder and the input buffer size can be easily established via the DTMC model. Different scheduling schemes and decoding modes are proposed and compared. The novel high-throughput decoding architecture is shown to incur 3-10% of the computational complexity of Viterbi decoding at a relatively high SNR.
The 57-64 GHz unlicensed bandwidth around 60 GHz can accommodate multi-gigabits per second (multi-Gbps) wireless transmission in a short range. There are several standards for 60 GHz systems, such as WirelessHD  and IEEE 802.15.3c [2, 3]. In both WirelessHD and the AV PHY mode in IEEE 802.15.3c, a concatenated FEC scheme is used with a RS code as the outer code and a convolutional code as the inner code. In order to achieve the target decoding throughput at multi-Gbps, parallel convolutional encoding has been adopted by the transmitter baseband design in both standards. It is straightforward to use parallel Viterbi decoding in the receiver baseband. However, it has been shown in [4, 5] that parallel Viterbi decoders in the receiver baseband result in massive hardware complexity and power consumption. The problem will become more severe if a higher decoding throughput is targeted (i.e., 10 Gbps) for a battery powered user terminal in the future . Hence it is desirable to find a low-complexity high-throughput decoding method for convolutional codes in such systems.
The Viterbi algorithm (VA) achieves maximum likelihood decoding for convolutional codes . The VA is a breadth-first, exhaustive search approach based on the trellis diagram. Sequential decoding is another method of convolutional decoding and is a depth-first, non-exhaustive searching approach based on the tree diagram. It only explores partial paths locally in the code tree, so it has sub-optimal decoding performance and its computational complexity varies with SNR. There are two main types of sequential decoding algorithms which are known as the Stack algorithm  and the Fano algorithm [9, 10]. Because the Fano algorithm has low storage and sorting requirements, it can achieve higher decoding throughput compared to the Stack algorithm. Only the Fano algorithm is considered in this article. Sequential decoding is not widely used in real systems due to the excessive computations and long decoding delay when the SNR is low. However, if a relatively high SNR can be achieved (e.g., for a very short range and/or via beamforming) or required for some applications (e.g., HD video streaming), sequential decoding will on average incur a very low computational complexity and short decoding delay, which results in a high decoding throughput.
In this article, a novel low-complexity high-throughput decoding architecture based on parallel Fano algorithm decoding with scheduling is proposed. Different scheduling schemes and decoding modes are investigated. A discrete time Markov chain (DTMC) is introduced to model the proposed architecture to establish the relationship between input data rate, input buffer size, and clock speed of the decoders. The trade-offs between error rate, computational complexity, scheduling schemes and decoding modes are studied. It will be shown that the high-throughput decoding architecture can achieve a much lower computational complexity compared to the Viterbi decoding with a similar error rate performance. The rest of the article is organized as follows. First, the unidirectional Fano algorithm (UFA) and bidirectional Fano algorithm (BFA) are reviewed in Section 2. The novel parallel Fano decoding with scheduling architecture is proposed in Section 3. Different scheduling schemes and decoding modes are also proposed in this section. The DTMC based modeling is applied to the decoding architecture in Section 4. Simulation results are given in Section 5, and the conclusions are drawn in Section 6.
2 Unidirectional Fano algorithm and bidirectional Fano algorithm
In the conventional unidirectional Fano algorithm, the decoder starts decoding from the initial state zero (or origin node). During each iteration of the algorithm, the decoder may move forward (increase depth within the tree), move backward (reduce depth), or stay at the current tree depth. The decision is made based on the comparison between the threshold value and the path metric. If a forward movement is made, the threshold value needs to be tightened. If the decoder cannot move forward or backward, the threshold value needs to be loosened. A detailed flowchart of the UFA can be found in .
3 Parallel Fano decoding with scheduling
When an input buffer is about to overflow, the scheduler compares the computational efforts of all the decoders and erases the codeword of the decoder if it has consumed the highest computational effort among all the decoders. After the codeword of the decoder is erased, one codeword in the input buffer is scheduled to the decoder and the occupancy of the input buffer is reduced Q n = Q n - L f .
The number of decoders M is assumed to be the same as the number of input buffers N in Figures 5 and 6 (i.e., M = N) for ease of illustration. However, it will be shown in Section 5 that a higher number of decoders may be required to achieve a target decoding throughput (i.e., M > N).
3.2 Scheduling schemes
When a decoder finishes decoding a codeword, the scheduler needs to decide which input buffer the decoder should serve next. It has been discussed in [18–20] that serving the longest queue first (LQF) can help making the parallel queues (or input buffers) the most balanced or stable, thus maximising the input data rate R d . The scheduled decoders serving the longest queue first is considered to be one of the best scheduling schemes in the proposed architecture in terms of achieving a high decoding throughput.
The LQF scheme needs to compare the input buffer occupancy values. Other simpler scheduling schemes can be employed to reduce the computational and hardware complexity of the scheduler. One possible scheduling scheme is to randomly select the input buffer, which is named the RDM scheme. Another scheduling scheme is to group the parallel input buffers and decoders, such that each decoder can only be scheduled to the input buffers within the same group. The decoders in the same group are scheduled according to the LQF scheme. This is known as the static scheduling scheme or the STC scheme. In this article, each group is assumed to have two input buffers and two UFA decoders. Compared to the LQF scheme, the STC scheme can help reducing the need for multi-port memories and high fan-out multiplexers. It can also simplify the design of the scheduler and the connections between the input buffers and the decoders.
3.3 PUFAS mode and PBFAS mode
When a decoder finishes decoding a codeword, it can be scheduled to decode a new codeword from one of the input buffers, or it can be scheduled to help another decoder which has already been working on a whole codeword. The scheduled decoder can decode from the end state zero of this codeword, which makes and decode the same codeword in the BFA mode. These two modes are known as the parallel unidirectional Fano algorithm decoding with scheduling (PUFAS) mode and the parallel bidirectional Fano algorithm decoding with scheduling (PBFAS) mode, respectively. It has been shown in [12, 13] that the decoding throughput of a BFA decoder is at least two times of a UFA decoder (D BFA ≥ 2D UFA ) due to the parallel processing between the FD and the BD and also due to the computational effort reduction achieved by the BFA. As a result, if there are M UFA decoders among which any two can decode in the BFA mode, the decoding throughput can be improved by forming ⌊M/2⌋ BFA decoders. In this case, there will be ⌊M/2⌋ parallel BFA decoders which can be scheduled in the architecture.
4 DTMC based modeling
In this article, P B = 0.01 × P F is adopted as the target blocking probability (P target ). The relationship between R d , f clk and B can be found via simulation. Another way to analyze the architecture is to model it based on queuing theory.
4.1 DTMC based modeling on single UFA/BFA
where ∏(i) is the steady state probability that the input buffer is in state S i and .
4.2 Extension to PUFAS/PBFAS-LQF
The state transition probability matrix P T,i can be calculated based on the distribution of Δ i , and Equations (8)-(13) can still be applied to the PUFAS/PBFAS-LQF. The validation of the proposed DTMC model will be confirmed by the simulation results shown in the following section.
5 Simulation results
The performance of the proposed parallel Fano decoding with scheduling is examined via simulation in this section. The branch metric calculation is based on 1-bit hard-decision with the Fano metric . Using 3-bit soft-decision for the branch metric calculation results in about 1.75 to 2dB additional coding gain. However, 1-bit hard-decision is favoured in very high throughput decoder design to achieve a trade-off between the complexity of the decoder and the error rate performance. In this article, 1-bit hard-decision is adopted for the metric calculation for both the Viterbi and the Fano algorithm. The threshold adjustment value in the Fano algorithm is δ = 2. The modulation is BPSK and the channel is assumed to be an AWGN channel. The AWGN channel is similar to the LOS multipath channel for 60 GHz as discussed in . Each frame has L = 200 bits plus K - 1 = 6 zeros bits which results in a total frame (or a codeword) length of L f = L + K - 1 = 206 bits. The input buffer size is assumed to be B = 10.
5.1 Comparison between different scheduling schemes
5.2 Validation of the DTMC model
It has been found that the proposed DTMC based modeling on the PUFAS-LQF and PBFAS-LQF is ideal when the input buffer size B is large enough (i.e., B ≥ 5). The accuracy of the model degrades as B gets smaller. However, a very short input buffer will not be adopted according to the trade-off between area and decoding throughput as discussed in . Additionally, it has also been found that the accuracy of the model does not depend on the relationship between M and N (i.e., M > N, M = N or M < N) as long as the input buffer size is large enough.
5.3 Number of parallel Fano decoders
If the target decoding throughput is D target = 1 Gbps and the clock speed of the Fano decoder is f clk = 500 MHz, the supported input data rate will be R d = D target /N = 125 Mbps for N = 8 input buffers and the target speed factor will be μ1 = f clk /R d = 4. It can be seen from Figure 10 that the required number of decoders is M = 14 for the PUFAS-LQF and M = 12 for the PBFAS-LQF at E b /N0 = 4dB. Two decoders can be saved if the PBFAS-LQF is adopted compared to the PUFAS-LQF for the same decoding throughput.
It can also be seen from Figure 10 that the decoding throughput can be improved as SNR increases for the same number of decoders. As a result, some of the decoders can be dynamically turned off as SNR increases for the same decoding throughput, though a large number of decoders may be required to support a low SNR. For example, if the target decoding throughput increases to D target = 2 Gbps and the clock speed of the Fano decoder is still f clk = 500 MHz, the target speed factor will be μ2 = 2. It can be seen from Figure 10 that the required number of decoders is M = 28 for the PUFAS-LQF and M = 26 for the PBFAS-LQF at E b /N0 = 4dB which can be reduced to only M = 12 if the SNR increases to 5dB. In this case, more than half of the decoders can be turned off to reduce the power consumption of the decoding architecture.
5.4 Error rate performance and computational complexity
The distributions of C UFA , C BFA , and C VA at different SNR can be found in Figure 2.
It can be seen that the proposed decoding architecture consumes a much lower computational complexity compared to the PVA. For example at E b /N0 = 4 dB, the computational complexity of the PUFAS-LQF is only 10% of the PVA and it reduces to 3% at 6 dB. Additionally, the computational complexity of the PBFAS-LQF is lower than that of the PUFAS-LQF at a lower SNR, but they become very similar as SNR increases. This is because at a high SNR, the computational complexity reduction achieved by the BFA compared to the UFA becomes minimal. Since there is a very limited improvement on decoding throughput and computational complexity by using the PBFAS-LQF compared to the PUFAS-LQF at a high SNR, the PUFAS-LQF is favored due to its better BER performance. It can also be seen from Figures 11 and 12 that with a similar BER performance as the PUFAS-LQF and the PBFAS-LQF the T-algorithm cannot achieve the same low computational complexity.
This article considered the application of sequential decoding algorithm in high-throughput wireless communication systems. A novel architecture based on parallel Fano algorithm decoding with scheduling was proposed. Due to the scheduling of the Fano decoders according to the input buffer occupancy, a high decoding throughput can be achieved by the proposed architecture. Different scheduling schemes and decoding modes were proposed and compared. It was shown that the PBFAS-LQF scheme could achieve the highest decoding throughput. A DTMC model was proposed for the decoding architecture. The relationship between the input data rate, the clock speed of the decoder and the input buffer size can be easily established via the DTMC model. The model was validated by simulation and utilized to determine the number of decoders required for a target decoding throughput. It was shown that the novel high-throughput decoding architecture requires 3-10% of the computational complexity of the Viterbi decoding with a similar error rate performance. This novel architecture can be employed in high-throughput systems such as 60 GHz systems to achieve energy efficient low-complexity convolutional codes decoding.
aThe standard does not specify the Rx design. Only the Tx design is given. bSince the distribution of T s is obtained by simulation, the DTMC based results are referred to as semi-analytical.
The authors would like to thank the Telecommunications Research Laboratory (TRL) of Toshiba Research Europe Ltd and its directors for supporting this study.
- WirelessHD Specification Version 1.1 Overview[http://www.wirelesshd.org/pdfs/WirelessHD-Specification-Overview-v1.1May2010.pdf]
- IEEE Standard for Information technology-Telecommunications and information exchange between systems-Local and metropolitan area networks-Specific requirements. Part 15.3: Wireless Medium Access Control (MAC) and Physical Layer (PHY) Specifications for High Rate Wireless Personal Area Networks (WPANs) Amendment 2: Millimeter-wave-based Alternative Physical Layer Extension. IEEE Std 802.15.3c-2009 (Amendment to IEEE Std 802.15.3-2003) 2009.Google Scholar
- IEEE 802.15 WPAN Task Group 3c (TG3c) Millimeter Wave Alternative PHY[http://www.ieee802.org/15/pub/TG3c.html]
- Kato S, Harada H, Funada R, Baykas T, Sam C, Junyi W, Rahman M: Single carrier transmission for multi-gigabit 60-GHz WPAN systems. IEEE J Sel Areas Commun 2009, 27(8):1466-1478.View ArticleGoogle Scholar
- Marinkovic M, Piz M, Choi C, Panic G, Ehrig M, Grass E: Performance evaluation of channel coding for Gbps 60-GHz OFDM-based wireless communications. In IEEE 21st International Symposium on Personal Indoor and Mobile Radio Communications (PIMRC). Istanbul, Turkey; 2010:994-998.View ArticleGoogle Scholar
- Fettweis G, Guderian F, Krone S: Entering the path towards terabit/s wireless links. In Design, Automation and Test in Europe Conference and Exhibition (DATE). Grenoble, France; 2011:1-6.Google Scholar
- Viterbi A: Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans Inf Theory 1967, 13(2):260-269.View ArticleMATHGoogle Scholar
- Jelinek F: Fast sequential decoding algorithm using a stack. IBM J Res Develop 1969, 13(6):675-685.MathSciNetView ArticleMATHGoogle Scholar
- Fano R: A heuristic discussion of probabilistic decoding. IEEE Trans Inf Theory 1963, 9(2):64-74. 10.1109/TIT.1963.1057827MathSciNetView ArticleGoogle Scholar
- Pan W, Ortega A: Adaptive computation control of variable complexity Fano decoders. IEEE Trans Commun 2009, 57(6):1556-1559.View ArticleGoogle Scholar
- Lin S, Costello D: Error Control Coding: Fundamentals and Applications. Pearson Prentice-Hall, Upper Saddle River, NJ; 2004.MATHGoogle Scholar
- Xu R, Kocak T, Woodward G, Morris K, Dolwin C: Bidirectional Fano algorithm for high throughput sequential decoding. In IEEE 20th International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC). Tokyo, Japan; 2009:1809-1813.Google Scholar
- Xu R, Kocak T, Woodward G, Morris K: Throughput improvement on bidirectional Fano algorithm. In Proc of the 6th International Wireless Communications and Mobile Computing Conference (IWCMC). Caen, France; 2010:276-280.Google Scholar
- Habib I, Paker O, Sawitzki S: Design space exploration of hard-decision Viterbi decoding: algorithm and VLSI implementation. IEEE Trans Very Large Scale Integr (VLSI) Syst 2010, 18(5):794-807.View ArticleGoogle Scholar
- Black P, Meng T: 1-Gb/s, four-state, sliding block Viterbi decoder. IEEE J Solid-State Circ 1997, 32(6):797-805. 10.1109/4.585246View ArticleGoogle Scholar
- Fettweis G, Meyr H: Parallel Viterbi algorithm implementation: breaking the ACS-bottleneck. IEEE Trans Commun 1989, 37(8):785-790. 10.1109/26.31176View ArticleGoogle Scholar
- Anders M, Mathew S, Hsu S, Krishnamurthy R, Borkar S: A 1.9 Gb/s 358 mw 16-256 state reconfigurable Viterbi accelerator in 90 nm CMOS. IEEE J Solid-State Circ 2008, 43(1):214-222.View ArticleGoogle Scholar
- Tassiulas L, Ephremides A: Dynamic server allocation to parallel queues with randomly varying connectivity. IEEE Trans Inf Theory 1993, 39(2):466-478. 10.1109/18.212277MathSciNetView ArticleMATHGoogle Scholar
- Ganti A, Modiano E, Tsitsiklis J: Optimal transmission scheduling in symmetric communication models with intermittent connectivity. IEEE Trans Inf Theory 2007, 53(3):998-1008.MathSciNetView ArticleMATHGoogle Scholar
- Al-Zubaidy H, Talim J, Lambadaris I: Optimal scheduling policy determination for high speed downlink packet access. In IEEE International Conference on Communications (ICC). Glasgow, Scotland; 2007:472-479.Google Scholar
- Xu R, Woodward G, Morris K, Kocak T: A discrete time Markov chain model for high throughput bidirectional Fano decoders. In IEEE Global Telecommunications Conference (GLOBECOM). Miami, USA; 2010:1-5.Google Scholar
- Ozdag R, Beerel P: An asynchronous low-power high-performance sequential decoder implemented with QDI templates. IEEE Trans Very Large Scale Integr (VLSI) Syst 2006, 14(9):975-985.View ArticleGoogle Scholar
- Sum C, Zhou L, Funada R, Wang J, Baykas T, Rahman M, Harada H: Virtual time-slot allocation scheme for throughput enhancement in a millimeter-wave multi-Gbps WPAN system. IEEE J Sel Areas Commun 2009, 27(8):1379-1389.View ArticleGoogle Scholar
- Sun F, Zhang T: Low-power state-parallel relaxed adaptive Viterbi decoder. IEEE Trans Circ Syst I Regular Papers 2007, 54(5):1060-1068.View ArticleGoogle Scholar
- Jin J, Tsui C: Low-power limited-search parallel state Viterbi decoder implementation based on scarce state transition. IEEE Trans Very Large Scale Integr (VLSI) Syst 2007, 15(10):1172-1176.View ArticleGoogle Scholar
- He J, Liu H, Wang Z, Huang X, Zhang K: High-speed low-power Viterbi decoder design for TCM decoders. IEEE Trans Very Large Scale Integr (VLSI) Syst 2012, 20(4):755-759.View ArticleGoogle Scholar
- Simmons S: Breadth-first trellis decoding with adaptive effort. IEEE Trans Commun 1990, 38(1):3-12. 10.1109/26.46522MathSciNetView ArticleGoogle Scholar
- Shieh S, Chen P, Han Y: Reduction of computational complexity and sufficient stack size of the MLSDA by early elimination. In IEEE International Symposium on Information Theory (ISIT). Nice, France; 2007:1671-1675.Google Scholar
- Han Y, Chen P, Wu H: A maximum-likelihood soft-decision sequential decoding algo-rithm for binary convolutional codes. IEEE Trans Commun 2002, 50(2):173-178. 10.1109/26.983310View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.