 Research
 Open Access
 Published:
Algorithm and hardware design of a 2D sorterbased Kbest MIMO decoder
EURASIP Journal on Wireless Communications and Networking volume 2014, Article number: 93 (2014)
Abstract
Abstract
In the field of multiple input multiple output (MIMO) decoder, Kbest has been well investigated because it guarantees an SNRindependent fixedthroughput with a performance close to the optimal maximum likelihood detection (MLD). However, the complexity of its expansion and sorting tasks is significantly affected by the constellation size W. In this paper, we propose an algorithm and hardware design of a 2D sorterbased Kbest MIMO decoder whose complexity is negligibly affected by W. The main novelties of the algorithm are the following: (1) Direct expansion and parent node grouping ideas are proposed for reducing the expansion task’s complexity. (2) Twodimensional (2D) sorter is proposed for simplifying the sorting task. The hardware design of the decoder supports up to 256QAM modulation, which aims to apply into 4 × 4 MIMO 802.11n and 11ac systems. The paper shows that the proposed decoder outperforms the Bell Labs layered spacetime (BLAST) minimum mean square error (MMSE) and latticereduction aided (LRA) MMSE, and is close to the full Kbest in terms of bit error rate (BER) performance. The hardware design of the decoder is synthesized in application specific integrated circuit (ASIC) and compared with the previous works. As a result, it achieves the highest throughput (up to 2.7 Gbps), consumes the least power (56 mW), obtains the best hardware efficiency (15.2 Mbps/Kgate), and has the shortest latency (0.07 µs).
1 Introduction
Multiple input multiple output (MIMO) technology has shown a great promise for the future wireless communication because of its high spectral efficiency. For example, it has been applied in many wireless communication standards such as IEEE 802.16 e/m and IEEE 802.11 n/ac [1]. As an important part of the MIMO system, the MIMO decoder has been well investigated recently. Several types, such as maximum likelihood detection (MLD), linear minimum mean square error (LMMSE), Bell Labs layered spacetime MMSE (BLAST MMSE), and latticereduction aided MMSE (LRA MMSE), have been proposed. Among these, it is well known that the MLD is the optimal approach in terms of bit error rate (BER) performance. However, its complexity increases exponentially with the number of constellation points of the modulation and with the number of spatial streams [2]. Several researches on suboptimal MLD algorithms, especially on the full Kbest, have been done instead. If a MIMO system sends data via N spatial streams, the full Kbest will process through N stages. In each stage, it firstly computes the Euclidean distance from the received information to all of the constellation nodes (i.e., expansion task) and then sorts the obtained results (i.e., sorting task) to select K best nodes. If we denote W as the number of constellation nodes, complexity of the expansion and sorting tasks increases proportionally to W and W^{2}, respectively.
To reduce the Kbest’s complexity, several researches were carried out and published already. These researches can be classified into two methods named as complex domain and real domain. The former one processes through N stages as the full Kbest does. However, new ideas are proposed to reduce the complexity in tradeoff with an acceptable performance degradation. Some typical proposals on this method are a fixed sphere decoder algorithm  FSD in [3], a step reduced Kbest sphere decoder algorithm in [4], and a zigzag ondemand expansion scheme in [5]. On the other hand, the real domain method separates the inphase (IP) and quadraturephase (QP) components of a complex data into two independent real data and processes these data in real domain. Thus, the complexity of each stage is reduced, while the number of stages is increased from N (in complex domain) to 2N (in real domain). The wellknown researches on this method are [6–9]. Studying these proposals, we recognize that the expansion and sorting tasks are still too complex for practical implementation if a large value of K and highorder modulation types such as 256QAM are needed.
In this paper, we propose an algorithm and hardware design of a low complexity 2D sorterbased Kbest MIMO decoder. The proposal bases on the complex domain method. The contributions of this paper is briefly described as follows:

‘In terms of algorithm, we propose direct expansion and parent node grouping methods to reduce the expansion’s complexity, and two dimensional (2D) sorter to simplify the sorting task. The direct expansion specifies the best candidates directly without searching all the constellation nodes. Consequently, complexity of the algorithm is negligibly affected by constellation size. The Euclidean distance computation becomes simpler, and the divider is eliminated. The parent node grouping helps to reduce the number of search candidates within an acceptable amount without tradeoff of the BER performance. The 2D sorter does the matrixbased sorting. It has low complexity, is suitable for hardware resource sharing, and provides approximate result.

‘In terms of hardware architecture, a prototype of the algorithm which aims to support 4×4 MIMO 802.11n/ac systems is developed. We utilize some techniques such as resource sharing and GAINMUXbased multiplier to further reduce the complexity.
The rest of this paper is organized as follows: Section 2 shows the preliminary information such as notations, channel model, and full Kbest algorithm. Section 3 describes our algorithm. Section 4 focuses on hardware design. Sections 5 and 6 compare the proposed one with the previous works in terms of BER performance and application specific integrated circuit (ASIC) results, respectively. We conclude the paper in Section 7.
2 Background
2.1 Notations
We shall use bold lowercase letters for vectors and bold capital letters for matrices. Furthermore, ∥·∥ denotes the L  2 norm distance or Euclidean distance, (·)^{H} denotes the Hermitian transpose of a matrix, and (·)^{I} and (·)^{Q} respectively denote the inphase(IP) and quadraturephase (QP) parts of a signal.
2.2 Channel model and full Kbest algorithm
This paper considers a MIMO system with spatial multiplexing signaling (i.e., the signal transmitted from individual antennas are independent of each other). Let N and M represent the number of transmit and receive antennas, respectively, with M ≥ N. Assume that the transmit symbol is taken from a quadrature amplitude modulation (QAM) which has W constellation nodes.
The transmission of each vector x over flatfading MIMO channels can be modeled as (1), in which y is the M × 1 received signal vector, H is the M × N channel matrix, x is the N × 1 transmit symbol vector, and n is the M × 1 independent identically distributed (i.i.d.) Gaussian white noise vector. Channel H is decomposed into two matrices Q and R: H = Q R, in which Q is an M × M unitary matrix and R is an M × N upper triangular matrix. In case M > N, the last MN rows of R are zero, and the size of the R matrix thus becomes N × N. For simplicity, in this paper we assume that M = N.
The full Kbest finds the transmitted symbol x by solving (2). In this equation, Ω^{N} denotes W^{N} possible sets of the transmitted symbol vector x, and z = Q^{H}y. Equation 2 is computed through N stages in the order from N to 1, one after another. The n th stage (n = N,…,1) computes the n th partial Euclidean distance (PED_{n}) in (3) by adding PED_{n+1} (i.e., results of the (n+1)th stage) with D_{n} (i.e., calculated in the n th stage).
Two main tasks  expansion and sorting  will be done in stage n (n = N,…,1) (refer to [5] for details).

‘Expansion task firstly computes K × W values of D_{n} and x_{n} (i.e., child nodes) from K parent nodes selected from stage n+1. It then calculates PED_{n} = PED_{n+1} + D_{n}.

‘Sorting task sorts K × W values of PED_{n} to find the K smallest values of PED_{n} and the corresponding {x_{N},…,x_{n}}. The selected data will become the parent nodes of the next stage (i.e., stage n1).
The processing of the two first stages (i.e., N and N1) is illustrated in Figure 1. Notice that K = 1 if n = N.
At the final stage (i.e., stage 1), the sorting is not performed. All K×W values of PED_{1} are used for the final decision, whether hard or soft decision. The hard decision method finds the value of {x_{N},…,x_{1}} that is equivalent to the smallest value of PED_{1} and decides this value as the decoded data, while the soft decision method calculates the log likelihood ratio (LLR) of all information bits.
3 The proposed algorithm
Firstly, we use sorted QR decompose (SQRD) preprocessing: H=SQR instead of the conventional QRD: H=QR to improve the BER performance. In [10], the authors have shown that a low complex SQRD can be designed by using the modified GramSchmidt algorithm with pipelining and resource sharing.
The main process of our algorithm is done through N stages as the full Kbest does. The following ideas are proposed to reduce the complexity.
3.1 Direct expansion
Firstly, D_{n} in (3) is rewritten into (4) and (5) as follows.
In the first quarter of the constellation (in which IP and QP parts are both nonnegative), we divide the IP space into $\sqrt{\mathfrak{W}}\mathfrak{1}$ subdomains such as $[\phantom{\rule{0.3em}{0ex}}\mathfrak{0},{\mathfrak{r}}_{\mathfrak{n}\mathfrak{n}}),[\phantom{\rule{0.3em}{0ex}}{\mathfrak{r}}_{\mathfrak{n}\mathfrak{n}},\mathfrak{2}{\mathfrak{r}}_{\mathfrak{n}\mathfrak{n}}),\dots ,[\phantom{\rule{0.3em}{0ex}}\sqrt{\mathfrak{W}}\mathfrak{2},\infty )$. Each subdomain is associated with a set of $\mathit{\text{ceil}}\left(\sqrt{\mathfrak{L}}\right)$ best values of ${\mathfrak{x}}_{\mathfrak{n}}^{\mathfrak{I}}$. For example, if the modulation is 16QAM and L = 9, the IP space is divided into [ 0,r_{nn}), [ r_{nn},2r_{nn}), and [ 2r_{nn},∞) subdomains. The corresponding three best values of ${\mathfrak{x}}_{\mathfrak{n}}^{\mathfrak{I}}$ are (1, 1, 3), (1, 3, 1), and (3, 1, 1), respectively (refer to Figure 2a). With QP space and ${\mathfrak{x}}_{\mathfrak{n}}^{\mathfrak{Q}}$, we do similarly.
The L best child nodes per parent node in stage n (n = N,…,1) are directly specified as follows:
Step 1. Calculate f_{n} that is defined in (4).
Step 2. Determine the IP subdomain that ${\mathfrak{f}}_{\mathfrak{n}}^{\mathfrak{I}}$ belongs to by comparing ${\mathfrak{f}}_{\mathfrak{n}}^{\mathfrak{I}}$ with values such as r_{nn}, 2r_{nn}, …, $\left(\sqrt{\mathfrak{W}}\mathfrak{2}\right){\mathfrak{r}}_{\mathfrak{n}\mathfrak{n}}$. From that, the $\mathfrak{c}\mathfrak{e}\mathfrak{i}\mathfrak{l}\left(\sqrt{\mathfrak{L}}\right)$ best values of ${\mathfrak{x}}_{\mathfrak{n}}^{\mathfrak{I}}$ will be known. If ${\mathfrak{f}}_{\mathfrak{n}}^{\mathfrak{I}}<\mathfrak{0}$, the signs of ${\mathfrak{x}}_{\mathfrak{n}}^{\mathfrak{I}}$ are reversed. Then, we calculate the corresponding DI_{n} in (5). The ${\mathfrak{x}}_{\mathfrak{n}}^{\mathfrak{Q}}$ and DQ_{n} are found similarly (refer to Figure 2a).
Step 3. From $\mathfrak{c}\mathfrak{e}\mathfrak{i}\mathfrak{l}\left(\sqrt{\mathfrak{L}}\right)$ best values of ${\mathfrak{x}}_{\mathfrak{n}}^{\mathfrak{I}}$, DI_{n}, and ${\mathfrak{x}}_{\mathfrak{n}}^{\mathfrak{Q}}$, DQ_{n}, we compute L best values of x_{n} and D_{n} in (5). Let call i_{n} and q_{n} as the index numbers of the best values of DI_{n} and DQ_{n}, which are already in ascending order. The combination of the sum D_{n} = DI_{n}+DQ_{n} is arranged so that the sum ${\mathfrak{i}}_{\mathfrak{n}}^{\mathfrak{2}}+{\mathfrak{q}}_{\mathfrak{n}}^{\mathfrak{2}}$ increases. Consequently, the results of D_{n} are approximately in ascending order without sorting (refer to Figure 2b).
To expand L best child nodes from a parent node, the previous works such as [5] firstly finds the center node by rounding the result x_{c}=f_{n}/r_{nn}. It then seeks for L nearest nodes to the center node. The divider is thus required. By comparing as step 2, the proposed algorithm can eliminate the divider f_{n}/r_{nn}. Furthermore, by using (5), L values of D_{n} are obtained from ceil$\left(\sqrt{\mathfrak{L}}\right)$ values of DI_{n} and DQ_{n}. The complexity of computing Euclidean distance D_{n} is reduced.
3.2 Parent node grouping
It is important to know how much should the number of child nodes per parent node (L) be. If L is too large, BER performance is improved. However, the decoder’s complexity is also increased. If L is too small, the BER performance may be too small to fulfill the system requirement.
Notice that once the L best child nodes are directly specified as mentioned in Section 3.1, if L>K, there is no probability that one of the last LK child nodes of any parent will become the final selection. Thus, selecting L ≤ K is a way to reduce the complexity without tradeoff of the performance.
In another aspect, assume that k and c are the index number of the K parent nodes (PED_{n+1}) and of the L child nodes (D_{n}) per parent node in stage n, respectively. Because values of PED_{n+1} are already sorted in stage n+1, the parent node that has high index k will have a large value of PED_{n+1}. Thus, its child nodes are expected to have low probability to be selected as one of the K smallest (best) nodes for the next stage. To prove this analysis, we did the simulation and computed the probability (in %) in which a child node might become one of the K best nodes. The result is shown in Figure 3. From this figure, it can be seen that the larger the index k is, the smaller the number of child nodes may be selected.
Based on that fact, we propose a parent node grouping method as follows: The K parent nodes are divided into G groups. Each group has A=K/G parent nodes. Note that K and G should be selected so that K is dividable to G (i.e., mod(K,G) = 0). Group 1 contains the best parent nodes, while group G contains the worst parent nodes. Each parent node of the g th (g 1,2,…,G) group is expanded by L_{g} child nodes so that L_{G} < ⋯ < L_{1} ≤ K.
3.3 Twodimensional sorter
Sorting is the major bottleneck of the Kbest decoder because of its high complexity. Theoretically, the sorting of n elements requires (n^{2}n)/2 comparators.
In this subsection, we propose a twodimensional (2D) sorter which has low complexity, is suitable for hardware resource sharing, and produces approximate result. The 2D sorter for sorting $\mathfrak{C}=\sum _{\mathfrak{g}=\mathfrak{1}}^{\mathfrak{G}}\mathfrak{A}{\mathfrak{L}}_{\mathfrak{g}}$ child nodes is described as follows: we put the C child nodes into an A×B matrix, in which $\mathfrak{B}=\sum _{\mathfrak{g}=\mathfrak{1}}^{\mathfrak{G}}{\mathfrak{L}}_{\mathfrak{g}}$. The j th row of the matrix contains all the child nodes of the j th parent of all groups. The illustration in the case G=3 is shown in Figure 4. The matrix operates through two processes called as row sorting and column sorting, one after the other, as follows:

‘Row sorting. The B elements in a row are sorted. The smallest value is located in the left of the row. This sorting is repeated for all rows.

‘Column sorting. The A elements in a column are sorted. The smallest value is located in the top of the column. This sorting is repeated for all columns.
After completing the row and column sorting, the K topleft elements of the sorted matrix are expected to be the best (smallest) values and are selected. A simulation is needed in advance to correctly determine the position of the best candidates.To verify the correctness of the 2D sorter, we did the simulation and measured the probability (in %) in which an element of the sorted matrix might become one of the actual K = 7, K = 14, and K = 21 best nodes. The results are shown in Figure 5. From these results, positions of the 1st to the 7th (yellow color), 8th to the 14th (green color), and 15th to the 21st (blue color) best nodes are one by one determined. The figure also shows that the obtained results (in %) are slightly affected by channel type. However, the influence is too small so that the position of the best nodes is not affected by channel type.
The 2D sorter is suitable for hardware resource sharing because all the rows (columns) do the same task. A circuit which sorts B elements of the 1st row in the 1st cycle can be reused to sort the 2nd, …, A th rows in the 2nd, …, A th cycles.
4 Hardware design
4.1 Overview architecture
To determine the effectiveness of the proposed algorithm practically, we develop a 4×4 2D sorterbased K best MIMO decoder for 802.11n and 11.ac systems. The decoder supports five modulation types such as BPSK, QPSK, 16QAM, 64QAM, and 256QAM. After completing exhaustive simulation and considering the tradeoff between BER performance and complexity, the decoder is configured as follows:

At all stages, we select K = 21, G = 3, A = K/G = 7, L_{1} = 4, L_{2} = 3, L_{3} = 1, B = 8, and C = 56. In the case of 16QAM, QPSK, and BPSK, which has W<K, the numbers of parent nodes of stages 3, 2, and 1 (denoted by K_{3}, K_{2}, and K_{1}, respectively) are selected as follows: with 16QAM mode, K_{3} = 14 and K_{2} = K_{1} = 21; with QPSK mode, K_{3} = 4, K_{2} = 14, and K_{1} = 21; and with BPSK mode, K_{3} = 2, K_{2} = 4, and K_{1} = 8.

Stage 4 does not use the sorter, while stages 3 and 2 use the proposed 2D sorter with the matrix size of 7 × 8.

Soft decision is used to achieve high BER performance.This configuration is illustrated in Figure 6a.
The overview hardware architecture of the decoder is shown in Figure 6b. The ‘STAGE 4’ block computes K best values of PED_{4} and the corresponding x_{4} in (6). Similarly, the ‘STAGE 3’ block computes K best values of PED_{3} and the corresponding {x_{4},x_{3}} in (8). The ‘STAGE 2’ block computes K best values of PED_{2} and the corresponding {x_{4},x_{3},x_{2}} in (10). The ‘STAGE 1’ block computes C best values of PED_{1} and the corresponding {x_{4},x_{3},x_{2},x_{1}} in (12). The ‘LLR’ block computes the log likelihood ratio. The ‘MultiplierLess’ block prepares necessary data so that no multiplier will be implemented in all the abovementioned blocks.
4.2 Hardware implementation
To achieve low complexity, in addition to utilize the proposed algorithm, the following implementation points are worth to be noticed.
4.2.1 GAINMUXbased multiplier
From (6) to (12), it can be seen that the decoder requires a large number of multipliers to compute r_{ij}x_{j} (i = 4,3,2,1;j ≥ i). For example, $\mathfrak{2}\sqrt{\mathfrak{K}}$ multipliers are needed to compute ${\mathfrak{r}}_{\mathfrak{4}\mathfrak{4}}{\mathfrak{x}}_{\mathfrak{4}}^{\mathfrak{I}}$ and ${\mathfrak{r}}_{\mathfrak{4}\mathfrak{4}}{\mathfrak{x}}_{\mathfrak{4}}^{\mathfrak{Q}}$ in stage 4 (see (6)), and the multiplier costs large hardware resource.
To compute r_{ij}x_{j} (i = 4,3,2,1;j ≥ i), instead of using the multiplier, we implement GAIN and multiplexer (MUX) as be shown in Figure 7. This figure illustrates the case of multiplying r_{ij} with m best values of x_{j}. The left figure shows the conventional method which uses m different multipliers. The right figure is our proposed GAINMUXbased multipliers. The input data r_{ij} firstly goes into the ‘GAIN’ block that amplifies r_{ij} by the modulation gain D and then by the values of the constellation points such as 1,3,5,…,15. Notice that all the possible values of x_{j} are {D,3D,…,15D}. The outputs of ‘GAIN’ blocks are then inputted to m MUX blocks. Each MUX is controlled by a select signal of x_{j} (i.e., denoted by $\mathfrak{s}\mathfrak{e}\mathfrak{l}\text{\_}{\mathfrak{x}}_{\mathfrak{j}}^{\left(\mathfrak{m}\right)}$). If values of ${\mathfrak{x}}_{\mathfrak{j}}^{\left(\mathfrak{m}\right)}$ are {D,3D,…,15D}, values of $\mathfrak{s}\mathfrak{e}\mathfrak{l}\text{\_}{\mathfrak{s}}_{\mathfrak{j}}^{\left(\mathfrak{m}\right)}$ will be {0,1,…,7}. Consequently, the outputs of MUX blocks are equivalent to the outputs of multipliers in the left figure. Meanwhile, hardware cost for MUX is much smaller than that for the multiplier.
The decoder needs multipliers to compute many data, such as ${\mathfrak{r}}_{\mathfrak{4}\mathfrak{4}}{\mathfrak{x}}_{\mathfrak{4}}^{\mathfrak{I}}$, ${\mathfrak{r}}_{\mathfrak{3}\mathfrak{3}}^{\mathfrak{I}}{\mathfrak{x}}_{\mathfrak{3}}^{\mathfrak{I}}$, and ${\mathfrak{r}}_{\mathfrak{2}\mathfrak{2}}{\mathfrak{x}}_{\mathfrak{2}}^{\mathfrak{I}}$, while possible values of ${\mathfrak{x}}_{\mathfrak{4}}^{\mathfrak{I}}$, ${\mathfrak{x}}_{\mathfrak{3}}^{\mathfrak{I}}$, and ${\mathfrak{x}}_{\mathfrak{2}}^{\mathfrak{I}}$ are the same. Thus, one ‘GAIN’ block can be shared among them. The ‘MultiplierLess’ block implements this ‘GAIN’ block.
4.2.2 Resource sharing
This technique is implemented in STAGE 4, STAGE 3, STAGE 2, and STAGE 1 blocks.
The STAGE 4 block computes K best values of PED_{4} and x_{4} in (6). Based on the direct expansion method, it finds ceil$\left(\sqrt{\mathfrak{2}\mathfrak{1}}\right)=\mathfrak{5}$ best values of DI_{4} and DQ_{4} and then adds these values together. Because the processes of finding DI_{4} and DQ_{4} are similar to each other, they share the same circuit. Figure 8a shows the block diagram inside STAGE 4, in which, ‘BLOCK A’ is shared to find best values of DI_{4}, ${\mathfrak{x}}_{\mathfrak{4}}^{\mathfrak{I}}$ and DQ_{4}, ${\mathfrak{x}}_{\mathfrak{4}}^{\mathfrak{Q}}$ in two clock cycles. In other words, the sharing factor of this block is 2. The design of BLOCK A is shown in Figure 8b, in which the ‘SIGN ABS’ block determines the sign and absolute value of $\left{\mathfrak{z}}_{\mathfrak{4}}^{\mathfrak{I}}\right$ (and $\left{\mathfrak{z}}_{\mathfrak{4}}^{\mathfrak{Q}}\right$). The ‘CONSLOCAT’ block specifies the subdomain in the constellation that $\left{\mathfrak{z}}_{\mathfrak{4}}^{\mathfrak{I}}\right$ (and $\left{\mathfrak{z}}_{\mathfrak{4}}^{\mathfrak{Q}}\right$) belongs to. Based on information of the CONSLOCAT block, the ‘DI/DQ CAL’ block computes the best values of DI_{4} and DQ_{4}, while the ‘XDECODE’ block finds the best values of ${\mathfrak{x}}_{\mathfrak{4}}^{\mathfrak{I}}$ and ${\mathfrak{x}}_{\mathfrak{4}}^{\mathfrak{Q}}$.
The STAGE 3 block computes the best values of PED_{3} and the corresponding {x_{4},x_{3}} in (8). The block diagram is shown in Figure 8c, in which ‘B1,’ ‘B2,’ and ‘B3’ respectively perform the direct expansion for the 1st → 7th (i.e., group 1), 8th → 14th (i.e., group 2), and 15th → 21st (i.e., group 3) parent nodes. Because all parent nodes in the same group process similarly, they can share the same circuit. Consequently, the B1 block is designed to find L_{1} best child nodes of one parent node only. It is then reused in seven clock cycles to complete the direct expansion for seven parent nodes of group 1. The sharing factor is 7. Similarly, the B2 and B3 blocks are shared by seven times. Each B1, B2, and B3 block has the following components: ‘CAL f_{3}’ computes f_{3} in (7), ‘BLOCK A*’ computes DI_{3} and DQ_{3}, and ‘SUM’ computes PED_{3} from PED_{4}, DI_{3}, and DQ_{3} (see (8)).
After each clock cycle, B1, B2, and B3 output the best child nodes of one parent node in all groups. In other words, all elements of one row in the sort matrix (see Section 3.3) are obtained per cycle. Block ‘2DSORT’ thus requires only a onerowsorting circuit. This circuit is then shared to sort all seven rows in seven clock cycles. The sharing factor is 7. The hardware design of the 2DSORT block is shown in Figure 9. The ‘ROWSORT’ block sorts eight outputs of B1, B2, and B3 per clock cycle. Only four best data are obtained. In the ‘COLSORT’, the ‘1to7’ collects the best values from ROWSORT in seven cycles and sorts them. The ‘1to6’ block collects the 2nd best values from ROWSORT in seven cycles, sorts them, and obtains six best data, so on. The designs of ROWSORT and ‘1to3’ of COLSORT are shown in Figure 9b,c, respectively. It can be seen that the 2DSORT needs only 36 comparators to sort 56 child nodes, which is significantly reduced as compared to (56^{2}56)/2 = 1,540 comparators if using the full sort.
The architectures of STAGE 2 and STAGE 1 are similar to STAGE 3. The sharing factor of these blocks is 7. However, the 2DSORT block is not implemented in STAGE 1. Instead, the results of B1, B2, and B3 are directly passed to the LLR block.
5 BER performance comparison
The 802.11ac simulator with the following options were used in our simulation: 4×4 MIMO and transfer packet number of 5,000. Total transfer data was 2.5×10^{6} bytes. Bandwidth was 80 MHz. Channel type was D. Forward error correction (FEC) type was binary block code (BCC).
5.1 QRD versus SQRD
Figure 10 shows that using SQRD preprocessing helps to improve BER performance by 0.6 dB, 0.8 dB (at BER = 10^{3}), and 1 dB (at BER = 10^{2}) for cases of K = 21, K = 10, and K = 6, respectively, as compared to the case of using QRD.
5.2 Parent node grouping
Figure 11 shows that the BER performance is insignificantly degraded when ‘L1L2L3’ is decreased from 256256256 (full Kbest) to 999, 963, 444, and 431. Numerically, the performance degradation of L1L2L3 = 432 is about 0.15 dB as compared to the full K best (at BER = 10^{3}). However, when continuing to reduce the number of child nodes per parent node to L1L2L3 = 111, the performance degradation is about 1.2 dB, which is considerable, as compared to the full Kbest.
5.3 2D sorter
Figure 12 shows that the BER performance of S4 NoSort  S32 2D Sort is insignificantly degraded as compared to the case of S4 FullSort  S32 FullSort. The amount of degradation is about 0.08 dB (at BER = 10^{3}). In other words, (1) by applying the direct expansion method, the sorter can be eliminated in stage 4, and (2) the 2D sorter is an acceptable approximation of the full sorter. It can be used in tradeoff with about 0.08dB BER performance.
5.4 The proposed decoder
Figure 13 shows the BER of 4×4 MIMO 802.11ac system when applying BLAST MMSE, LRAMMSE, full Kbest (soft decision), and the proposed decoder (soft and hard decisions).
From this figure, it can be seen that for all modulation types (16QAM, 64QAM, and 256QAM), the proposed decoder with soft decision (green line) outperforms the BLAST MMSE (blue line) and LRA MMSE (black line), and is close to the full Kbest with soft decision (red line). Numerically, at the observation point of BER = 10^{3}, the proposed decoder (with soft decision) is better than BLAST MMSE by 6.7, 3.7, and 2.3 dB, respectively. It is better than LRA MMSE by 1, 0.5, and 0.02 dB, respectively. As compared to the full Kbest, the BER performance degradation of the proposed one is about 0.2 dB for all cases. In addition, using soft decision can improve the performance of the proposed decoder by about 2 dB as compared to the hard decision (green line versus pink line).
From this figure, we also see that the BER performance’s gap from the proposed decoder (soft decision) and the full Kbest to the LRA MMSE and the BLAST MMSE decreases when the modulation types increase from 16QAM to 64QAM and to 256QAM. That is because the modulation size increases while the K value is fixed to 21. Consequently, the BER performance of the proposed decoder and of the full Kbest is expected to be worse as the modulation size increases.
Notice that in cases of BPSK and QPSK, the proposed decoder searches all of the constellation nodes; it thus achieves the same BER as the optimal MLD does.
6 Complexity comparison
Due to the application of the direct expansion method, the number of search candidates (or visited nodes) of the proposed decoder is no longer affected by the constellation size. It is affected by K, L_{g} (g = 1,…,G), and N only.
Numerically, we compare the complexity of the proposed algorithm with the previous works in terms of total number of visited nodes (shorted as ‘total nodes’) in Table 1. All the compared algorithms are configured to be 4 × 4 MIMO decoder (N = 4). The data of [3] and [4] are obtained from their papers. Data of [5] is calculated by ourselves after understanding the algorithm. In the best of our knowledge, this algorithm needs to visit $\left(\sqrt{\mathfrak{W}}+\mathfrak{K}+\mathfrak{1}\right)+\mathfrak{2}\mathfrak{K}(\text{RSE\_num}+\text{CSE\_num}+\mathfrak{1})+\mathfrak{K}$ nodes, in which RSE_num = 4 and RSE_num = 3 are reported to be optimal for the case of N = 4, K = 10, and W = 64 (64QAM).
This table shows that

As compared to [3] and [4], the total nodes of the proposed algorithm reduces about 8.5 times, while the gap of the K value is about 1.24 times.

The total nodes of the proposed algorithm is about half of that of [5], while both have the same K = 21 and the proposed one supports higher modulation than [5] (256QAM versus 64QAM). In case [5] supports K = 10 and the proposed supports K = 21, they have the same total nodes.
The comparison in Table 1, however, just reflects the algorithm’s complexity in terms of total nodes. The complexity on computing the Euclidean distance of each visited node and on sorting the nodes cannot be seen.
To compare the decoder with the previous ones thoroughly, we designed and synthesized our decoder in ASIC. The synthesis tool was the Design Vision of Synopsys. The CMOS SAED 90 nm technology and saed90nm_min library were used. The applied voltage was 1.32 V.
The ASIC synthesis results are shown and compared in Table 2. All the designs are 4 × 4Kbestbased MIMO decoders. From this table, the contribution of the proposed decoder can be seen as follows:
High throughput. The proposed decoder achieves the highest throughput among all designs. Comparing with the most recent work in [5], the proposed decoder’s throughput is two times higher.Low power consumption. Among all the designs, the proposed design consumes the least power, which is about 56 mW.Small area. Although supporting higher modulation (i.e., 256QAM) and larger K(i.e., K=21) than the most recent work in [5], the proposed decoder occupies less hardware area. It needs 180 Kgates, which is almost half of [5]. Remember that the proposed decoder and [5] have the same number of visited nodes (see Table 1). This is the evidence for the effectiveness of the 2D sorter and computation method of the direct expansion.High normalized hardware efficiency (NHE). The proposed design obtains the highest NHE. It is 15.2 Mbps/Kgate, which is better than [8, 11, 12], and [5] by 50.7, 29.2, 8.5, and 3.6 times, respectively.Short latency. The proposed design has the shortest latency. It is 0.07 µs.
7 Conclusions
In this paper, we have proposed an algorithm and hardware design of a 2D sorterbased Kbest MIMO decoder that supports up to 256QAM. By utilizing the ideas such as direct expansion, parent node grouping, and 2D sorter, the algorithm has been proven to be less complex than the previous works, and its complexity is negligibly affected by the constellation size. A prototype hardware architecture of the algorithm has been developed to support 4 × 4 MIMO 802.11n and 11ac systems. Some techniques such as resource sharing, and MUXGAINbased multiplier have been implemented to further reduce the complexity.
The paper has shown that the proposed decoder outperforms the BLAST MMSE and LRA MMSE, and is close to the full Kbest in terms of BER performance. The hardware design of the decoder achieves the highest throughput (2.7 Gbps), consumes the least power (56 mW), obtains the best hardware efficiency (15.2 Mbps/Kgate), and has the shortest latency (0.07 µs). This research is, thus, expected to be utilized not only in 802.11n/ac but also in other MIMO systems.
Our future work is to upgrade the designed decoder so that it supports from 1 × 1 to 8 × 8 MIMO cases.
References
 1.
Tran TH, Nagao Y, Kurosaki M, Sai B, Ochi H: ASIC implement of 600 Mbps IEEE 802.11n 4 × 4 MIMO wireless LAN system. The 14th IEEE Int. Conf. on Advan. Commu. Tech. (ICACT) Pyeongchang Korea, 1922 Feb. 2012, pp. 360–363
 2.
Azzam L, Ayanoglu E: Reduction of ML decoding complexity for MIMO sphere decoding, QOSTBC, and OSTBC. Information Theory and Application Workshop San Diego, CA, USA, 27 Jan  1 Feb 2008, pp. 18–25
 3.
Barbero LG, Thompson JS: Fixing the complexity of the sphere decoder for MIMO detection. IEEE Trans. Wireless Commun 2008, 7: 21312142.
 4.
Mao X, Cheng Y, Ma L, Xiang H: Step reduced Kbest sphere decoding. Vehicular Technology Conference (VTC Fall) Quebec, Canada,3–6Sept 2012, pp. 1–4
 5.
Mahdavi M, Shabany M: Novel MIMO detection algorithm for highorder constellations in the complex domain. IEEE Trans. VLSI Syst 2013, 21: 834847.
 6.
Wenk M, Zellweger M, Burg A, Felber N, Fichtner W: Kbest MIMO detection VLSI architectures achieving up to 424 Mbps. Proc. Int. Symp. Circuits and Systems (ISCAS 2006) Kos Island, Greece, 2124 May 2006, pp. 1151–1154
 7.
Guo Z, Nilsson P: Algorithm and implementation of the Kbest sphere decoder for MIMO detection. IEEE Trans. Sel. Areas Commun 2006, 24: 491503.
 8.
Mondal S, Eltawil A, Shen CA, Salama KN: Design and implementation of a sort free Kbest sphere decoder. IEEE Trans. VLSI Syst 2010, 18: 14971501.
 9.
Shabany M, Gulak PG: A 675 Mb/s, 4 × 4 64QAM Kbest MIMO detector in 0.13 um CMOS. IEEE Trans. VLSI Syst 2012, 20: 135147.
 10.
Miyaoka Y, Nagao Y, Kurosaki M, Ochi H: RTL design of highspeed QR decomposition for MIMO decoder. IEICE Trans. Fund. Elec., Commu. Comp. Sci A E95, 19911997.
 11.
Chen S, Zhang T, Xin Y: Relaxed Kbest MIMO signal detector design and VLSI implementation. IEEE Trans. VLSI Syst 2007, 15: 328337.
 12.
Liao C, Wang T, Chiueh T: A 74.8 mW softoutput detector IC for 8 × 8 spatialmultiplexing MIMO communications. IEEE Trans. Solid State Circuits 2010, 45: 411421.
Acknowledgements
The authors would like to thank Assoc. Prof. Masayuki Kurosaki and Ms. Reina Hongyo for their help on tool licenses and hardware design. This work was partly supported by Japan Ministry of Education KAKENHI (22360156) and by VLSI Design and Education Center (VDEC), the University of Tokyo in collaboration with Synopsys.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0), which permits use, duplication, adaptation, distribution, and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Tran, T.H., Nagao, Y. & Ochi, H. Algorithm and hardware design of a 2D sorterbased Kbest MIMO decoder. J Wireless Com Network 2014, 93 (2014) doi:10.1186/16871499201493
Received
Accepted
Published
DOI
Keywords
 Maximum likelihood detection (MLD)
 Kbest; MIMO decoder; IEEE 802.11n/ac; 256QAM