4.1 Overview architecture
To determine the effectiveness of the proposed algorithm practically, we develop a 4×4 2D sorterbased K best MIMO decoder for 802.11n and 11.ac systems. The decoder supports five modulation types such as BPSK, QPSK, 16QAM, 64QAM, and 256QAM. After completing exhaustive simulation and considering the tradeoff between BER performance and complexity, the decoder is configured as follows:

At all stages, we select K = 21, G = 3, A = K/G = 7, L_{1} = 4, L_{2} = 3, L_{3} = 1, B = 8, and C = 56. In the case of 16QAM, QPSK, and BPSK, which has W<K, the numbers of parent nodes of stages 3, 2, and 1 (denoted by K_{3}, K_{2}, and K_{1}, respectively) are selected as follows: with 16QAM mode, K_{3} = 14 and K_{2} = K_{1} = 21; with QPSK mode, K_{3} = 4, K_{2} = 14, and K_{1} = 21; and with BPSK mode, K_{3} = 2, K_{2} = 4, and K_{1} = 8.

Stage 4 does not use the sorter, while stages 3 and 2 use the proposed 2D sorter with the matrix size of 7 × 8.

Soft decision is used to achieve high BER performance.This configuration is illustrated in Figure 6a.
\begin{array}{ll}{\text{PED}}_{\mathfrak{4}}& =\underset{\mathfrak{D}{\mathfrak{I}}_{\mathfrak{4}}}{\underset{\u23df}{{\mathfrak{z}}_{\mathfrak{4}}^{\mathfrak{I}}{\mathfrak{r}}_{\mathfrak{4}\mathfrak{4}}{\mathfrak{x}}_{\mathfrak{4}}^{\mathfrak{I}}{}^{\mathfrak{2}}}}+\underset{\mathfrak{D}{\mathfrak{Q}}_{\mathfrak{4}}}{\underset{\u23df}{{\mathfrak{z}}_{\mathfrak{4}}^{\mathfrak{Q}}{\mathfrak{r}}_{\mathfrak{4}\mathfrak{4}}{\mathfrak{x}}_{\mathfrak{4}}^{\mathfrak{Q}}{}^{\mathfrak{2}}}}\phantom{\rule{2em}{0ex}}\end{array}
(6)
\begin{array}{ll}{\mathfrak{f}}_{\mathfrak{3}}& ={\mathfrak{z}}_{\mathfrak{3}}{\mathfrak{r}}_{\mathfrak{3}\mathfrak{4}}{\mathfrak{x}}_{\mathfrak{4}}\phantom{\rule{2em}{0ex}}\end{array}
(7)
\begin{array}{ll}{\text{PED}}_{\mathfrak{3}}& ={\text{PED}}_{\mathfrak{4}}+\underset{\mathfrak{D}{\mathfrak{I}}_{\mathfrak{3}}}{\underset{\u23df}{{\mathfrak{f}}_{\mathfrak{3}}^{\mathfrak{I}}{\mathfrak{r}}_{\mathfrak{3}\mathfrak{3}}{\mathfrak{x}}_{\mathfrak{3}}^{\mathfrak{I}}{}^{\mathfrak{2}}}}+\underset{\mathfrak{D}{\mathfrak{Q}}_{\mathfrak{3}}}{\underset{\u23df}{{\mathfrak{f}}_{\mathfrak{3}}^{\mathfrak{Q}}{\mathfrak{r}}_{\mathfrak{3}\mathfrak{3}}{\mathfrak{x}}_{\mathfrak{3}}^{\mathfrak{Q}}{}^{\mathfrak{2}}}}\phantom{\rule{2em}{0ex}}\end{array}
(8)
\begin{array}{ll}{\mathfrak{f}}_{\mathfrak{2}}& ={\mathfrak{z}}_{\mathfrak{2}}{\mathfrak{r}}_{\mathfrak{2}\mathfrak{3}}{\mathfrak{x}}_{\mathfrak{3}}{\mathfrak{r}}_{\mathfrak{2}\mathfrak{4}}{\mathfrak{x}}_{\mathfrak{4}}\phantom{\rule{2em}{0ex}}\end{array}
(9)
\begin{array}{ll}{\text{PED}}_{\mathfrak{2}}& ={\text{PED}}_{\mathfrak{3}}+\underset{\mathfrak{D}{\mathfrak{I}}_{\mathfrak{2}}}{\underset{\u23df}{{\mathfrak{f}}_{\mathfrak{2}}^{\mathfrak{I}}{\mathfrak{r}}_{\mathfrak{2}\mathfrak{2}}{\mathfrak{x}}_{\mathfrak{2}}^{\mathfrak{I}}{}^{\mathfrak{2}}}}+\underset{\mathfrak{D}{\mathfrak{Q}}_{\mathfrak{2}}}{\underset{\u23df}{{\mathfrak{f}}_{\mathfrak{2}}^{\mathfrak{Q}}{\mathfrak{r}}_{\mathfrak{2}\mathfrak{2}}{\mathfrak{x}}_{\mathfrak{2}}^{\mathfrak{Q}}{}^{\mathfrak{2}}}}\phantom{\rule{2em}{0ex}}\end{array}
(10)
\begin{array}{ll}{\mathfrak{f}}_{\mathfrak{1}}& ={\mathfrak{z}}_{\mathfrak{1}}{\mathfrak{r}}_{\mathfrak{1}\mathfrak{2}}{\mathfrak{x}}_{\mathfrak{2}}{\mathfrak{r}}_{\mathfrak{1}\mathfrak{3}}{\mathfrak{x}}_{\mathfrak{3}}{\mathfrak{r}}_{\mathfrak{1}\mathfrak{4}}{\mathfrak{x}}_{\mathfrak{4}}\phantom{\rule{2em}{0ex}}\end{array}
(11)
\begin{array}{ll}{\text{PED}}_{\mathfrak{1}}& ={\text{PED}}_{\mathfrak{2}}+\underset{\mathfrak{D}{\mathfrak{I}}_{\mathfrak{1}}}{\underset{\u23df}{{\mathfrak{f}}_{\mathfrak{1}}^{\mathfrak{I}}{\mathfrak{r}}_{\mathfrak{1}\mathfrak{1}}{\mathfrak{x}}_{\mathfrak{1}}^{\mathfrak{I}}{}^{\mathfrak{2}}}}+\underset{\mathfrak{D}{\mathfrak{Q}}_{\mathfrak{1}}}{\underset{\u23df}{{\mathfrak{f}}_{\mathfrak{1}}^{\mathfrak{Q}}{\mathfrak{r}}_{\mathfrak{1}\mathfrak{1}}{\mathfrak{x}}_{\mathfrak{1}}^{\mathfrak{Q}}{}^{\mathfrak{2}}}}\phantom{\rule{2em}{0ex}}\end{array}
(12)
The overview hardware architecture of the decoder is shown in Figure 6b. The ‘STAGE 4’ block computes K best values of PED_{4} and the corresponding x_{4} in (6). Similarly, the ‘STAGE 3’ block computes K best values of PED_{3} and the corresponding {x_{4},x_{3}} in (8). The ‘STAGE 2’ block computes K best values of PED_{2} and the corresponding {x_{4},x_{3},x_{2}} in (10). The ‘STAGE 1’ block computes C best values of PED_{1} and the corresponding {x_{4},x_{3},x_{2},x_{1}} in (12). The ‘LLR’ block computes the log likelihood ratio. The ‘MultiplierLess’ block prepares necessary data so that no multiplier will be implemented in all the abovementioned blocks.
4.2 Hardware implementation
To achieve low complexity, in addition to utilize the proposed algorithm, the following implementation points are worth to be noticed.
4.2.1 GAINMUXbased multiplier
From (6) to (12), it can be seen that the decoder requires a large number of multipliers to compute r_{ij}x_{j} (i = 4,3,2,1;j ≥ i). For example, \mathfrak{2}\sqrt{\mathfrak{K}} multipliers are needed to compute {\mathfrak{r}}_{\mathfrak{4}\mathfrak{4}}{\mathfrak{x}}_{\mathfrak{4}}^{\mathfrak{I}} and {\mathfrak{r}}_{\mathfrak{4}\mathfrak{4}}{\mathfrak{x}}_{\mathfrak{4}}^{\mathfrak{Q}} in stage 4 (see (6)), and the multiplier costs large hardware resource.
To compute r_{ij}x_{j} (i = 4,3,2,1;j ≥ i), instead of using the multiplier, we implement GAIN and multiplexer (MUX) as be shown in Figure 7. This figure illustrates the case of multiplying r_{ij} with m best values of x_{j}. The left figure shows the conventional method which uses m different multipliers. The right figure is our proposed GAINMUXbased multipliers. The input data r_{ij} firstly goes into the ‘GAIN’ block that amplifies r_{ij} by the modulation gain D and then by the values of the constellation points such as 1,3,5,…,15. Notice that all the possible values of x_{j} are {D,3D,…,15D}. The outputs of ‘GAIN’ blocks are then inputted to m MUX blocks. Each MUX is controlled by a select signal of x_{j} (i.e., denoted by \mathfrak{s}\mathfrak{e}\mathfrak{l}\text{\_}{\mathfrak{x}}_{\mathfrak{j}}^{\left(\mathfrak{m}\right)}). If values of {\mathfrak{x}}_{\mathfrak{j}}^{\left(\mathfrak{m}\right)} are {D,3D,…,15D}, values of \mathfrak{s}\mathfrak{e}\mathfrak{l}\text{\_}{\mathfrak{s}}_{\mathfrak{j}}^{\left(\mathfrak{m}\right)} will be {0,1,…,7}. Consequently, the outputs of MUX blocks are equivalent to the outputs of multipliers in the left figure. Meanwhile, hardware cost for MUX is much smaller than that for the multiplier.
The decoder needs multipliers to compute many data, such as {\mathfrak{r}}_{\mathfrak{4}\mathfrak{4}}{\mathfrak{x}}_{\mathfrak{4}}^{\mathfrak{I}}, {\mathfrak{r}}_{\mathfrak{3}\mathfrak{3}}^{\mathfrak{I}}{\mathfrak{x}}_{\mathfrak{3}}^{\mathfrak{I}}, and {\mathfrak{r}}_{\mathfrak{2}\mathfrak{2}}{\mathfrak{x}}_{\mathfrak{2}}^{\mathfrak{I}}, while possible values of {\mathfrak{x}}_{\mathfrak{4}}^{\mathfrak{I}}, {\mathfrak{x}}_{\mathfrak{3}}^{\mathfrak{I}}, and {\mathfrak{x}}_{\mathfrak{2}}^{\mathfrak{I}} are the same. Thus, one ‘GAIN’ block can be shared among them. The ‘MultiplierLess’ block implements this ‘GAIN’ block.
4.2.2 Resource sharing
This technique is implemented in STAGE 4, STAGE 3, STAGE 2, and STAGE 1 blocks.
The STAGE 4 block computes K best values of PED_{4} and x_{4} in (6). Based on the direct expansion method, it finds ceil\left(\sqrt{\mathfrak{2}\mathfrak{1}}\right)=\mathfrak{5} best values of DI_{4} and DQ_{4} and then adds these values together. Because the processes of finding DI_{4} and DQ_{4} are similar to each other, they share the same circuit. Figure 8a shows the block diagram inside STAGE 4, in which, ‘BLOCK A’ is shared to find best values of DI_{4}, {\mathfrak{x}}_{\mathfrak{4}}^{\mathfrak{I}} and DQ_{4}, {\mathfrak{x}}_{\mathfrak{4}}^{\mathfrak{Q}} in two clock cycles. In other words, the sharing factor of this block is 2. The design of BLOCK A is shown in Figure 8b, in which the ‘SIGN ABS’ block determines the sign and absolute value of \left{\mathfrak{z}}_{\mathfrak{4}}^{\mathfrak{I}}\right (and \left{\mathfrak{z}}_{\mathfrak{4}}^{\mathfrak{Q}}\right). The ‘CONSLOCAT’ block specifies the subdomain in the constellation that \left{\mathfrak{z}}_{\mathfrak{4}}^{\mathfrak{I}}\right (and \left{\mathfrak{z}}_{\mathfrak{4}}^{\mathfrak{Q}}\right) belongs to. Based on information of the CONSLOCAT block, the ‘DI/DQ CAL’ block computes the best values of DI_{4} and DQ_{4}, while the ‘XDECODE’ block finds the best values of {\mathfrak{x}}_{\mathfrak{4}}^{\mathfrak{I}} and {\mathfrak{x}}_{\mathfrak{4}}^{\mathfrak{Q}}.
The STAGE 3 block computes the best values of PED_{3} and the corresponding {x_{4},x_{3}} in (8). The block diagram is shown in Figure 8c, in which ‘B1,’ ‘B2,’ and ‘B3’ respectively perform the direct expansion for the 1st → 7th (i.e., group 1), 8th → 14th (i.e., group 2), and 15th → 21st (i.e., group 3) parent nodes. Because all parent nodes in the same group process similarly, they can share the same circuit. Consequently, the B1 block is designed to find L_{1} best child nodes of one parent node only. It is then reused in seven clock cycles to complete the direct expansion for seven parent nodes of group 1. The sharing factor is 7. Similarly, the B2 and B3 blocks are shared by seven times. Each B1, B2, and B3 block has the following components: ‘CAL f_{3}’ computes f_{3} in (7), ‘BLOCK A*’ computes DI_{3} and DQ_{3}, and ‘SUM’ computes PED_{3} from PED_{4}, DI_{3}, and DQ_{3} (see (8)).
After each clock cycle, B1, B2, and B3 output the best child nodes of one parent node in all groups. In other words, all elements of one row in the sort matrix (see Section 3.3) are obtained per cycle. Block ‘2DSORT’ thus requires only a onerowsorting circuit. This circuit is then shared to sort all seven rows in seven clock cycles. The sharing factor is 7. The hardware design of the 2DSORT block is shown in Figure 9. The ‘ROWSORT’ block sorts eight outputs of B1, B2, and B3 per clock cycle. Only four best data are obtained. In the ‘COLSORT’, the ‘1to7’ collects the best values from ROWSORT in seven cycles and sorts them. The ‘1to6’ block collects the 2nd best values from ROWSORT in seven cycles, sorts them, and obtains six best data, so on. The designs of ROWSORT and ‘1to3’ of COLSORT are shown in Figure 9b,c, respectively. It can be seen that the 2DSORT needs only 36 comparators to sort 56 child nodes, which is significantly reduced as compared to (56^{2}56)/2 = 1,540 comparators if using the full sort.
The architectures of STAGE 2 and STAGE 1 are similar to STAGE 3. The sharing factor of these blocks is 7. However, the 2DSORT block is not implemented in STAGE 1. Instead, the results of B1, B2, and B3 are directly passed to the LLR block.