Hardware oriented, quasi-optimal detectors for iterative and non-iterative MIMO receivers

In this article we study hardware-oriented versions of the recently appeared Layered ORthogonal lattice Detector (LORD) and Turbo LORD (T-LORD). LORD and T-LORD are attractive Multiple-Input Multiple-Output (MIMO) detection algorithms, that aim to approach the optimal Maximum-Likelihood (ML) and Maximum-A-Posteriori (MAP) performance, respectively, yet allowing a complexity quadratic in the number of transmitting antennas rather than exponential. LORD and T-LORD are also well suited to a hardware (e.g., ASIC or FPGA) implementation because of their regularity, parallelism, deterministic latency, and complexity. Nevertheless, their complexity is still high in case of high cardinality constellations, such as the 64-QAM foreseen by the 802.11n standard. We show that, when only global latency constraints exist, e.g., a fixed time to detect the whole OFDM symbol, the LORD and T-LORD complexity can be remarkably reduced, still approaching the ML and MAP performance, respectively. Notwithstanding the suboptimal low-complexity and hardware-oriented implementation, LORD and T-LORD approach the EXtrinsic Information Transfer of the ML and MAP detectors, respectively. To focus on a specific setting, we consider the indoor MIMO wireless LAN 802.11n standard, taking into account errors in channel estimation and a frequency selective, spatially correlated channel model.


Introduction
Because of the increasing demand of data rate and link robustness in wireless transmissions, Multiple-Input Multiple-Output (MIMO) technologies are nowadays an indispensable option in the wireless communications standards recently released or under definition, such as IEEE 802.11n [1], WiMax [2], and mobile long term evolution (LTE) [3]. In fact, the capacity of the wireless link grows linearly with the number of transmitting or receiving antennas [4,5], when spatial diversity is available. In practice, MIMO is often combined with spacefrequency bit interleaved coded modulation (BICM) and orthogonal frequency-division multiplexing (OFDM) [1,2], which ensure that almost uncorrelated channels are experienced by different tones within an OFDM symbol.
To increase the spectral efficiency of the link, the transmitting antennas can be used in layered mode, i.e., each antenna transmits a different symbol in the same bandwidth at the same time. On one side, a sophisticated receiver is needed to solve spatial inter-symbol interference and effectively exploit the theoretical advantages. On the other side, mobile devices must be low power consuming and moderately expensive.
The ideal receiver should consider the likelihood of the received vectors for each possible codeword, jointly performing detection and decoding. This has prohibitive complexity, except for simple space-time codes. In practice, detection and decoding are decoupled, and Soft-Input Soft-Output (SISO) detectors are used in conjunction with SISO decoders in iterative schemes [6], to approximate the ideal receiver through disjoint stages, according to the turbo principle [7]. Turbo detectors exploit the information fed back by the channel decoder as a priori information about the transmitted vectors of symbols. If needed, detector and decoder can be simply applied in cascade, as a special case of turbo detection and decoding without iterations. In the former case, the optimal detector to be used is the Maximum-A-Posteriori (MAP) detector, while in the latter case the MAP detector degenerates into the symbol-by-symbol Maximum-Likelihood (ML) detector, since there is no available a-priori information.
Despite these simplifications, the complexity of the optimal MAP and ML detectors still increases exponentially with N t N b , where N t is the number of transmitting antennas and N b the modulation order [bits/dimension]. This is why many researchers have sought suboptimal detection strategies, trying to approach the ideal detector with limited complexity, e.g., via (Turbo) MMSE detection [8,9] or sphere detection [10][11][12][13][14]. The former strategy combines "soft decision" subtraction (soft-DFE) and linear minimum mean square error (MMSE) spatial equalization [8,9,15]. This method, although computationally affordable, can lead to largely sub-optimal results, especially when very high spectral efficiencies are sought with large N b and N t and high rate codes. Conversely, sphere decoding (SD) detectors [16] reduce the MAP and ML complexity restricting the search to a subset of the whole hyper-symbol constellation. The SD family can be roughly divided in two groups: the depthfirst [10] and the breadth-first [11][12][13][14][17][18][19] SDs. The second family has several advantages, such as fixed latency and lower complexity. However, for a small number of candidates, breadth-first SD leads to a performance degradation. Recently, to improve its behavior in iterative receivers, SD has been combined with MMSE [20]: the linear detector assists the SD, finding a good center of the search sphere and thus improving the performance of the iterative receiver.
One of the most promising proposals is the Layered ORthogonal lattice Detector (LORD) [21,22], and its iterative version, namely Turbo LORD (T-LORD) [23]. A proposal similar to LORD has been published few years later [24]. LORD detects the ML hyper-symbol, or close, depending on the number of antennas involved. In a similar way, the T-LORD approaches the MAP performance, combining the low-complexity spatial DFE principles of LORD with a simple yet accurate method to handle a priori log-likelihood ratios (LLRs). LORD and T-LORD are particularly suited for hardware, parallel implementation and soft-output bit detection, and perform very well in combination with soft decoders like SOVA [25] or BCJR [26] for convolutional codes, or with an LDPC decoder [27]. Nevertheless, their complexity is still high in case of high cardinality constellations, such as the 64-QAM foreseen e.g., by the 802.11n standard.
In this article, we propose simplified LORD and T-LORD versions, that keep their vocation for hardware implementation, maintaining deterministic complexity (quadratic in the number of transmitting antennas) and flexibility for setting the desired performance-complexity trade-off. Besides, we show that, when only global latency constraints exist, e.g., a fixed time to detect the whole OFDM symbol, the LORD and T-LORD complexity can be remarkably reduced, still approaching the ML and MAP performance, respectively.
In all cases the performance is very good. We show in particular that LORD and T-LORD can perform very close to the ideal MAP detector for at least up to four antennas and for modulation orders of 3 bits per dimension, even in a realistic setting, with imperfect channel state information (CSI) and correlated channels. Furthermore, we show that our detectors exhibit the same EXIT chart behavior as the MMSE-assisted SD [20], yet with a lower complexity, even more reduced w. r.t. [23]. Recently, a strategy that recalls in principle the idea proposed in this article for OFDM tones processing has been proposed in [28], with a different detector.
The article is organized as follows. Section 2 describes the system model. Section 3 recalls the full-complexity LORD and T-LORD algorithms. Section 4 motivates and describes our low-complexity hardware-oriented LORD and T-LORD proposals. Section 5 shows their performance. Section 6 concludes the article.

System model
Consider a MIMO communication system with N t antennas at the transmitter side and N r ≥ N t at the receiver side. To focus on a practical application, we adopt many of the parameters from the 802.11n standard [1]. At the transmitter, each Wireless LAN packet carrying 390 data bytes is encoded by a 64-states binary convolutional encoder, space-frequency interleaved and Gray-mapped onto an M-QAM constellation. An OFDM modulator, for each spatial stream, splits the overall frequency band (20 MHz) in N f = 64 sub-bands (tones), out of which N dc = 52 are data carriers. The block diagram in Figure 1 summarizes the main operations applied to a packet, when N t = 2 and N r = 2. The extension to more than two antennas is straightforward.
The OFDM format allows to separately consider each tone. Therefore the dependency on the carrier index is omitted in the article, and all the equations refer to a single OFDM tone. The received signal y ∈ C N r reads where x ∈ C N t contains the transmitted symbols and n ∈ C N r is an i.i.d. Gaussian complex white noise vector with covariance matrix: R n = E [nn H ] = N 0 I. The transmitted M-QAM symbols are uncorrelated, with zero mean and variance σ 2 X = 1, for each transmitting antenna. Therefore, the transmitter Signal-to-Noise-Ratio SNR TX equals N t /N 0 .
The MIMO (frequency selective) channel is represented by H ∈ C N r ×N t, whose elements h r,t are the complex (flat) gain of the path between transmitter t and receiver r, at a certain tone. These elements are normalized, i.e., E [|h r,t | 2 ] = 1, with t = 1, 2,..., N t and r = 1, 2,..., N r . More details on channel assumptions are given in Section 5.
At the receiver (see Figure 1), after OFDM demodulation the symbol vector y is passed to the detector, which computes the LLRs of the coded bits l t (n), with t = 1, 2,..., N t and n = 1, 2,..., 2N b , where N b = log 2 (M)/2 is the number of bits per dimension. The soft values are deinterleaved and passed to a decoder. In case of serially concatenated detection and decoding, the decoder, e.g., a Viterbi one, outputs hard bits. Conversely, in case of turbo detection and decoding, the SISO decoder, such as the BCJR [26] or the SOVA [25], also outputs the extrinsic information used as a priori LLR ξ t (n) at the detector (after interleaving) where b n (x t ) {0,1} is the i-th bit of symbol x t . Taking advantage of the soft information fed back by the decoder, at each iteration the detector produces more reliable extrinsic information to be passed to the decoder itself: Here j t (n) is the a posteriori LLR of b n (x t ), computed during the detection process. In the next section we focus on this detection stage.

Detectors outline
In this Section, we give a synthetic description of LORD and T-LORD, indispensable to understand the implementation proposed in the successive sections. For more details about LORD and T-LORD and their various versions, we defer the reader to [23]. We also recall the optimal MAP and ML detectors. Rather than practicable solutions they are a reference for the performance of any other detector. The reader interested in other techniques, such as the MMSE and the Sphere, can refer again to [23], where a detailed comparison with the "standard" T-LORD is reported, both in terms of complexity and performance.

MAP and ML detectors
A MAP detector accepts the received vector y and the a priori information, coming from the decoder, and evaluates the probability of all possible transmitted vectors x. These two contributions can be easily identified in the following metric: With no a priori information, i.e., ξ t (i) = 0, (4) reduces to the ML metric.
Equation (4) is the basis for the computation of a posteriori LLRs The last term in (5) is the (typically very accurate) Max-Log-MAP approximation. As mentioned in the introduction, the number of complex Euclidean distances (ED) and a-priori probabilities to compute, either in case of MAP and ML, as well as of their Max-Log approximations, increases exponentially with N b N t .

LORD detector
The LORD algorithm is composed of two stages. a The former is a pre-processing, common to several received sequences y, as long as the channel can be supposed constant for several OFDM symbols. It consists in N t QR factorizations of the channel matrix H, with permuted column orders: where (t) = u 1 . . . u t−1 u t+1 . . . u N t u t is the permutation matrix which moves the t-th column of H in the last position. Thus, the received symbol and the system model can be rewritten as without impairing the receiver performance as long as the AWGN is spatially i.i.d., sinceñ is still Gaussian with Rñ = Q H (t)R n Q(t) = N 0 I. The evaluation burden of this phase is negligible if the channel can be supposed constant for several consecutive OFDM symbols, thus we focus on the second stage of the algorithm.
For each permutation Π(t), the LORD algorithm explores every possible transmitted QAM symbol x t , moved in the lowest position ofx(t), called "root layer" from now on. For each hypothesis x t =x N t =x, the algorithm subtracts its interference over the upper layers, chooses the closest symbol over the new interferencefree layer (through a simple slicing) and iterates this process, up toỹ 1 (t). In formulas:ˆx Finally, the algorithm builds N t different sets, back to the "original" order: and performs the Log-Likelihood search only over the elements in (9): If N t = 2, it can be shown that the set S t contains, for each possible bit of x t , the closest hyper-symbol x having b m (x t ) equal to one or zero. Thus, the algorithm has the same performance of the Max-Log ML detector. On the contrary, if N t > 2, this is not assured, due to possible error propagation in the decision feedback equalization (DFE) (8). This sub-optimal behavior can be mitigated crossing many sets S j, i.e., letting a hyper-symbol x ∈ S t be replaced by the candidate x' ∈ S j , if its ED is smaller and x t = x t =x:

T-LORD detector
Turbo-LORD is a generalization of LORD, able to manage a-priori information. Basically, (9) is replaced by where the metric (x) is the same as (4) and all the elements in the same subset U t (x) share the same root symbolx: Thus, the a-posteriori LLR (5) is eventually approximated as The cardinality of U t (x) is selected according to the desired complexity-performance trade-off. E.g., one could consider just two hyper-symbols, the one with with the highest a-priori probability, and the one with the smallest ED: x D (t,x) = arg min Using the terminology introduced in [15] in the scalar case, (15) and (16) are said to obey the a priori and the distance criteria: T-LORD assumes that the symbol with the smallest (4) is either When N t > 2, the distance to be minimized is function of two or more variables, hence the solution can not be found through simple slicing. As in the previous section, one can rely on a DFE process, but there is not guarantee that the chosen symbol is actually the closest, because wrong decisions in intermediate layers can happen.
We denote this sequentially chosen hyper-symbol as x D,...,D (t,x), to underline the layer by layer application of the distance criterion, over the upper layers. In the same way, even the a-priori criterion can be applied layer by layer, processing blocks of 2N b LLRs per layer, letting their signs drive the choice of QAM candidates and subtracting their interference from the upper layers, this time without introducing errors.
In [23], other criteria have been proposed to choose the hyper-symbols in U t (x), e.g., to take into account not only the most probable a-priori symbol, but also the second one x F (t,x), that can be easily computed by flipping the weakest a-priori LLR. Furthermore, there is no need for applying the same criterion at each layer: they can be mixed, retaining only the K candidates with the best Partial A-Posteriori Probability (PAPP) where k ÷ N t stands for k,k + 1,...,N t and is the LLR permutation, coherent with matrix Π(t). The K-best algorithm builds U t (x) as follows. At the first step, the partial set U N t t (x) contains onlyx. At the k-th step, according to each possible criterion (a priori, distance and possibly flipping), the K-best algorithm expands each candidate in U k+1 t (x) (note that index k decreases while the algorithm proceeds). Only the K best results, out of this expanded set, are retained in the partial set U k t (x), while the other ones are discarded. At the last step, when k = 1, we declare U t (x) = U 1 t (x) and the T-LORD search goes on as explained in Section 3.2.
We remark that (18) can be recursively updated, layer by layer, adding the new partial ED and a priori terms to the previous partial metric, saving computing time and power. A very interesting choice, as shown in the next section, is K = 1. This corresponds to retaining at each layer only the best new candidate. This algorithm can be interpreted as a decision feedback equalizer (DFE), driven by the aforementioned criteria.
Finally, though the above enhancements proved to be effective, it can anyhow happen that the above generalized DFE process fails, missing the correct detection at some intermediate layer. However, when S j for another transmitted symbol is computed, the distance and the a priori criteria may select some symbols x with x t =x in the upper layers with a better metric (4). So, in analogy with (11), one can augment S t as follows: One can first compute (12) for all t, then cross data to obtain the improved S t . This enhancement implies no growth of the number of checked hyper-symbols, but only extra latency and complexity due to additional metric comparisons and selections.
4 Hardware-oriented, low-complexity LORD As we are going to show, the number of QAM symbols that LORD must check is largely affected by M, the cardinality of the constellation, which can even reach hundreds of points. Therefore, we aim to reduce the number of candidates in S t , exploring only a subset of the QAM constellation at the root layer. Thus, the hyper-symbol span described in (8) is performed only with the root belonging to this reduced set of QAM symbols. Trying to preserve the regularity and parallelism typical of the full-complexity LORD, we restrict our attention to square subsets of the QAM constellation, centered in the received (equalized) signalỹ N t (t)/r N t ,N t (t), at the root layer (red cross in Figure 2a).
The performance of the detector depends on the probability that the transmitted symbol does not belong to the square QAM subset, since when this happens, the LORD algorithm fails with high probability. To describe how this border violation behaves, we keep the QAM constellation size fixed and properly re-scale the noise power. After the pre-processing operations in (7), the channel gain r N t ,N t (t) multiplies the signalx N t (t), and the noise at the root layer reads which is a Gaussian variable with zero mean and This provides a better insight into our problem. Indeed, let us assume the channel is composed by i.i.d. Gaussian variables. Due to the Gram-Schmidt orthogonalization in the QR process, r N t ,N t (t) is a Rayleigh random variable with unit power and probability density function given by We can easily compute the pdf of the actual noise standard deviation, normalized by √ N 0 , being the out- This pdf is plotted in Figure 2b (solid blue line), along with other simulated pdfs (dashed red lines), referring to a MIMO system with N t = N r = 2 and spatially correlated channel gains. It can be observed that the correlated case is even worse, because the matrix H can easily be ill-conditioned, i.e., with the last diagonal elements of R(t) close to zero.
Even more importantly, the pdfs always exhibit long tails at high standard deviations. This suggests the following interpretation of the SNR behaviour at the root layer: quite often, the noise level is moderate; occasionally it is very high. The square side reduction is expected to be applicable only in the former case, since any approximation in the latter case would likely be harmful. This idea will be made practical in the following section.

Algorithm description
Aiming at square subsets (see Figure 2a) that include the transmitted symbol with a high probability, square side lengths larger than α N 0 /2 (with a > 1) should be considered. Probably, one could approach the performance of the full complexity LORD with ad-hoc tuning of the parameter a, fundamental to find a good tradeoff between complexity reduction and capability to detect the ML point. Nevertheless, this solution would be sensitive to the parameter a, thus not well suited to any device implementation.
On the contrary we propose an attractive alternative solution, which does not require any parameter to be tuned, once the architecture has been chosen. Indeed, we suggest to perform the full-complexity search over the whole QAM constellation for the carriers affected by the worst-case noise (in the following, "worst carriers" for brevity), limiting the search for all the other ones to a square subset with a fixed number of points. This way, we always reserve the more robust (full- complexity) algorithm to the carriers affected by higher noise powers at the root layer. This significantly reduces the probability that a transmitted point falls outside the reduced square subset of Figure 2a. Clearly, the higher the hardware capability of the device, the higher the fraction of carriers that can be fully spanned. The fundamental hypothesis, at the basis of this solution, is the existence of detection time constraints (measured in number of clock cycles) only at the level of the entire OFDM symbol, and not carrier-by-carrier. This hypothesis seems quite reasonable, since devices for typical applications have to conclude the detection of an entire OFDM symbol within a fixed time, say T max . Let us focus on the QAM symbols transmitted by the same antenna within an OFDM symbol. Define the parallelism P lc as the minimum number of DFE processors (8) able to exhaustively analyze only the worst N full c carriers, limiting the search for the rest of the tones within a square subset of cardinality S 2 , as in Figure 2.
Assume that each DFE processing (8) takes T elem clock cycles. Thus, the low-complexity LORD requires a number of elementary processing units per antenna P lc such that As a special case, when N full c = N dc we obtain the fullcomplexity LORD parallelism P full , satisfying For example, assume that T max = 6N dc T elem , i.e., on average 6 clock cycles are allowed to conclude the detection process for each antenna permutation, at each data carrier. b With N t = 2, N dc = 52 and a 64-QAM constellation, Figure 3 plots the maximum number of fully-spanned tones (solid lines) as a function of the parallelism P lc and of the square subset side S. Clearly, the higher the number of processing units, the higher the number of exhaustive searches we can exploit. On the contrary, the larger the square side length S, the lower the number of full-complexity detections.
To efficiently compute not only (8) but also (10), we impose further regularity to the hardware supposing that at each clock cycle the device can work only over a row/column of the square.
When P lc ≥ S, as in Figure 4a, the square processing is always performed in the same direction, e.g., by columns. On the contrary, when P lc <S, the square is computed through a tessellation, as in Figure 4b. It can be shown that this kind of processing requires S P lc (2S − P lc ) − S P lc 2 P lc + S (26) clock cycles and, with a reasoning analogous to (25), leads to a smaller number of available fully-spanned carriers, as reported in Figure 3 (dashed lines). Curves in case of tessellation have been plotted only up to S = 6, which is the limit case, since with 6 clock cycles (on average) the detection time is exhausted for performing the low-complexity algorithm over each carrier (i.e., N full c = 0). If S > 6, it is not possible to meet the overall OFDM time constraints of our example. Nevertheless, it will be shown in the next section that S = 5 is enough for all cases of practical interest we have tested. This means that the ML performance can be approached with S ⋅ P lc /M = 25/64 ≈ 40% of the original LORD complexity.

From the description of LORD and T-LORD outlined in Section 3, it is clear that the LORD algorithm is actually a special case of the T-LORD, when all the a-priori
LLRs ξ t (n) are zero. Indeed, in this case there is no point in applying the a-priori criterion, since any symbol has the same a-priori probability equal to 1 M . Only the distance criterion makes sense, and its DFE process is actually the same as (8). Finally, having just one meaningful candidate per set U t (x), also the K-best approach becomes superfluous. To summarize, the distance criterion in the T-LORD works as the LORD algorithm. For this reason we can generalize the LORD hardwareoriented simplification, presented in the previous section, to the T-LORD. Basically, the full T-LORD algorithm is performed only for the most attenuated carriers. For the rest, the DFE process is run just for a subset of root layer symbols. In this case, the candidate sets U t (x) are not determined for any hypothesisx, but only for those belonging to a properly chosen square subset of the QAM constellation. The only difference with LORD is in the way we can choose the square subset of cardinality S 2 . Indeed, in principle one could find it at the first iteration, as shown in Section 4, and let it unchanged along iterations. Though this approach is quite attractive, it is potentially harmful, since it could inhibit the a-priori LLRs influence on the detector outputs, if the search gets stuck in a bad subset of root layer symbols, not containing the transmitted one. An effective solution is to let the a-priori information drive the choice of the square subset, in conjunction with the observed signalỹ N t (t). We compute the L-MMSE estimation of the transmitted symbol on the root layer, performing a weighted maximal ratio combining (MRC) of the equalized received signalx D (t) and of the a-priori expected symbolx A (t): In case of null a-priori information, σ 2 A (t) = σ 2 X and the square subset choice is practically the same as the LORD one, since σ 2 X = 1 is typically much greater than σ 2 D (t). Conversely, when the a-priori information in high, the received symbol is ignored in the calculus of x(t), since σ 2 A (t) is small. Finally we remark that (29) and (30) can be efficiently computed with techniques similar to [15].

Related issues
As the constellation search is restricted at the root layer, there is no guarantee that at least one candidate symbol in S t exists for each value of any bit of x t . In this case, if the crossing processes (11) or (20) do not recover one in S t , one of the two terms in (10) and (14) is missing for that particular bit. Clipping approximations, like assigning a fixed (finite or infinite) value to its LLR, based on the hard decided ML or MAP symbol, are not completely satisfactory. Nevertheless, for a Gray 64-QAM constellation this approximation is required only when S ≤ 4. In fact, as clear in Figure 5, if we consider five or more adjacent symbols of an 8-PAM Gray constellation, we are assured to span at least one symbol for each possible bit value.
Another problem is how to efficiently find the square subset of Figure 2a. An efficient solution is to apply, for each dimension, the Euchner-Schnorr "zig-zag" algorithm [29], which determines the symbol closest to the received one or to the estimatedx(t), and alternatively adds points on its left and right, till the boundaries of the constellation or the square subset sizes are not exceeded.

LORD and T-LORD complexity
In this section we discuss the complexity of the proposed hardware-oriented LORD and T-LORD. A simple measure to rate the complexity of any detector is the number of spanned modulated symbols, i.e., the number of EDs to compute. Indeed, this is approximately proportional to the number of multiplications (usually more expensive than additions in hardware). E.g., one could compare the ML receiver with LORD and MMSE. With the above definition, the ML complexity is larger than c M N t . Conversely, LORD evaluates EDs, while the MMSE essentially requires 2 points per coded bit (we can exploit the Gray mapping regularity as in [30], "folding" the constellation), i.e., 2N t log 2 M on the whole. Though the computation of the number of EDs is only a preliminary tool to evaluate complexity, it reveals that the ML cost is exponential in the number of antennas, practically unaffordable even for small arrays, while the LORD complexity is only quadratic. Analogous considerations can be done in case of iterative detectors.
Here, we focus on the complexity reduction of the simplified LORD and K-best T-LORD, w.r.t. the "original" ones. For an exhaustive complexity analisys of all the T-LORD versions as well of other detector families, we refer the reader to [23]. As reported therein, the Kbest T-LORD (with K = 1 and all the enhancements set on) computes EDs. From (32) and (33), it is clear that when the constellation cardinality M is large, it represents the largest 000 001 011 010 110 111 101 100  contribute to the LORD and T-LORD complexity. The simplification proposed in the previous Subsections reduces that factor, and the complexity (averaged w.r.t. frequency tones) becomes To strenghten the above analysis, we study the number of multiplications, additions and comparisons, also distinguishing those performed just once (such as the QR decomposition), from those to be repeated for every detection process, i.e., referring to a single tone, OFDM symbol and iteration. Results for these fixed and variable operations are reported in Table 1, for the most complex case that we have investigated, i.e., M = 64 and N t = 4. The table also comprises the complexity referring to the soft-output generation stage. For completeness, the SIC-MMSE [15], the Full-Complexity T-LORD [23] and a SD have been reported, too. Among different SD families, a breadth-first list detector has been considered, since it guarantees deterministic complexity and latency, as T-LORD does. The list size is K = 36, chosen to achieve a performance close to the T-LORD one.
Focusing on the low-complexity K-best T-LORD, we have chosen a reduced square subset of side S = 4, the smallest available parallelism P lc = 4, and a full search over N full c = 8 carriers. The square subset center is driven both by the received symbols and the a-priori LLRs. This solution is very attractive, since the loss w. r.t. the full-complexity T-LORD evanishes after some iterations, as shown in the next section. Furthermore, being the square subset 4 × 4, the same hardware could be used also to detect a 16-QAM constellation, with negligible incremental costs. As we can see in Table 1, the low-complexity K-best T-LORD saves less multiplications and additions than those expected looking at the number of computed ED. Indeed, there are additional operations to perform, e.g., (29)-(31) and the identification of the N full c worst carriers. Also the crossing between candidate sets S t contributes to lower the complexity reduction, since it must be performed within the entire constellation, and not only among points of the square subset. Anyway, the simplified T-LORD greatly benefits from the reduction of the spanned symbols, and almost halves the number of required variable multiplications. Also the number of variable additions is sligthly smaller (the a-priori criterion remains almost unchanged).
To conclude, we remark that not only the device area (i.e., the number of required logic gates) benefits from the simplification proposed in this paper, but also the power consumption, as reported in [31]. E.g., in case of LORD with N t = N r = 2 and M = 64, assuming a 65 nm CMOS technology with an 80 MHz clock frequency, the area is reduced from 0.64 mm 2 to 0.21 mm 2 and the power from 38 to 14 mW, respectively. Therein, a comparison with prior designs can be found, too. d

Simulation results
In this section, we provide performance results, both in terms of extrinsic information delivered by several detectors and Monte Carlo simulation of the receivers embedding them.
We assume two different environments, referred in the following as the "ideal" and "real" one. Firstly, we use a rich scattering channel, whose coefficients h r,t are i.i.d. complex gaussian values with unit power. Perfect CSI at the receiver side is assumed, too. Then we consider a more realistic channel, with exponentially decaying power delay profile (PDP) and a short time delay spread τ rms = 50 ns (equal to the sampling time). Spatial correlation is assumed equal to r(t 1 , t 2 ) = 0.5 |t 1 −t 2 | , being t 1 and t 2 the antenna indexes, sorted in ascending order from a border of the linear array. The perfect CSI hypothesis is abandoned, and substituted by pilot aided tone-by-tone channel estimation (CE): due to the average of subsequent orthogonal long training sequences (LTS), as in [32], each channel tap estimationĥ r,t is affected by i.i.d. Gaussian noise with power No time nor frequency smoothing, such as [32,33], is adopted, since this would have reduced the difference between the ideal and real settlements.

EXIT charts
In an iterative receiver, a detector is rated on its capability to transfer extrinsic information to the decoder. EXIT charts are an effective tool to predict the convergence behavior of iterative systems [34] and to design component codes [27,35], even in case of (possibly MIMO) selective channels. The EXIT analysis assumes independent a priori LLRs ξ t (n), drawn at random from some probability density function (pdf) often modeled as the output of an AWGN channel with variance twice the mean μ. The output pdf p(l) of extrinsic LLRs is generally sampled experimentally. The mutual information for a consistent pdf is [36] In case of serial concatenation between the detector and the decoder, the quality of the detector output can be evaluated looking at the leftmost value in the graph, corresponding to absence of a-priori information (see, e. g., Figure 6). On the contrary, in turbo receivers one can track the system convergence overlapping the charts of the two iterative modules (with exchanged axes), since the output of the former becomes the input of the latter and so on.
In Figure 6, we plot (both in ideal and realistic channel conditions) the EXIT curve of the hardware oriented T-LORD developed so far and, as references, the SIC-MMSE [9] and the MAP.
We choose SNR = 23dB, corresponding to a target packet error rate (PER) close to 10 -2 . The T-LORD algorithm produces at the output almost the same information as the MAP detector, for any input information I A . On the contrary, SIC-MMSE is largely suboptimal and is expected to introduce severe losses in an iterative receiver. Monte-Carlo simulation will confirm these predictions.
In Figure 7 we further compare the EXIT curves of different T-LORD detectors. As we can see, the gap between the hardware oriented T-LORD with S = P lc = 5 and the full-complexity T-LORD is small. Conversely, the case S = P lc = 4 without the update of the square subset center leads to some information loss.
An interesting choice is S = P lc = 4 with the update enabled (denoted throughout figures as "moving square"), exhibiting an abrupt change of the delivered information, as the a-priori information gets large. When no a-priori information is available, the reduced square positioning is based only on the received, noisy symbol, therefore when the transmitted signal lies outside the reduced square, some extrinsic information is lost. Conversely, when the square positioning can  benefit from a priori information, a relevant fraction of extrinsic information is recovered. This closes the gap iteration by iteration, as confirmed by simulation in the following. Similar results hold for the realistic channel.

Monte Carlo simulations
In this section we provide simulation results for the above described low-complexity LORD detectors, with different square sides and parallelisms. For comparison, we plot also the MAP, MMSE [9] and full-complexity T-LORD [23] curves. Simulations are floating-point. Iterations range from 1 to 4. Iteration 1 means no extrinsic information is available to the detector, i.e., LORD and T-LORD coincide, as well as MAP and ML. Aiming to achieve very high spectral efficiencies, up to 15 bit/carrier, and to test the simplified T-LORD in challenging conditions, we always consider a channel code rate R c = 5/6, i.e., the most sensitive 64-QAM mode in the 802.11n standard [1]. Figures 8 and 9 plot PER vs SNR for the case N t = N r = 2 and N t = N r = 3, respectively. Target PER has been set equal to 10 -2 , a common assumption in wireless LAN communications when retransmission is allowed. e Here, we assume ideal channel conditions (i.e., Gaussian uncorrelated channels with perfect CSI at the receiver).
As we can see, the T-LORD performance is close to the MAP detector, and largely outperforms the MMSE receiver, plagued by ill-conditioned channel matrices. Only the T-LORD with S = P lc = 4 and fixed subset choice at the root layer has a modest loss, say 0.2dB more than the full-complexity T-LORD.
The T-LORD robustness w.r.t. MMSE becomes even more pronounced in Figure 10, assuming the realistic channel conditions described at the beginning of this section. Part of the SNR gap between ideal and realistic conditions can be ascribed to the noisy (tone-by-tone, ZF) channel estimates, computed exploiting the orthogonal preambles in [1]. The estimation error can be interpreted as additional noise over the link, and one  can expect an overall performance degradation equal to 3dB when N t = 2 or N t = 4, and slightly smaller (2.4dB) when N t = 3, since the 802.11n preamble contains one more LTS than the number of transmitting antennas. The remaining 3dB loss, that one would experience even in case of ideal CE, is due to the severe channel described in Section 2, with an exponentially decaying PDP, short time delay spread and spatial correlation. In this challenging case with less spatial diversity, the MMSE receiver completely fails to improve with iterations, while the full-complexity K-best T-LORD misses the MAP performance by only 0.5 dB in the realistic case. This gap is probably due to error propagation in the DFE process.
For completeness, in Figures 11 and 12 we also report simulations for the case N t = N r = 4, both for ideal and realistic channels. f In this case, the extremely time consuming MAP has been replaced by a lower bound, assuming that MAP receiver can fail only when also the full complexity T-LORD is not able to recover the message.
Focusing now on LORD detectors, S = P lc = 5 is enough to approach the optimal ML detector performance as the full-complexity LORD does, while in case of S = P lc = 4 LORD suffers some performance degradation. This can be explained by a higher probability that noise overcomes the square subset borders or the soft output generation misses some EDs in (10) or (14). Thus, the former parameters have been chosen for the HDL implementation of LORD [31]. Conversely, focusing on the fourth iteration, the above gap is almost closed by the S = P lc = 4 T-LORD with the square subset center update (as predicted by EXIT charts), performing even better than the case S = P lc = 5, with fixed subset.
To conclude, we defer readers interested in a performance comparison with sphere detectors to [23]. Results show that T-LORD achieves better performance computing less EDs, in any simulated case.

Conclusions
In this article, we have proposed innovative hardwareoriented, soft-output LORD and T-LORD algorithms, that can heavily reduce the number of parallel elementary processing units, required to meet the latency constraints in MIMO-OFDM systems, when highcardinality QAM constellations are deployed. The simplified versions preserve the features of the original algorithms, i.e., fixed complexity, deterministic latency and a remarkable parallel structure. The proposed solution is regular, scalable and does not require any ad hoc parameter tuning, e.g., depending on the experienced average SNR or the actual channel realization. Besides, the loss in 802.11n systems w.r.t. the ML and MAP detectors is very small (few tenths of dB). We tried several configurations up to 20 bit/carrier (N t = 4), corresponding to a system throughput of 260 Mb/s, if we consider the 802.11n standard. We also tested the system with very noisy channel estimates, as well as a more realistic channel offering less spatial and frequency diversity, due to correlation. In each case, the simplified LORD and T-LORD showed comforting robustness, outperforming the non-iterative ML and the iterative SIC-MMSE receiver, and always approaching the receiver with the ideal detector. These features make LORD and T-LORD good candidates for VLSI MIMO receivers.
Endnotes a For brevity, we give a simplified description of the LORD algorithm. For more details, refer to [21] and [22], where a real-domain modified QR decomposition allows to avoid the normalization of the columns of Q. Nevertheless, the low-complexity, hardware-oriented LORD and T-LORD presented in this article can be also applied to that framework, as shown in [31]. b Obviously this value depends on the hardware, but it is reasonable for an FPGA device (with a clock frequency of tens of MHz) aiming to process in realtime an ODFM symbol lasting 4 μs and carrying 52 data carries, as in [1]. c Note that the measure does not refer to the number of N t -dimensional hyper-symbols, but to the number of spanned QAM symbols throughout the algorithm. d A fair comparison of area and power consumption is hard to achieve, since many parameters change from one design to the other (e.g., clock frequency, CMOS technology, modulations, antennas, soft-output generation). Nevertheless, in [31] it is shown that LORD provides a very good trade-off in any case. e E.g., the standardization group for 802.11n chose PER = 10 -2 for performance comparison purposes. f A straight performance comparison between systems with a different number of antennas is hard to achieve. The SNR letting the system meet the target PER changes when the number of antennas gets large and its trend is hard to foresee, for at least two reasons. On the one hand, we exploit the capacity growth to increase the throughput, not to strengthen the communication. On the other hand, a larger number of antennas makes the data packet shorter, since more information is conveyed at each channel use: this reduces the PER for a given SNR.