Skip to main content


  • Research
  • Open Access

A low-complexity MIMO subspace detection algorithm

EURASIP Journal on Wireless Communications and Networking20152015:95

  • Received: 24 October 2014
  • Accepted: 9 February 2015
  • Published:


A low-complexity multiple-input multiple-output (MIMO) subspace detection algorithm is proposed. It is based on decomposing a MIMO channel into multiple subsets of decoupled streams that can be detected separately. The new scheme employs triangular decomposition followed by elementary matrix operations to transform the channel into a generalized elementary matrix whose structure matches the subsets of streams to be detected. The proposed approach avoids matrix inversion and allows subsets to overlap, thus achieving better diversity gain. An optimized detector architecture based on a 2-by-2 ML detector core is also presented. Simulations demonstrate that the proposed algorithm performs to within a few tenths of a dB from the optimum detection algorithm.


  • Maximum likelihood
  • MIMO detection
  • LLR
  • Subspace detection

1 Introduction

With the advent of smart mobile devices, the demand for wireless access to broadband networks has been rapidly increasing in the past decade. Service providers are constantly faced with a major challenge of meeting contradictory requirements for higher data rates, improved quality of service (QoS), and better network capacity, while maintaining the transmit power and bandwidth budgets. To achieve these targets, novel enabling technologies need to be considered.

Exploiting the spatial dimension by using multiple-input multiple-output (MIMO) antenna systems is one of the key-enabling technologies for achieving high spectral efficiency in modern wireless communications standards. MIMO technology improves both the spectral efficiency and the QoS of wireless communication systems. However, detection of spatially multiplexed MIMO streams plays a key role in receiver design both in terms of performance and complexity [1] and has remained an active area of research. Schemes for MIMO detection are either based on joint-stream detection to separate the streams, such as maximum likelihood (ML) detection [2-4], or subset-stream detection such as those employed in successive interference cancellation (SIC), interference coordination, transmit beamforming, network MIMO, and coordinated multi-point transmission and reception [5-9].

A plethora of MIMO detectors have appeared in the literature on this subject, offering various performance-complexity tradeoffs. Suboptimal zero-forcing (ZF) and minimum mean-squared error (MMSE) detectors [10], as well as the nonlinear parallel and successive interference cancellation schemes [11-14], require relatively low complexity but sacrifice performance. On the other hand, tree-search or list-based detectors require substantially higher complexity but can offer (near-)ML performance such as the well-known sphere decoding algorithm [2-4,15-19]. Other tree-search schemes, such as the K-Best algorithm [20-26], address the non-deterministic throughput aspects of sphere decoders. Practical implementation aspects have been investigated in [18,23,25-39].

Subspace detection based on channel decomposition offers a good compromise between performance and complexity. In [6,7], a method was presented in which the effective MIMO channel matrix H is uniformly decomposed into identical parallel subchannels using geometric mean decomposition (GMD). In [8], a related scheme was presented in which H is block-wise diagonalized to narrow down the number of jointly detected streams to two, and then GMD is applied to balance capacity within each pair of subchannels. The scheme was generalized in [9] to allow joint detection of several overlapping subchannels using QR decomposition (QRD). By allowing subspaces to overlap, additional diversity can be gathered by putting a low reliable data stream into several detection sets. Other subspace methods employ projection operators and lists to generate candidates for interference cancellation in equalization schemes (e.g., [40-44]).

The LORD algorithm proposed in [45,46] can be viewed as a special class of subspace MIMO detectors. It achieves ML performance (in the Max-log-MAP [47] sense) on two transmit antennas, but its performance degrades when the number of antennas increases. In [48], the LORD algorithm was generalized to four transmit antennas by using matrix inversion to decompose H into single streams.

Contributions: In this paper, we propose an efficient near-ML soft-output MIMO detection algorithm that jointly detects subsets of decoupled streams by transforming H into a generalized elementary matrix. A matrix transformation is analytically derived to induce the desired structure on H, while avoiding computationally complex operations such as pseudo-inversion. This is achieved by applying QL decomposition followed by elementary matrix operations on H in a manner analogous to the modified QL decomposition algorithm based on the Gram-Schmidt orthogonalization procedure, thus avoiding the need for expensive matrix inversion operations. Various decomposition structures are investigated, including the option of allowing the decomposition sets to overlap resulting in additional diversity gain. Furthermore, we show that a parallel MIMO detector can be constructed from 2×2 component detector cores that decouple the streams in the subsets in parallel. Finally, we show that for two streams, the proposed algorithm is optimal and reduces to the LORD algorithm [45,46]. When the subsets include single streams only, the algorithm reduces to that of [48]. The advantages of the proposed algorithm in attaining near-ML performance outperforming that of the sphere decoder [30] and the near-ML algorithm of [48] are demonstrated through computer simulations of the bit error-rate of a MIMO system.

The rest of the paper is organized as follows. Section 2 introduces the system model, and Section 3 reviews ML detection for two streams. Sections 4 and 5 present the proposed matrix decomposition scheme and a simplified construction using QLD. The detection algorithm is detailed in Section 6. Section 7 presents simulation results, while Section 8 ends the paper with concluding remarks.

2 System model

Consider a MIMO system with N transmit (Tx) antennas and MN receive (Rx) antennas. Assuming perfect channel knowledge at the receiver, the equivalent complex baseband input-output system relation can be modeled as y = H x + n, where \( \mathbf{y}\kern0.3em \in \kern0.3em {\mathcal{C}}^{M\times 1} \) is the received complex signal vector, \( \mathbf{H}\kern0.3em \in \kern0.3em {\mathcal{C}}^{M\times N} \) is the complex channel matrix, and \( \mathbf{x}\kern0.3em =\kern0.6em {\left[{x}_1\kern1em {x}_2\cdots {x}_N\right]}^T\in \mathcal{X}\kern0.3em =\kern0.3em {\mathcal{X}}_1\kern0.3em \times \cdots \times \kern0.3em {\mathcal{X}}_N \) is the N × 1 transmitted complex symbol vector. Each symbol x n belongs to a complex constellation \( {\mathcal{X}}_n \) of size \( {Q}_n\kern0.3em =\kern0.3em {2}^{q_n} \) and normalized so that \( \mathbb{E}\kern0.3em \left[{x}_n^{\ast }{x}_n\right]\kern0.3em =\kern0.3em 1 \), and is formed from a set of q n coded bit-interleaved sequence \( {\mathbf{b}}_n\kern0.3em =\kern0.3em \left({b}_{n,1},{b}_{n,2},\cdots \kern0.3em ,{b}_{n,{q}_n}\right) \) over the binary field \( {\mathcal{F}}_2 \). The effect of thermal noise is modeled as a zero-mean complex Gaussian circularly symmetric random vector \( \mathbf{n}\kern0.3em \in \kern0.3em {\mathcal{C}}^{M\times 1} \) with covariance matrix \( \mathbb{E}\kern0.3em \left[\mathbf{n}{\mathbf{n}}^{\ast}\right]\kern0.3em =\kern0.3em {\sigma}_{\mathbf{n}}^2{\mathbf{I}}_M \). The signal-to-noise ratio (SNR) is defined as \( \mathrm{S}\mathrm{N}\mathrm{R}\kern0.3em =\kern0.3em N/{\sigma}_{\mathbf{n}}^2 \). \( \mathbb{E}\kern0.3em \left[\cdotp \right] \) stands for the expected value, (·) T and (·) for the transpose and conjugate transpose, and I M for the M × M identity matrix.

Assuming equiprobable symbols, ML MIMO detection algorithms achieve optimum performance by finding the symbol vector \( \mathbf{x}\in \mathcal{X} \) that is closest to the received vector y under the Euclidean distance metric
$$ d\left(\mathbf{x}\right)\triangleq {\parallel \mathbf{y}-\mathbf{H}\mathbf{x}\parallel}^2={\parallel \overset{\sim }{\mathbf{y}}-\mathbf{L}\mathbf{x}\parallel}^2, $$
where H = Q L is the QL decomposition (QLD) [49] of H into a unitary matrix \( \mathbf{Q}\kern0.3em \in \kern0.3em {\mathcal{C}}^{M\times N} \) and a lower triangular matrix (LTM) \( \mathbf{L}\kern0.3em \in \kern0.3em {\mathcal{C}}^{N\times N} \) with positive real diagonal elements, and \( \overset{\sim }{\mathbf{y}}\kern0.3em =\kern0.3em {\mathbf{Q}}^{\ast}\mathbf{y}\kern0.3em =\kern0.3em \mathbf{L}\mathbf{x}\kern0.3em +\kern0.3em {\mathbf{Q}}^{\ast}\mathbf{n}\kern0.3em \in \kern0.3em {\mathcal{C}}^{N\times 1} \) denotes the transformed received signal vector. Since Q is unitary, it preserves Euclidean norm as well as noise statistics. Hence, for equiprobable symbols, a ‘hard-decision’ ML MIMO detector finds \( {\mathbf{x}}^{\mathrm{ML}}\kern0.3em \in \kern0.3em \mathcal{X} \) such that H x ML is closest to y in \( {\mathcal{C}}^{M\times 1} \) (or equivalently, L x ML is closest to \( \overset{\sim }{\mathbf{y}} \) in \( {\mathcal{C}}^{N\times 1} \)). This is essentially an integer least-squares problem [4] of the form
$$ \begin{array}{cc}{d}^{\mathrm{ML}}& =\underset{\mathbf{x}\in \mathcal{X}}{ \min}\kern0.3em {\parallel \overset{\sim }{\mathbf{y}}-\mathbf{L}\mathbf{x}\parallel}^2\\ {}\end{array} $$
$$ \begin{array}{cc}{\mathbf{x}}^{\mathrm{ML}}& =\underset{\mathbf{x}\in \mathcal{X}}{ \arg } \min \kern0.3em {\parallel \overset{\sim }{\mathbf{y}}-\mathbf{L}\mathbf{x}\parallel}^2.\end{array} $$
In MIMO systems employing soft-input channel decoders however, ML MIMO detectors generate soft-outputs (SO) in the form of log-likelihood ratios (LLRs) by searching for other ‘closest’ symbol vectors to \( \overset{\sim }{\mathbf{y}} \). The (unscaled) LLR value associated with bit b n,k is given by
$$ \begin{array}{lcrr}\kern2.4em {\varLambda}_{n,k}^{\mathrm{ML}}\kern0.3em & =\kern0.6em & \kern0.6em \underset{\mathbf{x}\in \underset{n,k}{\overset{(0)}{\mathcal{X}}}}{ \min}\kern0.3em d\left(\mathbf{x}\right)\kern0.3em -\kern0.9em \underset{\mathbf{x}\in \underset{n,k}{\overset{(1)}{\mathcal{X}}}}{ \min}\kern0.3em d\left(\mathbf{x}\right),\kern1em n\kern0.3em =\kern0.3em 1,\kern0.3em \cdots \kern0.3em ,N;\kern2.77626pt k\kern0.3em =\kern0.3em 1,\cdots \kern0.3em ,{q}_n,& \end{array} $$

where \( {\mathcal{X}}_{n,k}^{(0)}\kern0.3em =\kern0.3em \left\{\mathbf{x}\in \mathcal{X}:{b}_{n,k}=0\right\} \) and \( {\mathcal{X}}_{n,k}^{(1)}\kern0.3em =\kern0.3em \left\{\mathbf{x}\in \mathcal{X}:\right.\left.{b}_{n,k}\kern0.3em =\kern0.3em 1\right\} \) are the subsets of symbol vectors in that have their corresponding kth bit in the nth symbol 0 and 1, respectively.

3 Optimum 2×2 soft-output MIMO detection

In general, finding the ML solution requires computing \( \prod_{n=1}^N\kern0.3em {Q}_n \) distance metrics. When N = 2, a simplification [45] can be applied to reduce the number of computations from Q 1·Q 2 to Q 1 + Q 2 by triangularizing the channel matrix as H = Q 1 L 1, with Q 1 being unitary and L 1 being a LTM, leading to:
$$ \mathbf{y}-\mathbf{H}\mathbf{x}\to \kern0.6em \left[\begin{array}{c}{\overset{\sim }{y}}_1\\ {}{\overset{\sim }{y}}_2\end{array}\right]\kern0.6em -\kern0.6em \left[\begin{array}{cc}{a}_1& 0\\ {}{c}_1& {b}_1\end{array}\kern0.9em \right]\cdotp \left[\begin{array}{c}{x}_1\\ {}{x}_2\end{array}\right]={\overset{\sim }{\mathbf{y}}}_1\kern0.3em -\kern0.3em {\mathbf{L}}_1\mathbf{x}, $$
where \( {\overset{\sim }{\mathbf{y}}}_1\kern0.3em =\kern0.3em {\mathbf{Q}}_1^{\ast}\mathbf{y} \), with \( {a}_1,{b}_1\in {\mathcal{R}}^{+} \) and \( {c}_1\in \mathcal{C} \). Then
$$ \begin{array}{lcrr}\underset{\mathbf{x}\in \mathcal{X}}{ \min }{\parallel {\overset{\sim }{\mathbf{y}}}_1\kern0.6em -\kern0.3em {\mathbf{L}}_1\kern0.3em \mathbf{x}\parallel}^2\kern0.6em & \kern0.3em =& \kern0.9em \underset{\begin{array}{c}{x}_1\in {\mathcal{X}}_1\\ {}{x}_2\in {\mathcal{X}}_2\end{array}}{ \min}\left({\left|{\overset{\sim }{y}}_1\kern0.3em -\kern0.3em {a}_1{x}_1\right|}^2\kern0.6em +\kern0.3em {\left|{\overset{\sim }{y}}_2\kern0.3em -\kern0.3em {c}_1{x}_1\kern0.3em -\kern0.3em {b}_1{x}_2\right|}^2\right)& \\ {}\kern0.9em & =& \kern0.9em \underset{x_1\in {\mathcal{X}}_1}{ \min}\kern0.3em \left(\kern0.3em {\left|{\overset{\sim }{y}}_1\kern0.6em -\kern0.3em {a}_1{x}_1\kern0.3em \right|}^2\kern0.6em +\kern0.6em \underset{x_2\in {\mathcal{X}}_2}{ \min}\kern0.3em \hspace{2.22144pt}{\left|{\overset{\sim }{y}}_2\kern0.3em -\kern0.3em {c}_1{x}_1\kern0.6em -\kern0.3em {b}_1{x}_2\kern0.3em \right|}^2\kern0.3em \right)& \\ {}\kern0.6em & =& \kern0.9em \underset{x_1\in {\mathcal{X}}_1}{ \min}\left(\kern0.3em {\left|{\overset{\sim }{y}}_1\kern0.3em -\kern0.3em {a}_1{x}_1\right|}^2\kern0.3em +\kern0.3em {\left|{\overset{\sim }{y}}_2\kern0.3em -\kern0.3em {c}_1{x}_1\kern0.3em -\kern0.3em {b}_1{\widehat{x}}_2\right|}^2\right)& \\ {}\kern0.6em & \triangleq & \kern0.9em \underset{x_1\in {\mathcal{X}}_1}{ \min }{d}_1\left({x}_1\right)& \end{array} $$
where \( {\widehat{x}}_2 \) is obtained by slicing \( \left({\overset{\sim }{y}}_2\kern0.3em -\kern0.3em {c}_1{x}_1\right)/{b}_1\kern0.3em \in \kern0.3em \mathcal{C} \) to the nearest constellation point in \( {\mathcal{X}}_2 \) using the operator \( {\left\lfloor \alpha \right\rceil}_{{\mathcal{X}}_n}\triangleq \underset{x\in {\mathcal{X}}_n}{ \arg \min}\left|\alpha -x\right| \):
$$ \begin{array}{lcrr}{\widehat{x}}_2& =& {\left\lfloor \left({\overset{\sim }{y}}_2\kern0.3em -\kern0.3em {c}_1{x}_1\right)/{b}_1\right\rceil}_{{\mathcal{X}}_2}\in {\mathcal{X}}_2.& \end{array} $$
Hence (6) requires only \( \left|{\mathcal{X}}_1\right|\kern0.3em =\kern0.3em {Q}_1 \) distance computations. The LLRs of the bits in symbol x 1 can simply be obtained as
$$ \begin{array}{lcrr}{\varLambda}_{1,k}^{\mathrm{ML}}& =& \kern0.9em \underset{\underset{1}{x}\in {\mathcal{X}}_{1,k}^{(0)}}{ \min}\kern0.3em {d}_1\left({x}_1\right)-\kern0.9em \underset{\underset{1}{x}\in {\mathcal{X}}_{1,k}^{(1)}}{ \min}\kern0.3em {d}_1\left({x}_1\right),\kern2em k\kern0.3em =\kern0.3em 1,\cdots \kern0.3em ,{q}_1.& \end{array} $$
To obtain the LLRs of the bits in x 2 however, we triangularize H as Q 2 L 2 so that a zero appears in the upper left corner:
$$ \mathbf{y}-\mathbf{H}\mathbf{x}\longrightarrow \kern0.60em \left[\begin{array}{c}{\bar{y}}_1\\ {\bar{y}}_2\end{array}\right]-\left[\begin{array}{cc}0& {a}_2\\ {b}_2& {c}_2\end{array}\kern0.3em \right]\cdotp \left[\begin{array}{c}{x}_1\\ {x}_2\end{array}\right]={\bar{\mathbf{y}}}_2\kern0.3em -\kern0.3em {\mathbf{L}}_2\mathbf{x}, $$
where now \( {\bar{\mathbf{y}}}_2={\mathbf{Q}}_2^{\ast}\mathbf{y} \), and \( {a}_2,{b}_2\in {\mathcal{R}}^{+} \) and \( {c}_2\in \mathcal{C} \). Then
$$ \begin{array}{llll}\underset{\mathbf{x}\in \mathcal{X}}{ \min }{\parallel {\bar{\mathbf{y}}}_2- {\mathbf{L}}_2\mathbf{x}\parallel}^2 & =& {\underset{\substack{{{x}_1\in {\mathcal{X}}_1}\\{{x}_2\in {\mathcal{X}}_2}}}{ \min}}\left({\left|{\bar{y}}_1- {a}_2{x}_2\right|}^2+ {\left|{\bar{y}}_2 - {c}_2{x}_2 - {b}_2{x}_1\right|}^2\right)& \\ & =& \underset{x_2\in {\mathcal{X}}_2}{ \min} \left({\left|{\bar{y}}_1 - {a}_2{x}_2\right|}^2 + {\left|{\bar{y}}_2 - {c}_2{x}_2 - {b}_2{\hat{x}}_1\right|}^2\right)& \\ & \triangleq & \underset{x_2\in {\mathcal{X}}_2}{ \min} {d}_2\left({x}_2\right)& \end{array} $$
where \( {\hat{x}}_1 ={\left\lfloor \left({\bar{y}}_2 -{c}_2{x}_2\right)/{b}_2\right\rceil}_{{\mathcal{X}}_1} \). The LLRs of the bits in x 2 are given by
$$ \begin{array}{lcrr}{\varLambda}_{2,k}^{\mathrm{ML}}\kern0.3em & =& \kern1.2em \underset{\underset{2}{\overset{\in }{x}}{\mathcal{X}}_{2,k}^{(0)}}{ \min}\kern0.3em {d}_2\left({x}_2\right)-\kern0.9em \underset{\underset{2}{\overset{\in }{x}}{\mathcal{X}}_{2,k}^{(1)}}{ \min}\kern0.3em {d}_2\left({x}_2\right),\kern2em k\kern0.3em =\kern0.3em 1,\cdots \kern0.3em ,{q}_2.& \end{array} $$

Since Q 1,Q 2 are unitary, the ML solutions in (6), (10) are identical. To find the hard-decision ML solution, only one-sided QLD is needed on either layer 1 or 2. A list of Q n distances \( \left\{{d}_n(x),\kern1em \forall x\kern0.3em \in \kern0.3em {\mathcal{X}}_n\right\} \) is generated by enumerating all symbols \( x\kern0.3em \in \kern0.3em {\mathcal{X}}_n,n\kern0.3em =\kern0.3em 1 \) or n = 2, and the minimum is selected. However, to generate soft LLRs, two-sided decompositions are needed, and two lists of distances for n = 1and n = 2 must be computed to select the appropriate minima in (8) and (11).

4 Extensions to higher-order layers

The previous optimization cannot be extended in a straightforward manner to N = 3 or more layers because the structure of the lower triangular matrix L includes off-diagonal terms that prevent searching for the ML solution by enumerating symbols on one layer and finding the minima through slicing individually on all other layers in parallel. More specifically, in Figure 1a, the presence of the de-marked entries in the LTM implies that determining the ML solution requires enumerating symbols on N − 1 layers and slicing only on the last layer, as is typically done in tree-based detectors (e.g., [30]), and hence still requiring \( O\left(\prod_n{Q}_n\right) \) complexity rather than \( O\left(\sum_n{Q}_n\right) \).
Figure 1
Figure 1

4×4 channel matrix structures: (a) full; and (b-e) punctured structures for every layer.

One desirable structure of H for a four-layer MIMO system would be as shown in Figure 1b, in which the red-marked entries are zeroed-out. Here, by enumerating symbols on layer 1, the minimum distances and associated symbols on layers 2 to 4 can be searched for in parallel through slicing only on the corresponding layers, similar to the two-layer system. This suffices to compute the LLRs associated with the bits on layer-1 symbol. A similar process is repeated by decomposing H according to the structures shown in Figure 1c,d,e [48] to compute the LLRs for bits associated with layers 2 to 4.

Other ‘punctured’ structures are also possible for a 4 × 4 system as illustrated in Figure 2. They differ in 1) the number of layers over which symbols are enumerated (enumeration set), 2) the submatrix structure used to propagate these enumerated symbols and cancel their interference effect from the remaining layers (interference cancellation set), and 3) the number of layers in which the minimum distance and associated symbol can be obtained by slicing after interference cancellation (slicer set). Let E denote the size of the enumeration set, S the size of the slicer set, and S × E the size of the interference cancellation set. We refer to this structure using the triplet (E,S × E,S). For example, in Figure 2a, we enumerate over E = 1 layer only, cancel interference from this layer to the three other layers using a 3 × 1 structure, and slice over S = 3 layers. In the structure in Figure 2b, we enumerate over E = 2 layers, cancel interference using a 2 × 2 structure, and slice over S = 2 layers.
Figure 2
Figure 2

(1,3×1,3) (a), (2,2×2,2) (b), and (3,1×3,1) (c) punctured structures.

LLR values are generated for bits in symbols included in the enumeration set only. Complementary structures that enumerate symbols on other layers are required to generate their respective LLRs. For example, the (1,3 × 1,3) structure requires three other similar structures to generate LLRs for layers 2 to 4 (see Figure 1c,d,e). The (2,2 × 2,2) structure of Figure 2b on the other hand requires only one identical structure to generate LLRs for both layers 3 and 4, while Figure 2c requires one non-identical (1,3 × 1,3) structure to handle layer 4.

4.1 Preliminaries

Let H = [h 1 h 2 h N ], and assume H has full column rank. For simplicity, we assume N = M in the remainder of this work. We seek a matrix \( \mathbf{W}\kern0.3em =\kern0.3em \left[{\mathbf{w}}_1\kern1em {\mathbf{w}}_2\kern1em \cdots \kern1em {\mathbf{w}}_N\right]\kern0.3em \in \kern0.3em {\mathcal{C}}^{N\times N} \) that transforms H into a punctured LTM \( \mathbf{L}\kern0.3em =\kern0.3em \left[{l}_{ij}\right]\kern0.3em \in \kern0.3em {\mathcal{C}}^{N\times N} \) with \( {l}_{ii}\kern0.3em \in \kern0.3em {\mathcal{R}}^{+} \), such that:
$$ {\mathbf{W}}^{\ast}\mathbf{H}=\mathbf{L}. $$

The aim is to induce a specific pattern of zeros below the main diagonal of L by appropriately choosing the columns of W. Setting W = (H H)−1 H to be the left Moore-Penrose pseudo-inverse of H would obviously zero-out all entries in L below the main diagonal, resulting in L = I N . On the other hand, choosing W to be an orthonormal basis of the column space of H, would transform H into a regular (i.e., unpunctured) LTM, with W being unitary.

In general, if L is punctured, then W is non-unitary. However, we impose the condition on the column vectors of W to have unit length, i.e., \( {\mathbf{w}}_n^{\ast }{\mathbf{w}}_n\kern0.3em =\kern0.3em 1 \) for n = 1,,N, or
$$ \mathrm{diag}\left({\mathbf{W}}^{\ast}\mathbf{W}\right)={\left[1\kern1em 1\kern1em \cdots \kern1em 1\right]}_{1\times N}^T. $$

This guarantees that, when y is left-multiplied by W , the transformed noise vector W n has an unaltered covariance matrix \( \mathbb{E}\kern0.3em \left[{\mathbf{W}}^{\ast}\mathbf{n}{\mathbf{n}}^{\ast}\mathbf{W}\right]\kern0.3em =\kern0.3em {\sigma}_{\mathbf{n}}^2{\mathbf{I}}_N \). Note also that non-zero off-diagonal entries of the Gram matrix W W correspond to pairs of column vectors in W that are non-orthogonal.

4.2 Proposed WL decomposition scheme

Let P = H (H H)−1 H be the orthogonal projection onto the column space of H, and \( {\mathbf{P}}^{\perp}\kern0.3em =\kern0.3em \mathbf{I}\kern0.3em -\kern0.3em \mathbf{H}\kern0.3em {\left({\mathbf{H}}^{\ast}\mathbf{H}\right)}^{-1}\kern0.3em {\mathbf{H}}^{\ast } \) be the orthogonal projection onto the left nullspace of H. Let \( {\mathbf{H}}_{\mathcal{I}} \) be the submatrix formed by the columns of H whose index n belongs to set . For example, if H = [h 1 h 2 h 3 h 4] and \( \mathcal{I}\kern0.3em =\kern0.3em \left\{1,2\right\} \), then \( {\mathbf{H}}_{\mathcal{I}}\kern0.3em =\kern0.3em \left[{\mathbf{h}}_1\kern1em {\mathbf{h}}_2\right] \).

Let \( {\mathcal{I}}_n \) be column index set of the entries in the nth row of H to be zeroed out. Define the nth column vector \( {\overset{\sim }{\mathbf{w}}}_n\kern0.3em =\kern0.3em {\mathbf{P}}_{{\mathcal{I}}_n}^{\perp }{\mathbf{h}}_n \), where
$$ \begin{array}{cc}{\mathbf{P}}_{{\mathcal{I}}_n}^{\perp }& ={\mathbf{I}}_N-{\mathbf{H}}_{{\mathcal{I}}_n}\kern0.6em {\left({\mathbf{H}}_{{\mathcal{I}}_n}^{\ast }{\mathbf{H}}_{{\mathcal{I}}_n}\kern0.3em \right)}^{-1}\kern0.3em {\mathbf{H}}_{{\mathcal{I}}_n}^{\ast}\kern0.3em \end{array} $$
and \( {\mathbf{H}}_{{\mathcal{I}}_n}\kern0.3em =\kern0.3em \left\{{\mathbf{h}}_m\left.\right|m\in {\mathcal{I}}_n\right\} \). To satisfy (13), first note that
$$ {\overset{\sim }{\mathbf{w}}}_n^{\ast }{\overset{\sim }{\mathbf{w}}}_n={\mathbf{h}}_n^{\ast }{\left({\mathbf{P}}_{{\mathcal{I}}_n}^{\perp}\right)}^{\ast }{\mathbf{P}}_{{\mathcal{I}}_n}^{\perp }{\mathbf{h}}_n={\mathbf{h}}_n^{\ast }{\mathbf{P}}_{{\mathcal{I}}_n}^{\perp }{\mathbf{h}}_n={\overset{\sim }{\mathbf{w}}}_n^{\ast }{\mathbf{h}}_n, $$
where the second equality follows since \( {\mathbf{P}}_{{\mathcal{I}}_n}^{\perp } \) is a projection matrix, and hence \( {\left({\mathbf{P}}_{{\mathcal{I}}_n}^{\perp}\kern0.3em \right)}^{\ast}\kern0.3em =\kern0.3em {\mathbf{P}}_{{\mathcal{I}}_n}^{\perp } \) and \( {\mathbf{P}}_{{\mathcal{I}}_n}^{\perp}\kern0.6em {\mathbf{P}}_{{\mathcal{I}}_n}^{\perp}\kern0.3em =\kern0.3em {\mathbf{P}}_{{\mathcal{I}}_n}^{\perp } \). Hence the normalized vector w n , defined as
$$ {\mathbf{w}}_n=\frac{{\overset{\sim }{\mathbf{w}}}_n}{\parallel {\overset{\sim }{\mathbf{w}}}_n\parallel }, $$
with \( \parallel {\overset{\sim }{\mathbf{w}}}_n\parallel \kern0.3em =\kern0.3em \sqrt{{\mathbf{h}}_n^{\ast }{\mathbf{P}}_{{\mathcal{I}}_n}^{\perp }{\mathbf{h}}_n} \), has unit length.

We have the following main result:

Theorem 1 (WL decomposition).

Let \( {\mathcal{I}}_n,n\kern0.3em =\kern0.3em 1,\cdots \kern0.3em ,N \), be the column index sets where puncturing is desired in each row n of H. Form the submatrices \( {\mathbf{H}}_{{\mathcal{I}}_n} \) and their projection matrices \( {\mathbf{P}}_{{\mathcal{I}}_n}^{\perp } \) as noted above. Let \( \mathbf{D}\kern0.3em =\kern0.3em \left[{d}_n\right]\kern0.3em \in \kern0.3em {\mathcal{R}}^{+} \) be a diagonal matrix whose entries are given by \( {d}_n\kern0.3em =\kern0.3em 1/\sqrt{{\mathbf{h}}_n^{\ast }{\mathbf{P}}_{{\mathcal{I}}_n}^{\perp }{\mathbf{h}}_n},n\kern0.3em =\kern0.3em 1,\cdots \kern0.3em ,N \). Then the matrix
$$ {\mathbf{W}}^{\ast }=\mathbf{D}\left[\begin{array}{c}{\mathbf{h}}_1^{\ast }{\mathbf{P}}_{{\mathcal{I}}_1}^{\perp}\\ {}{\mathbf{h}}_2^{\ast }{\mathbf{P}}_{{\mathcal{I}}_2}^{\perp}\\ {}\vdots \\ {}{\mathbf{h}}_N^{\ast }{\mathbf{P}}_{{\mathcal{I}}_N}^{\perp}\\ {}\end{array}\right], $$

when right multiplied by H, zeros out the entries in the rows of H at column positions given in \( {\mathcal{I}}_n \), for all n, and satisfies condition (13).


Consider the product of \( {\mathbf{w}}_n^{\ast } \) with \( {\mathbf{H}}_{{\mathcal{I}}_n} \):
$$ \begin{array}{lll}{\mathbf{w}}_n^{\ast }{\mathbf{H}}_{{\mathcal{I}}_n}& ={d}_n{\mathbf{h}}_n^{\ast}\left({\mathbf{I}}_N-{\mathbf{H}}_{{\mathcal{I}}_n}\kern0.6em {\left({\mathbf{H}}_{{\mathcal{I}}_n}^{\ast }{\mathbf{H}}_{{\mathcal{I}}_n}\kern0.3em \right)}^{-1}\kern0.3em {\mathbf{H}}_{{\mathcal{I}}_n}^{\ast}\kern0.3em \right){\mathbf{H}}_{{\mathcal{I}}_n}\kern2em & \kern2em \\ {}& ={d}_n{\mathbf{h}}_n^{\ast}\left({\mathbf{H}}_{{\mathcal{I}}_n}-{\mathbf{H}}_{{\mathcal{I}}_n}\right)={\mathbf{0}}_{1\times \left|{\mathbf{H}}_{{\mathcal{I}}_n}\right|}\kern2em & \kern2em \end{array} $$

Also, \( {\mathbf{w}}_n^{\ast }{\mathbf{h}}_m\kern0.6em \ne \kern0.3em 0 \) for all \( m\kern0.3em \notin \kern0.3em {\mathcal{I}}_n,1\kern0.3em \le \kern0.3em m\kern0.3em \le \kern0.3em N \). Hence w n nulls the nth row of H only at the column indices given in \( {\mathcal{I}}_n \).

Example 1.

For a 4×4 MIMO system, choosing the puncturing sets as \( {\mathcal{I}}_1\kern0.3em =\kern0.3em \left\{2,3,4\right\},{\mathcal{I}}_2\kern0.3em =\kern0.3em \left\{3,4\right\},{\mathcal{I}}_3\kern0.3em =\kern0.3em \left\{2,4\right\},{\mathcal{I}}_4\kern0.3em =\kern0.3em \left\{2,3\right\} \), transforms H into a LTM L with entries l 32 = l 42 = l 43 = 0 as follows (Figure 2a):
$$ {\mathbf{W}}^{\ast}\mathbf{H}=\mathbf{D}\left[\begin{array}{c}{\mathbf{h}}_1^{\ast }{\mathbf{P}}_{2,3,4}^{\perp}\\ {\mathbf{h}}_2^{\ast }{\mathbf{P}}_{3,4}^{\perp}\\ {\mathbf{h}}_3^{\ast }{\mathbf{P}}_{2,4}^{\perp}\\ {\mathbf{h}}_4^{\ast }{\mathbf{P}}_{2,3}^{\perp}\\ \end{array}\right]\mathbf{H}=\left[\begin{array}{cccc}\times \\ \times & \times \\ \times && \times \\ \times &&& \times \\ \end{array}\right]=\mathbf{L} $$

Choosing \( {\mathcal{I}}_1\kern0.3em =\kern0.3em \left\{2,3,4\right\},{\mathcal{I}}_2\kern0.3em =\kern0.3em \left\{1,3,4\right\},{\mathcal{I}}_3\kern0.3em =\kern0.3em \left\{4\right\},{\mathcal{I}}_4\kern0.3em =\kern0.3em \left\{3\right\} \), results in the form of Figure 2b, while the choice \( {\mathcal{I}}_1\kern0.3em =\kern0.3em \left\{2,3,4\right\},{\mathcal{I}}_2\kern0.3em =\kern0.3em \left\{1,3,4\right\},{\mathcal{I}}_3\kern0.3em =\kern0.3em \left\{1,2,4\right\},{\mathcal{I}}_4\kern0.3em =\kern0.3em \left\{\varnothing \right\} \) results in the form of Figure 2c. □

The W,L pair in (12) that punctures H into a given lower triangular structure is not unique since W is non-unitary. But any other pair that generates the same structure is related to W,L by a matrix transformation as the next lemma shows.

Lemma 1 (Similar WL decompositions).

Let W 1,W 2 be two distinct matrices that transform H into two distinct but identically punctured LTMs L 1 and L 2. Then the orthonormal bases of the column space of W 1 and W 2 are both identical to that of H, and
$$ \begin{array}{cc}{\mathbf{L}}_1& ={\mathbf{R}}_1^{\ast }{\left({\mathbf{R}}_2^{\ast}\right)}^{-1}{\mathbf{L}}_2,\kern1em {\mathbf{W}}_1={\mathbf{R}}_1{\mathbf{R}}_2^{-1}{\mathbf{W}}_2\end{array} $$


We have \( {\mathbf{W}}_1^{\ast}\mathbf{H}\kern0.3em =\kern0.3em {\mathbf{L}}_1 \) and \( {\mathbf{W}}_2^{\ast}\mathbf{H}\kern0.3em =\kern0.3em {\mathbf{L}}_2 \). Let \( {\mathbf{W}}_1\kern0.3em =\kern0.3em {\mathbf{Q}}_1{\mathbf{R}}_1 \) and \( {\mathbf{W}}_2\kern0.3em =\kern0.3em {\mathbf{Q}}_2{\mathbf{R}}_2 \) denote the QR decompositions of W 1 and W 2, respectively. Then \( \mathbf{H}\kern0.3em =\kern0.3em {\mathbf{Q}}_1\kern0.3em {\left({\mathbf{R}}_1^{\ast}\right)}^{-1}\kern0.3em {\mathbf{L}}_1\kern0.3em =\kern0.3em {\mathbf{Q}}_2\kern0.3em {\left({\mathbf{R}}_2^{\ast}\right)}^{-1}\kern0.3em {\mathbf{L}}_2 \) where \( {\mathbf{Q}}_1 \) and \( {\mathbf{Q}}_2 \) are unitary, with both \( {\left({\mathbf{R}}_1^{\ast}\right)}^{-1}\kern0.3em {\mathbf{L}}_1 \) and \( {\left({\mathbf{R}}_2^{\ast}\right)}^{-1}\kern0.3em {\mathbf{L}}_2 \) being lower triangular matrices. But H admits a unique QLD in the form H = Q L, hence Q 1 = Q 2= Q and \( {\left({\mathbf{R}}_1^{\ast}\right)}^{-1}\kern0.3em {\mathbf{L}}_1\kern0.3em =\kern0.3em {\left({\mathbf{R}}_2^{\ast}\right)}^{-1}\kern0.3em {\mathbf{L}}_2 \).

5 Reduced-complexity WL decomposition using QLD

A brute force approach for computing W involves extensive matrix inversion, which is computationally expensive and prone to numerical error when executed on DSP or dedicated hardware with finite precision. We develop next an efficient scheme to determine W using the modified Gram-Schmidt orthogonalization procedure [49-51] followed by elementary matrix operations. From Lemma 1, any other W that produces an identical structure is related by (16).

Assume that H is decomposed into a unitary matrix Q 1 = [q 1 q 2 q N ] and a lower triangular matrix L 1 = [l ij ] N×N with real and positive diagonal elements. We then have \( {\mathbf{Q}}_1^{\ast}\mathbf{H}={\mathbf{L}}_1 \). Obviously, \( {\mathbf{q}}_1^{\ast }{\mathbf{q}}_1\kern0.3em =\kern0.3em 1 \) and \( {\mathbf{q}}_1^{\ast }{\mathbf{h}}_m\kern0.3em =\kern0.3em 0 \) for all m = 2,,N. Hence, w 1 = q 1. Now consider row 1 < nN of L 1, and assume the mth entry l nm , m < n, is to be nulled. We have \( {\mathbf{q}}_n^{\ast }{\mathbf{h}}_m\kern0.3em =\kern0.3em {l}_{nm}\in \mathcal{C} \) and \( {\mathbf{q}}_m^{\ast }{\mathbf{h}}_m\kern0.3em =\kern0.3em {l}_{mm}\in {\mathcal{R}}^{+} \), from which it follows that \( \left({\mathbf{q}}_n^{\ast }-{\mathbf{q}}_m^{\ast}\frac{l_{nm}}{l_{mm}}\right){\mathbf{h}}_m\kern0.3em =\kern0.3em 0 \). Therefore, the matrix operations
$$ \begin{array}{cc}{\mathbf{q}}_n& ={\mathbf{q}}_n-{\mathbf{q}}_m{l}_{nm}^{\ast }/{l}_{mm}\\ {}\end{array} $$
$$ \begin{array}{cc}{l}_{nj}& ={l}_{nj}-{l}_{mj}{l}_{nm}/{l}_{mm},\kern1em \mathrm{f}\mathrm{o}\mathrm{r}\kern1em j=1,\cdots \kern0.3em ,m,\end{array} $$
puncture the required entry and update the columns of Q 1 accordingly. These operations are repeated for all other column entries m < n to be punctured in row n. Finally, q n is normalized to have unit length, and the non-zero entries in row n of L 1 are updated:
$$ \begin{array}{cc}{l}_{nj}& ={l}_{nj}/\parallel {\mathbf{q}}_n\parallel, \kern1em \mathrm{f}\mathrm{o}\mathrm{r}\kern1em j=1,\cdots \kern0.3em ,n\\ {}\end{array} $$
$$ \begin{array}{cc}{\mathbf{q}}_n& ={\mathbf{q}}_n/\parallel {\mathbf{q}}_n\parallel \end{array} $$

The operations in (17)-(18) followed by the normalization steps (19)-(20) are repeated for all rows n where puncturing is required. The resulting Q 1 is W, and L 1 is the desired punctured LTM L.

In matrix form, we can write (17)-(18) using elementary matrices E m = [e nj ], 1 ≤ mN, which differ from I N by a single elementary row operation, and defined as follows:
$$ {e}_{nj}=\left\{\begin{array}{cc}1,& \mathrm{if}\kern0.6em j=n;\\ {}-{l}_{nm}/{l}_{mm},& \mathrm{if}\kern0.6em j=m,j\in {\mathcal{I}}_n;\\ {}0,& \mathrm{otherwise}.\end{array}\right. $$

The product of these elementary matrices forms the unscaled matrices L 2 = (E n E 1)L 1 and \( {\mathbf{Q}}_2^{\ast}\kern0.3em =\kern0.3em \left({\mathbf{E}}_n^{\ast}\cdots {\mathbf{E}}_1^{\ast}\right){\mathbf{Q}}_1^{\ast } \).

The scaling operations (19)-(20) can be written using the diagonal matrix \( \mathbf{D}=\left[{d}_n\right]\in {\mathcal{R}}^{+} \), where \( {d}_n\kern0.3em =\kern0.3em 1/\sqrt{{\left[{\mathbf{Q}}_2^{\ast }{\mathbf{Q}}_2\right]}_{nn}} \) and [·] nn denotes the nth diagonal element. The desired (scaled) matrices are given by \( {\mathbf{W}}^{\ast}\kern0.3em =\kern0.3em \mathbf{D}{\mathbf{Q}}_2^{\ast } \) and L = D L 2.

For detection, the product W y must be formed as well. This can simply be handled by first left-augmenting y to H, and then performing QLD on the augmented matrix to form \( \overset{\sim }{\mathbf{Q}}\overset{\sim }{\mathbf{L}}\kern0.3em =\kern0.3em \left[\mathbf{y}\kern0.3em \Big|\kern0.3em \mathbf{H}\right] \). When carrying out the orthogonalization procedure, the same operations applied to the columns of H are applied to the augmented column. This results in \( \overset{\sim }{\mathbf{Q}}\kern0.3em =\kern0.3em \left[{\mathbf{0}}_{N\times 1}\left|\right.\mathbf{Q}\right] \) and \( \overset{\sim }{\mathbf{L}}\kern0.3em =\kern0.3em \left[\overset{\sim }{\mathbf{y}}\left|\right.\mathbf{L}\right] \), where Q L = H and \( \overset{\sim }{\mathbf{y}}\kern0.3em =\kern0.3em {\mathbf{Q}}^{\ast}\mathbf{y} \), with \( \overset{\sim }{\mathbf{y}} \) essentially generated as a by-product of the decomposition. Next, when carrying out operations (17)-(18) followed (19)-(20) to puncture a given entry, these operations are also applied on the leftmost column of \( \overset{\sim }{\mathbf{L}} \) which contains \( \overset{\sim }{\mathbf{y}} \).

Algorithm 1 summarizes the steps of the proposed WLD scheme. Step 1 performs augmented QLD, while Step 2 performs nulling and normalization.

5.1 Complexity

We next analyze the complexity in terms of floating-point operations (flops) based on real multiplication (RMUL) and addition (RADD). We assume that real division and square-root operations are equivalent to a RMUL. Also, complex multiplication requires 4 RMUL and 2 RADD, while complex addition requires 2 RADD operations. Augmented QLD requires
$$ {\theta}_1\kern0.3em =\kern0.3em \left(4{N}^3\kern0.3em -\kern0.3em {N}^2\kern0.3em -\kern0.3em N\right)\mathsf{RADD}\kern0.3em +\kern0.3em \left(4{N}^3\kern0.3em +\kern0.3em 3{N}^2\right)\mathsf{RMUL} $$
flops, while puncturing requires
$$ {\theta}_2\kern0.3em =\kern0.3em \frac{2}{3}\left(8{N}^3\kern0.3em -\kern0.3em 15{N}^2\kern0.3em +\kern0.3em 4N-12\right)\mathsf{RADD}\left(\kern0.3em \frac{16}{3}{N}^3\kern0.3em -\kern0.3em 7\kern0.3em {N}^2\kern0.3em +\kern0.3em \frac{8}{3}N\kern0.3em -\kern0.3em 20\kern0.3em \right)\mathsf{RMUL} $$

flops, assuming a (1,(N − 1) × 1,N − 1) structure. In comparison, [48] requires (16N 4 − 4N 3)RMUL and (16N 4 − 4N 3)RADD.

6 Optimized detection algorithm

We next present a detection algorithm based on the proposed WLD scheme. We decompose H into T punctured LTMs having identical structure (E,S × E,S) where T = N/E as follows. The N streams are decoupled E at a time in T steps by cyclically shifting the columns of H using the permutation
$$ {\pi}_t(i)= \mod \kern0.3em \left(i\kern0.3em +\kern0.3em \left(t\kern0.3em -\kern0.3em 1\right)E\kern0.3em -\kern0.3em 1,N\right)\kern0.3em +\kern0.3em 1,\kern2em i=1,\cdots \kern0.3em ,N, $$

for t = 1,,T. Each permuted H is then WL-decomposed into W (t),L (t), wherein the E leftmost columns of L (t) correspond to E distinct decoupled streams. For example, to decouple all four streams of a 4 × 4 MIMO system using the (2,2 × 2,2) structure of Figure 2b, T = 2 such decompositions are required, one on [h 1 h 2 h 3 h 4] to decouple streams 1 and 2 and one on [h 3 h 4 h 1 h 2] to decouple streams 3 and 4. To allow decoupled subsets to overlap, the permutation is altered to place a stream with low reliability in several detection sets. For example, to decompose the four streams in the previous example into overlapping (2,2 × 2,2) structures, four decompositions are needed based on [h 1 h 2 h 3 h 4], [h 2 h 3 h 4 h 1], [h 3 h 4 h 1 h 2], and [h 4 h 3 h 2 h 1]. For simplicity, we assume identical constellations \( {\mathcal{X}}_n\kern0.3em =\kern0.3em \mathcal{O} \) on all layers.

One simple approach to low-complexity MIMO detection is zero-forcing. Left-multiplying y by W (t) to get \( {\mathbf{W}}^{(t)\ast}\mathbf{y}={\overset{\sim }{\mathbf{y}}}^{(t)}={\mathbf{L}}^{(t)}\mathbf{x}+{\mathbf{W}}^{(t)\ast}\mathbf{n} \) results in E decoupled streams corresponding to columns h 1,,h E . These streams can be detected separately by slicing the layers individually to find their closest symbols.

A more powerful detection scheme takes full advantage of the punctured structure of L (t). Instead of merely slicing to find the E closet symbols in the enumeration set, we enumerate all symbol vectors in the set \( {\mathcal{O}}^E \) and then slice to find the closest symbols in the remaining S layers, populating the Euclidean distance of the resulting symbol vectors along the way. First partition \( {\overset{\sim }{\mathbf{y}}}^{(t)} \), L (t), and x as
$$ {\overset{\sim }{\mathbf{y}}}^{(t)}\kern0.3em =\kern0.3em \left[\begin{array}{c}{\overset{\sim }{\mathbf{y}}}_1^{(t)}\\ {}{\overset{\sim }{\mathbf{y}}}_2^{(t)}\\ {}\end{array}\right],\kern1em {\mathbf{L}}^{(t)}\kern0.3em =\kern0.3em \left[\begin{array}{cc}{\mathbf{A}}^{\kern0.3em (t)}& \mathbf{0}\\ {}{\mathbf{B}}^{(t)}& {\mathbf{C}}^{(t)}\\ {}\end{array}\right],\kern1em \mathbf{x}\kern0.3em =\kern0.3em \left[\begin{array}{c}{\mathbf{x}}_1\\ {}{\mathbf{x}}_2\\ {}\end{array}\right], $$
where \( {\overset{\sim }{\mathbf{y}}}_1^{(t)}\kern0.3em \in \kern0.3em {\mathcal{C}}^{E\times 1} \), \( {\overset{\sim }{\mathbf{y}}}_2^{(t)}\kern0.3em \in \kern0.3em {\mathcal{C}}^{S\times 1} \), \( {\overset{\sim }{\mathbf{A}}}^{(t)}\kern0.3em \in \kern0.3em {\mathcal{R}}^{E\times E} \), \( {\overset{\sim }{\mathbf{B}}}^{(t)}\kern0.3em \in \kern0.3em {\mathcal{R}}^{S\times E} \), \( {\overset{\sim }{\mathbf{C}}}^{(t)}\kern0.3em \in \kern0.3em {\mathcal{R}}^{S\times S} \), \( {\mathbf{x}}_1\kern0.3em \in \kern0.3em {\mathcal{O}}^E \), and \( {\mathbf{x}}_2\kern0.3em \in \kern0.3em {\mathcal{O}}^S \). Then the symbol vector with minimum distance for the partitioned structure t is given by
$$ \begin{array}{cc}{\overset{\sim }{\mathbf{x}}}_t^{\mathrm{WL}}& \kern0.6em \triangleq \kern0.3em \underset{\mathbf{x}\in \mathcal{X}}{ \arg \min }{\parallel {\overset{\sim }{\mathbf{y}}}^{(t)}\kern0.6em -\kern0.3em {\mathbf{L}}^{\kern0.3em (t)}\mathbf{x}\parallel}^2\\ {}\kern0.6em =\kern0.3em \underset{{\mathbf{x}}_1\in {\mathcal{O}}^E}{ \arg \min}\left(\kern0.3em {\parallel {\overset{\sim }{\mathbf{y}}}_1^{(t)}\kern0.6em -\kern0.6em {\mathbf{A}}^{\kern0.3em (t)}{\mathbf{x}}_1\parallel}^2\kern0.9em +\kern0.3em {\parallel {\overset{\sim }{\mathbf{y}}}_2^{(t)}\kern0.6em -\kern0.3em {\mathbf{B}}^{(t)}{\mathbf{x}}_1\kern0.6em -\kern0.3em {\mathbf{C}}^{(t)}{\hat{\mathbf{x}}}_2\parallel}^2\right)\end{array} $$
$$ \begin{array}{cc}{\hat{\mathbf{x}}}_2& \kern0.3em =\kern0.3em {\left\lfloor \left({\overset{\sim }{\mathbf{y}}}_2^{(t)}\kern0.3em -\kern0.3em {\mathbf{B}}^{(t)}{\mathbf{x}}_1\right)/{\mathbf{C}}^{(t)}\right\rceil}_{{\mathcal{O}}^S}\end{array} $$
Note that the first argument of the slicing operator in (27) is a vector of length S since C (t) is a diagonal matrix. Here slicing is applied to the individual elements of the vector over the constellation . The symbols in the vector \( {\overset{\sim }{\mathbf{x}}}_t^{\mathrm{WL}} \) are rearranged back to their normal order using (24), and the permuted vector is denoted as \( {\mathbf{x}}_t^{\mathrm{WL}} \):
$$ {\mathbf{x}}_t^{\mathrm{WL}}={\pi}_t^{-1}\left({\overset{\sim }{\mathbf{x}}}_t^{\mathrm{WL}}\right). $$
The minimum distance \( {d}_t^{\mathrm{WL}} \) itself is computed as the Euclidean distance of y from \( \mathbf{H}{\mathbf{x}}_t^{\mathrm{WL}} \) (rather than the distance of \( {\overset{\sim }{\mathbf{y}}}^{(t)} \) from \( {\mathbf{L}}^{(t)}{\overset{\sim }{\mathbf{x}}}_t^{\mathrm{WL}} \)):
$$ {d}_t^{\mathrm{WL}}={\parallel \mathbf{y}-{\mathbf{H}}^{(t)}{\mathbf{x}}_t^{\mathrm{WL}}\parallel}^2\ne {\parallel {\overset{\sim }{\mathbf{y}}}^{(t)}-{\mathbf{L}}^{(t)}{\overset{\sim }{\mathbf{x}}}_t^{\mathrm{WL}}\parallel}^2. $$
To compute the LLRs, we expand distances similar to (26) when taking arg min to determine the symbol vector
$$ \begin{array}{cc}{\overset{\sim }{\mathbf{u}}}_{n,k,t}^{\mathrm{WL}}\kern0.6em & =\kern0.3em \underset{\mathbf{x}\in \underset{n,k}{\overset{(0)}{\mathcal{X}}}}{ \arg \min }{\parallel {\overset{\sim }{\mathbf{y}}}^{(t)}\kern0.6em -\kern0.3em {\mathbf{L}}^{\kern0.3em (t)}\mathbf{x}\parallel}^2\end{array} $$
$$ \begin{array}{cc}\kern0.6em & =\kern0.3em \underset{{\mathbf{x}}_1\in \underset{n,k}{\overset{(0)}{\mathcal{O}}}}{ \arg \min}\left(\kern0.3em {\parallel {\overset{\sim }{\mathbf{y}}}_1^{(t)}\kern0.6em -\kern0.6em {\mathbf{A}}^{\kern0.3em (t)}{\mathbf{x}}_1\parallel}^2\kern0.9em +\kern0.3em {\parallel {\overset{\sim }{\mathbf{y}}}_2^{(t)}\kern0.6em -\kern0.3em {\mathbf{B}}^{(t)}{\mathbf{x}}_1\kern0.6em -\kern0.3em {\mathbf{C}}^{(t)}{\hat{\mathbf{x}}}_2\parallel}^2\right)\end{array} $$
where \( {\mathcal{O}}_{n,k}^{(0)}\kern0.3em =\kern0.3em \left\{{\mathbf{x}}_1\kern0.3em \in \kern0.3em {\mathcal{O}}^E:{b}_{n,k}\kern0.3em =\kern0.3em 0\right\} \), and \( {\hat{\mathbf{x}}}_2 \) is defined as in (27) but with \( {\mathbf{x}}_1\kern0.3em \in \kern0.3em {\mathcal{O}}_{n,k}^{(0)} \). A similar computation is needed for the other hypothesis in the set \( {\mathcal{X}}_{n,k}^{(1)} \) to determine:
$$ \begin{array}{cc}{\overset{\sim }{\mathbf{v}}}_{n,k,t}^{\mathrm{WL}}\kern0.6em & =\kern0.3em \underset{\mathbf{x}\in \underset{n,k}{\overset{(1)}{\mathcal{X}}}}{ \arg \min }{\parallel {\overset{\sim }{\mathbf{y}}}^{(t)}\kern0.6em -\kern0.3em {\mathbf{L}}^{\kern0.3em (t)}\mathbf{x}\parallel}^2.\end{array} $$
Then (24) is applied to permute the symbols in \( {\overset{\sim }{\mathbf{u}}}_{n,k,t}^{\mathrm{WL}} \) and \( {\overset{\sim }{\mathbf{v}}}_{n,k,t}^{\mathrm{WL}} \) and form the reordered symbol vectors \( {\mathbf{u}}_{n,k,t}^{\mathrm{WL}} \) and \( {\mathbf{v}}_{n,k,t}^{\mathrm{WL}} \):
$$ \begin{array}{cc}{\mathbf{u}}_{n,k,t}^{\mathrm{WL}}& ={\pi}_t^{-1}\left({\overset{\sim }{\mathbf{u}}}_{n,k,t}^{\mathrm{WL}}\right),\\ {}{\mathbf{v}}_{n,k,t}^{\mathrm{WL}}& ={\pi}_t^{-1}\left({\overset{\sim }{\mathbf{v}}}_{n,k,t}^{\mathrm{WL}}\right).\end{array} $$
The LLR values are then computed as
$$ {\varLambda}_{n,k,t}^{\mathrm{WL}}={\parallel \mathbf{y}-\mathbf{H}{\mathbf{u}}_{n,k,t}^{\mathrm{WL}}\parallel}^2-{\parallel \mathbf{y}-\mathbf{H}{\mathbf{v}}_{n,k,t}^{\mathrm{WL}}\parallel}^2, $$

for n = 1,,N, \( k\kern0.3em =\kern0.3em 1\cdots \kern0.3em ,\underset{2}{ \log}\left|\mathcal{O}\right| \), and t = 1,,T.

Finally, one simple way to approximate the ML distance and LLRs is by selecting the minimum over all t in (28) and (31), respectively:
$$ \begin{array}{cc}{d}^{\mathrm{ML}}& \approx \underset{t=1,\cdots \kern0.3em ,T}{ \min}\left({d}_t^{\mathrm{WL}}\right),\kern2em {\varLambda}_{n,k}^{\mathrm{ML}}\approx \underset{t=1,\cdots \kern0.3em ,T}{ \min}\left({\varLambda}_{n,k,t}^{\mathrm{WL}}\right).\end{array} $$

Tighter LLRs can be produced by tracking global minimum distances rather than just minimizing over the per stream LLRs. Specifically, when using (26)-(27) for every stream t, t=1,,T, instead of just retaining the minimum distance and its corresponding arg min symbol vector, a list of all \( \left|{\mathcal{O}}^E\right| \) distances and their corresponding symbol vectors is populated. These T lists are then used to compute the LLR values by selecting the minimum distances for symbol vectors from these lists having 0 or 1 in the desired bit position where the LLR value is to be computed.

The pseudo-code in Algorithm 2 summarizes the steps of the proposed WLD detection algorithm. It produces tight LLRs according to the previous discussion, using the equations marked with (). Figure 3 shows a block diagram describing the dataflow of the algorithm. For N = 2 layers, the algorithm is optimal and reduces to that in [45] since the Ws are unitary in this case. When E = 1 and distances in (28) are computed based on L instead of H, the algorithm reduces to that in [48].
Figure 3
Figure 3

Block diagram of the proposed WL detection algorithm.

6.1 Multi-core detector architectures

Depending on the target throughput and the number of antennas N in the MIMO systems, multiple 2×2 detector cores can be configured to construct an N-stream MIMO detector. Figure 4 shows a four-sided fully parallel 4 × 4 MIMO detector that uses four cores to process the four streams. Here distance buffering and accumulation are needed before LLR processing in order to adjust the individual LLRs according to earlier discussion. It is assumed that an external digital signal processor (DSP) must supply the WLD matrix inputs for all four streams according to the decompositions in (25). If chip area is the constraining factor, a MIMO detector can be built using a single core that is time-multiplexed among the four streams.
Figure 4
Figure 4

Block diagram of four-sided MAP detector.

7 Performance simulation results

The coded bit-error rate (BER) performance of the proposed detection algorithm was evaluated through simulations of a MIMO system employing four transmit and four receive antennas. The channel encoder is based on the LTE turbo encoder specification [52] with interleaver length 1,024, using 16-QAM and 64-QAM modulation constellations. The channel entries are assumed to be independent and identically distributed (i.i.d.) complex Gaussian random variables. At the receiver end, we assume perfect channel knowledge. The turbo decoder implements the true A Posteriori Probability algorithm, and performs four (full) decoding iterations.

Figure 5 compares the BER versus SNR per receive antenna of the proposed WLD scheme with E = 1 and 2 structures, versus ML, ZF, the approach of [48], and the sphere decoder with radius clipping [30], for 16-QAM. Both overlapping and non-overlapping subsets are considered. Furthermore, two scenarios for distance computations in (28) are followed; one based on H and one on L. The curve labeled WLD-H1 corresponds to the proposed WLD detection scheme with E = 1 and distance computations based on H in (28), while the curve labeled WLD-L1 corresponds to the proposed scheme with E = 1 but with distance computations based on L in (28). Similarly for the curves labeled WLD-H2 and WLD-L2, but with E=2 and non-overlapping decomposition sets. Finally the curve labeled WLD-H2ov corresponds to E=2 but with four overlapping (2,2×2,2) structures.
Figure 5
Figure 5

BER vs. SNR plots for 16-QAM constellation.

As shown in the figure, the plots demonstrate that the proposed WLD algorithm with E=2 using H distances with overlapping subsets performs virtually as ML and is less than 0.1 dB away from ML with no overlapping subsets. Furthermore, the WLD-L2 scheme performs much worse than WLD-H2. For decompositions with single streams, L distances perform better than H distances; the WLD-H1 scheme exhibits an error floor. Compared to the well-known sphere decoding algorithm with radius-clipping [30], the proposed WLD-based schemes achieve better performance especially with E=2 structures.

Figure 6 compares the BER performance of the WLD schemes using 64-QAM constellations. The plots demonstrate again that the WLD scheme with E=2 using H distances and overlapping subsets performs very close to ML. It is worth mentioning that WLD-H2 again performs very close to WLD-H2ov (and hence to ML), which has significant hardware saving implications. Note here that the WLD-L2ov scheme using L-distances with E=2 overlapping decompositions performs the worst among the WLD schemes.
Figure 6
Figure 6

BER vs. SNR plots for 64-QAM constellation.

Finally, another important advantage of the proposed schemes is the significant reduction in simulation time observed when generating the BER plots. On a typical multi-core server machine, one simulation point takes in the order of few hours to complete, while the sphere decoder and ML approaches require few days. This is very valuable for designers when doing rapid evaluations and tradeoff analyses of various MIMO system features.

8 Conclusions

A low-complexity MIMO subspace detection algorithm has been presented. By decomposing the channel matrix of an M-stream MIMO system into a generalized elementary matrix structure, the detection problem becomes a generalization of that of a two-stream detection problem, which admits a simple architecture suitable for high-speed implementation. Multiple two-stream detector cores can be employed in parallel to improve detection throughput. The channel decomposition scheme is cast in terms of the standard Gram-Schmidt QL decomposition, which is supported in most modern DSPs.


Authors’ Affiliations

American University of Beirut, Bliss Street, Beirut, 11-0236, Lebanon


  1. A Paulraj, R Nabar, D Gore, Introduction to Space-Time Wireless Communications (Cambridge Univ. Press, Cambridge, U.K, 2003).Google Scholar
  2. E Viterbo, J Boutros, A universal lattice code decoder for fading channels. IEEE Trans. Inf. Theory. 45(5), 1639–1642 (1999).View ArticleMATHMathSciNetGoogle Scholar
  3. O Damen, A Chkeif, J-C Belfiore, Lattice code decoder for space-time codes. IEEE Commun. Lett. 4(5), 161–163 (2000).View ArticleGoogle Scholar
  4. E Agrell, T Eriksson, A Vardy, K Zeger, Closest point search in lattices. IEEE Trans. Inf. Theory. 48(8), 2201–2214 (2002).View ArticleMATHMathSciNetGoogle Scholar
  5. GJ Foschini, Layered space-time architecture for wireless communication in a fading environment when using multi-element antennas. Bell Labs Tech. J. 1(2), 41–59 (1996).View ArticleGoogle Scholar
  6. Y Jiang, J Li, WW Hager, Joint transceiver design for MIMO communications using geometric mean decomposition. IEEE Trans. Signal Process. 53(10), 3791–3803 (2005).View ArticleMathSciNetGoogle Scholar
  7. Y Jiang, J Li, WW Hager, Uniform channel decomposition for MIMO communications. IEEE Trans. Signal Process. 53(11), 4283–4294 (2005).View ArticleMathSciNetGoogle Scholar
  8. SL Ariyavisitakul, J Zheng, E Ojard, J Kim, Subspace beamforming for near-capacity MIMO performance. IEEE Trans. Signal Process. 56(11), 5729–5733 (2008).View ArticleMathSciNetGoogle Scholar
  9. Y Chen, ST Brink, in Proc. IEEE Int. Symp. Personal Indoor and Mobile Radio Commun. (PIMRC). Near-capacity MIMO subspace detection (Toronto, Canada, 2011), pp. 1733–1737.Google Scholar
  10. E Biglieri, R Calderbank, A Constantinides, A Goldsmith, A Paulraj, HV Poor, MIMO Wireless Communications (Cambridge Univ. Press, Cambridge, U.K, 2007).Google Scholar
  11. B Hassibi, in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal, Process. (ICASSP). An efficient square-root algorithm for BLAST (Istanbul, Turkey, 2000), pp. 5–9.Google Scholar
  12. GD Golden, JG Foschini, RA Valenzuela, PW Wolniansky, Detection algorithm and initial laboratory results using V-BLAST space-time communication architecture. IEE Electron. Lett. 35(1), 14–15 (1999).View ArticleGoogle Scholar
  13. D Wübben, R Böhnke, V Kühn, K Kammeyer, in Proc. IEEE Vehicular Technol. Conf. (VTC). MMSE extension of V-BLAST based on sorted QR decomposition (Orlando, Florida, 2003), pp. 508–512.Google Scholar
  14. C Studer, S Fateh, D Seethaler, ASIC implementation of soft-input soft-output MIMO detection using MMSE parallel interference cancellation. IEEE Trans. Syst. Sci. Cybern. 47(7), 1754–1765 (2011).Google Scholar
  15. E Viterbo, E Biglieri, in 14ème Colloque GRETSI. A universal decoding algorithm for lattice codes (Juan-Les-Pins, France, 1993), pp. 611–614.Google Scholar
  16. B Hochwald, S ten Brink, Achieving near-capacity on a multiple-antenna channel. IEEE Trans. Commun. 51(3), 389–399 (2003).View ArticleGoogle Scholar
  17. B Hassibi, H Vikalo, On sphere decoding algorithm. I. Expected complexity. IEEE Trans. Signal Process. 53(8), 2806–2818 (2005).View ArticleMathSciNetGoogle Scholar
  18. J Jaldén, B Ottersten, On the complexity of sphere decoding in digital communications. IEEE Trans. Signal Process. 53(4), 1474–1484 (2005).View ArticleMathSciNetGoogle Scholar
  19. D Seethaler, J Jaldén, C Studer, H Bölcskei, On the complexity distribution of sphere decoding. IEEE Trans. Inf. Theory. 57(9), 5754–5768 (2011).View ArticleGoogle Scholar
  20. K-W Wong, C-Y Tsui, RS-K Cheng, W-H Mow, in Proc. IEEE Int. Symp. on Circuits and Systems (ISCAS), 3. A VLSI architecture of a K-best lattice decoding algorithm for MIMO channels (Scottsdale, Arizona, 2002), pp. 273–276.Google Scholar
  21. M Wenk, M Zellweger, A Burg, N Felber, W Fichtner, in Proc. IEEE Int. Symp. on Circuits and Systems (ISCAS). K-best MIMO detection VLSI architectures achieving up to 424Mbps (Island of Kos, Greece, 2006), pp. 1151–1154.Google Scholar
  22. S Mondal, A Eltawil, C-A Shen, K Salama, Design and implementation of a sort free K-best sphere decoder. IEEE Trans. VLSI Syst. 18(10), 1497–1501 (2010).View ArticleGoogle Scholar
  23. L Liu, F Ye, X Ma, T Zhang, J Ren, A 1.1-Gb/s 115-pJ/bit configurable MIMO detector using 0.13 μm CMOS technology. IEEE Trans. Circuits Syst. II. 57(9), 701–705 (2010).View ArticleGoogle Scholar
  24. C-A Shen, A Eltawil, K Salama, Evaluation framework for K-best sphere decoders. J. Circuits Syst. Comput. 19(5), 975–995 (2010).View ArticleGoogle Scholar
  25. M Shabany, P Gulak, A 675 Mbps, 4×4 64-QAM K-Best MIMO detector in 0.13 μm CMOS. IEEE Trans. VLSI Syst. 20(1), 135–147 (2012).View ArticleGoogle Scholar
  26. M Mahdavi, M Shabany, Novel MIMO detection algorithm for high-order constellations in the complex domain. IEEE Trans. VLSI Syst. 21(5), 834–847 (2013).View ArticleGoogle Scholar
  27. D Garrett, L Davis, S ten Brink, B Hochwald, G Knagge, Silicon complexity for maximum likelihood MIMO detection using spherical decoding. IEEE J. Solid-State Circuits. 39(9), 1544–1552 (2004).View ArticleGoogle Scholar
  28. Z Guo, P Nilsson, in Proc. IEEE CAS Symp. Emerging Technologies, 1. A VLSI architecture of the Schnorr-Euchner decoder for MIMO systems (Shanghai, China, 2004), pp. 65–68.Google Scholar
  29. A Burg, M Borgmann, M Wenk, M Zellweger, W Fichtner, H Bölcskei, VLSI implementation of MIMO detection using the sphere decoding algorithm. IEEE J. Solid-State Circuits. 40(7), 1566–1577 (2005).View ArticleGoogle Scholar
  30. C Studer, A Burg, H Bölcskei, Soft-output sphere decoder: algorithms and VLSI implementation. IEEE J. Sel. Areas Commun. 26(2), 290–300 (2008).View ArticleGoogle Scholar
  31. C-H Yang, D Markovic, A flexible DSP architecture for MIMO sphere decoding. IEEE Trans. Circuits Syst. I. 56(10), 2301–2314 (2009).View ArticleMathSciNetGoogle Scholar
  32. C-H Yang, D Markovic, in European Solid-State Circuits Conf. (ESSCIRC). A 2.89 mW 50 GOPS 16×16 16-core MIMO sphere decoder in 90 nm CMOS (Athens, Greece, 2009), pp. 344–347.Google Scholar
  33. F Borlenghi, EM Witte, G Ascheid, H Meyr, A Burg, in IEEE Asian Solid State Circutis Conf. (A-SSCC). A 772 Mbit/s 8.81 bit/nJ 90 nm CMOS soft-input soft-output sphere decoder (Jeju, Korea, 2011), pp. 297–300.Google Scholar
  34. L Liu, J Lofgren, P Nilsson, Area-efficient configurable high-throughput signal detector supporting multiple MIMO modes. IEEE Trans. Circuits Syst. I. 59(9), 2085–2096 (2012).View ArticleMathSciNetGoogle Scholar
  35. Y Sun, JR Cavallaro, Trellis-search based soft-input soft-output MIMO detector: algorithm and VLSI architecture. IEEE Trans. Signal Process. 60(5), 2617–2627 (2012).View ArticleMathSciNetGoogle Scholar
  36. X Chen, G He, J Ma, VLSI implementation of a high-throughput iterative fixed-complexity sphere decoder. IEEE Trans. Circuits Syst. II. 60(5), 272–276 (2013).View ArticleGoogle Scholar
  37. MM Mansour, S Alex, MA Jalloul, Reduced complexity soft-output MIMO sphere detectors – Part I Algorithmic optimizations. IEEE Trans. Signal Process. 62(21), 5505–5520 (2014).View ArticleMathSciNetGoogle Scholar
  38. MM Mansour, S Alex, MA Jalloul, Reduced complexity soft-output MIMO sphere detectors – Part II Architecutral optimizations. IEEE Trans. Signal Process. 62(21), 5521–5535 (2014).View ArticleMathSciNetGoogle Scholar
  39. M-Y Huang, P-Y Tsai, Toward multi-gigabit wireless: design of high-throughput MIMO detectors with hardware-efficient architecture. IEEE Trans. Circuits Syst. I. 61(2), 613–624 (2014).View ArticleGoogle Scholar
  40. J Choi, A Singer, J Lee, N-I Cho, Improved linear soft-input soft-output detection via soft feedback successive interference cancellation. IEEE Trans. Commun. 58(3), 986–996 (2010).View ArticleGoogle Scholar
  41. RC de Lamare, R Sampaio-Neto, Adaptive reduced-rank equalization algorithms based on alternating optimization design techniques for MIMO systems. IEEE Trans. Veh. Technol. 60(6), 2482–2494 (2011).View ArticleGoogle Scholar
  42. T Hwang, Y Kim, H Park, Energy spreading transform approach to achieve full diversity and full rate for MIMO systems. IEEE Trans. Signal Process. 60(12), 6547–6560 (2012).View ArticleMathSciNetGoogle Scholar
  43. D Persson, J Kron, M Skoglund, EG Larsson, Joint source-channel coding for the MIMO broadcast channel. IEEE Trans. Signal Process. 60(4), 2085–2090 (2012).View ArticleMathSciNetGoogle Scholar
  44. RC de Lamare, Adaptive and iterative multi-branch MMSE decision feedback detection algorithms for multi-antenna systems. IEEE Trans. Wireless Commun. 12(10), 5294–5308 (2013).View ArticleGoogle Scholar
  45. M Siti, MP Fitz, in Proc. IEEE Int. Conf. Commun. (ICC), 4. A novel soft-output layered orthogonal lattice detector for multiple antenna communications (Istanbul, Turkey, 2006), pp. 1686–1691.Google Scholar
  46. M Siti, MP Fitz, in Proc. IEEE Wireless Commun. and Netw. Conf. (WCNC). On layer ordering techniques for near-optimal MIMO detectors (Hong Kong, 2007), pp. 1199–1204.Google Scholar
  47. MS Yee, in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal, Process. (ICASSP), 3. Max-log-MAP sphere decoder (Philadelphia, PA, 2005), pp. 1013–1016.Google Scholar
  48. E Ojard, S Ariyavisitakul, Method and system for approximate maximum likelihood (ML) detection in a multiple input multiple output (MIMO) receiver. U. S. Patent No. 20090074114 (2009).
  49. GH Golub, CFV Loan, Matrix Computations, 3rd edn (Johns Hopkins Univ. Press, Baltimore, MD, 1996).Google Scholar
  50. RC-H Chang, C-H Lin, K-H Lin, C-L Huang, F-C Chen, Iterative QR decomposition architecture using the modified Gram-Schmidt algorithm for MIMO systems. IEEE Trans. Circuits Syst. I. 57(5), 1095–1102 (2010).View ArticleMathSciNetGoogle Scholar
  51. D Wübben, R Böhnke, J Rinas, V Kühn, K Kammeyer, Efficient algorithm for decoding layered space-time codes. IEE Electron. Lett. 37(22), 1348–1350 (2001).View ArticleGoogle Scholar
  52. Evolved Universal Terrestrial Radio Access (E-UTRA); Physical Channels and Modulation. TS 36.211 3GPP.


© Mansour; licensee Springer. 2015

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.