- Research
- Open Access

# A low-complexity MIMO subspace detection algorithm

- Mohammad M Mansour
^{1}Email author

**2015**:95

https://doi.org/10.1186/s13638-015-0274-9

© Mansour; licensee Springer. 2015

**Received:**24 October 2014**Accepted:**9 February 2015**Published:**28 March 2015

## Abstract

A low-complexity multiple-input multiple-output (MIMO) subspace detection algorithm is proposed. It is based on decomposing a MIMO channel into multiple subsets of decoupled streams that can be detected separately. The new scheme employs triangular decomposition followed by elementary matrix operations to transform the channel into a generalized elementary matrix whose structure matches the subsets of streams to be detected. The proposed approach avoids matrix inversion and allows subsets to overlap, thus achieving better diversity gain. An optimized detector architecture based on a 2-by-2 ML detector core is also presented. Simulations demonstrate that the proposed algorithm performs to within a few tenths of a dB from the optimum detection algorithm.

## Keywords

- Maximum likelihood
- MIMO detection
- LLR
- Subspace detection

## 1 Introduction

With the advent of smart mobile devices, the demand for wireless access to broadband networks has been rapidly increasing in the past decade. Service providers are constantly faced with a major challenge of meeting contradictory requirements for higher data rates, improved quality of service (QoS), and better network capacity, while maintaining the transmit power and bandwidth budgets. To achieve these targets, novel enabling technologies need to be considered.

Exploiting the spatial dimension by using multiple-input multiple-output (MIMO) antenna systems is one of the key-enabling technologies for achieving high spectral efficiency in modern wireless communications standards. MIMO technology improves both the spectral efficiency and the QoS of wireless communication systems. However, detection of spatially multiplexed MIMO streams plays a key role in receiver design both in terms of performance and complexity [1] and has remained an active area of research. Schemes for MIMO detection are either based on joint-stream detection to separate the streams, such as maximum likelihood (ML) detection [2-4], or subset-stream detection such as those employed in successive interference cancellation (SIC), interference coordination, transmit beamforming, network MIMO, and coordinated multi-point transmission and reception [5-9].

A plethora of MIMO detectors have appeared in the literature on this subject, offering various performance-complexity tradeoffs. Suboptimal zero-forcing (ZF) and minimum mean-squared error (MMSE) detectors [10], as well as the nonlinear parallel and successive interference cancellation schemes [11-14], require relatively low complexity but sacrifice performance. On the other hand, tree-search or list-based detectors require substantially higher complexity but can offer (near-)ML performance such as the well-known sphere decoding algorithm [2-4,15-19]. Other tree-search schemes, such as the K-Best algorithm [20-26], address the non-deterministic throughput aspects of sphere decoders. Practical implementation aspects have been investigated in [18,23,25-39].

Subspace detection based on channel decomposition offers a good compromise between performance and complexity. In [6,7], a method was presented in which the effective MIMO channel matrix **H** is uniformly decomposed into identical parallel subchannels using geometric mean decomposition (GMD). In [8], a related scheme was presented in which **H** is block-wise diagonalized to narrow down the number of jointly detected streams to two, and then GMD is applied to balance capacity within each pair of subchannels. The scheme was generalized in [9] to allow joint detection of several overlapping subchannels using QR decomposition (QRD). By allowing subspaces to overlap, additional diversity can be gathered by putting a low reliable data stream into several detection sets. Other subspace methods employ projection operators and lists to generate candidates for interference cancellation in equalization schemes (e.g., [40-44]).

The LORD algorithm proposed in [45,46] can be viewed as a special class of subspace MIMO detectors. It achieves ML performance (in the Max-log-MAP [47] sense) on two transmit antennas, but its performance degrades when the number of antennas increases. In [48], the LORD algorithm was generalized to four transmit antennas by using matrix inversion to decompose **H** into single streams.

*Contributions*: In this paper, we propose an efficient near-ML soft-output MIMO detection algorithm that jointly detects subsets of decoupled streams by transforming **H** into a generalized elementary matrix. A matrix transformation is analytically derived to induce the desired structure on **H**, while avoiding computationally complex operations such as pseudo-inversion. This is achieved by applying QL decomposition followed by elementary matrix operations on **H** in a manner analogous to the modified QL decomposition algorithm based on the Gram-Schmidt orthogonalization procedure, thus avoiding the need for expensive matrix inversion operations. Various decomposition structures are investigated, including the option of allowing the decomposition sets to overlap resulting in additional diversity gain. Furthermore, we show that a parallel MIMO detector can be constructed from 2×2 component detector cores that decouple the streams in the subsets in parallel. Finally, we show that for two streams, the proposed algorithm is optimal and reduces to the LORD algorithm [45,46]. When the subsets include single streams only, the algorithm reduces to that of [48]. The advantages of the proposed algorithm in attaining near-ML performance outperforming that of the sphere decoder [30] and the near-ML algorithm of [48] are demonstrated through computer simulations of the bit error-rate of a MIMO system.

The rest of the paper is organized as follows. Section 2 introduces the system model, and Section 3 reviews ML detection for two streams. Sections 4 and 5 present the proposed matrix decomposition scheme and a simplified construction using QLD. The detection algorithm is detailed in Section 6. Section 7 presents simulation results, while Section 8 ends the paper with concluding remarks.

## 2 System model

Consider a MIMO system with *N* transmit (Tx) antennas and *M* ≥ *N* receive (Rx) antennas. Assuming perfect channel knowledge at the receiver, the equivalent complex baseband input-output system relation can be modeled as **y** = **H**
**x** + **n**, where \( \mathbf{y}\kern0.3em \in \kern0.3em {\mathcal{C}}^{M\times 1} \) is the received complex signal vector, \( \mathbf{H}\kern0.3em \in \kern0.3em {\mathcal{C}}^{M\times N} \) is the complex channel matrix, and \( \mathbf{x}\kern0.3em =\kern0.6em {\left[{x}_1\kern1em {x}_2\cdots {x}_N\right]}^T\in \mathcal{X}\kern0.3em =\kern0.3em {\mathcal{X}}_1\kern0.3em \times \cdots \times \kern0.3em {\mathcal{X}}_N \) is the *N* × 1 transmitted complex symbol vector. Each symbol *x*
_{
n
} belongs to a complex constellation \( {\mathcal{X}}_n \) of size \( {Q}_n\kern0.3em =\kern0.3em {2}^{q_n} \) and normalized so that \( \mathbb{E}\kern0.3em \left[{x}_n^{\ast }{x}_n\right]\kern0.3em =\kern0.3em 1 \), and is formed from a set of *q*
_{
n
} coded bit-interleaved sequence \( {\mathbf{b}}_n\kern0.3em =\kern0.3em \left({b}_{n,1},{b}_{n,2},\cdots \kern0.3em ,{b}_{n,{q}_n}\right) \) over the binary field \( {\mathcal{F}}_2 \). The effect of thermal noise is modeled as a zero-mean complex Gaussian circularly symmetric random vector \( \mathbf{n}\kern0.3em \in \kern0.3em {\mathcal{C}}^{M\times 1} \) with covariance matrix \( \mathbb{E}\kern0.3em \left[\mathbf{n}{\mathbf{n}}^{\ast}\right]\kern0.3em =\kern0.3em {\sigma}_{\mathbf{n}}^2{\mathbf{I}}_M \). The signal-to-noise ratio (SNR) is defined as \( \mathrm{S}\mathrm{N}\mathrm{R}\kern0.3em =\kern0.3em N/{\sigma}_{\mathbf{n}}^2 \). \( \mathbb{E}\kern0.3em \left[\cdotp \right] \) stands for the expected value, (·)^{
T
} and (·)^{∗} for the transpose and conjugate transpose, and **I**
_{
M
} for the *M* × *M* identity matrix.

**y**under the Euclidean distance metric

**H**=

**Q**

**L**is the QL decomposition (QLD) [49] of

**H**into a unitary matrix \( \mathbf{Q}\kern0.3em \in \kern0.3em {\mathcal{C}}^{M\times N} \) and a lower triangular matrix (LTM) \( \mathbf{L}\kern0.3em \in \kern0.3em {\mathcal{C}}^{N\times N} \) with positive real diagonal elements, and \( \overset{\sim }{\mathbf{y}}\kern0.3em =\kern0.3em {\mathbf{Q}}^{\ast}\mathbf{y}\kern0.3em =\kern0.3em \mathbf{L}\mathbf{x}\kern0.3em +\kern0.3em {\mathbf{Q}}^{\ast}\mathbf{n}\kern0.3em \in \kern0.3em {\mathcal{C}}^{N\times 1} \) denotes the transformed received signal vector. Since

**Q**is unitary, it preserves Euclidean norm as well as noise statistics. Hence, for equiprobable symbols, a ‘hard-decision’ ML MIMO detector finds \( {\mathbf{x}}^{\mathrm{ML}}\kern0.3em \in \kern0.3em \mathcal{X} \) such that

**H**

**x**

^{ML}is closest to

**y**in \( {\mathcal{C}}^{M\times 1} \) (or equivalently,

**L**

**x**

^{ML}is closest to \( \overset{\sim }{\mathbf{y}} \) in \( {\mathcal{C}}^{N\times 1} \)). This is essentially an integer least-squares problem [4] of the form

*b*

_{ n,k }is given by

where \( {\mathcal{X}}_{n,k}^{(0)}\kern0.3em =\kern0.3em \left\{\mathbf{x}\in \mathcal{X}:{b}_{n,k}=0\right\} \) and \( {\mathcal{X}}_{n,k}^{(1)}\kern0.3em =\kern0.3em \left\{\mathbf{x}\in \mathcal{X}:\right.\left.{b}_{n,k}\kern0.3em =\kern0.3em 1\right\} \) are the subsets of symbol vectors in
that have their corresponding *k*th bit in the *n*th symbol 0 and 1, respectively.

## 3 Optimum 2×2 soft-output MIMO detection

*N*= 2, a simplification [45] can be applied to reduce the number of computations from

*Q*

_{1}·

*Q*

_{2}to

*Q*

_{1}+

*Q*

_{2}by triangularizing the channel matrix as

**H**=

**Q**

_{1}

**L**

_{1}, with

**Q**

_{1}being unitary and

**L**

_{1}being a LTM, leading to:

*x*

_{1}can simply be obtained as

*x*

_{2}however, we triangularize

**H**as

**Q**

_{2}

**L**

_{2}so that a zero appears in the upper left corner:

*x*

_{2}are given by

Since **Q**
_{1},**Q**
_{2} are unitary, the ML solutions in (6), (10) are identical. To find the hard-decision ML solution, only one-sided QLD is needed on either layer 1 or 2. A list of *Q*
_{
n
} distances \( \left\{{d}_n(x),\kern1em \forall x\kern0.3em \in \kern0.3em {\mathcal{X}}_n\right\} \) is generated by enumerating all symbols \( x\kern0.3em \in \kern0.3em {\mathcal{X}}_n,n\kern0.3em =\kern0.3em 1 \) or *n* = 2, and the minimum is selected. However, to generate soft LLRs, two-sided decompositions are needed, and *two* lists of distances for *n* = 1*and*
*n* = 2 must be computed to select the appropriate minima in (8) and (11).

## 4 Extensions to higher-order layers

*N*= 3 or more layers because the structure of the lower triangular matrix

**L**includes off-diagonal terms that prevent searching for the ML solution by enumerating symbols on one layer and finding the minima through slicing individually on all other layers in parallel. More specifically, in Figure 1a, the presence of the de-marked entries in the LTM implies that determining the ML solution requires enumerating symbols on

*N*− 1 layers and slicing only on the last layer, as is typically done in tree-based detectors (e.g., [30]), and hence still requiring \( O\left(\prod_n{Q}_n\right) \) complexity rather than \( O\left(\sum_n{Q}_n\right) \).

One desirable structure of **H** for a four-layer MIMO system would be as shown in Figure 1b, in which the red-marked entries are zeroed-out. Here, by enumerating symbols on layer 1, the minimum distances and associated symbols on layers 2 to 4 can be searched for in parallel through slicing only on the corresponding layers, similar to the two-layer system. This suffices to compute the LLRs associated with the bits on layer-1 symbol. A similar process is repeated by decomposing **H** according to the structures shown in Figure 1c,d,e [48] to compute the LLRs for bits associated with layers 2 to 4.

*enumeration set*), 2) the submatrix structure used to propagate these enumerated symbols and cancel their interference effect from the remaining layers (

*interference cancellation set*), and 3) the number of layers in which the minimum distance and associated symbol can be obtained by slicing after interference cancellation (

*slicer set*). Let

*E*denote the size of the enumeration set,

*S*the size of the slicer set, and

*S*×

*E*the size of the interference cancellation set. We refer to this structure using the triplet (

*E*,

*S*×

*E*,

*S*). For example, in Figure 2a, we enumerate over

*E*= 1 layer only, cancel interference from this layer to the three other layers using a 3 × 1 structure, and slice over

*S*= 3 layers. In the structure in Figure 2b, we enumerate over

*E*= 2 layers, cancel interference using a 2 × 2 structure, and slice over

*S*= 2 layers.

LLR values are generated for bits in symbols included in the enumeration set only. Complementary structures that enumerate symbols on other layers are required to generate their respective LLRs. For example, the (1,3 × 1,3) structure requires three other similar structures to generate LLRs for layers 2 to 4 (see Figure 1c,d,e). The (2,2 × 2,2) structure of Figure 2b on the other hand requires only one identical structure to generate LLRs for both layers 3 and 4, while Figure 2c requires one non-identical (1,3 × 1,3) structure to handle layer 4.

### 4.1 Preliminaries

**H**= [

**h**

_{1}

**h**

_{2}⋯

**h**

_{ N }], and assume

**H**has full column rank. For simplicity, we assume

*N*=

*M*in the remainder of this work. We seek a matrix \( \mathbf{W}\kern0.3em =\kern0.3em \left[{\mathbf{w}}_1\kern1em {\mathbf{w}}_2\kern1em \cdots \kern1em {\mathbf{w}}_N\right]\kern0.3em \in \kern0.3em {\mathcal{C}}^{N\times N} \) that transforms

**H**into a

*punctured*LTM \( \mathbf{L}\kern0.3em =\kern0.3em \left[{l}_{ij}\right]\kern0.3em \in \kern0.3em {\mathcal{C}}^{N\times N} \) with \( {l}_{ii}\kern0.3em \in \kern0.3em {\mathcal{R}}^{+} \), such that:

The aim is to induce a specific pattern of zeros below the main diagonal of **L** by appropriately choosing the columns of **W**. Setting **W** = (**H**
^{∗}
**H**)^{−1}
**H**
^{∗} to be the left Moore-Penrose pseudo-inverse of **H** would obviously zero-out all entries in **L** below the main diagonal, resulting in **L** = **I**
_{
N
}. On the other hand, choosing **W** to be an orthonormal basis of the column space of **H**, would transform **H** into a regular (i.e., unpunctured) LTM, with **W** being unitary.

**L**is punctured, then

**W**is non-unitary. However, we impose the condition on the column vectors of

**W**to have unit length, i.e., \( {\mathbf{w}}_n^{\ast }{\mathbf{w}}_n\kern0.3em =\kern0.3em 1 \) for

*n*= 1,⋯,

*N*, or

This guarantees that, when **y** is left-multiplied by **W**
^{∗}, the transformed noise vector **W**
^{∗}
**n** has an unaltered covariance matrix \( \mathbb{E}\kern0.3em \left[{\mathbf{W}}^{\ast}\mathbf{n}{\mathbf{n}}^{\ast}\mathbf{W}\right]\kern0.3em =\kern0.3em {\sigma}_{\mathbf{n}}^2{\mathbf{I}}_N \). Note also that non-zero off-diagonal entries of the Gram matrix **W**
^{∗}
**W** correspond to pairs of column vectors in **W** that are non-orthogonal.

### 4.2 Proposed WL decomposition scheme

Let **P** = **H** (**H**
^{∗}
**H**)^{−1}
**H**
^{∗} be the orthogonal projection onto the column space of **H**, and \( {\mathbf{P}}^{\perp}\kern0.3em =\kern0.3em \mathbf{I}\kern0.3em -\kern0.3em \mathbf{H}\kern0.3em {\left({\mathbf{H}}^{\ast}\mathbf{H}\right)}^{-1}\kern0.3em {\mathbf{H}}^{\ast } \) be the orthogonal projection onto the left nullspace of **H**. Let \( {\mathbf{H}}_{\mathcal{I}} \) be the submatrix formed by the columns of **H** whose index *n* belongs to set
. For example, if **H** = [**h**
_{1}
**h**
_{2}
**h**
_{3}
**h**
_{4}] and \( \mathcal{I}\kern0.3em =\kern0.3em \left\{1,2\right\} \), then \( {\mathbf{H}}_{\mathcal{I}}\kern0.3em =\kern0.3em \left[{\mathbf{h}}_1\kern1em {\mathbf{h}}_2\right] \).

*n*th row of

**H**to be zeroed out. Define the

*n*th column vector \( {\overset{\sim }{\mathbf{w}}}_n\kern0.3em =\kern0.3em {\mathbf{P}}_{{\mathcal{I}}_n}^{\perp }{\mathbf{h}}_n \), where

**w**

_{ n }, defined as

We have the following main result:

###
**Theorem**
**1** (WL decomposition).

*n*of

**H**. Form the submatrices \( {\mathbf{H}}_{{\mathcal{I}}_n} \) and their projection matrices \( {\mathbf{P}}_{{\mathcal{I}}_n}^{\perp } \) as noted above. Let \( \mathbf{D}\kern0.3em =\kern0.3em \left[{d}_n\right]\kern0.3em \in \kern0.3em {\mathcal{R}}^{+} \) be a diagonal matrix whose entries are given by \( {d}_n\kern0.3em =\kern0.3em 1/\sqrt{{\mathbf{h}}_n^{\ast }{\mathbf{P}}_{{\mathcal{I}}_n}^{\perp }{\mathbf{h}}_n},n\kern0.3em =\kern0.3em 1,\cdots \kern0.3em ,N \). Then the matrix

when right multiplied by **H**, zeros out the entries in the rows of **H** at column positions given in \( {\mathcal{I}}_n \), for all *n*, and satisfies condition (13).

###
*Proof*.

Also, \( {\mathbf{w}}_n^{\ast }{\mathbf{h}}_m\kern0.6em \ne \kern0.3em 0 \) for all \( m\kern0.3em \notin \kern0.3em {\mathcal{I}}_n,1\kern0.3em \le \kern0.3em m\kern0.3em \le \kern0.3em N \). Hence **w**
_{
n
} nulls the *n*th row of **H** only at the column indices given in \( {\mathcal{I}}_n \).

###
**Example**
**1**.

**H**into a LTM

**L**with entries

*l*

_{32}=

*l*

_{42}=

*l*

_{43}= 0 as follows (Figure 2a):

Choosing \( {\mathcal{I}}_1\kern0.3em =\kern0.3em \left\{2,3,4\right\},{\mathcal{I}}_2\kern0.3em =\kern0.3em \left\{1,3,4\right\},{\mathcal{I}}_3\kern0.3em =\kern0.3em \left\{4\right\},{\mathcal{I}}_4\kern0.3em =\kern0.3em \left\{3\right\} \), results in the form of Figure 2b, while the choice \( {\mathcal{I}}_1\kern0.3em =\kern0.3em \left\{2,3,4\right\},{\mathcal{I}}_2\kern0.3em =\kern0.3em \left\{1,3,4\right\},{\mathcal{I}}_3\kern0.3em =\kern0.3em \left\{1,2,4\right\},{\mathcal{I}}_4\kern0.3em =\kern0.3em \left\{\varnothing \right\} \) results in the form of Figure 2c. □

The **W**,**L** pair in (12) that punctures **H** into a given lower triangular structure is not unique since **W** is non-unitary. But any other pair that generates the same structure is related to **W**,**L** by a matrix transformation as the next lemma shows.

###
**Lemma**
**1** (Similar WL decompositions).

**W**

_{1},

**W**

_{2}be two distinct matrices that transform

**H**into two distinct but identically punctured LTMs

**L**

_{1}and

**L**

_{2}. Then the orthonormal bases of the column space of

**W**

_{1}and

**W**

_{2}are both identical to that of

**H**, and

###
*Proof*.

We have \( {\mathbf{W}}_1^{\ast}\mathbf{H}\kern0.3em =\kern0.3em {\mathbf{L}}_1 \) and \( {\mathbf{W}}_2^{\ast}\mathbf{H}\kern0.3em =\kern0.3em {\mathbf{L}}_2 \). Let \( {\mathbf{W}}_1\kern0.3em =\kern0.3em {\mathbf{Q}}_1{\mathbf{R}}_1 \) and \( {\mathbf{W}}_2\kern0.3em =\kern0.3em {\mathbf{Q}}_2{\mathbf{R}}_2 \) denote the QR decompositions of **W**
_{1} and **W**
_{2}, respectively. Then \( \mathbf{H}\kern0.3em =\kern0.3em {\mathbf{Q}}_1\kern0.3em {\left({\mathbf{R}}_1^{\ast}\right)}^{-1}\kern0.3em {\mathbf{L}}_1\kern0.3em =\kern0.3em {\mathbf{Q}}_2\kern0.3em {\left({\mathbf{R}}_2^{\ast}\right)}^{-1}\kern0.3em {\mathbf{L}}_2 \) where \( {\mathbf{Q}}_1 \) and \( {\mathbf{Q}}_2 \) are unitary, with both \( {\left({\mathbf{R}}_1^{\ast}\right)}^{-1}\kern0.3em {\mathbf{L}}_1 \) and \( {\left({\mathbf{R}}_2^{\ast}\right)}^{-1}\kern0.3em {\mathbf{L}}_2 \) being lower triangular matrices. But **H** admits a unique QLD in the form **H** = **Q**
**L**, hence **Q**
_{1} = **Q**
_{2}= **Q** and \( {\left({\mathbf{R}}_1^{\ast}\right)}^{-1}\kern0.3em {\mathbf{L}}_1\kern0.3em =\kern0.3em {\left({\mathbf{R}}_2^{\ast}\right)}^{-1}\kern0.3em {\mathbf{L}}_2 \).

## 5 Reduced-complexity WL decomposition using QLD

A brute force approach for computing **W** involves extensive matrix inversion, which is computationally expensive and prone to numerical error when executed on DSP or dedicated hardware with finite precision. We develop next an efficient scheme to determine **W** using the modified Gram-Schmidt orthogonalization procedure [49-51] followed by elementary matrix operations. From Lemma 1, any other **W** that produces an identical structure is related by (16).

**H**is decomposed into a unitary matrix

**Q**

_{1}= [

**q**

_{1}

**q**

_{2}⋯

**q**

_{ N }] and a lower triangular matrix

**L**

_{1}= [

*l*

_{ ij }]

_{ N×N }with real and positive diagonal elements. We then have \( {\mathbf{Q}}_1^{\ast}\mathbf{H}={\mathbf{L}}_1 \). Obviously, \( {\mathbf{q}}_1^{\ast }{\mathbf{q}}_1\kern0.3em =\kern0.3em 1 \) and \( {\mathbf{q}}_1^{\ast }{\mathbf{h}}_m\kern0.3em =\kern0.3em 0 \) for all

*m*= 2,⋯,

*N*. Hence,

**w**

_{1}=

**q**

_{1}. Now consider row 1 <

*n*≤

*N*of

**L**

_{1}, and assume the

*m*th entry

*l*

_{ nm },

*m*<

*n*, is to be nulled. We have \( {\mathbf{q}}_n^{\ast }{\mathbf{h}}_m\kern0.3em =\kern0.3em {l}_{nm}\in \mathcal{C} \) and \( {\mathbf{q}}_m^{\ast }{\mathbf{h}}_m\kern0.3em =\kern0.3em {l}_{mm}\in {\mathcal{R}}^{+} \), from which it follows that \( \left({\mathbf{q}}_n^{\ast }-{\mathbf{q}}_m^{\ast}\frac{l_{nm}}{l_{mm}}\right){\mathbf{h}}_m\kern0.3em =\kern0.3em 0 \). Therefore, the matrix operations

**Q**

_{1}accordingly. These operations are repeated for all other column entries

*m*<

*n*to be punctured in row

*n*. Finally,

**q**

_{ n }is normalized to have unit length, and the non-zero entries in row

*n*of

**L**

_{1}are updated:

The operations in (17)-(18) followed by the normalization steps (19)-(20) are repeated for all rows *n* where puncturing is required. The resulting **Q**
_{1} is **W**, and **L**
_{1} is the desired punctured LTM **L**.

**E**

_{ m }= [

*e*

_{ nj }], 1 ≤

*m*≤

*N*, which differ from

**I**

_{ N }by a single elementary row operation, and defined as follows:

The product of these elementary matrices forms the unscaled matrices **L**
_{2} = (**E**
_{
n
}⋯**E**
_{1})**L**
_{1} and \( {\mathbf{Q}}_2^{\ast}\kern0.3em =\kern0.3em \left({\mathbf{E}}_n^{\ast}\cdots {\mathbf{E}}_1^{\ast}\right){\mathbf{Q}}_1^{\ast } \).

The scaling operations (19)-(20) can be written using the diagonal matrix \( \mathbf{D}=\left[{d}_n\right]\in {\mathcal{R}}^{+} \), where \( {d}_n\kern0.3em =\kern0.3em 1/\sqrt{{\left[{\mathbf{Q}}_2^{\ast }{\mathbf{Q}}_2\right]}_{nn}} \) and [·]_{
nn
} denotes the *n*th diagonal element. The desired (scaled) matrices are given by \( {\mathbf{W}}^{\ast}\kern0.3em =\kern0.3em \mathbf{D}{\mathbf{Q}}_2^{\ast } \) and **L** = **D**
**L**
_{2}.

For detection, the product **W**
^{∗}
**y** must be formed as well. This can simply be handled by first left-augmenting **y** to **H**, and then performing QLD on the augmented matrix to form \( \overset{\sim }{\mathbf{Q}}\overset{\sim }{\mathbf{L}}\kern0.3em =\kern0.3em \left[\mathbf{y}\kern0.3em \Big|\kern0.3em \mathbf{H}\right] \). When carrying out the orthogonalization procedure, the same operations applied to the columns of **H** are applied to the augmented column. This results in \( \overset{\sim }{\mathbf{Q}}\kern0.3em =\kern0.3em \left[{\mathbf{0}}_{N\times 1}\left|\right.\mathbf{Q}\right] \) and \( \overset{\sim }{\mathbf{L}}\kern0.3em =\kern0.3em \left[\overset{\sim }{\mathbf{y}}\left|\right.\mathbf{L}\right] \), where **Q**
**L** = **H** and \( \overset{\sim }{\mathbf{y}}\kern0.3em =\kern0.3em {\mathbf{Q}}^{\ast}\mathbf{y} \), with \( \overset{\sim }{\mathbf{y}} \) essentially generated as a by-product of the decomposition. Next, when carrying out operations (17)-(18) followed (19)-(20) to puncture a given entry, these operations are also applied on the leftmost column of \( \overset{\sim }{\mathbf{L}} \) which contains \( \overset{\sim }{\mathbf{y}} \).

Algorithm 1 summarizes the steps of the proposed WLD scheme. Step 1 performs augmented QLD, while Step 2 performs nulling and normalization.

### 5.1 Complexity

flops, assuming a (1,(*N* − 1) × 1,*N* − 1) structure. In comparison, [48] requires (16*N*
^{4} − 4*N*
^{3})RMUL and (16*N*
^{4} − 4*N*
^{3})RADD.

## 6 Optimized detection algorithm

**H**into

*T*punctured LTMs having

*identical*structure (

*E*,

*S*×

*E*,

*S*) where

*T*=

*N*/

*E*as follows. The

*N*streams are decoupled

*E*at a time in

*T*steps by cyclically shifting the columns of

**H**using the permutation

for *t* = 1,⋯,*T*. Each permuted **H** is then WL-decomposed into **W**
^{(t)},**L**
^{(t)}, wherein the *E* leftmost columns of **L**
^{(t)} correspond to *E* distinct decoupled streams. For example, to decouple all four streams of a 4 × 4 MIMO system using the (2,2 × 2,2) structure of Figure 2b, *T* = 2 such decompositions are required, one on [**h**
_{1}
**h**
_{2}
**h**
_{3}
**h**
_{4}] to decouple streams 1 and 2 and one on [**h**
_{3}
**h**
_{4}
**h**
_{1}
**h**
_{2}] to decouple streams 3 and 4. To allow decoupled subsets to overlap, the permutation is altered to place a stream with low reliability in several detection sets. For example, to decompose the four streams in the previous example into overlapping (2,2 × 2,2) structures, four decompositions are needed based on [**h**
_{1}
**h**
_{2}
**h**
_{3}
**h**
_{4}], [**h**
_{2}
**h**
_{3}
**h**
_{4}
**h**
_{1}], [**h**
_{3}
**h**
_{4}
**h**
_{1}
**h**
_{2}], and [**h**
_{4}
**h**
_{3}
**h**
_{2}
**h**
_{1}]. For simplicity, we assume identical constellations \( {\mathcal{X}}_n\kern0.3em =\kern0.3em \mathcal{O} \) on all layers.

One simple approach to low-complexity MIMO detection is zero-forcing. Left-multiplying **y** by **W**
^{(t)∗} to get \( {\mathbf{W}}^{(t)\ast}\mathbf{y}={\overset{\sim }{\mathbf{y}}}^{(t)}={\mathbf{L}}^{(t)}\mathbf{x}+{\mathbf{W}}^{(t)\ast}\mathbf{n} \) results in *E* decoupled streams corresponding to columns **h**
_{1},⋯,**h**
_{
E
}. These streams can be detected separately by slicing the layers individually to find their closest symbols.

**L**

^{(t)}. Instead of merely slicing to find the

*E*closet symbols in the enumeration set, we enumerate all symbol vectors in the set \( {\mathcal{O}}^E \) and then slice to find the closest symbols in the remaining

*S*layers, populating the Euclidean distance of the resulting symbol vectors along the way. First partition \( {\overset{\sim }{\mathbf{y}}}^{(t)} \),

**L**

^{(t)}, and

**x**as

*t*is given by

*S*since

**C**

^{(t)}is a diagonal matrix. Here slicing is applied to the individual elements of the vector over the constellation . The symbols in the vector \( {\overset{\sim }{\mathbf{x}}}_t^{\mathrm{WL}} \) are rearranged back to their normal order using (24), and the permuted vector is denoted as \( {\mathbf{x}}_t^{\mathrm{WL}} \):

**y**from \( \mathbf{H}{\mathbf{x}}_t^{\mathrm{WL}} \) (rather than the distance of \( {\overset{\sim }{\mathbf{y}}}^{(t)} \) from \( {\mathbf{L}}^{(t)}{\overset{\sim }{\mathbf{x}}}_t^{\mathrm{WL}} \)):

for *n* = 1,⋯,*N*, \( k\kern0.3em =\kern0.3em 1\cdots \kern0.3em ,\underset{2}{ \log}\left|\mathcal{O}\right| \), and *t* = 1,⋯,*T*.

*t*in (28) and (31), respectively:

Tighter LLRs can be produced by tracking global minimum distances rather than just minimizing over the per stream LLRs. Specifically, when using (26)-(27) for every stream *t*, *t*=1,⋯,*T*, instead of just retaining the minimum distance and its corresponding arg min symbol vector, a list of all \( \left|{\mathcal{O}}^E\right| \) distances and their corresponding symbol vectors is populated. These *T* lists are then used to compute the LLR values by selecting the minimum distances for symbol vectors from these lists having 0 or 1 in the desired bit position where the LLR value is to be computed.

*N*= 2 layers, the algorithm is optimal and reduces to that in [45] since the

**W**s are unitary in this case. When

*E*= 1 and distances in (28) are computed based on

**L**instead of

**H**, the algorithm reduces to that in [48].

### 6.1 Multi-core detector architectures

*N*in the MIMO systems, multiple 2×2 detector cores can be configured to construct an

*N*-stream MIMO detector. Figure 4 shows a four-sided fully parallel 4 × 4 MIMO detector that uses four cores to process the four streams. Here distance buffering and accumulation are needed before LLR processing in order to adjust the individual LLRs according to earlier discussion. It is assumed that an external digital signal processor (DSP) must supply the WLD matrix inputs for all four streams according to the decompositions in (25). If chip area is the constraining factor, a MIMO detector can be built using a single core that is time-multiplexed among the four streams.

## 7 Performance simulation results

The coded bit-error rate (BER) performance of the proposed detection algorithm was evaluated through simulations of a MIMO system employing four transmit and four receive antennas. The channel encoder is based on the LTE turbo encoder specification [52] with interleaver length 1,024, using 16-QAM and 64-QAM modulation constellations. The channel entries are assumed to be independent and identically distributed (i.i.d.) complex Gaussian random variables. At the receiver end, we assume perfect channel knowledge. The turbo decoder implements the true *A Posteriori* Probability algorithm, and performs four (full) decoding iterations.

*E*= 1 and 2 structures, versus ML, ZF, the approach of [48], and the sphere decoder with radius clipping [30], for 16-QAM. Both overlapping and non-overlapping subsets are considered. Furthermore, two scenarios for distance computations in (28) are followed; one based on

**H**and one on

**L**. The curve labeled WLD-H1 corresponds to the proposed WLD detection scheme with

*E*= 1 and distance computations based on

**H**in (28), while the curve labeled WLD-L1 corresponds to the proposed scheme with

*E*= 1 but with distance computations based on

**L**in (28). Similarly for the curves labeled WLD-H2 and WLD-L2, but with

*E*=2 and non-overlapping decomposition sets. Finally the curve labeled WLD-H2ov corresponds to

*E*=2 but with four overlapping (2,2×2,2) structures.

As shown in the figure, the plots demonstrate that the proposed WLD algorithm with *E*=2 using **H** distances with overlapping subsets performs virtually as ML and is less than 0.1 dB away from ML with no overlapping subsets. Furthermore, the WLD-L2 scheme performs much worse than WLD-H2. For decompositions with single streams, **L** distances perform better than **H** distances; the WLD-H1 scheme exhibits an error floor. Compared to the well-known sphere decoding algorithm with radius-clipping [30], the proposed WLD-based schemes achieve better performance especially with *E*=2 structures.

*E*=2 using

**H**distances and overlapping subsets performs very close to ML. It is worth mentioning that WLD-H2 again performs very close to WLD-H2ov (and hence to ML), which has significant hardware saving implications. Note here that the WLD-L2ov scheme using

**L**-distances with

*E*=2 overlapping decompositions performs the worst among the WLD schemes.

Finally, another important advantage of the proposed schemes is the significant reduction in simulation time observed when generating the BER plots. On a typical multi-core server machine, one simulation point takes in the order of few hours to complete, while the sphere decoder and ML approaches require few days. This is very valuable for designers when doing rapid evaluations and tradeoff analyses of various MIMO system features.

## 8 Conclusions

A low-complexity MIMO subspace detection algorithm has been presented. By decomposing the channel matrix of an *M*-stream MIMO system into a generalized elementary matrix structure, the detection problem becomes a generalization of that of a two-stream detection problem, which admits a simple architecture suitable for high-speed implementation. Multiple two-stream detector cores can be employed in parallel to improve detection throughput. The channel decomposition scheme is cast in terms of the standard Gram-Schmidt QL decomposition, which is supported in most modern DSPs.

## Declarations

## Authors’ Affiliations

## References

- A Paulraj, R Nabar, D Gore, Introduction to Space-Time Wireless Communications (Cambridge Univ. Press, Cambridge, U.K, 2003).Google Scholar
- E Viterbo, J Boutros, A universal lattice code decoder for fading channels. IEEE Trans. Inf. Theory. 45(5), 1639–1642 (1999).View ArticleMATHMathSciNetGoogle Scholar
- O Damen, A Chkeif, J-C Belfiore, Lattice code decoder for space-time codes. IEEE Commun. Lett. 4(5), 161–163 (2000).View ArticleGoogle Scholar
- E Agrell, T Eriksson, A Vardy, K Zeger, Closest point search in lattices. IEEE Trans. Inf. Theory. 48(8), 2201–2214 (2002).View ArticleMATHMathSciNetGoogle Scholar
- GJ Foschini, Layered space-time architecture for wireless communication in a fading environment when using multi-element antennas. Bell Labs Tech. J. 1(2), 41–59 (1996).View ArticleGoogle Scholar
- Y Jiang, J Li, WW Hager, Joint transceiver design for MIMO communications using geometric mean decomposition. IEEE Trans. Signal Process. 53(10), 3791–3803 (2005).View ArticleMathSciNetGoogle Scholar
- Y Jiang, J Li, WW Hager, Uniform channel decomposition for MIMO communications. IEEE Trans. Signal Process. 53(11), 4283–4294 (2005).View ArticleMathSciNetGoogle Scholar
- SL Ariyavisitakul, J Zheng, E Ojard, J Kim, Subspace beamforming for near-capacity MIMO performance. IEEE Trans. Signal Process. 56(11), 5729–5733 (2008).View ArticleMathSciNetGoogle Scholar
- Y Chen, ST Brink, in Proc. IEEE Int. Symp. Personal Indoor and Mobile Radio Commun. (PIMRC). Near-capacity MIMO subspace detection (Toronto, Canada, 2011), pp. 1733–1737.Google Scholar
- E Biglieri, R Calderbank, A Constantinides, A Goldsmith, A Paulraj, HV Poor, MIMO Wireless Communications (Cambridge Univ. Press, Cambridge, U.K, 2007).Google Scholar
- B Hassibi, in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal, Process. (ICASSP). An efficient square-root algorithm for BLAST (Istanbul, Turkey, 2000), pp. 5–9.Google Scholar
- GD Golden, JG Foschini, RA Valenzuela, PW Wolniansky, Detection algorithm and initial laboratory results using V-BLAST space-time communication architecture. IEE Electron. Lett. 35(1), 14–15 (1999).View ArticleGoogle Scholar
- D Wübben, R Böhnke, V Kühn, K Kammeyer, in Proc. IEEE Vehicular Technol. Conf. (VTC). MMSE extension of V-BLAST based on sorted QR decomposition (Orlando, Florida, 2003), pp. 508–512.Google Scholar
- C Studer, S Fateh, D Seethaler, ASIC implementation of soft-input soft-output MIMO detection using MMSE parallel interference cancellation. IEEE Trans. Syst. Sci. Cybern. 47(7), 1754–1765 (2011).Google Scholar
- E Viterbo, E Biglieri, in 14ème Colloque GRETSI. A universal decoding algorithm for lattice codes (Juan-Les-Pins, France, 1993), pp. 611–614.Google Scholar
- B Hochwald, S ten Brink, Achieving near-capacity on a multiple-antenna channel. IEEE Trans. Commun. 51(3), 389–399 (2003).View ArticleGoogle Scholar
- B Hassibi, H Vikalo, On sphere decoding algorithm. I. Expected complexity. IEEE Trans. Signal Process. 53(8), 2806–2818 (2005).View ArticleMathSciNetGoogle Scholar
- J Jaldén, B Ottersten, On the complexity of sphere decoding in digital communications. IEEE Trans. Signal Process. 53(4), 1474–1484 (2005).View ArticleMathSciNetGoogle Scholar
- D Seethaler, J Jaldén, C Studer, H Bölcskei, On the complexity distribution of sphere decoding. IEEE Trans. Inf. Theory. 57(9), 5754–5768 (2011).View ArticleGoogle Scholar
- K-W Wong, C-Y Tsui, RS-K Cheng, W-H Mow, in Proc. IEEE Int. Symp. on Circuits and Systems (ISCAS), 3. A VLSI architecture of a K-best lattice decoding algorithm for MIMO channels (Scottsdale, Arizona, 2002), pp. 273–276.Google Scholar
- M Wenk, M Zellweger, A Burg, N Felber, W Fichtner, in Proc. IEEE Int. Symp. on Circuits and Systems (ISCAS). K-best MIMO detection VLSI architectures achieving up to 424Mbps (Island of Kos, Greece, 2006), pp. 1151–1154.Google Scholar
- S Mondal, A Eltawil, C-A Shen, K Salama, Design and implementation of a sort free
*K*-best sphere decoder. IEEE Trans. VLSI Syst. 18(10), 1497–1501 (2010).View ArticleGoogle Scholar - L Liu, F Ye, X Ma, T Zhang, J Ren, A 1.1-Gb/s 115-pJ/bit configurable MIMO detector using 0.13
*μ*m CMOS technology. IEEE Trans. Circuits Syst. II. 57(9), 701–705 (2010).View ArticleGoogle Scholar - C-A Shen, A Eltawil, K Salama, Evaluation framework for
*K*-best sphere decoders. J. Circuits Syst. Comput. 19(5), 975–995 (2010).View ArticleGoogle Scholar - M Shabany, P Gulak, A 675 Mbps, 4×4 64-QAM K-Best MIMO detector in 0.13
*μ*m CMOS. IEEE Trans. VLSI Syst. 20(1), 135–147 (2012).View ArticleGoogle Scholar - M Mahdavi, M Shabany, Novel MIMO detection algorithm for high-order constellations in the complex domain. IEEE Trans. VLSI Syst. 21(5), 834–847 (2013).View ArticleGoogle Scholar
- D Garrett, L Davis, S ten Brink, B Hochwald, G Knagge, Silicon complexity for maximum likelihood MIMO detection using spherical decoding. IEEE J. Solid-State Circuits. 39(9), 1544–1552 (2004).View ArticleGoogle Scholar
- Z Guo, P Nilsson, in Proc. IEEE CAS Symp. Emerging Technologies, 1. A VLSI architecture of the Schnorr-Euchner decoder for MIMO systems (Shanghai, China, 2004), pp. 65–68.Google Scholar
- A Burg, M Borgmann, M Wenk, M Zellweger, W Fichtner, H Bölcskei, VLSI implementation of MIMO detection using the sphere decoding algorithm. IEEE J. Solid-State Circuits. 40(7), 1566–1577 (2005).View ArticleGoogle Scholar
- C Studer, A Burg, H Bölcskei, Soft-output sphere decoder: algorithms and VLSI implementation. IEEE J. Sel. Areas Commun. 26(2), 290–300 (2008).View ArticleGoogle Scholar
- C-H Yang, D Markovic, A flexible DSP architecture for MIMO sphere decoding. IEEE Trans. Circuits Syst. I. 56(10), 2301–2314 (2009).View ArticleMathSciNetGoogle Scholar
- C-H Yang, D Markovic, in European Solid-State Circuits Conf. (ESSCIRC). A 2.89 mW 50 GOPS 16×16 16-core MIMO sphere decoder in 90 nm CMOS (Athens, Greece, 2009), pp. 344–347.Google Scholar
- F Borlenghi, EM Witte, G Ascheid, H Meyr, A Burg, in IEEE Asian Solid State Circutis Conf. (A-SSCC). A 772 Mbit/s 8.81 bit/nJ 90 nm CMOS soft-input soft-output sphere decoder (Jeju, Korea, 2011), pp. 297–300.Google Scholar
- L Liu, J Lofgren, P Nilsson, Area-efficient configurable high-throughput signal detector supporting multiple MIMO modes. IEEE Trans. Circuits Syst. I. 59(9), 2085–2096 (2012).View ArticleMathSciNetGoogle Scholar
- Y Sun, JR Cavallaro, Trellis-search based soft-input soft-output MIMO detector: algorithm and VLSI architecture. IEEE Trans. Signal Process. 60(5), 2617–2627 (2012).View ArticleMathSciNetGoogle Scholar
- X Chen, G He, J Ma, VLSI implementation of a high-throughput iterative fixed-complexity sphere decoder. IEEE Trans. Circuits Syst. II. 60(5), 272–276 (2013).View ArticleGoogle Scholar
- MM Mansour, S Alex, MA Jalloul, Reduced complexity soft-output MIMO sphere detectors – Part I Algorithmic optimizations. IEEE Trans. Signal Process. 62(21), 5505–5520 (2014).View ArticleMathSciNetGoogle Scholar
- MM Mansour, S Alex, MA Jalloul, Reduced complexity soft-output MIMO sphere detectors – Part II Architecutral optimizations. IEEE Trans. Signal Process. 62(21), 5521–5535 (2014).View ArticleMathSciNetGoogle Scholar
- M-Y Huang, P-Y Tsai, Toward multi-gigabit wireless: design of high-throughput MIMO detectors with hardware-efficient architecture. IEEE Trans. Circuits Syst. I. 61(2), 613–624 (2014).View ArticleGoogle Scholar
- J Choi, A Singer, J Lee, N-I Cho, Improved linear soft-input soft-output detection via soft feedback successive interference cancellation. IEEE Trans. Commun. 58(3), 986–996 (2010).View ArticleGoogle Scholar
- RC de Lamare, R Sampaio-Neto, Adaptive reduced-rank equalization algorithms based on alternating optimization design techniques for MIMO systems. IEEE Trans. Veh. Technol. 60(6), 2482–2494 (2011).View ArticleGoogle Scholar
- T Hwang, Y Kim, H Park, Energy spreading transform approach to achieve full diversity and full rate for MIMO systems. IEEE Trans. Signal Process. 60(12), 6547–6560 (2012).View ArticleMathSciNetGoogle Scholar
- D Persson, J Kron, M Skoglund, EG Larsson, Joint source-channel coding for the MIMO broadcast channel. IEEE Trans. Signal Process. 60(4), 2085–2090 (2012).View ArticleMathSciNetGoogle Scholar
- RC de Lamare, Adaptive and iterative multi-branch MMSE decision feedback detection algorithms for multi-antenna systems. IEEE Trans. Wireless Commun. 12(10), 5294–5308 (2013).View ArticleGoogle Scholar
- M Siti, MP Fitz, in Proc. IEEE Int. Conf. Commun. (ICC), 4. A novel soft-output layered orthogonal lattice detector for multiple antenna communications (Istanbul, Turkey, 2006), pp. 1686–1691.Google Scholar
- M Siti, MP Fitz, in Proc. IEEE Wireless Commun. and Netw. Conf. (WCNC). On layer ordering techniques for near-optimal MIMO detectors (Hong Kong, 2007), pp. 1199–1204.Google Scholar
- MS Yee, in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal, Process. (ICASSP), 3. Max-log-MAP sphere decoder (Philadelphia, PA, 2005), pp. 1013–1016.Google Scholar
- E Ojard, S Ariyavisitakul, Method and system for approximate maximum likelihood (ML) detection in a multiple input multiple output (MIMO) receiver. U. S. Patent No. 20090074114 (2009). http://www.google.com/patents/US20090074114.
- GH Golub, CFV Loan, Matrix Computations, 3rd edn (Johns Hopkins Univ. Press, Baltimore, MD, 1996).Google Scholar
- RC-H Chang, C-H Lin, K-H Lin, C-L Huang, F-C Chen, Iterative QR decomposition architecture using the modified Gram-Schmidt algorithm for MIMO systems. IEEE Trans. Circuits Syst. I. 57(5), 1095–1102 (2010).View ArticleMathSciNetGoogle Scholar
- D Wübben, R Böhnke, J Rinas, V Kühn, K Kammeyer, Efficient algorithm for decoding layered space-time codes. IEE Electron. Lett. 37(22), 1348–1350 (2001).View ArticleGoogle Scholar
- Evolved Universal Terrestrial Radio Access (E-UTRA); Physical Channels and Modulation. TS 36.211 3GPP. http://www.3gpp.org.

## Copyright

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.