Achieving Low-Complexity Maximum-Likelihood Detection for the 3D MIMO Code

The 3D MIMO code is a robust and efficient space-time block code (STBC) for the distributed MIMO broadcasting but suffers from high maximum-likelihood (ML) decoding complexity. In this paper, we first analyze some properties of the 3D MIMO code to show that the 3D MIMO code is fast-decodable. It is proved that the ML decoding performance can be achieved with a complexity of O(M^{4.5}) instead of O(M^8) in quasi static channel with M-ary square QAM modulations. Consequently, we propose a simplified ML decoder exploiting the unique properties of 3D MIMO code. Simulation results show that the proposed simplified ML decoder can achieve much lower processing time latency compared to the classical sphere decoder with Schnorr-Euchner enumeration.


Introduction
Multiple-input multiple-output (MIMO) is a promising technique that can bring significant improvements to the wireless communication systems. In combination with space-time block code (STBC), it provides higher spectrum efficiency with better communication reliability [1]. In the last decades, MIMO has been widely employed in the latest wireless communication standards such as IEEE 802.11n, 3GPP Long Term Evolution (LTE), WiMAX and Digital Video Broadcasting-Next Generation Handheld (DVB-NGH) etc. It is also seen as the key technology for the future digital TV terrestrial broadcasting standards [2].
A so-called space-time-space (3D) MIMO code [3] was proposed for the future TV broadcasting systems in which the services are delivered by the MIMO transmission in a single frequency network (SFN). Specifically, it is proposed for a distributed MIMO broadcasting scenario where TV programs are transmitted by two geographically separated transmission sites, each site equipping two transmit antennas. On the other hand, each receiver has two receive antennas, forming a 4 × 2 MIMO transmission. The 3D MIMO code has been shown to be robust and efficient in the distributed MIMO broadcasting scenarios where there exists strong received signal power imbalances [4]. Hence, it is a promising candidate for the MIMO profile of future broadcasting standards. However, the 3D MIMO code suffers from a high computational complexity when the maximum-likelihood (ML) decoding is adopted. The decoding complexity is as high as O(M 8 ) when M -QAM constellation is used. Up to now, no study on the decoding complexity reduction for the 3D MIMO code has been carried out in the literature.
Recently a lot of efforts have been made in the STBC design to obtain both high code rate and low decoding complexity [5][6][7][8][9][10][11]. The decoding complexity reduction is commonly achieved by exploiting the orthogonality embedded in the STBC codeword. When there exists group-wise orthogonality in the codeword, the joint detection of many information symbols is converted into independent, group-wise detections [6,10], yielding low decoding complexity. For other cases such as DjABBA code [12], Biglieri-Hong-Viterbo (BHV) code [7], Srinath-Rajan code [8] and Ismail-Fiorina-Sari (IFS) code [11] in which the orthogonality only exists in a part of information symbols, some symbols can be detected in a group-wise manner once we condition them with respect to other symbols. The ML solutions can be obtained with a lower complexity compared with the ML detector. In other words, their decoding complexity is less than O(M κ ) where κ is the number of information symbols in a codeword. Such kind of STBCs are referred to as fast decodable STBCs [7].
However, most of fast-decodable STBCs are not optimized for distributed MIMO broadcasting scenarios and they are not robust under the received signal power imbalance conditions [4].
A partial interference cancellation (PIC) group decoding scheme has been presented aiming at reducing the decoding complexity of the STBCs containing group-wise orthogonalities in the codewords [13,14]. A number of STBCs that are optimized for this decoding scheme have also been proposed [14,15]. This scheme actually uses a linear equalization to convert the joint detection of a large number of symbols to several groups of ML decodings for few symbols. However, the overall performance of this decoding scheme cannot achieve the ML optimality.
Some alternatives with reduced decoding complexity have been presented for the distributed MIMO broadcasting. Polonen et al. described a STBC with less decoding complexity based on orthogonal basis [16].
However, such a code does not achieve full-diversity or full-rate for 4 × 2 MIMO transmissions and therefore performs worse than 3D MIMO code. A "punctured version" of 3D MIMO code that is full-rate for 4 × 2 MIMO transmissions with low decoding complexity has also been proposed [17]. However, it does not achieve full-diversity and is hence less robust in harsh channel conditions.
In this paper, we propose a reduced-complexity ML decoder for the 3D MIMO code which exploits the embedded orthogonality in the codeword. The main contributions are: • We propose to modify the original 3D MIMO codeword through some permutations of information symbols which leads to an ML decoding algorithm with reduced complexity without affecting all desirable properties of the 3D MIMO code.
• We prove that the 3D MIMO code is fast decodable. Moreover, we show that the worst case decoding complexity is O(M 4.5 ) for M -ary square QAM modulations which is the least among all square full-rate STBCs for 4 × 2 MIMO transmission.
• Based on the unique properties of the new form of 3D MIMO codeword, we propose a novel implementation of the simplified decoder that achieves a lower average complexity in terms of time latency without losing the ML optimality. The proposed implementation is also applicable for other fast decodable STBCs.
The remainder of the paper is organized as follows. Some fundamentals of the MIMO detection are presented in Section 2. In Section 3, the 3D MIMO code is first recalled. Consequently, a modification of the codeword is proposed to facilitate the decoding process. Three important properties of the new codeword are also revealed. With this knowledge, in Section 4, the ML decoder with a worst case decoding complexity of O(M 4.5 ) is derived. Then in Section 5, a new implementation of the reduced-complexity ML decoder is described. Section 6 presents the symbol error and complexity performance of the new decoder. Conclusions are drawn in Section 7.
Notations: Vectors and matrices are written in boldface letters. Superscript X T represents transposition of matrix X. x R and x I denote the real and imaginary parts of a complex number x, respectively. The operator (·) performs the complex-real conversion from C to R 2×2 : When (·) operator is applied to a matrix X ∈ C m×n , the operation in (1) is performed for all elements x j,k 's in the matrix, i.e. the (j, k)th 2 × 2 submatrix ofX isx j,k . For a complex vector x = [x 1 , x 2 , . . . , x n ] T ∈ C n , the operator (·) separates the real and imaginary parts of the given vector, i.e.
x [x R 1 , x I 1 , . . . , x R n , x I n ] T . For a matrix X = [x 1 , x 2 , . . . , x n ] where x j is the jth column of X, the operator vec(X) stacks the columns of X to form one column vector, i.e. vec(X) [x T 1 , x T 2 , . . . , x T n ] T . vec(X) denotes vectorizing matrix X followed by the real/imaginary part separation. The inner product of two real-valued vectors x and y is denoted by x, y = x T y. The n × n identity matrix is denoted by I n . The operator ⊗ denotes the Kronecker product.

MIMO system model
We consider a MIMO transmission with N t transmit and N r receive antennas over flat-fading channel. The received signal Y ∈ C Nr×T is: where X ∈ C Nt×T is the STBC codeword matrix which is transmitted over T channel uses; W ∈ C Nr×T is a complex-valued additive white Gaussian noise (AWGN) component; H ∈ C Nr×Nt is the channel matrix whose (j, k)th element h j,k denotes the channel coefficient of the link between the kth transmit antenna and the jth receive antenna. The channel is assumed to be quasi-static. That is, the channel coefficients keep constant over the duration of one STBC codeword, but change from one codeword to another. Moreover, h j,k 's are assumed to be independent from each other.
For linear STBCs, the codeword matrix X can be obtained through a linear operation [7]: where s = [s 1 , s 2 , . . . , s κ ] T is the vector containing κ independent information symbols. The code rate of STBC is κ/T information symbols per channel use. The generator matrix G ∈ R 2NtT ×2κ is obtained: where A j ∈ C Nt×T and B j ∈ C Nt×T are the complex weight matrices representing the contribution of the real and imaginary parts of the jth information symbol s j in the final codeword matrix.
Separating the real and imaginary parts of the transmitted and received signals, and stacking the columns of the codeword, the received MIMO signal (2) can be expressed in an equivalent real-valued form: where y = vec(Y), w = vec(W) and H eq ∈ R 2NrT ×2κ is the equivalent channel matrix and is obtained by: Note that the real-valued expression of the signal can be obtained from the complex-valued form via a linear transform. Hence, we will jointly use both real-and complex-valued forms in the sequel.

ML decoding of MIMO signals
Once the channel H eq is known by the receiver 1 , the information symbols can be retrieved from the received signal y in (5). The maximum-likelihood (ML) solution of the transmitted signal is the combination of information symbols s = [s 1 , s 2 , . . . , s κ ] that minimizes the Euclidian distance between the channel distorted information signal H eq s and received signal y, namely: where Θ is the set of the constellation symbols. (7) indicates that the ML solution is found by jointly determining κ independent information symbols. In other words, when the modulation of these symbols is M -QAM, the ML decoding should exhaustively check all M κ combinations. The search complexity grows dramatically with higher modulation order or larger number of information symbols in one codeword. Hence, the ML decoding is computationally demanding.

Fast ML decoding of MIMO signals
More efficient STBC decoding is achieved with the help of orthogonal-triangular (QR) decomposition [7,18].
The QR decomposition of the equivalent channel matrix H eq yields H eq = QR, where Q ∈ R 2NrT ×2κ is a unitary matrix, and R ∈ R 2κ×2κ is an upper triangular matrix. The definitions of the "classical Gram-Schmidt algorithm" based QR decomposition can be found in the Appendices. Note that other numerically stable QR decomposition algorithms can also be used without affecting the properties of the 3D MIMO code as well as the resulting low-complexity decoding methods. Instead of solving (7), the ML solution can be alternatively found by:ŝ where z = Q T y ∈ R 2κ is a linear transformation of received signal; S is a hypersphere centered on the received signal. Only the codewords inside the hypersphere are checked during the search in order to reduce the search complexity. The size of the hypersphere is represented by its radius. The decoding process is turned into a bounded search over a κ-level tree with complex-valued nodes. Hence, the worst case decoding Moreover, according to the property of QR decomposition, some information symbols can be decoded independently from the others if some elements of R are equal to zero. It suggests that the joint search in a high dimension is converted into a bunch of parallel, independent searches in low dimensions. This results in a significant reduction of the worst-case decoding complexity [7,8,19].

3D MIMO Code
The 3D MIMO code [3] possesses a better robustness against receive signal power imbalances in the distributed MIMO broadcasting scenarios but suffers from high decoding complexity [4]. In this section, we propose a new 3D MIMO codeword that enables low sphere decoding complexity via exchanging the positions of information symbols in the original 3D MIMO codeword. The basic idea behind this modification comes from the facts that the orthogonality embedded in the information symbols essentially enables independent detections and the sphere decoding complexity is mainly determined by the orthogonality among the first several symbols. Hence, exploiting the underlying orthogonality in the codeword and carefully choosing the sequence of information symbols can bring benefits in terms of decoding complexity.

A new proposal of the 3D MIMO codeword
The initially proposed codeword matrix of the 3D MIMO code is explicitly written as: where θ = 1+ . It is constructed in a hierarchical manner: eight information symbols (κ = 8) are first encoded to two Golden codewords [20], i.e. X Golden,1 and X Golden,2 , which are consequently arranged in an Alamouti manner [21] over four channel uses (T = 4) 2 .
This results in a code rate of 2 which is full-rate for the 4 × 2 MIMO transmission. Previous study shows that the 3D MIMO code achieves efficient and robust performance. However, since eight information symbols are stacked in one codeword, the ML decoding complexity is up to O(M 8 ).
It was shown that it is possible to achieve lower sphere decoding complexity through permuting the sequence of information symbols [22]. We propose to slightly modify the codeword by exchanging the positions of information symbols (s 3 , s 4 ) and (s 5 , s 6 ), yielding a new form of codeword: Since we only change the sequence of the information symbols in the codeword (the third and fourth information symbols become the fifth and sixth, and vice versa) and the information symbols are independent from each other, the new codeword preserves all the good attributes of the original 3D MIMO code in distributed MIMO scenarios illustrated in [4]. More importantly, this modification is based on the embedded orthogonalities in the 3D MIMO codeword and yields an interesting codeword structure which will be exploited to achieve lower decoding complexity. The advantages brought by the new codeword structure will be highlighted in the following sections.

Key properties of the proposed 3D MIMO codeword
Due to the underlying Alamouti and Golden structures, the 3D MIMO code has some unique properties which lead to simplified decoding algorithms.
Based on the new codeword in (10) and taking into account (6), (3) and (4), we can obtain a few interesting properties of R that can be made use of to achieve a low decoding complexity.
Theorem 2. R 12 is a null matrix when the channel is quasi-static, i.e. q j , h k = 0, ∀j = 1, 2, 3, 4 and k = 5, 6, 7, 8. Similarly, q 2 , h 3 = 0 means that their imaginary parts, namely z(2) and z(4), do not contain any contribution from s R 1 and s R 2 , either. As we will show later, this real/imaginary independency leads to independent and parallel detections for real part and imaginary part, respectively.
The real/imaginary part independency comes from the underlying Golden and Alamouti structures. It has been revealed that the complex-valued R matrix of the Golden code has a real upper left submatrix [19], which coincides with the structure as presented in Theorem 1. It shows the real/imaginary part independency of the Golden code in its 2×2 codeword matrix. The Alamouti-like arrangement of the two Golden codewords, on the other hand, helps creating this independency in the 4 × 4 codeword matrix of the 3D MIMO code.

Group-wise parallel detections
We divide the information symbols and received symbols into four groups, i.e. a = [s 1 , Taking into account the structure of R and Theorem 2, the decoding metric in (8) can be rewritten as: From (12) and (13), it can be seen that the contributions from the information symbol groups a and b are uncorrelated in the received symbol. For instance, z 12 does not contain any information from b, and z 34 is irrelevant to a, either. This enables us to use group-wise conditional detections to retrieve the ML solutions [23].
In particular, the ML solutionŝ ML = [â,b,ĉ,d] T is achieved in two search steps, namely a joint "outer" search for [ĉ,d]: and two independent "inner" searches forâ andb, respectively: where with

Independent detections of real and imaginary parts
If square shape QAM modulations are considered, the decoding complexity can be further improved. The square M -QAM symbol can be separated into two independent √ M -PAM symbols on the real and imaginary axes, respectively. Using Theorem 1 and Corollary 1, the real and imaginary parts can be decoded separately.
Take the detection of a as an example. Denote its real and imaginary parts as a R = [s R 1 , s R 2 ] T and a I = [s I 1 , s I 2 ] T , respectively. Given [c, d] and using Theorem 1, the detection of a in (16) is rewritten as [19]: where Ψ is the set of and R I 11 are tailored upper-triangular matrices associated with real and imaginary parts, respectively: (18) means that the detections of real and imaginary parts are similar and can be performed separately.
Take the real part as an example. We apply again the conditional detection here. For a given s R 2 , the metric for the real part detection becomes: where For a given s R 2 , the best s R 1 that minimizes the decoding metric can alternatively be found by minimizing a quadratic function of s R 1 given on the right hand side of (20). The best solution of s R 1 is easily found by: where Q(·) is the slicing operation providing the PAM symbol that is closest to the given value.
Consequently,ŝ R 1 is obtained by using the solutionŝ R 2 in (21). Similar process can be applied to solve the imaginary parts. The best solution of [ŝ I 1 ,ŝ I 2 ] T given [c, d] can be found by: where  [12] O(M 7 ) O(M 6 ) Perfect code (2-layer) [24] O(M 6 ) O(M 5.5 ) BHV [7] O(M 6 ) O(M 4.5 ) EAST [9] O  Table 1. It can be seen that the 3D MIMO code is among the simplest full-rate STBCs when the square QAM modulations are considered.

Proposed Implementation of the Simplified ML Decoder
In the previous sections, we have illustrated the fast decodability of the 3D MIMO code in theory. With this knowledge, we propose an implementation of the simplified ML decoder that can be used in practice. Using the two-stage tree-search structure and leveraging the symmetry structure in the codeword, the proposed implementation requires a low average complexity in practice. Moreover, various performance-complexity trade-offs can be easily achieved by replacing the sphere decoder by other suboptimal tree search algorithms such as K-best algorithm [25], fixed-complexity sphere decoder [26] etc.  Figure 3: Illustration of the two-stage sphere decoding.

Two-stage decoding structure
Recall that the fast decodability is achieved by concatenating the joint search of four complex symbols and several detections in parallel. Figure 3 presents the general structure of the proposed simplified ML decoder. Detailed pseudo code is presented in Algorithms 1, 2, 3 and 4 so that the proposed decoder can be implemented without major effort.

4-level tree search phase
The joint detection of [c, d] is realized by a complex sphere decoder with Schnorr-Euchner enumeration, which is visualized by the search over a 4-level tree as shown in Figure 3. The nodes of the same level represent all the solutions of a complex information symbol. Each path from the root to a leaf node represents a possible combination of [c, d].
The details of the tree search is explicitly presented in Algorithm 2. The search starts from the root node and traverses the nodes of lower levels in a depth-first manner. An adaptive search radius is used to speed up the convergence of the algorithm by limiting the search within a hypersphere S. For the node under checking, the partial distance resulted by the current path is compared with the radius. If the partial distance is smaller than the radius, the search moves on to the children nodes on the next level. Otherwise, the search jumps to another sibling node on the current level. When all the nodes of the level have already been checked, the search goes back to the upper level. The radius is initially set to infinity and is adaptively  The sequence in which the sibling nodes are visited is determined according to the their partial distances in an ascending order. This is to guarantee that the promising candidates are visited first in order to reduce the search complexity. This ordering process is referred to as Schnorr-Euchner enumeration [18,27,28]. It can simply be implemented by a lookup table [29,30] (line 4 in Algorithm 1) and its complexity is merely the computation of the linear estimationŝ ZF .

Parallel decision phase
Once a leaf node is achieved in the tree search, a better solution of [c, d] is found. Consequently, the tree search process is suspended and the new [c, d] is used to trigger the parallel detections of rest symbols.
The parallel detection is depicted in Figure 4. The implementation details are presented in Algorithm 3.
As shown in Figure 4, are carried out in parallel. For each branch, a one-level sphere decoder is used to traverse all possible PAM symbols as given in (22).
The visiting sequence is also determined by the Schnorr-Euchner enumeration. The detections in different branches are synchronized by a common clock signal because the operations are exactly the same for all branches. All branches simultaneously check the first candidate PAM symbol and then move on to the second one, and so on.
Moreover, we propose a mechanism that terminates the search in each branch not only based on its own results, but also taking into account the results from other branches. In particular, once the best solution of the jth branch is found ahead of others, the resulting branch distance d j is recorded and shared with other branches to speed up the overall search process.
Take the search of the first branch as an example. The most promising PAM symbol in the unchecked symbol list is assigned to s R 2 (line 8 in Algorithm 3). The partial distance τ 1 is calculated (line 9 in Algorithm 3). The search is terminated in two cases: 1) if this partial distance is greater than the current minimum branch distance (τ 1 > p 1 ); or 2) if the overall distance is beyond the current radius of the sphere decoder in the tree search phase ((τ 1 + d 2 + d 3 + d 4 + d) > radius).
Once the searches on all branches are terminated, the solution [a, b] and the resulting distance d p are returned to the tree search phase. The tree search process is resumed. The overall distance is compared with the current radius (line 14 in Algorithm 2) to determine whether the current solution is a better one.
If a better solution is found, the radius is updated accordingly (line 16 in Algorithm 2). The tree search process is moved on to the next unchecked node.

Column switch based on ZF estimation
In the proposed algorithm, the search of eight symbols is divided into a tree search for four symbols and parallel detections for the other four symbols. Due to the symmetric structure of the codeword matrix The exchanging of the symbol sequences can be achieved by permuting the corresponding columns in the equivalent channel matrix H eq . Note that, the aforementioned column permutations do not affect the decoding performance. This permits us to choose the symbols that will be determined by the tree search and the ones that will be decoded in the parallel detections.
The proposed column switch method is presented in Algorithm 4. The basic idea is to use the tree search to determine the more difficult half part and use the parallel detections to find the easier half part. The reason behind this idea is that the parallel decoding is more efficient to decode the reliable symbols separately.
The more accurate the linear estimation, the faster convergence speed for each individual detection branch.
On the other hand, the tree search phase is a joint serial detection in nature which is more suitable to decode those unreliable symbols.
The next question is how to properly choose the unreliable symbols. In the literature, Barbero et al. proposed to sort the decoding sequence based on the norm of subchannels in the fixed-complexity sphere decoder [26]. However, it is not applicable here because the 3D MIMO code achieves full-diversity and the equivalent subchannels have similar norm values. In addition, as we have to maintain the structure of R matrix, the unconstrained subchannel sorting proposed in [29] is not applicable, either.
Alternatively, we propose to sort the information symbols according to the aggregate error of the linear estimation: where s ZF = H † eq y is the unconstrained estimation of the information symbols in which H † eq represents the inverse of equivalent channel matrix;ŝ ZF = Q(s ZF ) is the constellation point that is closest to s ZF . The metric is the distance between the estimated information symbols and the nearest constellation points, i.e. an indicator of the estimation accuracy.
Using (25), the decoding sequence can be determined in two levels. We first compare the aggregate errors of the first half and second half parts of symbols (line 2 in Algorithm 4). The half with worse accuracy is assigned to the tree search (put in the latter part of the decoding sequence). Consequently, within this half part, the errors of the first two symbols and the second two are compared. The two symbols with worse accuracy are put closer to the root of the tree. If this two-symbol by two-symbol exchange takes place in the second half of the symbols which are to be decoded using tree search, the same two-symbol by two-symbol exchange should be done accordingly in the other half in order to maintain the structure of R matrix. If only the symbol exchange between two halves of the symbols is carried out, it is referred to as "4-by-4 column switch". Otherwise, if the exchange within each half is also performed, it is called "2-by-2 column switch".
The advantage of the column switch will be shown in the next section. Schnorr-Euchner (S-E) enumeration proposed by Guo and Nilson [28] are also given as references. The Guo-Nilson's sphere decoder is a low-complexity implementation of sphere decoder with S-E but is sub-optimal in terms of symbol error rate. It can be seen that the proposed decoders achieve the same performance as ML decoder with both QPSK and 16-QAM modulations. In addition, the proposed decoders outperform the Guo-Nilson's sphere decoder with 16-QAM modulation. A gain of around 0.7 dB can be observed at symbol error rate level of 1 × 10 −4 . The Guo-Nilson's sphere decoder is also given as a reference. Since processing each node requires roughly the same operations for both decoders, these experiments actually give the comparison of the processing time latency [19,28]. It can be seen from the results that the proposed decoders require much less processing time than the ML decoder which needs to traverse all M 8 possibilities. In QPSK case, the proposed decoders always yield less latency than Guo-Nilson's sphere decoder. For instance, the proposed decoder with 2-by-2 column switch visits only 254.6 nodes on an average at SNR of 0 dB. Compared with the Guo-Nilson's decoder which visits 1276.6 nodes at SNR of 0 dB, the proposed one achieves a processing time reduction of 80%. The reductions are 50% and 49% at 10 dB and 20 dB, respectively. In addition, the improvements brought by the proposed column switch technique can also be seen in the results. For instance, the 2-by-2 column switch yields a processing time reduction of 38% at 0 dB compared with the decoder without column switch. This improvement is less significant in high SNR region (e.g. greater than 15 dB). Moreover, the 2-by-2 column switch offers better performance than the 4-by-4 counterpart because it is more likely to allocate the four most unreliable complex symbols to the tree-search stage, which helps improving the global convergence speed. In 16-QAM case (see Figure 7), the proposed decoder with 2-by-2 column switch also brings processing time reduction in low SNR region. At 8 dB, it visits 1301.1 nodes on an average, yielding a time reduction of 84% compared with Guo-Nilson's decoder. The proposed decoder needs similar time latency as Guo-Nilson's decoder in higher SNR region (e.g. greater than 15 dB). Taking into account the symbol error rate performance given in Figure 5, 8 ∼ 15 dB is the SNR region where the error correction ability of the channel coding will be carried out significantly. This region also represents the minimum SNR level at which the receiver can still work properly. Figure 8 and Figure 9 give the overall required multiplications to decode each codeword. For the proposed decoders, the multiplications spent by the tree search and by all four search branches are taken into account.

Computational complexity
The computation overheads such as QR decomposition, linear estimation, are also included in the results to    give the overall complexity of the decoders. It can be seen from the results that, in QPSK case, the proposed decoder with 2-by-2 column switch spends 11%, 21% and 3% more multiplications than Guo-Nilson's decoder at SNR of 0 dB, 6 dB and 20 dB, respectively. In 16-QAM case, it needs 5% less multiplication at 8 dB but spends 85% and 9% more multiplications at 14 dB and 28 dB, respectively. However, it is worth noting that with the cost of increased multiplications the proposed decoders provide less processing latencies. For instance, the proposed decoder with 2-by-2 column switch achieves 62% processing time reduction at SNR of 6 dB with QPSK and 42% reduction at 14 dB with 16-QAM, respectively.   Finally, Figure 10 and Figure 11 present the overall divisions spent by the decoders. The proposed decoders require less divisions than the Guo-Nilson's. For instance, the proposed decoder with 2-by-2 column switch requires 47%, 9% and 5% less divisions at SNR of 0 dB, 10 dB and 20 dB, respectively, with QPSK.
In 16-QAM case, it achieves 79% reduction of divisions at 8 dB. In the meantime, the two decoders spend roughly the same number of divisions in higher SNR region, e.g. greater than 18 dB.
In general, we can see the different trade-offs achieved by different decoders. Proposed decoders achieves ML performance with less time latency and less divisions than Guo-Nilson's. On the other hand, the Guo-Nilson's decoder needs less multiplications with some performance loss with 16-QAM.

Conclusion
The 3D MIMO code has been shown to be efficient and robust in distributed MIMO scenarios. Yet, it suffers from high ML decoding complexity. In this paper, we first proposed a new form of the 3D MIMO codeword and investigated some important properties of the new codeword. With these properties, the 3D MIMO code is proved to be fast decodable. Consequently, we proposed a reduced-complexity ML decoder for the 3D MIMO code which offers the same performance as ML decoder. Simulation results demonstrate that the novel low-complexity decoder yields much less processing time latency than the classical Guo-Nilson's sphere decoder with Schnorr-Euchner enumeration. Moreover, the proposed 2-by-2 column switch technique can significantly reduce the average decoding complexity, especially with 16-QAM modulation.
Algorithm 1: Simple ML decoder for 3D MIMO code.