Low-complexity deep unfolded neural network receiver for MIMO systems based on the probability data association detector

The interest on applications where machine learning algorithms and communications are combined has been on a rising in recent years. Machine learning and neural networks are being advocated as a way of improving the performance of several functions across all layers of future communication systems. Furthermore, in applications where complexity reduction is essential for the system feasibility at the cost of an affordable performance loss, more efficient systems might be achieved with the aid of machine learning algorithms. Signal detection for multiple-input multiple-output (MIMO) systems has become a hot topic in recent years given its prominent role in fourth and fifth generations of mobile networks. However, the computational complexity in MIMO systems can become prohibitive when the number of antennas increases. Therefore, by leveraging neural networks architectures we propose a deep unfolded detector, whereby the algorithm of the probability data association (PDA) detector is adapted and enhanced by means of neural network learning capabilities. We unveil that the proposed detector is orders-of-magnitude less complex than the PDA detector, yet presenting no severe penalties in performance in terms of bit error rate (BER).

less complex than the latter. However, the system model in [4] considers a system. Recently, several works [5,6] proposed solutions that attempt to integrate machine learning (ML) and NN to MIMO systems. One emerging solution involves adapting NN architectures according to model-driven detection algorithms, such that its iterations are unfolded on NN layers. This solution is called deep unfolding [5,7]. Therefore, in this work we propose a deep unfolded detector [8] based on the probability data association (PDA) detector [9] for MIMO systems. The main aim is to achieve the aforementioned advantages of data-driven detectors for SISO systems in MIMO systems, while advantageous features of the PDA detector [3] are maintained. To the best of authors knowledge, this is the first attempt at combining the deep unfolded architecture with the algorithm of the PDA detector for MIMO systems.
MIMO systems are also largely used for beamforming and beamsteering in the most recent mobile networks, where precoding can provide spatial multiplexing and improve the system performance without increasing the complexity on the receiver side [10]. It is clear that precoding will play an important role in future MIMO systems for mobile communications. Nevertheless, this work focuses on MIMO systems where multiple antennas transmit data over a rich scattering environment without considering precoding, relying on detection techniques that can resolve the interantenna interference (IAI) with affordable complexity, a scenario where the PDA detector is an interesting solution [3].

Contributions and paper organization
In this paper, we make the following contributions: • We propose a novel combination of the data-driven deep unfolded detector and the PDA algorithm for signal detection in MIMO systems; • Differently from other similar proposals [5,6,8], we employ the categorical crossentropy loss function and dispense with the use of optimal Gaussian denoisers; • The computational complexity of the proposed detector is evaluated and compared with the complexity presented by detectors of interest; • A low-complexity variation of the deep unfolded PDA (DU-PDA) is also presented, its computational complexity being lower than the linear zero-forcing (ZF)detector; • Numerical results from computational simulations compare the uncoded and coded error rates of the proposed detectors with other detectors under time-dispersive channels.
The remainder of this paper is organized as follows. In Sect. 2, we present the system model of the baseline orthogonal frequency division multiplexing (OFDM)-MIMO system. Section 3 then introduces the problem of signal detection for MIMO systems and gives a brief description of the PDA detector and of the deep unfolding learning. This is followed by a description of the proposed DU-PDA and an analysis on the computational complexity of all detectors discussed throughout this paper. Next, in Sect. 4, we provide numerical results to evaluate the performance of all detectors studied in this paper, including the optimum MLD. Finally, Sect. 5 concludes the paper.
The nth entry of the vector x is represented by x(n) . The entry on the ith row and jth column of the matrix X is denoted by X i,j . The superscript x (n) denotes the nth instance of the vector x , such that X = {x (n) } ∀n forms a collection of vectors or a dataset. The sets of real and complex numbers are represented by R and C , respectively. The absolute value of the scalar x ∈ R or the modulus of x ∈ C is denoted by |x| . The sets of vectors of dimension X with real and complex entries are, respectively, represented by R X and C X . The sets of matrices of dimension X × Y with real and complex entries are correspondingly described by R X×Y and C X×Y . The transposition operation of a vector or matrix is represented as (·) T . The ℓ p -norm, p ≥ 1 , of the vector x is given by �x� p = |x(0)| p + |x(1)| p + · · · + |x(n − 1)| p 1/p . The expected value of the random variable z is denoted by E[z] . The real and imaginary parts of z ∈ C are denoted by ℜ(z) and ℑ(z) . The estimate of a scalar x, a vector x or a matrix X is represented by x , x and X , respectively. The number of elements in a set X is given by #X . Computational complexity is denoted by the asymptotic operator O(·).

System model
Suppose that in a multiple antenna system we have N t transmitting antennas and N r receiving antennas, thereby constituting an N t × N r point-to-point baseband and fully digital MIMO system. Therefore, bits of data are demultiplexed into N t substreams, which in turn are mapped to a sequence of complex symbols. These symbols are transmitted by its respective transmit antenna using an OFDM system, for which it is assumed that the cyclic prefix (CP) length is larger than the maximum delay spread for all N t N r channels. Finally, after performing the discrete Fourier transform (DFT) we have the following representation of the received baseband signal at the kth subcarrier: Here, H k ∈ C N r ×N t is the matrix containing all channel frequency responses for the kth OFDM subcarrier; ã k ∈ C N t represents the symbol vector transmitted by the N t transmit antennas on the kth subcarrier of the OFDM block and ñ k ∈ C N r is the complex additive white Gaussian noise (AWGN) vector in the frequency domain at the kth subcarrier for the N r receive antennas, with zero mean and covariance matrix given by σ 2 I N r .
For convenience, henceforth we make use of the real-valued representation [3,8,9] for systems. Therefore, let the received signal (1) be represented by the concatenation of its real and imaginary parts, such that where (1) r k =H kãk +ñ k .
(2) r k = H k a k + n k , Moreover, we assume that ℜ(ã k ) ∈ S N t and ℑ(ã k ) ∈ S N t ; that is, the real and imaginary parts of ã k can take on different values from the finite set of coordinates pertaining to the square M-quadrature amplitude modulation (QAM) constellation. Hence, let , such that the constellation energy is normalized to 1 (unity).

Detection in MIMO systems
A classical problem in the MIMO literature is to decide which symbols were transmitted by each antenna when only (2) is available at the receiver. This detection problem can be solved optimally, however at great computational effort, by the MLD for MIMO as follows for which â k ∈ S 2N t is the estimated vector of symbols' coordinates.
It is known that the prohibitive complexity presented by the MLD motivated the research of several alternative detectors for MIMO throughout the last decades [3]. The PDA detector is one of these alternatives that presents significantly lower complexity when compared with the MLD, with an affordable bit error rate(BER) performance loss under specific conditions, as will be detailed in Sects. 3.5 and 4. In Sect. 3.1, the PDA detectors' algorithm first proposed in [9] is briefly revisited, followed by our proposed DU-PDA, for which the PDA is the underlying algorithm.

Probability data association detector
Before the detection task is carried out by the PDA detector, the received signal, r k , is preprocessed or equalized using the ZF principle as follows [2,3,9] wherein H † k = (H T k H k ) −1 H T k is the left Moore-Penrose pseudoinverse and v k = H † k n k is the enhanced AWGN. Let us rewrite (8), such that where e i is the vector with 1 (one) at its ith entry and 0 (zero) otherwise, and V i is a multivariate random variable (RV) that can be seen as the effective interference-plus-noise contaminating a k (i) [9]. Therefore, the crux is at detecting the symbol transmitted by the ith antenna, while considering that all other j = i transmitted symbols are interference added to the noise term, which is described by V i . Therefore, the PDA detector associates, for each a k (i) , a probability vector which is given by the evaluation of P m (a k (i) = q(m) | z k , {p j } ∀j� =i ) ; q(m) ∈ S being a coordinate of the M-QAM constellation and m ∈ {0, 1, . . . , It is important to remark that the PDA detector uses all {p j } ∀j� =i associated with interfering symbols already detected, thanks to the incorporation of a strategy similar to that of successive interference cancellation (SIC) detectors. This significantly reduces the computational complexity for calculating p i , since otherwise P m (a k (i) = q(m) | z k ) would have to be evaluated. The problem here is the requirement of computing multiple integrals for each received symbol, rendering this evaluation prohibitive in practice. Dropping the subscript (·) k in order to simplify the notation and assuming that V i has a Gaussian distribution [9,11], then the likelihood function of z | a(i) = q(m) can be defined as [2] that accounts for the noise enhancement caused by the ZF. To evaluate the posteriors probabilities associated with each symbol, we compute which can be seen as an approximate form of the Bayesian theorem [11]. Then, substituting (10) into (14) yields Finally, the PDA detector procedure is given in Algorithm 1.
. Note that the optimal detection sequence [9] used in Algorithm 1 can be found with the aid of the following operation: where f T i represents the ith row of F = H † and h j denotes the jth column of H . Note that larger magnitudes for ρ(i) mean that the ith antenna suffers less IAI [3]. In other words, the off-diagonal entries of the ith row from FH have, combined, smaller magnitudes than its ith diagonal entry. It is easy to show that the optimal sequence is defined by sorting ρ = [ρ(0) ρ(1) . . . ρ(2N t − 1)] T in a descending order, denoted as

Deep unfolding
Prior to presenting our proposed DU-PDA detector, a brief description of NNs and deep unfolding is provided in this section.
In general, the NN architecture has shown great potential for detecting signals, but its design and parameterization, among other problems, impose limitations [4]. Alternatively, this architecture can be adapted such that iterations of an given algorithm are unfolded on its layers [5,6,12], hence the term "unfolding. " It is also commonly assumed that the NN employs several layers and, consequently, the term "deep" is added.
More specifically, consider an algorithm with an input vector denoted by x ∈ R N , for which its output is given by y ∈ R S , then this algorithm can be expressed by [12] (16) wherein is the set of all parameters used by the algorithm, g(·) represents a mapping function, usually nonlinear, and ψ is iteratively updated as follows where the ℓ th iteration also involves an operation with a mapping function f (·) and ψ 0 denotes the initial value. Therefore, in the deep unfolded context, ψ ℓ can be understood as the input-output relationship at the ℓ th layer of a NN architecture, as illustrated in Fig. 1.
Note that dimensions of learnable parameters are defined according to the underlying algorithm after which (17), (18) and the architecture depicted in Fig. 1 are based. This includes weights and bias, for example, which are optimized by the NN training algorithm [4,12]. In other words, this means that the number of layers and neurons is fixed, thereby simplifying considerably the process of defining what is commonly known as the NN hyperparameters.
Moreover, improvements are also obtained by using the aforementioned learnable parameters directly into the iterative algorithm. That way, learning capabilities of NNs can be applied for optimizing algorithms such that its global performance, computational complexity, or even both, are improved. In Sect. 3.3, the PDA detector, reviewed in Sect. 3.1, is implemented using the deep unfolded architecture for NNs, unveiling our proposed DU-PDA detector for MIMO systems.

Proposed deep unfolded PDA detector
Aiming to take advantage of the iterative algorithm of the PDA detector, we propose the DU-PDA detector. Firstly, in the DU-PDA detector, the received signal, r , is preprocessed at the ℓ th layer by the following operation [8]; [13, §IV-B, p. 1706] where â ℓ ∈ R 2N t is the estimated transmitted symbol vector and the scalar w ℓ ∈ R represents a learnable parameter. Note that this preprocessing principle differs from the ZF, which is used by the PDA detector, as defined in (8). In contrast, for the proposed Fig. 1 Deep unfolding architecture. It is based on an underlying algorithm with an input vector given by x and an output determined by y . Each hidden layer unfolds the ℓ th iteration of this algorithm and its inputoutput relationship is expressed by (18), whereas the output layer is represented by (17)  DU-PDA, it is employed a preprocessing based on the approximate message passing (AMP)algorithm [14], which also bear similarities with the Richardson method [2, §IV-6, p. 9]. In this way, â ℓ is updated iteratively until it converges to an acceptable approximation of the transmitted symbol vector. Interestingly, when we have â ℓ → a , then the socalled residual term r − Hâ ℓ → n , which give us a result in (19) similar to (8).
The preprocessed signal of (19) is then fed into the following operation 1 : where Note that the nonlinear function softm(·) is applied at each layer. This makes (20) identical to (15) except that it is unfolded on successive layers and that ψ j = p j . Notably, this also distinguishes the proposed DU-PDA from other architectures [5,8,13] that use instead optimal denoisers at each layer, which do not account for interfering symbols as the underlying PDA algorithm of the DU-PDA does. Moreover, since the preprocessing is modified, then it is necessary to redefine the covariance matrix, ℓ * , as follows [15, §III-D, p. 2023], [8] where wherein [x] + = max (0, x) and for which Equation (23) can be understood as the empirical mean-squared error (MSE) estimator of the covariance matrix originated from the residual and noise terms of (19). More importantly, note that ℓ * is now a diagonal matrix. This means that computing −1 ℓ * is not as costly as its counterpart in (11), that is, in the PDA detector. More details about such implications are given in Sect. 3.4. Therefore, by considering developments presented in this subsection and the general model described in Sect. 3.2, we have (20) (25) ψ ℓ * +1 (m) = softm z ℓ , ψ ℓ * (m), {w ℓ , µ ℓ * , � ℓ * } , Souza  which is similar to what is evaluated in (15) with the addition, however, of a learnable parameter and a different preprocessing of the received signal. Note also that ψ L = y , meaning that the last layer output is also given by (25). Furthermore, let such that the convergence of (19) might be improved, given that the soft combining of symbols' coordinates and their estimated associated probabilities are fed forward to the next layer.
In Algorithm 2, we detail the general procedure carried out by the proposed DU-PDA detector. The ground truth used for training the NN is defined by I ℓ * = [I(0) I(1) . . . I( √ M − 1)] T , such that I = {I ℓ * } ∀ℓ * . It indicates the known constellation coordinates that are transmitted for the training procedure; thus, I ℓ * (m) ∈ {0, 1} ∀ m . Observe also that the PDA detector outputs approximate posteriors, as shown in (15), which is leveraged by our proposed DU-PDA detector in Algorithm 2 when employing the categorical cross-entropy loss function: Bear in mind that the loss is calculated considering the output of all L unfolded layers and not only the last one. Also, note that the use of (27) contrasts with the popular choice of the MSE loss function [5]. Additionally, it is a well-known fact that the crossentropy loss function is more appropriate for classification tasks.

Simplified DU-PDA
The model of the DU-PDA presented in the previous subsection can be simplified even further if some assumptions are made. Therefore, a new variation of the proposed DU-PDA detector, namely the simplified DU-PDA detector, is presented in this subsection. For this detector, the calculations performed in (23) are simplified and the scalar 0.5σ 2 is applied directly in (22). The reasoning behind this approach lies in the asymptotic case, that is, when N t → ∞ and N r → ∞ . For this case, the first term of (23) vanishes, since 2 and similarly for the second term we have which yields wherein, for the sake of simplicity, the learnable parameter w ℓ is omitted. This is analogous to the channel hardening effect present in massive MIMO systems [2,3], where values for N t and N r are large. Although we demonstrate via computational simulations in Sect. 4 that the simplified DU-PDA only presents marginal losses in performance, it is still unknown if other similar architectures proposed in the literature [5,6,8,13] are robust enough to allow such simplifications.

Computational complexity
According to the guidelines presented in [4, §IV-C, p. 122404], the global computation complexity of the PDA detector is approximately given by However, if we let N r ≫ √ M and simplify constants, then it can be written more compactly as The global complexity is composed mainly by the local cost of (19), given by O (8N t N r ) per layer, and the local cost of (23), expressed by O(4N 2 t + 8N t N r + N r ) for each layer 4 . The NN training stage cost is not taking into account when calculating the computational complexity of the detection stage, since the training stage is assumed to be computed offline as discussed in [4]. Nevertheless, in general, the backpropagation algorithm used for training NNs has a complexity that scales linearly with the number of training samples, N TR , and training iterations, say N TI . More importantly, it scales exponentially with the number of layers L because of the chain rule derivatives calculated during backpropagation. In principle, this is a high complexity when compared with the detection or forward-pass complexity, but once trained, the NN-based detector may serve multiple users during a prescribed timeline [16]. This means that the training complexity cost is distributed over time and users, whereas the detection complexity is fixed for each user and transmission cycle. Hence, since training is not performed in the detection cycle, its complexity is not considered, enabling a fair comparison with other detectors.
Furthermore, recall that the simplified form of calculation demonstrated by (30) reduces even further the global complexity of the proposed DU-PDA detector. More specifically, the global complexity of the simplified DU-PDA detector is given approximately by O(LN t N r ) , meaning that the cost is reduced to one order-of-magnitude when compared to the DU-PDA detector.
From the computational complexity associated with each detector, it is possible to conclude that the PDA is more complex than the proposed DU-PDA. More specifically, this cost difference is due to the higher-order term N 4 t , included in the PDA global complexity. This is expected because of the inversion of matrices performed by the PDA detector, which are not necessary for both the DU-PDA and simplified DU-PDA detectors. Also, notice that for both of these detectors, the total number of layers L might significantly increase its global complexity. It is demonstrated in Sect. 4, however, that this number is a multiple of N t , thus still implying in a lower global complexity for the DU-PDA when compared to the PDA. In fact, the simplified DU-PDA complexity becomes even lower than that of the ZF in the aforementioned case. Additionally, an optimal detection sequence, such as (16), is not a general requirement for the DU-PDA, which further reduces its global complexity in relation to the PDA. Despite shedding light on how detectors' computational complexity compares to each other, these are only asymptotic predictions of complexity. A detailed evaluation of system end-to-end latency [17,18], for example, is out of scope in this work. However, it can be verified for a typical 4 × 8 MIMO considered in Sect. 4, that the symbol detection (see Line 15 of Algorithm 2) of the DU-PDA takes approximately 50 milliseconds in average with neglectable variance. Note that this time value heavily depends on the implementation of the proposed detector, which in this work is based on the Tensor-Flow library [19] not yet optimized for a full-fledged hardware implementation. Indeed, implementations using hardware description language (HDL) can provide a more reliable analysis on the end-to-end latency of the proposed detector.
For convenience, Table 1 summarizes the global computational complexity for all detectors of interest. Observe that the AMP detector and the sphere detector (SD) are also included for the sake of completeness. For the AMP, N I refers to the number of iterations or updates executed, whereas for the SD we considered the fixed-complexity SD [3, §VIII-D, p. 20], since its performance is near-optimum. To conclude, note also in Table 1 how the complexity of all detectors increases polynomially with the number of transmitting antennas N t . The exceptions, however, are the MLD and the SD, whose complexity increases exponentially with N t and √ N t , respectively, as expected.

Numerical results and discussion
Before presenting numerical results about the detectors performances, we list important system parameters in the following subsection.

System parameters
In this work, the following system parameters are adopted: (i) Before transmission, a frame of n b data bits is encoded using the polar encoder [20] with a code rate of R < 1 . Thus, n b /R bits now represent the coded frame that is effectively transmitted; (ii) entries of the channel frequency response matrix, H , are drawn from a complex Gaussian random process for all k subcarriers at each transmission of an OFDM frame and are normalized by 1/ √ N r . Hence, we have H i,j ∼ CN (0, 1/N r ), ∀ i, j and, consequently, the system signal-to-noise ratio (SNR) per bit can be expressed as follows Table 1 Global computational complexity of detectors studied in this work. Note that they are given in the most compact form and are also ranked in an ascending order, that is, from less to more costly as lines progress to the bottom of the table

Detector
Global Computational Complexity which is henceforward assumed to be identical for all subcarriers. The BER is employed for measuring coded detectors' performances, which is obtained by averaging bit decision errors over multiple Monte Carlo experiments. Each experiment is generated using a computational simulation that involves: (i) the generation of n b = 256 equiprobable data bits; (ii) the encoding of data bits by the polar encoder, resulting in a codeword of 256 R bits; (iii) mapping of coded bits into complex symbols Q a k ∈ S N t for all k subcarriers; (iv) transmission of the OFDM frame; (v) the generation of normalized channel coefficients to form entries of the channel matrix H k ; (vi) the generation of complex AWGN samples present in the receiver; (vii) the final decision in favor of the symbol coordinate associated with the higher probability value; and (viii) the subsequent decoding of decided symbols into bits via the polar decoder [21]. More specifically, we implement a tree-based architecture of a successive cancellation list decoding [22], with code rate equal to R.
For the sake of brevity, some algorithmic procedures 5 were omitted from Algorithm 2. However, it is worth mentioning that the DU-PDA training is performed considering that SNR values are drawn from a uniform distribution U ∼ [min(SNR), max(SNR)] , as discussed in [4, §VI-A, p. 122405]. Additionally, it was decided heuristically to use a total number of N TR = 10 5 samples for training and also that the DU-PDA should include L = 4N t layers 6 . More details about the proposed DU-PDA hyperparameters can be verified in Table 2. These parameters are used for all scenarios demonstrated in Sect. 4.2.
Furthermore, note that in this work we employ hard decoding for all detectors analyzed. However, in principle, soft decoding could also be integrated to the proposed DU-PDA since soft outputs are available via (25) [11]. Nonetheless, for the proposed DU-PDA, the hard decoding approach attains a better performance-complexity trade-off, which is more aligned with the general aim of the work of proposing a lowcomplexity detector with affordable performance losses. This also allows for a fair comparison with algorithms that provide hard decoding sequences. Table 1, considering a square 4 × 4 MIMO (Fig. 2a) system and a underloaded [3] 4 × 8 MIMO (Fig. 2b), all of which employ the quadrature phase shift keying (QPSK) ( M = 4 ) modulation.

Figure 2 brings the uncoded detection performance for all detectors presented in
The detection performance is given as a function of multiple SNR values, and it is defined as the probability of occurrence of any error in the received symbol vector. This is done because bits are not encoded for the scenarios analyzed in Fig. 2.
Firstly, observe in Fig. 2a that the performance of the PDA detector adheres closely with that reported in the seminal work of [9], thus validating the simulation model. Moreover, notice that the DU-PDA detector has shown a prohibitive performance for the 4 × 4 MIMO scenario, which was also verified to be the case for other square MIMO systems. However, for the underloaded scenario demonstrated in Fig. 2b, where N r ≫ N t , the DU-PDA detector presents better performance. All the same, if the relative performance of the DU-PDA against the ZF and, particularly, the AMP detectors is taken into account, then Fig. 2a and b shows that the DU-PDA outperforms these detectors for most of the SNR range analyzed, while presenting a comparable detection complexity 7 . It was verified, however, that for the underloaded scenario of 4 × 8 MIMO, the DU-PDA detector reaches a performance floor of P â � = a ≈ 3 × 10 −3 , from which no improvement can be obtained irrespective of how high are the SNR values. This motivated the integration of the Polar encoder as described in Sect. 4.1, also with a aim at potentially improving the proposed DU-PDA performance relative to other detectors. Note in Fig. 3a that the 4 × 8 MIMO scenario is illustrated again as in Fig. 2, however, considering now the Polar encoding with a code rate of R = 1/2. This is accompanied in Fig. 3b, for which the 4 × 16 MIMO scenario with a 16-QAM ( M = 16 ) modulation is presented, considering the same aforementioned code rate.
We begin by pointing out that the performance floor observed in Fig. 3a and b, although undesirable, is not so much detrimental to the overall performance as in Fig. 2b. This happens because the introduced channel coding improves the performance for all the SNR range under analysis. Therefore, the BER values where the DU-PDA is better than the ZF and AMP consist of the more interesting region of values for which SNR < 10 (dB). It is granted that the performance floor is still presented in Fig. 3a and b, but now at low values of BER ≈ 2 × 10 −4 and BER ≈ 2 × 10 −5 , respectively. These observations support the conjecture that the uncoded DU-PDA detector is interference limited for high SNR values. In this SNR range, the distribution of (19) ceases to be approximately Gaussian because of the low AWGN levels and becomes defined in most part by the non-Gaussian IAI distribution. This in turn violates the Gaussian distribution assumption mentioned in Sect. 3.1, regarding the PDA detector, which is the underlying algorithm of the proposed DU-PDA detector. Hence, we have the performance floor shown in Fig. 2b, but which is partially mitigated by a robust coding scheme in Fig. 3. Furthermore, to elaborate on the detection performance of the AMP detector in Figs. 2 and 3, one can see that this detector suffers from a severe performance floor for high SNR. This behavior is also explained by the reasoning described for the DU-PDA, which means that the violation of the Gaussian distribution assumption also severely affects the AMP detection performance [23].
Moreover, note also that Fig. 3 depicts the detection performance of the simplified DU-PDA detector. For this detector, the calculations performed in (23) are simplified, yielding (30). Although the dimensions of MIMO systems illustrated in Fig. 3 are   Fig. 3 that the detection performance of the simplified DU-PDA detector is practically identical to the DU-PDA detectors' performance, except at the high SNR region where the simplified DU-PDA is marginally worse than the DU-PDA detector. Finally, note also that the simplified DU-PDA complexity becomes even lower than that of the ZF and AMP detectors, especially when the number of L = 4N t layers used is considered. This makes the simplified DU-PDA detector the less costly of all detectors analyzed in this work, as can be verified in Table 1. Yet it performs approximately 2 dB better than the ZF in Fig. 3a, for values of SNR < 10 (dB), for example. More importantly, the simplified DU-PDA largely improves upon the performance of the AMP detector, in spite of using similar operations as described in (19).
Additionally, Fig. 4a shows the performance of relevant detectors for the 8 × 16 MIMO scenario considering the QPSK modulation. Figure 4b in turn illustrates detectors performances, also considering the QPSK modulation, for multiple values of transmitting antennas, N t , for which the number of receiving antennas, N r = 12 , and the SNR = 7 (dB) are fixed. Note that for this scenario we still assume the number of layers, L, of the DU-PDA detector, to be restrained by N t , such that L = 2cN t . This is adopted since each layer in the DU-PDA architecture outputs the posterior associated with one transmitted symbol, a by-product of the underlying PDA algorithm employed by the DU-PDA detector. However, we verified through experiments that for c > 2 no improvement was obtained in detection performance, yet at the cost of increased training and detection complexity. Therefore, the value L = 4N t defined in Sect. 4.1 was shown to be the most suitable one.
In Fig. 4a, it can be observed with the larger MIMO system that the proposed simplified DU-PDA detector outperforms the ZF detector, particularly for the low BER  Table 1 are not analyzed either because their performance is too prohibitive or identical to the performance of detectors already shown here