 Research
 Open Access
 Published:
Linear precoding based on polynomial expansion: reducing complexity in massive MIMO
EURASIP Journal on Wireless Communications and Networking volume 2016, Article number: 63 (2016)
Abstract
Massive multipleinput multipleoutput (MIMO) techniques have the potential to bring tremendous improvements in spectral efficiency to future communication systems. Counterintuitively, the practical issues of having uncertain channel knowledge, high propagation losses, and implementing optimal nonlinear precoding are solved more or less automatically by enlarging system dimensions. However, the computational precoding complexity grows with the system dimensions. For example, the closetooptimal and relatively “antennaefficient” regularized zeroforcing (RZF) precoding is very complicated to implement in practice, since it requires fast inversions of large matrices in every coherence period. Motivated by the high performance of RZF, we propose to replace the matrix inversion and multiplication by a truncated polynomial expansion (TPE), thereby obtaining the new TPE precoding scheme which is more suitable for realtime hardware implementation and significantly reduces the delay to the first transmitted symbol. The degree of the matrix polynomial can be adapted to the available hardware resources and enables smooth transition between simple maximum ratio transmission and more advanced RZF.
By deriving new random matrix results, we obtain a deterministic expression for the asymptotic signaltointerferenceandnoise ratio (SINR) achieved by TPE precoding in massive MIMO systems. Furthermore, we provide a closedform expression for the polynomial coefficients that maximizes this SINR. To maintain a fixed peruser rate loss as compared to RZF, the polynomial degree does not need to scale with the system, but it should be increased with the quality of the channel knowledge and the signaltonoise ratio.
Introduction
The current wireless networks must be greatly densified to meet the exponential growth in data traffic and number of user terminals (UTs) [1]. The conventional densification approach is to decrease the intersite distance by adding new base stations (BSs) [2]. However, the cells are subject to more interference from neighboring cells as distances shrink, which requires substantial coordination between neighboring BSs or fractional frequency reuse patterns. Furthermore, serving highmobility UTs by small cells is very cumbersome due to the large overhead caused by rapidly recurring handover.
Massive multipleinput multipleoutput (MIMO) techniques, also known as largescale multiuser MIMO techniques, have been shown to be viable alternatives and complements to small cells [3–7]. By deploying largescale arrays with very many antennas at current macro BSs, an exceptional array gain and spatial precoding resolution can be obtained. This is exploited to achieve higher UT rates and serve more UTs simultaneously. In this paper, we consider the singlecell downlink case where one BS with M antennas serves K singleantenna UTs. As a rule of thumb, hundreds of BS antennas may be deployed in the near future to serve several tens of UTs in parallel. If the UTs are selected spatially to have a very small number of common scatterers, the user channels naturally decorrelate as M grows large [8, 9] and spacedivision multiple access (SDMA) techniques become robust to channel uncertainty [3].
One might imagine that by taking M and K large, it becomes terribly difficult to optimize the system throughput. The beauty of massive MIMO is that this is not the case: simple linear precoding is asymptotically optimal in the regime M≫K≫0 [3] and random matrix theory can provide simple deterministic approximations of the stochastic achievable rates [5, 10–14]. These socalled deterministic equivalents are tight as M grows large due to channel hardening but are usually also very accurate at small values of M and K.
Although linear precoding is computationally more efficient than its nonlinear alternatives, the complexity of most linear precoding schemes is still intractable in the large (M,K) regime since the number of arithmetic operations is proportional to K ^{2} M. For example, both the optimal precoding parametrization in [15] and the nearoptimal regularized zeroforcing (RZF) precoding [16] require an inversion of the Gram matrix of the joint channel of all users—this matrix operation has a complexity proportional to K ^{2} M. A notable exception is the matched filter, also known as maximum ratio transmission (MRT) [17], whose complexity only scales as MK. Unfortunately, this precoding scheme requires roughly an order of magnitude more BS antennas to perform as well as RZF [5]. Since it makes little sense to deploy an advanced massive MIMO system and then cripple the system throughput by using interferenceignoring MRT, treating the precoding complexity problem is the main focus of this paper.
Similar complexity issues appear in multiuser detection, where the minimum mean square error (MMSE) detector involves matrix inversions [18]. This uplink problem has received considerable attention in the last two decades; see [18–21] and references therein. In particular, different reducedrank filtering approaches have been proposed, often based on the concept of truncated polynomial expansion (TPE). Simply speaking, the idea is to approximate the matrix inverse by a matrix polynomial with J terms, where J needs not to scale with the system dimensions to maintain a certain approximation accuracy [19]. TPEbased detectors admit simple and efficient multistage/pipelined hardware implementation [18], which stands in contrast to the complicated implementation of matrix inversion. A key requirement to achieve good detection performance at small J is to find good coefficients for the polynomial. This has been a major research challenge because the optimal coefficients are expensive to compute [18]. Alternatives based on appropriate scaling [20] and asymptotic analysis [21] have been proposed. A similar TPEbased approach was used in [22] for the purpose of lowcomplexity channel estimation in massive MIMO systems.
In this paper, which follows our work in [23], we propose a new family of lowcomplexity linear precoding schemes for the singlecell multiuser downlink. We exploit TPE to enable a balancing of precoding complexity and system throughput. A main analytic contribution is the derivation of deterministic equivalents for the achievable user rates for any order J of TPE precoding. These expressions are tight when M and K grow large with a fixed ratio but also provide close approximations at small parameter values. The deterministic equivalents allow for optimization of the polynomial coefficients; we derive the coefficients that maximize the throughput. We note that this approach for precoding design is very new. The only other work is [24] by Zarei et al., of which we just became aware at the time this paper was first submitted. Unlike our work, the precoding in [24] is conceived to minimize the sum MSE of all users. Although our approach builds upon the same TPE concept as [24], the design method proposed herein is more efficient since it considers the optimization of the throughput. This metric is usually more pertinent than the sum MSE. Additionally, our work is more comprehensive in that we consider a channel model which takes into account the transmit correlation at the base station.
Our novel TPE precoding scheme enables a smooth transition in performance between MRT (J=1) and RZF (\(J=\min (M,K)\)), where the majority of the gap is bridged for small values of J. We show that J is independent of the system dimensions M and K but must increase with the signaltonoise ratio (SNR) and channel state information (CSI) quality to maintain a fixed peruser rate gap to RZF. We stress that the polynomial structure provides a green radio approach to precoding, since it enables energyefficient multistage hardware implementation as compared to the complicated/inefficient signal processing required to compute conventional RZF. Also, the delay to the first transmitted symbol is significantly reduced, which is of great interest in systems with very short coherence periods. Furthermore, the hardware complexity can be easily tailored to the deployment scenario or even changed dynamically by increasing and reducing J in high and lowSNR situations, respectively.
Notation
Boldface (lowercase) is used for column vectors, x, and (uppercase) for matrices, X. Let X ^{T}, X ^{H}, and X ^{∗} denote the transpose, conjugate transpose, and conjugate of X, respectively, while tr(X) is the matrix trace function. The Frobenius norm is denoted as ∥·∥, and the spectral norm is denoted as ∥·∥_{2}. A circularly symmetric complex Gaussian random vector x is denoted as \(\textbf {x} \sim \mathcal {CN}(\bar {\textbf {x}},\textbf {Q})\), where \(\bar {\textbf {x}}\) is the mean and Q is the covariance matrix. The set of all complex numbers is denoted by \(\mathbb {C}\), with \(\mathbb {C}^{N\times 1}\) and \(\mathbb {C}^{N\times M}\) being the generalizations to vectors and matrices, respectively. The M×M identity matrix is written as I _{ M }, and the zero vector of length M is denoted as 0 _{ M×1}. For an infinitely differentiable monovariate function f(t), the ℓth derivative at t=t _{0} (i.e., \( {~}^{d^{\ell }}\!\!/_{d t^{\ell }}f(t)_{t=t_{0}}\phantom {\dot {i}\!}\)) is denoted by f ^{(ℓ)}(t _{0}) and more concisely f ^{(ℓ)}, when t=0. An analog definition is considered in the bivariate case; in particular, f ^{(l,m)}(t _{0},u _{0}) refers to the ℓth and mth derivatives with respect to t and u at t _{0} and u _{0}, respectively, (i.e., \( {}^{\partial ^{\ell }}\!\!/_{\partial t^{\ell }}\, {}^{\partial ^{m}}\!\!/_{\partial u^{m}}f(t,u)_{t=t_{0}, u=u_{0}} \)). If t _{0}=u _{0}=0, we abbreviate again as f ^{(l,m)}=f ^{(l,m)}(0,0). Furthermore, we use the big O and small o notations in their usual sense; that is, \(\alpha _{M} = \mathcal {O}(\beta _{M})\) serves as a flexible abbreviation for α _{ M }≤C β _{ M }, where C is a generic constant and α _{ M }=o(β _{ M }) is shorthand for α _{ M }=ε _{ M } β _{ M } with ε _{ M }→0, as M goes to infinity.
System model
This section defines the singlecell system with flatfading channels, linear precoding, and channel estimation errors.
Transmission model
We consider a singlecell downlink system in which a BS, equipped with M antennas, serves K singleantenna UTs. The received complex baseband signal \(y_{k} \in {\mathbb C}\) at the kth UT is given by
where \(\textbf {x} \in \mathbb {C}^{M\times 1}\) is the transmit signal and \(\textbf {h}_{k} \in \mathbb {C}^{M\times 1}\) represents the random channel vector between the BS and the kth UT. The additive circularly symmetric complex Gaussian noise at the kth UT is denoted by \(n_{k} \sim \mathcal {CN}(0,\sigma ^{2})\) for k=1,…,K, where σ ^{2} is the receiver noise variance.
The smallscale channel fading is modeled as follows.
Assumption 1.
The channel vector h _{ k } is modeled as
where the channel covariance matrix \(\boldsymbol {\Phi } \in \mathbb {C}^{M \times M}\) has bounded spectral norm ∥Φ∥_{2}, as \(M \rightarrow \infty \), and \(\mathbf {z}_{k} \sim \mathcal {CN}(\mathbf {0}_{M\times 1},\mathbf {I}_{M})\). The channel vector has a fixed realization for a coherence period and then takes a new independent realization. This model is known as Rayleigh blockfading.
Note that we assume that the UTs reside in a rich scattering environment described by the covariance matrix Φ. This matrix can either be a scaled identity matrix as in [3] or describe arrayspecific properties (e.g., nonisotropic radiation patterns) and general propagation properties of the coverage area (e.g., for practical sectorized sites). We only consider a common covariance matrix Φ model here, since the main focus in this publication is the precoding scheme. This simplification has been done in many recent publications. Adhikary et al. [25] have proposed to always only serve groups of UTs that share approximately equal covariance matrices, hence providing further motivation behind Assumption 1.
The application of TPE precoding to multicell systems can be found in our paper [26]. However, the models used in this paper and in [26] are incompatible and differ most prominently in the assumption whether the total transmit power increases with the number of users as in [26] or is fixed as in this paper; see (8). This seemingly negligible change has a big impact on the analysis and applicability of the models, as this assumption means that the noise term in [26] becomes asymptotically zero, while in the current work, the noise term is nonnegligible. The channel estimation model in [26] and in this paper is also different, and the calculations follow very different approaches, due to the inclusion of power control later on. Another big extension in the current work is the complete complexity analysis of the TPE approach in comparison to the classical RZF approach. Only this analysis gives TPE precoding its motivation and pertinence. Finally, we want to point out that the optimization in [26] is with respect to a maxmin SNR problem and the solution is not given as a closed form, while here we maximize the throughput and find a closedform solution. Before utilizing our work, one needs to decide which model gives the most accurate asymptotic behavior for the specific type of system considered.
Assumption 2.
The BS employs Gaussian codebooks and linear precoding, where \(\mathbf {g}_{k} \in {\mathbb C}^{M\times 1}\) denotes the precoding vector and \(s_{k} \sim \mathcal {CN}(0,1)\) is the data symbol of the kth UT.
Based on this assumption, the transmit signal in (1) is
The matrix notation is obtained by letting \(\mathbf {G} = [ \mathbf {g}_{1} \, \ldots \, \mathbf {g}_{K} ] \in \mathbb {C}^{M \times K}\) be the precoding matrix and \(\mathbf {s} = [s_{1} \, \ldots \, s_{K}]^{\text {\tiny T}} \sim \mathcal {CN}(\mathbf {0}_{K\times 1},\mathbf {I}_{K}) \) be the vector containing all UT data symbols.
Consequently, the received signal (1) can be expressed as
Let \(\mathbf {G}_{k} \in \mathbb {C}^{M \times (K1)}\) be the matrix G with column g _{ k } removed. Then, the SINR at the kth UT becomes
By assuming that each UT has perfect instantaneous CSI, the achievable data rates at the UTs are
Model of imperfect channel information at transmitter
Since we typically have M≥K in practice, we assume that we either have a timedivision duplex (TDD) protocol where the BS acquires channel knowledge from uplink pilot signaling [5] or a frequencydivision duplex (FDD) protocol where temporal correlation is exploited as in [27]. In both cases, the transmitter generally has imperfect knowledge of the instantaneous channel realizations and we model this by the generic GaussMarkov formulation; see [12, 28, 29]:
Assumption 3.
The transmitter has an imperfect channel estimate
for each UT, k=1,…,K, where h _{ k } is the true channel, \(\mathbf {v}_{k} \sim \mathcal {CN}(\mathbf {0}_{M\times 1},\mathbf {I}_{M})\), and \(\mathbf {n}_{k} = \boldsymbol {\Phi }^{\frac {1}{2}} \mathbf {v}_{k} \sim \mathcal {CN}(\mathbf {0}_{M\times 1},\boldsymbol {\Phi })\) models the independent error. The scalar parameter τ∈[0,1] indicates the quality of the instantaneous CSI, where τ=0 corresponds to perfect instantaneous CSI and τ=1 corresponds to having only statistical channel knowledge.
The parameter τ depends on factors such as time/power spent on pilotbased channel estimation and user mobility. Note that we assume for simplicity that the BS has the same quality of channel knowledge for all UTs.
Based on the model in (6), the matrix
denotes the joint imperfect knowledge of all user channels.
Linear precoding
Many heuristic linear precoding schemes have been proposed in the literature, mainly because finding the optimal precoding (in terms of weighted sum rate or other criteria) is very computationally demanding and thus unsuitable for fading systems [30]. Among the heuristic schemes, we distinguish RZF precoding [16], which is also known as transmit Wiener filter [31], signaltoleakageandnoise ratio maximizing beamforming [32], generalized eigenvaluebased beamformer [33], and virtual SINR maximizing beamforming [34]. The reason that RZF precoding has been proposed by different authors (under different names) is, most likely, that it provides closetooptimal performance in many scenarios. It also outperforms classical MRT and zeroforcing beamforming (ZFBF) by combining the respective benefits of these schemes [30]. Therefore, RZF is deemed the natural starting point for this paper.
Next, we provide a brief review of RZF and prior performance results in massive MIMO systems. These results serve as a starting point for Section 3.2, where we propose an alternative precoding scheme with a computational/hardware complexity more suited for large systems.
Review on RZF precoding in massive MIMO systems
Suppose we have a total transmit power constraint
We stress that the total power P is fixed, while we let the number of antennas, M, and number of UTs, K, grow large.
Similar to [12], we define the RZF precoding matrix as
where the power normalization parameter β is set such that G _{RZF} satisfies the power constraint in (8) and P is a fixed diagonal matrix whose diagonal elements are power allocation weights for each user. We assume that P satisfies the following:
Assumption 4.
The diagonal values p _{ k }, k=1,…,K in P=diag(p _{1},…,p _{ K }) are positive and of order \(\mathcal {O}(\frac {1}{K})\).
The scalar regularization coefficient ξ can be selected in different ways, depending on the noise variance, channel uncertainty at the transmitter, and system dimensions [12, 16]. In [12], the performance of each UT under RZF precoding is studied in the large (M,K) regime. This means that M and K tend to infinity at the same speed, which can be formalized as follows.
Assumption 5.
In the large (M,K) regime, M and K tend to infinity such that
The user performance is characterized by SINR_{ k } in (5). Although the SINR is a random quantity that depends on the instantaneous values of the random users channels in H and the instantaneous estimate \(\widehat {\mathbf {H}}\), it can be approximated using deterministic quantities in the large (M,K) regime [10–13]. These are quantities that only depend on the statistics of the channels and are often referred to as deterministic equivalents, since they are almost surely (a.s.) tight in the asymptotic limit. This channel hardening property is essentially due to the law of large numbers. Deterministic equivalents were first proposed by Hachem et al. in [10], who have also shown their ability to capture important system performance indicators. When the deterministic equivalents are applied at finite M and K, they are referred to as largescale approximations.
In the sequel, by deterministic equivalent of a sequence of random variables X _{ n }, we mean a deterministic sequence \(\overline {X}_{n}\) which approximates X _{ n } such that
As an example, we recall the following result from [10], which provides some widely known results on deterministic equivalents. Note that we have chosen to work with a slightly different definition of the deterministic equivalents than in [10], since this better fits the analysis of our proposed precoding scheme.
Theorem 1.
(Adapted from [10]) Consider the resolvent matrix \(\mathbf {Q}(t)=\left (\frac {t}{K}\mathbf {H}\mathbf {H}^{\text {\tiny H}}+ \mathbf {I}_{M}\right)^{1}\) where the columns of H are distributed according to Assumption 1. Then, the equation
admits a unique solution δ(t)>0 for every t>0.
Let \(\mathbf {T}(t)=\left (\mathbf {I}_{M}+\frac {t\boldsymbol {\Phi }}{1+t\delta (t)}\right)^{1}\) and let U be any matrix with bounded spectral norm. Under Assumption 5 and for t>0, we have
The statement in (11) shows that \(\frac {1}{K}\text {tr}(\mathbf {U}\mathbf {T}(t))\) is a deterministic equivalent to the random quantity \(\frac {1}{K}\text {tr}(\mathbf {U}\mathbf {Q}(t))\).
In this paper, the deterministic equivalents are essential to determine the limit to which the SINRs tend in the large (M,K) regime. For RZF precoding, as in (9), this limit is given by the following theorem.
Theorem 2.
(Adapted from Corollary 1 in [12]) Let \(\rho =\frac {P}{\sigma ^{2}}\) and consider the notation \(\mathbf {T}=\mathbf {T}(\frac {1}{\xi })\) and \(\delta =\delta (\frac {1}{\xi })\). Define the deterministic scalar quantities
and
Then, the SINRs with RZF precoding satisfies
Note that all UTs obtain the same asymptotic value of the SINR since the UTs have homogeneous channel statistics. Theorem 2 holds for any regularization coefficient ξ, but the parameter can also be selected to maximize the limiting value θ of the SINRs. This is achieved by the following theorem.
Theorem 3.
(Adapted from Proposition 2 in [12]) Under the assumption of a uniform power allocation, \(p_{k}=\frac {P}{K}\), the largescale approximated SINR in (12) under RZF precoding is maximized by the regularization parameter ξ ^{⋆}, given as the positive solution to the fixedpoint equation
where ν(ξ) is given by
The RZF precoding matrix in (9) is a function of the instantaneous CSI at the transmitter. Although the SINRs converges to the deterministic equivalents given in Theorem 2, in the large (M,K) regime, the precoding matrix remains a random quantity that is typically recalculated on a millisecond basis (i.e., at the same pace as the channel knowledge is updated). This is a major practical issue, because the matrix inversion operation in RZF precoding is very computationally demanding in large systems [35]; the number of operation scale as \(\mathcal {O}(K^{2} M)\) and the known inversion algorithms are complicated to implement in hardware (see Section 4 for details). The matrix inversion is the key to interference suppression in RZF precoding, thus there is a need to develop less complicated precoding schemes that still can suppress interference efficiently.
Truncated polynomial expansion precoding
Motivated by the inherent complexity issues of RZF precoding, we now develop a new linear precoding class that is much easier to implement in large systems. The precoding is based on rewriting the matrix inversion by a polynomial expansion, which is then truncated. The following lemma provides a major motivation behind the use of polynomial expansions.
Lemma 1.
For any positive definite Hermitian matrix X,
where the second equality holds if the parameter κ is selected such that \(0 < \kappa < \frac {2}{\max _{n} \lambda _{n}(\mathbf {X})}\).
Proof 1.
The inverse of an Hermitian matrix can be computed by inverting each eigenvalue, while keeping the eigenvectors fixed. This lemma follows by applying the standard Taylor series expansion \((1x)^{1} = \sum _{\ell = 0}^{\infty } x^{\ell }\), for any x<1, on each eigenvalue of the Hermitian matrix (I−κ X). The condition on x corresponds to requiring that the spectral norm ∥I−κ X∥_{2} is bounded by unity, which holds for \( \kappa < \frac {2}{\max _{n} \lambda _{n}(\mathbf {X})}\). See [20] for an indepth analysis of such properties of polynomial expansions.
This lemma shows that the inverse of any Hermitian matrix can be expressed as a matrix polynomial. More importantly, the loworder terms are the most influential ones, since the eigenvalues of (I−κ X)^{ℓ} converge geometrically to zero as ℓ grows large. This is due to each eigenvalue λ of (I−κ X) having an absolute value smaller than unity, λ<1, and thus λ ^{ℓ} goes geometrically to zero as \(\ell \rightarrow \infty \). As such, it makes sense to consider a TPE of the matrix inverse using only the first J terms. This corresponds to approximating the inversion of each eigenvalue by a Taylor polynomial with J terms, hence the approximation accuracy per matrix element is independent of M and K; that is, J needs not change with the system dimensions.
TPE has been successfully applied for lowcomplexity multiuser detection in [18–21] and channel estimation in [22]. Next, we exploit the TPE technique to approximate RZF precoding by a matrix polynomial. Starting from G _{RZF} in (9), we note that
where (15) follows directly from Lemma 1 (for an appropriate selection of κ), (16) is achieved by truncating the polynomial (only keeping the first J terms), and (17) follows from applying the binomial theorem and gathering the terms for each exponent. Inspecting (17), we have a precoding matrix with the structure
where w _{0},…,w _{ J−1} are scalar coefficients. Although the bracketed term in (17) provides a potential expression for w _{ ℓ }, we stress that these are generally not the optimal coefficients when \(J < \infty \). Also, these coefficients are not satisfying the power constraint in (8) since the coefficients are not adapted to the truncation. Hence, we treat w _{0},…,w _{ J−1} as design parameters that should be selected to maximize the performance; for example, by maximizing the limiting value of the SINRs, as was done in Theorem 3 for RZF precoding. We note especially that the value of κ in (17) does not need to be explicitly known in order to choose, optimize, and implement the coefficients. We only need for κ to exist, which is always the case under Assumption 2. Besides the simplified structure, the proposed precoding matrix G _{TPE} possesses a higher number of degrees of freedom (represented by the J scalars w _{ ℓ }) than the RZF precoding (which has only the regularization coefficient ξ).
The precoding in (18) is coined TPE precoding and actually defines a whole class of precoding matrices for different J. For J=1, we obtain \(\mathbf {G} = \frac {w_{0}}{\sqrt {K}} \widehat {\mathbf {H}}\mathbf {P}^{\frac {1}{2}}\), which equals MRT. Furthermore, RZF precoding can be obtained by choosing \(J=\min (M,K)\) and coefficients based on the characteristic polynomial of \((\frac {1}{K}\widehat {\mathbf {H}}\widehat {\mathbf {H}}^{\text {\tiny H}} +\xi \mathbf {I}_{M})^{1}\) (directly from CayleyHamilton theorem). We refer to J as the TPE order and note that the corresponding polynomial degree is J−1. Clearly, proper selection of J enables a smooth transition between the traditional lowcomplexity MRT and the highcomplexity RZF precoding. Based on the discussion that followed Lemma 1, we assume that the parameter J is a finite constant that does not grow with M and K.
Complexity analysis
In this section, we compare the complexities of RZF and TPE precoding in a theoretical fashion and in an implementation sense. The complexities are given as simple numbers of complex addition and multiplication operations needed for a given arithmetic operation. The number of floating point operations (flops) needed to implement these complex operations varies greatly according to the used hardware and complex number representation (i.e., polar or Cartesian). Thus, we will not attempt to give a measure in flops. Also, the ability to parallelize operations and to customize algorithmspecific circuits has a fundamental impact on the computational delays and energy consumption in practical systems.
Sum complexity per coherence period for RZF and TPE
In order to compare the number of complex operations needed for conventional RZF precoding and the proposed TPE precoding, it is important to consider how often each operation is repeated. There are two time scales: (1) operations that take place once per coherence period (i.e., once per channel realization) and (2) operations that take place every time the channel is used for downlink transmission. To differentiate between these time scales, we let \(T_{\text {data}}^{\text {pcp}}\) denote the number of downlink channel uses for data transmission per coherence period. Recall from (3) that the transmit signal is G s, where the precoding matrix \(\mathbf {G} \in \mathbb {C}^{M \times K}\) changes once per coherence period and the data transmit symbols \(\mathbf {s} \in \mathbb {C}^{K \times 1}\) are different for each channel use.
The RZF precoding matrix in (9) is computed once per coherence period. There are two equivalent expressions in (9), where the difference is that the matrix inversion is either of dimension K×K or M×M. Since K≤M in most cases of practical interest, and especially in the massive MIMO regime, we consider the first precoding expression: \(\frac {1}{\sqrt {K}} \widehat {\mathbf {H}}\left (\frac {1}{\sqrt {K}}{\widehat {\mathbf {H}}^{\text {\tiny H}}}\frac {1}{\sqrt {K}}\widehat {\mathbf {H}}+\xi \mathbf {I}_{K}\right)^{1}\mathbf {P}^{\frac {1}{2}}\beta \).
Assuming that \(\frac {1}{\sqrt {K}} \widehat {\mathbf {H}}\), ξ, β, and \(\mathbf {P}^{\frac {1}{2}}\) are available in advance and the Hermitian operation is “free,” we need to (1) compute the matrixmatrix multiplication \((\frac {1}{\sqrt {K}} {\widehat {\mathbf {H}}^{\text {\tiny H}}}) (\frac {1}{\sqrt {K}}\widehat {\mathbf {H}})\); (2) add the diagonal matrix ξ I _{ K } to the result; (3) compute \(\frac {1}{\sqrt {K}} \widehat {\mathbf {H}}\left (\frac {1}{K}{\widehat {\mathbf {H}}^{\text {\tiny H}}}\widehat {\mathbf {H}}+\xi \mathbf {I}_{K}\right)^{1}\); and (4) multiply the result with the diagonal matrix resulting from \(\mathbf {P}^{\frac {1}{2}} \beta \). These are standard operations for matrices; thus, we obtain the numbers of complex operations as: K ^{2}(2M−1), K, \(\frac {K^{3}}{3}+2 K^{2} M\), and M K+K operations, respectively. Step 3 is not immediately obvious, but an efficient method for this part is to compute a Cholesky factorization of \(\frac {1}{K}{\widehat {\mathbf {H}}^{\text {\tiny H}}}\widehat {\mathbf {H}}+\xi \mathbf {I}_{K}\) (at a cost of K ^{3}/3) and then solve a simple linear equation system for each row of \(\frac {1}{\sqrt {K}}{\widehat {\mathbf {H}}^{\text {\tiny H}}}\) (at a cost of 2K ^{2} each) ([36], Slides 9–6, 9). This approach is preferable to the alternative of completely inverting the matrix (again using Cholesky factorization) and then using matrixmatrix multiplication, as long as K ^{3}−K M>0. Given that the alternative method has a cost of 4K ^{3}/3+M K(2K−1). It is interesting to note here that, for the case of M≫K, the matrixmatrix multiplication is actually more expensive than the matrix inversion (2M K ^{2} vs. K ^{3}).^{1}
Once G _{RZF} has been computed, the matrixvector multiplication G _{RZF} s requires M(2K−1) operations per channel use of data transmission. In summary, RZF precoding has a total number of complex operations per coherence period of
There is a second approach to looking at the RZF precoder complexity. Let the transmit signal with RZF precoding at channel use t be denoted as \({\mathbf {x}}^{(t)}_{\text {RZF}} \). The transmitted signal is then \({\mathbf {x}}^{(t)}_{\text {RZF}} = \mathbf {G}_{\text {RZF}} \mathbf {s}^{(t)}= \frac {1}{\sqrt {K}} \widehat {\mathbf {H}}\left (\frac {1}{K}{\widehat {\mathbf {H}}^{\text {\tiny H}}}\widehat {\mathbf {H}}+\xi \mathbf {I}_{K}\right)^{1}\beta \mathbf {P}^{\frac {1}{2}}\mathbf {s}^{(t)}\). Thus, one can replace the “matrix times inverse of another matrix” operation taking place each coherence period by a matrixinverse operation per coherence period and two matrixvector multiplications per data symbol vector. Thus, one effectively splits the previous point (3) into two parts and waits for the symbol vector to allow for the matrixvector multiplications. This results in
Still, this complexity is dominated by the matrixmatrix multiplication inside the inverse. However, the per coherence period complexity is reduced in exchange for a slight increase in complexity per symbol. Depending on the usecase of the precoder, this change can either be advantageous or disadvantageous (see Fig. 1 and Subsection 4.2). We note that choosing to incorporate the multiplication with \(\mathbf {P}^{\frac {1}{2}}\) per coherence period or per symbol vector does only insignificantly change the stated outcomes. In the following, we will chose the appropriate version for each comparison.
Next, we consider TPE precoding. Similar to before, we assume that \(\frac {1}{\sqrt {K}} \widehat {\mathbf {H}}\), w _{ ℓ }, and \(\mathbf {P}^{\frac {1}{2}}\) are available in advance and the Hermitian operation is “free.” Let the transmit signal vector with TPE precoding at channel use t be denoted as \({\mathbf {x}}^{(t)}_{\text {TPE}} \) and observe that it can be expressed as
where s ^{(t)} is the vector of data symbols at channel use t and
This reveals that there is an iterative way of computing the J terms in TPE precoding. The benefit of this approach is that it can be implemented using only matrixvector multiplications.^{2}
Similar to the above, we conclude that the case ℓ=0 uses K+M(2K−1) operations and each of the J−1 cases of ℓ≥1 needs M(2K−1)+K(2M−1) operations. One remarks that it is impractical and unneeded to carry out a matrixmatrix multiplication at this step. Finally, the multiplication with w _{ ℓ } and the summation require M(2J−1) further operations. In summary, TPE precoding has a total number of arithmetic operations of
When comparing RZF and TPE precoding, we note that the complexity of precomputing the RZF precoding matrix is very large, but it is only done once per coherence period. The corresponding matrix G _{TPE} for TPE precoding is never computed separately but only indirectly as G _{TPE} s for each data symbol vector s. Intuitively, precomputation is beneficial when the coherence period is long (compared to M and K), and the sequential computation of TPE precoding is beneficial when the system dimensions M and K are large (compared to the coherence period) or the coherence period is short. This is seen from the large dimensional complexity scaling which is \(\mathcal {O}(4 K^{2} M)\) or \(\mathcal {O}(2 K^{2} M)\) for RZF precoding (the latter, if the RZF or RZF2 approach is used) and \(\mathcal {O}(4 J K M T_{\text {data}}^{\text {pcp}})\) for TPE precoding; thus, the asymptotic difference is significant. The breakeven point, where TPE precoding outperforms RZF, is easily computed looking at \(C_{\text {RZF}}^{\text {pcp}} > C_{\text {TPE}}^{\text {pcp}}\)
and similar for \(C_{\text {RZF2}}^{\text {pcp}} > C_{\text {TPE}}^{\text {pcp}}\).
One should not forget the overhead signaling required to obtain CSI at the UTs, which makes the number of channel uses T _{data} available for data symbols reduce with K. For example, suppose T _{coherence} is the total coherence period and that we use a TDD protocol, where η _{DL} is the fraction used for downlink transmission and μ K channel uses (for some μ≥1) are consumed by downlink pilot signals that provide the UTs with sufficient CSI. We then have T _{data}=η _{DL} T _{coherence}−μ K. Using this relationship, the number of arithmetic operations are illustrated numerically in Fig. 1 for \(\eta _{\text {DL}} = \frac {1}{2}\), K=100, and μ=2.^{3} This figure shows that TPE precoding uses fewer operations than RZF precoding when the coherence period is short and the TPE order is small, while RZF is competitive for long coherence times.
We remark that all previously found results change in favor of TPE, if one uses the canonical transformation of complex to real operations by doubling all dimensions.
Remark 1.
Power normalization. In this section, we assumed that β and w _{ ℓ } (and ξ) are known beforehand. These factors are responsible for the power normalization of the transmit signal. Depending on the chosen normalization, for example, the average per one UT in this paper requires the full precoding matrix to be known. Thus, it forbids the alternative implementation of RZF precoding detailed before. Note that this could be remedied by changing to “strict” per UT normalization. In general, we can find values for β and w _{ ℓ }, which only rely on channel statistics and are valid in the large (M,K) regime. This, and the possible fix for the alternative RZF approach, has motivated us to assume β and w _{ ℓ } as known.
Delay to the first transmission for RZF and TPE
A practically important complexity metric is the number of complex operations for the first channel use. This number can also be interpreted as the delay until the start of data transmission. This complexity can easily be found from the previous results, by choosing T _{data}=1. Directly looking at the massive MIMO case, we find \(C_{\text {RZF}}^{\text {1st}} = 4 M K^{2}\), \(C_{\text {RZF2}}^{\text {1st}} = 2 M K^{2}\), and \( C_{\text {TPE}}^{\text {1st}} = 4JMK\). Hence, the first data vector is transmitted by a factor of K/(2J) earlier,^{4} when TPE precoding is employed. This factor is significant and gives TPE precoding practical relevance, especially in massive MIMO systems and in very fast changing environments, i.e., when coherence periods are very short. We also remark that not wasting time during the coherence period pays off greatly, as the lost channel uses are given by the saved time multiplied by the (often large) coherence bandwidth.
Implementation complexity of RZF and TPE precoding
In practice, the number of arithmetic operations is not the main issue, but the implementation cost in terms of hardware complexity, time delays, and energy consumption. The analysis in Subsection 4.1 showed that we can only expect improvements in the sum of complex operations from TPE precoding per coherence period in certain scenarios. However, one advantage of TPE precoding is that it enables multistage hardware implementation where the computations are pipelined [20] over multiple processing cores (e.g., applicationspecific integrated circuits (ASICs)). This structure is illustrated in Fig. 2, where the transmitted signal x ^{(t)} is prepared in various cores (black path), while the preceding and succeeding transmit signals are computed in the “free” cores (gray paths). Each processing core performs two simple matrixvector multiplications, each requiring approximately \(\mathcal {O} (2MK)\) complex additions and multiplications per coherence period. This is relatively easy to implement using ASICs or FPGAs, which are know to be very energyefficient and have low production cost. Consequently, we can select the TPE order J as large as needed to obtain a certain precoding accuracy, if we are prepared to use as many circuits of the same type as needed. Then, the delay between two consecutive transmitted symbol vectors is given only by the delay of two matrixvector multiplications.
In comparison, the inversion of RZF precoding can only be pseudoparallelized by using tree structures. Hence, the pipelining of the C _{RZF} complex operations per coherence period is limited by the delay of a single processing core that implements the inverse of a matrixmatrix; this delay is most probably much larger than the two matrixvector multiplications of TPE. The delay of a second core implementing the multiplication of the inverse with the channel matrix is negligible in comparison. Like mentioned before, the precomputation of the RZF precoding matrix causes nonnegligible delays that forces \(T_{\text {data}}^{\text {pcp}}\) to be smaller than for TPE precoding; for example, [35] describes a hardware implementation from [37] where it takes 0.15 ms to compute RZF precoding for K=15, which translated to a loss of 0.15 ms × 200 kHz=30 channel uses in a system with coherence bandwidth 200 kHz. Also, the number of active UTs can be much larger than this in largescale MIMO systems [38]. TPE precoding does not cause such delays because there are no precomputations—the arithmetic operations are spread over the coherence period.
In practice, this means one can argue that only the curve pertaining to J=1 in Fig. 1 is relevant for comparisons between TPE and RZF after implementation; if one is prepared to add (seemingly unfairly) as many computation cores as necessary to TPE.
Analysis and optimization of TPE precoding
In this section, we consider the large (M,K) regime, defined in Assumption 5. We show that SINR _{ k }, for k=1,…,K, under TPE precoding converges to a limit, a deterministic equivalent, that depends only on the coefficients w _{ ℓ }, the respective attributed power p _{ k }, and the channel statistics.
Recall the SINR expression in (5) and observe that g _{ k }=G e _{ k } and \(\mathbf {h}_{k}^{\text {\tiny H}}\mathbf {G}_{k}\mathbf {G}_{k}^{\text {\tiny H}}\mathbf {h}_{k} = \mathbf {h}_{k}^{\text {\tiny H}}\mathbf {G}\mathbf {G}^{\text {\tiny H}}\mathbf {h}_{k}  \mathbf {h}_{k}^{\text {\tiny H}}\mathbf {g}_{k}\mathbf {g}_{k}^{\text {\tiny H}}\mathbf {h}_{k}\), where e _{ k } is the kth column of the identity matrix I _{ K }. By substituting the TPE precoding expression (18) into (5), it is easy to show that the SINR writes as
where w=[w _{0} … w _{ J−1}]^{T} and the (ℓ,m)th elements of the matrices A _{ k }, \(\mathbf {B}_{k} \in \mathbb {C}^{J \times J}\) are
for ℓ=0,…,J−1 and m=0,…,J−1.^{5}
Since the random matrices A _{ k } and B _{ k } are of finite dimensions, it suffices to determine a deterministic equivalent for each of their elements. To achieve this, we express them using the resolvent matrix of \(\widehat {\mathbf {H}}\). This can be done by introducing the following random functionals in t and u:
By taking derivatives of X _{ k,M }(t,u) and Z _{ k,M }(t,u) at the point (t,u)=(0,0), we obtain
Substituting (24) and (25) into (20) and (21), respectively, we obtain the alternative expressions
It, thus, suffices to study the asymptotic convergence of the bivariate functions X _{ k,M }(t,u) and Z _{ k,M }(t,u). This is achieved by the following new theorem and its corollary:
Theorem 4.
Consider a channel matrix \(\widehat {\mathbf {H}}\) whose columns are distributed according to Assumption 3. Under the asymptotic regime described in Assumption 5, we have
and
where
and β _{ M }(t,u) is given by
Proof 2.
The proof leans heavily on lemmas presented in Appendix 1 and is detailed in Appendix 2.
Corollary 1.
Assume that Assumptions 1 and 5 hold true. Then, we have
and
Proof 3.
See Appendix 4.
Corollary 1 shows that the entries of A _{ k } and B _{ k }, which depend on the derivatives of X _{ k,M }(t,u) and Z _{ k,M }(t,u), can be approximated in the asymptotic regime by T ^{(ℓ)} and δ ^{(ℓ)}, which are the derivatives of T(t) and δ(t) at t=0. Such derivatives can be computed numerically using the iterative algorithm of [21], which is provided in Appendix 6 for the sake of completeness.
It remains to compute the aforementioned derivatives. To this end, we denote \(f(t)=\frac {1}{1+t\delta (t)}\), \(\boldsymbol {\mathcal {T}}(t)=f(t)\mathbf {T}(t)\), and by f ^{(ℓ)}, \(\boldsymbol {\mathcal {T}}^{(\ell)}\) their derivatives at t=0. \(\boldsymbol {\mathcal {T}}^{(\ell)}\) can be calculated using the Leibniz derivation rule \( \boldsymbol {\mathcal {T}}^{(\ell)} = \left (\mathbf {T}(t) f(t) \right)^{(\ell)}_{t=0} =  \sum _{n=0}^{\ell } {\ell \choose n} \mathbf {T}^{(n)} f^{(\ell n)} \) and the respective values from Appendix 6. Rewriting (26) as
and using the Leibniz rule, we obtain for any integers ℓ and m greater than 1, the expression
An iterative algorithm for the computation of \(\beta _{M}^{(\ell,m)}\) is given in Appendix 5.
With these derivation results on hand, we are now in the position to determine the expressions for the derivatives of the quantities of interest, namely \(\overline {X}_{k,m}(t,u)\) and \(\overline {b}_{M}(t,u)\). Using again the Leibniz derivation rule, we obtain
Using these results in combination with Corollary 1, we immediately obtain the asymptotic equivalents of A _{ k } and B _{ k }:
Corollary 2.
Let \(\widetilde {\mathbf {A}}\) and \(\widetilde {\mathbf {B}}\) be the J×J matrices, whose entries are
Then, in the asymptotic regime, for any k∈1,…,K we have
Optimization of the polynomial coefficients
Next, we consider the optimization of the asymptotic SINRs with respect to the polynomial coefficients w=[w _{0} …w _{ J−1}]^{T}. Using results from the previous sections, a deterministic equivalent for the SINR of the kth UT is
The optimized TPE precoding should satisfy the power constraints in (8)
or equivalently
where the (ℓ,m)th element of the J×J matrix C is
In order to make the optimization problem independent of the channel realizations, we replace the constraint in (28) by a deterministic one, which depends only on the statistics of the channel. To find a deterministic equivalent of the matrix C, we introduce the random quantity
whose derivatives \(Y_{M}^{(\ell,m)}\) satisfy
Using the same method as for the matrices A and B, we achieve the following result:
Theorem 5.
Considering the setting of Theorem 4, we have the following convergence results:

1.
Let \(c(t,u)=\frac {\frac {1}{K}\text {tr} \left (\boldsymbol {\Phi }\mathbf {T}(u)\mathbf {T}(t) \right)}{(1+t\delta (t))(1+u\delta (u))}(1+tu\beta (t,u))\), then
$$ Y_{M}(t,u)\text{tr} \left(\mathbf{P}\right)c(t,u)\xrightarrow[M,K\to+\infty]{\mathrm{a.s.}}0. $$ 
2.
Denote by c ^{(ℓ,m)} the ℓth and mth derivatives with respect to t and u, respectively, then
$$\begin{array}{*{20}l} c^{(\ell,m)}&=\sum\limits_{k=1}^{\ell}\sum\limits_{n=1}^{m} kn{\ell \choose k}{m \choose n}\beta^{(n1,k1)}\\ &\quad\times\frac{1}{K}\text{tr} \left(\boldsymbol{\Phi} \boldsymbol{\mathcal{T}}^{(\ellk)}\boldsymbol{\mathcal{T}}^{(mn)} \right) \\ &\quad+\frac{1}{K}\text{tr}\left(\boldsymbol{\Phi}\boldsymbol{\mathcal{T}}^{(m)}\boldsymbol{\mathcal{T}}^{(\ell)}\right) \end{array} $$ 
3.
Let \(\widetilde {\mathbf {C}}\) be the J×J matrix with entries given by
$$ [\widetilde{\mathbf{C}}]_{\ell,m}=\frac{(1)^{\ell+m}c^{(\ell,m)}}{\ell!m!}. $$Then, in the asymptotic regime
$$ \\mathbf{C}\text{tr} \left(\mathbf{P}\right)\widetilde{\mathbf{C}}\\xrightarrow[M,K\to+\infty]{\mathrm{a.s.}} 0. $$
Proof 4.
The proof relies on the same techniques as before, so we provide only a sketch in Appendix 7.
Based on Theorem 5, we can consider the deterministic power constraint
which can be seen as an approximation of (28), in the sense that for any w satisfying (30), we have
Now, the maximization of the asymptotic SINR of UT k amounts to solving the following optimization problem:
The next theorem shows that the optimal solution, w _{opt}, to (31) admits a closedform expression.
Theorem 6.
Let a be a unit norm eigenvector corresponding to the maximum eigenvalue \(\lambda _{\max }\) of
Then, the optimal value of the problem in (31) is achieved by
where the scaling factor α is
Moreover, for the optimal coefficients, the asymptotic SINR for the kth UT is
Proof 5.
The proof is given in Appendix 8.
The optimal polynomial coefficients for UT k are given in (33) of Theorem 6. Interestingly, these coefficients are independent of the user index; thus, we have indeed derived the jointly optimal coefficients. Furthermore, all users converge to the same deterministic SINR up to an UTspecific scaling factor \(\frac {K p_{k}}{\gamma \text {tr} (\mathbf {P}) }\).
Remark 2.
The asymptotic SINR expressions in (35) are only functions of the statistics and the power allocation p _{1},…,p _{ K }. The power allocation can be optimized with respect to some system performance metric. For example, one can show that the asymptotic average achievable rate
is maximized by a uniform power allocation \(p_{k} \! =\! \frac {P}{K}\) for all k.
Remark 3.
Theorem 6 shows that the J polynomial coefficients that jointly maximize the asymptotic SINRs can be computed using only the channel statistics and the channel estimation error. The optimal coefficients are then given in closed form in (33). Numerical experiments show that the coefficients are very robust to underestimation of τ and robust to overestimation. Hence, the main feature of Theorem 6 is that the TPE precoding coefficients can be computed beforehand or at least be updated at the relatively slow rate of change of the channel statistics. Thus, the cost of the optimization step is negligible with respect to calculating the precoding itself. The performance of finitedimensional largescale MIMO systems is evaluated numerically in Section 6.
Remark 4.
Finally, we remark that Assumption 5 prevents us from directly analyzing the scenario where K is fixed and \(M \rightarrow \infty \), but we can infer the behavior of TPE precoding based on previous works. In particular, it is known that MRT is an asymptotically optimal precoding scheme in this scenario [4]. We recall from Section 3.2 that TPE precoding reduces to MRT for J=1. Hence, we expect the optimal coefficients to behave as w _{0}≠0 and \(w_{\ell }\rightarrow 0\) for ℓ≥1 when \(M \rightarrow \infty \). In other words, we can reduce J as M grows large and still keep a fixed performance gap to RZF precoding.
Simulation results
In this section, we compare the RZF precoding from [16] (which was restated in (9)) with the proposed TPE precoding (defined in (18)) by means of simulations. The purpose is to validate the performance of the proposed precoding scheme and illustrate some of its main properties. The performance measure is the average achievable rate
of the UTs, where the expectation is taken with respect to different channel realizations and users. In the simulations, we model the channel covariance matrix as
where a is chosen to be 0.1. This approach is known as the exponential correlation model [39]. More involved models could be chosen here but would make it harder to evaluate the performance and function of TPE, while not offering more insight. The sum power constraint
is applied for both precoding schemes. Unless otherwise stated, we use uniform power allocation for the UTs, since the asymptotic properties of RZF precoding are known in this case (see Theorem 3). Without loss of generality, we have set σ ^{2}=1. Our default simulation model is a largescale singlecell MIMO system of dimensions M=128 and K=32.
We first take a look at Fig. 3. It considers a TPE order of J=3 and three different quality levels of the CSI at the BS: τ∈{0.1, 0.4, 0.7}. From Fig. 3, we see that RZF and TPE achieve almost the same average UT performance when a bad channel estimate is available (τ=0.7). Furthermore, TPE and RZF perform almost identically at low SNR values, for any τ. In general, the unsurprising observation is that the rate difference becomes larger at high SNRs and when τ is small (i.e., with more accurate channel knowledge).
Figure 4 shows more directly the relationship between the average achievable UT rates and the TPE order J. We consider the case τ=0.1, M=512, and K=128, in order to be in a regime where TPE performs relatively bad (see Fig. 3) and the precoding complexity becomes an issue. From the figure, we see that choosing a larger value for J gives a TPE performance closer to that of RZF. However, doing so will also require more hardware; see Section 4.3. The proposed TPE precoding never surpasses the RZF performance, which is noteworthy since TPE has J degrees of freedom that can be optimized (see Section 5.1), while RZF only has one design parameter. Hence one can regard RZF precoding as an upper bound to TPE precoding in the singlecell scenario.^{6}
It is desirable to select the TPE order J in such a way that we achieve a certain limited rateloss with respect to RZF precoding. Figure 5 illustrates the rateloss (per UT) between TPE and RZF, while the number of UTs K and transmit antennas M increase with a fixed ratio (M/K=4). The figure considers the case of τ=0.1. We observe that the TPE order J and the system dimensions are independent in their respective effects on the rateloss between TPE and RZF precoding. This observation is in line with previous results on polynomial expansions, for example, [19] where reducedrank received filtering was considered. The independence between J and the system dimensions M and K (given the same ratio) is indeed a main motivation behind TPE precoding, because it implies that the order J can be kept small even when TPE precoding is applied to very largescale MIMO systems. The intuition behind this result is that the polynomial expansion approximates the inversion of each eigenvalue with the same accuracy, irrespective of the number of eigenvalues; see Section 3.2 for details. Although the relative performance loss is unaffected by the system dimensions, we also see that J needs to be increased along with the SNR, if a constant performance gap is desired.
In the simulation depicted in Fig. 6, we introduce a hypothetical case of TPE precoding (TPEopt) that optimizes the J coefficients using the estimated channel coefficients in each coherence period, instead of relying solely on the channel statistics. More precisely, the optimal coefficients in Theorem 6 are not computed using the deterministic equivalents of \(\widetilde {\mathbf {A}}\), \(\widetilde {\mathbf {B}}\), and \(\widetilde {\mathbf {C}}\) but using the original matrices from (20), (21), and (29). This plot illustrates the additional performance loss caused by precalculating the TPE coefficients based on channel statistics and asymptotic analysis, instead of carrying out the optimization step for each channel realization. The difference is virtually zero at low SNRs and high at high SNRs. Furthermore, we note that increasing the value of J has the same performancegapreducing effect on TPEopt, as it has on TPE (see Figs. 4 and 5). In order to preserve readability, only the curves pertaining to J=3 are shown in Fig. 6.
Finally, to assess the validity of our results, we treat the case of nonuniform power allocation (i.e., with different values for p _{ k }). In particular, we considered a situation where the users are divided into four classes corresponding to {c _{1},c _{2},c _{4} c _{4}}={1,2,3,4}, where \(p_{k} = \frac {c_{k}}{K}\) in order to adhere to the scaling in Assumption 4. Figure 7 shows the theoretical large (M,K) regime (DE; based on (35)) and empirical (MC; based on (19)) average rate per UT for each class, when K=32,M=128, and τ=0.1. We especially remark the very good agreement between our theoretical analysis and the empirical system performance.
Conclusions
Conventional RZF precoding provides attractive system throughput in massive MIMO systems, but its computational and implementation complexity is prohibitively high, due to the required channel matrix inversion. In this paper, we have proposed a new class of TPE precoding schemes where the inversion is approximated by truncated polynomial expansions to enable simple hardware implementation. In the singlecell downlink with M transmit antennas and K singleantenna users, this new class can approximate RZF precoding to an arbitrary accuracy by choosing the TPE order J in the interval \(1\leq J \leq \min (M,K)\). In terms of implementation complexity, TPE precoding has several advantages: (1) There is no need to compute the precoding matrix beforehand (which leaves more channel uses for data transmission); (2) the delay to the first transmitted symbol is reduced significantly; (3) the multistage structure enables pipelining; and (4) the parameter J can be tailored to the available hardware.
Although the polynomial coefficients depend on the instantaneous channel realizations, we have shown that the peruser SINRs converge to deterministic values in the large (M,K) regime. This enabled us to compute asymptotically optimal coefficients using merely the statistics of the channels. The simulations revealed that the difference in performance between RZF and TPE is small at low SNRs and for large CSI errors. The TPE order J can be chosen very small in these situations, and in general, it does not need to scale with the system dimensions. However, to maintain a fixed peruser rateloss compared to RZF, J should increase with the SNR or as the CSI quality improves.
Endnotes
^{1} Matrix multiplication combined with matrix inversion can be implemented using the Strassen’s algorithm in [40] and the improved CoppersmithWinograd algorithm in [41]. These are divideandconquer algorithms that exploit that 2×2 matrices can be multiplied efficiently and thereby reduce the asymptotic complexity of multipling/inverting K×K matrices to \(\mathcal {O}(K^{2.8074})\) and \(\mathcal {O}(K^{2.373})\), respectively. Unfortunately, the overhead in these algorithms is heavy and thus K needs to be at the order of several thousands to achieve a lower complexity than the Cholesky approach considered here. Hence, these alternative algorithms are unfavorable for matrices of practical sizes.
^{2} Intuitively, one circumvents the expensive matrixmatrix multiplication with a dominolike chain of 2J−1 (less expensive) matrixvector multiplications per transmitted symbol vector. This became possible by replacing the inverse of a matrixmatrix multiplication in the RZF with a sum of weighted matrix powers.
^{3} These parameter values correspond to symmetric downlink/uplink transmission, 2 downlink pilot symbols per UT (at different frequencies). Looking at values similar the LTE standard ([42] Chapter 10), e.g., a coherence bandwidth of 200 kHz, and a coherence period of 5 ms one would arrive a T _{coherence} of 1000.
^{4} Depending on the massive MIMO system, K can be on the order of 100 and M of the order 10K, while we will see later that J=4 is sufficient for many cases.
^{5} The entries of matrices are numbered from 0, for notational convenience.
^{6} The optimal precoding parametrization in [15] has K−1 parameters. To optimize some general performance metric, it is therefore necessary to let the number of design parameters scale with the system dimensions.
Appendix 1: Useful lemmas
Lemma 2.
(Common inverses of resolvents) Given any matrix \(\widehat {\mathbf {H}} \in \mathbb {C}^{M\times K}\), let \(\widehat {\mathbf {h}}_{k}\) denote its kth column and \(\widehat {\mathbf {h}}_{k}\) denote the matrix obtained after removing the kth column from \(\widehat {\mathbf {H}}\). The resolvent matrices of \(\widehat {\mathbf {H}}\) and \(\widehat {\mathbf {h}}_{k}\) are denoted by \( \mathbf {Q}(t)=\left (\frac {t}{K}\widehat {\mathbf {H}}\widehat {\mathbf {H}}^{\text {\tiny H}}+\mathbf {I}_{M}\right)^{1} \) and \(\mathbf {Q}_{k}(t)=\left (\frac {t}{K}\widehat {\mathbf {H}}_{k}\widehat {\mathbf {H}}_{k}^{\text {\tiny H}}+\mathbf {I}_{M}\right)^{1} \), respectively. It then holds that
and also
Proof 6.
This follows from the Woodbury identity [43].
The following lemma characterizes the asymptotic behavior of quadratic forms. It will be of frequent use in the computation of deterministic equivalents.
Lemma 3.
(Convergence of quadratic forms) Let \(\mathbf {x}_{M}=\left [X_{1},\ldots,X_{M}\right ]^{\text {\tiny T}}\) be a M×1 vector with i.i.d. complex Gaussian random variables with unit variance. Let A _{ M } be an M×M matrix independent of x _{ M }, whose spectral norm is bounded; that is, there exists \(C_{A} < \infty \) such that ∥A∥_{2}≤C _{ A }. Then, for any p≥1, there exists a constant C _{ p } depending only on p, such that
where the expectation is taken over the distribution of x _{ M }. By choosing p≥2, we thus have that
Lemma 4.
Let A _{ M } be as in Lemma 3, and x _{ M },y _{ M } be random, mutually independent with complex Gaussian entries of zero mean and variance 1. Then,
Lemma 5.
(Rankone perturbation lemma) Let Q(t) and Q _{ k }(t) be the resolvent matrices as defined in Lemma 2. Then, for any matrix A we have
Lemma 6.
Let X _{ M } and Y _{ M } be two scalar random variables, with vary such that \(\text {var}(X_{M})=\mathcal {O}\left (M^{2}\right)\) and \(\text {var}(X_{M})=\mathcal {O}\left (M^{2}\right)=\mathcal {O}\left (K^{2}\right)\). Then,
Proof 7.
We have
Using the CauchySchwartz inequality, we see that
which establishes the desired result.
Appendix 2: Proof of Theorem 4
Here, we present the proof of Theorem 4, which establishes the asymptotic convergence of X _{ k,M }(t,u) and Z _{ k,M }(t,u) to deterministic quantities.
Deterministic equivalent for X _{ k,M }(t,u)
We will begin the proof by looking at the random quantity X _{ k,M }(t,u). Using the notation of Lemma 2, we can write
To control the quadratic form \(\frac {1}{K}\mathbf {h}_{k}^{\text {\tiny H}}\mathbf {Q}(t)\widehat {\mathbf {h}}_{k}\), we need to remove the dependency of Q(t) on vector \(\widehat {\mathbf {h}}_{k}\). For that, we shall use the relation in (36), thereby yielding
Using Lemma 3, we thus have
Since \(\frac {1}{K}\text {tr} \left (\boldsymbol {\Phi }\mathbf {Q}_{k}(t) \right) \frac {1}{K}\text {tr} \left (\boldsymbol {\Phi }\mathbf {Q}(t) \right) \xrightarrow [M,K\to +\infty ]{\mathrm {a.s.}} 0\), by the rankone perturbation property in Lemma 5, we have
Finally, Theorem 1 implies that
The same kind of calculations can be used to deal with the quadratic form \(\frac {1}{K}\mathbf {h}_{k}^{\text {\tiny H}}\mathbf {Q}_{k}(t)\widehat {\mathbf {h}}_{k}\), whose asymptotic limit is the same as \(\frac {\sqrt {1\tau ^{2}}}{K}\widehat {\mathbf {h}}_{k}^{\text {\tiny H}}\mathbf {Q}_{k}(t)\widehat {\mathbf {h}}_{k}\), due to the independence between the channel estimation error and the channel vector h _{ k }. Hence,
Plugging the deterministic approximation of (39) and (40) into (38), we thus see that
and hence,
Deterministic equivalent for Z _{ k,M }(t,u)
Finding a deterministic equivalent for Z _{ k,M }(t,u) is much more involved than for X _{ k,M }(t,u). Following the same steps as in section “Deterministic equivalent for X _{ k,M }(t,u)” Appendix 2, we decompose Z _{ k,M }(t,u) as
As it will be shown next, to determine the asymptotic limit of the random variables X _{ i }(t,u),i=1,…,4, we need to find a deterministic equivalent for
This is the most involved step of the proof. It will, thus, be treated separately in Appendix 3, where we establish the following lemma:
Lemma 7.
Let H be an M×K random matrix whose columns are drawn according to Assumption 1. Define for t≥0, the resolvent matrix \( \mathbf {Q}(t)=\left (\frac {t}{K}\mathbf {H}\mathbf {H}^{\text {\tiny H}}+\mathbf {I}_{K}\right)^{1}. \) Let A be an M×M deterministic matrix with uniformly spectral norm and \(\widehat {\alpha }_{M}(t,u,\mathbf {A})\)given as
Then, in the asymptotic regime described by Assumption 5, we have
where
In particular, if A=Φ, we have
The proof of this lemma is adjourned to Appendix 3.
Let us begin by treating X _{1}(t,u)
The righthand side term in the equation above can be treated using (40), thereby yielding
Using Lemma 3, we can prove that
Continuing, according to Lemma 7, we have
Combining (42) with (43) yields
Thus, in the asymptotic regime, we have
Controlling the other terms X _{ i }(t,u),i=2,3,4, will also include the term β(t,u). First note that X _{2}(t,u) is given by
where
Observe that Y _{2}(t,u) is very similar to X _{1}(t,u). The only difference is that Y _{2}(t,u) is a quadratic form involving vectors h _{ k } and \(\widehat {\mathbf {h}}_{k}\), whereas X _{1}(t,u) involves only the vector h _{ k }. Following the same kind of calculations leads to
Since \(\frac {\frac {1}{K}\widehat {\mathbf {h}}_{k}^{\text {\tiny H}}\mathbf {Q}_{k}(u)\mathbf {h}_{k}}{1+\frac {u}{K}\widehat {\mathbf {h}}_{k}\mathbf {Q}_{k}(u)\widehat {\mathbf {h}}_{k}}\) satisfies
we now have
Similarly, X _{3}(t,u) satisfies
Finally, X _{4}(t,u) can be treated using the same approach, thereby providing the following convergence:
Summing (44), (45), (46), and (47) yields
Appendix 3: Proof of Lemma 7
The aim of this section is to determine a deterministic equivalent for the random quantity
The proof is technical and will make frequent use of results from Appendix 1. First, we need to control \(\text {var}\left ({\widehat {\alpha }}_{M}(t,u)\right)\). This has already been treated in [10] where it was proved that \(\text {var}\left (\widehat {\alpha }_{M}(t,u,\mathbf {A})\right)=\mathcal {O}(K^{2})\) when t=u. The same calculations hold for t≠u, thus we consider in the sequel that \(\text {var}\left (\widehat {\alpha }_{M}(t,u,\mathbf {A})\right)=\mathcal {O}(K^{2}).\) Hence, we have
Equation (48) allows us to focus directly on controlling \(\mathbb {E}[\widehat {\alpha }_{M}(t,u,\mathbf {A})]\). Using the resolvent identity
we decompose \(\widehat {\alpha }_{M}(t,u,\mathbf {A})\) as
We will only directly deal with the terms Z _{1} and Z _{3}, since Z _{2} will be compensated by terms in Z _{3}. We begin with Z _{1}:
Using Lemma 3, we can show that the first term on the righthand side of the above equation is negligible. Therefore,
Using Lemma 5, we have
Theorem 1, thus, implies
We now look at Z _{3}, where
Using (37), we arrive at
From (36), Z _{3} can be decomposed as
We sequentially deal with the terms Z _{31} and Z _{32}. The same arguments as those used before allow us to substitute the denominator by 1+t δ(t), thereby yielding
By Lemma 3, the quadratic forms involved in χ _{2} have variance \(\mathcal {O}(K^{2})\), and thus can be substituted by their expected mean (see Lemma 6). We obtain
The term χ _{1} will be compensated by Z _{2}. To see that, observe that the first order of χ _{1} does not change if we substitute H _{ ℓ } by H and P _{ ℓ } by P. Besides, due to Lemma 5, we can substitute Q _{ ℓ }(t) by Q(t) and Q _{ ℓ }(u) by Q(u), hence proving that
Finally, it remains to deal with Z _{32}. Substituting \(\frac {1}{K}\mathbf {h}_{\ell }^{\text {\tiny H}}\mathbf {Q}_{\ell }(t)\mathbf {h}_{\ell }\) and \(\frac {1}{K}\mathbf {h}_{\ell }^{\text {\tiny H}}\mathbf {Q}_{\ell }(u)\mathbf {h}_{\ell }\) by their asymptotic equivalent δ(t) and δ(u), we get
Analogously to before, \(\mathbb {E}\left [Z_{32}\right ]\) can be simplified as
Combining (7), (50), and (51), we obtain
Replacing A with Φ, one finds a deterministic equivalent
Appendix 4: Proof of Corollary 1
The proof of Corollary 1 relies on Montel’s theorem [44]. We only prove that the result for X _{ k,M }(t,u), Z _{ k,M }(t,u) follows analogously. Note that X _{ k,M }(t,u) and \(\overline {X}_{k,m}(t,u)\) are analytic functions, when their domains are extended to \(\mathbb {C}\backslash \mathbb {R}_{}\times \mathbb {C}\backslash \mathbb {R}_{}\), where \(\mathbb {R}_{}\) is the set of negative realvalued numbers. Since \(X_{k,M}(t,u)\overline {X}_{k,M}(t,u)\) is almost surely bounded for large M and K on every compact subset of \(\mathbb {C}\backslash \mathbb {R}_{}\), Montel’s theorem asserts that there exists a converging subsequence, which converges to an analytic function. Since this limiting function is necessarily zero on the positive real axis, it must be zero everywhere. Thus, from every subsequence, one can extract a convergent one that converges to zero thus,
Since X _{ k,M }(z _{1},z _{2}) is analytic, the derivatives of \(X_{k,M}(z_{1},z_{2})  \overline {X}_{k,M}(z_{1},z_{2})\) converge to zero. In particular, if \(\tilde {t}\) and \(\tilde {u}\) are strictly positive scalars, we have
This result can be extended to the case of \(\tilde {t}=0\) and \(\tilde {u}=0\). To see this, let η>0 and decompose
where
Now, let ε>0. Since the derivatives of \(X_{k,M}^{(m,\ell)}\) and \(\overline {X}_{k,M}^{(m,\ell)}\) are almost surely bounded for large M and K, the quantities α _{1} and α _{3} can be made smaller than ε/3 when η is small enough. On the other hand, (55) implies that α _{2} converges to zero almost surely. There exists M _{0}, such that, for M≥M _{0}, we have \(\alpha _{2}\leq \frac {\epsilon }{3}\). Therefore, for M large enough, \( \left  X_{k,M}^{(m,\ell)}  \overline {X}_{k,M}^{(m,\ell)} \right \leq \epsilon, \) thereby proving
Appendix 5: Iterative algorithm for computing \(\beta _{M}^{(\ell,m)}\)
An iterative approach for computing \(\beta _{M}^{(\ell,m)}\) is given in the following by Algorithm 1.
Appendix 6: Iterative algorithm for computing T ^{(q)}
For the sake of completeness, we provide hereafter Algorithm 2 that can be used to compute T ^{(q)}. It is an adapted version of the iterative algorithm given in [21].
Appendix 7: Sketch of the proof of Theorem 5
The goal of this section is to provide an outline of the proof for finding the deterministic equivalent of the quantity
A full proof proceeds in the following steps:

1.
First, compute the deterministic equivalent for
$$ Y_{M}(t,u)=\frac{1}{K}\text{tr}\left(\textbf{Q}(t)\widehat{\mathbf{H}}\textbf{P}\widehat{\mathbf{H}}^{\text{\tiny H}} \textbf{Q}(u)\right), $$where \(Q(t)=\left (\frac {t}{K}\textbf {H}\textbf {H}^{\text {\tiny H}}+\textbf {I}\right)^{1}\). This can be achieved by using Lemma 7, where it is proved that
$$ Y_{M}(t,u)\overline{\alpha}_{M}(t,u,\textbf{I})\xrightarrow[M,K\to+\infty]{a.s}0 $$and thus,
$$ Y_{M}(t,u)\text{tr}(\textbf{P}) c(t,u)\xrightarrow[M,K\to+\infty]{\mathrm{a.s.}}0. $$ 
2.
Now, since
$$ \left[\widetilde{\mathbf{C}}\right]_{\ell,m}=\frac{(1)^{\ell+m}Y_{M}^{(\ell,m)}}{\ell!m!}, $$we can prove, using the same approach as in the proof of Theorem 1, that
$$ Y_{M}(t,u)^{(\ell,m)}\text{tr}(\textbf{P}) c^{(\ell,m)}\xrightarrow[M,K\to+\infty]{a.s}0. $$ 
3.
Finally, one computes the derivative of c(t,u) at t=0 and u=0, using the Leibniz rule, to arrive at the desired result.
Appendix 8: Proof of Theorem 6
By using \(\frac {\text {tr} \left (\textbf {P}\right)\textbf {w}^{\text {\tiny H}}\widetilde {\mathbf {C}}\textbf {w}}{P}=1\) and dividing the objective function by the constant \(\frac {K p_{k}}{\text {tr}(\textbf {P})}\), the problem (31) can be rewritten as
Making the change of variable \(\textbf {a}=\left (\widetilde {\mathbf {B}}+\frac {\sigma ^{2}}{P}{\widetilde {\mathbf {C}}} \right)^{\frac {1}{2}} \textbf {w}\), we transform (P _{1}) into
We notice that the objective function of (P _{2}) is independent of the norm of a. We can, therefore, select a to maximize the objective function and then adapt the norm to fit the constraint. If we discard the constraint, what remains is a classic Rayleigh quotient [45], which is maximized by the eigenvector a corresponding to the maximum eigenvalue of
By transforming a back to the original variable w, we obtain (33), where the scaling in (34) corresponds to a scaling of a in order to satisfy the constraint.
References
 1
Cisco, Cisco visual networking index: global mobile data traffic forecast update, 2012–2017 (Cisco Public Information, San Jose, CA, 2013).
 2
J Hoydis, M Kobayashi, M Debbah, Green smallcell networks. IEEE Veh. Technol. Mag.6(1), 37–43 (2011).
 3
TL Marzetta, Noncooperative cellular wireless with unlimited numbers of base station antennas. IEEE Trans. Commun.9(11), 3590–3600 (2010).
 4
F Rusek, D Persson, BK Lau, EG Larsson, TL Marzetta, O Edfors, F Tufvesson, Scaling up MIMO: opportunities and challenges with very large arrays. IEEE Signal Process. Mag.30(1), 40–60 (2013).
 5
J Hoydis, S ten Brink, M Debbah, Massive MIMO in the UL/DL of cellular networks: how many antennas do we need?IEEE J. Sel. Areas Commun.31(2), 160–171 (2013).
 6
K Hosseini, J Hoydis, S ten Brink, M Debbah, in Communications (ICC), 2013 IEEE International Conference on 9–13 June 2013. Massive MIMO and small cells: How to densify heterogeneous networks (IEEEBudapest, 2013), pp. 5442–5447. doi: 10.1109/ICC.2013.6655455.
 7
E Björnson, M Kountouris, M Debbah, in Telecommunications (ICT), 2013 20th International Conference on 6–8 May 2013. Massive MIMO and small cells: Improving energy efficiency by optimal softcell coordination (IEEECasablanca, 2013), pp. 1–5. doi: 10.1109/ICTEL.2013.6632074.
 8
X Gao, O Edfors, F Rusek, F Tufvesson, in Vehicular Technology Conference (VTC Fall), 2011 IEEE. Linear PreCoding Performance in Measured VeryLarge MIMO Channels (IEEESan Francisco, CA, 2011), pp. 1–5. doi: 10.1109/VETECF.2011.6093291.
 9
J Hoydis, C Hoek, T Wild, S ten Brink, in Wireless Communication Systems (ISWCS), 2012 International Symposium on 28–31 Aug. 2012. Channel measurements for large antenna arrays, (2012), pp. 811–815. Print ISBN: 9781467307611.
 10
W Hachem, O Khorunzhy, P Loubaton, J Najim, LA Pastur, A new approach for capacity analysis of large dimensional multiantenna channels. IEEE Trans. Inf. Theory. 54(9), 3987–4004 (2008).
 11
VK Nguyen, JS Evans, in Global Telecommunications Conference, 2008. IEEE GLOBECOM 2008. IEEE. Multiuser Transmit Beamforming via Regularized Channel Inversion: A Large System Analysis (IEEENew Orleans, LO, 2008). doi: 10.1109/GLOCOM.2008.ECP.176.
 12
S Wagner, R Couillet, M Debbah, DTM Slock, Large system analysis of linear precoding in MISO broadcast channels with limited feedback. IEEE Trans. Inf. Theory. 58(7), 4509–4537 (2012).
 13
R Muharar, J Evans, in Communications (ICC), 2011 IEEE International Conference on 5–9 June 2011. Downlink Beamforming with TransmitSide Channel Correlation: A Large System Analysis (IEEEKyoto, 2011), pp. 1–5. doi: 10.1109/icc.2011.5962672.
 14
R Couillet, M Debbah, Random Matrix Methods for Wireless Communications, 1st edn. (Cambridge University Press, New York, NY, USA, 2011).
 15
E Björnson, M Bengtsson, B Ottersten, Pareto characterization of the multicell MIMO performance region with simple receivers. IEEE Trans. Signal Process.60(8), 4464–4469 (2012).
 16
CB Peel, BM Hochwald, AL Swindlehurst, A vectorperturbation technique for nearcapacity multiantenna multiuser communication, part I: channel inversion and regularization. IEEE Trans. Commun.53(1), 195–202 (2005).
 17
TKY Lo, Maximum ratio transmission. IEEE Trans. Commun.47(10), 1458–1461 (1999).
 18
S Moshavi, EG Kanterakis, DL Schilling, Multistage linear receivers for DSCDMA systems. Int. J. Wireless Inf. Netw.3(1), 1–17 (1996).
 19
ML Honig, W Xiao, Performance of reducedrank linear interference suppression. IEEE Trans. Inf. Theory. 47(5), 1928–1946 (2001).
 20
G Sessler, F Jondral, Low complexity polynomial expansion multiuser detector for CDMA systems. IEEE Trans. Veh. Technol.54(4), 1379–1391 (2005).
 21
J Hoydis, M Debbah, M Kobayashi, in Information Theory Proceedings (ISIT), 2011 IEEE International Symposium on July 31 2011–Aug. 5 2011. Asymptotic moments for interference mitigation in correlated fading channels (IEEESt. Petersburg, 2011), pp. 2796–2800. doi: 10.1109/ISIT.2011.6034083.
 22
N Shariati, E Björnson, M Bengtsson, M Debbah, in Personal Indoor and Mobile Radio Communications (PIMRC), 2013 IEEE 24th International Symposium on 8–11 Sept. 2013. Lowcomplexity channel estimation in largescale MIMO using polynomial expansion (IEEELondon, 2013), pp. 1157–1162. DOI: 10.1109/PIMRC.2013.6666313.
 23
A Müller, A Kammoun, E Björnson, M Debbah, in Sensor Array and Multichannel Signal Processing Workshop (SAM), 2014 IEEE 8th. Efficient linear precoding for massive MIMO systems using truncated polynomial expansion (IEEEA Coruna, 2014), pp. 273–276. doi:10.1109/SAM.2014.6882394.
 24
S Zarei, W Gerstacker, R Schober, in Signals, Systems and Computers, 2013 Asilomar Conference on. lowcomplexity linear precoding and power allocation scheme for downlink massive MIMO systems (IEEEPacific Grove, CA, 2013), pp. 285–290. doi: 10.1109/ACSSC.2013.6810278.
 25
A Adhikary, N Junyoung, JY Ahn, G Caire, Joint spatial division and multiplexing—the largescale array regime. IEEE Trans. Inf. Theory. 59(10), 6441–6463 (2013).
 26
A Kammoun, A Müller, E Björnson, M Debbah, Linear precoding based on polynomial expansion: largescale multicell MIMO systems. IEEE J. Sel. Topics Signal Process. 8(5), 861–875 (2014).
 27
J Choi, DJ Love, P Bidigare, Downlink Training Techniques for FDD Massive MIMO Systems: Open Loop and ClosedLoop Training With Memory. IEEE J. Sel. Topics Signal Process.8(5), 802–814 (2014). doi: 10.1109/JSTSP.2014.2313020.
 28
C Wang, RD Murch, Adaptive downlink multiuser MIMO wireless systems for correlated channels with imperfect CSI. IEEE Trans. Wireless Commun.5(9), 2435–2436 (2006).
 29
B NosratMakouei, JG Andrews, RW Heath, MIMO interference alignment over correlated channels with imperfect CSI. IEEE Trans. Signal Process.59(6), 2783–2794 (2011).
 30
E Björnson, E Jorswieck, Optimal resource allocation in coordinated multicell systems. Foundations Trends Commun. Inf. Theory. 9(23), 113–381 (2013).
 31
M Joham, W Utschick, JA Nossek, Linear transmit processing in MIMO communications systems. IEEE Trans. Signal Process.53(8), 2700–2712 (2005).
 32
M Sadek, A Tarighat, AH Sayed, A leakagebased precoding scheme for downlink multiuser MIMO channels. IEEE Trans. Wireless Commun.6(5), 1711–1721 (2007).
 33
R Stridh, M Bengtsson, B Ottersten, System evaluation of optimal downlink beamforming with congestion control in wireless communication. IEEE Trans. Wireless Commun.5(4), 743–751 (2006).
 34
E Björnson, R Zakhour, D Gesbert, B Ottersten, Cooperative multicell precoding: rate region characterization and distributed strategies with instantaneous and statistical CSI. IEEE Trans. Signal Process.58(8), 4298–4310 (2010).
 35
C Shepard, H Yu, N Anand, E Li, T Marzetta, R Yang, L Zhong, in Proceedings of the 18th Annual International Conference on Mobile Computing and Networking, Mobicom ’12. Argos: Practical Manyantenna Base Stations (ACMNew York, NY, USA, 2012), pp. 53–64. doi: 10.1145/2348543.2348553.
 36
S Boyd, L Vandenberghe, Numerical linear algebra background. http://www.seas.ucla.edu/~vandenbe/ee236b/lectures/numlinalg.pdf Accessed 12 Feb 2016.
 37
C Dick, F Harris, M Pajic, D Vuletic. Implementing a realtime beamformer on an FPGA platform. Xcell journal 86 (Xilinx, Inc.San Jose, CA 951243400, 2100 Logic Drive, 2007).
 38
E Björnson, EG Larsson, M Debbah, in Signal and Information Processing (GlobalSIP), 2014 IEEE Global Conference on 3–5 Dec. 2014. Optimizing multicell massive MIMO for spectral efficiency: How Many users should be scheduled? (IEEEAtlanta, GA, 2014), pp. 612–616. doi: 10.1109/GlobalSIP.2014.7032190.
 39
SL Loyka, Channel capacity of MIMO architecture using the exponential correlation matrix. IEEE Commun. Lett.5(9), 369–371 (2001).
 40
V Strassen, Gaussian elimination is not optimal. Numer. Math.13:, 354–356 (1969).
 41
VV Williams, in Proceedings of the Fortyfourth Annual ACM Symposium on Theory of Computing, STOC ’12. Multiplying Matrices Faster Than Coppersmithwinograd (ACMNew York, NY, USA, 2012), pp. 887–898. doi: 10.1145/2213977.2214056.
 42
E Dahlman, S Parkvall, J Skold, 4G: LTE/LTEAdvanced for Mobile Broadband (Academic Press is an imprint of Elsevier, The Boulevard, Langford Land, Kidlington, Oxford, OX5 1GB, UK & 30 Corporate Drive, Suite 400, Burlington, MA 01803, USA, 2013).
 43
GH Golub, CF Van Loan, Matrix Computations (Johns Hopkins University, Press, Baltimore, MD, USA, 1996).
 44
W Rudin, Real and Complex Analysis, 3rd edn. (McGrawHill Inc, New York, 1986).
 45
S Boyd, L Vandenberghe, Convex Optimization (Cambridge University Press, New York, 2004).
Acknowledgements
This research has been supported by the ERC Starting Grant 305123 MORE (Advanced Mathematical Tools for Complex Network Engineering). Parts of the results were previously presented at the 8th IEEE Sensor Array and Multichannel Signal Processing Workshop, 2014. E. Björnson is funded by the International Postdoc Grant 2012228 from the Swedish Research Council.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Mueller, A., Kammoun, A., Björnson, E. et al. Linear precoding based on polynomial expansion: reducing complexity in massive MIMO. J Wireless Com Network 2016, 63 (2016). https://doi.org/10.1186/s136380160546z
Received:
Accepted:
Published:
Keywords
 Massive MIMO
 Linear precoding
 Multiuser systems
 Polynomial expansion
 Random matrix theory