Structured Channel Covariance Estimation from Limited Samples for Large Antenna Arrays

In massive multiple-input multiple-output (MIMO) systems, the knowledge of the users' channel covariance matrix is crucial for minimum mean square error (MMSE) channel estimation in the uplink as well as it plays an important role in several multiuser beamforming schemes in the downlink. Due to the large number of base station antennas in massive MIMO, accurate covariance estimation is challenging especially in the case where the number of samples is limited and thus comparable to the channel vector dimension. As a result, the standard sample covariance estimator may yield a too large estimation error which in turn may yield significant system performance degradation with respect to the ideal channel covariance knowledge case. To address such problem, we propose a method based on a parametric representation of the channel angular scattering function. The proposed parametric representation includes a discrete specular component which is addressed using the well-known MUltiple SIgnal Classification (MUSIC) method, and a diffuse scattering component, which is modeled as the superposition of suitable dictionary functions. To obtain the representation parameters we propose two methods, where the first solves a non-negative least-squares problem and the second maximizes the likelihood function using expectation-maximization. Our simulation results show that the proposed methods outperform the state of the art with respect to various estimation quality metrics and different sample sizes.


Introduction
Massive multiple-input multiple-output (MIMO) communication system, where the number of base station (BS) antennas M is much larger than the number of single antenna users, has been shown to achieve high spectral efficiency in wireless cellular networks and to enjoy various system level benefit, such as energy efficiency, inter-cell interference reduction, and dramatic simplification of user scheduling (e.g., see [2,3]). In a large number of papers on the subject, the knowledge of the uplink (UL) and downlink (DL) channel covariance matrix, i.e., of the correlation structure of the channel antenna coefficients at the BS array, is assumed and used for a variety of purposes, such as minimum mean square error (MMSE) UL channel estimation and pilot decontamination [4][5][6], efficient DL multiuser precoding/beamforming design, especially in the frequency division duplexing (FDD) case [6][7][8][9][10], and multiuser DL precoding design based on statistical channel state information (CSI) [11][12][13].
Under the usual assumption of wide-sense stationary (WSS) uncorrelated scattering (US) [6,7,9,14,15], 1 the channel vector evolves over time as a vector-valued WSS process and its spatial correlation is frequency-invariant over a frequency interval significantly larger than the signal bandwidth, although much smaller than the carrier frequency. In particular, in an orthogonal frequency division multiplexing (OFDM) system, the channel spatial covariance is independent of time (OFDM symbol index) and frequency (subcarrier index). In order to capture the actual WSS statistics, samples sufficiently spaced over time and frequency must be collected (e.g., one sample for every resource block in the UL slots over which a given user is active). On the other hand, the WSS model holds only locally, over time intervals where the propagation geometry (angle and distances of multipath components) does not change significantly. Such time interval, referred to as "geometry coherence time, " is several orders of magnitude larger than the coherence time of the small-scale channel coefficients. For a typical mobile urban environment, the channel geometry coherence time is of the order of seconds, while the small-scale fading coherence time is of the order of milliseconds (see [19] and references therein). 2 Hence, the BS can collect a window of tens-to-hundreds of noisy channel snapshots from UL pilot symbols sent by any given user and use them to produce a "local" estimate of the corresponding user covariance matrix which remains valid during a channel geometry coherence time. Such estimation must be repeated, or updated, at a rate that depends on the propagation scenario and user-BS relative motion. This discussion points out that the number of samples N available for covariance estimation is limited and often comparable or even less than the number of BS antennas M. In general, the accurate estimation of a high-dimensional M × M spatial channel covariance matrix (both in the UL and in the DL) from a limited number of noisy channel realizations (samples) is generally a difficult task.
The simplest way to estimate the channel covariance matrices is the sample covariance estimator. Such estimation is asymptotically unbiased and consistent and works well when N ≫ M . Unfortunately, as already noticed above, this is typically not the case in massive MIMO. Hence, the goal of this paper is to devise new parametric estimators that outperform the sample covariance estimator, as well as the other state-of-the-art methods proposed in the literature. In addition, an attractive feature of the proposed parametric estimation is that it lends itself to the extrapolation of the estimated channel covariance matrix from the UL to the DL frequency band. As the channel samples are collected by the BS through pilot symbols sent by the users on the UL, the UL channel covariance can be directly estimated via our method. However, as pointed out above, several schemes for DL multiuser precoding/beamforming in FDD systems make use of the user channel covariance matrix in the DL, which differs from the UL covariance since the frequency separation between the UL and the DL bands is large. The estimation of the DL covariance from UL channel samples has been considered in several works and it is generally another challenging task [7,[21][22][23][24][25]. We shall see that our scheme is able to accurately estimate the DL channel covariance by extrapolating (over frequency) the estimated parametric model in the UL.

Related work
Covariance estimation from limited samples is a very well-known classical problem in statistics. In the "large system regime, " when both the vector dimension M and the number of samples N grow large with fixed sampling ratio N/M of samples per dimension, a vast literature focused on the asymptotic eigenvalue distribution of sample covariance matrices estimator under specific statistical assumptions (see, e.g., [26][27][28] and the references therein). Covariance estimation has reemerged recently in many problems in machine learning, compressed sensing, biology, etc. (see, e.g., [29][30][31][32] from some recent results). What makes these recent works different from the classical ones is the highly structured nature of the covariance matrices in these applications. A key challenge in these new applications is to design efficient estimation algorithms able to take advantage of the underlying structure to recover the covariance matrix with a small sample size.
As explained in the sequel, MIMO covariance matrices are highly structured due to the particular spatial configuration of the BS antennas and wireless propagation scenario. So, it is very important to design efficient algorithms that are able to take advantage of this specific structure. Previous works have considered either diffuse scattering (the received power from any given angle direction is infinitesimal [21]), or separable discrete components [33][34][35]. Interestingly, both classes of state-of-the-art methods incur significant degradation when the actual channel model does not conform to the assumptions. No current work has addressed the simultaneous presence of both discrete components and diffuse scattering. In this paper, we propose a method to address such mixed scattering model. We shall compare our proposed method with alternative approaches which can be regarded as the state of the art for the specific case of wireless massive MIMO channels. The competitor methods shall be briefly discussed when presenting such comparison results.

Contributions
The main contributions of this work are listed below: (1) In contrast to the work in [21] that only considers diffuse scattering, and the works in [33][34][35] that only consider separable discrete components, we propose a method that handles the more realistic mixed scattering model where discrete and diffuse scattering is simultaneously present in the angle domain. It should be noticed that our model is not just "one or many possible alternatives. " In fact, our model encompasses any possible received power distribution over the angle domain, referred to in this work as the angular scattering function (ASF).
(2) The new idea of the proposed general model consists of approximating the ASF in terms of atoms of a dictionary. This dictionary is formed by Dirac delta functions placed at the angle-of-arrival (AoA) of the discrete specular components and by a family of functions obtained by shifts of a template density function over the angle domain, in order to approximate the diffuse component of the ASF. Notice that the proposed dictionary is partially "on the grid" (the shifts of the template density function at integer multiples of some chosen discretization interval) and partially "off the grid" (the Diracs placed at the unquantized AoAs). In this sense, the method differs substantially from standard compressed sensing schemes and does not assume sparsity of the ASF in the angle domain. In fact, the diffuse scattering component is not sparse at all, but is dense on given angular spread intervals. (3) In order to estimate the AoA of the discrete components and their number (model order), we propose to use the well-known MUltiple SIgnal Classification (MUSIC) method [36]. In fact, this approach is particularly suited to the problem at hand thanks to the provable asymptotic consistency for the corresponding spiked model. Notice that this property has not been proven for other super-resolution methods based on convexity and atomic norm, which have also the disadvantage of being significantly more computationally complex (see, e.g., [37, Table III]) and numerically unstable. (4) After obtaining the estimated AoAs of the spikes via MUSIC, we propose two estimators, namely a constrained least-squares estimator and maximum-likelihood (ML) estimator, to estimate the parametric model coefficients. In particular, the maximization of the likelihood function in the ML estimator is obtained via the expectation-maximization (EM) initialized by the constrained least-squares solution.
We demonstrate the advantage of the proposed approach through extensive numerical results based on the channel emulator QUAsi Deterministic RadIo channel GenerAtor (QuaDriGa) [18], used routinely in 3GPP standardization to provide a common reference for comparison of different physical layer algorithms. We compare our method with other state-of-the-art schemes, showing that the proposed algorithms outperform the competitors over all benchmarks in a wide range of sampling ratio N/M. It should be noticed that the QuaDriGa channel generator is agnostic and unaware of the proposed ASF representation and generates channel snapshots on the basis of a mixed physical/ statistical model. Therefore, our results are not "biased" by generating channels that are "matched" to the assumed model. On the contrary, our results show that the proposed approach is robust to the underlying channel model and provides improvements on the state of the art even if the underlying channel statistics do not reflect exactly the assumed one.

Notations
An identity matrix with K columns is denoted as I K . An all-zero matrix with size m × n is denoted as 0 m×n . diag(·) returns a diagonal matrix. � · � F returns Frobenius norm of a matrix. δ(·) denotes the Dirac delta function. We use [n] to denote the ordered integer set {1, 2, . . . , n}.

Methods
This study is theoretical. The algorithms are developed through mathematical equations and implemented using standard scientific computation software (MATLAB). The numerical results of Sect. 5 are obtained by standard computer simulation using MATLAB, with channel vector snapshots extracted from the publicly available channel simulator (QuaDriGa) [18], which is used in a large number of comparative studies in the framework of 3GPP standardization. All details of the numerical results, including comparison with competing state-of-the-art methods, are fully described in Sect. 5. No proprietary data have been used in this study.

System model
We consider a typical single-cell massive MIMO communication system, where a BS equipped with a uniform linear array (ULA) of M antennas communicates with multiple users through a multipath channel. 3 Following the common assumption that the pilot sequences of different users are orthogonal to each other in the time-frequency domain, and since the observations used for channel covariance estimation use only the UL pilots, without loss of generality we focus on a generic user. Figure 1 visualizes the propagation model based on multipath clusters, which is physically motivated and widely adopted in standard channel simulation tools such as QuaDriGa [18]. During UL transmission, on each time-frequency resource block (RB) s, the BS receives an UL user pilot carrying a measurement for the channel vector h[s] . We assume that the window of N samples collected for covariance estimation is designed such that the samples are enough spaced in the time-frequency domain and resulting in statistically independent channel snapshots {h[s] : s = [N ]} . Meanwhile, the whole window spans a time significantly shorter than the geometry coherence time so that the WSS assumption holds (see discussion in Sect. 1) and the channel snapshots are identically distributed. The channel vectors are given by [38] (1) where ρ(ξ ; s) is the channel complex coefficient at the normalized AoA ξ = sin(θ) sin(θ max ) ∈ [−1, 1) , where θ max ∈ [0, π 2 ] is the maximum array angular aperture 4 ; a(ξ ) ∈ C M denotes the array response vector as a function of ξ , with m-th element given , where d denotes the antenna spacing and 0 denotes the carrier wavelength. For convenience, we assume the antenna spacing to be d = 0 2 sin(θ max ) . Thus, the array response vector is given as The channel coefficient ρ(ξ ; s) represents the small-scale multipath fading component at a given AoA, and it is modeled as a complex circularly symmetric Gaussian process with respect to ξ . Due to the WSS property, the channel second-order statistics are invariant with respect to the index s ∈ [N ] . In particular, ρ(ξ ; s) has mean zero and variance E[ρ(ξ ; s)ρ * (ξ ; s)] = γ (ξ ) . The function γ : [−1, 1] → R + is a real nonnegative measure that describes how the channel energy is distributed across the angle domain, and it is referred to as the channel ASF. From (1) and the ASF definition, it follows that the channel spatial covariance matrix, describing the correlation of the channel coefficients at the different antenna elements, is given by Notice that h is Toeplitz. This fact is verified when all the scattering clusters (see Fig. 1) are in the far field of the BS array. 5

Sample covariance matrix
We start by reviewing the sample covariance estimator. For known noise power N 0 at the BS, the sample covariance matrix is given by 6 (2) a(ξ ) = 1, e jπξ , . . . , e jπ(M−1)ξ T . ( 4 Note that we use the standard spherical coordinate system (see, e.g., [16,Section 7.1]), where θ is the azimuth angle and limited by θ ∈ [−θ max , θ max ]. 5 As a note of caution, we hasten to say that the model must be reconsidered in the case of "extra-large aperture" arrays (e.g., see [39]), where the Toeplitz form does not hold any longer. This is a consistent estimator, in the sense that it converges to the true covariance matrix as N → ∞ [40, Section 1.2.2]. The mean square (Frobenius norm) error incurred by the sample covariance estimator is given as [40] By applying the Cauchy-Schwarz inequality to the singular values of h , it is seen that , which together with the estimation error expression yields the upper bound to the normalized mean squared error E As already discussed, a relevant and interesting regime for massive MIMO is when N and M are of the same order. From the above analysis, it is clear that the sample covariance estimator yields a small error if rank( h ) ≪ N . For example, if the scattering contains only a finite number of discrete components (e.g., as assumed in [33][34][35] dν is piecewise continuous with strictly monotonically increasing segments, then rank( h ) increases linearly with M (see [10]) and the error incurred by the sample covariance estimator may be large. On the other hand, the presence of discrete scattering components implies that γ (ξ ) contains Dirac delta functions (spikes) and therefore it is not squared-integrable. This poses significant problems for estimation methods that assume γ (ξ ) to be an element in a Hilbert space of functions (e.g., the method proposed in [21]). The challenge tackled in this work is to devise an estimator which is able to handle both the small sample regime N /M ≤ 1 and the presence of discrete and diffuse scattering.

Structure of the channel covariance matrix
As said, the ASF describes how the received signal power is distributed over the AoA domain. The signal from the UE to the BS array propagates through a given scattering environment. The line-of-sight (LoS) path (if present), specular reflections, and wedge diffraction occupy extremely narrow angular intervals. This is usually modeled in a large number of papers as the superposition of discrete separable angular components coming at normalized AoAs {φ i } . In particular, it is assumed that the general form (1) reduces to the discrete sum of r paths h . However, it is well known from channel sounding observations (e.g., see [41]) and widely treated theoretically (e.g., see [38]) that diffuse scattering is typically also present and may carry a very significant part of the received signal power especially at frequencies below 6GHz. In this case, scattering clusters span continuous intervals over the AoA domain. In order to encompass full generality, we model the ASF γ (ξ ) as a mixedtype distribution [42, Section 5.3] including discrete and diffuse scattering components: where γ d (ξ ) models the power received from r ≪ M discrete paths and γ c (ξ ) models the power coming from diffuse scattering clusters. Since the ASF can be seen as a (generalized) density function, we borrow the language of discrete and continuous random variables and refer to γ d (ξ ) and to γ c (ξ ) as the discrete and the continuous parts of the ASF, respectively. 7 This corresponds to a so-called spiked model in the language of asymptotic random matrix theory (e.g., see [28]), where the spikes are the discrete scattering components.
Plugging (6) into (3), we obtain a corresponding decomposition of the channel covariance matrix as Notice that, in a typical massive MIMO scenario, rank( d h ) = r is much smaller than M.

Dictionary-based parametric representation and covariance estimation
An outline of the steps taken by the proposed method is given in the following: i) Spike Location Estimation for γ d (ξ ) : We apply the MUSIC algorithm [36] to estimate the AoAs of the spike components, i.e., the angles {φ i } r i=1 in (6), from the N noisy samples {y[s] : s ∈ [N ]} . We let { φ i } r i=1 denote the estimated AoAs, where also the number of spikes r is estimated (not assumed known). This is detailed in Sect. 4.1. For complete estimation of the discrete part d h , the weights {c i } r i=1 need to be further recovered. This is done jointly with the coefficients of the continuous part, as elaborated in the next step.
ii) Dictionary-based Method for Joint Estimation of γ d (ξ ) and γ c (ξ ) : We assume that the ASF continuous part can be written as } is a suitable dictionary of nonnegative density functions (not containing spikes). Figure 2 shows an example of Dirac delta and Gaussian dictionaries. Now, the goal is to estimate the model parameters . The parameters estimation can be formulated as a constrained least-squares problem, as detailed in Sect. 4.2.In particular, if the functions in G c have disjoint supports (i.e., the nonzero parts of different functions are not overlapping, e.g., the Dirac delta functions in Fig. 2a), we obtain a nonnegative least-squares (NNLS) problem, while if the functions in G c have overlapping Fig. 2 Examples of Dirac delta (disjoint supports) and Gaussian dictionaries (overlapping supports) 7 As for probability density functions, γ c (ξ ) needs not be continuous, but its cumulative distribution function supports (e.g., the Gaussian functions in Fig. 2b), we obtain a quadratic programming (QP) problem. As an alternative, we can use ML estimation. The log-likelihood function is a difference of concave functions in the model parameters and can be maximized using majorization-minimization (MM) methods. However, general MM approaches are prohibitively computationally complex for typical values of M arising in massive MIMO. It turns out that when G c is formed by Dirac deltas on a discrete grid, the likelihood function maximization can be obtained through EM with much lower complexity. Since in general EM is guaranteed to converge to local optima, the initialization plays an important role. We propose to use the result of the (low-complexity) NNLS estimator as initial point for the EM iteration. The resulting method is detailed in Sect. 4.3. iii) From ASF to Covariance Estimation: Finally, having estimated γ d (ξ ) and γ c (ξ ) , we estimate the covariance h via (7). In particular, since γ (ξ ) depends only on the scattering geometry and it is invariant with frequency, 8 the mapping γ (ξ ) → h defined by (7) can be applied for different carrier frequencies by changing the wavelength parameter 0 in the expression of the array response vector a(ξ ) . Specifically, replacing a(ξ ) in (7) with a(νξ ) where ν is a wavelength expansion/contraction coefficient, we obtain an estimator at the carrier frequency f ν = νf 0 , where f 0 = c 0 / 0 is the UL carrier frequency and c 0 denotes the speed of light. The resulting covariance estimator is given by } are the estimated model parameters, and where we define S . In particular, for the DL carrier we have ν > 1 since in typical cellular systems the DL carrier frequency is higher than the UL carrier frequency.

Discrete ASF support estimation
In this part, we first estimate the model order r by applying the well-known minimum description length (MDL) principle [44]. Then, we use the MUSIC method (e.g., see [45,46] and references therein), to estimate the locations {φ i } r i=1 of the spikes in discrete part of ASF γ d (ξ ) . Given the noisy samples Y = {y[s]} N s=1 , the sample covariance of Y is given as

Number of spikes estimation using MDL
First, we wish to estimate the model order r. We adopt the classical MDL method provided in [47], which was designed for estimating the number of sources impinging on a passive array of sensors under white Gaussian noise with unknown noise power. Specifically, assume that the covariance matrix of the observation has the form where d h is the rank-r covariance matrix in (7) and σ 2 is an unknown parameter. 9 We consider the following family of covariance matrices where (k) denotes a semi-positive matrix of rank k and k ∈ {0, 1, . . . , M − 1} is the model order. The MDL approach selects a model order k ∈ {0, 1, . . . , M − 1} from a parameterized family of probability densities f (Y| (k) ) that minimize the so-called total description length of the samples, which is tightly approximated by Rissanen bound [49] (after neglecting o(1) terms) as (11) (k) (k) = arg min where the first term is the result of the maximum likelihood estimator obtained in (15) and the second term is a penalty function including | (k) | = k(2M − k) being the number of free parameters in (k) [47,48]. The number of spikes is estimated by minimizing the MDL metric Remark 1 It is noticed that the proposed MDL approach does not explicitly consider the diffuse part. It considers only a spike signal subspace with dimension r corresponding to the r largest eigenvalues of y and a white noise with power σ 2 resulting in M − r smallest eigenvalues of y equaling to σ 2 . The diffuse component in our model can be considered as an additional spatially colored noise as far as the spike estimation is concerned. It is shown in [50] that in the presence of colored Gaussian noise, MDL tends to overestimate the model order with increasing number of samples N. However, we shall see later that overcounting the spikes is not catastrophic in the overall covariance estimation scheme, since the coefficients of fictitious spikes are typically estimated as near zero in the model coefficient estimation step. In other words, it is always better to overestimate the number of spikes than to underestimate them so that the true spikes are not missed. In [51], it is shown that undermodeling may happen when the signal and the noise eigenvalues are not well separated and the noise eigenvalues are clustered sufficiently closely. We show that the undermodeling is not likely to happen in our case by showing that the gap between r large (containing the contribution of the spikes) and M − r small eigenvalues (containing only the contribution of the diffuse scattering and the noise) of the sample covariance y becomes larger and larger as the number of antennas M increases. This can be explained by noticing that as the array angular resolution increases, the amount of received signal energy in each angular bin decreases with the bin width in the bins that contain no spikes, while it remains roughly constant with M if the angular bin contains a spike. More precisely, an angular bin of width 2/M and centered at ξ contains a received signal power proportional to 2(γ c (ξ ) + N 0 )/M if no spike falls in the interval, and to c i + 2(γ c (ξ ) + N 0 )/M if the spike located at φ i falls in the interval. By Szegö's theorem (e.g., see [10] and references therein), the eigenvalues of the covariance matrix y = h + N 0 I M converge asymptotically as M → ∞ to the energy received on equally spaced angular bins of width 2/M over the interval [−1, 1] in the ξ domain.  where rect A is 1 over the interval A and zero elsewhere. This ASF contains r = 2 spikes and two rectangular-shaped diffuse components. The SNR is set to 20 dB. For a large enough number of antennas (and even for a moderate number such as M = 25 ), the eigenvalue distribution shows a significant jump, such that the two largest eigenvalues "escape" from the rest. Note that, by increasing the number of antennas, this separation becomes more and more significant. Hence, the undermodeling of MDL in the relevant case of massive MIMO is unlikely to occur.

Location of spikes estimation using MUSIC
Once the number of spikes is estimated using MDL, MUSIC proceeds to identify the locations of those spikes.

Coefficients estimation by constrained least-squares
After obtaining an estimate of the number of spikes and their locations as described before, we need to find the spike coefficients c = [c 1 , . . . , c r ] T ∈ R r + and the coefficients In order to estimate the coefficients vector u from the noisy samples {y[s] : s ∈ [N ]} , we propose three algorithms, namely NNLS estimator, QP estimator, and ML-EM estimator.

NNLS estimator
If the dictionary functions have disjoint support, since the overall γ c (ξ ) must be nonnegative, it follows that the coefficients {u i : i ∈ [G]} must take values in R + , i.e., the whole vector u is nonnegative. We know that if the number of samples N is large enough, the sample covariance matrix y would converge to y = h + N 0 I M . Then, our goal is to find a good fitting to the sample covariance matrix y from the set of all covariance matrices of the form (22)  For this purpose, we use the Frobenius norm as a fitting metric and obtain an estimate of the model coefficients as Applying vectorization and defining A = [vec(S 1 ), . . . , vec(S G+ r )] and f = vec( h ) , we can write this as a NNLS problem which can be efficiently solved using a variety of convex optimization techniques (see, e.g., [52,53]). In our simulation, we use the built-in MATLAB function lsqnonneg.
Note that a ULA covariance is Hermitian Toeplitz. By leveraging this, we can reformulate the optimization problem so that the problem dimension is reduced. Concretely, let h denote the orthogonal projection of h onto the space of Hermitian Toeplitz matrices. This is obtained by averaging the diagonals of h and replacing the diagonal elements by the corresponding average value (see Appendix B for completeness). We define the first column of h as σ . Then, (25) can be reformulated as where A = [(S 1 ) ·,1 , . . . , (S G+ r ) ·,1 ] is the matrix collecting the first columns (S i ) ·,1 of the S i T is the weighting matrix to compensate for the number of times an element is repeated in a Hermitian Toeplitz matrix. (25) and (26) are equivalent.

Lemma 2 The optimization problems in
Proof Note that the difference between (25) and (26) is only the replacement of h from h . Thus, it is sufficient to show the equivalence of the following two optimization problems: where HT is the set of Hermitian Toeplitz matrices. Let = h − h . Then, P1 is rewritten as min X∈HT �X − h − � 2 F . Since h is the orthogonal projection of h on the set HT , by the orthogonality principle the difference = h − h is orthogonal to the whole set, i.e., � , T� := tr( H T) = 0, ∀T ∈ HT . It follows that for any X ∈ HT the difference X − h is also in HT . Thus, is constant with respect to X and therefore plays no role in minimization. This proves the equivalence of P1 and P2.

QP estimator
When the dictionary functions ψ i (ξ ) have overlapping support (e.g., with Gaussian or Laplacian densities), the model coefficients {b i : i = [G]} may take negative values as long as the resulting continuous ASF component is nonnegative, i.e., We approximate this infinite-dimensional constraint by defining a sufficiently fine grid of equally spaced points {ξ 1 , . . . , ξ G } on ξ ∈ [−1, 1] , where G is generally significantly larger than G and impose the nonnegativity of γ c (·) at these points. The resulting constraint is . Then, the estimation problem under dictionary functions with overlapping support is given by where u = [b T , c T ] T is consistently with the definition in (22). Notice that (29) is a QP problem and can be solved using standard QP solvers, such as quadprog in MATLAB.

Coefficients estimation by maximum likelihood
Instead of directly fitting the Frobenius norm between the sample covariance and the parametric covariance, the model parameters can be estimate by the ML method. Give the matrix of the observed noisy channel samples Y , the likelihood function of Y assuming h = h (u) in the form of (22) is given by Using (30), we can form the minus log-likelihood function f ML (u) := − 1 N log p(Y|u) . Then, the ML-based covariance estimator is obtained by minimizing f ML (u) with respect to the real and nonnegative coefficients vector u , which is formulated as the optimization problem: where y is the sample covariance matrix of the observations defined in (10). Note that the objective function f ML (u) = f cav (u) + f vex (u) in (31) is the sum of a concave and a convex function and thus (31) is not a convex problem. It is generally difficult to find the global optimum of a non-convex function such as f ML (u) . A standard approach in such cases is to adopt a MM algorithm [54,55], alternating through two steps with an updating surrogate function that has favorable optimization properties (e.g., convexity) and approximates the upper bound of the original objective function. Typical examples of MM algorithms are EM method [56], cyclic minimization [57], and the concave-convex procedure [58]. We choose the EM algorithm to iteratively find a good stationary point of f ML (u) as we will see that this algorithm yields a computationally efficient update rule and excellent empirical results for the task of estimating the parametric ASF coefficients. Note that although the likelihood function in (31) is in a general form for any family of dictionary functions, the EM method can be applied only in the case where all the matrices S i have rank 1, which is the case when the dictionary functions ψ i (ξ ) are Dirac delta functions. In contrast, the more general concave-convex procedure (e.g., see [1] for the application in this case) can deal with any type of dictionary, but yields significantly higher computational complexity, so that it is not suited for large M (massive MIMO case). In this work, we deal with general dictionary functions using constrained LS, while restrict the use of EM to the case of Dirac delta dictionary functions.
Application of the EM algorithm.  , z[N ]] . The EM algorithm treats (Y, X) as the incomplete data, where X is referred to as missing data. Using the fact that Y given X is Gaussian with mean DX and independent components with variance N 0 , we have that p(Y, X|u) = p(Y|X)p(X|u) , where p(Y|X) is a conditional Gaussian distribution that does not depend on u . By marginalizing with respect to X and taking the logarithm, the log-likelihood function takes on the form of a conditional expectation L(u) := log p(Y|u) = log E X|u [p(Y|X)] . The EM algorithm maximizes iteratively a lower bound on L(u) by alternating the expectation step (E-step) and the maximization (32)  step (M-step) [55]. Let u (ℓ) be the estimate of u in the ℓ-th iteration, by introducing a posterior density of X as p(X|Y, u (ℓ) ) we have where (a) follows Jensen's inequality and the concavity of log(·).
The E-step consists of computing L(u| u (ℓ) ) . Using the joint conditional Gaussianity of Y and X given u = u (ℓ) , L(u| u (ℓ) ) can be evaluated in closed form by computing the conditional mean and covariance of x[s] given y[s] and u (ℓ) , respectively, given by [59] where we define U (ℓ) = diag( u (ℓ) ) . The M-step consists of the maximization: Note that p(Y|X) and p(X|Y, u (ℓ) ) in (35) do not depend on u and thus can be neglected in the M-step. Hence, the function to be maximized in the M-step can be equivalently written as (details are omitted for brevity) It is observed from (38) that the maximization is decoupled with respect to each component u i of u . Then, the optimality in the ℓ-th iteration is also easily obtained in closed form. Setting each partial derivative ∂ ∂u i of (38) to zero, we find With a initial point u (0) , the ML-EM algorithm iteratively runs the E-step and M-step until the stop condition f ML ( u (ℓ) ) − f ML ( u (ℓ+1) ) ≤ ǫ EM is met, where ǫ EM is the predefined stop threshold. The initial point is usually set as an all ones vector. In our case, it can be set as the result of NNLS solution. An extensive comparison of these two initializations obtained by simulating several different channel scattering geometries is depicted in Fig. 5, which shows that both initializations converge within 100 iterations. It reveals that the NNLS initialization converges much faster and results in lower values of the objective function. Therefore, in our results we used the NNLS solution to initialize ML-EM algorithm. Furthermore, we provide computational complexity analysis of the proposed NNLS and ML-EM algorithms in Appendix C. Remark 3 The use of Dirac delta functions as dictionary functions to approximate the continuous part of the ASF γ c may sound as a contradiction, since by definition this component does not contain spikes. However, there is a fundamental difference between the Dirac components corresponding to spikes and the equally spaced "picket-fence" used to approximate γ c . This can be noticed by observing that the power associated with the i-th spike component is c i , which is a constant independent on the number of grid points G, while the power associated with the i-th dictionary function ψ

Results and discussion
In this section, the proposed constrained LS-and ML-EM-based estimators are numerically evaluated and compared with existing state-of-the-art methods. We use the realistic channel emulator QuaDriGa [18] to generate the channel samples. We adopt two communication scenarios in QuaDriGa: 3GPP 3D Urban Macro-Cell Line Of Sight (3GPP-3D-UMa-LOS) and 3GPP 3D Urban Macro-Cell Non-Line Of Sight (3GPP-3D-UMa-NLOS). The BS array adopts horizontal ULA with M = 128 antennas. The carrier frequency of the UL and DL is 1.9 GHz and 2.1 GHz, respectively. The SNR is set to 10 dB. The results are averaged over 20 random ASFs and 100 times of random channel realizations for each ASF.

Compared benchmarks
We compare the proposed algorithms to the following three benchmarks.
(1)Toeplitz-PSD Projection: The first benchmark is an intuitively simple approach. We know that the ULA channel covariance is a Toeplitz-PSD matrix. Thus, we can project the sample covariance onto the space of Toeplitz-PSD by solving the convex optimization problem: The projected matrix PSD h is the covariance estimate. We use the projections onto convex sets (POCS) algorithm [60] to solve (40). Specifically, the sample covariance matrix is alternately projected onto the convex set of Toeplitz and PSD cone. Projecting on Toeplitz follows the way in Appendix B, and projection on the PSD cone is achieved by eigendecomposition and setting all negative eigenvalues to zero. These two projections are repeated till convergence. Note that the complexity of this semi-definite programming problem can be high when the number of antennas M is large. Moreover, Toeplitz-PSD can not provide UL-DL covariance transformation.
(2)The SPICE Method: The second method we use for comparison is known as sparse iterative covariance-based estimation (SPICE) [33]. This method also exploits the ASF domain but can be only applied with Dirac delta dictionaries. Similar to the parametric covariance model with only Dirac delta dictionaries introduced in (32), assuming Dirac delta dictionaries D = [a(ξ 1 ), . . . , a(ξ G )] of G array response vectors corresponding to G AoAs, and defining = Ddiag(u)D H , the ASF coefficients u are estimated by solving the following convex optimization problem for two cases: The channel covariance estimate is then obtained as SPICE h = Ddiag(u ⋆ )D H .
(3)Convex Projection Method: This method is proposed in [21] for the ASF estimation by solving a convex feasibility problem γ = find γ , subject to γ ∈ S , where This can be solved by applying an iterative projection algorithm, which produces a sequence of functions in L 2 that converges to a function satisfying the constraint γ ∈ S . Given the estimated ASF γ , the channel covariance estimation is obtained following (3).

Considered metrics
Denoting a generic covariance estimate as , we use three error metrics to evaluate the estimation quality: (1) Normalized Frobenius norm Error: This error is defined as (42) S = γ : (2) Normalized MSE of Channel Estimation: Given a noisy channel observation y = h + z , the optimal estimation of channel vector h is obtained via linear MMSE filter h = (N 0 I M + ) −1 y . Then, this metric considers the normalized mean squared error (NMSE) of instantaneous channel estimation: By the optimality of linear MMSE estimation for Gaussian random vectors, the lower bound of this error is obtained using the true channel covariance. (3) Power Efficiency: This metric evaluates the similarity of dominant subspaces between the estimated and true matrices, which is an important factor in various applications of massive MIMO such as user grouping and group-based beamforming [7,10,61]. Specifically, let p ∈ [M] denote a subspace dimension parameter and let U p ∈ C M×p and U p ∈ C M×p be the p dominant eigenvectors of h and corresponding to their largest p eigenvalues, respectively. The power efficiency (PE) based on p is defined as It is noticed that E PE (p) ∈ [0, 1] and the closer it is to 0, the more power is captured by the estimated p-dominant subspace.

Performance comparison
We first provide a performance comparison under only Dirac delta dictionaries with a relatively large number G of grid points. Then, we provide a comparison under different dictionaries to show the benefit of properly choosing the dictionary adapted to the case at hand.
(1)Comparison under Dirac delta dictionaries: We set the number of dictionary functions for the ASF continuous part as G = 2M . SPICE is also applied to the same picketfence dictionary without knowledge of the spike locations, since this is a feature of our own method and not intrinsic in the SPICE algorithm. The projection method is not a dictionary based method. To implement it, we discretize the ASF domain with 5000 grid points to approximate the continue ASF.
In Fig. 6, the UL and DL channel covariance estimation error for scenario 3GPP-3D-UMa-LOS in terms of the normalized Frobenius norm error, normalized MSE of channel estimation, and power efficiency under different sample ratios N/M (from 0.0625 to 1) are depicted. It is observed that the results of the proposed NNLS and ML methods significantly outperform the other benchmarks in both UL and DL for all metrics under all range of sample size. Although the Toeplitz-PSD method has very similar Frobenius norm errors compared to our methods, it performs significantly worse than our methods in terms of channel estimation error especially under extremely small sample size. Moreover, it is worth to emphasize again that the Toeplitz-PSD method does not give directly an easy way to perform the UL-DL covariance transformation. In contrast (see (44) Fig. 6c), the proposed methods yield very good UL-DL transformation (according to the considered metrics) even under very small sample size. Furthermore, we also present the results of NNLS and ML-EM without MUSIC (i.e., without explicit spike location estimation) denoted as "NNLS, Delta, nM" and "ML-EM, Delta, nM" in Fig. 6a and b. It is observed that in this case the performances degrade dramatically. This indicates that the proposed MUSIC step is necessary and non-trivially improves the overall performance. It is also noticed that our results without MUSIC are still much better than the results of SPICE and projection method, which shows the advantage of the proposed NNLS and ML-EM algorithms. Results for the NLOS scenario 3GPP-3D-UMa-NLOS are depicted in Fig. 7. First, we observe again that our methods outperform the other benchmarks. Interestingly, the results with MUSIC perform almost the same as the results without MUSIC. Notice that in this case the scheme is unaware of the fact that there are no spikes, and the MDL/MUSIC step may give some spurious spikes, which are then eliminated by the subsequent NNLS and ML/EM step (estimated coefficients near zero). This demonstrates the fact that the proposed method is robust to both LOS and NLOS cases.
(2)Comparison with Dirac delta and overlapping Gaussian dictionaries: In this part, we show the results based on Dirac delta and overlapping Gaussian dictionaries to indicate the importance of finding the proper dictionary. We set a fixed N = 16 , i.e., N /M = 0.125 . We test the NNLS algorithm with Dirac delta and Gaussian dictionaries as well as the QP estimator with Gaussian dictionaries with G = 10000 under different number of dictionaries (G/M is from 0.0625 to 2) for the NLOS scenario 3GPP-3D-UMa-NLOS. The overlapping Gaussian dictionary functions are defined as follows. Given G, let ψ(ξ ) be a Gaussian density whose support is limited to [− 4 G+3 , 4 G+3 ] . The dictionary {ψ i (ξ ) : i ∈ [G]} consists of skewed shifted versions of ψ(ξ ) , i.e., ψ i (ξ ) = J (ξ ) ψ(ξ + 1 − 2(i+1) G+3 ), i ∈ [G] , where J (ξ ) = 1 √ 1−ξ 2 is a factor due to coordinate transformation from θ to ξ . Specifically, ψ(ξ ) is given by where µ = 0 and σ = 4 3(G+3) to ensure that the truncated interval [− 4 G+3 , 4 G+3 ] accounts for 6σ of the Gaussian function, and a 0 is a normalization scalar such that 1 −1 ψ(ξ )dξ = 1. The results are shown in Fig. 8. From Fig. 8a and b, it is observed that the results under Gaussian dictionaries are much better than the results under Dirac delta dictionaries when the number of dictionaries is small, e.g., G/M ≤ 1 . It is also observed that both Frobenius norm error and channel estimate MSE of NNLS and SPICE under Dirac delta dictionaries decrease dramatically as G increases. In contrast, the results under Gaussian dictionaries become slightly worsen when G/M > 0.5 . As G becomes large, the results with Gaussian dictionary functions converge to the Dirac delta case. This is of course explained by the fact that the thin Gaussian density with normalized integral is more and more similar to Dirac delta functions. A similar behavior is observed in Fig. 8c and d. These results indicate that by choosing some template dictionary function may achieve advantages of the quantization of the AoA domain compared to Dirac delta function for small size of the dictionary, which reduces significantly the dimension and computational complexity.

Conclusion
In this work, we addressed the problem of estimating the covariance matrix of the channel vector from a set of noisy UL pilot observations in massive MIMO systems. By modeling the ASF of the channel as a parametric representation in the angle domain, we proposed the NNLS-, QP-, and ML-EM-based estimators to obtain the model parameters. In order to find the discrete scattering components (number and (46)