Training sequence design for MIMO channels: an applicationoriented approach
 Dimitrios Katselis^{1}Email author,
 Cristian R Rojas^{1},
 Mats Bengtsson^{1},
 Emil Björnson^{1},
 Xavier Bombois^{2},
 Nafiseh Shariati^{1},
 Magnus Jansson^{1} and
 Håkan Hjalmarsson^{1}
https://doi.org/10.1186/168714992013245
© Katselis et al.; licensee Springer. 2013
Received: 29 April 2013
Accepted: 1 October 2013
Published: 17 October 2013
Abstract
In this paper, the problem of training optimization for estimating a multipleinput multipleoutput (MIMO) flat fading channel in the presence of spatially and temporally correlated Gaussian noise is studied in an applicationoriented setup. So far, the problem of MIMO channel estimation has mostly been treated within the context of minimizing the mean square error (MSE) of the channel estimate subject to various constraints, such as an upper bound on the available training energy. We introduce a more general framework for the task of training sequence design in MIMO systems, which can treat not only the minimization of channel estimator’s MSE but also the optimization of a final performance metric of interest related to the use of the channel estimate in the communication system. First, we show that the proposed framework can be used to minimize the training energy budget subject to a quality constraint on the MSE of the channel estimator. A deterministic version of the 'dual’ problem is also provided. We then focus on four specific applications, where the training sequence can be optimized with respect to the classical channel estimation MSE, a weighted channel estimation MSE and the MSE of the equalization error due to the use of an equalizer at the receiver or an appropriate linear precoder at the transmitter. In this way, the intended use of the channel estimate is explicitly accounted for. The superiority of the proposed designs over existing methods is demonstrated via numerical simulations.
1 Introduction
An important factor in the performance of multiple antenna systems is the accuracy of the channel state information (CSI) [1]. CSI is primarily used at the receiver side for purposes of coherent or semicoherent detection, but it can be also used at the transmitter side, e.g., for precoding and adaptive modulation. Since in communication systems the maximization of spectral efficiency is an objective of interest, the training duration and energy should be minimized. Most current systems use training signals that are white, both spatially and temporally, which is known to be a good choice according to several criteria [2, 3]. However, in case that some prior knowledge on the channel or noise statistics is available, it is possible to tailor the training signal and to obtain a significantly improved performance. Especially, several authors have studied scenarios where longterm CSI in the form of a covariance matrix over the shortterm fading is available. So far, most proposed algorithms have been designed to minimize the squared error of the channel estimate, e.g., [4–9]. Alternative design criteria are used in [5] and [10], where the channel entropy is minimized given the received training signal. In [11], the resulting capacity in the case of a singleinput singleoutput (SISO) channel is considered, while [12] focuses on the pairwise error probability.
Herein, a generic context is described, drawing from similar techniques that have been recently proposed for training signal design in system identification [13–15]. This context aims at providing a unified theoretical framework that can be used to treat the MIMO training optimization problem in various scenarios. Furthermore, it provides a different way of looking at the aforementioned problem that could be adjusted to a wide variety of estimationrelated problems in communication systems. First, we show how the problem of minimizing the training energy subject to a quality constraint can be solved, while a 'dual’ deterministic (average design) problem is considered^{a}. In the sequel, we show that by a suitable definition of the performance measure, the problem of optimizing the training for minimizing the channel MSE can be treated as a special case. We also consider a weighted version of the channel MSE, which relates to the wellknown Loptimality criterion [16]. Moreover, we explicitly consider how the channel estimate will be used and attempt to optimize the end performance of the data transmission, which is not necessarily equivalent to minimizing the mean square error (MSE) of the channel estimate. Specifically, we study two uses of the channel estimate: channel equalization at the receiver using a minimum mean square error (MMSE) equalizer and channel inversion (zeroforcing precoding) at the transmitter, and derive the corresponding optimal training signals for each case. In the case of MMSE equalization, separate approximations are provided for the high and low SNR regimes. Finally, the resulting performance is illustrated based on numerical simulations. Compared to related results in the control literature, here, we directly design a finite length training signal and consider not only deterministic channel parameters but also a Bayesian channel estimation framework. A related pilot design strategy has been proposed in [17] for the problem of jointly estimating the frequency offset and the channel impulse response in singleantenna transmissions.
Implementing an adaptive choice of pilot signals in a practical system would require a feedback signalling overhead, since both the transmitter and the receiver have to agree on the choice of the pilots. Just as the previous studies in the area, the current paper is primarily intended to provide a theoretical benchmark on the resulting performance of such a scheme. Directly considering the end performance in the pilot design is a step into making the results more relevant. The data model used in [4–10] is based on the assumption that the channel is frequency flat but the noise is allowed to be frequency selective. Such a generalized assumption is relevant in systems that share spectrum with other radio interfaces using a narrower bandwidth and possibly in situations where channel coding introduces a temporal correlation in interfering signals. In order to focus on the main principles of our proposed strategy, we maintain this research line by using the same model in the current paper.
As a final comment, the novelty of this paper is on introducing the applicationoriented framework as the appropriate context for training sequence design in communication systems. To this end, Hermitian formlike approximations of performance metrics are addressed here because they usually are good approximations of many performance metrics of interest, as well as for simplicity purposes and comprehensiveness of presentation. Although the ultimate performance metric in communications systems, namely the bit error rate (BER), would be of interest, its handling seems to be a challenging task and is reserved for future study. In this paper, we make an effort to introduce the applicationoriented training design framework in the most illustrative and straightforward way.
This paper is organized as follows: Section 2 introduces the basic MIMO received signal model and specific assumptions on the structure of channel and noise covariance matrices. Section 3 presents the optimal channel estimators, when the channel is considered to be either a deterministic or a random matrix. Section 4 presents the applicationoriented optimal training designs in a guaranteed performance context, based on confidence ellipsoids and Markov bound relaxations. Moreover, Section 5 focuses on four specific applications, namely that of MSE channel estimation, channel estimation based on the Loptimality criterion, and finally channel estimation for MMSE equalization and ZF precoding. Numerical simulations are provided in Section 6, while Section 7 concludes this paper.
1.1 Notations
Boldface (lowercase) is used for column vectors, x, and (uppercase) for matrices, X. Moreover, X^{ T }, X^{ H }, X^{∗}, and X^{†} denote the transpose, the conjugate transpose, the conjugate, and the MoorePenrose pseudoinverse of X, respectively. The trace of X is denoted as tr(X) and A ≽ B means that A  B is positive semidefinite. vec(X) is the vector produced by stacking the columns of X, and (X)_{i,j} is the (i, j)th element of X. [X]_{+} means that all negative eigenvalues of X are replaced by zeros (i.e., [X]_{+} ≽ 0). $\mathcal{C}\mathcal{N}(\stackrel{\u0304}{\mathbf{x}},\mathbf{Q})$ stands for circularly symmetric complex Gaussian random vectors, where $\stackrel{\u0304}{\mathbf{x}}$ is the mean and Q the covariance matrix. Finally, α! denotes the factorial of the nonnegative integer α and mod (a, b) the modulo operation between the integers a, b.
2 System model
 (i)
A deterministic model
 (ii)A stochastic Rayleigh fading model^{b}, i.e., $\text{vec}(\mathbf{H})\in \mathcal{C}\mathcal{N}(\mathbf{0},\mathbf{R})$, where, for mathematical tractability, we will assume that the known covariance matrix R possesses the Kronecker model used, e.g., in [7, 10]:$\begin{array}{ll}\mathbf{R}& ={\mathbf{R}}_{T}^{T}\otimes {\mathbf{R}}_{R}\phantom{\rule{2em}{0ex}}\end{array}$(1)
where ${\mathbf{R}}_{T}\in {\mathbb{C}}^{{n}_{T}\times {n}_{T}}$ and ${\mathbf{R}}_{R}\in {\mathbb{C}}^{{n}_{R}\times {n}_{R}}$ are the spatial covariance matrices at the transmitter and receiver side, respectively. This model has been experimentally verified in [18, 19] and further motivated in [20, 21].
where $\mathbf{N}=\left[\mathbf{n}(1)\phantom{\rule{2.77626pt}{0ex}}\phantom{\rule{2.77626pt}{0ex}}\dots \phantom{\rule{2.77626pt}{0ex}}\phantom{\rule{2.77626pt}{0ex}}\mathbf{n}(B)\right]\in {\mathbb{C}}^{{n}_{R}\times B}$ is the combined noise and interference matrix.
Here, ${\mathbf{S}}_{Q}\in {\mathbb{C}}^{B\times B}$ represents the temporal covariance matrix^{c} that is used to model the effect of temporal correlations in interfering signals, when the noise incorporates multiuser interference. Moreover, ${\mathbf{S}}_{R}\in {\mathbb{C}}^{{n}_{R}\times {n}_{R}}$ represents the received spatial covariance matrix that is mostly related with the characteristics of the receive array. The Kronecker structure (3) corresponds to an assumption that the spatial and temporal properties of N are uncorrelated.
The channel and noise statistics will be assumed known to the receiver during estimation. Statistics can often be achieved by longterm estimation and tracking [22]. For the data transmission phase, we will assume that the transmit signal {x(t)} is a zeromean, weakly stationary process, which is both temporally and spatially white, i.e., its spectrum is Φ_{ x }(ω) = λ_{ x }I.
3 Channel matrix estimation
3.1 Deterministic channel estimation
where ${\chi}_{\alpha}^{2}(n)$ is the α percentile of the χ^{2}(n) distribution [15].
3.2 Bayesian channel estimation
where ${\mathcal{I}}_{\text{F,MMSE}}\triangleq {\mathbf{C}}_{\text{MMSE}}^{1}$ is the inverse covariance matrix in the MMSE case [15].
4 Applicationoriented optimal training design
We will call this problem as the deterministic maximized performance problem (DMPP). The corresponding Bayesian problem will be denoted as the stochastic maximized performance problem (SMPP). We will study the DGPP/SGPP in detail in this contribution, but the DMPP/SMPP can be treated in similar ways. In fact, Theorem 3 in [24] suggests that the solutions to the DMPP/SMPP are the same as for DGPP/SGPP, save for a scaling factor.
The existing work on optimal training design for MIMO channels are, to the best of the authors knowledge, based upon standard measures on the quality of the channel estimate, rather than on the quality of the end use of the channel. The framework presented in this section can be used to treat the existing results as special cases. Additionally, if an end performance metric is optimized, the DGPP/SGPP and DMPP/SMPP formulations better reflect the ultimate objective of the training design. This type of optimal training design formulations has already been used in the control literature, but mainly for large sample sizes [13, 14, 25, 26], yielding an enhanced performance with respect to conventional estimationtheoretic approaches. A reasonable question is to examine if such a performance gain can be achieved in the case of training sequence design for MIMO channel estimation, where the sample sizes would be very small.
Remark.
for some ε ∈ [0, 1]. Problems (12), (13), and (14) correspond to a convex relaxation of this chance constraint based on confidence ellipsoids [27], as we show in the next subsection.
4.1 Approximating the training design problems
(for a more general result, see [15], Theorem 3.1).
We call the last problem approximative SGPP (ASGPP).
 1.
The approximation (16) is not possible for the performance metric of every application. Several examples that this is possible are presented in Section 5. Therefore, in some applications, different convex approximations of the corresponding performance metrics may have to be found.
 2.
The quality of the approximation (16) is characterized by its corresponding tightness to the true performance metric. For our purposes, when the tightness of the aforementioned approximation is acceptable, such an approximation will be desirable for two reasons. First, it corresponds to a Hermitian form, therefore offering nice mathematical properties and tractability. Additionally, the constraint ${\mathcal{D}}_{D}\subseteq {\mathcal{D}}_{\text{adm}}$ can be efficiently handled.
 3.
The sizes of ${\mathcal{D}}_{D}$ and ${\mathcal{D}}_{\text{adm}}$ critically depend on the parameter α. In practice, requiring α to have a value close to 1 corresponds to adequately representing the uncertainty set in which (approximately) all possible channel estimation errors lie.
4.2 The deterministic guaranteed performance problem
The problem formulations for ADGPP and ASGPP in (19) and (20), respectively, are similar in structure. The solutions to these problems (and to other approximative guaranteed performance problems) can be obtained from the following general theorem, which has not previously been available in the literature, to the best of our knowledge:
Theorem 1.
and D_{ A }, D_{ B }are realvalued diagonal matrices, with their diagonal elements sorted in ascending and descending order, respectively, that is, 0 < (D_{ A })_{1,1} ≤ … ≤ (D_{ A })_{N,N}and (D_{ B })_{1,1} ≥ … ≥ (D_{ B })_{n,n} ≥ 0.
If the eigenvalues of A and B are distinct and strictly positive, then the solution (22) is unique up to the multiplication of the columns of U_{ A }and U_{ B }by complex unitnorm scalars.
Proof.
The proof is given in Appendix 7. □
By the right choice of A and B, Theorem 1 will solve the ADGPP in (19). This is shown by the next theorem (recall that we have assumed that $\mathbf{S}={\mathbf{S}}_{Q}^{T}\otimes {\mathbf{S}}_{R}$).
Theorem 2.
where$\stackrel{\mathbf{~}}{\mathbf{P}}={\mathbf{P}}^{T}\otimes \mathbf{I}$, ${\mathbf{S}}_{Q}\in {\mathbb{C}}^{B\times B}$, ${\mathbf{S}}_{R}\in {\mathbb{C}}^{{n}_{R}\times {n}_{R}}$are Hermitian positive definite, ${\mathcal{I}}_{T}\in {\mathbb{C}}^{{n}_{T}\times {n}_{T}}$, ${\mathcal{I}}_{R}\in {\mathbb{C}}^{{n}_{R}\times {n}_{R}}$are Hermitian positive semidefinite, and c is a positive constant.
If $B\ge \text{rank}({\mathcal{I}}_{T})$, this problem is equivalent to (21) in Theorem 1 for A = S_{ Q }and$\mathbf{B}=c{\lambda}_{\text{max}}\left({\mathbf{S}}_{R}{\mathcal{I}}_{R}\right){\mathcal{I}}_{T}$, where λ_{max}(·) denotes the maximum eigenvalue.
Proof.
The proof is given in Appendix 7. □
4.3 The stochastic guaranteed performance problem
We will see that Theorem 1 can be also used to solve the ASGPP in (20). In order to obtain closedform solutions, we need some equality relation between the Kronecker blocks of $\mathbf{R}={\mathbf{R}}_{T}^{T}\otimes {\mathbf{R}}_{R}$ and of either $\mathbf{S}={\mathbf{S}}_{Q}^{T}\otimes {\mathbf{S}}_{R}$ or ${\mathcal{I}}_{\text{adm}}={\mathcal{I}}_{T}^{T}\otimes {\mathcal{I}}_{R}$. For instance, it can be R_{ R } = S_{ R }, which may be satisfied if the receive antennas are spatially uncorrelated or if the signal and interference are received from the same main direction (see [7] for details on the interpretations of these assumptions).
The solution to ASGPP in (20) is given by the next theorem.
Theorem 3.
where$\stackrel{\mathbf{~}}{\mathbf{P}}={\mathbf{P}}^{T}\otimes \mathbf{I}$, $\mathbf{R}={\mathbf{R}}_{T}^{T}\otimes {\mathbf{R}}_{R}$, and$\mathbf{S}={\mathbf{S}}_{Q}^{T}\otimes {\mathbf{S}}_{R}$. Here, ${\mathbf{R}}_{T}\in {\mathbb{C}}^{{n}_{T}\times {n}_{T}}$, ${\mathbf{R}}_{R}\in {\mathbb{C}}^{{n}_{R}\times {n}_{R}}$, ${\mathbf{S}}_{Q}\in {\mathbb{C}}^{B\times B}$, ${\mathbf{S}}_{R}\in {\mathbb{C}}^{{n}_{R}\times {n}_{R}}$are Hermitian positive definite, ${\mathcal{I}}_{T}\in {\mathbb{C}}^{{n}_{T}\times {n}_{T}}$, ${\mathcal{I}}_{R}\in {\mathbb{C}}^{{n}_{R}\times {n}_{R}}$are Hermitian positive semidefinite, and c is a positive constant.

If R_{ R } = S_{ R }and$B\ge \text{rank}\left({\left[c{\lambda}_{\text{max}}\left({\mathbf{S}}_{R}{\mathcal{I}}_{R}\right){\mathcal{I}}_{T}{\mathbf{R}}_{T}^{1}\right]}_{+}\right)$, then the problem is equivalent to (21) in Theorem 1 for A = S_{ Q }and$\mathbf{B}={\left[c{\lambda}_{\text{max}}({\mathbf{S}}_{R}{\mathcal{I}}_{R}){\mathcal{I}}_{T}{\mathbf{R}}_{T}^{1}\right]}_{+}$.

If${\mathbf{R}}_{R}^{1}={\mathcal{I}}_{R}$and$B\ge \text{rank}\left({\left[c{\mathcal{I}}_{T}{\mathbf{R}}_{T}^{1}\right]}_{+}\right)$, then the problem is equivalent to (21) in Theorem 1 for A = S_{ Q }and$\mathbf{B}={\lambda}_{\text{max}}({\mathbf{S}}_{R}{\mathcal{I}}_{R}){\left[c{\mathcal{I}}_{T}{\mathbf{R}}_{T}^{1}\right]}_{+}$.

If${\mathbf{R}}_{T}^{1}={\mathcal{I}}_{T}$and$B\ge \text{rank}\left({\mathcal{I}}_{T}\right)$, then the problem is equivalent to (21) in Theorem 1 for A = S_{ Q }and$\mathbf{B}={\lambda}_{\text{max}}\left({\mathbf{S}}_{R}{\left[c{\mathcal{I}}_{R}{\mathbf{R}}_{R}\right]}_{+}\right){\mathcal{I}}_{T}$.
Proof.
The proof is given in Appendix 3. □
The mathematical difference between ADGPP and ASGPP is the R^{1} term that appears in the constraint of the latter. This term has a clear impact on the structure of the optimal ASGPP training matrix.
It is also worth noting that the solution for R_{ R } = S_{ R } requires $B\ge \text{rank}({[c{\lambda}_{\text{max}}({\mathbf{S}}_{R}{\mathcal{I}}_{R}){\mathcal{I}}_{T}{\mathbf{R}}_{T}^{1}]}_{+})$ which means that solutions can be achieved also for B < n_{ T } (i.e., when only the B < n_{ T } strongest eigendirections of the channel are excited by training). In certain cases, e.g., when the interference is temporally white (S_{ Q } = I), it is optimal to have $B=\text{rank}({[c{\lambda}_{\text{max}}({\mathbf{S}}_{R}{\mathcal{I}}_{R}){\mathcal{I}}_{T}{\mathbf{R}}_{T}^{1}]}_{+})$ as larger B will not decrease the training energy usage, cf.[9].
4.4 Optimizing the average performance
so problem (26) is solved by the following theorem.
Theorem 4.
where${\mathcal{I}}_{\text{adm}}={\mathcal{I}}_{T}^{T}\otimes {\mathcal{I}}_{R}$as before. Set${\mathcal{I}}_{T}^{\prime}={\mathcal{I}}_{T}^{T}={\mathbf{U}}_{T}{\mathbf{D}}_{T}{\mathbf{U}}_{T}^{H}$and${\mathbf{S}}_{Q}^{\prime}={\mathbf{S}}_{Q}^{T}={\mathbf{U}}_{Q}{\mathbf{D}}_{Q}{\mathbf{U}}_{Q}^{H}$. Here, ${\mathbf{U}}_{T}\in {\mathbb{C}}^{{n}_{T}\times {n}_{T}}$, ${\mathbf{U}}_{Q}\in {\mathbb{C}}^{B\times B}$are unitary matrices and D_{ T }, D_{ Q }are diagonal n_{ T } × n_{ T }and B × B matrices containing the eigenvalues of${\mathcal{I}}_{T}^{\prime}$and${\mathbf{S}}_{Q}^{\prime}$in descending and ascending order, respectively. Then, the optimal training matrix P equals${\left({\mathbf{U}}_{T}{\mathbf{D}}_{P}{\mathbf{U}}_{Q}^{H}\right)}^{\ast}$, where D_{ P }is an n_{ T } × B diagonal matrix with main diagonal entries equal to${({\mathbf{D}}_{P})}_{i,i}=\sqrt{\mathcal{P}\sqrt{{\alpha}_{i}}/{\sum}_{j=1}^{{n}_{T}}\sqrt{{\alpha}_{j}}},i=1,2,\dots ,{n}_{T}(B\ge {n}_{T})$ and α_{ i } = (D_{ T })_{i,i}(D_{ Q })_{i,i}, i = 1, 2, …, n_{ T }with the aforementioned ordering.
Proof.
The proof is given in Appendix 7. □
 1.
In the general case of a nonKroneckerstructured ${\mathcal{I}}_{\text{adm}}$, the training can be obtained using numerical methods like the semidefinite relaxation approach described in [28].
 2.
If ${\mathcal{I}}_{\text{adm}}$ depends on H, then in order to implement this design, the embedded H in ${\mathcal{I}}_{\text{adm}}$ may be replaced by a previous channel estimate. This implies that this approach is possible whenever the channel variations allow for such a design. This observation also applies to the designs in the previous subsections (see also [24, 29], where the same issue is discussed for other system identification applications).
In this case, we can derive closed form expressions for the optimal training under assumptions similar to those made in Theorem 3. We therefore have the following result:
Theorem 5.
where${\mathcal{I}}_{\text{adm}}={\mathcal{I}}_{T}^{T}\otimes {\mathcal{I}}_{R}$as before. Set${\mathbf{S}}_{Q}^{\prime}={\mathbf{S}}_{Q}^{T}={\mathbf{V}}_{Q}{\mathit{\Lambda}}_{Q}{\mathbf{V}}_{Q}^{H}$. Here, we assume that${\mathbf{V}}_{Q}\in {\mathbb{C}}^{B\times B}$is a unitary matrix and Λ_{ Q }a diagonal B × B matrix containing the eigenvalues of${\mathbf{S}}_{Q}^{\prime}$in arbitrary order. Assume also that${\mathbf{R}}_{T}^{\prime}={\mathbf{R}}_{T}^{T}$with eigenvalue decomposition${\mathbf{U}}_{T}^{\prime}{\mathit{\Lambda}}_{T}^{\prime}{\mathbf{U}}_{T}^{\mathrm{\prime H}}$. The diagonal elements of${\mathit{\Lambda}}_{T}^{\prime}$are assumed to be arbitrarily ordered. Then, we have the following cases:

R_{ R } = S_{ R }: We further discriminate two cases

: Then the optimal training is given by a straightforward adaptation of Proposition 2 in[8].${\mathcal{I}}_{T}=\mathbf{I}$

: Then, the optimal training matrix P equals${\left({\mathbf{U}}_{T}^{\prime}({\pi}_{\text{opt}}){\mathbf{D}}_{P}{\mathbf{V}}_{Q}^{H}({\varpi}_{\text{opt}})\right)}^{\ast}$, where π_{opt}, ϖ_{opt}stand for the optimal orderings of the eigenvalues of${\mathbf{R}}_{T}^{\prime}$and${\mathbf{S}}_{Q}^{\prime}$, respectively. These optimal orderings are determined by Algorithm 1 in Appendix 5. Additionally, define the parameter m_{∗}as in Equation 69 (see Appendix 5). Assuming in the following that, for simplicity of notation, ${({\mathit{\Lambda}}_{T}^{\prime})}_{i,i}$’s and (Λ_{ Q })_{i,i}’s have the optimal ordering, the optimal (D_{ P })_{j,j}, j = 1, 2, …, m_{∗}are given by the expression${\mathbf{R}}_{T}^{1}={\mathcal{I}}_{T}$$\begin{array}{l}\sqrt{\frac{\mathcal{P}+{\sum}_{i=1}^{{m}_{\ast}}\frac{{({\mathit{\Lambda}}_{Q})}_{i,i}}{{({\mathit{\Lambda}}_{T}^{\prime})}_{i,i}}}{\sum _{i=1}^{{m}_{\ast}}\sqrt{\frac{{({\mathit{\Lambda}}_{Q})}_{i,i}}{{({\mathit{\Lambda}}_{T}^{\prime})}_{i,i}}}}\sqrt{\frac{{({\mathit{\Lambda}}_{Q})}_{j,j}}{{({\mathit{\Lambda}}_{T}^{\prime})}_{j,j}}}\frac{{({\mathit{\Lambda}}_{Q})}_{j,j}}{{({\mathit{\Lambda}}_{T}^{\prime})}_{j,j}}},\end{array}$
while (D_{ P })_{j,j} = 0 for j = m_{∗} + 1, …, n_{ T }.

Proof.
The proof is given in Appendix 5. □
 1.
If the modal matrices of R _{ R } and S _{ R } are the same, ${\mathcal{I}}_{T}=\mathbf{I}$ and ${\mathcal{I}}_{R}=\mathbf{I}$, then the optimal training is given by [9].
 2.
In any other case (e.g., if R _{ R } ≠ S _{ R }), the training can be found using numerical methods like the semidefinite relaxation approach described in [28]. Note again that this approach can also handle general ${\mathcal{I}}_{\text{adm}}$, not necessarily expressed as ${\mathcal{I}}_{T}^{T}\otimes {\mathcal{I}}_{R}$.
According to the analysis in [27], these approximations should be tighter than the approximations based on confidence ellipsoids presented in Subsections 4.1, 4.2, and 4.3 for practically relevant values of ε.
5 Applications
5.1 Optimal training for channel estimation
which corresponds to ${\mathcal{I}}_{\text{adm}}=\mathbf{I}$, i.e., to ${\mathcal{I}}_{T}=\mathbf{I}$ and ${\mathcal{I}}_{R}=\mathbf{I}$. The ADGPP and ASGPP are given by (19) and (20), respectively, with the corresponding substitutions. Their solutions follow directly from Theorems 2 and 3, respectively. To the best of the authors’ knowledge, such formulations for the classical MIMO training design problem are presented here for the first time. Furthermore, solutions to the standard approach of minimizing the channel MSE subject to a constraint on the training energy budget are provided by Theorems 4 and 5 as special cases.
Remark.
Although the confidence ellipsoid and Markov bound approximations are generally different [27], in the simulation section, we show that their performance is almost identical for reasonable operating γregimes in the specific case of standard channel estimation.
5.2 Optimal training for the Loptimality criterion
for some positive semidefinite weighting matrix W. Assume also that W = W_{1} ⊗ W_{2} for some positive semidefinite matrices W_{1}, W_{2}. Taking the expected value of this performance metric with respect to either $\stackrel{~}{\mathbf{H}}$ or both $\stackrel{~}{\mathbf{H}}$ and H leads to the wellknown Loptimality criterion for optimal experiment design in statistics [16]. In this case, ${\mathcal{I}}_{T}={\mathbf{W}}_{1}^{T}$ and ${\mathcal{I}}_{R}={\mathbf{W}}_{2}$. In the context of MIMO communication systems, such a performance metric may arise, e.g., if we want to estimate the MIMO channel having some deficiencies in either the transmit and/or the receive antenna arrays. The simplest case would be both W_{1} and W_{2} being diagonal with nonzero entries in the interval [0,1], W_{1} representing the deficiencies in the transmit antenna array and W_{2} in the receive array. More general matrices can be considered if we assume crosscouplings between the transmit and/or receive antenna elements.
Remark.
The numerical approach of [28] mentioned after Theorems 4 and 5 can handle general weighting matrices W, not necessarily Kroneckerstructured.
5.3 Optimal training for channel equalization
In this subsection, we consider the problem of estimating a transmitted signal sequence {x(t)} from the corresponding received signal sequence {y(t)}. Among a wide range of methods that are available [30, 31], we will consider the MMSE equalizer, and for mathematical tractability, we will approximate it by the noncausal Wiener filter. Note that for reasonably long block lengths, the MMSE estimate becomes similar to the noncausal Wiener filter [32]. Thus, the optimal training design based on the noncausal Wiener filter should also provide good performance when using an MMSE equalizer.
5.3.1 Equalization using exact channel state information
Remark.
Assuming nonsingularity of Φ_{ n }(ω) for every ω, the MMSE equalizer is applicable for all values of the pair (n_{ T }, n_{ R }).
5.3.2 Equalization using a channel estimate
We see that the poorer the accuracy of the estimate, the larger the performance metric ${J}_{\text{CE}}(\stackrel{~}{\mathbf{H}},\mathbf{H})$ and, thus, the larger the performance loss of the equalizer. Therefore, this performance metric is a reasonable candidate to use when formulating our training sequence design problem. Indeed, the Wiener equalizer based on the estimate $\hat{\mathbf{H}}=\mathbf{H}+\stackrel{~}{\mathbf{H}}$ of H can be deemed to have a satisfactory performance if ${J}_{\text{CE}}(\stackrel{~}{\mathbf{H}},\mathbf{H})$ remains below some userchosen threshold. Thus, we will use J_{CE} as J in problems (12) and (13). Though these problems are not convex, we show in Appendix 1 how they can be convexified, provided some approximations are made.
 1.
The excess MSE ${J}_{\mathit{\text{CE}}}(\stackrel{~}{\mathbf{H}},\mathbf{H})$ quantifies the distance of the MMSE equalizer using the channel estimate $\hat{\mathbf{H}}$ over the clairvoyant MMSE equalizer, i.e., the one using the true channel. This performance metric is not the same as the classical MSE in the equalization context, where the difference $\widehat{\mathbf{x}}(t;\mathbf{H}+\stackrel{~}{\mathbf{H}})\mathbf{x}(t)$ is considered instead of $\widehat{\mathbf{x}}(t;\mathbf{H}+\stackrel{~}{\mathbf{H}})\widehat{\mathbf{x}}(t;\mathbf{H})$. However, since in practice the best transmit vector estimate that can be attained is the clairvoyant one, the choice of ${J}_{\text{CE}}(\stackrel{~}{\mathbf{H}},\mathbf{H})$ is justified. This selection allows for a performance metric approximation given by (16).
 2.
There are certain cases of interest, where ${J}_{\text{CE}}(\stackrel{~}{\mathbf{H}},\mathbf{H})$ approximately coincides with the classical equalization MSE. Such a case occurs when n _{ R } ≥ n _{ T }, H is full column rank and the SNR is high during data transmission.
5.4 Optimal training for zeroforcing precoding
where the precoder is $\mathbf{\Psi}={\hat{\mathbf{H}}}^{\mathrm{\u2020}}$, i.e., $\mathbf{\Psi}={\hat{\mathbf{H}}}^{H}{(\hat{\mathbf{H}}{\hat{\mathbf{H}}}^{H})}^{1}$ if we limit ourselves to the practically relevant case n_{ T } ≥ n_{ R } and assume that $\hat{\mathbf{H}}$ is full rank. Note that x(t) is an n_{ R } × 1 vector in this case, but the transmitted vector is Ψ x(t), which is n_{ T } × 1.
if we define ${\mathcal{I}}_{T}={\lambda}_{x}{\mathbf{H}}^{\mathrm{\u2020}}{({\mathbf{H}}^{\mathrm{\u2020}})}^{H}={\lambda}_{x}{\mathbf{H}}^{H}{(\mathbf{H}{\mathbf{H}}^{H})}^{2}\mathbf{H}$ and ${\mathcal{I}}_{R}=\mathbf{I}$.
Remark.
The cost functions of (27) and (28) reveal the fact that any performanceoriented training design is a compromise between the strict channel estimation accuracy and the desired accuracy related to the end performance metric at hand. Caution is needed to identify cases where the performanceoriented design may severely degrade the channel estimation accuracy, annihilating all gains from such a design. In the case of ZF precoding, if n_{ T } > n_{ R }, ${\mathcal{I}}_{T}$ will have rank at most n_{ R } yielding a training matrix P with only n_{ R } active eigendirections. This is in contrast to the secondary target, which is the channel estimation accuracy. Therefore, we expect ADGPP, ASGPP, and the approaches in Subsection 4.4 to behave abnormally in this case. Thus, we propose the performanceoriented design only when n_{ T } = n_{ R } in the context of the ZF precoding.
6 Numerical examples
where r is the (complex) normalized correlation coefficient with magnitude ρ = r < 1. We choose to examine the high correlation scenario for all the presented schemes. Therefore, in all plots, r = 0.9 for all matrices R_{ T }, R_{ R }, S_{ Q }, S_{ R }. Additionally, the transmit SNR during data transmission is chosen to be 15 dB, when channel equalization and ZF precoding are considered. High SNR expressions are therefore used for optimal training sequence designs. Since the optimal pilot sequences depend on the true channel, we have for these two applications additionally assumed that the channel changes from block to block according to the relationship H_{ i } = H_{i1} + μ E_{ i }, where E_{ i } has the same Kronecker structure as H, and it is completely independent from H_{i1}. The estimated H_{i1} is used in the pilot design. The value of μ is 0.01.
7 Conclusions
In this contribution, we have presented a quite general framework for MIMO training sequence design subject to flat and block fading, as well as spatially and temporally correlated Gaussian noise. The main contribution has been to incorporate the objective of the channel estimation into the design. We have shown that by a suitable approximation of $J(\stackrel{~}{\mathbf{H}},\mathbf{H})$, it is possible to solve this type of problem for several interesting applications such as standard MIMO channel estimation, Loptimality criterion, MMSE channel equalization, and ZF precoding. For these problems, we have numerically demonstrated the superiority of the schemes derived in this paper. Additionally, the proposed framework is valuable since it provides a universal way of posing different estimationrelated problems in communication systems. We have seen that it shows interesting promise for, e.g., ZF precoding, and it may yield even greater end performance gains in estimation problems related to communication systems, when approximations can be avoided, depending on the end performance metric at hand.
Endnotes
^{a} The word 'dual’ in this paper defers from the Lagrangian duality studied in the context of convex optimization theory (see [24] for more details on this type of duality).
^{b} For simplicity, we have assumed a zeromean channel, but it is straightforward to extend the results to Rician fading channels, similar to [9].
^{c} We set the subscript Q to S_{ Q } to highlight its temporal nature and the fact that its size is B × B. The matrices with subscript T in this paper share the common characteristic that they are n_{ T } × n_{ T }, while those with subscript R are n_{ R } × n_{ R }.
^{d} For a Hermitian positive semidefinite matrix A, we consider here that A^{1/2} is the matrix with the same eigenvectors as A and eigenvalues as the square roots of the corresponding eigenvalues of A. With this definition of the square root of a Hermitian positive semidefinite matrix, it is clear that A^{1/2} = A^{H/2}, leading to A = A^{1/2}A^{H/2} = A^{H/2}A^{1/2}.
^{e} For easiness, we use the MATLAB notation in this table.
Appendix 1
Approximating the performance measure for MMSE channel equalization
We have now obtained a quadratic form. Note indeed that the last two terms are just complex conjugates of each other and thus we can write them as two times their real part.
High SNR analysis
In order to obtain a simpler expression for ${\mathcal{I}}_{\text{adm}}$, we will assume high SNR in the data transmission phase. We consider the practically relevant case where rank(H) = min(n_{ T }, nn_{ n }R). Depending on the rank of the channel matrix H, we will have three different cases:
Case 1.
Case 2.
Case 3.
Low SNR analysis
Appendix 2
Proof of Theorem 1
For the proof of Theorem 1, we require some preliminary results. Lemmas 1 and 2 will be used to establish the uniqueness part of Theorem 1, and Lemma 3 is an extension of a standard result in majorization theory, which is used in the main part of the proof.
Lemma 1.
Let$\mathbf{D}\in {\mathbb{R}}^{n\times n}$be a diagonal matrix with elements d_{1,1} > ⋯ > d_{n,n} > 0. If$\mathbf{U}\in {\mathbb{C}}^{n\times n}$is a unitary matrix such that UDU^{ H }has diagonal (d_{1,1}, …, d_{n,n}), then U is of the form U = diag(u_{1,1}, …, u_{n,n}), where u_{i,i} = 1 for i = 1, …, n. This also implies that UDU^{ H } = D.
Proof.
□
However, since d_{1,1} > ⋯ > d_{n,n} > 0, the only way to satisfy this equation is to have u_{1,1} = 1 and u_{i,1} = 0 for i = 2, …, n. Now, if the assertion holds for i = 1,…, k, the orthogonality of the columns of U implies that u_{i,k+1} = 0 for i = 1, …, k, and by following a similar reasoning as for the case i = 1, we deduce that u_{k+1,k+1} = 1 and u_{i,k+1} = 0 for i = k + 2, …, n.
Lemma 2.
Let$\mathbf{D}\in {\mathbb{R}}^{n\times n}$be a diagonal matrix with elements d_{1,1} > ⋯ >d_{N,N} > 0. If$\mathbf{U}\in {\mathbb{C}}^{n\times n}$, with n ≤ N, such that U^{ H }U = I and$\mathbf{V}=\stackrel{~}{\mathbf{D}}\mathbf{U}{\stackrel{~}{\mathbf{D}}}^{1}$(where$\stackrel{~}{\mathbf{D}}=\text{diag}({d}_{1,1},\dots ,{d}_{n,n})$) also satisfies V^{ H }V = I, then U is of the form U = [diag(u_{1,1}, …, u_{n,n}) 0_{Nm,n}]^{ T }, where u_{i,i} = 1 for i = 1, …, n.
Proof.
Since d_{1,1} > ⋯ > d_{N,N} > 0, the only way to satisfy this equation is to have u_{1,1} = 1 and u_{i,1} = 0 for i = 2, …, N. If now the assertion holds for columns 1 to k, the orthogonality of the columns of U implies that u_{i,k+1} = 0 for i = 1, …, k, and by following a similar reasoning as for the first column of U we have that u_{k+1,k+1} = 1 and u_{i,k+1} = 0 for i = k + 2, …, N. □
Lemma 3.
Let$\mathbf{A},\mathbf{B}\in {\mathbb{C}}^{n\times n}$be Hermitian matrices. Arrange the eigenvalues a_{1}, n …, a_{ n }of A in a descending order and the eigenvalues b_{1}, n …, b_{ n }of B in an ascending order. Then, $\text{tr}\phantom{\rule{0.3em}{0ex}}(\mathbf{A}\mathbf{B})\ge \sum _{i=1}^{n}{a}_{i}{b}_{i}$. Furthermore, if B = diag(b_{1}, n …, b_{ n }) and both matrices have distinct eigenvalues, then$\text{tr}\phantom{\rule{0.3em}{0ex}}(\mathbf{A}\mathbf{B})=\sum _{i=1}^{n}{a}_{i}{b}_{i}$if and only if A = diag(a_{1}, n …, a_{ n }).
Proof.
unless (A)_{[i, i]} = a_{ i } for every i = 1, …, n. Therefore, $\text{tr}(\mathbf{A}\mathbf{B})={\sum}_{i=1}^{n}{a}_{i}{b}_{i}$ if and only if the diagonal of A is (a_{1}, nnn …, a_{ n }). Now, we have to prove that A is actually diagonal, but this follows from Lemma 1. □
Proof of Theorem 1
With this problem formulation, it follows (from Sylvester’s law of inertia [35]) that we need m ≥ rank(D_{ B }) to achieve feasibility in the constraint (i.e., having at least as many nonzero singular values of Σ as nonzero eigenvalues in D_{ B }). This corresponds to the condition N ≥ rank(B) in the theorem.
where λ_{ j }(·) denotes the j th largest eigenvalue. The equality is achieved if V = I, and observe that we can select V in this manner without affecting the constraint.
requiring m ≥ rank(D_{ B }). Suppose that $\stackrel{\u0304}{\mathbf{U}}$ and $\stackrel{\u0304}{\mathbf{\Sigma}}$ minimize the cost. Then, we can replace $\stackrel{\u0304}{\mathbf{U}}$ by I and satisfy the constraint, without affecting the cost in (48). This means that there exists an optimal solution with U = I.
with D_{ P } as stated in the theorem.
Finally, we will show how to characterize all optimal solutions for the case when A and B have distinct nonzero eigenvalues (thus, m = n). The optimal solutions need to give equality in (48) and thus Lemma 3 gives that V Σ Σ^{ H }V^{ H } is diagonal and equal to Σ Σ^{ H }. Lemma 1 then implies that V = diag(v_{1,1}, …, v_{n,n}) with v_{i,i} = 1 for i = 1, …, n.
where u_{i,i} = 1 for i = 1, …, n. Since U has to be unitary and its last N  n + 1 columns play no role in $\stackrel{\u0304}{\mathbf{P}}$ (due to the form of Σ), we can take them as [0_{n,Nm+1}I_{Nm+1}]^{ T } without loss of generality.
Summarizing, an optimal solution is given by (23). When A and B have distinct eigenvalues, V and U can only multiply the columns of U_{ A } and U_{ B }, respectively, by complex scalars of unit magnitude.
Appendix 3
Proof of Theorems 2 and 3
Before proving Theorems 2 and 3, a lemma will be given that characterizes equivalences between different sets of feasible training matrices P.
Lemma 4.
Proof.
□
Hence, RHS⊆LHS.
which is a contradiction. Hence, LHS⊆RHS.
Proof of Theorem 2
Proof of Theorem 3
where the last equality follows from the fact that the left hand side is positive semidefinite.
Observe that this expression is identical to the constraint in (24), except that the positive semidefinite ${\mathcal{I}}_{T}$ has been replaced by ${[\phantom{\rule{0.3em}{0ex}}c{\mathcal{I}}_{T}{\mathbf{R}}_{T}]}_{+}$. Thus, the equivalence follows directly from Theorem 2.
As in the previous case, the equivalence follows directly from Theorem 2.
Appendix 4
Proof of Theorem 4
while the objective value equals to ${\left({\sum}_{i=1}^{{n}_{T}}\sqrt{{\alpha}_{i}}\right)}^{2}/\mathcal{P}$. Using Lemma 3, it can be seen that π and ϖ should correspond to opposite orderings of (Λ_{ T })_{i,i},(Λ_{ Q })_{j,j}, i = 1, 2, …, n_{ T }, j = 1, 2, …, B, respectively. Since B can be greater than n_{ T }, the eigenvalues of ${\mathcal{I}}_{T}^{\prime}$ must be set in decreasing order and those of S^{′}_{ Q } in increasing order.
Appendix 5
Proof of Theorem 5
where ${\mathbf{R}}_{T}^{\prime}={\mathbf{R}}_{T}^{T}$ with eigenvalue decomposition ${\mathbf{U}}_{T}^{\prime}{\mathit{\Lambda}}_{T}^{\prime}{\mathbf{U}}_{T}^{\mathrm{\prime H}}$. This objective function subject to the training energy constraint $\text{tr}({\mathbf{P}}^{\prime}{\mathbf{P}}^{\mathrm{\prime H}})\le \mathcal{P}$ seems very difficult to minimize analytically unless special assumptions are made.

R_{ R } = S_{ R }: Then, (63) becomes$\begin{array}{l}\text{tr}\left\{{\left({\mathcal{I}}_{T}^{\prime 1/2}{\mathbf{R}}_{T}^{\prime 1}{\mathcal{I}}_{T}^{\prime 1/2}+{\mathcal{I}}_{T}^{\prime 1/2}{\mathbf{P}}^{\mathrm{\prime H}}{\mathbf{S}}_{Q}^{\prime 1}{\mathbf{P}}^{\prime}{\mathcal{I}}_{T}^{\prime 1/2}\right)}^{1}\right.\\ \otimes \left(\right)close="\}">{\mathcal{I}}_{R}^{1/2}{\mathbf{R}}_{R}{\mathcal{I}}_{R}^{1/2}& .\end{array}$(64)

Using once more the fact that tr(A ⊗ B) = tr(A) tr(B) for square matrices A and B, it is clear from (64) that the optimal training matrix can be found by minimizing$\text{tr}\left\{{\left({\mathbf{R}}_{T}^{\prime 1}+{\mathbf{P}}^{\mathrm{\prime H}}{\mathbf{S}}_{Q}^{\prime 1}{\mathbf{P}}^{\prime}\right)}^{1}{\mathcal{I}}_{T}^{\prime}\right\}.$(65)

Again, here some special assumptions may be of interest.

: Then, the optimal training matrix can be found by straightforward adjustment of Proposition 2 in [8].–${\mathcal{I}}_{T}=\mathbf{I}$: Then, (65) takes the form${\mathbf{R}}_{T}^{1}={\mathcal{I}}_{T}$$\text{tr}\left\{{\left(\mathbf{I}+{\mathbf{R}}_{T}^{\prime 1/2}{\mathbf{P}}^{\mathrm{\prime H}}{\mathbf{S}}_{Q}^{\prime 1}{\mathbf{P}}^{\prime}{\mathbf{R}}_{T}^{\prime 1/2}\right)}^{1}\right\}.$(66)