Skip to main content

Spatial attention and quantization-based contrastive learning framework for mmWave massive MIMO beam training


Deep learning (DL)-based beam training schemes have been exploited to improve spectral efficiency with fast optimal beam selection for millimeter-wave (mmWave) massive multiple-input multiple-output (MIMO) systems. To achieve high prediction accuracy, these DL models rely on training with a tremendous amount of labeled environmental measurements, such as mmWave channel state information (CSI). However, demanding a large volume of ground truth labels for beam training is inefficient and infeasible due to the high labeling cost and the requirement for expertise in practical mmWave massive MIMO systems. Meanwhile, a complex environment incurs critical performance degradation in the continuous output of beam training. In this paper, we propose a novel contrastive learning framework, named self-enhanced quantized phase-based transformer network (SE-QPTNet), for reliable beam training with only a small fraction of the labeled CSI dataset. We first develop a quantized phase-based transformer network (QPTNet) with a hierarchical structure to explore the essential features from frequency and spatial views and quantize the environmental components with a latent beam codebook to achieve robust representation. Next, we design the SE-QPTNet including self-enhanced pre-training and supervised beam training. SE-QPTNet pre-trains by the contrastive information of the target user and others with the unlabeled CSI, and then, it is utilized as the initialization to fine-tune with a reduced volume of labeled CSI. Finally, the experimental results show that the proposed framework improves beam prediction accuracy and data rates with 5% labeled data compared to existing solutions. Our proposed framework further enhances flexibility and breaks the limitation of the quantity of label information for practical beam training.

1 Introduction

Millimeter-wave (mmWave) massive multiple-input multiple-output (MIMO) is one of the most critical technologies in fifth-/sixth-generation (5G/6G) wireless communication systems due to its capacity to provide comprehensive spectrum and spatial resources for high transmission rate demands [1,2,3]. Benefiting from the short wavelength of mmWave signals, it permits a massive number of antenna elements to be integrated into limited equipment size at both base station (BS) and user equipment (UE) sides [4]. In addition, the massive MIMO array can compensate for the severer path loss of mmWave signals by highly directional beamforming, leading to stronger coverage, larger data rates, and improved reliability [5, 6].

Generally, mmWave massive MIMO arrays adaptively transmit signals via directional beams according to the wireless environment state [7,8,9,10,11]. To efficiently transmit beams with maximum gain, beam training is essential to identify the line-of-sight (LOS) or dominant channel path, yielding the optimal beam pair for the transceivers [7]. In practical beam training, candidate beams can be defined by a finite-size codebook covering the intended angle range and exhaustively retrieved to determine the optimal beam pair [8]. However, mmWave massive MIMO beam training is challenging due to the large codebook size, which results in high computational overhead. To reduce the training overhead, [9] proposes hierarchical multi-resolution codebook solutions, where a low-resolution sub-codebook detects the candidate transmitting direction, and the high-resolution codebook confirms the optimal beam pair. Another efficient beam training approach is interactive beam search. In [10] and [11], they detect the direction of the LOS/dominate path from the mmWave channel estimation and select optimal beam pairs.

The performance of beam training schemes, including alignment accuracy and overhead, is highly dependent on the codebook design. The literature has shown that an adaptive hierarchical codebook can decide the codewords based on previous beam training results with multiple mainlobes covering a spatial region for one or more user equipments (UEs) [12]. In [13], it is provided to efficiently generate the hierarchical codebook by jointly exploiting sub-arrays with the partial active antenna elements. An adaptive and sequential alignment scheme was proposed in [14], demonstrating the relation between fast search time and the probability of error in acquired beam directions through extrinsic Jensen–Shannon divergence. The study proposed in [15] developed a fast beam-sweeping algorithm based on compressive sensing (CS) to determine the minimum number of measurements required.

1.1 Related work

The conventional schemes can satisfy the user demands but can hardly inherit from the experience to further improve their ability. Deep learning (DL) has recently elevated the field of wireless systems research and beam training to new heights [16, 17]. These heuristic proposals depend on the well-labeled channel state information (CSI) corresponding to the aligned gain of a pre-defined codebook to construct a supervised learning (SL) framework. In this framework, the optimal beam can directly predict based on the large volume of labeled knowledge to reduce training overhead and effects of noise [18,19,20,21,22,23]. A beam selection scheme based on deep neural networks (DNN) was proposed in [18] that recognized the desired beam from the elementary relation of position and received signals with low alignment overhead. Meanwhile, the potential of DL in predicting the optimal mmWave beam and correcting blockage status based on the sub-6 GHz channel information has been proposed by [19] to reduce beam training overhead and achieve reliable communication [20]. has trained a beamforming prediction network (BPNet) using supervised and unsupervised learning methods to optimize power allocation and predict virtual uplink beamforming (VUB) for improving computational efficiency. An online learning-based training strategy in [21] utilized a large volume DNN network to obtain the offline model parameters and fine-tune the DNN model according to the extra CSI measured in real time. An adaptive beam training scheme for calibration was proposed in [22] that estimated approximate CSI features by a convolutional neural network (CNN) and determined the optimal beam by self-criticism of the long short-term memory (LSTM). Moreover, a non-deterministic beam training was proposed in [23] that developed a binary coding scheme to represent the valid CSI and reduce the effect of noise. Although DL-based studies can obtain impressive achievements, severe multipath interference may impair prediction accuracy when the angles of paths in the local cluster are closely near the dominant channel path.

CSI is informative in deciding the optimal beam by exploring the dominant path during the training process. In [24], the authors proposed a hierarchical search by decomposing the multipath into several virtual components and using the hierarchical search to recover the dominant CSI for beam training. However, estimation performance highly relies on the training penalty. DL-based beamspace channel estimation was proposed in [25] to directly estimate the beamspace from the received signal, eliminating the need for a time-consuming beamforming process. A channel signature-based hybrid precoding design was proposed in [26] using DNN to estimate the channel and perform hybrid precoding with low computational complexity. Furthermore, a dual timescale variational framework for mmWave beam training and training [27] addressed beamforming direction in real time by training a deep recurrent variational autoencoder, taking into account both the historical channel information and the current channel conditions. In [28] and [29], LSTM is shown to further improve the ability of beam training by the implicit channel signature. [28] inferred the optimal beam directions at a target BS in future time slots depending on the historical channel features for mobile UE. [30] indicated that spatial attention beam training can improve transmission reliability, and the associative LSTM encoder performs explicit channel features to improve training ability.

The existing solutions exploiting the DL model for beam training share the following limitations. Firstly, most existing solutions perform inefficient feature extraction for frequency- and spatial-domain CSI. Direct vectorization for frequency and spatial information leads to a coarse representation. It is thus hard for DNN-based beam training to extract the CSI feature information effectively, resulting in low learning efficiency and beam prediction performance. Meanwhile, CNN-based models show strong ability on two-dimensional local feature extraction but suffer from a limited global view of beamspace awareness. In addition, sequence modeling can capture the relation of features from frequency and spatial domains but can hardly learn over an extensive range. Secondly, environmental measurements can affect the continuous output of the dense network. The continuous representation of CSI is sensitive to noise and channel variation, resulting in an incorrect beam prediction. Finally, the existing SL approaches require all CSI data to be labeled. However, labeling the large volume of CSI is unrealistic due to the high labeling cost and expertise requirement in practical mmWave massive MIMO systems. Although CSI is easily obtainable, handling the rapidly changing CSI when labeling the actual optimal beam is impractical. Therefore, the deficiency of labeled CSI can constrain the performance of existing DL-based beam training schemes.

1.2 Motivation

Typically, the CSI of massive MIMO contains the frequency dynamic response that exhibits spatial fluctuations in interconnected power grids. Because we concentrate on self-enhanced pre-training and SL with limited labeled data, it is critical to align the relation of preponderant paths corresponding with different subcarriers for well-explored beneficial features. Since the training signals are resulted from the transmitted signals and the propagation environment, tracking the UE movement and capturing the local feature couples have achieved delightful results [22, 30]. However, these methods are incapable of fetching a global view of spatial and frequency features. Consequently, we absorb the benefits of existing works to develop a hierarchical DL architecture with two levels, where the first level tracks the spatial varying on different subcarriers and serves the second level to explore the global view for efficient beam training.

Moreover, the complexity and diversity of the wireless environment are challenging issues for DL-based beam training. Continuous feature extraction results in uncontrollable representation, sensitive to random disturbance terms. Existing work address phase quantization mainly in three ways, by designing with real-valued phase shifts and then applying quantization [31], by constructing an analog beamforming codebook [32], and by nonlinear mapping into binary phase quantization [33]. They lack flexibility, robustness to noise, and channel variations. To address this problem, we develop a configurable codebook to quantize the continuous spatial feature satisfying the equal AoD distribution in categorical beams. Specifically, discrete codewords can get rid of the effects of noise and channel variations and reflect the dominant factor of the CSI. Thus, we can obtain controllable quantized results from the codebook to improve the robustness of hierarchical DL architecture.

DL has been suggested as a promising approach to address the nonlinear relationship of CSI and optimal beam prediction. As indicated in [21, 30], DL-integrated beam prediction is typically performed under an SL framework with perfect CSI annotation. However, it is challenging to acquire the exhaustive annotation of massive CSI for DL-based beam training in realistic massive MIMO mmWave systems, which leads to high labeling costs and expertise requirements. [34] proposes an unsupervised method that performs the CSI reconstruction, and accomplishes the online SL beam training with a large dataset of labeled CSI. It is inefficient for limited labeled CSI because of the uniqueness of the wireless environment, lacking extendability. We shed light on a novel contrastive learning framework with limited labeled CSI to mitigate the expertise requirements. Specially, we leverage the uniqueness of CSI of different UE locations to pre-train by identifying the contrastive information between the target UE and others.

1.3 Contributions

This paper proposes a novel contrastive learning framework, named self-enhanced quantized phase-based transformer network (SE-QPTNet), in mmWave massive MIMO systems to enable reliable beam training with CSI measurements underlying limited labels. We first design a hierarchical DL architecture, named quantized phase-based transformer network (QPTNet), which sequentially extracts frequency and spatial-domain features to enable effective representation. In order to perform a reliable spatial representation, we also develop a latent beam codebook to align the similar phase features of the frequency domain for exploring the environmental components. Then, we propose the contrastive learning framework SE-QPTNet extended from the QPTNet architecture. The SE-QPTNet framework is concerned with detecting the relationship between the global beam signature and the distinctive CSI corresponding to user locations using contrastive environmental prediction. By leveraging the contrastive information of the unlabeled CSI, SE-QPTNet is pre-trained as the initialization for fine-tuning with a limited amount of labels to provide more accurate performance. The main contributions of this paper can be summarized as follows:

  • We propose QPTNet, a hierarchical DL architecture with two levels, to enable effective representation of CSI in frequency and spatial domains. The first level performs gate recurrent unit (GRU) to model and extract the dependencies among clusters information in the frequency domain. The second level utilizes the spatial attention mechanism to extract a global beam signature based on the results of quantization.

  • We design a codebook-based phase quantization to explore the complicated environmental components for reliable spatial representation. This quantization method can match and aggregate similar phase features with a codeword, improving the feature robustness from the effect of noise in the CSI. It converts the prior knowledge of continuous spatial features extracted from the frequency domain into the posterior knowledge of categorical beams.

  • We develop a contrastive learning framework SE-QPTNet benefiting from the hierarchical QPTNet and codebook-based phase quantization. This enhanced model further improves beam training accuracy with limited labeled CSI. To the best of our knowledge, this is the first study that introduces contrastive learning in beam training applications. SE-QPTNet performs two benefits based on contrastive environmental prediction. Firstly, it can pre-train without any label information by detecting the relationship between the global beam feature and positive/negative samples. Secondly, the similarity of a positive sample and beam signature can effectively capture the spatial dynamic changes under long inter-frequency spans. SE-QPTNet preserves the benefits of QPTNet and reduces the labeling cost.

The rest of the paper is organized as follows: Sect. 2 introduces the mmWave massive MIMO system model and the problem formulation. Section 3 explains the principle of spatial attention-associated beam prediction. Section 4 develops the hierarchical feature extraction scheme QPTNet and discusses the details of the codebook-based phase quantization and the proposed QPTNet. Section 5 proposes the contrastive learning framework SE-QPTNet and the procedure of self-enhanced pre-training. We then present the simulation results in Sect. 6, followed by a conclusion of this study in Sect. 7.

Notations: \(\varvec{A}\) is a matrix; \(\varvec{a}\) is a vector; a is a scalar; \((\cdot )^{T}\) and \((\cdot )^{H}\) denote transpose and conjugate transpose, respectively, while \(|\cdot |\) denotes the magnitude operator. \(\mathfrak {R}(\cdot )\) and \(\mathfrak {I}(\cdot )\) denote the real and imaginary parts of a complex number, respectively. \(\mathcal{C}\mathcal{N}(0,\varvec{\Sigma })\) represents the zero-mean complex Gaussian distribution with covariance matrix \(\varvec{\Sigma }\), respectively; \(\sigma (\cdot )\) denotes the activate function of neural network.

2 Methods/experimental

The purpose of this study was to tackle the annotation costs and the spectral efficiency problem of mmWave massive MIMO systems by a novel contrastive learning framework SE-QPTNet. We consider a system containing a BS equipped with massive MIMO communicating the UE in a complex environment with limited label information. Taking advantage of contrastive structure, the proposed framework SE-QPTNet pre-trains with unlabeled data and transforms the model to a hierarchical QPTNet with spatial attention-based feature extraction. The proposed SE-QPTNet can improve the learning efficiency, preserve the stability of feature extraction, reduce the requirements of expertise, and train with limited annotations. To analyze the performance of the framework experimentally, we generate the dataset by DeepMIMO and evaluate the proposal and competitive methods by performances of training loss, success rate, and achievable rate.

3 System model and problem formulation

Fig. 1
figure 1

Overview of the proposed beam training schemes in mmWave massive MIMO system

It is worth emphasizing that our proposal can be extended to various communication scenarios. First, the proposed contrastive learning framework can be developed for the multi-user hybrid beamforming design since the beam direction is unique for each user separately to achieve fewer effects of multi-user interference, and the labeling cost is also intractable. Then, the proposed contrastive learning framework can be applied in the harsh wireless environment, e.g., the reconfigurable intelligent surface (RIS) scenario, where the beam direction of each user is aligned with the dominant AoD of RIS, while others are treated as negative.

The overview of proposed beam training schemes is illustrated in Fig. 1. And the proposed DL-integrated beam prediction is typically performed at the BS side to ensure fast prediction responses with high computational resources. For analytical simplicity, we consider downlink transmission of an mmWave massive MIMO BS and a single antenna UE. For a two-dimensional (2D) mmWave channel where only azimuth angles are considered at both BS and UE, the Saleh–Valenzuela channel model is typically adopted, which can be formulated as

$$\begin{aligned} \varvec{H} = \sqrt{\frac{N_{t}N_{r}}{L}}\displaystyle \sum _{l=1}^{L} \beta _{l}\varvec{a}_{r}(N_{r},{ \theta }_{l}) \varvec{a}^{H}_{t}(N_{t},\varvec{ \phi }_{l}), \end{aligned}$$

where L, \(\beta _{l}\), \(\theta _{l}\), and \(\phi _{l}\) denote the number of channel paths, channel gain, angle-of-arrival (AoA), and angle-of-departure (AoD) of the lth channel path, respectively.

Since the first channel path, corresponding to the LOS path, is typically significant, recognizing the LOS path can be beneficial for improving the coverage of mmWave signals [30]. Although the number of resolvable channel paths is much smaller than the number of BS antennas, i.e., \(L \ll N_{t}\), it is still challenging to efficiently distinguish the LOS path because of the limited scattering of mmWave channels and the non-line-of-sight (NLOS) [35] [36]. The AoA and AoD of the l th path can be defined as \(\phi _{l} = 2d_{t}\sin \Phi _{l}/\lambda\) and \(\theta _{l} = 2d_{r}\sin \Theta _{l}/\lambda\), where \(\Theta _{l}\) and \(\Phi _{l}\) are the set of LOS and NLOS paths, respectively; \(\lambda\) denotes the wavelength; \(\mathrm{{d}}_{t}=\mathrm{{d}}_{r}=\lambda /2\) are the antenna spacing at the BS and UE. In particular, both \(\Theta _{l}\) and \(\Phi _{l}\) satisfy uniform distribution within \([-\frac{\pi }{2},\frac{\pi }{2}]\). The transmit and receive array steering vectors can be expressed as

$$\begin{aligned} \varvec{a}_{t}(\phi _{l})&= \frac{1}{\sqrt{N_{t}}}[1,\textrm{e}^{j\pi \phi _{l}},\cdots , \textrm{e}^{j(N_{t}-1)\pi \phi _{l}}]^{T}, \end{aligned}$$
$$\begin{aligned} \varvec{a}_{r}(\theta _{l})&= \frac{1}{\sqrt{N_{r}}}[1,\textrm{e}^{j\pi \theta _{l}},\cdots , \textrm{e}^{j(N_{r}-1)\pi \theta _{l}}]^{T}. \end{aligned}$$

With the mmWave channel matrix \(\varvec{H}\) given in (1), the received signal can be described as

$$\begin{aligned} y = \sqrt{P}\varvec{\omega }^{H}\varvec{H}\varvec{f}x +\varvec{\omega }^{H}\varvec{n}, \end{aligned}$$

where P, \(\varvec{\omega }\in {\mathbb {C}}^{N_{r}\times {1}}\), \(\varvec{f}\in {\mathbb {C}}^{N_{t}\times {1}}\) denote the transmit power, combiner, and beamformer, respectively. x is the transmitted data with unit power, i.e., \(\vert x \vert =1\), while \(\varvec{n} \sim \mathcal{C}\mathcal{N}(\varvec{0},\sigma ^{2}\varvec{I}_{N_{r}})\) denotes the additional white Gaussian noise (AWGN) vector with power \(\sigma ^{2}\). Typically, the beamformer and combiner do not increase or decrease the power gain, i.e., \(\Vert \varvec{\omega }\Vert _{2}=\Vert \varvec{f}\Vert _{2}=1\). The achievable rate can be described by

$$\begin{aligned} R = \log _{2}\left( 1+ \frac{P\vert \varvec{\omega }^{H}\varvec{H}\varvec{f} \vert ^2}{\sigma ^{2}} \right) . \end{aligned}$$

To get the maximum achievable rate for the given \(\varvec{H}\), we conventionally perform the beam training to construct the optimal beam pair by optimizing \(\varvec{f}\) and \(\varvec{\omega }\) before data transmission (as shown in Fig. 1). This optimization issue can be implemented by the pre-defined codebooks \({\mathcal {F}}\) and \({\mathcal {W}}\) as the following equation:

$$\begin{aligned} \{\varvec{f}^{op}, \varvec{\omega }^{op}\}= \mathop {\arg \max }_{\begin{array}{c} \varvec{f}\in {\mathcal {F}} \\ \varvec{\omega }\in {\mathcal {W}} \end{array}}\vert \varvec{\omega }^{H}\varvec{H}\varvec{f} \vert ^2. \end{aligned}$$

Practically, it is impossible to directly reveal the ideal pair of \(\varvec{f}^{op}\) and \(\varvec{\omega }^{op}\) since it is hard to tackle with the three independent matrices in (4).

A straightforward and ergodic solution of (6) is to enumerate all possible candidate codewords of \(\varvec{f}\) and \(\varvec{\omega }\) to determine the optimal solution with the largest achievable rate through the beam-sweeping and the overview is shown in Fig. 1. Conventionally, the beamformer \(\varvec{f}\) can be generated by a pre-defined codebook \({\mathcal {F}}\triangleq \{\varvec{f}_{n}, n=1,2,\dots , N_{t}\}\) that includes \(N_{t}\) codewords corresponding to different AoDs with the inherent transmitting spatial resolution. Identically, the combiner can be generated by \({\mathcal {W}}\triangleq \{\varvec{\omega }_{m},m=1,2,\dots ,N_{r}\}\) including \(N_{r}\) codewords with the inherent receiving spatial resolution with different AoAs. For each beam training test, the BS selects a codeword from \({\mathcal {F}}\) as the beamformer aligns with the combiner from \({\mathcal {W}}\) at the UE side. Generally, the discrete Fourier transform (DFT) codebook is a feasible option to decide the candidate beamformer \(\varvec{f}_{n}\) and combiner \(\varvec{\omega }_{m}\):

$$\begin{aligned} \varvec{f}_{n}&= \frac{1}{\sqrt{N_{t}}}[1,\textrm{e}^{j\pi \sin {\xi _{{t},n}}},\cdots , \textrm{e}^{j\pi (N_{t}-1)\sin {\xi _{{t},n}}}]^{T}, \end{aligned}$$
$$\begin{aligned} \varvec{\omega }_{m}&= \frac{1}{\sqrt{N_{r}}}[1,\textrm{e}^{j\pi \sin {\xi _{{r},m}}}, \cdots ,\textrm{e}^{j\pi (N_{r}-1)\sin {\xi _{{r},m}}}]^{T}, \end{aligned}$$

where the \(\xi _{{t},n}\) and \(\xi _{{r},m}\) are the beam directions of the n th possible beam at BS and m th possible received beams at the UE side. To span the whole angular domain in both BS and UE, \(\xi _{{t},n}\) and \(\xi _{{r},m}\) can be uniformly sampled in \((-\Xi _{t},\Xi _{t})\) and \((-\Xi _{r},\Xi _{r})\), i.e.,

$$\begin{aligned} \xi _{{t},n}&= \left( -1+\frac{2n-1}{N_{t}}\right) \Xi _{t}, \end{aligned}$$
$$\begin{aligned} \xi _{{r},m}&= \left( -1+\frac{2m-1}{N_{r}}\right) \Xi _{r}. \end{aligned}$$

To find the optimal solution of (6), the searching range is \(K \triangleq N_{t}N_{r}\), while the candidate beam pair can be denoted as

$$\begin{aligned} \begin{aligned} {\mathcal {B}}&= \{ \varvec{b}_{k}|k = 1,2, \dots , K \}, \varvec{b}_{k} = \{\varvec{f}_{n},\varvec{\omega }_{m}\},\\&n =1,2,\dots , N_{t}, m=1,2,\dots , N_{r}. \end{aligned} \end{aligned}$$

To evaluate the performance of beam training, the success rate is regarded as an important criterion in [13]. The index of solutions corresponding to the largest achievable rate can be treated as successful; otherwise, we consider the solutions are fail in the beam training. Hence, the success rate \(\gamma\) can be defined as the ratio of the number of successful trails \(N_\mathrm{{{Suc}}}\) over the total number of trails \(N_\mathrm{{{Tot}}}\) as the following equation

$$\begin{aligned} \gamma = \frac{N_\mathrm{{{Suc}}}}{ N_\mathrm{{{Tot}}} }. \end{aligned}$$

This paper proposes a novel contrastive learning-based SE-QPTNet for reliable beam training with low labeling cost and expertise requirements. Generally, the prediction of the optimal mmWave transmitting beam is operated at the BS side, and the same method can be easily extended to predict the optimal receiving beam. Considering severe multipath interference and inconsistent dominant beam prediction, we propose to adopt the phase quantization of periodically estimated CSI to reflect the relation of cluster AoDs/AoAs of UE and predict the optimal mmWave beam when mmWave beam training is required. Since the mmWave channels are considered to have identical LOS AoD/AoA and NLOS cluster AoDs/AoAs. For AoDs, we can rewrite the received signal (4) at the UE side as

$$\begin{aligned} \begin{aligned} y&= \sqrt{P}(\varvec{H}_\mathrm{{{LOS}}}+\varvec{H}_\mathrm{{{NLOS}}})\varvec{f}_{n}x + \varvec{n}\\&= \sqrt{P}\varvec{H}_\mathrm{{{LOS}}}\varvec{f}_{n}x + \underbrace{\sqrt{P}\varvec{H}_\mathrm{{{NLOS}}}\varvec{f}_{n}x + \varvec{n},}_{\varvec{n}_{eq}} \end{aligned} \end{aligned}$$

where the \(\varvec{n}_{eq}\) denotes the equivalent noise. By substituting into (1) and (13), it yields

$$\begin{aligned} y = \sqrt{\frac{N_{t}N_{r}P}{L}}\displaystyle \sum _{l=1}^{L} \beta _\mathrm{{{LOS}}}\underbrace{\varvec{a}^{H}_{t}(N_{t},\varvec{ \phi }_\mathrm{{{LOS}}})\varvec{f}_{n}}_\mathrm{{{correlation}}}x + \varvec{n}_{eq}. \end{aligned}$$

We can quantify the correlation between the mmWave beam in (7) and the array steering vector in (2) as

$$\begin{aligned} \begin{aligned} q_{n}(N_{t},\phi _\mathrm{{{LOS}}})&= \varvec{a}_{t}^{H}(N_{t},\phi _\mathrm{{{LOS}}})\varvec{f}_{n}(\xi _{{t},n})\\&=\frac{1}{N_{t}}\frac{\sin \frac{\pi N_{t}\psi _{n}}{2}}{\sin \frac{\pi \psi _{n}}{2}}\textrm{e}^{j\pi \frac{N_{t}-1}{2}\psi _{n}}, \end{aligned} \end{aligned}$$

where \(\psi _{n}=\sin \xi _{{t},n}-\sin \phi _\mathrm{{{LOS}}}\). The quantization \(q_{n}(N_{t},\phi _\mathrm{{{LOS}}})\) illustrates the relation between angles of direction \(\xi _{{t,n}}\) and \(\phi _\mathrm{{{LOS}}}\), which is regarded as a quantization error [12]. However, the multipath interference and approximation of \(\phi _\mathrm{{{LOS}}}\) may lead to inaccurate predicted results.

Since the number of candidate beams is finite, mmWave beam training is a multiclass classification task based on the beam codebook \({\mathcal {F}}\). Mathematically, the prediction is represented by the probability results of the training function \({\mathcal {Q}}(\cdot )\) as

$$\begin{aligned} n^{*} = \mathop {\arg \max }_{{n} \in \{ 1,2,\dots ,N_{t} \} }P_{n}(\varvec{c}_{t}|{\mathcal {Q}}(\varvec{H});\varvec{W}), \end{aligned}$$

where \(P_{n}\) is the predicted probability, and the optimal beam corresponding index \({n}^{*}\) is the maximum probability from the output given the parameters \(\varvec{W}\).

In this paper, we propose a novel spatial attention and quantization-based contrastive learning framework to achieve high learning efficiency with lower labeled data requirements and greater robustness. The proposed SE-QPTNet framework includes two stages:

Training stage We divide the training stage into two phases: self-enhanced pre-training and supervised training. Self-enhanced pre-training is an unsupervised procedure, where we only utilize the CSI as the training samples to acquire the knowledge of environment variations based on the contrastive learning framework. This solution concentrates on modeling the belonging relation between the CSI features and its global beam signature to enhance the uniqueness of the feature of the current user. In supervised training, limited training samples are collected from the CSI and the transmitting beam information. The optimal mmWave transmitting beam index decided by the beam-sweeping is used as the classification label and utilizes corresponding CSI as input. Different from the existing works, we fine-tune the model based on the initialization given knowledge of environmental variations from the pre-training stage.

Predicting stage: When mmWave beam training is required, the BS can predict the optimal beam with received signals depending on the well-trained model. According to our proposal, the labeling cost and requirement of expertise are flexible and beam prediction can achieve reliability under environmental variation.

For clarity, we first describe the spatial attention-associated beam prediction of the proposed hierarchical QPTNet. Then, we illustrate the details of codebook-based phased quantization for the spatial features and procedure of QPTNet, including the first level for UE tracking corresponding to the different locations, quantization, and the beam prediction decided by spatial attention. We finally introduce the proposed contrastive learning framework SE-QPTNet and summarize the procedure of self-enhanced pre-training.

4 Spatial attention-associated beam prediction

This section presents the efficient spatial attention-associated beam prediction, depicted in Fig. 2. Efficient perception of the environment from the observed CSI can significantly improve beam prediction accuracy. Benefiting from the attention mechanism, we can infer the optimal beam response from implicit directions by its attention scoring [37]. Moreover, the attention mechanism is also good at simultaneously capturing features from entire implicit directions for a comprehensive beam signature.

Beam gain generation module: Since the initial spatial-frequency channel measurement \(\varvec{H}\) is complex-valued, we firstly convert normalized \(\varvec{H}\) into real-valued \(\varvec{\tilde{H}}\) and normalize with the maximum amplitude of its elements:

$$\begin{aligned} \varvec{\tilde{H}} =\left[ \mathfrak {R}\left( \frac{\varvec{H}}{\left\| \max (\varvec{H})\right\| }\right) \mathfrak {I}\left( \frac{\varvec{H}}{\left\| \max (\varvec{H})\right\| }\right) \right] . \end{aligned}$$

Then real-valued \(\varvec{\tilde{H}}\) concatenates the real and imaginary components in the spatial domain to simultaneously generate the beam gain. Finally, the input modified channel can be denoted as \(\varvec{\tilde{H}} \in {\mathbb {R}}^{N_{f}\times {2N_{t}}}\).

Fig. 2
figure 2

Overview of spatial attention-associated beam prediction. The candidate beam responses can be performed by calculating the attention score with the beam gains and directions representation based on the channel signatures. And the predicted beam is decided by a scoring evaluation choosing the high pairing probability of beam gains and direction

To generate the transmitting beam gain in each antenna direction, we exploit the linear projection of \(\varvec{\tilde{H}}\) to antenna space \(N_{t}\) through an embedding layer. To preserve the relative distance among the antennas, the distinctive beam gain embedding parameters employ with the linear projection results. The transmitting beam gains of \(\varvec{\tilde{H}}\) can be defined as

$$\begin{aligned} \varvec{B}_{g} = \texttt {embedding}(\varvec{\tilde{H}}), \end{aligned}$$

where \(\texttt {embedding}(\cdot )\) is a linear projection layer with \(2N_{t} \times N_{t}\) transmit antenna dimension, and the size of \(\varvec{B}_{g} \in {\mathbb {R}}^{N_{f}\times {N_{t}}}\).

Beam direction tagging module: To symbolize the inherent direction for the beamforming gains, we consider a beam direction tagging \(\varvec{B}_{t}\) for each transmitting beam gain in (18), which is generated through the linear projection \(\varvec{B}_{t} \in {\mathbb {R}}^{N_{f}\times {N_{t}}}\) with each transmitter antenna index. By combing (18), the inputs of transformer encoder \(\varvec{{X}}=\varvec{B}_{g}+\varvec{B}_{t}\). Stacked attention module: The transformer encoder enables the spatial-frequency feature to deeply learn essential representation by applying the stacked attention module. Particularly, the attention mechanism [37] may capture the relevant relation between the specific LOS/dominate path direction while diminishing the effects of NLOS/subordinate by traversing all transmission antennas with the beam direction query matrix \(\varvec{Q}\), beam gains key matrix \(\varvec{K}\), and corresponding scoring value matrix \(\varvec{V}\). The stacked attention module devotes more focus to mutually important alignment degrees from the coherence of candidate beam gains and the beam directions of the local antenna, learning which specific AoD information is more competitive than others, depending on the limited number of scatters of LOS and NLOS paths. Then, the query matrix \(\varvec{Q}\), key matrix \(\varvec{K}\), and value matrix \(\varvec{V}\) are generated by the wide fully connected (FC) layers with input signal \(\varvec{X}\) as

$$\begin{aligned} \begin{aligned} \varvec{Q}&= \varvec{W}^{q_{i}}\varvec{X},\\ \varvec{K}&= \varvec{W}^{k_{i}}\varvec{X},\\ \varvec{V}&= \varvec{W}^{v_{i}}\varvec{X}, \end{aligned} \end{aligned}$$

where the \(\varvec{W}^{q_{i}},\varvec{W}^{k_{i}},\varvec{W}^{v_{i}}\) are the linear projection layers in i th attention module. Attention operation can be introduced as a scaled coherence function in Fig. 2, which maps a beam direction query and a bunch of candidate beam gains pairs as a dependency relation. Specifically, we compute the dot-product of beam direction query matrix \(\varvec{Q}\) with all key beam gains matrix \(\varvec{K}\) and apply a \(\texttt {softmax}(\cdot )\) activation to obtain the pairing probabilities on the scoring value matrix \(\varvec{V}\), so that beam gains can be precisely aligned with LOS direction. Then, the output can be computed by

$$\begin{aligned} \texttt {Attention}(\varvec{Q},\varvec{K},\varvec{V}) = \texttt {softmax}\left(\frac{\varvec{Q}\varvec{K}^{\textrm{T}}}{\sqrt{N_{t}}}\right)\varvec{V}. \end{aligned}$$

The candidate pairs context of the beam gains and direction can be extracted by the forward stacked attention process and decided with the relevant score, which comes from the softmax probability result. Moreover, the superior collaboration of beam gains and direction can effectively improve the beam training performance.

Output module: Different from the dense FC layer based-prediction scheme, the global average pooling (GAP) in [38], is introduced to implement the average of each token from stacked attention operations to the candidate beams which can be written as

$$\begin{aligned} \varvec{c}_{t} = \frac{1}{D}\displaystyle \sum _{d=1}^{D}\varvec{c}_{d}, \end{aligned}$$

where \(\varvec{c}_{d} \in {\mathbb {R}}^{N_{t}\times {1}}, d = 1,2, \cdots , D\) is the output from the stacked attention module, and the size of beam signature vector \(\varvec{c}_{t} \in {\mathbb {R}}^{N_{t}\times {1}}\). One advantage of GAP over the FC layer is that it summarizes the aligned information between beam gains and transmitting direction with the weighted average results; thus, it is more robust to spatial translations of the fusion local features of gains and direction. In addition, there are no extra parameters to optimize in the GAP layer, and overfitting can be avoided. The resulting vector \(\varvec{c}_{t}\) is directly fed into the \(\texttt {softmax}(\cdot )\) layer, and the optimal beam can be chosen with the maximum probability of the global beam signature \(\varvec{c}_{t}\)

$$\begin{aligned} \begin{aligned} \varvec{p}&= \texttt {softmax}(\varvec{c}_{t}), \\ n^{*}&= \mathop {\arg \max }_{{{n}}\in \{1,2,\dots ,N_{t}\}} {p}_{n}, \end{aligned} \end{aligned}$$

where \(\varvec{p}\) is the predicted probability and \(p_{n}\) is the element of \(\varvec{p}\). To train our model, the cross-entropy loss is applied as the evaluation metric for the classification problem. The cross-entropy loss can be expressed as

$$\begin{aligned} {\mathcal {L}}_\mathrm{{{cls}}} = -\sum _{n=1}^{N_{t}}y_{c}\log ({p}_{n}), \end{aligned}$$

where \(y_{c}\) is the actual optimal mmWave beam described by one-hot encoding as the classification label. If label c is identical to the optimal beam, \(y_{c}=1\); otherwise, \(y_{c}=0\).

5 QPTNet-associated beam training

Fig. 3
figure 3

Overview of the proposed QPTNet. CSI is separately processed along the frequency subcarriers to capture the continuous spatial feature \(\varvec{z}_{l}\) via the GRU encoder. A latent beam codebook handles the perception of environmental variations by exploring the relation between the categorical beam \(\varvec{g}^{*}_{i}\) and the continuous spatial feature. Moreover, we reconstruct the channel data based on the selected latent beams via a GRU decoder to update the codebook and hold the consistency of channel and beamspace. The selected latent beams are fed into a spatial attention-associated beam prediction. \(\varvec{c}_t\) indicates the global beam signature

This section presents the hierarchical QPTNet for efficiently processing the CSI. QPTNet has two hierarchical levels and a codebook-based phase quantization procedure. The first level is an autoregressive encoder based on GRU, and the second level is a spatial attention encoder based on the transformer. The output of the GRU encoder can be quantized with a generated beam codebook. Finally, the spatial attention encoder extracts a global beam signature. The overview of QPTNet is illustrated in Fig. 3.

5.1 Codebook-based phase quantization

To analyze the components of the propagation environment, we attempt to capture the implicated relation of cluster AoDs by the discrete phase-based quantization method. We define a latent beam codebook \({\mathcal {G}}\) including \(N_{CB}\) codewords \(\varvec{g}_{i} \in {\mathbb {C}}^{N_{t} \times 1}, i = 1,2, \cdots , N_{CB}\) (i.e., \(N_{CB}\)-way categorical beams). The latent beamspace is initialized with randomly sampled angular \({\bar{\phi }}_{i} \in [-\frac{\pi }{2},\frac{\pi }{2}]\), and the normalized spatial frequency \({u}_{i}\) is defined as

$$\begin{aligned} {u}_{i} = \frac{d\sin {{\bar{\phi }}_{i}}}{\lambda }, \end{aligned}$$

where \(u_{i} \in [-1/2,1/2]\) for \(\lambda /2\) element spacing. Intuitively, the beam vector can be generated by the DFT of \(\varvec{u}\) at points separated by \(1/N_{CB}\). Note that \(N_{CB} \ge N_{t}\) and there are \(N_{CB}\) categorical beam vectors \(\varvec{g}_{i}\). Thus, the latent beamspace can be described as

$$\begin{aligned} \varvec{g}_{i} = \frac{1}{\sqrt{N_{t}}}[1,\textrm{e}^{-j2\pi u_{i}},\cdots ,\textrm{e}^{-j2\pi (N_{CB}-1)u_{i}}]^{T}. \end{aligned}$$

Since we only consider the spatial power spectrum, the elements of (25) are adjustable with the network training.

As shown in Fig. 3(a) and (b), the GRU encoder yields a continuous spatial feature matrix \(\varvec{Z}\) with column vectors \(\varvec{z}_{l} \in {\mathbb {C}}^{N_{t} \times 1}, l = 1,2, \cdots , N_{f}\). Next, we quantize the continuous feature \(\varvec{z}_{l}\) into categorical beam \(\varvec{g}_{i^{*}}\)based on the Euclidean distance \(d_{l,i}\). The categorical beam index is expressed as

$$\begin{aligned} i^{*}=\mathop {\arg \min }_{i}\underbrace{\Vert {\varvec{z}_{{l}}-\varvec{g}_{i}}\Vert _{2}}_{d_{l,i}}. \end{aligned}$$

The minimal \(d_{l,i^{*}}\) ensures that the continuous spatial features corresponding to the categorical beams are within a neighboring zone of latent beamspace. Meanwhile, distance calculation can efficiently reflect the relation of cluster AoDs to represent the components of the environment. The quantization approach can also, respectively, perform the clustering by mapping to the nearest codeword. The posterior categorical beam distribution \(p(\varvec{\hat{z}}_{l}= \varvec{g}_{i} |\varvec{\tilde{H}},\varvec{\bar{\phi }}_{i})\) is formulated as

$$\begin{aligned} p(\varvec{\hat{z}}_{l}= \varvec{g}_{i} |\varvec{\tilde{H}},\varvec{\bar{\phi }}_{i}) = {\left\{ \begin{array}{ll}1,~&{}{\text { if }}~ i=i^{*}, \\ 0,~&{}{\text { else.}}\end{array}\right. }\ \end{aligned}$$

During the forward pass, we define \(\varvec{\hat{Z}} = [\varvec{\hat{z}}_{1}, \varvec{\hat{z}}_{2},\cdots , \varvec{\hat{z}}_{N_{f}}]\) to present discrete angular beam responses. \(\varvec{\hat{z}}_{l} = \varvec{g}_{i^{*}}\) are then selected corresponding to the approximate AoDs \(\varvec{\bar{\phi }}_{i}\). Different from the setup in Section 3, we acquire the optimal beam prediction based on the attention scoring results of \(\varvec{\hat{Z}}\) to enhance the ability of environmental representation.

5.2 Hierarchical learning procedure

The proposed QPTNet contains two hierarchical levels for extracting the features from frequency and spatial domains. For the first level, we apply a GRU encoder to extract the relations between two domains and the spatial feature \(\varvec{Z}\). We employ the GRU module as the instant feature extractor due to its high training efficiency and similarity to the LSTM network in temporal sequence learning. GRU encoder proposes to synchronize the CSI estimation based on the gate mechanism to control the spatial feature in different subcarriers. In processing \(\varvec{\tilde{H}}\) at each subcarrier index, the GRU encoder takes in the channel delay response at the current subcarrier as well as the shared hidden state and output from the previous subcarrier step \((\varvec{z}_{l-1},\varvec{\Theta })\) and updates its hidden state \(\varvec{\Theta }\) at the current subcarrier index, resulting into a new \(\varvec{z}_{l}\). The process can be described by

$$\begin{aligned} \varvec{z}_{l} ={\mathbb {P}}(\varvec{\tilde{H}}_{f_{l}} \mid \varvec{\tilde{H}}_{f_{l-1}},\cdots ,\varvec{\tilde{H}}_{f_{2}},\varvec{\tilde{H}}_{f_{1}},\varvec{\Theta }), \end{aligned}$$

where \({\mathbb {P}}(\cdot )\) represents the conditional probability distribution over the subcarriers. Once the continuous spatial feature \(\varvec{z}_{l}\) is extracted, we quantize it by a latent beam codebook to perceive the environmental component. According to (26) and (27), we perform a nearest neighbor search to decide the posterior latent beam vector related to the discrete angular index \(\varvec{\bar{\phi }}_{i^{*}}\).

The input to the decoder corresponds to the selected latent beam matrix \(\varvec{\hat{Z}}\). We reconstruct the channel matrix based on \(\varvec{\hat{Z}}\) to guarantee the consistency between the latent beamspace and channel space. The reconstruction loss is formulated as

$$\begin{aligned} {\mathcal {L}}_\mathrm{{{recon}}} ={\mathbb {E}}\{ {\frac{1}{N}}\displaystyle \sum _{n=1}^{N} \Vert {{\mathcal {F}}_\mathrm{{{enc}}}(\varvec{\tilde{H}})- {\mathcal {F}}_\mathrm{{{dec}}}(\varvec{\hat{Z}}) }\Vert ^{2}_{2}/\Vert {{\mathcal {F}}_{dec}(\varvec{\hat{Z}})}\Vert ^{2}_{2} \}. \end{aligned}$$

\({\mathcal {F}}_\mathrm{{{enc}}}\)/\({\mathcal {F}}_{dec}\) are GRU encoder/decoder mapping and the reconstructed channel \(\varvec{\hat{H}}= {\mathcal {F}}_{dec}(\varvec{\hat{Z}})\), respectively. The forward computation pipeline can be regarded as the regular autoencoder with a specific nonlinearity that corresponds to the 1-of-\(N_{CB}\) latent beam vectors. The optimization of QPTNet follows the back-propagating in [39]. To make sure the encoder commits to a latent beamspace and its output does not grow, we add a commitment loss. Thus, the total training loss becomes

$$\begin{aligned} {\mathcal {L}}_{total} = {\mathcal {L}}_\mathrm{{{cls}}} + {\mathcal {L}}_\mathrm{{{recon}}}+\Vert {sg[{\mathcal {F}}_\mathrm{{{enc}}}(\varvec{\tilde{H}})]-\varvec{G}}\Vert ^{2}_{2} +\beta \Vert {{\mathcal {F}}_{enc}(\varvec{\tilde{H}})-sg[\varvec{G}]}\Vert ^{2}_{2} , \end{aligned}$$

where \(sg[\cdot ]\) means the stop gradient operator defined as an identity at forward computation time and has zero partial derivatives, \(\varvec{G}= [\varvec{g}_{1},\varvec{g}_{1},\cdots ,\varvec{g}_{N_{CB}}]\), and \(\beta\) is a hyperparameter. The first term is the beam prediction loss in (23) and the optimization does not change the latent beam codebook because of the nonlinearity of quantization. The second term is the CSI reconstruction loss to update the latent beam codebook. The third term utilizes the \(l_{2}\) norm loss to move the latent beam codebook vectors close to the encoder output \({\mathcal {F}}_{enc}(\varvec{\tilde{H}})\), and the fourth term ensures that the encoder outputs toward the codeword.

6 SE-QPTNet-associated beam training

Fig. 4
figure 4

Overview of the proposed SE-QPTNet. Proposed SE-QPTNet compensates for the effects of propagation delay by separately dealing with the sub-antenna arrays. Practically, the intermediate beam codes \(\varvec{\hat{z}}\) are divided into \(\varvec{\hat{z}}_{\le t}\) as small scale parts and \(\varvec{\hat{z}}_{>t}\) as large-scale parts by an anchor t. We feed \(\varvec{\hat{z}}_{\le t}\) to the spatial attention-associated beam prediction for identifying the beam signature vector \(\varvec{c}_{t}\). Contrastive environmental prediction measures the similarity between the \(\varvec{c}_{t}\) and the reconstructed \(\varvec{\hat{H}}_\mathrm{{{recon}}}\), and makes the negative samples \(\varvec{\hat{H}}_\mathrm{{{neg}}}\) dissimilar to its beam signature

This section proposes SE-QPTNet to provide an unsupervised pre-training strategy for improving the performance of QPTNet given the small training dataset. In SE-QPTNet, we construct positive and negative samples based on the anchor operation. Then, the pre-training process can be completed by detecting the relationship among a positive sample of target UE, negative samples of others, and global beam signature with the contrastive environmental prediction. Therefore, SE-QPTNet can constantly perceive the environment component under a contrastive learning framework to further improve the performance of beam training. Figure 4 describes the procedure of self-enhanced pre-training in SE-QPTNet.

To pre-train the proposed SE-QPTNet under the contrastive learning framework, we present an anchor operation to acquire the positive and negative samples. Specifically, we obtain forward latent beam features by splitting the outputs of quantization with the anchor t and explore the global beam signature \(\varvec{c}_{t}\) based on attention scoring evaluation. Meanwhile, the reconstructed channel data streams recovered by the GRU decoder divide as contrastive samples with length k, in which the positive sample is determined by the CSI of the target UE location while negative samples are determined by the others. SE-QPNet enhances the similarity between the target UE and the prediction of \(\varvec{c}_{t}\) while making the other UE responses dissimilar by the contrastive environmental prediction. Benefiting from this distinction, we can facilitate latent angular learning within the latent beams corresponding to the dominant path, discard the noises, and enhance the representation of the environmental component.

Considering the mmWave massive MIMO system, the amplitudes and phases of the received signals suffer from spatial differences due to a significant fraction of propagation delay [40]. To achieve robustness, the proposed SE-QPTNet compensates for the effects of propagation delay by separately dealing with the selected latent beams. We denote the global view \(\varvec{\hat{Z}}\) output from the quantization for notational convenience. We select an anchor t and define \(\varvec{\hat{Z}}_{\le t}\) as the forward component. Taking inspiration from recent advancements [41], we introduce a token \(\varvec{\hat{z}}_{0}\) at the beginning of \(\varvec{\hat{Z}}_{\le t}\) and feed them to the spatial attention-associated beam prediction to identify the beam signature vector \(\varvec{c}_{t}\). We then predict the k reconstructed channels by a linear mapping \(\varvec{\bar{H}}_{t+k} = \varvec{W}_{t+k} \varvec{c}_{t}\), where \(\varvec{W}_{t+k}\) denotes the weight of the linear mapping \({\mathcal {T}}\). If the large-scale dense channel signatures can be successfully predicted, it indicates that the beam signature vector \(\varvec{c}_{t}\) contains a global and robust view of the propagation environment. By combining the positive and negative samples, the loss function of contrastive environmental prediction is

$$\begin{aligned} {\mathcal {L}}_{N} = - \sum _{k} \frac{\exp \left( \varvec{\hat{H}}_{t+k}^T \varvec{\bar{H}}_{t+k} \right) }{\exp \left( \varvec{\hat{H}}_{t+k}^T \varvec{\bar{H}}_{t+k} \right) + \sum _{l} \exp \left( \varvec{\hat{H}}_{t+k}^T \varvec{\bar{H}}_\mathrm{{{neg,l}}} \right) }, \end{aligned}$$

where \(\varvec{\bar{H}}_\mathrm{{{neg,l}}}\) denotes negative samples taken from other channel data, and the positive sample is \(\varvec{\hat{H}}_{t+k}\). Considering the latent beam codebook updating, the total pre-training loss is expressed as

$$\begin{aligned} {\mathcal {L}}_\mathrm{{{pre}}} = {\mathcal {L}}_{N}+\Vert {sg[{\mathcal {F}}_\mathrm{{{enc}}}(\varvec{\tilde{H}})]-\varvec{G}}\Vert ^{2}_{2} +\beta \Vert { {\mathcal {F}}_\mathrm{{{enc}}}(\varvec{\tilde{H}})]-sg[\varvec{G}]}\Vert ^{2}_{2}. \end{aligned}$$

The pre-training process is illustrated in Algorithm 1.

figure a

Once the pre-training process is completed, we transfer the parameters of the GRU encoder/decoder and spatial attention to the QPTNet for supervised beam training. According to (23) and (32), the total beam training loss of the proposed SE-QPTNet can be described as

$$\begin{aligned} {\mathcal {L}}_\mathrm{{{tot}}} = {\mathcal {L}}_\mathrm{{{cls}}}+{\mathcal {L}}_\mathrm{{{pre}}}. \end{aligned}$$

Benefiting from the pre-trained model, the network can achieve high performance with a small training dataset. It allows mitigation of preprocessing tasks for labeling the actual optimal beam knowledge, reduces training overhead, and reaches stronger adaptability to environmental variation.

7 Results and discussion

7.1 Simulation set up

We evaluate the training and predicting performances of the proposed QPTNet and SE-QPTNet on three tasks: training performance, success rate, and achievable rate. We consider publicly available evaluation scenarios from the DeepMIMO dataset [42] for data generation. The scenario is constructed using the 3D ray-tracing software Wireless InSite [43], which captures the channel dependence on frequency and spatial domains. The mmWave signal is available at 28 GHz in an outdoor scenario, and we consider a single BS with ULA massive MIMO located at BS 3 of the scenario. We collect the mmWave channel data by the DeepMIMO generator and describe the details in Table 1. The main parameters of the proposed framework are represented in Table 2.

Table 1 DeepMIMO dataset parameters
Table 2 Training hyperparameters

7.2 Training performance

Fig. 5
figure 5

Training performance of QPTNet and SE-QPTNet

We first compare the training performance of QPTNet and SE-QPTNet with different sizes of latent beam codebook options, and then, we evaluate the proposed schemes against existing related works in [16, 29].

Fig. 6
figure 6

Training performance of QPTNet, SE-QPTNet, spatial attention-associated beam prediction, and DNN [21] and LSTM attention [30]

Figure 5 presents the training performance of different latent beam codebook options between QPTNet and SE-QPTNet. The results indicate that the latent beam codebook with 256 spatial resolution results in the smallest training loss values for both QPTNet and SE-QPTNet. Additionally, SE-QPTNet outperforms QPTNet with arbitrary latent beam codebook options, achieving a much lower training loss curve. These results highlight the importance of higher spatial resolution options in precisely quantifying the continuous channel features into categorical beams to achieve better beam training performance and faster convergence. It also shows that the contrastive learning framework based SE-QPTNet can improve the learning efficiency by consistently interacting with the propagation environment and accounting for the mmWave channel fluctuation in the re-identification of QPTNet. Therefore, the higher spatial resolution option and the contrastive environmental prediction can enhance training performance and learning efficiency. Consequently, we set the latent beam codebook option \(N_{CB}=256\) for the following results.

Figure 6 illustrates the training comparisons among the proposed SE-QPTNet, QPTNet, spatial attention-associated beam prediction, and DNN [21] and LSTM attention [30]. The training loss performance indicates that the SE-QPTNet achieves the fastest convergence with the loss value of 0.12, outperforming LSTM attention (0.15), DNN (1.8), spatial attention (0.31), and QPTNet (0.4). It is observed that the learning efficiency of SE-QPTNet is the best among the comparisons, benefiting from its competitive initial loss value (0.78) and smooth convergence. Looking closely at QPTNet and spatial attention, the latent beams can quantify the components of the propagation environment and significantly improve the accuracy of beam prediction, leading to faster convergence. However, the latent beam codebook updated by the mmWave channel reconstruction restricts the learning efficiency of QPTNet, because of the effects of noise and channel variations.

7.3 Success rate performance

To evaluate the robustness and low training overhead, we evaluate the performance of beam prediction in terms of the success rate under the different SNRs ranging from 0 to 20 dB and with various sizes of training datasets.

Fig. 7
figure 7

Average success rate performance of SE-QPTNet, QPTNet, QP-DNN, and LSTM attention [30] with different prior label information

Figure 7 shows the average success rate of proposed SE-QPTNet, QPTNet, and related works with different training data sizes. It is visible that the proposed SE-QPTNet can achieve a success rate of around 0.69 only with 1% training data and rise to 0.91 given 30% training data. To illustrate the advantage of the proposed phase quantization scheme, we apply the DNN instead of spatial attention-associated beam prediction, named QP-DNN. It demonstrates that the unsupervised pre-training strategy can perceive informative environmental components to distinguish the implicit representations of AoDs so that SE-QPTNet can promote learning ability with a small training dataset. Moreover, the proposed contrastive learning framework can enhance the global beam signature with multipath interference to improve reliability. Compared with the LSTM attention, the proposed SE-QPTNet obtains almost the same performance with only 5% data, which improves learning efficiency by six times. The result shows the potential to utilize less training time and mitigation of labeling the actual optimal beam knowledge to perform a flexible beam training process.

Fig. 8
figure 8

Average success rate of QPTNet, SE-QPTNet, spatial attention-associated beam prediction, and DNN [21] and LSTM attention [30] with different SNRs from 0 to 20 dB

Fig. 9
figure 9

Success rate of SE-QPTNet with different codebook sizes, sigmoid-based quantization [33], and SE-QPTNet without quantization

To further investigate the robustness and lower training overhead, we evaluate the average success rate using 50% of the training dataset with the proposed SE-QPTNet. The results are presented in Fig. 8. It shows the success rate performance of the proposed SE-QPTNet (50% training data), QPTNet, spatial attention-associated beam prediction, DNN, and LSTM attention. Our proposed SE-QPTNet achieves a success rate of 0.58, demonstrating its robustness at 0 dB SNR compared to LSTM attention, DNN, and spatial attention, as well as the proposed QPTNet, with achievable rates of 0.42, 0.43, 0.26, and 0.39, respectively. The proposed SE-QPTNet exhibited robustness against the effect of noise with low SNR levels in the comparison due to its contrastive learning framework. Closely looking at QPTNet, LSTM attention, and spatial attention; the performance of QPTNet is sensitive at the low SNR level because the latent beam codebook updates with the reconstruction of the mmWave channel. It makes QPTNet incorrectly quantize the spatial feature resulting in fuzzy environmental components.

Fig. 9 illustrates the effectiveness of the proposed codebook-based phase quantization. We compare the success rate performance of different latent beam codebook options and sigmoid-based quantization [33]. The baseline is the proposed SE-QPTNet without quantization. It is visible that the proposed SE-QPTNet can obtain a higher success rate performance and faster convergence than the sigmoid-based method and the baseline method. The sigmoid-based quantization in [33] can map each frequency information into the range of 0 and 1 to establish a phase relationship. The continuous and monotonically increasing properties make the results of quantization hardly determine the discrete phase option. Unlike the sigmoid-based quantization, the proposed codebook-based method leads to a rich diversity of beam features corresponding to the discrete phases. Moreover, the proposed quantization method can boost robust performance (smaller error band) by increasing the size of the codebook.

Table 3 Complexity of the proposed SE-QPTNet, QPTNet, and related works

In Table 3, we compare the complexity of SE-QPTNet, QPTNet, and related works. Both the success rate and the achievable rate of SE-QPTNet can obtain the best performance against the related works. Although the achievable rate of QPTNet and SE-QPTNet are almost identical, the success rate of SE-QPTNet is higher. We can observe an improvement of \(2.6 \%\), owing to the contrastive environmental prediction that enables it to update the latent beam codebook efficiently. The number of parameters of SE-QPTNet is lower than those of LSTM attention [30], spatial attention [37], as well as the DNN [21], which can save \(18\%\), \(28\%\), and \(82\%\) parameter requirements. The results demonstrate that the proposed contrastive learning framework can improve DL performance with low overhead. Additionally, contrastive environmental prediction is beneficial for improving the learning efficiency by distinguishing the target UE from others, maintaining a low parameter overhead and better global beam signature representation. Consequently, the proposed contrastive learning framework contributes to promoting learning efficiency and getting rid of high complexity.

Table 4 Performances of different attention module options

Table 4 investigates the effects of spatial attention modules for beam prediction. We compare the complexity, success rate, and achievable rate of different spatial attention options. According to the results, the success rate can be improved by \(20\%\) when the number of attention modules is larger than 1. Increasing the number of attention modules does not significantly improve the accuracy of beam prediction, but obviously increases the overhead. Therefore, we adopt two spatial attention modules as the optimal option for SE-QPTNet.

7.4 Achievable rate performance

The proposed SE-QPTNet can obtain extra performance by pre-training without label information and achieve a high transmission rate with a reduced amount of labeled CSI. To evaluate the training overhead and robustness in practical systems, we plot the achievable rate for different training dataset sizes and their performance under different SNR levels.

In Table 5, we compare the achievable rate of QPTNet and SE-QPTNet against different latent beam codebook options. The achievable rates of SE-QPTNet and QPTNet are close but still higher than that of QPTNet without noise effects. Although the performance of QPTNet and SE-QPTNet is almost identical, SE-QPTNet is more robust. We can observe an improvement of 0.41 in 10 dB SNR, owing to the contrastive environmental prediction that enables it to update the latent beam codebook efficiently. The results demonstrate that a larger spatial resolution can comprehensively quantify environmental variations to improve beam prediction accuracy. Additionally, contrastive environmental prediction is beneficial for improving robustness by the inherent representation of the global beam signature among different UEs.

Table 5 Achievable rate of QPTNet and SE-QPTNet with different SNRs and \(N_{CB}\)
Fig. 10
figure 10

Achievable rate performance of QPTNet, SE-QPTNet, and DNN [21] with different training samples

Fig. 11
figure 11

Achievable rate performance of QPTNet, SE-QPTNet, and DNN [21] with different SNRs from 0 to 20 dB

To illustrate the low training overhead of proposed solutions, Fig. 10 shows the achievable rate of the proposed SE-QPTNet, QPTNet, and DNN approaches, defined in (5) for different dataset sizes. Note that the dot-dash line in Fig. 10 represents the upper bound on the performance of the given system and channel models. The horizontal axis represents the training data size, and the achievable rate results calculate with \(1\%, 5\%, 10\%, 30\%, 50\%, 70\%, 100\%\) training data. The results of SE-QPTNet and QPTNet are close to the upper bound with only 10% training data, indicating that the proposed solutions allow the BS to apply the desired beam with an efficient training scheme. It is visible that SE-QPTNet can successfully predict the optimal beam based on the contrastive learning framework with only \(1\%\) training data at the BS. It clearly illustrates the ability of the self-enhancement to efficiently represent the desired angular sector with positive and negative samples, achieving negligible training overhead. Moreover, the proposed solutions are flexible in customizing the beam solution with small training data. Therefore, approaching this bound illustrates the optimality of the proposed solutions

To verify the robustness of the proposed SE-QPTNet and QPTNet, we investigate the achievable rate of different SNR levels, as shown in Fig. 11. The dot-dash line represents the upper bound performance, the optimal rate without noise for the given system and channel models. The results show that SE-QPTNet and QPTNet are consistently better than the simple DNN beam training approach. Benefiting from the contrastive environmental prediction to update the latent beam codebook, SE-QPTNet can achieve a higher rate than the QPTNet at low SNR levels. Both proposed beam training schemes are near the optimal rate when SNR is higher than 15 dB. It illustrates that the hierarchical strategy and the quantization can deal with the frequency and spatial channel efficiently, and the ability of the self-enhancement can provide a better perception of the complicated propagation environment by the contrastive learning framework. Moreover, the contrastive environmental prediction can improve the robustness relying on the distinct global beam signature by considering the similarity among global beam signature and positive/negative samples.

8 Conclusion

This paper proposed a novel framework SE-QPTNet for beam training based on spatial attention and quantization, utilizing the contrastive environmental prediction to detect the relationship between the beam signature of the target UE and others. The proposed hierarchical scheme QPTNet quantized the continuous spatial features into controllable latent beam responses within a wide range of discrete phases achieving higher robustness. The proposed contrastive learning framework SE-QPTNet enhanced the capability of identifying the beam signature by contrastive environmental prediction with a small fraction of the labeled CSI dataset. The proposed SE-QPTNet permitted DL-integrated beam training to adapt flexibly to wireless environments with low labeling costs. The simulation results showed that the proposed SE-QPTNet framework can achieve a data rate of 7.01 bps/Hz with only \(5\%\) labeled data, which represents \(95.1\%\) of the performance utilizing the full-size dataset. The proposed SE-QPTNet framework outperformed existing DL-based schemes, obtaining higher capacity with lower expertise requirements and highly reliable performance for mmWave massive MIMO systems.

Availability of data and materials

The millimeter-wave massive MIMO dataset is generated from the DeepMIMO, a public framework for generating large-scale MIMO data based on Accurate Remcom 3D ray-tracing.



Fifth/sixth generation


Millimeter wave


Multiple-input multiple-output


Base station


User equipment


Channel state information










Radio frequency


Compressive sensing


Deep learning


Supervised learning


Deep neural network


Beam prediction network


Virtual uplink beamforming


Long short-term memory


Gate recurrent unit


Quantized phase-based transformer network


Self-enhanced QPTNet


Reconfigurable intelligent surface


Additional white Gaussian noise


Discrete Fourier transform


Fully connected


Global average pooling


  1. T.S. Rappaport, Y. Xing, G.R. MacCartney, A.F. Molisch, E. Mellios, J. Zhang, Overview of millimeter wave communications for fifth-generation (5G) wireless networks with a focus on propagation models. IEEE Trans. Antennas Propag. 65(12), 6213–6230 (2017).

    Article  Google Scholar 

  2. M. Alsabah, M.A. Naser, B.M. Mahmmod, S.H. Abdulhussain, M.R. Eissa, A. Al-Baidhani, N.K. Noordin, S.M. Sait, K.A. Al-Utaibi, F. Hashim, 6G wireless communications networks: a comprehensive survey. IEEE Access 9, 148191–148243 (2021).

    Article  Google Scholar 

  3. T.S. Rappaport, S. Sun, R. Mayzus, H. Zhao, Y. Azar, K. Wang, G.N. Wong, J.K. Schulz, M. Samimi, F. Gutierrez, Millimeter wave mobile communications for 5G cellular: it will work! IEEE Access 1, 335–349 (2013).

    Article  Google Scholar 

  4. J. Hoydis, S. ten Brink, M. Debbah, Massive MIMO in the UL/DL of cellular networks: how many antennas do we need? IEEE J. Sel. Areas Commun. 31(2), 160–171 (2013).

    Article  Google Scholar 

  5. J. Mo, B.L. Ng, S. Chang, P. Huang, M.N. Kulkarni, A. Alammouri, J.C. Zhang, J. Lee, W.-J. Choi, Beam codebook design for 5G mmWave terminals. IEEE Access 7, 98387–98404 (2019).

    Article  Google Scholar 

  6. F.A. Pereira de Figueiredo, An overview of massive MIMO for 5G and 6G. IEEE Lat Am Trans 20(6), 931–940 (2022).

  7. J. Li, Y. Niu, H. Wu, B. Ai, S. Chen, Z. Feng, Z. Zhong, N. Wang, Mobility support for millimeter wave communications: opportunities and challenges (IEEE Commun. Surv, Tutor, 2022)

    Google Scholar 

  8. Y. Li, J.G. Andrews, F.h. Baccelli, Design and analysis of initial access in millimeter wave cellular networks. IEEE Trans. Wireless Commun. 16(10), 6409–6425 (2017).

  9. J. Kim, A.F. Molisch, Fast millimeter-wave beam training with receive beamforming. J. Commn. Net 16(5), 512–522 (2014).

    Article  Google Scholar 

  10. M.E. Eltayeb, A. Alkhateeb, R.W. Heath, T.Y. Al-Naffouri, Opportunistic beam training with hybrid analog/digital codebooks for mmWave systems. In: Proc. IEEE GlobalSIP, pp. 315–319 (2015).

  11. J. Wang, Z. Lan, C. Pyo, T. Baykas, C. Sum, M.A. Rahman, J. Gao, R. Funada, F. Kojima, H. Harada, S. Kato, Beam codebook based beamforming protocol for multi-Gbps millimeter-wave WPAN systems. IEEE J. Sel. Areas Commun. 27(8), 1390–1399 (2009).

  12. C. Qi, K. Chen, O.A. Dobre, G.Y. Li, Hierarchical codebook-based multiuser beam training for millimeter wave massive mimo. IEEE Trans. Wireless Commun. 19(12), 8142–8152 (2020).

    Article  Google Scholar 

  13. Z. Xiao, T. He, P. Xia, X.-G. Xia, Hierarchical codebook design for beamforming training in millimeter-wave communication. IEEE Trans. Wireless Commun. 15(5), 3380–3392 (2016).

    Article  Google Scholar 

  14. S.-E. Chiu, N. Ronquillo, T. Javidi, Active learning and csi acquisition for mmwave initial alignment. IEEE J. Sel. Areas Commun. 37(11), 2474–2489 (2019).

    Article  Google Scholar 

  15. I. Aykin, M. Krunz, Efficient beam sweeping algorithms and initial access protocols for millimeter-wave networks. IEEE Trans. Wireless Commun. 19(4), 2504–2514 (2020).

    Article  Google Scholar 

  16. A. Alkhateeb, S. Alex, P. Varkey, Y. Li, Q. Qu, D. Tujkovic, Deep learning coordinated beamforming for highly-mobile millimeter wave systems. IEEE Access 6, 37328–37348 (2018)

    Article  Google Scholar 

  17. H. Echigo, Y. Cao, M. Bouazizi, T. Ohtsuki, A deep learning-based low overhead beam selection in mmwave communications. IEEE Trans. Veh. Technol. 70(1), 682–691 (2021).

    Article  Google Scholar 

  18. S. Rezaie, C.N. Manchón, E. de Carvalho, Location- and orientation-aided millimeter wave beam selection using deep learning. In: Proc. IEEE ICC,

  19. M. Alrabeiah, A. Alkhateeb, Deep learning for mmWave beam and blockage prediction using sub-6 GHz channels. IEEE Trans. Commun. 68(9), 5504–5518 (2020)

    Article  Google Scholar 

  20. H. Huang, Y. Peng, J. Yang, W. Xia, G. Gui, Fast beamforming design via deep learning. IEEE Trans. Veh. Technol. 69(1), 1065–1069 (2019)

    Article  Google Scholar 

  21. C. Qi, Y. Wang, G.Y. Li, Deep learning for beam training in millimeter wave massive MIMO systems. IEEE Trans. Wireless Commun. (2020).

    Article  Google Scholar 

  22. K. Ma, D. He, H. Sun, Z. Wang, S. Chen, Deep learning assisted calibrated beam training for millimeter-wave communication systems. IEEE Trans. Commun. 69(10), 6706–6721 (2021).

    Article  Google Scholar 

  23. H.Jia, Liu, Z. Wang, N. Chen, M. Okada, Non-deterministic sparse feature learning for reliable beam prediction in mmWave massive MIMO systems. Proc. IEEE PIMRC (2022).

  24. Z. Xiao, H. Dong, L. Bai, P. Xia, X.-G. Xia, Enhanced channel estimation and codebook design for millimeter-wave communication. IEEE Trans. Veh. Technol. 67(10), 9393–9405 (2018).

    Article  Google Scholar 

  25. J. Palacios, D. De Donno, J. Widmer, Tracking mm-wave channel dynamics: Fast beam training strategies under mobility. In: Proc. IEEE INFOCOM, pp. 1–9 (2017).

  26. W. Ma, C. Qi, Z. Zhang, J. Cheng, Sparse channel estimation and hybrid precoding using deep learning for millimeter wave massive MIMO. IEEE Trans. Commun. 68(5), 2838–2849 (2020).

    Article  Google Scholar 

  27. M. Hussain, N. Michelusi, Learning and adaptation for millimeter-wave beam tracking and training: A dual timescale variational framework. IEEE J. Sel. Areas Commun. 40(1), 37–53 (2022).

    Article  Google Scholar 

  28. S. Chen, Z. Jiang, S. Zhou, Z. Niu, Time-sequence channel inference for beam alignment in vehicular networks. In: Proc. IEEE GlobalSIP, pp. 1199–1203 (2018).

  29. H. Jia, N. Chen, M. Okada, Memory shared spatial attention neural network for reliable beam prediction. In: Proc. IEEE GCCE, pp. 107–108 (2022).

  30. L. Dai, X. Gao, S. Han, I. Chih-Lin, X. Wang, Beamspace channel estimation for millimeter-wave massive MIMO systems with lens antenna array. In: Proc. IEEE ICCC, pp. 1–6 (2016).

  31. H. Hojatian, V.N. Ha, J. Nadal, J.-F. Frigon, F. Leduc-Primeau, RSSI-based hybrid beamforming design with deep learning. In: Proc. IEEE ICC, pp. 1–6 (2020).

  32. K. Chen, J. Yang, Q. Li, X. Ge, Sub-array hybrid precoding for massive MIMO systems: a CNN-based approach. IEEE Commun. Lett. 25(1), 191–195 (2021).

    Article  Google Scholar 

  33. H. Hojatian, J. Nadal, J.-F. Frigon, F. Leduc–Primeau, Flexible unsupervised learning for massive MIMO subarray hybrid beamforming. In: IEEE GLOBECOM, pp. 3833–3838 (2022). IEEE

  34. H. Hojatian, J. Nadal, J.-F. Frigon, F. Leduc-Primeau, Unsupervised deep learning for massive MIMO hybrid beamforming. IEEE Trans. Wirel. Commun. 20(11), 7086–7099 (2021).

    Article  Google Scholar 

  35. C. Gustafson, K. Haneda, S. Wyne, F. Tufvesson, On mmWave multipath clustering and channel modeling. IEEE Trans. Antennas Propag. 62(3), 1445–1455 (2014).

    Article  Google Scholar 

  36. T.S. Rappaport, Y. Xing, G.R. MacCartney, A.F. Molisch, E. Mellios, J. Zhang, Overview of millimeter wave communications for fifth-generation (5G) wireless networks-with a focus on propagation models. IEEE Trans. Antennas Propag. 65(12), 6213–6230 (2017).

    Article  Google Scholar 

  37. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need. In: Proc. NeuralIPS, pp. 5998–6008 (2017)

  38. M. Lin, Q. Chen, S. Yan, Network in network. arXiv preprint arXiv:1312.4400 (2013)

  39. A. Van Den Oord, O. Vinyals, et al., Neural discrete representation learning. In: Proc. NeuralIPS, vol. 30 (2017)

  40. B. Wang, F. Gao, S. Jin, H. Lin, G.Y. Li, S. Sun, T.S. Rappaport, Spatial-wideband effect in massive MIMO with application in mmWave systems. IEEE Commun. Mag. 56(12), 134–141 (2018).

    Article  Google Scholar 

  41. A.v.d. Oord, Y. Li, O. Vinyals, Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)

  42. A. Alkhateeb, DeepMIMO: A generic deep learning dataset for millimeter wave and massive MIMO applications. CoRR arxiv: abs/1902.06435 (2019)

  43. Remcom: Wireless insite, (July, 2023)

Download references


This work was supported in part by the Japan Society for the Promotion of Science (JSPS) KAKENHI Grant Number 23K16870 and Hirose Foundation.

Author information

Authors and Affiliations



All the authors contributed equally to data collection, processing, experiments and article writing. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Na Chen.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interest

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit

Reprints and Permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jia, H., Chen, N., Urakami, T. et al. Spatial attention and quantization-based contrastive learning framework for mmWave massive MIMO beam training. J Wireless Com Network 2023, 69 (2023).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI:


  • MmWave
  • Massive MIMO
  • Deep learning
  • Spatial attention
  • Feature quantization
  • Contrastive learning