Spatial attention and quantization-based contrastive learning framework for mmWave massive MIMO beam training

Deep learning (DL)-based beam training schemes have been exploited to improve spectral efficiency with fast optimal beam selection for millimeter-wave (mmWave) massive multiple-input multiple-output (MIMO) systems. To achieve high prediction accuracy, these DL models rely on training with a tremendous amount of labeled environmental measurements, such as mmWave channel state information (CSI). However, demanding a large volume of ground truth labels for beam training is inefficient and infeasible due to the high labeling cost and the requirement for expertise in practical mmWave massive MIMO systems. Meanwhile, a complex environment incurs critical performance degradation in the continuous output of beam training. In this paper, we propose a novel contrastive learning framework, named self-enhanced quantized phase-based transformer network (SE-QPTNet), for reliable beam training with only a small fraction of the labeled CSI dataset. We first develop a quantized phase-based transformer network (QPTNet) with a hierarchical structure to explore the essential features from frequency and spatial views and quantize the environmental components with a latent beam codebook to achieve robust representation. Next, we design the SE-QPTNet including self-enhanced pre-training and supervised beam training. SE-QPTNet pre-trains by the contrastive information of the target user and others with the unlabeled CSI, and then, it is utilized as the initialization to fine-tune with a reduced volume of labeled CSI. Finally, the experimental results show that the proposed framework improves beam prediction accuracy and data rates with 5% labeled data compared to existing solutions. Our proposed framework further enhances flexibility and breaks the limitation of the quantity of label information for practical beam training.


Introduction
Millimeter-wave (mmWave) massive multiple-input multiple-output (MIMO) is one of the most critical technologies in fifth-/sixth-generation (5G/6G) wireless communication systems due to its capacity to provide comprehensive spectrum and spatial resources for high transmission rate demands [1][2][3].Benefiting from the short wavelength of mmWave signals, it permits a massive number of antenna elements to be integrated into limited equipment size at both base station (BS) and user equipment (UE) sides [4].In addition, the massive MIMO array can compensate for the severer path loss of mmWave signals by highly directional beamforming, leading to stronger coverage, larger data rates, and improved reliability [5,6].
Generally, mmWave massive MIMO arrays adaptively transmit signals via directional beams according to the wireless environment state [7][8][9][10][11].To efficiently transmit beams with maximum gain, beam training is essential to identify the line-of-sight (LOS) or dominant channel path, yielding the optimal beam pair for the transceivers [7].In practical beam training, candidate beams can be defined by a finite-size codebook covering the intended angle range and exhaustively retrieved to determine the optimal beam pair [8].However, mmWave massive MIMO beam training is challenging due to the large codebook size, which results in high computational overhead.To reduce the training overhead, [9] proposes hierarchical multi-resolution codebook solutions, where a low-resolution sub-codebook detects the candidate transmitting direction, and the high-resolution codebook confirms the optimal beam pair.Another efficient beam training approach is interactive beam search.In [10] and [11], they detect the direction of the LOS/dominate path from the mmWave channel estimation and select optimal beam pairs.
The performance of beam training schemes, including alignment accuracy and overhead, is highly dependent on the codebook design.The literature has shown that an adaptive hierarchical codebook can decide the codewords based on previous beam training results with multiple mainlobes covering a spatial region for one or more user equipments (UEs) [12].In [13], it is provided to efficiently generate the hierarchical codebook by jointly exploiting sub-arrays with the partial active antenna elements.An adaptive and sequential alignment scheme was proposed in [14], demonstrating the relation between fast search time and the probability of error in acquired beam directions through extrinsic Jensen-Shannon divergence.The study proposed in [15] developed a fast beam-sweeping algorithm based on compressive sensing (CS) to determine the minimum number of measurements required.

Related work
The conventional schemes can satisfy the user demands but can hardly inherit from the experience to further improve their ability.Deep learning (DL) has recently elevated the field of wireless systems research and beam training to new heights [16,17].These heuristic proposals depend on the well-labeled channel state information (CSI) corresponding to the aligned gain of a pre-defined codebook to construct a supervised learning (SL) framework.In this framework, the optimal beam can directly predict based on the large volume of labeled knowledge to reduce training overhead and effects of noise [18][19][20][21][22][23].A beam selection scheme based on deep neural networks (DNN) was proposed in [18] that recognized the desired beam from the elementary relation of position and received signals with low alignment overhead.Meanwhile, the potential of DL in predicting the optimal mmWave beam and correcting blockage status based on the sub-6 GHz channel information has been proposed by [19] to reduce beam training overhead and achieve reliable communication [20].has trained a beamforming prediction network (BPNet) using supervised and unsupervised learning methods to optimize power allocation and predict virtual uplink beamforming (VUB) for improving computational efficiency.An online learning-based training strategy in [21] utilized a large volume DNN network to obtain the offline model parameters and fine-tune the DNN model according to the extra CSI measured in real time.An adaptive beam training scheme for calibration was proposed in [22] that estimated approximate CSI features by a convolutional neural network (CNN) and determined the optimal beam by self-criticism of the long short-term memory (LSTM).Moreover, a non-deterministic beam training was proposed in [23] that developed a binary coding scheme to represent the valid CSI and reduce the effect of noise.Although DL-based studies can obtain impressive achievements, severe multipath interference may impair prediction accuracy when the angles of paths in the local cluster are closely near the dominant channel path.
CSI is informative in deciding the optimal beam by exploring the dominant path during the training process.In [24], the authors proposed a hierarchical search by decomposing the multipath into several virtual components and using the hierarchical search to recover the dominant CSI for beam training.However, estimation performance highly relies on the training penalty.DL-based beamspace channel estimation was proposed in [25] to directly estimate the beamspace from the received signal, eliminating the need for a time-consuming beamforming process.A channel signature-based hybrid precoding design was proposed in [26] using DNN to estimate the channel and perform hybrid precoding with low computational complexity.Furthermore, a dual timescale variational framework for mmWave beam training and training [27] addressed beamforming direction in real time by training a deep recurrent variational autoencoder, taking into account both the historical channel information and the current channel conditions.In [28] and [29], LSTM is shown to further improve the ability of beam training by the implicit channel signature.[28] inferred the optimal beam directions at a target BS in future time slots depending on the historical channel features for mobile UE. [30] indicated that spatial attention beam training can improve transmission reliability, and the associative LSTM encoder performs explicit channel features to improve training ability.
The existing solutions exploiting the DL model for beam training share the following limitations.Firstly, most existing solutions perform inefficient feature extraction for frequency-and spatial-domain CSI.Direct vectorization for frequency and spatial information leads to a coarse representation.It is thus hard for DNN-based beam training to extract the CSI feature information effectively, resulting in low learning efficiency and beam prediction performance.Meanwhile, CNN-based models show strong ability on two-dimensional local feature extraction but suffer from a limited global view of beamspace awareness.In addition, sequence modeling can capture the relation of features from frequency and spatial domains but can hardly learn over an extensive range.Secondly, environmental measurements can affect the continuous output of the dense network.The continuous representation of CSI is sensitive to noise and channel variation, resulting in an incorrect beam prediction.Finally, the existing SL approaches require all CSI data to be labeled.However, labeling the large volume of CSI is unrealistic due to the high labeling cost and expertise requirement in practical mmWave massive MIMO systems.Although CSI is easily obtainable, handling the rapidly changing CSI when labeling the actual optimal beam is impractical.Therefore, the deficiency of labeled CSI can constrain the performance of existing DL-based beam training schemes.

Motivation
Typically, the CSI of massive MIMO contains the frequency dynamic response that exhibits spatial fluctuations in interconnected power grids.Because we concentrate on self-enhanced pre-training and SL with limited labeled data, it is critical to align the relation of preponderant paths corresponding with different subcarriers for well-explored beneficial features.Since the training signals are resulted from the transmitted signals and the propagation environment, tracking the UE movement and capturing the local feature couples have achieved delightful results [22,30].However, these methods are incapable of fetching a global view of spatial and frequency features.Consequently, we absorb the benefits of existing works to develop a hierarchical DL architecture with two levels, where the first level tracks the spatial varying on different subcarriers and serves the second level to explore the global view for efficient beam training.
Moreover, the complexity and diversity of the wireless environment are challenging issues for DL-based beam training.Continuous feature extraction results in uncontrollable representation, sensitive to random disturbance terms.Existing work address phase quantization mainly in three ways, by designing with real-valued phase shifts and then applying quantization [31], by constructing an analog beamforming codebook [32], and by nonlinear mapping into binary phase quantization [33].They lack flexibility, robustness to noise, and channel variations.To address this problem, we develop a configurable codebook to quantize the continuous spatial feature satisfying the equal AoD distribution in categorical beams.Specifically, discrete codewords can get rid of the effects of noise and channel variations and reflect the dominant factor of the CSI.Thus, we can obtain controllable quantized results from the codebook to improve the robustness of hierarchical DL architecture.
DL has been suggested as a promising approach to address the nonlinear relationship of CSI and optimal beam prediction.As indicated in [21,30], DL-integrated beam prediction is typically performed under an SL framework with perfect CSI annotation.However, it is challenging to acquire the exhaustive annotation of massive CSI for DL-based beam training in realistic massive MIMO mmWave systems, which leads to high labeling costs and expertise requirements.[34] proposes an unsupervised method that performs the CSI reconstruction, and accomplishes the online SL beam training with a large dataset of labeled CSI.It is inefficient for limited labeled CSI because of the uniqueness of the wireless environment, lacking extendability.We shed light on a novel contrastive learning framework with limited labeled CSI to mitigate the expertise requirements.Specially, we leverage the uniqueness of CSI of different UE locations to pre-train by identifying the contrastive information between the target UE and others.

Contributions
This paper proposes a novel contrastive learning framework, named self-enhanced quantized phase-based transformer network (SE-QPTNet), in mmWave massive MIMO systems to enable reliable beam training with CSI measurements underlying limited labels.We first design a hierarchical DL architecture, named quantized phase-based transformer network (QPTNet), which sequentially extracts frequency and spatial-domain features to enable effective representation.In order to perform a reliable spatial representation, we also develop a latent beam codebook to align the similar phase features of the frequency domain for exploring the environmental components.Then, we propose the contrastive learning framework SE-QPTNet extended from the QPTNet architecture.The SE-QPTNet framework is concerned with detecting the relationship between the global beam signature and the distinctive CSI corresponding to user locations using contrastive environmental prediction.By leveraging the contrastive information of the unlabeled CSI, SE-QPTNet is pre-trained as the initialization for fine-tuning with a limited amount of labels to provide more accurate performance.The main contributions of this paper can be summarized as follows: • We propose QPTNet, a hierarchical DL architecture with two levels, to enable effective representation of CSI in frequency and spatial domains.The first level performs gate recurrent unit (GRU) to model and extract the dependencies among clusters information in the frequency domain.The second level utilizes the spatial attention mechanism to extract a global beam signature based on the results of quantization.• We design a codebook-based phase quantization to explore the complicated environmental components for reliable spatial representation.This quantization method can match and aggregate similar phase features with a codeword, improving the feature robustness from the effect of noise in the CSI.It converts the prior knowledge of continuous spatial features extracted from the frequency domain into the posterior knowledge of categorical beams.• We develop a contrastive learning framework SE-QPTNet benefiting from the hierarchical QPTNet and codebook-based phase quantization.This enhanced model further improves beam training accuracy with limited labeled CSI.To the best of our knowledge, this is the first study that introduces contrastive learning in beam training applications.SE-QPTNet performs two benefits based on contrastive environmental prediction.Firstly, it can pre-train without any label information by detecting the relationship between the global beam feature and positive/negative samples.Secondly, the similarity of a positive sample and beam signature can effectively capture the spatial dynamic changes under long inter-frequency spans.SE-QPTNet preserves the benefits of QPTNet and reduces the labeling cost.
The rest of the paper is organized as follows: Sect. 2 introduces the mmWave massive MIMO system model and the problem formulation.Section 3 explains the principle of spatial attention-associated beam prediction.Section 4 develops the hierarchical feature extraction scheme QPTNet and discusses the details of the codebook-based phase quantization and the proposed QPTNet.Section 5 proposes the contrastive learning framework SE-QPTNet and the procedure of self-enhanced pre-training.We then present the simulation results in Sect.6, followed by a conclusion of this study in Sect.7. Notations: A is a matrix; a is a vector; a is a scalar; (•) T and (•) H denote transpose and conjugate transpose, respectively, while | • | denotes the magnitude operator.R(•) and I(•) denote the real and imaginary parts of a complex number, respectively.CN (0, ) represents the zero-mean complex Gaussian distribution with covariance matrix , respectively; σ (•) denotes the activate function of neural network.

Methods/experimental
The purpose of this study was to tackle the annotation costs and the spectral efficiency problem of mmWave massive MIMO systems by a novel contrastive learning framework SE-QPTNet.We consider a system containing a BS equipped with massive MIMO communicating the UE in a complex environment with limited label information.Taking advantage of contrastive structure, the proposed framework SE-QPTNet pre-trains with unlabeled data and transforms the model to a hierarchical QPTNet with spatial attention-based feature extraction.The proposed SE-QPTNet can improve the learning efficiency, preserve the stability of feature extraction, reduce the requirements of expertise, and train with limited annotations.To analyze the performance of the framework experimentally, we generate the dataset by DeepMIMO and evaluate the proposal and competitive methods by performances of training loss, success rate, and achievable rate.

System model and problem formulation
It is worth emphasizing that our proposal can be extended to various communication scenarios.First, the proposed contrastive learning framework can be developed for the multi-user hybrid beamforming design since the beam direction is unique for each user separately to achieve fewer effects of multi-user interference, and the labeling cost is also intractable.Then, the proposed contrastive learning framework can be applied in the harsh wireless environment, e.g., the reconfigurable intelligent surface (RIS) scenario, where the beam direction of each user is aligned with the dominant AoD of RIS, while others are treated as negative.
The overview of proposed beam training schemes is illustrated in Fig. 1.And the proposed DL-integrated beam prediction is typically performed at the BS side to ensure fast prediction responses with high computational resources.For analytical simplicity, we consider downlink transmission of an mmWave massive MIMO BS and a single antenna UE.For a two-dimensional (2D) mmWave channel where only azimuth angles are considered at both BS and UE, the Saleh-Valenzuela channel model is typically adopted, which can be formulated as where L, β l , θ l , and φ l denote the number of channel paths, channel gain, angle-of-arrival (AoA), and angle-of-departure (AoD) of the lth channel path, respectively.
Since the first channel path, corresponding to the LOS path, is typically significant, recognizing the LOS path can be beneficial for improving the coverage of mmWave signals [30].Although the number of resolvable channel paths is much smaller than the number of BS antennas, i.e., L ≪ N t , it is still challenging to efficiently distinguish the LOS path because of the limited scattering of mmWave channels and the nonline-of-sight (NLOS) [35] [36].The AoA and AoD of the l th path can be defined as φ l = 2d t sin � l / and θ l = 2d r sin � l / , where l and l are the set of LOS and NLOS paths, respectively; denotes the wavelength; d t = d r = /2 are the antenna spacing (1) at the BS and UE.In particular, both l and l satisfy uniform distribution within With the mmWave channel matrix H given in (1), the received signal can be described as where P, ω ∈ C N r ×1 , f ∈ C N t ×1 denote the transmit power, combiner, and beam- former, respectively.x is the transmitted data with unit power, i.e., |x| = 1 , while n ∼ CN (0, σ 2 I N r ) denotes the additional white Gaussian noise (AWGN) vector with power σ 2 .Typically, the beamformer and combiner do not increase or decrease the power gain, i.e., �ω� 2 = �f � 2 = 1 .The achievable rate can be described by To get the maximum achievable rate for the given H , we conventionally perform the beam training to construct the optimal beam pair by optimizing f and ω before data transmission (as shown in Fig. 1).This optimization issue can be implemented by the pre-defined codebooks F and W as the following equation: Practically, it is impossible to directly reveal the ideal pair of f op and ω op since it is hard to tackle with the three independent matrices in (4).
A straightforward and ergodic solution of ( 6) is to enumerate all possible candidate codewords of f and ω to determine the optimal solution with the largest achievable rate through the beam-sweeping and the overview is shown in Fig. 1.Conventionally, the beamformer f can be generated by a pre-defined codebook F {f n , n = 1, 2, . . ., N t } that includes N t codewords corresponding to different AoDs with the inher- ent transmitting spatial resolution.Identically, the combiner can be generated by W {ω m , m = 1, 2, . . ., N r } including N r codewords with the inherent receiving spatial resolution with different AoAs.For each beam training test, the BS selects a codeword from F as the beamformer aligns with the combiner from W at the UE side.Generally, the discrete Fourier transform (DFT) codebook is a feasible option to decide the candidate beamformer f n and combiner ω m : (2) where the ξ t,n and ξ r,m are the beam directions of the n th possible beam at BS and m th possible received beams at the UE side.To span the whole angular domain in both BS and UE, ξ t,n and ξ r,m can be uniformly sampled in (−� t , � t ) and (−� r , � r ) , i.e., To find the optimal solution of ( 6), the searching range is K N t N r , while the candidate beam pair can be denoted as To evaluate the performance of beam training, the success rate is regarded as an important criterion in [13].The index of solutions corresponding to the largest achievable rate can be treated as successful; otherwise, we consider the solutions are fail in the beam training.Hence, the success rate γ can be defined as the ratio of the number of successful trails N Suc over the total number of trails N Tot as the following equation This paper proposes a novel contrastive learning-based SE-QPTNet for reliable beam training with low labeling cost and expertise requirements.Generally, the prediction of the optimal mmWave transmitting beam is operated at the BS side, and the same method can be easily extended to predict the optimal receiving beam.Considering severe multipath interference and inconsistent dominant beam prediction, we propose to adopt the phase quantization of periodically estimated CSI to reflect the relation of cluster AoDs/ AoAs of UE and predict the optimal mmWave beam when mmWave beam training is required.Since the mmWave channels are considered to have identical LOS AoD/AoA and NLOS cluster AoDs/AoAs.For AoDs, we can rewrite the received signal (4) at the UE side as where the n eq denotes the equivalent noise.By substituting into (1) and ( 13), it yields We can quantify the correlation between the mmWave beam in (7) and the array steering vector in (2) as where ψ n = sin ξ t,n − sin φ LOS .The quantization q n (N t , φ LOS ) illustrates the relation between angles of direction ξ t,n and φ LOS , which is regarded as a quantization error [12].However, the multipath interference and approximation of φ LOS may lead to inaccurate predicted results.
Since the number of candidate beams is finite, mmWave beam training is a multiclass classification task based on the beam codebook F .Mathematically, the predic- tion is represented by the probability results of the training function Q(•) as where P n is the predicted probability, and the optimal beam corresponding index n * is the maximum probability from the output given the parameters W .
In this paper, we propose a novel spatial attention and quantization-based contrastive learning framework to achieve high learning efficiency with lower labeled data requirements and greater robustness.The proposed SE-QPTNet framework includes two stages: Training stage We divide the training stage into two phases: self-enhanced pretraining and supervised training.Self-enhanced pre-training is an unsupervised procedure, where we only utilize the CSI as the training samples to acquire the knowledge of environment variations based on the contrastive learning framework.This solution concentrates on modeling the belonging relation between the CSI features and its global beam signature to enhance the uniqueness of the feature of the current user.In supervised training, limited training samples are collected from the CSI and the transmitting beam information.The optimal mmWave transmitting beam index decided by the beam-sweeping is used as the classification label and utilizes corresponding CSI as input.Different from the existing works, we fine-tune the model based on the initialization given knowledge of environmental variations from the pre-training stage.
Predicting stage: When mmWave beam training is required, the BS can predict the optimal beam with received signals depending on the well-trained model.According to our proposal, the labeling cost and requirement of expertise are flexible and beam prediction can achieve reliability under environmental variation.
For clarity, we first describe the spatial attention-associated beam prediction of the proposed hierarchical QPTNet.Then, we illustrate the details of codebook-based phased quantization for the spatial features and procedure of QPTNet, including the first level for UE tracking corresponding to the different locations, quantization, and the beam prediction decided by spatial attention.We finally introduce the proposed contrastive learning framework SE-QPTNet and summarize the procedure of selfenhanced pre-training.(15)

Spatial attention-associated beam prediction
This section presents the efficient spatial attention-associated beam prediction, depicted in Fig. 2. Efficient perception of the environment from the observed CSI can significantly improve beam prediction accuracy.Benefiting from the attention mechanism, we can infer the optimal beam response from implicit directions by its attention scoring [37].Moreover, the attention mechanism is also good at simultaneously capturing features from entire implicit directions for a comprehensive beam signature.
Beam gain generation module: Since the initial spatial-frequency channel measurement H is complex-valued, we firstly convert normalized H into real-valued H and nor- malize with the maximum amplitude of its elements: Then real-valued H concatenates the real and imaginary components in the spatial domain to simultaneously generate the beam gain.Finally, the input modified channel can be denoted as To generate the transmitting beam gain in each antenna direction, we exploit the linear projection of H to antenna space N t through an embedding layer.To preserve the relative distance among the antennas, the distinctive beam gain embedding parameters employ with the linear projection results.The transmitting beam gains of H can be defined as where embedding(•) is a linear projection layer with 2N t × N t transmit antenna dimen- sion, and the size of Beam direction tagging module: To symbolize the inherent direction for the beamforming gains, we consider a beam direction tagging B t for each transmitting beam gain in (18), which is generated through the linear projection  each transmitter antenna index.By combing (18), the inputs of transformer encoder X = B g + B t .Stacked attention module: The transformer encoder enables the spa- tial-frequency feature to deeply learn essential representation by applying the stacked attention module.Particularly, the attention mechanism [37] may capture the relevant relation between the specific LOS/dominate path direction while diminishing the effects of NLOS/subordinate by traversing all transmission antennas with the beam direction query matrix Q , beam gains key matrix K , and corresponding scoring value matrix V .The stacked attention module devotes more focus to mutually important alignment degrees from the coherence of candidate beam gains and the beam directions of the local antenna, learning which specific AoD information is more competitive than others, depending on the limited number of scatters of LOS and NLOS paths.Then, the query matrix Q , key matrix K , and value matrix V are generated by the wide fully connected (FC) layers with input signal X as where the W q i , W k i , W v i are the linear projection layers in i th attention module.Atten- tion operation can be introduced as a scaled coherence function in Fig. 2, which maps a beam direction query and a bunch of candidate beam gains pairs as a dependency relation.Specifically, we compute the dot-product of beam direction query matrix Q with all key beam gains matrix K and apply a softmax(•) activation to obtain the pairing prob- abilities on the scoring value matrix V , so that beam gains can be precisely aligned with LOS direction.Then, the output can be computed by The candidate pairs context of the beam gains and direction can be extracted by the forward stacked attention process and decided with the relevant score, which comes from the softmax probability result.Moreover, the superior collaboration of beam gains and direction can effectively improve the beam training performance.
Output module: Different from the dense FC layer based-prediction scheme, the global average pooling (GAP) in [38], is introduced to implement the average of each token from stacked attention operations to the candidate beams which can be written as where c d ∈ R N t ×1 , d = 1, 2, • • • , D is the output from the stacked attention module, and the size of beam signature vector c t ∈ R N t ×1 .One advantage of GAP over the FC layer is that it summarizes the aligned information between beam gains and transmitting direction with the weighted average results; thus, it is more robust to spatial translations of the fusion local features of gains and direction.In addition, there are no extra parameters to optimize in the GAP layer, and overfitting can be avoided.The resulting vector (19 c t is directly fed into the softmax(•) layer, and the optimal beam can be chosen with the maximum probability of the global beam signature c t where p is the predicted probability and p n is the element of p .To train our model, the cross-entropy loss is applied as the evaluation metric for the classification problem.The cross-entropy loss can be expressed as where y c is the actual optimal mmWave beam described by one-hot encoding as the clas- sification label.If label c is identical to the optimal beam, y c = 1 ; otherwise, y c = 0.

QPTNet-associated beam training
This section presents the hierarchical QPTNet for efficiently processing the CSI.QPT-Net has two hierarchical levels and a codebook-based phase quantization procedure.The first level is an autoregressive encoder based on GRU, and the second level is a spatial attention encoder based on the transformer.The output of the GRU encoder can be quantized with a generated beam codebook.Finally, the spatial attention encoder extracts a global beam signature.The overview of QPTNet is illustrated in Fig. 3.

Codebook-based phase quantization
To analyze the components of the propagation environment, we attempt to capture the implicated relation of cluster AoDs by the discrete phase-based quantization method.We define a latent beam codebook G including N CB codewords g i ∈ C N t ×1 , i = 1, 2, • • • , N CB (i.e., N CB -way categorical beams).The latent beamspace is initialized with randomly sampled angular φi ∈ [− π 2 , π 2 ] , and the normalized spatial frequency u i is defined as where u i ∈ [−1/2, 1/2] for /2 element spacing.Intuitively, the beam vector can be gen- erated by the DFT of u at points separated by 1/N CB .Note that N CB ≥ N t and there are N CB categorical beam vectors g i .Thus, the latent beamspace can be described as Since we only consider the spatial power spectrum, the elements of (25) are adjustable with the network training.
As shown in Fig. 3(a) and (b), the GRU encoder yields a continuous spatial feature matrix Z with column vectors z l ∈ C N t ×1 , l = 1, 2, • • • , N f .Next, we quantize the con- tinuous feature z l into categorical beam g i * based on the Euclidean distance d l,i .The cat- egorical beam index is expressed as (22) The minimal d l,i * ensures that the continuous spatial features corresponding to the cat- egorical beams are within a neighboring zone of latent beamspace.Meanwhile, distance calculation can efficiently reflect the relation of cluster AoDs to represent the components of the environment.The quantization approach can also, respectively, perform the clustering by mapping to the nearest codeword.The posterior categorical beam distribution p(ẑ l = g i | H, φi ) is formulated as Fig. 3 Overview of the proposed QPTNet.CSI is separately processed along the frequency subcarriers to capture the continuous spatial feature z l via the GRU encoder.A latent beam codebook handles the perception of environmental variations by exploring the relation between the categorical beam g * i and the continuous spatial feature.Moreover, we reconstruct the channel data based on the selected latent beams via a GRU decoder to update the codebook and hold the consistency of channel and beamspace.The selected latent beams are fed into a spatial attention-associated beam prediction.c t indicates the global beam signature During the forward pass, we define Ẑ = [ẑ 1 , ẑ2 , • • • , ẑN f ] to present discrete angular beam responses.ẑl = g i * are then selected corresponding to the approximate AoDs φi .Different from the setup in Section 3, we acquire the optimal beam prediction based on the attention scoring results of Ẑ to enhance the ability of environmental representation.

Hierarchical learning procedure
The proposed QPTNet contains two hierarchical levels for extracting the features from frequency and spatial domains.For the first level, we apply a GRU encoder to extract the relations between two domains and the spatial feature Z .We employ the GRU module as the instant feature extractor due to its high training efficiency and similarity to the LSTM network in temporal sequence learning.GRU encoder proposes to synchronize the CSI estimation based on the gate mechanism to control the spatial feature in different subcarriers.In processing H at each subcarrier index, the GRU encoder takes in the channel delay response at the current subcarrier as well as the shared hidden state and output from the previous subcarrier step (z l−1 , ) and updates its hidden state at the current subcarrier index, resulting into a new z l .The process can be described by where P(•) represents the conditional probability distribution over the subcarriers.Once the continuous spatial feature z l is extracted, we quantize it by a latent beam codebook to perceive the environmental component.According to (26) and ( 27), we perform a nearest neighbor search to decide the posterior latent beam vector related to the discrete angular index φi * .
The input to the decoder corresponds to the selected latent beam matrix Ẑ .We recon- struct the channel matrix based on Ẑ to guarantee the consistency between the latent beamspace and channel space.The reconstruction loss is formulated as F enc /F dec are GRU encoder/decoder mapping and the reconstructed channel Ĥ = F dec ( Ẑ) , respectively.The forward computation pipeline can be regarded as the regular autoencoder with a specific nonlinearity that corresponds to the 1-of-N CB latent beam vectors.The optimization of QPTNet follows the back-propagating in [39].To make sure the encoder commits to a latent beamspace and its output does not grow, we add a commitment loss.Thus, the total training loss becomes where sg[•] means the stop gradient operator defined as an identity at forward computa- tion time and has zero partial derivatives, G = [g 1 , g 1 , • • • , g N CB ] , and β is a hyperpa- rameter.The first term is the beam prediction loss in (23) and the optimization does not change the latent beam codebook because of the nonlinearity of quantization.The second term is the CSI reconstruction loss to update the latent beam codebook.The third term utilizes the l 2 norm loss to move the latent beam codebook vectors close to the (28) encoder output F enc ( H) , and the fourth term ensures that the encoder outputs toward the codeword.

SE-QPTNet-associated beam training
This section proposes SE-QPTNet to provide an unsupervised pre-training strategy for improving the performance of QPTNet given the small training dataset.In SE-QPTNet, we construct positive and negative samples based on the anchor operation.
Then, the pre-training process can be completed by detecting the relationship among a positive sample of target UE, negative samples of others, and global beam signature with the contrastive environmental prediction.Therefore, SE-QPTNet can constantly perceive the environment component under a contrastive learning framework to further improve the performance of beam training.Figure 4 describes the procedure of self-enhanced pre-training in SE-QPTNet.
To pre-train the proposed SE-QPTNet under the contrastive learning framework, we present an anchor operation to acquire the positive and negative samples.Specifically, we obtain forward latent beam features by splitting the outputs of quantization with the anchor t and explore the global beam signature c t based on attention scoring evaluation.Meanwhile, the reconstructed channel data streams recovered by the GRU decoder divide as contrastive samples with length k, in which the positive sample is determined by the CSI of the target UE location while negative samples are determined by the others.SE-QPNet enhances the similarity between the target UE and the prediction of c t while making the other UE responses dissimilar by the con- trastive environmental prediction.Benefiting from this distinction, we can facilitate Fig. 4 Overview of the proposed SE-QPTNet.Proposed SE-QPTNet compensates for the effects of propagation delay by separately dealing with the sub-antenna arrays.Practically, the intermediate beam codes ẑ are divided into ẑ≤t as small scale parts and ẑ>t as large-scale parts by an anchor t.We feed ẑ≤t to the spatial attention-associated beam prediction for identifying the beam signature vector c t .Contrastive environmental prediction measures the similarity between the c t and the reconstructed Ĥrecon , and makes the negative samples Ĥneg dissimilar to its beam signature latent angular learning within the latent beams corresponding to the dominant path, discard the noises, and enhance the representation of the environmental component.
Considering the mmWave massive MIMO system, the amplitudes and phases of the received signals suffer from spatial differences due to a significant fraction of propagation delay [40].To achieve robustness, the proposed SE-QPTNet compensates for the effects of propagation delay by separately dealing with the selected latent beams.We denote the global view Ẑ output from the quantization for notational convenience.We select an anchor t and define Ẑ≤t as the forward component.Taking inspiration from recent advancements [41], we introduce a token ẑ0 at the beginning of Ẑ≤t and feed them to the spatial attention-associated beam prediction to identify the beam signature vector c t .We then predict the k reconstructed channels by a linear mapping Ht+k = W t+k c t , where W t+k denotes the weight of the linear mapping T .If the large-scale dense chan- nel signatures can be successfully predicted, it indicates that the beam signature vector c t contains a global and robust view of the propagation environment.By combining the positive and negative samples, the loss function of contrastive environmental prediction is where Hneg,l denotes negative samples taken from other channel data, and the positive sample is Ĥt+k .Considering the latent beam codebook updating, the total pre-training loss is expressed as The pre-training process is illustrated in Algorithm 1. (31) Once the pre-training process is completed, we transfer the parameters of the GRU encoder/decoder and spatial attention to the QPTNet for supervised beam training.According to (23) and (32), the total beam training loss of the proposed SE-QPTNet can be described as Benefiting from the pre-trained model, the network can achieve high performance with a small training dataset.It allows mitigation of preprocessing tasks for labeling the actual optimal beam knowledge, reduces training overhead, and reaches stronger adaptability to environmental variation.

Simulation set up
We evaluate the training and predicting performances of the proposed QPTNet and SE-QPTNet on three tasks: training performance, success rate, and achievable rate.We consider publicly available evaluation scenarios from the DeepMIMO dataset [42] for data generation.The scenario is constructed using the 3D ray-tracing software Wireless InSite [43], which captures the channel dependence on frequency and spatial domains.The mmWave signal is available at 28 GHz in an outdoor scenario, and we consider a single BS with ULA massive MIMO located at BS 3 of the scenario.We collect the mmWave channel data by the DeepMIMO generator and describe the details in Table 1.The main parameters of the proposed framework are represented in Table 2.

Training performance
We first compare the training performance of QPTNet and SE-QPTNet with different sizes of latent beam codebook options, and then, we evaluate the proposed schemes against existing related works in [16,29].
Figure 5 presents the training performance of different latent beam codebook options between QPTNet and SE-QPTNet.The results indicate that the latent beam codebook with 256 spatial resolution results in the smallest training loss values for (33)  both QPTNet and SE-QPTNet.Additionally, SE-QPTNet outperforms QPTNet with arbitrary latent beam codebook options, achieving a much lower training loss curve.These results highlight the importance of higher spatial resolution options in precisely quantifying the continuous channel features into categorical beams to achieve better beam training performance and faster convergence.It also shows that the contrastive learning framework based SE-QPTNet can improve the learning efficiency by consistently interacting with the propagation environment and accounting for the mmWave channel fluctuation in the re-identification of QPTNet.Therefore, the higher spatial resolution option and the contrastive environmental prediction can enhance training performance and learning efficiency.Consequently, we set the latent beam codebook option N CB = 256 for the following results.[21] and LSTM attention [30].The training loss performance indicates that the SE-QPTNet achieves the fastest convergence with the loss value of 0.12, outperforming LSTM attention (0.15), DNN (1.8), spatial attention (0.31), and QPTNet (0.4).It is observed that the learning efficiency of SE-QPTNet is the best among the comparisons, benefiting from its competitive initial loss value (0.78) and smooth convergence.Looking closely at QPTNet and spatial attention, the latent beams can quantify the components of the propagation environment and significantly improve the accuracy of beam prediction, leading to faster convergence.However, the latent beam codebook updated by the mmWave channel reconstruction restricts the learning efficiency of QPTNet, because of the effects of noise and channel variations.

Success rate performance
To evaluate the robustness and low training overhead, we evaluate the performance of beam prediction in terms of the success rate under the different SNRs ranging from 0 to 20 dB and with various sizes of training datasets.Fig. 6 Training performance of QPTNet, SE-QPTNet, spatial attention-associated beam prediction, and DNN [21] and LSTM attention [30] Figure 7 shows the average success rate of proposed SE-QPTNet, QPTNet, and related works with different training data sizes.It is visible that the proposed SE-QPTNet can achieve a success rate of around 0.69 only with 1% training data and rise to 0.91 given 30% training data.To illustrate the advantage of the proposed phase quantization scheme, we apply the DNN instead of spatial attention-associated beam prediction, named QP-DNN.It demonstrates that the unsupervised pre-training strategy can perceive informative environmental components to distinguish the implicit representations of AoDs so that SE-QPTNet can promote learning ability with a small training dataset.Moreover, the proposed contrastive learning framework can enhance the global beam signature with multipath interference to improve reliability.Compared with the LSTM attention, the proposed SE-QPTNet obtains almost the same performance with only 5% data, which improves learning efficiency by six times.The result shows the potential to utilize less training time and mitigation of labeling the actual optimal beam knowledge to perform a flexible beam training process.
To further investigate the robustness and lower training overhead, we evaluate the average success rate using 50% of the training dataset with the proposed SE-QPTNet.The results are presented in Fig. 8.It shows the success rate performance of the proposed SE-QPTNet (50% training data), QPTNet, spatial attention-associated beam prediction, DNN, and LSTM attention.Our proposed SE-QPTNet achieves a success rate of 0.58, demonstrating its robustness at 0 dB SNR compared to LSTM attention, DNN, and spatial attention, as well as the proposed QPTNet, with achievable rates of 0.42, 0.43, 0.26, and 0.39, respectively.The proposed SE-QPTNet exhibited robustness against the effect of noise with low SNR levels in the comparison due to its contrastive learning framework.Closely looking at QPTNet, LSTM attention, and spatial attention; the performance of QPTNet is sensitive at the low SNR level because the latent beam codebook updates with the reconstruction of the mmWave channel.It makes QPTNet incorrectly quantize the spatial feature resulting in fuzzy environmental components.
Fig. 9 illustrates the effectiveness of the proposed codebook-based phase quantization.We compare the success rate performance of different latent beam codebook options and sigmoid-based quantization [33].The baseline is the proposed SE-QPT-Net without quantization.It is visible that the proposed SE-QPTNet can obtain a higher success rate performance and faster convergence than the sigmoid-based Fig. 8 Average success rate of QPTNet, SE-QPTNet, spatial attention-associated beam prediction, and DNN [21] and LSTM attention [30] with different SNRs from 0 to 20 dB method and the baseline method.The sigmoid-based quantization in [33] can map each frequency information into the range of 0 and 1 to establish a phase relationship.The continuous and monotonically increasing properties make the results of quantization hardly determine the discrete phase option.Unlike the sigmoid-based quantization, the proposed codebook-based method leads to a rich diversity of beam features corresponding to the discrete phases.Moreover, the proposed quantization method can boost robust performance (smaller error band) by increasing the size of the codebook.
In Table 3, we compare the complexity of SE-QPTNet, QPTNet, and related works.Both the success rate and the achievable rate of SE-QPTNet can obtain the best performance against the related works.Although the achievable rate of QPTNet and SE-QPTNet are almost identical, the success rate of SE-QPTNet is higher.We can observe an improvement of 2.6% , owing to the contrastive environmental prediction that ena- bles it to update the latent beam codebook efficiently.The number of parameters of SE-QPTNet is lower than those of LSTM attention [30], spatial attention [37], as well as the DNN [21], which can save 18% , 28% , and 82% parameter requirements.The results demonstrate that the proposed contrastive learning framework can improve DL performance with low overhead.Additionally, contrastive environmental prediction is beneficial for improving the learning efficiency by distinguishing the target UE from others, maintaining a low parameter overhead and better global beam signature representation.Consequently, the proposed contrastive learning framework contributes to promoting learning efficiency and getting rid of high complexity.Table 4 investigates the effects of spatial attention modules for beam prediction.We compare the complexity, success rate, and achievable rate of different spatial attention options.According to the results, the success rate can be improved by 20% when the number of attention modules is larger than 1.Increasing the number of attention modules does not significantly improve the accuracy of beam prediction, but obviously increases the overhead.Therefore, we adopt two spatial attention modules as the optimal option for SE-QPTNet.

Achievable rate performance
The proposed SE-QPTNet can obtain extra performance by pre-training without label information and achieve a high transmission rate with a reduced amount of labeled CSI.To evaluate the training overhead and robustness in practical systems, we plot the achievable rate for different training dataset sizes and their performance under different SNR levels.
In Table 5, we compare the achievable rate of QPTNet and SE-QPTNet against different latent beam codebook options.The achievable rates of SE-QPTNet and QPTNet are close but still higher than that of QPTNet without noise effects.Although the performance of QPTNet and SE-QPTNet is almost identical, SE-QPTNet is more robust.We can observe an improvement of 0.41 in 10 dB SNR, owing to the contrastive environmental prediction that enables it to update the latent beam codebook efficiently.The results demonstrate that a larger spatial resolution can comprehensively quantify environmental variations to improve beam prediction accuracy.Additionally, contrastive environmental prediction is beneficial for improving robustness by the inherent representation of the global beam signature among different UEs.To illustrate the low training overhead of proposed solutions, Fig. 10 shows the achievable rate of the proposed SE-QPTNet, QPTNet, and DNN approaches, defined in (5) for different dataset sizes.Note that the dot-dash line in Fig. 10 represents the upper bound on the performance of the given system and channel models.The horizontal axis represents the training data size, and the achievable rate results calculate with 1%, 5%, 10%, 30%, 50%, 70%, 100% training data.The results of SE-QPTNet and QPTNet are close to the upper bound with only 10% training data, indicating that the proposed solutions allow the BS to apply the desired beam with an efficient training scheme.It is visible that SE-QPTNet can successfully predict the optimal beam based on the contrastive learning framework with only 1% training data at the BS.It clearly illustrates the ability of the self-enhancement to efficiently represent the desired angular sector with positive and negative samples, achieving negligible training overhead.Moreover, the proposed solutions are flexible in customizing the beam solution with small training data.Therefore, approaching this bound illustrates the optimality of the proposed solutions To verify the robustness of the proposed SE-QPTNet and QPTNet, we investigate the achievable rate of different SNR levels, as shown in Fig. 11.The dot-dash line represents the upper bound performance, the optimal rate without noise for the given system and channel models.The results show that SE-QPTNet and QPTNet are consistently better than the simple DNN beam training approach.Benefiting from the contrastive environmental prediction to update the latent beam codebook, SE-QPTNet can achieve a higher rate than the QPTNet at low SNR levels.Both proposed beam training schemes are near the optimal rate when SNR is higher than 15 dB.It illustrates that the hierarchical strategy and the quantization can deal with the frequency and spatial channel efficiently, and the ability of the self-enhancement can provide a better perception of the complicated propagation environment by the contrastive learning framework.Moreover, the contrastive environmental prediction can improve the robustness relying on the distinct global beam signature by considering the similarity among global beam signature and positive/ negative samples.

Conclusion
This paper proposed a novel framework SE-QPTNet for beam training based on spatial attention and quantization, utilizing the contrastive environmental prediction to detect the relationship between the beam signature of the target UE and others.The proposed hierarchical scheme QPTNet quantized the continuous spatial features into controllable latent beam responses within a wide range of discrete phases achieving higher robustness.The proposed contrastive learning framework SE-QPTNet enhanced the capability of identifying the beam signature by contrastive environmental prediction with a small fraction of the labeled CSI dataset.The proposed SE-QPTNet permitted DL-integrated beam training to adapt flexibly to wireless environments with low labeling costs.The simulation results showed that the proposed SE-QPTNet framework can achieve a data rate of 7.01 bps/Hz with only 5% labeled data, which represents 95.1% of the performance utilizing the full-size dataset.The proposed SE-QPTNet framework outperformed existing DL-based schemes, obtaining higher capacity with lower expertise requirements and highly reliable performance for mmWave massive MIMO systems.

Fig. 1
Fig. 1 Overview of the proposed beam training schemes in mmWave massive MIMO system

Fig. 2
Fig.2Overview of spatial attention-associated beam prediction.The candidate beam responses can be performed by calculating the attention score with the beam gains and directions representation based on the channel signatures.And the predicted beam is decided by a scoring evaluation choosing the high pairing probability of beam gains and direction

Figure 6
Figure6illustrates the training comparisons among the proposed SE-QPTNet, QPTNet, spatial attention-associated beam prediction, and DNN[21] and LSTM attention[30].The training loss performance indicates that the SE-QPTNet achieves the fastest convergence with the loss value of 0.12, outperforming LSTM attention (0.15), DNN (1.8), spatial attention (0.31), and QPTNet (0.4).It is observed that the learning efficiency of SE-QPTNet is the best among the comparisons, benefiting from its competitive initial loss value (0.78) and smooth convergence.Looking closely at QPTNet and spatial attention, the latent beams can quantify the components of the propagation environment and significantly improve the accuracy of beam prediction, leading to faster convergence.However, the latent beam codebook updated by the mmWave channel reconstruction restricts the learning efficiency of QPTNet, because of the effects of noise and channel variations.

Fig. 10 Fig. 11
Fig.10 Achievable rate performance of QPTNet, SE-QPTNet, and DNN[21] with different training samples Jia et al.J Wireless Com Network 2023, 2023(1):69 member since 2018.Her research interest includes signal processing, wireless communications, and multimedia communication.Her current work focuses on intelligent signal processing, massive MIMO, and radio over fiber (RoF).Taisei Urakami received the B.S. degree in Electrical Engineering from National Institute of Technology (KOSEN), Kagawa College, Kagawa, Japan, in 2022.He is currently pursuing a master student, Division of Information Science, Nara Institute of Science and Technology, Japan.He is a student member of IEICE.His research interests include deep learning-based beam prediction for intelligent reflecting surface, metasurface reflector, and microwave circuits.HuiGao received the B.Eng. degree in Information Engineering and the Ph.D. degree in Signal and Information Processing from Beijing University of Posts and Telecommunications (BUPT), Beijing, China, in 2007 and 2012, respectively.From 2009 to 2012, he was a research assistant with the Wireless and Mobile Communications Technology R &D Center, Tsinghua University, Beijing, China.In 2012, he was a research assistant with the Singapore University of Technology and Design, Singapore, where he was a postdoctoral researcher from 2012 to 2014.He is currently an associate professor with the School of Information and Communication Engineering, BUPT.His research interests include massive and mmWave multiple-input multiple-output systems, cooperative communications, and ultra-wideband wireless communications.Minoru Okada received the B.E. degree from the University of Electro-Communications, in 1990, and received the M.E and Ph.D. degrees in Communications Engineering from Osaka University in 1992 and 1998, respectively.He served as an assistant professor at Osaka University from 1993-2000.He was a visiting researcher at Southampton University, UK, in 1999.He moved to Nara Institute of Science and Technology as an associate professor in 2000.Since 2006, he has been a professor at the same institute.He is serving as a chair of the technical committee on wideband systems (WBS) in IEICE.His research interests include wireless communications and power transfer.

Table 2
Training hyperparameters

Table 3
Complexity of the proposed SE-QPTNet, QPTNet, and related works

Table 4
Performances of different attention module options

Table 5
Achievable rate of QPTNet and SE-QPTNet with different SNRs and N CB