User selection and dynamic power allocation in the SWIPT-NOMA relay system

Non-orthogonal multiple access (NOMA) technology provides an effective solution to massive access with a high data rate demand in new-generation mobile networks. The paper combinations with NOMA and simultaneous wireless information and power transfer (SWIPT) relay to maximize the sum rate in the downlink system. To that end, it is critical how to select effectively users access system and power allocation for the access user. This paper proposes a user selection and dynamic power allocation (USDPA) scheme in the NOMA-SWIPT relay system based on neural network because traditional optimization methods have difficulty solving nonlinear and non-convex problems. We establish a user selection network utilizing a deep neural network (DNN) and propose a power allocation network using deep reinforcement learning. The simulation results show that the proposed scheme achieves better performance than other related schemes, especially for high quality of service requirements.

Numerous studies were conducted on wireless resource management to improve the performance of the SWIPT-NOMA relay system [6][7][8][9][10][11]. Reference [6] utilized an average power allocation scheme in the downlink and fixed power control in the uplink to evaluate the ergodic rate; however, the strategy does not guarantee that all signals are successfully decoded in the downlink and uplink. Reference [7] compared a cognitive radio-inspired power allocation scheme with a fixed power allocation scheme to ensure the fairness of the data rate. Reference [8] analyzed the error probability of the SWIPT-NOMA system by using a fixed allocation power scheme. In [9], the outage probability was regarded as the optimization function to obtain the power allocation factors. The analysis in [10] was in line with realistic scenarios regarding the impact of imperfect channel state information (ICSI) and residual hardware impairments (RHIs). Reference [11] evaluated the performance of a complex SWIPT scenario that allocated fixed power in the downlink. However, most papers focused on fixed power allocation, whereas artificial intelligence-based (AI) schemes for wireless resource allocation have not been well researched.
Many studies investigated access schemes in SWIPT-NOMA relay systems [12][13][14]. Reference [12] analyzed a SWIPT-NOMA relay system that considered channel estimation errors (CEEs) and RHIs. When all users accessed the system, more interference occurred at the receivers. Reference [13] investigated the outage probability choosing an optimum near destination node and an optimum far destination node, and the near node was used as the relay. Nevertheless, neither [12] nor [13] considered the channel gain from the relay to the MT, causing performance degradation. [7][8][9][10][11] considered access to all users, which prevented the decoding of all signals and generated more interference at the receivers. Furthermore, the performance of an all-access users' scheme is not high [15,16] despite high model complexity. In contrast to [7][8][9][10][11][12][13][14] proposed access to users who have fed back channel state information (CSI), the algorithm had difficulty converging.
Therefore, it is imperative to develop a scheme that provides user access and allocates power to qualified users. AI techniques can extract valuable information from data to learn and support different functions for optimization, prediction, and decision-making in mobile edge computing, mobility prediction, optimal handover solutions, and spectrum management [17]. Deep reinforcement learning (DRL) can solve real-time and dynamic decision-making problems for power allocation [18][19][20]. Reference [18] proposed a deep Q network (DQN) for each MT to obtain the optimal power allocation scheme. The objective was to reduce the size of the state space; however, this distributed power allocation method has no information interaction between the MTs, resulting in power allocation conflicts. Reference [19] proposed a two-step model-free DRL-based power control scheme to maximize the long-term sum energy efficiency (EE). Based on a multi-carrier NOMA network with SWIPT, reference [20] proposed to use a deep belief network (DBN) to approximate the optimal power allocation.

Contribution
To deal with problems of traditional methods [7][8][9][10][11][12][13][14][15][16], inspired by the above studies [17][18][19][20], we propose a combined user selection and dynamic power allocation (USDPA) scheme that chooses the best users access the system and decides optimal power allocation to maximize the sum rate. The main advantages and contributions of this paper are summarized as follows.
• The USDPA scheme is proposed in the SWIPT-NOMA relay system to optimize the user access and power allocation simultaneously to maximize the sum rate because traditional optimization methods have difficulty solving nonlinear and non-convex problems. More importantly, the results show that our algorithm can successfully access more users than comparable algorithms. • We use a deep neural network (DNN) for the user selection network to generate the access decision. Subsequently, the access decision is mapped to several candidate access actions, whose number changes adaptively. In addition, the result displays that the model converges quickly without adding additional computational complexity. • We utilize a DQN to generate the optimal power allocation for each candidate access action. Afterwards, we use the optimal pair of access action and power allocation action with the maximum sum rate in the system. The best power allocation action is stored in the replay memory to train this network. • Finally, we compare the performance of the USDPA with other schemes. The simulations under different scenarios show that the proposed algorithm improves quality of service (QoS) and can achieve better performance than other related schemes.

Organization
The remainder of this paper is organized as follows. Section 2 describes the system model and the problem formulation of the user selection and power allocation model for the SWIPT-NOMA relay system. Section 3 presents the USDPA scheme, including the user selection network and power allocation network. Section 4 presents the experimental results and analysis, including the convergence, the sum rate, and the number of successful communication users (NSCUs). Finally, the conclusions are summarized in Sect. 5.

Problem formulation
We consider a system model that includes a base station (BS), a relay employing a decode-and-forward (DF) protocol, and N destinations, as illustrated in Fig. 1. Hereafter, subscripts S, R and D i will be used for the BS, relay, and destination i , respectively. The radius of sector S 1 are ϒ S 1 with the BS at the center and an angle φ . The radius of sector S 2 are ϒ S 2 with same center and angle as sector S 1 . The relay is located on a circular arc of radius ϒ S 1 , and the N destinations are randomly and uniformly distributed in the region between ϒ S 1 and ϒ S 2 . Each destination node and the relay have a single antenna operating in half-duplex (HD) mode. We assume that all small-scale fading in the system is independent and identically distributed Rayleigh fading occurs. The channel coefficients of the links from the BS to the relay and the relay to D i are h S,R ∼ CN 0, d −τ S,R and h R,D i ∼ CN 0, d −τ R,D i , respectively, where d i,j denotes the distance between node i and node j , and τ is the path-loss exponent.
For simplicity, we assume that the transmission time T = 1 and the bandwidth B = 1 . The power splitting relay (PSR) strategy is used (Fig. 2). Within the duration of each αT , the relay performs energy harvesting (EH) and information decoding (ID); within each (1 − α)T period, the relay performs information forwarding (IF) in the NOMA mode. ρ is the power splitting factor for harvesting energy, and (1 − ρ) is for decoding information. At the end of αT of each slot, the relay receives the signal from the BS, which can be expressed as: where x S is the signal transmitted by the BS to the relay. n R ∼ CN 0, σ 2 R is the additive white Gaussian noise.
The energy harvested by the relay is defined as follows: where η is the energy conversion efficiency factor. The remaining battery power of the relay is B r (t) at the beginning of each slot and B r (0) = 0 in the first time slot. We assume

Relay
Energy Harvesting Information Forwarding

Fig. 2
The PSR strategy. The relay uses the PSR strategy to harvest energy and forward the superimposed signals to the qualified users that the harvested energy is much less than the maximum storage capacity of the relay. After αT of each slot, the total energy of the battery is: where B r (t − 1) is the remaining energy of the previous time slot. The relay decodes the received signal and forwards the superimposed signal through NOMA. In each (1 − α)T , the maximum transmitting power of the relay can be expressed as: We assume that D j belongs to a set C that includes the qualified access users, where |C| = ̟ . The received signals from the relay can be defined as: where j is the power factor allocated by the relay to signal x j , and the power allocation factor j , n D ∼ CN 0, σ 2 D is the additive white Gaussian noise. The expression of the remaining energy of the battery at the relay after each time slot can be expressed as: We implement successive interference cancellation (SIC) based on the power ranking from strong to weak. If the j-th user is able to eliminate the signals of weaker users, the signal-to-interference-plus-noise ratio (SINR) for decoding its own signal is: where The achievable data rate at the D j is defined as follows: The sum rate of this system is as follows:

Problem formulation
We consider the maximum sum rate of the SWIPT-NOMA relay system; thus, the optimization problem is expressed as: where the set C includes the qualified access users; R th and are data rate threshold for the D j to decode the signal successfully and the set of power allocation factors, respectively.
Constraint C1 represents that each access user belongs to the qualified set C ; constraint C2 represents the minimum quality of service (QoS) requirements for selected access users, where the data rate of each qualified access user needs to be larger than the rate threshold; constraint C3 states that the power cannot be larger than the transmission power of the relay.
A user selection network is established to reduce the interference caused by the access of all users. In addition, since power allocation adjustment is inefficient, we propose a DQN algorithm to solve this problem.

USDPA scheme
In this section, we describe the USDPA scheme to determine user access and power allocation. The USDPA algorithm for the downlink SWIPT-NOMA relay system is presented in Fig. 3. We first determine the user access based on the user selection network and subsequently derive the power allocation based on the DQN.
The relay forwards the signals to the users with actions of the user selection network and the power allocation network. By obtaining optimized the user access and power allocation of the system, we maximize the sum rate. The USDPA algorithm is shown in Algorithm 1.

User selection network
In this part, we design an access policy that rapidly generates an access decision Y (t).
where Y (t) represents the output of the user selection network.

User selection algorithm
The user selection network has an embedded parameter ω 1 (t) that connects the hidden neurons. At the beginning of each slot, the user selection network uses h R,D (t) as the input and outputs a relaxed user access action Y (t) with N dimensions according to the access policy π x (t) and the parameterized ω 1 (t) . Since each value in Y (t) is between 0 and 1, it is difficult to determine who should access the system; thus, we design a mapping rule to quantize the output Y (t) . According to this rule, Y (t) is mapped into W access vectors, and the one with the maximum sum rate is the best access vector q * (t).
A four-layer DNN is designed with one input layer, two hidden layers, and one output layer. The dimensions of the input h R,D (t) and output Y (t) are N , which denotes the number of destination nodes. The two hidden layers' activation function is a Relu function, and (10) the output layer uses a Sigmoid activation function. In the t-th slot. The output of the user selection network can be expressed as Y (t) = f ω 1 h R,D (t) . The user selection algorithm is shown in Algorithm 2.

The mapping rule
The output Y (t) of the user selection network is mapped to W vectors. Each value of the vector is either 0 or 1, which 0 means the user is not accessed and otherwise. It should be noted that there are 2 N cases for vectors; consequently, W ∈ 1, 2 N , where its initial value is the same as N . Reference [21] proved the effectiveness of this method using the same binary representation in edge computing to evaluate the output of the DNN. The detailed mapping rules are as follows: (1) q 1 accounts for the first mapping vector of Y (t) and is obtained by comparing Y (t) with 0.5.
(2) The new sequence Y * (t) is obtained by sorting Y (t) according to the absolute value of the difference between Y (t) and 0.5.
(3) The values of the remaining W -1 mapping vectors are related to Y * (t) , and the vector of the i-th mapping is as follows: After each ζ slot, W * = min(max(W (t − 1), . . . , W (t − ζ )) + 1, N ) , where W (t − ζ ) is the position of the best user selection vector corresponding to slot (t − ζ ) of W vectors.

The training of the user selection network
To maximize the sum rate, h S,R (t) and each access vector q k (t) are the inputs of the power allocation network, which outputs the power allocation p k (t) . Then, the sum rate is calculated using each action pair (q k (t), p k (t)) where k = 1, . . . , W . The system selects the best access action q * (t) and adds the newly obtained pair ( h S,R (t), q * (t) ) to the replay memory 1 for training, and q * (t) is used as labels. Subsequently, a batch of training samples � 1 (t) are from the replay memory 1 to train the user selection network and the parameters ω 1 (t) and the policy π x (t) are updated. The ω 1 (t) is updated by reducing the loss function of the user selection network every 1 slots as follows: The Adam optimizer is utilized in the training process with learning rate θ 1 . After training, the user selection policy π x (t) * can be updated.

Power allocation algorithm
Next, we obtain the appropriate allocation action using the DQN; the algorithm is shown in Algorithm 3. We first provide some background information on reinforcement learning (RL) to clarify the algorithm. The key elements of RL are defined as follows: State space: The state space is defined as Action space: a = a 1 , . . . , a z is defined for its power allocation action space where z = A N M . There are M power allocation factors, and the action space for N destinations has A N M actions. Reward: We use the NSCUs, whose data rate is no less than the QoS threshold to obtain an immediate reward, which is defined as follows: where ̟ k (t) is the number of qualified users accessing the system in the k-th access vector. Moreover, the cumulative reward function of the power allocation network is defined as follows: where γ is the discount factor of the reward during L slots.
Transition probability: P represents the transition probability, i.e., the probability to transition from state s(t) to the next state s(t + 1) , given the action a(t) executed in the state s(t).
The Q value function is instrumental in solving RL problems [22]. The function describes the expected cumulative reward R(t) of initial s(t) , performing action a(t) , and following policy π r (t) . To obtain the appropriate power allocation action, the Q value function is defined as: The optimal action-value function in Eq. (18) is equal to the Bellman optimality equation [22], which is expressed as follows: After the optimal Q-function Q * (s(t), a(t)) is obtained, the Q-learning policy is determined by: The state-value function is obtained as follows: The Q-value is defined as follows: where θ 2 (t) is the learning rate of the power allocation network. (16) Q π r (t) (s(t), a(t)) = E π r (t) [r(t) + γ Q π r (t) (s(t + 1), a(t + 1))|s(t), a(t)] (19) Q * (s(t), a(t)) = E[r(t) + γ max Q * (s(t + 1), a(t + 1))|s(t), a(t)].
In general, the Q learning algorithm adopts the ε − greedy policy to select the power allocation action a(t) with probability 1 − ε , whereas a random action has a probability of ε = 0.8 . The power allocation action is generated by: where ω 2 is the parameter of the power allocation network.

Power allocation algorithm based on the DQN
Nevertheless, the Bellman equation is difficult to obtain because it is nonlinear and does not have a closed-form solution. The solution to this problem is to utilize neural networks to estimate the Q value. Therefore, we adopt a DQN to establish the power allocation network with a DNN to output the estimated Q value.
We design a power allocation policy π r (t) that quickly generates a power allocation decision corresponding to each access vector of the user selection network. The power allocation is implemented by the DQN, which is characterized by the embedded parameter ω 2 (t) that connects the hidden neurons. After the output of the user selection network has been mapped to W access vectors q(t) , h R,D (t) combined with each access vector q k (t) is used as the input of the power allocation network. The output of this algorithm is p k (t) corresponding to each access vector q k (t) . Then, we choose the actions (q * (t), p * (t)) with the maximum sum rate as the best actions and add the newly obtained pair ( h R,D (t), p * (t) ) to the replay memory 2. Subsequently, a batch of training samples 2 from the replay memory 2 is used to train the power allocation network, and the parameters ω 2 (t) and π r (t) are updated.
A five-layer power allocation network is designed, with one input layer, three hidden layers, and one output layer. The Relu function is used as the activation function in the first two hidden layers, and the tanh function is used in the last hidden layer. The output of the power allocation network can be expressed as p k (t) = f ω 2 h R,D (t), q k (t) .
After allocating power according to access vectors, the relay executes the optimal actions (q * (t), p * (t)) with the maximum sum rate of the system and receives the immediate reward r(t) . Subsequently, the system moves to the next state, and the replay memory 2 is used to store the tuple (s(t), p * (t), r(t), s(t + 1)) of each slot. When the replay memory 2 is full, the oldest record is removed, and the newest record is stored.

The training of power allocation algorithm
A batch of training samples � 2 (t) from replay memory 2 is used to train the power allocation network. Then, the target Q value is obtained according to the target Q network as follows: where ω 2 is the parameter of target Q network. The Q network is trained with � 2 (t) by minimizing the loss function of power allocation network which is defined as: (23) a 1 (t) = arg max Q(s(t), a(t); ω 2 ), (24) y i = r(i) + max Q(s(i + 1), a 1 (i + 1); ω 2 ), Meanwhile, the Adam optimizer is utilized in the training process with learning rate θ 2 . We update the parameters ω 2 of the target Q network by copying the parameters of the Q network to each slot.

Complexity analysis
The complexity of the USDPA algorithm depends on the number of layers of the neural network and the number of neurons in each layer. The complexity of the user selection network is M 1 Nf 1 + f 1 f 2 + f 2 N , where f 1 and f 2 are the numbers of neurons in the first and second hidden layers, respectively. The complexity of the power allocation network is where f Q1 , f Q2 and f Q3 are the numbers of neurons in the first, second, and third hidden layers, respectively. In the USDPA algorithm, the output of the user selection network is mapped to W user access vectors; thus, the algorithm complexity is O(M 1 + WM 2 ).

Results and discussion
In this section, the effectiveness of the proposed user selection and power allocation optimization scheme of the SWIPT-NOMA relay system is verified using the simulation. The effects of R th and various levels of transmitting power at the BS on the performance (25) Loss(ω 2 ) = (y i − Q(s(i), a 1 (i); ω 2 )) 2 .
of the SWIPT-NOMA relay system are analyzed to illustrate the superiority of the proposed scheme in increasing the sum rate.
In this paper, Tensorflow 2.0 is used for simulation. The simulation parameters are set as follows [23] (Table 1).
The sizes of replay memory 1 and replay memory 2 are 1000 and 400, respectively. The initial number of mapping vectors W = N.

Validation of training effects
In this part, we assess the performance of the proposed USDPA algorithm using simulations with different requirements for the successful decoding of the signals. Figure 4 shows the W of the USDPA algorithm versus the training slots when P S = 40 dbm. The value of W converges quickly after 4000 slots, and the value of W is nearly stable within 2, indicating that the mapping scheme does not increase the computational complexity. It can be seen that the higher the QoS, the lower the value of W converges. The reason is that it is easier to satisfy the lower QoS. Figure 5 shows the loss of the USDPA algorithm with P S = 40 dBm and R th = 0.3 bits/s/Hz. It can be seen that the loss functions of the user selection network and the power allocation network converge quickly. Figure 6 shows the average reward of the USDPA versus the training time slots with P S = 40 dBm for different QoS requirements. It can be observed that different QoS requirements have different effects on the performance of the USDPA algorithm. Specifically, the algorithm takes longer to reach convergence when the QoS requirements are high. In addition, when the QoS requirement is 0.3 bits/s/Hz, the loss of the allocation network converges rapidly after about 10,000 slots (Fig. 5), and the average reward converges rapidly to 10 (Fig. 6). Furthermore, the average reward after 2000 time slots is higher than 0.4 bits/s/Hz at a QoS requirement of 0.5 bit/s/Hz. The reason is that when the QoS requirement is 0.5 bits/s/Hz, the USDPA algorithm selects fewer users, causing less interference, and they can access the system more easily and successfully to meet the QoS requirement and allocate the appropriate power using the DQN. Therefore, the average rewards are relatively high. In general, the results indicate that the USDPA algorithm exhibits excellent learning performance for different QoS requirements.

Experimental results and discussion
The goal of this paper is to maximize the sum rate of the SWIPT-NOMA relay system. Consequently, the sum rate and the NSCUs are used to evaluate the algorithm's performance. Four algorithms are compared with the proposed algorithm: (1) All users access (AU) + DQN: all users access the system, and the power allocation scheme uses the DQN, which is the same as the power allocation scheme in [18]. (2) All users access + average power allocation (AU + AP): all users access the system, and the power of each user's signal is the average power factor. The algorithm decodes the signals using the order of channels from strong to weak. (3) The user selection scheme average power allocation (US + AP): the users that access the system are determined by the proposed user selection network, and the power of each user's signal is the average power factor. (4) Random user access (RU) + DQN: the users that access the system are determined randomly, and the DQN is used to allocate the power. Figure 7 shows the NCMUs for different data thresholds. The NCMUs exhibits a decreasing trend for the USDPA, AU + DQN, US + DQN, AU + AP, and US + AP, when P S = 40 dBm. The reason is that it is difficult for the system to allocate the appropriate power factor to enable the users to decode the signal successfully. AU + DQN shows the best NCMUs performance when the data thresholds are R th = 0.2 bits/s/Hz and R th = 0.25 bits/s/Hz. The reason is that AU + DQN is easier to satisfy the lower QoS requirements. The USDPA algorithm exhibits the optimum performance when R th = 0.3 bits/s/Hz, R th = 0.35 bits/s/Hz, and R th = 0.4 bits/s/Hz because the user selection network choose some users to access the system. The fewer the users accessing the system, the less interference there is. However, according to the USDPA, it is not possible for the system to access only one user. The reason is that the power allocation factor is less than 1 and does not take 0 and 1, which means that the sum rate of one user access will not be the maximum. Therefore, the system always selects multiple users to access the system. What's more, it can be seen that the performance of AU + AP and US + AP schemes both converge when the QoS requirement is high. The reason is that both of them use the average power allocation factor for access users, which makes it difficult to guarantee that all qualified users can successfully decode the signal under the high QoS requirements. When R th = 0.3 bits/s/Hz, the performance of the USDPA algorithm is 63.3%, 144.7%, 115%, and 7% higher than that of the RU + DQN, US + AP, AU + AP, and AU + DQN, respectively. Figure 8 shows the average sum rate of the five schemes with P S = 40 dBm. We can see that the performance of the USDPA is the best for all QoS requirements. When R th = 0.3 bit/s/Hz, the average sum rate of the USPDA algorithm is 47.8%, 38.2%, 178%, and 63.1% higher than that of the AU + DQN, RU + DQN, AU + AP, and US + AP, respectively. The reason is that when the average power allocation is utilized for user access, there is no dynamic adjustment of the power allocation factor. More importantly, we observe that the average sum rate of the USDPA scheme is higher for a QoS requirement of 0.4 bits/s/ Hz than a QoS requirement of 0.35 bits/s/Hz. The reason is that when the QoS requirements are higher, the USDPA algorithm selects fewer users to access the system. Thus, there is less interference at the receivers, and the achieved sum rate is higher. In addition, if all users access the system, there is more interference at the receivers, although the DQN is used to allocate power. The US + AP algorithm maintains stable performance as R th increases because the user selection network chooses appropriate users to access the system. Although the RU + DQN algorithm chooses the users randomly, it still maintains a steady average sum rate because it uses the DQN algorithm to allocate power.  Figure 9 displays the trend of the average sum rate of the five schemes with different levels of transmitting power at the BS, when R th = 0.3 bits/s/Hz at the receivers. The average sum rate increases with increasing P S . As P S increases, the SINR at the accessed receivers improves, leading to a performance improvement. In addition, we find that the USDPA scheme outperforms the other four schemes. The proposed scheme jointly Fig. 8 The average sum rate comparison versus R th . Different QoS can detect the stability of the algorithm. When a certain value is reached, the algorithm starts to converge. The x-axis is R th , and the y-axis is the average sum rate. The illustrations are the USDPA, AU + DQN, RU + DQN, AU + AP, US + AP Fig. 9 The average sum rate comparison versus P S . The x-axis is P S , and the y-axis is the average sum rate. The illustrations are the USDPA, AU + DQN, RU + DQN, AU + AP, US + AP optimizes user access and power allocation, and the algorithm exhibits efficient learning ability by utilizing the user selection network and the power allocation network in the dynamic environment.

Conclusion
We propose a USDPA scheme in the SWIPT-NOMA relay system to maximize the sum rate in the downlink. A model of the SWIPT-NOMA relay system was established with a PSR scheme to harvest energy and forward signals. The USDPA was used to optimize the user access action and power allocation action simultaneously. The simulation results showed that the proposed scheme provided the best performance for increasing the sum rate. Due to the complexity of the problem, practical scenarios of multi-antenna configuration and a bidirectional relay will be analyzed in a future study.