 Research
 Open access
 Published:
Joint spectrum and power allocation scheme based on value decomposition networks in D2D communication networks
EURASIP Journal on Wireless Communications and Networking volumeÂ 2024, ArticleÂ number:Â 79 (2024)
Abstract
Devicetodevice (D2D) communications allow shortrange communication devices to multiplex cellularlicensed spectrum to directly establish local connections for ultrahigh number of terminal connections and greater system throughput. However, spectrum sharing also brings serious interference to the network. Therefore, a reliable and efficient resource allocation strategy is important to mitigate the interference and improve the system spectral efficiency. In this paper, we investigated spectrum access and power allocation in D2D communications underlay cellular networks based on deep reinforcement learning with the aim of finding a feasible resource allocation strategy to maximize data rate and system fairness. We proposed a value decomposition networkbased resource allocation scheme for D2D communication networks. Our proposed scheme avoids frequent information exchanges among D2D users by centralized training, while allowing D2D users to make distributed joint resource allocation decisions. Simulation results show that the proposed scheme has stable convergence and good scalability, and can effectively improve the system capacity.
1 Introduction
With the application of 5Â G communication technology, the access of massive terminal devices and the explosive growth of data traffic pose a great challenge to the existing network system and architectureÂ [1, 2]. D2D communication is one of the key technologies for 5Â G communications, allowing two very close mobile devices to communicate directly without involving a base station (BS)Â [3]. By sharing the underlying spectrum resources, D2D users can reuse the licensed spectrum of cellular users, potentially improving spectrum efficiencyÂ [4]. However, the transmission of D2D signals can potentially create detrimental interference to cellular networks and other D2D links, significantly hampering the progress of D2D communicationsÂ [5, 6]. As a result, effective resource allocation policy becomes crucial in order to enhance the capacity of D2D underlying communications within cellular networks.
The joint spectrum allocation and power allocation for D2D communications and an NPhard problem. Existing works use different mathematical theories to solve the resource allocation problem. Graph theory is widely used to solve resource allocation problems in wireless networksÂ [7, 8]. InÂ [9], the joint mode selection and resource allocation problem is investigated and a graph theorybased algorithm is proposed to obtain the optimal solution. InÂ [10], a simplified bipartite graph is constructed with the weights of the bipartite graph represented as the maximization and rate of the D2D and cellular links under the outage constraints. Also game theory is an effective method to solve the problem of resource allocation in networksÂ [11,12,13]. InÂ [11], an optimization algorithm based on the branchâ€“price idea is proposed, aiming at solving the joint optimization problem of link dispatching, channel allocation and power control in D2D communicationassisted fog computing. InÂ [12], a dynamic Stackelberg game algorithm is proposed in which the base station as a leader charges the D2D users as followers to reduce interference and increase the throughput of the D2D users. InÂ [13], a zerosum gamebased algorithm is proposed to optimize the rate and security of the network in the presence of incomplete channel state information. Resource allocation algorithms based on heuristic ideas usually require a large number of iterative computations with high complexityÂ [14, 15]. InÂ [14], a hybrid genetic algorithm (GA) and binary particle swarm optimization algorithm are proposed for joint subchannel allocation and power control. InÂ [15], a simulated annealingbased algorithm for resource allocation to D2D user pairs and cellular users improves the average resource utilization.
Resource allocation methods based on traditional optimization algorithms have certain limitations. The optimization effect of the resource allocation scheme based on graph theory relies on the global channel state information (CSI), which cannot adapt to the increasingly complex communication network structure. The resource allocation scheme based on game theory requires frequent information exchange between users, which brings great signaling overhead. Resource allocation schemes based on heuristic algorithms require multiple iterations to solve the computation. In recent years, reinforcement learning (RL) methods can use historical environmental information to learn resource allocation strategies to maximize longterm rewards. Moreover, deep reinforcement learning (DRL) has been applied to the spectrum allocation problem. InÂ [16], a distributed spectrum allocation scheme based on qlearning is proposed, where D2D users adaptively select the access spectrum to reduce interference and maximize the system throughput in a multitier heterogeneous network. A D2D spectrum access algorithm based on double deep Qnetwork (DDQN) has been proposed inÂ [17], and this algorithm enables D2D pairs to select orthogonal or nonorthogonal spectrum access in different time slots. InÂ [18], a DRLbased spectrum access scheme is designed to optimize system throughput and resource allocation fairness. In addition, several works have considered joint power control and spectrum allocationÂ [19,20,21,22]. InÂ [19], a distributed deep reinforcement learning scheme is proposed to autonomously optimize channel selection and transmission power using local information and historical environmental information. InÂ [20], a neural network with two layers of outputs is used to approximate the Qvalues of spectrum allocation and power control strategies. InÂ [21], a dueling double deep Qnetwork based on priority sampling is proposed to maximize the total throughput. InÂ [22], a distributed spectrum access and power allocation scheme is proposed based on deep Qnetworks (DQN)Â [23] that can be applied to vehicletovehicle (V2V) communication scenarios without the need for global channel information. Although all these efforts have contributed greatly to resource allocation in D2D communications, none of them has considered the distribution of rewards between agents in a cooperative relationship and the effect of instability in a multiagent environment on the algorithms, and the convergence effect of the algorithms is overly dependent on the design of the rewards. Recently, there have been a number of multiagent deep reinforcement learning (MADRL)based resource allocation studies that have begun to focus on improving the convergence of algorithms in multiagent environmentÂ [24, 25]. InÂ [24], a distributed spectrum access algorithm has been designed based on the actorcritic (AC) approach to train DRL models by collecting historical information of all users or neighboring users through a centralized training process. This method shares global history states to alleviate instability in multiintelligent body environments. However, a global reward is missing to promote cooperation among the agents. InÂ [25], a framework based on the QMIXÂ [26] algorithm for D2D users pairing under collaborative relationships was developed to solve the problem of associating D2D users and caching devices.
This paper researches the collaborative resource allocation for D2D underlay communications in cellular networks. Considering the requirements of spectral efficiency as well as system fairness, a nonlinear integer optimization objective is formulated. In order to tackle the resource allocation challenges in our scenario, we have proposed the value decomposition networkbased resource allocation (VDNRA) scheme. VDNRA scheme leverages the value decomposition networks (VDN)Â [27] to optimize the collaborative spectrum and power allocation in D2D networks. This scheme aims to mitigate interference, enhance the overall network capacity and ensure fairness among D2D pairs. The main contributions are summarized as follows:

1.
We investigate the resource allocation problem for D2D communications. Spectral efficiency and fairness objective functions of the system are maximized by jointly optimizing spectrum allocation and power control.

2.
The optimization problem is modeled as a decentralized partially observable Markov decision process (DecPOMDP). The D2D users are defined as agents under cooperative relationship. each D2D user can use local observation state information to obtain resource allocation policy without the need for realtime global channel state information (CSI).

3.
A value decomposition networkbased resource allocation (VDNRA) scheme is proposed to mitigate the problems of unstable multiagent environment and difficult convergence of training by centralized training to utilize the environment state observed by each agent. The distributed execution also reduces the signaling overhead and the stress on the base station.
The rest of this paper is organized as follows. SectionÂ 2 provides a detailed description of the system model and the multiagent reinforcement learning scheme. SectionÂ 3 presents the simulation results of our proposed scheme. Finally, the conclusions and discussion are given in Sect.Â 4.
2 Method
We investigate the problem of spectrum allocation and power control for D2D underlay communications in a singlecell uplink scenario. First, the resource allocation task is transformed into a multiagent reinforcement learning process under cooperative relationship. A value decomposition networkbased resource allocation (VDNRA) scheme is proposed, and the corresponding key elements such as states, actions and rewards are designed. In VDNRA scheme, cooperation between agents is facilitated by learning a global actionâ€“value function to mitigate the impact of multiagent environment instability on the algorithm performance.
2.1 System model and problem formulation
We consider a typical D2D communication scenario under a single cellular network with spectrum allocation and power control issues, as shown in Fig.Â 1. In this case, the base station (BS) is located at the center of the cellular and provides services to the cellular users (CUEs). The D2D users (DUEs) share the same spectrum resources with the cellular users. There are M CUEs, denoted as \(\mathcal {M}=\{1,\ldots ,M\}\), and N D2D pairs, denoted as \(\mathcal {N}=\{1,\ldots ,N\}\) in the scenario. We consider the uplink interference, i.e., the DUEs reuses the uplink channel of the CUEs. This choice is driven by the fact that, in realworld communication scenarios, the utilization of uplink resources tends to be lower compared to downlink resources. Additionally, BS have a higher capacity to manage interference compared to mobile devicesÂ [28]. Orthogonal frequencydivision multiplexing (OFDM) technique is used to partition the frequencyselective channel into flat channels on multiple different subcarriers. CUEs are preallocated K orthogonal subchannels, where the kth subchannel is assigned to the mth CUE, thus ensuring that CUEs do not interfere with each other. In the system model, the number of reusable subchannels is equal to the number of CUEs (\(M=K\)), i.e., the cellular network is fully loaded. As a result of sharing the same spectrum, D2D links are susceptible to two types of interference. The first is crosslayer interference which refers to the interference experienced by the D2D receiver from the cellular user with whom it shares the same spectrum. The second is colayer interference which arises from the other D2D transmitter sharing the same subchannel with that D2D pair.
We assume that the kth subchannel is assigned to the nth DUE and use the indicator function \(\alpha _{n}^{k}\in \{0,1\}\) to represent the DUE assignment decision. Specifically, if the nth DUE reuses the kth subchannel, \(\alpha _{n}^{k}=1\); otherwise, it is 0. \(P_{n}^{d}\) and \(P_{m}^{c}\) denote the transmit power of the nth DUE and mth CUE, respectively. \(g_{n}^{d}\) denotes the gain of the subchannel k from the nth D2D transmitter to the receiver. \(g_{m,k}^{c,r}\) and \(g_{n^{'},k}^{t,r}\) denote the interference link gain from the mth CUE to the nth D2D receiver on the subchannel k and the interference link gain from the \(n^{'}\)th D2D transmitter to the nth D2D receiver on the subchannel k, respectively. \(\sigma ^{2}\) is the additive white Gaussian noise (AWGN) power. Note that the channel gains of these links are frequencyselective, meaning they are subject to frequencydependent smallscale fading and largescale fading effects. Moreover, the channel gains vary over different coherence time periods, further complicating the resource allocation problem.
Therefore, the signaltointerferenceplusnoise ratio (SINR) on the kth subchannel for the nth DUE is given by
where
Among them, \(I_{c}\) represents the interference power when the DUE and CUE share the same spectrum, and \(I_{d}\) indicates the interference from other DUES which sharing the same spectrum with that D2D pair.
Based on the SINR of DUE, the spectral efficiency of DUE can be further obtained as
In order to maintain fairness within the system, we employ Jainâ€™s fairness indexÂ [29]. Here, the fairness index is given by
where \(f\in [0,1]\); this index serves as a metric to gauge the fairness of the systemâ€™s quality of service among D2D users. It quantifies the spectral efficiency gap between different DUEs, with a larger fairness index indicating a smaller gap in spectral efficiency and a fairer distribution of resources among the users. By considering Jainâ€™s fairness index, we aim to ensure equal opportunities and satisfactory quality of service for DUEs within the network.
Our main objective is to develop a comprehensive strategy for joint spectrum and power allocation. This strategy aims to maximize the total spectral efficiency within the system while simultaneously ensuring fairness among users. Therefore, the optimization objective can be written as
where \(\rho _{1}\) and \(\rho _{2}\) are weighting factors that balance the total spectral efficiency and system fairness, respectively. \(P_{\rm{max}}^{d}\) is the maximum transmit power of the D2D transmitter. C1 denotes the range of values of the channel occupancy indicator variable, C2 denotes the DUE maximum transmit power constraint, and C3 denotes the multiplexing of at most one subchannel per DUE. This is an NPhard optimization problem with nonlinear constraints and requires mixedinteger planning for each time slot as the channel changes. To solve this problem, we will investigate the application of the multiagent RL algorithm in this scenario.
2.2 Deep reinforcement learning framework
In this section, we first model the multiagent environment and then propose a value decomposition networkbased resource allocation (VDNRA) scheme with centralized exploration training and distributed execution to solve joint spectrum and power allocation problem. The proposed VDNRA can make full use of the cooperation among DUEs to further improve the overall performance of the system.
2.2.1 Multiagent environments
Here we describe the optimization problem as a fully cooperative multiagent task, and the process of D2D pairs cooperating with each other to maximize rewards can be described as a decentralized partially observable Markov decision process (DecPOMDP) which can be represented as a tuple \((N,\mathcal {S},\mathcal {A},r,P,\gamma )\). N is the set of agents, and \(\mathcal {S}\) is the set of all agentsâ€™ local partially observable state \(s_{t}\) of the environment. \(\mathcal {A}\) is the action space representing the set of optional actions \(a_t\) of agents, and r is the common reward for all agents. P is the state transfer probability, which refers to the probability that according to the current state \(s_{t}\in \mathcal {S}\), after the agent performs the action \(a_{t}\in \mathcal {A}\), the environment gives the next moment state \(s_{t+1}\). \(\gamma\) is a discount factor for rewards that indicates the importance of future rewards for payoffs.
State: For feasibility reasons, we use the local channel state information (CSI) observed by the D2D link as the state characterization, consisting of the following components: instantaneous channel information of the D2D link, \(G_{n}^{d,t}\), the channel information of the interfering link from the cellular user to the D2D receiver, \(G_{n}^{c,t}\), interference to the link in the previous time slot, \(I_{n}^{t1},\) and the spectral efficiency of the D2D link in the previous time slot, \(C_{n}^{t1}\). Hence, \(s_{n}^{t} = [G_{n}^{d,t}, G_{n}^{c,t}, I_{n}^{t1}, C_{n}^{t1}]\).
Action: The action space is set as a collection of discrete power and subchannel resource blocks. At each time t, agent n takes an action \(a_{n}^{t}\) denoted as
\({{\alpha }_{n}}=[\alpha _{n}^{1},\alpha _{n}^{2}, \ldots ,\alpha _{n}^{K}]\) is a Kdimensional onehot variable that represents the subchannel occupancy of the nth DUE. \(P_{n}^{d}\) is the power of the nth D2D transmitter. We quantified the transmit power into five levels, so the dimension of the action is \(D_{a}= 5 \times K\).
Reward: In the RL process, agent make decisions to maximize rewards based on the interactions of the environment. Therefore, the design of rewards should be consistent with the goal of optimization. We design the following reward function for this distributed resource allocation problem:
where \(C_{n}^{d,t}\) and \(f_{t}\) are the spectral efficiency of the nth DUE and the fairness index at time slot t, respectively. \(\rho _{1}\) and \(\rho _{2}\) are weighting factors.
2.2.2 The problem solution
In singleagent RL model, agent interacts with the environment and makes decisions based on a specific policy \(\pi\). At each time step t, each agent receives an state \(s_{t}\) based on the true information of the environment and then selects an action \(a_{t}\) from the action space \(\mathcal {A}\) according to the current policy \(\pi\). Following the action, the agent gets new state \(s_{t+1}\) and receives a reward \(r_{t}\). The goal of reinforcement learning is to learn a policy that maximizes the reward expectation gain
In RL algorithm based on value learning, the action state value function \(Q_{\pi }(s_t,a_t)\) is used to represent the expectation of rewards in the current state \(s_t\) and action \(a_t\), which is expressed as
There are two difficulties in using RL algorithm in a multiagent environment. First, the agents cannot obtain global state information when interacting with the environment, and can only obtain their own local observation information. Second, when multiple agents update their policies independently at the same time during the learning process, from the point of view of any single agent, the environmental feedback it obtains, such as reward expectation, no longer satisfies the condition of unbiased estimation because of the actions of other agents in the environment, making it difficult for the algorithm to converge.
Most of the existing DRLbased distributed resource allocation algorithms directly apply singleagent DRL algorithms to the multiagent environment. This approach ignores the potential cooperative relationship between the agents in optimizing the overall goal, while the algorithmic effect is very dependent on the setting of the reward function. To address this challenge posed by the nonstationary learning problem in multiagent environment, we propose a distributed RL algorithm called VDNRA, as shown in Fig.Â 2. Our VDNRA algorithm makes an important distinction, taking into account that each agent contributes differently to the global optimization objective within a cooperative relationship. It decomposes the reward expectation of the overall network into the sum of the local reward expectations of each individual agent and centrally trains a reinforcement learning model capable of making rational decisions based on realtime observed states through parameter sharing. It helps mitigate the instability inherent in multiagent environments and subsequently improves the overall performance of the algorithm. By focusing on the contributions of each agent, our proposed algorithm can better handle the complexity of resource allocation in D2D communication scenario. Furthermore, in our approach, the D2D side utilizes locally observed information to make decisions. This eliminates the need for frequent message interactions between D2D pairs, thereby avoiding excessive signaling overhead. This approach improves the efficiency and scalability of the resource allocation process in D2D networks, making it more suitable for deployment in practical scenarios.
VDN is based on the assumption that the joint actionâ€“value function for the system can be additively decomposed into value functions across agents,
where d denotes the number of agents, \(h^{n}=[s_{0}^{n},s_{1}^{n}, \ldots ,s_{t}^{n}]\) is a set of historical local state trajectories observed by the agent n and \(a^{n}\) is the chosen action by the agent n. \(\overset{ \sim }{Q}_{n}\) depends on the local observation of each agent. \(\overset{ \sim }{Q}_{n}\) is learned implicitly rather than from any reward specific to agent i. Each agent acting greedily relative to its local value \(\overset{\sim }{Q}_{n}\) is equivalent to a central arbiter choosing joint actions by maximizing \(\sum _{n=1}^{N} \overset{ \sim }{Q}_{n}\).
The algorithm performs endtoend training by minimizing the loss function \(\mathcal {L}( \theta )\) can be written as
where b is the batch size of the sample conversion from the replay buffer and \(\theta\) denotes the parameters of the network. The objective function is denote as
where \(h^{'}\) and \(a^{'}\) are historical trajectories and the chosen action at the next moment, respectively. \(\theta ^{}\) is the parameter for target network.
VDNRA implicitly learns individual value functions for each agent by backpropagating gradients from the Qlearning rule, without the need to deliberately design rewards for individual agent. With this approach, only the individual value functions are learned centrally and then executed in a distributed manner. Algorithm 1 summarizes the operation of the proposed method.
3 Results and discussion
We consider a D2D underlying cellular network scenario. The scenario has 5 subchannels and assumed that a subchannel can be assigned to multiple D2D pairs. Cellular users and D2D users are randomly distributed in a cell with a radius of 250Â ms, and each D2D pair has a communication distance between 10 and 40Â ms. In our VDNRA scheme, the agent network uses the structure of DRQNÂ [30]. The agent network has four layers, which contains a hidden layer consisting of recurrent neural network (RNN). To design our neural network architecture, we allocate 128 neurons for the first hidden layer. Additionally, for the second hidden layer, we employ gate recurrent units (GRUs)Â [31] and assign 64 neurons to this layer. Relu function is used as activation function. The learning rate of the network is 0.0001. Note that the weighting factors \(\rho _{1}\) and \(\rho _{2}\) are used to regulate the size of the reward value, which has no real physical meaning, and the values are mainly based on the convergence effect of the algorithm in the experiments. In the following experiments, when \(\rho _{1}=0.9\) and \(\rho _{2}=0.1\), the spectral efficiency and fair objective function in the reward can be better balanced. The detail parameters are given in TableÂ 1.
We compared the proposed VDNRA scheme with four other schemes. The first one is the most classical DRL method DQN, which is used inÂ [22]. The second approach is denoted as QMIXDRL, which is based on value decomposition. In the QMIX algorithm, a mixing network and hypernetwork are utilized to accurately estimate the QvalueÂ [25]. In the third scheme, DUEs select the subchannel based on the SNIR in a roundrobin manner while transmitting the signal at maximum power. This simple approach can be useful in certain scenarios where more sophisticated resource allocation strategies are not available or necessary. However, it is important to note that this scheme is limited in optimizing system performance. The fourth approach selects the subchannels randomly and transmit signals at maximum power, which has no practical application and only serves to indicate a lower bound on the performance of the system. Note that the states, rewards and agent network structure used by the DQN and QMIXDQL are the same as the proposed scheme VDNRA, thus ensuring that the compared DRL schemes are ensured to have the same complexity as the proposed scheme.
FigureÂ 3 shows the changes in the average spectral efficiency of the D2D pairs as the number of training steps increases when the number of D2D pairs is 20. For all DRL schemes, the average spectral efficiency initially increase with the increase in the number of training steps and eventually remain converged. In comparison with other schemes, our proposed VDNRA scheme makes better use of the global state information by learning a joint action state value function through centralized training, which mitigates the effects of multiagent environment instability and partial observability, and thus promotes the cooperation of the agents to optimize the overall goal. Therefore, the VDNRA scheme has a faster convergence speed and the best convergence effect. This indicates that our proposed scheme achieves better optimization of the resource allocation process and improves the overall performance of the system. The faster convergence rate also suggests that our scheme is more efficient in adapting and optimizing resource allocation in a dynamic and changing environment.
In Fig.Â 4, we generate the cumulative distribution function (CDF) of the average spectral efficiency for D2D pairs over 5000 test time slots. The results indicate that the DRL scheme can substantially enhance the average spectral efficiency of the system compared to general methods. Furthermore, the VDNRA scheme surpasses other DRL algorithms in terms of performance during the test phase. This outcome confirms the effectiveness of our VDNRA algorithm in improving the overall spectral efficiency of the D2D network.
FiguresÂ 5 andÂ 6 show the average rate and fairness index of different algorithms versus the number of D2D pairs, respectively. First, as the number of D2D pairs increases, the interference between D2D pairs increases. Therefore the average spectral efficiency decreases. As the number of D2D pairs increases, the disparity between the data rates of these pairs becomes more pronounced due to differences in interference levels. Consequently, this leads to a decrease in the fairness index. Furthermore, the VDNRA and the QMIXDRL outperform the other three benchmark schemes. This is because ordinary DQN algorithms treat each agent independently, perceiving the other agents as part of the environment, resulting in a singleagent reinforcement learning process for each agent. Due to the presence of other agents in the environment, the environment is a nonstationary one and its convergence is very dependent on the design of the reward function. Since in our optimization objective, the degree of contribution of individual D2Ds to the whole is their respective spectral efficiencies, the relationship between overall Q and local Q satisfies the premise assumptions and linear summation of the VDNRA. In the QMIXDRL approach, the relationship between the overall Qvalue and the local Qvalues is represented as nonlinear. On the other hand, in our proposed VDNRA scheme, we describe the overall Qvalue as a sum of local Qvalues to make it more aligned with the specific requirements and characteristics of our proposed system environment. So the final experimental results also show that in terms of both the average spectral efficiency and fairness index, our proposed algorithm achieves the best results.
4 Conclusion
This paper focuses on exploring the problem of joint spectrum allocation and power control for D2D communication using a multiagent DRL approach. The proposed VDNRA algorithm maximizes the throughput of D2D and ensures the fairness of the system. Specifically, D2D pairs are treated as multiple agents to share resources. Each agent can adaptively control the power and select resources independently without any priori information. Considering the nonstationary of the multiagent environment, the centralized training distributed execution of the proposed algorithm can ensure the convergence of the algorithm without the need of interagent signaling interaction. Simulation results verify the effectiveness of the proposed algorithm, which significantly improves the spectral efficiency of the D2D link while ensuring system fairness. For future work, we plan to consider the D2D communication resource allocation problem under multicell heterogeneous networks, and also consider hierarchical combinatorial optimization to design resource allocation schemes for continuous and discrete variables, respectively.
Availability of data and materials
Data can be obtained from the corresponding author upon reasonable request. Data sharing is not applicable to this article as no datasets were generated or analyzed during the current study.
Abbreviations
 D2D:

Devicetodevice
 DRL:

Deep reinforcement learning
 MADRL:

Multiagent deep reinforcement learning
 VDNRA:

Value decomposition networkbased resource allocation
 DQN:

Deep Qnetwork
 DDQN:

Double deep Qnetwork
 V2V:

Vehicletovehicle
 AWGN:

Additive white Gaussian noise
 CUE:

Cellular user
 DUE:

D2D user
 OFDM:

Orthogonal frequencydivision multiplexing
 DecPOMDP:

Decentralized partially observable Markov decision mode
 CSI:

Channel state information
References
A. Gupta, R.K. Jha, A survey of 5G network: architecture and emerging technologies. IEEE Access 3, 1206â€“1232 (2015). https://doi.org/10.1109/ACCESS.2015.2461602
J. Cao, M. Ma, H. Li, R. Ma, Y. Sun, P. Yu, L. Xiong, A survey on security aspects for 3GPP 5G networks. IEEE Commun. Surv. Tutor. 22(1), 170â€“195 (2020). https://doi.org/10.1109/COMST.2019.2951818
N. Zhao, Y.C. Liang, D. Niyato, Y. Pei, M. Wu, Y. Jiang, Deep reinforcement learning for user association and resource allocation in heterogeneous cellular networks. IEEE Trans. Wirel. Commun. 18(11), 5141â€“5152 (2019). https://doi.org/10.1109/TWC.2019.2933417
J. Liu, N. Kato, J. Ma, N. Kadowaki, Devicetodevice communication in LTEadvanced networks: a survey. IEEE Commun. Surv. Tutor. 17(4), 1923â€“1940 (2015). https://doi.org/10.1109/COMST.2014.2375934
Y. Kai, J. Wang, H. Zhu, J. Wang, Resource allocation and performance analysis of cellularassisted OFDMA devicetodevice communications. IEEE Trans. Wirel. Commun. 18(1), 416â€“431 (2019). https://doi.org/10.1109/TWC.2018.2880956
D. Shi, L. Li, T. Ohtsuki, M. Pan, Z. Han, H.V. Poor, Make smart decisions faster: deciding D2D resource allocation via Stackelberg game guided multiagent deep reinforcement learning. IEEE Trans. Mobile Comput 21(12), 4426â€“4438 (2022). https://doi.org/10.1109/TMC.2021.3085206
J. Hao, H. Zhang, L. Song, Z. Han, Graphbased resource allocation for devicetodevice communications aided cellular network, in 2014 IEEE/CIC International Conference on Communications in China (ICCC) (2014), pp. 256â€“260. https://doi.org/10.1109/ICCChina.2014.7008282
Y. Xue, Z. Yang, W. Yang, J. Yang, D2D resource allocation and power control algorithms based on graph coloring in 5G IoT, in 2019 Computing, Communications and IoT Applications (ComComAp) (2019), pp. 17â€“22. https://doi.org/10.1109/ComComAp46287.2019.9018806
Y. Dai, M. Sheng, J. Liu, N. Cheng, X. Shen, Q. Yang, Joint mode selection and resource allocation for D2Denabled NOMA cellular networks. IEEE Trans. Veh. Technol. 68(7), 6721â€“6733 (2019). https://doi.org/10.1109/TVT.2019.2916395
L. Wang, H. Tang, H. Wu, G.L. StÃ¼ber, Resource allocation for D2D communications underlay in Rayleigh fading channels. IEEE Trans. Veh. Technol. 66(2), 1159â€“1170 (2017). https://doi.org/10.1109/TVT.2016.2553124
C. Yi, S. Huang, J. Cai, Joint resource allocation for devicetodevice communication assisted fog computing. IEEE Trans. Mobile Comput. 20(3), 1076â€“1091 (2021). https://doi.org/10.1109/TMC.2019.2952354
N. Sawyer, D.B. Smith, Flexible resource allocation in devicetodevice communications using Stackelberg game theory. IEEE Trans. Commun. 67(1), 653â€“667 (2019). https://doi.org/10.1109/TCOMM.2018.2873344
R. Gupta, S. Tanwar, A zerosum gamebased secure and interference mitigation scheme for socially aware D2D communication with imperfect CSI. IEEE Trans. Netw. Serv. Manag. 19(3), 3478â€“3486 (2022). https://doi.org/10.1109/TNSM.2022.3173305
M. Hamdi, M. Zaied, Resource allocation based on hybrid genetic algorithm and particle swarm optimization for D2D multicast communications. Appl. Soft Comput. 83, 105605 (2019). https://doi.org/10.1016/j.asoc.2019.105605
B. Lu, S. Lin, J. Shi, Y. Wang, Resource allocation for D2D communications underlaying cellular networks over Nakagami\(m\) fading channel. IEEE Access 7, 21816â€“21825 (2019). https://doi.org/10.1109/ACCESS.2019.2894721
K. Zia, N. Javed, M.N. Sial, S. Ahmed, A.A. Pirzada, F. Pervez, A distributed multiagent RLbased autonomous spectrum allocation scheme in D2D enabled multitier HetNets. IEEE Access 7, 6733â€“6745 (2019). https://doi.org/10.1109/ACCESS.2018.2890210
J. Huang, Y. Yang, G. He, Y. Xiao, J. Liu, Deep reinforcement learningbased dynamic spectrum access for D2D communication underlay cellular networks. IEEE Commun. Lett. 25(8), 2614â€“2618 (2021). https://doi.org/10.1109/LCOMM.2021.3079920
J. Huang, Y. Yang, Z. Gao, D. He, D.W.K. Ng, Dynamic spectrum access for D2Denabled internet of things: a deep reinforcement learning approach. IEEE Internet Things J. 9(18), 17793â€“17807 (2022). https://doi.org/10.1109/JIOT.2022.3160197
J. Tan, Y.C. Liang, L. Zhang, G. Feng, Deep reinforcement learning for joint channel selection and power control in D2D networks. IEEE Trans. Wirel. Commun. 20(2), 1363â€“1378 (2021). https://doi.org/10.1109/TWC.2020.3032991
D. Wang, H. Qin, B. Song, K. Xu, X. Du, M. Guizani, Joint resource allocation and power control for D2D communication with deep reinforcement learning in MCC. Phys. Commun. 45(2), 1874â€“4907 (2021). https://doi.org/10.1016/j.phycom.2020.101262
H. Xiang, Y. Yang, G. He, J. Huang, D. He, Multiagent deep reinforcement learningbased power control and resource allocation for D2D communications. IEEE Wirel. Commun. Lett. 11(8), 1659â€“1663 (2022). https://doi.org/10.1109/LWC.2022.3170998
H. Ye, G.Y. Li, B.H.F. Juang, Deep reinforcement learning based resource allocation for V2V communications. IEEE Trans. Veh. Technol. 68(4), 3163â€“3173 (2019). https://doi.org/10.1109/TVT.2019.2897134
V. Mnih, K. Kavukcuoglu, D. Silver, A.A. Rusu, J. Veness, M.G. Bellemare, A. Graves, M. Riedmiller, A.K. Fidjeland, G. Ostrovski et al., Humanlevel control through deep reinforcement learning. Nature 518(7540), 529â€“533 (2015)
Z. Li, C. Guo, Multiagent deep reinforcement learning based spectrum allocation for D2D underlay communications. IEEE Trans. Veh. Technol. 69(2), 1828â€“1840 (2020). https://doi.org/10.1109/TVT.2019.2961405
A. Mseddi, W. Jaafar, A. Moussaid, H. Elbiaze, W. Ajib, Collaborative D2D pairing in cacheenabled underlay cellular networks, in 2021 IEEE Global Communications Conference (GLOBECOM) (IEEE, 2021), pp. 1â€“6. https://doi.org/10.1109/GLOBECOM46510.2021.9685468
T. Rashid, M. Samvelyan, C.S. De Witt, G. Farquhar, J. Foerster, S. Whiteson, QMIX: Monotonic value function factorisation for deep multiagent reinforcement learning (2018). arXiv preprint arXiv:1803.11485. https://doi.org/10.48550/arXiv.1803.11485
P. Sunehag, G. Lever, A. Gruslys, W.M. Czarnecki, V. Zambaldi, M. Jaderberg, M. Lanctot, N. Sonnerat, J.Z. Leibo, K. Tuyls et al., Valuedecomposition networks for cooperative multiagent learning (2017). arXiv preprint arXiv:1706.05296. https://doi.org/10.48550/arXiv.1706.05296
X. Lin, J.G. Andrews, A. Ghosh, R. Ratasuk, An overview of 3GPP devicetodevice proximity services. IEEE Commun. Mag. 52(4), 40â€“48 (2014). https://doi.org/10.1109/MCOM.2014.6807945
C.H. Liu, Z. Chen, J. Tang, J. Xu, C. Piao, Energyefficient UAV control for effective and fair communication coverage: a deep reinforcement learning approach. IEEE J. Sel. Areas Commun. 36(9), 2059â€“2070 (2018). https://doi.org/10.1109/JSAC.2018.2864373
M. Hausknecht, P. Stone, Deep recurrent Qlearning for partially observable MDPs (2015). arXiv preprint arXiv:1412.3555. https://doi.org/10.48550/arXiv.1507.06527
J. Chung, C. Gulcehre, K. Cho, Y. Bengio, Empirical evaluation of gated recurrent neural networks on sequence modeling (2014). arXiv preprint arXiv:1412.3555. https://doi.org/10.48550/arXiv.1412.3555
Acknowledgements
The authors would like to thank the respected editors and the reviewers for their helpful comments.
Funding
This work was supported in part by the National Key R &D Program of China under Grant 2022YFC3302801, in part by the National Key R &D Program of China under Grant 2020YFC0833201 and in part by the Natural Science Foundation of Shandong Province under Grant ZR2020MF004.
Author information
Authors and Affiliations
Contributions
ZH, TL and CS developed the idea, wrote the paper and developed the simulation code. ZL and JW polished the paper, and added to the content and important references. XL helped in mathematical modeling and discussing the results. HC and XZ were involved in discussing the technical idea and contributed to mathematical modeling. YC helped to manage the discussion, polish the paper thoroughly and adjust the writing of the paper.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
All authors agree to publish the research in this journal.
Competing interests
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons AttributionNonCommercialNoDerivatives 4.0 International License, which permits any noncommercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the articleâ€™s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the articleâ€™s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/byncnd/4.0/.
About this article
Cite this article
Huang, Z., Li, T., Song, C. et al. Joint spectrum and power allocation scheme based on value decomposition networks in D2D communication networks. J Wireless Com Network 2024, 79 (2024). https://doi.org/10.1186/s13638024023931
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13638024023931