Skip to main content

Joint spectrum and power allocation scheme based on value decomposition networks in D2D communication networks

Abstract

Device-to-device (D2D) communications allow short-range communication devices to multiplex cellular-licensed spectrum to directly establish local connections for ultra-high number of terminal connections and greater system throughput. However, spectrum sharing also brings serious interference to the network. Therefore, a reliable and efficient resource allocation strategy is important to mitigate the interference and improve the system spectral efficiency. In this paper, we investigated spectrum access and power allocation in D2D communications underlay cellular networks based on deep reinforcement learning with the aim of finding a feasible resource allocation strategy to maximize data rate and system fairness. We proposed a value decomposition network-based resource allocation scheme for D2D communication networks. Our proposed scheme avoids frequent information exchanges among D2D users by centralized training, while allowing D2D users to make distributed joint resource allocation decisions. Simulation results show that the proposed scheme has stable convergence and good scalability, and can effectively improve the system capacity.

1 Introduction

With the application of 5 G communication technology, the access of massive terminal devices and the explosive growth of data traffic pose a great challenge to the existing network system and architecture [1, 2]. D2D communication is one of the key technologies for 5 G communications, allowing two very close mobile devices to communicate directly without involving a base station (BS) [3]. By sharing the underlying spectrum resources, D2D users can reuse the licensed spectrum of cellular users, potentially improving spectrum efficiency [4]. However, the transmission of D2D signals can potentially create detrimental interference to cellular networks and other D2D links, significantly hampering the progress of D2D communications [5, 6]. As a result, effective resource allocation policy becomes crucial in order to enhance the capacity of D2D underlying communications within cellular networks.

The joint spectrum allocation and power allocation for D2D communications and an NP-hard problem. Existing works use different mathematical theories to solve the resource allocation problem. Graph theory is widely used to solve resource allocation problems in wireless networks [7, 8]. In [9], the joint mode selection and resource allocation problem is investigated and a graph theory-based algorithm is proposed to obtain the optimal solution. In [10], a simplified bipartite graph is constructed with the weights of the bipartite graph represented as the maximization and rate of the D2D and cellular links under the outage constraints. Also game theory is an effective method to solve the problem of resource allocation in networks [11,12,13]. In [11], an optimization algorithm based on the branch–price idea is proposed, aiming at solving the joint optimization problem of link dispatching, channel allocation and power control in D2D communication-assisted fog computing. In [12], a dynamic Stackelberg game algorithm is proposed in which the base station as a leader charges the D2D users as followers to reduce interference and increase the throughput of the D2D users. In [13], a zero-sum game-based algorithm is proposed to optimize the rate and security of the network in the presence of incomplete channel state information. Resource allocation algorithms based on heuristic ideas usually require a large number of iterative computations with high complexity [14, 15]. In [14], a hybrid genetic algorithm (GA) and binary particle swarm optimization algorithm are proposed for joint sub-channel allocation and power control. In [15], a simulated annealing-based algorithm for resource allocation to D2D user pairs and cellular users improves the average resource utilization.

Resource allocation methods based on traditional optimization algorithms have certain limitations. The optimization effect of the resource allocation scheme based on graph theory relies on the global channel state information (CSI), which cannot adapt to the increasingly complex communication network structure. The resource allocation scheme based on game theory requires frequent information exchange between users, which brings great signaling overhead. Resource allocation schemes based on heuristic algorithms require multiple iterations to solve the computation. In recent years, reinforcement learning (RL) methods can use historical environmental information to learn resource allocation strategies to maximize long-term rewards. Moreover, deep reinforcement learning (DRL) has been applied to the spectrum allocation problem. In [16], a distributed spectrum allocation scheme based on q-learning is proposed, where D2D users adaptively select the access spectrum to reduce interference and maximize the system throughput in a multi-tier heterogeneous network. A D2D spectrum access algorithm based on double deep Q-network (DDQN) has been proposed in [17], and this algorithm enables D2D pairs to select orthogonal or non-orthogonal spectrum access in different time slots. In [18], a DRL-based spectrum access scheme is designed to optimize system throughput and resource allocation fairness. In addition, several works have considered joint power control and spectrum allocation [19,20,21,22]. In [19], a distributed deep reinforcement learning scheme is proposed to autonomously optimize channel selection and transmission power using local information and historical environmental information. In [20], a neural network with two layers of outputs is used to approximate the Q-values of spectrum allocation and power control strategies. In [21], a dueling double deep Q-network based on priority sampling is proposed to maximize the total throughput. In [22], a distributed spectrum access and power allocation scheme is proposed based on deep Q-networks (DQN) [23] that can be applied to vehicle-to-vehicle (V2V) communication scenarios without the need for global channel information. Although all these efforts have contributed greatly to resource allocation in D2D communications, none of them has considered the distribution of rewards between agents in a cooperative relationship and the effect of instability in a multi-agent environment on the algorithms, and the convergence effect of the algorithms is overly dependent on the design of the rewards. Recently, there have been a number of multi-agent deep reinforcement learning (MADRL)-based resource allocation studies that have begun to focus on improving the convergence of algorithms in multi-agent environment [24, 25]. In [24], a distributed spectrum access algorithm has been designed based on the actor-critic (AC) approach to train DRL models by collecting historical information of all users or neighboring users through a centralized training process. This method shares global history states to alleviate instability in multi-intelligent body environments. However, a global reward is missing to promote cooperation among the agents. In [25], a framework based on the QMIX [26] algorithm for D2D users pairing under collaborative relationships was developed to solve the problem of associating D2D users and caching devices.

This paper researches the collaborative resource allocation for D2D underlay communications in cellular networks. Considering the requirements of spectral efficiency as well as system fairness, a nonlinear integer optimization objective is formulated. In order to tackle the resource allocation challenges in our scenario, we have proposed the value decomposition network-based resource allocation (VDN-RA) scheme. VDN-RA scheme leverages the value decomposition networks (VDN) [27] to optimize the collaborative spectrum and power allocation in D2D networks. This scheme aims to mitigate interference, enhance the overall network capacity and ensure fairness among D2D pairs. The main contributions are summarized as follows:

  1. 1.

    We investigate the resource allocation problem for D2D communications. Spectral efficiency and fairness objective functions of the system are maximized by jointly optimizing spectrum allocation and power control.

  2. 2.

    The optimization problem is modeled as a decentralized partially observable Markov decision process (Dec-POMDP). The D2D users are defined as agents under cooperative relationship. each D2D user can use local observation state information to obtain resource allocation policy without the need for real-time global channel state information (CSI).

  3. 3.

    A value decomposition network-based resource allocation (VDN-RA) scheme is proposed to mitigate the problems of unstable multi-agent environment and difficult convergence of training by centralized training to utilize the environment state observed by each agent. The distributed execution also reduces the signaling overhead and the stress on the base station.

The rest of this paper is organized as follows. Section 2 provides a detailed description of the system model and the multi-agent reinforcement learning scheme. Section 3 presents the simulation results of our proposed scheme. Finally, the conclusions and discussion are given in Sect. 4.

2 Method

We investigate the problem of spectrum allocation and power control for D2D underlay communications in a single-cell uplink scenario. First, the resource allocation task is transformed into a multi-agent reinforcement learning process under cooperative relationship. A value decomposition network-based resource allocation (VDN-RA) scheme is proposed, and the corresponding key elements such as states, actions and rewards are designed. In VDN-RA scheme, cooperation between agents is facilitated by learning a global action–value function to mitigate the impact of multi-agent environment instability on the algorithm performance.

2.1 System model and problem formulation

We consider a typical D2D communication scenario under a single cellular network with spectrum allocation and power control issues, as shown in Fig. 1. In this case, the base station (BS) is located at the center of the cellular and provides services to the cellular users (CUEs). The D2D users (DUEs) share the same spectrum resources with the cellular users. There are M CUEs, denoted as \(\mathcal {M}=\{1,\ldots ,M\}\), and N D2D pairs, denoted as \(\mathcal {N}=\{1,\ldots ,N\}\) in the scenario. We consider the uplink interference, i.e., the DUEs reuses the uplink channel of the CUEs. This choice is driven by the fact that, in real-world communication scenarios, the utilization of uplink resources tends to be lower compared to downlink resources. Additionally, BS have a higher capacity to manage interference compared to mobile devices [28]. Orthogonal frequency-division multiplexing (OFDM) technique is used to partition the frequency-selective channel into flat channels on multiple different sub-carriers. CUEs are pre-allocated K orthogonal sub-channels, where the kth sub-channel is assigned to the mth CUE, thus ensuring that CUEs do not interfere with each other. In the system model, the number of reusable sub-channels is equal to the number of CUEs (\(M=K\)), i.e., the cellular network is fully loaded. As a result of sharing the same spectrum, D2D links are susceptible to two types of interference. The first is cross-layer interference which refers to the interference experienced by the D2D receiver from the cellular user with whom it shares the same spectrum. The second is co-layer interference which arises from the other D2D transmitter sharing the same sub-channel with that D2D pair.

Fig. 1
figure 1

System model

We assume that the kth sub-channel is assigned to the nth DUE and use the indicator function \(\alpha _{n}^{k}\in \{0,1\}\) to represent the DUE assignment decision. Specifically, if the nth DUE reuses the kth sub-channel, \(\alpha _{n}^{k}=1\); otherwise, it is 0. \(P_{n}^{d}\) and \(P_{m}^{c}\) denote the transmit power of the nth DUE and mth CUE, respectively. \(g_{n}^{d}\) denotes the gain of the sub-channel k from the nth D2D transmitter to the receiver. \(g_{m,k}^{c,r}\) and \(g_{n^{'},k}^{t,r}\) denote the interference link gain from the mth CUE to the nth D2D receiver on the sub-channel k and the interference link gain from the \(n^{'}\)th D2D transmitter to the nth D2D receiver on the sub-channel k, respectively. \(\sigma ^{2}\) is the additive white Gaussian noise (AWGN) power. Note that the channel gains of these links are frequency-selective, meaning they are subject to frequency-dependent small-scale fading and large-scale fading effects. Moreover, the channel gains vary over different coherence time periods, further complicating the resource allocation problem.

Therefore, the signal-to-interference-plus-noise ratio (SINR) on the kth sub-channel for the nth DUE is given by

$$\begin{aligned} \xi _{n}^{d}=\frac{ P_{n}^{d}g_{n}^{d}}{I_{c}+I_{d}+ \sigma ^{2}} \end{aligned}$$
(1)

where

$$\begin{aligned} I_{c} & = \sum _{m \in M}\alpha _{n}^{k}P_{m}^{c} g_{m,k}^{c,r} \end{aligned}$$
(2)
$$\begin{aligned} I_{d} & = \sum _{n^{'} \in N, n^{'} \ne n}\alpha _{n^{'}}^{k}P_{n^{'}}^{d} g_{n^{'},k}^{t,r} \end{aligned}$$
(3)

Among them, \(I_{c}\) represents the interference power when the DUE and CUE share the same spectrum, and \(I_{d}\) indicates the interference from other DUES which sharing the same spectrum with that D2D pair.

Based on the SINR of DUE, the spectral efficiency of DUE can be further obtained as

$$\begin{aligned} C_{n}^{d}= log(1+\xi _{n}^{d}) \end{aligned}$$
(4)

In order to maintain fairness within the system, we employ Jain’s fairness index [29]. Here, the fairness index is given by

$$\begin{aligned} f=\frac{(\sum _{n\in N} C_{n}^{d})^{2}}{N\sum _{n\in N} (C_{n}^{d})^2} \end{aligned}$$
(5)

where \(f\in [0,1]\); this index serves as a metric to gauge the fairness of the system’s quality of service among D2D users. It quantifies the spectral efficiency gap between different DUEs, with a larger fairness index indicating a smaller gap in spectral efficiency and a fairer distribution of resources among the users. By considering Jain’s fairness index, we aim to ensure equal opportunities and satisfactory quality of service for DUEs within the network.

Our main objective is to develop a comprehensive strategy for joint spectrum and power allocation. This strategy aims to maximize the total spectral efficiency within the system while simultaneously ensuring fairness among users. Therefore, the optimization objective can be written as

$$\begin{aligned} & \max _{{\alpha _{n}^{k} ,P_{n}^{d} }} \rho _{1} \sum\limits_{{n \in N}} {C_{n}^{d} } + \rho _{2} f \\ & C1:\alpha _{n}^{k} \in \{ 0,1\} \\ & C2:0 \le P_{n}^{d} \le P_{{{\text{max}}}}^{d} ,\quad \forall n \in N \\ & C3:\sum\limits_{{k \in K}} {\alpha _{n}^{k} } \le 1,\quad \forall n \in N \\ \end{aligned}$$
(6)

where \(\rho _{1}\) and \(\rho _{2}\) are weighting factors that balance the total spectral efficiency and system fairness, respectively. \(P_{\rm{max}}^{d}\) is the maximum transmit power of the D2D transmitter. C1 denotes the range of values of the channel occupancy indicator variable, C2 denotes the DUE maximum transmit power constraint, and C3 denotes the multiplexing of at most one sub-channel per DUE. This is an NP-hard optimization problem with nonlinear constraints and requires mixed-integer planning for each time slot as the channel changes. To solve this problem, we will investigate the application of the multi-agent RL algorithm in this scenario.

2.2 Deep reinforcement learning framework

In this section, we first model the multi-agent environment and then propose a value decomposition network-based resource allocation (VDN-RA) scheme with centralized exploration training and distributed execution to solve joint spectrum and power allocation problem. The proposed VDN-RA can make full use of the cooperation among DUEs to further improve the overall performance of the system.

2.2.1 Multi-agent environments

Here we describe the optimization problem as a fully cooperative multi-agent task, and the process of D2D pairs cooperating with each other to maximize rewards can be described as a decentralized partially observable Markov decision process (Dec-POMDP) which can be represented as a tuple \((N,\mathcal {S},\mathcal {A},r,P,\gamma )\). N is the set of agents, and \(\mathcal {S}\) is the set of all agents’ local partially observable state \(s_{t}\) of the environment. \(\mathcal {A}\) is the action space representing the set of optional actions \(a_t\) of agents, and r is the common reward for all agents. P is the state transfer probability, which refers to the probability that according to the current state \(s_{t}\in \mathcal {S}\), after the agent performs the action \(a_{t}\in \mathcal {A}\), the environment gives the next moment state \(s_{t+1}\). \(\gamma\) is a discount factor for rewards that indicates the importance of future rewards for payoffs.

State: For feasibility reasons, we use the local channel state information (CSI) observed by the D2D link as the state characterization, consisting of the following components: instantaneous channel information of the D2D link, \(G_{n}^{d,t}\), the channel information of the interfering link from the cellular user to the D2D receiver, \(G_{n}^{c,t}\), interference to the link in the previous time slot, \(I_{n}^{t-1},\) and the spectral efficiency of the D2D link in the previous time slot, \(C_{n}^{t-1}\). Hence, \(s_{n}^{t} = [G_{n}^{d,t}, G_{n}^{c,t}, I_{n}^{t-1}, C_{n}^{t-1}]\).

Action: The action space is set as a collection of discrete power and sub-channel resource blocks. At each time t, agent n takes an action \(a_{n}^{t}\) denoted as

$$\begin{aligned} a_{n}^{t}=[\alpha _{n},P_{n}^{d}] \end{aligned}$$
(7)

\({{\alpha }_{n}}=[\alpha _{n}^{1},\alpha _{n}^{2}, \ldots ,\alpha _{n}^{K}]\) is a K-dimensional one-hot variable that represents the sub-channel occupancy of the nth DUE. \(P_{n}^{d}\) is the power of the nth D2D transmitter. We quantified the transmit power into five levels, so the dimension of the action is \(D_{a}= 5 \times K\).

Reward: In the RL process, agent make decisions to maximize rewards based on the interactions of the environment. Therefore, the design of rewards should be consistent with the goal of optimization. We design the following reward function for this distributed resource allocation problem:

$$\begin{aligned} r_{n}^{t}=\rho _{1}\frac{1}{N}\sum _{n\in N} C_{n}^{d,t}+\rho _{2}f_{t} \end{aligned}$$
(8)

where \(C_{n}^{d,t}\) and \(f_{t}\) are the spectral efficiency of the nth DUE and the fairness index at time slot t, respectively. \(\rho _{1}\) and \(\rho _{2}\) are weighting factors.

2.2.2 The problem solution

In single-agent RL model, agent interacts with the environment and makes decisions based on a specific policy \(\pi\). At each time step t, each agent receives an state \(s_{t}\) based on the true information of the environment and then selects an action \(a_{t}\) from the action space \(\mathcal {A}\) according to the current policy \(\pi\). Following the action, the agent gets new state \(s_{t+1}\) and receives a reward \(r_{t}\). The goal of reinforcement learning is to learn a policy that maximizes the reward expectation gain

Fig. 2
figure 2

Value decomposition network-based resource allocation (VDN-RA) algorithm architecture

$$\begin{aligned} J =E\left[ \sum _{t=0}^\infty \gamma ^tr_{t+1} \right] \end{aligned}$$
(9)

In RL algorithm based on value learning, the action state value function \(Q_{\pi }(s_t,a_t)\) is used to represent the expectation of rewards in the current state \(s_t\) and action \(a_t\), which is expressed as

$$\begin{aligned} Q_{\pi }(s_t,a_t)=E_{s,a}\left[ {\sum _{t=0}^\infty \gamma ^tr_{t+1}}|s_{t},a_{t} \right] \end{aligned}$$
(10)

There are two difficulties in using RL algorithm in a multi-agent environment. First, the agents cannot obtain global state information when interacting with the environment, and can only obtain their own local observation information. Second, when multiple agents update their policies independently at the same time during the learning process, from the point of view of any single agent, the environmental feedback it obtains, such as reward expectation, no longer satisfies the condition of unbiased estimation because of the actions of other agents in the environment, making it difficult for the algorithm to converge.

Most of the existing DRL-based distributed resource allocation algorithms directly apply single-agent DRL algorithms to the multi-agent environment. This approach ignores the potential cooperative relationship between the agents in optimizing the overall goal, while the algorithmic effect is very dependent on the setting of the reward function. To address this challenge posed by the non-stationary learning problem in multi-agent environment, we propose a distributed RL algorithm called VDN-RA, as shown in Fig. 2. Our VDN-RA algorithm makes an important distinction, taking into account that each agent contributes differently to the global optimization objective within a cooperative relationship. It decomposes the reward expectation of the overall network into the sum of the local reward expectations of each individual agent and centrally trains a reinforcement learning model capable of making rational decisions based on real-time observed states through parameter sharing. It helps mitigate the instability inherent in multi-agent environments and subsequently improves the overall performance of the algorithm. By focusing on the contributions of each agent, our proposed algorithm can better handle the complexity of resource allocation in D2D communication scenario. Furthermore, in our approach, the D2D side utilizes locally observed information to make decisions. This eliminates the need for frequent message interactions between D2D pairs, thereby avoiding excessive signaling overhead. This approach improves the efficiency and scalability of the resource allocation process in D2D networks, making it more suitable for deployment in practical scenarios.

VDN is based on the assumption that the joint action–value function for the system can be additively decomposed into value functions across agents,

$$\begin{aligned} Q_{tot}((h^{1}, h^{2}, \ldots , h^{N}),( a^{1}, a^{2}, \ldots , a^{N})) \approx \sum _{n=1}^{N} \overset{ \sim }{Q}_{n}( h^{n}, a^{n}) \end{aligned}$$
(11)

where d denotes the number of agents, \(h^{n}=[s_{0}^{n},s_{1}^{n}, \ldots ,s_{t}^{n}]\) is a set of historical local state trajectories observed by the agent n and \(a^{n}\) is the chosen action by the agent n. \(\overset{ \sim }{Q}_{n}\) depends on the local observation of each agent. \(\overset{ \sim }{Q}_{n}\) is learned implicitly rather than from any reward specific to agent i. Each agent acting greedily relative to its local value \(\overset{\sim }{Q}_{n}\) is equivalent to a central arbiter choosing joint actions by maximizing \(\sum _{n=1}^{N} \overset{ \sim }{Q}_{n}\).

The algorithm performs end-to-end training by minimizing the loss function \(\mathcal {L}( \theta )\) can be written as

$$\begin{aligned} \mathcal {L}( \theta )= \sum _{i=1}^{b}[{(y_{i}^{tot}-Q_{ tot }{(h,a; \theta ) })}^{2}] \end{aligned}$$
(12)

where b is the batch size of the sample conversion from the replay buffer and \(\theta\) denotes the parameters of the network. The objective function is denote as

$$\begin{aligned} y^{tot}=r+ \gamma \text {max}_{a^{'}}Q_{\rm{tot}}(h^{'},a^{'};\theta ^{-}) \end{aligned}$$
(13)

where \(h^{'}\) and \(a^{'}\) are historical trajectories and the chosen action at the next moment, respectively. \(\theta ^{-}\) is the parameter for target network.

Algorithm 1
figure a

VDN-RA algorithm

VDN-RA implicitly learns individual value functions for each agent by backpropagating gradients from the Q-learning rule, without the need to deliberately design rewards for individual agent. With this approach, only the individual value functions are learned centrally and then executed in a distributed manner. Algorithm 1 summarizes the operation of the proposed method.

3 Results and discussion

We consider a D2D underlying cellular network scenario. The scenario has 5 sub-channels and assumed that a sub-channel can be assigned to multiple D2D pairs. Cellular users and D2D users are randomly distributed in a cell with a radius of 250 ms, and each D2D pair has a communication distance between 10 and 40 ms. In our VDN-RA scheme, the agent network uses the structure of DRQN [30]. The agent network has four layers, which contains a hidden layer consisting of recurrent neural network (RNN). To design our neural network architecture, we allocate 128 neurons for the first hidden layer. Additionally, for the second hidden layer, we employ gate recurrent units (GRUs) [31] and assign 64 neurons to this layer. Relu function is used as activation function. The learning rate of the network is 0.0001. Note that the weighting factors \(\rho _{1}\) and \(\rho _{2}\) are used to regulate the size of the reward value, which has no real physical meaning, and the values are mainly based on the convergence effect of the algorithm in the experiments. In the following experiments, when \(\rho _{1}=0.9\) and \(\rho _{2}=0.1\), the spectral efficiency and fair objective function in the reward can be better balanced. The detail parameters are given in Table 1.

Table 1 Simulation parameters

We compared the proposed VDN-RA scheme with four other schemes. The first one is the most classical DRL method DQN, which is used in [22]. The second approach is denoted as QMIX-DRL, which is based on value decomposition. In the QMIX algorithm, a mixing network and hypernetwork are utilized to accurately estimate the Q-value [25]. In the third scheme, DUEs select the sub-channel based on the SNIR in a round-robin manner while transmitting the signal at maximum power. This simple approach can be useful in certain scenarios where more sophisticated resource allocation strategies are not available or necessary. However, it is important to note that this scheme is limited in optimizing system performance. The fourth approach selects the sub-channels randomly and transmit signals at maximum power, which has no practical application and only serves to indicate a lower bound on the performance of the system. Note that the states, rewards and agent network structure used by the DQN and QMIX-DQL are the same as the proposed scheme VDN-RA, thus ensuring that the compared DRL schemes are ensured to have the same complexity as the proposed scheme.

Fig. 3
figure 3

Train average rate with different time slots

Fig. 4
figure 4

Test performance comparison under different algorithms

Figure 3 shows the changes in the average spectral efficiency of the D2D pairs as the number of training steps increases when the number of D2D pairs is 20. For all DRL schemes, the average spectral efficiency initially increase with the increase in the number of training steps and eventually remain converged. In comparison with other schemes, our proposed VDN-RA scheme makes better use of the global state information by learning a joint action state value function through centralized training, which mitigates the effects of multi-agent environment instability and partial observability, and thus promotes the cooperation of the agents to optimize the overall goal. Therefore, the VDN-RA scheme has a faster convergence speed and the best convergence effect. This indicates that our proposed scheme achieves better optimization of the resource allocation process and improves the overall performance of the system. The faster convergence rate also suggests that our scheme is more efficient in adapting and optimizing resource allocation in a dynamic and changing environment.

In Fig. 4, we generate the cumulative distribution function (CDF) of the average spectral efficiency for D2D pairs over 5000 test time slots. The results indicate that the DRL scheme can substantially enhance the average spectral efficiency of the system compared to general methods. Furthermore, the VDN-RA scheme surpasses other DRL algorithms in terms of performance during the test phase. This outcome confirms the effectiveness of our VDN-RA algorithm in improving the overall spectral efficiency of the D2D network.

Fig. 5
figure 5

Average rate of D2D links versus the number of D2D links

Fig. 6
figure 6

Average fairness index of D2D links versus the number of D2D links

Figures 5 and 6 show the average rate and fairness index of different algorithms versus the number of D2D pairs, respectively. First, as the number of D2D pairs increases, the interference between D2D pairs increases. Therefore the average spectral efficiency decreases. As the number of D2D pairs increases, the disparity between the data rates of these pairs becomes more pronounced due to differences in interference levels. Consequently, this leads to a decrease in the fairness index. Furthermore, the VDN-RA and the QMIX-DRL outperform the other three benchmark schemes. This is because ordinary DQN algorithms treat each agent independently, perceiving the other agents as part of the environment, resulting in a single-agent reinforcement learning process for each agent. Due to the presence of other agents in the environment, the environment is a non-stationary one and its convergence is very dependent on the design of the reward function. Since in our optimization objective, the degree of contribution of individual D2Ds to the whole is their respective spectral efficiencies, the relationship between overall Q and local Q satisfies the premise assumptions and linear summation of the VDN-RA. In the QMIX-DRL approach, the relationship between the overall Q-value and the local Q-values is represented as nonlinear. On the other hand, in our proposed VDN-RA scheme, we describe the overall Q-value as a sum of local Q-values to make it more aligned with the specific requirements and characteristics of our proposed system environment. So the final experimental results also show that in terms of both the average spectral efficiency and fairness index, our proposed algorithm achieves the best results.

4 Conclusion

This paper focuses on exploring the problem of joint spectrum allocation and power control for D2D communication using a multi-agent DRL approach. The proposed VDN-RA algorithm maximizes the throughput of D2D and ensures the fairness of the system. Specifically, D2D pairs are treated as multiple agents to share resources. Each agent can adaptively control the power and select resources independently without any priori information. Considering the non-stationary of the multi-agent environment, the centralized training distributed execution of the proposed algorithm can ensure the convergence of the algorithm without the need of inter-agent signaling interaction. Simulation results verify the effectiveness of the proposed algorithm, which significantly improves the spectral efficiency of the D2D link while ensuring system fairness. For future work, we plan to consider the D2D communication resource allocation problem under multi-cell heterogeneous networks, and also consider hierarchical combinatorial optimization to design resource allocation schemes for continuous and discrete variables, respectively.

Availability of data and materials

Data can be obtained from the corresponding author upon reasonable request. Data sharing is not applicable to this article as no datasets were generated or analyzed during the current study.

Abbreviations

D2D:

Device-to-device

DRL:

Deep reinforcement learning

MADRL:

Multi-agent deep reinforcement learning

VDN-RA:

Value decomposition network-based resource allocation

DQN:

Deep Q-network

DDQN:

Double deep Q-network

V2V:

Vehicle-to-vehicle

AWGN:

Additive white Gaussian noise

CUE:

Cellular user

DUE:

D2D user

OFDM:

Orthogonal frequency-division multiplexing

Dec-POMDP:

Decentralized partially observable Markov decision mode

CSI:

Channel state information

References

  1. A. Gupta, R.K. Jha, A survey of 5G network: architecture and emerging technologies. IEEE Access 3, 1206–1232 (2015). https://doi.org/10.1109/ACCESS.2015.2461602

    Article  Google Scholar 

  2. J. Cao, M. Ma, H. Li, R. Ma, Y. Sun, P. Yu, L. Xiong, A survey on security aspects for 3GPP 5G networks. IEEE Commun. Surv. Tutor. 22(1), 170–195 (2020). https://doi.org/10.1109/COMST.2019.2951818

    Article  Google Scholar 

  3. N. Zhao, Y.-C. Liang, D. Niyato, Y. Pei, M. Wu, Y. Jiang, Deep reinforcement learning for user association and resource allocation in heterogeneous cellular networks. IEEE Trans. Wirel. Commun. 18(11), 5141–5152 (2019). https://doi.org/10.1109/TWC.2019.2933417

    Article  Google Scholar 

  4. J. Liu, N. Kato, J. Ma, N. Kadowaki, Device-to-device communication in LTE-advanced networks: a survey. IEEE Commun. Surv. Tutor. 17(4), 1923–1940 (2015). https://doi.org/10.1109/COMST.2014.2375934

    Article  Google Scholar 

  5. Y. Kai, J. Wang, H. Zhu, J. Wang, Resource allocation and performance analysis of cellular-assisted OFDMA device-to-device communications. IEEE Trans. Wirel. Commun. 18(1), 416–431 (2019). https://doi.org/10.1109/TWC.2018.2880956

    Article  Google Scholar 

  6. D. Shi, L. Li, T. Ohtsuki, M. Pan, Z. Han, H.V. Poor, Make smart decisions faster: deciding D2D resource allocation via Stackelberg game guided multi-agent deep reinforcement learning. IEEE Trans. Mobile Comput 21(12), 4426–4438 (2022). https://doi.org/10.1109/TMC.2021.3085206

    Article  Google Scholar 

  7. J. Hao, H. Zhang, L. Song, Z. Han, Graph-based resource allocation for device-to-device communications aided cellular network, in 2014 IEEE/CIC International Conference on Communications in China (ICCC) (2014), pp. 256–260. https://doi.org/10.1109/ICCChina.2014.7008282

  8. Y. Xue, Z. Yang, W. Yang, J. Yang, D2D resource allocation and power control algorithms based on graph coloring in 5G IoT, in 2019 Computing, Communications and IoT Applications (ComComAp) (2019), pp. 17–22. https://doi.org/10.1109/ComComAp46287.2019.9018806

  9. Y. Dai, M. Sheng, J. Liu, N. Cheng, X. Shen, Q. Yang, Joint mode selection and resource allocation for D2D-enabled NOMA cellular networks. IEEE Trans. Veh. Technol. 68(7), 6721–6733 (2019). https://doi.org/10.1109/TVT.2019.2916395

    Article  Google Scholar 

  10. L. Wang, H. Tang, H. Wu, G.L. Stüber, Resource allocation for D2D communications underlay in Rayleigh fading channels. IEEE Trans. Veh. Technol. 66(2), 1159–1170 (2017). https://doi.org/10.1109/TVT.2016.2553124

    Article  Google Scholar 

  11. C. Yi, S. Huang, J. Cai, Joint resource allocation for device-to-device communication assisted fog computing. IEEE Trans. Mobile Comput. 20(3), 1076–1091 (2021). https://doi.org/10.1109/TMC.2019.2952354

    Article  Google Scholar 

  12. N. Sawyer, D.B. Smith, Flexible resource allocation in device-to-device communications using Stackelberg game theory. IEEE Trans. Commun. 67(1), 653–667 (2019). https://doi.org/10.1109/TCOMM.2018.2873344

    Article  Google Scholar 

  13. R. Gupta, S. Tanwar, A zero-sum game-based secure and interference mitigation scheme for socially aware D2D communication with imperfect CSI. IEEE Trans. Netw. Serv. Manag. 19(3), 3478–3486 (2022). https://doi.org/10.1109/TNSM.2022.3173305

    Article  Google Scholar 

  14. M. Hamdi, M. Zaied, Resource allocation based on hybrid genetic algorithm and particle swarm optimization for D2D multicast communications. Appl. Soft Comput. 83, 105605 (2019). https://doi.org/10.1016/j.asoc.2019.105605

    Article  Google Scholar 

  15. B. Lu, S. Lin, J. Shi, Y. Wang, Resource allocation for D2D communications underlaying cellular networks over Nakagami-\(m\) fading channel. IEEE Access 7, 21816–21825 (2019). https://doi.org/10.1109/ACCESS.2019.2894721

    Article  Google Scholar 

  16. K. Zia, N. Javed, M.N. Sial, S. Ahmed, A.A. Pirzada, F. Pervez, A distributed multi-agent RL-based autonomous spectrum allocation scheme in D2D enabled multi-tier HetNets. IEEE Access 7, 6733–6745 (2019). https://doi.org/10.1109/ACCESS.2018.2890210

    Article  Google Scholar 

  17. J. Huang, Y. Yang, G. He, Y. Xiao, J. Liu, Deep reinforcement learning-based dynamic spectrum access for D2D communication underlay cellular networks. IEEE Commun. Lett. 25(8), 2614–2618 (2021). https://doi.org/10.1109/LCOMM.2021.3079920

    Article  Google Scholar 

  18. J. Huang, Y. Yang, Z. Gao, D. He, D.W.K. Ng, Dynamic spectrum access for D2D-enabled internet of things: a deep reinforcement learning approach. IEEE Internet Things J. 9(18), 17793–17807 (2022). https://doi.org/10.1109/JIOT.2022.3160197

    Article  Google Scholar 

  19. J. Tan, Y.-C. Liang, L. Zhang, G. Feng, Deep reinforcement learning for joint channel selection and power control in D2D networks. IEEE Trans. Wirel. Commun. 20(2), 1363–1378 (2021). https://doi.org/10.1109/TWC.2020.3032991

    Article  Google Scholar 

  20. D. Wang, H. Qin, B. Song, K. Xu, X. Du, M. Guizani, Joint resource allocation and power control for D2D communication with deep reinforcement learning in MCC. Phys. Commun. 45(2), 1874–4907 (2021). https://doi.org/10.1016/j.phycom.2020.101262

    Article  Google Scholar 

  21. H. Xiang, Y. Yang, G. He, J. Huang, D. He, Multi-agent deep reinforcement learning-based power control and resource allocation for D2D communications. IEEE Wirel. Commun. Lett. 11(8), 1659–1663 (2022). https://doi.org/10.1109/LWC.2022.3170998

    Article  Google Scholar 

  22. H. Ye, G.Y. Li, B.-H.F. Juang, Deep reinforcement learning based resource allocation for V2V communications. IEEE Trans. Veh. Technol. 68(4), 3163–3173 (2019). https://doi.org/10.1109/TVT.2019.2897134

    Article  Google Scholar 

  23. V. Mnih, K. Kavukcuoglu, D. Silver, A.A. Rusu, J. Veness, M.G. Bellemare, A. Graves, M. Riedmiller, A.K. Fidjeland, G. Ostrovski et al., Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)

    Article  Google Scholar 

  24. Z. Li, C. Guo, Multi-agent deep reinforcement learning based spectrum allocation for D2D underlay communications. IEEE Trans. Veh. Technol. 69(2), 1828–1840 (2020). https://doi.org/10.1109/TVT.2019.2961405

    Article  Google Scholar 

  25. A. Mseddi, W. Jaafar, A. Moussaid, H. Elbiaze, W. Ajib, Collaborative D2D pairing in cache-enabled underlay cellular networks, in 2021 IEEE Global Communications Conference (GLOBECOM) (IEEE, 2021), pp. 1–6. https://doi.org/10.1109/GLOBECOM46510.2021.9685468

  26. T. Rashid, M. Samvelyan, C.S. De Witt, G. Farquhar, J. Foerster, S. Whiteson, QMIX: Monotonic value function factorisation for deep multi-agent reinforcement learning (2018). arXiv preprint arXiv:1803.11485. https://doi.org/10.48550/arXiv.1803.11485

  27. P. Sunehag, G. Lever, A. Gruslys, W.M. Czarnecki, V. Zambaldi, M. Jaderberg, M. Lanctot, N. Sonnerat, J.Z. Leibo, K. Tuyls et al., Value-decomposition networks for cooperative multi-agent learning (2017). arXiv preprint arXiv:1706.05296. https://doi.org/10.48550/arXiv.1706.05296

  28. X. Lin, J.G. Andrews, A. Ghosh, R. Ratasuk, An overview of 3GPP device-to-device proximity services. IEEE Commun. Mag. 52(4), 40–48 (2014). https://doi.org/10.1109/MCOM.2014.6807945

    Article  Google Scholar 

  29. C.H. Liu, Z. Chen, J. Tang, J. Xu, C. Piao, Energy-efficient UAV control for effective and fair communication coverage: a deep reinforcement learning approach. IEEE J. Sel. Areas Commun. 36(9), 2059–2070 (2018). https://doi.org/10.1109/JSAC.2018.2864373

    Article  Google Scholar 

  30. M. Hausknecht, P. Stone, Deep recurrent Q-learning for partially observable MDPs (2015). arXiv preprint arXiv:1412.3555. https://doi.org/10.48550/arXiv.1507.06527

  31. J. Chung, C. Gulcehre, K. Cho, Y. Bengio, Empirical evaluation of gated recurrent neural networks on sequence modeling (2014). arXiv preprint arXiv:1412.3555. https://doi.org/10.48550/arXiv.1412.3555

Download references

Acknowledgements

The authors would like to thank the respected editors and the reviewers for their helpful comments.

Funding

This work was supported in part by the National Key R &D Program of China under Grant 2022YFC3302801, in part by the National Key R &D Program of China under Grant 2020YFC0833201 and in part by the Natural Science Foundation of Shandong Province under Grant ZR2020MF004.

Author information

Authors and Affiliations

Authors

Contributions

ZH, TL and CS developed the idea, wrote the paper and developed the simulation code. ZL and JW polished the paper, and added to the content and important references. XL helped in mathematical modeling and discussing the results. HC and XZ were involved in discussing the technical idea and contributed to mathematical modeling. YC helped to manage the discussion, polish the paper thoroughly and adjust the writing of the paper.

Corresponding author

Correspondence to Yewen Cao.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

All authors agree to publish the research in this journal.

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Huang, Z., Li, T., Song, C. et al. Joint spectrum and power allocation scheme based on value decomposition networks in D2D communication networks. J Wireless Com Network 2024, 79 (2024). https://doi.org/10.1186/s13638-024-02393-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13638-024-02393-1

Keywords