Dynamic spectrum access and sharing through actor-critic deep reinforcement learning

EURASIP Journal on Wireless Communications and Networking

Table 1 Parameters of the TD3 algorithm implemented by the secondary user

Discount rate γ of cumulative reward	0.5
Learning rate of actor	0.0001
Learning rate of critic	0.0003
Update parameter of target networks \(\rho\)	0.001
TD3 delayed update of actor	1 actor update for 10 critic updates
Experience replay buffer size	100,000
Mini-batch size	128
State observation time span T₀	32 time slots
Reward coefficient \(\beta\)	0.05 (bit/s/Hz) / mW
Exploration noise w added to the action, decreasing during training	Start at \(\sigma _{w}=10\) mW \(\sigma _{w,t+1} = 0.99995\sigma _{w,t}\)