From: Dynamic spectrum access and sharing through actor-critic deep reinforcement learning
Discount rate γ of cumulative reward | 0.5 |
Learning rate of actor | 0.0001 |
Learning rate of critic | 0.0003 |
Update parameter of target networks \(\rho\) | 0.001 |
TD3 delayed update of actor | 1 actor update for 10 critic updates |
Experience replay buffer size | 100,000 |
Mini-batch size | 128 |
State observation time span T0 | 32 time slots |
Reward coefficient \(\beta\) | 0.05 (bit/s/Hz) / mW |
Exploration noise w added to the action, decreasing during training | Start at \(\sigma _{w}=10\) mW \(\sigma _{w,t+1} = 0.99995\sigma _{w,t}\) |