Algorithm 1 Q-learning algorithm for the proposed scheme | |
---|---|
1: | while required packet exists |
2: | if time index > sensing window size |
3: | initialize the sensing results and obtained results |
4: | else |
5: | time index = time index +1 |
6: | while time index < n sensing n reply |
7: | Determine sensing tool (with IA or without IA) |
8: | from the k out of N rule with N = n sensing n reply sensing results |
9: | end |
10: | Obtain P(D0), P(D1) from the sensing results |
11: | Calculate P(H0), P(H1) from the simultaneous equations of P(D0),P(D1) |
12: | Calculate R I ,R L from the conditional probabilities and determine the state |
13: | Select an action a t based on the optimal policy from the current state s t |
14: | Obtain the immediate payoff R from action a t |
15: | Observe the next state st + 1 by R I ,R L |
16: | Update \( \mathcal{Q}\left({s}_t,{a}_t\right) \) based upon this experience as |
17: | \( \mathcal{Q}\left({s}_t,{a}_t\right)\leftarrow \left(1-\alpha \right)\mathcal{Q}\left({s}_t,{a}_t\right)+\alpha \left\{{r}_t+\gamma \underset{a_{t+1}}{\max}\mathcal{Q}\left({s}_{t+1},{a}_{t+1}\right)\right\} \) |
18: | end |
19: | end |