Fig. 3From: Positioning and power optimisation for UAV-assisted networks in the presence of eavesdroppers: a multi-armed bandit approachReinforcement learning process: (1) At time t the agent chooses an action \(A_t\) based on a policy and the set of estimates \(\{Q_a\}\), (2) the agent applies action \(A_t\) onto the environment, (3) the agent obtains reward \(R_t\) from the environment, and (4) the agent updates the estimate for the action chosen \(Q_{A_t}\)Back to article page