Human-level control through deep reinforcement learning
Published:
Notes
Reinforcement learning methods that use non-linear function approximators (e.g. neural networks) for the action value function are not theoretically stable.
This paper gets around that problem with two main changes to training:
Experience replay
Each time step is recorded with: state, action, reward, and next state. A random mini-batch of experiences is drawn and used for a Q-learning update.
Separate target and behavior value networks
Updates are performed on a copy of the network. Only after a number of such updates is the improved network copied back into the one used to choose actions.