Human-level control through deep reinforcement learning

Published: February 26, 2015

Notes

Reinforcement learning methods that use non-linear function approximators (e.g. neural networks) for the action value function are not theoretically stable.

This paper gets around that problem with two main changes to training:

Experience replay
Each time step is recorded with: state, action, reward, and next state. A random mini-batch of experiences is drawn and used for a Q-learning update.
Separate target and behavior value networks
Updates are performed on a copy of the network. Only after a number of such updates is the improved network copied back into the one used to choose actions.

Bleyddyn

Notes