Sample-Efficient Deep RL with Generative Adversarial Tree Search
Published:
Overview
They use a GAN for the dynamics model, based on PIX2PIX with Wasserstein metric as the loss and spectral normalization to make training more stable. Input to the GAN is four consecutive frames plus gaussian noise plus a sequence of actions.
The Wasserstein distance can be used to approximate optimism in the Q-function for a better method of exploration than e-greedy.
They say:
In order to improve the quality of the generated frames, it is common to also add a class of multiple losses and capture different frequency aspects of the frames. Therefore, we also add 10 * L1 + 90 * L2 loss to the GAN loss in order to improve the training process.
Refs:
- Image-to-Image Translation with Conditional Adversarial Networks
- Action-Conditional Video Prediction using Deep Networks in Atari Games
Learned Environment Models
- World Models
- Learning Robot Policies by Dreaming
- Sample-Efficient Deep RL with Generative Adversarial Tree Search