Sample-Efficient Deep RL with Generative Adversarial Tree Search

Published: June 27, 2018

Overview

They use a GAN for the dynamics model, based on PIX2PIX with Wasserstein metric as the loss and spectral normalization to make training more stable. Input to the GAN is four consecutive frames plus gaussian noise plus a sequence of actions.

The Wasserstein distance can be used to approximate optimism in the Q-function for a better method of exploration than e-greedy.

They say:

In order to improve the quality of the generated frames, it is common to also add a class of multiple losses and capture different frequency aspects of the frames. Therefore, we also add 10 * L1 + 90 * L2 loss to the GAN loss in order to improve the training process.

Refs:

Image-to-Image Translation with Conditional Adversarial Networks
Action-Conditional Video Prediction using Deep Networks in Atari Games

Learned Environment Models