Sample-Efficient Deep RL with Generative Adversarial Tree Search

Published:

Download

Overview

They use a GAN for the dynamics model, based on PIX2PIX with Wasserstein metric as the loss and spectral normalization to make training more stable. Input to the GAN is four consecutive frames plus gaussian noise plus a sequence of actions.

The Wasserstein distance can be used to approximate optimism in the Q-function for a better method of exploration than e-greedy.

They say:

In order to improve the quality of the generated frames, it is common to also add a class of multiple losses and capture different frequency aspects of the frames. Therefore, we also add 10 * L1 + 90 * L2 loss to the GAN loss in order to improve the training process.

Refs:

Learned Environment Models