Unsupervised Predictive Memory in a Goal-Directed Agent

Published:

Download

Horribly Oversimplified Overview

Observation (Image, reward at t-1, velocity, Text input) is passed through encoders for each modality to create an embedding.

Prior from the previous step (including memory reads), plus embedding is used to create a posterior distribution, q. A sample from that becomes the new state variable, z.

z is written to memory and passed to the policy, which also reads from memory to produce actions.

z is also passed through decoders to produce a reconstruction loss for the unsupervised memory-based predictor.

Possible Simplifications

Based on their lesion studies it looks to me like removing the retroactive memory updates has the least effect on performance, so that should be the first simplification.

From a performance point of view, combining the Memory and Policy had little effect. They separated them to verify that the unsupervised memory worked.

World Models is a much simpler version of this that doesn’t include a separate memory.