Unicorn: Continual learning with a universal, off-policy agent

Published:

Download

Overview

Combines Universal Value Function Approximators with off-policy goal learning updates.

UVFA’s extend value functions to be conditional on a goal signal: Q(s, a; g).

CNN -> FC -> + Prev Action and Reward -> LSTM -> + goal signal matrix -> MLP -> Q-values (Actions x Goals)

The goal signal matrix is number of goals by goal representation dimensionality, which is carried through to the Q-values.

Q-values are estimated for all goals, not just the current one, and TD errors are summed across all goals. Off-policy goals (the ones not currently active) are truncated as soon as the action chosen by the off-policy goal differs from the on-policy goal.

Notes

The biggest drawback seems to be that the number of goals has to be specified ahead of time.

Refs

T. Schaul, D. Horgan, K. Gregor, and D. Silver. Universal value function approximators. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages 1312–1320, 2015. Abstract, PDF