Unicorn: Continual learning with a universal, off-policy agent




Combines Universal Value Function Approximators with off-policy goal learning updates.

UVFA’s extend value functions to be conditional on a goal signal: Q(s, a; g).

CNN -> FC -> + Prev Action and Reward -> LSTM -> + goal signal matrix -> MLP -> Q-values (Actions x Goals)

The goal signal matrix is number of goals by goal representation dimensionality, which is carried through to the Q-values.

Q-values are estimated for all goals, not just the current one, and TD errors are summed across all goals. Off-policy goals (the ones not currently active) are truncated as soon as the action chosen by the off-policy goal differs from the on-policy goal.


The biggest drawback seems to be that the number of goals has to be specified ahead of time.


