Off-Policy Actor-Critic

Published:

Download

Notes

Action-value methods like Greedy-GQ and Q-Learning have three limitations:

  1. Policies are deterministic.
  2. Finding max action value is difficult for larger action spaces.
  3. Small changes in the action value function can produce large changes in behavior.

Policy gradient methods, including Actor-Critic avoid those limitations, but they were all on-policy before this paper.

The Off-PAC algorithm:

Initialize the vectors $e_v$, $e_u$, and w to zero

Initialize the vectors v and u arbitrarily

Initialize the state s

For each step:

Choose an action, a, according to $b(·\|s)$

Observe resultant reward, r, and next state, s′

$\delta \leftarrow r + \gamma(s′)v^Tx_{s′} − v^Tx_s$

$ρ \leftarrow \pi_u(a\|s) / b(a\|s)$

Update the critic using the $GTD(\lambda)$ algorithm:

$e_v \leftarrow ρ(x_s + \gamma(s) \lambda e_v)$

$v \leftarrow v + \alpha_v [\delta e_v − \gamma(s′)(1 − \lambda)(w^Te_v)x_s]$

$w \leftarrow w + \alpha_w [\delta e_v − (w^Tx_s)x_s] $

Update the actor:

$e_u \leftarrow ρ [ \frac{\nabla_u \pi_u (a\|s)}{\pi_u(a\|s)} + \gamma(s) \lambda e_u]$

$u \leftarrow u + \alpha_u \delta e_u$

$s \leftarrow s′$

They only talk about using elegibility traces with a linear combination of state features so I have no idea how well they would work with a neural network. I’m also not sure what the w weights are. They aren’t mentioned anywhere in the paper.