Off-Policy Actor-Critic
Published:
Notes
Action-value methods like Greedy-GQ and Q-Learning have three limitations:
- Policies are deterministic.
- Finding max action value is difficult for larger action spaces.
- Small changes in the action value function can produce large changes in behavior.
Policy gradient methods, including Actor-Critic avoid those limitations, but they were all on-policy before this paper.
The Off-PAC algorithm:
Initialize the vectors $e_v$, $e_u$, and w to zero
Initialize the vectors v and u arbitrarily
Initialize the state s
For each step:
Choose an action, a, according to $b(·\|s)$
Observe resultant reward, r, and next state, s′
$\delta \leftarrow r + \gamma(s′)v^Tx_{s′} − v^Tx_s$
$ρ \leftarrow \pi_u(a\|s) / b(a\|s)$
Update the critic using the $GTD(\lambda)$ algorithm:
$e_v \leftarrow ρ(x_s + \gamma(s) \lambda e_v)$
$v \leftarrow v + \alpha_v [\delta e_v − \gamma(s′)(1 − \lambda)(w^Te_v)x_s]$
$w \leftarrow w + \alpha_w [\delta e_v − (w^Tx_s)x_s] $
Update the actor:
$e_u \leftarrow ρ [ \frac{\nabla_u \pi_u (a\|s)}{\pi_u(a\|s)} + \gamma(s) \lambda e_u]$
$u \leftarrow u + \alpha_u \delta e_u$
$s \leftarrow s′$
They only talk about using elegibility traces with a linear combination of state features so I have no idea how well they would work with a neural network. I’m also not sure what the w weights are. They aren’t mentioned anywhere in the paper.