Mastering the game of Go without human knowledge

Published: October 26, 2017

They use a shared representation for the policy and value outputs with different output heads.

Policy head: Conv-2 stride 1, BN, Relu, Fully connected with 362 outputs (192 + 1)

Value head: Conv-1 stride 1, BN, Relu, FC 256, Relu, FC with one output, tanh ([-1, 1])

Loss function:

$l = (z−v)^2 − \pi T logp + c||\theta||^2$

MSE for the value output, Categorical cross entry for the policy and L2 regularizer with c=10e-4

By using a combined policy and value network architecture, and by using a low weight on the value component, it was possible to avoid overfitting to the values (a problem described in previous work12).

I don’t see the ‘low weight on the value component’ in the description of the loss function.

Bleyddyn