Experimenting with OpenAIs Baselines code

2 minute read

Published:

I forked Open AI’s baseline code and made a few changes. This was my first full run before I started playing around with the model architecture.

Changes from OpenAI’s:

  • Turned on Logging, including Tensorboard output
  • Log rewards
  • Add a command line option for setting number of cpus

Code: Commit used

Command line: python3 ./run_atari.py –cpu=3 -m=200

I increased the nsteps parameter to 200 and ran for 200e6 timesteps, which I think offsets my very low number of cpu’s.

I save checkpoints every 5000 updates, and also save one after the very last update.

Results of playing Breakout 100 times with the final trained acktr model from above (rewards):

  • Minimum: 0.000
  • Maximum: 104.000
  • Avg/std: 31.470/38.248

Looking at the plot of rewards versus updates, I think I could drop back down to 40e6 timesteps.

rewards vs updates

The way rewards were originally being logged doesn’t make much sense. It looks like Gym considers each point to an episode, by returning done from the step() method. Breakout itself considers 5 points to be a game, so in my later testing I sum scores for each 5 ‘episodes’ and report that.

Original Model

I partially reproduced OpenAI’s model in Keras so I could output a summary even though it didn’t include filter size and stride.

Layer (type)Output ShapeParam #SizeStride
conv2d_1 (Conv2D)(None, 20, 20, 32)82248x84
conv2d_2 (Conv2D)(None, 9, 9, 64)328324x42
conv2d_3 (Conv2D)(None, 7, 7, 32)184643x31
flatten_1 (Flatten)(None, 1568)0  
dense_1 (Dense)(None, 512)803328  
dense_2 (Dense)(None, 4)2052  

Total params: 864,900

Trainable params: 864,900

3x3 Model

Layer (type)Output ShapeParam #SizeStride
conv2d_1 (Conv2D)(None, 21, 21, 32)11843x34
conv2d_2 (Conv2D)(None, 10, 10, 64)184963x33
conv2d_3 (Conv2D)(None, 8, 8, 32)184643x31
flatten_1 (Flatten)(None, 2048)0  
dense_1 (Dense)(None, 512)1049088  
dense_2 (Dense)(None, 4)2052  

Total params: 1,089,284

Trainable params: 1,089,284

Testing of checkpoints

I ran checkpointed versions of each model on 10 games and plotted the average reward along with error bars at 1 standard deviation. These plots use the per-game method of figuring reward.

The original model (8x8, 4x4, 3x3 filters)

Original Model

3x3 Filters

3x3 Filters