5. Why Policy Gradients?
Learning policy directly instead of Q or V
Don’t need dynamics model:
Vanilla Q-learning is intractable for large action spaces
10. Intuition
Gradient tries to:
● Increase probability of paths with
positive
● Decrease probability of paths with
negative
Figure source: Schulman & Abbeel
21. Active areas of research
New environments!
Better sample efficiency
Transfer learning
Perception
Exploration/Auxiliary Tasks
22. Tensorflow tips and tricks
Use tf summaries for bookkeeping
Variable scoping for re-use
● tf.get_collection(tf.GraphKeys.VARIABLES)
Global counters
Coordinator and server APIs for multi-threaded/distributed
training
Gradient clipping
24. RL tips and tricks
Standardize your rollouts
Batch size makes a big difference
Neural net architecture doesn’t matter that much:
● batch norm, dropout, etc
Policy gradients don’t benefit as much from off-policy exploration (eg.
e-greedy)