We introduce a novel training procedure for policy gradient methods wherein episodic memory is used to optimize the hyperparameters of reinforcement learning algorithms on-the-fly.
Unblocking The Main Thread Solving ANRs and Frozen Frames
Episodic Policy Gradient Training
1. Episodic Policy Gradient Training
Authors: Hung Le, Majid Abdolshah, Thommen Karimpanal George, Kien Do, Dung Nguyen,
Svetha Venkatesh
Presented by Hung Le
1
3. Role of hyperparameters in RL
• RL is very sensitive to hyperparameters
• SOTA performance is achieved with
extensive hyperparameter tuning
Islam, R., Henderson, P., Gomrokchi, M., and Precup, D. (2017). Reproducibility of benchmarked deep reinforcement learning tasks for continuous control. arXiv preprint arXiv:1708.04133. 3
DQN
Hyperparameters
Not to mention
Neural network
architecture
4. A quick taxonomy of hyperparameter (HP)
optimization
4
HP
Optimization
HP
Tuning
Parallel
search
Grid search
Random
search
Evolutionary
search
Sequential
search
BO
HP
Scheduling
Parallel
search
Population-
based
Sequential
search
Meta-
gradients
Greedy
search
Episodic
search (ours)
RED: Slide focus
HP is fixed during a training run
HP changes
across training
5. Why hyperparameter scheduling (HS)?
• Fixed hyperparameter during training is suboptimal
o E.g learning rate is often reduced over training to guarantee convergence
• Empirical studies show in many cases, dynamic hyperparameters are
better
François-Lavet, Vincent, Raphael Fonteneau, and Damien Ernst. "How to discount deep reinforcement learning: Towards new dynamic strategies." arXiv preprint arXiv:1512.02011 (2015).
5
7. Limitation of current HS
• Don’t have the context of training in the optimization process
• Treated as a stateless bandit or greedy optimization
Ignoring the context prevents the use of episodic experiences that
can be critical in optimization and planning
E.g., the hyperparameters that helped overcome a past local
optimum in the loss surface can be reused when the learning
algorithm falls into a similar local optimum
7
8. How to build the state (context)
• If we know the loss landscape and the
current parameter the exact state
• More practical assumption:
State = [current parameter + (estimated
)derivatives]
• It is huge need to learn a compact
representation
8
Images:
https://proceedings.neurips.cc/paper/2018/file/a41b3bb3e6b050b6c9067c67f663b915-Paper.pdf
https://towardsdatascience.com/debugging-your-neural-nets-and-checking-your-gradients-f4d7f55da167
10. Finding hyperparameter as an MDP
• At each policy update, the hyper-agent:
• Observe the training context- hyper-state
• Configure the PG algorithm with suitable hyperparameters ψ - hyper-action
• Train RL agent with ψ, observed learning progress – hyper-reward
• The goal of the Hyper-RL is the same as the main RL’s: to maximize
the return of the RL agent
At a hyper-state, find hyper-action that maximize the accumulated
hyper-reward (hyper-return)
10
11. Hyper-RL structure
11
Update step-Env step
Training progress
Update step-Env step
Update step-Env step Update step-Env step Update step-Env step
Hyper-state as weights, gradient Hyper-action as discretized ψ Hyper-reward as as average return
12. Hyper-state representation learning
12
Image: https://medium.com/retina-ai-health-inc/variational-inference-derivation-of-the-variational-autoencoder-vae-loss-function-a-true-story-3543a3dc67ee
• Compress the parameters/gradients to a vector hyper-state s
• VAE learns to reconstruct s
• The latent vector is the hyper-state representation
Compress
13. Compression via linear mapping
• Parameters and derivatives
are formed as tensor
𝑊𝑊
𝑚𝑚
𝑛𝑛 ∈ 𝑅𝑅𝑑𝑑′×𝑑𝑑𝑛𝑛𝑛𝑛
n: order, m: layer
• High-order derivatives are
estimated by taking difference
of the gradients
• Learnable linear mapping
𝐶𝐶𝑚𝑚
𝑛𝑛 ∈ 𝑅𝑅𝑑𝑑𝑛𝑛𝑛𝑛×𝑑𝑑
𝑑𝑑𝑛𝑛𝑛𝑛 ≫ 𝑑𝑑
13
15. Why episodic memory?
• Any standard RL can be used to solve the Hyper-RL
• However
• The number of update steps is small wrt the number of env steps
• Must be sample efficiency, fast to arrive at good hyper-actions. Otherwise,
making training RL agent chaotic
• Episodic memory:
• Simple, non-parametric
• Estimate value via nearest neighbor lookup (don’t need to learn)
• Contextual decision making e.g., we may use past experiences of traffic to not
return home from work at 5pm
15
16. Episodic memory for Hyper-RL
• Estimate value of any hyper-
state-action pair
16
KEY |VALUE
Experience hyper-state/action |Outcome Hyper-Returns
memory
23. Key takeaways about our episodic training
• Jointly optimize hyperparameters and parameters of RL models (this paper
focuses on policy gradient RL)
• Treat the hyperparameter optimization problem as a Hyper-RL with state
representation as the context of training
• Learn the context of training via reconstructing the model’s parameters,
derivatives, …
• Solve the Hyper-RL with Episodic Control:
• Episodic memory storing hyper-state, hyper-action and hyper-value
• Weighted average writing mechanism
• Results are consistent good:
• Mujoco, Atari, …
• A2C, PPO, ACKTR, …
• Learning rate, batch size, clip, GAE lambda, …
23