Episodic Policy Gradient Training

Episodic Policy Gradient Training
Authors: Hung Le, Majid Abdolshah, Thommen Karimpanal George, Kien Do, Dung Nguyen,
Svetha Venkatesh
Presented by Hung Le
1

Role of hyperparameters in RL
• RL is very sensitive to hyperparameters
• SOTA performance is achieved with
extensive hyperparameter tuning
Islam, R., Henderson, P., Gomrokchi, M., and Precup, D. (2017). Reproducibility of benchmarked deep reinforcement learning tasks for continuous control. arXiv preprint arXiv:1708.04133. 3
DQN
Hyperparameters
Not to mention
Neural network
architecture

A quick taxonomy of hyperparameter (HP)
optimization
4
HP
Optimization
HP
Tuning
Parallel
search
Grid search
Random
search
Evolutionary
search
Sequential
search
BO
HP
Scheduling
Parallel
search
Population-
based
Sequential
search
Meta-
gradients
Greedy
search
Episodic
search (ours)
RED: Slide focus
HP is fixed during a training run
HP changes
across training

Why hyperparameter scheduling (HS)?
• Fixed hyperparameter during training is suboptimal
o E.g learning rate is often reduced over training to guarantee convergence
• Empirical studies show in many cases, dynamic hyperparameters are
better
François-Lavet, Vincent, Raphael Fonteneau, and Damien Ernst. "How to discount deep reinforcement learning: Towards new dynamic strategies." arXiv preprint arXiv:1512.02011 (2015).
5

How to learn a good
hyperparameter scheduling?
6

Limitation of current HS
• Don’t have the context of training in the optimization process
• Treated as a stateless bandit or greedy optimization
Ignoring the context prevents the use of episodic experiences that
can be critical in optimization and planning
E.g., the hyperparameters that helped overcome a past local
optimum in the loss surface can be reused when the learning
algorithm falls into a similar local optimum
7

How to build the state (context)
• If we know the loss landscape and the
current parameter the exact state
• More practical assumption:
State = [current parameter + (estimated
)derivatives]
• It is huge  need to learn a compact
representation
8
Images:
https://proceedings.neurips.cc/paper/2018/file/a41b3bb3e6b050b6c9067c67f663b915-Paper.pdf
https://towardsdatascience.com/debugging-your-neural-nets-and-checking-your-gradients-f4d7f55da167

Finding hyperparameter as an MDP
• At each policy update, the hyper-agent:
• Observe the training context- hyper-state
• Configure the PG algorithm with suitable hyperparameters ψ - hyper-action
• Train RL agent with ψ, observed learning progress – hyper-reward
• The goal of the Hyper-RL is the same as the main RL’s: to maximize
the return of the RL agent
 At a hyper-state, find hyper-action that maximize the accumulated
hyper-reward (hyper-return)
10

Hyper-RL structure
11
Update step-Env step
Training progress
Update step-Env step
Update step-Env step Update step-Env step Update step-Env step
Hyper-state as weights, gradient Hyper-action as discretized ψ Hyper-reward as as average return

Hyper-state representation learning
12
Image: https://medium.com/retina-ai-health-inc/variational-inference-derivation-of-the-variational-autoencoder-vae-loss-function-a-true-story-3543a3dc67ee
• Compress the parameters/gradients to a vector hyper-state s
• VAE learns to reconstruct s
• The latent vector is the hyper-state representation
Compress

Compression via linear mapping
• Parameters and derivatives
are formed as tensor
𝑊𝑊
𝑚𝑚
𝑛𝑛 ∈ 𝑅𝑅𝑑𝑑′×𝑑𝑑𝑛𝑛𝑛𝑛
n: order, m: layer
• High-order derivatives are
estimated by taking difference
of the gradients
• Learnable linear mapping
𝐶𝐶𝑚𝑚
𝑛𝑛 ∈ 𝑅𝑅𝑑𝑑𝑛𝑛𝑛𝑛×𝑑𝑑
 𝑑𝑑𝑛𝑛𝑛𝑛 ≫ 𝑑𝑑
13

Episodic memory as a practical
solution
14

Why episodic memory?
• Any standard RL can be used to solve the Hyper-RL
• However
• The number of update steps is small wrt the number of env steps
• Must be sample efficiency, fast to arrive at good hyper-actions. Otherwise,
making training RL agent chaotic
• Episodic memory:
• Simple, non-parametric
• Estimate value via nearest neighbor lookup (don’t need to learn)
• Contextual decision making e.g., we may use past experiences of traffic to not
return home from work at 5pm
15

Episodic memory for Hyper-RL
• Estimate value of any hyper-
state-action pair
16
KEY |VALUE
Experience hyper-state/action |Outcome Hyper-Returns
memory

Update memory
17
Given a new outcome, update the values in the
memory

Classical control: Episodic memory vs DQN
19

Continuous control: EPGT vs prior HS
20

Atari: EPGT vs manual tuning
21

Key takeaways about our episodic training
• Jointly optimize hyperparameters and parameters of RL models (this paper
focuses on policy gradient RL)
• Treat the hyperparameter optimization problem as a Hyper-RL with state
representation as the context of training
• Learn the context of training via reconstructing the model’s parameters,
derivatives, …
• Solve the Hyper-RL with Episodic Control:
• Episodic memory storing hyper-state, hyper-action and hyper-value
• Weighted average writing mechanism
• Results are consistent good:
• Mujoco, Atari, …
• A2C, PPO, ACKTR, …
• Learning rate, batch size, clip, GAE lambda, …
23

Thank you
thai.le@deakin.edu.au
A²I²
Deakin University
Geelong Waurn Ponds
Campus, Geelong, VIC 3220
Hung Le
24

Episodic Policy Gradient Training

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Episodic Policy Gradient Training

Similar to Episodic Policy Gradient Training (20)

Recently uploaded

Recently uploaded (20)

Episodic Policy Gradient Training