SlideShare a Scribd company logo
1 of 24
Download to read offline
Episodic Policy Gradient Training
Authors: Hung Le, Majid Abdolshah, Thommen Karimpanal George, Kien Do, Dung Nguyen,
Svetha Venkatesh
Presented by Hung Le
1
Introduction
2
Role of hyperparameters in RL
• RL is very sensitive to hyperparameters
• SOTA performance is achieved with
extensive hyperparameter tuning
Islam, R., Henderson, P., Gomrokchi, M., and Precup, D. (2017). Reproducibility of benchmarked deep reinforcement learning tasks for continuous control. arXiv preprint arXiv:1708.04133. 3
DQN
Hyperparameters
Not to mention
Neural network
architecture
A quick taxonomy of hyperparameter (HP)
optimization
4
HP
Optimization
HP
Tuning
Parallel
search
Grid search
Random
search
Evolutionary
search
Sequential
search
BO
HP
Scheduling
Parallel
search
Population-
based
Sequential
search
Meta-
gradients
Greedy
search
Episodic
search (ours)
RED: Slide focus
HP is fixed during a training run
HP changes
across training
Why hyperparameter scheduling (HS)?
• Fixed hyperparameter during training is suboptimal
o E.g learning rate is often reduced over training to guarantee convergence
• Empirical studies show in many cases, dynamic hyperparameters are
better
François-Lavet, Vincent, Raphael Fonteneau, and Damien Ernst. "How to discount deep reinforcement learning: Towards new dynamic strategies." arXiv preprint arXiv:1512.02011 (2015).
5
How to learn a good
hyperparameter scheduling?
6
Limitation of current HS
• Don’t have the context of training in the optimization process
• Treated as a stateless bandit or greedy optimization
Ignoring the context prevents the use of episodic experiences that
can be critical in optimization and planning
E.g., the hyperparameters that helped overcome a past local
optimum in the loss surface can be reused when the learning
algorithm falls into a similar local optimum
7
How to build the state (context)
• If we know the loss landscape and the
current parameter the exact state
• More practical assumption:
State = [current parameter + (estimated
)derivatives]
• It is huge  need to learn a compact
representation
8
Images:
https://proceedings.neurips.cc/paper/2018/file/a41b3bb3e6b050b6c9067c67f663b915-Paper.pdf
https://towardsdatascience.com/debugging-your-neural-nets-and-checking-your-gradients-f4d7f55da167
Hyper-RL formulation
9
Finding hyperparameter as an MDP
• At each policy update, the hyper-agent:
• Observe the training context- hyper-state
• Configure the PG algorithm with suitable hyperparameters ψ - hyper-action
• Train RL agent with ψ, observed learning progress – hyper-reward
• The goal of the Hyper-RL is the same as the main RL’s: to maximize
the return of the RL agent
 At a hyper-state, find hyper-action that maximize the accumulated
hyper-reward (hyper-return)
10
Hyper-RL structure
11
Update step-Env step
Training progress
Update step-Env step
Update step-Env step Update step-Env step Update step-Env step
Hyper-state as weights, gradient Hyper-action as discretized ψ Hyper-reward as as average return
Hyper-state representation learning
12
Image: https://medium.com/retina-ai-health-inc/variational-inference-derivation-of-the-variational-autoencoder-vae-loss-function-a-true-story-3543a3dc67ee
• Compress the parameters/gradients to a vector hyper-state s
• VAE learns to reconstruct s
• The latent vector is the hyper-state representation
Compress
Compression via linear mapping
• Parameters and derivatives
are formed as tensor
𝑊𝑊
𝑚𝑚
𝑛𝑛 ∈ 𝑅𝑅𝑑𝑑′×𝑑𝑑𝑛𝑛𝑛𝑛
n: order, m: layer
• High-order derivatives are
estimated by taking difference
of the gradients
• Learnable linear mapping
𝐶𝐶𝑚𝑚
𝑛𝑛 ∈ 𝑅𝑅𝑑𝑑𝑛𝑛𝑛𝑛×𝑑𝑑
 𝑑𝑑𝑛𝑛𝑛𝑛 ≫ 𝑑𝑑
13
Episodic memory as a practical
solution
14
Why episodic memory?
• Any standard RL can be used to solve the Hyper-RL
• However
• The number of update steps is small wrt the number of env steps
• Must be sample efficiency, fast to arrive at good hyper-actions. Otherwise,
making training RL agent chaotic
• Episodic memory:
• Simple, non-parametric
• Estimate value via nearest neighbor lookup (don’t need to learn)
• Contextual decision making e.g., we may use past experiences of traffic to not
return home from work at 5pm
15
Episodic memory for Hyper-RL
• Estimate value of any hyper-
state-action pair
16
KEY |VALUE
Experience hyper-state/action |Outcome Hyper-Returns
memory
Update memory
17
Given a new outcome, update the values in the
memory
Experimental results
18
Classical control: Episodic memory vs DQN
19
Continuous control: EPGT vs prior HS
20
Atari: EPGT vs manual tuning
21
Ablation study
22
Key takeaways about our episodic training
• Jointly optimize hyperparameters and parameters of RL models (this paper
focuses on policy gradient RL)
• Treat the hyperparameter optimization problem as a Hyper-RL with state
representation as the context of training
• Learn the context of training via reconstructing the model’s parameters,
derivatives, …
• Solve the Hyper-RL with Episodic Control:
• Episodic memory storing hyper-state, hyper-action and hyper-value
• Weighted average writing mechanism
• Results are consistent good:
• Mujoco, Atari, …
• A2C, PPO, ACKTR, …
• Learning rate, batch size, clip, GAE lambda, …
23
Thank you
thai.le@deakin.edu.au
A²I²
Deakin University
Geelong Waurn Ponds
Campus, Geelong, VIC 3220
Hung Le
24

More Related Content

What's hot

Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Chris Ohk
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratchJie-Han Chen
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithmJie-Han Chen
 
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...Dongmin Lee
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDongHyun Kwak
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learningJie-Han Chen
 
An Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGIAn Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGIAnirban Santara
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learningJie-Han Chen
 
A brief introduction to Searn Algorithm
A brief introduction to Searn AlgorithmA brief introduction to Searn Algorithm
A brief introduction to Searn AlgorithmSupun Abeysinghe
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialOmar Enayet
 
Multi armed bandit
Multi armed banditMulti armed bandit
Multi armed banditJie-Han Chen
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement LearningUsman Qayyum
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learningSubrat Panda, PhD
 
Discrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RLDiscrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RLJie-Han Chen
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningSalem-Kabbani
 
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Dongmin Lee
 
Algorithms and Programming
Algorithms and ProgrammingAlgorithms and Programming
Algorithms and ProgrammingMelanie Knight
 
Machine learning Algorithms with a Sagemaker demo
Machine learning Algorithms with a Sagemaker demoMachine learning Algorithms with a Sagemaker demo
Machine learning Algorithms with a Sagemaker demoHridyesh Bisht
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313Slideshare
 

What's hot (20)

Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratch
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithm
 
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learning
 
An Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGIAn Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGI
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
 
A brief introduction to Searn Algorithm
A brief introduction to Searn AlgorithmA brief introduction to Searn Algorithm
A brief introduction to Searn Algorithm
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
 
Multi armed bandit
Multi armed banditMulti armed bandit
Multi armed bandit
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learning
 
Discrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RLDiscrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RL
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
 
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)
 
Algorithms and Programming
Algorithms and ProgrammingAlgorithms and Programming
Algorithms and Programming
 
Machine learning Algorithms with a Sagemaker demo
Machine learning Algorithms with a Sagemaker demoMachine learning Algorithms with a Sagemaker demo
Machine learning Algorithms with a Sagemaker demo
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313
 

Similar to Episodic Policy Gradient Training

Information Theoretic aspect of reinforcement learning
Information Theoretic aspect of reinforcement learningInformation Theoretic aspect of reinforcement learning
Information Theoretic aspect of reinforcement learningJongsuHa
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningNAVER Engineering
 
AI_Unit-4_Learning.pptx
AI_Unit-4_Learning.pptxAI_Unit-4_Learning.pptx
AI_Unit-4_Learning.pptxMohammadAsim91
 
Cutting edge hyperparameter tuning made simple with ray tune
Cutting edge hyperparameter tuning made simple with ray tuneCutting edge hyperparameter tuning made simple with ray tune
Cutting edge hyperparameter tuning made simple with ray tuneXiaoweiJiang7
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningKhaled Saleh
 
Deep Q-learning from Demonstrations DQfD
Deep Q-learning from Demonstrations DQfDDeep Q-learning from Demonstrations DQfD
Deep Q-learning from Demonstrations DQfDAmmar Rashed
 
Deep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningDeep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningShubhmay Potdar
 
Online Hyperparameter Meta-Learning with Hypergradient Distillation
Online Hyperparameter Meta-Learning with Hypergradient DistillationOnline Hyperparameter Meta-Learning with Hypergradient Distillation
Online Hyperparameter Meta-Learning with Hypergradient DistillationMLAI2
 
Population Based Training of Neural Networks
Population Based Training of Neural NetworksPopulation Based Training of Neural Networks
Population Based Training of Neural NetworksDADAJONJURAKUZIEV
 
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...Jisu Han
 
Reinforcement course material samples: lecture 1
Reinforcement course material samples: lecture 1Reinforcement course material samples: lecture 1
Reinforcement course material samples: lecture 1YasutoTamura1
 
part3Module 3 ppt_with classification.pptx
part3Module 3 ppt_with classification.pptxpart3Module 3 ppt_with classification.pptx
part3Module 3 ppt_with classification.pptxVaishaliBagewadikar
 
MLPerf an industry standard benchmark suite for machine learning performance
MLPerf an industry standard benchmark suite for machine learning performanceMLPerf an industry standard benchmark suite for machine learning performance
MLPerf an industry standard benchmark suite for machine learning performancejemin lee
 
17_00-Dima-Panchenko-cnn-tips-and-tricks.pptx
17_00-Dima-Panchenko-cnn-tips-and-tricks.pptx17_00-Dima-Panchenko-cnn-tips-and-tricks.pptx
17_00-Dima-Panchenko-cnn-tips-and-tricks.pptxMahmoudAbuGhali
 
TensorFlow London 11: Pierre Harvey Richemond 'Trends and Developments in Rei...
TensorFlow London 11: Pierre Harvey Richemond 'Trends and Developments in Rei...TensorFlow London 11: Pierre Harvey Richemond 'Trends and Developments in Rei...
TensorFlow London 11: Pierre Harvey Richemond 'Trends and Developments in Rei...Seldon
 
Introduction to cyclical learning rates for training neural nets
Introduction to cyclical learning rates for training neural netsIntroduction to cyclical learning rates for training neural nets
Introduction to cyclical learning rates for training neural netsSayak Paul
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Universitat Politècnica de Catalunya
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysYasutoTamura1
 
Uber Data Analysis - SAS Project
Uber Data Analysis - SAS ProjectUber Data Analysis - SAS Project
Uber Data Analysis - SAS ProjectKushal417
 

Similar to Episodic Policy Gradient Training (20)

Information Theoretic aspect of reinforcement learning
Information Theoretic aspect of reinforcement learningInformation Theoretic aspect of reinforcement learning
Information Theoretic aspect of reinforcement learning
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
AI_Unit-4_Learning.pptx
AI_Unit-4_Learning.pptxAI_Unit-4_Learning.pptx
AI_Unit-4_Learning.pptx
 
Cutting edge hyperparameter tuning made simple with ray tune
Cutting edge hyperparameter tuning made simple with ray tuneCutting edge hyperparameter tuning made simple with ray tune
Cutting edge hyperparameter tuning made simple with ray tune
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
Deep Q-learning from Demonstrations DQfD
Deep Q-learning from Demonstrations DQfDDeep Q-learning from Demonstrations DQfD
Deep Q-learning from Demonstrations DQfD
 
Deep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningDeep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter Tuning
 
Online Hyperparameter Meta-Learning with Hypergradient Distillation
Online Hyperparameter Meta-Learning with Hypergradient DistillationOnline Hyperparameter Meta-Learning with Hypergradient Distillation
Online Hyperparameter Meta-Learning with Hypergradient Distillation
 
Population Based Training of Neural Networks
Population Based Training of Neural NetworksPopulation Based Training of Neural Networks
Population Based Training of Neural Networks
 
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
 
Reinforcement course material samples: lecture 1
Reinforcement course material samples: lecture 1Reinforcement course material samples: lecture 1
Reinforcement course material samples: lecture 1
 
part3Module 3 ppt_with classification.pptx
part3Module 3 ppt_with classification.pptxpart3Module 3 ppt_with classification.pptx
part3Module 3 ppt_with classification.pptx
 
MLPerf an industry standard benchmark suite for machine learning performance
MLPerf an industry standard benchmark suite for machine learning performanceMLPerf an industry standard benchmark suite for machine learning performance
MLPerf an industry standard benchmark suite for machine learning performance
 
17_00-Dima-Panchenko-cnn-tips-and-tricks.pptx
17_00-Dima-Panchenko-cnn-tips-and-tricks.pptx17_00-Dima-Panchenko-cnn-tips-and-tricks.pptx
17_00-Dima-Panchenko-cnn-tips-and-tricks.pptx
 
TensorFlow London 11: Pierre Harvey Richemond 'Trends and Developments in Rei...
TensorFlow London 11: Pierre Harvey Richemond 'Trends and Developments in Rei...TensorFlow London 11: Pierre Harvey Richemond 'Trends and Developments in Rei...
TensorFlow London 11: Pierre Harvey Richemond 'Trends and Developments in Rei...
 
Tuning learning rate
Tuning learning rateTuning learning rate
Tuning learning rate
 
Introduction to cyclical learning rates for training neural nets
Introduction to cyclical learning rates for training neural netsIntroduction to cyclical learning rates for training neural nets
Introduction to cyclical learning rates for training neural nets
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
 
Uber Data Analysis - SAS Project
Uber Data Analysis - SAS ProjectUber Data Analysis - SAS Project
Uber Data Analysis - SAS Project
 

Recently uploaded

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 

Recently uploaded (20)

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 

Episodic Policy Gradient Training

  • 1. Episodic Policy Gradient Training Authors: Hung Le, Majid Abdolshah, Thommen Karimpanal George, Kien Do, Dung Nguyen, Svetha Venkatesh Presented by Hung Le 1
  • 3. Role of hyperparameters in RL • RL is very sensitive to hyperparameters • SOTA performance is achieved with extensive hyperparameter tuning Islam, R., Henderson, P., Gomrokchi, M., and Precup, D. (2017). Reproducibility of benchmarked deep reinforcement learning tasks for continuous control. arXiv preprint arXiv:1708.04133. 3 DQN Hyperparameters Not to mention Neural network architecture
  • 4. A quick taxonomy of hyperparameter (HP) optimization 4 HP Optimization HP Tuning Parallel search Grid search Random search Evolutionary search Sequential search BO HP Scheduling Parallel search Population- based Sequential search Meta- gradients Greedy search Episodic search (ours) RED: Slide focus HP is fixed during a training run HP changes across training
  • 5. Why hyperparameter scheduling (HS)? • Fixed hyperparameter during training is suboptimal o E.g learning rate is often reduced over training to guarantee convergence • Empirical studies show in many cases, dynamic hyperparameters are better François-Lavet, Vincent, Raphael Fonteneau, and Damien Ernst. "How to discount deep reinforcement learning: Towards new dynamic strategies." arXiv preprint arXiv:1512.02011 (2015). 5
  • 6. How to learn a good hyperparameter scheduling? 6
  • 7. Limitation of current HS • Don’t have the context of training in the optimization process • Treated as a stateless bandit or greedy optimization Ignoring the context prevents the use of episodic experiences that can be critical in optimization and planning E.g., the hyperparameters that helped overcome a past local optimum in the loss surface can be reused when the learning algorithm falls into a similar local optimum 7
  • 8. How to build the state (context) • If we know the loss landscape and the current parameter the exact state • More practical assumption: State = [current parameter + (estimated )derivatives] • It is huge  need to learn a compact representation 8 Images: https://proceedings.neurips.cc/paper/2018/file/a41b3bb3e6b050b6c9067c67f663b915-Paper.pdf https://towardsdatascience.com/debugging-your-neural-nets-and-checking-your-gradients-f4d7f55da167
  • 10. Finding hyperparameter as an MDP • At each policy update, the hyper-agent: • Observe the training context- hyper-state • Configure the PG algorithm with suitable hyperparameters ψ - hyper-action • Train RL agent with ψ, observed learning progress – hyper-reward • The goal of the Hyper-RL is the same as the main RL’s: to maximize the return of the RL agent  At a hyper-state, find hyper-action that maximize the accumulated hyper-reward (hyper-return) 10
  • 11. Hyper-RL structure 11 Update step-Env step Training progress Update step-Env step Update step-Env step Update step-Env step Update step-Env step Hyper-state as weights, gradient Hyper-action as discretized ψ Hyper-reward as as average return
  • 12. Hyper-state representation learning 12 Image: https://medium.com/retina-ai-health-inc/variational-inference-derivation-of-the-variational-autoencoder-vae-loss-function-a-true-story-3543a3dc67ee • Compress the parameters/gradients to a vector hyper-state s • VAE learns to reconstruct s • The latent vector is the hyper-state representation Compress
  • 13. Compression via linear mapping • Parameters and derivatives are formed as tensor 𝑊𝑊 𝑚𝑚 𝑛𝑛 ∈ 𝑅𝑅𝑑𝑑′×𝑑𝑑𝑛𝑛𝑛𝑛 n: order, m: layer • High-order derivatives are estimated by taking difference of the gradients • Learnable linear mapping 𝐶𝐶𝑚𝑚 𝑛𝑛 ∈ 𝑅𝑅𝑑𝑑𝑛𝑛𝑛𝑛×𝑑𝑑  𝑑𝑑𝑛𝑛𝑛𝑛 ≫ 𝑑𝑑 13
  • 14. Episodic memory as a practical solution 14
  • 15. Why episodic memory? • Any standard RL can be used to solve the Hyper-RL • However • The number of update steps is small wrt the number of env steps • Must be sample efficiency, fast to arrive at good hyper-actions. Otherwise, making training RL agent chaotic • Episodic memory: • Simple, non-parametric • Estimate value via nearest neighbor lookup (don’t need to learn) • Contextual decision making e.g., we may use past experiences of traffic to not return home from work at 5pm 15
  • 16. Episodic memory for Hyper-RL • Estimate value of any hyper- state-action pair 16 KEY |VALUE Experience hyper-state/action |Outcome Hyper-Returns memory
  • 17. Update memory 17 Given a new outcome, update the values in the memory
  • 19. Classical control: Episodic memory vs DQN 19
  • 20. Continuous control: EPGT vs prior HS 20
  • 21. Atari: EPGT vs manual tuning 21
  • 23. Key takeaways about our episodic training • Jointly optimize hyperparameters and parameters of RL models (this paper focuses on policy gradient RL) • Treat the hyperparameter optimization problem as a Hyper-RL with state representation as the context of training • Learn the context of training via reconstructing the model’s parameters, derivatives, … • Solve the Hyper-RL with Episodic Control: • Episodic memory storing hyper-state, hyper-action and hyper-value • Weighted average writing mechanism • Results are consistent good: • Mujoco, Atari, … • A2C, PPO, ACKTR, … • Learning rate, batch size, clip, GAE lambda, … 23
  • 24. Thank you thai.le@deakin.edu.au A²I² Deakin University Geelong Waurn Ponds Campus, Geelong, VIC 3220 Hung Le 24