SlideShare a Scribd company logo
Parameter Space Noise for Exploration
Yoonho Lee
Department of Computer Science and Engineering
Pohang University of Science and Technology
November 02, 2017
Exploration-Exploitation Tradeoff
Exploration and exploitation must be carefully balanced for
optimal performance
Exploration in RL
Exploration in multi-armed bandits is simply choosing a suboptimal
arm. How do we explore in RL environments?
Exploration in RL
Exploration in multi-armed bandits is simply choosing a suboptimal
arm. How do we explore in RL environments?
Naive approaches:
-greedy actions in DQN
Entropy loss in policy gradient methods
Exploration in RL
Exploration in multi-armed bandits is simply choosing a suboptimal
arm. How do we explore in RL environments?
Naive approaches:
-greedy actions in DQN
Entropy loss in policy gradient methods
More sophisticated approaches:
Density Modelling
Dynamics Modelling
Self-supervised curiosity
Parameter Space Noise for Exploration
Matthias Plappert, Rein Houthooft, Prafulla Dhariwal, Szymon
Sidor, Richard Y. Chen, Xi Chen, Tamim Asfour, Pieter Abbeel,
Marcin Andrychowicz
Proposed Method
θ = θ + N(0, σ2I)
We perturb policy paramters at the beginning of each episode and
keep it fixed for the entire rollout
Proposed Method
θ = θ + N(0, σ2I)
We perturb policy paramters at the beginning of each episode and
keep it fixed for the entire rollout
Off-policy
Gather experience with θ = θ + N(0, σ2I), and update network
with θ.
Proposed Method
θ = θ + N(0, σ2I)
We perturb policy paramters at the beginning of each episode and
keep it fixed for the entire rollout
Off-policy
Gather experience with θ = θ + N(0, σ2I), and update network
with θ.
On-policy
Given policy πθ(a|s) with θ ∼ N(φ, Σ), policy gradient is
φ,ΣEτ [R(τ)] ≈
1
N i ,τi
T−1
t=0
φ,Σ log π(at|st; φ + i
Σ)Rt(τi
)
Experiments
Chain Environment
A simple environment in which directed exploration is required
to perform well
Start at s1, rewards only at s1 and sN
Easy to fall in local optima of staying at s1
Experiments
Chain Environment
Lower is better.
Parameter space noise outperforms both -greedy and
bootstrapped DQN.
Experiments
Atari
Parameter space noise outperforms -greedy in games that
require exploration
Experiments
Continous Control with DDPG
Parameter space noise outperforms action space noise in
HalfCheetah(Other networks fall into a local minima)
Not much difference in other environments. This is because
the rewards are well-shaped, so exploration isn’t really crucial
here.
Experiments
Continous Control with DDPG
Harder environments with sparse rewards
Two environments in which only parameter noise get a
non-zero reward
Experiments
Continous Control with TRPO
Parameter space noise is slightly better in HalfCheetah, and
significantly better in Walker2D.
The wrong variance setting seems to disable learning, and
each environment has a different optimal variance.
Experiments
Continous Control with TRPO
Parameter space noise works well in sparse reward
environments.
Summary
Parameter space noise is a simple method that allows directed
exploration.
Applicable to both on-policy and off-policy methods
Orthogonal to advances such as Double DQN, Dueling
Networks or TRPO.
Discussion
No comparison with sophisticated exploration methods
If this works, why did no one try using dropout in policy
networks/DQN?
What does this imply about the parameter space of a neural
network?
Is there a connection between this and recent results linking
parameter noise to variational inference?
Thank You

More Related Content

What's hot

Fourier transforms
Fourier transformsFourier transforms
Fourier transforms
Iffat Anjum
 
Jagmohan presentation2008
Jagmohan presentation2008Jagmohan presentation2008
Jagmohan presentation2008Jag Mohan Singh
 
Fft analysis
Fft analysisFft analysis
Fft analysis
AldinoAldo
 
Fourier transforms
Fourier transformsFourier transforms
Fourier transforms
kalung0313
 
Presentation on fourier transformation
Presentation on fourier transformationPresentation on fourier transformation
Presentation on fourier transformation
Wasim Shah
 
Sensing Throughput Tradeoff for Cognitive Radio Networks with Noise Variance ...
Sensing Throughput Tradeoff for Cognitive Radio Networks with Noise Variance ...Sensing Throughput Tradeoff for Cognitive Radio Networks with Noise Variance ...
Sensing Throughput Tradeoff for Cognitive Radio Networks with Noise Variance ...
T. E. BOGALE
 
Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...
Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...
Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...
Yamagishi Laboratory, National Institute of Informatics, Japan
 
Chapter 5 Image Processing: Fourier Transformation
Chapter 5 Image Processing: Fourier TransformationChapter 5 Image Processing: Fourier Transformation
Chapter 5 Image Processing: Fourier Transformation
Varun Ojha
 
Signal propagation. path loss models
Signal propagation. path loss modelsSignal propagation. path loss models
Signal propagation. path loss models
Nguyen Minh Thu
 
Digital signal processing through speech, hearing, and Python
Digital signal processing through speech, hearing, and PythonDigital signal processing through speech, hearing, and Python
Digital signal processing through speech, hearing, and Python
Mel Chua
 
Automated seismic-to-well ties?
Automated seismic-to-well ties?Automated seismic-to-well ties?
Automated seismic-to-well ties?
UT Technology
 
Fast Fourier Transform Analysis
Fast Fourier Transform AnalysisFast Fourier Transform Analysis
Fast Fourier Transform Analysis
dhikadixiana
 
Sound analysis and processing with MATLAB
Sound analysis and processing with MATLABSound analysis and processing with MATLAB
Sound analysis and processing with MATLAB
Tan Hoang Luu
 
Transforms
TransformsTransforms
Transforms
ssuser2797e4
 
DFT and its properties
DFT and its propertiesDFT and its properties
DFT and its properties
ssuser2797e4
 
EIPOMDP Poster (PDF)
EIPOMDP Poster (PDF)EIPOMDP Poster (PDF)
EIPOMDP Poster (PDF)Teddy Ni
 
Sound Source Localization with microphone arrays
Sound Source Localization with microphone arraysSound Source Localization with microphone arrays
Sound Source Localization with microphone arrays
Ramin Anushiravani
 

What's hot (17)

Fourier transforms
Fourier transformsFourier transforms
Fourier transforms
 
Jagmohan presentation2008
Jagmohan presentation2008Jagmohan presentation2008
Jagmohan presentation2008
 
Fft analysis
Fft analysisFft analysis
Fft analysis
 
Fourier transforms
Fourier transformsFourier transforms
Fourier transforms
 
Presentation on fourier transformation
Presentation on fourier transformationPresentation on fourier transformation
Presentation on fourier transformation
 
Sensing Throughput Tradeoff for Cognitive Radio Networks with Noise Variance ...
Sensing Throughput Tradeoff for Cognitive Radio Networks with Noise Variance ...Sensing Throughput Tradeoff for Cognitive Radio Networks with Noise Variance ...
Sensing Throughput Tradeoff for Cognitive Radio Networks with Noise Variance ...
 
Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...
Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...
Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...
 
Chapter 5 Image Processing: Fourier Transformation
Chapter 5 Image Processing: Fourier TransformationChapter 5 Image Processing: Fourier Transformation
Chapter 5 Image Processing: Fourier Transformation
 
Signal propagation. path loss models
Signal propagation. path loss modelsSignal propagation. path loss models
Signal propagation. path loss models
 
Digital signal processing through speech, hearing, and Python
Digital signal processing through speech, hearing, and PythonDigital signal processing through speech, hearing, and Python
Digital signal processing through speech, hearing, and Python
 
Automated seismic-to-well ties?
Automated seismic-to-well ties?Automated seismic-to-well ties?
Automated seismic-to-well ties?
 
Fast Fourier Transform Analysis
Fast Fourier Transform AnalysisFast Fourier Transform Analysis
Fast Fourier Transform Analysis
 
Sound analysis and processing with MATLAB
Sound analysis and processing with MATLABSound analysis and processing with MATLAB
Sound analysis and processing with MATLAB
 
Transforms
TransformsTransforms
Transforms
 
DFT and its properties
DFT and its propertiesDFT and its properties
DFT and its properties
 
EIPOMDP Poster (PDF)
EIPOMDP Poster (PDF)EIPOMDP Poster (PDF)
EIPOMDP Poster (PDF)
 
Sound Source Localization with microphone arrays
Sound Source Localization with microphone arraysSound Source Localization with microphone arrays
Sound Source Localization with microphone arrays
 

More from Yoonho Lee

Meta-learning and the ELBO
Meta-learning and the ELBOMeta-learning and the ELBO
Meta-learning and the ELBO
Yoonho Lee
 
On First-Order Meta-Learning Algorithms
On First-Order Meta-Learning AlgorithmsOn First-Order Meta-Learning Algorithms
On First-Order Meta-Learning Algorithms
Yoonho Lee
 
New Insights and Perspectives on the Natural Gradient Method
New Insights and Perspectives on the Natural Gradient MethodNew Insights and Perspectives on the Natural Gradient Method
New Insights and Perspectives on the Natural Gradient Method
Yoonho Lee
 
Gradient-Based Meta-Learning with Learned Layerwise Metric and Subspace
Gradient-Based Meta-Learning with Learned Layerwise Metric and SubspaceGradient-Based Meta-Learning with Learned Layerwise Metric and Subspace
Gradient-Based Meta-Learning with Learned Layerwise Metric and Subspace
Yoonho Lee
 
Meta Learning Shared Hierarchies
Meta Learning Shared HierarchiesMeta Learning Shared Hierarchies
Meta Learning Shared Hierarchies
Yoonho Lee
 
Continuous Adaptation via Meta Learning in Nonstationary and Competitive Envi...
Continuous Adaptation via Meta Learning in Nonstationary and Competitive Envi...Continuous Adaptation via Meta Learning in Nonstationary and Competitive Envi...
Continuous Adaptation via Meta Learning in Nonstationary and Competitive Envi...
Yoonho Lee
 
The Predictron: End-to-end Learning and Planning
The Predictron: End-to-end Learning and PlanningThe Predictron: End-to-end Learning and Planning
The Predictron: End-to-end Learning and Planning
Yoonho Lee
 
Dueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement LearningDueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement Learning
Yoonho Lee
 
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
Model-Agnostic Meta-Learning for Fast Adaptation of Deep NetworksModel-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
Yoonho Lee
 
Modular Multitask Reinforcement Learning with Policy Sketches
Modular Multitask Reinforcement Learning with Policy SketchesModular Multitask Reinforcement Learning with Policy Sketches
Modular Multitask Reinforcement Learning with Policy Sketches
Yoonho Lee
 
Evolution Strategies as a Scalable Alternative to Reinforcement Learning
Evolution Strategies as a Scalable Alternative to Reinforcement LearningEvolution Strategies as a Scalable Alternative to Reinforcement Learning
Evolution Strategies as a Scalable Alternative to Reinforcement Learning
Yoonho Lee
 
Gradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation GraphsGradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation Graphs
Yoonho Lee
 

More from Yoonho Lee (12)

Meta-learning and the ELBO
Meta-learning and the ELBOMeta-learning and the ELBO
Meta-learning and the ELBO
 
On First-Order Meta-Learning Algorithms
On First-Order Meta-Learning AlgorithmsOn First-Order Meta-Learning Algorithms
On First-Order Meta-Learning Algorithms
 
New Insights and Perspectives on the Natural Gradient Method
New Insights and Perspectives on the Natural Gradient MethodNew Insights and Perspectives on the Natural Gradient Method
New Insights and Perspectives on the Natural Gradient Method
 
Gradient-Based Meta-Learning with Learned Layerwise Metric and Subspace
Gradient-Based Meta-Learning with Learned Layerwise Metric and SubspaceGradient-Based Meta-Learning with Learned Layerwise Metric and Subspace
Gradient-Based Meta-Learning with Learned Layerwise Metric and Subspace
 
Meta Learning Shared Hierarchies
Meta Learning Shared HierarchiesMeta Learning Shared Hierarchies
Meta Learning Shared Hierarchies
 
Continuous Adaptation via Meta Learning in Nonstationary and Competitive Envi...
Continuous Adaptation via Meta Learning in Nonstationary and Competitive Envi...Continuous Adaptation via Meta Learning in Nonstationary and Competitive Envi...
Continuous Adaptation via Meta Learning in Nonstationary and Competitive Envi...
 
The Predictron: End-to-end Learning and Planning
The Predictron: End-to-end Learning and PlanningThe Predictron: End-to-end Learning and Planning
The Predictron: End-to-end Learning and Planning
 
Dueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement LearningDueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement Learning
 
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
Model-Agnostic Meta-Learning for Fast Adaptation of Deep NetworksModel-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
 
Modular Multitask Reinforcement Learning with Policy Sketches
Modular Multitask Reinforcement Learning with Policy SketchesModular Multitask Reinforcement Learning with Policy Sketches
Modular Multitask Reinforcement Learning with Policy Sketches
 
Evolution Strategies as a Scalable Alternative to Reinforcement Learning
Evolution Strategies as a Scalable Alternative to Reinforcement LearningEvolution Strategies as a Scalable Alternative to Reinforcement Learning
Evolution Strategies as a Scalable Alternative to Reinforcement Learning
 
Gradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation GraphsGradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation Graphs
 

Recently uploaded

somanykidsbutsofewfathers-140705000023-phpapp02.pptx
somanykidsbutsofewfathers-140705000023-phpapp02.pptxsomanykidsbutsofewfathers-140705000023-phpapp02.pptx
somanykidsbutsofewfathers-140705000023-phpapp02.pptx
Howard Spence
 
Obesity causes and management and associated medical conditions
Obesity causes and management and associated medical conditionsObesity causes and management and associated medical conditions
Obesity causes and management and associated medical conditions
Faculty of Medicine And Health Sciences
 
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdfBonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
khadija278284
 
Supercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdf
Supercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdfSupercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdf
Supercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdf
Access Innovations, Inc.
 
Eureka, I found it! - Special Libraries Association 2021 Presentation
Eureka, I found it! - Special Libraries Association 2021 PresentationEureka, I found it! - Special Libraries Association 2021 Presentation
Eureka, I found it! - Special Libraries Association 2021 Presentation
Access Innovations, Inc.
 
Gregory Harris' Civics Presentation.pptx
Gregory Harris' Civics Presentation.pptxGregory Harris' Civics Presentation.pptx
Gregory Harris' Civics Presentation.pptx
gharris9
 
0x01 - Newton's Third Law: Static vs. Dynamic Abusers
0x01 - Newton's Third Law:  Static vs. Dynamic Abusers0x01 - Newton's Third Law:  Static vs. Dynamic Abusers
0x01 - Newton's Third Law: Static vs. Dynamic Abusers
OWASP Beja
 
Acorn Recovery: Restore IT infra within minutes
Acorn Recovery: Restore IT infra within minutesAcorn Recovery: Restore IT infra within minutes
Acorn Recovery: Restore IT infra within minutes
IP ServerOne
 
Bitcoin Lightning wallet and tic-tac-toe game XOXO
Bitcoin Lightning wallet and tic-tac-toe game XOXOBitcoin Lightning wallet and tic-tac-toe game XOXO
Bitcoin Lightning wallet and tic-tac-toe game XOXO
Matjaž Lipuš
 
Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Doctoral Symposium at the 17th IEEE International Conference on Software Test...Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Sebastiano Panichella
 
María Carolina Martínez - eCommerce Day Colombia 2024
María Carolina Martínez - eCommerce Day Colombia 2024María Carolina Martínez - eCommerce Day Colombia 2024
María Carolina Martínez - eCommerce Day Colombia 2024
eCommerce Institute
 
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...
Orkestra
 
Announcement of 18th IEEE International Conference on Software Testing, Verif...
Announcement of 18th IEEE International Conference on Software Testing, Verif...Announcement of 18th IEEE International Conference on Software Testing, Verif...
Announcement of 18th IEEE International Conference on Software Testing, Verif...
Sebastiano Panichella
 
Getting started with Amazon Bedrock Studio and Control Tower
Getting started with Amazon Bedrock Studio and Control TowerGetting started with Amazon Bedrock Studio and Control Tower
Getting started with Amazon Bedrock Studio and Control Tower
Vladimir Samoylov
 
International Workshop on Artificial Intelligence in Software Testing
International Workshop on Artificial Intelligence in Software TestingInternational Workshop on Artificial Intelligence in Software Testing
International Workshop on Artificial Intelligence in Software Testing
Sebastiano Panichella
 
Media as a Mind Controlling Strategy In Old and Modern Era
Media as a Mind Controlling Strategy In Old and Modern EraMedia as a Mind Controlling Strategy In Old and Modern Era
Media as a Mind Controlling Strategy In Old and Modern Era
faizulhassanfaiz1670
 
Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...
Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...
Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...
OECD Directorate for Financial and Enterprise Affairs
 

Recently uploaded (17)

somanykidsbutsofewfathers-140705000023-phpapp02.pptx
somanykidsbutsofewfathers-140705000023-phpapp02.pptxsomanykidsbutsofewfathers-140705000023-phpapp02.pptx
somanykidsbutsofewfathers-140705000023-phpapp02.pptx
 
Obesity causes and management and associated medical conditions
Obesity causes and management and associated medical conditionsObesity causes and management and associated medical conditions
Obesity causes and management and associated medical conditions
 
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdfBonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
 
Supercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdf
Supercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdfSupercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdf
Supercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdf
 
Eureka, I found it! - Special Libraries Association 2021 Presentation
Eureka, I found it! - Special Libraries Association 2021 PresentationEureka, I found it! - Special Libraries Association 2021 Presentation
Eureka, I found it! - Special Libraries Association 2021 Presentation
 
Gregory Harris' Civics Presentation.pptx
Gregory Harris' Civics Presentation.pptxGregory Harris' Civics Presentation.pptx
Gregory Harris' Civics Presentation.pptx
 
0x01 - Newton's Third Law: Static vs. Dynamic Abusers
0x01 - Newton's Third Law:  Static vs. Dynamic Abusers0x01 - Newton's Third Law:  Static vs. Dynamic Abusers
0x01 - Newton's Third Law: Static vs. Dynamic Abusers
 
Acorn Recovery: Restore IT infra within minutes
Acorn Recovery: Restore IT infra within minutesAcorn Recovery: Restore IT infra within minutes
Acorn Recovery: Restore IT infra within minutes
 
Bitcoin Lightning wallet and tic-tac-toe game XOXO
Bitcoin Lightning wallet and tic-tac-toe game XOXOBitcoin Lightning wallet and tic-tac-toe game XOXO
Bitcoin Lightning wallet and tic-tac-toe game XOXO
 
Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Doctoral Symposium at the 17th IEEE International Conference on Software Test...Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Doctoral Symposium at the 17th IEEE International Conference on Software Test...
 
María Carolina Martínez - eCommerce Day Colombia 2024
María Carolina Martínez - eCommerce Day Colombia 2024María Carolina Martínez - eCommerce Day Colombia 2024
María Carolina Martínez - eCommerce Day Colombia 2024
 
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...
 
Announcement of 18th IEEE International Conference on Software Testing, Verif...
Announcement of 18th IEEE International Conference on Software Testing, Verif...Announcement of 18th IEEE International Conference on Software Testing, Verif...
Announcement of 18th IEEE International Conference on Software Testing, Verif...
 
Getting started with Amazon Bedrock Studio and Control Tower
Getting started with Amazon Bedrock Studio and Control TowerGetting started with Amazon Bedrock Studio and Control Tower
Getting started with Amazon Bedrock Studio and Control Tower
 
International Workshop on Artificial Intelligence in Software Testing
International Workshop on Artificial Intelligence in Software TestingInternational Workshop on Artificial Intelligence in Software Testing
International Workshop on Artificial Intelligence in Software Testing
 
Media as a Mind Controlling Strategy In Old and Modern Era
Media as a Mind Controlling Strategy In Old and Modern EraMedia as a Mind Controlling Strategy In Old and Modern Era
Media as a Mind Controlling Strategy In Old and Modern Era
 
Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...
Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...
Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...
 

Parameter Space Noise for Exploration

  • 1. Parameter Space Noise for Exploration Yoonho Lee Department of Computer Science and Engineering Pohang University of Science and Technology November 02, 2017
  • 2. Exploration-Exploitation Tradeoff Exploration and exploitation must be carefully balanced for optimal performance
  • 3. Exploration in RL Exploration in multi-armed bandits is simply choosing a suboptimal arm. How do we explore in RL environments?
  • 4. Exploration in RL Exploration in multi-armed bandits is simply choosing a suboptimal arm. How do we explore in RL environments? Naive approaches: -greedy actions in DQN Entropy loss in policy gradient methods
  • 5. Exploration in RL Exploration in multi-armed bandits is simply choosing a suboptimal arm. How do we explore in RL environments? Naive approaches: -greedy actions in DQN Entropy loss in policy gradient methods More sophisticated approaches: Density Modelling Dynamics Modelling Self-supervised curiosity
  • 6. Parameter Space Noise for Exploration Matthias Plappert, Rein Houthooft, Prafulla Dhariwal, Szymon Sidor, Richard Y. Chen, Xi Chen, Tamim Asfour, Pieter Abbeel, Marcin Andrychowicz
  • 7. Proposed Method θ = θ + N(0, σ2I) We perturb policy paramters at the beginning of each episode and keep it fixed for the entire rollout
  • 8. Proposed Method θ = θ + N(0, σ2I) We perturb policy paramters at the beginning of each episode and keep it fixed for the entire rollout Off-policy Gather experience with θ = θ + N(0, σ2I), and update network with θ.
  • 9. Proposed Method θ = θ + N(0, σ2I) We perturb policy paramters at the beginning of each episode and keep it fixed for the entire rollout Off-policy Gather experience with θ = θ + N(0, σ2I), and update network with θ. On-policy Given policy πθ(a|s) with θ ∼ N(φ, Σ), policy gradient is φ,ΣEτ [R(τ)] ≈ 1 N i ,τi T−1 t=0 φ,Σ log π(at|st; φ + i Σ)Rt(τi )
  • 10. Experiments Chain Environment A simple environment in which directed exploration is required to perform well Start at s1, rewards only at s1 and sN Easy to fall in local optima of staying at s1
  • 11. Experiments Chain Environment Lower is better. Parameter space noise outperforms both -greedy and bootstrapped DQN.
  • 12. Experiments Atari Parameter space noise outperforms -greedy in games that require exploration
  • 13. Experiments Continous Control with DDPG Parameter space noise outperforms action space noise in HalfCheetah(Other networks fall into a local minima) Not much difference in other environments. This is because the rewards are well-shaped, so exploration isn’t really crucial here.
  • 14. Experiments Continous Control with DDPG Harder environments with sparse rewards Two environments in which only parameter noise get a non-zero reward
  • 15. Experiments Continous Control with TRPO Parameter space noise is slightly better in HalfCheetah, and significantly better in Walker2D. The wrong variance setting seems to disable learning, and each environment has a different optimal variance.
  • 16. Experiments Continous Control with TRPO Parameter space noise works well in sparse reward environments.
  • 17. Summary Parameter space noise is a simple method that allows directed exploration. Applicable to both on-policy and off-policy methods Orthogonal to advances such as Double DQN, Dueling Networks or TRPO.
  • 18. Discussion No comparison with sophisticated exploration methods If this works, why did no one try using dropout in policy networks/DQN? What does this imply about the parameter space of a neural network? Is there a connection between this and recent results linking parameter noise to variational inference?