SlideShare a Scribd company logo
1 of 19
Download to read offline
Parameter Space Noise for Exploration
Yoonho Lee
Department of Computer Science and Engineering
Pohang University of Science and Technology
November 02, 2017
Exploration-Exploitation Tradeoff
Exploration and exploitation must be carefully balanced for
optimal performance
Exploration in RL
Exploration in multi-armed bandits is simply choosing a suboptimal
arm. How do we explore in RL environments?
Exploration in RL
Exploration in multi-armed bandits is simply choosing a suboptimal
arm. How do we explore in RL environments?
Naive approaches:
-greedy actions in DQN
Entropy loss in policy gradient methods
Exploration in RL
Exploration in multi-armed bandits is simply choosing a suboptimal
arm. How do we explore in RL environments?
Naive approaches:
-greedy actions in DQN
Entropy loss in policy gradient methods
More sophisticated approaches:
Density Modelling
Dynamics Modelling
Self-supervised curiosity
Parameter Space Noise for Exploration
Matthias Plappert, Rein Houthooft, Prafulla Dhariwal, Szymon
Sidor, Richard Y. Chen, Xi Chen, Tamim Asfour, Pieter Abbeel,
Marcin Andrychowicz
Proposed Method
θ = θ + N(0, σ2I)
We perturb policy paramters at the beginning of each episode and
keep it fixed for the entire rollout
Proposed Method
θ = θ + N(0, σ2I)
We perturb policy paramters at the beginning of each episode and
keep it fixed for the entire rollout
Off-policy
Gather experience with θ = θ + N(0, σ2I), and update network
with θ.
Proposed Method
θ = θ + N(0, σ2I)
We perturb policy paramters at the beginning of each episode and
keep it fixed for the entire rollout
Off-policy
Gather experience with θ = θ + N(0, σ2I), and update network
with θ.
On-policy
Given policy πθ(a|s) with θ ∼ N(φ, Σ), policy gradient is
φ,ΣEτ [R(τ)] ≈
1
N i ,τi
T−1
t=0
φ,Σ log π(at|st; φ + i
Σ)Rt(τi
)
Experiments
Chain Environment
A simple environment in which directed exploration is required
to perform well
Start at s1, rewards only at s1 and sN
Easy to fall in local optima of staying at s1
Experiments
Chain Environment
Lower is better.
Parameter space noise outperforms both -greedy and
bootstrapped DQN.
Experiments
Atari
Parameter space noise outperforms -greedy in games that
require exploration
Experiments
Continous Control with DDPG
Parameter space noise outperforms action space noise in
HalfCheetah(Other networks fall into a local minima)
Not much difference in other environments. This is because
the rewards are well-shaped, so exploration isn’t really crucial
here.
Experiments
Continous Control with DDPG
Harder environments with sparse rewards
Two environments in which only parameter noise get a
non-zero reward
Experiments
Continous Control with TRPO
Parameter space noise is slightly better in HalfCheetah, and
significantly better in Walker2D.
The wrong variance setting seems to disable learning, and
each environment has a different optimal variance.
Experiments
Continous Control with TRPO
Parameter space noise works well in sparse reward
environments.
Summary
Parameter space noise is a simple method that allows directed
exploration.
Applicable to both on-policy and off-policy methods
Orthogonal to advances such as Double DQN, Dueling
Networks or TRPO.
Discussion
No comparison with sophisticated exploration methods
If this works, why did no one try using dropout in policy
networks/DQN?
What does this imply about the parameter space of a neural
network?
Is there a connection between this and recent results linking
parameter noise to variational inference?
Thank You

More Related Content

What's hot

Fourier transforms
Fourier transformsFourier transforms
Fourier transformsIffat Anjum
 
Jagmohan presentation2008
Jagmohan presentation2008Jagmohan presentation2008
Jagmohan presentation2008Jag Mohan Singh
 
Fourier transforms
Fourier transformsFourier transforms
Fourier transformskalung0313
 
Presentation on fourier transformation
Presentation on fourier transformationPresentation on fourier transformation
Presentation on fourier transformationWasim Shah
 
Sensing Throughput Tradeoff for Cognitive Radio Networks with Noise Variance ...
Sensing Throughput Tradeoff for Cognitive Radio Networks with Noise Variance ...Sensing Throughput Tradeoff for Cognitive Radio Networks with Noise Variance ...
Sensing Throughput Tradeoff for Cognitive Radio Networks with Noise Variance ...T. E. BOGALE
 
Chapter 5 Image Processing: Fourier Transformation
Chapter 5 Image Processing: Fourier TransformationChapter 5 Image Processing: Fourier Transformation
Chapter 5 Image Processing: Fourier TransformationVarun Ojha
 
Signal propagation. path loss models
Signal propagation. path loss modelsSignal propagation. path loss models
Signal propagation. path loss modelsNguyen Minh Thu
 
Digital signal processing through speech, hearing, and Python
Digital signal processing through speech, hearing, and PythonDigital signal processing through speech, hearing, and Python
Digital signal processing through speech, hearing, and PythonMel Chua
 
Automated seismic-to-well ties?
Automated seismic-to-well ties?Automated seismic-to-well ties?
Automated seismic-to-well ties?UT Technology
 
Fast Fourier Transform Analysis
Fast Fourier Transform AnalysisFast Fourier Transform Analysis
Fast Fourier Transform Analysisdhikadixiana
 
Sound analysis and processing with MATLAB
Sound analysis and processing with MATLABSound analysis and processing with MATLAB
Sound analysis and processing with MATLABTan Hoang Luu
 
DFT and its properties
DFT and its propertiesDFT and its properties
DFT and its propertiesssuser2797e4
 
EIPOMDP Poster (PDF)
EIPOMDP Poster (PDF)EIPOMDP Poster (PDF)
EIPOMDP Poster (PDF)Teddy Ni
 
Sound Source Localization with microphone arrays
Sound Source Localization with microphone arraysSound Source Localization with microphone arrays
Sound Source Localization with microphone arraysRamin Anushiravani
 

What's hot (17)

Fourier transforms
Fourier transformsFourier transforms
Fourier transforms
 
Jagmohan presentation2008
Jagmohan presentation2008Jagmohan presentation2008
Jagmohan presentation2008
 
Fft analysis
Fft analysisFft analysis
Fft analysis
 
Fourier transforms
Fourier transformsFourier transforms
Fourier transforms
 
Presentation on fourier transformation
Presentation on fourier transformationPresentation on fourier transformation
Presentation on fourier transformation
 
Sensing Throughput Tradeoff for Cognitive Radio Networks with Noise Variance ...
Sensing Throughput Tradeoff for Cognitive Radio Networks with Noise Variance ...Sensing Throughput Tradeoff for Cognitive Radio Networks with Noise Variance ...
Sensing Throughput Tradeoff for Cognitive Radio Networks with Noise Variance ...
 
Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...
Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...
Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...
 
Chapter 5 Image Processing: Fourier Transformation
Chapter 5 Image Processing: Fourier TransformationChapter 5 Image Processing: Fourier Transformation
Chapter 5 Image Processing: Fourier Transformation
 
Signal propagation. path loss models
Signal propagation. path loss modelsSignal propagation. path loss models
Signal propagation. path loss models
 
Digital signal processing through speech, hearing, and Python
Digital signal processing through speech, hearing, and PythonDigital signal processing through speech, hearing, and Python
Digital signal processing through speech, hearing, and Python
 
Automated seismic-to-well ties?
Automated seismic-to-well ties?Automated seismic-to-well ties?
Automated seismic-to-well ties?
 
Fast Fourier Transform Analysis
Fast Fourier Transform AnalysisFast Fourier Transform Analysis
Fast Fourier Transform Analysis
 
Sound analysis and processing with MATLAB
Sound analysis and processing with MATLABSound analysis and processing with MATLAB
Sound analysis and processing with MATLAB
 
Transforms
TransformsTransforms
Transforms
 
DFT and its properties
DFT and its propertiesDFT and its properties
DFT and its properties
 
EIPOMDP Poster (PDF)
EIPOMDP Poster (PDF)EIPOMDP Poster (PDF)
EIPOMDP Poster (PDF)
 
Sound Source Localization with microphone arrays
Sound Source Localization with microphone arraysSound Source Localization with microphone arrays
Sound Source Localization with microphone arrays
 

More from Yoonho Lee

Meta-learning and the ELBO
Meta-learning and the ELBOMeta-learning and the ELBO
Meta-learning and the ELBOYoonho Lee
 
On First-Order Meta-Learning Algorithms
On First-Order Meta-Learning AlgorithmsOn First-Order Meta-Learning Algorithms
On First-Order Meta-Learning AlgorithmsYoonho Lee
 
New Insights and Perspectives on the Natural Gradient Method
New Insights and Perspectives on the Natural Gradient MethodNew Insights and Perspectives on the Natural Gradient Method
New Insights and Perspectives on the Natural Gradient MethodYoonho Lee
 
Gradient-Based Meta-Learning with Learned Layerwise Metric and Subspace
Gradient-Based Meta-Learning with Learned Layerwise Metric and SubspaceGradient-Based Meta-Learning with Learned Layerwise Metric and Subspace
Gradient-Based Meta-Learning with Learned Layerwise Metric and SubspaceYoonho Lee
 
Meta Learning Shared Hierarchies
Meta Learning Shared HierarchiesMeta Learning Shared Hierarchies
Meta Learning Shared HierarchiesYoonho Lee
 
Continuous Adaptation via Meta Learning in Nonstationary and Competitive Envi...
Continuous Adaptation via Meta Learning in Nonstationary and Competitive Envi...Continuous Adaptation via Meta Learning in Nonstationary and Competitive Envi...
Continuous Adaptation via Meta Learning in Nonstationary and Competitive Envi...Yoonho Lee
 
The Predictron: End-to-end Learning and Planning
The Predictron: End-to-end Learning and PlanningThe Predictron: End-to-end Learning and Planning
The Predictron: End-to-end Learning and PlanningYoonho Lee
 
Dueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement LearningDueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement LearningYoonho Lee
 
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
Model-Agnostic Meta-Learning for Fast Adaptation of Deep NetworksModel-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
Model-Agnostic Meta-Learning for Fast Adaptation of Deep NetworksYoonho Lee
 
Modular Multitask Reinforcement Learning with Policy Sketches
Modular Multitask Reinforcement Learning with Policy SketchesModular Multitask Reinforcement Learning with Policy Sketches
Modular Multitask Reinforcement Learning with Policy SketchesYoonho Lee
 
Evolution Strategies as a Scalable Alternative to Reinforcement Learning
Evolution Strategies as a Scalable Alternative to Reinforcement LearningEvolution Strategies as a Scalable Alternative to Reinforcement Learning
Evolution Strategies as a Scalable Alternative to Reinforcement LearningYoonho Lee
 
Gradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation GraphsGradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation GraphsYoonho Lee
 

More from Yoonho Lee (12)

Meta-learning and the ELBO
Meta-learning and the ELBOMeta-learning and the ELBO
Meta-learning and the ELBO
 
On First-Order Meta-Learning Algorithms
On First-Order Meta-Learning AlgorithmsOn First-Order Meta-Learning Algorithms
On First-Order Meta-Learning Algorithms
 
New Insights and Perspectives on the Natural Gradient Method
New Insights and Perspectives on the Natural Gradient MethodNew Insights and Perspectives on the Natural Gradient Method
New Insights and Perspectives on the Natural Gradient Method
 
Gradient-Based Meta-Learning with Learned Layerwise Metric and Subspace
Gradient-Based Meta-Learning with Learned Layerwise Metric and SubspaceGradient-Based Meta-Learning with Learned Layerwise Metric and Subspace
Gradient-Based Meta-Learning with Learned Layerwise Metric and Subspace
 
Meta Learning Shared Hierarchies
Meta Learning Shared HierarchiesMeta Learning Shared Hierarchies
Meta Learning Shared Hierarchies
 
Continuous Adaptation via Meta Learning in Nonstationary and Competitive Envi...
Continuous Adaptation via Meta Learning in Nonstationary and Competitive Envi...Continuous Adaptation via Meta Learning in Nonstationary and Competitive Envi...
Continuous Adaptation via Meta Learning in Nonstationary and Competitive Envi...
 
The Predictron: End-to-end Learning and Planning
The Predictron: End-to-end Learning and PlanningThe Predictron: End-to-end Learning and Planning
The Predictron: End-to-end Learning and Planning
 
Dueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement LearningDueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement Learning
 
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
Model-Agnostic Meta-Learning for Fast Adaptation of Deep NetworksModel-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
 
Modular Multitask Reinforcement Learning with Policy Sketches
Modular Multitask Reinforcement Learning with Policy SketchesModular Multitask Reinforcement Learning with Policy Sketches
Modular Multitask Reinforcement Learning with Policy Sketches
 
Evolution Strategies as a Scalable Alternative to Reinforcement Learning
Evolution Strategies as a Scalable Alternative to Reinforcement LearningEvolution Strategies as a Scalable Alternative to Reinforcement Learning
Evolution Strategies as a Scalable Alternative to Reinforcement Learning
 
Gradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation GraphsGradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation Graphs
 

Recently uploaded

Gaps, Issues and Challenges in the Implementation of Mother Tongue Based-Mult...
Gaps, Issues and Challenges in the Implementation of Mother Tongue Based-Mult...Gaps, Issues and Challenges in the Implementation of Mother Tongue Based-Mult...
Gaps, Issues and Challenges in the Implementation of Mother Tongue Based-Mult...marjmae69
 
Mathan flower ppt.pptx slide orchids ✨🌸
Mathan flower ppt.pptx slide orchids ✨🌸Mathan flower ppt.pptx slide orchids ✨🌸
Mathan flower ppt.pptx slide orchids ✨🌸mathanramanathan2005
 
Work Remotely with Confluence ACE 2.pptx
Work Remotely with Confluence ACE 2.pptxWork Remotely with Confluence ACE 2.pptx
Work Remotely with Confluence ACE 2.pptxmavinoikein
 
Dutch Power - 26 maart 2024 - Henk Kras - Circular Plastics
Dutch Power - 26 maart 2024 - Henk Kras - Circular PlasticsDutch Power - 26 maart 2024 - Henk Kras - Circular Plastics
Dutch Power - 26 maart 2024 - Henk Kras - Circular PlasticsDutch Power
 
Simulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Simulation-based Testing of Unmanned Aerial Vehicles with AerialistSimulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Simulation-based Testing of Unmanned Aerial Vehicles with AerialistSebastiano Panichella
 
Anne Frank A Beacon of Hope amidst darkness ppt.pptx
Anne Frank A Beacon of Hope amidst darkness ppt.pptxAnne Frank A Beacon of Hope amidst darkness ppt.pptx
Anne Frank A Beacon of Hope amidst darkness ppt.pptxnoorehahmad
 
James Joyce, Dubliners and Ulysses.ppt !
James Joyce, Dubliners and Ulysses.ppt !James Joyce, Dubliners and Ulysses.ppt !
James Joyce, Dubliners and Ulysses.ppt !risocarla2016
 
The Ten Facts About People With Autism Presentation
The Ten Facts About People With Autism PresentationThe Ten Facts About People With Autism Presentation
The Ten Facts About People With Autism PresentationNathan Young
 
Call Girls In Aerocity 🤳 Call Us +919599264170
Call Girls In Aerocity 🤳 Call Us +919599264170Call Girls In Aerocity 🤳 Call Us +919599264170
Call Girls In Aerocity 🤳 Call Us +919599264170Escort Service
 
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...漢銘 謝
 
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...Krijn Poppe
 
Genesis part 2 Isaiah Scudder 04-24-2024.pptx
Genesis part 2 Isaiah Scudder 04-24-2024.pptxGenesis part 2 Isaiah Scudder 04-24-2024.pptx
Genesis part 2 Isaiah Scudder 04-24-2024.pptxFamilyWorshipCenterD
 
PHYSICS PROJECT BY MSC - NANOTECHNOLOGY
PHYSICS PROJECT BY MSC  - NANOTECHNOLOGYPHYSICS PROJECT BY MSC  - NANOTECHNOLOGY
PHYSICS PROJECT BY MSC - NANOTECHNOLOGYpruthirajnayak525
 
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Event 4 Introduction to Open Source.pptx
Event 4 Introduction to Open Source.pptxEvent 4 Introduction to Open Source.pptx
Event 4 Introduction to Open Source.pptxaryanv1753
 
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.KathleenAnnCordero2
 
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.com
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.comSaaStr Workshop Wednesday w/ Kyle Norton, Owner.com
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.comsaastr
 
call girls in delhi malviya nagar @9811711561@
call girls in delhi malviya nagar @9811711561@call girls in delhi malviya nagar @9811711561@
call girls in delhi malviya nagar @9811711561@vikas rana
 
miladyskindiseases-200705210221 2.!!pptx
miladyskindiseases-200705210221 2.!!pptxmiladyskindiseases-200705210221 2.!!pptx
miladyskindiseases-200705210221 2.!!pptxCarrieButtitta
 
SBFT Tool Competition 2024 -- Python Test Case Generation Track
SBFT Tool Competition 2024 -- Python Test Case Generation TrackSBFT Tool Competition 2024 -- Python Test Case Generation Track
SBFT Tool Competition 2024 -- Python Test Case Generation TrackSebastiano Panichella
 

Recently uploaded (20)

Gaps, Issues and Challenges in the Implementation of Mother Tongue Based-Mult...
Gaps, Issues and Challenges in the Implementation of Mother Tongue Based-Mult...Gaps, Issues and Challenges in the Implementation of Mother Tongue Based-Mult...
Gaps, Issues and Challenges in the Implementation of Mother Tongue Based-Mult...
 
Mathan flower ppt.pptx slide orchids ✨🌸
Mathan flower ppt.pptx slide orchids ✨🌸Mathan flower ppt.pptx slide orchids ✨🌸
Mathan flower ppt.pptx slide orchids ✨🌸
 
Work Remotely with Confluence ACE 2.pptx
Work Remotely with Confluence ACE 2.pptxWork Remotely with Confluence ACE 2.pptx
Work Remotely with Confluence ACE 2.pptx
 
Dutch Power - 26 maart 2024 - Henk Kras - Circular Plastics
Dutch Power - 26 maart 2024 - Henk Kras - Circular PlasticsDutch Power - 26 maart 2024 - Henk Kras - Circular Plastics
Dutch Power - 26 maart 2024 - Henk Kras - Circular Plastics
 
Simulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Simulation-based Testing of Unmanned Aerial Vehicles with AerialistSimulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Simulation-based Testing of Unmanned Aerial Vehicles with Aerialist
 
Anne Frank A Beacon of Hope amidst darkness ppt.pptx
Anne Frank A Beacon of Hope amidst darkness ppt.pptxAnne Frank A Beacon of Hope amidst darkness ppt.pptx
Anne Frank A Beacon of Hope amidst darkness ppt.pptx
 
James Joyce, Dubliners and Ulysses.ppt !
James Joyce, Dubliners and Ulysses.ppt !James Joyce, Dubliners and Ulysses.ppt !
James Joyce, Dubliners and Ulysses.ppt !
 
The Ten Facts About People With Autism Presentation
The Ten Facts About People With Autism PresentationThe Ten Facts About People With Autism Presentation
The Ten Facts About People With Autism Presentation
 
Call Girls In Aerocity 🤳 Call Us +919599264170
Call Girls In Aerocity 🤳 Call Us +919599264170Call Girls In Aerocity 🤳 Call Us +919599264170
Call Girls In Aerocity 🤳 Call Us +919599264170
 
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
 
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...
 
Genesis part 2 Isaiah Scudder 04-24-2024.pptx
Genesis part 2 Isaiah Scudder 04-24-2024.pptxGenesis part 2 Isaiah Scudder 04-24-2024.pptx
Genesis part 2 Isaiah Scudder 04-24-2024.pptx
 
PHYSICS PROJECT BY MSC - NANOTECHNOLOGY
PHYSICS PROJECT BY MSC  - NANOTECHNOLOGYPHYSICS PROJECT BY MSC  - NANOTECHNOLOGY
PHYSICS PROJECT BY MSC - NANOTECHNOLOGY
 
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
 
Event 4 Introduction to Open Source.pptx
Event 4 Introduction to Open Source.pptxEvent 4 Introduction to Open Source.pptx
Event 4 Introduction to Open Source.pptx
 
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.
 
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.com
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.comSaaStr Workshop Wednesday w/ Kyle Norton, Owner.com
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.com
 
call girls in delhi malviya nagar @9811711561@
call girls in delhi malviya nagar @9811711561@call girls in delhi malviya nagar @9811711561@
call girls in delhi malviya nagar @9811711561@
 
miladyskindiseases-200705210221 2.!!pptx
miladyskindiseases-200705210221 2.!!pptxmiladyskindiseases-200705210221 2.!!pptx
miladyskindiseases-200705210221 2.!!pptx
 
SBFT Tool Competition 2024 -- Python Test Case Generation Track
SBFT Tool Competition 2024 -- Python Test Case Generation TrackSBFT Tool Competition 2024 -- Python Test Case Generation Track
SBFT Tool Competition 2024 -- Python Test Case Generation Track
 

Parameter Space Noise for Exploration

  • 1. Parameter Space Noise for Exploration Yoonho Lee Department of Computer Science and Engineering Pohang University of Science and Technology November 02, 2017
  • 2. Exploration-Exploitation Tradeoff Exploration and exploitation must be carefully balanced for optimal performance
  • 3. Exploration in RL Exploration in multi-armed bandits is simply choosing a suboptimal arm. How do we explore in RL environments?
  • 4. Exploration in RL Exploration in multi-armed bandits is simply choosing a suboptimal arm. How do we explore in RL environments? Naive approaches: -greedy actions in DQN Entropy loss in policy gradient methods
  • 5. Exploration in RL Exploration in multi-armed bandits is simply choosing a suboptimal arm. How do we explore in RL environments? Naive approaches: -greedy actions in DQN Entropy loss in policy gradient methods More sophisticated approaches: Density Modelling Dynamics Modelling Self-supervised curiosity
  • 6. Parameter Space Noise for Exploration Matthias Plappert, Rein Houthooft, Prafulla Dhariwal, Szymon Sidor, Richard Y. Chen, Xi Chen, Tamim Asfour, Pieter Abbeel, Marcin Andrychowicz
  • 7. Proposed Method θ = θ + N(0, σ2I) We perturb policy paramters at the beginning of each episode and keep it fixed for the entire rollout
  • 8. Proposed Method θ = θ + N(0, σ2I) We perturb policy paramters at the beginning of each episode and keep it fixed for the entire rollout Off-policy Gather experience with θ = θ + N(0, σ2I), and update network with θ.
  • 9. Proposed Method θ = θ + N(0, σ2I) We perturb policy paramters at the beginning of each episode and keep it fixed for the entire rollout Off-policy Gather experience with θ = θ + N(0, σ2I), and update network with θ. On-policy Given policy πθ(a|s) with θ ∼ N(φ, Σ), policy gradient is φ,ΣEτ [R(τ)] ≈ 1 N i ,τi T−1 t=0 φ,Σ log π(at|st; φ + i Σ)Rt(τi )
  • 10. Experiments Chain Environment A simple environment in which directed exploration is required to perform well Start at s1, rewards only at s1 and sN Easy to fall in local optima of staying at s1
  • 11. Experiments Chain Environment Lower is better. Parameter space noise outperforms both -greedy and bootstrapped DQN.
  • 12. Experiments Atari Parameter space noise outperforms -greedy in games that require exploration
  • 13. Experiments Continous Control with DDPG Parameter space noise outperforms action space noise in HalfCheetah(Other networks fall into a local minima) Not much difference in other environments. This is because the rewards are well-shaped, so exploration isn’t really crucial here.
  • 14. Experiments Continous Control with DDPG Harder environments with sparse rewards Two environments in which only parameter noise get a non-zero reward
  • 15. Experiments Continous Control with TRPO Parameter space noise is slightly better in HalfCheetah, and significantly better in Walker2D. The wrong variance setting seems to disable learning, and each environment has a different optimal variance.
  • 16. Experiments Continous Control with TRPO Parameter space noise works well in sparse reward environments.
  • 17. Summary Parameter space noise is a simple method that allows directed exploration. Applicable to both on-policy and off-policy methods Orthogonal to advances such as Double DQN, Dueling Networks or TRPO.
  • 18. Discussion No comparison with sophisticated exploration methods If this works, why did no one try using dropout in policy networks/DQN? What does this imply about the parameter space of a neural network? Is there a connection between this and recent results linking parameter noise to variational inference?