Parameter Space Noise for Exploration

•

0 likes•451 views

Parameter space noise is a simple method for exploration in reinforcement learning where noise is added to the policy parameters at the start of each episode. It balances exploration and exploitation better than epsilon-greedy or bootstrapped DQN in environments requiring directed exploration like chain environments. It also outperforms action space noise in continuous control tasks with DDPG and is better than alternatives in sparse reward environments. The method is applicable to both on and off-policy algorithms and provides an orthogonal exploration technique to other advances in deep reinforcement learning.

Parameter Space Noise for Exploration
Yoonho Lee
Department of Computer Science and Engineering
Pohang University of Science and Technology
November 02, 2017

Exploration-Exploitation Tradeoﬀ
Exploration and exploitation must be carefully balanced for
optimal performance

Exploration in RL
Exploration in multi-armed bandits is simply choosing a suboptimal
arm. How do we explore in RL environments?

Exploration in RL
Exploration in multi-armed bandits is simply choosing a suboptimal
arm. How do we explore in RL environments?
Naive approaches:
-greedy actions in DQN
Entropy loss in policy gradient methods

Parameter Space Noise for Exploration
Matthias Plappert, Rein Houthooft, Prafulla Dhariwal, Szymon
Sidor, Richard Y. Chen, Xi Chen, Tamim Asfour, Pieter Abbeel,
Marcin Andrychowicz

Proposed Method
θ = θ + N(0, σ2I)
We perturb policy paramters at the beginning of each episode and
keep it ﬁxed for the entire rollout

Proposed Method
θ = θ + N(0, σ2I)
We perturb policy paramters at the beginning of each episode and
keep it ﬁxed for the entire rollout
Oﬀ-policy
Gather experience with θ = θ + N(0, σ2I), and update network
with θ.

Experiments
Chain Environment
A simple environment in which directed exploration is required
to perform well
Start at s1, rewards only at s1 and sN
Easy to fall in local optima of staying at s1

Experiments
Chain Environment
Lower is better.
Parameter space noise outperforms both -greedy and
bootstrapped DQN.

Experiments
Atari
Parameter space noise outperforms -greedy in games that
require exploration

Experiments
Continous Control with DDPG
Parameter space noise outperforms action space noise in
HalfCheetah(Other networks fall into a local minima)
Not much diﬀerence in other environments. This is because
the rewards are well-shaped, so exploration isn’t really crucial
here.

Experiments
Continous Control with DDPG
Harder environments with sparse rewards
Two environments in which only parameter noise get a
non-zero reward

Experiments
Continous Control with TRPO
Parameter space noise is slightly better in HalfCheetah, and
signiﬁcantly better in Walker2D.
The wrong variance setting seems to disable learning, and
each environment has a diﬀerent optimal variance.

Experiments
Continous Control with TRPO
Parameter space noise works well in sparse reward
environments.

Summary
Parameter space noise is a simple method that allows directed
exploration.
Applicable to both on-policy and oﬀ-policy methods
Orthogonal to advances such as Double DQN, Dueling
Networks or TRPO.

Discussion
No comparison with sophisticated exploration methods
If this works, why did no one try using dropout in policy
networks/DQN?
What does this imply about the parameter space of a neural
network?
Is there a connection between this and recent results linking
parameter noise to variational inference?

The individual and combined impacts of various front-end approaches on the performance of deep neural network (DNN) based speech recognition systems are examined in distant talking situations. The contents were published in: Takuya Yoshioka and Mark J. F. Gales, "Environmentally robust ASR front-end for deep neural network acoustic models," Computer Speech and Language, vol. 31, no. 1, pp. 65-86, May 2015.

Missing Component Restoration for Masked Speech Signals based on Time-Domain ...

NU_I_TODALAB

IEEE International Workshop on Machine Learning for Signal Processing (MLSP2017) Nominated For Best Student Paper Award (student: Shogo Seki) Shogo Seki, Hirokazu Kameoka, Tomoki Toda, Kazuya Takeda: Missing Component Restoration for Masked Speech Signals based on Time-Domain Spectrogram Factorization，Sep. 2017 Toda Laboratory, Department of Intelligent Systems, Graduate School of Informatics, Nagoya University

Suppression of Chirp Interferers in GPS Using the Fractional Fourier Transform

CSCJournals

In this paper we apply the Fractional Fourier Transform (FrFT) to remove chirp interferers that corrupt Global Positioning System (GPS) signals. The concept is based on the fact that in the time-frequency plane, known as the Wigner Distribution (WD), chirps are represented as lines. Using an FrFT with some rotational parameter ‘a’, we rotate to a new time axis ta that transforms the chirp to a tone, in which the energy of the tone is contained in usually just one or two samples. The best `a', and the correct time sample along the ta axis, may be found without a priori knowledge by searching for the peak in the FrFT, since compression to one or two time samples results in an energy spike. Once the peak is found, we zero out the tone, and hence the underlying chirp. Rotation back to the original time domain via an inverse FrFT produces an improved GPS signal. This method can apply to multiple chirp interferers, and we describe how to easily determine the number of interferers, K, by finding peaks in the FrFT space over the parameter `a'. We also describe how to easily notch the interferers once converted to tones by computing a threshold based on the power of the coarse acquisition (C/A) code and noise. We show that for signal-to-noise ratios (SNRs) greater than at least 10 dB, interferers can be notched regardless of the ratio of the C/A code power to the combined interferer power, denoted as carrier-to-interference ratio (CIR).

Flow diagram

IIT Mandi

Fourier transform

Department of Technical Education, Ministry of Education

Slides from PyCon 2013 tutorial reformatted for self-study. Code at https://github.com/mchua/pycon-sigproc, original description follows: Why do pianos sound different from guitars? How can we visualize how deafness affects a child's speech? These are signal processing questions, traditionally tackled only by upper-level engineering students with MATLAB and differential equations; we're going to do it with algebra and basic Python skills. Based on a signal processing class for audiology graduate students, taught by a deaf musician.

Automated seismic-to-well ties?

UT Technology

Fast Fourier Transform Analysis

dhikadixiana

Sound analysis and processing with MATLAB

Tan Hoang Luu

Transforms

ssuser2797e4

DFT and its properties

ssuser2797e4

EIPOMDP Poster (PDF)Teddy Ni

Sound Source Localization with microphone arrays

Ramin Anushiravani

Meta-learning and the ELBO

Yoonho Lee

On First-Order Meta-Learning Algorithms

Yoonho Lee

New Insights and Perspectives on the Natural Gradient Method

Yoonho Lee

Gradient-Based Meta-Learning with Learned Layerwise Metric and Subspace

Yoonho Lee

Meta Learning Shared Hierarchies

Yoonho Lee

What's hot

Fourier transforms

Iffat Anjum

Jagmohan presentation2008Jag Mohan Singh

Fft analysis

AldinoAldo

Fourier transforms

kalung0313

Presentation on fourier transformation

Wasim Shah

Sensing Throughput Tradeoff for Cognitive Radio Networks with Noise Variance ...

T. E. BOGALE

Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...

Yamagishi Laboratory, National Institute of Informatics, Japan

Chapter 5 Image Processing: Fourier Transformation

Varun Ojha

Signal propagation. path loss models

Nguyen Minh Thu

Digital signal processing through speech, hearing, and Python

Mel Chua

Automated seismic-to-well ties?

UT Technology

Fast Fourier Transform Analysis

dhikadixiana

Sound analysis and processing with MATLAB

Tan Hoang Luu

Transforms

ssuser2797e4

DFT and its properties

ssuser2797e4

EIPOMDP Poster (PDF)Teddy Ni

Sound Source Localization with microphone arrays

Ramin Anushiravani

What's hot (17)

Fourier transforms

Jagmohan presentation2008

Fft analysis

Fourier transforms

Presentation on fourier transformation

Sensing Throughput Tradeoff for Cognitive Radio Networks with Noise Variance ...

Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...

Chapter 5 Image Processing: Fourier Transformation

Signal propagation. path loss models

Digital signal processing through speech, hearing, and Python

Automated seismic-to-well ties?

Fast Fourier Transform Analysis

Sound analysis and processing with MATLAB

Transforms

DFT and its properties

EIPOMDP Poster (PDF)

Sound Source Localization with microphone arrays

Recently uploaded

somanykidsbutsofewfathers-140705000023-phpapp02.pptx

Howard Spence

Obesity causes and management and associated medical conditions

Faculty of Medicine And Health Sciences

Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf

khadija278284

Supercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdf

Access Innovations, Inc.

Eureka, I found it! - Special Libraries Association 2021 Presentation

Access Innovations, Inc.

Have you ever wondered how search works while visiting an e-commerce site, internal website, or searching through other types of online resources? Look no further than this informative session on the ways that taxonomies help end-users navigate the internet! Hear from taxonomists and other information professionals who have first-hand experience creating and working with taxonomies that aid in navigation, search, and discovery across a range of disciplines.

Gregory Harris' Civics Presentation.pptx

gharris9

0x01 - Newton's Third Law: Static vs. Dynamic Abusers

OWASP Beja

f you offer a service on the web, odds are that someone will abuse it. Be it an API, a SaaS, a PaaS, or even a static website, someone somewhere will try to figure out a way to use it to their own needs. In this talk we'll compare measures that are effective against static attackers and how to battle a dynamic attacker who adapts to your counter-measures. About the Speaker =============== Diogo Sousa, Engineering Manager @ Canonical An opinionated individual with an interest in cryptography and its intersection with secure software development.

Acorn Recovery: Restore IT infra within minutes

IP ServerOne

Bitcoin Lightning wallet and tic-tac-toe game XOXO

Matjaž Lipuš

Doctoral Symposium at the 17th IEEE International Conference on Software Test...

Sebastiano Panichella

María Carolina Martínez - eCommerce Day Colombia 2024

eCommerce Institute

Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...

Orkestra

Announcement of 18th IEEE International Conference on Software Testing, Verif...

Sebastiano Panichella

Getting started with Amazon Bedrock Studio and Control Tower

Vladimir Samoylov

International Workshop on Artificial Intelligence in Software Testing

Sebastiano Panichella

Media as a Mind Controlling Strategy In Old and Modern Era

faizulhassanfaiz1670

This presentation, created by Syed Faiz ul Hassan, explores the profound influence of media on public perception and behavior. It delves into the evolution of media from oral traditions to modern digital and social media platforms. Key topics include the role of media in information propagation, socialization, crisis awareness, globalization, and education. The presentation also examines media influence through agenda setting, propaganda, and manipulative techniques used by advertisers and marketers. Furthermore, it highlights the impact of surveillance enabled by media technologies on personal behavior and preferences. Through this comprehensive overview, the presentation aims to shed light on how media shapes collective consciousness and public opinion.

Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...

OECD Directorate for Financial and Enterprise Affairs

Recently uploaded (17)

somanykidsbutsofewfathers-140705000023-phpapp02.pptx

Obesity causes and management and associated medical conditions

Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf

Supercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdf

Eureka, I found it! - Special Libraries Association 2021 Presentation

Gregory Harris' Civics Presentation.pptx

0x01 - Newton's Third Law: Static vs. Dynamic Abusers

Acorn Recovery: Restore IT infra within minutes

Bitcoin Lightning wallet and tic-tac-toe game XOXO

Doctoral Symposium at the 17th IEEE International Conference on Software Test...

María Carolina Martínez - eCommerce Day Colombia 2024

Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...

Announcement of 18th IEEE International Conference on Software Testing, Verif...

Getting started with Amazon Bedrock Studio and Control Tower

International Workshop on Artificial Intelligence in Software Testing

Media as a Mind Controlling Strategy In Old and Modern Era

Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...

Parameter Space Noise for Exploration

1. Parameter Space Noise for Exploration Yoonho Lee Department of Computer Science and Engineering Pohang University of Science and Technology November 02, 2017

2. Exploration-Exploitation Tradeoﬀ Exploration and exploitation must be carefully balanced for optimal performance

3. Exploration in RL Exploration in multi-armed bandits is simply choosing a suboptimal arm. How do we explore in RL environments?

4. Exploration in RL Exploration in multi-armed bandits is simply choosing a suboptimal arm. How do we explore in RL environments? Naive approaches: -greedy actions in DQN Entropy loss in policy gradient methods

5. Exploration in RL Exploration in multi-armed bandits is simply choosing a suboptimal arm. How do we explore in RL environments? Naive approaches: -greedy actions in DQN Entropy loss in policy gradient methods More sophisticated approaches: Density Modelling Dynamics Modelling Self-supervised curiosity

6. Parameter Space Noise for Exploration Matthias Plappert, Rein Houthooft, Prafulla Dhariwal, Szymon Sidor, Richard Y. Chen, Xi Chen, Tamim Asfour, Pieter Abbeel, Marcin Andrychowicz

7. Proposed Method θ = θ + N(0, σ2I) We perturb policy paramters at the beginning of each episode and keep it ﬁxed for the entire rollout

8. Proposed Method θ = θ + N(0, σ2I) We perturb policy paramters at the beginning of each episode and keep it ﬁxed for the entire rollout Oﬀ-policy Gather experience with θ = θ + N(0, σ2I), and update network with θ.

9. Proposed Method θ = θ + N(0, σ2I) We perturb policy paramters at the beginning of each episode and keep it ﬁxed for the entire rollout Oﬀ-policy Gather experience with θ = θ + N(0, σ2I), and update network with θ. On-policy Given policy πθ(a|s) with θ ∼ N(φ, Σ), policy gradient is φ,ΣEτ [R(τ)] ≈ 1 N i ,τi T−1 t=0 φ,Σ log π(at|st; φ + i Σ)Rt(τi )

10. Experiments Chain Environment A simple environment in which directed exploration is required to perform well Start at s1, rewards only at s1 and sN Easy to fall in local optima of staying at s1

11. Experiments Chain Environment Lower is better. Parameter space noise outperforms both -greedy and bootstrapped DQN.

12. Experiments Atari Parameter space noise outperforms -greedy in games that require exploration

13. Experiments Continous Control with DDPG Parameter space noise outperforms action space noise in HalfCheetah(Other networks fall into a local minima) Not much diﬀerence in other environments. This is because the rewards are well-shaped, so exploration isn’t really crucial here.

14. Experiments Continous Control with DDPG Harder environments with sparse rewards Two environments in which only parameter noise get a non-zero reward

15. Experiments Continous Control with TRPO Parameter space noise is slightly better in HalfCheetah, and signiﬁcantly better in Walker2D. The wrong variance setting seems to disable learning, and each environment has a diﬀerent optimal variance.

16. Experiments Continous Control with TRPO Parameter space noise works well in sparse reward environments.

17. Summary Parameter space noise is a simple method that allows directed exploration. Applicable to both on-policy and oﬀ-policy methods Orthogonal to advances such as Double DQN, Dueling Networks or TRPO.

18. Discussion No comparison with sophisticated exploration methods If this works, why did no one try using dropout in policy networks/DQN? What does this imply about the parameter space of a neural network? Is there a connection between this and recent results linking parameter noise to variational inference?

19. Thank You

Parameter Space Noise for Exploration

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

More from Yoonho Lee

More from Yoonho Lee (12)

Recently uploaded

Recently uploaded (17)

Parameter Space Noise for Exploration