Parameter Space Noise for Exploration

Parameter Space Noise for Exploration
Yoonho Lee
Department of Computer Science and Engineering
Pohang University of Science and Technology
November 02, 2017

Exploration-Exploitation Tradeoﬀ
Exploration and exploitation must be carefully balanced for
optimal performance

Exploration in RL
Exploration in multi-armed bandits is simply choosing a suboptimal
arm. How do we explore in RL environments?

Exploration in RL
Naive approaches:
-greedy actions in DQN
Entropy loss in policy gradient methods

Exploration in RL
Naive approaches:
-greedy actions in DQN
Entropy loss in policy gradient methods
More sophisticated approaches:
Density Modelling
Dynamics Modelling
Self-supervised curiosity

Parameter Space Noise for Exploration
Matthias Plappert, Rein Houthooft, Prafulla Dhariwal, Szymon
Sidor, Richard Y. Chen, Xi Chen, Tamim Asfour, Pieter Abbeel,
Marcin Andrychowicz

Proposed Method
θ = θ + N(0, σ2I)
We perturb policy paramters at the beginning of each episode and
keep it ﬁxed for the entire rollout

Proposed Method
θ = θ + N(0, σ2I)
Oﬀ-policy
Gather experience with θ = θ + N(0, σ2I), and update network
with θ.

Proposed Method
θ = θ + N(0, σ2I)
Oﬀ-policy
Gather experience with θ = θ + N(0, σ2I), and update network
with θ.
On-policy
Given policy πθ(a|s) with θ ∼ N(φ, Σ), policy gradient is
φ,ΣEτ [R(τ)] ≈
1
N i ,τi
T−1
t=0
φ,Σ log π(at|st; φ + i
Σ)Rt(τi
)

Experiments
Chain Environment
A simple environment in which directed exploration is required
to perform well
Start at s1, rewards only at s1 and sN
Easy to fall in local optima of staying at s1

Experiments
Chain Environment
Lower is better.
Parameter space noise outperforms both -greedy and
bootstrapped DQN.

Experiments
Atari
Parameter space noise outperforms -greedy in games that
require exploration

Experiments
Continous Control with DDPG
Parameter space noise outperforms action space noise in
HalfCheetah(Other networks fall into a local minima)
Not much diﬀerence in other environments. This is because
the rewards are well-shaped, so exploration isn’t really crucial
here.

Experiments
Continous Control with DDPG
Harder environments with sparse rewards
Two environments in which only parameter noise get a
non-zero reward

Experiments
Continous Control with TRPO
Parameter space noise is slightly better in HalfCheetah, and
signiﬁcantly better in Walker2D.
The wrong variance setting seems to disable learning, and
each environment has a diﬀerent optimal variance.

Experiments
Continous Control with TRPO
Parameter space noise works well in sparse reward
environments.

Summary
Parameter space noise is a simple method that allows directed
exploration.
Applicable to both on-policy and oﬀ-policy methods
Orthogonal to advances such as Double DQN, Dueling
Networks or TRPO.

Discussion
No comparison with sophisticated exploration methods
If this works, why did no one try using dropout in policy
networks/DQN?
What does this imply about the parameter space of a neural
network?
Is there a connection between this and recent results linking
parameter noise to variational inference?

Parameter Space Noise for Exploration

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

More from Yoonho Lee

More from Yoonho Lee (12)

Recently uploaded

Recently uploaded (20)

Parameter Space Noise for Exploration