SBFT Tool Competition 2024 -- Python Test Case Generation Track
Parameter Space Noise for Exploration
1. Parameter Space Noise for Exploration
Yoonho Lee
Department of Computer Science and Engineering
Pohang University of Science and Technology
November 02, 2017
3. Exploration in RL
Exploration in multi-armed bandits is simply choosing a suboptimal
arm. How do we explore in RL environments?
4. Exploration in RL
Exploration in multi-armed bandits is simply choosing a suboptimal
arm. How do we explore in RL environments?
Naive approaches:
-greedy actions in DQN
Entropy loss in policy gradient methods
5. Exploration in RL
Exploration in multi-armed bandits is simply choosing a suboptimal
arm. How do we explore in RL environments?
Naive approaches:
-greedy actions in DQN
Entropy loss in policy gradient methods
More sophisticated approaches:
Density Modelling
Dynamics Modelling
Self-supervised curiosity
6. Parameter Space Noise for Exploration
Matthias Plappert, Rein Houthooft, Prafulla Dhariwal, Szymon
Sidor, Richard Y. Chen, Xi Chen, Tamim Asfour, Pieter Abbeel,
Marcin Andrychowicz
7. Proposed Method
θ = θ + N(0, σ2I)
We perturb policy paramters at the beginning of each episode and
keep it fixed for the entire rollout
8. Proposed Method
θ = θ + N(0, σ2I)
We perturb policy paramters at the beginning of each episode and
keep it fixed for the entire rollout
Off-policy
Gather experience with θ = θ + N(0, σ2I), and update network
with θ.
9. Proposed Method
θ = θ + N(0, σ2I)
We perturb policy paramters at the beginning of each episode and
keep it fixed for the entire rollout
Off-policy
Gather experience with θ = θ + N(0, σ2I), and update network
with θ.
On-policy
Given policy πθ(a|s) with θ ∼ N(φ, Σ), policy gradient is
φ,ΣEτ [R(τ)] ≈
1
N i ,τi
T−1
t=0
φ,Σ log π(at|st; φ + i
Σ)Rt(τi
)
10. Experiments
Chain Environment
A simple environment in which directed exploration is required
to perform well
Start at s1, rewards only at s1 and sN
Easy to fall in local optima of staying at s1
13. Experiments
Continous Control with DDPG
Parameter space noise outperforms action space noise in
HalfCheetah(Other networks fall into a local minima)
Not much difference in other environments. This is because
the rewards are well-shaped, so exploration isn’t really crucial
here.
14. Experiments
Continous Control with DDPG
Harder environments with sparse rewards
Two environments in which only parameter noise get a
non-zero reward
15. Experiments
Continous Control with TRPO
Parameter space noise is slightly better in HalfCheetah, and
significantly better in Walker2D.
The wrong variance setting seems to disable learning, and
each environment has a different optimal variance.
17. Summary
Parameter space noise is a simple method that allows directed
exploration.
Applicable to both on-policy and off-policy methods
Orthogonal to advances such as Double DQN, Dueling
Networks or TRPO.
18. Discussion
No comparison with sophisticated exploration methods
If this works, why did no one try using dropout in policy
networks/DQN?
What does this imply about the parameter space of a neural
network?
Is there a connection between this and recent results linking
parameter noise to variational inference?