NovelD: A Simple yet Effective Exploration Criterion

Machine
Learning
LABoratory
Seungjoon Lee. 2023-09-29. sjlee1218@postech.ac.kr
NovelD: A Simple yet Effective
Exploration Criterion
Neurips 2021. Paper Summary
1

Caution!!!
• This is the material I summarized a paper at my personal research meeting.
• Some of the contents may be incorrect!
• Some contributions, experiments are excluded intentionally, because they
are not directly related to my research interest.
• Methods are simpli
fi
ed for easy explanation.
• Please send me an email if you want to contact me: sjlee1218@postech.ac.kr
(for correction or addition of materials, ideas to develop this paper, or others).
3

Contents
• Introduction
• Methods
• Experiments
• Conclusion
4

Situations
• RL cannot explore well in sparse-reward environments.
• Novelty-based RL exploration methods incentivize exploration using novelty
as intrinsic rewards.
6

Complications
• If the novelty-based RL agent meets the unknown region, it explores the
region thoroughly until the novelty of the region gets low. (DFS manner)
• It focus on a tree rather than a forest, slowing down RL exploration.
• If the state space is large, the novelty-based RL agent forgets the explored
regions, going back to the explored regions.
• It gets trapped some regions, so its state visit counts become imbalanced.
7

Question & Hypothesis
• Question:
• Do you make novelty-based intrinsic reward method which makes the state
visit counts uniform and explores much broadly?
• Hypothesis:
• Intrinsic reward (IR) by novelty di
ff
erence can make the visit count uniform
and push the boundary of the known regions consistently.
• The IR by novelty di
ff
erence is relevantly robust to the forgetting of NN.
8

Contributions
• The authors show that novelty-based methods explore in a DFS-like manner,
and stuck in some large state spaces.
• Intrinsic rewards by novelty di
ff
erence accelerate RL exploration:
• by pushing boundaries of known regions consistently in a BFS-like manner,
• by making uniform state visit count,
• by being tolerant to the forgetting of the agent.
9

Problem Formulation
• Episodic MDP with
fi
nite horizon
• , observation space
• ,
fi
nite horizon
•
that maximizes is considered, where .
(S, A, P, R, γ, T)
S
T
π(a|o) E[
∑
t=0
γt
rt] rt = re
t + αri
t
11

Methods Outline
Desires
• The new intrinsic reward should:
• forces an agent to push the boundary/frontier of the known regions.
• forces an agent to make uniform state visit counts.
12

Methods Outline
• Intrinsic reward calculation + Novelty estimation + RL agent
• Intrinsic reward calculation: novelty di
ff
erence
• Novelty estimation: Random Network Distillation (RND)
• RL: PPO
13

Methods - Intrinsic Reward Calculation
• Intrinsic rewards (IR) are calculated by the novelty di
ff
erence (NovelD)
•
• could be any novelty measure for a state.
• is an state visit count in one episode.
• So, NovelD gives IR only when is the new state in this episode.
ri
t(st, at, st+1) = max [novelty(st+1) − α ⋅ novelty(st),0] ⋅ 1 [Ne(st+1) = 1]
novelty( ⋅ )
Ne(s)
st+1
14

Methods - Novelty Estimation
• Novelty of is estimated by RND, estimating high novelty to unfamiliar states.
•
• Target function .
• Predictor function .
s
Novelty(s) = ||ffixed(s) − fψ(s)||2
ffixed : S → ℝk
fψ : S → ℝk
15

Methods - RL Agent
Training of RL agent
• RL agent: PPO
•
Value loss , where
•
, ,
.
• Policy loss
L(ϕ) =
∑
t
[yt − Vϕ(st)]
2
yt = Aπθold(st, at) + Vπθold(st)
Aπθold(st, at) =
∞
∑
k=0
(λγ)k
δt+k δt+k = r(st+k, at+k) + γVπθold(st+k+1) − Vπθold(st+k)
rk = re
k + ri
k
L(θ) = min
(
πθ(at |st)
πθold
(at |st)
Aπθold(st, at), clip
(
πθ(at |st)
πθold
(at |st)
,1 − ϵ,1 + ϵ
)
Aπθold(st, at)
)
16

Methods - Novelty Difference v.s. Novelty
When for many states
novelty(st) ≈ 0
• v.s. .
• For simplicity, let’s assume
• Both methods behave similarly.
ri
(st, at, st+1) = novelty(st+1) − novelty(st) ri
(st) = novelty(st)
α = 1
17

When novelty(st) > > 0
• v.s.
• The naive novelty method can make high rewards from the both of the
below scenarios.
• So, if the agent meets an unfamiliar region, it can easily maximizes its
reward by exploring thoroughly only the region. (DFS manner)
ri
(st) = novelty(st)
18

When novelty(st) > > 0
• v.s.
• The novelty di
ff
erence method can make high rewards only from the right-
side
fi
gure of the below scenarios.
• So the agent should get out the known region, even if the knowledge of
the region is rough yet. (BFS manner)
ri
(st) = novelty(st)
19

Methods Analysis - Pushing Boundaries
Why?
• The NovelD reward forces an agent to push the boundary/frontier of the
known regions.
• Because the agent can get high rewards at the boundary of the known
regions and unknown regions.
20

Methods Analysis - Pushing Boundaries
So what?
• So what? Why does pushing boundaries help RL exploration?
• NovelD forces the agent to visit states which have been never explored so
far.
• So the agent should visit the indeed new states outside the known regions,
even if the knowledge of the region is rough yet.
21

Methods Analysis - Uniform Visit Counts
Why?
• The NovelD reward forces an agent to make uniform state visit counts.
• Because the agent is forced to act to make the novelty signal
fl
at.
• This is done by making the uncertainty
fl
at for all states, which is done by
uniform visit counts. (Analogy: making equal for all in UCB)
ln t
N(s)
s
22

Methods Analysis - Uniform Visit Counts
So what?
• So what? Why does uniform visit counts help RL exploration?
• (My own conjecture) It is known that high entropy of the distribution of the
visited state counts helps RL exploration.
• (My own conjecture) Value function would be approximated well if the visit
counts are uniform in the setting with extrinsic rewards.
23

Methods Analysis - Tolerance to Forgetting
Why?
• If an agent forget the explored region, the neighbors of the region would be
forgotten too.
• So, the novelties are increased in the neighbors of the region.
• So, the novelty di
ff
erences are low, so the incentive is low to explore the
explored but forgotten regions.
24

PoC: Why does NovelD Accelerate Exploration?
• The intrinsic rewards by NovelD accelerate RL exploration showing:
• 1) The boundaries of the known regions are pushed by NovelD.
• 2) The visit counts of states becomes uniform by NovelD.
• in pure exploration setting (w/o extrinsic rewards)
26

PoC
Environment
• Environment: MiniGrid
• 2D grid-world environments with goal-oriented tasks.
• Randomized, procedurally generated environment.
• Reward is positive only when reaching the
fi
nal goal.
• Action space is discrete.
• NovelD uses bird-eye view full observations, not partial observations in the
agent’s view.
27

PoC - Pushed Boundary in Pure Exploration
Claim
• Claim:
• NovelD forces an agent to push the boundary of the explored regions,
which accelerates RL exploration.
28

PoC - Pushed Boundary in Pure Exploration
Results
• Pure exploration in MiniGrid
• The NovelD agent gets high IR at the boundary of the explored region, and
pushes high IR regions consistently.
• RND agent cannot pushes high IR regions clearly.
29
After start
After entering
the 2nd room
After entering
the 3rd room
Empirical IR plot
after di
ff
erent checkpoints

PoC - Uniform State Visit Counts in Pure Exploration
Claims
• Claim:
• NovelD forces an agent to make uniform state visit counts, which
accelerates RL exploration.
30

Results
• Visit count is analyzed in one
fi
xed env.
• NovelD makes the visited count uniform after some stabilization steps of
the encoder in the novelty calculation.
• RND makes visit counts non-uniform, going back-and-forth explored
regions to understand the regions thoroughly.
N(s)
31
Normalized visit counts heat map for the location of agents
N(s)/Z

Results
• Visit count is analyzed in one
fi
xed env.
• NovelD makes visit count distribution in each room have high entropy.
•
where .
N(s)
ℋ(ρroom(s)) ρroom(s) = N(s)/
∑
s′

∈Sroom
N(s′

)
32
after some env steps by RND / NovelD
ℋ(ρroom(s))
Entropy gets lower in RND
after 3M env steps
Entropy gets higher
in most rooms
In NovelD

Experiments with Extrinsic Rewards
• Experiments with extrinsic reward in MiniGrid envs:
• Training environment is randomly initialized at each episode to have
di
ff
erent entity locations and colors.
• Test performance is evaluated.
• The results are averaged across four seeds and 32 random initialized
environments.
33

• NovelD solves hard games within small steps even when the state space is large.
• Other algorithms cannot solve them if the the state space becomes larger
(larger rooms, bigger # of rooms).
34

• NovelD improves sample e
ffi
ciency in easy envs, and solves hard envs.
35

Experiments - Noisy-TV in MiniGrid
• Noisy-TV in MiniGrid env:
• Some walls of the env change the color randomly at every time step.
• Empirically, NovelD’s performance doesn’t degrade in noisy-TV setup.
36

Conclusion
• Intrinsic rewards by NovelD accelerate RL exploration:
• by pushing boundaries of known regions consistently in a BFS-like manner,
• by making uniform state visit count,
• by being tolerant to the forgetting of the agent.
• NovelD outperforms other algorithms in terms of sample e
ffi
ciency in various
environments (MiniGrid, Atari, NetHack)
38

Limitations
• [implementation] NovelD uses the fully observable state in MiniGrid, not partial
observation of the agent.
• If observations are same in di
ff
erent context (location), NovelD would not
explore the unexplored region.
• If env is noisy, is high for all , so NovelD cannot get meaningful
intrinsic rewards.
• The NovelD agent could get the meaningless dense intrinsic rewards in the
observations with the di
ff
erent view of the same object. [ref]
• The NovelD is not tested in continuous action RL domains.
novelty(s) s
39
Semantic Exploration from Language Abstractions: https://arxiv.org/abs/2204.05080

NovelD: A Simple yet Effective Exploration Criterion

Recommended

Recommended

More Related Content

Similar to NovelD: A Simple yet Effective Exploration Criterion

Similar to NovelD: A Simple yet Effective Exploration Criterion (20)

Recently uploaded

Recently uploaded (20)

NovelD: A Simple yet Effective Exploration Criterion