SlideShare a Scribd company logo
Machine
Learning
LABoratory
Seungjoon Lee. 2023-09-29. sjlee1218@postech.ac.kr
NovelD: A Simple yet Effective
Exploration Criterion
Neurips 2021. Paper Summary
1
Paper in One Figure
2
Caution!!!
• This is the material I summarized a paper at my personal research meeting.
• Some of the contents may be incorrect!
• Some contributions, experiments are excluded intentionally, because they
are not directly related to my research interest.
• Methods are simpli
fi
ed for easy explanation.
• Please send me an email if you want to contact me: sjlee1218@postech.ac.kr
(for correction or addition of materials, ideas to develop this paper, or others).
3
Contents
• Introduction
• Methods
• Experiments
• Conclusion
4
Intro
5
Situations
• RL cannot explore well in sparse-reward environments.
• Novelty-based RL exploration methods incentivize exploration using novelty
as intrinsic rewards.
6
Complications
• If the novelty-based RL agent meets the unknown region, it explores the
region thoroughly until the novelty of the region gets low. (DFS manner)
• It focus on a tree rather than a forest, slowing down RL exploration.
• If the state space is large, the novelty-based RL agent forgets the explored
regions, going back to the explored regions.
• It gets trapped some regions, so its state visit counts become imbalanced.
7
Question & Hypothesis
• Question:
• Do you make novelty-based intrinsic reward method which makes the state
visit counts uniform and explores much broadly?
• Hypothesis:
• Intrinsic reward (IR) by novelty di
ff
erence can make the visit count uniform
and push the boundary of the known regions consistently.
• The IR by novelty di
ff
erence is relevantly robust to the forgetting of NN.
8
Contributions
• The authors show that novelty-based methods explore in a DFS-like manner,
and stuck in some large state spaces.
• Intrinsic rewards by novelty di
ff
erence accelerate RL exploration:
• by pushing boundaries of known regions consistently in a BFS-like manner,
• by making uniform state visit count,
• by being tolerant to the forgetting of the agent.
9
Methods
10
Problem Formulation
• Episodic MDP with
fi
nite horizon
• , observation space
• ,
fi
nite horizon
•
that maximizes is considered, where .
(S, A, P, R, γ, T)
S
T
π(a|o) E[
∑
t=0
γt
rt] rt = re
t + αri
t
11
Methods Outline
Desires
• The new intrinsic reward should:
• forces an agent to push the boundary/frontier of the known regions.
• forces an agent to make uniform state visit counts.
12
Methods Outline
• Intrinsic reward calculation + Novelty estimation + RL agent
• Intrinsic reward calculation: novelty di
ff
erence
• Novelty estimation: Random Network Distillation (RND)
• RL: PPO
13
Methods - Intrinsic Reward Calculation
• Intrinsic rewards (IR) are calculated by the novelty di
ff
erence (NovelD)
•
• could be any novelty measure for a state.
• is an state visit count in one episode.
• So, NovelD gives IR only when is the new state in this episode.
ri
t(st, at, st+1) = max [novelty(st+1) − α ⋅ novelty(st),0] ⋅ 1 [Ne(st+1) = 1]
novelty( ⋅ )
Ne(s)
st+1
14
Methods - Novelty Estimation
• Novelty of is estimated by RND, estimating high novelty to unfamiliar states.
•
• Target function .
• Predictor function .
s
Novelty(s) = ||ffixed(s) − fψ(s)||2
ffixed : S → ℝk
fψ : S → ℝk
15
Methods - RL Agent
Training of RL agent
• RL agent: PPO
•
Value loss , where
•
, ,
.
• Policy loss
L(ϕ) =
∑
t
[yt − Vϕ(st)]
2
yt = Aπθold(st, at) + Vπθold(st)
Aπθold(st, at) =
∞
∑
k=0
(λγ)k
δt+k δt+k = r(st+k, at+k) + γVπθold(st+k+1) − Vπθold(st+k)
rk = re
k + ri
k
L(θ) = min
(
πθ(at |st)
πθold
(at |st)
Aπθold(st, at), clip
(
πθ(at |st)
πθold
(at |st)
,1 − ϵ,1 + ϵ
)
Aπθold(st, at)
)
16
Methods - Novelty Difference v.s. Novelty
When for many states
novelty(st) ≈ 0
• v.s. .
• For simplicity, let’s assume
• Both methods behave similarly.
ri
(st, at, st+1) = novelty(st+1) − novelty(st) ri
(st) = novelty(st)
α = 1
17
Methods - Novelty Difference v.s. Novelty
When novelty(st) > > 0
• v.s.
• The naive novelty method can make high rewards from the both of the
below scenarios.
• So, if the agent meets an unfamiliar region, it can easily maximizes its
reward by exploring thoroughly only the region. (DFS manner)
ri
(st, at, st+1) = novelty(st+1) − novelty(st) ri
(st) = novelty(st)
18
Methods - Novelty Difference v.s. Novelty
When novelty(st) > > 0
• v.s.
• The novelty di
ff
erence method can make high rewards only from the right-
side
fi
gure of the below scenarios.
• So the agent should get out the known region, even if the knowledge of
the region is rough yet. (BFS manner)
ri
(st, at, st+1) = novelty(st+1) − novelty(st) ri
(st) = novelty(st)
19
Methods Analysis - Pushing Boundaries
Why?
• The NovelD reward forces an agent to push the boundary/frontier of the
known regions.
• Because the agent can get high rewards at the boundary of the known
regions and unknown regions.
20
Methods Analysis - Pushing Boundaries
So what?
• So what? Why does pushing boundaries help RL exploration?
• NovelD forces the agent to visit states which have been never explored so
far.
• So the agent should visit the indeed new states outside the known regions,
even if the knowledge of the region is rough yet.
21
Methods Analysis - Uniform Visit Counts
Why?
• The NovelD reward forces an agent to make uniform state visit counts.
• Because the agent is forced to act to make the novelty signal
fl
at.
• This is done by making the uncertainty
fl
at for all states, which is done by
uniform visit counts. (Analogy: making equal for all in UCB)
ln t
N(s)
s
22
Methods Analysis - Uniform Visit Counts
So what?
• So what? Why does uniform visit counts help RL exploration?
• (My own conjecture) It is known that high entropy of the distribution of the
visited state counts helps RL exploration.
• (My own conjecture) Value function would be approximated well if the visit
counts are uniform in the setting with extrinsic rewards.
23
Methods Analysis - Tolerance to Forgetting
Why?
• If an agent forget the explored region, the neighbors of the region would be
forgotten too.
• So, the novelties are increased in the neighbors of the region.
• So, the novelty di
ff
erences are low, so the incentive is low to explore the
explored but forgotten regions.
24
Experiments
25
PoC: Why does NovelD Accelerate Exploration?
• The intrinsic rewards by NovelD accelerate RL exploration showing:
• 1) The boundaries of the known regions are pushed by NovelD.
• 2) The visit counts of states becomes uniform by NovelD.
• in pure exploration setting (w/o extrinsic rewards)
26
PoC
Environment
• Environment: MiniGrid
• 2D grid-world environments with goal-oriented tasks.
• Randomized, procedurally generated environment.
• Reward is positive only when reaching the
fi
nal goal.
• Action space is discrete.
• NovelD uses bird-eye view full observations, not partial observations in the
agent’s view.
27
PoC - Pushed Boundary in Pure Exploration
Claim
• Claim:
• NovelD forces an agent to push the boundary of the explored regions,
which accelerates RL exploration.
28
PoC - Pushed Boundary in Pure Exploration
Results
• Pure exploration in MiniGrid
• The NovelD agent gets high IR at the boundary of the explored region, and
pushes high IR regions consistently.
• RND agent cannot pushes high IR regions clearly.
29
After start
After entering
the 2nd room
After entering
the 3rd room
Empirical IR plot
after di
ff
erent checkpoints
PoC - Uniform State Visit Counts in Pure Exploration
Claims
• Claim:
• NovelD forces an agent to make uniform state visit counts, which
accelerates RL exploration.
30
PoC - Uniform State Visit Counts in Pure Exploration
Results
• Visit count is analyzed in one
fi
xed env.
• NovelD makes the visited count uniform after some stabilization steps of
the encoder in the novelty calculation.
• RND makes visit counts non-uniform, going back-and-forth explored
regions to understand the regions thoroughly.
N(s)
31
Normalized visit counts heat map for the location of agents
N(s)/Z
PoC - Uniform State Visit Counts in Pure Exploration
Results
• Visit count is analyzed in one
fi
xed env.
• NovelD makes visit count distribution in each room have high entropy.
•
where .
N(s)
ℋ(ρroom(s)) ρroom(s) = N(s)/
∑
s′

∈Sroom
N(s′

)
32
after some env steps by RND / NovelD
ℋ(ρroom(s))
Entropy gets lower in RND
after 3M env steps
Entropy gets higher
in most rooms
In NovelD
Experiments with Extrinsic Rewards
• Experiments with extrinsic reward in MiniGrid envs:
• Training environment is randomly initialized at each episode to have
di
ff
erent entity locations and colors.
• Test performance is evaluated.
• The results are averaged across four seeds and 32 random initialized
environments.
33
Experiments with Extrinsic Rewards
• NovelD solves hard games within small steps even when the state space is large.
• Other algorithms cannot solve them if the the state space becomes larger
(larger rooms, bigger # of rooms).
34
Experiments with Extrinsic Rewards
• NovelD improves sample e
ffi
ciency in easy envs, and solves hard envs.
35
Experiments - Noisy-TV in MiniGrid
• Noisy-TV in MiniGrid env:
• Some walls of the env change the color randomly at every time step.
• Empirically, NovelD’s performance doesn’t degrade in noisy-TV setup.
36
Conclusion
37
Conclusion
• Intrinsic rewards by NovelD accelerate RL exploration:
• by pushing boundaries of known regions consistently in a BFS-like manner,
• by making uniform state visit count,
• by being tolerant to the forgetting of the agent.
• NovelD outperforms other algorithms in terms of sample e
ffi
ciency in various
environments (MiniGrid, Atari, NetHack)
38
Limitations
• [implementation] NovelD uses the fully observable state in MiniGrid, not partial
observation of the agent.
• If observations are same in di
ff
erent context (location), NovelD would not
explore the unexplored region.
• If env is noisy, is high for all , so NovelD cannot get meaningful
intrinsic rewards.
• The NovelD agent could get the meaningless dense intrinsic rewards in the
observations with the di
ff
erent view of the same object. [ref]
• The NovelD is not tested in continuous action RL domains.
novelty(s) s
39
Semantic Exploration from Language Abstractions: https://arxiv.org/abs/2204.05080

More Related Content

Similar to NovelD: A Simple yet Effective Exploration Criterion

Object detection at night
Object detection at nightObject detection at night
Object detection at night
Sanjay Crúzé
 
Lecture slides week14-15
Lecture slides week14-15Lecture slides week14-15
Lecture slides week14-15
Shani729
 
Heuristic search
Heuristic searchHeuristic search
Heuristic search
NivethaS35
 
Computer Graphics: Visible surface detection methods
Computer Graphics: Visible surface detection methodsComputer Graphics: Visible surface detection methods
Computer Graphics: Visible surface detection methods
Joseph Charles
 
Paris Master Class 2011 - 07 Dynamic Global Illumination
Paris Master Class 2011 - 07 Dynamic Global IlluminationParis Master Class 2011 - 07 Dynamic Global Illumination
Paris Master Class 2011 - 07 Dynamic Global Illumination
Wolfgang Engel
 
Online Stochastic Tensor Decomposition for Background Subtraction in Multispe...
Online Stochastic Tensor Decomposition for Background Subtraction in Multispe...Online Stochastic Tensor Decomposition for Background Subtraction in Multispe...
Online Stochastic Tensor Decomposition for Background Subtraction in Multispe...
ActiveEon
 
Oxford 05-oct-2012
Oxford 05-oct-2012Oxford 05-oct-2012
Oxford 05-oct-2012
Ted Dunning
 
Hidden surface removal algorithm
Hidden surface removal algorithmHidden surface removal algorithm
Hidden surface removal algorithm
KKARUNKARTHIK
 
cohenmedioni.ppt
cohenmedioni.pptcohenmedioni.ppt
cohenmedioni.ppt
ChinnuDS
 
Locality sensitive hashing
Locality sensitive hashingLocality sensitive hashing
Locality sensitive hashing
Sameera Horawalavithana
 
PPT ON INTRODUCTION TO AI- UNIT-1-PART-2.pptx
PPT ON INTRODUCTION TO AI- UNIT-1-PART-2.pptxPPT ON INTRODUCTION TO AI- UNIT-1-PART-2.pptx
PPT ON INTRODUCTION TO AI- UNIT-1-PART-2.pptx
RaviKiranVarma4
 
Knowledge and reasoning power point for engineering students
Knowledge and reasoning power point for engineering studentsKnowledge and reasoning power point for engineering students
Knowledge and reasoning power point for engineering students
Geetha Kannan
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
Geetha Kannan
 
Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...
Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...
Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...
Sean Moran
 
Leveraging high level and low-level features for multimedia event detection.2...
Leveraging high level and low-level features for multimedia event detection.2...Leveraging high level and low-level features for multimedia event detection.2...
Leveraging high level and low-level features for multimedia event detection.2...
Lu Jiang
 
Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25
MapR Technologies
 
9553477.ppt
9553477.ppt9553477.ppt
9553477.ppt
Omer Tariq
 
cos323_s06_lecture03_optimization.ppt
cos323_s06_lecture03_optimization.pptcos323_s06_lecture03_optimization.ppt
cos323_s06_lecture03_optimization.ppt
devesh604174
 
ACM 2013-02-25
ACM 2013-02-25ACM 2013-02-25
ACM 2013-02-25
Ted Dunning
 
SPATIAL POINT PATTERNS
SPATIAL POINT PATTERNSSPATIAL POINT PATTERNS
SPATIAL POINT PATTERNS
LiemNguyenDuy
 

Similar to NovelD: A Simple yet Effective Exploration Criterion (20)

Object detection at night
Object detection at nightObject detection at night
Object detection at night
 
Lecture slides week14-15
Lecture slides week14-15Lecture slides week14-15
Lecture slides week14-15
 
Heuristic search
Heuristic searchHeuristic search
Heuristic search
 
Computer Graphics: Visible surface detection methods
Computer Graphics: Visible surface detection methodsComputer Graphics: Visible surface detection methods
Computer Graphics: Visible surface detection methods
 
Paris Master Class 2011 - 07 Dynamic Global Illumination
Paris Master Class 2011 - 07 Dynamic Global IlluminationParis Master Class 2011 - 07 Dynamic Global Illumination
Paris Master Class 2011 - 07 Dynamic Global Illumination
 
Online Stochastic Tensor Decomposition for Background Subtraction in Multispe...
Online Stochastic Tensor Decomposition for Background Subtraction in Multispe...Online Stochastic Tensor Decomposition for Background Subtraction in Multispe...
Online Stochastic Tensor Decomposition for Background Subtraction in Multispe...
 
Oxford 05-oct-2012
Oxford 05-oct-2012Oxford 05-oct-2012
Oxford 05-oct-2012
 
Hidden surface removal algorithm
Hidden surface removal algorithmHidden surface removal algorithm
Hidden surface removal algorithm
 
cohenmedioni.ppt
cohenmedioni.pptcohenmedioni.ppt
cohenmedioni.ppt
 
Locality sensitive hashing
Locality sensitive hashingLocality sensitive hashing
Locality sensitive hashing
 
PPT ON INTRODUCTION TO AI- UNIT-1-PART-2.pptx
PPT ON INTRODUCTION TO AI- UNIT-1-PART-2.pptxPPT ON INTRODUCTION TO AI- UNIT-1-PART-2.pptx
PPT ON INTRODUCTION TO AI- UNIT-1-PART-2.pptx
 
Knowledge and reasoning power point for engineering students
Knowledge and reasoning power point for engineering studentsKnowledge and reasoning power point for engineering students
Knowledge and reasoning power point for engineering students
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
 
Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...
Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...
Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...
 
Leveraging high level and low-level features for multimedia event detection.2...
Leveraging high level and low-level features for multimedia event detection.2...Leveraging high level and low-level features for multimedia event detection.2...
Leveraging high level and low-level features for multimedia event detection.2...
 
Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25
 
9553477.ppt
9553477.ppt9553477.ppt
9553477.ppt
 
cos323_s06_lecture03_optimization.ppt
cos323_s06_lecture03_optimization.pptcos323_s06_lecture03_optimization.ppt
cos323_s06_lecture03_optimization.ppt
 
ACM 2013-02-25
ACM 2013-02-25ACM 2013-02-25
ACM 2013-02-25
 
SPATIAL POINT PATTERNS
SPATIAL POINT PATTERNSSPATIAL POINT PATTERNS
SPATIAL POINT PATTERNS
 

Recently uploaded

IMPORTANCE OF ALGAE AND ITS BENIFITS.pptx
IMPORTANCE OF ALGAE  AND ITS BENIFITS.pptxIMPORTANCE OF ALGAE  AND ITS BENIFITS.pptx
IMPORTANCE OF ALGAE AND ITS BENIFITS.pptx
OmAle5
 
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
PsychoTech Services
 
Farming systems analysis: what have we learnt?.pptx
Farming systems analysis: what have we learnt?.pptxFarming systems analysis: what have we learnt?.pptx
Farming systems analysis: what have we learnt?.pptx
Frédéric Baudron
 
JAMES WEBB STUDY THE MASSIVE BLACK HOLE SEEDS
JAMES WEBB STUDY THE MASSIVE BLACK HOLE SEEDSJAMES WEBB STUDY THE MASSIVE BLACK HOLE SEEDS
JAMES WEBB STUDY THE MASSIVE BLACK HOLE SEEDS
Sérgio Sacani
 
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at 𝐳 = 2.9 wi...
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at  𝐳 = 2.9  wi...Discovery of An Apparent Red, High-Velocity Type Ia Supernova at  𝐳 = 2.9  wi...
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at 𝐳 = 2.9 wi...
Sérgio Sacani
 
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
vluwdy49
 
cathode ray oscilloscope and its applications
cathode ray oscilloscope and its applicationscathode ray oscilloscope and its applications
cathode ray oscilloscope and its applications
sandertein
 
Anti-Universe And Emergent Gravity and the Dark Universe
Anti-Universe And Emergent Gravity and the Dark UniverseAnti-Universe And Emergent Gravity and the Dark Universe
Anti-Universe And Emergent Gravity and the Dark Universe
Sérgio Sacani
 
Introduction_Ch_01_Biotech Biotechnology course .pptx
Introduction_Ch_01_Biotech Biotechnology course .pptxIntroduction_Ch_01_Biotech Biotechnology course .pptx
Introduction_Ch_01_Biotech Biotechnology course .pptx
QusayMaghayerh
 
Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...
Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...
Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...
Sérgio Sacani
 
HUMAN EYE By-R.M Class 10 phy best digital notes.pdf
HUMAN EYE By-R.M Class 10 phy best digital notes.pdfHUMAN EYE By-R.M Class 10 phy best digital notes.pdf
HUMAN EYE By-R.M Class 10 phy best digital notes.pdf
Ritik83251
 
TOPIC OF DISCUSSION: CENTRIFUGATION SLIDESHARE.pptx
TOPIC OF DISCUSSION: CENTRIFUGATION SLIDESHARE.pptxTOPIC OF DISCUSSION: CENTRIFUGATION SLIDESHARE.pptx
TOPIC OF DISCUSSION: CENTRIFUGATION SLIDESHARE.pptx
shubhijain836
 
LEARNING TO LIVE WITH LAWS OF MOTION .pptx
LEARNING TO LIVE WITH LAWS OF MOTION .pptxLEARNING TO LIVE WITH LAWS OF MOTION .pptx
LEARNING TO LIVE WITH LAWS OF MOTION .pptx
yourprojectpartner05
 
23PH301 - Optics - Optical Lenses.pptx
23PH301 - Optics  -  Optical Lenses.pptx23PH301 - Optics  -  Optical Lenses.pptx
23PH301 - Optics - Optical Lenses.pptx
RDhivya6
 
Gadgets for management of stored product pests_Dr.UPR.pdf
Gadgets for management of stored product pests_Dr.UPR.pdfGadgets for management of stored product pests_Dr.UPR.pdf
Gadgets for management of stored product pests_Dr.UPR.pdf
PirithiRaju
 
Microbiology of Central Nervous System INFECTIONS.pdf
Microbiology of Central Nervous System INFECTIONS.pdfMicrobiology of Central Nervous System INFECTIONS.pdf
Microbiology of Central Nervous System INFECTIONS.pdf
sammy700571
 
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Leonel Morgado
 
Clinical periodontology and implant dentistry 2003.pdf
Clinical periodontology and implant dentistry 2003.pdfClinical periodontology and implant dentistry 2003.pdf
Clinical periodontology and implant dentistry 2003.pdf
RAYMUNDONAVARROCORON
 
BIRDS DIVERSITY OF SOOTEA BISWANATH ASSAM.ppt.pptx
BIRDS  DIVERSITY OF SOOTEA BISWANATH ASSAM.ppt.pptxBIRDS  DIVERSITY OF SOOTEA BISWANATH ASSAM.ppt.pptx
BIRDS DIVERSITY OF SOOTEA BISWANATH ASSAM.ppt.pptx
goluk9330
 
MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...
MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...
MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...
ABHISHEK SONI NIMT INSTITUTE OF MEDICAL AND PARAMEDCIAL SCIENCES , GOVT PG COLLEGE NOIDA
 

Recently uploaded (20)

IMPORTANCE OF ALGAE AND ITS BENIFITS.pptx
IMPORTANCE OF ALGAE  AND ITS BENIFITS.pptxIMPORTANCE OF ALGAE  AND ITS BENIFITS.pptx
IMPORTANCE OF ALGAE AND ITS BENIFITS.pptx
 
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
 
Farming systems analysis: what have we learnt?.pptx
Farming systems analysis: what have we learnt?.pptxFarming systems analysis: what have we learnt?.pptx
Farming systems analysis: what have we learnt?.pptx
 
JAMES WEBB STUDY THE MASSIVE BLACK HOLE SEEDS
JAMES WEBB STUDY THE MASSIVE BLACK HOLE SEEDSJAMES WEBB STUDY THE MASSIVE BLACK HOLE SEEDS
JAMES WEBB STUDY THE MASSIVE BLACK HOLE SEEDS
 
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at 𝐳 = 2.9 wi...
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at  𝐳 = 2.9  wi...Discovery of An Apparent Red, High-Velocity Type Ia Supernova at  𝐳 = 2.9  wi...
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at 𝐳 = 2.9 wi...
 
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
 
cathode ray oscilloscope and its applications
cathode ray oscilloscope and its applicationscathode ray oscilloscope and its applications
cathode ray oscilloscope and its applications
 
Anti-Universe And Emergent Gravity and the Dark Universe
Anti-Universe And Emergent Gravity and the Dark UniverseAnti-Universe And Emergent Gravity and the Dark Universe
Anti-Universe And Emergent Gravity and the Dark Universe
 
Introduction_Ch_01_Biotech Biotechnology course .pptx
Introduction_Ch_01_Biotech Biotechnology course .pptxIntroduction_Ch_01_Biotech Biotechnology course .pptx
Introduction_Ch_01_Biotech Biotechnology course .pptx
 
Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...
Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...
Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...
 
HUMAN EYE By-R.M Class 10 phy best digital notes.pdf
HUMAN EYE By-R.M Class 10 phy best digital notes.pdfHUMAN EYE By-R.M Class 10 phy best digital notes.pdf
HUMAN EYE By-R.M Class 10 phy best digital notes.pdf
 
TOPIC OF DISCUSSION: CENTRIFUGATION SLIDESHARE.pptx
TOPIC OF DISCUSSION: CENTRIFUGATION SLIDESHARE.pptxTOPIC OF DISCUSSION: CENTRIFUGATION SLIDESHARE.pptx
TOPIC OF DISCUSSION: CENTRIFUGATION SLIDESHARE.pptx
 
LEARNING TO LIVE WITH LAWS OF MOTION .pptx
LEARNING TO LIVE WITH LAWS OF MOTION .pptxLEARNING TO LIVE WITH LAWS OF MOTION .pptx
LEARNING TO LIVE WITH LAWS OF MOTION .pptx
 
23PH301 - Optics - Optical Lenses.pptx
23PH301 - Optics  -  Optical Lenses.pptx23PH301 - Optics  -  Optical Lenses.pptx
23PH301 - Optics - Optical Lenses.pptx
 
Gadgets for management of stored product pests_Dr.UPR.pdf
Gadgets for management of stored product pests_Dr.UPR.pdfGadgets for management of stored product pests_Dr.UPR.pdf
Gadgets for management of stored product pests_Dr.UPR.pdf
 
Microbiology of Central Nervous System INFECTIONS.pdf
Microbiology of Central Nervous System INFECTIONS.pdfMicrobiology of Central Nervous System INFECTIONS.pdf
Microbiology of Central Nervous System INFECTIONS.pdf
 
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
 
Clinical periodontology and implant dentistry 2003.pdf
Clinical periodontology and implant dentistry 2003.pdfClinical periodontology and implant dentistry 2003.pdf
Clinical periodontology and implant dentistry 2003.pdf
 
BIRDS DIVERSITY OF SOOTEA BISWANATH ASSAM.ppt.pptx
BIRDS  DIVERSITY OF SOOTEA BISWANATH ASSAM.ppt.pptxBIRDS  DIVERSITY OF SOOTEA BISWANATH ASSAM.ppt.pptx
BIRDS DIVERSITY OF SOOTEA BISWANATH ASSAM.ppt.pptx
 
MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...
MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...
MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...
 

NovelD: A Simple yet Effective Exploration Criterion

  • 1. Machine Learning LABoratory Seungjoon Lee. 2023-09-29. sjlee1218@postech.ac.kr NovelD: A Simple yet Effective Exploration Criterion Neurips 2021. Paper Summary 1
  • 2. Paper in One Figure 2
  • 3. Caution!!! • This is the material I summarized a paper at my personal research meeting. • Some of the contents may be incorrect! • Some contributions, experiments are excluded intentionally, because they are not directly related to my research interest. • Methods are simpli fi ed for easy explanation. • Please send me an email if you want to contact me: sjlee1218@postech.ac.kr (for correction or addition of materials, ideas to develop this paper, or others). 3
  • 4. Contents • Introduction • Methods • Experiments • Conclusion 4
  • 6. Situations • RL cannot explore well in sparse-reward environments. • Novelty-based RL exploration methods incentivize exploration using novelty as intrinsic rewards. 6
  • 7. Complications • If the novelty-based RL agent meets the unknown region, it explores the region thoroughly until the novelty of the region gets low. (DFS manner) • It focus on a tree rather than a forest, slowing down RL exploration. • If the state space is large, the novelty-based RL agent forgets the explored regions, going back to the explored regions. • It gets trapped some regions, so its state visit counts become imbalanced. 7
  • 8. Question & Hypothesis • Question: • Do you make novelty-based intrinsic reward method which makes the state visit counts uniform and explores much broadly? • Hypothesis: • Intrinsic reward (IR) by novelty di ff erence can make the visit count uniform and push the boundary of the known regions consistently. • The IR by novelty di ff erence is relevantly robust to the forgetting of NN. 8
  • 9. Contributions • The authors show that novelty-based methods explore in a DFS-like manner, and stuck in some large state spaces. • Intrinsic rewards by novelty di ff erence accelerate RL exploration: • by pushing boundaries of known regions consistently in a BFS-like manner, • by making uniform state visit count, • by being tolerant to the forgetting of the agent. 9
  • 11. Problem Formulation • Episodic MDP with fi nite horizon • , observation space • , fi nite horizon • that maximizes is considered, where . (S, A, P, R, γ, T) S T π(a|o) E[ ∑ t=0 γt rt] rt = re t + αri t 11
  • 12. Methods Outline Desires • The new intrinsic reward should: • forces an agent to push the boundary/frontier of the known regions. • forces an agent to make uniform state visit counts. 12
  • 13. Methods Outline • Intrinsic reward calculation + Novelty estimation + RL agent • Intrinsic reward calculation: novelty di ff erence • Novelty estimation: Random Network Distillation (RND) • RL: PPO 13
  • 14. Methods - Intrinsic Reward Calculation • Intrinsic rewards (IR) are calculated by the novelty di ff erence (NovelD) • • could be any novelty measure for a state. • is an state visit count in one episode. • So, NovelD gives IR only when is the new state in this episode. ri t(st, at, st+1) = max [novelty(st+1) − α ⋅ novelty(st),0] ⋅ 1 [Ne(st+1) = 1] novelty( ⋅ ) Ne(s) st+1 14
  • 15. Methods - Novelty Estimation • Novelty of is estimated by RND, estimating high novelty to unfamiliar states. • • Target function . • Predictor function . s Novelty(s) = ||ffixed(s) − fψ(s)||2 ffixed : S → ℝk fψ : S → ℝk 15
  • 16. Methods - RL Agent Training of RL agent • RL agent: PPO • Value loss , where • , , . • Policy loss L(ϕ) = ∑ t [yt − Vϕ(st)] 2 yt = Aπθold(st, at) + Vπθold(st) Aπθold(st, at) = ∞ ∑ k=0 (λγ)k δt+k δt+k = r(st+k, at+k) + γVπθold(st+k+1) − Vπθold(st+k) rk = re k + ri k L(θ) = min ( πθ(at |st) πθold (at |st) Aπθold(st, at), clip ( πθ(at |st) πθold (at |st) ,1 − ϵ,1 + ϵ ) Aπθold(st, at) ) 16
  • 17. Methods - Novelty Difference v.s. Novelty When for many states novelty(st) ≈ 0 • v.s. . • For simplicity, let’s assume • Both methods behave similarly. ri (st, at, st+1) = novelty(st+1) − novelty(st) ri (st) = novelty(st) α = 1 17
  • 18. Methods - Novelty Difference v.s. Novelty When novelty(st) > > 0 • v.s. • The naive novelty method can make high rewards from the both of the below scenarios. • So, if the agent meets an unfamiliar region, it can easily maximizes its reward by exploring thoroughly only the region. (DFS manner) ri (st, at, st+1) = novelty(st+1) − novelty(st) ri (st) = novelty(st) 18
  • 19. Methods - Novelty Difference v.s. Novelty When novelty(st) > > 0 • v.s. • The novelty di ff erence method can make high rewards only from the right- side fi gure of the below scenarios. • So the agent should get out the known region, even if the knowledge of the region is rough yet. (BFS manner) ri (st, at, st+1) = novelty(st+1) − novelty(st) ri (st) = novelty(st) 19
  • 20. Methods Analysis - Pushing Boundaries Why? • The NovelD reward forces an agent to push the boundary/frontier of the known regions. • Because the agent can get high rewards at the boundary of the known regions and unknown regions. 20
  • 21. Methods Analysis - Pushing Boundaries So what? • So what? Why does pushing boundaries help RL exploration? • NovelD forces the agent to visit states which have been never explored so far. • So the agent should visit the indeed new states outside the known regions, even if the knowledge of the region is rough yet. 21
  • 22. Methods Analysis - Uniform Visit Counts Why? • The NovelD reward forces an agent to make uniform state visit counts. • Because the agent is forced to act to make the novelty signal fl at. • This is done by making the uncertainty fl at for all states, which is done by uniform visit counts. (Analogy: making equal for all in UCB) ln t N(s) s 22
  • 23. Methods Analysis - Uniform Visit Counts So what? • So what? Why does uniform visit counts help RL exploration? • (My own conjecture) It is known that high entropy of the distribution of the visited state counts helps RL exploration. • (My own conjecture) Value function would be approximated well if the visit counts are uniform in the setting with extrinsic rewards. 23
  • 24. Methods Analysis - Tolerance to Forgetting Why? • If an agent forget the explored region, the neighbors of the region would be forgotten too. • So, the novelties are increased in the neighbors of the region. • So, the novelty di ff erences are low, so the incentive is low to explore the explored but forgotten regions. 24
  • 26. PoC: Why does NovelD Accelerate Exploration? • The intrinsic rewards by NovelD accelerate RL exploration showing: • 1) The boundaries of the known regions are pushed by NovelD. • 2) The visit counts of states becomes uniform by NovelD. • in pure exploration setting (w/o extrinsic rewards) 26
  • 27. PoC Environment • Environment: MiniGrid • 2D grid-world environments with goal-oriented tasks. • Randomized, procedurally generated environment. • Reward is positive only when reaching the fi nal goal. • Action space is discrete. • NovelD uses bird-eye view full observations, not partial observations in the agent’s view. 27
  • 28. PoC - Pushed Boundary in Pure Exploration Claim • Claim: • NovelD forces an agent to push the boundary of the explored regions, which accelerates RL exploration. 28
  • 29. PoC - Pushed Boundary in Pure Exploration Results • Pure exploration in MiniGrid • The NovelD agent gets high IR at the boundary of the explored region, and pushes high IR regions consistently. • RND agent cannot pushes high IR regions clearly. 29 After start After entering the 2nd room After entering the 3rd room Empirical IR plot after di ff erent checkpoints
  • 30. PoC - Uniform State Visit Counts in Pure Exploration Claims • Claim: • NovelD forces an agent to make uniform state visit counts, which accelerates RL exploration. 30
  • 31. PoC - Uniform State Visit Counts in Pure Exploration Results • Visit count is analyzed in one fi xed env. • NovelD makes the visited count uniform after some stabilization steps of the encoder in the novelty calculation. • RND makes visit counts non-uniform, going back-and-forth explored regions to understand the regions thoroughly. N(s) 31 Normalized visit counts heat map for the location of agents N(s)/Z
  • 32. PoC - Uniform State Visit Counts in Pure Exploration Results • Visit count is analyzed in one fi xed env. • NovelD makes visit count distribution in each room have high entropy. • where . N(s) ℋ(ρroom(s)) ρroom(s) = N(s)/ ∑ s′  ∈Sroom N(s′  ) 32 after some env steps by RND / NovelD ℋ(ρroom(s)) Entropy gets lower in RND after 3M env steps Entropy gets higher in most rooms In NovelD
  • 33. Experiments with Extrinsic Rewards • Experiments with extrinsic reward in MiniGrid envs: • Training environment is randomly initialized at each episode to have di ff erent entity locations and colors. • Test performance is evaluated. • The results are averaged across four seeds and 32 random initialized environments. 33
  • 34. Experiments with Extrinsic Rewards • NovelD solves hard games within small steps even when the state space is large. • Other algorithms cannot solve them if the the state space becomes larger (larger rooms, bigger # of rooms). 34
  • 35. Experiments with Extrinsic Rewards • NovelD improves sample e ffi ciency in easy envs, and solves hard envs. 35
  • 36. Experiments - Noisy-TV in MiniGrid • Noisy-TV in MiniGrid env: • Some walls of the env change the color randomly at every time step. • Empirically, NovelD’s performance doesn’t degrade in noisy-TV setup. 36
  • 38. Conclusion • Intrinsic rewards by NovelD accelerate RL exploration: • by pushing boundaries of known regions consistently in a BFS-like manner, • by making uniform state visit count, • by being tolerant to the forgetting of the agent. • NovelD outperforms other algorithms in terms of sample e ffi ciency in various environments (MiniGrid, Atari, NetHack) 38
  • 39. Limitations • [implementation] NovelD uses the fully observable state in MiniGrid, not partial observation of the agent. • If observations are same in di ff erent context (location), NovelD would not explore the unexplored region. • If env is noisy, is high for all , so NovelD cannot get meaningful intrinsic rewards. • The NovelD agent could get the meaningless dense intrinsic rewards in the observations with the di ff erent view of the same object. [ref] • The NovelD is not tested in continuous action RL domains. novelty(s) s 39 Semantic Exploration from Language Abstractions: https://arxiv.org/abs/2204.05080