SlideShare a Scribd company logo
1 of 151
AAMAS’24 Tutorial 9
Unlocking Exploration:
Self-Motivated Agents
Thrive on Memory-
Driven Curiosity
2
Why This Topic?
 Deep Learning in RL: Deep RL Agents excel in
complex tasks but need many interactions to
learn
 Practicality Issues: Extensive learning steps
hinder RL's use in real-world scenarios
 Exploration Optimization: Improving exploration
is key for RL's real-world application
 Memory and Learning: Memory-based
exploration can speed up learning and advance
AI
Generated by DALL-E 3
3
About Us
 Authors: Hung Le, Hoang Nguyen and Dai Do
 Our lab: A2I2, Deakin University
 Hung Le is a research lecturer at Deakin
University, leading research on deep sequential
models and reinforcement learning
 Hoang Nguyen is a second-year PhD student at
A2I2, specializing in reinforcement learning and
causality
 Dai Do is a second-year PhD student at A2I2,
specializing in reinforcement learning and large
language models
A2I2 Lab Foyer, Waurn Ponds
4
Deakin University CRICOS Provider Code: 00113B
About the Tutorial
This tutorial is based on my previous presentations,
expanding on topics covered in the following talks:
 Memory-Based Reinforcement Learning.
AJCAI’22
 Memory for Lean Reinforcement Learning. FPT
Software AI Center. 2022
 Neural machine reasoning. IJCAI’21
 From deep learning to deep reasoning. KDD’21
 My Blogs: https://hungleai.substack.com/
Generated by DALL-E 3
5
Tutorial Outline
 Part A: Reinforcement Learning Fundamentals and
Exploration Inefficiency (30 minutes)
 Welcome and Introduction 👈 [We are here]
 Reinforcement Learning Basics
 Exploring Challenges in Deep RL
 QA and Demo
 Part B: Surprise and Novelty (110 minutes,
including a 20-minute break)
 Principles and Frameworks
 Deliberate Memory for Surprise-driven
Exploration
 Break
 RAM-like Memory for Novelty-based
Exploration
 Replay Memory
 QA and Demo
 Part C: Advanced Topics (60 minutes)
 Language-guided exploration
 Causal discovery for exploration
 Closing Remarks
 QA and Demo
Reinforcement Learning
Fundamentals and
Exploration Inefficiency
PART A
Generated by DALL-E 3
7
Tutorial Outline
 Part A: Reinforcement Learning Fundamentals and
Exploration Inefficiency (30 minutes)
 Welcome and Introduction
 Reinforcement Learning Basics
 Key components and frameworks 👈 [We are here]
 Classic exploration
 Exploring Challenges in Deep RL
 QA and Demo
 Part B: Surprise and Novelty (110 minutes, including a
20-minute break)
 Principles and Frameworks
 Deliberate Memory for Surprise-driven Exploration
 Break
 RAM-like Memory for Novelty-based
Exploration
 Replay Memory
 QA and Demo
 Part C: Advanced Topics (60 minutes)
 Language-guided exploration
 Causal discovery for exploration
 Closing Remarks
 QA and Demo
8
Reinforcement Learning Basics
 In reinforcement learning (RL), an agent
interacts with the environment, taking
actions a, receiving a reward r, and
moving to a new state s
 The agent is tasked with maximizing the
accumulated rewards or returns R over
time by finding optimal actions (policy)
9
Reinforcement Learning Concepts
 Policy π : maps state s to action a
 Return (discounted) G or R: the cumulative
(weighted) sum of rewards
 State value function V: the expected discounted
return starting with state s following policy π
 State-action value function Q: the expected return
starting from , taking the action , and thereafter
following policy π
Classic RL algorithms: Value learning
10
Q-learning
(temporal difference-TD)
Watkins, Christopher JCH, and Peter Dayan. "Q-learning." Machine
learning 8, no. 3 (1992): 279-292.
Williams, Ronald J. "Simple statistical gradient-following algorithms for
connectionist reinforcement learning." Machine learning 8, no. 3 (1992):
229-256.
 Basic idea: before finding optimal policy,
we find the value function
 Learn (action) value function:
 V(s)
 Q(s,a)
 Estimate V(s)=E(∑R from s)
 Estimate Q(s,a)=E(∑R from s,a)
 Given Q(s,a)
→ choose action that maximizes the value
(ε-greedy policy)
RL algorithms: Q-Learning
Classic RL algorithm: Policy gradient
 Basic idea: directly optimise the policy as a
function of states
 Need to estimate the gradient of the
objective function E(∑R) w.r.t the
parameters of the policy
 Focus on optimisation techniques
11
REINFORCE
(policy gradient)
RL algorithms: Policy Gradient
General RL algorithms
12
Q-Learning vs Policy Gradient
Both require
exploration
to collect data for
training
13
Tutorial Outline
 Part A: Reinforcement Learning Fundamentals and
Exploration Inefficiency (30 minutes)
 Welcome and Introduction
 Reinforcement Learning Basics
 Key components and frameworks
 Classic exploration 👈 [We are here]
 Exploring Challenges in Deep RL
 QA and Demo
 Part B: Surprise and Novelty (110 minutes, including a
20-minute break)
 Principles and Frameworks
 Deliberate Memory for Surprise-driven Exploration
 Break
 RAM-like Memory for Novelty-based
Exploration
 Replay Memory
 QA and Demo
 Part C: Advanced Topics (60 minutes)
 Language-guided exploration
 Causal discovery for exploration
 Closing Remarks
 QA and Demo
14
Deakin University CRICOS Provider Code: 00113B
ε-greedy
 ε-greedy is the simplest exploration
strategy that works in theory
 It heavily relies on pure randomness
and biased estimates of action values
Q, and thus is sample-inefficient in
practice
 We often go with what we assume is
best, but sometimes, we take a
random chance to explore other
options. This is one example of an
optimistic strategy
It is used in Q-learning
15
Deakin University CRICOS Provider Code: 00113B
ε-greedy: Problems
 It is not surprising why this strategy
might struggle in real-world scenarios:
being overly optimistic when your
estimation is imprecise can be risky.
 It may lead to getting stuck in a local
optimum and missing out on
discovering the global one with the
highest returns.
Benchmarking ε-greedy (red line) and other exploration method on Montezuma’s
Revenge. Taïga, Adrien Ali, William Fedus, Marlos C. Machado, Aaron Courville, and Marc G.
Bellemare. "Benchmarking bonus-based exploration methods on the arcade learning
environment." arXiv preprint arXiv:1908.02388 (2019).
ε-greedy
16
Deakin University CRICOS Provider Code: 00113B
Upper Confidence Bound (UCB)
 One way to address the problem of over-
optimism is to consider the uncertainty of the
estimation
 We do not want to miss an action with a
currently low estimated value and high
uncertainty, as it may possess a higher value:
 What we need is to guarantee:
Hoeffding’s Inequality
How to estimate
uncertainty?
Implicitly, with a large value of the exploration-
exploitation trade-off parameter c, the chosen
action is more likely to deviate from the greedy
action, leading to increased exploration.
t -c
17
Deakin University CRICOS Provider Code: 00113B
Thompson Sampling
 When additional assumptions about the reward distribution are
available, an action can be chosen based on the probability that
it is optimal (probability matching strategy)
 Thompson sampling is one way to implement the strategy:
1. Assume the reward follows a distribution p(r|a, θ) where θ
is the parameter whose prior is p(θ)
2. Given the set of past observations Dt is made of triplets {(ai,
ri)|i=1,2..,t}, we update the posterior using Bayes rule
3. Given the posterior, we can estimate the action value
4. We can compute the probability of choosing action a
2
3
4
18
Deakin University CRICOS Provider Code: 00113B
Information Gain
 Information Gain (IG) measures the change in the
amount of information (measured in entropy H) of a
latent variable
 The latent variable often refers to the parameter of
the model θ after seeing observation (e.g., reward r)
caused by some action a
 A big drop in the entropy means the observation
makes the model more predictable and less
uncertain
 Our goal is to find a harmony between minimizing
expected regret in the current period and acquiring
new information about the observation model
Russo, Daniel, and Benjamin Van Roy. "Learning to optimize via information-
directed sampling." Advances in Neural Information Processing Systems 27
(2014).
19
Deakin University CRICOS Provider Code: 00113B
Application: Multi-arm bandit
 There are multiple actions to take (bandit’s arm)
 After taking one action, agent observes a reward
 Maximizing cumulated rewards or minimizing
cumulated regrets.
Generated by DALL-E 3
20
Deakin University CRICOS Provider Code: 00113B
Limitations of Classical Exploration
❌ Scalability Issues: Most are specifically designed for
bandit problems, and thus, they are hard to apply in
large-scale or high-dimensional problems (e.g., Atari
games), resulting in increased computational
demands that can be impractical
❌ Assumption Sensitivity: These methods heavily rely
on specific assumptions about reward distributions or
system dynamics, limiting their adaptability when
assumptions do not hold
❌ Vulnerability to Uncertainty: They may struggle in
dynamic environments with complex reward
structures or frequent changes, leading to suboptimal
performance Generated by DALL-E 3
21
Tutorial Outline
 Part A: Reinforcement Learning Fundamentals and
Exploration Inefficiency (30 minutes)
 Welcome and Introduction
 Reinforcement Learning Basics
 Exploring Challenges in Deep RL
 Hard Exploration Problems 👈 [We are here]
 Simple exploring solutions
 QA and Demo
 Part B: Surprise and Novelty (110 minutes, including a
20-minute break)
 Principles and Frameworks
 Deliberate Memory for Surprise-driven Exploration
 Break
 RAM-like Memory for Novelty-based
Exploration
 Replay Memory
 QA and Demo
 Part C: Advanced Topics (60 minutes)
 Language-guided exploration
 Causal discovery for exploration
 Closing Remarks
 QA and Demo
 Task:
 Agent searches for the key
 Agent picks the key
 Agent open the door to access the room
 Agent finds the box in the room
 Reward:
 If the agent reaches the box, get +1
reward
22
https://github.com/maximecb/gym-minigrid
→ How to learn such complicated
policies using the simple reward?
Modern RL Environments are Complicated
23
Deakin University CRICOS Provider Code: 00113B
Why is Scaling a Big Problem?
 Practical environments often involve huge
continuous state and action spaces
 Classical approaches cannot be implemented or fail
to hold their theoretical properties in these settings
Doom environment: continuous high-dimensional
state space (source)
Mujoco environment: continuous action space (source).
24
Deakin University CRICOS Provider Code: 00113B
Challenging Environments for Exploration
 Environments require long-term memory of
agents:
 Maze navigation with conditions such as
finding the objects that have the same color
as the wall
 Remember the shortest path to the objects
experienced in the past
 Noisy environments:
 Noisy-TV: a random TV will distract the RL
agent from its main task due to noisy screen
https://github.com/jurgisp/memory-maze
Noisy-TV (source)
25
Tutorial Outline
 Part A: Reinforcement Learning Fundamentals and
Exploration Inefficiency (30 minutes)
 Welcome and Introduction
 Reinforcement Learning Basics
 Exploring Challenges in Deep RL
 Hard exploration problems
 Simple exploring solutions 👈 [We are here]
 QA and Demo
 Part B: Surprise and Novelty (110 minutes, including a
20-minute break)
 Principles and Frameworks
 Deliberate Memory for Surprise-driven Exploration
 Break
 RAM-like Memory for Novelty-based
Exploration
 Replay Memory
 QA and Demo
 Part C: Advanced Topics (60 minutes)
 Language-guided exploration
 Causal discovery for exploration
 Closing Remarks
 QA and Demo
26
Deakin University CRICOS Provider Code: 00113B
Entropy Maximization
 In the era of deep learning, neural networks are used for
approximating functions, including parameterizing value and
policy functions in RL
 ε-greedy is less straightforward for policy gradient methods
 An entropy loss term is introduced in the objective function to
penalize overly deterministic policies. This encourages diverse
exploration, avoiding suboptimal actions by maximizing the
bonus entropy loss term
❌ It may also impede the optimization of other losses, especially
the main objective
❌ The entropy loss does not enforce different level of exploration
for different tasks
It is used in PPO, A3C, …
27
Deakin University CRICOS Provider Code: 00113B
Noisy Networks
 Another method to add randomness to the
policy is to add noise to the weights of the
neural networks
 Throughout training, noise samples are drawn
and added to the weights for both forward and
backward propagation
❌ Although Noisy Networks can vary exploration
degree across tasks, adapting exploration at the
state level is far from reachable.
Certain states with higher uncertainty may require
more exploration, while others may not
An example of a noisy linear layer. Here w is the matrix weight and b is
the bias vector. The parameters µw, µb, σw and σb are the learnables of
the network whereas εw and εw are noise variables. Fortunato, Meire,
Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Ian Osband,
Alex Graves Vlad Mnih, Remi Munos Demis Hassabis Olivier Pietquin,
Charles Blundell, and Shane Legg. "Noisy Networks for Exploration."
arXiv preprint arXiv:1706.10295 (2017)
28
Deakin University CRICOS Provider Code: 00113B
End of Part A
 QA
 Demo
Generated by DALL-E 3
Intrinsic Motivation:
Surprise and Novelty
PART B
Generated by DALL-E 3
Generated by DALL-E 3
30
Tutorial Outline
 Part A: Reinforcement Learning Fundamentals and
Exploration Inefficiency (30 minutes)
 Welcome and Introduction
 Reinforcement Learning Basics
 Exploring Challenges in Deep RL
 QA and Demo
 Part B: Surprise and Novelty (110 minutes, including a
20-minute break)
 Principles and Frameworks 👈 [We are here]
 Reward shaping and the role of memory
 A taxonomy of memory-driven intrinsic exploration
 Deliberate Memory for Surprise-driven Exploration
 Break
 RAM-like Memory for Novelty-based
Exploration
 Replay Memory
 QA and Demo
 Part C: Advanced Topics (60 minutes)
 Language-guided exploration
 Causal discovery for exploration
 Closing Remarks
 QA and Demo
 No curiosity, random exploration
 epsilon-greedy
 “Tractable” exploration
 Somehow optimize exploration,
e.g. UCB, Thomson sampling
 Only doable for simple
environments
31
 Approximate “trackable” exploration
(count-based)
 Scalable to harder environments
 Intrinsic motivation exploration
(SOTA)
Predictive
Novel or supersite-based curiosity
Causal
…
https://cmutschler.de/rl
Frameworks
32
Deakin University CRICOS Provider Code: 00113B
Reward Shaping
 Entropy loss or inject noise into the policy/value
parameters with the limitation that the level of
exploration is not explicitly conditioned on fine-
grant factors such as states or actions
 Solution: intrinsic reward bonuses assign higher
internal rewards to state-action pairs that
require higher exploration and vice versa
 The final reward for the agent will be the
weighted sum of the intrinsic reward and the
external (environment) reward
 Animal can travel for long distance till
they find food
 Human can navigate to go to an
address in a strange city
 What motivates these agents to
explore?
intrinsic motivation
curiosity, hunch
intrinsic reward
33
https://www.beepods.com/5-fascinating-ways-bees-and-flowers-find-each-other/
Intrinsic Motivation in Biological World
34
Deakin University CRICOS Provider Code: 00113B
What Does Intrinsic Reward Represent?
 Novelty
 It is inherent for biological agents to be motivated by new things.
 Tracking the occurrences of a state provides a novelty indicator, with
increased occurrences signaling less novelty
 Surprise
 Surprise emerges when there's a discrepancy between expectations
and the observed or experienced reality
 Build a model of the environment, predicting the next state given the
current state and action
 The intrinsic reward is the prediction error itself
 This reward increases when the model encounters difficulty in
predicting or expresses surprise at the current observation.
Novelty
Surprise
35
Deakin University CRICOS Provider Code: 00113B
The Role of Memory
 Biological agents inherently possess memory to
monitor events:
 Drawing from previous experiences, they
discern novelty in observations
 Utilizing their prior understanding of the
world, they identify unexpected observations
 RL agents can be quipped with memory:
 Event-based Memory: Episodic Memory
 Semantic Memory: World Model
Novelty
Surprise
MEMORY
36
Deakin University CRICOS Provider Code: 00113B
A Taxonomy of Memory for RL Exploration
37
Tutorial Outline
 Part A: Reinforcement Learning Fundamentals and
Exploration Inefficiency (30 minutes)
 Welcome and Introduction
 Reinforcement Learning Basics
 Exploring Challenges in Deep RL
 QA and Demo
 Part B: Surprise and Novelty (110 minutes, including a
20-minute break)
 Principles and Frameworks
 Deliberate Memory for Surprise-driven Exploration
 Forward dynamics prediction👈 [We are here]
 Advanced dynamics-based surprises
 Ensemble and disagreement
 Break
 RAM-like Memory for Novelty-based
Exploration
 Replay Memory
 QA and Demo
 Part C: Advanced Topics (60 minutes)
 Language-guided exploration
 Causal discovery for exploration
 Closing Remarks
 QA and Demo
38
Deakin University CRICOS Provider Code: 00113B
Forward Dynamics Prediction
 Build a model of the environment, predicting the
next state given the current state and action
 This kind of model, also known as forward
dynamics or world model
 C: actor vs M: world model. M predicts
consequences of C. C acts to make M fail
 As a result:
 If C action results in repeated and boring
consequences  M predict well
 C must explore novel consequence https://people.idsia.ch/~juergen/artificial-curiosity-since-1990.html
39
Deakin University CRICOS Provider Code: 00113B
Learning Progress as Intrinsic Reward
 The learning progress is estimated by comparing
the mean error rate of the prediction model
during the current moving window to the mean
error rate of the previous window
 The two windows are different by 𝜏 steps
 In can mitigate Noisy-TV problems
Pierre-Yves Oudeyer & Frederic Kaplan. “How can we define intrinsic motivation?” Conf. on Epigenetic
Robotics, 2008.
40
Deakin University CRICOS Provider Code: 00113B
Deep Dynamic Models
 Use a neural network f that takes a representation of the current state and action to predict the
next state
 The representation is shaped through unsupervised training, i.e., state reconstruction task, using
an autoencoder’s hidden state
 The network f, fed with the autoencoder’s hidden state, is trained to minimize the prediction
error, which is the norm of the difference between the predicted state and the true state
Intrinsic reward
Stadie, Levine, Abbeel: Incentivizing Exploration in Reinforcement
Learning with Deep Predictive Models. In NIPS 2015.
 Forward model on feature space
 Feature space ignore irrelevant, uncontrollable factors
 The consequences depend on
 Action (controllable)
 Environment (uncontrollable)
 We want the state embedding enables controllable
space
41
https://blog.dataiku.com/curiosity-driven-learning-through-next-state-prediction
Deepak Pathak et al.: Curiosity-driven Exploration by Self-Supervised Prediction. ICML 2017
ICM
More Complicated Model
Inverse
dynamic
representation
learning
42
43
Tutorial Outline
 Part A: Reinforcement Learning Fundamentals and
Exploration Inefficiency (30 minutes)
 Welcome and Introduction
 Reinforcement Learning Basics
 Exploring Challenges in Deep RL
 QA and Demo
 Part B: Surprise and Novelty (110 minutes, including a
20-minute break)
 Principles and Frameworks
 Deliberate Memory for Surprise-driven Exploration
 Forward dynamics prediction
 Advanced dynamics-based surprises👈 [We are here]
 Ensemble and disagreement
 Break
 RAM-like Memory for Novelty-based
Exploration
 Replay Memory
 QA and Demo
 Part C: Advanced Topics (60 minutes)
 Language-guided exploration
 Causal discovery for exploration
 Closing Remarks
 QA and Demo
 The prediction target
is stochastic
 Information necessary for the
prediction is missing
 Model class of predictors is too
limited to fit the complexity of
the target function
 Both the totally predictable and
the fundamentally unpredictable will
get boring
44
https://openai.com/blog/reinforcement-learning-with-prediction-based-rewards/
When Predictive Surprise Fails
45
https://favpng.com/
Ideas for Improvements
Reward M’s progress instead of error
No M’s improvement if the
consequence is too hard or too easy
to predict  no reward
Remember all experiences
“Store” all experienced
consequence, including stochastic
ones
Global or Local memory
Like human
Better Representations
Representation Learning
https://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction
46
Deakin University CRICOS Provider Code: 00113B
Random Network Distillation (RND)
 The intrinsic reward is defined through the task of
predicting the output of a fixed (target) network
 Target Network’s weights are random
 By predicting the target output, the Predictor Network
tries to “remember” the randomized state
 If old state reappears, it can be predicted easily by the
Predictor Network
 RND obviates Noisy-TV since the target network can be
chosen to be deterministic and inside the model-class
of the predictor network.
Burda, Yuri, Harrison Edwards, Amos Storkey, and Oleg Klimov. "Exploration by random network
distillation." In International Conference on Learning Representations. 2018.
47
Deakin University CRICOS Provider Code: 00113B
Noisy-TV in Atari games
 Montezuma’s Revenge the agent oscillates between two rooms
 This leads to an irreducibly high prediction error, as the non-determinism of sticky actions
makes it impossible to know whether, once the agent is close to crossing a room boundary,
making one extra step will result in it staying in the same room, or crossing to the next one
 This is a manifestation of the ‘noisy TV’ problem
48
Deakin University CRICOS Provider Code: 00113B
Latent World Model
 World model for exploration should be robust
against stochasticity and able to extrapolate the
state dynamics  prediction error can be a
measurement for novelty
 Train WM on latent representation space. This
space is shaped by unsupervised learning  a
zero-centered distribution with a covariance
matrix equal to the identity  robust to
stochastic elements and is arranged respecting
the temporal distance of observations
 WM error is computed in latent space
49
Deakin University CRICOS Provider Code: 00113B
Latent World Model: Sample-efficient Atari Benchmark
50
Deakin University CRICOS Provider Code: 00113B
Bayesian Surprise
 Surprise can be interpreted from a Bayesian statistics perspective
 Similar to IG idea, the aim is to minimize uncertainty about the dynamics, formalized as maximizing
the cumulative reduction in entropy
 The reduction of entropy per time step, also known as mutual information, I(𝚯;St+1|ξt,at)
 Here θ is the parameters of the dynamics model 𝚯. Because we are interested in finding intrinsic
reward for a given timestep, we can define:
Houthooft, Rein, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. "Vime:
Variational information maximizing exploration." Advances in neural information processing systems 29
(2016).
51
Deakin University CRICOS Provider Code: 00113B
Variational Bayesian Surprise
 The KL involves computing the posterior p(θ|st+1),
which is generally intractable
 Use variational inference for approximating the
posterior, alternative variational distribution q(θ; 𝜙)
 𝜙 is the variational parameter
 This is equivalent to parameterizing the dynamics
model as a Bayesian neural network (BNN) with
weight distributions maintained as a fully factorized
Gaussian. Train 𝜙:
52
Deakin University CRICOS Provider Code: 00113B
Bayesian Surprise Benchmarking
53
Deakin University CRICOS Provider Code: 00113B
Bayesian Learning Progress
 Training a BNN is complicated, there are different
Bayesian views on surprise
 Formulating the objective of the RL agent as jointly
maximizing expected return and surprise
 P is the true dynamics model and P𝜙 is the learned
dynamics model
 The objective can be translated to maximizing the
bonus reward per step
Joshua Achiam and Shankar Sastry. 2017. Surprise-based intrinsic motivation for deep reinforcement
learning. arXiv preprint arXiv:1703.01732 (2017).
54
Deakin University CRICOS Provider Code: 00113B
Bayesian Learning Progress: Approximation Solutions
 In practice, we do not know P. Need
approximation
 Prediction error: measures the error in log
probability instead of the norm of the
difference between the predicted and the
reality
 Learning progress written in the form of log
probability
 To train the dynamics model P𝜙, solve the
constrained optimization
By introducing the KL constraint, the
posterior model is prevented from diverging
too far from the prior, thereby preventing
the generation of unstable intrinsic rewards.
55
Tutorial Outline
 Part A: Reinforcement Learning Fundamentals and
Exploration Inefficiency (30 minutes)
 Welcome and Introduction
 Reinforcement Learning Basics
 Exploring Challenges in Deep RL
 QA and Demo
 Part B: Surprise and Novelty (110 minutes, including a
20-minute break)
 Principles and Frameworks
 Deliberate Memory for Surprise-driven Exploration
 Forward dynamics prediction
 Advanced dynamics-based surprises
 Ensemble and disagreement 👈 [We are here]
 Break
 RAM-like Memory for Novelty-based
Exploration
 Replay Memory
 QA and Demo
 Part C: Advanced Topics (60 minutes)
 Language-guided exploration
 Causal discovery for exploration
 Closing Remarks
 QA and Demo
56
Deakin University CRICOS Provider Code: 00113B
Intrinsic Motivation via Disagreement
 An alternative method with forward dynamics
involves using the variance of the prediction
rather than the error
 This requires multiple prediction models
trained to minimize the forward dynamics
prediction errors,
 Use the empirical variance (disagreement) of
their predictions as the intrinsic reward
 The higher the variance, the more uncertain
about the observation  need to explore
more
Deepak Pathak, et al. “Self-Supervised Exploration via Disagreement.” In ICML 2019.
57
Deakin University CRICOS Provider Code: 00113B
Bayesian Disagreement
 Bayesian surprises are defined by a specific dynamics model
 What happens if we consider a distribution of models? Bayesian surprise of a policy becomes:
P(T) is the transition distribution of the
environment and P(T|𝜙) is the transition
distribution according to the dynamics
model.
Prediction error averaging considers all
transition models and possible
predictions
Pranav Shyam, Wojciech Jaskowski, and Faustino Gomez. 2019. Model-Based Active
Exploration. In Proceedings of the 36th International Conference on Machine Learning,
ICML 2019, 9-15 June 2019, Long Beach, California, USA. 5779–5788.
P(S|s,a,t) is the dynamics model learned from a transition
dynamics t
58
Deakin University CRICOS Provider Code: 00113B
How to Compute u?
 The term u(s,a) turns out to be the Jensen-
Shannon Divergence of a set of learned
dynamics from a transition dynamics t
 JSD can be approximated by employing N
dynamics models:
 For each P parameterized by Gaussian
distribution 𝓝i(µi,Σi), we need another layer of
approximation to compute u(s,a) by replacing
the Shannon entropy with Rényi entropy and use
the corresponding Jensen-Rényi Divergence
(JRD)
59
Deakin University CRICOS Provider Code: 00113B
Surprise-based Exploration: Pros and Cons
👍 Dynamics models can be trained easily these
days. There are many works on that topic
👍 Advanced methods can somehow handle Noisy-
TV
❌ Focusing on the forward dynamics error is not
effective in driving the exploration, especially when
the world model is not good and always predicts
wrongly
❌ Advanced methods such as ensembles or
learning progress are compute-expensive to cope
with Noisy-TV
Generated by DALL-E 3
60
Tutorial Outline
 Part A: Reinforcement Learning Fundamentals and
Exploration Inefficiency (30 minutes)
 Welcome and Introduction
 Reinforcement Learning Basics
 Exploring Challenges in Deep RL
 QA and Demo
 Part B: Surprise and Novelty (110 minutes, including a
20-minute break)
 Principles and Frameworks
 Deliberate Memory for Surprise-driven Exploration
 Forward dynamics prediction
 Advanced dynamics-based surprises
 Ensemble and disagreement
 Break👈 [We are here]
 RAM-like Memory for Novelty-based
Exploration
 Replay Memory
 QA and Demo
 Part C: Advanced Topics (60 minutes)
 Language-guided exploration
 Causal discovery for exploration
 Closing Remarks
 QA and Demo
61
Tutorial Outline
 Part A: Reinforcement Learning Fundamentals and
Exploration Inefficiency (30 minutes)
 Welcome and Introduction
 Reinforcement Learning Basics
 Exploring Challenges in Deep RL
 QA and Demo
 Part B: Surprise and Novelty (110 minutes, including a
20-minute break)
 Principles and Frameworks
 Deliberate Memory for Surprise-driven Exploration
 Forward dynamics prediction
 Advanced dynamics-based surprises
 Ensemble and disagreement
 Break
 RAM-like Memory for Novelty-based
Exploration
 Count-based memory 👈 [We are here]
 Episodic memory
 Hybrid memory
 Replay Memory
 QA and Demo
 Part C: Advanced Topics (60 minutes)
 Language-guided exploration
 Causal discovery for exploration
 Closing Remarks
 QA and Demo
62
Deakin University CRICOS Provider Code: 00113B
Novelty via Counting
 Humans want to explore novel places, make new
friends, and buy new stuff. It is inherent for humans to
be motivated by new things
 How to translate this intrinsic motivation to RL agents?
 Tracking the occurrences of a state (N(s)) provides a
novelty indicator, with increased occurrences means
less novelty
 ri(s,a)=N(s)-0.5 where N counts the number of
times s appears
❌ Empirical counts in continuous state spaces is
impractical due to the rarity of exact state visits, resulting
in N(s)=0 most of the time
63
Deakin University CRICOS Provider Code: 00113B
Density-based State Counting
 Use a density function of the state to estimate
its occurrences
 ρ(x)=ρ(s=x|s1:n) be a density function of the state
x given s1:n and ρ’(x)=ρ(s=x|s1:nx) the density
function of the state x after observing its first
occurrence after s1:n
 N̂ (x) and n̂ as a “pseudo-count” of x and the
pseudo-total count before and after an
occurrence of s
the true density of x stays the same before and
after an occurrence of x
Bellemare, Marc, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos.
"Unifying count-based exploration and intrinsic motivation." Advances in neural information processing
systems 29 (2016).
64
Deakin University CRICOS Provider Code: 00113B
Pseudo State Count
 In practice, in a huge state space, ρ’n(x)≈0, we can
rewrite the pseudo-count:
 PG means predictive gain, which is computed as:
 Resembles the information gain: the difference
between the expectation of a posterior and prior
distribution
 Extend to count state-action pairs, one can
concatenate the action representation with the state
representation.
65
Deakin University CRICOS Provider Code: 00113B
Hash Count
 If counting each exact state is challenging, why
not partition the continuous state space into
manageable blocks?
 By using a function 𝜙 mapping a state to a code,
we can count the occurrence of the code instead
of the state
 “Distant” states to be counted separately while
“similar” states are merged  SimHash for the
mapping function
 Higher k, less occurrence collisions, and thus
states are more distinguished
Tang, Haoran, Rein Houthooft, Davis Foote, Adam Stooke, OpenAI Xi Chen, Yan Duan, John Schulman,
Filip DeTurck, and Pieter Abbeel. "# exploration: A study of count-based exploration for deep
reinforcement learning." Advances in neural information processing systems 30 (2017).
sgn is the sign function, A is a k × d matrix with
i.i.d. entries drawn from a standard Gaussian
distribution. and g is some transformation function
66
Deakin University CRICOS Provider Code: 00113B
Hash Count: Cope with High-dimensional States
 Representation learning is employed to capture
good g through autoencoder network and
reconstruction learning
 The network aims to reconstruct the original
state input s, and the hidden representation b(s)
will be used to compute g(s)=round(b(s))
 Another regularization term prevents the
corresponding bit in the binary code from
flipping throughout the agent's lifetime
67
Deakin University CRICOS Provider Code: 00113B
Change Counting
 To encourage the agent to explore novel
state-action pairs meaningfully, we can
assess changes caused by activities and
prioritize those that signify novelty
 c(s,s’) as the environment change caused
by a transition (s, a, s’)
 Combines state count and change count,
resulting in the intrinsic reward:
Change count (last row) vs norm of change (middle row) vs state count (top row).
Change count suffers less from attracting to meaningless activities.
Parisi, Simone, Victoria Dean, Deepak Pathak, and Abhinav Gupta.
"Interesting object, curious agent: Learning task-agnostic exploration.
" Advances in Neural Information Processing Systems 34 (2021): 20516-20530.
68
Tutorial Outline
 Part A: Reinforcement Learning Fundamentals and
Exploration Inefficiency (30 minutes)
 Welcome and Introduction
 Reinforcement Learning Basics
 Exploring Challenges in Deep RL
 QA and Demo
 Part B: Surprise and Novelty (110 minutes, including a
20-minute break)
 Principles and Frameworks
 Deliberate Memory for Surprise-driven Exploration
 Forward dynamics prediction
 Advanced dynamics-based surprises
 Ensemble and disagreement
 Break
 RAM-like Memory for Novelty-based
Exploration
 Count-based memory
 Episodic memory👈 [We are here]
 Hybrid memory
 Replay Memory
 QA and Demo
 Part C: Advanced Topics (60 minutes)
 Language-guided exploration
 Causal discovery for exploration
 Closing Remarks
 QA and Demo
69
Deakin University CRICOS Provider Code: 00113B
Alternative Novelty Measurement
❌ An inherent constraint of count-based methods lies
in the approximation error between the pseudo-count
and the true count
 Different novelty criteria: novel observations are
those that demand effort to reach, typically beyond
the already explored areas of the environment
 Measure the effort in environmental steps,
estimating it with a neural network that predicts
the steps between two observations
 To capture the explored areas of the environment,
Use an episodic memory initialized empty at the
start of each episode
Novelty through reachability concept. An observation is novel if it can only reach
those in the memory in more than k steps. Savinov, Nikolay, Anton Raichuk,
Raphaël Marinier, Damien Vincent, Marc Pollefeys, Timothy Lillicrap, and Sylvain
Gelly. "Episodic curiosity through reachability." arXiv preprint arXiv:1810.02274
(2018).
if the maximum reachability score of the given observation
is greater than a threshold k, we can regard it as novel
70
Deakin University CRICOS Provider Code: 00113B
Episodic Curiosity: Memory Workflow
 Define the threshold k, use reachability
network to classify whether two observations
are separated by more or less than k steps
 After training, the reachability network is used
to estimate the novelty of the current
observation in the episode given the episodic
memory M, which finally is used to compute
the intrinsic reward
 Using a function F to aggregate the reachability
scores between the current observation and
those in the memory leads to the intrinsic
reward. F can be max or 90-th percentile
71
Deakin University CRICOS Provider Code: 00113B
Explicit Memory of Positions (only for Atari games)
 Collect the agent’s position from game RAM to
indicate where on the grid an agent has visited
 White sections in the curiosity grid (middle)
show which locations have been visited; the
unvisited black sections yield an exploration
bonus when touched.
 The network receives both game input (left) and
curiosity grid (middle) and must learn how to
form a map of where the agent has been
(hypothetical illustration, right)
Stanton, Christopher, and Jeff Clune. "Deep curiosity search: Intra-life exploration can improve
performance on challenging deep reinforcement learning problems." arXiv preprint
arXiv:1806.00553 (2018).
72
Deakin University CRICOS Provider Code: 00113B
Novelty Connection to Surprise
 Theoretically, an overparameterized autoencoder whose task is to reconstruct its input is
equivalent to an associative memory (Adityanarayanan et al., PNAS 2020)
 We can train an autoencoder that takes the state as input to reconstruct and use its
reconstruction error as an indicator of life-long novelty
 Greater errors signify a higher level of novelty in states, indicating that the autoencoder has not
encountered these states frequently enough to learn their successful reconstruction effectively
 Related to RND: the intrinsic reward is still the reconstruction error. But this time, the
reconstructed target is no longer the original input. Instead, it is a transformed version of the
input
73
Tutorial Outline
 Part A: Reinforcement Learning Fundamentals and
Exploration Inefficiency (30 minutes)
 Welcome and Introduction
 Reinforcement Learning Basics
 Exploring Challenges in Deep RL
 QA and Demo
 Part B: Surprise and Novelty (110 minutes, including a
20-minute break)
 Principles and Frameworks
 Deliberate Memory for Surprise-driven Exploration
 Forward dynamics prediction
 Advanced dynamics-based surprises
 Ensemble and disagreement
 Break
 RAM-like Memory for Novelty-based
Exploration
 Count-based memory
 Episodic memory
 Hybrid memory 👈 [We are here]
 Replay Memory
 QA and Demo
 Part C: Advanced Topics (60 minutes)
 Language-guided exploration
 Causal discovery for exploration
 Closing Remarks
 QA and Demo
74
Deakin University CRICOS Provider Code: 00113B
Surprise + Novelty
 Never Give Up agent (NGU) combines
existing surprise and novelty components
from the literature cleverly:
1. State representation learning via inverse
dynamics (ICM)
2. Life-long novelty module using RND
3. Episodic novelty using episodic memory
inspired by EC
 The implementation of the episodic
memory in NGU is new.
The dynamics model f is employed to produce the
representations for the novelty modules. Two types of
novelty are combined to produce the final intrinsic reward.
Badia, Adrià Puigdomènech, Pablo Sprechmann, Alex Vitvitskyi, Daniel Guo, Bilal Piot, Steven
Kapturowski, Olivier Tieleman et al. "Never give up: Learning directed exploration strategies." arXiv
preprint arXiv:2002.06038 (2020).
75
Deakin University CRICOS Provider Code: 00113B
NGU: Episodic Novelty
 Encourages the exploration of novel states
within an episode simply via nearest-neighbor
matching.
 As a result, the agent will not revisit the same
state in an episode twice. This concept is
different from lifelong novelty
 The closer the current state is to its neighbors,
the higher the similarity and thus, the smaller
the reward
 Hybrid Intrinsic Reward:
76
Deakin University CRICOS Provider Code: 00113B
Agent57: Exploration at Scale
 An upgraded version of the NGU:
 Splitting the value function into 2 separate
function values for external and internal
rewards
 A population of policies (and value functions) is
trained, each characterized by a distinct pair of
exploration parameters:
 N is the size of the population. 𝛾j is the discount
factor hyperparameters and βj is the intrinsic
reward coefficient hyperparameters. Adapted by a
meta controller (bandit algo.) Badia, Adrià Puigdomènech, Bilal Piot, Steven Kapturowski, Pablo Sprechmann, Alex Vitvitskyi, Zhaohan
Daniel Guo, and Charles Blundell. "Agent57: Outperforming the atari human benchmark." In International
conference on machine learning, pp. 507-517. PMLR, 2020.
77
Deakin University CRICOS Provider Code: 00113B
Agent 57: Atari Benchmark
78
Deakin University CRICOS Provider Code: 00113B
Cluster Memory for Counting
 Parametric methods for counting have problems:
s slow adaptation and catastrophic forgetting
 Count Estimation provides a long term
visitation-based exploration bonus while
retaining responsiveness to the most recent
experience  a finite slot-based container M
stores representations and corresponding
counter C
 Memory RECODE is update by considering:
 Adding new embedding (atom) with count 1
 Update nearest atom and increase count
Kernel is non-zero for all
neighbours within a radius
79
Deakin University CRICOS Provider Code: 00113B
RECODE: Representation Learning
 The transformer takes masked sequences of
length k consisting of actions and embedded
observations as inputs and tries to reconstruct
the missing embeddings in the output
 The reconstructed embeddings at time t − 1 and
t are then used to build a 1-step action-
prediction classifier
 Similar to ICM’s inverse dynamics
Saade, Alaa, Steven Kapturowski, Daniele Calandriello, Charles Blundell, Pablo Sprechmann, Leopoldo
Sarra, Oliver Groth, Michal Valko, and Bilal Piot. "Unlocking the Power of Representations in Long-term
Novelty-based Exploration." In The Twelfth International Conference on Learning Representations. 2024.
80
Deakin University CRICOS Provider Code: 00113B
Novelty of Surprise
 The norm of the prediction error (surprise norm) is
not good (e.g., as in Noisy-TV)
 A new metric: surprise novelty, the error of
reconstructing surprise (the error of state
prediction)
 This requires a surprise generator such as a
dynamics model to produce the surprise vector u,
i.e., the difference vector between the predicted
and reality
 Then, inter and intra-episode novelty scores are
estimated by a system of memory, called Surprise
Memory (SM), consisting of an autoencoder
network W and episodic memory M, respectively
Le, Hung, Kien Do, Dung Nguyen, and Svetha Venkatesh. "Beyond Surprise:
Improving Exploration Through Surprise Novelty. In AAMAS, 2024.
81
Deakin University CRICOS Provider Code: 00113B
Benefit of Hybrid Systems
 Marry the goodness of both worlds:
 Dynamic Prediction
 Novelty Estimation
 Combine intrinsic rewards:
 Long-term, global, inter-episodes
 Short-term, local, intra-episodes
 Limitation of dynamic prediction using deep models
can be compensated with non-parametric memory
approaches
 Noisy-TV problems are mitigated but not completely
solved
Noisy-TV: a random TV will distract the RL agent from its main
task due to high surprise (source).
82
Deakin University CRICOS Provider Code: 00113B
However, Is Intrinsic Reward Really Good?
Taiga et al. On bonus based exploration methods in the arcade
learning environment. In ICLR 2019
 Intrinsic motivation rewards heavily on memory
concept (global, local…)
 The performance IR agent is very good, but …
 It requires more samples to train (10^9 or 10^10
steps is the norm)
Is The goal of IR to enable sample efficiency?
Overfitting to Montezuma Revenge?
Depending on architecture and tuning, normal
exploration in general is ok
83
Issues with Intrinsic Motivation
Ecoffet, Adrien, Joost Huizinga, Joel Lehman, Kenneth O. Stanley, and Jeff Clune. "First return, then explore." Nature 590, no. 7847 (2021): 580-586.
84
Deakin University CRICOS Provider Code: 00113B
Reflection on Memory
 Surprise:
 Memory is hidden inside dynamics models,
memorizing the seen observations to make
the prediction available
 This memory is long-term, semantics and
slow to update
 Novelty:
 Memory is obvious as a slot-based matrix,
nearest neighbour estimator, counter ….
 This memory is often short-term, instance-
based and adaptive to changes in the
environments
Intrinsic
Exploration
Memory
Surprise
Novelty
Memory Exploration
?
85
Tutorial Outline
 Part A: Reinforcement Learning Fundamentals and
Exploration Inefficiency (30 minutes)
 Welcome and Introduction
 Reinforcement Learning Basics
 Exploring Challenges in Deep RL
 QA and Demo
 Part B: Surprise and Novelty (110 minutes, including a
20-minute break)
 Principles and Frameworks
 Deliberate Memory for Surprise-driven Exploration
 Forward dynamics prediction
 Advanced dynamics-based surprises
 Ensemble and disagreement
 Break
 RAM-like Memory for Novelty-based
Exploration
 Replay Memory
 Novelty-based Replay 👈 [We are here]
 Performance-based Replay
 QA and Demo
 Part C: Advanced Topics (60 minutes)
 Language-guided exploration
 Causal discovery for exploration
 Closing Remarks
 QA and Demo
86
Deakin University CRICOS Provider Code: 00113B
A Direct Exploration Mechanism
 Two major issues have hindered the ability of
previous algorithms to explore:
 Detachment: loses track of interesting areas
to explore from
 Derailment: the exploratory mechanisms of
the algorithm prevent it from utilize
previously visited states
 The role of memory is simplified:
 Store past states
 Retrieve states to explore
 Replay
Replay
Memory
Sampled
States
Exploration
87
Deakin University CRICOS Provider Code: 00113B
Go-Explore
 Detachment can be addressed by Memory: keep track areas
by grouping similar states into cells
 Similar to Hash count
 Map a state to a cell
 Each cell has a score indicating sampling probability
 Derailment can be addressed by Simulator (only suitable for
Atari)
 Sample a cell’s state from the memory
 Simulator resets the state of the agent to the cell’s state
 The memory is updated with new cells during exploration
Ecoffet, Adrien, Joost Huizinga, Joel Lehman, Kenneth O. Stanley, and Jeff Clune. "First return, then explore." Nature 590, no. 7847 (2021): 580-586.
88
Deakin University CRICOS Provider Code: 00113B
Go-Explore: Engineering Details
 State to Cell: downscaled cell with adaptive downscaling
parameters for robustness:
 Calculating how a sample of recent frames would be
grouped into cells
 Selecting the values that result in the best cell
distribution (manually designed)
 The selection probability of a cell at each step is proportional to its
selection weight  count-based
 Domain knowledge weight: (1) the number of horizontal neighbours
to the cell present in the archive (h); (2) a key bonus: for each location
 Train from demonstrations: backward algorithm places the agent close
to the end of the trajectory and runs PPO until the performance of the
agent matches that of the demonstration
Cseen is the number of exploration steps in which
that cell is visited
89
Deakin University CRICOS Provider Code: 00113B
Addressing Cell Limitations
❌ Cell design is not obvious requires detailed knowledge of the observation space, the dynamics of
the environment, and the subsequent task
 Latent Go-Explore: Go-Explore operates without cells:
 A latent representation is learned simultaneously with the exploration
 Sampling of the final goal is based on a non-parametric density model of the latent space
 Replace simulator with goal-based exploration
Quentin Gallou´edec and
Emmanuel Dellandr´ea. 2023.
Cell-free latent go-explore. In
International Conference on
Machine Learning. PMLR,
10571–10586.
90
Deakin University CRICOS Provider Code: 00113B
Latent Go-Explore: Details
 Representation Learning:
 ICM’s Inverse Dynamics
 Forward Dynamics
 Vector Quantized Variational Autoencoder
Reconstruction
 Density Estimation to sample goals:
 Goals must be at the edge of the yet unexplored
areas
 Goal is reachable (already visited)
 Use particle-based entropy estimator to estimate
density score and rank
geometric law on the rank. p
is hyperparameter. The
higher the rank (more
dense), the less novel
sample less
91
Deakin University CRICOS Provider Code: 00113B
Go-Explore Family is current SOTA on Atari
92
Tutorial Outline
 Part A: Reinforcement Learning Fundamentals and
Exploration Inefficiency (30 minutes)
 Welcome and Introduction
 Reinforcement Learning Basics
 Exploring Challenges in Deep RL
 QA and Demo
 Part B: Surprise and Novelty (110 minutes, including a
20-minute break)
 Principles and Frameworks
 Deliberate Memory for Surprise-driven Exploration
 Forward dynamics prediction
 Advanced dynamics-based surprises
 Ensemble and disagreement
 Break
 RAM-like Memory for Novelty-based
Exploration
 Replay Memory
 Novelty-based Replay
 Performance-based Replay👈 [We are here]
 QA and Demo
 Part C: Advanced Topics (60 minutes)
 Language-guided exploration
 Causal discovery for exploration
 Closing Remarks
 QA and Demo
93
Deakin University CRICOS Provider Code: 00113B
Imitation Learning: Exploration via Exploitation
 Exploiting past good experiences affects learning
 Self-imitation learning imitates the agent’s own past
good decisions
 Memory is a replay buffer stores experiences
 Learns a policy to imitate state-action pairs in the replay
buffer only when the return in the past episode is
greater than the agent’s value estimate (performance-
based)
 If the return in the past is greater than the agent’s value
estimate (R > Vθ), the agent learns to choose the action
chosen in the past in the given state.
Oh, Junhyuk, Yijie Guo, Satinder Singh, and Honglak Lee. "Self-imitation learning." In International
conference on machine learning, pp. 3878-3887. PMLR, 2018.
94
Deakin University CRICOS Provider Code: 00113B
Goal-based Exploration Using Memory
 Generate new trajectories visiting novel states
by editing or augmenting the trajectories stored
in the memory from past experiences
 A sequence-to-sequence model with an
attention mechanism learns to ‘translate’ the
demonstration trajectory to a sequence of
actions and generate a new trajectory
Sample using count-based novelty
Insert new trajectory if the ending is different significantly
Otherwise, replace with higher return trajectory
Yijie Guo, Jongwook Choi, Marcin Moczulski, Shengyu Feng, Samy Bengio, Mohammad Norouzi, and Honglak Lee.
2020. Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural
Information Processing Systems 33 (2020), 4333–4345.
Diverse Trajectory-
conditioned Self-
Imitation Learning
(DTSIL)
95
Deakin University CRICOS Provider Code: 00113B
DTSIL: Policy Learning
 Train a trajectory-conditioned policy πθ(at|e≤t, ot, g) that
should flexibly imitate any given trajectory g
 To imitate, assign rim as the imitation reward (0.1) if the
state is similar
 After visiting the last (non-terminal) state in the
demonstration, the agent performs random exploration
(r=0) to encourage exploration
 Policy Gradient training:
Further imitation encouragement
96
Deakin University CRICOS Provider Code: 00113B
DTSIL: Performance when Combined with Count
97
Deakin University CRICOS Provider Code: 00113B
Replay Memory: Pros and Cons
👍 Replay memory provides a direct exploration
mechanism without intrinsic reward
👍 Sampling strategies are built upon previous
works
❌ Make additional assumptions such as simulator
or the availability of demonstrations
❌ Often requires goal-based policy, multiple sub
trainings
Generated by DALL-E 3
98
Deakin University CRICOS Provider Code: 00113B
End of Part B
 QA
 Demo
Generated by DALL-E 3
Advanced Topics
PART C
Generated by DALL-E 3
Generated by DALL-E 3
Generated by DALL-E 3
100
Tutorial Outline
 Part A: Reinforcement Learning Fundamentals and
Exploration Inefficiency (30 minutes)
 Welcome and Introduction
 Reinforcement Learning Basics
 Exploring Challenges in Deep RL
 QA and Demo
 Part B: Surprise and Novelty (110 minutes, including a
20-minute break)
 Principles and Frameworks
 Deliberate Memory for Surprise-driven Exploration
 Forward dynamics prediction
 Advanced dynamics-based surprises
 Ensemble and disagreement
 Break
 RAM-like Memory for Novelty-based Exploration
 Replay Memory
 Novelty-based Replay
 Performance-based Replay QA and Demo
 Part C: Advanced Topics (60 minutes)
 Language-guided exploration👈 [We are here]
 Language-assisted RL
 LLM-based exploration
 Causal discovery for exploration
 Closing Remarks
 QA and Demo
Beyond the state space: Language-guided exploration
 Why language?
 Humans are able to learn quickly
in new environments due to a
rich set of commonsense priors
about the world
→ reflected in language
 Read the instruction to play
game to avoid trials and
errors.
Abstraction
Compositional
Generalization
101
102
Luketina, Jelena, Nantas Nardelli, Gregory Farquhar, Jakob Foerster, Jacob Andreas,
Edward Grefenstette, Shimon Whiteson, and Tim Rocktäschel. "A survey of reinforcement
learning informed by natural language." arXiv preprint arXiv:1906.03926 (2019).
How Language can be used in RL
 Make a cake from tools
and materials
 Pure RL agent needs to
try thousands of
settings until it find the
desired characteristics
 If RL agent read and
follow the recipe, it
may take 1 trial to
succeed
103
https://www.moresteam.com/toolbox/design-of-experiments.cfm
A more practical use case
104
Tutorial Outline
 Part A: Reinforcement Learning Fundamentals and
Exploration Inefficiency (30 minutes)
 Welcome and Introduction
 Reinforcement Learning Basics
 Exploring Challenges in Deep RL
 QA and Demo
 Part B: Surprise and Novelty (110 minutes, including a
20-minute break)
 Principles and Frameworks
 Deliberate Memory for Surprise-driven Exploration
 Forward dynamics prediction
 Advanced dynamics-based surprises
 Ensemble and disagreement
 Break
 RAM-like Memory for Novelty-based
Exploration
 Replay Memory
 Novelty-based Replay
 Performance-based Replay QA and Demo
 Part C: Advanced Topics (60 minutes)
 Language-guided exploration
 Language-assisted RL 👈 [We are here]
 LLM-based exploration
 Causal discovery for exploration
 Closing Remarks
 QA and Demo
105
Gated-attention for Task-oriented Language
Grounding
 Task-oriented Language Grounding: extract meaningful representations from natural language
instructions and maps to visual elements and actions
 Task: Given initial Image in pixels and instruction -> guide agent to move towards desired object
 2 main modules:
 State Processing: Process image and language jointly with State Processing module to obtain state
 Policy Learning: Use policy to map states to corresponding actions
Generated by DALL-E 3
106
State Processing Module
 Use Gated-Attention instead of concatenation to jointly
represent image and language information as one state
 Language instruction go through a fully-connected layer to
match image dimension, called Attention Vector
 Each element of Attention Vector is expanded to a (𝐻 × 𝑊)
matrix to match the feature map of each image element
 Final representation is obtained via element-wise product
between image and language representation
Generated by DALL-E 3
107
Policy Module
 Actions:
 Turn (left, right)
 Move forward
 Policy Architecture: two variants
 Behaviour Cloning: Uses target and object locations and
orientations in every state to select optimal action
 A3C: use a deep neural network to learn policy and value
function. Gives action based on policy 𝜋(𝑎|𝐼𝐿, 𝐿)
Generated by DALL-E 3
108
Results
Generated by DALL-E 3
- With Gated-Attention, agent learns faster and achieves better accuracy (success rate of reaching correct object before
episode terminates)
- As environment gets harder, more exploration is needed -> A3C with GA performs better than Imitation Learning, where
little exploration is done
109
Semantic Exploration from Language Abstractions and
Pretrained Representations
 Novelty-based exploration methods suffer in high-dimensional
visual state spaces
 e.g.: different viewpoints of one place in 3D can map to distinct
visual states/ features, despite being semantically similar
 Language can be a useful abstraction for exploration, as it
coarsens the state space in a way that reflects the semantics of
environment
 Solution: Use vision-language pretrained representations to
guide semantic exploration in 3D
Generated by DALL-E 3
Example of a state (picture) with language
description (caption). Note how the
caption focuses on important aspects of
the state
Example of how many states can be
conveyed with one text caption
110
Intrinsic Reward Design
 State 𝑠𝑡: embedding of Image, denoted as 𝑂𝑉
 Goal: Described by a text instruction, denoted as 𝑔
 Caption is encoded by a pretrained language
encoder, output embedding is denoted as 𝑂𝐿; only
used to calculate intrinsic reward. Note that agent
never observes 𝑂𝐿.
 Intrinsic reward is goal-agnostic; computed with
access to state representation (either 𝑂𝑉 or 𝑂𝐿)
 Add intrinsic reward for two exploration algorithms:
 Never Give Up (NGU; Badia et al.)
 Random Network Distillation (RND; Burda et al.)
Generated by DALL-E 3
Badia, Adrià Puigdomènech, et al. Never give up: Learning directed exploration strategies. arXiv preprint arXiv:2002.06038 (2020).
Y. Burda, H. Edwards, A. Storkey, and O. Klimov. Exploration by random network distillation. arXiv preprint
arXiv:1810.12894, 2018.
111
Never Give Up (NGU)
 State representations (which are used to compute intrinsic reward, can be either 𝑂𝐿 or 𝑂𝑉) along trajectory are written to memory buffer
 Novelty is a function of L2 distance between current and the k-nearest states in buffer. Intrinsic reward is higher for larger distances
 To influence exploration: modify the embedding function
 Originally, embedding function is learned
 Variants:
 Vis-NGU & LSE-NGU: Use Visual embeddings (𝑂𝑉).
 Lang-NGU: Use Language embeddings (𝑂𝐿).
Generated by DALL-E 3
112
Random Network Distillation (RND)
 Intrinsic reward is derived from the prediction error
between a trainable network on a target from a frozen
function
 Trainable network learns independently from the
policy network
 As training progresses, frequently-visited states yield
less intrinsic reward due to reduced prediction errors
Generated by DALL-E 3
113
Results
 Example of how language representations
(orange line) can help agent explore better
than visual representation in 3D
environment.
 Using visual representation, agent
struggles with different view of only one
scene.
Generated by DALL-E 3
114
Tutorial Outline
 Part A: Reinforcement Learning Fundamentals and
Exploration Inefficiency (30 minutes)
 Welcome and Introduction
 Reinforcement Learning Basics
 Exploring Challenges in Deep RL
 QA and Demo
 Part B: Surprise and Novelty (110 minutes, including a
20-minute break)
 Principles and Frameworks
 Deliberate Memory for Surprise-driven Exploration
 Forward dynamics prediction
 Advanced dynamics-based surprises
 Ensemble and disagreement
 Break
 RAM-like Memory for Novelty-based
Exploration
 Replay Memory
 Novelty-based Replay
 Performance-based Replay QA and Demo
 Part C: Advanced Topics (60 minutes)
 Language-guided exploration
 Language-assisted RL
 LLM-based exploration👈 [We are here]
 Causal discovery for exploration
 Closing Remarks
 QA and Demo
115
ELLM: Guiding Pretraining in RL with LLM
 Many distinct actions can lead to similar outcomes -> Intrinsically Motivated RL (IM-RL):
explore outcomes rather than actions
 Competence-based IM (CB-IM): maximize diversity of skills mastered by agent
 CB-IM aims to optimize 𝑅𝑖𝑛𝑡:
 Given those, CB-IM algorithms train a goal-conditioned policy 𝜋(𝑎|𝑜, 𝑔) that maximizes 𝑅𝑖𝑛𝑡
 Goal distribution G and 𝑅𝑖𝑛𝑡(𝑜, 𝑎, 𝑜′|𝑔) must be defined such that 3 properties are satisfied:
1. Diverse
2. Common-sense sensitive (Ex: chop a tree > drink a tree)
3. Context sensitive (Ex: only chop the tree when the tree is in view)
Generated by DALL-E 3
116
Why ELLM?
 Previous methods hand-define 𝑅𝑖𝑛𝑡 and G, and use various motivations to guide goal
sampling 𝑔~𝐺: novelty, learning progress, intermediate difficulty
 ELLM: alleviate the need for environment-specific hand-coded definitions of 𝑅𝑖𝑛𝑡 and
G with LM:
 Language-based goal representations
 Language-model-based goal generation
Generated by DALL-E 3
117
Architecture
 Goal representation: prompt LLM with available actions & description of current observation ->
construct prompt -> LLM generate goals
 Open-ended goal generation: Ask LLM to generate goals
 Closed-form: Ask LLM yes/no questions (e.g: should the agent do X? yes/no?)
Generated by DALL-E 3
118
Intrinsic Reward Design
 Reward LLM goals: with similarity between generated
goal with description of agent’s transition 𝐶𝑡𝑟𝑎𝑛𝑠𝑖𝑡𝑖𝑜𝑛
 If there are multiple goals -> reward agent with
most similar goal to transition description:
Generated by DALL-E 3
119
ELLM: Results
Generated by DALL-E 3
120
Intrinsically Guided Exploration from LLMs (IGE-LLMs)
 For long, sequential tasks with sparse rewards, intrinsic reward can help guide policy learning
towards exploration, along with the main policy driver extrinsic reward
 LLM can be used as an evaluator for potential future rewards of all actions a, that maps every (s,a)
pairs directly to an intrinsic reward 𝑟𝑖
 Total reward is given as 𝑟𝑐
= 𝑟𝑒
+ 𝜆𝑖
𝑟𝑖
𝑤𝑖
, where 𝑟𝑒
is the external reward, 𝜆𝑖
is a controlling
factor and 𝑤𝑖 is a linearly decaying weight
Generated by DALL-E 3
121
Prompting
Generated by DALL-E 3
 Example of an input prompt to evaluate possible
actions in DeepSea Environment
 LLM is given current position and possible
actions. It is asked to evaluate the ratings of
every possible next actions
122
Benefit of Intrinsic Reward
 LLM improves traditional exploration
methods
 Using LLM to generate actions directly
exhibits significant errors (grey lines on
the right graph), even with advanced
LLMs (GPT-4)
 However, when used as intrinsic reward
only, it helps with exploration, especially
in harder environments. It also results in
better performance
Generated by DALL-E 3
123
Tutorial Outline
 Part A: Reinforcement Learning Fundamentals and
Exploration Inefficiency (30 minutes)
 Welcome and Introduction
 Reinforcement Learning Basics
 Exploring Challenges in Deep RL
 QA and Demo
 Part B: Surprise and Novelty (110 minutes, including a
20-minute break)
 Principles and Frameworks
 Deliberate Memory for Surprise-driven Exploration
 Forward dynamics prediction
 Advanced dynamics-based surprises
 Ensemble and disagreement
 Break
 RAM-like Memory for Novelty-based Exploration
 Replay Memory
 Novelty-based Replay
 Performance-based Replay QA and Demo
 Part C: Advanced Topics (60 minutes)
 Language-guided exploration
 Causal discovery for exploration👈 [We are here]
 Statistical approaches
 Deep learning approaches
 Closing Remarks
 QA and Demo
124
Deakin University CRICOS Provider Code: 00113B
What is causality?
 The relationship between cause and effect.
 Uncovering two fundamental questions:
 Causal discovery: Evidence required to infer
cause-effect relationships?
 Causal inference: Given causal information,
what inference can be drawn?
 Structure Causal Models (SCMs) framework
(Pearl, 2009a): .
 Causal graph: . Figure 1: Example of SCM and causal graph for scurvy problem.
125
Deakin University CRICOS Provider Code: 00113B
Why causality and RL?
 Understand cause-effect reduce exploring unnecessary action, thus, sample
efficiency.
 Ex: Not move toward the door before obtain the key.
 Improve interpretability.
 Ex: Why policy prioritize action of obtaining the key?
 Generalizability.
126
Deakin University CRICOS Provider Code: 00113B
Interpreting Causality in RL Environment
 Taking action A can affect the reward R.
 State S is the context variables that affect both
action A and reward R.
 .
 .
 The most common is .
 And, U is the unknown confounder variable.
 Categorize based on techniques to improve
exploration or causality measurement techniques.
 Statistical vs deep learning methods.
Figure 2: Causality in RL Environment.
127
Tutorial Outline
 Part A: Reinforcement Learning Fundamentals and
Exploration Inefficiency (30 minutes)
 Welcome and Introduction
 Reinforcement Learning Basics
 Exploring Challenges in Deep RL
 QA and Demo
 Part B: Surprise and Novelty (110 minutes, including a
20-minute break)
 Principles and Frameworks
 Deliberate Memory for Surprise-driven Exploration
 Forward dynamics prediction
 Advanced dynamics-based surprises
 Ensemble and disagreement
 Break
 RAM-like Memory for Novelty-based
Exploration
 Replay Memory
 Novelty-based Replay
 Performance-based Replay QA and Demo
 Part C: Advanced Topics (60 minutes)
 Language-guided exploration
 Causal discovery for exploration
 Statistical approaches👈 [We are here]
 Deep learning approaches
 Closing Remarks
 QA and Demo
128
Deakin University CRICOS Provider Code: 00113B
Causal influence detection for improving
efficiency in reinforcement learning.
Seitzer, M., Schölkopf, B., & Martius, G. (2021). Causal influence detection for
improving efficiency in reinforcement learning. Advances in Neural
Information Processing Systems, 34, 22905-22918.
129
Deakin University CRICOS Provider Code: 00113B
Causal Action Influence Detection (CAI)
 Mentioned previously, .
 Decomposed state S into N components.
 One steps transitions graph:
 How to detect when action influence next
state S’?
Figure 3: Global Causal Graph Fully Connected.
Figure 4: Example of Situation Dependent Controlled..
130
Deakin University CRICOS Provider Code: 00113B
Causal Action Influence Detection (CAI) (cont.)
 Conditional Mutual Information (CMI):
 Estimation of CAI:
Estimate forward model
from data.
131
Deakin University CRICOS Provider Code: 00113B
Using CAI to Improve exploration in RL?
 CAI as Intrinsic Reward.
 Active Exploration with CAI.
 CAI Experience Replay
 Experiment on 3 environments: FetchPush,
FetchPickAndPlace, FetchRotTable. Goal is the
coordinate the object must be.
 Baseline RL Algorithm: DDPG + HER.
Figure 5: FetchPickAndPlace Environment.
Figure 6: FetchRotTable Environment.
132
Deakin University CRICOS Provider Code: 00113B
CAI as Intrinsic Reward
 Use CAI as reward signal.
 Use on its own or with task reward.
Figure 7: Bonus reward improves performance on
FetchPickAndPlace.
133
Deakin University CRICOS Provider Code: 00113B
Active Exploration with CAI
 Replace random exploration with causal exploration.
 Choose action with highest contribution to CAI
calculation.
Figure 8: Performance of active exploration in
FetchPickAndPlace depending on the fraction of
exploratory actions chosen actively from a total of 30%
(epsilon) exploratory actions.
Figure 9: Experiment comparing exploration
strategies on FetchPickAndPlace. The combination of
active exploration and reward bonus yields the largest
sample efficiency.
134
Deakin University CRICOS Provider Code: 00113B
CAI Experience Replay
 Choose episode for replay from replay buffer
with guide from causal (inverse) ranking .
 is the probability of sampling any state from
episode i (of M episodes) in the replay buffer
(with T is the episode length).
Figure 10: Comparison of CAI-P with baselines (energy-based method with privileged
information (EBP), prioritized experience replay (PER), and HER without prioritization)
135
Tutorial Outline
 Part A: Reinforcement Learning Fundamentals and
Exploration Inefficiency (30 minutes)
 Welcome and Introduction
 Reinforcement Learning Basics
 Exploring Challenges in Deep RL
 QA and Demo
 Part B: Surprise and Novelty (110 minutes, including a
20-minute break)
 Principles and Frameworks
 Deliberate Memory for Surprise-driven Exploration
 Forward dynamics prediction
 Advanced dynamics-based surprises
 Ensemble and disagreement
 Break
 RAM-like Memory for Novelty-based
Exploration
 Replay Memory
 Novelty-based Replay
 Performance-based Replay QA and Demo
 Part C: Advanced Topics (60 minutes)
 Language-guided exploration
 Causal discovery for exploration
 Statistical approaches
 Deep learning approaches👈 [We are
here]
 Closing Remarks
 QA and Demo
136
Deakin University CRICOS Provider Code: 00113B
Causality-driven hierarchical structure
discovery for reinforcement learning.
Hu, X., Zhang, R., Tang, K., Guo, J., Yi, Q., Chen, R., ... & Chen, Y. (2022).
Causality-driven hierarchical structure discovery for reinforcement
learning. Advances in Neural Information Processing Systems, 35, 20064-
20076.
137
Deakin University CRICOS Provider Code: 00113B
Structural Causal Representation Learning
 Environment with multiple objects.
 Ex: Have wood and stone can make axe.
 How to measure causality between these
object?
 Model the SCM of objects between adjacent
timesteps.
Figure 11: Example of environment with
multi-objects and causal graph
138
Deakin University CRICOS Provider Code: 00113B
Structural Causal Representation Learning (cont.)
 Simpler case with 4 objects A, B, C, D.
 A is object of interest.
 Need a forward/transition model .
 Parameterized by .
 Need a masking function (otherwise, don’t know
which objects affect A).
 Parameterized by where M is
the no. of objects.
Figure 12: Example of SCM representation
learning (with interested object A).
139
Deakin University CRICOS Provider Code: 00113B
Structural Causal Representation Learning (cont.)
 Iterative process.
 Fix one parameter, while optimizing the other one.
 After finish optimizing, extract edge:
140
Deakin University CRICOS Provider Code: 00113B
CDHRL Framework
Figure 14: CDHRL Framework.
141
Deakin University CRICOS Provider Code: 00113B
Hierarchical Causal Subgoal Training
 Whenever new subgoals added, train .
 Deciding whether subgoal is reachable, with
current state and policy .
 If the state is reachable, within certain timesteps,
add to subgoal set.
Figure 13: Example of subgoal hierarchy
given causal graph.
142
Deakin University CRICOS Provider Code: 00113B
Figure 16: Results on Minigrid-2d (left) and Eden (right).
Figure 15: Environment Minigrid-2d (left) and Eden (right).
 An upper controller
policy , is trained to
select subgoals from
current subgoal set and
maximize task reward.
 The upper controller is
multi-level DQN with
HER.
143
Deakin University CRICOS Provider Code: 00113B
Disentangling causal effects for
hierarchical reinforcement learning.
Corcoll, O., & Vicente, R. (2020). Disentangling causal effects for hierarchical
reinforcement learning. arXiv preprint arXiv:2010.01351.
144
Deakin University CRICOS Provider Code: 00113B
Controlled Effect Disentanglement
 Total effect, the change in environment states,
involves dynamic effects and controllable effects.
 Whether next state is an outcome of the action
or just by accident.
 We care about controllable effects.
 Based on Average Treatment Effect (ATE).
The normality
Figure 17: The relationship between total effects, dynamic
effects, and controllable effects.
145
Deakin University CRICOS Provider Code: 00113B
 We cannot calculate total effect for every actions .
 Use a neural network to estimate.
 Learn a vector representation of effect.
Figure 18: Total effects modelling architecture.
146
Deakin University CRICOS Provider Code: 00113B
Exploration with controllable effects as goal
Effect Sampling
Policy
Taking Action
Policy
Model
distribution of
effects
Figure 19: Components of causal effects for hierarchical
reinforcement learning.
147
Deakin University CRICOS Provider Code: 00113B
Controllable Effect Distribution Learning
 Train a Variational Autoencoder.
 Approximate controllable effect .
Figure 20: VAE architecture to learn effect distribution.
148
Deakin University CRICOS Provider Code: 00113B
Training to Select Goal and Reach Goal
 Train using DQN, on data .
 Use to select sub-effect.
 Train using DQN, on data .
Figure 21: Architecture learning to select effects as
subgoals.
Figure 22: Architecture learning to select actions to reach
subgoals.
149
Deakin University CRICOS Provider Code: 00113B
3 Levels:
 Task T: go to the target location.
 Task BT: go to the target location
while carrying a ball.
 Task CBT: pick ball, put it in the
chest, and go to the target.
Figure 23: Comparing with DQN baseline on 3 tasks.
CEHRL can learn complex task, while, DQN cannot learn
the complex task.
Figure 24: Random effect vs random
action exploration.
150
Deakin University CRICOS Provider Code: 00113B
End of Part C
 QA
 Demo
Generated by DALL-E 3
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity

More Related Content

Similar to Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity

An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learningJie-Han Chen
 
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...Universitat Politècnica de Catalunya
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learningAndres Mendez-Vazquez
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningDongHyun Kwak
 
What is Reinforcement Learning.pdf
What is Reinforcement Learning.pdfWhat is Reinforcement Learning.pdf
What is Reinforcement Learning.pdfAiblogtech
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningKhaled Saleh
 
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...Universitat Politècnica de Catalunya
 
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...Jisu Han
 
Reinforcement learning
Reinforcement  learningReinforcement  learning
Reinforcement learningSKS
 
REINFORCEMENT LEARNING (reinforced through trial and error).pptx
REINFORCEMENT LEARNING (reinforced through trial and error).pptxREINFORCEMENT LEARNING (reinforced through trial and error).pptx
REINFORCEMENT LEARNING (reinforced through trial and error).pptxarchayacb21
 
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018Universitat Politècnica de Catalunya
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDongHyun Kwak
 
Reinforcement Learning.pdf
Reinforcement Learning.pdfReinforcement Learning.pdf
Reinforcement Learning.pdfhemayadav41
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learningbutest
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysYasutoTamura1
 
Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2Flavian Vasile
 
Presentation on supervised learning
Presentation on supervised learningPresentation on supervised learning
Presentation on supervised learningTonmoy Bhagawati
 
Presentazione Tesi Laurea Triennale in Informatica
Presentazione Tesi Laurea Triennale in InformaticaPresentazione Tesi Laurea Triennale in Informatica
Presentazione Tesi Laurea Triennale in InformaticaLuca Marignati
 

Similar to Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity (20)

An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learning
 
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Hierarchical Object Detection with Deep Reinforcement Learning
Hierarchical Object Detection with Deep Reinforcement LearningHierarchical Object Detection with Deep Reinforcement Learning
Hierarchical Object Detection with Deep Reinforcement Learning
 
What is Reinforcement Learning.pdf
What is Reinforcement Learning.pdfWhat is Reinforcement Learning.pdf
What is Reinforcement Learning.pdf
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
 
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
 
Reinforcement learning
Reinforcement  learningReinforcement  learning
Reinforcement learning
 
REINFORCEMENT LEARNING (reinforced through trial and error).pptx
REINFORCEMENT LEARNING (reinforced through trial and error).pptxREINFORCEMENT LEARNING (reinforced through trial and error).pptx
REINFORCEMENT LEARNING (reinforced through trial and error).pptx
 
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Reinforcement Learning.pdf
Reinforcement Learning.pdfReinforcement Learning.pdf
Reinforcement Learning.pdf
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
 
Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
Presentation on supervised learning
Presentation on supervised learningPresentation on supervised learning
Presentation on supervised learning
 
Presentazione Tesi Laurea Triennale in Informatica
Presentazione Tesi Laurea Triennale in InformaticaPresentazione Tesi Laurea Triennale in Informatica
Presentazione Tesi Laurea Triennale in Informatica
 

More from Hung Le

Memory-based Reinforcement Learning
Memory-based Reinforcement LearningMemory-based Reinforcement Learning
Memory-based Reinforcement LearningHung Le
 
Memory for Lean Reinforcement Learning.pdf
Memory for Lean Reinforcement Learning.pdfMemory for Lean Reinforcement Learning.pdf
Memory for Lean Reinforcement Learning.pdfHung Le
 
Episodic Policy Gradient Training
Episodic Policy Gradient TrainingEpisodic Policy Gradient Training
Episodic Policy Gradient TrainingHung Le
 
Model Based Episodic Memory
Model Based Episodic MemoryModel Based Episodic Memory
Model Based Episodic MemoryHung Le
 
Self-Attentive Associative Memory
Self-Attentive Associative MemorySelf-Attentive Associative Memory
Self-Attentive Associative MemoryHung Le
 
Neural Stored-program Memory
 Neural Stored-program Memory Neural Stored-program Memory
Neural Stored-program MemoryHung Le
 

More from Hung Le (6)

Memory-based Reinforcement Learning
Memory-based Reinforcement LearningMemory-based Reinforcement Learning
Memory-based Reinforcement Learning
 
Memory for Lean Reinforcement Learning.pdf
Memory for Lean Reinforcement Learning.pdfMemory for Lean Reinforcement Learning.pdf
Memory for Lean Reinforcement Learning.pdf
 
Episodic Policy Gradient Training
Episodic Policy Gradient TrainingEpisodic Policy Gradient Training
Episodic Policy Gradient Training
 
Model Based Episodic Memory
Model Based Episodic MemoryModel Based Episodic Memory
Model Based Episodic Memory
 
Self-Attentive Associative Memory
Self-Attentive Associative MemorySelf-Attentive Associative Memory
Self-Attentive Associative Memory
 
Neural Stored-program Memory
 Neural Stored-program Memory Neural Stored-program Memory
Neural Stored-program Memory
 

Recently uploaded

Oracle Database Administration I (1Z0-082) Exam Dumps 2024.pdf
Oracle Database Administration I (1Z0-082) Exam Dumps 2024.pdfOracle Database Administration I (1Z0-082) Exam Dumps 2024.pdf
Oracle Database Administration I (1Z0-082) Exam Dumps 2024.pdfSkillCertProExams
 
The Influence and Evolution of Mogul Press in Contemporary Public Relations.docx
The Influence and Evolution of Mogul Press in Contemporary Public Relations.docxThe Influence and Evolution of Mogul Press in Contemporary Public Relations.docx
The Influence and Evolution of Mogul Press in Contemporary Public Relations.docxMogul Press
 
ServiceNow CIS-Discovery Exam Dumps 2024
ServiceNow CIS-Discovery Exam Dumps 2024ServiceNow CIS-Discovery Exam Dumps 2024
ServiceNow CIS-Discovery Exam Dumps 2024SkillCertProExams
 
Breathing in New Life_ Part 3 05 22 2024.pptx
Breathing in New Life_ Part 3 05 22 2024.pptxBreathing in New Life_ Part 3 05 22 2024.pptx
Breathing in New Life_ Part 3 05 22 2024.pptxFamilyWorshipCenterD
 
Microsoft Fabric Analytics Engineer (DP-600) Exam Dumps 2024.pdf
Microsoft Fabric Analytics Engineer (DP-600) Exam Dumps 2024.pdfMicrosoft Fabric Analytics Engineer (DP-600) Exam Dumps 2024.pdf
Microsoft Fabric Analytics Engineer (DP-600) Exam Dumps 2024.pdfSkillCertProExams
 
Understanding Poverty: A Community Questionnaire
Understanding Poverty: A Community QuestionnaireUnderstanding Poverty: A Community Questionnaire
Understanding Poverty: A Community Questionnairebazilnaeem7
 
OC Streetcar Final Presentation-Downtown Santa Ana
OC Streetcar Final Presentation-Downtown Santa AnaOC Streetcar Final Presentation-Downtown Santa Ana
OC Streetcar Final Presentation-Downtown Santa AnaRahsaan L. Browne
 
Deciding The Topic of our Magazine.pptx.
Deciding The Topic of our Magazine.pptx.Deciding The Topic of our Magazine.pptx.
Deciding The Topic of our Magazine.pptx.bazilnaeem7
 
ACM CHT Best Inspection Practices Kinben Innovation MIC Slideshare.pdf
ACM CHT Best Inspection Practices Kinben Innovation MIC Slideshare.pdfACM CHT Best Inspection Practices Kinben Innovation MIC Slideshare.pdf
ACM CHT Best Inspection Practices Kinben Innovation MIC Slideshare.pdfKinben Innovation Private Limited
 
DAY 0 8 A Revelation 05-19-2024 PPT.pptx
DAY 0 8 A Revelation 05-19-2024 PPT.pptxDAY 0 8 A Revelation 05-19-2024 PPT.pptx
DAY 0 8 A Revelation 05-19-2024 PPT.pptxFamilyWorshipCenterD
 

Recently uploaded (10)

Oracle Database Administration I (1Z0-082) Exam Dumps 2024.pdf
Oracle Database Administration I (1Z0-082) Exam Dumps 2024.pdfOracle Database Administration I (1Z0-082) Exam Dumps 2024.pdf
Oracle Database Administration I (1Z0-082) Exam Dumps 2024.pdf
 
The Influence and Evolution of Mogul Press in Contemporary Public Relations.docx
The Influence and Evolution of Mogul Press in Contemporary Public Relations.docxThe Influence and Evolution of Mogul Press in Contemporary Public Relations.docx
The Influence and Evolution of Mogul Press in Contemporary Public Relations.docx
 
ServiceNow CIS-Discovery Exam Dumps 2024
ServiceNow CIS-Discovery Exam Dumps 2024ServiceNow CIS-Discovery Exam Dumps 2024
ServiceNow CIS-Discovery Exam Dumps 2024
 
Breathing in New Life_ Part 3 05 22 2024.pptx
Breathing in New Life_ Part 3 05 22 2024.pptxBreathing in New Life_ Part 3 05 22 2024.pptx
Breathing in New Life_ Part 3 05 22 2024.pptx
 
Microsoft Fabric Analytics Engineer (DP-600) Exam Dumps 2024.pdf
Microsoft Fabric Analytics Engineer (DP-600) Exam Dumps 2024.pdfMicrosoft Fabric Analytics Engineer (DP-600) Exam Dumps 2024.pdf
Microsoft Fabric Analytics Engineer (DP-600) Exam Dumps 2024.pdf
 
Understanding Poverty: A Community Questionnaire
Understanding Poverty: A Community QuestionnaireUnderstanding Poverty: A Community Questionnaire
Understanding Poverty: A Community Questionnaire
 
OC Streetcar Final Presentation-Downtown Santa Ana
OC Streetcar Final Presentation-Downtown Santa AnaOC Streetcar Final Presentation-Downtown Santa Ana
OC Streetcar Final Presentation-Downtown Santa Ana
 
Deciding The Topic of our Magazine.pptx.
Deciding The Topic of our Magazine.pptx.Deciding The Topic of our Magazine.pptx.
Deciding The Topic of our Magazine.pptx.
 
ACM CHT Best Inspection Practices Kinben Innovation MIC Slideshare.pdf
ACM CHT Best Inspection Practices Kinben Innovation MIC Slideshare.pdfACM CHT Best Inspection Practices Kinben Innovation MIC Slideshare.pdf
ACM CHT Best Inspection Practices Kinben Innovation MIC Slideshare.pdf
 
DAY 0 8 A Revelation 05-19-2024 PPT.pptx
DAY 0 8 A Revelation 05-19-2024 PPT.pptxDAY 0 8 A Revelation 05-19-2024 PPT.pptx
DAY 0 8 A Revelation 05-19-2024 PPT.pptx
 

Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity

  • 1. AAMAS’24 Tutorial 9 Unlocking Exploration: Self-Motivated Agents Thrive on Memory- Driven Curiosity
  • 2. 2 Why This Topic?  Deep Learning in RL: Deep RL Agents excel in complex tasks but need many interactions to learn  Practicality Issues: Extensive learning steps hinder RL's use in real-world scenarios  Exploration Optimization: Improving exploration is key for RL's real-world application  Memory and Learning: Memory-based exploration can speed up learning and advance AI Generated by DALL-E 3
  • 3. 3 About Us  Authors: Hung Le, Hoang Nguyen and Dai Do  Our lab: A2I2, Deakin University  Hung Le is a research lecturer at Deakin University, leading research on deep sequential models and reinforcement learning  Hoang Nguyen is a second-year PhD student at A2I2, specializing in reinforcement learning and causality  Dai Do is a second-year PhD student at A2I2, specializing in reinforcement learning and large language models A2I2 Lab Foyer, Waurn Ponds
  • 4. 4 Deakin University CRICOS Provider Code: 00113B About the Tutorial This tutorial is based on my previous presentations, expanding on topics covered in the following talks:  Memory-Based Reinforcement Learning. AJCAI’22  Memory for Lean Reinforcement Learning. FPT Software AI Center. 2022  Neural machine reasoning. IJCAI’21  From deep learning to deep reasoning. KDD’21  My Blogs: https://hungleai.substack.com/ Generated by DALL-E 3
  • 5. 5 Tutorial Outline  Part A: Reinforcement Learning Fundamentals and Exploration Inefficiency (30 minutes)  Welcome and Introduction 👈 [We are here]  Reinforcement Learning Basics  Exploring Challenges in Deep RL  QA and Demo  Part B: Surprise and Novelty (110 minutes, including a 20-minute break)  Principles and Frameworks  Deliberate Memory for Surprise-driven Exploration  Break  RAM-like Memory for Novelty-based Exploration  Replay Memory  QA and Demo  Part C: Advanced Topics (60 minutes)  Language-guided exploration  Causal discovery for exploration  Closing Remarks  QA and Demo
  • 6. Reinforcement Learning Fundamentals and Exploration Inefficiency PART A Generated by DALL-E 3
  • 7. 7 Tutorial Outline  Part A: Reinforcement Learning Fundamentals and Exploration Inefficiency (30 minutes)  Welcome and Introduction  Reinforcement Learning Basics  Key components and frameworks 👈 [We are here]  Classic exploration  Exploring Challenges in Deep RL  QA and Demo  Part B: Surprise and Novelty (110 minutes, including a 20-minute break)  Principles and Frameworks  Deliberate Memory for Surprise-driven Exploration  Break  RAM-like Memory for Novelty-based Exploration  Replay Memory  QA and Demo  Part C: Advanced Topics (60 minutes)  Language-guided exploration  Causal discovery for exploration  Closing Remarks  QA and Demo
  • 8. 8 Reinforcement Learning Basics  In reinforcement learning (RL), an agent interacts with the environment, taking actions a, receiving a reward r, and moving to a new state s  The agent is tasked with maximizing the accumulated rewards or returns R over time by finding optimal actions (policy)
  • 9. 9 Reinforcement Learning Concepts  Policy π : maps state s to action a  Return (discounted) G or R: the cumulative (weighted) sum of rewards  State value function V: the expected discounted return starting with state s following policy π  State-action value function Q: the expected return starting from , taking the action , and thereafter following policy π
  • 10. Classic RL algorithms: Value learning 10 Q-learning (temporal difference-TD) Watkins, Christopher JCH, and Peter Dayan. "Q-learning." Machine learning 8, no. 3 (1992): 279-292. Williams, Ronald J. "Simple statistical gradient-following algorithms for connectionist reinforcement learning." Machine learning 8, no. 3 (1992): 229-256.  Basic idea: before finding optimal policy, we find the value function  Learn (action) value function:  V(s)  Q(s,a)  Estimate V(s)=E(∑R from s)  Estimate Q(s,a)=E(∑R from s,a)  Given Q(s,a) → choose action that maximizes the value (ε-greedy policy) RL algorithms: Q-Learning
  • 11. Classic RL algorithm: Policy gradient  Basic idea: directly optimise the policy as a function of states  Need to estimate the gradient of the objective function E(∑R) w.r.t the parameters of the policy  Focus on optimisation techniques 11 REINFORCE (policy gradient) RL algorithms: Policy Gradient
  • 12. General RL algorithms 12 Q-Learning vs Policy Gradient Both require exploration to collect data for training
  • 13. 13 Tutorial Outline  Part A: Reinforcement Learning Fundamentals and Exploration Inefficiency (30 minutes)  Welcome and Introduction  Reinforcement Learning Basics  Key components and frameworks  Classic exploration 👈 [We are here]  Exploring Challenges in Deep RL  QA and Demo  Part B: Surprise and Novelty (110 minutes, including a 20-minute break)  Principles and Frameworks  Deliberate Memory for Surprise-driven Exploration  Break  RAM-like Memory for Novelty-based Exploration  Replay Memory  QA and Demo  Part C: Advanced Topics (60 minutes)  Language-guided exploration  Causal discovery for exploration  Closing Remarks  QA and Demo
  • 14. 14 Deakin University CRICOS Provider Code: 00113B ε-greedy  ε-greedy is the simplest exploration strategy that works in theory  It heavily relies on pure randomness and biased estimates of action values Q, and thus is sample-inefficient in practice  We often go with what we assume is best, but sometimes, we take a random chance to explore other options. This is one example of an optimistic strategy It is used in Q-learning
  • 15. 15 Deakin University CRICOS Provider Code: 00113B ε-greedy: Problems  It is not surprising why this strategy might struggle in real-world scenarios: being overly optimistic when your estimation is imprecise can be risky.  It may lead to getting stuck in a local optimum and missing out on discovering the global one with the highest returns. Benchmarking ε-greedy (red line) and other exploration method on Montezuma’s Revenge. Taïga, Adrien Ali, William Fedus, Marlos C. Machado, Aaron Courville, and Marc G. Bellemare. "Benchmarking bonus-based exploration methods on the arcade learning environment." arXiv preprint arXiv:1908.02388 (2019). ε-greedy
  • 16. 16 Deakin University CRICOS Provider Code: 00113B Upper Confidence Bound (UCB)  One way to address the problem of over- optimism is to consider the uncertainty of the estimation  We do not want to miss an action with a currently low estimated value and high uncertainty, as it may possess a higher value:  What we need is to guarantee: Hoeffding’s Inequality How to estimate uncertainty? Implicitly, with a large value of the exploration- exploitation trade-off parameter c, the chosen action is more likely to deviate from the greedy action, leading to increased exploration. t -c
  • 17. 17 Deakin University CRICOS Provider Code: 00113B Thompson Sampling  When additional assumptions about the reward distribution are available, an action can be chosen based on the probability that it is optimal (probability matching strategy)  Thompson sampling is one way to implement the strategy: 1. Assume the reward follows a distribution p(r|a, θ) where θ is the parameter whose prior is p(θ) 2. Given the set of past observations Dt is made of triplets {(ai, ri)|i=1,2..,t}, we update the posterior using Bayes rule 3. Given the posterior, we can estimate the action value 4. We can compute the probability of choosing action a 2 3 4
  • 18. 18 Deakin University CRICOS Provider Code: 00113B Information Gain  Information Gain (IG) measures the change in the amount of information (measured in entropy H) of a latent variable  The latent variable often refers to the parameter of the model θ after seeing observation (e.g., reward r) caused by some action a  A big drop in the entropy means the observation makes the model more predictable and less uncertain  Our goal is to find a harmony between minimizing expected regret in the current period and acquiring new information about the observation model Russo, Daniel, and Benjamin Van Roy. "Learning to optimize via information- directed sampling." Advances in Neural Information Processing Systems 27 (2014).
  • 19. 19 Deakin University CRICOS Provider Code: 00113B Application: Multi-arm bandit  There are multiple actions to take (bandit’s arm)  After taking one action, agent observes a reward  Maximizing cumulated rewards or minimizing cumulated regrets. Generated by DALL-E 3
  • 20. 20 Deakin University CRICOS Provider Code: 00113B Limitations of Classical Exploration ❌ Scalability Issues: Most are specifically designed for bandit problems, and thus, they are hard to apply in large-scale or high-dimensional problems (e.g., Atari games), resulting in increased computational demands that can be impractical ❌ Assumption Sensitivity: These methods heavily rely on specific assumptions about reward distributions or system dynamics, limiting their adaptability when assumptions do not hold ❌ Vulnerability to Uncertainty: They may struggle in dynamic environments with complex reward structures or frequent changes, leading to suboptimal performance Generated by DALL-E 3
  • 21. 21 Tutorial Outline  Part A: Reinforcement Learning Fundamentals and Exploration Inefficiency (30 minutes)  Welcome and Introduction  Reinforcement Learning Basics  Exploring Challenges in Deep RL  Hard Exploration Problems 👈 [We are here]  Simple exploring solutions  QA and Demo  Part B: Surprise and Novelty (110 minutes, including a 20-minute break)  Principles and Frameworks  Deliberate Memory for Surprise-driven Exploration  Break  RAM-like Memory for Novelty-based Exploration  Replay Memory  QA and Demo  Part C: Advanced Topics (60 minutes)  Language-guided exploration  Causal discovery for exploration  Closing Remarks  QA and Demo
  • 22.  Task:  Agent searches for the key  Agent picks the key  Agent open the door to access the room  Agent finds the box in the room  Reward:  If the agent reaches the box, get +1 reward 22 https://github.com/maximecb/gym-minigrid → How to learn such complicated policies using the simple reward? Modern RL Environments are Complicated
  • 23. 23 Deakin University CRICOS Provider Code: 00113B Why is Scaling a Big Problem?  Practical environments often involve huge continuous state and action spaces  Classical approaches cannot be implemented or fail to hold their theoretical properties in these settings Doom environment: continuous high-dimensional state space (source) Mujoco environment: continuous action space (source).
  • 24. 24 Deakin University CRICOS Provider Code: 00113B Challenging Environments for Exploration  Environments require long-term memory of agents:  Maze navigation with conditions such as finding the objects that have the same color as the wall  Remember the shortest path to the objects experienced in the past  Noisy environments:  Noisy-TV: a random TV will distract the RL agent from its main task due to noisy screen https://github.com/jurgisp/memory-maze Noisy-TV (source)
  • 25. 25 Tutorial Outline  Part A: Reinforcement Learning Fundamentals and Exploration Inefficiency (30 minutes)  Welcome and Introduction  Reinforcement Learning Basics  Exploring Challenges in Deep RL  Hard exploration problems  Simple exploring solutions 👈 [We are here]  QA and Demo  Part B: Surprise and Novelty (110 minutes, including a 20-minute break)  Principles and Frameworks  Deliberate Memory for Surprise-driven Exploration  Break  RAM-like Memory for Novelty-based Exploration  Replay Memory  QA and Demo  Part C: Advanced Topics (60 minutes)  Language-guided exploration  Causal discovery for exploration  Closing Remarks  QA and Demo
  • 26. 26 Deakin University CRICOS Provider Code: 00113B Entropy Maximization  In the era of deep learning, neural networks are used for approximating functions, including parameterizing value and policy functions in RL  ε-greedy is less straightforward for policy gradient methods  An entropy loss term is introduced in the objective function to penalize overly deterministic policies. This encourages diverse exploration, avoiding suboptimal actions by maximizing the bonus entropy loss term ❌ It may also impede the optimization of other losses, especially the main objective ❌ The entropy loss does not enforce different level of exploration for different tasks It is used in PPO, A3C, …
  • 27. 27 Deakin University CRICOS Provider Code: 00113B Noisy Networks  Another method to add randomness to the policy is to add noise to the weights of the neural networks  Throughout training, noise samples are drawn and added to the weights for both forward and backward propagation ❌ Although Noisy Networks can vary exploration degree across tasks, adapting exploration at the state level is far from reachable. Certain states with higher uncertainty may require more exploration, while others may not An example of a noisy linear layer. Here w is the matrix weight and b is the bias vector. The parameters µw, µb, σw and σb are the learnables of the network whereas εw and εw are noise variables. Fortunato, Meire, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Ian Osband, Alex Graves Vlad Mnih, Remi Munos Demis Hassabis Olivier Pietquin, Charles Blundell, and Shane Legg. "Noisy Networks for Exploration." arXiv preprint arXiv:1706.10295 (2017)
  • 28. 28 Deakin University CRICOS Provider Code: 00113B End of Part A  QA  Demo Generated by DALL-E 3
  • 29. Intrinsic Motivation: Surprise and Novelty PART B Generated by DALL-E 3 Generated by DALL-E 3
  • 30. 30 Tutorial Outline  Part A: Reinforcement Learning Fundamentals and Exploration Inefficiency (30 minutes)  Welcome and Introduction  Reinforcement Learning Basics  Exploring Challenges in Deep RL  QA and Demo  Part B: Surprise and Novelty (110 minutes, including a 20-minute break)  Principles and Frameworks 👈 [We are here]  Reward shaping and the role of memory  A taxonomy of memory-driven intrinsic exploration  Deliberate Memory for Surprise-driven Exploration  Break  RAM-like Memory for Novelty-based Exploration  Replay Memory  QA and Demo  Part C: Advanced Topics (60 minutes)  Language-guided exploration  Causal discovery for exploration  Closing Remarks  QA and Demo
  • 31.  No curiosity, random exploration  epsilon-greedy  “Tractable” exploration  Somehow optimize exploration, e.g. UCB, Thomson sampling  Only doable for simple environments 31  Approximate “trackable” exploration (count-based)  Scalable to harder environments  Intrinsic motivation exploration (SOTA) Predictive Novel or supersite-based curiosity Causal … https://cmutschler.de/rl Frameworks
  • 32. 32 Deakin University CRICOS Provider Code: 00113B Reward Shaping  Entropy loss or inject noise into the policy/value parameters with the limitation that the level of exploration is not explicitly conditioned on fine- grant factors such as states or actions  Solution: intrinsic reward bonuses assign higher internal rewards to state-action pairs that require higher exploration and vice versa  The final reward for the agent will be the weighted sum of the intrinsic reward and the external (environment) reward
  • 33.  Animal can travel for long distance till they find food  Human can navigate to go to an address in a strange city  What motivates these agents to explore? intrinsic motivation curiosity, hunch intrinsic reward 33 https://www.beepods.com/5-fascinating-ways-bees-and-flowers-find-each-other/ Intrinsic Motivation in Biological World
  • 34. 34 Deakin University CRICOS Provider Code: 00113B What Does Intrinsic Reward Represent?  Novelty  It is inherent for biological agents to be motivated by new things.  Tracking the occurrences of a state provides a novelty indicator, with increased occurrences signaling less novelty  Surprise  Surprise emerges when there's a discrepancy between expectations and the observed or experienced reality  Build a model of the environment, predicting the next state given the current state and action  The intrinsic reward is the prediction error itself  This reward increases when the model encounters difficulty in predicting or expresses surprise at the current observation. Novelty Surprise
  • 35. 35 Deakin University CRICOS Provider Code: 00113B The Role of Memory  Biological agents inherently possess memory to monitor events:  Drawing from previous experiences, they discern novelty in observations  Utilizing their prior understanding of the world, they identify unexpected observations  RL agents can be quipped with memory:  Event-based Memory: Episodic Memory  Semantic Memory: World Model Novelty Surprise MEMORY
  • 36. 36 Deakin University CRICOS Provider Code: 00113B A Taxonomy of Memory for RL Exploration
  • 37. 37 Tutorial Outline  Part A: Reinforcement Learning Fundamentals and Exploration Inefficiency (30 minutes)  Welcome and Introduction  Reinforcement Learning Basics  Exploring Challenges in Deep RL  QA and Demo  Part B: Surprise and Novelty (110 minutes, including a 20-minute break)  Principles and Frameworks  Deliberate Memory for Surprise-driven Exploration  Forward dynamics prediction👈 [We are here]  Advanced dynamics-based surprises  Ensemble and disagreement  Break  RAM-like Memory for Novelty-based Exploration  Replay Memory  QA and Demo  Part C: Advanced Topics (60 minutes)  Language-guided exploration  Causal discovery for exploration  Closing Remarks  QA and Demo
  • 38. 38 Deakin University CRICOS Provider Code: 00113B Forward Dynamics Prediction  Build a model of the environment, predicting the next state given the current state and action  This kind of model, also known as forward dynamics or world model  C: actor vs M: world model. M predicts consequences of C. C acts to make M fail  As a result:  If C action results in repeated and boring consequences  M predict well  C must explore novel consequence https://people.idsia.ch/~juergen/artificial-curiosity-since-1990.html
  • 39. 39 Deakin University CRICOS Provider Code: 00113B Learning Progress as Intrinsic Reward  The learning progress is estimated by comparing the mean error rate of the prediction model during the current moving window to the mean error rate of the previous window  The two windows are different by 𝜏 steps  In can mitigate Noisy-TV problems Pierre-Yves Oudeyer & Frederic Kaplan. “How can we define intrinsic motivation?” Conf. on Epigenetic Robotics, 2008.
  • 40. 40 Deakin University CRICOS Provider Code: 00113B Deep Dynamic Models  Use a neural network f that takes a representation of the current state and action to predict the next state  The representation is shaped through unsupervised training, i.e., state reconstruction task, using an autoencoder’s hidden state  The network f, fed with the autoencoder’s hidden state, is trained to minimize the prediction error, which is the norm of the difference between the predicted state and the true state Intrinsic reward Stadie, Levine, Abbeel: Incentivizing Exploration in Reinforcement Learning with Deep Predictive Models. In NIPS 2015.
  • 41.  Forward model on feature space  Feature space ignore irrelevant, uncontrollable factors  The consequences depend on  Action (controllable)  Environment (uncontrollable)  We want the state embedding enables controllable space 41 https://blog.dataiku.com/curiosity-driven-learning-through-next-state-prediction Deepak Pathak et al.: Curiosity-driven Exploration by Self-Supervised Prediction. ICML 2017 ICM More Complicated Model Inverse dynamic representation learning
  • 42. 42
  • 43. 43 Tutorial Outline  Part A: Reinforcement Learning Fundamentals and Exploration Inefficiency (30 minutes)  Welcome and Introduction  Reinforcement Learning Basics  Exploring Challenges in Deep RL  QA and Demo  Part B: Surprise and Novelty (110 minutes, including a 20-minute break)  Principles and Frameworks  Deliberate Memory for Surprise-driven Exploration  Forward dynamics prediction  Advanced dynamics-based surprises👈 [We are here]  Ensemble and disagreement  Break  RAM-like Memory for Novelty-based Exploration  Replay Memory  QA and Demo  Part C: Advanced Topics (60 minutes)  Language-guided exploration  Causal discovery for exploration  Closing Remarks  QA and Demo
  • 44.  The prediction target is stochastic  Information necessary for the prediction is missing  Model class of predictors is too limited to fit the complexity of the target function  Both the totally predictable and the fundamentally unpredictable will get boring 44 https://openai.com/blog/reinforcement-learning-with-prediction-based-rewards/ When Predictive Surprise Fails
  • 45. 45 https://favpng.com/ Ideas for Improvements Reward M’s progress instead of error No M’s improvement if the consequence is too hard or too easy to predict  no reward Remember all experiences “Store” all experienced consequence, including stochastic ones Global or Local memory Like human Better Representations Representation Learning https://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction
  • 46. 46 Deakin University CRICOS Provider Code: 00113B Random Network Distillation (RND)  The intrinsic reward is defined through the task of predicting the output of a fixed (target) network  Target Network’s weights are random  By predicting the target output, the Predictor Network tries to “remember” the randomized state  If old state reappears, it can be predicted easily by the Predictor Network  RND obviates Noisy-TV since the target network can be chosen to be deterministic and inside the model-class of the predictor network. Burda, Yuri, Harrison Edwards, Amos Storkey, and Oleg Klimov. "Exploration by random network distillation." In International Conference on Learning Representations. 2018.
  • 47. 47 Deakin University CRICOS Provider Code: 00113B Noisy-TV in Atari games  Montezuma’s Revenge the agent oscillates between two rooms  This leads to an irreducibly high prediction error, as the non-determinism of sticky actions makes it impossible to know whether, once the agent is close to crossing a room boundary, making one extra step will result in it staying in the same room, or crossing to the next one  This is a manifestation of the ‘noisy TV’ problem
  • 48. 48 Deakin University CRICOS Provider Code: 00113B Latent World Model  World model for exploration should be robust against stochasticity and able to extrapolate the state dynamics  prediction error can be a measurement for novelty  Train WM on latent representation space. This space is shaped by unsupervised learning  a zero-centered distribution with a covariance matrix equal to the identity  robust to stochastic elements and is arranged respecting the temporal distance of observations  WM error is computed in latent space
  • 49. 49 Deakin University CRICOS Provider Code: 00113B Latent World Model: Sample-efficient Atari Benchmark
  • 50. 50 Deakin University CRICOS Provider Code: 00113B Bayesian Surprise  Surprise can be interpreted from a Bayesian statistics perspective  Similar to IG idea, the aim is to minimize uncertainty about the dynamics, formalized as maximizing the cumulative reduction in entropy  The reduction of entropy per time step, also known as mutual information, I(𝚯;St+1|ξt,at)  Here θ is the parameters of the dynamics model 𝚯. Because we are interested in finding intrinsic reward for a given timestep, we can define: Houthooft, Rein, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. "Vime: Variational information maximizing exploration." Advances in neural information processing systems 29 (2016).
  • 51. 51 Deakin University CRICOS Provider Code: 00113B Variational Bayesian Surprise  The KL involves computing the posterior p(θ|st+1), which is generally intractable  Use variational inference for approximating the posterior, alternative variational distribution q(θ; 𝜙)  𝜙 is the variational parameter  This is equivalent to parameterizing the dynamics model as a Bayesian neural network (BNN) with weight distributions maintained as a fully factorized Gaussian. Train 𝜙:
  • 52. 52 Deakin University CRICOS Provider Code: 00113B Bayesian Surprise Benchmarking
  • 53. 53 Deakin University CRICOS Provider Code: 00113B Bayesian Learning Progress  Training a BNN is complicated, there are different Bayesian views on surprise  Formulating the objective of the RL agent as jointly maximizing expected return and surprise  P is the true dynamics model and P𝜙 is the learned dynamics model  The objective can be translated to maximizing the bonus reward per step Joshua Achiam and Shankar Sastry. 2017. Surprise-based intrinsic motivation for deep reinforcement learning. arXiv preprint arXiv:1703.01732 (2017).
  • 54. 54 Deakin University CRICOS Provider Code: 00113B Bayesian Learning Progress: Approximation Solutions  In practice, we do not know P. Need approximation  Prediction error: measures the error in log probability instead of the norm of the difference between the predicted and the reality  Learning progress written in the form of log probability  To train the dynamics model P𝜙, solve the constrained optimization By introducing the KL constraint, the posterior model is prevented from diverging too far from the prior, thereby preventing the generation of unstable intrinsic rewards.
  • 55. 55 Tutorial Outline  Part A: Reinforcement Learning Fundamentals and Exploration Inefficiency (30 minutes)  Welcome and Introduction  Reinforcement Learning Basics  Exploring Challenges in Deep RL  QA and Demo  Part B: Surprise and Novelty (110 minutes, including a 20-minute break)  Principles and Frameworks  Deliberate Memory for Surprise-driven Exploration  Forward dynamics prediction  Advanced dynamics-based surprises  Ensemble and disagreement 👈 [We are here]  Break  RAM-like Memory for Novelty-based Exploration  Replay Memory  QA and Demo  Part C: Advanced Topics (60 minutes)  Language-guided exploration  Causal discovery for exploration  Closing Remarks  QA and Demo
  • 56. 56 Deakin University CRICOS Provider Code: 00113B Intrinsic Motivation via Disagreement  An alternative method with forward dynamics involves using the variance of the prediction rather than the error  This requires multiple prediction models trained to minimize the forward dynamics prediction errors,  Use the empirical variance (disagreement) of their predictions as the intrinsic reward  The higher the variance, the more uncertain about the observation  need to explore more Deepak Pathak, et al. “Self-Supervised Exploration via Disagreement.” In ICML 2019.
  • 57. 57 Deakin University CRICOS Provider Code: 00113B Bayesian Disagreement  Bayesian surprises are defined by a specific dynamics model  What happens if we consider a distribution of models? Bayesian surprise of a policy becomes: P(T) is the transition distribution of the environment and P(T|𝜙) is the transition distribution according to the dynamics model. Prediction error averaging considers all transition models and possible predictions Pranav Shyam, Wojciech Jaskowski, and Faustino Gomez. 2019. Model-Based Active Exploration. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA. 5779–5788. P(S|s,a,t) is the dynamics model learned from a transition dynamics t
  • 58. 58 Deakin University CRICOS Provider Code: 00113B How to Compute u?  The term u(s,a) turns out to be the Jensen- Shannon Divergence of a set of learned dynamics from a transition dynamics t  JSD can be approximated by employing N dynamics models:  For each P parameterized by Gaussian distribution 𝓝i(µi,Σi), we need another layer of approximation to compute u(s,a) by replacing the Shannon entropy with Rényi entropy and use the corresponding Jensen-Rényi Divergence (JRD)
  • 59. 59 Deakin University CRICOS Provider Code: 00113B Surprise-based Exploration: Pros and Cons 👍 Dynamics models can be trained easily these days. There are many works on that topic 👍 Advanced methods can somehow handle Noisy- TV ❌ Focusing on the forward dynamics error is not effective in driving the exploration, especially when the world model is not good and always predicts wrongly ❌ Advanced methods such as ensembles or learning progress are compute-expensive to cope with Noisy-TV Generated by DALL-E 3
  • 60. 60 Tutorial Outline  Part A: Reinforcement Learning Fundamentals and Exploration Inefficiency (30 minutes)  Welcome and Introduction  Reinforcement Learning Basics  Exploring Challenges in Deep RL  QA and Demo  Part B: Surprise and Novelty (110 minutes, including a 20-minute break)  Principles and Frameworks  Deliberate Memory for Surprise-driven Exploration  Forward dynamics prediction  Advanced dynamics-based surprises  Ensemble and disagreement  Break👈 [We are here]  RAM-like Memory for Novelty-based Exploration  Replay Memory  QA and Demo  Part C: Advanced Topics (60 minutes)  Language-guided exploration  Causal discovery for exploration  Closing Remarks  QA and Demo
  • 61. 61 Tutorial Outline  Part A: Reinforcement Learning Fundamentals and Exploration Inefficiency (30 minutes)  Welcome and Introduction  Reinforcement Learning Basics  Exploring Challenges in Deep RL  QA and Demo  Part B: Surprise and Novelty (110 minutes, including a 20-minute break)  Principles and Frameworks  Deliberate Memory for Surprise-driven Exploration  Forward dynamics prediction  Advanced dynamics-based surprises  Ensemble and disagreement  Break  RAM-like Memory for Novelty-based Exploration  Count-based memory 👈 [We are here]  Episodic memory  Hybrid memory  Replay Memory  QA and Demo  Part C: Advanced Topics (60 minutes)  Language-guided exploration  Causal discovery for exploration  Closing Remarks  QA and Demo
  • 62. 62 Deakin University CRICOS Provider Code: 00113B Novelty via Counting  Humans want to explore novel places, make new friends, and buy new stuff. It is inherent for humans to be motivated by new things  How to translate this intrinsic motivation to RL agents?  Tracking the occurrences of a state (N(s)) provides a novelty indicator, with increased occurrences means less novelty  ri(s,a)=N(s)-0.5 where N counts the number of times s appears ❌ Empirical counts in continuous state spaces is impractical due to the rarity of exact state visits, resulting in N(s)=0 most of the time
  • 63. 63 Deakin University CRICOS Provider Code: 00113B Density-based State Counting  Use a density function of the state to estimate its occurrences  ρ(x)=ρ(s=x|s1:n) be a density function of the state x given s1:n and ρ’(x)=ρ(s=x|s1:nx) the density function of the state x after observing its first occurrence after s1:n  N̂ (x) and n̂ as a “pseudo-count” of x and the pseudo-total count before and after an occurrence of s the true density of x stays the same before and after an occurrence of x Bellemare, Marc, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. "Unifying count-based exploration and intrinsic motivation." Advances in neural information processing systems 29 (2016).
  • 64. 64 Deakin University CRICOS Provider Code: 00113B Pseudo State Count  In practice, in a huge state space, ρ’n(x)≈0, we can rewrite the pseudo-count:  PG means predictive gain, which is computed as:  Resembles the information gain: the difference between the expectation of a posterior and prior distribution  Extend to count state-action pairs, one can concatenate the action representation with the state representation.
  • 65. 65 Deakin University CRICOS Provider Code: 00113B Hash Count  If counting each exact state is challenging, why not partition the continuous state space into manageable blocks?  By using a function 𝜙 mapping a state to a code, we can count the occurrence of the code instead of the state  “Distant” states to be counted separately while “similar” states are merged  SimHash for the mapping function  Higher k, less occurrence collisions, and thus states are more distinguished Tang, Haoran, Rein Houthooft, Davis Foote, Adam Stooke, OpenAI Xi Chen, Yan Duan, John Schulman, Filip DeTurck, and Pieter Abbeel. "# exploration: A study of count-based exploration for deep reinforcement learning." Advances in neural information processing systems 30 (2017). sgn is the sign function, A is a k × d matrix with i.i.d. entries drawn from a standard Gaussian distribution. and g is some transformation function
  • 66. 66 Deakin University CRICOS Provider Code: 00113B Hash Count: Cope with High-dimensional States  Representation learning is employed to capture good g through autoencoder network and reconstruction learning  The network aims to reconstruct the original state input s, and the hidden representation b(s) will be used to compute g(s)=round(b(s))  Another regularization term prevents the corresponding bit in the binary code from flipping throughout the agent's lifetime
  • 67. 67 Deakin University CRICOS Provider Code: 00113B Change Counting  To encourage the agent to explore novel state-action pairs meaningfully, we can assess changes caused by activities and prioritize those that signify novelty  c(s,s’) as the environment change caused by a transition (s, a, s’)  Combines state count and change count, resulting in the intrinsic reward: Change count (last row) vs norm of change (middle row) vs state count (top row). Change count suffers less from attracting to meaningless activities. Parisi, Simone, Victoria Dean, Deepak Pathak, and Abhinav Gupta. "Interesting object, curious agent: Learning task-agnostic exploration. " Advances in Neural Information Processing Systems 34 (2021): 20516-20530.
  • 68. 68 Tutorial Outline  Part A: Reinforcement Learning Fundamentals and Exploration Inefficiency (30 minutes)  Welcome and Introduction  Reinforcement Learning Basics  Exploring Challenges in Deep RL  QA and Demo  Part B: Surprise and Novelty (110 minutes, including a 20-minute break)  Principles and Frameworks  Deliberate Memory for Surprise-driven Exploration  Forward dynamics prediction  Advanced dynamics-based surprises  Ensemble and disagreement  Break  RAM-like Memory for Novelty-based Exploration  Count-based memory  Episodic memory👈 [We are here]  Hybrid memory  Replay Memory  QA and Demo  Part C: Advanced Topics (60 minutes)  Language-guided exploration  Causal discovery for exploration  Closing Remarks  QA and Demo
  • 69. 69 Deakin University CRICOS Provider Code: 00113B Alternative Novelty Measurement ❌ An inherent constraint of count-based methods lies in the approximation error between the pseudo-count and the true count  Different novelty criteria: novel observations are those that demand effort to reach, typically beyond the already explored areas of the environment  Measure the effort in environmental steps, estimating it with a neural network that predicts the steps between two observations  To capture the explored areas of the environment, Use an episodic memory initialized empty at the start of each episode Novelty through reachability concept. An observation is novel if it can only reach those in the memory in more than k steps. Savinov, Nikolay, Anton Raichuk, Raphaël Marinier, Damien Vincent, Marc Pollefeys, Timothy Lillicrap, and Sylvain Gelly. "Episodic curiosity through reachability." arXiv preprint arXiv:1810.02274 (2018). if the maximum reachability score of the given observation is greater than a threshold k, we can regard it as novel
  • 70. 70 Deakin University CRICOS Provider Code: 00113B Episodic Curiosity: Memory Workflow  Define the threshold k, use reachability network to classify whether two observations are separated by more or less than k steps  After training, the reachability network is used to estimate the novelty of the current observation in the episode given the episodic memory M, which finally is used to compute the intrinsic reward  Using a function F to aggregate the reachability scores between the current observation and those in the memory leads to the intrinsic reward. F can be max or 90-th percentile
  • 71. 71 Deakin University CRICOS Provider Code: 00113B Explicit Memory of Positions (only for Atari games)  Collect the agent’s position from game RAM to indicate where on the grid an agent has visited  White sections in the curiosity grid (middle) show which locations have been visited; the unvisited black sections yield an exploration bonus when touched.  The network receives both game input (left) and curiosity grid (middle) and must learn how to form a map of where the agent has been (hypothetical illustration, right) Stanton, Christopher, and Jeff Clune. "Deep curiosity search: Intra-life exploration can improve performance on challenging deep reinforcement learning problems." arXiv preprint arXiv:1806.00553 (2018).
  • 72. 72 Deakin University CRICOS Provider Code: 00113B Novelty Connection to Surprise  Theoretically, an overparameterized autoencoder whose task is to reconstruct its input is equivalent to an associative memory (Adityanarayanan et al., PNAS 2020)  We can train an autoencoder that takes the state as input to reconstruct and use its reconstruction error as an indicator of life-long novelty  Greater errors signify a higher level of novelty in states, indicating that the autoencoder has not encountered these states frequently enough to learn their successful reconstruction effectively  Related to RND: the intrinsic reward is still the reconstruction error. But this time, the reconstructed target is no longer the original input. Instead, it is a transformed version of the input
  • 73. 73 Tutorial Outline  Part A: Reinforcement Learning Fundamentals and Exploration Inefficiency (30 minutes)  Welcome and Introduction  Reinforcement Learning Basics  Exploring Challenges in Deep RL  QA and Demo  Part B: Surprise and Novelty (110 minutes, including a 20-minute break)  Principles and Frameworks  Deliberate Memory for Surprise-driven Exploration  Forward dynamics prediction  Advanced dynamics-based surprises  Ensemble and disagreement  Break  RAM-like Memory for Novelty-based Exploration  Count-based memory  Episodic memory  Hybrid memory 👈 [We are here]  Replay Memory  QA and Demo  Part C: Advanced Topics (60 minutes)  Language-guided exploration  Causal discovery for exploration  Closing Remarks  QA and Demo
  • 74. 74 Deakin University CRICOS Provider Code: 00113B Surprise + Novelty  Never Give Up agent (NGU) combines existing surprise and novelty components from the literature cleverly: 1. State representation learning via inverse dynamics (ICM) 2. Life-long novelty module using RND 3. Episodic novelty using episodic memory inspired by EC  The implementation of the episodic memory in NGU is new. The dynamics model f is employed to produce the representations for the novelty modules. Two types of novelty are combined to produce the final intrinsic reward. Badia, Adrià Puigdomènech, Pablo Sprechmann, Alex Vitvitskyi, Daniel Guo, Bilal Piot, Steven Kapturowski, Olivier Tieleman et al. "Never give up: Learning directed exploration strategies." arXiv preprint arXiv:2002.06038 (2020).
  • 75. 75 Deakin University CRICOS Provider Code: 00113B NGU: Episodic Novelty  Encourages the exploration of novel states within an episode simply via nearest-neighbor matching.  As a result, the agent will not revisit the same state in an episode twice. This concept is different from lifelong novelty  The closer the current state is to its neighbors, the higher the similarity and thus, the smaller the reward  Hybrid Intrinsic Reward:
  • 76. 76 Deakin University CRICOS Provider Code: 00113B Agent57: Exploration at Scale  An upgraded version of the NGU:  Splitting the value function into 2 separate function values for external and internal rewards  A population of policies (and value functions) is trained, each characterized by a distinct pair of exploration parameters:  N is the size of the population. 𝛾j is the discount factor hyperparameters and βj is the intrinsic reward coefficient hyperparameters. Adapted by a meta controller (bandit algo.) Badia, Adrià Puigdomènech, Bilal Piot, Steven Kapturowski, Pablo Sprechmann, Alex Vitvitskyi, Zhaohan Daniel Guo, and Charles Blundell. "Agent57: Outperforming the atari human benchmark." In International conference on machine learning, pp. 507-517. PMLR, 2020.
  • 77. 77 Deakin University CRICOS Provider Code: 00113B Agent 57: Atari Benchmark
  • 78. 78 Deakin University CRICOS Provider Code: 00113B Cluster Memory for Counting  Parametric methods for counting have problems: s slow adaptation and catastrophic forgetting  Count Estimation provides a long term visitation-based exploration bonus while retaining responsiveness to the most recent experience  a finite slot-based container M stores representations and corresponding counter C  Memory RECODE is update by considering:  Adding new embedding (atom) with count 1  Update nearest atom and increase count Kernel is non-zero for all neighbours within a radius
  • 79. 79 Deakin University CRICOS Provider Code: 00113B RECODE: Representation Learning  The transformer takes masked sequences of length k consisting of actions and embedded observations as inputs and tries to reconstruct the missing embeddings in the output  The reconstructed embeddings at time t − 1 and t are then used to build a 1-step action- prediction classifier  Similar to ICM’s inverse dynamics Saade, Alaa, Steven Kapturowski, Daniele Calandriello, Charles Blundell, Pablo Sprechmann, Leopoldo Sarra, Oliver Groth, Michal Valko, and Bilal Piot. "Unlocking the Power of Representations in Long-term Novelty-based Exploration." In The Twelfth International Conference on Learning Representations. 2024.
  • 80. 80 Deakin University CRICOS Provider Code: 00113B Novelty of Surprise  The norm of the prediction error (surprise norm) is not good (e.g., as in Noisy-TV)  A new metric: surprise novelty, the error of reconstructing surprise (the error of state prediction)  This requires a surprise generator such as a dynamics model to produce the surprise vector u, i.e., the difference vector between the predicted and reality  Then, inter and intra-episode novelty scores are estimated by a system of memory, called Surprise Memory (SM), consisting of an autoencoder network W and episodic memory M, respectively Le, Hung, Kien Do, Dung Nguyen, and Svetha Venkatesh. "Beyond Surprise: Improving Exploration Through Surprise Novelty. In AAMAS, 2024.
  • 81. 81 Deakin University CRICOS Provider Code: 00113B Benefit of Hybrid Systems  Marry the goodness of both worlds:  Dynamic Prediction  Novelty Estimation  Combine intrinsic rewards:  Long-term, global, inter-episodes  Short-term, local, intra-episodes  Limitation of dynamic prediction using deep models can be compensated with non-parametric memory approaches  Noisy-TV problems are mitigated but not completely solved Noisy-TV: a random TV will distract the RL agent from its main task due to high surprise (source).
  • 82. 82 Deakin University CRICOS Provider Code: 00113B However, Is Intrinsic Reward Really Good? Taiga et al. On bonus based exploration methods in the arcade learning environment. In ICLR 2019  Intrinsic motivation rewards heavily on memory concept (global, local…)  The performance IR agent is very good, but …  It requires more samples to train (10^9 or 10^10 steps is the norm) Is The goal of IR to enable sample efficiency? Overfitting to Montezuma Revenge? Depending on architecture and tuning, normal exploration in general is ok
  • 83. 83 Issues with Intrinsic Motivation Ecoffet, Adrien, Joost Huizinga, Joel Lehman, Kenneth O. Stanley, and Jeff Clune. "First return, then explore." Nature 590, no. 7847 (2021): 580-586.
  • 84. 84 Deakin University CRICOS Provider Code: 00113B Reflection on Memory  Surprise:  Memory is hidden inside dynamics models, memorizing the seen observations to make the prediction available  This memory is long-term, semantics and slow to update  Novelty:  Memory is obvious as a slot-based matrix, nearest neighbour estimator, counter ….  This memory is often short-term, instance- based and adaptive to changes in the environments Intrinsic Exploration Memory Surprise Novelty Memory Exploration ?
  • 85. 85 Tutorial Outline  Part A: Reinforcement Learning Fundamentals and Exploration Inefficiency (30 minutes)  Welcome and Introduction  Reinforcement Learning Basics  Exploring Challenges in Deep RL  QA and Demo  Part B: Surprise and Novelty (110 minutes, including a 20-minute break)  Principles and Frameworks  Deliberate Memory for Surprise-driven Exploration  Forward dynamics prediction  Advanced dynamics-based surprises  Ensemble and disagreement  Break  RAM-like Memory for Novelty-based Exploration  Replay Memory  Novelty-based Replay 👈 [We are here]  Performance-based Replay  QA and Demo  Part C: Advanced Topics (60 minutes)  Language-guided exploration  Causal discovery for exploration  Closing Remarks  QA and Demo
  • 86. 86 Deakin University CRICOS Provider Code: 00113B A Direct Exploration Mechanism  Two major issues have hindered the ability of previous algorithms to explore:  Detachment: loses track of interesting areas to explore from  Derailment: the exploratory mechanisms of the algorithm prevent it from utilize previously visited states  The role of memory is simplified:  Store past states  Retrieve states to explore  Replay Replay Memory Sampled States Exploration
  • 87. 87 Deakin University CRICOS Provider Code: 00113B Go-Explore  Detachment can be addressed by Memory: keep track areas by grouping similar states into cells  Similar to Hash count  Map a state to a cell  Each cell has a score indicating sampling probability  Derailment can be addressed by Simulator (only suitable for Atari)  Sample a cell’s state from the memory  Simulator resets the state of the agent to the cell’s state  The memory is updated with new cells during exploration Ecoffet, Adrien, Joost Huizinga, Joel Lehman, Kenneth O. Stanley, and Jeff Clune. "First return, then explore." Nature 590, no. 7847 (2021): 580-586.
  • 88. 88 Deakin University CRICOS Provider Code: 00113B Go-Explore: Engineering Details  State to Cell: downscaled cell with adaptive downscaling parameters for robustness:  Calculating how a sample of recent frames would be grouped into cells  Selecting the values that result in the best cell distribution (manually designed)  The selection probability of a cell at each step is proportional to its selection weight  count-based  Domain knowledge weight: (1) the number of horizontal neighbours to the cell present in the archive (h); (2) a key bonus: for each location  Train from demonstrations: backward algorithm places the agent close to the end of the trajectory and runs PPO until the performance of the agent matches that of the demonstration Cseen is the number of exploration steps in which that cell is visited
  • 89. 89 Deakin University CRICOS Provider Code: 00113B Addressing Cell Limitations ❌ Cell design is not obvious requires detailed knowledge of the observation space, the dynamics of the environment, and the subsequent task  Latent Go-Explore: Go-Explore operates without cells:  A latent representation is learned simultaneously with the exploration  Sampling of the final goal is based on a non-parametric density model of the latent space  Replace simulator with goal-based exploration Quentin Gallou´edec and Emmanuel Dellandr´ea. 2023. Cell-free latent go-explore. In International Conference on Machine Learning. PMLR, 10571–10586.
  • 90. 90 Deakin University CRICOS Provider Code: 00113B Latent Go-Explore: Details  Representation Learning:  ICM’s Inverse Dynamics  Forward Dynamics  Vector Quantized Variational Autoencoder Reconstruction  Density Estimation to sample goals:  Goals must be at the edge of the yet unexplored areas  Goal is reachable (already visited)  Use particle-based entropy estimator to estimate density score and rank geometric law on the rank. p is hyperparameter. The higher the rank (more dense), the less novel sample less
  • 91. 91 Deakin University CRICOS Provider Code: 00113B Go-Explore Family is current SOTA on Atari
  • 92. 92 Tutorial Outline  Part A: Reinforcement Learning Fundamentals and Exploration Inefficiency (30 minutes)  Welcome and Introduction  Reinforcement Learning Basics  Exploring Challenges in Deep RL  QA and Demo  Part B: Surprise and Novelty (110 minutes, including a 20-minute break)  Principles and Frameworks  Deliberate Memory for Surprise-driven Exploration  Forward dynamics prediction  Advanced dynamics-based surprises  Ensemble and disagreement  Break  RAM-like Memory for Novelty-based Exploration  Replay Memory  Novelty-based Replay  Performance-based Replay👈 [We are here]  QA and Demo  Part C: Advanced Topics (60 minutes)  Language-guided exploration  Causal discovery for exploration  Closing Remarks  QA and Demo
  • 93. 93 Deakin University CRICOS Provider Code: 00113B Imitation Learning: Exploration via Exploitation  Exploiting past good experiences affects learning  Self-imitation learning imitates the agent’s own past good decisions  Memory is a replay buffer stores experiences  Learns a policy to imitate state-action pairs in the replay buffer only when the return in the past episode is greater than the agent’s value estimate (performance- based)  If the return in the past is greater than the agent’s value estimate (R > Vθ), the agent learns to choose the action chosen in the past in the given state. Oh, Junhyuk, Yijie Guo, Satinder Singh, and Honglak Lee. "Self-imitation learning." In International conference on machine learning, pp. 3878-3887. PMLR, 2018.
  • 94. 94 Deakin University CRICOS Provider Code: 00113B Goal-based Exploration Using Memory  Generate new trajectories visiting novel states by editing or augmenting the trajectories stored in the memory from past experiences  A sequence-to-sequence model with an attention mechanism learns to ‘translate’ the demonstration trajectory to a sequence of actions and generate a new trajectory Sample using count-based novelty Insert new trajectory if the ending is different significantly Otherwise, replace with higher return trajectory Yijie Guo, Jongwook Choi, Marcin Moczulski, Shengyu Feng, Samy Bengio, Mohammad Norouzi, and Honglak Lee. 2020. Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020), 4333–4345. Diverse Trajectory- conditioned Self- Imitation Learning (DTSIL)
  • 95. 95 Deakin University CRICOS Provider Code: 00113B DTSIL: Policy Learning  Train a trajectory-conditioned policy πθ(at|e≤t, ot, g) that should flexibly imitate any given trajectory g  To imitate, assign rim as the imitation reward (0.1) if the state is similar  After visiting the last (non-terminal) state in the demonstration, the agent performs random exploration (r=0) to encourage exploration  Policy Gradient training: Further imitation encouragement
  • 96. 96 Deakin University CRICOS Provider Code: 00113B DTSIL: Performance when Combined with Count
  • 97. 97 Deakin University CRICOS Provider Code: 00113B Replay Memory: Pros and Cons 👍 Replay memory provides a direct exploration mechanism without intrinsic reward 👍 Sampling strategies are built upon previous works ❌ Make additional assumptions such as simulator or the availability of demonstrations ❌ Often requires goal-based policy, multiple sub trainings Generated by DALL-E 3
  • 98. 98 Deakin University CRICOS Provider Code: 00113B End of Part B  QA  Demo Generated by DALL-E 3
  • 99. Advanced Topics PART C Generated by DALL-E 3 Generated by DALL-E 3 Generated by DALL-E 3
  • 100. 100 Tutorial Outline  Part A: Reinforcement Learning Fundamentals and Exploration Inefficiency (30 minutes)  Welcome and Introduction  Reinforcement Learning Basics  Exploring Challenges in Deep RL  QA and Demo  Part B: Surprise and Novelty (110 minutes, including a 20-minute break)  Principles and Frameworks  Deliberate Memory for Surprise-driven Exploration  Forward dynamics prediction  Advanced dynamics-based surprises  Ensemble and disagreement  Break  RAM-like Memory for Novelty-based Exploration  Replay Memory  Novelty-based Replay  Performance-based Replay QA and Demo  Part C: Advanced Topics (60 minutes)  Language-guided exploration👈 [We are here]  Language-assisted RL  LLM-based exploration  Causal discovery for exploration  Closing Remarks  QA and Demo
  • 101. Beyond the state space: Language-guided exploration  Why language?  Humans are able to learn quickly in new environments due to a rich set of commonsense priors about the world → reflected in language  Read the instruction to play game to avoid trials and errors. Abstraction Compositional Generalization 101
  • 102. 102 Luketina, Jelena, Nantas Nardelli, Gregory Farquhar, Jakob Foerster, Jacob Andreas, Edward Grefenstette, Shimon Whiteson, and Tim Rocktäschel. "A survey of reinforcement learning informed by natural language." arXiv preprint arXiv:1906.03926 (2019). How Language can be used in RL
  • 103.  Make a cake from tools and materials  Pure RL agent needs to try thousands of settings until it find the desired characteristics  If RL agent read and follow the recipe, it may take 1 trial to succeed 103 https://www.moresteam.com/toolbox/design-of-experiments.cfm A more practical use case
  • 104. 104 Tutorial Outline  Part A: Reinforcement Learning Fundamentals and Exploration Inefficiency (30 minutes)  Welcome and Introduction  Reinforcement Learning Basics  Exploring Challenges in Deep RL  QA and Demo  Part B: Surprise and Novelty (110 minutes, including a 20-minute break)  Principles and Frameworks  Deliberate Memory for Surprise-driven Exploration  Forward dynamics prediction  Advanced dynamics-based surprises  Ensemble and disagreement  Break  RAM-like Memory for Novelty-based Exploration  Replay Memory  Novelty-based Replay  Performance-based Replay QA and Demo  Part C: Advanced Topics (60 minutes)  Language-guided exploration  Language-assisted RL 👈 [We are here]  LLM-based exploration  Causal discovery for exploration  Closing Remarks  QA and Demo
  • 105. 105 Gated-attention for Task-oriented Language Grounding  Task-oriented Language Grounding: extract meaningful representations from natural language instructions and maps to visual elements and actions  Task: Given initial Image in pixels and instruction -> guide agent to move towards desired object  2 main modules:  State Processing: Process image and language jointly with State Processing module to obtain state  Policy Learning: Use policy to map states to corresponding actions Generated by DALL-E 3
  • 106. 106 State Processing Module  Use Gated-Attention instead of concatenation to jointly represent image and language information as one state  Language instruction go through a fully-connected layer to match image dimension, called Attention Vector  Each element of Attention Vector is expanded to a (𝐻 × 𝑊) matrix to match the feature map of each image element  Final representation is obtained via element-wise product between image and language representation Generated by DALL-E 3
  • 107. 107 Policy Module  Actions:  Turn (left, right)  Move forward  Policy Architecture: two variants  Behaviour Cloning: Uses target and object locations and orientations in every state to select optimal action  A3C: use a deep neural network to learn policy and value function. Gives action based on policy 𝜋(𝑎|𝐼𝐿, 𝐿) Generated by DALL-E 3
  • 108. 108 Results Generated by DALL-E 3 - With Gated-Attention, agent learns faster and achieves better accuracy (success rate of reaching correct object before episode terminates) - As environment gets harder, more exploration is needed -> A3C with GA performs better than Imitation Learning, where little exploration is done
  • 109. 109 Semantic Exploration from Language Abstractions and Pretrained Representations  Novelty-based exploration methods suffer in high-dimensional visual state spaces  e.g.: different viewpoints of one place in 3D can map to distinct visual states/ features, despite being semantically similar  Language can be a useful abstraction for exploration, as it coarsens the state space in a way that reflects the semantics of environment  Solution: Use vision-language pretrained representations to guide semantic exploration in 3D Generated by DALL-E 3 Example of a state (picture) with language description (caption). Note how the caption focuses on important aspects of the state Example of how many states can be conveyed with one text caption
  • 110. 110 Intrinsic Reward Design  State 𝑠𝑡: embedding of Image, denoted as 𝑂𝑉  Goal: Described by a text instruction, denoted as 𝑔  Caption is encoded by a pretrained language encoder, output embedding is denoted as 𝑂𝐿; only used to calculate intrinsic reward. Note that agent never observes 𝑂𝐿.  Intrinsic reward is goal-agnostic; computed with access to state representation (either 𝑂𝑉 or 𝑂𝐿)  Add intrinsic reward for two exploration algorithms:  Never Give Up (NGU; Badia et al.)  Random Network Distillation (RND; Burda et al.) Generated by DALL-E 3 Badia, Adrià Puigdomènech, et al. Never give up: Learning directed exploration strategies. arXiv preprint arXiv:2002.06038 (2020). Y. Burda, H. Edwards, A. Storkey, and O. Klimov. Exploration by random network distillation. arXiv preprint arXiv:1810.12894, 2018.
  • 111. 111 Never Give Up (NGU)  State representations (which are used to compute intrinsic reward, can be either 𝑂𝐿 or 𝑂𝑉) along trajectory are written to memory buffer  Novelty is a function of L2 distance between current and the k-nearest states in buffer. Intrinsic reward is higher for larger distances  To influence exploration: modify the embedding function  Originally, embedding function is learned  Variants:  Vis-NGU & LSE-NGU: Use Visual embeddings (𝑂𝑉).  Lang-NGU: Use Language embeddings (𝑂𝐿). Generated by DALL-E 3
  • 112. 112 Random Network Distillation (RND)  Intrinsic reward is derived from the prediction error between a trainable network on a target from a frozen function  Trainable network learns independently from the policy network  As training progresses, frequently-visited states yield less intrinsic reward due to reduced prediction errors Generated by DALL-E 3
  • 113. 113 Results  Example of how language representations (orange line) can help agent explore better than visual representation in 3D environment.  Using visual representation, agent struggles with different view of only one scene. Generated by DALL-E 3
  • 114. 114 Tutorial Outline  Part A: Reinforcement Learning Fundamentals and Exploration Inefficiency (30 minutes)  Welcome and Introduction  Reinforcement Learning Basics  Exploring Challenges in Deep RL  QA and Demo  Part B: Surprise and Novelty (110 minutes, including a 20-minute break)  Principles and Frameworks  Deliberate Memory for Surprise-driven Exploration  Forward dynamics prediction  Advanced dynamics-based surprises  Ensemble and disagreement  Break  RAM-like Memory for Novelty-based Exploration  Replay Memory  Novelty-based Replay  Performance-based Replay QA and Demo  Part C: Advanced Topics (60 minutes)  Language-guided exploration  Language-assisted RL  LLM-based exploration👈 [We are here]  Causal discovery for exploration  Closing Remarks  QA and Demo
  • 115. 115 ELLM: Guiding Pretraining in RL with LLM  Many distinct actions can lead to similar outcomes -> Intrinsically Motivated RL (IM-RL): explore outcomes rather than actions  Competence-based IM (CB-IM): maximize diversity of skills mastered by agent  CB-IM aims to optimize 𝑅𝑖𝑛𝑡:  Given those, CB-IM algorithms train a goal-conditioned policy 𝜋(𝑎|𝑜, 𝑔) that maximizes 𝑅𝑖𝑛𝑡  Goal distribution G and 𝑅𝑖𝑛𝑡(𝑜, 𝑎, 𝑜′|𝑔) must be defined such that 3 properties are satisfied: 1. Diverse 2. Common-sense sensitive (Ex: chop a tree > drink a tree) 3. Context sensitive (Ex: only chop the tree when the tree is in view) Generated by DALL-E 3
  • 116. 116 Why ELLM?  Previous methods hand-define 𝑅𝑖𝑛𝑡 and G, and use various motivations to guide goal sampling 𝑔~𝐺: novelty, learning progress, intermediate difficulty  ELLM: alleviate the need for environment-specific hand-coded definitions of 𝑅𝑖𝑛𝑡 and G with LM:  Language-based goal representations  Language-model-based goal generation Generated by DALL-E 3
  • 117. 117 Architecture  Goal representation: prompt LLM with available actions & description of current observation -> construct prompt -> LLM generate goals  Open-ended goal generation: Ask LLM to generate goals  Closed-form: Ask LLM yes/no questions (e.g: should the agent do X? yes/no?) Generated by DALL-E 3
  • 118. 118 Intrinsic Reward Design  Reward LLM goals: with similarity between generated goal with description of agent’s transition 𝐶𝑡𝑟𝑎𝑛𝑠𝑖𝑡𝑖𝑜𝑛  If there are multiple goals -> reward agent with most similar goal to transition description: Generated by DALL-E 3
  • 120. 120 Intrinsically Guided Exploration from LLMs (IGE-LLMs)  For long, sequential tasks with sparse rewards, intrinsic reward can help guide policy learning towards exploration, along with the main policy driver extrinsic reward  LLM can be used as an evaluator for potential future rewards of all actions a, that maps every (s,a) pairs directly to an intrinsic reward 𝑟𝑖  Total reward is given as 𝑟𝑐 = 𝑟𝑒 + 𝜆𝑖 𝑟𝑖 𝑤𝑖 , where 𝑟𝑒 is the external reward, 𝜆𝑖 is a controlling factor and 𝑤𝑖 is a linearly decaying weight Generated by DALL-E 3
  • 121. 121 Prompting Generated by DALL-E 3  Example of an input prompt to evaluate possible actions in DeepSea Environment  LLM is given current position and possible actions. It is asked to evaluate the ratings of every possible next actions
  • 122. 122 Benefit of Intrinsic Reward  LLM improves traditional exploration methods  Using LLM to generate actions directly exhibits significant errors (grey lines on the right graph), even with advanced LLMs (GPT-4)  However, when used as intrinsic reward only, it helps with exploration, especially in harder environments. It also results in better performance Generated by DALL-E 3
  • 123. 123 Tutorial Outline  Part A: Reinforcement Learning Fundamentals and Exploration Inefficiency (30 minutes)  Welcome and Introduction  Reinforcement Learning Basics  Exploring Challenges in Deep RL  QA and Demo  Part B: Surprise and Novelty (110 minutes, including a 20-minute break)  Principles and Frameworks  Deliberate Memory for Surprise-driven Exploration  Forward dynamics prediction  Advanced dynamics-based surprises  Ensemble and disagreement  Break  RAM-like Memory for Novelty-based Exploration  Replay Memory  Novelty-based Replay  Performance-based Replay QA and Demo  Part C: Advanced Topics (60 minutes)  Language-guided exploration  Causal discovery for exploration👈 [We are here]  Statistical approaches  Deep learning approaches  Closing Remarks  QA and Demo
  • 124. 124 Deakin University CRICOS Provider Code: 00113B What is causality?  The relationship between cause and effect.  Uncovering two fundamental questions:  Causal discovery: Evidence required to infer cause-effect relationships?  Causal inference: Given causal information, what inference can be drawn?  Structure Causal Models (SCMs) framework (Pearl, 2009a): .  Causal graph: . Figure 1: Example of SCM and causal graph for scurvy problem.
  • 125. 125 Deakin University CRICOS Provider Code: 00113B Why causality and RL?  Understand cause-effect reduce exploring unnecessary action, thus, sample efficiency.  Ex: Not move toward the door before obtain the key.  Improve interpretability.  Ex: Why policy prioritize action of obtaining the key?  Generalizability.
  • 126. 126 Deakin University CRICOS Provider Code: 00113B Interpreting Causality in RL Environment  Taking action A can affect the reward R.  State S is the context variables that affect both action A and reward R.  .  .  The most common is .  And, U is the unknown confounder variable.  Categorize based on techniques to improve exploration or causality measurement techniques.  Statistical vs deep learning methods. Figure 2: Causality in RL Environment.
  • 127. 127 Tutorial Outline  Part A: Reinforcement Learning Fundamentals and Exploration Inefficiency (30 minutes)  Welcome and Introduction  Reinforcement Learning Basics  Exploring Challenges in Deep RL  QA and Demo  Part B: Surprise and Novelty (110 minutes, including a 20-minute break)  Principles and Frameworks  Deliberate Memory for Surprise-driven Exploration  Forward dynamics prediction  Advanced dynamics-based surprises  Ensemble and disagreement  Break  RAM-like Memory for Novelty-based Exploration  Replay Memory  Novelty-based Replay  Performance-based Replay QA and Demo  Part C: Advanced Topics (60 minutes)  Language-guided exploration  Causal discovery for exploration  Statistical approaches👈 [We are here]  Deep learning approaches  Closing Remarks  QA and Demo
  • 128. 128 Deakin University CRICOS Provider Code: 00113B Causal influence detection for improving efficiency in reinforcement learning. Seitzer, M., Schölkopf, B., & Martius, G. (2021). Causal influence detection for improving efficiency in reinforcement learning. Advances in Neural Information Processing Systems, 34, 22905-22918.
  • 129. 129 Deakin University CRICOS Provider Code: 00113B Causal Action Influence Detection (CAI)  Mentioned previously, .  Decomposed state S into N components.  One steps transitions graph:  How to detect when action influence next state S’? Figure 3: Global Causal Graph Fully Connected. Figure 4: Example of Situation Dependent Controlled..
  • 130. 130 Deakin University CRICOS Provider Code: 00113B Causal Action Influence Detection (CAI) (cont.)  Conditional Mutual Information (CMI):  Estimation of CAI: Estimate forward model from data.
  • 131. 131 Deakin University CRICOS Provider Code: 00113B Using CAI to Improve exploration in RL?  CAI as Intrinsic Reward.  Active Exploration with CAI.  CAI Experience Replay  Experiment on 3 environments: FetchPush, FetchPickAndPlace, FetchRotTable. Goal is the coordinate the object must be.  Baseline RL Algorithm: DDPG + HER. Figure 5: FetchPickAndPlace Environment. Figure 6: FetchRotTable Environment.
  • 132. 132 Deakin University CRICOS Provider Code: 00113B CAI as Intrinsic Reward  Use CAI as reward signal.  Use on its own or with task reward. Figure 7: Bonus reward improves performance on FetchPickAndPlace.
  • 133. 133 Deakin University CRICOS Provider Code: 00113B Active Exploration with CAI  Replace random exploration with causal exploration.  Choose action with highest contribution to CAI calculation. Figure 8: Performance of active exploration in FetchPickAndPlace depending on the fraction of exploratory actions chosen actively from a total of 30% (epsilon) exploratory actions. Figure 9: Experiment comparing exploration strategies on FetchPickAndPlace. The combination of active exploration and reward bonus yields the largest sample efficiency.
  • 134. 134 Deakin University CRICOS Provider Code: 00113B CAI Experience Replay  Choose episode for replay from replay buffer with guide from causal (inverse) ranking .  is the probability of sampling any state from episode i (of M episodes) in the replay buffer (with T is the episode length). Figure 10: Comparison of CAI-P with baselines (energy-based method with privileged information (EBP), prioritized experience replay (PER), and HER without prioritization)
  • 135. 135 Tutorial Outline  Part A: Reinforcement Learning Fundamentals and Exploration Inefficiency (30 minutes)  Welcome and Introduction  Reinforcement Learning Basics  Exploring Challenges in Deep RL  QA and Demo  Part B: Surprise and Novelty (110 minutes, including a 20-minute break)  Principles and Frameworks  Deliberate Memory for Surprise-driven Exploration  Forward dynamics prediction  Advanced dynamics-based surprises  Ensemble and disagreement  Break  RAM-like Memory for Novelty-based Exploration  Replay Memory  Novelty-based Replay  Performance-based Replay QA and Demo  Part C: Advanced Topics (60 minutes)  Language-guided exploration  Causal discovery for exploration  Statistical approaches  Deep learning approaches👈 [We are here]  Closing Remarks  QA and Demo
  • 136. 136 Deakin University CRICOS Provider Code: 00113B Causality-driven hierarchical structure discovery for reinforcement learning. Hu, X., Zhang, R., Tang, K., Guo, J., Yi, Q., Chen, R., ... & Chen, Y. (2022). Causality-driven hierarchical structure discovery for reinforcement learning. Advances in Neural Information Processing Systems, 35, 20064- 20076.
  • 137. 137 Deakin University CRICOS Provider Code: 00113B Structural Causal Representation Learning  Environment with multiple objects.  Ex: Have wood and stone can make axe.  How to measure causality between these object?  Model the SCM of objects between adjacent timesteps. Figure 11: Example of environment with multi-objects and causal graph
  • 138. 138 Deakin University CRICOS Provider Code: 00113B Structural Causal Representation Learning (cont.)  Simpler case with 4 objects A, B, C, D.  A is object of interest.  Need a forward/transition model .  Parameterized by .  Need a masking function (otherwise, don’t know which objects affect A).  Parameterized by where M is the no. of objects. Figure 12: Example of SCM representation learning (with interested object A).
  • 139. 139 Deakin University CRICOS Provider Code: 00113B Structural Causal Representation Learning (cont.)  Iterative process.  Fix one parameter, while optimizing the other one.  After finish optimizing, extract edge:
  • 140. 140 Deakin University CRICOS Provider Code: 00113B CDHRL Framework Figure 14: CDHRL Framework.
  • 141. 141 Deakin University CRICOS Provider Code: 00113B Hierarchical Causal Subgoal Training  Whenever new subgoals added, train .  Deciding whether subgoal is reachable, with current state and policy .  If the state is reachable, within certain timesteps, add to subgoal set. Figure 13: Example of subgoal hierarchy given causal graph.
  • 142. 142 Deakin University CRICOS Provider Code: 00113B Figure 16: Results on Minigrid-2d (left) and Eden (right). Figure 15: Environment Minigrid-2d (left) and Eden (right).  An upper controller policy , is trained to select subgoals from current subgoal set and maximize task reward.  The upper controller is multi-level DQN with HER.
  • 143. 143 Deakin University CRICOS Provider Code: 00113B Disentangling causal effects for hierarchical reinforcement learning. Corcoll, O., & Vicente, R. (2020). Disentangling causal effects for hierarchical reinforcement learning. arXiv preprint arXiv:2010.01351.
  • 144. 144 Deakin University CRICOS Provider Code: 00113B Controlled Effect Disentanglement  Total effect, the change in environment states, involves dynamic effects and controllable effects.  Whether next state is an outcome of the action or just by accident.  We care about controllable effects.  Based on Average Treatment Effect (ATE). The normality Figure 17: The relationship between total effects, dynamic effects, and controllable effects.
  • 145. 145 Deakin University CRICOS Provider Code: 00113B  We cannot calculate total effect for every actions .  Use a neural network to estimate.  Learn a vector representation of effect. Figure 18: Total effects modelling architecture.
  • 146. 146 Deakin University CRICOS Provider Code: 00113B Exploration with controllable effects as goal Effect Sampling Policy Taking Action Policy Model distribution of effects Figure 19: Components of causal effects for hierarchical reinforcement learning.
  • 147. 147 Deakin University CRICOS Provider Code: 00113B Controllable Effect Distribution Learning  Train a Variational Autoencoder.  Approximate controllable effect . Figure 20: VAE architecture to learn effect distribution.
  • 148. 148 Deakin University CRICOS Provider Code: 00113B Training to Select Goal and Reach Goal  Train using DQN, on data .  Use to select sub-effect.  Train using DQN, on data . Figure 21: Architecture learning to select effects as subgoals. Figure 22: Architecture learning to select actions to reach subgoals.
  • 149. 149 Deakin University CRICOS Provider Code: 00113B 3 Levels:  Task T: go to the target location.  Task BT: go to the target location while carrying a ball.  Task CBT: pick ball, put it in the chest, and go to the target. Figure 23: Comparing with DQN baseline on 3 tasks. CEHRL can learn complex task, while, DQN cannot learn the complex task. Figure 24: Random effect vs random action exploration.
  • 150. 150 Deakin University CRICOS Provider Code: 00113B End of Part C  QA  Demo Generated by DALL-E 3