Despite remarkable successes in various domains such as robotics and games, Reinforcement Learning (RL) still struggles with exploration inefficiency. For example, in hard Atari games, state-of-the-art agents often require billions of trial actions, equivalent to years of practice, while a moderately skilled human player can achieve the same score in just a few hours of play. This contrast emerges from the difference in exploration strategies between humans, leveraging memory, intuition and experience, and current RL agents, primarily relying on random trials and errors. This tutorial reviews recent advances in enhancing RL exploration efficiency through intrinsic motivation or curiosity, allowing agents to navigate environments without external rewards. Unlike previous surveys, we analyze intrinsic motivation through a memory-centric perspective, drawing parallels between human and agent curiosity, and providing a memory-driven taxonomy of intrinsic motivation approaches.
The talk consists of three main parts. Part A provides a brief introduction to RL basics, delves into the historical context of the explore-exploit dilemma, and raises the challenge of exploration inefficiency. In Part B, we present a taxonomy of self-motivated agents leveraging deliberate, RAM-like, and replay memory models to compute surprise, novelty, and goal, respectively. Part C explores advanced topics, presenting recent methods using language models and causality for exploration. Whenever possible, case studies and hands-on coding demonstrations. will be presented.
2. 2
Why This Topic?
Deep Learning in RL: Deep RL Agents excel in
complex tasks but need many interactions to
learn
Practicality Issues: Extensive learning steps
hinder RL's use in real-world scenarios
Exploration Optimization: Improving exploration
is key for RL's real-world application
Memory and Learning: Memory-based
exploration can speed up learning and advance
AI
Generated by DALL-E 3
3. 3
About Us
Authors: Hung Le, Hoang Nguyen and Dai Do
Our lab: A2I2, Deakin University
Hung Le is a research lecturer at Deakin
University, leading research on deep sequential
models and reinforcement learning
Hoang Nguyen is a second-year PhD student at
A2I2, specializing in reinforcement learning and
causality
Dai Do is a second-year PhD student at A2I2,
specializing in reinforcement learning and large
language models
A2I2 Lab Foyer, Waurn Ponds
4. 4
Deakin University CRICOS Provider Code: 00113B
About the Tutorial
This tutorial is based on my previous presentations,
expanding on topics covered in the following talks:
Memory-Based Reinforcement Learning.
AJCAI’22
Memory for Lean Reinforcement Learning. FPT
Software AI Center. 2022
Neural machine reasoning. IJCAI’21
From deep learning to deep reasoning. KDD’21
My Blogs: https://hungleai.substack.com/
Generated by DALL-E 3
5. 5
Tutorial Outline
Part A: Reinforcement Learning Fundamentals and
Exploration Inefficiency (30 minutes)
Welcome and Introduction 👈 [We are here]
Reinforcement Learning Basics
Exploring Challenges in Deep RL
QA and Demo
Part B: Surprise and Novelty (110 minutes,
including a 20-minute break)
Principles and Frameworks
Deliberate Memory for Surprise-driven
Exploration
Break
RAM-like Memory for Novelty-based
Exploration
Replay Memory
QA and Demo
Part C: Advanced Topics (60 minutes)
Language-guided exploration
Causal discovery for exploration
Closing Remarks
QA and Demo
7. 7
Tutorial Outline
Part A: Reinforcement Learning Fundamentals and
Exploration Inefficiency (30 minutes)
Welcome and Introduction
Reinforcement Learning Basics
Key components and frameworks 👈 [We are here]
Classic exploration
Exploring Challenges in Deep RL
QA and Demo
Part B: Surprise and Novelty (110 minutes, including a
20-minute break)
Principles and Frameworks
Deliberate Memory for Surprise-driven Exploration
Break
RAM-like Memory for Novelty-based
Exploration
Replay Memory
QA and Demo
Part C: Advanced Topics (60 minutes)
Language-guided exploration
Causal discovery for exploration
Closing Remarks
QA and Demo
8. 8
Reinforcement Learning Basics
In reinforcement learning (RL), an agent
interacts with the environment, taking
actions a, receiving a reward r, and
moving to a new state s
The agent is tasked with maximizing the
accumulated rewards or returns R over
time by finding optimal actions (policy)
9. 9
Reinforcement Learning Concepts
Policy π : maps state s to action a
Return (discounted) G or R: the cumulative
(weighted) sum of rewards
State value function V: the expected discounted
return starting with state s following policy π
State-action value function Q: the expected return
starting from , taking the action , and thereafter
following policy π
10. Classic RL algorithms: Value learning
10
Q-learning
(temporal difference-TD)
Watkins, Christopher JCH, and Peter Dayan. "Q-learning." Machine
learning 8, no. 3 (1992): 279-292.
Williams, Ronald J. "Simple statistical gradient-following algorithms for
connectionist reinforcement learning." Machine learning 8, no. 3 (1992):
229-256.
Basic idea: before finding optimal policy,
we find the value function
Learn (action) value function:
V(s)
Q(s,a)
Estimate V(s)=E(∑R from s)
Estimate Q(s,a)=E(∑R from s,a)
Given Q(s,a)
→ choose action that maximizes the value
(ε-greedy policy)
RL algorithms: Q-Learning
11. Classic RL algorithm: Policy gradient
Basic idea: directly optimise the policy as a
function of states
Need to estimate the gradient of the
objective function E(∑R) w.r.t the
parameters of the policy
Focus on optimisation techniques
11
REINFORCE
(policy gradient)
RL algorithms: Policy Gradient
13. 13
Tutorial Outline
Part A: Reinforcement Learning Fundamentals and
Exploration Inefficiency (30 minutes)
Welcome and Introduction
Reinforcement Learning Basics
Key components and frameworks
Classic exploration 👈 [We are here]
Exploring Challenges in Deep RL
QA and Demo
Part B: Surprise and Novelty (110 minutes, including a
20-minute break)
Principles and Frameworks
Deliberate Memory for Surprise-driven Exploration
Break
RAM-like Memory for Novelty-based
Exploration
Replay Memory
QA and Demo
Part C: Advanced Topics (60 minutes)
Language-guided exploration
Causal discovery for exploration
Closing Remarks
QA and Demo
14. 14
Deakin University CRICOS Provider Code: 00113B
ε-greedy
ε-greedy is the simplest exploration
strategy that works in theory
It heavily relies on pure randomness
and biased estimates of action values
Q, and thus is sample-inefficient in
practice
We often go with what we assume is
best, but sometimes, we take a
random chance to explore other
options. This is one example of an
optimistic strategy
It is used in Q-learning
15. 15
Deakin University CRICOS Provider Code: 00113B
ε-greedy: Problems
It is not surprising why this strategy
might struggle in real-world scenarios:
being overly optimistic when your
estimation is imprecise can be risky.
It may lead to getting stuck in a local
optimum and missing out on
discovering the global one with the
highest returns.
Benchmarking ε-greedy (red line) and other exploration method on Montezuma’s
Revenge. Taïga, Adrien Ali, William Fedus, Marlos C. Machado, Aaron Courville, and Marc G.
Bellemare. "Benchmarking bonus-based exploration methods on the arcade learning
environment." arXiv preprint arXiv:1908.02388 (2019).
ε-greedy
16. 16
Deakin University CRICOS Provider Code: 00113B
Upper Confidence Bound (UCB)
One way to address the problem of over-
optimism is to consider the uncertainty of the
estimation
We do not want to miss an action with a
currently low estimated value and high
uncertainty, as it may possess a higher value:
What we need is to guarantee:
Hoeffding’s Inequality
How to estimate
uncertainty?
Implicitly, with a large value of the exploration-
exploitation trade-off parameter c, the chosen
action is more likely to deviate from the greedy
action, leading to increased exploration.
t -c
17. 17
Deakin University CRICOS Provider Code: 00113B
Thompson Sampling
When additional assumptions about the reward distribution are
available, an action can be chosen based on the probability that
it is optimal (probability matching strategy)
Thompson sampling is one way to implement the strategy:
1. Assume the reward follows a distribution p(r|a, θ) where θ
is the parameter whose prior is p(θ)
2. Given the set of past observations Dt is made of triplets {(ai,
ri)|i=1,2..,t}, we update the posterior using Bayes rule
3. Given the posterior, we can estimate the action value
4. We can compute the probability of choosing action a
2
3
4
18. 18
Deakin University CRICOS Provider Code: 00113B
Information Gain
Information Gain (IG) measures the change in the
amount of information (measured in entropy H) of a
latent variable
The latent variable often refers to the parameter of
the model θ after seeing observation (e.g., reward r)
caused by some action a
A big drop in the entropy means the observation
makes the model more predictable and less
uncertain
Our goal is to find a harmony between minimizing
expected regret in the current period and acquiring
new information about the observation model
Russo, Daniel, and Benjamin Van Roy. "Learning to optimize via information-
directed sampling." Advances in Neural Information Processing Systems 27
(2014).
19. 19
Deakin University CRICOS Provider Code: 00113B
Application: Multi-arm bandit
There are multiple actions to take (bandit’s arm)
After taking one action, agent observes a reward
Maximizing cumulated rewards or minimizing
cumulated regrets.
Generated by DALL-E 3
20. 20
Deakin University CRICOS Provider Code: 00113B
Limitations of Classical Exploration
❌ Scalability Issues: Most are specifically designed for
bandit problems, and thus, they are hard to apply in
large-scale or high-dimensional problems (e.g., Atari
games), resulting in increased computational
demands that can be impractical
❌ Assumption Sensitivity: These methods heavily rely
on specific assumptions about reward distributions or
system dynamics, limiting their adaptability when
assumptions do not hold
❌ Vulnerability to Uncertainty: They may struggle in
dynamic environments with complex reward
structures or frequent changes, leading to suboptimal
performance Generated by DALL-E 3
21. 21
Tutorial Outline
Part A: Reinforcement Learning Fundamentals and
Exploration Inefficiency (30 minutes)
Welcome and Introduction
Reinforcement Learning Basics
Exploring Challenges in Deep RL
Hard Exploration Problems 👈 [We are here]
Simple exploring solutions
QA and Demo
Part B: Surprise and Novelty (110 minutes, including a
20-minute break)
Principles and Frameworks
Deliberate Memory for Surprise-driven Exploration
Break
RAM-like Memory for Novelty-based
Exploration
Replay Memory
QA and Demo
Part C: Advanced Topics (60 minutes)
Language-guided exploration
Causal discovery for exploration
Closing Remarks
QA and Demo
22. Task:
Agent searches for the key
Agent picks the key
Agent open the door to access the room
Agent finds the box in the room
Reward:
If the agent reaches the box, get +1
reward
22
https://github.com/maximecb/gym-minigrid
→ How to learn such complicated
policies using the simple reward?
Modern RL Environments are Complicated
23. 23
Deakin University CRICOS Provider Code: 00113B
Why is Scaling a Big Problem?
Practical environments often involve huge
continuous state and action spaces
Classical approaches cannot be implemented or fail
to hold their theoretical properties in these settings
Doom environment: continuous high-dimensional
state space (source)
Mujoco environment: continuous action space (source).
24. 24
Deakin University CRICOS Provider Code: 00113B
Challenging Environments for Exploration
Environments require long-term memory of
agents:
Maze navigation with conditions such as
finding the objects that have the same color
as the wall
Remember the shortest path to the objects
experienced in the past
Noisy environments:
Noisy-TV: a random TV will distract the RL
agent from its main task due to noisy screen
https://github.com/jurgisp/memory-maze
Noisy-TV (source)
25. 25
Tutorial Outline
Part A: Reinforcement Learning Fundamentals and
Exploration Inefficiency (30 minutes)
Welcome and Introduction
Reinforcement Learning Basics
Exploring Challenges in Deep RL
Hard exploration problems
Simple exploring solutions 👈 [We are here]
QA and Demo
Part B: Surprise and Novelty (110 minutes, including a
20-minute break)
Principles and Frameworks
Deliberate Memory for Surprise-driven Exploration
Break
RAM-like Memory for Novelty-based
Exploration
Replay Memory
QA and Demo
Part C: Advanced Topics (60 minutes)
Language-guided exploration
Causal discovery for exploration
Closing Remarks
QA and Demo
26. 26
Deakin University CRICOS Provider Code: 00113B
Entropy Maximization
In the era of deep learning, neural networks are used for
approximating functions, including parameterizing value and
policy functions in RL
ε-greedy is less straightforward for policy gradient methods
An entropy loss term is introduced in the objective function to
penalize overly deterministic policies. This encourages diverse
exploration, avoiding suboptimal actions by maximizing the
bonus entropy loss term
❌ It may also impede the optimization of other losses, especially
the main objective
❌ The entropy loss does not enforce different level of exploration
for different tasks
It is used in PPO, A3C, …
27. 27
Deakin University CRICOS Provider Code: 00113B
Noisy Networks
Another method to add randomness to the
policy is to add noise to the weights of the
neural networks
Throughout training, noise samples are drawn
and added to the weights for both forward and
backward propagation
❌ Although Noisy Networks can vary exploration
degree across tasks, adapting exploration at the
state level is far from reachable.
Certain states with higher uncertainty may require
more exploration, while others may not
An example of a noisy linear layer. Here w is the matrix weight and b is
the bias vector. The parameters µw, µb, σw and σb are the learnables of
the network whereas εw and εw are noise variables. Fortunato, Meire,
Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Ian Osband,
Alex Graves Vlad Mnih, Remi Munos Demis Hassabis Olivier Pietquin,
Charles Blundell, and Shane Legg. "Noisy Networks for Exploration."
arXiv preprint arXiv:1706.10295 (2017)
30. 30
Tutorial Outline
Part A: Reinforcement Learning Fundamentals and
Exploration Inefficiency (30 minutes)
Welcome and Introduction
Reinforcement Learning Basics
Exploring Challenges in Deep RL
QA and Demo
Part B: Surprise and Novelty (110 minutes, including a
20-minute break)
Principles and Frameworks 👈 [We are here]
Reward shaping and the role of memory
A taxonomy of memory-driven intrinsic exploration
Deliberate Memory for Surprise-driven Exploration
Break
RAM-like Memory for Novelty-based
Exploration
Replay Memory
QA and Demo
Part C: Advanced Topics (60 minutes)
Language-guided exploration
Causal discovery for exploration
Closing Remarks
QA and Demo
31. No curiosity, random exploration
epsilon-greedy
“Tractable” exploration
Somehow optimize exploration,
e.g. UCB, Thomson sampling
Only doable for simple
environments
31
Approximate “trackable” exploration
(count-based)
Scalable to harder environments
Intrinsic motivation exploration
(SOTA)
Predictive
Novel or supersite-based curiosity
Causal
…
https://cmutschler.de/rl
Frameworks
32. 32
Deakin University CRICOS Provider Code: 00113B
Reward Shaping
Entropy loss or inject noise into the policy/value
parameters with the limitation that the level of
exploration is not explicitly conditioned on fine-
grant factors such as states or actions
Solution: intrinsic reward bonuses assign higher
internal rewards to state-action pairs that
require higher exploration and vice versa
The final reward for the agent will be the
weighted sum of the intrinsic reward and the
external (environment) reward
33. Animal can travel for long distance till
they find food
Human can navigate to go to an
address in a strange city
What motivates these agents to
explore?
intrinsic motivation
curiosity, hunch
intrinsic reward
33
https://www.beepods.com/5-fascinating-ways-bees-and-flowers-find-each-other/
Intrinsic Motivation in Biological World
34. 34
Deakin University CRICOS Provider Code: 00113B
What Does Intrinsic Reward Represent?
Novelty
It is inherent for biological agents to be motivated by new things.
Tracking the occurrences of a state provides a novelty indicator, with
increased occurrences signaling less novelty
Surprise
Surprise emerges when there's a discrepancy between expectations
and the observed or experienced reality
Build a model of the environment, predicting the next state given the
current state and action
The intrinsic reward is the prediction error itself
This reward increases when the model encounters difficulty in
predicting or expresses surprise at the current observation.
Novelty
Surprise
35. 35
Deakin University CRICOS Provider Code: 00113B
The Role of Memory
Biological agents inherently possess memory to
monitor events:
Drawing from previous experiences, they
discern novelty in observations
Utilizing their prior understanding of the
world, they identify unexpected observations
RL agents can be quipped with memory:
Event-based Memory: Episodic Memory
Semantic Memory: World Model
Novelty
Surprise
MEMORY
37. 37
Tutorial Outline
Part A: Reinforcement Learning Fundamentals and
Exploration Inefficiency (30 minutes)
Welcome and Introduction
Reinforcement Learning Basics
Exploring Challenges in Deep RL
QA and Demo
Part B: Surprise and Novelty (110 minutes, including a
20-minute break)
Principles and Frameworks
Deliberate Memory for Surprise-driven Exploration
Forward dynamics prediction👈 [We are here]
Advanced dynamics-based surprises
Ensemble and disagreement
Break
RAM-like Memory for Novelty-based
Exploration
Replay Memory
QA and Demo
Part C: Advanced Topics (60 minutes)
Language-guided exploration
Causal discovery for exploration
Closing Remarks
QA and Demo
38. 38
Deakin University CRICOS Provider Code: 00113B
Forward Dynamics Prediction
Build a model of the environment, predicting the
next state given the current state and action
This kind of model, also known as forward
dynamics or world model
C: actor vs M: world model. M predicts
consequences of C. C acts to make M fail
As a result:
If C action results in repeated and boring
consequences M predict well
C must explore novel consequence https://people.idsia.ch/~juergen/artificial-curiosity-since-1990.html
39. 39
Deakin University CRICOS Provider Code: 00113B
Learning Progress as Intrinsic Reward
The learning progress is estimated by comparing
the mean error rate of the prediction model
during the current moving window to the mean
error rate of the previous window
The two windows are different by 𝜏 steps
In can mitigate Noisy-TV problems
Pierre-Yves Oudeyer & Frederic Kaplan. “How can we define intrinsic motivation?” Conf. on Epigenetic
Robotics, 2008.
40. 40
Deakin University CRICOS Provider Code: 00113B
Deep Dynamic Models
Use a neural network f that takes a representation of the current state and action to predict the
next state
The representation is shaped through unsupervised training, i.e., state reconstruction task, using
an autoencoder’s hidden state
The network f, fed with the autoencoder’s hidden state, is trained to minimize the prediction
error, which is the norm of the difference between the predicted state and the true state
Intrinsic reward
Stadie, Levine, Abbeel: Incentivizing Exploration in Reinforcement
Learning with Deep Predictive Models. In NIPS 2015.
41. Forward model on feature space
Feature space ignore irrelevant, uncontrollable factors
The consequences depend on
Action (controllable)
Environment (uncontrollable)
We want the state embedding enables controllable
space
41
https://blog.dataiku.com/curiosity-driven-learning-through-next-state-prediction
Deepak Pathak et al.: Curiosity-driven Exploration by Self-Supervised Prediction. ICML 2017
ICM
More Complicated Model
Inverse
dynamic
representation
learning
43. 43
Tutorial Outline
Part A: Reinforcement Learning Fundamentals and
Exploration Inefficiency (30 minutes)
Welcome and Introduction
Reinforcement Learning Basics
Exploring Challenges in Deep RL
QA and Demo
Part B: Surprise and Novelty (110 minutes, including a
20-minute break)
Principles and Frameworks
Deliberate Memory for Surprise-driven Exploration
Forward dynamics prediction
Advanced dynamics-based surprises👈 [We are here]
Ensemble and disagreement
Break
RAM-like Memory for Novelty-based
Exploration
Replay Memory
QA and Demo
Part C: Advanced Topics (60 minutes)
Language-guided exploration
Causal discovery for exploration
Closing Remarks
QA and Demo
44. The prediction target
is stochastic
Information necessary for the
prediction is missing
Model class of predictors is too
limited to fit the complexity of
the target function
Both the totally predictable and
the fundamentally unpredictable will
get boring
44
https://openai.com/blog/reinforcement-learning-with-prediction-based-rewards/
When Predictive Surprise Fails
45. 45
https://favpng.com/
Ideas for Improvements
Reward M’s progress instead of error
No M’s improvement if the
consequence is too hard or too easy
to predict no reward
Remember all experiences
“Store” all experienced
consequence, including stochastic
ones
Global or Local memory
Like human
Better Representations
Representation Learning
https://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction
46. 46
Deakin University CRICOS Provider Code: 00113B
Random Network Distillation (RND)
The intrinsic reward is defined through the task of
predicting the output of a fixed (target) network
Target Network’s weights are random
By predicting the target output, the Predictor Network
tries to “remember” the randomized state
If old state reappears, it can be predicted easily by the
Predictor Network
RND obviates Noisy-TV since the target network can be
chosen to be deterministic and inside the model-class
of the predictor network.
Burda, Yuri, Harrison Edwards, Amos Storkey, and Oleg Klimov. "Exploration by random network
distillation." In International Conference on Learning Representations. 2018.
47. 47
Deakin University CRICOS Provider Code: 00113B
Noisy-TV in Atari games
Montezuma’s Revenge the agent oscillates between two rooms
This leads to an irreducibly high prediction error, as the non-determinism of sticky actions
makes it impossible to know whether, once the agent is close to crossing a room boundary,
making one extra step will result in it staying in the same room, or crossing to the next one
This is a manifestation of the ‘noisy TV’ problem
48. 48
Deakin University CRICOS Provider Code: 00113B
Latent World Model
World model for exploration should be robust
against stochasticity and able to extrapolate the
state dynamics prediction error can be a
measurement for novelty
Train WM on latent representation space. This
space is shaped by unsupervised learning a
zero-centered distribution with a covariance
matrix equal to the identity robust to
stochastic elements and is arranged respecting
the temporal distance of observations
WM error is computed in latent space
50. 50
Deakin University CRICOS Provider Code: 00113B
Bayesian Surprise
Surprise can be interpreted from a Bayesian statistics perspective
Similar to IG idea, the aim is to minimize uncertainty about the dynamics, formalized as maximizing
the cumulative reduction in entropy
The reduction of entropy per time step, also known as mutual information, I(𝚯;St+1|ξt,at)
Here θ is the parameters of the dynamics model 𝚯. Because we are interested in finding intrinsic
reward for a given timestep, we can define:
Houthooft, Rein, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. "Vime:
Variational information maximizing exploration." Advances in neural information processing systems 29
(2016).
51. 51
Deakin University CRICOS Provider Code: 00113B
Variational Bayesian Surprise
The KL involves computing the posterior p(θ|st+1),
which is generally intractable
Use variational inference for approximating the
posterior, alternative variational distribution q(θ; 𝜙)
𝜙 is the variational parameter
This is equivalent to parameterizing the dynamics
model as a Bayesian neural network (BNN) with
weight distributions maintained as a fully factorized
Gaussian. Train 𝜙:
53. 53
Deakin University CRICOS Provider Code: 00113B
Bayesian Learning Progress
Training a BNN is complicated, there are different
Bayesian views on surprise
Formulating the objective of the RL agent as jointly
maximizing expected return and surprise
P is the true dynamics model and P𝜙 is the learned
dynamics model
The objective can be translated to maximizing the
bonus reward per step
Joshua Achiam and Shankar Sastry. 2017. Surprise-based intrinsic motivation for deep reinforcement
learning. arXiv preprint arXiv:1703.01732 (2017).
54. 54
Deakin University CRICOS Provider Code: 00113B
Bayesian Learning Progress: Approximation Solutions
In practice, we do not know P. Need
approximation
Prediction error: measures the error in log
probability instead of the norm of the
difference between the predicted and the
reality
Learning progress written in the form of log
probability
To train the dynamics model P𝜙, solve the
constrained optimization
By introducing the KL constraint, the
posterior model is prevented from diverging
too far from the prior, thereby preventing
the generation of unstable intrinsic rewards.
55. 55
Tutorial Outline
Part A: Reinforcement Learning Fundamentals and
Exploration Inefficiency (30 minutes)
Welcome and Introduction
Reinforcement Learning Basics
Exploring Challenges in Deep RL
QA and Demo
Part B: Surprise and Novelty (110 minutes, including a
20-minute break)
Principles and Frameworks
Deliberate Memory for Surprise-driven Exploration
Forward dynamics prediction
Advanced dynamics-based surprises
Ensemble and disagreement 👈 [We are here]
Break
RAM-like Memory for Novelty-based
Exploration
Replay Memory
QA and Demo
Part C: Advanced Topics (60 minutes)
Language-guided exploration
Causal discovery for exploration
Closing Remarks
QA and Demo
56. 56
Deakin University CRICOS Provider Code: 00113B
Intrinsic Motivation via Disagreement
An alternative method with forward dynamics
involves using the variance of the prediction
rather than the error
This requires multiple prediction models
trained to minimize the forward dynamics
prediction errors,
Use the empirical variance (disagreement) of
their predictions as the intrinsic reward
The higher the variance, the more uncertain
about the observation need to explore
more
Deepak Pathak, et al. “Self-Supervised Exploration via Disagreement.” In ICML 2019.
57. 57
Deakin University CRICOS Provider Code: 00113B
Bayesian Disagreement
Bayesian surprises are defined by a specific dynamics model
What happens if we consider a distribution of models? Bayesian surprise of a policy becomes:
P(T) is the transition distribution of the
environment and P(T|𝜙) is the transition
distribution according to the dynamics
model.
Prediction error averaging considers all
transition models and possible
predictions
Pranav Shyam, Wojciech Jaskowski, and Faustino Gomez. 2019. Model-Based Active
Exploration. In Proceedings of the 36th International Conference on Machine Learning,
ICML 2019, 9-15 June 2019, Long Beach, California, USA. 5779–5788.
P(S|s,a,t) is the dynamics model learned from a transition
dynamics t
58. 58
Deakin University CRICOS Provider Code: 00113B
How to Compute u?
The term u(s,a) turns out to be the Jensen-
Shannon Divergence of a set of learned
dynamics from a transition dynamics t
JSD can be approximated by employing N
dynamics models:
For each P parameterized by Gaussian
distribution 𝓝i(µi,Σi), we need another layer of
approximation to compute u(s,a) by replacing
the Shannon entropy with Rényi entropy and use
the corresponding Jensen-Rényi Divergence
(JRD)
59. 59
Deakin University CRICOS Provider Code: 00113B
Surprise-based Exploration: Pros and Cons
👍 Dynamics models can be trained easily these
days. There are many works on that topic
👍 Advanced methods can somehow handle Noisy-
TV
❌ Focusing on the forward dynamics error is not
effective in driving the exploration, especially when
the world model is not good and always predicts
wrongly
❌ Advanced methods such as ensembles or
learning progress are compute-expensive to cope
with Noisy-TV
Generated by DALL-E 3
60. 60
Tutorial Outline
Part A: Reinforcement Learning Fundamentals and
Exploration Inefficiency (30 minutes)
Welcome and Introduction
Reinforcement Learning Basics
Exploring Challenges in Deep RL
QA and Demo
Part B: Surprise and Novelty (110 minutes, including a
20-minute break)
Principles and Frameworks
Deliberate Memory for Surprise-driven Exploration
Forward dynamics prediction
Advanced dynamics-based surprises
Ensemble and disagreement
Break👈 [We are here]
RAM-like Memory for Novelty-based
Exploration
Replay Memory
QA and Demo
Part C: Advanced Topics (60 minutes)
Language-guided exploration
Causal discovery for exploration
Closing Remarks
QA and Demo
61. 61
Tutorial Outline
Part A: Reinforcement Learning Fundamentals and
Exploration Inefficiency (30 minutes)
Welcome and Introduction
Reinforcement Learning Basics
Exploring Challenges in Deep RL
QA and Demo
Part B: Surprise and Novelty (110 minutes, including a
20-minute break)
Principles and Frameworks
Deliberate Memory for Surprise-driven Exploration
Forward dynamics prediction
Advanced dynamics-based surprises
Ensemble and disagreement
Break
RAM-like Memory for Novelty-based
Exploration
Count-based memory 👈 [We are here]
Episodic memory
Hybrid memory
Replay Memory
QA and Demo
Part C: Advanced Topics (60 minutes)
Language-guided exploration
Causal discovery for exploration
Closing Remarks
QA and Demo
62. 62
Deakin University CRICOS Provider Code: 00113B
Novelty via Counting
Humans want to explore novel places, make new
friends, and buy new stuff. It is inherent for humans to
be motivated by new things
How to translate this intrinsic motivation to RL agents?
Tracking the occurrences of a state (N(s)) provides a
novelty indicator, with increased occurrences means
less novelty
ri(s,a)=N(s)-0.5 where N counts the number of
times s appears
❌ Empirical counts in continuous state spaces is
impractical due to the rarity of exact state visits, resulting
in N(s)=0 most of the time
63. 63
Deakin University CRICOS Provider Code: 00113B
Density-based State Counting
Use a density function of the state to estimate
its occurrences
ρ(x)=ρ(s=x|s1:n) be a density function of the state
x given s1:n and ρ’(x)=ρ(s=x|s1:nx) the density
function of the state x after observing its first
occurrence after s1:n
N̂ (x) and n̂ as a “pseudo-count” of x and the
pseudo-total count before and after an
occurrence of s
the true density of x stays the same before and
after an occurrence of x
Bellemare, Marc, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos.
"Unifying count-based exploration and intrinsic motivation." Advances in neural information processing
systems 29 (2016).
64. 64
Deakin University CRICOS Provider Code: 00113B
Pseudo State Count
In practice, in a huge state space, ρ’n(x)≈0, we can
rewrite the pseudo-count:
PG means predictive gain, which is computed as:
Resembles the information gain: the difference
between the expectation of a posterior and prior
distribution
Extend to count state-action pairs, one can
concatenate the action representation with the state
representation.
65. 65
Deakin University CRICOS Provider Code: 00113B
Hash Count
If counting each exact state is challenging, why
not partition the continuous state space into
manageable blocks?
By using a function 𝜙 mapping a state to a code,
we can count the occurrence of the code instead
of the state
“Distant” states to be counted separately while
“similar” states are merged SimHash for the
mapping function
Higher k, less occurrence collisions, and thus
states are more distinguished
Tang, Haoran, Rein Houthooft, Davis Foote, Adam Stooke, OpenAI Xi Chen, Yan Duan, John Schulman,
Filip DeTurck, and Pieter Abbeel. "# exploration: A study of count-based exploration for deep
reinforcement learning." Advances in neural information processing systems 30 (2017).
sgn is the sign function, A is a k × d matrix with
i.i.d. entries drawn from a standard Gaussian
distribution. and g is some transformation function
66. 66
Deakin University CRICOS Provider Code: 00113B
Hash Count: Cope with High-dimensional States
Representation learning is employed to capture
good g through autoencoder network and
reconstruction learning
The network aims to reconstruct the original
state input s, and the hidden representation b(s)
will be used to compute g(s)=round(b(s))
Another regularization term prevents the
corresponding bit in the binary code from
flipping throughout the agent's lifetime
67. 67
Deakin University CRICOS Provider Code: 00113B
Change Counting
To encourage the agent to explore novel
state-action pairs meaningfully, we can
assess changes caused by activities and
prioritize those that signify novelty
c(s,s’) as the environment change caused
by a transition (s, a, s’)
Combines state count and change count,
resulting in the intrinsic reward:
Change count (last row) vs norm of change (middle row) vs state count (top row).
Change count suffers less from attracting to meaningless activities.
Parisi, Simone, Victoria Dean, Deepak Pathak, and Abhinav Gupta.
"Interesting object, curious agent: Learning task-agnostic exploration.
" Advances in Neural Information Processing Systems 34 (2021): 20516-20530.
68. 68
Tutorial Outline
Part A: Reinforcement Learning Fundamentals and
Exploration Inefficiency (30 minutes)
Welcome and Introduction
Reinforcement Learning Basics
Exploring Challenges in Deep RL
QA and Demo
Part B: Surprise and Novelty (110 minutes, including a
20-minute break)
Principles and Frameworks
Deliberate Memory for Surprise-driven Exploration
Forward dynamics prediction
Advanced dynamics-based surprises
Ensemble and disagreement
Break
RAM-like Memory for Novelty-based
Exploration
Count-based memory
Episodic memory👈 [We are here]
Hybrid memory
Replay Memory
QA and Demo
Part C: Advanced Topics (60 minutes)
Language-guided exploration
Causal discovery for exploration
Closing Remarks
QA and Demo
69. 69
Deakin University CRICOS Provider Code: 00113B
Alternative Novelty Measurement
❌ An inherent constraint of count-based methods lies
in the approximation error between the pseudo-count
and the true count
Different novelty criteria: novel observations are
those that demand effort to reach, typically beyond
the already explored areas of the environment
Measure the effort in environmental steps,
estimating it with a neural network that predicts
the steps between two observations
To capture the explored areas of the environment,
Use an episodic memory initialized empty at the
start of each episode
Novelty through reachability concept. An observation is novel if it can only reach
those in the memory in more than k steps. Savinov, Nikolay, Anton Raichuk,
Raphaël Marinier, Damien Vincent, Marc Pollefeys, Timothy Lillicrap, and Sylvain
Gelly. "Episodic curiosity through reachability." arXiv preprint arXiv:1810.02274
(2018).
if the maximum reachability score of the given observation
is greater than a threshold k, we can regard it as novel
70. 70
Deakin University CRICOS Provider Code: 00113B
Episodic Curiosity: Memory Workflow
Define the threshold k, use reachability
network to classify whether two observations
are separated by more or less than k steps
After training, the reachability network is used
to estimate the novelty of the current
observation in the episode given the episodic
memory M, which finally is used to compute
the intrinsic reward
Using a function F to aggregate the reachability
scores between the current observation and
those in the memory leads to the intrinsic
reward. F can be max or 90-th percentile
71. 71
Deakin University CRICOS Provider Code: 00113B
Explicit Memory of Positions (only for Atari games)
Collect the agent’s position from game RAM to
indicate where on the grid an agent has visited
White sections in the curiosity grid (middle)
show which locations have been visited; the
unvisited black sections yield an exploration
bonus when touched.
The network receives both game input (left) and
curiosity grid (middle) and must learn how to
form a map of where the agent has been
(hypothetical illustration, right)
Stanton, Christopher, and Jeff Clune. "Deep curiosity search: Intra-life exploration can improve
performance on challenging deep reinforcement learning problems." arXiv preprint
arXiv:1806.00553 (2018).
72. 72
Deakin University CRICOS Provider Code: 00113B
Novelty Connection to Surprise
Theoretically, an overparameterized autoencoder whose task is to reconstruct its input is
equivalent to an associative memory (Adityanarayanan et al., PNAS 2020)
We can train an autoencoder that takes the state as input to reconstruct and use its
reconstruction error as an indicator of life-long novelty
Greater errors signify a higher level of novelty in states, indicating that the autoencoder has not
encountered these states frequently enough to learn their successful reconstruction effectively
Related to RND: the intrinsic reward is still the reconstruction error. But this time, the
reconstructed target is no longer the original input. Instead, it is a transformed version of the
input
73. 73
Tutorial Outline
Part A: Reinforcement Learning Fundamentals and
Exploration Inefficiency (30 minutes)
Welcome and Introduction
Reinforcement Learning Basics
Exploring Challenges in Deep RL
QA and Demo
Part B: Surprise and Novelty (110 minutes, including a
20-minute break)
Principles and Frameworks
Deliberate Memory for Surprise-driven Exploration
Forward dynamics prediction
Advanced dynamics-based surprises
Ensemble and disagreement
Break
RAM-like Memory for Novelty-based
Exploration
Count-based memory
Episodic memory
Hybrid memory 👈 [We are here]
Replay Memory
QA and Demo
Part C: Advanced Topics (60 minutes)
Language-guided exploration
Causal discovery for exploration
Closing Remarks
QA and Demo
74. 74
Deakin University CRICOS Provider Code: 00113B
Surprise + Novelty
Never Give Up agent (NGU) combines
existing surprise and novelty components
from the literature cleverly:
1. State representation learning via inverse
dynamics (ICM)
2. Life-long novelty module using RND
3. Episodic novelty using episodic memory
inspired by EC
The implementation of the episodic
memory in NGU is new.
The dynamics model f is employed to produce the
representations for the novelty modules. Two types of
novelty are combined to produce the final intrinsic reward.
Badia, Adrià Puigdomènech, Pablo Sprechmann, Alex Vitvitskyi, Daniel Guo, Bilal Piot, Steven
Kapturowski, Olivier Tieleman et al. "Never give up: Learning directed exploration strategies." arXiv
preprint arXiv:2002.06038 (2020).
75. 75
Deakin University CRICOS Provider Code: 00113B
NGU: Episodic Novelty
Encourages the exploration of novel states
within an episode simply via nearest-neighbor
matching.
As a result, the agent will not revisit the same
state in an episode twice. This concept is
different from lifelong novelty
The closer the current state is to its neighbors,
the higher the similarity and thus, the smaller
the reward
Hybrid Intrinsic Reward:
76. 76
Deakin University CRICOS Provider Code: 00113B
Agent57: Exploration at Scale
An upgraded version of the NGU:
Splitting the value function into 2 separate
function values for external and internal
rewards
A population of policies (and value functions) is
trained, each characterized by a distinct pair of
exploration parameters:
N is the size of the population. 𝛾j is the discount
factor hyperparameters and βj is the intrinsic
reward coefficient hyperparameters. Adapted by a
meta controller (bandit algo.) Badia, Adrià Puigdomènech, Bilal Piot, Steven Kapturowski, Pablo Sprechmann, Alex Vitvitskyi, Zhaohan
Daniel Guo, and Charles Blundell. "Agent57: Outperforming the atari human benchmark." In International
conference on machine learning, pp. 507-517. PMLR, 2020.
78. 78
Deakin University CRICOS Provider Code: 00113B
Cluster Memory for Counting
Parametric methods for counting have problems:
s slow adaptation and catastrophic forgetting
Count Estimation provides a long term
visitation-based exploration bonus while
retaining responsiveness to the most recent
experience a finite slot-based container M
stores representations and corresponding
counter C
Memory RECODE is update by considering:
Adding new embedding (atom) with count 1
Update nearest atom and increase count
Kernel is non-zero for all
neighbours within a radius
79. 79
Deakin University CRICOS Provider Code: 00113B
RECODE: Representation Learning
The transformer takes masked sequences of
length k consisting of actions and embedded
observations as inputs and tries to reconstruct
the missing embeddings in the output
The reconstructed embeddings at time t − 1 and
t are then used to build a 1-step action-
prediction classifier
Similar to ICM’s inverse dynamics
Saade, Alaa, Steven Kapturowski, Daniele Calandriello, Charles Blundell, Pablo Sprechmann, Leopoldo
Sarra, Oliver Groth, Michal Valko, and Bilal Piot. "Unlocking the Power of Representations in Long-term
Novelty-based Exploration." In The Twelfth International Conference on Learning Representations. 2024.
80. 80
Deakin University CRICOS Provider Code: 00113B
Novelty of Surprise
The norm of the prediction error (surprise norm) is
not good (e.g., as in Noisy-TV)
A new metric: surprise novelty, the error of
reconstructing surprise (the error of state
prediction)
This requires a surprise generator such as a
dynamics model to produce the surprise vector u,
i.e., the difference vector between the predicted
and reality
Then, inter and intra-episode novelty scores are
estimated by a system of memory, called Surprise
Memory (SM), consisting of an autoencoder
network W and episodic memory M, respectively
Le, Hung, Kien Do, Dung Nguyen, and Svetha Venkatesh. "Beyond Surprise:
Improving Exploration Through Surprise Novelty. In AAMAS, 2024.
81. 81
Deakin University CRICOS Provider Code: 00113B
Benefit of Hybrid Systems
Marry the goodness of both worlds:
Dynamic Prediction
Novelty Estimation
Combine intrinsic rewards:
Long-term, global, inter-episodes
Short-term, local, intra-episodes
Limitation of dynamic prediction using deep models
can be compensated with non-parametric memory
approaches
Noisy-TV problems are mitigated but not completely
solved
Noisy-TV: a random TV will distract the RL agent from its main
task due to high surprise (source).
82. 82
Deakin University CRICOS Provider Code: 00113B
However, Is Intrinsic Reward Really Good?
Taiga et al. On bonus based exploration methods in the arcade
learning environment. In ICLR 2019
Intrinsic motivation rewards heavily on memory
concept (global, local…)
The performance IR agent is very good, but …
It requires more samples to train (10^9 or 10^10
steps is the norm)
Is The goal of IR to enable sample efficiency?
Overfitting to Montezuma Revenge?
Depending on architecture and tuning, normal
exploration in general is ok
83. 83
Issues with Intrinsic Motivation
Ecoffet, Adrien, Joost Huizinga, Joel Lehman, Kenneth O. Stanley, and Jeff Clune. "First return, then explore." Nature 590, no. 7847 (2021): 580-586.
84. 84
Deakin University CRICOS Provider Code: 00113B
Reflection on Memory
Surprise:
Memory is hidden inside dynamics models,
memorizing the seen observations to make
the prediction available
This memory is long-term, semantics and
slow to update
Novelty:
Memory is obvious as a slot-based matrix,
nearest neighbour estimator, counter ….
This memory is often short-term, instance-
based and adaptive to changes in the
environments
Intrinsic
Exploration
Memory
Surprise
Novelty
Memory Exploration
?
85. 85
Tutorial Outline
Part A: Reinforcement Learning Fundamentals and
Exploration Inefficiency (30 minutes)
Welcome and Introduction
Reinforcement Learning Basics
Exploring Challenges in Deep RL
QA and Demo
Part B: Surprise and Novelty (110 minutes, including a
20-minute break)
Principles and Frameworks
Deliberate Memory for Surprise-driven Exploration
Forward dynamics prediction
Advanced dynamics-based surprises
Ensemble and disagreement
Break
RAM-like Memory for Novelty-based
Exploration
Replay Memory
Novelty-based Replay 👈 [We are here]
Performance-based Replay
QA and Demo
Part C: Advanced Topics (60 minutes)
Language-guided exploration
Causal discovery for exploration
Closing Remarks
QA and Demo
86. 86
Deakin University CRICOS Provider Code: 00113B
A Direct Exploration Mechanism
Two major issues have hindered the ability of
previous algorithms to explore:
Detachment: loses track of interesting areas
to explore from
Derailment: the exploratory mechanisms of
the algorithm prevent it from utilize
previously visited states
The role of memory is simplified:
Store past states
Retrieve states to explore
Replay
Replay
Memory
Sampled
States
Exploration
87. 87
Deakin University CRICOS Provider Code: 00113B
Go-Explore
Detachment can be addressed by Memory: keep track areas
by grouping similar states into cells
Similar to Hash count
Map a state to a cell
Each cell has a score indicating sampling probability
Derailment can be addressed by Simulator (only suitable for
Atari)
Sample a cell’s state from the memory
Simulator resets the state of the agent to the cell’s state
The memory is updated with new cells during exploration
Ecoffet, Adrien, Joost Huizinga, Joel Lehman, Kenneth O. Stanley, and Jeff Clune. "First return, then explore." Nature 590, no. 7847 (2021): 580-586.
88. 88
Deakin University CRICOS Provider Code: 00113B
Go-Explore: Engineering Details
State to Cell: downscaled cell with adaptive downscaling
parameters for robustness:
Calculating how a sample of recent frames would be
grouped into cells
Selecting the values that result in the best cell
distribution (manually designed)
The selection probability of a cell at each step is proportional to its
selection weight count-based
Domain knowledge weight: (1) the number of horizontal neighbours
to the cell present in the archive (h); (2) a key bonus: for each location
Train from demonstrations: backward algorithm places the agent close
to the end of the trajectory and runs PPO until the performance of the
agent matches that of the demonstration
Cseen is the number of exploration steps in which
that cell is visited
89. 89
Deakin University CRICOS Provider Code: 00113B
Addressing Cell Limitations
❌ Cell design is not obvious requires detailed knowledge of the observation space, the dynamics of
the environment, and the subsequent task
Latent Go-Explore: Go-Explore operates without cells:
A latent representation is learned simultaneously with the exploration
Sampling of the final goal is based on a non-parametric density model of the latent space
Replace simulator with goal-based exploration
Quentin Gallou´edec and
Emmanuel Dellandr´ea. 2023.
Cell-free latent go-explore. In
International Conference on
Machine Learning. PMLR,
10571–10586.
90. 90
Deakin University CRICOS Provider Code: 00113B
Latent Go-Explore: Details
Representation Learning:
ICM’s Inverse Dynamics
Forward Dynamics
Vector Quantized Variational Autoencoder
Reconstruction
Density Estimation to sample goals:
Goals must be at the edge of the yet unexplored
areas
Goal is reachable (already visited)
Use particle-based entropy estimator to estimate
density score and rank
geometric law on the rank. p
is hyperparameter. The
higher the rank (more
dense), the less novel
sample less
92. 92
Tutorial Outline
Part A: Reinforcement Learning Fundamentals and
Exploration Inefficiency (30 minutes)
Welcome and Introduction
Reinforcement Learning Basics
Exploring Challenges in Deep RL
QA and Demo
Part B: Surprise and Novelty (110 minutes, including a
20-minute break)
Principles and Frameworks
Deliberate Memory for Surprise-driven Exploration
Forward dynamics prediction
Advanced dynamics-based surprises
Ensemble and disagreement
Break
RAM-like Memory for Novelty-based
Exploration
Replay Memory
Novelty-based Replay
Performance-based Replay👈 [We are here]
QA and Demo
Part C: Advanced Topics (60 minutes)
Language-guided exploration
Causal discovery for exploration
Closing Remarks
QA and Demo
93. 93
Deakin University CRICOS Provider Code: 00113B
Imitation Learning: Exploration via Exploitation
Exploiting past good experiences affects learning
Self-imitation learning imitates the agent’s own past
good decisions
Memory is a replay buffer stores experiences
Learns a policy to imitate state-action pairs in the replay
buffer only when the return in the past episode is
greater than the agent’s value estimate (performance-
based)
If the return in the past is greater than the agent’s value
estimate (R > Vθ), the agent learns to choose the action
chosen in the past in the given state.
Oh, Junhyuk, Yijie Guo, Satinder Singh, and Honglak Lee. "Self-imitation learning." In International
conference on machine learning, pp. 3878-3887. PMLR, 2018.
94. 94
Deakin University CRICOS Provider Code: 00113B
Goal-based Exploration Using Memory
Generate new trajectories visiting novel states
by editing or augmenting the trajectories stored
in the memory from past experiences
A sequence-to-sequence model with an
attention mechanism learns to ‘translate’ the
demonstration trajectory to a sequence of
actions and generate a new trajectory
Sample using count-based novelty
Insert new trajectory if the ending is different significantly
Otherwise, replace with higher return trajectory
Yijie Guo, Jongwook Choi, Marcin Moczulski, Shengyu Feng, Samy Bengio, Mohammad Norouzi, and Honglak Lee.
2020. Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural
Information Processing Systems 33 (2020), 4333–4345.
Diverse Trajectory-
conditioned Self-
Imitation Learning
(DTSIL)
95. 95
Deakin University CRICOS Provider Code: 00113B
DTSIL: Policy Learning
Train a trajectory-conditioned policy πθ(at|e≤t, ot, g) that
should flexibly imitate any given trajectory g
To imitate, assign rim as the imitation reward (0.1) if the
state is similar
After visiting the last (non-terminal) state in the
demonstration, the agent performs random exploration
(r=0) to encourage exploration
Policy Gradient training:
Further imitation encouragement
97. 97
Deakin University CRICOS Provider Code: 00113B
Replay Memory: Pros and Cons
👍 Replay memory provides a direct exploration
mechanism without intrinsic reward
👍 Sampling strategies are built upon previous
works
❌ Make additional assumptions such as simulator
or the availability of demonstrations
❌ Often requires goal-based policy, multiple sub
trainings
Generated by DALL-E 3
100. 100
Tutorial Outline
Part A: Reinforcement Learning Fundamentals and
Exploration Inefficiency (30 minutes)
Welcome and Introduction
Reinforcement Learning Basics
Exploring Challenges in Deep RL
QA and Demo
Part B: Surprise and Novelty (110 minutes, including a
20-minute break)
Principles and Frameworks
Deliberate Memory for Surprise-driven Exploration
Forward dynamics prediction
Advanced dynamics-based surprises
Ensemble and disagreement
Break
RAM-like Memory for Novelty-based Exploration
Replay Memory
Novelty-based Replay
Performance-based Replay QA and Demo
Part C: Advanced Topics (60 minutes)
Language-guided exploration👈 [We are here]
Language-assisted RL
LLM-based exploration
Causal discovery for exploration
Closing Remarks
QA and Demo
101. Beyond the state space: Language-guided exploration
Why language?
Humans are able to learn quickly
in new environments due to a
rich set of commonsense priors
about the world
→ reflected in language
Read the instruction to play
game to avoid trials and
errors.
Abstraction
Compositional
Generalization
101
102. 102
Luketina, Jelena, Nantas Nardelli, Gregory Farquhar, Jakob Foerster, Jacob Andreas,
Edward Grefenstette, Shimon Whiteson, and Tim Rocktäschel. "A survey of reinforcement
learning informed by natural language." arXiv preprint arXiv:1906.03926 (2019).
How Language can be used in RL
103. Make a cake from tools
and materials
Pure RL agent needs to
try thousands of
settings until it find the
desired characteristics
If RL agent read and
follow the recipe, it
may take 1 trial to
succeed
103
https://www.moresteam.com/toolbox/design-of-experiments.cfm
A more practical use case
104. 104
Tutorial Outline
Part A: Reinforcement Learning Fundamentals and
Exploration Inefficiency (30 minutes)
Welcome and Introduction
Reinforcement Learning Basics
Exploring Challenges in Deep RL
QA and Demo
Part B: Surprise and Novelty (110 minutes, including a
20-minute break)
Principles and Frameworks
Deliberate Memory for Surprise-driven Exploration
Forward dynamics prediction
Advanced dynamics-based surprises
Ensemble and disagreement
Break
RAM-like Memory for Novelty-based
Exploration
Replay Memory
Novelty-based Replay
Performance-based Replay QA and Demo
Part C: Advanced Topics (60 minutes)
Language-guided exploration
Language-assisted RL 👈 [We are here]
LLM-based exploration
Causal discovery for exploration
Closing Remarks
QA and Demo
105. 105
Gated-attention for Task-oriented Language
Grounding
Task-oriented Language Grounding: extract meaningful representations from natural language
instructions and maps to visual elements and actions
Task: Given initial Image in pixels and instruction -> guide agent to move towards desired object
2 main modules:
State Processing: Process image and language jointly with State Processing module to obtain state
Policy Learning: Use policy to map states to corresponding actions
Generated by DALL-E 3
106. 106
State Processing Module
Use Gated-Attention instead of concatenation to jointly
represent image and language information as one state
Language instruction go through a fully-connected layer to
match image dimension, called Attention Vector
Each element of Attention Vector is expanded to a (𝐻 × 𝑊)
matrix to match the feature map of each image element
Final representation is obtained via element-wise product
between image and language representation
Generated by DALL-E 3
107. 107
Policy Module
Actions:
Turn (left, right)
Move forward
Policy Architecture: two variants
Behaviour Cloning: Uses target and object locations and
orientations in every state to select optimal action
A3C: use a deep neural network to learn policy and value
function. Gives action based on policy 𝜋(𝑎|𝐼𝐿, 𝐿)
Generated by DALL-E 3
108. 108
Results
Generated by DALL-E 3
- With Gated-Attention, agent learns faster and achieves better accuracy (success rate of reaching correct object before
episode terminates)
- As environment gets harder, more exploration is needed -> A3C with GA performs better than Imitation Learning, where
little exploration is done
109. 109
Semantic Exploration from Language Abstractions and
Pretrained Representations
Novelty-based exploration methods suffer in high-dimensional
visual state spaces
e.g.: different viewpoints of one place in 3D can map to distinct
visual states/ features, despite being semantically similar
Language can be a useful abstraction for exploration, as it
coarsens the state space in a way that reflects the semantics of
environment
Solution: Use vision-language pretrained representations to
guide semantic exploration in 3D
Generated by DALL-E 3
Example of a state (picture) with language
description (caption). Note how the
caption focuses on important aspects of
the state
Example of how many states can be
conveyed with one text caption
110. 110
Intrinsic Reward Design
State 𝑠𝑡: embedding of Image, denoted as 𝑂𝑉
Goal: Described by a text instruction, denoted as 𝑔
Caption is encoded by a pretrained language
encoder, output embedding is denoted as 𝑂𝐿; only
used to calculate intrinsic reward. Note that agent
never observes 𝑂𝐿.
Intrinsic reward is goal-agnostic; computed with
access to state representation (either 𝑂𝑉 or 𝑂𝐿)
Add intrinsic reward for two exploration algorithms:
Never Give Up (NGU; Badia et al.)
Random Network Distillation (RND; Burda et al.)
Generated by DALL-E 3
Badia, Adrià Puigdomènech, et al. Never give up: Learning directed exploration strategies. arXiv preprint arXiv:2002.06038 (2020).
Y. Burda, H. Edwards, A. Storkey, and O. Klimov. Exploration by random network distillation. arXiv preprint
arXiv:1810.12894, 2018.
111. 111
Never Give Up (NGU)
State representations (which are used to compute intrinsic reward, can be either 𝑂𝐿 or 𝑂𝑉) along trajectory are written to memory buffer
Novelty is a function of L2 distance between current and the k-nearest states in buffer. Intrinsic reward is higher for larger distances
To influence exploration: modify the embedding function
Originally, embedding function is learned
Variants:
Vis-NGU & LSE-NGU: Use Visual embeddings (𝑂𝑉).
Lang-NGU: Use Language embeddings (𝑂𝐿).
Generated by DALL-E 3
112. 112
Random Network Distillation (RND)
Intrinsic reward is derived from the prediction error
between a trainable network on a target from a frozen
function
Trainable network learns independently from the
policy network
As training progresses, frequently-visited states yield
less intrinsic reward due to reduced prediction errors
Generated by DALL-E 3
113. 113
Results
Example of how language representations
(orange line) can help agent explore better
than visual representation in 3D
environment.
Using visual representation, agent
struggles with different view of only one
scene.
Generated by DALL-E 3
114. 114
Tutorial Outline
Part A: Reinforcement Learning Fundamentals and
Exploration Inefficiency (30 minutes)
Welcome and Introduction
Reinforcement Learning Basics
Exploring Challenges in Deep RL
QA and Demo
Part B: Surprise and Novelty (110 minutes, including a
20-minute break)
Principles and Frameworks
Deliberate Memory for Surprise-driven Exploration
Forward dynamics prediction
Advanced dynamics-based surprises
Ensemble and disagreement
Break
RAM-like Memory for Novelty-based
Exploration
Replay Memory
Novelty-based Replay
Performance-based Replay QA and Demo
Part C: Advanced Topics (60 minutes)
Language-guided exploration
Language-assisted RL
LLM-based exploration👈 [We are here]
Causal discovery for exploration
Closing Remarks
QA and Demo
115. 115
ELLM: Guiding Pretraining in RL with LLM
Many distinct actions can lead to similar outcomes -> Intrinsically Motivated RL (IM-RL):
explore outcomes rather than actions
Competence-based IM (CB-IM): maximize diversity of skills mastered by agent
CB-IM aims to optimize 𝑅𝑖𝑛𝑡:
Given those, CB-IM algorithms train a goal-conditioned policy 𝜋(𝑎|𝑜, 𝑔) that maximizes 𝑅𝑖𝑛𝑡
Goal distribution G and 𝑅𝑖𝑛𝑡(𝑜, 𝑎, 𝑜′|𝑔) must be defined such that 3 properties are satisfied:
1. Diverse
2. Common-sense sensitive (Ex: chop a tree > drink a tree)
3. Context sensitive (Ex: only chop the tree when the tree is in view)
Generated by DALL-E 3
116. 116
Why ELLM?
Previous methods hand-define 𝑅𝑖𝑛𝑡 and G, and use various motivations to guide goal
sampling 𝑔~𝐺: novelty, learning progress, intermediate difficulty
ELLM: alleviate the need for environment-specific hand-coded definitions of 𝑅𝑖𝑛𝑡 and
G with LM:
Language-based goal representations
Language-model-based goal generation
Generated by DALL-E 3
117. 117
Architecture
Goal representation: prompt LLM with available actions & description of current observation ->
construct prompt -> LLM generate goals
Open-ended goal generation: Ask LLM to generate goals
Closed-form: Ask LLM yes/no questions (e.g: should the agent do X? yes/no?)
Generated by DALL-E 3
118. 118
Intrinsic Reward Design
Reward LLM goals: with similarity between generated
goal with description of agent’s transition 𝐶𝑡𝑟𝑎𝑛𝑠𝑖𝑡𝑖𝑜𝑛
If there are multiple goals -> reward agent with
most similar goal to transition description:
Generated by DALL-E 3
120. 120
Intrinsically Guided Exploration from LLMs (IGE-LLMs)
For long, sequential tasks with sparse rewards, intrinsic reward can help guide policy learning
towards exploration, along with the main policy driver extrinsic reward
LLM can be used as an evaluator for potential future rewards of all actions a, that maps every (s,a)
pairs directly to an intrinsic reward 𝑟𝑖
Total reward is given as 𝑟𝑐
= 𝑟𝑒
+ 𝜆𝑖
𝑟𝑖
𝑤𝑖
, where 𝑟𝑒
is the external reward, 𝜆𝑖
is a controlling
factor and 𝑤𝑖 is a linearly decaying weight
Generated by DALL-E 3
121. 121
Prompting
Generated by DALL-E 3
Example of an input prompt to evaluate possible
actions in DeepSea Environment
LLM is given current position and possible
actions. It is asked to evaluate the ratings of
every possible next actions
122. 122
Benefit of Intrinsic Reward
LLM improves traditional exploration
methods
Using LLM to generate actions directly
exhibits significant errors (grey lines on
the right graph), even with advanced
LLMs (GPT-4)
However, when used as intrinsic reward
only, it helps with exploration, especially
in harder environments. It also results in
better performance
Generated by DALL-E 3
123. 123
Tutorial Outline
Part A: Reinforcement Learning Fundamentals and
Exploration Inefficiency (30 minutes)
Welcome and Introduction
Reinforcement Learning Basics
Exploring Challenges in Deep RL
QA and Demo
Part B: Surprise and Novelty (110 minutes, including a
20-minute break)
Principles and Frameworks
Deliberate Memory for Surprise-driven Exploration
Forward dynamics prediction
Advanced dynamics-based surprises
Ensemble and disagreement
Break
RAM-like Memory for Novelty-based Exploration
Replay Memory
Novelty-based Replay
Performance-based Replay QA and Demo
Part C: Advanced Topics (60 minutes)
Language-guided exploration
Causal discovery for exploration👈 [We are here]
Statistical approaches
Deep learning approaches
Closing Remarks
QA and Demo
124. 124
Deakin University CRICOS Provider Code: 00113B
What is causality?
The relationship between cause and effect.
Uncovering two fundamental questions:
Causal discovery: Evidence required to infer
cause-effect relationships?
Causal inference: Given causal information,
what inference can be drawn?
Structure Causal Models (SCMs) framework
(Pearl, 2009a): .
Causal graph: . Figure 1: Example of SCM and causal graph for scurvy problem.
125. 125
Deakin University CRICOS Provider Code: 00113B
Why causality and RL?
Understand cause-effect reduce exploring unnecessary action, thus, sample
efficiency.
Ex: Not move toward the door before obtain the key.
Improve interpretability.
Ex: Why policy prioritize action of obtaining the key?
Generalizability.
126. 126
Deakin University CRICOS Provider Code: 00113B
Interpreting Causality in RL Environment
Taking action A can affect the reward R.
State S is the context variables that affect both
action A and reward R.
.
.
The most common is .
And, U is the unknown confounder variable.
Categorize based on techniques to improve
exploration or causality measurement techniques.
Statistical vs deep learning methods.
Figure 2: Causality in RL Environment.
127. 127
Tutorial Outline
Part A: Reinforcement Learning Fundamentals and
Exploration Inefficiency (30 minutes)
Welcome and Introduction
Reinforcement Learning Basics
Exploring Challenges in Deep RL
QA and Demo
Part B: Surprise and Novelty (110 minutes, including a
20-minute break)
Principles and Frameworks
Deliberate Memory for Surprise-driven Exploration
Forward dynamics prediction
Advanced dynamics-based surprises
Ensemble and disagreement
Break
RAM-like Memory for Novelty-based
Exploration
Replay Memory
Novelty-based Replay
Performance-based Replay QA and Demo
Part C: Advanced Topics (60 minutes)
Language-guided exploration
Causal discovery for exploration
Statistical approaches👈 [We are here]
Deep learning approaches
Closing Remarks
QA and Demo
128. 128
Deakin University CRICOS Provider Code: 00113B
Causal influence detection for improving
efficiency in reinforcement learning.
Seitzer, M., Schölkopf, B., & Martius, G. (2021). Causal influence detection for
improving efficiency in reinforcement learning. Advances in Neural
Information Processing Systems, 34, 22905-22918.
129. 129
Deakin University CRICOS Provider Code: 00113B
Causal Action Influence Detection (CAI)
Mentioned previously, .
Decomposed state S into N components.
One steps transitions graph:
How to detect when action influence next
state S’?
Figure 3: Global Causal Graph Fully Connected.
Figure 4: Example of Situation Dependent Controlled..
130. 130
Deakin University CRICOS Provider Code: 00113B
Causal Action Influence Detection (CAI) (cont.)
Conditional Mutual Information (CMI):
Estimation of CAI:
Estimate forward model
from data.
131. 131
Deakin University CRICOS Provider Code: 00113B
Using CAI to Improve exploration in RL?
CAI as Intrinsic Reward.
Active Exploration with CAI.
CAI Experience Replay
Experiment on 3 environments: FetchPush,
FetchPickAndPlace, FetchRotTable. Goal is the
coordinate the object must be.
Baseline RL Algorithm: DDPG + HER.
Figure 5: FetchPickAndPlace Environment.
Figure 6: FetchRotTable Environment.
132. 132
Deakin University CRICOS Provider Code: 00113B
CAI as Intrinsic Reward
Use CAI as reward signal.
Use on its own or with task reward.
Figure 7: Bonus reward improves performance on
FetchPickAndPlace.
133. 133
Deakin University CRICOS Provider Code: 00113B
Active Exploration with CAI
Replace random exploration with causal exploration.
Choose action with highest contribution to CAI
calculation.
Figure 8: Performance of active exploration in
FetchPickAndPlace depending on the fraction of
exploratory actions chosen actively from a total of 30%
(epsilon) exploratory actions.
Figure 9: Experiment comparing exploration
strategies on FetchPickAndPlace. The combination of
active exploration and reward bonus yields the largest
sample efficiency.
134. 134
Deakin University CRICOS Provider Code: 00113B
CAI Experience Replay
Choose episode for replay from replay buffer
with guide from causal (inverse) ranking .
is the probability of sampling any state from
episode i (of M episodes) in the replay buffer
(with T is the episode length).
Figure 10: Comparison of CAI-P with baselines (energy-based method with privileged
information (EBP), prioritized experience replay (PER), and HER without prioritization)
135. 135
Tutorial Outline
Part A: Reinforcement Learning Fundamentals and
Exploration Inefficiency (30 minutes)
Welcome and Introduction
Reinforcement Learning Basics
Exploring Challenges in Deep RL
QA and Demo
Part B: Surprise and Novelty (110 minutes, including a
20-minute break)
Principles and Frameworks
Deliberate Memory for Surprise-driven Exploration
Forward dynamics prediction
Advanced dynamics-based surprises
Ensemble and disagreement
Break
RAM-like Memory for Novelty-based
Exploration
Replay Memory
Novelty-based Replay
Performance-based Replay QA and Demo
Part C: Advanced Topics (60 minutes)
Language-guided exploration
Causal discovery for exploration
Statistical approaches
Deep learning approaches👈 [We are
here]
Closing Remarks
QA and Demo
136. 136
Deakin University CRICOS Provider Code: 00113B
Causality-driven hierarchical structure
discovery for reinforcement learning.
Hu, X., Zhang, R., Tang, K., Guo, J., Yi, Q., Chen, R., ... & Chen, Y. (2022).
Causality-driven hierarchical structure discovery for reinforcement
learning. Advances in Neural Information Processing Systems, 35, 20064-
20076.
137. 137
Deakin University CRICOS Provider Code: 00113B
Structural Causal Representation Learning
Environment with multiple objects.
Ex: Have wood and stone can make axe.
How to measure causality between these
object?
Model the SCM of objects between adjacent
timesteps.
Figure 11: Example of environment with
multi-objects and causal graph
138. 138
Deakin University CRICOS Provider Code: 00113B
Structural Causal Representation Learning (cont.)
Simpler case with 4 objects A, B, C, D.
A is object of interest.
Need a forward/transition model .
Parameterized by .
Need a masking function (otherwise, don’t know
which objects affect A).
Parameterized by where M is
the no. of objects.
Figure 12: Example of SCM representation
learning (with interested object A).
139. 139
Deakin University CRICOS Provider Code: 00113B
Structural Causal Representation Learning (cont.)
Iterative process.
Fix one parameter, while optimizing the other one.
After finish optimizing, extract edge:
141. 141
Deakin University CRICOS Provider Code: 00113B
Hierarchical Causal Subgoal Training
Whenever new subgoals added, train .
Deciding whether subgoal is reachable, with
current state and policy .
If the state is reachable, within certain timesteps,
add to subgoal set.
Figure 13: Example of subgoal hierarchy
given causal graph.
142. 142
Deakin University CRICOS Provider Code: 00113B
Figure 16: Results on Minigrid-2d (left) and Eden (right).
Figure 15: Environment Minigrid-2d (left) and Eden (right).
An upper controller
policy , is trained to
select subgoals from
current subgoal set and
maximize task reward.
The upper controller is
multi-level DQN with
HER.
143. 143
Deakin University CRICOS Provider Code: 00113B
Disentangling causal effects for
hierarchical reinforcement learning.
Corcoll, O., & Vicente, R. (2020). Disentangling causal effects for hierarchical
reinforcement learning. arXiv preprint arXiv:2010.01351.
144. 144
Deakin University CRICOS Provider Code: 00113B
Controlled Effect Disentanglement
Total effect, the change in environment states,
involves dynamic effects and controllable effects.
Whether next state is an outcome of the action
or just by accident.
We care about controllable effects.
Based on Average Treatment Effect (ATE).
The normality
Figure 17: The relationship between total effects, dynamic
effects, and controllable effects.
145. 145
Deakin University CRICOS Provider Code: 00113B
We cannot calculate total effect for every actions .
Use a neural network to estimate.
Learn a vector representation of effect.
Figure 18: Total effects modelling architecture.
146. 146
Deakin University CRICOS Provider Code: 00113B
Exploration with controllable effects as goal
Effect Sampling
Policy
Taking Action
Policy
Model
distribution of
effects
Figure 19: Components of causal effects for hierarchical
reinforcement learning.
147. 147
Deakin University CRICOS Provider Code: 00113B
Controllable Effect Distribution Learning
Train a Variational Autoencoder.
Approximate controllable effect .
Figure 20: VAE architecture to learn effect distribution.
148. 148
Deakin University CRICOS Provider Code: 00113B
Training to Select Goal and Reach Goal
Train using DQN, on data .
Use to select sub-effect.
Train using DQN, on data .
Figure 21: Architecture learning to select effects as
subgoals.
Figure 22: Architecture learning to select actions to reach
subgoals.
149. 149
Deakin University CRICOS Provider Code: 00113B
3 Levels:
Task T: go to the target location.
Task BT: go to the target location
while carrying a ball.
Task CBT: pick ball, put it in the
chest, and go to the target.
Figure 23: Comparing with DQN baseline on 3 tasks.
CEHRL can learn complex task, while, DQN cannot learn
the complex task.
Figure 24: Random effect vs random
action exploration.