Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity

AAMAS’24 Tutorial 9
Unlocking Exploration:
Self-Motivated Agents
Thrive on Memory-
Driven Curiosity

2
Why This Topic?
 Deep Learning in RL: Deep RL Agents excel in
complex tasks but need many interactions to
learn
 Practicality Issues: Extensive learning steps
hinder RL's use in real-world scenarios
 Exploration Optimization: Improving exploration
is key for RL's real-world application
 Memory and Learning: Memory-based
exploration can speed up learning and advance
AI
Generated by DALL-E 3

3
About Us
 Authors: Hung Le, Hoang Nguyen and Dai Do
 Our lab: A2I2, Deakin University
 Hung Le is a research lecturer at Deakin
University, leading research on deep sequential
models and reinforcement learning
 Hoang Nguyen is a second-year PhD student at
A2I2, specializing in reinforcement learning and
causality
 Dai Do is a second-year PhD student at A2I2,
specializing in reinforcement learning and large
language models
A2I2 Lab Foyer, Waurn Ponds

4
Deakin University CRICOS Provider Code: 00113B
About the Tutorial
This tutorial is based on my previous presentations,
expanding on topics covered in the following talks:
 Memory-Based Reinforcement Learning.
AJCAI’22
 Memory for Lean Reinforcement Learning. FPT
Software AI Center. 2022
 Neural machine reasoning. IJCAI’21
 From deep learning to deep reasoning. KDD’21
 My Blogs: https://hungleai.substack.com/

5
Tutorial Outline
 Part A: Reinforcement Learning Fundamentals and
Exploration Inefficiency (30 minutes)
 Welcome and Introduction 👈 [We are here]
 Reinforcement Learning Basics
 Exploring Challenges in Deep RL
 QA and Demo
 Part B: Surprise and Novelty (110 minutes,
including a 20-minute break)
 Principles and Frameworks
 Deliberate Memory for Surprise-driven
Exploration
 Break
 RAM-like Memory for Novelty-based
Exploration
 Replay Memory
 QA and Demo
 Part C: Advanced Topics (60 minutes)
 Language-guided exploration
 Causal discovery for exploration
 Closing Remarks
 QA and Demo

Reinforcement Learning
Fundamentals and
Exploration Inefficiency
PART A

7
Tutorial Outline
 Welcome and Introduction
 Key components and frameworks 👈 [We are here]
 Classic exploration
 QA and Demo
 Part B: Surprise and Novelty (110 minutes, including a
20-minute break)
 Deliberate Memory for Surprise-driven Exploration
 Break
Exploration
 Replay Memory
 QA and Demo
 Closing Remarks
 QA and Demo

8
Reinforcement Learning Basics
 In reinforcement learning (RL), an agent
interacts with the environment, taking
actions a, receiving a reward r, and
moving to a new state s
 The agent is tasked with maximizing the
accumulated rewards or returns R over
time by finding optimal actions (policy)

9
Reinforcement Learning Concepts
 Policy π : maps state s to action a
 Return (discounted) G or R: the cumulative
(weighted) sum of rewards
 State value function V: the expected discounted
return starting with state s following policy π
 State-action value function Q: the expected return
starting from , taking the action , and thereafter
following policy π

Classic RL algorithms: Value learning
10
Q-learning
(temporal difference-TD)
Watkins, Christopher JCH, and Peter Dayan. "Q-learning." Machine
learning 8, no. 3 (1992): 279-292.
Williams, Ronald J. "Simple statistical gradient-following algorithms for
connectionist reinforcement learning." Machine learning 8, no. 3 (1992):
229-256.
 Basic idea: before finding optimal policy,
we find the value function
 Learn (action) value function:
 V(s)
 Q(s,a)
 Estimate V(s)=E(∑R from s)
 Estimate Q(s,a)=E(∑R from s,a)
 Given Q(s,a)
→ choose action that maximizes the value
(ε-greedy policy)
RL algorithms: Q-Learning

Classic RL algorithm: Policy gradient
 Basic idea: directly optimise the policy as a
function of states
 Need to estimate the gradient of the
objective function E(∑R) w.r.t the
parameters of the policy
 Focus on optimisation techniques
11
REINFORCE
(policy gradient)
RL algorithms: Policy Gradient

General RL algorithms
12
Q-Learning vs Policy Gradient
Both require
exploration
to collect data for
training

13
Tutorial Outline
 Key components and frameworks
 Classic exploration 👈 [We are here]
 QA and Demo
20-minute break)
 Break
Exploration
 Replay Memory
 QA and Demo
 Closing Remarks
 QA and Demo

14
ε-greedy
 ε-greedy is the simplest exploration
strategy that works in theory
 It heavily relies on pure randomness
and biased estimates of action values
Q, and thus is sample-inefficient in
practice
 We often go with what we assume is
best, but sometimes, we take a
random chance to explore other
options. This is one example of an
optimistic strategy
It is used in Q-learning

15
ε-greedy: Problems
 It is not surprising why this strategy
might struggle in real-world scenarios:
being overly optimistic when your
estimation is imprecise can be risky.
 It may lead to getting stuck in a local
optimum and missing out on
discovering the global one with the
highest returns.
Benchmarking ε-greedy (red line) and other exploration method on Montezuma’s
Revenge. Taïga, Adrien Ali, William Fedus, Marlos C. Machado, Aaron Courville, and Marc G.
Bellemare. "Benchmarking bonus-based exploration methods on the arcade learning
environment." arXiv preprint arXiv:1908.02388 (2019).
ε-greedy

16
Upper Confidence Bound (UCB)
 One way to address the problem of over-
optimism is to consider the uncertainty of the
estimation
 We do not want to miss an action with a
currently low estimated value and high
uncertainty, as it may possess a higher value:
 What we need is to guarantee:
Hoeffding’s Inequality
How to estimate
uncertainty?
Implicitly, with a large value of the exploration-
exploitation trade-off parameter c, the chosen
action is more likely to deviate from the greedy
action, leading to increased exploration.
t -c

17
Thompson Sampling
 When additional assumptions about the reward distribution are
available, an action can be chosen based on the probability that
it is optimal (probability matching strategy)
 Thompson sampling is one way to implement the strategy:
1. Assume the reward follows a distribution p(r|a, θ) where θ
is the parameter whose prior is p(θ)
2. Given the set of past observations Dt is made of triplets {(ai,
ri)|i=1,2..,t}, we update the posterior using Bayes rule
3. Given the posterior, we can estimate the action value
4. We can compute the probability of choosing action a
2
3
4

18
Information Gain
 Information Gain (IG) measures the change in the
amount of information (measured in entropy H) of a
latent variable
 The latent variable often refers to the parameter of
the model θ after seeing observation (e.g., reward r)
caused by some action a
 A big drop in the entropy means the observation
makes the model more predictable and less
uncertain
 Our goal is to find a harmony between minimizing
expected regret in the current period and acquiring
new information about the observation model
Russo, Daniel, and Benjamin Van Roy. "Learning to optimize via information-
directed sampling." Advances in Neural Information Processing Systems 27
(2014).

19
Application: Multi-arm bandit
 There are multiple actions to take (bandit’s arm)
 After taking one action, agent observes a reward
 Maximizing cumulated rewards or minimizing
cumulated regrets.

20
Limitations of Classical Exploration
❌ Scalability Issues: Most are specifically designed for
bandit problems, and thus, they are hard to apply in
large-scale or high-dimensional problems (e.g., Atari
games), resulting in increased computational
demands that can be impractical
❌ Assumption Sensitivity: These methods heavily rely
on specific assumptions about reward distributions or
system dynamics, limiting their adaptability when
assumptions do not hold
❌ Vulnerability to Uncertainty: They may struggle in
dynamic environments with complex reward
structures or frequent changes, leading to suboptimal
performance Generated by DALL-E 3

21
Tutorial Outline
 Hard Exploration Problems 👈 [We are here]
 Simple exploring solutions
 QA and Demo
20-minute break)
 Break
Exploration
 Replay Memory
 QA and Demo
 Closing Remarks
 QA and Demo

 Task:
 Agent searches for the key
 Agent picks the key
 Agent open the door to access the room
 Agent finds the box in the room
 Reward:
 If the agent reaches the box, get +1
reward
22
https://github.com/maximecb/gym-minigrid
→ How to learn such complicated
policies using the simple reward?
Modern RL Environments are Complicated

23
Why is Scaling a Big Problem?
 Practical environments often involve huge
continuous state and action spaces
 Classical approaches cannot be implemented or fail
to hold their theoretical properties in these settings
Doom environment: continuous high-dimensional
state space (source)
Mujoco environment: continuous action space (source).

24
Challenging Environments for Exploration
 Environments require long-term memory of
agents:
 Maze navigation with conditions such as
finding the objects that have the same color
as the wall
 Remember the shortest path to the objects
experienced in the past
 Noisy environments:
 Noisy-TV: a random TV will distract the RL
agent from its main task due to noisy screen
https://github.com/jurgisp/memory-maze
Noisy-TV (source)

25
Tutorial Outline
 Hard exploration problems
 Simple exploring solutions 👈 [We are here]
 QA and Demo
20-minute break)
 Break
Exploration
 Replay Memory
 QA and Demo
 Closing Remarks
 QA and Demo

26
Entropy Maximization
 In the era of deep learning, neural networks are used for
approximating functions, including parameterizing value and
policy functions in RL
 ε-greedy is less straightforward for policy gradient methods
 An entropy loss term is introduced in the objective function to
penalize overly deterministic policies. This encourages diverse
exploration, avoiding suboptimal actions by maximizing the
bonus entropy loss term
❌ It may also impede the optimization of other losses, especially
the main objective
❌ The entropy loss does not enforce different level of exploration
for different tasks
It is used in PPO, A3C, …

27
Noisy Networks
 Another method to add randomness to the
policy is to add noise to the weights of the
neural networks
 Throughout training, noise samples are drawn
and added to the weights for both forward and
backward propagation
❌ Although Noisy Networks can vary exploration
degree across tasks, adapting exploration at the
state level is far from reachable.
Certain states with higher uncertainty may require
more exploration, while others may not
An example of a noisy linear layer. Here w is the matrix weight and b is
the bias vector. The parameters µw, µb, σw and σb are the learnables of
the network whereas εw and εw are noise variables. Fortunato, Meire,
Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Ian Osband,
Alex Graves Vlad Mnih, Remi Munos Demis Hassabis Olivier Pietquin,
Charles Blundell, and Shane Legg. "Noisy Networks for Exploration."
arXiv preprint arXiv:1706.10295 (2017)

28
End of Part A
 QA
 Demo

Intrinsic Motivation:
Surprise and Novelty
PART B

30
Tutorial Outline
 QA and Demo
20-minute break)
 Principles and Frameworks 👈 [We are here]
 Reward shaping and the role of memory
 A taxonomy of memory-driven intrinsic exploration
 Break
Exploration
 Replay Memory
 QA and Demo
 Closing Remarks
 QA and Demo

 No curiosity, random exploration
 epsilon-greedy
 “Tractable” exploration
 Somehow optimize exploration,
e.g. UCB, Thomson sampling
 Only doable for simple
environments
31
 Approximate “trackable” exploration
(count-based)
 Scalable to harder environments
 Intrinsic motivation exploration
(SOTA)
Predictive
Novel or supersite-based curiosity
Causal
…
https://cmutschler.de/rl
Frameworks

32
Reward Shaping
 Entropy loss or inject noise into the policy/value
parameters with the limitation that the level of
exploration is not explicitly conditioned on fine-
grant factors such as states or actions
 Solution: intrinsic reward bonuses assign higher
internal rewards to state-action pairs that
require higher exploration and vice versa
 The final reward for the agent will be the
weighted sum of the intrinsic reward and the
external (environment) reward

 Animal can travel for long distance till
they find food
 Human can navigate to go to an
address in a strange city
 What motivates these agents to
explore?
intrinsic motivation
curiosity, hunch
intrinsic reward
33
https://www.beepods.com/5-fascinating-ways-bees-and-flowers-find-each-other/
Intrinsic Motivation in Biological World

34
What Does Intrinsic Reward Represent?
 Novelty
 It is inherent for biological agents to be motivated by new things.
 Tracking the occurrences of a state provides a novelty indicator, with
increased occurrences signaling less novelty
 Surprise
 Surprise emerges when there's a discrepancy between expectations
and the observed or experienced reality
 Build a model of the environment, predicting the next state given the
current state and action
 The intrinsic reward is the prediction error itself
 This reward increases when the model encounters difficulty in
predicting or expresses surprise at the current observation.
Novelty
Surprise

35
The Role of Memory
 Biological agents inherently possess memory to
monitor events:
 Drawing from previous experiences, they
discern novelty in observations
 Utilizing their prior understanding of the
world, they identify unexpected observations
 RL agents can be quipped with memory:
 Event-based Memory: Episodic Memory
 Semantic Memory: World Model
Novelty
Surprise
MEMORY

36
A Taxonomy of Memory for RL Exploration

37
Tutorial Outline
 QA and Demo
20-minute break)
 Forward dynamics prediction👈 [We are here]
 Advanced dynamics-based surprises
 Ensemble and disagreement
 Break
Exploration
 Replay Memory
 QA and Demo
 Closing Remarks
 QA and Demo

38
Forward Dynamics Prediction
 Build a model of the environment, predicting the
next state given the current state and action
 This kind of model, also known as forward
dynamics or world model
 C: actor vs M: world model. M predicts
consequences of C. C acts to make M fail
 As a result:
 If C action results in repeated and boring
consequences  M predict well
 C must explore novel consequence https://people.idsia.ch/~juergen/artificial-curiosity-since-1990.html

39
Learning Progress as Intrinsic Reward
 The learning progress is estimated by comparing
the mean error rate of the prediction model
during the current moving window to the mean
error rate of the previous window
 The two windows are different by 𝜏 steps
 In can mitigate Noisy-TV problems
Pierre-Yves Oudeyer & Frederic Kaplan. “How can we define intrinsic motivation?” Conf. on Epigenetic
Robotics, 2008.

40
Deep Dynamic Models
 Use a neural network f that takes a representation of the current state and action to predict the
next state
 The representation is shaped through unsupervised training, i.e., state reconstruction task, using
an autoencoder’s hidden state
 The network f, fed with the autoencoder’s hidden state, is trained to minimize the prediction
error, which is the norm of the difference between the predicted state and the true state
Intrinsic reward
Stadie, Levine, Abbeel: Incentivizing Exploration in Reinforcement
Learning with Deep Predictive Models. In NIPS 2015.

 Forward model on feature space
 Feature space ignore irrelevant, uncontrollable factors
 The consequences depend on
 Action (controllable)
 Environment (uncontrollable)
 We want the state embedding enables controllable
space
41
https://blog.dataiku.com/curiosity-driven-learning-through-next-state-prediction
Deepak Pathak et al.: Curiosity-driven Exploration by Self-Supervised Prediction. ICML 2017
ICM
More Complicated Model
Inverse
dynamic
representation
learning

43
Tutorial Outline
 QA and Demo
20-minute break)
 Forward dynamics prediction
 Advanced dynamics-based surprises👈 [We are here]
 Break
Exploration
 Replay Memory
 QA and Demo
 Closing Remarks
 QA and Demo

 The prediction target
is stochastic
 Information necessary for the
prediction is missing
 Model class of predictors is too
limited to fit the complexity of
the target function
 Both the totally predictable and
the fundamentally unpredictable will
get boring
44
https://openai.com/blog/reinforcement-learning-with-prediction-based-rewards/
When Predictive Surprise Fails

45
https://favpng.com/
Ideas for Improvements
Reward M’s progress instead of error
No M’s improvement if the
consequence is too hard or too easy
to predict  no reward
Remember all experiences
“Store” all experienced
consequence, including stochastic
ones
Global or Local memory
Like human
Better Representations
Representation Learning
https://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction

46
Random Network Distillation (RND)
 The intrinsic reward is defined through the task of
predicting the output of a fixed (target) network
 Target Network’s weights are random
 By predicting the target output, the Predictor Network
tries to “remember” the randomized state
 If old state reappears, it can be predicted easily by the
Predictor Network
 RND obviates Noisy-TV since the target network can be
chosen to be deterministic and inside the model-class
of the predictor network.
Burda, Yuri, Harrison Edwards, Amos Storkey, and Oleg Klimov. "Exploration by random network
distillation." In International Conference on Learning Representations. 2018.

47
Noisy-TV in Atari games
 Montezuma’s Revenge the agent oscillates between two rooms
 This leads to an irreducibly high prediction error, as the non-determinism of sticky actions
makes it impossible to know whether, once the agent is close to crossing a room boundary,
making one extra step will result in it staying in the same room, or crossing to the next one
 This is a manifestation of the ‘noisy TV’ problem

48
Latent World Model
 World model for exploration should be robust
against stochasticity and able to extrapolate the
state dynamics  prediction error can be a
measurement for novelty
 Train WM on latent representation space. This
space is shaped by unsupervised learning  a
zero-centered distribution with a covariance
matrix equal to the identity  robust to
stochastic elements and is arranged respecting
the temporal distance of observations
 WM error is computed in latent space

49
Latent World Model: Sample-efficient Atari Benchmark

50
Bayesian Surprise
 Surprise can be interpreted from a Bayesian statistics perspective
 Similar to IG idea, the aim is to minimize uncertainty about the dynamics, formalized as maximizing
the cumulative reduction in entropy
 The reduction of entropy per time step, also known as mutual information, I(𝚯;St+1|ξt,at)
 Here θ is the parameters of the dynamics model 𝚯. Because we are interested in finding intrinsic
reward for a given timestep, we can define:
Houthooft, Rein, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. "Vime:
Variational information maximizing exploration." Advances in neural information processing systems 29
(2016).

51
Variational Bayesian Surprise
 The KL involves computing the posterior p(θ|st+1),
which is generally intractable
 Use variational inference for approximating the
posterior, alternative variational distribution q(θ; 𝜙)
 𝜙 is the variational parameter
 This is equivalent to parameterizing the dynamics
model as a Bayesian neural network (BNN) with
weight distributions maintained as a fully factorized
Gaussian. Train 𝜙:

52
Bayesian Surprise Benchmarking

53
Bayesian Learning Progress
 Training a BNN is complicated, there are different
Bayesian views on surprise
 Formulating the objective of the RL agent as jointly
maximizing expected return and surprise
 P is the true dynamics model and P𝜙 is the learned
dynamics model
 The objective can be translated to maximizing the
bonus reward per step
Joshua Achiam and Shankar Sastry. 2017. Surprise-based intrinsic motivation for deep reinforcement
learning. arXiv preprint arXiv:1703.01732 (2017).

54
Bayesian Learning Progress: Approximation Solutions
 In practice, we do not know P. Need
approximation
 Prediction error: measures the error in log
probability instead of the norm of the
difference between the predicted and the
reality
 Learning progress written in the form of log
probability
 To train the dynamics model P𝜙, solve the
constrained optimization
By introducing the KL constraint, the
posterior model is prevented from diverging
too far from the prior, thereby preventing
the generation of unstable intrinsic rewards.

55
Tutorial Outline
 QA and Demo
20-minute break)
 Ensemble and disagreement 👈 [We are here]
 Break
Exploration
 Replay Memory
 QA and Demo
 Closing Remarks
 QA and Demo

56
Intrinsic Motivation via Disagreement
 An alternative method with forward dynamics
involves using the variance of the prediction
rather than the error
 This requires multiple prediction models
trained to minimize the forward dynamics
prediction errors,
 Use the empirical variance (disagreement) of
their predictions as the intrinsic reward
 The higher the variance, the more uncertain
about the observation  need to explore
more
Deepak Pathak, et al. “Self-Supervised Exploration via Disagreement.” In ICML 2019.

57
Bayesian Disagreement
 Bayesian surprises are defined by a specific dynamics model
 What happens if we consider a distribution of models? Bayesian surprise of a policy becomes:
P(T) is the transition distribution of the
environment and P(T|𝜙) is the transition
distribution according to the dynamics
model.
Prediction error averaging considers all
transition models and possible
predictions
Pranav Shyam, Wojciech Jaskowski, and Faustino Gomez. 2019. Model-Based Active
Exploration. In Proceedings of the 36th International Conference on Machine Learning,
ICML 2019, 9-15 June 2019, Long Beach, California, USA. 5779–5788.
P(S|s,a,t) is the dynamics model learned from a transition
dynamics t

58
How to Compute u?
 The term u(s,a) turns out to be the Jensen-
Shannon Divergence of a set of learned
dynamics from a transition dynamics t
 JSD can be approximated by employing N
dynamics models:
 For each P parameterized by Gaussian
distribution 𝓝i(µi,Σi), we need another layer of
approximation to compute u(s,a) by replacing
the Shannon entropy with Rényi entropy and use
the corresponding Jensen-Rényi Divergence
(JRD)

59
Surprise-based Exploration: Pros and Cons
👍 Dynamics models can be trained easily these
days. There are many works on that topic
👍 Advanced methods can somehow handle Noisy-
TV
❌ Focusing on the forward dynamics error is not
effective in driving the exploration, especially when
the world model is not good and always predicts
wrongly
❌ Advanced methods such as ensembles or
learning progress are compute-expensive to cope
with Noisy-TV

60
Tutorial Outline
 QA and Demo
20-minute break)
 Break👈 [We are here]
Exploration
 Replay Memory
 QA and Demo
 Closing Remarks
 QA and Demo

61
Tutorial Outline
 QA and Demo
20-minute break)
 Break
Exploration
 Count-based memory 👈 [We are here]
 Episodic memory
 Hybrid memory
 Replay Memory
 QA and Demo
 Closing Remarks
 QA and Demo

62
Novelty via Counting
 Humans want to explore novel places, make new
friends, and buy new stuff. It is inherent for humans to
be motivated by new things
 How to translate this intrinsic motivation to RL agents?
 Tracking the occurrences of a state (N(s)) provides a
novelty indicator, with increased occurrences means
less novelty
 ri(s,a)=N(s)-0.5 where N counts the number of
times s appears
❌ Empirical counts in continuous state spaces is
impractical due to the rarity of exact state visits, resulting
in N(s)=0 most of the time

63
Density-based State Counting
 Use a density function of the state to estimate
its occurrences
 ρ(x)=ρ(s=x|s1:n) be a density function of the state
x given s1:n and ρ’(x)=ρ(s=x|s1:nx) the density
function of the state x after observing its first
occurrence after s1:n
 N̂ (x) and n̂ as a “pseudo-count” of x and the
pseudo-total count before and after an
occurrence of s
the true density of x stays the same before and
after an occurrence of x
Bellemare, Marc, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos.
"Unifying count-based exploration and intrinsic motivation." Advances in neural information processing
systems 29 (2016).

64
Pseudo State Count
 In practice, in a huge state space, ρ’n(x)≈0, we can
rewrite the pseudo-count:
 PG means predictive gain, which is computed as:
 Resembles the information gain: the difference
between the expectation of a posterior and prior
distribution
 Extend to count state-action pairs, one can
concatenate the action representation with the state
representation.

65
Hash Count
 If counting each exact state is challenging, why
not partition the continuous state space into
manageable blocks?
 By using a function 𝜙 mapping a state to a code,
we can count the occurrence of the code instead
of the state
 “Distant” states to be counted separately while
“similar” states are merged  SimHash for the
mapping function
 Higher k, less occurrence collisions, and thus
states are more distinguished
Tang, Haoran, Rein Houthooft, Davis Foote, Adam Stooke, OpenAI Xi Chen, Yan Duan, John Schulman,
Filip DeTurck, and Pieter Abbeel. "# exploration: A study of count-based exploration for deep
reinforcement learning." Advances in neural information processing systems 30 (2017).
sgn is the sign function, A is a k × d matrix with
i.i.d. entries drawn from a standard Gaussian
distribution. and g is some transformation function

66
Hash Count: Cope with High-dimensional States
 Representation learning is employed to capture
good g through autoencoder network and
reconstruction learning
 The network aims to reconstruct the original
state input s, and the hidden representation b(s)
will be used to compute g(s)=round(b(s))
 Another regularization term prevents the
corresponding bit in the binary code from
flipping throughout the agent's lifetime

67
Change Counting
 To encourage the agent to explore novel
state-action pairs meaningfully, we can
assess changes caused by activities and
prioritize those that signify novelty
 c(s,s’) as the environment change caused
by a transition (s, a, s’)
 Combines state count and change count,
resulting in the intrinsic reward:
Change count (last row) vs norm of change (middle row) vs state count (top row).
Change count suffers less from attracting to meaningless activities.
Parisi, Simone, Victoria Dean, Deepak Pathak, and Abhinav Gupta.
"Interesting object, curious agent: Learning task-agnostic exploration.
" Advances in Neural Information Processing Systems 34 (2021): 20516-20530.

68
Tutorial Outline
 QA and Demo
20-minute break)
 Break
Exploration
 Count-based memory
 Episodic memory👈 [We are here]
 Hybrid memory
 Replay Memory
 QA and Demo
 Closing Remarks
 QA and Demo

69
Alternative Novelty Measurement
❌ An inherent constraint of count-based methods lies
in the approximation error between the pseudo-count
and the true count
 Different novelty criteria: novel observations are
those that demand effort to reach, typically beyond
the already explored areas of the environment
 Measure the effort in environmental steps,
estimating it with a neural network that predicts
the steps between two observations
 To capture the explored areas of the environment,
Use an episodic memory initialized empty at the
start of each episode
Novelty through reachability concept. An observation is novel if it can only reach
those in the memory in more than k steps. Savinov, Nikolay, Anton Raichuk,
Raphaël Marinier, Damien Vincent, Marc Pollefeys, Timothy Lillicrap, and Sylvain
Gelly. "Episodic curiosity through reachability." arXiv preprint arXiv:1810.02274
(2018).
if the maximum reachability score of the given observation
is greater than a threshold k, we can regard it as novel

70
Episodic Curiosity: Memory Workflow
 Define the threshold k, use reachability
network to classify whether two observations
are separated by more or less than k steps
 After training, the reachability network is used
to estimate the novelty of the current
observation in the episode given the episodic
memory M, which finally is used to compute
the intrinsic reward
 Using a function F to aggregate the reachability
scores between the current observation and
those in the memory leads to the intrinsic
reward. F can be max or 90-th percentile

71
Explicit Memory of Positions (only for Atari games)
 Collect the agent’s position from game RAM to
indicate where on the grid an agent has visited
 White sections in the curiosity grid (middle)
show which locations have been visited; the
unvisited black sections yield an exploration
bonus when touched.
 The network receives both game input (left) and
curiosity grid (middle) and must learn how to
form a map of where the agent has been
(hypothetical illustration, right)
Stanton, Christopher, and Jeff Clune. "Deep curiosity search: Intra-life exploration can improve
performance on challenging deep reinforcement learning problems." arXiv preprint
arXiv:1806.00553 (2018).

72
Novelty Connection to Surprise
 Theoretically, an overparameterized autoencoder whose task is to reconstruct its input is
equivalent to an associative memory (Adityanarayanan et al., PNAS 2020)
 We can train an autoencoder that takes the state as input to reconstruct and use its
reconstruction error as an indicator of life-long novelty
 Greater errors signify a higher level of novelty in states, indicating that the autoencoder has not
encountered these states frequently enough to learn their successful reconstruction effectively
 Related to RND: the intrinsic reward is still the reconstruction error. But this time, the
reconstructed target is no longer the original input. Instead, it is a transformed version of the
input

73
Tutorial Outline
 QA and Demo
20-minute break)
 Break
Exploration
 Count-based memory
 Episodic memory
 Hybrid memory 👈 [We are here]
 Replay Memory
 QA and Demo
 Closing Remarks
 QA and Demo

74
Surprise + Novelty
 Never Give Up agent (NGU) combines
existing surprise and novelty components
from the literature cleverly:
1. State representation learning via inverse
dynamics (ICM)
2. Life-long novelty module using RND
3. Episodic novelty using episodic memory
inspired by EC
 The implementation of the episodic
memory in NGU is new.
The dynamics model f is employed to produce the
representations for the novelty modules. Two types of
novelty are combined to produce the final intrinsic reward.
Badia, Adrià Puigdomènech, Pablo Sprechmann, Alex Vitvitskyi, Daniel Guo, Bilal Piot, Steven
Kapturowski, Olivier Tieleman et al. "Never give up: Learning directed exploration strategies." arXiv
preprint arXiv:2002.06038 (2020).

75
NGU: Episodic Novelty
 Encourages the exploration of novel states
within an episode simply via nearest-neighbor
matching.
 As a result, the agent will not revisit the same
state in an episode twice. This concept is
different from lifelong novelty
 The closer the current state is to its neighbors,
the higher the similarity and thus, the smaller
the reward
 Hybrid Intrinsic Reward:

76
Agent57: Exploration at Scale
 An upgraded version of the NGU:
 Splitting the value function into 2 separate
function values for external and internal
rewards
 A population of policies (and value functions) is
trained, each characterized by a distinct pair of
exploration parameters:
 N is the size of the population. 𝛾j is the discount
factor hyperparameters and βj is the intrinsic
reward coefficient hyperparameters. Adapted by a
meta controller (bandit algo.) Badia, Adrià Puigdomènech, Bilal Piot, Steven Kapturowski, Pablo Sprechmann, Alex Vitvitskyi, Zhaohan
Daniel Guo, and Charles Blundell. "Agent57: Outperforming the atari human benchmark." In International
conference on machine learning, pp. 507-517. PMLR, 2020.

77
Agent 57: Atari Benchmark

78
Cluster Memory for Counting
 Parametric methods for counting have problems:
s slow adaptation and catastrophic forgetting
 Count Estimation provides a long term
visitation-based exploration bonus while
retaining responsiveness to the most recent
experience  a finite slot-based container M
stores representations and corresponding
counter C
 Memory RECODE is update by considering:
 Adding new embedding (atom) with count 1
 Update nearest atom and increase count
Kernel is non-zero for all
neighbours within a radius

79
RECODE: Representation Learning
 The transformer takes masked sequences of
length k consisting of actions and embedded
observations as inputs and tries to reconstruct
the missing embeddings in the output
 The reconstructed embeddings at time t − 1 and
t are then used to build a 1-step action-
prediction classifier
 Similar to ICM’s inverse dynamics
Saade, Alaa, Steven Kapturowski, Daniele Calandriello, Charles Blundell, Pablo Sprechmann, Leopoldo
Sarra, Oliver Groth, Michal Valko, and Bilal Piot. "Unlocking the Power of Representations in Long-term
Novelty-based Exploration." In The Twelfth International Conference on Learning Representations. 2024.

80
Novelty of Surprise
 The norm of the prediction error (surprise norm) is
not good (e.g., as in Noisy-TV)
 A new metric: surprise novelty, the error of
reconstructing surprise (the error of state
prediction)
 This requires a surprise generator such as a
dynamics model to produce the surprise vector u,
i.e., the difference vector between the predicted
and reality
 Then, inter and intra-episode novelty scores are
estimated by a system of memory, called Surprise
Memory (SM), consisting of an autoencoder
network W and episodic memory M, respectively
Le, Hung, Kien Do, Dung Nguyen, and Svetha Venkatesh. "Beyond Surprise:
Improving Exploration Through Surprise Novelty. In AAMAS, 2024.

81
Benefit of Hybrid Systems
 Marry the goodness of both worlds:
 Dynamic Prediction
 Novelty Estimation
 Combine intrinsic rewards:
 Long-term, global, inter-episodes
 Short-term, local, intra-episodes
 Limitation of dynamic prediction using deep models
can be compensated with non-parametric memory
approaches
 Noisy-TV problems are mitigated but not completely
solved
Noisy-TV: a random TV will distract the RL agent from its main
task due to high surprise (source).

82
However, Is Intrinsic Reward Really Good?
Taiga et al. On bonus based exploration methods in the arcade
learning environment. In ICLR 2019
 Intrinsic motivation rewards heavily on memory
concept (global, local…)
 The performance IR agent is very good, but …
 It requires more samples to train (10^9 or 10^10
steps is the norm)
Is The goal of IR to enable sample efficiency?
Overfitting to Montezuma Revenge?
Depending on architecture and tuning, normal
exploration in general is ok

83
Issues with Intrinsic Motivation
Ecoffet, Adrien, Joost Huizinga, Joel Lehman, Kenneth O. Stanley, and Jeff Clune. "First return, then explore." Nature 590, no. 7847 (2021): 580-586.

84
Reflection on Memory
 Surprise:
 Memory is hidden inside dynamics models,
memorizing the seen observations to make
the prediction available
 This memory is long-term, semantics and
slow to update
 Novelty:
 Memory is obvious as a slot-based matrix,
nearest neighbour estimator, counter ….
 This memory is often short-term, instance-
based and adaptive to changes in the
environments
Intrinsic
Exploration
Memory
Surprise
Novelty
Memory Exploration
?

85
Tutorial Outline
 QA and Demo
20-minute break)
 Break
Exploration
 Replay Memory
 Novelty-based Replay 👈 [We are here]
 Performance-based Replay
 QA and Demo
 Closing Remarks
 QA and Demo

86
A Direct Exploration Mechanism
 Two major issues have hindered the ability of
previous algorithms to explore:
 Detachment: loses track of interesting areas
to explore from
 Derailment: the exploratory mechanisms of
the algorithm prevent it from utilize
previously visited states
 The role of memory is simplified:
 Store past states
 Retrieve states to explore
 Replay
Replay
Memory
Sampled
States
Exploration

87
Go-Explore
 Detachment can be addressed by Memory: keep track areas
by grouping similar states into cells
 Similar to Hash count
 Map a state to a cell
 Each cell has a score indicating sampling probability
 Derailment can be addressed by Simulator (only suitable for
Atari)
 Sample a cell’s state from the memory
 Simulator resets the state of the agent to the cell’s state
 The memory is updated with new cells during exploration
Ecoffet, Adrien, Joost Huizinga, Joel Lehman, Kenneth O. Stanley, and Jeff Clune. "First return, then explore." Nature 590, no. 7847 (2021): 580-586.

88
Go-Explore: Engineering Details
 State to Cell: downscaled cell with adaptive downscaling
parameters for robustness:
 Calculating how a sample of recent frames would be
grouped into cells
 Selecting the values that result in the best cell
distribution (manually designed)
 The selection probability of a cell at each step is proportional to its
selection weight  count-based
 Domain knowledge weight: (1) the number of horizontal neighbours
to the cell present in the archive (h); (2) a key bonus: for each location
 Train from demonstrations: backward algorithm places the agent close
to the end of the trajectory and runs PPO until the performance of the
agent matches that of the demonstration
Cseen is the number of exploration steps in which
that cell is visited

89
Addressing Cell Limitations
❌ Cell design is not obvious requires detailed knowledge of the observation space, the dynamics of
the environment, and the subsequent task
 Latent Go-Explore: Go-Explore operates without cells:
 A latent representation is learned simultaneously with the exploration
 Sampling of the final goal is based on a non-parametric density model of the latent space
 Replace simulator with goal-based exploration
Quentin Gallou´edec and
Emmanuel Dellandr´ea. 2023.
Cell-free latent go-explore. In
International Conference on
Machine Learning. PMLR,
10571–10586.

90
Latent Go-Explore: Details
 Representation Learning:
 ICM’s Inverse Dynamics
 Forward Dynamics
 Vector Quantized Variational Autoencoder
Reconstruction
 Density Estimation to sample goals:
 Goals must be at the edge of the yet unexplored
areas
 Goal is reachable (already visited)
 Use particle-based entropy estimator to estimate
density score and rank
geometric law on the rank. p
is hyperparameter. The
higher the rank (more
dense), the less novel
sample less

91
Go-Explore Family is current SOTA on Atari

92
Tutorial Outline
 QA and Demo
20-minute break)
 Break
Exploration
 Replay Memory
 Novelty-based Replay
 Performance-based Replay👈 [We are here]
 QA and Demo
 Closing Remarks
 QA and Demo

93
Imitation Learning: Exploration via Exploitation
 Exploiting past good experiences affects learning
 Self-imitation learning imitates the agent’s own past
good decisions
 Memory is a replay buffer stores experiences
 Learns a policy to imitate state-action pairs in the replay
buffer only when the return in the past episode is
greater than the agent’s value estimate (performance-
based)
 If the return in the past is greater than the agent’s value
estimate (R > Vθ), the agent learns to choose the action
chosen in the past in the given state.
Oh, Junhyuk, Yijie Guo, Satinder Singh, and Honglak Lee. "Self-imitation learning." In International
conference on machine learning, pp. 3878-3887. PMLR, 2018.

94
Goal-based Exploration Using Memory
 Generate new trajectories visiting novel states
by editing or augmenting the trajectories stored
in the memory from past experiences
 A sequence-to-sequence model with an
attention mechanism learns to ‘translate’ the
demonstration trajectory to a sequence of
actions and generate a new trajectory
Sample using count-based novelty
Insert new trajectory if the ending is different significantly
Otherwise, replace with higher return trajectory
Yijie Guo, Jongwook Choi, Marcin Moczulski, Shengyu Feng, Samy Bengio, Mohammad Norouzi, and Honglak Lee.
2020. Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural
Information Processing Systems 33 (2020), 4333–4345.
Diverse Trajectory-
conditioned Self-
Imitation Learning
(DTSIL)

95
DTSIL: Policy Learning
 Train a trajectory-conditioned policy πθ(at|e≤t, ot, g) that
should flexibly imitate any given trajectory g
 To imitate, assign rim as the imitation reward (0.1) if the
state is similar
 After visiting the last (non-terminal) state in the
demonstration, the agent performs random exploration
(r=0) to encourage exploration
 Policy Gradient training:
Further imitation encouragement

96
DTSIL: Performance when Combined with Count

97
Replay Memory: Pros and Cons
👍 Replay memory provides a direct exploration
mechanism without intrinsic reward
👍 Sampling strategies are built upon previous
works
❌ Make additional assumptions such as simulator
or the availability of demonstrations
❌ Often requires goal-based policy, multiple sub
trainings

98
End of Part B
 QA
 Demo

Advanced Topics
PART C

100
Tutorial Outline
 QA and Demo
20-minute break)
 Break
 RAM-like Memory for Novelty-based Exploration
 Replay Memory
 Performance-based Replay QA and Demo
 Language-guided exploration👈 [We are here]
 Language-assisted RL
 LLM-based exploration
 Closing Remarks
 QA and Demo

Beyond the state space: Language-guided exploration
 Why language?
 Humans are able to learn quickly
in new environments due to a
rich set of commonsense priors
about the world
→ reflected in language
 Read the instruction to play
game to avoid trials and
errors.
Abstraction
Compositional
Generalization
101

102
Luketina, Jelena, Nantas Nardelli, Gregory Farquhar, Jakob Foerster, Jacob Andreas,
Edward Grefenstette, Shimon Whiteson, and Tim Rocktäschel. "A survey of reinforcement
learning informed by natural language." arXiv preprint arXiv:1906.03926 (2019).
How Language can be used in RL

 Make a cake from tools
and materials
 Pure RL agent needs to
try thousands of
settings until it find the
desired characteristics
 If RL agent read and
follow the recipe, it
may take 1 trial to
succeed
103
https://www.moresteam.com/toolbox/design-of-experiments.cfm
A more practical use case

104
Tutorial Outline
 QA and Demo
20-minute break)
 Break
Exploration
 Replay Memory
 Language-assisted RL 👈 [We are here]
 LLM-based exploration
 Closing Remarks
 QA and Demo

105
Gated-attention for Task-oriented Language
Grounding
 Task-oriented Language Grounding: extract meaningful representations from natural language
instructions and maps to visual elements and actions
 Task: Given initial Image in pixels and instruction -> guide agent to move towards desired object
 2 main modules:
 State Processing: Process image and language jointly with State Processing module to obtain state
 Policy Learning: Use policy to map states to corresponding actions

106
State Processing Module
 Use Gated-Attention instead of concatenation to jointly
represent image and language information as one state
 Language instruction go through a fully-connected layer to
match image dimension, called Attention Vector
 Each element of Attention Vector is expanded to a (𝐻 × 𝑊)
matrix to match the feature map of each image element
 Final representation is obtained via element-wise product
between image and language representation

107
Policy Module
 Actions:
 Turn (left, right)
 Move forward
 Policy Architecture: two variants
 Behaviour Cloning: Uses target and object locations and
orientations in every state to select optimal action
 A3C: use a deep neural network to learn policy and value
function. Gives action based on policy 𝜋(𝑎|𝐼𝐿, 𝐿)

108
Results
- With Gated-Attention, agent learns faster and achieves better accuracy (success rate of reaching correct object before
episode terminates)
- As environment gets harder, more exploration is needed -> A3C with GA performs better than Imitation Learning, where
little exploration is done

109
Semantic Exploration from Language Abstractions and
Pretrained Representations
 Novelty-based exploration methods suffer in high-dimensional
visual state spaces
 e.g.: different viewpoints of one place in 3D can map to distinct
visual states/ features, despite being semantically similar
 Language can be a useful abstraction for exploration, as it
coarsens the state space in a way that reflects the semantics of
environment
 Solution: Use vision-language pretrained representations to
guide semantic exploration in 3D
Example of a state (picture) with language
description (caption). Note how the
caption focuses on important aspects of
the state
Example of how many states can be
conveyed with one text caption

110
Intrinsic Reward Design
 State 𝑠𝑡: embedding of Image, denoted as 𝑂𝑉
 Goal: Described by a text instruction, denoted as 𝑔
 Caption is encoded by a pretrained language
encoder, output embedding is denoted as 𝑂𝐿; only
used to calculate intrinsic reward. Note that agent
never observes 𝑂𝐿.
 Intrinsic reward is goal-agnostic; computed with
access to state representation (either 𝑂𝑉 or 𝑂𝐿)
 Add intrinsic reward for two exploration algorithms:
 Never Give Up (NGU; Badia et al.)
 Random Network Distillation (RND; Burda et al.)
Badia, Adrià Puigdomènech, et al. Never give up: Learning directed exploration strategies. arXiv preprint arXiv:2002.06038 (2020).
Y. Burda, H. Edwards, A. Storkey, and O. Klimov. Exploration by random network distillation. arXiv preprint
arXiv:1810.12894, 2018.

111
Never Give Up (NGU)
 State representations (which are used to compute intrinsic reward, can be either 𝑂𝐿 or 𝑂𝑉) along trajectory are written to memory buffer
 Novelty is a function of L2 distance between current and the k-nearest states in buffer. Intrinsic reward is higher for larger distances
 To influence exploration: modify the embedding function
 Originally, embedding function is learned
 Variants:
 Vis-NGU & LSE-NGU: Use Visual embeddings (𝑂𝑉).
 Lang-NGU: Use Language embeddings (𝑂𝐿).

112
Random Network Distillation (RND)
 Intrinsic reward is derived from the prediction error
between a trainable network on a target from a frozen
function
 Trainable network learns independently from the
policy network
 As training progresses, frequently-visited states yield
less intrinsic reward due to reduced prediction errors

113
Results
 Example of how language representations
(orange line) can help agent explore better
than visual representation in 3D
environment.
 Using visual representation, agent
struggles with different view of only one
scene.

114
Tutorial Outline
 QA and Demo
20-minute break)
 Break
Exploration
 Replay Memory
 Language-assisted RL
 LLM-based exploration👈 [We are here]
 Closing Remarks
 QA and Demo

115
ELLM: Guiding Pretraining in RL with LLM
 Many distinct actions can lead to similar outcomes -> Intrinsically Motivated RL (IM-RL):
explore outcomes rather than actions
 Competence-based IM (CB-IM): maximize diversity of skills mastered by agent
 CB-IM aims to optimize 𝑅𝑖𝑛𝑡:
 Given those, CB-IM algorithms train a goal-conditioned policy 𝜋(𝑎|𝑜, 𝑔) that maximizes 𝑅𝑖𝑛𝑡
 Goal distribution G and 𝑅𝑖𝑛𝑡(𝑜, 𝑎, 𝑜′|𝑔) must be defined such that 3 properties are satisfied:
1. Diverse
2. Common-sense sensitive (Ex: chop a tree > drink a tree)
3. Context sensitive (Ex: only chop the tree when the tree is in view)

116
Why ELLM?
 Previous methods hand-define 𝑅𝑖𝑛𝑡 and G, and use various motivations to guide goal
sampling 𝑔~𝐺: novelty, learning progress, intermediate difficulty
 ELLM: alleviate the need for environment-specific hand-coded definitions of 𝑅𝑖𝑛𝑡 and
G with LM:
 Language-based goal representations
 Language-model-based goal generation

117
Architecture
 Goal representation: prompt LLM with available actions & description of current observation ->
construct prompt -> LLM generate goals
 Open-ended goal generation: Ask LLM to generate goals
 Closed-form: Ask LLM yes/no questions (e.g: should the agent do X? yes/no?)

118
Intrinsic Reward Design
 Reward LLM goals: with similarity between generated
goal with description of agent’s transition 𝐶𝑡𝑟𝑎𝑛𝑠𝑖𝑡𝑖𝑜𝑛
 If there are multiple goals -> reward agent with
most similar goal to transition description:

119
ELLM: Results

120
Intrinsically Guided Exploration from LLMs (IGE-LLMs)
 For long, sequential tasks with sparse rewards, intrinsic reward can help guide policy learning
towards exploration, along with the main policy driver extrinsic reward
 LLM can be used as an evaluator for potential future rewards of all actions a, that maps every (s,a)
pairs directly to an intrinsic reward 𝑟𝑖
 Total reward is given as 𝑟𝑐
= 𝑟𝑒
+ 𝜆𝑖
𝑟𝑖
𝑤𝑖
, where 𝑟𝑒
is the external reward, 𝜆𝑖
is a controlling
factor and 𝑤𝑖 is a linearly decaying weight

121
Prompting
 Example of an input prompt to evaluate possible
actions in DeepSea Environment
 LLM is given current position and possible
actions. It is asked to evaluate the ratings of
every possible next actions

122
Benefit of Intrinsic Reward
 LLM improves traditional exploration
methods
 Using LLM to generate actions directly
exhibits significant errors (grey lines on
the right graph), even with advanced
LLMs (GPT-4)
 However, when used as intrinsic reward
only, it helps with exploration, especially
in harder environments. It also results in
better performance

123
Tutorial Outline
 QA and Demo
20-minute break)
 Break
 RAM-like Memory for Novelty-based Exploration
 Replay Memory
 Causal discovery for exploration👈 [We are here]
 Statistical approaches
 Deep learning approaches
 Closing Remarks
 QA and Demo

124
What is causality?
 The relationship between cause and effect.
 Uncovering two fundamental questions:
 Causal discovery: Evidence required to infer
cause-effect relationships?
 Causal inference: Given causal information,
what inference can be drawn?
 Structure Causal Models (SCMs) framework
(Pearl, 2009a): .
 Causal graph: . Figure 1: Example of SCM and causal graph for scurvy problem.

125
Why causality and RL?
 Understand cause-effect reduce exploring unnecessary action, thus, sample
efficiency.
 Ex: Not move toward the door before obtain the key.
 Improve interpretability.
 Ex: Why policy prioritize action of obtaining the key?
 Generalizability.

126
Interpreting Causality in RL Environment
 Taking action A can affect the reward R.
 State S is the context variables that affect both
action A and reward R.
 .
 .
 The most common is .
 And, U is the unknown confounder variable.
 Categorize based on techniques to improve
exploration or causality measurement techniques.
 Statistical vs deep learning methods.
Figure 2: Causality in RL Environment.

127
Tutorial Outline
 QA and Demo
20-minute break)
 Break
Exploration
 Replay Memory
 Statistical approaches👈 [We are here]
 Deep learning approaches
 Closing Remarks
 QA and Demo

128
Causal influence detection for improving
efficiency in reinforcement learning.
Seitzer, M., Schölkopf, B., & Martius, G. (2021). Causal influence detection for
improving efficiency in reinforcement learning. Advances in Neural
Information Processing Systems, 34, 22905-22918.

129
Causal Action Influence Detection (CAI)
 Mentioned previously, .
 Decomposed state S into N components.
 One steps transitions graph:
 How to detect when action influence next
state S’?
Figure 3: Global Causal Graph Fully Connected.
Figure 4: Example of Situation Dependent Controlled..

130
Causal Action Influence Detection (CAI) (cont.)
 Conditional Mutual Information (CMI):
 Estimation of CAI:
Estimate forward model
from data.

131
Using CAI to Improve exploration in RL?
 CAI as Intrinsic Reward.
 Active Exploration with CAI.
 CAI Experience Replay
 Experiment on 3 environments: FetchPush,
FetchPickAndPlace, FetchRotTable. Goal is the
coordinate the object must be.
 Baseline RL Algorithm: DDPG + HER.
Figure 5: FetchPickAndPlace Environment.
Figure 6: FetchRotTable Environment.

132
CAI as Intrinsic Reward
 Use CAI as reward signal.
 Use on its own or with task reward.
Figure 7: Bonus reward improves performance on
FetchPickAndPlace.

133
Active Exploration with CAI
 Replace random exploration with causal exploration.
 Choose action with highest contribution to CAI
calculation.
Figure 8: Performance of active exploration in
FetchPickAndPlace depending on the fraction of
exploratory actions chosen actively from a total of 30%
(epsilon) exploratory actions.
Figure 9: Experiment comparing exploration
strategies on FetchPickAndPlace. The combination of
active exploration and reward bonus yields the largest
sample efficiency.

134
CAI Experience Replay
 Choose episode for replay from replay buffer
with guide from causal (inverse) ranking .
 is the probability of sampling any state from
episode i (of M episodes) in the replay buffer
(with T is the episode length).
Figure 10: Comparison of CAI-P with baselines (energy-based method with privileged
information (EBP), prioritized experience replay (PER), and HER without prioritization)

135
Tutorial Outline
 QA and Demo
20-minute break)
 Break
Exploration
 Replay Memory
 Statistical approaches
 Deep learning approaches👈 [We are
here]
 Closing Remarks
 QA and Demo

136
Causality-driven hierarchical structure
discovery for reinforcement learning.
Hu, X., Zhang, R., Tang, K., Guo, J., Yi, Q., Chen, R., ... & Chen, Y. (2022).
Causality-driven hierarchical structure discovery for reinforcement
learning. Advances in Neural Information Processing Systems, 35, 20064-
20076.

137
Structural Causal Representation Learning
 Environment with multiple objects.
 Ex: Have wood and stone can make axe.
 How to measure causality between these
object?
 Model the SCM of objects between adjacent
timesteps.
Figure 11: Example of environment with
multi-objects and causal graph

138
Structural Causal Representation Learning (cont.)
 Simpler case with 4 objects A, B, C, D.
 A is object of interest.
 Need a forward/transition model .
 Parameterized by .
 Need a masking function (otherwise, don’t know
which objects affect A).
 Parameterized by where M is
the no. of objects.
Figure 12: Example of SCM representation
learning (with interested object A).

139
Structural Causal Representation Learning (cont.)
 Iterative process.
 Fix one parameter, while optimizing the other one.
 After finish optimizing, extract edge:

140
CDHRL Framework
Figure 14: CDHRL Framework.

141
Hierarchical Causal Subgoal Training
 Whenever new subgoals added, train .
 Deciding whether subgoal is reachable, with
current state and policy .
 If the state is reachable, within certain timesteps,
add to subgoal set.
Figure 13: Example of subgoal hierarchy
given causal graph.

142
Figure 16: Results on Minigrid-2d (left) and Eden (right).
Figure 15: Environment Minigrid-2d (left) and Eden (right).
 An upper controller
policy , is trained to
select subgoals from
current subgoal set and
maximize task reward.
 The upper controller is
multi-level DQN with
HER.

143
Disentangling causal effects for
hierarchical reinforcement learning.
Corcoll, O., & Vicente, R. (2020). Disentangling causal effects for hierarchical
reinforcement learning. arXiv preprint arXiv:2010.01351.

144
Controlled Effect Disentanglement
 Total effect, the change in environment states,
involves dynamic effects and controllable effects.
 Whether next state is an outcome of the action
or just by accident.
 We care about controllable effects.
 Based on Average Treatment Effect (ATE).
The normality
Figure 17: The relationship between total effects, dynamic
effects, and controllable effects.

145
 We cannot calculate total effect for every actions .
 Use a neural network to estimate.
 Learn a vector representation of effect.
Figure 18: Total effects modelling architecture.

146
Exploration with controllable effects as goal
Effect Sampling
Policy
Taking Action
Policy
Model
distribution of
effects
Figure 19: Components of causal effects for hierarchical
reinforcement learning.

147
Controllable Effect Distribution Learning
 Train a Variational Autoencoder.
 Approximate controllable effect .
Figure 20: VAE architecture to learn effect distribution.

148
Training to Select Goal and Reach Goal
 Train using DQN, on data .
 Use to select sub-effect.
 Train using DQN, on data .
Figure 21: Architecture learning to select effects as
subgoals.
Figure 22: Architecture learning to select actions to reach
subgoals.

149
3 Levels:
 Task T: go to the target location.
 Task BT: go to the target location
while carrying a ball.
 Task CBT: pick ball, put it in the
chest, and go to the target.
Figure 23: Comparing with DQN baseline on 3 tasks.
CEHRL can learn complex task, while, DQN cannot learn
the complex task.
Figure 24: Random effect vs random
action exploration.

150
End of Part C
 QA
 Demo

Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity

Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity

Recommended

Recommended

More Related Content

Similar to Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity

Similar to Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity (20)

More from Hung Le

More from Hung Le (6)

Recently uploaded

Recently uploaded (10)

Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity