The document discusses memory-based reinforcement learning. It begins with background on reinforcement learning and classic RL algorithms like Q-learning and policy gradients. It then discusses challenges with deep RL approaches that lack memory. Different types of memory are proposed to address these challenges, including episodic memory, semantic memory, and working memory. Memory-based approaches are shown to improve sample efficiency and performance on tasks like Atari games. The role of memory is also discussed for exploration, handling partial observability, and hyperparameter optimization in reinforcement learning.
3. What is Reinforcement Learning (RL)?
● Agent interacts with environment
● S+A=>S’+R (MDP)
● The transition can be stochastic or
deterministic
● Find a policy π(S) → A to maximize
expected return E(∑R) from the
environment
3
4. A grid-world example
● The state space is discrete. We have 6 states
corresponding to 6 locations in the map
● The action space is discrete. We have 4 actions
corresponding to 4 movements
● The reward can be “nothing”, “poison”, “1
cheese” or “3 cheese”. We can convert to
scalars: 0, -1,1,3
● The transition in this case is deterministic,
corresponding to the outcome of movements.
○ It can be stochastic in other cases
○ E.g. at (0,0) move to the left may result in
(0,1) or (1,1) with equal probability
4
https://huggingface.co/blog/deep-rl-q-part2
5. Classic RL algorithms: Value learning
5
Q-learning
(temporal difference-TD)
Watkins, Christopher JCH, and Peter Dayan. "Q-learning." Machine
learning 8, no. 3 (1992): 279-292.
Williams, Ronald J. "Simple statistical gradient-following algorithms
for connectionist reinforcement learning." Machine learning 8, no. 3
(1992): 229-256.
● Basic idea: before finding optimal
policy, we find the value function
● Learn (action) value function:
○ V(s)
○ Q(s,a)
● V(s)=E(∑R from s)
● Q(s,a)=E(∑R from s,a)
● Given Q(s,a)
→ choose action that maximizes the
value (ε-greedy policy)
6. Classic RL algorithm: Policy gradient
● Basic idea: directly optimise the policy as
a function of states
● Need to estimate the gradient of the
objective function E(∑R) w.r.t the
parameters of the policy
● Focus on optimisation techniques
● No memory
6
REINFORCE
(policy gradient)
8. Do we have memory in value learning?
● Q-table in value learning can be considered as a memory
● It remembers “how good a state-action pair is on average”
● The memory is very basic, non-smooth and redundant
8
9. Challenges in RL: the optimal policy can be
complex
● Task:
○ Agent searches for the key
○ Agent picks the key
○ Agent open the door to access the
room
○ Agent finds the box in the room
● Reward:
○ If the agent reaches the box, get +1
reward
9
https://github.com/maximecb/gym-minigrid
→ How to learn such complicated
policies using the simple reward?
10. Short Answer: just learn from many trials (data)!
Chess
Self-driving
car
Video
games
Robotics
10
13. Limitation of training with big data
● High cost
○ Training time (days to months)
○ GPU (dozens to hundreds)
● Require simulators
○ Agents are trained in simulation (millions to billions of steps)
○ The cost for one step of simulation can be high
○ Some real environments simply don’t have simulator
● Trained agents are unlike humans
○ Unsafe exploration
○ Weird behaviors
○ Fail to generalize
13
14. Human vs RL Agents in Atari games
● Human:
○ Few hours of practicing to reach
moderate performance
○ Don’t forget how to play old game
as learning new ones
○ Can play any game
● RL Agents (DQN-based):
○ 21 trillions hours of training to beat
human (AlphaZero), equivalents to
11,500 years of human practice
○ Catastrophic forgetting if learn
games sequentially
○ Despite forever training, there
exists failed games
14
17. What is memory?
● Memory is the ability to efficiently store,
retain and recall information
● Brain memory stores items, events and
high-level structures
● Computer memory stores data,
programs and temporary variables
17
18. Memory in neural networks
18
Long-term
memory
Short-term
memory
Functional
memory
● Semantic memory:
storing data in the neural
network weights
● Episodic memory: storing
episodic events in matrix
memory
● Associative memory: key-value
binding as in Hopfield Network
or Transformer layer
● Working memory: matrix
memory in memory augmented
neural network
● Memory stores
programs
● Memory of models,
mixture of experts ..
19. Semantic memory
● A feed-forward neural network can be
viewed as a semantic memory
○ Data is stored in the weight of the
network via backpropagation
○ Data is read via forwarding the
input
○ It can be associative memory as
well
● A table stores the statistics of data can
be also a semantic memory (value
table)
19
y=Wx
20. Working Memory
● Recurrent neural networks
contains working memory (hidden
state)
○ The hidden state capture
past inputs
○ The prediction is made
based on the hidden state
● Advanced versions of RNN
○ GRU/LSTM
○ MANN
20
21. Episodic Memory
● Often implemented as a matrix,
table
● Can be key-value memory
● Access via attention or analogy
search
● Support neural networks in
making predictions
21
22. Properties of memories
22
Lifespan Plasticity Example
● 1 episode is one day
● Last for 1 day
● Build memory instantly
Short-term Quick
1 Working
memory
● Persists across agent’s lifetime
● Last for several years
● Build memory instantly
Long-term Quick
2 Episodic
memory
● persists across agent’s lifetime
● Last for several years
● Take time to build memory
Long-term Slow
3 Semantic
memory
25. Semantic Memory in RL
● Human brain implements RL
● Dopamine neurons reflects a reward
prediction error (TD learning)
● What is the memory in brain that
stores V?
○ Value table is not scalable
○ May be a value model →DQN
(semantic memory)
25
26. DQN: Replay buffer is an episodic memory
• Store experiences: (s,a,r,s’) tuple
• Read memory via replay sampling
• Memory content is used to train
Action-Value Network Q
26
Mnih, V., Kavukcuoglu, K., Silver, D. et al. Human-level control through deep reinforcement
learning. Nature 518, 529–533 (2015). https://doi.org/10.1038/nature14236
27. DQN’s memories are better than Q-table, but …
● Inefficiency:
○ Learning semantic memory (Q network)
is slow (gradient descent)
○ Optimise many parameters
● Bootstrap noise:
○ The target is the network’s output
○ If network is not well trained, the target
is noisy
● Reply buffer:
○ Raw observations
○ Need many sampling iterations
27
28. Alternative: episodic control paradigm
Current experience
Eg: (St, At),…
Memory
read
Experiences Final Returns
Policy
Value
Environment
Memory write
28
● Episodic memory is a key-value memory
● Directly binds experience to return→ refers to
experiences that have high return to make
decisions
29. Model-free episodic control: K-nearest neighbors
29
Blundell, Charles, Benigno Uria, Alexander Pritzel, Yazhe Li, Avraham Ruderman, Joel Z. Leibo, Jack Rae, Daan
Wierstra, and Demis Hassabis. "Model-free episodic control." NeurIPS (2016).
Fix-size memory
First-in-first out
● No need to learn parameters
(pretrained 𝜙)
● Quick value estimation
31. Limitation of model-free episodic memory
● Near-deterministic assumption
○ Assume clean env.
○ Store the best return
● Sample-inefficiency:
○ store state-action-value which demands
experiencing all actions to gain experience
● Fixed combination between episodic and parametric
values
○ episodic contribution weight unchanged for
different observations
○ requires manual tuning of the weight
31
What if the state is partially
observable and the number of
actions is large?
32. Model-based episodic memory
● Learn a model of trajectories using
self-supervised training
○ Model=LSTM
○ Learn to reconstruct past state-action
given current trajectory and query
● The trained LSTM is used to generate
trajectory representations
→ counterfactual trajectory
→ imagine actions
32
Le, Hung, Thommen Karimpanal George, Majid Abdolshah, Truyen Tran, and Svetha Venkatesh. "Model-Based Episodic
Memory Induces Dynamic Hybrid Controls." NeurIPS (2021).
33. Discrete-action environment: Atari benchmark
33
Evaluation Metrics:
Normalised score = (Model’s score-random play
score)/(human score - random play score).
Mnih, V., Kavukcuoglu, K., Silver, D. et al. Human-level control through deep reinforcement
learning. Nature 518, 529–533 (2015). https://doi.org/10.1038/nature14236
~60 games
34. Sample efficiency test on Atari games
34
Model-free (10M)
Hybrid (40M)
Model-based (10M)
DQN (200M)
36. When the state is not enough …
● Partially Observable Environments:
○ States do not contain all required
information for optimal action
○ E.g. state=position, does not contain
velocity
● Ways to improve:
○ Build richer state representations
○ Memory of all past
observations/actions
● Policy gradient
36
Full map
Observed
state
RNN hidden state
RNN as policy model
37. Building better working memory
for better the context
37
● External memory: longer-term,
store more
● Unsupervised training to learn
read-write operation
Wayne, Greg, Chia-Chun Hung, David Amos, Mehdi Mirza, Arun Ahuja,
Agnieszka Grabska-Barwinska, Jack Rae et al. "Unsupervised predictive
memory in a goal-directed agent." arXiv preprint arXiv:1803.10760 (2018).
39. It is useful for memory-based decision process
39
40. Benchmark: Navigation with Distraction
40
Hung, Chia-Chun, Timothy Lillicrap, Josh Abramson, Yan Wu, Mehdi
Mirza, Federico Carnevale, Arun Ahuja, and Greg Wayne. "Optimizing
agent behavior over long time scales by transporting value." Nature
communications 10, no. 1 (2019): 1-12.
45. Exploration issue in RL
● Rewards can be very sparse
○ RL agents cannot learn anything until
they collect the first reward
○ Explore forever?
● Sometimes reward function in
complicated real-world problem is
unknown
○ Don’t have simulator
○ Explore freely in real world is unsafe
→ Sample inefficiency
→Efficient exploration
45
46. Need exploring mechanisms
to enable sample-efficiency!!!
46
Aubret, A., L. Matignon, and S. Hassas. "A survey on intrinsic motivation in reinforcement learning."
47. In biological world, agents can cope with this
problem very well
● Animal can travel for long
distance till they find food
● Human can navigate to go to an
address in a strange city
○ intrinsic motivation
○ curiosity, hunch
○ intrinsic reward
47
https://www.beepods.com/5-fascinating-ways-bees-and-flowers-find-each-other/
48. Agents should be motivated towards
“interesting” consequences
● C: actor vs M: world model
● M predicts consequences of C
● As a result:
▪ If C action results in repeated and
boring consequences →M predict
well
▪ C must explore novel
consequence
● Memory:
▪ To learn the world model
▪ To know if something novel or old
48
https://people.idsia.ch/~juergen/artificial-curiosity-since-1990.html
49. M: Forward model learns dynamics
(semantic memory)
49
Stadie, Levine, Abbeel: Incentivizing Exploration in Reinforcement Learning with Deep Predictive Models. 2015
Novelty if prediction
error is high (intrinsic
reward)
50. When novelty as prediction error is useless
● The prediction target is stochastic
● Information necessary for the prediction
is missing
→ Both the totally predictable and the
fundamentally unpredictable will get boring
→Solution: Remember all experiences
● “Store” all observations, including
stochastic ones in working, semantic or
episodic memory
● Instead of predicting, try recalling from
the memory
50
https://openai.com/blog/reinforcement-learning-with-prediction-based-rewards/
51. Working memory: Store visited in-episode states
●Novelty through reachability:
▪ Boring if reachable from states in memory in
less than k steps
●Learn to classify: reachable or unreachable
○ Collect 2 states from trajectory
○ Create label to indicate one is reachable from
another
51
Savinov et al. "Episodic Curiosity through Reachability." In ICLR 2018.
High if unreachable
52. Exploration with working memory is better
52
No intrinsic
reward
Intrinsic
reward via
dynamic
prediction
Intrinsic
reward via
WM
54. Semantic memory:
distillation to neural networks’ weight
●Target network: randomly
transform state
●Predictor network: try to
remember the transformed
state
○ A global memory
○ Random TV is not problem
■ Remember all noisy
channels
54
https://medium.com/data-from-the-trenches/curiosity-driven-learning-through-random-network-distillation-488ffd8e5938
Burda et al. Random Network Distillation: a new take on Curiosity-Driven Learning, In ICLR 2019
High if
cannot
distill
55. Episodic memory: explore from
stored good state
●Archive: a memory of good states
(state-score) → sample one
●Purely random exploration from this
state→ collect more states
●Update the archive
And many other tricks: imitation learning,
goal-based policy, …
55
Adrian Ecoffet et al.: First return, then explore. Nature 2021
58. Episodic memory for hyperparameter optimisation
●RL is very sensitive to hyperparameters
●SOTA performance is achieved with extensive
hyperparameter tuning
Islam, R., Henderson, P., Gomrokchi, M., and Precup, D. (2017). Reproducibility of benchmarked deep reinforcement learning tasks for continuous control. arXiv preprint
arXiv:1708.04133.
58
DQN
Hyperparameters
are enormous!
59. Limitation of memory-less optimiser
● Don’t have the context of training in the optimization process
● Treated as a stateless bandit or greedy optimization
○ Ignoring the context prevents the use of episodic experiences that can be
critical in optimization and planning
○ E.g. the hyperparameters that helped overcome a past local optimum in
the loss surface can be reused when the learning algorithm falls into a
similar local optimum
59
How to build the context (the key in
key-value memory)?
60. Optimising hyperparameter as episodic RL
●At each policy update, the hyper-agent:
○ Observe the training context- hyper-state
○ Configure the RL algorithm with suitable hyperparameters ψ - hyper-action
○ Train RL agent with ψ, observed learning progress – hyper-reward
●The goal of the Hyper-RL is the same as the main RL’s: to maximize the return of the RL agent
○ At a hyper-state, find hyper-action that maximize the accumulated hyper-reward
(hyper-return)
60
KEY |VALUE
Experience hyper-state/action |Outcome
Hyper-Returns
Le, Hung, Majid Abdolshah, Thommen K. George, Kien Do, Dung
Nguyen, and Svetha Venkatesh. "Episodic Policy Gradient Training."
AAAI (2022).
61. Hyper-state representation learning
● Compress the parameters/gradients to a vector hyper-state s
● VAE learns to reconstruct s
● The latent vector is the hyper-state representation
61
64. Policy gradient optimisation
64
Issues with naïve Policy Gradient
● High variance and unstable.
● The gradient may not accurately reflect the
policy gain when the policy changes
substantially
Trust-region optimization is a solution
● The new policy should be inside a small
trust region around the last sampling policy
(old policy)
● Bound KL(𝜋_(𝜃_𝑜𝑙𝑑 ) |𝜋_(𝜃_𝑛𝑒𝑤 )) (TRPO,
PPO, …)
66. 66
When the old policy is bad
● Bounding makes the new policy stuck in the
local optima with the old policy
● Relying in one old policy is not enough
→ Need to store many past policies and rely on all of
them
Le, Hung, Thommen Karimpanal George, Majid Abdolshah, Dung Nguyen, Kien Do, Sunil Gupta, and Svetha
Venkatesh. "Memory-Constrained Policy Optimization." NeurIPS (2022).
67. 67
Use two trust regions instead of one
Backup Trust Region from Virtual Policy
PG Objective
68. 68
Memory of policy networks
- Build a memory of past policy. Choose 𝜓 from the policy memory via attention
- fϕ is a neural network parameterized by ϕ which outputs softmax attention
weights over the M past policies
- v is a “context” vector capturing different relations among θ, θold, ψ.
71. In summary
● Memory assists RL agents in many forms:
○ Semantic
○ Working
○ Episodic
● And in many tasks:
○ Store experiences
○ Exploration
○ State representation
○ Optimisation (hyperparameters, policy)
71
72. What’s next? Life-long memory
● So far, the memory lifespan is restricted to an episode (working memory) or a
task (episodic or semantic memory)
● A real memory will span across tasks and domains:
○ Playing 60 Atari games in a row
○ Learn Mujoco then learn Atari
● It requires new kind of memory that supports different representations from
different scenarios
● The amount of events and information is big
○ Efficient memory access mechanism
○ Effective memory selection
72
73. What’s next? Dynamic memory
● Current memory is fixed size (table, matrix, neural network)
○ It is not enough when the observations are dense
○ It is redundant when the observations are sparse
● Can we build a dynamic memory that automatically grow and shrink depending
on context?
○ Memory read and write will be more precise
○ No noise stored in the memory
73
74. What’s next? Hierarchical memory
● Current memory models are general flat, supporting single-access
● To remember details, it needs several steps or recall:
○ Coarse-grained chunk of steps
○ A specific step in the chunk
● Remember different timescales
○ Events from recent timesteps
○ Events from a far episode
74
75. What’s next? Abstract memory
● Current memory models stores specific events, states, actions or
representations of them
● To excel in diverse tasks, it is critical to capture abstract concepts:
○ Goal (e.g. use the red key to open the red door)
○ Relationships (e.g. climbing the ladder and picking the key are required to
pass the level)
○ High-level objects (e.g. anything the block the door is the obstacle)
● It is unclear how artificial memory can stores these complex concepts
75
76. What’s next? Complementary learning system
● A system of multiple memory kinds
● The memory communicates and transfers knowledge:
○ Episodic memory distill events to semantic knowledge
○ Working memory distill temporary information to long-term memory
● How to design an efficient and biologically plausible system of memory is an
open problem.
76
77. What’s next? Other testbeds for memory
● Continual RL
● Meta-RL
● Few-shot-RL
77