Model-Based Episodic Memory Induces
Dynamic Hybrid Controls
Authors: Hung Le, Thommen Karimpanal George, Majid Abdolshah, Truyen Tran, Svetha
Venkatesh
Presented by Hung Le
1
Reinforcement learning
2
Image source: Wikipedia
1. Model-based RL
2. Model-free RL
3. Episodic RL
3rd way: episodic control
Episodic memory-Hippocampus
Instance of the experiences
Fast learning
Heuristic/suboptimal
Questions that episodic memory can answer:
What did you have for breakfast this morning?
Which action did the agent take resulting in high return?
Typical episodic control paradigm
Current experience
Memory
read
Experiences | Returns
Policy
Value
• Key-value episodic memory
• Key=Experience can be any from
single state to the whole trajectory
• Value=return/estimated value
Environment
Memory write
3
Image Source: Sutton & Barto Book: Reinforcement Learning: An Introduction
Hybrid design of episodic and model-free RL
(Complementary learning systems)
4
Action
Update
Rapid learning
Episodic Memory
Image source: internet, Neural Episodic Control
Limitations
• Near-deterministic assumption
 store the best return
• Sample-inefficiency
 store state-action-value which demands experiencing all actions to
make reliable decisions
 update one memory slot at a time, slow value propagation
• Fixed combination between episodic and parametric values
 episodic contribution weight unchanged for different observations
and requires manual tuning of the weight
5
Our contribution
• Episodic memory of trajectory-value
 Store trajectory representations instead of states  handle noisy, POMDP
• Memory-based value estimation mechanism
 Memory read: mix average and max return of nearest neighbors balancing
 Memory write: weighted averaging write to multiple slots
 Memory refine: bootstrapped memory update hasten value propagation
• Dynamic hybrid control:
 Neural network learns to weight episodic value against DQN’s value
 Conditioned on the current trajectory
6
Trajectory representation
learning
• Trajectory model is LSTM
• Hidden state ⃗
𝜏𝜏 is the representation
• Self-supervised learning:
 Recall past events given a query as the
preceding event (reconstruction loss)
 2 trajectories having more common
transitions are closer in the
representation space
7
Memory reading
• (a) average neighbors: pessimistic
• (b) max neighbors: optimistic
• Randomly select (a) or (b) with a probability p
8
Memory writing
At the end of episode, update the values of
multiple key neighbors such that the updated
values are approaching the return with
speeds relative to the distances
9
Memory refining
• Refine the memory value at any step
• Bootstrapped update
10
Episodic value estimation via memory-based planning
• What is the value of taking action a from state s?
• Next observation is approximated by the trajectory representation following a
• The value is queried from the memory
• Current reward r is estimated from a reward model
11
MBEC agent in navigation tasks
12
Combining episodic and parametric value
function
13
MBEC+DQN=MBEC++
14
MBEC++ in noisy classical control tasks
15
MBEC++ in POMDP and Atari tasks
16
Human normalized scores (mean/median) at
10 million frames for all and a subset of 25
games.
Key takeaways about our episodic memory
• Storing distributed trajectories produced by a trajectory model
• Memory-based planning with fast value-propagating memory writing
and refining
• Dynamic consolidation of episodic values to parametric value function
• Good results:
• Noisy environments
• Atari games
• POMDPs
17
Thank you
thai.le@deakin.edu.au
A²I²
Deakin University
Geelong Waurn Ponds
Campus, Geelong, VIC 3220
Hung Le
18

Model Based Episodic Memory

  • 1.
    Model-Based Episodic MemoryInduces Dynamic Hybrid Controls Authors: Hung Le, Thommen Karimpanal George, Majid Abdolshah, Truyen Tran, Svetha Venkatesh Presented by Hung Le 1
  • 2.
    Reinforcement learning 2 Image source:Wikipedia 1. Model-based RL 2. Model-free RL 3. Episodic RL 3rd way: episodic control Episodic memory-Hippocampus Instance of the experiences Fast learning Heuristic/suboptimal Questions that episodic memory can answer: What did you have for breakfast this morning? Which action did the agent take resulting in high return?
  • 3.
    Typical episodic controlparadigm Current experience Memory read Experiences | Returns Policy Value • Key-value episodic memory • Key=Experience can be any from single state to the whole trajectory • Value=return/estimated value Environment Memory write 3 Image Source: Sutton & Barto Book: Reinforcement Learning: An Introduction
  • 4.
    Hybrid design ofepisodic and model-free RL (Complementary learning systems) 4 Action Update Rapid learning Episodic Memory Image source: internet, Neural Episodic Control
  • 5.
    Limitations • Near-deterministic assumption store the best return • Sample-inefficiency  store state-action-value which demands experiencing all actions to make reliable decisions  update one memory slot at a time, slow value propagation • Fixed combination between episodic and parametric values  episodic contribution weight unchanged for different observations and requires manual tuning of the weight 5
  • 6.
    Our contribution • Episodicmemory of trajectory-value  Store trajectory representations instead of states  handle noisy, POMDP • Memory-based value estimation mechanism  Memory read: mix average and max return of nearest neighbors balancing  Memory write: weighted averaging write to multiple slots  Memory refine: bootstrapped memory update hasten value propagation • Dynamic hybrid control:  Neural network learns to weight episodic value against DQN’s value  Conditioned on the current trajectory 6
  • 7.
    Trajectory representation learning • Trajectorymodel is LSTM • Hidden state ⃗ 𝜏𝜏 is the representation • Self-supervised learning:  Recall past events given a query as the preceding event (reconstruction loss)  2 trajectories having more common transitions are closer in the representation space 7
  • 8.
    Memory reading • (a)average neighbors: pessimistic • (b) max neighbors: optimistic • Randomly select (a) or (b) with a probability p 8
  • 9.
    Memory writing At theend of episode, update the values of multiple key neighbors such that the updated values are approaching the return with speeds relative to the distances 9
  • 10.
    Memory refining • Refinethe memory value at any step • Bootstrapped update 10
  • 11.
    Episodic value estimationvia memory-based planning • What is the value of taking action a from state s? • Next observation is approximated by the trajectory representation following a • The value is queried from the memory • Current reward r is estimated from a reward model 11
  • 12.
    MBEC agent innavigation tasks 12
  • 13.
    Combining episodic andparametric value function 13
  • 14.
  • 15.
    MBEC++ in noisyclassical control tasks 15
  • 16.
    MBEC++ in POMDPand Atari tasks 16 Human normalized scores (mean/median) at 10 million frames for all and a subset of 25 games.
  • 17.
    Key takeaways aboutour episodic memory • Storing distributed trajectories produced by a trajectory model • Memory-based planning with fast value-propagating memory writing and refining • Dynamic consolidation of episodic values to parametric value function • Good results: • Noisy environments • Atari games • POMDPs 17
  • 18.
    Thank you thai.le@deakin.edu.au A²I² Deakin University GeelongWaurn Ponds Campus, Geelong, VIC 3220 Hung Le 18