Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
Model Based Episodic Memory
1. Model-Based Episodic Memory Induces
Dynamic Hybrid Controls
Authors: Hung Le, Thommen Karimpanal George, Majid Abdolshah, Truyen Tran, Svetha
Venkatesh
Presented by Hung Le
1
2. Reinforcement learning
2
Image source: Wikipedia
1. Model-based RL
2. Model-free RL
3. Episodic RL
3rd way: episodic control
Episodic memory-Hippocampus
Instance of the experiences
Fast learning
Heuristic/suboptimal
Questions that episodic memory can answer:
What did you have for breakfast this morning?
Which action did the agent take resulting in high return?
3. Typical episodic control paradigm
Current experience
Memory
read
Experiences | Returns
Policy
Value
• Key-value episodic memory
• Key=Experience can be any from
single state to the whole trajectory
• Value=return/estimated value
Environment
Memory write
3
Image Source: Sutton & Barto Book: Reinforcement Learning: An Introduction
4. Hybrid design of episodic and model-free RL
(Complementary learning systems)
4
Action
Update
Rapid learning
Episodic Memory
Image source: internet, Neural Episodic Control
5. Limitations
• Near-deterministic assumption
store the best return
• Sample-inefficiency
store state-action-value which demands experiencing all actions to
make reliable decisions
update one memory slot at a time, slow value propagation
• Fixed combination between episodic and parametric values
episodic contribution weight unchanged for different observations
and requires manual tuning of the weight
5
6. Our contribution
• Episodic memory of trajectory-value
Store trajectory representations instead of states handle noisy, POMDP
• Memory-based value estimation mechanism
Memory read: mix average and max return of nearest neighbors balancing
Memory write: weighted averaging write to multiple slots
Memory refine: bootstrapped memory update hasten value propagation
• Dynamic hybrid control:
Neural network learns to weight episodic value against DQN’s value
Conditioned on the current trajectory
6
7. Trajectory representation
learning
• Trajectory model is LSTM
• Hidden state ⃗
𝜏𝜏 is the representation
• Self-supervised learning:
Recall past events given a query as the
preceding event (reconstruction loss)
2 trajectories having more common
transitions are closer in the
representation space
7
8. Memory reading
• (a) average neighbors: pessimistic
• (b) max neighbors: optimistic
• Randomly select (a) or (b) with a probability p
8
9. Memory writing
At the end of episode, update the values of
multiple key neighbors such that the updated
values are approaching the return with
speeds relative to the distances
9
11. Episodic value estimation via memory-based planning
• What is the value of taking action a from state s?
• Next observation is approximated by the trajectory representation following a
• The value is queried from the memory
• Current reward r is estimated from a reward model
11
16. MBEC++ in POMDP and Atari tasks
16
Human normalized scores (mean/median) at
10 million frames for all and a subset of 25
games.
17. Key takeaways about our episodic memory
• Storing distributed trajectories produced by a trajectory model
• Memory-based planning with fast value-propagating memory writing
and refining
• Dynamic consolidation of episodic values to parametric value function
• Good results:
• Noisy environments
• Atari games
• POMDPs
17