SlideShare a Scribd company logo
1 of 79
Download to read offline
Presented by Dr. Hung Le
Memory-based
Reinforcement Learning
1
Background
2
What is Reinforcement Learning (RL)?
● Agent interacts with environment
● S+A=>S’+R (MDP)
● The transition can be stochastic or
deterministic
● Find a policy π(S) → A to maximize
expected return E(∑R) from the
environment
3
A grid-world example
● The state space is discrete. We have 6 states
corresponding to 6 locations in the map
● The action space is discrete. We have 4 actions
corresponding to 4 movements
● The reward can be “nothing”, “poison”, “1
cheese” or “3 cheese”. We can convert to
scalars: 0, -1,1,3
● The transition in this case is deterministic,
corresponding to the outcome of movements.
○ It can be stochastic in other cases
○ E.g. at (0,0) move to the left may result in
(0,1) or (1,1) with equal probability
4
https://huggingface.co/blog/deep-rl-q-part2
Classic RL algorithms: Value learning
5
Q-learning
(temporal difference-TD)
Watkins, Christopher JCH, and Peter Dayan. "Q-learning." Machine
learning 8, no. 3 (1992): 279-292.
Williams, Ronald J. "Simple statistical gradient-following algorithms
for connectionist reinforcement learning." Machine learning 8, no. 3
(1992): 229-256.
● Basic idea: before finding optimal
policy, we find the value function
● Learn (action) value function:
○ V(s)
○ Q(s,a)
● V(s)=E(∑R from s)
● Q(s,a)=E(∑R from s,a)
● Given Q(s,a)
→ choose action that maximizes the
value (ε-greedy policy)
Classic RL algorithm: Policy gradient
● Basic idea: directly optimise the policy as
a function of states
● Need to estimate the gradient of the
objective function E(∑R) w.r.t the
parameters of the policy
● Focus on optimisation techniques
● No memory
6
REINFORCE
(policy gradient)
General RL algorithms
7
Do we have memory in value learning?
● Q-table in value learning can be considered as a memory
● It remembers “how good a state-action pair is on average”
● The memory is very basic, non-smooth and redundant
8
Challenges in RL: the optimal policy can be
complex
● Task:
○ Agent searches for the key
○ Agent picks the key
○ Agent open the door to access the
room
○ Agent finds the box in the room
● Reward:
○ If the agent reaches the box, get +1
reward
9
https://github.com/maximecb/gym-minigrid
→ How to learn such complicated
policies using the simple reward?
Short Answer: just learn from many trials (data)!
Chess
Self-driving
car
Video
games
Robotics
10
Deep RL: Value/Policy are neural networks
11
Example: RL agent plays video game
12
Game
simulation
DQN
Limitation of training with big data
● High cost
○ Training time (days to months)
○ GPU (dozens to hundreds)
● Require simulators
○ Agents are trained in simulation (millions to billions of steps)
○ The cost for one step of simulation can be high
○ Some real environments simply don’t have simulator
● Trained agents are unlike humans
○ Unsafe exploration
○ Weird behaviors
○ Fail to generalize
13
Human vs RL Agents in Atari games
● Human:
○ Few hours of practicing to reach
moderate performance
○ Don’t forget how to play old game
as learning new ones
○ Can play any game
● RL Agents (DQN-based):
○ 21 trillions hours of training to beat
human (AlphaZero), equivalents to
11,500 years of human practice
○ Catastrophic forgetting if learn
games sequentially
○ Despite forever training, there
exists failed games
14
What is missing?
15
Taxonomy of memories
16
What is memory?
● Memory is the ability to efficiently store,
retain and recall information
● Brain memory stores items, events and
high-level structures
● Computer memory stores data,
programs and temporary variables
17
Memory in neural networks
18
Long-term
memory
Short-term
memory
Functional
memory
● Semantic memory:
storing data in the neural
network weights
● Episodic memory: storing
episodic events in matrix
memory
● Associative memory: key-value
binding as in Hopfield Network
or Transformer layer
● Working memory: matrix
memory in memory augmented
neural network
● Memory stores
programs
● Memory of models,
mixture of experts ..
Semantic memory
● A feed-forward neural network can be
viewed as a semantic memory
○ Data is stored in the weight of the
network via backpropagation
○ Data is read via forwarding the
input
○ It can be associative memory as
well
● A table stores the statistics of data can
be also a semantic memory (value
table)
19
y=Wx
Working Memory
● Recurrent neural networks
contains working memory (hidden
state)
○ The hidden state capture
past inputs
○ The prediction is made
based on the hidden state
● Advanced versions of RNN
○ GRU/LSTM
○ MANN
20
Episodic Memory
● Often implemented as a matrix,
table
● Can be key-value memory
● Access via attention or analogy
search
● Support neural networks in
making predictions
21
Properties of memories
22
Lifespan Plasticity Example
● 1 episode is one day
● Last for 1 day
● Build memory instantly
Short-term Quick
1 Working
memory
● Persists across agent’s lifetime
● Last for several years
● Build memory instantly
Long-term Quick
2 Episodic
memory
● persists across agent’s lifetime
● Last for several years
● Take time to build memory
Long-term Slow
3 Semantic
memory
How can it help RL?
23
Memory as experiences
24
Semantic Memory in RL
● Human brain implements RL
● Dopamine neurons reflects a reward
prediction error (TD learning)
● What is the memory in brain that
stores V?
○ Value table is not scalable
○ May be a value model →DQN
(semantic memory)
25
DQN: Replay buffer is an episodic memory
• Store experiences: (s,a,r,s’) tuple
• Read memory via replay sampling
• Memory content is used to train
Action-Value Network Q
26
Mnih, V., Kavukcuoglu, K., Silver, D. et al. Human-level control through deep reinforcement
learning. Nature 518, 529–533 (2015). https://doi.org/10.1038/nature14236
DQN’s memories are better than Q-table, but …
● Inefficiency:
○ Learning semantic memory (Q network)
is slow (gradient descent)
○ Optimise many parameters
● Bootstrap noise:
○ The target is the network’s output
○ If network is not well trained, the target
is noisy
● Reply buffer:
○ Raw observations
○ Need many sampling iterations
27
Alternative: episodic control paradigm
Current experience
Eg: (St, At),…
Memory
read
Experiences Final Returns
Policy
Value
Environment
Memory write
28
● Episodic memory is a key-value memory
● Directly binds experience to return→ refers to
experiences that have high return to make
decisions
Model-free episodic control: K-nearest neighbors
29
Blundell, Charles, Benigno Uria, Alexander Pritzel, Yazhe Li, Avraham Ruderman, Joel Z. Leibo, Jack Rae, Daan
Wierstra, and Demis Hassabis. "Model-free episodic control." NeurIPS (2016).
Fix-size memory
First-in-first out
● No need to learn parameters
(pretrained 𝜙)
● Quick value estimation
Hybrid: episodic memory + semantic memory (DQN)
30
Lin, Zichuan, Tianqi Zhao, Guangwen Yang, and Lintao Zhang. "Episodic Memory Deep Q-Networks. IJCAI18"
Episodic
TD learning
Limitation of model-free episodic memory
● Near-deterministic assumption
○ Assume clean env.
○ Store the best return
● Sample-inefficiency:
○ store state-action-value which demands
experiencing all actions to gain experience
● Fixed combination between episodic and parametric
values
○ episodic contribution weight unchanged for
different observations
○ requires manual tuning of the weight
31
What if the state is partially
observable and the number of
actions is large?
Model-based episodic memory
● Learn a model of trajectories using
self-supervised training
○ Model=LSTM
○ Learn to reconstruct past state-action
given current trajectory and query
● The trained LSTM is used to generate
trajectory representations
→ counterfactual trajectory
→ imagine actions
32
Le, Hung, Thommen Karimpanal George, Majid Abdolshah, Truyen Tran, and Svetha Venkatesh. "Model-Based Episodic
Memory Induces Dynamic Hybrid Controls." NeurIPS (2021).
Discrete-action environment: Atari benchmark
33
Evaluation Metrics:
Normalised score = (Model’s score-random play
score)/(human score - random play score).
Mnih, V., Kavukcuoglu, K., Silver, D. et al. Human-level control through deep reinforcement
learning. Nature 518, 529–533 (2015). https://doi.org/10.1038/nature14236
~60 games
Sample efficiency test on Atari games
34
Model-free (10M)
Hybrid (40M)
Model-based (10M)
DQN (200M)
Memory to build context
35
When the state is not enough …
● Partially Observable Environments:
○ States do not contain all required
information for optimal action
○ E.g. state=position, does not contain
velocity
● Ways to improve:
○ Build richer state representations
○ Memory of all past
observations/actions
● Policy gradient
36
Full map
Observed
state
RNN hidden state
RNN as policy model
Building better working memory
for better the context
37
● External memory: longer-term,
store more
● Unsupervised training to learn
read-write operation
Wayne, Greg, Chia-Chun Hung, David Amos, Mehdi Mirza, Arun Ahuja,
Agnieszka Grabska-Barwinska, Jack Rae et al. "Unsupervised predictive
memory in a goal-directed agent." arXiv preprint arXiv:1803.10760 (2018).
Unsupervised training on the memory
38
It is useful for memory-based decision process
39
Benchmark: Navigation with Distraction
40
Hung, Chia-Chun, Timothy Lillicrap, Josh Abramson, Yan Wu, Mehdi
Mirza, Federico Carnevale, Arun Ahuja, and Greg Wayne. "Optimizing
agent behavior over long time scales by transporting value." Nature
communications 10, no. 1 (2019): 1-12.
Memory is critical for distracting observations
41
Memory with attention is beneficial
42
Break and QA
43
Memory for exploration
44
Exploration issue in RL
● Rewards can be very sparse
○ RL agents cannot learn anything until
they collect the first reward
○ Explore forever?
● Sometimes reward function in
complicated real-world problem is
unknown
○ Don’t have simulator
○ Explore freely in real world is unsafe
→ Sample inefficiency
→Efficient exploration
45
Need exploring mechanisms
to enable sample-efficiency!!!
46
Aubret, A., L. Matignon, and S. Hassas. "A survey on intrinsic motivation in reinforcement learning."
In biological world, agents can cope with this
problem very well
● Animal can travel for long
distance till they find food
● Human can navigate to go to an
address in a strange city
○ intrinsic motivation
○ curiosity, hunch
○ intrinsic reward
47
https://www.beepods.com/5-fascinating-ways-bees-and-flowers-find-each-other/
Agents should be motivated towards
“interesting” consequences
● C: actor vs M: world model
● M predicts consequences of C
● As a result:
▪ If C action results in repeated and
boring consequences →M predict
well
▪ C must explore novel
consequence
● Memory:
▪ To learn the world model
▪ To know if something novel or old
48
https://people.idsia.ch/~juergen/artificial-curiosity-since-1990.html
M: Forward model learns dynamics
(semantic memory)
49
Stadie, Levine, Abbeel: Incentivizing Exploration in Reinforcement Learning with Deep Predictive Models. 2015
Novelty if prediction
error is high (intrinsic
reward)
When novelty as prediction error is useless
● The prediction target is stochastic
● Information necessary for the prediction
is missing
→ Both the totally predictable and the
fundamentally unpredictable will get boring
→Solution: Remember all experiences
● “Store” all observations, including
stochastic ones in working, semantic or
episodic memory
● Instead of predicting, try recalling from
the memory
50
https://openai.com/blog/reinforcement-learning-with-prediction-based-rewards/
Working memory: Store visited in-episode states
●Novelty through reachability:
▪ Boring if reachable from states in memory in
less than k steps
●Learn to classify: reachable or unreachable
○ Collect 2 states from trajectory
○ Create label to indicate one is reachable from
another
51
Savinov et al. "Episodic Curiosity through Reachability." In ICLR 2018.
High if unreachable
Exploration with working memory is better
52
No intrinsic
reward
Intrinsic
reward via
dynamic
prediction
Intrinsic
reward via
WM
Deepmind’s Maze benchmark
53
Bad behavior
Good behavior
Semantic memory:
distillation to neural networks’ weight
●Target network: randomly
transform state
●Predictor network: try to
remember the transformed
state
○ A global memory
○ Random TV is not problem
■ Remember all noisy
channels
54
https://medium.com/data-from-the-trenches/curiosity-driven-learning-through-random-network-distillation-488ffd8e5938
Burda et al. Random Network Distillation: a new take on Curiosity-Driven Learning, In ICLR 2019
High if
cannot
distill
Episodic memory: explore from
stored good state
●Archive: a memory of good states
(state-score) → sample one
●Purely random exploration from this
state→ collect more states
●Update the archive
And many other tricks: imitation learning,
goal-based policy, …
55
Adrian Ecoffet et al.: First return, then explore. Nature 2021
Atari game: Superhuman performance
56
Episodic
Semantic
Working
Montezuma Revenge benchmark
Memory for optimisation
57
Episodic memory for hyperparameter optimisation
●RL is very sensitive to hyperparameters
●SOTA performance is achieved with extensive
hyperparameter tuning
Islam, R., Henderson, P., Gomrokchi, M., and Precup, D. (2017). Reproducibility of benchmarked deep reinforcement learning tasks for continuous control. arXiv preprint
arXiv:1708.04133.
58
DQN
Hyperparameters
are enormous!
Limitation of memory-less optimiser
● Don’t have the context of training in the optimization process
● Treated as a stateless bandit or greedy optimization
○ Ignoring the context prevents the use of episodic experiences that can be
critical in optimization and planning
○ E.g. the hyperparameters that helped overcome a past local optimum in
the loss surface can be reused when the learning algorithm falls into a
similar local optimum
59
How to build the context (the key in
key-value memory)?
Optimising hyperparameter as episodic RL
●At each policy update, the hyper-agent:
○ Observe the training context- hyper-state
○ Configure the RL algorithm with suitable hyperparameters ψ - hyper-action
○ Train RL agent with ψ, observed learning progress – hyper-reward
●The goal of the Hyper-RL is the same as the main RL’s: to maximize the return of the RL agent
○ At a hyper-state, find hyper-action that maximize the accumulated hyper-reward
(hyper-return)
60
KEY |VALUE
Experience hyper-state/action |Outcome
Hyper-Returns
Le, Hung, Majid Abdolshah, Thommen K. George, Kien Do, Dung
Nguyen, and Svetha Venkatesh. "Episodic Policy Gradient Training."
AAAI (2022).
Hyper-state representation learning
● Compress the parameters/gradients to a vector hyper-state s
● VAE learns to reconstruct s
● The latent vector is the hyper-state representation
61
Continuous-action environment:
Mujoco benchmark
62
Metric: positive reward allocated based on the
distance moved forward and a negative reward
allocated for moving backward.
63
Policy gradient optimisation
64
Issues with naïve Policy Gradient
● High variance and unstable.
● The gradient may not accurately reflect the
policy gain when the policy changes
substantially
Trust-region optimization is a solution
● The new policy should be inside a small
trust region around the last sampling policy
(old policy)
● Bound KL(𝜋_(𝜃_𝑜𝑙𝑑 ) |𝜋_(𝜃_𝑛𝑒𝑤 )) (TRPO,
PPO, …)
65
What is wrong with these trust-region methods?
66
When the old policy is bad
● Bounding makes the new policy stuck in the
local optima with the old policy
● Relying in one old policy is not enough
→ Need to store many past policies and rely on all of
them
Le, Hung, Thommen Karimpanal George, Majid Abdolshah, Dung Nguyen, Kien Do, Sunil Gupta, and Svetha
Venkatesh. "Memory-Constrained Policy Optimization." NeurIPS (2022).
67
Use two trust regions instead of one
Backup Trust Region from Virtual Policy
PG Objective
68
Memory of policy networks
- Build a memory of past policy. Choose 𝜓 from the policy memory via attention
- fϕ is a neural network parameterized by ϕ which outputs softmax attention
weights over the M past policies
- v is a “context” vector capturing different relations among θ, θold, ψ.
69
Atari
Final performance
Mujoco:
Conclusion
70
In summary
● Memory assists RL agents in many forms:
○ Semantic
○ Working
○ Episodic
● And in many tasks:
○ Store experiences
○ Exploration
○ State representation
○ Optimisation (hyperparameters, policy)
71
What’s next? Life-long memory
● So far, the memory lifespan is restricted to an episode (working memory) or a
task (episodic or semantic memory)
● A real memory will span across tasks and domains:
○ Playing 60 Atari games in a row
○ Learn Mujoco then learn Atari
● It requires new kind of memory that supports different representations from
different scenarios
● The amount of events and information is big
○ Efficient memory access mechanism
○ Effective memory selection
72
What’s next? Dynamic memory
● Current memory is fixed size (table, matrix, neural network)
○ It is not enough when the observations are dense
○ It is redundant when the observations are sparse
● Can we build a dynamic memory that automatically grow and shrink depending
on context?
○ Memory read and write will be more precise
○ No noise stored in the memory
73
What’s next? Hierarchical memory
● Current memory models are general flat, supporting single-access
● To remember details, it needs several steps or recall:
○ Coarse-grained chunk of steps
○ A specific step in the chunk
● Remember different timescales
○ Events from recent timesteps
○ Events from a far episode
74
What’s next? Abstract memory
● Current memory models stores specific events, states, actions or
representations of them
● To excel in diverse tasks, it is critical to capture abstract concepts:
○ Goal (e.g. use the red key to open the red door)
○ Relationships (e.g. climbing the ladder and picking the key are required to
pass the level)
○ High-level objects (e.g. anything the block the door is the obstacle)
● It is unclear how artificial memory can stores these complex concepts
75
What’s next? Complementary learning system
● A system of multiple memory kinds
● The memory communicates and transfers knowledge:
○ Episodic memory distill events to semantic knowledge
○ Working memory distill temporary information to long-term memory
● How to design an efficient and biologically plausible system of memory is an
open problem.
76
What’s next? Other testbeds for memory
● Continual RL
● Meta-RL
● Few-shot-RL
77
Demo and QA
https://github.com/thaihungle/AJCAI22-Tutorial
78
Our team at A2I2 is hiring!
Contact thai.le@deakin.edu.au for PhD scholarships.
79

More Related Content

What's hot

Virtual memory
Virtual memoryVirtual memory
Virtual memory
Anuj Modi
 
Processor structure and funtions
Processor structure and funtionsProcessor structure and funtions
Processor structure and funtions
Muhammad Ishaq
 

What's hot (20)

Bus Interfacing with Intel Microprocessors Based Systems
Bus Interfacing with Intel Microprocessors Based SystemsBus Interfacing with Intel Microprocessors Based Systems
Bus Interfacing with Intel Microprocessors Based Systems
 
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
 
NVIDIA OpenGL 4.6 in 2017
NVIDIA OpenGL 4.6 in 2017NVIDIA OpenGL 4.6 in 2017
NVIDIA OpenGL 4.6 in 2017
 
Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016
 
Siggraph2016 - The Devil is in the Details: idTech 666
Siggraph2016 - The Devil is in the Details: idTech 666Siggraph2016 - The Devil is in the Details: idTech 666
Siggraph2016 - The Devil is in the Details: idTech 666
 
6 Drone
6   Drone6   Drone
6 Drone
 
Block Cipher and its Design Principles
Block Cipher and its Design PrinciplesBlock Cipher and its Design Principles
Block Cipher and its Design Principles
 
Deep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksDeep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural Networks
 
Computer Graphics
Computer GraphicsComputer Graphics
Computer Graphics
 
BKK16-315 Graphics Stack Update
BKK16-315 Graphics Stack UpdateBKK16-315 Graphics Stack Update
BKK16-315 Graphics Stack Update
 
MicroProgrammed Explained .
MicroProgrammed Explained .MicroProgrammed Explained .
MicroProgrammed Explained .
 
Object Detection Using R-CNN Deep Learning Framework
Object Detection Using R-CNN Deep Learning FrameworkObject Detection Using R-CNN Deep Learning Framework
Object Detection Using R-CNN Deep Learning Framework
 
Virtual memory
Virtual memoryVirtual memory
Virtual memory
 
Deep Learning for Computer Vision: Object Detection (UPC 2016)
Deep Learning for Computer Vision: Object Detection (UPC 2016)Deep Learning for Computer Vision: Object Detection (UPC 2016)
Deep Learning for Computer Vision: Object Detection (UPC 2016)
 
isa architecture
isa architectureisa architecture
isa architecture
 
Processor structure and funtions
Processor structure and funtionsProcessor structure and funtions
Processor structure and funtions
 
Webinar Slides: Probing Techniques and Tradeoffs – What to Use and Why
Webinar Slides: Probing Techniques and Tradeoffs – What to Use and WhyWebinar Slides: Probing Techniques and Tradeoffs – What to Use and Why
Webinar Slides: Probing Techniques and Tradeoffs – What to Use and Why
 
Android binder introduction
Android binder introductionAndroid binder introduction
Android binder introduction
 
Model compression
Model compressionModel compression
Model compression
 
Siggraph 2016 - Vulkan and nvidia : the essentials
Siggraph 2016 - Vulkan and nvidia : the essentialsSiggraph 2016 - Vulkan and nvidia : the essentials
Siggraph 2016 - Vulkan and nvidia : the essentials
 

Similar to Memory-based Reinforcement Learning

Artificial Intelligence Chapter 9 Negnevitsky
Artificial Intelligence Chapter 9 NegnevitskyArtificial Intelligence Chapter 9 Negnevitsky
Artificial Intelligence Chapter 9 Negnevitsky
lopanath
 
Deep Learning and Automatic Differentiation from Theano to PyTorch
Deep Learning and Automatic Differentiation from Theano to PyTorchDeep Learning and Automatic Differentiation from Theano to PyTorch
Deep Learning and Automatic Differentiation from Theano to PyTorch
inside-BigData.com
 
Pregel: A System For Large Scale Graph Processing
Pregel: A System For Large Scale Graph ProcessingPregel: A System For Large Scale Graph Processing
Pregel: A System For Large Scale Graph Processing
Riyad Parvez
 

Similar to Memory-based Reinforcement Learning (20)

Memory for Lean Reinforcement Learning.pdf
Memory for Lean Reinforcement Learning.pdfMemory for Lean Reinforcement Learning.pdf
Memory for Lean Reinforcement Learning.pdf
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learning
 
Fundamental of Machine Learning
Fundamental of Machine LearningFundamental of Machine Learning
Fundamental of Machine Learning
 
Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Se...
Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Se...Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Se...
Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Se...
 
Language Understanding for Text-based Games using Deep Reinforcement Learning
Language Understanding for Text-based Games using Deep Reinforcement LearningLanguage Understanding for Text-based Games using Deep Reinforcement Learning
Language Understanding for Text-based Games using Deep Reinforcement Learning
 
Machine Learning with Python
Machine Learning with PythonMachine Learning with Python
Machine Learning with Python
 
Artificial Intelligence Chapter 9 Negnevitsky
Artificial Intelligence Chapter 9 NegnevitskyArtificial Intelligence Chapter 9 Negnevitsky
Artificial Intelligence Chapter 9 Negnevitsky
 
#1 Berlin Students in AI, Machine Learning & NLP presentation
#1 Berlin Students in AI, Machine Learning & NLP presentation#1 Berlin Students in AI, Machine Learning & NLP presentation
#1 Berlin Students in AI, Machine Learning & NLP presentation
 
Unsupervised Learning (D2L6 2017 UPC Deep Learning for Computer Vision)
Unsupervised Learning (D2L6 2017 UPC Deep Learning for Computer Vision)Unsupervised Learning (D2L6 2017 UPC Deep Learning for Computer Vision)
Unsupervised Learning (D2L6 2017 UPC Deep Learning for Computer Vision)
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
 
DAA Slides for Multiple topics such as different algorithms
DAA Slides for Multiple topics such as different algorithmsDAA Slides for Multiple topics such as different algorithms
DAA Slides for Multiple topics such as different algorithms
 
Intro to TensorFlow and PyTorch Workshop at Tubular Labs
Intro to TensorFlow and PyTorch Workshop at Tubular LabsIntro to TensorFlow and PyTorch Workshop at Tubular Labs
Intro to TensorFlow and PyTorch Workshop at Tubular Labs
 
PAISS (PRAIRIE AI Summer School) Digest July 2018
PAISS (PRAIRIE AI Summer School) Digest July 2018 PAISS (PRAIRIE AI Summer School) Digest July 2018
PAISS (PRAIRIE AI Summer School) Digest July 2018
 
deepnet-lourentzou.ppt
deepnet-lourentzou.pptdeepnet-lourentzou.ppt
deepnet-lourentzou.ppt
 
Netflix machine learning
Netflix machine learningNetflix machine learning
Netflix machine learning
 
Deep Learning and Automatic Differentiation from Theano to PyTorch
Deep Learning and Automatic Differentiation from Theano to PyTorchDeep Learning and Automatic Differentiation from Theano to PyTorch
Deep Learning and Automatic Differentiation from Theano to PyTorch
 
Cassandra - A Decentralized Structured Storage System
Cassandra - A Decentralized Structured Storage SystemCassandra - A Decentralized Structured Storage System
Cassandra - A Decentralized Structured Storage System
 
Deep learning to the rescue - solving long standing problems of recommender ...
Deep learning to the rescue - solving long standing problems of recommender ...Deep learning to the rescue - solving long standing problems of recommender ...
Deep learning to the rescue - solving long standing problems of recommender ...
 
Pregel: A System For Large Scale Graph Processing
Pregel: A System For Large Scale Graph ProcessingPregel: A System For Large Scale Graph Processing
Pregel: A System For Large Scale Graph Processing
 
Deep Learning Tutorial
Deep Learning Tutorial Deep Learning Tutorial
Deep Learning Tutorial
 

Recently uploaded

Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven CuriosityUnlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Hung Le
 
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
David Celestin
 

Recently uploaded (20)

SaaStr Workshop Wednesday with CEO of Guru
SaaStr Workshop Wednesday with CEO of GuruSaaStr Workshop Wednesday with CEO of Guru
SaaStr Workshop Wednesday with CEO of Guru
 
BEAUTIFUL PLACES TO VISIT IN LESOTHO.pptx
BEAUTIFUL PLACES TO VISIT IN LESOTHO.pptxBEAUTIFUL PLACES TO VISIT IN LESOTHO.pptx
BEAUTIFUL PLACES TO VISIT IN LESOTHO.pptx
 
LITTLE ABOUT LESOTHO FROM THE TIME MOSHOESHOE THE FIRST WAS BORN
LITTLE ABOUT LESOTHO FROM THE TIME MOSHOESHOE THE FIRST WAS BORNLITTLE ABOUT LESOTHO FROM THE TIME MOSHOESHOE THE FIRST WAS BORN
LITTLE ABOUT LESOTHO FROM THE TIME MOSHOESHOE THE FIRST WAS BORN
 
"I hear you": Moving beyond empathy in UXR
"I hear you": Moving beyond empathy in UXR"I hear you": Moving beyond empathy in UXR
"I hear you": Moving beyond empathy in UXR
 
2024-05-15-Surat Meetup-Hyperautomation.pptx
2024-05-15-Surat Meetup-Hyperautomation.pptx2024-05-15-Surat Meetup-Hyperautomation.pptx
2024-05-15-Surat Meetup-Hyperautomation.pptx
 
2024 mega trends for the digital workplace - FINAL.pdf
2024 mega trends for the digital workplace - FINAL.pdf2024 mega trends for the digital workplace - FINAL.pdf
2024 mega trends for the digital workplace - FINAL.pdf
 
The Concession of Asaba International Airport: Balancing Politics and Policy ...
The Concession of Asaba International Airport: Balancing Politics and Policy ...The Concession of Asaba International Airport: Balancing Politics and Policy ...
The Concession of Asaba International Airport: Balancing Politics and Policy ...
 
Abortion Pills Fahaheel ௹+918133066128💬@ Safe and Effective Mifepristion and ...
Abortion Pills Fahaheel ௹+918133066128💬@ Safe and Effective Mifepristion and ...Abortion Pills Fahaheel ௹+918133066128💬@ Safe and Effective Mifepristion and ...
Abortion Pills Fahaheel ௹+918133066128💬@ Safe and Effective Mifepristion and ...
 
Databricks Machine Learning Associate Exam Dumps 2024.pdf
Databricks Machine Learning Associate Exam Dumps 2024.pdfDatabricks Machine Learning Associate Exam Dumps 2024.pdf
Databricks Machine Learning Associate Exam Dumps 2024.pdf
 
History of Morena Moshoeshoe birth death
History of Morena Moshoeshoe birth deathHistory of Morena Moshoeshoe birth death
History of Morena Moshoeshoe birth death
 
Introduction to Artificial intelligence.
Introduction to Artificial intelligence.Introduction to Artificial intelligence.
Introduction to Artificial intelligence.
 
in kuwait௹+918133066128....) @abortion pills for sale in Kuwait City
in kuwait௹+918133066128....) @abortion pills for sale in Kuwait Cityin kuwait௹+918133066128....) @abortion pills for sale in Kuwait City
in kuwait௹+918133066128....) @abortion pills for sale in Kuwait City
 
Using AI to boost productivity for developers
Using AI to boost productivity for developersUsing AI to boost productivity for developers
Using AI to boost productivity for developers
 
STM valmiusseminaari 26-04-2024 PUUMALAINEN Ajankohtaista kansainvälisestä yh...
STM valmiusseminaari 26-04-2024 PUUMALAINEN Ajankohtaista kansainvälisestä yh...STM valmiusseminaari 26-04-2024 PUUMALAINEN Ajankohtaista kansainvälisestä yh...
STM valmiusseminaari 26-04-2024 PUUMALAINEN Ajankohtaista kansainvälisestä yh...
 
BIG DEVELOPMENTS IN LESOTHO(DAMS & MINES
BIG DEVELOPMENTS IN LESOTHO(DAMS & MINESBIG DEVELOPMENTS IN LESOTHO(DAMS & MINES
BIG DEVELOPMENTS IN LESOTHO(DAMS & MINES
 
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven CuriosityUnlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
 
Ready Set Go Children Sermon about Mark 16:15-20
Ready Set Go Children Sermon about Mark 16:15-20Ready Set Go Children Sermon about Mark 16:15-20
Ready Set Go Children Sermon about Mark 16:15-20
 
BIG DEVELOPMENTS IN LESOTHO(DAMS & MINES
BIG DEVELOPMENTS IN LESOTHO(DAMS & MINESBIG DEVELOPMENTS IN LESOTHO(DAMS & MINES
BIG DEVELOPMENTS IN LESOTHO(DAMS & MINES
 
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
 
Microsoft Fabric Analytics Engineer (DP-600) Exam Dumps 2024.pdf
Microsoft Fabric Analytics Engineer (DP-600) Exam Dumps 2024.pdfMicrosoft Fabric Analytics Engineer (DP-600) Exam Dumps 2024.pdf
Microsoft Fabric Analytics Engineer (DP-600) Exam Dumps 2024.pdf
 

Memory-based Reinforcement Learning

  • 1. Presented by Dr. Hung Le Memory-based Reinforcement Learning 1
  • 3. What is Reinforcement Learning (RL)? ● Agent interacts with environment ● S+A=>S’+R (MDP) ● The transition can be stochastic or deterministic ● Find a policy π(S) → A to maximize expected return E(∑R) from the environment 3
  • 4. A grid-world example ● The state space is discrete. We have 6 states corresponding to 6 locations in the map ● The action space is discrete. We have 4 actions corresponding to 4 movements ● The reward can be “nothing”, “poison”, “1 cheese” or “3 cheese”. We can convert to scalars: 0, -1,1,3 ● The transition in this case is deterministic, corresponding to the outcome of movements. ○ It can be stochastic in other cases ○ E.g. at (0,0) move to the left may result in (0,1) or (1,1) with equal probability 4 https://huggingface.co/blog/deep-rl-q-part2
  • 5. Classic RL algorithms: Value learning 5 Q-learning (temporal difference-TD) Watkins, Christopher JCH, and Peter Dayan. "Q-learning." Machine learning 8, no. 3 (1992): 279-292. Williams, Ronald J. "Simple statistical gradient-following algorithms for connectionist reinforcement learning." Machine learning 8, no. 3 (1992): 229-256. ● Basic idea: before finding optimal policy, we find the value function ● Learn (action) value function: ○ V(s) ○ Q(s,a) ● V(s)=E(∑R from s) ● Q(s,a)=E(∑R from s,a) ● Given Q(s,a) → choose action that maximizes the value (ε-greedy policy)
  • 6. Classic RL algorithm: Policy gradient ● Basic idea: directly optimise the policy as a function of states ● Need to estimate the gradient of the objective function E(∑R) w.r.t the parameters of the policy ● Focus on optimisation techniques ● No memory 6 REINFORCE (policy gradient)
  • 8. Do we have memory in value learning? ● Q-table in value learning can be considered as a memory ● It remembers “how good a state-action pair is on average” ● The memory is very basic, non-smooth and redundant 8
  • 9. Challenges in RL: the optimal policy can be complex ● Task: ○ Agent searches for the key ○ Agent picks the key ○ Agent open the door to access the room ○ Agent finds the box in the room ● Reward: ○ If the agent reaches the box, get +1 reward 9 https://github.com/maximecb/gym-minigrid → How to learn such complicated policies using the simple reward?
  • 10. Short Answer: just learn from many trials (data)! Chess Self-driving car Video games Robotics 10
  • 11. Deep RL: Value/Policy are neural networks 11
  • 12. Example: RL agent plays video game 12 Game simulation DQN
  • 13. Limitation of training with big data ● High cost ○ Training time (days to months) ○ GPU (dozens to hundreds) ● Require simulators ○ Agents are trained in simulation (millions to billions of steps) ○ The cost for one step of simulation can be high ○ Some real environments simply don’t have simulator ● Trained agents are unlike humans ○ Unsafe exploration ○ Weird behaviors ○ Fail to generalize 13
  • 14. Human vs RL Agents in Atari games ● Human: ○ Few hours of practicing to reach moderate performance ○ Don’t forget how to play old game as learning new ones ○ Can play any game ● RL Agents (DQN-based): ○ 21 trillions hours of training to beat human (AlphaZero), equivalents to 11,500 years of human practice ○ Catastrophic forgetting if learn games sequentially ○ Despite forever training, there exists failed games 14
  • 17. What is memory? ● Memory is the ability to efficiently store, retain and recall information ● Brain memory stores items, events and high-level structures ● Computer memory stores data, programs and temporary variables 17
  • 18. Memory in neural networks 18 Long-term memory Short-term memory Functional memory ● Semantic memory: storing data in the neural network weights ● Episodic memory: storing episodic events in matrix memory ● Associative memory: key-value binding as in Hopfield Network or Transformer layer ● Working memory: matrix memory in memory augmented neural network ● Memory stores programs ● Memory of models, mixture of experts ..
  • 19. Semantic memory ● A feed-forward neural network can be viewed as a semantic memory ○ Data is stored in the weight of the network via backpropagation ○ Data is read via forwarding the input ○ It can be associative memory as well ● A table stores the statistics of data can be also a semantic memory (value table) 19 y=Wx
  • 20. Working Memory ● Recurrent neural networks contains working memory (hidden state) ○ The hidden state capture past inputs ○ The prediction is made based on the hidden state ● Advanced versions of RNN ○ GRU/LSTM ○ MANN 20
  • 21. Episodic Memory ● Often implemented as a matrix, table ● Can be key-value memory ● Access via attention or analogy search ● Support neural networks in making predictions 21
  • 22. Properties of memories 22 Lifespan Plasticity Example ● 1 episode is one day ● Last for 1 day ● Build memory instantly Short-term Quick 1 Working memory ● Persists across agent’s lifetime ● Last for several years ● Build memory instantly Long-term Quick 2 Episodic memory ● persists across agent’s lifetime ● Last for several years ● Take time to build memory Long-term Slow 3 Semantic memory
  • 23. How can it help RL? 23
  • 25. Semantic Memory in RL ● Human brain implements RL ● Dopamine neurons reflects a reward prediction error (TD learning) ● What is the memory in brain that stores V? ○ Value table is not scalable ○ May be a value model →DQN (semantic memory) 25
  • 26. DQN: Replay buffer is an episodic memory • Store experiences: (s,a,r,s’) tuple • Read memory via replay sampling • Memory content is used to train Action-Value Network Q 26 Mnih, V., Kavukcuoglu, K., Silver, D. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015). https://doi.org/10.1038/nature14236
  • 27. DQN’s memories are better than Q-table, but … ● Inefficiency: ○ Learning semantic memory (Q network) is slow (gradient descent) ○ Optimise many parameters ● Bootstrap noise: ○ The target is the network’s output ○ If network is not well trained, the target is noisy ● Reply buffer: ○ Raw observations ○ Need many sampling iterations 27
  • 28. Alternative: episodic control paradigm Current experience Eg: (St, At),… Memory read Experiences Final Returns Policy Value Environment Memory write 28 ● Episodic memory is a key-value memory ● Directly binds experience to return→ refers to experiences that have high return to make decisions
  • 29. Model-free episodic control: K-nearest neighbors 29 Blundell, Charles, Benigno Uria, Alexander Pritzel, Yazhe Li, Avraham Ruderman, Joel Z. Leibo, Jack Rae, Daan Wierstra, and Demis Hassabis. "Model-free episodic control." NeurIPS (2016). Fix-size memory First-in-first out ● No need to learn parameters (pretrained 𝜙) ● Quick value estimation
  • 30. Hybrid: episodic memory + semantic memory (DQN) 30 Lin, Zichuan, Tianqi Zhao, Guangwen Yang, and Lintao Zhang. "Episodic Memory Deep Q-Networks. IJCAI18" Episodic TD learning
  • 31. Limitation of model-free episodic memory ● Near-deterministic assumption ○ Assume clean env. ○ Store the best return ● Sample-inefficiency: ○ store state-action-value which demands experiencing all actions to gain experience ● Fixed combination between episodic and parametric values ○ episodic contribution weight unchanged for different observations ○ requires manual tuning of the weight 31 What if the state is partially observable and the number of actions is large?
  • 32. Model-based episodic memory ● Learn a model of trajectories using self-supervised training ○ Model=LSTM ○ Learn to reconstruct past state-action given current trajectory and query ● The trained LSTM is used to generate trajectory representations → counterfactual trajectory → imagine actions 32 Le, Hung, Thommen Karimpanal George, Majid Abdolshah, Truyen Tran, and Svetha Venkatesh. "Model-Based Episodic Memory Induces Dynamic Hybrid Controls." NeurIPS (2021).
  • 33. Discrete-action environment: Atari benchmark 33 Evaluation Metrics: Normalised score = (Model’s score-random play score)/(human score - random play score). Mnih, V., Kavukcuoglu, K., Silver, D. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015). https://doi.org/10.1038/nature14236 ~60 games
  • 34. Sample efficiency test on Atari games 34 Model-free (10M) Hybrid (40M) Model-based (10M) DQN (200M)
  • 35. Memory to build context 35
  • 36. When the state is not enough … ● Partially Observable Environments: ○ States do not contain all required information for optimal action ○ E.g. state=position, does not contain velocity ● Ways to improve: ○ Build richer state representations ○ Memory of all past observations/actions ● Policy gradient 36 Full map Observed state RNN hidden state RNN as policy model
  • 37. Building better working memory for better the context 37 ● External memory: longer-term, store more ● Unsupervised training to learn read-write operation Wayne, Greg, Chia-Chun Hung, David Amos, Mehdi Mirza, Arun Ahuja, Agnieszka Grabska-Barwinska, Jack Rae et al. "Unsupervised predictive memory in a goal-directed agent." arXiv preprint arXiv:1803.10760 (2018).
  • 38. Unsupervised training on the memory 38
  • 39. It is useful for memory-based decision process 39
  • 40. Benchmark: Navigation with Distraction 40 Hung, Chia-Chun, Timothy Lillicrap, Josh Abramson, Yan Wu, Mehdi Mirza, Federico Carnevale, Arun Ahuja, and Greg Wayne. "Optimizing agent behavior over long time scales by transporting value." Nature communications 10, no. 1 (2019): 1-12.
  • 41. Memory is critical for distracting observations 41
  • 42. Memory with attention is beneficial 42
  • 45. Exploration issue in RL ● Rewards can be very sparse ○ RL agents cannot learn anything until they collect the first reward ○ Explore forever? ● Sometimes reward function in complicated real-world problem is unknown ○ Don’t have simulator ○ Explore freely in real world is unsafe → Sample inefficiency →Efficient exploration 45
  • 46. Need exploring mechanisms to enable sample-efficiency!!! 46 Aubret, A., L. Matignon, and S. Hassas. "A survey on intrinsic motivation in reinforcement learning."
  • 47. In biological world, agents can cope with this problem very well ● Animal can travel for long distance till they find food ● Human can navigate to go to an address in a strange city ○ intrinsic motivation ○ curiosity, hunch ○ intrinsic reward 47 https://www.beepods.com/5-fascinating-ways-bees-and-flowers-find-each-other/
  • 48. Agents should be motivated towards “interesting” consequences ● C: actor vs M: world model ● M predicts consequences of C ● As a result: ▪ If C action results in repeated and boring consequences →M predict well ▪ C must explore novel consequence ● Memory: ▪ To learn the world model ▪ To know if something novel or old 48 https://people.idsia.ch/~juergen/artificial-curiosity-since-1990.html
  • 49. M: Forward model learns dynamics (semantic memory) 49 Stadie, Levine, Abbeel: Incentivizing Exploration in Reinforcement Learning with Deep Predictive Models. 2015 Novelty if prediction error is high (intrinsic reward)
  • 50. When novelty as prediction error is useless ● The prediction target is stochastic ● Information necessary for the prediction is missing → Both the totally predictable and the fundamentally unpredictable will get boring →Solution: Remember all experiences ● “Store” all observations, including stochastic ones in working, semantic or episodic memory ● Instead of predicting, try recalling from the memory 50 https://openai.com/blog/reinforcement-learning-with-prediction-based-rewards/
  • 51. Working memory: Store visited in-episode states ●Novelty through reachability: ▪ Boring if reachable from states in memory in less than k steps ●Learn to classify: reachable or unreachable ○ Collect 2 states from trajectory ○ Create label to indicate one is reachable from another 51 Savinov et al. "Episodic Curiosity through Reachability." In ICLR 2018. High if unreachable
  • 52. Exploration with working memory is better 52 No intrinsic reward Intrinsic reward via dynamic prediction Intrinsic reward via WM
  • 53. Deepmind’s Maze benchmark 53 Bad behavior Good behavior
  • 54. Semantic memory: distillation to neural networks’ weight ●Target network: randomly transform state ●Predictor network: try to remember the transformed state ○ A global memory ○ Random TV is not problem ■ Remember all noisy channels 54 https://medium.com/data-from-the-trenches/curiosity-driven-learning-through-random-network-distillation-488ffd8e5938 Burda et al. Random Network Distillation: a new take on Curiosity-Driven Learning, In ICLR 2019 High if cannot distill
  • 55. Episodic memory: explore from stored good state ●Archive: a memory of good states (state-score) → sample one ●Purely random exploration from this state→ collect more states ●Update the archive And many other tricks: imitation learning, goal-based policy, … 55 Adrian Ecoffet et al.: First return, then explore. Nature 2021
  • 56. Atari game: Superhuman performance 56 Episodic Semantic Working Montezuma Revenge benchmark
  • 58. Episodic memory for hyperparameter optimisation ●RL is very sensitive to hyperparameters ●SOTA performance is achieved with extensive hyperparameter tuning Islam, R., Henderson, P., Gomrokchi, M., and Precup, D. (2017). Reproducibility of benchmarked deep reinforcement learning tasks for continuous control. arXiv preprint arXiv:1708.04133. 58 DQN Hyperparameters are enormous!
  • 59. Limitation of memory-less optimiser ● Don’t have the context of training in the optimization process ● Treated as a stateless bandit or greedy optimization ○ Ignoring the context prevents the use of episodic experiences that can be critical in optimization and planning ○ E.g. the hyperparameters that helped overcome a past local optimum in the loss surface can be reused when the learning algorithm falls into a similar local optimum 59 How to build the context (the key in key-value memory)?
  • 60. Optimising hyperparameter as episodic RL ●At each policy update, the hyper-agent: ○ Observe the training context- hyper-state ○ Configure the RL algorithm with suitable hyperparameters ψ - hyper-action ○ Train RL agent with ψ, observed learning progress – hyper-reward ●The goal of the Hyper-RL is the same as the main RL’s: to maximize the return of the RL agent ○ At a hyper-state, find hyper-action that maximize the accumulated hyper-reward (hyper-return) 60 KEY |VALUE Experience hyper-state/action |Outcome Hyper-Returns Le, Hung, Majid Abdolshah, Thommen K. George, Kien Do, Dung Nguyen, and Svetha Venkatesh. "Episodic Policy Gradient Training." AAAI (2022).
  • 61. Hyper-state representation learning ● Compress the parameters/gradients to a vector hyper-state s ● VAE learns to reconstruct s ● The latent vector is the hyper-state representation 61
  • 62. Continuous-action environment: Mujoco benchmark 62 Metric: positive reward allocated based on the distance moved forward and a negative reward allocated for moving backward.
  • 63. 63
  • 64. Policy gradient optimisation 64 Issues with naïve Policy Gradient ● High variance and unstable. ● The gradient may not accurately reflect the policy gain when the policy changes substantially Trust-region optimization is a solution ● The new policy should be inside a small trust region around the last sampling policy (old policy) ● Bound KL(𝜋_(𝜃_𝑜𝑙𝑑 ) |𝜋_(𝜃_𝑛𝑒𝑤 )) (TRPO, PPO, …)
  • 65. 65 What is wrong with these trust-region methods?
  • 66. 66 When the old policy is bad ● Bounding makes the new policy stuck in the local optima with the old policy ● Relying in one old policy is not enough → Need to store many past policies and rely on all of them Le, Hung, Thommen Karimpanal George, Majid Abdolshah, Dung Nguyen, Kien Do, Sunil Gupta, and Svetha Venkatesh. "Memory-Constrained Policy Optimization." NeurIPS (2022).
  • 67. 67 Use two trust regions instead of one Backup Trust Region from Virtual Policy PG Objective
  • 68. 68 Memory of policy networks - Build a memory of past policy. Choose 𝜓 from the policy memory via attention - fϕ is a neural network parameterized by ϕ which outputs softmax attention weights over the M past policies - v is a “context” vector capturing different relations among θ, θold, ψ.
  • 71. In summary ● Memory assists RL agents in many forms: ○ Semantic ○ Working ○ Episodic ● And in many tasks: ○ Store experiences ○ Exploration ○ State representation ○ Optimisation (hyperparameters, policy) 71
  • 72. What’s next? Life-long memory ● So far, the memory lifespan is restricted to an episode (working memory) or a task (episodic or semantic memory) ● A real memory will span across tasks and domains: ○ Playing 60 Atari games in a row ○ Learn Mujoco then learn Atari ● It requires new kind of memory that supports different representations from different scenarios ● The amount of events and information is big ○ Efficient memory access mechanism ○ Effective memory selection 72
  • 73. What’s next? Dynamic memory ● Current memory is fixed size (table, matrix, neural network) ○ It is not enough when the observations are dense ○ It is redundant when the observations are sparse ● Can we build a dynamic memory that automatically grow and shrink depending on context? ○ Memory read and write will be more precise ○ No noise stored in the memory 73
  • 74. What’s next? Hierarchical memory ● Current memory models are general flat, supporting single-access ● To remember details, it needs several steps or recall: ○ Coarse-grained chunk of steps ○ A specific step in the chunk ● Remember different timescales ○ Events from recent timesteps ○ Events from a far episode 74
  • 75. What’s next? Abstract memory ● Current memory models stores specific events, states, actions or representations of them ● To excel in diverse tasks, it is critical to capture abstract concepts: ○ Goal (e.g. use the red key to open the red door) ○ Relationships (e.g. climbing the ladder and picking the key are required to pass the level) ○ High-level objects (e.g. anything the block the door is the obstacle) ● It is unclear how artificial memory can stores these complex concepts 75
  • 76. What’s next? Complementary learning system ● A system of multiple memory kinds ● The memory communicates and transfers knowledge: ○ Episodic memory distill events to semantic knowledge ○ Working memory distill temporary information to long-term memory ● How to design an efficient and biologically plausible system of memory is an open problem. 76
  • 77. What’s next? Other testbeds for memory ● Continual RL ● Meta-RL ● Few-shot-RL 77
  • 79. Our team at A2I2 is hiring! Contact thai.le@deakin.edu.au for PhD scholarships. 79