Memory for Lean Reinforcement Learning.pdf

Memory for Lean
Reinforcement Learning
Presented by Hung Le
1

What is Reinforcement Learning (RL)?
● Agent interacts with
environment
● S+A=>S’+R (MDP)
● Find a policy π(S) → A to
maximize return ∑R from the
environment
● Learn (action) value function:
○ V(s)
○ Q(s,a)
→ choose action that maximizes
the value (e.g. ε-greedy policy)
3
https://crl.causalai.net/

Classic (memory-less) RL algorithms
4
https://www.analyticsvidhya.com/blog/2019/04/introduction-deep-q-learning-python/
https://jonathan-hui.medium.com/rl-policy-gradients-explained-9b13b688b146
Q-learning
(temporal difference)
REINFORCE
(policy gradient)
Watkins, Christopher JCH, and Peter Dayan. "Q-learning." Machine
learning 8, no. 3 (1992): 279-292.
Williams, Ronald J. "Simple statistical gradient-following algorithms
for connectionist reinforcement learning." Machine learning 8, no. 3
(1992): 229-256.

Challenge: the RL task can be complicated while
the reward is simple
● Task:
○ Agent searches for the key
○ Agent picks the key
○ Agent open the door to access the room
○ Agent finds the box in the room
● Reward:
○ If the agent reaches the box, get +1 reward
5
https://github.com/maximecb/gym-minigrid
→ How to learn such complicated
policies using the simple reward?

Short Answer: just learn from many trials (data)!
Chess
Self-driving
car
Video
games
Robotics
6

Example: RL agent plays video game
7
Game
simulation

Limitation of training with big data
● High cost
○ Training time (days to months)
○ GPU (dozens to hundreds)
● Require simulators
○ Agents are trained in simulation (millions to billions
of steps)
○ The cost for one step of simulation can be high
○ Some real environments simply don’t have simulator
● Trained agents are unlike humans
○ Unsafe exploration
○ Weird behaviors
○ Fail to generalize
8

Human vs RL Agents in Atari games
● Human:
○ Few hours of practicing to reach
moderate performance
○ Don’t forget how to play old game as
learning new ones
○ Can play any game
● RL Agents:
○ 21 trillions hours of training to beat
human (AlphaZero), equivalents to
11,500 years of human practice
○ Catastrophic forgetting if learn games
sequentially
○ Despite forever training, there exists
failed games
9

What is missing?
10
Memory as experiences
Memory for exploration
Memory to build context

What is memory?
● Memory is the ability to efficiently
store, retain and recall information
● Brain memory stores items, events
and high-level structures
● Computer memory stores data,
programs and temporary variables
12
What is the example of
memory in neural network?

Memory in neural networks
13
Long-term
memory
Short-term
memory
Functional
memory
● Semantic memory:
storing data in the neural
network weights
● Episodic memory: storing
episodic events in matrix
memory
● Associative memory: key-value
binding as in Hopfield Network
or Transformer layer
● Working memory: matrix
memory in memory augmented
neural network
● Memory stores
programs
● Memory of models,
mixture of experts ..

Properties of memories
14
Lifespan Plasticity Example
● 1 episode is one day
● Last for 1 day
● Build memory instantly
Short-term Quick
1 Working
memory
● Persists across agent’s lifetime
● Last for several years
● Build memory instantly
Long-term Quick
2 Episodic
memory
● persists across agent’s lifetime
● Last for several years
● Take time to build memory
Long-term Slow
3 Semantic
memory

Memory in RL
● Human brain implements RL
● Dopamine neurons reflects a
reward prediction error (TD
learning)
● What is the memory in brain that
stores V?
○ Value table is not scalable
○ May be a value model →DQN
(semantic memory)
15

DQN: Replay buffer is an episodic memory
• Store experiences: (s,a,r,s’) tuple
• Read memory via replay sampling
• Memory content is used to train
Action-Value Network Q
16
Mnih, V., Kavukcuoglu, K., Silver, D. et al. Human-level control through deep reinforcement
learning. Nature 518, 529–533 (2015). https://doi.org/10.1038/nature14236

DQN’s memories are better than tables, but …
● Inefficiency:
○ Learning semantic memory (Q network) is
slow (gradient descent)
○ Optimize many parameters
● Bootstrap noise:
○ The target is the network’s output
○ If network is not well trained, the target is
noisy
● Reply buffer:
○ Raw observations
○ Need many sampling iterations
17

Alternative: episodic control paradigm
Current experience
Eg: (St, At),…
Memory
read
Experiences Final Returns
Policy
Value
Environment
Memory write
18
● Episodic memory is a key-value memory
● Directly binds experience to return→ refers to
experiences that have high return to make
decisions

Model-free episodic control: K-nearest neighbors
19
Blundell, Charles, Benigno Uria, Alexander Pritzel, Yazhe Li, Avraham Ruderman, Joel Z. Leibo, Jack Rae, Daan
Wierstra, and Demis Hassabis. "Model-free episodic control." NeurIPS (2016).
Fix-size memory
First-in-first out
● No need to learn parameters
(pretrained 𝜙)
● Quick value estimation

Hybrid: episodic memory + DQN
20
Lin, Zichuan, Tianqi Zhao, Guangwen Yang, and Lintao Zhang. "Episodic Memory Deep Q-Networks. IJCAI18"
Episodic
TD learning
Contribution
weight

Limitation of model-free episodic memory
● Near-deterministic assumption
○ Assume clean env.
○ Store the best return
● Sample-inefficiency:
○ store state-action-value which demands experiencing all
actions to gain experience
● Fixed combination between episodic and
parametric values
○ episodic contribution weight unchanged for different
observations
○ requires manual tuning of the weight
21
What if the state is partially
observable and the number of
actions is large?

Model-based episodic memory
● Learn a model of trajectories using
self-supervised training
○ Model=LSTM
○ Learn to reconstruct past state-action
given current trajectory and query
● The trained LSTM is used to
generate trajectory
representations
→ counterfactual trajectory
→ imagine actions
22
Le, Hung, Thommen Karimpanal George, Majid Abdolshah, Truyen Tran, and Svetha Venkatesh. "Model-Based Episodic
Memory Induces Dynamic Hybrid Controls." NeurIPS (2021).

Sample efficiency test on Atari game
23
Model-free (10M)
Hybrid (40M)
Model-based (10M)
DQN (200M)

Episodic memory for hyperparameter
optimization
●RL is very sensitive to hyperparameters
●SOTA performance is achieved with extensive
hyperparameter tuning
Islam, R., Henderson, P., Gomrokchi, M., and Precup, D. (2017). Reproducibility of benchmarked deep reinforcement learning tasks for continuous control. arXiv preprint
arXiv:1708.04133.
24
DQN
Hyperparameters
Not to mention
Neural network
architecture

Limitation of memory-less optimizer
●Don’t have the context of training in the optimization process
● Treated as a stateless bandit or greedy optimization
○ Ignoring the context prevents the use of episodic experiences that can be critical in
optimization and planning
○ E.g. the hyperparameters that helped overcome a past local optimum in the loss
surface can be reused when the learning algorithm falls into a similar local optimum
25
How to build the context (memory key)?

Optimizing hyperparameter as episodic RL
●At each policy update, the hyper-agent:
○ Observe the training context- hyper-state
○ Configure the RL algorithm with suitable hyperparameters ψ - hyper-action
○ Train RL agent with ψ, observed learning progress – hyper-reward
●The goal of the Hyper-RL is the same as the main RL’s: to maximize the
return of the RL agent
○ At a hyper-state, find hyper-action that maximize the accumulated hyper-reward (hyper-return)
26
KEY |VALUE
Experience hyper-state/action |Outcome
Hyper-Returns
Le, Hung, Majid Abdolshah, Thommen K. George, Kien Do, Dung
Nguyen, and Svetha Venkatesh. "Episodic Policy Gradient Training."
AAAI (2022).

When the state is not enough …
● Partially Observable Environments:
○ States do not contain all required
information for optimal action
○ Ee.g. state=position, does not contain
velocity
● Ways to improve:
○ Build richer state representations
○ Memory of all past observations/actions
● Policy gradient
29
Full map Observed
state
RNN hidden state
RNN as policy model

Building better working memory for better the context
30
● External memory: longer-term,
store more
● Unsupervised training to learn
read-write operation
Wayne, Greg, Chia-Chun Hung, David Amos, Mehdi Mirza, Arun Ahuja, Agnieszka Grabska-Barwinska, Jack Rae et al.
"Unsupervised predictive memory in a goal-directed agent." arXiv preprint arXiv:1803.10760 (2018).

Relational memory to capture state relationships
31
Dot-product attention
Outer-product attention
Zambaldi, Vinicius, David Raposo, Adam Santoro, Victor Bapst, Yujia Li,
Igor Babuschkin, Karl Tuyls et al. "Deep reinforcement learning with
relational inductive biases." In ICLR 2018.
Le, Hung, Truyen Tran, and Svetha Venkatesh. "Self-attentive
associative memory." In ICML 2020.

Suitable for multi-agent
or sparse POMDP
32
Skip4
No Skip

Challenges of RL
● Rewards can be very sparse
○ RL agents cannot learn anything
until they collect the first reward
○ Explore forever?
● Sometimes reward function in
complicated real-world problem is
unknown
○ Don’t have simulator
○ Only know when we explore
○ Of course, if agent explore long
enough, it will learn eventually
→ Sample inefficiency
34

Need exploring mechanisms to enable
sample-efficiency!!!
35
Aubret, A., L. Matignon, and S. Hassas. "A survey on intrinsic motivation in reinforcement learning."

In biological world, agents can cope with
this problem very well
● Animal can travel for long
distance till they find food
● Human can navigate to go to
an address in a strange city
○ intrinsic motivation
○ curiosity, hunch
○ intrinsic reward
36
https://www.beepods.com/5-fascinating-ways-bees-and-flowers-find-each-other/

Agents should be motivated towards
“interesting” consequences
● C: actor vs M: world model
● M predicts consequences of C
● As a result:
▪ If C action results in repeated and boring
consequences →M predict well
▪ C must explore novel consequence
● Memory:
▪ To learn the world model
▪ To know if something novel or old
37
https://people.idsia.ch/~juergen/artificial-curiosity-since-1990.html

M: Forward model learns dynamics (no memory)
38
Stadie, Levine, Abbeel: Incentivizing Exploration in Reinforcement Learning with Deep Predictive Models. 2015
Novelty if prediction
error is high (intrinsic
reward)

When novelty as prediction error is useless
● The prediction target is stochastic
● Information necessary for the prediction is
missing
→ Both the totally predictable and the
fundamentally unpredictable will get boring
→Solution: Remember all experiences
● “Store” all observations, including
stochastic ones in working, semantic or
episodic memory
● Instead of predicting, try recalling from the
memory
39
https://openai.com/blog/reinforcement-learning-with-prediction-based-rewards/

Working memory: Store visited in-episode states
●Novelty through reachability:
▪ Boring if reachable from states in memory in less than
k steps
●Learn to classify: reachable or unreachable
○ Collect 2 states from trajectory
○ Create label to indicate one is reachable from another
40
Savinov et al. "Episodic Curiosity through Reachability." In ICLR 2018.
High if unreachable

Exploration with working memory is better
41
No intrinsic
reward
Intrinsic
reward via
dynamic
prediction
Intrinsic
reward via
WM

Semantic memory: distillation to neural networks’ weight
●Target network: randomly
transform state
●Predictor network: try to
remember the transformed
state
○ A global memory
○ Random TV is not problem
■ Remember all noisy
channels
42
https://medium.com/data-from-the-trenches/curiosity-driven-learning-through-random-network-distillation-488ffd8e5938
Burda et al. Random Network Distillation: a new take on Curiosity-Driven Learning, In ICLR 2019
High if
cannot
distill

Episodic memory: explore
from stored good state
●Archive: a memory of good states
(state-score) → sample one
●Purely random exploration from this
state→ collect more states
●Update the archive
And many other tricks: imitation
learning, goal-based policy, …
43
Adrian Ecoffet et al.: First return, then explore. Nature 2021

Superhuman performance
44
Episodic
Semantic
Working
Robotics
Video games

Conclusion
● Memory assists RL agents in many
forms:
○ Semantic
○ Working
○ Episodic
● And in many tasks:
○ Store experiences
○ Exploration
○ State representation
45

Thank you!
46
Our team at A2I2 is hiring!
Contact thai.le@deakin.edu.au for PhD scholarships.

Memory for Lean Reinforcement Learning.pdf

More Related Content

Similar to Memory for Lean Reinforcement Learning.pdf

Recently uploaded

Memory for Lean Reinforcement Learning.pdf