Presented by Dr. Hung Le
Memory in Autonomous Agents:
From Reinforcement Learning to Large
Language Models
1
https://uni-sydney.zoom.us/j/89711610330
2
Why This Topic?
❑ Autonomous agents are emerging as AI systems
capable of perceiving, deciding, and acting
independently, offering powerful applications.
❑ From RL agents mastering chess and Atari
games to robots and LLMs tackling reasoning
and research tasks.
❑ Practicality issues: The cost of training agents is
huge and the performance is worse in
real-world challenges
❑ Memory for agents: Memory-augmented
systems can learn faster and achieve
unprecedented capability Generated by DALL-E
3
Generated by DALL-E 3
3
About Me
❑ Presenter: Dr. Hung Le
❑ Our lab: A2I2, Deakin University
❑ Hung Le is a DECRA Fellow and a lecturer at Deakin
University, leading research on deep sequential models
and reinforcement learning
A2I2 Lab Foyer, Waurn
Ponds
A2I2 Lab Foyer, Waurn Ponds
Background
4
What is Reinforcement Learning (RL)?
❏ Agent interacts with environment
❏ S+A=>S’+R (MDP)
❏ The transition can be stochastic or
deterministic
❏ Find a policy π(S) → A to maximize
expected return E(∑R) from the
environment
5
Deep RL Agent: Value/Policy Are Neural Networks
6
Example: RL Agent Plays Video Game
7
Game
simulation
DQN
Atari Benchmark
8
Evaluation Metrics:
Normalised score = (Model’s score-random play
score)/(human score - random play score).
Mnih, V., Kavukcuoglu, K., Silver, D. et al. Human-level control through deep reinforcement
learning. Nature 518, 529–533 (2015). https://doi.org/10.1038/nature14236
~60 games
Limitation of RL Agents
● High cost
○ Training time (days to months)
○ GPU (dozens to hundreds)
● Require simulators or big data
○ Agents are trained in simulation (millions to billions
of steps)
○ The cost for one step of simulation can be high
○ Some real environments simply don’t have simulator
● Trained agents are unlike humans
○ Unsafe exploration, unethical actions
○ Weird behaviors, hallucination
○ Fail to generalize
● RL Agents (DQN-based):
○ 21 trillions hours of training to beat human
(AlphaZero), equivalents to 11,500 years of
human practice
9
LLM Agent: Value/Policy Are LLMs
10
Example: LLM Agents Generate Answers
11
https://cobusgreyling.substack.com/p/comparing-llm-agents-to-
chains-differences
12
LLM Agents for Research
● Scientists interact with the agents
○ Seeding ideas for exploration (initial state)
○ Giving feedback (reward) in natural language
● The worker agents act:
○ Uses web search + specialized AI models
○ Enhances grounding & hypothesis quality
● Supervisor agent role:
○ Parses goals into research plans
○ Assigns specialized agents to worker queue
○ Allocates compute resources
https://research.google/blog/accelerating-scientific-breakthroughs-with-
an-ai-co-scientist/
13
LLM Agents Are Much More Expensive
● Training cost is reaching $1 billion
● Inference cost for GPT4-Like LLMs:
$140–150 million/year
● LLM agent systems often use multiple
models → multiplying costs
● Same Issues:
○ Unsafe exploration, unethical actions
○ Weird behaviors, hallucination
○ Fail to generalize
https://www.linkedin.com/pulse/uncovering-hidden-costs-ai-queries-dr-ayman-al-rifa
ei-1kmsf?
What Is Missing?
14
What Is Memory?
❏ Memory is the ability to efficiently store,
retain and recall information
❏ Brain memory stores items, events and
high-level structures
❏ Computer memory stores data,
programs and temporary variables
15
Simple Taxonomy of Memories
16
Lifespan Plasticity Example
● 1 episode is one day
● Last for 1 day
● Build memory instantly
Short-term Quick
1 Working
memory
● Persists across agent’s lifetime
● Last for several years
● Build memory instantly
Long-term Quick
2 Episodic
memory
● persists across agent’s lifetime
● Last for several years
● Take time to build memory
Long-term Slow
3 Semantic
memory
Episodic Memory for Agents
17
Episodic Control Paradigm
Current experience
Eg: (St, At),…
Memory
read
Experiences Final Returns
Policy
Value
Environment
Memory write
18
❏ Episodic memory is a key-value memory
❏ Directly binds experience to return→ refers to
experiences that have high return to make
decisions
Model-free Episodic Control: K-nearest Neighbors
19
Blundell, Charles, Benigno Uria, Alexander Pritzel, Yazhe Li, Avraham Ruderman, Joel Z. Leibo, Jack Rae, Daan
Wierstra, and Demis Hassabis. "Model-free episodic control." NeurIPS (2016).
Fix-size memory
First-in-first out
● No need to learn parameters
(pretrained 𝜙)
● Quick value estimation
Sample Efficiency Test on Atari Games
20
Model-free (10M)
Hybrid (40M)
Model-based (10M)
DQN (200M)
Le, Hung, Thommen Karimpanal George, Majid Abdolshah, Truyen Tran, and Svetha Venkatesh. "Model-Based Episodic
Memory Induces Dynamic Hybrid Controls." NeurIPS (2021).
21
LLM Alignment During Inference
● “I need your help now” should be
steered to “I beseech thee, assist me this
very moment!”
● Episodic memory stores the
input-output LLM’s representations
● At test time, relevant output
representations are retrieved to
construct the local steering vector for
each token
[Dynamic Steering With Episodic Memory For Large Language
Models](https://aclanthology.org/2025.findings-acl.706/) (Do et al., ACL-Findings 2025)
Local steering
(episodic)
Global steering
(semantics)
Only need
10-100 samples!
Working Memory for Agents
22
When the State Is Not Enough …
● Partially Observable Environments:
○ States do not contain all required
information for optimal action
○ E.g. state=position, does not contain
velocity
● Ways to improve:
○ Build richer state representations
○ Memory of all past
observations/actions
● Policy gradient
23
Full map
Observed
state
RNN hidden state
RNN as policy model
Working Memory: Recurrence or Attention?
24
Ni, Tianwei, Benjamin Eysenbach, and Ruslan Salakhutdinov. "Recurrent Model-Free RL Can Be a
Strong Baseline for Many POMDPs." In International Conference on Machine Learning, pp.
16691-16723. PMLR, 2022
Calibration Update
25
Stable Working Memory: Fast and Powerful
● Stable Hadamard Memory (SHM)
introduces a special calibration matrix Ct
defined as:
● θt: Trainable parameters that are
randomly selected for each timestep.
● vc(xt): A mapping function (e.g., a linear
transformation)
● This dynamic design keeps updates
stable by ensuring that the cumulative
product of calibration matrices is
bounded:
Le, Hung, Kien Do, Dung Nguyen, Sunil Gupta, and Svetha Venkatesh. "Stable
Hadamard Memory: Revitalizing Memory-Augmented Agents for Reinforcement
Learning." ICLR, 2025.
Long-term credit
assignment
Integrated Memory Systems
for Challenging Tasks
26
27
Classic RL with Hybrid Memory Systems
● Addressing challenging hard-exploration
problems:
○ Montezuma Revenge
○ Noisy TV
● Working Memory: Novelty estimation within
episode
● Episodic Memory: Novelty estimation across
episode
● Semantic Memory: Surprise estimation via
prediction error
Noisy-TV: a random TV will distract the RL agent from its main
task due to high surprise (source).
28
Example: Memories for Novelty of Surprise
❑ The norm of the prediction error (surprise norm) is
not good (e.g., as in Noisy-TV)
❑ A new metric: surprise novelty, the error of
reconstructing surprise (the error of state
prediction)
❑ This requires a surprise generator such as a
dynamics model to produce the surprise vector u,
i.e., the difference vector between the predicted
and reality
❑ Then, inter and intra-episode novelty scores are
estimated by a system of memory, called Surprise
Memory (SM), consisting of an autoencoder
network W and episodic memory M, respectively
Le, Hung, Kien Do, Dung Nguyen, and Svetha Venkatesh. "Beyond
Surprise: Improving Exploration Through Surprise Novelty. In AAMAS,
2024.
29
LLM Agents Also Have a System of Memories
❏ Semantics Memory: knowledge stored in the
LLM’s weights or other knowledge database
❏ Working Memory: the context stored in the
prompt, accessed by attention mechanism
❏ Episodic Memory: external memory module to
store past experiences
30
Memory for Experience Recall (EXPEL)
1. Gather Experiences: Retry tasks with
past reflections to build a memory
pool.
2. Reflect & Improve: Learn from failed
and successful attempts.
3. Recall Examples: Retrieve similar past
trajectories for guidance.
4. Extract Insights: Generate high-level
lessons from successes/failures.
5. Task Inference: Use insights + retrieved
examples to improve reasoning.
Zhao, Andrew, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao
Huang. "Expel: Llm agents are experiential learners." In Proceedings of the AAAI
Conference on Artificial Intelligence, vol. 38, no. 17, pp. 19632-19642. 2024.
31
Spatial Memory for Navigation
● The LLM considers the current set of
memories (R0:i) and the question (Q) to
generate a function call f and a query q
which retrieves m memories.
● Each memory contains position, time,
and caption information to be used as
further context.
● LLM can make up to k queries per
iteration. Retrieves k × m memories and
updates context
● Check: Can the question be answered?
○ No: Repeat querying with updated
context
○ Yes: Summarize relevant info & generate
final answer
Anwar, Abrar, John Welsh, Joydeep Biswas, Soha Pouya, and Yan Chang.
"Remembr: Building and reasoning over long-horizon spatio-temporal memory for
robot navigation." ICRA (2025).
32
Spatial Memory in Action
33
Memory for Learning to Reason
❏ Challenge: Tiny LLMs (≤1B parameters)
struggle with RL due to sparse rewards,
poor exploration, and frequent incorrect
outputs.
❏ Solution: Memory-R+, a
memory-augmented RL framework
inspired by human episodic memory,
storing both successes and failures.
❏ Mechanism: Intrinsic motivation guides
reasoning using nearest-neighbor
memory retrieval for success (exploit) and
failure (explore) patterns.
Le, Hung, Dai Do, Dung Nguyen, and Svetha Venkatesh. "Reasoning Under 1
Billion: Memory-Augmented Reinforcement Learning for Large Language Models."
arXiv preprint arXiv:2504.02273 (2025).
34
Emerging Capabilities for Tiny LLMs
35
Discussion and QA
● Memory is essential for agents, both classic RL
and LLM
● 3 basic types of memory
● Memory is useful to make agents more efficient
and robust against challenging environments:
○ Long-horizon tasks
○ Multiple-steps
○ Noisy environments
Generated by DALL-E
3

Memory in Autonomous Agents: From Reinforcement Learning to Large Language Models

  • 1.
    Presented by Dr.Hung Le Memory in Autonomous Agents: From Reinforcement Learning to Large Language Models 1 https://uni-sydney.zoom.us/j/89711610330
  • 2.
    2 Why This Topic? ❑Autonomous agents are emerging as AI systems capable of perceiving, deciding, and acting independently, offering powerful applications. ❑ From RL agents mastering chess and Atari games to robots and LLMs tackling reasoning and research tasks. ❑ Practicality issues: The cost of training agents is huge and the performance is worse in real-world challenges ❑ Memory for agents: Memory-augmented systems can learn faster and achieve unprecedented capability Generated by DALL-E 3 Generated by DALL-E 3
  • 3.
    3 About Me ❑ Presenter:Dr. Hung Le ❑ Our lab: A2I2, Deakin University ❑ Hung Le is a DECRA Fellow and a lecturer at Deakin University, leading research on deep sequential models and reinforcement learning A2I2 Lab Foyer, Waurn Ponds A2I2 Lab Foyer, Waurn Ponds
  • 4.
  • 5.
    What is ReinforcementLearning (RL)? ❏ Agent interacts with environment ❏ S+A=>S’+R (MDP) ❏ The transition can be stochastic or deterministic ❏ Find a policy π(S) → A to maximize expected return E(∑R) from the environment 5
  • 6.
    Deep RL Agent:Value/Policy Are Neural Networks 6
  • 7.
    Example: RL AgentPlays Video Game 7 Game simulation DQN
  • 8.
    Atari Benchmark 8 Evaluation Metrics: Normalisedscore = (Model’s score-random play score)/(human score - random play score). Mnih, V., Kavukcuoglu, K., Silver, D. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015). https://doi.org/10.1038/nature14236 ~60 games
  • 9.
    Limitation of RLAgents ● High cost ○ Training time (days to months) ○ GPU (dozens to hundreds) ● Require simulators or big data ○ Agents are trained in simulation (millions to billions of steps) ○ The cost for one step of simulation can be high ○ Some real environments simply don’t have simulator ● Trained agents are unlike humans ○ Unsafe exploration, unethical actions ○ Weird behaviors, hallucination ○ Fail to generalize ● RL Agents (DQN-based): ○ 21 trillions hours of training to beat human (AlphaZero), equivalents to 11,500 years of human practice 9
  • 10.
  • 11.
    Example: LLM AgentsGenerate Answers 11 https://cobusgreyling.substack.com/p/comparing-llm-agents-to- chains-differences
  • 12.
    12 LLM Agents forResearch ● Scientists interact with the agents ○ Seeding ideas for exploration (initial state) ○ Giving feedback (reward) in natural language ● The worker agents act: ○ Uses web search + specialized AI models ○ Enhances grounding & hypothesis quality ● Supervisor agent role: ○ Parses goals into research plans ○ Assigns specialized agents to worker queue ○ Allocates compute resources https://research.google/blog/accelerating-scientific-breakthroughs-with- an-ai-co-scientist/
  • 13.
    13 LLM Agents AreMuch More Expensive ● Training cost is reaching $1 billion ● Inference cost for GPT4-Like LLMs: $140–150 million/year ● LLM agent systems often use multiple models → multiplying costs ● Same Issues: ○ Unsafe exploration, unethical actions ○ Weird behaviors, hallucination ○ Fail to generalize https://www.linkedin.com/pulse/uncovering-hidden-costs-ai-queries-dr-ayman-al-rifa ei-1kmsf?
  • 14.
  • 15.
    What Is Memory? ❏Memory is the ability to efficiently store, retain and recall information ❏ Brain memory stores items, events and high-level structures ❏ Computer memory stores data, programs and temporary variables 15
  • 16.
    Simple Taxonomy ofMemories 16 Lifespan Plasticity Example ● 1 episode is one day ● Last for 1 day ● Build memory instantly Short-term Quick 1 Working memory ● Persists across agent’s lifetime ● Last for several years ● Build memory instantly Long-term Quick 2 Episodic memory ● persists across agent’s lifetime ● Last for several years ● Take time to build memory Long-term Slow 3 Semantic memory
  • 17.
  • 18.
    Episodic Control Paradigm Currentexperience Eg: (St, At),… Memory read Experiences Final Returns Policy Value Environment Memory write 18 ❏ Episodic memory is a key-value memory ❏ Directly binds experience to return→ refers to experiences that have high return to make decisions
  • 19.
    Model-free Episodic Control:K-nearest Neighbors 19 Blundell, Charles, Benigno Uria, Alexander Pritzel, Yazhe Li, Avraham Ruderman, Joel Z. Leibo, Jack Rae, Daan Wierstra, and Demis Hassabis. "Model-free episodic control." NeurIPS (2016). Fix-size memory First-in-first out ● No need to learn parameters (pretrained 𝜙) ● Quick value estimation
  • 20.
    Sample Efficiency Teston Atari Games 20 Model-free (10M) Hybrid (40M) Model-based (10M) DQN (200M) Le, Hung, Thommen Karimpanal George, Majid Abdolshah, Truyen Tran, and Svetha Venkatesh. "Model-Based Episodic Memory Induces Dynamic Hybrid Controls." NeurIPS (2021).
  • 21.
    21 LLM Alignment DuringInference ● “I need your help now” should be steered to “I beseech thee, assist me this very moment!” ● Episodic memory stores the input-output LLM’s representations ● At test time, relevant output representations are retrieved to construct the local steering vector for each token [Dynamic Steering With Episodic Memory For Large Language Models](https://aclanthology.org/2025.findings-acl.706/) (Do et al., ACL-Findings 2025) Local steering (episodic) Global steering (semantics) Only need 10-100 samples!
  • 22.
  • 23.
    When the StateIs Not Enough … ● Partially Observable Environments: ○ States do not contain all required information for optimal action ○ E.g. state=position, does not contain velocity ● Ways to improve: ○ Build richer state representations ○ Memory of all past observations/actions ● Policy gradient 23 Full map Observed state RNN hidden state RNN as policy model
  • 24.
    Working Memory: Recurrenceor Attention? 24 Ni, Tianwei, Benjamin Eysenbach, and Ruslan Salakhutdinov. "Recurrent Model-Free RL Can Be a Strong Baseline for Many POMDPs." In International Conference on Machine Learning, pp. 16691-16723. PMLR, 2022 Calibration Update
  • 25.
    25 Stable Working Memory:Fast and Powerful ● Stable Hadamard Memory (SHM) introduces a special calibration matrix Ct defined as: ● θt: Trainable parameters that are randomly selected for each timestep. ● vc(xt): A mapping function (e.g., a linear transformation) ● This dynamic design keeps updates stable by ensuring that the cumulative product of calibration matrices is bounded: Le, Hung, Kien Do, Dung Nguyen, Sunil Gupta, and Svetha Venkatesh. "Stable Hadamard Memory: Revitalizing Memory-Augmented Agents for Reinforcement Learning." ICLR, 2025. Long-term credit assignment
  • 26.
    Integrated Memory Systems forChallenging Tasks 26
  • 27.
    27 Classic RL withHybrid Memory Systems ● Addressing challenging hard-exploration problems: ○ Montezuma Revenge ○ Noisy TV ● Working Memory: Novelty estimation within episode ● Episodic Memory: Novelty estimation across episode ● Semantic Memory: Surprise estimation via prediction error Noisy-TV: a random TV will distract the RL agent from its main task due to high surprise (source).
  • 28.
    28 Example: Memories forNovelty of Surprise ❑ The norm of the prediction error (surprise norm) is not good (e.g., as in Noisy-TV) ❑ A new metric: surprise novelty, the error of reconstructing surprise (the error of state prediction) ❑ This requires a surprise generator such as a dynamics model to produce the surprise vector u, i.e., the difference vector between the predicted and reality ❑ Then, inter and intra-episode novelty scores are estimated by a system of memory, called Surprise Memory (SM), consisting of an autoencoder network W and episodic memory M, respectively Le, Hung, Kien Do, Dung Nguyen, and Svetha Venkatesh. "Beyond Surprise: Improving Exploration Through Surprise Novelty. In AAMAS, 2024.
  • 29.
    29 LLM Agents AlsoHave a System of Memories ❏ Semantics Memory: knowledge stored in the LLM’s weights or other knowledge database ❏ Working Memory: the context stored in the prompt, accessed by attention mechanism ❏ Episodic Memory: external memory module to store past experiences
  • 30.
    30 Memory for ExperienceRecall (EXPEL) 1. Gather Experiences: Retry tasks with past reflections to build a memory pool. 2. Reflect & Improve: Learn from failed and successful attempts. 3. Recall Examples: Retrieve similar past trajectories for guidance. 4. Extract Insights: Generate high-level lessons from successes/failures. 5. Task Inference: Use insights + retrieved examples to improve reasoning. Zhao, Andrew, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. "Expel: Llm agents are experiential learners." In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 17, pp. 19632-19642. 2024.
  • 31.
    31 Spatial Memory forNavigation ● The LLM considers the current set of memories (R0:i) and the question (Q) to generate a function call f and a query q which retrieves m memories. ● Each memory contains position, time, and caption information to be used as further context. ● LLM can make up to k queries per iteration. Retrieves k × m memories and updates context ● Check: Can the question be answered? ○ No: Repeat querying with updated context ○ Yes: Summarize relevant info & generate final answer Anwar, Abrar, John Welsh, Joydeep Biswas, Soha Pouya, and Yan Chang. "Remembr: Building and reasoning over long-horizon spatio-temporal memory for robot navigation." ICRA (2025).
  • 32.
  • 33.
    33 Memory for Learningto Reason ❏ Challenge: Tiny LLMs (≤1B parameters) struggle with RL due to sparse rewards, poor exploration, and frequent incorrect outputs. ❏ Solution: Memory-R+, a memory-augmented RL framework inspired by human episodic memory, storing both successes and failures. ❏ Mechanism: Intrinsic motivation guides reasoning using nearest-neighbor memory retrieval for success (exploit) and failure (explore) patterns. Le, Hung, Dai Do, Dung Nguyen, and Svetha Venkatesh. "Reasoning Under 1 Billion: Memory-Augmented Reinforcement Learning for Large Language Models." arXiv preprint arXiv:2504.02273 (2025).
  • 34.
  • 35.
    35 Discussion and QA ●Memory is essential for agents, both classic RL and LLM ● 3 basic types of memory ● Memory is useful to make agents more efficient and robust against challenging environments: ○ Long-horizon tasks ○ Multiple-steps ○ Noisy environments Generated by DALL-E 3