Memory in Autonomous Agents: From Reinforcement Learning to Large Language Models

Presented by Dr. Hung Le
Memory in Autonomous Agents:
From Reinforcement Learning to Large
Language Models
1
https://uni-sydney.zoom.us/j/89711610330

2
Why This Topic?
❑ Autonomous agents are emerging as AI systems
capable of perceiving, deciding, and acting
independently, offering powerful applications.
❑ From RL agents mastering chess and Atari
games to robots and LLMs tackling reasoning
and research tasks.
❑ Practicality issues: The cost of training agents is
huge and the performance is worse in
real-world challenges
❑ Memory for agents: Memory-augmented
systems can learn faster and achieve
unprecedented capability Generated by DALL-E
3
Generated by DALL-E 3

3
About Me
❑ Presenter: Dr. Hung Le
❑ Our lab: A2I2, Deakin University
❑ Hung Le is a DECRA Fellow and a lecturer at Deakin
University, leading research on deep sequential models
and reinforcement learning
A2I2 Lab Foyer, Waurn
Ponds
A2I2 Lab Foyer, Waurn Ponds

What is Reinforcement Learning (RL)?
❏ Agent interacts with environment
❏ S+A=>S’+R (MDP)
❏ The transition can be stochastic or
deterministic
❏ Find a policy π(S) → A to maximize
expected return E(∑R) from the
environment
5

Deep RL Agent: Value/Policy Are Neural Networks
6

Example: RL Agent Plays Video Game
7
Game
simulation
DQN

Atari Benchmark
8
Evaluation Metrics:
Normalised score = (Model’s score-random play
score)/(human score - random play score).
Mnih, V., Kavukcuoglu, K., Silver, D. et al. Human-level control through deep reinforcement
learning. Nature 518, 529–533 (2015). https://doi.org/10.1038/nature14236
~60 games

Limitation of RL Agents
● High cost
○ Training time (days to months)
○ GPU (dozens to hundreds)
● Require simulators or big data
○ Agents are trained in simulation (millions to billions
of steps)
○ The cost for one step of simulation can be high
○ Some real environments simply don’t have simulator
● Trained agents are unlike humans
○ Unsafe exploration, unethical actions
○ Weird behaviors, hallucination
○ Fail to generalize
● RL Agents (DQN-based):
○ 21 trillions hours of training to beat human
(AlphaZero), equivalents to 11,500 years of
human practice
9

LLM Agent: Value/Policy Are LLMs
10

Example: LLM Agents Generate Answers
11
https://cobusgreyling.substack.com/p/comparing-llm-agents-to-
chains-differences

12
LLM Agents for Research
● Scientists interact with the agents
○ Seeding ideas for exploration (initial state)
○ Giving feedback (reward) in natural language
● The worker agents act:
○ Uses web search + specialized AI models
○ Enhances grounding & hypothesis quality
● Supervisor agent role:
○ Parses goals into research plans
○ Assigns specialized agents to worker queue
○ Allocates compute resources
https://research.google/blog/accelerating-scientific-breakthroughs-with-
an-ai-co-scientist/

13
LLM Agents Are Much More Expensive
● Training cost is reaching $1 billion
● Inference cost for GPT4-Like LLMs:
$140–150 million/year
● LLM agent systems often use multiple
models → multiplying costs
● Same Issues:
○ Unsafe exploration, unethical actions
○ Weird behaviors, hallucination
○ Fail to generalize
https://www.linkedin.com/pulse/uncovering-hidden-costs-ai-queries-dr-ayman-al-rifa
ei-1kmsf?

What Is Memory?
❏ Memory is the ability to efficiently store,
retain and recall information
❏ Brain memory stores items, events and
high-level structures
❏ Computer memory stores data,
programs and temporary variables
15

Simple Taxonomy of Memories
16
Lifespan Plasticity Example
● 1 episode is one day
● Last for 1 day
● Build memory instantly
Short-term Quick
1 Working
memory
● Persists across agent’s lifetime
● Last for several years
● Build memory instantly
Long-term Quick
2 Episodic
memory
● persists across agent’s lifetime
● Last for several years
● Take time to build memory
Long-term Slow
3 Semantic
memory

Episodic Control Paradigm
Current experience
Eg: (St, At),…
Memory
read
Experiences Final Returns
Policy
Value
Environment
Memory write
18
❏ Episodic memory is a key-value memory
❏ Directly binds experience to return→ refers to
experiences that have high return to make
decisions

Model-free Episodic Control: K-nearest Neighbors
19
Blundell, Charles, Benigno Uria, Alexander Pritzel, Yazhe Li, Avraham Ruderman, Joel Z. Leibo, Jack Rae, Daan
Wierstra, and Demis Hassabis. "Model-free episodic control." NeurIPS (2016).
Fix-size memory
First-in-first out
● No need to learn parameters
(pretrained 𝜙)
● Quick value estimation

Sample Efficiency Test on Atari Games
20
Model-free (10M)
Hybrid (40M)
Model-based (10M)
DQN (200M)
Le, Hung, Thommen Karimpanal George, Majid Abdolshah, Truyen Tran, and Svetha Venkatesh. "Model-Based Episodic
Memory Induces Dynamic Hybrid Controls." NeurIPS (2021).

21
LLM Alignment During Inference
● “I need your help now” should be
steered to “I beseech thee, assist me this
very moment!”
● Episodic memory stores the
input-output LLM’s representations
● At test time, relevant output
representations are retrieved to
construct the local steering vector for
each token
[Dynamic Steering With Episodic Memory For Large Language
Models](https://aclanthology.org/2025.findings-acl.706/) (Do et al., ACL-Findings 2025)
Local steering
(episodic)
Global steering
(semantics)
Only need
10-100 samples!

When the State Is Not Enough …
● Partially Observable Environments:
○ States do not contain all required
information for optimal action
○ E.g. state=position, does not contain
velocity
● Ways to improve:
○ Build richer state representations
○ Memory of all past
observations/actions
● Policy gradient
23
Full map
Observed
state
RNN hidden state
RNN as policy model

Working Memory: Recurrence or Attention?
24
Ni, Tianwei, Benjamin Eysenbach, and Ruslan Salakhutdinov. "Recurrent Model-Free RL Can Be a
Strong Baseline for Many POMDPs." In International Conference on Machine Learning, pp.
16691-16723. PMLR, 2022
Calibration Update

25
Stable Working Memory: Fast and Powerful
● Stable Hadamard Memory (SHM)
introduces a special calibration matrix Ct
defined as:
● θt: Trainable parameters that are
randomly selected for each timestep.
● vc(xt): A mapping function (e.g., a linear
transformation)
● This dynamic design keeps updates
stable by ensuring that the cumulative
product of calibration matrices is
bounded:
Le, Hung, Kien Do, Dung Nguyen, Sunil Gupta, and Svetha Venkatesh. "Stable
Hadamard Memory: Revitalizing Memory-Augmented Agents for Reinforcement
Learning." ICLR, 2025.
Long-term credit
assignment

Integrated Memory Systems
for Challenging Tasks
26

27
Classic RL with Hybrid Memory Systems
● Addressing challenging hard-exploration
problems:
○ Montezuma Revenge
○ Noisy TV
● Working Memory: Novelty estimation within
episode
● Episodic Memory: Novelty estimation across
episode
● Semantic Memory: Surprise estimation via
prediction error
Noisy-TV: a random TV will distract the RL agent from its main
task due to high surprise (source).

28
Example: Memories for Novelty of Surprise
❑ The norm of the prediction error (surprise norm) is
not good (e.g., as in Noisy-TV)
❑ A new metric: surprise novelty, the error of
reconstructing surprise (the error of state
prediction)
❑ This requires a surprise generator such as a
dynamics model to produce the surprise vector u,
i.e., the difference vector between the predicted
and reality
❑ Then, inter and intra-episode novelty scores are
estimated by a system of memory, called Surprise
Memory (SM), consisting of an autoencoder
network W and episodic memory M, respectively
Le, Hung, Kien Do, Dung Nguyen, and Svetha Venkatesh. "Beyond
Surprise: Improving Exploration Through Surprise Novelty. In AAMAS,
2024.

29
LLM Agents Also Have a System of Memories
❏ Semantics Memory: knowledge stored in the
LLM’s weights or other knowledge database
❏ Working Memory: the context stored in the
prompt, accessed by attention mechanism
❏ Episodic Memory: external memory module to
store past experiences

30
Memory for Experience Recall (EXPEL)
1. Gather Experiences: Retry tasks with
past reflections to build a memory
pool.
2. Reflect & Improve: Learn from failed
and successful attempts.
3. Recall Examples: Retrieve similar past
trajectories for guidance.
4. Extract Insights: Generate high-level
lessons from successes/failures.
5. Task Inference: Use insights + retrieved
examples to improve reasoning.
Zhao, Andrew, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao
Huang. "Expel: Llm agents are experiential learners." In Proceedings of the AAAI
Conference on Artificial Intelligence, vol. 38, no. 17, pp. 19632-19642. 2024.

31
Spatial Memory for Navigation
● The LLM considers the current set of
memories (R0:i) and the question (Q) to
generate a function call f and a query q
which retrieves m memories.
● Each memory contains position, time,
and caption information to be used as
further context.
● LLM can make up to k queries per
iteration. Retrieves k × m memories and
updates context
● Check: Can the question be answered?
○ No: Repeat querying with updated
context
○ Yes: Summarize relevant info & generate
final answer
Anwar, Abrar, John Welsh, Joydeep Biswas, Soha Pouya, and Yan Chang.
"Remembr: Building and reasoning over long-horizon spatio-temporal memory for
robot navigation." ICRA (2025).

33
Memory for Learning to Reason
❏ Challenge: Tiny LLMs (≤1B parameters)
struggle with RL due to sparse rewards,
poor exploration, and frequent incorrect
outputs.
❏ Solution: Memory-R+, a
memory-augmented RL framework
inspired by human episodic memory,
storing both successes and failures.
❏ Mechanism: Intrinsic motivation guides
reasoning using nearest-neighbor
memory retrieval for success (exploit) and
failure (explore) patterns.
Le, Hung, Dai Do, Dung Nguyen, and Svetha Venkatesh. "Reasoning Under 1
Billion: Memory-Augmented Reinforcement Learning for Large Language Models."
arXiv preprint arXiv:2504.02273 (2025).

34
Emerging Capabilities for Tiny LLMs

35
Discussion and QA
● Memory is essential for agents, both classic RL
and LLM
● 3 basic types of memory
● Memory is useful to make agents more efficient
and robust against challenging environments:
○ Long-horizon tasks
○ Multiple-steps
○ Noisy environments
Generated by DALL-E
3

Memory in Autonomous Agents: From Reinforcement Learning to Large Language Models

More Related Content

Similar to Memory in Autonomous Agents: From Reinforcement Learning to Large Language Models

More from Hung Le

Recently uploaded

Memory in Autonomous Agents: From Reinforcement Learning to Large Language Models