Memory for Lean
Reinforcement Learning
Presented by Hung Le
1
Background
2
What is Reinforcement Learning (RL)?
● Agent interacts with
environment
● S+A=>S’+R (MDP)
● Find a policy π(S) → A to
maximize return ∑R from the
environment
● Learn (action) value function:
○ V(s)
○ Q(s,a)
→ choose action that maximizes
the value (e.g. ε-greedy policy)
3
https://crl.causalai.net/
Classic (memory-less) RL algorithms
4
https://www.analyticsvidhya.com/blog/2019/04/introduction-deep-q-learning-python/
https://jonathan-hui.medium.com/rl-policy-gradients-explained-9b13b688b146
Q-learning
(temporal difference)
REINFORCE
(policy gradient)
Watkins, Christopher JCH, and Peter Dayan. "Q-learning." Machine
learning 8, no. 3 (1992): 279-292.
Williams, Ronald J. "Simple statistical gradient-following algorithms
for connectionist reinforcement learning." Machine learning 8, no. 3
(1992): 229-256.
Challenge: the RL task can be complicated while
the reward is simple
● Task:
○ Agent searches for the key
○ Agent picks the key
○ Agent open the door to access the room
○ Agent finds the box in the room
● Reward:
○ If the agent reaches the box, get +1 reward
5
https://github.com/maximecb/gym-minigrid
→ How to learn such complicated
policies using the simple reward?
Short Answer: just learn from many trials (data)!
Chess
Self-driving
car
Video
games
Robotics
6
Example: RL agent plays video game
7
Game
simulation
Limitation of training with big data
● High cost
○ Training time (days to months)
○ GPU (dozens to hundreds)
● Require simulators
○ Agents are trained in simulation (millions to billions
of steps)
○ The cost for one step of simulation can be high
○ Some real environments simply don’t have simulator
● Trained agents are unlike humans
○ Unsafe exploration
○ Weird behaviors
○ Fail to generalize
8
Human vs RL Agents in Atari games
● Human:
○ Few hours of practicing to reach
moderate performance
○ Don’t forget how to play old game as
learning new ones
○ Can play any game
● RL Agents:
○ 21 trillions hours of training to beat
human (AlphaZero), equivalents to
11,500 years of human practice
○ Catastrophic forgetting if learn games
sequentially
○ Despite forever training, there exists
failed games
9
What is missing?
10
Memory as experiences
Memory for exploration
Memory to build context
Memory as experiences
11
What is memory?
● Memory is the ability to efficiently
store, retain and recall information
● Brain memory stores items, events
and high-level structures
● Computer memory stores data,
programs and temporary variables
12
What is the example of
memory in neural network?
Memory in neural networks
13
Long-term
memory
Short-term
memory
Functional
memory
● Semantic memory:
storing data in the neural
network weights
● Episodic memory: storing
episodic events in matrix
memory
● Associative memory: key-value
binding as in Hopfield Network
or Transformer layer
● Working memory: matrix
memory in memory augmented
neural network
● Memory stores
programs
● Memory of models,
mixture of experts ..
Properties of memories
14
Lifespan Plasticity Example
● 1 episode is one day
● Last for 1 day
● Build memory instantly
Short-term Quick
1 Working
memory
● Persists across agent’s lifetime
● Last for several years
● Build memory instantly
Long-term Quick
2 Episodic
memory
● persists across agent’s lifetime
● Last for several years
● Take time to build memory
Long-term Slow
3 Semantic
memory
Memory in RL
● Human brain implements RL
● Dopamine neurons reflects a
reward prediction error (TD
learning)
● What is the memory in brain that
stores V?
○ Value table is not scalable
○ May be a value model →DQN
(semantic memory)
15
DQN: Replay buffer is an episodic memory
• Store experiences: (s,a,r,s’) tuple
• Read memory via replay sampling
• Memory content is used to train
Action-Value Network Q
16
Mnih, V., Kavukcuoglu, K., Silver, D. et al. Human-level control through deep reinforcement
learning. Nature 518, 529–533 (2015). https://doi.org/10.1038/nature14236
DQN’s memories are better than tables, but …
● Inefficiency:
○ Learning semantic memory (Q network) is
slow (gradient descent)
○ Optimize many parameters
● Bootstrap noise:
○ The target is the network’s output
○ If network is not well trained, the target is
noisy
● Reply buffer:
○ Raw observations
○ Need many sampling iterations
17
Alternative: episodic control paradigm
Current experience
Eg: (St, At),…
Memory
read
Experiences Final Returns
Policy
Value
Environment
Memory write
18
● Episodic memory is a key-value memory
● Directly binds experience to return→ refers to
experiences that have high return to make
decisions
Model-free episodic control: K-nearest neighbors
19
Blundell, Charles, Benigno Uria, Alexander Pritzel, Yazhe Li, Avraham Ruderman, Joel Z. Leibo, Jack Rae, Daan
Wierstra, and Demis Hassabis. "Model-free episodic control." NeurIPS (2016).
Fix-size memory
First-in-first out
● No need to learn parameters
(pretrained 𝜙)
● Quick value estimation
Hybrid: episodic memory + DQN
20
Lin, Zichuan, Tianqi Zhao, Guangwen Yang, and Lintao Zhang. "Episodic Memory Deep Q-Networks. IJCAI18"
Episodic
TD learning
Contribution
weight
Limitation of model-free episodic memory
● Near-deterministic assumption
○ Assume clean env.
○ Store the best return
● Sample-inefficiency:
○ store state-action-value which demands experiencing all
actions to gain experience
● Fixed combination between episodic and
parametric values
○ episodic contribution weight unchanged for different
observations
○ requires manual tuning of the weight
21
What if the state is partially
observable and the number of
actions is large?
Model-based episodic memory
● Learn a model of trajectories using
self-supervised training
○ Model=LSTM
○ Learn to reconstruct past state-action
given current trajectory and query
● The trained LSTM is used to
generate trajectory
representations
→ counterfactual trajectory
→ imagine actions
22
Le, Hung, Thommen Karimpanal George, Majid Abdolshah, Truyen Tran, and Svetha Venkatesh. "Model-Based Episodic
Memory Induces Dynamic Hybrid Controls." NeurIPS (2021).
Sample efficiency test on Atari game
23
Model-free (10M)
Hybrid (40M)
Model-based (10M)
DQN (200M)
Episodic memory for hyperparameter
optimization
●RL is very sensitive to hyperparameters
●SOTA performance is achieved with extensive
hyperparameter tuning
Islam, R., Henderson, P., Gomrokchi, M., and Precup, D. (2017). Reproducibility of benchmarked deep reinforcement learning tasks for continuous control. arXiv preprint
arXiv:1708.04133.
24
DQN
Hyperparameters
Not to mention
Neural network
architecture
Limitation of memory-less optimizer
●Don’t have the context of training in the optimization process
● Treated as a stateless bandit or greedy optimization
○ Ignoring the context prevents the use of episodic experiences that can be critical in
optimization and planning
○ E.g. the hyperparameters that helped overcome a past local optimum in the loss
surface can be reused when the learning algorithm falls into a similar local optimum
25
How to build the context (memory key)?
Optimizing hyperparameter as episodic RL
●At each policy update, the hyper-agent:
○ Observe the training context- hyper-state
○ Configure the RL algorithm with suitable hyperparameters ψ - hyper-action
○ Train RL agent with ψ, observed learning progress – hyper-reward
●The goal of the Hyper-RL is the same as the main RL’s: to maximize the
return of the RL agent
○ At a hyper-state, find hyper-action that maximize the accumulated hyper-reward (hyper-return)
26
KEY |VALUE
Experience hyper-state/action |Outcome
Hyper-Returns
Le, Hung, Majid Abdolshah, Thommen K. George, Kien Do, Dung
Nguyen, and Svetha Venkatesh. "Episodic Policy Gradient Training."
AAAI (2022).
27
Memory to build context
28
When the state is not enough …
● Partially Observable Environments:
○ States do not contain all required
information for optimal action
○ Ee.g. state=position, does not contain
velocity
● Ways to improve:
○ Build richer state representations
○ Memory of all past observations/actions
● Policy gradient
29
Full map Observed
state
RNN hidden state
RNN as policy model
Building better working memory for better the context
30
● External memory: longer-term,
store more
● Unsupervised training to learn
read-write operation
Wayne, Greg, Chia-Chun Hung, David Amos, Mehdi Mirza, Arun Ahuja, Agnieszka Grabska-Barwinska, Jack Rae et al.
"Unsupervised predictive memory in a goal-directed agent." arXiv preprint arXiv:1803.10760 (2018).
Relational memory to capture state relationships
31
Dot-product attention
Outer-product attention
Zambaldi, Vinicius, David Raposo, Adam Santoro, Victor Bapst, Yujia Li,
Igor Babuschkin, Karl Tuyls et al. "Deep reinforcement learning with
relational inductive biases." In ICLR 2018.
Le, Hung, Truyen Tran, and Svetha Venkatesh. "Self-attentive
associative memory." In ICML 2020.
Suitable for multi-agent
or sparse POMDP
32
Skip4
No Skip
Memory for exploration
33
Challenges of RL
● Rewards can be very sparse
○ RL agents cannot learn anything
until they collect the first reward
○ Explore forever?
● Sometimes reward function in
complicated real-world problem is
unknown
○ Don’t have simulator
○ Only know when we explore
○ Of course, if agent explore long
enough, it will learn eventually
→ Sample inefficiency
34
Need exploring mechanisms to enable
sample-efficiency!!!
35
Aubret, A., L. Matignon, and S. Hassas. "A survey on intrinsic motivation in reinforcement learning."
In biological world, agents can cope with
this problem very well
● Animal can travel for long
distance till they find food
● Human can navigate to go to
an address in a strange city
○ intrinsic motivation
○ curiosity, hunch
○ intrinsic reward
36
https://www.beepods.com/5-fascinating-ways-bees-and-flowers-find-each-other/
Agents should be motivated towards
“interesting” consequences
● C: actor vs M: world model
● M predicts consequences of C
● As a result:
▪ If C action results in repeated and boring
consequences →M predict well
▪ C must explore novel consequence
● Memory:
▪ To learn the world model
▪ To know if something novel or old
37
https://people.idsia.ch/~juergen/artificial-curiosity-since-1990.html
M: Forward model learns dynamics (no memory)
38
Stadie, Levine, Abbeel: Incentivizing Exploration in Reinforcement Learning with Deep Predictive Models. 2015
Novelty if prediction
error is high (intrinsic
reward)
When novelty as prediction error is useless
● The prediction target is stochastic
● Information necessary for the prediction is
missing
→ Both the totally predictable and the
fundamentally unpredictable will get boring
→Solution: Remember all experiences
● “Store” all observations, including
stochastic ones in working, semantic or
episodic memory
● Instead of predicting, try recalling from the
memory
39
https://openai.com/blog/reinforcement-learning-with-prediction-based-rewards/
Working memory: Store visited in-episode states
●Novelty through reachability:
▪ Boring if reachable from states in memory in less than
k steps
●Learn to classify: reachable or unreachable
○ Collect 2 states from trajectory
○ Create label to indicate one is reachable from another
40
Savinov et al. "Episodic Curiosity through Reachability." In ICLR 2018.
High if unreachable
Exploration with working memory is better
41
No intrinsic
reward
Intrinsic
reward via
dynamic
prediction
Intrinsic
reward via
WM
Semantic memory: distillation to neural networks’ weight
●Target network: randomly
transform state
●Predictor network: try to
remember the transformed
state
○ A global memory
○ Random TV is not problem
■ Remember all noisy
channels
42
https://medium.com/data-from-the-trenches/curiosity-driven-learning-through-random-network-distillation-488ffd8e5938
Burda et al. Random Network Distillation: a new take on Curiosity-Driven Learning, In ICLR 2019
High if
cannot
distill
Episodic memory: explore
from stored good state
●Archive: a memory of good states
(state-score) → sample one
●Purely random exploration from this
state→ collect more states
●Update the archive
And many other tricks: imitation
learning, goal-based policy, …
43
Adrian Ecoffet et al.: First return, then explore. Nature 2021
Superhuman performance
44
Episodic
Semantic
Working
Robotics
Video games
Conclusion
● Memory assists RL agents in many
forms:
○ Semantic
○ Working
○ Episodic
● And in many tasks:
○ Store experiences
○ Exploration
○ State representation
45
Thank you!
46
Our team at A2I2 is hiring!
Contact thai.le@deakin.edu.au for PhD scholarships.

Memory for Lean Reinforcement Learning.pdf

  • 1.
    Memory for Lean ReinforcementLearning Presented by Hung Le 1
  • 2.
  • 3.
    What is ReinforcementLearning (RL)? ● Agent interacts with environment ● S+A=>S’+R (MDP) ● Find a policy π(S) → A to maximize return ∑R from the environment ● Learn (action) value function: ○ V(s) ○ Q(s,a) → choose action that maximizes the value (e.g. ε-greedy policy) 3 https://crl.causalai.net/
  • 4.
    Classic (memory-less) RLalgorithms 4 https://www.analyticsvidhya.com/blog/2019/04/introduction-deep-q-learning-python/ https://jonathan-hui.medium.com/rl-policy-gradients-explained-9b13b688b146 Q-learning (temporal difference) REINFORCE (policy gradient) Watkins, Christopher JCH, and Peter Dayan. "Q-learning." Machine learning 8, no. 3 (1992): 279-292. Williams, Ronald J. "Simple statistical gradient-following algorithms for connectionist reinforcement learning." Machine learning 8, no. 3 (1992): 229-256.
  • 5.
    Challenge: the RLtask can be complicated while the reward is simple ● Task: ○ Agent searches for the key ○ Agent picks the key ○ Agent open the door to access the room ○ Agent finds the box in the room ● Reward: ○ If the agent reaches the box, get +1 reward 5 https://github.com/maximecb/gym-minigrid → How to learn such complicated policies using the simple reward?
  • 6.
    Short Answer: justlearn from many trials (data)! Chess Self-driving car Video games Robotics 6
  • 7.
    Example: RL agentplays video game 7 Game simulation
  • 8.
    Limitation of trainingwith big data ● High cost ○ Training time (days to months) ○ GPU (dozens to hundreds) ● Require simulators ○ Agents are trained in simulation (millions to billions of steps) ○ The cost for one step of simulation can be high ○ Some real environments simply don’t have simulator ● Trained agents are unlike humans ○ Unsafe exploration ○ Weird behaviors ○ Fail to generalize 8
  • 9.
    Human vs RLAgents in Atari games ● Human: ○ Few hours of practicing to reach moderate performance ○ Don’t forget how to play old game as learning new ones ○ Can play any game ● RL Agents: ○ 21 trillions hours of training to beat human (AlphaZero), equivalents to 11,500 years of human practice ○ Catastrophic forgetting if learn games sequentially ○ Despite forever training, there exists failed games 9
  • 10.
    What is missing? 10 Memoryas experiences Memory for exploration Memory to build context
  • 11.
  • 12.
    What is memory? ●Memory is the ability to efficiently store, retain and recall information ● Brain memory stores items, events and high-level structures ● Computer memory stores data, programs and temporary variables 12 What is the example of memory in neural network?
  • 13.
    Memory in neuralnetworks 13 Long-term memory Short-term memory Functional memory ● Semantic memory: storing data in the neural network weights ● Episodic memory: storing episodic events in matrix memory ● Associative memory: key-value binding as in Hopfield Network or Transformer layer ● Working memory: matrix memory in memory augmented neural network ● Memory stores programs ● Memory of models, mixture of experts ..
  • 14.
    Properties of memories 14 LifespanPlasticity Example ● 1 episode is one day ● Last for 1 day ● Build memory instantly Short-term Quick 1 Working memory ● Persists across agent’s lifetime ● Last for several years ● Build memory instantly Long-term Quick 2 Episodic memory ● persists across agent’s lifetime ● Last for several years ● Take time to build memory Long-term Slow 3 Semantic memory
  • 15.
    Memory in RL ●Human brain implements RL ● Dopamine neurons reflects a reward prediction error (TD learning) ● What is the memory in brain that stores V? ○ Value table is not scalable ○ May be a value model →DQN (semantic memory) 15
  • 16.
    DQN: Replay bufferis an episodic memory • Store experiences: (s,a,r,s’) tuple • Read memory via replay sampling • Memory content is used to train Action-Value Network Q 16 Mnih, V., Kavukcuoglu, K., Silver, D. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015). https://doi.org/10.1038/nature14236
  • 17.
    DQN’s memories arebetter than tables, but … ● Inefficiency: ○ Learning semantic memory (Q network) is slow (gradient descent) ○ Optimize many parameters ● Bootstrap noise: ○ The target is the network’s output ○ If network is not well trained, the target is noisy ● Reply buffer: ○ Raw observations ○ Need many sampling iterations 17
  • 18.
    Alternative: episodic controlparadigm Current experience Eg: (St, At),… Memory read Experiences Final Returns Policy Value Environment Memory write 18 ● Episodic memory is a key-value memory ● Directly binds experience to return→ refers to experiences that have high return to make decisions
  • 19.
    Model-free episodic control:K-nearest neighbors 19 Blundell, Charles, Benigno Uria, Alexander Pritzel, Yazhe Li, Avraham Ruderman, Joel Z. Leibo, Jack Rae, Daan Wierstra, and Demis Hassabis. "Model-free episodic control." NeurIPS (2016). Fix-size memory First-in-first out ● No need to learn parameters (pretrained 𝜙) ● Quick value estimation
  • 20.
    Hybrid: episodic memory+ DQN 20 Lin, Zichuan, Tianqi Zhao, Guangwen Yang, and Lintao Zhang. "Episodic Memory Deep Q-Networks. IJCAI18" Episodic TD learning Contribution weight
  • 21.
    Limitation of model-freeepisodic memory ● Near-deterministic assumption ○ Assume clean env. ○ Store the best return ● Sample-inefficiency: ○ store state-action-value which demands experiencing all actions to gain experience ● Fixed combination between episodic and parametric values ○ episodic contribution weight unchanged for different observations ○ requires manual tuning of the weight 21 What if the state is partially observable and the number of actions is large?
  • 22.
    Model-based episodic memory ●Learn a model of trajectories using self-supervised training ○ Model=LSTM ○ Learn to reconstruct past state-action given current trajectory and query ● The trained LSTM is used to generate trajectory representations → counterfactual trajectory → imagine actions 22 Le, Hung, Thommen Karimpanal George, Majid Abdolshah, Truyen Tran, and Svetha Venkatesh. "Model-Based Episodic Memory Induces Dynamic Hybrid Controls." NeurIPS (2021).
  • 23.
    Sample efficiency teston Atari game 23 Model-free (10M) Hybrid (40M) Model-based (10M) DQN (200M)
  • 24.
    Episodic memory forhyperparameter optimization ●RL is very sensitive to hyperparameters ●SOTA performance is achieved with extensive hyperparameter tuning Islam, R., Henderson, P., Gomrokchi, M., and Precup, D. (2017). Reproducibility of benchmarked deep reinforcement learning tasks for continuous control. arXiv preprint arXiv:1708.04133. 24 DQN Hyperparameters Not to mention Neural network architecture
  • 25.
    Limitation of memory-lessoptimizer ●Don’t have the context of training in the optimization process ● Treated as a stateless bandit or greedy optimization ○ Ignoring the context prevents the use of episodic experiences that can be critical in optimization and planning ○ E.g. the hyperparameters that helped overcome a past local optimum in the loss surface can be reused when the learning algorithm falls into a similar local optimum 25 How to build the context (memory key)?
  • 26.
    Optimizing hyperparameter asepisodic RL ●At each policy update, the hyper-agent: ○ Observe the training context- hyper-state ○ Configure the RL algorithm with suitable hyperparameters ψ - hyper-action ○ Train RL agent with ψ, observed learning progress – hyper-reward ●The goal of the Hyper-RL is the same as the main RL’s: to maximize the return of the RL agent ○ At a hyper-state, find hyper-action that maximize the accumulated hyper-reward (hyper-return) 26 KEY |VALUE Experience hyper-state/action |Outcome Hyper-Returns Le, Hung, Majid Abdolshah, Thommen K. George, Kien Do, Dung Nguyen, and Svetha Venkatesh. "Episodic Policy Gradient Training." AAAI (2022).
  • 27.
  • 28.
    Memory to buildcontext 28
  • 29.
    When the stateis not enough … ● Partially Observable Environments: ○ States do not contain all required information for optimal action ○ Ee.g. state=position, does not contain velocity ● Ways to improve: ○ Build richer state representations ○ Memory of all past observations/actions ● Policy gradient 29 Full map Observed state RNN hidden state RNN as policy model
  • 30.
    Building better workingmemory for better the context 30 ● External memory: longer-term, store more ● Unsupervised training to learn read-write operation Wayne, Greg, Chia-Chun Hung, David Amos, Mehdi Mirza, Arun Ahuja, Agnieszka Grabska-Barwinska, Jack Rae et al. "Unsupervised predictive memory in a goal-directed agent." arXiv preprint arXiv:1803.10760 (2018).
  • 31.
    Relational memory tocapture state relationships 31 Dot-product attention Outer-product attention Zambaldi, Vinicius, David Raposo, Adam Santoro, Victor Bapst, Yujia Li, Igor Babuschkin, Karl Tuyls et al. "Deep reinforcement learning with relational inductive biases." In ICLR 2018. Le, Hung, Truyen Tran, and Svetha Venkatesh. "Self-attentive associative memory." In ICML 2020.
  • 32.
    Suitable for multi-agent orsparse POMDP 32 Skip4 No Skip
  • 33.
  • 34.
    Challenges of RL ●Rewards can be very sparse ○ RL agents cannot learn anything until they collect the first reward ○ Explore forever? ● Sometimes reward function in complicated real-world problem is unknown ○ Don’t have simulator ○ Only know when we explore ○ Of course, if agent explore long enough, it will learn eventually → Sample inefficiency 34
  • 35.
    Need exploring mechanismsto enable sample-efficiency!!! 35 Aubret, A., L. Matignon, and S. Hassas. "A survey on intrinsic motivation in reinforcement learning."
  • 36.
    In biological world,agents can cope with this problem very well ● Animal can travel for long distance till they find food ● Human can navigate to go to an address in a strange city ○ intrinsic motivation ○ curiosity, hunch ○ intrinsic reward 36 https://www.beepods.com/5-fascinating-ways-bees-and-flowers-find-each-other/
  • 37.
    Agents should bemotivated towards “interesting” consequences ● C: actor vs M: world model ● M predicts consequences of C ● As a result: ▪ If C action results in repeated and boring consequences →M predict well ▪ C must explore novel consequence ● Memory: ▪ To learn the world model ▪ To know if something novel or old 37 https://people.idsia.ch/~juergen/artificial-curiosity-since-1990.html
  • 38.
    M: Forward modellearns dynamics (no memory) 38 Stadie, Levine, Abbeel: Incentivizing Exploration in Reinforcement Learning with Deep Predictive Models. 2015 Novelty if prediction error is high (intrinsic reward)
  • 39.
    When novelty asprediction error is useless ● The prediction target is stochastic ● Information necessary for the prediction is missing → Both the totally predictable and the fundamentally unpredictable will get boring →Solution: Remember all experiences ● “Store” all observations, including stochastic ones in working, semantic or episodic memory ● Instead of predicting, try recalling from the memory 39 https://openai.com/blog/reinforcement-learning-with-prediction-based-rewards/
  • 40.
    Working memory: Storevisited in-episode states ●Novelty through reachability: ▪ Boring if reachable from states in memory in less than k steps ●Learn to classify: reachable or unreachable ○ Collect 2 states from trajectory ○ Create label to indicate one is reachable from another 40 Savinov et al. "Episodic Curiosity through Reachability." In ICLR 2018. High if unreachable
  • 41.
    Exploration with workingmemory is better 41 No intrinsic reward Intrinsic reward via dynamic prediction Intrinsic reward via WM
  • 42.
    Semantic memory: distillationto neural networks’ weight ●Target network: randomly transform state ●Predictor network: try to remember the transformed state ○ A global memory ○ Random TV is not problem ■ Remember all noisy channels 42 https://medium.com/data-from-the-trenches/curiosity-driven-learning-through-random-network-distillation-488ffd8e5938 Burda et al. Random Network Distillation: a new take on Curiosity-Driven Learning, In ICLR 2019 High if cannot distill
  • 43.
    Episodic memory: explore fromstored good state ●Archive: a memory of good states (state-score) → sample one ●Purely random exploration from this state→ collect more states ●Update the archive And many other tricks: imitation learning, goal-based policy, … 43 Adrian Ecoffet et al.: First return, then explore. Nature 2021
  • 44.
  • 45.
    Conclusion ● Memory assistsRL agents in many forms: ○ Semantic ○ Working ○ Episodic ● And in many tasks: ○ Store experiences ○ Exploration ○ State representation 45
  • 46.
    Thank you! 46 Our teamat A2I2 is hiring! Contact thai.le@deakin.edu.au for PhD scholarships.