On Collective Reinforcement Learning
Techniques, Challenges, and Opportunities
Gianluca Aguzzi gianluca.aguzzi@unibo.it
Dipartimento di Informatica – Scienza e Ingegneria (DISI)
Alma Mater Studiorum – Università di Bologna
13/12/2021
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 1 / 53
Outline
1 Introduction
2 From Single Agent To Multi-Agent
3 Learning in CAS
4 Independent Learners
5 Joint Action Learners
6 Advanced Topics
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 2 / 53
Introduction
Introduction
Motivation
Learning is a key mechanism to drive adaptivity
Intelligent agents improve their performance with experience
We would like to improve adaptiveness with raw experience and, hence
We would like to bring Reinforcement Learning methodology in CAS
Lecture goals
Understanding the challenges related to the Multi-Agent System
(MAS) domain
Show Techniques, Patterns, and Architecture applied in Collective
Systems
Hands-on in some practical example of Collective Learning
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 3 / 53
Introduction
Single-Agent Learning
What have you seen so far . . .
Markov Decision Process (MDP) to model the agent-environment
interactions
Find a learning process that eventually leads to an optimal policy π∗
Q-Learning (in general value-based approaches) as a prominent
algorithm to reach converge
. . . But plain RL works only under some conditions
Reward hypothesis
Full environment observability and Markovian property
Stationary environment
State/action space should be small enough to be stored in memory
(otherwise, we should leverage function approximators)
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 4 / 53
Introduction
Partial Observable environments
Definition
An agent does not have a perfect and complete knowledge of the state of
the environment
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 5 / 53
Introduction
Non-stationary environments
Definition
The environment model (e.g. the random variable associated with it)
changes over time.
MDP are Stationary by definition . . .
. . . But, real-case environment dynamics could change over time (e.g.
markets, city traffic, routing networks)
Practically, it seems that RL (in particular Temporal Difference (TD)
methods) works well even in this case
But, we lose convergence guarantees.
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 6 / 53
From Single Agent To Multi-Agent
From Single-Agent To Multi-Agent
Multi-Agent Reinforcement Learning (MARL)
Multiple agents learn together the best policy that maximises a long term
reward signal.
Considerations
If multiple agents exist, but only one agent learns by experience, then
it is a single agent problem (e.g. single-player videogames)
So, MAS + Learning 6
=⇒ MARL, but MARL =⇒ MAS
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 7 / 53
From Single Agent To Multi-Agent
Stochastic Game S (or Markov games)
Extension of MDP to the MAS context
Common abstraction in MARL algorithms
Definition
S is a tuple < N, S, {Ai }i∈{1,...,N}, P, {Ri }i∈{1,...,N} >
N is the number of agents (|N| > 1)
S is the global environment state
Ai is the action state of agent i. A := A1 × · · · × AN is the joint
action space
P : SxA → P(S) is the state transaction. P is a discrete probabilistic
distribution (associate for each s ∈ S a probability)
Ri : S × A × S → R is the reward signal
Typical time evolution: S0, A0, R1, S1, . . .
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 8 / 53
From Single Agent To Multi-Agent
Stochastic games: Example
Paper Rock Scissor
N = 2
A1 = A2 = {Paper, Rock, Scissor}
S = { }
R =
Rock Paper Scissor
" #
Rock 0, 0 −1, 1 1, −1
Paper 1, −1 0, 0 −1, 1
Scissor −1, 1 1, −1 0, 0
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 9 / 53
From Single Agent To Multi-Agent
MARL Systems: Task type
Cooperative
Agents share the same reward function (R1 = · · · = RN) in order to
accomplish a collective goal
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 10 / 53
From Single Agent To Multi-Agent
MARL Systems: Task type
Competitive
Agents compete with each other to maximise a long term return
Board Games, Video games, . . .
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 11 / 53
From Single Agent To Multi-Agent
MARL Systems: Task type
Mixed
Agents can both compete and cooperate in order to maximise a global
reward function
Also called General Sum games
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 12 / 53
From Single Agent To Multi-Agent
On Cooperative Task
Homogeneous
Each agent has same capabilities (A1 = · · · = AN)
The overall goal is to find the best policy that is the same for each
agent (π∗ = π∗
1 = π∗
2 = · · · = π∗
N)
Heterogeneous
Each agent could have different capabilities (in the worst case,
A1 6= . . . 6= AN)
Each agent has its local policy that should be maximised following the
global collective goal
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 13 / 53
From Single Agent To Multi-Agent
MARL Systems: Learning Scheme
Centralised Training and Centralised Execution (CTCE)
One agent with a global view of the system (in the cloud? in a
server?)
Nodes send their perception to that central agent
With this perceptions, it creates a global state of the system
With the current policy, it chooses the actions that nodes should
performance (Control)
In the next time step, it evaluates the reward function and updates the
policy accordingly (Learn)
Are the nodes agents?
Used mainly in offline (i.e. at simulation time) setting
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 14 / 53
From Single Agent To Multi-Agent
MARL Systems: Learning Scheme
Centralised Training and Centralised Execution (CTCE)
Agent 

0
Agent

N
. . .
Environment
Learner
S, R
A S, R A
A
S, R
S, R
A
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 15 / 53
From Single Agent To Multi-Agent
MARL Systems: Learning Scheme
Decentralised Training and Decentralised Execution (DTDE)
Each node has its local policy/value table
They can perceive the environment state (or they can observe a part
of it)
With the current state, they perform an action following their local
policy (Control)
In the next time step, they update their policy following a local reward
function (Learn)
Both used in offline and online (i.e. deployment time) setting
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 16 / 53
From Single Agent To Multi-Agent
MARL Systems: Learning Scheme
Decentralised Training and Decentralised Execution (DTDE)
A
Agent 

0
Agent

N
. . .
Environment
S
A S
L L
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 17 / 53
From Single Agent To Multi-Agent
MARL Systems: Learning Scheme
Centralised Training and Decentralised Execution (CTDE)
A simulation time learning online execution pattern
Simulation time
Each node follows the typical ot, at, ot+1, at+1, . . . trajectory using a
local policy
After an episode, this trajectory (or something derived from it) will be
sent to a central learner
It, using a global view of the system, improves the policies of the agents
An then the policies will be shared with the agents
Execution time
Each agent has the local policy distilled during the simulation time
With it, they act in the environment
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 18 / 53
From Single Agent To Multi-Agent
MARL Systems: Learning Scheme
Decentralised Training and Decentralised Execution (DTDE)
Agent 

0
Agent

N
. . .
Environment
Learner
A A
S, R
S, R
S, R S, R
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 19 / 53
From Single Agent To Multi-Agent
MARL Systems: Successful Applications
OpenAI Hide And Seek:
https://openai.com/blog/emergent-tool-use/
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 20 / 53
From Single Agent To Multi-Agent
MARL Systems: Successful Applications
Capture The Flag: https:
//deepmind.com/blog/article/capture-the-flag-science
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 21 / 53
Learning in CAS
Collective Adaptive System (CAS)
Recap
Systems composed by a (possibly) large set of components executing a
collective task relying on component interactions and showing inherent
adaptiveness.
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 22 / 53
Learning in CAS
Learning in CAS 1
Constraints
The collective goal could be accomplished through competition and/or
cooperation
The system could be heterogeneous or homogeneous
The agent number is not bounded (openness)
Distributed control – i.e. no central authority exists
1
Mirko D’Angelo et al. “On learning in collective self-adaptive systems: state of
practice and a 3D framework”. In: ACM, 2019, pp. 13–24. url:
https://doi.org/10.1109/SEAMS.2019.00012
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 23 / 53
Learning in CAS
Learning in CAS: Challenges
CASs are partial observable
Each agent could only perceive a part of the system through its sensors.
Learning in CASs make the environments non-stationary
Each agent learns concurrently =⇒ by the eye of an agent the
environment is in continuous evolution.
The curse of dimensionality – MAS combinatorial nature
When we have to deal with a large number of agents, the overall
state-action space is increasing exponentially to the number of agents — so
a central learner cannot solve the learning problem.
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 24 / 53
Learning in CAS
Learning in CAS: Challenges
Multi-Agent credit assignment problem
Typically, in CAS, a global reward function exists. But it is hard to
understand the influence of a single agent to the overall collective
behaviour.
Sample efficiency
Action-space and state-space are very large in CASs =⇒ the problem of
sample efficiency (i.e. how many samples does the RL need to reach a
good policy?) arise as in the Deep Learning context.
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 25 / 53
Learning in CAS
On sample efficiency: Example of learning time
A Nowadays problem . . .
Nowadays, Deep Learning Techniques are used to train complex neural
networks
When applied to Reinforcement Learning, the training phase requires
millions of interactions for an agent to learn
In the Multi-Agent Setting is even worst . . .
In the previous example2, they train 30 agents in 450k games!
2
Max Jaderberg et al. “Human-level performance in first-person multiplayer games
with population-based deep reinforcement learning”. In: CoRR abs/1807.01281 (2018).
arXiv: 1807.01281. url: http://arxiv.org/abs/1807.01281
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 26 / 53
Learning in CAS
On scalability: Single Agent Learner Example
Learning setting
A central agent (i.e. in the cloud? In a server?) sees the entire system
Standard RL algorithm (tabular) create a table with the size of |S × A|
But the system-wide action space is the cartesian product of the agent
action-space, so the A cardinality is |A|N
With 100 agents with 10 possible actions, we have already reached an
action space with more actions then the particles in the universe
(∼ 1080 ®).
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 27 / 53
Learning in CAS
Learning in CAS: A Motivating Example
Robot Collision Avoidance 3
https://sites.google.com/view/drlmaca
3
Pinxin Long et al. “Towards Optimally Decentralized Multi-Robot Collision
Avoidance via Deep Reinforcement Learning”. In: IEEE, 2018, pp. 6252–6259. url:
https://doi.org/10.1109/ICRA.2018.8461113
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 28 / 53
Learning in CAS
Learning in CAS: Toolkits
MAgent 4
https://github.com/geek-ai/MAgent
4
Lianmin Zheng et al. “MAgent: A many-agent reinforcement learning platform for
artificial collective intelligence”. In: 2018
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 29 / 53
Learning in CAS
Learning in CAS: Our today focus
Constraints
Learning in cooperative systems: i.e. each entity shares the same
reward functions
Learning in homogeneous systems: i.e. each entity is interchangeable
with each other
o We do not consider partial observability as a core problem
Homogenous system: Implications
 The optimal policy is the same for the whole system
 During the learning phases, the system can use different policies (e.g.
to reduce sample efficiency)
 We reduce the action space
 We reduce the non-stationarity problem
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 30 / 53
Learning in CAS
Learning in CAS: A simplified model
Each agent is seen as a tuple (M, N)
M is a local Markov decision process (S is an observation space in
common with the entire system)
N is the neighbourhood perceived by an agents
The global state system state is unknown
Agents want to learn a policy π(a|s, N) to share with the entire
system
When agents do not consider neighbours, we call the system as
Independent learners
When agents consider neighbours actions to choose their local action,
we call the system Joint action learners
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 31 / 53
Learning in CAS
Learning in CAS: Can we use standard RL algorithms?
In a centralised training decentralised execution settings
Each agent explores the environment with the same policy φ
Central learners improve the policy following the agents’ perceptions
When a good policy is found, it is deployed in a concrete system
Then, the central learner is not required anymore
Indeed yes . . .
. . . But this allows only offline learning (CTDE setting)
This practically lead to lazy agent – the exploration part is limited
Does not consider other agent actions – each agent act independently
from each other
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 32 / 53
Independent Learners
Independent Learners
Configuration
Each agent concurrently learns an optimal local policy
This led to the global optimal policies as an emergent behaviour
The homogeneity is driven by the same reward function and the same
action/observation spaces
More agent =⇒ more exploration
Not converge to global optimum, but good pratical performance are
founded in various works: (Tumer and Agogino 2007; Tumer, Agogino,
and David H. Wolpert 2002; Wang et al. 2006)
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 33 / 53
Independent Learners
Independent Learners
Implications
 Scalable: the learning algorithm does not depend on the system size
 Easy to implement: the algorithm is the same developed for single
agent context
 Offline and Online: Can be used both at simulation time or at
deployment time (CTDE, DTDE are conceptually supported)
 Increase non-stationarity: the environment dynamics change
continuously as the agent learns a new policy
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 34 / 53
Independent Learners
Decentralised Q-Learning5
Idea
Each agent is a Q-Learner and he does consider other agents as a part of
the environment
Considerations
 Simplest extension of Q-Learning into the Multi-Agent domain
 Need to use Greedy in the Limit of Infinite Exploration (GLIE) policy
to reach good performance
 Complex and application-dependant parameters tuning
5
Ming Tan. “Multi-Agent Reinforcement Learning: Independent versus Cooperative
Agents”. In: ed. by Paul E. Utgoff. Morgan Kaufmann, 1993, pp. 330–337. doi:
10.1016/b978-1-55860-307-3.50049-6. url:
https://doi.org/10.1016/b978-1-55860-307-3.50049-6
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 35 / 53
Independent Learners
Hysteretic Q-Learning6
Idea
Decentralised Q-Learning does not take into consideration
non-stationarity problem
An idea could be to apply a correction factor that reduces the
non-stationarity during the concurrent learning phases
Hysteretic Q-Learning uses a hysteresis heuristic to handle this
problem
It supposes an optimistic behaviour, giving more weight to good
action then the bad actions (due to the exploration of nearby agents)
6
Laëtitia Matignon et al. “Hysteretic q-learning: an algorithm for decentralized
reinforcement learning in cooperative multi-agent teams”. In: IEEE, 2007, pp. 64–69.
url: https://doi.org/10.1109/IROS.2007.4399095
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 36 / 53
Independent Learners
Hysteretic Q-Learning
Implementation
Use two learning factors, α and β, with α  β
The Q-table update is evaluated to consider the goodness of an
action/space pair: δ(s, a) = rt + γ ∗ argmaxa0 Q(s0, a0) − Q(st, at)
When δ  0 it means that we improve our experience, so we use α as
learning rate, β otherwise
The Q-table update became:
Q(st, at) =
(
Q(st, at) + αδ(st, at) if(δ(st, at))  0
Q(st, at) + βδ(st, at) otherwise
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 37 / 53
Independent Learners
Hysteretic Q-Learning
Idea
 It improves standard Q-Learning performance without adding more
complexity
 It is currently used in Deep Reinforcement Learning to handle complex
systems (Omidshafiei et al. 2017)
 It is a heuristic, so it does not have any theoretical guarantees
Other extensions follow this idea and suffer from the same pros and
cons. The most famous are Distributed Q-Learning (Lauer et al.
2000), Lenient Q-Learning (Panait et al. 2006), Win-Or-Learn
Fast (Bowling et al. 2002)
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 38 / 53
Independent Learners
QD-Learning7
Idea
Each agent know its local actions and can communicate only with a
restricted neighbourhood
The Q update depends on the Q-values of the neighbours
To do that, the system mantains Q-tables of different time steps.
Q-tables are indexed with: Qi
t (in English: the Q-table of the agent i
at the time step t)
It uses innovation and consensus
7
Soummya Kar et al. “QD-Learning: A Collaborative Distributed Strategy for
Multi-Agent Reinforcement Learning Through Consensus + Innovations”. In: IEEE
Trans. Signal Process. 61.7 (2013), pp. 1848–1862. url:
https://doi.org/10.1109/TSP.2013.2241057
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 39 / 53
Independent Learners
QD-Learning
Update rule
consensus(st, at) =
P
i∈N (Qme
t (st, at) − Qi
t(st, at))
innovation(st, at) = (rt + γ ∗ argmaxa0 Qme
t (s0, a0) − Qmet(st, at))
innovation is the standard Q advantage evaluation
Finally, each agent update their table as:
Qme
t+1(st, at) = Qme
t − β ∗ consensus(st, at) + α ∗ innovation(st, at)
Discussion
 It is shown that it converges asymptotically to an optimal policy . . .
 . . . under constraints of networks connection
 . . . with full state observability
 . . . with specific constraints on α and β
 . . . with a predefined neighbourhood graph
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 40 / 53
Joint Action Learners
Joint Action Learners
From Independent to Joint actions
Independent learners tend to make the environment highly
non-stationarity
Furthermore, they do not consider other agents ← collective behaviour
through emergence (QD-learning is an exception)
Ideally, we can reduce non-stationarity and we want to consider other
agents by creating policies that observe other agents actions (typically
denoted by: πi,N )
This does not scale with the agent population.
A possibility to reduce the space is to consider only the neighbours (in
a typically CAS settings)
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 41 / 53
Joint Action Learners
Neighbourhood Q-Learning8
Q-Learning mixed with Joint action spaces
Each agent improve a Q-function consider also the action taken by the
neighbourhood (aN )
A way to handle that, is to consider the state as kt = (s, aN )
Q(kt, a) = Q(kt, a) + α ∗ (rt + γ ∗ argmaxa0 Q(k0, a0))
Implications
 Consider neighbourhood actions =⇒ reduce the non-stationarity
problem
 Q-table depends on agent neighbourhood =⇒ the Q-table size
increase exponentially with the neighbourhood
8
Alexander Zai et al. Deep reinforcement learning in action. Manning Publications,
2020
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 42 / 53
Joint Action Learners
Just a scratch to the surface :)
A recap
The approaches handle only partially the collective learning problem
(scalability, cooperative, non-stationarity)
They are used nowadays but in combination with other approaches
(neural networks, reply buffer, difference reward) . . .
. . . But today no theoretical solutions exists yet for large scale
collective learning
Note: we do not cover the game theoretical background ← it is
current the main interest in literature (problem =⇒ these works
consider few agents)
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 43 / 53
Advanced Topics
State of the art: on advanced topics
Deep Learning as a mechanism to solve complex tasks
Nowadays, Deep Learning architectures are typically used in MARL —
example in MAgent (random) (deep learning policy)
For handling non-stationarity, actor-critic methodologies are used
For handling partial-observability, recurrent networks are applied
For handling large-state/action space, deep architectures are
considered
 are all heuristic =⇒ some approach could do not work in different
application
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 44 / 53
Advanced Topics
On Actor-Critic: Policy Gradient Methods
What we see so far are value-based methods =⇒ learn value
function that guide the agent’s policy
­ Why we cannot learn the policy directly?
In some cases, the value function could be very complex but not the
policy one
Furthermore, value-based methods led to determistic policy, but in
some cases only a stochastic policy is the best one (e.g. in Rock
paper scissor)
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 45 / 53
Advanced Topics
On Actor-Critic: Policy Gradient Methods
Idea
Optimizing the policy (π(a|s, θ)) w.r.t the expected return by gradient
descent (or ascent).
π(a|s, θ) is read as: the probability of the action a knowing the state s
and the weight θ
Generally, the loss function is described as: J(θ) = E
PT
i=0(γ ∗ Ri )
The weight then, are updated as: θt+1 = θt + α∇J(θt) =⇒
standard Stachocastic gradient descent (ascent)
Considerations
 Convergence proof to local optimal
 Tends to have high bias (underfitting)
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 46 / 53
Advanced Topics
On Actor-Critic: Learning description
Idea
Combine Policy Gradient and
Value-Based Methods
The Critic (the value function) is
updated like standard value
approaches (e.g. TD errors)
The Actor (the policy) is
updated using the policy
gradient update based on the
Critic estimation
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 47 / 53
Advanced Topics
On Actor-Critic: MARL settings
Idea
The Actor is a local shared policy that takes observations as input and
returns local actions
The Critic instead, is placed in a central server
The Critic has a global view of the system =⇒ it could use overall
state information
The Actor is then influenced by global data, but at deployment time it
does not use them
Considerations
 A balanced between fully decentralised and centralised method
 Require simulations =⇒ cannot be used in online systems
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 48 / 53
Advanced Topics
Handling Partial Observability
As I mentioned, CAS are partially observable . . .
. . . But the methods proposed does not handle this matter in any sense
Recurrent Neural Networks (i.e. Neural Networks with memory) are
an emergent trend to handle this issue
In MAS settings, RNNs and Hysteretic updates are used together to
handle both non-stationarity and partial observability 9
 It is not clear if RNNs are a solution for Partial Observability
9
Shayegan Omidshafiei et al. “Deep Decentralized Multi-task Multi-Agent
Reinforcement Learning under Partial Observability”. In: vol. 70. Proceedings of
Machine Learning Research. PMLR, 2017, pp. 2681–2690. url:
http://proceedings.mlr.press/v70/omidshafiei17a.html
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 49 / 53
Advanced Topics
Handling large scale systems
Agent size was not a central problem in the first MARL works
Independent learners are typically deployed when we have to deal with
several agents . . .
. . . but it makes the learning process noising
Lately, a novel trend emerges to handle this problem, the so-called
Mean-Field Reinforcement Learning10
10
Yaodong Yang et al. “Mean Field Multi-Agent Reinforcement Learning”. In: vol. 80.
Proceedings of Machine Learning Research. PMLR, 2018, pp. 5567–5576. url:
http://proceedings.mlr.press/v80/yang18d.html
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 50 / 53
Advanced Topics
Handling large scale systems: On Mean-Field Reinforcement
Learning
The learning problem is set up as a joint action learners
In Mean-Field, the system in modelled as two agent, the main agent
and the rest
In Mean Field Reinforcement Learning, the rest is computed as the
mean action taken by the neighbourhood
 It has theoretical guarantees
 It is easily used also in CAS
 The approaches proposed are offline — no online adaptation.
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 51 / 53
Advanced Topics
Handling Multi-Agent credit assignment problem
In CAS settings, typically we have a global reward function
 It is hard to understand the influence of the agent action on that signal
This is known as Multi-Agent credit assignment problem
It is tackled in COIN approach (David H Wolpert et al. 1999) . . .
. . . but it required to run several simulations to understand each agent
influence
COMA11 is a novel approach that uses an Actor-Critic setting with a
contrafacutal baseline
11
Jakob N. Foerster et al. “Counterfactual Multi-Agent Policy Gradients”. In: AAAI
Press, 2018, pp. 2974–2982. url:
https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17193
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 52 / 53
On Collective Reinforcement Learning
Techniques, Challenges, and Opportunities
Gianluca Aguzzi gianluca.aguzzi@unibo.it
Dipartimento di Informatica – Scienza e Ingegneria (DISI)
Alma Mater Studiorum – Università di Bologna
13/12/2021
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 53 / 53
References
References I
[BV02] Michael H. Bowling and Manuela M. Veloso. “Multiagent
learning using a variable learning rate”. In: Artif. Intell. 136.2
(2002), pp. 215–250. doi:
10.1016/S0004-3702(02)00121-2. url:
https://doi.org/10.1016/S0004-3702(02)00121-2.
[DAn+19] Mirko D’Angelo et al. “On learning in collective self-adaptive
systems: state of practice and a 3D framework”. In: ACM,
2019, pp. 13–24. url:
https://doi.org/10.1109/SEAMS.2019.00012.
[Foe+18] Jakob N. Foerster et al. “Counterfactual Multi-Agent Policy
Gradients”. In: AAAI Press, 2018, pp. 2974–2982. url:
https://www.aaai.org/ocs/index.php/AAAI/AAAI18/
paper/view/17193.
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021
References
References II
[GD21] Sven Gronauer and Klaus Diepold. “Multi-agent deep
reinforcement learning: a survey”. In: Artificial Intelligence
Review (2021), pp. 1–49.
[Jad+18] Max Jaderberg et al. “Human-level performance in first-person
multiplayer games with population-based deep reinforcement
learning”. In: CoRR abs/1807.01281 (2018). arXiv:
1807.01281. url: http://arxiv.org/abs/1807.01281.
[KMP13] Soummya Kar, José M. F. Moura, and H. Vincent Poor.
“QD-Learning: A Collaborative Distributed Strategy for
Multi-Agent Reinforcement Learning Through Consensus +
Innovations”. In: IEEE Trans. Signal Process. 61.7 (2013),
pp. 1848–1862. url:
https://doi.org/10.1109/TSP.2013.2241057.
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021
References
References III
[Lon+18] Pinxin Long et al. “Towards Optimally Decentralized
Multi-Robot Collision Avoidance via Deep Reinforcement
Learning”. In: IEEE, 2018, pp. 6252–6259. url:
https://doi.org/10.1109/ICRA.2018.8461113.
[LR00] Martin Lauer and Martin A. Riedmiller. “An Algorithm for
Distributed Reinforcement Learning in Cooperative
Multi-Agent Systems”. In: Morgan Kaufmann, 2000,
pp. 535–542.
[MLL07] Laëtitia Matignon, Guillaume J Laurent, and
Nadine Le Fort-Piat. “Hysteretic q-learning: an algorithm for
decentralized reinforcement learning in cooperative multi-agent
teams”. In: IEEE, 2007, pp. 64–69. url:
https://doi.org/10.1109/IROS.2007.4399095.
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021
References
References IV
[NNN18] Thanh Thi Nguyen, Ngoc Duy Nguyen, and Saeid Nahavandi.
“Deep Reinforcement Learning for Multi-Agent Systems: A
Review of Challenges, Solutions and Applications”. In: CoRR
abs/1812.11794 (2018). arXiv: 1812.11794. url:
http://arxiv.org/abs/1812.11794.
[OH19] Afshin Oroojlooyjadid and Davood Hajinezhad. “A Review of
Cooperative Multi-Agent Deep Reinforcement Learning”. In:
CoRR abs/1908.03963 (2019). arXiv: 1908.03963. url:
http://arxiv.org/abs/1908.03963.
[Omi+17] Shayegan Omidshafiei et al. “Deep Decentralized Multi-task
Multi-Agent Reinforcement Learning under Partial
Observability”. In: vol. 70. Proceedings of Machine Learning
Research. PMLR, 2017, pp. 2681–2690. url: http:
//proceedings.mlr.press/v70/omidshafiei17a.html.
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021
References
References V
[PSL06] Liviu Panait, Keith Sullivan, and Sean Luke. “Lenient learners
in cooperative multiagent systems”. In: 5th International Joint
Conference on Autonomous Agents and Multiagent Systems
(AAMAS 2006), Hakodate, Japan, May 8-12, 2006. Ed. by
Hideyuki Nakashima et al. ACM, 2006, pp. 801–803. doi:
10.1145/1160633.1160776. url:
https://doi.org/10.1145/1160633.1160776.
[Sos+16] Adrian Sosic et al. “Inverse Reinforcement Learning in Swarm
Systems”. In: CoRR abs/1602.05450 (2016). arXiv:
1602.05450. url: http://arxiv.org/abs/1602.05450.
[TA07] Kagan Tumer and Adrian K. Agogino. “Distributed
agent-based air traffic flow management”. In: IFAAMAS, 2007,
p. 255. url: https://doi.org/10.1145/1329125.1329434.
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021
References
References VI
[Tan93] Ming Tan. “Multi-Agent Reinforcement Learning: Independent
versus Cooperative Agents”. In: ed. by Paul E. Utgoff. Morgan
Kaufmann, 1993, pp. 330–337. doi:
10.1016/b978-1-55860-307-3.50049-6. url: https:
//doi.org/10.1016/b978-1-55860-307-3.50049-6.
[TAW02] Kagan Tumer, Adrian K. Agogino, and David H. Wolpert.
“Learning sequences of actions in collectives of autonomous
agents”. In: ACM, 2002, pp. 378–385. url:
https://doi.org/10.1145/544741.544832.
[WS06] Ying Wang and Clarence W. de Silva. “Multi-robot
Box-pushing: Single-Agent Q-Learning vs. Team Q-Learning”.
In: IEEE, 2006, pp. 3694–3699. url:
https://doi.org/10.1109/IROS.2006.281729.
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021
References
References VII
[WT99] David H Wolpert and Kagan Tumer. “An introduction to
collective intelligence”. In: arXiv preprint cs/9908014 (1999).
[Yan+18] Yaodong Yang et al. “Mean Field Multi-Agent Reinforcement
Learning”. In: vol. 80. Proceedings of Machine Learning
Research. PMLR, 2018, pp. 5567–5576. url:
http://proceedings.mlr.press/v80/yang18d.html.
[ZB20] Alexander Zai and Brandon Brown. Deep reinforcement
learning in action. Manning Publications, 2020.
[Zhe+18] Lianmin Zheng et al. “MAgent: A many-agent reinforcement
learning platform for artificial collective intelligence”. In: 2018.
Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021

On Collective Reinforcement Learning

  • 1.
    On Collective ReinforcementLearning Techniques, Challenges, and Opportunities Gianluca Aguzzi gianluca.aguzzi@unibo.it Dipartimento di Informatica – Scienza e Ingegneria (DISI) Alma Mater Studiorum – Università di Bologna 13/12/2021 Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 1 / 53
  • 2.
    Outline 1 Introduction 2 FromSingle Agent To Multi-Agent 3 Learning in CAS 4 Independent Learners 5 Joint Action Learners 6 Advanced Topics Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 2 / 53
  • 3.
    Introduction Introduction Motivation Learning is akey mechanism to drive adaptivity Intelligent agents improve their performance with experience We would like to improve adaptiveness with raw experience and, hence We would like to bring Reinforcement Learning methodology in CAS Lecture goals Understanding the challenges related to the Multi-Agent System (MAS) domain Show Techniques, Patterns, and Architecture applied in Collective Systems Hands-on in some practical example of Collective Learning Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 3 / 53
  • 4.
    Introduction Single-Agent Learning What haveyou seen so far . . . Markov Decision Process (MDP) to model the agent-environment interactions Find a learning process that eventually leads to an optimal policy π∗ Q-Learning (in general value-based approaches) as a prominent algorithm to reach converge . . . But plain RL works only under some conditions Reward hypothesis Full environment observability and Markovian property Stationary environment State/action space should be small enough to be stored in memory (otherwise, we should leverage function approximators) Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 4 / 53
  • 5.
    Introduction Partial Observable environments Definition Anagent does not have a perfect and complete knowledge of the state of the environment Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 5 / 53
  • 6.
    Introduction Non-stationary environments Definition The environmentmodel (e.g. the random variable associated with it) changes over time. MDP are Stationary by definition . . . . . . But, real-case environment dynamics could change over time (e.g. markets, city traffic, routing networks) Practically, it seems that RL (in particular Temporal Difference (TD) methods) works well even in this case But, we lose convergence guarantees. Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 6 / 53
  • 7.
    From Single AgentTo Multi-Agent From Single-Agent To Multi-Agent Multi-Agent Reinforcement Learning (MARL) Multiple agents learn together the best policy that maximises a long term reward signal. Considerations If multiple agents exist, but only one agent learns by experience, then it is a single agent problem (e.g. single-player videogames) So, MAS + Learning 6 =⇒ MARL, but MARL =⇒ MAS Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 7 / 53
  • 8.
    From Single AgentTo Multi-Agent Stochastic Game S (or Markov games) Extension of MDP to the MAS context Common abstraction in MARL algorithms Definition S is a tuple < N, S, {Ai }i∈{1,...,N}, P, {Ri }i∈{1,...,N} > N is the number of agents (|N| > 1) S is the global environment state Ai is the action state of agent i. A := A1 × · · · × AN is the joint action space P : SxA → P(S) is the state transaction. P is a discrete probabilistic distribution (associate for each s ∈ S a probability) Ri : S × A × S → R is the reward signal Typical time evolution: S0, A0, R1, S1, . . . Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 8 / 53
  • 9.
    From Single AgentTo Multi-Agent Stochastic games: Example Paper Rock Scissor N = 2 A1 = A2 = {Paper, Rock, Scissor} S = { } R = Rock Paper Scissor " # Rock 0, 0 −1, 1 1, −1 Paper 1, −1 0, 0 −1, 1 Scissor −1, 1 1, −1 0, 0 Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 9 / 53
  • 10.
    From Single AgentTo Multi-Agent MARL Systems: Task type Cooperative Agents share the same reward function (R1 = · · · = RN) in order to accomplish a collective goal Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 10 / 53
  • 11.
    From Single AgentTo Multi-Agent MARL Systems: Task type Competitive Agents compete with each other to maximise a long term return Board Games, Video games, . . . Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 11 / 53
  • 12.
    From Single AgentTo Multi-Agent MARL Systems: Task type Mixed Agents can both compete and cooperate in order to maximise a global reward function Also called General Sum games Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 12 / 53
  • 13.
    From Single AgentTo Multi-Agent On Cooperative Task Homogeneous Each agent has same capabilities (A1 = · · · = AN) The overall goal is to find the best policy that is the same for each agent (π∗ = π∗ 1 = π∗ 2 = · · · = π∗ N) Heterogeneous Each agent could have different capabilities (in the worst case, A1 6= . . . 6= AN) Each agent has its local policy that should be maximised following the global collective goal Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 13 / 53
  • 14.
    From Single AgentTo Multi-Agent MARL Systems: Learning Scheme Centralised Training and Centralised Execution (CTCE) One agent with a global view of the system (in the cloud? in a server?) Nodes send their perception to that central agent With this perceptions, it creates a global state of the system With the current policy, it chooses the actions that nodes should performance (Control) In the next time step, it evaluates the reward function and updates the policy accordingly (Learn) Are the nodes agents? Used mainly in offline (i.e. at simulation time) setting Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 14 / 53
  • 15.
    From Single AgentTo Multi-Agent MARL Systems: Learning Scheme Centralised Training and Centralised Execution (CTCE) Agent  0 Agent N . . . Environment Learner S, R A S, R A A S, R S, R A Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 15 / 53
  • 16.
    From Single AgentTo Multi-Agent MARL Systems: Learning Scheme Decentralised Training and Decentralised Execution (DTDE) Each node has its local policy/value table They can perceive the environment state (or they can observe a part of it) With the current state, they perform an action following their local policy (Control) In the next time step, they update their policy following a local reward function (Learn) Both used in offline and online (i.e. deployment time) setting Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 16 / 53
  • 17.
    From Single AgentTo Multi-Agent MARL Systems: Learning Scheme Decentralised Training and Decentralised Execution (DTDE) A Agent  0 Agent N . . . Environment S A S L L Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 17 / 53
  • 18.
    From Single AgentTo Multi-Agent MARL Systems: Learning Scheme Centralised Training and Decentralised Execution (CTDE) A simulation time learning online execution pattern Simulation time Each node follows the typical ot, at, ot+1, at+1, . . . trajectory using a local policy After an episode, this trajectory (or something derived from it) will be sent to a central learner It, using a global view of the system, improves the policies of the agents An then the policies will be shared with the agents Execution time Each agent has the local policy distilled during the simulation time With it, they act in the environment Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 18 / 53
  • 19.
    From Single AgentTo Multi-Agent MARL Systems: Learning Scheme Decentralised Training and Decentralised Execution (DTDE) Agent  0 Agent N . . . Environment Learner A A S, R S, R S, R S, R Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 19 / 53
  • 20.
    From Single AgentTo Multi-Agent MARL Systems: Successful Applications OpenAI Hide And Seek: https://openai.com/blog/emergent-tool-use/ Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 20 / 53
  • 21.
    From Single AgentTo Multi-Agent MARL Systems: Successful Applications Capture The Flag: https: //deepmind.com/blog/article/capture-the-flag-science Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 21 / 53
  • 22.
    Learning in CAS CollectiveAdaptive System (CAS) Recap Systems composed by a (possibly) large set of components executing a collective task relying on component interactions and showing inherent adaptiveness. Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 22 / 53
  • 23.
    Learning in CAS Learningin CAS 1 Constraints The collective goal could be accomplished through competition and/or cooperation The system could be heterogeneous or homogeneous The agent number is not bounded (openness) Distributed control – i.e. no central authority exists 1 Mirko D’Angelo et al. “On learning in collective self-adaptive systems: state of practice and a 3D framework”. In: ACM, 2019, pp. 13–24. url: https://doi.org/10.1109/SEAMS.2019.00012 Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 23 / 53
  • 24.
    Learning in CAS Learningin CAS: Challenges CASs are partial observable Each agent could only perceive a part of the system through its sensors. Learning in CASs make the environments non-stationary Each agent learns concurrently =⇒ by the eye of an agent the environment is in continuous evolution. The curse of dimensionality – MAS combinatorial nature When we have to deal with a large number of agents, the overall state-action space is increasing exponentially to the number of agents — so a central learner cannot solve the learning problem. Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 24 / 53
  • 25.
    Learning in CAS Learningin CAS: Challenges Multi-Agent credit assignment problem Typically, in CAS, a global reward function exists. But it is hard to understand the influence of a single agent to the overall collective behaviour. Sample efficiency Action-space and state-space are very large in CASs =⇒ the problem of sample efficiency (i.e. how many samples does the RL need to reach a good policy?) arise as in the Deep Learning context. Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 25 / 53
  • 26.
    Learning in CAS Onsample efficiency: Example of learning time A Nowadays problem . . . Nowadays, Deep Learning Techniques are used to train complex neural networks When applied to Reinforcement Learning, the training phase requires millions of interactions for an agent to learn In the Multi-Agent Setting is even worst . . . In the previous example2, they train 30 agents in 450k games! 2 Max Jaderberg et al. “Human-level performance in first-person multiplayer games with population-based deep reinforcement learning”. In: CoRR abs/1807.01281 (2018). arXiv: 1807.01281. url: http://arxiv.org/abs/1807.01281 Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 26 / 53
  • 27.
    Learning in CAS Onscalability: Single Agent Learner Example Learning setting A central agent (i.e. in the cloud? In a server?) sees the entire system Standard RL algorithm (tabular) create a table with the size of |S × A| But the system-wide action space is the cartesian product of the agent action-space, so the A cardinality is |A|N With 100 agents with 10 possible actions, we have already reached an action space with more actions then the particles in the universe (∼ 1080 ®). Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 27 / 53
  • 28.
    Learning in CAS Learningin CAS: A Motivating Example Robot Collision Avoidance 3 https://sites.google.com/view/drlmaca 3 Pinxin Long et al. “Towards Optimally Decentralized Multi-Robot Collision Avoidance via Deep Reinforcement Learning”. In: IEEE, 2018, pp. 6252–6259. url: https://doi.org/10.1109/ICRA.2018.8461113 Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 28 / 53
  • 29.
    Learning in CAS Learningin CAS: Toolkits MAgent 4 https://github.com/geek-ai/MAgent 4 Lianmin Zheng et al. “MAgent: A many-agent reinforcement learning platform for artificial collective intelligence”. In: 2018 Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 29 / 53
  • 30.
    Learning in CAS Learningin CAS: Our today focus Constraints Learning in cooperative systems: i.e. each entity shares the same reward functions Learning in homogeneous systems: i.e. each entity is interchangeable with each other o We do not consider partial observability as a core problem Homogenous system: Implications The optimal policy is the same for the whole system During the learning phases, the system can use different policies (e.g. to reduce sample efficiency) We reduce the action space We reduce the non-stationarity problem Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 30 / 53
  • 31.
    Learning in CAS Learningin CAS: A simplified model Each agent is seen as a tuple (M, N) M is a local Markov decision process (S is an observation space in common with the entire system) N is the neighbourhood perceived by an agents The global state system state is unknown Agents want to learn a policy π(a|s, N) to share with the entire system When agents do not consider neighbours, we call the system as Independent learners When agents consider neighbours actions to choose their local action, we call the system Joint action learners Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 31 / 53
  • 32.
    Learning in CAS Learningin CAS: Can we use standard RL algorithms? In a centralised training decentralised execution settings Each agent explores the environment with the same policy φ Central learners improve the policy following the agents’ perceptions When a good policy is found, it is deployed in a concrete system Then, the central learner is not required anymore Indeed yes . . . . . . But this allows only offline learning (CTDE setting) This practically lead to lazy agent – the exploration part is limited Does not consider other agent actions – each agent act independently from each other Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 32 / 53
  • 33.
    Independent Learners Independent Learners Configuration Eachagent concurrently learns an optimal local policy This led to the global optimal policies as an emergent behaviour The homogeneity is driven by the same reward function and the same action/observation spaces More agent =⇒ more exploration Not converge to global optimum, but good pratical performance are founded in various works: (Tumer and Agogino 2007; Tumer, Agogino, and David H. Wolpert 2002; Wang et al. 2006) Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 33 / 53
  • 34.
    Independent Learners Independent Learners Implications Scalable: the learning algorithm does not depend on the system size Easy to implement: the algorithm is the same developed for single agent context Offline and Online: Can be used both at simulation time or at deployment time (CTDE, DTDE are conceptually supported) Increase non-stationarity: the environment dynamics change continuously as the agent learns a new policy Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 34 / 53
  • 35.
    Independent Learners Decentralised Q-Learning5 Idea Eachagent is a Q-Learner and he does consider other agents as a part of the environment Considerations Simplest extension of Q-Learning into the Multi-Agent domain Need to use Greedy in the Limit of Infinite Exploration (GLIE) policy to reach good performance Complex and application-dependant parameters tuning 5 Ming Tan. “Multi-Agent Reinforcement Learning: Independent versus Cooperative Agents”. In: ed. by Paul E. Utgoff. Morgan Kaufmann, 1993, pp. 330–337. doi: 10.1016/b978-1-55860-307-3.50049-6. url: https://doi.org/10.1016/b978-1-55860-307-3.50049-6 Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 35 / 53
  • 36.
    Independent Learners Hysteretic Q-Learning6 Idea DecentralisedQ-Learning does not take into consideration non-stationarity problem An idea could be to apply a correction factor that reduces the non-stationarity during the concurrent learning phases Hysteretic Q-Learning uses a hysteresis heuristic to handle this problem It supposes an optimistic behaviour, giving more weight to good action then the bad actions (due to the exploration of nearby agents) 6 Laëtitia Matignon et al. “Hysteretic q-learning: an algorithm for decentralized reinforcement learning in cooperative multi-agent teams”. In: IEEE, 2007, pp. 64–69. url: https://doi.org/10.1109/IROS.2007.4399095 Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 36 / 53
  • 37.
    Independent Learners Hysteretic Q-Learning Implementation Usetwo learning factors, α and β, with α β The Q-table update is evaluated to consider the goodness of an action/space pair: δ(s, a) = rt + γ ∗ argmaxa0 Q(s0, a0) − Q(st, at) When δ 0 it means that we improve our experience, so we use α as learning rate, β otherwise The Q-table update became: Q(st, at) = ( Q(st, at) + αδ(st, at) if(δ(st, at)) 0 Q(st, at) + βδ(st, at) otherwise Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 37 / 53
  • 38.
    Independent Learners Hysteretic Q-Learning Idea It improves standard Q-Learning performance without adding more complexity It is currently used in Deep Reinforcement Learning to handle complex systems (Omidshafiei et al. 2017) It is a heuristic, so it does not have any theoretical guarantees Other extensions follow this idea and suffer from the same pros and cons. The most famous are Distributed Q-Learning (Lauer et al. 2000), Lenient Q-Learning (Panait et al. 2006), Win-Or-Learn Fast (Bowling et al. 2002) Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 38 / 53
  • 39.
    Independent Learners QD-Learning7 Idea Each agentknow its local actions and can communicate only with a restricted neighbourhood The Q update depends on the Q-values of the neighbours To do that, the system mantains Q-tables of different time steps. Q-tables are indexed with: Qi t (in English: the Q-table of the agent i at the time step t) It uses innovation and consensus 7 Soummya Kar et al. “QD-Learning: A Collaborative Distributed Strategy for Multi-Agent Reinforcement Learning Through Consensus + Innovations”. In: IEEE Trans. Signal Process. 61.7 (2013), pp. 1848–1862. url: https://doi.org/10.1109/TSP.2013.2241057 Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 39 / 53
  • 40.
    Independent Learners QD-Learning Update rule consensus(st,at) = P i∈N (Qme t (st, at) − Qi t(st, at)) innovation(st, at) = (rt + γ ∗ argmaxa0 Qme t (s0, a0) − Qmet(st, at)) innovation is the standard Q advantage evaluation Finally, each agent update their table as: Qme t+1(st, at) = Qme t − β ∗ consensus(st, at) + α ∗ innovation(st, at) Discussion It is shown that it converges asymptotically to an optimal policy . . . . . . under constraints of networks connection . . . with full state observability . . . with specific constraints on α and β . . . with a predefined neighbourhood graph Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 40 / 53
  • 41.
    Joint Action Learners JointAction Learners From Independent to Joint actions Independent learners tend to make the environment highly non-stationarity Furthermore, they do not consider other agents ← collective behaviour through emergence (QD-learning is an exception) Ideally, we can reduce non-stationarity and we want to consider other agents by creating policies that observe other agents actions (typically denoted by: πi,N ) This does not scale with the agent population. A possibility to reduce the space is to consider only the neighbours (in a typically CAS settings) Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 41 / 53
  • 42.
    Joint Action Learners NeighbourhoodQ-Learning8 Q-Learning mixed with Joint action spaces Each agent improve a Q-function consider also the action taken by the neighbourhood (aN ) A way to handle that, is to consider the state as kt = (s, aN ) Q(kt, a) = Q(kt, a) + α ∗ (rt + γ ∗ argmaxa0 Q(k0, a0)) Implications Consider neighbourhood actions =⇒ reduce the non-stationarity problem Q-table depends on agent neighbourhood =⇒ the Q-table size increase exponentially with the neighbourhood 8 Alexander Zai et al. Deep reinforcement learning in action. Manning Publications, 2020 Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 42 / 53
  • 43.
    Joint Action Learners Justa scratch to the surface :) A recap The approaches handle only partially the collective learning problem (scalability, cooperative, non-stationarity) They are used nowadays but in combination with other approaches (neural networks, reply buffer, difference reward) . . . . . . But today no theoretical solutions exists yet for large scale collective learning Note: we do not cover the game theoretical background ← it is current the main interest in literature (problem =⇒ these works consider few agents) Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 43 / 53
  • 44.
    Advanced Topics State ofthe art: on advanced topics Deep Learning as a mechanism to solve complex tasks Nowadays, Deep Learning architectures are typically used in MARL — example in MAgent (random) (deep learning policy) For handling non-stationarity, actor-critic methodologies are used For handling partial-observability, recurrent networks are applied For handling large-state/action space, deep architectures are considered are all heuristic =⇒ some approach could do not work in different application Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 44 / 53
  • 45.
    Advanced Topics On Actor-Critic:Policy Gradient Methods What we see so far are value-based methods =⇒ learn value function that guide the agent’s policy ­ Why we cannot learn the policy directly? In some cases, the value function could be very complex but not the policy one Furthermore, value-based methods led to determistic policy, but in some cases only a stochastic policy is the best one (e.g. in Rock paper scissor) Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 45 / 53
  • 46.
    Advanced Topics On Actor-Critic:Policy Gradient Methods Idea Optimizing the policy (π(a|s, θ)) w.r.t the expected return by gradient descent (or ascent). π(a|s, θ) is read as: the probability of the action a knowing the state s and the weight θ Generally, the loss function is described as: J(θ) = E PT i=0(γ ∗ Ri ) The weight then, are updated as: θt+1 = θt + α∇J(θt) =⇒ standard Stachocastic gradient descent (ascent) Considerations Convergence proof to local optimal Tends to have high bias (underfitting) Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 46 / 53
  • 47.
    Advanced Topics On Actor-Critic:Learning description Idea Combine Policy Gradient and Value-Based Methods The Critic (the value function) is updated like standard value approaches (e.g. TD errors) The Actor (the policy) is updated using the policy gradient update based on the Critic estimation Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 47 / 53
  • 48.
    Advanced Topics On Actor-Critic:MARL settings Idea The Actor is a local shared policy that takes observations as input and returns local actions The Critic instead, is placed in a central server The Critic has a global view of the system =⇒ it could use overall state information The Actor is then influenced by global data, but at deployment time it does not use them Considerations A balanced between fully decentralised and centralised method Require simulations =⇒ cannot be used in online systems Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 48 / 53
  • 49.
    Advanced Topics Handling PartialObservability As I mentioned, CAS are partially observable . . . . . . But the methods proposed does not handle this matter in any sense Recurrent Neural Networks (i.e. Neural Networks with memory) are an emergent trend to handle this issue In MAS settings, RNNs and Hysteretic updates are used together to handle both non-stationarity and partial observability 9 It is not clear if RNNs are a solution for Partial Observability 9 Shayegan Omidshafiei et al. “Deep Decentralized Multi-task Multi-Agent Reinforcement Learning under Partial Observability”. In: vol. 70. Proceedings of Machine Learning Research. PMLR, 2017, pp. 2681–2690. url: http://proceedings.mlr.press/v70/omidshafiei17a.html Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 49 / 53
  • 50.
    Advanced Topics Handling largescale systems Agent size was not a central problem in the first MARL works Independent learners are typically deployed when we have to deal with several agents . . . . . . but it makes the learning process noising Lately, a novel trend emerges to handle this problem, the so-called Mean-Field Reinforcement Learning10 10 Yaodong Yang et al. “Mean Field Multi-Agent Reinforcement Learning”. In: vol. 80. Proceedings of Machine Learning Research. PMLR, 2018, pp. 5567–5576. url: http://proceedings.mlr.press/v80/yang18d.html Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 50 / 53
  • 51.
    Advanced Topics Handling largescale systems: On Mean-Field Reinforcement Learning The learning problem is set up as a joint action learners In Mean-Field, the system in modelled as two agent, the main agent and the rest In Mean Field Reinforcement Learning, the rest is computed as the mean action taken by the neighbourhood It has theoretical guarantees It is easily used also in CAS The approaches proposed are offline — no online adaptation. Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 51 / 53
  • 52.
    Advanced Topics Handling Multi-Agentcredit assignment problem In CAS settings, typically we have a global reward function It is hard to understand the influence of the agent action on that signal This is known as Multi-Agent credit assignment problem It is tackled in COIN approach (David H Wolpert et al. 1999) . . . . . . but it required to run several simulations to understand each agent influence COMA11 is a novel approach that uses an Actor-Critic setting with a contrafacutal baseline 11 Jakob N. Foerster et al. “Counterfactual Multi-Agent Policy Gradients”. In: AAAI Press, 2018, pp. 2974–2982. url: https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17193 Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 52 / 53
  • 53.
    On Collective ReinforcementLearning Techniques, Challenges, and Opportunities Gianluca Aguzzi gianluca.aguzzi@unibo.it Dipartimento di Informatica – Scienza e Ingegneria (DISI) Alma Mater Studiorum – Università di Bologna 13/12/2021 Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021 53 / 53
  • 54.
    References References I [BV02] MichaelH. Bowling and Manuela M. Veloso. “Multiagent learning using a variable learning rate”. In: Artif. Intell. 136.2 (2002), pp. 215–250. doi: 10.1016/S0004-3702(02)00121-2. url: https://doi.org/10.1016/S0004-3702(02)00121-2. [DAn+19] Mirko D’Angelo et al. “On learning in collective self-adaptive systems: state of practice and a 3D framework”. In: ACM, 2019, pp. 13–24. url: https://doi.org/10.1109/SEAMS.2019.00012. [Foe+18] Jakob N. Foerster et al. “Counterfactual Multi-Agent Policy Gradients”. In: AAAI Press, 2018, pp. 2974–2982. url: https://www.aaai.org/ocs/index.php/AAAI/AAAI18/ paper/view/17193. Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021
  • 55.
    References References II [GD21] SvenGronauer and Klaus Diepold. “Multi-agent deep reinforcement learning: a survey”. In: Artificial Intelligence Review (2021), pp. 1–49. [Jad+18] Max Jaderberg et al. “Human-level performance in first-person multiplayer games with population-based deep reinforcement learning”. In: CoRR abs/1807.01281 (2018). arXiv: 1807.01281. url: http://arxiv.org/abs/1807.01281. [KMP13] Soummya Kar, José M. F. Moura, and H. Vincent Poor. “QD-Learning: A Collaborative Distributed Strategy for Multi-Agent Reinforcement Learning Through Consensus + Innovations”. In: IEEE Trans. Signal Process. 61.7 (2013), pp. 1848–1862. url: https://doi.org/10.1109/TSP.2013.2241057. Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021
  • 56.
    References References III [Lon+18] PinxinLong et al. “Towards Optimally Decentralized Multi-Robot Collision Avoidance via Deep Reinforcement Learning”. In: IEEE, 2018, pp. 6252–6259. url: https://doi.org/10.1109/ICRA.2018.8461113. [LR00] Martin Lauer and Martin A. Riedmiller. “An Algorithm for Distributed Reinforcement Learning in Cooperative Multi-Agent Systems”. In: Morgan Kaufmann, 2000, pp. 535–542. [MLL07] Laëtitia Matignon, Guillaume J Laurent, and Nadine Le Fort-Piat. “Hysteretic q-learning: an algorithm for decentralized reinforcement learning in cooperative multi-agent teams”. In: IEEE, 2007, pp. 64–69. url: https://doi.org/10.1109/IROS.2007.4399095. Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021
  • 57.
    References References IV [NNN18] ThanhThi Nguyen, Ngoc Duy Nguyen, and Saeid Nahavandi. “Deep Reinforcement Learning for Multi-Agent Systems: A Review of Challenges, Solutions and Applications”. In: CoRR abs/1812.11794 (2018). arXiv: 1812.11794. url: http://arxiv.org/abs/1812.11794. [OH19] Afshin Oroojlooyjadid and Davood Hajinezhad. “A Review of Cooperative Multi-Agent Deep Reinforcement Learning”. In: CoRR abs/1908.03963 (2019). arXiv: 1908.03963. url: http://arxiv.org/abs/1908.03963. [Omi+17] Shayegan Omidshafiei et al. “Deep Decentralized Multi-task Multi-Agent Reinforcement Learning under Partial Observability”. In: vol. 70. Proceedings of Machine Learning Research. PMLR, 2017, pp. 2681–2690. url: http: //proceedings.mlr.press/v70/omidshafiei17a.html. Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021
  • 58.
    References References V [PSL06] LiviuPanait, Keith Sullivan, and Sean Luke. “Lenient learners in cooperative multiagent systems”. In: 5th International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS 2006), Hakodate, Japan, May 8-12, 2006. Ed. by Hideyuki Nakashima et al. ACM, 2006, pp. 801–803. doi: 10.1145/1160633.1160776. url: https://doi.org/10.1145/1160633.1160776. [Sos+16] Adrian Sosic et al. “Inverse Reinforcement Learning in Swarm Systems”. In: CoRR abs/1602.05450 (2016). arXiv: 1602.05450. url: http://arxiv.org/abs/1602.05450. [TA07] Kagan Tumer and Adrian K. Agogino. “Distributed agent-based air traffic flow management”. In: IFAAMAS, 2007, p. 255. url: https://doi.org/10.1145/1329125.1329434. Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021
  • 59.
    References References VI [Tan93] MingTan. “Multi-Agent Reinforcement Learning: Independent versus Cooperative Agents”. In: ed. by Paul E. Utgoff. Morgan Kaufmann, 1993, pp. 330–337. doi: 10.1016/b978-1-55860-307-3.50049-6. url: https: //doi.org/10.1016/b978-1-55860-307-3.50049-6. [TAW02] Kagan Tumer, Adrian K. Agogino, and David H. Wolpert. “Learning sequences of actions in collectives of autonomous agents”. In: ACM, 2002, pp. 378–385. url: https://doi.org/10.1145/544741.544832. [WS06] Ying Wang and Clarence W. de Silva. “Multi-robot Box-pushing: Single-Agent Q-Learning vs. Team Q-Learning”. In: IEEE, 2006, pp. 3694–3699. url: https://doi.org/10.1109/IROS.2006.281729. Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021
  • 60.
    References References VII [WT99] DavidH Wolpert and Kagan Tumer. “An introduction to collective intelligence”. In: arXiv preprint cs/9908014 (1999). [Yan+18] Yaodong Yang et al. “Mean Field Multi-Agent Reinforcement Learning”. In: vol. 80. Proceedings of Machine Learning Research. PMLR, 2018, pp. 5567–5576. url: http://proceedings.mlr.press/v80/yang18d.html. [ZB20] Alexander Zai and Brandon Brown. Deep reinforcement learning in action. Manning Publications, 2020. [Zhe+18] Lianmin Zheng et al. “MAgent: A many-agent reinforcement learning platform for artificial collective intelligence”. In: 2018. Aguzzi (DISI, Univ. Bologna) On Collective Reinforcement Learning 13/12/2021