Multi-Agent Reinforcement Learning

Multi-Agent Reinforcement Learning
Seolho Kim

Contents
● Introduction
○ What is Multi-agent RL?
● Background
○ (Single agent)Reinforcement Learning
○ Game Theory
● Multi-Agent Reinforcement Learning
○ Why multi-agent RL is hard to train?
○ Baseline
○ Cooperation
○ Zero-Sum
○ General-Sum
● References

Introduction
What is Multi-agent RL?
- Reinforcement Learning is promising way to solve sequential decision making
problems.
source : https://now.sen.go.kr/2016/12/03.php
source :
https://deepmind.com/blog/article/Agent57-Outper
forming-the-human-Atari-benchmark

Introduction
- We can expand it by adding multiple agents to solve more complex problems.
source :
https://deepmind.com/blog/article/alphastar-maste
ring-real-time-strategy-game-starcraft-ii source :
https://www.youtube.com/watch?v=kopoLzvh5jY

Introduction
Problem size
Number of Agents
Tabular Solution Methods
ex)Game Theory
Tabular Solution Methods
ex)Dynamic Programming
Approximate Solution Methods
ex)Monte Carlo, TD learning
Approximate Solution Methods

Reinforcement Learning
- Reinforcement learning is a problem, a class of solution methods that work well on the problem, and
the field that studies this problems and its solution methods.
- Reinforcement learning is learning what to do—how to map situations to actions—so as to maximize
a numerical reward signal. The learner is not told which actions to take, but instead must discover
which actions yield the most reward by trying them. In the most interesting and challenging cases,
actions may affect not only the immediate reward but also the next situation and, through that, all
subsequent rewards.
Background

source : Sutton, Reinforcement learning: An introduction
Background

Background
- RL framed as a infinite horizon discounted Markov Decision Process(MDP)
- (infinite horizon) MDP
- Find policy
source :
https://en.wikipedia.org/wiki/Markov_decision_process

Background
- Value function
- Action value function

Background
- value based
- policy based

Background
Game Theory
- The study of mathematical models pertaining to the strategic interaction of
decision making where several self-interested players must make choices that
potentially affect the interests of other players.
- Only talk about non-cooperative with complete information in this Seminar.

Background
Game Theory
- Normal form representation
- A set of players
- All possible strategies for player i
- Utility function for each players
- Goal
- maximizing their own expected utilities(payoff)
- depending on any beliefs.
- Assume “All players are rational.”

Background
Game Theory
- strategies(like policy)
- pure strategies
- Select only single strategy
- mixed strategies
- Randomize over the set of available actions according to some
probability distribution
- beliefs

Game Theory
- suppose non-cooperative 2 rational players
Background
row player (-1,-1) (-3,0)
(0,-3) (-2,-2)
column player A B
a
b
Prisoner's dilemma

Game Theory
- Best Response
Background

Background
Game Theory
- Nash Equilibrium
- If each player has chosen a strategy — an action plan choosing their own actions based on what
has happened so far in the game — and no player can increase their own expected payoff by
changing their strategy while the other players keep theirs unchanged, then the current set of
strategy choices constitutes a Nash equilibrium.
- A strategy profile is a Nash equilibrium if
- Mutual best responses
- Rationality + Correct beliefs
- Every finite game has at least one Nash equilibrium.

Game Theory
- Find nash eqbm
Background
row player (5,3) (1,0)
(0,1) (2,4)
column player A B
a
b

Background
Game Theory
- Extensive form
- The players of a game
- What each player can do at each of their moves
- The payoffs received by every player for every possible combination of moves
- + What each player knows for every move
- + For every player every opportunity they have to move
- Subgame Perfect Equation
- Backward Induction(Bellman Equation)
source :
https://en.wikipedia.org/wiki/Extensive-form_game

Background
Game Theory
- A game in normal form and a game in extensive form can carry the same
information.
source :
https://en.wikipedia.org/wiki/Extensive-form_game

Background
Game Theory
- We can use value function on each node

Background
Game Theory
- Common Knowledge
- There is common knowledge of p in a group of agents G when all the agents in G know p, they all know
that they know p, they all know that they all know that they know p, and so on ad infinitum.
- Event E, each player P1,P2.
- P1 knows E.
- P2 knows E.
- P1 knows P2 knows E.
- P2 knows P1 knows E.
- P1 knows P2 knows (P1 knows E)
- P2 knows P1 knows (P2 knows E)
- ...

Background
Game Theory
- Common Knowledge Example
- Three girls are sitting in a circle, each wearing a red or white hat. Each can see the color of all
hats except their own. Now suppose they are all wearing red hats. It is said that if the teacher
announces that at least one of the hats is red, and then sequentially asks each girl if she
knows the color of her hat, the third girl questioned can know her hat color.
Red hat puzzle

Background
Game Theory
- Each girl A,B,C has an information set.
- Teacher announced and girl A didn't answer, RWW can’t be answer.

Background
Game Theory
- Girl B didn’t answer. RRW and WRW can’t be the answer.
- Girl C can answer her hat color is red.

Game Theory
- Repeated Iterations
Background
1
2 2
1 1 1 1
1
2 2
1 1 1 1
1
2 2
1 1 1 1
1
2 2
1 1 1 1
1
2 2
1 1 1 1
2 iterations

Background
Game Theory
- Finitely Repeated Iterations
- Non-equilibrium strategy can be equilibrium if there is more than one nash equilibrium by
punishment reducing deviation incentive.
- Infinitely Repeated Iterations
- Using discount factor, player i’s payoff diminishes with time depending on discount
factor.
- It can be that the preferred strategy is not to play a Nash strategy of the stage game, but
to cooperate and play a socially optimum strategy.

Why multi-agent RL is hard to train?
- Credit Assignment Problem
- One of MARL's biggest challenge is Credit Assignment Problem. In cooperative situations, the
environment gives a global total-sum scalar reward, so more consideration is needed to infer
which agent contributes to this than in a single agent situation.
- Environment
- non-stationary
- Training of each agent prevents the learning environment to be non-stationary for the
other agents.
- Interaction limitation
- How each agent communicates with each other.

Why multi-agent RL is hard to train?
- Goal setting
- Cooperation
- zero-sum
- General sum
- need to learn to reciprocate

Setting
source :
Foerster. Multi Agent Reinforcement learning(2019)

Setting
- Centralized Training Decentralized Execution
- During centralized training, the agent receives additional information, as well as local
information. And the agent uses only local information when it execution.
- Recurrent Network to deal with POMDP
- In POMDP, agent needs to infer state well, so it encode the previous history
information.
- Deep Recurrent Q-Learning for Partially Observable MDPs

Baseline
- Independent Q Learning(IQL)
- Multiagent Cooperation and Competition with
Deep Reinforcement Learning(2015)
- Each agent Independently learns own
Q-network on Pong.
- Another agent is considered as environment.
- Independent Actor-Critic(IAC) is of the same
kind.
source :
Multiagent Cooperation and Competition with

Baseline
- Independent Q Learning(IQL)
- Multiagent Cooperation and Competition with Deep Reinforcement Learning(2015)
source :
Multiagent Cooperation and Competition with
Cooperation
Competition

Cooperation
- Counterfactual Multi-Agent Policy
Gradients(2017)(COMA)
- Centralized Critic, parameter sharing Actors.
- each actor gradient
source :
Counterfactual multi-agent policy gradients(2017)

Cooperation
- Counterfactual Multi-Agent(COMA)
- Credit Assignment Problem
- shaped reward
- Using Default action c? No
- Advantage function
- Iterating for getting all action value? No
source :

Cooperation
- Algorithm
source :

Cooperation
source :

Cooperation
- QMIX: Monotonic Value Function
Factorisation for Deep Multi-Agent
Reinforcement Learning(2018)
- value decomposition networks
- Q Sum
- QMIX
- QMIX source :
QMIX: Monotonic Value Function Factorisation for
Deep Multi-Agent Reinforcement Learning(2018)

Cooperation
- QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent
source :
QMIX: Monotonic Value Function Factorisation for
Deep Multi-Agent Reinforcement Learning(2018)

Cooperation
- Multi-Agent Common Knowledge Reinforcement Learning(2018)
- Use Common Knowledge and hierarchically control agents.
- Dec-POMDP
- Decentralized Partially Observable Markov Decision Processes
- State is composed of a number of entities.
- In state s, binary mask , all entities the agent a can see :
- Every group member(agent) computes common knowledge independently using prior
knowledge and commonly known trajectory.(random seed is also Common knowledge)

Cooperation
- Delegation Action
source :
Multi-Agent Common Knowledge Reinforcement
Learning(2018)

Cooperation
source :
Learning(2018)

Cooperation
- Central-V
source :
Learning(2018)

Cooperation
- Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning(2018)
- To enhance data efficiency, ReplayBuffer is introduced. It assumed the same condition at the
same time step.
- If we can use true state information, then Bellman equation can be formulated :
- Recording data with time
- Calculating an importance weighted loss :
source :
Stabilising Experience Replay for Deep
Multi-Agent Reinforcement Learning(2018)

Cooperation
- Stabilising Experience Replay for Deep Multi-Agent Reinforcement
Learning(2018)
- But it can’t!(All agents in partially observable environment)
- So we make new game that is specified by
- augmented state(action-observation history added) and reward function.
source :

Cooperation
Learning(2018)
- Q function is updated only approximation in the partially observable
setting(Intractable!)
source :

Cooperation
Learning(2018)
- Important Sampling is approximation and hard to control variance.
- Instead, use idea of Hyper Q-learning!
- Input another agent’s policy into observation.
- Hard to scaling -> finger-print! (e.g. training iteration number,
exploration rate)

Cooperation
Learning(2018)
source :

Cooperation
- Learning to Communicate with Deep
Multi-Agent Reinforcement
Learning(2016)
- RIAL
- Action : U + M
- environment action U
- message M
- Action select : e greedy
- No experience replay
- Parameter sharing
source :
Learning to Communicate with Deep Multi-Agent

Cooperation
- Learning to Communicate with Deep
Multi-Agent Reinforcement
Learning(2016)
- DIAL
- Action : U + M
- environment action U
- message M
- C-Net
- Q network
- message network
- DRU
- After noise is added, it
passes sigmoid function.
- Action select : e greedy
- No experience replay
- Parameter sharing
source :

Cooperation
- Learning to Communicate with Deep Multi-Agent
- DIAL
source :

Zero-Sum
- Mastering the game of Go with deep neural networks and tree search(2016)
vs
- Grandmaster level in StarCraft II using multi-agent reinforcement
learning(2019)
- League
- Main Agents
- Main exploiter agents
- League exploiter agents
- Prioritized fictitious self-play

General-Sum
- Learning with Opponent-Learning Awareness(2018)
- Suppose there are 2 players, each policy parameter is
- If we can access all parameter value, then iteratively calculate
- Instead, with step size , naive learner 1’s parameter update rule :

General-Sum
- Unlike NL, LOLA learner learn to optimize(respect to player 1):
- Assuming small , first-order Taylor expansion result in :
- By substituting the opponent’s naive learning step :

General-Sum
- LOLA learning rule :

General-Sum
- LOLA learning via policy gradient :
- Naive learner :
- Second order :

General-Sum
- complete LOLA update policy gradient :
- Opponent can’t access :

General-Sum
- Tit-for-tat strategy
source :
Learning with Opponent-Learning Awareness(2018)

General-Sum
Naive Learner VS LOLA
source :

General-Sum
source :

Reference
1. Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.
2. Wikipedia contributors. (2021, July 17). Markov decision process. In Wikipedia, The Free
Encyclopedia. Retrieved 05:59, August 9, 2021, from
https://en.wikipedia.org/w/index.php?title=Markov_decision_process&oldid=1034067020
3. Zhu, H., Nel, A., & Ferreira, H. (2015). Competitive Spectrum Pricing under Centralized Dynamic
Spectrum Allocation. Advances in Wireless Technologies and Telecommunication, 884–908.
https://doi.org/10.4018/978-1-4666-6571-2.ch034
4. Bonanno, G. (2018). Game Theory: Volume 1: Basic Concepts (2nd ed.). CreateSpace
Independent Publishing Platform.
5. Wikipedia contributors. (2021, March 2). Extensive-form game. In Wikipedia, The Free
https://en.wikipedia.org/w/index.php?title=Extensive-form_game&oldid=1009744715

6. Wikipedia contributors. (2021, March 2). Extensive-form game. In Wikipedia, The Free
https://en.wikipedia.org/w/index.php?title=Extensive-form_game&oldid=1009744715
7. Wikipedia contributors. (2021, July 8). Common knowledge (logic). In Wikipedia, The Free
https://en.wikipedia.org/w/index.php?title=Common_knowledge_(logic)&oldid=1032661454
8. Wikipedia contributors. (2021, March 2). Repeated game. In Wikipedia, The Free Encyclopedia.
Retrieved 06:11, August 9, 2021, from
https://en.wikipedia.org/w/index.php?title=Repeated_game&oldid=1009754520
9. Foerster, J. N. (2018). Deep multi-agent reinforcement learning [PhD thesis]. University of Oxford
Reference

10. Tampuu, A., Matiisen, T., Kodelja, D., Kuzovkin, I., Korjus, K., Aru, J., Aru, J., & Vicente, R. (2017).
Multiagent cooperation and competition with deep reinforcement learning. PLOS ONE, 12(4),
e0172395. https://doi.org/10.1371/journal.pone.0172395
11. Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, Shimon Whiteson,
(2018). Counterfactual Multi-Agent Policy Gradients, AAAI Conference on Artificial Intelligence
12. Rashid, T., Samvelyan, M., de Witt, C. S., Farquhar, G., Foerster, J. N., & Whiteson, S. (2018).
QMIX - monotonic value function factorisation for deep multi-agent reinforcement learning. In
International conference on machine learning.
13. Christian A. Schroeder de Witt, Jakob N. Foerster, Gregory Farquhar, Philip H. S. Torr, Wendelin
Boehmer, and Shimon Whiteson(2018). Multi-Agent Common Knowledge Reinforcement Learning.
arXiv:1810.11702 [cs] URL http://arxiv.org/abs/1810.
Reference

14. Foerster, J., Nardelli, N., Farquhar, G., Afouras, T., Torr, P.H.S., Kohli, P. & Whiteson, S.. (2017).
Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning. Proceedings of the 34th
International Conference on Machine Learning, in Proceedings of Machine Learning Research
70:1146-1155 Available from http://proceedings.mlr.press/v70/foerster17b.html
15. J. N. Foerster, Y. M. Assael, N. de Freitas, and S. Whiteson(2016). Learning to communicate with
deep multi-agent reinforcement learning. CoRR, abs/1605.06676,
16.Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J.,
Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N.,
Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., & Hassabis, D. (2016).
Mastering the game of go with deep neural networks and tree search. Nature, 529(7587), 484–489.
https://doi.org/10.1038/nature16961
17. J. N. Foerster et al.(2017), Learning with opponent-learning awareness. arXiv:1709.04326 [cs.AI]
Reference

17. Chung-san, R. (2016, December 3). ‘알파고 시대’ 우리 교육, 어떻게 나아가야 하나?
서울특별시교육청. https://now.sen.go.kr/2016/12/03.php
18. DeepMind. (2020, May 31). Agent57: Outperforming the human Atari benchmark.
https://deepmind.com/blog/article/Agent57-Outperforming-the-human-Atari-benchmark
19. AlphaStar: Mastering the Real-Time Strategy Game StarCraft II. (2019, January 24). DeepMind.
https://deepmind.com/blog/article/alphastar-mastering-real-time-strategy-game-starcraft-ii
20. Multi-Agent Hide and Seek. (2019, September 17). [Video]. YouTube.
https://www.youtube.com/watch?v=kopoLzvh5jY
21. Tayagkrischelle, T. (2014, September 13). game theorA6 [Slides]. Slideshare.
https://www.slideshare.net/tayagkrischelle/game-theora6
22. Lanctot, M. [ Laber Labs]. (2020, May 16). Multi-agent Reinforcement Learning - Laber Labs
Workshop [Video]. YouTube. https://www.youtube.com/watch?v=rbZBBTLH32o
Reference

Multi-Agent Reinforcement Learning

More Related Content

What's hot

Similar to Multi-Agent Reinforcement Learning

Recently uploaded

In this document

Multi-Agent Reinforcement Learning