Multi-Agent Reinforcement Learning
Seolho Kim
Contents
● Introduction
○ What is Multi-agent RL?
● Background
○ (Single agent)Reinforcement Learning
○ Game Theory
● Multi-Agent Reinforcement Learning
○ Why multi-agent RL is hard to train?
○ Baseline
○ Cooperation
○ Zero-Sum
○ General-Sum
● References
Introduction
What is Multi-agent RL?
- Reinforcement Learning is promising way to solve sequential decision making
problems.
source : https://now.sen.go.kr/2016/12/03.php
source :
https://deepmind.com/blog/article/Agent57-Outper
forming-the-human-Atari-benchmark
Introduction
What is Multi-agent RL?
- We can expand it by adding multiple agents to solve more complex problems.
source :
https://deepmind.com/blog/article/alphastar-maste
ring-real-time-strategy-game-starcraft-ii source :
https://www.youtube.com/watch?v=kopoLzvh5jY
Introduction
What is Multi-agent RL?
Problem size
Number of Agents
Tabular Solution Methods
ex)Game Theory
Tabular Solution Methods
ex)Dynamic Programming
Approximate Solution Methods
ex)Monte Carlo, TD learning
Approximate Solution Methods
Reinforcement Learning
- Reinforcement learning is a problem, a class of solution methods that work well on the problem, and
the field that studies this problems and its solution methods.
- Reinforcement learning is learning what to do—how to map situations to actions—so as to maximize
a numerical reward signal. The learner is not told which actions to take, but instead must discover
which actions yield the most reward by trying them. In the most interesting and challenging cases,
actions may affect not only the immediate reward but also the next situation and, through that, all
subsequent rewards.
Background
Reinforcement Learning
source : Sutton, Reinforcement learning: An introduction
Background
Background
Reinforcement Learning
- RL framed as a infinite horizon discounted Markov Decision Process(MDP)
- (infinite horizon) MDP
- Find policy
source :
https://en.wikipedia.org/wiki/Markov_decision_process
Background
Reinforcement Learning
- Value function
- Action value function
Background
Reinforcement Learning
- value based
- policy based
Background
Game Theory
- The study of mathematical models pertaining to the strategic interaction of
decision making where several self-interested players must make choices that
potentially affect the interests of other players.
- Only talk about non-cooperative with complete information in this Seminar.
Background
Game Theory
- Normal form representation
- A set of players
- All possible strategies for player i
- Utility function for each players
- Goal
- maximizing their own expected utilities(payoff)
- depending on any beliefs.
- Assume “All players are rational.”
Background
Game Theory
- strategies(like policy)
- pure strategies
- Select only single strategy
- mixed strategies
- Randomize over the set of available actions according to some
probability distribution
- beliefs
Game Theory
- suppose non-cooperative 2 rational players
Background
row player (-1,-1) (-3,0)
(0,-3) (-2,-2)
column player A B
a
b
Prisoner's dilemma
Game Theory
- Best Response
Background
Background
Game Theory
- Nash Equilibrium
- If each player has chosen a strategy — an action plan choosing their own actions based on what
has happened so far in the game — and no player can increase their own expected payoff by
changing their strategy while the other players keep theirs unchanged, then the current set of
strategy choices constitutes a Nash equilibrium.
- A strategy profile is a Nash equilibrium if
- Mutual best responses
- Rationality + Correct beliefs
- Every finite game has at least one Nash equilibrium.
Game Theory
- Find nash eqbm
Background
row player (5,3) (1,0)
(0,1) (2,4)
column player A B
a
b
Background
Game Theory
- Extensive form
- The players of a game
- What each player can do at each of their moves
- The payoffs received by every player for every possible combination of moves
- + What each player knows for every move
- + For every player every opportunity they have to move
- Subgame Perfect Equation
- Backward Induction(Bellman Equation)
source :
https://en.wikipedia.org/wiki/Extensive-form_game
Background
Game Theory
- A game in normal form and a game in extensive form can carry the same
information.
source :
https://en.wikipedia.org/wiki/Extensive-form_game
Background
Game Theory
- We can use value function on each node
Background
Game Theory
- Common Knowledge
- There is common knowledge of p in a group of agents G when all the agents in G know p, they all know
that they know p, they all know that they all know that they know p, and so on ad infinitum.
- Event E, each player P1,P2.
- P1 knows E.
- P2 knows E.
- P1 knows P2 knows E.
- P2 knows P1 knows E.
- P1 knows P2 knows (P1 knows E)
- P2 knows P1 knows (P2 knows E)
- ...
Background
Game Theory
- Common Knowledge Example
- Three girls are sitting in a circle, each wearing a red or white hat. Each can see the color of all
hats except their own. Now suppose they are all wearing red hats. It is said that if the teacher
announces that at least one of the hats is red, and then sequentially asks each girl if she
knows the color of her hat, the third girl questioned can know her hat color.
Red hat puzzle
Background
Game Theory
- Common Knowledge Example
- Each girl A,B,C has an information set.
- Teacher announced and girl A didn't answer, RWW can’t be answer.
Background
Game Theory
- Common Knowledge Example
- Girl B didn’t answer. RRW and WRW can’t be the answer.
- Girl C can answer her hat color is red.
Game Theory
- Repeated Iterations
Background
1
2 2
1 1 1 1
1
2 2
1 1 1 1
1
2 2
1 1 1 1
1
2 2
1 1 1 1
1
2 2
1 1 1 1
2 iterations
Background
Game Theory
- Finitely Repeated Iterations
- Non-equilibrium strategy can be equilibrium if there is more than one nash equilibrium by
punishment reducing deviation incentive.
- Infinitely Repeated Iterations
- Using discount factor, player i’s payoff diminishes with time depending on discount
factor.
- It can be that the preferred strategy is not to play a Nash strategy of the stage game, but
to cooperate and play a socially optimum strategy.
Why multi-agent RL is hard to train?
- Credit Assignment Problem
- One of MARL's biggest challenge is Credit Assignment Problem. In cooperative situations, the
environment gives a global total-sum scalar reward, so more consideration is needed to infer
which agent contributes to this than in a single agent situation.
- Environment
- non-stationary
- Training of each agent prevents the learning environment to be non-stationary for the
other agents.
- Interaction limitation
- How each agent communicates with each other.
Multi-Agent Reinforcement Learning
Why multi-agent RL is hard to train?
- Goal setting
- Cooperation
- zero-sum
- General sum
- need to learn to reciprocate
Multi-Agent Reinforcement Learning
Setting
Multi-Agent Reinforcement Learning
source :
Foerster. Multi Agent Reinforcement learning(2019)
Setting
- Centralized Training Decentralized Execution
- During centralized training, the agent receives additional information, as well as local
information. And the agent uses only local information when it execution.
- Recurrent Network to deal with POMDP
- In POMDP, agent needs to infer state well, so it encode the previous history
information.
- Deep Recurrent Q-Learning for Partially Observable MDPs
Multi-Agent Reinforcement Learning
Baseline
- Independent Q Learning(IQL)
- Multiagent Cooperation and Competition with
Deep Reinforcement Learning(2015)
- Each agent Independently learns own
Q-network on Pong.
- Another agent is considered as environment.
- Independent Actor-Critic(IAC) is of the same
kind.
source :
Multiagent Cooperation and Competition with
Deep Reinforcement Learning(2015)
Multi-Agent Reinforcement Learning
Baseline
- Independent Q Learning(IQL)
- Multiagent Cooperation and Competition with Deep Reinforcement Learning(2015)
source :
Multiagent Cooperation and Competition with
Deep Reinforcement Learning(2015)
Cooperation
Competition
Multi-Agent Reinforcement Learning
Cooperation
- Counterfactual Multi-Agent Policy
Gradients(2017)(COMA)
- Centralized Critic, parameter sharing Actors.
- each actor gradient
source :
Counterfactual multi-agent policy gradients(2017)
Multi-Agent Reinforcement Learning
Cooperation
- Counterfactual Multi-Agent(COMA)
- Credit Assignment Problem
- shaped reward
- Using Default action c? No
- Advantage function
- Iterating for getting all action value? No
source :
Counterfactual multi-agent policy gradients(2017)
Multi-Agent Reinforcement Learning
Cooperation
- Counterfactual Multi-Agent(COMA)
- Algorithm
source :
Counterfactual multi-agent policy gradients(2017)
Multi-Agent Reinforcement Learning
Cooperation
- Counterfactual Multi-Agent(COMA)
source :
Counterfactual multi-agent policy gradients(2017)
Multi-Agent Reinforcement Learning
Cooperation
- QMIX: Monotonic Value Function
Factorisation for Deep Multi-Agent
Reinforcement Learning(2018)
- value decomposition networks
- Q Sum
- QMIX
- QMIX source :
QMIX: Monotonic Value Function Factorisation for
Deep Multi-Agent Reinforcement Learning(2018)
Multi-Agent Reinforcement Learning
Cooperation
- QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent
Reinforcement Learning(2018)
source :
QMIX: Monotonic Value Function Factorisation for
Deep Multi-Agent Reinforcement Learning(2018)
Multi-Agent Reinforcement Learning
Cooperation
- Multi-Agent Common Knowledge Reinforcement Learning(2018)
- Use Common Knowledge and hierarchically control agents.
- Dec-POMDP
- Decentralized Partially Observable Markov Decision Processes
- State is composed of a number of entities.
- In state s, binary mask , all entities the agent a can see :
- Every group member(agent) computes common knowledge independently using prior
knowledge and commonly known trajectory.(random seed is also Common knowledge)
Multi-Agent Reinforcement Learning
Cooperation
- Multi-Agent Common Knowledge Reinforcement Learning(2018)
- Delegation Action
source :
Multi-Agent Common Knowledge Reinforcement
Learning(2018)
Multi-Agent Reinforcement Learning
Cooperation
- Multi-Agent Common Knowledge Reinforcement Learning(2018)
source :
Multi-Agent Common Knowledge Reinforcement
Learning(2018)
Multi-Agent Reinforcement Learning
Cooperation
- Multi-Agent Common Knowledge Reinforcement Learning(2018)
source :
Multi-Agent Common Knowledge Reinforcement
Learning(2018)
Multi-Agent Reinforcement Learning
Cooperation
- Multi-Agent Common Knowledge Reinforcement Learning(2018)
- Central-V
source :
Multi-Agent Common Knowledge Reinforcement
Learning(2018)
Multi-Agent Reinforcement Learning
Cooperation
- Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning(2018)
- To enhance data efficiency, ReplayBuffer is introduced. It assumed the same condition at the
same time step.
- If we can use true state information, then Bellman equation can be formulated :
- Recording data with time
- Calculating an importance weighted loss :
source :
Stabilising Experience Replay for Deep
Multi-Agent Reinforcement Learning(2018)
Multi-Agent Reinforcement Learning
Cooperation
- Stabilising Experience Replay for Deep Multi-Agent Reinforcement
Learning(2018)
- But it can’t!(All agents in partially observable environment)
- So we make new game that is specified by
- augmented state(action-observation history added) and reward function.
source :
Stabilising Experience Replay for Deep
Multi-Agent Reinforcement Learning(2018)
Multi-Agent Reinforcement Learning
Cooperation
- Stabilising Experience Replay for Deep Multi-Agent Reinforcement
Learning(2018)
- Q function is updated only approximation in the partially observable
setting(Intractable!)
source :
Stabilising Experience Replay for Deep
Multi-Agent Reinforcement Learning(2018)
Multi-Agent Reinforcement Learning
Cooperation
- Stabilising Experience Replay for Deep Multi-Agent Reinforcement
Learning(2018)
- Important Sampling is approximation and hard to control variance.
- Instead, use idea of Hyper Q-learning!
- Input another agent’s policy into observation.
- Hard to scaling -> finger-print! (e.g. training iteration number,
exploration rate)
Multi-Agent Reinforcement Learning
Cooperation
- Stabilising Experience Replay for Deep Multi-Agent Reinforcement
Learning(2018)
source :
Stabilising Experience Replay for Deep
Multi-Agent Reinforcement Learning(2018)
Multi-Agent Reinforcement Learning
Cooperation
- Learning to Communicate with Deep
Multi-Agent Reinforcement
Learning(2016)
- RIAL
- Action : U + M
- environment action U
- message M
- Action select : e greedy
- No experience replay
- Parameter sharing
source :
Learning to Communicate with Deep Multi-Agent
Reinforcement Learning(2016)
Multi-Agent Reinforcement Learning
Cooperation
- Learning to Communicate with Deep
Multi-Agent Reinforcement
Learning(2016)
- DIAL
- Action : U + M
- environment action U
- message M
- C-Net
- Q network
- message network
- DRU
- After noise is added, it
passes sigmoid function.
- Action select : e greedy
- No experience replay
- Parameter sharing
source :
Learning to Communicate with Deep Multi-Agent
Reinforcement Learning(2016)
Multi-Agent Reinforcement Learning
Cooperation
- Learning to Communicate with Deep Multi-Agent
Reinforcement Learning(2016)
- DIAL
source :
Learning to Communicate with Deep Multi-Agent
Reinforcement Learning(2016)
Multi-Agent Reinforcement Learning
Zero-Sum
- Mastering the game of Go with deep neural networks and tree search(2016)
vs
- Grandmaster level in StarCraft II using multi-agent reinforcement
learning(2019)
- League
- Main Agents
- Main exploiter agents
- League exploiter agents
- Prioritized fictitious self-play
Multi-Agent Reinforcement Learning
General-Sum
- Learning with Opponent-Learning Awareness(2018)
- Suppose there are 2 players, each policy parameter is
- If we can access all parameter value, then iteratively calculate
- Instead, with step size , naive learner 1’s parameter update rule :
Multi-Agent Reinforcement Learning
General-Sum
- Learning with Opponent-Learning Awareness(2018)
- Unlike NL, LOLA learner learn to optimize(respect to player 1):
- Assuming small , first-order Taylor expansion result in :
- By substituting the opponent’s naive learning step :
Multi-Agent Reinforcement Learning
General-Sum
- Learning with Opponent-Learning Awareness(2018)
- LOLA learning rule :
Multi-Agent Reinforcement Learning
General-Sum
- Learning with Opponent-Learning Awareness(2018)
- LOLA learning via policy gradient :
- Naive learner :
- Second order :
Multi-Agent Reinforcement Learning
General-Sum
- Learning with Opponent-Learning Awareness(2018)
- LOLA learning via policy gradient :
- complete LOLA update policy gradient :
- Opponent can’t access :
Multi-Agent Reinforcement Learning
General-Sum
- Learning with Opponent-Learning Awareness(2018)
- LOLA learning via policy gradient :
- Tit-for-tat strategy
source :
Learning with Opponent-Learning Awareness(2018)
Multi-Agent Reinforcement Learning
General-Sum
- Learning with Opponent-Learning Awareness(2018)
- LOLA learning via policy gradient :
Naive Learner VS LOLA
source :
Learning with Opponent-Learning Awareness(2018)
Multi-Agent Reinforcement Learning
General-Sum
- Learning with Opponent-Learning Awareness(2018)
- LOLA learning via policy gradient :
source :
Learning with Opponent-Learning Awareness(2018)
Multi-Agent Reinforcement Learning
Reference
1. Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.
2. Wikipedia contributors. (2021, July 17). Markov decision process. In Wikipedia, The Free
Encyclopedia. Retrieved 05:59, August 9, 2021, from
https://en.wikipedia.org/w/index.php?title=Markov_decision_process&oldid=1034067020
3. Zhu, H., Nel, A., & Ferreira, H. (2015). Competitive Spectrum Pricing under Centralized Dynamic
Spectrum Allocation. Advances in Wireless Technologies and Telecommunication, 884–908.
https://doi.org/10.4018/978-1-4666-6571-2.ch034
4. Bonanno, G. (2018). Game Theory: Volume 1: Basic Concepts (2nd ed.). CreateSpace
Independent Publishing Platform.
5. Wikipedia contributors. (2021, March 2). Extensive-form game. In Wikipedia, The Free
Encyclopedia. Retrieved 06:09, August 9, 2021, from
https://en.wikipedia.org/w/index.php?title=Extensive-form_game&oldid=1009744715
6. Wikipedia contributors. (2021, March 2). Extensive-form game. In Wikipedia, The Free
Encyclopedia. Retrieved 06:10, August 9, 2021, from
https://en.wikipedia.org/w/index.php?title=Extensive-form_game&oldid=1009744715
7. Wikipedia contributors. (2021, July 8). Common knowledge (logic). In Wikipedia, The Free
Encyclopedia. Retrieved 06:11, August 9, 2021, from
https://en.wikipedia.org/w/index.php?title=Common_knowledge_(logic)&oldid=1032661454
8. Wikipedia contributors. (2021, March 2). Repeated game. In Wikipedia, The Free Encyclopedia.
Retrieved 06:11, August 9, 2021, from
https://en.wikipedia.org/w/index.php?title=Repeated_game&oldid=1009754520
9. Foerster, J. N. (2018). Deep multi-agent reinforcement learning [PhD thesis]. University of Oxford
Reference
10. Tampuu, A., Matiisen, T., Kodelja, D., Kuzovkin, I., Korjus, K., Aru, J., Aru, J., & Vicente, R. (2017).
Multiagent cooperation and competition with deep reinforcement learning. PLOS ONE, 12(4),
e0172395. https://doi.org/10.1371/journal.pone.0172395
11. Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, Shimon Whiteson,
(2018). Counterfactual Multi-Agent Policy Gradients, AAAI Conference on Artificial Intelligence
12. Rashid, T., Samvelyan, M., de Witt, C. S., Farquhar, G., Foerster, J. N., & Whiteson, S. (2018).
QMIX - monotonic value function factorisation for deep multi-agent reinforcement learning. In
International conference on machine learning.
13. Christian A. Schroeder de Witt, Jakob N. Foerster, Gregory Farquhar, Philip H. S. Torr, Wendelin
Boehmer, and Shimon Whiteson(2018). Multi-Agent Common Knowledge Reinforcement Learning.
arXiv:1810.11702 [cs] URL http://arxiv.org/abs/1810.
Reference
14. Foerster, J., Nardelli, N., Farquhar, G., Afouras, T., Torr, P.H.S., Kohli, P. & Whiteson, S.. (2017).
Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning. Proceedings of the 34th
International Conference on Machine Learning, in Proceedings of Machine Learning Research
70:1146-1155 Available from http://proceedings.mlr.press/v70/foerster17b.html
15. J. N. Foerster, Y. M. Assael, N. de Freitas, and S. Whiteson(2016). Learning to communicate with
deep multi-agent reinforcement learning. CoRR, abs/1605.06676,
16.Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J.,
Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N.,
Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., & Hassabis, D. (2016).
Mastering the game of go with deep neural networks and tree search. Nature, 529(7587), 484–489.
https://doi.org/10.1038/nature16961
17. J. N. Foerster et al.(2017), Learning with opponent-learning awareness. arXiv:1709.04326 [cs.AI]
Reference
17. Chung-san, R. (2016, December 3). ‘알파고 시대’ 우리 교육, 어떻게 나아가야 하나?
서울특별시교육청. https://now.sen.go.kr/2016/12/03.php
18. DeepMind. (2020, May 31). Agent57: Outperforming the human Atari benchmark.
https://deepmind.com/blog/article/Agent57-Outperforming-the-human-Atari-benchmark
19. AlphaStar: Mastering the Real-Time Strategy Game StarCraft II. (2019, January 24). DeepMind.
https://deepmind.com/blog/article/alphastar-mastering-real-time-strategy-game-starcraft-ii
20. Multi-Agent Hide and Seek. (2019, September 17). [Video]. YouTube.
https://www.youtube.com/watch?v=kopoLzvh5jY
21. Tayagkrischelle, T. (2014, September 13). game theorA6 [Slides]. Slideshare.
https://www.slideshare.net/tayagkrischelle/game-theora6
22. Lanctot, M. [ Laber Labs]. (2020, May 16). Multi-agent Reinforcement Learning - Laber Labs
Workshop [Video]. YouTube. https://www.youtube.com/watch?v=rbZBBTLH32o
Reference

Multi-Agent Reinforcement Learning

  • 1.
  • 2.
    Contents ● Introduction ○ Whatis Multi-agent RL? ● Background ○ (Single agent)Reinforcement Learning ○ Game Theory ● Multi-Agent Reinforcement Learning ○ Why multi-agent RL is hard to train? ○ Baseline ○ Cooperation ○ Zero-Sum ○ General-Sum ● References
  • 3.
    Introduction What is Multi-agentRL? - Reinforcement Learning is promising way to solve sequential decision making problems. source : https://now.sen.go.kr/2016/12/03.php source : https://deepmind.com/blog/article/Agent57-Outper forming-the-human-Atari-benchmark
  • 4.
    Introduction What is Multi-agentRL? - We can expand it by adding multiple agents to solve more complex problems. source : https://deepmind.com/blog/article/alphastar-maste ring-real-time-strategy-game-starcraft-ii source : https://www.youtube.com/watch?v=kopoLzvh5jY
  • 5.
    Introduction What is Multi-agentRL? Problem size Number of Agents Tabular Solution Methods ex)Game Theory Tabular Solution Methods ex)Dynamic Programming Approximate Solution Methods ex)Monte Carlo, TD learning Approximate Solution Methods
  • 6.
    Reinforcement Learning - Reinforcementlearning is a problem, a class of solution methods that work well on the problem, and the field that studies this problems and its solution methods. - Reinforcement learning is learning what to do—how to map situations to actions—so as to maximize a numerical reward signal. The learner is not told which actions to take, but instead must discover which actions yield the most reward by trying them. In the most interesting and challenging cases, actions may affect not only the immediate reward but also the next situation and, through that, all subsequent rewards. Background
  • 7.
    Reinforcement Learning source :Sutton, Reinforcement learning: An introduction Background
  • 8.
    Background Reinforcement Learning - RLframed as a infinite horizon discounted Markov Decision Process(MDP) - (infinite horizon) MDP - Find policy source : https://en.wikipedia.org/wiki/Markov_decision_process
  • 9.
    Background Reinforcement Learning - Valuefunction - Action value function
  • 10.
  • 11.
    Background Game Theory - Thestudy of mathematical models pertaining to the strategic interaction of decision making where several self-interested players must make choices that potentially affect the interests of other players. - Only talk about non-cooperative with complete information in this Seminar.
  • 12.
    Background Game Theory - Normalform representation - A set of players - All possible strategies for player i - Utility function for each players - Goal - maximizing their own expected utilities(payoff) - depending on any beliefs. - Assume “All players are rational.”
  • 13.
    Background Game Theory - strategies(likepolicy) - pure strategies - Select only single strategy - mixed strategies - Randomize over the set of available actions according to some probability distribution - beliefs
  • 14.
    Game Theory - supposenon-cooperative 2 rational players Background row player (-1,-1) (-3,0) (0,-3) (-2,-2) column player A B a b Prisoner's dilemma
  • 15.
    Game Theory - BestResponse Background
  • 16.
    Background Game Theory - NashEquilibrium - If each player has chosen a strategy — an action plan choosing their own actions based on what has happened so far in the game — and no player can increase their own expected payoff by changing their strategy while the other players keep theirs unchanged, then the current set of strategy choices constitutes a Nash equilibrium. - A strategy profile is a Nash equilibrium if - Mutual best responses - Rationality + Correct beliefs - Every finite game has at least one Nash equilibrium.
  • 17.
    Game Theory - Findnash eqbm Background row player (5,3) (1,0) (0,1) (2,4) column player A B a b
  • 18.
    Background Game Theory - Extensiveform - The players of a game - What each player can do at each of their moves - The payoffs received by every player for every possible combination of moves - + What each player knows for every move - + For every player every opportunity they have to move - Subgame Perfect Equation - Backward Induction(Bellman Equation) source : https://en.wikipedia.org/wiki/Extensive-form_game
  • 19.
    Background Game Theory - Agame in normal form and a game in extensive form can carry the same information. source : https://en.wikipedia.org/wiki/Extensive-form_game
  • 20.
    Background Game Theory - Wecan use value function on each node
  • 21.
    Background Game Theory - CommonKnowledge - There is common knowledge of p in a group of agents G when all the agents in G know p, they all know that they know p, they all know that they all know that they know p, and so on ad infinitum. - Event E, each player P1,P2. - P1 knows E. - P2 knows E. - P1 knows P2 knows E. - P2 knows P1 knows E. - P1 knows P2 knows (P1 knows E) - P2 knows P1 knows (P2 knows E) - ...
  • 22.
    Background Game Theory - CommonKnowledge Example - Three girls are sitting in a circle, each wearing a red or white hat. Each can see the color of all hats except their own. Now suppose they are all wearing red hats. It is said that if the teacher announces that at least one of the hats is red, and then sequentially asks each girl if she knows the color of her hat, the third girl questioned can know her hat color. Red hat puzzle
  • 23.
    Background Game Theory - CommonKnowledge Example - Each girl A,B,C has an information set. - Teacher announced and girl A didn't answer, RWW can’t be answer.
  • 24.
    Background Game Theory - CommonKnowledge Example - Girl B didn’t answer. RRW and WRW can’t be the answer. - Girl C can answer her hat color is red.
  • 25.
    Game Theory - RepeatedIterations Background 1 2 2 1 1 1 1 1 2 2 1 1 1 1 1 2 2 1 1 1 1 1 2 2 1 1 1 1 1 2 2 1 1 1 1 2 iterations
  • 26.
    Background Game Theory - FinitelyRepeated Iterations - Non-equilibrium strategy can be equilibrium if there is more than one nash equilibrium by punishment reducing deviation incentive. - Infinitely Repeated Iterations - Using discount factor, player i’s payoff diminishes with time depending on discount factor. - It can be that the preferred strategy is not to play a Nash strategy of the stage game, but to cooperate and play a socially optimum strategy.
  • 27.
    Why multi-agent RLis hard to train? - Credit Assignment Problem - One of MARL's biggest challenge is Credit Assignment Problem. In cooperative situations, the environment gives a global total-sum scalar reward, so more consideration is needed to infer which agent contributes to this than in a single agent situation. - Environment - non-stationary - Training of each agent prevents the learning environment to be non-stationary for the other agents. - Interaction limitation - How each agent communicates with each other. Multi-Agent Reinforcement Learning
  • 28.
    Why multi-agent RLis hard to train? - Goal setting - Cooperation - zero-sum - General sum - need to learn to reciprocate Multi-Agent Reinforcement Learning
  • 29.
    Setting Multi-Agent Reinforcement Learning source: Foerster. Multi Agent Reinforcement learning(2019)
  • 30.
    Setting - Centralized TrainingDecentralized Execution - During centralized training, the agent receives additional information, as well as local information. And the agent uses only local information when it execution. - Recurrent Network to deal with POMDP - In POMDP, agent needs to infer state well, so it encode the previous history information. - Deep Recurrent Q-Learning for Partially Observable MDPs Multi-Agent Reinforcement Learning
  • 31.
    Baseline - Independent QLearning(IQL) - Multiagent Cooperation and Competition with Deep Reinforcement Learning(2015) - Each agent Independently learns own Q-network on Pong. - Another agent is considered as environment. - Independent Actor-Critic(IAC) is of the same kind. source : Multiagent Cooperation and Competition with Deep Reinforcement Learning(2015) Multi-Agent Reinforcement Learning
  • 32.
    Baseline - Independent QLearning(IQL) - Multiagent Cooperation and Competition with Deep Reinforcement Learning(2015) source : Multiagent Cooperation and Competition with Deep Reinforcement Learning(2015) Cooperation Competition Multi-Agent Reinforcement Learning
  • 33.
    Cooperation - Counterfactual Multi-AgentPolicy Gradients(2017)(COMA) - Centralized Critic, parameter sharing Actors. - each actor gradient source : Counterfactual multi-agent policy gradients(2017) Multi-Agent Reinforcement Learning
  • 34.
    Cooperation - Counterfactual Multi-Agent(COMA) -Credit Assignment Problem - shaped reward - Using Default action c? No - Advantage function - Iterating for getting all action value? No source : Counterfactual multi-agent policy gradients(2017) Multi-Agent Reinforcement Learning
  • 35.
    Cooperation - Counterfactual Multi-Agent(COMA) -Algorithm source : Counterfactual multi-agent policy gradients(2017) Multi-Agent Reinforcement Learning
  • 36.
    Cooperation - Counterfactual Multi-Agent(COMA) source: Counterfactual multi-agent policy gradients(2017) Multi-Agent Reinforcement Learning
  • 37.
    Cooperation - QMIX: MonotonicValue Function Factorisation for Deep Multi-Agent Reinforcement Learning(2018) - value decomposition networks - Q Sum - QMIX - QMIX source : QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning(2018) Multi-Agent Reinforcement Learning
  • 38.
    Cooperation - QMIX: MonotonicValue Function Factorisation for Deep Multi-Agent Reinforcement Learning(2018) source : QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning(2018) Multi-Agent Reinforcement Learning
  • 39.
    Cooperation - Multi-Agent CommonKnowledge Reinforcement Learning(2018) - Use Common Knowledge and hierarchically control agents. - Dec-POMDP - Decentralized Partially Observable Markov Decision Processes - State is composed of a number of entities. - In state s, binary mask , all entities the agent a can see : - Every group member(agent) computes common knowledge independently using prior knowledge and commonly known trajectory.(random seed is also Common knowledge) Multi-Agent Reinforcement Learning
  • 40.
    Cooperation - Multi-Agent CommonKnowledge Reinforcement Learning(2018) - Delegation Action source : Multi-Agent Common Knowledge Reinforcement Learning(2018) Multi-Agent Reinforcement Learning
  • 41.
    Cooperation - Multi-Agent CommonKnowledge Reinforcement Learning(2018) source : Multi-Agent Common Knowledge Reinforcement Learning(2018) Multi-Agent Reinforcement Learning
  • 42.
    Cooperation - Multi-Agent CommonKnowledge Reinforcement Learning(2018) source : Multi-Agent Common Knowledge Reinforcement Learning(2018) Multi-Agent Reinforcement Learning
  • 43.
    Cooperation - Multi-Agent CommonKnowledge Reinforcement Learning(2018) - Central-V source : Multi-Agent Common Knowledge Reinforcement Learning(2018) Multi-Agent Reinforcement Learning
  • 44.
    Cooperation - Stabilising ExperienceReplay for Deep Multi-Agent Reinforcement Learning(2018) - To enhance data efficiency, ReplayBuffer is introduced. It assumed the same condition at the same time step. - If we can use true state information, then Bellman equation can be formulated : - Recording data with time - Calculating an importance weighted loss : source : Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning(2018) Multi-Agent Reinforcement Learning
  • 45.
    Cooperation - Stabilising ExperienceReplay for Deep Multi-Agent Reinforcement Learning(2018) - But it can’t!(All agents in partially observable environment) - So we make new game that is specified by - augmented state(action-observation history added) and reward function. source : Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning(2018) Multi-Agent Reinforcement Learning
  • 46.
    Cooperation - Stabilising ExperienceReplay for Deep Multi-Agent Reinforcement Learning(2018) - Q function is updated only approximation in the partially observable setting(Intractable!) source : Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning(2018) Multi-Agent Reinforcement Learning
  • 47.
    Cooperation - Stabilising ExperienceReplay for Deep Multi-Agent Reinforcement Learning(2018) - Important Sampling is approximation and hard to control variance. - Instead, use idea of Hyper Q-learning! - Input another agent’s policy into observation. - Hard to scaling -> finger-print! (e.g. training iteration number, exploration rate) Multi-Agent Reinforcement Learning
  • 48.
    Cooperation - Stabilising ExperienceReplay for Deep Multi-Agent Reinforcement Learning(2018) source : Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning(2018) Multi-Agent Reinforcement Learning
  • 49.
    Cooperation - Learning toCommunicate with Deep Multi-Agent Reinforcement Learning(2016) - RIAL - Action : U + M - environment action U - message M - Action select : e greedy - No experience replay - Parameter sharing source : Learning to Communicate with Deep Multi-Agent Reinforcement Learning(2016) Multi-Agent Reinforcement Learning
  • 50.
    Cooperation - Learning toCommunicate with Deep Multi-Agent Reinforcement Learning(2016) - DIAL - Action : U + M - environment action U - message M - C-Net - Q network - message network - DRU - After noise is added, it passes sigmoid function. - Action select : e greedy - No experience replay - Parameter sharing source : Learning to Communicate with Deep Multi-Agent Reinforcement Learning(2016) Multi-Agent Reinforcement Learning
  • 51.
    Cooperation - Learning toCommunicate with Deep Multi-Agent Reinforcement Learning(2016) - DIAL source : Learning to Communicate with Deep Multi-Agent Reinforcement Learning(2016) Multi-Agent Reinforcement Learning
  • 52.
    Zero-Sum - Mastering thegame of Go with deep neural networks and tree search(2016) vs - Grandmaster level in StarCraft II using multi-agent reinforcement learning(2019) - League - Main Agents - Main exploiter agents - League exploiter agents - Prioritized fictitious self-play Multi-Agent Reinforcement Learning
  • 53.
    General-Sum - Learning withOpponent-Learning Awareness(2018) - Suppose there are 2 players, each policy parameter is - If we can access all parameter value, then iteratively calculate - Instead, with step size , naive learner 1’s parameter update rule : Multi-Agent Reinforcement Learning
  • 54.
    General-Sum - Learning withOpponent-Learning Awareness(2018) - Unlike NL, LOLA learner learn to optimize(respect to player 1): - Assuming small , first-order Taylor expansion result in : - By substituting the opponent’s naive learning step : Multi-Agent Reinforcement Learning
  • 55.
    General-Sum - Learning withOpponent-Learning Awareness(2018) - LOLA learning rule : Multi-Agent Reinforcement Learning
  • 56.
    General-Sum - Learning withOpponent-Learning Awareness(2018) - LOLA learning via policy gradient : - Naive learner : - Second order : Multi-Agent Reinforcement Learning
  • 57.
    General-Sum - Learning withOpponent-Learning Awareness(2018) - LOLA learning via policy gradient : - complete LOLA update policy gradient : - Opponent can’t access : Multi-Agent Reinforcement Learning
  • 58.
    General-Sum - Learning withOpponent-Learning Awareness(2018) - LOLA learning via policy gradient : - Tit-for-tat strategy source : Learning with Opponent-Learning Awareness(2018) Multi-Agent Reinforcement Learning
  • 59.
    General-Sum - Learning withOpponent-Learning Awareness(2018) - LOLA learning via policy gradient : Naive Learner VS LOLA source : Learning with Opponent-Learning Awareness(2018) Multi-Agent Reinforcement Learning
  • 60.
    General-Sum - Learning withOpponent-Learning Awareness(2018) - LOLA learning via policy gradient : source : Learning with Opponent-Learning Awareness(2018) Multi-Agent Reinforcement Learning
  • 61.
    Reference 1. Sutton, R.S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press. 2. Wikipedia contributors. (2021, July 17). Markov decision process. In Wikipedia, The Free Encyclopedia. Retrieved 05:59, August 9, 2021, from https://en.wikipedia.org/w/index.php?title=Markov_decision_process&oldid=1034067020 3. Zhu, H., Nel, A., & Ferreira, H. (2015). Competitive Spectrum Pricing under Centralized Dynamic Spectrum Allocation. Advances in Wireless Technologies and Telecommunication, 884–908. https://doi.org/10.4018/978-1-4666-6571-2.ch034 4. Bonanno, G. (2018). Game Theory: Volume 1: Basic Concepts (2nd ed.). CreateSpace Independent Publishing Platform. 5. Wikipedia contributors. (2021, March 2). Extensive-form game. In Wikipedia, The Free Encyclopedia. Retrieved 06:09, August 9, 2021, from https://en.wikipedia.org/w/index.php?title=Extensive-form_game&oldid=1009744715
  • 62.
    6. Wikipedia contributors.(2021, March 2). Extensive-form game. In Wikipedia, The Free Encyclopedia. Retrieved 06:10, August 9, 2021, from https://en.wikipedia.org/w/index.php?title=Extensive-form_game&oldid=1009744715 7. Wikipedia contributors. (2021, July 8). Common knowledge (logic). In Wikipedia, The Free Encyclopedia. Retrieved 06:11, August 9, 2021, from https://en.wikipedia.org/w/index.php?title=Common_knowledge_(logic)&oldid=1032661454 8. Wikipedia contributors. (2021, March 2). Repeated game. In Wikipedia, The Free Encyclopedia. Retrieved 06:11, August 9, 2021, from https://en.wikipedia.org/w/index.php?title=Repeated_game&oldid=1009754520 9. Foerster, J. N. (2018). Deep multi-agent reinforcement learning [PhD thesis]. University of Oxford Reference
  • 63.
    10. Tampuu, A.,Matiisen, T., Kodelja, D., Kuzovkin, I., Korjus, K., Aru, J., Aru, J., & Vicente, R. (2017). Multiagent cooperation and competition with deep reinforcement learning. PLOS ONE, 12(4), e0172395. https://doi.org/10.1371/journal.pone.0172395 11. Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, Shimon Whiteson, (2018). Counterfactual Multi-Agent Policy Gradients, AAAI Conference on Artificial Intelligence 12. Rashid, T., Samvelyan, M., de Witt, C. S., Farquhar, G., Foerster, J. N., & Whiteson, S. (2018). QMIX - monotonic value function factorisation for deep multi-agent reinforcement learning. In International conference on machine learning. 13. Christian A. Schroeder de Witt, Jakob N. Foerster, Gregory Farquhar, Philip H. S. Torr, Wendelin Boehmer, and Shimon Whiteson(2018). Multi-Agent Common Knowledge Reinforcement Learning. arXiv:1810.11702 [cs] URL http://arxiv.org/abs/1810. Reference
  • 64.
    14. Foerster, J.,Nardelli, N., Farquhar, G., Afouras, T., Torr, P.H.S., Kohli, P. & Whiteson, S.. (2017). Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning. Proceedings of the 34th International Conference on Machine Learning, in Proceedings of Machine Learning Research 70:1146-1155 Available from http://proceedings.mlr.press/v70/foerster17b.html 15. J. N. Foerster, Y. M. Assael, N. de Freitas, and S. Whiteson(2016). Learning to communicate with deep multi-agent reinforcement learning. CoRR, abs/1605.06676, 16.Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., & Hassabis, D. (2016). Mastering the game of go with deep neural networks and tree search. Nature, 529(7587), 484–489. https://doi.org/10.1038/nature16961 17. J. N. Foerster et al.(2017), Learning with opponent-learning awareness. arXiv:1709.04326 [cs.AI] Reference
  • 65.
    17. Chung-san, R.(2016, December 3). ‘알파고 시대’ 우리 교육, 어떻게 나아가야 하나? 서울특별시교육청. https://now.sen.go.kr/2016/12/03.php 18. DeepMind. (2020, May 31). Agent57: Outperforming the human Atari benchmark. https://deepmind.com/blog/article/Agent57-Outperforming-the-human-Atari-benchmark 19. AlphaStar: Mastering the Real-Time Strategy Game StarCraft II. (2019, January 24). DeepMind. https://deepmind.com/blog/article/alphastar-mastering-real-time-strategy-game-starcraft-ii 20. Multi-Agent Hide and Seek. (2019, September 17). [Video]. YouTube. https://www.youtube.com/watch?v=kopoLzvh5jY 21. Tayagkrischelle, T. (2014, September 13). game theorA6 [Slides]. Slideshare. https://www.slideshare.net/tayagkrischelle/game-theora6 22. Lanctot, M. [ Laber Labs]. (2020, May 16). Multi-agent Reinforcement Learning - Laber Labs Workshop [Video]. YouTube. https://www.youtube.com/watch?v=rbZBBTLH32o Reference