20200510 37

Reading ICLR2020 paper
“A Generalized Training Approach For
Multi-agent Learning”
Presenter: 37
Date: May 10th, 2020
Content: Bread House Seminar
Place: zoom

Overview
• investigated Policy-Spaced Response Oracles (PSRO)
• utilized α-Rank instead of computation of Nash equilibria
❖ established convergence guarantee in several game classes
❖ identified links between Nash equilibria and α-Rank
❖ α-Rank achieves faster convergence than approximate Nash solvers
• Background knowledge (we learn today):
#Game theory (two- or multi-player, zero- or general-sum) #Nash equilibria #computation of Nash equilibria
#Reinforcement Learning #PSRO #α-Rank #PageRank #Markov Matrix #Kuhn and Leduc Poker #MuJoCo soccer
Muller, P., Omidshaﬁei, S., Rowland, M., Tuyls, K., Perolat, J., Liu, S., Hennes, D., Marris, L., Lanctot, M., Hughes, E., Wang, Z., Lever, G., Heess, N., Graepel, T., Munos, R. (2019). A Generalized Training
Approach for Multiagent Learning arXiv https://arxiv.org/abs/1909.12823

Key points
• #Game theory (two- or multi-player, zero- or general-sum)
• #Nash equilibria
• #computation of Nash equilibria
• #Reinforcement Learning
• #PSRO (Policy-Spaced Response Oracles)
• #α-Rank
• #PageRank
• #MuJoCo soccer

• Beautiful mind (2001)
• “Recall the lessons of Adam Smith, the father of modern economics”
“In competition, individual ambition servers the common good.”
“Exactly! Every man for himself, gentlemen.”
Nash says “Because the best result will come from everyone in the group,
Doing what’s best for himself and the group”
(Recap) #Nash equilibrium

(Recap) examples of simple famous games in Game Theory
• Prisoner’s dilemma
• Nash equilibrium in the game is mutual betray
• Nash equilibrium is not Pareto efficient in this case, and the other three are
Pareto efficient
A’s and B’s
Payoﬀ function
B stays silent B betrays
A stays silent (-1, -1) (-5, 0)
A betrays (0, -5) (-3, -3)

(Recap) #Nash’s Existence Theorem
https://www.dominos.jp/menu-pizza

(Recap) #Nash’s Existence Theorem
https://www.jstor.org/stable/pdf/1969529.pdf?refreqid=excelsior%3Aee23262bab98861eceb01bc78e973f05

• Chicken game (also known as hawk-dove or snowdrift game)
• Nash equilibria in the pure strategy are
A swerves and B goes straight / A goes straight and B swerves
• The mixed strategy of 99% swerve and 1% straight is also Nash equilibrium for both players
A’s and B’s
Payoﬀ function
Swerve Strainght
Swerve (0, 0) (-1, +1)
Straight (+1, -1) (-100, -100)

• Stag hunt game
• Nash equilibria in the game are (S, S), (H, H), or (50% S 50% H, 50% S 50% H)
• This game describes a conflict between safety and social cooperation
A’s and B’s
Payoﬀ function
Stag 
(Cooperate)
Hare 
(Defect)
Stag 
(Cooperate)
(4, 4) (1, 3)
Hare 
(Defect)
(3, 1) (2, 2)

• Matching pennies / Rock-Paper-Scissors
• Nash equilibrium in the game is (50% H 50% T, 50% H 50% T)
• Zero-sum game
A’s and B’s
Payoﬀ function
Heads Tails
Heads (+1, -1) (-1, +1)
Tails (-1, +1) (+1, -1)
A’s and B’s
Payoﬀ
function
Rock Paper Scissors
Rock (0, 0) (-1, +1) (+1, -1)
Paper (+1, -1) (0, 0) (-1, +1)
Scissors (-1, +1) (+1, -1) (0, 0)

(Recap) #The computation of Nash equilibria
• Several computation algorithm to search Nash equilibria
- Two players
❖ Support enumeration (finds all Nash equilibria, applicable tens strategies)
❖ Vertex enumeration (finds all Nash equilibria, applicable tens strategies)
❖ Lemke-Howson (finds one Nash equilibrium, hundreds strategies)
❖ …
- Multi players
❖ McLennan-Tourky (finds one Nash equilibrium, a few players a few strategies)
Lemke, Carlton E., and Joseph T. Howson, Jr. “Equilibrium points of bimatrix games.” Journal of the Society for Industrial
and Applied Mathematics 12.2 (1964): 413-423.

https://vknight.org/gt/chapters/04/

Liu, S., Lever, G., Merel, J., Tunyasuvunakool, S., Heess, N., Graepel, T. (2019), ICLR2019. Emergent Coordination Through Competition arXiv https://arxiv.org/abs/1902.07151
• Study on the emergence of cooperative behaviors in RL agents
- Introduced a challenging competitive multi-agent soccer game (with continuous simulated physics)
- used Decentralized, population-based training with co-play (PBT) and evaluated Nash averaging
• background: #MARL(Multi-Agent Reinforcement Learning), #Markov game, #PBT, (#Elo rating, #Nash averaging)
• PBT (Jaderberg+ 2017, Jaderberg+ 2018)
• A method to optimize hyper-parameters via a population of simultaneously learning agents: during
training, poor performing agents inherit parameters from stronger agents with additional mutation
• was extended to incorporate co-play for MARL: subsets of agents are selected from the population to play
together in multi-agent games. Each agent treats the other agents as part of their environment
https://github.com/deepmind/dm_control/tree/master/dm_control/locomotion/soccer?&
• Reward Shaping
• The sparse scoring and conceding environment rewards (goal and concede)
• vel-to-ball: player’s linear velocity projected onto its direction vector towards the ball
• vel-ball-to-goal: ball’s linear velocity projected onto its direction vector towards the center of opponent’s goal
• Experiment
• 32 agents in the population
• For 2v2 training match 4 agents were
selected uniformly
• Evaluated Nash-Averaging Evaluators
#MuJoCo soccer - Emergent Coordination Through Competition

#MuJoCo soccer - Emergent Coordination Through Competition
Liu, S., Lever, G., Merel, J., Tunyasuvunakool, S., Heess, N., Graepel, T. (2019), ICLR2019. Emergent Coordination Through Competition arXiv https://arxiv.org/abs/1902.07151
• Behavior statistics evolution indicates the coordination with teammates 😁
• Pass/interception over 10m apart increases dramatically
https://github.com/deepmind/dm_control/tree/master/dm_control/locomotion/soccer?&
KL divergence incurred by replacing a subset
of state with counterfactual information.
• Question: “had a subset of the observation been different, how
much would I have changed my policy?”
• Quantified the dependency of agent’s policy on the subset of
the observation space
• Measured the KL divergence in agent’s policy distribution
(counterfactual policy divergence)
• Result:
• ball-position is strong factor
• The opponenent-0/1-position incur less divergence than
teammate position
• It concludes that the coordinating teammate position is
important game dynamics to determine the each player’s
action

Elo Rating System
• Elo Rating System:
a method for calculating the relative skill levels of
players in zero-sum games such as chess
• Named after its creator, Arpad Elo
• Implemented several games
• The United States Chess Federation (USCF) in 1960
• World Chess Federation (FIDE) in 1970
• American college football, Major League Baseball,
FIFA World Cup, etc
• Problem: If there’re many scissors players in rock-paper-
scissors world, rock players have high Elo Rating score.
The Elo rating system in chess Nicholas R. Moloney (with Mariia Koroliuk)
Here is the famous scene from The Social Network (2010), where Eduardo
Saverin gives Mark Zuckerberg the Algorithm he needs to code Facemash.
Eduardo then writes the code on the window of the Havard dorm room
http://www.fbmovie.com/

#Reinforcement Learning
https://youtu.be/WXuK6gekU1Y?t=2363AlphaGo - The Movie | Full Documentary https://www.youtube.com/channel/UCP7jMXSY2xbc3KCAE0MHQ-ALectures
https://www.youtube.com/watch?
v=ld28AU7DDB4&list=PLqYmG7hTraZBKeNJ-
JE_eyJHZ7XgBoAyb&index=10
Reinforcement Learning 10: Classic Games Case Study by David Silver
https://qiita.com/icoxfog417/items/242439ecd1a477ece312

https://www.youtube.com/watch?v=bRfUxQs6xIM&list=PLqYmG7hTraZBKeNJ-JE_eyJHZ7XgBoAyb&index=6Reinforcement Learning 6: Policy Gradients and Actor Critics by Hado van Hasselt

https://www.youtube.com/watch?v=ld28AU7DDB4&list=PLqYmG7hTraZBKeNJ-JE_eyJHZ7XgBoAyb&index=10Reinforcement Learning 10: Classic Games Case Study by David Silver

#PSRO - Policy-Spaced Response Oracles

Lanctot, M., Zambaldi, V., Gruslys, A., Lazaridou, A., Tuyls, K., Perolat, J., Silver, D., Graepel, T. (2017), NIPS 2017. A Uniﬁed Game-Theoretic Approach to Multiagent Reinforcement
Learning arXiv https://arxiv.org/abs/1711.00832
• PSRO (Lanctot+ 2017)
• Meta-game is growing by adding policies (“oracles”) that approximate best responses to the meta-strategy of the
other players.
• A natural generalization of Double Oracle (DO) and fictitious Self-Play
• Linked to empirical game-theoretic analysis (EGTA)
• Double Oracle (DO)
• Double oracle solves a set of (two-player, normal-form) sub-games induced by subsets at time t
• Introduced in the paper “Planning in the presence of cost functions controlled by an adversary” ICML 2003
• Applied in the paper “Algorithms for Computing Strategies in Two-Player Simultaneous Move Game”, AI 2016
Bošanský, B., Lisý, V., Lanctot, M., Čermák, J., Winands, M. (2016). Algorithms for computing strategies in two-player simultaneous move games Artiﬁcial Intelligence 237(), 1-40.
https://dx.doi.org/10.1016/j.artint.2016.03.005
Planning in the Presence of Cost Functions Controlled by an Adversary
#PSRO - Policy-Spaced Response Oracles
Example of application of Double Oracle (DO)
to two-player simultaneous move game

Results - MuJoCo Soccer Game
PSRO(α-Rank, RL) and PSRO(Uniform, RL) agents 
8 best trained agents, 3 vs. 3 soccer game
PSRO(α-Rank, RL) and self-play-based training agents 
8 best trained agents, 2 vs. 2 soccer game

#PSRO(Nash, BR) vs PSRO(α-Rank, BR) vs PSRO(α-Rank, PBR)
• PSRO(Nash, BR) will eventually return an NE in two-player zero-sum games [McMahan+, 2003]
• How about PSRO(α-Rank, BR)? - No, we can show the counterexample

#PSRO(Nash, BR) vs PSRO(α-Rank, BR) vs PSRO(α-Rank, PBR)
• PSRO(α-Rank, BR) leads to the algorithm terminating with strategy set {A, B, C, D} and not
discovering strategy X in the sink strongly-connected component.
• How do we fix the issue? - we define the Preference-Based Best Response (PBR) oracle

#α-Rank

#PageRank
PageRank is an algorithm used by Google Search to rank web pages in their search engine
results. PageRank was named after Larry Page, one of the founders of Google

#PageRank
PageRank is computed from the iteration of the Google matrix.
Perron Frobenius theorem: If all entries of a n × n matrix A are positive, then it has a unique
maximal eigenvalue. Its eigenvector has positive entries.
The PageRank Citation Ranking: Bringing Order to the Web, 1998, http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf

20200510 37

Recommended

Recommended

More Related Content

Similar to 20200510 37

Similar to 20200510 37 (6)

More from X 37

More from X 37 (8)

Recently uploaded

Recently uploaded (20)

20200510 37