SlideShare a Scribd company logo
1 of 32
Download to read offline
Reading ICLR2020 paper
“A Generalized Training Approach For
Multi-agent Learning”
Presenter: 37
Date: May 10th, 2020
Content: Bread House Seminar
Place: zoom
Overview
• investigated Policy-Spaced Response Oracles (PSRO)
• utilized α-Rank instead of computation of Nash equilibria
❖ established convergence guarantee in several game classes
❖ identified links between Nash equilibria and α-Rank
❖ α-Rank achieves faster convergence than approximate Nash solvers
• Background knowledge (we learn today):
#Game theory (two- or multi-player, zero- or general-sum) #Nash equilibria #computation of Nash equilibria
#Reinforcement Learning #PSRO #α-Rank #PageRank #Markov Matrix #Kuhn and Leduc Poker #MuJoCo soccer
Muller, P., Omidshafiei, S., Rowland, M., Tuyls, K., Perolat, J., Liu, S., Hennes, D., Marris, L., Lanctot, M., Hughes, E., Wang, Z., Lever, G., Heess, N., Graepel, T., Munos, R. (2019). A Generalized Training
Approach for Multiagent Learning arXiv https://arxiv.org/abs/1909.12823
Key points
• #Game theory (two- or multi-player, zero- or general-sum)
• #Nash equilibria
• #computation of Nash equilibria
• #Reinforcement Learning
• #PSRO (Policy-Spaced Response Oracles)
• #α-Rank
• #PageRank
• #MuJoCo soccer
• Beautiful mind (2001)
• “Recall the lessons of Adam Smith, the father of modern economics”
“In competition, individual ambition servers the common good.”
“Exactly! Every man for himself, gentlemen.”
Nash says “Because the best result will come from everyone in the group,
Doing what’s best for himself and the group”
(Recap) #Nash equilibrium
(Recap) #Nash equilibrium
(Recap) examples of simple famous games in Game Theory
• Prisoner’s dilemma
• Nash equilibrium in the game is mutual betray
• Nash equilibrium is not Pareto efficient in this case, and the other three are
Pareto efficient
A’s and B’s
Payoff function
B stays silent B betrays
A stays silent (-1, -1) (-5, 0)
A betrays (0, -5) (-3, -3)
(Recap) Best Response
(Recap) #Nash’s Existence Theorem
https://www.dominos.jp/menu-pizza
(Recap) #Nash’s Existence Theorem
https://www.jstor.org/stable/pdf/1969529.pdf?refreqid=excelsior%3Aee23262bab98861eceb01bc78e973f05
(Recap) examples of simple famous games in Game Theory
• Chicken game (also known as hawk-dove or snowdrift game)
• Nash equilibria in the pure strategy are
A swerves and B goes straight / A goes straight and B swerves
• The mixed strategy of 99% swerve and 1% straight is also Nash equilibrium for both players
A’s and B’s
Payoff function
Swerve Strainght
Swerve (0, 0) (-1, +1)
Straight (+1, -1) (-100, -100)
(Recap) examples of simple famous games in Game Theory
• Stag hunt game
• Nash equilibria in the game are (S, S), (H, H), or (50% S 50% H, 50% S 50% H)
• This game describes a conflict between safety and social cooperation
A’s and B’s
Payoff function
Stag

(Cooperate)
Hare

(Defect)
Stag

(Cooperate)
(4, 4) (1, 3)
Hare

(Defect)
(3, 1) (2, 2)
(Recap) examples of simple famous games in Game Theory
• Matching pennies / Rock-Paper-Scissors
• Nash equilibrium in the game is (50% H 50% T, 50% H 50% T)
• Zero-sum game
A’s and B’s
Payoff function
Heads Tails
Heads (+1, -1) (-1, +1)
Tails (-1, +1) (+1, -1)
A’s and B’s
Payoff
function
Rock Paper Scissors
Rock (0, 0) (-1, +1) (+1, -1)
Paper (+1, -1) (0, 0) (-1, +1)
Scissors (-1, +1) (+1, -1) (0, 0)
(Recap) #The computation of Nash equilibria
• Several computation algorithm to search Nash equilibria
- Two players
❖ Support enumeration (finds all Nash equilibria, applicable tens strategies)
❖ Vertex enumeration (finds all Nash equilibria, applicable tens strategies)
❖ Lemke-Howson (finds one Nash equilibrium, hundreds strategies)
❖ …
- Multi players
❖ McLennan-Tourky (finds one Nash equilibrium, a few players a few strategies)
Lemke, Carlton E., and Joseph T. Howson, Jr. “Equilibrium points of bimatrix games.” Journal of the Society for Industrial
and Applied Mathematics 12.2 (1964): 413-423.
(Recap) #The computation of Nash equilibria
(Recap) #The computation of Nash equilibria
https://vknight.org/gt/chapters/04/
Key points
• #Game theory (two- or multi-player, zero- or general-sum)
• #Nash equilibria
• #computation of Nash equilibria
• #Reinforcement Learning
• #PSRO (Policy-Spaced Response Oracles)
• #α-Rank
• #PageRank
• #MuJoCo soccer
Liu, S., Lever, G., Merel, J., Tunyasuvunakool, S., Heess, N., Graepel, T. (2019), ICLR2019. Emergent Coordination Through Competition arXiv https://arxiv.org/abs/1902.07151
• Study on the emergence of cooperative behaviors in RL agents
- Introduced a challenging competitive multi-agent soccer game (with continuous simulated physics)
- used Decentralized, population-based training with co-play (PBT) and evaluated Nash averaging
• background: #MARL(Multi-Agent Reinforcement Learning), #Markov game, #PBT, (#Elo rating, #Nash averaging)
• PBT (Jaderberg+ 2017, Jaderberg+ 2018)
• A method to optimize hyper-parameters via a population of simultaneously learning agents: during
training, poor performing agents inherit parameters from stronger agents with additional mutation
• was extended to incorporate co-play for MARL: subsets of agents are selected from the population to play
together in multi-agent games. Each agent treats the other agents as part of their environment
https://github.com/deepmind/dm_control/tree/master/dm_control/locomotion/soccer?&
• Reward Shaping
• The sparse scoring and conceding environment rewards (goal and concede)
• vel-to-ball: player’s linear velocity projected onto its direction vector towards the ball
• vel-ball-to-goal: ball’s linear velocity projected onto its direction vector towards the center of opponent’s goal
• Experiment
• 32 agents in the population
• For 2v2 training match 4 agents were
selected uniformly
• Evaluated Nash-Averaging Evaluators
#MuJoCo soccer - Emergent Coordination Through Competition
#MuJoCo soccer - Emergent Coordination Through Competition
Liu, S., Lever, G., Merel, J., Tunyasuvunakool, S., Heess, N., Graepel, T. (2019), ICLR2019. Emergent Coordination Through Competition arXiv https://arxiv.org/abs/1902.07151
• Behavior statistics evolution indicates the coordination with teammates 😁
• Pass/interception over 10m apart increases dramatically
https://github.com/deepmind/dm_control/tree/master/dm_control/locomotion/soccer?&
KL divergence incurred by replacing a subset
of state with counterfactual information.
• Question: “had a subset of the observation been different, how
much would I have changed my policy?”
• Quantified the dependency of agent’s policy on the subset of
the observation space
• Measured the KL divergence in agent’s policy distribution
(counterfactual policy divergence)
• Result:
• ball-position is strong factor
• The opponenent-0/1-position incur less divergence than
teammate position
• It concludes that the coordinating teammate position is
important game dynamics to determine the each player’s
action
Elo Rating System
• Elo Rating System:
a method for calculating the relative skill levels of
players in zero-sum games such as chess
• Named after its creator, Arpad Elo
• Implemented several games
• The United States Chess Federation (USCF) in 1960
• World Chess Federation (FIDE) in 1970
• American college football, Major League Baseball,
FIFA World Cup, etc
• Problem: If there’re many scissors players in rock-paper-
scissors world, rock players have high Elo Rating score.
The Elo rating system in chess Nicholas R. Moloney (with Mariia Koroliuk)
Here is the famous scene from The Social Network (2010), where Eduardo
Saverin gives Mark Zuckerberg the Algorithm he needs to code Facemash.
Eduardo then writes the code on the window of the Havard dorm room
http://www.fbmovie.com/
#Reinforcement Learning
https://youtu.be/WXuK6gekU1Y?t=2363AlphaGo - The Movie | Full Documentary https://www.youtube.com/channel/UCP7jMXSY2xbc3KCAE0MHQ-ALectures
https://www.youtube.com/watch?
v=ld28AU7DDB4&list=PLqYmG7hTraZBKeNJ-
JE_eyJHZ7XgBoAyb&index=10
Reinforcement Learning 10: Classic Games Case Study by David Silver
https://qiita.com/icoxfog417/items/242439ecd1a477ece312
#Reinforcement Learning
https://www.youtube.com/watch?v=bRfUxQs6xIM&list=PLqYmG7hTraZBKeNJ-JE_eyJHZ7XgBoAyb&index=6Reinforcement Learning 6: Policy Gradients and Actor Critics by Hado van Hasselt
#Reinforcement Learning
https://www.youtube.com/watch?v=ld28AU7DDB4&list=PLqYmG7hTraZBKeNJ-JE_eyJHZ7XgBoAyb&index=10Reinforcement Learning 10: Classic Games Case Study by David Silver
Key points
• #Game theory (two- or multi-player, zero- or general-sum)
• #Nash equilibria
• #computation of Nash equilibria
• #Reinforcement Learning
• #PSRO (Policy-Spaced Response Oracles)
• #α-Rank
• #PageRank
• #MuJoCo soccer
#PSRO - Policy-Spaced Response Oracles
Muller, P., Omidshafiei, S., Rowland, M., Tuyls, K., Perolat, J., Liu, S., Hennes, D., Marris, L., Lanctot, M., Hughes, E., Wang, Z., Lever, G., Heess, N., Graepel, T., Munos, R. (2019). A Generalized Training
Approach for Multiagent Learning arXiv https://arxiv.org/abs/1909.12823
Lanctot, M., Zambaldi, V., Gruslys, A., Lazaridou, A., Tuyls, K., Perolat, J., Silver, D., Graepel, T. (2017), NIPS 2017. A Unified Game-Theoretic Approach to Multiagent Reinforcement
Learning arXiv https://arxiv.org/abs/1711.00832
• PSRO (Lanctot+ 2017)
• Meta-game is growing by adding policies (“oracles”) that approximate best responses to the meta-strategy of the
other players.
• A natural generalization of Double Oracle (DO) and fictitious Self-Play
• Linked to empirical game-theoretic analysis (EGTA)
• Double Oracle (DO)
• Double oracle solves a set of (two-player, normal-form) sub-games induced by subsets at time t
• Introduced in the paper “Planning in the presence of cost functions controlled by an adversary” ICML 2003
• Applied in the paper “Algorithms for Computing Strategies in Two-Player Simultaneous Move Game”, AI 2016
Bošanský, B., Lisý, V., Lanctot, M., Čermák, J., Winands, M. (2016). Algorithms for computing strategies in two-player simultaneous move games Artificial Intelligence 237(), 1-40.
https://dx.doi.org/10.1016/j.artint.2016.03.005
Planning in the Presence of Cost Functions Controlled by an Adversary
#PSRO - Policy-Spaced Response Oracles
Example of application of Double Oracle (DO)
to two-player simultaneous move game
Results - MuJoCo Soccer Game
Muller, P., Omidshafiei, S., Rowland, M., Tuyls, K., Perolat, J., Liu, S., Hennes, D., Marris, L., Lanctot, M., Hughes, E., Wang, Z., Lever, G., Heess, N., Graepel, T., Munos, R. (2019). A Generalized Training
Approach for Multiagent Learning arXiv https://arxiv.org/abs/1909.12823
PSRO(α-Rank, RL) and PSRO(Uniform, RL) agents

8 best trained agents, 3 vs. 3 soccer game
PSRO(α-Rank, RL) and self-play-based training agents

8 best trained agents, 2 vs. 2 soccer game
#PSRO(Nash, BR) vs PSRO(α-Rank, BR) vs PSRO(α-Rank, PBR)
Muller, P., Omidshafiei, S., Rowland, M., Tuyls, K., Perolat, J., Liu, S., Hennes, D., Marris, L., Lanctot, M., Hughes, E., Wang, Z., Lever, G., Heess, N., Graepel, T., Munos, R. (2019). A Generalized Training
Approach for Multiagent Learning arXiv https://arxiv.org/abs/1909.12823
• PSRO(Nash, BR) will eventually return an NE in two-player zero-sum games [McMahan+, 2003]
• How about PSRO(α-Rank, BR)? - No, we can show the counterexample
#PSRO(Nash, BR) vs PSRO(α-Rank, BR) vs PSRO(α-Rank, PBR)
Muller, P., Omidshafiei, S., Rowland, M., Tuyls, K., Perolat, J., Liu, S., Hennes, D., Marris, L., Lanctot, M., Hughes, E., Wang, Z., Lever, G., Heess, N., Graepel, T., Munos, R. (2019). A Generalized Training
Approach for Multiagent Learning arXiv https://arxiv.org/abs/1909.12823
• PSRO(Nash, BR) will eventually return an NE in two-player zero-sum games [McMahan+, 2003]
• How about PSRO(α-Rank, BR)? - No, we can show the counterexample
#PSRO(Nash, BR) vs PSRO(α-Rank, BR) vs PSRO(α-Rank, PBR)
Muller, P., Omidshafiei, S., Rowland, M., Tuyls, K., Perolat, J., Liu, S., Hennes, D., Marris, L., Lanctot, M., Hughes, E., Wang, Z., Lever, G., Heess, N., Graepel, T., Munos, R. (2019). A Generalized Training
Approach for Multiagent Learning arXiv https://arxiv.org/abs/1909.12823
• PSRO(α-Rank, BR) leads to the algorithm terminating with strategy set {A, B, C, D} and not
discovering strategy X in the sink strongly-connected component.
• How do we fix the issue? - we define the Preference-Based Best Response (PBR) oracle
#α-Rank
Muller, P., Omidshafiei, S., Rowland, M., Tuyls, K., Perolat, J., Liu, S., Hennes, D., Marris, L., Lanctot, M., Hughes, E., Wang, Z., Lever, G., Heess, N., Graepel, T., Munos, R. (2019). A Generalized Training
Approach for Multiagent Learning arXiv https://arxiv.org/abs/1909.12823
#PageRank
PageRank is an algorithm used by Google Search to rank web pages in their search engine
results. PageRank was named after Larry Page, one of the founders of Google
#PageRank
PageRank is computed from the iteration of the Google matrix.
Perron Frobenius theorem: If all entries of a n × n matrix A are positive, then it has a unique
maximal eigenvalue. Its eigenvector has positive entries.
The PageRank Citation Ranking: Bringing Order to the Web, 1998, http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf

More Related Content

Similar to 20200510 37

二人零和マルコフゲームにおけるオフ方策評価
二人零和マルコフゲームにおけるオフ方策評価二人零和マルコフゲームにおけるオフ方策評価
二人零和マルコフゲームにおけるオフ方策評価Kenshi Abe
 
Learning Analytics Design in Game-based Learning
Learning Analytics Design in Game-based LearningLearning Analytics Design in Game-based Learning
Learning Analytics Design in Game-based LearningMIT
 
Recommendation System Complex Topic Learning.ppt
Recommendation System Complex Topic Learning.pptRecommendation System Complex Topic Learning.ppt
Recommendation System Complex Topic Learning.pptXanat V. Meza
 

Similar to 20200510 37 (6)

Winning in Sports with Networks
Winning in Sports with NetworksWinning in Sports with Networks
Winning in Sports with Networks
 
Game Sense Teaching
Game Sense TeachingGame Sense Teaching
Game Sense Teaching
 
二人零和マルコフゲームにおけるオフ方策評価
二人零和マルコフゲームにおけるオフ方策評価二人零和マルコフゲームにおけるオフ方策評価
二人零和マルコフゲームにおけるオフ方策評価
 
Metulini1503
Metulini1503Metulini1503
Metulini1503
 
Learning Analytics Design in Game-based Learning
Learning Analytics Design in Game-based LearningLearning Analytics Design in Game-based Learning
Learning Analytics Design in Game-based Learning
 
Recommendation System Complex Topic Learning.ppt
Recommendation System Complex Topic Learning.pptRecommendation System Complex Topic Learning.ppt
Recommendation System Complex Topic Learning.ppt
 

More from X 37

20200322 inpainting
20200322 inpainting20200322 inpainting
20200322 inpaintingX 37
 
20191215 rate distortion theory and VAEs
20191215 rate distortion theory and VAEs20191215 rate distortion theory and VAEs
20191215 rate distortion theory and VAEsX 37
 
20191027 bread house seminar
20191027 bread house seminar20191027 bread house seminar
20191027 bread house seminarX 37
 
20190818 Bread Seminar
20190818 Bread Seminar20190818 Bread Seminar
20190818 Bread SeminarX 37
 
paper repo - pre training for model robustness and uncertainty
paper repo - pre training for model robustness and uncertaintypaper repo - pre training for model robustness and uncertainty
paper repo - pre training for model robustness and uncertaintyX 37
 
Anomaly detection and change detection - sparse structure analysis -
Anomaly detection and change detection - sparse structure analysis -Anomaly detection and change detection - sparse structure analysis -
Anomaly detection and change detection - sparse structure analysis -X 37
 
Reading papers - survey on Non-Convex Optimization
Reading papers - survey on Non-Convex OptimizationReading papers - survey on Non-Convex Optimization
Reading papers - survey on Non-Convex OptimizationX 37
 
20180520 MLPHS
20180520 MLPHS20180520 MLPHS
20180520 MLPHSX 37
 

More from X 37 (8)

20200322 inpainting
20200322 inpainting20200322 inpainting
20200322 inpainting
 
20191215 rate distortion theory and VAEs
20191215 rate distortion theory and VAEs20191215 rate distortion theory and VAEs
20191215 rate distortion theory and VAEs
 
20191027 bread house seminar
20191027 bread house seminar20191027 bread house seminar
20191027 bread house seminar
 
20190818 Bread Seminar
20190818 Bread Seminar20190818 Bread Seminar
20190818 Bread Seminar
 
paper repo - pre training for model robustness and uncertainty
paper repo - pre training for model robustness and uncertaintypaper repo - pre training for model robustness and uncertainty
paper repo - pre training for model robustness and uncertainty
 
Anomaly detection and change detection - sparse structure analysis -
Anomaly detection and change detection - sparse structure analysis -Anomaly detection and change detection - sparse structure analysis -
Anomaly detection and change detection - sparse structure analysis -
 
Reading papers - survey on Non-Convex Optimization
Reading papers - survey on Non-Convex OptimizationReading papers - survey on Non-Convex Optimization
Reading papers - survey on Non-Convex Optimization
 
20180520 MLPHS
20180520 MLPHS20180520 MLPHS
20180520 MLPHS
 

Recently uploaded

Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINsankalpkumarsahoo174
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfSumit Kumar yadav
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 

Recently uploaded (20)

Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 

20200510 37

  • 1. Reading ICLR2020 paper “A Generalized Training Approach For Multi-agent Learning” Presenter: 37 Date: May 10th, 2020 Content: Bread House Seminar Place: zoom
  • 2. Overview • investigated Policy-Spaced Response Oracles (PSRO) • utilized α-Rank instead of computation of Nash equilibria ❖ established convergence guarantee in several game classes ❖ identified links between Nash equilibria and α-Rank ❖ α-Rank achieves faster convergence than approximate Nash solvers • Background knowledge (we learn today): #Game theory (two- or multi-player, zero- or general-sum) #Nash equilibria #computation of Nash equilibria #Reinforcement Learning #PSRO #α-Rank #PageRank #Markov Matrix #Kuhn and Leduc Poker #MuJoCo soccer Muller, P., Omidshafiei, S., Rowland, M., Tuyls, K., Perolat, J., Liu, S., Hennes, D., Marris, L., Lanctot, M., Hughes, E., Wang, Z., Lever, G., Heess, N., Graepel, T., Munos, R. (2019). A Generalized Training Approach for Multiagent Learning arXiv https://arxiv.org/abs/1909.12823
  • 3. Key points • #Game theory (two- or multi-player, zero- or general-sum) • #Nash equilibria • #computation of Nash equilibria • #Reinforcement Learning • #PSRO (Policy-Spaced Response Oracles) • #α-Rank • #PageRank • #MuJoCo soccer
  • 4. • Beautiful mind (2001) • “Recall the lessons of Adam Smith, the father of modern economics” “In competition, individual ambition servers the common good.” “Exactly! Every man for himself, gentlemen.” Nash says “Because the best result will come from everyone in the group, Doing what’s best for himself and the group” (Recap) #Nash equilibrium
  • 6. (Recap) examples of simple famous games in Game Theory • Prisoner’s dilemma • Nash equilibrium in the game is mutual betray • Nash equilibrium is not Pareto efficient in this case, and the other three are Pareto efficient A’s and B’s Payoff function B stays silent B betrays A stays silent (-1, -1) (-5, 0) A betrays (0, -5) (-3, -3)
  • 8. (Recap) #Nash’s Existence Theorem https://www.dominos.jp/menu-pizza
  • 9. (Recap) #Nash’s Existence Theorem https://www.jstor.org/stable/pdf/1969529.pdf?refreqid=excelsior%3Aee23262bab98861eceb01bc78e973f05
  • 10. (Recap) examples of simple famous games in Game Theory • Chicken game (also known as hawk-dove or snowdrift game) • Nash equilibria in the pure strategy are A swerves and B goes straight / A goes straight and B swerves • The mixed strategy of 99% swerve and 1% straight is also Nash equilibrium for both players A’s and B’s Payoff function Swerve Strainght Swerve (0, 0) (-1, +1) Straight (+1, -1) (-100, -100)
  • 11. (Recap) examples of simple famous games in Game Theory • Stag hunt game • Nash equilibria in the game are (S, S), (H, H), or (50% S 50% H, 50% S 50% H) • This game describes a conflict between safety and social cooperation A’s and B’s Payoff function Stag
 (Cooperate) Hare
 (Defect) Stag
 (Cooperate) (4, 4) (1, 3) Hare
 (Defect) (3, 1) (2, 2)
  • 12. (Recap) examples of simple famous games in Game Theory • Matching pennies / Rock-Paper-Scissors • Nash equilibrium in the game is (50% H 50% T, 50% H 50% T) • Zero-sum game A’s and B’s Payoff function Heads Tails Heads (+1, -1) (-1, +1) Tails (-1, +1) (+1, -1) A’s and B’s Payoff function Rock Paper Scissors Rock (0, 0) (-1, +1) (+1, -1) Paper (+1, -1) (0, 0) (-1, +1) Scissors (-1, +1) (+1, -1) (0, 0)
  • 13. (Recap) #The computation of Nash equilibria • Several computation algorithm to search Nash equilibria - Two players ❖ Support enumeration (finds all Nash equilibria, applicable tens strategies) ❖ Vertex enumeration (finds all Nash equilibria, applicable tens strategies) ❖ Lemke-Howson (finds one Nash equilibrium, hundreds strategies) ❖ … - Multi players ❖ McLennan-Tourky (finds one Nash equilibrium, a few players a few strategies) Lemke, Carlton E., and Joseph T. Howson, Jr. “Equilibrium points of bimatrix games.” Journal of the Society for Industrial and Applied Mathematics 12.2 (1964): 413-423.
  • 14. (Recap) #The computation of Nash equilibria
  • 15. (Recap) #The computation of Nash equilibria https://vknight.org/gt/chapters/04/
  • 16. Key points • #Game theory (two- or multi-player, zero- or general-sum) • #Nash equilibria • #computation of Nash equilibria • #Reinforcement Learning • #PSRO (Policy-Spaced Response Oracles) • #α-Rank • #PageRank • #MuJoCo soccer
  • 17. Liu, S., Lever, G., Merel, J., Tunyasuvunakool, S., Heess, N., Graepel, T. (2019), ICLR2019. Emergent Coordination Through Competition arXiv https://arxiv.org/abs/1902.07151 • Study on the emergence of cooperative behaviors in RL agents - Introduced a challenging competitive multi-agent soccer game (with continuous simulated physics) - used Decentralized, population-based training with co-play (PBT) and evaluated Nash averaging • background: #MARL(Multi-Agent Reinforcement Learning), #Markov game, #PBT, (#Elo rating, #Nash averaging) • PBT (Jaderberg+ 2017, Jaderberg+ 2018) • A method to optimize hyper-parameters via a population of simultaneously learning agents: during training, poor performing agents inherit parameters from stronger agents with additional mutation • was extended to incorporate co-play for MARL: subsets of agents are selected from the population to play together in multi-agent games. Each agent treats the other agents as part of their environment https://github.com/deepmind/dm_control/tree/master/dm_control/locomotion/soccer?& • Reward Shaping • The sparse scoring and conceding environment rewards (goal and concede) • vel-to-ball: player’s linear velocity projected onto its direction vector towards the ball • vel-ball-to-goal: ball’s linear velocity projected onto its direction vector towards the center of opponent’s goal • Experiment • 32 agents in the population • For 2v2 training match 4 agents were selected uniformly • Evaluated Nash-Averaging Evaluators #MuJoCo soccer - Emergent Coordination Through Competition
  • 18. #MuJoCo soccer - Emergent Coordination Through Competition Liu, S., Lever, G., Merel, J., Tunyasuvunakool, S., Heess, N., Graepel, T. (2019), ICLR2019. Emergent Coordination Through Competition arXiv https://arxiv.org/abs/1902.07151 • Behavior statistics evolution indicates the coordination with teammates 😁 • Pass/interception over 10m apart increases dramatically https://github.com/deepmind/dm_control/tree/master/dm_control/locomotion/soccer?& KL divergence incurred by replacing a subset of state with counterfactual information. • Question: “had a subset of the observation been different, how much would I have changed my policy?” • Quantified the dependency of agent’s policy on the subset of the observation space • Measured the KL divergence in agent’s policy distribution (counterfactual policy divergence) • Result: • ball-position is strong factor • The opponenent-0/1-position incur less divergence than teammate position • It concludes that the coordinating teammate position is important game dynamics to determine the each player’s action
  • 19. Elo Rating System • Elo Rating System: a method for calculating the relative skill levels of players in zero-sum games such as chess • Named after its creator, Arpad Elo • Implemented several games • The United States Chess Federation (USCF) in 1960 • World Chess Federation (FIDE) in 1970 • American college football, Major League Baseball, FIFA World Cup, etc • Problem: If there’re many scissors players in rock-paper- scissors world, rock players have high Elo Rating score. The Elo rating system in chess Nicholas R. Moloney (with Mariia Koroliuk) Here is the famous scene from The Social Network (2010), where Eduardo Saverin gives Mark Zuckerberg the Algorithm he needs to code Facemash. Eduardo then writes the code on the window of the Havard dorm room http://www.fbmovie.com/
  • 20. #Reinforcement Learning https://youtu.be/WXuK6gekU1Y?t=2363AlphaGo - The Movie | Full Documentary https://www.youtube.com/channel/UCP7jMXSY2xbc3KCAE0MHQ-ALectures https://www.youtube.com/watch? v=ld28AU7DDB4&list=PLqYmG7hTraZBKeNJ- JE_eyJHZ7XgBoAyb&index=10 Reinforcement Learning 10: Classic Games Case Study by David Silver https://qiita.com/icoxfog417/items/242439ecd1a477ece312
  • 23. Key points • #Game theory (two- or multi-player, zero- or general-sum) • #Nash equilibria • #computation of Nash equilibria • #Reinforcement Learning • #PSRO (Policy-Spaced Response Oracles) • #α-Rank • #PageRank • #MuJoCo soccer
  • 24. #PSRO - Policy-Spaced Response Oracles Muller, P., Omidshafiei, S., Rowland, M., Tuyls, K., Perolat, J., Liu, S., Hennes, D., Marris, L., Lanctot, M., Hughes, E., Wang, Z., Lever, G., Heess, N., Graepel, T., Munos, R. (2019). A Generalized Training Approach for Multiagent Learning arXiv https://arxiv.org/abs/1909.12823
  • 25. Lanctot, M., Zambaldi, V., Gruslys, A., Lazaridou, A., Tuyls, K., Perolat, J., Silver, D., Graepel, T. (2017), NIPS 2017. A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning arXiv https://arxiv.org/abs/1711.00832 • PSRO (Lanctot+ 2017) • Meta-game is growing by adding policies (“oracles”) that approximate best responses to the meta-strategy of the other players. • A natural generalization of Double Oracle (DO) and fictitious Self-Play • Linked to empirical game-theoretic analysis (EGTA) • Double Oracle (DO) • Double oracle solves a set of (two-player, normal-form) sub-games induced by subsets at time t • Introduced in the paper “Planning in the presence of cost functions controlled by an adversary” ICML 2003 • Applied in the paper “Algorithms for Computing Strategies in Two-Player Simultaneous Move Game”, AI 2016 Bošanský, B., Lisý, V., Lanctot, M., Čermák, J., Winands, M. (2016). Algorithms for computing strategies in two-player simultaneous move games Artificial Intelligence 237(), 1-40. https://dx.doi.org/10.1016/j.artint.2016.03.005 Planning in the Presence of Cost Functions Controlled by an Adversary #PSRO - Policy-Spaced Response Oracles Example of application of Double Oracle (DO) to two-player simultaneous move game
  • 26. Results - MuJoCo Soccer Game Muller, P., Omidshafiei, S., Rowland, M., Tuyls, K., Perolat, J., Liu, S., Hennes, D., Marris, L., Lanctot, M., Hughes, E., Wang, Z., Lever, G., Heess, N., Graepel, T., Munos, R. (2019). A Generalized Training Approach for Multiagent Learning arXiv https://arxiv.org/abs/1909.12823 PSRO(α-Rank, RL) and PSRO(Uniform, RL) agents
 8 best trained agents, 3 vs. 3 soccer game PSRO(α-Rank, RL) and self-play-based training agents
 8 best trained agents, 2 vs. 2 soccer game
  • 27. #PSRO(Nash, BR) vs PSRO(α-Rank, BR) vs PSRO(α-Rank, PBR) Muller, P., Omidshafiei, S., Rowland, M., Tuyls, K., Perolat, J., Liu, S., Hennes, D., Marris, L., Lanctot, M., Hughes, E., Wang, Z., Lever, G., Heess, N., Graepel, T., Munos, R. (2019). A Generalized Training Approach for Multiagent Learning arXiv https://arxiv.org/abs/1909.12823 • PSRO(Nash, BR) will eventually return an NE in two-player zero-sum games [McMahan+, 2003] • How about PSRO(α-Rank, BR)? - No, we can show the counterexample
  • 28. #PSRO(Nash, BR) vs PSRO(α-Rank, BR) vs PSRO(α-Rank, PBR) Muller, P., Omidshafiei, S., Rowland, M., Tuyls, K., Perolat, J., Liu, S., Hennes, D., Marris, L., Lanctot, M., Hughes, E., Wang, Z., Lever, G., Heess, N., Graepel, T., Munos, R. (2019). A Generalized Training Approach for Multiagent Learning arXiv https://arxiv.org/abs/1909.12823 • PSRO(Nash, BR) will eventually return an NE in two-player zero-sum games [McMahan+, 2003] • How about PSRO(α-Rank, BR)? - No, we can show the counterexample
  • 29. #PSRO(Nash, BR) vs PSRO(α-Rank, BR) vs PSRO(α-Rank, PBR) Muller, P., Omidshafiei, S., Rowland, M., Tuyls, K., Perolat, J., Liu, S., Hennes, D., Marris, L., Lanctot, M., Hughes, E., Wang, Z., Lever, G., Heess, N., Graepel, T., Munos, R. (2019). A Generalized Training Approach for Multiagent Learning arXiv https://arxiv.org/abs/1909.12823 • PSRO(α-Rank, BR) leads to the algorithm terminating with strategy set {A, B, C, D} and not discovering strategy X in the sink strongly-connected component. • How do we fix the issue? - we define the Preference-Based Best Response (PBR) oracle
  • 30. #α-Rank Muller, P., Omidshafiei, S., Rowland, M., Tuyls, K., Perolat, J., Liu, S., Hennes, D., Marris, L., Lanctot, M., Hughes, E., Wang, Z., Lever, G., Heess, N., Graepel, T., Munos, R. (2019). A Generalized Training Approach for Multiagent Learning arXiv https://arxiv.org/abs/1909.12823
  • 31. #PageRank PageRank is an algorithm used by Google Search to rank web pages in their search engine results. PageRank was named after Larry Page, one of the founders of Google
  • 32. #PageRank PageRank is computed from the iteration of the Google matrix. Perron Frobenius theorem: If all entries of a n × n matrix A are positive, then it has a unique maximal eigenvalue. Its eigenvector has positive entries. The PageRank Citation Ranking: Bringing Order to the Web, 1998, http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf