SlideShare a Scribd company logo
1 of 23
Download to read offline
QMIX: Monotonic Value Function Factorisation for
Deep Multi-Agent Reinforcement Learning
2021.4.30 정민재
Multi agent system setting (MAS)
Jan’t Hoen, Pieter, et al. "An overview of cooperative and competitive multiagent learning." International Workshop on Learning and
Adaption in Multi-Agent Systems. Springer, Berlin, Heidelberg, 2005.
• Cooperative
- The agents pursue a common goal
• Competitive
- Non-aligned goals
- Individual agents seek only to maximize their own gains
MAS setting
● Challenge
○ Joint action space grow exponentially by the number of agents
■ EX) N Agent with 4 discrete action: 4𝑁
○ Agent’s partial observability, communication constraint
■ Cannot access to full state
NEED
Decentralized policy
Up
Down
Right
Left
(Up, Down)
(Up, Left)
(Up, Up)
(Up, Right)
(Down, Down)
(Down, Up)
⋮
4 16
!
!
?
Centralized Training Decentralized Execution(CTDE)
● We can use global state or extra state information and remove
communication constraint in simulation and laboratorial environment
Centralized Training Decentralized execution
CTDE Approach
Independent Q learning
(IQL) [Tan 1993]
Counterfactual multi-agent policy gradient
(COMA) [Foerster 2018]
Value Decomposition network
(VDN) [Sunehg 2017]
policy 1
𝑸𝒕𝒐𝒕
policy 2 policy 3
How to learn action-value function 𝑸𝒕𝒐𝒕 and extract decentralized policy?
𝑸𝒕𝒐𝒕
Qa1
Qa2
Qa3
+
+
policy 3
𝑸𝒂𝟑
Greedy
policy 2
𝑸𝒂𝟐
Greedy
policy 1
𝑸𝒂𝟏
Greedy
Learn independent individual action-value function
policy 1
policy 2
policy 3
+ Simplest Option
+ Learn decentralized policy trivially
- Cannot handle non-stationary case
Learn centralized, but factored action-value function Learn centralized full state action-value function
+ Lean 𝑸𝒕𝒐𝒕 directly with actor-critic framework
- On policy: sample inefficient
- Less scalability
+ Lean 𝑸𝒕𝒐𝒕
+ Easy to extract decentralized policy
- Limited representation capacity
- Do not use additional global state information
● Learn 𝑸𝒕𝒐𝒕 -
> figure out effectiveness of agent’s actions
● Extract decentralized policy <- joint-action space growth problem, local observability, communication constraint
QMIX !
Other agents also learning and
change the strategies
-> no guarantee convergence
Background
DEC-POMDP (decentralized partially observable Markov decision process)
𝑠 ∈ 𝑆 : state
𝑢 ∈ 𝑈 : joint action
𝑃(𝑠′
|𝑠, 𝑢): transition function
𝑟 𝑠, 𝑢 : reward function
𝑛 : agent number
𝑎 ∈ 𝐴 : agent
𝑧 ∈ 𝑍 : observation
𝑂(𝑠, 𝑎) : observation function
𝛾 : discount rate
𝜏𝑎
∈ 𝑇 : action-observation history
Background: Value decomposition
VDN full factorization
Utility function, not value function
Does not estimate an expected return
Guestrin, Carlos, Daphne Koller, and Ronald Parr. "Multiagent Planning with Factored MDPs." NIPS. Vol. 1. 2001.
Sunehag, Peter, et al. "Value-decomposition networks for cooperative multi-agent learning." arXiv preprint arXiv:1706.05296 (2017).
Factored joint value function [Guestrin 2001]
Subset of the agents
Factored value function reduce the parameters that have to be learned
https://youtu.be/W_9kcQmaWjo
Improve scalability
QMIX: Key Idea
Key Idea: full factorization of VDN is not necessary
• Consistency holds if a global argmax performed on 𝑄𝑡𝑜𝑡 yields the same result as
a set of individual argmax operations performed on each 𝑄𝑎
Assumption: the environment is not adversarial
VDN’s representation also satisfy this
QMIX’s representation can be generalized to the larger family of monotonic function
How to ensure this?
QMIX: Monotonicity constraint
Rashid, Tabish, et al. "Monotonic value function factorisation for deep multi-agent reinforcement learning." Journal of Machine Learning Research 21.178
(2020): 1-51.
argmax joint-action of 𝑸𝒕𝒐𝒕is the set of individual argmax action of 𝑸𝒂
If
Then
QMIX: Architecture
● QMIX represents 𝑄𝑡𝑜𝑡 using an architecture consisting of agent network, mixing network,
hypernetworks
QMIX: agent network
• DRQN (Hausknecht 2015)
- Deal with partial observability
• 𝒖𝒕−𝟏
𝒂
input
- Stochastic policies during training
• Agent ID(optional)
- Heterogeneous policies
Hausknecht, Matthew, and Peter Stone. "Deep recurrent q-learning for partially observable mdps." arXiv preprint arXiv:1507.06527 (2015).
QMIX: Mixing Network
• Hypernetwork
- Change state 𝑠𝑡 into weights of the mixing network
- Allow 𝑄𝑡𝑜𝑡 depend on the extra information
• Absolute operation
- Ensure the monotonicity constraint
• Why not pass 𝒔𝒕 directly into mixing network?
• Elu activation
- Prevent overly constraining 𝑠𝑡 through the
monotonic function, reduce representation capacity
- Negative input is likely to remain negative <- zeroed
by mixing network If use ReLU
QMIX: algorithm
Initialization
Rollout episode
Episode sampling
Update 𝑄𝑡𝑜𝑡
Update target
Rashid, Tabish, et al. "Monotonic value function factorisation for deep multi-agent reinforcement learning." Journal of Machine Learning Research 21.178 (2020):
1-51.
Representational complexity
Learned QMIX 𝑄𝑡𝑜𝑡
Learned VDN 𝑄𝑡𝑜𝑡
• Agent’s best action at the same time step in multi agent setting will not factorize perfectly with QMIX
• Monotonicity constraint prevents QMIX from representing non monotonic function
https://youtu.be/W_9kcQmaWjo
Representational complexity
https://youtu.be/W_9kcQmaWjo
Shimon Whiteson
• Even VDN cannot represent the example of the
middle exactly, it could approximate it with a value
function from the left game
• Should we care about these games that are in the
middle?
Rashid, Tabish, et al. "Monotonic value function factorisation for deep multi-agent reinforcement learning." Journal of Machine Learning Research 21.178
(2020): 1-51.
• It matters because of Bootstrapping
Representational complexity
• QMIX still learns the correct maximum over the Q-values
Payoff matrix Learned QMIX 𝑄𝑡𝑜𝑡
• The less bootstrapping error results better action selection in earlier state
Let’s see with
two-step games
• Two-Step Game
VDN
QMIX
Greedy policy
Representational complexity
The ability to express the complex situation
(A, ∙) -> (A,B) or (B,B)
(B, ∙) -> (B,B)
7 7
8
QMIX’s higher
represent capacity
Better
strategy than VDN
Experiment: SC2
• Starcraft2 have a rich set of complex micro-actions that allow the learning of complex
interactions between collaborating agents
• SC2LE environment mitigates many of the practical difficulties In using game as RL platform
allies enemy
https://youtu.be/HIqS-r4ZRGg
Experiment: SC2
• Observation (Sight range)
- distance
- relative x, y
- unit_type
• Action
- move[direction]
- attack[enemy_id] (Shooting range)
- stop
- noop
• Reward
- joint reward: total damage (each time step)
- bonus1 : 10 (killing each opponent)
- bonus2 : 100 (killing all opponent)
enemy
• Global state (hidden from agents)
(distance from center, health, shield, cooldown, last action)
Experiment: SC2 - main results
• IQL: Highly unstable <- non-stationary of the environment
• VDN: Better than IQL in every experiment setup, learn focusing fire
• QMIX: Superior at heterogeneous agent setting
Heterogeneous
: initial hump
Learning the simple strategy
Homogeneous
Experiment : SC2 - ablation result
• QMIX-NS: Without hypernetworks -> significance of extra state information
• QMIX-Lin: Removing hidden layer -> necessity of non-linear mixing
• VDN-S: Adding a state-dependent term to the sum of the 𝑄𝑎 -> significance of utilizing the state 𝒔𝒕
Heterogeneous
Nonlinear factorization is not always required
Homogeneous
Conclusion - QMIX
• Centralized training – Decentralized execution
Training Execution
• Allow rich joint-action value function 𝑄𝑡𝑜𝑡 with ensuring monotonicity constraint
• Performance of decentralized unit micromanagement task in SC2 environment outperforms VDN
More information
https://youtu.be/W_9kcQmaWjo
Rashid, Tabish, et al. "Monotonic value function factorisation for deep multi-agent
reinforcement learning." Journal of Machine Learning Research 21.178 (2020): 1-51.

More Related Content

What's hot

Wasserstein GAN 수학 이해하기 I
Wasserstein GAN 수학 이해하기 IWasserstein GAN 수학 이해하기 I
Wasserstein GAN 수학 이해하기 ISungbin Lim
 
파이썬과 케라스로 배우는 강화학습 저자특강
파이썬과 케라스로 배우는 강화학습 저자특강파이썬과 케라스로 배우는 강화학습 저자특강
파이썬과 케라스로 배우는 강화학습 저자특강Woong won Lee
 
안.전.제.일. 강화학습!
안.전.제.일. 강화학습!안.전.제.일. 강화학습!
안.전.제.일. 강화학습!Dongmin Lee
 
강화학습 기초부터 DQN까지 (Reinforcement Learning from Basics to DQN)
강화학습 기초부터 DQN까지 (Reinforcement Learning from Basics to DQN)강화학습 기초부터 DQN까지 (Reinforcement Learning from Basics to DQN)
강화학습 기초부터 DQN까지 (Reinforcement Learning from Basics to DQN)Curt Park
 
Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)Taehoon Kim
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learningSubrat Panda, PhD
 
오토인코더의 모든 것
오토인코더의 모든 것오토인코더의 모든 것
오토인코더의 모든 것NAVER Engineering
 
Dueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement LearningDueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement LearningYoonho Lee
 
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Dongmin Lee
 
[부스트캠프 Tech Talk] 김봉진_WandB로 Auto ML 뿌수기
[부스트캠프 Tech Talk] 김봉진_WandB로 Auto ML 뿌수기[부스트캠프 Tech Talk] 김봉진_WandB로 Auto ML 뿌수기
[부스트캠프 Tech Talk] 김봉진_WandB로 Auto ML 뿌수기CONNECT FOUNDATION
 
은닉 마르코프 모델, Hidden Markov Model(HMM)
은닉 마르코프 모델, Hidden Markov Model(HMM)은닉 마르코프 모델, Hidden Markov Model(HMM)
은닉 마르코프 모델, Hidden Markov Model(HMM)찬희 이
 
강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introductionTaehoon Kim
 
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsPR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsJinwon Lee
 
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...홍배 김
 
자습해도 모르겠던 딥러닝, 머리속에 인스톨 시켜드립니다.
자습해도 모르겠던 딥러닝, 머리속에 인스톨 시켜드립니다.자습해도 모르겠던 딥러닝, 머리속에 인스톨 시켜드립니다.
자습해도 모르겠던 딥러닝, 머리속에 인스톨 시켜드립니다.Yongho Ha
 

What's hot (20)

Wasserstein GAN 수학 이해하기 I
Wasserstein GAN 수학 이해하기 IWasserstein GAN 수학 이해하기 I
Wasserstein GAN 수학 이해하기 I
 
파이썬과 케라스로 배우는 강화학습 저자특강
파이썬과 케라스로 배우는 강화학습 저자특강파이썬과 케라스로 배우는 강화학습 저자특강
파이썬과 케라스로 배우는 강화학습 저자특강
 
안.전.제.일. 강화학습!
안.전.제.일. 강화학습!안.전.제.일. 강화학습!
안.전.제.일. 강화학습!
 
강화학습 기초부터 DQN까지 (Reinforcement Learning from Basics to DQN)
강화학습 기초부터 DQN까지 (Reinforcement Learning from Basics to DQN)강화학습 기초부터 DQN까지 (Reinforcement Learning from Basics to DQN)
강화학습 기초부터 DQN까지 (Reinforcement Learning from Basics to DQN)
 
Wasserstein GAN
Wasserstein GANWasserstein GAN
Wasserstein GAN
 
Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learning
 
오토인코더의 모든 것
오토인코더의 모든 것오토인코더의 모든 것
오토인코더의 모든 것
 
Dueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement LearningDueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement Learning
 
그림 그리는 AI
그림 그리는 AI그림 그리는 AI
그림 그리는 AI
 
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)
 
[부스트캠프 Tech Talk] 김봉진_WandB로 Auto ML 뿌수기
[부스트캠프 Tech Talk] 김봉진_WandB로 Auto ML 뿌수기[부스트캠프 Tech Talk] 김봉진_WandB로 Auto ML 뿌수기
[부스트캠프 Tech Talk] 김봉진_WandB로 Auto ML 뿌수기
 
Q-learning
Q-learningQ-learning
Q-learning
 
은닉 마르코프 모델, Hidden Markov Model(HMM)
은닉 마르코프 모델, Hidden Markov Model(HMM)은닉 마르코프 모델, Hidden Markov Model(HMM)
은닉 마르코프 모델, Hidden Markov Model(HMM)
 
강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction
 
FHE in Action
FHE in ActionFHE in Action
FHE in Action
 
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsPR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
 
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
 
Map
MapMap
Map
 
자습해도 모르겠던 딥러닝, 머리속에 인스톨 시켜드립니다.
자습해도 모르겠던 딥러닝, 머리속에 인스톨 시켜드립니다.자습해도 모르겠던 딥러닝, 머리속에 인스톨 시켜드립니다.
자습해도 모르겠던 딥러닝, 머리속에 인스톨 시켜드립니다.
 

Similar to QMIX: monotonic value function factorization paper review

Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningKhaled Saleh
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017MLconf
 
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...gabrielesisinna
 
Combinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learningCombinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learning민재 정
 
Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)DonghyunKang12
 
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...MLconf
 
deep reinforcement learning with double q learning
deep reinforcement learning with double q learningdeep reinforcement learning with double q learning
deep reinforcement learning with double q learningSeungHyeok Baek
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDongHyun Kwak
 
Thesis Presentation on Energy Efficiency Improvement in Data Centers
Thesis Presentation on Energy Efficiency Improvement in Data CentersThesis Presentation on Energy Efficiency Improvement in Data Centers
Thesis Presentation on Energy Efficiency Improvement in Data CentersMonica Vitali
 
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중datasciencekorea
 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsPierre de Lacaze
 
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...Sunghoon Joo
 
Computer Vision for Beginners
Computer Vision for BeginnersComputer Vision for Beginners
Computer Vision for BeginnersSanghamitra Deb
 
Deep Q-learning from Demonstrations DQfD
Deep Q-learning from Demonstrations DQfDDeep Q-learning from Demonstrations DQfD
Deep Q-learning from Demonstrations DQfDAmmar Rashed
 
State Representation Learning for control: an overview
State Representation Learning for control: an overviewState Representation Learning for control: an overview
State Representation Learning for control: an overviewNatalia Díaz Rodríguez
 
WWW 2021report public
WWW 2021report publicWWW 2021report public
WWW 2021report publicTakuma Oda
 
Towards Reinforcement Learning-based Aggregate Computing
Towards Reinforcement Learning-based Aggregate ComputingTowards Reinforcement Learning-based Aggregate Computing
Towards Reinforcement Learning-based Aggregate ComputingGianluca Aguzzi
 

Similar to QMIX: monotonic value function factorization paper review (20)

Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
 
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...
 
Combinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learningCombinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learning
 
Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)
 
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
 
deep reinforcement learning with double q learning
deep reinforcement learning with double q learningdeep reinforcement learning with double q learning
deep reinforcement learning with double q learning
 
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Thesis Presentation on Energy Efficiency Improvement in Data Centers
Thesis Presentation on Energy Efficiency Improvement in Data CentersThesis Presentation on Energy Efficiency Improvement in Data Centers
Thesis Presentation on Energy Efficiency Improvement in Data Centers
 
AI and Deep Learning
AI and Deep Learning AI and Deep Learning
AI and Deep Learning
 
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중
 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural Nets
 
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
 
Computer Vision for Beginners
Computer Vision for BeginnersComputer Vision for Beginners
Computer Vision for Beginners
 
Deep Q-learning from Demonstrations DQfD
Deep Q-learning from Demonstrations DQfDDeep Q-learning from Demonstrations DQfD
Deep Q-learning from Demonstrations DQfD
 
State Representation Learning for control: an overview
State Representation Learning for control: an overviewState Representation Learning for control: an overview
State Representation Learning for control: an overview
 
WWW 2021report public
WWW 2021report publicWWW 2021report public
WWW 2021report public
 
Optimization Using Evolutionary Computing Techniques
Optimization Using Evolutionary Computing Techniques Optimization Using Evolutionary Computing Techniques
Optimization Using Evolutionary Computing Techniques
 
Towards Reinforcement Learning-based Aggregate Computing
Towards Reinforcement Learning-based Aggregate ComputingTowards Reinforcement Learning-based Aggregate Computing
Towards Reinforcement Learning-based Aggregate Computing
 

Recently uploaded

Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfme23b1001
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.eptoze12
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxk795866
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxvipinkmenon1
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxPoojaBan
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidNikhilNagaraju
 

Recently uploaded (20)

Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdf
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptx
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptx
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfid
 

QMIX: monotonic value function factorization paper review

  • 1. QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning 2021.4.30 정민재
  • 2. Multi agent system setting (MAS) Jan’t Hoen, Pieter, et al. "An overview of cooperative and competitive multiagent learning." International Workshop on Learning and Adaption in Multi-Agent Systems. Springer, Berlin, Heidelberg, 2005. • Cooperative - The agents pursue a common goal • Competitive - Non-aligned goals - Individual agents seek only to maximize their own gains
  • 3. MAS setting ● Challenge ○ Joint action space grow exponentially by the number of agents ■ EX) N Agent with 4 discrete action: 4𝑁 ○ Agent’s partial observability, communication constraint ■ Cannot access to full state NEED Decentralized policy Up Down Right Left (Up, Down) (Up, Left) (Up, Up) (Up, Right) (Down, Down) (Down, Up) ⋮ 4 16 ! ! ?
  • 4. Centralized Training Decentralized Execution(CTDE) ● We can use global state or extra state information and remove communication constraint in simulation and laboratorial environment Centralized Training Decentralized execution
  • 5. CTDE Approach Independent Q learning (IQL) [Tan 1993] Counterfactual multi-agent policy gradient (COMA) [Foerster 2018] Value Decomposition network (VDN) [Sunehg 2017] policy 1 𝑸𝒕𝒐𝒕 policy 2 policy 3 How to learn action-value function 𝑸𝒕𝒐𝒕 and extract decentralized policy? 𝑸𝒕𝒐𝒕 Qa1 Qa2 Qa3 + + policy 3 𝑸𝒂𝟑 Greedy policy 2 𝑸𝒂𝟐 Greedy policy 1 𝑸𝒂𝟏 Greedy Learn independent individual action-value function policy 1 policy 2 policy 3 + Simplest Option + Learn decentralized policy trivially - Cannot handle non-stationary case Learn centralized, but factored action-value function Learn centralized full state action-value function + Lean 𝑸𝒕𝒐𝒕 directly with actor-critic framework - On policy: sample inefficient - Less scalability + Lean 𝑸𝒕𝒐𝒕 + Easy to extract decentralized policy - Limited representation capacity - Do not use additional global state information ● Learn 𝑸𝒕𝒐𝒕 - > figure out effectiveness of agent’s actions ● Extract decentralized policy <- joint-action space growth problem, local observability, communication constraint QMIX ! Other agents also learning and change the strategies -> no guarantee convergence
  • 6. Background DEC-POMDP (decentralized partially observable Markov decision process) 𝑠 ∈ 𝑆 : state 𝑢 ∈ 𝑈 : joint action 𝑃(𝑠′ |𝑠, 𝑢): transition function 𝑟 𝑠, 𝑢 : reward function 𝑛 : agent number 𝑎 ∈ 𝐴 : agent 𝑧 ∈ 𝑍 : observation 𝑂(𝑠, 𝑎) : observation function 𝛾 : discount rate 𝜏𝑎 ∈ 𝑇 : action-observation history
  • 7. Background: Value decomposition VDN full factorization Utility function, not value function Does not estimate an expected return Guestrin, Carlos, Daphne Koller, and Ronald Parr. "Multiagent Planning with Factored MDPs." NIPS. Vol. 1. 2001. Sunehag, Peter, et al. "Value-decomposition networks for cooperative multi-agent learning." arXiv preprint arXiv:1706.05296 (2017). Factored joint value function [Guestrin 2001] Subset of the agents Factored value function reduce the parameters that have to be learned https://youtu.be/W_9kcQmaWjo Improve scalability
  • 8. QMIX: Key Idea Key Idea: full factorization of VDN is not necessary • Consistency holds if a global argmax performed on 𝑄𝑡𝑜𝑡 yields the same result as a set of individual argmax operations performed on each 𝑄𝑎 Assumption: the environment is not adversarial VDN’s representation also satisfy this QMIX’s representation can be generalized to the larger family of monotonic function How to ensure this?
  • 9. QMIX: Monotonicity constraint Rashid, Tabish, et al. "Monotonic value function factorisation for deep multi-agent reinforcement learning." Journal of Machine Learning Research 21.178 (2020): 1-51. argmax joint-action of 𝑸𝒕𝒐𝒕is the set of individual argmax action of 𝑸𝒂 If Then
  • 10. QMIX: Architecture ● QMIX represents 𝑄𝑡𝑜𝑡 using an architecture consisting of agent network, mixing network, hypernetworks
  • 11. QMIX: agent network • DRQN (Hausknecht 2015) - Deal with partial observability • 𝒖𝒕−𝟏 𝒂 input - Stochastic policies during training • Agent ID(optional) - Heterogeneous policies Hausknecht, Matthew, and Peter Stone. "Deep recurrent q-learning for partially observable mdps." arXiv preprint arXiv:1507.06527 (2015).
  • 12. QMIX: Mixing Network • Hypernetwork - Change state 𝑠𝑡 into weights of the mixing network - Allow 𝑄𝑡𝑜𝑡 depend on the extra information • Absolute operation - Ensure the monotonicity constraint • Why not pass 𝒔𝒕 directly into mixing network? • Elu activation - Prevent overly constraining 𝑠𝑡 through the monotonic function, reduce representation capacity - Negative input is likely to remain negative <- zeroed by mixing network If use ReLU
  • 13. QMIX: algorithm Initialization Rollout episode Episode sampling Update 𝑄𝑡𝑜𝑡 Update target Rashid, Tabish, et al. "Monotonic value function factorisation for deep multi-agent reinforcement learning." Journal of Machine Learning Research 21.178 (2020): 1-51.
  • 14. Representational complexity Learned QMIX 𝑄𝑡𝑜𝑡 Learned VDN 𝑄𝑡𝑜𝑡 • Agent’s best action at the same time step in multi agent setting will not factorize perfectly with QMIX • Monotonicity constraint prevents QMIX from representing non monotonic function https://youtu.be/W_9kcQmaWjo
  • 15. Representational complexity https://youtu.be/W_9kcQmaWjo Shimon Whiteson • Even VDN cannot represent the example of the middle exactly, it could approximate it with a value function from the left game • Should we care about these games that are in the middle?
  • 16. Rashid, Tabish, et al. "Monotonic value function factorisation for deep multi-agent reinforcement learning." Journal of Machine Learning Research 21.178 (2020): 1-51. • It matters because of Bootstrapping Representational complexity • QMIX still learns the correct maximum over the Q-values Payoff matrix Learned QMIX 𝑄𝑡𝑜𝑡 • The less bootstrapping error results better action selection in earlier state Let’s see with two-step games
  • 17. • Two-Step Game VDN QMIX Greedy policy Representational complexity The ability to express the complex situation (A, ∙) -> (A,B) or (B,B) (B, ∙) -> (B,B) 7 7 8 QMIX’s higher represent capacity Better strategy than VDN
  • 18. Experiment: SC2 • Starcraft2 have a rich set of complex micro-actions that allow the learning of complex interactions between collaborating agents • SC2LE environment mitigates many of the practical difficulties In using game as RL platform allies enemy https://youtu.be/HIqS-r4ZRGg
  • 19. Experiment: SC2 • Observation (Sight range) - distance - relative x, y - unit_type • Action - move[direction] - attack[enemy_id] (Shooting range) - stop - noop • Reward - joint reward: total damage (each time step) - bonus1 : 10 (killing each opponent) - bonus2 : 100 (killing all opponent) enemy • Global state (hidden from agents) (distance from center, health, shield, cooldown, last action)
  • 20. Experiment: SC2 - main results • IQL: Highly unstable <- non-stationary of the environment • VDN: Better than IQL in every experiment setup, learn focusing fire • QMIX: Superior at heterogeneous agent setting Heterogeneous : initial hump Learning the simple strategy Homogeneous
  • 21. Experiment : SC2 - ablation result • QMIX-NS: Without hypernetworks -> significance of extra state information • QMIX-Lin: Removing hidden layer -> necessity of non-linear mixing • VDN-S: Adding a state-dependent term to the sum of the 𝑄𝑎 -> significance of utilizing the state 𝒔𝒕 Heterogeneous Nonlinear factorization is not always required Homogeneous
  • 22. Conclusion - QMIX • Centralized training – Decentralized execution Training Execution • Allow rich joint-action value function 𝑄𝑡𝑜𝑡 with ensuring monotonicity constraint • Performance of decentralized unit micromanagement task in SC2 environment outperforms VDN
  • 23. More information https://youtu.be/W_9kcQmaWjo Rashid, Tabish, et al. "Monotonic value function factorisation for deep multi-agent reinforcement learning." Journal of Machine Learning Research 21.178 (2020): 1-51.