SlideShare a Scribd company logo
QMIX: Monotonic Value Function Factorisation for
Deep Multi-Agent Reinforcement Learning
2021.4.30 정민재
Multi agent system setting (MAS)
Jan’t Hoen, Pieter, et al. "An overview of cooperative and competitive multiagent learning." International Workshop on Learning and
Adaption in Multi-Agent Systems. Springer, Berlin, Heidelberg, 2005.
• Cooperative
- The agents pursue a common goal
• Competitive
- Non-aligned goals
- Individual agents seek only to maximize their own gains
MAS setting
● Challenge
○ Joint action space grow exponentially by the number of agents
■ EX) N Agent with 4 discrete action: 4𝑁
○ Agent’s partial observability, communication constraint
■ Cannot access to full state
NEED
Decentralized policy
Up
Down
Right
Left
(Up, Down)
(Up, Left)
(Up, Up)
(Up, Right)
(Down, Down)
(Down, Up)
⋮
4 16
!
!
?
Centralized Training Decentralized Execution(CTDE)
● We can use global state or extra state information and remove
communication constraint in simulation and laboratorial environment
Centralized Training Decentralized execution
CTDE Approach
Independent Q learning
(IQL) [Tan 1993]
Counterfactual multi-agent policy gradient
(COMA) [Foerster 2018]
Value Decomposition network
(VDN) [Sunehg 2017]
policy 1
𝑸𝒕𝒐𝒕
policy 2 policy 3
How to learn action-value function 𝑸𝒕𝒐𝒕 and extract decentralized policy?
𝑸𝒕𝒐𝒕
Qa1
Qa2
Qa3
+
+
policy 3
𝑸𝒂𝟑
Greedy
policy 2
𝑸𝒂𝟐
Greedy
policy 1
𝑸𝒂𝟏
Greedy
Learn independent individual action-value function
policy 1
policy 2
policy 3
+ Simplest Option
+ Learn decentralized policy trivially
- Cannot handle non-stationary case
Learn centralized, but factored action-value function Learn centralized full state action-value function
+ Lean 𝑸𝒕𝒐𝒕 directly with actor-critic framework
- On policy: sample inefficient
- Less scalability
+ Lean 𝑸𝒕𝒐𝒕
+ Easy to extract decentralized policy
- Limited representation capacity
- Do not use additional global state information
● Learn 𝑸𝒕𝒐𝒕 -
> figure out effectiveness of agent’s actions
● Extract decentralized policy <- joint-action space growth problem, local observability, communication constraint
QMIX !
Other agents also learning and
change the strategies
-> no guarantee convergence
Background
DEC-POMDP (decentralized partially observable Markov decision process)
𝑠 ∈ 𝑆 : state
𝑢 ∈ 𝑈 : joint action
𝑃(𝑠′
|𝑠, 𝑢): transition function
𝑟 𝑠, 𝑢 : reward function
𝑛 : agent number
𝑎 ∈ 𝐴 : agent
𝑧 ∈ 𝑍 : observation
𝑂(𝑠, 𝑎) : observation function
𝛾 : discount rate
𝜏𝑎
∈ 𝑇 : action-observation history
Background: Value decomposition
VDN full factorization
Utility function, not value function
Does not estimate an expected return
Guestrin, Carlos, Daphne Koller, and Ronald Parr. "Multiagent Planning with Factored MDPs." NIPS. Vol. 1. 2001.
Sunehag, Peter, et al. "Value-decomposition networks for cooperative multi-agent learning." arXiv preprint arXiv:1706.05296 (2017).
Factored joint value function [Guestrin 2001]
Subset of the agents
Factored value function reduce the parameters that have to be learned
https://youtu.be/W_9kcQmaWjo
Improve scalability
QMIX: Key Idea
Key Idea: full factorization of VDN is not necessary
• Consistency holds if a global argmax performed on 𝑄𝑡𝑜𝑡 yields the same result as
a set of individual argmax operations performed on each 𝑄𝑎
Assumption: the environment is not adversarial
VDN’s representation also satisfy this
QMIX’s representation can be generalized to the larger family of monotonic function
How to ensure this?
QMIX: Monotonicity constraint
Rashid, Tabish, et al. "Monotonic value function factorisation for deep multi-agent reinforcement learning." Journal of Machine Learning Research 21.178
(2020): 1-51.
argmax joint-action of 𝑸𝒕𝒐𝒕is the set of individual argmax action of 𝑸𝒂
If
Then
QMIX: Architecture
● QMIX represents 𝑄𝑡𝑜𝑡 using an architecture consisting of agent network, mixing network,
hypernetworks
QMIX: agent network
• DRQN (Hausknecht 2015)
- Deal with partial observability
• 𝒖𝒕−𝟏
𝒂
input
- Stochastic policies during training
• Agent ID(optional)
- Heterogeneous policies
Hausknecht, Matthew, and Peter Stone. "Deep recurrent q-learning for partially observable mdps." arXiv preprint arXiv:1507.06527 (2015).
QMIX: Mixing Network
• Hypernetwork
- Change state 𝑠𝑡 into weights of the mixing network
- Allow 𝑄𝑡𝑜𝑡 depend on the extra information
• Absolute operation
- Ensure the monotonicity constraint
• Why not pass 𝒔𝒕 directly into mixing network?
• Elu activation
- Prevent overly constraining 𝑠𝑡 through the
monotonic function, reduce representation capacity
- Negative input is likely to remain negative <- zeroed
by mixing network If use ReLU
QMIX: algorithm
Initialization
Rollout episode
Episode sampling
Update 𝑄𝑡𝑜𝑡
Update target
Rashid, Tabish, et al. "Monotonic value function factorisation for deep multi-agent reinforcement learning." Journal of Machine Learning Research 21.178 (2020):
1-51.
Representational complexity
Learned QMIX 𝑄𝑡𝑜𝑡
Learned VDN 𝑄𝑡𝑜𝑡
• Agent’s best action at the same time step in multi agent setting will not factorize perfectly with QMIX
• Monotonicity constraint prevents QMIX from representing non monotonic function
https://youtu.be/W_9kcQmaWjo
Representational complexity
https://youtu.be/W_9kcQmaWjo
Shimon Whiteson
• Even VDN cannot represent the example of the
middle exactly, it could approximate it with a value
function from the left game
• Should we care about these games that are in the
middle?
Rashid, Tabish, et al. "Monotonic value function factorisation for deep multi-agent reinforcement learning." Journal of Machine Learning Research 21.178
(2020): 1-51.
• It matters because of Bootstrapping
Representational complexity
• QMIX still learns the correct maximum over the Q-values
Payoff matrix Learned QMIX 𝑄𝑡𝑜𝑡
• The less bootstrapping error results better action selection in earlier state
Let’s see with
two-step games
• Two-Step Game
VDN
QMIX
Greedy policy
Representational complexity
The ability to express the complex situation
(A, ∙) -> (A,B) or (B,B)
(B, ∙) -> (B,B)
7 7
8
QMIX’s higher
represent capacity
Better
strategy than VDN
Experiment: SC2
• Starcraft2 have a rich set of complex micro-actions that allow the learning of complex
interactions between collaborating agents
• SC2LE environment mitigates many of the practical difficulties In using game as RL platform
allies enemy
https://youtu.be/HIqS-r4ZRGg
Experiment: SC2
• Observation (Sight range)
- distance
- relative x, y
- unit_type
• Action
- move[direction]
- attack[enemy_id] (Shooting range)
- stop
- noop
• Reward
- joint reward: total damage (each time step)
- bonus1 : 10 (killing each opponent)
- bonus2 : 100 (killing all opponent)
enemy
• Global state (hidden from agents)
(distance from center, health, shield, cooldown, last action)
Experiment: SC2 - main results
• IQL: Highly unstable <- non-stationary of the environment
• VDN: Better than IQL in every experiment setup, learn focusing fire
• QMIX: Superior at heterogeneous agent setting
Heterogeneous
: initial hump
Learning the simple strategy
Homogeneous
Experiment : SC2 - ablation result
• QMIX-NS: Without hypernetworks -> significance of extra state information
• QMIX-Lin: Removing hidden layer -> necessity of non-linear mixing
• VDN-S: Adding a state-dependent term to the sum of the 𝑄𝑎 -> significance of utilizing the state 𝒔𝒕
Heterogeneous
Nonlinear factorization is not always required
Homogeneous
Conclusion - QMIX
• Centralized training – Decentralized execution
Training Execution
• Allow rich joint-action value function 𝑄𝑡𝑜𝑡 with ensuring monotonicity constraint
• Performance of decentralized unit micromanagement task in SC2 environment outperforms VDN
More information
https://youtu.be/W_9kcQmaWjo
Rashid, Tabish, et al. "Monotonic value function factorisation for deep multi-agent
reinforcement learning." Journal of Machine Learning Research 21.178 (2020): 1-51.

More Related Content

What's hot

MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex FridmanMIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
Peerasak C.
 
알아두면 쓸데있는 신기한 강화학습 NAVER 2017
알아두면 쓸데있는 신기한 강화학습 NAVER 2017알아두면 쓸데있는 신기한 강화학습 NAVER 2017
알아두면 쓸데있는 신기한 강화학습 NAVER 2017
Taehoon Kim
 
Deep Multi-agent Reinforcement Learning
Deep Multi-agent Reinforcement LearningDeep Multi-agent Reinforcement Learning
Deep Multi-agent Reinforcement Learning
deawoo Kim
 
Deep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQNDeep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQN
Euijin Jeong
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
DongHyun Kwak
 
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)
Dongmin Lee
 
Deep Q-learning from Demonstrations DQfD
Deep Q-learning from Demonstrations DQfDDeep Q-learning from Demonstrations DQfD
Deep Q-learning from Demonstrations DQfD
Ammar Rashed
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learning
Jie-Han Chen
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
Ding Li
 
Soft Actor Critic 解説
Soft Actor Critic 解説Soft Actor Critic 解説
Soft Actor Critic 解説
KCS Keio Computer Society
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
Kuppusamy P
 
[DL輪読会]Geometric Unsupervised Domain Adaptation for Semantic Segmentation
[DL輪読会]Geometric Unsupervised Domain Adaptation for Semantic Segmentation[DL輪読会]Geometric Unsupervised Domain Adaptation for Semantic Segmentation
[DL輪読会]Geometric Unsupervised Domain Adaptation for Semantic Segmentation
Deep Learning JP
 
Generative Adversarial Networks (GAN)
Generative Adversarial Networks (GAN)Generative Adversarial Networks (GAN)
Generative Adversarial Networks (GAN)
Manohar Mukku
 
오토인코더의 모든 것
오토인코더의 모든 것오토인코더의 모든 것
오토인코더의 모든 것
NAVER Engineering
 
Xgboost
XgboostXgboost
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its Applications
Bill Liu
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
Usman Qayyum
 
Introduction to Generative Adversarial Networks (GANs)
Introduction to Generative Adversarial Networks (GANs)Introduction to Generative Adversarial Networks (GANs)
Introduction to Generative Adversarial Networks (GANs)
Appsilon Data Science
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learning
Subrat Panda, PhD
 
Dueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement LearningDueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement Learning
Yoonho Lee
 

What's hot (20)

MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex FridmanMIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
 
알아두면 쓸데있는 신기한 강화학습 NAVER 2017
알아두면 쓸데있는 신기한 강화학습 NAVER 2017알아두면 쓸데있는 신기한 강화학습 NAVER 2017
알아두면 쓸데있는 신기한 강화학습 NAVER 2017
 
Deep Multi-agent Reinforcement Learning
Deep Multi-agent Reinforcement LearningDeep Multi-agent Reinforcement Learning
Deep Multi-agent Reinforcement Learning
 
Deep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQNDeep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQN
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)
 
Deep Q-learning from Demonstrations DQfD
Deep Q-learning from Demonstrations DQfDDeep Q-learning from Demonstrations DQfD
Deep Q-learning from Demonstrations DQfD
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Soft Actor Critic 解説
Soft Actor Critic 解説Soft Actor Critic 解説
Soft Actor Critic 解説
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
 
[DL輪読会]Geometric Unsupervised Domain Adaptation for Semantic Segmentation
[DL輪読会]Geometric Unsupervised Domain Adaptation for Semantic Segmentation[DL輪読会]Geometric Unsupervised Domain Adaptation for Semantic Segmentation
[DL輪読会]Geometric Unsupervised Domain Adaptation for Semantic Segmentation
 
Generative Adversarial Networks (GAN)
Generative Adversarial Networks (GAN)Generative Adversarial Networks (GAN)
Generative Adversarial Networks (GAN)
 
오토인코더의 모든 것
오토인코더의 모든 것오토인코더의 모든 것
오토인코더의 모든 것
 
Xgboost
XgboostXgboost
Xgboost
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its Applications
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Introduction to Generative Adversarial Networks (GANs)
Introduction to Generative Adversarial Networks (GANs)Introduction to Generative Adversarial Networks (GANs)
Introduction to Generative Adversarial Networks (GANs)
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learning
 
Dueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement LearningDueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement Learning
 

Similar to QMIX: monotonic value function factorization paper review

Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
Khaled Saleh
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
MLconf
 
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...
gabrielesisinna
 
Combinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learningCombinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learning
민재 정
 
Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)
DonghyunKang12
 
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
MLconf
 
deep reinforcement learning with double q learning
deep reinforcement learning with double q learningdeep reinforcement learning with double q learning
deep reinforcement learning with double q learning
SeungHyeok Baek
 
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
The Statistical and Applied Mathematical Sciences Institute
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
DongHyun Kwak
 
Thesis Presentation on Energy Efficiency Improvement in Data Centers
Thesis Presentation on Energy Efficiency Improvement in Data CentersThesis Presentation on Energy Efficiency Improvement in Data Centers
Thesis Presentation on Energy Efficiency Improvement in Data Centers
Monica Vitali
 
AI and Deep Learning
AI and Deep Learning AI and Deep Learning
AI and Deep Learning
Subrat Panda, PhD
 
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중
datasciencekorea
 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural Nets
Pierre de Lacaze
 
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
Sunghoon Joo
 
Computer Vision for Beginners
Computer Vision for BeginnersComputer Vision for Beginners
Computer Vision for Beginners
Sanghamitra Deb
 
State Representation Learning for control: an overview
State Representation Learning for control: an overviewState Representation Learning for control: an overview
State Representation Learning for control: an overview
Natalia Díaz Rodríguez
 
WWW 2021report public
WWW 2021report publicWWW 2021report public
WWW 2021report public
Takuma Oda
 
Optimization Using Evolutionary Computing Techniques
Optimization Using Evolutionary Computing Techniques Optimization Using Evolutionary Computing Techniques
Optimization Using Evolutionary Computing Techniques
Siksha 'O' Anusandhan (Deemed to be University )
 
Towards Reinforcement Learning-based Aggregate Computing
Towards Reinforcement Learning-based Aggregate ComputingTowards Reinforcement Learning-based Aggregate Computing
Towards Reinforcement Learning-based Aggregate Computing
Gianluca Aguzzi
 
Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neu...
Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neu...Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neu...
Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neu...
InVID Project
 

Similar to QMIX: monotonic value function factorization paper review (20)

Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
 
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...
 
Combinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learningCombinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learning
 
Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)
 
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
 
deep reinforcement learning with double q learning
deep reinforcement learning with double q learningdeep reinforcement learning with double q learning
deep reinforcement learning with double q learning
 
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Thesis Presentation on Energy Efficiency Improvement in Data Centers
Thesis Presentation on Energy Efficiency Improvement in Data CentersThesis Presentation on Energy Efficiency Improvement in Data Centers
Thesis Presentation on Energy Efficiency Improvement in Data Centers
 
AI and Deep Learning
AI and Deep Learning AI and Deep Learning
AI and Deep Learning
 
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중
 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural Nets
 
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
 
Computer Vision for Beginners
Computer Vision for BeginnersComputer Vision for Beginners
Computer Vision for Beginners
 
State Representation Learning for control: an overview
State Representation Learning for control: an overviewState Representation Learning for control: an overview
State Representation Learning for control: an overview
 
WWW 2021report public
WWW 2021report publicWWW 2021report public
WWW 2021report public
 
Optimization Using Evolutionary Computing Techniques
Optimization Using Evolutionary Computing Techniques Optimization Using Evolutionary Computing Techniques
Optimization Using Evolutionary Computing Techniques
 
Towards Reinforcement Learning-based Aggregate Computing
Towards Reinforcement Learning-based Aggregate ComputingTowards Reinforcement Learning-based Aggregate Computing
Towards Reinforcement Learning-based Aggregate Computing
 
Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neu...
Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neu...Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neu...
Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neu...
 

Recently uploaded

spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
Madan Karki
 
Embedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoringEmbedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoring
IJECEIAES
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Christina Lin
 
International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...
gerogepatton
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
Aditya Rajan Patra
 
Exception Handling notes in java exception
Exception Handling notes in java exceptionException Handling notes in java exception
Exception Handling notes in java exception
Ratnakar Mikkili
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
IJECEIAES
 
Manufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptxManufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptx
Madan Karki
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
ClaraZara1
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
thanhdowork
 
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMSA SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
IJNSA Journal
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
gestioneergodomus
 
Series of visio cisco devices Cisco_Icons.ppt
Series of visio cisco devices Cisco_Icons.pptSeries of visio cisco devices Cisco_Icons.ppt
Series of visio cisco devices Cisco_Icons.ppt
PauloRodrigues104553
 
New techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdfNew techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdf
wisnuprabawa3
 
sieving analysis and results interpretation
sieving analysis and results interpretationsieving analysis and results interpretation
sieving analysis and results interpretation
ssuser36d3051
 
Low power architecture of logic gates using adiabatic techniques
Low power architecture of logic gates using adiabatic techniquesLow power architecture of logic gates using adiabatic techniques
Low power architecture of logic gates using adiabatic techniques
nooriasukmaningtyas
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
Rahul
 
basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
NidhalKahouli2
 
Heat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation pptHeat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation ppt
mamunhossenbd75
 
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptxML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
JamalHussainArman
 

Recently uploaded (20)

spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
 
Embedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoringEmbedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoring
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
 
International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
 
Exception Handling notes in java exception
Exception Handling notes in java exceptionException Handling notes in java exception
Exception Handling notes in java exception
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
 
Manufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptxManufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptx
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
 
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMSA SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
 
Series of visio cisco devices Cisco_Icons.ppt
Series of visio cisco devices Cisco_Icons.pptSeries of visio cisco devices Cisco_Icons.ppt
Series of visio cisco devices Cisco_Icons.ppt
 
New techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdfNew techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdf
 
sieving analysis and results interpretation
sieving analysis and results interpretationsieving analysis and results interpretation
sieving analysis and results interpretation
 
Low power architecture of logic gates using adiabatic techniques
Low power architecture of logic gates using adiabatic techniquesLow power architecture of logic gates using adiabatic techniques
Low power architecture of logic gates using adiabatic techniques
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
 
basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
 
Heat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation pptHeat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation ppt
 
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptxML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
 

QMIX: monotonic value function factorization paper review

  • 1. QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning 2021.4.30 정민재
  • 2. Multi agent system setting (MAS) Jan’t Hoen, Pieter, et al. "An overview of cooperative and competitive multiagent learning." International Workshop on Learning and Adaption in Multi-Agent Systems. Springer, Berlin, Heidelberg, 2005. • Cooperative - The agents pursue a common goal • Competitive - Non-aligned goals - Individual agents seek only to maximize their own gains
  • 3. MAS setting ● Challenge ○ Joint action space grow exponentially by the number of agents ■ EX) N Agent with 4 discrete action: 4𝑁 ○ Agent’s partial observability, communication constraint ■ Cannot access to full state NEED Decentralized policy Up Down Right Left (Up, Down) (Up, Left) (Up, Up) (Up, Right) (Down, Down) (Down, Up) ⋮ 4 16 ! ! ?
  • 4. Centralized Training Decentralized Execution(CTDE) ● We can use global state or extra state information and remove communication constraint in simulation and laboratorial environment Centralized Training Decentralized execution
  • 5. CTDE Approach Independent Q learning (IQL) [Tan 1993] Counterfactual multi-agent policy gradient (COMA) [Foerster 2018] Value Decomposition network (VDN) [Sunehg 2017] policy 1 𝑸𝒕𝒐𝒕 policy 2 policy 3 How to learn action-value function 𝑸𝒕𝒐𝒕 and extract decentralized policy? 𝑸𝒕𝒐𝒕 Qa1 Qa2 Qa3 + + policy 3 𝑸𝒂𝟑 Greedy policy 2 𝑸𝒂𝟐 Greedy policy 1 𝑸𝒂𝟏 Greedy Learn independent individual action-value function policy 1 policy 2 policy 3 + Simplest Option + Learn decentralized policy trivially - Cannot handle non-stationary case Learn centralized, but factored action-value function Learn centralized full state action-value function + Lean 𝑸𝒕𝒐𝒕 directly with actor-critic framework - On policy: sample inefficient - Less scalability + Lean 𝑸𝒕𝒐𝒕 + Easy to extract decentralized policy - Limited representation capacity - Do not use additional global state information ● Learn 𝑸𝒕𝒐𝒕 - > figure out effectiveness of agent’s actions ● Extract decentralized policy <- joint-action space growth problem, local observability, communication constraint QMIX ! Other agents also learning and change the strategies -> no guarantee convergence
  • 6. Background DEC-POMDP (decentralized partially observable Markov decision process) 𝑠 ∈ 𝑆 : state 𝑢 ∈ 𝑈 : joint action 𝑃(𝑠′ |𝑠, 𝑢): transition function 𝑟 𝑠, 𝑢 : reward function 𝑛 : agent number 𝑎 ∈ 𝐴 : agent 𝑧 ∈ 𝑍 : observation 𝑂(𝑠, 𝑎) : observation function 𝛾 : discount rate 𝜏𝑎 ∈ 𝑇 : action-observation history
  • 7. Background: Value decomposition VDN full factorization Utility function, not value function Does not estimate an expected return Guestrin, Carlos, Daphne Koller, and Ronald Parr. "Multiagent Planning with Factored MDPs." NIPS. Vol. 1. 2001. Sunehag, Peter, et al. "Value-decomposition networks for cooperative multi-agent learning." arXiv preprint arXiv:1706.05296 (2017). Factored joint value function [Guestrin 2001] Subset of the agents Factored value function reduce the parameters that have to be learned https://youtu.be/W_9kcQmaWjo Improve scalability
  • 8. QMIX: Key Idea Key Idea: full factorization of VDN is not necessary • Consistency holds if a global argmax performed on 𝑄𝑡𝑜𝑡 yields the same result as a set of individual argmax operations performed on each 𝑄𝑎 Assumption: the environment is not adversarial VDN’s representation also satisfy this QMIX’s representation can be generalized to the larger family of monotonic function How to ensure this?
  • 9. QMIX: Monotonicity constraint Rashid, Tabish, et al. "Monotonic value function factorisation for deep multi-agent reinforcement learning." Journal of Machine Learning Research 21.178 (2020): 1-51. argmax joint-action of 𝑸𝒕𝒐𝒕is the set of individual argmax action of 𝑸𝒂 If Then
  • 10. QMIX: Architecture ● QMIX represents 𝑄𝑡𝑜𝑡 using an architecture consisting of agent network, mixing network, hypernetworks
  • 11. QMIX: agent network • DRQN (Hausknecht 2015) - Deal with partial observability • 𝒖𝒕−𝟏 𝒂 input - Stochastic policies during training • Agent ID(optional) - Heterogeneous policies Hausknecht, Matthew, and Peter Stone. "Deep recurrent q-learning for partially observable mdps." arXiv preprint arXiv:1507.06527 (2015).
  • 12. QMIX: Mixing Network • Hypernetwork - Change state 𝑠𝑡 into weights of the mixing network - Allow 𝑄𝑡𝑜𝑡 depend on the extra information • Absolute operation - Ensure the monotonicity constraint • Why not pass 𝒔𝒕 directly into mixing network? • Elu activation - Prevent overly constraining 𝑠𝑡 through the monotonic function, reduce representation capacity - Negative input is likely to remain negative <- zeroed by mixing network If use ReLU
  • 13. QMIX: algorithm Initialization Rollout episode Episode sampling Update 𝑄𝑡𝑜𝑡 Update target Rashid, Tabish, et al. "Monotonic value function factorisation for deep multi-agent reinforcement learning." Journal of Machine Learning Research 21.178 (2020): 1-51.
  • 14. Representational complexity Learned QMIX 𝑄𝑡𝑜𝑡 Learned VDN 𝑄𝑡𝑜𝑡 • Agent’s best action at the same time step in multi agent setting will not factorize perfectly with QMIX • Monotonicity constraint prevents QMIX from representing non monotonic function https://youtu.be/W_9kcQmaWjo
  • 15. Representational complexity https://youtu.be/W_9kcQmaWjo Shimon Whiteson • Even VDN cannot represent the example of the middle exactly, it could approximate it with a value function from the left game • Should we care about these games that are in the middle?
  • 16. Rashid, Tabish, et al. "Monotonic value function factorisation for deep multi-agent reinforcement learning." Journal of Machine Learning Research 21.178 (2020): 1-51. • It matters because of Bootstrapping Representational complexity • QMIX still learns the correct maximum over the Q-values Payoff matrix Learned QMIX 𝑄𝑡𝑜𝑡 • The less bootstrapping error results better action selection in earlier state Let’s see with two-step games
  • 17. • Two-Step Game VDN QMIX Greedy policy Representational complexity The ability to express the complex situation (A, ∙) -> (A,B) or (B,B) (B, ∙) -> (B,B) 7 7 8 QMIX’s higher represent capacity Better strategy than VDN
  • 18. Experiment: SC2 • Starcraft2 have a rich set of complex micro-actions that allow the learning of complex interactions between collaborating agents • SC2LE environment mitigates many of the practical difficulties In using game as RL platform allies enemy https://youtu.be/HIqS-r4ZRGg
  • 19. Experiment: SC2 • Observation (Sight range) - distance - relative x, y - unit_type • Action - move[direction] - attack[enemy_id] (Shooting range) - stop - noop • Reward - joint reward: total damage (each time step) - bonus1 : 10 (killing each opponent) - bonus2 : 100 (killing all opponent) enemy • Global state (hidden from agents) (distance from center, health, shield, cooldown, last action)
  • 20. Experiment: SC2 - main results • IQL: Highly unstable <- non-stationary of the environment • VDN: Better than IQL in every experiment setup, learn focusing fire • QMIX: Superior at heterogeneous agent setting Heterogeneous : initial hump Learning the simple strategy Homogeneous
  • 21. Experiment : SC2 - ablation result • QMIX-NS: Without hypernetworks -> significance of extra state information • QMIX-Lin: Removing hidden layer -> necessity of non-linear mixing • VDN-S: Adding a state-dependent term to the sum of the 𝑄𝑎 -> significance of utilizing the state 𝒔𝒕 Heterogeneous Nonlinear factorization is not always required Homogeneous
  • 22. Conclusion - QMIX • Centralized training – Decentralized execution Training Execution • Allow rich joint-action value function 𝑄𝑡𝑜𝑡 with ensuring monotonicity constraint • Performance of decentralized unit micromanagement task in SC2 environment outperforms VDN
  • 23. More information https://youtu.be/W_9kcQmaWjo Rashid, Tabish, et al. "Monotonic value function factorisation for deep multi-agent reinforcement learning." Journal of Machine Learning Research 21.178 (2020): 1-51.