QMIX: monotonic value function factorization paper review
1. QMIX: Monotonic Value Function Factorisation for
Deep Multi-Agent Reinforcement Learning
2021.4.30 정민재
2. Multi agent system setting (MAS)
Jan’t Hoen, Pieter, et al. "An overview of cooperative and competitive multiagent learning." International Workshop on Learning and
Adaption in Multi-Agent Systems. Springer, Berlin, Heidelberg, 2005.
• Cooperative
- The agents pursue a common goal
• Competitive
- Non-aligned goals
- Individual agents seek only to maximize their own gains
3. MAS setting
● Challenge
○ Joint action space grow exponentially by the number of agents
■ EX) N Agent with 4 discrete action: 4𝑁
○ Agent’s partial observability, communication constraint
■ Cannot access to full state
NEED
Decentralized policy
Up
Down
Right
Left
(Up, Down)
(Up, Left)
(Up, Up)
(Up, Right)
(Down, Down)
(Down, Up)
⋮
4 16
!
!
?
4. Centralized Training Decentralized Execution(CTDE)
● We can use global state or extra state information and remove
communication constraint in simulation and laboratorial environment
Centralized Training Decentralized execution
5. CTDE Approach
Independent Q learning
(IQL) [Tan 1993]
Counterfactual multi-agent policy gradient
(COMA) [Foerster 2018]
Value Decomposition network
(VDN) [Sunehg 2017]
policy 1
𝑸𝒕𝒐𝒕
policy 2 policy 3
How to learn action-value function 𝑸𝒕𝒐𝒕 and extract decentralized policy?
𝑸𝒕𝒐𝒕
Qa1
Qa2
Qa3
+
+
policy 3
𝑸𝒂𝟑
Greedy
policy 2
𝑸𝒂𝟐
Greedy
policy 1
𝑸𝒂𝟏
Greedy
Learn independent individual action-value function
policy 1
policy 2
policy 3
+ Simplest Option
+ Learn decentralized policy trivially
- Cannot handle non-stationary case
Learn centralized, but factored action-value function Learn centralized full state action-value function
+ Lean 𝑸𝒕𝒐𝒕 directly with actor-critic framework
- On policy: sample inefficient
- Less scalability
+ Lean 𝑸𝒕𝒐𝒕
+ Easy to extract decentralized policy
- Limited representation capacity
- Do not use additional global state information
● Learn 𝑸𝒕𝒐𝒕 -
> figure out effectiveness of agent’s actions
● Extract decentralized policy <- joint-action space growth problem, local observability, communication constraint
QMIX !
Other agents also learning and
change the strategies
-> no guarantee convergence
6. Background
DEC-POMDP (decentralized partially observable Markov decision process)
𝑠 ∈ 𝑆 : state
𝑢 ∈ 𝑈 : joint action
𝑃(𝑠′
|𝑠, 𝑢): transition function
𝑟 𝑠, 𝑢 : reward function
𝑛 : agent number
𝑎 ∈ 𝐴 : agent
𝑧 ∈ 𝑍 : observation
𝑂(𝑠, 𝑎) : observation function
𝛾 : discount rate
𝜏𝑎
∈ 𝑇 : action-observation history
7. Background: Value decomposition
VDN full factorization
Utility function, not value function
Does not estimate an expected return
Guestrin, Carlos, Daphne Koller, and Ronald Parr. "Multiagent Planning with Factored MDPs." NIPS. Vol. 1. 2001.
Sunehag, Peter, et al. "Value-decomposition networks for cooperative multi-agent learning." arXiv preprint arXiv:1706.05296 (2017).
Factored joint value function [Guestrin 2001]
Subset of the agents
Factored value function reduce the parameters that have to be learned
https://youtu.be/W_9kcQmaWjo
Improve scalability
8. QMIX: Key Idea
Key Idea: full factorization of VDN is not necessary
• Consistency holds if a global argmax performed on 𝑄𝑡𝑜𝑡 yields the same result as
a set of individual argmax operations performed on each 𝑄𝑎
Assumption: the environment is not adversarial
VDN’s representation also satisfy this
QMIX’s representation can be generalized to the larger family of monotonic function
How to ensure this?
9. QMIX: Monotonicity constraint
Rashid, Tabish, et al. "Monotonic value function factorisation for deep multi-agent reinforcement learning." Journal of Machine Learning Research 21.178
(2020): 1-51.
argmax joint-action of 𝑸𝒕𝒐𝒕is the set of individual argmax action of 𝑸𝒂
If
Then
10. QMIX: Architecture
● QMIX represents 𝑄𝑡𝑜𝑡 using an architecture consisting of agent network, mixing network,
hypernetworks
11. QMIX: agent network
• DRQN (Hausknecht 2015)
- Deal with partial observability
• 𝒖𝒕−𝟏
𝒂
input
- Stochastic policies during training
• Agent ID(optional)
- Heterogeneous policies
Hausknecht, Matthew, and Peter Stone. "Deep recurrent q-learning for partially observable mdps." arXiv preprint arXiv:1507.06527 (2015).
12. QMIX: Mixing Network
• Hypernetwork
- Change state 𝑠𝑡 into weights of the mixing network
- Allow 𝑄𝑡𝑜𝑡 depend on the extra information
• Absolute operation
- Ensure the monotonicity constraint
• Why not pass 𝒔𝒕 directly into mixing network?
• Elu activation
- Prevent overly constraining 𝑠𝑡 through the
monotonic function, reduce representation capacity
- Negative input is likely to remain negative <- zeroed
by mixing network If use ReLU
13. QMIX: algorithm
Initialization
Rollout episode
Episode sampling
Update 𝑄𝑡𝑜𝑡
Update target
Rashid, Tabish, et al. "Monotonic value function factorisation for deep multi-agent reinforcement learning." Journal of Machine Learning Research 21.178 (2020):
1-51.
14. Representational complexity
Learned QMIX 𝑄𝑡𝑜𝑡
Learned VDN 𝑄𝑡𝑜𝑡
• Agent’s best action at the same time step in multi agent setting will not factorize perfectly with QMIX
• Monotonicity constraint prevents QMIX from representing non monotonic function
https://youtu.be/W_9kcQmaWjo
16. Rashid, Tabish, et al. "Monotonic value function factorisation for deep multi-agent reinforcement learning." Journal of Machine Learning Research 21.178
(2020): 1-51.
• It matters because of Bootstrapping
Representational complexity
• QMIX still learns the correct maximum over the Q-values
Payoff matrix Learned QMIX 𝑄𝑡𝑜𝑡
• The less bootstrapping error results better action selection in earlier state
Let’s see with
two-step games
17. • Two-Step Game
VDN
QMIX
Greedy policy
Representational complexity
The ability to express the complex situation
(A, ∙) -> (A,B) or (B,B)
(B, ∙) -> (B,B)
7 7
8
QMIX’s higher
represent capacity
Better
strategy than VDN
18. Experiment: SC2
• Starcraft2 have a rich set of complex micro-actions that allow the learning of complex
interactions between collaborating agents
• SC2LE environment mitigates many of the practical difficulties In using game as RL platform
allies enemy
https://youtu.be/HIqS-r4ZRGg
19. Experiment: SC2
• Observation (Sight range)
- distance
- relative x, y
- unit_type
• Action
- move[direction]
- attack[enemy_id] (Shooting range)
- stop
- noop
• Reward
- joint reward: total damage (each time step)
- bonus1 : 10 (killing each opponent)
- bonus2 : 100 (killing all opponent)
enemy
• Global state (hidden from agents)
(distance from center, health, shield, cooldown, last action)
20. Experiment: SC2 - main results
• IQL: Highly unstable <- non-stationary of the environment
• VDN: Better than IQL in every experiment setup, learn focusing fire
• QMIX: Superior at heterogeneous agent setting
Heterogeneous
: initial hump
Learning the simple strategy
Homogeneous
21. Experiment : SC2 - ablation result
• QMIX-NS: Without hypernetworks -> significance of extra state information
• QMIX-Lin: Removing hidden layer -> necessity of non-linear mixing
• VDN-S: Adding a state-dependent term to the sum of the 𝑄𝑎 -> significance of utilizing the state 𝒔𝒕
Heterogeneous
Nonlinear factorization is not always required
Homogeneous
22. Conclusion - QMIX
• Centralized training – Decentralized execution
Training Execution
• Allow rich joint-action value function 𝑄𝑡𝑜𝑡 with ensuring monotonicity constraint
• Performance of decentralized unit micromanagement task in SC2 environment outperforms VDN