QMIX: Monotonic Value Function Factorisation for
Deep Multi-Agent Reinforcement Learning
2021.4.30 정민재
Multi agent system setting (MAS)
Jan’t Hoen, Pieter, et al. "An overview of cooperative and competitive multiagent learning." International Workshop on Learning and
Adaption in Multi-Agent Systems. Springer, Berlin, Heidelberg, 2005.
• Cooperative
- The agents pursue a common goal
• Competitive
- Non-aligned goals
- Individual agents seek only to maximize their own gains
MAS setting
● Challenge
○ Joint action space grow exponentially by the number of agents
■ EX) N Agent with 4 discrete action: 4𝑁
○ Agent’s partial observability, communication constraint
■ Cannot access to full state
NEED
Decentralized policy
Up
Down
Right
Left
(Up, Down)
(Up, Left)
(Up, Up)
(Up, Right)
(Down, Down)
(Down, Up)
⋮
4 16
!
!
?
Centralized Training Decentralized Execution(CTDE)
● We can use global state or extra state information and remove
communication constraint in simulation and laboratorial environment
Centralized Training Decentralized execution
CTDE Approach
Independent Q learning
(IQL) [Tan 1993]
Counterfactual multi-agent policy gradient
(COMA) [Foerster 2018]
Value Decomposition network
(VDN) [Sunehg 2017]
policy 1
𝑸𝒕𝒐𝒕
policy 2 policy 3
How to learn action-value function 𝑸𝒕𝒐𝒕 and extract decentralized policy?
𝑸𝒕𝒐𝒕
Qa1
Qa2
Qa3
+
+
policy 3
𝑸𝒂𝟑
Greedy
policy 2
𝑸𝒂𝟐
Greedy
policy 1
𝑸𝒂𝟏
Greedy
Learn independent individual action-value function
policy 1
policy 2
policy 3
+ Simplest Option
+ Learn decentralized policy trivially
- Cannot handle non-stationary case
Learn centralized, but factored action-value function Learn centralized full state action-value function
+ Lean 𝑸𝒕𝒐𝒕 directly with actor-critic framework
- On policy: sample inefficient
- Less scalability
+ Lean 𝑸𝒕𝒐𝒕
+ Easy to extract decentralized policy
- Limited representation capacity
- Do not use additional global state information
● Learn 𝑸𝒕𝒐𝒕 -
> figure out effectiveness of agent’s actions
● Extract decentralized policy <- joint-action space growth problem, local observability, communication constraint
QMIX !
Other agents also learning and
change the strategies
-> no guarantee convergence
Background
DEC-POMDP (decentralized partially observable Markov decision process)
𝑠 ∈ 𝑆 : state
𝑢 ∈ 𝑈 : joint action
𝑃(𝑠′
|𝑠, 𝑢): transition function
𝑟 𝑠, 𝑢 : reward function
𝑛 : agent number
𝑎 ∈ 𝐴 : agent
𝑧 ∈ 𝑍 : observation
𝑂(𝑠, 𝑎) : observation function
𝛾 : discount rate
𝜏𝑎
∈ 𝑇 : action-observation history
Background: Value decomposition
VDN full factorization
Utility function, not value function
Does not estimate an expected return
Guestrin, Carlos, Daphne Koller, and Ronald Parr. "Multiagent Planning with Factored MDPs." NIPS. Vol. 1. 2001.
Sunehag, Peter, et al. "Value-decomposition networks for cooperative multi-agent learning." arXiv preprint arXiv:1706.05296 (2017).
Factored joint value function [Guestrin 2001]
Subset of the agents
Factored value function reduce the parameters that have to be learned
https://youtu.be/W_9kcQmaWjo
Improve scalability
QMIX: Key Idea
Key Idea: full factorization of VDN is not necessary
• Consistency holds if a global argmax performed on 𝑄𝑡𝑜𝑡 yields the same result as
a set of individual argmax operations performed on each 𝑄𝑎
Assumption: the environment is not adversarial
VDN’s representation also satisfy this
QMIX’s representation can be generalized to the larger family of monotonic function
How to ensure this?
QMIX: Monotonicity constraint
Rashid, Tabish, et al. "Monotonic value function factorisation for deep multi-agent reinforcement learning." Journal of Machine Learning Research 21.178
(2020): 1-51.
argmax joint-action of 𝑸𝒕𝒐𝒕is the set of individual argmax action of 𝑸𝒂
If
Then
QMIX: Architecture
● QMIX represents 𝑄𝑡𝑜𝑡 using an architecture consisting of agent network, mixing network,
hypernetworks
QMIX: agent network
• DRQN (Hausknecht 2015)
- Deal with partial observability
• 𝒖𝒕−𝟏
𝒂
input
- Stochastic policies during training
• Agent ID(optional)
- Heterogeneous policies
Hausknecht, Matthew, and Peter Stone. "Deep recurrent q-learning for partially observable mdps." arXiv preprint arXiv:1507.06527 (2015).
QMIX: Mixing Network
• Hypernetwork
- Change state 𝑠𝑡 into weights of the mixing network
- Allow 𝑄𝑡𝑜𝑡 depend on the extra information
• Absolute operation
- Ensure the monotonicity constraint
• Why not pass 𝒔𝒕 directly into mixing network?
• Elu activation
- Prevent overly constraining 𝑠𝑡 through the
monotonic function, reduce representation capacity
- Negative input is likely to remain negative <- zeroed
by mixing network If use ReLU
QMIX: algorithm
Initialization
Rollout episode
Episode sampling
Update 𝑄𝑡𝑜𝑡
Update target
Rashid, Tabish, et al. "Monotonic value function factorisation for deep multi-agent reinforcement learning." Journal of Machine Learning Research 21.178 (2020):
1-51.
Representational complexity
Learned QMIX 𝑄𝑡𝑜𝑡
Learned VDN 𝑄𝑡𝑜𝑡
• Agent’s best action at the same time step in multi agent setting will not factorize perfectly with QMIX
• Monotonicity constraint prevents QMIX from representing non monotonic function
https://youtu.be/W_9kcQmaWjo
Representational complexity
https://youtu.be/W_9kcQmaWjo
Shimon Whiteson
• Even VDN cannot represent the example of the
middle exactly, it could approximate it with a value
function from the left game
• Should we care about these games that are in the
middle?
Rashid, Tabish, et al. "Monotonic value function factorisation for deep multi-agent reinforcement learning." Journal of Machine Learning Research 21.178
(2020): 1-51.
• It matters because of Bootstrapping
Representational complexity
• QMIX still learns the correct maximum over the Q-values
Payoff matrix Learned QMIX 𝑄𝑡𝑜𝑡
• The less bootstrapping error results better action selection in earlier state
Let’s see with
two-step games
• Two-Step Game
VDN
QMIX
Greedy policy
Representational complexity
The ability to express the complex situation
(A, ∙) -> (A,B) or (B,B)
(B, ∙) -> (B,B)
7 7
8
QMIX’s higher
represent capacity
Better
strategy than VDN
Experiment: SC2
• Starcraft2 have a rich set of complex micro-actions that allow the learning of complex
interactions between collaborating agents
• SC2LE environment mitigates many of the practical difficulties In using game as RL platform
allies enemy
https://youtu.be/HIqS-r4ZRGg
Experiment: SC2
• Observation (Sight range)
- distance
- relative x, y
- unit_type
• Action
- move[direction]
- attack[enemy_id] (Shooting range)
- stop
- noop
• Reward
- joint reward: total damage (each time step)
- bonus1 : 10 (killing each opponent)
- bonus2 : 100 (killing all opponent)
enemy
• Global state (hidden from agents)
(distance from center, health, shield, cooldown, last action)
Experiment: SC2 - main results
• IQL: Highly unstable <- non-stationary of the environment
• VDN: Better than IQL in every experiment setup, learn focusing fire
• QMIX: Superior at heterogeneous agent setting
Heterogeneous
: initial hump
Learning the simple strategy
Homogeneous
Experiment : SC2 - ablation result
• QMIX-NS: Without hypernetworks -> significance of extra state information
• QMIX-Lin: Removing hidden layer -> necessity of non-linear mixing
• VDN-S: Adding a state-dependent term to the sum of the 𝑄𝑎 -> significance of utilizing the state 𝒔𝒕
Heterogeneous
Nonlinear factorization is not always required
Homogeneous
Conclusion - QMIX
• Centralized training – Decentralized execution
Training Execution
• Allow rich joint-action value function 𝑄𝑡𝑜𝑡 with ensuring monotonicity constraint
• Performance of decentralized unit micromanagement task in SC2 environment outperforms VDN
More information
https://youtu.be/W_9kcQmaWjo
Rashid, Tabish, et al. "Monotonic value function factorisation for deep multi-agent
reinforcement learning." Journal of Machine Learning Research 21.178 (2020): 1-51.

QMIX: monotonic value function factorization paper review

  • 1.
    QMIX: Monotonic ValueFunction Factorisation for Deep Multi-Agent Reinforcement Learning 2021.4.30 정민재
  • 2.
    Multi agent systemsetting (MAS) Jan’t Hoen, Pieter, et al. "An overview of cooperative and competitive multiagent learning." International Workshop on Learning and Adaption in Multi-Agent Systems. Springer, Berlin, Heidelberg, 2005. • Cooperative - The agents pursue a common goal • Competitive - Non-aligned goals - Individual agents seek only to maximize their own gains
  • 3.
    MAS setting ● Challenge ○Joint action space grow exponentially by the number of agents ■ EX) N Agent with 4 discrete action: 4𝑁 ○ Agent’s partial observability, communication constraint ■ Cannot access to full state NEED Decentralized policy Up Down Right Left (Up, Down) (Up, Left) (Up, Up) (Up, Right) (Down, Down) (Down, Up) ⋮ 4 16 ! ! ?
  • 4.
    Centralized Training DecentralizedExecution(CTDE) ● We can use global state or extra state information and remove communication constraint in simulation and laboratorial environment Centralized Training Decentralized execution
  • 5.
    CTDE Approach Independent Qlearning (IQL) [Tan 1993] Counterfactual multi-agent policy gradient (COMA) [Foerster 2018] Value Decomposition network (VDN) [Sunehg 2017] policy 1 𝑸𝒕𝒐𝒕 policy 2 policy 3 How to learn action-value function 𝑸𝒕𝒐𝒕 and extract decentralized policy? 𝑸𝒕𝒐𝒕 Qa1 Qa2 Qa3 + + policy 3 𝑸𝒂𝟑 Greedy policy 2 𝑸𝒂𝟐 Greedy policy 1 𝑸𝒂𝟏 Greedy Learn independent individual action-value function policy 1 policy 2 policy 3 + Simplest Option + Learn decentralized policy trivially - Cannot handle non-stationary case Learn centralized, but factored action-value function Learn centralized full state action-value function + Lean 𝑸𝒕𝒐𝒕 directly with actor-critic framework - On policy: sample inefficient - Less scalability + Lean 𝑸𝒕𝒐𝒕 + Easy to extract decentralized policy - Limited representation capacity - Do not use additional global state information ● Learn 𝑸𝒕𝒐𝒕 - > figure out effectiveness of agent’s actions ● Extract decentralized policy <- joint-action space growth problem, local observability, communication constraint QMIX ! Other agents also learning and change the strategies -> no guarantee convergence
  • 6.
    Background DEC-POMDP (decentralized partiallyobservable Markov decision process) 𝑠 ∈ 𝑆 : state 𝑢 ∈ 𝑈 : joint action 𝑃(𝑠′ |𝑠, 𝑢): transition function 𝑟 𝑠, 𝑢 : reward function 𝑛 : agent number 𝑎 ∈ 𝐴 : agent 𝑧 ∈ 𝑍 : observation 𝑂(𝑠, 𝑎) : observation function 𝛾 : discount rate 𝜏𝑎 ∈ 𝑇 : action-observation history
  • 7.
    Background: Value decomposition VDNfull factorization Utility function, not value function Does not estimate an expected return Guestrin, Carlos, Daphne Koller, and Ronald Parr. "Multiagent Planning with Factored MDPs." NIPS. Vol. 1. 2001. Sunehag, Peter, et al. "Value-decomposition networks for cooperative multi-agent learning." arXiv preprint arXiv:1706.05296 (2017). Factored joint value function [Guestrin 2001] Subset of the agents Factored value function reduce the parameters that have to be learned https://youtu.be/W_9kcQmaWjo Improve scalability
  • 8.
    QMIX: Key Idea KeyIdea: full factorization of VDN is not necessary • Consistency holds if a global argmax performed on 𝑄𝑡𝑜𝑡 yields the same result as a set of individual argmax operations performed on each 𝑄𝑎 Assumption: the environment is not adversarial VDN’s representation also satisfy this QMIX’s representation can be generalized to the larger family of monotonic function How to ensure this?
  • 9.
    QMIX: Monotonicity constraint Rashid,Tabish, et al. "Monotonic value function factorisation for deep multi-agent reinforcement learning." Journal of Machine Learning Research 21.178 (2020): 1-51. argmax joint-action of 𝑸𝒕𝒐𝒕is the set of individual argmax action of 𝑸𝒂 If Then
  • 10.
    QMIX: Architecture ● QMIXrepresents 𝑄𝑡𝑜𝑡 using an architecture consisting of agent network, mixing network, hypernetworks
  • 11.
    QMIX: agent network •DRQN (Hausknecht 2015) - Deal with partial observability • 𝒖𝒕−𝟏 𝒂 input - Stochastic policies during training • Agent ID(optional) - Heterogeneous policies Hausknecht, Matthew, and Peter Stone. "Deep recurrent q-learning for partially observable mdps." arXiv preprint arXiv:1507.06527 (2015).
  • 12.
    QMIX: Mixing Network •Hypernetwork - Change state 𝑠𝑡 into weights of the mixing network - Allow 𝑄𝑡𝑜𝑡 depend on the extra information • Absolute operation - Ensure the monotonicity constraint • Why not pass 𝒔𝒕 directly into mixing network? • Elu activation - Prevent overly constraining 𝑠𝑡 through the monotonic function, reduce representation capacity - Negative input is likely to remain negative <- zeroed by mixing network If use ReLU
  • 13.
    QMIX: algorithm Initialization Rollout episode Episodesampling Update 𝑄𝑡𝑜𝑡 Update target Rashid, Tabish, et al. "Monotonic value function factorisation for deep multi-agent reinforcement learning." Journal of Machine Learning Research 21.178 (2020): 1-51.
  • 14.
    Representational complexity Learned QMIX𝑄𝑡𝑜𝑡 Learned VDN 𝑄𝑡𝑜𝑡 • Agent’s best action at the same time step in multi agent setting will not factorize perfectly with QMIX • Monotonicity constraint prevents QMIX from representing non monotonic function https://youtu.be/W_9kcQmaWjo
  • 15.
    Representational complexity https://youtu.be/W_9kcQmaWjo Shimon Whiteson •Even VDN cannot represent the example of the middle exactly, it could approximate it with a value function from the left game • Should we care about these games that are in the middle?
  • 16.
    Rashid, Tabish, etal. "Monotonic value function factorisation for deep multi-agent reinforcement learning." Journal of Machine Learning Research 21.178 (2020): 1-51. • It matters because of Bootstrapping Representational complexity • QMIX still learns the correct maximum over the Q-values Payoff matrix Learned QMIX 𝑄𝑡𝑜𝑡 • The less bootstrapping error results better action selection in earlier state Let’s see with two-step games
  • 17.
    • Two-Step Game VDN QMIX Greedypolicy Representational complexity The ability to express the complex situation (A, ∙) -> (A,B) or (B,B) (B, ∙) -> (B,B) 7 7 8 QMIX’s higher represent capacity Better strategy than VDN
  • 18.
    Experiment: SC2 • Starcraft2have a rich set of complex micro-actions that allow the learning of complex interactions between collaborating agents • SC2LE environment mitigates many of the practical difficulties In using game as RL platform allies enemy https://youtu.be/HIqS-r4ZRGg
  • 19.
    Experiment: SC2 • Observation(Sight range) - distance - relative x, y - unit_type • Action - move[direction] - attack[enemy_id] (Shooting range) - stop - noop • Reward - joint reward: total damage (each time step) - bonus1 : 10 (killing each opponent) - bonus2 : 100 (killing all opponent) enemy • Global state (hidden from agents) (distance from center, health, shield, cooldown, last action)
  • 20.
    Experiment: SC2 -main results • IQL: Highly unstable <- non-stationary of the environment • VDN: Better than IQL in every experiment setup, learn focusing fire • QMIX: Superior at heterogeneous agent setting Heterogeneous : initial hump Learning the simple strategy Homogeneous
  • 21.
    Experiment : SC2- ablation result • QMIX-NS: Without hypernetworks -> significance of extra state information • QMIX-Lin: Removing hidden layer -> necessity of non-linear mixing • VDN-S: Adding a state-dependent term to the sum of the 𝑄𝑎 -> significance of utilizing the state 𝒔𝒕 Heterogeneous Nonlinear factorization is not always required Homogeneous
  • 22.
    Conclusion - QMIX •Centralized training – Decentralized execution Training Execution • Allow rich joint-action value function 𝑄𝑡𝑜𝑡 with ensuring monotonicity constraint • Performance of decentralized unit micromanagement task in SC2 environment outperforms VDN
  • 23.
    More information https://youtu.be/W_9kcQmaWjo Rashid, Tabish,et al. "Monotonic value function factorisation for deep multi-agent reinforcement learning." Journal of Machine Learning Research 21.178 (2020): 1-51.