Presentation on “Multiagent Bidirectional-
Coordinated Nets for Learning to Play
StarCraft Combat Games”
Kiho Suh

Modulabs( ), June 22nd 2017
About Paper
• Published on March 29th 2017
(v1)

• Updated on June 20th 2017 (v3)

• Alibaba, University College
London

• https://arxiv.org/pdf/
1703.10069.pdf
Motivation
• Single-Agent AI .
(Atari, Baduk, Texas Hold’em )
• . Artificial
General Intelligence ?
• AI agent .
• real-time strategy (RTS) game “StarCraft” .
• play ,
“StarCraft” .
• Parameter space joint learning approach .
?
agent
?
communication
.
communication protocol .
:
multi-agent bidirectionally-coordinated network
(BiCNet) with a vectorized extension of actor-
critic formulation
?
• agent BiCNet .
• evaluation-decision-making process .
• Parameter dynamic grouping
.
•
AI agent .
• label data BiCNet
agent .
https://www.youtube.com/watch?v=kW2q15MNFug
!
Related works
• Jakob Foerster, Yannis M Assael, Nando de Freitas, and
Shimon Whiteson. Learning to communicate with deep
multi-agent reinforcement learning. NIPS 2016.
• Sainbayar Sukhbaatar, Rob Fergus, et al. Learning
multiagent communication with backpropagation. NIPS
2016.
Differentiable Inter-Agent Learning (Jakob Foerster
et al. 2016)
• agent agent Q
RNN time-step
transfer .
•
times-step
agent transfer .
• Agent agent
,
agent
observation action
.
Differentiable Inter-Agent Learning & Reinforced
Inter-Agent Learning (Jakob Foerster et al. 2016)
• (non-stationary
environments)
.
• Starcraft real-trim
strategy (RTS)
.
CommNet (Sainbayar Sukhbaatar et al. 2016)
• Multi-agent .
• passing the averaged message over the agent modules between layers
• fully symmetric, so lacks the ability of handle heterogeneous agent
types
BiCNet
Stochastic Game of N agents and M opponents
• S agent state space
• Ai Controller agent i action space, i ∈ [1, N]
• Bj enemy j action space, j ∈ [1, M]
• T : S x A
N
x B
M
-> S environment deterministic transition
function
• Ri : S x A
N
x B
M
-> R agent/enemy i reward function, i ∈ [1, N+M]
* agent( , ) action space .
Global Reward
• Continuous action space to reduce the redundancy in
modeling the large discrete action space
• Reward shaping agent
.
• Global reward: agent reward
.
Definition of Reward Function
• Eq. (1) controlled agent . enemy global reward .
controlled agent enemy 0 . zero-sum game!
• .
reward .
(controlled agents)
reduced health level for agent j
(enemies)
Minimax Game
• Controlled agent expected sum of discounted
rewards policy .
• Enemy joint policy expected sum .
optimal action-state value function
Sampled historical state-action pairs (s, b) of the
enemies
• Minimax Q-learning . Eq. (2) Q-function modelling
.
• fictitious play( ) enemies policy bφ
.
- AI agent fictions play . Controlled agents( ) enemies
player . Eq.(2) Q-function .
- , supervised learning
deterministic policy bφ .
• Policy network sampled historical state-action pairs(s,b) .
Simpler MDP problem
Enemies policy , Eq. (2)
Stochastic Game MDP .
Eq. (1)
• Eq. (1) global reward Eq. (1) zero-
sum game local collaboration reward function
team collaboration
.
• agent collaboration
.
• Eq. (1) agent local reward agent
.
Extension of formulation of Eq. (1)
• agent i top-K(i)
• k
reward.
• agent top-K .
• Eq (1) .
Bellman equation for agent i
• N numbers, i ∈ {1, ..., N}
Objective as an expectation
• action space model-free policy
iteration .
• Qi gradient policy
vectorized version of deterministic policy
gradient (DPG) .
Final Equation (Actor)
• agents rewards gradient
agent action backpropagate
gradient parameter
backpropagate .
Final Equation (Critic)
• Off-policy deterministic actor-critic
• critic: off-policy action-value function
.
Actor-Critic networks
• Ready to use SGD to compute the updates for both the
actor and critic networks
•
backprop .
BiCNet
• Bi-directional RNN actor-critic .
Design of the two networks
• Parameter agent agent
. agent
.
• agent training test agent
.
• bi-directional RNN agent .
• Full dependency among agents because the gradients from all the actions in
Eq. (9) are efficiently propagated through the entire networks
• Not fully symmetric, and maintaining certain social conventions and roles by
fixing the order of the agents that join the RNN. Solving any possible tie
between multiple optimal joint actions
Experiments
• BicNet agent
built-AI .
• .
•
Experiments
• easy combats
- {3 Marines vs. 1 Super Zergling}
- {3 Wraiths vs. 3 Mutalisks}
• difficult combats
- {5 Marines vs. 5 Marines}
- {15 Marines vs. 16 Marines}
- {20 Marines vs. 30 Zerglings}
- {10 Marines vs. 13 Zerglings}
- {15 Wraiths vs. 17 Wraiths}
• heterogeneous combats
- {2 Dropships and 2 Tanks vs. 1 Ultralisk}
Marine Zergling
Wraith Mutalisk
Dropship Ultralisk
Siege Tank
all images are from
http://starcraft.wikia.com/wiki/
Baselines
• Independent controller (IND): agent .
.
• Fully-connected (FC): agent fully-connected.
.
• CommNet: agent multi-agent
• GreedyMDP with Episodic Zero-Order Optimization (GMEZO):
conducting collaborations through a greedy update over MDP
agents, as well as adding episodic noises in the parameter
space for explorations
Action space for each individual agent
• 3 dimensional real vector
• 1st dimension: ranging from -1 to 1
- Greater than or equal to 0, agent attacks
- otherwise, agent moves
• 2nd and 3rd dimension: degree and distance, collectively
indicating the destination that the agent should move or
attack from its current location
Training
• Nadam optimizer
• learning rate = 0.002
• 800 episodes (more than 700k steps)
Simple Experiment
• tested on 100 independent games
• skip frame: how many frames we should skip for controlling the agents actions
• when batch_size is 32 (highest mean Q-value after 600k training steps) and skip_frame
is 2 (highest mean Q-value after between 300k and 600k) has the highest winning rate.
Simple Experiment
• Letting 4~6 agents work together as a group can efficiently control individual agents while
maximizing damage output.
• Fig 3, 4~5 as group size would help achieve best performance.
• Fig 4, the convergence speed by plotting the winning rate against the number of training
episodes.
Performance Comparison
• BicNet is trained over 100k steps
• measuring the performance as the average winning rate on 100 test
games
• when the number of agents goes beyond 10, the margin of
performance between BiCNet and the second best starts to increase
Performance Comparison
• “5M vs. 5M”, key factor to win is to “focus fire” on the weak.
• As BicNet has built-in design for dynamic grouping, small number of agents (such as
“5M vs. 5M”) does not suffice to show the advantages of BiCNet on large-scale
collaborations.
• For “5M vs. 5M”, BicNet only needs 10 combats before learning the idea of “focus
fire,” achieving 85% win rate, whereas CommNet needs at least 50 episodes with a
much lower winning rate
Visualization
• “3 Marines vs. 1 Super Zergling” when the coordinated cover attack has been
learned.
• Collected values in the last hidden layer of the well-trained critic network over 10k
steps.
• t-SNE
Strategies to Experiment
• Move without collision
• task. .
• Hit and run
• task. .
• Cover attack
• task. .
• Focus fire without overkill
• task. .
• Collaboration between heterogeneous agents
• task. .
https://www.youtube.com/watch?v=kW2q15MNFug
!
Coordinated moves without collision (3 Marines
(ours) vs. 1 Super Zergling)
• The agents move in a rather uncoordinated way, particularly, when two agents are close to each
other, i.e. one agent may unintentionally block the other’s path.
• After 40k steps in around 50 episodes, the number of collisions reduces dramatically.
Winning rate against difficult settings
Hit and Run tactics (3 Marines (ours) vs. 1 Zealot)
Move agents away if under attack, and fight back when feel
safe again.
Coordinated Cover Attack (4 Dragoons (ours) vs. 2
Ultralisks)
• Let one agent draw fire or attention from the enemies.
• At the meantime, other agents can take advantage of the time or distance
gap to cause more harms.
Coordinated Cover Attack (3 Marines (ours) vs. 1
Zergling)
Focus fire without overkill (15 Marines (ours) vs. 16
Marines)
• How to efficiently allocate the attacking resources becomes important.
• The grouping design in the policy network serves as the key factor for BiCNet to learn
“focus fire without overkill.”
• Even with the decreasing of our unit number, each group can be dynamically resigned
to make sure that the 3~5 units focus on attacking on same enemy.
Collaborations between heterogeneous agents (2
Dropships and 2 tanks vs. 1 Ultralisk)
• A wide variety of types of units in Starcraft
• Can be easily implemented in BicNet
Further to Investigate after this paper
• Strong correlation between the specified reward and the
learned policies
• How the policies are communicated over the networks
among agents
• Whether there is a specific language that may have
emerged
• Nash Equilibrium when both sides are played by deep
multi agent models.
– Youtube
“ ”
“ xx xx xx xx ”
“ 20 …”
!

[한국어] Multiagent Bidirectional- Coordinated Nets for Learning to Play StarCraft Combat Games

  • 1.
    Presentation on “MultiagentBidirectional- Coordinated Nets for Learning to Play StarCraft Combat Games” Kiho Suh Modulabs( ), June 22nd 2017
  • 2.
    About Paper • Publishedon March 29th 2017 (v1) • Updated on June 20th 2017 (v3) • Alibaba, University College London • https://arxiv.org/pdf/ 1703.10069.pdf
  • 3.
    Motivation • Single-Agent AI. (Atari, Baduk, Texas Hold’em ) • . Artificial General Intelligence ? • AI agent . • real-time strategy (RTS) game “StarCraft” . • play , “StarCraft” . • Parameter space joint learning approach .
  • 4.
  • 5.
    ? communication . communication protocol . : multi-agentbidirectionally-coordinated network (BiCNet) with a vectorized extension of actor- critic formulation
  • 6.
    ? • agent BiCNet. • evaluation-decision-making process . • Parameter dynamic grouping . • AI agent . • label data BiCNet agent .
  • 7.
  • 8.
    Related works • JakobFoerster, Yannis M Assael, Nando de Freitas, and Shimon Whiteson. Learning to communicate with deep multi-agent reinforcement learning. NIPS 2016. • Sainbayar Sukhbaatar, Rob Fergus, et al. Learning multiagent communication with backpropagation. NIPS 2016.
  • 9.
    Differentiable Inter-Agent Learning(Jakob Foerster et al. 2016) • agent agent Q RNN time-step transfer . • times-step agent transfer . • Agent agent , agent observation action .
  • 10.
    Differentiable Inter-Agent Learning& Reinforced Inter-Agent Learning (Jakob Foerster et al. 2016) • (non-stationary environments) . • Starcraft real-trim strategy (RTS) .
  • 11.
    CommNet (Sainbayar Sukhbaataret al. 2016) • Multi-agent . • passing the averaged message over the agent modules between layers • fully symmetric, so lacks the ability of handle heterogeneous agent types
  • 12.
  • 13.
    Stochastic Game ofN agents and M opponents • S agent state space • Ai Controller agent i action space, i ∈ [1, N] • Bj enemy j action space, j ∈ [1, M] • T : S x A N x B M -> S environment deterministic transition function • Ri : S x A N x B M -> R agent/enemy i reward function, i ∈ [1, N+M] * agent( , ) action space .
  • 14.
    Global Reward • Continuousaction space to reduce the redundancy in modeling the large discrete action space • Reward shaping agent . • Global reward: agent reward .
  • 15.
    Definition of RewardFunction • Eq. (1) controlled agent . enemy global reward . controlled agent enemy 0 . zero-sum game! • . reward . (controlled agents) reduced health level for agent j (enemies)
  • 16.
    Minimax Game • Controlledagent expected sum of discounted rewards policy . • Enemy joint policy expected sum . optimal action-state value function
  • 17.
    Sampled historical state-actionpairs (s, b) of the enemies • Minimax Q-learning . Eq. (2) Q-function modelling . • fictitious play( ) enemies policy bφ . - AI agent fictions play . Controlled agents( ) enemies player . Eq.(2) Q-function . - , supervised learning deterministic policy bφ . • Policy network sampled historical state-action pairs(s,b) .
  • 18.
    Simpler MDP problem Enemiespolicy , Eq. (2) Stochastic Game MDP .
  • 19.
    Eq. (1) • Eq.(1) global reward Eq. (1) zero- sum game local collaboration reward function team collaboration . • agent collaboration . • Eq. (1) agent local reward agent .
  • 20.
    Extension of formulationof Eq. (1) • agent i top-K(i) • k reward. • agent top-K . • Eq (1) .
  • 21.
    Bellman equation foragent i • N numbers, i ∈ {1, ..., N}
  • 22.
    Objective as anexpectation • action space model-free policy iteration . • Qi gradient policy vectorized version of deterministic policy gradient (DPG) .
  • 23.
    Final Equation (Actor) •agents rewards gradient agent action backpropagate gradient parameter backpropagate .
  • 24.
    Final Equation (Critic) •Off-policy deterministic actor-critic • critic: off-policy action-value function .
  • 25.
    Actor-Critic networks • Readyto use SGD to compute the updates for both the actor and critic networks • backprop .
  • 27.
  • 28.
    Design of thetwo networks • Parameter agent agent . agent . • agent training test agent . • bi-directional RNN agent . • Full dependency among agents because the gradients from all the actions in Eq. (9) are efficiently propagated through the entire networks • Not fully symmetric, and maintaining certain social conventions and roles by fixing the order of the agents that join the RNN. Solving any possible tie between multiple optimal joint actions
  • 29.
  • 30.
    Experiments • easy combats -{3 Marines vs. 1 Super Zergling} - {3 Wraiths vs. 3 Mutalisks} • difficult combats - {5 Marines vs. 5 Marines} - {15 Marines vs. 16 Marines} - {20 Marines vs. 30 Zerglings} - {10 Marines vs. 13 Zerglings} - {15 Wraiths vs. 17 Wraiths} • heterogeneous combats - {2 Dropships and 2 Tanks vs. 1 Ultralisk} Marine Zergling Wraith Mutalisk Dropship Ultralisk Siege Tank all images are from http://starcraft.wikia.com/wiki/
  • 31.
    Baselines • Independent controller(IND): agent . . • Fully-connected (FC): agent fully-connected. . • CommNet: agent multi-agent • GreedyMDP with Episodic Zero-Order Optimization (GMEZO): conducting collaborations through a greedy update over MDP agents, as well as adding episodic noises in the parameter space for explorations
  • 32.
    Action space foreach individual agent • 3 dimensional real vector • 1st dimension: ranging from -1 to 1 - Greater than or equal to 0, agent attacks - otherwise, agent moves • 2nd and 3rd dimension: degree and distance, collectively indicating the destination that the agent should move or attack from its current location
  • 33.
    Training • Nadam optimizer •learning rate = 0.002 • 800 episodes (more than 700k steps)
  • 34.
    Simple Experiment • testedon 100 independent games • skip frame: how many frames we should skip for controlling the agents actions • when batch_size is 32 (highest mean Q-value after 600k training steps) and skip_frame is 2 (highest mean Q-value after between 300k and 600k) has the highest winning rate.
  • 35.
    Simple Experiment • Letting4~6 agents work together as a group can efficiently control individual agents while maximizing damage output. • Fig 3, 4~5 as group size would help achieve best performance. • Fig 4, the convergence speed by plotting the winning rate against the number of training episodes.
  • 36.
    Performance Comparison • BicNetis trained over 100k steps • measuring the performance as the average winning rate on 100 test games • when the number of agents goes beyond 10, the margin of performance between BiCNet and the second best starts to increase
  • 37.
    Performance Comparison • “5Mvs. 5M”, key factor to win is to “focus fire” on the weak. • As BicNet has built-in design for dynamic grouping, small number of agents (such as “5M vs. 5M”) does not suffice to show the advantages of BiCNet on large-scale collaborations. • For “5M vs. 5M”, BicNet only needs 10 combats before learning the idea of “focus fire,” achieving 85% win rate, whereas CommNet needs at least 50 episodes with a much lower winning rate
  • 38.
    Visualization • “3 Marinesvs. 1 Super Zergling” when the coordinated cover attack has been learned. • Collected values in the last hidden layer of the well-trained critic network over 10k steps. • t-SNE
  • 39.
    Strategies to Experiment •Move without collision • task. . • Hit and run • task. . • Cover attack • task. . • Focus fire without overkill • task. . • Collaboration between heterogeneous agents • task. .
  • 40.
  • 41.
    Coordinated moves withoutcollision (3 Marines (ours) vs. 1 Super Zergling) • The agents move in a rather uncoordinated way, particularly, when two agents are close to each other, i.e. one agent may unintentionally block the other’s path. • After 40k steps in around 50 episodes, the number of collisions reduces dramatically.
  • 42.
    Winning rate againstdifficult settings
  • 43.
    Hit and Runtactics (3 Marines (ours) vs. 1 Zealot) Move agents away if under attack, and fight back when feel safe again.
  • 44.
    Coordinated Cover Attack(4 Dragoons (ours) vs. 2 Ultralisks) • Let one agent draw fire or attention from the enemies. • At the meantime, other agents can take advantage of the time or distance gap to cause more harms.
  • 45.
    Coordinated Cover Attack(3 Marines (ours) vs. 1 Zergling)
  • 46.
    Focus fire withoutoverkill (15 Marines (ours) vs. 16 Marines) • How to efficiently allocate the attacking resources becomes important. • The grouping design in the policy network serves as the key factor for BiCNet to learn “focus fire without overkill.” • Even with the decreasing of our unit number, each group can be dynamically resigned to make sure that the 3~5 units focus on attacking on same enemy.
  • 47.
    Collaborations between heterogeneousagents (2 Dropships and 2 tanks vs. 1 Ultralisk) • A wide variety of types of units in Starcraft • Can be easily implemented in BicNet
  • 48.
    Further to Investigateafter this paper • Strong correlation between the specified reward and the learned policies • How the policies are communicated over the networks among agents • Whether there is a specific language that may have emerged • Nash Equilibrium when both sides are played by deep multi agent models.
  • 49.
    – Youtube “ ” “xx xx xx xx ” “ 20 …” !