1
Deep Multi-agent
Reinforcement Learning
Presenter: Daewoo Kim
LANADA, KAIST
2
 Foerster, J. N., Assael, Y. M., de Freitas, N., Whiteson, S. “Learning
to Communicate with Deep Multi-Agent Reinforcement Learning,”
NIPS 2016
 Gupta, J. K., Egorov, M., Kochenderfer, M.
“Cooperative Multi-Agent Control Using Deep Reinforcement
Learning”.
Adaptive Learning Agents (ALA) 2017.
 Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, P., Mordatch, I. “Multi-
Agent Actor-Critic for Mixed Cooperative-Competitive Environments.”
NIPS 2017
 Hausknecht, M. J. “Cooperation and communication in multiagent
deep reinforcement learning,” 2016
Papers
3
 We live in a multi-agent world
 Multi-agent RL
– Cooperative behaviors of multiple agents are not easily learned
by a single agent
– Then, how can we make multiple agents cooperate by RL?
Motivation
4
Example: Predator-prey
5
Example: Half Field Offense
6
 Multiple single agents (Baseline)
 Centralized (Baseline)
 Multi-agent RL with communication
 Distributed Multi-agent RL
 Ad hoc teamwork
How to Run Multiple Agents
7
 Single agent trains in environment
 Multiple (identical) single agents run in same
environment
 Training
 Execution
Multiple Single Agent (Naïve approach)
Environment
State
Reward
Agent 1
Actor Network
Critic Network
Update
Action
Environment
State
Reward
Agent 2
Actor Network
Critic Network
Update
Action
Environment
State
Reward
Agent 𝑖
Actor Network
Critic Network
Update
Action…
Environment
Agent 1
State Action
Agent 2
State Action
Agent 𝑖
State Action
…Actor
Network
Actor
Network
Actor
Network
8
 Multiple agents are controlled by single controller
(agent)
 State & action spaces are concatenated
 Large state and action space make it challenging to
learn
Centralized
Environment
Controller (agent)
Actor Network
Critic Network
Update
Shared Reward
State State State… Action Action Action…
9
 Two players are one team
 They share the reward (score)
 How they can cooperate?  With communication
Multi-agent RL with communication
Observation
Shared Reward
Observation
Shared Reward
Action Action
Pass me ! NoMessage
(Communication)
Agent 1 Agent 2
10
 Two players are one team with shared reward
 They cannot communicate each other
 If they practice together for long time,
then they can cooperate without communication
Distributed Multi-agent RL
Observation
Shared Reward
Observation
Shared Reward
Action Action
Agent 1 Agent 2
11
 Two players are one team with shared reward
 They does not know each other
 Pre-coordinating team may not always be possible
Ad Hoc Teamwork
Observation
Shared Reward
Observation
Shared Reward
Action Action
Agent 1 Agent 2
12
Summary
Multiple
single
agent
Ad hoc
teamwork
Distributed
multi-agent
RL
Multi-agent
RL with
comm.
Centralized
RL Agent Multiple Multiple Multiple Multiple Single
Training Separately Separately Together Together Together
Executio
n
Separately Separately Separately Together
with comm.
Together
Reward Indep. Shared Shared/Ind
ep.
Shared Shared
Cooperat
ion
No Yes Yes Yes Yes
Part 1. Part 2.
13
 Naïve approach: concurrent
– Centralized training with decentralized (distributed) execution
– Each agent’s policy is independent
– Each agent maintains their own actor-critic networks
– All agents share reward
Distributed Multi-agent RL
Environment
State
Shared
reward
Agent 1
Actor Network
Critic Network
Update
Action State
Shared
reward
Agent 2
Actor Network
Critic Network
Update
Action
Reward is shared
Training
Environment
State
Agent 1
Actor Network
Action State
Agent 2
Actor Network
Action
Execution
14
 One agent learns well, but the other agent shows no
ability
 Concurrent. One agent always learns how to perform
the task before the other, and the other has less chance
to learn
 Centralized. Centralized controller learns to use one
agent exclusively for scoring goals, and learns to walk
the second agent away from the ball entirely
Result: Concurrent and Centralized
Concurrent Centralized
15
1. Parameter sharing [2]
– Agents share the weight of actor and critic network
– Update the network of agent who has less chance to learn
together
2. Multi-agent DDPG [3]
– Why agents share reward?
– Agents can have arbitrary reward structures, including
conflicting rewards in a competitive setting
– Observation is shared during training
Two Approaches
[2] Gupta, J. K., Egorov, M., Kochenderfer, M. “Cooperative Multi-Agent Control Using Deep
Reinforcement Learning”. Adaptive Learning Agents (ALA) 2017.
[3] Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, P., Mordatch, I. “Multi-Agent Actor-Critic for
Mixed Cooperative-Competitive Environments.” NIPS 2017
16
 Share weights of actor networks and critic network
between agents
– Similar behaviors between agents
– Encourage both agents to participate even though the goal is
achievable by a single agent
– Reduces total number of parameters
Parameter Sharing
Actor Network
Critic Network
Actor Network
Critic Network
Environment
State
Shared
reward
Action State
Shared
reward
Action
Reward is shared
Sharing Parameter. .
Sharing Parameter. .
Agent 1 Agent 2
17
Design choice
 Share lower layer parameter
– Same low-level processing of state
features
– Specialization in the higher layers of
the network allows each agent to
develop a unique policy
 Share both critics and actor
network
 How many layers they share?
– 2 layer in this case
Parameter Sharing
18
 Shows cooperative behaviors
 The shared weights from the first agent allow the second
agent to learn to score
Result: Parameter Sharing
19
 Propose a general-purpose multi-agent learning
algorithm
1. Use local information (i.e. their own observations) at execution
time
 Centralized training with decentralized execution
2. Applicable not only to cooperative interaction but to competitive
or mixed interaction
 Each agent has its own reward, and observations of all agents
are
shared during training
Multi-agent RL: MADDPG
Environment
State
Reward
Agent 1
Actor Network
Critic Network
Action State
Reward
Agent 2
Actor Network
Critic Network
Action
Reward is not shared
Training
Environment
State
Agent 1
Actor Network
Action State
Agent 2
Actor Network
Action
Execution
20
 𝑁: number of agents
 𝒙: state
 𝑜𝑖: observation of agent 𝑖
 𝑎𝑖: action of agent 𝑖
 𝑟𝑖: reward of agent 𝑖
 One sample
Model
21
 Policies of 𝑁 agents are parameterized by
and let 𝝁 = {𝜇1, … , 𝜇 𝑁} be the continuous policies
 Goal: find 𝜃𝑖 maximizing expected return for agent 𝑖,
 Gradient of
Decentralized Actor Network
Actor network for agent 𝑖 (𝜃𝑖)
ActionObservation
Critic
22
 Centralized action-value function for agent 𝑖
 The centralized action-value function is updated as:
Centralized Critic Network
23
Result: MADDPG
Average number of prey touches by predator per
episode with 𝑁 = 𝐿 = 3, where the prey
(adversaries) are slightly (30%) faster
Predator–prey experiment
24
 Multiple agents run in environment
 Goal: Maximizing their shared utility
 Improve performance with communication?
– What kind of information they should exchange?
– What if there is limited channel capacity among agents?
Multi-agent RL with communication[1]
Environment
Observation
Shared Reward
Agent 1
Message select
Action Observation
Shared Reward
Agent 2
Action
Action select
Message select
Action select
Message
Message
Reward is
shared
Limited capacity
[1] Foerster, J. N., Assael, Y. M., de Freitas, N., Whiteson, S. “Learning to Communicate with Deep Multi-Agent
Reinforcement Learning,” NIPS 2016
25
 𝑛 prisoners have been newly ushered into prison
 They will be placed in an isolated cell
 Each day, manager chooses one prisoner uniform
randomly
– He has chance to toggle light bulb (communication)
– He has option of announcing that he believes all prisoners have
been chosen by manager at some point in time (action)
 If he is right, every body go home, otherwise all die
Switch Riddle Problem
26
• Multi-Agent: 𝑛 agents with 1-bit communication channel
• State: 𝑛 -bit array: has 𝑖-th prisoner been chosen
• Action: ‘Announce’ / ‘None’
• Reward: + 1 (freedom) / 0 (episode expires) / -1 (all die)
• Observation: ‘None’
• Communication: switch (1-bit)
Multi agent RL with comm.
27
 Agents can discover communication protocols through
Deep RL
 Protocols can be extracted and understood
Result
28
Thank you
More comments and questions at
dwkim@lanada.kaist.ac.kr

Deep Multi-agent Reinforcement Learning

  • 1.
  • 2.
    2  Foerster, J.N., Assael, Y. M., de Freitas, N., Whiteson, S. “Learning to Communicate with Deep Multi-Agent Reinforcement Learning,” NIPS 2016  Gupta, J. K., Egorov, M., Kochenderfer, M. “Cooperative Multi-Agent Control Using Deep Reinforcement Learning”. Adaptive Learning Agents (ALA) 2017.  Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, P., Mordatch, I. “Multi- Agent Actor-Critic for Mixed Cooperative-Competitive Environments.” NIPS 2017  Hausknecht, M. J. “Cooperation and communication in multiagent deep reinforcement learning,” 2016 Papers
  • 3.
    3  We livein a multi-agent world  Multi-agent RL – Cooperative behaviors of multiple agents are not easily learned by a single agent – Then, how can we make multiple agents cooperate by RL? Motivation
  • 4.
  • 5.
  • 6.
    6  Multiple singleagents (Baseline)  Centralized (Baseline)  Multi-agent RL with communication  Distributed Multi-agent RL  Ad hoc teamwork How to Run Multiple Agents
  • 7.
    7  Single agenttrains in environment  Multiple (identical) single agents run in same environment  Training  Execution Multiple Single Agent (Naïve approach) Environment State Reward Agent 1 Actor Network Critic Network Update Action Environment State Reward Agent 2 Actor Network Critic Network Update Action Environment State Reward Agent 𝑖 Actor Network Critic Network Update Action… Environment Agent 1 State Action Agent 2 State Action Agent 𝑖 State Action …Actor Network Actor Network Actor Network
  • 8.
    8  Multiple agentsare controlled by single controller (agent)  State & action spaces are concatenated  Large state and action space make it challenging to learn Centralized Environment Controller (agent) Actor Network Critic Network Update Shared Reward State State State… Action Action Action…
  • 9.
    9  Two playersare one team  They share the reward (score)  How they can cooperate?  With communication Multi-agent RL with communication Observation Shared Reward Observation Shared Reward Action Action Pass me ! NoMessage (Communication) Agent 1 Agent 2
  • 10.
    10  Two playersare one team with shared reward  They cannot communicate each other  If they practice together for long time, then they can cooperate without communication Distributed Multi-agent RL Observation Shared Reward Observation Shared Reward Action Action Agent 1 Agent 2
  • 11.
    11  Two playersare one team with shared reward  They does not know each other  Pre-coordinating team may not always be possible Ad Hoc Teamwork Observation Shared Reward Observation Shared Reward Action Action Agent 1 Agent 2
  • 12.
    12 Summary Multiple single agent Ad hoc teamwork Distributed multi-agent RL Multi-agent RL with comm. Centralized RLAgent Multiple Multiple Multiple Multiple Single Training Separately Separately Together Together Together Executio n Separately Separately Separately Together with comm. Together Reward Indep. Shared Shared/Ind ep. Shared Shared Cooperat ion No Yes Yes Yes Yes Part 1. Part 2.
  • 13.
    13  Naïve approach:concurrent – Centralized training with decentralized (distributed) execution – Each agent’s policy is independent – Each agent maintains their own actor-critic networks – All agents share reward Distributed Multi-agent RL Environment State Shared reward Agent 1 Actor Network Critic Network Update Action State Shared reward Agent 2 Actor Network Critic Network Update Action Reward is shared Training Environment State Agent 1 Actor Network Action State Agent 2 Actor Network Action Execution
  • 14.
    14  One agentlearns well, but the other agent shows no ability  Concurrent. One agent always learns how to perform the task before the other, and the other has less chance to learn  Centralized. Centralized controller learns to use one agent exclusively for scoring goals, and learns to walk the second agent away from the ball entirely Result: Concurrent and Centralized Concurrent Centralized
  • 15.
    15 1. Parameter sharing[2] – Agents share the weight of actor and critic network – Update the network of agent who has less chance to learn together 2. Multi-agent DDPG [3] – Why agents share reward? – Agents can have arbitrary reward structures, including conflicting rewards in a competitive setting – Observation is shared during training Two Approaches [2] Gupta, J. K., Egorov, M., Kochenderfer, M. “Cooperative Multi-Agent Control Using Deep Reinforcement Learning”. Adaptive Learning Agents (ALA) 2017. [3] Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, P., Mordatch, I. “Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments.” NIPS 2017
  • 16.
    16  Share weightsof actor networks and critic network between agents – Similar behaviors between agents – Encourage both agents to participate even though the goal is achievable by a single agent – Reduces total number of parameters Parameter Sharing Actor Network Critic Network Actor Network Critic Network Environment State Shared reward Action State Shared reward Action Reward is shared Sharing Parameter. . Sharing Parameter. . Agent 1 Agent 2
  • 17.
    17 Design choice  Sharelower layer parameter – Same low-level processing of state features – Specialization in the higher layers of the network allows each agent to develop a unique policy  Share both critics and actor network  How many layers they share? – 2 layer in this case Parameter Sharing
  • 18.
    18  Shows cooperativebehaviors  The shared weights from the first agent allow the second agent to learn to score Result: Parameter Sharing
  • 19.
    19  Propose ageneral-purpose multi-agent learning algorithm 1. Use local information (i.e. their own observations) at execution time  Centralized training with decentralized execution 2. Applicable not only to cooperative interaction but to competitive or mixed interaction  Each agent has its own reward, and observations of all agents are shared during training Multi-agent RL: MADDPG Environment State Reward Agent 1 Actor Network Critic Network Action State Reward Agent 2 Actor Network Critic Network Action Reward is not shared Training Environment State Agent 1 Actor Network Action State Agent 2 Actor Network Action Execution
  • 20.
    20  𝑁: numberof agents  𝒙: state  𝑜𝑖: observation of agent 𝑖  𝑎𝑖: action of agent 𝑖  𝑟𝑖: reward of agent 𝑖  One sample Model
  • 21.
    21  Policies of𝑁 agents are parameterized by and let 𝝁 = {𝜇1, … , 𝜇 𝑁} be the continuous policies  Goal: find 𝜃𝑖 maximizing expected return for agent 𝑖,  Gradient of Decentralized Actor Network Actor network for agent 𝑖 (𝜃𝑖) ActionObservation Critic
  • 22.
    22  Centralized action-valuefunction for agent 𝑖  The centralized action-value function is updated as: Centralized Critic Network
  • 23.
    23 Result: MADDPG Average numberof prey touches by predator per episode with 𝑁 = 𝐿 = 3, where the prey (adversaries) are slightly (30%) faster Predator–prey experiment
  • 24.
    24  Multiple agentsrun in environment  Goal: Maximizing their shared utility  Improve performance with communication? – What kind of information they should exchange? – What if there is limited channel capacity among agents? Multi-agent RL with communication[1] Environment Observation Shared Reward Agent 1 Message select Action Observation Shared Reward Agent 2 Action Action select Message select Action select Message Message Reward is shared Limited capacity [1] Foerster, J. N., Assael, Y. M., de Freitas, N., Whiteson, S. “Learning to Communicate with Deep Multi-Agent Reinforcement Learning,” NIPS 2016
  • 25.
    25  𝑛 prisonershave been newly ushered into prison  They will be placed in an isolated cell  Each day, manager chooses one prisoner uniform randomly – He has chance to toggle light bulb (communication) – He has option of announcing that he believes all prisoners have been chosen by manager at some point in time (action)  If he is right, every body go home, otherwise all die Switch Riddle Problem
  • 26.
    26 • Multi-Agent: 𝑛agents with 1-bit communication channel • State: 𝑛 -bit array: has 𝑖-th prisoner been chosen • Action: ‘Announce’ / ‘None’ • Reward: + 1 (freedom) / 0 (episode expires) / -1 (all die) • Observation: ‘None’ • Communication: switch (1-bit) Multi agent RL with comm.
  • 27.
    27  Agents candiscover communication protocols through Deep RL  Protocols can be extracted and understood Result
  • 28.
    28 Thank you More commentsand questions at dwkim@lanada.kaist.ac.kr

Editor's Notes

  • #4 Motivation. All of you may know well single agent RL such as Q-learning, Policy gradient and so on. However, single agent RL can be used only in specific area such as game. However, we live in multi-agent world. All of these have multiple agent and they interact each other. So today I’ll explain the reinforcement learning when there are multiple agent.
  • #7 Before explaining multi agent RL, I would like to explain 5 classes of running multiple agents. It is hard to say these two are multi agent RL. But I’ll explain for comparison I’ll briefly explain these and see the features of these