Deep Multi-agent Reinforcement Learning

1
Deep Multi-agent
Reinforcement Learning
Presenter: Daewoo Kim
LANADA, KAIST

2
 Foerster, J. N., Assael, Y. M., de Freitas, N., Whiteson, S. “Learning
to Communicate with Deep Multi-Agent Reinforcement Learning,”
NIPS 2016
 Gupta, J. K., Egorov, M., Kochenderfer, M.
“Cooperative Multi-Agent Control Using Deep Reinforcement
Learning”.
Adaptive Learning Agents (ALA) 2017.
 Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, P., Mordatch, I. “Multi-
Agent Actor-Critic for Mixed Cooperative-Competitive Environments.”
NIPS 2017
 Hausknecht, M. J. “Cooperation and communication in multiagent
deep reinforcement learning,” 2016
Papers

3
 We live in a multi-agent world
 Multi-agent RL
– Cooperative behaviors of multiple agents are not easily learned
by a single agent
– Then, how can we make multiple agents cooperate by RL?
Motivation

6
 Multiple single agents (Baseline)
 Centralized (Baseline)
 Multi-agent RL with communication
 Distributed Multi-agent RL
 Ad hoc teamwork
How to Run Multiple Agents

7
 Single agent trains in environment
 Multiple (identical) single agents run in same
environment
 Training
 Execution
Multiple Single Agent (Naïve approach)
Environment
State
Reward
Agent 1
Actor Network
Critic Network
Update
Action
Environment
State
Reward
Agent 2
Actor Network
Critic Network
Update
Action
Environment
State
Reward
Agent 𝑖
Actor Network
Critic Network
Update
Action…
Environment
Agent 1
State Action
Agent 2
State Action
Agent 𝑖
State Action
…Actor
Network
Actor
Network
Actor
Network

8
 Multiple agents are controlled by single controller
(agent)
 State & action spaces are concatenated
 Large state and action space make it challenging to
learn
Centralized
Environment
Controller (agent)
Actor Network
Critic Network
Update
Shared Reward
State State State… Action Action Action…

9
 Two players are one team
 They share the reward (score)
 How they can cooperate?  With communication
Multi-agent RL with communication
Observation
Shared Reward
Observation
Shared Reward
Action Action
Pass me ! NoMessage
(Communication)
Agent 1 Agent 2

10
 Two players are one team with shared reward
 They cannot communicate each other
 If they practice together for long time,
then they can cooperate without communication
Distributed Multi-agent RL
Observation
Shared Reward
Observation
Shared Reward
Action Action
Agent 1 Agent 2

11
 Two players are one team with shared reward
 They does not know each other
 Pre-coordinating team may not always be possible
Ad Hoc Teamwork
Observation
Shared Reward
Observation
Shared Reward
Action Action
Agent 1 Agent 2

12
Summary
Multiple
single
agent
Ad hoc
teamwork
Distributed
multi-agent
RL
Multi-agent
RL with
comm.
Centralized
RL Agent Multiple Multiple Multiple Multiple Single
Training Separately Separately Together Together Together
Executio
n
Separately Separately Separately Together
with comm.
Together
Reward Indep. Shared Shared/Ind
ep.
Shared Shared
Cooperat
ion
No Yes Yes Yes Yes
Part 1. Part 2.

13
 Naïve approach: concurrent
– Centralized training with decentralized (distributed) execution
– Each agent’s policy is independent
– Each agent maintains their own actor-critic networks
– All agents share reward
Distributed Multi-agent RL
Environment
State
Shared
reward
Agent 1
Actor Network
Critic Network
Update
Action State
Shared
reward
Agent 2
Actor Network
Critic Network
Update
Action
Reward is shared
Training
Environment
State
Agent 1
Actor Network
Action State
Agent 2
Actor Network
Action
Execution

14
 One agent learns well, but the other agent shows no
ability
 Concurrent. One agent always learns how to perform
the task before the other, and the other has less chance
to learn
 Centralized. Centralized controller learns to use one
agent exclusively for scoring goals, and learns to walk
the second agent away from the ball entirely
Result: Concurrent and Centralized
Concurrent Centralized

15
1. Parameter sharing [2]
– Agents share the weight of actor and critic network
– Update the network of agent who has less chance to learn
together
2. Multi-agent DDPG [3]
– Why agents share reward?
– Agents can have arbitrary reward structures, including
conflicting rewards in a competitive setting
– Observation is shared during training
Two Approaches
[2] Gupta, J. K., Egorov, M., Kochenderfer, M. “Cooperative Multi-Agent Control Using Deep
Reinforcement Learning”. Adaptive Learning Agents (ALA) 2017.
[3] Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, P., Mordatch, I. “Multi-Agent Actor-Critic for
Mixed Cooperative-Competitive Environments.” NIPS 2017

16
 Share weights of actor networks and critic network
between agents
– Similar behaviors between agents
– Encourage both agents to participate even though the goal is
achievable by a single agent
– Reduces total number of parameters
Parameter Sharing
Actor Network
Critic Network
Actor Network
Critic Network
Environment
State
Shared
reward
Action State
Shared
reward
Action
Reward is shared
Sharing Parameter. .
Sharing Parameter. .
Agent 1 Agent 2

17
Design choice
 Share lower layer parameter
– Same low-level processing of state
features
– Specialization in the higher layers of
the network allows each agent to
develop a unique policy
 Share both critics and actor
network
 How many layers they share?
– 2 layer in this case
Parameter Sharing

18
 Shows cooperative behaviors
 The shared weights from the first agent allow the second
agent to learn to score
Result: Parameter Sharing

19
 Propose a general-purpose multi-agent learning
algorithm
1. Use local information (i.e. their own observations) at execution
time
 Centralized training with decentralized execution
2. Applicable not only to cooperative interaction but to competitive
or mixed interaction
 Each agent has its own reward, and observations of all agents
are
shared during training
Multi-agent RL: MADDPG
Environment
State
Reward
Agent 1
Actor Network
Critic Network
Action State
Reward
Agent 2
Actor Network
Critic Network
Action
Reward is not shared
Training
Environment
State
Agent 1
Actor Network
Action State
Agent 2
Actor Network
Action
Execution

20
 𝑁: number of agents
 𝒙: state
 𝑜𝑖: observation of agent 𝑖
 𝑎𝑖: action of agent 𝑖
 𝑟𝑖: reward of agent 𝑖
 One sample
Model

21
 Policies of 𝑁 agents are parameterized by
and let 𝝁 = {𝜇1, … , 𝜇 𝑁} be the continuous policies
 Goal: find 𝜃𝑖 maximizing expected return for agent 𝑖,
 Gradient of
Decentralized Actor Network
Actor network for agent 𝑖 (𝜃𝑖)
ActionObservation
Critic

22
 Centralized action-value function for agent 𝑖
 The centralized action-value function is updated as:
Centralized Critic Network

23
Result: MADDPG
Average number of prey touches by predator per
episode with 𝑁 = 𝐿 = 3, where the prey
(adversaries) are slightly (30%) faster
Predator–prey experiment

24
 Multiple agents run in environment
 Goal: Maximizing their shared utility
 Improve performance with communication?
– What kind of information they should exchange?
– What if there is limited channel capacity among agents?
Multi-agent RL with communication[1]
Environment
Observation
Shared Reward
Agent 1
Message select
Action Observation
Shared Reward
Agent 2
Action
Action select
Message select
Action select
Message
Message
Reward is
shared
Limited capacity
[1] Foerster, J. N., Assael, Y. M., de Freitas, N., Whiteson, S. “Learning to Communicate with Deep Multi-Agent
Reinforcement Learning,” NIPS 2016

25
 𝑛 prisoners have been newly ushered into prison
 They will be placed in an isolated cell
 Each day, manager chooses one prisoner uniform
randomly
– He has chance to toggle light bulb (communication)
– He has option of announcing that he believes all prisoners have
been chosen by manager at some point in time (action)
 If he is right, every body go home, otherwise all die
Switch Riddle Problem

26
• Multi-Agent: 𝑛 agents with 1-bit communication channel
• State: 𝑛 -bit array: has 𝑖-th prisoner been chosen
• Action: ‘Announce’ / ‘None’
• Reward: + 1 (freedom) / 0 (episode expires) / -1 (all die)
• Observation: ‘None’
• Communication: switch (1-bit)
Multi agent RL with comm.

27
 Agents can discover communication protocols through
Deep RL
 Protocols can be extracted and understood
Result

28
Thank you
More comments and questions at
dwkim@lanada.kaist.ac.kr

Deep Multi-agent Reinforcement Learning

More Related Content

What's hot

Similar to Deep Multi-agent Reinforcement Learning

Recently uploaded

Deep Multi-agent Reinforcement Learning

Editor's Notes