SlideShare a Scribd company logo
1
Deep Multi-agent
Reinforcement Learning
Presenter: Daewoo Kim
LANADA, KAIST
2
 Foerster, J. N., Assael, Y. M., de Freitas, N., Whiteson, S. “Learning
to Communicate with Deep Multi-Agent Reinforcement Learning,”
NIPS 2016
 Gupta, J. K., Egorov, M., Kochenderfer, M.
“Cooperative Multi-Agent Control Using Deep Reinforcement
Learning”.
Adaptive Learning Agents (ALA) 2017.
 Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, P., Mordatch, I. “Multi-
Agent Actor-Critic for Mixed Cooperative-Competitive Environments.”
NIPS 2017
 Hausknecht, M. J. “Cooperation and communication in multiagent
deep reinforcement learning,” 2016
Papers
3
 We live in a multi-agent world
 Multi-agent RL
– Cooperative behaviors of multiple agents are not easily learned
by a single agent
– Then, how can we make multiple agents cooperate by RL?
Motivation
4
Example: Predator-prey
5
Example: Half Field Offense
6
 Multiple single agents (Baseline)
 Centralized (Baseline)
 Multi-agent RL with communication
 Distributed Multi-agent RL
 Ad hoc teamwork
How to Run Multiple Agents
7
 Single agent trains in environment
 Multiple (identical) single agents run in same
environment
 Training
 Execution
Multiple Single Agent (Naïve approach)
Environment
State
Reward
Agent 1
Actor Network
Critic Network
Update
Action
Environment
State
Reward
Agent 2
Actor Network
Critic Network
Update
Action
Environment
State
Reward
Agent 𝑖
Actor Network
Critic Network
Update
Action…
Environment
Agent 1
State Action
Agent 2
State Action
Agent 𝑖
State Action
…Actor
Network
Actor
Network
Actor
Network
8
 Multiple agents are controlled by single controller
(agent)
 State & action spaces are concatenated
 Large state and action space make it challenging to
learn
Centralized
Environment
Controller (agent)
Actor Network
Critic Network
Update
Shared Reward
State State State… Action Action Action…
9
 Two players are one team
 They share the reward (score)
 How they can cooperate?  With communication
Multi-agent RL with communication
Observation
Shared Reward
Observation
Shared Reward
Action Action
Pass me ! NoMessage
(Communication)
Agent 1 Agent 2
10
 Two players are one team with shared reward
 They cannot communicate each other
 If they practice together for long time,
then they can cooperate without communication
Distributed Multi-agent RL
Observation
Shared Reward
Observation
Shared Reward
Action Action
Agent 1 Agent 2
11
 Two players are one team with shared reward
 They does not know each other
 Pre-coordinating team may not always be possible
Ad Hoc Teamwork
Observation
Shared Reward
Observation
Shared Reward
Action Action
Agent 1 Agent 2
12
Summary
Multiple
single
agent
Ad hoc
teamwork
Distributed
multi-agent
RL
Multi-agent
RL with
comm.
Centralized
RL Agent Multiple Multiple Multiple Multiple Single
Training Separately Separately Together Together Together
Executio
n
Separately Separately Separately Together
with comm.
Together
Reward Indep. Shared Shared/Ind
ep.
Shared Shared
Cooperat
ion
No Yes Yes Yes Yes
Part 1. Part 2.
13
 Naïve approach: concurrent
– Centralized training with decentralized (distributed) execution
– Each agent’s policy is independent
– Each agent maintains their own actor-critic networks
– All agents share reward
Distributed Multi-agent RL
Environment
State
Shared
reward
Agent 1
Actor Network
Critic Network
Update
Action State
Shared
reward
Agent 2
Actor Network
Critic Network
Update
Action
Reward is shared
Training
Environment
State
Agent 1
Actor Network
Action State
Agent 2
Actor Network
Action
Execution
14
 One agent learns well, but the other agent shows no
ability
 Concurrent. One agent always learns how to perform
the task before the other, and the other has less chance
to learn
 Centralized. Centralized controller learns to use one
agent exclusively for scoring goals, and learns to walk
the second agent away from the ball entirely
Result: Concurrent and Centralized
Concurrent Centralized
15
1. Parameter sharing [2]
– Agents share the weight of actor and critic network
– Update the network of agent who has less chance to learn
together
2. Multi-agent DDPG [3]
– Why agents share reward?
– Agents can have arbitrary reward structures, including
conflicting rewards in a competitive setting
– Observation is shared during training
Two Approaches
[2] Gupta, J. K., Egorov, M., Kochenderfer, M. “Cooperative Multi-Agent Control Using Deep
Reinforcement Learning”. Adaptive Learning Agents (ALA) 2017.
[3] Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, P., Mordatch, I. “Multi-Agent Actor-Critic for
Mixed Cooperative-Competitive Environments.” NIPS 2017
16
 Share weights of actor networks and critic network
between agents
– Similar behaviors between agents
– Encourage both agents to participate even though the goal is
achievable by a single agent
– Reduces total number of parameters
Parameter Sharing
Actor Network
Critic Network
Actor Network
Critic Network
Environment
State
Shared
reward
Action State
Shared
reward
Action
Reward is shared
Sharing Parameter. .
Sharing Parameter. .
Agent 1 Agent 2
17
Design choice
 Share lower layer parameter
– Same low-level processing of state
features
– Specialization in the higher layers of
the network allows each agent to
develop a unique policy
 Share both critics and actor
network
 How many layers they share?
– 2 layer in this case
Parameter Sharing
18
 Shows cooperative behaviors
 The shared weights from the first agent allow the second
agent to learn to score
Result: Parameter Sharing
19
 Propose a general-purpose multi-agent learning
algorithm
1. Use local information (i.e. their own observations) at execution
time
 Centralized training with decentralized execution
2. Applicable not only to cooperative interaction but to competitive
or mixed interaction
 Each agent has its own reward, and observations of all agents
are
shared during training
Multi-agent RL: MADDPG
Environment
State
Reward
Agent 1
Actor Network
Critic Network
Action State
Reward
Agent 2
Actor Network
Critic Network
Action
Reward is not shared
Training
Environment
State
Agent 1
Actor Network
Action State
Agent 2
Actor Network
Action
Execution
20
 𝑁: number of agents
 𝒙: state
 𝑜𝑖: observation of agent 𝑖
 𝑎𝑖: action of agent 𝑖
 𝑟𝑖: reward of agent 𝑖
 One sample
Model
21
 Policies of 𝑁 agents are parameterized by
and let 𝝁 = {𝜇1, … , 𝜇 𝑁} be the continuous policies
 Goal: find 𝜃𝑖 maximizing expected return for agent 𝑖,
 Gradient of
Decentralized Actor Network
Actor network for agent 𝑖 (𝜃𝑖)
ActionObservation
Critic
22
 Centralized action-value function for agent 𝑖
 The centralized action-value function is updated as:
Centralized Critic Network
23
Result: MADDPG
Average number of prey touches by predator per
episode with 𝑁 = 𝐿 = 3, where the prey
(adversaries) are slightly (30%) faster
Predator–prey experiment
24
 Multiple agents run in environment
 Goal: Maximizing their shared utility
 Improve performance with communication?
– What kind of information they should exchange?
– What if there is limited channel capacity among agents?
Multi-agent RL with communication[1]
Environment
Observation
Shared Reward
Agent 1
Message select
Action Observation
Shared Reward
Agent 2
Action
Action select
Message select
Action select
Message
Message
Reward is
shared
Limited capacity
[1] Foerster, J. N., Assael, Y. M., de Freitas, N., Whiteson, S. “Learning to Communicate with Deep Multi-Agent
Reinforcement Learning,” NIPS 2016
25
 𝑛 prisoners have been newly ushered into prison
 They will be placed in an isolated cell
 Each day, manager chooses one prisoner uniform
randomly
– He has chance to toggle light bulb (communication)
– He has option of announcing that he believes all prisoners have
been chosen by manager at some point in time (action)
 If he is right, every body go home, otherwise all die
Switch Riddle Problem
26
• Multi-Agent: 𝑛 agents with 1-bit communication channel
• State: 𝑛 -bit array: has 𝑖-th prisoner been chosen
• Action: ‘Announce’ / ‘None’
• Reward: + 1 (freedom) / 0 (episode expires) / -1 (all die)
• Observation: ‘None’
• Communication: switch (1-bit)
Multi agent RL with comm.
27
 Agents can discover communication protocols through
Deep RL
 Protocols can be extracted and understood
Result
28
Thank you
More comments and questions at
dwkim@lanada.kaist.ac.kr

More Related Content

What's hot

Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
Khaled Saleh
 
Intro to Reinforcement learning - part III
Intro to Reinforcement learning - part IIIIntro to Reinforcement learning - part III
Intro to Reinforcement learning - part III
Mikko Mäkipää
 
Reinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic ProgrammingReinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic Programming
Seung Jae Lee
 
강화학습의 흐름도 Part 1
강화학습의 흐름도 Part 1강화학습의 흐름도 Part 1
강화학습의 흐름도 Part 1
Dongmin Lee
 
[기초개념] Recurrent Neural Network (RNN) 소개
[기초개념] Recurrent Neural Network (RNN) 소개[기초개념] Recurrent Neural Network (RNN) 소개
[기초개념] Recurrent Neural Network (RNN) 소개
Donghyeon Kim
 
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex FridmanMIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
Peerasak C.
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
NAVER Engineering
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratch
Jie-Han Chen
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
Ding Li
 
Reinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo MethodsReinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo Methods
Seung Jae Lee
 
Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
Nikolay Pavlov
 
Lecture 9 Markov decision process
Lecture 9 Markov decision processLecture 9 Markov decision process
Lecture 9 Markov decision process
VARUN KUMAR
 
Policy gradient
Policy gradientPolicy gradient
Policy gradient
Jie-Han Chen
 
Hierarchical Reinforcement Learning
Hierarchical Reinforcement LearningHierarchical Reinforcement Learning
Hierarchical Reinforcement Learning
ahmad bassiouny
 
Deep deterministic policy gradient
Deep deterministic policy gradientDeep deterministic policy gradient
Deep deterministic policy gradient
Slobodan Blazeski
 
Multi-armed Bandits
Multi-armed BanditsMulti-armed Bandits
Multi-armed Bandits
Dongmin Lee
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
DongHyun Kwak
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learning
Subrat Panda, PhD
 
Multi-agent actor-critic for mixed cooperative-competitive environmentsの紹介
Multi-agent actor-critic for mixed cooperative-competitive environmentsの紹介Multi-agent actor-critic for mixed cooperative-competitive environmentsの紹介
Multi-agent actor-critic for mixed cooperative-competitive environmentsの紹介
Yusuke Nakata
 
강화학습 알고리즘의 흐름도 Part 2
강화학습 알고리즘의 흐름도 Part 2강화학습 알고리즘의 흐름도 Part 2
강화학습 알고리즘의 흐름도 Part 2
Dongmin Lee
 

What's hot (20)

Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
Intro to Reinforcement learning - part III
Intro to Reinforcement learning - part IIIIntro to Reinforcement learning - part III
Intro to Reinforcement learning - part III
 
Reinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic ProgrammingReinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic Programming
 
강화학습의 흐름도 Part 1
강화학습의 흐름도 Part 1강화학습의 흐름도 Part 1
강화학습의 흐름도 Part 1
 
[기초개념] Recurrent Neural Network (RNN) 소개
[기초개념] Recurrent Neural Network (RNN) 소개[기초개념] Recurrent Neural Network (RNN) 소개
[기초개념] Recurrent Neural Network (RNN) 소개
 
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex FridmanMIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratch
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Reinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo MethodsReinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo Methods
 
Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
 
Lecture 9 Markov decision process
Lecture 9 Markov decision processLecture 9 Markov decision process
Lecture 9 Markov decision process
 
Policy gradient
Policy gradientPolicy gradient
Policy gradient
 
Hierarchical Reinforcement Learning
Hierarchical Reinforcement LearningHierarchical Reinforcement Learning
Hierarchical Reinforcement Learning
 
Deep deterministic policy gradient
Deep deterministic policy gradientDeep deterministic policy gradient
Deep deterministic policy gradient
 
Multi-armed Bandits
Multi-armed BanditsMulti-armed Bandits
Multi-armed Bandits
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learning
 
Multi-agent actor-critic for mixed cooperative-competitive environmentsの紹介
Multi-agent actor-critic for mixed cooperative-competitive environmentsの紹介Multi-agent actor-critic for mixed cooperative-competitive environmentsの紹介
Multi-agent actor-critic for mixed cooperative-competitive environmentsの紹介
 
강화학습 알고리즘의 흐름도 Part 2
강화학습 알고리즘의 흐름도 Part 2강화학습 알고리즘의 흐름도 Part 2
강화학습 알고리즘의 흐름도 Part 2
 

Similar to Deep Multi-agent Reinforcement Learning

Mathmodsocsys3 090827155611-phpapp02
Mathmodsocsys3 090827155611-phpapp02Mathmodsocsys3 090827155611-phpapp02
Mathmodsocsys3 090827155611-phpapp02
Jovenary Muta
 
Agents(1).ppt
Agents(1).pptAgents(1).ppt
Agents(1).ppt
jameskilonzo1
 
Intelligent Agents: Technology and Applications
Intelligent Agents: Technology and Applications Intelligent Agents: Technology and Applications
Intelligent Agents: Technology and Applications butest
 
c27_mas.ppt
c27_mas.pptc27_mas.ppt
c27_mas.ppt
Hassan458257
 
Agents-and-Problem-Solving-20022024-094442am.pdf
Agents-and-Problem-Solving-20022024-094442am.pdfAgents-and-Problem-Solving-20022024-094442am.pdf
Agents-and-Problem-Solving-20022024-094442am.pdf
syedhasanali293
 
A.i lecture 04
A.i lecture 04A.i lecture 04
A.i lecture 04
yarafghani
 
Interactions in Multi Agent Systems
Interactions in Multi Agent SystemsInteractions in Multi Agent Systems
Interactions in Multi Agent Systems
SSA KPI
 
Accountability in Action - Step Two
Accountability in Action - Step TwoAccountability in Action - Step Two
Accountability in Action - Step Two
tincancollective
 
Lect7MAS-Coordination
Lect7MAS-CoordinationLect7MAS-Coordination
Lect7MAS-Coordination
Antonio Moreno
 
Evolution of Coordination and Communication in Groups of Embodied Agents
Evolution of Coordination and Communication in Groups of Embodied AgentsEvolution of Coordination and Communication in Groups of Embodied Agents
Evolution of Coordination and Communication in Groups of Embodied Agents
Olaf Witkowski
 
Collaborative defence strategies for network security
Collaborative defence strategies for network securityCollaborative defence strategies for network security
Collaborative defence strategies for network security
sonukumar142
 
Agent Interaction Agents
Agent Interaction AgentsAgent Interaction Agents
Agent Interaction Agents
Audrey Britton
 
Using Communication to Reduce Locality in Multi-Robo
Using Communication to Reduce Locality in Multi-RoboUsing Communication to Reduce Locality in Multi-Robo
Using Communication to Reduce Locality in Multi-Roboderakberreyesa
 
Agents.ppt
Agents.pptAgents.ppt
Agents.ppt
KhanKhaja1
 
Multiagent systems (and their use in industry)
Multiagent systems (and their use in industry)Multiagent systems (and their use in industry)
Multiagent systems (and their use in industry)Marc-Philippe Huget
 
Lecture 4 (1).pptx
Lecture 4 (1).pptxLecture 4 (1).pptx
Lecture 4 (1).pptx
SumairaRasool6
 
Efficient opinion sharing in large decentralised teams
Efficient opinion sharing in large decentralised teamsEfficient opinion sharing in large decentralised teams
Efficient opinion sharing in large decentralised teams
Oleksandr Pryymak
 

Similar to Deep Multi-agent Reinforcement Learning (20)

Mathmodsocsys3 090827155611-phpapp02
Mathmodsocsys3 090827155611-phpapp02Mathmodsocsys3 090827155611-phpapp02
Mathmodsocsys3 090827155611-phpapp02
 
Agents(1).ppt
Agents(1).pptAgents(1).ppt
Agents(1).ppt
 
Intelligent Agents: Technology and Applications
Intelligent Agents: Technology and Applications Intelligent Agents: Technology and Applications
Intelligent Agents: Technology and Applications
 
c27_mas.ppt
c27_mas.pptc27_mas.ppt
c27_mas.ppt
 
Agents-and-Problem-Solving-20022024-094442am.pdf
Agents-and-Problem-Solving-20022024-094442am.pdfAgents-and-Problem-Solving-20022024-094442am.pdf
Agents-and-Problem-Solving-20022024-094442am.pdf
 
A.i lecture 04
A.i lecture 04A.i lecture 04
A.i lecture 04
 
Interactions in Multi Agent Systems
Interactions in Multi Agent SystemsInteractions in Multi Agent Systems
Interactions in Multi Agent Systems
 
Accountability in Action - Step Two
Accountability in Action - Step TwoAccountability in Action - Step Two
Accountability in Action - Step Two
 
Lect7MAS-Coordination
Lect7MAS-CoordinationLect7MAS-Coordination
Lect7MAS-Coordination
 
AI Lesson 02
AI Lesson 02AI Lesson 02
AI Lesson 02
 
Ai Slides
Ai SlidesAi Slides
Ai Slides
 
Evolution of Coordination and Communication in Groups of Embodied Agents
Evolution of Coordination and Communication in Groups of Embodied AgentsEvolution of Coordination and Communication in Groups of Embodied Agents
Evolution of Coordination and Communication in Groups of Embodied Agents
 
Collaborative defence strategies for network security
Collaborative defence strategies for network securityCollaborative defence strategies for network security
Collaborative defence strategies for network security
 
Agent Interaction Agents
Agent Interaction AgentsAgent Interaction Agents
Agent Interaction Agents
 
Using Communication to Reduce Locality in Multi-Robo
Using Communication to Reduce Locality in Multi-RoboUsing Communication to Reduce Locality in Multi-Robo
Using Communication to Reduce Locality in Multi-Robo
 
Agents.ppt
Agents.pptAgents.ppt
Agents.ppt
 
Multiagent systems (and their use in industry)
Multiagent systems (and their use in industry)Multiagent systems (and their use in industry)
Multiagent systems (and their use in industry)
 
Presentation_DAI
Presentation_DAIPresentation_DAI
Presentation_DAI
 
Lecture 4 (1).pptx
Lecture 4 (1).pptxLecture 4 (1).pptx
Lecture 4 (1).pptx
 
Efficient opinion sharing in large decentralised teams
Efficient opinion sharing in large decentralised teamsEfficient opinion sharing in large decentralised teams
Efficient opinion sharing in large decentralised teams
 

Recently uploaded

Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
gerogepatton
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
Kamal Acharya
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
ydteq
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
Pipe Restoration Solutions
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
SamSarthak3
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Dr.Costas Sachpazis
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
AafreenAbuthahir2
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Teleport Manpower Consultant
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
ViniHema
 
Investor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptxInvestor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptx
AmarGB2
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
Jayaprasanna4
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
MdTanvirMahtab2
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Amil Baba Dawood bangali
 
AP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specificAP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specific
BrazilAccount1
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
zwunae
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
AJAYKUMARPUND1
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
manasideore6
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
fxintegritypublishin
 
ML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptxML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptx
Vijay Dialani, PhD
 

Recently uploaded (20)

Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
 
Investor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptxInvestor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptx
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
 
AP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specificAP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specific
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
 
ML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptxML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptx
 

Deep Multi-agent Reinforcement Learning

  • 2. 2  Foerster, J. N., Assael, Y. M., de Freitas, N., Whiteson, S. “Learning to Communicate with Deep Multi-Agent Reinforcement Learning,” NIPS 2016  Gupta, J. K., Egorov, M., Kochenderfer, M. “Cooperative Multi-Agent Control Using Deep Reinforcement Learning”. Adaptive Learning Agents (ALA) 2017.  Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, P., Mordatch, I. “Multi- Agent Actor-Critic for Mixed Cooperative-Competitive Environments.” NIPS 2017  Hausknecht, M. J. “Cooperation and communication in multiagent deep reinforcement learning,” 2016 Papers
  • 3. 3  We live in a multi-agent world  Multi-agent RL – Cooperative behaviors of multiple agents are not easily learned by a single agent – Then, how can we make multiple agents cooperate by RL? Motivation
  • 6. 6  Multiple single agents (Baseline)  Centralized (Baseline)  Multi-agent RL with communication  Distributed Multi-agent RL  Ad hoc teamwork How to Run Multiple Agents
  • 7. 7  Single agent trains in environment  Multiple (identical) single agents run in same environment  Training  Execution Multiple Single Agent (Naïve approach) Environment State Reward Agent 1 Actor Network Critic Network Update Action Environment State Reward Agent 2 Actor Network Critic Network Update Action Environment State Reward Agent 𝑖 Actor Network Critic Network Update Action… Environment Agent 1 State Action Agent 2 State Action Agent 𝑖 State Action …Actor Network Actor Network Actor Network
  • 8. 8  Multiple agents are controlled by single controller (agent)  State & action spaces are concatenated  Large state and action space make it challenging to learn Centralized Environment Controller (agent) Actor Network Critic Network Update Shared Reward State State State… Action Action Action…
  • 9. 9  Two players are one team  They share the reward (score)  How they can cooperate?  With communication Multi-agent RL with communication Observation Shared Reward Observation Shared Reward Action Action Pass me ! NoMessage (Communication) Agent 1 Agent 2
  • 10. 10  Two players are one team with shared reward  They cannot communicate each other  If they practice together for long time, then they can cooperate without communication Distributed Multi-agent RL Observation Shared Reward Observation Shared Reward Action Action Agent 1 Agent 2
  • 11. 11  Two players are one team with shared reward  They does not know each other  Pre-coordinating team may not always be possible Ad Hoc Teamwork Observation Shared Reward Observation Shared Reward Action Action Agent 1 Agent 2
  • 12. 12 Summary Multiple single agent Ad hoc teamwork Distributed multi-agent RL Multi-agent RL with comm. Centralized RL Agent Multiple Multiple Multiple Multiple Single Training Separately Separately Together Together Together Executio n Separately Separately Separately Together with comm. Together Reward Indep. Shared Shared/Ind ep. Shared Shared Cooperat ion No Yes Yes Yes Yes Part 1. Part 2.
  • 13. 13  Naïve approach: concurrent – Centralized training with decentralized (distributed) execution – Each agent’s policy is independent – Each agent maintains their own actor-critic networks – All agents share reward Distributed Multi-agent RL Environment State Shared reward Agent 1 Actor Network Critic Network Update Action State Shared reward Agent 2 Actor Network Critic Network Update Action Reward is shared Training Environment State Agent 1 Actor Network Action State Agent 2 Actor Network Action Execution
  • 14. 14  One agent learns well, but the other agent shows no ability  Concurrent. One agent always learns how to perform the task before the other, and the other has less chance to learn  Centralized. Centralized controller learns to use one agent exclusively for scoring goals, and learns to walk the second agent away from the ball entirely Result: Concurrent and Centralized Concurrent Centralized
  • 15. 15 1. Parameter sharing [2] – Agents share the weight of actor and critic network – Update the network of agent who has less chance to learn together 2. Multi-agent DDPG [3] – Why agents share reward? – Agents can have arbitrary reward structures, including conflicting rewards in a competitive setting – Observation is shared during training Two Approaches [2] Gupta, J. K., Egorov, M., Kochenderfer, M. “Cooperative Multi-Agent Control Using Deep Reinforcement Learning”. Adaptive Learning Agents (ALA) 2017. [3] Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, P., Mordatch, I. “Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments.” NIPS 2017
  • 16. 16  Share weights of actor networks and critic network between agents – Similar behaviors between agents – Encourage both agents to participate even though the goal is achievable by a single agent – Reduces total number of parameters Parameter Sharing Actor Network Critic Network Actor Network Critic Network Environment State Shared reward Action State Shared reward Action Reward is shared Sharing Parameter. . Sharing Parameter. . Agent 1 Agent 2
  • 17. 17 Design choice  Share lower layer parameter – Same low-level processing of state features – Specialization in the higher layers of the network allows each agent to develop a unique policy  Share both critics and actor network  How many layers they share? – 2 layer in this case Parameter Sharing
  • 18. 18  Shows cooperative behaviors  The shared weights from the first agent allow the second agent to learn to score Result: Parameter Sharing
  • 19. 19  Propose a general-purpose multi-agent learning algorithm 1. Use local information (i.e. their own observations) at execution time  Centralized training with decentralized execution 2. Applicable not only to cooperative interaction but to competitive or mixed interaction  Each agent has its own reward, and observations of all agents are shared during training Multi-agent RL: MADDPG Environment State Reward Agent 1 Actor Network Critic Network Action State Reward Agent 2 Actor Network Critic Network Action Reward is not shared Training Environment State Agent 1 Actor Network Action State Agent 2 Actor Network Action Execution
  • 20. 20  𝑁: number of agents  𝒙: state  𝑜𝑖: observation of agent 𝑖  𝑎𝑖: action of agent 𝑖  𝑟𝑖: reward of agent 𝑖  One sample Model
  • 21. 21  Policies of 𝑁 agents are parameterized by and let 𝝁 = {𝜇1, … , 𝜇 𝑁} be the continuous policies  Goal: find 𝜃𝑖 maximizing expected return for agent 𝑖,  Gradient of Decentralized Actor Network Actor network for agent 𝑖 (𝜃𝑖) ActionObservation Critic
  • 22. 22  Centralized action-value function for agent 𝑖  The centralized action-value function is updated as: Centralized Critic Network
  • 23. 23 Result: MADDPG Average number of prey touches by predator per episode with 𝑁 = 𝐿 = 3, where the prey (adversaries) are slightly (30%) faster Predator–prey experiment
  • 24. 24  Multiple agents run in environment  Goal: Maximizing their shared utility  Improve performance with communication? – What kind of information they should exchange? – What if there is limited channel capacity among agents? Multi-agent RL with communication[1] Environment Observation Shared Reward Agent 1 Message select Action Observation Shared Reward Agent 2 Action Action select Message select Action select Message Message Reward is shared Limited capacity [1] Foerster, J. N., Assael, Y. M., de Freitas, N., Whiteson, S. “Learning to Communicate with Deep Multi-Agent Reinforcement Learning,” NIPS 2016
  • 25. 25  𝑛 prisoners have been newly ushered into prison  They will be placed in an isolated cell  Each day, manager chooses one prisoner uniform randomly – He has chance to toggle light bulb (communication) – He has option of announcing that he believes all prisoners have been chosen by manager at some point in time (action)  If he is right, every body go home, otherwise all die Switch Riddle Problem
  • 26. 26 • Multi-Agent: 𝑛 agents with 1-bit communication channel • State: 𝑛 -bit array: has 𝑖-th prisoner been chosen • Action: ‘Announce’ / ‘None’ • Reward: + 1 (freedom) / 0 (episode expires) / -1 (all die) • Observation: ‘None’ • Communication: switch (1-bit) Multi agent RL with comm.
  • 27. 27  Agents can discover communication protocols through Deep RL  Protocols can be extracted and understood Result
  • 28. 28 Thank you More comments and questions at dwkim@lanada.kaist.ac.kr

Editor's Notes

  1. Motivation. All of you may know well single agent RL such as Q-learning, Policy gradient and so on. However, single agent RL can be used only in specific area such as game. However, we live in multi-agent world. All of these have multiple agent and they interact each other. So today I’ll explain the reinforcement learning when there are multiple agent.
  2. Before explaining multi agent RL, I would like to explain 5 classes of running multiple agents. It is hard to say these two are multi agent RL. But I’ll explain for comparison I’ll briefly explain these and see the features of these