Multi-Agent Actor-Critic for Mixed
Cooperative-Competitive Environments
Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb,
Pieter Abbeel, Igor Mordatch
NIPS 2017
発表者 千葉大学 中田勇介
• マルチエージェントシステム
• 強化学習
• 提案法
2
Applications of multi-agent system
• Multi-robot control
• Multiplayer games
• Analysis of social dilemmas
3
Advantages of multi-agent system
• 問題解決能力
• 単独ではできない問題が解ける可能性がある
• 適応能力
• 問題の変化にエージェントの追加や変更で対応
• ロバスト性
• あるエージェントの不具合を他エージェントが補う
• 並列性
• 非同期な処理で全体の処理が高速化
• モジュール性
• 既存のエージェントを再利用し,設計コストを削減
4
参考文献: http://kodamaforest.blog112.fc2.com/blog-entry-57.html
Examples of Multi-agent Environment
5
Reinforcement learning
• Q-Learning
• Assuming the MDP
• Policy gradient
• High variance
• Actor-Critic
• Actor learns policy
• Critic learns value
:Trajectory
6
Bias and Variance
Bias: Low, Variance: Low Bias: Low , Variance: High
Bias: High, Variance: HighBias: High, Variance: Low
7
Proposed method
Advantages
• Policy use local information(i.e. their own observation) only
• Applicable to cooperative, competitive, mixed environment
How
• Extend Actor-Critic
• Critic use extra information of other agents policy
• Actor use local information only
8
Proposed method
How
• Extend Actor-Critic
• Critic use extra information of other agents policy
• Actor use local information only
方策の学習後はCriticを用いる必要がない.
→ 並列性高い
→ 実行時間小
9
Related work
• Agents independently learn with Q-Learning
• Non stationary dynamics
• Agents independently learn with Policy Gradient
• High variance
• Sharing policy parameter
• Only for homogeneous agents, not for competitive.
• Input other agent’s policy
• Optimistic and hysteric Q function update
10
Partially Observable Markov games
• Number of Agents:
• Set of States:
• Set of action:
• Set of observation:
• Stochastic policy:
• Transition function:
• Reward:
11
Background
• Q-Learning and Deep Q-Networks
• Policy gradient
• Deep deterministic policy gradient
12
Q-learning and Deep Q-Networks
• Q-Learning
• DQN
13
Q-learning and Deep Q-Networks
• Difficulty: Non - stationary
• DQN’s difficulty : can not use replay buffer
• 他エージェントの方策が変化→状態遷移が変化
14
Policy gradient
• Objective function :
• Gradient:
• REINFORCE:
• Actor-Critic:
15
Deep deterministic policy gradient
• Deterministic policy:
• Gradient:
• Off policy algorithm
16
Methods
17
Methods
Centralized
Decentralized
18
Gradient of Actor
• Multi agents
• Single agent
19
Gradient of Critic
• Multi agents
• Single agent
20
21
Experiments
22
Experiments
24
Cooperative communication
25
Cooperative communication
• Proposed methods
• Successfully learn policy
• Traditional RL
• Listener ignores the speaker and moves to the middle
• Reason: lack of a consistent gradient signal
26
Experiments
盗み聞きを防ぐ
27
Conclusions
• Agents learn a centralized critic, decentralized policy
• Useful in cooperative and competitive environment
• Input space of Q grows linearly with the number of
agents
28

Multi-agent actor-critic for mixed cooperative-competitive environmentsの紹介

Editor's Notes

  • #24 LiveSlide Site https://www.youtube.com/watch?time_continue=64&v=QCmBo91Wy64