Imagination-Augmented Agents for Deep Reinforcement Learning성재 최
I will introduce a paper about I2A architecture made by deepmind. That is about Imagination-Augmented Agents for Deep Reinforcement Learning
This slide were presented at Deep Learning Study group in DAVIAN LAB.
Paper link: https://arxiv.org/abs/1707.06203
3월달 "강화학습의 이론과 실제" 로 강의했던 강의자료 배포합니다.
1.Dynamic Programming
2.Policy iteration
3.Value iteration
4.Monte Carlo method
5.Temporal-Difference Learning
6.Sarsa
7.Q-learning
8.딥러닝 프레임워크 케라스 소개 및 슈퍼마리오 환경 구축
9.DQN을 이용한 인공지능 슈퍼마리오 만들기
이 흐름으로 강의를 했는데
브레이크아웃 설명은 양혁렬 (Hyuk Ryeol Yang)님의 코드를 참고 하였고
8번,9번은 새로운 환경이 나왔으니 무시해도 좋겠습니다.
이 환경에 대한 자료는 주말까지 작성하고 업로드 할 예정입니다.
Imagination-Augmented Agents for Deep Reinforcement Learning성재 최
I will introduce a paper about I2A architecture made by deepmind. That is about Imagination-Augmented Agents for Deep Reinforcement Learning
This slide were presented at Deep Learning Study group in DAVIAN LAB.
Paper link: https://arxiv.org/abs/1707.06203
3월달 "강화학습의 이론과 실제" 로 강의했던 강의자료 배포합니다.
1.Dynamic Programming
2.Policy iteration
3.Value iteration
4.Monte Carlo method
5.Temporal-Difference Learning
6.Sarsa
7.Q-learning
8.딥러닝 프레임워크 케라스 소개 및 슈퍼마리오 환경 구축
9.DQN을 이용한 인공지능 슈퍼마리오 만들기
이 흐름으로 강의를 했는데
브레이크아웃 설명은 양혁렬 (Hyuk Ryeol Yang)님의 코드를 참고 하였고
8번,9번은 새로운 환경이 나왔으니 무시해도 좋겠습니다.
이 환경에 대한 자료는 주말까지 작성하고 업로드 할 예정입니다.
Main obstacles of Bayesian statistics or Bayesian machine learning is computing posterior distribution. In many contexts, computing posterior distribution is intractable. Today, there are two main stream to detour directly computing posterior distribution. One is using sampling method(ex. MCMC) and another is Variational inference. Compared to Variational inference, MCMC takes more time and vulnerable to high-dimensional parameters. However, MCMC has strength in simplicity and guarantees of convergence. I'll briefly introduce several methods people using in application.
Differential Geometry for Machine LearningSEMINARGROOT
References:
Differential Geometry of Curves and Surfaces, Manfredo P. Do Carmo (2016)
Differential Geometry by Claudio Arezzo
Youtube: https://youtu.be/tKnBj7B2PSg
What is a Manifold?
Youtube: https://youtu.be/CEXSSz0gZI4
Shape analysis (MIT spring 2019) by Justin Solomon
Youtube: https://youtu.be/GEljqHZb30c
Tensor Calculus
Youtube: https://youtu.be/kGXr1SF3WmA
Manifolds: A Gentle Introduction,
Hyperbolic Geometry and Poincaré Embeddings by Brian Keng
Link: http://bjlkeng.github.io/posts/manifolds/,
http://bjlkeng.github.io/posts/hyperbolic-geometry-and-poincare-embeddings/
Statistical Learning models for Manifold-Valued measurements with application to computer vision and neuroimaging by Hyunwoo J.Kim
Understanding Blackbox Prediction via Influence FunctionsSEMINARGROOT
Pang Wei Koh and Percy Liang
"Understanding Black-Box prediction via influence functions" ICML 2017 Best paper
References:
https://youtu.be/0w9fLX_T6tY
https://arxiv.org/abs/1703.04730
Attention Is All You Need (NIPS 2017)
(Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin)
paper link: https://arxiv.org/pdf/1706.03762.pdf
Reference:
https://youtu.be/mxGCEWOxfe8 (by Minsuk Heo)
https://youtu.be/5vcj8kSwBCY (Stanford CS224N: NLP with Deep Learning | Winter 2019 | Lecture 14 – Transformers and Self-Attention)
Main obstacles of Bayesian statistics or Bayesian machine learning is computing posterior distribution. In many contexts, computing posterior distribution is intractable. Today, there are two main stream to detour directly computing posterior distribution. One is using sampling method(ex. MCMC) and another is Variational inference. Compared to Variational inference, MCMC takes more time and vulnerable to high-dimensional parameters. However, MCMC has strength in simplicity and guarantees of convergence. I'll briefly introduce several methods people using in application.
Differential Geometry for Machine LearningSEMINARGROOT
References:
Differential Geometry of Curves and Surfaces, Manfredo P. Do Carmo (2016)
Differential Geometry by Claudio Arezzo
Youtube: https://youtu.be/tKnBj7B2PSg
What is a Manifold?
Youtube: https://youtu.be/CEXSSz0gZI4
Shape analysis (MIT spring 2019) by Justin Solomon
Youtube: https://youtu.be/GEljqHZb30c
Tensor Calculus
Youtube: https://youtu.be/kGXr1SF3WmA
Manifolds: A Gentle Introduction,
Hyperbolic Geometry and Poincaré Embeddings by Brian Keng
Link: http://bjlkeng.github.io/posts/manifolds/,
http://bjlkeng.github.io/posts/hyperbolic-geometry-and-poincare-embeddings/
Statistical Learning models for Manifold-Valued measurements with application to computer vision and neuroimaging by Hyunwoo J.Kim
Understanding Blackbox Prediction via Influence FunctionsSEMINARGROOT
Pang Wei Koh and Percy Liang
"Understanding Black-Box prediction via influence functions" ICML 2017 Best paper
References:
https://youtu.be/0w9fLX_T6tY
https://arxiv.org/abs/1703.04730
Attention Is All You Need (NIPS 2017)
(Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin)
paper link: https://arxiv.org/pdf/1706.03762.pdf
Reference:
https://youtu.be/mxGCEWOxfe8 (by Minsuk Heo)
https://youtu.be/5vcj8kSwBCY (Stanford CS224N: NLP with Deep Learning | Winter 2019 | Lecture 14 – Transformers and Self-Attention)
1. COMA
Counterfactual Multi-Agent
Policy Gradients
Jakob N. Foerster*, Gregory Farquhar*, Triantafyllos Afouras,
Nantas Nardelli, Shimon Whiteson
University of Oxford, United Kingdom
AAAI 2018
발표자 김민섭
github.com/minseop4898
minseop4898@naver.com
2. What is multi-agent?
• 의사결정의 주체인 agent가 여러 개
• Global reward가 존재
• 각 agent 마다 local action-observation history가 존재
• 각 agent가 서로 협동하여 global reward의 cumulative
sum(return)을 maximize 하여야 함
Single agent Multi agent
3. Multi-Agent game G
(partially observable setting)
•
• 𝑛 : number of agents. : agent 𝑎.
• : agent 𝑎의 action. : joint action.
• : true environment state.
• : state transition probability.
• : global reward.
• : discount factor.
• : agent’s local observation. : observation function
• : agent’s action-observation history.
• : stochastic policy
4. Centralized training(critic) of decentralized
policies(actor)
• Centralized policy
• joint action space가 agent의 개수에 따라서 exponential하게 증가한다.
• 각 agent의 local partial observation의 구현이 불가능해짐.
• Multi-agent credit assignment problem 을 처리할 수 없음. (뒤에서 설명)
• Decentralized value function
• 각 agent 마다 value function을 갖고있는것.
• 각각의 value function 은 joint action 𝒖가 아닌 𝑢 𝑎에 의해 계산된다.
• 각 Agent의 에 대한 value function 이기 때문에 global state 를 사용할 수 없
다.
• 즉, global state 와 joint action 에 대한 가치를 평가할 수 없다. ( )
-> Centralized value function, Decentralized policies 제안.
5. Multi-agent credit assignment problem
• Multi-agent credit assignment problem
• 특정 agent 의 action 이 global reward 에 얼마나 많은 영향을 끼쳤는지 알아내
는 것.
• 위의 TD-error는 global reward에 의해서 계산되므로 multi-agent
credit assignment problem을 풀지 못한다.
-> 다른 방법 필요
(naïve way)
6. Difference reward (Wolpert and Tumer, 2002)
• 각 agent 는 global reward가 아닌 shaped reward를 받는다
• Shaped reward :
• : default action
• Agent 𝑎가 취한 action이 default action에 비해 global reward
를 얼마나 증가 혹은 감소 시켰는가 를 판단할 수 있음
• Multi-agent credit assignment problem을 풀 수 있음.
• 를 구하기 위해서 환경 simulation을 한번 더 해
야함 -> separate counterfactual simulation 필요.
• Default action 을 정하는 기준이 애매모호함.
7. Approximating difference evaluations with local
information (Colby, Curran, and Tumer, 2015)
• 를 계산하는 function approximator
• Extra counterfactual simulation 환경이 필요 없어졌음
• 여전히 default action 을 설정해주어야 함
• Function approximator에 의한 연산, 메모리 overhead 증가
• Approximation error 가능성
8. Counterfactual baseline
• 이 논문의 핵심
• V대신 Q를 쓰고 baseline function을 새로 제시해서 앞의 문제들을 해결
• Baseline function : agent 𝑎가 취할 수 있는 모든 action에 대한
marginalization
• agent 𝑎가 현재 취한 action에 대한 Q값이 다른 취할 수 있는 모든 actio
들에 대한 평균 Q값보다 얼만큼 높은지에 대한 정보를 Advantage
function 이 담고있다. -> multi-agent credit assignment 문제 해결.
• Extra simulation, reward model, user-designed default action 필요없음
9. Problems with using Q
• Q(state, action)
• Input : state, action.
• output : (state, action)에 대한 Q value 값.
• Function approximator Q(state) in deep RL
• Input : state
• Output : state에서 가능한 모든 action에 대한 Q value 값.
• 연산 효율을 위해
• Function approximator Q(state) in multi-agent deep RL
• Input : state
• Output : size of joint action space (impractical to train)
• Function approximator Q(state, 𝒖−𝑎
) in multi-agent deep RL
• 이 논문에서 제안.
• Input : state, 𝒖−𝑎
• Output : size of agent 𝑎’s action space 𝑈 (train 가능)
• 한번의 forward pass로 counter factual basline 계산 가능
10. Information flow of COMA
• Bounded softmax
• (𝜀 은 0.5에서 0.02까지 매 애피소드마다 linear하게 줄어듦)
• For exploration
• Parameter sharing of policy network
• For efficiency
• Agent의 id를 input으로 같이 넣어줌
11. Experimental setup (StarCraft)
• Action (discrete)
• Move, attack, stop, no operation
• Restricted field of agent’s view
• Agent의 시야를 공격 범위 내로 한정
• Agent의 partial observability를 위한 세팅
• 게임을 더 어렵게 만듦
• Agent’s local observation
• Agent의 시야 안에 있는 유닛들의 feature
• Distance, relative x, relative y, unit type
• Global state
• 모든 유닛들의 feature
• Absolute x, absolute y, unit type
• Extra global state feature
• 각 유닛의 HP, cooldown
12. Experiment
• Central-QV
• Decentralized actor, centralized critic
• Q와 V를 둘 다 학습
• Advantage function : Q – V
• Central-V
• Decentralized actor, centralized critic
• V만 학습
• Advatage function : TD-error
• IAC-V
• Decentralized actor critic
• Value function 사용
• IAC-Q
• Decentralized actor critic
• Action value function 사용