2014
Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature 518.7540 (2015): 529-533.
Silver, David, et al. "Mastering the game of Go with deep neural networks and tree search." Nature 529.7587 (2016): 484-489.
Vinyals, Oriol, et al. "StarCraft II: A New Challenge for Reinforcement Learning."
2016
2017
협업 or 경쟁이 필요한 Multi Agent
자율 주행 자동차, 대화 AI, 대규모 공장 로봇 …
Starcraft
Peng, Peng, et al. "Multiagent Bidirectionally-Coordinated Nets for Learning to Play StarCraft Combat Games." arXiv preprint arXiv:1703.10069 (2017).
Communication
Mordatch, Igor, and Pieter Abbeel. "Emergence of Grounded Compositional Language in Multi-Agent Populations." arXiv preprint arXiv:1703.04908 (2017)
https://blog.openai.com/learning-to-communicate/
다른 모든 Agent에게 메세지 전달
Actor-Critic + Centralized Q-value
다른 Agent의 내부 정보를 공유
Lowe, Ryan, et al. "Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments." arXiv preprint arXiv:1706.02275 (2017)
https://blog.openai.com/learning-to-cooperate-compete-and-communicate/
Centralized Q-value
Sparse Reward
Kulkarni, Tejas D., et al. "Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation." Advances in Neural Information Processing Systems. 2016.
Vezhnevets, Alexander Sasha, et al. "Feudal networks for hierarchical reinforcement learning." arXiv preprint arXiv:1703.01161 (2017).
30번 정도의 올바른 행동 후에 0이 아닌 Reward을 얻음
Feedback
밧줄을 타고 내려가서 해골을 피하고 사다리를 타서 열쇠를 얻어야 100점 얻음
Kulkarni, Tejas D., et al. "Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation." Advances in Neural Information Processing Systems. 2016
Vezhnevets, Alexander Sasha, et al. "Feudal networks for hierarchical reinforcement learning." arXiv preprint arXiv:1703.01161 (2017)
Bacon, Pierre-Luc, Jean Harb, and Doina Precup. "The Option-Critic Architecture." AAAI. 2017
A A
행동 𝑎"Reward 𝑟"
Non-hierarchical RL Hierarchical RL
Kulkarni, Tejas D., et al. "Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation." Advances in Neural Information Processing Systems. 2016
Vezhnevets, Alexander Sasha, et al. "Feudal networks for hierarchical reinforcement learning." arXiv preprint arXiv:1703.01161 (2017)
Bacon, Pierre-Luc, Jean Harb, and Doina Precup. "The Option-Critic Architecture." AAAI. 2017
A A
행동 𝑎"Reward 𝑟"
Non-hierarchical RL Hierarchical RL
목표1 목표2 목표3
Kulkarni, Tejas D., et al. "Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation." Advances in Neural Information Processing Systems. 2016
Vezhnevets, Alexander Sasha, et al. "Feudal networks for hierarchical reinforcement learning." arXiv preprint arXiv:1703.01161 (2017)
Bacon, Pierre-Luc, Jean Harb, and Doina Precup. "The Option-Critic Architecture." AAAI. 2017
A A
행동 𝑎"Reward 𝑟"
Non-hierarchical RL Hierarchical RL
밧줄 잡기
목표1 목표2 목표3
Kulkarni, Tejas D., et al. "Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation." Advances in Neural Information Processing Systems. 2016
Vezhnevets, Alexander Sasha, et al. "Feudal networks for hierarchical reinforcement learning." arXiv preprint arXiv:1703.01161 (2017)
Bacon, Pierre-Luc, Jean Harb, and Doina Precup. "The Option-Critic Architecture." AAAI. 2017
A A
행동 𝑎"Reward 𝑟"
Non-hierarchical RL Hierarchical RL
밧줄 잡기 사다리 내려가기
목표1 목표2 목표3
Kulkarni, Tejas D., et al. "Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation." Advances in Neural Information Processing Systems. 2016
Vezhnevets, Alexander Sasha, et al. "Feudal networks for hierarchical reinforcement learning." arXiv preprint arXiv:1703.01161 (2017)
Bacon, Pierre-Luc, Jean Harb, and Doina Precup. "The Option-Critic Architecture." AAAI. 2017
A A
행동 𝑎"Reward 𝑟"
Non-hierarchical RL Hierarchical RL
밧줄 잡기 사다리 내려가기 점프 하기
목표1 목표2 목표3
Kulkarni, Tejas D., et al. "Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation." Advances in Neural Information Processing Systems. 2016
Vezhnevets, Alexander Sasha, et al. "Feudal networks for hierarchical reinforcement learning." arXiv preprint arXiv:1703.01161 (2017)
Bacon, Pierre-Luc, Jean Harb, and Doina Precup. "The Option-Critic Architecture." AAAI. 2017
목표1 목표2 목표3
A A
행동 𝑎"Reward 𝑟"
𝑎*,"𝑎,,"
Non-hierarchical RL Hierarchical RL
𝑎-,"
밧줄 잡기 사다리 내려가기 점프 하기
Kulkarni, Tejas D., et al. "Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation." Advances in Neural Information Processing Systems. 2016
Vezhnevets, Alexander Sasha, et al. "Feudal networks for hierarchical reinforcement learning." arXiv preprint arXiv:1703.01161 (2017)
Bacon, Pierre-Luc, Jean Harb, and Doina Precup. "The Option-Critic Architecture." AAAI. 2017
목표1 목표2 목표3
- - ON
A A
목표 Ω
행동 𝑎"Reward 𝑟"
Non-hierarchical RL Hierarchical RL
𝑎*,"𝑎,," 𝑎-,"
Kulkarni, Tejas D., et al. "Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation." Advances in Neural Information Processing Systems. 2016
Vezhnevets, Alexander Sasha, et al. "Feudal networks for hierarchical reinforcement learning." arXiv preprint arXiv:1703.01161 (2017)
Bacon, Pierre-Luc, Jean Harb, and Doina Precup. "The Option-Critic Architecture." AAAI. 2017
목표1 목표2 목표3
- - ON
A A
목표 Ω
행동 𝑎-,"행동 𝑎"Reward 𝑟"
𝑎*,"𝑎,,"
Non-hierarchical RL Hierarchical RL
Kulkarni, Tejas D., et al. "Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation." Advances in Neural Information Processing Systems. 2016
Vezhnevets, Alexander Sasha, et al. "Feudal networks for hierarchical reinforcement learning." arXiv preprint arXiv:1703.01161 (2017)
Bacon, Pierre-Luc, Jean Harb, and Doina Precup. "The Option-Critic Architecture." AAAI. 2017
목표1 목표2 목표3
- - ON
A A
목표 Ω
행동 𝑎-,"행동 𝑎"Reward 𝑟" Reward 𝑟"
𝑎*,"𝑎,,"
Non-hierarchical RL Hierarchical RL
Montezuma 잘 풀었다
Kulkarni, Tejas D., et al. "Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation." Advances in Neural Information Processing Systems. 2016
Vezhnevets, Alexander Sasha, et al. "Feudal networks for hierarchical reinforcement learning." arXiv preprint arXiv:1703.01161 (2017)
Bacon, Pierre-Luc, Jean Harb, and Doina Precup. "The Option-Critic Architecture." AAAI. 2017
암기로 풀 수 없는 문제
Weber, Théophane, et al. "Imagination-Augmented Agents for Deep Reinforcement Learning." arXiv preprint arXiv:1707.06203 (2017).
https://deepmind.com/blog/agents-imagine-and-plan/
Weber, Théophane, et al. "Imagination-Augmented Agents for Deep Reinforcement Learning." arXiv preprint arXiv:1707.06203 (2017).
https://deepmind.com/blog/agents-imagine-and-plan/
실제로 일어날 일을 시뮬레이션으로 (internal simulation) 상상해 보고 행동
Model-free RL + Model-based RL
Deep Q-learning
Policy Gradient
…
Model-free RL + Model-based RL
Imagination
Weber, Théophane, et al. "Imagination-Augmented Agents for Deep Reinforcement Learning." arXiv preprint arXiv:1707.06203 (2017).
https://deepmind.com/blog/agents-imagine-and-plan/
Duan, Yan, et al. "RL $^ 2$: Fast Reinforcement Learning via Slow Reinforcement Learning." arXiv preprint arXiv:1611.02779 (2016).
https://www.youtube.com/playlist?list=PLp24ODExrsVeA-ZnOQhdhX6X7ed5H_W4q
Duan, Yan, et al. "RL $^ 2$: Fast Reinforcement Learning via Slow Reinforcement Learning." arXiv preprint arXiv:1611.02779 (2016).
한판 = 한 Episode
Duan, Yan, et al. "RL $^ 2$: Fast Reinforcement Learning via Slow Reinforcement Learning." arXiv preprint arXiv:1611.02779 (2016).
Episode가 끝나도 정보를 리셋하지 않고 계속 사용
Duan, Yan, et al. "RL $^ 2$: Fast Reinforcement Learning via Slow Reinforcement Learning." arXiv preprint arXiv:1611.02779 (2016).
N번의 Episode를 하나의 Trial로 정의
N번의 Episode를 통해서 최적의 플레이를 찾는 방법을 학습
Duan, Yan, et al. "RL $^ 2$: Fast Reinforcement Learning via Slow Reinforcement Learning." arXiv preprint arXiv:1611.02779 (2016).
새로운 시도에는 새로운 게임(여기서는 새로운 맵)을 플레이
Duan, Yan, et al. "RL $^ 2$: Fast Reinforcement Learning via Slow Reinforcement Learning." arXiv preprint arXiv:1611.02779 (2016).
새로운 시도에는 새로운 게임(여기서는 새로운 맵)을 플레이
RL2: Recurrent Network
Duan, Yan, et al. "RL $^ 2$: Fast Reinforcement Learning via Slow Reinforcement Learning." arXiv preprint arXiv:1611.02779 (2016).
https://www.youtube.com/playlist?list=PLp24ODExrsVeA-ZnOQhdhX6X7ed5H_W4q
Episode의 Return이 아닌 Trial의 Return을 optimize
Model-Agnostic Meta-Learning
Finn, Chelsea, Pieter Abbeel, and Sergey Levine. "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks." arXiv preprint arXiv:1703.03400 (2017).
여러 Task를 동시에 학습해 weight의 central point를 찾음
그리고 1번의 gradient update로 새 Task에 적응
하지만 Pointer Network 학습을 위해
추가적인 Supervision 필요
단점
몇 번째 segment가 매뉴얼 조각을 포함하는지
…
…
Attention
Xu, Kelvin, et al. "Show, attend and tell: Neural image caption generation with visual attention." International Conference on Machine Learning. 2015.
Gated-Attention + A3C
Hermann, Karl Moritz, et al. "Grounded language learning in a simulated 3D world." arXiv preprint arXiv:1706.06551 (2017)
https://sites.google.com/view/gated-attention/home
Self-Supervision + A3C
Chaplot, Devendra Singh, et al. "Gated-Attention Architectures for Task-Oriented Language Grounding." arXiv preprint arXiv:1706.07230 (2017)
https://www.youtube.com/watch?v=wJjdu1bPJ04
물체들의 관계까지 이해해야 하는 Agent
Berthelot, David, Tom Schumm, and Luke Metz. "Began: Boundary equilibrium generative adversarial networks." arXiv preprint arXiv:1703.10717 (2017).
https://github.com/carpedm20/BEGAN-tensorflow
Kim, Taeksoo, et al. "Learning to discover cross-domain relations with generative adversarial networks." arXiv preprint arXiv:1703.05192 (2017).
https://github.com/carpedm20/DiscoGAN-pytorch
Shrivastava, Ashish, et al. "Learning from simulated and unsupervised images through adversarial training." arXiv preprint arXiv:1612.07828 (2016).
https://github.com/carpedm20/simulated-unsupervised-tensorflow