Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Yuandong Tian at AI Frontiers : Planning in Reinforcement Learning

215 views

Published on

Deep Reinforcement Learning (DRL) has made strong progress in many tasks, such as board games, robotics, navigation, neural architecture search, etc. I will present our recent open-sourced DRL frameworks to facilitate game research and development. Our framework is scalable so we can can reproduce AlphaGoZero and AlphaZero using 2000 GPUs, achieving super-human performance of Go AI that beats 4 top-30 professional players. We also show usability of our platform by training agents in real-time strategy games, and show interesting behaviors with a small amount of resource.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Yuandong Tian at AI Frontiers : Planning in Reinforcement Learning

  1. 1. Planning in Reinforcement Learning Yuandong Tian Research Scientist Facebook AI Research
  2. 2. AI works in a lot of situations Medical Translation Personalization Surveillance Object Recognition Smart Design Speech Recognition Board game
  3. 3. What AI still needs to improve Very few supervised data Complicated environments Lots of Corner cases. Home Robotics Autonomous Driving ChatBot StarCraftQuestion Answering Common Sense Exponential space to explore
  4. 4. What AI still needs to improve Human level A scary trend of slowing down “Man, we need more data” Initial Enthusiasm “It really works! All in AI!” Trying all possible hacks “How can that be…” Despair “No way, it doesn’t work” Performance Efforts
  5. 5. What AI still needs to improve Human level A scary trend of slowing down “Man, we need more data” Initial Enthusiasm “It really works! All in AI!” Trying all possible hacks “How can that be…” Despair “No way, it doesn’t work” Performance Efforts We need novel algorithms
  6. 6. Reinforcement Learning Action State Reward Agent Environment [R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction] Atari Games Go DoTA 2 Doom Quake 3 StarCraft
  7. 7. Why Planning is important? Just one example
  8. 8. AlphaGo Zero Update Models Generate Training data Self-Replays Zero-human knowledge [Silver et al, Mastering the game of Go without human knowledge, Nature 2017]
  9. 9. AlphaGo Zero Strength • 3 days version • 4.9M Games, 1600 rollouts/move • 20 block ResNet • Defeat AlphaGo Lee. • 40 days version • 29M Games, 1600 rollouts/move • 40 blocks ResNet. • Defeat AlphaGo Master by 89:11
  10. 10. ELF: Extensive, Lightweight and Flexible Framework for Game Research Larry ZitnickQucheng Gong Wenling Shang Yuxin WuYuandong Tian https://github.com/facebookresearch/ELF [Y. Tian et al, ELF: An Extensive, Lightweight and Flexible Research Platform for Real-time Strategy Games, NIPS 2017]
  11. 11. ELF: A simple for-loop Action State Reward Agent Environment C++ Python
  12. 12. How ELF works Game Threads (C++) 0 1 2 3 4 5 6 7 Batch BatchBatch Batch Batch Python
  13. 13. Distributed ELF Server Evaluate/Selfplay Training Send request (game params) Receive experiences Client Client Client Client Client Client Client AlphaGoZero (more synchronization) AlphaZero (less synchronization) Putting AlphaGoZero and AlphaZero into the same framework
  14. 14. ELF OpenGo • System can be trained with 2000 GPUs in 2 weeks. • Decent performance against professional players and strong bots. • Abundant ablation analysis • Decoupled design, code highly reusable for other games. We open source the code and the pre-trained model for the Go and ML community https://github.com/pytorch/ELF Simple tutorial in experimental branch (tutorial, tutorial_distri)
  15. 15. ELF OpenGo Performance 20-0Name (rank) ELO (world rank) Result Kim Ji-seok 3590 (#3) 5-0 Shin Jin-seo 3570 (#5) 5-0 Park Yeonghun 3481 (#23) 5-0 Choi Cheolhan 3466 (#30) 5-0 Single GPU, 80k rollouts, 50 seconds Offer unlimited thinking time for the players Vs top professional players Vs strong bot (LeelaZero) [158603eb, 192x15, Apr. 25, 2018]: 980 wins, 18 losses (98.2%) Vs professional players Single GPU, 2k rollouts, 27-0 against Taiwanese pros.
  16. 16. Planning is how new knowledge is created Game Start Game End Random Moves Meaningful Moves Already dan level even if the opening doesn’t make much sense. Planning #rollouts Win rate against bot without planning 50% 100% Training is almost always constrained by model capacity (why 40b > 20b) Where the reward signal is
  17. 17. Planning is how new knowledge is created T-step looking forward Tree searchOne-step looking forward Monte Carlo Sampling Learning a neural network that directly predicts the optimal value/policy Temporal Difference (TD) in Reinforcement Learning:
  18. 18. Planning is how game AI is created Extensive search/planning Evaluate Consequence Black wins White wins Black wins White wins Black wins Current game situation Lufei Ruan vs. Yifan Hou (2010) AlphaBeta Pruning + Iterative Deepening Monte Carlo Tree Search …
  19. 19. How to plan without a known model? If you don’t have a ground truth dynamics model … • Only (human/expert) trajectories, no world model. • Limited access of world models. • Cannot restart the model, cannot query any (s, a) pair • Noisy signals from the world model • … If you have a ground truth dynamics model … • Infinite access of the exact world model • May query any (s, a) pair • … Current state Next state Action to take Build one
  20. 20. Navigation Target Yi Wu Georgia Gkioxari Yuxin Wu How to plan the trajectory?
  21. 21. Build a semantic model outdoor living room sofa Find “oven” car chair dining room kitchen oven Incomplete model of the environment [Y. Wu et al, Learning and Planning with a Semantic Model, submitted to ICLR 2019]
  22. 22. Build a semantic model Bayesian Inference 𝑃(𝑧|𝑌) car chair dining room kitchen oven 0.7 0.95 0.8 0.5 0.6 0.7 Next step “kitchen” outdoor living room sofa Learning experience 𝑌
  23. 23. LEAPS LEArning and Planning with a Semantic model living room kitchen chair sofa Dining room 𝑃(𝑧kitchen,living room) 𝑃(𝑧sofa,living room) 𝑃(𝑧chair,living room) 𝑃(𝑧dining,living room) living room kitchen chair sofa Dining room 𝑃(𝑧kitchen,living room) 𝑃 𝑧sofa,living room 𝑌 = 1 𝑃(𝑧chair,living room) 𝑃(𝑧dining,living room) Planning the trajectory and more exploration Planning the trajectory and more exploration
  24. 24. House3D SUNCG dataset, 45K scenes, all objects are fully labeled. https://github.com/facebookresearch/House3D Depth Segmentation maskRGB image Top-down map
  25. 25. Learning the Prior between Different Rooms
  26. 26. Test Performance on ConceptNav
  27. 27. Case Study • A case study • Go to “outdoor” Prior: 𝑃(𝑧) Birth Outdoor Living room Garage 0.12 0.38 0.76 0.73 0.28 Sub-Goal: Outdoor
  28. 28. Case Study • A case study • Go to “outdoor” Sub-Goal: Outdoor Failed! Birth Outdoor Living room Garage 0.12 0.38 0.01 0.73 0.28 Posterior: 𝑃(𝑧|𝐸)
  29. 29. • A case study • Go to “outdoor” Posterior: 𝑃(𝑧|𝐸) Sub-Goal: Garage Birth Outdoor Living room Garage 0.12 0.01 0.73 0.28 Failed! Case Study 0.08
  30. 30. Case Study • A case study • Go to “outdoor” Sub-Goal: Living Room Posterior: 𝑃(𝑧|𝐸) Birth Outdoor Living room Garage 0.12 0.08 0.01 0.28 Success 0.99
  31. 31. Case Study • A case study • Go to “outdoor” Sub-Goal: Outdoor Posterior: 𝑃(𝑧|𝐸) Success Birth Outdoor Living room Garage 0.99 0.08 0.01 0.280.99
  32. 32. Improving Existing Plans
  33. 33. Iterative LQR What if the dynamics is nonlinear? New Policy Sample according to the new policy
  34. 34. Non-differentiable Plans • Direct predicting combinatorial solutions. [O. Vinyals. et al, Pointer Networks, NIPS 2015] Convex hull Seq2seq model [H. Mao et al, Resource Management with Deep Reinforcement Learning, ACM Workshop on Hot Topics in Networks, 2016] Schedule the job to i-th slot Policy gradient
  35. 35. Neural network rewriter st Sample 𝒈 𝒕 ∼ 𝑺𝑷 𝒈 𝒕 , 𝒈 𝒕 ⊂ 𝒔 𝒕 Sample 𝒂 𝒕 ∼ 𝑹𝑺 𝒈 𝒕 𝒔 𝒕+𝟏 = 𝒇(𝒔 𝒕, 𝒈 𝒕, 𝒂 𝒕) Input Encoder Score Predictor Rule Selector [X. Chen and Y. Tian, Learning to Progressively Plan, submitted to ICLR 2019] Is it simpler to improve the solution progressively?
  36. 36. Job Scheduling Scheduling 1 2 3 T=2, S=1 T=3, S=2 T=1, S=3 0 1 2 3 1 32 4 5 6 0 1 2 3 1 2 3 1 32 4 5 6 1 2 3 Graph representationJobs 10 Resource 1 Resource 2 time time Slow down Slow down
  37. 37. Job Scheduling g2 0.1 g4 0.7 g5 0.2 Input Encoder Score Predictor Rule Selector 0.1 -0.3 0.2 DAG-LSTM Embedding 0 1 5 3 42 FC FC FC Softmax St at St+1 0 1 5 3 42 -0.2 0.5 𝐠 𝐭 0 1 5 4 32
  38. 38. Job Scheduling #Resources 2 5 10 20 Shortest Job First 4.80 5.83 5.58 5.00 Shortest First Search 4.25 5.05 5.54 4.98 DeepRM 2.81 6.52 9.20 10.18 Neural Rewriter (Ours) 2.80 3.36 4.50 4.63 Optim (LB) 2.57 3.02 4.08 4.26 Earliest Job First (UB) 11.11 13.62 22.13 24.23
  39. 39. Expression Simplification a1 0 … a5 1 (Constant Reduction) … a19 0 Input Encoder Score Predictor Rule Selector Embedding Tree-LSTM FC 0 -1 0.7 FC Softmax St St+1 𝐠 𝐭 at <= min - v0 v2 v1 v1 <= min - v0 v2 v1 v1 <= min 0 v0 v2
  40. 40. Expression Simplification Expr Len Reduction Tree Size Reduction Halide Rule-based Rewriter 36.13 9.68 Heuristic Search 43.27 12.09 Neural Rewriter (Ours) 46.98 13.53
  41. 41. Example ≤ 5 max v 3 + 3 5 ≤ max 𝑣, 3 + 3 ≤ 5 v 3 max + + 33 ≤ 5 v 3 + ≤ 5 3 3 + OR ≤ 5 v 3 + True OR True
  42. 42. Future Directions Multi-Agent Admiral (General) Captain Lieutenant Hierarchical RL RL Systems Model-based RL Policy Optimization Model Estimation RL applications RL for Optimization
  43. 43. How to do well in Reinforcement Learning? Experience on (distributed) systems Strong math skills Strong coding skills Parameter tuning skills
  44. 44. Thanks!

×