Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Value iteration networks

2,548 views

Published on

This slide shows Value Iteration Network which is presented in NIPS 2016.

Published in: Technology

Value iteration networks

  1. 1. Value Iteration Networks A. Tamar, Y. Wu, G. Thomas, S. Levine, and P. Abbeel Dept. of Electrical Engineering and Computer Sciences, UC Berkeley Presenter: Keisuke Fujimoto (Twitter @peisuke)
  2. 2. Value Iteration Networks Purpose: Machine learning based robot path planning. This planner is available in new environment not included in train data set. Strategy: Prediction of optimal action. The method can learn rewards of each place and action to get good rewards. Result: Planning in 28 x 28 grid map, Applicable to continuous control robot Map Pose Velocity Goal Action A. Tamar, Y. Wu, G. Thomas, S. Levine, and P. Abbeel Dept. of Electrical Engineering and Computer Sciences, UC Berkeley Presenter: Keisuke Fujimoto (ABEJA)
  3. 3. Background Target : Autonomous Robot • Manipulation robot, Navigation robot, Transfer robot Problem : • Reinforcement learning can not work outside of training environments. Goal Target object Manipulation robot Navigation robot
  4. 4. Contribution • Value Iteration Networks (VIN) • Model free training • It does not require robot dynamics models. • Generalized action prediction in new environments • It can not work outside of training environments. • Key approach • Represents value-iteration planning by CNN • Prediction of reward map and computation of sum of future rewards.
  5. 5. Overview of VIN Input : State of the robot (pose, velocity), goal, map (left fig.) Output : Action (direction, mortar's torque) Strategy : Determination of optimal action using predicted rewards (right fig.). State Rewards
  6. 6. Reward propagation • Action can be determined by sum of future reward generated using reward propagation -10 -10 -10 -10 -10 1 -10 -10 Map Reward from map Left move action -10 -10 -9 -10 -10 -10 -9 1 0.9 -10 -10 -9 -10 -10 -10 -10 -10 1 -9 -9 -10 -10 0.9 -9 -9 Up move from map One-step propagation example:
  7. 7. Determination of action • Optimal action at reward propagated place is max reward action (middle fig.) • Determination of optimal action using propagated reward (right fig.) Left move action -10 -10 -9 -10 -10 -10 -9 1 0.9 -10 -10 -9 -10 -10 -10 -10 -10 1 -9 -9 -10 -10 0.9 -9 -9 Up move from map -10 -10 -9 -10 -10 -10 -9 1 0.9 -9 -10 -10 0.9 -9 -9 Max After Reward propagation -10 -10 -9 -8 -10 -10 -10 -9 1 0.9 -9 -10 -10 0.9 0.8 -8 -9 -9 0.8 0.7 -7 -8 -8 0.7 0.6 Current robot pose
  8. 8. Value Iteration Module • Reward propagation with Convolutional Neural Network • Input is reward map and output is sum of feature reward map • Q is hidden reward map, V is sum of feature reward map Output Convolution Max
  9. 9. Value Iteration Networks • Deep Architecture of Value Iteration Networks • Input is map and state, fR predicts reward map • Attention modules crops the value map around robot position • 𝜓 outputs optimal action
  10. 10. Attention function • Attention module crops a subset of the values around current robot pose. • Optimal pose have relative to only current robot pose. • Due to this attention module, prediction of optimal action becomes easy. -10 -10 -9 -8 -10 -10 -10 -9 1 0.9 -9 -10 -10 0.9 0.8 -8 -9 -9 0.8 0.7 -7 -8 -8 0.7 0.6 If robot is here. -10 0.9 0.8 -9 0.8 0.7 -8 0.7 0.6 Selected area
  11. 11. Grid-World Domain Environment : Occupancy grid map, test size is 8x8 to 28x28 The number of recurrence is 20 for the 28x28 maps Training dataset is 5000 maps, 7 trajectories. Networks Arch. : Competitive method : CNN based Deep Q-Network, Direct action prediction using FCN Map, Goal CNN Reward map VI module Attention FC layer Action Current Position 3 layer net 150 hidden node 10 channels in Q-layer 80 parameters
  12. 12. Results of Grid-World Domain Predicted path Reward Sum of feature reward
  13. 13. Mars Rover Navigation Environment : • Navigating the surface of Mars by a rover. • It predicts path from only surface image without obstacle information. • Success rate is 90.3%. Red point shows elevation sharper, in prediction time, vin does not uses the elevation shape information
  14. 14. Continuous Control Environment : • Apply to continuous control space. • Grid size is 28x28 • input is position and velocity which is float data. • Output is 2d continuous control parameters. Comparison about final distance to the goal This result is from author's presentation
  15. 15. WebNav Challenge Environment : • Navigate website links to find a query • Features: average word embeddings • Using an approximate graph for planning Evaluation: • Success rate of within top-4 predictions • Test set 1: start from index page • Test set 2: start from random page Result:
  16. 16. Conclusion Purpose : • Machine learning based robot path planning. Method : • Learning rewards of each place and predict action using propagated reward. Result : • VIN policies learn an approximate planning computation relevant for solving the task. • Grid-worlds, to continuous control, and even to navigation of Wikipedia links.
  17. 17. Code: https://github.com/peisuke/vin This code is implemented in chainer! Twitter: @peisuke We are hiring !! https://www.wantedly.com/companies/abeja

×