Advertisement
Advertisement

More Related Content

Advertisement

RISECamp 2017: Reinforcement Learning Concepts

  1. Roy Fox RISECamp 7 Sep, 2017 Reinforcement Learning Concepts
  2. Sequential Decision Making 2 Nonsequential Sequential
  3. Sequential Decision Making 2 Nonsequential Sequential
  4. Sequential Decision Making 2 Nonsequential Sequential
  5. Sequential Decision Making 2 Nonsequential Sequential
  6. Sequential Decision Making 2 Nonsequential Sequential
  7. Sequential Decision Making 2 Nonsequential Sequential
  8. Sequential Decision Making 2 Nonsequential Sequential
  9. Example: Maze Navigation 3
  10. Example: Maze Navigation • State: agent location 3 st
  11. Example: Maze Navigation • State: agent location • Action: where to move 3 st at
  12. Example: Maze Navigation • State: agent location • Action: where to move • Reward: • Prize for reaching target • Cost for hitting wall 3 st at rt
  13. • Trajectory: sequence of states, actions and rewards • State dynamics: • Policy: deterministic ; stochastic • Reward: Markov Decision Process (MDP) ppst`1|st, atq ⇡pat|stq 4 at“⇡pstq rt“rpst, atq s0, a0, r0, s1, a1, r1, s2, . . .
  14. • Environment • Reset() → get initial state • Step( ) → get reward draw next state • Agent policy • Action( ) → draw next action "Rolling Out" 5 s0 ppst`1|st, atq at st ⇡pat|stq rpst, atq
  15. • Environment • Reset() → get initial state • Step( ) → get reward draw next state • Agent policy • Action( ) → draw next action "Rolling Out" 5 s0 ppst`1|st, atq at st ⇡pat|stq rpst, atq
  16. • Environment • Reset() → get initial state • Step( ) → get reward draw next state • Agent policy • Action( ) → draw next action "Rolling Out" 5 s0 ppst`1|st, atq at st ⇡pat|stq rpst, atq
  17. • Environment • Reset() → get initial state • Step( ) → get reward draw next state • Agent policy • Action( ) → draw next action "Rolling Out" 5 s0 ppst`1|st, atq at st ⇡pat|stq rpst, atq
  18. • Environment • Reset() → get initial state • Step( ) → get reward draw next state • Agent policy • Action( ) → draw next action "Rolling Out" 5 s0 ppst`1|st, atq at st ⇡pat|stq rpst, atq
  19. • Environment • Reset() → get initial state • Step( ) → get reward draw next state • Agent policy • Action( ) → draw next action "Rolling Out" 5 s0 ppst`1|st, atq at st ⇡pat|stq rpst, atq
  20. • Environment • Reset() → get initial state • Step( ) → get reward draw next state • Agent policy • Action( ) → draw next action "Rolling Out" 5 s0 ppst`1|st, atq at st ⇡pat|stq rpst, atq
  21. • Environment • Reset() → get initial state • Step( ) → get reward draw next state • Agent policy • Action( ) → draw next action "Rolling Out" 5 s0 ppst`1|st, atq at st ⇡pat|stq rpst, atq
  22. • Environment • Reset() → get initial state • Step( ) → get reward draw next state • Agent policy • Action( ) → draw next action "Rolling Out" 5 s0 ppst`1|st, atq at st ⇡pat|stq rpst, atq
  23. • Environment • Reset() → get initial state • Step( ) → get reward draw next state • Agent policy • Action( ) → draw next action "Rolling Out" 5 s0 ppst`1|st, atq at st ⇡pat|stq rpst, atq
  24. • Environment • Reset() → get initial state • Step( ) → get reward draw next state • Agent policy • Action( ) → draw next action "Rolling Out" 5 s0 ppst`1|st, atq at st ⇡pat|stq rpst, atq
  25. • Environment • Reset() → get initial state • Step( ) → get reward draw next state • Agent policy • Action( ) → draw next action "Rolling Out" 5 s0 ppst`1|st, atq at st ⇡pat|stq rpst, atq
  26. • Environment • Reset() → get initial state • Step( ) → get reward draw next state • Agent policy • Action( ) → draw next action "Rolling Out" 5 s0 ppst`1|st, atq at st ⇡pat|stq rpst, atq
  27. • Environment • Reset() → get initial state • Step( ) → get reward draw next state • Agent policy • Action( ) → draw next action "Rolling Out" 5 s0 ppst`1|st, atq at st ⇡pat|stq rpst, atq
  28. • Environment • Reset() → get initial state • Step( ) → get reward draw next state • Agent policy • Action( ) → draw next action "Rolling Out" 5 s0 ppst`1|st, atq at st ⇡pat|stq rpst, atq
  29. Example: Game of Go • State: board • Action: place stone • Reward: captures • Environment: can be simulated 6
  30. Example: Autonomous Vehicle • State: cameras, GPS • Action: steer, accelerate • Reward: speed, constraint satisfaction • Environment: physical 7
  31. Example: Surgical Robot • State: endoscope image, joint angles • Action: change in joint angles • Reward: task success • Environment: physical 8
  32. Example: Surgical Robot • State: endoscope image, joint angles • Action: change in joint angles • Reward: task success • Environment: physical 8
  33. Policy Evaluation • Return: • Discount prefer early rewards, late costs • Policy value is its expected return 9 R “ r0 ` r1 ` 2 r2 ` ¨ ¨ ¨ V⇡ ErRs 0 § § 1
  34. 6.3 Policy Evaluation • Return: • Discount prefer early rewards, late costs • Policy value is its expected return 9 R “ r0 ` r1 ` 2 r2 ` ¨ ¨ ¨ V⇡ ErRs 0 § § 1 7 5 7 7 5 7 R:
  35. Value Function • Policy value function is its expected return given current state, action • Optimal policy satisfies 10 Q⇡ps, aq ErR|s, as ⇡psq “ arg max a Q⇡ps, aq
  36. Value Iteration • Bellman equation: • Iterate to convergence • Final policy is 11 Qpst, atq “ rpst, atq` Ermax at`1 Qpst`1, at`1qs ⇡psq “ arg max a Qps, aq
  37. Value Iteration • Bellman equation: • Iterate to convergence • Final policy is 11 Qpst, atq “ rpst, atq` Ermax at`1 Qpst`1, at`1qs ⇡psq “ arg max a Qps, aq
  38. Representing Value • How to represent Q(s,a) for large state space? • Approximate Q with deep representations • Deep Q Net generalizes to unseen states 12
  39. Policy-Based RL • Represent with deep network • Iteratively evaluate and improve 13 Value Policy Improve Evaluate ⇡pa|sq ⇡
  40. RISECamp 7 Sep, 2017 Questions?
Advertisement