### RISECamp 2017: Reinforcement Learning Concepts

1. Roy Fox RISECamp 7 Sep, 2017 Reinforcement Learning Concepts
2. Sequential Decision Making 2 Nonsequential Sequential
3. Sequential Decision Making 2 Nonsequential Sequential
4. Sequential Decision Making 2 Nonsequential Sequential
5. Sequential Decision Making 2 Nonsequential Sequential
6. Sequential Decision Making 2 Nonsequential Sequential
7. Sequential Decision Making 2 Nonsequential Sequential
8. Sequential Decision Making 2 Nonsequential Sequential
10. Example: Maze Navigation • State: agent location 3 st
11. Example: Maze Navigation • State: agent location • Action: where to move 3 st at
12. Example: Maze Navigation • State: agent location • Action: where to move • Reward: • Prize for reaching target • Cost for hitting wall 3 st at rt
13. • Trajectory: sequence of states, actions and rewards • State dynamics: • Policy: deterministic ; stochastic • Reward: Markov Decision Process (MDP) ppst`1|st, atq ⇡pat|stq 4 at“⇡pstq rt“rpst, atq s0, a0, r0, s1, a1, r1, s2, . . .
14. • Environment • Reset() → get initial state • Step( ) → get reward draw next state • Agent policy • Action( ) → draw next action "Rolling Out" 5 s0 ppst`1|st, atq at st ⇡pat|stq rpst, atq
15. • Environment • Reset() → get initial state • Step( ) → get reward draw next state • Agent policy • Action( ) → draw next action "Rolling Out" 5 s0 ppst`1|st, atq at st ⇡pat|stq rpst, atq
16. • Environment • Reset() → get initial state • Step( ) → get reward draw next state • Agent policy • Action( ) → draw next action "Rolling Out" 5 s0 ppst`1|st, atq at st ⇡pat|stq rpst, atq
17. • Environment • Reset() → get initial state • Step( ) → get reward draw next state • Agent policy • Action( ) → draw next action "Rolling Out" 5 s0 ppst`1|st, atq at st ⇡pat|stq rpst, atq
18. • Environment • Reset() → get initial state • Step( ) → get reward draw next state • Agent policy • Action( ) → draw next action "Rolling Out" 5 s0 ppst`1|st, atq at st ⇡pat|stq rpst, atq
19. • Environment • Reset() → get initial state • Step( ) → get reward draw next state • Agent policy • Action( ) → draw next action "Rolling Out" 5 s0 ppst`1|st, atq at st ⇡pat|stq rpst, atq
20. • Environment • Reset() → get initial state • Step( ) → get reward draw next state • Agent policy • Action( ) → draw next action "Rolling Out" 5 s0 ppst`1|st, atq at st ⇡pat|stq rpst, atq
21. • Environment • Reset() → get initial state • Step( ) → get reward draw next state • Agent policy • Action( ) → draw next action "Rolling Out" 5 s0 ppst`1|st, atq at st ⇡pat|stq rpst, atq
22. • Environment • Reset() → get initial state • Step( ) → get reward draw next state • Agent policy • Action( ) → draw next action "Rolling Out" 5 s0 ppst`1|st, atq at st ⇡pat|stq rpst, atq
23. • Environment • Reset() → get initial state • Step( ) → get reward draw next state • Agent policy • Action( ) → draw next action "Rolling Out" 5 s0 ppst`1|st, atq at st ⇡pat|stq rpst, atq
24. • Environment • Reset() → get initial state • Step( ) → get reward draw next state • Agent policy • Action( ) → draw next action "Rolling Out" 5 s0 ppst`1|st, atq at st ⇡pat|stq rpst, atq
25. • Environment • Reset() → get initial state • Step( ) → get reward draw next state • Agent policy • Action( ) → draw next action "Rolling Out" 5 s0 ppst`1|st, atq at st ⇡pat|stq rpst, atq
26. • Environment • Reset() → get initial state • Step( ) → get reward draw next state • Agent policy • Action( ) → draw next action "Rolling Out" 5 s0 ppst`1|st, atq at st ⇡pat|stq rpst, atq
27. • Environment • Reset() → get initial state • Step( ) → get reward draw next state • Agent policy • Action( ) → draw next action "Rolling Out" 5 s0 ppst`1|st, atq at st ⇡pat|stq rpst, atq
28. • Environment • Reset() → get initial state • Step( ) → get reward draw next state • Agent policy • Action( ) → draw next action "Rolling Out" 5 s0 ppst`1|st, atq at st ⇡pat|stq rpst, atq
29. Example: Game of Go • State: board • Action: place stone • Reward: captures • Environment: can be simulated 6
30. Example: Autonomous Vehicle • State: cameras, GPS • Action: steer, accelerate • Reward: speed, constraint satisfaction • Environment: physical 7
31. Example: Surgical Robot • State: endoscope image, joint angles • Action: change in joint angles • Reward: task success • Environment: physical 8
32. Example: Surgical Robot • State: endoscope image, joint angles • Action: change in joint angles • Reward: task success • Environment: physical 8
33. Policy Evaluation • Return: • Discount prefer early rewards, late costs • Policy value is its expected return 9 R “ r0 ` r1 ` 2 r2 ` ¨ ¨ ¨ V⇡ ErRs 0 § § 1
34. 6.3 Policy Evaluation • Return: • Discount prefer early rewards, late costs • Policy value is its expected return 9 R “ r0 ` r1 ` 2 r2 ` ¨ ¨ ¨ V⇡ ErRs 0 § § 1 7 5 7 7 5 7 R:
35. Value Function • Policy value function is its expected return given current state, action • Optimal policy satisﬁes 10 Q⇡ps, aq ErR|s, as ⇡psq “ arg max a Q⇡ps, aq
36. Value Iteration • Bellman equation: • Iterate to convergence • Final policy is 11 Qpst, atq “ rpst, atq` Ermax at`1 Qpst`1, at`1qs ⇡psq “ arg max a Qps, aq
37. Value Iteration • Bellman equation: • Iterate to convergence • Final policy is 11 Qpst, atq “ rpst, atq` Ermax at`1 Qpst`1, at`1qs ⇡psq “ arg max a Qps, aq
38. Representing Value • How to represent Q(s,a) for large state space? • Approximate Q with deep representations • Deep Q Net generalizes to unseen states 12
39. Policy-Based RL • Represent with deep network • Iteratively evaluate and improve 13 Value Policy Improve Evaluate ⇡pa|sq ⇡
40. RISECamp 7 Sep, 2017 Questions?