Upcoming SlideShare
×

# Between winning slow and losing fast.

254 views

Published on

BMAC presentation. Department of Computer Science. Colorado State University. Feb. 08, 2010.

0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
254
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
13
0
Likes
0
Embeds 0
No embeds

No notes for slide

### Between winning slow and losing fast.

1. 1. Winning slow, losing fast, and in between. Reinaldo A Uribe Muriel Colorado State University. Prof. C. Anderson Oita University. Prof. K. Shibata Universidad de Los Andes. Prof. F. Lozano February 8, 2010
2. 2. It’s all fun and games until someone proves a theorem.Outline 1 Fun and games 2 A theorem 3 An algorithm
3. 3. A game: Snakes & LaddersBoard: Crawford & Son, Melbourne, 1901. (Source: http://www.naa.gov.au/) Player advances the number of steps indicated by a die. Landing on a snake’s mouth sends the player back to the tail. Landing on a ladder’s bottom moves the player forward to the top. Goal: reaching state 100.
4. 4. A game: Snakes & LaddersBoard: Crawford & Son, Melbourne, 1901. (Source: http://www.naa.gov.au/) Player advances the number of steps indicated by a die. Landing on a snake’s mouth sends the player back to the tail. Boring! Landing on a ladder’s (No skill required, only luck.) bottom moves the player forward to the top. Goal: reaching state 100.
5. 5. Variation: Decision Snakes and Ladders Sets of “win” and “loss” terminal states. Actions: either “advance” or “retreat,” to be decided before throwing the die.
6. 6. Reinforcement Learning: Finding the optimal policy. “Natural” Rewards: ±1 on “win”/“lose”, 0 othw. Optimal policy maximizes total expected reward. Dynamic programming quickly ﬁnds the optimal policy. Probability of winning: pw = 0.97222 . . .
7. 7. Reinforcement Learning: Finding the optimal policy. “Natural” Rewards: ±1 on “win”/“lose”, 0 othw. Optimal policy maximizes total expected reward. Dynamic programming quickly ﬁnds the optimal policy. Probability of winning: pw = 0.97222 . . . But...
8. 8. Claim: It is not always desirable to ﬁnd the optimal policy for that problem.
9. 9. Claim: It is not always desirable to ﬁnd the optimal policy for that problem. Hint: mean episode length of the optimal policy, d = 84.58333 steps.
10. 10. Optimal policy revisited. Seek winning.
11. 11. Optimal policy revisited. Seek winning.
12. 12. Optimal policy revisited. Seek winning. Avoid losing.
13. 13. Optimal policy revisited. Seek winning. Avoid losing. Stay safe.
14. 14. Optimal policy revisited. Seek winning. Avoid losing. Stay safe.
15. 15. Optimal policy revisited. Seek winning. Avoid losing. Stay safe.
16. 16. Optimal policy revisited. Seek winning. Avoid losing. Stay safe.
17. 17. A simple, yet powerful idea. Introduce a step punishment term −rstep so the agent has an incentive to terminate faster.
18. 18. A simple, yet powerful idea. Introduce a step punishment term −rstep so the agent has an incentive to terminate faster. At time t,   +1 − rstep “win” r (t) = −1 − rstep “loss” −rstep othw. 
19. 19. A simple, yet powerful idea. Introduce a step punishment term −rstep so the agent has an incentive to terminate faster. At time t,   +1 − rstep “win” r (t) = −1 − rstep “loss” −rstep othw.  Origin: Maze rewards, −1 except on termination. Problem: rstep =? (i.e, cost of staying in the game usually incommensurable with terminal rewards)
20. 20. Better than optimal? Optimal policy for rstep = 0
21. 21. Better than optimal? Optimal policy for rstep = 0.08701
22. 22. Better than optimal? Optimal policy for rstep = 0.08701 pw = 0.48673 (was 0.97222 — 50.06%) d = 11.17627 (was 84.58333 — 13.21%)
23. 23. Better than optimal? Optimal policy for rstep = 0.08701 pw = 0.48673 (was 0.97222 — 50.06%) d = 11.17627 (was 84.58333 — 13.21%) pw This policy maximizes d
24. 24. Chess: White wins∗Uribe Muriel. Journal of Fabricated Results, Vol 06, No. 8, 2010
25. 25. Chess: White wins∗Uribe Muriel. Journal of Fabricated Results, Vol 06, No. 8, 2010 ∗ in 108 ply.
26. 26. Chess: White wins∗Uribe Muriel. Journal of Fabricated Results, Vol 06, No. 8, 2010 ∗ in 108 ply. √ Visits only about 5 of the total number of valid states1 , but, if a ply takes one second, an average game will last three years and two months. 1 Shannon, 1950.
27. 27. Chess: White wins∗Uribe Muriel. Journal of Fabricated Results, Vol 06, No. 8, 2010 ∗ in 108 ply. √ Visits only about 5 of the total number of valid states1 , but, if a ply takes one second, an average game will last three years and two months. Certainly unlikely to be the case, but in fact ﬁnding policies of maximum winning probability remains the usual goal in RL. 1 Shannon, 1950.
28. 28. Chess: White wins∗Uribe Muriel. Journal of Fabricated Results, Vol 06, No. 8, 2010 ∗ in 108 ply. √ Visits only about 5 of the total number of valid states1 , but, if a ply takes one second, an average game will last three years and two months. Certainly unlikely to be the case, but in fact ﬁnding policies of maximum winning probability remains the usual goal in RL. Discount factor γ, used to ensure values are ﬁnite, has eﬀect in episode length, but is unpredictable and suboptimal (for the pdw problem) 1 Shannon, 1950.
29. 29. Main result. For a general ±1-rewarded problem, there exists an ∗ rstep for which the value-optimal solution maximizes pw d and the value of the initial state is -1 ∗ ∃rstep | pw π ∗ = argmax v = argmax π∈Π π∈Π d ∗ v (s0 ) = v = −1
30. 30. Stating the obvious. Every policy has a mean episode length d ≥ 1 and probability of winning 0 ≤ pw ≤ 1.
31. 31. Stating the obvious. Every policy has a mean episode length d ≥ 1 and probability of winning 0 ≤ pw ≤ 1. v = 2pw − 1 − rstep d
32. 32. Stating the obvious. Every policy has a mean episode length d ≥ 1 and probability of winning 0 ≤ pw ≤ 1. v = 2pw − 1 − rstep d (Lemma: Extensible to vectors using indicator variables)
33. 33. Stating the obvious. Every policy has a mean episode length d ≥ 1 and probability of winning 0 ≤ pw ≤ 1. v = 2pw − 1 − rstep d (Lemma: Extensible to vectors using indicator variables) The proof rests on a solid foundation of duh!
34. 34. Key substitution.The w − l space w = pd w l = 1−pw d
35. 35. Key substitution.The w − l space w = pd w l = 1−pw d Each policy is represented by a unique point in the w − l plane.
36. 36. Key substitution.The w − l space w = pd w l = 1−pw d Each policy is represented by a unique point in the w − l plane. The policy cloud is limited by the triangle with vertices (1,0), (0,1), and (0,0).
37. 37. Execution and speed in the w − l space. Winning probability: Mean episode length: w 1 pw = d= w +l w +l
38. 38. Proof Outline - Value in the w − l space. w − l − rstep v= w +l
39. 39. So... All level sets intersect at the same point, (rstep , −rstep ) There is a one-to-one relationship between values and slopes. Value (for all rstep ), mean episode length and winning probability level sets are lines Optimal policies in the convex hull of the policy cloud.
40. 40. And done! pw π ∗ = max = max w π d π (Vertical level sets) When vt ≈ −1, we’re there.
41. 41. Algorithm Set ε Initialize π0 rstep ← 0 Repeat: + Find π + , vπ (solve from π0 by any RL method) rstep ← rstep π0 ← π + + Until |vπ (s0 ) + 1| < ε
42. 42. Algorithm Set ε Initialize π0 rstep ← 0 Repeat: + Find π + , vπ (solve from π0 by any RL method) rstep ← rstep π0 ← π + + Until |vπ (s0 ) + 1| < ε On termination, π + ≈ π ∗ . rstep update using a learning rate µ > 0, + rstep = rstep + µ[vπ (s0 ) + 1]
43. 43. Optimal rstep update. Minimizing the interval of rstep uncertainty in the next iteration. Requires solving a minmax problem. Either root of an 8th degree polynomial in rstep or zero of the diﬀerence of two rational functions of order 4. (Easy using secant method). O(log 1 ) complexity.
44. 44. Extensions. Problems solvable through a similar method Convex (linear) tradeoﬀ. π ∗ = argmaxπ∈Π {αpw − (1 − α)d} Greedy tradeoﬀ. ∗ 2pw −1 π = argmaxπ∈Π d Arbitrary tradeoﬀs. ∗ αpw −β π = argmaxπ∈Π d Asymmetric rewards. rwin = a, rloss = −b; a, b ≥ 0 Games with tie outcomes. Games with multiple win / loss rewards.
45. 45. Harder family of problems Maximize the probability of having won before n steps / m episodes. Why? Non-linear level sets / non-convex functions in the w − l space.
46. 46. Outline of future research.Towards robustness. Policy variation in tasks with ﬁxed episode length. Inclusion of time as a component of the state space.
47. 47. Outline of future research.Towards robustness. Policy variation in tasks with ﬁxed episode length. Inclusion of time as a component of the state space. Deﬁning policy neighbourhoods.
48. 48. Outline of future research.Towards robustness. Policy variation in tasks with ﬁxed episode length. Inclusion of time as a component of the state space. Deﬁning policy neighbourhoods. 1 Continuous/discrete statewise action neighbourhoods. 2 Discrete policy neighbourhoods for structured tasks. 3 General policy neighbourhoods.
49. 49. Outline of future research.Towards robustness. Policy variation in tasks with ﬁxed episode length. Inclusion of time as a component of the state space. Deﬁning policy neighbourhoods. Feature-robustness
50. 50. Outline of future research.Towards robustness. Policy variation in tasks with ﬁxed episode length. Inclusion of time as a component of the state space. Deﬁning policy neighbourhoods. Feature-robustness 1 Value/Speed/Execution neighbourhoods in the w − l space. 2 Robustness as a trading oﬀ of features
51. 51. Outline of future research.Towards robustness. Policy variation in tasks with ﬁxed episode length. Inclusion of time as a component of the state space. Deﬁning policy neighbourhoods. Feature-robustness Can traditional Reinforcement Learning methods still be used to handle the learning?
52. 52. Thank you.muriel@cs.colostate.edu - r-uribe@uniandes.edu.co Untitled by Li Wei, School of Design, Oita University, 2009.