Alborz

1,131 views

Published on

Knowledge Diffusion Network
Visiting Lecturer Program (114)
Speaker: Alborz Geramifard
Ph.D. Candidate
Department of Computing Science, Edmonton, University of Alberta, Canada
Title: Incremental Least- Squares Temporal Difference Learning

Time: Tuesday, Sep 11, 2007, 12:00-1:00 pm
Location: Department of Computer Engineering, Sharif University of Technology, Tehran

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,131
On SlideShare
0
From Embeds
0
Number of Embeds
32
Actions
Shares
0
Downloads
13
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Alborz

  1. 1. Incremental Least-Squares Temporal Difference Learning Alborz Geramifard September, 2007 alborz@cs.ualberta.ca
  2. 2. Incremental Least-Squares Temporal Difference Learning Alborz Geramifard September, 2007 alborz@cs.ualberta.ca
  3. 3. Acknowledgement Richard Sutton Michael Bowling Martin Zinkevich
  4. 4. Agencia
  5. 5. Agencia Generous Wise Kind Forgetful
  6. 6. Agencia Envirocia
  7. 7. Agencia Agriculture Politics Research ... Envirocia
  8. 8. The Ambassador of Envirocia
  9. 9. The Vazir of Agencia The Ambassador of Envirocia
  10. 10. The Vazir of Agencia The Ambassador of Envirocia “Your king is generous, wise, and kind, but he is forgetful. Sometimes he makes decisions which are not wise in light of our earlier conversations.”
  11. 11. The Vazir of Agencia Having consultants for various subjects might be the answer ... The Ambassador of Envirocia “Your king is generous, wise, and kind, but he is forgetful. Sometimes he makes decisions which are not wise in light of our earlier conversations.”
  12. 12. Consultants
  13. 13. Consultants
  14. 14. Ambassador
  15. 15. Vazir Ambassador
  16. 16. Vazir Ambassador “I should admit that recently the king makes decent judgments, but as the number of our meetings increases, it takes a while for me to hear back from the majesty ...”
  17. 17. Agencia Envirocia
  18. 18. Reinforcement Learning Framework Observation Agent Action Reward Environment
  19. 19. Observation Agent Action Reward Environment
  20. 20. Observation Agent Action Reward Environment
  21. 21. Observation Agent Action Reward Environment Goal
  22. 22. Observation Agent Action Reward Environment Goal
  23. 23. Observation Agent Action Mo Reward re Co mp Environment lica ted S tate Spa Goal ce Mo re Co mplic a ted Tas ks
  24. 24. Summary
  25. 25. Data Efficiency Speed Summary
  26. 26. Summary Data Efficiency Part I: King Speed
  27. 27. Summary Part II: King + Consultants Data Efficiency Part I: King Speed
  28. 28. Summary Part II: King + Consultants Data Efficiency ? Part I: King Speed
  29. 29. Summary LSTD Data Efficiency TD Speed
  30. 30. Summary LSTD Data Efficiency iLSTD TD Speed
  31. 31. Contributions
  32. 32. Contributions iLSTD: A new policy evaluation algorithm Extension with eligibility traces Running time analysis Dimension selection methods Proof of convergence Empirical results
  33. 33. Contributions iLSTD: A new policy evaluation algorithm Extension with eligibility traces Running time analysis Dimension selection methods Proof of convergence Empirical results
  34. 34. Contributions iLSTD: A new policy evaluation algorithm Extension with eligibility traces Running time analysis Dimension selection methods Proof of convergence Empirical results
  35. 35. Outline
  36. 36. Outline Motivation Introduction The New Approach Dimension Selection Conclusion
  37. 37. Outline Motivation Introduction The New Approach Dimension Selection Conclusion
  38. 38. Markov Decision Process a a S, A, Pss , Rss ,γ
  39. 39. Markov Decision Process a a S, A, Pss , Rss ,γ 10%,-300 100%,+120 100%,+60 Working Working 30%,-200 Studying 70%,-200 Ph.D. B.S. M.S. 80%,-50 Studying 10%,-50 Working 100%,+85
  40. 40. Markov Decision Process a a S, A, Pss , Rss ,γ 10%,-300 100%,+120 100%,+60 Working Working 30%,-200 Studying 70%,-200 Ph.D. B.S. M.S. 80%,-50 Studying 10%,-50 Working 100%,+85 B.S., Working, +60, B.S. Studying, -50, M.S. , ...
  41. 41. Policy Evaluation
  42. 42. Policy Evaluation Policy Improvement
  43. 43. Policy Evaluation Policy Improvement
  44. 44. Policy Evaluation Policy Improvement
  45. 45. Notation rt+1 V (s) π Scalar Regular φ(s) µ(θ) Vector Bold Lower Case ˜ At A Matrix Bold Upper Case
  46. 46. Policy Evaluation ∞ V (s) = E rt s0 = s, π π t−1 γ t=1
  47. 47. Linear Function Approximation n V (s) = θ · φ(s) = θi φi (s) i=1
  48. 48. Sparsity of features k features are active at any given Sparsity: Only moment. k n
  49. 49. Sparsity of features k features are active at any given Sparsity: Only moment. k n Acrobot [Sutton 96]: 48 << 18,648
  50. 50. Sparsity of features k features are active at any given Sparsity: Only moment. k n Acrobot [Sutton 96]: 48 << 18,648 Card game [Bowling et al. 02]: 3 << 10^6
  51. 51. Sparsity of features k features are active at any given Sparsity: Only moment. k n Acrobot [Sutton 96]: 48 << 18,648 Card game [Bowling et al. 02]: 3 << 10^6 Keep away soccer [Stone et al. 05]: 416 << 10^4
  52. 52. Temporal Difference Learning TD(0)
  53. 53. Temporal Difference Learning TD(0) (π) a,r st+1 st [Sutton 88]
  54. 54. Temporal Difference Learning TD(0) (π) a,r st+1 st Tabular Representation δt = rt + γV (st+1 ) − V (st ). Vt+1 (st ) = Vt (st ) + αt δt [Sutton 88]
  55. 55. Temporal Difference Learning TD(0) (π) a,r st+1 st Tabular Representation δt = rt + γV (st+1 ) − V (st ). Vt+1 (st ) = Vt (st ) + αt δt Linear Function Approximation θ t+1 = θ t + αt φ(st )δt (V ) [Sutton 88]
  56. 56. TD(0) Properties Computational complexity O(k) per time step Data inefficient Only last transition
  57. 57. TD(0) Properties Computational complexity Constant O(k) per time step Data inefficient Only last transition
  58. 58. Least-Squares TD (LSTD) Sum of TD updates
  59. 59. Least-Squares TD (LSTD) [Bradtke, Barto 96] Sum of TD updates
  60. 60. Least-Squares TD (LSTD) [Bradtke, Barto 96] Sum of TD updates t µt (θ) = φi δi (Vθ ) i=1
  61. 61. Least-Squares TD (LSTD) [Bradtke, Barto 96] Sum of TD updates t µt (θ) = φi δi (Vθ ) i=1 t t = φi (φi − γφi+1 ) θ T φi ri+1 − i=1 i=1 At bt
  62. 62. Least-Squares TD (LSTD) [Bradtke, Barto 96] Sum of TD updates t µt (θ) = φi δi (Vθ ) i=1 t t = φi (φi − γφi+1 ) θ T φi ri+1 − i=1 i=1 At bt = bt − At θ
  63. 63. Least-Squares TD (LSTD) [Bradtke, Barto 96] Sum of TD updates t µt (θ) = φi δi (Vθ ) i=1 t t = φi (φi − γφi+1 ) θ T φi ri+1 − i=1 i=1 At bt = bt − At θ µt (θ) = 0 −1 θ=A b
  64. 64. LSTD Properties Computational complexity O(n ) per time step 2 Data efficient Look through all data
  65. 65. LSTD Properties Quadratic Computational complexity O(n ) per time step 2 Data efficient Look through all data
  66. 66. Outline
  67. 67. Outline Motivation Introduction The New Approach Dimension Selection Conclusion
  68. 68. The New Approach
  69. 69. The New Approach A and b matrices change on each iteration.
  70. 70. The New Approach A and b matrices change on each iteration. t t µt (θ) = φi (φi − γφi+1 ) θ T φi ri+1 − i=1 i=1 At bt
  71. 71. The New Approach A and b matrices change on each iteration. t t µt (θ) = φi (φi − γφi+1 ) θ T φi ri+1 − i=1 i=1 At bt bt = bt−1 + rt φt ∆bt At = At−1 + φt (φt − γφt+1 ) T ∆At
  72. 72. Incremental LSTD µt (θ) = bt − At θ
  73. 73. Incremental LSTD µt (θ) = bt − At θ [Geramifard, Bowling, Sutton 06]
  74. 74. Incremental LSTD µt (θ) = bt − At θ Fixed θ Fixed A and b [Geramifard, Bowling, Sutton 06]
  75. 75. Incremental LSTD µt (θ) = bt − At θ Fixed θ µt (θ) = µt−1 (θ) + ∆bt − (∆At )θ. Fixed A and b [Geramifard, Bowling, Sutton 06]
  76. 76. Incremental LSTD µt (θ) = bt − At θ Fixed θ µt (θ) = µt−1 (θ) + ∆bt − (∆At )θ. Fixed A and b µt (θ t+1 ) = µt (θ t ) − At (∆θ t ). [Geramifard, Bowling, Sutton 06]
  77. 77. iLSTD µt (θ t+1 ) = µt (θ t ) − At (∆θ t ).
  78. 78. iLSTD µt (θ t+1 ) = µt (θ t ) − At (∆θ t ). How to change θ ?
  79. 79. iLSTD µt (θ t+1 ) = µt (θ t ) − At (∆θ t ). How to change θ ? Descent in the direction of µt (θ) ?
  80. 80. iLSTD µt (θ t+1 ) = µt (θ t ) − At (∆θ t ). How to change θ ? Descent in the direction of µt (θ) ?
  81. 81. iLSTD µt (θ t+1 ) = µt (θ t ) − At (∆θ t ). How to change θ ? Descent in the direction of µt (θ) ?
  82. 82. iLSTD µt (θ t+1 ) = µt (θ t ) − At (∆θ t ). How to change θ ? Descent in the direction of µt (θ) ? n
  83. 83. iLSTD µt (θ t+1 ) = µt (θ t ) − At (∆θ t ). How to change θ ? O(n ) 2 Descent in the direction of µt (θ) ? n
  84. 84. iLSTD µt (θ t+1 ) = µt (θ t ) − At (∆θ t ). How to change θ ? O(n ) 2 Descent in the direction of µt (θ) ?
  85. 85. iLSTD µt (θ t+1 ) = µt (θ t ) − At (∆θ t ). How to change θ ? O(n ) 2 Descent in the direction of µt (θ) ?
  86. 86. iLSTD µt (θ t+1 ) = µt (θ t ) − At (∆θ t ). How to change θ ? O(n ) 2 Descent in the direction of µt (θ) ?
  87. 87. iLSTD µt (θ t+1 ) = µt (θ t ) − At (∆θ t ). How to change θ ? O(n ) 2 Descent in the direction of µt (θ) ?
  88. 88. chosen. iLSTD Algorithm Algorithm 5: iLSTD Co 0 s ← s0 , A ← 0, µ ← 0, t ← 0 1 Initialize θ arbitrarily 2 repeat 3 Take action according to π and observe r, s t←t+1 4 ∆b ← φ(s)r 5 ∆A ← φ(s)(φ(s) − γφ(s ))T 6 A ← A + ∆A 7 µ ← µ + ∆b − (∆A)θ 8 for i from 1 to m do 9 10 j ← choose an index of µ using a dimension selection mechanism θj ← θj + αµj 11 µ ← µ − αµj Aej 12 13 end for 14 s ← s 15 end repeat
  89. 89. chosen. iLSTD Algorithm Algorithm 5: iLSTD Co 0 s ← s0 , A ← 0, µ ← 0, t ← 0 1 Initialize θ arbitrarily 2 repeat 3 Take action according to π and observe r, s t←t+1 4 ∆b ← φ(s)r 5 ∆A ← φ(s)(φ(s) − γφ(s ))T 6 A ← A + ∆A 7 µ ← µ + ∆b − (∆A)θ 8 for i from 1 to m do 9 10 j ← choose an index of µ using a dimension selection mechanism θj ← θj + αµj 11 µ ← µ − αµj Aej 12 13 end for 14 s ← s 15 end repeat
  90. 90. chosen. iLSTD Algorithm Algorithm 5: iLSTD Co 0 s ← s0 , A ← 0, µ ← 0, t ← 0 1 Initialize θ arbitrarily 2 repeat 3 Take action according to π and observe r, s t←t+1 4 ∆b ← φ(s)r 5 ∆A ← φ(s)(φ(s) − γφ(s ))T 6 A ← A + ∆A 7 µ ← µ + ∆b − (∆A)θ 8 for i from 1 to m do 9 10 j ← choose an index of µ using a dimension selection mechanism θj ← θj + αµj 11 µ ← µ − αµj Aej 12 13 end for 14 s ← s 15 end repeat
  91. 91. iLSTD Per-time-step computational complexity O(mn + k ) 2 More data efficient than TD
  92. 92. iLSTD Per-time-step computational complexity O(mn + k ) 2 Number of iterations per time step More data efficient than TD
  93. 93. iLSTD Per-time-step computational complexity O(mn + k ) 2 Number of features Number of iterations per time step More data efficient than TD
  94. 94. iLSTD Per-time-step computational complexity O(mn + k ) 2 Maximum number of active features Number of features Number of iterations per time step More data efficient than TD
  95. 95. iLSTD Per-time-step computational complexity O(mn + k ) 2 Linear Maximum number of active features Number of features Number of iterations per time step More data efficient than TD
  96. 96. iLSTD Theorem : iLSTD converges with probability one to the same solution as TD, under the usual step-size conditions, for any dimension selection method such that all dimensions for which µt is non-zero are selected in the limit an infinite number of times.
  97. 97. Empirical Results
  98. 98. Settings
  99. 99. Settings Averaged over 30 runs Same random seed for all methods Sparse matrix representation iLSTD Non-zero random selection One descent per iteration
  100. 100. Boyan Chain
  101. 101. Boyan Chain -3 -3 -3 -3 -3 N 5 -3 4 -3 3 -3 2 -2 10 0 -3 -3 1 0 0 0 0 0 0 . . . . . . . . . . . . . . . . . . . . . 0 1 0.75 0.5 0.25 0 0 0 0 0.25 0.5 0.75 1 0 [Boyan 99]
  102. 102. Boyan Chain -3 -3 -3 -3 -3 N 5 -3 4 -3 3 -3 2 -2 10 0 -3 -3 1 0 0 0 0 0 0 . . . . . . . . . . . . . . . . . . . . . 0 1 0.75 0.5 0.25 0 0 0 0 0.25 0.5 0.75 1 0 n = 4 (Small) n = 25 (Medium) n = 100 (Large) [Boyan 99]
  103. 103. Boyan Chain -3 -3 -3 -3 -3 N 5 -3 4 -3 3 -3 2 -2 10 0 -3 -3 1 0 0 0 0 0 0 . . . . . . . . . . . . . . . . . . . . . 0 1 0.75 0.5 0.25 0 0 0 0 0.25 0.5 0.75 1 0 n = 4 (Small) k=2 n = 25 (Medium) n = 100 (Large) [Boyan 99]
  104. 104. Small Boyan Chain 2 0 10 10 RMS of V(s) over all states RMS of V(s) over all states 1 10 iLSTD TD −1 0 10 10 TD LSTD iLSTD −1 LSTD 10 −2 −2 10 10 0 200 400 600 800 1000 0 0.5 1 1.5 2 2.5 3 Episodes Time(s)
  105. 105. Small Boyan Chain 2 0 10 10 RMS of V(s) over all states RMS of V(s) over all states 1 10 iLSTD TD −1 0 10 10 TD LSTD iLSTD −1 LSTD 10 −2 −2 10 10 0 200 400 600 800 1000 0 0.5 1 1.5 2 2.5 3 Episodes Time(s)
  106. 106. Medium Boyan Chain 2 3 10 10 RMS of V(s) over all states RMS of V(s) over all states 2 10 1 10 1 10 iLSTD 0 iLSTD TD 10 TD 0 10 LSTD LSTD −1 −1 10 10 0 200 400 600 800 1000 0 5 10 15 20 Episodes Time(s)
  107. 107. Medium Boyan Chain 2 3 10 10 RMS of V(s) over all states RMS of V(s) over all states 2 10 1 10 1 10 iLSTD 0 iLSTD TD 10 TD 0 10 LSTD LSTD −1 −1 10 10 0 200 400 600 800 1000 0 5 10 15 20 Episodes Time(s)
  108. 108. Large Boyan Chain 3 3 10 10 RMS of V(s) over all states 2 RMS of V(s) over all states 2 10 10 TD TD 1 1 10 10 iLSTD LSTD 0 0 10 10 LSTD iLSTD −1 −1 10 10 0 200 400 600 800 1000 0 20 40 60 80 100 Episodes Time(s)
  109. 109. Large Boyan Chain 3 3 10 10 RMS of V(s) over all states 2 RMS of V(s) over all states 2 10 10 TD TD 1 1 10 10 iLSTD LSTD 0 0 10 10 LSTD iLSTD −1 −1 10 10 0 200 400 600 800 1000 0 20 40 60 80 100 Episodes Time(s)
  110. 110. Large Boyan Chain 3 3 10 10 RMS of V(s) over all states 2 RMS of V(s) over all states 2 10 10 TD TD 1 1 10 10 iLSTD LSTD 0 0 10 10 LSTD iLSTD −1 −1 10 10 0 200 400 600 800 1000 0 20 40 60 80 100 Episodes Time(s)
  111. 111. Mountain Car
  112. 112. Mountain Car Goal Mountain Car Tile coding n = 10,000 Position = -1 (Easy) k = 10 Position = -.5 (Hard) [For details see RL-Library]
  113. 113. Easy Mountain Car 7 6 10 10 TD 6 10 iLSTD 4 10 TD Loss Loss 5 10 TD LSTD 5 LSTD 10 iLSTD Loss Function iLSTD 3 10 4 4 10 10 LSTD 0 50 100 150 200 250 300 350 0 200 400 600 800 1000 Time (s) Episode ∗ Loss = ||b − A θ||2 ∗ 2 10 0 200 400 600 800 1000 Episodes
  114. 114. Hard Mountain Car 8 10 7 10 TD 7 10 Loss Loss iLSTD 6 10 LSTD TD 6 LSTD 10 iLSTD 5 5 10 10 0 200 400 600 800 1000 0 200 400 600 800 1000 1200 Episode Time (s)
  115. 115. Hard Mountain Car 8 10 7 10 TD 7 10 Loss Loss iLSTD 6 10 LSTD TD 6 LSTD 10 iLSTD 5 5 10 10 0 200 400 600 800 1000 0 200 400 600 800 1000 1200 Episode Time (s)
  116. 116. Running Time Constant Linear Exponential 155 154 TD 124 iLSTD 93 Time (ms) LSTD 62 31 34 0 9 7 5 5 2 0 0 0 0 0 0 0 0 m d sy d all ar ar iu Ea Sm H H ed M Boyan Chain Mountain Car
  117. 117. Running Time Constant Linear Exponential 155 154 TD 124 iLSTD 93 Time (ms) LSTD 62 31 34 0 9 7 5 5 2 0 0 0 0 0 0 0 0 m d sy d all ar ar iu Ea Sm H H ed M Boyan Chain Mountain Car
  118. 118. Outline
  119. 119. Outline Motivation Introduction The New Approach Dimension Selection Conclusion
  120. 120. Dimension Selection Random ✓
  121. 121. Dimension Selection Random ✓ Greedy
  122. 122. Dimension Selection Random ✓ Greedy ε-Greedy
  123. 123. Dimension Selection Random ✓ Greedy ε-Greedy Boltzmann
  124. 124. Greedy Dimension Selection Pick the one with highest value of |µt (i)|
  125. 125. Greedy Dimension Selection Pick the one with highest value of |µt (i)| Not proven to converge.
  126. 126. ε-Greedy Dimension Selection ε : Non-Zero Random (1-ε) : Greedy
  127. 127. ε-Greedy Dimension Selection ε : Non-Zero Random (1-ε) : Greedy Convergence proof applies.
  128. 128. Boltzmann Component Selection Boltzmann Distribution + Non-Zero Random
  129. 129. Boltzmann Component Selection Boltzmann Distribution + Non-Zero Random
  130. 130. Boltzmann Component Selection Boltzmann Distribution + Non-Zero Random ψ×m
  131. 131. Boltzmann Component Selection Boltzmann Distribution + Non-Zero Random Boltzmann Distribution ψ×m
  132. 132. Boltzmann Component Selection Boltzmann Distribution + Non-Zero Random Boltzmann Distribution ψ×m Convergence proof applies.
  133. 133. Empirical Results ε-Greedy: ε =.1 Boltzmann: ψ = 10^-9, τ = 1
  134. 134. Empirical Results 8 10 ε-Greedy: ε =.1 Random Greedy !−Greedy 6 10 Boltzmann: ψ = 10^-9, τ = 1 Boltzmann Error Measure 4 10 2 10 0 10 −2 10 Small Medium Large Easy Hard Boyan Chain Mountain Car
  135. 135. Empirical Results 8 10 ε-Greedy: ε =.1 Random Greedy !−Greedy 6 10 Boltzmann: ψ = 10^-9, τ = 1 Boltzmann Error Measure 4 10 2 10 0 10 −2 10 Small Medium Large Easy Hard Boyan Chain Mountain Car
  136. 136. Empirical Results 8 10 ε-Greedy: ε =.1 Random Greedy !−Greedy 6 10 Boltzmann: ψ = 10^-9, τ = 1 Boltzmann Error Measure 4 10 2 10 0 10 −2 10 Small Medium Large Easy Hard Boyan Chain Mountain Car
  137. 137. Running Time Random Greedy Boltzmann !-greedy 11.0 Clocktime/step (ms) 8.8 6.6 4.4 2.2 0 l e sy d um al ar rg Ea Sm i La H ed M Boyan chain Mountain car
  138. 138. Outline
  139. 139. Outline Motivation Introduction The New Approach Dimension Selection Conclusion
  140. 140. Conclusion
  141. 141. Conclusion LSTD Data Efficiency TD Speed
  142. 142. Conclusion LSTD Data Efficiency iLSTD TD Speed
  143. 143. Conclusion LSTD Data Efficiency iLSTD TD Speed
  144. 144. Conclusion LSTD Data Efficiency iLSTD TD Speed
  145. 145. Conclusion LSTD Data Efficiency iLSTD TD Speed
  146. 146. Conclusion No learning rate! LSTD Data Efficiency iLSTD TD Speed
  147. 147. Questions ?
  148. 148. Questions ...
  149. 149. Questions ... What if someone uses batch-LSTD?
  150. 150. Questions ... What if someone uses batch-LSTD? Why iLSTD takes simple descent?
  151. 151. Questions ... What if someone uses batch-LSTD? Why iLSTD takes simple descent? Hmm ... What about control?
  152. 152. Thanks ...

×