Alborz

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    1 Favorite

    Alborz - Presentation Transcript

    1. Incremental Least-Squares Temporal Difference Learning Alborz Geramifard September, 2007 alborz@cs.ualberta.ca
    2. Incremental Least-Squares Temporal Difference Learning Alborz Geramifard September, 2007 alborz@cs.ualberta.ca
    3. Acknowledgement Richard Sutton Michael Bowling Martin Zinkevich
    4. Agencia
    5. Agencia Generous Wise Kind Forgetful
    6. Agencia Envirocia
    7. Agencia Agriculture Politics Research ... Envirocia
    8. The Ambassador of Envirocia
    9. The Vazir of Agencia The Ambassador of Envirocia
    10. The Vazir of Agencia The Ambassador of Envirocia “Your king is generous, wise, and kind, but he is forgetful. Sometimes he makes decisions which are not wise in light of our earlier conversations.”
    11. The Vazir of Agencia Having consultants for various subjects might be the answer ... The Ambassador of Envirocia “Your king is generous, wise, and kind, but he is forgetful. Sometimes he makes decisions which are not wise in light of our earlier conversations.”
    12. Consultants
    13. Consultants
    14. Ambassador
    15. Vazir Ambassador
    16. Vazir Ambassador “I should admit that recently the king makes decent judgments, but as the number of our meetings increases, it takes a while for me to hear back from the majesty ...”
    17. Agencia Envirocia
    18. Reinforcement Learning Framework Observation Agent Action Reward Environment
    19. Observation Agent Action Reward Environment
    20. Observation Agent Action Reward Environment
    21. Observation Agent Action Reward Environment Goal
    22. Observation Agent Action Reward Environment Goal
    23. Observation Agent Action Mo Reward re Co mp Environment lica ted S tate Spa Goal ce Mo re Co mplic a ted Tas ks
    24. Summary
    25. Data Efficiency Speed Summary
    26. Summary Data Efficiency Part I: King Speed
    27. Summary Part II: King + Consultants Data Efficiency Part I: King Speed
    28. Summary Part II: King + Consultants Data Efficiency ? Part I: King Speed
    29. Summary LSTD Data Efficiency TD Speed
    30. Summary LSTD Data Efficiency iLSTD TD Speed
    31. Contributions
    32. Contributions iLSTD: A new policy evaluation algorithm Extension with eligibility traces Running time analysis Dimension selection methods Proof of convergence Empirical results
    33. Contributions iLSTD: A new policy evaluation algorithm Extension with eligibility traces Running time analysis Dimension selection methods Proof of convergence Empirical results
    34. Contributions iLSTD: A new policy evaluation algorithm Extension with eligibility traces Running time analysis Dimension selection methods Proof of convergence Empirical results
    35. Outline
    36. Outline Motivation Introduction The New Approach Dimension Selection Conclusion
    37. Outline Motivation Introduction The New Approach Dimension Selection Conclusion
    38. Markov Decision Process a a S, A, Pss , Rss ,γ
    39. Markov Decision Process a a S, A, Pss , Rss ,γ 10%,-300 100%,+120 100%,+60 Working Working 30%,-200 Studying 70%,-200 Ph.D. B.S. M.S. 80%,-50 Studying 10%,-50 Working 100%,+85
    40. Markov Decision Process a a S, A, Pss , Rss ,γ 10%,-300 100%,+120 100%,+60 Working Working 30%,-200 Studying 70%,-200 Ph.D. B.S. M.S. 80%,-50 Studying 10%,-50 Working 100%,+85 B.S., Working, +60, B.S. Studying, -50, M.S. , ...
    41. Policy Evaluation
    42. Policy Evaluation Policy Improvement
    43. Policy Evaluation Policy Improvement
    44. Policy Evaluation Policy Improvement
    45. Notation rt+1 V (s) π Scalar Regular φ(s) µ(θ) Vector Bold Lower Case ˜ At A Matrix Bold Upper Case
    46. Policy Evaluation ∞ V (s) = E rt s0 = s, π π t−1 γ t=1
    47. Linear Function Approximation n V (s) = θ · φ(s) = θi φi (s) i=1
    48. Sparsity of features k features are active at any given Sparsity: Only moment. k n
    49. Sparsity of features k features are active at any given Sparsity: Only moment. k n Acrobot [Sutton 96]: 48 << 18,648
    50. Sparsity of features k features are active at any given Sparsity: Only moment. k n Acrobot [Sutton 96]: 48 << 18,648 Card game [Bowling et al. 02]: 3 << 10^6
    51. Sparsity of features k features are active at any given Sparsity: Only moment. k n Acrobot [Sutton 96]: 48 << 18,648 Card game [Bowling et al. 02]: 3 << 10^6 Keep away soccer [Stone et al. 05]: 416 << 10^4
    52. Temporal Difference Learning TD(0)
    53. Temporal Difference Learning TD(0) (π) a,r st+1 st [Sutton 88]
    54. Temporal Difference Learning TD(0) (π) a,r st+1 st Tabular Representation δt = rt + γV (st+1 ) − V (st ). Vt+1 (st ) = Vt (st ) + αt δt [Sutton 88]
    55. Temporal Difference Learning TD(0) (π) a,r st+1 st Tabular Representation δt = rt + γV (st+1 ) − V (st ). Vt+1 (st ) = Vt (st ) + αt δt Linear Function Approximation θ t+1 = θ t + αt φ(st )δt (V ) [Sutton 88]
    56. TD(0) Properties Computational complexity O(k) per time step Data inefficient Only last transition
    57. TD(0) Properties Computational complexity Constant O(k) per time step Data inefficient Only last transition
    58. Least-Squares TD (LSTD) Sum of TD updates
    59. Least-Squares TD (LSTD) [Bradtke, Barto 96] Sum of TD updates
    60. Least-Squares TD (LSTD) [Bradtke, Barto 96] Sum of TD updates t µt (θ) = φi δi (Vθ ) i=1
    61. Least-Squares TD (LSTD) [Bradtke, Barto 96] Sum of TD updates t µt (θ) = φi δi (Vθ ) i=1 t t = φi (φi − γφi+1 ) θ T φi ri+1 − i=1 i=1 At bt
    62. Least-Squares TD (LSTD) [Bradtke, Barto 96] Sum of TD updates t µt (θ) = φi δi (Vθ ) i=1 t t = φi (φi − γφi+1 ) θ T φi ri+1 − i=1 i=1 At bt = bt − At θ
    63. Least-Squares TD (LSTD) [Bradtke, Barto 96] Sum of TD updates t µt (θ) = φi δi (Vθ ) i=1 t t = φi (φi − γφi+1 ) θ T φi ri+1 − i=1 i=1 At bt = bt − At θ µt (θ) = 0 −1 θ=A b
    64. LSTD Properties Computational complexity O(n ) per time step 2 Data efficient Look through all data
    65. LSTD Properties Quadratic Computational complexity O(n ) per time step 2 Data efficient Look through all data
    66. Outline
    67. Outline Motivation Introduction The New Approach Dimension Selection Conclusion
    68. The New Approach
    69. The New Approach A and b matrices change on each iteration.
    70. The New Approach A and b matrices change on each iteration. t t µt (θ) = φi (φi − γφi+1 ) θ T φi ri+1 − i=1 i=1 At bt
    71. The New Approach A and b matrices change on each iteration. t t µt (θ) = φi (φi − γφi+1 ) θ T φi ri+1 − i=1 i=1 At bt bt = bt−1 + rt φt ∆bt At = At−1 + φt (φt − γφt+1 ) T ∆At
    72. Incremental LSTD µt (θ) = bt − At θ
    73. Incremental LSTD µt (θ) = bt − At θ [Geramifard, Bowling, Sutton 06]
    74. Incremental LSTD µt (θ) = bt − At θ Fixed θ Fixed A and b [Geramifard, Bowling, Sutton 06]
    75. Incremental LSTD µt (θ) = bt − At θ Fixed θ µt (θ) = µt−1 (θ) + ∆bt − (∆At )θ. Fixed A and b [Geramifard, Bowling, Sutton 06]
    76. Incremental LSTD µt (θ) = bt − At θ Fixed θ µt (θ) = µt−1 (θ) + ∆bt − (∆At )θ. Fixed A and b µt (θ t+1 ) = µt (θ t ) − At (∆θ t ). [Geramifard, Bowling, Sutton 06]
    77. iLSTD µt (θ t+1 ) = µt (θ t ) − At (∆θ t ).
    78. iLSTD µt (θ t+1 ) = µt (θ t ) − At (∆θ t ). How to change θ ?
    79. iLSTD µt (θ t+1 ) = µt (θ t ) − At (∆θ t ). How to change θ ? Descent in the direction of µt (θ) ?
    80. iLSTD µt (θ t+1 ) = µt (θ t ) − At (∆θ t ). How to change θ ? Descent in the direction of µt (θ) ?
    81. iLSTD µt (θ t+1 ) = µt (θ t ) − At (∆θ t ). How to change θ ? Descent in the direction of µt (θ) ?
    82. iLSTD µt (θ t+1 ) = µt (θ t ) − At (∆θ t ). How to change θ ? Descent in the direction of µt (θ) ? n
    83. iLSTD µt (θ t+1 ) = µt (θ t ) − At (∆θ t ). How to change θ ? O(n ) 2 Descent in the direction of µt (θ) ? n
    84. iLSTD µt (θ t+1 ) = µt (θ t ) − At (∆θ t ). How to change θ ? O(n ) 2 Descent in the direction of µt (θ) ?
    85. iLSTD µt (θ t+1 ) = µt (θ t ) − At (∆θ t ). How to change θ ? O(n ) 2 Descent in the direction of µt (θ) ?
    86. iLSTD µt (θ t+1 ) = µt (θ t ) − At (∆θ t ). How to change θ ? O(n ) 2 Descent in the direction of µt (θ) ?
    87. iLSTD µt (θ t+1 ) = µt (θ t ) − At (∆θ t ). How to change θ ? O(n ) 2 Descent in the direction of µt (θ) ?
    88. chosen. iLSTD Algorithm Algorithm 5: iLSTD Co 0 s ← s0 , A ← 0, µ ← 0, t ← 0 1 Initialize θ arbitrarily 2 repeat 3 Take action according to π and observe r, s t←t+1 4 ∆b ← φ(s)r 5 ∆A ← φ(s)(φ(s) − γφ(s ))T 6 A ← A + ∆A 7 µ ← µ + ∆b − (∆A)θ 8 for i from 1 to m do 9 10 j ← choose an index of µ using a dimension selection mechanism θj ← θj + αµj 11 µ ← µ − αµj Aej 12 13 end for 14 s ← s 15 end repeat
    89. chosen. iLSTD Algorithm Algorithm 5: iLSTD Co 0 s ← s0 , A ← 0, µ ← 0, t ← 0 1 Initialize θ arbitrarily 2 repeat 3 Take action according to π and observe r, s t←t+1 4 ∆b ← φ(s)r 5 ∆A ← φ(s)(φ(s) − γφ(s ))T 6 A ← A + ∆A 7 µ ← µ + ∆b − (∆A)θ 8 for i from 1 to m do 9 10 j ← choose an index of µ using a dimension selection mechanism θj ← θj + αµj 11 µ ← µ − αµj Aej 12 13 end for 14 s ← s 15 end repeat
    90. chosen. iLSTD Algorithm Algorithm 5: iLSTD Co 0 s ← s0 , A ← 0, µ ← 0, t ← 0 1 Initialize θ arbitrarily 2 repeat 3 Take action according to π and observe r, s t←t+1 4 ∆b ← φ(s)r 5 ∆A ← φ(s)(φ(s) − γφ(s ))T 6 A ← A + ∆A 7 µ ← µ + ∆b − (∆A)θ 8 for i from 1 to m do 9 10 j ← choose an index of µ using a dimension selection mechanism θj ← θj + αµj 11 µ ← µ − αµj Aej 12 13 end for 14 s ← s 15 end repeat
    91. iLSTD Per-time-step computational complexity O(mn + k ) 2 More data efficient than TD
    92. iLSTD Per-time-step computational complexity O(mn + k ) 2 Number of iterations per time step More data efficient than TD
    93. iLSTD Per-time-step computational complexity O(mn + k ) 2 Number of features Number of iterations per time step More data efficient than TD
    94. iLSTD Per-time-step computational complexity O(mn + k ) 2 Maximum number of active features Number of features Number of iterations per time step More data efficient than TD
    95. iLSTD Per-time-step computational complexity O(mn + k ) 2 Linear Maximum number of active features Number of features Number of iterations per time step More data efficient than TD
    96. iLSTD Theorem : iLSTD converges with probability one to the same solution as TD, under the usual step-size conditions, for any dimension selection method such that all dimensions for which µt is non-zero are selected in the limit an infinite number of times.
    97. Empirical Results
    98. Settings
    99. Settings Averaged over 30 runs Same random seed for all methods Sparse matrix representation iLSTD Non-zero random selection One descent per iteration
    100. Boyan Chain
    101. Boyan Chain -3 -3 -3 -3 -3 N 5 -3 4 -3 3 -3 2 -2 10 0 -3 -3 1 0 0 0 0 0 0 . . . . . . . . . . . . . . . . . . . . . 0 1 0.75 0.5 0.25 0 0 0 0 0.25 0.5 0.75 1 0 [Boyan 99]
    102. Boyan Chain -3 -3 -3 -3 -3 N 5 -3 4 -3 3 -3 2 -2 10 0 -3 -3 1 0 0 0 0 0 0 . . . . . . . . . . . . . . . . . . . . . 0 1 0.75 0.5 0.25 0 0 0 0 0.25 0.5 0.75 1 0 n = 4 (Small) n = 25 (Medium) n = 100 (Large) [Boyan 99]
    103. Boyan Chain -3 -3 -3 -3 -3 N 5 -3 4 -3 3 -3 2 -2 10 0 -3 -3 1 0 0 0 0 0 0 . . . . . . . . . . . . . . . . . . . . . 0 1 0.75 0.5 0.25 0 0 0 0 0.25 0.5 0.75 1 0 n = 4 (Small) k=2 n = 25 (Medium) n = 100 (Large) [Boyan 99]
    104. Small Boyan Chain 2 0 10 10 RMS of V(s) over all states RMS of V(s) over all states 1 10 iLSTD TD −1 0 10 10 TD LSTD iLSTD −1 LSTD 10 −2 −2 10 10 0 200 400 600 800 1000 0 0.5 1 1.5 2 2.5 3 Episodes Time(s)
    105. Small Boyan Chain 2 0 10 10 RMS of V(s) over all states RMS of V(s) over all states 1 10 iLSTD TD −1 0 10 10 TD LSTD iLSTD −1 LSTD 10 −2 −2 10 10 0 200 400 600 800 1000 0 0.5 1 1.5 2 2.5 3 Episodes Time(s)
    106. Medium Boyan Chain 2 3 10 10 RMS of V(s) over all states RMS of V(s) over all states 2 10 1 10 1 10 iLSTD 0 iLSTD TD 10 TD 0 10 LSTD LSTD −1 −1 10 10 0 200 400 600 800 1000 0 5 10 15 20 Episodes Time(s)
    107. Medium Boyan Chain 2 3 10 10 RMS of V(s) over all states RMS of V(s) over all states 2 10 1 10 1 10 iLSTD 0 iLSTD TD 10 TD 0 10 LSTD LSTD −1 −1 10 10 0 200 400 600 800 1000 0 5 10 15 20 Episodes Time(s)
    108. Large Boyan Chain 3 3 10 10 RMS of V(s) over all states 2 RMS of V(s) over all states 2 10 10 TD TD 1 1 10 10 iLSTD LSTD 0 0 10 10 LSTD iLSTD −1 −1 10 10 0 200 400 600 800 1000 0 20 40 60 80 100 Episodes Time(s)
    109. Large Boyan Chain 3 3 10 10 RMS of V(s) over all states 2 RMS of V(s) over all states 2 10 10 TD TD 1 1 10 10 iLSTD LSTD 0 0 10 10 LSTD iLSTD −1 −1 10 10 0 200 400 600 800 1000 0 20 40 60 80 100 Episodes Time(s)
    110. Large Boyan Chain 3 3 10 10 RMS of V(s) over all states 2 RMS of V(s) over all states 2 10 10 TD TD 1 1 10 10 iLSTD LSTD 0 0 10 10 LSTD iLSTD −1 −1 10 10 0 200 400 600 800 1000 0 20 40 60 80 100 Episodes Time(s)
    111. Mountain Car
    112. Mountain Car Goal Mountain Car Tile coding n = 10,000 Position = -1 (Easy) k = 10 Position = -.5 (Hard) [For details see RL-Library]
    113. Easy Mountain Car 7 6 10 10 TD 6 10 iLSTD 4 10 TD Loss Loss 5 10 TD LSTD 5 LSTD 10 iLSTD Loss Function iLSTD 3 10 4 4 10 10 LSTD 0 50 100 150 200 250 300 350 0 200 400 600 800 1000 Time (s) Episode ∗ Loss = ||b − A θ||2 ∗ 2 10 0 200 400 600 800 1000 Episodes
    114. Hard Mountain Car 8 10 7 10 TD 7 10 Loss Loss iLSTD 6 10 LSTD TD 6 LSTD 10 iLSTD 5 5 10 10 0 200 400 600 800 1000 0 200 400 600 800 1000 1200 Episode Time (s)
    115. Hard Mountain Car 8 10 7 10 TD 7 10 Loss Loss iLSTD 6 10 LSTD TD 6 LSTD 10 iLSTD 5 5 10 10 0 200 400 600 800 1000 0 200 400 600 800 1000 1200 Episode Time (s)
    116. Running Time Constant Linear Exponential 155 154 TD 124 iLSTD 93 Time (ms) LSTD 62 31 34 0 9 7 5 5 2 0 0 0 0 0 0 0 0 m d sy d all ar ar iu Ea Sm H H ed M Boyan Chain Mountain Car
    117. Running Time Constant Linear Exponential 155 154 TD 124 iLSTD 93 Time (ms) LSTD 62 31 34 0 9 7 5 5 2 0 0 0 0 0 0 0 0 m d sy d all ar ar iu Ea Sm H H ed M Boyan Chain Mountain Car
    118. Outline
    119. Outline Motivation Introduction The New Approach Dimension Selection Conclusion
    120. Dimension Selection Random ✓
    121. Dimension Selection Random ✓ Greedy
    122. Dimension Selection Random ✓ Greedy ε-Greedy
    123. Dimension Selection Random ✓ Greedy ε-Greedy Boltzmann
    124. Greedy Dimension Selection Pick the one with highest value of |µt (i)|
    125. Greedy Dimension Selection Pick the one with highest value of |µt (i)| Not proven to converge.
    126. ε-Greedy Dimension Selection ε : Non-Zero Random (1-ε) : Greedy
    127. ε-Greedy Dimension Selection ε : Non-Zero Random (1-ε) : Greedy Convergence proof applies.
    128. Boltzmann Component Selection Boltzmann Distribution + Non-Zero Random
    129. Boltzmann Component Selection Boltzmann Distribution + Non-Zero Random
    130. Boltzmann Component Selection Boltzmann Distribution + Non-Zero Random ψ×m
    131. Boltzmann Component Selection Boltzmann Distribution + Non-Zero Random Boltzmann Distribution ψ×m
    132. Boltzmann Component Selection Boltzmann Distribution + Non-Zero Random Boltzmann Distribution ψ×m Convergence proof applies.
    133. Empirical Results ε-Greedy: ε =.1 Boltzmann: ψ = 10^-9, τ = 1
    134. Empirical Results 8 10 ε-Greedy: ε =.1 Random Greedy !−Greedy 6 10 Boltzmann: ψ = 10^-9, τ = 1 Boltzmann Error Measure 4 10 2 10 0 10 −2 10 Small Medium Large Easy Hard Boyan Chain Mountain Car
    135. Empirical Results 8 10 ε-Greedy: ε =.1 Random Greedy !−Greedy 6 10 Boltzmann: ψ = 10^-9, τ = 1 Boltzmann Error Measure 4 10 2 10 0 10 −2 10 Small Medium Large Easy Hard Boyan Chain Mountain Car
    136. Empirical Results 8 10 ε-Greedy: ε =.1 Random Greedy !−Greedy 6 10 Boltzmann: ψ = 10^-9, τ = 1 Boltzmann Error Measure 4 10 2 10 0 10 −2 10 Small Medium Large Easy Hard Boyan Chain Mountain Car
    137. Running Time Random Greedy Boltzmann !-greedy 11.0 Clocktime/step (ms) 8.8 6.6 4.4 2.2 0 l e sy d um al ar rg Ea Sm i La H ed M Boyan chain Mountain car
    138. Outline
    139. Outline Motivation Introduction The New Approach Dimension Selection Conclusion
    140. Conclusion
    141. Conclusion LSTD Data Efficiency TD Speed
    142. Conclusion LSTD Data Efficiency iLSTD TD Speed
    143. Conclusion LSTD Data Efficiency iLSTD TD Speed
    144. Conclusion LSTD Data Efficiency iLSTD TD Speed
    145. Conclusion LSTD Data Efficiency iLSTD TD Speed
    146. Conclusion No learning rate! LSTD Data Efficiency iLSTD TD Speed
    147. Questions ?
    148. Questions ...
    149. Questions ... What if someone uses batch-LSTD?
    150. Questions ... What if someone uses batch-LSTD? Why iLSTD takes simple descent?
    151. Questions ... What if someone uses batch-LSTD? Why iLSTD takes simple descent? Hmm ... What about control?
    152. Thanks ...

    + knowdiffknowdiff, 3 years ago

    custom

    834 views, 1 favs, 1 embeds more stats

    Knowledge Diffusion Network
    Visiting Lecturer Progr more

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 834
      • 816 on SlideShare
      • 18 from embeds
    • Comments 0
    • Favorites 1
    • Downloads 12
    Most viewed embeds
    • 18 views on http://knowdiff.net

    more

    All embeds
    • 18 views on http://knowdiff.net

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories

    Tags