Alborz
Upcoming SlideShare
Loading in...5
×
 

Alborz

on

  • 1,468 views

Knowledge Diffusion Network...

Knowledge Diffusion Network
Visiting Lecturer Program (114)
Speaker: Alborz Geramifard
Ph.D. Candidate
Department of Computing Science, Edmonton, University of Alberta, Canada
Title: Incremental Least- Squares Temporal Difference Learning

Time: Tuesday, Sep 11, 2007, 12:00-1:00 pm
Location: Department of Computer Engineering, Sharif University of Technology, Tehran

Statistics

Views

Total Views
1,468
Views on SlideShare
1,448
Embed Views
20

Actions

Likes
1
Downloads
12
Comments
0

2 Embeds 20

http://knowdiff.net 18
http://www.slideshare.net 2

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Alborz Alborz Presentation Transcript

  • Incremental Least-Squares Temporal Difference Learning Alborz Geramifard September, 2007 alborz@cs.ualberta.ca
  • Incremental Least-Squares Temporal Difference Learning Alborz Geramifard September, 2007 alborz@cs.ualberta.ca
  • Acknowledgement Richard Sutton Michael Bowling Martin Zinkevich
  • Agencia
  • Agencia Generous Wise Kind Forgetful
  • Agencia Envirocia
  • Agencia Agriculture Politics Research ... Envirocia
  • The Ambassador of Envirocia
  • The Vazir of Agencia The Ambassador of Envirocia
  • The Vazir of Agencia The Ambassador of Envirocia “Your king is generous, wise, and kind, but he is forgetful. Sometimes he makes decisions which are not wise in light of our earlier conversations.”
  • The Vazir of Agencia Having consultants for various subjects might be the answer ... The Ambassador of Envirocia “Your king is generous, wise, and kind, but he is forgetful. Sometimes he makes decisions which are not wise in light of our earlier conversations.”
  • Consultants
  • Consultants
  • Ambassador
  • Vazir Ambassador
  • Vazir Ambassador “I should admit that recently the king makes decent judgments, but as the number of our meetings increases, it takes a while for me to hear back from the majesty ...”
  • Agencia Envirocia
  • Reinforcement Learning Framework Observation Agent Action Reward Environment
  • Observation Agent Action Reward Environment
  • Observation Agent Action Reward Environment
  • Observation Agent Action Reward Environment Goal
  • Observation Agent Action Reward Environment Goal
  • Observation Agent Action Mo Reward re Co mp Environment lica ted S tate Spa Goal ce Mo re Co mplic a ted Tas ks
  • Summary
  • Data Efficiency Speed Summary
  • Summary Data Efficiency Part I: King Speed
  • Summary Part II: King + Consultants Data Efficiency Part I: King Speed
  • Summary Part II: King + Consultants Data Efficiency ? Part I: King Speed
  • Summary LSTD Data Efficiency TD Speed
  • Summary LSTD Data Efficiency iLSTD TD Speed
  • Contributions
  • Contributions iLSTD: A new policy evaluation algorithm Extension with eligibility traces Running time analysis Dimension selection methods Proof of convergence Empirical results
  • Contributions iLSTD: A new policy evaluation algorithm Extension with eligibility traces Running time analysis Dimension selection methods Proof of convergence Empirical results
  • Contributions iLSTD: A new policy evaluation algorithm Extension with eligibility traces Running time analysis Dimension selection methods Proof of convergence Empirical results
  • Outline
  • Outline Motivation Introduction The New Approach Dimension Selection Conclusion
  • Outline Motivation Introduction The New Approach Dimension Selection Conclusion
  • Markov Decision Process a a S, A, Pss , Rss ,γ
  • Markov Decision Process a a S, A, Pss , Rss ,γ 10%,-300 100%,+120 100%,+60 Working Working 30%,-200 Studying 70%,-200 Ph.D. B.S. M.S. 80%,-50 Studying 10%,-50 Working 100%,+85
  • Markov Decision Process a a S, A, Pss , Rss ,γ 10%,-300 100%,+120 100%,+60 Working Working 30%,-200 Studying 70%,-200 Ph.D. B.S. M.S. 80%,-50 Studying 10%,-50 Working 100%,+85 B.S., Working, +60, B.S. Studying, -50, M.S. , ...
  • Policy Evaluation
  • Policy Evaluation Policy Improvement
  • Policy Evaluation Policy Improvement
  • Policy Evaluation Policy Improvement
  • Notation rt+1 V (s) π Scalar Regular φ(s) µ(θ) Vector Bold Lower Case ˜ At A Matrix Bold Upper Case
  • Policy Evaluation ∞ V (s) = E rt s0 = s, π π t−1 γ t=1
  • Linear Function Approximation n V (s) = θ · φ(s) = θi φi (s) i=1
  • Sparsity of features k features are active at any given Sparsity: Only moment. k n
  • Sparsity of features k features are active at any given Sparsity: Only moment. k n Acrobot [Sutton 96]: 48 << 18,648
  • Sparsity of features k features are active at any given Sparsity: Only moment. k n Acrobot [Sutton 96]: 48 << 18,648 Card game [Bowling et al. 02]: 3 << 10^6
  • Sparsity of features k features are active at any given Sparsity: Only moment. k n Acrobot [Sutton 96]: 48 << 18,648 Card game [Bowling et al. 02]: 3 << 10^6 Keep away soccer [Stone et al. 05]: 416 << 10^4
  • Temporal Difference Learning TD(0)
  • Temporal Difference Learning TD(0) (π) a,r st+1 st [Sutton 88]
  • Temporal Difference Learning TD(0) (π) a,r st+1 st Tabular Representation δt = rt + γV (st+1 ) − V (st ). Vt+1 (st ) = Vt (st ) + αt δt [Sutton 88]
  • Temporal Difference Learning TD(0) (π) a,r st+1 st Tabular Representation δt = rt + γV (st+1 ) − V (st ). Vt+1 (st ) = Vt (st ) + αt δt Linear Function Approximation θ t+1 = θ t + αt φ(st )δt (V ) [Sutton 88]
  • TD(0) Properties Computational complexity O(k) per time step Data inefficient Only last transition
  • TD(0) Properties Computational complexity Constant O(k) per time step Data inefficient Only last transition
  • Least-Squares TD (LSTD) Sum of TD updates
  • Least-Squares TD (LSTD) [Bradtke, Barto 96] Sum of TD updates
  • Least-Squares TD (LSTD) [Bradtke, Barto 96] Sum of TD updates t µt (θ) = φi δi (Vθ ) i=1
  • Least-Squares TD (LSTD) [Bradtke, Barto 96] Sum of TD updates t µt (θ) = φi δi (Vθ ) i=1 t t = φi (φi − γφi+1 ) θ T φi ri+1 − i=1 i=1 At bt
  • Least-Squares TD (LSTD) [Bradtke, Barto 96] Sum of TD updates t µt (θ) = φi δi (Vθ ) i=1 t t = φi (φi − γφi+1 ) θ T φi ri+1 − i=1 i=1 At bt = bt − At θ
  • Least-Squares TD (LSTD) [Bradtke, Barto 96] Sum of TD updates t µt (θ) = φi δi (Vθ ) i=1 t t = φi (φi − γφi+1 ) θ T φi ri+1 − i=1 i=1 At bt = bt − At θ µt (θ) = 0 −1 θ=A b
  • LSTD Properties Computational complexity O(n ) per time step 2 Data efficient Look through all data
  • LSTD Properties Quadratic Computational complexity O(n ) per time step 2 Data efficient Look through all data
  • Outline
  • Outline Motivation Introduction The New Approach Dimension Selection Conclusion
  • The New Approach
  • The New Approach A and b matrices change on each iteration.
  • The New Approach A and b matrices change on each iteration. t t µt (θ) = φi (φi − γφi+1 ) θ T φi ri+1 − i=1 i=1 At bt
  • The New Approach A and b matrices change on each iteration. t t µt (θ) = φi (φi − γφi+1 ) θ T φi ri+1 − i=1 i=1 At bt bt = bt−1 + rt φt ∆bt At = At−1 + φt (φt − γφt+1 ) T ∆At
  • Incremental LSTD µt (θ) = bt − At θ
  • Incremental LSTD µt (θ) = bt − At θ [Geramifard, Bowling, Sutton 06]
  • Incremental LSTD µt (θ) = bt − At θ Fixed θ Fixed A and b [Geramifard, Bowling, Sutton 06]
  • Incremental LSTD µt (θ) = bt − At θ Fixed θ µt (θ) = µt−1 (θ) + ∆bt − (∆At )θ. Fixed A and b [Geramifard, Bowling, Sutton 06]
  • Incremental LSTD µt (θ) = bt − At θ Fixed θ µt (θ) = µt−1 (θ) + ∆bt − (∆At )θ. Fixed A and b µt (θ t+1 ) = µt (θ t ) − At (∆θ t ). [Geramifard, Bowling, Sutton 06]
  • iLSTD µt (θ t+1 ) = µt (θ t ) − At (∆θ t ).
  • iLSTD µt (θ t+1 ) = µt (θ t ) − At (∆θ t ). How to change θ ?
  • iLSTD µt (θ t+1 ) = µt (θ t ) − At (∆θ t ). How to change θ ? Descent in the direction of µt (θ) ?
  • iLSTD µt (θ t+1 ) = µt (θ t ) − At (∆θ t ). How to change θ ? Descent in the direction of µt (θ) ?
  • iLSTD µt (θ t+1 ) = µt (θ t ) − At (∆θ t ). How to change θ ? Descent in the direction of µt (θ) ?
  • iLSTD µt (θ t+1 ) = µt (θ t ) − At (∆θ t ). How to change θ ? Descent in the direction of µt (θ) ? n
  • iLSTD µt (θ t+1 ) = µt (θ t ) − At (∆θ t ). How to change θ ? O(n ) 2 Descent in the direction of µt (θ) ? n
  • iLSTD µt (θ t+1 ) = µt (θ t ) − At (∆θ t ). How to change θ ? O(n ) 2 Descent in the direction of µt (θ) ?
  • iLSTD µt (θ t+1 ) = µt (θ t ) − At (∆θ t ). How to change θ ? O(n ) 2 Descent in the direction of µt (θ) ?
  • iLSTD µt (θ t+1 ) = µt (θ t ) − At (∆θ t ). How to change θ ? O(n ) 2 Descent in the direction of µt (θ) ?
  • iLSTD µt (θ t+1 ) = µt (θ t ) − At (∆θ t ). How to change θ ? O(n ) 2 Descent in the direction of µt (θ) ?
  • chosen. iLSTD Algorithm Algorithm 5: iLSTD Co 0 s ← s0 , A ← 0, µ ← 0, t ← 0 1 Initialize θ arbitrarily 2 repeat 3 Take action according to π and observe r, s t←t+1 4 ∆b ← φ(s)r 5 ∆A ← φ(s)(φ(s) − γφ(s ))T 6 A ← A + ∆A 7 µ ← µ + ∆b − (∆A)θ 8 for i from 1 to m do 9 10 j ← choose an index of µ using a dimension selection mechanism θj ← θj + αµj 11 µ ← µ − αµj Aej 12 13 end for 14 s ← s 15 end repeat
  • chosen. iLSTD Algorithm Algorithm 5: iLSTD Co 0 s ← s0 , A ← 0, µ ← 0, t ← 0 1 Initialize θ arbitrarily 2 repeat 3 Take action according to π and observe r, s t←t+1 4 ∆b ← φ(s)r 5 ∆A ← φ(s)(φ(s) − γφ(s ))T 6 A ← A + ∆A 7 µ ← µ + ∆b − (∆A)θ 8 for i from 1 to m do 9 10 j ← choose an index of µ using a dimension selection mechanism θj ← θj + αµj 11 µ ← µ − αµj Aej 12 13 end for 14 s ← s 15 end repeat
  • chosen. iLSTD Algorithm Algorithm 5: iLSTD Co 0 s ← s0 , A ← 0, µ ← 0, t ← 0 1 Initialize θ arbitrarily 2 repeat 3 Take action according to π and observe r, s t←t+1 4 ∆b ← φ(s)r 5 ∆A ← φ(s)(φ(s) − γφ(s ))T 6 A ← A + ∆A 7 µ ← µ + ∆b − (∆A)θ 8 for i from 1 to m do 9 10 j ← choose an index of µ using a dimension selection mechanism θj ← θj + αµj 11 µ ← µ − αµj Aej 12 13 end for 14 s ← s 15 end repeat
  • iLSTD Per-time-step computational complexity O(mn + k ) 2 More data efficient than TD
  • iLSTD Per-time-step computational complexity O(mn + k ) 2 Number of iterations per time step More data efficient than TD
  • iLSTD Per-time-step computational complexity O(mn + k ) 2 Number of features Number of iterations per time step More data efficient than TD
  • iLSTD Per-time-step computational complexity O(mn + k ) 2 Maximum number of active features Number of features Number of iterations per time step More data efficient than TD
  • iLSTD Per-time-step computational complexity O(mn + k ) 2 Linear Maximum number of active features Number of features Number of iterations per time step More data efficient than TD
  • iLSTD Theorem : iLSTD converges with probability one to the same solution as TD, under the usual step-size conditions, for any dimension selection method such that all dimensions for which µt is non-zero are selected in the limit an infinite number of times.
  • Empirical Results
  • Settings
  • Settings Averaged over 30 runs Same random seed for all methods Sparse matrix representation iLSTD Non-zero random selection One descent per iteration
  • Boyan Chain
  • Boyan Chain -3 -3 -3 -3 -3 N 5 -3 4 -3 3 -3 2 -2 10 0 -3 -3 1 0 0 0 0 0 0 . . . . . . . . . . . . . . . . . . . . . 0 1 0.75 0.5 0.25 0 0 0 0 0.25 0.5 0.75 1 0 [Boyan 99]
  • Boyan Chain -3 -3 -3 -3 -3 N 5 -3 4 -3 3 -3 2 -2 10 0 -3 -3 1 0 0 0 0 0 0 . . . . . . . . . . . . . . . . . . . . . 0 1 0.75 0.5 0.25 0 0 0 0 0.25 0.5 0.75 1 0 n = 4 (Small) n = 25 (Medium) n = 100 (Large) [Boyan 99]
  • Boyan Chain -3 -3 -3 -3 -3 N 5 -3 4 -3 3 -3 2 -2 10 0 -3 -3 1 0 0 0 0 0 0 . . . . . . . . . . . . . . . . . . . . . 0 1 0.75 0.5 0.25 0 0 0 0 0.25 0.5 0.75 1 0 n = 4 (Small) k=2 n = 25 (Medium) n = 100 (Large) [Boyan 99]
  • Small Boyan Chain 2 0 10 10 RMS of V(s) over all states RMS of V(s) over all states 1 10 iLSTD TD −1 0 10 10 TD LSTD iLSTD −1 LSTD 10 −2 −2 10 10 0 200 400 600 800 1000 0 0.5 1 1.5 2 2.5 3 Episodes Time(s)
  • Small Boyan Chain 2 0 10 10 RMS of V(s) over all states RMS of V(s) over all states 1 10 iLSTD TD −1 0 10 10 TD LSTD iLSTD −1 LSTD 10 −2 −2 10 10 0 200 400 600 800 1000 0 0.5 1 1.5 2 2.5 3 Episodes Time(s)
  • Medium Boyan Chain 2 3 10 10 RMS of V(s) over all states RMS of V(s) over all states 2 10 1 10 1 10 iLSTD 0 iLSTD TD 10 TD 0 10 LSTD LSTD −1 −1 10 10 0 200 400 600 800 1000 0 5 10 15 20 Episodes Time(s)
  • Medium Boyan Chain 2 3 10 10 RMS of V(s) over all states RMS of V(s) over all states 2 10 1 10 1 10 iLSTD 0 iLSTD TD 10 TD 0 10 LSTD LSTD −1 −1 10 10 0 200 400 600 800 1000 0 5 10 15 20 Episodes Time(s)
  • Large Boyan Chain 3 3 10 10 RMS of V(s) over all states 2 RMS of V(s) over all states 2 10 10 TD TD 1 1 10 10 iLSTD LSTD 0 0 10 10 LSTD iLSTD −1 −1 10 10 0 200 400 600 800 1000 0 20 40 60 80 100 Episodes Time(s)
  • Large Boyan Chain 3 3 10 10 RMS of V(s) over all states 2 RMS of V(s) over all states 2 10 10 TD TD 1 1 10 10 iLSTD LSTD 0 0 10 10 LSTD iLSTD −1 −1 10 10 0 200 400 600 800 1000 0 20 40 60 80 100 Episodes Time(s)
  • Large Boyan Chain 3 3 10 10 RMS of V(s) over all states 2 RMS of V(s) over all states 2 10 10 TD TD 1 1 10 10 iLSTD LSTD 0 0 10 10 LSTD iLSTD −1 −1 10 10 0 200 400 600 800 1000 0 20 40 60 80 100 Episodes Time(s)
  • Mountain Car
  • Mountain Car Goal Mountain Car Tile coding n = 10,000 Position = -1 (Easy) k = 10 Position = -.5 (Hard) [For details see RL-Library]
  • Easy Mountain Car 7 6 10 10 TD 6 10 iLSTD 4 10 TD Loss Loss 5 10 TD LSTD 5 LSTD 10 iLSTD Loss Function iLSTD 3 10 4 4 10 10 LSTD 0 50 100 150 200 250 300 350 0 200 400 600 800 1000 Time (s) Episode ∗ Loss = ||b − A θ||2 ∗ 2 10 0 200 400 600 800 1000 Episodes
  • Hard Mountain Car 8 10 7 10 TD 7 10 Loss Loss iLSTD 6 10 LSTD TD 6 LSTD 10 iLSTD 5 5 10 10 0 200 400 600 800 1000 0 200 400 600 800 1000 1200 Episode Time (s)
  • Hard Mountain Car 8 10 7 10 TD 7 10 Loss Loss iLSTD 6 10 LSTD TD 6 LSTD 10 iLSTD 5 5 10 10 0 200 400 600 800 1000 0 200 400 600 800 1000 1200 Episode Time (s)
  • Running Time Constant Linear Exponential 155 154 TD 124 iLSTD 93 Time (ms) LSTD 62 31 34 0 9 7 5 5 2 0 0 0 0 0 0 0 0 m d sy d all ar ar iu Ea Sm H H ed M Boyan Chain Mountain Car
  • Running Time Constant Linear Exponential 155 154 TD 124 iLSTD 93 Time (ms) LSTD 62 31 34 0 9 7 5 5 2 0 0 0 0 0 0 0 0 m d sy d all ar ar iu Ea Sm H H ed M Boyan Chain Mountain Car
  • Outline
  • Outline Motivation Introduction The New Approach Dimension Selection Conclusion
  • Dimension Selection Random ✓
  • Dimension Selection Random ✓ Greedy
  • Dimension Selection Random ✓ Greedy ε-Greedy
  • Dimension Selection Random ✓ Greedy ε-Greedy Boltzmann
  • Greedy Dimension Selection Pick the one with highest value of |µt (i)|
  • Greedy Dimension Selection Pick the one with highest value of |µt (i)| Not proven to converge.
  • ε-Greedy Dimension Selection ε : Non-Zero Random (1-ε) : Greedy
  • ε-Greedy Dimension Selection ε : Non-Zero Random (1-ε) : Greedy Convergence proof applies.
  • Boltzmann Component Selection Boltzmann Distribution + Non-Zero Random
  • Boltzmann Component Selection Boltzmann Distribution + Non-Zero Random
  • Boltzmann Component Selection Boltzmann Distribution + Non-Zero Random ψ×m
  • Boltzmann Component Selection Boltzmann Distribution + Non-Zero Random Boltzmann Distribution ψ×m
  • Boltzmann Component Selection Boltzmann Distribution + Non-Zero Random Boltzmann Distribution ψ×m Convergence proof applies.
  • Empirical Results ε-Greedy: ε =.1 Boltzmann: ψ = 10^-9, τ = 1
  • Empirical Results 8 10 ε-Greedy: ε =.1 Random Greedy !−Greedy 6 10 Boltzmann: ψ = 10^-9, τ = 1 Boltzmann Error Measure 4 10 2 10 0 10 −2 10 Small Medium Large Easy Hard Boyan Chain Mountain Car
  • Empirical Results 8 10 ε-Greedy: ε =.1 Random Greedy !−Greedy 6 10 Boltzmann: ψ = 10^-9, τ = 1 Boltzmann Error Measure 4 10 2 10 0 10 −2 10 Small Medium Large Easy Hard Boyan Chain Mountain Car
  • Empirical Results 8 10 ε-Greedy: ε =.1 Random Greedy !−Greedy 6 10 Boltzmann: ψ = 10^-9, τ = 1 Boltzmann Error Measure 4 10 2 10 0 10 −2 10 Small Medium Large Easy Hard Boyan Chain Mountain Car
  • Running Time Random Greedy Boltzmann !-greedy 11.0 Clocktime/step (ms) 8.8 6.6 4.4 2.2 0 l e sy d um al ar rg Ea Sm i La H ed M Boyan chain Mountain car
  • Outline
  • Outline Motivation Introduction The New Approach Dimension Selection Conclusion
  • Conclusion
  • Conclusion LSTD Data Efficiency TD Speed
  • Conclusion LSTD Data Efficiency iLSTD TD Speed
  • Conclusion LSTD Data Efficiency iLSTD TD Speed
  • Conclusion LSTD Data Efficiency iLSTD TD Speed
  • Conclusion LSTD Data Efficiency iLSTD TD Speed
  • Conclusion No learning rate! LSTD Data Efficiency iLSTD TD Speed
  • Questions ?
  • Questions ...
  • Questions ... What if someone uses batch-LSTD?
  • Questions ... What if someone uses batch-LSTD? Why iLSTD takes simple descent?
  • Questions ... What if someone uses batch-LSTD? Why iLSTD takes simple descent? Hmm ... What about control?
  • Thanks ...