SlideShare a Scribd company logo
1 of 156
Download to read offline
Incremental
     Least-Squares
     Temporal Difference
     Learning
Alborz Geramifard
 September, 2007
  alborz@cs.ualberta.ca
Incremental
     Least-Squares
     Temporal Difference
     Learning
Alborz Geramifard
 September, 2007
  alborz@cs.ualberta.ca
Acknowledgement

          Richard Sutton

Michael Bowling

          Martin Zinkevich
Agencia
Agencia



          Generous

          Wise

          Kind

          Forgetful
Agencia




Envirocia
Agencia     Agriculture

            Politics

            Research

            ...




Envirocia
The Ambassador of Envirocia
The Vazir of Agencia




The Ambassador of Envirocia
The Vazir of Agencia




The Ambassador of Envirocia
                       “Your king is generous, wise,
                       and kind, but he is forgetful.
                       Sometimes he makes decisions
                       which are not wise in light of
                       our earlier conversations.”
The Vazir of Agencia

                       Having consultants for various
                       subjects might be the answer ...




The Ambassador of Envirocia
                       “Your king is generous, wise,
                       and kind, but he is forgetful.
                       Sometimes he makes decisions
                       which are not wise in light of
                       our earlier conversations.”
Consultants
Consultants
Ambassador
Vazir




Ambassador
Vazir




Ambassador
             “I should admit that recently the king
             makes decent judgments, but as the
             number of our meetings increases, it
             takes a while for me to hear back from
             the majesty ...”
Agencia




Envirocia
Reinforcement Learning Framework
   Observation
                   Agent
                               Action
      Reward



                 Environment
Observation
                Agent
                            Action
   Reward



              Environment
Observation
                Agent
                            Action
   Reward



              Environment
Observation
                       Agent
                                   Action
          Reward



                     Environment




Goal
Observation
                       Agent
                                   Action
          Reward



                     Environment




Goal
Observation
                                                            Agent
                                                                        Action
                      Mo                       Reward

                         re   Co
                                  mp                      Environment
                                    lica
                                          ted
                                              S   tate
                                                             Spa
                                   Goal
                                                                ce


Mo
   re   Co
           mplic
                 a   ted
                         Tas
                             ks
Summary
Data Efficiency




Speed
                        Summary
Summary
Data Efficiency




                          Part I: King

                 Speed
Summary
                Part II: King + Consultants
Data Efficiency




                                  Part I: King

                    Speed
Summary
                Part II: King + Consultants
Data Efficiency

                           ?

                                  Part I: King

                    Speed
Summary
                LSTD
Data Efficiency




                          TD
                 Speed
Summary
                LSTD
Data Efficiency
                         iLSTD


                             TD
                 Speed
Contributions
Contributions
  iLSTD: A new policy evaluation
  algorithm

  Extension with eligibility traces

  Running time analysis

  Dimension selection methods

  Proof of convergence

  Empirical results
Contributions
  iLSTD: A new policy evaluation
  algorithm

  Extension with eligibility traces

  Running time analysis

  Dimension selection methods

  Proof of convergence

  Empirical results
Contributions
  iLSTD: A new policy evaluation
  algorithm

  Extension with eligibility traces

  Running time analysis

  Dimension selection methods

  Proof of convergence

  Empirical results
Outline
Outline
 Motivation

 Introduction

 The New Approach

 Dimension Selection

 Conclusion
Outline
 Motivation

 Introduction

 The New Approach

 Dimension Selection

 Conclusion
Markov Decision Process
           a       a
    S, A, Pss   , Rss   ,γ
Markov Decision Process
              a                             a
       S, A, Pss                         , Rss                ,γ
                                   10%,-300                        100%,+120
  100%,+60
                                                                       Working
      Working
                                        30%,-200

             Studying                                   70%,-200
                                                                   Ph.D.
   B.S.                           M.S.
                        80%,-50              Studying


          10%,-50                    Working

                                  100%,+85
Markov Decision Process
                  a                             a
           S, A, Pss                         , Rss                ,γ
                                       10%,-300                        100%,+120
      100%,+60
                                                                           Working
          Working
                                            30%,-200

                 Studying                                   70%,-200
                                                                       Ph.D.
       B.S.                           M.S.
                            80%,-50              Studying


              10%,-50                    Working

                                      100%,+85


B.S., Working, +60, B.S. Studying, -50, M.S. , ...
Policy Evaluation
Policy Evaluation



Policy Improvement
Policy Evaluation



Policy Improvement
Policy Evaluation



Policy Improvement
Notation

                                 rt+1
                         V (s)
                          π
Scalar     Regular

                         φ(s)    µ(θ)
Vector Bold Lower Case

                                 ˜
                         At      A
Matrix Bold Upper Case
Policy Evaluation

            ∞
V (s) = E                   rt s0 = s, π
 π                    t−1
                  γ
            t=1
Linear
Function Approximation
                     n
V (s) = θ · φ(s) =         θi φi (s)
                     i=1
Sparsity of features
                 k features are active at any given
Sparsity: Only
moment.

                 k           n
Sparsity of features
                 k features are active at any given
Sparsity: Only
moment.

                 k           n
 Acrobot [Sutton 96]: 48 << 18,648
Sparsity of features
                 k features are active at any given
Sparsity: Only
moment.

                 k           n
 Acrobot [Sutton 96]: 48 << 18,648

 Card game [Bowling et al. 02]: 3 << 10^6
Sparsity of features
                 k features are active at any given
Sparsity: Only
moment.

                 k           n
 Acrobot [Sutton 96]: 48 << 18,648

 Card game [Bowling et al. 02]: 3 << 10^6

 Keep away soccer [Stone et al. 05]: 416 << 10^4
Temporal Difference Learning
          TD(0)
Temporal Difference Learning
          TD(0)
             (π) a,r
                       st+1
        st




                              [Sutton 88]
Temporal Difference Learning
          TD(0)
                 (π) a,r
                           st+1
            st
   Tabular Representation

δt = rt + γV (st+1 ) − V (st ).
Vt+1 (st ) = Vt (st ) + αt δt


                                  [Sutton 88]
Temporal Difference Learning
          TD(0)
                 (π) a,r
                           st+1
            st
   Tabular Representation

δt = rt + γV (st+1 ) − V (st ).
Vt+1 (st ) = Vt (st ) + αt δt
   Linear Function Approximation
θ t+1 = θ t + αt φ(st )δt (V )
                                   [Sutton 88]
TD(0) Properties
Computational complexity


  O(k) per time step
Data inefficient

 Only last transition
TD(0) Properties
Computational complexity
                           Constant
  O(k) per time step
Data inefficient

 Only last transition
Least-Squares TD (LSTD)
  Sum of TD updates
Least-Squares TD (LSTD)
                      [Bradtke, Barto 96]
  Sum of TD updates
Least-Squares TD (LSTD)
                                [Bradtke, Barto 96]
  Sum of TD updates
             t
 µt (θ) =         φi δi (Vθ )
            i=1
Least-Squares TD (LSTD)
                                      [Bradtke, Barto 96]
  Sum of TD updates
             t
 µt (θ) =         φi δi (Vθ )
            i=1
             t                   t
       =                              φi (φi − γφi+1 ) θ
                                                    T
                  φi ri+1 −
            i=1                 i=1

                                           At
                  bt
Least-Squares TD (LSTD)
                                      [Bradtke, Barto 96]
  Sum of TD updates
             t
 µt (θ) =         φi δi (Vθ )
            i=1
             t                   t
       =                              φi (φi − γφi+1 ) θ
                                                    T
                  φi ri+1 −
            i=1                 i=1

                                           At
                  bt

       = bt − At θ
Least-Squares TD (LSTD)
                                      [Bradtke, Barto 96]
  Sum of TD updates
             t
 µt (θ) =         φi δi (Vθ )
            i=1
             t                   t
       =                              φi (φi − γφi+1 ) θ
                                                     T
                  φi ri+1 −
            i=1                 i=1

                                           At
                  bt

       = bt − At θ
µt (θ) = 0                                      −1
                                        θ=A          b
LSTD Properties
 Computational complexity

   O(n ) per time step
          2

 Data efficient

  Look through all data
LSTD Properties
                 Quadratic
 Computational complexity

   O(n ) per time step
      2

 Data efficient

  Look through all data
Outline
Outline
 Motivation

 Introduction

 The New Approach

 Dimension Selection

 Conclusion
The New Approach
The New Approach

  A and b matrices change on each iteration.
The New Approach

   A and b matrices change on each iteration.
              t                 t
  µt (θ) =                           φi (φi − γφi+1 ) θ
                                                   T
                   φi ri+1 −
             i=1               i=1

                                          At
                   bt
The New Approach

   A and b matrices change on each iteration.
              t                 t
  µt (θ) =                           φi (φi − γφi+1 ) θ
                                                   T
                   φi ri+1 −
             i=1               i=1

                                          At
                   bt
 bt = bt−1 + rt φt
                        ∆bt
 At = At−1 + φt (φt − γφt+1 )                  T


                                ∆At
Incremental LSTD
µt (θ) = bt − At θ
Incremental LSTD
µt (θ) = bt − At θ




   [Geramifard, Bowling, Sutton 06]
Incremental LSTD
          µt (θ) = bt − At θ
Fixed θ




Fixed A and b




                [Geramifard, Bowling, Sutton 06]
Incremental LSTD
           µt (θ) = bt − At θ
 Fixed θ

µt (θ) = µt−1 (θ) + ∆bt − (∆At )θ.

 Fixed A and b




                 [Geramifard, Bowling, Sutton 06]
Incremental LSTD
           µt (θ) = bt − At θ
 Fixed θ

µt (θ) = µt−1 (θ) + ∆bt − (∆At )θ.

 Fixed A and b

 µt (θ t+1 ) = µt (θ t ) − At (∆θ t ).
                 [Geramifard, Bowling, Sutton 06]
iLSTD
µt (θ t+1 ) = µt (θ t ) − At (∆θ t ).
iLSTD
 µt (θ t+1 ) = µt (θ t ) − At (∆θ t ).
How to change θ ?
iLSTD
 µt (θ t+1 ) = µt (θ t ) − At (∆θ t ).
How to change θ ?

 Descent in the direction of µt (θ) ?
iLSTD
 µt (θ t+1 ) = µt (θ t ) − At (∆θ t ).
How to change θ ?

 Descent in the direction of µt (θ) ?
iLSTD
 µt (θ t+1 ) = µt (θ t ) − At (∆θ t ).
How to change θ ?

 Descent in the direction of µt (θ) ?
iLSTD
 µt (θ t+1 ) = µt (θ t ) − At (∆θ t ).
How to change θ ?

 Descent in the direction of µt (θ) ?




              n
iLSTD
 µt (θ t+1 ) = µt (θ t ) − At (∆θ t ).
How to change θ ?

                                        O(n )
                                           2
 Descent in the direction of µt (θ) ?




              n
iLSTD
 µt (θ t+1 ) = µt (θ t ) − At (∆θ t ).
How to change θ ?

                                        O(n )
                                           2
 Descent in the direction of µt (θ) ?
iLSTD
 µt (θ t+1 ) = µt (θ t ) − At (∆θ t ).
How to change θ ?

                                        O(n )
                                           2
 Descent in the direction of µt (θ) ?
iLSTD
 µt (θ t+1 ) = µt (θ t ) − At (∆θ t ).
How to change θ ?

                                        O(n )
                                           2
 Descent in the direction of µt (θ) ?
iLSTD
 µt (θ t+1 ) = µt (θ t ) − At (∆θ t ).
How to change θ ?

                                        O(n )
                                           2
 Descent in the direction of µt (θ) ?
chosen.

                    iLSTD Algorithm
Algorithm 5: iLSTD                                                      Co
 0 s ← s0 , A ← 0, µ ← 0, t ← 0
 1 Initialize θ arbitrarily
 2 repeat
 3     Take action according to π and observe r, s
       t←t+1
 4
       ∆b ← φ(s)r
 5
       ∆A ← φ(s)(φ(s) − γφ(s ))T
 6
       A ← A + ∆A
 7
       µ ← µ + ∆b − (∆A)θ
 8
       for i from 1 to m do
 9
 10       j ← choose an index of µ using a dimension selection mechanism
          θj ← θj + αµj
 11
          µ ← µ − αµj Aej
 12
 13 end for
 14 s ← s
 15 end repeat
chosen.

                    iLSTD Algorithm
Algorithm 5: iLSTD                                                      Co
 0 s ← s0 , A ← 0, µ ← 0, t ← 0
 1 Initialize θ arbitrarily
 2 repeat
 3     Take action according to π and observe r, s
       t←t+1
 4
       ∆b ← φ(s)r
 5
       ∆A ← φ(s)(φ(s) − γφ(s ))T
 6
       A ← A + ∆A
 7
       µ ← µ + ∆b − (∆A)θ
 8
       for i from 1 to m do
 9
 10       j ← choose an index of µ using a dimension selection mechanism
          θj ← θj + αµj
 11
          µ ← µ − αµj Aej
 12
 13 end for
 14 s ← s
 15 end repeat
chosen.

                    iLSTD Algorithm
Algorithm 5: iLSTD                                                      Co
 0 s ← s0 , A ← 0, µ ← 0, t ← 0
 1 Initialize θ arbitrarily
 2 repeat
 3     Take action according to π and observe r, s
       t←t+1
 4
       ∆b ← φ(s)r
 5
       ∆A ← φ(s)(φ(s) − γφ(s ))T
 6
       A ← A + ∆A
 7
       µ ← µ + ∆b − (∆A)θ
 8
       for i from 1 to m do
 9
 10       j ← choose an index of µ using a dimension selection mechanism
          θj ← θj + αµj
 11
          µ ← µ − αµj Aej
 12
 13 end for
 14 s ← s
 15 end repeat
iLSTD

 Per-time-step computational complexity

O(mn + k )      2




 More data efficient than TD
iLSTD

 Per-time-step computational complexity

O(mn + k )       2



    Number of iterations per time step
 More data efficient than TD
iLSTD

 Per-time-step computational complexity

O(mn + k )       2


       Number of features
    Number of iterations per time step
 More data efficient than TD
iLSTD

 Per-time-step computational complexity

O(mn + k )       2

              Maximum number of active features
       Number of features
    Number of iterations per time step
 More data efficient than TD
iLSTD

 Per-time-step computational complexity

O(mn + k )       2
                                         Linear
              Maximum number of active features
       Number of features
    Number of iterations per time step
 More data efficient than TD
iLSTD
    Theorem : iLSTD converges with
probability one to the same solution as TD,
under the usual step-size conditions, for any
dimension selection method such that all
dimensions for which µt is non-zero are
selected in the limit an infinite number of
times.
Empirical Results
Settings
Settings
Averaged over 30 runs
Same random seed for all methods
Sparse matrix representation
iLSTD
 Non-zero random selection
 One descent per iteration
Boyan Chain
Boyan Chain
              -3               -3      -3     -3
         -3

N                       5 -3    4 -3   3 -3    2 -2   10   0
                   -3
    -3
1                       0        0      0       0     0    0
.                       .        .      .       .     .    .
.                       .        .      .       .     .    .
.                       .        .      .       .     .    .
0                       1      0.75    0.5    0.25    0    0
0                       0      0.25    0.5    0.75    1    0




                                                   [Boyan 99]
Boyan Chain
              -3               -3      -3     -3
         -3

N                       5 -3    4 -3   3 -3    2 -2   10   0
                   -3
    -3
1                       0        0      0       0     0    0
.                       .        .      .       .     .    .
.                       .        .      .       .     .    .
.                       .        .      .       .     .    .
0                       1      0.75    0.5    0.25    0    0
0                       0      0.25    0.5    0.75    1    0


    n = 4 (Small)
    n = 25 (Medium)
    n = 100 (Large)
                                                   [Boyan 99]
Boyan Chain
              -3               -3      -3     -3
         -3

N                       5 -3    4 -3   3 -3    2 -2   10   0
                   -3
    -3
1                       0        0      0       0     0    0
.                       .        .      .       .     .    .
.                       .        .      .       .     .    .
.                       .        .      .       .     .    .
0                       1      0.75    0.5    0.25    0    0
0                       0      0.25    0.5    0.75    1    0


    n = 4 (Small)                             k=2
    n = 25 (Medium)
    n = 100 (Large)
                                                   [Boyan 99]
Small Boyan Chain
                                                                                                             2
                               0
                                                                                                            10
                              10
RMS of V(s) over all states




                                                                              RMS of V(s) over all states
                                                                                                             1
                                                                                                            10


                                                                                                                           iLSTD
                                                                TD
                               −1                                                                            0
                              10                                                                            10



                                                                                                                                         TD
                                                                LSTD
                                       iLSTD                                                                 −1
                                                                                                                                                 LSTD
                                                                                                            10



                               −2                                                                            −2
                              10                                                                            10
                                   0     200   400        600    800   1000                                      0   0.5     1       1.5     2   2.5    3
                                                     Episodes                                                                      Time(s)
Small Boyan Chain
                                                                                                             2
                               0
                                                                                                            10
                              10
RMS of V(s) over all states




                                                                              RMS of V(s) over all states
                                                                                                             1
                                                                                                            10


                                                                                                                           iLSTD
                                                                TD
                               −1                                                                            0
                              10                                                                            10



                                                                                                                                         TD
                                                                LSTD
                                       iLSTD                                                                 −1
                                                                                                                                                 LSTD
                                                                                                            10



                               −2                                                                            −2
                              10                                                                            10
                                   0     200   400        600    800   1000                                      0   0.5     1       1.5     2   2.5    3
                                                     Episodes                                                                      Time(s)
Medium Boyan Chain
                               2                                                                            3
                              10                                                                           10
RMS of V(s) over all states




                                                                             RMS of V(s) over all states
                                                                                                            2
                                                                                                           10
                               1
                              10


                                                                                                            1
                                                                                                           10
                                                                                                                        iLSTD
                               0
                                          iLSTD                      TD
                              10
                                                                                                                            TD
                                                                                                            0
                                                                                                           10

                                                                                                                                          LSTD
                                       LSTD
                               −1                                                                           −1
                              10                                                                           10
                                   0    200   400        600   800    1000                                      0   5       10       15      20
                                                    Episodes                                                               Time(s)
Medium Boyan Chain
                               2                                                                            3
                              10                                                                           10
RMS of V(s) over all states




                                                                             RMS of V(s) over all states
                                                                                                            2
                                                                                                           10
                               1
                              10


                                                                                                            1
                                                                                                           10
                                                                                                                        iLSTD
                               0
                                          iLSTD                      TD
                              10
                                                                                                                            TD
                                                                                                            0
                                                                                                           10

                                                                                                                                          LSTD
                                       LSTD
                               −1                                                                           −1
                              10                                                                           10
                                   0    200   400        600   800    1000                                      0   5       10       15      20
                                                    Episodes                                                               Time(s)
Large Boyan Chain
                               3                                                                          3
                              10                                                                         10
RMS of V(s) over all states




                               2




                                                                           RMS of V(s) over all states
                                                                                                          2
                              10                                                                         10



                                                              TD
                                                                                                                              TD
                               1                                                                          1
                              10                                                                         10

                                             iLSTD
                                                                                                                                        LSTD
                               0                                                                          0
                              10                                                                         10


                                                      LSTD                                                                  iLSTD
                               −1                                                                         −1
                              10                                                                         10
                                   0   200   400        600   800   1000                                      0   20   40       60     80   100
                                                   Episodes                                                                  Time(s)
Large Boyan Chain
                               3                                                                          3
                              10                                                                         10
RMS of V(s) over all states




                               2




                                                                           RMS of V(s) over all states
                                                                                                          2
                              10                                                                         10



                                                              TD
                                                                                                                              TD
                               1                                                                          1
                              10                                                                         10

                                             iLSTD
                                                                                                                                        LSTD
                               0                                                                          0
                              10                                                                         10


                                                      LSTD                                                                  iLSTD
                               −1                                                                         −1
                              10                                                                         10
                                   0   200   400        600   800   1000                                      0   20   40       60     80   100
                                                   Episodes                                                                  Time(s)
Large Boyan Chain
                               3                                                                          3
                              10                                                                         10
RMS of V(s) over all states




                               2




                                                                           RMS of V(s) over all states
                                                                                                          2
                              10                                                                         10



                                                              TD
                                                                                                                              TD
                               1                                                                          1
                              10                                                                         10

                                             iLSTD
                                                                                                                                        LSTD
                               0                                                                          0
                              10                                                                         10


                                                      LSTD                                                                  iLSTD
                               −1                                                                         −1
                              10                                                                         10
                                   0   200   400        600   800   1000                                      0   20   40       60     80   100
                                                   Episodes                                                                  Time(s)
Mountain Car
Mountain Car
                         Goal




Mountain Car            Tile coding
                        n = 10,000
Position = -1 (Easy)
                        k = 10
Position = -.5 (Hard)

                [For details see RL-Library]
Easy Mountain Car
                                                                        7
                     6
                                                                       10
                10




                                                  TD
                                                                        6
                                                                       10
                               iLSTD
                 4
                10

                                                                                               TD



                                                                Loss
   Loss




                     5
                10

                                                  TD                                                     LSTD
                                                                        5
                                                 LSTD                  10
                               iLSTD
Loss Function




                                                                                              iLSTD
                 3
                10

                                                                        4
                     4
                                                                       10
                10
                                                 LSTD                       0   50   100   150    200   250   300   350
                     0   200    400       600      800   1000
                                                                                             Time (s)
                                      Episode



                                                                       ∗
                                  Loss = ||b − A θ||2                                  ∗
                 2
                10
                     0   200    400        600     800   1000
                                      Episodes
Hard Mountain Car
                                                           8
                                                          10
        7
       10




                                      TD
                                                           7
                                                          10




                                                   Loss
Loss




                            iLSTD
        6
       10
                                                                                                 LSTD
                                                                          TD
                                                           6
                      LSTD                                10


                                                                               iLSTD
        5                                                  5
       10                                                 10
            0   200   400       600   800   1000               0   200   400      600      800    1000   1200
                            Episode                                             Time (s)
Hard Mountain Car
                                                           8
                                                          10
        7
       10




                                      TD
                                                           7
                                                          10




                                                   Loss
Loss




                            iLSTD
        6
       10
                                                                                                 LSTD
                                                                          TD
                                                           6
                      LSTD                                10


                                                                               iLSTD
        5                                                  5
       10                                                 10
            0   200   400       600   800   1000               0   200   400      600      800    1000   1200
                            Episode                                             Time (s)
Running Time                                     Constant
                                                                               Linear
                                                                               Exponential



            155                                                                  154
                              TD
            124
                              iLSTD
             93
Time (ms)




                              LSTD
             62

             31                                                   34


              0                                                            9
                                                              7        5
                                                         5
                                                     2
                                                 0
                                         0
                          0
                      0             0
                  0            0             0
                                     m



                                               d


                                                             sy



                                                                         d
                    all




                                             ar




                                                                       ar
                                   iu




                                                         Ea
                  Sm




                                             H




                                                                       H
                              ed
                              M




                      Boyan Chain                        Mountain Car
Running Time                                     Constant
                                                                               Linear
                                                                               Exponential



            155                                                                  154
                              TD
            124
                              iLSTD
             93
Time (ms)




                              LSTD
             62

             31                                                   34


              0                                                            9
                                                              7        5
                                                         5
                                                     2
                                                 0
                                         0
                          0
                      0             0
                  0            0             0
                                     m



                                               d


                                                             sy



                                                                         d
                    all




                                             ar




                                                                       ar
                                   iu




                                                         Ea
                  Sm




                                             H




                                                                       H
                              ed
                              M




                      Boyan Chain                        Mountain Car
Outline
Outline
  Motivation

  Introduction

  The New Approach

  Dimension Selection

  Conclusion
Dimension Selection

Random ✓
Dimension Selection

Random ✓

Greedy
Dimension Selection

Random ✓

Greedy

ε-Greedy
Dimension Selection

Random ✓

Greedy

ε-Greedy

Boltzmann
Greedy Dimension Selection
   Pick the one with highest value of

             |µt (i)|
Greedy Dimension Selection
   Pick the one with highest value of

             |µt (i)|
    Not proven to converge.
ε-Greedy Dimension
     Selection
  ε : Non-Zero Random

  (1-ε) : Greedy
ε-Greedy Dimension
     Selection
  ε : Non-Zero Random

  (1-ε) : Greedy

  Convergence proof applies.
Boltzmann Component
      Selection
 Boltzmann Distribution +
 Non-Zero Random
Boltzmann Component
      Selection
 Boltzmann Distribution +
 Non-Zero Random
Boltzmann Component
      Selection
 Boltzmann Distribution +
 Non-Zero Random



ψ×m
Boltzmann Component
      Selection
 Boltzmann Distribution +
 Non-Zero Random
           Boltzmann Distribution


ψ×m
Boltzmann Component
      Selection
 Boltzmann Distribution +
 Non-Zero Random
            Boltzmann Distribution


ψ×m

 Convergence proof applies.
Empirical Results
    ε-Greedy: ε =.1

    Boltzmann: ψ = 10^-9, τ = 1
Empirical Results
                   8
                  10
                       ε-Greedy: ε =.1
                          Random
                          Greedy
                           !−Greedy
                   6
                  10
                       Boltzmann: ψ = 10^-9, τ = 1
                           Boltzmann
  Error Measure




                   4
                  10



                   2
                  10



                   0
                  10



                   −2
                  10
                         Small   Medium     Large   Easy       Hard
                                 Boyan Chain          Mountain Car
Empirical Results
                   8
                  10
                       ε-Greedy: ε =.1
                          Random
                          Greedy
                           !−Greedy
                   6
                  10
                       Boltzmann: ψ = 10^-9, τ = 1
                           Boltzmann
  Error Measure




                   4
                  10



                   2
                  10



                   0
                  10



                   −2
                  10
                         Small   Medium     Large   Easy       Hard
                                 Boyan Chain          Mountain Car
Empirical Results
                   8
                  10
                       ε-Greedy: ε =.1
                          Random
                          Greedy
                           !−Greedy
                   6
                  10
                       Boltzmann: ψ = 10^-9, τ = 1
                           Boltzmann
  Error Measure




                   4
                  10



                   2
                  10



                   0
                  10



                   −2
                  10
                         Small   Medium     Large   Easy       Hard
                                 Boyan Chain          Mountain Car
Running Time
                       Random           Greedy                    Boltzmann
                                                      !-greedy
                      11.0
Clocktime/step (ms)




                       8.8


                       6.6


                       4.4


                       2.2


                        0
                                  l




                                                       e




                                                             sy




                                                                     d
                                            um
                              al




                                                                   ar
                                                    rg



                                                           Ea
                             Sm




                                            i



                                                 La




                                                                  H
                                         ed
                                        M




                                      Boyan chain          Mountain car
Outline
Outline
 Motivation

 Introduction

 The New Approach

 Dimension Selection

 Conclusion
Conclusion
Conclusion
                 LSTD
Data Efficiency




                             TD
                  Speed
Conclusion
                 LSTD
Data Efficiency
                          iLSTD



                             TD
                  Speed
Conclusion
                 LSTD
Data Efficiency
                          iLSTD



                             TD
                  Speed
Conclusion
                 LSTD
Data Efficiency
                          iLSTD



                             TD
                  Speed
Conclusion
                 LSTD
Data Efficiency
                          iLSTD



                             TD
                  Speed
Conclusion
                          No learning rate!
                 LSTD
Data Efficiency
                              iLSTD



                                  TD
                  Speed
Questions ?
Questions ...
Questions ...
   What if someone uses batch-LSTD?
Questions ...
   What if someone uses batch-LSTD?

   Why iLSTD takes simple descent?
Questions ...
   What if someone uses batch-LSTD?

   Why iLSTD takes simple descent?

   Hmm ... What about control?
Thanks ...

More Related Content

More from knowdiff

Ut talk feb 2017
Ut talk   feb 2017Ut talk   feb 2017
Ut talk feb 2017knowdiff
 
Ali khalili: Towards an Open Linked Data-based Infrastructure for Studying Sc...
Ali khalili: Towards an Open Linked Data-based Infrastructure for Studying Sc...Ali khalili: Towards an Open Linked Data-based Infrastructure for Studying Sc...
Ali khalili: Towards an Open Linked Data-based Infrastructure for Studying Sc...knowdiff
 
Scheduling for cloud systems with multi level data locality
Scheduling for cloud systems with multi level data localityScheduling for cloud systems with multi level data locality
Scheduling for cloud systems with multi level data localityknowdiff
 
Amin Milani Fard: Directed Model Inference for Testing and Analysis of Web Ap...
Amin Milani Fard: Directed Model Inference for Testing and Analysis of Web Ap...Amin Milani Fard: Directed Model Inference for Testing and Analysis of Web Ap...
Amin Milani Fard: Directed Model Inference for Testing and Analysis of Web Ap...knowdiff
 
Knowledge based economy and power of crowd sourcing
Knowledge based economy and power of crowd sourcing Knowledge based economy and power of crowd sourcing
Knowledge based economy and power of crowd sourcing knowdiff
 
Amin tayyebi: Big Data and Land Use Change Science
Amin tayyebi: Big Data and Land Use Change ScienceAmin tayyebi: Big Data and Land Use Change Science
Amin tayyebi: Big Data and Land Use Change Scienceknowdiff
 
Mehdi Rezagholizadeh: Image Sensor Modeling: Color Measurement at Low Light L...
Mehdi Rezagholizadeh: Image Sensor Modeling: Color Measurement at Low Light L...Mehdi Rezagholizadeh: Image Sensor Modeling: Color Measurement at Low Light L...
Mehdi Rezagholizadeh: Image Sensor Modeling: Color Measurement at Low Light L...knowdiff
 
Sara Afshar: Scheduling and Resource Sharing in Multiprocessor Real-Time Systems
Sara Afshar: Scheduling and Resource Sharing in Multiprocessor Real-Time SystemsSara Afshar: Scheduling and Resource Sharing in Multiprocessor Real-Time Systems
Sara Afshar: Scheduling and Resource Sharing in Multiprocessor Real-Time Systemsknowdiff
 
Seyed Mehdi mohaghegh: Modelling material use within the low carbon energy pa...
Seyed Mehdi mohaghegh: Modelling material use within the low carbon energy pa...Seyed Mehdi mohaghegh: Modelling material use within the low carbon energy pa...
Seyed Mehdi mohaghegh: Modelling material use within the low carbon energy pa...knowdiff
 
Narjess Afzaly: Model Your Problem with Graphs and Generate your objects
Narjess Afzaly: Model Your Problem with Graphs and Generate your objectsNarjess Afzaly: Model Your Problem with Graphs and Generate your objects
Narjess Afzaly: Model Your Problem with Graphs and Generate your objectsknowdiff
 
Computational methods applications in air pollution modeling (Dr. Yadghar)
Computational methods applications in air pollution modeling (Dr. Yadghar)Computational methods applications in air pollution modeling (Dr. Yadghar)
Computational methods applications in air pollution modeling (Dr. Yadghar)knowdiff
 
Somaz Kolahi : Functional Dependencies: Redundancy Analysis and Correcting Vi...
Somaz Kolahi : Functional Dependencies: Redundancy Analysis and Correcting Vi...Somaz Kolahi : Functional Dependencies: Redundancy Analysis and Correcting Vi...
Somaz Kolahi : Functional Dependencies: Redundancy Analysis and Correcting Vi...knowdiff
 
Uncalibrated Image-Based Robotic Visual Servoing (knowdiff.net)
Uncalibrated Image-Based Robotic Visual Servoing (knowdiff.net)Uncalibrated Image-Based Robotic Visual Servoing (knowdiff.net)
Uncalibrated Image-Based Robotic Visual Servoing (knowdiff.net)knowdiff
 
Knowdiff visiting lecturer 140 (Azad Shademan): Uncalibrated Image-Based Robo...
Knowdiff visiting lecturer 140 (Azad Shademan): Uncalibrated Image-Based Robo...Knowdiff visiting lecturer 140 (Azad Shademan): Uncalibrated Image-Based Robo...
Knowdiff visiting lecturer 140 (Azad Shademan): Uncalibrated Image-Based Robo...knowdiff
 
Mehran Shaghaghi: Quantum Mechanics Dilemmas
Mehran Shaghaghi: Quantum Mechanics DilemmasMehran Shaghaghi: Quantum Mechanics Dilemmas
Mehran Shaghaghi: Quantum Mechanics Dilemmasknowdiff
 
Hossein Taghavi : Codes on Graphs
Hossein Taghavi : Codes on GraphsHossein Taghavi : Codes on Graphs
Hossein Taghavi : Codes on Graphsknowdiff
 
Dr. Amir Nejat
Dr. Amir NejatDr. Amir Nejat
Dr. Amir Nejatknowdiff
 

More from knowdiff (17)

Ut talk feb 2017
Ut talk   feb 2017Ut talk   feb 2017
Ut talk feb 2017
 
Ali khalili: Towards an Open Linked Data-based Infrastructure for Studying Sc...
Ali khalili: Towards an Open Linked Data-based Infrastructure for Studying Sc...Ali khalili: Towards an Open Linked Data-based Infrastructure for Studying Sc...
Ali khalili: Towards an Open Linked Data-based Infrastructure for Studying Sc...
 
Scheduling for cloud systems with multi level data locality
Scheduling for cloud systems with multi level data localityScheduling for cloud systems with multi level data locality
Scheduling for cloud systems with multi level data locality
 
Amin Milani Fard: Directed Model Inference for Testing and Analysis of Web Ap...
Amin Milani Fard: Directed Model Inference for Testing and Analysis of Web Ap...Amin Milani Fard: Directed Model Inference for Testing and Analysis of Web Ap...
Amin Milani Fard: Directed Model Inference for Testing and Analysis of Web Ap...
 
Knowledge based economy and power of crowd sourcing
Knowledge based economy and power of crowd sourcing Knowledge based economy and power of crowd sourcing
Knowledge based economy and power of crowd sourcing
 
Amin tayyebi: Big Data and Land Use Change Science
Amin tayyebi: Big Data and Land Use Change ScienceAmin tayyebi: Big Data and Land Use Change Science
Amin tayyebi: Big Data and Land Use Change Science
 
Mehdi Rezagholizadeh: Image Sensor Modeling: Color Measurement at Low Light L...
Mehdi Rezagholizadeh: Image Sensor Modeling: Color Measurement at Low Light L...Mehdi Rezagholizadeh: Image Sensor Modeling: Color Measurement at Low Light L...
Mehdi Rezagholizadeh: Image Sensor Modeling: Color Measurement at Low Light L...
 
Sara Afshar: Scheduling and Resource Sharing in Multiprocessor Real-Time Systems
Sara Afshar: Scheduling and Resource Sharing in Multiprocessor Real-Time SystemsSara Afshar: Scheduling and Resource Sharing in Multiprocessor Real-Time Systems
Sara Afshar: Scheduling and Resource Sharing in Multiprocessor Real-Time Systems
 
Seyed Mehdi mohaghegh: Modelling material use within the low carbon energy pa...
Seyed Mehdi mohaghegh: Modelling material use within the low carbon energy pa...Seyed Mehdi mohaghegh: Modelling material use within the low carbon energy pa...
Seyed Mehdi mohaghegh: Modelling material use within the low carbon energy pa...
 
Narjess Afzaly: Model Your Problem with Graphs and Generate your objects
Narjess Afzaly: Model Your Problem with Graphs and Generate your objectsNarjess Afzaly: Model Your Problem with Graphs and Generate your objects
Narjess Afzaly: Model Your Problem with Graphs and Generate your objects
 
Computational methods applications in air pollution modeling (Dr. Yadghar)
Computational methods applications in air pollution modeling (Dr. Yadghar)Computational methods applications in air pollution modeling (Dr. Yadghar)
Computational methods applications in air pollution modeling (Dr. Yadghar)
 
Somaz Kolahi : Functional Dependencies: Redundancy Analysis and Correcting Vi...
Somaz Kolahi : Functional Dependencies: Redundancy Analysis and Correcting Vi...Somaz Kolahi : Functional Dependencies: Redundancy Analysis and Correcting Vi...
Somaz Kolahi : Functional Dependencies: Redundancy Analysis and Correcting Vi...
 
Uncalibrated Image-Based Robotic Visual Servoing (knowdiff.net)
Uncalibrated Image-Based Robotic Visual Servoing (knowdiff.net)Uncalibrated Image-Based Robotic Visual Servoing (knowdiff.net)
Uncalibrated Image-Based Robotic Visual Servoing (knowdiff.net)
 
Knowdiff visiting lecturer 140 (Azad Shademan): Uncalibrated Image-Based Robo...
Knowdiff visiting lecturer 140 (Azad Shademan): Uncalibrated Image-Based Robo...Knowdiff visiting lecturer 140 (Azad Shademan): Uncalibrated Image-Based Robo...
Knowdiff visiting lecturer 140 (Azad Shademan): Uncalibrated Image-Based Robo...
 
Mehran Shaghaghi: Quantum Mechanics Dilemmas
Mehran Shaghaghi: Quantum Mechanics DilemmasMehran Shaghaghi: Quantum Mechanics Dilemmas
Mehran Shaghaghi: Quantum Mechanics Dilemmas
 
Hossein Taghavi : Codes on Graphs
Hossein Taghavi : Codes on GraphsHossein Taghavi : Codes on Graphs
Hossein Taghavi : Codes on Graphs
 
Dr. Amir Nejat
Dr. Amir NejatDr. Amir Nejat
Dr. Amir Nejat
 

Recently uploaded

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsAndrey Dotsenko
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 

Recently uploaded (20)

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 

Alborz

  • 1. Incremental Least-Squares Temporal Difference Learning Alborz Geramifard September, 2007 alborz@cs.ualberta.ca
  • 2. Incremental Least-Squares Temporal Difference Learning Alborz Geramifard September, 2007 alborz@cs.ualberta.ca
  • 3.
  • 4. Acknowledgement Richard Sutton Michael Bowling Martin Zinkevich
  • 5.
  • 7. Agencia Generous Wise Kind Forgetful
  • 9. Agencia Agriculture Politics Research ... Envirocia
  • 10. The Ambassador of Envirocia
  • 11. The Vazir of Agencia The Ambassador of Envirocia
  • 12. The Vazir of Agencia The Ambassador of Envirocia “Your king is generous, wise, and kind, but he is forgetful. Sometimes he makes decisions which are not wise in light of our earlier conversations.”
  • 13. The Vazir of Agencia Having consultants for various subjects might be the answer ... The Ambassador of Envirocia “Your king is generous, wise, and kind, but he is forgetful. Sometimes he makes decisions which are not wise in light of our earlier conversations.”
  • 14.
  • 19. Vazir Ambassador “I should admit that recently the king makes decent judgments, but as the number of our meetings increases, it takes a while for me to hear back from the majesty ...”
  • 21. Reinforcement Learning Framework Observation Agent Action Reward Environment
  • 22. Observation Agent Action Reward Environment
  • 23. Observation Agent Action Reward Environment
  • 24. Observation Agent Action Reward Environment Goal
  • 25. Observation Agent Action Reward Environment Goal
  • 26. Observation Agent Action Mo Reward re Co mp Environment lica ted S tate Spa Goal ce Mo re Co mplic a ted Tas ks
  • 29. Summary Data Efficiency Part I: King Speed
  • 30. Summary Part II: King + Consultants Data Efficiency Part I: King Speed
  • 31. Summary Part II: King + Consultants Data Efficiency ? Part I: King Speed
  • 32. Summary LSTD Data Efficiency TD Speed
  • 33. Summary LSTD Data Efficiency iLSTD TD Speed
  • 35. Contributions iLSTD: A new policy evaluation algorithm Extension with eligibility traces Running time analysis Dimension selection methods Proof of convergence Empirical results
  • 36. Contributions iLSTD: A new policy evaluation algorithm Extension with eligibility traces Running time analysis Dimension selection methods Proof of convergence Empirical results
  • 37. Contributions iLSTD: A new policy evaluation algorithm Extension with eligibility traces Running time analysis Dimension selection methods Proof of convergence Empirical results
  • 39. Outline Motivation Introduction The New Approach Dimension Selection Conclusion
  • 40. Outline Motivation Introduction The New Approach Dimension Selection Conclusion
  • 41. Markov Decision Process a a S, A, Pss , Rss ,γ
  • 42. Markov Decision Process a a S, A, Pss , Rss ,γ 10%,-300 100%,+120 100%,+60 Working Working 30%,-200 Studying 70%,-200 Ph.D. B.S. M.S. 80%,-50 Studying 10%,-50 Working 100%,+85
  • 43. Markov Decision Process a a S, A, Pss , Rss ,γ 10%,-300 100%,+120 100%,+60 Working Working 30%,-200 Studying 70%,-200 Ph.D. B.S. M.S. 80%,-50 Studying 10%,-50 Working 100%,+85 B.S., Working, +60, B.S. Studying, -50, M.S. , ...
  • 48. Notation rt+1 V (s) π Scalar Regular φ(s) µ(θ) Vector Bold Lower Case ˜ At A Matrix Bold Upper Case
  • 49. Policy Evaluation ∞ V (s) = E rt s0 = s, π π t−1 γ t=1
  • 50. Linear Function Approximation n V (s) = θ · φ(s) = θi φi (s) i=1
  • 51. Sparsity of features k features are active at any given Sparsity: Only moment. k n
  • 52. Sparsity of features k features are active at any given Sparsity: Only moment. k n Acrobot [Sutton 96]: 48 << 18,648
  • 53. Sparsity of features k features are active at any given Sparsity: Only moment. k n Acrobot [Sutton 96]: 48 << 18,648 Card game [Bowling et al. 02]: 3 << 10^6
  • 54. Sparsity of features k features are active at any given Sparsity: Only moment. k n Acrobot [Sutton 96]: 48 << 18,648 Card game [Bowling et al. 02]: 3 << 10^6 Keep away soccer [Stone et al. 05]: 416 << 10^4
  • 56. Temporal Difference Learning TD(0) (π) a,r st+1 st [Sutton 88]
  • 57. Temporal Difference Learning TD(0) (π) a,r st+1 st Tabular Representation δt = rt + γV (st+1 ) − V (st ). Vt+1 (st ) = Vt (st ) + αt δt [Sutton 88]
  • 58. Temporal Difference Learning TD(0) (π) a,r st+1 st Tabular Representation δt = rt + γV (st+1 ) − V (st ). Vt+1 (st ) = Vt (st ) + αt δt Linear Function Approximation θ t+1 = θ t + αt φ(st )δt (V ) [Sutton 88]
  • 59. TD(0) Properties Computational complexity O(k) per time step Data inefficient Only last transition
  • 60. TD(0) Properties Computational complexity Constant O(k) per time step Data inefficient Only last transition
  • 61. Least-Squares TD (LSTD) Sum of TD updates
  • 62. Least-Squares TD (LSTD) [Bradtke, Barto 96] Sum of TD updates
  • 63. Least-Squares TD (LSTD) [Bradtke, Barto 96] Sum of TD updates t µt (θ) = φi δi (Vθ ) i=1
  • 64. Least-Squares TD (LSTD) [Bradtke, Barto 96] Sum of TD updates t µt (θ) = φi δi (Vθ ) i=1 t t = φi (φi − γφi+1 ) θ T φi ri+1 − i=1 i=1 At bt
  • 65. Least-Squares TD (LSTD) [Bradtke, Barto 96] Sum of TD updates t µt (θ) = φi δi (Vθ ) i=1 t t = φi (φi − γφi+1 ) θ T φi ri+1 − i=1 i=1 At bt = bt − At θ
  • 66. Least-Squares TD (LSTD) [Bradtke, Barto 96] Sum of TD updates t µt (θ) = φi δi (Vθ ) i=1 t t = φi (φi − γφi+1 ) θ T φi ri+1 − i=1 i=1 At bt = bt − At θ µt (θ) = 0 −1 θ=A b
  • 67. LSTD Properties Computational complexity O(n ) per time step 2 Data efficient Look through all data
  • 68. LSTD Properties Quadratic Computational complexity O(n ) per time step 2 Data efficient Look through all data
  • 70. Outline Motivation Introduction The New Approach Dimension Selection Conclusion
  • 72. The New Approach A and b matrices change on each iteration.
  • 73. The New Approach A and b matrices change on each iteration. t t µt (θ) = φi (φi − γφi+1 ) θ T φi ri+1 − i=1 i=1 At bt
  • 74. The New Approach A and b matrices change on each iteration. t t µt (θ) = φi (φi − γφi+1 ) θ T φi ri+1 − i=1 i=1 At bt bt = bt−1 + rt φt ∆bt At = At−1 + φt (φt − γφt+1 ) T ∆At
  • 75. Incremental LSTD µt (θ) = bt − At θ
  • 76. Incremental LSTD µt (θ) = bt − At θ [Geramifard, Bowling, Sutton 06]
  • 77. Incremental LSTD µt (θ) = bt − At θ Fixed θ Fixed A and b [Geramifard, Bowling, Sutton 06]
  • 78. Incremental LSTD µt (θ) = bt − At θ Fixed θ µt (θ) = µt−1 (θ) + ∆bt − (∆At )θ. Fixed A and b [Geramifard, Bowling, Sutton 06]
  • 79. Incremental LSTD µt (θ) = bt − At θ Fixed θ µt (θ) = µt−1 (θ) + ∆bt − (∆At )θ. Fixed A and b µt (θ t+1 ) = µt (θ t ) − At (∆θ t ). [Geramifard, Bowling, Sutton 06]
  • 80. iLSTD µt (θ t+1 ) = µt (θ t ) − At (∆θ t ).
  • 81. iLSTD µt (θ t+1 ) = µt (θ t ) − At (∆θ t ). How to change θ ?
  • 82. iLSTD µt (θ t+1 ) = µt (θ t ) − At (∆θ t ). How to change θ ? Descent in the direction of µt (θ) ?
  • 83. iLSTD µt (θ t+1 ) = µt (θ t ) − At (∆θ t ). How to change θ ? Descent in the direction of µt (θ) ?
  • 84. iLSTD µt (θ t+1 ) = µt (θ t ) − At (∆θ t ). How to change θ ? Descent in the direction of µt (θ) ?
  • 85. iLSTD µt (θ t+1 ) = µt (θ t ) − At (∆θ t ). How to change θ ? Descent in the direction of µt (θ) ? n
  • 86. iLSTD µt (θ t+1 ) = µt (θ t ) − At (∆θ t ). How to change θ ? O(n ) 2 Descent in the direction of µt (θ) ? n
  • 87. iLSTD µt (θ t+1 ) = µt (θ t ) − At (∆θ t ). How to change θ ? O(n ) 2 Descent in the direction of µt (θ) ?
  • 88. iLSTD µt (θ t+1 ) = µt (θ t ) − At (∆θ t ). How to change θ ? O(n ) 2 Descent in the direction of µt (θ) ?
  • 89. iLSTD µt (θ t+1 ) = µt (θ t ) − At (∆θ t ). How to change θ ? O(n ) 2 Descent in the direction of µt (θ) ?
  • 90. iLSTD µt (θ t+1 ) = µt (θ t ) − At (∆θ t ). How to change θ ? O(n ) 2 Descent in the direction of µt (θ) ?
  • 91. chosen. iLSTD Algorithm Algorithm 5: iLSTD Co 0 s ← s0 , A ← 0, µ ← 0, t ← 0 1 Initialize θ arbitrarily 2 repeat 3 Take action according to π and observe r, s t←t+1 4 ∆b ← φ(s)r 5 ∆A ← φ(s)(φ(s) − γφ(s ))T 6 A ← A + ∆A 7 µ ← µ + ∆b − (∆A)θ 8 for i from 1 to m do 9 10 j ← choose an index of µ using a dimension selection mechanism θj ← θj + αµj 11 µ ← µ − αµj Aej 12 13 end for 14 s ← s 15 end repeat
  • 92. chosen. iLSTD Algorithm Algorithm 5: iLSTD Co 0 s ← s0 , A ← 0, µ ← 0, t ← 0 1 Initialize θ arbitrarily 2 repeat 3 Take action according to π and observe r, s t←t+1 4 ∆b ← φ(s)r 5 ∆A ← φ(s)(φ(s) − γφ(s ))T 6 A ← A + ∆A 7 µ ← µ + ∆b − (∆A)θ 8 for i from 1 to m do 9 10 j ← choose an index of µ using a dimension selection mechanism θj ← θj + αµj 11 µ ← µ − αµj Aej 12 13 end for 14 s ← s 15 end repeat
  • 93. chosen. iLSTD Algorithm Algorithm 5: iLSTD Co 0 s ← s0 , A ← 0, µ ← 0, t ← 0 1 Initialize θ arbitrarily 2 repeat 3 Take action according to π and observe r, s t←t+1 4 ∆b ← φ(s)r 5 ∆A ← φ(s)(φ(s) − γφ(s ))T 6 A ← A + ∆A 7 µ ← µ + ∆b − (∆A)θ 8 for i from 1 to m do 9 10 j ← choose an index of µ using a dimension selection mechanism θj ← θj + αµj 11 µ ← µ − αµj Aej 12 13 end for 14 s ← s 15 end repeat
  • 94. iLSTD Per-time-step computational complexity O(mn + k ) 2 More data efficient than TD
  • 95. iLSTD Per-time-step computational complexity O(mn + k ) 2 Number of iterations per time step More data efficient than TD
  • 96. iLSTD Per-time-step computational complexity O(mn + k ) 2 Number of features Number of iterations per time step More data efficient than TD
  • 97. iLSTD Per-time-step computational complexity O(mn + k ) 2 Maximum number of active features Number of features Number of iterations per time step More data efficient than TD
  • 98. iLSTD Per-time-step computational complexity O(mn + k ) 2 Linear Maximum number of active features Number of features Number of iterations per time step More data efficient than TD
  • 99. iLSTD Theorem : iLSTD converges with probability one to the same solution as TD, under the usual step-size conditions, for any dimension selection method such that all dimensions for which µt is non-zero are selected in the limit an infinite number of times.
  • 102. Settings Averaged over 30 runs Same random seed for all methods Sparse matrix representation iLSTD Non-zero random selection One descent per iteration
  • 104. Boyan Chain -3 -3 -3 -3 -3 N 5 -3 4 -3 3 -3 2 -2 10 0 -3 -3 1 0 0 0 0 0 0 . . . . . . . . . . . . . . . . . . . . . 0 1 0.75 0.5 0.25 0 0 0 0 0.25 0.5 0.75 1 0 [Boyan 99]
  • 105. Boyan Chain -3 -3 -3 -3 -3 N 5 -3 4 -3 3 -3 2 -2 10 0 -3 -3 1 0 0 0 0 0 0 . . . . . . . . . . . . . . . . . . . . . 0 1 0.75 0.5 0.25 0 0 0 0 0.25 0.5 0.75 1 0 n = 4 (Small) n = 25 (Medium) n = 100 (Large) [Boyan 99]
  • 106. Boyan Chain -3 -3 -3 -3 -3 N 5 -3 4 -3 3 -3 2 -2 10 0 -3 -3 1 0 0 0 0 0 0 . . . . . . . . . . . . . . . . . . . . . 0 1 0.75 0.5 0.25 0 0 0 0 0.25 0.5 0.75 1 0 n = 4 (Small) k=2 n = 25 (Medium) n = 100 (Large) [Boyan 99]
  • 107. Small Boyan Chain 2 0 10 10 RMS of V(s) over all states RMS of V(s) over all states 1 10 iLSTD TD −1 0 10 10 TD LSTD iLSTD −1 LSTD 10 −2 −2 10 10 0 200 400 600 800 1000 0 0.5 1 1.5 2 2.5 3 Episodes Time(s)
  • 108. Small Boyan Chain 2 0 10 10 RMS of V(s) over all states RMS of V(s) over all states 1 10 iLSTD TD −1 0 10 10 TD LSTD iLSTD −1 LSTD 10 −2 −2 10 10 0 200 400 600 800 1000 0 0.5 1 1.5 2 2.5 3 Episodes Time(s)
  • 109. Medium Boyan Chain 2 3 10 10 RMS of V(s) over all states RMS of V(s) over all states 2 10 1 10 1 10 iLSTD 0 iLSTD TD 10 TD 0 10 LSTD LSTD −1 −1 10 10 0 200 400 600 800 1000 0 5 10 15 20 Episodes Time(s)
  • 110. Medium Boyan Chain 2 3 10 10 RMS of V(s) over all states RMS of V(s) over all states 2 10 1 10 1 10 iLSTD 0 iLSTD TD 10 TD 0 10 LSTD LSTD −1 −1 10 10 0 200 400 600 800 1000 0 5 10 15 20 Episodes Time(s)
  • 111. Large Boyan Chain 3 3 10 10 RMS of V(s) over all states 2 RMS of V(s) over all states 2 10 10 TD TD 1 1 10 10 iLSTD LSTD 0 0 10 10 LSTD iLSTD −1 −1 10 10 0 200 400 600 800 1000 0 20 40 60 80 100 Episodes Time(s)
  • 112. Large Boyan Chain 3 3 10 10 RMS of V(s) over all states 2 RMS of V(s) over all states 2 10 10 TD TD 1 1 10 10 iLSTD LSTD 0 0 10 10 LSTD iLSTD −1 −1 10 10 0 200 400 600 800 1000 0 20 40 60 80 100 Episodes Time(s)
  • 113. Large Boyan Chain 3 3 10 10 RMS of V(s) over all states 2 RMS of V(s) over all states 2 10 10 TD TD 1 1 10 10 iLSTD LSTD 0 0 10 10 LSTD iLSTD −1 −1 10 10 0 200 400 600 800 1000 0 20 40 60 80 100 Episodes Time(s)
  • 115. Mountain Car Goal Mountain Car Tile coding n = 10,000 Position = -1 (Easy) k = 10 Position = -.5 (Hard) [For details see RL-Library]
  • 116. Easy Mountain Car 7 6 10 10 TD 6 10 iLSTD 4 10 TD Loss Loss 5 10 TD LSTD 5 LSTD 10 iLSTD Loss Function iLSTD 3 10 4 4 10 10 LSTD 0 50 100 150 200 250 300 350 0 200 400 600 800 1000 Time (s) Episode ∗ Loss = ||b − A θ||2 ∗ 2 10 0 200 400 600 800 1000 Episodes
  • 117. Hard Mountain Car 8 10 7 10 TD 7 10 Loss Loss iLSTD 6 10 LSTD TD 6 LSTD 10 iLSTD 5 5 10 10 0 200 400 600 800 1000 0 200 400 600 800 1000 1200 Episode Time (s)
  • 118. Hard Mountain Car 8 10 7 10 TD 7 10 Loss Loss iLSTD 6 10 LSTD TD 6 LSTD 10 iLSTD 5 5 10 10 0 200 400 600 800 1000 0 200 400 600 800 1000 1200 Episode Time (s)
  • 119. Running Time Constant Linear Exponential 155 154 TD 124 iLSTD 93 Time (ms) LSTD 62 31 34 0 9 7 5 5 2 0 0 0 0 0 0 0 0 m d sy d all ar ar iu Ea Sm H H ed M Boyan Chain Mountain Car
  • 120. Running Time Constant Linear Exponential 155 154 TD 124 iLSTD 93 Time (ms) LSTD 62 31 34 0 9 7 5 5 2 0 0 0 0 0 0 0 0 m d sy d all ar ar iu Ea Sm H H ed M Boyan Chain Mountain Car
  • 122. Outline Motivation Introduction The New Approach Dimension Selection Conclusion
  • 127. Greedy Dimension Selection Pick the one with highest value of |µt (i)|
  • 128. Greedy Dimension Selection Pick the one with highest value of |µt (i)| Not proven to converge.
  • 129. ε-Greedy Dimension Selection ε : Non-Zero Random (1-ε) : Greedy
  • 130. ε-Greedy Dimension Selection ε : Non-Zero Random (1-ε) : Greedy Convergence proof applies.
  • 131. Boltzmann Component Selection Boltzmann Distribution + Non-Zero Random
  • 132. Boltzmann Component Selection Boltzmann Distribution + Non-Zero Random
  • 133. Boltzmann Component Selection Boltzmann Distribution + Non-Zero Random ψ×m
  • 134. Boltzmann Component Selection Boltzmann Distribution + Non-Zero Random Boltzmann Distribution ψ×m
  • 135. Boltzmann Component Selection Boltzmann Distribution + Non-Zero Random Boltzmann Distribution ψ×m Convergence proof applies.
  • 136. Empirical Results ε-Greedy: ε =.1 Boltzmann: ψ = 10^-9, τ = 1
  • 137. Empirical Results 8 10 ε-Greedy: ε =.1 Random Greedy !−Greedy 6 10 Boltzmann: ψ = 10^-9, τ = 1 Boltzmann Error Measure 4 10 2 10 0 10 −2 10 Small Medium Large Easy Hard Boyan Chain Mountain Car
  • 138. Empirical Results 8 10 ε-Greedy: ε =.1 Random Greedy !−Greedy 6 10 Boltzmann: ψ = 10^-9, τ = 1 Boltzmann Error Measure 4 10 2 10 0 10 −2 10 Small Medium Large Easy Hard Boyan Chain Mountain Car
  • 139. Empirical Results 8 10 ε-Greedy: ε =.1 Random Greedy !−Greedy 6 10 Boltzmann: ψ = 10^-9, τ = 1 Boltzmann Error Measure 4 10 2 10 0 10 −2 10 Small Medium Large Easy Hard Boyan Chain Mountain Car
  • 140. Running Time Random Greedy Boltzmann !-greedy 11.0 Clocktime/step (ms) 8.8 6.6 4.4 2.2 0 l e sy d um al ar rg Ea Sm i La H ed M Boyan chain Mountain car
  • 142. Outline Motivation Introduction The New Approach Dimension Selection Conclusion
  • 144. Conclusion LSTD Data Efficiency TD Speed
  • 145. Conclusion LSTD Data Efficiency iLSTD TD Speed
  • 146. Conclusion LSTD Data Efficiency iLSTD TD Speed
  • 147. Conclusion LSTD Data Efficiency iLSTD TD Speed
  • 148. Conclusion LSTD Data Efficiency iLSTD TD Speed
  • 149. Conclusion No learning rate! LSTD Data Efficiency iLSTD TD Speed
  • 150.
  • 153. Questions ... What if someone uses batch-LSTD?
  • 154. Questions ... What if someone uses batch-LSTD? Why iLSTD takes simple descent?
  • 155. Questions ... What if someone uses batch-LSTD? Why iLSTD takes simple descent? Hmm ... What about control?