SlideShare a Scribd company logo
100 things I know.
    Part I of III


  Reinaldo Uribe M


    Mar. 4, 2012
SMDP Problem Description.

  1. In a Markov Decision Process, a (learning) agent is embedded
  in an envionment and takes actions that affect that environment.

       States: s ∈ S.
       Actions: a ∈ As ; A =      s∈S   As .
       (Stationary) system dynamics:
       transition from s to s after taking a, with probability
         a
       Pss = p(s |s, a)
       Rewards: Ra . Def. r(s, a) = E Ra | s, a
                 ss                    ss


  At time t, the agent is in state st , takes action at , transitions to
  state st+1 and observes reinforcement rt+1 with expectation
  r(st , at ).
SMDP Problem Description.

  2. Policies, value and optimal policies.
      An element Ο€ of the policy space Ξ  indicates what action,
      Ο€(s), to take at each state.
      The value of a policy from a given state, v Ο€ (s) is the expected
      cumulative reward received starting in s and following Ο€:
                               ∞
                 v Ο€ (s) = E         Ξ³ t r(st , Ο€(st )) | s0 = s, Ο€
                               t=0

      0 < Ξ³ ≀ 1 is a discount factor.
      An optimal policy, Ο€ βˆ— has maximum value at every state:

                     Ο€ βˆ— (s) ∈ argmax v Ο€ (s)       βˆ€s
                                Ο€βˆˆΞ 
                                Ο€βˆ—
                      v βˆ— (s) = v (s) β‰₯ v Ο€ (s) βˆ€Ο€ ∈ Ξ 
SMDP Problem Description.



  3. Discount
      Makes infinite-horizon value bounded if rewards are bounded.
      Ostensibly makes rewards received sooner more desirable than
      those received later.
      But, exponential terms make analysis awkward and harder...
      ... and Ξ³ has unexpected, undesirable effects, as shown in   Uribe

      et al. 2011

      Therefore, hereon Ξ³ = 1.
      See section Discount, at the end, for discussion.
SMDP Problem Description.


  4. Average reward models.
          A more natural long term measure of optimality exists
      for such cyclical tasks, based on maximizing the average
      reward per action. Mahadevan 1996

                               nβˆ’1
                        1
            ρπ (s) = lim  E          r(st , Ο€(st )) | s0 = 0, Ο€
                    nβ†’βˆž n
                               t=0

  Optimal policy:

                      Οβˆ— (s) β‰₯ ρπ (s)    βˆ€s, Ο€ ∈ Ξ 

  Remark: All actions equally costly.
SMDP Problem Description

  5. Semi-Markov Decision Process: usual approach, transition
  times.
      Agent is in state st and takes action Ο€(st ) at decision epoch t.
      After an average of Nt units of time, the sistem evolves to
      state st+1 and the agent observes rt+1 with expectation
      r(st , Ο€(st )).
      In general, Nt (st , at , st+1 ).
      Gain (of a policy at a state):

                                          nβˆ’1
                Ο€
                                E         t=0 r(st , Ο€(st )) | s0   = s, Ο€
               ρ (s) = lim
                         nβ†’βˆž                   nβˆ’1
                                     E         t=0 Nt | s0   = s, Ο€

      Optimizing gain still maximizes average reward per action, but
      actions are no longer equally weighted. (Unless all Nt = 1)
SMDP Problem Description

  6.a Semi-Markov Decision Process: explicit action costs.
      Taking an action takes time, costs money, or consumes
      energy. (Or any combination thereof)
      Either way, real valued cost kt+1 not necessarily related to
      process rewards.
      Cost can depend on a, s and (less common in practice) s .
      Generally, actions have positive cost. We simply require all
      policies to have positive expected cost.
      Wlog the magnitude of the smallest nonzero average action
      cost is forced to be unity:

                        |k(a, s)| β‰₯ 1   βˆ€k(a, s) = 0
SMDP Problem Description

  6.b Semi-Markov Decision Process: explicit action costs.
      Cost of a policy from a state:
                                 nβˆ’1
               cΟ€ (s) = lim E            k(st , Ο€(st )) | s0 = s, Ο€
                        nβ†’βˆž
                                   t=0

      So cΟ€ (s) > 0   βˆ€Ο€ ∈ Ξ , s.

      Nt = k(st , Ο€(st )). Only their definition/interpretation
      changes.
      Gain
                                          v Ο€ (s)/n
                              ρπ (s) =
                                          cΟ€ (s)/n
SMDP Problem Description

  7. Optimality of Ο€ βˆ— :

  Ο€ βˆ— ∈ Ξ  with gain
                                nβˆ’1
                            E   t=0 r(st , Ο€(st )) | s0   = s, Ο€ βˆ—       βˆ—
                                                                      v Ο€ (s)
   Ο€βˆ—        βˆ—
  ρ (s) = ρ (s) = lim                                                = Ο€βˆ—
                      nβ†’βˆž       nβˆ’1
                                                          = s, Ο€ βˆ—    c (s)
                            E   t=0 k(st , Ο€(st )) | s0

  is optimal if

                       Οβˆ— (s) β‰₯ ρπ (s)   βˆ€s, Ο€ ∈ Ξ ,

  as it was in ARRL.

  Notice that the optimal policy doesn’t necessarily maximize v Ο€ or
  minimize cΟ€ . Only optimizes their ratio.
SMDP Problem Description

  8. Policies in ARRL and SMDPs are evaluated using the
  average-adjusted sum of rewards:
                                      nβˆ’1
          H Ο€ (s) = lim E                   (r(st , Ο€(st )) βˆ’ ρπ (s)) | s0 = s, Ο€
                         nβ†’βˆž
                                      t=0

  Puterman 1994, Abounadi et al. 2001, Ghavamzadeh & Mahadevan 2007




         This signals the existence of bias optimal policies that, while
         gain optimal, also maximize the transitory rewards received
         before entering recurrence.
         We are interested in gain-optimal policies only.
         (It is hard enough...)
SMDP Problem Description


  9. The Unichain Property
      A process is unichain if every policy has a single, unique
      recurrent class.
      I.e. if for every policy, all recurrent states communicate
      between them.
      All methods rely on the unichain property. (Because, if it
      holds:)
      ρπ (s) is constant for all s.
      ρπ (s) = ρπ
      Gain and value expressions simplify. (See next)
      However, deciding if a problem is unichain is NP-Hard.
      Tsitsiklis 2003
SMDP Problem Description

  10. Unichain property under recurrent states.   Feinberg & Yang, 2010

      A state is recurrent if it belongs to a recurrent class of every
      policy.
      A recurrent state can be found, or proven not to exist, in
      polynomial time.
      If a recurrent state exists, determining whether the unichain
      property holds can be done in polynomial time.
      (We are not going to actually do it–it requires knowledge of
      the system dynamics–but good to know!)
      Recurrent states seem useful. In fact, existence of a recurrent
      state is more critical to our purposes that the unichain
      property.
      Both will be required in principle for our methods/analysis,
      until their necessity is furher qualified in section Unichain
      Considerations below.
Intermission
Generic Learning Algorithm
   11. The relevant expressions under our assumptions simplify, losing
   dependence on s0

   The following Bellman equation holds for average-adjusted state
   value:

              H Ο€ (s) = r(s, Ο€(s)) βˆ’ k(s, Ο€(s))(ρπ ) + EΟ€ H Ο€ (s )   (1)

   Ghavamzadeh & Mahadevan 2007



   Reinforcement Learning methods exploit Eq. (1), running the
   process and substituting:
         State for state-action pair value.
         Expected for obseved reward and cost.
         ρπ for an estimate.
         H Ο€ (s ) for its current estimate.
Generic Learning Algorithm



   12.

   Algorithm 1 Generic SMDP solver
     Initialize
     repeat forever
         Act
         Do RL to find value of current Ο€   Usually 1-step Q-learning
         Update ρ.
Generic Learning Algorithm

   13.
         Model-based state value update:

                    H t+1 (st ) ← max r(st , a) + Ea H t (st+1 )
                                     a



             Ea emphasizes that expected value of next state depends on
             action chosen/taken.


         Model free state-action pair value update:

               Qt+1 (st , at ) ← (1 βˆ’ Ξ³t ) Qt (st , at )+
                                Ξ³t rt+1 βˆ’ ρt ct+1 + max Qt (st+1 , a)
                                                            a



             In ARRL, ct = 1 βˆ€t
Generic Learning Algorithm
   14.a Table of algorithms. ARRL

    Algorithm                        Gain update
                                                   t
    AAC                                                 r(si , Ο€ i (si ))
                                                  i=0
    Jalali and Ferguson 1989
                                     ρt+1 ←
                                                          t+1
                                         t+1
    R–Learning                       ρ         ← (1 βˆ’ Ξ±)ρt +
    Schwartz 1993                                 Ξ± rt+1 + max Qt (st+1 , a) βˆ’ max Qt (st , a)
                                                                     a           a

    H–Learning                       ρt+1 ← (1βˆ’Ξ±t )ρt +Ξ±t rt+1 βˆ’ H t (st ) + H t (st+1 )
                                               Ξ±t
    Tadepalli and Ok 1998            Ξ±t+1 ←
                                             Ξ±t + 1
    SSP Q-Learning                   ρt+1 ← ρt + Ξ±t min Qt (Λ†, a)
                                                            s
    Abounadi et al. 2001                                       a

                                                   t
    HAR                                                 r(si , Ο€ i (si ))
                                                  i=0
    Ghavamzadeh and Mahadevan 2007
                                     ρt+1 ←
                                                          t+1
Generic Learning Algorithm


   14.b Table of algorithms. SMDPRL


    Algorithm                        Gain update

    SMART                                      t
    Das et al. 1999
                                                    r(si , Ο€ i (si ))
                                              i=0
                                     ρt+1 ←    t
    MAX-Q                                           c(si , Ο€ i (si ))
    Ghavamzadeh and Mahadevan 2001            i=0
SSP Q-Learning
  15. Stochastic Shortest Path Q-Learning
      Most interesting. ARRL
      If unichain and exists s recurrent (Assumption 2.1 ):
                             Λ†
               SSP Q-learning is based on the observation that
           the average cost under any stationary policy is
           simply the ratio of expected total cost and expected
           time between two successive visits to the reference
           state [Λ†]
                  s

      Thus, they propose (after Bertsekas 1998) making the process
      episodic, splitting s into the (unique) initial and terminal
                          Λ†
      states.
      If the Assumption holds, termination has probability 1.
      Only the value/cost of the initial state are important.
      Optimal solution β€œcan be shown to happen” when H(Λ†) = 0.
                                                       s
      (See next section)
SSP Q-Learning
  16. SSPQ ρ update.

                      ρt+1 ← ρt + Ξ±t min Qt (Λ†, a),
                                             s
                                          a


  where

                                           2
                          Ξ±t β†’ ∞;         Ξ±t < ∞.
                      t               t


  But it is hard to prove boundedness of {ρt }, so suggested instead

                      ρt+1 ← Ξ“ ρt + Ξ±t min Qt (Λ†, a) ,
                                               s
                                              a


  with Ξ“(Β·) a projection to [βˆ’K, K] and Οβˆ— ∈ (βˆ’K, K).
A Critique


   17. Complexity.
       Unknown.
       While RL is PAC.


   18. Convergence.
       Not always guaranteed (ex. R-Learning).
       When proven, asymptotic:
       convergence to the optimal policy/value if all state-action
       pairs are visited infinite times.
       Usually proven depending on decaying learning rates, which
       make learning even slower.
A Critique


   19. Convergence of ρ updates.
            ... while the second β€œslow” iteration gradually guides
       [ρt ] to the desired value.
          Abounadi et al. 2001




       It is the slow one!
       Must be so for sufficient approximation of current policy value
       for improvement.
       Initially biased towards (likely poor) observed returns at the
       start.
       A long time must probably pass following the optimal policy
       for ρ to converge to actual value.
Our method

  20.
        Favours an understanding of the βˆ’Ο term, either alone in
        ARRL or as a factor of costs in SMDPs, not so much as an
        approximation to average rewards but as a punishment for
        taking actions, which must be made β€œworth it” by the rewards.
        I.e. nudging.
        Exploits the splitting of SSP Q-Learning, in order to focus on
        the value/cost of a single state, s.
                                          Λ†
        Thus, also assumes the existence of a recurrent state, and
        that the unichain policy holds. (For the time being)

        Attempts to ensure an accelerated convergence of ρ updates.
        In a context in which certain, efficient convergence can be
        easily introduced.
Intermission
Fractional programming

   21. So, β€˜Bertsekas splitting’ of s into initial sI and terminal sT .
                                    Λ†
   Then, from sI
        Any policy Ο€ ∈ Ξ  has an expected return until termination
        v Ο€ (sI ),
        and an expected cost until termination cΟ€ (sI ).
                                                 v Ο€ (sI )
        The ARRL problem, then, becomes max Ο€
                                            Ο€βˆˆΞ  c (sI )


   Lemma

                       v Ο€ (sI )
              argmax             = argmax v Ο€ (s) + Οβˆ— (βˆ’cΟ€ (s))
                Ο€βˆˆΞ     cΟ€ (sI )      Ο€βˆˆΞ 

   For Οβˆ— such that max v Ο€ (s) + ρ(βˆ’cΟ€ (s)) = 0
                      Ο€βˆˆΞ 
Fractional programming




   22. Implications.
       Assume the gain, Οβˆ— is known.
       Then, the nonlinear SMDP problem reduces to RL.
       Which is better studied, well understood, simpler, and for
       which sophisticated, efficient algorithms exist.
       If we only use (r βˆ’ Οβˆ— c)(s, a, s ).
       Problem: Οβˆ— is usually not known.
Nudging

  23. Idea:
      Separate reinforcement learning (leave it to the pros) from
      updating ρ.
      Thus, value-learning becomes method-free.
      We can use any old RL method.

      Gain update is actually the most critical step.
      Punish too little, and the agent will not care about hurrying,
      only collecting reward.
      Punish too much, and the agent will only care about finishing
      already.

      In that sense, (r βˆ’ ρc) is like picking fruit inside a maze.
Nudging



  24. The problem reduces to a sequence of RL problems.
      For a sequence of (temporarily fixed) ρk
      Some of the methods already provide an indication of the sign
      of ρ updates.
      We just don’t hurry to update ρ after taking a single action.

      Plus the method comes armed with a termination condition:
      As soon as H k (sI ) = 0 then Ο€ k = Ο€ βˆ— .
Nudging



  25.

  Algorithm 2 Nudged SSP Learning
    Initialize
    repeat
        Set reward scheme to (r βˆ’ cρ)
        Solve by any RL method.
        Update ρ                        From current H Ο€ (sI )
    until H Ο€ (sI ) = 0
w βˆ’ l space


   26. D
   We will propose a method for updating ρ and show that it
   minimizes uncertainty between steps. For that, we will use a
   transformation that extends the work of our CIG paper. But First.

   Let D be a bound on the magnitude of unnudged reward

                      D β‰₯ lim sup{H Ο€ (sI ) | ρ = 0}
                             Ο€βˆˆΞ 
                      D ≀ lim inf {H Ο€ (sI ) | ρ = 0}
                             Ο€βˆˆΞ 

   Observe interval (βˆ’D, D) bounds Οβˆ— but the upper bound is tight
   only in ARRL if all of D reward is received in a single step from sI .
w βˆ’ l space



   27. All policies Ο€ ∈ Ξ , from (that is, at) sI have:
       real expected value |v Ο€ (sI )| ≀ D.
       positive cost cΟ€ (sI ) β‰₯ 1



   28.a w βˆ’ l transformation:
                            D+v Ο€ (sI )        Dβˆ’v Ο€ (sI )
                      w=     2cΟ€ (sI )    l=    2cΟ€ (sI )
w βˆ’ l space

   28.b w βˆ’ l plane.

                  D




                   l




                       w   D
w βˆ’ l space

   29. Properties:
       w, l β‰₯ 0
       w, l ≀ D
                      D
       w+l =                     ≀D
                    cΟ€ (s   I)
       v Ο€ (sI ) = D         β‡’       l=0
       v Ο€ (sI ) = βˆ’D            β‡’    w=0
          lim     (w, l) = (0, 0)
       cΟ€ (sI )β†’βˆž


   30. Inverse transformation:
                                      Ο€    Ο€
                     v Ο€ (sI ) = D wΟ€ βˆ’lΟ€
                                   w +l        cΟ€ (sI ) = D wΟ€1 Ο€
                                                              +l
Intermission
w βˆ’ l space

 31. Value.
                                                    w Ο€ βˆ’ lΟ€
                                       v Ο€ (sI ) = D
  D                                                 w Ο€ + lΟ€
                                       Level sets are lines.
                                       w–axis, expected D.
                                       l–axis, expected βˆ’D.
                                       w = l, expected 0.
               5D
      βˆ’D

           βˆ’0.




  l                                    Optimization β†’ fanning
                                       from l = 0.
                    0




                                       Not convex, but splits the
                    0.5D
                                       space.
                                       So optimizers are vertices of
                               D
                           w       D   convex hull of policies.
w βˆ’ l space

 32. Cost.

  D
                                      1
                      cΟ€ (sI ) = D
                                     wΟ€+ lΟ€
                      Level sets are lnes with slope
                      βˆ’1.
                      w + l = D, expected cost 1.
  l
                      Cost decreases with distance
              1




                      to the origin.
                      Cost optimizers (both max
          2




                      and min) also vertices.
      4
      8




              w   D
w βˆ’ l space




   33. The origin.
       Policies of infinite expected cost.
       Mean the problem is not unichain or sI is not recurrent.
       And are troublesome for optimizing value.

       So under our assumptions, the origin does not belong to the
       space
Nudged value in the w βˆ’ l space
   34. SMDP problem in w βˆ’ l.
                                                                                               Ο€   Ο€
                v Ο€ (sI )         D wΟ€ βˆ’lΟ€
          argmax Ο€        = argmax w 1 +l
                                           = argmax wΟ€ βˆ’ lΟ€
            Ο€βˆˆΞ   c (sI        Ο€βˆˆΞ  D wΟ€ +lΟ€     Ο€βˆˆΞ 


                D



                                        /2
                               D
                              βˆ’




                 l
                                                     ● ●
                                                      ●           ●●         ●
                                                         ●
                                     ● ● ● ●● ●● ● ● ● ●
                                            ● ●● ● ● ●●●●● ●
                                      ● ● ● ● ●●●●● ●●●●● ●
                                             ●        ● ● ●●
                                                       ● ● ●●
                                                      ● ●            ●       ●●
                                         ● ● ● ● ●●● ● ●●● ● ●
                                         ● ● ● ●●●●● ●●● ●● ● ●● ●
                                                      ●
                              ● ● ● ●●● ● ●●●●●●●● ●●● ●● ●● ●● ● ●
                               ●                      ● ● ●● ● ●
                                                       ● ●                    ●
                              ●                ●      ● ●● ●● ●● ●
                                          ●● ●● ●●●●●●●●● ● ● ● ●
                                                           ●
                                                       ● ●●●● ●
                                               ● ● ●●●●● ● ●●●● ●
                             ● ● ● ● ●●● ●●●● ●●●●●●●●●●●●●●●● ● ●
                                        ● ● ●●●●● ●●●●●●●● ●●●● ● ●
                                       ●●●●●●● ●●●●●●●●●●●● ● ●●●●
                                                ●                  ●
                                              ●●●●●● ●● ● ●●● ● ●
                                                ● ● ●●● ● ● ● ●
                                          ●●●● ●● ●●● ● ● ●● ● ●
                                  ● ●● ●●●●●●●●●●●●●●●●●● ● ●● ●
                                       ● ●●●●●●●●●●●●●●●●●●● ● ●
                                             ● ●●● ●●●● ●●●● ● ● ● ●
                                                                                     ●
                                         ● ● ●●● ●●●●●●●●● ● ●● ●
                                                        ● ● ●●
                                 ● ● ● ● ●●●●●●●●●● ●●●●●●● ●●● ●● ●
                                        ● ●●●●●●●●●●●●●●●● ● ● ●
                              ● ●●● ●● ●●●●●●●●●●●●●●●●●●●●●● ●●●
                                                   ● ●●● ●●●●
                                                 ●●●●● ●● ●●●● ● ●
                                                        ●
                                                   ● ● ● ●●●●     ●
                                                                                   ●
                              ● ●● ●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ●
                                      ● ●●●●●●●●●●●●●●●●●●●●● ● ●
                                            ●●●● ●●●● ●●●●●●● ●●
                                                    ● ●
                                          ●●●●●●●●●●●●●● ●●●●● ●
                                           ●● ●●●●●●●●●●●●●●● ● ●
                                                                 ●
                                      ● ●●●●● ●●●●●●●●●●●●●●●●● ● ●
                                                   ● ●●
                                                ● ● ●●●● ●● ● ●
                                                                        ●●
                             ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●
                                                                ●●●
                                                                ●● ● ●
                                     ● ● ● ●●●●●●●●●●●●●●●●●●●● ● ●
                                             ● ●●● ●●●● ●●● ●
                                         ●● ●● ●●●●●●●●●●●●●● ● ● ●
                                                       ●●
                                     ● ●●●●●●●●●●●●●●●●●●●●● ●●●
                              ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●
                                                        ●●
                                                   ● ●●●● ● ● ●●● ●
                                ●● ●● ●●●●●●●●●●●●● ●●●●●●●●●●●●●●●
                                    ● ● ●●●●●●●●●●●●●●●●●●●●●●●●
                                             ● ●●●●● ● ●●●●●● ● ●
                                         ● ●●●●●●●●●●●●●●●●● ● ●
                                                   ● ●●●●● ● ●● ●
                                                              ● ● ●●
                                                                                       0

                                     ● ●●●●●●●●●●●●●●●●●●●●●● ●●●●
                                     ● ● ●●●●●●●●●●●●●●●●●●● ● ●●
                                                         ●       ●
                                          ● ●●●●●●●●●●●●●●● ●● ● ●
                                             ●
                                 ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●
                                             ●●●●●●● ●●●●●●●●●●
                                                       ● ●●●●●●●● ●
                                          ● ●●●●●●●●●●●●●●● ●● ● ●
                                                  ● ●●● ●●●● ● ● ●
                                  ● ●● ●●●●●●●●●●●●●●●●●●●●●●● ●●●
                                     ● ● ●●●●●●●●●●●●●●●●●● ● ● ●
                           ● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
                                                           ●
                                                          ● ● ●●
                                                                    ●
                                                 ● ● ●●●●●●● ● ● ● ● ●
                                                  ● ●● ●● ●● ●●
                                 ● ●●●●●●●●●●●●●●●●●●●● ●●●●●●● ●
                                     ●● ● ●●●●●●●●●●●●●●●● ●●●● ●●
                                       ● ● ● ●● ● ●●●●●●●● ●● ●●●●●
                                ● ●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●
                                                                    ●
                                ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ●
                                      ● ● ●●●●●●●●●●●●●●●● ●●●●●●
                                                         ●● ● ●
                                                ● ●●●●●●●● ● ●● ●
                                       ●● ● ●●●●●●●●●● ●●●●● ●● ● ●
                                                                                       ●
                      ●   ●                               ● ●● ●
                                  ●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●
                                                 ● ● ●●●● ● ●
                                                           ●
                                             ● ●●●●●●●●●●●● ● ● ●
                                             ●●●●●●●●●● ●● ●●● ● ●
                                       ● ● ●●●●●●●●●●●●●●●●●●● ● ● ●
                                   ●● ●● ●●●●●●●●●●●●●●●●●●●● ●●●●
                                              ● ● ●●●●●●●● ●
                                 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●
                                                        ● ●● ●
                          ●                            ● ●● ● ●
                                    ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●
                                              ●● ● ●●●●● ●●●●● ●
                                        ● ●●●●●●●●●●●●●●●●●●● ●
                                            ● ●●● ●●●● ●●●●●●●●●●
                                    ● ● ●●●● ●●●●●●●●●●●●●●●●●●● ●
                                                                                    ●
                                ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●
                                                            ● ● ●
                                       ● ●●●●●●●●●●●●●●●●●●●● ●● ●
                                   ● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ●
                                        ● ●●●●●●●●●●●●●●● ●●●● ● ●
                                               ● ● ●● ● ●● ●●
                                            ●●● ●● ● ●● ●● ●●●●●
                                                ● ● ●●
                                          ●●●●● ●●●●●●●●● ●●● ●●
                                           ●●●●● ●●●●●●●● ●●● ●
                                   ● ● ●●●●●● ●●●●●●●●●●●●●●●● ● ●
                                  ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●
                                                                ● ●
                                                  ●●● ● ● ● ● ●●●
                                                  ●              ●
                                        ● ●●●●●●●●●●●●●●●●●●●●●●●●
                              ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●
                                         ● ●●●● ●● ●●●●●●●●●●●●● ● ●
                                  ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●
                                        ● ●●●●●●● ●●●●●●●●●●●●●●●●
                                            ● ● ●●●●●●●●●●●●● ●● ●
                                                     ● ● ●● ●● ●●
                                                  ●●●●●●●● ●● ●●●
                                                     ● ●●● ● ● ●
                                                   ● ● ● ●● ●
                                                 ●●● ●● ● ●● ● ● ●
                                     ● ● ●● ●●●● ●●●● ●●● ●●● ●
                            ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●
                               ●● ● ●●● ●●●●●●●●●●●●●●●●●● ●●
                                                            ●
                                ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●
                                           ● ● ●● ●●●●●●●●●●●●●●                ●●
                                 ●          ● ●●●●●● ●●●●●●● ● ●
                                             ● ●●●●● ● ● ● ●●● ●
                                           ●●●●●●●●●●●●●●●● ●●●●● ●
                                          ●●●●●●●●●●●●●●●●●●●●● ●
                                           ● ● ●●●●● ● ●●●●● ●● ●
                                                    ●
                                       ●●●● ●●●●●●●●●●●●●●●●●●●●● ●
                                                                   ●
                           ● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●
                                                 ●●● ●●●● ● ●●●●
                                                  ● ● ●● ● ● ● ●
                                  ●●● ●●● ●●●●●●●●●●●●●●●●●●●●● ●●●
                                      ● ●● ●●●●●●●●●●●●●●●●●● ●●
                                      ● ●●●●●●●●●●●●●●●●●●●●● ●●● ●
                                     ● ●●●●●●●●●●●●●●●●●●●●●●●●● ● ●
                                                  ●       ●● ●
                            ●                ● ●● ●●●●●●●●●●●●
                                  ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●
                                       ● ●●●●●●●●●●●●●●●●● ● ●●
                                            ● ●● ● ●●●●●●● ●●●● ●
                                      ● ●● ●●● ●●●●●●●●●●●●●●●● ●
                                                ●● ●●●●●● ●● ●●●
                                                  ● ● ●● ●●● ●
                                            ● ● ●●●●●●●●● ●● ●●● ●● ●
                                       ●● ●●●●●●●●●●●●●● ●●● ●●●
                                      ●●●●●●●●●●●●●●●●●●●●●●●●● ●
                                                      ●●
                                     ●●●●●●●●●●●●●●●●●●●●●●●●●●●●
                                  ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
                                      ● ● ●●●●●●●●●●●●●●●●●●●●●● ●
                                            ●●●●●●●●●●●●●●●●●●●●● ●
                                                ●● ● ●● ●●● ● ●● ●●
                           ● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ●
                                              ●●●●●●●●● ●●● ●●●●●● ●
                                                   ● ●● ● ●● ● ●
                            ●                ●●●●●●●●●●●●● ●●●● ● ●
                                                 ●●● ● ●●●●●● ● ●
                                                       ● ●● ● ●
                                        ●● ●● ●● ●●●●●●●● ●● ●● ●
                                  ●●●●● ●● ●●●●●● ●●●●● ●●●●● ● ●● ●●  ●
                                      ●●● ●● ●●●●●●●●●●●●●●● ●● ●
                                   ●●● ● ●●●●●●●●●●●●●●●●●●●● ● ●
                                  ● ●●●● ●●●●●●●●●●●●●●●●●●●●● ● ●● ●
                                          ● ●●●● ●●● ● ● ● ● ●
                                                       ● ● ●
                                        ●● ●●●●●●●●● ●●●● ●●●●●●
                                          ● ●● ●● ●●● ●● ● ●●●● ●●●
                                               ●●       ● ●●            ●
                          ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●
                                                   ● ●●●●
                                      ●●● ●●●●●●●●●●●●●●●●●●●●● ●
                                                ● ● ●● ●● ● ●
                                                 ●
                                     ●●●●●●●●●●●●●●●●●●●●●●●●● ●●
                                        ● ● ●●●●●●●●● ●●●●●● ●●
                                               ●●●●●●●●●●●●●●● ●
                                     ● ●●●●●●●●●●●●●●●●●●●●●●● ● ●
                                    ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●
                                                ● ● ●●●●●●●●● ●● ●
                                                   ● ●● ● ●
                                                                        ●
                                                                        ●●
                                    ● ●●●●●●●●●●●●●●●●●●●●●●● ● ●
                                                   ● ●●
                                            ●● ●●●●●●● ●● ●● ●● ●● ● ●
                                    ● ● ●●●●●●●●●●●●●●●●●●●●●●●● ●
                                             ● ●● ●●●●●●●●●● ●●● ●
                              ● ●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●● ● ●●
                                                   ●
                                                   ●● ● ● ● ● ●   ●
                                       ● ● ●● ●●●●●●●●●●●● ●● ● ●
                                      ● ● ● ●●●●●●●●●●●●●●●●● ●●
                              ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●
                                                           ● ●●
                                                     ● ●●●●● ● ●
                               ● ●●● ●●●●●●●●●●●●●●●●●●●● ●●●●
                                       ● ● ● ●●●●●●●●●● ●●●● ● ●●
                                      ●● ● ● ●●●●●●●●●●●●●●●●●● ●
                                                           ●
                                                 ● ●● ●●●●●● ● ●●
                                                           ● ●
                                                 ●● ●●●● ●● ● ● ●
                                      ●● ● ●●●●●●●●●●●●●●●●●●●●●● ●
                                       ● ● ● ●●●●●●●●●●● ●●●● ●● ●
                                                   ● ● ●● ●
                                              ● ●● ●●● ●●●● ●● ●
                                        ● ●●● ●●●●●●●●●●●●● ●● ●
                               ● ●●● ●●●●●●●●●●●●●●●●●●●●● ●● ●●
                                             ●● ●●●●●●● ●● ● ● ●●
                                                   ● ● ●
                               ● ● ● ●●●●●●●●●●●●●●● ● ●●●●● ● ●
                                                   ● ●  ●
                               ● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ● ●
                                          ● ● ●●●●●●●●● ●●● ●●●
                               ● ●●● ●●●●●●●●●●●●●●●●● ●●●●●●●● ●●
                                           ● ● ●● ●●● ●●●● ● ●●●
                                         ● ● ●●● ●●●●●●●●●●●●
                                   ● ● ●●●●●●●●●●●●●●●●●●● ●● ●
                                                  ● ● ●● ●●
                                           ● ● ●● ●● ●● ●●● ●● ●
                                                        ●
                                                 ● ●●●●● ● ● ●
                               ● ●●● ●●●●●●●●●●●●●●●●●●●● ●●● ● ●●●
                                           ● ●● ●●●●●●●● ●●● ●
                                       ● ●●● ●●●●●●●●●●●●●●● ●●
                                          ● ●●●●● ●●●●●● ●● ●
                                      ● ●● ●●●●●● ●●●●●●●●●●●● ●
                                                       ● ●●●
                                          ●●● ●●●●●●●●●●●●●● ●
                                                      ● ●●●● ● ●
                                   ● ● ●●●●●●●●●●●●●●●●●●●●● ● ●
                                   ● ● ● ●●●●●●●●●●●●●●●●●●● ● ● ● ●
                                                             ● ● ●●
                                  ● ●● ●●●●●●●●●●●●●●●●●●●●●● ●● ● ●
                                      ● ●●●●●●●●● ●●●●●●●●●●●●● ●
                                               ●●●● ● ●●●●●●●● ● ●
                                                       ●
                                                   ●● ● ●●●● ●●
                                               ●●●●● ●●● ● ●●● ●
                                       ● ● ●●●●● ●●●●● ●●●● ●● ● ●
                                                       ●     ●● ● ●
                                                     ● ● ●●● ● ●●
                                                             ●
                                     ●●        ●●●●●●●●●●●●●● ● ●
                                   ● ● ● ●● ●● ●● ● ● ●● ● ●● ●
                                               ● ●●●●
                                   ● ● ● ●●●● ●●●●●●●●●●●●● ● ● ● ● ●
                                             ● ● ● ●●●●● ● ●●●     ●
                                          ● ●●● ● ● ●●●●●● ●● ●● ● ● ●
                                                    ● ●● ● ● ●●●
                                      ● ●●●●●●● ●●● ●●● ● ●● ● ● ●
                             ● ● ● ●●●●●●●●●●●●●●●●●● ●●● ● ● ● ● ●
                                    ●● ●●●●●●● ●●●●●●●● ● ●● ● ● ●
                                              ●●● ● ●●● ●●●● ●
                                                ●●
                                                 ●       ●●●
                                                          ●       ● ●
                                                                   ●
                                         ● ●●● ● ●●●●●●●●●● ●● ● ●
                                             ● ● ●● ●● ● ● ● ● ● ●
                                     ● ●●● ●●●●●●●●●●● ●●● ●● ● ●● ● ●
                                                      ●● ●● ● ●
                                                            ●
                                         ●●●●●●●●●●●●●●●● ●●● ● ●
                                             ●● ●● ●● ●●●● ●●● ●● ●
                                                       ● ●
                              ●         ● ● ●●●●●●●●●●●●● ● ●●● ●●
                                               ●●●●● ●● ● ● ●● ● ●
                                                 ● ●●                 ●     ●
                                                 ● ●●● ● ● ● ● ●
                                ● ● ● ●●●●●●●●●● ●●● ●●●●● ●●●
                                                ● ● ●● ● ● ●● ●● ● ● ● ●
                                                     ●●
                                                      ●●      ● ● ●●
                                                     ●● ● ● ●● ●
                                     ●●  ● ● ●●●●● ● ● ● ● ●●● ●●
                                         ●                       ●
                                                                 ● ●
                                        ● ● ● ● ●● ●●●●● ●●●●● ● ●
                                                    ● ● ●● ● ●● ● ● ●
                                     ● ●● ●● ●●●●●●●●●● ●● ●● ● ● ● ●●
                                                    ●● ● ● ● ●●
                                                    ●● ● ● ●●●● ●● ● ●
                               ● ● ●●●● ● ●● ● ●● ●●●●● ● ●●●● ●
                                     ● ● ●
                                         ●           ● ●● ● ● ●  ● ●
                                                      ●
                                        ●● ● ● ● ●●● ●● ●● ● ●
                                                      ●
                                        ●● ●● ●●●●●●●●●● ● ●
                                           ●● ● ● ● ●● ●● ●●
                                              ● ● ● ● ●● ●● ●        ●      ●
                                                                            ●
                                             ●      ●      ● ● ●● ●
                                     ●● ●●● ●●●● ●●●● ●●●● ●● ●
                                                          ●●        ● ●      ●
                                    ●
                                    ●● ●  ● ●●●●●●●●●●● ●●● ● ●
                                                ●● ● ●●● ● ● ●● ● ● ●
                                                  ●          ●
                                                            ● ●        ●
                                    ● ●          ●     ● ●●● ● ● ● ●
                                                              ●
                                            ● ●● ● ● ●
                                                    ●               ●
                                              ●        ●        ●● ●
                                                                   ●     ●●
                                                      ●●
                                                                                                   /2
                                              ●
                                                                                                   D




                                                                                           w            D
Nudged value in the w βˆ’ l space



   35. Nudged value, for some ρ.

                    argmax v Ο€ (sI ) βˆ’ ρcΟ€ (sI )
                      Ο€βˆˆΞ 
                             w Ο€ βˆ’ lΟ€        1
                  = argmax D   Ο€ + lΟ€
                                      βˆ’ ρD Ο€
                      Ο€βˆˆΞ     w            w + lΟ€
                             w Ο€ βˆ’ lΟ€ βˆ’ ρ
                  = argmax D
                      Ο€βˆˆΞ       w Ο€ + lΟ€
Nudged value in the w βˆ’ l space



   36. Nudged value level sets
                                                Λ†
   (For a set ρ and all policies Ο€ with a given h)
                                 Λ†

                                 Λ†
                               Dβˆ’h Ο€     D
                        lΟ€ =
                         Λ†
                                   wΛ† βˆ’      ρ
                                 Λ†
                               D+h      D+hˆ

   Lines!

                         Λ†
   Slope depends only on h (i.e., not on ρ)
Nudged value in the w βˆ’ l space




   37. Pencil of lines
                        Λ†     Λ‡
   For a set ρ, any two h and h level set lines have intersection:
                                   ρ    ρ
                                     ,βˆ’
                                   2    2
   Pencil of lines with that vertex.
Nudged value in the w βˆ’ l space


     D                              38. Zero nudged value.
                                            Dβˆ’0 Ο€        D
                                       lΟ€ =
                                        Λ†
                                                   wΛ† βˆ’     ρ
                                            D+0         D+0
      l                                lΟ€ = w Ο€ βˆ’ ρ
                                        Λ†     Λ†

           βˆ’D
                                    Unity slope.
                        0           Negative values above, positive
                                    below.
                    w           D
                            D
          
          ρ
          
          
              ρ
                ο£Ά
                ο£·
                ο£·
                                    If whole cloud above w = l, some
           ,βˆ’ ο£·
               
          ο£­
            2 2
                ο£Έ
                                    negative nudging is the optimizer.
                                    (Encouragement)
Nudged value in the w βˆ’ l space


                  D




                   l
                                              ● ●
                                              ●       ●●      ●
                                   ● ● ● ●● ●● ● ●●
                                          ●
                                  ● ●● ●● ● ● ● ●
                                        ● ● ●●●● ● ●
                                                   ●
                                       ● ● ● ●●● ● ● ● ●
                                               ● ●
                              ● ● ●●●●●●●●● ●●●●● ● ●
                                               ● ●● ●
                                        ● ● ● ● ●● ●
                                               ● ●
                                                ●
                             ● ● ●●●●●●●●●●●●●●● ●
                             ● ● ●●●●●●●●● ●●●●● ● ●
                                                              ●
                                     ●●●●●●●●●●●● ● ●●
                                     ● ●●●●●●●●●● ●●●● ●
                                               ●●
                                                ●●
                                          ● ●●●● ●● ●
                            ●● ● ●●●●●●●●●●●●●●●●●● ● ●●
                                      ● ● ●●●●●●●●● ●
                                    ●● ●●● ●●●●●●●●●● ● ●
                                     ●●● ●●●●●●●●●●●●●
                                        ●● ● ●● ●●● ● ●●
                                 ● ● ●●●●●●●●●●●●●●●●●●
                                           ● ● ● ●●
                                  ●● ●●●●●●●●●●●●●●● ●
                                 ● ●●●●●●●●●●●●●●●● ● ● ●
                                          ●● ●●●●●● ●●
                                    ●●●●●●●●●●●●●●●●● ●
                                          ●● ● ●●● ●
                             ●●●●●●●●●●●●●●●●●●●●●●●
                                          ●●●●● ●●● ●●
                                        ●●●●●●●●●●● ●
                                                ●● ●●
                                     ●●●●●●●●●●●●●●●●
                                            ● ●●● ●
                                      ● ●●●●●●●●●●●● ●
                                   ●● ●●●●●●●●●●●●●● ●
                                  ● ●●●●●●●●●●●●●●●●●●
                               ● ●●●●●●●●●●●●●●●●● ●●●
                                           ●● ● ●● ● ● ●
                            ●●● ● ●●●●●●●●●●●●●●●●●●●●●●●
                                      ●● ●●●●●●●●●●●●●●
                                            ●     ●●
                                                  ●● ●
                                   ●● ●●●●●●●●●●●●●●●
                                            ● ●●●● ●
                                      ● ●●●●●●●●● ●
                             ● ●●●●●●●●●●●●●●●●●●●●●●●●
                                      ●●●●●●●●●●● ● ●
                                      ●●●●●●●●●●●●● ●
                              ● ●●●●●●●●●●●●●●●●●●●● ●
                                          ● ●●●● ●●●●
                                   ● ●●●●●●●●●●●●●●● ●●
                                             ● ●●
                                  ●●●●●●●●●●●●●●●●●●●● ●
                                      ●●●●● ●●●● ●● ● ●
                               ● ●●●●●●●●●●●●●●●●●●●●●
                                  ● ●●●●●●●●●●●●●●●●●●
                                             ● ● ●●●
                                   ● ●●●●●●●●●●●●●● ●
                                  ●●●●●●●●●●●●●●●●●● ●
                                ● ●●●●●●●●●●●●●●●●●● ●●●
                           ● ●● ●●●●●●●●●●●●●●●●●●●●●
                                   ●●●●●●●●●●●●●●● ● ●
                                             ● ● ●● ●●
                                            ●● ●●●●●●●
                              ● ●●●●●●●●●●●●●●●●●●●●●
                               ● ●●●●●●●●●●●●●●●●●●●●●
                                  ●●●●●●●●●●●●●●●●●●●
                                   ●● ●●●●●●●●●●●●●● ●
                                        ● ● ●●●● ●●●
                                                ● ●●
                       ● ●● ● ●●●●●●●●●●●●●●●●●●●●●● ● ●
                                       ● ● ●●●●●●●●
                                               ● ● ●
                                ● ●●●●●●●●●●●●●●●●●●●●
                                                                ●
                                           ● ●●● ●●●●
                                     ●●●●●●●●●●●●● ●● ●
                                 ● ●●●●●●●●●●●●●●●●● ●●
                                       ●● ●●●●●●●●●● ●●
                               ●●●●●●●●●●●●●●●●●●●●●● ●
                                      ●●● ●●●●●●●●● ●● ●
                                       ● ●●●●●● ●●●● ●
                                                ●
                                    ●●●●●●●●●●●●●●●●● ●
                                          ● ●●● ●●●
                                ● ●●●●●●●●●●●●●●●●●● ●
                                                ●
                                       ●● ●●●●●●●●●●●
                              ●●●●●●●●●●●●●●●●●●●●●●● ●● ●
                                       ● ●●●●●●● ●●●●●
                                ● ●●●●●●●●●●●●●●●● ●●●
                                    ●●●●●●●●●●● ●●●● ●
                                 ● ● ●●●●●●●●●●●●●●● ●
                                ●●●●●●●●●●●●●●●●●●●● ● ● ●
                                        ● ●●●●● ● ●●
                                ● ●●●●●●●●●●●●●●●●● ●●●
                                    ●●● ● ●●●●●●●●●● ●
                                       ● ● ●● ●●● ●●● ●●
                                             ● ●●● ● ●
                             ●● ●●●●●●●●●●●●●●●●●●●● ●●● ●●
                                ●●●●●●●●●●●●●●●●●● ●●
                                      ● ●●●●● ●● ●● ●● ●
                                    ●●●●●●●●●●●●● ●●●●
                                                    ●
                               ● ●●●●●●●●●●●●●●●●●●●●
                              ●●●●●●●●●●●●●●●●●●●●●● ●●●
                                  ●●●●●●●●●●●●●●●●● ●
                                                   ●
                              ● ●●●●●●●●●●●●●●●●●●●●● ●
                                              ●
                                              ● ●●
                           ●●●●●●●●●●●●●●●●●●●●●●●● ●●
                                    ●● ●●●●●●●●●●●●● ●
                                      ●●●●●●●●●●●● ●● ●
                                           ●●
                                          ●●●● ●● ● ●
                                       ●●●●●●●●●●●● ●
                                          ● ● ●● ● ●
                                     ●●●●●●●●●●●●●●●●
                                 ●●●●●●●●●●●●●●●●●●● ●
                                   ●●●●●●●●●●●●●●●● ●
                           ● ●● ●●●●●●●●●●●●●●●●●●●●●● ● ●
                                ●● ●●●●●●●●●●●●●●●●●●●● ●
                                        ●●●●●●●●●● ● ●●
                                      ●●● ●●●●●●●●●●● ●
                                              ●
                               ●●●●●●●●●●●●●●●●●●●●●●●●
                                              ●●
                                ●● ●●●●●●●●●●●●●●●●●● ●
                                          ●●●●● ● ●●●●●●●
                                  ● ●●●●●●●●●●●●●●●●●●
                                   ● ●●●●●●●●●●●●●●●● ● ● ●
                                     ●●●●●●●●●●●●●●●●●
                                          ● ● ● ●● ● ●
                                          ●●●●●●● ●●● ●● ●
                                          ●● ●●●●●● ●●●●
                                                   ●● ●
                                       ●● ● ● ● ●● ●●
                                   ● ●●●●●●●● ●●●● ●
                                ● ●●●●●●●●●●●●●●●●●●
                                  ●●●●●●●●●●●● ● ● ●
                           ● ● ● ●●●●●●●●●●●●●●●●●●●●● ● ●●
                                ●●●●●●●●●●●●●●●●●●●●●●
                                   ● ●●●●●●●●●●●● ●
                                    ● ● ● ●●●●●●●●● ●
                           ● ●●●●●●●●●●●●●●●●●●●●●● ●
                                   ●● ●●●●●●●●●●●●●● ●
                                ● ●●●●●●●●●●●●●●●●●●●●● ●
                                  ●●●●●●●●●●●●●●●●● ● ●
                                          ● ● ●● ●●● ● ●
                                   ●● ●●●●●●●●●●●●●● ●●
                                        ● ● ●●●●●●●● ●
                                                ●
                                         ● ● ●●●●●●●●●
                                               ● ●
                                   ●●●●● ●●●●●●●●●●●
                          ● ● ●●●●●●●●●●●●●●●●●●●●●●●●
                                 ●●●●●●●●●●●●●●●●●●●● ● ●
                                                ●
                                                         ●
                                               ● ●● ●
                                          ●● ●●●●● ●●
                                        ●●●●●●●●●●● ●
                                     ●●●● ●●●●●●●●●●●
                                  ●●●●●●●●●●●●●●●●●● ● ●
                                          ●● ●●●●●●●●
                                       ● ●●●●●●●●●●●
                                                         ●
                                       ● ●●●●●●●●●●●●
                                 ● ●●●●●●●●●●●●●●●●●●● ●
                                 ●●●●●●●●●●●●●●●●●●●●
                             ● ●●●●●●●●●●●●●●●●●●●●●● ●
                                  ●●●●●●●●●●●●●●●●●●●
                                   ● ● ●●●●●●●●●●●●●● ●
                                            ●● ●● ● ●
                                              ● ●
                                           ●●●●● ● ●● ●
                             ● ●●●●●●●●●●●●●●●●●●●●● ● ●●
                                             ● ●●
                              ● ●●●●●●●●●●●●●●●●●●●●●●
                                   ● ●●●●●●●●●●●●●●●●
                                    ● ● ●●●●●●●● ●●●● ●
                                            ●●●● ● ●
                                    ● ●●●●●●●●●●● ●● ●
                                        ●●●●●●●●●●●●●
                                   ●●● ●●●●●●●●●●●●●●●
                              ●●●●●●●●●●●●●●●●●●●●●●● ●●
                                    ● ●●●●●●●●●●●●●● ● ●
                                 ●●●●●●●●●●●●●●●●●●● ● ●
                                ●●●●●●●●●●●●●●●●●●●●●●
                                             ● ●●
                                           ●● ● ●● ●●● ●●
                                                ●●
                                             ●●●●● ●
                             ● ● ●●●●●●●●●●●●●●● ●● ●
                               ●●● ●● ●●●●●● ●●●● ●
                              ● ●● ●●●●●●●●●●●●●● ●● ●
                                           ● ●● ● ● ●●●
                                                  ●
                                                        ●
                                                          ●
                                      ●● ●●●●●●●●● ●●
                              ● ●●●●●●●●●●●●●●●●●● ● ●
                                     ●●●●●●●● ●●●●● ●
                                              ● ●●
                                        ● ● ●●●●●● ●●
                                 ● ●●●●●●●●●●●●●●● ●
                                            ● ●●●●● ● ●
                                   ● ●●●● ●●●●●●●●●● ●
                                          ●● ●●● ● ●● ●
                                ● ● ●●●●●●●●●●●●●●●● ● ●
                                       ●●●●●● ●●●● ●●
                                ● ●●●●●●●●●●●●●●●●●●●●●●
                                   ● ●●●●●●●●●●●●●●●●●
                                          ● ● ●●● ● ●
                                         ●●● ●●● ●●●●
                                               ● ●
                                   ● ●●●●●●●●●●●● ● ●
                                ● ● ●●●●●●●●●●●●●●●●● ●●
                                       ●●●●●●● ●●● ●
                                   ● ● ●●●●●●●●●●●● ●● ●
                                 ●● ●●●●●●●●●●●●●● ●●●
                                           ●● ●● ● ●
                                         ●●●● ●● ● ●
                                      ●●●●● ●●●●●●●
                                      ●●●●●●●●●●●● ● ●
                            ● ●●●●●●●●●●●●●●●●●●●●●● ● ●
                                                  ●
                                     ●●●●●●●●●●●●● ●● ●
                                            ●
                                 ● ●●●●●●●●●●●●●●●●● ●● ●
                                        ●●●●●●● ● ● ● ●
                                          ●●● ● ● ● ●
                                  ●●●●●●●●●●●●●●●●●● ●
                                      ●●●●●●●●●●●●● ● ●
                                           ●●●●●● ● ●●● ●
                                            ● ●●  ● ● ●●
                                                   ● ●●
                             ● ●●●●●●●●●●●●●●● ●●●
                                         ● ● ●●● ● ● ●
                                    ● ●●● ●● ● ●●● ●● ●
                                                  ● ●●
                                    ● ●●●●●●●●● ●●● ●●
                              ● ●● ●● ●●●●●●● ●●● ●●●
                                     ● ● ●●●● ●●● ●●
                                            ●● ● ● ● ●
                                              ● ●
                                     ● ●● ●● ● ● ●●●●● ●
                                              ● ● ●● ●
                                    ● ●●● ●●●●●●●● ● ●
                                              ● ● ● ●● ●
                                  ●● ●●●●● ●●● ●●● ●● ●
                                    ● ●●●●● ● ● ● ●
                                                         ●
                                  ●● ● ● ●●●●● ● ●●●●●
                              ● ● ● ●●●●●●●● ● ●●● ●●●
                                     ●        ●
                                              ●● ●● ●
                                     ● ● ●●●●● ● ●
                                   ●●●●● ●●●●●● ●● ●
                                                      ●
                                    ●● ● ● ●●●● ● ● ●
                                     ●● ● ●●●●● ●
                                  ● ●● ●●●●●●● ●●● ●
                                             ● ● ●●●
                                                  ●●
                                  ● ●●●● ● ●●●●● ●● ● ●
                                            ● ●● ● ● ●
                                                              ●
                                           ● ●● ●● ●● ● ●
                                 ●● ● ●●● ●●●●● ●●● ● ●
                                           ●     ● ● ●
                                 ●●               ●       ●
                                       ● ● ● ● ● ●● ●●
                                           ●● ●
                                                   ●
                                                      ●●●
                                        ● ●
                                         ●     ●
                                               ●     ●● ●   ●




                                                                    w   D
Nudged value in the w βˆ’ l space




   40. Initial bounds on Οβˆ— .

                                βˆ’D ≀ Οβˆ— ≀ D

   (Duh! but nice geometry)
Enclosing triangle

                                   42. Nomenclature
                                   D

 41. Definition.
   ABC such that:
       ABC βŠ‚ w βˆ’ l space.
     (wβˆ— , lβˆ— ) ∈   ABC.            l                 B
                                                      ●
     Slope of AB segment, unity.                           mΞ²
                                             mΞ±
     wA ≀ wB                                      1
                                                            ●   C
     wA ≀ wC                            A●            mΞ³


                                                            w       D
Enclosing Triangle


   43. (New) bounds on ρ.
   Def. Slope mΞΆ projection of point X(wX , lX ) to w = βˆ’l line.

                                  mΞΆ wX βˆ’ lX
                           XΞΆ =
                                   mΞΆ + 1

   Bounds:

                            AΞ± = BΞ± ≀ Οβˆ— ≀ CΞ±
                  wA βˆ’ lA = wB βˆ’ lB ≀ Οβˆ— ≀ wC βˆ’ lC


   44. So, collinearity (of A, B and C) implies optimality.
   (Even if there are multiple optima)
Right and left uncertainty



   45. Iterating inside an enclosing triangle.
     1   Set ρ to some value within the bounds
             Λ†
         (wA βˆ’ lA ≀ ρ ≀ wC βˆ’ lC ).
                     Λ†
     2   Solve problem with rewards (r βˆ’ ρc).
                                         Λ†


   46. Optimality.
   If h(sI ) = 0
   Done!
   Optimal policy found for current problem solves SMDP and
   termination condition has been met.
Right and left uncertainty

   47.a If h(sI ) > 0
   Right uncertainty.


                        l
                                 B
                                 ●




                                         ●
                                             S
                                     T
                                     ●
                                             ●   C
                            A●




                                             w
                            y1
Right and left uncertainty

   47.b Right uncertainty.
   Derivation:

          y1 = SΞ± βˆ’ TΞ±
               1
             = ((1 βˆ’ mΞ² )wS βˆ’ (1 βˆ’ mΞ³ )wT βˆ’ (mΞ³ βˆ’ mΞ² )wC )
               2
   Maximization:

    βˆ—     2s ab(ρ/2 βˆ’ CΞ² )(ρ/2 βˆ’ CΞ³ ) + a(ρ/2 βˆ’ CΞ² ) + b(ρ/2 βˆ’ CΞ³ )
   y1 =
                                     c
    s = sign(mΞ² βˆ’ mΞ³ )
    a = (1 βˆ’ mΞ³ )(mΞ² + 1)
    b = (1 βˆ’ mΞ² )(mΞ³ + 1)
    c = (b βˆ’ a) = 2(mΞ³ βˆ’ mΞ² )
Right and left uncertainty

   48.a If h(sI ) < 0
   Left uncertainty.


                             l
                                            B
                                            ●

                                        ●




                                                ●   C
                                 A● R
                                    ●




                        y2
                                                w
Right and left uncertainty




   48. Left uncertainty.
   Is maximum where expected.
   (When value level set crosses B)

                     y2 = RΞ± βˆ’ QΞ±
                      βˆ—   (ρ/2 βˆ’ BΞ± )
                     y2 =             (BΞ± βˆ’ BΞ³ )
                          (ρ/2 βˆ’ BΞ³ )
Right and left uncertainty




   49. Fundamental lemma.

   As ρ grows, maximal right uncertainty is monotonically
       Λ†
   decreasing and maximal left uncertainty is monotonically
   increasing, and both are non-negative with minimum 0.
Optimal nudging

   50.
         Find ρ (between the bounds, obviously) such that the
              Λ†
         maximum resulting uncertainty, either left or right, is min.
         Since both are monotonic and have minimum 0, min max
         when maximum left and right uncertainties are equal.
         Remark: bear in mind this (↑) is the worst case. It can
         terminate immediately.
         ρ is gain, but neither biased towards observations (initial or
         Λ†
         otherwise), nor slowly updated.

         Optimal nudging is β€œoptimal” in the sense that with this
         update the maximum uncertainty range of resulting ρ values is
         minimum.
Optimal nudging




   51. Enclosing triangle into enclosing triangle.


   52. Strictly smaller (both    area and, importantly, resulting
   uncertainty)
Obtaining an initial enclosing triangle


   53. Setting ρ(0) = 0 and solving.
       Maximizes reward irrespective of cost. (Usual RL problem)
       Can be interpreted geometrically as fanning from the w axis
       to find the policy with w, l coordinates that subtends the
       smallest angle.
       The resulting optimizer maps to a point somewhere along a
       line with intercept at the origin.

   54. Optimum of the SMDP problem above but not behind that
   line.
   Else, contradiction.
Obtaining an initial enclosing triangle




   56. Either way, after iter. 0, uncertainty reduces in at least half.
Conic intersection


   57. Maximum right uncertainty is a conic!
                                                              
                       c       βˆ’(b + a)       βˆ’CΞ± c           r
          βˆ—       ο£­ βˆ’(b + a)                                   βˆ—
     r   y1   1                    c       (CΞ² a + CΞ³ b) ο£Έ ο£­ y1 ο£Έ = 0
                                                 2
                     βˆ’CΞ± c   (CΞ² a + CΞ³ b)     CΞ± c          1


   58. Maximum left uncertainty is a conic!
                                                           
                        0       1       (BΞ³ βˆ’ CΞ³ )         r
          βˆ—                                                 βˆ—
     r   y2   1   ο£­     1       0         βˆ’BΞ³         ο£Έ ο£­ y2 ο£Έ = 0
                    (BΞ³ βˆ’ CΞ³ ) βˆ’BΞ³    βˆ’2BΞ± (BΞ³ βˆ’ CΞ³ )     1
Conic intersection




   59. Intersecting them is easy.

   60. And cheap. (Requires in principle constant time and simple
   matrix operations)

   61. So plug it in!
Termination Criteria



   62.
         We want to reduce uncertainty to Ξ΅.
         Because it is a good idea. (Right?)
         So there’s your termination condition right there.

   63. Alternatively, stop when |h(k) (sI )| < .

   64. In any case, if the same policy remains optimal and the sign of
   its nudged value changes between iterations, stop:
   It is the optimal solution of the SMDP problem.
Finding D



  65. A quick and dirty method:
    1   Maximize cost (or episode length, all costs equal 1).
    2   Multiply by largest unsigned reinforcement.
  66. So, at most one RL problem more.

  67. If D is estimated too large, wider initial bounds and longer
  computation, but ok.
  68. If D is estimated too small (by other methods, of course),
  points outside the triangle in w βˆ’ l space. (But where?)
Recurring state + unichain considerations

   69. Feinberg and Yang: Deciding whether the unichain condition
   holds can be done in polynomial time if a recurring state exists.

   70. Existence of a recurring state is common in practice.

   71. (Future work) It can maybe be induced using Ρ–MDPs.
   (Maybe).

   72. At least one case in which no unichain is no problem: games.
       Certainty of positive policies.
       Non-positive chains.

   73. Happens! (See experiments)
Complexity



   74. Discounted RL is PAC (–efficient).

   75. In problem size parameters (|S|, |A|) and 1/Ξ³.

   76. Episodic undiscounted RL is also PAC.
   (Following similar arguments, but slightly more intricate
   derivations)

   77. So we call a PAC (–efficient) method a number of times.
Complexity



   78. Most worstest case foreverest when choosing ρ(k) is not
   reducing uncertainty.

   79. Reducing it in half is a better bound for our method.

   80. ... and it is a tight bound...

   81. ... in cases that are nearly optimal from the outset.

   82. So, at worst, log 1 calls to PAC:
                         Ξ΅
   PAC!
Complexity




   83. Whoops, we proved complexity! That’s a first for SMDP
   (or ARRL, for that matter).


   84. And we inherit convergence from invoked RL, so there’s
   also that.
Tipically much faster



   85. Worst case happens when we are ”already there.

   86. Otherwise, depends, but certainly better.

   87. Multi-iteration reduction in uncertainty way better than 0.5Β· ,
   because it accumulates geometrically.

   88. Empirical complexity better than the already very good upper
   bound.
Bibliography I




   S. Mahadevan. Average reward reinforcement learning: Foundations, algorithms, and empirical results. Machine
       Learning, 22(1):159–195, 1996.
   Reinaldo Uribe, Fernando Lozano, Katsunari Shibata, and Charles Anderson. Discount and speed/execution
       tradeoffs in markov decision process games. In Computational Intelligence and Games (CIG), 2011 IEEE
       Conference on, pages 79–86. IEEE, 2011.

More Related Content

What's hot

Random Matrix Theory and Machine Learning - Part 1
Random Matrix Theory and Machine Learning - Part 1Random Matrix Theory and Machine Learning - Part 1
Random Matrix Theory and Machine Learning - Part 1
Fabian Pedregosa
Β 
Runtime Analysis of Population-based Evolutionary Algorithms
Runtime Analysis of Population-based Evolutionary AlgorithmsRuntime Analysis of Population-based Evolutionary Algorithms
Runtime Analysis of Population-based Evolutionary Algorithms
PK Lehre
Β 
Basic concepts and how to measure price volatility
Basic concepts and how to measure price volatility Basic concepts and how to measure price volatility
Basic concepts and how to measure price volatility
African Growth and Development Policy (AGRODEP) Modeling Consortium
Β 
The Black-Litterman model in the light of Bayesian portfolio analysis
The Black-Litterman model in the light of Bayesian portfolio analysisThe Black-Litterman model in the light of Bayesian portfolio analysis
The Black-Litterman model in the light of Bayesian portfolio analysis
Daniel Bruggisser
Β 
Slides universitΓ© Laval, Actuariat, Avril 2011
Slides universitΓ© Laval, Actuariat, Avril 2011Slides universitΓ© Laval, Actuariat, Avril 2011
Slides universitΓ© Laval, Actuariat, Avril 2011Arthur Charpentier
Β 
Optimalpolicyhandout
OptimalpolicyhandoutOptimalpolicyhandout
OptimalpolicyhandoutNBER
Β 
Martin Roth: A spatial peaks-over-threshold model in a nonstationary climate
Martin Roth: A spatial peaks-over-threshold model in a nonstationary climateMartin Roth: A spatial peaks-over-threshold model in a nonstationary climate
Martin Roth: A spatial peaks-over-threshold model in a nonstationary climateJiΕ™Γ­ Ε mΓ­da
Β 
Relaxed Utility Maximization in Complete Markets
Relaxed Utility Maximization in Complete MarketsRelaxed Utility Maximization in Complete Markets
Relaxed Utility Maximization in Complete Marketsguasoni
Β 
Chester Nov08 Terry Lynch
Chester Nov08 Terry LynchChester Nov08 Terry Lynch
Chester Nov08 Terry LynchTerry Lynch
Β 
MM framework for RL
MM framework for RLMM framework for RL
MM framework for RL
Sung Yub Kim
Β 
Signal Processing Course : Sparse Regularization of Inverse Problems
Signal Processing Course : Sparse Regularization of Inverse ProblemsSignal Processing Course : Sparse Regularization of Inverse Problems
Signal Processing Course : Sparse Regularization of Inverse Problems
Gabriel PeyrΓ©
Β 
SLC 2015 talk improved version
SLC 2015 talk improved versionSLC 2015 talk improved version
SLC 2015 talk improved version
Zheng Mengdi
Β 
Exchange confirm
Exchange confirmExchange confirm
Exchange confirmNBER
Β 
PAWL - GPU meeting @ Warwick
PAWL - GPU meeting @ WarwickPAWL - GPU meeting @ Warwick
PAWL - GPU meeting @ Warwick
Pierre Jacob
Β 
Savage-Dickey paradox
Savage-Dickey paradoxSavage-Dickey paradox
Savage-Dickey paradox
Christian Robert
Β 
Lecture on nk [compatibility mode]
Lecture on nk [compatibility mode]Lecture on nk [compatibility mode]
Lecture on nk [compatibility mode]NBER
Β 
some thoughts on divergent series
some thoughts on divergent seriessome thoughts on divergent series
some thoughts on divergent series
genius98
Β 
Unbiased Bayes for Big Data
Unbiased Bayes for Big DataUnbiased Bayes for Big Data
Unbiased Bayes for Big Data
Christian Robert
Β 

What's hot (20)

Random Matrix Theory and Machine Learning - Part 1
Random Matrix Theory and Machine Learning - Part 1Random Matrix Theory and Machine Learning - Part 1
Random Matrix Theory and Machine Learning - Part 1
Β 
Runtime Analysis of Population-based Evolutionary Algorithms
Runtime Analysis of Population-based Evolutionary AlgorithmsRuntime Analysis of Population-based Evolutionary Algorithms
Runtime Analysis of Population-based Evolutionary Algorithms
Β 
Basic concepts and how to measure price volatility
Basic concepts and how to measure price volatility Basic concepts and how to measure price volatility
Basic concepts and how to measure price volatility
Β 
The Black-Litterman model in the light of Bayesian portfolio analysis
The Black-Litterman model in the light of Bayesian portfolio analysisThe Black-Litterman model in the light of Bayesian portfolio analysis
The Black-Litterman model in the light of Bayesian portfolio analysis
Β 
Slides universitΓ© Laval, Actuariat, Avril 2011
Slides universitΓ© Laval, Actuariat, Avril 2011Slides universitΓ© Laval, Actuariat, Avril 2011
Slides universitΓ© Laval, Actuariat, Avril 2011
Β 
Optimalpolicyhandout
OptimalpolicyhandoutOptimalpolicyhandout
Optimalpolicyhandout
Β 
Martin Roth: A spatial peaks-over-threshold model in a nonstationary climate
Martin Roth: A spatial peaks-over-threshold model in a nonstationary climateMartin Roth: A spatial peaks-over-threshold model in a nonstationary climate
Martin Roth: A spatial peaks-over-threshold model in a nonstationary climate
Β 
Relaxed Utility Maximization in Complete Markets
Relaxed Utility Maximization in Complete MarketsRelaxed Utility Maximization in Complete Markets
Relaxed Utility Maximization in Complete Markets
Β 
Chester Nov08 Terry Lynch
Chester Nov08 Terry LynchChester Nov08 Terry Lynch
Chester Nov08 Terry Lynch
Β 
MM framework for RL
MM framework for RLMM framework for RL
MM framework for RL
Β 
Signal Processing Course : Sparse Regularization of Inverse Problems
Signal Processing Course : Sparse Regularization of Inverse ProblemsSignal Processing Course : Sparse Regularization of Inverse Problems
Signal Processing Course : Sparse Regularization of Inverse Problems
Β 
SLC 2015 talk improved version
SLC 2015 talk improved versionSLC 2015 talk improved version
SLC 2015 talk improved version
Β 
Exchange confirm
Exchange confirmExchange confirm
Exchange confirm
Β 
PAWL - GPU meeting @ Warwick
PAWL - GPU meeting @ WarwickPAWL - GPU meeting @ Warwick
PAWL - GPU meeting @ Warwick
Β 
Slides smart-2015
Slides smart-2015Slides smart-2015
Slides smart-2015
Β 
Savage-Dickey paradox
Savage-Dickey paradoxSavage-Dickey paradox
Savage-Dickey paradox
Β 
Lecture on nk [compatibility mode]
Lecture on nk [compatibility mode]Lecture on nk [compatibility mode]
Lecture on nk [compatibility mode]
Β 
some thoughts on divergent series
some thoughts on divergent seriessome thoughts on divergent series
some thoughts on divergent series
Β 
1
11
1
Β 
Unbiased Bayes for Big Data
Unbiased Bayes for Big DataUnbiased Bayes for Big Data
Unbiased Bayes for Big Data
Β 

Viewers also liked

Hardware hacking for software people
Hardware hacking for software peopleHardware hacking for software people
Hardware hacking for software people
Dobrica Pavlinuőić
Β 
SRS-PDR-CDR_Rev_1_3
SRS-PDR-CDR_Rev_1_3SRS-PDR-CDR_Rev_1_3
SRS-PDR-CDR_Rev_1_3Zack Lyman
Β 
Between winning slow and losing fast.
Between winning slow and losing fast.Between winning slow and losing fast.
Between winning slow and losing fast.
r-uribe
Β 
Punishment and Grace: On the Economics of Tax Amnesties
Punishment and Grace: On the Economics of Tax AmnestiesPunishment and Grace: On the Economics of Tax Amnesties
Punishment and Grace: On the Economics of Tax Amnesties
Nugroho Adi
Β 
Rtl sdr software defined radio
Rtl sdr   software defined radioRtl sdr   software defined radio
Rtl sdr software defined radio
Eueung Mulyana
Β 
Capital Asset Pricing Model (CAPM)
Capital Asset Pricing Model (CAPM)Capital Asset Pricing Model (CAPM)
Capital Asset Pricing Model (CAPM)
Heickal Pradinanta
Β 
Financial Crime Compliance at Standard Chartered
Financial Crime Compliance at Standard CharteredFinancial Crime Compliance at Standard Chartered
Financial Crime Compliance at Standard Chartered
TEDxMongKok
Β 
A Glimpse into Developing Software-Defined Radio by Python
A Glimpse into Developing Software-Defined Radio by PythonA Glimpse into Developing Software-Defined Radio by Python
A Glimpse into Developing Software-Defined Radio by Python
Albert Huang
Β 
Economic crimes
Economic crimesEconomic crimes
Economic crimes
Konstantin Eryomin
Β 
Financial Crimes
Financial CrimesFinancial Crimes
Financial Crimes
Animesh Shaw
Β 
Economics of crime model
Economics of crime modelEconomics of crime model
Economics of crime model
Ha Bui
Β 
Raspberry Pi and Amateur Radio
Raspberry Pi and Amateur RadioRaspberry Pi and Amateur Radio
Raspberry Pi and Amateur Radio
Kevin Hooke
Β 
Sanctions in Anti-trust cases – Prof. John M. Connor – Purdue University, US ...
Sanctions in Anti-trust cases – Prof. John M. Connor – Purdue University, US ...Sanctions in Anti-trust cases – Prof. John M. Connor – Purdue University, US ...
Sanctions in Anti-trust cases – Prof. John M. Connor – Purdue University, US ...
OECD Directorate for Financial and Enterprise Affairs
Β 
Capital Asset Pricing Model
Capital Asset Pricing ModelCapital Asset Pricing Model
Capital Asset Pricing Model
Rod Medallon
Β 
British economy presentation
British economy presentationBritish economy presentation
British economy presentation
COLUMDAE
Β 
Sanctions in Anti-trust cases – Prof. Hwang LEE – Korean University School of...
Sanctions in Anti-trust cases – Prof. Hwang LEE – Korean University School of...Sanctions in Anti-trust cases – Prof. Hwang LEE – Korean University School of...
Sanctions in Anti-trust cases – Prof. Hwang LEE – Korean University School of...
OECD Directorate for Financial and Enterprise Affairs
Β 
British Economy
British EconomyBritish Economy
British Economy
Prateek Lohia
Β 
Micro and Macro Economics
Micro and Macro EconomicsMicro and Macro Economics
Micro and Macro Economics
nabeelhaiderkhan
Β 

Viewers also liked (19)

Hardware hacking for software people
Hardware hacking for software peopleHardware hacking for software people
Hardware hacking for software people
Β 
SRS-PDR-CDR_Rev_1_3
SRS-PDR-CDR_Rev_1_3SRS-PDR-CDR_Rev_1_3
SRS-PDR-CDR_Rev_1_3
Β 
Between winning slow and losing fast.
Between winning slow and losing fast.Between winning slow and losing fast.
Between winning slow and losing fast.
Β 
Punishment and Grace: On the Economics of Tax Amnesties
Punishment and Grace: On the Economics of Tax AmnestiesPunishment and Grace: On the Economics of Tax Amnesties
Punishment and Grace: On the Economics of Tax Amnesties
Β 
Rtl sdr software defined radio
Rtl sdr   software defined radioRtl sdr   software defined radio
Rtl sdr software defined radio
Β 
Capital Asset Pricing Model (CAPM)
Capital Asset Pricing Model (CAPM)Capital Asset Pricing Model (CAPM)
Capital Asset Pricing Model (CAPM)
Β 
Financial Crime Compliance at Standard Chartered
Financial Crime Compliance at Standard CharteredFinancial Crime Compliance at Standard Chartered
Financial Crime Compliance at Standard Chartered
Β 
A Glimpse into Developing Software-Defined Radio by Python
A Glimpse into Developing Software-Defined Radio by PythonA Glimpse into Developing Software-Defined Radio by Python
A Glimpse into Developing Software-Defined Radio by Python
Β 
Economic crimes
Economic crimesEconomic crimes
Economic crimes
Β 
Financial Crimes
Financial CrimesFinancial Crimes
Financial Crimes
Β 
Economics of crime model
Economics of crime modelEconomics of crime model
Economics of crime model
Β 
Raspberry Pi and Amateur Radio
Raspberry Pi and Amateur RadioRaspberry Pi and Amateur Radio
Raspberry Pi and Amateur Radio
Β 
Sanctions in Anti-trust cases – Prof. John M. Connor – Purdue University, US ...
Sanctions in Anti-trust cases – Prof. John M. Connor – Purdue University, US ...Sanctions in Anti-trust cases – Prof. John M. Connor – Purdue University, US ...
Sanctions in Anti-trust cases – Prof. John M. Connor – Purdue University, US ...
Β 
Economic offences
Economic offencesEconomic offences
Economic offences
Β 
Capital Asset Pricing Model
Capital Asset Pricing ModelCapital Asset Pricing Model
Capital Asset Pricing Model
Β 
British economy presentation
British economy presentationBritish economy presentation
British economy presentation
Β 
Sanctions in Anti-trust cases – Prof. Hwang LEE – Korean University School of...
Sanctions in Anti-trust cases – Prof. Hwang LEE – Korean University School of...Sanctions in Anti-trust cases – Prof. Hwang LEE – Korean University School of...
Sanctions in Anti-trust cases – Prof. Hwang LEE – Korean University School of...
Β 
British Economy
British EconomyBritish Economy
British Economy
Β 
Micro and Macro Economics
Micro and Macro EconomicsMicro and Macro Economics
Micro and Macro Economics
Β 

Similar to 100 things I know

Introducing Zap Q-Learning
Introducing Zap Q-Learning   Introducing Zap Q-Learning
Introducing Zap Q-Learning
Sean Meyn
Β 
lecture6.ppt
lecture6.pptlecture6.ppt
lecture6.ppt
AbhiYadav655132
Β 
Simplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution AlgorithmsSimplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution Algorithms
Per Kristian Lehre
Β 
Simplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution AlgorithmsSimplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution Algorithms
PK Lehre
Β 
week10_Reinforce.pdf
week10_Reinforce.pdfweek10_Reinforce.pdf
week10_Reinforce.pdf
YuChianWu
Β 
Policy-Gradient for deep reinforcement learning.pdf
Policy-Gradient for  deep reinforcement learning.pdfPolicy-Gradient for  deep reinforcement learning.pdf
Policy-Gradient for deep reinforcement learning.pdf
21522733
Β 
stochastic processes assignment help
stochastic processes assignment helpstochastic processes assignment help
stochastic processes assignment help
Statistics Homework Helper
Β 
Lec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methodsLec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methods
Ronald Teo
Β 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
The Statistical and Applied Mathematical Sciences Institute
Β 
Approximate Bayesian Computation with Quasi-Likelihoods
Approximate Bayesian Computation with Quasi-LikelihoodsApproximate Bayesian Computation with Quasi-Likelihoods
Approximate Bayesian Computation with Quasi-Likelihoods
Stefano Cabras
Β 
MAPE regression, seminar @ QUT (Brisbane)
MAPE regression, seminar @ QUT (Brisbane)MAPE regression, seminar @ QUT (Brisbane)
MAPE regression, seminar @ QUT (Brisbane)
Arnaud de Myttenaere
Β 
RL unit 5 part 1.pdf
RL unit 5 part 1.pdfRL unit 5 part 1.pdf
RL unit 5 part 1.pdf
ChandanaVemulapalli2
Β 
NONLINEAR DIFFERENCE EQUATIONS WITH SMALL PARAMETERS OF MULTIPLE SCALES
NONLINEAR DIFFERENCE EQUATIONS WITH SMALL PARAMETERS OF MULTIPLE SCALESNONLINEAR DIFFERENCE EQUATIONS WITH SMALL PARAMETERS OF MULTIPLE SCALES
NONLINEAR DIFFERENCE EQUATIONS WITH SMALL PARAMETERS OF MULTIPLE SCALES
Tahia ZERIZER
Β 
Deep Learning Opening Workshop - Statistical and Computational Guarantees of ...
Deep Learning Opening Workshop - Statistical and Computational Guarantees of ...Deep Learning Opening Workshop - Statistical and Computational Guarantees of ...
Deep Learning Opening Workshop - Statistical and Computational Guarantees of ...
The Statistical and Applied Mathematical Sciences Institute
Β 
Numerical Methods
Numerical MethodsNumerical Methods
Numerical Methods
Teja Ande
Β 
A New Nonlinear Reinforcement Scheme for Stochastic Learning Automata
A New Nonlinear Reinforcement Scheme for Stochastic Learning AutomataA New Nonlinear Reinforcement Scheme for Stochastic Learning Automata
A New Nonlinear Reinforcement Scheme for Stochastic Learning Automata
infopapers
Β 
Lecture_9.pdf
Lecture_9.pdfLecture_9.pdf
Lecture_9.pdf
BrofessorPaulNguyen
Β 
Unbiased Hamiltonian Monte Carlo
Unbiased Hamiltonian Monte CarloUnbiased Hamiltonian Monte Carlo
Unbiased Hamiltonian Monte Carlo
JeremyHeng10
Β 
Financial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning ApproachFinancial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning Approach
θ¬™η›Š 黃
Β 
Nonconvex Compressed Sensing with the Sum-of-Squares Method
Nonconvex Compressed Sensing with the Sum-of-Squares MethodNonconvex Compressed Sensing with the Sum-of-Squares Method
Nonconvex Compressed Sensing with the Sum-of-Squares Method
Tasuku Soma
Β 

Similar to 100 things I know (20)

Introducing Zap Q-Learning
Introducing Zap Q-Learning   Introducing Zap Q-Learning
Introducing Zap Q-Learning
Β 
lecture6.ppt
lecture6.pptlecture6.ppt
lecture6.ppt
Β 
Simplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution AlgorithmsSimplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution Algorithms
Β 
Simplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution AlgorithmsSimplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution Algorithms
Β 
week10_Reinforce.pdf
week10_Reinforce.pdfweek10_Reinforce.pdf
week10_Reinforce.pdf
Β 
Policy-Gradient for deep reinforcement learning.pdf
Policy-Gradient for  deep reinforcement learning.pdfPolicy-Gradient for  deep reinforcement learning.pdf
Policy-Gradient for deep reinforcement learning.pdf
Β 
stochastic processes assignment help
stochastic processes assignment helpstochastic processes assignment help
stochastic processes assignment help
Β 
Lec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methodsLec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methods
Β 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
Β 
Approximate Bayesian Computation with Quasi-Likelihoods
Approximate Bayesian Computation with Quasi-LikelihoodsApproximate Bayesian Computation with Quasi-Likelihoods
Approximate Bayesian Computation with Quasi-Likelihoods
Β 
MAPE regression, seminar @ QUT (Brisbane)
MAPE regression, seminar @ QUT (Brisbane)MAPE regression, seminar @ QUT (Brisbane)
MAPE regression, seminar @ QUT (Brisbane)
Β 
RL unit 5 part 1.pdf
RL unit 5 part 1.pdfRL unit 5 part 1.pdf
RL unit 5 part 1.pdf
Β 
NONLINEAR DIFFERENCE EQUATIONS WITH SMALL PARAMETERS OF MULTIPLE SCALES
NONLINEAR DIFFERENCE EQUATIONS WITH SMALL PARAMETERS OF MULTIPLE SCALESNONLINEAR DIFFERENCE EQUATIONS WITH SMALL PARAMETERS OF MULTIPLE SCALES
NONLINEAR DIFFERENCE EQUATIONS WITH SMALL PARAMETERS OF MULTIPLE SCALES
Β 
Deep Learning Opening Workshop - Statistical and Computational Guarantees of ...
Deep Learning Opening Workshop - Statistical and Computational Guarantees of ...Deep Learning Opening Workshop - Statistical and Computational Guarantees of ...
Deep Learning Opening Workshop - Statistical and Computational Guarantees of ...
Β 
Numerical Methods
Numerical MethodsNumerical Methods
Numerical Methods
Β 
A New Nonlinear Reinforcement Scheme for Stochastic Learning Automata
A New Nonlinear Reinforcement Scheme for Stochastic Learning AutomataA New Nonlinear Reinforcement Scheme for Stochastic Learning Automata
A New Nonlinear Reinforcement Scheme for Stochastic Learning Automata
Β 
Lecture_9.pdf
Lecture_9.pdfLecture_9.pdf
Lecture_9.pdf
Β 
Unbiased Hamiltonian Monte Carlo
Unbiased Hamiltonian Monte CarloUnbiased Hamiltonian Monte Carlo
Unbiased Hamiltonian Monte Carlo
Β 
Financial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning ApproachFinancial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning Approach
Β 
Nonconvex Compressed Sensing with the Sum-of-Squares Method
Nonconvex Compressed Sensing with the Sum-of-Squares MethodNonconvex Compressed Sensing with the Sum-of-Squares Method
Nonconvex Compressed Sensing with the Sum-of-Squares Method
Β 

Recently uploaded

Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
Β 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
Β 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
Β 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
Β 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
Β 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
Β 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
Β 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
Β 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
Β 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
Β 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
Β 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
Β 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
Β 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
Β 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
Β 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
Β 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
Β 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
Β 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
Β 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
Β 

Recently uploaded (20)

Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
Β 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Β 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Β 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
Β 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
Β 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Β 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Β 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Β 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
Β 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Β 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
Β 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Β 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
Β 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Β 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Β 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Β 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Β 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Β 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Β 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Β 

100 things I know

  • 1. 100 things I know. Part I of III Reinaldo Uribe M Mar. 4, 2012
  • 2. SMDP Problem Description. 1. In a Markov Decision Process, a (learning) agent is embedded in an envionment and takes actions that affect that environment. States: s ∈ S. Actions: a ∈ As ; A = s∈S As . (Stationary) system dynamics: transition from s to s after taking a, with probability a Pss = p(s |s, a) Rewards: Ra . Def. r(s, a) = E Ra | s, a ss ss At time t, the agent is in state st , takes action at , transitions to state st+1 and observes reinforcement rt+1 with expectation r(st , at ).
  • 3. SMDP Problem Description. 2. Policies, value and optimal policies. An element Ο€ of the policy space Ξ  indicates what action, Ο€(s), to take at each state. The value of a policy from a given state, v Ο€ (s) is the expected cumulative reward received starting in s and following Ο€: ∞ v Ο€ (s) = E Ξ³ t r(st , Ο€(st )) | s0 = s, Ο€ t=0 0 < Ξ³ ≀ 1 is a discount factor. An optimal policy, Ο€ βˆ— has maximum value at every state: Ο€ βˆ— (s) ∈ argmax v Ο€ (s) βˆ€s Ο€βˆˆΞ  Ο€βˆ— v βˆ— (s) = v (s) β‰₯ v Ο€ (s) βˆ€Ο€ ∈ Ξ 
  • 4. SMDP Problem Description. 3. Discount Makes infinite-horizon value bounded if rewards are bounded. Ostensibly makes rewards received sooner more desirable than those received later. But, exponential terms make analysis awkward and harder... ... and Ξ³ has unexpected, undesirable effects, as shown in Uribe et al. 2011 Therefore, hereon Ξ³ = 1. See section Discount, at the end, for discussion.
  • 5. SMDP Problem Description. 4. Average reward models. A more natural long term measure of optimality exists for such cyclical tasks, based on maximizing the average reward per action. Mahadevan 1996 nβˆ’1 1 ρπ (s) = lim E r(st , Ο€(st )) | s0 = 0, Ο€ nβ†’βˆž n t=0 Optimal policy: Οβˆ— (s) β‰₯ ρπ (s) βˆ€s, Ο€ ∈ Ξ  Remark: All actions equally costly.
  • 6. SMDP Problem Description 5. Semi-Markov Decision Process: usual approach, transition times. Agent is in state st and takes action Ο€(st ) at decision epoch t. After an average of Nt units of time, the sistem evolves to state st+1 and the agent observes rt+1 with expectation r(st , Ο€(st )). In general, Nt (st , at , st+1 ). Gain (of a policy at a state): nβˆ’1 Ο€ E t=0 r(st , Ο€(st )) | s0 = s, Ο€ ρ (s) = lim nβ†’βˆž nβˆ’1 E t=0 Nt | s0 = s, Ο€ Optimizing gain still maximizes average reward per action, but actions are no longer equally weighted. (Unless all Nt = 1)
  • 7. SMDP Problem Description 6.a Semi-Markov Decision Process: explicit action costs. Taking an action takes time, costs money, or consumes energy. (Or any combination thereof) Either way, real valued cost kt+1 not necessarily related to process rewards. Cost can depend on a, s and (less common in practice) s . Generally, actions have positive cost. We simply require all policies to have positive expected cost. Wlog the magnitude of the smallest nonzero average action cost is forced to be unity: |k(a, s)| β‰₯ 1 βˆ€k(a, s) = 0
  • 8. SMDP Problem Description 6.b Semi-Markov Decision Process: explicit action costs. Cost of a policy from a state: nβˆ’1 cΟ€ (s) = lim E k(st , Ο€(st )) | s0 = s, Ο€ nβ†’βˆž t=0 So cΟ€ (s) > 0 βˆ€Ο€ ∈ Ξ , s. Nt = k(st , Ο€(st )). Only their definition/interpretation changes. Gain v Ο€ (s)/n ρπ (s) = cΟ€ (s)/n
  • 9. SMDP Problem Description 7. Optimality of Ο€ βˆ— : Ο€ βˆ— ∈ Ξ  with gain nβˆ’1 E t=0 r(st , Ο€(st )) | s0 = s, Ο€ βˆ— βˆ— v Ο€ (s) Ο€βˆ— βˆ— ρ (s) = ρ (s) = lim = Ο€βˆ— nβ†’βˆž nβˆ’1 = s, Ο€ βˆ— c (s) E t=0 k(st , Ο€(st )) | s0 is optimal if Οβˆ— (s) β‰₯ ρπ (s) βˆ€s, Ο€ ∈ Ξ , as it was in ARRL. Notice that the optimal policy doesn’t necessarily maximize v Ο€ or minimize cΟ€ . Only optimizes their ratio.
  • 10. SMDP Problem Description 8. Policies in ARRL and SMDPs are evaluated using the average-adjusted sum of rewards: nβˆ’1 H Ο€ (s) = lim E (r(st , Ο€(st )) βˆ’ ρπ (s)) | s0 = s, Ο€ nβ†’βˆž t=0 Puterman 1994, Abounadi et al. 2001, Ghavamzadeh & Mahadevan 2007 This signals the existence of bias optimal policies that, while gain optimal, also maximize the transitory rewards received before entering recurrence. We are interested in gain-optimal policies only. (It is hard enough...)
  • 11. SMDP Problem Description 9. The Unichain Property A process is unichain if every policy has a single, unique recurrent class. I.e. if for every policy, all recurrent states communicate between them. All methods rely on the unichain property. (Because, if it holds:) ρπ (s) is constant for all s. ρπ (s) = ρπ Gain and value expressions simplify. (See next) However, deciding if a problem is unichain is NP-Hard. Tsitsiklis 2003
  • 12. SMDP Problem Description 10. Unichain property under recurrent states. Feinberg & Yang, 2010 A state is recurrent if it belongs to a recurrent class of every policy. A recurrent state can be found, or proven not to exist, in polynomial time. If a recurrent state exists, determining whether the unichain property holds can be done in polynomial time. (We are not going to actually do it–it requires knowledge of the system dynamics–but good to know!) Recurrent states seem useful. In fact, existence of a recurrent state is more critical to our purposes that the unichain property. Both will be required in principle for our methods/analysis, until their necessity is furher qualified in section Unichain Considerations below.
  • 14. Generic Learning Algorithm 11. The relevant expressions under our assumptions simplify, losing dependence on s0 The following Bellman equation holds for average-adjusted state value: H Ο€ (s) = r(s, Ο€(s)) βˆ’ k(s, Ο€(s))(ρπ ) + EΟ€ H Ο€ (s ) (1) Ghavamzadeh & Mahadevan 2007 Reinforcement Learning methods exploit Eq. (1), running the process and substituting: State for state-action pair value. Expected for obseved reward and cost. ρπ for an estimate. H Ο€ (s ) for its current estimate.
  • 15. Generic Learning Algorithm 12. Algorithm 1 Generic SMDP solver Initialize repeat forever Act Do RL to find value of current Ο€ Usually 1-step Q-learning Update ρ.
  • 16. Generic Learning Algorithm 13. Model-based state value update: H t+1 (st ) ← max r(st , a) + Ea H t (st+1 ) a Ea emphasizes that expected value of next state depends on action chosen/taken. Model free state-action pair value update: Qt+1 (st , at ) ← (1 βˆ’ Ξ³t ) Qt (st , at )+ Ξ³t rt+1 βˆ’ ρt ct+1 + max Qt (st+1 , a) a In ARRL, ct = 1 βˆ€t
  • 17. Generic Learning Algorithm 14.a Table of algorithms. ARRL Algorithm Gain update t AAC r(si , Ο€ i (si )) i=0 Jalali and Ferguson 1989 ρt+1 ← t+1 t+1 R–Learning ρ ← (1 βˆ’ Ξ±)ρt + Schwartz 1993 Ξ± rt+1 + max Qt (st+1 , a) βˆ’ max Qt (st , a) a a H–Learning ρt+1 ← (1βˆ’Ξ±t )ρt +Ξ±t rt+1 βˆ’ H t (st ) + H t (st+1 ) Ξ±t Tadepalli and Ok 1998 Ξ±t+1 ← Ξ±t + 1 SSP Q-Learning ρt+1 ← ρt + Ξ±t min Qt (Λ†, a) s Abounadi et al. 2001 a t HAR r(si , Ο€ i (si )) i=0 Ghavamzadeh and Mahadevan 2007 ρt+1 ← t+1
  • 18. Generic Learning Algorithm 14.b Table of algorithms. SMDPRL Algorithm Gain update SMART t Das et al. 1999 r(si , Ο€ i (si )) i=0 ρt+1 ← t MAX-Q c(si , Ο€ i (si )) Ghavamzadeh and Mahadevan 2001 i=0
  • 19. SSP Q-Learning 15. Stochastic Shortest Path Q-Learning Most interesting. ARRL If unichain and exists s recurrent (Assumption 2.1 ): Λ† SSP Q-learning is based on the observation that the average cost under any stationary policy is simply the ratio of expected total cost and expected time between two successive visits to the reference state [Λ†] s Thus, they propose (after Bertsekas 1998) making the process episodic, splitting s into the (unique) initial and terminal Λ† states. If the Assumption holds, termination has probability 1. Only the value/cost of the initial state are important. Optimal solution β€œcan be shown to happen” when H(Λ†) = 0. s (See next section)
  • 20. SSP Q-Learning 16. SSPQ ρ update. ρt+1 ← ρt + Ξ±t min Qt (Λ†, a), s a where 2 Ξ±t β†’ ∞; Ξ±t < ∞. t t But it is hard to prove boundedness of {ρt }, so suggested instead ρt+1 ← Ξ“ ρt + Ξ±t min Qt (Λ†, a) , s a with Ξ“(Β·) a projection to [βˆ’K, K] and Οβˆ— ∈ (βˆ’K, K).
  • 21. A Critique 17. Complexity. Unknown. While RL is PAC. 18. Convergence. Not always guaranteed (ex. R-Learning). When proven, asymptotic: convergence to the optimal policy/value if all state-action pairs are visited infinite times. Usually proven depending on decaying learning rates, which make learning even slower.
  • 22. A Critique 19. Convergence of ρ updates. ... while the second β€œslow” iteration gradually guides [ρt ] to the desired value. Abounadi et al. 2001 It is the slow one! Must be so for sufficient approximation of current policy value for improvement. Initially biased towards (likely poor) observed returns at the start. A long time must probably pass following the optimal policy for ρ to converge to actual value.
  • 23. Our method 20. Favours an understanding of the βˆ’Ο term, either alone in ARRL or as a factor of costs in SMDPs, not so much as an approximation to average rewards but as a punishment for taking actions, which must be made β€œworth it” by the rewards. I.e. nudging. Exploits the splitting of SSP Q-Learning, in order to focus on the value/cost of a single state, s. Λ† Thus, also assumes the existence of a recurrent state, and that the unichain policy holds. (For the time being) Attempts to ensure an accelerated convergence of ρ updates. In a context in which certain, efficient convergence can be easily introduced.
  • 25. Fractional programming 21. So, β€˜Bertsekas splitting’ of s into initial sI and terminal sT . Λ† Then, from sI Any policy Ο€ ∈ Ξ  has an expected return until termination v Ο€ (sI ), and an expected cost until termination cΟ€ (sI ). v Ο€ (sI ) The ARRL problem, then, becomes max Ο€ Ο€βˆˆΞ  c (sI ) Lemma v Ο€ (sI ) argmax = argmax v Ο€ (s) + Οβˆ— (βˆ’cΟ€ (s)) Ο€βˆˆΞ  cΟ€ (sI ) Ο€βˆˆΞ  For Οβˆ— such that max v Ο€ (s) + ρ(βˆ’cΟ€ (s)) = 0 Ο€βˆˆΞ 
  • 26. Fractional programming 22. Implications. Assume the gain, Οβˆ— is known. Then, the nonlinear SMDP problem reduces to RL. Which is better studied, well understood, simpler, and for which sophisticated, efficient algorithms exist. If we only use (r βˆ’ Οβˆ— c)(s, a, s ). Problem: Οβˆ— is usually not known.
  • 27. Nudging 23. Idea: Separate reinforcement learning (leave it to the pros) from updating ρ. Thus, value-learning becomes method-free. We can use any old RL method. Gain update is actually the most critical step. Punish too little, and the agent will not care about hurrying, only collecting reward. Punish too much, and the agent will only care about finishing already. In that sense, (r βˆ’ ρc) is like picking fruit inside a maze.
  • 28. Nudging 24. The problem reduces to a sequence of RL problems. For a sequence of (temporarily fixed) ρk Some of the methods already provide an indication of the sign of ρ updates. We just don’t hurry to update ρ after taking a single action. Plus the method comes armed with a termination condition: As soon as H k (sI ) = 0 then Ο€ k = Ο€ βˆ— .
  • 29. Nudging 25. Algorithm 2 Nudged SSP Learning Initialize repeat Set reward scheme to (r βˆ’ cρ) Solve by any RL method. Update ρ From current H Ο€ (sI ) until H Ο€ (sI ) = 0
  • 30. w βˆ’ l space 26. D We will propose a method for updating ρ and show that it minimizes uncertainty between steps. For that, we will use a transformation that extends the work of our CIG paper. But First. Let D be a bound on the magnitude of unnudged reward D β‰₯ lim sup{H Ο€ (sI ) | ρ = 0} Ο€βˆˆΞ  D ≀ lim inf {H Ο€ (sI ) | ρ = 0} Ο€βˆˆΞ  Observe interval (βˆ’D, D) bounds Οβˆ— but the upper bound is tight only in ARRL if all of D reward is received in a single step from sI .
  • 31. w βˆ’ l space 27. All policies Ο€ ∈ Ξ , from (that is, at) sI have: real expected value |v Ο€ (sI )| ≀ D. positive cost cΟ€ (sI ) β‰₯ 1 28.a w βˆ’ l transformation: D+v Ο€ (sI ) Dβˆ’v Ο€ (sI ) w= 2cΟ€ (sI ) l= 2cΟ€ (sI )
  • 32. w βˆ’ l space 28.b w βˆ’ l plane. D l w D
  • 33. w βˆ’ l space 29. Properties: w, l β‰₯ 0 w, l ≀ D D w+l = ≀D cΟ€ (s I) v Ο€ (sI ) = D β‡’ l=0 v Ο€ (sI ) = βˆ’D β‡’ w=0 lim (w, l) = (0, 0) cΟ€ (sI )β†’βˆž 30. Inverse transformation: Ο€ Ο€ v Ο€ (sI ) = D wΟ€ βˆ’lΟ€ w +l cΟ€ (sI ) = D wΟ€1 Ο€ +l
  • 35. w βˆ’ l space 31. Value. w Ο€ βˆ’ lΟ€ v Ο€ (sI ) = D D w Ο€ + lΟ€ Level sets are lines. w–axis, expected D. l–axis, expected βˆ’D. w = l, expected 0. 5D βˆ’D βˆ’0. l Optimization β†’ fanning from l = 0. 0 Not convex, but splits the 0.5D space. So optimizers are vertices of D w D convex hull of policies.
  • 36. w βˆ’ l space 32. Cost. D 1 cΟ€ (sI ) = D wΟ€+ lΟ€ Level sets are lnes with slope βˆ’1. w + l = D, expected cost 1. l Cost decreases with distance 1 to the origin. Cost optimizers (both max 2 and min) also vertices. 4 8 w D
  • 37. w βˆ’ l space 33. The origin. Policies of infinite expected cost. Mean the problem is not unichain or sI is not recurrent. And are troublesome for optimizing value. So under our assumptions, the origin does not belong to the space
  • 38. Nudged value in the w βˆ’ l space 34. SMDP problem in w βˆ’ l. Ο€ Ο€ v Ο€ (sI ) D wΟ€ βˆ’lΟ€ argmax Ο€ = argmax w 1 +l = argmax wΟ€ βˆ’ lΟ€ Ο€βˆˆΞ  c (sI Ο€βˆˆΞ  D wΟ€ +lΟ€ Ο€βˆˆΞ  D /2 D βˆ’ l ● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ● ● ●●●●● ● ● ● ● ● ●●●●● ●●●●● ● ● ● ● ●● ● ● ●● ● ● ● ●● ● ● ● ● ●●● ● ●●● ● ● ● ● ● ●●●●● ●●● ●● ● ●● ● ● ● ● ● ●●● ● ●●●●●●●● ●●● ●● ●● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ●● ●● ● ●● ●● ●●●●●●●●● ● ● ● ● ● ● ●●●● ● ● ● ●●●●● ● ●●●● ● ● ● ● ● ●●● ●●●● ●●●●●●●●●●●●●●●● ● ● ● ● ●●●●● ●●●●●●●● ●●●● ● ● ●●●●●●● ●●●●●●●●●●●● ● ●●●● ● ● ●●●●●● ●● ● ●●● ● ● ● ● ●●● ● ● ● ● ●●●● ●● ●●● ● ● ●● ● ● ● ●● ●●●●●●●●●●●●●●●●●● ● ●● ● ● ●●●●●●●●●●●●●●●●●●● ● ● ● ●●● ●●●● ●●●● ● ● ● ● ● ● ● ●●● ●●●●●●●●● ● ●● ● ● ● ●● ● ● ● ● ●●●●●●●●●● ●●●●●●● ●●● ●● ● ● ●●●●●●●●●●●●●●●● ● ● ● ● ●●● ●● ●●●●●●●●●●●●●●●●●●●●●● ●●● ● ●●● ●●●● ●●●●● ●● ●●●● ● ● ● ● ● ● ●●●● ● ● ● ●● ●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●● ● ● ●●●● ●●●● ●●●●●●● ●● ● ● ●●●●●●●●●●●●●● ●●●●● ● ●● ●●●●●●●●●●●●●●● ● ● ● ● ●●●●● ●●●●●●●●●●●●●●●●● ● ● ● ●● ● ● ●●●● ●● ● ● ●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ●●● ●● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●● ● ● ● ●●● ●●●● ●●● ● ●● ●● ●●●●●●●●●●●●●● ● ● ● ●● ● ●●●●●●●●●●●●●●●●●●●●● ●●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●● ● ●●●● ● ● ●●● ● ●● ●● ●●●●●●●●●●●●● ●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●● ● ●●●●●● ● ● ● ●●●●●●●●●●●●●●●●● ● ● ● ●●●●● ● ●● ● ● ● ●● 0 ● ●●●●●●●●●●●●●●●●●●●●●● ●●●● ● ● ●●●●●●●●●●●●●●●●●●● ● ●● ● ● ● ●●●●●●●●●●●●●●● ●● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●● ●●●●●●●●●● ● ●●●●●●●● ● ● ●●●●●●●●●●●●●●● ●● ● ● ● ●●● ●●●● ● ● ● ● ●● ●●●●●●●●●●●●●●●●●●●●●●● ●●● ● ● ●●●●●●●●●●●●●●●●●● ● ● ● ● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●● ● ● ● ●●●●●●● ● ● ● ● ● ● ●● ●● ●● ●● ● ●●●●●●●●●●●●●●●●●●●● ●●●●●●● ● ●● ● ●●●●●●●●●●●●●●●● ●●●● ●● ● ● ● ●● ● ●●●●●●●● ●● ●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●● ● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● ● ● ●●●●●●●●●●●●●●●● ●●●●●● ●● ● ● ● ●●●●●●●● ● ●● ● ●● ● ●●●●●●●●●● ●●●●● ●● ● ● ● ● ● ● ●● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ●●●● ● ● ● ● ●●●●●●●●●●●● ● ● ● ●●●●●●●●●● ●● ●●● ● ● ● ● ●●●●●●●●●●●●●●●●●●● ● ● ● ●● ●● ●●●●●●●●●●●●●●●●●●●● ●●●● ● ● ●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●● ● ● ● ●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ●●●●● ●●●●● ● ● ●●●●●●●●●●●●●●●●●●● ● ● ●●● ●●●● ●●●●●●●●●● ● ● ●●●● ●●●●●●●●●●●●●●●●●●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● ● ● ●●●●●●●●●●●●●●●●●●●● ●● ● ● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ● ● ●●●●●●●●●●●●●●● ●●●● ● ● ● ● ●● ● ●● ●● ●●● ●● ● ●● ●● ●●●●● ● ● ●● ●●●●● ●●●●●●●●● ●●● ●● ●●●●● ●●●●●●●● ●●● ● ● ● ●●●●●● ●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ●●● ● ● ● ● ●●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ●●●● ●● ●●●●●●●●●●●●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ●●●●●●● ●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●● ●● ● ● ● ●● ●● ●● ●●●●●●●● ●● ●●● ● ●●● ● ● ● ● ● ● ●● ● ●●● ●● ● ●● ● ● ● ● ● ●● ●●●● ●●●● ●●● ●●● ● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●● ● ●●● ●●●●●●●●●●●●●●●●●● ●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●● ●●●●●●●●●●●●●● ●● ● ● ●●●●●● ●●●●●●● ● ● ● ●●●●● ● ● ● ●●● ● ●●●●●●●●●●●●●●●● ●●●●● ● ●●●●●●●●●●●●●●●●●●●●● ● ● ● ●●●●● ● ●●●●● ●● ● ● ●●●● ●●●●●●●●●●●●●●●●●●●●● ● ● ● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ●●● ●●●● ● ●●●● ● ● ●● ● ● ● ● ●●● ●●● ●●●●●●●●●●●●●●●●●●●●● ●●● ● ●● ●●●●●●●●●●●●●●●●●● ●● ● ●●●●●●●●●●●●●●●●●●●●● ●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●● ● ● ● ●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●● ● ●● ● ●● ● ●●●●●●● ●●●● ● ● ●● ●●● ●●●●●●●●●●●●●●●● ● ●● ●●●●●● ●● ●●● ● ● ●● ●●● ● ● ● ●●●●●●●●● ●● ●●● ●● ● ●● ●●●●●●●●●●●●●● ●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●● ● ●● ● ●● ●●● ● ●● ●● ● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● ●●●●●●●●● ●●● ●●●●●● ● ● ●● ● ●● ● ● ● ●●●●●●●●●●●●● ●●●● ● ● ●●● ● ●●●●●● ● ● ● ●● ● ● ●● ●● ●● ●●●●●●●● ●● ●● ● ●●●●● ●● ●●●●●● ●●●●● ●●●●● ● ●● ●● ● ●●● ●● ●●●●●●●●●●●●●●● ●● ● ●●● ● ●●●●●●●●●●●●●●●●●●●● ● ● ● ●●●● ●●●●●●●●●●●●●●●●●●●●● ● ●● ● ● ●●●● ●●● ● ● ● ● ● ● ● ● ●● ●●●●●●●●● ●●●● ●●●●●● ● ●● ●● ●●● ●● ● ●●●● ●●● ●● ● ●● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●● ● ●●●● ●●● ●●●●●●●●●●●●●●●●●●●●● ● ● ● ●● ●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● ●●●●●●●●● ●●●●●● ●● ●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●● ● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●● ●● ● ● ●● ● ● ● ●● ● ●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●● ●● ●●●●●●● ●● ●● ●● ●● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●● ●●●●●●●●●● ●●● ● ● ●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●● ● ●● ● ●● ● ● ● ● ● ● ● ● ●● ●●●●●●●●●●●● ●● ● ● ● ● ● ●●●●●●●●●●●●●●●●● ●● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●● ● ●●●●● ● ● ● ●●● ●●●●●●●●●●●●●●●●●●●● ●●●● ● ● ● ●●●●●●●●●● ●●●● ● ●● ●● ● ● ●●●●●●●●●●●●●●●●●● ● ● ● ●● ●●●●●● ● ●● ● ● ●● ●●●● ●● ● ● ● ●● ● ●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ●●●●●●●●●●● ●●●● ●● ● ● ● ●● ● ● ●● ●●● ●●●● ●● ● ● ●●● ●●●●●●●●●●●●● ●● ● ● ●●● ●●●●●●●●●●●●●●●●●●●●● ●● ●● ●● ●●●●●●● ●● ● ● ●● ● ● ● ● ● ● ●●●●●●●●●●●●●●● ● ●●●●● ● ● ● ● ● ● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ● ● ● ● ●●●●●●●●● ●●● ●●● ● ●●● ●●●●●●●●●●●●●●●●● ●●●●●●●● ●● ● ● ●● ●●● ●●●● ● ●●● ● ● ●●● ●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●● ●● ● ● ● ●● ●● ● ● ●● ●● ●● ●●● ●● ● ● ● ●●●●● ● ● ● ● ●●● ●●●●●●●●●●●●●●●●●●●● ●●● ● ●●● ● ●● ●●●●●●●● ●●● ● ● ●●● ●●●●●●●●●●●●●●● ●● ● ●●●●● ●●●●●● ●● ● ● ●● ●●●●●● ●●●●●●●●●●●● ● ● ●●● ●●● ●●●●●●●●●●●●●● ● ● ●●●● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ●● ● ●● ●●●●●●●●●●●●●●●●●●●●●● ●● ● ● ● ●●●●●●●●● ●●●●●●●●●●●●● ● ●●●● ● ●●●●●●●● ● ● ● ●● ● ●●●● ●● ●●●●● ●●● ● ●●● ● ● ● ●●●●● ●●●●● ●●●● ●● ● ● ● ●● ● ● ● ● ●●● ● ●● ● ●● ●●●●●●●●●●●●●● ● ● ● ● ● ●● ●● ●● ● ● ●● ● ●● ● ● ●●●● ● ● ● ●●●● ●●●●●●●●●●●●● ● ● ● ● ● ● ● ● ●●●●● ● ●●● ● ● ●●● ● ● ●●●●●● ●● ●● ● ● ● ● ●● ● ● ●●● ● ●●●●●●● ●●● ●●● ● ●● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●●● ●●● ● ● ● ● ● ●● ●●●●●●● ●●●●●●●● ● ●● ● ● ● ●●● ● ●●● ●●●● ● ●● ● ●●● ● ● ● ● ● ●●● ● ●●●●●●●●●● ●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ●●● ●●●●●●●●●●● ●●● ●● ● ●● ● ● ●● ●● ● ● ● ●●●●●●●●●●●●●●●● ●●● ● ● ●● ●● ●● ●●●● ●●● ●● ● ● ● ● ● ● ●●●●●●●●●●●●● ● ●●● ●● ●●●●● ●● ● ● ●● ● ● ● ●● ● ● ● ●●● ● ● ● ● ● ● ● ● ●●●●●●●●●● ●●● ●●●●● ●●● ● ● ●● ● ● ●● ●● ● ● ● ● ●● ●● ● ● ●● ●● ● ● ●● ● ●● ● ● ●●●●● ● ● ● ● ●●● ●● ● ● ● ● ● ● ● ● ●● ●●●●● ●●●●● ● ● ● ● ●● ● ●● ● ● ● ● ●● ●● ●●●●●●●●●● ●● ●● ● ● ● ●● ●● ● ● ● ●● ●● ● ● ●●●● ●● ● ● ● ● ●●●● ● ●● ● ●● ●●●●● ● ●●●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ●●● ●● ●● ● ● ● ●● ●● ●●●●●●●●●● ● ● ●● ● ● ● ●● ●● ●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ●● ● ●● ●●● ●●●● ●●●● ●●●● ●● ● ●● ● ● ● ● ●● ● ● ●●●●●●●●●●● ●●● ● ● ●● ● ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ●● /2 ● D w D
  • 39. Nudged value in the w βˆ’ l space 35. Nudged value, for some ρ. argmax v Ο€ (sI ) βˆ’ ρcΟ€ (sI ) Ο€βˆˆΞ  w Ο€ βˆ’ lΟ€ 1 = argmax D Ο€ + lΟ€ βˆ’ ρD Ο€ Ο€βˆˆΞ  w w + lΟ€ w Ο€ βˆ’ lΟ€ βˆ’ ρ = argmax D Ο€βˆˆΞ  w Ο€ + lΟ€
  • 40. Nudged value in the w βˆ’ l space 36. Nudged value level sets Λ† (For a set ρ and all policies Ο€ with a given h) Λ† Λ† Dβˆ’h Ο€ D lΟ€ = Λ† wΛ† βˆ’ ρ Λ† D+h D+hΛ† Lines! Λ† Slope depends only on h (i.e., not on ρ)
  • 41. Nudged value in the w βˆ’ l space 37. Pencil of lines Λ† Λ‡ For a set ρ, any two h and h level set lines have intersection: ρ ρ ,βˆ’ 2 2 Pencil of lines with that vertex.
  • 42. Nudged value in the w βˆ’ l space D 38. Zero nudged value. Dβˆ’0 Ο€ D lΟ€ = Λ† wΛ† βˆ’ ρ D+0 D+0 l lΟ€ = w Ο€ βˆ’ ρ Λ† Λ† βˆ’D Unity slope. 0 Negative values above, positive below. w D D  ρ   ρ ο£Ά ο£· ο£· If whole cloud above w = l, some  ,βˆ’ ο£·  ο£· ο£­ 2 2 ο£Έ negative nudging is the optimizer. (Encouragement)
  • 43. Nudged value in the w βˆ’ l space D l ● ● ● ●● ● ● ● ● ●● ●● ● ●● ● ● ●● ●● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ●●●●●●●●● ●●●●● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●●●●●●●●●●●●●●● ● ● ● ●●●●●●●●● ●●●●● ● ● ● ●●●●●●●●●●●● ● ●● ● ●●●●●●●●●● ●●●● ● ●● ●● ● ●●●● ●● ● ●● ● ●●●●●●●●●●●●●●●●●● ● ●● ● ● ●●●●●●●●● ● ●● ●●● ●●●●●●●●●● ● ● ●●● ●●●●●●●●●●●●● ●● ● ●● ●●● ● ●● ● ● ●●●●●●●●●●●●●●●●●● ● ● ● ●● ●● ●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●● ● ● ● ●● ●●●●●● ●● ●●●●●●●●●●●●●●●●● ● ●● ● ●●● ● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●● ●● ●●●●●●●●●●● ● ●● ●● ●●●●●●●●●●●●●●●● ● ●●● ● ● ●●●●●●●●●●●● ● ●● ●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●● ●●● ●● ● ●● ● ● ● ●●● ● ●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●● ● ●● ●● ● ●● ●●●●●●●●●●●●●●● ● ●●●● ● ● ●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●● ● ● ●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●● ● ● ●●●● ●●●● ● ●●●●●●●●●●●●●●● ●● ● ●● ●●●●●●●●●●●●●●●●●●●● ● ●●●●● ●●●● ●● ● ● ● ●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●● ● ● ●●● ● ●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●● ●●● ● ●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●● ● ● ● ● ●● ●● ●● ●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●● ● ● ● ●●●● ●●● ● ●● ● ●● ● ●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ●●●●●●●● ● ● ● ● ●●●●●●●●●●●●●●●●●●●● ● ● ●●● ●●●● ●●●●●●●●●●●●● ●● ● ● ●●●●●●●●●●●●●●●●● ●● ●● ●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●● ● ●●● ●●●●●●●●● ●● ● ● ●●●●●● ●●●● ● ● ●●●●●●●●●●●●●●●●● ● ● ●●● ●●● ● ●●●●●●●●●●●●●●●●●● ● ● ●● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● ●●●●●●● ●●●●● ● ●●●●●●●●●●●●●●●● ●●● ●●●●●●●●●●● ●●●● ● ● ● ●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●● ● ● ● ● ●●●●● ● ●● ● ●●●●●●●●●●●●●●●●● ●●● ●●● ● ●●●●●●●●●● ● ● ● ●● ●●● ●●● ●● ● ●●● ● ● ●● ●●●●●●●●●●●●●●●●●●●● ●●● ●● ●●●●●●●●●●●●●●●●●● ●● ● ●●●●● ●● ●● ●● ● ●●●●●●●●●●●●● ●●●● ● ● ●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ●●● ●●●●●●●●●●●●●●●●● ● ● ● ●●●●●●●●●●●●●●●●●●●●● ● ● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●● ●● ●● ●●●●●●●●●●●●● ● ●●●●●●●●●●●● ●● ● ●● ●●●● ●● ● ● ●●●●●●●●●●●● ● ● ● ●● ● ● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●● ● ● ●● ●●●●●●●●●●●●●●●●●●●●●● ● ● ●● ●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●● ● ●● ●●● ●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●● ●● ●● ●●●●●●●●●●●●●●●●●● ● ●●●●● ● ●●●●●●● ● ●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●● ● ● ● ●●●●●●●●●●●●●●●●● ● ● ● ●● ● ● ●●●●●●● ●●● ●● ● ●● ●●●●●● ●●●● ●● ● ●● ● ● ● ●● ●● ● ●●●●●●●● ●●●● ● ● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●● ● ●● ●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●● ● ● ● ● ●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●● ● ●● ●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●● ● ● ● ● ●● ●●● ● ● ●● ●●●●●●●●●●●●●● ●● ● ● ●●●●●●●● ● ● ● ● ●●●●●●●●● ● ● ●●●●● ●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ●● ● ●● ●●●●● ●● ●●●●●●●●●●● ● ●●●● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●● ● ● ●● ●●●●●●●● ● ●●●●●●●●●●● ● ● ●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●● ● ●● ●● ● ● ● ● ●●●●● ● ●● ● ● ●●●●●●●●●●●●●●●●●●●●● ● ●● ● ●● ● ●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●● ● ● ●●●●●●●● ●●●● ● ●●●● ● ● ● ●●●●●●●●●●● ●● ● ●●●●●●●●●●●●● ●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●● ● ●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●● ● ●● ●● ● ●● ●●● ●● ●● ●●●●● ● ● ● ●●●●●●●●●●●●●●● ●● ● ●●● ●● ●●●●●● ●●●● ● ● ●● ●●●●●●●●●●●●●● ●● ● ● ●● ● ● ●●● ● ● ● ●● ●●●●●●●●● ●● ● ●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●● ●●●●● ● ● ●● ● ● ●●●●●● ●● ● ●●●●●●●●●●●●●●● ● ● ●●●●● ● ● ● ●●●● ●●●●●●●●●● ● ●● ●●● ● ●● ● ● ● ●●●●●●●●●●●●●●●● ● ● ●●●●●● ●●●● ●● ● ●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●● ● ● ●●● ● ● ●●● ●●● ●●●● ● ● ● ●●●●●●●●●●●● ● ● ● ● ●●●●●●●●●●●●●●●●● ●● ●●●●●●● ●●● ● ● ● ●●●●●●●●●●●● ●● ● ●● ●●●●●●●●●●●●●● ●●● ●● ●● ● ● ●●●● ●● ● ● ●●●●● ●●●●●●● ●●●●●●●●●●●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●●●●●●●●●●●●● ●● ● ● ● ●●●●●●●●●●●●●●●●● ●● ● ●●●●●●● ● ● ● ● ●●● ● ● ● ● ●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●● ● ● ●●●●●● ● ●●● ● ● ●● ● ● ●● ● ●● ● ●●●●●●●●●●●●●●● ●●● ● ● ●●● ● ● ● ● ●●● ●● ● ●●● ●● ● ● ●● ● ●●●●●●●●● ●●● ●● ● ●● ●● ●●●●●●● ●●● ●●● ● ● ●●●● ●●● ●● ●● ● ● ● ● ● ● ● ●● ●● ● ● ●●●●● ● ● ● ●● ● ● ●●● ●●●●●●●● ● ● ● ● ● ●● ● ●● ●●●●● ●●● ●●● ●● ● ● ●●●●● ● ● ● ● ● ●● ● ● ●●●●● ● ●●●●● ● ● ● ●●●●●●●● ● ●●● ●●● ● ● ●● ●● ● ● ● ●●●●● ● ● ●●●●● ●●●●●● ●● ● ● ●● ● ● ●●●● ● ● ● ●● ● ●●●●● ● ● ●● ●●●●●●● ●●● ● ● ● ●●● ●● ● ●●●● ● ●●●●● ●● ● ● ● ●● ● ● ● ● ● ●● ●● ●● ● ● ●● ● ●●● ●●●●● ●●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ●● ● ● ●●● ● ● ● ● ● ●● ● ● w D
  • 44. Nudged value in the w βˆ’ l space 40. Initial bounds on Οβˆ— . βˆ’D ≀ Οβˆ— ≀ D (Duh! but nice geometry)
  • 45. Enclosing triangle 42. Nomenclature D 41. Definition. ABC such that: ABC βŠ‚ w βˆ’ l space. (wβˆ— , lβˆ— ) ∈ ABC. l B ● Slope of AB segment, unity. mΞ² mΞ± wA ≀ wB 1 ● C wA ≀ wC A● mΞ³ w D
  • 46. Enclosing Triangle 43. (New) bounds on ρ. Def. Slope mΞΆ projection of point X(wX , lX ) to w = βˆ’l line. mΞΆ wX βˆ’ lX XΞΆ = mΞΆ + 1 Bounds: AΞ± = BΞ± ≀ Οβˆ— ≀ CΞ± wA βˆ’ lA = wB βˆ’ lB ≀ Οβˆ— ≀ wC βˆ’ lC 44. So, collinearity (of A, B and C) implies optimality. (Even if there are multiple optima)
  • 47. Right and left uncertainty 45. Iterating inside an enclosing triangle. 1 Set ρ to some value within the bounds Λ† (wA βˆ’ lA ≀ ρ ≀ wC βˆ’ lC ). Λ† 2 Solve problem with rewards (r βˆ’ ρc). Λ† 46. Optimality. If h(sI ) = 0 Done! Optimal policy found for current problem solves SMDP and termination condition has been met.
  • 48. Right and left uncertainty 47.a If h(sI ) > 0 Right uncertainty. l B ● ● S T ● ● C A● w y1
  • 49. Right and left uncertainty 47.b Right uncertainty. Derivation: y1 = SΞ± βˆ’ TΞ± 1 = ((1 βˆ’ mΞ² )wS βˆ’ (1 βˆ’ mΞ³ )wT βˆ’ (mΞ³ βˆ’ mΞ² )wC ) 2 Maximization: βˆ— 2s ab(ρ/2 βˆ’ CΞ² )(ρ/2 βˆ’ CΞ³ ) + a(ρ/2 βˆ’ CΞ² ) + b(ρ/2 βˆ’ CΞ³ ) y1 = c s = sign(mΞ² βˆ’ mΞ³ ) a = (1 βˆ’ mΞ³ )(mΞ² + 1) b = (1 βˆ’ mΞ² )(mΞ³ + 1) c = (b βˆ’ a) = 2(mΞ³ βˆ’ mΞ² )
  • 50. Right and left uncertainty 48.a If h(sI ) < 0 Left uncertainty. l B ● ● ● C A● R ● y2 w
  • 51. Right and left uncertainty 48. Left uncertainty. Is maximum where expected. (When value level set crosses B) y2 = RΞ± βˆ’ QΞ± βˆ— (ρ/2 βˆ’ BΞ± ) y2 = (BΞ± βˆ’ BΞ³ ) (ρ/2 βˆ’ BΞ³ )
  • 52. Right and left uncertainty 49. Fundamental lemma. As ρ grows, maximal right uncertainty is monotonically Λ† decreasing and maximal left uncertainty is monotonically increasing, and both are non-negative with minimum 0.
  • 53. Optimal nudging 50. Find ρ (between the bounds, obviously) such that the Λ† maximum resulting uncertainty, either left or right, is min. Since both are monotonic and have minimum 0, min max when maximum left and right uncertainties are equal. Remark: bear in mind this (↑) is the worst case. It can terminate immediately. ρ is gain, but neither biased towards observations (initial or Λ† otherwise), nor slowly updated. Optimal nudging is β€œoptimal” in the sense that with this update the maximum uncertainty range of resulting ρ values is minimum.
  • 54. Optimal nudging 51. Enclosing triangle into enclosing triangle. 52. Strictly smaller (both area and, importantly, resulting uncertainty)
  • 55. Obtaining an initial enclosing triangle 53. Setting ρ(0) = 0 and solving. Maximizes reward irrespective of cost. (Usual RL problem) Can be interpreted geometrically as fanning from the w axis to find the policy with w, l coordinates that subtends the smallest angle. The resulting optimizer maps to a point somewhere along a line with intercept at the origin. 54. Optimum of the SMDP problem above but not behind that line. Else, contradiction.
  • 56. Obtaining an initial enclosing triangle 56. Either way, after iter. 0, uncertainty reduces in at least half.
  • 57. Conic intersection 57. Maximum right uncertainty is a conic!   ο£Ά c βˆ’(b + a) βˆ’CΞ± c r βˆ— ο£­ βˆ’(b + a) βˆ— r y1 1 c (CΞ² a + CΞ³ b) ο£Έ ο£­ y1 ο£Έ = 0 2 βˆ’CΞ± c (CΞ² a + CΞ³ b) CΞ± c 1 58. Maximum left uncertainty is a conic!   ο£Ά 0 1 (BΞ³ βˆ’ CΞ³ ) r βˆ— βˆ— r y2 1 ο£­ 1 0 βˆ’BΞ³ ο£Έ ο£­ y2 ο£Έ = 0 (BΞ³ βˆ’ CΞ³ ) βˆ’BΞ³ βˆ’2BΞ± (BΞ³ βˆ’ CΞ³ ) 1
  • 58. Conic intersection 59. Intersecting them is easy. 60. And cheap. (Requires in principle constant time and simple matrix operations) 61. So plug it in!
  • 59. Termination Criteria 62. We want to reduce uncertainty to Ξ΅. Because it is a good idea. (Right?) So there’s your termination condition right there. 63. Alternatively, stop when |h(k) (sI )| < . 64. In any case, if the same policy remains optimal and the sign of its nudged value changes between iterations, stop: It is the optimal solution of the SMDP problem.
  • 60. Finding D 65. A quick and dirty method: 1 Maximize cost (or episode length, all costs equal 1). 2 Multiply by largest unsigned reinforcement. 66. So, at most one RL problem more. 67. If D is estimated too large, wider initial bounds and longer computation, but ok. 68. If D is estimated too small (by other methods, of course), points outside the triangle in w βˆ’ l space. (But where?)
  • 61. Recurring state + unichain considerations 69. Feinberg and Yang: Deciding whether the unichain condition holds can be done in polynomial time if a recurring state exists. 70. Existence of a recurring state is common in practice. 71. (Future work) It can maybe be induced using Ρ–MDPs. (Maybe). 72. At least one case in which no unichain is no problem: games. Certainty of positive policies. Non-positive chains. 73. Happens! (See experiments)
  • 62. Complexity 74. Discounted RL is PAC (–efficient). 75. In problem size parameters (|S|, |A|) and 1/Ξ³. 76. Episodic undiscounted RL is also PAC. (Following similar arguments, but slightly more intricate derivations) 77. So we call a PAC (–efficient) method a number of times.
  • 63. Complexity 78. Most worstest case foreverest when choosing ρ(k) is not reducing uncertainty. 79. Reducing it in half is a better bound for our method. 80. ... and it is a tight bound... 81. ... in cases that are nearly optimal from the outset. 82. So, at worst, log 1 calls to PAC: Ξ΅ PAC!
  • 64. Complexity 83. Whoops, we proved complexity! That’s a first for SMDP (or ARRL, for that matter). 84. And we inherit convergence from invoked RL, so there’s also that.
  • 65. Tipically much faster 85. Worst case happens when we are ”already there. 86. Otherwise, depends, but certainly better. 87. Multi-iteration reduction in uncertainty way better than 0.5Β· , because it accumulates geometrically. 88. Empirical complexity better than the already very good upper bound.
  • 66. Bibliography I S. Mahadevan. Average reward reinforcement learning: Foundations, algorithms, and empirical results. Machine Learning, 22(1):159–195, 1996. Reinaldo Uribe, Fernando Lozano, Katsunari Shibata, and Charles Anderson. Discount and speed/execution tradeoffs in markov decision process games. In Computational Intelligence and Games (CIG), 2011 IEEE Conference on, pages 79–86. IEEE, 2011.