100 things I know

100 things I know.
Part I of III

Reinaldo Uribe M

Mar. 4, 2012

SMDP Problem Description.

1. In a Markov Decision Process, a (learning) agent is embedded
in an envionment and takes actions that aﬀect that environment.

States: s ∈ S.
Actions: a ∈ As ; A = s∈S As .
(Stationary) system dynamics:
transition from s to s after taking a, with probability
a
Pss = p(s |s, a)
Rewards: Ra . Def. r(s, a) = E Ra | s, a
ss ss

At time t, the agent is in state st , takes action at , transitions to
state st+1 and observes reinforcement rt+1 with expectation
r(st , at ).


2. Policies, value and optimal policies.
An element π of the policy space Π indicates what action,
π(s), to take at each state.
The value of a policy from a given state, v π (s) is the expected
cumulative reward received starting in s and following π:
∞
v π (s) = E γ t r(st , π(st )) | s0 = s, π
t=0

0 < γ ≤ 1 is a discount factor.
An optimal policy, π ∗ has maximum value at every state:

π ∗ (s) ∈ argmax v π (s) ∀s
π∈Π
π∗
v ∗ (s) = v (s) ≥ v π (s) ∀π ∈ Π


3. Discount
Makes inﬁnite-horizon value bounded if rewards are bounded.
Ostensibly makes rewards received sooner more desirable than
those received later.
But, exponential terms make analysis awkward and harder...
... and γ has unexpected, undesirable eﬀects, as shown in Uribe

et al. 2011

Therefore, hereon γ = 1.
See section Discount, at the end, for discussion.


4. Average reward models.
A more natural long term measure of optimality exists
for such cyclical tasks, based on maximizing the average
reward per action. Mahadevan 1996

n−1
1
ρπ (s) = lim E r(st , π(st )) | s0 = 0, π
n→∞ n
t=0

Optimal policy:

ρ∗ (s) ≥ ρπ (s) ∀s, π ∈ Π

Remark: All actions equally costly.

SMDP Problem Description

5. Semi-Markov Decision Process: usual approach, transition
times.
Agent is in state st and takes action π(st ) at decision epoch t.
After an average of Nt units of time, the sistem evolves to
state st+1 and the agent observes rt+1 with expectation
r(st , π(st )).
In general, Nt (st , at , st+1 ).
Gain (of a policy at a state):

n−1
π
E t=0 r(st , π(st )) | s0 = s, π
ρ (s) = lim
n→∞ n−1
E t=0 Nt | s0 = s, π

Optimizing gain still maximizes average reward per action, but
actions are no longer equally weighted. (Unless all Nt = 1)


6.a Semi-Markov Decision Process: explicit action costs.
Taking an action takes time, costs money, or consumes
energy. (Or any combination thereof)
Either way, real valued cost kt+1 not necessarily related to
process rewards.
Cost can depend on a, s and (less common in practice) s .
Generally, actions have positive cost. We simply require all
policies to have positive expected cost.
Wlog the magnitude of the smallest nonzero average action
cost is forced to be unity:

|k(a, s)| ≥ 1 ∀k(a, s) = 0


6.b Semi-Markov Decision Process: explicit action costs.
Cost of a policy from a state:
n−1
cπ (s) = lim E k(st , π(st )) | s0 = s, π
n→∞
t=0

So cπ (s) > 0 ∀π ∈ Π, s.

Nt = k(st , π(st )). Only their deﬁnition/interpretation
changes.
Gain
v π (s)/n
ρπ (s) =
cπ (s)/n


7. Optimality of π ∗ :

π ∗ ∈ Π with gain
n−1
E t=0 r(st , π(st )) | s0 = s, π ∗ ∗
v π (s)
π∗ ∗
ρ (s) = ρ (s) = lim = π∗
n→∞ n−1
= s, π ∗ c (s)
E t=0 k(st , π(st )) | s0

is optimal if

ρ∗ (s) ≥ ρπ (s) ∀s, π ∈ Π,

as it was in ARRL.

Notice that the optimal policy doesn’t necessarily maximize v π or
minimize cπ . Only optimizes their ratio.


8. Policies in ARRL and SMDPs are evaluated using the
average-adjusted sum of rewards:
n−1
H π (s) = lim E (r(st , π(st )) − ρπ (s)) | s0 = s, π
n→∞
t=0

Puterman 1994, Abounadi et al. 2001, Ghavamzadeh & Mahadevan 2007

This signals the existence of bias optimal policies that, while
gain optimal, also maximize the transitory rewards received
before entering recurrence.
We are interested in gain-optimal policies only.
(It is hard enough...)


9. The Unichain Property
A process is unichain if every policy has a single, unique
recurrent class.
I.e. if for every policy, all recurrent states communicate
between them.
All methods rely on the unichain property. (Because, if it
holds:)
ρπ (s) is constant for all s.
ρπ (s) = ρπ
Gain and value expressions simplify. (See next)
However, deciding if a problem is unichain is NP-Hard.
Tsitsiklis 2003


10. Unichain property under recurrent states. Feinberg & Yang, 2010

A state is recurrent if it belongs to a recurrent class of every
policy.
A recurrent state can be found, or proven not to exist, in
polynomial time.
If a recurrent state exists, determining whether the unichain
property holds can be done in polynomial time.
(We are not going to actually do it–it requires knowledge of
the system dynamics–but good to know!)
Recurrent states seem useful. In fact, existence of a recurrent
state is more critical to our purposes that the unichain
property.
Both will be required in principle for our methods/analysis,
until their necessity is furher qualiﬁed in section Unichain
Considerations below.

Generic Learning Algorithm
11. The relevant expressions under our assumptions simplify, losing
dependence on s0

The following Bellman equation holds for average-adjusted state
value:

H π (s) = r(s, π(s)) − k(s, π(s))(ρπ ) + Eπ H π (s ) (1)

Ghavamzadeh & Mahadevan 2007

Reinforcement Learning methods exploit Eq. (1), running the
process and substituting:
State for state-action pair value.
Expected for obseved reward and cost.
ρπ for an estimate.
H π (s ) for its current estimate.


12.

Algorithm 1 Generic SMDP solver
Initialize
repeat forever
Act
Do RL to ﬁnd value of current π Usually 1-step Q-learning
Update ρ.


13.
Model-based state value update:

H t+1 (st ) ← max r(st , a) + Ea H t (st+1 )
a

Ea emphasizes that expected value of next state depends on
action chosen/taken.

Model free state-action pair value update:

Qt+1 (st , at ) ← (1 − γt ) Qt (st , at )+
γt rt+1 − ρt ct+1 + max Qt (st+1 , a)
a

In ARRL, ct = 1 ∀t

14.a Table of algorithms. ARRL

Algorithm Gain update
t
AAC r(si , π i (si ))
i=0
Jalali and Ferguson 1989
ρt+1 ←
t+1
t+1
R–Learning ρ ← (1 − α)ρt +
Schwartz 1993 α rt+1 + max Qt (st+1 , a) − max Qt (st , a)
a a

H–Learning ρt+1 ← (1−αt )ρt +αt rt+1 − H t (st ) + H t (st+1 )
αt
Tadepalli and Ok 1998 αt+1 ←
αt + 1
SSP Q-Learning ρt+1 ← ρt + αt min Qt (ˆ, a)
s
Abounadi et al. 2001 a

t
HAR r(si , π i (si ))
i=0
Ghavamzadeh and Mahadevan 2007
ρt+1 ←
t+1


14.b Table of algorithms. SMDPRL

Algorithm Gain update

SMART t
Das et al. 1999
r(si , π i (si ))
i=0
ρt+1 ← t
MAX-Q c(si , π i (si ))
Ghavamzadeh and Mahadevan 2001 i=0

SSP Q-Learning
15. Stochastic Shortest Path Q-Learning
Most interesting. ARRL
If unichain and exists s recurrent (Assumption 2.1 ):
ˆ
SSP Q-learning is based on the observation that
the average cost under any stationary policy is
simply the ratio of expected total cost and expected
time between two successive visits to the reference
state [ˆ]
s

Thus, they propose (after Bertsekas 1998) making the process
episodic, splitting s into the (unique) initial and terminal
ˆ
states.
If the Assumption holds, termination has probability 1.
Only the value/cost of the initial state are important.
Optimal solution “can be shown to happen” when H(ˆ) = 0.
s
(See next section)

SSP Q-Learning
16. SSPQ ρ update.

ρt+1 ← ρt + αt min Qt (ˆ, a),
s
a

where

2
αt → ∞; αt < ∞.
t t

But it is hard to prove boundedness of {ρt }, so suggested instead

ρt+1 ← Γ ρt + αt min Qt (ˆ, a) ,
s
a

with Γ(·) a projection to [−K, K] and ρ∗ ∈ (−K, K).

A Critique

17. Complexity.
Unknown.
While RL is PAC.

18. Convergence.
Not always guaranteed (ex. R-Learning).
When proven, asymptotic:
convergence to the optimal policy/value if all state-action
pairs are visited inﬁnite times.
Usually proven depending on decaying learning rates, which
make learning even slower.

A Critique

19. Convergence of ρ updates.
... while the second “slow” iteration gradually guides
[ρt ] to the desired value.
Abounadi et al. 2001

It is the slow one!
Must be so for suﬃcient approximation of current policy value
for improvement.
Initially biased towards (likely poor) observed returns at the
start.
A long time must probably pass following the optimal policy
for ρ to converge to actual value.

Our method

20.
Favours an understanding of the −ρ term, either alone in
ARRL or as a factor of costs in SMDPs, not so much as an
approximation to average rewards but as a punishment for
taking actions, which must be made “worth it” by the rewards.
I.e. nudging.
Exploits the splitting of SSP Q-Learning, in order to focus on
the value/cost of a single state, s.
ˆ
Thus, also assumes the existence of a recurrent state, and
that the unichain policy holds. (For the time being)

Attempts to ensure an accelerated convergence of ρ updates.
In a context in which certain, eﬃcient convergence can be
easily introduced.

Fractional programming

21. So, ‘Bertsekas splitting’ of s into initial sI and terminal sT .
ˆ
Then, from sI
Any policy π ∈ Π has an expected return until termination
v π (sI ),
and an expected cost until termination cπ (sI ).
v π (sI )
The ARRL problem, then, becomes max π
π∈Π c (sI )

Lemma

v π (sI )
argmax = argmax v π (s) + ρ∗ (−cπ (s))
π∈Π cπ (sI ) π∈Π

For ρ∗ such that max v π (s) + ρ(−cπ (s)) = 0
π∈Π

Fractional programming

22. Implications.
Assume the gain, ρ∗ is known.
Then, the nonlinear SMDP problem reduces to RL.
Which is better studied, well understood, simpler, and for
which sophisticated, eﬃcient algorithms exist.
If we only use (r − ρ∗ c)(s, a, s ).
Problem: ρ∗ is usually not known.

Nudging

23. Idea:
Separate reinforcement learning (leave it to the pros) from
updating ρ.
Thus, value-learning becomes method-free.
We can use any old RL method.

Gain update is actually the most critical step.
Punish too little, and the agent will not care about hurrying,
only collecting reward.
Punish too much, and the agent will only care about ﬁnishing
already.

In that sense, (r − ρc) is like picking fruit inside a maze.

Nudging

24. The problem reduces to a sequence of RL problems.
For a sequence of (temporarily ﬁxed) ρk
Some of the methods already provide an indication of the sign
of ρ updates.
We just don’t hurry to update ρ after taking a single action.

Plus the method comes armed with a termination condition:
As soon as H k (sI ) = 0 then π k = π ∗ .

Nudging

25.

Algorithm 2 Nudged SSP Learning
Initialize
repeat
Set reward scheme to (r − cρ)
Solve by any RL method.
Update ρ From current H π (sI )
until H π (sI ) = 0

w − l space

26. D
We will propose a method for updating ρ and show that it
minimizes uncertainty between steps. For that, we will use a
transformation that extends the work of our CIG paper. But First.

Let D be a bound on the magnitude of unnudged reward

D ≥ lim sup{H π (sI ) | ρ = 0}
π∈Π
D ≤ lim inf {H π (sI ) | ρ = 0}
π∈Π

Observe interval (−D, D) bounds ρ∗ but the upper bound is tight
only in ARRL if all of D reward is received in a single step from sI .

w − l space

27. All policies π ∈ Π, from (that is, at) sI have:
real expected value |v π (sI )| ≤ D.
positive cost cπ (sI ) ≥ 1

28.a w − l transformation:
D+v π (sI ) D−v π (sI )
w= 2cπ (sI ) l= 2cπ (sI )

w − l space

28.b w − l plane.

D

l

w D

w − l space

29. Properties:
w, l ≥ 0
w, l ≤ D
D
w+l = ≤D
cπ (s I)
v π (sI ) = D ⇒ l=0
v π (sI ) = −D ⇒ w=0
lim (w, l) = (0, 0)
cπ (sI )→∞

30. Inverse transformation:
π π
v π (sI ) = D wπ −lπ
w +l cπ (sI ) = D wπ1 π
+l

w − l space

31. Value.
w π − lπ
v π (sI ) = D
D w π + lπ
Level sets are lines.
w–axis, expected D.
l–axis, expected −D.
w = l, expected 0.
5D
−D

−0.

l Optimization → fanning
from l = 0.
0

Not convex, but splits the
0.5D
space.
So optimizers are vertices of
D
w D convex hull of policies.

w − l space

32. Cost.

D
1
cπ (sI ) = D
wπ+ lπ
Level sets are lnes with slope
−1.
w + l = D, expected cost 1.
l
Cost decreases with distance
1

to the origin.
Cost optimizers (both max
2

and min) also vertices.
4
8

w D

w − l space

33. The origin.
Policies of inﬁnite expected cost.
Mean the problem is not unichain or sI is not recurrent.
And are troublesome for optimizing value.

So under our assumptions, the origin does not belong to the
space

Nudged value in the w − l space
34. SMDP problem in w − l.
π π
v π (sI ) D wπ −lπ
argmax π = argmax w 1 +l
= argmax wπ − lπ
π∈Π c (sI π∈Π D wπ +lπ π∈Π

D

/2
D
−

l
● ●
● ●● ●
●
● ● ● ●● ●● ● ● ● ●
● ●● ● ● ●●●●● ●
● ● ● ● ●●●●● ●●●●● ●
● ● ● ●●
● ● ●●
● ● ● ●●
● ● ● ● ●●● ● ●●● ● ●
● ● ● ●●●●● ●●● ●● ● ●● ●
●
● ● ● ●●● ● ●●●●●●●● ●●● ●● ●● ●● ● ●
● ● ● ●● ● ●
● ● ●
● ● ● ●● ●● ●● ●
●● ●● ●●●●●●●●● ● ● ● ●
●
● ●●●● ●
● ● ●●●●● ● ●●●● ●
● ● ● ● ●●● ●●●● ●●●●●●●●●●●●●●●● ● ●
● ● ●●●●● ●●●●●●●● ●●●● ● ●
●●●●●●● ●●●●●●●●●●●● ● ●●●●
● ●
●●●●●● ●● ● ●●● ● ●
● ● ●●● ● ● ● ●
●●●● ●● ●●● ● ● ●● ● ●
● ●● ●●●●●●●●●●●●●●●●●● ● ●● ●
● ●●●●●●●●●●●●●●●●●●● ● ●
● ●●● ●●●● ●●●● ● ● ● ●
●
● ● ●●● ●●●●●●●●● ● ●● ●
● ● ●●
● ● ● ● ●●●●●●●●●● ●●●●●●● ●●● ●● ●
● ●●●●●●●●●●●●●●●● ● ● ●
● ●●● ●● ●●●●●●●●●●●●●●●●●●●●●● ●●●
● ●●● ●●●●
●●●●● ●● ●●●● ● ●
●
● ● ● ●●●● ●
●
● ●● ●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ●
● ●●●●●●●●●●●●●●●●●●●●● ● ●
●●●● ●●●● ●●●●●●● ●●
● ●
●●●●●●●●●●●●●● ●●●●● ●
●● ●●●●●●●●●●●●●●● ● ●
●
● ●●●●● ●●●●●●●●●●●●●●●●● ● ●
● ●●
● ● ●●●● ●● ● ●
●●
● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●
●●●
●● ● ●
● ● ● ●●●●●●●●●●●●●●●●●●●● ● ●
● ●●● ●●●● ●●● ●
●● ●● ●●●●●●●●●●●●●● ● ● ●
●●
● ●●●●●●●●●●●●●●●●●●●●● ●●●
● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●
●●
● ●●●● ● ● ●●● ●
●● ●● ●●●●●●●●●●●●● ●●●●●●●●●●●●●●●
● ● ●●●●●●●●●●●●●●●●●●●●●●●●
● ●●●●● ● ●●●●●● ● ●
● ●●●●●●●●●●●●●●●●● ● ●
● ●●●●● ● ●● ●
● ● ●●
0

● ●●●●●●●●●●●●●●●●●●●●●● ●●●●
● ● ●●●●●●●●●●●●●●●●●●● ● ●●
● ●
● ●●●●●●●●●●●●●●● ●● ● ●
●
● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●
●●●●●●● ●●●●●●●●●●
● ●●●●●●●● ●
● ●●●●●●●●●●●●●●● ●● ● ●
● ●●● ●●●● ● ● ●
● ●● ●●●●●●●●●●●●●●●●●●●●●●● ●●●
● ● ●●●●●●●●●●●●●●●●●● ● ● ●
● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
● ● ●●
●
● ● ●●●●●●● ● ● ● ● ●
● ●● ●● ●● ●●
● ●●●●●●●●●●●●●●●●●●●● ●●●●●●● ●
●● ● ●●●●●●●●●●●●●●●● ●●●● ●●
● ● ● ●● ● ●●●●●●●● ●● ●●●●●
● ●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●
●
● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ●
● ● ●●●●●●●●●●●●●●●● ●●●●●●
●● ● ●
● ●●●●●●●● ● ●● ●
●● ● ●●●●●●●●●● ●●●●● ●● ● ●
●
● ● ● ●● ●
●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●
● ● ●●●● ● ●
●
● ●●●●●●●●●●●● ● ● ●
●●●●●●●●●● ●● ●●● ● ●
● ● ●●●●●●●●●●●●●●●●●●● ● ● ●
●● ●● ●●●●●●●●●●●●●●●●●●●● ●●●●
● ● ●●●●●●●● ●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●
● ●● ●
● ● ●● ● ●
● ●●●●●●●●●●●●●●●●●●●●●●●●●●●
●● ● ●●●●● ●●●●● ●
● ●●●●●●●●●●●●●●●●●●● ●
● ●●● ●●●● ●●●●●●●●●●
● ● ●●●● ●●●●●●●●●●●●●●●●●●● ●
●
● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●
● ● ●
● ●●●●●●●●●●●●●●●●●●●● ●● ●
● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ●
● ●●●●●●●●●●●●●●● ●●●● ● ●
● ● ●● ● ●● ●●
●●● ●● ● ●● ●● ●●●●●
● ● ●●
●●●●● ●●●●●●●●● ●●● ●●
●●●●● ●●●●●●●● ●●● ●
● ● ●●●●●● ●●●●●●●●●●●●●●●● ● ●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●
● ●
●●● ● ● ● ● ●●●
● ●
● ●●●●●●●●●●●●●●●●●●●●●●●●
● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●
● ●●●● ●● ●●●●●●●●●●●●● ● ●
● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●
● ●●●●●●● ●●●●●●●●●●●●●●●●
● ● ●●●●●●●●●●●●● ●● ●
● ● ●● ●● ●●
●●●●●●●● ●● ●●●
● ●●● ● ● ●
● ● ● ●● ●
●●● ●● ● ●● ● ● ●
● ● ●● ●●●● ●●●● ●●● ●●● ●
● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●
●● ● ●●● ●●●●●●●●●●●●●●●●●● ●●
●
● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●
● ● ●● ●●●●●●●●●●●●●● ●●
● ● ●●●●●● ●●●●●●● ● ●
● ●●●●● ● ● ● ●●● ●
●●●●●●●●●●●●●●●● ●●●●● ●
●●●●●●●●●●●●●●●●●●●●● ●
● ● ●●●●● ● ●●●●● ●● ●
●
●●●● ●●●●●●●●●●●●●●●●●●●●● ●
●
● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●
●●● ●●●● ● ●●●●
● ● ●● ● ● ● ●
●●● ●●● ●●●●●●●●●●●●●●●●●●●●● ●●●
● ●● ●●●●●●●●●●●●●●●●●● ●●
● ●●●●●●●●●●●●●●●●●●●●● ●●● ●
● ●●●●●●●●●●●●●●●●●●●●●●●●● ● ●
● ●● ●
● ● ●● ●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●
● ●●●●●●●●●●●●●●●●● ● ●●
● ●● ● ●●●●●●● ●●●● ●
● ●● ●●● ●●●●●●●●●●●●●●●● ●
●● ●●●●●● ●● ●●●
● ● ●● ●●● ●
● ● ●●●●●●●●● ●● ●●● ●● ●
●● ●●●●●●●●●●●●●● ●●● ●●●
●●●●●●●●●●●●●●●●●●●●●●●●● ●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●
● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
● ● ●●●●●●●●●●●●●●●●●●●●●● ●
●●●●●●●●●●●●●●●●●●●●● ●
●● ● ●● ●●● ● ●● ●●
● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ●
●●●●●●●●● ●●● ●●●●●● ●
● ●● ● ●● ● ●
● ●●●●●●●●●●●●● ●●●● ● ●
●●● ● ●●●●●● ● ●
● ●● ● ●
●● ●● ●● ●●●●●●●● ●● ●● ●
●●●●● ●● ●●●●●● ●●●●● ●●●●● ● ●● ●● ●
●●● ●● ●●●●●●●●●●●●●●● ●● ●
●●● ● ●●●●●●●●●●●●●●●●●●●● ● ●
● ●●●● ●●●●●●●●●●●●●●●●●●●●● ● ●● ●
● ●●●● ●●● ● ● ● ● ●
● ● ●
●● ●●●●●●●●● ●●●● ●●●●●●
● ●● ●● ●●● ●● ● ●●●● ●●●
●● ● ●● ●
● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●
● ●●●●
●●● ●●●●●●●●●●●●●●●●●●●●● ●
● ● ●● ●● ● ●
●
●●●●●●●●●●●●●●●●●●●●●●●●● ●●
● ● ●●●●●●●●● ●●●●●● ●●
●●●●●●●●●●●●●●● ●
● ●●●●●●●●●●●●●●●●●●●●●●● ● ●
●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●
● ● ●●●●●●●●● ●● ●
● ●● ● ●
●
●●
● ●●●●●●●●●●●●●●●●●●●●●●● ● ●
● ●●
●● ●●●●●●● ●● ●● ●● ●● ● ●
● ● ●●●●●●●●●●●●●●●●●●●●●●●● ●
● ●● ●●●●●●●●●● ●●● ●
● ●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●● ● ●●
●
●● ● ● ● ● ● ●
● ● ●● ●●●●●●●●●●●● ●● ● ●
● ● ● ●●●●●●●●●●●●●●●●● ●●
● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●
● ●●
● ●●●●● ● ●
● ●●● ●●●●●●●●●●●●●●●●●●●● ●●●●
● ● ● ●●●●●●●●●● ●●●● ● ●●
●● ● ● ●●●●●●●●●●●●●●●●●● ●
●
● ●● ●●●●●● ● ●●
● ●
●● ●●●● ●● ● ● ●
●● ● ●●●●●●●●●●●●●●●●●●●●●● ●
● ● ● ●●●●●●●●●●● ●●●● ●● ●
● ● ●● ●
● ●● ●●● ●●●● ●● ●
● ●●● ●●●●●●●●●●●●● ●● ●
● ●●● ●●●●●●●●●●●●●●●●●●●●● ●● ●●
●● ●●●●●●● ●● ● ● ●●
● ● ●
● ● ● ●●●●●●●●●●●●●●● ● ●●●●● ● ●
● ● ●
● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ● ●
● ● ●●●●●●●●● ●●● ●●●
● ●●● ●●●●●●●●●●●●●●●●● ●●●●●●●● ●●
● ● ●● ●●● ●●●● ● ●●●
● ● ●●● ●●●●●●●●●●●●
● ● ●●●●●●●●●●●●●●●●●●● ●● ●
● ● ●● ●●
● ● ●● ●● ●● ●●● ●● ●
●
● ●●●●● ● ● ●
● ●●● ●●●●●●●●●●●●●●●●●●●● ●●● ● ●●●
● ●● ●●●●●●●● ●●● ●
● ●●● ●●●●●●●●●●●●●●● ●●
● ●●●●● ●●●●●● ●● ●
● ●● ●●●●●● ●●●●●●●●●●●● ●
● ●●●
●●● ●●●●●●●●●●●●●● ●
● ●●●● ● ●
● ● ●●●●●●●●●●●●●●●●●●●●● ● ●
● ● ● ●●●●●●●●●●●●●●●●●●● ● ● ● ●
● ● ●●
● ●● ●●●●●●●●●●●●●●●●●●●●●● ●● ● ●
● ●●●●●●●●● ●●●●●●●●●●●●● ●
●●●● ● ●●●●●●●● ● ●
●
●● ● ●●●● ●●
●●●●● ●●● ● ●●● ●
● ● ●●●●● ●●●●● ●●●● ●● ● ●
● ●● ● ●
● ● ●●● ● ●●
●
●● ●●●●●●●●●●●●●● ● ●
● ● ● ●● ●● ●● ● ● ●● ● ●● ●
● ●●●●
● ● ● ●●●● ●●●●●●●●●●●●● ● ● ● ● ●
● ● ● ●●●●● ● ●●● ●
● ●●● ● ● ●●●●●● ●● ●● ● ● ●
● ●● ● ● ●●●
● ●●●●●●● ●●● ●●● ● ●● ● ● ●
● ● ● ●●●●●●●●●●●●●●●●●● ●●● ● ● ● ● ●
●● ●●●●●●● ●●●●●●●● ● ●● ● ● ●
●●● ● ●●● ●●●● ●
●●
● ●●●
● ● ●
●
● ●●● ● ●●●●●●●●●● ●● ● ●
● ● ●● ●● ● ● ● ● ● ●
● ●●● ●●●●●●●●●●● ●●● ●● ● ●● ● ●
●● ●● ● ●
●
●●●●●●●●●●●●●●●● ●●● ● ●
●● ●● ●● ●●●● ●●● ●● ●
● ●
● ● ● ●●●●●●●●●●●●● ● ●●● ●●
●●●●● ●● ● ● ●● ● ●
● ●● ● ●
● ●●● ● ● ● ● ●
● ● ● ●●●●●●●●●● ●●● ●●●●● ●●●
● ● ●● ● ● ●● ●● ● ● ● ●
●●
●● ● ● ●●
●● ● ● ●● ●
●● ● ● ●●●●● ● ● ● ● ●●● ●●
● ●
● ●
● ● ● ● ●● ●●●●● ●●●●● ● ●
● ● ●● ● ●● ● ● ●
● ●● ●● ●●●●●●●●●● ●● ●● ● ● ● ●●
●● ● ● ● ●●
●● ● ● ●●●● ●● ● ●
● ● ●●●● ● ●● ● ●● ●●●●● ● ●●●● ●
● ● ●
● ● ●● ● ● ● ● ●
●
●● ● ● ● ●●● ●● ●● ● ●
●
●● ●● ●●●●●●●●●● ● ●
●● ● ● ● ●● ●● ●●
● ● ● ● ●● ●● ● ● ●
●
● ● ● ● ●● ●
●● ●●● ●●●● ●●●● ●●●● ●● ●
●● ● ● ●
●
●● ● ● ●●●●●●●●●●● ●●● ● ●
●● ● ●●● ● ● ●● ● ● ●
● ●
● ● ●
● ● ● ● ●●● ● ● ● ●
●
● ●● ● ● ●
● ●
● ● ●● ●
● ●●
●●
/2
●
D

w D


35. Nudged value, for some ρ.

argmax v π (sI ) − ρcπ (sI )
π∈Π
w π − lπ 1
= argmax D π + lπ
− ρD π
π∈Π w w + lπ
w π − lπ − ρ
= argmax D
π∈Π w π + lπ


36. Nudged value level sets
ˆ
(For a set ρ and all policies π with a given h)
ˆ

ˆ
D−h π D
lπ =
ˆ
wˆ − ρ
ˆ
D+h D+hˆ

Lines!

ˆ
Slope depends only on h (i.e., not on ρ)


37. Pencil of lines
ˆ ˇ
For a set ρ, any two h and h level set lines have intersection:
ρ ρ
,−
2 2
Pencil of lines with that vertex.


D 38. Zero nudged value.
D−0 π D
lπ =
ˆ
wˆ − ρ
D+0 D+0
l lπ = w π − ρ
ˆ ˆ

−D
Unity slope.
0 Negative values above, positive
below.
w D
D

ρ


ρ



If whole cloud above w = l, some
 ,− 
 

2 2

negative nudging is the optimizer.
(Encouragement)


D

l
● ●
● ●● ●
● ● ● ●● ●● ● ●●
●
● ●● ●● ● ● ● ●
● ● ●●●● ● ●
●
● ● ● ●●● ● ● ● ●
● ●
● ● ●●●●●●●●● ●●●●● ● ●
● ●● ●
● ● ● ● ●● ●
● ●
●
● ● ●●●●●●●●●●●●●●● ●
● ● ●●●●●●●●● ●●●●● ● ●
●
●●●●●●●●●●●● ● ●●
● ●●●●●●●●●● ●●●● ●
●●
●●
● ●●●● ●● ●
●● ● ●●●●●●●●●●●●●●●●●● ● ●●
● ● ●●●●●●●●● ●
●● ●●● ●●●●●●●●●● ● ●
●●● ●●●●●●●●●●●●●
●● ● ●● ●●● ● ●●
● ● ●●●●●●●●●●●●●●●●●●
● ● ● ●●
●● ●●●●●●●●●●●●●●● ●
● ●●●●●●●●●●●●●●●● ● ● ●
●● ●●●●●● ●●
●●●●●●●●●●●●●●●●● ●
●● ● ●●● ●
●●●●●●●●●●●●●●●●●●●●●●●
●●●●● ●●● ●●
●●●●●●●●●●● ●
●● ●●
●●●●●●●●●●●●●●●●
● ●●● ●
● ●●●●●●●●●●●● ●
●● ●●●●●●●●●●●●●● ●
● ●●●●●●●●●●●●●●●●●●
● ●●●●●●●●●●●●●●●●● ●●●
●● ● ●● ● ● ●
●●● ● ●●●●●●●●●●●●●●●●●●●●●●●
●● ●●●●●●●●●●●●●●
● ●●
●● ●
●● ●●●●●●●●●●●●●●●
● ●●●● ●
● ●●●●●●●●● ●
● ●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●● ● ●
●●●●●●●●●●●●● ●
● ●●●●●●●●●●●●●●●●●●●● ●
● ●●●● ●●●●
● ●●●●●●●●●●●●●●● ●●
● ●●
●●●●●●●●●●●●●●●●●●●● ●
●●●●● ●●●● ●● ● ●
● ●●●●●●●●●●●●●●●●●●●●●
● ●●●●●●●●●●●●●●●●●●
● ● ●●●
● ●●●●●●●●●●●●●● ●
●●●●●●●●●●●●●●●●●● ●
● ●●●●●●●●●●●●●●●●●● ●●●
● ●● ●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●● ● ●
● ● ●● ●●
●● ●●●●●●●
● ●●●●●●●●●●●●●●●●●●●●●
● ●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●
●● ●●●●●●●●●●●●●● ●
● ● ●●●● ●●●
● ●●
● ●● ● ●●●●●●●●●●●●●●●●●●●●●● ● ●
● ● ●●●●●●●●
● ● ●
● ●●●●●●●●●●●●●●●●●●●●
●
● ●●● ●●●●
●●●●●●●●●●●●● ●● ●
● ●●●●●●●●●●●●●●●●● ●●
●● ●●●●●●●●●● ●●
●●●●●●●●●●●●●●●●●●●●●● ●
●●● ●●●●●●●●● ●● ●
● ●●●●●● ●●●● ●
●
●●●●●●●●●●●●●●●●● ●
● ●●● ●●●
● ●●●●●●●●●●●●●●●●●● ●
●
●● ●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●● ●● ●
● ●●●●●●● ●●●●●
● ●●●●●●●●●●●●●●●● ●●●
●●●●●●●●●●● ●●●● ●
● ● ●●●●●●●●●●●●●●● ●
●●●●●●●●●●●●●●●●●●●● ● ● ●
● ●●●●● ● ●●
● ●●●●●●●●●●●●●●●●● ●●●
●●● ● ●●●●●●●●●● ●
● ● ●● ●●● ●●● ●●
● ●●● ● ●
●● ●●●●●●●●●●●●●●●●●●●● ●●● ●●
●●●●●●●●●●●●●●●●●● ●●
● ●●●●● ●● ●● ●● ●
●●●●●●●●●●●●● ●●●●
●
● ●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●● ●●●
●●●●●●●●●●●●●●●●● ●
●
● ●●●●●●●●●●●●●●●●●●●●● ●
●
● ●●
●●●●●●●●●●●●●●●●●●●●●●●● ●●
●● ●●●●●●●●●●●●● ●
●●●●●●●●●●●● ●● ●
●●
●●●● ●● ● ●
●●●●●●●●●●●● ●
● ● ●● ● ●
●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●● ●
●●●●●●●●●●●●●●●● ●
● ●● ●●●●●●●●●●●●●●●●●●●●●● ● ●
●● ●●●●●●●●●●●●●●●●●●●● ●
●●●●●●●●●● ● ●●
●●● ●●●●●●●●●●● ●
●
●●●●●●●●●●●●●●●●●●●●●●●●
●●
●● ●●●●●●●●●●●●●●●●●● ●
●●●●● ● ●●●●●●●
● ●●●●●●●●●●●●●●●●●●
● ●●●●●●●●●●●●●●●● ● ● ●
●●●●●●●●●●●●●●●●●
● ● ● ●● ● ●
●●●●●●● ●●● ●● ●
●● ●●●●●● ●●●●
●● ●
●● ● ● ● ●● ●●
● ●●●●●●●● ●●●● ●
● ●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●● ● ● ●
● ● ● ●●●●●●●●●●●●●●●●●●●●● ● ●●
●●●●●●●●●●●●●●●●●●●●●●
● ●●●●●●●●●●●● ●
● ● ● ●●●●●●●●● ●
● ●●●●●●●●●●●●●●●●●●●●●● ●
●● ●●●●●●●●●●●●●● ●
● ●●●●●●●●●●●●●●●●●●●●● ●
●●●●●●●●●●●●●●●●● ● ●
● ● ●● ●●● ● ●
●● ●●●●●●●●●●●●●● ●●
● ● ●●●●●●●● ●
●
● ● ●●●●●●●●●
● ●
●●●●● ●●●●●●●●●●●
● ● ●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●● ● ●
●
●
● ●● ●
●● ●●●●● ●●
●●●●●●●●●●● ●
●●●● ●●●●●●●●●●●
●●●●●●●●●●●●●●●●●● ● ●
●● ●●●●●●●●
● ●●●●●●●●●●●
●
● ●●●●●●●●●●●●
● ●●●●●●●●●●●●●●●●●●● ●
●●●●●●●●●●●●●●●●●●●●
● ●●●●●●●●●●●●●●●●●●●●●● ●
●●●●●●●●●●●●●●●●●●●
● ● ●●●●●●●●●●●●●● ●
●● ●● ● ●
● ●
●●●●● ● ●● ●
● ●●●●●●●●●●●●●●●●●●●●● ● ●●
● ●●
● ●●●●●●●●●●●●●●●●●●●●●●
● ●●●●●●●●●●●●●●●●
● ● ●●●●●●●● ●●●● ●
●●●● ● ●
● ●●●●●●●●●●● ●● ●
●●●●●●●●●●●●●
●●● ●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●● ●●
● ●●●●●●●●●●●●●● ● ●
●●●●●●●●●●●●●●●●●●● ● ●
●●●●●●●●●●●●●●●●●●●●●●
● ●●
●● ● ●● ●●● ●●
●●
●●●●● ●
● ● ●●●●●●●●●●●●●●● ●● ●
●●● ●● ●●●●●● ●●●● ●
● ●● ●●●●●●●●●●●●●● ●● ●
● ●● ● ● ●●●
●
●
●
●● ●●●●●●●●● ●●
● ●●●●●●●●●●●●●●●●●● ● ●
●●●●●●●● ●●●●● ●
● ●●
● ● ●●●●●● ●●
● ●●●●●●●●●●●●●●● ●
● ●●●●● ● ●
● ●●●● ●●●●●●●●●● ●
●● ●●● ● ●● ●
● ● ●●●●●●●●●●●●●●●● ● ●
●●●●●● ●●●● ●●
● ●●●●●●●●●●●●●●●●●●●●●●
● ●●●●●●●●●●●●●●●●●
● ● ●●● ● ●
●●● ●●● ●●●●
● ●
● ●●●●●●●●●●●● ● ●
● ● ●●●●●●●●●●●●●●●●● ●●
●●●●●●● ●●● ●
● ● ●●●●●●●●●●●● ●● ●
●● ●●●●●●●●●●●●●● ●●●
●● ●● ● ●
●●●● ●● ● ●
●●●●● ●●●●●●●
●●●●●●●●●●●● ● ●
● ●●●●●●●●●●●●●●●●●●●●●● ● ●
●
●●●●●●●●●●●●● ●● ●
●
● ●●●●●●●●●●●●●●●●● ●● ●
●●●●●●● ● ● ● ●
●●● ● ● ● ●
●●●●●●●●●●●●●●●●●● ●
●●●●●●●●●●●●● ● ●
●●●●●● ● ●●● ●
● ●● ● ● ●●
● ●●
● ●●●●●●●●●●●●●●● ●●●
● ● ●●● ● ● ●
● ●●● ●● ● ●●● ●● ●
● ●●
● ●●●●●●●●● ●●● ●●
● ●● ●● ●●●●●●● ●●● ●●●
● ● ●●●● ●●● ●●
●● ● ● ● ●
● ●
● ●● ●● ● ● ●●●●● ●
● ● ●● ●
● ●●● ●●●●●●●● ● ●
● ● ● ●● ●
●● ●●●●● ●●● ●●● ●● ●
● ●●●●● ● ● ● ●
●
●● ● ● ●●●●● ● ●●●●●
● ● ● ●●●●●●●● ● ●●● ●●●
● ●
●● ●● ●
● ● ●●●●● ● ●
●●●●● ●●●●●● ●● ●
●
●● ● ● ●●●● ● ● ●
●● ● ●●●●● ●
● ●● ●●●●●●● ●●● ●
● ● ●●●
●●
● ●●●● ● ●●●●● ●● ● ●
● ●● ● ● ●
●
● ●● ●● ●● ● ●
●● ● ●●● ●●●●● ●●● ● ●
● ● ● ●
●● ● ●
● ● ● ● ● ●● ●●
●● ●
●
●●●
● ●
● ●
● ●● ● ●

w D


40. Initial bounds on ρ∗ .

−D ≤ ρ∗ ≤ D

(Duh! but nice geometry)

Enclosing triangle

42. Nomenclature
D

41. Deﬁnition.
ABC such that:
ABC ⊂ w − l space.
(w∗ , l∗ ) ∈ ABC. l B
●
Slope of AB segment, unity. mβ
mα
wA ≤ wB 1
● C
wA ≤ wC A● mγ

w D

Enclosing Triangle

43. (New) bounds on ρ.
Def. Slope mζ projection of point X(wX , lX ) to w = −l line.

mζ wX − lX
Xζ =
mζ + 1

Bounds:

Aα = Bα ≤ ρ∗ ≤ Cα
wA − lA = wB − lB ≤ ρ∗ ≤ wC − lC

44. So, collinearity (of A, B and C) implies optimality.
(Even if there are multiple optima)

Right and left uncertainty

45. Iterating inside an enclosing triangle.
1 Set ρ to some value within the bounds
ˆ
(wA − lA ≤ ρ ≤ wC − lC ).
ˆ
2 Solve problem with rewards (r − ρc).
ˆ

46. Optimality.
If h(sI ) = 0
Done!
Optimal policy found for current problem solves SMDP and
termination condition has been met.


47.a If h(sI ) > 0
Right uncertainty.

l
B
●

●
S
T
●
● C
A●

w
y1


47.b Right uncertainty.
Derivation:

y1 = Sα − Tα
1
= ((1 − mβ )wS − (1 − mγ )wT − (mγ − mβ )wC )
2
Maximization:

∗ 2s ab(ρ/2 − Cβ )(ρ/2 − Cγ ) + a(ρ/2 − Cβ ) + b(ρ/2 − Cγ )
y1 =
c
s = sign(mβ − mγ )
a = (1 − mγ )(mβ + 1)
b = (1 − mβ )(mγ + 1)
c = (b − a) = 2(mγ − mβ )


48.a If h(sI ) < 0
Left uncertainty.

l
B
●

●

● C
A● R
●

y2
w


48. Left uncertainty.
Is maximum where expected.
(When value level set crosses B)

y2 = Rα − Qα
∗ (ρ/2 − Bα )
y2 = (Bα − Bγ )
(ρ/2 − Bγ )


49. Fundamental lemma.

As ρ grows, maximal right uncertainty is monotonically
ˆ
decreasing and maximal left uncertainty is monotonically
increasing, and both are non-negative with minimum 0.

Optimal nudging

50.
Find ρ (between the bounds, obviously) such that the
ˆ
maximum resulting uncertainty, either left or right, is min.
Since both are monotonic and have minimum 0, min max
when maximum left and right uncertainties are equal.
Remark: bear in mind this (↑) is the worst case. It can
terminate immediately.
ρ is gain, but neither biased towards observations (initial or
ˆ
otherwise), nor slowly updated.

Optimal nudging is “optimal” in the sense that with this
update the maximum uncertainty range of resulting ρ values is
minimum.

Optimal nudging

51. Enclosing triangle into enclosing triangle.

52. Strictly smaller (both area and, importantly, resulting
uncertainty)

Obtaining an initial enclosing triangle

53. Setting ρ(0) = 0 and solving.
Maximizes reward irrespective of cost. (Usual RL problem)
Can be interpreted geometrically as fanning from the w axis
to ﬁnd the policy with w, l coordinates that subtends the
smallest angle.
The resulting optimizer maps to a point somewhere along a
line with intercept at the origin.

54. Optimum of the SMDP problem above but not behind that
line.
Else, contradiction.

Obtaining an initial enclosing triangle

56. Either way, after iter. 0, uncertainty reduces in at least half.

Conic intersection

57. Maximum right uncertainty is a conic!
  
c −(b + a) −Cα c r
∗  −(b + a) ∗
r y1 1 c (Cβ a + Cγ b)   y1  = 0
2
−Cα c (Cβ a + Cγ b) Cα c 1

58. Maximum left uncertainty is a conic!
  
0 1 (Bγ − Cγ ) r
∗ ∗
r y2 1  1 0 −Bγ   y2  = 0
(Bγ − Cγ ) −Bγ −2Bα (Bγ − Cγ ) 1

Conic intersection

59. Intersecting them is easy.

60. And cheap. (Requires in principle constant time and simple
matrix operations)

61. So plug it in!

Termination Criteria

62.
We want to reduce uncertainty to ε.
Because it is a good idea. (Right?)
So there’s your termination condition right there.

63. Alternatively, stop when |h(k) (sI )| < .

64. In any case, if the same policy remains optimal and the sign of
its nudged value changes between iterations, stop:
It is the optimal solution of the SMDP problem.

Finding D

65. A quick and dirty method:
1 Maximize cost (or episode length, all costs equal 1).
2 Multiply by largest unsigned reinforcement.
66. So, at most one RL problem more.

67. If D is estimated too large, wider initial bounds and longer
computation, but ok.
68. If D is estimated too small (by other methods, of course),
points outside the triangle in w − l space. (But where?)

Recurring state + unichain considerations

69. Feinberg and Yang: Deciding whether the unichain condition
holds can be done in polynomial time if a recurring state exists.

70. Existence of a recurring state is common in practice.

71. (Future work) It can maybe be induced using ε–MDPs.
(Maybe).

72. At least one case in which no unichain is no problem: games.
Certainty of positive policies.
Non-positive chains.

73. Happens! (See experiments)

Complexity

74. Discounted RL is PAC (–eﬃcient).

75. In problem size parameters (|S|, |A|) and 1/γ.

76. Episodic undiscounted RL is also PAC.
(Following similar arguments, but slightly more intricate
derivations)

77. So we call a PAC (–eﬃcient) method a number of times.

Complexity

78. Most worstest case foreverest when choosing ρ(k) is not
reducing uncertainty.

79. Reducing it in half is a better bound for our method.

80. ... and it is a tight bound...

81. ... in cases that are nearly optimal from the outset.

82. So, at worst, log 1 calls to PAC:
ε
PAC!

Complexity

83. Whoops, we proved complexity! That’s a ﬁrst for SMDP
(or ARRL, for that matter).

84. And we inherit convergence from invoked RL, so there’s
also that.

Tipically much faster

85. Worst case happens when we are ”already there.

86. Otherwise, depends, but certainly better.

87. Multi-iteration reduction in uncertainty way better than 0.5· ,
because it accumulates geometrically.

88. Empirical complexity better than the already very good upper
bound.

Bibliography I

S. Mahadevan. Average reward reinforcement learning: Foundations, algorithms, and empirical results. Machine
Learning, 22(1):159–195, 1996.
Reinaldo Uribe, Fernando Lozano, Katsunari Shibata, and Charles Anderson. Discount and speed/execution
tradeoﬀs in markov decision process games. In Computational Intelligence and Games (CIG), 2011 IEEE
Conference on, pages 79–86. IEEE, 2011.

100 things I know

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to 100 things I know

Similar to 100 things I know (20)

Recently uploaded

Recently uploaded (20)

100 things I know