Optimal Nudging. Presentation UD.

  • 21 views
Uploaded on

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
21
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
1
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Optimal Nudging A new approach to solving SMDPs Reinaldo Uribe M Universidad de los Andes — Oita University Colorado State University Nov. 11, 2013
  • 2. Snakes & Ladders Player advances the number of steps indicated by a die. Landing on a snake’s mouth sends the player back to the tail. Landing on a ladder’s bottom moves the player forward to the top. Goal: reaching state 100.
  • 3. Snakes & Ladders Player advances the number of steps indicated by a die. Boring! (No skill required, only luck.) Landing on a snake’s mouth sends the player back to the tail. Landing on a ladder’s bottom moves the player forward to the top. Goal: reaching state 100.
  • 4. Variation: Decision Snakes and Ladders Sets of “win” and “loss” terminal states. Actions: either “advance” or “go back,” to be decided before throwing the die.
  • 5. Reinforcement Learning: Finding an optimal policy. “Natural” Rewards: ±1 on “win”/“lose”, 0 othw. Optimal policy maximizes total expected reward. Dynamic programming quickly finds the optimal policy. Probability of winning: pw = 0.97222 . . .
  • 6. We know a lot! Markov Decision Process: States, Actions, Transition Probabilities, Rewards.
  • 7. We know a lot! Markov Decision Process: States, Actions, Transition Probabilities, Rewards. Policies and policy value.
  • 8. We know a lot! Markov Decision Process: States, Actions, Transition Probabilities, Rewards. Policies and policy value. Max winning probability = max earnings.
  • 9. We know a lot! Markov Decision Process: States, Actions, Transition Probabilities, Rewards. Policies and policy value. Max winning probability = max earnings. Taking an action costs (in units different from rewards.)
  • 10. We know a lot! Markov Decision Process: States, Actions, Transition Probabilities, Rewards. Policies and policy value. Max winning probability = max earnings. Taking an action costs (in units different from rewards.) Different actions may have different costs.
  • 11. We know a lot! Markov Decision Process: States, Actions, Transition Probabilities, Rewards. Policies and policy value. Max winning probability = max earnings. Taking an action costs (in units different from rewards.) Different actions may have different costs. Semi-Markov model with average rewards.
  • 12. Better than optimal? (Old optimal policy)
  • 13. Better than optimal? (Optimal policy) with average reward ρ = 0.08701
  • 14. Better than optimal? (Optimal policy) with average reward ρ = 0.08701 pw = 0.48673 (was 0.97222 — 50.06%) d = 11.17627 (was 84.58333 — 13.21%)
  • 15. Better than optimal? (Optimal policy) with average reward ρ = 0.08701 pw = 0.48673 (was 0.97222 — 50.06%) d = 11.17627 (was 84.58333 — 13.21%) This policy maximizes pw d
  • 16. So, how are average-reward optimal policies found? Algorithm 1 Generic SMDP solver Initialize repeat forever Act Do RL to find value of current π Update ρ. Usually 1-step Q-learning Average-adjusted Q-learning: Qt+1 (st , at ) ← (1 − γt ) Qt (st , at ) + γt rt+1 − ρt ct+1 + max Qt (st+1 , a) a
  • 17. Generic Learning Algorithm Table of algorithms. ARRL Algorithm Gain update t r(si , π i (si )) AAC Jalali and Ferguson 1989 R–Learning ρt+1 ← ρ t+1 Tadepalli and Ok 1998 SSP Q-Learning t+1 ← (1 − α)ρt + α rt+1 + max Qt (st+1 , a) − max Qt (st , a) Schwartz 1993 H–Learning i=0 a ρt+1 ← ρt + αt min Qt (ˆ, a) s a Abounadi et al. 2001 t r(si , π i (si )) HAR Ghavamzadeh and Mahadevan 2007 a ρt+1 ← (1−αt )ρt +αt rt+1 − H t (st ) + H t (st+1 ) αt αt+1 ← αt + 1 ρt+1 ← i=0 t+1
  • 18. Generic Learning Algorithm Table of algorithms. SMDPRL Algorithm Gain update SMART t r(si , π i (si )) Das et al. 1999 ρt+1 ← i=0 t MAX-Q Ghavamzadeh and Mahadevan 2001 c(si , π i (si )) i=0
  • 19. Nudging Algorithm 2 Nudged Learning Initialize (π, ρ, Q) repeat Set reward scheme to (r − ρc). Solve by any RL method. Update ρ until Qπ (sI ) = 0
  • 20. Nudging Algorithm 3 Nudged Learning Initialize (π, ρ, Q) repeat Set reward scheme to (r − ρc). Solve by any RL method. Update ρ until Qπ (sI ) = 0 Note: ‘by any RL method’ refers to a well-studied problem for which better algorithms (both practical and with theoretical guarantees) exist. ρ can (and will) be updated optimally.
  • 21. The w − l space. Definition (Policy π has expected average reward v π and expected average cost c π . Let D be a bound on the absolute value of v π ) wπ = D + vπ , 2c π lπ = D − vπ . 2c π D l ● ● ● ● ● ●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●● ● ● ●●●● ● ● ● ● ●● ● ● ● ● ● ●●●●● ●●●●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ● ●●● ● ● ● ● ● ●●●●● ●●● ●● ● ●● ● ● ● ●● ● ● ● ● ● ●●● ● ●●●●●●●● ●●● ●● ●● ●● ● ● ● ●● ●● ●●●●●●●●● ● ●●● ● ● ● ●● ●● ●● ● ● ●● ● ●●● ● ● ● ●●●●● ●●● ●● ● ● ● ● ● ● ●●● ●●●● ●●●●●●●●●●●●●●●● ● ● ● ● ● ●●●●● ●●●●●●●●●●●●● ● ●●●●●●● ●●●●●●●●●●●●● ●●●● ● ● ●●●●● ●● ● ● ●●● ● ●● ● ● ● ● ● ● ●● ● ●●●● ●● ●●●●●●●●●● ● ● ● ●● ●●●●●●●●●●●●●●●●●●●●● ● ● ● ●●●●●●● ●●● ●●●●●●●● ●● ● ● ●●● ●●●● ●● ●●● ● ● ● ● ●●● ●●●● ●●●●● ● ●● ● ● ●●● ●●●● ● ● ● ● ●●●●●●●●●● ●●●●●●●●●● ● ● ● ● ●●●●●●●●●● ●●●●● ● ● ● ● ● ●●● ●● ●●●●●●●●●●●●●●●●●●●● ●● ●● ● ●●●● ●●●● ● ● ● ● ●● ● ● ● ● ● ●●●● ● ● ●● ●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●● ● ● ● ● ●● ●●●●● ● ●●●●● ●●●●●●●●●●●●● ●●●●● ●●● ●● ● ●● ●●●●●●● ●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●●● ●● ● ●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● ● ● ● ●● ● ●●● ●●●● ●●●●●● ●●●●●●●● ● ●● ● ●●●●●●● ●●● ●●● ●● ● ● ● ● ●●●●● ●●●●●●●●●●●●●●●● ● ● ●● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●●● ●●● ● ● ● ● ●●●●●●●●●●●●●● ●●●●●●●●●●●● ●●● ●●●●● ● ● ● ● ●● ●● ●●●●●●●●●●●●● ●●●●●●●●●●●● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ●●●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●● ●●●●●●●●● ● ● ● ● ●● ● ● ●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●● ●●●●●●● ●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● ● ●● ●● ●● ● ●● ●●●●●●●●●●●●●● ●● ● ● ● ●● ● ● ●● ● ● ●● ●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● ●●●●●●●●●●●●●●●● ●●●●●●● ● ●● ●● ●●●●● ●● ● ● ●●● ● ●● ● ● ● ● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●● ●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●● ●●●●●●● ● ●● ● ●●●●●●●●●●●●●●●● ●●●● ●● ● ● ● ●●● ●●●●●●●●●●● ●●●●● ● ● ● ● ● ●● ● ●●●●●●●●●●●●● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●● ● ● ● ● ●●●●●●●●●●●●●●●● ●●●●●● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● ● ● ● ●●●●●●●●●● ●●●● ● ● ● ● ●●●●●●●●● ●●● ●●● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●● ●● ● ●●● ● ● ●● ● ●●● ●●●●● ● ● ●●●●●●●●●●●●● ●●● ● ● ● ● ● ●●●●●●●●●●●●● ●●●● ● ● ● ● ● ●● ●● ●●●●●●●●●●●●●●●●●●●●●● ●●●● ● ● ●●●●●● ● ● ● ● ● ● ● ●●●●●●●● ●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●● ● ● ●● ● ● ● ● ● ● ●● ● ● ●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●● ●● ● ● ●●● ● ●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●● ● ●● ●●● ●●●●● ● ●●●●●●●●●●●●●●●●●●●● ●●●● ● ● ● ● ● ●●● ● ●● ● ● ●●● ● ●●● ●●●●●● ● ● ● ● ●● ●● ● ● ● ●●●●●●●●●●●●●●● ●●●● ● ●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ● ●●● ● ● ● ● ● ●● ● ●●●●●●●●●● ● ●● ● ●● ● ● ●●● ●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ●●● ●● ●●●● ●● ●● ● ●● ● ● ● ● ●●● ●●●● ●●●●●●●●●●●● ●● ●● ● ● ● ●●●●●●●●●●● ●●●●●●●●●●●●●● ● ● ●●●● ● ● ●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●● ● ●● ●●● ● ●●●●●●●●●●●●●●●●●●● ● ● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ●●●●●●●●● ●●●●●● ●●●●●●● ●●● ●●●● ●● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●● ●●●●●●●●●●● ●● ● ●●●● ●●●●●● ●●●●●●●●●●●● ● ●●●●●●●●●●●● ● ● ● ● ●●● ●●●● ●●●●● ●●●● ●●●●●●●●●●●●●●●●● ●●● ● ● ● ● ●● ● ●● ● ● ●● ● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ●●● ●●● ●●●●●●●●●●●●●●●●●●●●● ● ●● ● ●● ●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●● ●●●● ● ● ●● ●● ● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●● ● ●● ●●● ●●●●●● ●● ● ● ●● ● ● ● ● ●● ●● ● ●● ●●●●●●●●●●●● ●●●●●● ● ●● ●●● ● ●● ● ●● ● ●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●● ●● ●●●●●●●●●●●●●●●●●●●● ● ●● ● ● ● ●●● ●●●●● ●● ● ● ● ● ●●●●●● ●●●●●●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● ● ●● ● ● ● ● ●●● ●●●●●●●●●●●●●●●●●●●●● ●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●● ● ● ●● ●● ●●●●●●●●●● ● ●●● ●● ● ●● ● ● ● ●● ●● ● ● ●●●●●●●●●●●●●●●●●●●●● ●● ●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●● ●● ●●●●●●●●●●●●●●●●●●●●●● ● ● ●● ● ● ● ●●●●●●● ●● ●●●●●●● ●●●●●● ●●●●●●●●●●●●●●● ●● ● ●●●●●● ● ●● ●● ●● ●● ● ●● ● ● ● ●● ● ● ●●●●●●● ●●●● ●● ●●●● ●● ● ●●● ● ●●● ● ● ● ●●●● ●●●●●●●●●●●●●●● ●●●●● ● ● ● ●●●● ●●●●●●●●●●●●●● ●●●●● ● ● ●● ●● ● ● ● ● ●●● ●● ●●●●●●● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●● ● ● ● ● ●● ● ● ●●● ●●●●●●●●●●●●●●●● ●●●● ● ● ● ● ●● ● ● ● ● ● ●● ●●●●●●●●●●●●●●●●●●● ● ●● ● ●●●● ●●●●●●●●●● ●●●● ●●● ●● ●●●●●●●● ● ● ●● ● ●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●●● ●●●●●●●●●●●●● ●● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●●●●●●●●●●●● ●● ●● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●● ● ●● ● ●●●● ●●●●●● ● ●●● ● ● ●● ● ●●● ● ●● ● ● ● ●● ● ● ● ●● ●●●●●●●●●● ●● ●● ● ● ● ● ● ●●●●●●●●●●●●●● ●● ●● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●●● ●●●●●●●●●●●●●●●●●●●● ●●●● ● ● ● ● ●● ●●●●●●● ●●●●●● ● ●● ●● ● ● ●●●●●●●●●●●●●●●●●● ● ●●● ● ● ● ● ● ●●●● ●● ● ● ● ●●●●●● ●●● ● ●●● ● ●●●●●●●●●●●●●● ●●●●●●●●● ●● ● ● ● ● ●●●●●●●●●●●●●● ● ● ● ● ●●● ● ●●● ●●●●●● ● ● ● ●● ● ●● ● ●●●●●●● ●●●●● ● ● ●●●●●●●● ●●● ● ● ● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ● ● ● ● ●●●●●●●●●●●●●●●●●● ● ●● ●●●●●●●●● ●● ●●●● ● ● ● ●●● ● ●● ●●●●●●●●●●●●●●●●●●● ● ● ● ●● ●●● ●●● ● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ● ● ●●●●●●● ●●●● ●● ●● ●● ● ● ● ●● ●● ● ● ● ● ● ● ●● ● ●●● ●●● ●●●●● ● ● ● ● ● ● ● ● ● ● ●● ●● ●●● ●● ● ● ●●●●●●●●●●●●●● ●●● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ●● ●●● ●●●●●●●●●●●●●●●●●● ● ● ●●● ● ●● ● ●● ●●●●●●●●●●●●●●●●●● ●● ●● ● ●●●●●●●●●●●●●● ● ● ● ● ●● ●● ●●●●● ●●● ● ● ● ●● ●●● ●●●● ●●●●●●●●● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●● ●● ● ● ●● ●●●●●● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ●●● ●●● ●●● ● ● ● ●● ● ● ●● ●●●●●●●●●●●●● ●●●●●●●● ●●● ● ● ●●● ●● ●●● ● ●●● ● ● ● ●●●● ●● ●●●●●●●● ●●● ● ●● ●●● ●● ●●● ●●●●● ●●● ● ●●●●●●●●●●●●●●●●●●● ●●● ● ● ●● ● ● ● ● ●● ● ● ● ● ●●●●●●●●●●●●●●●●● ● ● ● ●●●●● ●●● ● ● ● ● ● ● ●● ● ● ● ● ● ●●●●●● ●●●●●●●●●●● ●●● ● ● ● ● ●●●●●●●● ●● ● ● ●●●●●● ●●●● ●●●●●● ● ● ● ●●●●●●● ●●●●●●●● ●●● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●●● ●●● ● ● ● ● ● ● ● ● ●●● ●● ● ● ●●●● ● ●●●● ●● ● ●● ● ●● ● ●●●●●● ● ●● ●●●●●●● ●●●●●●●● ● ●● ● ● ● ●●●●●●●●● ●● ●● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ●●● ●●●●●●●●●●●●●● ●● ● ●● ● ● ●●●● ● ● ●●●●●● ●● ●●● ● ● ● ● ● ●●●●●● ●●● ● ● ● ● ● ● ● ● ●●● ●● ● ● ● ●● ● ● ● ● ●●●●●● ● ●●●●● ● ●●● ●● ● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ●●●●●●●●●● ●●● ●●●●● ●●● ● ● ● ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●●●● ● ● ● ● ● ●●● ●● ● ●●● ● ●● ● ● ● ● ● ● ● ●● ●●●●● ●●●●● ● ● ● ● ●● ●● ●●●●●●●●●● ●● ●● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ● ●●●● ●● ● ● ● ● ● ● ● ●● ● ●● ● ● ●●●● ● ●●● ●● ●●●●● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●●●● ● ● ●● ●● ●●●●●●●●●● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ●●●● ●●●● ●●●● ●● ● ● ● ●● ● ●● ● ● ● ● ●●●●●●●●●●● ●●● ● ● ● ●● ● ●●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● w D
  • 22. The w − l space. Value and Cost (Policy π has expected average reward v π and expected average cost c π . Let D be a bound on the absolute value of v π ) wπ = D + vπ , 2c π lπ = D 5D −0. 1 l 0 −D D l D − vπ . 2c π 2 0.5D 4 8 D w D w D
  • 23. The w − l space. Nudged value (Policy π has expected average reward v π and expected average cost c π . Let D be a bound on the absolute value of v π ) wπ = D + vπ , 2c π lπ = D − vπ . 2c π − D /2 D l ● ● ● ● ●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●● ● ● ●●●● ● ● ● ● ●● ● ● ● ● ● ●●●●● ●●●●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ● ●●● ● ● ● ● ● ●●●●● ●●● ●● ● ●● ● ● ● ●● ● ● ● ● ● ●●● ● ●●●●●●●● ●●● ●● ●● ●● ● ● ● ●● ●● ●●●●●●●●● ● ●●● ● ● ● ●● ●● ●● ● ● ●● ● ●●● ● ● ● ●●●●● ●●● ●● ● ● ● ● ● ● ●●● ●●●● ●●●●●●●●●●●●●●●● ● ● ● ● ● ●●●●● ●●●●●●●●●●●●● ● ●●●●●●● ●●●●●●●●●●●●● ●●●● ● ● ●●●●● ●● ● ● ●●● ● ●● ● ● ● ● ● ● ●● ● ●●●● ●● ●●●●●●●●●● ● ● ● ●● ●●●●●●●●●●●●●●●●●●●●● ● ● ● ●●●●●●● ●●● ●●●●●●●● ●● ● ● ●●● ●●●● ●● ●●● ● ● ● ● ●●● ●●●● ●●●●● ● ●● ● ● ●●● ●●●● ● ● ● ● ●●●●●●●●●● ●●●●●●●●●● ● ● ● ● ●●●●●●●●●● ●●●●● ● ● ● ● ● ●●● ●● ●●●●●●●●●●●●●●●●●●●● ●● ●● ● ●●●● ●●●● ● ● ● ● ●● ● ● ● ● ● ●●●● ● ● ●● ●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●● ● ● ● ● ●● ●●●●● ● ●●●●● ●●●●●●●●●●●●● ●●●●● ●●● ●● ● ●● ●●●●●●● ●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●●● ●● ● ●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● ● ● ● ●● ● ●●● ●●●● ●●●●●● ●●●●●●●● ● ●● ● ●●●●●●● ●●● ●●● ●● ● ● ● ● ●●●●● ●●●●●●●●●●●●●●●● ● ● ●● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●●● ●●● ● ● ● ● ●●●●●●●●●●●●●● ●●●●●●●●●●●● ●●● ●●●●● ● ● ● ● ●● ●● ●●●●●●●●●●●●● ●●●●●●●●●●●● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ●●●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●● ●●●●●●●●● ● ● ● ● ●● ● ● ●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●● ●●●●●●● ●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● ● ●● ●● ●● ● ●● ●●●●●●●●●●●●●● ●● ● ● ● ●● ● ● ●● ● ● ●● ●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● ●●●●●●●●●●●●●●●● ●●●●●●● ● ●● ●● ●●●●● ●● ● ● ●●● ● ●● ● ● ● ● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●● ●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●● ●●●●●●● ● ●● ● ●●●●●●●●●●●●●●●● ●●●● ●● ● ● ● ●●● ●●●●●●●●●●● ●●●●● ● ● ● ● ● ●● ● ●●●●●●●●●●●●● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●● ● ● ● ● ●●●●●●●●●●●●●●●● ●●●●●● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● ● ● ● ●●●●●●●●●● ●●●● ● ● ● ● ●●●●●●●●● ●●● ●●● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●● ●● ● ●●● ● ● ●● ● ●●● ●●●●● ● ● ●●●●●●●●●●●●● ●●● ● ● ● ● ● ●●●●●●●●●●●●● ●●●● ● ● ● ● ● ●● ●● ●●●●●●●●●●●●●●●●●●●●●● ●●●● ● ● ●●●●●● ● ● ● ● ● ● ● ●●●●●●●● ●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●● ● ● ●● ● ● ● ● ● ● ●● ● ● ●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●● ●● ● ● ●●● ● ●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●● ● ●● ●●● ●●●●● ● ●●●●●●●●●●●●●●●●●●●● ●●●● ● ● ● ● ● ●●● ● ●● ● ● ●●● ● ●●● ●●●●●● ● ● ● ● ●● ●● ● ● ● ●●●●●●●●●●●●●●● ●●●● ● ●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ● ●●● ● ● ● ● ● ●● ● ●●●●●●●●●● ● ●● ● ●● ● ● ●●● ●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ●●● ●● ●●●● ●● ●● ● ●● ● ● ● ● ●●● ●●●● ●●●●●●●●●●●● ●● ●● ● ● ● ●●●●●●●●●●● ●●●●●●●●●●●●●● ● ● ●●●● ● ● ●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●● ● ●● ●●● ● ●●●●●●●●●●●●●●●●●●● ● ● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ●●●●●●●●● ●●●●●● ●●●●●●● ●●● ●●●● ●● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●● ●●●●●●●●●●● ●● ● ●●●● ●●●●●● ●●●●●●●●●●●● ● ●●●●●●●●●●●● ● ● ● ● ●●● ●●●● ●●●●● ●●●● ●●●●●●●●●●●●●●●●● ●●● ● ● ● ● ●● ● ●● ● ● ●● ● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ●●● ●●● ●●●●●●●●●●●●●●●●●●●●● ● ●● ● ●● ●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●● ●●●● ● ● ●● ●● ● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●● ● ●● ●●● ●●●●●● ●● ● ● ●● ● ● ● ● ●● ●● ● ●● ●●●●●●●●●●●● ●●●●●● ● ●● ●●● ● ●● ● ●● ● ●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●● ●● ●●●●●●●●●●●●●●●●●●●● ● ●● ● ● ● ●●● ●●●●● ●● ● ● ● ● ●●●●●● ●●●●●●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● ● ●● ● ● ● ● ●●● ●●●●●●●●●●●●●●●●●●●●● ●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●● ● ● ●● ●● ●●●●●●●●●● ● ●●● ●● ● ●● ● ● ● ●● ●● ● ● ●●●●●●●●●●●●●●●●●●●●● ●● ●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●● ●● ●●●●●●●●●●●●●●●●●●●●●● ● ● ●● ● ● ● ●●●●●●● ●● ●●●●●●● ●●●●●● ●●●●●●●●●●●●●●● ●● ● ●●●●●● ● ●● ●● ●● ●● ● ●● ● ● ● ●● ● ● ●●●●●●● ●●●● ●● ●●●● ●● ● ●●● ● ●●● ● ● ● ●●●● ●●●●●●●●●●●●●●● ●●●●● ● ● ● ●●●● ●●●●●●●●●●●●●● ●●●●● ● ● ●● ●● ● ● ● ● ●●● ●● ●●●●●●● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●● ● ● ● ● ●● ● ● ●●● ●●●●●●●●●●●●●●●● ●●●● ● ● ● ● ●● ● ● ● ● ● ●● ●●●●●●●●●●●●●●●●●●● ● ●● ● ●●●● ●●●●●●●●●● ●●●● ●●● ●● ●●●●●●●● ● ● ●● ● ●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●●● ●●●●●●●●●●●●● ●● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●●●●●●●●●●●● ●● ●● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●● ● ●● ● ●●●● ●●●●●● ● ●●● ● ● ●● ● ●●● ● ●● ● ● ● ●● ● ● ● ●● ●●●●●●●●●● ●● ●● ● ● ● ● ● ●●●●●●●●●●●●●● ●● ●● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●●● ●●●●●●●●●●●●●●●●●●●● ●●●● ● ● ● ● ●● ●●●●●●● ●●●●●● ● ●● ●● ● ● ●●●●●●●●●●●●●●●●●● ● ●●● ● ● ● ● ● ●●●● ●● ● ● ● ●●●●●● ●●● ● ●●● ● ●●●●●●●●●●●●●● ●●●●●●●●● ●● ● ● ● ● ●●●●●●●●●●●●●● ● ● ● ● ●●● ● ●●● ●●●●●● ● ● ● ●● ● ●● ● ●●●●●●● ●●●●● ● ● ●●●●●●●● ●●● ● ● ● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ● ● ● ● ●●●●●●●●●●●●●●●●●● ● ●● ●●●●●●●●● ●● ●●●● ● ● ● ●●● ● ●● ●●●●●●●●●●●●●●●●●●● ● ● ● ●● ●●● ●●● ● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ● ● ●●●●●●● ●●●● ●● ●● ●● ● ● ● ●● ●● ● ● ● ● ● ● ●● ● ●●● ●●● ●●●●● ● ● ● ● ● ● ● ● ● ● ●● ●● ●●● ●● ● ● ●●●●●●●●●●●●●● ●●● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ●● ●●● ●●●●●●●●●●●●●●●●●● ● ● ●●● ● ●● ● ●● ●●●●●●●●●●●●●●●●●● ●● ●● ● ●●●●●●●●●●●●●● ● ● ● ● ●● ●● ●●●●● ●●● ● ● ● ●● ●●● ●●●● ●●●●●●●●● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●● ●● ● ● ●● ●●●●●● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ●●● ●●● ●●● ● ● ● ●● ● ● ●● ●●●●●●●●●●●●● ●●●●●●●● ●●● ● ● ●●● ●● ●●● ● ●●● ● ● ● ●●●● ●● ●●●●●●●● ●●● ● ●● ●●● ●● ●●● ●●●●● ●●● ● ●●●●●●●●●●●●●●●●●●● ●●● ● ● ●● ● ● ● ● ●● ● ● ● ● ●●●●●●●●●●●●●●●●● ● ● ● ●●●●● ●●● ● ● ● ● ● ● ●● ● ● ● ● ● ●●●●●● ●●●●●●●●●●● ●●● ● ● ● ● ●●●●●●●● ●● ● ● ●●●●●● ●●●● ●●●●●● ● ● ● ●●●●●●● ●●●●●●●● ●●● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●●● ●●● ● ● ● ● ● ● ● ● ●●● ●● ● ● ●●●● ● ●●●● ●● ● ●● ● ●● ● ●●●●●● ● ●● ●●●●●●● ●●●●●●●● ● ●● ● ● ● ●●●●●●●●● ●● ●● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ●●● ●●●●●●●●●●●●●● ●● ● ●● ● ● ●●●● ● ● ●●●●●● ●● ●●● ● ● ● ● ● ●●●●●● ●●● ● ● ● ● ● ● ● ● ●●● ●● ● ● ● ●● ● ● ● ● ●●●●●● ● ●●●●● ● ●●● ●● ● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ●●●●●●●●●● ●●● ●●●●● ●●● ● ● ● ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●●●● ● ● ● ● ● ●●● ●● ● ●●● ● ●● ● ● ● ● ● ● ● ●● ●●●●● ●●●●● ● ● ● ● ●● ●● ●●●●●●●●●● ●● ●● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ● ●●●● ●● ● ● ● ● ● ● ● ●● ● ●● ● ● ●●●● ● ●●● ●● ●●●●● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●●●● ● ● ●● ●● ●●●●●●●●●● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ●●●● ●●●● ●●●● ●● ● ● ● ●● ● ●● ● ● ● ● ●●●●●●●●●●● ●●● ● ● ● ●● ● ●●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● 0 /2 ● D ● w D
  • 24. The w − l space. As a projective transformation.
  • 25. The w − l space. As a projective transformation. Policy Value D 1 −D Episode Length
  • 26. The w − l space. As a projective transformation. −D Policy Value D l 1 −D Episode Length w D
  • 27. Sample task: two states, continuous actions s1 a1 ∈ [0, 1] r1 = 1 + (a1 − 0.5)2 c1 = 1 + a1
  • 28. Sample task: two states, continuous actions s1 a1 ∈ [0, 1] r1 = 1 + (a1 − 0.5)2 c1 = 1 + a1 s2 a2 ∈ [0, 1] r2 = 1 + a2 c2 = 1 + (a2 − 0.5)2
  • 29. Sample task: two states, continuous actions Policy Space (Actions) 1 a2 0 0 a1 1
  • 30. Sample task: two states, continuous actions Policy Values and Costs Policy value 4 Policy cost 4
  • 31. Sample task: two states, continuous actions Policy Manifold in w − l l D/2 w D/2
  • 32. And the rest... Neat geometry, linear problems in w − l. Easily exploited using straightforward algebra / calculus. Updating average reward between iterations can be optimized. Becomes finding the (or rather an) intersection between two conics. Which can be solved in O(1) time.
  • 33. And the rest... Neat geometry, linear problems in w − l. Easily exploited using straightforward algebra / calculus. Updating average reward between iterations can be optimized. Becomes finding the (or rather an) intersection between two conics. Which can be solved in O(1) time. Worst case, uncertainty reduces in half. Typically much better than that. Little extra complexity added to already PAC methods.
  • 34. Thank you. r-uribe@uniandes.edu.co Untitled by Li Wei, School of Design, Oita University, 2009.