Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- MAPPING INDIVIDUAL AND COLLECTIVE R... by Anant Kumar 363 views
- decision making criterion by Gaurav Sonkar 39023 views
- Decision theory by Aditya Mahagaonkar 27785 views
- Modal Vebs I by esthercurielmartin 1693 views
- Robust Policy Computation in Reward... by Kevin Regan 1120 views
- Bad Things May Be Good for You: cr... by Alan Dix 548 views

2,476 views

2,338 views

2,338 views

Published on

Talk given for University of Toronto Machine Learning Seminar. Fall 2009.

No Downloads

Total views

2,476

On SlideShare

0

From Embeds

0

Number of Embeds

267

Shares

0

Downloads

33

Comments

0

Likes

2

No embeds

No notes for slide

transitions are learnable - do not change from user to user

transitions are learnable - do not change from user to user

transitions are learnable - do not change from user to user

- 1. Regret-Based Reward Elicitation for Markov Decision Processes Kevin Regan University of Toronto Craig Boutilier
- 2. Introduction 2 Motivation
- 3. Introduction 3 Motivation Markov Decision Processes have proven to be an extremely useful model for decision making in stochastic environments • Model requires dynamics and rewards
- 4. Introduction 4 Motivation Markov Decision Processes have proven to be an extremely useful model for decision making in stochastic environments • Model requires dynamics and rewards Specifying dynamics a priori can be difﬁcult • We can learn a model of the world in either an ofﬂine or online (reinforcement learning) setting
- 5. Introduction 5 Motivation Markov Decision Processes have proven to be an extremely useful model for decision making in stochastic environments • Model requires dynamics and rewards In some simple cases reward can be thought of as being directly “observed” • For instance: the reward in a robot navigation problem corresponding to the distance travelled
- 6. Introduction 6 Motivation Except in some simple cases, the speciﬁcation of reward functions for MDPs is problematic • Rewards can vary user-to-user • Preferences about which states/actions are “good” and “bad” need to be translated into precise numerical reward • Time consuming to specify reward for all states/actions Example domain: assistive technology
- 7. Introduction 7 Motivation However, • Near-optimal policies can be found without a fully speciﬁed reward function • We can bound the performance of a policy using regret
- 8. Outline 1. Decision Theory 2. Preference Elicitation 3. MDPs 4. Current Work
- 9. Decision Theory 9 Utility Given A decision maker (DM) A set of possible outcomes Θ A set of lotteries L of the form: l ≡ 〈 p1 , x1 , p2 , x2 ,…, pn , xn 〉 where xi ∈Θ, ∑p i =1 i l ≡ 〈x1 , p, x2 〉 = 〈 p, x1 ,(1 − p), x2 〉 Compound lotteries l 1 = 〈0.75, x, 0.25, 〈0.6, y, 0.4, z〉〉 l2 y l1= x l2 = z
- 10. Decision Theory 9 Utility Given A decision maker (DM) A set of possible outcomes Θ A set of lotteries L of the form: l ≡ 〈 p1 , x1 , p2 , x2 ,…, pn , xn 〉 where xi ∈Θ, ∑p i =1 i l ≡ 〈x1 , p, x2 〉 = 〈 p, x1 ,(1 − p), x2 〉 Compound lotteries l 1 = 〈0.75, x, 0.25, 〈0.6, y, 0.4, z〉〉 = 〈0.75, x, 0.15, y, 0.1, z〉 y l2 y z l1= x l2 = z l1= x
- 11. Decision Theory 10 Utility Axioms Completeness Transitivity Independence Continuity
- 12. Decision Theory 11 Utility Axioms Completeness For x, y ∈Θ Transitivity It is the case that either: Independence x is weakly preferred to y : x ± y Continuity y is weakly preferred to x : x y One is indifferent : x ~ y
- 13. Decision Theory 12 Utility Axioms Completeness For any x, y, z ∈Θ Transitivity If x ± y and y ± z Independence Then x ± z Continuity
- 14. Decision Theory 13 Utility Axioms Completeness For every l 1 , l 2 , l 3 ∈L and p ∈(0,1) Transitivity If l 1 f l 2 Independence Then 〈l 1 , p, l 3 〉 f 〈l 2 , p, l 3 〉 Continuity
- 15. Decision Theory 14 Utility Axioms Completeness For every l 1 , l 2 , l 3 ∈L Transitivity If l 1 f l 2 f l 3 Independence Then for some p ∈(0,1) : Continuity l 2 ~ 〈l 1 , p, l 3 〉
- 16. Decision Theory 15 Utility Axioms Completeness There exists a utility function u : Θ → ° Transitivity Such that: Independence u(x) ≥ u(y) ⇔ x ± y Continuity n u(l ) = 〈 p1 , x1 ,…, pn , xn 〉 = ∑ pi u(xi ) i The utility of a lottery is the expected utility of its outcomes
- 17. Outline 1. Decision Theory 2. Preference Elicitation 3. MDPs 4. Current Work
- 18. Preference Elicitation 17 Queries Ranking Please order this subset of outcomes Standard Gamble 〈x1 , x2 ,…, xm 〉 Bound u(x1 ) ≥ u(x2 ) ≥ u(x3 ) ≥ L ≥ u(xm )
- 19. Preference Elicitation 18 Queries Ranking Please choose a p for which you Standard Gamble are indifferent between y and the Bound lottery 〈x ï , p, x ⊥ 〉 ï ⊥ y ~ 〈x , p, x 〉 u(y) = p
- 20. Preference Elicitation 19 Queries Ranking Please choose a p for which y is at Standard Gamble least as good as the lottery 〈x ï ,b, x ⊥ 〉 Bound y ± 〈x ï ,b, x ⊥ 〉 u(y) ≥ b
- 21. Preference Elicitation 20 Preference Elicitation Rather than fully specifying a utility function, we 1. Make decision w.r.t. an imprecisely speciﬁed utility function 2. Perform elicitation until we are satisﬁed with the decision Prob Make Decision Satisﬁed? YES Done Util NO User Select Query
- 22. Preference Elicitation 21 Robust Decision Criteria Maximax Given a set of feasible utility functions U Maximin arg max max u(x) Minimax Regret x∈Θ u∈U u2 u3 Max u1 Regret x1 8 2 1 x2 7 7 1 x3 2 2 2
- 23. Preference Elicitation 22 Robust Decision Criteria Maximax Given a set of feasible utility functions U Maximin arg max max u(x) Minimax Regret x∈Θ u∈U u2 u3 Max u1 Regret x1 8 2 1 x2 7 7 1 x3 2 2 2
- 24. Preference Elicitation 23 Robust Decision Criteria Maximax Given a set of feasible utility functions U Maximin arg max min u(x) Minimax Regret x∈Θ u∈U u2 u3 Max u1 Regret x1 8 2 1 x2 7 7 1 x3 2 2 2
- 25. Preference Elicitation 24 Robust Decision Criteria Maximax Given a set of feasible utility functions U Maximin arg max min u(x) Minimax Regret x∈Θ u∈U u2 u3 Max u1 Regret x1 8 2 1 x2 7 7 1 x3 2 2 2
- 26. Preference Elicitation 25 Robust Decision Criteria Maximax Given a set of feasible utility functions U Maximin arg min max max u(x ') − u(x) Minimax Regret x∈Θ x '∈Θ u∈U u2 u3 Max u1 Regret x1 8 2 1 x2 7 7 1 x3 2 2 2
- 27. Preference Elicitation 26 Robust Decision Criteria Maximax Given a set of feasible utility functions U Maximin arg min max max u(x ') − u(x) Minimax Regret x∈Θ x '∈Θ u∈U u2 u3 Max u1 Regret x1 8 2 1 x2 7 7 1 x3 2 2 2
- 28. Preference Elicitation 27 Robust Decision Criteria Maximax Given a set of feasible utility functions U Maximin arg min max max u(x ') − u(x) Minimax Regret x∈Θ x '∈Θ u∈U u2 u3 Max u1 Regret x1 8 2 1 5 x2 7 7 1 x3 2 2 2
- 29. Preference Elicitation 28 Robust Decision Criteria Maximax Given a set of feasible utility functions U Maximin arg min max max u(x ') − u(x) Minimax Regret x∈Θ x '∈Θ u∈U u2 u3 Max u1 Regret x1 8 2 1 5 x2 7 7 1 x3 2 2 2
- 30. Preference Elicitation 29 Robust Decision Criteria Maximax Given a set of feasible utility functions U Maximin arg min max max u(x ') − u(x) Minimax Regret x∈Θ x '∈Θ u∈U u2 u3 Max u1 Regret x1 8 2 1 5 x2 7 7 1 1 x3 2 2 2
- 31. Preference Elicitation 30 Robust Decision Criteria Maximax Given a set of feasible utility functions U Maximin arg min max max u(x ') − u(x) Minimax Regret x∈Θ x '∈Θ u∈U u2 u3 Max u1 Regret x1 8 2 1 5 x2 7 7 1 1 x3 2 2 2
- 32. Preference Elicitation 31 Robust Decision Criteria Maximax Given a set of feasible utility functions U Maximin arg min max max u(x ') − u(x) Minimax Regret x∈Θ x '∈Θ u∈U u2 u3 Max u1 Regret x1 8 2 1 5 x2 7 7 1 1 x3 2 2 2 6
- 33. Preference Elicitation 32 Robust Decision Criteria Maximax Given a set of feasible utility functions U Maximin arg min max max u(x ') − u(x) Minimax Regret x∈Θ x '∈Θ u∈U u2 u3 Max u1 Regret x1 8 2 1 5 x2 7 7 1 1 x3 2 2 2 6
- 34. Preference Elicitation 33 Bayesian Decision Criteria Expected Utility Assuming we have a prior φ over Value At Risk potential utility functions
- 35. Preference Elicitation 34 Bayesian Decision Criteria Expected Utility Assuming we have a prior φ over Value At Risk potential utility functions φ arg max E [u(x)] u x∈Θ
- 36. Preference Elicitation 35 Bayesian Decision Criteria Expected Utility Assuming we have a prior φ over Value At Risk potential utility functions φ ( ) arg max max Pr Eu [u(x)] ≥ δ ≥ η x∈Θ δ 90% η = 90% 10% δ
- 37. Outline 1. Decision Theory 2. Preference Elicitation 3. MDPs 4. Current Work
- 38. Markov Decision Processes 37 Markov Decision Process (MDP) S - Set of States at at+1 A - Set of Actions st st+1 st+2 … Pr(s ' | a, s) - Transitions rt rt+1 rt+2 α - Starting State Distribution γ - Discount Factor WORLD r(s) - Reward [or r(s, a) ] States Actions AGENT
- 39. Markov Decision Processes 37 Markov Decision Process (MDP) S - Set of States at at+1 A - Set of Actions st st+1 st+2 … Known Pr(s ' | a, s) - Transitions rt rt+1 rt+2 α - Starting State Distribution γ - Discount Factor WORLD ? r(s) - Reward [or r(s, a) ] States Actions AGENT
- 40. Markov Decision Processes 38 MDP - Policies Policy A stationary policy π maps each state to an action For inﬁnite horizon MDPs, every policy is a stationary policy Policy Given a policy π , the value of a state is Value π ∞ t V (s0 ) = E ∑ γ r π , s0 t=0
- 41. Markov Decision Processes 39 MDP - Computing Value Function The value of a policy can be found by successive approximation V0π (s) = r(s, aπ ) V1π (s) = r(s, aπ ) +γ ∑ Pr( s′ | s, aπ )V0π (s ') s' M M M Vkπ (s) = r(s, aπ ) +γ ∑ Pr( s′ | s, aπ )V π (s ') k−1 s' There will exist a ﬁxed point π π V (s) = r(s, aπ ) +γ ∑ Pr(s ' | s, aπ )V ( s′ ) s'
- 42. Markov Decision Processes 40 MDP - Optimal Value Functions Optimal We wish to ﬁnd the optimal policy π* Policy * π′ π : V ≥V ∀π ' π* π* Bellman V (s) = max r(s, aπ * ) +γ ∑ Pr( s′ | s, aπ * )V (s ') a s' Equation
- 43. Markov Decision Processes 41 Value Iteration Algorithm Yields an Ú− optimal policy 1. initialize V0 , set n = 0, choose Ú> 0 2. For each s : Vn+1 (s) = max r(s, a) +γ ∑ Pr( s′ | s, a)Vn (s ') a s' (1 − γ ) 3. If Vn+1 − Vn > Ú : 2γ increment n and return to step 2 We can recover the policy by ﬁnding the best one step action π (s) = arg max r(s, a) +γ ∑ Pr( s′ | s, a)V (s ') a s'
- 44. Markov Decision Processes 42 Linear Programming Formulation minimize V ∑ α (s)V (s) s subject to V (s) ≥ r(s, a) + γ ∑ Pr(s ' | s, a)V (s ') ∀a, s s'
- 45. Markov Decision Processes 43 MDP - Occupancy Frequencies f (s, a) An occupancy frequency f (s, a) expresses the total discounted probability of being in state s and taking action a Valid ∑ f (s , a) = ∑ ∑ Pr(s 0 0 | s, a) f (s, a) − α (s0 ) ∀s0 a s a f (s, a)
- 46. Markov Decision Processes 44 LP - Occupancy Frequency min. V ∑ α (s)V (s) s subj: V (s) ≥ r(s, a) + γ ∑ Pr(s ' | s, a)V (s ') ∀a, s s' max. f ∑ ∑ f (s, a)r(s, a) s a subj: ∑ f (s , a) − γ ∑ ∑ Pr(s 0 0 | s, a) f (s, a) = α (s0 ) ∀s0 a s a
- 47. Markov Decision Processes 44 LP - Occupancy Frequency ∑ ∑ f (s, a)r(s, a) = ∑ α (s)V (s) s a s min. V ∑ α (s)V (s) s subj: V (s) ≥ r(s, a) + γ ∑ Pr(s ' | s, a)V (s ') ∀a, s s' max. f ∑ ∑ f (s, a)r(s, a) s a subj: ∑ f (s , a) − γ ∑ ∑ Pr(s 0 0 | s, a) f (s, a) = α (s0 ) ∀s0 a s a
- 48. Markov Decision Processes 45 MDP Summary Slide Policies Over the past couple of decades, there has Dynamics been lot of work done on scaling MDPs Rewards Factored Models Decomposition Linear Approximation
- 49. Markov Decision Processes 46 MDP Summary Slide Policies To use these algorithms we need a model of Dynamics the dynamics (transition function). There are techniques for: Rewards Deriving models of dynamics from data. Finding policies that are robust to inaccurate transition models
- 50. Markov Decision Processes 47 MDP Summary Slide Policies There has been comparatively little work on Dynamics specifying rewards Rewards Finding policies that are robust to imprecise reward models Eliciting reward information from users
- 51. Outline 1. Decision Theory 2. Preference Elicitation 3. MDPs 4. Current Work
- 52. Outline 1. Decision Theory 2. Preference Elicitation 3. MDPs 4. Current Work A. Imprecise Reward Speciﬁcation B. Computing Robust Policies C. Preference Elicitation D. Evaluation E. Future Work
- 53. Text 50 Current Work MDP Compute Satisﬁed? YES Done Robust Policy R NO User Select Query
- 54. Model : MDP 51 MDP - Reward Uncertainty We quantify the strict uncertainty over reward with a set of feasible reward functions R We specify R using a set of linear inequalities forming a polytope Where do these inequalities come from? Bound queries: Is r(s,a) > b? Policy comparisons: Is fπ ·r > fπ′ ·r ?
- 55. Outline 1. Decision Theory 2. Preference Elicitation 3. MDPs 4. Current Work A. Imprecise Reward Speciﬁcation B. Computing Robust Policies C. Preference Elicitation D. Evaluation E. Future Work
- 56. Computation 53 Minimax Regret Original min max max g ·r − f ·r Formulation f∈F g∈F r∈R Benders’ minimize δ f∈F , δ Decomposition subject to : δ ≥ g ·r − f ·r ∀ g ∈F r ∈R
- 57. Computation 54 Minimax Regret Original min max max g ·r − f ·r Formulation f∈F g∈F r∈R Benders’ minimize δ f∈F , δ Decomposition subject to : δ ≥ g ·r − f ·r ∀ g ∈V ( F ) r ∈V ( R ) Maximums will exist at the vertices of F and R Rather than enumerating an exponential number of vertices we use constraint generation
- 58. Computation 55 Minimax Regret - Constraint Generation 1. We limit adversary • Player minimizes regret w.r.t. a small set of adversary responses 2. We untie adversary’s hands • Adversary ﬁnds maximum regret w.r.t. player’s policy • Add adversary’s choice of r and g to set of adversary responses Done when: untying adversary’s hands yields no improvement • ie. regret of player minimizing = regret of adversary maximizing
- 59. Computation 56 Constraint Generation - Player 1. Limit adversary minimize δ f∈F , δ subject to : δ ≥ g ·r − f ·r ∀ 〈 g, r 〉 ∈GEN
- 60. Computation 57 Constraint Generation - Adversary 2. Untie adversary’s hands: Given player policy f max max g ·r − f ·r g∈F r∈R This formulation is a non-convex linear program We reformulate as a mixed integer linear program
- 61. (indeed, it is the maximally violated such constraint). So it Computation 58 is added to Gen and the process repeats. Constraint Generation ,-R) is realized by the following MIP, Computation of MR(f Adversary using value and Q-functions:1 2. maximize α · V − r · f (9) Q,V,I,r subject to: Qa = ra + γPa V ∀a∈A V ≥ Qa ∀a∈A (10) V ≤ (1 − Ia )Ma + Qa ∀a∈A (11) Cr ≤ d X Ia = 1 (12) a Ia (s) ∈ {0, 1} ∀a, s (13) ⊥ Ma = M − Ma Only tractablerepresents the adversary’s policy, with Ia (s) de- Here I for small Markov Decision Problems noting the probability of action a being taken at state s
- 62. ) ! " # $ % & ' ( )* )) Computation 59 +,-./012314565/7 Figure 2: Scaling of constraint generation with number of states. Approximating Minimax Regret 9.:54;<.0=//1/0>7/7470?5@09.A/.40<670*+,-./01203454.6 )78) We )7(# the Max Regret MIP formulation relax 9.:54;<.0=//1/ )7() The )7)# value of the resulting policy is no longer exact, however, resulting reward still feasible. We ﬁnd optimal policy w.r.t. to )7)) resulting reward # ! " $ % & ' () *+,-./01203454.6 9.:54;<.0=//1/0>7/7470?;B;,5@09.A/.40<670*+,-./01203454.6 )78) 9.:54;<.0=//1/ )7(# )7() )7)# )7)) ! " # $ % & ' () *+,-./01203454.6 Figure 3: Relative approximation error of linear relaxation
- 63. Computation 60 Scaling (Log Scale) 89-/1<7=1+,-./012314565/7 )***** >?6@51A9B9-6?1C/D0/5 EFF02?9-65/1A9B9-6?1C/D0/5 )**** )*** 89-/1:-7; )** )* ) ! " # $ % & ' ( )* )) +,-./012314565/7 Figure 2: Scaling of constraint generation with number of states.
- 64. Outline 1. Decision Theory 2. Preference Elicitation 3. MDPs 4. Current Work A. Imprecise Reward Speciﬁcation B. Computing Robust Policies C. Preference Elicitation D. Evaluation E. Future Work
- 65. Reward Elicitation 62 Reward Elicitation MDP Compute Satisﬁed? YES Done Robust Policy R NO User Select Query
- 66. Reward Elicitation 63 Bound Queries Query Is r(s,a) > b? where b is a point between the upper and lower bounds of r(s,a) Gap Δ(s, a) = max r '(s, a) − min r(s, a) r' r At each step of elicitation we need to select the s, a parameters and b using the gap:
- 67. Reward Elicitation 64 Selecting Bound Queries Halve the Largest Gap (HLG) Current Solution (CS) Select the s,a with the Use the current solution g(s,a) largest gap Δ(s,a) [or f(s,a)] of the minimax regret calculation to weight Set b to the midpoint of the each gap Δ(s,a) interval for r(s,a) Select the s,a with the largest weighted gap g(s,a)Δ(s, a) Set b to the midpoint of the interval for r(s,a)
- 68. Outline 1. Decision Theory 2. Preference Elicitation 3. MDPs 4. Current Work A. Imprecise Reward Speciﬁcation B. Computing Robust Policies C. Preference Elicitation D. Evaluation E. Future Work
- 69. Evaluation 66 Experimental Setup Randomly generated MDPs Semi-sparse random transition function, discount factor of 0.95 Random true reward drawn from ﬁxed interval, upper and lower bounds on reward drawn randomly All results are averaged over 20 runs 10 states 5 actions
- 70. Evaluation 67 Elicitation Effectiveness We examine the combination of each criteria for robust policies with each of the elicitation strategies Minimax Regret Halve the Largest Gap ƒ (MMR) (HLG) Maximin Regret Current Solution (MR) (CS)
- 71. Evaluation 68 Max Regret - Random MDP /01+2(3)(4+567+,'-.()+89+&'():(6 %" 0.35 #$ 0.12 /01:-:;+<+=>? /:;:-01+<+=>? 0.30 %! /01:-:;+<+@A #! 0.10 /:;:-01+<+@A $" 0.25 1 0.08 /01+2(3)(4 2)'(+3(4)(5 Max Regret True Regret $! 0.20 0 0.06 #" 0.15 0.04 / #! 0.10 56+>+?@A 34+>+?@A $ 0.02 " 0.05 56+>+BC 34+>+BC ! ! "! %!! !1 "! #!! #"! $!! $"! %!! ! &'()*+,'-.()
- 72. Evaluation 69 True Regret (Loss) - Random MDP 2)'(+3(4)(5+678+,'-.()+9:+&'();(7 #$ 0.12 -:;+<+=>? <=>;-;?+@+ABC 01+<+=>? <;?;-=>+@+ABC -:;+<+@A #! 0.10 <=>;-;?+@+DE 01+<+@A <;?;-=>+@+DE 1 0.08 2)'(+3(4)(5 True Regret 0 0.06 0.04 / $ 0.02 ! "! %!! 1 ! "! #!! #"! $!! $"! %!! &'()*+,'-.()
- 73. Evaluation 70 Queries per Reward Point - Random MDP <45;1=/7,0>0?+./4.509./0/.67/80914:; $&! 700 $!! 600 Most of reward 500 #&! space unexplored *+,-./0120/.67/80914:;5 #!! 400 "&! 300 We repeatedly query a small "!! 200 set of “high impact” reward points &! 100 ! ! " # $ % & ' ( ) *+,-./01203+./4.5
- 74. Evaluation 71 Autonomic Computing Setup Host 1 Demand Resource 2 Hosts Total 3 Demand levels Resource 3 Units of Resource M Model Host k Demand Resource 90 States 10 Actions
- 75. Evaluation 72 Max Regret - Autonomic Computing Queries vs. Max Regret 0.7 0.12 Maximin Minimax Regret 0.6 0.10 0.5 0.08 True Regret Max Regret 0.4 0.06 0.3 0.04 0.2 0.02 0.1 egret 0.0 0.00 1000 1 0 200 400 600 800 1000 0 Queries
- 76. Evaluation 73 True Regret (Loss) - Autonomic Computing Queries vs. True Regret 0.12 Maximin egret Minimax Regret 0.10 0.08 True Regret 0.06 0.04 0.02 0.00 1000 0 1 200 400 600 800 1000 Queries
- 77. Outline 1. Decision Theory 2. Preference Elicitation 3. MDPs 4. Current Work A. Imprecise Reward Speciﬁcation B. Computing Robust Policies C. Preference Elicitation D. Evaluation E. Future Work
- 78. Introduction 75 Overview MDP Compute Satisﬁed? YES Done Robust Policy R NO User Select Query
- 79. Introduction 75 Contributions Compute 1. A technique for ﬁnding robust policies using Satisﬁed? YES Done Robust Policy minimax regret NO 2. A simple elicitation procedure that quickly leads to Select Query near-optimal/optimal policies
- 80. Conclusion 76 Future Work Bottleneck: Adversary’s max regret computation Scaling Idea: The set Γ of adversary policies g that will ever be a regret maximizing response can be small Factored MDPs Approaches that uses Γ to Richer efﬁciently compute max regret Queries We have An algorithm to ﬁnd Γ A theorem that shows the algorithm runs in time polynomial in the number of policies found
- 81. Conclusion 77 Future Work Scaling Working with Factored MDPs will Factored Model problems in a more natural way MDPs Richer Allow us to use lower the dimensionality of Queries the reward functions Leverage existing techniques for scaling MDPs that take advantage of factored
- 82. Conclusion 78 Future Work In state s which action would you like to take? Scaling Factored MDPs In state s do you prefer action a1 to a2 ? Richer Queries Do you prefer sequence s1 , a1 , s2 , a2 ,…sk to ′ ′ ′ ′ ′ s , a , s , a ,…s ? 1 1 2 2 k
- 83. Conclusion 79 Future Work Do you prefer tradeoff Scaling f (s2 , a3 ) = f1 amount of time doing (s2 , a3 ) and f (s1 , a4 ) = f2 amount of time doing (s1 , a4 ) Factored or MDPs f ′ (s2 , a3 ) = f 1 amount of time doing (s2 , a3 ) and ′ Richer f ′ (s1 , a4 ) = f ′2 amount of time doing (s1 , a4 ) ? Queries f1 s Cab Available f1 s No Street Car f2 f2 a Take Cab a Waiting s No Street Car s Cab Available f2 f1 f1 f2 a Waiting a Take Cab
- 84. Thank you.
- 85. Regret-Based Reward Elicitation for Markov Decision Processes Kevin M Regan University of Toronto Craig Boutilier
- 86. f g r ax min r·f (7) subject to: γE f + α = 0 Appendix 82 F r∈R γE g + α = 0 Full Formulation on the adversary. If MR(f , R) = MMR (R) then the con- to com uncertainty in any MDP pa- Cr ≤ at straint for g, r is satisﬁed d the current solution, and in- mine k has focused on uncertainty deed all unexpressed constraints must be satisﬁed as well. have t This is equivalent to a minimization: the of eliciting information The process then terminates with minimax optimal solu- freque rewards is left unaddressed. tion minimize δ MR(f , R) > MMR (R), implying that f . Otherwise, (8) exact f ,δ the constraint for g, r is violated in the current relaxation Master uted for uncertain transition (indeed, it is the r · g − r · f violated suchF, r ∈ R So it subject to: maximally ≤ δ ∀ g ∈ constraint). We ha an alt riterion by decomposing the is added to Gen and the+ α = 0 repeats. γE f process sarial nd using dynamic program- ization to ﬁnd the worst case Computation of MR(f , R) is realized by the following MIP, (for a ]. McMahan, Gordon, and This corresponds Q-functions:1 dual LP formulation of using value and to the standard for m rogramming approach to ef- an MDP with the addition of adversarial policy constraints. imatio maximize α · V − r · f (9) n value of an MDP (we em- The inﬁnite number of constraints can be reduced: ﬁrst we Q,V,I,r tice): need only retain as potentially active those ∀ a ∈ A subject to: Qa = ra + γPa V constraints for the in ch to ours below). Delage vertices of polytope R; Qa for any r ∈ R, weaonly require V ≥ and ∀ ∈A (10) tors. oblem of uncertainty over re- V ≤ (1 − a )M + Qa ∀a∈A ∗ the constraint correspondingIto itsaoptimal policy gr . How- (11) does n functions) in the presence of policy rcentile criterion, which can ever, vertex enumeration is not feasible; so we apply Ben- Cr ≤ d Subproblem than maximin. They also ders’ decomposition [2]a to iteratively generate constraints. X I =1 (12) constr remai ng rewards using sampling to At each iteration, two optimizations are solved. The master a value e of information of noisy in- Ia (s) ∈ {0, of ∀a, s problem solves a relaxation 1} program (8) using only a (13) that is ard space. The percentile ap- small subset of the constraints,M⊥ Ma = M − corresponding to a subset a this s n nor does it offer a bound on Gen of all g, r pairs; we call these generated constraints. lution es ([20]) also adopt maximin Initially, this set is arbitrary (e.g., empty). with Ia (s) de- Here I represents the adversary’s policy, Intuitively, in lem s noting the probability of action a being taken at state s
- 87. Evaluation 83 Maximin Value - Random MDP 2345-56+738'(+9:;+,'-.()+<=+&'()5(: #!! 1.00 %" 0.35 0.30 %! 0.95 1" $" 0.25 0.90 1! 2345-56+738'( Maximin Value /01+2(3)(4 Max Regret $! 0.20 0.85 0" #" 0.15 0! 0.80 #! 0.10 2345-56+>+?@A 0.75 /" 2565-34+>+?@A " 0.05 2345-56+>+BC 2565-34+>+BC /! 0.70 1 ! ! "! #!! #"! $!! $"! %!! ! &'()*+,'-.()
- 88. Computation 84 Regret Gap vs Time 3.4567/859/:1;/+,-./!/3.<= "&# "$# "## *# 3.4567/859 Regret Gap (# &# $# # !$# !"### # "### $### %### &### '### (### )### *### +,-./0-12

No public clipboards found for this slide

Be the first to comment