Upcoming SlideShare
×

# Regret-Based Reward Elicitation for Markov Decision Processes

2,476 views
2,338 views

Published on

Talk given for University of Toronto Machine Learning Seminar. Fall 2009.

Published in: Technology, Economy & Finance
2 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
2,476
On SlideShare
0
From Embeds
0
Number of Embeds
267
Actions
Shares
0
33
0
Likes
2
Embeds 0
No embeds

No notes for slide
• Markov decision processes are an extremely useful model for decision making in stochastic environments. To use it we need to know dynamics and the rewards involved. A lot of work has been done on learning dynamics both in an offline and online setting.
• Markov decision processes are an extremely useful model for decision making in stochastic environments. To use it we need to know dynamics and the rewards involved. A lot of work has been done on learning dynamics both in an offline and online setting.
• Markov decision processes are an extremely useful model for decision making in stochastic environments. To use it we need to know dynamics and the rewards involved. A lot of work has been done on learning dynamics both in an offline and online setting. Rewards are often assumed to be directly observable parts of the world. My perspective is that &amp;#x201C;rewards are in people&amp;#x2019;s heads&amp;#x201D;: In some cases there is a simple mapping between what you want (in your head) and the world (ie. finding the shortest path),
• Markov decision processes are an extremely useful model for decision making in stochastic environments. To use it we need to know dynamics and the rewards involved. A lot of work has been done on learning dynamics both in an offline and online setting. In some simple cases, rewards can be thought of as being directly observable: --&gt; For instance the distance travelled in a robot navigation problem where we are trying to get a robot from point A to point B. ---&gt; When I am in my car trying to get from point A to point B I want the path with the fewest stoplights, someone else may want the path with the nicest scenery, while someone else may sacrifice some stoplights for some scenery. Reward is a surrogate for subjective preferences .... flip slide (sometimes its easy but)
• The dynamics in combination simple bounds on reward function lead to areas of reward space not having a high impact on the value of a policy
• Maximin is common but we use regret
• Maximin is common but we use regret
• Maximin is common but we use regret
• Maximin is common but we use regret
• Maximin is common but we use regret
• Maximin is common but we use regret
• Maximin is common but we use regret
• Maximin is common but we use regret
• Maximin is common but we use regret
• Maximin is common but we use regret
• Maximin is common but we use regret
• Maximin is common but we use regret
• Maximin is common but we use regret
• Maximin is common but we use regret
• Maximin is common but we use regret
• Maximin is common but we use regret
• Maximin is common but we use regret
• Maximin is common but we use regret
• Maximin is common but we use regret
• Maximin is common but we use regret
• Maximin is common but we use regret
• Maximin is common but we use regret
• Maximin is common but we use regret
• So for f2, no matter what the instantiation of reward, by changing policy the player only stood to gain by one. This is an intuitive measure.
• Maximin is common but we use regret
• Maximin is common but we use regret
• Maximin is common but we use regret
• Robust MDP literature often assumes transitions are known
transitions are learnable - do not change from user to user
• Robust MDP literature often assumes transitions are known
transitions are learnable - do not change from user to user
• Robust MDP literature often assumes transitions are known
transitions are learnable - do not change from user to user
• Convergence properties
• explain how constraints create max
• Question: Is it worth spending more time to familiarize the audience with &amp;#x201C;occupancy frequencies&amp;#x201D;?
• I will vocally mention other representations
• In the voice over I will explain how each reformulation proceeds from the previous expression
• In the voice over I will explain how each reformulation proceeds from the previous expression
• In the voice over I will explain how each reformulation proceeds from the previous expression
• In the voice over I will explain how each reformulation proceeds from the previous expression
• In the voice over I will explain how each reformulation proceeds from the previous expression
• I would also like to give a clear intuition as to why this is inherently hard.
• On average less than 10% error
• will also note that on a 90 state MDP with 16 action the relaxation is computing minimax regret in less than 3 seconds.
• Here I will review the preference elicitation process
• Note that it is useful in non-sequential
• Now I have left out the autonomic computing results, due to lack of time. If there is a little time, after giving the results for the random MDPs I can state that we have similar results for a large MDP instance.
• 20 runs --&gt; 20 MDPs with 10 states and 5 actions
• We have a paritially specified reward function we compute a robust policy that minimizes maximum regret --&gt; We then elicit information about the reward function which leads to a better policy. --&gt; We continue this process until we have an optimal policy or we regret guarantee is small enough
• We have a paritially specified reward function we compute a robust policy that minimizes maximum regret --&gt; We then elicit information about the reward function which leads to a better policy. --&gt; We continue this process until we have an optimal policy or we regret guarantee is small enough
• We have a paritially specified reward function we compute a robust policy that minimizes maximum regret --&gt; We then elicit information about the reward function which leads to a better policy. --&gt; We continue this process until we have an optimal policy or we regret guarantee is small enough
• We have a paritially specified reward function we compute a robust policy that minimizes maximum regret --&gt; We then elicit information about the reward function which leads to a better policy. --&gt; We continue this process until we have an optimal policy or we regret guarantee is small enough
• We have a paritially specified reward function we compute a robust policy that minimizes maximum regret --&gt; We then elicit information about the reward function which leads to a better policy. --&gt; We continue this process until we have an optimal policy or we regret guarantee is small enough
• Notes test
• I will give the context in the voice over. The main idea on this slide is that: in practice constraint generation quickly converges. To segue to the next slide I will recall that we still need to solve a MIP with |S||A| variables and constraints, thus we developed an approximation.
• ### Regret-Based Reward Elicitation for Markov Decision Processes

1. 1. Regret-Based Reward Elicitation for Markov Decision Processes Kevin Regan University of Toronto Craig Boutilier
2. 2. Introduction 2 Motivation
3. 3. Introduction 3 Motivation Markov Decision Processes have proven to be an extremely useful model for decision making in stochastic environments • Model requires dynamics and rewards
4. 4. Introduction 4 Motivation Markov Decision Processes have proven to be an extremely useful model for decision making in stochastic environments • Model requires dynamics and rewards Specifying dynamics a priori can be difﬁcult • We can learn a model of the world in either an ofﬂine or online (reinforcement learning) setting
5. 5. Introduction 5 Motivation Markov Decision Processes have proven to be an extremely useful model for decision making in stochastic environments • Model requires dynamics and rewards In some simple cases reward can be thought of as being directly “observed” • For instance: the reward in a robot navigation problem corresponding to the distance travelled
6. 6. Introduction 6 Motivation Except in some simple cases, the speciﬁcation of reward functions for MDPs is problematic • Rewards can vary user-to-user • Preferences about which states/actions are “good” and “bad” need to be translated into precise numerical reward • Time consuming to specify reward for all states/actions Example domain: assistive technology
7. 7. Introduction 7 Motivation However, • Near-optimal policies can be found without a fully speciﬁed reward function • We can bound the performance of a policy using regret
8. 8. Outline 1. Decision Theory 2. Preference Elicitation 3. MDPs 4. Current Work
9. 9. Decision Theory 9 Utility Given A decision maker (DM) A set of possible outcomes Θ A set of lotteries L of the form: l ≡ 〈 p1 , x1 , p2 , x2 ,…, pn , xn 〉 where xi ∈Θ, ∑p i =1 i l ≡ 〈x1 , p, x2 〉 = 〈 p, x1 ,(1 − p), x2 〉 Compound lotteries l 1 = 〈0.75, x, 0.25, 〈0.6, y, 0.4, z〉〉 l2 y l1= x l2 = z
10. 10. Decision Theory 9 Utility Given A decision maker (DM) A set of possible outcomes Θ A set of lotteries L of the form: l ≡ 〈 p1 , x1 , p2 , x2 ,…, pn , xn 〉 where xi ∈Θ, ∑p i =1 i l ≡ 〈x1 , p, x2 〉 = 〈 p, x1 ,(1 − p), x2 〉 Compound lotteries l 1 = 〈0.75, x, 0.25, 〈0.6, y, 0.4, z〉〉 = 〈0.75, x, 0.15, y, 0.1, z〉 y l2 y z l1= x l2 = z l1= x
11. 11. Decision Theory 10 Utility Axioms Completeness Transitivity Independence Continuity
12. 12. Decision Theory 11 Utility Axioms Completeness For x, y ∈Θ Transitivity It is the case that either: Independence x is weakly preferred to y : x ± y Continuity y is weakly preferred to x : x y One is indifferent : x ~ y
13. 13. Decision Theory 12 Utility Axioms Completeness For any x, y, z ∈Θ Transitivity If x ± y and y ± z Independence Then x ± z Continuity
14. 14. Decision Theory 13 Utility Axioms Completeness For every l 1 , l 2 , l 3 ∈L and p ∈(0,1) Transitivity If l 1 f l 2 Independence Then 〈l 1 , p, l 3 〉 f 〈l 2 , p, l 3 〉 Continuity
15. 15. Decision Theory 14 Utility Axioms Completeness For every l 1 , l 2 , l 3 ∈L Transitivity If l 1 f l 2 f l 3 Independence Then for some p ∈(0,1) : Continuity l 2 ~ 〈l 1 , p, l 3 〉
16. 16. Decision Theory 15 Utility Axioms Completeness There exists a utility function u : Θ → ° Transitivity Such that: Independence u(x) ≥ u(y) ⇔ x ± y Continuity n u(l ) = 〈 p1 , x1 ,…, pn , xn 〉 = ∑ pi u(xi ) i The utility of a lottery is the expected utility of its outcomes
17. 17. Outline 1. Decision Theory 2. Preference Elicitation 3. MDPs 4. Current Work
18. 18. Preference Elicitation 17 Queries Ranking Please order this subset of outcomes Standard Gamble 〈x1 , x2 ,…, xm 〉 Bound u(x1 ) ≥ u(x2 ) ≥ u(x3 ) ≥ L ≥ u(xm )
19. 19. Preference Elicitation 18 Queries Ranking Please choose a p for which you Standard Gamble are indifferent between y and the Bound lottery 〈x ï , p, x ⊥ 〉 ï ⊥ y ~ 〈x , p, x 〉 u(y) = p
20. 20. Preference Elicitation 19 Queries Ranking Please choose a p for which y is at Standard Gamble least as good as the lottery 〈x ï ,b, x ⊥ 〉 Bound y ± 〈x ï ,b, x ⊥ 〉 u(y) ≥ b
21. 21. Preference Elicitation 20 Preference Elicitation Rather than fully specifying a utility function, we 1. Make decision w.r.t. an imprecisely speciﬁed utility function 2. Perform elicitation until we are satisﬁed with the decision Prob Make Decision Satisﬁed? YES Done Util NO User Select Query
22. 22. Preference Elicitation 21 Robust Decision Criteria Maximax Given a set of feasible utility functions U Maximin arg max max u(x) Minimax Regret x∈Θ u∈U u2 u3 Max u1 Regret x1 8 2 1 x2 7 7 1 x3 2 2 2
23. 23. Preference Elicitation 22 Robust Decision Criteria Maximax Given a set of feasible utility functions U Maximin arg max max u(x) Minimax Regret x∈Θ u∈U u2 u3 Max u1 Regret x1 8 2 1 x2 7 7 1 x3 2 2 2
24. 24. Preference Elicitation 23 Robust Decision Criteria Maximax Given a set of feasible utility functions U Maximin arg max min u(x) Minimax Regret x∈Θ u∈U u2 u3 Max u1 Regret x1 8 2 1 x2 7 7 1 x3 2 2 2
25. 25. Preference Elicitation 24 Robust Decision Criteria Maximax Given a set of feasible utility functions U Maximin arg max min u(x) Minimax Regret x∈Θ u∈U u2 u3 Max u1 Regret x1 8 2 1 x2 7 7 1 x3 2 2 2
26. 26. Preference Elicitation 25 Robust Decision Criteria Maximax Given a set of feasible utility functions U Maximin arg min max max u(x ') − u(x) Minimax Regret x∈Θ x '∈Θ u∈U u2 u3 Max u1 Regret x1 8 2 1 x2 7 7 1 x3 2 2 2
27. 27. Preference Elicitation 26 Robust Decision Criteria Maximax Given a set of feasible utility functions U Maximin arg min max max u(x ') − u(x) Minimax Regret x∈Θ x '∈Θ u∈U u2 u3 Max u1 Regret x1 8 2 1 x2 7 7 1 x3 2 2 2
28. 28. Preference Elicitation 27 Robust Decision Criteria Maximax Given a set of feasible utility functions U Maximin arg min max max u(x ') − u(x) Minimax Regret x∈Θ x '∈Θ u∈U u2 u3 Max u1 Regret x1 8 2 1 5 x2 7 7 1 x3 2 2 2
29. 29. Preference Elicitation 28 Robust Decision Criteria Maximax Given a set of feasible utility functions U Maximin arg min max max u(x ') − u(x) Minimax Regret x∈Θ x '∈Θ u∈U u2 u3 Max u1 Regret x1 8 2 1 5 x2 7 7 1 x3 2 2 2
30. 30. Preference Elicitation 29 Robust Decision Criteria Maximax Given a set of feasible utility functions U Maximin arg min max max u(x ') − u(x) Minimax Regret x∈Θ x '∈Θ u∈U u2 u3 Max u1 Regret x1 8 2 1 5 x2 7 7 1 1 x3 2 2 2
31. 31. Preference Elicitation 30 Robust Decision Criteria Maximax Given a set of feasible utility functions U Maximin arg min max max u(x ') − u(x) Minimax Regret x∈Θ x '∈Θ u∈U u2 u3 Max u1 Regret x1 8 2 1 5 x2 7 7 1 1 x3 2 2 2
32. 32. Preference Elicitation 31 Robust Decision Criteria Maximax Given a set of feasible utility functions U Maximin arg min max max u(x ') − u(x) Minimax Regret x∈Θ x '∈Θ u∈U u2 u3 Max u1 Regret x1 8 2 1 5 x2 7 7 1 1 x3 2 2 2 6
33. 33. Preference Elicitation 32 Robust Decision Criteria Maximax Given a set of feasible utility functions U Maximin arg min max max u(x ') − u(x) Minimax Regret x∈Θ x '∈Θ u∈U u2 u3 Max u1 Regret x1 8 2 1 5 x2 7 7 1 1 x3 2 2 2 6
34. 34. Preference Elicitation 33 Bayesian Decision Criteria Expected Utility Assuming we have a prior φ over Value At Risk potential utility functions
35. 35. Preference Elicitation 34 Bayesian Decision Criteria Expected Utility Assuming we have a prior φ over Value At Risk potential utility functions φ arg max E [u(x)] u x∈Θ
36. 36. Preference Elicitation 35 Bayesian Decision Criteria Expected Utility Assuming we have a prior φ over Value At Risk potential utility functions φ ( ) arg max max Pr Eu [u(x)] ≥ δ ≥ η x∈Θ δ 90% η = 90% 10% δ
37. 37. Outline 1. Decision Theory 2. Preference Elicitation 3. MDPs 4. Current Work
38. 38. Markov Decision Processes 37 Markov Decision Process (MDP) S - Set of States at at+1 A - Set of Actions st st+1 st+2 … Pr(s ' | a, s) - Transitions rt rt+1 rt+2 α - Starting State Distribution γ - Discount Factor WORLD r(s) - Reward [or r(s, a) ] States Actions AGENT
39. 39. Markov Decision Processes 37 Markov Decision Process (MDP) S - Set of States at at+1 A - Set of Actions st st+1 st+2 … Known Pr(s ' | a, s) - Transitions rt rt+1 rt+2 α - Starting State Distribution γ - Discount Factor WORLD ? r(s) - Reward [or r(s, a) ] States Actions AGENT
40. 40. Markov Decision Processes 38 MDP - Policies Policy A stationary policy π maps each state to an action For inﬁnite horizon MDPs, every policy is a stationary policy Policy Given a policy π , the value of a state is Value π  ∞ t  V (s0 ) = E  ∑ γ r π , s0   t=0 
41. 41. Markov Decision Processes 39 MDP - Computing Value Function The value of a policy can be found by successive approximation V0π (s) = r(s, aπ ) V1π (s) = r(s, aπ ) +γ ∑ Pr( s′ | s, aπ )V0π (s ') s' M M M Vkπ (s) = r(s, aπ ) +γ ∑ Pr( s′ | s, aπ )V π (s ') k−1 s' There will exist a ﬁxed point π π V (s) = r(s, aπ ) +γ ∑ Pr(s ' | s, aπ )V ( s′ ) s'
42. 42. Markov Decision Processes 40 MDP - Optimal Value Functions Optimal We wish to ﬁnd the optimal policy π* Policy * π′ π : V ≥V ∀π ' π* π* Bellman V (s) = max r(s, aπ * ) +γ ∑ Pr( s′ | s, aπ * )V (s ') a s' Equation
43. 43. Markov Decision Processes 41 Value Iteration Algorithm Yields an Ú− optimal policy 1. initialize V0 , set n = 0, choose Ú> 0 2. For each s : Vn+1 (s) = max r(s, a) +γ ∑ Pr( s′ | s, a)Vn (s ') a s' (1 − γ ) 3. If Vn+1 − Vn > Ú : 2γ increment n and return to step 2 We can recover the policy by ﬁnding the best one step action π (s) = arg max r(s, a) +γ ∑ Pr( s′ | s, a)V (s ') a s'
44. 44. Markov Decision Processes 42 Linear Programming Formulation minimize V ∑ α (s)V (s) s subject to V (s) ≥ r(s, a) + γ ∑ Pr(s ' | s, a)V (s ') ∀a, s s'
45. 45. Markov Decision Processes 43 MDP - Occupancy Frequencies f (s, a) An occupancy frequency f (s, a) expresses the total discounted probability of being in state s and taking action a Valid ∑ f (s , a) = ∑ ∑ Pr(s 0 0 | s, a) f (s, a) − α (s0 ) ∀s0 a s a f (s, a)
46. 46. Markov Decision Processes 44 LP - Occupancy Frequency min. V ∑ α (s)V (s) s subj: V (s) ≥ r(s, a) + γ ∑ Pr(s ' | s, a)V (s ') ∀a, s s' max. f ∑ ∑ f (s, a)r(s, a) s a subj: ∑ f (s , a) − γ ∑ ∑ Pr(s 0 0 | s, a) f (s, a) = α (s0 ) ∀s0 a s a
47. 47. Markov Decision Processes 44 LP - Occupancy Frequency ∑ ∑ f (s, a)r(s, a) = ∑ α (s)V (s) s a s min. V ∑ α (s)V (s) s subj: V (s) ≥ r(s, a) + γ ∑ Pr(s ' | s, a)V (s ') ∀a, s s' max. f ∑ ∑ f (s, a)r(s, a) s a subj: ∑ f (s , a) − γ ∑ ∑ Pr(s 0 0 | s, a) f (s, a) = α (s0 ) ∀s0 a s a
48. 48. Markov Decision Processes 45 MDP Summary Slide Policies Over the past couple of decades, there has Dynamics been lot of work done on scaling MDPs Rewards Factored Models Decomposition Linear Approximation
49. 49. Markov Decision Processes 46 MDP Summary Slide Policies To use these algorithms we need a model of Dynamics the dynamics (transition function). There are techniques for: Rewards Deriving models of dynamics from data. Finding policies that are robust to inaccurate transition models
50. 50. Markov Decision Processes 47 MDP Summary Slide Policies There has been comparatively little work on Dynamics specifying rewards Rewards Finding policies that are robust to imprecise reward models Eliciting reward information from users
51. 51. Outline 1. Decision Theory 2. Preference Elicitation 3. MDPs 4. Current Work
52. 52. Outline 1. Decision Theory 2. Preference Elicitation 3. MDPs 4. Current Work A. Imprecise Reward Speciﬁcation B. Computing Robust Policies C. Preference Elicitation D. Evaluation E. Future Work
53. 53. Text 50 Current Work MDP Compute Satisﬁed? YES Done Robust Policy R NO User Select Query
54. 54. Model : MDP 51 MDP - Reward Uncertainty We quantify the strict uncertainty over reward with a set of feasible reward functions R We specify R using a set of linear inequalities forming a polytope Where do these inequalities come from? Bound queries: Is r(s,a) > b? Policy comparisons: Is fπ ·r > fπ′ ·r ?
55. 55. Outline 1. Decision Theory 2. Preference Elicitation 3. MDPs 4. Current Work A. Imprecise Reward Speciﬁcation B. Computing Robust Policies C. Preference Elicitation D. Evaluation E. Future Work
56. 56. Computation 53 Minimax Regret Original min max max g ·r − f ·r Formulation f∈F g∈F r∈R Benders’ minimize δ f∈F , δ Decomposition subject to : δ ≥ g ·r − f ·r ∀ g ∈F r ∈R
57. 57. Computation 54 Minimax Regret Original min max max g ·r − f ·r Formulation f∈F g∈F r∈R Benders’ minimize δ f∈F , δ Decomposition subject to : δ ≥ g ·r − f ·r ∀ g ∈V ( F ) r ∈V ( R ) Maximums will exist at the vertices of F and R Rather than enumerating an exponential number of vertices we use constraint generation
58. 58. Computation 55 Minimax Regret - Constraint Generation 1. We limit adversary • Player minimizes regret w.r.t. a small set of adversary responses 2. We untie adversary’s hands • Adversary ﬁnds maximum regret w.r.t. player’s policy • Add adversary’s choice of r and g to set of adversary responses Done when: untying adversary’s hands yields no improvement • ie. regret of player minimizing = regret of adversary maximizing
59. 59. Computation 56 Constraint Generation - Player 1. Limit adversary minimize δ f∈F , δ subject to : δ ≥ g ·r − f ·r ∀ 〈 g, r 〉 ∈GEN
60. 60. Computation 57 Constraint Generation - Adversary 2. Untie adversary’s hands: Given player policy f max max g ·r − f ·r g∈F r∈R This formulation is a non-convex linear program We reformulate as a mixed integer linear program
61. 61. (indeed, it is the maximally violated such constraint). So it Computation 58 is added to Gen and the process repeats. Constraint Generation ,-R) is realized by the following MIP, Computation of MR(f Adversary using value and Q-functions:1 2. maximize α · V − r · f (9) Q,V,I,r subject to: Qa = ra + γPa V ∀a∈A V ≥ Qa ∀a∈A (10) V ≤ (1 − Ia )Ma + Qa ∀a∈A (11) Cr ≤ d X Ia = 1 (12) a Ia (s) ∈ {0, 1} ∀a, s (13) ⊥ Ma = M − Ma Only tractablerepresents the adversary’s policy, with Ia (s) de- Here I for small Markov Decision Problems noting the probability of action a being taken at state s
62. 62. ) ! " # \$ % & ' ( )* )) Computation 59 +,-./012314565/7 Figure 2: Scaling of constraint generation with number of states. Approximating Minimax Regret 9.:54;<.0=//1/0>7/7470?5@09.A/.40<670*+,-./01203454.6 )78) We )7(# the Max Regret MIP formulation relax 9.:54;<.0=//1/ )7() The )7)# value of the resulting policy is no longer exact, however, resulting reward still feasible. We ﬁnd optimal policy w.r.t. to )7)) resulting reward # ! " \$ % & ' () *+,-./01203454.6 9.:54;<.0=//1/0>7/7470?;B;,5@09.A/.40<670*+,-./01203454.6 )78) 9.:54;<.0=//1/ )7(# )7() )7)# )7)) ! " # \$ % & ' () *+,-./01203454.6 Figure 3: Relative approximation error of linear relaxation
63. 63. Computation 60 Scaling (Log Scale) 89-/1<7=1+,-./012314565/7 )***** >?6@51A9B9-6?1C/D0/5 EFF02?9-65/1A9B9-6?1C/D0/5 )**** )*** 89-/1:-7; )** )* ) ! " # \$ % & ' ( )* )) +,-./012314565/7 Figure 2: Scaling of constraint generation with number of states.
64. 64. Outline 1. Decision Theory 2. Preference Elicitation 3. MDPs 4. Current Work A. Imprecise Reward Speciﬁcation B. Computing Robust Policies C. Preference Elicitation D. Evaluation E. Future Work
65. 65. Reward Elicitation 62 Reward Elicitation MDP Compute Satisﬁed? YES Done Robust Policy R NO User Select Query
66. 66. Reward Elicitation 63 Bound Queries Query Is r(s,a) > b? where b is a point between the upper and lower bounds of r(s,a) Gap Δ(s, a) = max r '(s, a) − min r(s, a) r' r At each step of elicitation we need to select the s, a parameters and b using the gap:
67. 67. Reward Elicitation 64 Selecting Bound Queries Halve the Largest Gap (HLG) Current Solution (CS) Select the s,a with the Use the current solution g(s,a) largest gap Δ(s,a) [or f(s,a)] of the minimax regret calculation to weight Set b to the midpoint of the each gap Δ(s,a) interval for r(s,a) Select the s,a with the largest weighted gap g(s,a)Δ(s, a) Set b to the midpoint of the interval for r(s,a)
68. 68. Outline 1. Decision Theory 2. Preference Elicitation 3. MDPs 4. Current Work A. Imprecise Reward Speciﬁcation B. Computing Robust Policies C. Preference Elicitation D. Evaluation E. Future Work
69. 69. Evaluation 66 Experimental Setup Randomly generated MDPs Semi-sparse random transition function, discount factor of 0.95 Random true reward drawn from ﬁxed interval, upper and lower bounds on reward drawn randomly All results are averaged over 20 runs 10 states 5 actions
70. 70. Evaluation 67 Elicitation Effectiveness We examine the combination of each criteria for robust policies with each of the elicitation strategies Minimax Regret Halve the Largest Gap ƒ (MMR) (HLG) Maximin Regret Current Solution (MR) (CS)
71. 71. Evaluation 68 Max Regret - Random MDP /01+2(3)(4+567+,'-.()+89+&'():(6 %" 0.35 #\$ 0.12 /01:-:;+<+=>? /:;:-01+<+=>? 0.30 %! /01:-:;+<+@A #! 0.10 /:;:-01+<+@A \$" 0.25 1 0.08 /01+2(3)(4 2)'(+3(4)(5 Max Regret True Regret \$! 0.20 0 0.06 #" 0.15 0.04 / #! 0.10 56+>+?@A 34+>+?@A \$ 0.02 " 0.05 56+>+BC 34+>+BC ! ! "! %!! !1 "! #!! #"! \$!! \$"! %!! ! &'()*+,'-.()
72. 72. Evaluation 69 True Regret (Loss) - Random MDP 2)'(+3(4)(5+678+,'-.()+9:+&'();(7 #\$ 0.12 -:;+<+=>? <=>;-;?+@+ABC 01+<+=>? <;?;-=>+@+ABC -:;+<+@A #! 0.10 <=>;-;?+@+DE 01+<+@A <;?;-=>+@+DE 1 0.08 2)'(+3(4)(5 True Regret 0 0.06 0.04 / \$ 0.02 ! "! %!! 1 ! "! #!! #"! \$!! \$"! %!! &'()*+,'-.()
73. 73. Evaluation 70 Queries per Reward Point - Random MDP <45;1=/7,0>0?+./4.509./0/.67/80914:; \$&! 700 \$!! 600 Most of reward 500 #&! space unexplored *+,-./0120/.67/80914:;5 #!! 400 "&! 300 We repeatedly query a small "!! 200 set of “high impact” reward points &! 100 ! ! " # \$ % & ' ( ) *+,-./01203+./4.5
74. 74. Evaluation 71 Autonomic Computing Setup Host 1 Demand Resource 2 Hosts Total 3 Demand levels Resource 3 Units of Resource M Model Host k Demand Resource 90 States 10 Actions
75. 75. Evaluation 72 Max Regret - Autonomic Computing Queries vs. Max Regret 0.7 0.12 Maximin Minimax Regret 0.6 0.10 0.5 0.08 True Regret Max Regret 0.4 0.06 0.3 0.04 0.2 0.02 0.1 egret 0.0 0.00 1000 1 0 200 400 600 800 1000 0 Queries
76. 76. Evaluation 73 True Regret (Loss) - Autonomic Computing Queries vs. True Regret 0.12 Maximin egret Minimax Regret 0.10 0.08 True Regret 0.06 0.04 0.02 0.00 1000 0 1 200 400 600 800 1000 Queries
77. 77. Outline 1. Decision Theory 2. Preference Elicitation 3. MDPs 4. Current Work A. Imprecise Reward Speciﬁcation B. Computing Robust Policies C. Preference Elicitation D. Evaluation E. Future Work
78. 78. Introduction 75 Overview MDP Compute Satisﬁed? YES Done Robust Policy R NO User Select Query
79. 79. Introduction 75 Contributions Compute 1. A technique for ﬁnding robust policies using Satisﬁed? YES Done Robust Policy minimax regret NO 2. A simple elicitation procedure that quickly leads to Select Query near-optimal/optimal policies
80. 80. Conclusion 76 Future Work Bottleneck: Adversary’s max regret computation Scaling Idea: The set Γ of adversary policies g that will ever be a regret maximizing response can be small Factored MDPs Approaches that uses Γ to Richer efﬁciently compute max regret Queries We have An algorithm to ﬁnd Γ A theorem that shows the algorithm runs in time polynomial in the number of policies found
81. 81. Conclusion 77 Future Work Scaling Working with Factored MDPs will Factored Model problems in a more natural way MDPs Richer Allow us to use lower the dimensionality of Queries the reward functions Leverage existing techniques for scaling MDPs that take advantage of factored
82. 82. Conclusion 78 Future Work In state s which action would you like to take? Scaling Factored MDPs In state s do you prefer action a1 to a2 ? Richer Queries Do you prefer sequence s1 , a1 , s2 , a2 ,…sk to ′ ′ ′ ′ ′ s , a , s , a ,…s ? 1 1 2 2 k
83. 83. Conclusion 79 Future Work Do you prefer tradeoff Scaling f (s2 , a3 ) = f1 amount of time doing (s2 , a3 ) and f (s1 , a4 ) = f2 amount of time doing (s1 , a4 ) Factored or MDPs f ′ (s2 , a3 ) = f 1 amount of time doing (s2 , a3 ) and ′ Richer f ′ (s1 , a4 ) = f ′2 amount of time doing (s1 , a4 ) ? Queries f1 s Cab Available f1 s No Street Car f2 f2 a Take Cab a Waiting s No Street Car s Cab Available f2 f1 f1 f2 a Waiting a Take Cab
84. 84. Thank you.
85. 85. Regret-Based Reward Elicitation for Markov Decision Processes Kevin M Regan University of Toronto Craig Boutilier
86. 86. f g r ax min r·f (7) subject to: γE f + α = 0 Appendix 82 F r∈R γE g + α = 0 Full Formulation on the adversary. If MR(f , R) = MMR (R) then the con- to com uncertainty in any MDP pa- Cr ≤ at straint for g, r is satisﬁed d the current solution, and in- mine k has focused on uncertainty deed all unexpressed constraints must be satisﬁed as well. have t This is equivalent to a minimization: the of eliciting information The process then terminates with minimax optimal solu- freque rewards is left unaddressed. tion minimize δ MR(f , R) > MMR (R), implying that f . Otherwise, (8) exact f ,δ the constraint for g, r is violated in the current relaxation Master uted for uncertain transition (indeed, it is the r · g − r · f violated suchF, r ∈ R So it subject to: maximally ≤ δ ∀ g ∈ constraint). We ha an alt riterion by decomposing the is added to Gen and the+ α = 0 repeats. γE f process sarial nd using dynamic program- ization to ﬁnd the worst case Computation of MR(f , R) is realized by the following MIP, (for a ]. McMahan, Gordon, and This corresponds Q-functions:1 dual LP formulation of using value and to the standard for m rogramming approach to ef- an MDP with the addition of adversarial policy constraints. imatio maximize α · V − r · f (9) n value of an MDP (we em- The inﬁnite number of constraints can be reduced: ﬁrst we Q,V,I,r tice): need only retain as potentially active those ∀ a ∈ A subject to: Qa = ra + γPa V constraints for the in ch to ours below). Delage vertices of polytope R; Qa for any r ∈ R, weaonly require V ≥ and ∀ ∈A (10) tors. oblem of uncertainty over re- V ≤ (1 − a )M + Qa ∀a∈A ∗ the constraint correspondingIto itsaoptimal policy gr . How- (11) does n functions) in the presence of policy rcentile criterion, which can ever, vertex enumeration is not feasible; so we apply Ben- Cr ≤ d Subproblem than maximin. They also ders’ decomposition [2]a to iteratively generate constraints. X I =1 (12) constr remai ng rewards using sampling to At each iteration, two optimizations are solved. The master a value e of information of noisy in- Ia (s) ∈ {0, of ∀a, s problem solves a relaxation 1} program (8) using only a (13) that is ard space. The percentile ap- small subset of the constraints,M⊥ Ma = M − corresponding to a subset a this s n nor does it offer a bound on Gen of all g, r pairs; we call these generated constraints. lution es ([20]) also adopt maximin Initially, this set is arbitrary (e.g., empty). with Ia (s) de- Here I represents the adversary’s policy, Intuitively, in lem s noting the probability of action a being taken at state s
87. 87. Evaluation 83 Maximin Value - Random MDP 2345-56+738'(+9:;+,'-.()+<=+&'()5(: #!! 1.00 %" 0.35 0.30 %! 0.95 1" \$" 0.25 0.90 1! 2345-56+738'( Maximin Value /01+2(3)(4 Max Regret \$! 0.20 0.85 0" #" 0.15 0! 0.80 #! 0.10 2345-56+>+?@A 0.75 /" 2565-34+>+?@A " 0.05 2345-56+>+BC 2565-34+>+BC /! 0.70 1 ! ! "! #!! #"! \$!! \$"! %!! ! &'()*+,'-.()
88. 88. Computation 84 Regret Gap vs Time 3.4567/859/:1;/+,-./!/3.<= "&# "\$# "## *# 3.4567/859 Regret Gap (# &# \$# # !\$# !"### # "### \$### %### &### '### (### )### *### +,-./0-12