Robust Policy Computation in Reward-uncertain MDPs using Nondominated Policies

1,192 views

Published on

Talk given at the

Published in: Technology, Economy & Finance
  • Be the first to comment

  • Be the first to like this

Robust Policy Computation in Reward-uncertain MDPs using Nondominated Policies

  1. 1. Robust Policy Computation in Reward-uncertain MDPs using Nondominated Policies Kevin Regan University of Toronto Craig Boutilier
  2. 2. Introduction 2 Motivation Setting: Computational approaches to sequential decision making under uncertainty, specifically MDPs These approaches require • A model of dynamics • A model of rewards
  3. 3. Introduction 3 Motivation Except in some simple cases, the specification of rewards is problematic • Preferences about which states/actions are “good” and “bad” need to be translated into precise numerical reward • Time consuming to specify reward for all states/actions • Rewards can vary user-to-user
  4. 4. Introduction 4 Motivation Given an MDP with imprecise specification of reward, we wish to efficiently compute a robust policy
  5. 5. Introduction 5 Reward Elicitation Done yes MDP Compute Decision Reward decision Satisfied? measure no Select Query User response query
  6. 6. Outline 1. Imprecise Reward MDPs 2. Minimax Regret Optimal Policies for IRMDPs 3. Using Nondominated Policies 4. Generating Nondominated Policies
  7. 7. Model : MDP 7 Markov Decision Process (MDP) S - Set of States at at+1 A - Set of Actions st st+1 st+2 … Pr(s ' | a, s) - Transitions rt rt+1 rt+2 β - Starting State Distribution γ - Discount Factor WORLD r(s, a) - Reward States Actions AGENT
  8. 8. Model : MDP 7 Markov Decision Process (MDP) S - Set of States at at+1 A - Set of Actions st st+1 st+2 … Known Pr(s ' | a, s) - Transitions rt rt+1 rt+2 β - Starting State Distribution γ - Discount Factor WORLD ? r(s, a) - Reward States Actions AGENT
  9. 9. Markov Decision Processes 8 Policies Policy A (stationary) policy π maps each state to an action. Policy Given a policy π , the value of a state is Value π  ∞ t  V = E  ∑γ r π, β   t=0 
  10. 10. Model : MDP 9 Occupancy Frequencies The occupancy frequency f(s,a) expresses the total discounted probability of being in state s and taking action a Given f(s,a) we can f (s, a) π recover the policy π : π(s, a) = π ∑ f (s, a ') a' * The value of optimal π * π corresponds to the π* * * value of the optimal f : V = r ·f r
  11. 11. Markov Decision Processes 10 The Imprecise Reward MDP We replace a known reward function with a set of feasible reward functions R ≡ {r | Ar ≤ b} We specify R using a set of linear inequalities forming a polytope R Where do these inequalities come from? Bound queries: Is r(s,a) > b? Policy comparisons: Is fπ ·r > fπ′ ·r ?
  12. 12. Robust Policy Optimization 11 Robust Policy Criterion Given a policy f, define max regret max max g · r − f · r g∈F r ∈R * = max g · r − f · rr r ∈R We use minimax regret: min max max g · r − f · r f ∈F g∈F r ∈R * = min max g · r − f · r r f ∈F r ∈R Player chooses policy f Adversary finds reward r along with optimal policy g (w.r.t. r) that maximizes regret given player’s choice f
  13. 13. Nondom Policies 12 Nondominated Policies A policy f is nondominated w.r.t. the set of reward functions R if and only if: ∃ r ∈R s.t. f·r ≥ f '·r ∀ f ' ∈F Let Γ denote the set of nondominated policies for a fixed R . For any IRMDP and policy f it is the case that: arg max  max g ·r − f ·r  ∈Γ g∈F  r∈R 
  14. 14. Nondom Policies 13 Nondominated Policies V(r) = max f ·r f∈F = max f ·r f∈Γ Value f1 f2 f3 f4 f5 Reward
  15. 15. Nondom Policies 13 Nondominated Policies V(r) = max f ·r f∈F = max f ·r f∈Γ Value f1 f2 f3 f4 f5 Reward
  16. 16. Nondom Policies 13 Nondominated Policies V(r) = max f ·r f∈F = max f ·r f∈Γ Value olicy dp in ate Dom f1 f2 f3 f4 f5 Reward
  17. 17. Nondom Policies 14 Nondominated Policies IRMDP POMDP V(r) = max f ·r V(b) = max b ·α f α Value Value f1 α1 f2 f3 f4 f5 α2 α3 α4 α5 Reward Belief
  18. 18. Computation 15 Minimax Regret Original formulation: min max max g ·r − f ·r f∈F g∈F r∈R • We reformulate using Benders’ decomposition: minimize δ f∈F , δ subject to : δ ≥ g ·r − f ·r ∀ g ∈F r ∈R
  19. 19. Computation 16 Minimax Regret Original formulation: min max max g ·r − f ·r f∈F g∈F r∈R • We reformulate using Benders’ decomposition: minimize δ f∈F , δ subject to : δ ≥ g ·r − f ·r ∀ g ∈V ( F ) r ∈V ( R ) • Maximums will exist at the vertices of F and R • Rather than enumerating an exponential number of vertices we use constraint generation
  20. 20. Computation 17 Minimax Regret - Constraint Generation 1. We limit adversary • Player minimizes regret w.r.t. a limited set of adversary choices (constraints) 2. Untie adversary’s hands • Adversary finds maximum regret w.r.t. player’s policy • Add adversary’s choice of r and g to set of adversary choices Done when: untying adversary’s hands yields no improvement • ie. regret of player minimizing = regret of adversary maximizing
  21. 21. Computation 18 Adversary Computation - Max Regret 2. Given a player’s policy f , Regan & Boutilier (2009) used a MIP to find max regret. We instead iteratively compute the best reward for each g ∈Γ    maxmize g ·r − f ·r  max  r  g∈Γ   subject to : Ar ≤ b      
  22. 22. Computation 19 Minimax Regret - Results
  23. 23. Generating Γ . 20 π Witness Algorithm We take inspiration from the Witness algorithm for POMDPs (Littman, Kaelbling, Cassandra, 1998) s:a We define the local adjustment f to a policy in terms of occupancy frequencies as: s:a s:a f = f·β(s)(e +γ ∑ Pr(s' | s, a)f[s']) + (1− β(s)) s' % Theorem: Let Γ ÷ Γ be a (strictly) partial set of nondominated % polices. Then there is an f ∈ Γ , an (s, a) and an r ∈R s:a % such that: f ·r > f ' ·r, ∀ f ' ∈ Γ
  24. 24. Generating Γ . 21 Nondominated Policies Value Reward
  25. 25. Generating Γ . 22 Nondominated Policies Value Reward
  26. 26. Generating Γ . 23 Nondominated Policies Value Reward
  27. 27. Generating Γ . 24 Nondominated Policies Value Witness Reward
  28. 28. Generating Γ . 25 Nondominated Policies Value Optimal Policy for Witness Witness Reward
  29. 29. Generating Γ . 26 π Witness Algorithm - Complexity The runtime of π Witness is polynomial in the size of inputs: S, A, R and output Γ. In general the minimax regret computation for MDPs is NP-Hard (Xu & Mannor 2009). We show: for any class of MDPS with a polynomial number of nondominated policies, the minimax regret computation is polynomial.
  30. 30. Generating Γ . 27 π Witness Algorithm - Results We assessed performance using factored MDPs: Witness 1.algorithmreward dimension πWitness Fixed State State Number of Vectors Number Vectors πWitness Runtime (secs) πWitness Runtime (secs) algorithm Size Size µ σ µµ σσ ry r ∈ R r(s) = r (x ) + r (x ) r∈R 4 4 3.463 3.463 2.231 2.231 0.064 0.064 0.045 0.045 1 1 2 2 8 8 3.772 3.772 3.189 3.189 0.145 0.145 0.144 0.144 16 16 7.157 7.157 5.743 5.743 0.433 0.433 0.329 0.329 32 32 7.953 7.953 6.997 6.997 1.228 1.228 1.062 1.062 ot empty do empty do 64 64 11.251 11.251 9.349 9.349 4.883 4.883 3.981 3.981 nin agenda agenda Table 1: Varying Number of States Table 1: Varying Number of States o dWitnessReward(f s:aΓ) WitnessReward(f s:a , , Γ) Reward πWitness Runtime (secs) Reward Number of Vectors Number of Vectors πWitness Runtime (secs) ss found2.dow state dimension ess found Fix do Dim. Dim. µ µ σ σ µµ σσ ← findBest(r) ) findBest(rw 2 2.050 0.887 1.093 0.634 st to Γ to Γ s = 〈x1 , x 2 , x 3 , x 4 , x 5 , x 6 〉 2 44 2.050 10.20 10.20 0.887 10.05 10.05 1.093 4.554 4.554 0.634 4.483 4.483 st to agenda to agenda 6 759.6 707.4 1178 1660 6 759.6 707.4 1178 1660 findWitnessReward(f s:aΓ) ndWitnessReward(f s:a , , Γ) 88 6116 6116 5514 5514 80642 80642 77635 77635 Table 2: Varying Dimension of Reward Space Table 2: Varying Dimension of Reward Space 1. The agenda holds the policies for dependent”), we saw above that even small MDPs can ad- The agenda holds the policies for dependent”), we saw above that even small MDPs can ad- yet explored all local adjustments. find- mit a huge number of nondominated policies. However, in
  31. 31. Generating Γ . 27 π Witness Algorithm - Results We assessed performance using factored MDPs: Witness 1.algorithmreward dimension πWitness Fixed State State Number of Vectors Number Vectors πWitness Runtime (secs) πWitness Runtime (secs) algorithm Size Size µ σ µµ σσ ry r ∈ R r(s) = r (x ) + r (x ) r∈R 4 4 3.463 3.463 2.231 2.231 0.064 0.064 0.045 0.045 1 1 2 2 8 8 3.772 3.772 3.189 3.189 0.145 0.145 0.144 0.144 16 16 7.157 7.157 5.743 5.743 0.433 0.433 0.329 0.329 32 32 7.953 7.953 6.997 6.997 1.228 1.228 1.062 1.062 ot empty do empty do 64 64 11.251 11.251 9.349 9.349 4.883 4.883 3.981 3.981 nin agenda agenda Table 1: Varying Number of States Table 1: Varying Number of States o dWitnessReward(f s:aΓ) WitnessReward(f s:a , , Γ) Reward πWitness Runtime (secs) Reward Number of Vectors Number of Vectors πWitness Runtime (secs) ss found2.dow state dimension ess found Fix do Dim. Dim. µ µ σ σ µµ σσ ← findBest(r) ) findBest(rw 2 2.050 0.887 1.093 0.634 st to Γ to Γ s = 〈x1 , x 2 , x 3 , x 4 , x 5 , x 6 〉 2 44 2.050 10.20 10.20 0.887 10.05 10.05 1.093 4.554 4.554 0.634 4.483 4.483 st to agenda to agenda 6 759.6 707.4 1178 1660 6 759.6 707.4 1178 1660 findWitnessReward(f s:aΓ) ndWitnessReward(f s:a , , Γ) 88 6116 6116 5514 5514 80642 80642 77635 77635 Table 2: Varying Dimension of Reward Space Table 2: Varying Dimension of Reward Space 1. The agenda holds the policies for dependent”), we saw above that even small MDPs can ad- The agenda holds the policies for dependent”), we saw above that even small MDPs can ad- yet explored all local adjustments. find- mit a huge number of nondominated policies. However, in
  32. 32. Generating Γ . 28 π Witness Algorithm - Discussion The π Witness algorithm can be run once offline to generate the set of nondominated policies. During elicitation we wish to quickly recompute minimax regret as information is gather and R changes. Constraint generation using Γ renders the minimax regret computation tractable when the number of nondominated policies is small. πWitness can be used as an anytime algorithm, at any point using the partial set of nondominated policies to approximate minimax regret.
  33. 33. Generating Γ . 29 π Witness Algorithm - Error Bound We show a bound on the error in minimax regret using a bound on the error in value: % % Ú (Γ) ≤ Ú (Γ) ≡ max VΓ (r) − VΓ (r) % MMR V r∈R
  34. 34. Generating Γ . 29 π Witness Algorithm - Error Bound We show a bound on the error in minimax regret using a bound on the error in value: % % Ú (Γ) ≤ Ú (Γ) ≡ max VΓ (r) − VΓ (r) % MMR V r∈R % Ú (Γ) V Value Reward
  35. 35. Generating Γ . 30 π Witness Algorithm - Anytime Results
  36. 36. Summary 31 Contributions A novel algorithm for computing minimax regret leveraging Γ π Witness: A polytime algorithm for generating Γ Factored reward allows πWitness to scale to large state spaces % Small approximate sets Γ can yield tractable high-quality approximations to minimax regret
  37. 37. Summary 32 Current Directions We have developed a method for computing tight bounds on Ú and thus on Ú V MMR Method yields approach to anytime nondominated policy generation, which directly minimized error We are looking at policy generation in conjunction with reward elicitation We have techniques for the modifying the partial set of of nondominated policies during elicitation to improve the quality of approximation
  38. 38. EXTRA
  39. 39. Generating Γ . 34 π Witness Algorithm Algorithm 1: The πWitness algorithm r ← some arbitrary r ∈ R f ← findBest(r) Γ←{f } agenda ← { f } while agenda is not empty do f ← next item in agenda foreach s, a do rw ← findWitnessReward(f s:a , Γ) R while witness found do fbest ← findBest(rw ) add fbest to Γ add fbest to agenda rw ← findWitnessReward(f s:a , Γ) is sketched in Alg. 1. The agenda holds the policies for dep
  40. 40. Markov Decision Processes 35 Reward Elicitation Done yes MDP Compute Decision Reward decision Satisfied? measure no Select Query User response query
  41. 41. Model : MDP 36 Occupancy Frequencies An occupancy frequency f(s, a) expresses the total discounted probability of being in state s and taking action a f = arg max [ r·f ] * * π = arg max β·V    π f∈F π * * The value of optimal π corresponds to the value optimal f 1. Each policy induces an occupancy * frequency π * β ·V = r ·f r 2. We can recover the policy from the occupancy frequency 3. Be clear about value function being dot product of reward and occupancy freq
  42. 42. Robust Policy Optimization 37 Example - Minimax Regret Choose the occupancy frequencies (policy) that minimize worst case regret * f = arg min max max g ·r − f ·r f∈F g∈F r∈R r2 r3 Max r1 Regret f1 8 2 1 f2 7 7 1 f3 2 2 2
  43. 43. Robust Policy Optimization 38 Example - Minimax Regret Choose the occupancy frequencies (policy) that minimize worst case regret * f = arg min max max g ·r − f ·r f∈F g∈F r∈R r2 r3 Max r1 Regret Player f1 8 2 1 f2 7 7 1 f3 2 2 2
  44. 44. Robust Policy Optimization 39 Example - Minimax Regret Choose the occupancy frequencies (policy) that minimize worst case regret Adversary * f = arg min max max g ·r − f ·r f∈F g∈F r∈R r2 r3 Max r1 Regret Player f1 8 2 1 5 Adversary f2 7 7 1 f3 2 2 2
  45. 45. Robust Policy Optimization 40 Example - Minimax Regret Choose the occupancy frequencies (policy) that minimize worst case regret * f = arg min max max g ·r − f ·r f∈F g∈F r∈R r2 r3 Max r1 Regret f1 8 2 1 5 Player f2 7 7 1 1 f3 2 2 2
  46. 46. Robust Policy Optimization 41 Example - Minimax Regret Choose the occupancy frequencies (policy) that minimize worst case regret * f = arg min max max g ·r − f ·r f∈F g∈F r∈R r2 r3 Max r1 Regret f1 8 2 1 5 f2 7 7 1 1 Player f3 2 2 2 6
  47. 47. Robust Policy Optimization 42 Example - Minimax Regret Choose the occupancy frequencies (policy) that minimize worst case regret * f = arg min max max g ·r − f ·r f∈F g∈F r∈R r2 r3 Max r1 Regret f1 8 2 1 5 f2 7 7 1 1 f3 2 2 2 6
  48. 48. Directions 43 Computing Robust Policies The Minimax Regret Criterion can be applied to computing policies π' π arg min max max V − V r r π π′ r∈R It offers a number of desirable properties 1. Offers a (non-probabilistic) guarantee 2. Less conservative than maximin 3. The relative comparison between current choice and best possible choice offers an intuitive measure Ongoing work has developed several novel approaches to computing Minimax Regret for Markov decision processes
  49. 49. Model Uncertainty 44 Robust MDPs [McMahen, Gordon & Blum 2005] Unknown model parameters: Rewards Decision criterion: Maximin Use linear programming approach with constraint generation ∗  t π  π = arg max min E  ∑ γ R(x)  x R∈R π  t  maximize: δ δ,π subject to : δ ≤ V π R ∀ R ∈R R
  50. 50. Model Uncertainty 45 Robust MDPs [Delage & Mannor 2007] Unknown model parameters: Transitions & Rewards Decision criterion: Percentile Criterion Solve for reward in the form of a Gaussian as a SOCP Give an approximation for transitions in the form of Dirichlets maximize: y π, y   ∞ t   subject to : Pr  E  ∑ γ rt (xt ) | π  ≥ y  ≥ η   t=0   η y
  51. 51. Robust Policy Optimization 46 Robust Policy Criteria The maximin value or “security level” is defined as: max min f · r f ∈F r ∈R We use minimax regret, defined as: min max max g · r − f · r f ∈F g∈F r ∈R

×