The document describes regret-based reward elicitation for Markov decision processes. It motivates using MDPs to model decision making problems and notes that specifying reward functions can be challenging. It discusses using preference elicitation to iteratively refine an imprecise reward function, and describes various decision criteria that are robust to uncertainty in the reward function, such as maximin, minimax regret, expected utility, and value at risk.
We develop a new method to optimize portfolios of options in a market where European calls and puts are available with many exercise prices for each of several potentially correlated underlying assets. We identify the combination of asset-specific option payoffs that maximizes the Sharpe ratio of the overall portfolio: such payoffs are the unique solution to a system of integral equations, which reduce to a linear matrix equation under suitable representations of the underlying probabilities. Even when implied volatilities are all higher than historical volatilities, it can be optimal to sell options on some assets while buying options on others, as hedging demand outweighs demand for asset-specific returns.
We develop a new method to optimize portfolios of options in a market where European calls and puts are available with many exercise prices for each of several potentially correlated underlying assets. We identify the combination of asset-specific option payoffs that maximizes the Sharpe ratio of the overall portfolio: such payoffs are the unique solution to a system of integral equations, which reduce to a linear matrix equation under suitable representations of the underlying probabilities. Even when implied volatilities are all higher than historical volatilities, it can be optimal to sell options on some assets while buying options on others, as hedging demand outweighs demand for asset-specific returns.
The first report of Machine Learning Seminar organized by Computational Linguistics Laboratory at Kazan Federal University. See http://cll.niimm.ksu.ru/cms/lang/en_US/main/seminars/mlseminar
Shortfall aversion reflects the higher utility loss of a spending cut from a reference point than the utility gain from a similar spending increase, in the spirit of Prospect Theory's loss aversion. This paper posits a model of utility of spending scaled by a function of past peak spending, called target spending. The discontinuity of the marginal utility at the target spending corresponds to shortfall aversion. According to the closed-form solution of the associated spending-investment problem, (i) the spending rate is constant and equals the historical peak for relatively large values of wealth/target; and (ii) the spending rate increases (and the target with it) when that ratio reaches its model-determined upper bound. These features contrast with traditional Merton-style models which call for spending rates proportional to wealth. A simulation using the 1926-2012 realized returns suggests that spending of the very shortfall averse is typically increasing and very smooth.
The first report of Machine Learning Seminar organized by Computational Linguistics Laboratory at Kazan Federal University. See http://cll.niimm.ksu.ru/cms/lang/en_US/main/seminars/mlseminar
Shortfall aversion reflects the higher utility loss of a spending cut from a reference point than the utility gain from a similar spending increase, in the spirit of Prospect Theory's loss aversion. This paper posits a model of utility of spending scaled by a function of past peak spending, called target spending. The discontinuity of the marginal utility at the target spending corresponds to shortfall aversion. According to the closed-form solution of the associated spending-investment problem, (i) the spending rate is constant and equals the historical peak for relatively large values of wealth/target; and (ii) the spending rate increases (and the target with it) when that ratio reaches its model-determined upper bound. These features contrast with traditional Merton-style models which call for spending rates proportional to wealth. A simulation using the 1926-2012 realized returns suggests that spending of the very shortfall averse is typically increasing and very smooth.
Applications of Markov Decision Processes (MDPs) in the Internet of Things (I...mabualsh
These slides summarize the applications of Markov Decision Processes (MDPs) in the Internet of Things (IoT) and Sensor Networks. The material is based on our survey article [Abu Alsheikh et al. "Markov decision processes with applications in wireless sensor networks: A survey." Communications Surveys & Tutorials, IEEE 17.3 (2015): 1239-1267].
Statement of stochastic programming problemsSSA KPI
AACIMP 2010 Summer School lecture by Leonidas Sakalauskas. "Applied Mathematics" stream. "Stochastic Programming and Applications" course. Part 1.
More info at http://summerschool.ssa.org.ua
Cuckoo Search Algorithm: An IntroductionXin-She Yang
This presentation explains the fundamental ideas of the standard Cuckoo Search (CS) algorithm, which also contains the links to the free Matlab codes at Mathswork file exchanges and the animations of numerical simulations (video at Youtube). An example of multi-objective cuckoo search (MOCS) is also given with link to the Matlab code.
This a set of slides explaining the search methods by
Gradient Descent
Simulated Annealing
Hill Climbing
They are still not great, but they are good enough
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
3. Introduction 3
Motivation
Markov Decision Processes have proven to be an extremely useful
model for decision making in stochastic environments
• Model requires dynamics and rewards
4. Introduction 4
Motivation
Markov Decision Processes have proven to be an extremely useful
model for decision making in stochastic environments
• Model requires dynamics and rewards
Specifying dynamics a priori can be difficult
• We can learn a model of the world in either an offline or online
(reinforcement learning) setting
5. Introduction 5
Motivation
Markov Decision Processes have proven to be an extremely useful
model for decision making in stochastic environments
• Model requires dynamics and rewards
In some simple cases reward can be thought of as being directly
“observed”
• For instance: the reward in a robot navigation problem
corresponding to the distance travelled
6. Introduction 6
Motivation
Except in some simple cases, the specification of reward
functions for MDPs is problematic
• Rewards can vary user-to-user
• Preferences about which states/actions are “good” and “bad”
need to be translated into precise numerical reward
• Time consuming to specify reward for all states/actions
Example domain: assistive technology
7. Introduction 7
Motivation
However,
• Near-optimal policies can be found without a fully specified
reward function
• We can bound the performance of a policy using regret
8. Outline 1. Decision Theory
2. Preference Elicitation
3. MDPs
4. Current Work
9. Decision Theory 9
Utility
Given A decision maker (DM)
A set of possible outcomes Θ
A set of lotteries L of the form:
l ≡ 〈 p1 , x1 , p2 , x2 ,…, pn , xn 〉 where xi ∈Θ, ∑p i =1
i
l ≡ 〈x1 , p, x2 〉 = 〈 p, x1 ,(1 − p), x2 〉
Compound lotteries
l 1 = 〈0.75, x, 0.25, 〈0.6, y, 0.4, z〉〉
l2 y
l1= x l2 = z
10. Decision Theory 9
Utility
Given A decision maker (DM)
A set of possible outcomes Θ
A set of lotteries L of the form:
l ≡ 〈 p1 , x1 , p2 , x2 ,…, pn , xn 〉 where xi ∈Θ, ∑p i =1
i
l ≡ 〈x1 , p, x2 〉 = 〈 p, x1 ,(1 − p), x2 〉
Compound lotteries
l 1 = 〈0.75, x, 0.25, 〈0.6, y, 0.4, z〉〉 = 〈0.75, x, 0.15, y, 0.1, z〉
y
l2 y z
l1= x l2 = z l1= x
11. Decision Theory 10
Utility
Axioms Completeness
Transitivity
Independence
Continuity
12. Decision Theory 11
Utility
Axioms Completeness For x, y ∈Θ
Transitivity It is the case that either:
Independence x is weakly preferred to y : x ± y
Continuity y is weakly preferred to x : x y
One is indifferent : x ~ y
13. Decision Theory 12
Utility
Axioms Completeness For any x, y, z ∈Θ
Transitivity If x ± y and y ± z
Independence Then x ± z
Continuity
14. Decision Theory 13
Utility
Axioms Completeness For every l 1 , l 2 , l 3 ∈L and p ∈(0,1)
Transitivity If l 1 f l 2
Independence Then 〈l 1 , p, l 3 〉 f 〈l 2 , p, l 3 〉
Continuity
15. Decision Theory 14
Utility
Axioms Completeness For every l 1 , l 2 , l 3 ∈L
Transitivity If l 1 f l 2 f l 3
Independence Then for some p ∈(0,1) :
Continuity l 2 ~ 〈l 1 , p, l 3 〉
16. Decision Theory 15
Utility
Axioms Completeness There exists a utility function u : Θ → °
Transitivity Such that:
Independence u(x) ≥ u(y) ⇔ x ± y
Continuity n
u(l ) = 〈 p1 , x1 ,…, pn , xn 〉 = ∑ pi u(xi )
i
The utility of a lottery is the
expected utility of its outcomes
17. Outline 1. Decision Theory
2. Preference Elicitation
3. MDPs
4. Current Work
18. Preference Elicitation 17
Queries
Ranking
Please order this subset of outcomes
Standard Gamble
〈x1 , x2 ,…, xm 〉
Bound
u(x1 ) ≥ u(x2 ) ≥ u(x3 ) ≥ L ≥ u(xm )
19. Preference Elicitation 18
Queries
Ranking
Please choose a p for which you
Standard Gamble
are indifferent between y and the
Bound
lottery 〈x ï , p, x ⊥ 〉
ï ⊥
y ~ 〈x , p, x 〉
u(y) = p
20. Preference Elicitation 19
Queries
Ranking
Please choose a p for which y is at
Standard Gamble
least as good as the lottery 〈x ï ,b, x ⊥ 〉
Bound
y ± 〈x ï ,b, x ⊥ 〉
u(y) ≥ b
21. Preference Elicitation 20
Preference Elicitation
Rather than fully specifying a utility function, we
1. Make decision w.r.t. an imprecisely specified utility function
2. Perform elicitation until we are satisfied with the decision
Prob
Make Decision Satisfied? YES Done
Util
NO
User Select Query
22. Preference Elicitation 21
Robust Decision Criteria
Maximax Given a set of feasible utility functions U
Maximin
arg max max u(x)
Minimax Regret x∈Θ u∈U
u2 u3 Max
u1 Regret
x1 8 2 1
x2 7 7 1
x3 2 2 2
23. Preference Elicitation 22
Robust Decision Criteria
Maximax Given a set of feasible utility functions U
Maximin
arg max max u(x)
Minimax Regret x∈Θ u∈U
u2 u3 Max
u1 Regret
x1 8 2 1
x2 7 7 1
x3 2 2 2
24. Preference Elicitation 23
Robust Decision Criteria
Maximax Given a set of feasible utility functions U
Maximin
arg max min u(x)
Minimax Regret x∈Θ u∈U
u2 u3 Max
u1 Regret
x1 8 2 1
x2 7 7 1
x3 2 2 2
25. Preference Elicitation 24
Robust Decision Criteria
Maximax Given a set of feasible utility functions U
Maximin
arg max min u(x)
Minimax Regret x∈Θ u∈U
u2 u3 Max
u1 Regret
x1 8 2 1
x2 7 7 1
x3 2 2 2
26. Preference Elicitation 25
Robust Decision Criteria
Maximax Given a set of feasible utility functions U
Maximin
arg min max max u(x ') − u(x)
Minimax Regret x∈Θ x '∈Θ u∈U
u2 u3 Max
u1 Regret
x1 8 2 1
x2 7 7 1
x3 2 2 2
27. Preference Elicitation 26
Robust Decision Criteria
Maximax Given a set of feasible utility functions U
Maximin
arg min max max u(x ') − u(x)
Minimax Regret x∈Θ x '∈Θ u∈U
u2 u3 Max
u1 Regret
x1 8 2 1
x2 7 7 1
x3 2 2 2
28. Preference Elicitation 27
Robust Decision Criteria
Maximax Given a set of feasible utility functions U
Maximin
arg min max max u(x ') − u(x)
Minimax Regret x∈Θ x '∈Θ u∈U
u2 u3 Max
u1 Regret
x1 8 2 1 5
x2 7 7 1
x3 2 2 2
29. Preference Elicitation 28
Robust Decision Criteria
Maximax Given a set of feasible utility functions U
Maximin
arg min max max u(x ') − u(x)
Minimax Regret x∈Θ x '∈Θ u∈U
u2 u3 Max
u1 Regret
x1 8 2 1 5
x2 7 7 1
x3 2 2 2
30. Preference Elicitation 29
Robust Decision Criteria
Maximax Given a set of feasible utility functions U
Maximin
arg min max max u(x ') − u(x)
Minimax Regret x∈Θ x '∈Θ u∈U
u2 u3 Max
u1 Regret
x1 8 2 1 5
x2 7 7 1 1
x3 2 2 2
31. Preference Elicitation 30
Robust Decision Criteria
Maximax Given a set of feasible utility functions U
Maximin
arg min max max u(x ') − u(x)
Minimax Regret x∈Θ x '∈Θ u∈U
u2 u3 Max
u1 Regret
x1 8 2 1 5
x2 7 7 1 1
x3 2 2 2
32. Preference Elicitation 31
Robust Decision Criteria
Maximax Given a set of feasible utility functions U
Maximin
arg min max max u(x ') − u(x)
Minimax Regret x∈Θ x '∈Θ u∈U
u2 u3 Max
u1 Regret
x1 8 2 1 5
x2 7 7 1 1
x3 2 2 2 6
33. Preference Elicitation 32
Robust Decision Criteria
Maximax Given a set of feasible utility functions U
Maximin
arg min max max u(x ') − u(x)
Minimax Regret x∈Θ x '∈Θ u∈U
u2 u3 Max
u1 Regret
x1 8 2 1 5
x2 7 7 1 1
x3 2 2 2 6
34. Preference Elicitation 33
Bayesian Decision Criteria
Expected Utility Assuming we have a prior φ over
Value At Risk potential utility functions
35. Preference Elicitation 34
Bayesian Decision Criteria
Expected Utility Assuming we have a prior φ over
Value At Risk potential utility functions
φ
arg max E [u(x)]
u
x∈Θ
36. Preference Elicitation 35
Bayesian Decision Criteria
Expected Utility Assuming we have a prior φ over
Value At Risk potential utility functions
φ
( )
arg max max Pr Eu [u(x)] ≥ δ ≥ η
x∈Θ δ
90%
η = 90%
10%
δ
37. Outline 1. Decision Theory
2. Preference Elicitation
3. MDPs
4. Current Work
38. Markov Decision Processes 37
Markov Decision Process (MDP)
S - Set of States at at+1
A - Set of Actions st st+1 st+2 …
Pr(s ' | a, s) - Transitions
rt rt+1 rt+2
α - Starting State Distribution
γ - Discount Factor
WORLD
r(s) - Reward [or r(s, a) ]
States Actions
AGENT
39. Markov Decision Processes 37
Markov Decision Process (MDP)
S - Set of States at at+1
A - Set of Actions st st+1 st+2 …
Known
Pr(s ' | a, s) - Transitions
rt rt+1 rt+2
α - Starting State Distribution
γ - Discount Factor
WORLD
? r(s) - Reward [or r(s, a) ]
States Actions
AGENT
40. Markov Decision Processes 38
MDP - Policies
Policy A stationary policy π maps each state to an action
For infinite horizon MDPs, every
policy is a stationary policy
Policy Given a policy π , the value of a state is
Value
π ∞ t
V (s0 ) = E ∑ γ r π , s0
t=0
41. Markov Decision Processes 39
MDP - Computing Value Function
The value of a policy can be found by successive approximation
V0π (s) = r(s, aπ )
V1π (s) = r(s, aπ ) +γ ∑ Pr( s′ | s, aπ )V0π (s ')
s'
M M M
Vkπ (s) = r(s, aπ ) +γ ∑ Pr( s′ | s, aπ )V π (s ')
k−1
s'
There will exist a fixed point
π π
V (s) = r(s, aπ ) +γ ∑ Pr(s ' | s, aπ )V ( s′ )
s'
42. Markov Decision Processes 40
MDP - Optimal Value Functions
Optimal We wish to find the optimal policy
π*
Policy
* π′
π : V ≥V ∀π '
π* π*
Bellman V (s) = max r(s, aπ * ) +γ ∑ Pr( s′ | s, aπ * )V (s ')
a
s'
Equation
43. Markov Decision Processes 41
Value Iteration Algorithm
Yields an Ú− optimal policy
1. initialize V0 , set n = 0, choose Ú> 0
2. For each s :
Vn+1 (s) = max r(s, a) +γ ∑ Pr( s′ | s, a)Vn (s ')
a
s'
(1 − γ )
3. If Vn+1 − Vn > Ú :
2γ
increment n and return to step 2
We can recover the policy by finding the best one step action
π (s) = arg max r(s, a) +γ ∑ Pr( s′ | s, a)V (s ')
a s'
44. Markov Decision Processes 42
Linear Programming Formulation
minimize
V
∑ α (s)V (s)
s
subject to V (s) ≥ r(s, a) + γ ∑ Pr(s ' | s, a)V (s ') ∀a, s
s'
45. Markov Decision Processes 43
MDP - Occupancy Frequencies
f (s, a) An occupancy frequency f (s, a) expresses the
total discounted probability of being in state s
and taking action a
Valid ∑ f (s , a) = ∑ ∑ Pr(s
0 0 | s, a) f (s, a) − α (s0 ) ∀s0
a s a
f (s, a)
46. Markov Decision Processes 44
LP - Occupancy Frequency
min.
V
∑ α (s)V (s)
s
subj: V (s) ≥ r(s, a) + γ ∑ Pr(s ' | s, a)V (s ') ∀a, s
s'
max.
f
∑ ∑ f (s, a)r(s, a)
s a
subj: ∑ f (s , a) − γ ∑ ∑ Pr(s
0 0 | s, a) f (s, a) = α (s0 ) ∀s0
a s a
47. Markov Decision Processes 44
LP - Occupancy Frequency
∑ ∑ f (s, a)r(s, a) = ∑ α (s)V (s)
s a s
min.
V
∑ α (s)V (s)
s
subj: V (s) ≥ r(s, a) + γ ∑ Pr(s ' | s, a)V (s ') ∀a, s
s'
max.
f
∑ ∑ f (s, a)r(s, a)
s a
subj: ∑ f (s , a) − γ ∑ ∑ Pr(s
0 0 | s, a) f (s, a) = α (s0 ) ∀s0
a s a
48. Markov Decision Processes 45
MDP Summary Slide
Policies Over the past couple of decades, there has
Dynamics been lot of work done on scaling MDPs
Rewards
Factored Models
Decomposition
Linear Approximation
49. Markov Decision Processes 46
MDP Summary Slide
Policies To use these algorithms we need a model of
Dynamics the dynamics (transition function). There are
techniques for:
Rewards
Deriving models of
dynamics from data.
Finding policies that are robust
to inaccurate transition models
50. Markov Decision Processes 47
MDP Summary Slide
Policies There has been comparatively little work on
Dynamics specifying rewards
Rewards
Finding policies that are robust
to imprecise reward models
Eliciting reward information
from users
51. Outline 1. Decision Theory
2. Preference Elicitation
3. MDPs
4. Current Work
52. Outline 1. Decision Theory
2. Preference Elicitation
3. MDPs
4. Current Work A. Imprecise Reward Specification
B. Computing Robust Policies
C. Preference Elicitation
D. Evaluation
E. Future Work
53. Text 50
Current Work
MDP
Compute
Satisfied? YES Done
Robust Policy
R
NO
User Select Query
54. Model : MDP 51
MDP - Reward Uncertainty
We quantify the strict uncertainty over reward
with a set of feasible reward functions R
We specify R using a set of linear
inequalities forming a polytope
Where do these inequalities come from?
Bound queries: Is r(s,a) > b?
Policy comparisons: Is fπ ·r > fπ′ ·r ?
55. Outline 1. Decision Theory
2. Preference Elicitation
3. MDPs
4. Current Work A. Imprecise Reward Specification
B. Computing Robust Policies
C. Preference Elicitation
D. Evaluation
E. Future Work
56. Computation 53
Minimax Regret
Original min max max g ·r − f ·r
Formulation f∈F g∈F r∈R
Benders’ minimize δ
f∈F , δ
Decomposition
subject to : δ ≥ g ·r − f ·r ∀ g ∈F r ∈R
57. Computation 54
Minimax Regret
Original min max max g ·r − f ·r
Formulation f∈F g∈F r∈R
Benders’ minimize δ
f∈F , δ
Decomposition
subject to : δ ≥ g ·r − f ·r ∀ g ∈V ( F ) r ∈V ( R )
Maximums will exist at the vertices of F and R
Rather than enumerating an exponential number of vertices we use
constraint generation
58. Computation 55
Minimax Regret - Constraint Generation
1. We limit adversary
• Player minimizes regret w.r.t. a small set of adversary
responses
2. We untie adversary’s hands
• Adversary finds maximum regret w.r.t. player’s policy
• Add adversary’s choice of r and g to set of adversary
responses
Done when: untying adversary’s hands yields no improvement
• ie. regret of player minimizing = regret of adversary maximizing
59. Computation 56
Constraint Generation - Player
1. Limit adversary
minimize δ
f∈F , δ
subject to : δ ≥ g ·r − f ·r ∀ 〈 g, r 〉 ∈GEN
60. Computation 57
Constraint Generation - Adversary
2. Untie adversary’s hands: Given player policy f
max max g ·r − f ·r
g∈F r∈R
This formulation is a
non-convex linear program
We reformulate as a mixed
integer linear program
61. (indeed, it is the maximally violated such constraint). So it
Computation 58
is added to Gen and the process repeats.
Constraint Generation ,-R) is realized by the following MIP,
Computation of MR(f Adversary
using value and Q-functions:1
2. maximize α · V − r · f (9)
Q,V,I,r
subject to: Qa = ra + γPa V ∀a∈A
V ≥ Qa ∀a∈A (10)
V ≤ (1 − Ia )Ma + Qa ∀a∈A (11)
Cr ≤ d
X
Ia = 1 (12)
a
Ia (s) ∈ {0, 1} ∀a, s (13)
⊥
Ma = M − Ma
Only tractablerepresents the adversary’s policy, with Ia (s) de-
Here I for small Markov Decision Problems
noting the probability of action a being taken at state s
62. )
! " # $ % & ' ( )* )) Computation 59
+,-./012314565/7
Figure 2: Scaling of constraint generation with number of states.
Approximating Minimax Regret
9.:54;<.0=//1/0>7/7470?5@09.A/.40<670*+,-./01203454.6
)78)
We )7(# the Max Regret MIP formulation
relax
9.:54;<.0=//1/
)7()
The
)7)# value of the resulting policy is no longer exact, however,
resulting reward still feasible. We find optimal policy w.r.t. to
)7))
resulting reward #
! " $ % & ' ()
*+,-./01203454.6
9.:54;<.0=//1/0>7/7470?;B;,5@09.A/.40<670*+,-./01203454.6
)78)
9.:54;<.0=//1/
)7(#
)7()
)7)#
)7))
! " # $ % & ' ()
*+,-./01203454.6
Figure 3: Relative approximation error of linear relaxation
63. Computation 60
Scaling (Log Scale)
89-/1<7=1+,-./012314565/7
)*****
>?6@51A9B9-6?1C/D0/5
EFF02?9-65/1A9B9-6?1C/D0/5
)****
)***
89-/1:-7;
)**
)*
)
! " # $ % & ' ( )* ))
+,-./012314565/7
Figure 2: Scaling of constraint generation with number of states.
64. Outline 1. Decision Theory
2. Preference Elicitation
3. MDPs
4. Current Work A. Imprecise Reward Specification
B. Computing Robust Policies
C. Preference Elicitation
D. Evaluation
E. Future Work
65. Reward Elicitation 62
Reward Elicitation
MDP
Compute
Satisfied? YES Done
Robust Policy
R
NO
User Select Query
66. Reward Elicitation 63
Bound Queries
Query Is r(s,a) > b?
where b is a point between the
upper and lower bounds of r(s,a)
Gap Δ(s, a) = max r '(s, a) − min r(s, a)
r' r
At each step of elicitation we need
to select the s, a parameters
and b using the gap:
67. Reward Elicitation 64
Selecting Bound Queries
Halve the Largest Gap (HLG) Current Solution (CS)
Select the s,a with the Use the current solution g(s,a)
largest gap Δ(s,a) [or f(s,a)] of the minimax
regret calculation to weight
Set b to the midpoint of the each gap Δ(s,a)
interval for r(s,a)
Select the s,a with the largest
weighted gap g(s,a)Δ(s, a)
Set b to the midpoint of the
interval for r(s,a)
68. Outline 1. Decision Theory
2. Preference Elicitation
3. MDPs
4. Current Work A. Imprecise Reward Specification
B. Computing Robust Policies
C. Preference Elicitation
D. Evaluation
E. Future Work
69. Evaluation 66
Experimental Setup
Randomly generated MDPs
Semi-sparse random transition function,
discount factor of 0.95
Random true reward drawn from fixed interval,
upper and lower bounds on reward drawn
randomly
All results are averaged over 20 runs
10 states 5 actions
70. Evaluation 67
Elicitation Effectiveness
We examine the combination of each criteria for robust policies with
each of the elicitation strategies
Minimax Regret Halve the Largest Gap
ƒ
(MMR) (HLG)
Maximin Regret Current Solution
(MR) (CS)
73. Evaluation 70
Queries per Reward Point - Random MDP
<45;1=/7,0>0?+./4.509./0/.67/80914:;
$&!
700
$!!
600
Most of reward
500
#&! space unexplored
*+,-./0120/.67/80914:;5
#!!
400
"&!
300
We repeatedly query a small
"!!
200 set of “high impact” reward points
&!
100
!
! " # $ % & ' ( )
*+,-./01203+./4.5
74. Evaluation 71
Autonomic Computing
Setup
Host 1
Demand Resource
2 Hosts
Total
3 Demand levels
Resource
3 Units of Resource
M Model
Host k
Demand Resource 90 States
10 Actions
77. Outline 1. Decision Theory
2. Preference Elicitation
3. MDPs
4. Current Work A. Imprecise Reward Specification
B. Computing Robust Policies
C. Preference Elicitation
D. Evaluation
E. Future Work
78. Introduction 75
Overview
MDP
Compute
Satisfied? YES Done
Robust Policy
R
NO
User Select Query
79. Introduction 75
Contributions
Compute 1. A technique for finding robust policies using
Satisfied? YES Done
Robust Policy minimax regret
NO
2. A simple elicitation procedure that quickly leads to
Select Query
near-optimal/optimal policies
80. Conclusion 76
Future Work
Bottleneck: Adversary’s max regret computation
Scaling Idea: The set Γ of adversary policies g that will ever
be a regret maximizing response can be small
Factored
MDPs
Approaches that uses Γ to
Richer
efficiently compute max regret
Queries
We have An algorithm to find Γ
A theorem that shows the
algorithm runs in time polynomial
in the number of policies found
81. Conclusion 77
Future Work
Scaling Working with Factored MDPs will
Factored Model problems in a more natural way
MDPs
Richer Allow us to use lower the dimensionality of
Queries the reward functions
Leverage existing techniques for scaling
MDPs that take advantage of factored
82. Conclusion 78
Future Work
In state s which action would you like to take?
Scaling
Factored
MDPs
In state s do you prefer action a1 to a2 ?
Richer
Queries
Do you prefer sequence
s1 , a1 , s2 , a2 ,…sk to
′ ′ ′ ′ ′
s , a , s , a ,…s ?
1 1 2 2 k
83. Conclusion 79
Future Work
Do you prefer tradeoff
Scaling f (s2 , a3 ) = f1 amount of time doing (s2 , a3 ) and
f (s1 , a4 ) = f2 amount of time doing (s1 , a4 )
Factored or
MDPs
f ′ (s2 , a3 ) = f 1 amount of time doing (s2 , a3 ) and
′
Richer f ′ (s1 , a4 ) = f ′2 amount of time doing (s1 , a4 ) ?
Queries
f1 s Cab Available
f1 s No Street Car
f2 f2 a Take Cab
a Waiting
s No Street Car s Cab Available
f2 f1 f1 f2
a Waiting a Take Cab
87. f g r
ax min r·f (7) subject to: γE f + α = 0 Appendix 82
F r∈R
γE g + α = 0
Full Formulation
on the adversary. If MR(f , R) = MMR (R) then the con- to com
uncertainty in any MDP pa- Cr ≤ at
straint for g, r is satisfied d the current solution, and in- mine
k has focused on uncertainty deed all unexpressed constraints must be satisfied as well. have t
This is equivalent to a minimization:
the of eliciting information The process then terminates with minimax optimal solu- freque
rewards is left unaddressed. tion minimize δ MR(f , R) > MMR (R), implying that
f . Otherwise, (8) exact
f ,δ
the constraint for g, r is violated in the current relaxation
Master
uted for uncertain transition
(indeed, it is the r · g − r · f violated suchF, r ∈ R So it
subject to: maximally ≤ δ ∀ g ∈ constraint).
We ha
an alt
riterion by decomposing the
is added to Gen and the+ α = 0 repeats.
γE f process sarial
nd using dynamic program-
ization to find the worst case Computation of MR(f , R) is realized by the following MIP, (for a
]. McMahan, Gordon, and This corresponds Q-functions:1 dual LP formulation of
using value and to the standard for m
rogramming approach to ef- an MDP with the addition of adversarial policy constraints. imatio
maximize α · V − r · f (9)
n value of an MDP (we em- The infinite number of constraints can be reduced: first we
Q,V,I,r tice):
need only retain as potentially active those ∀ a ∈ A
subject to: Qa = ra + γPa V constraints for the in
ch to ours below). Delage
vertices of polytope R; Qa for any r ∈ R, weaonly require
V ≥ and ∀ ∈A (10) tors.
oblem of uncertainty over re-
V ≤ (1 − a )M + Qa ∀a∈A ∗
the constraint correspondingIto itsaoptimal policy gr . How-
(11) does n
functions) in the presence of policy
rcentile criterion, which can ever, vertex enumeration is not feasible; so we apply Ben-
Cr ≤ d
Subproblem
than maximin. They also ders’ decomposition [2]a to iteratively generate constraints.
X
I =1 (12)
constr
remai
ng rewards using sampling to At each iteration, two optimizations are solved. The master
a
value
e of information of noisy in- Ia (s) ∈ {0, of ∀a, s
problem solves a relaxation 1} program (8) using only a (13)
that is
ard space. The percentile ap- small subset of the constraints,M⊥
Ma = M − corresponding to a subset
a this s
n nor does it offer a bound on Gen of all g, r pairs; we call these generated constraints. lution
es ([20]) also adopt maximin Initially, this set is arbitrary (e.g., empty). with Ia (s) de-
Here I represents the adversary’s policy, Intuitively, in lem s
noting the probability of action a being taken at state s
89. Computation 84
Regret Gap vs Time
3.4567/859/:1;/+,-./!/3.<=
"&#
"$#
"##
*#
3.4567/859
Regret Gap
(#
&#
$#
#
!$#
!"### # "### $### %### &### '### (### )### *###
+,-./0-12
Editor's Notes
Markov decision processes are an extremely useful model for decision making in stochastic environments. To use it we need to know dynamics and the rewards involved. A lot of work has been done on learning dynamics both in an offline and online setting.
Markov decision processes are an extremely useful model for decision making in stochastic environments. To use it we need to know dynamics and the rewards involved. A lot of work has been done on learning dynamics both in an offline and online setting.
Markov decision processes are an extremely useful model for decision making in stochastic environments. To use it we need to know dynamics and the rewards involved. A lot of work has been done on learning dynamics both in an offline and online setting. Rewards are often assumed to be directly observable parts of the world. My perspective is that &#x201C;rewards are in people&#x2019;s heads&#x201D;: In some cases there is a simple mapping between what you want (in your head) and the world (ie. finding the shortest path),
Markov decision processes are an extremely useful model for decision making in stochastic environments. To use it we need to know dynamics and the rewards involved. A lot of work has been done on learning dynamics both in an offline and online setting. In some simple cases, rewards can be thought of as being directly observable: --> For instance the distance travelled in a robot navigation problem where we are trying to get a robot from point A to point B. ---> When I am in my car trying to get from point A to point B I want the path with the fewest stoplights, someone else may want the path with the nicest scenery, while someone else may sacrifice some stoplights for some scenery. Reward is a surrogate for subjective preferences .... flip slide (sometimes its easy but)
The dynamics in combination simple bounds on reward function lead to areas of reward space not having a high impact on the value of a policy
Maximin is common but we use regret
Maximin is common but we use regret
Maximin is common but we use regret
Maximin is common but we use regret
Maximin is common but we use regret
Maximin is common but we use regret
Maximin is common but we use regret
Maximin is common but we use regret
Maximin is common but we use regret
Maximin is common but we use regret
Maximin is common but we use regret
Maximin is common but we use regret
Maximin is common but we use regret
Maximin is common but we use regret
Maximin is common but we use regret
Maximin is common but we use regret
Maximin is common but we use regret
Maximin is common but we use regret
Maximin is common but we use regret
Maximin is common but we use regret
Maximin is common but we use regret
Maximin is common but we use regret
Maximin is common but we use regret
So for f2, no matter what the instantiation of reward, by changing policy the player only stood to gain by one. This is an intuitive measure.
Maximin is common but we use regret
Maximin is common but we use regret
Maximin is common but we use regret
Robust MDP literature often assumes transitions are known
transitions are learnable - do not change from user to user
Robust MDP literature often assumes transitions are known
transitions are learnable - do not change from user to user
Robust MDP literature often assumes transitions are known
transitions are learnable - do not change from user to user
Convergence properties
explain how constraints create max
Question: Is it worth spending more time to familiarize the audience with &#x201C;occupancy frequencies&#x201D;?
I will vocally mention other representations
In the voice over I will explain how each reformulation proceeds from the previous expression
In the voice over I will explain how each reformulation proceeds from the previous expression
In the voice over I will explain how each reformulation proceeds from the previous expression
In the voice over I will explain how each reformulation proceeds from the previous expression
In the voice over I will explain how each reformulation proceeds from the previous expression
I would also like to give a clear intuition as to why this is inherently hard.
On average less than 10% error
will also note that on a 90 state MDP with 16 action the relaxation is computing minimax regret in less than 3 seconds.
Here I will review the preference elicitation process
Note that it is useful in non-sequential
Now I have left out the autonomic computing results, due to lack of time. If there is a little time, after giving the results for the random MDPs I can state that we have similar results for a large MDP instance.
20 runs --> 20 MDPs with 10 states and 5 actions
We have a paritially specified reward function we compute a robust policy that minimizes maximum regret --> We then elicit information about the reward function which leads to a better policy. --> We continue this process until we have an optimal policy or we regret guarantee is small enough
We have a paritially specified reward function we compute a robust policy that minimizes maximum regret --> We then elicit information about the reward function which leads to a better policy. --> We continue this process until we have an optimal policy or we regret guarantee is small enough
We have a paritially specified reward function we compute a robust policy that minimizes maximum regret --> We then elicit information about the reward function which leads to a better policy. --> We continue this process until we have an optimal policy or we regret guarantee is small enough
We have a paritially specified reward function we compute a robust policy that minimizes maximum regret --> We then elicit information about the reward function which leads to a better policy. --> We continue this process until we have an optimal policy or we regret guarantee is small enough
We have a paritially specified reward function we compute a robust policy that minimizes maximum regret --> We then elicit information about the reward function which leads to a better policy. --> We continue this process until we have an optimal policy or we regret guarantee is small enough
Notes test
I will give the context in the voice over. The main idea on this slide is that: in practice constraint generation quickly converges. To segue to the next slide I will recall that we still need to solve a MIP with |S||A| variables and constraints, thus we developed an approximation.