SlideShare a Scribd company logo
1 of 60
Decision Making Under
Uncertainty
Russell and Norvig: ch 16, 17
CMSC 671 – Fall 2005
material from Lise Getoor, Jean-Claude
Latombe, and Daphne Koller
Decision Making Under
Uncertainty
Many environments have multiple
possible outcomes
Some of these outcomes may be good;
others may be bad
Some may be very likely; others unlikely
What’s a poor agent to do??
Non-Deterministic vs.
Probabilistic Uncertainty
?
b
a c
{a,b,c}
 decision that is
best for worst case
?
b
a c
{a(pa),b(pb),c(pc)}
 decision that maximizes
expected utility value
Non-deterministic model Probabilistic model
~ Adversarial search
Expected Utility
Random variable X with n values x1,…,xn
and distribution (p1,…,pn)
E.g.: X is the state reached after doing
an action A under uncertainty
Function U of X
E.g., U is the utility of a state
The expected utility of A is
EU[A] = Si=1,…,n p(xi|A)U(xi)
s0
s3
s2
s1
A1
0.2 0.7 0.1
100 50 70
U(S0) = 100 x 0.2 + 50 x 0.7 + 70 x 0.1
= 20 + 35 + 7
= 62
One State/One Action Example
s0
s3
s2
s1
A1
0.2 0.7 0.1
100 50 70
A2
s4
0.2 0.8
80
• U1(S0) = 62
• U2(S0) = 74
• U(S0) = max{U1(S0),U2(S0)}
= 74
One State/Two Actions Example
s0
s3
s2
s1
A1
0.2 0.7 0.1
100 50 70
A2
s4
0.2 0.8
80
• U1(S0) = 62 – 5 = 57
• U2(S0) = 74 – 25 = 49
• U(S0) = max{U1(S0),U2(S0)}
= 57
-5 -25
Introducing Action Costs
MEU Principle
A rational agent should choose the
action that maximizes agent’s expected
utility
This is the basis of the field of decision
theory
The MEU principle provides a
normative criterion for rational
choice of action
Not quite…
Must have complete model of:
 Actions
 Utilities
 States
Even if you have a complete model, will be
computationally intractable
In fact, a truly rational agent takes into account the
utility of reasoning as well---bounded rationality
Nevertheless, great progress has been made in this
area recently, and we are able to solve much more
complex decision-theoretic problems than ever before
We’ll look at
Decision-Theoretic Planning
 Simple decision making (ch. 16)
 Sequential decision making (ch. 17)
Axioms of Utility Theory
Orderability
 (A>B)  (A<B)  (A~B)
Transitivity
 (A>B)  (B>C)  (A>C)
Continuity
 A>B>C  p [p,A; 1-p,C] ~ B
Substitutability
 A~B  [p,A; 1-p,C]~[p,B; 1-p,C]
Monotonicity
 A>B  (p≥q  [p,A; 1-p,B] >~ [q,A; 1-q,B])
Decomposability
 [p,A; 1-p, [q,B; 1-q, C]] ~ [p,A; (1-p)q, B; (1-p)(1-q), C]
Money Versus Utility
Money <> Utility
 More money is better, but not always in a
linear relationship to the amount of money
Expected Monetary Value
Risk-averse – U(L) < U(SEMV(L))
Risk-seeking – U(L) > U(SEMV(L))
Risk-neutral – U(L) = U(SEMV(L))
Value Function
Provides a ranking of alternatives, but not a
meaningful metric scale
Also known as an “ordinal utility function”
Remember the expectiminimax example:
 Sometimes, only relative judgments (value
functions) are necessary
 At other times, absolute judgments (utility
functions) are required
Multiattribute Utility Theory
A given state may have multiple utilities
 ...because of multiple evaluation criteria
 ...because of multiple agents (interested
parties) with different utility functions
We will talk about this more later in the
semester, when we discuss multi-agent
systems and game theory
Decision Networks
Extend BNs to handle actions and
utilities
Also called influence diagrams
Use BN inference methods to solve
Perform Value of Information
calculations
Decision Networks cont.
Chance nodes: random variables, as in
BNs
Decision nodes: actions that decision
maker can take
Utility/value nodes: the utility of the
outcome state.
R&N example
Umbrella Network
weather
forecast
umbrella
happiness
take/don’t take
f w p(f|w)
sunny rain 0.3
rainy rain 0.7
sunny no rain 0.8
rainy no rain 0.2
P(rain) = 0.4
U(have,rain) = -25
U(have,~rain) = 0
U(~have, rain) = -100
U(~have, ~rain) = 100
have umbrella
P(have|take) = 1.0
P(~have|~take)=1.0
Evaluating Decision Networks
Set the evidence variables for current state
For each possible value of the decision node:
 Set decision node to that value
 Calculate the posterior probability of the parent
nodes of the utility node, using BN inference
 Calculate the resulting utility for action
Return the action with the highest utility
Decision Making:
Umbrella Network
weather
forecast
umbrella
happiness
take/don’t take
f w p(f|w)
sunny rain 0.3
rainy rain 0.7
sunny no rain 0.8
rainy no rain 0.2
P(rain) = 0.4
U(have,rain) = -25
U(have,~rain) = 0
U(~have, rain) = -100
U(~have, ~rain) = 100
have umbrella
P(have|take) = 1.0
P(~have|~take)=1.0
Should I take my umbrella??
Value of Information (VOI)
Suppose an agent’s current knowledge is E. The
value of the current best action  is



i
i
i
A
))
A
(
Do
,
E
|
)
A
(
sult
(Re
P
))
A
(
sult
(Re
U
max
)
E
|
(
EU
The value of the new best action (after new evidence
E’ is obtained):
 



i
i
i
A
))
A
(
Do
,
E
,
E
|
)
A
(
sult
(Re
P
))
A
(
sult
(Re
U
max
)
E
,
E
|
(
EU
The value of information for E’ is therefore:
)
E
|
(
EU
)
E
,
e
|
(
EU
)
E
|
e
(
P
)
E
(
VOI
k
k
ek
k 



 
Value of Information:
Umbrella Network
weather
forecast
umbrella
happiness
take/don’t take
f w p(f|w)
sunny rain 0.3
rainy rain 0.7
sunny no rain 0.8
rainy no rain 0.2
P(rain) = 0.4
U(have,rain) = -25
U(have,~rain) = 0
U(~have, rain) = -100
U(~have, ~rain) = 100
have umbrella
P(have|take) = 1.0
P(~have|~take)=1.0
What is the value of knowing the weather forecast?
Sequential Decision Making
Finite Horizon
Infinite Horizon
Simple Robot Navigation Problem
• In each state, the possible actions are U, D, R, and L
Probabilistic Transition Model
• In each state, the possible actions are U, D, R, and L
• The effect of U is as follows (transition model):
• With probability 0.8 the robot moves up one square (if the
robot is already in the top row, then it does not move)
Probabilistic Transition Model
• In each state, the possible actions are U, D, R, and L
• The effect of U is as follows (transition model):
• With probability 0.8 the robot moves up one square (if the
robot is already in the top row, then it does not move)
• With probability 0.1 the robot moves right one square (if the
robot is already in the rightmost row, then it does not move)
Probabilistic Transition Model
• In each state, the possible actions are U, D, R, and L
• The effect of U is as follows (transition model):
• With probability 0.8 the robot moves up one square (if the
robot is already in the top row, then it does not move)
• With probability 0.1 the robot moves right one square (if the
robot is already in the rightmost row, then it does not move)
• With probability 0.1 the robot moves left one square (if the
robot is already in the leftmost row, then it does not move)
Markov Property
The transition properties depend only
on the current state, not on previous
history (how that state was reached)
Sequence of Actions
• Planned sequence of actions: (U, R)
2
3
1
4
3
2
1
[3,2]
Sequence of Actions
• Planned sequence of actions: (U, R)
• U is executed
2
3
1
4
3
2
1
[3,2]
[4,2]
[3,3]
[3,2]
Histories
• Planned sequence of actions: (U, R)
• U has been executed
• R is executed
• There are 9 possible sequences of states
– called histories – and 6 possible final states
for the robot!
4
3
2
1
2
3
1
[3,2]
[4,2]
[3,3]
[3,2]
[3,3]
[3,2] [4,1] [4,2] [4,3]
[3,1]
Probability of Reaching the Goal
•P([4,3] | (U,R).[3,2]) =
P([4,3] | R.[3,3]) x P([3,3] | U.[3,2])
+ P([4,3] | R.[4,2]) x P([4,2] | U.[3,2])
2
3
1
4
3
2
1
Note importance of Markov property
in this derivation
•P([3,3] | U.[3,2]) = 0.8
•P([4,2] | U.[3,2]) = 0.1
•P([4,3] | R.[3,3]) = 0.8
•P([4,3] | R.[4,2]) = 0.1
•P([4,3] | (U,R).[3,2]) = 0.65
Utility Function
• [4,3] provides power supply
• [4,2] is a sand area from which the robot cannot escape
-1
+1
2
3
1
4
3
2
1
Utility Function
• [4,3] provides power supply
• [4,2] is a sand area from which the robot cannot escape
• The robot needs to recharge its batteries
-1
+1
2
3
1
4
3
2
1
Utility Function
• [4,3] provides power supply
• [4,2] is a sand area from which the robot cannot escape
• The robot needs to recharge its batteries
• [4,3] or [4,2] are terminal states
-1
+1
2
3
1
4
3
2
1
Utility of a History
• [4,3] provides power supply
• [4,2] is a sand area from which the robot cannot escape
• The robot needs to recharge its batteries
• [4,3] or [4,2] are terminal states
• The utility of a history is defined by the utility of the last
state (+1 or –1) minus n/25, where n is the number of moves
-1
+1
2
3
1
4
3
2
1
Utility of an Action Sequence
-1
+1
• Consider the action sequence (U,R) from [3,2]
2
3
1
4
3
2
1
Utility of an Action Sequence
-1
+1
• Consider the action sequence (U,R) from [3,2]
• A run produces one among 7 possible histories, each with some
probability
2
3
1
4
3
2
1
[3,2]
[4,2]
[3,3]
[3,2]
[3,3]
[3,2] [4,1] [4,2] [4,3]
[3,1]
Utility of an Action Sequence
-1
+1
• Consider the action sequence (U,R) from [3,2]
• A run produces one among 7 possible histories, each with some
probability
• The utility of the sequence is the expected utility of the histories:
U = ShUh P(h)
2
3
1
4
3
2
1
[3,2]
[4,2]
[3,3]
[3,2]
[3,3]
[3,2] [4,1] [4,2] [4,3]
[3,1]
Optimal Action Sequence
-1
+1
• Consider the action sequence (U,R) from [3,2]
• A run produces one among 7 possible histories, each with some
probability
• The utility of the sequence is the expected utility of the histories
• The optimal sequence is the one with maximal utility
2
3
1
4
3
2
1
[3,2]
[4,2]
[3,3]
[3,2]
[3,3]
[3,2] [4,1] [4,2] [4,3]
[3,1]
Optimal Action Sequence
-1
+1
• Consider the action sequence (U,R) from [3,2]
• A run produces one among 7 possible histories, each with some
probability
• The utility of the sequence is the expected utility of the histories
• The optimal sequence is the one with maximal utility
• But is the optimal action sequence what we want to
compute?
2
3
1
4
3
2
1
[3,2]
[4,2]
[3,3]
[3,2]
[3,3]
[3,2] [4,1] [4,2] [4,3]
[3,1]
only if the sequence is executed blindly!
Accessible or
observable state
Repeat:
 s  sensed state
 If s is terminal then exit
 a  choose action (given s)
 Perform a
Reactive Agent Algorithm
Policy (Reactive/Closed-Loop Strategy)
• A policy P is a complete mapping from states to actions
-1
+1
2
3
1
4
3
2
1
Repeat:
 s  sensed state
 If s is terminal then exit
 a  P(s)
 Perform a
Reactive Agent Algorithm
Optimal Policy
-1
+1
• A policy P is a complete mapping from states to actions
• The optimal policy P* is the one that always yields a
history (ending at a terminal state) with maximal
expected utility
2
3
1
4
3
2
1
Makes sense because of Markov property
Note that [3,2] is a “dangerous”
state that the optimal policy
tries to avoid
Optimal Policy
-1
+1
• A policy P is a complete mapping from states to actions
• The optimal policy P* is the one that always yields a
history with maximal expected utility
2
3
1
4
3
2
1
This problem is called a
Markov Decision Problem (MDP)
How to compute P*?
Additive Utility
History H = (s0,s1,…,sn)
The utility of H is additive iff:
U(s0,s1,…,sn) = R(0) + U(s1,…,sn) = S R(i)
Reward
Additive Utility
History H = (s0,s1,…,sn)
The utility of H is additive iff:
U(s0,s1,…,sn) = R(0) + U(s1,…,sn) = S R(i)
Robot navigation example:
 R(n) = +1 if sn = [4,3]
 R(n) = -1 if sn = [4,2]
 R(i) = -1/25 if i = 0, …, n-1
Principle of Max Expected Utility
History H = (s0,s1,…,sn)
Utility of H: U(s0,s1,…,sn) = S R(i)
First-step analysis 
U(i) = R(i) + maxa SkP(k | a.i) U(k)
P*(i) = arg maxa SkP(k | a.i) U(k)
-1
+1
Value Iteration
Initialize the utility of each non-terminal state si to
U0(i) = 0
For t = 0, 1, 2, …, do:
Ut+1(i)  R(i) + maxa SkP(k | a.i) Ut(k)
-1
+1
2
3
1
4
3
2
1
Value Iteration
Initialize the utility of each non-terminal state si to
U0(i) = 0
For t = 0, 1, 2, …, do:
Ut+1(i)  R(i) + maxa SkP(k | a.i) Ut(k)
Ut([3,1])
t
0 30
20
10
0.611
0.5
0
-1
+1
2
3
1
4
3
2
1
0.705 0.655 0.388
0.611
0.762
0.812 0.868 0.918
0.660
Note the importance
of terminal states and
connectivity of the
state-transition graph
Policy Iteration
Pick a policy P at random
Policy Iteration
Pick a policy P at random
Repeat:
 Compute the utility of each state for P
Ut+1(i)  R(i) + SkP(k | P(i).i) Ut(k)
Policy Iteration
Pick a policy P at random
Repeat:
 Compute the utility of each state for P
Ut+1(i)  R(i) + SkP(k | P(i).i) Ut(k)
 Compute the policy P’ given these utilities
P’(i) = arg maxa SkP(k | a.i) U(k)
Policy Iteration
Pick a policy P at random
Repeat:
 Compute the utility of each state for P
Ut+1(i)  R(i) + SkP(k | P(i).i) Ut(k)
 Compute the policy P’ given these utilities
P’(i) = arg maxa SkP(k | a.i) U(k)
 If P’ = P then return P
Or solve the set of linear equations:
U(i) = R(i) + SkP(k | P(i).i) U(k)
(often a sparse system)
Infinite Horizon
-1
+1
2
3
1
4
3
2
1
In many problems, e.g., the robot
navigation example, histories are
potentially unbounded and the same
state can be reached many times
One trick:
Use discounting to make infinite
Horizon problem mathematically
tractable
What if the robot lives forever?
Example: Tracking a Target
target
robot
• The robot must keep
the target in view
• The target’s trajectory
is not known in advance
• The environment may
or may not be known
An optimal policy cannot be computed ahead
of time:
- The environment might be unknown
- The environment may only be partially observable
- The target may not wait
 A policy must be computed “on-the-fly”
POMDP (Partially Observable Markov Decision Problem)
• A sensing operation returns multiple
states, with a probability distribution
• Choosing the action that maximizes the
expected utility of this state distribution
assuming “state utilities” computed as
above is not good enough, and actually
does not make sense (is not rational)
Example: Target Tracking
There is uncertainty
in the robot’s and target’s
positions; this uncertainty
grows with further motion
There is a risk that the target
may escape behind the corner,
requiring the robot to move
appropriately
But there is a positioning
landmark nearby. Should
the robot try to reduce its
position uncertainty?
Summary
Decision making under uncertainty
Utility function
Optimal policy
Maximal expected utility
Value iteration
Policy iteration

More Related Content

Similar to c22-24_decisions.ppt

Lecture 3 qualtifed rules of inference
Lecture 3 qualtifed rules of inferenceLecture 3 qualtifed rules of inference
Lecture 3 qualtifed rules of inferenceasimnawaz54
 
CSE680-07QuickSort.pptx
CSE680-07QuickSort.pptxCSE680-07QuickSort.pptx
CSE680-07QuickSort.pptxDeepakM509554
 
Biosight: Quantitative Methods for Policy Analysis: Stochastic Dynamic Progra...
Biosight: Quantitative Methods for Policy Analysis: Stochastic Dynamic Progra...Biosight: Quantitative Methods for Policy Analysis: Stochastic Dynamic Progra...
Biosight: Quantitative Methods for Policy Analysis: Stochastic Dynamic Progra...IFPRI-EPTD
 
DETECTION OF MOVING OBJECT
DETECTION OF MOVING OBJECTDETECTION OF MOVING OBJECT
DETECTION OF MOVING OBJECTAM Publications
 
Dsoop (co 221) 1
Dsoop (co 221) 1Dsoop (co 221) 1
Dsoop (co 221) 1Puja Koch
 
Sloshing-aware MPC for upper stage attitude control
Sloshing-aware MPC for upper stage attitude controlSloshing-aware MPC for upper stage attitude control
Sloshing-aware MPC for upper stage attitude controlPantelis Sopasakis
 
controllability-and-observability.pdf
controllability-and-observability.pdfcontrollability-and-observability.pdf
controllability-and-observability.pdfLAbiba16
 
ECCV2008: MAP Estimation Algorithms in Computer Vision - Part 2
ECCV2008: MAP Estimation Algorithms in Computer Vision - Part 2ECCV2008: MAP Estimation Algorithms in Computer Vision - Part 2
ECCV2008: MAP Estimation Algorithms in Computer Vision - Part 2zukun
 
Practical Volume Estimation of Zonotopes by a new Annealing Schedule for Cool...
Practical Volume Estimation of Zonotopes by a new Annealing Schedule for Cool...Practical Volume Estimation of Zonotopes by a new Annealing Schedule for Cool...
Practical Volume Estimation of Zonotopes by a new Annealing Schedule for Cool...Apostolos Chalkis
 
Planning Under Uncertainty With Markov Decision Processes
Planning Under Uncertainty With Markov Decision ProcessesPlanning Under Uncertainty With Markov Decision Processes
Planning Under Uncertainty With Markov Decision Processesahmad bassiouny
 
AIMA_ch3_L2-complement.ppt kjekfkjekjfkjefkjefkjek
AIMA_ch3_L2-complement.ppt kjekfkjekjfkjefkjefkjekAIMA_ch3_L2-complement.ppt kjekfkjekjfkjefkjefkjek
AIMA_ch3_L2-complement.ppt kjekfkjekjfkjefkjefkjekpavan402055
 
Automated Information Retrieval Model Using FP Growth Based Fuzzy Particle Sw...
Automated Information Retrieval Model Using FP Growth Based Fuzzy Particle Sw...Automated Information Retrieval Model Using FP Growth Based Fuzzy Particle Sw...
Automated Information Retrieval Model Using FP Growth Based Fuzzy Particle Sw...AIRCC Publishing Corporation
 
Improving initial generations in pso algorithm for transportation network des...
Improving initial generations in pso algorithm for transportation network des...Improving initial generations in pso algorithm for transportation network des...
Improving initial generations in pso algorithm for transportation network des...ijcsit
 

Similar to c22-24_decisions.ppt (20)

4260 9235-1-pb
4260 9235-1-pb4260 9235-1-pb
4260 9235-1-pb
 
Lecture 3 qualtifed rules of inference
Lecture 3 qualtifed rules of inferenceLecture 3 qualtifed rules of inference
Lecture 3 qualtifed rules of inference
 
CSE680-07QuickSort.pptx
CSE680-07QuickSort.pptxCSE680-07QuickSort.pptx
CSE680-07QuickSort.pptx
 
Biosight: Quantitative Methods for Policy Analysis: Stochastic Dynamic Progra...
Biosight: Quantitative Methods for Policy Analysis: Stochastic Dynamic Progra...Biosight: Quantitative Methods for Policy Analysis: Stochastic Dynamic Progra...
Biosight: Quantitative Methods for Policy Analysis: Stochastic Dynamic Progra...
 
DETECTION OF MOVING OBJECT
DETECTION OF MOVING OBJECTDETECTION OF MOVING OBJECT
DETECTION OF MOVING OBJECT
 
Dsoop (co 221) 1
Dsoop (co 221) 1Dsoop (co 221) 1
Dsoop (co 221) 1
 
Stack of Tasks Course
Stack of Tasks CourseStack of Tasks Course
Stack of Tasks Course
 
Sloshing-aware MPC for upper stage attitude control
Sloshing-aware MPC for upper stage attitude controlSloshing-aware MPC for upper stage attitude control
Sloshing-aware MPC for upper stage attitude control
 
Av 738- Adaptive Filtering - Wiener Filters[wk 3]
Av 738- Adaptive Filtering - Wiener Filters[wk 3]Av 738- Adaptive Filtering - Wiener Filters[wk 3]
Av 738- Adaptive Filtering - Wiener Filters[wk 3]
 
Icra 17
Icra 17Icra 17
Icra 17
 
controllability-and-observability.pdf
controllability-and-observability.pdfcontrollability-and-observability.pdf
controllability-and-observability.pdf
 
Unit II PPT.pptx
Unit II PPT.pptxUnit II PPT.pptx
Unit II PPT.pptx
 
ECCV2008: MAP Estimation Algorithms in Computer Vision - Part 2
ECCV2008: MAP Estimation Algorithms in Computer Vision - Part 2ECCV2008: MAP Estimation Algorithms in Computer Vision - Part 2
ECCV2008: MAP Estimation Algorithms in Computer Vision - Part 2
 
Practical Volume Estimation of Zonotopes by a new Annealing Schedule for Cool...
Practical Volume Estimation of Zonotopes by a new Annealing Schedule for Cool...Practical Volume Estimation of Zonotopes by a new Annealing Schedule for Cool...
Practical Volume Estimation of Zonotopes by a new Annealing Schedule for Cool...
 
Planning Under Uncertainty With Markov Decision Processes
Planning Under Uncertainty With Markov Decision ProcessesPlanning Under Uncertainty With Markov Decision Processes
Planning Under Uncertainty With Markov Decision Processes
 
AIMA_ch3_L2-complement.ppt kjekfkjekjfkjefkjefkjek
AIMA_ch3_L2-complement.ppt kjekfkjekjfkjefkjefkjekAIMA_ch3_L2-complement.ppt kjekfkjekjfkjefkjefkjek
AIMA_ch3_L2-complement.ppt kjekfkjekjfkjefkjefkjek
 
QMC: Transition Workshop - Probability Models for Discretization Uncertainty ...
QMC: Transition Workshop - Probability Models for Discretization Uncertainty ...QMC: Transition Workshop - Probability Models for Discretization Uncertainty ...
QMC: Transition Workshop - Probability Models for Discretization Uncertainty ...
 
Teori automata lengkap
Teori automata lengkapTeori automata lengkap
Teori automata lengkap
 
Automated Information Retrieval Model Using FP Growth Based Fuzzy Particle Sw...
Automated Information Retrieval Model Using FP Growth Based Fuzzy Particle Sw...Automated Information Retrieval Model Using FP Growth Based Fuzzy Particle Sw...
Automated Information Retrieval Model Using FP Growth Based Fuzzy Particle Sw...
 
Improving initial generations in pso algorithm for transportation network des...
Improving initial generations in pso algorithm for transportation network des...Improving initial generations in pso algorithm for transportation network des...
Improving initial generations in pso algorithm for transportation network des...
 

Recently uploaded

一比一原版曼彻斯特城市大学毕业证成绩单原件一模一样
一比一原版曼彻斯特城市大学毕业证成绩单原件一模一样一比一原版曼彻斯特城市大学毕业证成绩单原件一模一样
一比一原版曼彻斯特城市大学毕业证成绩单原件一模一样fsdfdsgf
 
Solar Photovoltaic Plant Project Proposal by Slidesgo.pptx
Solar Photovoltaic Plant Project Proposal by Slidesgo.pptxSolar Photovoltaic Plant Project Proposal by Slidesgo.pptx
Solar Photovoltaic Plant Project Proposal by Slidesgo.pptxAmarHaddad
 
一比一原版(Temple毕业证书)美国天普大学毕业证成绩单原件一模一样
一比一原版(Temple毕业证书)美国天普大学毕业证成绩单原件一模一样一比一原版(Temple毕业证书)美国天普大学毕业证成绩单原件一模一样
一比一原版(Temple毕业证书)美国天普大学毕业证成绩单原件一模一样fsdfdsgf
 
一比一原版(Cumbria毕业证书)英国坎布里亚大学毕业证成绩单学位证书
一比一原版(Cumbria毕业证书)英国坎布里亚大学毕业证成绩单学位证书一比一原版(Cumbria毕业证书)英国坎布里亚大学毕业证成绩单学位证书
一比一原版(Cumbria毕业证书)英国坎布里亚大学毕业证成绩单学位证书AD
 
一比一原版北雷克斯学院毕业证成绩单原件一模一样
一比一原版北雷克斯学院毕业证成绩单原件一模一样一比一原版北雷克斯学院毕业证成绩单原件一模一样
一比一原版北雷克斯学院毕业证成绩单原件一模一样CC
 
原版制作加州理工学院毕业证成绩单原件一模一样
原版制作加州理工学院毕业证成绩单原件一模一样原版制作加州理工学院毕业证成绩单原件一模一样
原版制作加州理工学院毕业证成绩单原件一模一样hbgfewda
 
Why Won't Your Audi A3 Shift Into Reverse Gear Let's Investigate
Why Won't Your Audi A3 Shift Into Reverse Gear Let's InvestigateWhy Won't Your Audi A3 Shift Into Reverse Gear Let's Investigate
Why Won't Your Audi A3 Shift Into Reverse Gear Let's InvestigateAutowerks
 
一比一定制加拿大新喀里多尼亚学院毕业证(UofL毕业证书)学位证书
一比一定制加拿大新喀里多尼亚学院毕业证(UofL毕业证书)学位证书一比一定制加拿大新喀里多尼亚学院毕业证(UofL毕业证书)学位证书
一比一定制加拿大新喀里多尼亚学院毕业证(UofL毕业证书)学位证书egfdgfd
 
How to Reset BMW Brake Pad Warning Light A Step-by-Step Guide
How to Reset BMW Brake Pad Warning Light A Step-by-Step GuideHow to Reset BMW Brake Pad Warning Light A Step-by-Step Guide
How to Reset BMW Brake Pad Warning Light A Step-by-Step GuideUltimate Bimmer Service
 
Introduction to Automotive Bootloader | Programming Sequence
Introduction to Automotive Bootloader | Programming SequenceIntroduction to Automotive Bootloader | Programming Sequence
Introduction to Automotive Bootloader | Programming SequenceKapil Thakar
 
Toko Jual Viagra Asli Di Surabaya 081229400522 COD Obat Kuat Viagra
Toko Jual Viagra Asli Di Surabaya 081229400522 COD Obat Kuat ViagraToko Jual Viagra Asli Di Surabaya 081229400522 COD Obat Kuat Viagra
Toko Jual Viagra Asli Di Surabaya 081229400522 COD Obat Kuat Viagraadet6151
 
-VDA-Special-Characteristics Special characteristics.pdf
-VDA-Special-Characteristics Special characteristics.pdf-VDA-Special-Characteristics Special characteristics.pdf
-VDA-Special-Characteristics Special characteristics.pdfBorja ARRIZABALAGA URIARTE
 
Toyota Yaris service manual Free.pdf Toyota Yaris Service manual
Toyota Yaris service manual Free.pdf  Toyota Yaris Service manualToyota Yaris service manual Free.pdf  Toyota Yaris Service manual
Toyota Yaris service manual Free.pdf Toyota Yaris Service manualAutocarmanuals.com
 
Catalogue Yamaha Nouvo 115 S / Nouvo .pdf
Catalogue Yamaha Nouvo 115 S / Nouvo .pdfCatalogue Yamaha Nouvo 115 S / Nouvo .pdf
Catalogue Yamaha Nouvo 115 S / Nouvo .pdfHafizLaziz
 
原版定制(UniSA毕业证书)澳大利亚南澳大学毕业证原件一模一样
原版定制(UniSA毕业证书)澳大利亚南澳大学毕业证原件一模一样原版定制(UniSA毕业证书)澳大利亚南澳大学毕业证原件一模一样
原版定制(UniSA毕业证书)澳大利亚南澳大学毕业证原件一模一样yuslkdal
 
如何办理澳洲南澳大学毕业证(UniSA毕业证书)成绩单本科学位证原版一比一
如何办理澳洲南澳大学毕业证(UniSA毕业证书)成绩单本科学位证原版一比一如何办理澳洲南澳大学毕业证(UniSA毕业证书)成绩单本科学位证原版一比一
如何办理澳洲南澳大学毕业证(UniSA毕业证书)成绩单本科学位证原版一比一0uyfyq0q4
 
ℂall Girls Azad Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp Number 24/7
ℂall Girls Azad Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp  Number 24/7ℂall Girls Azad Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp  Number 24/7
ℂall Girls Azad Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp Number 24/7gragkhusi
 
Basic of Firmware & Embedded Software Programming in C
Basic of Firmware & Embedded Software Programming in CBasic of Firmware & Embedded Software Programming in C
Basic of Firmware & Embedded Software Programming in CKapil Thakar
 
Benefits of Load Planning in Fleet Management
Benefits of Load Planning in Fleet ManagementBenefits of Load Planning in Fleet Management
Benefits of Load Planning in Fleet Managementjennifermiller8137
 

Recently uploaded (20)

一比一原版曼彻斯特城市大学毕业证成绩单原件一模一样
一比一原版曼彻斯特城市大学毕业证成绩单原件一模一样一比一原版曼彻斯特城市大学毕业证成绩单原件一模一样
一比一原版曼彻斯特城市大学毕业证成绩单原件一模一样
 
Solar Photovoltaic Plant Project Proposal by Slidesgo.pptx
Solar Photovoltaic Plant Project Proposal by Slidesgo.pptxSolar Photovoltaic Plant Project Proposal by Slidesgo.pptx
Solar Photovoltaic Plant Project Proposal by Slidesgo.pptx
 
一比一原版(Temple毕业证书)美国天普大学毕业证成绩单原件一模一样
一比一原版(Temple毕业证书)美国天普大学毕业证成绩单原件一模一样一比一原版(Temple毕业证书)美国天普大学毕业证成绩单原件一模一样
一比一原版(Temple毕业证书)美国天普大学毕业证成绩单原件一模一样
 
一比一原版(Cumbria毕业证书)英国坎布里亚大学毕业证成绩单学位证书
一比一原版(Cumbria毕业证书)英国坎布里亚大学毕业证成绩单学位证书一比一原版(Cumbria毕业证书)英国坎布里亚大学毕业证成绩单学位证书
一比一原版(Cumbria毕业证书)英国坎布里亚大学毕业证成绩单学位证书
 
一比一原版北雷克斯学院毕业证成绩单原件一模一样
一比一原版北雷克斯学院毕业证成绩单原件一模一样一比一原版北雷克斯学院毕业证成绩单原件一模一样
一比一原版北雷克斯学院毕业证成绩单原件一模一样
 
原版制作加州理工学院毕业证成绩单原件一模一样
原版制作加州理工学院毕业证成绩单原件一模一样原版制作加州理工学院毕业证成绩单原件一模一样
原版制作加州理工学院毕业证成绩单原件一模一样
 
Why Won't Your Audi A3 Shift Into Reverse Gear Let's Investigate
Why Won't Your Audi A3 Shift Into Reverse Gear Let's InvestigateWhy Won't Your Audi A3 Shift Into Reverse Gear Let's Investigate
Why Won't Your Audi A3 Shift Into Reverse Gear Let's Investigate
 
一比一定制加拿大新喀里多尼亚学院毕业证(UofL毕业证书)学位证书
一比一定制加拿大新喀里多尼亚学院毕业证(UofL毕业证书)学位证书一比一定制加拿大新喀里多尼亚学院毕业证(UofL毕业证书)学位证书
一比一定制加拿大新喀里多尼亚学院毕业证(UofL毕业证书)学位证书
 
AI for Smart Vehicles - A quick overview
AI for Smart Vehicles - A quick overviewAI for Smart Vehicles - A quick overview
AI for Smart Vehicles - A quick overview
 
How to Reset BMW Brake Pad Warning Light A Step-by-Step Guide
How to Reset BMW Brake Pad Warning Light A Step-by-Step GuideHow to Reset BMW Brake Pad Warning Light A Step-by-Step Guide
How to Reset BMW Brake Pad Warning Light A Step-by-Step Guide
 
Introduction to Automotive Bootloader | Programming Sequence
Introduction to Automotive Bootloader | Programming SequenceIntroduction to Automotive Bootloader | Programming Sequence
Introduction to Automotive Bootloader | Programming Sequence
 
Toko Jual Viagra Asli Di Surabaya 081229400522 COD Obat Kuat Viagra
Toko Jual Viagra Asli Di Surabaya 081229400522 COD Obat Kuat ViagraToko Jual Viagra Asli Di Surabaya 081229400522 COD Obat Kuat Viagra
Toko Jual Viagra Asli Di Surabaya 081229400522 COD Obat Kuat Viagra
 
-VDA-Special-Characteristics Special characteristics.pdf
-VDA-Special-Characteristics Special characteristics.pdf-VDA-Special-Characteristics Special characteristics.pdf
-VDA-Special-Characteristics Special characteristics.pdf
 
Toyota Yaris service manual Free.pdf Toyota Yaris Service manual
Toyota Yaris service manual Free.pdf  Toyota Yaris Service manualToyota Yaris service manual Free.pdf  Toyota Yaris Service manual
Toyota Yaris service manual Free.pdf Toyota Yaris Service manual
 
Catalogue Yamaha Nouvo 115 S / Nouvo .pdf
Catalogue Yamaha Nouvo 115 S / Nouvo .pdfCatalogue Yamaha Nouvo 115 S / Nouvo .pdf
Catalogue Yamaha Nouvo 115 S / Nouvo .pdf
 
原版定制(UniSA毕业证书)澳大利亚南澳大学毕业证原件一模一样
原版定制(UniSA毕业证书)澳大利亚南澳大学毕业证原件一模一样原版定制(UniSA毕业证书)澳大利亚南澳大学毕业证原件一模一样
原版定制(UniSA毕业证书)澳大利亚南澳大学毕业证原件一模一样
 
如何办理澳洲南澳大学毕业证(UniSA毕业证书)成绩单本科学位证原版一比一
如何办理澳洲南澳大学毕业证(UniSA毕业证书)成绩单本科学位证原版一比一如何办理澳洲南澳大学毕业证(UniSA毕业证书)成绩单本科学位证原版一比一
如何办理澳洲南澳大学毕业证(UniSA毕业证书)成绩单本科学位证原版一比一
 
ℂall Girls Azad Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp Number 24/7
ℂall Girls Azad Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp  Number 24/7ℂall Girls Azad Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp  Number 24/7
ℂall Girls Azad Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp Number 24/7
 
Basic of Firmware & Embedded Software Programming in C
Basic of Firmware & Embedded Software Programming in CBasic of Firmware & Embedded Software Programming in C
Basic of Firmware & Embedded Software Programming in C
 
Benefits of Load Planning in Fleet Management
Benefits of Load Planning in Fleet ManagementBenefits of Load Planning in Fleet Management
Benefits of Load Planning in Fleet Management
 

c22-24_decisions.ppt

  • 1. Decision Making Under Uncertainty Russell and Norvig: ch 16, 17 CMSC 671 – Fall 2005 material from Lise Getoor, Jean-Claude Latombe, and Daphne Koller
  • 2. Decision Making Under Uncertainty Many environments have multiple possible outcomes Some of these outcomes may be good; others may be bad Some may be very likely; others unlikely What’s a poor agent to do??
  • 3. Non-Deterministic vs. Probabilistic Uncertainty ? b a c {a,b,c}  decision that is best for worst case ? b a c {a(pa),b(pb),c(pc)}  decision that maximizes expected utility value Non-deterministic model Probabilistic model ~ Adversarial search
  • 4. Expected Utility Random variable X with n values x1,…,xn and distribution (p1,…,pn) E.g.: X is the state reached after doing an action A under uncertainty Function U of X E.g., U is the utility of a state The expected utility of A is EU[A] = Si=1,…,n p(xi|A)U(xi)
  • 5. s0 s3 s2 s1 A1 0.2 0.7 0.1 100 50 70 U(S0) = 100 x 0.2 + 50 x 0.7 + 70 x 0.1 = 20 + 35 + 7 = 62 One State/One Action Example
  • 6. s0 s3 s2 s1 A1 0.2 0.7 0.1 100 50 70 A2 s4 0.2 0.8 80 • U1(S0) = 62 • U2(S0) = 74 • U(S0) = max{U1(S0),U2(S0)} = 74 One State/Two Actions Example
  • 7. s0 s3 s2 s1 A1 0.2 0.7 0.1 100 50 70 A2 s4 0.2 0.8 80 • U1(S0) = 62 – 5 = 57 • U2(S0) = 74 – 25 = 49 • U(S0) = max{U1(S0),U2(S0)} = 57 -5 -25 Introducing Action Costs
  • 8. MEU Principle A rational agent should choose the action that maximizes agent’s expected utility This is the basis of the field of decision theory The MEU principle provides a normative criterion for rational choice of action
  • 9. Not quite… Must have complete model of:  Actions  Utilities  States Even if you have a complete model, will be computationally intractable In fact, a truly rational agent takes into account the utility of reasoning as well---bounded rationality Nevertheless, great progress has been made in this area recently, and we are able to solve much more complex decision-theoretic problems than ever before
  • 10. We’ll look at Decision-Theoretic Planning  Simple decision making (ch. 16)  Sequential decision making (ch. 17)
  • 11. Axioms of Utility Theory Orderability  (A>B)  (A<B)  (A~B) Transitivity  (A>B)  (B>C)  (A>C) Continuity  A>B>C  p [p,A; 1-p,C] ~ B Substitutability  A~B  [p,A; 1-p,C]~[p,B; 1-p,C] Monotonicity  A>B  (p≥q  [p,A; 1-p,B] >~ [q,A; 1-q,B]) Decomposability  [p,A; 1-p, [q,B; 1-q, C]] ~ [p,A; (1-p)q, B; (1-p)(1-q), C]
  • 12. Money Versus Utility Money <> Utility  More money is better, but not always in a linear relationship to the amount of money Expected Monetary Value Risk-averse – U(L) < U(SEMV(L)) Risk-seeking – U(L) > U(SEMV(L)) Risk-neutral – U(L) = U(SEMV(L))
  • 13. Value Function Provides a ranking of alternatives, but not a meaningful metric scale Also known as an “ordinal utility function” Remember the expectiminimax example:  Sometimes, only relative judgments (value functions) are necessary  At other times, absolute judgments (utility functions) are required
  • 14. Multiattribute Utility Theory A given state may have multiple utilities  ...because of multiple evaluation criteria  ...because of multiple agents (interested parties) with different utility functions We will talk about this more later in the semester, when we discuss multi-agent systems and game theory
  • 15. Decision Networks Extend BNs to handle actions and utilities Also called influence diagrams Use BN inference methods to solve Perform Value of Information calculations
  • 16. Decision Networks cont. Chance nodes: random variables, as in BNs Decision nodes: actions that decision maker can take Utility/value nodes: the utility of the outcome state.
  • 18. Umbrella Network weather forecast umbrella happiness take/don’t take f w p(f|w) sunny rain 0.3 rainy rain 0.7 sunny no rain 0.8 rainy no rain 0.2 P(rain) = 0.4 U(have,rain) = -25 U(have,~rain) = 0 U(~have, rain) = -100 U(~have, ~rain) = 100 have umbrella P(have|take) = 1.0 P(~have|~take)=1.0
  • 19. Evaluating Decision Networks Set the evidence variables for current state For each possible value of the decision node:  Set decision node to that value  Calculate the posterior probability of the parent nodes of the utility node, using BN inference  Calculate the resulting utility for action Return the action with the highest utility
  • 20. Decision Making: Umbrella Network weather forecast umbrella happiness take/don’t take f w p(f|w) sunny rain 0.3 rainy rain 0.7 sunny no rain 0.8 rainy no rain 0.2 P(rain) = 0.4 U(have,rain) = -25 U(have,~rain) = 0 U(~have, rain) = -100 U(~have, ~rain) = 100 have umbrella P(have|take) = 1.0 P(~have|~take)=1.0 Should I take my umbrella??
  • 21. Value of Information (VOI) Suppose an agent’s current knowledge is E. The value of the current best action  is    i i i A )) A ( Do , E | ) A ( sult (Re P )) A ( sult (Re U max ) E | ( EU The value of the new best action (after new evidence E’ is obtained):      i i i A )) A ( Do , E , E | ) A ( sult (Re P )) A ( sult (Re U max ) E , E | ( EU The value of information for E’ is therefore: ) E | ( EU ) E , e | ( EU ) E | e ( P ) E ( VOI k k ek k      
  • 22. Value of Information: Umbrella Network weather forecast umbrella happiness take/don’t take f w p(f|w) sunny rain 0.3 rainy rain 0.7 sunny no rain 0.8 rainy no rain 0.2 P(rain) = 0.4 U(have,rain) = -25 U(have,~rain) = 0 U(~have, rain) = -100 U(~have, ~rain) = 100 have umbrella P(have|take) = 1.0 P(~have|~take)=1.0 What is the value of knowing the weather forecast?
  • 23. Sequential Decision Making Finite Horizon Infinite Horizon
  • 24. Simple Robot Navigation Problem • In each state, the possible actions are U, D, R, and L
  • 25. Probabilistic Transition Model • In each state, the possible actions are U, D, R, and L • The effect of U is as follows (transition model): • With probability 0.8 the robot moves up one square (if the robot is already in the top row, then it does not move)
  • 26. Probabilistic Transition Model • In each state, the possible actions are U, D, R, and L • The effect of U is as follows (transition model): • With probability 0.8 the robot moves up one square (if the robot is already in the top row, then it does not move) • With probability 0.1 the robot moves right one square (if the robot is already in the rightmost row, then it does not move)
  • 27. Probabilistic Transition Model • In each state, the possible actions are U, D, R, and L • The effect of U is as follows (transition model): • With probability 0.8 the robot moves up one square (if the robot is already in the top row, then it does not move) • With probability 0.1 the robot moves right one square (if the robot is already in the rightmost row, then it does not move) • With probability 0.1 the robot moves left one square (if the robot is already in the leftmost row, then it does not move)
  • 28. Markov Property The transition properties depend only on the current state, not on previous history (how that state was reached)
  • 29. Sequence of Actions • Planned sequence of actions: (U, R) 2 3 1 4 3 2 1 [3,2]
  • 30. Sequence of Actions • Planned sequence of actions: (U, R) • U is executed 2 3 1 4 3 2 1 [3,2] [4,2] [3,3] [3,2]
  • 31. Histories • Planned sequence of actions: (U, R) • U has been executed • R is executed • There are 9 possible sequences of states – called histories – and 6 possible final states for the robot! 4 3 2 1 2 3 1 [3,2] [4,2] [3,3] [3,2] [3,3] [3,2] [4,1] [4,2] [4,3] [3,1]
  • 32. Probability of Reaching the Goal •P([4,3] | (U,R).[3,2]) = P([4,3] | R.[3,3]) x P([3,3] | U.[3,2]) + P([4,3] | R.[4,2]) x P([4,2] | U.[3,2]) 2 3 1 4 3 2 1 Note importance of Markov property in this derivation •P([3,3] | U.[3,2]) = 0.8 •P([4,2] | U.[3,2]) = 0.1 •P([4,3] | R.[3,3]) = 0.8 •P([4,3] | R.[4,2]) = 0.1 •P([4,3] | (U,R).[3,2]) = 0.65
  • 33. Utility Function • [4,3] provides power supply • [4,2] is a sand area from which the robot cannot escape -1 +1 2 3 1 4 3 2 1
  • 34. Utility Function • [4,3] provides power supply • [4,2] is a sand area from which the robot cannot escape • The robot needs to recharge its batteries -1 +1 2 3 1 4 3 2 1
  • 35. Utility Function • [4,3] provides power supply • [4,2] is a sand area from which the robot cannot escape • The robot needs to recharge its batteries • [4,3] or [4,2] are terminal states -1 +1 2 3 1 4 3 2 1
  • 36. Utility of a History • [4,3] provides power supply • [4,2] is a sand area from which the robot cannot escape • The robot needs to recharge its batteries • [4,3] or [4,2] are terminal states • The utility of a history is defined by the utility of the last state (+1 or –1) minus n/25, where n is the number of moves -1 +1 2 3 1 4 3 2 1
  • 37. Utility of an Action Sequence -1 +1 • Consider the action sequence (U,R) from [3,2] 2 3 1 4 3 2 1
  • 38. Utility of an Action Sequence -1 +1 • Consider the action sequence (U,R) from [3,2] • A run produces one among 7 possible histories, each with some probability 2 3 1 4 3 2 1 [3,2] [4,2] [3,3] [3,2] [3,3] [3,2] [4,1] [4,2] [4,3] [3,1]
  • 39. Utility of an Action Sequence -1 +1 • Consider the action sequence (U,R) from [3,2] • A run produces one among 7 possible histories, each with some probability • The utility of the sequence is the expected utility of the histories: U = ShUh P(h) 2 3 1 4 3 2 1 [3,2] [4,2] [3,3] [3,2] [3,3] [3,2] [4,1] [4,2] [4,3] [3,1]
  • 40. Optimal Action Sequence -1 +1 • Consider the action sequence (U,R) from [3,2] • A run produces one among 7 possible histories, each with some probability • The utility of the sequence is the expected utility of the histories • The optimal sequence is the one with maximal utility 2 3 1 4 3 2 1 [3,2] [4,2] [3,3] [3,2] [3,3] [3,2] [4,1] [4,2] [4,3] [3,1]
  • 41. Optimal Action Sequence -1 +1 • Consider the action sequence (U,R) from [3,2] • A run produces one among 7 possible histories, each with some probability • The utility of the sequence is the expected utility of the histories • The optimal sequence is the one with maximal utility • But is the optimal action sequence what we want to compute? 2 3 1 4 3 2 1 [3,2] [4,2] [3,3] [3,2] [3,3] [3,2] [4,1] [4,2] [4,3] [3,1] only if the sequence is executed blindly!
  • 42. Accessible or observable state Repeat:  s  sensed state  If s is terminal then exit  a  choose action (given s)  Perform a Reactive Agent Algorithm
  • 43. Policy (Reactive/Closed-Loop Strategy) • A policy P is a complete mapping from states to actions -1 +1 2 3 1 4 3 2 1
  • 44. Repeat:  s  sensed state  If s is terminal then exit  a  P(s)  Perform a Reactive Agent Algorithm
  • 45. Optimal Policy -1 +1 • A policy P is a complete mapping from states to actions • The optimal policy P* is the one that always yields a history (ending at a terminal state) with maximal expected utility 2 3 1 4 3 2 1 Makes sense because of Markov property Note that [3,2] is a “dangerous” state that the optimal policy tries to avoid
  • 46. Optimal Policy -1 +1 • A policy P is a complete mapping from states to actions • The optimal policy P* is the one that always yields a history with maximal expected utility 2 3 1 4 3 2 1 This problem is called a Markov Decision Problem (MDP) How to compute P*?
  • 47. Additive Utility History H = (s0,s1,…,sn) The utility of H is additive iff: U(s0,s1,…,sn) = R(0) + U(s1,…,sn) = S R(i) Reward
  • 48. Additive Utility History H = (s0,s1,…,sn) The utility of H is additive iff: U(s0,s1,…,sn) = R(0) + U(s1,…,sn) = S R(i) Robot navigation example:  R(n) = +1 if sn = [4,3]  R(n) = -1 if sn = [4,2]  R(i) = -1/25 if i = 0, …, n-1
  • 49. Principle of Max Expected Utility History H = (s0,s1,…,sn) Utility of H: U(s0,s1,…,sn) = S R(i) First-step analysis  U(i) = R(i) + maxa SkP(k | a.i) U(k) P*(i) = arg maxa SkP(k | a.i) U(k) -1 +1
  • 50. Value Iteration Initialize the utility of each non-terminal state si to U0(i) = 0 For t = 0, 1, 2, …, do: Ut+1(i)  R(i) + maxa SkP(k | a.i) Ut(k) -1 +1 2 3 1 4 3 2 1
  • 51. Value Iteration Initialize the utility of each non-terminal state si to U0(i) = 0 For t = 0, 1, 2, …, do: Ut+1(i)  R(i) + maxa SkP(k | a.i) Ut(k) Ut([3,1]) t 0 30 20 10 0.611 0.5 0 -1 +1 2 3 1 4 3 2 1 0.705 0.655 0.388 0.611 0.762 0.812 0.868 0.918 0.660 Note the importance of terminal states and connectivity of the state-transition graph
  • 52. Policy Iteration Pick a policy P at random
  • 53. Policy Iteration Pick a policy P at random Repeat:  Compute the utility of each state for P Ut+1(i)  R(i) + SkP(k | P(i).i) Ut(k)
  • 54. Policy Iteration Pick a policy P at random Repeat:  Compute the utility of each state for P Ut+1(i)  R(i) + SkP(k | P(i).i) Ut(k)  Compute the policy P’ given these utilities P’(i) = arg maxa SkP(k | a.i) U(k)
  • 55. Policy Iteration Pick a policy P at random Repeat:  Compute the utility of each state for P Ut+1(i)  R(i) + SkP(k | P(i).i) Ut(k)  Compute the policy P’ given these utilities P’(i) = arg maxa SkP(k | a.i) U(k)  If P’ = P then return P Or solve the set of linear equations: U(i) = R(i) + SkP(k | P(i).i) U(k) (often a sparse system)
  • 56. Infinite Horizon -1 +1 2 3 1 4 3 2 1 In many problems, e.g., the robot navigation example, histories are potentially unbounded and the same state can be reached many times One trick: Use discounting to make infinite Horizon problem mathematically tractable What if the robot lives forever?
  • 57. Example: Tracking a Target target robot • The robot must keep the target in view • The target’s trajectory is not known in advance • The environment may or may not be known An optimal policy cannot be computed ahead of time: - The environment might be unknown - The environment may only be partially observable - The target may not wait  A policy must be computed “on-the-fly”
  • 58. POMDP (Partially Observable Markov Decision Problem) • A sensing operation returns multiple states, with a probability distribution • Choosing the action that maximizes the expected utility of this state distribution assuming “state utilities” computed as above is not good enough, and actually does not make sense (is not rational)
  • 59. Example: Target Tracking There is uncertainty in the robot’s and target’s positions; this uncertainty grows with further motion There is a risk that the target may escape behind the corner, requiring the robot to move appropriately But there is a positioning landmark nearby. Should the robot try to reduce its position uncertainty?
  • 60. Summary Decision making under uncertainty Utility function Optimal policy Maximal expected utility Value iteration Policy iteration