This document discusses probabilistic planning domains and solutions. It defines a probabilistic planning domain as one where actions have multiple possible outcomes, each with a probability. Solutions must be either safe, with a probability of 1 of reaching the goal, or unsafe, with a probability between 0 and 1. Both acyclic and cyclic safe policies are possible, while unsafe policies can get stuck with some probability less than 1.
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Automated Planning with Probabilistic Models
1. 1
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15
This
work
is
licensed
under
a
CreaAve
Commons
ADribuAon-‐NonCommercial-‐ShareAlike
4.0
InternaAonal
License.
Chapter
6
Delibera.on
with
Probabilis.c
Domain
Models
Dana S. Nau and Vikas Shivashankar
University of Maryland
2. 2
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15
Probabilis.c
Planning
Domain
● Actions have multiple possible outcomes
Ø Each outcome has a probability
● Several possible action representations
Ø Bayes nets, probabilistic operators, …
● Book doesn’t commit to any representation
Ø Only deals with the underlying semantics
● Σ = (S,A,γ,Pr,cost)
Ø S = set of states
Ø A = set of actions
Ø γ : S × A → 2S
Ø Pr(s′ | s, a) = probability of going to state s′ if we perform a in s
• Require Pr(s′ | s, a) > 0 for every s′ in γ(s,a)
Ø cost: S × A → +
• cost(s,a) = cost of action a in state s
3. 3
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15
Nota.on
from
Chapter
5
● Policy π : Sπ → A
Ø Sπ ⊆ S
Ø ∀s ∈ Sπ , π(s) ∈ Applicable(s)
● γ︎(s,π) = {s and all descendants of s reachable by π}
Ø the transitive closure of γ with π
● Graph(s,π) = rooted graph induced by π
Ø {nodes} = γ︎(s,π)
Ø {edges} = ∪a∈Applicable(s){(s,s′) | s′ ∈ γ(s,a)}
Ø root = s
● leaves(s,π) = {states in γ︎(s,π) that aren’t in Sπ}
4. 4
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15
Stochas.c
Systems
● Stochastic shortest path (SSP) problem: a triple (Σ, s0, Sg)
● Solution for (Σ, s0, Sg):
Ø policy π such that s0 ∈ Sπ and leaves(s0,π) ⋂ Sg ≠ ∅
● π is closed if π doesn’t stop at non-goal states unless no action is applicable
Ø for every state in γ︎(s,π), either
• s ∈ Sπ (i.e., π specifies an action at s)
• s ∈ Sg (i.e., s is a goal state)
• Applicable(s) = ∅ (i.e., there are no applicable actions at s)
5. 5
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15
● Robot r1 starts
at location l1
Ø s0 = s1 in
the diagram
● Objective is to
get r1 to location l4
Ø Sg = {s4}
● π1 = {(s1,
move(r1,l1,l2)),
(s2,
move(r1,l2,l3)),
(s3,
move(r1,l3,l4))}
Ø Solution?
Ø Closed?
move(r1,l2,l1)
2
Policies
Goal
Start
6. 6
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15
● History: a sequence of
states, starting at s0
σ = 〈s0, s1, s2, s3, …, sh〉
or (not in book):
σ = 〈s0, s1, s2, s3, …〉
● Let H(π) = {all histories
that can be produced by
following π from s0 to a
state in leaves(s0, π)}
● If σ ∈ H(π) then Pr (σ | π) = ∏i ≥ 0 Pr (si+1 | si ,π(si))
Ø Thus ∑σ ∈ Hπ Pr (σ | π) = 1
● Probability that π will stop at a goal state:
Ø Pr (Sg | π) = ∑ {Pr (σ | π) | σ ∈ H(π) and σ ends at a state in Sg}
move(r1,l2,l1)
2
Histories
Goal
Start
7. 7
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15
Unsafe
Solu.ons
● A solution π is unsafe if
Ø 0 < Pr (Sg | π) < 1
● Example:
π1 = {(s1,
move(r1,l1,l2)),
(s2,
move(r1,l2,l3)),
(s3,
move(r1,l3,l4))}
● H(π1) contains two histories:
Ø σ1 = 〈s1,
s2,
s3,
s4〉
Pr (σ1 | π1) = 1 × .8 × 1 = 0.8
Ø σ2 = 〈s1,
s2,
s5〉
Pr (σ2 | π1) = 1 × .2 = 0.2
● Pr (Sg | π) = 0.8
Explicit
dead
end
move(r1,l2,l1)
2
Goal
Start
11. 11
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15
● Example:
π = {(s1,
move(r1,l1,l2)),
(s2,
move(r1,l2,l3)),
(s3,
move(r1,l3,l4)),
(s4,
move(r1,l4,l1)),
(s5,
move(r1,l5,l1))}
● What is Pr (Sg | π)?
Goal
move(r1,l2,l1)
2
Example
Start
12. 12
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15
r
=
–100
Expected
Cost
Goal
Start
● cost(s,a) = cost of using a in s
● Example:
Ø cost(s,a) = 1 for each
“horizontal” action
Ø cost(s,a) = 100 for each
“vertical” action
● Cost of a history:
Ø Let σ = 〈s0, s1, … 〉 ∈ H(π)
Ø cost(σ | π) = ∑i ≥ 0 cost(si,π(si))
● Let π be a safe solution
● Expected cost of following π to a goal:
Ø Vπ(s) = 0 if s is a goal
Ø Vπ(s) = cost(s,π(s)) + ∑ ︎s′∈γ(s,π(s)) Pr (sʹ′|s, π(s)) Vπ(sʹ′) otherwise
● If s = s0 then
Ø Vπ(s0) = ∑σ ∈ H(π) cost(σ | π) Pr(σ | s0, π)
s
s1
s2
sn
…
π(s)
15. 15
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15
Planning
as
Op.miza.on
● Let π and π′ be safe solutions
● π dominates π′ if Vπ(s) ≤ Vπ′(s) for every state where both π and π′ are defined
Ø i.e., Vπ(s) ≤ Vπ′(s) for every s in Sπ ∩ Sπ′
● π is optimal if π dominates every safe solution π′
● V*(s) = min{Vπ(s) | π is a safe solution for which π(s) is defined}
= expected cost of getting from s to a goal using an
optimal safe solution
● Optimality principle (also called Bellman’s theorem):
Ø V*(s) = 0, if s is a goal
Ø V*(s) = mina∈Applicable(s){cost(s,a) + ∑ ︎s′ ∈ γ(s,a) Pr (sʹ′|s,a) Vπ(sʹ′)}, otherwise
s
s1
s2
sn
…
π(s)
16. 16
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15
Policy
Itera.on
● Let (Σ,s0,Sg) be a safe SSP (i.e., Sg is reachable from every state)
● Let π be a safe solution that is defined at every state in S
● Let s be a state, and let a ∈ Applicable(s)
Ø Cost-to-go: expected cost at s if we start with a, and use π afterward
Ø Qπ(s,a) = cost(s,a) + ∑s′ ∈ γ(s,a) Pr (sʹ′|s,a) Vπ(sʹ′)
● For every s, let π′(s) ∈ argmina∈Applicable(s) Qπ(s,a)
Ø Then π′ is a safe solution and dominates π
● PI(Σ,s0,Sg,π0)
π ← π0
loop
compute Vπ (n equations and n unkowns, where n = |S|)
for every non-goal state s do
π′(s) ← any action in argmina∈Applicable(s) Qπ(s,a)
if π′ = π then return π
π ← π′
● Converges in a finite number of iterations
s
s1
s2
sn
…
π(s)
Tie-breaking rule: if
π(s) ∈ argmina∈Applicable(s) Qπ(s,a),
then use π′(s) = π(s)
19. 19
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15
Value
Itera.on
(Synchronous
Version)
● Let (Σ,s0,Sg) be a safe SSP
● Start with an arbitrary cost V(s) for each s and a small η > 0
VI(Σ,s0,Sg,V)
π ← ∅
loop
Vold ← V
for every non-goal state s do
for every a ∈ Applicable(s) do
Q(s,a) ← cost(s,a) + ∑sʹ′ ∈ S Pr (sʹ′ | s,a) Vold(sʹ′)
V(s) ← mina∈Applicable(s) Q(s,a)
if maxs ∈ S ∖ Sg
|V(s) – Vold(s)| < η for every s then exit the loop
π(s) ← argmina∈Applicable(s) Q(s,a)
● |V′(s) – V(s)| is the residual of s
● maxs ∈ S ∖ Sg
|V′(s) – V(s)| is the residual
20. 20
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15
Goal
Start
Example
● aij = the action that moves from si to sj
Ø e.g., a12 = move(r1,l1,l2))
● η = 0.2
● V(s) = 0 for all s
Q(s1, a12) = 100 + 0 = 100
Q(s1, a14) = 1 + (½×0 + ½×0) = 1
min = 1
Q(s2, a21) = 100 + 0 = 100
Q(s2, a23) = 1 + (½×0 + ½×0) = 1
min = 1
Q(s3, a32) = 1 + 0 = 1
Q(s3, a34) = 100 + 0 = 100
min = 1
Q(s5, a52) = 1 + 0 = 1
Q(s5, a54) = 100 + 0 = 100
min = 1
residual = max(1–0, 1–0, 1–0, 1–0) = 1
23. 23
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15
Goal
Start
Example
● V(s1) = 13/4; V(s2) = 3; V(s3) = 3; V(s4) = 0; V(s5) = 3
Q(s1, a12) = 100 + 3 = 104
Q(s1, a14) = 1 + (½×13/4 + ½×0) = 17/8
min = 17/8
Q(s2, a21) = 100 + 13/4 = 1013/4
Q(s2, a23) = 1 + (½×3 + ½×3) = 4
min = 4
Q(s3, a32) = 1 + 3 = 4
Q(s3, a34) = 100 + 0 = 100
min = 4
Q(s5, a52) = 1 + 3 = 4
Q(s5, a54) = 100 + 0 = 100
min = 4
residual = max(17/8–13/4, 4–3, 4–3, 4–3) = 1
● How long before residual < η = 0.2?
● How long if the “vertical” actions cost
10 instead of 100?
24. 24
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15
Discussion
● Policy iteration computes an entire policy in each iteration,
and computes values based on that policy
Ø More work per iteration, because it needs to solve a set of simultaneous
equations
Ø Usually converges in a smaller number of iterations
● Value iteration computes new values in each iteration,
and chooses a policy based on those values
Ø In general, the values are not the values that one would get from the chosen
policy or any other policy
Ø Less work per iteration, because it doesn’t need to solve a set of equations
Ø Usually takes more iterations to converge
● What I showed you was the synchronous version of Value Iteration
• For each s, compute new values of Q and V using Vold
Ø Asynchronous version: compute new values of Q and V using V
• New values may depend on which nodes have already been updated
25. 25
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15
Value
Itera.on
● Synchronous version:
VI(Σ,s0,Sg,V)
π ← ∅
loop
Vold ← V
for every s ∈ S ∖ Sg do
for every a ∈ Applicable(s) do
Q(s,a) ← cost(s,a) +
∑sʹ′ ∈ S Pr (sʹ′|s,a) Vold(sʹ′)
V(s) ← mina∈Applicable(s) Q(s,a)
π(s) ← argmina∈Applicable(s) Q(s,a)
if maxs ∈ S ∖ Sg
|V(s) – Vold(s)| < η then
return π
● maxs ∈ S ∖ Sg
|V(s) – Vold(s)| is the residual
● |V(s) – Vold(s)| is the residual of s
● Asynchronous version:
VI(Σ,s0,Sg,V)
π ← ∅
loop
r ← 0 // the residual
for every s ∈ S ∖ Sg do
r ← max(r,Bellman-‐Update(s,V,π))
if r < η then return π
Bellman-‐Update(s,V,π)
vold ← V(s)
for every a ∈ Applicable(s) do
Q(s,a) ← cost(s,a) + ∑sʹ′∈S Pr (sʹ′|s,a) V(sʹ′)
V(s) ← mina∈Applicable(s) Q(s,a)
π(s) ← argmina∈Applicable(s) Q(s,a)
return |V(s) – vold|
Start with an arbitrary cost V(s) for each s, and a small η > 0
26. 26
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15
Discussion
(Con.nued)
● For both, the number of iterations is polynomial in the number of states
Ø But the number of states is usually quite large
Ø In each iteration, need to examine the entire state space
● Thus, these algorithms can take huge amounts of time and space
● Use search techniques to avoid searching the entire space
27. 27
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15
AO∗ (Σ,s0,Sg,h)
π ← ∅; V(s0) ← h(s0)
Envelope ← {s0} // all generated states
loop
if leaves(s0,π) ⊆ Sg then return π
select s ∈ leaves(s0,π) ∖ Sg
for all a ∈ Applicable(s)
for all s′ ∈ γ(s,a) ∖ Envelope do
V(s′) ← h(s′); add s′ to Envelope
AO-‐Update(s,V,π)
return π
AO-‐Update(s,V,π)
Z ← {s} // set of nodes that need updating
while Z ≠ ∅ do
select any s ∈ Z such that γ(s,π(s)) ∩ Z = ∅
remove s from Z
Bellman-‐Update(s,V,π)
Z ← Z ∪ {s′ ∈ Sπ | s ∈ γ(s′,π(s′))}
● h is the heuristic function
Ø Must have h(s) = 0 for every s in Sg
Bellman-‐Update(s,V,π)
vold ← V(s)
for every a ∈ Applicable(s) do
Q(s,a) ← cost(s,a) + ∑sʹ′∈S Pr (sʹ′|s,a) V(sʹ′)
V(s) ← mina∈Applicable(s) Q(s,a)
π(s) ← argmina∈Applicable(s) Q(s,a)
return |V(s) – vold|
Ø Example: h(s) = 0 for all s
AO*
(requires
Σ
to
be
acyclic)
Goal
Start
s2
s4
s3
s4
c
=
1
c
=
100
c
=
10
c
=
20
c
=
1
0.2
0.8
0.5
0.5
s1
s6
Dom(π)
28. 28
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15
Goal
Start
Bellman-‐Update(s,V,π)
vold ← V(s)
for every a ∈ Applicable(s) do
Q(s,a) ← cost(s,a) + ∑sʹ′∈S Pr (sʹ′|s,a) V(sʹ′)
V(s) ← mina∈Applicable(s) Q(s,a)
π(s) ← argmina∈Applicable(s) Q(s,a)
return |V(s) – vold|
LAO*
(can
handle
cycles)
LAO∗(Σ,s0,Sg,h)
π ← ∅; V(s0) ← h(s0)
Envelope ← {s0} // all generated states
loop
if leaves(s0,π) ⊆ Sg then return π
select s ∈ leaves(s0,π) ∖ Sg
for all a ∈ Applicable(s)
for all s′ ∈ γ(s,a) ∖ Envelope do
V(s′) ← h(s′); add s′ to Envelope
LAO-‐Update(s,V,π)
return π
LAO-‐Update(s,V,π)
Z ← {s} ∪ {s′ ∈ γ(s0,π)} | s ∈ γ(s′,π)}
for every s ∈ Z do Bellman-‐Update(s,V,π)
leavesold ← leaves(s0,π)
rmax ← η + 1
loop until leaves(s0,π) ⊈ leavesold or rmax ≤ η
rmax ← max{Bellman-‐Update(s,V,π) | s ∈ Sπ)
all ancestors of s that we
can reach from s0 using π
Dom(π)
29. 29
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15
Planning
and
Ac.ng
● Run-‐Lookahead(Σ,s0,Sg)
Ø s ← s0
Ø while s ∉ Sg and Applicable(s) ≠ ∅ do
• a ←Lookahead(s,θ)
• perform action a
• s ← observe resulting state
● One possibility: use FF-‐Replan
from Chapter 5
● Problem: FF-‐Replan doesn’t know about probabilities of outcomes
Ø May choose actions that are likely to produce bad outcomes
Ø e.g., a14 in the example above
Section 5.6 199
FF-Replan (⌃, s, Sg)
while s /2 Sg and Applicable(s) 6= ? do
if ⇡d undefined for s then do
⇡d Forward-search (⌃d, s, Sg)
apply action ⇡d(s)
s observe resulting state
Figure 5.22: Online determinization planning and acting algorithm.
Goal
Start
s2
s4
s3
s4
c
=
1
c
=
100
c
=
10
c
=
1
0.2
0.8
0.9
0.1
s1
s6
c
=
1000
30. 30
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15
Improving
on
FF-‐Replan
● RFF algorithm:
Ø Don’t just generate one outcome
Ø Generate all “likely” outcomes
and plan for them too
• Pr(s | s0, π) ≥ θ
Section 5.6 199
FF-Replan (⌃, s, Sg)
while s /2 Sg and Applicable(s) 6= ? do
if ⇡d undefined for s then do
⇡d Forward-search (⌃d, s, Sg)
apply action ⇡d(s)
s observe resulting state
Figure 5.22: Online determinization planning and acting algorithm.
lookahead and partial numebr of outcomes, in any arbitrary way.
The second parametric dimension is in the application of the partial plan
that has been generated, i.e., apply the partial plan ⇡. Independently of the
lookahead, we can still execute ⇡ in a partial way. Suppose for instance that
we have generated a sequential plan of length n, we can decide to apply
m n steps.
Two approaches to the design of a Lookahead procedure are presented
next:
• Lookahead by determinization
Det-‐Plan
should be something like this:
If θ ≤ 0.9 then RFF will notice the problem
a subset of Dom(π)
Goal
Start
s2
s4
s3
s4
c
=
1
c
=
100
c
=
10
c
=
1
0.2
0.8
0.9
0.1
s1
s6
c
=
1000
31. 31
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15
Mul.-‐Arm
Bandit
● Statistical model of sequential experiments
Ø Name comes from a traditional slot machine
(one-armed bandit)
● Multiple actions
Ø Each action provides a reward from a
probability distribution associated with
that specific action
Ø Objective: maximize the expected utility of a sequence of actions
● Exploitation vs exploration dilemma:
Ø Exploitation: choosing an action that you already know about, because
you think it’s likely to give you a high reward
Ø Exploration: choosing an action that you don’t know much about, in
hopes that maybe it will produce a better reward than the actions you
already know about
32. 32
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15
UCB
(Upper
Confidence
Bound)
Algorithm
● Let
Ø xi = average reward you’ve gotten from arm i
Ø ti = number of times you’ve tried arm i;
Ø t = ∑i ti
● loop
Ø if there are one or more arms
that have not been played
Ø then play one of them
Ø else play the arm i that has the highest value of xi + 2 √ (log t)/ti
33. 33
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15
UCT
Algorithm
● UCT (with a few corrections)
● Recursive UCB computation
to compute Q(s,a)
● Anytime algorithm
Ø Call repeatedly until
time runs out
● At end, choose action
argmina Q(s,a)
Goal
Start
s2
s4
s3
s4
c
=
1
c
=
100
c
=
10
c
=
20
c
=
1
0.2
0.8
0.5
0.5
s1
s6
34. 34
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15
Kluge
for
use
in
unsafe
domains
● Modification for domains in
which some states are unsafe
Ø Avoid unsafe plans by
refusing to choose
actions that lead to dead
ends
● Problem: it’s too cautious
Ø Will return ∞ if there are
no safe plans
if Applicable(s) = ∅ then
return ∞
Goal
Start
s2
s4
s3
s4
c
=
1
c
=
100
c
=
10
c
=
1
0.2
0.8
0.5
0.5
s1
s6
35. 35
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15
UCT
as
an
Ac.ng
Procedure
● Suppose that
Ø You don’t know Pr
Ø You can restart your actor
as many times as you want
● Can modify UCT to be an acting
procedure
Ø Use it to explore the
environment
Goal
Start
s2
s4
s3
s4
c
=
1
c
=
100
c
=
10
c
=
20
c
=
1
0.2
0.8
0.5
0.5
s1
s6
execute a; observe s′