Automated Planning with Probabilistic Models

1
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15

This
work
is
licensed
under
a
CreaAve
Commons
ADribuAon-‐NonCommercial-‐ShareAlike
4.0
InternaAonal
License.

Chapter
6

Delibera.on
with
Probabilis.c
Domain
Models

Dana S. Nau and Vikas Shivashankar
University of Maryland

2
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15

Probabilis.c
Planning
Domain

●  Actions have multiple possible outcomes
Ø  Each outcome has a probability
●  Several possible action representations
Ø  Bayes nets, probabilistic operators, …
●  Book doesn’t commit to any representation
Ø  Only deals with the underlying semantics
●  Σ = (S,A,γ,Pr,cost)
Ø  S = set of states
Ø  A = set of actions
Ø  γ : S × A → 2S
Ø  Pr(s′ | s, a) = probability of going to state s′ if we perform a in s
•  Require Pr(s′ | s, a) > 0 for every s′ in γ(s,a)
Ø  cost: S × A → +
•  cost(s,a) = cost of action a in state s

3
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15

Nota.on
from
Chapter
5

●  Policy π : Sπ → A
Ø  Sπ ⊆ S
Ø  ∀s ∈ Sπ , π(s) ∈ Applicable(s)
●  γ︎(s,π) = {s and all descendants of s reachable by π}
Ø  the transitive closure of γ with π
●  Graph(s,π) = rooted graph induced by π
Ø  {nodes} = γ︎(s,π)
Ø  {edges} = ∪a∈Applicable(s){(s,s′) | s′ ∈ γ(s,a)}
Ø  root = s
●  leaves(s,π) = {states in γ︎(s,π) that aren’t in Sπ}

4
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15

Stochas.c
Systems

●  Stochastic shortest path (SSP) problem: a triple (Σ, s0, Sg)
●  Solution for (Σ, s0, Sg):
Ø  policy π such that s0 ∈ Sπ and leaves(s0,π) ⋂ Sg ≠ ∅
●  π is closed if π doesn’t stop at non-goal states unless no action is applicable
Ø  for every state in γ︎(s,π), either
•  s ∈ Sπ (i.e., π specifies an action at s)
•  s ∈ Sg (i.e., s is a goal state)
•  Applicable(s) = ∅ (i.e., there are no applicable actions at s)

5
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15

●  Robot r1 starts
at location l1

Ø  s0 = s1 in
the diagram
●  Objective is to
get r1 to location l4

Ø  Sg = {s4}
●  π1 = {(s1,
move(r1,l1,l2)),
(s2,
move(r1,l2,l3)),
(s3,
move(r1,l3,l4))}
Ø  Solution?
Ø  Closed?

move(r1,l2,l1)

2

Policies

Goal

Start

6
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15

●  History: a sequence of
states, starting at s0
σ = 〈s0, s1, s2, s3, …, sh〉
or (not in book):
σ = 〈s0, s1, s2, s3, …〉
●  Let H(π) = {all histories
that can be produced by
following π from s0 to a
state in leaves(s0, π)}
●  If σ ∈ H(π) then Pr (σ | π) = ∏i ≥ 0 Pr (si+1 | si ,π(si))
Ø  Thus ∑σ ∈ Hπ Pr (σ | π) = 1
●  Probability that π will stop at a goal state:
Ø  Pr (Sg | π) = ∑ {Pr (σ | π) | σ ∈ H(π) and σ ends at a state in Sg}
move(r1,l2,l1)

2

Histories

Goal

Start

7
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15

Unsafe
Solu.ons

●  A solution π is unsafe if
Ø  0 < Pr (Sg | π) < 1
●  Example:
π1 = {(s1,
move(r1,l1,l2)),

(s2,
move(r1,l2,l3)),

(s3,
move(r1,l3,l4))}
●  H(π1) contains two histories:
Ø  σ1 = 〈s1,
s2,
s3,
s4〉
Pr (σ1 | π1) = 1 × .8 × 1 = 0.8
Ø  σ2 = 〈s1,
s2,
s5〉
Pr (σ2 | π1) = 1 × .2 = 0.2
●  Pr (Sg | π) = 0.8
Explicit
dead
end

move(r1,l2,l1)

2

Goal

Start

8
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15

Unsafe
Solu.ons

●  A solution π is unsafe if
Ø  0 < Pr (Sg | π) < 1
●  Example:
π2 = {(s1,
move(r1,l1,l2)),

(s2,
move(r1,l2,l3)),

(s3,
move(r1,l3,l4)),

(s5,
move(r1,l5,l6)),

(s6,
move(r1,l6,l5))}
Ø  σ1 = 〈s1,
s2,
s3,
s4〉
Pr (σ1 | π2) = 1 × .8 × 1 = 0.8
Ø  σ3 = 〈s1,
s2,
s5,
s6,
s5,
s6,
…
〉
Pr (σ3 | π2) = 1 × .2 × 1 × 1 × 1 × … = 0.2
●  Pr (Sg | π2) = 0.8
Implicit
dead
end

move(r1,l2,l1)

2
wait

s6

at(r1,l6)

Goal

Start

9
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15

●  A solution π is safe if
Ø  Pr (Sg | π) = 1
●  An acyclic safe solution:
π3 = {(s1,
move(r1,l1,l2)),

(s2,
move(r1,l2,l3)),

(s3,
move(r1,l3,l4)),

(s5,
move(r1,l5,l4))}
Ø  σ1 = 〈s1,
s2,
s3,
s4〉
Pr (σ1 | π3) = 1 × .8 × 1 = 0.8
Ø  σ4 = 〈s1,
s2,
s5,
s4〉
Pr (σ4 | π3) = 1 × .2 × 1 = 0.2
●  Pr (Sg | π3) = 0.8 + 0.2 = 1
move(r1,l2,l1)

2

Safe
Solu.ons

Goal

Start

10
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15

●  A solution π is safe if
Ø  Pr (Sg | π) = 1
●  A cyclic safe solution:
π4 = {(s5,
move(r1,l5,l4)}
●  H(π4) contains infinitely
many histories:

Ø  σ5 = 〈s1,
s4
〉
Pr (σ5 | π4) = 0.5
Ø  σ6 = 〈s1,
s1,
s4〉
Pr (σ6 | π4) = 0.5 × 0.5 = 0.25
Ø  σ7 = 〈s1,
s1,
s1,
s4〉
Pr (σ7 | π4) = 0.5 × 0.5 × 0.5 = 0.125
• • •
Ø  σ∞ = 〈s1,
s1,
s1,
s1,
…
〉
Pr (σ∞ | π4) = 0.5 × 0.5 × 0.5 × 0.5 × 0.5 × … = 0
move(r1,l2,l1)

2

Safe
Solu.ons

Pr (Sg | π4) = .5 + .25 + .125 + … = 1
Goal

Start

11
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15

●  Example:
π = {(s1,
move(r1,l1,l2)),

(s2,
move(r1,l2,l3)),

(s3,
move(r1,l3,l4)),

(s4,
move(r1,l4,l1)),

(s5,
move(r1,l5,l1))}
●  What is Pr (Sg | π)?
Goal

move(r1,l2,l1)

2

Example

Start

12
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15

r
=
–100

Expected
Cost

Goal

Start

●  cost(s,a) = cost of using a in s
●  Example:
Ø  cost(s,a) = 1 for each
“horizontal” action
Ø  cost(s,a) = 100 for each
“vertical” action
●  Cost of a history:
Ø  Let σ = 〈s0, s1, … 〉 ∈ H(π)
Ø  cost(σ | π) = ∑i ≥ 0 cost(si,π(si))
●  Let π be a safe solution
●  Expected cost of following π to a goal:
Ø  Vπ(s) = 0 if s is a goal
Ø  Vπ(s) = cost(s,π(s)) + ∑ ︎s′∈γ(s,π(s)) Pr (sʹ′|s, π(s)) Vπ(sʹ′) otherwise
●  If s = s0 then
Ø  Vπ(s0) = ∑σ ∈ H(π) cost(σ | π) Pr(σ | s0, π)
s
s1
s2
sn
…
π(s)

13
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15

r
=
–100

Example

Goal

Start

π3 = {(s1,
move(r1,l1,l2)),

(s2,
move(r1,l2,l3)),

(s3,
move(r1,l3,l4)),

(s5,
move(r1,l5,l4))}

H(π1) contains two histories:
σ1 = 〈s1,
s2,
s3,
s4〉
Pr (σ1 | π3) = 0.8 cost(σ1 | π3) = 100 + 1 + 100 = 201

σ2 = 〈s1,
s2,
s5,
s4〉
Pr (σ2 | π3) = 0.2 cost(σ2 | π3) = 100 + 1 + 100 = 201

Vπ1(s1) = 201×0.8 + 201×0.2 = 201

14
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15

π4 = {(s5,
move(r1,l5,l4)}
●  H(π4) contains infinitely
many histories:

Ø  σ5 = 〈s1,
s4
〉
Pr (σ5 | π4) = 0.5 cost (σ5 | π4) = 1
Ø  σ6 = 〈s1,
s1,
s4〉
Pr (σ6 | π4) = 0.25 cost (σ6 | π4) = 2
Ø  σ7 = 〈s1,
s1,
s1,
s4〉
Pr (σ7 | π4) = 0.125 cost (σ7 | π4) = 3
• • •
●  Vπ4(s1) = 1×.5 + 2×.25 + 3×.125 + 4×.0625 + … = 2
Safe
Solu.ons

Goal

Start

15
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15

Planning
as
Op.miza.on

●  Let π and π′ be safe solutions
●  π dominates π′ if Vπ(s) ≤ Vπ′(s) for every state where both π and π′ are defined
Ø  i.e., Vπ(s) ≤ Vπ′(s) for every s in Sπ ∩ Sπ′
●  π is optimal if π dominates every safe solution π′
●  V*(s) = min{Vπ(s) | π is a safe solution for which π(s) is defined}
= expected cost of getting from s to a goal using an
optimal safe solution
●  Optimality principle (also called Bellman’s theorem):
Ø  V*(s) = 0, if s is a goal
Ø  V*(s) = mina∈Applicable(s){cost(s,a) + ∑ ︎s′ ∈ γ(s,a) Pr (sʹ′|s,a) Vπ(sʹ′)}, otherwise
s
s1
s2
sn
…
π(s)

16
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15

Policy
Itera.on

●  Let (Σ,s0,Sg) be a safe SSP (i.e., Sg is reachable from every state)
●  Let π be a safe solution that is defined at every state in S
●  Let s be a state, and let a ∈ Applicable(s)
Ø  Cost-to-go: expected cost at s if we start with a, and use π afterward
Ø  Qπ(s,a) = cost(s,a) + ∑s′ ∈ γ(s,a) Pr (sʹ′|s,a) Vπ(sʹ′)
●  For every s, let π′(s) ∈ argmina∈Applicable(s) Qπ(s,a)
Ø  Then π′ is a safe solution and dominates π
●  PI(Σ,s0,Sg,π0)

π ← π0
loop
compute Vπ (n equations and n unkowns, where n = |S|)
for every non-goal state s do
π′(s) ← any action in argmina∈Applicable(s) Qπ(s,a)
if π′ = π then return π
π ← π′
●  Converges in a finite number of iterations
s
s1
s2
sn
…
π(s)
Tie-breaking rule: if
π(s) ∈ argmina∈Applicable(s) Qπ(s,a),
then use π′(s) = π(s)

17
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15

r
=
–100

Start with
π0 = {(s1,
move(r1,l1,l2)),

(s2,
move(r1,l2,l3)),

(s3,
move(r1,l3,l4)),

(s5,
move(r1,l5,l4))}
Example

Goal

Start

Vπ(s4) = 0
Vπ(s3) = 100 + Vπ(s4) = 100
Vπ(s5) = 100 + Vπ(s5) = 100
Vπ(s2) = 1 + (0.8 Vπ(s3) + 0.2 Vπ(s5)) = 101
Vπ(s1) = 100 + Vπ(s2) = 201
Q(s1,move(r1,l1,l2)) = 100 + 101 = 201
Q(s1,move(r1,l1,l4)) = 1 + ½ × 201 + ½ × 0 = 101.5
argmin = move(r1,l1,l4)

Q(s2,move(r1,l2,l3)) = 1 + (0.8 × 100 + 0.2 × 100) = 101
Q(s2,move(r1,l2,l1)) = 100 + 201 = 301

Q(s3,move(r1,l3,l4)) = 100 + 0 = 100
Q(s3,move(r1,l3,l2)) = 100 + 101 = 201

Q(s5,move(r1,l5,l4)) = 100 + 0 = 100
Q(s5,move(r1,l5,l4)) = 100+101 = 201

18
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15

r
=
–100

π = {(s1,
move(r1,l1,l4)),

(s2,
move(r1,l2,l3)),

(s3,
move(r1,l3,l4)),

(s5,
move(r1,l5,l4))}
Example

Goal

Start

Vπ(s4) = 0
Vπ(s3) = 100 + Vπ(s4) = 100
Vπ(s5) = 100 + Vπ(s5) = 100
Vπ(s2) = 1 + (0.8 Vπ(s3) + 0.2 Vπ(s5)) = 101
Vπ(s1) = 1 + ½ Vπ(s1) + ½ Vπ(s4) = 2
Q(s1,move(r1,l1,l2)) = 100 + 101 = 201
Q(s1,move(r1,l1,l4)) = 1 + ½ × 2 + ½ × 0 = 2

Q(s2,move(r1,l2,l3)) = 1 + (0.8 × 100 + 0.2 × 100) = 101
Q(s2,move(r1,l2,l1)) = 100 + 2 = 102

Q(s3,move(r1,l3,l4)) = 100 + 0 = 100
Q(s3,move(r1,l3,l2)) = 100 + 101 = 201

Q(s5,move(r1,l5,l4)) = 100 + 0 = 100
Q(s5,move(r1,l5,l4)) = 100+101 = 201

19
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15

Value
Itera.on
(Synchronous
Version)

●  Let (Σ,s0,Sg) be a safe SSP
●  Start with an arbitrary cost V(s) for each s and a small η > 0
VI(Σ,s0,Sg,V)

π ← ∅
loop
Vold ← V
for every non-goal state s do
for every a ∈ Applicable(s) do
Q(s,a) ← cost(s,a) + ∑sʹ′ ∈ S Pr (sʹ′ | s,a) Vold(sʹ′)
V(s) ← mina∈Applicable(s) Q(s,a)
if maxs ∈ S ∖ Sg
|V(s) – Vold(s)| < η for every s then exit the loop
π(s) ← argmina∈Applicable(s) Q(s,a)
●  |V′(s) – V(s)| is the residual of s
●  maxs ∈ S ∖ Sg
|V′(s) – V(s)| is the residual

20
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15

Goal

Start

Example

●  aij = the action that moves from si to sj
Ø  e.g., a12 = move(r1,l1,l2))

●  η = 0.2
●  V(s) = 0 for all s
Q(s1, a12) = 100 + 0 = 100
Q(s1, a14) = 1 + (½×0 + ½×0) = 1
min = 1
Q(s2, a21) = 100 + 0 = 100
Q(s2, a23) = 1 + (½×0 + ½×0) = 1
min = 1
Q(s3, a32) = 1 + 0 = 1
Q(s3, a34) = 100 + 0 = 100
min = 1
Q(s5, a52) = 1 + 0 = 1
Q(s5, a54) = 100 + 0 = 100
min = 1
residual = max(1–0, 1–0, 1–0, 1–0) = 1

21
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15

Goal

Start

Example

●  V(s1) = 1; V(s2) = 1; V(s3) = 1; V(s4) = 0; V(s5) = 1
Q(s1, a12) = 100 + 1 = 101
Q(s1, a14) = 1 + (½×1 + ½×0) = 1½
min = 1½
Q(s2, a21) = 100 + 1 = 101
Q(s2, a23) = 1 + (½×1 + ½×1) = 2
min = 2
Q(s3, a32) = 1 + 1 = 2
Q(s3, a34) = 100 + 0 = 100
min = 2
Q(s5, a52) = 1 + 1 = 2
Q(s5, a54) = 100 + 0 = 100
min = 2
residual = max(1½–1, 2–1, 2–1, 2–1) = 1

22
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15

Goal

Start

Example

●  V(s1) = 1½; V(s2) = 2; V(s3) = 2; V(s4) = 0; V(s5) = 2
Q(s1, a12) = 100 + 2 = 103
Q(s1, a14) = 1 + (½×1½ + ½×0) = 13/4
min = 13/4
Q(s2, a21) = 100 + 1½ = 101½
Q(s2, a23) = 1 + (½×2 + ½×2) = 3
min = 3
Q(s3, a32) = 1 + 2 = 3
Q(s3, a34) = 100 + 0 = 100
min = 3
Q(s5, a52) = 1 + 2 = 3
Q(s5, a54) = 100 + 0 = 100
min = 3
residual = max(13/4–1½, 3–2, 3–2, 3–2) = 1

23
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15

Goal

Start

Example

●  V(s1) = 13/4; V(s2) = 3; V(s3) = 3; V(s4) = 0; V(s5) = 3
Q(s1, a12) = 100 + 3 = 104
Q(s1, a14) = 1 + (½×13/4 + ½×0) = 17/8
min = 17/8
Q(s2, a21) = 100 + 13/4 = 1013/4
Q(s2, a23) = 1 + (½×3 + ½×3) = 4
min = 4
Q(s3, a32) = 1 + 3 = 4
Q(s3, a34) = 100 + 0 = 100
min = 4
Q(s5, a52) = 1 + 3 = 4
Q(s5, a54) = 100 + 0 = 100
min = 4
residual = max(17/8–13/4, 4–3, 4–3, 4–3) = 1
●  How long before residual < η = 0.2?
●  How long if the “vertical” actions cost
10 instead of 100?

24
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15

Discussion

●  Policy iteration computes an entire policy in each iteration,
and computes values based on that policy
Ø  More work per iteration, because it needs to solve a set of simultaneous
equations
Ø  Usually converges in a smaller number of iterations
●  Value iteration computes new values in each iteration,
and chooses a policy based on those values
Ø  In general, the values are not the values that one would get from the chosen
policy or any other policy
Ø  Less work per iteration, because it doesn’t need to solve a set of equations
Ø  Usually takes more iterations to converge
●  What I showed you was the synchronous version of Value Iteration
•  For each s, compute new values of Q and V using Vold
Ø  Asynchronous version: compute new values of Q and V using V
•  New values may depend on which nodes have already been updated

25
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15

Value
Itera.on

●  Synchronous version:
VI(Σ,s0,Sg,V)

π ← ∅
loop
Vold ← V
for every s ∈ S ∖ Sg do
Q(s,a) ← cost(s,a) +
∑sʹ′ ∈ S Pr (sʹ′|s,a) Vold(sʹ′)
if maxs ∈ S ∖ Sg
|V(s) – Vold(s)| < η then
return π
●  maxs ∈ S ∖ Sg
|V(s) – Vold(s)| is the residual
●  |V(s) – Vold(s)| is the residual of s
●  Asynchronous version:

VI(Σ,s0,Sg,V)

π ← ∅
loop
r ← 0 // the residual
for every s ∈ S ∖ Sg do
r ← max(r,Bellman-‐Update(s,V,π))
if r < η then return π
Bellman-‐Update(s,V,π)
vold ← V(s)
Q(s,a) ← cost(s,a) + ∑sʹ′∈S Pr (sʹ′|s,a) V(sʹ′)
return |V(s) – vold|
Start with an arbitrary cost V(s) for each s, and a small η > 0

26
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15

Discussion
(Con.nued)

●  For both, the number of iterations is polynomial in the number of states
Ø  But the number of states is usually quite large
Ø  In each iteration, need to examine the entire state space
●  Thus, these algorithms can take huge amounts of time and space
●  Use search techniques to avoid searching the entire space

27
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15

AO∗ (Σ,s0,Sg,h)
π ← ∅; V(s0) ← h(s0)
Envelope ← {s0} // all generated states
loop
if leaves(s0,π) ⊆ Sg then return π
select s ∈ leaves(s0,π) ∖ Sg
for all a ∈ Applicable(s)
for all s′ ∈ γ(s,a) ∖ Envelope do
V(s′) ← h(s′); add s′ to Envelope
AO-‐Update(s,V,π)
return π
AO-‐Update(s,V,π)
Z ← {s} // set of nodes that need updating
while Z ≠ ∅ do
select any s ∈ Z such that γ(s,π(s)) ∩ Z = ∅
remove s from Z
Z ← Z ∪ {s′ ∈ Sπ | s ∈ γ(s′,π(s′))}
●  h is the heuristic function
Ø  Must have h(s) = 0 for every s in Sg

vold ← V(s)
Ø  Example: h(s) = 0 for all s
AO*
(requires
Σ
to
be
acyclic)

Goal

Start

s2

s4

s3

s4

c
=
1

c
=

100

c
=
10

c
=
20

c
=
1

0.2

0.8

0.5

0.5
s1

s6

Dom(π)

28
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15

Goal

Start

vold ← V(s)
LAO*
(can
handle
cycles)

LAO∗(Σ,s0,Sg,h)
π ← ∅; V(s0) ← h(s0)
Envelope ← {s0} // all generated states
loop
if leaves(s0,π) ⊆ Sg then return π
select s ∈ leaves(s0,π) ∖ Sg
for all a ∈ Applicable(s)
for all s′ ∈ γ(s,a) ∖ Envelope do
V(s′) ← h(s′); add s′ to Envelope
LAO-‐Update(s,V,π)
return π
LAO-‐Update(s,V,π)
Z ← {s} ∪ {s′ ∈ γ(s0,π)} | s ∈ γ(s′,π)}
for every s ∈ Z do Bellman-‐Update(s,V,π)
leavesold ← leaves(s0,π)
rmax ← η + 1
loop until leaves(s0,π) ⊈ leavesold or rmax ≤ η
rmax ← max{Bellman-‐Update(s,V,π) | s ∈ Sπ)
all ancestors of s that we
can reach from s0 using π

Dom(π)

29
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15

Planning
and
Ac.ng

●  Run-‐Lookahead(Σ,s0,Sg)
Ø  s ← s0
Ø  while s ∉ Sg and Applicable(s) ≠ ∅ do
•  a ←Lookahead(s,θ)
•  perform action a
•  s ← observe resulting state
●  One possibility: use FF-‐Replan
from Chapter 5
●  Problem: FF-‐Replan doesn’t know about probabilities of outcomes
Ø  May choose actions that are likely to produce bad outcomes
Ø  e.g., a14 in the example above
Section 5.6 199
FF-Replan (⌃, s, Sg)
while s /2 Sg and Applicable(s) 6= ? do
if ⇡d undeﬁned for s then do
⇡d Forward-search (⌃d, s, Sg)
apply action ⇡d(s)
s observe resulting state
Figure 5.22: Online determinization planning and acting algorithm.
Goal

Start

s2

s4

s3

s4

c
=
1

c
=

100

c
=
10

c
=
1

0.2

0.8

0.9

0.1
s1

s6

c
=

1000

30
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15

Improving
on
FF-‐Replan

●  RFF algorithm:
Ø  Don’t just generate one outcome
Ø  Generate all “likely” outcomes
and plan for them too
•  Pr(s | s0, π) ≥ θ
Section 5.6 199
FF-Replan (⌃, s, Sg)
while s /2 Sg and Applicable(s) 6= ? do
if ⇡d undeﬁned for s then do
⇡d Forward-search (⌃d, s, Sg)
apply action ⇡d(s)
s observe resulting state
Figure 5.22: Online determinization planning and acting algorithm.
lookahead and partial numebr of outcomes, in any arbitrary way.
The second parametric dimension is in the application of the partial plan
that has been generated, i.e., apply the partial plan ⇡. Independently of the
lookahead, we can still execute ⇡ in a partial way. Suppose for instance that
we have generated a sequential plan of length n, we can decide to apply
m  n steps.
Two approaches to the design of a Lookahead procedure are presented
next:
• Lookahead by determinization
Det-‐Plan
should be something like this:

If θ ≤ 0.9 then RFF will notice the problem
a subset of Dom(π)

Goal

Start

s2

s4

s3

s4

c
=
1

c
=

100

c
=
10

c
=
1

0.2

0.8

0.9

0.1
s1

s6

c
=

1000

31
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15

Mul.-‐Arm
Bandit

●  Statistical model of sequential experiments
Ø  Name comes from a traditional slot machine
(one-armed bandit)
●  Multiple actions
Ø  Each action provides a reward from a
probability distribution associated with
that specific action
Ø  Objective: maximize the expected utility of a sequence of actions
●  Exploitation vs exploration dilemma:
Ø  Exploitation: choosing an action that you already know about, because
you think it’s likely to give you a high reward
Ø  Exploration: choosing an action that you don’t know much about, in
hopes that maybe it will produce a better reward than the actions you
already know about

32
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15

UCB
(Upper
Conﬁdence
Bound)
Algorithm

●  Let
Ø  xi = average reward you’ve gotten from arm i
Ø  ti = number of times you’ve tried arm i;
Ø  t = ∑i ti
●  loop
Ø  if there are one or more arms
that have not been played
Ø  then play one of them
Ø  else play the arm i that has the highest value of xi + 2 √ (log t)/ti

33
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15

UCT
Algorithm

●  UCT (with a few corrections)
●  Recursive UCB computation
to compute Q(s,a)
●  Anytime algorithm
Ø  Call repeatedly until
time runs out
●  At end, choose action
argmina Q(s,a)
Goal
Start

s2

s4

s3

s4

c
=
1

c
=

100

c
=
10

c
=
20

c
=
1

0.2

0.8

0.5

0.5
s1

s6

34
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15

Kluge
for
use
in
unsafe
domains

●  Modification for domains in
which some states are unsafe
Ø  Avoid unsafe plans by
refusing to choose
actions that lead to dead
ends
●  Problem: it’s too cautious
Ø  Will return ∞ if there are
no safe plans
if Applicable(s) = ∅ then
return ∞
Goal
Start

s2

s4

s3

s4

c
=
1

c
=

100

c
=
10

c
=
1

0.2

0.8

0.5

0.5
s1

s6

35
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
5/10/15

UCT
as
an
Ac.ng
Procedure

●  Suppose that
Ø  You don’t know Pr
Ø  You can restart your actor
as many times as you want
●  Can modify UCT to be an acting
procedure
Ø  Use it to explore the
environment
Goal
Start

s2

s4

s3

s4

c
=
1

c
=

100

c
=
10

c
=
20

c
=
1

0.2

0.8

0.5

0.5
s1

s6

execute a; observe s′

Automated Planning with Probabilistic Models

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (15)

Similar to Automated Planning with Probabilistic Models

Similar to Automated Planning with Probabilistic Models (20)

More from Tianlu Wang

More from Tianlu Wang (20)

Recently uploaded

Recently uploaded (20)

Automated Planning with Probabilistic Models