Chapter06

1
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
4/24/15

This
work
is
licensed
under
a
CreaBve
Commons
AEribuBon-‐NonCommercial-‐ShareAlike
4.0
InternaBonal
License.

Chapter
4

Delibera.on
with
Probabilis.c
Domain
Models

Dana S. Nau and Vikas Shivashankar
University of Maryland

2
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
4/24/15

Probabilis.c
Planning
Domain

●  Actions have multiple possible outcomes
Ø  Each outcome has a probability
●  Several possible action representations
Ø  Bayes nets, probabilistic operators, …
●  Book doesn’t commit to any representation
Ø  Only deals with the underlying semantics
●  Σ = (S,A,γ,Pr,cost)
Ø  S = set of states
Ø  A = set of actions
Ø  γ : S × A → 2S
Ø  Pr(s′ | s, a) = probability of going to state s′ if we perform a in s
•  Require Pr(s′ | s, a) > 0 for every s′ in γ(s,a)
Ø  cost: S × A → +
•  cost(s,a) = cost of action a in state s

3
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
4/24/15

Nota.on
from
Chapter
5

●  Policy π : Sπ → A
Ø  Sπ ⊆ S
Ø  ∀s ∈ Sπ , π(s) ∈ Applicable(s)
●  γ︎(s,π) = {s and all descendants of s reachable by π}
Ø  the transitive closure of γ with π
●  Graph(s,π) = rooted graph induced by π
Ø  {nodes} = γ︎(s,π)
Ø  {edges} = ∪a∈Applicable(s){(s,s′) | s′ ∈ γ(s,a)}
Ø  root = s
●  leaves(s,π) = {states in γ︎(s,π) that aren’t in Sπ}

4
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
4/24/15

Stochas.c
Systems

●  Stochastic shortest path (SSP) problem: a triple (Σ, s0, Sg)
●  Solution for (Σ, s0, Sg):
Ø  policy π such that s0 ∈ Sπ and leaves(s0,π) ⋂ Sg ≠ ∅
●  π is closed if π doesn’t stop at non-goal states unless no action is applicable
Ø  for every state in γ︎(s,π), either
•  s ∈ Sπ (i.e., π specifies an action at s)
•  s ∈ Sg (i.e., s is a goal state)
•  Applicable(s) = ∅ (i.e., there are no applicable actions at s)

5
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
4/24/15

●  Robot r1 starts
at location l1

Ø  s0 = s1 in
the diagram
●  Objective is to
get r1 to location l4

Ø  Sg = {s4}
●  π1 = {(s1,
move(r1,l1,l2)),
(s2,
move(r1,l2,l3)),
(s3,
move(r1,l3,l4))}
Ø  Solution?
Ø  Closed?

move(r1,l2,l1)

2

Policies

Goal

Start

6
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
4/24/15

●  History: a sequence of
states, starting at s0
σ = 〈s0, s1, s2, s3, …, sh〉
or (not in book):
σ = 〈s0, s1, s2, s3, …〉
●  Let H(π) = {all histories
that can be produced by
following π from s0 to a
state in leaves(s0, π)}
●  If σ ∈ H(π) then Pr (σ | π) = ∏i ≥ 0 Pr (si+1 | si ,π(si))
Ø  Thus ∑σ ∈ Hπ Pr (σ | π) = 1
●  Probability that π will stop at a goal state:
Ø  Pr (Sg | π) = ∑ {Pr (σ | π) | σ ∈ H(π) and σ ends at a state in Sg}
move(r1,l2,l1)

2

Histories

Goal

Start

7
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
4/24/15

Unsafe
Solu.ons

●  A solution π is unsafe if
Ø  0 < Pr (Sg | π) < 1
●  Example:
π1 = {(s1,
move(r1,l1,l2)),

(s2,
move(r1,l2,l3)),

(s3,
move(r1,l3,l4))}
●  H(π1) contains two histories:
Ø  σ1 = 〈s1,
s2,
s3,
s4〉
Pr (σ1 | π1) = 1 × .8 × 1 = 0.8
Ø  σ2 = 〈s1,
s2,
s5〉
Pr (σ2 | π1) = 1 × .2 = 0.2
●  Pr (Sg | π) = 0.8
Explicit
dead
end

move(r1,l2,l1)

2

Goal

Start

8
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
4/24/15

Unsafe
Solu.ons

●  A solution π is unsafe if
Ø  0 < Pr (Sg | π) < 1
●  Example:
π2 = {(s1,
move(r1,l1,l2)),

(s2,
move(r1,l2,l3)),

(s3,
move(r1,l3,l4)),

(s5,
move(r1,l5,l6)),

(s6,
move(r1,l6,l5))}
Ø  σ1 = 〈s1,
s2,
s3,
s4〉
Pr (σ1 | π2) = 1 × .8 × 1 = 0.8
Ø  σ3 = 〈s1,
s2,
s5,
s6,
s5,
s6,
…
〉
Pr (σ3 | π2) = 1 × .2 × 1 × 1 × 1 × … = 0.2
●  Pr (Sg | π2) = 0.8
Implicit
dead
end

move(r1,l2,l1)

2
wait

s6

at(r1,l6)

Goal

Start

9
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
4/24/15

●  A solution π is safe if
Ø  Pr (Sg | π) = 1
●  An acyclic safe solution:
π3 = {(s1,
move(r1,l1,l2)),

(s2,
move(r1,l2,l3)),

(s3,
move(r1,l3,l4)),

(s5,
move(r1,l5,l1))}
Ø  σ1 = 〈s1,
s2,
s3,
s4〉
Pr (σ1 | π3) = 1 × .8 × 1 = 0.8
Ø  σ4 = 〈s1,
s2,
s5,
s4〉
Pr (σ4 | π3) = 1 × .2 × 1 = 0.2
●  Pr (Sg | π3) = 0.8 + 0.2 = 1
move(r1,l2,l1)

2

Safe
Solu.ons

Goal

Start

10
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
4/24/15

●  A solution π is safe if
Ø  Pr (Sg | π) = 1
●  A cyclic safe solution:
π4 = {(s5,
move(r1,l5,l4)}
●  H(π4) contains infinitely
many histories:

Ø  σ5 = 〈s1,
s4
〉
Pr (σ5 | π4) = 0.5
Ø  σ6 = 〈s1,
s1,
s4〉
Pr (σ6 | π4) = 0.5 × 0.5 = 0.25
Ø  σ7 = 〈s1,
s1,
s1,
s4〉
Pr (σ7 | π4) = 0.5 × 0.5 × 0.5 = 0.125
• • •
Ø  σ∞ = 〈s1,
s1,
s1,
s1,
…
〉
Pr (σ∞ | π4) = 0.5 × 0.5 × 0.5 × 0.5 × 0.5 × … = 0
move(r1,l2,l1)

2

Safe
Solu.ons

Pr (Sg | π4) = .5 + .25 + .125 + … = 1
Goal

Start

11
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
4/24/15

●  Example:
π = {(s1,
move(r1,l1,l2)),

(s2,
move(r1,l2,l3)),

(s3,
move(r1,l3,l4)),

(s4,
move(r1,l4,l1)),

(s5,
move(r1,l5,l1))}
●  What is Pr (Sg | π)?
Goal

move(r1,l2,l1)

2

Example

Start

12
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
4/24/15

r
=
–100

Expected
Cost

Goal

Start

●  cost(s,a) = cost of using a in s
●  Example:
Ø  cost(s,a) = 1 for each
“horizontal” action
Ø  cost(s,a) = 100 for each
“vertical” action
●  Cost of a history:
Ø  Let σ = 〈s0, s1, … 〉 ∈ H(π)
Ø  cost(σ | π) = ∑i ≥ 0 cost(si,π(si))
●  Let π be a safe solution
●  Expected cost of following π to a goal:
Ø  Vπ(s) = 0 if s is a goal
Ø  Vπ(s) = cost(s,π(s)) + ∑ ︎s′∈γ(s,π(s)) Pr (sʹ′|s, π(s)) Vπ(sʹ′) otherwise
●  If s = s0 then
Ø  Vπ(s0) = ∑σ ∈ H(π) cost(σ | π) Pr(σ | s0, π)
s
s1
s2
sn
…
π(s)

13
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
4/24/15

r
=
–100

Example

Goal

Start

π3 = {(s1,
move(r1,l1,l2)),

(s2,
move(r1,l2,l3)),

(s3,
move(r1,l3,l4)),

(s5,
move(r1,l5,l4))}

H(π1) contains two histories:
σ1 = 〈s1,
s2,
s3,
s4〉
Pr (σ1 | π3) = 0.8 cost(σ1 | π3) = 100 + 1 + 100 = 201

σ2 = 〈s1,
s2,
s5,
s4〉
Pr (σ2 | π3) = 0.2 cost(σ2 | π3) = 100 + 1 + 100 = 201

Vπ1(s1) = 201×0.8 + 201×0.2 = 201

14
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
4/24/15

π4 = {(s5,
move(r1,l5,l4)}
●  H(π4) contains infinitely
many histories:

Ø  σ5 = 〈s1,
s4
〉
Pr (σ5 | π4) = 0.5 cost (σ5 | π4) = 1
Ø  σ6 = 〈s1,
s1,
s4〉
Pr (σ6 | π4) = 0.25 cost (σ6 | π4) = 2
Ø  σ7 = 〈s1,
s1,
s1,
s4〉
Pr (σ7 | π4) = 0.125 cost (σ7 | π4) = 3
• • •
●  Vπ4(s1) = 1×.5 + 2×.25 + 3×.125 + 4×.0625 + … = 2
Safe
Solu.ons

Goal

Start

15
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
4/24/15

Planning
as
Op.miza.on

●  Let π and π′ be safe solutions
●  π dominates π′ if Vπ(s) ≤ Vπ′(s) for every state where both π and π′ are defined
Ø  i.e., Vπ(s) ≤ Vπ′(s) for every s in Sπ ∩ Sπ′
●  π is optimal if π dominates every safe solution π′
●  V*(s) = min{Vπ(s) | π is a safe solution for which π(s) is defined}
= expected cost of getting from s to a goal using an
optimal safe solution
●  Optimality principle (also called Bellman’s theorem):
Ø  V*(s) = 0, if s is a goal
Ø  V*(s) = mina∈Applicable(s){cost(s,a) + ∑ ︎s′ ∈ γ(s,a) Pr (sʹ′|s,a) Vπ(sʹ′)}, otherwise
s
s1
s2
sn
…
π(s)

16
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
4/24/15

Policy
Itera.on

●  Let (Σ,s0,Sg) be a safe SSP (i.e., Sg is reachable from every state)
●  Let π be a safe solution that is defined at every state in S
●  Let s be a state, and let a ∈ Applicable(s)
Ø  Cost-to-go: expected cost at s if we start with a, and use π afterward
Ø  Qπ(s,a) = cost(s,a) + ∑s′ ∈ γ(s,a) Pr (sʹ′|s,a) Vπ(sʹ′)
●  For every s, let π′(s) ∈ argmina∈Applicable(s) Qπ(s,a)
Ø  Then π′ is a safe solution and dominates π
●  PI(Σ,s0,Sg,π0)

π ← π0
loop
compute Vπ (n equations and n unkowns, where n = |S|)
for every non-goal state s do
π′(s) ← any action in argmina∈Applicable(s) Qπ(s,a)
if π′ = π then return π
π ← π′
●  Converges in a finite number of iterations
s
s1
s2
sn
…
π(s)
Tie-breaking rule: if
π(s) ∈ argmina∈Applicable(s) Qπ(s,a),
then use π′(s) = π(s)

17
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
4/24/15

r
=
–100

Start with
π0 = {(s1,
move(r1,l1,l2)),

(s2,
move(r1,l2,l3)),

(s3,
move(r1,l3,l4)),

(s5,
move(r1,l5,l4))}
Example

Goal

Start

Vπ(s4) = 0
Vπ(s3) = 100 + Vπ(s4) = 100
Vπ(s5) = 100 + Vπ(s5) = 100
Vπ(s2) = 1 + (0.8 Vπ(s3) + 0.2 Vπ(s5)) = 101
Vπ(s1) = 100 + Vπ(s2) = 201
Q(s1,move(r1,l1,l2)) = 100 + 101 = 201
Q(s1,move(r1,l1,l4)) = 1 + ½ × 201 + ½ × 0 = 101.5
argmin = move(r1,l1,l4)

Q(s2,move(r1,l2,l3)) = 1 + (0.8 × 100 + 0.2 × 100) = 101
Q(s2,move(r1,l2,l1)) = 100 + 201 = 301

Q(s3,move(r1,l3,l4)) = 100 + 0 = 100
Q(s3,move(r1,l3,l2)) = 100 + 101 = 201

Q(s5,move(r1,l5,l4)) = 100 + 0 = 100
Q(s5,move(r1,l5,l4)) = 100+101 = 201

18
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
4/24/15

r
=
–100

π = {(s1,
move(r1,l1,l4)),

(s2,
move(r1,l2,l3)),

(s3,
move(r1,l3,l4)),

(s5,
move(r1,l5,l4))}
Example

Goal

Start

Vπ(s4) = 0
Vπ(s3) = 100 + Vπ(s4) = 100
Vπ(s5) = 100 + Vπ(s5) = 100
Vπ(s2) = 1 + (0.8 Vπ(s3) + 0.2 Vπ(s5)) = 101
Vπ(s1) = 1 + ½ Vπ(s1) + ½ Vπ(s4) = 2
Q(s1,move(r1,l1,l2)) = 100 + 101 = 201
Q(s1,move(r1,l1,l4)) = 1 + ½ × 2 + ½ × 0 = 2

Q(s2,move(r1,l2,l3)) = 1 + (0.8 × 100 + 0.2 × 100) = 101
Q(s2,move(r1,l2,l1)) = 100 + 2 = 102

Q(s3,move(r1,l3,l4)) = 100 + 0 = 100
Q(s3,move(r1,l3,l2)) = 100 + 101 = 201

Q(s5,move(r1,l5,l4)) = 100 + 0 = 100
Q(s5,move(r1,l5,l4)) = 100+101 = 201

19
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
4/24/15

Value
Itera.on
(Synchronous
Version)

●  Let (Σ,s0,Sg) be a safe SSP
●  Start with an arbitrary cost V(s) for each s and a small η > 0
VI(Σ,s0,Sg,V)

π ← ∅
loop
Vold ← V
for every non-goal state s do
for every a ∈ Applicable(s) do
Q(s,a) ← cost(s,a) + ∑sʹ′ ∈ S Pr (sʹ′ | s,a) Vold(sʹ′)
V(s) ← mina∈Applicable(s) Q(s,a)
if maxs ∈ S ∖ Sg
|V(s) – Vold(s)| < η for every s then exit the loop
π(s) ← argmina∈Applicable(s) Q(s,a)
●  |V′(s) – V(s)| is the residual of s
●  maxs ∈ S ∖ Sg
|V′(s) – V(s)| is the residual

20
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
4/24/15

Goal

Start

Example

●  aij = the action that moves from si to sj
Ø  e.g., a12 = move(r1,l1,l2))

●  η = 0.2
●  V(s) = 0 for all s
Q(s1, a12) = 100 + 0 = 100
Q(s1, a14) = 1 + (½×0 + ½×0) = 1
min = 1
Q(s2, a21) = 100 + 0 = 100
Q(s2, a23) = 1 + (½×0 + ½×0) = 1
min = 1
Q(s3, a32) = 1 + 0 = 1
Q(s3, a34) = 100 + 0 = 100
min = 1
Q(s5, a52) = 1 + 0 = 1
Q(s5, a54) = 100 + 0 = 100
min = 1
residual = max(1–0, 1–0, 1–0, 1–0) = 1

21
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
4/24/15

Goal

Start

Example

●  V(s1) = 1; V(s2) = 1; V(s3) = 1; V(s4) = 0; V(s5) = 1
Q(s1, a12) = 100 + 1 = 101
Q(s1, a14) = 1 + (½×1 + ½×0) = 1½
min = 1½
Q(s2, a21) = 100 + 1 = 101
Q(s2, a23) = 1 + (½×1 + ½×1) = 2
min = 2
Q(s3, a32) = 1 + 1 = 2
Q(s3, a34) = 100 + 0 = 100
min = 2
Q(s5, a52) = 1 + 1 = 2
Q(s5, a54) = 100 + 0 = 100
min = 2
residual = max(1½–1, 2–1, 2–1, 2–1) = 1

22
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
4/24/15

Goal

Start

Example

●  V(s1) = 1½; V(s2) = 2; V(s3) = 2; V(s4) = 0; V(s5) = 2
Q(s1, a12) = 100 + 2 = 103
Q(s1, a14) = 1 + (½×1½ + ½×0) = 13/4
min = 13/4
Q(s2, a21) = 100 + 1½ = 101½
Q(s2, a23) = 1 + (½×2 + ½×2) = 3
min = 3
Q(s3, a32) = 1 + 2 = 3
Q(s3, a34) = 100 + 0 = 100
min = 3
Q(s5, a52) = 1 + 2 = 3
Q(s5, a54) = 100 + 0 = 100
min = 3
residual = max(13/4–1½, 3–2, 3–2, 3–2) = 1

23
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
4/24/15

Goal

Start

Example

●  V(s1) = 13/4; V(s2) = 3; V(s3) = 3; V(s4) = 0; V(s5) = 3
Q(s1, a12) = 100 + 3 = 104
Q(s1, a14) = 1 + (½×13/4 + ½×0) = 17/8
min = 17/8
Q(s2, a21) = 100 + 13/4 = 1013/4
Q(s2, a23) = 1 + (½×3 + ½×3) = 4
min = 4
Q(s3, a32) = 1 + 3 = 4
Q(s3, a34) = 100 + 0 = 100
min = 4
Q(s5, a52) = 1 + 3 = 4
Q(s5, a54) = 100 + 0 = 100
min = 4
residual = max(17/8–13/4, 4–3, 4–3, 4–3) = 1
●  How long before residual < η = 0.2?
●  How long if the “vertical” actions cost
10 instead of 100?

24
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
4/24/15

Discussion

●  Policy iteration computes an entire policy in each iteration,
and computes values based on that policy
Ø  More work per iteration, because it needs to solve a set of simultaneous
equations
Ø  Usually converges in a smaller number of iterations
●  Value iteration computes new values in each iteration,
and chooses a policy based on those values
Ø  In general, the values are not the values that one would get from the chosen
policy or any other policy
Ø  Less work per iteration, because it doesn’t need to solve a set of equations
Ø  Usually takes more iterations to converge
●  What I showed you was the synchronous version of Value Iteration
•  For each s, compute new values of Q and V using Vold
Ø  Asynchronous version: compute new values of Q and V using V
•  New values may depend on which nodes have already been updated

25
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
4/24/15

Value
Itera.on

●  Synchronous version:
VI(Σ,s0,Sg,V)

π ← ∅
loop
Vold ← V
for every s ∈ S ∖ Sg do
Q(s,a) ← cost(s,a) +
∑sʹ′ ∈ S Pr (sʹ′|s,a) Vold(sʹ′)
if maxs ∈ S ∖ Sg
|V(s) – Vold(s)| < η then
return π
●  maxs ∈ S ∖ Sg
|V(s) – Vold(s)| is the residual
●  |V(s) – Vold(s)| is the residual of s
●  Asynchronous version:

VI(Σ,s0,Sg,V)

π ← ∅
loop
r ← 0 // the residual
for every s ∈ S ∖ Sg do
r ← max(r,Bellman-‐Update(s,V,π))
if r < η then return π
Bellman-‐Update(s,V,π)
vold ← V(s)
Q(s,a) ← cost(s,a) + ∑sʹ′∈S Pr (sʹ′|s,a) V(sʹ′)
return |V(s) – vold|
Start with an arbitrary cost V(s) for each s, and a small η > 0

26
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
4/24/15

Discussion
(Con.nued)

●  For both, the number of iterations is polynomial in the number of states
Ø  But the number of states is usually quite large
Ø  In each iteration, need to examine the entire state space
●  Thus, these algorithms can take huge amounts of time and space
●  Use search techniques to avoid searching the entire space

27
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
4/24/15

AO∗ (Σ,s0,Sg,h)
π ← ∅; V ← ∅
Envelope ← {s0} // all generated states
loop
if leaves(s0,π) ⊆ Sg then return π
select s ∈ leaves(s0,π) ∖ Sg
for all a ∈ Applicable(s)
for all s′ ∈ γ(s,a) ∖ Envelope do
V(s′) ← h(s′); add s′ to Envelope
AO-‐Update(s,V,π)
return π
AO-‐Update(s,V,π)
Z ← {s} // ancestors of updated nodes
while Z ≠ ∅ do
select any s ∈ Z such that γ(s,π(s)) ∩ Z = ∅
remove s from Z
Z ← Z ∪ {s′ ∈ Sπ | s ∈ γ(s′,π(s′))}
●  h is the heuristic function
Ø  Must have h(s) = 0 for every s in Sg

vold ← V(s)
Ø  Example: h(s) = 0 for all s
Goal

Start

AO*
(requires
Σ
to
be
acyclic)

s2

s1
s4

s6

s3

s4

c
=
1

c
=
100

c
=
10

c
=
20

c
=
1

0.2

0.8

0.5

0.5

28
Dana
Nau
and
Vikas
Shivashankar:
Lecture
slides
for
Automated
Planning
and
Ac0ng
Updated
4/24/15

vold ← V(s)
LAO*
(can
handle
cycles)

Goal

Start

LAO∗(Σ,s0,Sg,h)
π ← ∅; V ← ∅
Envelope ← {s0} // all generated states
loop
if leaves(s0,π) ⊆ Sg then return π
select s ∈ leaves(s0,π) ∖ Sg
for all a ∈ Applicable(s)
for all s′ ∈ γ(s,a) ∖ Envelope do
V(s′) ← h(s′); add s′ to Envelope
LAO-‐Update(s,V,π)
return π
LAO-‐Update(s,V,π)
Z ← {s} ∪ {s′ ∈ Sπ | s ∈ γ(s′,π)}
for every s ∈ Z do Bellman-‐Update(s,V,π)
loop
residual ←
max{Bellman-‐Update(s,V,π) | s ∈ Sπ)
if leaves(s0,π) changed or residual ≤ η then
break

Chapter06

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Chapter06

Similar to Chapter06 (20)

More from Tianlu Wang

More from Tianlu Wang (20)

Chapter06