A 5 minute presentation on the thesis "Learning To Hop Using Guided Policy Search". Presented during the ETH Zurich Computer Science Master Ceremony 2017.
Learning To Hop Using Guided Policy Search / ETH Zurich Computer Science Master Ceremony
1. Julian Viereck, Supervisors: Felix Berkenkamp1), Alexander Herzog2), Ludovic Righetti2), Prof. Andreas Krause1)
1)
Learning & Adaptive Systems Group, Department of Computer Science, ETH Zurich
2) Autonomous Motion Department, Max-Plank Institute for Intelligent Systems, Tübingen
9 June 2017 | ETH Zurich Computer Science Master Ceremony | Zurich
Learning To Hop Using
Guided Policy Search
4. Actions
Dt =
>:
@ ut
xt+1
A , ..., @ ut
xt+1
A
>;
Dt ⇠ N
⇣
µD
t , SD
t
⌘
xt+1 ⇠ N
⇣
µD
t,xt+1|xt,ut
, SD
t,xt+1|xt,ut
⌘
⇠ N
✓
fxut
xt
ut
+ fct , Ft
◆
p(x) = N
⇣
x µD
t , SD
t
⌘
µD?
t , SD?
t = argmax NIW(µ, S|µD
t , SD
t , µ
gm
t
| {z
1
p(xt+1|xt, ut) = 1
ˆpi = pi = p(xt+1|xt, ut) = ut, ot = t
Environment
State
ooobararoo
}
pq(ut|ot), `(xt, ut), rt, xt, at
Reinforcement Learning
Reward
)f ooobararoo
}
= pq(ut|ot), `(xt, ut), rt, xt, at
5. Actions
Dt =
>:
@ ut
xt+1
A , ..., @ ut
xt+1
A
>;
Dt ⇠ N
⇣
µD
t , SD
t
⌘
xt+1 ⇠ N
⇣
µD
t,xt+1|xt,ut
, SD
t,xt+1|xt,ut
⌘
⇠ N
✓
fxut
xt
ut
+ fct , Ft
◆
p(x) = N
⇣
x µD
t , SD
t
⌘
µD?
t , SD?
t = argmax NIW(µ, S|µD
t , SD
t , µ
gm
t
| {z
1
p(xt+1|xt, ut) = 1
ˆpi = pi = p(xt+1|xt, ut) = ut, ot = t
?
Environment
State
ooobararoo
}
pq(ut|ot), `(xt, ut), rt, xt, at
Cost
, n0)f ooobararoo
}
1
A ⇠= pq(ut|ot), `(xt, ut), rt, xt, at
Dynamics
⇠ N
✓
fxut
xt
ut
+ fct , Ft
◆
p(x) = N
⇣
x µD
t , SD
t
⌘
µD?
t , SD?
t = argmax NIW(µ, S|µD
t , SD
t , µ
gmm
t , S
gmm
t , k0, n0)f ooobararoo
| {z }
1
1|xt, ut) = 1
ˆpi = pi = p(xt+1|xt, ut) = ut, ot = t =
2
4
xt
ut
xt+1
3
5 = pq(ut|ot)
Guided Policy Search
6. Environment
Cost
, n0)f ooobararoo
}
1
A ⇠= pq(ut|ot), `(xt, ut), rt, xt, at
Local Behavior
Dynamics
⇠ N
✓
fxut
xt
ut
+ fct , Ft
◆
p(x) = N
⇣
x µD
t , SD
t
⌘
µD?
t , SD?
t = argmax NIW(µ, S|µD
t , SD
t , µ
gmm
t , S
gmm
t , k0, n0)f ooobararoo
| {z }
1
1|xt, ut) = 1
ˆpi = pi = p(xt+1|xt, ut) = ut, ot = t =
2
4
xt
ut
xt+1
3
5 = pq(ut|ot)
Global Behavior
multiple
State
ooobararoo
}
pq(ut|ot), `(xt, ut), rt, xt, at Actions
Dt =
>:
@ ut
xt+1
A , ..., @ ut
xt+1
A
>;
Dt ⇠ N
⇣
µD
t , SD
t
⌘
xt+1 ⇠ N
⇣
µD
t,xt+1|xt,ut
, SD
t,xt+1|xt,ut
⌘
⇠ N
✓
fxut
xt
ut
+ fct , Ft
◆
p(x) = N
⇣
x µD
t , SD
t
⌘
µD?
t , SD?
t = argmax NIW(µ, S|µD
t , SD
t , µ
gm
t
| {z
1
p(xt+1|xt, ut) = 1
ˆpi = pi = p(xt+1|xt, ut) = ut, ot = t
7. Environment
Cost
, n0)f ooobararoo
}
1
A ⇠= pq(ut|ot), `(xt, ut), rt, xt, at
Local Behavior
Dynamics
⇠ N
✓
fxut
xt
ut
+ fct , Ft
◆
p(x) = N
⇣
x µD
t , SD
t
⌘
µD?
t , SD?
t = argmax NIW(µ, S|µD
t , SD
t , µ
gmm
t , S
gmm
t , k0, n0)f ooobararoo
| {z }
1
1|xt, ut) = 1
ˆpi = pi = p(xt+1|xt, ut) = ut, ot = t =
2
4
xt
ut
xt+1
3
5 = pq(ut|ot)
Global Behavior
multiple
State
ooobararoo
}
pq(ut|ot), `(xt, ut), rt, xt, at Actions
Dt =
>:
@ ut
xt+1
A , ..., @ ut
xt+1
A
>;
Dt ⇠ N
⇣
µD
t , SD
t
⌘
xt+1 ⇠ N
⇣
µD
t,xt+1|xt,ut
, SD
t,xt+1|xt,ut
⌘
⇠ N
✓
fxut
xt
ut
+ fct , Ft
◆
p(x) = N
⇣
x µD
t , SD
t
⌘
µD?
t , SD?
t = argmax NIW(µ, S|µD
t , SD
t , µ
gm
t
| {z
1
p(xt+1|xt, ut) = 1
ˆpi = pi = p(xt+1|xt, ut) = ut, ot = t
Optimization objective
8. Environment
Cost
, n0)f ooobararoo
}
1
A ⇠= pq(ut|ot), `(xt, ut), rt, xt, at
Dynamics
Local Behavior
⇠ N
✓
fxut
xt
ut
+ fct , Ft
◆
p(x) = N
⇣
x µD
t , SD
t
⌘
µD?
t , SD?
t = argmax NIW(µ, S|µD
t , SD
t , µ
gmm
t , S
gmm
t , k0, n0)f ooobararoo
| {z }
1
1|xt, ut) = 1
ˆpi = pi = p(xt+1|xt, ut) = ut, ot = t =
2
4
xt
ut
xt+1
3
5 = pq(ut|ot)
Global Behavior
State
ooobararoo
}
pq(ut|ot), `(xt, ut), rt, xt, at Actions
Dt =
>:
@ ut
xt+1
A , ..., @ ut
xt+1
A
>;
Dt ⇠ N
⇣
µD
t , SD
t
⌘
xt+1 ⇠ N
⇣
µD
t,xt+1|xt,ut
, SD
t,xt+1|xt,ut
⌘
⇠ N
✓
fxut
xt
ut
+ fct , Ft
◆
p(x) = N
⇣
x µD
t , SD
t
⌘
µD?
t , SD?
t = argmax NIW(µ, S|µD
t , SD
t , µ
gm
t
| {z
1
p(xt+1|xt, ut) = 1
ˆpi = pi = p(xt+1|xt, ut) = ut, ot = t
multiple
16. GMM Prior
p(x) = N
⇣
x µD
t , SD
t
⌘
D?
, SD?
t = argmax NIW(µ, S|µD
t , SD
t , µ
gmm
t , S
gmm
t , k0, n0)f ooobararoo
| {z }
1
|xt, ut) = 1
ˆpi = pi = p(xt+1|xt, ut) = ut, ot = t =
0
@
xt
ut
xt+1
1
A ⇠ N = pq(ut|ot)
Dynamics
17. GMM Prior
p(x) = N
⇣
x µD
t , SD
t
⌘
D?
, SD?
t = argmax NIW(µ, S|µD
t , SD
t , µ
gmm
t , S
gmm
t , k0, n0)f ooobararoo
| {z }
1
|xt, ut) = 1
ˆpi = pi = p(xt+1|xt, ut) = ut, ot = t =
0
@
xt
ut
xt+1
1
A ⇠ N = pq(ut|ot)
Dynamics
→
18. GMM Prior
p(x) = N
⇣
x µD
t , SD
t
⌘
D?
, SD?
t = argmax NIW(µ, S|µD
t , SD
t , µ
gmm
t , S
gmm
t , k0, n0)f ooobararoo
| {z }
1
|xt, ut) = 1
ˆpi = pi = p(xt+1|xt, ut) = ut, ot = t =
0
@
xt
ut
xt+1
1
A ⇠ N = pq(ut|ot)
Dynamics
µD?
t , SD?
t = argmax NIW(µ, S|µD
t
|
p(xt+1|xt, ut) = 1
ˆpi = pi = p(xt+1|xt, ut) =→
37. Global Behavior
Dt =
8
><
>:
0
@
xt
ut
xt+1
1
A
1
, ...,
0
@
xt
ut
xt+1
1
A
J
9
>=
>;
Dt ⇠ N
⇣
µD
t , SD
t
⌘
xt+1 ⇠ N
⇣
µD
t,xt+1|xt,ut
, SD
t,xt+1|xt,ut
⌘
⇠ N
✓
fxut
xt
ut
+ fct , Ft
◆
p(x) = N
⇣
x µD
t , SD
t
⌘
µD?
t , SD?
t = argmax NIW(µ, S|µD
t , SD
t , µ
gmm
t , S
gmm
t , k0, n0)f ooobararoo
| {z }
1
p(xt+1|xt, ut) = 1
ˆpi = pi = p(xt+1|xt, ut) = ut, ot = t =
2
4
xt
ut
xt+1
3
5 = pq(ut|ot)
Dt =
8
><
>:
0
@
xt
ut
xt+1
1
A
1
, ...,
0
@
xt
ut
xt+1
1
A
J
9
>=
>;
Dt ⇠ N
⇣
µD
t , SD
t
⌘
xt+1 ⇠ N
⇣
µD
t,xt+1|xt,ut
, SD
t,xt+1|xt,ut
⌘
⇠ N
✓
fxut
xt
ut
+ fct , Ft
◆
p(x) = N
⇣
x µD
t , SD
t
⌘
µD?
t , SD?
t = argmax NIW(µ, S|µD
t , SD
t , µ
gmm
t , S
gmm
t , k0, n0)f ooobararoo
| {z }
1
p(xt+1|xt, ut) = 1
ˆpi = pi = p(xt+1|xt, ut) = ut, ot = t =
2
4
xt
ut
xt+1
3
5 = pq(ut|ot)