1
Tatsuya Matsushima @__tmats__ , Matsuo Lab
•
•
•
•
•
•
•
2
(SRL)
•
•
•
•
•
•
3
•
•
• 

•
•
• 

4
•
•
•
•
•
5
•
SRL
6
at ∈ 𝒜
ot ∈ 𝒪
at
ot ot+1
˜st ˜st+1˜st ∈ ˜𝒮
˜st ∈ ˜𝒮 st ∈ 𝒮
o1:t st st = ϕ (o1:t)
SRL
•
•
•
•
7
SRL
•
•
•
8
st
ϕ ϕ−1
st = ϕ (ot; θϕ)
̂ot = ϕ−1
(st; θϕ−1
)
SRL
•
•
•
9
̂st+1 = f (st, at; θfwd)
st = ϕ (ot; θϕ)
ϕ
st at st+1f
SRL
•
•
10
st st+1 at
ϕ at
st = ϕ (ot; θϕ)
̂at = g (st, st+1; θinv)


SRL
•
•
•
•
11
Loss = ℒprior (s1:n; θϕ |c)
s1:nc


st = ϕ (ot; θϕ)
•
•
•
•
•
Why SRL?
12
13
•
•
•
•
14
•
•
•
•
•
•
•
15
•
•
•
•
•
•
•
•
16
•
•
•
17
̂st+1 = Wst + Uat + V
E2C [Watter+ 2015]
•
• 

• 

18
st
̂st+1 ∼ 𝒩 (μ = Wst + Uat + V, σ)
̂st+1
st+1
•
• 

• 



World Model [Ha+ 2018]
19
•
•
•
20
ltθt
pt
ICM [Pathak+ 2017]
•
•
•


21
ℒfwd (
̂ϕ (ot+1), ̂f (
̂ϕ (ot), at))
=
1
2
̂f (
̂ϕ (ot), at) − ̂ϕ (ot+1)
2
2
ℒfwd
min
θP,θI,θF
[−λ𝔼π(st; θP) [Σtrt] + (1 − β)ℒinv + βℒfwd]
•
•
22
min
G,Q,ℳ
max
D
V(G, D) − λIVLB(G, Q)
•
•
•
23


•
• 

•
• 



•
• 



24
ℒSlowness(D, ϕ) = 𝔼 [ Δst
2
]
ℒVariabilty(D, ϕ) = 𝔼 [e− st1 − st2
]
•
•
• 



•
• 

25
ℒProp(D, ϕ) = 𝔼
[( Δst2
− Δst1 )
2
|at1
= at2]
ℒRep(D, ϕ) = 𝔼
[
e
− st2
− st1
2
Δst2
− Δst1
2
|at1
= at2]
•
26
/ 

※


E2C

[Watter+ 2015]
✔ ✔ ✔ ✔
World Model

[Ha+ 2018]
✔ ✔ ✔
ICM

[Pathak+ 2017]
✔ ✔ ✔
Causal InfoGAN

[Kurutach+ 2018]
✔ ✔ ✔ ✔
VPN

[Oh+ 2017]
✔ ✔
Robotic Priors

[Jonschkowski+ 2015]
✔ ✔
•
•
•
•
•
27
Robotic Priors

[Jon-schkowski+ 2015]
slot car racing 16×16×3 2 (25)
E2C

[Watter+ 2015]
cart-pole 80×80×3 8
ICM

[Pathak+ 2017]
Mario Bros. 42×42×3 2 (14)
•
•
•
•
•
•
•
28
KNN − MSE(s) =
1
k ∑
s′∈KNN(s,k)
˜s − ˜s′
2
•
•
•
•
•
•
29
30
•
•
•
•
•
31
•
•
•
•
•
•
32
S-RL Toolbox
•
•
•
•
•
•
•
•
•
•
33
34
•
•
•
•
•
•
•
35
Appendix
36
References
37
References
38
References
39

[DL輪読会]`強化学習のための状態表現学習 -より良い「世界モデル」の獲得に向けて-