Mastering Diverse Domains through World Models
Shohei Taniguchi, Matsuo Lab
ॻࢽ৘ใ
Mastering Diverse Domains through World Models
• ஶऀ
• Danijar Hafner, Jurgis Pasukonis,
Jimmy Ba, Timothy Lillicrap
• ֓ཁ
• ੈքϞσϧΛ࢖ͬͨ‫ڧ‬Խֶशख๏Dreamerͷվળ൛ (ver. 3)
• εΫϥονͷ‫ڧ‬ԽֶशͰॳΊͯMinecraftͰμΠϠϞϯυΛͱΔ͜ͱʹ੒ޭ
https://arxiv.org/abs/2301.04104
2
Minecraft ObtainDiamond
• MinecraftͰμΠϠϞϯυΛͱΔλεΫ
• ใु͸ɼதؒΞΠςϜ͔μΠϠΛͱͬͨͱ͖ͷΈಘΒΕΔ
• NeurIPSͰ2019೥͔Βίϯϖ͕ߦΘΕ͓ͯΓɼRL‫ڀݚ‬ͷ1ͭϚΠϧετʔϯ
• ͜Ε·ͰεΫϥονͷRLͰμΠϠ֫ಘ·Ͱ
੒ޭͨ͠ྫ͸ͳ͠
• ਓؒͷσϞΛ࢖͏ख๏Ͱͷ੒ޭྫ͸͋Γ
ൃද֓ཁ
• લఏ஌ࣝ
• ੈքϞσϧ x ‫ڧ‬Խֶश
• PlaNet, Dreamer, DreamerV2
• DreamerV3
• ·ͱΊ
εϥΠυͷҰ෦ΛҎԼ͔Βྲྀ༻͍ͯ͠·͢
https://www.slideshare.net/ShoheiTaniguchi2/ss-238325780
4
‫ڧ‬Խֶशͷ՝୊
αϯϓϧޮ཰
• ֶशʹେྔͷ͕͔͔࣌ؒΔ
• ϩϘοτͳͲ͸ͦΜͳʹසൟʹ࣮‫ֶͰػ‬शͤ͞Δͷ͸ίετతʹ‫͍͠ݫ‬
5
ੈքϞσϧ x ‫ڧ‬Խֶश
‫ڥ؀‬ͷϞσϧΛਂ૚ֶशͰ֫ಘͰ͖Ε͹
ͦͷϞσϧ಺Ͱ‫ڥ؀‬ΛγϛϡϨʔτͯ͠
ํࡦΛֶशͰ͖Δ͸ͣ
➡ ੈքϞσϧ
6
ੈքϞσϧ x ‫ڧ‬Խֶश
ֶशͷྲྀΕ
1. ํࡦ Ͱ‫͔ڥ؀‬Βσʔλ ΛूΊΔ
2. Λ༻͍ͯੈքϞσϧ Λֶश
3. ੈքϞσϧΛ༻͍ͯํࡦ Λߋ৽
• 1 ~ 3Λ‫܁‬Γฦ͢
π D
D = {x1, a1, r1, …, xT, aT, rT}
D pψ
pψ (x1:T, r1:T ∣ a1:T)
π https://arxiv.org/abs/1903.00374
7
World Models
[Ha and Schmidhuber,2018]
• ੈքϞσϧ‫ܥ‬ͷ‫ڀݚ‬ͷ૸Γͱ͍͑Δ࿦จ
• ੈքϞσϧͷֶशɿVAE + MDN-RNN
• ํࡦͷֶशɿCMA-ES
• ࠓճ͸ৄ͍͠಺༰͸ׂѪ͠·͢
ʢҎԼͷεϥΠυͳͲΛࢀরʣ
https://www.slideshare.net/masa_s/ss-97848402
https://worldmodels.github.io/
https://arxiv.org/abs/1803.10122
8
PlaNet
[Hafner,et al.,2019]
• ੈքϞσϧͷֶशɿ
• Recurrent State Space Model
• ํࡦͷֶशɿCEM
• ϞσϧϑϦʔͱ΄΅ಉ౳ͷੑೳ
্ɿ࣮‫Ͱڥ؀‬ͷϩʔϧΞ΢τ
ԼɿੈքϞσϧʹΑΔγϛϡϨʔγϣϯ
DM Control SuiteͰͷ࣮‫݁ݧ‬Ռ
https://arxiv.org/abs/1811.04551
https://planetrl.github.io/
9
Ψ΢ε‫ܕ‬ঢ়ଶۭؒϞσϧ
Gaussian State Space Model
• ঢ়ଶભҠ֬཰ʹਖ਼‫ن‬෼෍Λ࢖͏Ϟσϧ
•
• ؔ਺ ʹ͸DNNͳͲΛ༻͍Δ
• ͜Εͩͱ࣮‫ݧ‬తʹ͏·͍͔͘ͳ͍ʢޯ഑ফࣦͳͲʣ
pψ (st+1 ∣ st, at)
= Normal (μψ (st, at), diag (σ2
ψ (st, at)))
μψ, σ2
ψ
ot
at
rt
st
ot+1
at+1
rt+1
st+1
10
࠶‫ؼ‬తঢ়ଶۭؒϞσϧ
Recurrent State Space Model (RSSM)
• ঢ়ଶ Λܾఆ࿦తʹભҠ͢Δ ͱ
֬཰తʹભҠ͢Δ ʹ෼͚ͯϞσϧԽ͢Δ
• ͸LSTMͳͲͷRNN‫ܕ‬ͷؔ਺
s h
z
ht+1 = fψ (ht, st, at)
pψ (st ∣ ht) = Normal (μψ (ht), diag (σ2
ψ (ht)))
fψ
xt
at
rt
st
xt+1
at+1
rt+1
st+1
ht ht+1
11
RSSMΛ࢖͏ͱ͔ͳΓੑೳ্͕͕Δ
࠶‫ؼ‬తঢ়ଶۭؒϞσϧ
Recurrent State Space Model (RSSM)
12
Dreamer
[Hafner,et al.,2019]
• PlaNetΛϕʔεʹͯ͠ɺ
ํࡦͷֶशΛActor-Critic‫ʹܕ‬มߋ
• Ձ஋ؔ਺ʹ ऩӹΛ༻͍Δ
• PlaNet͔Βੑೳ͕େ෯ʹվળ
λ
https://arxiv.org/abs/1912.01603
https://ai.googleblog.com/2020/03/introducing-dreamer-scalable.html
13
Ձ஋ؔ਺ͷਪఆ
ϕϧϚϯํఔࣜ
εςοϓʹ֦ு͢Δͱ
Vπ
(st) =
𝔼
π [r (st, at)] + Vπ
(st+1)
n
Vπ
n (st) =
𝔼
π
[
n−1
∑
k=1
r (st+k, at+k)
]
+ Vπ
(st+n)
14
Ձ஋ؔ਺ͷਪఆ
Ͱࢦ਺ฏ‫ۉ‬ΛͱΔͱ
͜ΕΛ ऩӹͱ‫Ϳݺ‬
Vπ
n (st) =
𝔼
π
[
n−1
∑
k=1
r (st+k, at+k)
]
+ Vπ
(st+n)
n = 1,…, ∞
V̄π
(st, λ) = (1 − λ)
∞
∑
n=1
λn−1
Vπ
n (st)
λ
15
Ձ஋ؔ਺ͷਪఆ
DreamerͰ͸ɺ ऩӹΛՁ஋ؔ਺ͷλʔήοτͱ͢Δ
ͨͩ͠ɺࢦ਺ฏ‫ۉ‬ͷ࿨͸ద౰ͳେ͖͞ʢ ͱ͢ΔʣͰଧͪ੾Δ
λ
θ ← θ − ηθ ∇θ
𝔼
pψ,πϕ [
V
πϕ
θ (st) − V̄π
(st, λ)
2]
H
V̄π
(st, λ) ≈ (1 − λ)
H−1
∑
n=1
λn−1
Vπ
n (st) + λH−1
Vπ
H (st)
16
ऩӹͷޮՌ
λ
No value͸ํࡦޯ഑๏Ͱֶशͨ͠৔߹ͷ݁Ռ
ऩӹΛ༻͍Δ͜ͱͰɺ ʹґΒͣੑೳ͕վળ
λ H
17
DreamerV2
[Hafner,et al.,2020]
Dreamerͷվྑ൛
1. જࡏม਺ʹ཭ࢄͳΧςΰϦΧϧ෼෍Λ࢖͏
2. Τϯίʔμ͕ա౓ʹਖ਼ଇԽ͞Εͳ͍Α͏ʹ
KL߲ͷֶश཰Λௐ੔͢Δ
• AtariͰਓؒϨϕϧͷੑೳΛୡ੒
18
཭ࢄજࡏม਺
• PlaNet΍DreamerV1Ͱ͸ɼ࿈ଓతͳજࡏม਺Λ࢖͍ɼਖ਼‫ن‬෼෍ͰϞσϧԽ
• DreamerV2Ͱ͸ɼ཭ࢄͳΧςΰϦΧϧ෼෍ʹมߋ
19
཭ࢄજࡏม਺
• ཭ࢄʹͨ͜͠ͱͰɼޯ഑ͷਪఆʹreparameterization trick͸࢖͑ͳ͘ͳΔ
• ୅ΘΓʹstraight-through estimatorͰਪఆ
• ਪఆྔʹόΠΞε͕৐Δ͕ɼ࣮૷͕؆୯
20
KL Balancing
• ੈքϞσϧͷϩεʹ͓͍ͯɼKL߲͸encoderͱભҠϞσϧͷpriorΛ͚ۙͮΔ
ਖ਼ଇԽͷ໾ׂΛ͢Δ
• ͔͠͠ɼಛʹֶशॳ‫ʹظ‬ભҠϞσϧ͕े෼ʹֶशͰ͖͍ͯͳ͍ঢ়ଶͩͱ
͜ͷKLਖ਼ଇԽ͕‫ͳ͘ڧ‬Γֶ͗ͯ͢शͷ๦͛ʹͳΔ
21
KL Balancing
• EncoderͱભҠϞσϧͷKL߲ʹ͍ͭͯͷֶश཰Λௐ੔͢Δ͜ͱͰܰ‫ݮ‬
• ͸0.8ʹઃఆ
α
22
࣮‫ݧ‬
• AtariͰਓؒ௒͑
• ϞσϧϑϦʔͷDQN, RainbowͳͲΑΓ΋‫͍ڧ‬
23
࣮‫ݧ‬
Ablation
• ΧςΰϦΧϧม਺΍KL balancingͷޮՌ΋͔ͳΓେ͖͍
24
DreamerV3
25
DreamerV3
• DreamerV2ΛΑΓ൚༻తʹ࢖͑Δख๏ʹ͢ΔͨΊʹ͍͔ͭ͘޻෉Λ௥Ճ
• υϝΠϯ͕มΘͬͯ΋ৗʹಉ͡ϋΠύϥͰֶशͰ͖ΔΑ͏ʹ
1. ‫؍‬ଌ΍ใुͷ஋Λsymlogؔ਺Ͱม‫͢׵‬Δ
2. Actorͷ໨తؔ਺Ͱ͸ ऩӹͷ஋Λਖ਼‫ن‬Խ͢Δ
λ
26
Symlog Prediction
• υϝΠϯ͕มΘΔͱɼ‫؍‬ଌ΍ใुͷ஋ͷεέʔϧ͕มΘΔͷͰɼ
ஞҰϋΠύϥΛௐ੔͢Δඞཁ͕͋Δ
• ͦΕΛ͠ͳ͍͍ͯ͘Α͏ʹɼsymlogؔ਺Λ͔͚Δ͜ͱͰ஋Λ͋Δఔ౓ἧ͑Δ
• Մ‫ͳ਺ؔͳٯ‬ͷͰɼ‫਺ؔٯ‬Λ͔͚Ε͹‫ݩ‬ͷ஋ʹ໭ͤΔ
27
ऩӹͷਖ਼‫ن‬Խ
λ
• Τϯτϩϐʔਖ਼ଇԽ෇͖ͰactorΛֶश͢Δ৔߹ɼͦͷ܎਺ͷνϡʔχϯά͸
ใुͷεέʔϧ΍εύʔεੑʹґଘ͢ΔͷͰ೉͍͠
• ͏·͘ใुͷ஋Λਖ਼‫ن‬ԽͰ͖Ε͹ɼυϝΠϯʹΑΒͣΤϯτϩϐʔ߲ͷ܎਺Λ
‫ݻ‬ఆͰ͖Δ͸ͣ
28
ऩӹͷਖ਼‫ن‬Խ
λ
• ऩӹΛ5ʙ95%෼Ґ਺ͷ෯Ͱਖ਼‫ن‬Խ͢Δ
• ୯७ʹ෼ࢄͰਖ਼‫ن‬Խ͢Δͱɼใु͕εύʔεͳͱ͖ʹɼऩӹ͕աେධՁ͞Εͯ
͠·͏ͷͰɼ֎Ε஋Λ஄͚ΔΑ͏ʹ͜ͷ‫͢ʹܗ‬Δ
29
࣮‫ݧ‬
• ͢΂ͯͷυϝΠϯɾλεΫͰಉ͡ϋΠύϥͰߴ͍ੑೳ͕ग़ͤΔ
30
࣮‫ݧ‬
• ϞσϧͷαΠζʹΑͬͯੑೳ͕εέʔϧ͢Δ͜ͱ΋֬ೝ
31
࣮‫ݧ‬
ੈքϞσϧʹΑΔະདྷ༧ଌ
32
࣮‫ݧ‬
• MinecraftͰॳΊͯRL agent͕μΠϠϞϯυΛͱΔ͜ͱʹ੒ޭ
33
·ͱΊ
• ੈքϞσϧͷ୅දతͳख๏DreamerͷൃలΛղઆ
• V3ʹؔͯ͠͸ਖ਼௚ώϡʔϦεςΟοΫͷմ‫ײ‬͸൱Ίͳ͍
• ݁Ռ͸͍͢͝
34

【DL輪読会】Mastering Diverse Domains through World Models