NeRF-VAE:
 

A Geometry Aware 3D Scene Generative Model
Shohei Taniguchi, Matsuo Lab
֓ཁ
ະ஌γʔϯͷ෮‫ݩ‬ɾੜ੒͕Ͱ͖ΔNeRF
• ஶऀ
Adam R. Kosiorek, Heiko Strathmann, Daniel Zoran, Pol Moreno, Rosalia Schneider,
Soňa Mokrá, Danilo J. Rezende


• DeepMind


• GQNͷgeneratorʹNeRFΛ࢖ͬͨϞσϧ


• Last authorͷRezende͸GQNͷఏҊऀ


• ICMLϑΥʔϚοτ
2
Outline
1. લఏ஌ࣝ


• Neural Radiance Fields (NeRF)


• Generative Query Networks (GQN)


2. ख๏ɿNeRF-VAE


3. ࣮‫ݧ‬


4. ·ͱΊ
3
લఏ஌ࣝ
4
[Mildenhall et al., ECCV2020]
• 3࣍‫࠲ݩ‬ඪ ( ) ͱࢹઢํ޲ ( ) Λ
ೖྗͱًͯ͠౓ ( ) ͱີ౓ Λ
ग़ྗ͢ΔNN (γʔϯؔ਺
)

• ༷ʑͳ֯౓͔ΒࡱͬͨࣸਅͰֶश
➡︎
ผͷ֯౓͔ΒࡱͬͨࣸਅΛ
ɹੜ੒Ͱ͖Δ(novel view synthesis)
x d
r, g, b σ
Fθ : (x, d) ↦ ((r, g, b), σ)
NeRF
5
NeRF
[Mildenhall et al., ECCV2020]
• γʔϯΛ3࣍‫࠲ݩ‬ඪͱࢹઢํ޲͔Βً౓ͱີ౓ ΁ͷؔ਺ͱͯ͠ද‫ݱ‬


• ͜ͷؔ਺͕Θ͔Δͱɺvolume renderingΛ༻͍ͯ೚ҙͷࢹ఺͔Βͷը૾Λ
ੜ੒Մೳʢৄ͘͠͸౔‫͞ډ‬Μͷࢿྉ[1, 2]Λࢀরʣ
6
[Mildenhall et al., ECCV2020]
• ֶश͸ϨϯμϦϯάͨ͠ը૾ͱ
ਅͷը૾ͱͷ̎৐‫ࠩޡ‬ͷ࠷খԽ


• volume rendering͕ඍ෼ՄೳͳͷͰ
end-to-endʹֶशՄೳ


• ϨϯμϦϯά࣌ʹ࢖͏αϯϓϧ఺ͷ
બͼํͳͲʹ͸༷ʑͳ޻෉͋Γ
NeRF
7
[Mildenhall et al., ECCV2020]
Pros


• 3Dγʔϯͷද‫ͯ͠ͱݱ‬ը‫ظ‬త


• ैདྷ͸఺‫܈‬΍ϝογϡͷΑ͏ͳ
཭ࢄͰߴίετͳද‫ݱ‬


• NNΛ࢖ͬͨimplicitͳද‫Ͱݱ‬
ෳࡶͳγʔϯΛਫ਼៛ʹଊ͑ΒΕΔ
NeRF
8
NeRF
[Mildenhall et al., ECCV2020]
Cons


• γʔϯ͝ͱʹஞҰϞσϧΛ࠷దԽ͢Δඞཁ͕͋Δ


• ະ஌ͷγʔϯ͕ಘΒΕͨΒɺͦͷ౓ʹϞσϧΛֶश͠ͳ͚Ε͹ͳΒͳ͍


• γʔϯ͝ͱʹͨ͘͞Μͷը૾Λ༻ҙ͢Δඞཁ͕͋Δ


• 1γʔϯ͋ͨΓֶशʹ1~2೔͔͔Δ


• ʢ౰વ͕ͩʣ৽͍͠γʔϯͷੜ੒͸Ͱ͖ͳ͍
9
[Eslami et al.,2018]
• 3࣍‫ݩ‬γʔϯ෮‫ݩ‬Λߦ͏VAE


• EncoderΛ༻͍ͯ৽͍͠γʔϯΛ
ߴ଎ʹ෮‫͖Ͱݩ‬Δ


• Ϟσϧ͸৞ΈࠐΈϕʔε


• ৄ͘͠͸ླ໦͞Μͷࢿྉ[3]Λࢀর
GQN
10
GQN
[Eslami et al.,2018]
• ࢹ఺ ͔Β‫ͨݟ‬ը૾Λ ͱ͠ɺγʔϯΛજࡏม਺ Ͱද‫ݱ‬


• VAEͱಉ༷ʹม෼Լքͷ࠷େԽͰֶश
c I z
z
I
c
log p ({Ik}
N
k=1
∣ {ck}
N
k=1)
= log
∫
p (z)
N
∏
k=1
p (Ik ∣ ck, z) dz
≥
𝔼
q(z ∣ {Ik, ck}
N
k=1) [
N
∑
k=1
log p (Ik ∣ ck, z)
]
− DKL (q∥p)
11
[Eslami et al.,2018]
৽͍͠γʔϯͷ෮‫ݩ‬͸encoder ( )Λ
࢖ͬͯߴ଎ʹͰ͖Δ


q
p (I ∣ c, {Ik, ck}
M
k=1)
≈
𝔼
q(z ∣ {Ik, ck}
M
k=1)
[p (I ∣ c, z)]
GQN
12
GQN
[Eslami et al.,2018]
Pros


• EncoderͰະ஌γʔϯΛߴ଎ʹ
෮‫͖Ͱݩ‬Δ (amortized inference)


• ֶश࣌ؒ΋ͦ͜·Ͱ͔͔Βͳ͍


Cons


• ‫ز‬Կతͳ৘ใΛ࢖ͬͯͳ͍ͷͰ
෮‫ݩ‬ը૾ʹҰ؏ੑ͕ͳ͍


• NeRF΄Ͳ៉ྷʹੜ੒Ͱ͖ͳ͍
13
ख๏
14
NeRF-VAE
• NeRFʹજࡏม਺Λ࣋ͨͤͯɺVAEͷΑ͏ʹֶश͢Δ͜ͱͰ
ະ஌γʔϯͷ෮‫͕ݩ‬Մೳͳ‫֦ʹܗ‬ு


• γʔϯؔ਺ͷೖྗʹ΋જࡏม਺ΛՃ͑Δ


• γʔϯؔ਺ͷύϥϝʔλ ͸શγʔϯʹ‫ڞ‬௨ͳߏ଄Λֶश͠
જࡏม਺ ͕γʔϯ͝ͱͷಛ௃Λଊ͑ΔΑ͏ʹͳΔ


• ࣄલ෼෍ ͔Βαϯϓϧ͢Ε͹ɺ৽͍͠γʔϯͷੜ੒΋Ͱ͖Δ
Gθ( ⋅ , z) : (x, d) ↦ ((r, g, b), σ)
θ
z
p (z)
z
I
c
15
• ࢹ఺ ͔ΒͷϨϯμϦϯά݁ՌΛ ͱ͢Δͱ
໬౓ؔ਺͸


• ֶश͸GQNͱಉ༷ʹม෼Լքͷ࠷େԽ
c ̂
I = render (Gθ( ⋅ , z), c)
pθ(I ∣ z, c) =
∏
i,j
𝒩
(I(i, j) ∣ ̂
I(i, j), σ2
lik)
𝔼
q(z ∣ {Ik, ck}
N
k=1) [
N
∑
k=1
log p (Ik ∣ ck, z)
]
− DKL (q∥p)
z
I
c
NeRF-VAE
࠷దԽ
16
1. Encoder ( ) ͸ResNetͰ֤ը૾ΛຒΊࠐΜͩಛ௃ͷฏ‫ۉ‬Λऔͬͯ
ਖ਼‫ن‬෼෍ͷύϥϝʔλʹม‫׵‬


2. Encoderͷਪ࿦࣌ʹiterative amortized inferenceΛ࢖͏


3. γʔϯؔ਺ ʹattentionϕʔεͷ
ΞʔΩςΫνϟΛ࢖͏
q
Gθ( ⋅ , z)
NeRF-VAE
ࡉ͔͍޻෉
17
࣮‫ݧ‬
18
NeRFͱͷൺֱ
• NeRFʹൺ΂ͯগͳ͍ࢹ఺਺Ͱ΋͏·͍͘͘


• ࢹ఺਺͕े෼ଟ͍৔߹͸NeRFͷํ͕͖Ε͍ʢ͜Ε͸౰વʣ
19
GQN (CONV-AR-VAE) ͱͷൺֱ
ϨϯμϦϯάͷҰ؏ੑ
• GQN͸Ұ؏ੑ͕ͳ͍ʢ෺ମ͕‫ݱ‬ΕͨΓফ͑ͨΓ͍ͯ͠Δʣ


• ఏҊ๏͸NeRFͰ‫ز‬Կతͳࣄલ஌͕ࣝೖ͍ͬͯΔͷͰɺৗʹҰ؏͍ͯ͠Δ
20
GQN (CONV-AR-VAE) ͱͷൺֱ
෼෍֎΁ͷ൚Խ
• GQN͸ֶश࣌ʹ‫ͱͨ͜ݟ‬ͷͳ͍ࢹ఺͸͏·͘ϨϯμϦϯάͰ͖ͳ͍


• ఏҊ๏͸͏·͘൚Խ͍ͯ͠Δ
21
৽͍͠γʔϯͷੜ੒
• ࣄલ෼෍͔Βαϯϓϧ͢Δ͜ͱͰ৽͍͠γʔϯੜ੒΋Ͱ͖Δ


• ‫ݪ‬ཧతʹ͸GQNͰ΋Ͱ͖Δ͸͕ͣͩଟ෼͜Μͳʹ៉ྷʹੜ੒Ͱ͖ͳ͍͸ͣ
22
·ͱΊ & ‫ײ‬૝
• NeRFͱVAEΛ૊Έ߹ΘͤΔ͜ͱͰɺະ஌γʔϯͷ෮‫ݩ‬/ੜ੒͕Ͱ͖ΔϞσϧ
NeRF-VAEΛఏҊ


• Ұ؏ͨ͠ϨϯμϦϯά΍৽͍͠γʔϯͷੜ੒͕Մೳʹ


‫ײ‬૝


• ૉ௚ͳ֦ுͰྑͦ͞͏͕ͩɺ࣮‫݁ݧ‬Ռ͸ͦΕ΄Ͳ‫͕ͨ͠ؾ͍ͳ͘ڧ‬


• ͜Ε͕NeRFͱಉ͘͡Β͍ෳࡶͳγʔϯʹεέʔϧͨ͠Β͔ͳΓͦ͢͝͏
23
References
[1] [DLྠಡձ]NeRF: Representing Scenes as Neural Radiance Fields for View
Synthesis (https://www.slideshare.net/DeepLearningJP2016/dlnerf-representing-
scenes-as-neural-radiance-fields-for-view-synthesis)


[2] [DLྠಡձ]Neural Radiance Field (NeRF) ͷ೿ੜ‫ͱ·ڀݚ‬Ί (https://
www.slideshare.net/DeepLearningJP2016/dlneural-radiance-field-nerf?ref=https://
deeplearning.jp/)


[3] [DLྠಡձ]GQNͱؔ࿈‫ڀݚ‬ɼੈքϞσϧͱͷؔ܎ʹ͍ͭͯ (https://
www.slideshare.net/DeepLearningJP2016/dlgqn-111725780)
24

【DL輪読会】NeRF-VAE: A Geometry Aware 3D Scene Generative Model