Learning Spatial Common Sense with Geometry-Aware Recurrent Networks

Learning Spatial Common Sense with
Geometry-Aware Recurrent Networks

Novel View Synthesis
• いくつかの観測から, 別視点の画像を予測
するタスク
• 脳科学の分野における⼼的回転（メンタル
ローテーション）という現象と関連が深い
• ⼈間は⼼に思い浮かべたイメージを回転さ
せることができる
• CVpaperchallengeのNovel View
Synthesisの発表が⾯⽩かったので関連
論⽂を紹介します
ShepardとMetzlerの実験
よくわかる認知科学（乾、吉川、川⼝
編、2011、ミネルヴァ書房）pp.61

例 : Generative Query Network (GQN)
• Novel view synthesisのタスクを通して, 空間の情報を集約するシー
ン表現 (scene representation) を獲得
• Eslamiらは, これをconditional VAEの枠組みで実現する⼿法を提案
• ⽇本語だと⾦⼦さん, 鈴⽊さんの解説資料がわかりやすいです
• https://www.slideshare.net/MasayaKaneko/neural-scene-
representation-and-rendering-33d
• https://www.slideshare.net/DeepLearningJP2016/dlgqn-111725780
S. Eslami et al. Neural Scene Representation
and Rendering, Science, 2018.

Novel View Synthesis
• cvpaperchallengeの発表では, タスクの概要・論⽂の紹介・未解決問題な
どを紹介していただいた
• 未解決問題は以下の項⽬等が挙げられた
• カテゴリに依存しない新規視点画像⽣成
• 複数物体のnovel view synthesis
• 実データにおけるnovel view synthesis
• 未知視点への汎化
• 発表を聞いて, 近年はシーン表現をどのようにモデリングするかという所
が重要そうだと思った

Novel View Synthesis論⽂
• Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene
Representations
• Visual Object Networks: Image Generation with Disentangled 3D Representations
• Transformable Bottleneck Networks
• DeepVoxels: Learning Persistent 3D Feature Embeddings
• Geometry-Aware Recurrent Neural Networks for Active Visual Recognition
• Multi-view to Novel view: Synthesizing novel views with Self-Learned Confidence
• Transformation-Grounded Image Generation Network for Novel 3D View Synthesis
• View Synthesis by Appearance Flow
• Learning Spatial Common Sense with Geometry-Aware Recurrent Networks

Novel View Synthesis論⽂
• Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene
Representations
• Visual Object Networks: Image Generation with Disentangled 3D Representations
• Transformable Bottleneck Networks
• DeepVoxels: Learning Persistent 3D Feature Embeddings
• Multi-view to Novel view: Synthesizing novel views with Self-Learned Confidence
• Transformation-Grounded Image Generation Network for Novel 3D View Synthesis
• View Synthesis by Appearance Flow
• Learning Spatial Common Sense with Geometry-Aware Recurrent Networks
今⽇発表するやつ

書誌情報
• CMUの研究チームによる論⽂
• last authorのFragkiadaki⽒は機械に映画を理解させるのが⽬標らしい
• 本研究もそのために必要な技術という位置ずけ？
• CVPR2019 oral にそれぞれ採択
• この研究は、多視点の情報を3Dの潜在表現として統合する⼿法の提案を
している
• geometryの知識をdeep learningのモデルに導⼊
• オクルージョンに強い
• 特に記載のない場合, 図は本論⽂からの引⽤

概要
• 複数の2D画像から抽出した特徴を3Dの潜在表現に統合する⼿法の提案
• 提案⼿法では微分可能な幾何的な操作(投影, 再投影, ego-motion estimationなど) を
deep learningに取り⼊れた
• 現実世界と3D featureの位置的な関連は保存している
• 提案したモデルは短い画像列から新規視点のviewを予測するタスクで学習
• さらに, 3D segmentationや3D object detectionも学習可能
• 特に検出の問題では物体の永続性を考慮した検出 (オクルージョンに強い)
• 実体を持つvisual agentにspatial common senseを持たせるために必要な
技術であると結論づけた

背景, モチベーション
• 近年の画像認識モデルは⼈間が持つ物体の永続性や空間認
識能⼒を持ち合わせていない
• 動画で物体がすれ違った時に, 隠れている部分にも物体は存在し
たままのはず (永続性)
• このような能⼒は画像+ラベルのデータを⽤いた教師あり学習
では獲得されない
Ø 新しいモデルを提案する必要がある
• 2D画像のシーケンスを3D featureに統合するGeometry-
aware RNNの提案
1. 2D featureを3D空間に逆投影 (unprojection)
2. ego-motionの予測
3. GRUでシーンの3D featureを更新
• 提案⼿法は, SLAMから着想を得た部分が⼤きい

提案⼿法のadvantage
• 新規視点予測のタスクにおける汎化性能が⾼い
• geometryを考慮しない⼿法 (GQN) の性能を⼤きく上回る
• ただし, ego-motionの推定をせずGTを使った場合であることに注意
• 3D segmentationや3D object detectionにも適⽤可能
• 視点の変化に伴う⼀時的なオクルージョンに頑健な検出結果を得た
Ø物体の永続性 (object permanence) を理解した認識⽅法である！
• 実装公開あり
• https://github.com/ricsonc/grnn

提案⼿法
• 提案⼿法 (上図) は4つのポイントからなる
1. Unprojection
2. Egomotion estimation and stabilization
3. Recurrent map update
4. Projection and decoding Given

Unprojection
1. CNN (2D U-net) で2D特徴マップを抽出
2. 2D特徴マップを3D空間に逆投影
3. depthマップから同じサイズの3D occupancy gridを作成 (物体がある位置は1, それ以外は0にな
るテンソル)
4. 3D U-netを⽤いて3D feature !𝑉# を抽出
①
②
④

Egomotion estimation and stabilization
• 視点は距離を変えずに, ⾓度のみ変化するとい
う仮定
• 新しい視点のから作成した3Dテンソル !𝑉# を,
いくつかの異なる⾓度で回転させる → !𝑉$%#
• 考えられる⾓度の数だけ !𝑉$%# を作成
• その時刻の3D feature memoryと内積をとり
最もスコアの⾼い⾓度を推定したego-motion
とする → !𝑉#
&
• 実際にはスコアで重み付け平均をとる処理を
⾏なっている
• 推定した⾓度で再度変換を⾏なった後,
GRUに⼊⼒

Recurrent map update
• Egomotion estimationにより向きを合わせた3D
feature !𝑉#
& を3D convolutional Gated Recurrent
Unit (GRU) layerに⼊⼒
• 隠れ状態の3D feature memory : 𝑚#を更新していく
• 𝑚#の初期値は0とした
• Novel View PredictionのタスクではGRUを使わず
に平均をとる処理でも同じような性能が得られた

Projection and decoding
• 得られた3D feature memory 𝑚#を⽤いてタスクを
⾏う部分のネットワーク構造
• クエリの視点 : 𝑞 を与え, 𝑚# を変換
• 各depthの値に応じた2D featureに投影しstack →
𝑝*
• 𝑝*をconvLSTMで 𝑞 に対応するRGB画像へdecode
• 物体のvisibilityは明⽰的には与えずNNの計算に任
せる
Ø ⾏うタスクとしてはview predictionと3D
MaskRCNNがある

Projection and decoding
• View predictionの場合は𝑚#を右の図の
ようにdecodeして画像をレンダリング
する
• 3D MaskRCNNでは, 𝑚# の候補領域の
部分をROI poolingして, その部分を
decodeすることにより物体マスクを⽣
成

実験
• 検証したいのは以下の問い
1. GRNNsはspatial common senseを学習するか
2. geometryを考慮したネットワーク構造はspatial common senseを獲得するの
に必要か
3. GRNNsの性能について
• spatial common senseは, ⼈間が持つ空間認識能⼒全般を指す (広い意味)
• 3D shapeは2D平⾯を膨らませることで⽣成可能
• シーンは物体から構成される
• 3次元物体は交差しない
• 物体は急に存在を消したりしない

View prediction
• 複数画像の⼊⼒を元に, 新しい視点の画像を予測
するタスク
• 実験に⽤いたデータセット
• ShapeNet : 学習データは2つの物体を観測するという設定で
準備
• Shepard-metzler : テトリスみたいなやつ
• Rooms-ring-camera dataset from : 部屋の中にランダムに物
体があるようなデータセット
• ⽐較対象 : GQN
• 条件を揃えるため, 提案⼿法でdepth mapのGTは⽤いず, ego-
motionのGTは⽤いた

View prediction
• 再構成誤差は提案⼿法の⽅が⼩さい
• より正確に予測ができた

View prediction
テスト時のみ物体を4つに増やす場合
↑の結果は提案⼿法の⾼い汎化性能を証明
(未知の設定でもよく予測できる)

View prediction
• 特徴表現の⾜し算・引き算

3D object detection and segmentation
• 具体的にはinstance segmentationのタスクを⾏なっている
• ShapeNetでデータセットを作成
• mean Average Precision (mAP)で評価
• 4つの設定で検証
• geometry-awareでないモデル + ego-motionのGT + depthのGT
• GRNN + ego-motionのGT + 推定したdepth
• GRNN + ego-motionのGT + depthのGT
• GRNN + 推定したego-motion + depthのGT
• 両⽅推定するのはやらないの…?

• GRNN + ego-motionのGT + depthのGT の結果が最も良い (それはそう)
• mAP0.75においてはGRNNはgeometry-awareでないモデルよりも良い結果
• geometry-awareな提案⼿法の優位性は⽰せた
• ego-motionとdepthを両⽅推定できるとさらに良さそう

• 複数の視点の観測を統合することで, オクルージョンに頑健な検出を実現

まとめ
• spatial common senseを獲得するため, 2D画像列から3D featureを⽣成す
るネットワーク構造を提案
• unprojection, ego-motion estimationなど, 微分可能なgeometricな処理を
⽤いることにより実現
• 新規視点予測のタスクにおいて, 低い再構成誤差や⾼い汎化性能を⽰した
• 3D object detection & segmentationにおいては, オクルージョンに頑健
な検出ができたことを確認
• Future works
• 現実のデータ・動的なシーンなどに適⽤可能なモデルの提案
• 4Dテンソルのスパース性を⽤いた計算効率の向上

関連研究
• NeurIPS 2018に採択
• 同じ研究グループの論⽂
• 同様のシステムを動的に観測位置を変化させるエージェントに適⽤
• よりinformativeな⽅向を視点を動かす⽅策を学習できた
R. Cheng et al. “Geometry-Aware Recurrent Neural Networks for Active
Visual Recognition”, NeurIPS, 2018.
R. Cheng et al. Supplemental materials of “Geometry-Aware Recurrent
Neural Networks for Active Visual Recognition”, NeurIPS, 2018.

おまけ : Novel View Synthesis サーベイ

View Synthesis by Appearance Flow
• Novel view synthesis のタスクを, 2D画像からのフロウを推定することにより解
いた。
• フレームワーク全体は下図のようになる。これは⼀気通貫に学習することができ
る。
T. Zou et al. “View Synthesis by Appearance Flow”, in ECCV, 2016.

Transformation-Grounded Image Generation
Network for Novel 3D View Synthesis
• NovelViewSynthesisのタスクにおいて、新規視点でのオブジェクトのうちソー
ス画像で⾒えている部分はそれをコピーして⽤い、残りの部分はGANで⽣成する
ような枠組みを提案した。ネットワークはdisocclusion-aware appearance flow
network (DOAFN) とcompletion networkから構成される。
• 先⾏研究のAppearance Flow Network (AFN) よりもよい結果を得た。
E. Park et al. “Transformation-Grounded Image Generation Network for Novel 3D View Synthesis”, in CVPR, 2017.

Visual Object Networks: Image Generation with
Disentangled 3D Representations
• 3Dを考慮した画像⽣成を⾏う⼿法の提案
• 3D shapeの⽣成→ターゲット視点に対応した深度画像とマスクに変換→
テクスチャコードを与えて画像にCNNでレンダリング
J. Zhu et al. “Visual Object Networks: Image Generation with Disentangled 3D Representations”, in NeurIPS, 2018.

Multi-view to Novel view: Synthesizing novel views
with Self-Learned Confidence
• 複数視点の画像から、新規視点の画像を⽣成する⼿法の提案。フレームワークはFlowPredictorと
Recurrent Pixel Generatorからなり、前者はソース画像からターゲット画像へのフロウを推定し、
後者は⼊⼒から直接画像を復元することを試みる。最後にこれらを確信度で重み付けをして統合す
る。
• 3DCGのオブジェクトを⽤いて実験を⾏い当時のSOTAとなった。
S. Sun et al. “Multi-view to Novel view:
Synthesizing novel views with Self-Learned
Confidence”, in ECCV, 2018.

Transformable Bottleneck Networks
• 2D画像をCNNにより3Dの編集ができるようにする⼿法の提案。
• 画像から3D featureを抽出し、そこにターゲットポーズに関する変形を⼊れたのち2Dへ
の投影を⾏い、画像の再構成など後段のタスクを⾏う。
• これにより剛体変換にとどまらない3Dを考慮した画像編集を⾏うことができる。
K. Olszewski et al. “Transformable Bottleneck Networks”, 2019.

DeepVoxels: Learning Persistent 3D Feature
Embeddings
• 画像シーケンスを1つのボクセル表現に落とし込む⼿法の提案。
• 提案⼿法のフレームワークは以下の順で処理を⾏う。
• 画像から2D featureを抽出→2D featureを3D featureに再投影→これらを画像シーケンスについて⾏いGRUで統合
→3D featureをターゲットの視点へ投影し画像を再構成
• この再構成誤差により全体のフレームワークの学習を⾏う。
• 提案⼿法はnovel view synthesisの性能が良い。
V. Sitzmann et al. “DeepVoxels: Learning Persistent 3D Feature Embeddings”, in CVPR, 2019.

DeepVoxels: Learning Persistent 3D Feature
Embeddings
V. Sitzmann et al. “DeepVoxels: Learning Persistent 3D Feature Embeddings”, in CVPR, 2019.

参考⽂献
• S. Eslami et al. Neural Scene Representation and Rendering, Science, 2018.
• T. Zou et al. “View Synthesis by Appearance Flow”, in ECCV, 2016.
• E. Park et al. “Transformation-Grounded Image Generation Network for Novel 3D View Synthesis”, in CVPR,
2017.
• J. Zhu et al. “Visual Object Networks: Image Generation with Disentangled 3D Representations”, in NeurIPS,
2018.
• S. Sun et al. “Multi-view to Novel view: Synthesizing novel views with Self-Learned Confidence”, in ECCV,
2018.
• K. Olszewski et al. “Transformable Bottleneck Networks”, 2019.
• V. Sitzmann et al. “DeepVoxels: Learning Persistent 3D Feature Embeddings”, in CVPR, 2019.
• R. Cheng et al. “Geometry-Aware Recurrent Neural Networks for Active Visual Recognition”, NeurIPS, 2018.
• H. Tung et al. “Learning Spatial Common Sense with Geometry-Aware Recurrent Networks”, in CVPR, 2019.

Learning Spatial Common Sense with Geometry-Aware Recurrent Networks

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Learning Spatial Common Sense with Geometry-Aware Recurrent Networks

Similar to Learning Spatial Common Sense with Geometry-Aware Recurrent Networks (20)

More from Kento Doi

More from Kento Doi (7)

Recently uploaded

Recently uploaded (11)

Learning Spatial Common Sense with Geometry-Aware Recurrent Networks