BA-Net: Dense Bundle Adjustment Network （3D勉強会@関東）

BA-Net: Dense Bundle Adjustment Network
C.Tang+, arXiv18
@denkiwakame
1
2018/09/29 3D勉強会@関東

about.me
@denkiwakame
● 〜2015 京都大学松山研究室（B4〜M2）
● 〜2017 某企業研
● 2017〜都内ベンチャー
Interests
● Generalized Camera Calibration [M.Nishimura+,ICCV15]
● MRF optimization (low-level vision)
● Model Compression
● GPGPU (CUDA)，SIMD
● Quantum Computing
2
[M.Nishimura+, ICCV15] A Linear Generalized Camera Calibration from Three Intersecting
Reference Planes

前回
● 3D 勉強会@関東 (2018/05/17)
○ https://togetter.com/li/1231482?page=5
3

前回
● CVPR2018 (2018/06/19-21)
4
CVPR18 honorable mention

CodeSLAM [M.Bloesch+,CVPR18]
● VAE によって depth の圧縮表現（code）を得る
5

● VAE によって depth の圧縮表現（code）を得る
6
● Pose (R,t) - depth optimization が解ける
○ gauss-newton 法で最適化（ヤコビアンが計算できる）
〜128dim
parametrized depth
pose
photometric error
geometric error

depth parametrization
● pose / depth estimation は BAと独立
○ joint optimization の結果が VAE に feedback されることはない
7
Bundle Adjustment / GN
(pose-geo optimization)
camera pose estimation

BA-Net : Dense Bundle Adjustment Network
● 特徴抽出，depth の圧縮表現学習，バンドル調整を全てNN
で行う
8
SfM の一連の要素を end-to-end でモデリング

Bundle Adjustment - バンドル調整とは？
● 基準点とカメラを結ぶ光線束（=bundle）を撮影画像を用いて
調整し，カメラの位置姿勢を最適化する
○ 画像からその背後にある幾何学的なパラメータを推定することが目的
○ 主要な対象は形状復元や姿勢推定だが，幅広い問題が対象となり得る
9
Bill Triggs, Philip McLauchlan, Richard Hartley and Andrew Fitzgibbon
Bundle Adjustment -- A Modern Synthesis

Bundle Adjustment - historical overview
● [T.Bill+,VA99] Bundle Adjustment - a modern
synthesis
10
Triggs, Bill, et al. "Bundle adjustment—a modern synthesis." International workshop on vision
algorithms. Springer, Berlin, Heidelberg, 1999. (鉄板)

Bundle Adjustment Revisit
● Geometric BA (e.g. ORB-SLAM)
○ Minimize Feature Reprojection Error
○ sparse
● Photometric BA (e.g. LSD-SLAM)
○ Minimize Photometric Error
○ dense/semi-dense
11
● Feature BA (proposed)
○ Minimize Learned-Feature Reprojection Error
○ dense

(1) Geometric BA
● Indirect methods (ORB-SLAM, ….)
○ 再投影誤差を最小化
○ 計算コストが低い
● Drawbacks
○ 画像上の reference point のみ考慮
○ ノイズや歪みに弱い (RANSAC必須)
12
Related Work

(2) Photometric BA
● Direct methods (LSD-SLAM, ...)
○ (再投影された) 画素の輝度値の差を最小化
○ テクスチャが疎らな環境でも有効
● Drawbacks
○ 初期値に依存（非凸性が強い）
○ 照明環境の変化に弱い（photometric calib 必須）
○ 動物体など，outlier に弱い
13
Related Work

(3) feature BA
● “Learned” direct methods
○ SfM に適する特徴を学習する
○ 再投影された learned-feature F の誤差を最小化
● Improvements
○ 照明環境の変化に頑健な特徴を学習で獲得できる
○ outlier (e.g. moving objects) に強い特徴を獲得
■ それはちょっと言い過ぎでは...?
14
Proposed
camera
parameters
depth
(prameterized)
inference from BA-Net
learned-feature from BA-Net

● Feature learning for SfM (2-view setting)
○ BA-Net の minor contribution
Network overview - feature & depth learning
15
depth parametrization
sub-network
feature pyramid
network
base-network
[F.Yu+,CVPR17]

Feature Pyramid
● DRN-54 から階層的な特徴を抽出（like FPN）
○ Ci : pretrained (DRN-54)
○ Fi : learned hierarchical feature
○ 〜128 dim ぐらいが精度のバランスが良かったらしい
16

Feature Pyramid
● “learned” feature によってマッチング適する（凸性の高い）
特徴を獲得
○ C3: DRN-54 feature (pretrained on IMAGENET)
○ F3: learned feature
17

Depth parametrization
● Why raw depth representation is intractable ?
○ too many parameters to be optimized (intractable)
■ 320x240 pixels input → 76.8k parameters !
■ pose-geo optimization が困難
○ difficulty in initialization
■ R,t のパラメータが不完全な初期状態では，non-overwrapped view に
depth が飛んでいってしまう
■ 個々のdepth を独立に扱う方式ではうまく学習できない
● Autoencoder で parametrize
○ CodeSLAM と異なり，VAE ではなく standard auto-encoder を使う
○ 128 dim の基底ベクトルとして圧縮表現を学習
18

● “differentiable” bundle optimization layer
○ BA-Net の major contribution
Network overview - geometric constraint
19
camera
parameters
depth
(parameterized)
camera pose
loss
depth map
loss

Levenverg-Marquardt algorithm revisit
● objective:
● updates:
○ 適当な初期値 x を与える
○ E(x) のヤコビ行列 J ，A=J^TJ の対角行列 D を計算
○ ↓を解いてΔx を算出
○ ΔE = E(x+Δx) - E(x) を計算後，x → x+Δx と更新
20
damping
factor
non-negative
diagonal matrixjacobian of E(x)
この手続きを NN 化し，backprop でパラメータ更新することを考える

● Difficulties:
○ LM法の反復解法は収束閾値に達すると打ち切り
■ 条件分岐を含む非連続関数になってしまう（微分不能）
○ 各反復試行において，目的関数に応じ λ を操作
■ E(x) が減少するまでλを大きくし続ける（減少した際にλを小さくする)
■ 条件分岐（ry
● Simple yet effective approach
○ 固定回数の反復を実装する (※)
○ lambda を network で推論
Differentiable LM layer
21
※ [J.Domke+, AISTATS12] Generic Methods for Optimization-Based Modeling

● previous iteration の x を元に，E(x) を計算
● J(x) をE(x)から計算，D(x) を A=J(x)^TJ(x) から計算
● damping factor λを算出
● λを用いてΔx を算出
22
Δx が F, λ, x の関数として表現
（F について differentiable）

● Lambda-prediction layers
○ 全画素から E(x) を集め，128dim として入力
○ 4層の FC層からなる（ReLU で >0 に強制）
○ E(x) から λに回帰する非線形関数を学習する?
23
※ 古典手法では E(x+Δx) - E(x) の値に応じて
ルールベースでλを設定

● Coarse-to-fine optimization
○ LSD-SLAM [J.ENgel+,ECCV14] や DSO
[J.Engel+,TPAMI17] と同様，Coarse-to-fine BA を適用
○ feature pyramid の各 layer で 5-iteration
○ 計 15-iteration を 1回の forward-pass で反復
24

関連：CRF as RNN
● Semantic Segmentation において，後段に平均場近似
（mean-field approximation）の手続きをRNNで実装
○ CRFの最適化は post-processing として当初分離されていた
○ RNN で平均場近似を実装することにより，CRF の学習と最適化が1つのネッ
トワークで行えるようになった
25[S.Zheng+, ICCV15] Conditional Random Fields as Recurrent Neural Networks

Training Objective
● Camera Pose Loss
○ 普通の rotation (quaternion), translation loss
● Depth Map Loss
○ berHu Loss [L.Zwald+,arXiv12] が単純なL2ノルムよりも良い
26

Dataset
● ScanNet
○ large-scale indoor dataset with 1513 sequences in
706 different scenes
■ training: 1413 sequences (547991 images)
■ testing: 100 sequences (2000 images)
● KITTI
○ generated camera poses from LibVISO2
27

Quantitative Evaluation (on ScanNet)
● outperformed DeMoN[B.Ummenhofer+,CVPR17]
○ ours*, DeMoN* : trained on SUN3D dataset
○ 別データ（SUN3D）で trained したモデルでも DeMoN を上回る
● outperformed also geometric/photometric BA
○ geometric BA が苦手な indoor scenes (テクスチャが貧しい) に有効な特徴
を学習できた
○ photometric BA よりも凸性の高い特徴が有効だった
28

Qualitative Evaluation (on ScanNet)
● DeMoN [B.Ummenhofer+,CVPR17] よりdepth が再現
29

Qualitative Evaluation (on KITTI)
● unsupervised [Wang+,CVPR18], supervised
[Godard+,CVPR17] 両方を outperformed
30
[Wang+,CVPR18] Learning depth from monocular videos using direct methods.
[C. Godard+,CVPR17] Unsupervised monocular depth estimation with left-right consistency.

Ablation Studies
● Learned-feature vs Pre-trained feature
○ 目的関数の凸性が向上し，収束が早まった
○ IMAGENET で pretrain した network とerror 率を比較
31

Ablation Studies
● BA-layer vs pose estimation
○ fixed depth map に対し pose estimation をした結果と比較
○ BA-layer (jointly optimize depth and camera pose) が優位
● to make the comparison fair:
○ どちらも learned-feature F を用いる
32

Ablation Studies
● Differentiable LM vs Damped gauss newton
○ λ=prefixed な damped gauss newton 法との比較
■ CodeSLAM[CVPR18] で利用
○ λが固定な方法に比べて，λを推論しつつ update する BA-Net の結果が大き
く上回る
● Again, for fair comparison,
○ 両方で learned-feature F を利用
33

Conclusion
● Contributions:
○ SfM に適した特徴を自動獲得する feature BA を提案
○ LM法を differentiable に変形した BA-layer を提案
○ learned-feature representation + depth parametrization + BA
optimization をend-to-end で行うことが可能になった
● Future work
○ multi-view extension
■ 本論文では two-view の検証のみ
■ 1 camera 1 network （intrinsic parameter に相当する特性を内包）
■ 原理上は easily extensible
34

Learning to solve NLLS [R.Clark+,ECCV18] との比較
● LS-Net: Learning to Solve Nonlinear Least Squares for Monocular Stereo
○ CodeSLAM [R.Clark+,CVPR18] グループの成果
○ gauss-newton の optimizer 自体を学習する（meta-learning）
● BA-Net: Dense Bundle Adjustment Network
○ LM法をNNで実装
35
Appendix

Meta-learning
● Learning to learn gradient descent by gradient descent [NIPS16]
○ https://github.com/deepmind/learning-to-learn
○ Optimizer を用いて（SGDとか）Optimizer 自体を学習する
36
Appendix
データに応じて，最適な g を学習できる！（はず）
なんだこの絵は...

Learning to solve NLLS [R.Clark+,ECCV18] との比較
● LS-Net: Learning to Solve Nonlinear Least Squares for Monocular Stereo
● BA-Net: Dense Bundle Adjustment Network
37
Appendix
Jacobian，残差を入力とする optimizer 自体をRNN-LSTM で学習
λのみNNで表現し，LM法は計算グラフとして実装

Meta-learning based optimization
● https://youtu.be/Q1Xm_fl2GeU
38
Appendix

updated version of BA-Net !
https://openreview.net/forum?id=B1gabhRcYX
39
Appendix

BA-Net: Dense Bundle Adjustment Network （3D勉強会@関東）

More Related Content

What's hot

Similar to BA-Net: Dense Bundle Adjustment Network （3D勉強会@関東）

BA-Net: Dense Bundle Adjustment Network （3D勉強会@関東）