Depth Prediction Without the Sensors: Leveraging Structure for Unsupervised Learning from Monocular Videos

DLゼミ
Depth Prediction Without the Sensors: Leveraging
Structure for Unsupervised Learning from
Monocular Videos
1
北海道大学大学院情報科学院
調和系工学研究室
修士1年森雄斗

論文情報
• タイトル
– Depth Prediction Without the Sensors: Leveraging Structure
for Unsupervised Learning from Monocular Videos
• 学会
– AAAI 2019
• 著者
– Vincent Casser1*, Soeren Pirk Reza, Mahjourian2, Anelia Angelova
• Google Brain
• 1 Institute for Applied Computational Science, Harvard University; Google Brain
• 2 University of Texas at Austin; Google Brain
• Github
– https://github.com/tensorflow/models/tree/master/research/struct2depth
(TensorFlow1.x)
• Webサイト(struct2depth)
– https://sites.google.com/view/struct2depth
2

概要 3
• 単眼カメラの入力から教師なし学習でシーン深度とロボットの
エゴモーション(カメラ/ロボットの動き)を推定
• ステレオカメラによる深度予測と同等の精度で、物体の動きを
多く含むシーンでの深度予測を大幅に向上
• 屋内-屋外などの異なる環境の移動に対応

背景
• カメラの映像から深度の予測を行うタスクは、屋内および屋外
のロボットナビゲーション（障害物回避、経路計画）において
重要
• 深度予測の教師あり学習には高価な深度センサーが必要
4

先行研究
Unsupervised learning of depth and ego-motion from video
(Zhou et al. 2017)
• ステレオカメラではなく、単眼カメラを用いた手法
• カメラ画像から深度とエゴモーションをディープニューラルネ
ットワークで予測する
5

提案手法
• 先行研究の手法を改善
– Motion Model
• インスタンスセグメンテーションによる個々のオブジェクトをモデル化
– Imposing Object Size Constraints
• 物体サイズによる正則化により、極端な誤差を防ぐ
– Test Time Refinement Model
• オンラインでパラメータチューニングを行うことでドメイン転送が可能
6

Problem Setup
• 単眼カメラの画像: (𝐼1, 𝐼2, 𝐼3) ∈ ℝ 𝐻×𝑊×3
• 深度関数 𝜃: ℝ 𝐻×𝑊×3 → ℝ 𝐻×𝑊
• 深度マップ 𝐷𝑖 = 𝜃(𝐼𝑖)
• エゴモーションネットワーク 𝜓 𝐸 = ℝ2×𝐻×𝑊×3
→ ℝ6
– 2frameのRGB画像から6軸ベクトル (𝑡 𝑥, 𝑡 𝑦, 𝑡 𝑧, 𝑟𝑥, 𝑟𝑦, 𝑟𝑧)
– PoseCNN: 𝐸1→2 = 𝜓 𝐸(𝐼1, 𝐼2)
• Warping operator 𝜑 𝐼𝑖, 𝐷𝑗, 𝐸𝑖→𝑗 → መ𝐼𝑖→𝑗
– 画像と深度推定値とエゴモーションから次の画像の推定
• Reconstruction Loss: 𝐿 𝑟𝑒𝑐 = min( መ𝐼1→2 − 𝐼2 )
– 画像の推定結果と実際の画像の差が誤差
7

Algorithm Baseline
• 𝐿 𝑟𝑒𝑐 = 𝑚𝑖𝑛( መ𝐼1→2 − 𝐼2 , መ𝐼3→2 − 𝐼2 )
– 前後のフレームいずれから中間フレームとの差の誤差を計算[1]
• 𝐿 = 𝛼1 σ𝑖=0
3
𝐿 𝑟𝑒𝑐
(𝑖)
+ 𝛼2 𝐿 𝑠𝑠𝑖𝑚
(𝑖)
+ 𝛼3
1
2 𝑖 𝐿 𝑠𝑚
(𝑖)
– 全体の損失は、 Reconstruction Loss、SSIM(画質の損失)[2], 深度
マップの滑らかさの誤差[3]を使用
– 𝛼𝑖 : 4つのハイパーパラメータ
8
[1] Godard, Clément, Oisin Mac Aodha, and Gabriel J. Brostow. "Unsupervised monocular depth
estimation with left-right consistency." Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition. 2017.
[2] Zhou Wang, A. C. Bovik, H. R. Sheikh and E. P. Simoncelli, "Image quality assessment: from error
visibility to structural similarity," in IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600-612,
April 2004, doi: 10.1109/TIP.2003.819861.
[3] Zhou, Tinghui, et al. "Unsupervised learning of depth and ego-motion from video." Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition. 2017.

Motion Model
• 3Dオブジェクトでモデル化するだけでなく、その動きを3次元
的に予測する: 𝜓 𝑀
• インスタンスセグメンテーションマスク:
– (𝑆𝑖,1, 𝑆𝑖,2, 𝑆𝑖,3) ∈ ℕ 𝐻×𝑊
• 物体の運動を画像からマスクアウト
– 𝑂0 𝑆
• エゴモーションモデル
– 𝑉 = 𝑂0(𝑆1) ⊙ 𝑂0(𝑆2) ⊙ 𝑂0(𝑆3) : 3フレームの画像から物体領域を含まない領域
– 𝐸1→2, 𝐸2→3 = 𝜓 𝐸 𝐼1 ⊙ 𝑉, 𝐼2 ⊙ 𝑉, 𝐼3 ⊙ 𝑉
9

Motion Model
• 𝑖番目のオブジェクトの運動推定値:
– 𝑀1→2
(𝑖)
, 𝑀2→3
(𝑖)
= 𝜓 𝑀(መ𝐼1→2 ⊙ 𝑂𝑖
መ𝑆1→2 , 𝐼2 ⊙ 𝑂𝑖 𝑆2 , መ𝐼3→2 ⊙ 𝑂𝑖( መ𝑆3→2))
– 𝑀1→2
(𝑖)
, 𝑀2→3
(𝑖)
∈ ℝ6
– この段階ではエゴモーションは考慮されていない
• 最終的な出力 (𝑖はオブジェクトの番号)
–
10

Imposing Object Size Constraints
• 先行研究ではカメラ自体とオブジェクトが一緒に動く場合、無
限に遠い静止したオブジェクトと判断されていた
– インスタンスセグメンテーションによるクラスに大きさの事前知識
を与えることで極端な誤差を回避
11

Test Time Refinement Model
• 推論中にモデルの重みを固定せず、オンラインチューニングす
ることで自律型システムで有利
• 具体的には3フレームの画像を使って、深度予測の質を大幅に向
上させることが可能となった
12

モデルの評価実験データセット
• KITTI dataset
– 深度推定とエゴモーション予測の評価
• Cityscapes dataset
– 自律運転に使用されるデータセット
– 複数の移動物体を持つシーンが多く含まれている
• Fetch Indoor Navigation dataset
– 屋内のデータセット
– 上のCityscapesでトレーニングを行った後、微調整なしで評価を行
う =未知の環境での適応性の検証
13

KITTI dataset
• 深度推定値の精度
– 運動モデル（M）とオンラインチューニング（R）を導入した場合において高い評価値
15
Motion modelを使用した競合モデル

Cityscapes 16
• Cityscapesデータセットを用いて学習し、KITTIで評価
– KITTIでの学習と同様に、運動モデル（M）とオンラインチューニン
グ（R）を導入した場合において高い評価値

Fetch Indoor Navigation dataset
• Cityscapesデータセットで学習を行い、室内データで定性評価
17

まとめ 18
• 単眼カメラのみで深度とエゴモーションを予測
• 先行研究を以下の手法で改善
– Motion Model
• インスタンスセグメンテーションによる個々のオブジェクトをモデル化
– Imposing Object Size Constraints
• 物体サイズによる正則化により、極端な誤差を防ぐ
– Test Time Refinement Model
• オンラインでパラメータチューニングを行うことでドメイン転送が可能
ステレオカメラによる深度推定に匹敵する精度
屋内-屋外などの環境の変化に対応

補足（ビデオのfpsについて）
• 学習時
– Cityscapes : 8fps
– KITTI dataset : データによって異なる
• 推論時
– Base-line, Motionについては記載なし
– Online refinementを使用するためには
Geforce 1080Tiで動作し、バッチ4で50FPS, バッチ1で30FPS
19
参照: https://sites.google.com/view/struct2depth

Depth Prediction Without the Sensors: Leveraging Structure for Unsupervised Learning from Monocular Videos

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Depth Prediction Without the Sensors: Leveraging Structure for Unsupervised Learning from Monocular Videos

Similar to Depth Prediction Without the Sensors: Leveraging Structure for Unsupervised Learning from Monocular Videos (20)

More from harmonylab

More from harmonylab (20)

Depth Prediction Without the Sensors: Leveraging Structure for Unsupervised Learning from Monocular Videos