# Depth Prediction Without the Sensors: Leveraging Structure for Unsupervised Learning from Monocular Videos

2019/02/23 第4回3D勉強会@関東 の発表資料です

Published in: Science
### Depth Prediction Without the Sensors: Leveraging Structure for Unsupervised Learning from Monocular Videos

1. 1. DEPTH PREDICTION WITHOUT THE SENSORS: LEVERAGINGSTRUCTUREFORUNSUPERVISEDLEARNING FROMMONOCULARVIDEOS [CASSER+,AAAI’19] 4 3D Shinya Sumikura @sumicco_cv @shinsumicco 2019/02/23 1
2. 2. ( ) [ M1] • • SfM, SLAM • 3 CV • bundle adjustment • C/C++ • (CUDA) 2019/02/23 2 (2011-) “GAN” = (GaN) (2016-) 3 CV
3. 3. Depth Prediction Without the Sensors: Leveraging Structure for Unsupervised Learning from Monocular Videos [Casser+, AAAI ’19] 2019/02/23 3
4. 4. Depth Prediction Without the Sensors: Leveraging Structure for Unsupervised Learning from Monocular Videos [Casser+, AAAI ’19] 1. depth “ ” motion à Motion model 2. depth à Object size constraints 3. inference fine-tuning à Test time refinement 2019/02/23 4
5. 5. SfMlearner? • “Unsupervised Learning of Depth and Ego-Motion from Video” [Zhou+, CVPR ‘17] depth 2019/02/23 5 training: (depth pose ) Depth CNN Pose CNN
6. 6. SfMlearner? 1. 3 (! − 1, !, ! + 1) 2. ! 1 depth 3. ! → ! − 1 , (! → ! + 1) )*→*+,, )*→*-, 4. ! warp loss 2019/02/23 6
7. 7. SfMlearner “Unsupervised Learning of Depth and Ego-Motion from Monocular Video Using 3D Geometric Constraints” [Mahjourian+, CVPR ‘18] • loss L1 loss SSIM loss • depth 3 ICP-based back-prop • ( ICP loss weight 0 ) “Learning Depth from Monocular Videos using Direct Methods” [Wang+, CVPR ‘18] • differentiable direct visual odometry (DDVO) pose “Digging Into Self-Supervised Monocular Depth Estimation” [Godard+, arxiv] • Depth Pose Encoder • ( ) “Monocular Depth Estimation: A Survey” [Bhoi, arxiv] 2019/02/23 7
8. 8. • 3 • • à 2019/02/23 8 (“Digging Into Self-Supervised Monocular Depth Estimation” [Godard+, arxiv] )
9. 9. • motion mask “SfM-Net: Learning of Structure and Motion from Video” [Vijayanarasimhan+, arxiv] • motion mask object motion • 2D scene flow “GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose” [Yin+, CVPR ‘18] • depth pose 2D rigid flow( ) residual flow( ) image warping 2019/02/23 9
10. 10. • 3D scene flow ”Every Pixel Counts: Unsupervised Geometry Learning with Holistic 3D Motion Understanding” [Yang+, ECCV ‘18] • 3D motion (= 3D scene flow) image warping • 3D scene flow • 3D grid reasoning à 3D scene flow “Learning Independent Object Motion from Unlabelled Stereoscopic Videos” [Cao+, arxiv] (2019/01/07) • bounding box 3D view frustum 3D structure object motion • depth, 3D scene flow, object mask • ( ?) 2019/02/23 10
11. 11. 1. training instance segmentation (off-the-shelf) 2. instance object motion 3. ego-motion + object motion image warping 2019/02/23 11
12. 12. 1.Motionmodel • 2019/02/23 12
13. 13. 1.Motionmodel • instance segmentation • Mask-RCNN 2019/02/23 13 image !" instance seg. mask #\$, " image !& instance seg. mask #\$, & image !' instance seg. mask #\$, ' ( − 1 ( + 1( instance ID object , instance ID
14. 14. 1.Motionmodel • binary mask • instance ID ! binary mask 2019/02/23 14 image "# instance seg. mask \$%, # image "' instance seg. mask \$%, ' image "( instance seg. mask \$%, ( )* \$' = 1 − ⋃% \$%,' )/ \$' = \$/, ' instance 0à1(true), à0(false)
15. 15. 1.Motionmodel ego-motion • binary mask ! 2019/02/23 15 ! = #\$ %& ⊙ #\$ %( ⊙ #\$ %) warp binary mask #\$ %& #\$ %( #\$ %)
16. 16. 1.Motionmodel ego-motion • ! 2019/02/23 16 ⊙ ⊙ ⊙ #\$ #% #& ! ! !
17. 17. 1.Motionmodel ego-motion • ego-motion estimator !" 2019/02/23 17 #\$→& #&→' !" (\$ ⊙ * (& ⊙ * (' ⊙ * ego-motion
18. 18. 1.Motionmodel instance object motion • depth !" ego-motion #\$→", #"→' image warping • segmentation mask (), \$, (), ' warp 2019/02/23 18 #\$→" #"→' *\$ depth !" = ,(.") ." .\$ .' 0.\$→" 0.'→" ." warp .\$ .'
19. 19. 1.Motionmodel instance object motion • instance ! warp binary mask object motion estimator "# 2019/02/23 19 ⊙ %& '()→+ ⊙ %& (+ ⊙ %& '(,→+ -)→+ (&) -+→, (&) object motion estimator "# '0)→+ '0,→+ 0+
20. 20. 1.Motionmodel • warp !"#→% (') , !"*→% (') 2019/02/23 20 !"#→% (') = !"#→% ⊙ - + / 01# 2 !"#→% (0) ⊙ 30(4%) "% depth 5% ego-motion 6#→% instance ID 7 warping warping depth 5% ego-motion 6#→% object motion 89→: (;)
21. 21. 1.Motionmodel • ! No. (instance ID ) • ( "# ) \$ = & '() * "#\$+,- ' + "/\$0012 ' + "* 1 2' \$56 ' • warping loss \$+,- = min :;#→/ = − ;/ , :;*→/ = − ;/ • SSIM loss ( ) \$0012 = 1 − SSIM(;D, ;E) 2 • smoothness loss ( ry) \$56 = GHI∗ KL|NOP| + GQI∗ KL|NRP| (I∗ = I/TI) 2019/02/23 21 (2) (1)
22. 22. 2.Objectsizeconstraints object size • 1.8m • 1.7m • 2.5m 2019/02/23 22 !"#\$%&' = 1.8m !-.\$ = 1.7m !0 1 ∶ category ID ∈ ℝ?@ A instance (real world ) instance ID B à category ID C(B) : ℕA → ℕA instance ID B à !H(I) : ℕA → ℝ?@ A
23. 23. 2.Objectsizeconstraints • object size depth • !" 2019/02/23 23 ℎ(%& ' ) [pix])*(&) [m] real world instance ID + mask blob instance ID + depth ,-../01(); ℎ) [m] !" [pix] ,-../01 ); ℎ = !" ) ℎ
24. 24. 2.Objectsizeconstraints • depth !"##\$%& • '! disparity normalization • depth shrink • “Learning Depth from Monocular Videos using Direct Methods” [Wang+, CVPR ’18] 2019/02/23 24 ()* = , -./ 0 ! ⊙ 2- 3 '! − !"##\$%&(67 - , ℎ 2- 3 '! :/ depth instance ; binary mask instance ; reference depth
25. 25. 3.Testtimerefinement • inference • UP • : KITTI • inference 3 ! training • ! over training (! = 20) • fine-tuning online • / 2019/02/23 25
26. 26. • M: Motion model • R: Test time refinement • M+R SoTA( ) • M ( ) 2019/02/23 26
27. 27. • 2019/02/23 27
28. 28. Motionmodel • KITTI training à Cityscapes inference • Motion model depth • worst 2019/02/23 28
29. 29. Motionmodel • Cityscapes training à KITTI evaluation • Motion model (M) domain • Test time refinement (R) 2019/02/23 29
30. 30. Motionmodel • training object motion • instance 6DoF motion • instance segmentation • instance ID 2019/02/23 30
31. 31. Testtimerefinement • : baseline : refinement mode (R) • R depth • zero-shot domain transfer 2019/02/23 31 training: KITTI, inference: Cityscapestraining: KITTI, inference: KITTI
32. 32. Indoordataset • Cityscapes training Indoor dataset inference • : baseline : refinement model (R) • “ (Cityscapes) à ” domain transfer 2019/02/23 32
33. 33. • Motion model depth • test time refinement • instance segmentation motion naïve • instance segmentation network back prop • ? • End-to-end ? • instance ID 2019/02/23 33