論文読み会@AIST (Deep Virtual Stereo Odometry [ECCV2018])

Deep Virtual Stereo Odometry:
Leveraging Deep Depth Prediction for Monocular
Direct Sparse Odometry [ECCV2018(oral)]
The University of Tokyo
Aizawa Lab M1 Masaya Kaneko
論文読み会 @ AIST

1
Introduction
• Monocular Visual Odometry
– Camera’s trajectory estimation and 3D reconstruction from
image sequences obtained by monocular camera
Direct Sparse Odometry [Engel+, PAMI’18]

2
Introduction
• Monocular Visual Odometry
– Prone to scale drift (unknown scale)
– Require sufficient motion parallax in successive frames
Scale drift Small Parallax leads
incorrect depth estimation
drift

3
Introduction
• Typically complex sensors are employed to avoid this issue.
– Active depth sensors (LiDAR, RGB-D camera)
– Stereo camera
• However, these sensors have following disadvantages.
– Require larger efforts in calibration
– Increase the cost of system
Velodyne
(https://velodynelidar.com/)
ZED
(https://www.stereolabs.com/)

4
Introduction
• If a priori knowledge about environment is used, this issue
can be solved without complex sensors.
– Deep based approach like CNN-SLAM [Tateno+, CVPR’18]
– Now they propose a method to adapt this approach into
state-of-the-art VO, DSO (Direct Sparse Odometry).
https://drive.google.com/file/d/108CttbYiBqaI3b1jIJFTS26SzNfQqQNG/view

5
Problem setting
• Requirements
– In inference time, you can only use monocular camera.
(for Monocular Visual Odometry)
– In train time, only inexpensive sensors are available.
• Mono/Stereo camera is ok.
• Active sensors are too costly to use.
Sensors Inference Train
Monocular camera 〇〇
Stereo camera × 〇
Active sensors
(RGB-D, LiDAR)
× ×

• Deep Virtual Stereo Odometry
1. Train deep depth estimator using stereo camera
6
Proposed method
Loss
(Stereo camera)
Input (left)
Left Disparity
Right Disparity
StackNet

2. In inference time, predicted disparities are used for depth
initialization in monocular DSO.
7
Proposed method
Monocular DSO
Sparse Depthmap
estimation
Created Map
Input (left)
Left Disparity
Right Disparity
Initialize
StackNet

8
Proposed method
Monocular DSO
Sparse Depthmap
estimation
Created Map
Input (left)
Left Disparity
Right Disparity
Initialize
StackNet

9
1. Deep Monocular depth estimation
• 3 key ingredients
Network Architecture Loss Function
1. StackNet
2 stage refinement of the network
predictions in a stacked encoder-
decoder architecture
Input (left)
Left Disparity
Right Disparity
StackNet
3. Supervised learning
Use accurate sparse depth reconstruction
by Stereo DSO as GT
Left Disparity
Stereo DSO’s
Reconstructed result
2. Self-supervised learning
Photoconsistency in a stereo setup
𝐼𝑙𝑒𝑓𝑡
𝑑𝑖𝑠𝑝 𝑟𝑖𝑔ℎ𝑡
𝐼𝑟𝑒𝑐𝑜𝑛𝑠
𝑟𝑖𝑔ℎ𝑡
𝑙𝑒𝑓𝑡
𝑑𝑖𝑠𝑝𝑙𝑒𝑓𝑡

10
• Network Architecture
– StackNet (SimpleNet + ResidualNet)

11
• Loss Function
– Linear combination of 5 terms in each image scale
1. Self-supervised loss
2. Supervised loss
3. Left-right disparity consistency loss
4. Disparity smoothness regularization
5. Occlusion regularization

12
• Loss Function
1. Self-supervised loss
• Measures the quality of the reconstructed images
𝑑𝑖𝑠𝑝 𝑟𝑖𝑔ℎ𝑡
𝑟𝑖𝑔ℎ𝑡
𝑙𝑒𝑓𝑡

13
• Loss Function
2. Supervised loss
• Measures the deviation of the predicted disparity from
disparities estimated by Stereo DSO [Wang+, ICCV’17]
Left Disparity
Stereo DSO’s
Reconstructed result
Stereo DSO
(using Stereo camera)

14
• Loss Function
3. Left-right disparity consistency loss
• Consistency loss proposed in MonoDepth [Godard+, CVPR’17]
𝑑𝑖𝑠𝑝𝑙𝑒𝑓𝑡 𝑑𝑖𝑠𝑝 𝑟𝑖𝑔ℎ𝑡
𝐼 𝑟𝑖𝑔ℎ𝑡

15
• Loss Function
4. Disparity smoothness regularization
• Predicted disparity map should be locally smooth
5. Occlusion regularization
• Disparity in occlusion are should be zero

16
• Experimental Result
– Outperform the state-of-the-art semi-supervised method
by Kuznietsov et al.

17
• Experimental Result
– Their results contain more details and deliver comparable
prediction on thin structure like pole.

18
Proposed method (described)
Monocular DSO
Sparse Depthmap
estimation
Created Map
Input (left)
Left Disparity
Right Disparity
Initialize
StackNet

2. In inference time, predicted disparities are used for
depth initialization in monocular DSO.
19
Proposed method (described)
Monocular DSO
Sparse Depthmap
estimation
Created Map
Input (left)
Left Disparity
Right Disparity
Initialize
StackNet

20
2. Deep Virtual Stereo Odometry
• Monocular DSO + Deep Disparity prediction
– Disparities are used for 2 key ways
1. Frame initialization / point selection
2. Left-right constraints into windowed optimization in
Monocular DSO
1. Frame initialization/
Point selection
2. Constraints
in optimization

21
• Monocular DSO + Deep Disparity prediction
– Disparities are used for 2 key ways
Monocular DSO
– First we explain overview of Monocular DSO
Monocular DSO

22
DSO (Direct Sparse Odometry)
• Novel direct sparse Visual Odometry method
– Direct: seamless ability to use & reconstruct all points
instead of only corners
– Sparse: efficient, joint optimization of all parameters
Feature-based,
Sparse
Direct,
Semi-dense
Taking both approach’s benefits
[1] https://drive.google.com/file/d/108CttbYiBqaI3b1jIJFTS26SzNfQqQNG/view
LSD-SLAM [Engel+, ICCV’14] ORB-SLAM [Mur-Artal+, MVIGRO’14]

23
• Direct sparse model
DSO - Model Formulation-
Target frame 𝐼′𝑗
(Pose 𝐓𝑗,
exposure time 𝒕𝑗)
𝒑
𝓝 𝒑
𝒑′Depth 1/𝑑 𝒑
Back-Projection
Π 𝑐
−1
Projection
Π 𝑐
Reference frame 𝐼′𝑖
(Pose 𝐓𝑖,
exposure time 𝒕𝑖)

24
(Pose 𝐓𝑗,
𝒑
𝓝 𝒑
Back-Projection
Π 𝑐
−1
Projection
Π 𝑐
(Pose 𝐓𝑖,
𝓝 𝒑
[1] https://people.eecs.berkeley.edu/~chaene/cvpr17tut/SLAM.pdf

25
(Pose 𝐓𝑗,
𝒑
𝓝 𝒑
Back-Projection
Π 𝑐
−1
Projection
Π 𝑐
(Pose 𝐓𝑖,
𝓝 𝒑
Target Variables
- Camera Pose 𝐓𝑖, 𝐓𝑗,
- Inverse Depth 𝑑 𝐩
- Camera intrinsics 𝐜

26
(Pose 𝐓𝑗,
𝒑
𝓝 𝒑
Back-Projection
Π 𝑐
−1
Projection
Π 𝑐
(Pose 𝐓𝑖,
𝓝 𝒑
Target Variables
Error between irradiance 𝑩 = 𝑰/𝒕

27
• Photometric calibration
– Feature-based method only focus on geometric calibration,
widely ignores this (features are invariant).
– In direct method, this calibration is very important!
Observed Pixel value 𝐼𝑖

28
Photometric
Calibration
Hardware gamma 𝐺
(Response calibration)
Vignette 𝐺
(Vignette calibration)
Photometrically
corrected image 𝐼𝑖
′
𝐼𝑖 𝐱 = 𝐺(𝑡𝑖 𝑉 𝐱 𝐵𝑖(𝐱))
→𝐼𝑖
′
𝐱 ≡ 𝑡𝑖 𝐵𝑖 𝐱 = 𝐺−1
(𝐼𝑖(𝐱))/𝑉(𝐱)

29
Photometric
Calibration
Hardware gamma 𝐺
(Response calibration)
Vignette 𝐺
(Vignette calibration)
Irradiance 𝐵𝑖
(consistent value)
𝐼𝑖 𝐱 = 𝐺(𝑡𝑖 𝑉 𝐱 𝐵𝑖(𝐱))
→𝐼𝑖
′
𝐱 ≡ 𝑡𝑖 𝐵𝑖 𝐱 = 𝐺−1
(𝐼𝑖(𝐱))/𝑉(𝐱)
Photometrically
corrected image 𝐼𝑖
′ Exposure time 𝑡𝑖
𝐵𝑖 𝐱 =
𝐼𝑖
′
(𝐱)
𝑡𝑖

30
(Pose 𝐓𝑗,
𝒑
𝓝 𝒑
Back-Projection
Π 𝑐
−1
Projection
Π 𝑐
(Pose 𝐓𝑖,
𝓝 𝒑
Error between irradiance 𝑩 = 𝑰/𝒕

31
• Direct sparse model (photo-calibration is not available)
– Additionally estimate affine lighting parameters
Reference frame 𝐼𝑖
(Pose 𝐓𝑖,
Affine lighting (𝑎𝑖, 𝑏𝑖))
Target frame 𝐼𝑗
(Pose 𝐓𝑗,
Affine lighting (𝑎𝑗, 𝑏𝑗))
𝒑
𝓝 𝒑
Back-Projection
Π 𝑐
−1
Projection
Π 𝑐
𝓝 𝒑
Error between affine lighted raw pixel

32
• Direct sparse model (photo-calibration is not available)
– Additionally estimate affine lighting parameters
Reference frame 𝐼𝑖
(Pose 𝐓𝑖,
Affine lighting (𝑎𝑖, 𝑏𝑖))
Target frame 𝐼𝑗
(Pose 𝐓𝑗,
Affine lighting (𝑎𝑗, 𝑏𝑗))
𝒑
𝓝 𝒑
Back-Projection
Π 𝑐
−1
Projection
Π 𝑐
𝓝 𝒑
Error between affine lighted raw pixel
Target Variables
- Brightness 𝑎𝑖, 𝑏𝑖, 𝑎𝑗, 𝑏𝑗

33
All host frames ℱ = {1,2,3,4}
Frame 𝐼𝑖=1
points 𝒫𝑖=1
observations
obs(𝐩)
Target Variables

34
• System overview of DSO
1. Frame Tracking
2. Keyframe Creation
3. Windowed Optimization
4. Marginalization
DSO - System Overview -

35
1. Frame Tracking
4. Marginalization
Tracking + Depth estimation
- From active 𝑁𝑓(= 7) Keyframes
Multi-scale Image Pyramid
+ constant motion model

36
1. Frame Tracking
4. Marginalization
Tracking + Depth estimation
- Frame keeps 𝑁 𝑝 = 2000 points
1. Well distributed in an image ?
2. High image gradient magnitude ?
Point selection’s Two criteria

37
1. Frame Tracking
4. Marginalization
Whether KF is required or not ?
- Similar strategy to ORB-SLAM
1. In the view field ?
2. Occlusion ?
3. Camera exposure time?
Three criteria
If these conditions are met, tracked
frame is inserted as Keyframe.

1. Frame Tracking
4. Marginalization
38
Windowed optimization (BA)
- Minimize Photometric error from
active Keyframes

1. Frame Tracking
4. Marginalization
39
Marginalization
- Old variables are removed to
avoid too much computation
- Schur complement
Black : marginalized points

40
1. Frame Tracking
4. Marginalization
Monocular DSO

41
• 2 key ingredients in DVSO
Monocular DSO
1. Frame initialization/
Point selection
2. Constraints
in optimization
Monocular DSO

42
• StackNet’s prediction is used as initial depth value in
monocular DSO. (similar to stereo DSO)

43
• Left and right disparities by StackNet are used for point
selection
• To filter out the occluded area’s pixel (𝑒𝑙𝑟 > 1)

44
2. Additional Constraints in Optimization
• Monocular DSO’s total energy (described)
→

45
• Introduce a novel virtual stereo term for each point
• To check whether optimized depth is consistent with
the disparity prediction of StackNet.

46
• Total energy is summation of original error and virtual
stereo term
Original errorVirtual stereo term

47
Experimental result
• KITTI Odometry Benchmark
– Comparison with SoTA Stereo VO
– Achieve comparable performance to stereo method!

48
Experimental result
– Comparison with SoTA Stereo VO
– Achieve comparable performance to stereo method!

49
Experimental result
– Comparison with SoTA Monocular/Stereo VO

50
Experimental result
– Comparison with deep learning approaches
– Clearly outperform SoTA deep learning based VO methods!

51
Experimental result
– Localization and Mapping Result

52
Conclusion
• They present a novel monocular VO system, DVSO.
– Recover metric scale and Reduce scale drift with only a
single camera
– Outperform SoTA monocular VO
– Achieve comparable results to stereo VO
• Future work
– Fine-tune of the network inside the odometry pipeline end-
to-end.
– Investigate how much proposed approach can generalize
to other camera and environments

論文読み会@AIST (Deep Virtual Stereo Odometry [ECCV2018])

More Related Content

What's hot

Similar to 論文読み会@AIST (Deep Virtual Stereo Odometry [ECCV2018])

More from Masaya Kaneko

Recently uploaded

論文読み会@AIST (Deep Virtual Stereo Odometry [ECCV2018])