2019年6月13日、SSII2019 Organized Session: Multimodal 4D sensing。エンドユーザー向け SLAM 技術の現在。登壇者:武笠 知幸(Research Scientist, Rakuten Institute of Technology)
https://confit.atlas.jp/guide/event/ssii2019/static/organized#OS2
2. 2
2015Ph.D. Student Engineer Researcher2012
3D Reconstruction
& Motion Analysis
Tomoyuki MUKASA, Ph.D. 3D Vision Researcher
VR for
Exhibition
AR for Tourism
AR/VR/HCI for
e-commerce
4. 4
Mission of Our Groups:
Create New User Experiences Applicable to Rakuten Services
Contributing to
existing businesses
Exploring
new ideas
Increasing
tech-brand awareness
Using Computer Vision & Human Computer Interaction
5. 5
The 3 R’s Of Computer Vision
Woman
Red
Blouse
Category Attributes
6. 6
The 3 Main Points for End-users
• Easy access on mobile
• AR furniture app
• Web AR/SLAM
• Improving the experience on mobile
• Dense 3D reconstruction
• Occlusion-aware AR
• Manipulatable AR
• Understanding & manipulation of the environment around the user
• Delivery robots
7. 7
Easy access on mobile
• AR furniture app
• - / / 5
• - /
• Web AR/SLAM
• / / /
• / / /
• / - /5 / / / 5 5 /
10. 11
Floor Detection & IMU Fusion
Need to be tracked in 3D!
Almost solved in ARKit/ARCore…
11. 12
What is still missing?
Merchants’ pages
3D models SLAM w/ scale estimation
Advanced visualization w/
inpainting & relighting
AR app for everyone
E. Zhang, M. F. Cohen, and B. Curless.
"Emptying, Refurnishing, and Relighting Indoor Spaces”, SIGGRAPH Asia, 2016.
12. 13
ARKit /
ARCore
What is still missing?
Merchants’ pages
3D models SLAM w/ scale estimation
Advanced visualization w/
inpainting & relighting
AR app for everyone
E. Zhang, M. F. Cohen, and B. Curless.
"Emptying, Refurnishing, and Relighting Indoor Spaces”, SIGGRAPH Asia, 2016.
14. 15
Simplest Web AR
Pros:
• No need to install native app
• Easy to create only w/ HTML(+Javascript)
Cons:
• Marker-based
• Need newer environment
Later than iOS11Safari, Android5 Chrome
Implementation
• AR.js + A-frame
15. 16
About 1K people tried
@Spartan race in Sendai
2018/12/15
• AR photo booth: 240 groups
• AR lottery: 510 people
2K people tried @Japan Open
2018/10/02-07
16. 17
AR message card
2019/5/12, 6/16
Trial in Mother’s day &
Father’s day
AR Quiz
2019/5/16
Trial
@Tokyo Dome
AR Lottery
2018, 2019/3, 4
R-mobile campaign
20. Objects
Object detection
& recognition
Input image
Surface orientation Partial view alignment
3D pose estimation
Plane fitting 3D scene initialization
Room Geometry
Objects in 3D scene
walls initialized
with unknown scale
21. 24
Improving the experience on mobile
• Dense 3D reconstruction
• - / / /
• + & /
• Occlusion-aware AR
• - / / /
• &
• - /
• Manipulatable AR
• & /
•
23. 26
Dense Visual Monocular SLAM
• Direct method based on photo consistency
• Multi-baseline stereo using GPU
• Getting easier to run on the latest mobile
device, but still unwanted from the end-user
point of view because of energy consumption,
etc.
R. A. Newcombe, S. J. Lovegrove and A. J. Davison,
"DTAM: Dense tracking and mapping in real-time," ICCV, 2011
24. 27
Depth prediction by CNN
D. Eigen, C. Puhrsch, and R. Fergus.
“Depth map prediction from a single image using a multi-scale deep network.”
NIPS, 2014.
M. Kaneko, K. Sakurada and K. Aizawa.
“MeshDepth: Disconnected Mesh-based Deep Depth Prediction.”
ArXiv, 2019.
Global Coarse-Scale Network +
Local Fine-Scale Network
Disconnected mesh representation
25. 28
SLAM + Depth prediction
Semi-dense SLAM + Prediction Compact and optimizable representation of
dense geometry
K. Tateno, F. Tombari, I. Laina and N. Navab, "CNN-SLAM: Real-Time Dense
Monocular SLAM with Learned Depth Prediction," CVPR, 2017.
M. Bloesch, J. Czarnowski, R. Clark, S. Leutenegger and A. J. Davison.
“CodeSLAM - Learning a Compact, Optimisable Representation for Dense Visual SLAM.”
CVPR, 2018.
26. 29
Mesh CNN-SLAM (ICCV WS 2017)
Image capturing
& Visualization thread
2D tracking thread
3D Mapping thread
Depth prediction thread
Depth fusion thread
CLIENT-SIDE
SERVER-SIDE
Monocular visual SLAM
Depth prediction by CNN
3D reconstruction
Depth fusion
by surface mesh deformation
t t+1 t+2 t+3 t+4
Key-frame
ARAP deformation
27. 30
Mesh CNN-SLAM (ICCV WS 2017)
Figure 4. (Top) Distribution of weights wi for the deformation and
(bottom) the corresponding textured mesh. Larger intensity values
in the top figure indicate the higher weights.
4. Experiments
frames detected by ORB-SLAM because these are selected
based on visual changes. We filter out those key-frames us-
ing a spatio-temporal distance criterion similar to the other
feature-based approaches, e.g., PTAM, and send them to the
server.
The key-frames are processed on the server and the depth
image for each frame is estimated by the CNN architecture.
In the fusion process, we convert the depth images to a re-
fined mesh sequence as shown at the bottom of Figure 5.We
also make the ground truth mesh sequence correspond to the
refined one from the raw depth maps captured by the depth
sensor on the other hand. We compute residual errors be-
tween the refined mesh and the ground truth as shown in Ta-
ble 2 and Figure 6. We can observe that our framework ef-
ficiently reduces the residual errors for all sequences. Both
the average and the median of the residual errors fall within
the range from about two thirds to a half.
We also evaluate the absolute scale estimated from depth
prediction as shown in the rightmost column in the Table 2.
The average error of the estimated scales for our six office
scenes is 20% of the ground truth scale.
5. Conclusion
In this paper, we proposed a framework fusing the re-
sult of geometric measurement, i.e., feature-based monocu-
28. 31
Mesh CNN-SLAM (ICCV WS 2017)
Sofa area 1 Sofa area 2 Sofa area 3 Desk area 1 Desk area 2 Meeting room
Figure 5. Input data for our depth fusion and the reconstructed scenes. From top to bottom row: color images, feature tracking result
of SLAM, corresponding ground truth depth images, depth images estimated by DNN, and 3D reconstruction results on six office scenes,
respectively.
Scene Mesh from CNN depth map Refined mesh by our method
Mean Median Std dev Mean Median Std dev Scale
29. 32
Mesh CNN-SLAM (ICCV WS 2017)
Method 3D Reconstruction Computational complexity Accuracy Scale
Monocular visual SLAM
(feature based)
Sparse (scene complexity
dependent)
Low (runs on mobile device) High None
CNN-based depth pre-
diction
Dense (estimated for
each pixel)
High (a few seconds for each
frame)
Medium (training-data
dependent)
Available
Proposed framework Dense (estimated for
each pixel)
High (but only visual SLAM
runs on mobile device)
High Available
Table 1. Properties of individual reconstruction methods and of their combination, which retains desirable properties of each.
less et al. proposed to use averaging truncated signed dis-
tance functions (TSDF) for depth susion [3] which is simple
yet effective and used in a large number of reconstruction
pipelines including KinectFusion [21].
Mesh deformation techniques are widely used in graph-
ics and vision. Especially, linear variational mesh deforma-
tion techniques were developed for editing detailed high-
resolution meshes, like those produced by scanning real-
world objects [2]. For local detail preservation mesh defor-
mations that are locally as-rigid-as-possible (ARAP) have
been proposed. The ARAP method by Sorkine et al. [25]
3.1. Monocular visual SLAM
Although our framework is compatible with any type of
feature-based monocular visual SLAM methods, we em-
ploy ORB-SLAM [20] because of its robustness and ac-
curacy. ORB-SLAM incorporates three parallel threads:
tracking, mapping and loop closing. The tracking is in
charge of localizing the camera in every frame and deciding
when to insert a new key-frame. The mapping processes
new key-frames and performs local bundle adjustment for
reconstruction. The loop closing searches for loops with
every new key-frame.
Each key-frame Kt is associated with camera pose Tkt
at
time t, locations of ORB features p2D
(t) and correspond-
30. 33
Occlusion-aware AR
• Optical-flow based depth edge approach
• Disparity + Bilateral grid approach
• CNN-based approach
A. Holynski, J. Kopf.
“Fast Depth Densification for Occlusion-aware Augmented Reality.”
SIGGRAPH Asia 2018
31. 34
Optical-flow based depth edge approach (Facebook)
A. Holynski, J. Kopf.
“Fast Depth Densification for Occlusion-aware Augmented Reality.”
SIGGRAPH Asia 2018
(U-Washington + FB)
32. 35
Disparity + Bilateral grid approach (Google)
J. Valentin, et al. “Depth from motion for smartphone AR.” ACM Trans. Graph, 2018.
33. 36
CNN-based approach (Niantic)
C. Godard, O. M. Aodha and G. J. Brostow.
“Digging Into Self-Supervised Monocular Depth Estimation.” ArXiv, 2018
C. Godard, O. Mac Aodha, and G. J. Brostow. Unsupervised monocular depth
estimation with left-right consistency. In CVPR, 2017.
• Monodepth:
Unsupervised Monocular Depth Estimation with
Left-Right Consistency
• Monodepth2:
Self-Supervised Monocular Depth Estimation
34. 37
CNN-based approach
M. Poggi, F. Aleotti, F. Tosi, and S. Mattoccia.
“Towards real-time unsupervised monocular depth estimation on cpu.”
IROS, 2018.
C. Godard, O. Mac Aodha, and G. J. Brostow. Unsupervised monocular depth
estimation with left-right consistency. In CVPR, 2017.
• PyD-Net: Pyramidal features extractor to reduce complexity
• Based on Monodepth
• Customized for CPU
36. 39
Spatial consistency for disocclusion
P. P. Srinivasan, R. Tucker, J. T. Barron, R. Ramamoorthi, R. Ng and N. Snavely.
“Pushing the Boundaries of View Extrapolation with Multiplane Images.” CVPR, 2019.
• View synthesis based on Multiplane Image (MPI)
Cf. Multiplane camera in animation
• Novel view extrapolations with plausible disocclusions
• Consistency between rendered views
37. 40
Temporal consistency for disocclusion
R. Xu, X. Li, B. Zhou and C. C. Loy. “Deep Flow-Guided Video Inpainting.” CVPR, 2019.
38. 41
Learning Human Depth for Disocclusion
Z. Li, T. Dekel, F. Cole, R. Tucker,
N. Snavely, C. Liu and W. T. Freeman.
“Learning the Depths of Moving People
by Watching Frozen People.”
CVPR, 2019.
Apple also revealed
”people occlusion” feature
for ARKit 3 in WWDC 2019
39. 42
Manipulation of the viewpoint & appearance
M. Meshry, D. B. Goldman, S. Khamis, H. Hoppe, R. Pandey, N. Snavely and R. M-Brualla.
“Neural Rendering in the Wild.” CVPR, 2019
Total Scene Capture
• Encode the 3D structure of the scene, enabling rendering from an arbitrary viewpoint,
• Capture all possible appearances of the scene and allow rendering the scene under any of them.
• Understand the location and appearance of transient objects in the scene
and allow for reproducing or omitting them.
40. 43
Understanding & manipulation of the environment around the user
• Delivery robots
• . .)
• . , .) ( , (. . (. ( ) (,(
• .) , ( .