Color and 3D Semantic Reconstruction of Indoor Scenes from RGB-D stream

Color and 3D Semantic Reconstruction
of Indoor Scenes from RGB-D Streams
전준호
CG Lab. POSTECH
Tech Talk @ NAVER
2018.12.10

3D Reconstruction
• Capture shape and appearance of real objects and environments
• Produce 3D models for applications such as virtual/augmented reality, 3D printing
2

3D Reconstruction using RGB-D Sensor
• Geometric reconstructions are rapidly developed, and available for large-scale scenes
▫ But mainly focus on acquiring an accurate geometry
KinectFusion [Newcombe 2011]
Voxel hashing [Nießner 2013] Elastic fragments [Zhou 2013] Robust reconstruction [Choi 2015]
3

Auxiliary Information of 3D Indoor Scene
• Surface color
• Object class
• Lighting condition
• Sound
 Rich UX  Color and Semantic Reconstruction
4

Contributions – Color Reconstruction
• Texture Map Generation for 3D Reconstructed Scenes
▫ Reconstruct clean and sharp surface color of the 3D reconstructed scene
▫ Light-weight color representation for reconstructed scenes
▫ Texture coordinates optimization to acquire sharp texture map
Texture map generation for 3D reconstruction
5

Contributions – Semantic Reconstruction
• Reconstruction of semantically segmented 3D meshes
▫ Predict per-vertex object class of the 3D reconstructed scene
▫ Volumetric semantic fusion of frame-by-frame semantic predictions
▫ Adaptive integration and CRF optimization for robust labeling
3D semantic reconstruction
6

Texture Map Generation
for 3D Reconstructed Scenes
Junho Jeon, Yeongyu Jung, Haejoon Kim,
Seungyong Lee
The Visual Computer (CGI 2016)

3D Reconstruction using RGB-D Sensor
• Available for very large-scale scenes
▫ But no or inaccurate color information!
Robust reconstruction [Choi 2015] BundleFusion [Dai 2017] 8

Color Reconstruction
• Naïve color blending introduces blurring, ghosting, etc.
▫ Incorrect camera poses
▫ Lens distortions
▫ Misaligned RGB-D images
• Goal: precisely reconstruct the color from RGB-D stream
Blurry color from volumetric blending
9

Previous work: Color Map Optimization
• Zhou and Koltun, TOG 2014
▫ Project RGB stream onto mesh to get vertex color
▫ Optimize camera pose & warping function for images  clean vertex color
▫ Limitation: method based on vertex color
 Time-consuming optimization
 Inefficient rendering
* Images from Zhou’s slides
Result
Optimization takes 5 mins.
Image warping function
10

Our Approach
• Color reconstruction based on texture mapping
▫ Generating texture map for simplified mesh
▫ Optimize texture map to maximize photometric consistency
▫ GPU-based parallel solver
 100x faster color reconstruction!
 Efficient rendering
Our method
11

Rendering Result
Sub-textures
Global texture
Spatio-temporally
sampled key frames
Simplified 3D
reconstructed mesh
RGB-D stream
Color
Depth
Overall Framework
1) Preprocessing
2) Spatiotemporal key frame sampling
3) Texture map generation
4) Texture map optimization
Refined global
texture map
(1) Preprocessing
(4) Texture map
optimization
(2) Key frame
sampling
(3) Texture map
generation
12

Preprocessing
Rendering Result
Sub-textures
Global texture
Spatio-temporally
sampled key frames
Simplified 3D
reconstructed mesh
RGB-D stream
Color
Depth
Refined global
texture map
(1) Preprocessing
(4) Texture map
optimization
(2) Key frame
sampling
(3) Texture map
generation
13

Preprocessing
• Geometric model reconstruction
▫ Dense scene reconstruction with point of interest [Zhou 2013]
▫ Any other 3D reconstruction method can be used
• Model simplification
▫ Original mesh consists of more than 1M faces
 Inefficient texture mapping
 Further process becomes extremely time-consuming
▫ Surface simplification using quadric error metrics [Garland 1997]
Mesh simplification (faces 460K to 23K)Dense scene reconstruction [Zhou 2013]
14

Spatiotemporal Key Frame Sampling
Rendering Result
Sub-textures
Global texture
Spatio-temporally
sampled key frames
Simplified 3D
reconstructed mesh
RGB-D stream
Color
Depth
Refined global
texture map
(1) Preprocessing
(4) Texture map
optimization
(2) Key frame
sampling
(3) Texture map
generation
15

Spatiotemporal Key Frame Sampling
• Input color stream
▫ A lot of redundant data, color images suffer from motion blurs
• Temporal sampling
▫ Sample less blurry key frames based on Blurriness [Crété-Roffet 2007]
• Spatial sampling
▫ Uniqueness: the image not able to be covered by other image
▫ Sample by eliminating image with minimum uniqueness
Overlapping (red) and unique region (blue)Temporal sampling with blurriness
16

Rendering ResultRGB-D stream
Color
Depth
Refined global
texture map
Sub-textures
Global texture
Spatio-temporally
sampled key frames
Simplified 3D
reconstructed mesh
(3) Texture map
generation
(1) Preprocessing
(4) Texture map
optimization
(2) Key frame
sampling
17

• UV unwrapping to mesh for global texture map
▫ Get global texture coordinates for every vertex
• Estimate color by blending key frames
▫ Sub-texture map by projecting mesh to each camera
▫ Blended sub-texture becomes global texture
Global texture map
UV
unwrapping
Sub-textures
Mesh
projection
Weighted
blending
18

Global Texture Map Optimization
Rendering Result
Spatio-temporally
sampled key frames
Simplified 3D
reconstructed mesh
RGB-D stream
Color
Depth
(1) Preprocessing
(2) Key frame
sampling
Sub-textures
Global texture
Refined global
texture map
(4) Texture map
optimization
(3) Texture map
generation
19

• Generated texture map also suffers from blurring, ghosting, etc.
▫ Inconsistent color blending from different sub-textures
• Optimize sub-texture coordinates to be consistent
 Sharper & cleaner global texture map
Inconsistent blendingConsistent blending
20

• Search new sub-texture coordinates of each vertex
• Energy formulation for photometric consistency
▫ For every face, blended global texture should be consistent with sub-textures
▫ Consider consistency of sampled points on each face
• Non-linear least square problem
▫ Need to be solved by Gauss-Newton method
Sub-texture coordinates (variables)
Sub-texture
(intensity)
Blended global texture
(intensity)
Sub-textures of face f
21

GPU-based Alternating Solver
• Applying naïve Gauss-Newton method is non-trivial
▫ Infeasible to solve directly due to the # of variables
• Exploit locality of the problem to parallelize the optimization
▫ Assuming 1-ring neighborhood of v fixed,
optimization of sub-texture coordinates uv is independent from other vertices
▫ Schwarz Alternating Method
 While keeping boundary variables, update inner variables
 Independent optimizations are propagated iteratively
v
v2v1
v3
v4
1-ring neighborhood
v
Propagation of optimization
22

Experimental Results
• Tested on various 3D reconstructed models
• Intel i7 4.0GHz, 16 RAM, NVIDIA Titan X
*models from Zhou et al. 23

Volumetric blending
24

Experimental Results 10k faces
Optimization takes 2.6s (100 iterations)
Our method
25

Our result (after opt.)Our result (before opt.)
Color map optimization [Zhou 2014]Volumetric blending
Faces #: 15,853,238
File size: 713MB
Optimization: 5 mins
Faces #: 10,000
File size: 217KB + few MB
Optimization: 2.6 secs
28

• 10K faces takes less than 3s (original mesh consists of 1M faces)
20K faces
10K faces
5K faces
29

130k faces
Optimization takes 16s
30

130k faces
31

130k faces
32

Experimental Results 130k faces
33

195k faces
34

195k faces
35

Summary
• Texture map generation for color reconstruction of 3D indoor scene
▫ Texture map generation maximizing the photometric consistency of mapping
▫ Spatiotemporal sampling for faster processing and sharper texture map
▫ Efficient optimization based on a parallel Gauss-Newton solver on GPU
 Directly applicable for computer graphics
36

Semantic Reconstruction:
Reconstruction of Semantically Segmented
3D Meshes via Volumetric Semantic Fusion
Junho Jeon, Jinwoong Jung, Jungeon Kim,
Seungyong Lee
Computer Graphics Forum (Pacific Graphics 2018)

Reconstruction of Semantic Information
• Virtual/augmented reality  Interaction with 3D scenes
• Single connected 3D model  not suitable
• Requires individually segmented object models
 Semantic segmentation on 3D reconstructed scene
Interaction with 3D scene Single connected 3D model
Sofa
Floor
Shelves
Wall
38

Semantic Segmentation on 2D Image
• Pixel-wise annotation of semantic object class
• Well established network architectures and dataset
▫ PASCAL, MS COCO, Mapillary, Places, …
• Has shown successful performance
Places dataset Mapillary dataset 39

3D Semantic Segmentation
• Point-wise (vertex-wise) annotation on 3D scene model
• Deep learning on a 3D data is not straight-forward
▫ Unstructured point cloud, mesh with complex topology
• Lack of annotated 3D reconstructed model dataset
▫ Recently, an annotated dataset is released (ScanNet)
floor
bed
wall
Chair
Picture
Reconstructed 3D model Per-vertex annotation
40

Related Work – 3D CNN-based Methods
• Represent input geometry as an uniform voxel grid
▫ Binary occupancy grid or distance field
• Direct feature extraction and classification w/ 3D CNN
• Higher memory consumption  only low resolution segmentations
Fully convolutional 3D CNN architecture
Images courtesy of [Qi 2017]
Voxel-based semantic segmentation
[Dai 2017]
41

Related Work – Point-based Methods
• Unstructured point cloud to ordered sequence vector
▫ Point set grouping, slice pooling, max pooling
• Feature extraction and classification w/ MLP or RNN
• Miss geometric detail (may miss small object classes)
RSNet [Huang 2018]PointNet++ [Qi 2017]
42

Our Approach: Semantic Reconstruction
• 3D (geometry) reconstruction
using fusion of
multiple geometry (depth images)
• 3D semantic reconstruction
using fusion of
multiple 2D semantic predictions
Multiple depth images Dense surface reconstruction
Multiple semantic predictions 3D semantic reconstruction
43

Volumetric Fusion of Semantic Information
• Review: Volumetric Fusion of 3D Geometry
• Geometry representation using a uniform voxel grid
▫ Each voxel stores TSDF value (geometry information)
Uniform voxel grid
44

• Review: Volumetric Fusion of 3D Geometry
• Geometry representation using a uniform voxel grid
▫ Each voxel stores TSDF value (geometry information)
• Merge noisy measurements on single voxel grid w/ estimated camera poses
▫ Volumetric denoising of the reconstructed geometry (TSDF values)
Images courtesy of Newcombe’s slides
Multi-frames geometric fusionUniform voxel grid
45

• Each voxel has semantic probabilistic distribution (20 classes)
▫ Volumetric fusion of multi-frames semantic predictions
• Seamless integration into the 3D (geometry) reconstruction process
Volumetric semantic fusion
RGB-D Stream
Stream of
Semantic Prediction
CNN-based 2D semantic segmentation
46

Semantic Reconstruction
• CNN-based 2D Semantic Segmentation
• Volumetric semantic fusion with adaptive weight
• CRF-based 3D semantic label regularization
Overall framework
RGB-D Stream Dense Surface Reconstruction
(1) CNN-based
Per-vertex
Semantic Class Confidence
Semantically Segmented Mesh
& Projected 2D Segmentation
(2) Volumetric Semantic Fusion
…
(3) CRF-based
Label Regularization
…
47

(1) CNN-based
Per-vertex
…
(3) CRF-based
…
Overall framework
48

(1) CNN-based
Per-vertex
…
(3) CRF-based
…
Overall framework
49

(1) CNN-based
Per-vertex
…
(3) CRF-based
…
Overall framework
50

CNN-based 2D Semantic Segmentation
• RGB-D Semantic segmentation  RDFNet [Park 2017]
• Stream for 3D reconstruction differs from still images
▫ Captured close to objects, may suffer from motion blur
Images from ScanNet dataset (reconstruction)
Images from NYU-D dataset (still image)
51

CNN-based 2D Semantic Segmentation
• RGB-D Semantic segmentation  RDFNet [Park 2017]
• Stream for 3D reconstruction differs from still images
▫ Captured close to objects, may suffer from motion blur
• Fine tuning on ScanNet dataset [Dai et al. 2017]
▫ Drastically improves segmentation quality
Input Original RDFNet
[Park 2017]
Fine-tuned RDFNet
52

Adaptive Volumetric Semantic Fusion
• 2D predictions & camera poses may have error
▫ Weighted average of class probability for a voxel
?
??
?
53

• Depth-based reliability weight
▫ Network accuracy depends on a pixel depth (i.e. relative scale)
▫ Pixel close to the camera gives less contribution to the result
?
??
?
54

• Foreground boundary weight
▫ Unreliable predictions around misaligned objects boundary
▫ Prohibit wall/floor labels for foreground objects (pixels)
Input color Input depth
Depth weights Foreground weights
Wall
(background)
Object
(foreground)
Unreliable predictions
55

Reconstruction of Semantically Labeled 3D Mesh
• Marching cube to extract a reconstructed 3D mesh from the volumetric representation
▫ Bilinear interpolation of fused probabilities at voxels
▫ Each vertex has 20 objects class probabilities
56

Reconstruction of Semantically Labeled 3D Mesh
• Marching cube to extract a reconstructed 3D mesh from the volumetric representation
▫ Bilinear interpolation of fused probabilities at voxels
▫ Each vertex has 20 objects class probabilities
• Select maximum probability class for each vertex to obtain an initial segmentation
Initial segmentation result
Color Floor Wall
OthersBedChair
Probability visualization for major classes
57

CRF-based Label Regularization
• Integrated, but noisy 3D segmentation results
▫ 2D Segmentation considers a limited FOV individually
• CRF optimization to determine final class labels
• Consider global context of reconstructed scene w/ geometry (surface normal),
appearance (colors), and semantic similarity using a confusion matrix of CNN
Input Naïve integration Adaptive integration
No CRF
Final result
58

Experimental Setting
• 2D semantic segmentation: RDFNet for RGB-D stream (Caffe) [Park 2017]
• Camera pose estimation & 3D volumetric fusion: BundleFusion [Dai 2017]
• NVIDIA GeForce Titan X with 12GB VRAM
59

Reconstructed scene (color) Segmented result 60

Segmented resultReconstructed scene (color) 61

Segmentation Result of Large-scale Scenes
• Incremental integration enables
semantic reconstruction
of large-scale 3D scenes
Reconstructed scene Segmented result
62

Visual Comparisons
Input ScanNet [Dai 2017] Our results 63

Visual Comparisons
Input ScanNet [Dai 2017] Our results 64

Quantitative Evaluation
• Global voxel classification accuracy with majority voting
▫ Improve previous method with a large gap (+6.86%)
▫ Tested on ScanNet dataset (312 test scenes)
Configurations Accuracy
Voxel-based labeling [Dai 2017] 73.0%
Naïve integration without CRF 79.02%
Adaptive integration without CRF 79.28%
Naïve integration with CRF 79.79%
Adaptive integration with CRF 79.86%
65

• Global voxel classification accuracy with majority voting
▫ Improve previous method with a large gap (+6.86%)
▫ Tested on ScanNet dataset (312 test scenes)
• Adaptive integration & CRF seem not effective
▫ Mainly handles an object boundary: visually critical but covers only small portion of data
Configurations Accuracy
Voxel-based labeling [Dai 2017] 73.0%
Naïve integration without CRF 79.02%
Adaptive integration without CRF 79.28%
Naïve integration with CRF 79.79%
Adaptive integration with CRF 79.86%
66

• ScanNet dataset is highly unbalanced
▫ Most of vertices (61.7%) are wall & floor  imbalanced classes
▫ Class-mean intersection-over-union (mIOU) & accuracy (mAcc)
• Outperforms previous SOTAs (+2.84% / +15.3%)
Configurations mIOU mAcc
PointNet [Qi 2017] 14.69% 19.90%
PointNet++ [Qi 2017] 34.26% 43.77%
RSNet [Huang 2018] 39.35% 48.37%
RSNet w/ RGB [Huang 2018] 41.16% 50.34%
Ours 44.00% 65.64%
67

2D Projection of 3D Segmentation
• Fusion & regularization improve semantic segmentation results
• We can render 2D semantic maps from the segmented 3D model
• Original 2D segmentation vs. rendered 2D results
▫ Tested on ScanNet dataset (53K frames from 312 test scenes)
Pixel
Acc.
Mean
Acc.
Mean
IoU
Original RDFNet 60.44 47.32 29.34
Finetuned RDFNet (2D) 73.55 59.82 45.60
Our result (rendered 2D) 77.18 63.20 50.69
Quantitative comparisonInput image Results of CNN Our result 68

3D Scene Completion and Manipulation
• Class-wise (semantic) 3D scene manipulation
• Scene completion
• Object modification
Input sceneSemantic meshFloor fillingObject removal
69

Summary
• Volumetric semantic fusion integrating 2D semantic predictions
 exploit success of 2D CNN & data
• Adaptive integration based on depth and scene structure
 compensate uncertainty of network prediction
• CRF-based label regularization using the geometric and photometric information
 refine final result
70

Summary
• Volumetric semantic fusion integrating 2D semantic predictions
 exploit success of 2D CNN & data
• Adaptive integration based on depth and scene structure
 compensate uncertainty of network prediction
• CRF-based label regularization using the geometric and photometric information
 refine final result
• Limitation
▫ 2D semantic segmentation requires heavy computation
▫ Multiple GPUs to achieve real-time performance
71

Summary and Future Work

Summary
• 3D Reconstruction of auxiliary information
▫ Beyond the geometric reconstruction of the indoor scene
▫ Useful for rich user experience on VR/AR application
73

Summary
• 3D Color and Semantic Reconstruction of Indoor Scenes from a RGB-D Streams
 Efficient and accurate color representation
 Texture map generation using spatiotemporal key frame sampling and texture coordinate optimization
 Optimizing texture map considering geometric and photometric consistency together
 Per-vertex dense semantic class information
 3D Semantic segmentation on a reconstructed scenes via a volumetric semantic fusion
 3D instance segmentation of reconstructed scene for individual object meshes
Texture map generation Semantic reconstruction
74

Supplementary Sildes

Color and 3D Semantic Reconstruction of Indoor Scenes from RGB-D stream

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Color and 3D Semantic Reconstruction of Indoor Scenes from RGB-D stream

Similar to Color and 3D Semantic Reconstruction of Indoor Scenes from RGB-D stream (20)

More from NAVER Engineering

More from NAVER Engineering (20)

Recently uploaded

Recently uploaded (20)

Color and 3D Semantic Reconstruction of Indoor Scenes from RGB-D stream