Depth Fusion from RGB and Depth Sensors IV

DEPTH FUSION FROM RGB AND
DEPTH SENSORS IV
Yu Huang
Yu.huang07@gmail.com
Sunnyvale, California

Outline
■ Single-Photon 3D Imaging with Deep Sensor Fusion
■ Deep RGB-D Canonical Correlation Analysis For Sparse Depth Completion
■ Confidence Propagation through CNNs for Guided Sparse Depth
Regression
■ Learning Guided Convolutional Network for Depth Completion
■ DFineNet: Ego-Motion Estimation and Depth Refinement from Sparse,
Noisy Depth Input with RGB Guidance
■ PLIN: A Network for Pseudo-LiDAR Point Cloud Interpolation
■ Depth Completion from Sparse LiDAR Data with Depth-Normal
Constraints

Single-Photon 3D Imaging with Deep
Sensor Fusion
■ Active illumination time-of-flight sensors in particular have become widely used to
estimate a 3D representation of a scene.
■ However, the maximum range, density of acquired spatial samples, and overall
acquisition time of these sensors is fundamentally limited by the min-imum signal
required to estimate depth reliably.
■ A data-driven method for photon-efficient 3D imaging which leverages sensor fusion
and computational reconstruction to rapidly and robustly estimate a dense depth map
from low photon counts.
■ This sensor fusion approach uses measurements of single photon arrival times from a
LR single-photon detector array and an intensity image from a conventional HR camera.
■ Using a multi-scale deep convolutional network, it jointly processes the raw
measurements from both sensors and output a high-resolution depth map.
2018

Sensor Fusion
Single-photon 3D imaging systems measure a spatio-temporal volume containing photon counts (left) that
include ambient light, noise, and photons emitted by a pulsed laser into the scene and reflected back to the
detector. Conventional depth estimation techniques, such as log-matched filtering (center left), estimate a depth
map from these counts. However, depth estimation is a non-convex and challenging problem, especially for
extremely low photon counts observed in fast or long-range 3D imaging systems. Here is a data-driven approach
to solve this depth estimation problem and explore deep sensor fusion approaches that use an intensity image of
the scene to optimize the robustness (center right) and resolution (right) of the depth estimation.

Sensor Fusion
The denoising branch (left) takes as input the 3D volume of photon counts and processes it at multiple scales using
a series of 3D conv layers. The resulting features from each resolution scale are concatenated together and
optionally concatenated with additional features from an intensity image in a sensor fusion approach. A further set
of 3D conv layers regresses a normalized illumination pulse, censoring the BG photon events. A differentiable
argmax operator is used to localize the ToF of the estimated illumination pulse and determine the depth. In the
image-guided upsampling branch (right), the network predicts HF differences between an upsampled LF depth map
and the HR depth map using multi-scale guidance from HF features of the intensity image. The entire network is
trainable end-to-end for depth estimation and upsampling from raw photon counts and an intensity image.

Sensor Fusion
(a) photo of setup (b) imaging optics (c) illumination optics
Single-photon imaging prototype. (a) both the imaging opMcs (boNom) and illuminaMon opMcs (top). The
illuminaMon and imaging opMcs are aligned in a recMfied setup to perform energy-efficient epipolar scanning.
(bA dichroic short-pass filter reflects light above 500 nm to a PointGrey vision camera, and transmits light of
all remaining wavelengths through a 450 nm laser line filter and onto a 1D array of 256 SPAD pixels. The galvo
mirror angle controls the scanline imaging the scene. (c) A cylindrical lens creates a verMcal laser line, and the
galvo mirror determines the posiMon of this laser line within the scene.

Sensor Fusion
Reconstruction results for four scenes: checkerboard, elephant, lamp, and bouncing ball.

Deep RGB-D Canonical Correlation
Analysis For Sparse Depth Completion
■ Correlation For Completion Network (CFCNet), an end-to-end deep model to do the sparse
depth completion task with RGB info.
■ A 2D deep canonical correlation analysis as network constraints to ensure encoders of RGB
and depth capture the most similar semantics.
■ It transforms the RGB features to the depth domain, and the complementary RGB info is
used to complete the missing depth info.
■ A completed dense depth map is viewed as composed of two parts.
■ One is the sparse depth which is observable and used as the input, another is non-
observable and recovered by the task.
■ Also, the corresponding full RGB image of the depth map can be decomposed into two parts,
one is called the sparse RGB, which holds the corresponding RGB values at the observable
locations in the sparse depth, and the other part is complementary RGB, which is the
subtraction of the sparse RGB from the full RGB images.
■ During training, CFCNet learns the relationship between sparse depth and sparse RGB and
uses the learned knowledge to recover non-observable depth from complementary RGB.
2019,6

The input 0 - 1 sparse mask represents the sparse pattern of depth measures. The complementary mask is complementary to the
sparse mask. Separate a full image into a sparse RGB and a complementary RGB by the mask and feed them with masks into networks.

■ CFCNet takes in sparse depth map, sparse RGB, and complementary RGB.
■ Use the Sparsity-aware Attentional Convolutions (SAConv) in VGG16-like encoders.
■ SAConv is inspired by local attention mask which introduces the segmentation-aware
mask to let convolution "focus" on the signals consistent with the segmentation mask.
■ In order to propagate info from reliable sources, use sparsity masks to make convolution
operations attend on the signals from reliable locations.
■ Difference to the local attention mask is that SAConv does not apply mask normalization.
■ It avoids mask normalization cause it affect the stability of later 2D2CCA calculations due
to the numerically small extracted features it produces after several times normalization.
■ Also, use max-pooling operation on masks after every SAConv to keep track of the visibility.
■ If there is at least one nonzero value visible to a convolutional kernel, the max-pooling
would evaluate the value at the position to 1.

SAConv. The ⊙ is for Hadamard product. The
⊗ is for convolution. The + is for elementwise
addition. The kernel size is 3 × 3 and stride is 1
for both convolution and max-pooling.
2D Deep Canonical Correlation Analysis (2D2CCA)
full-rank covariance matrix
covariance matrices
correlation
The total loss funcMon

■ Most multi-modal deep learning approaches simply concatenate or element-wisely add
bottleneck features.
■ However, when the extracted semantics and range of feature value differs among
elements, direct concatenation and addition on multi-modal data source would not
always yield better performance than single-modal data source.
■ To avoid this problem, use encoders to extract higher-level semantics from two branches.
■ 2D2CCA ensures the extracted features from two branches are maximally correlated.
■ The intuition is to capture the same semantics from the RGB and depth domains.
■ Next, use a transformer network to transform extracted features from RGB domain to
depth domain, making extracted features from different sources share the same
numerical range.
■ During the training phase, use features of sparse depth and corresponding sparse RGB
image to calculate the 2D2CCA loss and transformer loss.

(a) RGB image (b) 500 points sparse depth as inputs. (c) Completed depth maps. (d) Results from MIT.

Learning Guided Convolutional Network for
Depth Completion
■ Dense depth perception is critical for autonomous driving and other robotics applications.
■ It is thus necessary to complete the sparse LiDAR data, where a synchronized guidance
RGB image is often used to facilitate this completion.
■ Inspired by the guided image filtering, a guided network to predict kernel weights from the
guidance image.
■ These predicted kernels are then applied to extract the depth image features.
■ In this way, a network generates content-dependent and spatially-variant kernels for multi-
modal feature fusion.
■ Dynamically generated spatially-variant kernels could lead to prohibitive GPU memory
consumption and computation overhead.
■ A convolution factorization is designed to reduce computation and memory consumption.
■ GPU memory reduction makes it possible for feature fusion to work in multi-stage scheme.
2019，8

Depth Completion
The network architecture includes two sub-networks: GuideNet in orange and DepthNet in blue. To add a convolution
layer at the beginning of both GuideNet and DepthNet as well as the end of DepthNet. The light orange and blue are
the encoder stages, while corresponding dark ones are decoder stage of GuideNet and DepthNet, respectively. The
ResBlock represents the basic residual block structure with two sequential 3 × 3 convolutional layers.

Depth Completion
Guided Convolution Module. (a) the overall pipeline of guided convolution module. Given image features as input,
filter generation layer dynamically produces guided kernels, which are further applied on input depth features
and output new depth features. (b) the details of convolution between guided kernels and input depth features.
To factorize it into two-stage convolutions: channel-wise convolution and cross-channel convolution.

Depth Completion
Qualitative comparison with state-of-the-art methods on KITTI test set

DFineNet: Ego-Motion Estimation and Depth Refinement
from Sparse, Noisy Depth Input with RGB Guidance
■ Depth estimation is an important capability for autonomous vehicles to understand
and reconstruct 3D environments as well as avoid obstacles during the execution.
■ Accurate depth sensors such as LiDARs are often heavy, expensive and can only
provide sparse depth while lighter depth sensors such as stereo cameras are noiser
in comparison.
■ It is an end- to-end learning algorithm that is capable of using sparse, noisy input
depth for refinement and depth completion.
■ This model also produces the camera pose as a byproduct, making it a great solution
for autonomous systems.
■ To evaluate the approach on both indoor and outdoor datasets.
■ 2019，8.

DFineNet: Ego-Motion Estimation and Depth
Refinement from Sparse, Noisy Depth Input with RGB
Guidance
An example of sparse, noisy depth input (1st row), the
3D visualization of ground truth of depth (2nd row)
and the 3D visualization of output from our model
(bottom). RGB image (1st) is overlaid with sparse,
noisy depth input for visualization.

It refines sparse & noisy depth input (the 3rd row) to output dense depth of high quality (bottom row).

Network Architecture
The network consists of two branches: one CNN to learn the function that estimates the depth (ψd),
and one CNN to learn the function that estimates the pose (θp). This network takes as input the image
sequence and corresponding sparse depth maps and outputs the transformation as well as the dense
depth map. During training, the two sets of parameters are simultaneously updated by the training
signal which will be detailed in this section. It is the revised depth net of Ma from MIT as a Depth-CNN.

■ Supervised Loss
■ Photometric Loss
■ Masked Photometric Loss
■ Smoothness Loss
– Derived from Sfm-net
■ Total Loss:

Qualitative results of this method (left), RGB guide & certainty (middle) ranking 1st and MIT’s Ma(right) ranking 7th.

Confidence Propagation through CNNs
for Guided Sparse Depth Regression
■ 2019，8
■ Generally, convolutional neural networks (CNNs) process data on a regular grid, e.g. data
generated by ordinary cameras.
■ Designing CNNs for sparse and irregularly spaced input data is still an open research
problem with numerous applications in autonomous driving, robotics, and surveillance.
■ An algebraically-constrained normalized convolution layer for CNNs with highly sparse
input that has a smaller number of network parameters compared to related work.
■ Strategies for determining the confidence from the convolution operation and
propagating it to consecutive layers.
■ An objective function that simultaneously minimizes the data error while maximizing the
output confidence.
■ To integrate structural information, fusion strategies to combine depth and RGB
information in the normalized convolution network framework.
■ In addition, use of output confidence as an auxiliary information to improve the results.

Scene depth completion pipeline on an example image. The input to the pipeline is a very sparse projected LiDAR
point cloud, an input confidence map which has zeros at missing pixels and ones otherwise, and an RGB image. The
sparse point cloud input and the input confidence are fed to a multi-scale unguided network that acts as a generic
estimator for the data. Afterwards, the continuous output confidence map is concatenated with the RGB image and
fed to a feature extraction network. The output from the unguided network and the RGB feature extraction networks
are concatenated and fed to a fusion network which produces the final dense depth map.

The standard convoluMon layer in CNN frameworks can be replaced by a normalized convoluMon layer
with minor modificaMons. First, the layer takes in two inputs simultaneously, the data and its confidence.
The forward pass is then modified and the back-propagaMon is modified to include a derivaMve term for
the non-negaMvity enforcement funcMon. To propagate the confidence to consecuMve layers, the already-
calculated denominator term is normalized by the sum of the filter elements.
Normalized Convolution layer that takes in two inputs, i.e. data
and confidence and outputs a data term and a confidence term.

The multi-scale architecture for the task of unguided scene depth completion that utilizes normalized convolution
layers. Downsampling is performed using max pooling on confidence maps and the indices of the pooled pixels are
used to select the pixels with highest confidences from the feature maps. Different scales are fused by upsampling
the coarser scale and concatenating it with the finer scale. A normalized convolution layer is then used to fuse the
feature maps based on the confidence information. Finally, a 1 × 1 normalized convolution layer is used to merge
different channels into one channel and produce a dense output and an output confidence map.

(a) A multi-stream architecture that
contains a stream for depth and
another stream for RGB + Output
Confidence feature extraction.
Afterwards, a fusion network combines
both streams to produce the final
dense output. (d) A multi-scale
encoder-decoder architecture where
depth is fed to the unguided network
followed by an encoder and output
confidence and RGB image are
concatenated then fed to a similar
encoder. Both streams have skip-
connection to the decoder between
the corresponding scales. (c) is similar
to (a), but with early fusion and (d) is
similar to (b) but with early fusion.

(a) RGB input, (b) Method MS-Net[LF]-L2 (gd), (c) Sparse-
to-Dense (gd) and (d) HMS-Net (gd). For each one, top:
the prediction, ethod MS-Net[LF]-L2 (gd) performs
slightly better, while Sparse-to-Dense produces smoother
edges due to the use of a smoothness loss.

PLIN: A Network for Pseudo-LiDAR Point
Cloud Interpolation
■ LiDAR can provide dependable 3D spatial information at a low frequency (around 10Hz)
and have been widely applied in the field of autonomous driving and UAV.
■ However, the camera with a higher frequency (around 20-30Hz) has to be decreased so
as to match with LiDAR in a multi- sensor system.
■ A Pseudo-LiDAR interpolation network (PLIN) to increase the frequency of LiDAR sensors.
■ PLIN can generate temporally and spatially high- quality point cloud sequences to match
the high frequency of cameras.
■ For this goal, use a coarse interpolation stage guided by consecutive sparse depth maps
and motion relationship and a refined interpolation stage guided by the realistic scene.
■ Using this coarse-to-fine cascade structure, this method can progressively perceive
multi-modal info and generate accurate intermediate point clouds.
■ This is the first deep framework for Pseudo-LiDAR point cloud interpolation, which shows
appealing applications in navigation systems equipped with LiDAR and cameras.
2019，9

Cloud Interpolation
Overall pipeline of the proposed method.
PLIN aims to address the mismatching
problem of frequency between camera
and LiDAR sensors, generating both
temporally and spatially high-quality
point cloud sequences. This method
takes three consecutive color images
and two sparse depth maps as inputs,
and interpolates an intermediate dense
depth map, which is further transformed
into a Pseudo-LiDAR point cloud using
camera intrinsic parameters.

Cloud Interpolation
Overview of the Pseudo-LiDAR interpolation network (PLIN). The whole architecture consists of three modules,
including the motion guidance module, scene guidance module and transformation module.

Cloud Interpolation
Results of interpolated depth map obtained by PLIN. For each example, it shows the intermediate color image,
sparse depth map, dense depth map, and the result. This method can recover the original depth informaMon and
generate much denser distribuMons.

Cloud Interpolation
It shows the color image, interpolated dense depth map, two views of the generated Pseudo-LiDAR, and
enlarged areas. The complete network produces more accurate depth map, and the distribution and shape of
Pseudo-LiDAR are more similar to those of the GT point cloud.

Depth Completion from Sparse LiDAR
Data with Depth-Normal Constraints
■ Depth completion aims to recover dense depth maps from sparse depth
measurements.
■ It is of increasing importance for autonomous driving and draws increasing attention
from the vision community.
■ Most of existing methods directly train a network to learn a mapping from sparse
depth inputs to dense depth maps, which has difficulties in utilizing the 3D
geometric constraints and handling the practical sensor noises.
■ to regularize the depth completion and improve the robustness against noise, a
unified CNN framework 1) models the geometric constraints between depth and
surface normal in a diffusion module and 2) predicts the confidence of sparse
LiDAR measurements to mitigate the impact of noise.
■ Specifically, the encoder-decoder backbone predicts surface normals, coarse depth
and confidence of LiDAR inputs simultaneously, which are subsequently inputted
into the diffusion refinement module to obtain the final completion results.
2019，10

From sparse LiDAR measurements and color images (a-b), this model first infers the maps of coarse depth
and normal (c-d), and then recurrently refines the initial depth estimation by enforcing the constraints
between depth and normals. Moreover, to address the noises in practical LiDAR measurements (g), employ
a decoder branch to predict the confidences (h) of sparse inputs for better regularization.

The predicMon network first predicts maps of surface normal N, coarse depth D and confidence M of sparse depth input with a
shared-weight encoder and independent decoders. Then, the sparse depth inputs D ̄ and coarse depth D are transformed to the
plane-origin distance space as P ̄ and P. Next, the refinement network, an anisotropic diffusion module, refines the coarse depth
map D in the plane-origin distance subspace to enforce the constraints between depth and normal and to incorporate info from
the confident sparse depth inputs. During the refinement, the diffusion conductance depends on the similarity in guidance feature
map G. Finally, the refined P is inversely transformed back to obtain the refined depth map Drwhen the diffusion is finished.

Differentiable diffusion block. In each
refinement iteration, high-dimensional feature
vectors (e.g., of dimension 64) in guidance
feature map G are independently transformed via
two different functions f and g (modeled as two
convolution layers followed by normalization).
Then, the conductance from each location xi (in
plane-origin distance map P) to its neighboring K
pixels (xj ∈ Ni) are calculated. Finally, the diffusion
is performed through a convolution operation
with the kernels defined by the previous
computed conductance. Through such diffusion,
depth completion results are regularized by the
constraint between depth and normal.

negative cosine loss
L2 reconstruction loss
L2 depth loss
L2 refinement reconstruction loss
The overall loss function:
relation between depth and normal can be
established via the tangent plane equation

Quantitative comparison with other
methods. For each method, provide
the whole completion results as well
as the zoom-in views of details and
error maps for better comparison.
Also provide the normal prediction
and confidence prediction of this
method for better illustration.

Depth Fusion from RGB and Depth Sensors IV

Depth Fusion from RGB and Depth Sensors IV

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Depth Fusion from RGB and Depth Sensors IV

Similar to Depth Fusion from RGB and Depth Sensors IV (20)

More from Yu Huang

More from Yu Huang (20)

Recently uploaded

Recently uploaded (20)

Depth Fusion from RGB and Depth Sensors IV