Depth Fusion from RGB and Depth Sensors by Deep Learning

Depth Fusion from RGB
and Depth Sensors
Yu Huang
Yu.huang07@gmail.com
Sunnyvale, California

Outline
• 1. Sparsity Invariant CNNs
• 2. Sparse-to-Dense: Depth Prediction from Sparse Depth Samples and a Single Image
• 3. Self-Supervised Sparse-to-Dense: Depth Completion from LiDAR and Monocular Camera
• 4. Fusion of stereo and still monocular depth estimates in a self-supervised learning context
• 5. Deep Depth Completion of a Single RGB-D Image
• 6. Estimating Depth from RGB and Sparse Sensing
• Appendix: InterpoNet, a brain inspired NN for optic flow dense interpolation

Sparsity Invariant CNNs
• CNNs operating on sparse inputs for depth completion from sparse laser scan data.
• Traditional convolutional networks perform poorly when applied to sparse data even when
the location of missing data is provided to the network.
• This is simple yet effective sparse convolution layer which explicitly considers the location of
missing data during the convolution operation.
• The network architecture in synthetic and real experiments wrt various baseline approaches.
• Compared to dense baselines, the sparse convolution network generalizes well to novel
datasets and is invariant to the level of sparsity in the data.
• A dataset from the KITTI benchmark, comprising over 94k depth annotated RGB images.
• The dataset allows for training and evaluating depth completion and depth prediction
techniques in challenging real-world settings.

(a) as inputs leads to noisy results when processed with standard CNNs (c). In contrast, sparse conv. network
(d) predicts smooth and accurate depth maps by explicitly considering sparsity during convolution.
(a) Input
(visually
enhanced)
(b) Ground truth
(c) Standard
ConvNet
(d) Sparse conv.
network

Sparse Convolutional Network. (a) The input to the network is a sparse depth map (yellow) and a binary
observation mask (red). It passes through several sparse convolution layers (dashed) with decreasing kernel
sizes from 11×11 to 3 × 3. (b) Schematic of our sparse convolution operation. Here, ⊙ denotes elementwise
multiplication, ∗ convolution, 1/x inversion and “max pool” the max pooling operation. The input feature can be
single channel or multi-channel.

KITTI 2015 Dataset Raw LiDaR Acc. LiDaR SG
RGB Image Error Maps wrt. KITTI 2015

(a) Input (enhanced) (b) ConvNet (c) ConvNet + mask (d) Sparse ConvNet (e) Ground truth

Sparse-to-Dense: Depth Prediction from
Sparse Depth Samples and a Single Image
• Dense depth prediction from a sparse set of depth measurements and a single RGB image.
• Introduce additional sparse depth samples, either acquired with a low-resolution depth
sensor or computed via visual SLAM algorithms.
• Use a single deep regression network to learn directly from the RGB-D raw data, and
explore the impact of number of depth samples on prediction accuracy.
• Two applications: a plug-in module in SLAM to convert sparse maps to dense maps, and
super-resolution for LiDARs.

CNN architecture for NYU-Depth-v2 and KITTI datasets, respectively.

prediction on KITTI
RGB images
RGB-based
prediction
sd prediction with
200 and no RGB
RGB-d prediction
with 200 sparse
depth and rgb
ground truth depth

RGB
raw
depth
predicted
depth

Self-Supervised Sparse-to-Dense: Depth
Completion from LiDAR and Monocular Camera
• Depth completion faces 3 main challenges: 1). the irregularly spaced pattern in the sparse
depth input, 2). the difficulty in handling multiple sensor modalities, 3). the lack of dense,
pixel-level ground truth depth labels.
• A deep regression model to learn mapping from sparse depth (+rgb images) to dense depth.
• A self-supervised training framework that requires only sequences of rgb and sparse depth
images, without the need for dense depth labels.
Given (a) sparse LiDAR scans, (b) a color image, estimate (d) a dense depth image. Semi-dense depth
labels, (d) and (e), apply a highly-scalable, self-supervised framework for training such networks.

The encoder consists of a sequence of convolutions with increasing filter banks to down-sample the
feature spatial resolutions. The decoder, on the other hand, has a reversed structure with transposed
convolutions to up-sample the spatial resolutions. The input sparse depth and the color image are
separately processed by their initial convolutions. The convolved outputs are concatenated into a single
tensor, which acts as input to the residual blocks of ResNet-34. Output from each of the encoding layers
is passed to, via skip connections, the corresponding decoding layers. A final 1x1 convolution filter
produces a single prediction image with the same resolution as network input. All convolutions are
followed by batch normalization and ReLU, with the exception at the last layer.

A model-based self-supervised training framework for depth completion. This framework requires only a
synchronized sequence of color/intensity images from a monocular camera and sparse depth images from
LiDAR. White rectangles are variables, red is the depth network to be trained, blue are deterministic
computational blocks (without learnable parameters), and green are loss functions. During training, the
current data frame RGBd1 and a nearby data frame RGB2 are both used to provide supervision signals. At
inference time, only the current frame RGBd1 is needed to produce a depth prediction pred1.

• The depth loss is defined as
• Given the camera intrinsic matrix K, any pixel p1 in the current frame 1 has
the corresponding projection in frame 2 as
• Synthetic color image using bilinear interpolation
• The final photometric loss is
• The final loss function for the entire self-supervised framework is
Smoothness loss

Fusion of stereo and still monocular depth
estimates in a self-supervised learning context
• Self-supervised learning in which stereo vision
depth estimates serve as targets for a CNN that
transforms a single image to a depth map.
• After training, the stereo and mono estimates are
fused with a method that preserves high
confidence stereo estimates, while leveraging
CNN estimates in the low-confidence regions.
• Even rather limited CNNs can help provide stereo
vision equipped robots with more reliable depth
maps for autonomous navigation.
Self-supervised learning (SSL)

The regions where stereo vision is ’blind’ can be unveiled by the monocular estimator, as in those
areas a still mono estimator has a priori no constraints to make a valid depth prediction. Note that
the scene and obstacle are quite close to the camera. In large outdoor scenes with obstacles
further away, the proportion of occluded areas will be much smaller.

• The monocular depth estimation is performed with the Fully Convolutional Network (FCN).
• The basis is the well known VGG network, which is pruned of its fully connected layers.
• There are 5 main principles behind the fusion operation:
• (i) as CNN is better at estimating relative depths, its output should be scaled to the stereo range;
• (ii) when a pixel is occluded only monocular estimates are preserved;
• (iii) when stereo is considered reliable, its estimates are preserved;
• (iv)/(v) when in a region of low stereo confidence, if the relative depth estimates are
dissimilar/similar, then the CNN is trusted more/the stereo is trusted more.
• Since stereo vision involves finding correspond. in the same row, it relies on vertical contrasts.
• Convolve with a vertical Sobel filter and apply a threshold to obtain a binary map. This map is
subsequently convolved with a Gaussian blur filter of a relatively large size and renormalized;
• After the merging operation a median filter with a 5 × 5 kernel is used to smooth the final
depth map and reduce even more overall noise.

1) the rgb image.
2) Stereo depth map.
3) Still-mono depth map.
4) The merged depth map.
5) Confidence map (red high stereo
confidence, blue mono).
6) Diff in error against GT btw mono
and stereo (red high mono errors,
blue high stereo errors).
7) Velodyne depth map

Deep Depth Completion of a Single RGB-D Image
• The goal is to complete the depth channel of an RGB-D image.
• To train a deep network that takes an RGB image as input and predicts dense surface
normals and occlusion boundaries.
• Those predictions are then combined with raw depth observations provided by the RGB-D
camera to solve for depths for all pixels, including those missing in the original observation.
• A depth completion benchmark dataset, where holes are filled in training data through the
rendering of surface reconstructions created from multi-view RGB-D scans.

1) prediction of surface normals and occlusion boundaries only from color, and 2) optimization of
global surface structure from those predictions with soft constraints provided by observed depths.

Depth Completion Dataset. Depth
completions are computed from multi-
view surface reconstructions of large
indoor environments. Bottom: the raw
color and depth channels with the
rendered depth for the viewpoint marked
as the red dot. The rendered mesh
(colored by vertex in large image) is
created by combining RGB-D images
from a variety of other views spread
throughout the scene (yellow dots),
which collaborate to fill holes when
rendered to the red dot view.

Using surface normals to solve for depth completion. (a) An example of where depth cannot
be solved from surface normal. (b) The area missing depth is marked in red. The red arrow
shows paths on which depth cannot be integrated from surface normals. However in real-world
images, there are usually many paths through connected neighboring pixels (along floors,
ceilings, etc.) over which depths can be integrated (green arrows).

• The model is a FCN built on the back-bone of VGG-16 with symmetry encoder and decoder.
• It is also equipped with short-cut connections and shared pooling masks for corresponding
max pooling and unpooling layers, which are critical for learning local image features.
• Train the network with “ground truth” surface normals and silhouette boundaries computed
from the reconstructed mesh.
• Define the observed pixels as the ones with depth data from both the raw sensor and the
rendered mesh, and the unobserved pixels as the ones with depth from the rendered mesh
but not the raw sensor.
• For any given set of pixels (observed, unobserved, or both), train models with a loss for only
those pixels by masking out the gradients on other pixels during BP.
• The network learns to predict normals better from color than depth, even if the network is
given an extra channel containing a binary mask indicating which pixels observe depth.

• After predicting the surface normal image N and
occlusion boundary image B, solve a system of
equations to complete the depth image D.
• The objective function is defined as the
weighted sum of squared errors with four terms:
Input & GT Zhang et al. Laina et al. Chakrabarti et al.

Estimating Depth from RGB and Sparse Sensing
• A deep model that can produce dense depth maps given an RGB image with known depth
at a very sparse set of pixels.
The objective is to densify a sparse depth map (with additional cues from an RGB
image), then the model is called Deep Depth Densification, or D3 .

• A parametrization of the sparse depth input that accommodates sparse input patterns.
• It allows for varying such patterns not only across different deep models but even within the
same model during training and testing.
• Inputs to parametrization:
• I(x, y) and D(x, y): RGB vector-valued image I and ground truth depth D
• Both maps have dimensions H×W. Invalid values in D are encoded as zero.
• M(x,y): Binary pattern mask of dimensions H×W, where M(x,y) = 1 defines (x,y) locations of our
desired depth samples.
• All points where M(x,y) = 1 must correspond to valid depth points (D(x, y) > 0).
• From I, D and M, form 2 maps for the sparse depth input, S1(x,y) and S2(x,y).
• Both maps have dimension H×W;
• S1(x,y) is a NN (nearest neighbor) fill of the sparse depth M(x,y)∗D(x,y).
• S2(x, y) is the Euclidean Distance Transform of M(x, y), i.e. the L2 distance btw (x,y) and the closest
point (x’,y’) where M(x′,y′) = 1.
• The final parametrization of the sparse depth input is the concatenation of S1(x,y) and S2(x,y).

Both regular and irregular sparse patterns in S1 (top) and S2 (bottom). Dark points
in S2 correspond to the pixels where there is access to depth information.

• For regular grid patterns, to ensure minimal spatial bias when choosing the mask M(x,y) by
enforcing equal spacing btw subsequent pattern points in both the x and y directions.
• Such a strategy is convenient when one model accommodate images of different resolutions.
• For ease of interpretation, use sparse patterns close to an integer level of downsampling;
• It is beneficial to vary the sparse pattern M(x,y) during training.
• Such a schedule begins training at 6 times the desired sparse pattern density and smoothly
decays towards the final density as training progresses.
• Also train with randomly varying sampling densities at each training step.

D3 Network Architecture
DenseNet

InterpoNet, a brain inspired NN for optic flow
dense interpolation
• Sparse-to-dense interpolation for optical flow is a fundamental phase in the pipeline of
most of the leading optical flow estimation algorithms.
• The current SoA method for interpolation, EpicFlow, is a local average method based on an
edge aware geodesic distance.
• This is a data-driven sparse-to-dense interpolation algorithm based on FCN.
• Inspiration from the filling-in process in the visual cortex, introduce lateral dependencies
between neurons and multi-layer supervision into the learning process.
• The main branch of the network consists of ten layers, each applying a 7x7 convolution filter
followed by an ELU (exponential linear unit) non-linearity.
• The input to the entire algorithm is a set of sparse and noisy matches.
• FlowFields (FF), CPM-Flow (CPM), DiscreteFlow (DF), DeepMatching (DM);
• From the matches, produce as parse flow map of size h×w×2 of the image pair.

dense interpolation
InterpoNet

dense interpolation
• Inspired by that neuronal filling-in takes place in many layers in the visual system hierarchy,
used detour networks connecting each and every layer directly to the loss function.
• During training, the loss function served as top down information pushing each layer to
perform interpolation in the best possible manner.
• The detour networks were kept simple: aside from the main branch of the network, each of
the layer’s activations was transformed into a two channels flow map using a single conv.
layer with linear activations.
• Each of the flow maps produced by the detour networks was compared to the ground truth
flow map using the EPE and LD losses.
• The final network loss was the weighted sum of all the losses.
• For inference, use only the last detour layer output - the one connected to the last layer of
the network’s main branch.

dense interpolation

dense interpolation
InterpoNe

Depth Fusion from RGB and Depth Sensors by Deep Learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Depth Fusion from RGB and Depth Sensors by Deep Learning

Similar to Depth Fusion from RGB and Depth Sensors by Deep Learning (20)

More from Yu Huang

More from Yu Huang (20)

Recently uploaded

Recently uploaded (20)

Depth Fusion from RGB and Depth Sensors by Deep Learning