Deep VO and SLAM
Yu Huang
Yu.huang07@gmail.com
Sunnyvale, California
Outline
• DeepVO: Towards End-to-End Visual
Odometry with Deep Recurrent CNNs
• UnDeepVO: Monocular Visual Odometry
through Unsupervised Deep Learning
• VINet: Visual-Inertial Odometry as a
Sequence-to-Sequence Learning Problem
• Deep Virtual Stereo Odometry: Leveraging
Deep Depth Prediction for Monocular DSO
• DeepTAM: Deep Tracking and Mapping
• Visual SLAM for Automated Driving:
Exploring the Applications of Deep Learning
• DeepFusion: Real-Time Dense 3D
Reconstruction for Monocular SLAM using
Single-View Depth and Gradient Predictions
• Training Deep SLAM on Single Frames
• DeepVIO: Self-supervised Deep Learning of
Monocular VIO using 3D Geometric
Constraints
• DF-SLAM: A Deep-Learning Enhanced Visual
SLAM System based on Deep Local Features
• Sequential Adversarial Learning for Self-
Supervised Deep Visual Odometry
• Deep Direct Visual Odometry
• Learning By Inertia: Self-supervised
Monocular VO For Road Vehicles
DeepVO: Towards End-to-End Visual Odometry with
Deep Recurrent CNNs
• Most of existing VO algorithms are developed under a standard pipeline including feature
extraction, feature matching, motion estimation, local optimisation, etc.
• Although some of them have demonstrated superior performance, they usually need to be
carefully designed and specifically fine-tuned to work well in different environments.
• Some prior knowledge is also required to recover an absolute scale for monocular VO.
• This work is a end-to-end framework for monocular VO by using deep Recurrent
Convolutional Neural Networks (RCNNs).
• Since it is trained and deployed in an end-to-end manner, it infers poses directly from a
sequence of raw RGB images (videos) without adopting any module in the conventional VO
pipeline.
• Based on the RCNNs, it not only automatically learns effective feature representation for the
VO problem through CNN, but also implicitly models sequential dynamics and relations
using deep RNN.
• The end-to-end Deep Learning technique can be a viable complement to the traditional VO
systems.
DeepVO: Towards End-to-End Visual Odometry with
Deep Recurrent CNNs
Architectures of the conventional
feature based mono VO and the end-
to-end method. In the e2e method,
RCNN takes a sequence of RGB images
(video) as input and learns features by
CNN for RNN based sequential
modelling to estimate poses.
DeepVO: Towards End-to-End Visual Odometry with
Deep Recurrent CNNs
It takes a video clip or a monocular image sequence as input. Two consecutive images are stacked together to
form a tensor for the deep RCNN to learn how to extract motion information and estimate poses. Specifically,
the image tensor is fed into the CNN to produce an effective feature for the monocular VO, which is then
passed through a RNN for sequential learning. Each image pair yields a pose estimate at each time step
through the network. The VO system develops over time and estimates new poses as images are captured.
DeepVO: Towards End-to-End Visual Odometry with
Deep Recurrent CNNs
UnDeepVO: Monocular Visual Odometry
through Unsupervised Deep Learning
• Data-driven VO or DL based VO has drawn significant attention due to its potentials in
learning capability and the robustness to camera parameters and challenging environments.
• VO related unsupervised deep learning research mainly focuses on depth estimation,
inspired by the image wrap technique “spatial transformer”.
• A monocular visual odometry (VO) system called UnDeepVO.
• UnDeepVO is able to estimate the 6-DoF pose of a monocular camera and the depth of its
view by using deep neural networks.
• There are two features of the proposed UnDeepVO:
• The unsupervised deep learning scheme, and the absolute scale recovery.
• Train UnDeepVO by using stereo image pairs to recover the scale but test it by using
consecutive monocular images as a monocular system.
• The loss function defined for training the networks is based on spatial and temporal dense
information.
UnDeepVO: Monocular Visual Odometry
through Unsupervised Deep Learning
After training with unlabeled stereo
images, UnDeepVO can simultaneously
perform visual odometry and depth
estimation with monocular images. The
estimated 6-DoF poses and depth maps
are both scaled without the need for
scale post- processing.
UnDeepVO: Monocular Visual Odometry
through Unsupervised Deep Learning
Architecture of UnDeepVO
The system is composed of pose and depth estimator.
For the pose estimator, it is a VGG-based CNN architecture.
Since rotation (Euler angles) has high nonlinearity, it is usually
difficult to train compared with translation. For supervised
training, a popular solution is to give a bigger weight to the
rotational loss as a way of normalization. In order to better
train the rotation with unsupervised learning, decouple the
translation and the rotation with two separate groups of fully-
connected layers after the last convolutional layer. This enables
to introduce a weight normalizing the rotation and the
translation predictions for better performance.
The depth estimator is mainly based on an encoder- decoder
architecture to generate dense depth maps. The depth
estimator of UnDeepVO is designed to directly predict depth
maps. This is because training trails report that the whole
system is easier to converge when training in this way.
UnDeepVO: Monocular Visual Odometry
through Unsupervised Deep Learning
Training scheme of UnDeepVO. The pose and depth
estimators take stereo images as inputs to estimate 6-DoF
poses and depth maps, respectively. The total loss including
spatial losses and temporal losses can then be calculated based
on raw RGB images, estimated depth maps and poses.
UnDeepVO is trained with losses through BP. Since the losses
are built on geometric constraints rather than labeled data,
UnDeepVO is trained in an unsupervised manner. Its total loss
includes spatial image losses and temporal image losses.
The spatial image losses drive the network to recover scaled
depth maps by using stereo image pairs, while the temporal
image losses are designed to minimize the errors on camera
motion by using two consecutive monocular images.
UnDeepVO: Monocular Visual Odometry
through Unsupervised Deep Learning
UnDeepVO: Monocular Visual Odometry
through Unsupervised Deep Learning
VO results with UnDeepVO. All the methods listed in the table did not use loop closure. Note
that monocular VISO2-M and ORB-SLAM-M (without loop closure) did not work with image size
416 × 128, the results were obtained with image size 1242×376. 7-DoF (6-DoF + scale)
alignment with the ground-truth is applied for SfMLearner and ORB-SLAM-M.
VINet: Visual-Inertial Odometry as a
Sequence-to-Sequence Learning Problem
• A fundamental requirement for mobile robot autonomy is the ability to be able to accurately
navigate where no GPS signals are available.
• One of the most promising approaches to achieving this goal is through the fusion of
images from a monocular camera and inertial measurement unit.
• These VIO approaches still suffer from strict calibration and synchronization requirements.
• It presents an on-manifold sequence-to- sequence learning approach to motion estimation
using visual and inertial sensors.
• It is end-to-end trainable method for visual-inertial odometry which performs fusion of the data
at an intermediate feature-representation level.
• This method has numerous advantages over traditional approaches.
• Specifically, it eliminates the need for tedious manual synchronization of the camera and IMU as
well as eliminating the need for manual calibration between the IMU and camera.
• This model naturally and elegantly incorporates domain specific information which significantly
mitigates drift.
VINet: Visual-Inertial Odometry as a
Sequence-to-Sequence Learning Problem
Comparison between a standard visual-
inertial odometry framework and this
learning-based approach. Elements in
blue need to be specified during setup.
The parameters of VINet are hidden from
the user and fully learned from data.
VINet: Visual-Inertial Odometry as a
Sequence-to-Sequence Learning Problem
The VINet architecture for visual-inertial odometry. The network consists of a core LSTM processing the pose
output at camera-rate and an IMU LSTM processing data at the IMU rate.
VINet: Visual-Inertial Odometry as a
Sequence-to-Sequence Learning Problem
• The model consists of an CNN-RNN network which has been tailored to the task of visual-
inertial odometry estimation.
• The entire network is differentiable and thus trainable end-to-end for the purpose of ego-
motion estimation.
• The input to the network is monocular RGB images and IMU data which is a 6 dimensional
vector containing the x, y, z components of acceleration and angular velocity measured
using a gyroscope.
• The output of the network is a 7 dimensional vector - a 3 dimensional translation and 4
dimensional orientation quaternion - representing the change in pose of the robot from the
start of the sequence.
• In essence, the network learns the mapping which transforms input sequences of images
and IMU data to poses
VINet: Visual-Inertial Odometry as a
Sequence-to-Sequence Learning Problem
SE(3) composition layer - a parameter-free
layer which concatenates transformations
between frames on SE(3).
VINet: Visual-Inertial Odometry as a
Sequence-to-Sequence Learning Problem
6D MAV reconstructed
trajectory using the VINet
compared to OK-VIS
Deep Virtual Stereo Odometry: Leveraging
Deep Depth Prediction for Monocular DSO
• Monocular visual odometry approaches that purely rely on geometric cues are prone to
scale drift and require sufficient motion parallax in successive frames for motion estimation
and 3D reconstruction.
• This work leverages deep monocular depth prediction to overcome limitations of geometry-
based monocular visual odometry.
• It incorporates deep depth predictions into Direct Sparse Odometry (DSO) as direct virtual
stereo measurements.
• For depth prediction, it designs a deep network that refines predicted depth from a single
image in a two-stage process.
• It trains the network in a semi-supervised way on photo consistency in stereo images and on
consistency with accurate sparse depth reconstructions from Stereo DSO.
Deep Virtual Stereo Odometry: Leveraging
Deep Depth Prediction for Monocular DSO
It builds on three key ingredients: self-supervised learning from photo consistency in a stereo setup, supervised
learning based on accurate sparse depth reconstruction by Stereo DSO, and two-stage refinement of the network
predictions in a stacked encoder-decoder architecture. Semi-Supervised Deep Monocular Depth Estimation using a
deep network, called StackNet since it stacks two sub-networks, SimpleNet and ResidualNet. Both sub-networks
are fully convolutional deep neural network adopted from DispNet with an encoder-decoder scheme. ResidualNet
has fewer layers and takes the outputs of SimpleNet as inputs. Its purpose is to refine the disparity maps predicted
by SimpleNet by learning an additive residual signal. Similar residual learning architectures have been successfully
applied to related deep learning tasks.
Deep Virtual Stereo Odometry: Leveraging
Deep Depth Prediction for Monocular DSO
Deep Virtual Stereo Odometry: Leveraging
Deep Depth Prediction for Monocular DSO
• Deep Virtual Stereo Odometry (DVSO) builds on the windowed sparse direct bundle
adjustment formulation of monocular DSO.
• It uses disparity predictions for DSO in two key ways: Firstly, it initialize depth maps of new
keyframes from the disparities; Beyond this rather straight-forward approach, it also
incorporate virtual direct image alignment constraints into the windowed direct bundle
adjustment of DSO.
• It obtains these constraints by warping images with the estimated depth by bundle
adjustment and the predicted right disparities by the deep network StackNet assuming a
virtual stereo setup.
• The point selection strategy of DVSO is similar to monocular DSO, while it also introduced a
left-right consistency check to filter out the pixels which likely lie in the occluded area
Deep Virtual Stereo Odometry: Leveraging
Deep Depth Prediction for Monocular DSO
Architecture of DVSO. Every new frame is used for visual odometry and fed into the StackNet to predict left and
right disparity. The predicted left and right disparities are used for depth initialization, while the right disparity is
used to form the virtual stereo term in direct sparse bundle adjustment.
Deep Virtual Stereo Odometry: Leveraging
Deep Depth Prediction for Monocular DSO
[46]. Yin, X., Wang, X., Du, X., Chen, Q. “Scale recovery for monocular visual odometry using depth
estimated with deep convolutional neural fields”. CVPR 2017
[49]. Zhou, T., Brown, M., Snavely, N., Lowe, D.G. “Unsupervised learning of depth and ego-motion from
video”. CVPR 2017
DeepTAM: Deep Tracking and Mapping
• It is a system for keyframe-based dense camera tracking and depth
map estimation that is entirely learned.
• For tracking, estimate small pose increments between the current
camera image and a synthetic viewpoint.
• This significantly simplifies the learning problem and alleviates the
dataset bias for camera motions.
• Generating a large number of pose hypotheses leads to more accurate
predictions.
• For mapping, accumulate information in a cost volume centered at the
current depth estimate.
• The mapping network then combines the cost volume and the
keyframe image to update the depth prediction, thereby effectively
making use of depth measurements and image-based priors.
DeepTAM: Deep Tracking and Mapping
The tracking network uses an encoder-decoder type architecture with direct connections between the encoding
and decoding part. The decoder is used by two tasks, which are optical flow prediction and the generation of
pose hypotheses. The optical flow prediction is a small stack of two convolution layers and is only active during
training to stimulate the generation of motion features. The pose hypotheses generation part is a stack of down-
sampling convolution layers followed by a fully connected layer, which then splits into N = 64 fully connected
branches sharing parameters to estimate the 64 pose vectors.
Tracking Network
DeepTAM: Deep Tracking and Mapping
Overview of the tracking networks and the incremental pose estimation. They apply a coarse-to-fine approach to
efficiently estimate the current camera pose. They train 3 tracking networks each specialized for a distinct resolution
level corresponding to the input image dimensions (80 × 60), (160 × 120) and (320 × 240). Each network computes a
pose increment with respect to a pose guess. Each of the tracking networks uses the latest pose guess to generate a
virtual keyframe at the respective resolution level and thereby indirectly tracking the camera with respect to the
original keyframe (IK, DK). The final pose estimate is computed as the product of all incremental pose updates.
Tracking Networks
DeepTAM: Deep Tracking and Mapping
• Train a network to use the matching cost info. in the cost volume, combining it with the
image-based scene priors to obtain more accurate and robust depth estimates.
• For cost-volume-based methods, accuracy is limited by the number of depth labels N.
• Use an adaptive narrow band strategy to increase the sampling density while keeping the
number of labels constant.
• The narrow band allows to recover more details in the depth map, but also requires a good
initialization and regularization to keep the band in the right place.
• To recompute the cost volume for the narrow band for a small selection of frames and search
again for a better depth estimate.
• Define the narrow band of depth labels centered at the previous depth estimate dprev as
narrow band width
DeepTAM: Deep Tracking and Mapping
Mapping consists of a fixed band and narrow band module. Fixed band module: it takes the keyframe image IK(320 ×
240 × 3) and the cost volume Cfb (320 × 240 × 32) generated with 32 depth labels equally spaced in the range [0.01, 2.5]
as inputs and outputs an interpolation factor sfb (320 × 240 × 1). The fixed band depth estimation is Dfb = (1−sfb)·dmin
+sfb ·dmax. Narrow band module: The narrow band module is run iteratively; in each iteration build a cost volume Cnb from
a set of depth labels distributed with a band width σnb of 0.0125. It consists of two encoder-decoder pairs.
DeepTAM: Deep Tracking and Mapping
Visual SLAM for Automated Driving:
Exploring the Applications of Deep Learning
• Recently, there is progress on using CNN models for geometric vision tasks like depth
estimation, optical flow prediction or motion segmentation.
• However, Visual SLAM remains to be one of the areas of automated driving where CNNs are
not mature for deployment in commercial automated driving systems.
• This work explores how deep learning can be used to replace parts of the classical Visual
SLAM pipeline.
• Firstly, it describes the building blocks of Visual SLAM pipeline composed of standard
geometric vision tasks.
• Then it provides an overview of Visual SLAM use cases for automated driving based on the
authors’ experience in commercial deployment.
• Finally, it discusses the opportunities of using Deep Learning to improve upon state-of-the-
art classical methods.
Visual SLAM for Automated Driving:
Exploring the Applications of Deep Learning
Visual SLAM for Automated Driving:
Exploring the Applications of Deep Learning
DeepFusion: Real-Time Dense 3D Reconstruction for Monocular
SLAM using Single-View Depth and Gradient Predictions
• While the keypoint-based maps created by sparse monocular SLAM systems are
useful for camera tracking, dense 3D reconstructions may be desired for many
robotic tasks.
• Solutions involving depth cameras are limited in range and to indoor spaces, and
dense reconstruction systems based on minimising the photometric error between
frames are typically poorly constrained and suffer from scale ambiguity.
• To address these issues, a proposed 3D reconstruction system leverages the output
of a CNN to produce fully dense depth maps for keyframes that include metric scale.
• DeepFusion, is capable of producing real-time dense reconstructions on a GPU.
• It fuses the output of a semi- dense multi-view stereo algorithm with the depth and
gradient predictions of a CNN in a probabilistic fashion, using learned uncertainties
produced by the network.
• While the network only needs to be run once per keyframe, it is optimised for the
depth map with each new frame so as to constantly make use of new geometric
constraints.
DeepFusion: Real-Time Dense 3D Reconstruction for Monocular
SLAM using Single-View Depth and Gradient Predictions
DeepFusion represents the observed geometry with a series of keyframe depth maps. With each new RGB image,
the system obtains the pose from a monocular SLAM system (ORB-SLAM2) and then updates the semi-dense
depth estimates for the active keyframe. If the camera has translated more than the largest value or had fewer
than the minimal inliers in the semi-dense estimation, a new keyframe is created. To maintain a high frame rate,
the network outputs are only generated once per keyframe. Using a CNN, it predict the log-depth, log-depth
gradients and associated uncertainties from the new keyframe image. If a new keyframe is not created, then the
current semi- dense depth map and network outputs are fused to update the current depth map.
DeepFusion: Real-Time Dense 3D Reconstruction for Monocular
SLAM using Single-View Depth and Gradient Predictions
Comparison of reconstruction accuracy in terms of percentage of correct depth values (within 10% of ground
truth) on ICL-NUIM and TUM RGB-D datasets (TUM/seq1: fr3/long office household, TUM/seq2: fr3 no
structure texture near with loop, TUM/seq3: fr3/structure texture far). LSD-SLAM (BS) is LSD-SLAM
bootstrapped with a ground truth depth map, and REMODE uses LSD-SLAM (BS) poses and keyframes.
Training Deep SLAM on Single Frames
• Collecting ground truth poses to train learning-based visual odometry and SLAM methods is
difficult and expensive.
• This could be resolved by training in an unsupervised mode, but there is still a large gap
between performance of unsupervised and supervised methods.
• This work focuses on generating synthetic data for deep learning-based visual odometry
and SLAM methods that take optical flow as an input.
• It produces training data in a form of optical flow that corresponds to arbitrary camera
movement between a real frame and a virtual frame.
• For synthesizing data, use depth maps either by a depth sensor or estimated from stereo.
• It trains visual odometry model on synthetic data and do not use ground truth poses hence
this model can be considered unsupervised.
• Also it can be classified as monocular as no use of depth maps on inference.
• A simple way to convert any visual odometry model into a SLAM method based on frame
matching and graph optimization.
Training Deep SLAM on Single Frames
Architecture of visual odometry model
Training Deep SLAM on Single Frames
SLAM Architecture
Training Deep SLAM on Single Frames
Metrics on KITTI for unsupervised methods.
Ground truth and estimated EuRoC trajectory
DeepVIO: Self-supervised Deep Learning of Monocular
Visual Inertial Odometry using 3D Geometric Constraints
• This work is a self-supervised deep learning network for monocular visual inertial
odometry (named DeepVIO).
• DeepVIO provides absolute trajectory estimation by directly merging 2D optical flow
feature (OFF) and IMU data.
• Specifically, it firstly estimates the depth and dense 3D point cloud of each scene by
using stereo sequences, and then obtains 3D geometric constraints including 3D
optical flow and 6-DoF pose as supervisory signals.
• In DeepVIO training, 2D optical flow network is constrained by the projection of its
corresponding 3D optical flow, and LSTM- style IMU pre-integration network and
the fusion network are learned by minimizing the loss functions from ego-motion
constraints.
• Furthermore, it employs an IMU status update scheme to improve IMU pose
estimation through updating the additional gyroscope and accelerometer bias.
• Compared to the traditional methods, DeepVIO reduces the impacts of inaccurate
Camera- IMU calibrations, unsynchronized and missing data.
DeepVIO: Self-supervised Deep Learning of Monocular
Visual Inertial Odometry using 3D Geometric Constraints
Ego motion are used as 3D geometric
constraints to supervise 2D optical
flow learned from 2D flow network,
ego-motions estimated from the IMU
pre-integration network and VI fusion
network, the state of the IMU is
updated when it receives the
feedback from VI fusion network.
DeepVIO: Self-supervised Deep Learning of Monocular
Visual Inertial Odometry using 3D Geometric Constraints
The DeepVIO mainly consists of CNN-Flow, LSTM-IMU and FC-Fusion networks, which jointly compute
continuous trajectories from monocular images and IMU data. OFF and 2D optical flow are calculated by the
CNN-Flow network. The relative 6-DoF pose between the adjacent two frames is calculated by IMU pre-
integration network (LSTM-IMU) through the IMU data and status. Finally, the concatenated features from OFF
and IMU-se3 are fed into the FC-Fusion network to calculate the trajectory of the monocular camera.
The pretrained depth network (e.g., PSMNet) is firstly applied to estimate dense depth images. After that, it can
recover 3D point clouds. Next, the 6-DoF relative pose and 3D optical flow are calculated via the well-known ICP
method. Moreover, it synthetizes a 2D optical flow by projecting the 3D optical flow into its view.
DeepVIO: Self-supervised Deep Learning of Monocular
Visual Inertial Odometry using 3D Geometric Constraints
KITTI 09 EuRoC MH04
DF-SLAM: A Deep-Learning Enhanced Visual
SLAM System based on Deep Local Features
• As the foundation of driverless vehicle and intelligent robots, SLAM has attracted much
attention these days.
• However, non-geometric modules of traditional SLAM algorithms are limited by data
association tasks and have become a bottleneck preventing the development of SLAM.
• To deal with such problems, many researchers seek to Deep Learning for help.
• But most of these studies are limited to virtual datasets or specific environments, and even
sacrifice efficiency for accuracy.
• DF-SLAM uses deep local feature descriptors obtained by the neural network as a substitute
for traditional hand-made features.
• Since adopting a shallow network to extract local descriptors and remaining others the same
as original SLAM systems, DF-SLAM can still run in real-time on GPU.
DF-SLAM: A Deep-Learning Enhanced Visual
SLAM System based on Deep Local Features
DF-SLAM: A Deep-Learning Enhanced Visual
SLAM System based on Deep Local Features
• It derives the tracking thread from Visual Odometry algorithms.
• Tracking takes charge of constructing data associations between adjacent frames using
visual feature matching.
• Afterward, it initializes frames with the help of data associations and estimates the
localization of the camera using the polar geometric constraint.
• It also decides whether new keyframes are needed.
• If lost, global relocalization is performed based on the same sort of features.
• Local Mapping will be operated regularly to optimize camera poses and map points.
• It receives information constructed by the tracking thread and reconstructs a partial 3D map.
• If loops are detected, the Loop Closure thread will take turns to optimize the whole graph
and close the loop.
• The frame with a high matching score is selected as a candidate loop closing frame, which is
used to complete loop closing and global optimization.
DF-SLAM: A Deep-Learning Enhanced Visual
SLAM System based on Deep Local Features
DF-SLAM: A Deep-Learning Enhanced Visual
SLAM System based on Deep Local Features
• The first step is to extract interested points, utilizing TFeat network to describe the region
around key points and generate a normalized 128- D float descriptor.
• Features extracted are stored in every frame and passed to tracking, mapping and loop
closing threads.
• The localization algorithm is based on DBoW.
• To speed up the system, Visual Vocabulary is employed in numerous computer vision
applications.
• It trained the vocabulary, based on DBoW, using the feature descriptors extracted by DF
(Deep Feature) methods.
• Therefore, it assigns a word vector and feature vector for each frame and calculates their
similarity more easily.
DF-SLAM: A Deep-Learning Enhanced Visual
SLAM System based on Deep Local Features
The architecture of TFeat
There are only two convolutional layers followed by Tanh non-linearity in each branch. Max pooling is
added after the first convolutional layer to reduce parameters and further speed up the network. A
fully connected layer outputs a 128-D descriptor L2 normalized to unit-length as the last layer.
DF-SLAM: A Deep-Learning Enhanced Visual
SLAM System based on Deep Local Features
loop closure added Without loop closure
Sequential Adversarial Learning for Self-
Supervised Deep Visual Odometry
• This is a self-supervised learning framework for visual odometry (VO) that incorporates correlation of
consecutive frames and takes advantage of adversarial learning.
• Previous methods tackle self-supervised VO as a local SfM problem that recovers depth from single
image and relative poses from image pairs by minimizing photometric loss between warped and
captured images.
• As single-view depth estimation is an ill-posed problem, and photometric loss is incapable of
discriminating distortion artifacts of warped images, the estimated depth is vague and pose is
inaccurate.
• This framework learns a compact representation of frame-to-frame correlation, which is updated by
incorporating sequential information.
• The updated representation is used for depth estimation.
• Besides, VO is tackled as a self-supervised image generation task, taking advantage of GAN.
• The generator learns to estimate depth and pose to generate a warped target image. The
discriminator evaluates the qual- ity of generated image with high-level structural perception that
overcomes the problem of pixel-wise loss in previous methods. Experiments on KITTI and Cityscapes
datasets show that our method obtains more accurate depth with de- tails preserved and predicted
pose outperforms state-of-the- art self-supervised methods significantly.
Sequential Adversarial Learning for Self-
Supervised Deep Visual Odometry
The network extracts optical flow into a compact code, which is incorporated by LSTM to aggregate
historical information and refine previous estimations. Depth and pose estimation is regarded as an
image conditioned generative task, and the refined code is provided as input signal. The geometric
inference is used to reconstruct a warped image by view synthesis and evaluated by a discriminator.
Sequential Adversarial Learning for Self-
Supervised Deep Visual Odometry
The encoder compresses optical flow of two consecutive images into a compact code, which is aggregated
and refined by LSTM. The DepthNet estimates depth conditioned on the refined code and input image. The
estimated depth is concatenated with image for pose and mask prediction, while the authenticity of the
warped image is judged by the discriminator. The discriminator is excluded during the test phase.
Sequential Adversarial Learning for Self-
Supervised Deep Visual Odometry
Sequential Adversarial Learning for Self-
Supervised Deep Visual Odometry
Deep Direct Visual Odometry
• Different kinds of approaches have been proposed to solve VO problems, including direct
methods, semi-direct methods and feature-based methods.
• Monocular direct visual odometry (DVO) relies heavily on high-quality images and good
initial pose estimation for accuracy tracking process, which means that DVO may fail if the
image quality is poor or the initial value is incorrect.
• This study is a new architecture to overcome the above limitations by embedding deep
learning into DVO.
• Compared with the traditional VO methods, deep learning models do not rely on high-
precision features correspondence or high-quality images.
• A self-supervised network architecture for effectively predicting 6-DOF pose, called PoseNet,
is proposed, and it incorporates the pose prediction into Direct Sparse Odometry (DSO) for
robust initialization and tracking process.
• A soft-attention model and STM (selective transfer model) module are used to improve the
feature manipulation ability for accurate pose regression.
Deep Direct Visual Odometry
Self-supervised network architecture. (a) to achieve a better pose prediction, use 7 convolution layers with kernel
size 3 for feature extraction, the full connected layers and attention model. (b) A soft-attention model is used for
feature association and selection. The reweighted features are used to predict 6-DOF relative pose. (c) A STM
model is used to replace the common skip connection between encoder and decoder and selective transfer
characteristics in DepthNet. (d) The single-frame DepthNet adopts the encoder-decoder framework with a selective
transfer model, and the kernel size is 3 for all convolution and deconvolution layers.
Deep Direct Visual Odometry
The self-supervised training framework. There are only two components in our loss function L as the
supervisory signal during training, including the view reconstruction consistency loss Lc and the
depth smoothness loss Lsmooth.
Deep Direct Visual Odometry
The DDSO (deep direct sparse odometry) pipeline. This work augments the DSO framework with the pose
prediction module (PoseNet). Every new frame is fed into the proposed PoseNet with last frame to regress a
relative pose estimation. The predicted pose is used to improve initialization and tracking in DSO.
Deep Direct Visual Odometry
Learning By Inertia: Self-supervised Monocular
VO For Road Vehicles
• The method, called iDVO (inertia-embedded deep visual odometry), is a self-supervised
learning based monocular visual odometry (VO) for road vehicles.
• When modelling the geometric consistency within adjacent frames, most deep VO methods
ignore the temporal continuity of the camera pose, which results in a very severe jagged
fluctuation in the velocity curves.
• With the observation that road vehicles tend to perform smooth dynamic characteristics in
most of the time, the inertia loss function describes the abnormal motion variation, which
assists the model to learn the consecutiveness from long-term camera ego-motion.
• Based on the recurrent convolutional neural network (RCNN) architecture, this method
implicitly models the dynamics of road vehicles and the temporal consecutiveness by the
extended LSTM block.
• Furthermore, it develops the dynamic hard-edge mask to handle the non- consistency in
fast camera motion by blocking the boundary part and which generates more efficiency in
the whole non- consistency mask.
Learning By Inertia: Self-supervised Monocular
VO For Road Vehicles
The CNN-RCNN dual networks structure takes the sequences as input, and outputs the per-pixel depth
with 6DoF poses sequentially. The total mask used in view synthesis is combined by the computed
dynamic hard-edge mask and the estimated explain-ability mask. Both networks can be tested individually.
Learning By Inertia: Self-supervised Monocular
VO For Road Vehicles
The CNN-RCNN network structure. The correspondence between color and layer/operation is shown in the
leg- end at the bottom. The left one: the DispNet architecture is adopted for the depth estimator CNN. The
right one: The pose-prediction RCNN. The decoder part is for multi-scale explain-ability mask prediction.
Learning By Inertia: Self-supervised Monocular
VO For Road Vehicles
The sketch of dynamic hard-edge mask (the red part). For better display, the frame-to-frame
time between these 3 frames is ∼0.5s, so the mask is much larger than actual mask.
Because networks are trained by the video captured in dynamic vehicles (removal of the static frames by
optical flow), some obstacles at the edges in one frame might not be captured in the next frame (forward
moving). In reconstruction, some pixels at the edge part of the resource frame will not be mapped in the
boundary of the target frame.
To generate the non-consistency mask efficiently and precisely, use the dynamic hard-edge mask (DHEM).
The DHEM is a hard mask in the edge of resource frame, which together with the soft estimated explain-
ability mask will form a complete mask. The hard mask blocks the edge pixels by ignoring the pixels in
the mask during reconstruction.
Learning By Inertia: Self-supervised Monocular
VO For Road Vehicles
Deep VO and SLAM

Deep VO and SLAM

  • 1.
    Deep VO andSLAM Yu Huang Yu.huang07@gmail.com Sunnyvale, California
  • 2.
    Outline • DeepVO: TowardsEnd-to-End Visual Odometry with Deep Recurrent CNNs • UnDeepVO: Monocular Visual Odometry through Unsupervised Deep Learning • VINet: Visual-Inertial Odometry as a Sequence-to-Sequence Learning Problem • Deep Virtual Stereo Odometry: Leveraging Deep Depth Prediction for Monocular DSO • DeepTAM: Deep Tracking and Mapping • Visual SLAM for Automated Driving: Exploring the Applications of Deep Learning • DeepFusion: Real-Time Dense 3D Reconstruction for Monocular SLAM using Single-View Depth and Gradient Predictions • Training Deep SLAM on Single Frames • DeepVIO: Self-supervised Deep Learning of Monocular VIO using 3D Geometric Constraints • DF-SLAM: A Deep-Learning Enhanced Visual SLAM System based on Deep Local Features • Sequential Adversarial Learning for Self- Supervised Deep Visual Odometry • Deep Direct Visual Odometry • Learning By Inertia: Self-supervised Monocular VO For Road Vehicles
  • 3.
    DeepVO: Towards End-to-EndVisual Odometry with Deep Recurrent CNNs • Most of existing VO algorithms are developed under a standard pipeline including feature extraction, feature matching, motion estimation, local optimisation, etc. • Although some of them have demonstrated superior performance, they usually need to be carefully designed and specifically fine-tuned to work well in different environments. • Some prior knowledge is also required to recover an absolute scale for monocular VO. • This work is a end-to-end framework for monocular VO by using deep Recurrent Convolutional Neural Networks (RCNNs). • Since it is trained and deployed in an end-to-end manner, it infers poses directly from a sequence of raw RGB images (videos) without adopting any module in the conventional VO pipeline. • Based on the RCNNs, it not only automatically learns effective feature representation for the VO problem through CNN, but also implicitly models sequential dynamics and relations using deep RNN. • The end-to-end Deep Learning technique can be a viable complement to the traditional VO systems.
  • 4.
    DeepVO: Towards End-to-EndVisual Odometry with Deep Recurrent CNNs Architectures of the conventional feature based mono VO and the end- to-end method. In the e2e method, RCNN takes a sequence of RGB images (video) as input and learns features by CNN for RNN based sequential modelling to estimate poses.
  • 5.
    DeepVO: Towards End-to-EndVisual Odometry with Deep Recurrent CNNs It takes a video clip or a monocular image sequence as input. Two consecutive images are stacked together to form a tensor for the deep RCNN to learn how to extract motion information and estimate poses. Specifically, the image tensor is fed into the CNN to produce an effective feature for the monocular VO, which is then passed through a RNN for sequential learning. Each image pair yields a pose estimate at each time step through the network. The VO system develops over time and estimates new poses as images are captured.
  • 6.
    DeepVO: Towards End-to-EndVisual Odometry with Deep Recurrent CNNs
  • 7.
    UnDeepVO: Monocular VisualOdometry through Unsupervised Deep Learning • Data-driven VO or DL based VO has drawn significant attention due to its potentials in learning capability and the robustness to camera parameters and challenging environments. • VO related unsupervised deep learning research mainly focuses on depth estimation, inspired by the image wrap technique “spatial transformer”. • A monocular visual odometry (VO) system called UnDeepVO. • UnDeepVO is able to estimate the 6-DoF pose of a monocular camera and the depth of its view by using deep neural networks. • There are two features of the proposed UnDeepVO: • The unsupervised deep learning scheme, and the absolute scale recovery. • Train UnDeepVO by using stereo image pairs to recover the scale but test it by using consecutive monocular images as a monocular system. • The loss function defined for training the networks is based on spatial and temporal dense information.
  • 8.
    UnDeepVO: Monocular VisualOdometry through Unsupervised Deep Learning After training with unlabeled stereo images, UnDeepVO can simultaneously perform visual odometry and depth estimation with monocular images. The estimated 6-DoF poses and depth maps are both scaled without the need for scale post- processing.
  • 9.
    UnDeepVO: Monocular VisualOdometry through Unsupervised Deep Learning Architecture of UnDeepVO The system is composed of pose and depth estimator. For the pose estimator, it is a VGG-based CNN architecture. Since rotation (Euler angles) has high nonlinearity, it is usually difficult to train compared with translation. For supervised training, a popular solution is to give a bigger weight to the rotational loss as a way of normalization. In order to better train the rotation with unsupervised learning, decouple the translation and the rotation with two separate groups of fully- connected layers after the last convolutional layer. This enables to introduce a weight normalizing the rotation and the translation predictions for better performance. The depth estimator is mainly based on an encoder- decoder architecture to generate dense depth maps. The depth estimator of UnDeepVO is designed to directly predict depth maps. This is because training trails report that the whole system is easier to converge when training in this way.
  • 10.
    UnDeepVO: Monocular VisualOdometry through Unsupervised Deep Learning Training scheme of UnDeepVO. The pose and depth estimators take stereo images as inputs to estimate 6-DoF poses and depth maps, respectively. The total loss including spatial losses and temporal losses can then be calculated based on raw RGB images, estimated depth maps and poses. UnDeepVO is trained with losses through BP. Since the losses are built on geometric constraints rather than labeled data, UnDeepVO is trained in an unsupervised manner. Its total loss includes spatial image losses and temporal image losses. The spatial image losses drive the network to recover scaled depth maps by using stereo image pairs, while the temporal image losses are designed to minimize the errors on camera motion by using two consecutive monocular images.
  • 11.
    UnDeepVO: Monocular VisualOdometry through Unsupervised Deep Learning
  • 12.
    UnDeepVO: Monocular VisualOdometry through Unsupervised Deep Learning VO results with UnDeepVO. All the methods listed in the table did not use loop closure. Note that monocular VISO2-M and ORB-SLAM-M (without loop closure) did not work with image size 416 × 128, the results were obtained with image size 1242×376. 7-DoF (6-DoF + scale) alignment with the ground-truth is applied for SfMLearner and ORB-SLAM-M.
  • 13.
    VINet: Visual-Inertial Odometryas a Sequence-to-Sequence Learning Problem • A fundamental requirement for mobile robot autonomy is the ability to be able to accurately navigate where no GPS signals are available. • One of the most promising approaches to achieving this goal is through the fusion of images from a monocular camera and inertial measurement unit. • These VIO approaches still suffer from strict calibration and synchronization requirements. • It presents an on-manifold sequence-to- sequence learning approach to motion estimation using visual and inertial sensors. • It is end-to-end trainable method for visual-inertial odometry which performs fusion of the data at an intermediate feature-representation level. • This method has numerous advantages over traditional approaches. • Specifically, it eliminates the need for tedious manual synchronization of the camera and IMU as well as eliminating the need for manual calibration between the IMU and camera. • This model naturally and elegantly incorporates domain specific information which significantly mitigates drift.
  • 14.
    VINet: Visual-Inertial Odometryas a Sequence-to-Sequence Learning Problem Comparison between a standard visual- inertial odometry framework and this learning-based approach. Elements in blue need to be specified during setup. The parameters of VINet are hidden from the user and fully learned from data.
  • 15.
    VINet: Visual-Inertial Odometryas a Sequence-to-Sequence Learning Problem The VINet architecture for visual-inertial odometry. The network consists of a core LSTM processing the pose output at camera-rate and an IMU LSTM processing data at the IMU rate.
  • 16.
    VINet: Visual-Inertial Odometryas a Sequence-to-Sequence Learning Problem • The model consists of an CNN-RNN network which has been tailored to the task of visual- inertial odometry estimation. • The entire network is differentiable and thus trainable end-to-end for the purpose of ego- motion estimation. • The input to the network is monocular RGB images and IMU data which is a 6 dimensional vector containing the x, y, z components of acceleration and angular velocity measured using a gyroscope. • The output of the network is a 7 dimensional vector - a 3 dimensional translation and 4 dimensional orientation quaternion - representing the change in pose of the robot from the start of the sequence. • In essence, the network learns the mapping which transforms input sequences of images and IMU data to poses
  • 17.
    VINet: Visual-Inertial Odometryas a Sequence-to-Sequence Learning Problem SE(3) composition layer - a parameter-free layer which concatenates transformations between frames on SE(3).
  • 18.
    VINet: Visual-Inertial Odometryas a Sequence-to-Sequence Learning Problem 6D MAV reconstructed trajectory using the VINet compared to OK-VIS
  • 19.
    Deep Virtual StereoOdometry: Leveraging Deep Depth Prediction for Monocular DSO • Monocular visual odometry approaches that purely rely on geometric cues are prone to scale drift and require sufficient motion parallax in successive frames for motion estimation and 3D reconstruction. • This work leverages deep monocular depth prediction to overcome limitations of geometry- based monocular visual odometry. • It incorporates deep depth predictions into Direct Sparse Odometry (DSO) as direct virtual stereo measurements. • For depth prediction, it designs a deep network that refines predicted depth from a single image in a two-stage process. • It trains the network in a semi-supervised way on photo consistency in stereo images and on consistency with accurate sparse depth reconstructions from Stereo DSO.
  • 20.
    Deep Virtual StereoOdometry: Leveraging Deep Depth Prediction for Monocular DSO It builds on three key ingredients: self-supervised learning from photo consistency in a stereo setup, supervised learning based on accurate sparse depth reconstruction by Stereo DSO, and two-stage refinement of the network predictions in a stacked encoder-decoder architecture. Semi-Supervised Deep Monocular Depth Estimation using a deep network, called StackNet since it stacks two sub-networks, SimpleNet and ResidualNet. Both sub-networks are fully convolutional deep neural network adopted from DispNet with an encoder-decoder scheme. ResidualNet has fewer layers and takes the outputs of SimpleNet as inputs. Its purpose is to refine the disparity maps predicted by SimpleNet by learning an additive residual signal. Similar residual learning architectures have been successfully applied to related deep learning tasks.
  • 21.
    Deep Virtual StereoOdometry: Leveraging Deep Depth Prediction for Monocular DSO
  • 22.
    Deep Virtual StereoOdometry: Leveraging Deep Depth Prediction for Monocular DSO • Deep Virtual Stereo Odometry (DVSO) builds on the windowed sparse direct bundle adjustment formulation of monocular DSO. • It uses disparity predictions for DSO in two key ways: Firstly, it initialize depth maps of new keyframes from the disparities; Beyond this rather straight-forward approach, it also incorporate virtual direct image alignment constraints into the windowed direct bundle adjustment of DSO. • It obtains these constraints by warping images with the estimated depth by bundle adjustment and the predicted right disparities by the deep network StackNet assuming a virtual stereo setup. • The point selection strategy of DVSO is similar to monocular DSO, while it also introduced a left-right consistency check to filter out the pixels which likely lie in the occluded area
  • 23.
    Deep Virtual StereoOdometry: Leveraging Deep Depth Prediction for Monocular DSO Architecture of DVSO. Every new frame is used for visual odometry and fed into the StackNet to predict left and right disparity. The predicted left and right disparities are used for depth initialization, while the right disparity is used to form the virtual stereo term in direct sparse bundle adjustment.
  • 24.
    Deep Virtual StereoOdometry: Leveraging Deep Depth Prediction for Monocular DSO [46]. Yin, X., Wang, X., Du, X., Chen, Q. “Scale recovery for monocular visual odometry using depth estimated with deep convolutional neural fields”. CVPR 2017 [49]. Zhou, T., Brown, M., Snavely, N., Lowe, D.G. “Unsupervised learning of depth and ego-motion from video”. CVPR 2017
  • 25.
    DeepTAM: Deep Trackingand Mapping • It is a system for keyframe-based dense camera tracking and depth map estimation that is entirely learned. • For tracking, estimate small pose increments between the current camera image and a synthetic viewpoint. • This significantly simplifies the learning problem and alleviates the dataset bias for camera motions. • Generating a large number of pose hypotheses leads to more accurate predictions. • For mapping, accumulate information in a cost volume centered at the current depth estimate. • The mapping network then combines the cost volume and the keyframe image to update the depth prediction, thereby effectively making use of depth measurements and image-based priors.
  • 26.
    DeepTAM: Deep Trackingand Mapping The tracking network uses an encoder-decoder type architecture with direct connections between the encoding and decoding part. The decoder is used by two tasks, which are optical flow prediction and the generation of pose hypotheses. The optical flow prediction is a small stack of two convolution layers and is only active during training to stimulate the generation of motion features. The pose hypotheses generation part is a stack of down- sampling convolution layers followed by a fully connected layer, which then splits into N = 64 fully connected branches sharing parameters to estimate the 64 pose vectors. Tracking Network
  • 27.
    DeepTAM: Deep Trackingand Mapping Overview of the tracking networks and the incremental pose estimation. They apply a coarse-to-fine approach to efficiently estimate the current camera pose. They train 3 tracking networks each specialized for a distinct resolution level corresponding to the input image dimensions (80 × 60), (160 × 120) and (320 × 240). Each network computes a pose increment with respect to a pose guess. Each of the tracking networks uses the latest pose guess to generate a virtual keyframe at the respective resolution level and thereby indirectly tracking the camera with respect to the original keyframe (IK, DK). The final pose estimate is computed as the product of all incremental pose updates. Tracking Networks
  • 28.
    DeepTAM: Deep Trackingand Mapping • Train a network to use the matching cost info. in the cost volume, combining it with the image-based scene priors to obtain more accurate and robust depth estimates. • For cost-volume-based methods, accuracy is limited by the number of depth labels N. • Use an adaptive narrow band strategy to increase the sampling density while keeping the number of labels constant. • The narrow band allows to recover more details in the depth map, but also requires a good initialization and regularization to keep the band in the right place. • To recompute the cost volume for the narrow band for a small selection of frames and search again for a better depth estimate. • Define the narrow band of depth labels centered at the previous depth estimate dprev as narrow band width
  • 29.
    DeepTAM: Deep Trackingand Mapping Mapping consists of a fixed band and narrow band module. Fixed band module: it takes the keyframe image IK(320 × 240 × 3) and the cost volume Cfb (320 × 240 × 32) generated with 32 depth labels equally spaced in the range [0.01, 2.5] as inputs and outputs an interpolation factor sfb (320 × 240 × 1). The fixed band depth estimation is Dfb = (1−sfb)·dmin +sfb ·dmax. Narrow band module: The narrow band module is run iteratively; in each iteration build a cost volume Cnb from a set of depth labels distributed with a band width σnb of 0.0125. It consists of two encoder-decoder pairs.
  • 30.
  • 31.
    Visual SLAM forAutomated Driving: Exploring the Applications of Deep Learning • Recently, there is progress on using CNN models for geometric vision tasks like depth estimation, optical flow prediction or motion segmentation. • However, Visual SLAM remains to be one of the areas of automated driving where CNNs are not mature for deployment in commercial automated driving systems. • This work explores how deep learning can be used to replace parts of the classical Visual SLAM pipeline. • Firstly, it describes the building blocks of Visual SLAM pipeline composed of standard geometric vision tasks. • Then it provides an overview of Visual SLAM use cases for automated driving based on the authors’ experience in commercial deployment. • Finally, it discusses the opportunities of using Deep Learning to improve upon state-of-the- art classical methods.
  • 32.
    Visual SLAM forAutomated Driving: Exploring the Applications of Deep Learning
  • 33.
    Visual SLAM forAutomated Driving: Exploring the Applications of Deep Learning
  • 34.
    DeepFusion: Real-Time Dense3D Reconstruction for Monocular SLAM using Single-View Depth and Gradient Predictions • While the keypoint-based maps created by sparse monocular SLAM systems are useful for camera tracking, dense 3D reconstructions may be desired for many robotic tasks. • Solutions involving depth cameras are limited in range and to indoor spaces, and dense reconstruction systems based on minimising the photometric error between frames are typically poorly constrained and suffer from scale ambiguity. • To address these issues, a proposed 3D reconstruction system leverages the output of a CNN to produce fully dense depth maps for keyframes that include metric scale. • DeepFusion, is capable of producing real-time dense reconstructions on a GPU. • It fuses the output of a semi- dense multi-view stereo algorithm with the depth and gradient predictions of a CNN in a probabilistic fashion, using learned uncertainties produced by the network. • While the network only needs to be run once per keyframe, it is optimised for the depth map with each new frame so as to constantly make use of new geometric constraints.
  • 35.
    DeepFusion: Real-Time Dense3D Reconstruction for Monocular SLAM using Single-View Depth and Gradient Predictions DeepFusion represents the observed geometry with a series of keyframe depth maps. With each new RGB image, the system obtains the pose from a monocular SLAM system (ORB-SLAM2) and then updates the semi-dense depth estimates for the active keyframe. If the camera has translated more than the largest value or had fewer than the minimal inliers in the semi-dense estimation, a new keyframe is created. To maintain a high frame rate, the network outputs are only generated once per keyframe. Using a CNN, it predict the log-depth, log-depth gradients and associated uncertainties from the new keyframe image. If a new keyframe is not created, then the current semi- dense depth map and network outputs are fused to update the current depth map.
  • 36.
    DeepFusion: Real-Time Dense3D Reconstruction for Monocular SLAM using Single-View Depth and Gradient Predictions Comparison of reconstruction accuracy in terms of percentage of correct depth values (within 10% of ground truth) on ICL-NUIM and TUM RGB-D datasets (TUM/seq1: fr3/long office household, TUM/seq2: fr3 no structure texture near with loop, TUM/seq3: fr3/structure texture far). LSD-SLAM (BS) is LSD-SLAM bootstrapped with a ground truth depth map, and REMODE uses LSD-SLAM (BS) poses and keyframes.
  • 37.
    Training Deep SLAMon Single Frames • Collecting ground truth poses to train learning-based visual odometry and SLAM methods is difficult and expensive. • This could be resolved by training in an unsupervised mode, but there is still a large gap between performance of unsupervised and supervised methods. • This work focuses on generating synthetic data for deep learning-based visual odometry and SLAM methods that take optical flow as an input. • It produces training data in a form of optical flow that corresponds to arbitrary camera movement between a real frame and a virtual frame. • For synthesizing data, use depth maps either by a depth sensor or estimated from stereo. • It trains visual odometry model on synthetic data and do not use ground truth poses hence this model can be considered unsupervised. • Also it can be classified as monocular as no use of depth maps on inference. • A simple way to convert any visual odometry model into a SLAM method based on frame matching and graph optimization.
  • 38.
    Training Deep SLAMon Single Frames Architecture of visual odometry model
  • 39.
    Training Deep SLAMon Single Frames SLAM Architecture
  • 40.
    Training Deep SLAMon Single Frames Metrics on KITTI for unsupervised methods. Ground truth and estimated EuRoC trajectory
  • 41.
    DeepVIO: Self-supervised DeepLearning of Monocular Visual Inertial Odometry using 3D Geometric Constraints • This work is a self-supervised deep learning network for monocular visual inertial odometry (named DeepVIO). • DeepVIO provides absolute trajectory estimation by directly merging 2D optical flow feature (OFF) and IMU data. • Specifically, it firstly estimates the depth and dense 3D point cloud of each scene by using stereo sequences, and then obtains 3D geometric constraints including 3D optical flow and 6-DoF pose as supervisory signals. • In DeepVIO training, 2D optical flow network is constrained by the projection of its corresponding 3D optical flow, and LSTM- style IMU pre-integration network and the fusion network are learned by minimizing the loss functions from ego-motion constraints. • Furthermore, it employs an IMU status update scheme to improve IMU pose estimation through updating the additional gyroscope and accelerometer bias. • Compared to the traditional methods, DeepVIO reduces the impacts of inaccurate Camera- IMU calibrations, unsynchronized and missing data.
  • 42.
    DeepVIO: Self-supervised DeepLearning of Monocular Visual Inertial Odometry using 3D Geometric Constraints Ego motion are used as 3D geometric constraints to supervise 2D optical flow learned from 2D flow network, ego-motions estimated from the IMU pre-integration network and VI fusion network, the state of the IMU is updated when it receives the feedback from VI fusion network.
  • 43.
    DeepVIO: Self-supervised DeepLearning of Monocular Visual Inertial Odometry using 3D Geometric Constraints The DeepVIO mainly consists of CNN-Flow, LSTM-IMU and FC-Fusion networks, which jointly compute continuous trajectories from monocular images and IMU data. OFF and 2D optical flow are calculated by the CNN-Flow network. The relative 6-DoF pose between the adjacent two frames is calculated by IMU pre- integration network (LSTM-IMU) through the IMU data and status. Finally, the concatenated features from OFF and IMU-se3 are fed into the FC-Fusion network to calculate the trajectory of the monocular camera. The pretrained depth network (e.g., PSMNet) is firstly applied to estimate dense depth images. After that, it can recover 3D point clouds. Next, the 6-DoF relative pose and 3D optical flow are calculated via the well-known ICP method. Moreover, it synthetizes a 2D optical flow by projecting the 3D optical flow into its view.
  • 44.
    DeepVIO: Self-supervised DeepLearning of Monocular Visual Inertial Odometry using 3D Geometric Constraints KITTI 09 EuRoC MH04
  • 45.
    DF-SLAM: A Deep-LearningEnhanced Visual SLAM System based on Deep Local Features • As the foundation of driverless vehicle and intelligent robots, SLAM has attracted much attention these days. • However, non-geometric modules of traditional SLAM algorithms are limited by data association tasks and have become a bottleneck preventing the development of SLAM. • To deal with such problems, many researchers seek to Deep Learning for help. • But most of these studies are limited to virtual datasets or specific environments, and even sacrifice efficiency for accuracy. • DF-SLAM uses deep local feature descriptors obtained by the neural network as a substitute for traditional hand-made features. • Since adopting a shallow network to extract local descriptors and remaining others the same as original SLAM systems, DF-SLAM can still run in real-time on GPU.
  • 46.
    DF-SLAM: A Deep-LearningEnhanced Visual SLAM System based on Deep Local Features
  • 47.
    DF-SLAM: A Deep-LearningEnhanced Visual SLAM System based on Deep Local Features • It derives the tracking thread from Visual Odometry algorithms. • Tracking takes charge of constructing data associations between adjacent frames using visual feature matching. • Afterward, it initializes frames with the help of data associations and estimates the localization of the camera using the polar geometric constraint. • It also decides whether new keyframes are needed. • If lost, global relocalization is performed based on the same sort of features. • Local Mapping will be operated regularly to optimize camera poses and map points. • It receives information constructed by the tracking thread and reconstructs a partial 3D map. • If loops are detected, the Loop Closure thread will take turns to optimize the whole graph and close the loop. • The frame with a high matching score is selected as a candidate loop closing frame, which is used to complete loop closing and global optimization.
  • 48.
    DF-SLAM: A Deep-LearningEnhanced Visual SLAM System based on Deep Local Features
  • 49.
    DF-SLAM: A Deep-LearningEnhanced Visual SLAM System based on Deep Local Features • The first step is to extract interested points, utilizing TFeat network to describe the region around key points and generate a normalized 128- D float descriptor. • Features extracted are stored in every frame and passed to tracking, mapping and loop closing threads. • The localization algorithm is based on DBoW. • To speed up the system, Visual Vocabulary is employed in numerous computer vision applications. • It trained the vocabulary, based on DBoW, using the feature descriptors extracted by DF (Deep Feature) methods. • Therefore, it assigns a word vector and feature vector for each frame and calculates their similarity more easily.
  • 50.
    DF-SLAM: A Deep-LearningEnhanced Visual SLAM System based on Deep Local Features The architecture of TFeat There are only two convolutional layers followed by Tanh non-linearity in each branch. Max pooling is added after the first convolutional layer to reduce parameters and further speed up the network. A fully connected layer outputs a 128-D descriptor L2 normalized to unit-length as the last layer.
  • 51.
    DF-SLAM: A Deep-LearningEnhanced Visual SLAM System based on Deep Local Features loop closure added Without loop closure
  • 52.
    Sequential Adversarial Learningfor Self- Supervised Deep Visual Odometry • This is a self-supervised learning framework for visual odometry (VO) that incorporates correlation of consecutive frames and takes advantage of adversarial learning. • Previous methods tackle self-supervised VO as a local SfM problem that recovers depth from single image and relative poses from image pairs by minimizing photometric loss between warped and captured images. • As single-view depth estimation is an ill-posed problem, and photometric loss is incapable of discriminating distortion artifacts of warped images, the estimated depth is vague and pose is inaccurate. • This framework learns a compact representation of frame-to-frame correlation, which is updated by incorporating sequential information. • The updated representation is used for depth estimation. • Besides, VO is tackled as a self-supervised image generation task, taking advantage of GAN. • The generator learns to estimate depth and pose to generate a warped target image. The discriminator evaluates the qual- ity of generated image with high-level structural perception that overcomes the problem of pixel-wise loss in previous methods. Experiments on KITTI and Cityscapes datasets show that our method obtains more accurate depth with de- tails preserved and predicted pose outperforms state-of-the- art self-supervised methods significantly.
  • 53.
    Sequential Adversarial Learningfor Self- Supervised Deep Visual Odometry The network extracts optical flow into a compact code, which is incorporated by LSTM to aggregate historical information and refine previous estimations. Depth and pose estimation is regarded as an image conditioned generative task, and the refined code is provided as input signal. The geometric inference is used to reconstruct a warped image by view synthesis and evaluated by a discriminator.
  • 54.
    Sequential Adversarial Learningfor Self- Supervised Deep Visual Odometry The encoder compresses optical flow of two consecutive images into a compact code, which is aggregated and refined by LSTM. The DepthNet estimates depth conditioned on the refined code and input image. The estimated depth is concatenated with image for pose and mask prediction, while the authenticity of the warped image is judged by the discriminator. The discriminator is excluded during the test phase.
  • 55.
    Sequential Adversarial Learningfor Self- Supervised Deep Visual Odometry
  • 56.
    Sequential Adversarial Learningfor Self- Supervised Deep Visual Odometry
  • 57.
    Deep Direct VisualOdometry • Different kinds of approaches have been proposed to solve VO problems, including direct methods, semi-direct methods and feature-based methods. • Monocular direct visual odometry (DVO) relies heavily on high-quality images and good initial pose estimation for accuracy tracking process, which means that DVO may fail if the image quality is poor or the initial value is incorrect. • This study is a new architecture to overcome the above limitations by embedding deep learning into DVO. • Compared with the traditional VO methods, deep learning models do not rely on high- precision features correspondence or high-quality images. • A self-supervised network architecture for effectively predicting 6-DOF pose, called PoseNet, is proposed, and it incorporates the pose prediction into Direct Sparse Odometry (DSO) for robust initialization and tracking process. • A soft-attention model and STM (selective transfer model) module are used to improve the feature manipulation ability for accurate pose regression.
  • 58.
    Deep Direct VisualOdometry Self-supervised network architecture. (a) to achieve a better pose prediction, use 7 convolution layers with kernel size 3 for feature extraction, the full connected layers and attention model. (b) A soft-attention model is used for feature association and selection. The reweighted features are used to predict 6-DOF relative pose. (c) A STM model is used to replace the common skip connection between encoder and decoder and selective transfer characteristics in DepthNet. (d) The single-frame DepthNet adopts the encoder-decoder framework with a selective transfer model, and the kernel size is 3 for all convolution and deconvolution layers.
  • 59.
    Deep Direct VisualOdometry The self-supervised training framework. There are only two components in our loss function L as the supervisory signal during training, including the view reconstruction consistency loss Lc and the depth smoothness loss Lsmooth.
  • 60.
    Deep Direct VisualOdometry The DDSO (deep direct sparse odometry) pipeline. This work augments the DSO framework with the pose prediction module (PoseNet). Every new frame is fed into the proposed PoseNet with last frame to regress a relative pose estimation. The predicted pose is used to improve initialization and tracking in DSO.
  • 61.
  • 62.
    Learning By Inertia:Self-supervised Monocular VO For Road Vehicles • The method, called iDVO (inertia-embedded deep visual odometry), is a self-supervised learning based monocular visual odometry (VO) for road vehicles. • When modelling the geometric consistency within adjacent frames, most deep VO methods ignore the temporal continuity of the camera pose, which results in a very severe jagged fluctuation in the velocity curves. • With the observation that road vehicles tend to perform smooth dynamic characteristics in most of the time, the inertia loss function describes the abnormal motion variation, which assists the model to learn the consecutiveness from long-term camera ego-motion. • Based on the recurrent convolutional neural network (RCNN) architecture, this method implicitly models the dynamics of road vehicles and the temporal consecutiveness by the extended LSTM block. • Furthermore, it develops the dynamic hard-edge mask to handle the non- consistency in fast camera motion by blocking the boundary part and which generates more efficiency in the whole non- consistency mask.
  • 63.
    Learning By Inertia:Self-supervised Monocular VO For Road Vehicles The CNN-RCNN dual networks structure takes the sequences as input, and outputs the per-pixel depth with 6DoF poses sequentially. The total mask used in view synthesis is combined by the computed dynamic hard-edge mask and the estimated explain-ability mask. Both networks can be tested individually.
  • 64.
    Learning By Inertia:Self-supervised Monocular VO For Road Vehicles The CNN-RCNN network structure. The correspondence between color and layer/operation is shown in the leg- end at the bottom. The left one: the DispNet architecture is adopted for the depth estimator CNN. The right one: The pose-prediction RCNN. The decoder part is for multi-scale explain-ability mask prediction.
  • 65.
    Learning By Inertia:Self-supervised Monocular VO For Road Vehicles The sketch of dynamic hard-edge mask (the red part). For better display, the frame-to-frame time between these 3 frames is ∼0.5s, so the mask is much larger than actual mask. Because networks are trained by the video captured in dynamic vehicles (removal of the static frames by optical flow), some obstacles at the edges in one frame might not be captured in the next frame (forward moving). In reconstruction, some pixels at the edge part of the resource frame will not be mapped in the boundary of the target frame. To generate the non-consistency mask efficiently and precisely, use the dynamic hard-edge mask (DHEM). The DHEM is a hard mask in the edge of resource frame, which together with the soft estimated explain- ability mask will form a complete mask. The hard mask blocks the edge pixels by ignoring the pixels in the mask during reconstruction.
  • 66.
    Learning By Inertia:Self-supervised Monocular VO For Road Vehicles