Visual Odometry Research Using Deep Learning

Estimation Camera Trajectory with
Visual Odometry
Research Internship
Submitted in Fulfilment of the
Requirements for the Academic Degree
M.Sc.
Dept. of Computer Science
Chair of Computer Engineering
Submitted by: Anutam Majumder
Student ID: 456540
Date: 07.05.2021
Supervising tutor: Shadi Saleh
Batbayar Batseren

Contents
Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Research Questions And Objectives / Research Challenges . . . . . . . 6
3 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.0.1 Visual SLAM . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.0.2 Deep Learning based Methods: . . . . . . . . . . . . . . . . . 10
4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.0.1 Architecture of the CNN-RNN network . . . . . . . . . . . . . 12
4.0.2 Feature extraction method of Convolutional Neural Network . 13
4.0.3 Sequential Modelling method of Recurrent Neural Network . . 14
4.0.4 Cost Function and Optimisation . . . . . . . . . . . . . . . . . 15
5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.0.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.0.2 Training and Testing . . . . . . . . . . . . . . . . . . . . . . . 17
6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2

List of Figures
1.1 Conventional Visual Odometry Pipeline . . . . . . . . . . . . . . . . . 4
2.1 The red boxes show data, and the blue boxes show operations on
the data. At the highest level, consecutive video frames are used
to calculate an optical flow image, which is then fed into a convolu-
tional neural network that outputs the odometry information needed
to create a map. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1 A block diagram showing the main components of a: a VO and b
filter based SLAM system . . . . . . . . . . . . . . . . . . . . . . . . 8
4.1 Proposed architecture of the CNN-RNN based monocular VO system.
The dimensions of the tensors shown are example based on the image
dimensions of the KITTI dataset. The CNN ones should vary ac-
cording to the size of the input image. Camera image credit: KITTI
dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2 Configuration of the CNN . . . . . . . . . . . . . . . . . . . . . . . . 13
4.3 Folded and unfolded LSTM structure . . . . . . . . . . . . . . . . . . 14
5.1 Predicted path (in red) of image sequence 05 plotted against the
ground truth value (In green) of the sequence . . . . . . . . . . . . . 18
6.1 Performance of our Monocular VO model compared to VISO2-M and
VISO2-S . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3

1 Introduction
One of the fundamental needs for humans and mobile devices and agents is localizing
and mapping themselves with respect to the environment they are inside. Humans
are able to perceive their location and pose via multi-modal sensory perception [2] in
a complex three dimensional space. This ability for perceiving self-motion and their
surroundings plays a vital role in developing cognition for humans. Similarly, arti-
ficially intelligent agents or robots should also be able to perceive the environment
and predict their pose and location using on-board sensors. Visual Odometry (VO)
has come up as one of the most essential techniques for pose estimation and robot
localisation. It estimates the ego-motion of a camera by integrating the relative
motion between images into global poses[30].
Figure 1.1: Conventional Visual Odometry Pipeline
Deep learning technologies has been coming up as the first choice to tackle com-
puter vision problems through the last decade. But the potential of deep learning
technologies has not been fully utilized to address the visual odometry problem.
Most deep learning models only deal with recognition and classification problems,
which enables Convolution Neural Networks (CNN) to extract appearance informa-
tion from images. Limitation of this appearance based detection implementation
is that these VO systems only functions in the environments they are trained in.
A Visual Odometry system predicts the pose and localization correctly when fea-
ture based maps provide more accuracy than appearance based maps.[20] A VO
algorithm should be capable of modelling motion dynamics by reading the changes
and draw connections on sequence of images rather than from a single image. This
implies the requirement of sequential learning which CNN is alone incapable of han-
dling. In this work, this problem is addressed with the use of deep CNN network for
4

1 Introduction
creating feature maps followed by RNN network to create correspondences between
the learned features. The proposed method is capable of learning features through
CNN and updating themselves according to the previous states through RNN. It is
an end to end solution for the Visual Odometry problem and does not require any
module from the classical VO pipeline (not even camera calibration requirement).
5

2 Research Questions And
Objectives / Research Challenges
One of the main challenging aspect of conventional visual odometry methods is the
fact that it utilizes sparse key points to track features across moving pixels. Tra-
ditional odometry methods function in the way of mapping the results and placing
the vehicle on the map by a process called Simultaneous Localization and map-
ping(SLAM). But slow algorithmic performances and high memory usage prevent
the systems from adhering to requirements for extended data acquisition and pro-
cessing. In this work pixel movements is calculated using optical flow instead of
tracking sparse key points as in the conventional visual odometry pipeline. Optical
flow computes pixel movements in both horizontal and vertical directions. Linear
movements between frames is assumed between a video stream with small incremen-
tal movements as the camera updates. A proportional relationship between pixel
movements and physical movement of the camera is established. [26]
Figure 2.1: The red boxes show data, and the blue boxes show operations on the
data. At the highest level, consecutive video frames are used to calculate
an optical flow image, which is then fed into a convolutional neural
network that outputs the odometry information needed to create a map.
The main contributions of this work is as follows:
• A fully converged deep learning approach to Visual Odometry calculation.
• An accurate Visual Odometry system capable of prediction in real time.
6

3 Literature Review
Visual Odometry can be defined as the as the process of estimating a robot’s trans-
lational and rotational motion with respect to a reference frame from observation of
a sequence of images of its environment. Visual Odometry can be described as a par-
ticular case of a technique called structure from motion (SFM). SFM can handle the
problem of 3D reconstruction of structure of its environment from the sequence of
images.It can also predict the camera poses from sequentially ordered or unordered
image sets [33]. However, in case of SFM, the final refinement and global optimiza-
tion step of both the camera poses and the structure is computationally expansive.
This process also is usually performed off-line. But in case of visual odometry, the
estimation of the camera poses is computed real time. Visual Odometry techniques
can be technically categorized into Monocular [5]and Stereo Camera [23] methods.
These methods can be further categorized into feature matching (matching features
over a number of images) [23], feature tracking [9] (matching features in adjacent
frames) and optical flow techniques. These categorizations are purely based upon
the intensity of all pixels or specific regions in sequential images.
The technique of estimation a robot’s ego-motion by observation of a sequence
of incoming frames started in the 1980s by Moravec [25] at Stanford University.
Moravec introduced a stereo vision based technique in which a single camera would
slide on a rail in a move and stop fashion enabling the robot to extract image features
(corners) in the first image. The camera then slid on the rail in a perpendicular
direction with respect to the robot’s motion, and repeat the process until 9 images
are captured. Features were matched between the 9 images using Normalized Cross
Correlation (NCC). Those features were used to reconstruct the 3D structure. The
reconstructed 3D points were observed from different locations and the data was
then aligned to calculate camera motion transformation.
The scope of the above work was later extended by Matthies and Shafer [24]. They
derived an error model using 3D Gaussian distributions instead of using the scalar
model used in the earlier method. Other methods of stereo VO implementation also
came up in further studies. For example, in [29] maximum likelihood ego-motion
for modeling the error was introduced for localization of a rover over long distances.
In [21] a method was described for rover localization that took in raw image data
instead of geometric data for motion estimation.
The term “Visual Odometry” was first coined by Nister et al. [28].It’s similarity
to the concepts of Wheel odometry lead to it’s naming. Methods for obtaining cam-
era motion from visual input in both monocular and stereo systems was proposed.
These methods could estimate camera motion in the presence of outliers. An outlier
rejection scheme using RANSAC [10] was proposed to remove outliers.The above
7

3 Literature Review
Figure 3.1: A block diagram showing the main components of a: a VO and b filter
based SLAM system
method could also successfully for the first time track features across all frames in
place of matching features in two consecutive frames as in earlier methods. This
could avoid feature drift during cross-correlation based tracking. A RANSAC based
motion estimation using the 3D to 2D reprojection error (see “Motion Estimation”
section) was proposed. 3D to 2D re-projection errors were shown to give better
estimates when compared to the 3D to 3D errors [12].
Scaramuzza and Siegwart performed another important research for visual odom-
etry [32]. In their research, a monocular omni-directional camera was used and
fused the motion estimates gained by two approaches. In the first approach, SIFT
features were extracted and RANSAC was used for outlier removal [22].
In the second approach appearance based methods were used [6]. Appearance
based techniques are able to accurately handle outdoor spaces efficiently and ro-
bustly. They are also capable of avoiding error prone feature extraction and match-
ing techniques .
[16] A stereo VO system for outdoor navigation was proposed by Kaess. The
sparse features was obtained by feature matching. They were segregated in flow
based on close features and on far features.The reason for the separation is that small
changes in camera translations do not influence points that are far away. The far
points were used to recover the rotation transformation using a two-point RANSAC.
Close points were used to recover the translation using a one-point RANSAC.
In the next section another important technique for the motion estimation - Visual
SLAM is discussed.
3.0.1 Visual SLAM
Another significant approach for robot localization and mapping is Visual SLAM.
SLAM is how a robot localizes itself in an unknown environment, while it keeps on
constructing a map of its surroundings. SLAM has been researched upon in the last
couple of decades with different solutions using different sensors, including sonar
8

3 Literature Review
sensors , IR sensors and LASER scanners.
Recently the research interest have grown in VSLAM because passive low-cost
video sensors provide rich visual information available compared to LASER scan-
ners. However, more sophisticated algorithms are required for processing images
and extracting necessary information. However, for the advances in CPU and GPU
technologies, the real time implementation of the required complex algorithms are
no longer difficult. A variety of solutions including monocular [8], stereo , omni-
directional [18], time of flight (TOF) [34] and combined color and depth (RGB-D)
cameras [12] have been proposed.
Davison et al. [8] used a single monocular camera and constructed a map by
extracting sparse features of the environment. A Shi and Tomasi operator [35] was
used. New features were matched to those already observed.A normalized sum-
of-squared difference correlation technique was used. The use of single monocular
camera meant that the absolute scale of structures could not be obtained. The
camera too had to be calibrated. An Extended Kalman Filter (EKF) was utilized
for state estimation. Only a limited number of features were extracted and tracked.
Se et al. [13] proposed a vision based method for mobile robot localization and
mapping. SIFT was used for feature extraction in their method.
CV-SLAM (Ceiling Vision SLAM) is another technique which was researched
upon by pointing camera upwards towards the ceiling. The advantages of cv-SLAM
when compared to the frontal view V-SLAM are: less interactions with moving
obstacles and steady observation of features. Jeong et al. [15] were the first to in-
troduce the cv-SLAM. They employed a single monocular camera. Corner features
were extracted.Harris corner detector [11] was used. A landmark orientation esti-
mation technique was used after the detection. It was done to align with matching
the currently observed and previously stored landmarks.A NCC method was used.
There has been an increasing interest in dense 3D reconstruction of the environ-
ment as compared to the sparse 2D and 3D SLAM. Newcombe and Davison [27]
obtained a dense 3D model of the environment in real time. A single monocular
camera was used. But their method is limited to small environments only. Henry et
al. [14] implemented an RGB-D mapping approach utilizing an RGB-D camera (i.e.
Microsoft Kinect). This information was used to obtain a dense 3D reconstructed en-
vironment. It estimated the 6 degree of freedom (6DOF) camera pose. The method
extracted Features From Accelerated Segment Test (FAST) [31] features in each
frame and matched them with features from the previous frame using the Calonder
descriptors [3]. Then they performed a RANSAC alignment step which obtains a
subset of feature matches (inliers) that correspond to a consistent rigid transforma-
tion. This transformation was used as an initial guess in the Iterative Closest Point
(ICP) [4] algorithm which refined the transformation obtained by RANSAC. Sparse
Bundle Adjustment (SBA) was also applied in order to obtain a globally consistent
map and loop closure was detected by matching the current frame to previously
collected key-frames.
Bachrach et al. [1] proposed a VO and SLAM system for unmanned air vehicles
(UAVs). An RGB-D camera was utilized. The method relied on extracting FAST
9

3 Literature Review
features from sequential preprocessed images at different pyramid levels. This step
was followed by an initial rotation estimation that limited the size of the search
window for feature matching. The matching was performed by finding the mutual
lowest sum of squared difference (SSD) score between the descriptor vectors. A
greedy algorithm refined the matches and obtained the inlier set which were then
used to estimate the motion between frames. In order to reduce drift in the motion
estimates, they suggested matching the current frame to a selected key-frame instead
of matching consecutive frames.
In the above section, we discussed various approaches to solving the V-SLAM
problem. Earlier methods generally focused on sparse 2D and 3D SLAM methods
due to limitations of available computational resources. Most recently, the interest
has shifted towards dense 3D reconstruction of the environment due to technological
advances and availability of efficient optimization methods.
All studies have found that traditional Machine Learning techniques are ineffi-
cient when having to solve for big or highly non-linear, high-dimensional data, e.g.,
RGB images. Deep Learning techniques, which automatically learns suitable feature
representation from large-scale dataset provides an alternative solution to the VO
problem.
3.0.2 Deep Learning based Methods:
Deep Learning technologies have achieved significant results on localisation related
applications over the past few decades. CNN is mainly utilised for appearance based
place recognition [37]. K. Konda and R. Memisevic in their work [19] first realised
DL based VO with the process of synchrony detection between image sequences
and features. Softmax function was utilized after estimation of dept and velocity
to predict the discretised changes of direction and velocity. This method provides
a very good estimation from DL based stereo VO,but it formulates the VO as a
classification problem rather than pose regression.
Kendall et. all proposed a solution for camera relocalisation using a single. [17]
Their method fine-tuned images of a specific scene with CNNs. Their method la-
belled these images by Structure from Motion (SFM). But this process is time-
consuming and labour intensive for large-scale scenarios.
A trained CNN model serves as an appearance “map” of the scene in their imple-
mentation. However, the problem with the above approach is that the model needed
to be re-trained or fine-tuned for a new environment. So the above method is not
very useful for widespread usage. This serves as one of the biggest hindrance for
applying Deep Learning techniques for VO. To overcome this problem, another alter-
native method was proposed where the CNNs were provided with dense optical flow
instead of RGB images [7] . Three different architectures of CNNs were developed
using optical flow networks to learn appropriate features for VO. These methods lead
to development of a robust VO which were capable of handling even blurred and
under-exposed images. However, the proposed CNNs require pre-processed dense
optical flow as input, which cannot benefit from the end-to-end learning, as true to
10

3 Literature Review
the nature of deep learning models. The above methods are also inappropriate for
real-time applications.
CNNs alone is incapable of modelling sequential information. Therefore, none of
the previous work considers image sequences or videos for learning. In this work,
we tackle this by utilizing the RNNs, which are capable of sequential learning.
11

4 Methodology
Our approach to the solution for the above problem is utilizing the deep CNN and
RNN framework, which is composed of CNN based feature extraction and RNN
based sequential modelling. Since CNN is incapable of handling sequential informa-
tion, the RNN layer is added to process the features already extracted by the CNN
network.
Figure 4.1: Proposed architecture of the CNN-RNN based monocular VO system.
The dimensions of the tensors shown are example based on the image
dimensions of the KITTI dataset. The CNN ones should vary according
to the size of the input image. Camera image credit: KITTI dataset.
4.0.1 Architecture of the CNN-RNN network
Some popular Deep Neural Network architectures, such as VGGNet [36] and GoogLeNet
C. citeSzegedy which were originally developed for computer vision tasks, produces
good performance.These architectures are trained to learn knowledge from appear-
ance and image context. But the fundamentals of Visual Odometry is rooted in
Epipolar Geometry. It is not closely associated with appearance. So simply adopt-
ing the current popular DNN architectures for the VO problem is impractical. A
framework should be employed which can learn geometric feature representations
to address the VO and other geometric problems. At the same time,connections
among consecutive image frames should be derived as the VO systems evolve over
time produces results based on on image sequences acquired during movement. The
12

4 Methodology
proposed model takes these two requirements into consideration. The architecture
of the proposed Visual Odometry sytem is in Fig. 4.1. The proposed architecture
takes a video clip or a monocular image sequence as its input. At each time step,
the RGB image frame is pre-processed by subtracting the mean RGB values of the
training set and resizing to a new size in the multiple of 64. Two consecutive images
from the KITTI dataset are stacked together. It forms a tensor for the deep CNN-
RNN network, based upon which the model learns to extract motion information
and estimate poses. This image tensor is fed into the CNN to produce an effective
feature for the monocular VO. These feature maps are then passed through a RNN
for sequential learning. Each image pair yields a pose estimate at each time step
through the network. As images more and more images gets captured from the
KITTI dataset, the VO system develops over time and estimates new poses.
The advantage of this CNN-RNN based architecture is that it allows simultaneous
feature extraction and sequential modelling of VO through the combination of CNN
and RNN and it does not require any preprocessing.
4.0.2 Feature extraction method of Convolutional Neural
Network
To learn effective features maps from the input tensors, a CNN is developed to
perform feature extraction from two consecutive monocular RGB images of KITTI
dataset. The features that are studied are geometric instead of being appearance
based because as the VO system need to be deployed in unknown environments.
The configuration of the CNN is outlined in Figure 6.2.
Figure 4.2: Configuration of the CNN
The proposed CNN architecture has 9 convolutional layers. Each CNN layer is
13

4 Methodology
followed by a rectified linear unit (ReLU) activation.Only the Conv 6 layer is not
followed by a ReLu. So we have 17 layers in total. The convolution filter sizes in
the network gradually reduces from 7 × 7 to 5 × 5 to 3 × 3 to capture small and
interesting features. The padding method used is zero padding. The number of the
channels increases at each layer to learn various features.
4.0.3 Sequential Modelling method of Recurrent Neural
Network
After the CNN layer have extracted and prepared feature maps, a deep RNN network
is employed. This RNN network performs sequential learning for the network, i.e., it
models dynamics and relations among the sequence of feature maps already prepared
by the CNN network.
VO problem involves temporal model (motion model) and sequential data (image
sequence). Therefore, RNN, which is useful in modelling sequential data is employed
for this task. Estimating pose of current image frame benefits from information
encapsulated in previous frames.
However, RNN is not suitable to directly learn sequential representation from
high-dimensional raw data, such as images. Therefore, the proposed system adopts
the appealing CNN-RNN architecture with the CNN features as the input of the
RNN.
Figure 4.3: Folded and unfolded LSTM structure
RNN can maintain memory of its hidden states over time and has feedback loops
among the different states. This architecture helps modelling the current hidden
state to be a function of the previous ones. Hence, the RNN can find the relationship
between the current input and the previous states in the sequence. Let us assume a
convolutional feature xk at time k. A RNN will update at time step k by
14

4 Methodology
hk = H(WxhXk + Whhhk−1 + bh)
yk = Why)hk + by
where hk is the hidden state and and yk is the output at time k. W terms denote
the weight matrices of the hidden states and outputs, b terms are bias vectors, H is
an element-wise nonlinear activation function.
Long Short-Term Memory (LSTM) is capable of learning long-term dependencies
by usage of memory gates and units [38]. So RNN is used to find correlations
among images taken in long trajectories as in the case in our dataset used (KITTI)
as required in the case of visual odometry.
LSTM determines which previous hidden states to be discarded or retained for
updating the current state. This is how RNN can predict the motion during pose
estimation. The folded LSTM and its unfolded version over time are shown in the
Fig. 6.3. The diagram also shows the internal structure of a LSTM unit. After
unfolding the LSTM, we update each LSTM unit with a time step.When we assume
each LSTM unit to be Xk at time k.We also assume the hidden state hk1 and the
memory cell ck1 of the previous LSTM unit, the LSTM updates at each time step
k according to
iK = σ(WxiXk + Whihk1 + bi)
fk = σ(Wxf Xk + Whf hk − 1 + bf )
gk = tanh(WxgXk + Whghk−1 + bg)
ck = fk· ck−1 + ik· gk
ok = σ(WxoXk + Whohk−1 + bo)
hk = Ok· tanh(Ck)
In the above expressions, · denotes the element-wise product of two vectors, σ
denotes sigmoid non-linearity, tanh denotes hyperbolic tangent non-linearity, W
terms denote corresponding weight matrices, b terms denote bias vectors, ik, fk,
gk, ck and ok are input gate, forget gate, input modulation gate, memory cell and
output gate at time k, respectively.
Our LSTM structure can handle long-term dependencies as it is required to cor-
rectly analyze the long frames of KITTI dataset.But the proposed architecture still
needs depth on network layers to learn high level representation and model complex
dynamics between the frames of KITTI dataset. To address this issue, the deep
RNN is constructed by stacking two LSTM layers with the hidden states of a LSTM
being the input of the other one, as in Fig. 6.1. Each of the LSTM layers has
1000 hidden states. The deep RNN outputs a pose estimate at each time step based
on the visual features generated from the CNN. New poses are predicted over time
according to how the camera moves and more images get captured.
4.0.4 Cost Function and Optimisation
The proposed CNN-RNN based VO system computes the conditional probability of
the poses Yt = (y1, ..., yt) from a sequence of monocular RGB images Xt = (x1, ..., xt)
up to time t based on the probability:
p(Yt|Xt) = p(y1, ..., yt|x1, ..., xt)
15

4 Methodology
To correctly estimate the pose of the VO system, we are to find the optimal
parameters θ∗
, the DNN maximises the probabilistic equation:
θ∗
= argmaxp
θ(Y t|Xt; θ)
For the DNN to correctly learn the hyperparameters θ∗
of the DNNs, the Euclidean
distance between the ground truth pose (Pk, ϕk) at time k and its estimated one
( ˆ
Pk, ˆ
ϕk) is minimised.
The loss function is composed of Mean Square Error (MSE) of all positions p and
orientations ϕ:
θ∗
= argminp
θ
1
N
PN
i=1
Pt
k=1 k ˆ
Pk − Pkk2
2 + kk ˆ
ϕk − ϕkk2
2
where k .k is 2-norm, k (100 in the experiments) is a scale factor to balance
the weights of positions and orientations, and N is the number of samples. The
orientation ϕ is represented by Euler angles rather than quaternion since quaternion
is subject to an extra unit constraint which hinders the optimisation problem of DL.
We also find that in practice using quaternion degrades the orientation estimate to
some extent.
16

5 Results
5.0.1 Dataset
The KITTI VO/SLAM benchmark has 22 sequences of images, of which 11 ones (Se-
quence 00-10) are associated with ground truth. The other 10 sequences (Sequence
11-21) are only provided with raw sensor data. Our model was trained on the image
sequences 00, 02, 08 and 09 of the KITTI dataset, as these are sequences which are
relatively long. The trajectories are segmented to different lengths to generate much
data for training, producing 7410 samples in total. The trained model is tested on
the Sequence 04, 05, 06, 07 and 10 for evaluation.
The KITTI VO dataset provides us with translational and rotational errors for all
possible sequences of length (100,...,800) meters. It also provides us with evaluation
table which ranks methods according to the average of those values, where errors are
measured in percent (for translation) and in degrees per meter (for rotation). The
advantage of using this dataset is that our model is trained for all possible sequence
of lengths and hence can predict the location and orientation more accurately across
unknown environments. Also, this dataset makes the task of comparing the accuracy
of our model’s prediction with other state of the art technologies easier.
5.0.2 Training and Testing
This section of the report aims to describe the training and testing methodology
adopted. For the sake of training our model so that it works in all environments,
the training sequences Sequence 00, 01, 02, 05, 08 and 09 which are relatively long are
used for training. These sequences covers all a lot different trajectories and patterns
of navigation route of the car. This factor, along with the fact that these sequences
are long enables our model to get trained on different scenarios. The sequences
00, 01, 02, 05, 08 and 09 contains 4541, 1101, 4661, 2761, 4071 and 1591 images
respectively. We evaluated our trained model with the test sequences 04, 05, 07, 09
and 10 and plotted the prediction with respect to the ground truth value to check
the accuracy of the trained model. The estimated VO trajectories corresponding to
the previous testing are given in Fig. 5.1,5.2,5.3,5.4 and 5.5.
17

5 Results
Figure 5.1: Predicted path (in red) of image sequence 05 plotted against the ground
truth value (In green) of the sequence
18

5 Results
19

5 Results
20

6 Evaluation
We can conclude that our model produces relatively accurate and consistent tra-
jectories against to the ground truth.It demonstrates that the scales can be better
estimated in an end to end fashion using our model than using prior methods of
estimation using factors such as camera height. It is also to be noted that in case of
our VO model, no scale estimation or post alignment to ground truth is performed
for the DeepVO to obtain the absolute poses. The scale is maintained by the CNN-
RNN network itself and is ascertained during the end-to-end training. In case of
conventional odometry methods, recovering accurate and robust scale is difficult for
the monocular VO. Our model suggests an appealing advantage of the DL based
VO method. The detailed performance of the algorithms on the testing sequences is
summarised in figure 4.6. The KITTI Vision Benchmark Suite’s table was utilized
to compare the performance of our model with two state of the art Visual Odom-
etry algorithms Library for Visual Odometry 2 (Stereo Version, Active Matching)
[VISO2-S] [?] and Library for Visual Odometry 2 (Monocular Version) [VISO2-M]
[?]
Figure 6.1: Performance of our Monocular VO model compared to VISO2-M and
VISO2-S
It indicates that our deep learning model achieves more robust results than the
monocular VISO2. trel indicates the average translational RMSE drift in percent-
21

6 Evaluation
age on length of 100m-800m. rrel indicates the average rotational RMSE drift
(degree/100m) on length of 100m-800m.
Based on the findings of our work as evaluated against the KITTI vision bench-
mark suite, we can conclude that our work presents a monocular VO algorithm based
on Deep Learning. Utilizing the power of Deep Recurrent Neural Networks, we were
able to achieve simultaneous representation learning and sequential modelling of the
the monocular VO by combining the CNNs with the RNNs. Our approach does not
depend on any module in the conventional VO algorithms (even camera calibration)
for pose estimation and it is trained in an end-to-end manner. We also verified
from the KITTI VO benchmark suite that it can produce accurate VO results with
precise scales and can work well in completely new scenarios. Our methods were
able to produce some results on this area, but we stress that it is not expected
as a replacement to the classic geometry based approach. As of now, it can be a
viable complement, i.e., incorporating geometry with the representation, knowledge
and models learnt by the DNNs to further improve the VO in terms of accuracy
and, more importantly, robustness. More research is required to be performed with
combination of classical geometry based and deep learning based approaches.
22

Bibliography
[1] Bachrach, A., Prentice, S., He, R., Henry, P., Huang, A.S., Krainin, M., Mat-
urana, D., Fox, D., Roy, N.: Estimation, planning, and mapping for au-
tonomous flight using an rgb-d camera in gps-denied environments (2012),
https://doi.org/10.1177/0278364912455256
[2] Bertelson, P., Gelder, B., Spence, C., Driver, J.: The Psychology of Multimodal
Perception, pp. 141–177 (04 2004)
[3] Besl, P., McKay, N.D.: A method for registration of 3-d shapes (1992)
[4] Besl, P., McKay, N.D.: A method for registration of 3-d shapes (1992)
[5] Campbell, J., Sukthankar, R., Nourbakhsh, I.R., Pahwa, A.: A robust vi-
sual odometry and precipice detection system using consumer-grade monocular
vision. (2005), http://dblp.uni-trier.de/db/conf/icra/icra2005.html#
CampbellSNP05
[6] Comport, A., Malis, E., Rives, P.: Real-time quadrifocal visual odometry (2010)
[7] Costante, G., Mancini, M., Valigi, P., Ciarfuglia, T.A.: Exploring representa-
tion learning with cnns for frame-to-frame ego-motion estimation (2016)
[8] Davison: Real-time simultaneous localisation and mapping with a single camera
(2003)
[9] Dornhege, C., Kleiner, A.: Visual odometry for tracked vehicles (01 2006)
[10] Fischler, M., Bolles, R.: Random sample consensus: A paradigm for model
fitting with applications to image analysis and automated cartography (1981),
/brokenurl#http://publication.wilsonwong.me/load.php?id=233282275
[11] Harris, C., Stephens, M.: A combined corner and edge detector (1988)
[12] Henry, P., Krainin, M., Herbst, E., Ren, X., Fox, D.: Rgb-d mapping: Using
kinect-style depth cameras for dense 3d modeling of indoor environments (04
2012)
2012)
23

BIBLIOGRAPHY
2012)
[15] Jeong, W., Lee, K.M.: Cv-slam: a new ceiling vision-based slam technique
(2005)
[16] Kaess, M., K.N., Dellaert., F.: Flow separation for fast and robust stereo odom-
etry (2009)
[17] Kendall, A., Cipolla, R.: Modelling uncertainty in deep learning for camera
relocalization (09 2015)
[18] Kim, S., Oh, S.Y.: Slam in indoor environments using omni-directional vertical
and horizontal line features (01 2008)
[19] Konda., K., Memisevic., R.: Learning visual odometry with a convolutional
network (2015)
[20] Krombach, N., Droeschel, D., Houben, S., Behnke, S.: Feature-based visual
odometry prior for real-time semi-dense stereo SLAM (2018), http://arxiv.
org/abs/1810.07768
[21] Lacroix, S., Mallet, A., Chatila, R., Gallo, L.: Rover Self Localization in
Planetary-Like Environments (Aug 1999)
[22] Lowe, D.: Distinctive image features from scale-invariant keypoints (11 2004)
[23] Matthies, L., Shafer, S.: Error modeling in stereo navigation (June 1987)
[24] Matthies, L., Shafer, S.: Error modeling in stereo navigation (June 1987)
[25] Moravec, H.P.: Obstacle avoidance and navigation in the real world by a seeing
robot rover (Jun 2018), https://kilthub.cmu.edu/articles/journal_
contribution/Obstacle_avoidance_and_navigation_in_the_real_world_
by_a_seeing_robot_rover/6557033/1
[26] Muller, P., Savakis, A.: Flowdometry: An optical flow and deep learning based
approach to visual odometry (03 2017)
[27] Newcombe, R.A., Davison, A.J.: Live dense reconstruction with a single moving
camera
[28] Nister, D., Naroditsky, O., Bergen, J.: Visual odometry (2004)
[29] Olson, C., Matthies, L., Schoppers, M., Maimone, M.: Rover navigation using
stereo ego-motion (06 2003)
24

BIBLIOGRAPHY
[30] Pillai, S., Leonard, J.J.: Towards visual ego-motion learning in robots (2017),
http://arxiv.org/abs/1705.10279
[31] Rosten, E., Drummond, T.: Machine learning for high-speed corner detection
(2006)
[32] Scaramuzza, D., Siegwart, R.: Appearance-guided monocular omnidirectional
visual odometry for outdoor ground vehicles (11 2008)
[33] Scaramuzza, D., F.F.: Visual odometry: part i—the first 30 years and funda-
mentals (03 2011))
[34] Shi, J., Tomasi: Good features to track (1994)
[35] Shi, J., Tomasi: Good features to track (1994)
[36] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition (09 2014)
[37] Sünderhauf, N., Shirazi, S., Jacobson, A., Pepperell, E., Dayoub, F., Upcroft,
B., Milford, M.: Place recognition with convnet landmarks: Viewpoint-robust,
condition-robust, training-free (07 2015)
[38] Zaremba, W., Sutskever, I.: Learning to execute. CoRR abs/1410.4615 (2014),
http://arxiv.org/abs/1410.4615
25

Visual Odometry Research Using Deep Learning

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to Visual Odometry Research Using Deep Learning

Similar to Visual Odometry Research Using Deep Learning (20)

Recently uploaded

Recently uploaded (20)

Visual Odometry Research Using Deep Learning