SlideShare a Scribd company logo
1 of 25
Download to read offline
Estimation Camera Trajectory with
Visual Odometry
Research Internship
Submitted in Fulfilment of the
Requirements for the Academic Degree
M.Sc.
Dept. of Computer Science
Chair of Computer Engineering
Submitted by: Anutam Majumder
Student ID: 456540
Date: 07.05.2021
Supervising tutor: Shadi Saleh
Batbayar Batseren
Contents
Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Research Questions And Objectives / Research Challenges . . . . . . . 6
3 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.0.1 Visual SLAM . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.0.2 Deep Learning based Methods: . . . . . . . . . . . . . . . . . 10
4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.0.1 Architecture of the CNN-RNN network . . . . . . . . . . . . . 12
4.0.2 Feature extraction method of Convolutional Neural Network . 13
4.0.3 Sequential Modelling method of Recurrent Neural Network . . 14
4.0.4 Cost Function and Optimisation . . . . . . . . . . . . . . . . . 15
5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.0.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.0.2 Training and Testing . . . . . . . . . . . . . . . . . . . . . . . 17
6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2
List of Figures
1.1 Conventional Visual Odometry Pipeline . . . . . . . . . . . . . . . . . 4
2.1 The red boxes show data, and the blue boxes show operations on
the data. At the highest level, consecutive video frames are used
to calculate an optical flow image, which is then fed into a convolu-
tional neural network that outputs the odometry information needed
to create a map. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1 A block diagram showing the main components of a: a VO and b
filter based SLAM system . . . . . . . . . . . . . . . . . . . . . . . . 8
4.1 Proposed architecture of the CNN-RNN based monocular VO system.
The dimensions of the tensors shown are example based on the image
dimensions of the KITTI dataset. The CNN ones should vary ac-
cording to the size of the input image. Camera image credit: KITTI
dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2 Configuration of the CNN . . . . . . . . . . . . . . . . . . . . . . . . 13
4.3 Folded and unfolded LSTM structure . . . . . . . . . . . . . . . . . . 14
5.1 Predicted path (in red) of image sequence 05 plotted against the
ground truth value (In green) of the sequence . . . . . . . . . . . . . 18
5.2 Predicted path (in red) of image sequence 05 plotted against the
ground truth value (In green) of the sequence . . . . . . . . . . . . . 18
5.3 Predicted path (in red) of image sequence 05 plotted against the
ground truth value (In green) of the sequence . . . . . . . . . . . . . 19
5.4 Predicted path (in red) of image sequence 05 plotted against the
ground truth value (In green) of the sequence . . . . . . . . . . . . . 19
5.5 Predicted path (in red) of image sequence 05 plotted against the
ground truth value (In green) of the sequence . . . . . . . . . . . . . 20
6.1 Performance of our Monocular VO model compared to VISO2-M and
VISO2-S . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3
1 Introduction
One of the fundamental needs for humans and mobile devices and agents is localizing
and mapping themselves with respect to the environment they are inside. Humans
are able to perceive their location and pose via multi-modal sensory perception [2] in
a complex three dimensional space. This ability for perceiving self-motion and their
surroundings plays a vital role in developing cognition for humans. Similarly, arti-
ficially intelligent agents or robots should also be able to perceive the environment
and predict their pose and location using on-board sensors. Visual Odometry (VO)
has come up as one of the most essential techniques for pose estimation and robot
localisation. It estimates the ego-motion of a camera by integrating the relative
motion between images into global poses[30].
Figure 1.1: Conventional Visual Odometry Pipeline
Deep learning technologies has been coming up as the first choice to tackle com-
puter vision problems through the last decade. But the potential of deep learning
technologies has not been fully utilized to address the visual odometry problem.
Most deep learning models only deal with recognition and classification problems,
which enables Convolution Neural Networks (CNN) to extract appearance informa-
tion from images. Limitation of this appearance based detection implementation
is that these VO systems only functions in the environments they are trained in.
A Visual Odometry system predicts the pose and localization correctly when fea-
ture based maps provide more accuracy than appearance based maps.[20] A VO
algorithm should be capable of modelling motion dynamics by reading the changes
and draw connections on sequence of images rather than from a single image. This
implies the requirement of sequential learning which CNN is alone incapable of han-
dling. In this work, this problem is addressed with the use of deep CNN network for
4
1 Introduction
creating feature maps followed by RNN network to create correspondences between
the learned features. The proposed method is capable of learning features through
CNN and updating themselves according to the previous states through RNN. It is
an end to end solution for the Visual Odometry problem and does not require any
module from the classical VO pipeline (not even camera calibration requirement).
5
2 Research Questions And
Objectives / Research Challenges
One of the main challenging aspect of conventional visual odometry methods is the
fact that it utilizes sparse key points to track features across moving pixels. Tra-
ditional odometry methods function in the way of mapping the results and placing
the vehicle on the map by a process called Simultaneous Localization and map-
ping(SLAM). But slow algorithmic performances and high memory usage prevent
the systems from adhering to requirements for extended data acquisition and pro-
cessing. In this work pixel movements is calculated using optical flow instead of
tracking sparse key points as in the conventional visual odometry pipeline. Optical
flow computes pixel movements in both horizontal and vertical directions. Linear
movements between frames is assumed between a video stream with small incremen-
tal movements as the camera updates. A proportional relationship between pixel
movements and physical movement of the camera is established. [26]
Figure 2.1: The red boxes show data, and the blue boxes show operations on the
data. At the highest level, consecutive video frames are used to calculate
an optical flow image, which is then fed into a convolutional neural
network that outputs the odometry information needed to create a map.
The main contributions of this work is as follows:
• A fully converged deep learning approach to Visual Odometry calculation.
• An accurate Visual Odometry system capable of prediction in real time.
6
3 Literature Review
Visual Odometry can be defined as the as the process of estimating a robot’s trans-
lational and rotational motion with respect to a reference frame from observation of
a sequence of images of its environment. Visual Odometry can be described as a par-
ticular case of a technique called structure from motion (SFM). SFM can handle the
problem of 3D reconstruction of structure of its environment from the sequence of
images.It can also predict the camera poses from sequentially ordered or unordered
image sets [33]. However, in case of SFM, the final refinement and global optimiza-
tion step of both the camera poses and the structure is computationally expansive.
This process also is usually performed off-line. But in case of visual odometry, the
estimation of the camera poses is computed real time. Visual Odometry techniques
can be technically categorized into Monocular [5]and Stereo Camera [23] methods.
These methods can be further categorized into feature matching (matching features
over a number of images) [23], feature tracking [9] (matching features in adjacent
frames) and optical flow techniques. These categorizations are purely based upon
the intensity of all pixels or specific regions in sequential images.
The technique of estimation a robot’s ego-motion by observation of a sequence
of incoming frames started in the 1980s by Moravec [25] at Stanford University.
Moravec introduced a stereo vision based technique in which a single camera would
slide on a rail in a move and stop fashion enabling the robot to extract image features
(corners) in the first image. The camera then slid on the rail in a perpendicular
direction with respect to the robot’s motion, and repeat the process until 9 images
are captured. Features were matched between the 9 images using Normalized Cross
Correlation (NCC). Those features were used to reconstruct the 3D structure. The
reconstructed 3D points were observed from different locations and the data was
then aligned to calculate camera motion transformation.
The scope of the above work was later extended by Matthies and Shafer [24]. They
derived an error model using 3D Gaussian distributions instead of using the scalar
model used in the earlier method. Other methods of stereo VO implementation also
came up in further studies. For example, in [29] maximum likelihood ego-motion
for modeling the error was introduced for localization of a rover over long distances.
In [21] a method was described for rover localization that took in raw image data
instead of geometric data for motion estimation.
The term “Visual Odometry” was first coined by Nister et al. [28].It’s similarity
to the concepts of Wheel odometry lead to it’s naming. Methods for obtaining cam-
era motion from visual input in both monocular and stereo systems was proposed.
These methods could estimate camera motion in the presence of outliers. An outlier
rejection scheme using RANSAC [10] was proposed to remove outliers.The above
7
3 Literature Review
Figure 3.1: A block diagram showing the main components of a: a VO and b filter
based SLAM system
method could also successfully for the first time track features across all frames in
place of matching features in two consecutive frames as in earlier methods. This
could avoid feature drift during cross-correlation based tracking. A RANSAC based
motion estimation using the 3D to 2D reprojection error (see “Motion Estimation”
section) was proposed. 3D to 2D re-projection errors were shown to give better
estimates when compared to the 3D to 3D errors [12].
Scaramuzza and Siegwart performed another important research for visual odom-
etry [32]. In their research, a monocular omni-directional camera was used and
fused the motion estimates gained by two approaches. In the first approach, SIFT
features were extracted and RANSAC was used for outlier removal [22].
In the second approach appearance based methods were used [6]. Appearance
based techniques are able to accurately handle outdoor spaces efficiently and ro-
bustly. They are also capable of avoiding error prone feature extraction and match-
ing techniques .
[16] A stereo VO system for outdoor navigation was proposed by Kaess. The
sparse features was obtained by feature matching. They were segregated in flow
based on close features and on far features.The reason for the separation is that small
changes in camera translations do not influence points that are far away. The far
points were used to recover the rotation transformation using a two-point RANSAC.
Close points were used to recover the translation using a one-point RANSAC.
In the next section another important technique for the motion estimation - Visual
SLAM is discussed.
3.0.1 Visual SLAM
Another significant approach for robot localization and mapping is Visual SLAM.
SLAM is how a robot localizes itself in an unknown environment, while it keeps on
constructing a map of its surroundings. SLAM has been researched upon in the last
couple of decades with different solutions using different sensors, including sonar
8
3 Literature Review
sensors , IR sensors and LASER scanners.
Recently the research interest have grown in VSLAM because passive low-cost
video sensors provide rich visual information available compared to LASER scan-
ners. However, more sophisticated algorithms are required for processing images
and extracting necessary information. However, for the advances in CPU and GPU
technologies, the real time implementation of the required complex algorithms are
no longer difficult. A variety of solutions including monocular [8], stereo , omni-
directional [18], time of flight (TOF) [34] and combined color and depth (RGB-D)
cameras [12] have been proposed.
Davison et al. [8] used a single monocular camera and constructed a map by
extracting sparse features of the environment. A Shi and Tomasi operator [35] was
used. New features were matched to those already observed.A normalized sum-
of-squared difference correlation technique was used. The use of single monocular
camera meant that the absolute scale of structures could not be obtained. The
camera too had to be calibrated. An Extended Kalman Filter (EKF) was utilized
for state estimation. Only a limited number of features were extracted and tracked.
Se et al. [13] proposed a vision based method for mobile robot localization and
mapping. SIFT was used for feature extraction in their method.
CV-SLAM (Ceiling Vision SLAM) is another technique which was researched
upon by pointing camera upwards towards the ceiling. The advantages of cv-SLAM
when compared to the frontal view V-SLAM are: less interactions with moving
obstacles and steady observation of features. Jeong et al. [15] were the first to in-
troduce the cv-SLAM. They employed a single monocular camera. Corner features
were extracted.Harris corner detector [11] was used. A landmark orientation esti-
mation technique was used after the detection. It was done to align with matching
the currently observed and previously stored landmarks.A NCC method was used.
There has been an increasing interest in dense 3D reconstruction of the environ-
ment as compared to the sparse 2D and 3D SLAM. Newcombe and Davison [27]
obtained a dense 3D model of the environment in real time. A single monocular
camera was used. But their method is limited to small environments only. Henry et
al. [14] implemented an RGB-D mapping approach utilizing an RGB-D camera (i.e.
Microsoft Kinect). This information was used to obtain a dense 3D reconstructed en-
vironment. It estimated the 6 degree of freedom (6DOF) camera pose. The method
extracted Features From Accelerated Segment Test (FAST) [31] features in each
frame and matched them with features from the previous frame using the Calonder
descriptors [3]. Then they performed a RANSAC alignment step which obtains a
subset of feature matches (inliers) that correspond to a consistent rigid transforma-
tion. This transformation was used as an initial guess in the Iterative Closest Point
(ICP) [4] algorithm which refined the transformation obtained by RANSAC. Sparse
Bundle Adjustment (SBA) was also applied in order to obtain a globally consistent
map and loop closure was detected by matching the current frame to previously
collected key-frames.
Bachrach et al. [1] proposed a VO and SLAM system for unmanned air vehicles
(UAVs). An RGB-D camera was utilized. The method relied on extracting FAST
9
3 Literature Review
features from sequential preprocessed images at different pyramid levels. This step
was followed by an initial rotation estimation that limited the size of the search
window for feature matching. The matching was performed by finding the mutual
lowest sum of squared difference (SSD) score between the descriptor vectors. A
greedy algorithm refined the matches and obtained the inlier set which were then
used to estimate the motion between frames. In order to reduce drift in the motion
estimates, they suggested matching the current frame to a selected key-frame instead
of matching consecutive frames.
In the above section, we discussed various approaches to solving the V-SLAM
problem. Earlier methods generally focused on sparse 2D and 3D SLAM methods
due to limitations of available computational resources. Most recently, the interest
has shifted towards dense 3D reconstruction of the environment due to technological
advances and availability of efficient optimization methods.
All studies have found that traditional Machine Learning techniques are ineffi-
cient when having to solve for big or highly non-linear, high-dimensional data, e.g.,
RGB images. Deep Learning techniques, which automatically learns suitable feature
representation from large-scale dataset provides an alternative solution to the VO
problem.
3.0.2 Deep Learning based Methods:
Deep Learning technologies have achieved significant results on localisation related
applications over the past few decades. CNN is mainly utilised for appearance based
place recognition [37]. K. Konda and R. Memisevic in their work [19] first realised
DL based VO with the process of synchrony detection between image sequences
and features. Softmax function was utilized after estimation of dept and velocity
to predict the discretised changes of direction and velocity. This method provides
a very good estimation from DL based stereo VO,but it formulates the VO as a
classification problem rather than pose regression.
Kendall et. all proposed a solution for camera relocalisation using a single. [17]
Their method fine-tuned images of a specific scene with CNNs. Their method la-
belled these images by Structure from Motion (SFM). But this process is time-
consuming and labour intensive for large-scale scenarios.
A trained CNN model serves as an appearance “map” of the scene in their imple-
mentation. However, the problem with the above approach is that the model needed
to be re-trained or fine-tuned for a new environment. So the above method is not
very useful for widespread usage. This serves as one of the biggest hindrance for
applying Deep Learning techniques for VO. To overcome this problem, another alter-
native method was proposed where the CNNs were provided with dense optical flow
instead of RGB images [7] . Three different architectures of CNNs were developed
using optical flow networks to learn appropriate features for VO. These methods lead
to development of a robust VO which were capable of handling even blurred and
under-exposed images. However, the proposed CNNs require pre-processed dense
optical flow as input, which cannot benefit from the end-to-end learning, as true to
10
3 Literature Review
the nature of deep learning models. The above methods are also inappropriate for
real-time applications.
CNNs alone is incapable of modelling sequential information. Therefore, none of
the previous work considers image sequences or videos for learning. In this work,
we tackle this by utilizing the RNNs, which are capable of sequential learning.
11
4 Methodology
Our approach to the solution for the above problem is utilizing the deep CNN and
RNN framework, which is composed of CNN based feature extraction and RNN
based sequential modelling. Since CNN is incapable of handling sequential informa-
tion, the RNN layer is added to process the features already extracted by the CNN
network.
Figure 4.1: Proposed architecture of the CNN-RNN based monocular VO system.
The dimensions of the tensors shown are example based on the image
dimensions of the KITTI dataset. The CNN ones should vary according
to the size of the input image. Camera image credit: KITTI dataset.
4.0.1 Architecture of the CNN-RNN network
Some popular Deep Neural Network architectures, such as VGGNet [36] and GoogLeNet
C. citeSzegedy which were originally developed for computer vision tasks, produces
good performance.These architectures are trained to learn knowledge from appear-
ance and image context. But the fundamentals of Visual Odometry is rooted in
Epipolar Geometry. It is not closely associated with appearance. So simply adopt-
ing the current popular DNN architectures for the VO problem is impractical. A
framework should be employed which can learn geometric feature representations
to address the VO and other geometric problems. At the same time,connections
among consecutive image frames should be derived as the VO systems evolve over
time produces results based on on image sequences acquired during movement. The
12
4 Methodology
proposed model takes these two requirements into consideration. The architecture
of the proposed Visual Odometry sytem is in Fig. 4.1. The proposed architecture
takes a video clip or a monocular image sequence as its input. At each time step,
the RGB image frame is pre-processed by subtracting the mean RGB values of the
training set and resizing to a new size in the multiple of 64. Two consecutive images
from the KITTI dataset are stacked together. It forms a tensor for the deep CNN-
RNN network, based upon which the model learns to extract motion information
and estimate poses. This image tensor is fed into the CNN to produce an effective
feature for the monocular VO. These feature maps are then passed through a RNN
for sequential learning. Each image pair yields a pose estimate at each time step
through the network. As images more and more images gets captured from the
KITTI dataset, the VO system develops over time and estimates new poses.
The advantage of this CNN-RNN based architecture is that it allows simultaneous
feature extraction and sequential modelling of VO through the combination of CNN
and RNN and it does not require any preprocessing.
4.0.2 Feature extraction method of Convolutional Neural
Network
To learn effective features maps from the input tensors, a CNN is developed to
perform feature extraction from two consecutive monocular RGB images of KITTI
dataset. The features that are studied are geometric instead of being appearance
based because as the VO system need to be deployed in unknown environments.
The configuration of the CNN is outlined in Figure 6.2.
Figure 4.2: Configuration of the CNN
The proposed CNN architecture has 9 convolutional layers. Each CNN layer is
13
4 Methodology
followed by a rectified linear unit (ReLU) activation.Only the Conv 6 layer is not
followed by a ReLu. So we have 17 layers in total. The convolution filter sizes in
the network gradually reduces from 7 × 7 to 5 × 5 to 3 × 3 to capture small and
interesting features. The padding method used is zero padding. The number of the
channels increases at each layer to learn various features.
4.0.3 Sequential Modelling method of Recurrent Neural
Network
After the CNN layer have extracted and prepared feature maps, a deep RNN network
is employed. This RNN network performs sequential learning for the network, i.e., it
models dynamics and relations among the sequence of feature maps already prepared
by the CNN network.
VO problem involves temporal model (motion model) and sequential data (image
sequence). Therefore, RNN, which is useful in modelling sequential data is employed
for this task. Estimating pose of current image frame benefits from information
encapsulated in previous frames.
However, RNN is not suitable to directly learn sequential representation from
high-dimensional raw data, such as images. Therefore, the proposed system adopts
the appealing CNN-RNN architecture with the CNN features as the input of the
RNN.
Figure 4.3: Folded and unfolded LSTM structure
RNN can maintain memory of its hidden states over time and has feedback loops
among the different states. This architecture helps modelling the current hidden
state to be a function of the previous ones. Hence, the RNN can find the relationship
between the current input and the previous states in the sequence. Let us assume a
convolutional feature xk at time k. A RNN will update at time step k by
14
4 Methodology
hk = H(WxhXk + Whhhk−1 + bh)
yk = Why)hk + by
where hk is the hidden state and and yk is the output at time k. W terms denote
the weight matrices of the hidden states and outputs, b terms are bias vectors, H is
an element-wise nonlinear activation function.
Long Short-Term Memory (LSTM) is capable of learning long-term dependencies
by usage of memory gates and units [38]. So RNN is used to find correlations
among images taken in long trajectories as in the case in our dataset used (KITTI)
as required in the case of visual odometry.
LSTM determines which previous hidden states to be discarded or retained for
updating the current state. This is how RNN can predict the motion during pose
estimation. The folded LSTM and its unfolded version over time are shown in the
Fig. 6.3. The diagram also shows the internal structure of a LSTM unit. After
unfolding the LSTM, we update each LSTM unit with a time step.When we assume
each LSTM unit to be Xk at time k.We also assume the hidden state hk1 and the
memory cell ck1 of the previous LSTM unit, the LSTM updates at each time step
k according to
iK = σ(WxiXk + Whihk1 + bi)
fk = σ(Wxf Xk + Whf hk − 1 + bf )
gk = tanh(WxgXk + Whghk−1 + bg)
ck = fk· ck−1 + ik· gk
ok = σ(WxoXk + Whohk−1 + bo)
hk = Ok· tanh(Ck)
In the above expressions, · denotes the element-wise product of two vectors, σ
denotes sigmoid non-linearity, tanh denotes hyperbolic tangent non-linearity, W
terms denote corresponding weight matrices, b terms denote bias vectors, ik, fk,
gk, ck and ok are input gate, forget gate, input modulation gate, memory cell and
output gate at time k, respectively.
Our LSTM structure can handle long-term dependencies as it is required to cor-
rectly analyze the long frames of KITTI dataset.But the proposed architecture still
needs depth on network layers to learn high level representation and model complex
dynamics between the frames of KITTI dataset. To address this issue, the deep
RNN is constructed by stacking two LSTM layers with the hidden states of a LSTM
being the input of the other one, as in Fig. 6.1. Each of the LSTM layers has
1000 hidden states. The deep RNN outputs a pose estimate at each time step based
on the visual features generated from the CNN. New poses are predicted over time
according to how the camera moves and more images get captured.
4.0.4 Cost Function and Optimisation
The proposed CNN-RNN based VO system computes the conditional probability of
the poses Yt = (y1, ..., yt) from a sequence of monocular RGB images Xt = (x1, ..., xt)
up to time t based on the probability:
p(Yt|Xt) = p(y1, ..., yt|x1, ..., xt)
15
4 Methodology
To correctly estimate the pose of the VO system, we are to find the optimal
parameters θ∗
, the DNN maximises the probabilistic equation:
θ∗
= argmaxp
θ(Y t|Xt; θ)
For the DNN to correctly learn the hyperparameters θ∗
of the DNNs, the Euclidean
distance between the ground truth pose (Pk, ϕk) at time k and its estimated one
( ˆ
Pk, ˆ
ϕk) is minimised.
The loss function is composed of Mean Square Error (MSE) of all positions p and
orientations ϕ:
θ∗
= argminp
θ
1
N
PN
i=1
Pt
k=1 k ˆ
Pk − Pkk2
2 + kk ˆ
ϕk − ϕkk2
2
where k .k is 2-norm, k (100 in the experiments) is a scale factor to balance
the weights of positions and orientations, and N is the number of samples. The
orientation ϕ is represented by Euler angles rather than quaternion since quaternion
is subject to an extra unit constraint which hinders the optimisation problem of DL.
We also find that in practice using quaternion degrades the orientation estimate to
some extent.
16
5 Results
5.0.1 Dataset
The KITTI VO/SLAM benchmark has 22 sequences of images, of which 11 ones (Se-
quence 00-10) are associated with ground truth. The other 10 sequences (Sequence
11-21) are only provided with raw sensor data. Our model was trained on the image
sequences 00, 02, 08 and 09 of the KITTI dataset, as these are sequences which are
relatively long. The trajectories are segmented to different lengths to generate much
data for training, producing 7410 samples in total. The trained model is tested on
the Sequence 04, 05, 06, 07 and 10 for evaluation.
The KITTI VO dataset provides us with translational and rotational errors for all
possible sequences of length (100,...,800) meters. It also provides us with evaluation
table which ranks methods according to the average of those values, where errors are
measured in percent (for translation) and in degrees per meter (for rotation). The
advantage of using this dataset is that our model is trained for all possible sequence
of lengths and hence can predict the location and orientation more accurately across
unknown environments. Also, this dataset makes the task of comparing the accuracy
of our model’s prediction with other state of the art technologies easier.
5.0.2 Training and Testing
This section of the report aims to describe the training and testing methodology
adopted. For the sake of training our model so that it works in all environments,
the training sequences Sequence 00, 01, 02, 05, 08 and 09 which are relatively long are
used for training. These sequences covers all a lot different trajectories and patterns
of navigation route of the car. This factor, along with the fact that these sequences
are long enables our model to get trained on different scenarios. The sequences
00, 01, 02, 05, 08 and 09 contains 4541, 1101, 4661, 2761, 4071 and 1591 images
respectively. We evaluated our trained model with the test sequences 04, 05, 07, 09
and 10 and plotted the prediction with respect to the ground truth value to check
the accuracy of the trained model. The estimated VO trajectories corresponding to
the previous testing are given in Fig. 5.1,5.2,5.3,5.4 and 5.5.
17
5 Results
Figure 5.1: Predicted path (in red) of image sequence 05 plotted against the ground
truth value (In green) of the sequence
Figure 5.2: Predicted path (in red) of image sequence 05 plotted against the ground
truth value (In green) of the sequence
18
5 Results
Figure 5.3: Predicted path (in red) of image sequence 05 plotted against the ground
truth value (In green) of the sequence
Figure 5.4: Predicted path (in red) of image sequence 05 plotted against the ground
truth value (In green) of the sequence
19
5 Results
Figure 5.5: Predicted path (in red) of image sequence 05 plotted against the ground
truth value (In green) of the sequence
20
6 Evaluation
We can conclude that our model produces relatively accurate and consistent tra-
jectories against to the ground truth.It demonstrates that the scales can be better
estimated in an end to end fashion using our model than using prior methods of
estimation using factors such as camera height. It is also to be noted that in case of
our VO model, no scale estimation or post alignment to ground truth is performed
for the DeepVO to obtain the absolute poses. The scale is maintained by the CNN-
RNN network itself and is ascertained during the end-to-end training. In case of
conventional odometry methods, recovering accurate and robust scale is difficult for
the monocular VO. Our model suggests an appealing advantage of the DL based
VO method. The detailed performance of the algorithms on the testing sequences is
summarised in figure 4.6. The KITTI Vision Benchmark Suite’s table was utilized
to compare the performance of our model with two state of the art Visual Odom-
etry algorithms Library for Visual Odometry 2 (Stereo Version, Active Matching)
[VISO2-S] [?] and Library for Visual Odometry 2 (Monocular Version) [VISO2-M]
[?]
Figure 6.1: Performance of our Monocular VO model compared to VISO2-M and
VISO2-S
It indicates that our deep learning model achieves more robust results than the
monocular VISO2. trel indicates the average translational RMSE drift in percent-
21
6 Evaluation
age on length of 100m-800m. rrel indicates the average rotational RMSE drift
(degree/100m) on length of 100m-800m.
Based on the findings of our work as evaluated against the KITTI vision bench-
mark suite, we can conclude that our work presents a monocular VO algorithm based
on Deep Learning. Utilizing the power of Deep Recurrent Neural Networks, we were
able to achieve simultaneous representation learning and sequential modelling of the
the monocular VO by combining the CNNs with the RNNs. Our approach does not
depend on any module in the conventional VO algorithms (even camera calibration)
for pose estimation and it is trained in an end-to-end manner. We also verified
from the KITTI VO benchmark suite that it can produce accurate VO results with
precise scales and can work well in completely new scenarios. Our methods were
able to produce some results on this area, but we stress that it is not expected
as a replacement to the classic geometry based approach. As of now, it can be a
viable complement, i.e., incorporating geometry with the representation, knowledge
and models learnt by the DNNs to further improve the VO in terms of accuracy
and, more importantly, robustness. More research is required to be performed with
combination of classical geometry based and deep learning based approaches.
22
Bibliography
[1] Bachrach, A., Prentice, S., He, R., Henry, P., Huang, A.S., Krainin, M., Mat-
urana, D., Fox, D., Roy, N.: Estimation, planning, and mapping for au-
tonomous flight using an rgb-d camera in gps-denied environments (2012),
https://doi.org/10.1177/0278364912455256
[2] Bertelson, P., Gelder, B., Spence, C., Driver, J.: The Psychology of Multimodal
Perception, pp. 141–177 (04 2004)
[3] Besl, P., McKay, N.D.: A method for registration of 3-d shapes (1992)
[4] Besl, P., McKay, N.D.: A method for registration of 3-d shapes (1992)
[5] Campbell, J., Sukthankar, R., Nourbakhsh, I.R., Pahwa, A.: A robust vi-
sual odometry and precipice detection system using consumer-grade monocular
vision. (2005), http://dblp.uni-trier.de/db/conf/icra/icra2005.html#
CampbellSNP05
[6] Comport, A., Malis, E., Rives, P.: Real-time quadrifocal visual odometry (2010)
[7] Costante, G., Mancini, M., Valigi, P., Ciarfuglia, T.A.: Exploring representa-
tion learning with cnns for frame-to-frame ego-motion estimation (2016)
[8] Davison: Real-time simultaneous localisation and mapping with a single camera
(2003)
[9] Dornhege, C., Kleiner, A.: Visual odometry for tracked vehicles (01 2006)
[10] Fischler, M., Bolles, R.: Random sample consensus: A paradigm for model
fitting with applications to image analysis and automated cartography (1981),
/brokenurl#http://publication.wilsonwong.me/load.php?id=233282275
[11] Harris, C., Stephens, M.: A combined corner and edge detector (1988)
[12] Henry, P., Krainin, M., Herbst, E., Ren, X., Fox, D.: Rgb-d mapping: Using
kinect-style depth cameras for dense 3d modeling of indoor environments (04
2012)
[13] Henry, P., Krainin, M., Herbst, E., Ren, X., Fox, D.: Rgb-d mapping: Using
kinect-style depth cameras for dense 3d modeling of indoor environments (04
2012)
23
BIBLIOGRAPHY
[14] Henry, P., Krainin, M., Herbst, E., Ren, X., Fox, D.: Rgb-d mapping: Using
kinect-style depth cameras for dense 3d modeling of indoor environments (04
2012)
[15] Jeong, W., Lee, K.M.: Cv-slam: a new ceiling vision-based slam technique
(2005)
[16] Kaess, M., K.N., Dellaert., F.: Flow separation for fast and robust stereo odom-
etry (2009)
[17] Kendall, A., Cipolla, R.: Modelling uncertainty in deep learning for camera
relocalization (09 2015)
[18] Kim, S., Oh, S.Y.: Slam in indoor environments using omni-directional vertical
and horizontal line features (01 2008)
[19] Konda., K., Memisevic., R.: Learning visual odometry with a convolutional
network (2015)
[20] Krombach, N., Droeschel, D., Houben, S., Behnke, S.: Feature-based visual
odometry prior for real-time semi-dense stereo SLAM (2018), http://arxiv.
org/abs/1810.07768
[21] Lacroix, S., Mallet, A., Chatila, R., Gallo, L.: Rover Self Localization in
Planetary-Like Environments (Aug 1999)
[22] Lowe, D.: Distinctive image features from scale-invariant keypoints (11 2004)
[23] Matthies, L., Shafer, S.: Error modeling in stereo navigation (June 1987)
[24] Matthies, L., Shafer, S.: Error modeling in stereo navigation (June 1987)
[25] Moravec, H.P.: Obstacle avoidance and navigation in the real world by a seeing
robot rover (Jun 2018), https://kilthub.cmu.edu/articles/journal_
contribution/Obstacle_avoidance_and_navigation_in_the_real_world_
by_a_seeing_robot_rover/6557033/1
[26] Muller, P., Savakis, A.: Flowdometry: An optical flow and deep learning based
approach to visual odometry (03 2017)
[27] Newcombe, R.A., Davison, A.J.: Live dense reconstruction with a single moving
camera
[28] Nister, D., Naroditsky, O., Bergen, J.: Visual odometry (2004)
[29] Olson, C., Matthies, L., Schoppers, M., Maimone, M.: Rover navigation using
stereo ego-motion (06 2003)
24
BIBLIOGRAPHY
[30] Pillai, S., Leonard, J.J.: Towards visual ego-motion learning in robots (2017),
http://arxiv.org/abs/1705.10279
[31] Rosten, E., Drummond, T.: Machine learning for high-speed corner detection
(2006)
[32] Scaramuzza, D., Siegwart, R.: Appearance-guided monocular omnidirectional
visual odometry for outdoor ground vehicles (11 2008)
[33] Scaramuzza, D., F.F.: Visual odometry: part i—the first 30 years and funda-
mentals (03 2011))
[34] Shi, J., Tomasi: Good features to track (1994)
[35] Shi, J., Tomasi: Good features to track (1994)
[36] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition (09 2014)
[37] Sünderhauf, N., Shirazi, S., Jacobson, A., Pepperell, E., Dayoub, F., Upcroft,
B., Milford, M.: Place recognition with convnet landmarks: Viewpoint-robust,
condition-robust, training-free (07 2015)
[38] Zaremba, W., Sutskever, I.: Learning to execute. CoRR abs/1410.4615 (2014),
http://arxiv.org/abs/1410.4615
25

More Related Content

What's hot

Deep Learning for Structure-from-Motion (SfM)
Deep Learning for Structure-from-Motion (SfM)Deep Learning for Structure-from-Motion (SfM)
Deep Learning for Structure-from-Motion (SfM)PetteriTeikariPhD
 
Tracking Chessboard Corners Using Projective Transformation for Augmented Rea...
Tracking Chessboard Corners Using Projective Transformation for Augmented Rea...Tracking Chessboard Corners Using Projective Transformation for Augmented Rea...
Tracking Chessboard Corners Using Projective Transformation for Augmented Rea...CSCJournals
 
Deep vo and slam iii
Deep vo and slam iiiDeep vo and slam iii
Deep vo and slam iiiYu Huang
 
The flow of baseline estimation using a single omnidirectional camera
The flow of baseline estimation using a single omnidirectional cameraThe flow of baseline estimation using a single omnidirectional camera
The flow of baseline estimation using a single omnidirectional cameraTELKOMNIKA JOURNAL
 
Moving object detection using background subtraction algorithm using simulink
Moving object detection using background subtraction algorithm using simulinkMoving object detection using background subtraction algorithm using simulink
Moving object detection using background subtraction algorithm using simulinkeSAT Publishing House
 
Heap graph, software birthmark, frequent sub graph mining.
Heap graph, software birthmark, frequent sub graph mining.Heap graph, software birthmark, frequent sub graph mining.
Heap graph, software birthmark, frequent sub graph mining.iosrjce
 
European Robotics Forum 2014 Talk on Closing the Perception-Action Gap
European Robotics Forum 2014 Talk on Closing the Perception-Action GapEuropean Robotics Forum 2014 Talk on Closing the Perception-Action Gap
European Robotics Forum 2014 Talk on Closing the Perception-Action GapDariolakis
 
Robust and Efficient Coupling of Perception to Actuation with Metric and Non-...
Robust and Efficient Coupling of Perception to Actuation with Metric and Non-...Robust and Efficient Coupling of Perception to Actuation with Metric and Non-...
Robust and Efficient Coupling of Perception to Actuation with Metric and Non-...Darius Burschka
 
Interactive Full-Body Motion Capture Using Infrared Sensor Network
Interactive Full-Body Motion Capture Using Infrared Sensor Network  Interactive Full-Body Motion Capture Using Infrared Sensor Network
Interactive Full-Body Motion Capture Using Infrared Sensor Network ijcga
 
Visual Odometry using Stereo Vision
Visual Odometry using Stereo VisionVisual Odometry using Stereo Vision
Visual Odometry using Stereo VisionRSIS International
 
Traffic Automation System
Traffic Automation SystemTraffic Automation System
Traffic Automation SystemPrabal Chauhan
 
Real Time Object Identification for Intelligent Video Surveillance Applications
Real Time Object Identification for Intelligent Video Surveillance ApplicationsReal Time Object Identification for Intelligent Video Surveillance Applications
Real Time Object Identification for Intelligent Video Surveillance ApplicationsEditor IJCATR
 
Interactive full body motion capture using infrared sensor network
Interactive full body motion capture using infrared sensor networkInteractive full body motion capture using infrared sensor network
Interactive full body motion capture using infrared sensor networkijcga
 
The Technology Research of Camera Calibration Based On LabVIEW
The Technology Research of Camera Calibration Based On LabVIEWThe Technology Research of Camera Calibration Based On LabVIEW
The Technology Research of Camera Calibration Based On LabVIEWIJRES Journal
 

What's hot (17)

Deep Learning for Structure-from-Motion (SfM)
Deep Learning for Structure-from-Motion (SfM)Deep Learning for Structure-from-Motion (SfM)
Deep Learning for Structure-from-Motion (SfM)
 
Tracking Chessboard Corners Using Projective Transformation for Augmented Rea...
Tracking Chessboard Corners Using Projective Transformation for Augmented Rea...Tracking Chessboard Corners Using Projective Transformation for Augmented Rea...
Tracking Chessboard Corners Using Projective Transformation for Augmented Rea...
 
Deep vo and slam iii
Deep vo and slam iiiDeep vo and slam iii
Deep vo and slam iii
 
The flow of baseline estimation using a single omnidirectional camera
The flow of baseline estimation using a single omnidirectional cameraThe flow of baseline estimation using a single omnidirectional camera
The flow of baseline estimation using a single omnidirectional camera
 
Moving object detection using background subtraction algorithm using simulink
Moving object detection using background subtraction algorithm using simulinkMoving object detection using background subtraction algorithm using simulink
Moving object detection using background subtraction algorithm using simulink
 
Heap graph, software birthmark, frequent sub graph mining.
Heap graph, software birthmark, frequent sub graph mining.Heap graph, software birthmark, frequent sub graph mining.
Heap graph, software birthmark, frequent sub graph mining.
 
European Robotics Forum 2014 Talk on Closing the Perception-Action Gap
European Robotics Forum 2014 Talk on Closing the Perception-Action GapEuropean Robotics Forum 2014 Talk on Closing the Perception-Action Gap
European Robotics Forum 2014 Talk on Closing the Perception-Action Gap
 
Survey 1 (project overview)
Survey 1 (project overview)Survey 1 (project overview)
Survey 1 (project overview)
 
Robust and Efficient Coupling of Perception to Actuation with Metric and Non-...
Robust and Efficient Coupling of Perception to Actuation with Metric and Non-...Robust and Efficient Coupling of Perception to Actuation with Metric and Non-...
Robust and Efficient Coupling of Perception to Actuation with Metric and Non-...
 
Interactive Full-Body Motion Capture Using Infrared Sensor Network
Interactive Full-Body Motion Capture Using Infrared Sensor Network  Interactive Full-Body Motion Capture Using Infrared Sensor Network
Interactive Full-Body Motion Capture Using Infrared Sensor Network
 
Unit 1 notes
Unit 1 notesUnit 1 notes
Unit 1 notes
 
Visual Odometry using Stereo Vision
Visual Odometry using Stereo VisionVisual Odometry using Stereo Vision
Visual Odometry using Stereo Vision
 
Traffic Automation System
Traffic Automation SystemTraffic Automation System
Traffic Automation System
 
Real Time Object Identification for Intelligent Video Surveillance Applications
Real Time Object Identification for Intelligent Video Surveillance ApplicationsReal Time Object Identification for Intelligent Video Surveillance Applications
Real Time Object Identification for Intelligent Video Surveillance Applications
 
Interactive full body motion capture using infrared sensor network
Interactive full body motion capture using infrared sensor networkInteractive full body motion capture using infrared sensor network
Interactive full body motion capture using infrared sensor network
 
iwvp11-vivet
iwvp11-vivetiwvp11-vivet
iwvp11-vivet
 
The Technology Research of Camera Calibration Based On LabVIEW
The Technology Research of Camera Calibration Based On LabVIEWThe Technology Research of Camera Calibration Based On LabVIEW
The Technology Research of Camera Calibration Based On LabVIEW
 

Similar to Visual Odometry Research Using Deep Learning

Au anthea-ws-201011-ma sc-thesis
Au anthea-ws-201011-ma sc-thesisAu anthea-ws-201011-ma sc-thesis
Au anthea-ws-201011-ma sc-thesisevegod
 
Project report on Eye tracking interpretation system
Project report on Eye tracking interpretation systemProject report on Eye tracking interpretation system
Project report on Eye tracking interpretation systemkurkute1994
 
Distributed Traffic management framework
Distributed Traffic management frameworkDistributed Traffic management framework
Distributed Traffic management frameworkSaurabh Nambiar
 
Dissertation or Thesis on Efficient Clustering Scheme in Cognitive Radio Wire...
Dissertation or Thesis on Efficient Clustering Scheme in Cognitive Radio Wire...Dissertation or Thesis on Efficient Clustering Scheme in Cognitive Radio Wire...
Dissertation or Thesis on Efficient Clustering Scheme in Cognitive Radio Wire...aziznitham
 
Robustness in Deep Learning - Single Image Denoising using Untrained Networks...
Robustness in Deep Learning - Single Image Denoising using Untrained Networks...Robustness in Deep Learning - Single Image Denoising using Untrained Networks...
Robustness in Deep Learning - Single Image Denoising using Untrained Networks...Daniel983829
 
Monocular LSD-SLAM integreation within AR System
Monocular LSD-SLAM integreation within AR SystemMonocular LSD-SLAM integreation within AR System
Monocular LSD-SLAM integreation within AR SystemMarkus Höll
 
DISPARITY MAP GENERATION BASED ON TRAPEZOIDAL CAMERA ARCHITECTURE FOR MULTI-V...
DISPARITY MAP GENERATION BASED ON TRAPEZOIDAL CAMERA ARCHITECTURE FOR MULTI-V...DISPARITY MAP GENERATION BASED ON TRAPEZOIDAL CAMERA ARCHITECTURE FOR MULTI-V...
DISPARITY MAP GENERATION BASED ON TRAPEZOIDAL CAMERA ARCHITECTURE FOR MULTI-V...ijma
 
Master_Thesis_Jiaqi_Liu
Master_Thesis_Jiaqi_LiuMaster_Thesis_Jiaqi_Liu
Master_Thesis_Jiaqi_LiuJiaqi Liu
 
10.1109@ICCMC48092.2020.ICCMC-000167.pdf
10.1109@ICCMC48092.2020.ICCMC-000167.pdf10.1109@ICCMC48092.2020.ICCMC-000167.pdf
10.1109@ICCMC48092.2020.ICCMC-000167.pdfmokamojah
 
Leader Follower Formation Control of Ground Vehicles Using Dynamic Pixel Coun...
Leader Follower Formation Control of Ground Vehicles Using Dynamic Pixel Coun...Leader Follower Formation Control of Ground Vehicles Using Dynamic Pixel Coun...
Leader Follower Formation Control of Ground Vehicles Using Dynamic Pixel Coun...ijma
 
Machine Learning Project - Neural Network
Machine Learning Project - Neural Network Machine Learning Project - Neural Network
Machine Learning Project - Neural Network HamdaAnees
 
APPLICATION OF CONVOLUTIONAL NEURAL NETWORK IN IMAGE CLASSIFICATION
APPLICATION OF CONVOLUTIONAL NEURAL NETWORK IN IMAGE CLASSIFICATIONAPPLICATION OF CONVOLUTIONAL NEURAL NETWORK IN IMAGE CLASSIFICATION
APPLICATION OF CONVOLUTIONAL NEURAL NETWORK IN IMAGE CLASSIFICATIONWriters Per Hour
 
Masters' Thesis - Reza Pourramezan - 2017
Masters' Thesis - Reza Pourramezan - 2017Masters' Thesis - Reza Pourramezan - 2017
Masters' Thesis - Reza Pourramezan - 2017Reza Pourramezan
 
A SURVEY ON OBJECT DETECTION METHODS IN VISUAL SENSOR NETWORKS
A SURVEY ON OBJECT DETECTION METHODS IN VISUAL SENSOR NETWORKSA SURVEY ON OBJECT DETECTION METHODS IN VISUAL SENSOR NETWORKS
A SURVEY ON OBJECT DETECTION METHODS IN VISUAL SENSOR NETWORKSijassn
 
A SURVEY ON OBJECT DETECTION METHODS IN VISUAL SENSOR NETWORKS
 A SURVEY ON OBJECT DETECTION METHODS IN VISUAL SENSOR NETWORKS A SURVEY ON OBJECT DETECTION METHODS IN VISUAL SENSOR NETWORKS
A SURVEY ON OBJECT DETECTION METHODS IN VISUAL SENSOR NETWORKSijassn
 
A Survey on Object Detection Methods in Visual Sensor Networks
A Survey on Object Detection Methods in Visual Sensor Networks A Survey on Object Detection Methods in Visual Sensor Networks
A Survey on Object Detection Methods in Visual Sensor Networks ijassn
 

Similar to Visual Odometry Research Using Deep Learning (20)

Thesis_Prakash
Thesis_PrakashThesis_Prakash
Thesis_Prakash
 
Au anthea-ws-201011-ma sc-thesis
Au anthea-ws-201011-ma sc-thesisAu anthea-ws-201011-ma sc-thesis
Au anthea-ws-201011-ma sc-thesis
 
Report
ReportReport
Report
 
Project report on Eye tracking interpretation system
Project report on Eye tracking interpretation systemProject report on Eye tracking interpretation system
Project report on Eye tracking interpretation system
 
Distributed Traffic management framework
Distributed Traffic management frameworkDistributed Traffic management framework
Distributed Traffic management framework
 
Dissertation or Thesis on Efficient Clustering Scheme in Cognitive Radio Wire...
Dissertation or Thesis on Efficient Clustering Scheme in Cognitive Radio Wire...Dissertation or Thesis on Efficient Clustering Scheme in Cognitive Radio Wire...
Dissertation or Thesis on Efficient Clustering Scheme in Cognitive Radio Wire...
 
Robustness in Deep Learning - Single Image Denoising using Untrained Networks...
Robustness in Deep Learning - Single Image Denoising using Untrained Networks...Robustness in Deep Learning - Single Image Denoising using Untrained Networks...
Robustness in Deep Learning - Single Image Denoising using Untrained Networks...
 
Monocular LSD-SLAM integreation within AR System
Monocular LSD-SLAM integreation within AR SystemMonocular LSD-SLAM integreation within AR System
Monocular LSD-SLAM integreation within AR System
 
DISPARITY MAP GENERATION BASED ON TRAPEZOIDAL CAMERA ARCHITECTURE FOR MULTI-V...
DISPARITY MAP GENERATION BASED ON TRAPEZOIDAL CAMERA ARCHITECTURE FOR MULTI-V...DISPARITY MAP GENERATION BASED ON TRAPEZOIDAL CAMERA ARCHITECTURE FOR MULTI-V...
DISPARITY MAP GENERATION BASED ON TRAPEZOIDAL CAMERA ARCHITECTURE FOR MULTI-V...
 
Master_Thesis_Jiaqi_Liu
Master_Thesis_Jiaqi_LiuMaster_Thesis_Jiaqi_Liu
Master_Thesis_Jiaqi_Liu
 
10.1109@ICCMC48092.2020.ICCMC-000167.pdf
10.1109@ICCMC48092.2020.ICCMC-000167.pdf10.1109@ICCMC48092.2020.ICCMC-000167.pdf
10.1109@ICCMC48092.2020.ICCMC-000167.pdf
 
Leader Follower Formation Control of Ground Vehicles Using Dynamic Pixel Coun...
Leader Follower Formation Control of Ground Vehicles Using Dynamic Pixel Coun...Leader Follower Formation Control of Ground Vehicles Using Dynamic Pixel Coun...
Leader Follower Formation Control of Ground Vehicles Using Dynamic Pixel Coun...
 
Machine Learning Project - Neural Network
Machine Learning Project - Neural Network Machine Learning Project - Neural Network
Machine Learning Project - Neural Network
 
APPLICATION OF CONVOLUTIONAL NEURAL NETWORK IN IMAGE CLASSIFICATION
APPLICATION OF CONVOLUTIONAL NEURAL NETWORK IN IMAGE CLASSIFICATIONAPPLICATION OF CONVOLUTIONAL NEURAL NETWORK IN IMAGE CLASSIFICATION
APPLICATION OF CONVOLUTIONAL NEURAL NETWORK IN IMAGE CLASSIFICATION
 
Masters' Thesis - Reza Pourramezan - 2017
Masters' Thesis - Reza Pourramezan - 2017Masters' Thesis - Reza Pourramezan - 2017
Masters' Thesis - Reza Pourramezan - 2017
 
A SURVEY ON OBJECT DETECTION METHODS IN VISUAL SENSOR NETWORKS
A SURVEY ON OBJECT DETECTION METHODS IN VISUAL SENSOR NETWORKSA SURVEY ON OBJECT DETECTION METHODS IN VISUAL SENSOR NETWORKS
A SURVEY ON OBJECT DETECTION METHODS IN VISUAL SENSOR NETWORKS
 
A SURVEY ON OBJECT DETECTION METHODS IN VISUAL SENSOR NETWORKS
 A SURVEY ON OBJECT DETECTION METHODS IN VISUAL SENSOR NETWORKS A SURVEY ON OBJECT DETECTION METHODS IN VISUAL SENSOR NETWORKS
A SURVEY ON OBJECT DETECTION METHODS IN VISUAL SENSOR NETWORKS
 
A Survey on Object Detection Methods in Visual Sensor Networks
A Survey on Object Detection Methods in Visual Sensor Networks A Survey on Object Detection Methods in Visual Sensor Networks
A Survey on Object Detection Methods in Visual Sensor Networks
 
SeanLawlor_Masters_Thesis
SeanLawlor_Masters_ThesisSeanLawlor_Masters_Thesis
SeanLawlor_Masters_Thesis
 
Honours_Thesis2015_final
Honours_Thesis2015_finalHonours_Thesis2015_final
Honours_Thesis2015_final
 

Recently uploaded

CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
power system scada applications and uses
power system scada applications and usespower system scada applications and uses
power system scada applications and usesDevarapalliHaritha
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learningmisbanausheenparvam
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxbritheesh05
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLDeelipZope
 
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZTE
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxvipinkmenon1
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 

Recently uploaded (20)

CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
power system scada applications and uses
power system scada applications and usespower system scada applications and uses
power system scada applications and uses
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learning
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptx
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCL
 
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptx
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 

Visual Odometry Research Using Deep Learning

  • 1. Estimation Camera Trajectory with Visual Odometry Research Internship Submitted in Fulfilment of the Requirements for the Academic Degree M.Sc. Dept. of Computer Science Chair of Computer Engineering Submitted by: Anutam Majumder Student ID: 456540 Date: 07.05.2021 Supervising tutor: Shadi Saleh Batbayar Batseren
  • 2. Contents Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Research Questions And Objectives / Research Challenges . . . . . . . 6 3 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.0.1 Visual SLAM . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.0.2 Deep Learning based Methods: . . . . . . . . . . . . . . . . . 10 4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 4.0.1 Architecture of the CNN-RNN network . . . . . . . . . . . . . 12 4.0.2 Feature extraction method of Convolutional Neural Network . 13 4.0.3 Sequential Modelling method of Recurrent Neural Network . . 14 4.0.4 Cost Function and Optimisation . . . . . . . . . . . . . . . . . 15 5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 5.0.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 5.0.2 Training and Testing . . . . . . . . . . . . . . . . . . . . . . . 17 6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2
  • 3. List of Figures 1.1 Conventional Visual Odometry Pipeline . . . . . . . . . . . . . . . . . 4 2.1 The red boxes show data, and the blue boxes show operations on the data. At the highest level, consecutive video frames are used to calculate an optical flow image, which is then fed into a convolu- tional neural network that outputs the odometry information needed to create a map. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.1 A block diagram showing the main components of a: a VO and b filter based SLAM system . . . . . . . . . . . . . . . . . . . . . . . . 8 4.1 Proposed architecture of the CNN-RNN based monocular VO system. The dimensions of the tensors shown are example based on the image dimensions of the KITTI dataset. The CNN ones should vary ac- cording to the size of the input image. Camera image credit: KITTI dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 4.2 Configuration of the CNN . . . . . . . . . . . . . . . . . . . . . . . . 13 4.3 Folded and unfolded LSTM structure . . . . . . . . . . . . . . . . . . 14 5.1 Predicted path (in red) of image sequence 05 plotted against the ground truth value (In green) of the sequence . . . . . . . . . . . . . 18 5.2 Predicted path (in red) of image sequence 05 plotted against the ground truth value (In green) of the sequence . . . . . . . . . . . . . 18 5.3 Predicted path (in red) of image sequence 05 plotted against the ground truth value (In green) of the sequence . . . . . . . . . . . . . 19 5.4 Predicted path (in red) of image sequence 05 plotted against the ground truth value (In green) of the sequence . . . . . . . . . . . . . 19 5.5 Predicted path (in red) of image sequence 05 plotted against the ground truth value (In green) of the sequence . . . . . . . . . . . . . 20 6.1 Performance of our Monocular VO model compared to VISO2-M and VISO2-S . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3
  • 4. 1 Introduction One of the fundamental needs for humans and mobile devices and agents is localizing and mapping themselves with respect to the environment they are inside. Humans are able to perceive their location and pose via multi-modal sensory perception [2] in a complex three dimensional space. This ability for perceiving self-motion and their surroundings plays a vital role in developing cognition for humans. Similarly, arti- ficially intelligent agents or robots should also be able to perceive the environment and predict their pose and location using on-board sensors. Visual Odometry (VO) has come up as one of the most essential techniques for pose estimation and robot localisation. It estimates the ego-motion of a camera by integrating the relative motion between images into global poses[30]. Figure 1.1: Conventional Visual Odometry Pipeline Deep learning technologies has been coming up as the first choice to tackle com- puter vision problems through the last decade. But the potential of deep learning technologies has not been fully utilized to address the visual odometry problem. Most deep learning models only deal with recognition and classification problems, which enables Convolution Neural Networks (CNN) to extract appearance informa- tion from images. Limitation of this appearance based detection implementation is that these VO systems only functions in the environments they are trained in. A Visual Odometry system predicts the pose and localization correctly when fea- ture based maps provide more accuracy than appearance based maps.[20] A VO algorithm should be capable of modelling motion dynamics by reading the changes and draw connections on sequence of images rather than from a single image. This implies the requirement of sequential learning which CNN is alone incapable of han- dling. In this work, this problem is addressed with the use of deep CNN network for 4
  • 5. 1 Introduction creating feature maps followed by RNN network to create correspondences between the learned features. The proposed method is capable of learning features through CNN and updating themselves according to the previous states through RNN. It is an end to end solution for the Visual Odometry problem and does not require any module from the classical VO pipeline (not even camera calibration requirement). 5
  • 6. 2 Research Questions And Objectives / Research Challenges One of the main challenging aspect of conventional visual odometry methods is the fact that it utilizes sparse key points to track features across moving pixels. Tra- ditional odometry methods function in the way of mapping the results and placing the vehicle on the map by a process called Simultaneous Localization and map- ping(SLAM). But slow algorithmic performances and high memory usage prevent the systems from adhering to requirements for extended data acquisition and pro- cessing. In this work pixel movements is calculated using optical flow instead of tracking sparse key points as in the conventional visual odometry pipeline. Optical flow computes pixel movements in both horizontal and vertical directions. Linear movements between frames is assumed between a video stream with small incremen- tal movements as the camera updates. A proportional relationship between pixel movements and physical movement of the camera is established. [26] Figure 2.1: The red boxes show data, and the blue boxes show operations on the data. At the highest level, consecutive video frames are used to calculate an optical flow image, which is then fed into a convolutional neural network that outputs the odometry information needed to create a map. The main contributions of this work is as follows: • A fully converged deep learning approach to Visual Odometry calculation. • An accurate Visual Odometry system capable of prediction in real time. 6
  • 7. 3 Literature Review Visual Odometry can be defined as the as the process of estimating a robot’s trans- lational and rotational motion with respect to a reference frame from observation of a sequence of images of its environment. Visual Odometry can be described as a par- ticular case of a technique called structure from motion (SFM). SFM can handle the problem of 3D reconstruction of structure of its environment from the sequence of images.It can also predict the camera poses from sequentially ordered or unordered image sets [33]. However, in case of SFM, the final refinement and global optimiza- tion step of both the camera poses and the structure is computationally expansive. This process also is usually performed off-line. But in case of visual odometry, the estimation of the camera poses is computed real time. Visual Odometry techniques can be technically categorized into Monocular [5]and Stereo Camera [23] methods. These methods can be further categorized into feature matching (matching features over a number of images) [23], feature tracking [9] (matching features in adjacent frames) and optical flow techniques. These categorizations are purely based upon the intensity of all pixels or specific regions in sequential images. The technique of estimation a robot’s ego-motion by observation of a sequence of incoming frames started in the 1980s by Moravec [25] at Stanford University. Moravec introduced a stereo vision based technique in which a single camera would slide on a rail in a move and stop fashion enabling the robot to extract image features (corners) in the first image. The camera then slid on the rail in a perpendicular direction with respect to the robot’s motion, and repeat the process until 9 images are captured. Features were matched between the 9 images using Normalized Cross Correlation (NCC). Those features were used to reconstruct the 3D structure. The reconstructed 3D points were observed from different locations and the data was then aligned to calculate camera motion transformation. The scope of the above work was later extended by Matthies and Shafer [24]. They derived an error model using 3D Gaussian distributions instead of using the scalar model used in the earlier method. Other methods of stereo VO implementation also came up in further studies. For example, in [29] maximum likelihood ego-motion for modeling the error was introduced for localization of a rover over long distances. In [21] a method was described for rover localization that took in raw image data instead of geometric data for motion estimation. The term “Visual Odometry” was first coined by Nister et al. [28].It’s similarity to the concepts of Wheel odometry lead to it’s naming. Methods for obtaining cam- era motion from visual input in both monocular and stereo systems was proposed. These methods could estimate camera motion in the presence of outliers. An outlier rejection scheme using RANSAC [10] was proposed to remove outliers.The above 7
  • 8. 3 Literature Review Figure 3.1: A block diagram showing the main components of a: a VO and b filter based SLAM system method could also successfully for the first time track features across all frames in place of matching features in two consecutive frames as in earlier methods. This could avoid feature drift during cross-correlation based tracking. A RANSAC based motion estimation using the 3D to 2D reprojection error (see “Motion Estimation” section) was proposed. 3D to 2D re-projection errors were shown to give better estimates when compared to the 3D to 3D errors [12]. Scaramuzza and Siegwart performed another important research for visual odom- etry [32]. In their research, a monocular omni-directional camera was used and fused the motion estimates gained by two approaches. In the first approach, SIFT features were extracted and RANSAC was used for outlier removal [22]. In the second approach appearance based methods were used [6]. Appearance based techniques are able to accurately handle outdoor spaces efficiently and ro- bustly. They are also capable of avoiding error prone feature extraction and match- ing techniques . [16] A stereo VO system for outdoor navigation was proposed by Kaess. The sparse features was obtained by feature matching. They were segregated in flow based on close features and on far features.The reason for the separation is that small changes in camera translations do not influence points that are far away. The far points were used to recover the rotation transformation using a two-point RANSAC. Close points were used to recover the translation using a one-point RANSAC. In the next section another important technique for the motion estimation - Visual SLAM is discussed. 3.0.1 Visual SLAM Another significant approach for robot localization and mapping is Visual SLAM. SLAM is how a robot localizes itself in an unknown environment, while it keeps on constructing a map of its surroundings. SLAM has been researched upon in the last couple of decades with different solutions using different sensors, including sonar 8
  • 9. 3 Literature Review sensors , IR sensors and LASER scanners. Recently the research interest have grown in VSLAM because passive low-cost video sensors provide rich visual information available compared to LASER scan- ners. However, more sophisticated algorithms are required for processing images and extracting necessary information. However, for the advances in CPU and GPU technologies, the real time implementation of the required complex algorithms are no longer difficult. A variety of solutions including monocular [8], stereo , omni- directional [18], time of flight (TOF) [34] and combined color and depth (RGB-D) cameras [12] have been proposed. Davison et al. [8] used a single monocular camera and constructed a map by extracting sparse features of the environment. A Shi and Tomasi operator [35] was used. New features were matched to those already observed.A normalized sum- of-squared difference correlation technique was used. The use of single monocular camera meant that the absolute scale of structures could not be obtained. The camera too had to be calibrated. An Extended Kalman Filter (EKF) was utilized for state estimation. Only a limited number of features were extracted and tracked. Se et al. [13] proposed a vision based method for mobile robot localization and mapping. SIFT was used for feature extraction in their method. CV-SLAM (Ceiling Vision SLAM) is another technique which was researched upon by pointing camera upwards towards the ceiling. The advantages of cv-SLAM when compared to the frontal view V-SLAM are: less interactions with moving obstacles and steady observation of features. Jeong et al. [15] were the first to in- troduce the cv-SLAM. They employed a single monocular camera. Corner features were extracted.Harris corner detector [11] was used. A landmark orientation esti- mation technique was used after the detection. It was done to align with matching the currently observed and previously stored landmarks.A NCC method was used. There has been an increasing interest in dense 3D reconstruction of the environ- ment as compared to the sparse 2D and 3D SLAM. Newcombe and Davison [27] obtained a dense 3D model of the environment in real time. A single monocular camera was used. But their method is limited to small environments only. Henry et al. [14] implemented an RGB-D mapping approach utilizing an RGB-D camera (i.e. Microsoft Kinect). This information was used to obtain a dense 3D reconstructed en- vironment. It estimated the 6 degree of freedom (6DOF) camera pose. The method extracted Features From Accelerated Segment Test (FAST) [31] features in each frame and matched them with features from the previous frame using the Calonder descriptors [3]. Then they performed a RANSAC alignment step which obtains a subset of feature matches (inliers) that correspond to a consistent rigid transforma- tion. This transformation was used as an initial guess in the Iterative Closest Point (ICP) [4] algorithm which refined the transformation obtained by RANSAC. Sparse Bundle Adjustment (SBA) was also applied in order to obtain a globally consistent map and loop closure was detected by matching the current frame to previously collected key-frames. Bachrach et al. [1] proposed a VO and SLAM system for unmanned air vehicles (UAVs). An RGB-D camera was utilized. The method relied on extracting FAST 9
  • 10. 3 Literature Review features from sequential preprocessed images at different pyramid levels. This step was followed by an initial rotation estimation that limited the size of the search window for feature matching. The matching was performed by finding the mutual lowest sum of squared difference (SSD) score between the descriptor vectors. A greedy algorithm refined the matches and obtained the inlier set which were then used to estimate the motion between frames. In order to reduce drift in the motion estimates, they suggested matching the current frame to a selected key-frame instead of matching consecutive frames. In the above section, we discussed various approaches to solving the V-SLAM problem. Earlier methods generally focused on sparse 2D and 3D SLAM methods due to limitations of available computational resources. Most recently, the interest has shifted towards dense 3D reconstruction of the environment due to technological advances and availability of efficient optimization methods. All studies have found that traditional Machine Learning techniques are ineffi- cient when having to solve for big or highly non-linear, high-dimensional data, e.g., RGB images. Deep Learning techniques, which automatically learns suitable feature representation from large-scale dataset provides an alternative solution to the VO problem. 3.0.2 Deep Learning based Methods: Deep Learning technologies have achieved significant results on localisation related applications over the past few decades. CNN is mainly utilised for appearance based place recognition [37]. K. Konda and R. Memisevic in their work [19] first realised DL based VO with the process of synchrony detection between image sequences and features. Softmax function was utilized after estimation of dept and velocity to predict the discretised changes of direction and velocity. This method provides a very good estimation from DL based stereo VO,but it formulates the VO as a classification problem rather than pose regression. Kendall et. all proposed a solution for camera relocalisation using a single. [17] Their method fine-tuned images of a specific scene with CNNs. Their method la- belled these images by Structure from Motion (SFM). But this process is time- consuming and labour intensive for large-scale scenarios. A trained CNN model serves as an appearance “map” of the scene in their imple- mentation. However, the problem with the above approach is that the model needed to be re-trained or fine-tuned for a new environment. So the above method is not very useful for widespread usage. This serves as one of the biggest hindrance for applying Deep Learning techniques for VO. To overcome this problem, another alter- native method was proposed where the CNNs were provided with dense optical flow instead of RGB images [7] . Three different architectures of CNNs were developed using optical flow networks to learn appropriate features for VO. These methods lead to development of a robust VO which were capable of handling even blurred and under-exposed images. However, the proposed CNNs require pre-processed dense optical flow as input, which cannot benefit from the end-to-end learning, as true to 10
  • 11. 3 Literature Review the nature of deep learning models. The above methods are also inappropriate for real-time applications. CNNs alone is incapable of modelling sequential information. Therefore, none of the previous work considers image sequences or videos for learning. In this work, we tackle this by utilizing the RNNs, which are capable of sequential learning. 11
  • 12. 4 Methodology Our approach to the solution for the above problem is utilizing the deep CNN and RNN framework, which is composed of CNN based feature extraction and RNN based sequential modelling. Since CNN is incapable of handling sequential informa- tion, the RNN layer is added to process the features already extracted by the CNN network. Figure 4.1: Proposed architecture of the CNN-RNN based monocular VO system. The dimensions of the tensors shown are example based on the image dimensions of the KITTI dataset. The CNN ones should vary according to the size of the input image. Camera image credit: KITTI dataset. 4.0.1 Architecture of the CNN-RNN network Some popular Deep Neural Network architectures, such as VGGNet [36] and GoogLeNet C. citeSzegedy which were originally developed for computer vision tasks, produces good performance.These architectures are trained to learn knowledge from appear- ance and image context. But the fundamentals of Visual Odometry is rooted in Epipolar Geometry. It is not closely associated with appearance. So simply adopt- ing the current popular DNN architectures for the VO problem is impractical. A framework should be employed which can learn geometric feature representations to address the VO and other geometric problems. At the same time,connections among consecutive image frames should be derived as the VO systems evolve over time produces results based on on image sequences acquired during movement. The 12
  • 13. 4 Methodology proposed model takes these two requirements into consideration. The architecture of the proposed Visual Odometry sytem is in Fig. 4.1. The proposed architecture takes a video clip or a monocular image sequence as its input. At each time step, the RGB image frame is pre-processed by subtracting the mean RGB values of the training set and resizing to a new size in the multiple of 64. Two consecutive images from the KITTI dataset are stacked together. It forms a tensor for the deep CNN- RNN network, based upon which the model learns to extract motion information and estimate poses. This image tensor is fed into the CNN to produce an effective feature for the monocular VO. These feature maps are then passed through a RNN for sequential learning. Each image pair yields a pose estimate at each time step through the network. As images more and more images gets captured from the KITTI dataset, the VO system develops over time and estimates new poses. The advantage of this CNN-RNN based architecture is that it allows simultaneous feature extraction and sequential modelling of VO through the combination of CNN and RNN and it does not require any preprocessing. 4.0.2 Feature extraction method of Convolutional Neural Network To learn effective features maps from the input tensors, a CNN is developed to perform feature extraction from two consecutive monocular RGB images of KITTI dataset. The features that are studied are geometric instead of being appearance based because as the VO system need to be deployed in unknown environments. The configuration of the CNN is outlined in Figure 6.2. Figure 4.2: Configuration of the CNN The proposed CNN architecture has 9 convolutional layers. Each CNN layer is 13
  • 14. 4 Methodology followed by a rectified linear unit (ReLU) activation.Only the Conv 6 layer is not followed by a ReLu. So we have 17 layers in total. The convolution filter sizes in the network gradually reduces from 7 × 7 to 5 × 5 to 3 × 3 to capture small and interesting features. The padding method used is zero padding. The number of the channels increases at each layer to learn various features. 4.0.3 Sequential Modelling method of Recurrent Neural Network After the CNN layer have extracted and prepared feature maps, a deep RNN network is employed. This RNN network performs sequential learning for the network, i.e., it models dynamics and relations among the sequence of feature maps already prepared by the CNN network. VO problem involves temporal model (motion model) and sequential data (image sequence). Therefore, RNN, which is useful in modelling sequential data is employed for this task. Estimating pose of current image frame benefits from information encapsulated in previous frames. However, RNN is not suitable to directly learn sequential representation from high-dimensional raw data, such as images. Therefore, the proposed system adopts the appealing CNN-RNN architecture with the CNN features as the input of the RNN. Figure 4.3: Folded and unfolded LSTM structure RNN can maintain memory of its hidden states over time and has feedback loops among the different states. This architecture helps modelling the current hidden state to be a function of the previous ones. Hence, the RNN can find the relationship between the current input and the previous states in the sequence. Let us assume a convolutional feature xk at time k. A RNN will update at time step k by 14
  • 15. 4 Methodology hk = H(WxhXk + Whhhk−1 + bh) yk = Why)hk + by where hk is the hidden state and and yk is the output at time k. W terms denote the weight matrices of the hidden states and outputs, b terms are bias vectors, H is an element-wise nonlinear activation function. Long Short-Term Memory (LSTM) is capable of learning long-term dependencies by usage of memory gates and units [38]. So RNN is used to find correlations among images taken in long trajectories as in the case in our dataset used (KITTI) as required in the case of visual odometry. LSTM determines which previous hidden states to be discarded or retained for updating the current state. This is how RNN can predict the motion during pose estimation. The folded LSTM and its unfolded version over time are shown in the Fig. 6.3. The diagram also shows the internal structure of a LSTM unit. After unfolding the LSTM, we update each LSTM unit with a time step.When we assume each LSTM unit to be Xk at time k.We also assume the hidden state hk1 and the memory cell ck1 of the previous LSTM unit, the LSTM updates at each time step k according to iK = σ(WxiXk + Whihk1 + bi) fk = σ(Wxf Xk + Whf hk − 1 + bf ) gk = tanh(WxgXk + Whghk−1 + bg) ck = fk· ck−1 + ik· gk ok = σ(WxoXk + Whohk−1 + bo) hk = Ok· tanh(Ck) In the above expressions, · denotes the element-wise product of two vectors, σ denotes sigmoid non-linearity, tanh denotes hyperbolic tangent non-linearity, W terms denote corresponding weight matrices, b terms denote bias vectors, ik, fk, gk, ck and ok are input gate, forget gate, input modulation gate, memory cell and output gate at time k, respectively. Our LSTM structure can handle long-term dependencies as it is required to cor- rectly analyze the long frames of KITTI dataset.But the proposed architecture still needs depth on network layers to learn high level representation and model complex dynamics between the frames of KITTI dataset. To address this issue, the deep RNN is constructed by stacking two LSTM layers with the hidden states of a LSTM being the input of the other one, as in Fig. 6.1. Each of the LSTM layers has 1000 hidden states. The deep RNN outputs a pose estimate at each time step based on the visual features generated from the CNN. New poses are predicted over time according to how the camera moves and more images get captured. 4.0.4 Cost Function and Optimisation The proposed CNN-RNN based VO system computes the conditional probability of the poses Yt = (y1, ..., yt) from a sequence of monocular RGB images Xt = (x1, ..., xt) up to time t based on the probability: p(Yt|Xt) = p(y1, ..., yt|x1, ..., xt) 15
  • 16. 4 Methodology To correctly estimate the pose of the VO system, we are to find the optimal parameters θ∗ , the DNN maximises the probabilistic equation: θ∗ = argmaxp θ(Y t|Xt; θ) For the DNN to correctly learn the hyperparameters θ∗ of the DNNs, the Euclidean distance between the ground truth pose (Pk, ϕk) at time k and its estimated one ( ˆ Pk, ˆ ϕk) is minimised. The loss function is composed of Mean Square Error (MSE) of all positions p and orientations ϕ: θ∗ = argminp θ 1 N PN i=1 Pt k=1 k ˆ Pk − Pkk2 2 + kk ˆ ϕk − ϕkk2 2 where k .k is 2-norm, k (100 in the experiments) is a scale factor to balance the weights of positions and orientations, and N is the number of samples. The orientation ϕ is represented by Euler angles rather than quaternion since quaternion is subject to an extra unit constraint which hinders the optimisation problem of DL. We also find that in practice using quaternion degrades the orientation estimate to some extent. 16
  • 17. 5 Results 5.0.1 Dataset The KITTI VO/SLAM benchmark has 22 sequences of images, of which 11 ones (Se- quence 00-10) are associated with ground truth. The other 10 sequences (Sequence 11-21) are only provided with raw sensor data. Our model was trained on the image sequences 00, 02, 08 and 09 of the KITTI dataset, as these are sequences which are relatively long. The trajectories are segmented to different lengths to generate much data for training, producing 7410 samples in total. The trained model is tested on the Sequence 04, 05, 06, 07 and 10 for evaluation. The KITTI VO dataset provides us with translational and rotational errors for all possible sequences of length (100,...,800) meters. It also provides us with evaluation table which ranks methods according to the average of those values, where errors are measured in percent (for translation) and in degrees per meter (for rotation). The advantage of using this dataset is that our model is trained for all possible sequence of lengths and hence can predict the location and orientation more accurately across unknown environments. Also, this dataset makes the task of comparing the accuracy of our model’s prediction with other state of the art technologies easier. 5.0.2 Training and Testing This section of the report aims to describe the training and testing methodology adopted. For the sake of training our model so that it works in all environments, the training sequences Sequence 00, 01, 02, 05, 08 and 09 which are relatively long are used for training. These sequences covers all a lot different trajectories and patterns of navigation route of the car. This factor, along with the fact that these sequences are long enables our model to get trained on different scenarios. The sequences 00, 01, 02, 05, 08 and 09 contains 4541, 1101, 4661, 2761, 4071 and 1591 images respectively. We evaluated our trained model with the test sequences 04, 05, 07, 09 and 10 and plotted the prediction with respect to the ground truth value to check the accuracy of the trained model. The estimated VO trajectories corresponding to the previous testing are given in Fig. 5.1,5.2,5.3,5.4 and 5.5. 17
  • 18. 5 Results Figure 5.1: Predicted path (in red) of image sequence 05 plotted against the ground truth value (In green) of the sequence Figure 5.2: Predicted path (in red) of image sequence 05 plotted against the ground truth value (In green) of the sequence 18
  • 19. 5 Results Figure 5.3: Predicted path (in red) of image sequence 05 plotted against the ground truth value (In green) of the sequence Figure 5.4: Predicted path (in red) of image sequence 05 plotted against the ground truth value (In green) of the sequence 19
  • 20. 5 Results Figure 5.5: Predicted path (in red) of image sequence 05 plotted against the ground truth value (In green) of the sequence 20
  • 21. 6 Evaluation We can conclude that our model produces relatively accurate and consistent tra- jectories against to the ground truth.It demonstrates that the scales can be better estimated in an end to end fashion using our model than using prior methods of estimation using factors such as camera height. It is also to be noted that in case of our VO model, no scale estimation or post alignment to ground truth is performed for the DeepVO to obtain the absolute poses. The scale is maintained by the CNN- RNN network itself and is ascertained during the end-to-end training. In case of conventional odometry methods, recovering accurate and robust scale is difficult for the monocular VO. Our model suggests an appealing advantage of the DL based VO method. The detailed performance of the algorithms on the testing sequences is summarised in figure 4.6. The KITTI Vision Benchmark Suite’s table was utilized to compare the performance of our model with two state of the art Visual Odom- etry algorithms Library for Visual Odometry 2 (Stereo Version, Active Matching) [VISO2-S] [?] and Library for Visual Odometry 2 (Monocular Version) [VISO2-M] [?] Figure 6.1: Performance of our Monocular VO model compared to VISO2-M and VISO2-S It indicates that our deep learning model achieves more robust results than the monocular VISO2. trel indicates the average translational RMSE drift in percent- 21
  • 22. 6 Evaluation age on length of 100m-800m. rrel indicates the average rotational RMSE drift (degree/100m) on length of 100m-800m. Based on the findings of our work as evaluated against the KITTI vision bench- mark suite, we can conclude that our work presents a monocular VO algorithm based on Deep Learning. Utilizing the power of Deep Recurrent Neural Networks, we were able to achieve simultaneous representation learning and sequential modelling of the the monocular VO by combining the CNNs with the RNNs. Our approach does not depend on any module in the conventional VO algorithms (even camera calibration) for pose estimation and it is trained in an end-to-end manner. We also verified from the KITTI VO benchmark suite that it can produce accurate VO results with precise scales and can work well in completely new scenarios. Our methods were able to produce some results on this area, but we stress that it is not expected as a replacement to the classic geometry based approach. As of now, it can be a viable complement, i.e., incorporating geometry with the representation, knowledge and models learnt by the DNNs to further improve the VO in terms of accuracy and, more importantly, robustness. More research is required to be performed with combination of classical geometry based and deep learning based approaches. 22
  • 23. Bibliography [1] Bachrach, A., Prentice, S., He, R., Henry, P., Huang, A.S., Krainin, M., Mat- urana, D., Fox, D., Roy, N.: Estimation, planning, and mapping for au- tonomous flight using an rgb-d camera in gps-denied environments (2012), https://doi.org/10.1177/0278364912455256 [2] Bertelson, P., Gelder, B., Spence, C., Driver, J.: The Psychology of Multimodal Perception, pp. 141–177 (04 2004) [3] Besl, P., McKay, N.D.: A method for registration of 3-d shapes (1992) [4] Besl, P., McKay, N.D.: A method for registration of 3-d shapes (1992) [5] Campbell, J., Sukthankar, R., Nourbakhsh, I.R., Pahwa, A.: A robust vi- sual odometry and precipice detection system using consumer-grade monocular vision. (2005), http://dblp.uni-trier.de/db/conf/icra/icra2005.html# CampbellSNP05 [6] Comport, A., Malis, E., Rives, P.: Real-time quadrifocal visual odometry (2010) [7] Costante, G., Mancini, M., Valigi, P., Ciarfuglia, T.A.: Exploring representa- tion learning with cnns for frame-to-frame ego-motion estimation (2016) [8] Davison: Real-time simultaneous localisation and mapping with a single camera (2003) [9] Dornhege, C., Kleiner, A.: Visual odometry for tracked vehicles (01 2006) [10] Fischler, M., Bolles, R.: Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography (1981), /brokenurl#http://publication.wilsonwong.me/load.php?id=233282275 [11] Harris, C., Stephens, M.: A combined corner and edge detector (1988) [12] Henry, P., Krainin, M., Herbst, E., Ren, X., Fox, D.: Rgb-d mapping: Using kinect-style depth cameras for dense 3d modeling of indoor environments (04 2012) [13] Henry, P., Krainin, M., Herbst, E., Ren, X., Fox, D.: Rgb-d mapping: Using kinect-style depth cameras for dense 3d modeling of indoor environments (04 2012) 23
  • 24. BIBLIOGRAPHY [14] Henry, P., Krainin, M., Herbst, E., Ren, X., Fox, D.: Rgb-d mapping: Using kinect-style depth cameras for dense 3d modeling of indoor environments (04 2012) [15] Jeong, W., Lee, K.M.: Cv-slam: a new ceiling vision-based slam technique (2005) [16] Kaess, M., K.N., Dellaert., F.: Flow separation for fast and robust stereo odom- etry (2009) [17] Kendall, A., Cipolla, R.: Modelling uncertainty in deep learning for camera relocalization (09 2015) [18] Kim, S., Oh, S.Y.: Slam in indoor environments using omni-directional vertical and horizontal line features (01 2008) [19] Konda., K., Memisevic., R.: Learning visual odometry with a convolutional network (2015) [20] Krombach, N., Droeschel, D., Houben, S., Behnke, S.: Feature-based visual odometry prior for real-time semi-dense stereo SLAM (2018), http://arxiv. org/abs/1810.07768 [21] Lacroix, S., Mallet, A., Chatila, R., Gallo, L.: Rover Self Localization in Planetary-Like Environments (Aug 1999) [22] Lowe, D.: Distinctive image features from scale-invariant keypoints (11 2004) [23] Matthies, L., Shafer, S.: Error modeling in stereo navigation (June 1987) [24] Matthies, L., Shafer, S.: Error modeling in stereo navigation (June 1987) [25] Moravec, H.P.: Obstacle avoidance and navigation in the real world by a seeing robot rover (Jun 2018), https://kilthub.cmu.edu/articles/journal_ contribution/Obstacle_avoidance_and_navigation_in_the_real_world_ by_a_seeing_robot_rover/6557033/1 [26] Muller, P., Savakis, A.: Flowdometry: An optical flow and deep learning based approach to visual odometry (03 2017) [27] Newcombe, R.A., Davison, A.J.: Live dense reconstruction with a single moving camera [28] Nister, D., Naroditsky, O., Bergen, J.: Visual odometry (2004) [29] Olson, C., Matthies, L., Schoppers, M., Maimone, M.: Rover navigation using stereo ego-motion (06 2003) 24
  • 25. BIBLIOGRAPHY [30] Pillai, S., Leonard, J.J.: Towards visual ego-motion learning in robots (2017), http://arxiv.org/abs/1705.10279 [31] Rosten, E., Drummond, T.: Machine learning for high-speed corner detection (2006) [32] Scaramuzza, D., Siegwart, R.: Appearance-guided monocular omnidirectional visual odometry for outdoor ground vehicles (11 2008) [33] Scaramuzza, D., F.F.: Visual odometry: part i—the first 30 years and funda- mentals (03 2011)) [34] Shi, J., Tomasi: Good features to track (1994) [35] Shi, J., Tomasi: Good features to track (1994) [36] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (09 2014) [37] Sünderhauf, N., Shirazi, S., Jacobson, A., Pepperell, E., Dayoub, F., Upcroft, B., Milford, M.: Place recognition with convnet landmarks: Viewpoint-robust, condition-robust, training-free (07 2015) [38] Zaremba, W., Sutskever, I.: Learning to execute. CoRR abs/1410.4615 (2014), http://arxiv.org/abs/1410.4615 25