SlideShare a Scribd company logo
1 of 127
Download to read offline
Davide Scaramuzza
Visual Odometry and SLAM:
past, present, and the robust-perception age
References
๏ƒ˜Scaramuzza, D., Fraundorfer, F., Visual Odometry: Part I - The First 30 Years and
Fundamentals, IEEE Robotics and Automation Magazine, Volume 18, issue 4, 2011. PDF
๏ƒ˜Fraundorfer, F., Scaramuzza, D., Visual Odometry: Part II - Matching, Robustness, and
Applications, IEEE Robotics and Automation Magazine, Volume 19, issue 1, 2012. PDF
๏ƒ˜C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira, I.D. Reid, J.J. Leonard,
Simultaneous Localization And Mapping: Present, Future, and the Robust-Perception Age,
IEEE Transactions on Robotics (cond. Accepted), 2016. PDF
Outline
๏ƒ˜ Theory
๏ƒ˜ Open Source Algorithms
๏ƒ˜ Event-based Vision
Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch
input output
Image sequence (or video stream)
from one or more cameras attached to a moving vehicle
Camera trajectory (3D structure is a plus):
VO is the process of incrementally estimating the pose of the vehicle by
examining the changes that motion induces on the images of its onboard
cameras
Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch
๏ƒ˜ 1980: First known VO real-time implementation on a robot by Hans Moraveck PhD
thesis (NASA/JPL) for Mars rovers using one sliding camera (sliding stereo).
๏ƒ˜ 1980 to 2000: The VO research was dominated by NASA/JPL in preparation of
2004 Mars mission (see papers from Matthies, Olson, etc. from JPL)
๏ƒ˜ 2004: VO used on a robot on another planet: Mars rovers Spirit and Opportunity
๏ƒ˜ 2004. VO was revived in the academic environment
by David Nister ยซVisual Odometryยป paper.
The term VO became popular (and Nister
became head of MS Hololens before moving
to TESLA in 2014)
Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch
๏ƒ˜ 1980: First known VO real-time implementation on a robot by Hans Moraveck PhD
thesis (NASA/JPL) for Mars rovers using one sliding camera (sliding stereo).
๏ƒ˜ 1980 to 2000: The VO research was dominated by NASA/JPL in preparation of
2004 Mars mission (see papers from Matthies, Olson, etc. from JPL)
๏ƒ˜ 2004: VO used on a robot on another planet: Mars rovers Spirit and Opportunity
๏ƒ˜ 2004. VO was revived in the academic environment
by David Nister ยซVisual Odometryยป paper.
The term VO became popular (and Nister
became head of MS Hololens before moving
to TESLA in 2014)
Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch
๏ƒ˜ Sufficient illumination in the environment
๏ƒ˜ Dominance of static scene over moving objects
๏ƒ˜ Enough texture to allow apparent motion to be extracted
๏ƒ˜ Sufficient scene overlap between consecutive frames
Is any of these scenes good for VO? Why?
Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch
SFM VSLAM VO
Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch
SFM is more general than VO and tackles the problem of 3D
reconstruction and 6DOF pose estimation from unordered image sets
Reconstruction from 3 million images from Flickr.com
Cluster of 250 computers, 24 hours of computation!
Paper: โ€œBuilding Rome in a Dayโ€, ICCVโ€™09
Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch
๏ƒ˜ VO is a particular case of SFM
๏ƒ˜ VO focuses on estimating the 3D motion of the camera
sequentially (as a new frame arrives) and in real time.
๏ƒ˜ Terminology: sometimes SFM is used as a synonym of VO
Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch
๏ƒ˜ Visual Odometry
๏‚ง Focus on incremental estimation/local
consistency
๏ƒ˜ Visual SLAM: Simultaneous Localization And
Mapping
๏‚ง Focus on globally consistent estimation
๏‚ง Visual SLAM = visual odometry + loop detection
+ graph optimization
๏ƒ˜ The choice between VO and V-SLAM depends on
the tradeoff between performance and
consistency, and simplicity in implementation.
๏ƒ˜ VO trades off consistency for real-time
performance, without the need to keep track of all
the previous history of the camera.
Visual odometry
Visual SLAM
Image courtesy from [Clemente et al., RSSโ€™07]
Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch
1. Compute the relative motion ๐‘‡๐‘˜ from images ๐ผ๐‘˜โˆ’1 to image ๐ผ๐‘˜
2. Concatenate them to recover the full trajectory
3. An optimization over the last m poses can be done to refine locally
the trajectory (Pose-Graph or Bundle Adjustment)
...
๐‘ช๐ŸŽ ๐‘ช๐Ÿ ๐‘ช๐Ÿ‘ ๐‘ช๐Ÿ’ ๐‘ช๐’โˆ’๐Ÿ ๐‘ช๐’
๐‘‡๐‘˜ =
๐‘…๐‘˜,๐‘˜โˆ’1 ๐‘ก๐‘˜,๐‘˜โˆ’1
0 1
๐ถ๐‘› = ๐ถ๐‘›โˆ’1๐‘‡๐‘›
How do we estimate the relative motion ๐‘‡๐‘˜ ?
Image ๐ผ๐‘˜โˆ’1 Image ๐ผ๐‘˜
๐‘‡๐‘˜
โ€œAn Invitation to 3D Visionโ€, Ma, Soatto, Kosecka, Sastry, Springer, 2003
๐‘‡๐‘˜
SVO [Forster et al. 2014, TROโ€™16]
100-200 features x 4x4 patch
~ 2,000 pixels
Direct Image Alignment
DTAM [Newcombe et al. โ€˜11]
300โ€™000+ pixels
LSD [Engel et al. 2014]
~10โ€™000 pixels
Dense Semi-Dense Sparse
๐‘‡๐‘˜,๐‘˜โˆ’1 = arg min
๐‘‡
๐ผ๐‘˜ ๐’–โ€ฒ๐‘– โˆ’ ๐ผ๐‘˜โˆ’1(๐’–๐‘–) ๐œŽ
2
๐‘–
It minimizes the per-pixel intensity difference
Irani & Anandan, โ€œAll About Direct Methods,โ€ Vision Algorithms: Theory and Practice, Springer, 2000
SVO [Forster et al. 2014]
100-200 features x 4x4 patch
~ 2,000 pixels
Direct Image Alignment
DTAM [Newcombe et al. โ€˜11]
300,000+ pixels
LSD-SLAM [Engel et al. 2014]
~10,000 pixels
Dense Semi-Dense Sparse
๐‘‡๐‘˜,๐‘˜โˆ’1 = arg min
๐‘‡
๐ผ๐‘˜ ๐’–โ€ฒ๐‘– โˆ’ ๐ผ๐‘˜โˆ’1(๐’–๐‘–) ๐œŽ
2
๐‘–
It minimizes the per-pixel intensity difference
Irani & Anandan, โ€œAll About Direct Methods,โ€ Vision Algorithms: Theory and Practice, Springer, 2000
16
Feature-based methods
1. Extract & match features (+RANSAC)
2. Minimize Reprojection error
minimization
๐‘‡๐‘˜,๐‘˜โˆ’1 = arg min
๐‘‡
๐’–โ€ฒ
๐‘– โˆ’ ๐œ‹ ๐’‘๐‘– ฮฃ
2
๐‘–
Direct methods
1. Minimize photometric error
๐‘‡๐‘˜,๐‘˜โˆ’1 = ?
๐’‘๐‘–
๐’–โ€ฒ๐‘–
๐’–๐‘–
๐‘‡๐‘˜,๐‘˜โˆ’1 = arg min
๐‘‡
๐ผ๐‘˜ ๐’–โ€ฒ๐‘– โˆ’ ๐ผ๐‘˜โˆ’1(๐’–๐‘–) ๐œŽ
2
๐‘–
where ๐’–โ€ฒ๐‘– = ๐œ‹ ๐‘‡ โˆ™ ๐œ‹โˆ’1
๐’–๐‘– โˆ™ ๐‘‘
๐‘‡๐‘˜,๐‘˜โˆ’1
๐ผ๐‘˜
๐’–โ€ฒ๐‘–
๐’‘๐‘–
๐’–๐‘–
๐ผ๐‘˜โˆ’1
๐‘‘๐‘–
[Jin,Favaro,Soattoโ€™03] [Silveira, Malis, Rives, TROโ€™08], [Newcombe et al., ICCV โ€˜11],
[Engel et al., ECCVโ€™14], [Forster et al., ICRAโ€™14]
17
[Jin,Favaro,Soattoโ€™03] [Silveira, Malis, Rives, TROโ€™08], [Newcombe et al., ICCV โ€˜11],
[Engel et al., ECCVโ€™14], [Forster et al., ICRAโ€™14]
๐‘‡๐‘˜,๐‘˜โˆ’1 = arg min
๐‘‡
๐’–โ€ฒ
๐‘– โˆ’ ๐œ‹ ๐’‘๐‘– ฮฃ
2
๐‘–
๐‘‡๐‘˜,๐‘˜โˆ’1 = arg min
๐‘‡
๐ผ๐‘˜ ๐’–โ€ฒ๐‘– โˆ’ ๐ผ๐‘˜โˆ’1(๐’–๐‘–) ๐œŽ
2
๐‘–
where ๐’–โ€ฒ๐‘– = ๐œ‹ ๐‘‡ โˆ™ ๐œ‹โˆ’1
๐’–๐‘– โˆ™ ๐‘‘
๏ƒผ Large frame-to-frame motions
๏ƒผ Accuracy: Efficient optimization of
structure and motion (Bundle Adjustment)
๏ Slow due to costly feature extraction
and matching
๏ Matching Outliers (RANSAC)
๏ƒผ All information in the image can be
exploited (precision, robustness)
๏ƒผ Increasing camera frame-rate
reduces computational cost per
frame
๏ Limited frame-to-frame motion
๏ Joint optimization of dense structure
and motion too expensive
Feature-based methods
1. Extract & match features (+RANSAC)
2. Minimize Reprojection error
minimization
Direct methods
1. Minimize photometric error
Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch
Image sequence
Feature detection
Feature matching (tracking)
Motion estimation
2D-2D 3D-3D 3D-2D
Local optimization
VO computes the camera path incrementally (pose after pose)
Front-end
Back-end
Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch
๏ƒ˜ The Front-end is responsible for
๏‚ง Feature extraction, matching, and outlier removal
๏‚ง Loop closure detection
๏ƒ˜ The Back-end is responsible for the pose and structure
optimization (e.g., iSAM, g2o, Google Ceres)
Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch
Image sequence
Feature detection
Feature matching (tracking)
Motion estimation
2D-2D 3D-3D 3D-2D
Local optimization
VO computes the camera path incrementally (pose after pose)
Example features tracks
Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch
Feature Extraction
Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch
๏ƒ˜ A corner is defined as the intersection of one or more edges
๏‚ง A corner has high localization accuracy
๏‚ง Corner detectors are good for VO
๏‚ง Itโ€™s less distinctive than a blob
๏‚ง E.g., Harris, Shi-Tomasi, SUSAN, FAST
๏ƒ˜ A blob is any other image pattern, which is not a corner, that
significantly differs from its neighbors in intensity and texture
๏‚ง Has less localization accuracy than a corner
๏‚ง Blob detectors are better for place recognition
๏‚ง Itโ€™s more distinctive than a corner
๏‚ง E.g., MSER, LOG, DOG (SIFT), SURF, CenSurE
๏ƒ˜ Descriptor: Distinctive feature identifier
๏‚ง Standard descriptor: squared patch of pixel intensity values
๏‚ง Gradient or difference-based descriptors: SIFT, SURF, ORB, BRIEF, BRISK
Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch
Which of the patches below can be matched reliably?
Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch
๏ƒ˜ How do we identify corners?
๏ƒ˜ We can easily recognize the point by looking through a small window
๏ƒ˜ Shifting a window in any direction should give a large change in intensity in at
least 2 directions
โ€œflatโ€ region:
no intensity
change
โ€œcornerโ€:
significant change in
at least 2 directions
โ€œedgeโ€:
no change along the
edge direction
Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch
[Rosten et al., PAMI 2010]
๏ƒ˜ FAST: Features from Accelerated Segment Test
๏ƒ˜ Studies intensity of pixels on circle around candidate pixel C
๏ƒ˜ C is a FAST corner if a set of N contiguous pixels on circle are:
๏ƒ˜ all brighter than intensity_of(C)+theshold, or
๏ƒ˜ all darker than intensity_of(C)+theshold
โ€ข Typical FAST mask: test for 9 contiguous pixels in a 16-pixel circle
โ€ข Very fast detector - in the order of 100 Mega-pixel/second
Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch
SIFT responds to local regions that look like Difference of Gaussian (~Laplacian of
Gaussian)
s
s2
s3
s4
Scale
)
,
(
)
,
( y
x
G
y
x
G
DoG
LOG k s
s ๏€ญ
๏€ฝ
๏‚ป
SIFT detector (location + scale)
SIFT keypoints: local extrema in both location and scale of the
DoG
โ€ข Detect maxima and minima
of difference-of-Gaussian in
scale space
โ€ข Each point is compared to
its 8 neighbors in the
current image and 9
neighbors each in the scales
above and below
For each max or min found, output
is the location and the scale.
โ€ข Scale Invariant Feature Transform
โ€ข Invented by David Lowe [IJCV, 2004]
โ€ข Descriptor computation:
โ€“ Divide patch into 4x4 sub-patches: 16 cells
โ€“ Compute histogram of gradient orientations (8 reference angles) for all pixels
inside each sub-patch
โ€“ Resulting SIFT descriptor: 4x4x8 = 128 values
โ€“ Descriptor Matching: Euclidean-distance between these descriptor vectors
(i.e., SSD)
SIFT descriptor
David G. Lowe. "Distinctive image features from scale-invariant keypoints.โ€ IJCV , 2004.
Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch
[Calonder et. al, ECCV 2010]
Pattern for intensity pair samples โ€“
generated randomly
โ€ข Binary Robust Independent Elementary
Features
โ€ข Goal: high speed (in description and matching)
โ€ข Binary descriptor formation:
โ€ข Smooth image
โ€ข for each detected keypoint (e.g. FAST),
โ€ข sample 256 intensity pairs p=(๐‘1, ๐‘2) within
a squared patch around the keypoint
โ€ข for each pair p
โ€ข if ๐‘1 < ๐‘2 then set bit p of descriptor
to 1
โ€ข else set bit p of descriptor to 0
โ€ข The pattern is generated randomly only once;
then, the same pattern is used for all patches
โ€ข Not scale/rotation invariant
โ€ข Allows very fast Hamming Distance matching:
count the number of bits that are different in
the descriptors matched
Calonder, Lepetit, Strecha, Fua, BRIEF: Binary Robust Independent Elementary Features, ECCVโ€™10]
Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch
๏ƒ˜ Oriented FAST and Rotated
BRIEF
๏ƒ˜ Alterative to SIFT or SURF,
designed for fast computation
๏ƒ˜ Keypoint detector based on
FAST
๏ƒ˜ BRIEF descriptors are steered
according to keypoint
orientation (to provide
rotation invariance)
๏ƒ˜ Good Binary features are
learned by minimizing the
correlation on a set of training
patches.
ORB descriptor
[Rublee et al., ICCV 2011]
Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch
[Leutenegger, Chli, Siegwart, ICCV 2011]
โ€ข Binary Robust Invariant Scalable Keypoints
โ€ข Detect corners in scale-space using FAST
โ€ข Rotation and scale invariant
โ€ข Binary, formed by pairwise intensity
comparisons (like BRIEF)
โ€ข Pattern defines intensity comparisons in
the keypoint neighborhood
โ€ข Red circles: size of the smoothing kernel
applied
โ€ข Blue circles: smoothed pixel value used
โ€ข Compare short- and long-distance pairs
for orientation assignment & descriptor
formation
โ€ข Detection and descriptor speed: ~10
times faster than SURF
โ€ข Slower than BRIEF, but scale- and
rotation- invariant
ORB & BRISK:
โ€ข 128-to-256-bit binary descriptors
โ€ข Fast to extract and match (Hamming distance)
โ€ข Good for relocalization and Loop detection
โ€ข Multi-scale detection ๏ƒ  same point appears on several scales
Detector Descriptor Accuracy Relocalization &
Loop closing
Efficiency
Harris Patch ++++ - +++
Shi-Tomasi Patch ++++ - +++
SIFT SIFT ++ ++++ +
SURF SURF ++ ++++ ++
FAST BRIEF ++++ +++ ++++
ORB ORB ++++ +++ ++++
FAST BRISK ++++ +++ ++++
Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch
Image sequence
Feature detection
Feature matching (tracking)
Motion estimation
2D-2D 3D-3D 3D-2D
Local optimization
VO computes the camera path incrementally (pose after pose)
Tk,k-1
Tk+1,k
Ck-1
Ck
Ck+1
Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch
Motion from Image Feature Correspondences
๏ƒ˜ Both feature points ๐‘“๐‘˜โˆ’1 and ๐‘“๐‘˜ are specified in 2D
๏ƒ˜ The minimal-case solution involves 5-point correspondences
๏ƒ˜ The solution is found by minimizing the reprojection error:
๏ƒ˜ Popular algorithms: 8- and 5-point algorithms [Hartleyโ€™97, Nisterโ€™06]
Motion estimation
2D-2D 3D-2D 3D-3D
๐…
Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch
Motion from 3D Structure and Image Correspondences
๏ƒ˜ ๐‘“๐‘˜โˆ’1 is specified in 3D and ๐‘“๐‘˜ in 2D
๏ƒ˜ This problem is known as camera resection or PnP (perspective from n points)
๏ƒ˜ The minimal-case solution involves 3 correspondences (+1 for disambiguating the 4
solutions)
๏ƒ˜ The solution is found by minimizing the reprojection error:
๏ƒ˜ Popular algorithms: P3P [Gaoโ€™03, Kneipโ€™11]
Motion estimation
2D-2D 3D-2D 3D-3D
Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch
Motion from 3D-3D Point Correspondences (point cloud registration)
๏ƒ˜ Both ๐‘“๐‘˜โˆ’1 and ๐‘“๐‘˜ are specified in 3D. To do this, it is necessary to triangulate 3D points
(e.g. use a stereo camera)
๏ƒ˜ The minimal-case solution involves 3 non-collinear correspondences
๏ƒ˜ The solution is found by minimizing the 3D-3D Euclidean distance:
๏ƒ˜ Popular algorithm: [Arunโ€™87] for global registration, ICP for local refinement or Bundle
Adjustment (BA)
Motion estimation
2D-2D 3D-2D 3D-3D
๐…
Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch
Type of
correspondences
Monocular Stereo
2D-2D X X
3D-3D X
3D-2D X X
Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch
Typical visual odometry pipeline used in many algorithms
[Nisterโ€™04, PTAMโ€™07, LIBVISOโ€™08, LSD-SLAMโ€™14, SVOโ€™14, ORB-SLAMโ€™15]
Keyframe 1 Keyframe 2
Initial pointcloud
(bootstrapping)
New triangulated points
Current frame
New keyframe
Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch
๏ƒ˜ When frames are taken at nearby positions compared to the scene distance, 3D
points will exibit large uncertainty
Small baseline โ†’ large depth uncertainty Large baseline โ†’ small depth uncertainty
Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch
๏ƒ˜ When frames are taken at nearby positions compared to the scene distance, 3D
points will exibit large uncertainty
๏ƒ˜ One way to avoid this consists of skipping frames until the average uncertainty of
the 3D points decreases below a certain threshold. The selected frames are
called keyframes
๏ƒ˜ Rule of the thumb: add a keyframe when
. . .
average-depth
keyframe distance
> threshold (~10-20 %)
Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch
๏ƒ˜ Matched points are usually contaminated by outliers
๏ƒ˜ Causes of outliers are:
๏‚ง image noise
๏‚ง occlusions
๏‚ง blur
๏‚ง changes in view point and illumination
๏ƒ˜ For the camera motion to be estimated accurately, outliers must be removed
๏ƒ˜ This is the task of Robust Estimation
Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch
๏ƒ˜ Error at the loop closure: 6.5 m
๏ƒ˜ Error in orientation: 5 deg
๏ƒ˜ Trajectory length: 400 m
Before removing the outliers
After removing the outliers
Outliers can be removed using RANSAC [Fishler & Bolles, 1981]
Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch
Image sequence
Feature detection
Feature matching (tracking)
Motion estimation
2D-2D 3D-3D 3D-2D
Local optimization (back-end)
VO computes the camera path incrementally (pose after pose)
...
๐‘ช๐ŸŽ ๐‘ช๐Ÿ ๐‘ช๐Ÿ‘ ๐‘ช๐Ÿ’ ๐‘ช๐’โˆ’๐Ÿ ๐‘ช๐’
Front-end
Back-end
๏ƒ˜ So far we assumed that the transformations are between consecutive frames
๏ƒ˜ Transformations can be computed also between non-adjacent frames ๐‘‡๐‘–๐‘— (e.g., when
features from previous keyframes are still observed). They can be used as additional
constraints to improve cameras poses by minimizing the following:
๏ƒ˜ For efficiency, only the last ๐‘š keyframes are used
๏ƒ˜ Gauss-Newton or Levenberg-Marquadt are typically used to minimize it. For large graphs,
efficient open-source tools: g2o, GTSAM, Google Ceres
Pose-Graph Optimization
...
๐‘ช๐ŸŽ ๐‘ช๐Ÿ ๐‘ช๐Ÿ ๐‘ช๐Ÿ‘ ๐‘ช๐’โˆ’๐Ÿ ๐‘ช๐’
๐‘ป๐Ÿ,๐ŸŽ ๐‘ป๐Ÿ,๐Ÿ ๐‘ป๐Ÿ‘,๐Ÿ ๐‘ป๐’,๐’โˆ’๐Ÿ
๐‘ป๐Ÿ,๐ŸŽ
๐‘ป๐Ÿ‘,๐ŸŽ ๐‘ป๐’โˆ’๐Ÿ,๐Ÿ
๐ถ๐‘– โˆ’ ๐‘‡๐‘–๐‘—๐ถ๐‘—
2
๐‘—
๐‘–
๏ƒ˜ Similar to pose-graph optimization but it also optimizes 3D points
๏ƒ˜ ๐œŒ๐ป() is a robust cost function (e.g., Huber cost) to downweight wrong matches
๏ƒ˜ In order to not get stuck in local minima, the initialization should be close to the minimum
๏ƒ˜ Gauss-Newton or Levenberg-Marquadt can be used
๏ƒ˜ Very costly: example: 1k images and 100k points, 1s per LM iteration. For large graphs,
efficient open-source software exists: GTSAM, g2o, Google Ceres can be used
Bundle Adjustment (BA)
...
๐‘ช๐ŸŽ ๐‘ช๐Ÿ ๐‘ช๐Ÿ ๐‘ช๐Ÿ‘ ๐‘ช๐’โˆ’๐Ÿ ๐‘ช๐’
๐‘ป๐Ÿ,๐ŸŽ ๐‘ป๐Ÿ,๐Ÿ ๐‘ป๐Ÿ‘,๐Ÿ ๐‘ป๐’,๐’โˆ’๐Ÿ
๐‘ป๐Ÿ,๐ŸŽ
๐‘ป๐Ÿ‘,๐ŸŽ ๐‘ป๐’โˆ’๐Ÿ,๐Ÿ
๐‘‹๐‘–
, ๐ถ๐‘˜ = ๐‘Ž๐‘Ÿ๐‘”๐‘š๐‘–๐‘›๐‘‹๐‘–,๐ถ๐‘˜, ๐œŒ๐ป ๐‘๐‘˜
๐‘–
โˆ’ ๐œ‹ ๐‘‹๐‘–
, ๐ถ๐‘˜
๐‘–,๐‘˜
Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch
๏ƒ˜ BA is more precise than pose-graph optimization because it adds additional
constraints (landmark constraints)
๏ƒ˜ But more costly: ๐‘‚ ๐‘ž๐‘€ + ๐‘™๐‘ 3 with ๐‘€ and ๐‘ being the number of points
and cameras poses and ๐‘ž and ๐‘™ the number of parameters for points and
camera poses. Workarounds:
๏‚ง A small window size limits the number of parameters for the optimization and thus
makes real-time bundle adjustment possible.
๏‚ง It is possible to reduce the computational complexity by just optimizing over the
camera parameters and keeping the 3D landmarks fixed, e.g., (motion-only BA)
Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch
๏ƒ˜ Relocalization problem:
๏‚ง During VO, tracking can be lost (due to occlusions, low
texture, quick motion, illumination change)
๏ƒ˜ Solution: Re-localize camera pose and continue
๏ƒ˜ Loop closing problem
๏‚ง When you go back to a previously mapped area:
๏‚ง Loop detection: to avoid map duplication
๏‚ง Loop correction: to compensate the accumulated drift
๏‚ง In both cases you need a place recognition technique
University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch
Visual Place Recognition
๏ƒ˜ Goal: find the most similar images of a query image in a database of ๐‘ต images
๏ƒ˜ Complexity:
๐‘2โˆ™๐‘€2
2
feature comparisons (worst-case scenario)
๏‚ง Each image must be compared with all other images!
๏‚ง ๐‘ is the number of all images collected by a robot
- Example: 1 image per meter of travelled distance over a 100๐‘š2
house with one
robot and 100 feature per image โ†’ M = 100, ๐‘ = 100 โ†’ ๐‘2
๐‘€2
/2=
~ 50 ๐‘€๐‘–๐‘™๐‘™๐‘–๐‘œ๐‘› feature comparisons!
Solution: Use an inverted file index!
Complexity reduces to ๐‘ต โˆ™ ๐‘ด
[โ€œVideo Googleโ€, Sivic & Zisserman, ICCVโ€™03]
[โ€œScalable Recognition with a Vocabulary Treeโ€, Nister & Stewenius, CVPRโ€™06]
See also FABMAP and Galvez-Lopezโ€™12โ€™s (DBoW2)]
University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch
Indexing local features: inverted file text
๏ƒ˜ For text documents, an efficient way
to find all pages on which a word
occurs is to use an index
๏ƒ˜ We want to find all images in which a
feature occurs
๏ƒ˜ To use this idea, weโ€™ll need to map
our features to โ€œvisual wordsโ€
Building the Visual Vocabulary
Image Collection Extract Features Cluster Descriptors
Descriptors space
Examples
of
Visual
Words:
Limitations of VO-SLAM systems
๏ƒ˜ Limitations
๏‚ง Monocular (i.e., absolute scale is unknown)
๏‚ง Requires a reasonably illuminated area
๏‚ง Motion blur
๏‚ง Needs texture: will fail with large plain walls
๏‚ง Map is too sparse for interaction with the environment
๏ƒ˜ Extensions
๏‚ง IMU for robustness and absolute scale estimation
๏‚ง Stereo: real scale and more robust to quick motions
๏‚ง Semi-dense or dense mapping for environment interaction
๏‚ง Event-based cameras for high-speed motions and HDR environments
๏‚ง Learning for improved reliability
Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch
Visual-Inertial Fusion
Davide Scaramuzza โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch
Absolute Scale Determination
๏ƒ˜ The absolute pose ๐‘ฅ is known up to a scale ๐‘ , thus
๐‘ฅ = ๐‘ ๐‘ฅ
๏ƒ˜ IMU provides accelerations, thus
๐‘ฃ = ๐‘ฃ0 + ๐‘Ž ๐‘ก ๐‘‘๐‘ก
๏ƒ˜ By derivating the first one and equating them
๐‘ ๐‘ฅ = ๐‘ฃ0 + ๐‘Ž ๐‘ก ๐‘‘๐‘ก
๏ƒ˜ As shown in [Martinelli, TROโ€™12], for 6DOF, both ๐‘  and ๐‘ฃ0 can be determined in closed form
from a single feature observation and 3 views
๏ƒ˜ This is used to initialize the asbolute scale [Kaiser, ICRAโ€™16]
๏ƒ˜ The scale can then be tracked with
๏‚ง EKF [Mourikis & Roumeliotis, IJRRโ€™10], [Weiss, JFRโ€™13]
๏‚ง Non-linear optimization methods [Leutenegger, RSSโ€™13] [Forster, RSSโ€™15]
๏ƒ˜ Fusion solved as a non-linear optimization problem
๏ƒ˜ Increased accuracy over filtering methods
IMU residuals Reprojection residuals
Forster, Carlone, Dellaert, Scaramuzza, IMU Preintegration on Manifold for efficient Visual-Inertial
Maximum-a-Posteriori Estimation, Robotics Science and Systensโ€™15, Best Paper Award Finalist
Visual-Inertial Fusion [RSSโ€™15]
Google Tango
Proposed OKVIS
Accuracy: 0.1% of the travel distance
Forster, Carlone, Dellaert, Scaramuzza, IMU Preintegration on Manifold for efficient Visual-Inertial
Maximum-a-Posteriori Estimation, Robotics Science and Systensโ€™15, Best Paper Award Finalist
YouTube: https://youtu.be/CsJkci5lfco
SVO + GTSAM (Forster et al. RSSโ€™15)
(optimization based, pre-integrated
IMU): https://bitbucket.org/gtborg/gtsam
Instructions here:
http://arxiv.org/pdf/1512.02363
Open Source
Comparison with Previous Works
Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch
Open-source
VO & VSLAM algorithms
57
Intro: visual odometry algorithms
๏ƒ˜ Popular visual odometry and SLAM algorithms
๏‚ง ORB-SLAM (University of Zaragoza, 2015)
- ORB-SLAM2 (2016) supports stereo and RGBD camera
๏‚ง LSD-SLAM (Technical University of Munich, 2014)
๏‚ง DSO (Technical University of Munich, 2016)
๏‚ง SVO (University of Zurich, 2014/2016)
- SVO 2.0 (2016) supports wide angle, stereo and multiple cameras
58
ORB-SLAM
Large-scale Feature-based SLAM
[Mur-Artal, Montiel, Tardos, TROโ€™15]
Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch
๏ƒ˜ It combines all together:
๏‚ง Tracking
๏‚ง Mapping
๏‚ง Loop closing
๏‚ง Relocalization (DBoW)
๏‚ง Final optimization
๏ƒ˜ ORB: FAST corner + Oriented Rotated Brief descriptor
๏‚ง Binary descriptor
๏‚ง Very fast to compute and compare
๏ƒ˜ Real-time (30Hz)
60
ORB-SLAM: overview
61
ORB-SLAM: ORB feature
๏ƒ˜ ORB: Oriented FAST and Rotated Brief
๏‚ง 256-bit binary descriptor
๏‚ง Fast to extract and match (Hamming distance)
๏‚ง Good for tracking, relocation and Loop detection
๏‚ง Multi-scale detection: same point appears on several scales
62
ORB-SLAM: tracking
๏ƒ˜ For every new frame:
๏‚ง First track w.r.t. last frame
Find matches from last frame in the new frame -> PnP
๏‚ง Then track w.r.t. local map
Find matches from local keyframes in the new frame -> PnP
63
ORB-SLAM: mapping
๏ƒ˜ Map representation
๏‚ง Keyframe poses
๏‚ง Points
- 3D positions
- Descriptor
- Observations in frames
๏ƒ˜ Functions of the mapping part
๏‚ง Triangulate new points
๏‚ง Remove redundant keyframes/points
๏‚ง Optimize poses and points
Q: why do we need keyframes instead of just using points?
64
ORB-SLAM: video
Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch
LSD-SLAM
Large-scale Semi-Dense SLAM
[Engel, Schoeps, Cremers, ECCVโ€™14]
Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch
๏ƒ˜ Direct (photometric error) + Semi-Dense formulation
๏‚ง 3D geometry represented as semi-dense depth maps.
๏‚ง Optimizes a photometric error
๏‚ง Separateley optimizes poses (direct image alignment) &
geometry (pixel-wise filtering)
๏ƒ˜ Includes:
๏‚ง Loop closing
๏‚ง Relocalization
๏‚ง Final optimization
๏ƒ˜ Real-time (30Hz)
67
LSD-SLAM: overview
๏ƒ˜ Direct image alignment
๏ƒ˜ Depth refinement and regularization
Instead of using features, LSD-SLAM uses pixels with large gradients.
68
LSD-SLAM: Direct Image Alignment
๏ƒ˜ New frame w.r.t. last keyframe
๏ƒ˜ Keyframe w.r.t. global map
โ€ข Finding pose that Minimizes photometric error rp over all selected pixels
โ€ข Weighted by the photometric covariance
โ€ข Also minimizing geometric error: distance between the points in the
current keyframe and the points in the global map.
69
LSD-SLAM: Depth Refinement/Regularization
๏ƒ˜ Depth estimation: per pixel stereo:
๏‚ง Using the estimated pose from image alignment, we can perform stereo
matching for each pixel.
๏‚ง Using the stereo matching result to refine the depth.
๏ƒ˜ Regularization
๏‚ง Average using adjacent depth
๏‚ง Remove outliers and spurious estimations: visually appealing
70
LSD-SLAM: video
71
DSO
Direct Sparse Odometry
[Engel, Koltun, Cremers, Arxivโ€™16]
72
DSO: Tracking frontend
๏ƒ˜ Direct Image Alignment
โ€ข Using points of large gradients
โ€ข Incorporate photometric correction: robust to exposure time change
โ€ข Using exposure time ti tj to compensate exposure time change
โ€ข Using affine transformation if no exposure time is known
73
DSO: Optimization backend
๏ƒ˜ Sliding window estimator
๏‚ง Not full bundle adjustment
๏‚ง Only keep a fixed length window (e.g., 3 keyframes) of past frames
๏‚ง Instead of simply dropping the states out of the window, marginalizing the
states:
Advantage:
โ€ข Help improve accuracy
โ€ข Still able to operate in real-time
74
DSO: Video
Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch
SVO
Fast, Semi-Direct Visual Odometry
[Forster, Pizzoli, Scaramuzza, ICRAโ€™14, TROโ€™16]
SVO: overview
Mapping
โ€ข Probabilistic depth estimation
Direct (minimizes photometric error)
โ€ข Corners and edgelets
โ€ข Frame-to-frame motion
estimation
โ€ข Frame-to-Keyframe pose refinement
Edgelet Corner
Extensions
โ€ข Omni-cameras
โ€ข Multi-camera systems
โ€ข IMU pre-integration
โ€ข Dense โ†’ REMODE
Feature-based (minimizes photometric error)
77
SVO: Semi-Direct Visual Odometry [ICRAโ€™14]
Direct
Feature-based
โ€ข Frame-to-frame motion
estimation
โ€ข Frame-to-Keyframe pose
refinement
78
SVO: Semi-Direct Visual Odometry [ICRAโ€™14]
Direct
Feature-based
โ€ข Frame-to-frame motion
estimation
โ€ข Frame-to-Kreyframe pose
refinement
Mapping
๏ƒ˜ Feature extraction only for
every keyframe
๏ƒ˜ Probabilistic depth estimation
of 3D points
79
Probabilistic Depth Estimation in SVO
Measurement Likelihood models outliers:
โ€ข 2-Dimensional distribution: Depth ๐’… and inliner ratio ๐†
โ€ข Mixture of Gaussian + Uniform
โ€ข Inverse depth
Depth-Filter:
โ€ข Depth-filter for every new feature
โ€ข Recursive Bayesian depth estimation
โ€ข Epipolar search using ZMSSD
80
(2)
(1)
๏ƒ˜ Based on the model by (Vogiatzis &Hernandez, 2011) but with
inverse depth
๏ƒ˜ The posterior in (1) can be approximated by
Probabilistic Depth Estimation in SVO
๐œŒ ๐œŒ ๐œŒ ๐œŒ
๐‘‘ ๐‘‘ ๐‘‘ ๐‘‘
Modeling the posterior as a dense 2D histogram is very expensive!
(3)
๐œŒ ๐œŒ ๐œŒ
๐‘‘ ๐‘‘ ๐‘‘ ๐‘‘
๐œŒ
The parametric model describes the pixel depth at time k.
81
SVO: Video https://youtu.be/hR8uq1RTUfA
SVO for Autonomous Drone Navigation
RMS error: 5 mm, height: 1.5 m โ€“ Down-looking camera
Faessler, Fontana, Forster, Mueggler, Pizzoli, Scaramuzza, Autonomous, Vision-based Flight and Live Dense 3D
Mapping with a Quadrotor Micro Aerial Vehicle, Journal of Field Robotics, 2015.
Speed: 4 m/s, height: 1.5 m โ€“ Down-looking camera
Video
https://www.youtube.com/watch?v
=4X6Voft4Z_0
Video: https://youtu.be/fXy4P3nvxHQ
83
SVO on 4 fisheye Cameras from AUDI dataset
[Forster, et al., ยซSVO: Semi Direct Visual Odometry for Monocular and Multi-Camera Systemsยป, TROโ€™16]
Video: https://www.youtube.com/watch?v=gr00Bf0AP1k
84
Processing Times of SVO
[Forster, et al., ยซSVO: Semi Direct Visual Odometry for Monocular and Multi-Camera Systemsยป, TROโ€™16]
85
Processing Times of SVO
Laptop (Intel i7, 2.8 GHz)
Embedded ARM Cortex-A9, 1.7 GHz
400 frames per second
Up to 70 frames per second
๏ƒ˜ Open Source available at: github.com/uzh-rpg/rpg_svo
๏ƒ˜ Works with and without ROS
๏ƒ˜ Closed-Source professional edition (SVO 2.0): available for companies
Source Code
86
Summary: Feature-based vs. direct
1. Feature extraction
2. Feature matching
3. RANSAC + P3P
4. Reprojection error
minimization
๐‘‡๐‘˜,๐‘˜โˆ’1 = arg min
๐‘‡
๐’–โ€ฒ๐‘– โˆ’ ๐œ‹ ๐’‘๐‘–
2
๐‘–
Direct approaches (LSD, DSO, SVO)
1. Minimize photometric error
๐‘‡๐‘˜,๐‘˜โˆ’1 = arg min
๐‘‡
๐ผ๐‘˜ ๐’–โ€ฒ๐‘– โˆ’ ๐ผ๐‘˜โˆ’1(๐’–๐‘–) 2
๐‘–
๏ƒผ Large frame-to-frame motions
๏ Slow (20-30 Hz) due to costly feature
extraction and matching
๏ Not robust to high-frequency and
repetive texture
๏ Outliers
๏ƒผ Every pixel in the image can be
exploited (precision, robustness)
๏ƒผ Increasing camera frame-rate reduces
computational cost per frame
๏ƒป Limited to small frame-to-frame
motion
Feature-based (ORB-SLAM, part of SVO/DSO)
Comparison among SVO, DSO, ORB-SLAM,
LSD-SLAM [Forster, TROโ€™16]
๏ƒ˜ See next two slides
๏ƒ˜ For a thorough evaluation please refer to [Forster, TROโ€™16] paper,
where all these algorithms are evaluated in terms of accuracy
against ground truth and timing on several datasets: EUROC, TUM-
RGB-D, ICL-NUIM
[Forster, et al., ยซSVO: Semi Direct Visual Odometry for Monocular and Multi-Camera Systemsยป, TROโ€™16]
Accuracy (EUROC Dataset) [Forster, TROโ€™16]
[Forster, et al., ยซSVO: Semi Direct Visual Odometry for Monocular and Multi-Camera Systemsยป, TROโ€™16]
Timing (EUROC Dataset) [Forster, TROโ€™16]
[Forster, et al., ยซSVO: Semi Direct Visual Odometry for Monocular and Multi-Camera Systemsยป, TROโ€™16]
Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch
SVO [Forster et al. 2014, TROโ€™16]
100-200 features x 4x4 patch
~ 2,000 pixels
Review of Direct Image Alignment
DTAM [Newcombe et al. โ€˜11]
300โ€™000+ pixels
LSD [Engel et al. 2014]
~10โ€™000 pixels
Dense Semi-Dense Sparse
๐‘‡๐‘˜,๐‘˜โˆ’1 = arg min
๐‘‡
๐ผ๐‘˜ ๐’–โ€ฒ๐‘– โˆ’ ๐ผ๐‘˜โˆ’1(๐’–๐‘–) ๐œŽ
2
๐‘–
It minimizes the per-pixel intensity difference
Irani & Anandan, โ€œAll About Direct Methods,โ€ Vision Algorithms: Theory and Practice, Springer, 2000
Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch
๏ƒ˜ Goal: study the magnitude of the perturbation for which image-to-model alignment is
capable to converge as a function of the distance to the reference image
๏ƒ˜ The performance in this experiment is a measure of robustness: successful pose
estimation from large initial perturbations shows that the algorithm is capable of dealing
with rapid camera motions
๏ƒ˜ 1000 Blender simulations
๏ƒ˜ Alignment considered converged when the estimated relative pose is closer than 0.1
meters from ground-truth
๏ƒ˜ Result: difference between semi-dense image alignment and dense image alignment is
marginal. This is because pixels that exhibit no intensity gradient are not informative for
the optimization (their Jacobians are zero).
โ—ฆ Using all pixels becomes only useful when considering motion blur and image defocus
Distance between frames
Convergence
[%]
100
0
30%
overlap
[Forster, et al., ยซSVO: Semi Direct Visual Odometry for Monocular and Multi-Camera Systemsยป, TROโ€™16]
92
Summary: keyframe and filter-based method
๏ƒ˜ Why the parallel structure in all these algorithms?
๏‚ง Mapping is often expensive
- Local BA
- Loop detection and graph optimization
- Depth filter per feature
๏‚ง Using the best map available for real-time tracking [1]
๏ƒ˜ Why not filter-based method?
๏‚ง Keyframe-based: more accuracy per unit computing time [2]
๏‚ง Still useful in visual-inertial fusion
- MSCKF
- ROVIO
[1] Klein, Georg, and David Murray. "Parallel tracking and mapping for small AR workspaces.
[2] Strasdat, Hauke, Josรฉ MM Montiel, and Andrew J. Davison. "Visual SLAM: why filter?."
Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch
Error Propagation
Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch
Ck
Ck+1
Tk,k-1
Tk+1,k
Ck-1
Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch
Ck
Ck+1
Tk
Tk+1
Ck-1
๏ƒ˜ The uncertainty of the camera pose ๐ถ๐‘˜ is a combination of the
uncertainty at ๐ถ๐‘˜โˆ’1 (black-solid ellipse) and the uncertainty of the
transformation ๐‘‡๐‘˜ (gray dashed ellipse)
๏ƒ˜ ๐ถ๐‘˜ = ๐‘“(๐ถ๐‘˜โˆ’1, ๐‘‡๐‘˜)
๏ƒ˜ The combined covariance โˆ‘๐‘˜is
๏ƒ˜ The camera-pose uncertainty is always increasing when concatenating
transformations. Thus, it is important to keep the uncertainties of the
individual transformations small
Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch
Commercial Applications of SVO
Application: Autonomous Inspection of Bridges and Power Masts
Project with Parrot: Autonomous vision-based navigation
Albris drone
Video: https://youtu.be/mYKrR8pihAQ
Dacuda VR solutions
๏ƒ˜ Fully immersive virtual reality with 6-DoF for VR and AR content (running on
iPhone): https://www.youtube.com/watch?v=k0MLs5mqRNo
๏ƒ˜ Powered by SVO
Dacuda
3DAround iPhone App
Dacuda
Zurich-Eye โ€“ www.zurich-eye.com
Vision-based Localization and Mapping Solutions for Mobile Robots
Started in Sep. 2015, became Facebook-Oculus R&D Zurich in Sep. 2016
Event-based Vision
Open Problems and Challenges in Agile Robotics
Current flight maneuvers achieved with onboard cameras are still to slow
compared with those attainable by birds or FPV pilots
FPV-Drone race
๏‚ง At the current state, the agility of a robot is limited by the latency and
temporal discretization of its sensing pipeline.
๏‚ง Currently, the average robot-vision algorithms have latencies of 50-200 ms.
This puts a hard bound on the agility of the platform.
๏‚ง Can we create a low-latency, low-discretization perception pipeline?
- Yes, if we combine cameras with event-based sensors
time
frame next frame
command command
latency
computation
temporal discretization
To go faster, we need faster sensors!
[Censi & Scaramuzza, ยซLow Latency, Event-based Visual Odometryยป, ICRAโ€™14]
Human Vision System
๏ƒ˜ 130 million photoreceptors
๏ƒ˜ But only 2 million axons!
Dynamic Vision Sensor (DVS)
๏ƒ˜ Event-based camera developed by Tobi Delbruckโ€™s group (ETH & UZH).
๏ƒ˜ Temporal resolution: 1 ฮผs
๏ƒ˜ High dynamic range: 120 dB
๏ƒ˜ Low power: 20 mW
๏ƒ˜ Cost: 2,500 EUR
[Lichtsteiner, Posch, Delbruck. A 128x128 120 dB 15ยตs Latency Asynchronous Temporal Contrast
Vision Sensor. 2008]
Image of the solar eclipse (Marchโ€™15) captured by
a DVS (courtesy of IniLabs)
๏‚ง By contrast, a DVS outputs asynchronous events at microsecond resolution.
An event is generated each time a single pixel detects an intensity changes value
time
events stream
event:
๏‚ง A traditional camera outputs frames at fixed time intervals:
Lichtsteiner, Posch, Delbruck. A 128x128 120 dB 15ยตs Latency Asynchronous Temporal
Contrast Vision Sensor. 2008
time
frame next frame
Camera vs DVS
๐‘ก, ๐‘ฅ, ๐‘ฆ , ๐‘ ๐‘–๐‘”๐‘›
๐‘‘๐ผ(๐‘ฅ, ๐‘ฆ)
๐‘‘๐‘ก
sign (+1 or -1)
๏ƒ˜ Video with DVS explanation
Camera vs DynamicVision Sensor
Video: http://youtu.be/LauQ6LWTkxM
๏ƒ˜ Video with DVS explanation
Camera vs DynamicVision Sensor
Video: http://youtu.be/LauQ6LWTkxM
V = log ๐ผ(๐‘ก)
DVS Operating Principle [Lichtsteiner, ISCASโ€™09]
Events are generated any time a single pixel sees a change in brightness larger than ๐ถ
๐‘‚๐‘
๐‘‚๐น๐น ๐‘‚๐น๐น ๐‘‚๐น๐น
๐‘‚๐‘ ๐‘‚๐‘ ๐‘‚๐‘
๐‘‚๐น๐น
๐‘‚๐น๐น ๐‘‚๐น๐น
[Lichtsteiner, Posch, Delbruck. A 128x128 120 dB 15ยตs Latency Asynchronous Temporal Contrast
Vision Sensor. 2008]
[Cook et al., IJCNNโ€™11] [Kim et al., BMVCโ€™15]
The intensity signal at the event time can be reconstructed by integration of ยฑ๐ถ
โˆ†log ๐ผ โ‰ฅ ๐ถ
Pose Tracking and Intensity Reconstruction from a DVS
Dynamic Vision Sensor (DVS)
Advantages
โ€ข low-latency (~1 micro-second)
โ€ข high-dynamic range (120 dB instead 60 dB)
โ€ข Very low bandwidth (only intensity changes are transmitted): ~200Kb/s
โ€ข Low storage capacity, processing time, and power
Disadvantages
โ€ข Require totally new vision algorithms
โ€ข No intensity information (only binary intensity changes)
Lichtsteiner, Posch, Delbruck. A 128x128 120 dB 15ยตs Latency Asynchronous Temporal Contrast Vision Sensor. 2008
Generative Model [Censi & Scaramuzza, ICRAโ€™14]
The generative model tells us that the probability that an event is generated depends on the
scalar product between the gradient ๐›ป๐ผ and the apparent motion ๐ฎโˆ†๐‘ก
P(e)โˆ ๐›ป๐ผ, ๐ฎโˆ†๐‘ก
P
C
O
u
v
p
Zc
X
c
Y
c
๐›ป๐ผ
๐ฎ
Generative event model: ๐›ป โˆ†log ๐ผ , ๐ฎโˆ†๐‘ก = ๐ถ
[Event-based Camera Pose Tracking using a Generative Event Model, Arxiv]
[Censi & Scaramuzza, Low Latency, Event-based Visual Odometry, ICRAโ€™14]
Event-based 6DoF Pose Estimation Results
[Event-based Camera Pose Tracking using a Generative Event Model, Arxiv]
[Censi & Scaramuzza, Low Latency, Event-based Visual Odometry, ICRAโ€™14]
Video: https://youtu.be/iZZ77F-hwzs
Robustness to Illumination Changes and High-speed Motion
BMVCโ€™16 : EMVS: Event-based Multi-View Stereo, Best Industry Paper Award
Video: https://www.youtube.com/watch?v=EUX3Tfx0KKE
Possible future sensing architecture
[Censi & Scaramuzza, Low Latency, Event-based Visual Odometry, ICRAโ€™14]
DAVIS: Dynamic and Active-pixel Vision Sensor [Brandliโ€™14]
DVS events
time
CMOS frames
Brandli, Berner, Yang, Liu, Delbruck, "A 240ร— 180 130 dB 3 ยตs Latency Global Shutter Spatiotemporal Vision
Sensor." IEEE Journal of Solid-State Circuits, 2014.
Combines an event camera with a frame-based camera in the same pixel array!
Elias Mueggler โ€“ Robotics and Perception Group 117
Event-based Feature Tracking [IROSโ€™16]
๏ƒ˜ Extract Harris corners on images
๏ƒ˜ Track corners using event-based Iterative Closest Points (ICP)
IROSโ€™16 : Low-Latency Visual Odometry using Event-based Feature Tracks, Best application paper award finalist
Elias Mueggler โ€“ Robotics and Perception Group 118
Event-based, Sparse Visual Odometry [IROSโ€™16]
Video: https://youtu.be/RDu5eldW8i8
IROSโ€™16 : Event-based Feature Tracking for Low-latency Visual Odometry, Best application paper award finalist
Conclusions
๏ƒ˜ VO & SLAM theory well established
๏ƒ˜ Biggest challenges today are reliability and robustness
๏‚ง HDR scenes
๏‚ง High-speed motion
๏‚ง Low-texture scenes
๏ƒ˜ Which VO/SLAM is best?
๏‚ง Depends on the task and how you measure the performance!
- E.g., VR/AR/MR vs Robotics
๏ƒ˜ 99% of SLAM algorithms are passive: need active SLAM!
๏ƒ˜ Event cameras open enormous possibilities! Standard cameras have been studied
for 50 years!
๏‚ง Ideal for high speed motion estimation and robustness to HDR illumination
changes
Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch
VO (i.e., no loop closing)
๏ƒ˜ Modified PTAM: (feature-based, mono): http://wiki.ros.org/ethzasl_ptam
๏ƒ˜ LIBVISO2 (feature-based, mono and stereo): http://www.cvlibs.net/software/libviso
๏ƒ˜ SVO (semi-direct, mono, stereo, multi-cameras): https://github.com/uzh-rpg/rpg_svo
๏ƒ˜ DSO (direct sparse odometry): https://github.com/JakobEngel/dso
VIO
๏ƒ˜ ROVIO (tightly coupled EKF): https://github.com/ethz-asl/rovio
๏ƒ˜ OKVIS (non-linear optimization): https://github.com/ethz-asl/okvis
๏ƒ˜ SVO + GTSAM (Forster et al. RSSโ€™15) (optimization based, pre-integrated IMU):
https://bitbucket.org/gtborg/gtsam
๏‚ง Instructions here: http://arxiv.org/pdf/1512.02363
VSLAM
๏ƒ˜ ORB-SLAM (feature based, mono and stereo): https://github.com/raulmur/ORB_SLAM
๏ƒ˜ LSD-SLAM (semi-dense, direct, mono): https://github.com/tum-vision/lsd_slam
Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch
๏ƒ˜ GTSAM: https://collab.cc.gatech.edu/borg/gtsam?destination=node%2F299
๏ƒ˜ G2o: https://openslam.org/g2o.html
๏ƒ˜ Google Ceres Solver: http://ceres-solver.org/
Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch
VO (i.e., no loop closing)
๏ƒ˜ Modified PTAM (Weiss et al.,): (feature-based, mono): http://wiki.ros.org/ethzasl_ptam
๏ƒ˜ SVO (Forster et al.) (semi-direct, mono, stereo, multi-cameras): https://github.com/uzh-
rpg/rpg_svo
IMU-Vision fusion:
๏ƒ˜ Multi-Sensor Fusion Package (MSF) (Weiss et al.) - EKF, loosely-coupled:
http://wiki.ros.org/ethzasl_sensor_fusion
๏ƒ˜ SVO + GTSAM (Forster et al. RSSโ€™15) (optimization based, pre-integrated IMU):
https://bitbucket.org/gtborg/gtsam
๏‚ง Instructions here: http://arxiv.org/pdf/1512.02363
๏ƒ˜ OKVIS (non-linear optimization): https://github.com/ethz-asl/okvis
Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch
๏ƒ˜ Open source:
๏‚ง MAVMAP: https://github.com/mavmap/mavmap
๏ƒ˜ Closed source:
๏‚ง Pix4D: https://pix4d.com/
Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch
๏ƒ˜ DBoW2: https://github.com/dorian3d/DBoW2
๏ƒ˜ FABMAP: http://mrg.robots.ox.ac.uk/fabmap/
Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch
VO Datasets
๏ƒ˜ Malaga dataset: http://www.mrpt.org/malaga_dataset_2009
๏ƒ˜ KITTI Dataset: http://www.cvlibs.net/datasets/kitti/
VIO Datasets
These datasets include ground-truth 6-DOF poses from Vicon and synchronized
IMU and images:
๏ƒ˜ EUROC MAV Dataset (forward-facing stereo):
http://projects.asl.ethz.ch/datasets/doku.php?id=kmavvisualinertialdatasets
๏ƒ˜ RPG-UZH dataset (downward-facing monocular)
http://rpg.ifi.uzh.ch/datasets/dalidation.bag
More
๏ƒ˜ Check out also this:
http://homepages.inf.ed.ac.uk/rbf/CVonline/Imagedbase.htm
Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch
Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch

More Related Content

Similar to =iros16tutorial_2.pdf

Image Restoration for 3D Computer Vision
Image Restoration for 3D Computer VisionImage Restoration for 3D Computer Vision
Image Restoration for 3D Computer VisionPetteriTeikariPhD
ย 
Monte carlo methods in graphics and hacking
Monte carlo methods in graphics and hackingMonte carlo methods in graphics and hacking
Monte carlo methods in graphics and hackingHimanshu Goel
ย 
Introduction of slam
Introduction of slamIntroduction of slam
Introduction of slamHung-Chih Chang
ย 
Fcv taxo chellappa
Fcv taxo chellappaFcv taxo chellappa
Fcv taxo chellappazukun
ย 
Fcv taxo chellappa
Fcv taxo chellappaFcv taxo chellappa
Fcv taxo chellappazukun
ย 
Biomimetics : Compound eyes
Biomimetics : Compound eyesBiomimetics : Compound eyes
Biomimetics : Compound eyesYoung Min Song
ย 
Jรผrgen Sturm (Google): The Computer Vision Technology Underlying ARCore
Jรผrgen Sturm (Google): The Computer Vision Technology Underlying ARCoreJรผrgen Sturm (Google): The Computer Vision Technology Underlying ARCore
Jรผrgen Sturm (Google): The Computer Vision Technology Underlying ARCoreAugmentedWorldExpo
ย 
Keynote Presentatio - Forensic Laser Scanning by Eugene Liscio
Keynote Presentatio - Forensic Laser Scanning by Eugene LiscioKeynote Presentatio - Forensic Laser Scanning by Eugene Liscio
Keynote Presentatio - Forensic Laser Scanning by Eugene LiscioPPI_Group
ย 
SIGGRAPH 2014 Course on Computational Cameras and Displays (part 2)
SIGGRAPH 2014 Course on Computational Cameras and Displays (part 2)SIGGRAPH 2014 Course on Computational Cameras and Displays (part 2)
SIGGRAPH 2014 Course on Computational Cameras and Displays (part 2)Matthew O'Toole
ย 
20110220 computer vision_eruhimov_lecture01
20110220 computer vision_eruhimov_lecture0120110220 computer vision_eruhimov_lecture01
20110220 computer vision_eruhimov_lecture01Computer Science Club
ย 
Introduction to image processing-Class Notes
Introduction to image processing-Class NotesIntroduction to image processing-Class Notes
Introduction to image processing-Class NotesDr.YNM
ย 
VSlam 2017 11_20(ๅผต้–Žๆ™บ)
VSlam 2017 11_20(ๅผต้–Žๆ™บ)VSlam 2017 11_20(ๅผต้–Žๆ™บ)
VSlam 2017 11_20(ๅผต้–Žๆ™บ)Hung-Chih Chang
ย 
Dario izzo - Machine Learning methods and space engineering
Dario izzo - Machine Learning methods and space engineeringDario izzo - Machine Learning methods and space engineering
Dario izzo - Machine Learning methods and space engineeringAdvanced-Concepts-Team
ย 
Robust and Efficient Coupling of Perception to Actuation with Metric and Non-...
Robust and Efficient Coupling of Perception to Actuation with Metric and Non-...Robust and Efficient Coupling of Perception to Actuation with Metric and Non-...
Robust and Efficient Coupling of Perception to Actuation with Metric and Non-...Darius Burschka
ย 
"Think Like an Amateur, Do As an Expert: Lessons from a Career in Computer Vi...
"Think Like an Amateur, Do As an Expert: Lessons from a Career in Computer Vi..."Think Like an Amateur, Do As an Expert: Lessons from a Career in Computer Vi...
"Think Like an Amateur, Do As an Expert: Lessons from a Career in Computer Vi...Edge AI and Vision Alliance
ย 
Raskar, Rank and Sparsity in Computational Photography and Computational Ligh...
Raskar, Rank and Sparsity in Computational Photography and Computational Ligh...Raskar, Rank and Sparsity in Computational Photography and Computational Ligh...
Raskar, Rank and Sparsity in Computational Photography and Computational Ligh...Camera Culture Group, MIT Media Lab
ย 
D04432528
D04432528D04432528
D04432528IOSR-JEN
ย 

Similar to =iros16tutorial_2.pdf (20)

Image Restoration for 3D Computer Vision
Image Restoration for 3D Computer VisionImage Restoration for 3D Computer Vision
Image Restoration for 3D Computer Vision
ย 
Monte carlo methods in graphics and hacking
Monte carlo methods in graphics and hackingMonte carlo methods in graphics and hacking
Monte carlo methods in graphics and hacking
ย 
Introduction of slam
Introduction of slamIntroduction of slam
Introduction of slam
ย 
Raskar Computational Camera Fall 2009 Lecture 01
Raskar Computational Camera Fall 2009 Lecture 01Raskar Computational Camera Fall 2009 Lecture 01
Raskar Computational Camera Fall 2009 Lecture 01
ย 
Fcv taxo chellappa
Fcv taxo chellappaFcv taxo chellappa
Fcv taxo chellappa
ย 
Fcv taxo chellappa
Fcv taxo chellappaFcv taxo chellappa
Fcv taxo chellappa
ย 
Biomimetics : Compound eyes
Biomimetics : Compound eyesBiomimetics : Compound eyes
Biomimetics : Compound eyes
ย 
Jรผrgen Sturm (Google): The Computer Vision Technology Underlying ARCore
Jรผrgen Sturm (Google): The Computer Vision Technology Underlying ARCoreJรผrgen Sturm (Google): The Computer Vision Technology Underlying ARCore
Jรผrgen Sturm (Google): The Computer Vision Technology Underlying ARCore
ย 
Keynote Presentatio - Forensic Laser Scanning by Eugene Liscio
Keynote Presentatio - Forensic Laser Scanning by Eugene LiscioKeynote Presentatio - Forensic Laser Scanning by Eugene Liscio
Keynote Presentatio - Forensic Laser Scanning by Eugene Liscio
ย 
AR/SLAM for end-users
AR/SLAM for end-usersAR/SLAM for end-users
AR/SLAM for end-users
ย 
SIGGRAPH 2014 Course on Computational Cameras and Displays (part 2)
SIGGRAPH 2014 Course on Computational Cameras and Displays (part 2)SIGGRAPH 2014 Course on Computational Cameras and Displays (part 2)
SIGGRAPH 2014 Course on Computational Cameras and Displays (part 2)
ย 
20110220 computer vision_eruhimov_lecture01
20110220 computer vision_eruhimov_lecture0120110220 computer vision_eruhimov_lecture01
20110220 computer vision_eruhimov_lecture01
ย 
AR/SLAM and IoT
AR/SLAM and IoTAR/SLAM and IoT
AR/SLAM and IoT
ย 
Introduction to image processing-Class Notes
Introduction to image processing-Class NotesIntroduction to image processing-Class Notes
Introduction to image processing-Class Notes
ย 
VSlam 2017 11_20(ๅผต้–Žๆ™บ)
VSlam 2017 11_20(ๅผต้–Žๆ™บ)VSlam 2017 11_20(ๅผต้–Žๆ™บ)
VSlam 2017 11_20(ๅผต้–Žๆ™บ)
ย 
Dario izzo - Machine Learning methods and space engineering
Dario izzo - Machine Learning methods and space engineeringDario izzo - Machine Learning methods and space engineering
Dario izzo - Machine Learning methods and space engineering
ย 
Robust and Efficient Coupling of Perception to Actuation with Metric and Non-...
Robust and Efficient Coupling of Perception to Actuation with Metric and Non-...Robust and Efficient Coupling of Perception to Actuation with Metric and Non-...
Robust and Efficient Coupling of Perception to Actuation with Metric and Non-...
ย 
"Think Like an Amateur, Do As an Expert: Lessons from a Career in Computer Vi...
"Think Like an Amateur, Do As an Expert: Lessons from a Career in Computer Vi..."Think Like an Amateur, Do As an Expert: Lessons from a Career in Computer Vi...
"Think Like an Amateur, Do As an Expert: Lessons from a Career in Computer Vi...
ย 
Raskar, Rank and Sparsity in Computational Photography and Computational Ligh...
Raskar, Rank and Sparsity in Computational Photography and Computational Ligh...Raskar, Rank and Sparsity in Computational Photography and Computational Ligh...
Raskar, Rank and Sparsity in Computational Photography and Computational Ligh...
ย 
D04432528
D04432528D04432528
D04432528
ย 

Recently uploaded

The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
ย 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoรฃo Esperancinha
ย 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
ย 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
ย 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
ย 
High Profile Call Girls Dahisar Arpita 9907093804 Independent Escort Service ...
High Profile Call Girls Dahisar Arpita 9907093804 Independent Escort Service ...High Profile Call Girls Dahisar Arpita 9907093804 Independent Escort Service ...
High Profile Call Girls Dahisar Arpita 9907093804 Independent Escort Service ...Call girls in Ahmedabad High profile
ย 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
ย 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
ย 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
ย 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
ย 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
ย 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
ย 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1
ย 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
ย 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
ย 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
ย 
Model Call Girl in Narela Delhi reach out to us at ๐Ÿ”8264348440๐Ÿ”
Model Call Girl in Narela Delhi reach out to us at ๐Ÿ”8264348440๐Ÿ”Model Call Girl in Narela Delhi reach out to us at ๐Ÿ”8264348440๐Ÿ”
Model Call Girl in Narela Delhi reach out to us at ๐Ÿ”8264348440๐Ÿ”soniya singh
ย 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
ย 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
ย 

Recently uploaded (20)

The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
ย 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
ย 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
ย 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
ย 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
ย 
High Profile Call Girls Dahisar Arpita 9907093804 Independent Escort Service ...
High Profile Call Girls Dahisar Arpita 9907093804 Independent Escort Service ...High Profile Call Girls Dahisar Arpita 9907093804 Independent Escort Service ...
High Profile Call Girls Dahisar Arpita 9907093804 Independent Escort Service ...
ย 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
ย 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
ย 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
ย 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
ย 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
ย 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
ย 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
ย 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
ย 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
ย 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
ย 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
ย 
Model Call Girl in Narela Delhi reach out to us at ๐Ÿ”8264348440๐Ÿ”
Model Call Girl in Narela Delhi reach out to us at ๐Ÿ”8264348440๐Ÿ”Model Call Girl in Narela Delhi reach out to us at ๐Ÿ”8264348440๐Ÿ”
Model Call Girl in Narela Delhi reach out to us at ๐Ÿ”8264348440๐Ÿ”
ย 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
ย 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
ย 

=iros16tutorial_2.pdf

  • 1. Davide Scaramuzza Visual Odometry and SLAM: past, present, and the robust-perception age
  • 2. References ๏ƒ˜Scaramuzza, D., Fraundorfer, F., Visual Odometry: Part I - The First 30 Years and Fundamentals, IEEE Robotics and Automation Magazine, Volume 18, issue 4, 2011. PDF ๏ƒ˜Fraundorfer, F., Scaramuzza, D., Visual Odometry: Part II - Matching, Robustness, and Applications, IEEE Robotics and Automation Magazine, Volume 19, issue 1, 2012. PDF ๏ƒ˜C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira, I.D. Reid, J.J. Leonard, Simultaneous Localization And Mapping: Present, Future, and the Robust-Perception Age, IEEE Transactions on Robotics (cond. Accepted), 2016. PDF
  • 3. Outline ๏ƒ˜ Theory ๏ƒ˜ Open Source Algorithms ๏ƒ˜ Event-based Vision
  • 4. Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch input output Image sequence (or video stream) from one or more cameras attached to a moving vehicle Camera trajectory (3D structure is a plus): VO is the process of incrementally estimating the pose of the vehicle by examining the changes that motion induces on the images of its onboard cameras
  • 5. Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch ๏ƒ˜ 1980: First known VO real-time implementation on a robot by Hans Moraveck PhD thesis (NASA/JPL) for Mars rovers using one sliding camera (sliding stereo). ๏ƒ˜ 1980 to 2000: The VO research was dominated by NASA/JPL in preparation of 2004 Mars mission (see papers from Matthies, Olson, etc. from JPL) ๏ƒ˜ 2004: VO used on a robot on another planet: Mars rovers Spirit and Opportunity ๏ƒ˜ 2004. VO was revived in the academic environment by David Nister ยซVisual Odometryยป paper. The term VO became popular (and Nister became head of MS Hololens before moving to TESLA in 2014)
  • 6. Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch ๏ƒ˜ 1980: First known VO real-time implementation on a robot by Hans Moraveck PhD thesis (NASA/JPL) for Mars rovers using one sliding camera (sliding stereo). ๏ƒ˜ 1980 to 2000: The VO research was dominated by NASA/JPL in preparation of 2004 Mars mission (see papers from Matthies, Olson, etc. from JPL) ๏ƒ˜ 2004: VO used on a robot on another planet: Mars rovers Spirit and Opportunity ๏ƒ˜ 2004. VO was revived in the academic environment by David Nister ยซVisual Odometryยป paper. The term VO became popular (and Nister became head of MS Hololens before moving to TESLA in 2014)
  • 7. Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch ๏ƒ˜ Sufficient illumination in the environment ๏ƒ˜ Dominance of static scene over moving objects ๏ƒ˜ Enough texture to allow apparent motion to be extracted ๏ƒ˜ Sufficient scene overlap between consecutive frames Is any of these scenes good for VO? Why?
  • 8. Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch SFM VSLAM VO
  • 9. Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch SFM is more general than VO and tackles the problem of 3D reconstruction and 6DOF pose estimation from unordered image sets Reconstruction from 3 million images from Flickr.com Cluster of 250 computers, 24 hours of computation! Paper: โ€œBuilding Rome in a Dayโ€, ICCVโ€™09
  • 10. Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch ๏ƒ˜ VO is a particular case of SFM ๏ƒ˜ VO focuses on estimating the 3D motion of the camera sequentially (as a new frame arrives) and in real time. ๏ƒ˜ Terminology: sometimes SFM is used as a synonym of VO
  • 11. Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch ๏ƒ˜ Visual Odometry ๏‚ง Focus on incremental estimation/local consistency ๏ƒ˜ Visual SLAM: Simultaneous Localization And Mapping ๏‚ง Focus on globally consistent estimation ๏‚ง Visual SLAM = visual odometry + loop detection + graph optimization ๏ƒ˜ The choice between VO and V-SLAM depends on the tradeoff between performance and consistency, and simplicity in implementation. ๏ƒ˜ VO trades off consistency for real-time performance, without the need to keep track of all the previous history of the camera. Visual odometry Visual SLAM Image courtesy from [Clemente et al., RSSโ€™07]
  • 12. Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch 1. Compute the relative motion ๐‘‡๐‘˜ from images ๐ผ๐‘˜โˆ’1 to image ๐ผ๐‘˜ 2. Concatenate them to recover the full trajectory 3. An optimization over the last m poses can be done to refine locally the trajectory (Pose-Graph or Bundle Adjustment) ... ๐‘ช๐ŸŽ ๐‘ช๐Ÿ ๐‘ช๐Ÿ‘ ๐‘ช๐Ÿ’ ๐‘ช๐’โˆ’๐Ÿ ๐‘ช๐’ ๐‘‡๐‘˜ = ๐‘…๐‘˜,๐‘˜โˆ’1 ๐‘ก๐‘˜,๐‘˜โˆ’1 0 1 ๐ถ๐‘› = ๐ถ๐‘›โˆ’1๐‘‡๐‘›
  • 13. How do we estimate the relative motion ๐‘‡๐‘˜ ? Image ๐ผ๐‘˜โˆ’1 Image ๐ผ๐‘˜ ๐‘‡๐‘˜ โ€œAn Invitation to 3D Visionโ€, Ma, Soatto, Kosecka, Sastry, Springer, 2003 ๐‘‡๐‘˜
  • 14. SVO [Forster et al. 2014, TROโ€™16] 100-200 features x 4x4 patch ~ 2,000 pixels Direct Image Alignment DTAM [Newcombe et al. โ€˜11] 300โ€™000+ pixels LSD [Engel et al. 2014] ~10โ€™000 pixels Dense Semi-Dense Sparse ๐‘‡๐‘˜,๐‘˜โˆ’1 = arg min ๐‘‡ ๐ผ๐‘˜ ๐’–โ€ฒ๐‘– โˆ’ ๐ผ๐‘˜โˆ’1(๐’–๐‘–) ๐œŽ 2 ๐‘– It minimizes the per-pixel intensity difference Irani & Anandan, โ€œAll About Direct Methods,โ€ Vision Algorithms: Theory and Practice, Springer, 2000
  • 15. SVO [Forster et al. 2014] 100-200 features x 4x4 patch ~ 2,000 pixels Direct Image Alignment DTAM [Newcombe et al. โ€˜11] 300,000+ pixels LSD-SLAM [Engel et al. 2014] ~10,000 pixels Dense Semi-Dense Sparse ๐‘‡๐‘˜,๐‘˜โˆ’1 = arg min ๐‘‡ ๐ผ๐‘˜ ๐’–โ€ฒ๐‘– โˆ’ ๐ผ๐‘˜โˆ’1(๐’–๐‘–) ๐œŽ 2 ๐‘– It minimizes the per-pixel intensity difference Irani & Anandan, โ€œAll About Direct Methods,โ€ Vision Algorithms: Theory and Practice, Springer, 2000
  • 16. 16 Feature-based methods 1. Extract & match features (+RANSAC) 2. Minimize Reprojection error minimization ๐‘‡๐‘˜,๐‘˜โˆ’1 = arg min ๐‘‡ ๐’–โ€ฒ ๐‘– โˆ’ ๐œ‹ ๐’‘๐‘– ฮฃ 2 ๐‘– Direct methods 1. Minimize photometric error ๐‘‡๐‘˜,๐‘˜โˆ’1 = ? ๐’‘๐‘– ๐’–โ€ฒ๐‘– ๐’–๐‘– ๐‘‡๐‘˜,๐‘˜โˆ’1 = arg min ๐‘‡ ๐ผ๐‘˜ ๐’–โ€ฒ๐‘– โˆ’ ๐ผ๐‘˜โˆ’1(๐’–๐‘–) ๐œŽ 2 ๐‘– where ๐’–โ€ฒ๐‘– = ๐œ‹ ๐‘‡ โˆ™ ๐œ‹โˆ’1 ๐’–๐‘– โˆ™ ๐‘‘ ๐‘‡๐‘˜,๐‘˜โˆ’1 ๐ผ๐‘˜ ๐’–โ€ฒ๐‘– ๐’‘๐‘– ๐’–๐‘– ๐ผ๐‘˜โˆ’1 ๐‘‘๐‘– [Jin,Favaro,Soattoโ€™03] [Silveira, Malis, Rives, TROโ€™08], [Newcombe et al., ICCV โ€˜11], [Engel et al., ECCVโ€™14], [Forster et al., ICRAโ€™14]
  • 17. 17 [Jin,Favaro,Soattoโ€™03] [Silveira, Malis, Rives, TROโ€™08], [Newcombe et al., ICCV โ€˜11], [Engel et al., ECCVโ€™14], [Forster et al., ICRAโ€™14] ๐‘‡๐‘˜,๐‘˜โˆ’1 = arg min ๐‘‡ ๐’–โ€ฒ ๐‘– โˆ’ ๐œ‹ ๐’‘๐‘– ฮฃ 2 ๐‘– ๐‘‡๐‘˜,๐‘˜โˆ’1 = arg min ๐‘‡ ๐ผ๐‘˜ ๐’–โ€ฒ๐‘– โˆ’ ๐ผ๐‘˜โˆ’1(๐’–๐‘–) ๐œŽ 2 ๐‘– where ๐’–โ€ฒ๐‘– = ๐œ‹ ๐‘‡ โˆ™ ๐œ‹โˆ’1 ๐’–๐‘– โˆ™ ๐‘‘ ๏ƒผ Large frame-to-frame motions ๏ƒผ Accuracy: Efficient optimization of structure and motion (Bundle Adjustment) ๏ Slow due to costly feature extraction and matching ๏ Matching Outliers (RANSAC) ๏ƒผ All information in the image can be exploited (precision, robustness) ๏ƒผ Increasing camera frame-rate reduces computational cost per frame ๏ Limited frame-to-frame motion ๏ Joint optimization of dense structure and motion too expensive Feature-based methods 1. Extract & match features (+RANSAC) 2. Minimize Reprojection error minimization Direct methods 1. Minimize photometric error
  • 18. Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch Image sequence Feature detection Feature matching (tracking) Motion estimation 2D-2D 3D-3D 3D-2D Local optimization VO computes the camera path incrementally (pose after pose) Front-end Back-end
  • 19. Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch ๏ƒ˜ The Front-end is responsible for ๏‚ง Feature extraction, matching, and outlier removal ๏‚ง Loop closure detection ๏ƒ˜ The Back-end is responsible for the pose and structure optimization (e.g., iSAM, g2o, Google Ceres)
  • 20. Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch Image sequence Feature detection Feature matching (tracking) Motion estimation 2D-2D 3D-3D 3D-2D Local optimization VO computes the camera path incrementally (pose after pose) Example features tracks
  • 21. Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch Feature Extraction
  • 22. Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch ๏ƒ˜ A corner is defined as the intersection of one or more edges ๏‚ง A corner has high localization accuracy ๏‚ง Corner detectors are good for VO ๏‚ง Itโ€™s less distinctive than a blob ๏‚ง E.g., Harris, Shi-Tomasi, SUSAN, FAST ๏ƒ˜ A blob is any other image pattern, which is not a corner, that significantly differs from its neighbors in intensity and texture ๏‚ง Has less localization accuracy than a corner ๏‚ง Blob detectors are better for place recognition ๏‚ง Itโ€™s more distinctive than a corner ๏‚ง E.g., MSER, LOG, DOG (SIFT), SURF, CenSurE ๏ƒ˜ Descriptor: Distinctive feature identifier ๏‚ง Standard descriptor: squared patch of pixel intensity values ๏‚ง Gradient or difference-based descriptors: SIFT, SURF, ORB, BRIEF, BRISK
  • 23. Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch Which of the patches below can be matched reliably?
  • 24. Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch ๏ƒ˜ How do we identify corners? ๏ƒ˜ We can easily recognize the point by looking through a small window ๏ƒ˜ Shifting a window in any direction should give a large change in intensity in at least 2 directions โ€œflatโ€ region: no intensity change โ€œcornerโ€: significant change in at least 2 directions โ€œedgeโ€: no change along the edge direction
  • 25. Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch [Rosten et al., PAMI 2010] ๏ƒ˜ FAST: Features from Accelerated Segment Test ๏ƒ˜ Studies intensity of pixels on circle around candidate pixel C ๏ƒ˜ C is a FAST corner if a set of N contiguous pixels on circle are: ๏ƒ˜ all brighter than intensity_of(C)+theshold, or ๏ƒ˜ all darker than intensity_of(C)+theshold โ€ข Typical FAST mask: test for 9 contiguous pixels in a 16-pixel circle โ€ข Very fast detector - in the order of 100 Mega-pixel/second
  • 26. Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch SIFT responds to local regions that look like Difference of Gaussian (~Laplacian of Gaussian) s s2 s3 s4 Scale ) , ( ) , ( y x G y x G DoG LOG k s s ๏€ญ ๏€ฝ ๏‚ป
  • 27. SIFT detector (location + scale) SIFT keypoints: local extrema in both location and scale of the DoG โ€ข Detect maxima and minima of difference-of-Gaussian in scale space โ€ข Each point is compared to its 8 neighbors in the current image and 9 neighbors each in the scales above and below For each max or min found, output is the location and the scale.
  • 28. โ€ข Scale Invariant Feature Transform โ€ข Invented by David Lowe [IJCV, 2004] โ€ข Descriptor computation: โ€“ Divide patch into 4x4 sub-patches: 16 cells โ€“ Compute histogram of gradient orientations (8 reference angles) for all pixels inside each sub-patch โ€“ Resulting SIFT descriptor: 4x4x8 = 128 values โ€“ Descriptor Matching: Euclidean-distance between these descriptor vectors (i.e., SSD) SIFT descriptor David G. Lowe. "Distinctive image features from scale-invariant keypoints.โ€ IJCV , 2004.
  • 29. Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch [Calonder et. al, ECCV 2010] Pattern for intensity pair samples โ€“ generated randomly โ€ข Binary Robust Independent Elementary Features โ€ข Goal: high speed (in description and matching) โ€ข Binary descriptor formation: โ€ข Smooth image โ€ข for each detected keypoint (e.g. FAST), โ€ข sample 256 intensity pairs p=(๐‘1, ๐‘2) within a squared patch around the keypoint โ€ข for each pair p โ€ข if ๐‘1 < ๐‘2 then set bit p of descriptor to 1 โ€ข else set bit p of descriptor to 0 โ€ข The pattern is generated randomly only once; then, the same pattern is used for all patches โ€ข Not scale/rotation invariant โ€ข Allows very fast Hamming Distance matching: count the number of bits that are different in the descriptors matched Calonder, Lepetit, Strecha, Fua, BRIEF: Binary Robust Independent Elementary Features, ECCVโ€™10]
  • 30. Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch ๏ƒ˜ Oriented FAST and Rotated BRIEF ๏ƒ˜ Alterative to SIFT or SURF, designed for fast computation ๏ƒ˜ Keypoint detector based on FAST ๏ƒ˜ BRIEF descriptors are steered according to keypoint orientation (to provide rotation invariance) ๏ƒ˜ Good Binary features are learned by minimizing the correlation on a set of training patches. ORB descriptor [Rublee et al., ICCV 2011]
  • 31. Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch [Leutenegger, Chli, Siegwart, ICCV 2011] โ€ข Binary Robust Invariant Scalable Keypoints โ€ข Detect corners in scale-space using FAST โ€ข Rotation and scale invariant โ€ข Binary, formed by pairwise intensity comparisons (like BRIEF) โ€ข Pattern defines intensity comparisons in the keypoint neighborhood โ€ข Red circles: size of the smoothing kernel applied โ€ข Blue circles: smoothed pixel value used โ€ข Compare short- and long-distance pairs for orientation assignment & descriptor formation โ€ข Detection and descriptor speed: ~10 times faster than SURF โ€ข Slower than BRIEF, but scale- and rotation- invariant
  • 32. ORB & BRISK: โ€ข 128-to-256-bit binary descriptors โ€ข Fast to extract and match (Hamming distance) โ€ข Good for relocalization and Loop detection โ€ข Multi-scale detection ๏ƒ  same point appears on several scales Detector Descriptor Accuracy Relocalization & Loop closing Efficiency Harris Patch ++++ - +++ Shi-Tomasi Patch ++++ - +++ SIFT SIFT ++ ++++ + SURF SURF ++ ++++ ++ FAST BRIEF ++++ +++ ++++ ORB ORB ++++ +++ ++++ FAST BRISK ++++ +++ ++++
  • 33. Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch Image sequence Feature detection Feature matching (tracking) Motion estimation 2D-2D 3D-3D 3D-2D Local optimization VO computes the camera path incrementally (pose after pose) Tk,k-1 Tk+1,k Ck-1 Ck Ck+1
  • 34. Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch Motion from Image Feature Correspondences ๏ƒ˜ Both feature points ๐‘“๐‘˜โˆ’1 and ๐‘“๐‘˜ are specified in 2D ๏ƒ˜ The minimal-case solution involves 5-point correspondences ๏ƒ˜ The solution is found by minimizing the reprojection error: ๏ƒ˜ Popular algorithms: 8- and 5-point algorithms [Hartleyโ€™97, Nisterโ€™06] Motion estimation 2D-2D 3D-2D 3D-3D ๐…
  • 35. Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch Motion from 3D Structure and Image Correspondences ๏ƒ˜ ๐‘“๐‘˜โˆ’1 is specified in 3D and ๐‘“๐‘˜ in 2D ๏ƒ˜ This problem is known as camera resection or PnP (perspective from n points) ๏ƒ˜ The minimal-case solution involves 3 correspondences (+1 for disambiguating the 4 solutions) ๏ƒ˜ The solution is found by minimizing the reprojection error: ๏ƒ˜ Popular algorithms: P3P [Gaoโ€™03, Kneipโ€™11] Motion estimation 2D-2D 3D-2D 3D-3D
  • 36. Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch Motion from 3D-3D Point Correspondences (point cloud registration) ๏ƒ˜ Both ๐‘“๐‘˜โˆ’1 and ๐‘“๐‘˜ are specified in 3D. To do this, it is necessary to triangulate 3D points (e.g. use a stereo camera) ๏ƒ˜ The minimal-case solution involves 3 non-collinear correspondences ๏ƒ˜ The solution is found by minimizing the 3D-3D Euclidean distance: ๏ƒ˜ Popular algorithm: [Arunโ€™87] for global registration, ICP for local refinement or Bundle Adjustment (BA) Motion estimation 2D-2D 3D-2D 3D-3D ๐…
  • 37. Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch Type of correspondences Monocular Stereo 2D-2D X X 3D-3D X 3D-2D X X
  • 38. Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch Typical visual odometry pipeline used in many algorithms [Nisterโ€™04, PTAMโ€™07, LIBVISOโ€™08, LSD-SLAMโ€™14, SVOโ€™14, ORB-SLAMโ€™15] Keyframe 1 Keyframe 2 Initial pointcloud (bootstrapping) New triangulated points Current frame New keyframe
  • 39. Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch ๏ƒ˜ When frames are taken at nearby positions compared to the scene distance, 3D points will exibit large uncertainty Small baseline โ†’ large depth uncertainty Large baseline โ†’ small depth uncertainty
  • 40. Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch ๏ƒ˜ When frames are taken at nearby positions compared to the scene distance, 3D points will exibit large uncertainty ๏ƒ˜ One way to avoid this consists of skipping frames until the average uncertainty of the 3D points decreases below a certain threshold. The selected frames are called keyframes ๏ƒ˜ Rule of the thumb: add a keyframe when . . . average-depth keyframe distance > threshold (~10-20 %)
  • 41. Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch ๏ƒ˜ Matched points are usually contaminated by outliers ๏ƒ˜ Causes of outliers are: ๏‚ง image noise ๏‚ง occlusions ๏‚ง blur ๏‚ง changes in view point and illumination ๏ƒ˜ For the camera motion to be estimated accurately, outliers must be removed ๏ƒ˜ This is the task of Robust Estimation
  • 42. Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch ๏ƒ˜ Error at the loop closure: 6.5 m ๏ƒ˜ Error in orientation: 5 deg ๏ƒ˜ Trajectory length: 400 m Before removing the outliers After removing the outliers Outliers can be removed using RANSAC [Fishler & Bolles, 1981]
  • 43. Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch Image sequence Feature detection Feature matching (tracking) Motion estimation 2D-2D 3D-3D 3D-2D Local optimization (back-end) VO computes the camera path incrementally (pose after pose) ... ๐‘ช๐ŸŽ ๐‘ช๐Ÿ ๐‘ช๐Ÿ‘ ๐‘ช๐Ÿ’ ๐‘ช๐’โˆ’๐Ÿ ๐‘ช๐’ Front-end Back-end
  • 44. ๏ƒ˜ So far we assumed that the transformations are between consecutive frames ๏ƒ˜ Transformations can be computed also between non-adjacent frames ๐‘‡๐‘–๐‘— (e.g., when features from previous keyframes are still observed). They can be used as additional constraints to improve cameras poses by minimizing the following: ๏ƒ˜ For efficiency, only the last ๐‘š keyframes are used ๏ƒ˜ Gauss-Newton or Levenberg-Marquadt are typically used to minimize it. For large graphs, efficient open-source tools: g2o, GTSAM, Google Ceres Pose-Graph Optimization ... ๐‘ช๐ŸŽ ๐‘ช๐Ÿ ๐‘ช๐Ÿ ๐‘ช๐Ÿ‘ ๐‘ช๐’โˆ’๐Ÿ ๐‘ช๐’ ๐‘ป๐Ÿ,๐ŸŽ ๐‘ป๐Ÿ,๐Ÿ ๐‘ป๐Ÿ‘,๐Ÿ ๐‘ป๐’,๐’โˆ’๐Ÿ ๐‘ป๐Ÿ,๐ŸŽ ๐‘ป๐Ÿ‘,๐ŸŽ ๐‘ป๐’โˆ’๐Ÿ,๐Ÿ ๐ถ๐‘– โˆ’ ๐‘‡๐‘–๐‘—๐ถ๐‘— 2 ๐‘— ๐‘–
  • 45. ๏ƒ˜ Similar to pose-graph optimization but it also optimizes 3D points ๏ƒ˜ ๐œŒ๐ป() is a robust cost function (e.g., Huber cost) to downweight wrong matches ๏ƒ˜ In order to not get stuck in local minima, the initialization should be close to the minimum ๏ƒ˜ Gauss-Newton or Levenberg-Marquadt can be used ๏ƒ˜ Very costly: example: 1k images and 100k points, 1s per LM iteration. For large graphs, efficient open-source software exists: GTSAM, g2o, Google Ceres can be used Bundle Adjustment (BA) ... ๐‘ช๐ŸŽ ๐‘ช๐Ÿ ๐‘ช๐Ÿ ๐‘ช๐Ÿ‘ ๐‘ช๐’โˆ’๐Ÿ ๐‘ช๐’ ๐‘ป๐Ÿ,๐ŸŽ ๐‘ป๐Ÿ,๐Ÿ ๐‘ป๐Ÿ‘,๐Ÿ ๐‘ป๐’,๐’โˆ’๐Ÿ ๐‘ป๐Ÿ,๐ŸŽ ๐‘ป๐Ÿ‘,๐ŸŽ ๐‘ป๐’โˆ’๐Ÿ,๐Ÿ ๐‘‹๐‘– , ๐ถ๐‘˜ = ๐‘Ž๐‘Ÿ๐‘”๐‘š๐‘–๐‘›๐‘‹๐‘–,๐ถ๐‘˜, ๐œŒ๐ป ๐‘๐‘˜ ๐‘– โˆ’ ๐œ‹ ๐‘‹๐‘– , ๐ถ๐‘˜ ๐‘–,๐‘˜
  • 46. Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch ๏ƒ˜ BA is more precise than pose-graph optimization because it adds additional constraints (landmark constraints) ๏ƒ˜ But more costly: ๐‘‚ ๐‘ž๐‘€ + ๐‘™๐‘ 3 with ๐‘€ and ๐‘ being the number of points and cameras poses and ๐‘ž and ๐‘™ the number of parameters for points and camera poses. Workarounds: ๏‚ง A small window size limits the number of parameters for the optimization and thus makes real-time bundle adjustment possible. ๏‚ง It is possible to reduce the computational complexity by just optimizing over the camera parameters and keeping the 3D landmarks fixed, e.g., (motion-only BA)
  • 47. Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch ๏ƒ˜ Relocalization problem: ๏‚ง During VO, tracking can be lost (due to occlusions, low texture, quick motion, illumination change) ๏ƒ˜ Solution: Re-localize camera pose and continue ๏ƒ˜ Loop closing problem ๏‚ง When you go back to a previously mapped area: ๏‚ง Loop detection: to avoid map duplication ๏‚ง Loop correction: to compensate the accumulated drift ๏‚ง In both cases you need a place recognition technique
  • 48. University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch Visual Place Recognition ๏ƒ˜ Goal: find the most similar images of a query image in a database of ๐‘ต images ๏ƒ˜ Complexity: ๐‘2โˆ™๐‘€2 2 feature comparisons (worst-case scenario) ๏‚ง Each image must be compared with all other images! ๏‚ง ๐‘ is the number of all images collected by a robot - Example: 1 image per meter of travelled distance over a 100๐‘š2 house with one robot and 100 feature per image โ†’ M = 100, ๐‘ = 100 โ†’ ๐‘2 ๐‘€2 /2= ~ 50 ๐‘€๐‘–๐‘™๐‘™๐‘–๐‘œ๐‘› feature comparisons! Solution: Use an inverted file index! Complexity reduces to ๐‘ต โˆ™ ๐‘ด [โ€œVideo Googleโ€, Sivic & Zisserman, ICCVโ€™03] [โ€œScalable Recognition with a Vocabulary Treeโ€, Nister & Stewenius, CVPRโ€™06] See also FABMAP and Galvez-Lopezโ€™12โ€™s (DBoW2)]
  • 49. University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch Indexing local features: inverted file text ๏ƒ˜ For text documents, an efficient way to find all pages on which a word occurs is to use an index ๏ƒ˜ We want to find all images in which a feature occurs ๏ƒ˜ To use this idea, weโ€™ll need to map our features to โ€œvisual wordsโ€
  • 50. Building the Visual Vocabulary Image Collection Extract Features Cluster Descriptors Descriptors space Examples of Visual Words:
  • 51. Limitations of VO-SLAM systems ๏ƒ˜ Limitations ๏‚ง Monocular (i.e., absolute scale is unknown) ๏‚ง Requires a reasonably illuminated area ๏‚ง Motion blur ๏‚ง Needs texture: will fail with large plain walls ๏‚ง Map is too sparse for interaction with the environment ๏ƒ˜ Extensions ๏‚ง IMU for robustness and absolute scale estimation ๏‚ง Stereo: real scale and more robust to quick motions ๏‚ง Semi-dense or dense mapping for environment interaction ๏‚ง Event-based cameras for high-speed motions and HDR environments ๏‚ง Learning for improved reliability
  • 52. Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch Visual-Inertial Fusion
  • 53. Davide Scaramuzza โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch Absolute Scale Determination ๏ƒ˜ The absolute pose ๐‘ฅ is known up to a scale ๐‘ , thus ๐‘ฅ = ๐‘ ๐‘ฅ ๏ƒ˜ IMU provides accelerations, thus ๐‘ฃ = ๐‘ฃ0 + ๐‘Ž ๐‘ก ๐‘‘๐‘ก ๏ƒ˜ By derivating the first one and equating them ๐‘ ๐‘ฅ = ๐‘ฃ0 + ๐‘Ž ๐‘ก ๐‘‘๐‘ก ๏ƒ˜ As shown in [Martinelli, TROโ€™12], for 6DOF, both ๐‘  and ๐‘ฃ0 can be determined in closed form from a single feature observation and 3 views ๏ƒ˜ This is used to initialize the asbolute scale [Kaiser, ICRAโ€™16] ๏ƒ˜ The scale can then be tracked with ๏‚ง EKF [Mourikis & Roumeliotis, IJRRโ€™10], [Weiss, JFRโ€™13] ๏‚ง Non-linear optimization methods [Leutenegger, RSSโ€™13] [Forster, RSSโ€™15]
  • 54. ๏ƒ˜ Fusion solved as a non-linear optimization problem ๏ƒ˜ Increased accuracy over filtering methods IMU residuals Reprojection residuals Forster, Carlone, Dellaert, Scaramuzza, IMU Preintegration on Manifold for efficient Visual-Inertial Maximum-a-Posteriori Estimation, Robotics Science and Systensโ€™15, Best Paper Award Finalist Visual-Inertial Fusion [RSSโ€™15]
  • 55. Google Tango Proposed OKVIS Accuracy: 0.1% of the travel distance Forster, Carlone, Dellaert, Scaramuzza, IMU Preintegration on Manifold for efficient Visual-Inertial Maximum-a-Posteriori Estimation, Robotics Science and Systensโ€™15, Best Paper Award Finalist YouTube: https://youtu.be/CsJkci5lfco SVO + GTSAM (Forster et al. RSSโ€™15) (optimization based, pre-integrated IMU): https://bitbucket.org/gtborg/gtsam Instructions here: http://arxiv.org/pdf/1512.02363 Open Source Comparison with Previous Works
  • 56. Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch Open-source VO & VSLAM algorithms
  • 57. 57 Intro: visual odometry algorithms ๏ƒ˜ Popular visual odometry and SLAM algorithms ๏‚ง ORB-SLAM (University of Zaragoza, 2015) - ORB-SLAM2 (2016) supports stereo and RGBD camera ๏‚ง LSD-SLAM (Technical University of Munich, 2014) ๏‚ง DSO (Technical University of Munich, 2016) ๏‚ง SVO (University of Zurich, 2014/2016) - SVO 2.0 (2016) supports wide angle, stereo and multiple cameras
  • 59. Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch ๏ƒ˜ It combines all together: ๏‚ง Tracking ๏‚ง Mapping ๏‚ง Loop closing ๏‚ง Relocalization (DBoW) ๏‚ง Final optimization ๏ƒ˜ ORB: FAST corner + Oriented Rotated Brief descriptor ๏‚ง Binary descriptor ๏‚ง Very fast to compute and compare ๏ƒ˜ Real-time (30Hz)
  • 61. 61 ORB-SLAM: ORB feature ๏ƒ˜ ORB: Oriented FAST and Rotated Brief ๏‚ง 256-bit binary descriptor ๏‚ง Fast to extract and match (Hamming distance) ๏‚ง Good for tracking, relocation and Loop detection ๏‚ง Multi-scale detection: same point appears on several scales
  • 62. 62 ORB-SLAM: tracking ๏ƒ˜ For every new frame: ๏‚ง First track w.r.t. last frame Find matches from last frame in the new frame -> PnP ๏‚ง Then track w.r.t. local map Find matches from local keyframes in the new frame -> PnP
  • 63. 63 ORB-SLAM: mapping ๏ƒ˜ Map representation ๏‚ง Keyframe poses ๏‚ง Points - 3D positions - Descriptor - Observations in frames ๏ƒ˜ Functions of the mapping part ๏‚ง Triangulate new points ๏‚ง Remove redundant keyframes/points ๏‚ง Optimize poses and points Q: why do we need keyframes instead of just using points?
  • 65. Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch LSD-SLAM Large-scale Semi-Dense SLAM [Engel, Schoeps, Cremers, ECCVโ€™14]
  • 66. Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch ๏ƒ˜ Direct (photometric error) + Semi-Dense formulation ๏‚ง 3D geometry represented as semi-dense depth maps. ๏‚ง Optimizes a photometric error ๏‚ง Separateley optimizes poses (direct image alignment) & geometry (pixel-wise filtering) ๏ƒ˜ Includes: ๏‚ง Loop closing ๏‚ง Relocalization ๏‚ง Final optimization ๏ƒ˜ Real-time (30Hz)
  • 67. 67 LSD-SLAM: overview ๏ƒ˜ Direct image alignment ๏ƒ˜ Depth refinement and regularization Instead of using features, LSD-SLAM uses pixels with large gradients.
  • 68. 68 LSD-SLAM: Direct Image Alignment ๏ƒ˜ New frame w.r.t. last keyframe ๏ƒ˜ Keyframe w.r.t. global map โ€ข Finding pose that Minimizes photometric error rp over all selected pixels โ€ข Weighted by the photometric covariance โ€ข Also minimizing geometric error: distance between the points in the current keyframe and the points in the global map.
  • 69. 69 LSD-SLAM: Depth Refinement/Regularization ๏ƒ˜ Depth estimation: per pixel stereo: ๏‚ง Using the estimated pose from image alignment, we can perform stereo matching for each pixel. ๏‚ง Using the stereo matching result to refine the depth. ๏ƒ˜ Regularization ๏‚ง Average using adjacent depth ๏‚ง Remove outliers and spurious estimations: visually appealing
  • 71. 71 DSO Direct Sparse Odometry [Engel, Koltun, Cremers, Arxivโ€™16]
  • 72. 72 DSO: Tracking frontend ๏ƒ˜ Direct Image Alignment โ€ข Using points of large gradients โ€ข Incorporate photometric correction: robust to exposure time change โ€ข Using exposure time ti tj to compensate exposure time change โ€ข Using affine transformation if no exposure time is known
  • 73. 73 DSO: Optimization backend ๏ƒ˜ Sliding window estimator ๏‚ง Not full bundle adjustment ๏‚ง Only keep a fixed length window (e.g., 3 keyframes) of past frames ๏‚ง Instead of simply dropping the states out of the window, marginalizing the states: Advantage: โ€ข Help improve accuracy โ€ข Still able to operate in real-time
  • 75. Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch SVO Fast, Semi-Direct Visual Odometry [Forster, Pizzoli, Scaramuzza, ICRAโ€™14, TROโ€™16]
  • 76. SVO: overview Mapping โ€ข Probabilistic depth estimation Direct (minimizes photometric error) โ€ข Corners and edgelets โ€ข Frame-to-frame motion estimation โ€ข Frame-to-Keyframe pose refinement Edgelet Corner Extensions โ€ข Omni-cameras โ€ข Multi-camera systems โ€ข IMU pre-integration โ€ข Dense โ†’ REMODE Feature-based (minimizes photometric error)
  • 77. 77 SVO: Semi-Direct Visual Odometry [ICRAโ€™14] Direct Feature-based โ€ข Frame-to-frame motion estimation โ€ข Frame-to-Keyframe pose refinement
  • 78. 78 SVO: Semi-Direct Visual Odometry [ICRAโ€™14] Direct Feature-based โ€ข Frame-to-frame motion estimation โ€ข Frame-to-Kreyframe pose refinement Mapping ๏ƒ˜ Feature extraction only for every keyframe ๏ƒ˜ Probabilistic depth estimation of 3D points
  • 79. 79 Probabilistic Depth Estimation in SVO Measurement Likelihood models outliers: โ€ข 2-Dimensional distribution: Depth ๐’… and inliner ratio ๐† โ€ข Mixture of Gaussian + Uniform โ€ข Inverse depth Depth-Filter: โ€ข Depth-filter for every new feature โ€ข Recursive Bayesian depth estimation โ€ข Epipolar search using ZMSSD
  • 80. 80 (2) (1) ๏ƒ˜ Based on the model by (Vogiatzis &Hernandez, 2011) but with inverse depth ๏ƒ˜ The posterior in (1) can be approximated by Probabilistic Depth Estimation in SVO ๐œŒ ๐œŒ ๐œŒ ๐œŒ ๐‘‘ ๐‘‘ ๐‘‘ ๐‘‘ Modeling the posterior as a dense 2D histogram is very expensive! (3) ๐œŒ ๐œŒ ๐œŒ ๐‘‘ ๐‘‘ ๐‘‘ ๐‘‘ ๐œŒ The parametric model describes the pixel depth at time k.
  • 82. SVO for Autonomous Drone Navigation RMS error: 5 mm, height: 1.5 m โ€“ Down-looking camera Faessler, Fontana, Forster, Mueggler, Pizzoli, Scaramuzza, Autonomous, Vision-based Flight and Live Dense 3D Mapping with a Quadrotor Micro Aerial Vehicle, Journal of Field Robotics, 2015. Speed: 4 m/s, height: 1.5 m โ€“ Down-looking camera Video https://www.youtube.com/watch?v =4X6Voft4Z_0 Video: https://youtu.be/fXy4P3nvxHQ
  • 83. 83 SVO on 4 fisheye Cameras from AUDI dataset [Forster, et al., ยซSVO: Semi Direct Visual Odometry for Monocular and Multi-Camera Systemsยป, TROโ€™16] Video: https://www.youtube.com/watch?v=gr00Bf0AP1k
  • 84. 84 Processing Times of SVO [Forster, et al., ยซSVO: Semi Direct Visual Odometry for Monocular and Multi-Camera Systemsยป, TROโ€™16]
  • 85. 85 Processing Times of SVO Laptop (Intel i7, 2.8 GHz) Embedded ARM Cortex-A9, 1.7 GHz 400 frames per second Up to 70 frames per second ๏ƒ˜ Open Source available at: github.com/uzh-rpg/rpg_svo ๏ƒ˜ Works with and without ROS ๏ƒ˜ Closed-Source professional edition (SVO 2.0): available for companies Source Code
  • 86. 86 Summary: Feature-based vs. direct 1. Feature extraction 2. Feature matching 3. RANSAC + P3P 4. Reprojection error minimization ๐‘‡๐‘˜,๐‘˜โˆ’1 = arg min ๐‘‡ ๐’–โ€ฒ๐‘– โˆ’ ๐œ‹ ๐’‘๐‘– 2 ๐‘– Direct approaches (LSD, DSO, SVO) 1. Minimize photometric error ๐‘‡๐‘˜,๐‘˜โˆ’1 = arg min ๐‘‡ ๐ผ๐‘˜ ๐’–โ€ฒ๐‘– โˆ’ ๐ผ๐‘˜โˆ’1(๐’–๐‘–) 2 ๐‘– ๏ƒผ Large frame-to-frame motions ๏ Slow (20-30 Hz) due to costly feature extraction and matching ๏ Not robust to high-frequency and repetive texture ๏ Outliers ๏ƒผ Every pixel in the image can be exploited (precision, robustness) ๏ƒผ Increasing camera frame-rate reduces computational cost per frame ๏ƒป Limited to small frame-to-frame motion Feature-based (ORB-SLAM, part of SVO/DSO)
  • 87. Comparison among SVO, DSO, ORB-SLAM, LSD-SLAM [Forster, TROโ€™16] ๏ƒ˜ See next two slides ๏ƒ˜ For a thorough evaluation please refer to [Forster, TROโ€™16] paper, where all these algorithms are evaluated in terms of accuracy against ground truth and timing on several datasets: EUROC, TUM- RGB-D, ICL-NUIM [Forster, et al., ยซSVO: Semi Direct Visual Odometry for Monocular and Multi-Camera Systemsยป, TROโ€™16]
  • 88. Accuracy (EUROC Dataset) [Forster, TROโ€™16] [Forster, et al., ยซSVO: Semi Direct Visual Odometry for Monocular and Multi-Camera Systemsยป, TROโ€™16]
  • 89. Timing (EUROC Dataset) [Forster, TROโ€™16] [Forster, et al., ยซSVO: Semi Direct Visual Odometry for Monocular and Multi-Camera Systemsยป, TROโ€™16]
  • 90. Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch SVO [Forster et al. 2014, TROโ€™16] 100-200 features x 4x4 patch ~ 2,000 pixels Review of Direct Image Alignment DTAM [Newcombe et al. โ€˜11] 300โ€™000+ pixels LSD [Engel et al. 2014] ~10โ€™000 pixels Dense Semi-Dense Sparse ๐‘‡๐‘˜,๐‘˜โˆ’1 = arg min ๐‘‡ ๐ผ๐‘˜ ๐’–โ€ฒ๐‘– โˆ’ ๐ผ๐‘˜โˆ’1(๐’–๐‘–) ๐œŽ 2 ๐‘– It minimizes the per-pixel intensity difference Irani & Anandan, โ€œAll About Direct Methods,โ€ Vision Algorithms: Theory and Practice, Springer, 2000
  • 91. Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch ๏ƒ˜ Goal: study the magnitude of the perturbation for which image-to-model alignment is capable to converge as a function of the distance to the reference image ๏ƒ˜ The performance in this experiment is a measure of robustness: successful pose estimation from large initial perturbations shows that the algorithm is capable of dealing with rapid camera motions ๏ƒ˜ 1000 Blender simulations ๏ƒ˜ Alignment considered converged when the estimated relative pose is closer than 0.1 meters from ground-truth ๏ƒ˜ Result: difference between semi-dense image alignment and dense image alignment is marginal. This is because pixels that exhibit no intensity gradient are not informative for the optimization (their Jacobians are zero). โ—ฆ Using all pixels becomes only useful when considering motion blur and image defocus Distance between frames Convergence [%] 100 0 30% overlap [Forster, et al., ยซSVO: Semi Direct Visual Odometry for Monocular and Multi-Camera Systemsยป, TROโ€™16]
  • 92. 92 Summary: keyframe and filter-based method ๏ƒ˜ Why the parallel structure in all these algorithms? ๏‚ง Mapping is often expensive - Local BA - Loop detection and graph optimization - Depth filter per feature ๏‚ง Using the best map available for real-time tracking [1] ๏ƒ˜ Why not filter-based method? ๏‚ง Keyframe-based: more accuracy per unit computing time [2] ๏‚ง Still useful in visual-inertial fusion - MSCKF - ROVIO [1] Klein, Georg, and David Murray. "Parallel tracking and mapping for small AR workspaces. [2] Strasdat, Hauke, Josรฉ MM Montiel, and Andrew J. Davison. "Visual SLAM: why filter?."
  • 93. Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch Error Propagation
  • 94. Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch Ck Ck+1 Tk,k-1 Tk+1,k Ck-1
  • 95. Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch Ck Ck+1 Tk Tk+1 Ck-1 ๏ƒ˜ The uncertainty of the camera pose ๐ถ๐‘˜ is a combination of the uncertainty at ๐ถ๐‘˜โˆ’1 (black-solid ellipse) and the uncertainty of the transformation ๐‘‡๐‘˜ (gray dashed ellipse) ๏ƒ˜ ๐ถ๐‘˜ = ๐‘“(๐ถ๐‘˜โˆ’1, ๐‘‡๐‘˜) ๏ƒ˜ The combined covariance โˆ‘๐‘˜is ๏ƒ˜ The camera-pose uncertainty is always increasing when concatenating transformations. Thus, it is important to keep the uncertainties of the individual transformations small
  • 96. Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch Commercial Applications of SVO
  • 97. Application: Autonomous Inspection of Bridges and Power Masts Project with Parrot: Autonomous vision-based navigation Albris drone Video: https://youtu.be/mYKrR8pihAQ
  • 98. Dacuda VR solutions ๏ƒ˜ Fully immersive virtual reality with 6-DoF for VR and AR content (running on iPhone): https://www.youtube.com/watch?v=k0MLs5mqRNo ๏ƒ˜ Powered by SVO Dacuda
  • 100. Zurich-Eye โ€“ www.zurich-eye.com Vision-based Localization and Mapping Solutions for Mobile Robots Started in Sep. 2015, became Facebook-Oculus R&D Zurich in Sep. 2016
  • 102. Open Problems and Challenges in Agile Robotics Current flight maneuvers achieved with onboard cameras are still to slow compared with those attainable by birds or FPV pilots FPV-Drone race
  • 103. ๏‚ง At the current state, the agility of a robot is limited by the latency and temporal discretization of its sensing pipeline. ๏‚ง Currently, the average robot-vision algorithms have latencies of 50-200 ms. This puts a hard bound on the agility of the platform. ๏‚ง Can we create a low-latency, low-discretization perception pipeline? - Yes, if we combine cameras with event-based sensors time frame next frame command command latency computation temporal discretization To go faster, we need faster sensors! [Censi & Scaramuzza, ยซLow Latency, Event-based Visual Odometryยป, ICRAโ€™14]
  • 104. Human Vision System ๏ƒ˜ 130 million photoreceptors ๏ƒ˜ But only 2 million axons!
  • 105. Dynamic Vision Sensor (DVS) ๏ƒ˜ Event-based camera developed by Tobi Delbruckโ€™s group (ETH & UZH). ๏ƒ˜ Temporal resolution: 1 ฮผs ๏ƒ˜ High dynamic range: 120 dB ๏ƒ˜ Low power: 20 mW ๏ƒ˜ Cost: 2,500 EUR [Lichtsteiner, Posch, Delbruck. A 128x128 120 dB 15ยตs Latency Asynchronous Temporal Contrast Vision Sensor. 2008] Image of the solar eclipse (Marchโ€™15) captured by a DVS (courtesy of IniLabs)
  • 106. ๏‚ง By contrast, a DVS outputs asynchronous events at microsecond resolution. An event is generated each time a single pixel detects an intensity changes value time events stream event: ๏‚ง A traditional camera outputs frames at fixed time intervals: Lichtsteiner, Posch, Delbruck. A 128x128 120 dB 15ยตs Latency Asynchronous Temporal Contrast Vision Sensor. 2008 time frame next frame Camera vs DVS ๐‘ก, ๐‘ฅ, ๐‘ฆ , ๐‘ ๐‘–๐‘”๐‘› ๐‘‘๐ผ(๐‘ฅ, ๐‘ฆ) ๐‘‘๐‘ก sign (+1 or -1)
  • 107. ๏ƒ˜ Video with DVS explanation Camera vs DynamicVision Sensor Video: http://youtu.be/LauQ6LWTkxM
  • 108. ๏ƒ˜ Video with DVS explanation Camera vs DynamicVision Sensor Video: http://youtu.be/LauQ6LWTkxM
  • 109. V = log ๐ผ(๐‘ก) DVS Operating Principle [Lichtsteiner, ISCASโ€™09] Events are generated any time a single pixel sees a change in brightness larger than ๐ถ ๐‘‚๐‘ ๐‘‚๐น๐น ๐‘‚๐น๐น ๐‘‚๐น๐น ๐‘‚๐‘ ๐‘‚๐‘ ๐‘‚๐‘ ๐‘‚๐น๐น ๐‘‚๐น๐น ๐‘‚๐น๐น [Lichtsteiner, Posch, Delbruck. A 128x128 120 dB 15ยตs Latency Asynchronous Temporal Contrast Vision Sensor. 2008] [Cook et al., IJCNNโ€™11] [Kim et al., BMVCโ€™15] The intensity signal at the event time can be reconstructed by integration of ยฑ๐ถ โˆ†log ๐ผ โ‰ฅ ๐ถ
  • 110. Pose Tracking and Intensity Reconstruction from a DVS
  • 111. Dynamic Vision Sensor (DVS) Advantages โ€ข low-latency (~1 micro-second) โ€ข high-dynamic range (120 dB instead 60 dB) โ€ข Very low bandwidth (only intensity changes are transmitted): ~200Kb/s โ€ข Low storage capacity, processing time, and power Disadvantages โ€ข Require totally new vision algorithms โ€ข No intensity information (only binary intensity changes) Lichtsteiner, Posch, Delbruck. A 128x128 120 dB 15ยตs Latency Asynchronous Temporal Contrast Vision Sensor. 2008
  • 112. Generative Model [Censi & Scaramuzza, ICRAโ€™14] The generative model tells us that the probability that an event is generated depends on the scalar product between the gradient ๐›ป๐ผ and the apparent motion ๐ฎโˆ†๐‘ก P(e)โˆ ๐›ป๐ผ, ๐ฎโˆ†๐‘ก P C O u v p Zc X c Y c ๐›ป๐ผ ๐ฎ Generative event model: ๐›ป โˆ†log ๐ผ , ๐ฎโˆ†๐‘ก = ๐ถ [Event-based Camera Pose Tracking using a Generative Event Model, Arxiv] [Censi & Scaramuzza, Low Latency, Event-based Visual Odometry, ICRAโ€™14]
  • 113. Event-based 6DoF Pose Estimation Results [Event-based Camera Pose Tracking using a Generative Event Model, Arxiv] [Censi & Scaramuzza, Low Latency, Event-based Visual Odometry, ICRAโ€™14] Video: https://youtu.be/iZZ77F-hwzs
  • 114. Robustness to Illumination Changes and High-speed Motion BMVCโ€™16 : EMVS: Event-based Multi-View Stereo, Best Industry Paper Award Video: https://www.youtube.com/watch?v=EUX3Tfx0KKE
  • 115. Possible future sensing architecture [Censi & Scaramuzza, Low Latency, Event-based Visual Odometry, ICRAโ€™14]
  • 116. DAVIS: Dynamic and Active-pixel Vision Sensor [Brandliโ€™14] DVS events time CMOS frames Brandli, Berner, Yang, Liu, Delbruck, "A 240ร— 180 130 dB 3 ยตs Latency Global Shutter Spatiotemporal Vision Sensor." IEEE Journal of Solid-State Circuits, 2014. Combines an event camera with a frame-based camera in the same pixel array!
  • 117. Elias Mueggler โ€“ Robotics and Perception Group 117 Event-based Feature Tracking [IROSโ€™16] ๏ƒ˜ Extract Harris corners on images ๏ƒ˜ Track corners using event-based Iterative Closest Points (ICP) IROSโ€™16 : Low-Latency Visual Odometry using Event-based Feature Tracks, Best application paper award finalist
  • 118. Elias Mueggler โ€“ Robotics and Perception Group 118 Event-based, Sparse Visual Odometry [IROSโ€™16] Video: https://youtu.be/RDu5eldW8i8 IROSโ€™16 : Event-based Feature Tracking for Low-latency Visual Odometry, Best application paper award finalist
  • 119. Conclusions ๏ƒ˜ VO & SLAM theory well established ๏ƒ˜ Biggest challenges today are reliability and robustness ๏‚ง HDR scenes ๏‚ง High-speed motion ๏‚ง Low-texture scenes ๏ƒ˜ Which VO/SLAM is best? ๏‚ง Depends on the task and how you measure the performance! - E.g., VR/AR/MR vs Robotics ๏ƒ˜ 99% of SLAM algorithms are passive: need active SLAM! ๏ƒ˜ Event cameras open enormous possibilities! Standard cameras have been studied for 50 years! ๏‚ง Ideal for high speed motion estimation and robustness to HDR illumination changes
  • 120. Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch VO (i.e., no loop closing) ๏ƒ˜ Modified PTAM: (feature-based, mono): http://wiki.ros.org/ethzasl_ptam ๏ƒ˜ LIBVISO2 (feature-based, mono and stereo): http://www.cvlibs.net/software/libviso ๏ƒ˜ SVO (semi-direct, mono, stereo, multi-cameras): https://github.com/uzh-rpg/rpg_svo ๏ƒ˜ DSO (direct sparse odometry): https://github.com/JakobEngel/dso VIO ๏ƒ˜ ROVIO (tightly coupled EKF): https://github.com/ethz-asl/rovio ๏ƒ˜ OKVIS (non-linear optimization): https://github.com/ethz-asl/okvis ๏ƒ˜ SVO + GTSAM (Forster et al. RSSโ€™15) (optimization based, pre-integrated IMU): https://bitbucket.org/gtborg/gtsam ๏‚ง Instructions here: http://arxiv.org/pdf/1512.02363 VSLAM ๏ƒ˜ ORB-SLAM (feature based, mono and stereo): https://github.com/raulmur/ORB_SLAM ๏ƒ˜ LSD-SLAM (semi-dense, direct, mono): https://github.com/tum-vision/lsd_slam
  • 121. Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch ๏ƒ˜ GTSAM: https://collab.cc.gatech.edu/borg/gtsam?destination=node%2F299 ๏ƒ˜ G2o: https://openslam.org/g2o.html ๏ƒ˜ Google Ceres Solver: http://ceres-solver.org/
  • 122. Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch VO (i.e., no loop closing) ๏ƒ˜ Modified PTAM (Weiss et al.,): (feature-based, mono): http://wiki.ros.org/ethzasl_ptam ๏ƒ˜ SVO (Forster et al.) (semi-direct, mono, stereo, multi-cameras): https://github.com/uzh- rpg/rpg_svo IMU-Vision fusion: ๏ƒ˜ Multi-Sensor Fusion Package (MSF) (Weiss et al.) - EKF, loosely-coupled: http://wiki.ros.org/ethzasl_sensor_fusion ๏ƒ˜ SVO + GTSAM (Forster et al. RSSโ€™15) (optimization based, pre-integrated IMU): https://bitbucket.org/gtborg/gtsam ๏‚ง Instructions here: http://arxiv.org/pdf/1512.02363 ๏ƒ˜ OKVIS (non-linear optimization): https://github.com/ethz-asl/okvis
  • 123. Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch ๏ƒ˜ Open source: ๏‚ง MAVMAP: https://github.com/mavmap/mavmap ๏ƒ˜ Closed source: ๏‚ง Pix4D: https://pix4d.com/
  • 124. Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch ๏ƒ˜ DBoW2: https://github.com/dorian3d/DBoW2 ๏ƒ˜ FABMAP: http://mrg.robots.ox.ac.uk/fabmap/
  • 125. Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch VO Datasets ๏ƒ˜ Malaga dataset: http://www.mrpt.org/malaga_dataset_2009 ๏ƒ˜ KITTI Dataset: http://www.cvlibs.net/datasets/kitti/ VIO Datasets These datasets include ground-truth 6-DOF poses from Vicon and synchronized IMU and images: ๏ƒ˜ EUROC MAV Dataset (forward-facing stereo): http://projects.asl.ethz.ch/datasets/doku.php?id=kmavvisualinertialdatasets ๏ƒ˜ RPG-UZH dataset (downward-facing monocular) http://rpg.ifi.uzh.ch/datasets/dalidation.bag More ๏ƒ˜ Check out also this: http://homepages.inf.ed.ac.uk/rbf/CVonline/Imagedbase.htm
  • 126. Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch
  • 127. Davide Scaramuzza โ€“ University of Zurich โ€“ Robotics and Perception Group - rpg.ifi.uzh.ch