Three View Self Calibration and 3D Reconstruction

Camera Self Calibration and
Reconstruction from
Three Views
Peter Abeles
Date: May 2019
Copyright (C) 2019 Peter Abeles 1

Problem Statement
• 3-View Camera Self Calibration and 3D Reconstruction
• Focus on one possible solution. There are many others.
• Provides theory for “ThreeViewEstimateMetricScene” class in BoofCV
• Self Calibration
• Given a set of images of the same scene, estimate the camera’s intrinsic
parameters
• i.e. focal length, lens distortion, … etc
• Some basic assumptions are allowed
• Known pixel aspect ratio
• Known camera model, e.g. pinhole or fisheye
• No known calibration targets allowed
• e.g. no chessboard patterns
• Scene Reconstruction
• 3D location of observed features
• Camera pose, i.e. rotation and translation
• In most situations, to do self calibration, you also need to do scene
reconstruction
• Exceptions include; pure rotations and use of vanishing points
𝐾1, … , 𝐾 𝑛
(𝑅1, 𝑇1), … , (𝑅 𝑛, 𝑇𝑛)
𝑋1, … , 𝑋 𝑚
Intrinsic
Pose
Scene
+

Required Background
• This is an advanced topic in 3D computer vision
• You need to already be familiar with the following
• Pinhole camera, epipolar geometry, projective geometry, fundamental
matrices, homographies, SVD, EVD, 3D Point Clouds, RANSAC
• Hartley and Zisserman, “Multiple View Geometry in Computer Vision”
2nd ed.
• Classic and most extensive text on this subject
• Not required but it is cited many times in this presentation
• Sometimes referred to as simply H&Z

Why Just Three Views?
• 3-View contains all the interesting self
calibration math
• N-View is all about managing very large and
complex data structures
• N-View’s math is more forgiving and “less
interesting”
• 2-View is much more difficult than 3-View
• With 3-Views, geometry alone can eliminate
many of the false associations
• Many well known self calibration approaches
require 3 or more views
• We will come back to 2-View again at the end
• 1-View requires very strong assumptions
but is possible in specific situations
2-View
3-View
N-ViewsCopyright (C) 2019 Peter Abeles 4

What input images are ideal?
• Plenty of scene texture
• Translational camera motion
• Can’t reconstruct from pure
rotation
• Large baseline between views
• Better triangulation
• Small baseline between views
• Better feature matching
• Avoid extreme lighting
conditions

3-View Algorithm Overview (Part 1)
Feature Detection Associate 3-View Robust fit Trifocal
Tensor
Use best possible feature
detector, e.g. SIFT or
SURF. There’s also
promising research in
deep learning features.
Nearest-neighbor association.
Accepted only if 1→2→3→1.
RANSAC with a linear fit
for trifocal tensor and
reprojection error via
triangulation with non-
linear refinement.

𝑃2 =
𝑎11 𝑎12 𝑎13 𝑎14
𝑎21 𝑎22 𝑎23 𝑎24
𝑎31 𝑎32 𝑎33 𝑎34
𝑃1 =
1 0 0 0
0 1 0 0
0 0 1 0
𝑃3 =
𝑏11 𝑏12 𝑏13 𝑏14
𝑏21 𝑏22 𝑏23 𝑏24
𝑏31 𝑏32 𝑏33 𝑏34
Camera Matrices from
Trifocal Tensor
Absolute Dual Quadratic Metric Rectification of
Camera Matrices
Use Absolute Dual
Quadratic to compute a
rectifying homography H
Notoriously unstable to
estimate.
Arbitrary projective frame.
Compatible across all
views.
𝑄∞
∗
𝑤∗
= 𝑃𝑖 𝑄∞
∗
𝑃𝑖
𝑇
𝑤∗
= 𝐾𝐾 𝑇
𝑄∞
∗
= 𝐻 𝐼𝐻 𝑇
𝑃𝑖
𝑀
= 𝑃𝑖 𝐻
𝐾 is the 3x3 intrinsic matrix
𝑤∗
is the dual image of the
absolute conic
𝐻 is the rectifying homography
𝑃𝑖
𝑀
is the metric camera matrix

Construct Initial Estimate
of Scene
Bundle Adjustment Bundle Adjustment Again
Adjust initial camera
orientations and run SBA
multiple times.
Optimize all parameters at
once using an efficient sparse
bundle adjustment.
Triangulate 3D points
from previously estimated
camera locations.
𝑥 = 𝑃𝑖
𝑀
𝑋
𝑃𝑖
𝑀
= 𝐾𝑖 𝑅𝑖 𝑇𝑖
[1] Figures have been “borrowed” from the SBA Library documentation. SBA Library was not used in this project.

Stereo Processing
Rectify Stereo Pair Dense Stereo Disparity 3D Point Cloud
Distort images so that epipolar
lines are at infinity..
Compute disparity of every pixel in
the image. Disparity is difference
between left and right images.
Going from disparity to a
3D point cloud is simple.

3D Point Cloud
• 3-View structure with 3D cloud from 2-
Views
• Created using the algorithm being
explained in this presentation
• Source code, images, and a pre-built
binaries available
• See end of presentation for links
• Notice how some of the images don’t
quite look right?
• e.g. stretched oddly
• Why don’t they look right?
• Converged to a local minima and got
focal length wrong
• Insufficient camera geometry
• e.g. baseline too small,

Feature Detection
• Feature detectors find salient
features inside an image:
• 2D pixel coordinate
• N-tuple descriptor
• Rotation and scale invariant feature
detector and descriptor
recommended.
• E.g. SIFT [1]
• High quality implementations required
• 10% drop in stability significantly
increases instability
• BoofCV’s SURF [2] was used here [3]
Hessian Determinant Intensity
Selected SURF Features[1] Lowe, David G. "Object recognition from local scale-invariant features." ICCV. Vol. 99. No. 2. 1999.
[2] Bay, Herbert, Tinne Tuytelaars, and Luc Van Gool. "Surf: Speeded up robust features” ECCV. 2006.
[3] Abeles, P. “Speeding up SURF”. ISVC 2013

2-View Feature Association
Features are described by a location
(x,y) and an N-Tuple
𝐹𝑖 = [𝑥1, … , 𝑥 𝑁]
Each feature in image A is associated
with a feature in image B by
minimizing the error
arg min
𝑗
𝐹𝑖
𝐴
− 𝐹𝑗
𝐵
A match is only accepted if the two
features are mutually each others
best choice

3-View Association
• Two view association is
performed for each image
pair
• Pairs: 1-2, 2-3, 3-1
• A feature is only accepted if it
is tracked successfully around
all 3 views back to itself

Metric vs Projective Camera
Metric
• Is a projective camera with
additional constraints
• A projective camera can be
“elevated” to metric by rectifying
homography H
• Reconstruction in Euclidean space
• Defined uniquely up to scale
Projective
• Any 3x4 matrix of rank 3
• Reconstruction will be in projective
space
• Very odd appearance
• Points easily move in and out of
infinity
• Uniquely defined up to a
homography
𝑃 𝑀 = 𝐾[𝑅|𝑇]
𝑃 𝑀 = 𝑃𝐻
K is 3x3 upper triangle intrinsic matrix
R is 3x3 rotation matrix
T is 3x1 translation vector
T is 3x1 translation vector
H is 4x4 projective to metric homography in 3-space

Trifocal Tensor: Introduction
• Used to remove false associations and get
compatible projective cameras
• Trifocal tensor plays an analogous role in
three views that the fundamental matrix
does in two views [1]
• It is defined entirely by camera pose and
intrinsic parameters
• Given a trifocal tensor you can:
• Transfer: Give a view of a point or line in two
of the views estimate it’s location in third
• Extract Fundamental matrices 𝐹21, 𝐹31, 𝐹32
• Extract compatible projective camera
matrices 𝑃1, 𝑃2, 𝑃3
[1] Hartley and Zisserman, “Multiple View Geometry in Computer Vision”
[2] Figure by Marc Pollefeys
Relationship between a point and three
views. [2]

Trifocal Tensor: Math Background
{𝑇1, 𝑇2, 𝑇3} 𝑇𝑖 ∈ ℝ3×3
In matrix notation a trifocal tensor is
described by a set of three 3x3 matrices
Relationship to Fundamental Matrices
𝐹21 = 𝑒′
x [𝑇1, 𝑇2, 𝑇3]𝑒′′
𝐹31 = 𝑒′′
x [𝑇3
𝑇
, 𝑇3
𝑇
, 𝑇3
𝑇
]𝑒′
𝑃1 = [I|0]
𝑃2 = [[𝑇1, 𝑇2, 𝑇3]𝑒′′
|𝑒′
]
𝑃3 = [(𝑒′′
𝑒′′𝑇
− I)[𝑇3
𝑇
, 𝑇3
𝑇
, 𝑇3
𝑇
]𝑒′
|𝑒′′
]
Relationship to Camera Matrices
𝐴 x is the 3x3 skew symmetric matrix of vector A 𝑒′
and 𝑒′′
are the epipoles in second and third image
Given three point correspondences
the following is true
An epipole is the point of intersection on the image plane between the principle point of two cameras.
𝑥′
x
𝑖
𝑥 𝑖 𝑇𝑖 𝑥′′
x = 03𝑥3

Trifocal Tensor: Reprojection Error
• Reprojection error is needed for
RANSAC
• Attempted a few different
methods
• Point Transfer
• Triangulation
• Triangulation with Refinement
• Triangulation with Refinement
worked the best
1. Extract Camera matrices from Trifocal Tensor
2. Triangulate 3D point X in projective space
3. Refine 3D point by minimizing reprojection error
4. Select inliers using squared reprojection error
𝑃2 = [[𝑇1, 𝑇2, 𝑇3]𝑒′′
|𝑒′
]
𝑃3 = [(𝑒′′
𝑒′′𝑇
− I)[𝑇3
𝑇
, 𝑇3
𝑇
, 𝑇3
𝑇
]𝑒′
|𝑒′′
]
See page 312 for DLT in H&Z
𝑥𝑖 − 𝑋𝑃𝑖
2
min
𝑋
𝑥𝑖 − 𝑋𝑃𝑖
2
𝑃𝑖 projective camera matrix 3x4
𝑥𝑖 pixel observation of feature X

Fundamental Matrices Instead of Trifocal?
• Can’t you just compute 𝐹21, 𝐹31, 𝐹32 instead of a trifocal tensor?
• Yes and probably the more popular approach
• Many if not most 3D vision libraries don’t even have a trifocal tensor!
• Disadvantages of Fundamental Matrix approach
• Applying epipolar constraint three times is less effective than applying a
trifocal constraint once
• Projective Camera matrices found by decomposing 𝐹21 and 𝐹31will not be
compatible
• Section 15.4 in Hartley and Zisserman
• Page 301 in Y. Ma, S. Soatto, J. Kosecka, and S. S. Sastry, "An Invitation to 3-D Vision"

Fundamental Matrices Instead of Trifocal?
Two objects in front of a Camera
View 1 View 2 View 3
Epipolar Line
Apparent Location of Objects
• They all lie along the same epipolar line
• Pass epipolar tests, but would fail a trifocal
When you decompose a Fundamental matrix into two
projective matrices they have the following relationship
with a metric camera matrix
𝑃1
𝑀
= I 0 𝐻21 𝑃2
𝑀
= 𝑃2 𝐻21
𝑃1
𝑀
= I 0 𝐻31 𝑃3
𝑀
= 𝑃2 𝐻31
𝑃2
𝑀
= I 0 𝐻32 𝑃3
𝑀
= 𝑃3 𝐻31
𝐹21
𝐹31
𝐹32
• Notice how the rectifying homography 𝐻𝑖𝑗 is
different for each decomposition?
• It is possible to find a transform for each
decomposition which will take them to a single
compatible frame
• This is similar to scale ambiguity

Absolute Dual Quadratic (Part 1)
𝑃𝑖
𝑀
= 𝑃𝑖 𝐻𝑃𝑖
𝑀
= 𝐾𝑖[𝑅𝑖|𝑇𝑖]
𝑃1 =[I|0] 𝑃1
𝑀
= 𝐾1[𝐼|0]
𝐻 =
𝐾1 0
𝑣 𝑇
1
𝜋∞ = (𝑝 𝑇, 1) 𝑇
𝑃𝑖 = [𝐴𝑖|𝑎𝑖]
ADQ is used to upgrade a projective camera into a metric camera
A metric camera matrix 𝑃𝑖
𝑀
is related to a projective camera matrix 𝑃𝑖 by a homography H
We can select the origin/initial projective camera matrix arbitrarily, to make the math
easier we define it as follows:
From this it follows that
(1)
(2)
(3)
𝜋∞ = 𝐻−𝑇
0
0
0
1
𝑝 = − 𝐾1
−𝑇
𝑣
From which the plane at infinity is derivedThen we define
(4)
𝜋∞ is the plane at infinity
𝐻 is projective to metric homography. 4x4
𝑣 is an arbitrary 3x1 vector Copyright (C) 2019 Peter Abeles 20
𝑃𝑖
𝑀
is a metric camera matrix. 3x4
𝑃𝑖 is a projective camera matrix.
3x4

Absolute Dual Quadratic (Part 2)
Using equations (1,3,4) in the previous slide you can derive
𝐾𝑖 𝐾𝑖
𝑇
∈ ℝ3×3
= 𝐴𝑖 − 𝑎𝑖 𝑝 𝑇
𝐾1 𝐾1
𝑇
𝐴𝑖 − 𝑎𝑖 𝑝 𝑇 𝑇
𝑤𝑖
∗
= 𝐾𝑖 𝐾𝑖
𝑇
= 𝑃𝑖 𝑄∞
∗
𝑃𝑖
𝑇
= 𝐴𝑖 − 𝑎𝑖 𝑝 𝑇
𝑤1
∗
𝐴𝑖 − 𝑎𝑖 𝑝 𝑇 𝑇
𝑄∞
∗ ∈ ℝ4×4 = 𝐻 𝐼𝐻 𝑇 =
𝑤1
∗
−𝑤1
∗
𝑝
−𝑝 𝑇
𝑤1
∗
𝑝 𝑇
𝑤1
∗
𝑝
From which we then define the dual image of the absolute quadratic 𝑤𝑖
∗
The final step is to use the known structure of 𝑤1
∗
and the already found values of 𝑃𝑖
to compute 𝑄∞
∗
From this we define the Absolute Dual Quadratic 𝑄∞
∗
, which is a 4x4 matrix
Notice how 𝑤𝑖
∗
for view 𝑖 only depends on the unknowns p and 𝑤1
∗
?
𝐾𝑖 is a 3x3 upper
triangular intrinsic
camera matrix
𝐴𝑖 is 3x3 sub-matrix in 𝑃𝑖
𝑝 is from 𝜋∞ = (𝑝 𝑇
, 1) 𝑇
𝐼 is diag(1,1,1,0)
𝑃𝑖 is a projective camera
matrix. 3x4
𝐻 see past 20 slides . You know complaining that
every variable isn’t defined in very slide is really
silly?

Solving for Absolute Dual Quadratic (Part 1)
𝑤∗
= 𝐾𝑖 𝐾𝑖
𝑇
=
𝑓𝑥
2 + 𝑐 𝑥
2 𝑐 𝑥 𝑐 𝑦 𝑐 𝑥
𝑐 𝑥 𝑐 𝑦 𝑓𝑦
2 + 𝑐 𝑦
2 𝑐 𝑦
𝑐 𝑥 𝑐 𝑦 1
K=
𝑓𝑥 0 𝑐 𝑥
0 𝑓𝑦 𝑐 𝑦
0 0 1
Intrinsic Camera
Matrix with zero
skew
Known structure
of 𝑤∗
from K
• Now assume the principle point (𝑐 𝑥, 𝑐 𝑦) is (0,0)
• This can be accomplished by assuming the image center
is (𝑐 𝑥, 𝑐 𝑦) and subtracting that from all pixels.
• Calibration is much less sensitive to errors in the
principle point
𝑤∗
=
𝑓𝑥
2 0 0
0 𝑓𝑦
2
0
0 0 1
These known zeros will now be
used to solve for 𝑄∞
∗
(𝑓𝑥, 𝑓𝑦) is the camera’s focal length

Solving for Absolute Dual Quadratic (Part 2)
• You now have 𝑃𝑖 for 3 views and you know that 𝐾𝑖 has zero skew and zero principle points.
• Using the known zeros in 𝑤𝑖
∗
, that gives you 3 equations for each view.
• 𝑄∞
∗
is a symmetric 4x4 and can be parameterized by 10 unknowns
• With a bit of algebra it’s possible to reformat (2) into a linear system and solve for the null
space using SVD
• With 3 constraints per view this would require 4 views to solve for.
• If you add the constraint 𝑓𝑥 = 𝑓𝑦 then only 3 views are needed
• The equations are quite ugly and I wrote Sage Math code for generating them
• See (unreleased) BoofCV technical report [1]
𝑤𝑖
∗
= 𝑃𝑖 𝑄∞
∗ 𝑃𝑖
𝑇
𝑃𝑖 𝑄∞
∗ 𝑃𝑖
𝑇
13
= 0 𝑃𝑖 𝑄∞
∗ 𝑃𝑖
𝑇
23
= 0 𝑃𝑖 𝑄∞
∗
𝑃𝑖
𝑇
12
= 0
[1] Promise that I’ll release this “soon”
(1)
(2)

Projective to Metric
Recall that the absolute dual quadratic 𝑄∞
∗ can be
decomposed into
𝑄∞
∗
= 𝐻 𝐼𝐻 𝑇
Where 𝐻 a 4x4 projective to metric homography
and 𝐼 is diag(1,1,1,0).
H can thus be found using Eigenvalue
Decomposition [1] or by directly solving for 𝑤1
∗
which is the same as solving for 𝐾1. In this
example we used the later.
A = chol(inv(Q(1:3,1:3)))’
K = inv(A)
K = K./K(3,3)
p = -inv(K*K’)*Q(1:3,4)
H = [K [0;0;0];-p’*K 1]
BoofCV’s implementation contains additional
manipulations to scale variables and handle
degenerate situations.
Untested Matlab code for direct solution
[1] Page 463 in Hartley and Zisserman, “Multiple View Geometry in Computer Vision”

Initial Reconstruction
• We now have for each view
• 𝐾𝑖 intrinsic parameters
• 𝑅𝑖 and 𝑇𝑖 rotation and translation
• 𝑅1and 𝑇1are identity and (0,0,0)
• What we need are the 3D
locations of each feature
• Find using triangulation
3-View Metric Triangulation
Initial Estimate using DLT
Page 312 in H&Z
Non-Linear Refinement
of Residual Error
Residual = 𝑥 − 𝐾 𝑅 𝑇 𝑋

Refine Reconstruction
• Initial reconstruction is very crude
and not good enough to compute a
dense 3D cloud from
• Improve initial reconstruction using
sparse bundle adjustment [1]
• Sensitive to initial parameters
• Run multiple times
• Orientation physically impossible? Try
flipping initial orientation
• Remove outlier 3D features and run
again
• Select solution with minimal residual
error and is physically possible
Steps fx change |step| f-test g-test tr-ratio lambda
0 8.815E+04 0.000E+00 0.000E+00 0.000E+00 0.000E+00 0.00 1.00E-03
1 2.157E+02 -8.794E+04 3.118E+00 9.976E-01 2.832E+00 1.000 3.33E-04
…
15 3.253E+01 -4.782E-05 8.401E-01 1.470E-06 8.591E-03 0.745 2.47E-07
16 3.253E+01 -3.092E-05 8.615E-01 9.507E-07 7.473E-03 0.595 2.45E-07
Converged f-test
Before After
𝑓1 405.4 340.9
𝑓2 406.0 340.4
𝑓3 405.5 340.4
Verbose output from BoofCV’s implementation
Quality of solution is strongly dependent on initial
focal length estimates
fx is sum of residual error. ~1000x improvement
[1] Triggs, Bill, et al. "Bundle Adjustment— A Modern Synthesis." 1999.

Stereo Processing
Rectify Stereo Pair Dense Stereo Disparity 3D Point Cloud
Distort images so that epipolar
lines are at infinity. [1]
Compute disparity of every pixel in
the image. Disparity is difference
between left and right images. [2]
Going from disparity to a
3D point cloud is simple.
[1] A. Fusiello, E. Trucco, and A. Verri, "A Compact Algorithm for Rectification of Stereo Pairs“ Machine Vision and Applications, 2000
[2] Heiko Hirschmuller, Peter R. Innocent, and Jon Garibaldi. "Real-Time Correlation-Based Stereo Vision with Reduced Border Errors." Int. J. Comput. Vision 47, 1-3 2002
• Most implementations use Multiview Stereo for dense reconstruction
• Here dense two view stereo was used due to availability

How Practical is this?
• Reconstruction from three views has no known truly stable numerical
solution
• The literature is a bit of a graveyard
• Paper Y presents a new solution and mentions they were unable to replicate results in X
• Paper Z presents a new solution and mentions they were unable to replicate results in Y
• Having identical image features but changing their order will produce different
solutions
• Discovered when a concurrent feature detector was added
• Most reconstruction tools require many more than 3 views and often
“cheat” by using known focal length from EXIF data
• The algorithm presented here converges to a reasonable solution about
70%, if given good input images

Two View Reconstruction
• It’s possible to follow much the same pipe line
with two views and create a 3D reconstruction!
• False positive associations are much more
numerous
• Epipolar constraint is insufficient
• Difficult to obtain an initial estimate for K
• Guess and check often works
• Even less stable than 3-View Case
• Converges about 35% of the time with an ideal scene
Highly distorted
stereo rectified
image from 2-View
Successful stereo
rectification

Topics Not Discussed
• Homogenous coordinates vs Regular 2D and 3D
• When should you use what type?
• How to parameterize rotation matrices
• Rodrigues coordinates were used
• Proper normalization of input data
• In general keep everything having a magnitude of about 1
• Exception handling
• What do you do if a matrix is degenerate?
• Recent research
• New geometric algorithms
• Deep learning based approaches

Papers and Software
Books / Papers
• Multiple View Geometry
• An Invitation to 3D Vision
• Build Rome in a Day
• Bundle adjustment—A Modern
synthesis
• Direct Methods for Sparse Linear
Systems
Libraries
• BoofCV (Used here)
• Ceres Solver
• COLMAP
• Theia SFM
• Patch-based Multi-view Stereo
• Alice Vision
• SBA
A lot of research has gone into this topic. Here are some places to start learning more.
Apologies to all the papers/libraries that are missed!

Source Code and Applications
Presented results were generated using
examples and demonstration code
found in BoofCV (Github)
• Source code for three view
https://boofcv.org/index.php?title=Example_Three_View_Stereo_Uncalibrated
• Source code for two view
https://boofcv.org/index.php?title=Example_Stereo_Uncalibrated
• More robust and complex three
view solution
https://github.com/lessthanoptimal/BoofCV/blob/master/main/boofcv-
sfm/src/main/java/boofcv/alg/sfm/structure/ThreeViewEstimateMetricScene.java
Pre-built application are available
• Demonstration application (link)
• SFM 3D -> DemoThreeViewStereoApp
• Can dynamically adjust some settings
• Open images on your computer
• Pre-built example code can be run
using the same application

Three View Self Calibration and 3D Reconstruction

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Three View Self Calibration and 3D Reconstruction

Similar to Three View Self Calibration and 3D Reconstruction (20)

Recently uploaded

Recently uploaded (20)

Three View Self Calibration and 3D Reconstruction

Editor's Notes