Journey to Structure from Motion
Ja-Keoung Koo
Image Laboratory at Chung-Ang University
http://image.cau.ac.kr
*. This presentation is about my study in 2014
1. Introduction
2. Image Formation
3. SfM with Calibrated Camera
Two-view
Multiple-view
Pipeline
Revisited
4. SfM with IMU
KLT tracker with IMU
toy example: estimating homography
5. SfM with Rolling Shutter Camera
6. Direct Approach
7. Epilogue
Introduction
’Structure from Motion’?
Recover 3D information from 2D images (captured by moving
camera)
Visual SLAM(simulatenous localization and mapping) in robotics.
Tracking and mapping (e.g. PTAM, ISMAR 2007; DTAM, ICCV
2011)
Visual odometry (cf. wheel odometry)
Applications: Building rome in a day (large scale)
(Sameer Agarwal, Yasutaka Furukawa, Noah Snavely, Ian Simon, Brian
Curless, Steven M. Seitz and Richard Szeliski, Communications of the
ACM 2011) (..., ICCV 2009)
Applications: Autodesk 123D catch
Smartphone as a 3D scanner
Introduction
We will see:
usual SfM pipeline
with additional devices (IMU, rolling shutter, RGBD)
different approaches (direct, variational)
* Main reference is (Yi-Ma et al., 2004; Introduction to 3D vision).
Figures without reference are from this book.
1. Introduction
2. Image Formation
3. SfM with Calibrated Camera
Two-view
Multiple-view
Pipeline
Revisited
4. SfM with IMU
KLT tracker with IMU
toy example: estimating homography
5. SfM with Rolling Shutter Camera
6. Direct Approach
7. Epilogue
Image Formation: Rigid body motion
Pixel coordinate ← Retinal coordinate ← Camera coordinate ← World coordinate
In homogeneous coordinate,
λ


x
y
1

 =


sx 0 ox
0 sy oy
0 0 1




f 0 0
0 f 0
0 0 1




1 0 0 0
0 1 0 0
0 0 1 0




R T
1






Xw
Yw
Zw
1




Image Formation: Rigid body motion (cont.)
Pixel coordinate ← Retinal coordinate ← Camera coordinate ← World coordinate
Xw = RXc + T, Xc = RT
(Xw − T)
Camera center Oc in the world coordinate system
Xw = R0 + T = T
Ow in the camera coordinate system
Xc = RT
(0 − T) = −RT
T
View direction


0
0
1

 −


0
0
0

 =⇒

R


0
0
1

 + T

 − T = R(:, 3)
Image Formation: Perspective projection
Pixel coordinate ← Retinal coordinate ← Camera coordinate ← World coordinate
In homogeneous coordinate, (∼ means equality up to scale)


x
y
1

 ∼


1 0 0 0
0 1 0 0
0 0 1 0






X
Y
Z
1



 =


X
Y
Z

 ∼


X/Z
Y/Z
1


We have lost structure Z!
There are no parameters related to camera.
x, y: retinal (image or normalized) coordinate.
x , y : pixel coordinate.
Image Formation
Pixel coordinate ← Retinal coordinate ← Camera coordinate ← World coordinate
λ


x
y
1

 =


sx 0 ox
0 sy oy
0 0 1




f 0 0
0 f 0
0 0 1




1 0 0 0
0 1 0 0
0 0 1 0




R T
1






Xw
Yw
Zw
1




e.g. Our lens has a focal length f : 4.16 mm.
There is nothing related to pixel here!
Image Formation (cont.)
Pixel coordinate ← Retinal coordinate ← Camera coordinate ← World coordinate
λ


x
y
1

 =


sx 0 ox
0 sy oy
0 0 1




f 0 0
0 f 0
0 0 1




1 0 0 0
0 1 0 0
0 0 1 0




R T
1






Xw
Yw
Zw
1




For example, given horizontal pixel size of sensor is 3.75 (µm/pixel),
sx =
1
3.75 ∗ 10−6
.
So the focal length in pixel is
f · sx = 4.16 ∗ 10−3
∗
1
3.75 ∗ 10−6
= 1109
Think about pixel coordinate as a virtual coordinate.
Image Formation (cont.)
Pixel coordinate ← Retinal coordinate ← Camera coordinate ← World coordinate
Image Formation
Pixel coordinate ← Retinal coordinate ← Camera coordinate ← World coordinate
λ


x
y
1

 =


sx 0 ox
0 sy oy
0 0 1




f 0 0
0 f 0
0 0 1




1 0 0 0
0 1 0 0
0 0 1 0




R T
1






Xw
Yw
Zw
1




What elements are dependent on camera (sensor+lens)?
K ≡


sx 0 ox
0 sy oy
0 0 1




f 0 0
0 f 0
0 0 1


K is called a camera intrinsic matrix.
The procedure of getting K is called camera calibration.
3D Reconstruction from Images
Pixel coordinate → Retinal coordinate → Camera coordinate → World coordinate
(not exactly (−1, −1) ∼ (1, 1) )
x = Kx
K−1
x
observation
= x
normalized
3D Reconstruction from Images
Pixel coordinate → Retinal coordinate → Camera coordinate → World coordinate
Note that
π : (X, Y, Z) → (x , y )
x =
X · fx
Z
+ ox, y =
Y · fy
Z
+ oy
If we know the 3d scale factor Z and camera intrinsic matrix K with
respect to reference camera,
π−1
: (x , y , Z) → (X, Y, Z)
X =
x − ox
fx
Z, Y =
y − oy
fy
Z, Z = Z
If we know the camera motion, then we may get 3d structures for each
view.
Simpler Models
Orthographic camera model.
x
y
=
1 0 0
0 1 0


X
Y
Z


Sometimes authors assume the simplest model and go further.
(Ullman, 1979)
: Proof of existence of a solution for SfM under orthography
(Tomasi and Kanade IJCV 1992)
: Factorization Method under orthography
Simpler Models (cont.)
Weak-perspective projection model
x
y
= s
1 0 0
0 1 0


X
Y
Z


Complex Models
FOV model
(Devernay and Faugeras 2001, Straight lines have to be staright)
π : (X, Y, Z) → (x , y )
x =
X · fx
Z
r
r
+ ox, y =
Y · fy
Z
r
r
+ oy
, where r =
x2 + y2
z2
and r =
1
w
arctan(2r tan
w
2
).
...
Examples
If K = I?
If px = 0, py = 0? (screenshot? cropped?)
1. Introduction
2. Image Formation
3. SfM with Calibrated Camera
Two-view
Multiple-view
Pipeline
Revisited
4. SfM with IMU
KLT tracker with IMU
toy example: estimating homography
5. SfM with Rolling Shutter Camera
6. Direct Approach
7. Epilogue
Assumption
For a while, we assume that
The object (or scene) is static.
We have a calibrated camera. Simply, K = I.
There are only two views.
Notation.
λ : scalar representing 3-D depth
X : 3-D vector
x : 2-D vector
R ∈ SO(3) : 3x3 rotation matrix (R is orthogonal, det(R) = 1)
T : 3-D translation vector
ˆA : 3x3 skew-symmetric matrix such that ˆAv = A × v
Epipolar Constraint
Given corresponding 2-D coordinates x1, x2,
We want to find R, T, λ.
Let X1, X2 be the 3-D coordinates of a point p for each frame. Then, we
can write
X2 = RX1 + T
λ2x2 = Rλ1x1 + T
Premultiply by ˆT,
λ2
ˆTx2 = ˆTRλ1x1
Since ˆTx2 = T × x2 is perpendicular to the vector x2,
< x2, ˆTx2 >= xT
2
ˆTx2 = 0.
Thus,
< x2, T × Rx1 >= 0
Epipolar Constraint (cont.)
xT
2
ˆTRx1 = 0 (epipolar constraint)
E = ˆTR (essential matrix)
E
.
= { ˆTR | R ∈ SO(3), T ∈ R3
} ⊂ R3∗3
(essential space)
Kronecker product and stack of matries
The following is useful when A is the only unknown.
xT
2 Ax1 = 0 ⇐⇒ (x1 ⊗ x2)T
As
= 0
(We saw this in the conjugate gradient lecture. x2 and x1 are
A-orthogonal or conjugate.)
x1⊗x2 = a = [x1x2, x1y2, x1z2, y1x2, y1y2, y1z2, z1x2, z1y2, z1z2]T
∈ R9
.
(Kronecker product)
Kronecker product of two vectors is also a vector.
We can stack elements of a matrix column-wise.
Es
= [e11, e21, e31, e12, e22, e23, e13, e23, e33]T
∈ R9
.
(stack of a matrix E)
A stack of a matrix is a vector.
8 point algorithm
Approximation of E → Project onto the essential space → Recover R, T
Idea : Decompose known and unknown, approximate E and then recover
R, T.
input : n image correspondences (xj
1, xj
2), j = 1, 2, ..., n(n ≥ 8)
From the input data, we can set
χ = [a1
, a2
, · · · , an
]T
∈ Rn∗9
where aj
= xj
1 ⊗ xj
2 ∈ R9
. From the epipolar constraint xT
2
ˆTRx1 = 0,
we can rewrite
χEs
= 0.
(Rank theorem) When A is m-by-n matrix,
rank(A) + dimNul(A) = n
If rank(χ) = 8,
rank(χ) + dimNul(χ) = 8 + dimNul(χ) = 9.
Then, we can get Es
up to scale. So we need at least 8 points.
8 point algorithm (cont.)
Approximation of E → Project onto the essential space → Recover R, T
Find the vector Es
∈ R9
of unit length (we can take any scale) such that
χEs
is minimized.
Claim : Es
should be the eigenvector of χT
χ that corresponds
to its smallest eigenvalue
χT
χ is symmetric and so orthogonally diagonalizable (eigenvalue is σ2
)
χEs 2
= (Es
)T
(χT
χEs
) = (Es
)T
(σ2
Es
) = σ2
So if σ is minimized, χEs
is also minimized. In this way, we can get Es
.
8 point algorithm
Approximation of E → Project onto the essential space → Recover R, T
Theorem. E is an essential matrix if and only if
E = UΣV T
such that Σ = diag{σ, σ, 0}
where σ > 0 and U, V ∈ SO(3).
So we just have to replace the singular values with 1, 1.
% c o n s t r u c t chi matrix
chi = zeros (n , 9 ) ;
f o r i =1:n
chi ( i , : ) = kron ( [ featx1 ( i ) ; featy1 ( i ) ; 1 ] , [ featx2 ( i ) ;
end
% f i n d E stakced that minimizes | chi ∗ E stacked |
[ U chi , S chi , V chi ] = svd ( chi ) ;
E stacked = V chi ( : , 9 ) ;
% Pro jec t E onto e s s e n t i a l space
S (1 ,1) = 1; S (2 ,2) = 1; S (3 ,3) = 0;
% Get e s s e n t i a l matrix
8 point algorithm
Approximation of E → Project onto the essential space → Recover R, T
(Theorem). There are 4 candidates. (E, −E)
( ˆT, R) = (URz(±
π
2
) UT
, URT
Z(±
π
2
)V T
)
What candidate is the solution we want? Choose the one that λ is
positive.
Reconstruction from Two Calibrated Views
Approximation of E → Project onto the essential space → Recover R, T → Reconstruction
λj
2xj
2 = λj
1Rxj
1 + γT, j = 1, 2, ..., n
To reduce λ2, multiplying both sides by xj
2, we can get
0 = λj
1xj
2Rxj
1 + γxj
2T
Decompose known and unknown
Mj
λj .
= xj
2Rxj
1 xj
2T
λj
1
γ
= 0 (1)
Moreover, we can set
v = [λ1
1, λ2
1, ..., λn
1 , γ]T
∈ Rn+1
M =


ˆx1
2Rx1
1 0 0 ... ˆx1
2T
0 ˆx2
2Rx2
1 0 ... ...
0 0 ... ˆxn
2 Rxn
1 ˆxn
2 T


Mv = 0
The linear least-squares estimate of v is simply the eigenvector of MT
M
that corresponds to its smallest eigenvalue.
Reconstruction from Two Calibrated Views
Approximation of E → Project onto the essential space → Recover R, T → Reconstruction →
Bundle adjustment
Theory is beautiful, but we have noisy data ˜x1, ˜x2 and some outliers.
Multiple-view: Assumption
Now, we assume that
The object (or scene) is static.
We have a calibrated camera. Simply, K = I.
There are multiple views.
All the corresponding points are observable in multiple views.
For example, we can detect features, with not large baseline, by
using KLT tracker(Lucas and Kanade, 1981; Shi and Tomasi, 1994).
Then, choose the features that are visible in all frames.
The Multiple-view Matrix of a Point
Iλ =




x1 0 ... 0
0 x2 0 0
... ... ... ...
0 0 ... xm








λ1
λ2
...
λm



 =




Π1
Π2
...
Πm



 X = ΠX,
which is of the form
Iλ = ΠX ()
For a more compact formulation, we introduce the matrix
I⊥
=




ˆx1 0 ...0
0 ˆx2 ... 0
... ... ... ...
0 0 ... ˆxm



 ∈ R3m×3m
which has the property of ”annihilating” I:
I⊥
I = 0.
The Multiple-view Matrix of a Point (cont.)
We can premultiply the above to obtain
I⊥
ΠX = 0.
Thus the vector X is in the null space of the matrix
Wp
.
= I⊥
Π =




x1Π1
x2Π2
...
xmΠm



 ∈ R3m×4
To have a nontrivial solution, we must have
rank(Wp) ≤ 3
The rank of the matrix Wp is not affected if we multiply by a full-rank
matrix Dp ∈ R4×5
as follows:
WpDp =




x1Π1
x2Π2
...
xmΠm




x1 x1 0
0 0 1
=






x1x1 0 0
x2R2x1 x2R2x1 x2T2
x3R3x1 x3R3x1 x3T3
... ... ...
xmRmx1 xmRmx1 xmTm






The Multiple-view Matrix of a Point (cont.)
This means that rank(Wp) ≤ 3 iff the submatrix
Mp
.
=




x2R2x1 x2T2
x3R3x1 x3T3
... ...
xmRmx1 xmTm



 ∈ R3(m−1)×2
has rank(Mp) ≤ 1. Mp is called the multiple-view matrix associated with
a point p. It involves both the image x1 in the first view and the
coimages x1 in the remaining views.
Multiple-view Factorization of Point Features
rank(Mp) ≤ 1 states that two columns of Mp are linearly dependent. For
the j-th point pj
this implies






xj
2R2xj
1
xj
3R3xj
1
...
xj
mRmxj
1






+ αj






xj
2T2
xj
3T3
...
xj
mTm






= 0,
for some parameters αj
∈ R, j = 1, ..., n. Each row in the above equation
can be obtained from λj
i xj
i = λj
1Rixj
1 + Ti, multiplying by xj
i :
(j = 1, 2, 3..., n)
xj
i Rixj
1 + xj
i Ti/λj
1 = 0.
Therefore, αj
= 1/λj
1 is nothing but the inverse of the depth of point pj
w.r.t. the first frame.
Motion Estimation from Known Structure
Assume we have the depth of the points and thus their inverse αj
(i.e.
known structure). Then the above equation is linear in the camera motion
parameters Ri and Ti. Using the stack notation i = 2, 3, ..., m (fixed)
Pi
Rs
i
Ti
.
=





x1
1
T
⊗ x1
i α1
x1
i
x2
1
T
⊗ x2
i α2
x2
i
... ...
xn
1
T
⊗ xn
i αn
xn
i





Rs
i
Ti
= 0, ∈ R3n
It can be shown that if the αj
’s are known, the matrix Pi ∈ R3n×12
is of
rank 11 if more than n ≥ 6 points in general position are given. In that
case, the null space of Pi is unique up to a scale factor, and so is the
projection matrix Πi = [Ri, Ti]. (each of three (3n) are linearly
independent).
This does not decouple the structure and motion, completely. Firstly,
structure information is needed.
Structure Estimation from Known Motion
In turn, if the camera motion Πi = (Ri, Ti), i = 1, ..., m is known, we can
estimate the structure. The least squares solution for the above equation
is given by:
αj
= −
m
i=2(xj
i Ti)T
xj
i Rixj
1
m
i=2 xj
i Ti
2
, j = 1, ..., n.
In this way one can iteratively estimate structure and motion, estimating
one while keeping the other fixed.
For initialization, we need structure parameters from eight-point
algorithm.
While the equation for Πi makes use of the two frames 1 and i only, the
structure parameter estimation takes into account all frames. This can be
done either in batch mode or recursively.
As for the two-view case, such spectral approaches do not guarantee
optimality in the presence of noise and uncertainty.
Incremental Structure from Motion
aa
History
(Kruppa, 1916)
8point (1922)
5point
mm
1. Introduction
2. Image Formation
3. SfM with Calibrated Camera
Two-view
Multiple-view
Pipeline
Revisited
4. SfM with IMU
KLT tracker with IMU
toy example: estimating homography
5. SfM with Rolling Shutter Camera
6. Direct Approach
7. Epilogue
Biological Motivation: Human Inertial Sensors
The utricle measures acceleration in the horizontal direction and the
saccule measures in the vertical direction.
The semicircular canals detect angular velocity of the head and are
oriented in three orthogonal planes. [1]
http://www.nebraskamed.com/health-library/
3d-medical-atlas/302/balance-and-equilibrium
IMU sensors
IMU(inertial measurement unit) measures acceleration, angular
velocity and other things by using accelerometer, gyroscope and etc.
Thanks to the development of MEMS(Micro-Electro-Mechanical
Systems) technology, we can easily find IMU sensors.
For example, after iPhone4 (2010), almost all smartphones are
equipped with accelerometer, gyroscope and magnetic sensors.
Camera meets with IMU
KLT tracker with IMU
We want to understand, implement and test by using our data.
Hwangbo, Kim, and Kanade, 2009 “Inertial-Aided KLT Feature Tracking
for a Moving Camera.” [2]
https://www.youtube.com/watch?v=a81WzJONPGA
1. Introduction
2. Image Formation
3. SfM with Calibrated Camera
Two-view
Multiple-view
Pipeline
Revisited
4. SfM with IMU
KLT tracker with IMU
toy example: estimating homography
5. SfM with Rolling Shutter Camera
6. Direct Approach
7. Epilogue
1. Introduction
2. Image Formation
3. SfM with Calibrated Camera
Two-view
Multiple-view
Pipeline
Revisited
4. SfM with IMU
KLT tracker with IMU
toy example: estimating homography
5. SfM with Rolling Shutter Camera
6. Direct Approach
7. Epilogue
1. Introduction
2. Image Formation
3. SfM with Calibrated Camera
Two-view
Multiple-view
Pipeline
Revisited
4. SfM with IMU
KLT tracker with IMU
toy example: estimating homography
5. SfM with Rolling Shutter Camera
6. Direct Approach
7. Epilogue
Summary
Research is ongoing for each stage:
camera model, calibration, feature matching or tracking, two view,
multi view, bundle adjustment, dense matching, visualization, ...
Summary
Research is ongoing for each stage:
camera model, calibration, feature matching or tracking, two view,
multi view, bundle adjustment, dense matching, visualization, ...
Useful if there are additional devices:
IMU(motion), RGBD(structure), ...
Summary
Research is ongoing for each stage:
camera model, calibration, feature matching or tracking, two view,
multi view, bundle adjustment, dense matching, visualization, ...
Useful if there are additional devices:
IMU(motion), RGBD(structure), ...
Different approaches:
factorization method
feature-less direct approach
variational approach
Summary
Research is ongoing for each stage:
camera model, calibration, feature matching or tracking, two view,
multi view, bundle adjustment, dense matching, visualization, ...
Useful if there are additional devices:
IMU(motion), RGBD(structure), ...
Different approaches:
factorization method
feature-less direct approach
variational approach
What we have assumed:
static scene, calibrated camera
Summary
Research is ongoing for each stage:
camera model, calibration, feature matching or tracking, two view,
multi view, bundle adjustment, dense matching, visualization, ...
Useful if there are additional devices:
IMU(motion), RGBD(structure), ...
Different approaches:
factorization method
feature-less direct approach
variational approach
What we have assumed:
static scene, calibrated camera
What we have not considered:
cc
References
[1] P. Corke, J. Lobo, and J. Dias. An Introduction to Inertial and Visual
Sensing. The International Journal of Robotics Research,
26(6):519–535, June 2007.
[2] Myung Hwangbo, Jun-Sik Kim, and Takeo Kanade. Inertial-aided
KLT feature tracking for a moving camera. In Intelligent Robots and
Systems, 2009. IROS 2009. IEEE/RSJ International Conference on,
pages 1909–1916. IEEE.

Journey to structure from motion

  • 1.
    Journey to Structurefrom Motion Ja-Keoung Koo Image Laboratory at Chung-Ang University http://image.cau.ac.kr *. This presentation is about my study in 2014
  • 2.
    1. Introduction 2. ImageFormation 3. SfM with Calibrated Camera Two-view Multiple-view Pipeline Revisited 4. SfM with IMU KLT tracker with IMU toy example: estimating homography 5. SfM with Rolling Shutter Camera 6. Direct Approach 7. Epilogue
  • 3.
    Introduction ’Structure from Motion’? Recover3D information from 2D images (captured by moving camera) Visual SLAM(simulatenous localization and mapping) in robotics. Tracking and mapping (e.g. PTAM, ISMAR 2007; DTAM, ICCV 2011) Visual odometry (cf. wheel odometry)
  • 4.
    Applications: Building romein a day (large scale) (Sameer Agarwal, Yasutaka Furukawa, Noah Snavely, Ian Simon, Brian Curless, Steven M. Seitz and Richard Szeliski, Communications of the ACM 2011) (..., ICCV 2009)
  • 5.
    Applications: Autodesk 123Dcatch Smartphone as a 3D scanner
  • 6.
    Introduction We will see: usualSfM pipeline with additional devices (IMU, rolling shutter, RGBD) different approaches (direct, variational) * Main reference is (Yi-Ma et al., 2004; Introduction to 3D vision). Figures without reference are from this book.
  • 7.
    1. Introduction 2. ImageFormation 3. SfM with Calibrated Camera Two-view Multiple-view Pipeline Revisited 4. SfM with IMU KLT tracker with IMU toy example: estimating homography 5. SfM with Rolling Shutter Camera 6. Direct Approach 7. Epilogue
  • 8.
    Image Formation: Rigidbody motion Pixel coordinate ← Retinal coordinate ← Camera coordinate ← World coordinate In homogeneous coordinate, λ   x y 1   =   sx 0 ox 0 sy oy 0 0 1     f 0 0 0 f 0 0 0 1     1 0 0 0 0 1 0 0 0 0 1 0     R T 1       Xw Yw Zw 1    
  • 9.
    Image Formation: Rigidbody motion (cont.) Pixel coordinate ← Retinal coordinate ← Camera coordinate ← World coordinate Xw = RXc + T, Xc = RT (Xw − T) Camera center Oc in the world coordinate system Xw = R0 + T = T Ow in the camera coordinate system Xc = RT (0 − T) = −RT T View direction   0 0 1   −   0 0 0   =⇒  R   0 0 1   + T   − T = R(:, 3)
  • 10.
    Image Formation: Perspectiveprojection Pixel coordinate ← Retinal coordinate ← Camera coordinate ← World coordinate In homogeneous coordinate, (∼ means equality up to scale)   x y 1   ∼   1 0 0 0 0 1 0 0 0 0 1 0       X Y Z 1     =   X Y Z   ∼   X/Z Y/Z 1   We have lost structure Z! There are no parameters related to camera. x, y: retinal (image or normalized) coordinate. x , y : pixel coordinate.
  • 11.
    Image Formation Pixel coordinate← Retinal coordinate ← Camera coordinate ← World coordinate λ   x y 1   =   sx 0 ox 0 sy oy 0 0 1     f 0 0 0 f 0 0 0 1     1 0 0 0 0 1 0 0 0 0 1 0     R T 1       Xw Yw Zw 1     e.g. Our lens has a focal length f : 4.16 mm. There is nothing related to pixel here!
  • 12.
    Image Formation (cont.) Pixelcoordinate ← Retinal coordinate ← Camera coordinate ← World coordinate λ   x y 1   =   sx 0 ox 0 sy oy 0 0 1     f 0 0 0 f 0 0 0 1     1 0 0 0 0 1 0 0 0 0 1 0     R T 1       Xw Yw Zw 1     For example, given horizontal pixel size of sensor is 3.75 (µm/pixel), sx = 1 3.75 ∗ 10−6 . So the focal length in pixel is f · sx = 4.16 ∗ 10−3 ∗ 1 3.75 ∗ 10−6 = 1109 Think about pixel coordinate as a virtual coordinate.
  • 13.
    Image Formation (cont.) Pixelcoordinate ← Retinal coordinate ← Camera coordinate ← World coordinate
  • 14.
    Image Formation Pixel coordinate← Retinal coordinate ← Camera coordinate ← World coordinate λ   x y 1   =   sx 0 ox 0 sy oy 0 0 1     f 0 0 0 f 0 0 0 1     1 0 0 0 0 1 0 0 0 0 1 0     R T 1       Xw Yw Zw 1     What elements are dependent on camera (sensor+lens)? K ≡   sx 0 ox 0 sy oy 0 0 1     f 0 0 0 f 0 0 0 1   K is called a camera intrinsic matrix. The procedure of getting K is called camera calibration.
  • 15.
    3D Reconstruction fromImages Pixel coordinate → Retinal coordinate → Camera coordinate → World coordinate (not exactly (−1, −1) ∼ (1, 1) ) x = Kx K−1 x observation = x normalized
  • 16.
    3D Reconstruction fromImages Pixel coordinate → Retinal coordinate → Camera coordinate → World coordinate Note that π : (X, Y, Z) → (x , y ) x = X · fx Z + ox, y = Y · fy Z + oy If we know the 3d scale factor Z and camera intrinsic matrix K with respect to reference camera, π−1 : (x , y , Z) → (X, Y, Z) X = x − ox fx Z, Y = y − oy fy Z, Z = Z If we know the camera motion, then we may get 3d structures for each view.
  • 17.
    Simpler Models Orthographic cameramodel. x y = 1 0 0 0 1 0   X Y Z   Sometimes authors assume the simplest model and go further. (Ullman, 1979) : Proof of existence of a solution for SfM under orthography (Tomasi and Kanade IJCV 1992) : Factorization Method under orthography
  • 18.
    Simpler Models (cont.) Weak-perspectiveprojection model x y = s 1 0 0 0 1 0   X Y Z  
  • 19.
    Complex Models FOV model (Devernayand Faugeras 2001, Straight lines have to be staright) π : (X, Y, Z) → (x , y ) x = X · fx Z r r + ox, y = Y · fy Z r r + oy , where r = x2 + y2 z2 and r = 1 w arctan(2r tan w 2 ). ...
  • 20.
    Examples If K =I? If px = 0, py = 0? (screenshot? cropped?)
  • 21.
    1. Introduction 2. ImageFormation 3. SfM with Calibrated Camera Two-view Multiple-view Pipeline Revisited 4. SfM with IMU KLT tracker with IMU toy example: estimating homography 5. SfM with Rolling Shutter Camera 6. Direct Approach 7. Epilogue
  • 22.
    Assumption For a while,we assume that The object (or scene) is static. We have a calibrated camera. Simply, K = I. There are only two views. Notation. λ : scalar representing 3-D depth X : 3-D vector x : 2-D vector R ∈ SO(3) : 3x3 rotation matrix (R is orthogonal, det(R) = 1) T : 3-D translation vector ˆA : 3x3 skew-symmetric matrix such that ˆAv = A × v
  • 23.
    Epipolar Constraint Given corresponding2-D coordinates x1, x2, We want to find R, T, λ. Let X1, X2 be the 3-D coordinates of a point p for each frame. Then, we can write X2 = RX1 + T λ2x2 = Rλ1x1 + T Premultiply by ˆT, λ2 ˆTx2 = ˆTRλ1x1 Since ˆTx2 = T × x2 is perpendicular to the vector x2, < x2, ˆTx2 >= xT 2 ˆTx2 = 0. Thus, < x2, T × Rx1 >= 0
  • 24.
    Epipolar Constraint (cont.) xT 2 ˆTRx1= 0 (epipolar constraint) E = ˆTR (essential matrix) E . = { ˆTR | R ∈ SO(3), T ∈ R3 } ⊂ R3∗3 (essential space)
  • 25.
    Kronecker product andstack of matries The following is useful when A is the only unknown. xT 2 Ax1 = 0 ⇐⇒ (x1 ⊗ x2)T As = 0 (We saw this in the conjugate gradient lecture. x2 and x1 are A-orthogonal or conjugate.) x1⊗x2 = a = [x1x2, x1y2, x1z2, y1x2, y1y2, y1z2, z1x2, z1y2, z1z2]T ∈ R9 . (Kronecker product) Kronecker product of two vectors is also a vector. We can stack elements of a matrix column-wise. Es = [e11, e21, e31, e12, e22, e23, e13, e23, e33]T ∈ R9 . (stack of a matrix E) A stack of a matrix is a vector.
  • 26.
    8 point algorithm Approximationof E → Project onto the essential space → Recover R, T Idea : Decompose known and unknown, approximate E and then recover R, T. input : n image correspondences (xj 1, xj 2), j = 1, 2, ..., n(n ≥ 8) From the input data, we can set χ = [a1 , a2 , · · · , an ]T ∈ Rn∗9 where aj = xj 1 ⊗ xj 2 ∈ R9 . From the epipolar constraint xT 2 ˆTRx1 = 0, we can rewrite χEs = 0. (Rank theorem) When A is m-by-n matrix, rank(A) + dimNul(A) = n If rank(χ) = 8, rank(χ) + dimNul(χ) = 8 + dimNul(χ) = 9. Then, we can get Es up to scale. So we need at least 8 points.
  • 27.
    8 point algorithm(cont.) Approximation of E → Project onto the essential space → Recover R, T Find the vector Es ∈ R9 of unit length (we can take any scale) such that χEs is minimized. Claim : Es should be the eigenvector of χT χ that corresponds to its smallest eigenvalue χT χ is symmetric and so orthogonally diagonalizable (eigenvalue is σ2 ) χEs 2 = (Es )T (χT χEs ) = (Es )T (σ2 Es ) = σ2 So if σ is minimized, χEs is also minimized. In this way, we can get Es .
  • 28.
    8 point algorithm Approximationof E → Project onto the essential space → Recover R, T Theorem. E is an essential matrix if and only if E = UΣV T such that Σ = diag{σ, σ, 0} where σ > 0 and U, V ∈ SO(3). So we just have to replace the singular values with 1, 1. % c o n s t r u c t chi matrix chi = zeros (n , 9 ) ; f o r i =1:n chi ( i , : ) = kron ( [ featx1 ( i ) ; featy1 ( i ) ; 1 ] , [ featx2 ( i ) ; end % f i n d E stakced that minimizes | chi ∗ E stacked | [ U chi , S chi , V chi ] = svd ( chi ) ; E stacked = V chi ( : , 9 ) ; % Pro jec t E onto e s s e n t i a l space S (1 ,1) = 1; S (2 ,2) = 1; S (3 ,3) = 0; % Get e s s e n t i a l matrix
  • 29.
    8 point algorithm Approximationof E → Project onto the essential space → Recover R, T (Theorem). There are 4 candidates. (E, −E) ( ˆT, R) = (URz(± π 2 ) UT , URT Z(± π 2 )V T ) What candidate is the solution we want? Choose the one that λ is positive.
  • 30.
    Reconstruction from TwoCalibrated Views Approximation of E → Project onto the essential space → Recover R, T → Reconstruction λj 2xj 2 = λj 1Rxj 1 + γT, j = 1, 2, ..., n To reduce λ2, multiplying both sides by xj 2, we can get 0 = λj 1xj 2Rxj 1 + γxj 2T Decompose known and unknown Mj λj . = xj 2Rxj 1 xj 2T λj 1 γ = 0 (1) Moreover, we can set v = [λ1 1, λ2 1, ..., λn 1 , γ]T ∈ Rn+1 M =   ˆx1 2Rx1 1 0 0 ... ˆx1 2T 0 ˆx2 2Rx2 1 0 ... ... 0 0 ... ˆxn 2 Rxn 1 ˆxn 2 T   Mv = 0 The linear least-squares estimate of v is simply the eigenvector of MT M that corresponds to its smallest eigenvalue.
  • 31.
    Reconstruction from TwoCalibrated Views Approximation of E → Project onto the essential space → Recover R, T → Reconstruction → Bundle adjustment Theory is beautiful, but we have noisy data ˜x1, ˜x2 and some outliers.
  • 32.
    Multiple-view: Assumption Now, weassume that The object (or scene) is static. We have a calibrated camera. Simply, K = I. There are multiple views. All the corresponding points are observable in multiple views. For example, we can detect features, with not large baseline, by using KLT tracker(Lucas and Kanade, 1981; Shi and Tomasi, 1994). Then, choose the features that are visible in all frames.
  • 33.
    The Multiple-view Matrixof a Point Iλ =     x1 0 ... 0 0 x2 0 0 ... ... ... ... 0 0 ... xm         λ1 λ2 ... λm     =     Π1 Π2 ... Πm     X = ΠX, which is of the form Iλ = ΠX () For a more compact formulation, we introduce the matrix I⊥ =     ˆx1 0 ...0 0 ˆx2 ... 0 ... ... ... ... 0 0 ... ˆxm     ∈ R3m×3m which has the property of ”annihilating” I: I⊥ I = 0.
  • 34.
    The Multiple-view Matrixof a Point (cont.) We can premultiply the above to obtain I⊥ ΠX = 0. Thus the vector X is in the null space of the matrix Wp . = I⊥ Π =     x1Π1 x2Π2 ... xmΠm     ∈ R3m×4 To have a nontrivial solution, we must have rank(Wp) ≤ 3 The rank of the matrix Wp is not affected if we multiply by a full-rank matrix Dp ∈ R4×5 as follows: WpDp =     x1Π1 x2Π2 ... xmΠm     x1 x1 0 0 0 1 =       x1x1 0 0 x2R2x1 x2R2x1 x2T2 x3R3x1 x3R3x1 x3T3 ... ... ... xmRmx1 xmRmx1 xmTm      
  • 35.
    The Multiple-view Matrixof a Point (cont.) This means that rank(Wp) ≤ 3 iff the submatrix Mp . =     x2R2x1 x2T2 x3R3x1 x3T3 ... ... xmRmx1 xmTm     ∈ R3(m−1)×2 has rank(Mp) ≤ 1. Mp is called the multiple-view matrix associated with a point p. It involves both the image x1 in the first view and the coimages x1 in the remaining views.
  • 36.
    Multiple-view Factorization ofPoint Features rank(Mp) ≤ 1 states that two columns of Mp are linearly dependent. For the j-th point pj this implies       xj 2R2xj 1 xj 3R3xj 1 ... xj mRmxj 1       + αj       xj 2T2 xj 3T3 ... xj mTm       = 0, for some parameters αj ∈ R, j = 1, ..., n. Each row in the above equation can be obtained from λj i xj i = λj 1Rixj 1 + Ti, multiplying by xj i : (j = 1, 2, 3..., n) xj i Rixj 1 + xj i Ti/λj 1 = 0. Therefore, αj = 1/λj 1 is nothing but the inverse of the depth of point pj w.r.t. the first frame.
  • 37.
    Motion Estimation fromKnown Structure Assume we have the depth of the points and thus their inverse αj (i.e. known structure). Then the above equation is linear in the camera motion parameters Ri and Ti. Using the stack notation i = 2, 3, ..., m (fixed) Pi Rs i Ti . =      x1 1 T ⊗ x1 i α1 x1 i x2 1 T ⊗ x2 i α2 x2 i ... ... xn 1 T ⊗ xn i αn xn i      Rs i Ti = 0, ∈ R3n It can be shown that if the αj ’s are known, the matrix Pi ∈ R3n×12 is of rank 11 if more than n ≥ 6 points in general position are given. In that case, the null space of Pi is unique up to a scale factor, and so is the projection matrix Πi = [Ri, Ti]. (each of three (3n) are linearly independent). This does not decouple the structure and motion, completely. Firstly, structure information is needed.
  • 38.
    Structure Estimation fromKnown Motion In turn, if the camera motion Πi = (Ri, Ti), i = 1, ..., m is known, we can estimate the structure. The least squares solution for the above equation is given by: αj = − m i=2(xj i Ti)T xj i Rixj 1 m i=2 xj i Ti 2 , j = 1, ..., n. In this way one can iteratively estimate structure and motion, estimating one while keeping the other fixed. For initialization, we need structure parameters from eight-point algorithm. While the equation for Πi makes use of the two frames 1 and i only, the structure parameter estimation takes into account all frames. This can be done either in batch mode or recursively. As for the two-view case, such spectral approaches do not guarantee optimality in the presence of noise and uncertainty.
  • 39.
  • 41.
  • 42.
    1. Introduction 2. ImageFormation 3. SfM with Calibrated Camera Two-view Multiple-view Pipeline Revisited 4. SfM with IMU KLT tracker with IMU toy example: estimating homography 5. SfM with Rolling Shutter Camera 6. Direct Approach 7. Epilogue
  • 43.
    Biological Motivation: HumanInertial Sensors The utricle measures acceleration in the horizontal direction and the saccule measures in the vertical direction. The semicircular canals detect angular velocity of the head and are oriented in three orthogonal planes. [1] http://www.nebraskamed.com/health-library/ 3d-medical-atlas/302/balance-and-equilibrium
  • 44.
    IMU sensors IMU(inertial measurementunit) measures acceleration, angular velocity and other things by using accelerometer, gyroscope and etc. Thanks to the development of MEMS(Micro-Electro-Mechanical Systems) technology, we can easily find IMU sensors. For example, after iPhone4 (2010), almost all smartphones are equipped with accelerometer, gyroscope and magnetic sensors.
  • 45.
  • 46.
    KLT tracker withIMU We want to understand, implement and test by using our data. Hwangbo, Kim, and Kanade, 2009 “Inertial-Aided KLT Feature Tracking for a Moving Camera.” [2] https://www.youtube.com/watch?v=a81WzJONPGA
  • 47.
    1. Introduction 2. ImageFormation 3. SfM with Calibrated Camera Two-view Multiple-view Pipeline Revisited 4. SfM with IMU KLT tracker with IMU toy example: estimating homography 5. SfM with Rolling Shutter Camera 6. Direct Approach 7. Epilogue
  • 48.
    1. Introduction 2. ImageFormation 3. SfM with Calibrated Camera Two-view Multiple-view Pipeline Revisited 4. SfM with IMU KLT tracker with IMU toy example: estimating homography 5. SfM with Rolling Shutter Camera 6. Direct Approach 7. Epilogue
  • 49.
    1. Introduction 2. ImageFormation 3. SfM with Calibrated Camera Two-view Multiple-view Pipeline Revisited 4. SfM with IMU KLT tracker with IMU toy example: estimating homography 5. SfM with Rolling Shutter Camera 6. Direct Approach 7. Epilogue
  • 50.
    Summary Research is ongoingfor each stage: camera model, calibration, feature matching or tracking, two view, multi view, bundle adjustment, dense matching, visualization, ...
  • 51.
    Summary Research is ongoingfor each stage: camera model, calibration, feature matching or tracking, two view, multi view, bundle adjustment, dense matching, visualization, ... Useful if there are additional devices: IMU(motion), RGBD(structure), ...
  • 52.
    Summary Research is ongoingfor each stage: camera model, calibration, feature matching or tracking, two view, multi view, bundle adjustment, dense matching, visualization, ... Useful if there are additional devices: IMU(motion), RGBD(structure), ... Different approaches: factorization method feature-less direct approach variational approach
  • 53.
    Summary Research is ongoingfor each stage: camera model, calibration, feature matching or tracking, two view, multi view, bundle adjustment, dense matching, visualization, ... Useful if there are additional devices: IMU(motion), RGBD(structure), ... Different approaches: factorization method feature-less direct approach variational approach What we have assumed: static scene, calibrated camera
  • 54.
    Summary Research is ongoingfor each stage: camera model, calibration, feature matching or tracking, two view, multi view, bundle adjustment, dense matching, visualization, ... Useful if there are additional devices: IMU(motion), RGBD(structure), ... Different approaches: factorization method feature-less direct approach variational approach What we have assumed: static scene, calibrated camera What we have not considered: cc
  • 55.
    References [1] P. Corke,J. Lobo, and J. Dias. An Introduction to Inertial and Visual Sensing. The International Journal of Robotics Research, 26(6):519–535, June 2007. [2] Myung Hwangbo, Jun-Sik Kim, and Takeo Kanade. Inertial-aided KLT feature tracking for a moving camera. In Intelligent Robots and Systems, 2009. IROS 2009. IEEE/RSJ International Conference on, pages 1909–1916. IEEE.