Journey to structure from motion

Journey to Structure from Motion
Ja-Keoung Koo
Image Laboratory at Chung-Ang University
http://image.cau.ac.kr
*. This presentation is about my study in 2014

1. Introduction
2. Image Formation
3. SfM with Calibrated Camera
Two-view
Multiple-view
Pipeline
Revisited
4. SfM with IMU
KLT tracker with IMU
toy example: estimating homography
5. SfM with Rolling Shutter Camera
6. Direct Approach
7. Epilogue

Introduction
’Structure from Motion’?
Recover 3D information from 2D images (captured by moving
camera)
Visual SLAM(simulatenous localization and mapping) in robotics.
Tracking and mapping (e.g. PTAM, ISMAR 2007; DTAM, ICCV
2011)
Visual odometry (cf. wheel odometry)

Applications: Building rome in a day (large scale)
(Sameer Agarwal, Yasutaka Furukawa, Noah Snavely, Ian Simon, Brian
Curless, Steven M. Seitz and Richard Szeliski, Communications of the
ACM 2011) (..., ICCV 2009)

Applications: Autodesk 123D catch
Smartphone as a 3D scanner

Introduction
We will see:
usual SfM pipeline
with additional devices (IMU, rolling shutter, RGBD)
diﬀerent approaches (direct, variational)
* Main reference is (Yi-Ma et al., 2004; Introduction to 3D vision).
Figures without reference are from this book.

Image Formation: Rigid body motion
Pixel coordinate ← Retinal coordinate ← Camera coordinate ← World coordinate
In homogeneous coordinate,
λ


x
y
1

 =


sx 0 ox
0 sy oy
0 0 1




f 0 0
0 f 0
0 0 1




1 0 0 0
0 1 0 0
0 0 1 0




R T
1






Xw
Yw
Zw
1





Image Formation: Rigid body motion (cont.)
Xw = RXc + T, Xc = RT
(Xw − T)
Camera center Oc in the world coordinate system
Xw = R0 + T = T
Ow in the camera coordinate system
Xc = RT
(0 − T) = −RT
T
View direction


0
0
1

 −


0
0
0

 =⇒

R


0
0
1

 + T

 − T = R(:, 3)

Image Formation: Perspective projection
In homogeneous coordinate, (∼ means equality up to scale)


x
y
1

 ∼


1 0 0 0
0 1 0 0
0 0 1 0






X
Y
Z
1



 =


X
Y
Z

 ∼


X/Z
Y/Z
1


We have lost structure Z!
There are no parameters related to camera.
x, y: retinal (image or normalized) coordinate.
x , y : pixel coordinate.

Image Formation
λ


x
y
1

 =


sx 0 ox
0 sy oy
0 0 1




f 0 0
0 f 0
0 0 1




1 0 0 0
0 1 0 0
0 0 1 0




R T
1






Xw
Yw
Zw
1




e.g. Our lens has a focal length f : 4.16 mm.
There is nothing related to pixel here!

Image Formation (cont.)
λ


x
y
1

 =


sx 0 ox
0 sy oy
0 0 1




f 0 0
0 f 0
0 0 1




1 0 0 0
0 1 0 0
0 0 1 0




R T
1






Xw
Yw
Zw
1




For example, given horizontal pixel size of sensor is 3.75 (µm/pixel),
sx =
1
3.75 ∗ 10−6
.
So the focal length in pixel is
f · sx = 4.16 ∗ 10−3
∗
1
3.75 ∗ 10−6
= 1109
Think about pixel coordinate as a virtual coordinate.

Image Formation (cont.)

Image Formation
λ


x
y
1

 =


sx 0 ox
0 sy oy
0 0 1




f 0 0
0 f 0
0 0 1




1 0 0 0
0 1 0 0
0 0 1 0




R T
1






Xw
Yw
Zw
1




What elements are dependent on camera (sensor+lens)?
K ≡


sx 0 ox
0 sy oy
0 0 1




f 0 0
0 f 0
0 0 1


K is called a camera intrinsic matrix.
The procedure of getting K is called camera calibration.

3D Reconstruction from Images
Pixel coordinate → Retinal coordinate → Camera coordinate → World coordinate
(not exactly (−1, −1) ∼ (1, 1) )
x = Kx
K−1
x
observation
= x
normalized

3D Reconstruction from Images
Pixel coordinate → Retinal coordinate → Camera coordinate → World coordinate
Note that
π : (X, Y, Z) → (x , y )
x =
X · fx
Z
+ ox, y =
Y · fy
Z
+ oy
If we know the 3d scale factor Z and camera intrinsic matrix K with
respect to reference camera,
π−1
: (x , y , Z) → (X, Y, Z)
X =
x − ox
fx
Z, Y =
y − oy
fy
Z, Z = Z
If we know the camera motion, then we may get 3d structures for each
view.

Simpler Models
Orthographic camera model.
x
y
=
1 0 0
0 1 0


X
Y
Z


Sometimes authors assume the simplest model and go further.
(Ullman, 1979)
: Proof of existence of a solution for SfM under orthography
(Tomasi and Kanade IJCV 1992)
: Factorization Method under orthography

Simpler Models (cont.)
Weak-perspective projection model
x
y
= s
1 0 0
0 1 0


X
Y
Z



Complex Models
FOV model
(Devernay and Faugeras 2001, Straight lines have to be staright)
π : (X, Y, Z) → (x , y )
x =
X · fx
Z
r
r
+ ox, y =
Y · fy
Z
r
r
+ oy
, where r =
x2 + y2
z2
and r =
1
w
arctan(2r tan
w
2
).
...

Examples
If K = I?
If px = 0, py = 0? (screenshot? cropped?)

Assumption
For a while, we assume that
The object (or scene) is static.
We have a calibrated camera. Simply, K = I.
There are only two views.
Notation.
λ : scalar representing 3-D depth
X : 3-D vector
x : 2-D vector
R ∈ SO(3) : 3x3 rotation matrix (R is orthogonal, det(R) = 1)
T : 3-D translation vector
ˆA : 3x3 skew-symmetric matrix such that ˆAv = A × v

Epipolar Constraint
Given corresponding 2-D coordinates x1, x2,
We want to ﬁnd R, T, λ.
Let X1, X2 be the 3-D coordinates of a point p for each frame. Then, we
can write
X2 = RX1 + T
λ2x2 = Rλ1x1 + T
Premultiply by ˆT,
λ2
ˆTx2 = ˆTRλ1x1
Since ˆTx2 = T × x2 is perpendicular to the vector x2,
< x2, ˆTx2 >= xT
2
ˆTx2 = 0.
Thus,
< x2, T × Rx1 >= 0

Epipolar Constraint (cont.)
xT
2
ˆTRx1 = 0 (epipolar constraint)
E = ˆTR (essential matrix)
E
.
= { ˆTR | R ∈ SO(3), T ∈ R3
} ⊂ R3∗3
(essential space)

Kronecker product and stack of matries
The following is useful when A is the only unknown.
xT
2 Ax1 = 0 ⇐⇒ (x1 ⊗ x2)T
As
= 0
(We saw this in the conjugate gradient lecture. x2 and x1 are
A-orthogonal or conjugate.)
x1⊗x2 = a = [x1x2, x1y2, x1z2, y1x2, y1y2, y1z2, z1x2, z1y2, z1z2]T
∈ R9
.
(Kronecker product)
Kronecker product of two vectors is also a vector.
We can stack elements of a matrix column-wise.
Es
= [e11, e21, e31, e12, e22, e23, e13, e23, e33]T
∈ R9
.
(stack of a matrix E)
A stack of a matrix is a vector.

8 point algorithm
Approximation of E → Project onto the essential space → Recover R, T
Idea : Decompose known and unknown, approximate E and then recover
R, T.
input : n image correspondences (xj
1, xj
2), j = 1, 2, ..., n(n ≥ 8)
From the input data, we can set
χ = [a1
, a2
, · · · , an
]T
∈ Rn∗9
where aj
= xj
1 ⊗ xj
2 ∈ R9
. From the epipolar constraint xT
2
ˆTRx1 = 0,
we can rewrite
χEs
= 0.
(Rank theorem) When A is m-by-n matrix,
rank(A) + dimNul(A) = n
If rank(χ) = 8,
rank(χ) + dimNul(χ) = 8 + dimNul(χ) = 9.
Then, we can get Es
up to scale. So we need at least 8 points.

8 point algorithm (cont.)
Find the vector Es
∈ R9
of unit length (we can take any scale) such that
χEs
is minimized.
Claim : Es
should be the eigenvector of χT
χ that corresponds
to its smallest eigenvalue
χT
χ is symmetric and so orthogonally diagonalizable (eigenvalue is σ2
)
χEs 2
= (Es
)T
(χT
χEs
) = (Es
)T
(σ2
Es
) = σ2
So if σ is minimized, χEs
is also minimized. In this way, we can get Es
.

8 point algorithm
Theorem. E is an essential matrix if and only if
E = UΣV T
such that Σ = diag{σ, σ, 0}
where σ > 0 and U, V ∈ SO(3).
So we just have to replace the singular values with 1, 1.
% c o n s t r u c t chi matrix
chi = zeros (n , 9 ) ;
f o r i =1:n
chi ( i , : ) = kron ( [ featx1 ( i ) ; featy1 ( i ) ; 1 ] , [ featx2 ( i ) ;
end
% f i n d E stakced that minimizes | chi ∗ E stacked |
[ U chi , S chi , V chi ] = svd ( chi ) ;
E stacked = V chi ( : , 9 ) ;
% Pro jec t E onto e s s e n t i a l space
S (1 ,1) = 1; S (2 ,2) = 1; S (3 ,3) = 0;
% Get e s s e n t i a l matrix

8 point algorithm
(Theorem). There are 4 candidates. (E, −E)
( ˆT, R) = (URz(±
π
2
) UT
, URT
Z(±
π
2
)V T
)
What candidate is the solution we want? Choose the one that λ is
positive.

Reconstruction from Two Calibrated Views
Approximation of E → Project onto the essential space → Recover R, T → Reconstruction
λj
2xj
2 = λj
1Rxj
1 + γT, j = 1, 2, ..., n
To reduce λ2, multiplying both sides by xj
2, we can get
0 = λj
1xj
2Rxj
1 + γxj
2T
Decompose known and unknown
Mj
λj .
= xj
2Rxj
1 xj
2T
λj
1
γ
= 0 (1)
Moreover, we can set
v = [λ1
1, λ2
1, ..., λn
1 , γ]T
∈ Rn+1
M =


ˆx1
2Rx1
1 0 0 ... ˆx1
2T
0 ˆx2
2Rx2
1 0 ... ...
0 0 ... ˆxn
2 Rxn
1 ˆxn
2 T


Mv = 0
The linear least-squares estimate of v is simply the eigenvector of MT
M
that corresponds to its smallest eigenvalue.

Reconstruction from Two Calibrated Views
Approximation of E → Project onto the essential space → Recover R, T → Reconstruction →
Bundle adjustment
Theory is beautiful, but we have noisy data ˜x1, ˜x2 and some outliers.

Multiple-view: Assumption
Now, we assume that
The object (or scene) is static.
We have a calibrated camera. Simply, K = I.
There are multiple views.
All the corresponding points are observable in multiple views.
For example, we can detect features, with not large baseline, by
using KLT tracker(Lucas and Kanade, 1981; Shi and Tomasi, 1994).
Then, choose the features that are visible in all frames.

The Multiple-view Matrix of a Point
Iλ =




x1 0 ... 0
0 x2 0 0
... ... ... ...
0 0 ... xm








λ1
λ2
...
λm



 =




Π1
Π2
...
Πm



 X = ΠX,
which is of the form
Iλ = ΠX ()
For a more compact formulation, we introduce the matrix
I⊥
=




ˆx1 0 ...0
0 ˆx2 ... 0
... ... ... ...
0 0 ... ˆxm



 ∈ R3m×3m
which has the property of ”annihilating” I:
I⊥
I = 0.

The Multiple-view Matrix of a Point (cont.)
We can premultiply the above to obtain
I⊥
ΠX = 0.
Thus the vector X is in the null space of the matrix
Wp
.
= I⊥
Π =




x1Π1
x2Π2
...
xmΠm



 ∈ R3m×4
To have a nontrivial solution, we must have
rank(Wp) ≤ 3
The rank of the matrix Wp is not aﬀected if we multiply by a full-rank
matrix Dp ∈ R4×5
as follows:
WpDp =




x1Π1
x2Π2
...
xmΠm




x1 x1 0
0 0 1
=






x1x1 0 0
x2R2x1 x2R2x1 x2T2
x3R3x1 x3R3x1 x3T3
... ... ...
xmRmx1 xmRmx1 xmTm







The Multiple-view Matrix of a Point (cont.)
This means that rank(Wp) ≤ 3 iﬀ the submatrix
Mp
.
=




x2R2x1 x2T2
x3R3x1 x3T3
... ...
xmRmx1 xmTm



 ∈ R3(m−1)×2
has rank(Mp) ≤ 1. Mp is called the multiple-view matrix associated with
a point p. It involves both the image x1 in the ﬁrst view and the
coimages x1 in the remaining views.

Multiple-view Factorization of Point Features
rank(Mp) ≤ 1 states that two columns of Mp are linearly dependent. For
the j-th point pj
this implies






xj
2R2xj
1
xj
3R3xj
1
...
xj
mRmxj
1






+ αj






xj
2T2
xj
3T3
...
xj
mTm






= 0,
for some parameters αj
∈ R, j = 1, ..., n. Each row in the above equation
can be obtained from λj
i xj
i = λj
1Rixj
1 + Ti, multiplying by xj
i :
(j = 1, 2, 3..., n)
xj
i Rixj
1 + xj
i Ti/λj
1 = 0.
Therefore, αj
= 1/λj
1 is nothing but the inverse of the depth of point pj
w.r.t. the ﬁrst frame.

Motion Estimation from Known Structure
Assume we have the depth of the points and thus their inverse αj
(i.e.
known structure). Then the above equation is linear in the camera motion
parameters Ri and Ti. Using the stack notation i = 2, 3, ..., m (ﬁxed)
Pi
Rs
i
Ti
.
=





x1
1
T
⊗ x1
i α1
x1
i
x2
1
T
⊗ x2
i α2
x2
i
... ...
xn
1
T
⊗ xn
i αn
xn
i





Rs
i
Ti
= 0, ∈ R3n
It can be shown that if the αj
’s are known, the matrix Pi ∈ R3n×12
is of
rank 11 if more than n ≥ 6 points in general position are given. In that
case, the null space of Pi is unique up to a scale factor, and so is the
projection matrix Πi = [Ri, Ti]. (each of three (3n) are linearly
independent).
This does not decouple the structure and motion, completely. Firstly,
structure information is needed.

Structure Estimation from Known Motion
In turn, if the camera motion Πi = (Ri, Ti), i = 1, ..., m is known, we can
estimate the structure. The least squares solution for the above equation
is given by:
αj
= −
m
i=2(xj
i Ti)T
xj
i Rixj
1
m
i=2 xj
i Ti
2
, j = 1, ..., n.
In this way one can iteratively estimate structure and motion, estimating
one while keeping the other ﬁxed.
For initialization, we need structure parameters from eight-point
algorithm.
While the equation for Πi makes use of the two frames 1 and i only, the
structure parameter estimation takes into account all frames. This can be
done either in batch mode or recursively.
As for the two-view case, such spectral approaches do not guarantee
optimality in the presence of noise and uncertainty.

Incremental Structure from Motion
aa

History
(Kruppa, 1916)
8point (1922)
5point
mm

Biological Motivation: Human Inertial Sensors
The utricle measures acceleration in the horizontal direction and the
saccule measures in the vertical direction.
The semicircular canals detect angular velocity of the head and are
oriented in three orthogonal planes. [1]
http://www.nebraskamed.com/health-library/
3d-medical-atlas/302/balance-and-equilibrium

IMU sensors
IMU(inertial measurement unit) measures acceleration, angular
velocity and other things by using accelerometer, gyroscope and etc.
Thanks to the development of MEMS(Micro-Electro-Mechanical
Systems) technology, we can easily ﬁnd IMU sensors.
For example, after iPhone4 (2010), almost all smartphones are
equipped with accelerometer, gyroscope and magnetic sensors.

KLT tracker with IMU
We want to understand, implement and test by using our data.
Hwangbo, Kim, and Kanade, 2009 “Inertial-Aided KLT Feature Tracking
for a Moving Camera.” [2]
https://www.youtube.com/watch?v=a81WzJONPGA

Summary
Research is ongoing for each stage:
camera model, calibration, feature matching or tracking, two view,
multi view, bundle adjustment, dense matching, visualization, ...

Summary
Useful if there are additional devices:
IMU(motion), RGBD(structure), ...

Summary
Diﬀerent approaches:
factorization method
feature-less direct approach
variational approach

Summary
What we have assumed:
static scene, calibrated camera

Summary
What we have assumed:
static scene, calibrated camera
What we have not considered:
cc

References
[1] P. Corke, J. Lobo, and J. Dias. An Introduction to Inertial and Visual
Sensing. The International Journal of Robotics Research,
26(6):519–535, June 2007.
[2] Myung Hwangbo, Jun-Sik Kim, and Takeo Kanade. Inertial-aided
KLT feature tracking for a moving camera. In Intelligent Robots and
Systems, 2009. IROS 2009. IEEE/RSJ International Conference on,
pages 1909–1916. IEEE.

Journey to structure from motion

More Related Content

What's hot

Similar to Journey to structure from motion

Recently uploaded

Journey to structure from motion