30th コンピュータビジョン勉強会@関東 DynamicFusion

DynamicFusion:
Reconstruction and Tracking of
Non-Rigid Scenes in Real-Time
Richard A. Newbomb Dieter Fox Steven M. Seitz
Hiroki Mizuno
July 25 '15
1第30回（後編）コンピュータビジョン勉強会@関東 CVPR2015読み会

選んだ理由 (1/2)
•  一番初めに目に入った
第30回（後編）コンピュータビジョン勉強会@関東 CVPR2015読み会 2

選んだ理由 (2/2)
•  著者がすごい
Richard A. Newbombe
DTAM, KinectFusionの著者
Surreal VisionのCo-Founder
Dieter Fox
確率ロボティクスの著者
Steven M. Seitz
Multi View Stereo界隈の大御所

What's DynamicFusion
•  論文タイトル:
–  DynamicFusion:
Reconstruction and Tracking of Non-rigid Scenes in
Real-Time
•  Motivation:
–  "How can we generalise KinectFusion to reconstruct
and track dynamic, non-rigid scenes in real-time?"
•  ざっくり言うと:
"KinectFusionをDynamicなシーンに拡張"

というわけで
まずは
KinectFusion
の話をします

KinectFusion
•  KinectFusion:
Real-Time Dense Surface Mapping and Tracking
–  ISMAR 2011で発表 (Best Paper Award)
–  Richard A. Newbombe 他 (MSRの人が数名連名)
•  ざっくり概要
–  KinectのDepth Mapを入力としたSLAM
–  Depth Mapをリアルタイムに融合して、綺麗な面を生成
KinectFusion: Real-Time Dense Surface Mapping and Tracking⇤
Richard A. Newcombe
Imperial College London
Shahram Izadi
Microsoft Research
Otmar Hilliges
Microsoft Research
David Molyneaux
Microsoft Research
Lancaster University
David Kim
Microsoft Research
Newcastle University
Andrew J. Davison
Pushmeet Kohli
Microsoft Research
Jamie Shotton
Microsoft Research
Steve Hodges
Microsoft Research
Andrew Fitzgibbon
Microsoft Research
Figure 1: Example output from our system, generated in real-time with a handheld Kinect depth camera and no other sensing infrastructure.
Normal maps (colour) and Phong-shaded renderings (greyscale) from our dense reconstruction system are shown. On the left for comparison
is an example of the live, incomplete, and noisy data from the Kinect sensor (used as input to our system).

Result
https://www.youtube.com/watch?v=quGhaggn3cQ

Overview:
Rk Tg,k
Rk
Tg,k-1
Input
Measurement Pose
Estimation
Update
Reconstruction
Surface
Prediction
Compute
Surface Nertex and
Normal Maps
ICP of Predicted
and Measured
Surface
Integrate Surface
Measurement
into Global TSDF
Ray-cast
TSDF to Compute
Surface Prediction
Tg,k-1
k
Vk-1
S
Nk-1
^ ^
Vk
Nk
Figure 3: Overall system workﬂow.
Surface reconstruction update: The global scene fusion pro-
cess, where given the pose determined by tracking the depth data
from a new sensor frame, the surface measurement is integrated into
the scene model maintained with a volumetric, truncated signed dis-
tance function (TSDF) representation.
Surface prediction: Unlike frame-to-frame pose estimation as
performed in [15], we close the loop between mapping and local-
isation by tracking the live depth frame against the globally fused
model. This is performed by raycasting the signed distance func-
tion into the estimated frame to provide a dense surface prediction
against which the live depth map is aligned.
Figure 4: A s
showing the tru
ﬁeld around the
had a valid mea
bouring map ve
Nk(u) = n
⇥
(V
on dy-
moving
move.
ects.
UIST’11, October 16–19, 2011, Santa Barbara, CA, USA

Mapping as Surface Reconstruction
Rk Tg,k
Rk
Tg,k-1
Input
Measurement Pose
Estimation
Update
Reconstruction
Surface
Prediction
Compute
Surface Nertex and
Normal Maps
ICP of Predicted
and Measured
Surface
Integrate Surface
Measurement
into Global TSDF
Ray-cast
TSDF to Compute
Surface Prediction
Tg,k-1
k
Vk-1
S
Nk-1
^ ^
Vk
Nk
Figure
showin
ﬁeld ar

Mapping as Surface Reconstruction (1/4)
•  カメラのポーズがわかっている前提で、
Live FrameをGlobal Sceneに統合 (Fusion)
•  Sceneの表現:
–  Volumetric Truncated Signed Distance Function
(TSDF)※
–  Volumeデータ構造
–  面の位置をzeroとして、各Voxelに面までの距離を符
号付で格納
※ B. Curless and M. Levoy, A Volumetric for Building Complex Models from Range Images In ACM Transactions
on Graphics (SIGGRAPH), 1996.

•  TSDF概要
–  複数視点からのDepth mapを統合するアルゴリズム
–  ある視点から得られたDepth mapをTSDFに変換
–  別の視点からのDepth mapも同様にTSDFに変換
–  これらの重み付和をとることで、複数視点からのDepth mapが統合される
–  全部の視点が統合されたら、Marching Cubeなどで面を取り出すことが可能
Sensor
Near Far
x
x
Volume
Range surface
Zero-crossing
(isosurface)
x
x
New zero-crossing
Distance
from
surface
(a) (b)
Figure 2. Unweighted signed distance functions in 3D. (a) A range sensor
looking down the x-axis observes a range image, shown here as a recon-
structed range surface. Following one line of sight down the x-axis, we
r
1
d
1
(x)
w1
(x)
r
2
w2
(x)
d
2
(x)
W(x)
D(x)
R
(a) (b)
x x
Sensor
Figure 3. Signed distance and weight functions in one dimension. (a) The
sensor looks down the x-axis and takes two measurements, and .
and are the signed distance proﬁles, and and
are the weight functions. In 1D, we might expect two sensor measure-
ments to have the same weight magnitudes, but we have shown them to be
of different magnitude here to illustrate how the proﬁles combine in the
general case. (b) is a weighted combination of and ,
and is the sum of the weight functions. Given this formulation, the
zero-crossing, , becomes the weighted combination of and and
(b) (c)
(e) (f)
Isosurface
Sensor
n2
n1
Dmax
Dmin
(a)
(d)
Sensor
Figure 4. Combination of signed distance and weight functions in two
dimensions. (a) and (d) are the signed distance and weight functions, re-
spectively, generated for a range image viewed from the sensor line of
sight shown in (d). The signed distance functions are chosen to vary be-
Sen
Figure 5
pute the w
casting a
We obtai
and )
sensor (li
column o

•  Live Frame (Depth map) からTSDFへの変換
resents the distance to the nearest surface point.
Although efficient algorithms exist for computing the true dis-
crete SDF for a given set of point measurements (complexity is
linear in the the number of voxels) [22], sophisticated implemen-
tations are required to achieve top performance on GPU hardware,
without which real-time computation is not possible for a reason-
able size volume. Instead, we use a projective truncated signed
distance function that is readily computed and trivially parallelis-
able. For a raw depth map Rk with a known pose Tg,k, its global
frame projective TSDF [FRk
,WRk
] at a point p in the global frame
g is computed as,
FRk
(p) = Y
⇣
l 1
k(tg,k pk2 Rk(x)
⌘
, (6)
l = kK 1
˙xk2 , (7)
x =
j
p
⇣
KT 1
g,kp
⌘k
, (8)
Y(h) =
(
min
⇣
1, h
µ
⌘
sgn(h) iff h µ
null otherwise
(9)
We use a nearest neighbour lookup b.c instead of interpolating
the depth value, to prevent smearing of measurements at depth dis-
continuities. 1/l converts the ray distance to p to a depth (we found
no considerable difference in using SDF values computed using dis-
tances along the ray or along the optical axis). Y performs the SDF
truncation. The truncation function is scaled to ensure that a sur-
face measurement (zero crossing in the SDF) is represented by at
least one non truncated voxel value in the discretised volume ei-
ther side of the surface. Also, the support is increased linearly with
distance from the sensor center to support correct representation
of noisier measurements. The associated weight WRk
(p) is propor-
tional to cos(q)/Rk(x), where q is the angle between the associated
pixel ray direction and the surface normal measurement in the local
k
Wk 1(p)+WRk
(p
Wk(p) = Wk 1(p)+WRk
(p)
No update on the global TSDF is performed fo
from unmeasurable regions specified in Equation
provides weighting of the TSDF proportional to
surface measurement, we have also found that i
letting WRk
(p) = 1, resulting in a simple average,
sults. Moreover, by truncating the updated weigh
Wh ,
Wk(p) min(Wk 1(p)+WRk
(p),W
a moving average surface reconstruction can be o
reconstruction in scenes with dynamic object mot
Although a large number of voxels can be vis
project into the current image, the simplicity of
operation time is memory, not computation, bound
GPU hardware over 65 gigavoxels/second (⇡ 2m
update for a 5123 voxel reconstruction) can be up
bits per component in S(p), although experiment
ified that as few as 6 bits are required for the SD
we note that the raw depth measurements are used
rather than the bilateral filtered version used in t
ponent, described later in section 3.5. The early
desired high frequency structure and noise alike
duce the ability to reconstruct finer scale structure
3.4 Surface Prediction from Ray Casting
With the most up-to-date reconstruction availabl
ity to compute a dense surface prediction by rend
encoded in the zero level set Fk = 0 into a virtua
current estimate Tg,k. The surface prediction is s
and normal map ˆVk and ˆNk in frame of referenc
able size volume. Instead, we use a projective truncated signed
distance function that is readily computed and trivially parallelis-
able. For a raw depth map Rk with a known pose Tg,k, its global
rame projective TSDF [FRk
,WRk
] at a point p in the global frame
g is computed as,
FRk
(p) = Y
⇣
l 1
k(tg,k pk2 Rk(x)
⌘
, (6)
l = kK 1
˙xk2 , (7)
x =
j
p
⇣
KT 1
g,kp
⌘k
, (8)
Y(h) =
(
min
⇣
1, h
µ
⌘
sgn(h) iff h µ
null otherwise
(9)
he depth value, to prevent smearing of measurements at depth dis-
ances along the ray or along the optical axis). Y performs the SDF
runcation. The truncation function is scaled to ensure that a sur-
ace measurement (zero crossing in the SDF) is represented by at
east one non truncated voxel value in the discretised volume ei-
provides weig
surface measu
letting WRk
(p
sults. Moreov
Wh ,
W
a moving ave
reconstruction
Although a
project into th
operation time
GPU hardwar
update for a 5
bits per comp
ified that as fe
we note that th
rather than th
ponent, descr
desired high f
duce the abilit
3.4 Surfac
: 点pにおけるTSDF value : 点pにおける重み
(p) is propor-
frame.
The projective TSDF measurement is only correct exactly at the
surface FRk
(p) = 0 or if there is only a single point measurement
in isolation. When a surface is present the closest point along a
ray could be another surface point not on the ray associated with
the pixel in Equation 8. It has been shown that for points close
to the surface, a correction can be applied by scaling the SDF by
cos(q) [11]. However, we have found that approximation within
the truncation region for 100s or more fused TSDFs from multiple
viewpoints (as performed here) converges towards an SDF with a
pseudo-Euclidean metric that does not hinder mapping and tracking
performance.
ponent, described late
desired high frequenc
duce the ability to rec
3.4 Surface Pred
With the most up-to-
ity to compute a dens
encoded in the zero le
current estimate Tg,k.
and normal map ˆVk a
the subsequent camer
As we have a dense
SDF, a per pixel rayc
responding ray, Tg,kK
depth for the pixel a
ve for a visible surf
Marching also stops
mately when exiting t
face measurement at t
For points on or ve
is assumed that the gr
zero level set, and so
no considerable difference in using
tances along the ray or along the op
truncation. The truncation functio
face measurement (zero crossing i
least one non truncated voxel valu
ther side of the surface. Also, the s
distance from the sensor center to
of noisier measurements. The assoc
tional to cos(q)/Rk(x), where q is
pixel ray direction and the surface n
frame.
The projective TSDF measurem
surface FRk
(p) = 0 or if there is o
in isolation. When a surface is pr
ray could be another surface point
the pixel in Equation 8. It has be
to the surface, a correction can be
cos(q) [11]. However, we have fo
the truncation region for 100s or m
viewpoints (as performed here) co
pseudo-Euclidean metric that does
performance.
the depth value, to prevent smearing of measurements at depth dis-
(p) is propor-
frame.
The projective TSDF measurement is only correct exactly at the
surface FRk
(p) = 0 or if there is only a single point measurement
in isolation. When a surface is present the closest point along a
ray could be another surface point not on the ray associated with
the pixel in Equation 8. It has been shown that for points close
to the surface, a correction can be applied by scaling the SDF by
cos(q) [11]. However, we have found that approximation within
the truncation region for 100s or more fused TSDFs from multiple
viewpoints (as performed here) converges towards an SDF with a
pseudo-Euclidean metric that does not hinder mapping and tracking
performance.
we note that the
rather than the b
ponent, describe
desired high fre
duce the ability
3.4 Surface P
With the most u
ity to compute a
encoded in the z
current estimate
and normal map
the subsequent c
As we have a
SDF, a per pixe
responding ray,
depth for the p
ve for a visible
Marching also s
mately when exi
face measureme
For points on
is assumed that
zero level set, an
=
θ : Rayと法線のなす角
p
x
Global TSDF
Raw Depth map
µ
Rk(x)

実際のところ、でいい結果になるらしい。
(5)
h map Rk) pro-
reconstructed.
a depth mea-
f the measured
ong each depth
ee space (here
ixel ray). Sec-
ained in the re-
he camera ray.
of uncertainty
 µ. A TSDF
n measurement
e within visible
urface interface
e points farther
e the SDF rep-
g the true dis-
(complexity is
ted implemen-
GPU hardware,
e for a reason-
uncated signed
ally parallelis-
Tg,k, its global
e global frame
depth map, which can be seen as de-noising the global TSDF from
multiple noisy TSDF measurements. Under an L2 norm the de-
noised (fused) surface results as the zero-crossings of the point-wise
SDF F minimising:
min
F2F
Â
k
kWRk
FRk
F)k2. (10)
Given that the focus of our work is on real-time sensor tracking and
surface reconstruction we must maintain interactive frame-rates.
(For a 640x480 depth stream at 30fps the equivalent of over 9
million new point measurements are made per second). Storing
a weight Wk(p) with each value allows an important aspect of the
global minimum of the convex L2 de-noising metric to be exploited
for real-time fusion; that the solution can be obtained incrementally
as more data terms are added using a simple weighted running av-
erage [7], defined point-wise {p|FRk
(p) 6= null}:
Fk(p) =
Wk 1(p)Fk 1(p)+WRk
(p)FRk
(p)
Wk 1(p)+WRk
(p)
(11)
Wk(p) = Wk 1(p)+WRk
(p) (12)
No update on the global TSDF is performed for values resulting
from unmeasurable regions specified in Equation 9. While Wk(p)
provides weighting of the TSDF proportional to the uncertainty of
surface measurement, we have also found that in practice simply
letting WRk
(p) = 1, resulting in a simple average, provides good re-
sults. Moreover, by truncating the updated weight over some value
Wh ,
Wk(p) min(Wk 1(p)+WRk
(p),Wh ) , (13)
a weight Wk(p) with each value allows an important aspect of the
global minimum of the convex L2 de-noising metric to be exploited
for real-time fusion; that the solution can be obtained incrementally
as more data terms are added using a simple weighted running av-
(p) 6= null}:
Fk(p) =
Wk 1(p)Fk 1(p)+WRk
(p)FRk
(p)
Wk 1(p)+WRk
(p)
(11)
Wk(p) = Wk 1(p)+WRk
(p) (12)
No update on the global TSDF is performed for values resulting
from unmeasurable regions specified in Equation 9. While Wk(p)
provides weighting of the TSDF proportional to the uncertainty of
surface measurement, we have also found that in practice simply
letting WRk
(p) = 1, resulting in a simple average, provides good re-
sults. Moreover, by truncating the updated weight over some value
Wh ,
Given that the focus of our work is on real-time sensor trackin
surface reconstruction we must maintain interactive frame-
(For a 640x480 depth stream at 30fps the equivalent of o
million new point measurements are made per second). S
a weight Wk(p) with each value allows an important aspect
global minimum of the convex L2 de-noising metric to be exp
for real-time fusion; that the solution can be obtained increme
as more data terms are added using a simple weighted runnin
(p) 6= null}:
Fk(p) =
Wk 1(p)Fk 1(p)+WRk
(p)FRk
(p)
Wk 1(p)+WRk
(p)
Wk(p) = Wk 1(p)+WRk
(p)
No update on the global TSDF is performed for values res
from unmeasurable regions specified in Equation 9. While W
provides weighting of the TSDF proportional to the uncertai
Global SceneとLive FrameのTSDFのFusion
重みの更新
FusionされたTSDF Value
Global Model Live Frame

Surface Prediction
Rk Tg,k
Rk
Tg,k-1
Input
Measurement Pose
Estimation
Update
Reconstruction
Surface
Prediction
Compute
Surface Nertex and
Normal Maps
ICP of Predicted
and Measured
Surface
Integrate Surface
Measurement
into Global TSDF
Ray-cast
TSDF to Compute
Surface Prediction
Tg,k-1
k
Vk-1
S
Nk-1
^ ^
Vk
Nk
Figure
showin
ﬁeld ar

Surface Prediction
•  Global TSDFから任意視点におけるSurfaceを取り出す
•  Pose Estimationに必要
•  Ray Casting (Ray marching)
–  距離関数と親和性の高いレンダリング方法
–  カメラからRayを飛ばす
–  Minimum depthから初めて、zero交差が見つかったら
止める。
–  法線の計算:
Figure 6: Demonstration of the space skipping ray casting. (Left)
pixel iteration count are shown where for each pixel the ray is tra-
versed in steps of at most one voxel (white equals 480 increments
and black 60). (middle) ray marching steps are drastically reduced
by skipping empty space according to the minimum truncation µ
(white equals 70 iterations and black 10 ⇡ 6⇥ speedup). Marching
steps can be seen to increase around the surface interface where
the signed distance function has not been truncated. (Right) Normal
map at resulting surface intersection.
along which p was found can be computed directly from Fk using a
numerical derivative of the SDF:
Rg,k
ˆNk = ˆN
g
k(u) = n
⇥
—F(p)
⇤
, —F =

∂F
∂x
,
∂F
∂y
,
∂F
∂z
>
(14)
Further, this derivative is scaled in each dimension to ensure cor-
rect isotropy given potentially arbitrary voxel resolutions and re-
construction dimensions.
Since the rendering of the surface is restricted to provide phys-
the surface interface using eqn. 15.
Figure 6: Demonstration of the space skipping ray casting. (Left)
pixel iteration count are shown where for each pixel the ray is tra-
versed in steps of at most one voxel (white equals 480 increments
and black 60). (middle) ray marching steps are drastically reduced
by skipping empty space according to the minimum truncation µ
(white equals 70 iterations and black 10 ⇡ 6⇥ speedup). Marching
steps can be seen to increase around the surface interface where
the signed distance function has not been truncated. (Right) Normal
map at resulting surface intersection.
along which p was found can be computed directly from Fk using a
numerical derivative of the SDF:
Rg,k
ˆNk = ˆN
g
k(u) = n
⇥
—F(p)
⇤
, —F =

∂F
∂x
,
∂F
∂y
,
∂F
∂z
>
(14)
Further, this derivative is scaled in each dimension to ensure cor-
rect isotropy given potentially arbitrary voxel resolutions and re-
construction dimensions.
Since the rendering of the surface is restricted to provide phys-
ically plausible measurement predictions, there is a minimum and
maximum rendering range the ray needs to traverse corresponding
3
L
p
t
d
p
l
a
t
f
t
p
h
t
a
d
s
f
m
a
d
t
(
p
u
E

Surface Measurement
Rk Tg,k
Rk
Tg,k-1
Input
Measurement Pose
Estimation
Update
Reconstruction
Surface
Prediction
Compute
Surface Nertex and
Normal Maps
ICP of Predicted
and Measured
Surface
Integrate Surface
Measurement
into Global TSDF
Ray-cast
TSDF to Compute
Surface Prediction
Tg,k-1
k
Vk-1
S
Nk-1
^ ^
Vk
Nk
Figure
showin
ﬁeld ar

Surface Measurement (1/4)
•  Live Frameに対する前処理：
1.  De-noise (Bilateral Filter)
2.  各ピクセルごとにVertexとNormalを生成
3.  Pyramid生成 (L=1, 2, 3)

•  De-noise
provides calibrated depth measurements Rk(u) 2 R at each image
pixel u = (u,v)> in the image domain u 2 U ⇢ R2 such that pk =
Rk(u)K 1 ˙u is a metric point measurement in the sensor frame of
reference k. We apply a bilateral filter [27] to the raw depth map to
obtain a discontinuity preserved depth map with reduced noise Dk,
Dk(u) =
1
Wp
Â
q2U
Nss (ku qk2)Nsr (kRk(u) Rk(q)k2)Rk(q) ,
(2)
where Ns (t) = exp( t2s 2) and Wp is a normalizing constant.
We back-project the filtered depth values into the sensor’s frame
of reference to obtain a vertex map Vk,
Vk(u) = Dk(u)K 1
˙u . (3)
Since each frame from the depth sensor is a surface measurement
on a regular grid, we compute, using a cross product between neigh-
[7]
the
fac
vis
on
ple
a g
low
wit
fus
Sk(
rec
olu
res
vox
and
ra calibration matrix
e into image pixels.
e projection of p 2
to obtain q 2 R2 =
denote homogeneous
depth map Rk which
u) 2 R at each image
⇢ R2 such that pk =
the sensor frame of
the raw depth map to
th reduced noise Dk,
) Rk(q)k2)Rk(q) ,
(2)
corresponding depth map level. We note that
global co-ordinate frame transform Tg,k associ
measurement, the global frame vertex is V
g
k(
the equivalent mapping of normal vectors int
N
g
k(u) = Rg,kNk(u).
3.3 Mapping as Surface Reconstructio
Each consecutive depth frame, with an associa
estimate, is fused incrementally into one sing
using the volumetric truncated signed distan
[7]. In a true signed distance function, the v
the signed distance to the closest zero crossin
face), taking on positive and increasing valu
visible surface into free space, and negative a
on the non-visible side. The result of averagin
ple 3D point clouds (or surface measurements)
a global frame is a global surface fusion.
An example given in Figure 4 demonstrate
lows us to represent arbitrary genus surface
: Raw Depth map
We will also use a single constant camera calibration matri
K that transforms points on the sensor plane into image pixel
The function q = p(p) performs perspective projection of p
R3 = (x,y,z)> including dehomogenisation to obtain q 2 R2 =
(x/z,y/z)>. We will also use a dot notation to denote homogeneou
vectors ˙u := (u>|1)>
3.2 Surface Measurement
At time k a measurement comprises a raw depth map Rk whic
provides calibrated depth measurements Rk(u) 2 R at each imag
Rk(u)K 1 ˙u is a metric point measurement in the sensor frame o
reference k. We apply a bilateral filter [27] to the raw depth map t
obtain a discontinuity preserved depth map with reduced noise Dk
Dk(u) =
1
Wp
Â
q2U
Nss (ku qk2)Nsr (kRk(u) Rk(q)k2)Rk(q)
(2
: Pixel
maps the camera coordinate frame at time k into the global frame
g, such that a point pk 2 R3 in the camera frame is transferred into
the global co-ordinate frame via pg = Tg,kpk.
We will also use a single constant camera calibration matrix
K that transforms points on the sensor plane into image pixels.
The function q = p(p) performs perspective projection of p 2
(x/z,y/z)>. We will also use a dot notation to denote homogeneous
At time k a measurement comprises a raw depth map Rk which
Dk(u) =
1
Wp
Â
q2U
(2)
central pixel to ens
aries. Subsequently
Vl2[1...L], Nl2[1...L]
corresponding dep
global co-ordinate
measurement, the
the equivalent map
N
g
k(u) = Rg,kNk(u)
3.3 Mapping as
Each consecutive d
estimate, is fused
using the volumet
[7]. In a true sign
the signed distance
face), taking on p
visible surface into
on the non-visible
ple 3D point cloud
a global frame is a
An example giv
lows us to repres
within the volume.
fusion of the regis
Sk(p) where p 2 R
(x/z,y/z)>. We will also use a dot notation to denote homogeneous
At time k a measurement comprises a raw depth map Rk which
Dk(u) =
1
Wp
Â
q2U
(2)
Vk(u) = Dk(u)K 1
˙u . (3)
Since each frame from the depth sensor is a surface measurement
on a regular grid, we compute, using a cross product between neigh-
the equivalent mappin
N
g
k(u) = Rg,kNk(u).
3.3 Mapping as S
Each consecutive dept
estimate, is fused incr
using the volumetric
[7]. In a true signed
the signed distance to
face), taking on posit
visible surface into fre
on the non-visible side
ple 3D point clouds (o
a global frame is a glo
An example given
lows us to represent
within the volume. W
fusion of the registere
Sk(p) where p 2 R3 is
reconstructed. A disc
olution is stored in gl
reside. From here on w
voxel/memory elemen
and will refer only to t
so use a dot notation to denote homogeneous
urement
ment comprises a raw depth map Rk which
pth measurements Rk(u) 2 R at each image
e image domain u 2 U ⇢ R2 such that pk =
c point measurement in the sensor frame of
a bilateral filter [27] to the raw depth map to
preserved depth map with reduced noise Dk,
(ku qk2)Nsr (kRk(u) Rk(q)k2)Rk(q) ,
(2)
t2s 2) and Wp is a normalizing constant.
e filtered depth values into the sensor’s frame
a vertex map Vk,
Vk(u) = Dk(u)K 1
˙u . (3)
m the depth sensor is a surface measurement
the equivalent mapping of norm
N
g
k(u) = Rg,kNk(u).
3.3 Mapping as Surface R
Each consecutive depth frame,
estimate, is fused incrementally
using the volumetric truncated
[7]. In a true signed distance
the signed distance to the close
face), taking on positive and i
visible surface into free space,
on the non-visible side. The res
ple 3D point clouds (or surface
a global frame is a global surfac
An example given in Figure
lows us to represent arbitrary
within the volume. We will den
fusion of the registered depth m
Sk(p) where p 2 R3 is a global
reconstructed. A discretization
olution is stored in global GPU
reside. From here on we assum
voxel/memory elements and th
: 正規化
要するに、Depthに対するBilateral Filter

•  Vertex map
Wp
Â
q2U
(2)
(t) = exp( t2s 2) and Wp is a normalizing constant.
k-project the filtered depth values into the sensor’s frame
ce to obtain a vertex map Vk,
Vk(u) = Dk(u)K 1
˙u . (3)
h frame from the depth sensor is a surface measurement
ar grid, we compute, using a cross product between neigh-
a global frame is a global surfac
An example given in Figure
lows us to represent arbitrary
within the volume. We will den
fusion of the registered depth m
Sk(p) where p 2 R3 is a global
reconstructed. A discretization
olution is stored in global GPU
reside. From here on we assume
voxel/memory elements and th
and will refer only to the contin
•  Normal map
De-noise後のDepth mapを使って、点 u を3次元に投影
n pro-
h data
ed into
ed dis-
ion as
local-
fused
func-
diction
using
d cur-
ses all
ame at
Figure 4: A slice through the truncated signed distance volume
showing the truncated function F > µ (white), the smooth distance
field around the surface interface F = 0 and voxels that have not yet
had a valid measurement(grey) as detailed in eqn. 9.
bouring map vertices, the corresponding normal vectors,
Nk(u) = n
⇥
(Vk(u+1,v) Vk(u,v))⇥(Vk(u,v+1) Vk(u,v))
⇤
,
(4)
where n[x] = x/kxk2.
We also define a vertex validity mask: Mk(u) 7! 1 for each pixel
where a depth measurement transforms to a valid vertex; otherwise
if a depth measurement is missing Mk(u) 7! 0. The bilateral filtered
version of the depth map greatly increases the quality of the normal
maps produced, improving the data association required in tracking
described in Section 3.5.
We compute an L = 3 level multi-scale representation of the sur-
face measurement in the form of a vertex and normal map pyramid.
First a depth map pyramid Dl2[1...L] is computed. Setting the bottom
各頂点の3次元座標から、法線を計算 (近接ピクセル3次元座標の外積)
Figure 4: A slice through the truncated signed distance volume
showing the truncated function F > µ (white), the smooth distance
field around the surface interface F = 0 and voxels that have not yet
had a valid measurement(grey) as detailed in eqn. 9.
bouring map vertices, the corresponding normal vectors,
Nk(u) = n
⇥
(Vk(u+1,v) Vk(u,v))⇥(Vk(u,v+1) Vk(u,v))
⇤
,
(4)
where n[x] = x/kxk2.
We also define a vertex validity mask: Mk(u) 7! 1 for each pixel
where a depth measurement transforms to a valid vertex; otherwise
if a depth measurement is missing Mk(u) 7! 0. The bilateral filtered
version of the depth map greatly increases the quality of the normal
maps produced, improving the data association required in tracking
described in Section 3.5.
We compute an L = 3 level multi-scale representation of the sur-
:正規化 (単位ベクトル化)

•  Pyramid生成
–  3 Level
–  L = 1 : 生の解像度
–  L+1 : Lの半分の解像度
–  各レベルごとに、Vertex / Normal Mapを計算して保持

Sensor Pose Estimation
Rk Tg,k
Rk
Tg,k-1
Input
Measurement Pose
Estimation
Update
Reconstruction
Surface
Prediction
Compute
Surface Nertex and
Normal Maps
ICP of Predicted
and Measured
Surface
Integrate Surface
Measurement
into Global TSDF
Ray-cast
TSDF to Compute
Surface Prediction
Tg,k-1
k
Vk-1
S
Nk-1
^ ^
Vk
Nk
Figure
showin
ﬁeld ar

Sensor Pose Estimation (1/4)
•  Live FrameのPoseを推定
–  Live Frameの全Pixel情報を使って推定
•  仮定:
–  Frame間の移動量は十分小さい
•  Solution : ICP
–  Live FrameのVertex/Normal (De-noised) と、Surface
Prediction間の6DOFを推定
–  Point-Plane Optimization

•  The global point-plane energy
(14)
ure cor-
and re-
e phys-
um and
ponding
[0.4,8]
perty of
er pixel
ed volu-
can be
ver, due
a con-
simple
kipping
a good
est sur-
n march
ave +ve
to-frame tracking is obtained simply by setting ( ˆVk 1, ˆNk 1)
(Vk 1,Nk 1) which is used in our experimental section for a com-
parison between frame-to-frame and frame-model tracking.
Utilising the surface prediction, the global point-plane energy,
under the L2 norm for the desired camera pose estimate Tg,k is:
E(Tg,k) = Â
u2U
Wk(u)6=null
⇣
Tg,k
˙Vk(u) ˆV
g
k 1 (û)
⌘>
ˆN
g
k 1 (û)
2
, (16)
where each global frame surface prediction is obtained using the
previous fixed pose estimate Tg,k 1. The projective data as-
sociation algorithm produces the set of vertex correspondences
{Vk(u), ˆVk 1(û)|W(u) 6= null} by computing the perspectively pro-
jected point, û = p(KeTk 1,k
˙Vk(u)) using an estimate for the frame-
frame transform eTz
k 1,k = T 1
g,k 1
eTz
g,k and testing the predicted and
measured vertex and normal for compatibility. A threshold on the
distance of vertices and difference in normal values suffices to re-
ject grossly incorrect correspondences, also illustrated in Figure 7:
W(u) 6= null iff
8
<
:
Mk(u) = 1, and
keTz
g,k
˙Vk(u) ˆV
g
k 1 (û)k2  ed, and
heRz
g,kNk(u), ˆN
g
k 1 (û)i  eq .
(17)
rithm [4] to obtain correspondence and the
r pose optimisation. Second, modern GPU
y parrallelised processing pipeline, so that
point-plane optimisation can use all of the
ements.
r metric in combination with correspon-
rojective data association was first demon-
odelling system by [23] where frame-to-
(with a fixed camera) for depth map align-
instead track the current sensor frame by
easurement (Vk,Nk) against the model pre-
s frame ( ˆVk 1, ˆNk 1). We note that frame-
ained simply by setting ( ˆVk 1, ˆNk 1)
sed in our experimental section for a com-
o-frame and frame-model tracking.
prediction, the global point-plane energy,
he desired camera pose estimate Tg,k is:
g,k
˙Vk(u) ˆV
g
k 1 (û)
⌘>
ˆN
g
k 1 (û)
2
, (16)
e surface prediction is obtained using the
algorithm [4] to obtain correspondence and the
] for pose optimisation. Second, modern GPU
ully parrallelised processing pipeline, so that
nd point-plane optimisation can use all of the
asurements.
rror metric in combination with correspon-
g projective data association was first demon-
e modelling system by [23] where frame-to-
sed (with a fixed camera) for depth map align-
we instead track the current sensor frame by
e measurement (Vk,Nk) against the model pre-
ious frame ( ˆVk 1, ˆNk 1). We note that frame-
obtained simply by setting ( ˆVk 1, ˆNk 1)
is used in our experimental section for a com-
me-to-frame and frame-model tracking.
ce prediction, the global point-plane energy,
or the desired camera pose estimate Tg,k is:
⇣
Tg,k
˙Vk(u) ˆV
g
k 1 (û)
⌘>
ˆN
g
k 1 (û)
2
, (16)
ame surface prediction is obtained using the
: Live frame
: Model prediction from the previous frame
shapes in
base. For
or virtual
to form
s range
the range
planning
the ICP
st widely
a similar
Chen92]).
a recent
inal ICP
92], each
the other
oint error
between
rocess is
r it stops
2] used a
ization is
e tangent
-to-point
-to-plane
squares
Press92].
orithm is
hers have
e former
of the
Pottmann
between points in the source surface and points in the destination
surfaces. For example, for each point on the source surface, the
nearest point on the destination surface is chosen as its
correspondence [Besl92] (see [Rusinkiewicz01] for other
approaches to find point correspondences). The output of an ICP
iteration is a 3D rigid-body transformation M that transforms the
source points such that the total error between the corresponding
points, under a certain chosen error metric, is minimal.
When the point-to-plane error metric is used, the object of
minimization is the sum of the squared distance between each
source point and the tangent plane at its corresponding destination
point (see Figure 1). More specifically, if si = (six, siy, siz, 1)T
is a
source point, di = (dix, diy, diz, 1)T
is the corresponding destination
point, and ni = (nix, niy, niz, 0)T
is the unit normal vector at di, then
the goal of each ICP iteration is to find Mopt such that
( )( )∑ •−⋅=
i
iii
2
opt minarg ndsMM M (1)
where M and Mopt are 4×4 3D rigid-body transformation matrices.
Figure 1: Point-to-plane error between two surfaces.
tangent
plane
s1
source
point
destination
point
d1
n1
unit
normal
s2
d2
n2
s3
d3
n3
destination
surface
source
surface
l1
l2
l3
Low, Kok-Lim, Linear least-squares optimization for
Point-to-Plane ICP surface Registration Chapel Hill,
University of North Carolina, 2004.

•  Iterative solution
•  パラメータの更新量の計算
ve value before stepping over the surface
-up obtained is demonstrated in Figure 6
of steps required for each pixel to inter-
standard marching.
ions can be obtained by solving a ray/tri-
1] that requires solving a cubic polyno-
e we use a simple approximation. Given
ntersect the SDF where F+
t and F+
t+Dt are
DF values either side of the zero crossing
nd t + Dt from its starting point, we find
where ed and eq are threshold parameters of our system. eTz=
g,
initialised with the previous frame pose Tg,k.
An iterative solution, eTz
g,k for z > 0 is obtained by minimi
the energy of a linearised version of (16) around the previous
timate eTz 1
g,k . Using the small angle assumption for an increme
transform:
eTz
inc =
h
eRz etz
i
=
2
4
1 a g tx
a 1 b ty
g b 1 tz
3
5 ,
g a surface measurement with (center) and with-
filtering applied to the raw depth map. W(u) = null
unpredicted/unmeasured points shown in white.
rm is simply eTz
g,k = eTz
inc
eTz 1
g,k . Writing the update
vector,
x = (b,g,a,tx,ty,tz)>
2 R6
(19)
urrent global frame vertex estimates for all pixels
eV
g
k(u) = eTz 1
g,k
˙Vk(u), we minimise the linearised
g the incremental point transfer:
u) = eRz eV
g
k(u)+ etz = G(u)x+ eV
g
k(u) , (20)
trix G is formed with the skew-symmetric matrix
quately constrained.
the magnitude of incr
small angle assumpt
fails, the system is pl
Relocalisation Ou
re-localisation schem
known sensor pose is
user instructed to alig
played. While runni
validity checks are pa
4 EXPERIMENTS
We have conducted a
formance of our syst
tem’s ability to keep
extensively in our su
4.1 Metrically Co
Our tracking and ma
rithm for a given are
Figure 7: Example of point-plane outliers as person steps into par-
tially reconstructed scene (left). Outliers from compatibility checks
(Equation 17) using a surface measurement with (center) and with-
out (right) bilateral filtering applied to the raw depth map. W(u) = null
are light grey with unpredicted/unmeasured points shown in white.
an updated transform is simply eTz
g,k = eTz
inc
eTz 1
eTz
inc as a parameter vector,
2 R6
(19)
and updating the current global frame vertex estimates for all pixels
{u|W(u) 6= null}, eV
g
k(u) = eTz 1
g,k
error function using the incremental point transfer:
eTz
g,k
˙Vk(u) = eRz eV
g
g
k(u) , (20)
can be
does no
the line
free DO
quality
a check
quately
the mag
small a
fails, th
Relo
re-loca
known
user ins
played.
validity
4 EX
We hav
forman
tem’s a
extensi
Figure 7: Example of point-plane outliers as person steps into par-
tially reconstructed scene (left). Outliers from compatibility checks
(Equation 17) using a surface measurement with (center) and with-
out (right) bilateral filtering applied to the raw depth map. W(u) = null
are light grey with unpredicted/unmeasured points shown in white.
an updated transform is simply eTz
g,k = eTz
inc
eTz 1
eTz
inc as a parameter vector,
2 R6
(19)
g
k(u) = eTz 1
g,k
eTz
g,k
˙Vk(u) = eRz eV
g
g
k(u) , (20)
where the 3⇥6 matrix G is formed with the skew-symmetric matrix
form of eV
g
k(u):
G(u) =
h
eV
g
k(u)
i
⇥
I3⇥3 . (21)
An iteration is obtained by solving:
sensor motion increases the assum
of the point-plane error metric an
can be broken. Also, if the curren
does not provide point-plane pairs
the linear system then an arbitrar
free DOFs can be obtained. Both
quality reconstruction and trackin
a check on the null space of the no
quately constrained. We also perfo
the magnitude of incremental trans
small angle assumption was not d
fails, the system is placed into re-l
Relocalisation Our current imp
re-localisation scheme, whereby i
known sensor pose is used to prov
user instructed to align the incomi
played. While running the pose o
validity checks are passed tracking
4 EXPERIMENTS
We have conducted a number of ex
formance of our system. These an
tem’s ability to keep track during v
extensively in our submitted video
4.1 Metrically Consistent Re
Our tracking and mapping system
rithm for a given area of reconstr
investigating its ability to form m
trajectories containing local loop cl
global joint-estimation. We are als
system to scale gracefully with d
resources.
he 3⇥6 matrix G is formed with the skew-symmetric matrix
eV
g
k(u):
G(u) =
h
eV
g
k(u)
i
⇥
I3⇥3 . (21)
ation is obtained by solving:
min
x2R6
Â
Wk(u)6=null
kEk2
2 (22)
E = ˆN
g
k 1(û)>
⇣
G(u)x+ eV
g
k(u) ˆV
g
k 1(û)
⌘
(23)
mputing the derivative of the objective function with respect
parameter vector x and setting to zero, a 6 ⇥ 6 symmetric
ystem is generated for each vertex-normal element corre-
nce:
⇣ ⌘
4.1 M
Our tra
rithm f
investig
trajecto
global
system
resourc
To i
perime
ing a t
then sp
⇡ 19 s
our sys
region
tained i
a preci
easily e
x y z
g
k(u) = eTz 1
g,k
eTz
g,k
˙Vk(u) = eRz eV
g
g
k(u) , (20)
where the 3⇥6 matrix G is formed with the skew-symmetric matrix
form of eV
g
k(u):
G(u) =
h
eV
g
k(u)
i
⇥
I3⇥3 . (21)
An iteration is obtained by solving:
min
x2R6
Â
Wk(u)6=null
kEk2
2 (22)
E = ˆN
g
k 1(û)>
⇣
G(u)x+ eV
g
k(u) ˆV
g
k 1(û)
⌘
(23)
By computing the derivative of the objective function with respect
to the parameter vector x and setting to zero, a 6 ⇥ 6 symmetric
linear system is generated for each vertex-normal element corre-
ICPの更新量 :
Energy Function:

•  Coarse-to-fine Estimation
–  3Level pyramid
–  各レベルにおいて、更新回数の上限を設定
•  L3 : 4回
•  L2 : 5回
•  L1 : 10回

Result
https://www.youtube.com/watch?v=quGhaggn3cQ

ここまでが
KinectFusion
ここからが
DynamicFusion

DynamicFusion: Overview (1/3)
•  KinectFusionをDynamicなシーンに拡張
KinectFusion: Real-Time Dense Surface Mapping and Trac
Richard A. Newcombe
Shahram Izadi
Microsoft Research
Otmar Hilliges
Microsoft Research
David Molyneaux
Microsoft Research
Lancaster University
D
Micr
Newc
Andrew J. Davison
Pushmeet Kohli
Microsoft Research
Jamie Shotton
Microsoft Research
Steve Hodges
Microsoft Research
Andrew
Micros
Figure 1: Example output from our system, generated in real-time with a handheld Kinect depth camera and no other sen
Normal maps (colour) and Phong-shaded renderings (greyscale) from our dense reconstruction system are shown. On the
is an example of the live, incomplete, and noisy data from the Kinect sensor (used as input to our system).
ABSTRACT
We present a system for accurate real-time mapping of complex and
arbitrary indoor scenes in variable lighting conditions, using only a
moving low-cost depth camera and commodity graphics hardware.
We fuse all of the depth data streamed from a Kinect sensor into
a single global implicit surface model of the observed scene in
real-time. The current sensor pose is simultaneously obtained by
tracking the live depth frame relative to the global model using a
coarse-to-ﬁne iterative closest point (ICP) algorithm, which uses
all of the observed depth data available. We demonstrate the advan-
tages of tracking against the growing full surface model compared
with frame-to-frame tracking, obtaining tracking and mapping re-
sults in constant time within room sized scenes with limited drift
and high accuracy. We also show both qualitative and quantitative
1 INTRODUCTION
Real-time infrastructure-free tracking of a hand
simultaneously mapping the physical scene in h
new possibilities for augmented and mixed reali
In computer vision, research on structure fr
and multi-view stereo (MVS) has produced ma
sults, in particular accurate camera tracking and
tions (e.g. [10]), and increasingly reconstruction
(e.g. [24]). However, much of this work was not
time applications.
Research on simultaneous localisation and ma
focused more on real-time markerless tracking
construction based on the input of a single com
monocular RGB camera. Such ‘monocular SLA
MonoSLAM [8] and the more accurate Parallel

•  基本的はKinectFusion
–  シーン表現:
Truncated Signed Distance Function (TSDF)
–  入力データはDepth Map
–  Pose推定しながらDepth MapをFusion
•  拡張ポイント
–  Dynamicなシーンの表現 :
Canonical Space + Warp-Field
–  Canonical Space : 基準空間
–  Warp-Field : Live Frameと基準空間の間の変形
(Deformation)

(a) Initial Frame at t = 0s (b) Raw (noisy) depth maps for frames at t = 1s, 10s, 15s, 20s (
(d) Canonical Model (e) Canonical model warped into its live frame (f
Figure 2: DynamicFusion takes an online stream of noisy depth maps (a,b) and outputs a real-time dense reconst
scene (d,e). To achieve this, we estimate a volumetric warp (motion) ﬁeld that transforms the canonical model spa
enabling the scene motion to be undone, and all depth maps to be densely fused into a single rigid TSDF reconst
neously, the structure of the warp ﬁeld is constructed as a set of sparse 6D transformation nodes that are smoothl

Result
https://www.youtube.com/watch?v=i1eZekcc_lM&feature=youtu.be

DynamicFusion Outline
1.  Dense Non-rigid Warp Field
–  Dynamicなシーンの表現方法について
2.  Dense Non-rigid Surface Fusion
–  どうやってLive FrameをFusionするのか
3.  Estimating the Warp-Field State
–  Warp-Fieldの推定方法
4.  Extending the Warp-Field
–  Warp-Field構造の逐次追加方法

Dense Non-rigid Warp Field (1/5)
•  Live FrameとCanonical Model間の変形
•  単純に考えると
–  Volumeの各点に6D Transformationを設定
–  2563のVoxelとすると、推定する必要のあるパラメー
タ数:
2563 x 6 = 16,777,216 x 6 = 1億 / frame
–  パラメータ数が爆発。実現不能。
•  疎な6D変形ノードと補間による表現を導入

•  疎な6D変形ノード
: Embedded Deformation Graph
–  Polygon MeshをGraph構造で表現
–  Rigid-as-possibleな変形を実現
•  変形の補間
: Dual-Quaternion Blending
–  ある点xcの変形をk近傍ノードの重み付平均で表現
–  アーチファクトの少ないSkinningを高速に実現

•  Embedded Deformation Graph
–  Graphics分野の手法
–  Rigid-as-possibleな変形を実現
–  Polygon MeshをGraphで表現。ある点が移動した場合に、接続
された点が連動して動く
R.W.Sumner,J.Schmid,andM.Pauly.Embeddeddeformation for shape manipulation. ACM Trans. Graph., 26(3):80, 2007. 2, 4, 6
mply
orm
fer-
tion
ount
r to
tion
odes
ape
ica-
(2)
mely
like
nce
uce
000;
dis-
ons
User edit
g0 g1 g2 g3 g4
˜g0 ˜g1
˜g2
˜g3
˜g4
Econ
Econ +EregEcon +Ereg +Erot
R2(g3 g2)+g2 +t2
R2(g1 g2)+g2 +t2
˜g2 = g2 +t2
Figure 2: A simple deformation graph shows the effect of the three
terms of the objective function. The quadrilaterals at each graph
node illustrate the deformation induced by the corresponding afﬁne
transformation. Without the rotational term, unnatural shearing

•  Dual-Quaternion Blending
–  Graphics分野の手法
–  高速でアーチファクトの少ないSkinningを実現
–  できるだけ体積を保存するように補間
–  2つのQuaternionを使って、回転と平行移動を表現
ns as bases and define the dense volumet-
n through interpolation. Due to its compu-
y and high quality interpolation capability
ternion blending DQB [11], to define our
xc) ⌘ SE3(DQB(xc)) , (1)
ed average over unit dual quaternion trans-
mply DQB(xc) ⌘
P
k2N (xc) wk(xc)ˆqkc
k
P
k2N (xc) wk(xc)ˆqkck
,
Wt(vc)(v>
c , 1)>
and (n
that scaling of space can
function, since compres
resented by neighbourin
diverging directions. F
out any rigid body trans
the volume, e.g., due to
troduce the explicit warp
Tlw, and compose this
(e) Canonical model warped into its live frame (f) Model Normals
online stream of noisy depth maps (a,b) and outputs a real-time dense reconstruction of the moving
mate a volumetric warp (motion) field that transforms the canonical model space into the live frame,
done, and all depth maps to be densely fused into a single rigid TSDF reconstruction (d,f). Simulta-
eld is constructed as a set of sparse 6D transformation nodes that are smoothly interpolated through
onical frame (c). The resulting per-frame warp field estimate enables the progressively denoised and
nsformed into the live frame in real-time (e). In (e) we also visualise motion trails for a sub-sample
ond of scene motion together with a coordinate frame showing the rigid body component of the scene
node to model surface distance where increased distance is mapped to a lighter value.
nd rotation results in signif-
struction. For each canoni-
transforms that point from
-rigidly deformed frame of
the warp function for each
n must be efficiently opti-
ensely sample the volume,
E(3) field at the resolution
(TSDF) geometric repre-
DF volume reconstruction
2563
voxels would require
eters per frame, about 10
with each unit dual-quaternion ˆqkc 2 R8
. Here, N (x) are
the k-nearest transformation nodes to the point x and wk :
R3
7! R defines a weight that alters the radius of influence
of each node and SE3(.) converts from quaternions back to
an SE(3) transformation matrix. The state of the warp-field
Wt at time t is defined by the values of a set of n defor-
mation nodes N t
warp = {dgv, dgw, dgse3}t. Each of the
i = 1..n nodes has a position in the canonical frame dgi
v 2
R3
, its associated transformation Tic = dgi
se3, and a ra-
dial basis weight dgw that controls the extent of the trans-
formation wi(xc) = exp kdgi
v xck2
/ 2(dgi
w)2
.
Each radius parameter dgi
w is set to ensure the node’s in-
fluence overlaps with neighbouring nodes, dependent on
the sampling sparsity of nodes, which we describe in de-
: Dual-Quaternion
L. Kavan, S. Collins, J. Zˇa ́ra, and C. O’Sullivan. Skinning with Dual Quaternions. In Proceedings of the 2007 Symposium
on Interactive 3D Graphics and Games, I3D ’07, pages 39–46, New York, NY, USA, 2007. ACM. 3
Linear Blending Dual Quaternion Blending
図は以下より転用
http://rodolphe-vaillant.fr/?e=29

•  疎な6D変換ノードのパラメータ：
3
f each node and SE3(.) converts from quaternions back to
n SE(3) transformation matrix. The state of the warp-field
ation nodes Nt
= 1..n nodes has a position in the canonical frame dgi
v 2
3
se3, and a ra-
al basis weight dgw that controls the extent of the trans-
ormation wi(xc) = exp kdgi
v xck2
/ 2(dgi
w)2
.
ach radius parameter dgi
uence overlaps with neighbouring nodes, dependent on
e sampling sparsity of nodes, which we describe in de-
il in section (3.4). Since the warp function defines a
gid body transformation for all supported space, both posi-
.
Here, N (x) are
oint x and wk :
ius of influence
ernions back to
f the warp-field
set of n defor-
}t. Each of the
al frame dgi
v 2
gi
se3, and a ra-
nt of the trans-
2(dgi
w)2
.
the node’s in-
dependent on
describe in de-
ction defines a
pace, both posi-
is transformed,
s in signif-
ch canoni-
point from
ed frame of
on for each
ently opti-
he volume,
resolution
tric repre-
onstruction
uld require
, about 10
Fusion al-
ransforma-
ion of the
nd to move
sparse set
. Here, N (x) are
the k-nearest transformation nodes to the point x and wk :
R3
mation nodes Nt
v 2
R3
se3, and a ra-
v xck2
/ 2(dgi
w)2
.
the sampling sparsity of nodes, which we describe in de-
tail in section (3.4). Since the warp function defines a
rigid body transformation for all supported space, both posi-
tion and any associated orientation of space is transformed,
e.g., the vertex vc from a surface with orientation or nor-
>
d to a lighter value.
ion ˆqkc 2 R8
. Here, N (x) are
n nodes to the point x and wk :
hat alters the radius of influence
nverts from quaternions back to
atrix. The state of the warp-field
the values of a set of n defor-
gv, dgw, dgse3}t. Each of the
n in the canonical frame dgi
v 2
mation Tic = dgi
se3, and a ra-
controls the extent of the trans-
kdgi
v xck2
/ 2(dgi
w)2
.
is set to ensure the node’s in-
hbouring nodes, dependent on
odes, which we describe in de-
e the warp function defines a
h unit dual-quaternion ˆqkc 2 R8
. Here, N (x) are
rest transformation nodes to the point x and wk :
defines a weight that alters the radius of influence
ode and SE3(.) converts from quaternions back to
transformation matrix. The state of the warp-field
me t is defined by the values of a set of n defor-
odes N t
nodes has a position in the canonical frame dgi
v 2
ssociated transformation Tic = dgi
se3, and a ra-
weight dgw that controls the extent of the trans-
n wi(xc) = exp kdgi
v xck2
/ 2(dgi
w)2
.
ius parameter dgi
verlaps with neighbouring nodes, dependent on
ling sparsity of nodes, which we describe in de-
: Canonical Frameにおける位置
: Canonical - Live Frame間Transform
: ノードの影響範囲コントロール
6D transformation nodes that are smoothly interpolated through
rame warp field estimate enables the progressively denoised and
-time (e). In (e) we also visualise motion trails for a sub-sample
coordinate frame showing the rigid body component of the scene
re increased distance is mapped to a lighter value.
. Here, N (x) are
he k-nearest transformation nodes to the point x and wk :
R3
mation nodes N t
v 2
R3
se3, and a ra-
v xck2
/ 2(dgi
w)2
.
he sampling sparsity of nodes, which we describe in de-
DQBの重み

•  Warp-Field:
our complete warp-field is then given as:
Wt(xc) = TlwSE3(DQB(xc)). (2
3.2. Dense Non-Rigid Surface Fusion
We now describe how, given the model-to-frame warp
field Wt, we update our canonical model geometry. Ou
reconstruction into the canonical space S is represented by
the sampled TSDF V : S 7! R2
within a voxel domain
S ⇢ N3
. The sampled function holds for each voxel x 2
S corresponding to the sampled point xc, a tuple V(x) 7!
>
点xcのWarpe warp-field is then given as:
Wt(xc) = TlwSE3(DQB(xc)). (2)
Non-Rigid Surface Fusion
describe how, given the model-to-frame warp
we update our canonical model geometry. Our
on into the canonical space S is represented by
2
ou
al
ta
di
pr
1
k
fu
K
and outputs a real-time dense reconstruction of the moving
at transforms the canonical model space into the live frame,
used into a single rigid TSDF reconstruction (d,f). Simulta-
ransformation nodes that are smoothly interpolated through
e warp field estimate enables the progressively denoised and
(e). In (e) we also visualise motion trails for a sub-sample
dinate frame showing the rigid body component of the scene
creased distance is mapped to a lighter value.
each unit dual-quaternion ˆqkc 2 R8
. Here, N (x) are
-nearest transformation nodes to the point x and wk :
! R defines a weight that alters the radius of influence
ch node and SE3(.) converts from quaternions back to
E(3) transformation matrix. The state of the warp-field
at time t is defined by the values of a set of n defor-
on nodes Nt
1..n nodes has a position in the canonical frame dgi
v 2
its associated transformation Tic = dgi
se3, and a ra-
basis weight dgw that controls the extent of the trans-
: Cameraの移動に伴う、シーン全体の変形
: Dual-QuaternionからSE3(ユークリッド空間の運動群)
のへの変換

Dense Non-Rigid Surface Fusion (1/3)
•  Live FrameのDepthからCanonical Modelを更新
–  Warp-Fieldは得られているという前提
•  更新方法:
–  Warp-Fieldを使って、Canonical ModelをLive Frame
に合わせて変形
–  そのTSDFと、Live FrameのTSDFをFusion

warped center into the depth frame. This allows the TSDF
for a point in the canonical frame to be updated by com-
puting the projective TSDF in the deforming frame without
having to resample a warped TSDF in the live frame. The
projective signed distance at the warped canonical point is:
psdf(xc) =
h
K 1
Dt (uc)
⇥
u>
c , 1
⇤>
i
z
[xt]z , (3)
where uc = ⇡ (Kxt) is the pixel into which the voxel cen-
ter projects. We compute distance along the optical (z) axis
of the camera frame using the z component denoted [.]z.
K is the known 3 ⇥ 3 camera intrinsic matrix, and ⇡ per-
forms perspective projection. For each voxel x, we update
the TSDF to incorporate the projective SDF observed in the
warped frame using TSDF fusion:
V(x)t =
(
[v0
(x), w0
(x)]>
, if psdf(dc(x)) > ⌧
V(x)t 1, otherwise
(4)
where dc(.) transforms a discrete voxel point into the con-
tinuous TSDF domain. The truncation distance ⌧ > 0 and
the updated TSDF value is given by the weighted averaging
We es
dgse3 in
current re
that is mi
E(Wt, V
Our data
cost Dat
isation te
tion fields
tween tra
The coup
transform
ularisatio
model int
enables a
when giv
sistent de
space. We
psdf(xc) =
h
K 1
Dt (uc)
⇥
u>
c , 1
⇤>
i
z
[xt]z , (3)
V(x)t =
(
[v0
(x), w0
(x)]>
V(x)t 1, otherwise
(4)
scheme [5], with the weight truncation introduced in [18]:
v0
(x) =
v(x)t 1w(x)t 1 + min(⇢, ⌧)w(x)
w(x)t 1 + w(x)
⇢ = psdf(dc(x))
w0
(x) = min(w(x)t 1 + w(x), wmax) . (5)
Unlike the static fusion scenario where the weight w(x)
encodes the uncertainty of the depth value observed at the
projected pixel in the depth frame, we also account for un-
certainty associated with the warp function at xc. In the
E(Wt, V, Dt, E) = Data(Wt
Our data term consists of a
cost Data(Wt, V, Dt) which
isation term Reg(Wt, E) tha
tion fields, and ensures as-rigid
tween transformation nodes c
The coupling of a data-term f
transformations with a rigid-a
ularisation is a form of the e
model introduced in [25]. Th
enables a trade-off between re
when given high quality data,
sistent deformation of non or
space. We defined these terms
3.3.1 Dense Non-Rigid ICP
Our aim is to estimate all non
eters Tic and Tlw that warp t
live frame. We achieve this
rigid alignment of the curren
tracted from the canonical vol
live frame’s depth map.
Surface Prediction and Da
zero level set of the TSDF V is
and stored as a polygon mesh w
having to resample a warped TSDF in the live frame. The
psdf(xc) =
h
K 1
Dt (uc)
⇥
u>
c , 1
⇤>
i
z
[xt]z , (3)
V(x)t =
(
[v0
(x), w0
(x)]>
V(x)t 1, otherwise
(4)
scheme [5], with the weight truncation introduced in [18]:
v0
(x) =
v(x)t 1w(x)t 1 + min(⇢, ⌧)w(x)
w(x)t 1 + w(x)
⇢ = psdf(dc(x))
w0
(x) = min(w(x)t 1 + w(x), wmax) . (5)
Unlike the static fusion scenario where the weight w(x)
encodes the uncertainty of the depth value observed at the
that is minimised by o
E(Wt, V, Dt, E) = D
Our data term consi
cost Data(Wt, V, D
isation term Reg(W
tion fields, and ensure
tween transformation
The coupling of a da
transformations with
ularisation is a form
model introduced in
enables a trade-off be
when given high qua
sistent deformation o
space. We defined the
3.3.1 Dense Non-R
Our aim is to estimat
eters Tic and Tlw tha
live frame. We achi
rigid alignment of th
tracted from the cano
live frame’s depth ma
Surface Predictio
Live FrameのProjective Signed Distance Function:
TSDFのFusion:
Truncate:

Non-rigid scene deformation
(a) Live frame t = 0 (b) Live Frame t = 1 (c) Canonical 7! Live (d) Live fram
Figure 3: An illustration of how each point in the canonical frame maps, through a
when observing a deforming scene. In (a) the ﬁrst view of a dynamic scene is observ
initialized to the identity transform and the three rays shown in the live frame also ma
deforms in the live frame (b), the warp function transforms each point from the can

Estimating the Warp-Field State (1/5)
•  Depth map Dt からdgse3を推定
DF
m-
ut
he
s:
3)
n-
xis
]z.
er-
ate
he
4)
We estimate the current values of the transformations
dgse3 in Wt given a newly observed depth map Dt and the
current reconstruction V by constructing an energy function
that is minimised by our desired parameters:
E(Wt, V, Dt, E) = Data(Wt, V, Dt) + Reg(Wt, E) . (6)
Our data term consists of a dense model-to-frame ICP
cost Data(Wt, V, Dt) which is coupled with a regular-
isation term Reg(Wt, E) that penalises non-smooth mo-
tion fields, and ensures as-rigid-as-possible deformation be-
tween transformation nodes connected by the edge set E.
The coupling of a data-term formed from linearly blended
transformations with a rigid-as-possible graph based reg-
ularisation is a form of the embedded deformation graph
model introduced in [25]. The regularisation parameter
enables a trade-off between relaxing rigidity over the field
when given high quality data, and ensuring a smooth con-
sistent deformation of non or noisily observed regions of
これを最小化する各パラメータを推定する
Energy データ項
ICP cost
正規化項
滑らかでない動き
に対するペナルティ
Wt : Warp Field
: 現在のSurface (Polygon Mesh)
Dt : Live frame Depth map
ε : Edge set (ノードの接続)
ace into the live (non-rigid) frame (see Figure 3).
hnique greatly simplifies the non-rigid reconstruc-
cess over methods where all frames are explicitly
nto a canonical frame. Furthermore, given a cor-
p field, then, since all TSDF updates are computed
stances in the camera frame, the non-rigid projec-
DF fusion approach maintains the optimality guar-
or surface reconstruction from noisy observations
y proved for the static reconstruction case in [6].
imating the Warp-field State Wt
stimate the current values of the transformations
Wt given a newly observed depth map Dt and the
econstruction V by constructing an energy function
inimised by our desired parameters:
, Dt, E) = Data(Wt, V, Dt) + Reg(Wt, E) . (6)
a term consists of a dense model-to-frame ICP
ta(Wt, V, Dt) which is coupled with a regular-

•  現在のSurfaceを抽出
–  TSDFからZero Level setをMarching cubeで抽出
https://ja.wikipedia.org/wiki/マーチングキューブ法

•  ICP Cost
–  Canonical Modelを現在のTでレンダリング
–  Live FrameのDepth mapとのエラーをPoint-Planeで計算
transform close, modulo observation noise, to the live sur-
face vl : ⌦ 7! R3
, formed by back projection of the depth
image [vl(u)>
, 1]>
= K 1
Dt(u)[u>
, 1]>
. This can be
quantified by a per pixel dense model-to-frame point-plane
error, which we compute under the robust Tukey penalty
function data, summed over the predicted image domain
⌦:
Data(W, V, Dt) ⌘
X
u2⌦
data ˆn>
u (ˆvu vl˜u) .(7)
augment. The transformed model vertex v(u) is simply
˜Tu
= W(v(u)), producing the current canonical to live
frame point-normal predictions ˆvu = ˜Tu
v(u) and ˆnu =
˜Tu
n(u), and data-association of that model point-normal
is made with a live frame point-normal through perspective
projection into the pixel ˜u = ⇡(Kˆvu).
We note that, ignoring the negligible cost of rendering
the geometry ˆVw, the ability to extract, predict, and perform
projective data association with the currently visible canoni-
cal geometry leads to a data-term evaluation that has a com-
into v
which
assoc
insuffi
logue
in opt
How
ometr
dynam
make
it def
We
betwe
betwe
tion t
contin
larisa
Reg(
Point-Plane Error
face vl : ⌦ 7! R3
image [vl(u)>
, 1]>
= K 1
Dt(u)[u>
, 1]>
. This can be
⌦:
Data(W, V, Dt) ⌘
X
u2⌦
data ˆn>
u (ˆvu vl˜u) .(7)
˜Tu
v(u) and ˆnu =
˜Tu
cal geometry leads to a data-term evaluation that has a com-
putational complexity with an upper bound in the number
which n
associat
insuffici
logue to
in optim
How sh
ometry?
dynamic
make us
it deform
We u
between
between
tion term
continui
larisatio
Reg(W
where E
ed by back projection of the depth
K 1
Dt(u)[u>
, 1]>
. This can be
dense model-to-frame point-plane
te under the robust Tukey penalty
d over the predicted image domain
X
u2⌦
data ˆn>
u (ˆvu vl˜u) .(7)
med model vertex v(u) is simply
ucing the current canonical to live
dictions ˆvu = ˜Tu
v(u) and ˆnu =
ciation of that model point-normal
e point-normal through perspective
˜u = ⇡(Kˆvu).
ng the negligible cost of rendering
lity to extract, predict, and perform
on with the currently visible canoni-
ata-term evaluation that has a com-
associated data term. In any case,
insufficient geometric texture in t
logue to the aperture problem in o
in optimisation of the transform pa
How should we constrain the mot
ometry? Whilst the fully correct m
dynamics and, where applicable, t
make use of a simpler model of un
it deforms in a piece-wise smooth
We use a deformation graph bas
between transformation nodes, wh
between nodes i and j adds a rigi
tion term to the total error being m
continuity preserving Huber penal
larisation term sums over all pair-w
Reg(W, E) ⌘
nX
i=0
X
j2E(i)
↵ij reg
ometry should
to the live sur-
on of the depth
. This can be
me point-plane
Tukey penalty
image domain
ˆvu vl˜u) .(7)
v(u) is simply
nonical to live
(u) and ˆnu =
l point-normal
ugh perspective
st of rendering
ct, and perform
into view. However, nodes affecting canonical space within
which no currently observed surface resides will have no
associated data term. In any case, noise, missing data and
insufficient geometric texture in the live frame – an ana-
logue to the aperture problem in optical-flow – will result
in optimisation of the transform parameters being ill-posed.
How should we constrain the motion of non-observed ge-
ometry? Whilst the fully correct motion depends on object
dynamics and, where applicable, the subject’s volition, we
make use of a simpler model of unobserved geometry: that
it deforms in a piece-wise smooth way.
We use a deformation graph based regularization defined
between transformation nodes, where an edge in the graph
between nodes i and j adds a rigid-as-possible regularisa-
tion term to the total error being minimized, under the dis-
continuity preserving Huber penalty reg. The total regu-
larisation term sums over all pair-wise connected nodes:
Reg(W, E) ⌘
nX X
↵ij reg Ticdgj
Tjcdgj
, (
transform close, modulo observation noise, to the live sur-
face vl : ⌦ 7! R3
image [vl(u)>
, 1]>
= K 1
Dt(u)[u>
, 1]>
. This can be
⌦:
Data(W, V, Dt) ⌘
X
u2⌦
data ˆn>
u (ˆvu vl˜u) .(7)
˜Tu
v(u) and ˆnu =
˜Tu
Ω : ピクセルドメイン
ine. This results in a prediction of
eometry that is currently predicted
frame: P(ˆVc). We store this pre-
ages {v, n} : ⌦ 7! P(ˆVc), where
of the predicted images, storing the
me vertices and normals.
ormation parameters for the current
cted-to-be-visible geometry should
o observation noise, to the live sur-
med by back projection of the depth
K 1
Dt(u)[u>
, 1]>
. This can be
l dense model-to-frame point-plane
ute under the robust Tukey penalty
d over the predicted image domain
⌘
X
u2⌦
data ˆn>
u (ˆvu vl˜u) .(7)
med model vertex v(u) is simply
point-plane data-term evaluation
3.3.2 Warp-field Regularizat
It is crucial for our non-rigid TS
timate a deformation not only o
but over all space within S. T
of new regions of the scene surf
into view. However, nodes affec
which no currently observed su
associated data term. In any ca
insufficient geometric texture in
logue to the aperture problem i
in optimisation of the transform
How should we constrain the m
ometry? Whilst the fully correc
dynamics and, where applicable
make use of a simpler model of
it deforms in a piece-wise smoo
: robust Tukey Penalty function

•  Live Frameから見えていない部分について
•  Deformation Graphで求める:
–  ノードの接続情報をもとに、全体でRigid-as-possible
になるように正規化
imply
o live
ˆnu =
ormal
ective
dering
rform
anoni-
com-
umber
data-
trans-
Reg(W, E) ⌘
nX
i=0
X
j2E(i)
↵ij reg Ticdgj
v Tjcdgj
v , (8)
where E defines the regularisation graph topology, and ↵ij
defines the weight associated with the edge, which we set
to ↵ij = max(dgi
w, dgj
w).
ε(i) : i番目のノードの接続エッジ
nodes affecting canonical space within
observed surface resides will have no
In any case, noise, missing data and
c texture in the live frame – an ana-
problem in optical-flow – will result
transform parameters being ill-posed.
train the motion of non-observed ge-
ully correct motion depends on object
applicable, the subject’s volition, we
r model of unobserved geometry: that
wise smooth way.
ion graph based regularization defined
on nodes, where an edge in the graph
j adds a rigid-as-possible regularisa-
error being minimized, under the dis-
Huber penalty reg. The total regu-
over all pair-wise connected nodes:
X
2E(i)
↵ij reg Ticdgj
v Tjcdgj
v , (8)
egularisation graph topology, and ↵ij
sociated with the edge, which we set
dgj
w).
: Huber penalty
ometry? Whilst the fully correct motion depends on object
dynamics and, where applicable, the subject’s volition, we
Reg(W, E) ⌘
nX
i=0
X
j2E(i)
↵ij reg Ticdgj
v Tjcdgj
v , (8)
where E defines the regularisation graph topology, and ↵ij
defines the weight associated with the edge, which we set
to ↵ij = max(dgi
w, dgj
w).

•  Hierarchical Deformation Tree:
–  Deformation Graphをピラミッド化
–  Coarse-to-Fineな最適化
•  Embedded Deformation Graph (Original)
–  Regularizationの影響範囲をk-nearest neighborで定義
•  それに比べて
–  計算コスト・安定性の面でメリット

Efficient Optimization
•  まず、KinectFusionと同じ手順で、Camera
Pose (Tlw)を推定
•  次に、Eの最小化を行うことで、各Deformation
NodeのWarp-Fieldを求める。
•  Gauss-Newton法で解く
–  Gauss-Newton法: 二乗和の最小化に特化したNewton
法
–  ヘッセ行列 (2階微分) をヤコビ行列 (1階微分) で近似
–  コレスキー分解を使って効率的に解く
approach origi-
-rigidly deform-
t, we transform
warp into the live
arry through the
y projecting the
llows the TSDF
pdated by com-
g frame without
live frame. The
nonical point is:
[xt]z , (3)
h the voxel cen-
e optical (z) axis
ent denoted [.]z.
trix, and ⇡ per-
el x, we update
observed in the
x)) > ⌧
using distances in the camera frame, the non-rigid projec-
tive TSDF fusion approach maintains the optimality guar-
antees for surface reconstruction from noisy observations
originally proved for the static reconstruction case in [6].
3.3. Estimating the Warp-field State Wt
We estimate the current values of the transformations
dgse3 in Wt given a newly observed depth map Dt and the
current reconstruction V by constructing an energy function
that is minimised by our desired parameters:
E(Wt, V, Dt, E) = Data(Wt, V, Dt) + Reg(Wt, E) . (6)
Our data term consists of a dense model-to-frame ICP
cost Data(Wt, V, Dt) which is coupled with a regular-
isation term Reg(Wt, E) that penalises non-smooth mo-
tion fields, and ensures as-rigid-as-possible deformation be-
tween transformation nodes connected by the edge set E.
The coupling of a data-term formed from linearly blended
transformations with a rigid-as-possible graph based reg-
ularisation is a form of the embedded deformation graph
model introduced in [25]. The regularisation parameter
enables a trade-off between relaxing rigidity over the field
Energy関数 :

Extending the Warp-Field
•  Inserting New Deformation Nodes:
–  Live FrameのSurface Fusionが終わったら、Warp Nodeの調整を
行う
–  Polygon Meshを捜査して、現在のNodeの影響範囲にいない頂点
が見つかったら、Nodeを追加する
–  追加されたNodeの初期Warpパラメータは、DQBで計算
•  Updating the Regularization Graph:
–  Hierarchical Deformation Treeの構築
–  Level 0 = Warp Node
–  next level:
•  各Regularization Nodeを捜査して、Nodeを間引き
•  WarpはDQBで再計算

Result
https://www.youtube.com/watch?v=i1eZekcc_lM&feature=youtu.be

Limitation
•  Quickly move from closed to open topology
–  Canonical Modelが閉じたTopologyである場合に、そ
こから開くと失敗しやすい
–  Regularization GraphのEdgeが分離できない
–  逆に、開いた状態から閉じるは表現できる
•  とはいえ、GraphのEdgeは繋がらない
•  高速な移動
•  Live Frameから見えていない範囲の移動

Conclusion
•  DynamicFusion
–  DynamicなシーンでSLAM
–  リアルタイムに3D Reconstruction
•  Contribution
–  Non-rigid volumetric TSDF fusion
–  Estimate 6D warp-field in Realtime

30th コンピュータビジョン勉強会@関東 DynamicFusion

More Related Content

What's hot

Similar to 30th コンピュータビジョン勉強会@関東 DynamicFusion

Recently uploaded

30th コンピュータビジョン勉強会@関東 DynamicFusion