Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

5,519 views

Published on

第30回 (後編) コンピュータビジョン勉強会@関東

CVPR論文読み会

DynamicFusion

Published in:
Technology

No Downloads

Total views

5,519

On SlideShare

0

From Embeds

0

Number of Embeds

542

Shares

0

Downloads

142

Comments

0

Likes

14

No embeds

No notes for slide

- 1. DynamicFusion: Reconstruction and Tracking of Non-Rigid Scenes in Real-Time Richard A. Newbomb Dieter Fox Steven M. Seitz Hiroki Mizuno July 25 '15 1第30回（後編） コンピュータビジョン勉強会@関東 CVPR2015読み会
- 2. 選んだ理由 (1/2) • 一番初めに目に入った 第30回（後編） コンピュータビジョン勉強会@関東 CVPR2015読み会 2
- 3. 選んだ理由 (2/2) • 著者がすごい 第30回（後編） コンピュータビジョン勉強会@関東 CVPR2015読み会 3 Richard A. Newbombe DTAM, KinectFusionの著者 Surreal VisionのCo-Founder Dieter Fox 確率ロボティクスの著者 Steven M. Seitz Multi View Stereo界隈の大御所
- 4. What's DynamicFusion • 論文タイトル: – DynamicFusion: Reconstruction and Tracking of Non-rigid Scenes in Real-Time • Motivation: – "How can we generalise KinectFusion to reconstruct and track dynamic, non-rigid scenes in real-time?" • ざっくり言うと: "KinectFusionをDynamicなシーンに拡張" 第30回（後編） コンピュータビジョン勉強会@関東 CVPR2015読み会 4
- 5. 第30回（後編） コンピュータビジョン勉強会@関東 CVPR2015読み会 5 というわけで まずは KinectFusion の話をします
- 6. KinectFusion • KinectFusion: Real-Time Dense Surface Mapping and Tracking – ISMAR 2011で発表 (Best Paper Award) – Richard A. Newbombe 他 (MSRの人が数名連名) • ざっくり概要 – KinectのDepth Mapを入力としたSLAM – Depth Mapをリアルタイムに融合して、綺麗な面を生成 第30回（後編） コンピュータビジョン勉強会@関東 CVPR2015読み会 6 KinectFusion: Real-Time Dense Surface Mapping and Tracking⇤ Richard A. Newcombe Imperial College London Shahram Izadi Microsoft Research Otmar Hilliges Microsoft Research David Molyneaux Microsoft Research Lancaster University David Kim Microsoft Research Newcastle University Andrew J. Davison Imperial College London Pushmeet Kohli Microsoft Research Jamie Shotton Microsoft Research Steve Hodges Microsoft Research Andrew Fitzgibbon Microsoft Research Figure 1: Example output from our system, generated in real-time with a handheld Kinect depth camera and no other sensing infrastructure. Normal maps (colour) and Phong-shaded renderings (greyscale) from our dense reconstruction system are shown. On the left for comparison is an example of the live, incomplete, and noisy data from the Kinect sensor (used as input to our system).
- 7. Result 第30回（後編） コンピュータビジョン勉強会@関東 CVPR2015読み会 7 https://www.youtube.com/watch?v=quGhaggn3cQ
- 8. Overview: 第30回（後編） コンピュータビジョン勉強会@関東 CVPR2015読み会 8 Rk Tg,k Rk Tg,k-1 Input Measurement Pose Estimation Update Reconstruction Surface Prediction Compute Surface Nertex and Normal Maps ICP of Predicted and Measured Surface Integrate Surface Measurement into Global TSDF Ray-cast TSDF to Compute Surface Prediction Tg,k-1 k Vk-1 S Nk-1 ^ ^ Vk Nk Figure 3: Overall system workﬂow. Surface reconstruction update: The global scene fusion pro- cess, where given the pose determined by tracking the depth data from a new sensor frame, the surface measurement is integrated into the scene model maintained with a volumetric, truncated signed dis- tance function (TSDF) representation. Surface prediction: Unlike frame-to-frame pose estimation as performed in [15], we close the loop between mapping and local- isation by tracking the live depth frame against the globally fused model. This is performed by raycasting the signed distance func- tion into the estimated frame to provide a dense surface prediction against which the live depth map is aligned. Figure 4: A s showing the tru ﬁeld around the had a valid mea bouring map ve Nk(u) = n ⇥ (V on dy- moving move. ects. UIST’11, October 16–19, 2011, Santa Barbara, CA, USA
- 9. Mapping as Surface Reconstruction 第30回（後編） コンピュータビジョン勉強会@関東 CVPR2015読み会 9 Rk Tg,k Rk Tg,k-1 Input Measurement Pose Estimation Update Reconstruction Surface Prediction Compute Surface Nertex and Normal Maps ICP of Predicted and Measured Surface Integrate Surface Measurement into Global TSDF Ray-cast TSDF to Compute Surface Prediction Tg,k-1 k Vk-1 S Nk-1 ^ ^ Vk Nk Figure 3: Overall system workﬂow. Surface reconstruction update: The global scene fusion pro- cess, where given the pose determined by tracking the depth data from a new sensor frame, the surface measurement is integrated into Figure showin ﬁeld ar
- 10. Mapping as Surface Reconstruction (1/4) • カメラのポーズがわかっている前提で、 Live FrameをGlobal Sceneに統合 (Fusion) • Sceneの表現: – Volumetric Truncated Signed Distance Function (TSDF)※ – Volumeデータ構造 – 面の位置をzeroとして、各Voxelに面までの距離を符 号付で格納 第30回（後編） コンピュータビジョン勉強会@関東 CVPR2015読み会 10 ※ B. Curless and M. Levoy, A Volumetric for Building Complex Models from Range Images In ACM Transactions on Graphics (SIGGRAPH), 1996.
- 11. Mapping as Surface Reconstruction (2/4) • TSDF概要 – 複数視点からのDepth mapを統合するアルゴリズム – ある視点から得られたDepth mapをTSDFに変換 – 別の視点からのDepth mapも同様にTSDFに変換 – これらの重み付和をとることで、複数視点からのDepth mapが統合される – 全部の視点が統合されたら、Marching Cubeなどで面を取り出すことが可能 第30回（後編） コンピュータビジョン勉強会@関東 CVPR2015読み会 11 Sensor Near Far x x Volume Range surface Zero-crossing (isosurface) x x New zero-crossing Distance from surface (a) (b) Figure 2. Unweighted signed distance functions in 3D. (a) A range sensor looking down the x-axis observes a range image, shown here as a recon- structed range surface. Following one line of sight down the x-axis, we r 1 d 1 (x) w1 (x) r 2 w2 (x) d 2 (x) W(x) D(x) R (a) (b) x x Sensor Figure 3. Signed distance and weight functions in one dimension. (a) The sensor looks down the x-axis and takes two measurements, and . and are the signed distance proﬁles, and and are the weight functions. In 1D, we might expect two sensor measure- ments to have the same weight magnitudes, but we have shown them to be of different magnitude here to illustrate how the proﬁles combine in the general case. (b) is a weighted combination of and , and is the sum of the weight functions. Given this formulation, the zero-crossing, , becomes the weighted combination of and and (b) (c) (e) (f) Isosurface Sensor n2 n1 Dmax Dmin (a) (d) Sensor Figure 4. Combination of signed distance and weight functions in two dimensions. (a) and (d) are the signed distance and weight functions, re- spectively, generated for a range image viewed from the sensor line of sight shown in (d). The signed distance functions are chosen to vary be- Sen Figure 5 pute the w casting a We obtai and ) sensor (li column o
- 12. Mapping as Surface Reconstruction (3/4) • Live Frame (Depth map) からTSDFへの変換 第30回（後編） コンピュータビジョン勉強会@関東 CVPR2015読み会 12 resents the distance to the nearest surface point. Although efﬁcient algorithms exist for computing the true dis- crete SDF for a given set of point measurements (complexity is linear in the the number of voxels) [22], sophisticated implemen- tations are required to achieve top performance on GPU hardware, without which real-time computation is not possible for a reason- able size volume. Instead, we use a projective truncated signed distance function that is readily computed and trivially parallelis- able. For a raw depth map Rk with a known pose Tg,k, its global frame projective TSDF [FRk ,WRk ] at a point p in the global frame g is computed as, FRk (p) = Y ⇣ l 1 k(tg,k pk2 Rk(x) ⌘ , (6) l = kK 1 ˙xk2 , (7) x = j p ⇣ KT 1 g,kp ⌘k , (8) Y(h) = ( min ⇣ 1, h µ ⌘ sgn(h) iff h µ null otherwise (9) We use a nearest neighbour lookup b.c instead of interpolating the depth value, to prevent smearing of measurements at depth dis- continuities. 1/l converts the ray distance to p to a depth (we found no considerable difference in using SDF values computed using dis- tances along the ray or along the optical axis). Y performs the SDF truncation. The truncation function is scaled to ensure that a sur- face measurement (zero crossing in the SDF) is represented by at least one non truncated voxel value in the discretised volume ei- ther side of the surface. Also, the support is increased linearly with distance from the sensor center to support correct representation of noisier measurements. The associated weight WRk (p) is propor- tional to cos(q)/Rk(x), where q is the angle between the associated pixel ray direction and the surface normal measurement in the local k Wk 1(p)+WRk (p Wk(p) = Wk 1(p)+WRk (p) No update on the global TSDF is performed fo from unmeasurable regions speciﬁed in Equation provides weighting of the TSDF proportional to surface measurement, we have also found that i letting WRk (p) = 1, resulting in a simple average, sults. Moreover, by truncating the updated weigh Wh , Wk(p) min(Wk 1(p)+WRk (p),W a moving average surface reconstruction can be o reconstruction in scenes with dynamic object mot Although a large number of voxels can be vis project into the current image, the simplicity of operation time is memory, not computation, bound GPU hardware over 65 gigavoxels/second (⇡ 2m update for a 5123 voxel reconstruction) can be up bits per component in S(p), although experiment iﬁed that as few as 6 bits are required for the SD we note that the raw depth measurements are used rather than the bilateral ﬁltered version used in t ponent, described later in section 3.5. The early desired high frequency structure and noise alike duce the ability to reconstruct ﬁner scale structure 3.4 Surface Prediction from Ray Casting With the most up-to-date reconstruction availabl ity to compute a dense surface prediction by rend encoded in the zero level set Fk = 0 into a virtua current estimate Tg,k. The surface prediction is s and normal map ˆVk and ˆNk in frame of referenc able size volume. Instead, we use a projective truncated signed distance function that is readily computed and trivially parallelis- able. For a raw depth map Rk with a known pose Tg,k, its global rame projective TSDF [FRk ,WRk ] at a point p in the global frame g is computed as, FRk (p) = Y ⇣ l 1 k(tg,k pk2 Rk(x) ⌘ , (6) l = kK 1 ˙xk2 , (7) x = j p ⇣ KT 1 g,kp ⌘k , (8) Y(h) = ( min ⇣ 1, h µ ⌘ sgn(h) iff h µ null otherwise (9) We use a nearest neighbour lookup b.c instead of interpolating he depth value, to prevent smearing of measurements at depth dis- continuities. 1/l converts the ray distance to p to a depth (we found no considerable difference in using SDF values computed using dis- ances along the ray or along the optical axis). Y performs the SDF runcation. The truncation function is scaled to ensure that a sur- ace measurement (zero crossing in the SDF) is represented by at east one non truncated voxel value in the discretised volume ei- provides weig surface measu letting WRk (p sults. Moreov Wh , W a moving ave reconstruction Although a project into th operation time GPU hardwar update for a 5 bits per comp iﬁed that as fe we note that th rather than th ponent, descr desired high f duce the abilit 3.4 Surfac : 点pにおけるTSDF value : 点pにおける重み continuities. 1/l converts the ray distance to p to a depth (we found no considerable difference in using SDF values computed using dis- tances along the ray or along the optical axis). Y performs the SDF truncation. The truncation function is scaled to ensure that a sur- face measurement (zero crossing in the SDF) is represented by at least one non truncated voxel value in the discretised volume ei- ther side of the surface. Also, the support is increased linearly with distance from the sensor center to support correct representation of noisier measurements. The associated weight WRk (p) is propor- tional to cos(q)/Rk(x), where q is the angle between the associated pixel ray direction and the surface normal measurement in the local frame. The projective TSDF measurement is only correct exactly at the surface FRk (p) = 0 or if there is only a single point measurement in isolation. When a surface is present the closest point along a ray could be another surface point not on the ray associated with the pixel in Equation 8. It has been shown that for points close to the surface, a correction can be applied by scaling the SDF by cos(q) [11]. However, we have found that approximation within the truncation region for 100s or more fused TSDFs from multiple viewpoints (as performed here) converges towards an SDF with a pseudo-Euclidean metric that does not hinder mapping and tracking performance. ponent, described late desired high frequenc duce the ability to rec 3.4 Surface Pred With the most up-to- ity to compute a dens encoded in the zero le current estimate Tg,k. and normal map ˆVk a the subsequent camer As we have a dense SDF, a per pixel rayc responding ray, Tg,kK depth for the pixel a ve for a visible surf Marching also stops mately when exiting t face measurement at t For points on or ve is assumed that the gr zero level set, and so no considerable difference in using tances along the ray or along the op truncation. The truncation functio face measurement (zero crossing i least one non truncated voxel valu ther side of the surface. Also, the s distance from the sensor center to of noisier measurements. The assoc tional to cos(q)/Rk(x), where q is pixel ray direction and the surface n frame. The projective TSDF measurem surface FRk (p) = 0 or if there is o in isolation. When a surface is pr ray could be another surface point the pixel in Equation 8. It has be to the surface, a correction can be cos(q) [11]. However, we have fo the truncation region for 100s or m viewpoints (as performed here) co pseudo-Euclidean metric that does performance. We use a nearest neighbour lookup b.c instead of interpolating the depth value, to prevent smearing of measurements at depth dis- continuities. 1/l converts the ray distance to p to a depth (we found no considerable difference in using SDF values computed using dis- tances along the ray or along the optical axis). Y performs the SDF truncation. The truncation function is scaled to ensure that a sur- face measurement (zero crossing in the SDF) is represented by at least one non truncated voxel value in the discretised volume ei- ther side of the surface. Also, the support is increased linearly with distance from the sensor center to support correct representation of noisier measurements. The associated weight WRk (p) is propor- tional to cos(q)/Rk(x), where q is the angle between the associated pixel ray direction and the surface normal measurement in the local frame. The projective TSDF measurement is only correct exactly at the surface FRk (p) = 0 or if there is only a single point measurement in isolation. When a surface is present the closest point along a ray could be another surface point not on the ray associated with the pixel in Equation 8. It has been shown that for points close to the surface, a correction can be applied by scaling the SDF by cos(q) [11]. However, we have found that approximation within the truncation region for 100s or more fused TSDFs from multiple viewpoints (as performed here) converges towards an SDF with a pseudo-Euclidean metric that does not hinder mapping and tracking performance. we note that the rather than the b ponent, describe desired high fre duce the ability 3.4 Surface P With the most u ity to compute a encoded in the z current estimate and normal map the subsequent c As we have a SDF, a per pixe responding ray, depth for the p ve for a visible Marching also s mately when exi face measureme For points on is assumed that zero level set, an = θ : Rayと法線のなす角 p x Global TSDF Raw Depth map µ Rk(x)
- 13. Mapping as Surface Reconstruction (4/4) 第30回（後編） コンピュータビジョン勉強会@関東 CVPR2015読み会 13 実際のところ、 でいい結果になるらしい。 (5) h map Rk) pro- reconstructed. a depth mea- f the measured ong each depth ee space (here ixel ray). Sec- ained in the re- he camera ray. of uncertainty µ. A TSDF n measurement e within visible urface interface e points farther e the SDF rep- g the true dis- (complexity is ted implemen- GPU hardware, e for a reason- uncated signed ally parallelis- Tg,k, its global e global frame depth map, which can be seen as de-noising the global TSDF from multiple noisy TSDF measurements. Under an L2 norm the de- noised (fused) surface results as the zero-crossings of the point-wise SDF F minimising: min F2F Â k kWRk FRk F)k2. (10) Given that the focus of our work is on real-time sensor tracking and surface reconstruction we must maintain interactive frame-rates. (For a 640x480 depth stream at 30fps the equivalent of over 9 million new point measurements are made per second). Storing a weight Wk(p) with each value allows an important aspect of the global minimum of the convex L2 de-noising metric to be exploited for real-time fusion; that the solution can be obtained incrementally as more data terms are added using a simple weighted running av- erage [7], deﬁned point-wise {p|FRk (p) 6= null}: Fk(p) = Wk 1(p)Fk 1(p)+WRk (p)FRk (p) Wk 1(p)+WRk (p) (11) Wk(p) = Wk 1(p)+WRk (p) (12) No update on the global TSDF is performed for values resulting from unmeasurable regions speciﬁed in Equation 9. While Wk(p) provides weighting of the TSDF proportional to the uncertainty of surface measurement, we have also found that in practice simply letting WRk (p) = 1, resulting in a simple average, provides good re- sults. Moreover, by truncating the updated weight over some value Wh , Wk(p) min(Wk 1(p)+WRk (p),Wh ) , (13) a weight Wk(p) with each value allows an important aspect of the global minimum of the convex L2 de-noising metric to be exploited for real-time fusion; that the solution can be obtained incrementally as more data terms are added using a simple weighted running av- erage [7], deﬁned point-wise {p|FRk (p) 6= null}: Fk(p) = Wk 1(p)Fk 1(p)+WRk (p)FRk (p) Wk 1(p)+WRk (p) (11) Wk(p) = Wk 1(p)+WRk (p) (12) No update on the global TSDF is performed for values resulting from unmeasurable regions speciﬁed in Equation 9. While Wk(p) provides weighting of the TSDF proportional to the uncertainty of surface measurement, we have also found that in practice simply letting WRk (p) = 1, resulting in a simple average, provides good re- sults. Moreover, by truncating the updated weight over some value Wh , Given that the focus of our work is on real-time sensor trackin surface reconstruction we must maintain interactive frame- (For a 640x480 depth stream at 30fps the equivalent of o million new point measurements are made per second). S a weight Wk(p) with each value allows an important aspect global minimum of the convex L2 de-noising metric to be exp for real-time fusion; that the solution can be obtained increme as more data terms are added using a simple weighted runnin erage [7], deﬁned point-wise {p|FRk (p) 6= null}: Fk(p) = Wk 1(p)Fk 1(p)+WRk (p)FRk (p) Wk 1(p)+WRk (p) Wk(p) = Wk 1(p)+WRk (p) No update on the global TSDF is performed for values res from unmeasurable regions speciﬁed in Equation 9. While W provides weighting of the TSDF proportional to the uncertai Global SceneとLive FrameのTSDFのFusion 重みの更新 FusionされたTSDF Value Global Model Live Frame
- 14. Surface Prediction 第30回（後編） コンピュータビジョン勉強会@関東 CVPR2015読み会 14 Rk Tg,k Rk Tg,k-1 Input Measurement Pose Estimation Update Reconstruction Surface Prediction Compute Surface Nertex and Normal Maps ICP of Predicted and Measured Surface Integrate Surface Measurement into Global TSDF Ray-cast TSDF to Compute Surface Prediction Tg,k-1 k Vk-1 S Nk-1 ^ ^ Vk Nk Figure 3: Overall system workﬂow. Surface reconstruction update: The global scene fusion pro- cess, where given the pose determined by tracking the depth data from a new sensor frame, the surface measurement is integrated into Figure showin ﬁeld ar
- 15. Surface Prediction • Global TSDFから任意視点におけるSurfaceを取り出す • Pose Estimationに必要 • Ray Casting (Ray marching) – 距離関数と親和性の高いレンダリング方法 – カメラからRayを飛ばす – Minimum depthから初めて、zero交差が見つかったら 止める。 – 法線の計算: 第30回（後編） コンピュータビジョン勉強会@関東 CVPR2015読み会 15 Figure 6: Demonstration of the space skipping ray casting. (Left) pixel iteration count are shown where for each pixel the ray is tra- versed in steps of at most one voxel (white equals 480 increments and black 60). (middle) ray marching steps are drastically reduced by skipping empty space according to the minimum truncation µ (white equals 70 iterations and black 10 ⇡ 6⇥ speedup). Marching steps can be seen to increase around the surface interface where the signed distance function has not been truncated. (Right) Normal map at resulting surface intersection. along which p was found can be computed directly from Fk using a numerical derivative of the SDF: Rg,k ˆNk = ˆN g k(u) = n ⇥ —F(p) ⇤ , —F = ∂F ∂x , ∂F ∂y , ∂F ∂z > (14) Further, this derivative is scaled in each dimension to ensure cor- rect isotropy given potentially arbitrary voxel resolutions and re- construction dimensions. Since the rendering of the surface is restricted to provide phys- the surface interface using eqn. 15. Figure 6: Demonstration of the space skipping ray casting. (Left) pixel iteration count are shown where for each pixel the ray is tra- versed in steps of at most one voxel (white equals 480 increments and black 60). (middle) ray marching steps are drastically reduced by skipping empty space according to the minimum truncation µ (white equals 70 iterations and black 10 ⇡ 6⇥ speedup). Marching steps can be seen to increase around the surface interface where the signed distance function has not been truncated. (Right) Normal map at resulting surface intersection. along which p was found can be computed directly from Fk using a numerical derivative of the SDF: Rg,k ˆNk = ˆN g k(u) = n ⇥ —F(p) ⇤ , —F = ∂F ∂x , ∂F ∂y , ∂F ∂z > (14) Further, this derivative is scaled in each dimension to ensure cor- rect isotropy given potentially arbitrary voxel resolutions and re- construction dimensions. Since the rendering of the surface is restricted to provide phys- ically plausible measurement predictions, there is a minimum and maximum rendering range the ray needs to traverse corresponding 3 L p t d p l a t f t p h t a d s f m a d t ( p u E
- 16. Surface Measurement 第30回（後編） コンピュータビジョン勉強会@関東 CVPR2015読み会 16 Rk Tg,k Rk Tg,k-1 Input Measurement Pose Estimation Update Reconstruction Surface Prediction Compute Surface Nertex and Normal Maps ICP of Predicted and Measured Surface Integrate Surface Measurement into Global TSDF Ray-cast TSDF to Compute Surface Prediction Tg,k-1 k Vk-1 S Nk-1 ^ ^ Vk Nk Figure 3: Overall system workﬂow. Surface reconstruction update: The global scene fusion pro- cess, where given the pose determined by tracking the depth data from a new sensor frame, the surface measurement is integrated into Figure showin ﬁeld ar
- 17. Surface Measurement (1/4) • Live Frameに対する前処理： 1. De-noise (Bilateral Filter) 2. 各ピクセルごとにVertexとNormalを生成 3. Pyramid生成 (L=1, 2, 3) 第30回（後編） コンピュータビジョン勉強会@関東 CVPR2015読み会 17
- 18. Surface Measurement (2/4) • De-noise 第30回（後編） コンピュータビジョン勉強会@関東 CVPR2015読み会 18 provides calibrated depth measurements Rk(u) 2 R at each image pixel u = (u,v)> in the image domain u 2 U ⇢ R2 such that pk = Rk(u)K 1 ˙u is a metric point measurement in the sensor frame of reference k. We apply a bilateral ﬁlter [27] to the raw depth map to obtain a discontinuity preserved depth map with reduced noise Dk, Dk(u) = 1 Wp Â q2U Nss (ku qk2)Nsr (kRk(u) Rk(q)k2)Rk(q) , (2) where Ns (t) = exp( t2s 2) and Wp is a normalizing constant. We back-project the ﬁltered depth values into the sensor’s frame of reference to obtain a vertex map Vk, Vk(u) = Dk(u)K 1 ˙u . (3) Since each frame from the depth sensor is a surface measurement on a regular grid, we compute, using a cross product between neigh- [7] the fac vis on ple a g low wit fus Sk( rec olu res vox and ra calibration matrix e into image pixels. e projection of p 2 to obtain q 2 R2 = denote homogeneous depth map Rk which u) 2 R at each image ⇢ R2 such that pk = the sensor frame of the raw depth map to th reduced noise Dk, ) Rk(q)k2)Rk(q) , (2) corresponding depth map level. We note that global co-ordinate frame transform Tg,k associ measurement, the global frame vertex is V g k( the equivalent mapping of normal vectors int N g k(u) = Rg,kNk(u). 3.3 Mapping as Surface Reconstructio Each consecutive depth frame, with an associa estimate, is fused incrementally into one sing using the volumetric truncated signed distan [7]. In a true signed distance function, the v the signed distance to the closest zero crossin face), taking on positive and increasing valu visible surface into free space, and negative a on the non-visible side. The result of averagin ple 3D point clouds (or surface measurements) a global frame is a global surface fusion. An example given in Figure 4 demonstrate lows us to represent arbitrary genus surface : Raw Depth map We will also use a single constant camera calibration matri K that transforms points on the sensor plane into image pixel The function q = p(p) performs perspective projection of p R3 = (x,y,z)> including dehomogenisation to obtain q 2 R2 = (x/z,y/z)>. We will also use a dot notation to denote homogeneou vectors ˙u := (u>|1)> 3.2 Surface Measurement At time k a measurement comprises a raw depth map Rk whic provides calibrated depth measurements Rk(u) 2 R at each imag pixel u = (u,v)> in the image domain u 2 U ⇢ R2 such that pk = Rk(u)K 1 ˙u is a metric point measurement in the sensor frame o reference k. We apply a bilateral ﬁlter [27] to the raw depth map t obtain a discontinuity preserved depth map with reduced noise Dk Dk(u) = 1 Wp Â q2U Nss (ku qk2)Nsr (kRk(u) Rk(q)k2)Rk(q) (2 where Ns (t) = exp( t2s 2) and Wp is a normalizing constant. : Pixel maps the camera coordinate frame at time k into the global frame g, such that a point pk 2 R3 in the camera frame is transferred into the global co-ordinate frame via pg = Tg,kpk. We will also use a single constant camera calibration matrix K that transforms points on the sensor plane into image pixels. The function q = p(p) performs perspective projection of p 2 R3 = (x,y,z)> including dehomogenisation to obtain q 2 R2 = (x/z,y/z)>. We will also use a dot notation to denote homogeneous vectors ˙u := (u>|1)> 3.2 Surface Measurement At time k a measurement comprises a raw depth map Rk which provides calibrated depth measurements Rk(u) 2 R at each image pixel u = (u,v)> in the image domain u 2 U ⇢ R2 such that pk = Rk(u)K 1 ˙u is a metric point measurement in the sensor frame of reference k. We apply a bilateral ﬁlter [27] to the raw depth map to obtain a discontinuity preserved depth map with reduced noise Dk, Dk(u) = 1 Wp Â q2U Nss (ku qk2)Nsr (kRk(u) Rk(q)k2)Rk(q) , (2) where Ns (t) = exp( t2s 2) and Wp is a normalizing constant. We back-project the ﬁltered depth values into the sensor’s frame of reference to obtain a vertex map Vk, central pixel to ens aries. Subsequently Vl2[1...L], Nl2[1...L] corresponding dep global co-ordinate measurement, the the equivalent map N g k(u) = Rg,kNk(u) 3.3 Mapping as Each consecutive d estimate, is fused using the volumet [7]. In a true sign the signed distance face), taking on p visible surface into on the non-visible ple 3D point cloud a global frame is a An example giv lows us to repres within the volume. fusion of the regis Sk(p) where p 2 R R3 = (x,y,z)> including dehomogenisation to obtain q 2 R2 = (x/z,y/z)>. We will also use a dot notation to denote homogeneous vectors ˙u := (u>|1)> 3.2 Surface Measurement At time k a measurement comprises a raw depth map Rk which provides calibrated depth measurements Rk(u) 2 R at each image pixel u = (u,v)> in the image domain u 2 U ⇢ R2 such that pk = Rk(u)K 1 ˙u is a metric point measurement in the sensor frame of reference k. We apply a bilateral ﬁlter [27] to the raw depth map to obtain a discontinuity preserved depth map with reduced noise Dk, Dk(u) = 1 Wp Â q2U Nss (ku qk2)Nsr (kRk(u) Rk(q)k2)Rk(q) , (2) where Ns (t) = exp( t2s 2) and Wp is a normalizing constant. We back-project the ﬁltered depth values into the sensor’s frame of reference to obtain a vertex map Vk, Vk(u) = Dk(u)K 1 ˙u . (3) Since each frame from the depth sensor is a surface measurement on a regular grid, we compute, using a cross product between neigh- the equivalent mappin N g k(u) = Rg,kNk(u). 3.3 Mapping as S Each consecutive dept estimate, is fused incr using the volumetric [7]. In a true signed the signed distance to face), taking on posit visible surface into fre on the non-visible side ple 3D point clouds (o a global frame is a glo An example given lows us to represent within the volume. W fusion of the registere Sk(p) where p 2 R3 is reconstructed. A disc olution is stored in gl reside. From here on w voxel/memory elemen and will refer only to t so use a dot notation to denote homogeneous urement ment comprises a raw depth map Rk which pth measurements Rk(u) 2 R at each image e image domain u 2 U ⇢ R2 such that pk = c point measurement in the sensor frame of a bilateral ﬁlter [27] to the raw depth map to preserved depth map with reduced noise Dk, (ku qk2)Nsr (kRk(u) Rk(q)k2)Rk(q) , (2) t2s 2) and Wp is a normalizing constant. e ﬁltered depth values into the sensor’s frame a vertex map Vk, Vk(u) = Dk(u)K 1 ˙u . (3) m the depth sensor is a surface measurement the equivalent mapping of norm N g k(u) = Rg,kNk(u). 3.3 Mapping as Surface R Each consecutive depth frame, estimate, is fused incrementally using the volumetric truncated [7]. In a true signed distance the signed distance to the close face), taking on positive and i visible surface into free space, on the non-visible side. The res ple 3D point clouds (or surface a global frame is a global surfac An example given in Figure lows us to represent arbitrary within the volume. We will den fusion of the registered depth m Sk(p) where p 2 R3 is a global reconstructed. A discretization olution is stored in global GPU reside. From here on we assum voxel/memory elements and th : 正規化 要するに、Depthに対するBilateral Filter
- 19. Surface Measurement (3/4) • Vertex map 第30回（後編） コンピュータビジョン勉強会@関東 CVPR2015読み会 19 Wp Â q2U Nss (ku qk2)Nsr (kRk(u) Rk(q)k2)Rk(q) , (2) (t) = exp( t2s 2) and Wp is a normalizing constant. k-project the ﬁltered depth values into the sensor’s frame ce to obtain a vertex map Vk, Vk(u) = Dk(u)K 1 ˙u . (3) h frame from the depth sensor is a surface measurement ar grid, we compute, using a cross product between neigh- a global frame is a global surfac An example given in Figure lows us to represent arbitrary within the volume. We will den fusion of the registered depth m Sk(p) where p 2 R3 is a global reconstructed. A discretization olution is stored in global GPU reside. From here on we assume voxel/memory elements and th and will refer only to the contin • Normal map De-noise後のDepth mapを使って、点 u を3次元に投影 n pro- h data ed into ed dis- ion as local- fused func- diction using d cur- ses all ame at Figure 4: A slice through the truncated signed distance volume showing the truncated function F > µ (white), the smooth distance ﬁeld around the surface interface F = 0 and voxels that have not yet had a valid measurement(grey) as detailed in eqn. 9. bouring map vertices, the corresponding normal vectors, Nk(u) = n ⇥ (Vk(u+1,v) Vk(u,v))⇥(Vk(u,v+1) Vk(u,v)) ⇤ , (4) where n[x] = x/kxk2. We also deﬁne a vertex validity mask: Mk(u) 7! 1 for each pixel where a depth measurement transforms to a valid vertex; otherwise if a depth measurement is missing Mk(u) 7! 0. The bilateral ﬁltered version of the depth map greatly increases the quality of the normal maps produced, improving the data association required in tracking described in Section 3.5. We compute an L = 3 level multi-scale representation of the sur- face measurement in the form of a vertex and normal map pyramid. First a depth map pyramid Dl2[1...L] is computed. Setting the bottom 各頂点の3次元座標から、法線を計算 (近接ピクセル3次元座標の外積) Figure 4: A slice through the truncated signed distance volume showing the truncated function F > µ (white), the smooth distance ﬁeld around the surface interface F = 0 and voxels that have not yet had a valid measurement(grey) as detailed in eqn. 9. bouring map vertices, the corresponding normal vectors, Nk(u) = n ⇥ (Vk(u+1,v) Vk(u,v))⇥(Vk(u,v+1) Vk(u,v)) ⇤ , (4) where n[x] = x/kxk2. We also deﬁne a vertex validity mask: Mk(u) 7! 1 for each pixel where a depth measurement transforms to a valid vertex; otherwise if a depth measurement is missing Mk(u) 7! 0. The bilateral ﬁltered version of the depth map greatly increases the quality of the normal maps produced, improving the data association required in tracking described in Section 3.5. We compute an L = 3 level multi-scale representation of the sur- :正規化 (単位ベクトル化)
- 20. Surface Measurement (4/4) • Pyramid生成 – 3 Level – L = 1 : 生の解像度 – L+1 : Lの半分の解像度 – 各レベルごとに、Vertex / Normal Mapを計算して保持 第30回（後編） コンピュータビジョン勉強会@関東 CVPR2015読み会 20
- 21. Sensor Pose Estimation 第30回（後編） コンピュータビジョン勉強会@関東 CVPR2015読み会 21 Rk Tg,k Rk Tg,k-1 Input Measurement Pose Estimation Update Reconstruction Surface Prediction Compute Surface Nertex and Normal Maps ICP of Predicted and Measured Surface Integrate Surface Measurement into Global TSDF Ray-cast TSDF to Compute Surface Prediction Tg,k-1 k Vk-1 S Nk-1 ^ ^ Vk Nk Figure 3: Overall system workﬂow. Surface reconstruction update: The global scene fusion pro- cess, where given the pose determined by tracking the depth data from a new sensor frame, the surface measurement is integrated into Figure showin ﬁeld ar
- 22. Sensor Pose Estimation (1/4) • Live FrameのPoseを推定 – Live Frameの全Pixel情報を使って推定 • 仮定: – Frame間の移動量は十分小さい • Solution : ICP – Live FrameのVertex/Normal (De-noised) と、Surface Prediction間の6DOFを推定 – Point-Plane Optimization 第30回（後編） コンピュータビジョン勉強会@関東 CVPR2015読み会 22
- 23. Sensor Pose Estimation (2/4) • The global point-plane energy 第30回（後編） コンピュータビジョン勉強会@関東 CVPR2015読み会 23 (14) ure cor- and re- e phys- um and ponding [0.4,8] perty of er pixel ed volu- can be ver, due a con- simple kipping a good est sur- n march ave +ve to-frame tracking is obtained simply by setting ( ˆVk 1, ˆNk 1) (Vk 1,Nk 1) which is used in our experimental section for a com- parison between frame-to-frame and frame-model tracking. Utilising the surface prediction, the global point-plane energy, under the L2 norm for the desired camera pose estimate Tg,k is: E(Tg,k) = Â u2U Wk(u)6=null ⇣ Tg,k ˙Vk(u) ˆV g k 1 (ˆu) ⌘> ˆN g k 1 (ˆu) 2 , (16) where each global frame surface prediction is obtained using the previous ﬁxed pose estimate Tg,k 1. The projective data as- sociation algorithm produces the set of vertex correspondences {Vk(u), ˆVk 1(ˆu)|W(u) 6= null} by computing the perspectively pro- jected point, ˆu = p(KeTk 1,k ˙Vk(u)) using an estimate for the frame- frame transform eTz k 1,k = T 1 g,k 1 eTz g,k and testing the predicted and measured vertex and normal for compatibility. A threshold on the distance of vertices and difference in normal values sufﬁces to re- ject grossly incorrect correspondences, also illustrated in Figure 7: W(u) 6= null iff 8 < : Mk(u) = 1, and keTz g,k ˙Vk(u) ˆV g k 1 (ˆu)k2 ed, and heRz g,kNk(u), ˆN g k 1 (ˆu)i eq . (17) rithm [4] to obtain correspondence and the r pose optimisation. Second, modern GPU y parrallelised processing pipeline, so that point-plane optimisation can use all of the ements. r metric in combination with correspon- rojective data association was ﬁrst demon- odelling system by [23] where frame-to- (with a ﬁxed camera) for depth map align- instead track the current sensor frame by easurement (Vk,Nk) against the model pre- s frame ( ˆVk 1, ˆNk 1). We note that frame- ained simply by setting ( ˆVk 1, ˆNk 1) sed in our experimental section for a com- o-frame and frame-model tracking. prediction, the global point-plane energy, he desired camera pose estimate Tg,k is: g,k ˙Vk(u) ˆV g k 1 (ˆu) ⌘> ˆN g k 1 (ˆu) 2 , (16) e surface prediction is obtained using the algorithm [4] to obtain correspondence and the ] for pose optimisation. Second, modern GPU ully parrallelised processing pipeline, so that nd point-plane optimisation can use all of the asurements. rror metric in combination with correspon- g projective data association was ﬁrst demon- e modelling system by [23] where frame-to- sed (with a ﬁxed camera) for depth map align- we instead track the current sensor frame by e measurement (Vk,Nk) against the model pre- ious frame ( ˆVk 1, ˆNk 1). We note that frame- obtained simply by setting ( ˆVk 1, ˆNk 1) is used in our experimental section for a com- me-to-frame and frame-model tracking. ce prediction, the global point-plane energy, or the desired camera pose estimate Tg,k is: ⇣ Tg,k ˙Vk(u) ˆV g k 1 (ˆu) ⌘> ˆN g k 1 (ˆu) 2 , (16) ame surface prediction is obtained using the : Live frame : Model prediction from the previous frame shapes in base. For or virtual to form s range the range planning the ICP st widely a similar Chen92]). a recent inal ICP 92], each the other oint error between rocess is r it stops 2] used a ization is e tangent -to-point -to-plane squares Press92]. orithm is hers have e former of the Pottmann between points in the source surface and points in the destination surfaces. For example, for each point on the source surface, the nearest point on the destination surface is chosen as its correspondence [Besl92] (see [Rusinkiewicz01] for other approaches to find point correspondences). The output of an ICP iteration is a 3D rigid-body transformation M that transforms the source points such that the total error between the corresponding points, under a certain chosen error metric, is minimal. When the point-to-plane error metric is used, the object of minimization is the sum of the squared distance between each source point and the tangent plane at its corresponding destination point (see Figure 1). More specifically, if si = (six, siy, siz, 1)T is a source point, di = (dix, diy, diz, 1)T is the corresponding destination point, and ni = (nix, niy, niz, 0)T is the unit normal vector at di, then the goal of each ICP iteration is to find Mopt such that ( )( )∑ •−⋅= i iii 2 opt minarg ndsMM M (1) where M and Mopt are 4×4 3D rigid-body transformation matrices. Figure 1: Point-to-plane error between two surfaces. tangent plane s1 source point destination point d1 n1 unit normal s2 d2 n2 s3 d3 n3 destination surface source surface l1 l2 l3 Low, Kok-Lim, Linear least-squares optimization for Point-to-Plane ICP surface Registration Chapel Hill, University of North Carolina, 2004.
- 24. Sensor Pose Estimation (3/4) • Iterative solution • パラメータの更新量の計算 第30回（後編） コンピュータビジョン勉強会@関東 CVPR2015読み会 24 ve value before stepping over the surface -up obtained is demonstrated in Figure 6 of steps required for each pixel to inter- standard marching. ions can be obtained by solving a ray/tri- 1] that requires solving a cubic polyno- e we use a simple approximation. Given ntersect the SDF where F+ t and F+ t+Dt are DF values either side of the zero crossing nd t + Dt from its starting point, we ﬁnd where ed and eq are threshold parameters of our system. eTz= g, initialised with the previous frame pose Tg,k. An iterative solution, eTz g,k for z > 0 is obtained by minimi the energy of a linearised version of (16) around the previous timate eTz 1 g,k . Using the small angle assumption for an increme transform: eTz inc = h eRz etz i = 2 4 1 a g tx a 1 b ty g b 1 tz 3 5 , g a surface measurement with (center) and with- ﬁltering applied to the raw depth map. W(u) = null unpredicted/unmeasured points shown in white. rm is simply eTz g,k = eTz inc eTz 1 g,k . Writing the update vector, x = (b,g,a,tx,ty,tz)> 2 R6 (19) urrent global frame vertex estimates for all pixels eV g k(u) = eTz 1 g,k ˙Vk(u), we minimise the linearised g the incremental point transfer: u) = eRz eV g k(u)+ etz = G(u)x+ eV g k(u) , (20) trix G is formed with the skew-symmetric matrix quately constrained. the magnitude of incr small angle assumpt fails, the system is pl Relocalisation Ou re-localisation schem known sensor pose is user instructed to alig played. While runni validity checks are pa 4 EXPERIMENTS We have conducted a formance of our syst tem’s ability to keep extensively in our su 4.1 Metrically Co Our tracking and ma rithm for a given are Figure 7: Example of point-plane outliers as person steps into par- tially reconstructed scene (left). Outliers from compatibility checks (Equation 17) using a surface measurement with (center) and with- out (right) bilateral ﬁltering applied to the raw depth map. W(u) = null are light grey with unpredicted/unmeasured points shown in white. an updated transform is simply eTz g,k = eTz inc eTz 1 g,k . Writing the update eTz inc as a parameter vector, x = (b,g,a,tx,ty,tz)> 2 R6 (19) and updating the current global frame vertex estimates for all pixels {u|W(u) 6= null}, eV g k(u) = eTz 1 g,k ˙Vk(u), we minimise the linearised error function using the incremental point transfer: eTz g,k ˙Vk(u) = eRz eV g k(u)+ etz = G(u)x+ eV g k(u) , (20) can be does no the line free DO quality a check quately the mag small a fails, th Relo re-loca known user ins played. validity 4 EX We hav forman tem’s a extensi Figure 7: Example of point-plane outliers as person steps into par- tially reconstructed scene (left). Outliers from compatibility checks (Equation 17) using a surface measurement with (center) and with- out (right) bilateral ﬁltering applied to the raw depth map. W(u) = null are light grey with unpredicted/unmeasured points shown in white. an updated transform is simply eTz g,k = eTz inc eTz 1 g,k . Writing the update eTz inc as a parameter vector, x = (b,g,a,tx,ty,tz)> 2 R6 (19) and updating the current global frame vertex estimates for all pixels {u|W(u) 6= null}, eV g k(u) = eTz 1 g,k ˙Vk(u), we minimise the linearised error function using the incremental point transfer: eTz g,k ˙Vk(u) = eRz eV g k(u)+ etz = G(u)x+ eV g k(u) , (20) where the 3⇥6 matrix G is formed with the skew-symmetric matrix form of eV g k(u): G(u) = h eV g k(u) i ⇥ I3⇥3 . (21) An iteration is obtained by solving: sensor motion increases the assum of the point-plane error metric an can be broken. Also, if the curren does not provide point-plane pairs the linear system then an arbitrar free DOFs can be obtained. Both quality reconstruction and trackin a check on the null space of the no quately constrained. We also perfo the magnitude of incremental trans small angle assumption was not d fails, the system is placed into re-l Relocalisation Our current imp re-localisation scheme, whereby i known sensor pose is used to prov user instructed to align the incomi played. While running the pose o validity checks are passed tracking 4 EXPERIMENTS We have conducted a number of ex formance of our system. These an tem’s ability to keep track during v extensively in our submitted video 4.1 Metrically Consistent Re Our tracking and mapping system rithm for a given area of reconstr investigating its ability to form m trajectories containing local loop cl global joint-estimation. We are als system to scale gracefully with d resources. he 3⇥6 matrix G is formed with the skew-symmetric matrix eV g k(u): G(u) = h eV g k(u) i ⇥ I3⇥3 . (21) ation is obtained by solving: min x2R6 Â Wk(u)6=null kEk2 2 (22) E = ˆN g k 1(ˆu)> ⇣ G(u)x+ eV g k(u) ˆV g k 1(ˆu) ⌘ (23) mputing the derivative of the objective function with respect parameter vector x and setting to zero, a 6 ⇥ 6 symmetric ystem is generated for each vertex-normal element corre- nce: ⇣ ⌘ 4.1 M Our tra rithm f investig trajecto global system resourc To i perime ing a t then sp ⇡ 19 s our sys region tained i a preci easily e x y z and updating the current global frame vertex estimates for all pixels {u|W(u) 6= null}, eV g k(u) = eTz 1 g,k ˙Vk(u), we minimise the linearised error function using the incremental point transfer: eTz g,k ˙Vk(u) = eRz eV g k(u)+ etz = G(u)x+ eV g k(u) , (20) where the 3⇥6 matrix G is formed with the skew-symmetric matrix form of eV g k(u): G(u) = h eV g k(u) i ⇥ I3⇥3 . (21) An iteration is obtained by solving: min x2R6 Â Wk(u)6=null kEk2 2 (22) E = ˆN g k 1(ˆu)> ⇣ G(u)x+ eV g k(u) ˆV g k 1(ˆu) ⌘ (23) By computing the derivative of the objective function with respect to the parameter vector x and setting to zero, a 6 ⇥ 6 symmetric linear system is generated for each vertex-normal element corre- ICPの更新量 : Energy Function:
- 25. Sensor Pose Estimation (4/4) • Coarse-to-fine Estimation – 3Level pyramid – 各レベルにおいて、更新回数の上限を設定 • L3 : 4回 • L2 : 5回 • L1 : 10回 第30回（後編） コンピュータビジョン勉強会@関東 CVPR2015読み会 25
- 26. Result 第30回（後編） コンピュータビジョン勉強会@関東 CVPR2015読み会 26 https://www.youtube.com/watch?v=quGhaggn3cQ
- 27. 第30回（後編） コンピュータビジョン勉強会@関東 CVPR2015読み会 27 ここまでが KinectFusion ここからが DynamicFusion
- 28. DynamicFusion: Overview (1/3) • KinectFusionをDynamicなシーンに拡張 第30回（後編） コンピュータビジョン勉強会@関東 CVPR2015読み会 28 KinectFusion: Real-Time Dense Surface Mapping and Trac Richard A. Newcombe Imperial College London Shahram Izadi Microsoft Research Otmar Hilliges Microsoft Research David Molyneaux Microsoft Research Lancaster University D Micr Newc Andrew J. Davison Imperial College London Pushmeet Kohli Microsoft Research Jamie Shotton Microsoft Research Steve Hodges Microsoft Research Andrew Micros Figure 1: Example output from our system, generated in real-time with a handheld Kinect depth camera and no other sen Normal maps (colour) and Phong-shaded renderings (greyscale) from our dense reconstruction system are shown. On the is an example of the live, incomplete, and noisy data from the Kinect sensor (used as input to our system). ABSTRACT We present a system for accurate real-time mapping of complex and arbitrary indoor scenes in variable lighting conditions, using only a moving low-cost depth camera and commodity graphics hardware. We fuse all of the depth data streamed from a Kinect sensor into a single global implicit surface model of the observed scene in real-time. The current sensor pose is simultaneously obtained by tracking the live depth frame relative to the global model using a coarse-to-ﬁne iterative closest point (ICP) algorithm, which uses all of the observed depth data available. We demonstrate the advan- tages of tracking against the growing full surface model compared with frame-to-frame tracking, obtaining tracking and mapping re- sults in constant time within room sized scenes with limited drift and high accuracy. We also show both qualitative and quantitative 1 INTRODUCTION Real-time infrastructure-free tracking of a hand simultaneously mapping the physical scene in h new possibilities for augmented and mixed reali In computer vision, research on structure fr and multi-view stereo (MVS) has produced ma sults, in particular accurate camera tracking and tions (e.g. [10]), and increasingly reconstruction (e.g. [24]). However, much of this work was not time applications. Research on simultaneous localisation and ma focused more on real-time markerless tracking construction based on the input of a single com monocular RGB camera. Such ‘monocular SLA MonoSLAM [8] and the more accurate Parallel
- 29. DynamicFusion: Overview (2/3) • 基本的はKinectFusion – シーン表現: Truncated Signed Distance Function (TSDF) – 入力データはDepth Map – Pose推定しながらDepth MapをFusion • 拡張ポイント – Dynamicなシーンの表現 : Canonical Space + Warp-Field – Canonical Space : 基準空間 – Warp-Field : Live Frameと基準空間の間の変形 (Deformation) 第30回（後編） コンピュータビジョン勉強会@関東 CVPR2015読み会 29
- 30. DynamicFusion: Overview (3/3) 第30回（後編） コンピュータビジョン勉強会@関東 CVPR2015読み会 30 (a) Initial Frame at t = 0s (b) Raw (noisy) depth maps for frames at t = 1s, 10s, 15s, 20s ( (d) Canonical Model (e) Canonical model warped into its live frame (f Figure 2: DynamicFusion takes an online stream of noisy depth maps (a,b) and outputs a real-time dense reconst scene (d,e). To achieve this, we estimate a volumetric warp (motion) ﬁeld that transforms the canonical model spa enabling the scene motion to be undone, and all depth maps to be densely fused into a single rigid TSDF reconst neously, the structure of the warp ﬁeld is constructed as a set of sparse 6D transformation nodes that are smoothl
- 31. Result 第30回（後編） コンピュータビジョン勉強会@関東 CVPR2015読み会 31 https://www.youtube.com/watch?v=i1eZekcc_lM&feature=youtu.be
- 32. DynamicFusion Outline 1. Dense Non-rigid Warp Field – Dynamicなシーンの表現方法について 2. Dense Non-rigid Surface Fusion – どうやってLive FrameをFusionするのか 3. Estimating the Warp-Field State – Warp-Fieldの推定方法 4. Extending the Warp-Field – Warp-Field構造の逐次追加方法 第30回（後編） コンピュータビジョン勉強会@関東 CVPR2015読み会 32
- 33. DynamicFusion Outline 1. Dense Non-rigid Warp Field – Dynamicなシーンの表現方法について 2. Dense Non-rigid Surface Fusion – どうやってLive FrameをFusionするのか 3. Estimating the Warp-Field State – Warp-Fieldの推定方法 4. Extending the Warp-Field – Warp-Field構造の逐次追加方法 第30回（後編） コンピュータビジョン勉強会@関東 CVPR2015読み会 33
- 34. Dense Non-rigid Warp Field (1/5) • Live FrameとCanonical Model間の変形 • 単純に考えると – Volumeの各点に6D Transformationを設定 – 2563のVoxelとすると、推定する必要のあるパラメー タ数: 2563 x 6 = 16,777,216 x 6 = 1億 / frame – パラメータ数が爆発。実現不能。 • 疎な6D変形ノードと補間による表現を導入 第30回（後編） コンピュータビジョン勉強会@関東 CVPR2015読み会 34
- 35. Dense Non-rigid Warp Field (2/5) • 疎な6D変形ノード : Embedded Deformation Graph – Polygon MeshをGraph構造で表現 – Rigid-as-possibleな変形を実現 • 変形の補間 : Dual-Quaternion Blending – ある点xcの変形をk近傍ノードの重み付平均で表現 – アーチファクトの少ないSkinningを高速に実現 第30回（後編） コンピュータビジョン勉強会@関東 CVPR2015読み会 35
- 36. Dense Non-rigid Warp Field (3/5) • Embedded Deformation Graph – Graphics分野の手法 – Rigid-as-possibleな変形を実現 – Polygon MeshをGraphで表現。ある点が移動した場合に、接続 された点が連動して動く 第30回（後編） コンピュータビジョン勉強会@関東 CVPR2015読み会 36 R.W.Sumner,J.Schmid,andM.Pauly.Embeddeddeformation for shape manipulation. ACM Trans. Graph., 26(3):80, 2007. 2, 4, 6 mply orm fer- tion ount r to tion odes ape ica- (2) mely like nce uce 000; dis- ons User edit g0 g1 g2 g3 g4 ˜g0 ˜g1 ˜g2 ˜g3 ˜g4 Econ Econ +EregEcon +Ereg +Erot R2(g3 g2)+g2 +t2 R2(g1 g2)+g2 +t2 ˜g2 = g2 +t2 Figure 2: A simple deformation graph shows the effect of the three terms of the objective function. The quadrilaterals at each graph node illustrate the deformation induced by the corresponding afﬁne transformation. Without the rotational term, unnatural shearing
- 37. Dense Non-rigid Warp Field (4/5) • Dual-Quaternion Blending – Graphics分野の手法 – 高速でアーチファクトの少ないSkinningを実現 – できるだけ体積を保存するように補間 – 2つのQuaternionを使って、回転と平行移動を表現 第30回（後編） コンピュータビジョン勉強会@関東 CVPR2015読み会 37 ns as bases and deﬁne the dense volumet- n through interpolation. Due to its compu- y and high quality interpolation capability ternion blending DQB [11], to deﬁne our xc) ⌘ SE3(DQB(xc)) , (1) ed average over unit dual quaternion trans- mply DQB(xc) ⌘ P k2N (xc) wk(xc)ˆqkc k P k2N (xc) wk(xc)ˆqkck , Wt(vc)(v> c , 1)> and (n that scaling of space can function, since compres resented by neighbourin diverging directions. F out any rigid body trans the volume, e.g., due to troduce the explicit warp Tlw, and compose this (e) Canonical model warped into its live frame (f) Model Normals online stream of noisy depth maps (a,b) and outputs a real-time dense reconstruction of the moving mate a volumetric warp (motion) ﬁeld that transforms the canonical model space into the live frame, done, and all depth maps to be densely fused into a single rigid TSDF reconstruction (d,f). Simulta- eld is constructed as a set of sparse 6D transformation nodes that are smoothly interpolated through onical frame (c). The resulting per-frame warp ﬁeld estimate enables the progressively denoised and nsformed into the live frame in real-time (e). In (e) we also visualise motion trails for a sub-sample ond of scene motion together with a coordinate frame showing the rigid body component of the scene node to model surface distance where increased distance is mapped to a lighter value. nd rotation results in signif- struction. For each canoni- transforms that point from -rigidly deformed frame of the warp function for each n must be efﬁciently opti- ensely sample the volume, E(3) ﬁeld at the resolution (TSDF) geometric repre- DF volume reconstruction 2563 voxels would require eters per frame, about 10 with each unit dual-quaternion ˆqkc 2 R8 . Here, N (x) are the k-nearest transformation nodes to the point x and wk : R3 7! R deﬁnes a weight that alters the radius of inﬂuence of each node and SE3(.) converts from quaternions back to an SE(3) transformation matrix. The state of the warp-ﬁeld Wt at time t is deﬁned by the values of a set of n defor- mation nodes N t warp = {dgv, dgw, dgse3}t. Each of the i = 1..n nodes has a position in the canonical frame dgi v 2 R3 , its associated transformation Tic = dgi se3, and a ra- dial basis weight dgw that controls the extent of the trans- formation wi(xc) = exp kdgi v xck2 / 2(dgi w)2 . Each radius parameter dgi w is set to ensure the node’s in- ﬂuence overlaps with neighbouring nodes, dependent on the sampling sparsity of nodes, which we describe in de- : Dual-Quaternion L. Kavan, S. Collins, J. Zˇa ́ra, and C. O’Sullivan. Skinning with Dual Quaternions. In Proceedings of the 2007 Symposium on Interactive 3D Graphics and Games, I3D ’07, pages 39–46, New York, NY, USA, 2007. ACM. 3 Linear Blending Dual Quaternion Blending 図は以下より転用 http://rodolphe-vaillant.fr/?e=29
- 38. Dense Non-rigid Warp Field (4/5) • 疎な6D変換ノードのパラメータ： 第30回（後編） コンピュータビジョン勉強会@関東 CVPR2015読み会 38 3 7! R deﬁnes a weight that alters the radius of inﬂuence f each node and SE3(.) converts from quaternions back to n SE(3) transformation matrix. The state of the warp-ﬁeld Wt at time t is deﬁned by the values of a set of n defor- ation nodes Nt warp = {dgv, dgw, dgse3}t. Each of the = 1..n nodes has a position in the canonical frame dgi v 2 3 , its associated transformation Tic = dgi se3, and a ra- al basis weight dgw that controls the extent of the trans- ormation wi(xc) = exp kdgi v xck2 / 2(dgi w)2 . ach radius parameter dgi w is set to ensure the node’s in- uence overlaps with neighbouring nodes, dependent on e sampling sparsity of nodes, which we describe in de- il in section (3.4). Since the warp function deﬁnes a gid body transformation for all supported space, both posi- . Here, N (x) are oint x and wk : ius of inﬂuence ernions back to f the warp-ﬁeld set of n defor- }t. Each of the al frame dgi v 2 gi se3, and a ra- nt of the trans- 2(dgi w)2 . the node’s in- dependent on describe in de- ction deﬁnes a pace, both posi- is transformed, s in signif- ch canoni- point from ed frame of on for each ently opti- he volume, resolution tric repre- onstruction uld require , about 10 Fusion al- ransforma- ion of the nd to move sparse set with each unit dual-quaternion ˆqkc 2 R8 . Here, N (x) are the k-nearest transformation nodes to the point x and wk : R3 7! R deﬁnes a weight that alters the radius of inﬂuence of each node and SE3(.) converts from quaternions back to an SE(3) transformation matrix. The state of the warp-ﬁeld Wt at time t is deﬁned by the values of a set of n defor- mation nodes Nt warp = {dgv, dgw, dgse3}t. Each of the i = 1..n nodes has a position in the canonical frame dgi v 2 R3 , its associated transformation Tic = dgi se3, and a ra- dial basis weight dgw that controls the extent of the trans- formation wi(xc) = exp kdgi v xck2 / 2(dgi w)2 . Each radius parameter dgi w is set to ensure the node’s in- ﬂuence overlaps with neighbouring nodes, dependent on the sampling sparsity of nodes, which we describe in de- tail in section (3.4). Since the warp function deﬁnes a rigid body transformation for all supported space, both posi- tion and any associated orientation of space is transformed, e.g., the vertex vc from a surface with orientation or nor- > d to a lighter value. ion ˆqkc 2 R8 . Here, N (x) are n nodes to the point x and wk : hat alters the radius of inﬂuence nverts from quaternions back to atrix. The state of the warp-ﬁeld the values of a set of n defor- gv, dgw, dgse3}t. Each of the n in the canonical frame dgi v 2 mation Tic = dgi se3, and a ra- controls the extent of the trans- kdgi v xck2 / 2(dgi w)2 . is set to ensure the node’s in- hbouring nodes, dependent on odes, which we describe in de- e the warp function deﬁnes a h unit dual-quaternion ˆqkc 2 R8 . Here, N (x) are rest transformation nodes to the point x and wk : deﬁnes a weight that alters the radius of inﬂuence ode and SE3(.) converts from quaternions back to transformation matrix. The state of the warp-ﬁeld me t is deﬁned by the values of a set of n defor- odes N t warp = {dgv, dgw, dgse3}t. Each of the nodes has a position in the canonical frame dgi v 2 ssociated transformation Tic = dgi se3, and a ra- weight dgw that controls the extent of the trans- n wi(xc) = exp kdgi v xck2 / 2(dgi w)2 . ius parameter dgi w is set to ensure the node’s in- verlaps with neighbouring nodes, dependent on ling sparsity of nodes, which we describe in de- : Canonical Frameにおける位置 : Canonical - Live Frame間Transform : ノードの影響範囲コントロール 6D transformation nodes that are smoothly interpolated through rame warp ﬁeld estimate enables the progressively denoised and -time (e). In (e) we also visualise motion trails for a sub-sample coordinate frame showing the rigid body component of the scene re increased distance is mapped to a lighter value. with each unit dual-quaternion ˆqkc 2 R8 . Here, N (x) are he k-nearest transformation nodes to the point x and wk : R3 7! R deﬁnes a weight that alters the radius of inﬂuence of each node and SE3(.) converts from quaternions back to an SE(3) transformation matrix. The state of the warp-ﬁeld Wt at time t is deﬁned by the values of a set of n defor- mation nodes N t warp = {dgv, dgw, dgse3}t. Each of the i = 1..n nodes has a position in the canonical frame dgi v 2 R3 , its associated transformation Tic = dgi se3, and a ra- dial basis weight dgw that controls the extent of the trans- formation wi(xc) = exp kdgi v xck2 / 2(dgi w)2 . Each radius parameter dgi w is set to ensure the node’s in- ﬂuence overlaps with neighbouring nodes, dependent on he sampling sparsity of nodes, which we describe in de- DQBの重み
- 39. Dense Non-rigid Warp Field (5/5) • Warp-Field: 第30回（後編） コンピュータビジョン勉強会@関東 CVPR2015読み会 39 our complete warp-ﬁeld is then given as: Wt(xc) = TlwSE3(DQB(xc)). (2 3.2. Dense Non-Rigid Surface Fusion We now describe how, given the model-to-frame warp ﬁeld Wt, we update our canonical model geometry. Ou reconstruction into the canonical space S is represented by the sampled TSDF V : S 7! R2 within a voxel domain S ⇢ N3 . The sampled function holds for each voxel x 2 S corresponding to the sampled point xc, a tuple V(x) 7! > 点xcのWarpe warp-ﬁeld is then given as: Wt(xc) = TlwSE3(DQB(xc)). (2) Non-Rigid Surface Fusion describe how, given the model-to-frame warp we update our canonical model geometry. Our on into the canonical space S is represented by 2 ou al ta di pr 1 k fu K and outputs a real-time dense reconstruction of the moving at transforms the canonical model space into the live frame, used into a single rigid TSDF reconstruction (d,f). Simulta- ransformation nodes that are smoothly interpolated through e warp ﬁeld estimate enables the progressively denoised and (e). In (e) we also visualise motion trails for a sub-sample dinate frame showing the rigid body component of the scene creased distance is mapped to a lighter value. each unit dual-quaternion ˆqkc 2 R8 . Here, N (x) are -nearest transformation nodes to the point x and wk : ! R deﬁnes a weight that alters the radius of inﬂuence ch node and SE3(.) converts from quaternions back to E(3) transformation matrix. The state of the warp-ﬁeld at time t is deﬁned by the values of a set of n defor- on nodes Nt warp = {dgv, dgw, dgse3}t. Each of the 1..n nodes has a position in the canonical frame dgi v 2 its associated transformation Tic = dgi se3, and a ra- basis weight dgw that controls the extent of the trans- : Cameraの移動に伴う、シーン全体の変形 : Dual-QuaternionからSE3(ユークリッド空間の運動群) のへの変換
- 40. DynamicFusion Outline 1. Dense Non-rigid Warp Field – Dynamicなシーンの表現方法について 2. Dense Non-rigid Surface Fusion – どうやってLive FrameをFusionするのか 3. Estimating the Warp-Field State – Warp-Fieldの推定方法 4. Extending the Warp-Field – Warp-Field構造の逐次追加方法 第30回（後編） コンピュータビジョン勉強会@関東 CVPR2015読み会 40
- 41. Dense Non-Rigid Surface Fusion (1/3) • Live FrameのDepthからCanonical Modelを更新 – Warp-Fieldは得られているという前提 • 更新方法: – Warp-Fieldを使って、Canonical ModelをLive Frame に合わせて変形 – そのTSDFと、Live FrameのTSDFをFusion 第30回（後編） コンピュータビジョン勉強会@関東 CVPR2015読み会 41
- 42. Dense Non-Rigid Surface Fusion (2/3) 第30回（後編） コンピュータビジョン勉強会@関東 CVPR2015読み会 42 warped center into the depth frame. This allows the TSDF for a point in the canonical frame to be updated by com- puting the projective TSDF in the deforming frame without having to resample a warped TSDF in the live frame. The projective signed distance at the warped canonical point is: psdf(xc) = h K 1 Dt (uc) ⇥ u> c , 1 ⇤> i z [xt]z , (3) where uc = ⇡ (Kxt) is the pixel into which the voxel cen- ter projects. We compute distance along the optical (z) axis of the camera frame using the z component denoted [.]z. K is the known 3 ⇥ 3 camera intrinsic matrix, and ⇡ per- forms perspective projection. For each voxel x, we update the TSDF to incorporate the projective SDF observed in the warped frame using TSDF fusion: V(x)t = ( [v0 (x), w0 (x)]> , if psdf(dc(x)) > ⌧ V(x)t 1, otherwise (4) where dc(.) transforms a discrete voxel point into the con- tinuous TSDF domain. The truncation distance ⌧ > 0 and the updated TSDF value is given by the weighted averaging We es dgse3 in current re that is mi E(Wt, V Our data cost Dat isation te tion ﬁelds tween tra The coup transform ularisatio model int enables a when giv sistent de space. We projective signed distance at the warped canonical point is: psdf(xc) = h K 1 Dt (uc) ⇥ u> c , 1 ⇤> i z [xt]z , (3) where uc = ⇡ (Kxt) is the pixel into which the voxel cen- ter projects. We compute distance along the optical (z) axis of the camera frame using the z component denoted [.]z. K is the known 3 ⇥ 3 camera intrinsic matrix, and ⇡ per- forms perspective projection. For each voxel x, we update the TSDF to incorporate the projective SDF observed in the warped frame using TSDF fusion: V(x)t = ( [v0 (x), w0 (x)]> , if psdf(dc(x)) > ⌧ V(x)t 1, otherwise (4) where dc(.) transforms a discrete voxel point into the con- tinuous TSDF domain. The truncation distance ⌧ > 0 and the updated TSDF value is given by the weighted averaging scheme [5], with the weight truncation introduced in [18]: v0 (x) = v(x)t 1w(x)t 1 + min(⇢, ⌧)w(x) w(x)t 1 + w(x) ⇢ = psdf(dc(x)) w0 (x) = min(w(x)t 1 + w(x), wmax) . (5) Unlike the static fusion scenario where the weight w(x) encodes the uncertainty of the depth value observed at the projected pixel in the depth frame, we also account for un- certainty associated with the warp function at xc. In the E(Wt, V, Dt, E) = Data(Wt Our data term consists of a cost Data(Wt, V, Dt) which isation term Reg(Wt, E) tha tion ﬁelds, and ensures as-rigid tween transformation nodes c The coupling of a data-term f transformations with a rigid-a ularisation is a form of the e model introduced in [25]. Th enables a trade-off between re when given high quality data, sistent deformation of non or space. We deﬁned these terms 3.3.1 Dense Non-Rigid ICP Our aim is to estimate all non eters Tic and Tlw that warp t live frame. We achieve this rigid alignment of the curren tracted from the canonical vol live frame’s depth map. Surface Prediction and Da zero level set of the TSDF V is and stored as a polygon mesh w having to resample a warped TSDF in the live frame. The projective signed distance at the warped canonical point is: psdf(xc) = h K 1 Dt (uc) ⇥ u> c , 1 ⇤> i z [xt]z , (3) where uc = ⇡ (Kxt) is the pixel into which the voxel cen- ter projects. We compute distance along the optical (z) axis of the camera frame using the z component denoted [.]z. K is the known 3 ⇥ 3 camera intrinsic matrix, and ⇡ per- forms perspective projection. For each voxel x, we update the TSDF to incorporate the projective SDF observed in the warped frame using TSDF fusion: V(x)t = ( [v0 (x), w0 (x)]> , if psdf(dc(x)) > ⌧ V(x)t 1, otherwise (4) where dc(.) transforms a discrete voxel point into the con- tinuous TSDF domain. The truncation distance ⌧ > 0 and the updated TSDF value is given by the weighted averaging scheme [5], with the weight truncation introduced in [18]: v0 (x) = v(x)t 1w(x)t 1 + min(⇢, ⌧)w(x) w(x)t 1 + w(x) ⇢ = psdf(dc(x)) w0 (x) = min(w(x)t 1 + w(x), wmax) . (5) Unlike the static fusion scenario where the weight w(x) encodes the uncertainty of the depth value observed at the that is minimised by o E(Wt, V, Dt, E) = D Our data term consi cost Data(Wt, V, D isation term Reg(W tion ﬁelds, and ensure tween transformation The coupling of a da transformations with ularisation is a form model introduced in enables a trade-off be when given high qua sistent deformation o space. We deﬁned the 3.3.1 Dense Non-R Our aim is to estimat eters Tic and Tlw tha live frame. We achi rigid alignment of th tracted from the cano live frame’s depth ma Surface Predictio Live FrameのProjective Signed Distance Function: TSDFのFusion: Truncate:
- 43. Dense Non-Rigid Surface Fusion (3/3) 第30回（後編） コンピュータビジョン勉強会@関東 CVPR2015読み会 43 Non-rigid scene deformation (a) Live frame t = 0 (b) Live Frame t = 1 (c) Canonical 7! Live (d) Live fram Figure 3: An illustration of how each point in the canonical frame maps, through a when observing a deforming scene. In (a) the ﬁrst view of a dynamic scene is observ initialized to the identity transform and the three rays shown in the live frame also ma deforms in the live frame (b), the warp function transforms each point from the can
- 44. DynamicFusion Outline 1. Dense Non-rigid Warp Field – Dynamicなシーンの表現方法について 2. Dense Non-rigid Surface Fusion – どうやってLive FrameをFusionするのか 3. Estimating the Warp-Field State – Warp-Fieldの推定方法 4. Extending the Warp-Field – Warp-Field構造の逐次追加方法 第30回（後編） コンピュータビジョン勉強会@関東 CVPR2015読み会 44
- 45. Estimating the Warp-Field State (1/5) • Depth map Dt からdgse3を推定 第30回（後編） コンピュータビジョン勉強会@関東 CVPR2015読み会 45 DF m- ut he s: 3) n- xis ]z. er- ate he 4) We estimate the current values of the transformations dgse3 in Wt given a newly observed depth map Dt and the current reconstruction V by constructing an energy function that is minimised by our desired parameters: E(Wt, V, Dt, E) = Data(Wt, V, Dt) + Reg(Wt, E) . (6) Our data term consists of a dense model-to-frame ICP cost Data(Wt, V, Dt) which is coupled with a regular- isation term Reg(Wt, E) that penalises non-smooth mo- tion ﬁelds, and ensures as-rigid-as-possible deformation be- tween transformation nodes connected by the edge set E. The coupling of a data-term formed from linearly blended transformations with a rigid-as-possible graph based reg- ularisation is a form of the embedded deformation graph model introduced in [25]. The regularisation parameter enables a trade-off between relaxing rigidity over the ﬁeld when given high quality data, and ensuring a smooth con- sistent deformation of non or noisily observed regions of これを最小化する各パラメータを推定する Energy データ項 ICP cost 正規化項 滑らかでない動き に対するペナルティ Wt : Warp Field : 現在のSurface (Polygon Mesh) Dt : Live frame Depth map ε : Edge set (ノードの接続) ace into the live (non-rigid) frame (see Figure 3). hnique greatly simpliﬁes the non-rigid reconstruc- cess over methods where all frames are explicitly nto a canonical frame. Furthermore, given a cor- p ﬁeld, then, since all TSDF updates are computed stances in the camera frame, the non-rigid projec- DF fusion approach maintains the optimality guar- or surface reconstruction from noisy observations y proved for the static reconstruction case in [6]. imating the Warp-ﬁeld State Wt stimate the current values of the transformations Wt given a newly observed depth map Dt and the econstruction V by constructing an energy function inimised by our desired parameters: , Dt, E) = Data(Wt, V, Dt) + Reg(Wt, E) . (6) a term consists of a dense model-to-frame ICP ta(Wt, V, Dt) which is coupled with a regular-
- 46. Estimating the Warp-Field State (2/5) • 現在のSurfaceを抽出 – TSDFからZero Level setをMarching cubeで抽出 第30回（後編） コンピュータビジョン勉強会@関東 CVPR2015読み会 46 https://ja.wikipedia.org/wiki/マーチングキューブ法
- 47. Estimating the Warp-Field State (3/5) • ICP Cost – Canonical Modelを現在のTでレンダリング – Live FrameのDepth mapとのエラーをPoint-Planeで計算 第30回（後編） コンピュータビジョン勉強会@関東 CVPR2015読み会 47 transform close, modulo observation noise, to the live sur- face vl : ⌦ 7! R3 , formed by back projection of the depth image [vl(u)> , 1]> = K 1 Dt(u)[u> , 1]> . This can be quantiﬁed by a per pixel dense model-to-frame point-plane error, which we compute under the robust Tukey penalty function data, summed over the predicted image domain ⌦: Data(W, V, Dt) ⌘ X u2⌦ data ˆn> u (ˆvu vl˜u) .(7) augment. The transformed model vertex v(u) is simply ˜Tu = W(v(u)), producing the current canonical to live frame point-normal predictions ˆvu = ˜Tu v(u) and ˆnu = ˜Tu n(u), and data-association of that model point-normal is made with a live frame point-normal through perspective projection into the pixel ˜u = ⇡(Kˆvu). We note that, ignoring the negligible cost of rendering the geometry ˆVw, the ability to extract, predict, and perform projective data association with the currently visible canoni- cal geometry leads to a data-term evaluation that has a com- into v which assoc insufﬁ logue in opt How ometr dynam make it def We betwe betwe tion t contin larisa Reg( Point-Plane Error face vl : ⌦ 7! R3 , formed by back projection of the depth image [vl(u)> , 1]> = K 1 Dt(u)[u> , 1]> . This can be quantiﬁed by a per pixel dense model-to-frame point-plane error, which we compute under the robust Tukey penalty function data, summed over the predicted image domain ⌦: Data(W, V, Dt) ⌘ X u2⌦ data ˆn> u (ˆvu vl˜u) .(7) augment. The transformed model vertex v(u) is simply ˜Tu = W(v(u)), producing the current canonical to live frame point-normal predictions ˆvu = ˜Tu v(u) and ˆnu = ˜Tu n(u), and data-association of that model point-normal is made with a live frame point-normal through perspective projection into the pixel ˜u = ⇡(Kˆvu). We note that, ignoring the negligible cost of rendering the geometry ˆVw, the ability to extract, predict, and perform projective data association with the currently visible canoni- cal geometry leads to a data-term evaluation that has a com- putational complexity with an upper bound in the number which n associat insufﬁci logue to in optim How sh ometry? dynamic make us it deform We u between between tion term continui larisatio Reg(W where E ed by back projection of the depth K 1 Dt(u)[u> , 1]> . This can be dense model-to-frame point-plane te under the robust Tukey penalty d over the predicted image domain X u2⌦ data ˆn> u (ˆvu vl˜u) .(7) med model vertex v(u) is simply ucing the current canonical to live dictions ˆvu = ˜Tu v(u) and ˆnu = ciation of that model point-normal e point-normal through perspective ˜u = ⇡(Kˆvu). ng the negligible cost of rendering lity to extract, predict, and perform on with the currently visible canoni- ata-term evaluation that has a com- associated data term. In any case, insufﬁcient geometric texture in t logue to the aperture problem in o in optimisation of the transform pa How should we constrain the mot ometry? Whilst the fully correct m dynamics and, where applicable, t make use of a simpler model of un it deforms in a piece-wise smooth We use a deformation graph bas between transformation nodes, wh between nodes i and j adds a rigi tion term to the total error being m continuity preserving Huber penal larisation term sums over all pair-w Reg(W, E) ⌘ nX i=0 X j2E(i) ↵ij reg ometry should to the live sur- on of the depth . This can be me point-plane Tukey penalty image domain ˆvu vl˜u) .(7) v(u) is simply nonical to live (u) and ˆnu = l point-normal ugh perspective st of rendering ct, and perform into view. However, nodes affecting canonical space within which no currently observed surface resides will have no associated data term. In any case, noise, missing data and insufﬁcient geometric texture in the live frame – an ana- logue to the aperture problem in optical-ﬂow – will result in optimisation of the transform parameters being ill-posed. How should we constrain the motion of non-observed ge- ometry? Whilst the fully correct motion depends on object dynamics and, where applicable, the subject’s volition, we make use of a simpler model of unobserved geometry: that it deforms in a piece-wise smooth way. We use a deformation graph based regularization deﬁned between transformation nodes, where an edge in the graph between nodes i and j adds a rigid-as-possible regularisa- tion term to the total error being minimized, under the dis- continuity preserving Huber penalty reg. The total regu- larisation term sums over all pair-wise connected nodes: Reg(W, E) ⌘ nX X ↵ij reg Ticdgj Tjcdgj , ( transform close, modulo observation noise, to the live sur- face vl : ⌦ 7! R3 , formed by back projection of the depth image [vl(u)> , 1]> = K 1 Dt(u)[u> , 1]> . This can be quantiﬁed by a per pixel dense model-to-frame point-plane error, which we compute under the robust Tukey penalty function data, summed over the predicted image domain ⌦: Data(W, V, Dt) ⌘ X u2⌦ data ˆn> u (ˆvu vl˜u) .(7) augment. The transformed model vertex v(u) is simply ˜Tu = W(v(u)), producing the current canonical to live frame point-normal predictions ˆvu = ˜Tu v(u) and ˆnu = ˜Tu n(u), and data-association of that model point-normal is made with a live frame point-normal through perspective projection into the pixel ˜u = ⇡(Kˆvu). We note that, ignoring the negligible cost of rendering the geometry ˆVw, the ability to extract, predict, and perform projective data association with the currently visible canoni- Ω : ピクセルドメイン ine. This results in a prediction of eometry that is currently predicted frame: P(ˆVc). We store this pre- ages {v, n} : ⌦ 7! P(ˆVc), where of the predicted images, storing the me vertices and normals. ormation parameters for the current cted-to-be-visible geometry should o observation noise, to the live sur- med by back projection of the depth K 1 Dt(u)[u> , 1]> . This can be l dense model-to-frame point-plane ute under the robust Tukey penalty d over the predicted image domain ⌘ X u2⌦ data ˆn> u (ˆvu vl˜u) .(7) med model vertex v(u) is simply point-plane data-term evaluation 3.3.2 Warp-ﬁeld Regularizat It is crucial for our non-rigid TS timate a deformation not only o but over all space within S. T of new regions of the scene surf into view. However, nodes affec which no currently observed su associated data term. In any ca insufﬁcient geometric texture in logue to the aperture problem i in optimisation of the transform How should we constrain the m ometry? Whilst the fully correc dynamics and, where applicable make use of a simpler model of it deforms in a piece-wise smoo : robust Tukey Penalty function
- 48. Estimating the Warp-Field State (4/5) • Live Frameから見えていない部分について • Deformation Graphで求める: – ノードの接続情報をもとに、全体でRigid-as-possible になるように正規化 第30回（後編） コンピュータビジョン勉強会@関東 CVPR2015読み会 48 imply o live ˆnu = ormal ective dering rform anoni- com- umber data- trans- make use of a simpler model of unobserved geometry: that it deforms in a piece-wise smooth way. We use a deformation graph based regularization deﬁned between transformation nodes, where an edge in the graph between nodes i and j adds a rigid-as-possible regularisa- tion term to the total error being minimized, under the dis- continuity preserving Huber penalty reg. The total regu- larisation term sums over all pair-wise connected nodes: Reg(W, E) ⌘ nX i=0 X j2E(i) ↵ij reg Ticdgj v Tjcdgj v , (8) where E deﬁnes the regularisation graph topology, and ↵ij deﬁnes the weight associated with the edge, which we set to ↵ij = max(dgi w, dgj w). ε(i) : i番目のノードの接続エッジ nodes affecting canonical space within observed surface resides will have no In any case, noise, missing data and c texture in the live frame – an ana- problem in optical-ﬂow – will result transform parameters being ill-posed. train the motion of non-observed ge- ully correct motion depends on object applicable, the subject’s volition, we r model of unobserved geometry: that wise smooth way. ion graph based regularization deﬁned on nodes, where an edge in the graph j adds a rigid-as-possible regularisa- error being minimized, under the dis- Huber penalty reg. The total regu- over all pair-wise connected nodes: X 2E(i) ↵ij reg Ticdgj v Tjcdgj v , (8) egularisation graph topology, and ↵ij sociated with the edge, which we set dgj w). : Huber penalty ometry? Whilst the fully correct motion depends on object dynamics and, where applicable, the subject’s volition, we make use of a simpler model of unobserved geometry: that it deforms in a piece-wise smooth way. We use a deformation graph based regularization deﬁned between transformation nodes, where an edge in the graph between nodes i and j adds a rigid-as-possible regularisa- tion term to the total error being minimized, under the dis- continuity preserving Huber penalty reg. The total regu- larisation term sums over all pair-wise connected nodes: Reg(W, E) ⌘ nX i=0 X j2E(i) ↵ij reg Ticdgj v Tjcdgj v , (8) where E deﬁnes the regularisation graph topology, and ↵ij deﬁnes the weight associated with the edge, which we set to ↵ij = max(dgi w, dgj w).
- 49. Estimating the Warp-Field State (5/5) • Hierarchical Deformation Tree: – Deformation Graphをピラミッド化 – Coarse-to-Fineな最適化 • Embedded Deformation Graph (Original) – Regularizationの影響範囲をk-nearest neighborで定義 • それに比べて – 計算コスト・安定性の面でメリット 第30回（後編） コンピュータビジョン勉強会@関東 CVPR2015読み会 49
- 50. Efficient Optimization • まず、KinectFusionと同じ手順で、Camera Pose (Tlw)を推定 • 次に、Eの最小化を行うことで、各Deformation NodeのWarp-Fieldを求める。 • Gauss-Newton法で解く – Gauss-Newton法: 二乗和の最小化に特化したNewton 法 – ヘッセ行列 (2階微分) をヤコビ行列 (1階微分) で近似 – コレスキー分解を使って効率的に解く 第30回（後編） コンピュータビジョン勉強会@関東 CVPR2015読み会 50 approach origi- -rigidly deform- t, we transform warp into the live arry through the y projecting the llows the TSDF pdated by com- g frame without live frame. The nonical point is: [xt]z , (3) h the voxel cen- e optical (z) axis ent denoted [.]z. trix, and ⇡ per- el x, we update observed in the x)) > ⌧ using distances in the camera frame, the non-rigid projec- tive TSDF fusion approach maintains the optimality guar- antees for surface reconstruction from noisy observations originally proved for the static reconstruction case in [6]. 3.3. Estimating the Warp-ﬁeld State Wt We estimate the current values of the transformations dgse3 in Wt given a newly observed depth map Dt and the current reconstruction V by constructing an energy function that is minimised by our desired parameters: E(Wt, V, Dt, E) = Data(Wt, V, Dt) + Reg(Wt, E) . (6) Our data term consists of a dense model-to-frame ICP cost Data(Wt, V, Dt) which is coupled with a regular- isation term Reg(Wt, E) that penalises non-smooth mo- tion ﬁelds, and ensures as-rigid-as-possible deformation be- tween transformation nodes connected by the edge set E. The coupling of a data-term formed from linearly blended transformations with a rigid-as-possible graph based reg- ularisation is a form of the embedded deformation graph model introduced in [25]. The regularisation parameter enables a trade-off between relaxing rigidity over the ﬁeld Energy関数 :
- 51. DynamicFusion Outline 1. Dense Non-rigid Warp Field – Dynamicなシーンの表現方法について 2. Dense Non-rigid Surface Fusion – どうやってLive FrameをFusionするのか 3. Estimating the Warp-Field State – Warp-Fieldの推定方法 4. Extending the Warp-Field – Warp-Field構造の逐次追加方法 第30回（後編） コンピュータビジョン勉強会@関東 CVPR2015読み会 51
- 52. Extending the Warp-Field • Inserting New Deformation Nodes: – Live FrameのSurface Fusionが終わったら、Warp Nodeの調整を 行う – Polygon Meshを捜査して、現在のNodeの影響範囲にいない頂点 が見つかったら、Nodeを追加する – 追加されたNodeの初期Warpパラメータは、DQBで計算 • Updating the Regularization Graph: – Hierarchical Deformation Treeの構築 – Level 0 = Warp Node – next level: • 各Regularization Nodeを捜査して、Nodeを間引き • WarpはDQBで再計算 第30回（後編） コンピュータビジョン勉強会@関東 CVPR2015読み会 52
- 53. Result 第30回（後編） コンピュータビジョン勉強会@関東 CVPR2015読み会 53 https://www.youtube.com/watch?v=i1eZekcc_lM&feature=youtu.be
- 54. Limitation • Quickly move from closed to open topology – Canonical Modelが閉じたTopologyである場合に、そ こから開くと失敗しやすい – Regularization GraphのEdgeが分離できない – 逆に、開いた状態から閉じるは表現できる • とはいえ、GraphのEdgeは繋がらない • 高速な移動 • Live Frameから見えていない範囲の移動 第30回（後編） コンピュータビジョン勉強会@関東 CVPR2015読み会 54
- 55. Conclusion • DynamicFusion – DynamicなシーンでSLAM – リアルタイムに3D Reconstruction • Contribution – Non-rigid volumetric TSDF fusion – Estimate 6D warp-field in Realtime 第30回（後編） コンピュータビジョン勉強会@関東 CVPR2015読み会 55

No public clipboards found for this slide

Be the first to comment