Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Depth estimation do we need to throw old things away

111 views

Published on

발표의 개요 : Human visual system 기반의 CNN for depth estimation과 CNN inspired by conventional methods
Case1: Cross-channel stereo matching
Case2: Depth from light field
Case3: Multiview stereo
Conclusion

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Depth estimation do we need to throw old things away

  1. 1. Depth estimation: Do we need to throw old things away? Hae-Gon Jeon (전해곤) 1 Assistant Professor
  2. 2. My Research Timeline 2011 MS course 2013~2018: Ph.D. course 2015~Present Coded exposure imaging: Motion deblurring [ICCV’13, ICCV’15, IJCV’17, TIP’17] Light-field imaging [ECCV’14, CVPR’15, ICCVW’15, PAMI’17, SPL’17, TPAM’19, CVPR’18] [CVPR’16, CVPR’18, TIP’19] Depth + Denoising in low-light 2016~Present [ICIP’15, ICCV’15, CVPR’16, ECCV16, SPL’17, CVPR’17, TPAM’19] Depth from small motion 2018~Present: Post-doc Highly accurate 3D map Optimized path generation From real map information Visual and AI system for rescue robotics 2 Visual and AI system for rescue robotics [ICRA’19, Submitted to IROS19(1/2), IROS’19(1/2); IEEE TPAMI (Major Revision); IEEE TIP (Major Revision) ]
  3. 3. 3 Traditional Stereo Matching
  4. 4. 4 Traditional Stereo Matching
  5. 5. 5 Traditional Stereo Matching
  6. 6. 6 Traditional Stereo Matching
  7. 7. 7 DispNet (FlowNet) Encoder Decoder
  8. 8. 8 Image: https://myweb.rollins.edu/jsiry/Visual_Cortex.html Human Visual System
  9. 9. 9 Human Visual System and DispNet N, Mayer et al., A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation, CVPR16
  10. 10. 10 Human Visual System A multilayered membrane that contains millions of light- sensitive cells that detect the image and translate it into a series of electrical signals. The optic nerves from both eyes join at the optic chiasma where information from their retinas is correlated. Humans constantly scan objects in their field of view, usually resulting in a perceived image that is uniformly sharp
  11. 11. 11 Upconvolutional layers • high-level information passed from coarser feature maps • fine local information provided in lower layer feature maps Correlation layer • Multiplicative patch comparisons between two feature maps • No trainable weights Convolution layer • identical processing streams for the two images • With this architecture the network is constrained to first produce meaningful representations of the two images separately DispNet End-to-end disparity estimation network (No need optimization)
  12. 12. 12 Encoding images Correlation between two images estimation High-level information Human Visual System and DispNet
  13. 13. 13 DispNet, CVPR 16 PSMNet, CVPR 18 Is DispNet the best?? Rank Method Out-Noc Runtime 1 PSMNet, CVPR18 1.49% 0.41s 2 iResNet-i2, CVPR18 1.71% 0.12s 15 MC-CNN-arct, JMLR16 2.43% 67s 27 Content_CNN, CVPR16 3.07% 0.7s 28 Deep Embed, ICCV15 3.10% 3s 46 DispNetC, CVPR16 4.11% 0.06s 2018/04/23KITTI stereo evaluation 2012
  14. 14. 18 Asymmetric stereo Light-field camera Monocular camera [CVPR’16, Silver Prize of Samsung Humantech Paper Award, Submitted to IEEE TIP] [ECCV’14, CVPR’15, ICCVW’15, CVPR’18, IEEE TPAMI’17, IEEE SPL’17, IEEE TPAMI’19, Robustness champion of CVPR’17 workshop] [ICCV’15, CVPR’16, ECCV’16, CVPR’18, ICLR’19, IEEE TPAMI’17, IEEE SPL’17, IEEE TPAMI’19, IEEE TPAMI under minor revision] Today’s Talk
  15. 15. Stereo Matching with Color and Monochrome cameras Publications • Stereo Matching with Color and Monochrome Cameras in Low-light Conditions Hae-Gon Jeon, Joon-Young Lee, Sunghoon Im, Hyowon Ha and In So Kweon IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2016 • HumanTech Paper Award 2016, Silver Prize • CMSNet: Deep Color and Monochrome Stereo Hae-Gon Jeon, Sunghoon Im, Joon-Young Lee and Martial Hebert Submitted to IEEE Transactions on Image Processing 19
  16. 16. Low-light imaging1: Burst photography [Ziwei Liu et al., SIGGRAPH Asia 14, Sam Hasinoff et al., SIGGRAPH Asia 16] Courtesy of S. Im, H.-G. Jeon and I.S. Kweon, [Submitted to CVPR 18] Results from Google Camera HDR+ Sam Hasinoff et al., SIGGRAPH Asia 16 Burst shot with a short exposure Suffering from under-exposure 20
  17. 17. Noisy Visible IR image Fused Output Multi-spectral Video Fusion [IEEE TIP 07] - Twin cameras: IR/Visible - Temporal smoothing - Cross-bilateral filter Low-light imaging2: Multi-spectral fusion 21
  18. 18. 300 400 500 600 700 800 900 1000 0 10 20 30 40 50 60 70 Imaging Performance Wavelength (nm) QuantumEfficiency(%) Blue Green Red 300 400 500 600 700 800 900 1000 0 10 20 30 40 50 60 70 Imaging Performance Wavelength (nm) QuantumEfficiency(%) Blue Green Red Gray Color camera (RGB) Monochrome Camera (W) Color information Reducing sharpness Vulnerable to noise No color information Sharp image Robust to noise1) Different spectral sensitivity 2) Severe noise Issues on Depth from RGB-W image pair RGB2Gray Monochrome The proposed RGB-W stereo 22
  19. 19. Commercial product 23
  20. 20. Gain map Monochrome by gain adjustment Decolorized image Disparity map Monochrome Overview of the proposed method` (1) Input pair (2) Gain map (3) Decolorization (4) Disparity map by iterative gain adjustment (5) Refined map by a tree-based filtering (6) High-quality color image Solution 24
  21. 21. 1) Contrast Preservation 2) Noise suppression 𝐸𝐸𝑛𝑛 𝛾𝛾 = 𝛻𝛻𝑢𝑢 𝐼𝐼𝛾𝛾 1 + 𝛻𝛻𝑣𝑣 𝐼𝐼𝛾𝛾 1 𝛻𝛻𝑢𝑢 𝐼𝐼𝛾𝛾 2 + 𝛻𝛻𝑣𝑣 𝐼𝐼𝛾𝛾 2 𝐸𝐸𝑐𝑐 𝛾𝛾 = 𝐺𝐺 𝐼𝐼, 𝐼𝐼 − 𝐺𝐺(𝐼𝐼, �𝐼𝐼𝛾𝛾) 1 𝐺𝐺: the guided output image 𝛻𝛻 𝑢𝑢,𝑣𝑣 : image gradient of horizontal 𝑢𝑢 and vertical 𝑣𝑣 directions RGB2Gray Only contrast Proposed 𝐼𝐼𝛾𝛾 = 𝜔𝜔𝑟𝑟 𝐼𝐼𝑟𝑟 + 𝜔𝜔𝑔𝑔 𝐼𝐼𝑔𝑔 + 𝜔𝜔𝑏𝑏 𝐼𝐼𝑏𝑏 𝜔𝜔𝑟𝑟 + 𝜔𝜔𝑔𝑔 + 𝜔𝜔𝑏𝑏 = 1 𝜔𝜔𝑟𝑟 ≥ 0, 𝜔𝜔𝑔𝑔 ≥ 0, 𝜔𝜔𝑏𝑏 ≥ 0 𝜔𝜔 𝑟𝑟,𝑔𝑔,𝑏𝑏 ∈ {0.1, 0.2, ⋯ , 1.0} High Low 𝐼𝐼𝛾𝛾: the decolorized image 𝜔𝜔 𝑟𝑟,𝑔𝑔,𝑏𝑏 : weighting parameters of each color channel 𝐼𝐼 𝑟𝑟,𝑔𝑔,𝑏𝑏 : three color channels Decolorization Cost Gain compensation Impossible linear and global gain compensation due to different spectral sensitivities Decolorization Gain compensationTractable solution : Decolorization and Gain compensation 25
  22. 22. 𝒱𝒱 𝑥𝑥, 𝑙𝑙 = 𝛼𝛼𝒱𝒱𝑆𝑆𝑆𝑆𝑆𝑆 𝑥𝑥, 𝑙𝑙 + 1 − 𝛼𝛼 𝒱𝒱𝑆𝑆𝑆𝑆 𝑆𝑆(𝑥𝑥, 𝑙𝑙) 𝒱𝒱𝑆𝑆𝑆𝑆𝑆𝑆 𝑥𝑥, 𝑙𝑙 = � 𝑥𝑥∈Ω𝑥𝑥 min( 𝐼𝐼𝐿𝐿 − 𝐼𝐼𝑅𝑅 𝛾𝛾 𝑥𝑥 + 𝑑𝑑 , 𝜏𝜏1) 𝒱𝒱𝑆𝑆𝑆𝑆 𝑆𝑆 𝑥𝑥, 𝑙𝑙 = � 𝑥𝑥∈Ω𝑥𝑥 min( 𝐽𝐽(𝐼𝐼𝐿𝐿) − 𝐽𝐽(𝐼𝐼𝑅𝑅 𝛾𝛾 𝑥𝑥 + 𝑑𝑑 ) , 𝜏𝜏2) 𝑠𝑠. 𝑡𝑡. 𝐽𝐽 𝐼𝐼 = | ∑𝑥𝑥∈Ω𝑥𝑥 𝛻𝛻𝛻𝛻(𝑥𝑥) | ∑𝑥𝑥∈Ω𝑥𝑥 𝛻𝛻𝛻𝛻 𝑥𝑥 + 0.5 0.5 1 Sum of absolute differences (SAD): Robust to image noise Sum of informative edges (SIE): Robust to non-linear intensity variation Color image Informative edge map 𝑱𝑱(𝑰𝑰)Conventional gradient map 𝛻𝛻𝐼𝐼 Sum of intensity Intensity ×3 Ω𝑥𝑥: supporting window centered pixel at 𝑥𝑥, 𝑑𝑑: disparity, 𝜏𝜏 1,2 : truncated value Brightness consistency Edge similarity Sum of signed gradients cancels out image noise Sum of absolute gradients computes how strong the edges RGB-W Stereo Matching 26
  23. 23. Experiment setup 27
  24. 24. 6 ANCC : Heo et al., Robust stereo matching using adaptive normalized cross-correlation, IEEE PAMI 2011 DASC : Kim et al., DASC: Dense adaptive self-correlation descriptor for multimodal and multi-spectral correspondence, CVPR 2015 JDMCC : Heo et al., Joint depth map and color consistency estimation for stereo images with different illuminations and cameras, IEEE PAMI 2013 CCNG : Holloway et al., Generalized assorted camera arrays: Robust cross-channel registration and applications, IEEE TIP 2015 36.64% 34.29% 11.55%21.45%22.44% Dark illumination ANCC 37.89% JDMCC 36.24% 19.24% ProposedCCNG 32.56%41.71% DASC Bright illuminationStructured light Ground truth Quantitative evaluation 28
  25. 25. ANCC JDMCC ProposedCCNGDASC 40.13% 44.83% 17.07% 24.23% 11.79% 14.48% 19.20% 24.45% 19.52% 41.71%Dark illumination Bright illuminationStructured light Ground truth Quantitative evaluation 29
  26. 26. 7 ProposedColorMono ANCC DASC JDMCC CCNG Monochrom e Color ANCC (40.10%) DASC (39.85%) Ground truth CCNG (31.80%)JDMCC (32.86%) Proposed (8.89%) Dataset : Reindeer ( ) is a bad pixel rate Monochrome Color ANCC (26.58%) DASC (26.91%) Ground truth CCNG (18.44%)JDMCC (18.54%) Proposed (15.14%) Dataset : Moebious Evaluations 30
  27. 27. 8 Colorization method Color image Y channel of color image SLIC super-pixelU & V channel mapping Colorization result Colorization and enhancement V channel of color image High-quality color image recovery 31
  28. 28. Colorized and enhanced image Input color image High-quality color image recovery 32
  29. 29. 9 Applications 33
  30. 30. 34 (1) Input pair (2) Gain map (3) Decolorization (4) Disparity map by iterative gain adjustment (5) Refined map by a tree-based filtering (6) High-quality color image Problem 2. Matching window size 3. Balance value 1. Gain threshold 5. Smoothness parameter 4. # of iterations 6. # of super-pixel 7. Color similarity
  31. 31. 35 Deep Material Stereo [CVPR’18]
  32. 32. 36 CNN version of RGB-W Stereo Image recovery Depth estimation Encoder Consistency
  33. 33. 37 W C Denoising Left - Mono Right - Color Disparity Denoised - Chrominance Denoised - Mono Initial colorization Final color image - Occlusion Occlusion Disparity Colorization CNN version of RGB-W Stereo
  34. 34. 38 Final color image Occlusion Occlusion Colorization Occlusion Without occlusion With occlusion
  35. 35. 39 Denoising Left - Mono Right - Color Denoised - Chrominance Denoised - Mono Conv3X3X64,st=1 BatchNorm Subtract Relu Gaussian distribution: benefit for Gaussian noise model Residual: noise Denoising
  36. 36. 40
  37. 37. 41
  38. 38. 42
  39. 39. 43
  40. 40. 44
  41. 41. 45 RGB NIR Estimated disparity
  42. 42. 46 Asymmetric stereo Light-field camera Monocular camera [CVPR’16, Silver Prize of Samsung Humantech Paper Award, Submitted to IEEE TIP] [ECCV’14, CVPR’15, ICCVW’15, CVPR’18, IEEE TPAMI’17, IEEE SPL’17, IEEE TPAMI’19, Robustness champion of CVPR’17 workshop] [ICCV’15, CVPR’16, ECCV’16, CVPR’18, ICLR’19, IEEE TPAMI’17, IEEE SPL’17, IEEE TPAMI’19, IEEE TPAMI under minor revision] Today’s Talk
  43. 43. Depth from Single Light Field Images Publications • Accurate Depth Map Estimation from a Lenslet Light Field Camera Hae-Gon Jeon, Jaesik Park, Gyeongmin Choe, Jinsun Park, Yunsu Bok, Yu-Wing Tai and In So Kweon IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2015 • Depth from a Light Field Image with Learning- based Matching Costs Hae-Gon Jeon, Jaesik Park, Gyeongmin Choe, Jinsun Park, Yunsu Bok, Yu-Wing Tai and In So Kweon IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), Feb 2019 • Depth Estimation Challenge: Robustness Champion, CVPR workshop on Light Field for Computer Vision • EPINET: A Fully-Convolutional Neural Network using Epipolar Geometry for Depth from Light Field Images Changha Shin, Hae-Gon Jeon, Youngjin Yoon, In So Kweon and Seon Joo Kim IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2018 47
  44. 44. Epiploar plane image(EPI) Synthetic EPI Estimate slopes of lines [Wanner and Goldluecke PAMI 13, Tao et al. ICCV 13, Tao et al. CVPR 15, Wang et al. CVPR 16, Williem et al. CVPR 16, Heber et al. CVPR 17] Light-field image 48
  45. 45. Commercial light-field camera Main lens Microlens sensorObject Blocking penetration of light Capturing angular information of rays in one sensor Sensor size 3280*3280 Sub-aperture image 328*328 Problem1. Reduce spatial resolution Problem2. Increase photon noise 49
  46. 46. Real world EPI Epipolar plane image EPIs from plenoptic camera are filled with many noises and aliasing, and have vertical luminance changes due to the circular micro-lens used 50
  47. 47. Flipping adjacent views Sub-aperture images Very narrow baseline; Physically 0.45mm Within 1px Averbuch and Keller, “A unified approach to FFT based images registration”, IEEE TIP 2003 Accuracy with 1/100 pixel precision!! ℱ 𝐼𝐼 𝕩𝕩 + ∆𝕩𝕩 = ℱ{𝐼𝐼(𝕩𝕩)𝑒𝑒2𝜋𝜋𝑖𝑖∆𝕩𝕩 } 𝐼𝐼 𝕩𝕩 + ∆𝕩𝕩 = ℱ−1{ℱ{𝐼𝐼 𝕩𝕩 𝑒𝑒2𝜋𝜋𝑖𝑖∆𝕩𝕩}} ℱ : Fourier transform ∆𝕩𝕩 : Sub-pixel displacement 𝕩𝕩 Bilinear Bicubic Phase Original Multiview Stereo-Based Approach [CVPR’15] 51
  48. 48. Cost Volume 𝑓𝑓( ) Sub-aperture images Matching Cost , Reference view Target view Cost volume Depthlabel Sum of Absolute Difference (SAD) Sum of Gradient Difference (GRAD) 52
  49. 49. Quantitative evaluation on interpolation methods 5.37 16.2 11.3 7.38 2.89 11.18 9.03 5.33 15.35 9.02 7.4 2.35 6.65 8.73 4.69 9.88 8.91 6.06 2.27 6.22 6.38 0 2 4 6 8 10 12 14 16 18 Bilinear Bicubic Ours Bad pixel ratio MedievalBuddha Buddha2 Mona Papillon Still life Horses 53 GT Bilinear Bicubic Phase 0.2 % 1 % 0.2 % 1 % 16.2% 15.35% 9.88% 9.03% 8.73% 6.38%
  50. 50. Refinements Sub aperture image Cost volume Cost aggregation Graph-cuts Iterative refinement Conventional stereo matching [Rhemann et al., CVPR 2011] Proposed [CVPR 2015]Center view 54
  51. 51. Quantitative Evaluation Center View Depth from Structured Light Emitted Pattern Synthesized View 0 0.2 0.4 0.6 0.8 1 Ours Absolute Error (in pixels) 55
  52. 52. Lytro GCDL LAGC CADC OursCenter view Comparison to Lytro Bulit-in Lytro Built-in Ours Raytrix Central view GCDL LAGC Ours Qualitative Evaluation 56
  53. 53. Center View 3D mesh Actual scale Measured distance in 3D Lytro Illum Our Simple Lens Light Field Camera Dataset Our simple lens camera w/o distortion correction With distortion correction Qualitative Evaluation 57
  54. 54. There are still Problems Problem2: Severe noise Problem1: Severe vignetting 1. Hard to find accurate correspondence in radiometric distortions and severe noise ⇒ Using various hand-craft matching cost 2. Which one is correct matching cost? ⇒ Predicting the correct matching cost using two random forests 3. Does it work well in real world light-field images? ⇒ Realistic dataset generation based on an imaging pipeline of the Lytro camera 58
  55. 55. Overview of the Proposed Method [TPAM’19] 1. Realistic Light Field Image Generation; Emulating an imaging pipeline of Lytro camera 3. Random Forest 1 - Classification; Selecting dominant matching costs 4. Random Forest 2 - Regression; Predicting a disparity value with sub-pixel precision 2. Making Cost Volumes using Phase Shift; Overcoming inherent degradation of light-field images caused by a microlens array SAD GRAD Census ZNCC q = [ ] 59
  56. 56. 60 ? Raw image Sub-aperture images x y tθ x·sinθ+y·cosθ+t=0 θ t Template for indirect line fitting Best match Proposed ( )ccc ZYX ,, ( )ZYX ,, F point micro-lens center image Ll ( )cc yx , ( )yx, projected micro-lens center projected point main lens Projection of adjacent corners Closest point to micro-lens center (u, v) (u’, v’) Line feature ( ) ( ) 0=+−⋅+−⋅ cvvbuua cc Micro-lens center (uc, vc)               −      ′ ′ ′+      =      v u v u k v u v u ˆ ˆ Adjacent corners [Y. Bok, H.-G. Jeon, and I. S. Kweon., Geometric Calibration of Micro-Lens-Based Light-Field Cameras using Line Features, ECCV 2014, IEEE TPAMI 2017] ProposedDansereau et al., ICCV13 Light-field Camera Geometric Calibration Dansereau et al., ICCV13
  57. 57. Data Generation Vignetting Map Noise-free multi-view images Vignetting map from averaged white plane images Sub-aperture image with vignetting map 61
  58. 58. Data Generation Lenslet Image Generation Sub-aperture image with vignetting map Extract a pixel from each sub- aperture image Aggregate these pixels in a lenslet 62
  59. 59. Data Generation Add Noise Noise level estimation of each color channel 0.2 0.3 0.4 0.5 0.6 0 0.005 0.01 0.015 0.02 0.025 Intensity StandardDeviation Green Channel1 0.2 0.3 0.4 0.5 0.6 0 0.005 0.01 0.015 0.02 0.025 Intensity StandardDeviation Green Channel2 0.2 0.3 0.4 0.5 0.6 0 0.005 0.01 0.015 0.02 0.025 Intensity StandardDeviation Blue Channel 0.2 0.3 0.4 0.5 0.6 0 0.005 0.01 0.015 0.02 0.025 Intensity StandardDeviation Red Channel Convert color image to raw image Y. Schechner et al., “Multiplexing for optimal lighting”, IEEE TPAMI 2007 63
  60. 60. Data Generation Realistic Sub-aperture Image Generation Noisy raw imageDemosaicing Rearrange pixels at each lenslet to each sub-aperture image 64
  61. 61. Effectiveness of the augmented training dataset Depth profilewithout Depth profilewithDepth profileGaussian noise 65
  62. 62. Training Set http://hci-lightfield.iwr.uni-heidelberg.de/ Antinous, Range: [ -3.3, 2.8 ] Boardgames, Range: [ -1.8, 1.6 ] Dishes, Range: [ -3.1, 3.5 ] Greek, Range: [ -3.5, 3.1 ] Kitchen, Range: [ -1.6, 1.8 ] Medieval2, Range: [ -1.7, 2.0 ] Museum, Range: [ -1.5, 1.3 ] Pens, Range: [ -1.7, 2.0 ] Pillows, Range: [ -1.7, 1.8 ] Platonic, Range: [ -1.7, 1.5 ] Rosemary, Range: [ -1.8, 1.8 ] Table, Range: [ -2.0, 1.6 ] Tomb, Range: [ -1.5, 1.9 ] Tower, Range: [ -3.6, 3.5 ] Town, Range: [ -1.6, 1.6 ] Vinyl, Range: [ -1.6, 1.2 ] 66
  63. 63. Cost Volumes Matching Costs Sum of Absolute Difference (SAD) Zero-mean Normalized Cross correlation (ZNCC) Census Transform (Census) Sum of Gradient Difference (GRAD) + Robust to image noise; act as averaged filter + Compensate for differences in both gain and offset + Synergy with other matching costs + imposing higher weights at edge boundaries + Tolerate radiometric distortions H. Hirschmuller and D. Scharstein, “Evaluation of stereo matching costs on images with radiometric differences,” IEEE TPAMI 2009. 67
  64. 64. Cost Volumes Matching group1 𝑓𝑓( ) Sub-aperture images Matching Cost , Reference view Target view Cost volume Depthlabel 68
  65. 65. Cost Volumes Matching group2 𝑓𝑓( ) Sub-aperture images Matching Cost , Reference view Target view Cost volume Depthlabel 69
  66. 66. Cost Volumes Computed Cost Volumes Matchinggroup Matching cost Sum of Absolute Difference (SAD) Zero-mean Normalized Cross correlation (ZNCC) Census Transform (Census) Sum of Gradient Difference (GRAD) 70
  67. 67. Cost Volumes Computed Cost Volumes Disparities from each cost volume via Winner-Takes-All 71
  68. 68. Cost Volumes Computed Cost Volumes Vectorizing estimated depth labels with a ground truth depth label 31 53 43 55 55 55 55 55 6160 74 Ground truth 67 51 53 37 6658 12 25 42 49 55 6143 57 76 72 66 23 5558 56 SAD+GRAD GRAD+Census Census+SAD 𝛼𝛼 ∈ [0, 1.0] ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ Multiple disparity hypotheses Campbell et al., “Using Multiple hypotheses to improve depth-maps for multi-view stereo”, ECCV 2008 Multiple disparity hypotheses Multiple disparity hypotheses 72
  69. 69. Cost Volumes Computed Cost Volumes Vectorizing estimated depth labels with a ground truth depth label 25 54 48 32 32 32 32 32 3442 11 Ground truth 19 20 43 37 3233 5 31 42 29 12 4134 57 44 39 56 49 4317 32 SAD+GRAD GRAD+Census Census+SAD 𝛼𝛼 ∈ [0, 1.0] ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ Multiple disparity hypotheses Multiple disparity hypotheses Multiple disparity hypotheses 73
  70. 70. Training a random forest 32 32 31 42 29 12 4134 57 44 39 56 49 4317 32 ⋯ ⋯ ⋯ ⋯ 25 54 48 3232 3442 11⋯ ⋯ ⋯ 3219 20 43 37 3233 5⋯ ⋯ ⋯ ⋯ ⋯ 𝐪𝐪 Random forest 1 Classification Random Forest1 - Classification 74
  71. 71. Random Forest1 - Classification Importance �q4 �q3 �q1 �q2 �q7 �q9 �q5 �q8 �q10 �q11 �q6 𝐪𝐪 Retrieving a set of important matching costs using the permutation importance measure [L. Breiman, “Random forests,” Machine learning] + Removing unnecessary matching cost + Designing a better prediction model Matching Group1 Matching Group2 Matching Group3 Matching Group4 75
  72. 72. Random Forest2 - Regression Random forest 2 Regression�𝐪𝐪 �q4�q3�q1 �q2 �q7 �q9�q5 �q8 �q10 �q11�q6 vs. Estimated disparity value with sub-pixel precision SAD+GRAD [H.-G. Jeon et al., IEEE CVPR 2015] with Weighted Median Filter [Z. Ma et al., IEEE ICCV 2013] Input of a random forest for regression 76
  73. 73. Real-world examples – Lytro Illum Wanner and Goldluecke, IEEE TPAMI 14 Yu et al, ICCV 13 Ours, CVPR 15 Williem et al, CVPR 16 Wang et al, IEEE TPAMI 16 Tao et al, IEEE TPAMI 17 Proposed Wanner and Goldluecke, IEEE TPAMI 14 Yu et al, ICCV 13 Williem et al, CVPR 16 Wang et al, IEEE TPAMI 16 Tao et al, IEEE TPAMI 17 Proposed Ours, CVPR 15 77
  74. 74. (1-α)SAD+ αCensus SAD Census (1-α)Census+ αGRAD GRAD (1-α)GRAD+ αSAD ZNCC smooth smooth discontinuity strong texture strong texture weak texture discontinuity weak texture strong texture discontinuity Which matching costs are selected? 78
  75. 75. Benchmark Bad pixel ratio (>0.07px) & Mean square error Bad pixel ratio Mean square error (2017.05.23) Robustness Champion!! 79
  76. 76. Qualitative evaluation on different input setups Kim et al., “Scene Reconstruction from High Spatio- Angular Resolution Light Fields”, SIGGRAPH 2013 DSLR camera mounted on motorized linear stage Center view Kim et al. (# of input: 51) Proposed (# of input: 9) 80
  77. 77. Qualitative evaluation on different input setups Samsung Galaxy Note 8 Input images SGM Proposed H. Hirschmuller. Stereo processing by semiglobal matching and mutual information, IEEE PAMI 2008 81
  78. 78. One more Problem… Runtime Very slow … 82
  79. 79. Convolutional Neural Network DispNet, CVPR 16 PSMNet, CVPR 18 EdgeStereo, ArXiv 83
  80. 80. EPINET [CVPR’18] 2 2 22 2 2 2 2 22 2 2 2 2 22 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 280280 Angular directions of LF images 8 blocks 3 blocks Convolutional Block 70 70 70 1 7 7 7 7 Disparity Map 70 70 70 70 70 70 70 70 70 280 Last Convolutional Block Stack I0° Stack I90° Stack I135° Stack I45° Concatenation C O N V R E L U C O N V C O N V R E L U C O N V R E L U B N 84
  81. 81. Lack of Data Antinous, Range: [ -3.3, 2.8 ] Boardgames, Range: [ -1.8, 1.6 ] Dishes, Range: [ -3.1, 3.5 ] Greek, Range: [ -3.5, 3.1 ] Kitchen, Range: [ -1.6, 1.8 ] Medieval2, Range: [ -1.7, 2.0 ] Museum, Range: [ -1.5, 1.3 ] Pens, Range: [ -1.7, 2.0 ] Pillows, Range: [ -1.7, 1.8 ] Platonic, Range: [ -1.7, 1.5 ] Rosemary, Range: [ -1.8, 1.8 ] Table, Range: [ -2.0, 1.6 ] Tomb, Range: [ -1.5, 1.9 ] Tower, Range: [ -3.6, 3.5 ] Town, Range: [ -1.6, 1.6 ] Vinyl, Range: [ -1.6, 1.2 ] 85
  82. 82. View-shift augmentation Rotation augmentation Scale augmentation D(Disparity) L(Image width) D/2 D/3 L/2 L/3 Data Augmentation 86
  83. 83. Results • 4D Light Field Benchmark • 16 synthetic light-field images(9x9 views) • Depth/disparity map for training scenes • http://hci-lightfield.iwr.uni-heidelberg.de 87
  84. 84. 4D Light field Benchmark: http://hci-lightfield.iwr.uni-heidelberg.de/ Evaluations 88
  85. 85. Cotton Boxes Dots BadPixel(>0.03)BadPixel(>0.03)BadPixel(>0.03)DisparityDisparityDisparity (a) GT (b) EPI2 (c) LF_OCC (l) Proposed(g) SPO(d) LF (e) EPI1 (f) CAE (h) SC_GC (i) RPRF5 (j) OFSY_330 (k) PS_RF Evaluations 89
  86. 86. Center View (b) (c) (d) Ours(e) (f) (g) Center View (b) (c) (d) Ours(e) (f) (g) B: Globally consistent depth labeling of 4D lightfields. S. Wanner and B. Goldluecke C: Accurate depth map estimation from a lenslet light field camera.H.-G. Jeon et al , D: Robust light field depth estimation for noisy scene with occlusion. W. Williem et al E: Occlusionaware depth estimation using light-field cameras. T.-C. Wang,. F: Shape estimation from shading, defocus, and correspondence using light-field angular coherence. Tao et al G: Line assisted light field triangulation and stereo matching.Z. Yu, et al 90
  87. 87. Unknown geometry: Rotation, translation Two Research Topics 91
  88. 88. 92 Asymmetric stereo Light-field camera Monocular camera [CVPR’16, Silver Prize of Samsung Humantech Paper Award, Submitted to IEEE TIP] [ECCV’14, CVPR’15, ICCVW’15, CVPR’18, IEEE TPAMI’17, IEEE SPL’17, IEEE TPAMI’19, Robustness champion of CVPR’17 workshop] [ICCV’15, CVPR’16, ECCV’16, CVPR’18, ICLR’19, IEEE TPAMI’17, IEEE SPL’17, IEEE TPAMI’19, IEEE TPAMI under minor revision] Today’s Talk
  89. 89. Depth from Small Motion Video Clip Publications (Co-author papers) • High Quality Structure from Small Motion for Rolling Shutter Cameras Sunghoon Im, Hyowon Ha, Gyeongmin Choe, Hae-Gon Jeon, Kyungdon Joo and In So Kweon IEEE International Conference on Computer Vision (ICCV), Dec 2015 • High-quality Depth from Uncalibrated Small Motion Clip [Oral presentation] Hyowon Ha, Sunghoon Im, Jaesik Park, Hae-Gon Jeon and In So Kweon IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , Jun 2016 • All-around Depth from Small Motion with A Spherical Panoramic Camera Sunghoon Im, Hyowon Ha, François Rameau, Hae-Gon Jeon, Gyeongmin Choe and In So Kweon European Conference on Computer Vision (ECCV), Oct 2016 • Robust Depth Estimation from Auto Bracketed Images Sunghoon Im, Hae-Gon Jeon and In So Kweon IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2018 • Accurate 3D Reconstruction from Small Motion Clip for Rolling Shutter Cameras Sunghoon Im, Hyowon Ha, Gyeongmin Choe, Hae-Gon Jeon, Kyungdon Joo and In So Kweon IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), Apr 2019 • DPSNet: End-to-end Deep Plane Sweep Stereo Sunghoon Im, Hae-Gon Jeon, Steve Lin and In So Kweon International Conference on Learning Representations (ICLR), May 2019 93
  90. 90. 94
  91. 91. Yu & Gallup (CVPR14) Yu and Gallup, "3d reconstruction from accidental motion“, CVPR2014 95
  92. 92. 96
  93. 93. 3. Dense 3D reconstruction 2. Sparse 3D reconstruction 1. Feature extraction & tracking Depth from Small Motion 97
  94. 94. ∑∑= = −= I JN i N j jijijC 1 1 2 ||)(||),( XKPxPX φ  Bundle adjustment TT JI T j T ij Z Y Z X ZYX NN ZYX vu ]1,,[)],,([ featureof#:image,of#: coordinateWorld:]1,,,[ coordinateimage2D:]1,,[ = = = φ X x matrixExtrinsic:matrix,Intrinsic: ijPK           = 100 0 yy xx cf cf α K Sparse 3D reconstruction [ICCV’15] 2*Thenumberofre-projectionpoints The number of refine value  Jacobian matrix with rolling shutter The number of features : 8 The number of images : 6 2*Thenumberofre-projectionpoints The number of refine value  Small angle approximation - Rotation matrix           − − − == 1 1 1 x ij y ij x ij z ij y ij z ij ijijijij rr rr rr RR )(],|)([ rtrP where )( iiiij w rrrr −+= +1 )( iiiij w tttt −+= +1  Rotation and translation components  Jacobian matrix without rolling shutter 98
  95. 95. Input Image Sequence Proposed methodYu & Gallup (CVPR14) Sparse 3D reconstruction 99
  96. 96. Input Image Only color smoothness • Results of dense 3D reconstruction Input Image Only color smoothness Dense 3D Reconstruction 100
  97. 97.  Geometry guidance term ∑ ∑∈ ⋅ ⋅ −= p Wq q pp qp p g pg p DDwE 2 ) ˆ ˆ ()( Xn Xn D Key Idea Neighboring pixels with similar color should have similar normal  Normal vectors guide the 3D position of neighboring pixels ∑∈         ⋅−− = pWq g qp g g p N w γ )( exp nn11 8 1 = = gg p p pp p ppp N W n DXD pD yxX constant, neighbors-8 vectornormal: coordinate3 ateDepth valu: coordinateimageNormalized : : :ˆ :],,[ˆ γ )()()()( DDDD ggccd EEEE λλ ++=  Energy function )XDX(Dn ppqqp ˆˆ −⋅ Dense 3D Reconstruction Input image Sparse 3D points Normal of 3D points Normal map 101
  98. 98. Input Image Sparse 3D points Proposed method Conventional method • Results of dense 3D reconstruction Dense 3D Reconstruction 102
  99. 99. Dense 3D Reconstruction Proposed methodYu & Gallup (CVPR14) Dataset from Yu & Gallup 103
  100. 100. Proposed methodYu & Gallup (CVPR14) Dataset from Yu & Gallup 104
  101. 101. Application 105
  102. 102. There are still Problems 1. Exact definition of small motion 2. Blurry depth 106
  103. 103. Reference image Depth difference Ground truth Our depth map (𝒃𝒃 = 𝟏𝟏. 𝟓𝟓) Camera motion: Closest object = 1:100 Small Motion Issue [CVPR’16] 107
  104. 104. Solution to Blurry Depth: Plane Sweeping FarNear P1 P2 P3 Sweeping depth P1 P2 P3 Reference image Mean image Intensity profile Mean image Intensity profile Mean image Intensity profile Reference view Other views Flat Flat Flat Sharp Sharp Sharp Given intrinsic and extrinsic parameters of camera, 108
  105. 105. Depth Quantization Error [TPAM’19] Caused by Quantized depth range in Plane sweeping Depth range 109
  106. 106. Adaptive Matching Window [TPAM’19] Adaptive depth range The usage of blurry depth to estimate min-max depth range for per-pixel controls the confidence weight [0,1] controls the steepness of the exponential function confidence map Initial depth 110
  107. 107. Results Kinect2 Ours Error map 111
  108. 108. Ours, IEEE TPAMIOurs., CVPR 16Yu and GallupOurs, ICCV 15 Results 112
  109. 109. One more Problem… Runtime Camera pose: 2s + Surface normal: 5min + Plane sweeping: 5min + Refinement: 1s About 10 min 113
  110. 110. Procedure of stereo matching Parametric – AD, SAD, BT, mean filter, Laplacian of Gaussian, Bilateral filtering, ZSAD, NCC, ZNCC Nonparametric – Rank filter, Softrank filter, Census filter, Ordinal Mutual Information – Hierarchical MI Input Guide Output Cost volume Cost aggregation Iterative refinementGraph-cuts 𝐸𝐸 = 𝐸𝐸𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 + 𝐸𝐸𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 +𝐸𝐸𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 + 𝐸𝐸𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑟𝑟 𝑙𝑙𝑟𝑟 − 𝐶𝐶 𝑙𝑙+ + 𝐶𝐶(𝑙𝑙−) 2(𝐶𝐶(𝑙𝑙+ + 𝐶𝐶 𝑙𝑙− − 2𝐶𝐶(𝑙𝑙𝑟𝑟))) Image Cost volume Cost aggregation Graph-cuts Iterative refinement Cost computation Common Pipeline of Traditional Approaches Accurate Depth Map Estimation from a Lenslet Light Field Camera Hae-Gon Jeon et al., IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2015 Stereo Matching with Color and Monochrome Cameras in Low-light Conditions Hae-Gon Jeon et al., IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2016 Fully End-to-end process 114
  111. 111. Design a network inferred by traditional Plane Sweeping Algorithm 𝒊𝒊𝒕𝒕𝒕𝒕 Pair Image W × H × 2CH Feature Concatenate : Volume Generation * No learnable para. 1 𝒍𝒍 𝑳𝑳 W × H × 2CH × L Cost Volume Generation * 3D CNN W × H × L Averaging Cost (𝒊𝒊 ∈ {𝟏𝟏, . . 𝑵𝑵}) * No learnable para. W × H × L Cost Aggregation + Upsampling * 2D CNN 𝒊𝒊𝒕𝒕𝒕𝒕 Cost Volume (𝒊𝒊 + 𝟏𝟏)𝒕𝒕𝒕𝒕 Cost Volume Reference Image ⋯ 4W×4H×1 Depth regression (SoftMax) * No learnable para. W × H × CH Feature Extraction (1/4 Downsampled) * 2D CNN 4W × 4H × 3 Input Image Warping through 𝒍𝒍𝒕𝒕𝒕𝒕 plane (sweep) Overview of DPSNet [ICLR19] 115
  112. 112. Testing Do same process using reference and i-th images (i=1, … , N) Add all of the cost volume (N), then average them 𝒊𝒊𝒕𝒕𝒕𝒕 Pair Image W × H × 2CH Feature Concatenate : Volume Generation * No learnable para. 1 𝒍𝒍 𝑳𝑳 W × H × 2CH × L Cost Volume Generation * 3D CNN W × H × L Averaging Cost (𝒊𝒊 ∈ {𝟏𝟏, . . 𝑵𝑵}) * No learnable para. W × H × L Cost Aggregation + Upsampling * 2D CNN 𝒊𝒊𝒕𝒕𝒕𝒕 Cost Volume (𝒊𝒊 + 𝟏𝟏)𝒕𝒕𝒕𝒕 Cost Volume Reference Image ⋯ 4W×4H×1 Depth regression (SoftMax) * No learnable para. W × H × CH Feature Extraction (1/4 Downsampled) * 2D CNN 4W × 4H × 3 Input Image Warping through 𝒍𝒍𝒕𝒕𝒕𝒕 plane (sweep) Iteratively add cost volume Training & Test Process 116
  113. 113. Filter each cost layers using reference image features Inspired by traditional cost volume filtering We use shared weight for all layers Aggre- gated Volume Initial + Residual Context Network (2D Convolution) Reference Image Feature Cost Volume Slice Deep Cost Aggregation Rhemann et al., CVPR2011 117
  114. 114. Reference GT depth Estimated Depth Slice of volume along a label (far/close) Slice of volume along the green row in reference image (x: Column, y: Cost layer) Before Aggregation After Aggregation Ablation Study: Cost Aggregation 118
  115. 115. Confidence measures Depth map evaluation Lower is better Higher is better * Winner Margin (WM): Difference of maximum and second maximum * Curvature (CUR): Difference near maximum response Ablation Study: Cost Aggregation 119
  116. 116. ⇐ Error metrics w.r.t the number of images ⇐ Depth map result w.r.t the number of images Reference GT depth 2-view 3-view 4-view Ablation Study: Number of Input Images 120
  117. 117. MVS, SUN3D, RGBD, Scenes11 datasets Reference GT depth DeMoN CVPR17 COLMAP CVPR16 DeepMVS CVPR18 Ours Experimental Results 121
  118. 118. Quantitative Evaluation Depth map evaluation Lower is better Higher is better Experimental Results 122
  119. 119. Summary Step1: Solution Propose new ideas; Iterative decolorization Phase shift Rolling shutter bundle -> heavy computational burden Cascade optimization; Cost aggregation, Graph-cuts, weighted median filtering, tree- based filtering -> Need for careful tuning user- parameters` 123 Step2: maturity Handling remaining issues; Photometric distortion -- Random forest prediction Depth quantization error -- adaptive matching window -> Still suffering from computational issues Step3: Breakthrough Design of CNNs via Re-Search; CMSNet: fraction of iterative stereo matching EPINet: Merging traditional approaches DPSNet: inspired by traditional plane sweeping algorithm -> No user parameters in test phase -> Fast depth prediction -> Accurate results -> Large number of training parameters -> New applications Personal website https://sites.google.com/site/hgjeoncv/home

×