Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

CNN vs SIFT-based Visual Localization - Laura Leal-Taixé - UPC Barcelona 2018

299 views

Published on

https://telecombcn-dl.github.io/2018-dlcv/

Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.

Published in: Data & Analytics
  • Be the first to comment

CNN vs SIFT-based Visual Localization - Laura Leal-Taixé - UPC Barcelona 2018

  1. 1. Prof. Dr. Laura Leal-Taixé Out with the Old? CNN vs. SIFT-based visual localization
  2. 2. Visual Localization Compute exact position and orientation of query image (relative to 3D scene model)
  3. 3. Classic Localization Pipeline Extract Local Features Establish 2D-3D Matches Estimate Camera Pose (RANSAC) Learning visual localization?
  4. 4. Where do we get training data? Structure-from-Motion gives us plenty of data TRAINING TESTING
  5. 5. Related work: PoseNet Kendall et al. PoseNet:. ICCV 2015 Non-Linear Embedding Pretrained y R2048 Linear Regression p R3 FC FC q R4
  6. 6. Outdoor localization: SIFT wins • Experiments on Cambridge Landmarks SIFT CNN Walch, Hazirbas, Leal-Taixé, Sattler, Hilsenbeck, Cremers, Image-based localization using LSTMs for structured feature correlation. ICCV, 2017
  7. 7. Indoor localization: SIFT suffers • Experiments on 7Scenes Number of images that it failed to localize SIFT CNN
  8. 8. SIFT-based methods do not work at all! Our new dataset TUM-LSI: SIFT dies https://github.com/NavVisResearch/NavVis-Indoor-Dataset
  9. 9. • One separate model is needed per scene Limitations of current methods Pretrained y R2048 p R3 FC FC q R4 Kendall et al. PoseNet:. ICCV 2015 Brachmann et al. DSAC:. CVPR 2017
  10. 10. • One separate model is needed per scene • The most used loss function requires the setting of a hyperparameter which is again scene dependent Limitations of current methods Pretrained y R2048 p R3 FC FC q R4 DOES NOT SCALE
  11. 11. • L2 loss pose label pose prediction camera center position camera rotation in quaternion scaling factor to balance the weighting between the rotation error and the position error PoseNet loss functions Kendall et al. PoseNet:. ICCV 2015 Need to find its value per scene à can range from 200 to 2500 No geometry information
  12. 12. • Geometric loss function: Minimize re-projection error of 3D points visible in image PoseNet loss functions Kendall et al. Geometric Loss Functions for Camera Pose Regression with Deep Learning. CVPR 2017 Network needs to be pre-trained on L2 loss One network per scene More accurate results
  13. 13. Relative Pose Estimation • Use a neural network to predict relative poses • Loss function: PoseNet L2 loss Laskar et al. Camera relocalization by Computing Pairwise Relative Poses Using Convolutional Neural Network. ICCVW 2017 Loss still depends on a scene-dependent hyperparameter One network for all scenes
  14. 14. • One separate model is needed per scene • The most used loss function requires the setting of a hyperparameter which is again scene dependent Limitations of current methods
  15. 15. • One separate model is needed per scene • The most used loss function requires the setting of a hyperparameter which is again scene dependent Proposed method Relative Pose Estimation with Geometric Matching Q. Zhou, T. Sattler, L. Leal-Taixé. Submitted to ECCV 2018 Essential Matrix Regression One network for all scenes No hyperparameter on the loss
  16. 16. • One separate model is needed per scene • The most used loss function requires the setting of a hyperparameter which is again scene dependent Proposed method Q. Zhou, T. Sattler, L. Leal-Taixé. Submitted to ECCV 2018 Essential Matrix Regression One network for all scenes No hyperparameter on the loss Relative Pose Estimation with Geometric Matching
  17. 17. • Siamese architecture – Shared weights on the input image pair – Feature concatenation Regressing relative poses
  18. 18. I. Rocco et al. “Convolutional neural network architecture for geometric matching. CVPR 2017. Geometric matching layer • Mimic the geometrical matching process Features from the siamese net Fixed operation, no learnable parameters
  19. 19. Relative Pose Estimation with Geometric Matching • One separate model is needed per scene • The most used loss function requires the setting of a hyperparameter which is again scene dependent Proposed method Q. Zhou, T. Sattler, L. Leal-Taixé. Submitted to ECCV 2018 One network for all scenes No hyperparameter on the loss Essential Matrix Regression
  20. 20. • Essential matrix between query and training image • L2 loss function on the entries of the essential matrix The loss is hyperparameter free! Essential Matrix loss Fixing the scale of translations to get consistent predictions
  21. 21. Are we really learning an essential matrix? • Ratio between first two eigenvalues is close to 1?
  22. 22. Are we really learning an essential matrix? • Ratio between first two eigenvalues is close to 1? • Third eigenvalue close to 0?
  23. 23. Are we really learning an essential matrix? • Ratio between first two eigenvalues is close to 1? • Third eigenvalue close to 0? • We learn a good approximation of a true essential matrix • Feature-based methods also project the resulting matrix to a rank 2 matrix (due to noise) • Forcing this projection does not have any effect on the results on any of the tested datasets
  24. 24. A CNN model for Image retrieval to identify 30 nearest neighbors for the query image F. Radenovic ́et al.. CNN image retrieval learns from BoW: Unsupervised fine-tuning with hard examples.. ECCV 2016 Full localization pipeline
  25. 25. EssNet to regress essential matrices and estimate relative poses Full localization pipeline
  26. 26. Triangulation of the position and averaging of the rotation within a RANSAC loop. Full localization pipeline
  27. 27. Comparison to SOA • Experiments on outdoor scenes, Cambridge Landmarks
  28. 28. Comparison to SOA • Experiments on outdoor scenes, Cambridge Landmarks • SOA positional error with better rotational error
  29. 29. Comparison to SOA • Experiments on outdoor scenes, Cambridge Landmarks • Only geometric loss is more accurate in rotation but 21% less accurate in translation
  30. 30. Comparison to SOA • Experiments on outdoor scenes, Cambridge Landmarks One single network One network trained per scene
  31. 31. What can we do with more data? • Training relative poses allows us to leverage more data Trained on Cambridge Landmarks Pre-trained on a new dataset and fine- tuned on Cambridge Landmarks
  32. 32. Out with the Old? • SIFT-based methods still outperform CNN-based methods but there is potential in indoor scenes • EssNet: relative pose estimation – mimics the visual localization pipeline à takes advantage of multiple view geometry – One network to perform localization in any scene • Generalization still has a long way to go: if no images of the test scene are shown to the network, performance drops 40
  33. 33. Thank you! Prof. Laura Leal-Taixé https://dvl.in.tum.de

×