Successfully reported this slideshow.
Your SlideShare is downloading. ×

Burnaev and Notchenko. Skoltech. Bridging gap between 2D and 3D with Deep Learning

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 47 Ad

More Related Content

Viewers also liked (20)

Similar to Burnaev and Notchenko. Skoltech. Bridging gap between 2D and 3D with Deep Learning (20)

Advertisement

More from Skolkovo Robotics Center (18)

Recently uploaded (20)

Advertisement

Burnaev and Notchenko. Skoltech. Bridging gap between 2D and 3D with Deep Learning

  1. 1. Bridging the gap between 2D and 3D with Deep Learning Evgeny Burnaev (PhD) <e.burnaev@skoltech.ru> assoc. prof. Skoltech Alexandr Notchenko <a.notchenko@skoltech.ru> PhD student
  2. 2. [1]
  3. 3. ImageNet top-5 error over the years - Deep learning based methods - Feature based methods - human performance
  4. 4. Supervised Deep Learning data Type 2D Image classification, detection segmentation Pose Estimation Supervision class label , object detection box, segmentation contours Structure of “skeleton” on image
  5. 5. But world is in 3D
  6. 6. 3D deep learning is gaining popularity Workshops: ● Deep Learning for Robotic Vision Workshop CVPR 2017 ● Geometry Meets Deep Learning ECCV 2016 ● 3D Deep Learning Workshop @ NIPS 2016 ● Large Scale 3D Data: Acquisition, Modelling and Analysis CVPR 2016 ● 3D from a Single Image CVPR 2015 Google Scholar when searched for "3D" "Deep Learning" returns: year # articles 2012 410 2013 627 2014 1210 2015 2570 2016 5440
  7. 7. Representation of 3D data for Deep Learning Method Pros (+) Cons (-) Many 2D projections sustain surface texture, There is a lot of 2D DL methods Redundant representation, vulnerable to optic illusions Voxels simple, can be sparse, has volumetric properties losing surface properties Point Cloud Can be sparse losing surface properties and volumetric properties 2.5D images Cheap measurement devices, senses depth self occlusion of bodies in a scene, a lot of Noise in measurements
  8. 8. [6]
  9. 9. [2]
  10. 10. 3D shape as dense Point Cloud
  11. 11. Learning Rich Features from RGB-D Images for Object Detection and Segmentation [10]
  12. 12. Latest development in SLAM family of methods
  13. 13. LSD-SLAM (Large-Scale Direct Monocular Simultaneous Localization and Mapping) [5] LSD-SLAM - direct (feature-less) monocular SLAM
  14. 14. ElasticFusion ElasticFusion - DenseSLAM without a pose-graph [7]
  15. 15. Dynamic Fusion The technique won the prestigious CVPR 2015 best paper award. [9]
  16. 16. Problems of SLAM algorithms ● Don’t represent objects (only know surfaces) ● Mostly dense representation (requires a lot of data) ● Whole scene is one big surface, e.g. cannot separate different objects that are close to each other.
  17. 17. 3D Shape Retrieval
  18. 18. 3D Design Phase • There exists massive storages with 3D CAD models, e.g. GrabCAD Chairs Mechanical parts
  19. 19. 3D Design Phase •Designers spend about 60% of their time searching for the right information • Massive and complex CAD models are usually disorderly archived in enterprises, which makes design reuse a difficult task 3D Model retrieval can significantly shorten the product lifecycles
  20. 20. 3D Shape-based Model Retrieval •3D models are complex = No clear search rules •The text-based search has its limitations: e.g. often 3D models are poorly annotated • There is some commercial software for 3D CAD modeling, e.g. ➢ Exalead OnePart by Dassault Systems, ➢ Geolus Search by Siemens PLM, and others • However, used methods ➢ are time-consuming, ➢ are often based on hand-crafted descriptors, ➢ could be limited to a specific class of shapes, ➢ are not robust to scaling, rotations, etc.
  21. 21. Sparse 3D Convolutional Neural Networks for Large-Scale Shape Retrieval Alexandr Notchenko, Ermek Kapushev, Evgeny Burnaev Presented at 3D Deep Learning Workshop at NIPS 2016
  22. 22. Sparsity of voxel representation 30^3 Voxels is already enough to understand simple shape But with texture information it would be even easier Sparsity for all classes of ModelNet40 train dataset at voxel resolution 40 is only 5.5%
  23. 23. Shape Retrieval Precomputed feature vector of dataset. (Vcar , Vperson ,...) Vplane - feature vector of plane Sparse3DCNN Query Retrieved items Cosine distance
  24. 24. Triplet loss The representation can be efficiently learned by minimizing triplet loss. Triplet is a set (a, p, n), where ● a - anchor object ● p - positive object that is similar to anchor object ● n - negative object that is not similar to anchor object , where is a margin parameter, and are distances between p and a and n and a.
  25. 25. Our approach ● Use very large resolutions, and sparse representations. ● Used triplet learning for 3D shapes. ● Used Large Scale Shape Datasets ModelNet and ShapeNet.
  26. 26. Represent voxel shape as vector
  27. 27. Obligatory t-SNE
  28. 28. Conclusions ● For small datasets of shape or 3D sparse tensors voxels can work. ● Voxels don’t scale for hundreds of “classes” and loose texture information. ● Cannot encode complicated object domains.
  29. 29. Problems for next 5 years
  30. 30. Autonomous Vehicles
  31. 31. Augmented (Mixed) Reality
  32. 32. Robotics in human environments
  33. 33. Robotic Control in Human Environments
  34. 34. Commodity sensors to create 2.5D images Intel RealSense Series Asus Xtion Pro Microsoft Kinect v2 Structure Sensor
  35. 35. What they have in common?
  36. 36. What they have in common? They require understanding the whole scene
  37. 37. Problem of “Holistic” Scene understanding
  38. 38. Lin D., Fidler S., Urtasun R. Holistic scene understanding for 3d object detection with rgbd cameras //Proceedings of the IEEE International Conference on Computer Vision. – 2013. – С. 1417-1424. ● Human environments often designed by humans ● A most of the objects are created by humans ● Context provides information by joint probability functions ● Textures caused by materials and therefore can explain a functions and structure of an object Problem of “Holistic” Scene understanding
  39. 39. Connecting 3 families of CV algorithms is inevitable Learnable Computer Vision Systems (Deep Learning) Geometric Computer Vision (SLAMs) Probabilistic Computer Vision (Bayesian methods)
  40. 40. Connecting 3 families of CV algorithms is inevitable Learnable Computer Vision Systems (Deep Learning) Geometric Computer Vision (SLAMs) Probabilistic Computer Vision (Bayesian methods) Probabilistic Inverse Graphics
  41. 41. Probabilistic Inverse Graphics enables ● Takes into account setting information (shop: shelves and products | street: buildings, cars, pedestrians) ● Make maximum likelihood estimates from data and model (or give directions on how to reduce uncertainty the best way) ● Learns structure of objects (Materials and textures / 3D shape / intrinsic dynamics)
  42. 42. Thank you. Alexandr Notchenko Ermek Kapushev Evgeny Burnaev
  43. 43. Citations and Links 1. Deep Learning NIPS’2015 Tutorial by Geoff Hinton, Yoshua Bengio & Yann LeCun 2. Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., & Xiao, J. (2015). 3D ShapeNets: A Deep Representation for Volumetric Shapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1912-1920). 3. C. Nash, C. Williams Generative Models of Part-Structured 3D Objects 4. Qin, Fei-wei, et al. "A deep learning approach to the classification of 3D CAD models." Journal of Zhejiang University SCIENCE C 15.2 (2014): 91-106. 5. Engel, Jakob, Thomas Schöps, and Daniel Cremers. "LSD-SLAM: Large-scale direct monocular SLAM." European Conference on Computer Vision. Springer International Publishing, 2014. 6. Su, Hang, et al. "Multi-view convolutional neural networks for 3D shape recognition." Proceedings of the IEEE International Conference on Computer Vision. 2015. 7. Whelan, Thomas, et al. "ElasticFusion: Dense SLAM Without A Pose Graph." Robotics: science and systems. Vol. 11. 2015. 8. Notchenko, Alexandr, Ermek Kapushev, and Evgeny Burnaev. "Sparse 3D Convolutional Neural Networks for Large-Scale Shape Retrieval." arXiv preprint arXiv:1611.09159 (2016). 9. Newcombe, Richard A., Dieter Fox, and Steven M. Seitz. "Dynamicfusion: Reconstruction and tracking of non-rigid scenes in real-time." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015. 10. Gupta, Saurabh, et al. "Learning rich features from RGB-D images for object detection and segmentation." European Conference on Computer Vision. Springer International Publishing, 2014.

×