Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonDL 2020

342 views

Published on

Image segmentation is a classic computer vision task that aims at labeling pixels with semantic classes. These slides provide an overview of the basic approaches applied from the deep learning field to tackle this challenge and presents the basic subtasks (semantic, instance and panoptic segmentation) and related datasets.

Presented at the International Summer School on Deep Learning (ISSonDL) 2020 held online and organized by the University of Gdansk (Poland) between the 30th August and 2nd September.

http://2020.dl-lab.eu/virtual-summer-school-on-deep-learning/

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonDL 2020

  1. 1. Image Segmentation with Deep Learning Xavier Giro-i-Nieto UPC & BSC Barcelona Carles Ventura UOC Barcelona
  2. 2. Xavier Giro-i-Nieto Associate Professor at Universitat Politecnica de Catalunya (UPC) in Barcelona, Catalonia. IDEAI Center for Intelligent Data Science & Artificial Intelligence @DocXavi xavier.giro@upc.edu
  3. 3. https://sites.google.com/view/dlbcn2018/home https://sites.google.com/view/dlbcn2019/home Deep Learning Barcelona Symposium
  4. 4. Foundations ● MSc course [2017] [2018] [2019] ● BSc course [2018] [2019] [2020] Multimedia Applications Vision: [2016] [2017][2018][2019] Language & Speech: [2017] [2018] [2019] Reinforcement Learning ● [2020 Spring] [2020 Autumn] Deep Learning @ UPC TelecomBCN
  5. 5. 4th (face-to-face) & 5th edition (online) start November 2020. Sign up here. Online Postgraduate Course Àgata Lapedriza (UOC) Xavier Giró (UPC-BSC) Xavier Suau (Apple) Marta Ruiz (UPC) Carles Ventura (UOC) Jordi Pons (Dolby) Jordi Torres (BSC) Elisenda Bou (Vilynx) Daniel Fojo (Glovo)
  6. 6. Acknowledgements 6 Amaia Salvador amaia.salvador@upc.edu PhD Candidate Universitat Politècnica de Catalunya [DLCV 2016] Verónica Vilaplana veronica.vilaplana@upc.edu Associate Professor Universitat Politècnica de Catalunya [DLCV 2017] Míriam Bellver miriam.bellver@bsc.edu PhD Candidate Barcelona Supercomputing Center [DLCV 2018] [DLCV 2018]
  7. 7. From image to pixels classification (segmentation) 7 Slide inspired by cs231n lecture from Stanford University. Image Segmentation Object Detection Image Classification “chair”, “bin” “chair” “bin” “chair” “bin”
  8. 8. Segmentation Segmentation: Define the accurate boundaries of all objects in an image predicting a class map for each pixel 8
  9. 9. ● Autonomous driving Segmentation Applications
  10. 10. ● Medical imaging Image source: DRIVE Digital Retinal Image Vessel Extraction Segmentation Applications
  11. 11. ● Robotic applications Segmentation Applications
  12. 12. ● Scene understanding Segmentation Applications
  13. 13. Outline From Global to Local-scale Image Classification Semantic Segmentation ● Deconvolution (or transposed convolution) ● Dilated Convolution ● Skip Connections Instance Segmentation ● Proposal-Based ● Recurrent ● Instance Embedding Panoptic Segmentation 13
  14. 14. 14 Figure: Jeremy Jordan (2018) From Image to Pixel Classification (Segmentation)
  15. 15. From Image to Pixel Classification (Segmentation) 15
  16. 16. Slide: CS231n (Stanford University) CNN COW Extract patch Run through a CNN Classify center pixel Repeat for every pixel 16 From Image to Pixel Classification (Segmentation) Naive approach: Train a sliding window classifier.
  17. 17. Slide: CS231n (Stanford University) CNN COW Extract patch Run through a CNN Classify center pixel Repeat for every pixel 17 From Image to Pixel Classification (Segmentation) Naive approach: Train a sliding window classifier.
  18. 18. CNN Convolutionize: Run “fully convolutional” network to get all pixels at once. 18 From Global to Local-scale Image Classification Slide: CS231n (Stanford University)
  19. 19. CNN Convolutionize: Run “fully convolutional” network to get all pixels at once. 19 Slide concept: CS231n (Stanford University) From Global to Local-scale Image Classification
  20. 20. Convolutionize: Formulate each neuron in a fully connected (FC) layer as a convolutional filter (kernel) of a convolutional layer: 20 3x2x2 tensor (RGB image of 2x2) 2 fully connected neurons 3x2x2 * 2 weights 2 convolutional filters of 3 x 2 x 2 (same size as input tensor) 3x2x2 * 2 weights From Global to Local-scale Image Classification
  21. 21. 21 A model trained for image classification on low-definition images can provide local response when fed with high-definition images. Long, Jonathan, Evan Shelhamer, and Trevor Darrell. "Fully convolutional networks for semantic segmentation." CVPR 2015. (original figure has been modified) From Global to Local-scale Image Classification
  22. 22. 22Long, Jonathan, Evan Shelhamer, and Trevor Darrell. "Fully convolutional networks for semantic segmentation." CVPR 2015. (original figure has been modified) From Global to Local-scale Image Classification CNN Convolutionize: Run “fully convolutional” network to get all pixels at once...
  23. 23. 23 From Global to Local-scale Image Classification Campos, V., Jou, B., & Giro-i-Nieto, X. . From Pixels to Sentiment: Fine-tuning CNNs for Visual Sentiment Prediction. Image and Vision Computing. (2017) The FC to Conv redefinition allows generating heatmaps of the class prediction over the input images.
  24. 24. 24 From Global to Local-scale Image Classification Limitation: Pooling layers in the CNN will decrease the spatial definition of the output. Figure: Alicja Kwasniewska (ISSonDL 2020)
  25. 25. 25 From Global to Local-scale Image Classification CNN Limitation: Pooling layers in the CNN will decrease the spatial definition of the output. Slide concept: CS231n (Stanford University)
  26. 26. Outline From Global to Local-scale Image Classification Semantic Segmentation ● Deconvolution (or transposed convolution) ● Skip Connections ● Dilated Convolutions Instance Segmentation ● Proposal-Based ● Recurrent ● Instance Embedding Panoptic Segmentation 26
  27. 27. Semantic Segmentation Label every pixel! Don’t differentiate instances (cows) Classic computer vision problem 27 Slide: CS231n (Stanford University)
  28. 28. Instance Segmentation Detect instances, give category, label pixels “simultaneous detection and segmentation” (SDS) Labels are class-aware and instance-aware 28 Slide: CS231n (Stanford University)
  29. 29. Outline Semantic Segmentation ● Deconvolution (or transposed convolution) ● Dilated Convolution ● Skip Connections Instance Segmentation Methods ● Proposal-Based ● Recurrent ● Instance Embedding Panoptic Segmentation 29
  30. 30. 30Slide Credit: https://www.jeremyjordan.me/semantic-segmentation/ Semantic Segmentation
  31. 31. Semantic Segmentation 31 CNN Limitation of convolutionizing CNNs for image classification: Pooling layers in the CNN will decrease the spatial definition of the output. Slide concept: CS231n (Stanford University)
  32. 32. Learnable upsampling 32Long, Jonathan, Evan Shelhamer, and Trevor Darrell. "Fully convolutional networks for semantic segmentation." CVPR 2015.
  33. 33. 33 Slide: Alicja Kwasniewska (ISSonDL 2020) Learnable Upsample: Transposed Convolution
  34. 34. Reminder: Convolutional Layer Typical 3 x 3 convolution, stride 1 pad 1 Input: 4 x 4 Output: 4 x 4 34 Slide credit: CS231n (Stanford University)
  35. 35. Reminder: Convolutional Layer Typical 3 x 3 convolution, stride 1 pad 1 Input: 4 x 4 Output: 4 x 4 Dot product between filter and input 35 Slide credit: CS231n (Stanford University)
  36. 36. Reminder: Convolutional Layer Typical 3 x 3 convolution, stride 1 pad 1 Input: 4 x 4 Output: 4 x 4 Dot product between filter and input 36 Slide credit: CS231n (Stanford University)
  37. 37. Reminder: Convolutional Layer Typical 3 x 3 convolution, stride 2 pad 1 Input: 4 x 4 Output: 2 x 2 37 Slide credit: CS231n (Stanford University)
  38. 38. Reminder: Convolutional Layer Typical 3 x 3 convolution, stride 2 pad 1 Input: 4 x 4 Output: 2 x 2 Dot product between filter and input 38 Slide credit: CS231n (Stanford University)
  39. 39. Reminder: Convolutional Layer Typical 3 x 3 convolution, stride 2 pad 1 Input: 4 x 4 Output: 2 x 2 Dot product between filter and input 39 Slide credit: CS231n (Stanford University)
  40. 40. 3 x 3 “deconvolution”, stride 2 pad 1 Input: 2 x 2 Output: 4 x 4 40 Slide credit: CS231n (Stanford University) Learnable upsampling with Transposed Convolutions
  41. 41. 3 x 3 “deconvolution”, stride 2 pad 1 Input: 2 x 2 Output: 4 x 4 Input gives weight for filter values Learnable Upsample: Transposed Convolution 41 Slide credit: CS231n (Stanford University)
  42. 42. Learnable Upsample: Transposed Convolution Slide Credit: CS231n 3 x 3 “deconvolution”, stride 2 pad 1 Input: 2 x 2 Output: 4 x 4 Input gives weight for filter values Sum where output overlaps 42
  43. 43. Learnable Upsample: Transposed Convolution Noh, H., Hong, S., & Han, B. (2015). Learning deconvolution network for semantic segmentation. ICCV 2015. “Regular” VGG “Upside down” VGG 43
  44. 44. 44 Limitation of upsampling from deep CNN layers: Deeper layers are specialized for higher-level semantic tasks, not in capturing fine-grained details required for segmentation. Highest activations along CNN depth Learnable Upsample
  45. 45. Skip Connections “skip connections” Solution: Combine predictions from features at different depths. 45Long, Jonathan, Evan Shelhamer, and Trevor Darrell. "Fully convolutional networks for semantic segmentation." CVPR 2015. combination
  46. 46. 46#U-Net Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. "U-net: Convolutional networks for biomedical image segmentation." MICCAI 2015 Skip connections to intermediate layers
  47. 47. 47 Receptive Field Receptive field: Part of the input data that is visible to a neuron. It increases as we stack more convolutional layers (i.e. neurons in deeper layers have larger receptive fields). André Araujo, Wade Norris, Jack Sim, “Computing Receptive Fields of Convolutional Neural Networks”. Distill.pub 2019. Problem: Receptive field may be limited, and pixel-wise predictions at the deepest layer may not be aware of the whole image.
  48. 48. 48 Receptive Field: Dilated (atrous) convolutions Slide: Alicja Kwasniewska (ISSonDL 2020)
  49. 49. Dilated Convolutions ● By adding more layers: ○ The receptive field grows exponentially. ○ The number of learnable parameters (filter weights) grows linearly. 49 Yu, F., & Koltun, V. (2015). Multi-scale context aggregation by dilated convolutions. ICLR 2016.
  50. 50. Dilated Convolutions 50Source: https://github.com/vdumoulin/conv_arithmetic
  51. 51. Dilated Convolutions + Spatial Pyramid Pooling (SPP) 51 #SPP He, K., Zhang, X., Ren, S., & Sun, J. (2015). Spatial pyramid pooling in deep convolutional networks for visual recognition. TPAMI 2015. #PSPNet Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017). Pyramid scene parsing network. CVPR 2017.
  52. 52. State-of-the-art models 52 ● DeepLab v3+: Atrous Convolutions + Spatial Pyramid Pooling + Encoder-Decoder #DeepLabv3+ Chen, L. C., Zhu, Y., Papandreou, G., Schroff, F., & Adam, H. (2018). Encoder-decoder with atrous separable convolution for semantic image segmentation. ECCV 2018
  53. 53. Outline From Global to Local-scale Image Classification Semantic Segmentation ● Deconvolution (or transposed convolution) ● Skip Connections ● Dilated Convolution Instance Segmentation ● Proposal-Based ● Recurrent ● Instance Embedding Panoptic Segmentation 53
  54. 54. Proposal-based 54 Typical object detection/segmentation pipelines: Object proposal Refinement and Classification Dog 0.85 Cat 0.80 Dog 0.75 Cat 0.90
  55. 55. Proposal-based 55 Typical object detection/segmentation pipelines: Object proposal Refinement and Classification Dog 0.85 Cat 0.80 Dog 0.75 Cat 0.90 NMS: Non-Maximum Suppression
  56. 56. Proposal-based 56 Typical object detection/segmentation pipelines: Object proposal Refinement and Classification Dog 0.85 Cat 0.80 Dog 0.75 Cat 0.90 Binary Map Binary Map
  57. 57. Proposal-based Slide Credit: CS231nHariharan et al. Simultaneous Detection and Segmentation. ECCV 2014 External Segment proposals Mask out background with mean image Similar to R-CNN, but with segment proposals 57
  58. 58. Proposal based: Detection - Faster R-CNN Conv layers Region Proposal Network FC6 Class probabilities FC7 FC8 RPN Proposals RoI Pooling Conv5_3 RPN Proposals 58 Ren et al. Faster R-CNN: Towards real-time object detection with region proposal networks. NIPS 2015 Learn proposals end-to-end sharing parameters with the classification network
  59. 59. He et al. Mask R-CNN. ICCV 2017 Proposal-based Instance Segmentation: Mask R-CNN Faster R-CNN for Pixel Level Segmentation as a parallel prediction of masks and class labels 59
  60. 60. Mask R-CNN He et al. Mask R-CNN. ICCV 2017 Object Detection Object Detection and Segmentation
  61. 61. He et al. Mask R-CNN. ICCV 2017 Mask R-CNN: RoI Align RoI Pool from Fast R-CNN Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected layers Max-pool within each grid cell RoI conv features: C x h x w for region proposal Fully-connected layers expect low-res conv features: C x h x w x/16 & rounding → misalignment ! + not differentiable 61
  62. 62. 62
  63. 63. Limitations of Proposal-based models 63 1. Two objects might share the same bounding box: Only one will be kept after NMS step. 2. Choice of NMS threshold is application dependant 3. Same pixel can be assigned to multiple instances 4. Number of predictions is limited by the number of proposals.
  64. 64. Single-shot Instance Segmentation 64 ● Improving RetinaNet (single-shot object detector) in three ways: ○ Integrating instance mask prediction ○ Making the loss function adaptive and more stable ○ Including hard examples in training #RetinaMask Fu et al. RetinaMars: Learning to predict masks improves state-of-the-art single-shot detection for free. ArXiv 2019
  65. 65. 65 CNN Cat A Krizhevsky, I Sutskever, GE Hinton “Imagenet classification with deep convolutional neural networks” NIPS 2012
  66. 66. 66 Cat Grass Stone CNN RNN CNN CNN RNN
  67. 67. 67 CNN RNN CNN CNN RNN CNN CNN CNN
  68. 68. Recurrent Instance Segmentation Romera-Paredes & H.S. Torr. Recurrent Instance Segmentation ECCV 2016 68 Sequential mask generation
  69. 69. Salvador, A., Bellver, Campos. V, M., Baradad, M., Marqués, F., Torres, J., & Giro-i-Nieto, X. (2018) From Pixels to Object Sequences: Recurrent Semantic Instance Segmentation. Recurrent Instance Segmentation
  70. 70. Recurrent Instance Segmentation #RVOS Carles Ventura, Miriam Bellver, Andreu Girbau, Amaia Salvador, Ferran Marques and Xavier Giro-i-Nieto. “RVOS: End-to-End Recurrent Network for Video Object Segmentation”, CVPR 2019. time (frame sequence) space (object sequence)
  71. 71. Outline Segmentation Datasets Segmentation Applications Semantic Segmentation ● Deconvolution (or transposed convolution) ● Dilated Convolution ● Skip Connections Instance Segmentation ● Proposal-Based ● Recurrent ● DETR Panoptic Segmentation 71
  72. 72. Semantic + Instance = Panoptic Segmentation 72#PS Kirillov, A., He, K., Girshick, R., Rother, C., & Dollár, P. (2019). Panoptic segmentation. CVPR 2019.
  73. 73. Panoptic Segmentation: methods 73 ● UPSNet: A Unified Panoptic Segmentation Network Mask R-CNN design #UPSNET Xiong, Y., Liao, R., Zhao, H., Hu, R., Bai, M., Yumer, E., & Urtasun, R. (2019). Upsnet: A unified panoptic segmentation network. CVPR 2019.
  74. 74. Panoptic Segmentation: methods 74 ● UPSNet: A Unified Panoptic Segmentation Network Xioing et al. UPSNet: A Unified Panoptic Segmentation Network. CVPR 2019
  75. 75. Summary Semantic Segmentation Methods ● Deconvolution (or transposed convolution) ● Dilated Convolution ● Skip Connections Instance Segmentation Methods ● Proposal-Based ● Recurrent ● Instance Embedding Panoptic Segmentation 75
  76. 76. Latest advances ● Bolya et al. YOLACT Real-time Instance Segmentation. ICCV 2019 ● #Axial-DeepLab Wang, H., Zhu, Y., Green, B., Adam, H., Yuille, A., & Chen, L. C. (2020). Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation. ECCV 2020. ● #SOLO Wang, X., Kong, T., Shen, C., Jiang, Y., & Li, L. (2019). Solo: Segmenting objects by locations. ECCV 2020 ● Fast Semantic Segmentation with MobileNet in PyTorch. 76
  77. 77. Segmentation Datasets ● 20 categories ● +10,000 images ● Semantic segmentation GT ● Instance segmentation GT ● 540 categories ● +10,000 images ● Dense annotations ● Semantic segmentation GT ● Objects + stuff Pascal Visual Object Classes Pascal Context 77
  78. 78. Segmentation Datasets ● Real indoor & outdoor scenes ● 80 categories ● +300,000 images ● 2M instances ● Partial annotations ● Semantic segmentation GT ● Instance segmentation GT ● Objects, but no stuff COCO Common Objects in Context 78 ● Real general scenes ● +150 categories ● +22,000 images ● Semantic segmentation GT ● Instance + parts segmentation GT ● Objects and stuff ADE20K
  79. 79. Segmentation Datasets 79 ● Real general scenes ● 350 categories ● +950,000 of images ● 2,700,00 instance segmentations ● Instance segmentation GT ● Objects Open Images V6
  80. 80. Segmentation Datasets 80 ● Real general scenes ● 1,000 categories ● 164,000 of images ● 2,200,00 instance segmentations ● 11.2 objects instance from 3.4 categories on average per image (more complex images than Open Images and MS COCO) ● Instance segmentation GT ● Objects LVIS
  81. 81. Segmentation Datasets ● Real driving scenes ● 30 categories ● +25,000 images ● 20,000 partial annotations ● 5,000 dense annotations ● Semantic segmentation GT ● Instance segmentation GT ● Depth, GPS and other metadata ● Objects and stuff ● Real driving scenes covering 6 continents with variety of weather/season/time of day/camera/viewpoint ● 152 categories ● 25,000 images ● Semantic segmentation GT ● Instance + parts segmentation GT ● Objects and stuff CityScapes Mapillary Vistas Dataset 81
  82. 82. Our research
  83. 83. Hands on Carles Ventura cventuraroy@uoc.edu Lecturer Universitat Oberta de Catalunya

×