Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonDL 2020

Universitat Politècnica de Catalunya
Universitat Politècnica de CatalunyaAssociate Professor at Universitat Politècnica de Catalunya
Image Segmentation with Deep Learning
Xavier Giro-i-Nieto
UPC & BSC Barcelona
Carles Ventura
UOC Barcelona
Xavier Giro-i-Nieto
Associate Professor at Universitat Politecnica
de Catalunya (UPC) in Barcelona, Catalonia.
IDEAI Center for
Intelligent Data Science
& Artificial Intelligence
@DocXavi
xavier.giro@upc.edu
https://sites.google.com/view/dlbcn2018/home https://sites.google.com/view/dlbcn2019/home
Deep Learning Barcelona Symposium
Foundations
● MSc course [2017] [2018] [2019]
● BSc course [2018] [2019] [2020]
Multimedia Applications
Vision: [2016] [2017][2018][2019]
Language & Speech: [2017] [2018] [2019]
Reinforcement Learning
● [2020 Spring] [2020 Autumn]
Deep Learning @ UPC TelecomBCN
4th (face-to-face) & 5th edition (online) start November 2020. Sign up here.
Online Postgraduate Course
Àgata
Lapedriza
(UOC)
Xavier
Giró
(UPC-BSC)
Xavier
Suau
(Apple)
Marta
Ruiz
(UPC)
Carles
Ventura
(UOC)
Jordi
Pons
(Dolby)
Jordi
Torres
(BSC)
Elisenda
Bou
(Vilynx)
Daniel
Fojo
(Glovo)
Acknowledgements
6
Amaia Salvador
amaia.salvador@upc.edu
PhD Candidate
Universitat Politècnica de Catalunya
[DLCV 2016]
Verónica Vilaplana
veronica.vilaplana@upc.edu
Associate Professor
Universitat Politècnica de Catalunya
[DLCV 2017]
Míriam Bellver
miriam.bellver@bsc.edu
PhD Candidate
Barcelona Supercomputing Center
[DLCV 2018] [DLCV 2018]
From image to pixels classification (segmentation)
7
Slide inspired by cs231n lecture from Stanford University.
Image
Segmentation
Object Detection
Image
Classification
“chair”, “bin” “chair” “bin” “chair” “bin”
Segmentation
Segmentation: Define the accurate boundaries of all objects in an image
predicting a class map for each pixel
8
● Autonomous driving
Segmentation Applications
● Medical imaging
Image source: DRIVE Digital Retinal Image Vessel Extraction
Segmentation Applications
● Robotic applications
Segmentation Applications
● Scene understanding
Segmentation Applications
Outline
From Global to Local-scale Image Classification
Semantic Segmentation
● Deconvolution (or transposed convolution)
● Dilated Convolution
● Skip Connections
Instance Segmentation
● Proposal-Based
● Recurrent
● Instance Embedding
Panoptic Segmentation
13
14
Figure: Jeremy Jordan (2018)
From Image to Pixel Classification (Segmentation)
From Image to Pixel Classification (Segmentation)
15
Slide: CS231n (Stanford University)
CNN COW
Extract
patch
Run through
a CNN
Classify
center pixel
Repeat for
every pixel
16
From Image to Pixel Classification (Segmentation)
Naive approach: Train a sliding window classifier.
Slide: CS231n (Stanford University)
CNN COW
Extract
patch
Run through
a CNN
Classify
center pixel
Repeat for
every pixel
17
From Image to Pixel Classification (Segmentation)
Naive approach: Train a sliding window classifier.
CNN
Convolutionize: Run “fully convolutional” network to get all pixels at once.
18
From Global to Local-scale Image Classification
Slide: CS231n (Stanford University)
CNN
Convolutionize: Run “fully convolutional” network to get all pixels at once.
19
Slide concept: CS231n (Stanford University)
From Global to Local-scale Image Classification
Convolutionize: Formulate each neuron in a fully connected (FC) layer as a
convolutional filter (kernel) of a convolutional layer:
20
3x2x2 tensor
(RGB image of 2x2)
2 fully connected
neurons
3x2x2 * 2 weights
2 convolutional filters of 3 x 2 x 2
(same size as input tensor)
3x2x2 * 2 weights
From Global to Local-scale Image Classification
21
A model trained for image classification on low-definition images can provide local
response when fed with high-definition images.
Long, Jonathan, Evan Shelhamer, and Trevor Darrell. "Fully convolutional networks for semantic segmentation." CVPR
2015. (original figure has been modified)
From Global to Local-scale Image Classification
22Long, Jonathan, Evan Shelhamer, and Trevor Darrell. "Fully convolutional networks for semantic segmentation." CVPR
2015. (original figure has been modified)
From Global to Local-scale Image Classification
CNN
Convolutionize: Run “fully convolutional” network to get all pixels at once...
23
From Global to Local-scale Image Classification
Campos, V., Jou, B., & Giro-i-Nieto, X. . From Pixels to Sentiment: Fine-tuning CNNs for Visual Sentiment Prediction.
Image and Vision Computing. (2017)
The FC to Conv redefinition allows generating heatmaps of the class prediction over
the input images.
24
From Global to Local-scale Image Classification
Limitation:
Pooling layers in the CNN will
decrease the spatial definition of the
output.
Figure: Alicja Kwasniewska (ISSonDL 2020)
25
From Global to Local-scale Image Classification
CNN
Limitation: Pooling layers in the CNN will decrease the spatial definition of
the output.
Slide concept: CS231n (Stanford University)
Outline
From Global to Local-scale Image Classification
Semantic Segmentation
● Deconvolution (or transposed convolution)
● Skip Connections
● Dilated Convolutions
Instance Segmentation
● Proposal-Based
● Recurrent
● Instance Embedding
Panoptic Segmentation
26
Semantic Segmentation
Label every pixel!
Don’t differentiate
instances (cows)
Classic computer
vision problem
27
Slide: CS231n (Stanford University)
Instance Segmentation
Detect instances,
give category, label
pixels
“simultaneous
detection and
segmentation” (SDS)
Labels are
class-aware and
instance-aware
28
Slide: CS231n (Stanford University)
Outline
Semantic Segmentation
● Deconvolution (or transposed convolution)
● Dilated Convolution
● Skip Connections
Instance Segmentation Methods
● Proposal-Based
● Recurrent
● Instance Embedding
Panoptic Segmentation
29
30Slide Credit: https://www.jeremyjordan.me/semantic-segmentation/
Semantic Segmentation
Semantic Segmentation
31
CNN
Limitation of convolutionizing CNNs for image classification:
Pooling layers in the CNN will decrease the spatial definition of the output.
Slide concept: CS231n (Stanford University)
Learnable upsampling
32Long, Jonathan, Evan Shelhamer, and Trevor Darrell. "Fully convolutional networks for semantic segmentation." CVPR
2015.
33
Slide: Alicja Kwasniewska (ISSonDL 2020)
Learnable Upsample: Transposed Convolution
Reminder: Convolutional Layer
Typical 3 x 3 convolution, stride 1 pad 1
Input: 4 x 4 Output: 4 x 4
34
Slide credit: CS231n (Stanford University)
Reminder: Convolutional Layer
Typical 3 x 3 convolution, stride 1 pad 1
Input: 4 x 4 Output: 4 x 4
Dot product
between filter
and input
35
Slide credit: CS231n (Stanford University)
Reminder: Convolutional Layer
Typical 3 x 3 convolution, stride 1 pad 1
Input: 4 x 4 Output: 4 x 4
Dot product
between filter
and input
36
Slide credit: CS231n (Stanford University)
Reminder: Convolutional Layer
Typical 3 x 3 convolution, stride 2 pad 1
Input: 4 x 4 Output: 2 x 2
37
Slide credit: CS231n (Stanford University)
Reminder: Convolutional Layer
Typical 3 x 3 convolution, stride 2 pad 1
Input: 4 x 4 Output: 2 x 2
Dot product
between filter
and input
38
Slide credit: CS231n (Stanford University)
Reminder: Convolutional Layer
Typical 3 x 3 convolution, stride 2 pad 1
Input: 4 x 4 Output: 2 x 2
Dot product
between filter
and input
39
Slide credit: CS231n (Stanford University)
3 x 3 “deconvolution”, stride 2 pad 1
Input: 2 x 2 Output: 4 x 4
40
Slide credit: CS231n (Stanford University)
Learnable upsampling with Transposed Convolutions
3 x 3 “deconvolution”, stride 2 pad 1
Input: 2 x 2 Output: 4 x 4
Input gives
weight for
filter values
Learnable Upsample: Transposed Convolution
41
Slide credit: CS231n (Stanford University)
Learnable Upsample: Transposed Convolution
Slide Credit: CS231n
3 x 3 “deconvolution”, stride 2 pad 1
Input: 2 x 2 Output: 4 x 4
Input gives
weight for
filter values
Sum where
output overlaps
42
Learnable Upsample: Transposed Convolution
Noh, H., Hong, S., & Han, B. (2015). Learning deconvolution network for semantic segmentation. ICCV 2015.
“Regular” VGG “Upside down” VGG
43
44
Limitation of upsampling from deep CNN layers: Deeper layers
are specialized for higher-level semantic tasks, not in capturing
fine-grained details required for segmentation.
Highest activations along CNN depth
Learnable Upsample
Skip Connections
“skip
connections”
Solution: Combine
predictions from features
at different depths.
45Long, Jonathan, Evan Shelhamer, and Trevor Darrell. "Fully convolutional networks for semantic segmentation." CVPR
2015.
combination
46#U-Net Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. "U-net: Convolutional networks for biomedical image
segmentation." MICCAI 2015
Skip connections to intermediate layers
47
Receptive Field
Receptive field: Part of the input data that is visible to a neuron.
It increases as we stack more convolutional layers (i.e. neurons in deeper layers
have larger receptive fields).
André Araujo, Wade Norris, Jack Sim, “Computing Receptive Fields of Convolutional Neural Networks”. Distill.pub
2019.
Problem: Receptive field may be limited, and pixel-wise predictions at
the deepest layer may not be aware of the whole image.
48
Receptive Field: Dilated (atrous) convolutions
Slide: Alicja Kwasniewska (ISSonDL 2020)
Dilated Convolutions
● By adding more layers:
○ The receptive field grows exponentially.
○ The number of learnable parameters (filter weights) grows linearly.
49
Yu, F., & Koltun, V. (2015). Multi-scale context aggregation by dilated convolutions. ICLR 2016.
Dilated Convolutions
50Source: https://github.com/vdumoulin/conv_arithmetic
Dilated Convolutions + Spatial Pyramid Pooling (SPP)
51
#SPP He, K., Zhang, X., Ren, S., & Sun, J. (2015). Spatial pyramid pooling in deep convolutional networks for visual
recognition. TPAMI 2015.
#PSPNet Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017). Pyramid scene parsing network. CVPR 2017.
State-of-the-art models
52
● DeepLab v3+: Atrous Convolutions + Spatial Pyramid Pooling + Encoder-Decoder
#DeepLabv3+ Chen, L. C., Zhu, Y., Papandreou, G., Schroff, F., & Adam, H. (2018). Encoder-decoder with atrous
separable convolution for semantic image segmentation. ECCV 2018
Outline
From Global to Local-scale Image Classification
Semantic Segmentation
● Deconvolution (or transposed convolution)
● Skip Connections
● Dilated Convolution
Instance Segmentation
● Proposal-Based
● Recurrent
● Instance Embedding
Panoptic Segmentation
53
Proposal-based
54
Typical object detection/segmentation pipelines:
Object
proposal
Refinement
and
Classification
Dog
0.85
Cat
0.80
Dog
0.75
Cat
0.90
Proposal-based
55
Typical object detection/segmentation pipelines:
Object
proposal
Refinement
and
Classification
Dog
0.85
Cat
0.80
Dog
0.75
Cat
0.90
NMS: Non-Maximum Suppression
Proposal-based
56
Typical object detection/segmentation pipelines:
Object
proposal
Refinement
and
Classification
Dog
0.85
Cat
0.80
Dog
0.75
Cat
0.90
Binary
Map
Binary
Map
Proposal-based
Slide Credit: CS231nHariharan et al. Simultaneous Detection and Segmentation. ECCV 2014
External
Segment
proposals
Mask out background
with mean image
Similar to R-CNN, but with segment proposals
57
Proposal based: Detection - Faster R-CNN
Conv
layers
Region Proposal Network
FC6
Class probabilities
FC7
FC8
RPN Proposals
RoI
Pooling
Conv5_3
RPN Proposals
58
Ren et al. Faster R-CNN: Towards real-time object detection with region proposal networks. NIPS 2015
Learn proposals end-to-end sharing parameters with the classification network
He et al. Mask R-CNN. ICCV 2017
Proposal-based Instance Segmentation: Mask R-CNN
Faster R-CNN for Pixel Level Segmentation as a parallel prediction of masks
and class labels
59
Mask R-CNN
He et al. Mask R-CNN. ICCV 2017
Object Detection Object Detection and Segmentation
He et al. Mask R-CNN. ICCV 2017
Mask R-CNN: RoI Align
RoI Pool from Fast R-CNN
Hi-res input image:
3 x 800 x 600
with region
proposal
Convolution
and Pooling
Hi-res conv features:
C x H x W
with region proposal
Fully-connected
layers
Max-pool within
each grid cell
RoI conv features:
C x h x w
for region proposal
Fully-connected layers expect
low-res conv features:
C x h x w
x/16 & rounding → misalignment ! + not differentiable
61
62
Limitations of Proposal-based models
63
1. Two objects might share the same bounding box: Only
one will be kept after NMS step.
2. Choice of NMS threshold is application dependant
3. Same pixel can be assigned to multiple instances
4. Number of predictions is limited by the number of
proposals.
Single-shot Instance Segmentation
64
● Improving RetinaNet (single-shot object detector) in three ways:
○ Integrating instance mask prediction
○ Making the loss function adaptive and more stable
○ Including hard examples in training
#RetinaMask Fu et al. RetinaMars: Learning to predict masks improves state-of-the-art single-shot detection for free.
ArXiv 2019
65
CNN Cat
A Krizhevsky, I Sutskever, GE Hinton “Imagenet classification with deep convolutional neural networks” NIPS 2012
66
Cat
Grass
Stone
CNN
RNN
CNN
CNN
RNN
67
CNN
RNN
CNN
CNN
RNN
CNN
CNN
CNN
Recurrent Instance Segmentation
Romera-Paredes & H.S. Torr. Recurrent Instance Segmentation ECCV 2016 68
Sequential mask generation
Salvador, A., Bellver, Campos. V, M., Baradad, M., Marqués, F., Torres, J., & Giro-i-Nieto, X. (2018) From Pixels to Object
Sequences: Recurrent Semantic Instance Segmentation.
Recurrent Instance Segmentation
Recurrent Instance Segmentation
#RVOS Carles Ventura, Miriam Bellver, Andreu Girbau, Amaia Salvador, Ferran Marques and Xavier Giro-i-Nieto.
“RVOS: End-to-End Recurrent Network for Video Object Segmentation”, CVPR 2019.
time
(frame sequence)
space
(object sequence)
Outline
Segmentation Datasets
Segmentation Applications
Semantic Segmentation
● Deconvolution (or transposed convolution)
● Dilated Convolution
● Skip Connections
Instance Segmentation
● Proposal-Based
● Recurrent
● DETR
Panoptic Segmentation
71
Semantic + Instance = Panoptic Segmentation
72#PS Kirillov, A., He, K., Girshick, R., Rother, C., & Dollár, P. (2019). Panoptic segmentation. CVPR 2019.
Panoptic Segmentation: methods
73
● UPSNet: A Unified Panoptic Segmentation Network
Mask R-CNN design
#UPSNET Xiong, Y., Liao, R., Zhao, H., Hu, R., Bai, M., Yumer, E., & Urtasun, R. (2019). Upsnet: A unified panoptic segmentation
network. CVPR 2019.
Panoptic Segmentation: methods
74
● UPSNet: A Unified Panoptic Segmentation Network
Xioing et al. UPSNet: A Unified Panoptic Segmentation Network. CVPR 2019
Summary
Semantic Segmentation Methods
● Deconvolution (or transposed convolution)
● Dilated Convolution
● Skip Connections
Instance Segmentation Methods
● Proposal-Based
● Recurrent
● Instance Embedding
Panoptic Segmentation
75
Latest advances
● Bolya et al. YOLACT Real-time Instance Segmentation. ICCV 2019
● #Axial-DeepLab Wang, H., Zhu, Y., Green, B., Adam, H., Yuille, A., & Chen, L. C. (2020).
Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation. ECCV 2020.
● #SOLO Wang, X., Kong, T., Shen, C., Jiang, Y., & Li, L. (2019). Solo: Segmenting objects by locations.
ECCV 2020
● Fast Semantic Segmentation with MobileNet in PyTorch.
76
Segmentation Datasets
● 20 categories
● +10,000 images
● Semantic segmentation GT
● Instance segmentation GT
● 540 categories
● +10,000 images
● Dense annotations
● Semantic segmentation GT
● Objects + stuff
Pascal Visual Object Classes Pascal Context
77
Segmentation Datasets
● Real indoor & outdoor scenes
● 80 categories
● +300,000 images
● 2M instances
● Partial annotations
● Semantic segmentation GT
● Instance segmentation GT
● Objects, but no stuff
COCO Common Objects in Context
78
● Real general scenes
● +150 categories
● +22,000 images
● Semantic segmentation GT
● Instance + parts segmentation GT
● Objects and stuff
ADE20K
Segmentation Datasets
79
● Real general scenes
● 350 categories
● +950,000 of images
● 2,700,00 instance segmentations
● Instance segmentation GT
● Objects
Open Images V6
Segmentation Datasets
80
● Real general scenes
● 1,000 categories
● 164,000 of images
● 2,200,00 instance segmentations
● 11.2 objects instance from 3.4
categories on average per image
(more complex images than Open
Images and MS COCO)
● Instance segmentation GT
● Objects
LVIS
Segmentation Datasets
● Real driving scenes
● 30 categories
● +25,000 images
● 20,000 partial annotations
● 5,000 dense annotations
● Semantic segmentation GT
● Instance segmentation GT
● Depth, GPS and other metadata
● Objects and stuff
● Real driving scenes covering 6
continents with variety of
weather/season/time of
day/camera/viewpoint
● 152 categories
● 25,000 images
● Semantic segmentation GT
● Instance + parts segmentation GT
● Objects and stuff
CityScapes Mapillary Vistas Dataset
81
Our research
Hands on
Carles Ventura
cventuraroy@uoc.edu
Lecturer
Universitat Oberta de Catalunya
1 of 83

More Related Content

What's hot(20)

Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Universitat Politècnica de Catalunya334 views
Neural Architectures for Video EncodingNeural Architectures for Video Encoding
Neural Architectures for Video Encoding
Universitat Politècnica de Catalunya529 views
Deep Learning Representations for All (a.ka. the AI hype)Deep Learning Representations for All (a.ka. the AI hype)
Deep Learning Representations for All (a.ka. the AI hype)
Universitat Politècnica de Catalunya702 views
Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019
Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019
Universitat Politècnica de Catalunya2.2K views
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
Universitat Politècnica de Catalunya1.4K views
One Perceptron  to Rule them All: Deep Learning for Multimedia #A2IC2018One Perceptron  to Rule them All: Deep Learning for Multimedia #A2IC2018
One Perceptron to Rule them All: Deep Learning for Multimedia #A2IC2018
Universitat Politècnica de Catalunya601 views
Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018
Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018
Universitat Politècnica de Catalunya659 views
Deep Neural Networks for Multimodal LearningDeep Neural Networks for Multimodal Learning
Deep Neural Networks for Multimodal Learning
Marc Bolaños Solà1.1K views
The Perceptron - Xavier Giro-i-Nieto - UPC Barcelona 2018The Perceptron - Xavier Giro-i-Nieto - UPC Barcelona 2018
The Perceptron - Xavier Giro-i-Nieto - UPC Barcelona 2018
Universitat Politècnica de Catalunya679 views
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
Universitat Politècnica de Catalunya2.1K views
Deep Learning from Videos (UPC 2018)Deep Learning from Videos (UPC 2018)
Deep Learning from Videos (UPC 2018)
Universitat Politècnica de Catalunya3K views

Similar to Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonDL 2020(20)

Object Segmentation (D2L7 Insight@DCU Machine Learning Workshop 2017)Object Segmentation (D2L7 Insight@DCU Machine Learning Workshop 2017)
Object Segmentation (D2L7 Insight@DCU Machine Learning Workshop 2017)
Universitat Politècnica de Catalunya1.4K views
Intepretability / Explainable AI for Deep Neural NetworksIntepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural Networks
Universitat Politècnica de Catalunya663 views
Semantic Segmentation - Míriam Bellver - UPC Barcelona 2018Semantic Segmentation - Míriam Bellver - UPC Barcelona 2018
Semantic Segmentation - Míriam Bellver - UPC Barcelona 2018
Universitat Politècnica de Catalunya596 views
Deep Learning for Computer Vision: Segmentation (UPC 2016)Deep Learning for Computer Vision: Segmentation (UPC 2016)
Deep Learning for Computer Vision: Segmentation (UPC 2016)
Universitat Politècnica de Catalunya3.6K views
Pixel RNN to Pixel CNN++Pixel RNN to Pixel CNN++
Pixel RNN to Pixel CNN++
Dongheon Lee221 views
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Universitat Politècnica de Catalunya6.4K views
UNetEliyaLaialy (2).pptxUNetEliyaLaialy (2).pptx
UNetEliyaLaialy (2).pptx
NoorUlHaq4722 views
D1L5 Visualization (D1L2 Insight@DCU Machine Learning Workshop 2017)D1L5 Visualization (D1L2 Insight@DCU Machine Learning Workshop 2017)
D1L5 Visualization (D1L2 Insight@DCU Machine Learning Workshop 2017)
Universitat Politècnica de Catalunya549 views
CNNs: from the Basics to Recent AdvancesCNNs: from the Basics to Recent Advances
CNNs: from the Basics to Recent Advances
Dmytro Mishkin3.9K views

More from Universitat Politècnica de Catalunya(16)

Deep Generative Learning for AllDeep Generative Learning for All
Deep Generative Learning for All
Universitat Politècnica de Catalunya297 views
Towards Sign Language Translation & Production | Xavier Giro-i-NietoTowards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Universitat Politècnica de Catalunya289 views
The Transformer - Xavier Giró - UPC Barcelona 2021The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021
Universitat Politècnica de Catalunya258 views
Open challenges in sign language translation and productionOpen challenges in sign language translation and production
Open challenges in sign language translation and production
Universitat Politècnica de Catalunya187 views
Discovery and Learning of Navigation Goals from Pixels in MinecraftDiscovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in Minecraft
Universitat Politècnica de Catalunya193 views
Curriculum Learning for Recurrent Video Object SegmentationCurriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object Segmentation
Universitat Politècnica de Catalunya662 views
Deep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN BarcelonaDeep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN Barcelona
Deep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN Barcelona
Universitat Politècnica de Catalunya590 views
Backpropagation for Deep LearningBackpropagation for Deep Learning
Backpropagation for Deep Learning
Universitat Politècnica de Catalunya283 views
Automatic Reminiscence Therapy for DementiaAutomatic Reminiscence Therapy for Dementia
Automatic Reminiscence Therapy for Dementia
Universitat Politècnica de Catalunya658 views

Recently uploaded(20)

Data structure and algorithm. Data structure and algorithm.
Data structure and algorithm.
Abdul salam 12 views
Microsoft Fabric.pptxMicrosoft Fabric.pptx
Microsoft Fabric.pptx
Shruti Chaurasia19 views
PROGRAMME.pdfPROGRAMME.pdf
PROGRAMME.pdf
HiNedHaJar14 views
RuleBookForTheFairDataEconomy.pptxRuleBookForTheFairDataEconomy.pptx
RuleBookForTheFairDataEconomy.pptx
noraelstela166 views
Journey of Generative AIJourney of Generative AI
Journey of Generative AI
thomasjvarghese4918 views
ColonyOSColonyOS
ColonyOS
JohanKristiansson69 views
Introduction to Microsoft Fabric.pdfIntroduction to Microsoft Fabric.pdf
Introduction to Microsoft Fabric.pdf
ishaniuudeshika21 views
PTicketInput.pdfPTicketInput.pdf
PTicketInput.pdf
stuartmcphersonflipm314 views
Building Real-Time Travel AlertsBuilding Real-Time Travel Alerts
Building Real-Time Travel Alerts
Timothy Spann102 views
MOSORE_BRESCIAMOSORE_BRESCIA
MOSORE_BRESCIA
Federico Karagulian5 views
How Leaders See Data? (Level 1)How Leaders See Data? (Level 1)
How Leaders See Data? (Level 1)
Narendra Narendra10 views

Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonDL 2020

  • 1. Image Segmentation with Deep Learning Xavier Giro-i-Nieto UPC & BSC Barcelona Carles Ventura UOC Barcelona
  • 2. Xavier Giro-i-Nieto Associate Professor at Universitat Politecnica de Catalunya (UPC) in Barcelona, Catalonia. IDEAI Center for Intelligent Data Science & Artificial Intelligence @DocXavi xavier.giro@upc.edu
  • 4. Foundations ● MSc course [2017] [2018] [2019] ● BSc course [2018] [2019] [2020] Multimedia Applications Vision: [2016] [2017][2018][2019] Language & Speech: [2017] [2018] [2019] Reinforcement Learning ● [2020 Spring] [2020 Autumn] Deep Learning @ UPC TelecomBCN
  • 5. 4th (face-to-face) & 5th edition (online) start November 2020. Sign up here. Online Postgraduate Course Àgata Lapedriza (UOC) Xavier Giró (UPC-BSC) Xavier Suau (Apple) Marta Ruiz (UPC) Carles Ventura (UOC) Jordi Pons (Dolby) Jordi Torres (BSC) Elisenda Bou (Vilynx) Daniel Fojo (Glovo)
  • 6. Acknowledgements 6 Amaia Salvador amaia.salvador@upc.edu PhD Candidate Universitat Politècnica de Catalunya [DLCV 2016] Verónica Vilaplana veronica.vilaplana@upc.edu Associate Professor Universitat Politècnica de Catalunya [DLCV 2017] Míriam Bellver miriam.bellver@bsc.edu PhD Candidate Barcelona Supercomputing Center [DLCV 2018] [DLCV 2018]
  • 7. From image to pixels classification (segmentation) 7 Slide inspired by cs231n lecture from Stanford University. Image Segmentation Object Detection Image Classification “chair”, “bin” “chair” “bin” “chair” “bin”
  • 8. Segmentation Segmentation: Define the accurate boundaries of all objects in an image predicting a class map for each pixel 8
  • 10. ● Medical imaging Image source: DRIVE Digital Retinal Image Vessel Extraction Segmentation Applications
  • 13. Outline From Global to Local-scale Image Classification Semantic Segmentation ● Deconvolution (or transposed convolution) ● Dilated Convolution ● Skip Connections Instance Segmentation ● Proposal-Based ● Recurrent ● Instance Embedding Panoptic Segmentation 13
  • 14. 14 Figure: Jeremy Jordan (2018) From Image to Pixel Classification (Segmentation)
  • 15. From Image to Pixel Classification (Segmentation) 15
  • 16. Slide: CS231n (Stanford University) CNN COW Extract patch Run through a CNN Classify center pixel Repeat for every pixel 16 From Image to Pixel Classification (Segmentation) Naive approach: Train a sliding window classifier.
  • 17. Slide: CS231n (Stanford University) CNN COW Extract patch Run through a CNN Classify center pixel Repeat for every pixel 17 From Image to Pixel Classification (Segmentation) Naive approach: Train a sliding window classifier.
  • 18. CNN Convolutionize: Run “fully convolutional” network to get all pixels at once. 18 From Global to Local-scale Image Classification Slide: CS231n (Stanford University)
  • 19. CNN Convolutionize: Run “fully convolutional” network to get all pixels at once. 19 Slide concept: CS231n (Stanford University) From Global to Local-scale Image Classification
  • 20. Convolutionize: Formulate each neuron in a fully connected (FC) layer as a convolutional filter (kernel) of a convolutional layer: 20 3x2x2 tensor (RGB image of 2x2) 2 fully connected neurons 3x2x2 * 2 weights 2 convolutional filters of 3 x 2 x 2 (same size as input tensor) 3x2x2 * 2 weights From Global to Local-scale Image Classification
  • 21. 21 A model trained for image classification on low-definition images can provide local response when fed with high-definition images. Long, Jonathan, Evan Shelhamer, and Trevor Darrell. "Fully convolutional networks for semantic segmentation." CVPR 2015. (original figure has been modified) From Global to Local-scale Image Classification
  • 22. 22Long, Jonathan, Evan Shelhamer, and Trevor Darrell. "Fully convolutional networks for semantic segmentation." CVPR 2015. (original figure has been modified) From Global to Local-scale Image Classification CNN Convolutionize: Run “fully convolutional” network to get all pixels at once...
  • 23. 23 From Global to Local-scale Image Classification Campos, V., Jou, B., & Giro-i-Nieto, X. . From Pixels to Sentiment: Fine-tuning CNNs for Visual Sentiment Prediction. Image and Vision Computing. (2017) The FC to Conv redefinition allows generating heatmaps of the class prediction over the input images.
  • 24. 24 From Global to Local-scale Image Classification Limitation: Pooling layers in the CNN will decrease the spatial definition of the output. Figure: Alicja Kwasniewska (ISSonDL 2020)
  • 25. 25 From Global to Local-scale Image Classification CNN Limitation: Pooling layers in the CNN will decrease the spatial definition of the output. Slide concept: CS231n (Stanford University)
  • 26. Outline From Global to Local-scale Image Classification Semantic Segmentation ● Deconvolution (or transposed convolution) ● Skip Connections ● Dilated Convolutions Instance Segmentation ● Proposal-Based ● Recurrent ● Instance Embedding Panoptic Segmentation 26
  • 27. Semantic Segmentation Label every pixel! Don’t differentiate instances (cows) Classic computer vision problem 27 Slide: CS231n (Stanford University)
  • 28. Instance Segmentation Detect instances, give category, label pixels “simultaneous detection and segmentation” (SDS) Labels are class-aware and instance-aware 28 Slide: CS231n (Stanford University)
  • 29. Outline Semantic Segmentation ● Deconvolution (or transposed convolution) ● Dilated Convolution ● Skip Connections Instance Segmentation Methods ● Proposal-Based ● Recurrent ● Instance Embedding Panoptic Segmentation 29
  • 31. Semantic Segmentation 31 CNN Limitation of convolutionizing CNNs for image classification: Pooling layers in the CNN will decrease the spatial definition of the output. Slide concept: CS231n (Stanford University)
  • 32. Learnable upsampling 32Long, Jonathan, Evan Shelhamer, and Trevor Darrell. "Fully convolutional networks for semantic segmentation." CVPR 2015.
  • 33. 33 Slide: Alicja Kwasniewska (ISSonDL 2020) Learnable Upsample: Transposed Convolution
  • 34. Reminder: Convolutional Layer Typical 3 x 3 convolution, stride 1 pad 1 Input: 4 x 4 Output: 4 x 4 34 Slide credit: CS231n (Stanford University)
  • 35. Reminder: Convolutional Layer Typical 3 x 3 convolution, stride 1 pad 1 Input: 4 x 4 Output: 4 x 4 Dot product between filter and input 35 Slide credit: CS231n (Stanford University)
  • 36. Reminder: Convolutional Layer Typical 3 x 3 convolution, stride 1 pad 1 Input: 4 x 4 Output: 4 x 4 Dot product between filter and input 36 Slide credit: CS231n (Stanford University)
  • 37. Reminder: Convolutional Layer Typical 3 x 3 convolution, stride 2 pad 1 Input: 4 x 4 Output: 2 x 2 37 Slide credit: CS231n (Stanford University)
  • 38. Reminder: Convolutional Layer Typical 3 x 3 convolution, stride 2 pad 1 Input: 4 x 4 Output: 2 x 2 Dot product between filter and input 38 Slide credit: CS231n (Stanford University)
  • 39. Reminder: Convolutional Layer Typical 3 x 3 convolution, stride 2 pad 1 Input: 4 x 4 Output: 2 x 2 Dot product between filter and input 39 Slide credit: CS231n (Stanford University)
  • 40. 3 x 3 “deconvolution”, stride 2 pad 1 Input: 2 x 2 Output: 4 x 4 40 Slide credit: CS231n (Stanford University) Learnable upsampling with Transposed Convolutions
  • 41. 3 x 3 “deconvolution”, stride 2 pad 1 Input: 2 x 2 Output: 4 x 4 Input gives weight for filter values Learnable Upsample: Transposed Convolution 41 Slide credit: CS231n (Stanford University)
  • 42. Learnable Upsample: Transposed Convolution Slide Credit: CS231n 3 x 3 “deconvolution”, stride 2 pad 1 Input: 2 x 2 Output: 4 x 4 Input gives weight for filter values Sum where output overlaps 42
  • 43. Learnable Upsample: Transposed Convolution Noh, H., Hong, S., & Han, B. (2015). Learning deconvolution network for semantic segmentation. ICCV 2015. “Regular” VGG “Upside down” VGG 43
  • 44. 44 Limitation of upsampling from deep CNN layers: Deeper layers are specialized for higher-level semantic tasks, not in capturing fine-grained details required for segmentation. Highest activations along CNN depth Learnable Upsample
  • 45. Skip Connections “skip connections” Solution: Combine predictions from features at different depths. 45Long, Jonathan, Evan Shelhamer, and Trevor Darrell. "Fully convolutional networks for semantic segmentation." CVPR 2015. combination
  • 46. 46#U-Net Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. "U-net: Convolutional networks for biomedical image segmentation." MICCAI 2015 Skip connections to intermediate layers
  • 47. 47 Receptive Field Receptive field: Part of the input data that is visible to a neuron. It increases as we stack more convolutional layers (i.e. neurons in deeper layers have larger receptive fields). André Araujo, Wade Norris, Jack Sim, “Computing Receptive Fields of Convolutional Neural Networks”. Distill.pub 2019. Problem: Receptive field may be limited, and pixel-wise predictions at the deepest layer may not be aware of the whole image.
  • 48. 48 Receptive Field: Dilated (atrous) convolutions Slide: Alicja Kwasniewska (ISSonDL 2020)
  • 49. Dilated Convolutions ● By adding more layers: ○ The receptive field grows exponentially. ○ The number of learnable parameters (filter weights) grows linearly. 49 Yu, F., & Koltun, V. (2015). Multi-scale context aggregation by dilated convolutions. ICLR 2016.
  • 51. Dilated Convolutions + Spatial Pyramid Pooling (SPP) 51 #SPP He, K., Zhang, X., Ren, S., & Sun, J. (2015). Spatial pyramid pooling in deep convolutional networks for visual recognition. TPAMI 2015. #PSPNet Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017). Pyramid scene parsing network. CVPR 2017.
  • 52. State-of-the-art models 52 ● DeepLab v3+: Atrous Convolutions + Spatial Pyramid Pooling + Encoder-Decoder #DeepLabv3+ Chen, L. C., Zhu, Y., Papandreou, G., Schroff, F., & Adam, H. (2018). Encoder-decoder with atrous separable convolution for semantic image segmentation. ECCV 2018
  • 53. Outline From Global to Local-scale Image Classification Semantic Segmentation ● Deconvolution (or transposed convolution) ● Skip Connections ● Dilated Convolution Instance Segmentation ● Proposal-Based ● Recurrent ● Instance Embedding Panoptic Segmentation 53
  • 54. Proposal-based 54 Typical object detection/segmentation pipelines: Object proposal Refinement and Classification Dog 0.85 Cat 0.80 Dog 0.75 Cat 0.90
  • 55. Proposal-based 55 Typical object detection/segmentation pipelines: Object proposal Refinement and Classification Dog 0.85 Cat 0.80 Dog 0.75 Cat 0.90 NMS: Non-Maximum Suppression
  • 56. Proposal-based 56 Typical object detection/segmentation pipelines: Object proposal Refinement and Classification Dog 0.85 Cat 0.80 Dog 0.75 Cat 0.90 Binary Map Binary Map
  • 57. Proposal-based Slide Credit: CS231nHariharan et al. Simultaneous Detection and Segmentation. ECCV 2014 External Segment proposals Mask out background with mean image Similar to R-CNN, but with segment proposals 57
  • 58. Proposal based: Detection - Faster R-CNN Conv layers Region Proposal Network FC6 Class probabilities FC7 FC8 RPN Proposals RoI Pooling Conv5_3 RPN Proposals 58 Ren et al. Faster R-CNN: Towards real-time object detection with region proposal networks. NIPS 2015 Learn proposals end-to-end sharing parameters with the classification network
  • 59. He et al. Mask R-CNN. ICCV 2017 Proposal-based Instance Segmentation: Mask R-CNN Faster R-CNN for Pixel Level Segmentation as a parallel prediction of masks and class labels 59
  • 60. Mask R-CNN He et al. Mask R-CNN. ICCV 2017 Object Detection Object Detection and Segmentation
  • 61. He et al. Mask R-CNN. ICCV 2017 Mask R-CNN: RoI Align RoI Pool from Fast R-CNN Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected layers Max-pool within each grid cell RoI conv features: C x h x w for region proposal Fully-connected layers expect low-res conv features: C x h x w x/16 & rounding → misalignment ! + not differentiable 61
  • 62. 62
  • 63. Limitations of Proposal-based models 63 1. Two objects might share the same bounding box: Only one will be kept after NMS step. 2. Choice of NMS threshold is application dependant 3. Same pixel can be assigned to multiple instances 4. Number of predictions is limited by the number of proposals.
  • 64. Single-shot Instance Segmentation 64 ● Improving RetinaNet (single-shot object detector) in three ways: ○ Integrating instance mask prediction ○ Making the loss function adaptive and more stable ○ Including hard examples in training #RetinaMask Fu et al. RetinaMars: Learning to predict masks improves state-of-the-art single-shot detection for free. ArXiv 2019
  • 65. 65 CNN Cat A Krizhevsky, I Sutskever, GE Hinton “Imagenet classification with deep convolutional neural networks” NIPS 2012
  • 68. Recurrent Instance Segmentation Romera-Paredes & H.S. Torr. Recurrent Instance Segmentation ECCV 2016 68 Sequential mask generation
  • 69. Salvador, A., Bellver, Campos. V, M., Baradad, M., Marqués, F., Torres, J., & Giro-i-Nieto, X. (2018) From Pixels to Object Sequences: Recurrent Semantic Instance Segmentation. Recurrent Instance Segmentation
  • 70. Recurrent Instance Segmentation #RVOS Carles Ventura, Miriam Bellver, Andreu Girbau, Amaia Salvador, Ferran Marques and Xavier Giro-i-Nieto. “RVOS: End-to-End Recurrent Network for Video Object Segmentation”, CVPR 2019. time (frame sequence) space (object sequence)
  • 71. Outline Segmentation Datasets Segmentation Applications Semantic Segmentation ● Deconvolution (or transposed convolution) ● Dilated Convolution ● Skip Connections Instance Segmentation ● Proposal-Based ● Recurrent ● DETR Panoptic Segmentation 71
  • 72. Semantic + Instance = Panoptic Segmentation 72#PS Kirillov, A., He, K., Girshick, R., Rother, C., & Dollár, P. (2019). Panoptic segmentation. CVPR 2019.
  • 73. Panoptic Segmentation: methods 73 ● UPSNet: A Unified Panoptic Segmentation Network Mask R-CNN design #UPSNET Xiong, Y., Liao, R., Zhao, H., Hu, R., Bai, M., Yumer, E., & Urtasun, R. (2019). Upsnet: A unified panoptic segmentation network. CVPR 2019.
  • 74. Panoptic Segmentation: methods 74 ● UPSNet: A Unified Panoptic Segmentation Network Xioing et al. UPSNet: A Unified Panoptic Segmentation Network. CVPR 2019
  • 75. Summary Semantic Segmentation Methods ● Deconvolution (or transposed convolution) ● Dilated Convolution ● Skip Connections Instance Segmentation Methods ● Proposal-Based ● Recurrent ● Instance Embedding Panoptic Segmentation 75
  • 76. Latest advances ● Bolya et al. YOLACT Real-time Instance Segmentation. ICCV 2019 ● #Axial-DeepLab Wang, H., Zhu, Y., Green, B., Adam, H., Yuille, A., & Chen, L. C. (2020). Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation. ECCV 2020. ● #SOLO Wang, X., Kong, T., Shen, C., Jiang, Y., & Li, L. (2019). Solo: Segmenting objects by locations. ECCV 2020 ● Fast Semantic Segmentation with MobileNet in PyTorch. 76
  • 77. Segmentation Datasets ● 20 categories ● +10,000 images ● Semantic segmentation GT ● Instance segmentation GT ● 540 categories ● +10,000 images ● Dense annotations ● Semantic segmentation GT ● Objects + stuff Pascal Visual Object Classes Pascal Context 77
  • 78. Segmentation Datasets ● Real indoor & outdoor scenes ● 80 categories ● +300,000 images ● 2M instances ● Partial annotations ● Semantic segmentation GT ● Instance segmentation GT ● Objects, but no stuff COCO Common Objects in Context 78 ● Real general scenes ● +150 categories ● +22,000 images ● Semantic segmentation GT ● Instance + parts segmentation GT ● Objects and stuff ADE20K
  • 79. Segmentation Datasets 79 ● Real general scenes ● 350 categories ● +950,000 of images ● 2,700,00 instance segmentations ● Instance segmentation GT ● Objects Open Images V6
  • 80. Segmentation Datasets 80 ● Real general scenes ● 1,000 categories ● 164,000 of images ● 2,200,00 instance segmentations ● 11.2 objects instance from 3.4 categories on average per image (more complex images than Open Images and MS COCO) ● Instance segmentation GT ● Objects LVIS
  • 81. Segmentation Datasets ● Real driving scenes ● 30 categories ● +25,000 images ● 20,000 partial annotations ● 5,000 dense annotations ● Semantic segmentation GT ● Instance segmentation GT ● Depth, GPS and other metadata ● Objects and stuff ● Real driving scenes covering 6 continents with variety of weather/season/time of day/camera/viewpoint ● 152 categories ● 25,000 images ● Semantic segmentation GT ● Instance + parts segmentation GT ● Objects and stuff CityScapes Mapillary Vistas Dataset 81