https://telecombcn-dl.github.io/dlmm-2017-dcu/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
19. Slide Credit: CS231n
3 x 3 “deconvolution”, stride 2 pad 1
Input: 2 x 2 Output: 4 x 4
Input gives
weight for
filter values
Learnable Upsample: Deconvolutional Layer
19
20. Learnable Upsample: Deconvolutional Layer
Slide Credit: CS231n
3 x 3 “deconvolution”, stride 2 pad 1
Input: 2 x 2 Output: 4 x 4
Input gives
weight for
filter values
Sum where
output overlaps
20
21. Learnable Upsample: Deconvolutional Layer
Warning: Checkerboard effect when kernel size is not
divisible by the stride
Source: distill.pub
21
22. Learnable Upsample: Deconvolutional Layer
Source: distill.pub
stride = 2, kernel_size = 3
22
Warning: Checkerboard effect when kernel size is not
divisible by the stride
24. Better Upsampling: Subpixel
Re-arange features in previous convolutional layer to form a
higher resolution output
Shi et al.Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network.CVPR 2016
24
26. Skip Connections
Slide Credit: CS231n
Skip connections = Better results
“skip
connections”
Long et al. Fully Convolutional Networks for Semantic Segmentation. CVPR 2015
Recovering low level features from early layers
26
27. Dilated Convolutions
Yu & Koltun. Multi-Scale Context Aggregation by Dilated Convolutions. ICLR 2016
Structural change in convolutional layers for dense prediction problems (e.g. image segmentation)
● The receptive field grows exponentially as you add more layers → more context information in deeper
layers wrt regular convolutions
● Number of parameters increases linearly as you add more layers
27
29. Instance Segmentation
More challenging than Semantic Segmentation
● Number of objects is variable
● No unique match between predicted and ground truth objects (cannot use
instance IDs)
Several attack lines:
● Proposal-based methods
● Recurrent Neural Networks
29
30. Proposal-based
Slide Credit: CS231nHariharan et al. Simultaneous Detection and Segmentation. ECCV 2014
External
Segment
proposals
Mask out background
with mean image
Similar to R-CNN, but with segment proposals
30
32. Proposal-based Instance Segmentation: MNC
Dai et al. Instance-aware Semantic Segmentation via Multi-task Network Cascades. CVPR 2016
Won COCO 2015
challenge
(with ResNet)
Region proposal network (RPN)
Reshape boxes to
fixed size,
figure / ground
logistic regression
Mask out background,
predict object class
Learn entire model
end-to-end!
Faster R-CNN for Pixel Level Segmentation in a multi-stage cascade strategy
32
33. Dai et al. Instance-aware Semantic Segmentation via Multi-task Network Cascades. CVPR 2016
Predictions Ground truth
Proposal-based Instance Segmentation: MNC
33
34. He et al. Mask R-CNN. arXiv Mar 2017
Proposal-based Instance Segmentation: Mask R-CNN
Faster R-CNN for Pixel Level Segmentation as a parallel prediction of masks
and class labels
34
35. He et al. Mask R-CNN. arXiv Mar 2017
Mask R-CNN
● Classification & box detection losses are identical to those in Faster R-CNN
● Addition of a new loss term for mask prediction:
The network outputs a K x m x m volume for mask prediction, where K is the
number of categories and m is the size of the mask (square)
35
36. He et al. Mask R-CNN. arXiv Mar 2017
Mask R-CNN: RoI Align
Reminder: RoI Pool from Fast R-CNN
Hi-res input image:
3 x 800 x 600
with region
proposal
Convolution
and Pooling
Hi-res conv features:
C x H x W
with region proposal
Fully-connected
layers
Max-pool within
each grid cell
RoI conv features:
C x h x w
for region proposal
Fully-connected layers expect
low-res conv features:
C x h x w
x/16 & rounding → misalignment ! + not differentiable
36
37. Jaderberg et al. Spatial Transformer Networks. NIPS 2015
Mask R-CNN: RoI Align
Use bilinear interpolation instead of cropping + maxpool
37
Mapping given by box coordinates
( 12
and 21
= 0 translation + scale)
42. Recurrent Instance Segmentation:
Romera-Paredes & H.S. Torr. Recurrent Instance Segmentation ECCV 2016 Slide Credit: M. Baradad, ReadCV@UPC
1-Compute the IoU for all pairs of Predicted/GT
masks
Ŷt
Yt
42
Coverage Loss
43. Recurrent Instance Segmentation:
Romera-Paredes & H.S. Torr. Recurrent Instance Segmentation ECCV 2016 Slide Credit: M. Baradad, ReadCV@UPC
1-Compute the IoU for all pairs of Predicted/GT
masks
0.9
0
0
0.1
0.8
0.1
...
...
...
...
43
Coverage Loss
44. Recurrent Instance Segmentation:
Romera-Paredes & H.S. Torr. Recurrent Instance Segmentation ECCV 2016 Slide Credit: M. Baradad, ReadCV@UPC
2-Find best matching:
Loss: Sum of the Intersections over the union for
the best matching (*-1)
44
Coverage Loss
45. Recurrent Instance Segmentation:
Romera-Paredes & H.S. Torr. Recurrent Instance Segmentation ECCV 2016 Slide Credit: M. Baradad, ReadCV@UPC
3-Also take into account the scores
s1
= 0.93
s2
= 0.73
s3
= 0.86
s4
= 0.63
s5
= 0.56
Where:
is the binary cross entropy:
is the Iverson bracket which:
Is 1 if the condition is true and 0 else
45
Coverage Loss