A brief introduction to recent segmentation methods

A Brief Introduction
to Recent Segmentation Methods
Shunta Saito
Researcher at Preferred Networks, Inc.

Semantic Segmentation?
• Classifying all pixels, so it’s also called “Pixel-labeling”
* Feature Selection and Learning for Semantic Segmentation (Caner
Hazirbas), Master's thesis, Technical University Munich, 2014.
C C
C C
B
B
B
C C
B
B
B
B B
SS
S S S S S S S
S S S
R
R
R
R R
R R
R R

• Tackling this problem with CNN, usually it’s formulated as:
Typical formulation
Image CNN Prediction Label
Cross entropy
• The loss is calculated for each pixel independently
• It leads to the problem: “How can the model consider the context to make a single
prediction for a pixel?”

• How to leverage context information
• How to use law-level features in upper layers to make detailed
predictions
• How to create dense prediction
Common problems

“Fully Convolutional Networks for Semantic Segmentation”, Jonathan Long and Evan Shelhamer et al. appeared in
arxiv on Nov. 14, 2014
Fully Convolutional Network
(1: Reinterpret classiﬁcation as a coarse prediction)
• The fully connected layers in
classiﬁcation network can be
viewed as convolutions with
kernels that cover their entire input
regions
• The spatial output maps of
these convolutionalized
models make them a natural
choice for dense problems
like semantic segmentation.
See a Caffe’s example “Net Surgery”: https://github.com/BVLC/caffe/blob/master/examples/net_surgery.ipynb
256-6x6
4096-1x1
4096-1x1
If input is 451x451, output is 8x8 of 1000ch

Fully Convolutional Network (2: Coarse to dense)
• 1 possible way: 
“Shift-and-Stitch” trick proposed in OverFeat paper (21 Dec, 2013) 
OverFeat is the winner at the localization task of ILSVRC 2013 (not detection)
Shift input and stitch
(=“interlace”) the outputs

プログレッシブとインターレース（改訂版）by 義裕⼤大⾒見見
https://www.youtube.com/watch?v=h785loU4bh4
To understand OverFeat’s “Shift-and-Stitch” trick

• Another way: Decreasing subsampling layer (like Max Pooling)
‣ It has tradeoff:
‣ The filters see finer information, but have smaller receptive fields
and take longer to compute (due to large feature maps)
movies and images are from: http://cs231n.github.io/convolutional-networks/

• Instead of all ways listed
above, ﬁnally, they employed
upsampling to make coarse
predictions denser
• In a sense, upsampling with
factor f is convolution with a
fractional input stride of 1/f
• So, a natural way to upsample
is therefore backwards
convolution (sometimes
called deconvolution) with an
output stride of f
Upsampling by deconvolution 74 74
'k '
x ,
.kz
ftp.YE?D ,
"
iEIII÷IiiIEE÷:#
in
,÷÷:±÷ei:#
a-
Output
stride .f
74k ,
l⇒I*.IE?IItiiIe#eEiidYEEEE.*ai
.
Deconvolution

(3: Patch-wise training or whole image training)
• Whole image fully convolutional training is identical
to patchwise training where each batch consists of
all the receptive ﬁelds of the units below the loss for
an image
yajif
-
ED
yan⇒Ise###
Patchwise training
is loss sampling
• We performed spatial sampling of the loss by making an
independent choice to ignore each ﬁnal layer cell with some
probability 1 − p
• To avoid changing the effec- tive batch size, we
simultaneously increase the number of images per batch by
a factor 1/p

Fully Convolutional Network (4: Skip connection)
Nov. 14, 2014
• Fuse coarse, semantic and local,
appearance information
: Deconvolution 
(initialized as bilinear upsampling, and learned)
Added
: Bilinear upsampling (ﬁxed)

Fully Convolutional Network (5: Training scheme)
1. Prepare a trained model for ILSVRC12 (1000-class image classification)
2. Discard the final classifier layer
3. Convolutionalizing all remaining fully-connected layers
4. Append a 1x1 convolution with the target class number of channels
• MomentumSGD (momentum: 0.9)
• batchsize: 20
• Fixed LR: 10^-4 for FCN-VGG-16 (Doubled LR for biases)
• Weight decay: 5^-4
• Zero-initialize the class scoring layer
• Fine-tuning was for all layers
Other training settings:

1. Replacing all fully-connected layers with convolutions
2. Upsampling by backwards convolution, a.k.a. deconvolution (and
bilinear upsampling)
3. Applied skip connections to use local, appearance information in the
ﬁnal layer
Summary

• “Learning Deconvolution Network for Semantic Segmentation”, Hyeonwoo Noh, et al. appeared in
arxiv on May. 17, 2015
Deconvolution Network
* “Unpooling” here is corresponding to Chainer’s “Upsampling2D” function

• Fully Convolutional Network (FCN) has limitations:
• Fixed-size receptive ﬁeld yields inconsistent labels for large objects
➡ Skip connection can’t solve this because there is inherent trade-off between boundary
details and semantics
• Interpolating 16 x 16 output to the original input size makes blurred results
➡ The absence of a deep deconvolution network trained on a large dataset makes
it difﬁcult to reconstruct highly non- linear structures of object boundaries accurately.
Let’s do it to perform proper upsampling

Feature extractor Shape generator

Shape generator
14 × 14
deconvolutional
layer
28 × 28
unpooling layer
28 × 28
deconvolutional
layer
56 × 56
unpooling layer
56 × 56
deconvolutional
layer
112 × 112
unpooling layer
112 × 112
deconvolutional
layer
224 × 224
unpooling layer
224 × 224
deconvolutional
layer
Activations of each layer

One can think: “Skip connection is missing…?”

U-Net
“U-Net: Convolutional Networks for Biomedical Image Segmentation”, Olaf Ronneberger,
Philipp Fischer, Thomas Brox, 18 May 2015

SegNet
• “SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image
Segmentation”, Vijay Badrinarayanan, Alex Kendall, Roberto Cipolla, 2 Nov,
2015

SegNet
• The training procedure is a bit complicated
• Encoder-decorder “pair-wise” training
There’s a Chainer implementation: 
pfnet-research/chainer-segnet:
https://github.com/pfnet-research/chainer-segnet

Dilated convolutions
• “Multi-Scale Context Aggregation by Dilated Convolutions”, Fisher Yu, Vladlen Koltun, 23 Nov, 2015
• a.k.a stroud convolution, convolution with holes
• Enlarge the size of receptive ﬁeld without losing resolution
The ﬁgure is from “WaveNet: A Generative Model for Raw Audio”

• For example, the feature maps of ResNet are
downsampled 5 times, and 4 times in the 5 are done by
convolutions with stride of 2 (only the ﬁrst one is by
pooling with stride of 2)
1/4
1/8
1/16
1/32
1/2

• By using dilated convolutions instead of vanilla
convolutions, the resolution after the ﬁrst pooling can be
kept as the same to the end
1/4
1/8
1/8
1/8
1/2

But, it is still 1/8…
1/4
1/8
1/8
1/8
1/2

RefineNet
• “RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic
Segmentation”, Guosheng Lin, Anton Milan, Chunhua Shen, Ian Reid, 20 Nov. 2016

RefineNet
• Each intermediate feature map is refined through “RefineNet module”

ReﬁneNet
* Implementation has been done in Chainer, the codes will be public soon

Semantic Segmentation using Adversarial Networks
• “Semantic Segmentation using Adversarial Networks”, Pauline Luc, Camille Couprie,
Soumith Chintala, Jakob Verbeek, 25 Nov. 2016

A brief introduction to recent segmentation methods

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to A brief introduction to recent segmentation methods

Similar to A brief introduction to recent segmentation methods (20)

More from Shunta Saito

More from Shunta Saito (7)

Recently uploaded

Recently uploaded (20)

A brief introduction to recent segmentation methods