Summary
• Introduce Pyramid Pooling Module for better context grasp with sub-region awareness

Why did I choose this paper?
• Presented in CVPR 2017
• 1st place in ImageNet Scene Parsing Challenge
2016 (ADE20K)
• was 1st place in Cityscapes leaderboard
• now it's in 2nd place (I noticed this last week!)

Agenda
1. Common building blocks in semantic segmentation
2. Major Issue
3. Prior Work
4. Pyramid Pooling Module
5. Experiment results

Semantic Segmentation
• Predict pixel-wise labels from natural
images
• Each pixel in an image belongs to an
object class
• So it's not instance-aware !

Common Building Blocks (1)
Fully convolutional network (FCN)1
• A deep convolutional neural network
which doesn't include any fully-
connected layers
• Almost all recent methods are based
on FCN
• Typically pre-trained with ImageNet
under classiﬁcation problem setting
1
"Fully Convolutional Networks for Semantic Segmentation", PAMI 2016

Dilated convolution2
• Widen receptive ﬁeld without reducing
feature map resolution
• Important for leveraging global context
prior efﬁciently
2
"Multi-Scale Context Aggregation by Dilated Convolutions", ICLR 2016

Multi-scale feature ensemble
• Higher-layer feature contains more
semantic meaning and less location
information
• Combining multi-scale features can
improve the performance3
3
"Hypercolumns for Object Segmentation and Fine-grained Localization",
CVPR 2015

Conditional random field (CRF)
• Post-processing to refine the
segmentation result (DeepLab4
)
• Some following methods refined network
via end-to-end modeling (DPN5
, CRF as
RNN6
, Detections and Superpixels7
)
7
"Higher order conditional random fields in deep neural networks", ECCV
2016
6
"Conditional random fields as recurrent neural networks", ICCV 2015
5
"Semantic image segmentation via deep parsing network", ICCV 2015
4
"Semantic image segmentation with deep convolutional nets and fully
connected crfs", ICLR 2015

Global average pooling (GAP)
• ParsenNet8
proved that global average
pooling with FCN can improve semantic
segmentation results
• But the global descriptors used in the
paper are not representative enough for
some challenging datasets like ADE20K
8
"Parsenet: Looking wider to see better", ICLR 2016

Major Issue (1)
Mismatched relationship
• Co-occurrent visual patterns imply some
contexts
• e.g., an airplane is likely to fly in sky
while not over a road
• Lack of the ability to collect contextual
information increases the chance of
misclassification
• In the right figure, FCN predicts the boat
in the yellow box as a "car" based on its
appearance

Major Issue (2)
Confusing Classes
• There are confusing classes in major datasets: ﬁeld
and earth; mountain and hill; wall, house, building
and skyscraper, etc.
• The expert human annotator still makes 17.6%
pixel error for ADE20K9
• FCN predicts the object in the box as part of
skyscraper and part of building but the whole object
should be either skyscraper or building, not both
• Utilizing the relationship between classes is
important
9
"Semantic understanding of scenes through the ADE20K dataset",
CVPR 2017

Major Issue (3)
Inconspicuous Classes
• Small objects like streetlight and
signboard are inconspicuous and hard
to ﬁnd while they may be important
• Big objects may appear in
discontinuous, but FCN couldn't label
the pillow which has similar
appearance with the sheet correctly
• To improve performance for small or
very big objects, sub-regions should be
paid more attention

Summary of Issues
• Use co-occurrent visual patterns as context
• Consider relationship between classes
• Sub-regions should be paid more attention

Prior Work
Global Average Pooling (GAP)10
• Receptive ﬁeld of ResNet is already
larger than the input image, so GAP
sounds good to summarize the all
information
• But, pixels in an image may be various
objects which have different sizes, so
directly fusing them to form a single
vector may lose the spatial relation
and cause ambiguity
10
"Parsenet: Looking wider to see better", ICLR 2016

Prior Work
Spatial Pyramid Pooling (SPP)11
• Pooling with different kernel/stride
sizes to the feature maps
• Then ﬂatten and concatenate the
pooling results to make ﬁx-length
representation
• There still is context information loss
11
"Spatial pyramid pooling in deep convolutional networks for visual
recognition", ECCV 2014

Pyramid Pooling Module
• A hierarchical global prior, containing information with different scales and varying among different sub-regions
• Pyramid Pooling Module for global scene prior constructed on the top of the ﬁnal-layer-feature-map

• Use 1x1 conv to reduce the number of channels
• Then upsample (bilinear) them to the same size and concatenate all

Implementation details (1)
• The average pooling are four levels, 1x1, 2x2,
3x3, and 6x6 (ksize, stride)
• Pre-trained ResNet model with dilated
convolution is used as the feature extractor
(the output size will be 1/8 of input image)
• They use two losses;
1. softmax loss between ﬁnal layer and labels
2. softmax loss between an intermediate
output of ResNet and labels12
(weighted by
0.4)
12
"Relay backpropagation for effective learning of deep convolutional
neural networks", ECCV 2016

Optimization
MomentumSGD with weight
deacy
LR Scheduling
Momentum: 0.9
Weight decay: 0.0001 where

Training iteration Dataset augmentation
ADE20K: 150K Random mirror
PASCAL VOC: 30K Random resize between 0.5 and 2
Cityscapes: 90K Random rotation betwee -10 and 10
degrees
Random Gaussian blur for ADE20K
and PASCAL VOC

Implementation detailts (4)
• An appropriately large "cropsize" can yield good performance
• "batchsize" in the batch normalization layer is of great importance:
Cropsize Batchsize
ADE20K: 473 x 473 16 for all dataset
PASCAL VOC: 473 x 473
Cityscapes: 713 x 713

Implementation detailts (5)
MultiNode Batch Normalization
• To increase the "batchsize" in batch
normalization layers, they used custom
BN layer applied on data gathered from
multiple GPUs using OpenMPI
• We have Akiba-san's implementation of
multi-node batch normalization !

ImageNet Scene Parsing
Challenge 2016
• Dataset: ADE20K
• 150 classes and 1,038 image-level
labels
• 20,000/2,000/3,000 pixel-level labels
for train/val/test

Ablation Study for
• Average pooling works better than max
pooling in all settings
• Pooling with pyramid parsing
outperforms that using global pooling
• With dimension reduction (DR; reducing
the number of channels after pyramid
pooling), the performance is further
enhanced

Ablation Study for
Auxiliary Loss
• Set the auxiliary loss weight between
0 and 1 and compared the ﬁnal results
• yields the best performance

Ablation Study for the
depth of ResNet
Deeper is better

More Detailed
Performance Analysis
Additional processing Improvement (% in mIoU)
Data augmentation (DA) +1.54
Auxiliary loss (AL) +1.41
Pyramid pooling module (PSP) +4.45
Use deeper ResNet (50 to 269) +2.13
Multi-scale testing (MS) +1.13
• For multi-scale testing, they create prediction at 6 different
scales (0.5, 0.75, 1, 1.25, 1.5, and 1.75) and take average of them.

Results on PASCAL VOC
2012
• Extended with Semantic Boundaries Dataset (SBD) 13
, they
used
• 10582, 1449, and 1456 images for train/val/test
• Mismatched relationship: For "aeroplane" and "sky" in the
second and third rows, PSPNet ﬁnds missing parts.
• Confusing classes: For "cows" in row one, our baseline
model treats it as "horse" and "dog" while PSPNet corrects
these errors
• Conspicuous objects: For "person", "bottle" and "plant" in
following rows, PSPNet performs well on these small-size-
object classes in the images compared to the baseline model
13
"Semantic Contours from Inverse Detectors", ICCV 2011, http://
home.bharathh.info/pubs/codes/SBD/download.html

Results on PASCAL VOC 2012
• Comparing PSPNet with previous best-performing methods on the testing set based on two settings, i.e., with or without pre-training
on MS-COCO dataset

Results on Cityscapes
• Cityscapes dataset consits of 2975, 500, and 1525 train/val/tests images (19
classes)
• 20000 coarsely annotated images are available (in the table below, ‡ means it's used)

Thank you for your attention
• The official repository doesn't include any training code
• My own implementation for both training and testing have been ready:
• mitmul/chainer-pspnet: https://github.com/mitmul/chainer-pspnet
• Now I'm training a model to ensure the reproducibility
• Once finished the reproduction work, I'll send the code to ChainerCV
• In semantic segmentation task,
• input image is large (713 for PSPNet on cityscapes)
• appropriate batchsize, e.g., 16 or so, is important for batch normalization
• As the authors said, distributed batch normalization seems to be important in multi-GPU training
• So, now ChainerMN is necessary tool for such large-scale dataset and deep models
• It means that we need more GPU machines connected with InfiniBand

[unofficial] Pyramid Scene Parsing Network (CVPR 2017)