Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

[unofficial] Pyramid Scene Parsing Network (CVPR 2017)


Published on

Introduction to Pyramid Scene Parsing Network (presented in CVPR 2017)

Published in: Technology
  • Doctor's 2-Minute Ritual For Shocking Daily Belly Fat Loss! Watch This Video 
    Are you sure you want to  Yes  No
    Your message goes here
  • Do This Simple 2-Minute Ritual To Loss 1 Pound Of Belly Fat Every 72 Hours ♥♥♥
    Are you sure you want to  Yes  No
    Your message goes here
  • how to lose belly fat naturally without exercise ◆◆◆
    Are you sure you want to  Yes  No
    Your message goes here

[unofficial] Pyramid Scene Parsing Network (CVPR 2017)

  1. 1. Pyramid Scene Parsing Network Hengshuang Zhao1 , Jianping Shi2 , Xiaojuan Qi1 , Xiaogang Wang1 , Jiaya Jia 1 1 The Chinese University of Hong Kong, 2 SenseTime Group Limited Presentation: Shunta Saito Slide: Powered by Deckset (c) Preferred Networks 1
  2. 2. Summary • Introduce Pyramid Pooling Module for better context grasp with sub-region awareness (c) Preferred Networks 2
  3. 3. Why did I choose this paper? • Presented in CVPR 2017 • 1st place in ImageNet Scene Parsing Challenge 2016 (ADE20K) • was 1st place in Cityscapes leaderboard • now it's in 2nd place (I noticed this last week!) (c) Preferred Networks 3
  4. 4. Agenda 1. Common building blocks in semantic segmentation 2. Major Issue 3. Prior Work 4. Pyramid Pooling Module 5. Experiment results (c) Preferred Networks 4
  5. 5. Semantic Segmentation • Predict pixel-wise labels from natural images • Each pixel in an image belongs to an object class • So it's not instance-aware ! (c) Preferred Networks 5
  6. 6. Common Building Blocks (1) Fully convolutional network (FCN)1 • A deep convolutional neural network which doesn't include any fully- connected layers • Almost all recent methods are based on FCN • Typically pre-trained with ImageNet under classification problem setting 1 "Semantic Contours from Inverse Detectors", ICCV 2011, http:// (c) Preferred Networks 6
  7. 7. Common Building Blocks (2) Dilated convolution1 • Widen receptive field without reducing feature map resolution • Important for leveraging global context prior efficiently 1 "Semantic Contours from Inverse Detectors", ICCV 2011, http:// (c) Preferred Networks 7
  8. 8. Common Building Blocks (3) Multi-scale feature ensemble • Higher-layer feature contains more semantic meaning and less location information • Combining multi-scale features can improve the performance1 1 "Semantic Contours from Inverse Detectors", ICCV 2011, http:// (c) Preferred Networks 8
  9. 9. Common Building Blocks (4) Conditional random field (CRF) • Post-processing to refine the segmentation result (DeepLab1 ) • Some following methods refined network via end-to-end modeling (DPN2 , CRF as RNN3 , Detections and Superpixels4 ) 4 "Higher order conditional random fields in deep neural networks", ECCV 2016 3 "Conditional random fields as recurrent neural networks", ICCV 2015 2 "Semantic image segmentation via deep parsing network", ICCV 2015 1 "Semantic Contours from Inverse Detectors", ICCV 2011, http:// (c) Preferred Networks 9
  10. 10. Common Building Blocks (5) Global average pooling (GAP) • ParsenNet1 proved that global average pooling with FCN can improve semantic segmentation results • But the global descriptors used in the paper are not representative enough for some challenging datasets like ADE20K 1 "Semantic Contours from Inverse Detectors", ICCV 2011, http:// (c) Preferred Networks 10
  11. 11. Major Issue (1) Mismatched relationship • Co-occurrent visual patterns imply some contexts • e.g., an airplane is likely to fly in sky while not over a road • Lack of the ability to collect contextual information increases the chance of misclassification • In the right figure, FCN predicts the boat in the yellow box as a "car" based on its appearance (c) Preferred Networks 11
  12. 12. Major Issue (2) Confusing Classes • There are confusing classes in major datasets: field and earth; mountain and hill; wall, house, building and skyscraper, etc. • The expert human annotator still makes 17.6% pixel error for ADE20K1 • FCN predicts the object in the box as part of skyscraper and part of building but the whole object should be either skyscraper or building, not both • Utilizing the relationship between classes is important 1 "Semantic Contours from Inverse Detectors", ICCV 2011, http:// (c) Preferred Networks 12
  13. 13. Major Issue (3) Inconspicuous Classes • Small objects like streetlight and signboard are inconspicuous and hard to find while they may be important • Big objects may appear in discontinuous, but FCN couldn't label the pillow which has similar appearance with the sheet correctly • To improve performance for small or very big objects, sub-regions should be paid more attention (c) Preferred Networks 13
  14. 14. Summary of Issues • Use co-occurrent visual patterns as context • Consider relationship between classes • Sub-regions should be paid more attention (c) Preferred Networks 14
  15. 15. Prior Work Global Average Pooling (GAP)1 • Receptive field of ResNet is already larger than the input image, so GAP sounds good to summarize the all information • But, pixels in an image may be various objects which have different sizes, so directly fusing them to form a single vector may lose the spatial relation and cause ambiguity 1 "Semantic Contours from Inverse Detectors", ICCV 2011, http:// (c) Preferred Networks 15
  16. 16. Prior Work Spatial Pyramid Pooling (SPP)1 • Pooling with different kernel/stride sizes to the feature maps • Then flatten and concatenate the pooling results to make fix-length representation • There still is context information loss 1 "Semantic Contours from Inverse Detectors", ICCV 2011, http:// (c) Preferred Networks 16
  17. 17. Pyramid Pooling Module • A hierarchical global prior, containing information with different scales and varying among different sub-regions • Pyramid Pooling Module for global scene prior constructed on the top of the final-layer-feature-map (c) Preferred Networks 17
  18. 18. Pyramid Pooling Module • Use 1x1 conv to reduce the number of channels • Then upsample (bilinear) them to the same size and concatenate all (c) Preferred Networks 18
  19. 19. Implementation details (1) • The average pooling are four levels, 1x1, 2x2, 3x3, and 6x6 (ksize, stride) • Pre-trained ResNet model with dilated convolution is used as the feature extractor (the output size will be 1/8 of input image) • They use two losses; 1. softmax loss between final layer and labels 2. softmax loss between an intermediate output of ResNet and labels1 (weighted by 0.4) 1 "Semantic Contours from Inverse Detectors", ICCV 2011, http:// (c) Preferred Networks 19
  20. 20. Implementation details (2) Optimization MomentumSGD with weight deacy LR Scheduling Momentum: 0.9 Weight decay: 0.0001 where (c) Preferred Networks 20
  21. 21. Implementation details (3) Training iteration Dataset augmentation ADE20K: 150K Random mirror PASCAL VOC: 30K Random resize between 0.5 and 2 Cityscapes: 90K Random rotation betwee -10 and 10 degrees Random Gaussian blur for ADE20K and PASCAL VOC (c) Preferred Networks 21
  22. 22. Implementation detailts (4) • An appropriately large "cropsize" can yield good performance • "batchsize" in the batch normalization layer is of great importance: Cropsize Batchsize ADE20K: 473 x 473 16 for all dataset PASCAL VOC: 473 x 473 Cityscapes: 713 x 713 (c) Preferred Networks 22
  23. 23. Implementation detailts (5) Distributed Batch Normalization • To increase the "batchsize" in batch normalization layers, they used custom BN layer applied on data gathered from mulitple GPUs using OpenMPI • We have Akiba-san's implementation of distributed batch normalization ! (c) Preferred Networks 23
  24. 24. ImageNet Scene Parsing Challenge 2016 • Dataset: ADE20K • 150 classes and 1,038 image-level labels • 20,000/2,000/3,000 pixel-level labels for train/val/test (c) Preferred Networks 24
  25. 25. Ablation Study for Pyramid Pooling Module • Average pooling works better than max pooling in all settings • Pooling with pyramid parsing outperforms that using global pooling • With dimension reduction (DR; reducing the number of channels after pyramid pooling), the performance is further enhanced (c) Preferred Networks 25
  26. 26. Ablation Study for Auxiliary Loss • Set the auxiliary loss weight between 0 and 1 and compared the final results • yields the best performance (c) Preferred Networks 26
  27. 27. Ablation Study for the ResNet Part Deeper is better (c) Preferred Networks 27
  28. 28. More Detailed Performance Analysis Additional processing Improvement (% in mIoU) Data augmentation (DA) +1.54 Auxiliary loss (AL) +1.41 Pyramid pooling module (PSP) +4.45 Use deeper ResNet (50 to 269) +2.13 Multi-scale testing (MS) +1.13 • For multi-scale testing, they create prediction at 6 different scales (0.5, 0.75, 1, 1.25, 1.5, and 1.75) and take average of them. (c) Preferred Networks 28
  29. 29. Results on PASCAL VOC 2012 • Extended with Semantic Boundaries Dataset (SBD) 1 , they used • 10582, 1449, and 1456 images for train/val/test • Mismatched relationship: For "aeroplane" and "sky" in the second and third rows, PSPNet finds missing parts. • Confusing classes: For "cows" in row one, our baseline model treats it as "horse" and "dog" while PSPNet corrects these errors • Conspicuous objects: For "person", "bottle" and "plant" in following rows, PSPNet performs well on these small-size- object classes in the images compared to the baseline model 1 "Semantic Contours from Inverse Detectors", ICCV 2011, http:// (c) Preferred Networks 29
  30. 30. Results on PASCAL VOC 2012 • Comparing PSPNet with previous best-performing methods on the testing set based on two settings, i.e., with or without pre-training on MS-COCO dataset (c) Preferred Networks 30
  31. 31. Results on Cityscapes • Cityscapes dataset consits of 2975, 500, and 1525 train/val/tests images (19 classes) • 20000 coarsely annotated images are available (in the table below, ‡ means it's used) (c) Preferred Networks 31
  32. 32. Thank you for your attention • The official repository doesn't include any training code • My own implementation for both training and testing have been ready: • mitmul/segmentation: • Now I'm training a model to ensure the reproducibility • Once finished the reproduction work, I'll send the code to ChainerCV • The training on Cityscapes dataset takes over 20 days using 8 GPUs even with ResNet50-based PSPNet (They used ResNet101 for Cityscapes) • Now ChainerMN is necessary tool for such large-scale dataset and deep models • So, we need more GPU machines connected with InfiniBand each other (c) Preferred Networks 32