Successfully reported this slideshow.
Your SlideShare is downloading. ×

Weakly supervised semantic segmentation

Ad

Semantic Segmentation with
Limited Annotation
Zhedong Zheng
24 Feb 2018
1

Ad

What can we learn from
(from Stephen Chow’s film)
2

Ad

3

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Loading in …3
×

Check these out next

1 of 31 Ad
1 of 31 Ad
Advertisement

More Related Content

Advertisement

Weakly supervised semantic segmentation

  1. 1. Semantic Segmentation with Limited Annotation Zhedong Zheng 24 Feb 2018 1
  2. 2. What can we learn from (from Stephen Chow’s film) 2
  3. 3. 3
  4. 4. 1. Simple Does It: Weakly Supervised Instance and Semantic Segmentation (CVPR 2017) Weak 2. Colorful Image Colorization (ECCV 2016 oral) Self Related Works 4
  5. 5. 1. Simple Does It: Weakly Supervised Instance and Semantic Segmentation (CVPR 2017) Weak 2. Colorful Image Colorization (ECCV 2016 oral) Self Related Works 5
  6. 6. What 6
  7. 7. How Start from object bounding box annotations 7
  8. 8. Recall Several Rules 1. Background : No bounding box -> background 2. Object Extent : Bboxes are instance-level, provide information 3. Objectness : Spatial Continuity / Contrasting boundary 8
  9. 9. How to begin? If two boxes overlap, we assume the smaller one is in front. 9
  10. 10. How to begin? 10
  11. 11. Post-Process • Any pixel outside bbox is discard. • If IoU<50%, re-inital • DenseCRF 11
  12. 12. Result Naïve is without post-processing. 12
  13. 13. Result 13
  14. 14. Result 14
  15. 15. 1. Simple Does It: Weakly Supervised Instance and Semantic Segmentation (CVPR 2017) Weak 2. Colorful Image Colorization (ECCV 2016 oral) Self Related Works 15
  16. 16. 16
  17. 17. Grayscale image: L channel Color information: ab channels abL 17
  18. 18. abL Concatenate (L,ab)Grayscale image: L channel “Free” supervisory signal Semantics? Higher-level abstraction? 18
  19. 19. Inherent Ambiguity Grayscale 19
  20. 20. Inherent Ambiguity Our Output Ground Truth 20
  21. 21. Colors in ab space (continuous)Better Loss Function • Regression with L2 loss inadequate • Use multinomial classification • Class rebalancing to encourage learning of rare colors 21
  22. 22. Better Loss Function Colors in ab space (discrete) • Regression with L2 loss inadequate • Use multinomial classification • Class rebalancing to encourage learning of rare colors 22
  23. 23. Failure Cases 23
  24. 24. Biases 24
  25. 25. Evaluation Visual Quality Representation Learning Quantitative Per-pixel accuracy Perceptual realism Semantic interpretability Task generalization ImageNet classification Task & dataset generalization PASCAL classification, detection, segmentation Qualitative Low-level stimuli Legacy grayscale photos Hidden unit activations 25
  26. 26. faces dog faces flowers Hidden Unit (conv5) Activations 26
  27. 27. Dataset & Task Generalization on PASCAL VOC %fromGaussianto ImageNetlabels Classification Detection Segmentation Gaussian Initialization ImageNet Labels 100% 0% Pathak et al. Donahue et al. Doersch et al.Krähenbühl et al. Ours Autoencoder Wang & Gupta Agrawal et al. 27
  28. 28. Amateur Family Photo, 1956. 28
  29. 29. Amateur Family Photo, 1956. 29
  30. 30. Henri Cartier-Bresson, Sunday on the Banks of the River Seine, 1938. 30
  31. 31. Henri Cartier-Bresson, Sunday on the Banks of the River Seine, 1938. 31

Editor's Notes

  • So formally, we are working in the Lab color space. The grayscale information is contained in the L, or lightness channel of the image, and is the input to our system.

    The output is the ab, or color channels.

    We’re looking to learn the mapping from L to ab using a CNN.

    We can then take the predicted ab channels, concatenate them with the input, and hopefully get a plausible colorization of the input image. This is the graphics benefit of this problem.

  • We note that any image can be broken up into its grayscale and color components, and in this manner, can serve as a free supervisory signal for training a CNN. So perhaps by learning to color, we can achieve a deep representation which has higher level abstractions, or semantics.

    Now, this learning problem is less straightforward than one may expect.
  • For example, consider this grayscale image.
  • This is the output after passing it through our system. Now, it seems to look plausible. Now here is the ground truth. So notice that these two look very different. But even though red and blue are far apart in ab space, we are just as happy with the red colorization as we are with the blue, and perhaps the red is even better...
  • This indicates that any loss which assumes a unimodal output distribution, such as an L2 regression loss, is likely to be inadequate.
  • We reformulate the problem as multinomial classification. We divide the output ab space into discrete bins of size 10.
  • The system does have some interesting failure cases. We find that many man-made objects can be multiple colors. The system sometimes has a difficult time deciding which one to go with, leading to this type of tie-dye effect.
  • Also, we find other curious behaviors and biases. For example, when the system sees a dog, it sometimes expects a tongue underneath. Even when there is none, it will just go ahead and hallucinate one for us anyways.
  • Due to time constraints, we will not be able to discuss all of the tests, but please come by our poster for more details.
  • We also see units which correspond to more “thing” categories, such as human and dog faces, and flowers. The network was able to discover these units in an unsupervised regime.
  • The y=0 line shows the performance if we initialize the network using Gaussian weights.

    The performance we are hoping to match is if we use imagenet labels to train the system. We will see how well each of these methods make up the difference between Gaussian initialization and using Imagenet labels.

    One method for learning features is autoencoders, which rely on a bottleneck. The autoencoder features do not learn very semantically meaningful features. Using stacked k-means, as implemented by Krahenbuhl et al, makes up some of the ground.

    Previous self-supervision methods are shown here: inpainting, bidirectional GAN, relative context prediction.

    Finally, our method, outside of the Doersch detection result, performs competitively relative to other self-supervision methods. We found this result surprising, as our project was primarily focused on the graphics task of colorization. However, note the large gap between self-supervision methods and pre-training on ImageNet. There is still work to be done to achieve strong semantic representations without the benefit of labels.
  • This is an amateur family photo from the 1950s of my father and great grand-father.
  • This is a professional photograph from Henri Cartier-Bresson.

×