Panoptic Segmentation
@CVPR2019
22nd July 2019
AI System Group
Kosuke Kuzuoka
● Profile
○ Kosuke Kuzuoka
○ 22 years old
● Experience
○ June 2018 - Present
AI Research Engineer at DeNA Co., Ltd.
○ March 2017 - June 2018
R&D manager at CONCORE’S, inc.
● Interests
○ Self Driving Cars
○ Computer Vision
Who I am
Facebok Github LinkedIn
Panoptic Segmentation
Semantic Segmentation can:
- Segment instances
without boundaries
- Segment every pixel in the
input image
Instance Segmentation can:
- Segment instance class
with boundaries
- Segment object in the RoI
(Region of Interest)
A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dolla ́r. Panoptic segmentation
Panoptic Segmentation
Every instance that belongs to things (people,
cars, etc.) needs to be identified (instance
segmentation), while every class that belongs to
stuff class (sky, road, etc.) needs to be correctly
classified (semantic segmentation)
A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dolla ́r. Panoptic segmentation
What’s the challenge in this task?
Panoptic Segmentation
● FCN (Fully Convolutional Network) and DC
(Dilated Convolution) are widely used in high
precision semantic segmentation networks
● Each pixel is classified by producing the output
feature map with the same image shape, except
the depth channel
● The first part of the network produces class
agnostic boxes (RoIs), which then will be
classified by the second part of the network
● Box refinement and pixel classification will be
applied for each RoI produced by the RPN
(Region Proposal Network)
J. Long, E. Shelhamer, and T. Darrell. Fully convolutional
networks for semantic segmentation
K. He, G. Gkioxari, P. Dolla ́r, and R. Girshick. Mask R- CNN
Panoptic Segmentation
● FCN (Fully Convolutional Network) and DC
(Dilated Convolution) are widely used in high
precision semantic segmentation networks
● Each pixel is classified by producing the output
feature map with the same image shape, except
the depth channel
● The first part of the network produces class
agnostic boxes (RoIs), which then will be
classified by the second part of the network
● Box refinement and pixel classification will be
applied for each RoI produced by the RPN
(Region Proposal Network)
J. Long, E. Shelhamer, and T. Darrell. Fully convolutional
networks for semantic segmentation
K. He, G. Gkioxari, P. Dolla ́r, and R. Girshick. Mask R- CNN
Panoptic segmentation network is difficult
to design, because the architectures differ!
Is there any dataset for this task?
● Cityscapes
○ 5000 images (2975 train, 500 val and 1525 test)
○ 19 classes (8 thing classes and 11 stuff classes)
● ADE20k
○ 25k images (20k train, 2k val and 3k test)
○ 150 classes (100 thing classes and 50 stuff classes)
● Mapillary Vistas
○ 25k images (18k train, 2k val and 5k test)
○ 65 classes (37 thing classes and 28 stuff classes)
Panoptic Segmentation
New task, new evaluation metric!
Panoptic Feature Pyramid Network
Any prediction that
has an IoU with a GT
object greater than
0.5 is considered a
TP
Class prediction needs to
be the same as the GT
class, hence it’s an FP
RQ (Recognition Quality) is the F1
score for the instance segmentation
network, while SQ (Semantic
Quality) is the mIoU of the TP
segments.A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dolla ́r. Panoptic segmentation
● An end-to-end panoptic segmentation network proposed by FAIR (Facebook AI
Research)
● Used Mask R-CNN for a semantic segmentation task by attaching a newly
proposed branch, a semantic branch.
● Recorded high competitive precision when compared to other single panoptic
segmentation networks with less memory usage
Panoptic Feature Pyramid Network
What are the motivations?
● Most panoptic segmentation networks rely on separated backbone networks,
due to the network architecture difference (not end-to-end)
● Because backbone networks are separated, they don’t share weights, hence
the inference takes too much time
Panoptic Feature Pyramid Network
● Most panoptic segmentation networks rely on separated backbone networks,
due to the network architecture difference (non end-to-end)
● Because backbone networks are separated, they don’t share weights, hence
the inference takes too much time
Panoptic Feature Pyramid Network
Solve semantic segmentation tasks with instance
segmentation network with simple modifications!
SOLVED
Alexander Kirillov, Ross Girshick, Kaiming He, Piotr Dollár. Panoptic Feature Pyramid Networks
Panoptic Feature Pyramid Network
Each RoI is used for
classification and pixel
segmentation by the instance
segmentation branch
Feature maps from FPN are
used for pixel level
classification by the semantic
segmentation branch
Alexander Kirillov, Ross Girshick, Kaiming He, Piotr Dollár. Panoptic Feature Pyramid Networks
Panoptic Feature Pyramid Network
● ResNet FPN as the backbone network
● Backbone network is pre-trained on ImageNet dataset
● Output strides are set 32, 16, 8 and 4
● Feature maps are used for both the instance segmentation
branch and the semantic segmentation branch
256 x 1/32
256 x 1/16
256 x 1/8
Alexander Kirillov, Ross Girshick, Kaiming He, Piotr Dollár. Panoptic Feature Pyramid Networks
Panoptic Feature Pyramid Network
● RoI pooling, box refinement and pixel level segmentation
are applied for each RoI from the RPN
● Same design as the original Mask R-CNN
● The goal of this branch is to produce a single
feature map by merging different sized
feature maps
● 3x3 conv, GN, ReLU and bilinear
interpolation are used to make feature maps
become the same size and depth
● Feature maps are added by using
element-wise addition, and finally 1x1 conv,
bilinear interpolation and softmax are
applied for pixel level classification
Alexander Kirillov, Ross Girshick, Kaiming He, Piotr Dollár. Panoptic Feature Pyramid Networks
● Evaluated mIoU for semantic segmentation tasks using Mask R-CNN +
semantic branch (Semantic FPN)
● Evaluated mIoU and AP for semantic and instance segmentation tasks using
Mask R-CNN + semantic / instance branch
● Evaluated PQ and compared to other single panoptic segmentation networks
Panoptic Feature Pyramid Network
Experiments?
Panoptic Feature Pyramid Network
● Semantic FPN performed competitive results on Cityscapes and MS COCO datasets with less memory usage
● The results suggest that instance segmentation network architecture can be transformed into a semantic
segmentation network with a relatively small change
● Because semantic FPN doesn’t use DC (Dilated Conv), it is more efficient than other networks which use DC
Alexander Kirillov, Ross Girshick, Kaiming He, Piotr Dollár. Panoptic Feature Pyramid Networks
Panoptic Feature Pyramid Network
● Balancing the parameter λ is important for the end-to-end network to perform well on both instance and
semantic segmentation tasks
● The results suggest that if λ is set properly, the instance segmentation results benefit from the semantic
segmentation network and vice versa
Alexander Kirillov, Ross Girshick, Kaiming He, Piotr Dollár. Panoptic Feature Pyramid Networks
Panoptic Feature Pyramid Network
● The proposed network outperformed other single panoptic segmentation networks by a large margin on both
Cityscapes and MS COCO
● The margin of thing classes is more significant than the stuff classes, due to the fact that Panoptic FPN is
basically Mask R-CNN with a semantic branch
Alexander Kirillov, Ross Girshick, Kaiming He, Piotr Dollár. Panoptic Feature Pyramid Networks
Panoptic Feature Pyramid Network
Alexander Kirillov, Ross Girshick, Kaiming He, Piotr Dollár. Panoptic Feature Pyramid Networks
One more panoptic segmentation network!
● An end-to-end panoptic segmentation network from Uber ATG
● Mask R-CNN is used for instance and semantic segmentation by attaching a
semantic segmentation head to it
● Outputs from the instance segmentation head and semantic segmentation
head are merged by a newly proposed head, called the panoptic head
● Achieved higher PQ when compared to other panoptic segmentation networks
on COCO and Cityscapes datasets
UPSNet: A Unified Panoptic Segmentation Network
UPSNet: A Unified Panoptic Segmentation Network
● Like Panoptic FPN, UPSNet uses a single
backbone network
● ResNet50 FPN is used for the backbone
network
● The output stride of FPN is 4, 8, 16 and 32
● These feature maps are used for the instance
head and semantic head
Yuwen Xiong, Renjie Liao, Hengshuang Zhao, Rui Hu, Min Bai, Ersin Yumer, Raquel Urtasun. UPSNet: A Unified Panoptic Segmentation Network
UPSNet: A Unified Panoptic Segmentation Network
● Deformable conv is used on the output of FPN
● Upsampling is used to make all feature maps
be the same size after DC
● Concat is applied followed by 1x1 conv to
classify each pixel
● The goal of the semantic head is classifying
every pixel in the image, while not affecting
thing class predictions
● Cross entropy loss and RoI loss are used for
the semantic head
● Thing classes will be classified and detected
by this branch, just like as in the Mask R-CNN
● The goal of this head is classifying thing
classes by extracting instance-aware features
Yuwen Xiong, Renjie Liao, Hengshuang Zhao, Rui Hu, Min Bai, Ersin Yumer, Raquel Urtasun. UPSNet: A Unified Panoptic Segmentation Network
What’s new in this network?
UPSNet: A Unified Panoptic Segmentation Network
● Xstuff from the semantic branch is used to classify stuff classes, and directly
mapped to panoptic logits, Z
● Xmask is retrieved by cropping Xthing from the semantic branch with GT’s
bounding box region
● Output of the instance branch (Yi) is added with Xmask to get a pixel level
classification result
● Class category has been determined by taking argmax on channel axis
Yuwen Xiong, Renjie Liao, Hengshuang Zhao, Rui Hu, Min Bai, Ersin Yumer, Raquel Urtasun. UPSNet: A Unified Panoptic Segmentation Network
● Evaluated on MS COCO and Cityscapes dataset using PQ
● Compared with other panoptic segmentation networks on Cityscapes dataset
● Compared with ensemble panoptic segmentation network model using MS
COCO dataset
UPSNet: A Unified Panoptic Segmentation Network
UPSNet: A Unified Panoptic Segmentation Network
● UPSNet performed competitive results on
the COCO dataset on the figure above
with significantly fewer parameters
(almost half, as mentioned in the paper)
● Even though other networks use the
ensemble technique, UPSNet resulted in
competitive PQ on MS COCO dataset
(below figure)
● The thing class especially benefitted from
Mask R-CNN architecture
Yuwen Xiong, Renjie Liao, Hengshuang Zhao, Rui Hu, Min Bai, Ersin Yumer, Raquel Urtasun. UPSNet: A Unified Panoptic Segmentation Network
UPSNet: A Unified Panoptic Segmentation Network
Yuwen Xiong, Renjie Liao, Hengshuang Zhao, Rui Hu, Min Bai, Ersin Yumer, Raquel Urtasun. UPSNet: A Unified Panoptic Segmentation Network
Let’s summarise!
● Panoptic segmentation is a relatively new task, and gaining popularity more
and more over the years
● Panoptic segmentation networks will be available in PyTorch in a future
release
● Large scale datasets are publicly available for panoptic segmentation networks
(MS COCO, Cityscapes etc.)
Summary
Thanks!

Panoptic Segmentation @CVPR2019

  • 1.
    Panoptic Segmentation @CVPR2019 22nd July2019 AI System Group Kosuke Kuzuoka
  • 2.
    ● Profile ○ KosukeKuzuoka ○ 22 years old ● Experience ○ June 2018 - Present AI Research Engineer at DeNA Co., Ltd. ○ March 2017 - June 2018 R&D manager at CONCORE’S, inc. ● Interests ○ Self Driving Cars ○ Computer Vision Who I am Facebok Github LinkedIn
  • 4.
    Panoptic Segmentation Semantic Segmentationcan: - Segment instances without boundaries - Segment every pixel in the input image Instance Segmentation can: - Segment instance class with boundaries - Segment object in the RoI (Region of Interest) A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dolla ́r. Panoptic segmentation
  • 5.
    Panoptic Segmentation Every instancethat belongs to things (people, cars, etc.) needs to be identified (instance segmentation), while every class that belongs to stuff class (sky, road, etc.) needs to be correctly classified (semantic segmentation) A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dolla ́r. Panoptic segmentation
  • 6.
  • 7.
    Panoptic Segmentation ● FCN(Fully Convolutional Network) and DC (Dilated Convolution) are widely used in high precision semantic segmentation networks ● Each pixel is classified by producing the output feature map with the same image shape, except the depth channel ● The first part of the network produces class agnostic boxes (RoIs), which then will be classified by the second part of the network ● Box refinement and pixel classification will be applied for each RoI produced by the RPN (Region Proposal Network) J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation K. He, G. Gkioxari, P. Dolla ́r, and R. Girshick. Mask R- CNN
  • 8.
    Panoptic Segmentation ● FCN(Fully Convolutional Network) and DC (Dilated Convolution) are widely used in high precision semantic segmentation networks ● Each pixel is classified by producing the output feature map with the same image shape, except the depth channel ● The first part of the network produces class agnostic boxes (RoIs), which then will be classified by the second part of the network ● Box refinement and pixel classification will be applied for each RoI produced by the RPN (Region Proposal Network) J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation K. He, G. Gkioxari, P. Dolla ́r, and R. Girshick. Mask R- CNN Panoptic segmentation network is difficult to design, because the architectures differ!
  • 9.
    Is there anydataset for this task?
  • 10.
    ● Cityscapes ○ 5000images (2975 train, 500 val and 1525 test) ○ 19 classes (8 thing classes and 11 stuff classes) ● ADE20k ○ 25k images (20k train, 2k val and 3k test) ○ 150 classes (100 thing classes and 50 stuff classes) ● Mapillary Vistas ○ 25k images (18k train, 2k val and 5k test) ○ 65 classes (37 thing classes and 28 stuff classes) Panoptic Segmentation
  • 11.
    New task, newevaluation metric!
  • 12.
    Panoptic Feature PyramidNetwork Any prediction that has an IoU with a GT object greater than 0.5 is considered a TP Class prediction needs to be the same as the GT class, hence it’s an FP RQ (Recognition Quality) is the F1 score for the instance segmentation network, while SQ (Semantic Quality) is the mIoU of the TP segments.A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dolla ́r. Panoptic segmentation
  • 14.
    ● An end-to-endpanoptic segmentation network proposed by FAIR (Facebook AI Research) ● Used Mask R-CNN for a semantic segmentation task by attaching a newly proposed branch, a semantic branch. ● Recorded high competitive precision when compared to other single panoptic segmentation networks with less memory usage Panoptic Feature Pyramid Network
  • 15.
    What are themotivations?
  • 16.
    ● Most panopticsegmentation networks rely on separated backbone networks, due to the network architecture difference (not end-to-end) ● Because backbone networks are separated, they don’t share weights, hence the inference takes too much time Panoptic Feature Pyramid Network
  • 17.
    ● Most panopticsegmentation networks rely on separated backbone networks, due to the network architecture difference (non end-to-end) ● Because backbone networks are separated, they don’t share weights, hence the inference takes too much time Panoptic Feature Pyramid Network Solve semantic segmentation tasks with instance segmentation network with simple modifications! SOLVED Alexander Kirillov, Ross Girshick, Kaiming He, Piotr Dollár. Panoptic Feature Pyramid Networks
  • 18.
    Panoptic Feature PyramidNetwork Each RoI is used for classification and pixel segmentation by the instance segmentation branch Feature maps from FPN are used for pixel level classification by the semantic segmentation branch Alexander Kirillov, Ross Girshick, Kaiming He, Piotr Dollár. Panoptic Feature Pyramid Networks
  • 19.
    Panoptic Feature PyramidNetwork ● ResNet FPN as the backbone network ● Backbone network is pre-trained on ImageNet dataset ● Output strides are set 32, 16, 8 and 4 ● Feature maps are used for both the instance segmentation branch and the semantic segmentation branch 256 x 1/32 256 x 1/16 256 x 1/8 Alexander Kirillov, Ross Girshick, Kaiming He, Piotr Dollár. Panoptic Feature Pyramid Networks
  • 20.
    Panoptic Feature PyramidNetwork ● RoI pooling, box refinement and pixel level segmentation are applied for each RoI from the RPN ● Same design as the original Mask R-CNN ● The goal of this branch is to produce a single feature map by merging different sized feature maps ● 3x3 conv, GN, ReLU and bilinear interpolation are used to make feature maps become the same size and depth ● Feature maps are added by using element-wise addition, and finally 1x1 conv, bilinear interpolation and softmax are applied for pixel level classification Alexander Kirillov, Ross Girshick, Kaiming He, Piotr Dollár. Panoptic Feature Pyramid Networks
  • 21.
    ● Evaluated mIoUfor semantic segmentation tasks using Mask R-CNN + semantic branch (Semantic FPN) ● Evaluated mIoU and AP for semantic and instance segmentation tasks using Mask R-CNN + semantic / instance branch ● Evaluated PQ and compared to other single panoptic segmentation networks Panoptic Feature Pyramid Network
  • 22.
  • 23.
    Panoptic Feature PyramidNetwork ● Semantic FPN performed competitive results on Cityscapes and MS COCO datasets with less memory usage ● The results suggest that instance segmentation network architecture can be transformed into a semantic segmentation network with a relatively small change ● Because semantic FPN doesn’t use DC (Dilated Conv), it is more efficient than other networks which use DC Alexander Kirillov, Ross Girshick, Kaiming He, Piotr Dollár. Panoptic Feature Pyramid Networks
  • 24.
    Panoptic Feature PyramidNetwork ● Balancing the parameter λ is important for the end-to-end network to perform well on both instance and semantic segmentation tasks ● The results suggest that if λ is set properly, the instance segmentation results benefit from the semantic segmentation network and vice versa Alexander Kirillov, Ross Girshick, Kaiming He, Piotr Dollár. Panoptic Feature Pyramid Networks
  • 25.
    Panoptic Feature PyramidNetwork ● The proposed network outperformed other single panoptic segmentation networks by a large margin on both Cityscapes and MS COCO ● The margin of thing classes is more significant than the stuff classes, due to the fact that Panoptic FPN is basically Mask R-CNN with a semantic branch Alexander Kirillov, Ross Girshick, Kaiming He, Piotr Dollár. Panoptic Feature Pyramid Networks
  • 26.
    Panoptic Feature PyramidNetwork Alexander Kirillov, Ross Girshick, Kaiming He, Piotr Dollár. Panoptic Feature Pyramid Networks
  • 27.
    One more panopticsegmentation network!
  • 29.
    ● An end-to-endpanoptic segmentation network from Uber ATG ● Mask R-CNN is used for instance and semantic segmentation by attaching a semantic segmentation head to it ● Outputs from the instance segmentation head and semantic segmentation head are merged by a newly proposed head, called the panoptic head ● Achieved higher PQ when compared to other panoptic segmentation networks on COCO and Cityscapes datasets UPSNet: A Unified Panoptic Segmentation Network
  • 30.
    UPSNet: A UnifiedPanoptic Segmentation Network ● Like Panoptic FPN, UPSNet uses a single backbone network ● ResNet50 FPN is used for the backbone network ● The output stride of FPN is 4, 8, 16 and 32 ● These feature maps are used for the instance head and semantic head Yuwen Xiong, Renjie Liao, Hengshuang Zhao, Rui Hu, Min Bai, Ersin Yumer, Raquel Urtasun. UPSNet: A Unified Panoptic Segmentation Network
  • 31.
    UPSNet: A UnifiedPanoptic Segmentation Network ● Deformable conv is used on the output of FPN ● Upsampling is used to make all feature maps be the same size after DC ● Concat is applied followed by 1x1 conv to classify each pixel ● The goal of the semantic head is classifying every pixel in the image, while not affecting thing class predictions ● Cross entropy loss and RoI loss are used for the semantic head ● Thing classes will be classified and detected by this branch, just like as in the Mask R-CNN ● The goal of this head is classifying thing classes by extracting instance-aware features Yuwen Xiong, Renjie Liao, Hengshuang Zhao, Rui Hu, Min Bai, Ersin Yumer, Raquel Urtasun. UPSNet: A Unified Panoptic Segmentation Network
  • 32.
    What’s new inthis network?
  • 33.
    UPSNet: A UnifiedPanoptic Segmentation Network ● Xstuff from the semantic branch is used to classify stuff classes, and directly mapped to panoptic logits, Z ● Xmask is retrieved by cropping Xthing from the semantic branch with GT’s bounding box region ● Output of the instance branch (Yi) is added with Xmask to get a pixel level classification result ● Class category has been determined by taking argmax on channel axis Yuwen Xiong, Renjie Liao, Hengshuang Zhao, Rui Hu, Min Bai, Ersin Yumer, Raquel Urtasun. UPSNet: A Unified Panoptic Segmentation Network
  • 34.
    ● Evaluated onMS COCO and Cityscapes dataset using PQ ● Compared with other panoptic segmentation networks on Cityscapes dataset ● Compared with ensemble panoptic segmentation network model using MS COCO dataset UPSNet: A Unified Panoptic Segmentation Network
  • 35.
    UPSNet: A UnifiedPanoptic Segmentation Network ● UPSNet performed competitive results on the COCO dataset on the figure above with significantly fewer parameters (almost half, as mentioned in the paper) ● Even though other networks use the ensemble technique, UPSNet resulted in competitive PQ on MS COCO dataset (below figure) ● The thing class especially benefitted from Mask R-CNN architecture Yuwen Xiong, Renjie Liao, Hengshuang Zhao, Rui Hu, Min Bai, Ersin Yumer, Raquel Urtasun. UPSNet: A Unified Panoptic Segmentation Network
  • 36.
    UPSNet: A UnifiedPanoptic Segmentation Network Yuwen Xiong, Renjie Liao, Hengshuang Zhao, Rui Hu, Min Bai, Ersin Yumer, Raquel Urtasun. UPSNet: A Unified Panoptic Segmentation Network
  • 37.
  • 38.
    ● Panoptic segmentationis a relatively new task, and gaining popularity more and more over the years ● Panoptic segmentation networks will be available in PyTorch in a future release ● Large scale datasets are publicly available for panoptic segmentation networks (MS COCO, Cityscapes etc.) Summary
  • 39.