Susang Kim(healess@kaist.ac.kr)
Object Detection
DetectoRS: Detecting Objects with Recursive Feature Pyramid
and Switchable Atrous Convolution
Object Detection Milestones
Object Detection in 20 Years: A Survey : https://arxiv.org/pdf/1905.05055.pdf
Selective Search, Region Proposal Network 등을 통해 RoI를 찾는단계와 찾은 후보들에 대해 Bounding
Box Regression 및 Classification의 2단계를 거쳐 검출을 수행하는 2-StageDetector와 찾을 객체에 대해
미리 정의된 Anchor Box로 부터 Classification과 Bounding Box Regression을
바로 수행하는 1-Stage Detector가 두축으로 Multi-Stage를 통해 RoI를 찾는
Detector의 경우 DetectoRS의 baseline모델인 Cascade R-CNN과
Hybrid Task Cascade로 발전됨
The timeline of DL-based segmentation
Image Segmentation Using Deep Learning: A Survey : https://arxiv.org/pdf/2001.05566.pdf
의료 영역에 많이 활용되는
Segmentation은 upsampling +
skip combining
(coarse->densse)을 통해
segmentation map산출로
Pixel단위의 예측 수행
1) Fully convolutional networks
2) Convolutional models with graphical models
3) Encoder-decoder based models
4) Multi-scale and pyramid network based models
5) R-CNN based models (for instance segmentation)
6) Dilated convolutional models and DeepLab family
7) Recurrent neural network based models
8) Attention-based models
and Other models
Cascade Mask R-CNN (CVPR 2018)
Cascade R-CNN: High Quality Object Detection and Instance Segmentation : https://arxiv.org/pdf/1906.09756.pdf
일반적으로 Object Detection에서 IoU기준을 0.5로
정하지만 IoU가 낮더라도 정보는 더 많기에 Bounding
Box가 잘 뽑히면 Segmentation의 성능도 올라가기에
순차적으로 이전에 뽑은 Bounding Box에서 새로운 정보를
뽑아 성능을 개선 Faster R-CNN이 1개의 Classifier을
사용했다면 Cascade R-CNN은 n개의 Classifier(Multi-Stage
Detector)
Multi-stage detector HTC (CVPR 2019)
Hybrid Task Cascade (HTC)는 2018 COCO Challenge instance segmentation task에서 우승한 모델로
본 연구의 Baseline으로 활용(MMdet팀-SenseTime, The Chinese University of Hong Kong)
Cascade Mask R-CNN의 parallel 구조에서 각 layer에서 추출된 Bounding Box와 Mask Feature의
interleaves branch와 각 layer의 Mask Feature를 연결하여 information flow 강화로 성능을 개선
Hybrid Task Cascade for Instance Segmentation : https://arxiv.org/pdf/1901.07518.pdf
DetectoRS
2020년 6월 arxiv에 공개된 논문으로
modern object detector에 사용되어온
looking and thinking twice 개념에
macro(Recursive Feature Pyramid) 및
micro(Switchable Atrous Convolution)
level의 아키텍쳐를 통해 새로운
backbone design을 제시
On COCO test-dev, DetectoRS
achieves state-of-the art 55.7% box
AP for object detection, 48.5% mask
AP for instance segmentation, and
50.0% PQ for panoptic segmentation.
(2020.07)
21년 3월 기준 랭킹 1위는
Noah CV Lab (Huawei) - 58.8%
https://cocodataset.org/#detection-leaderboard
Introduction
macro level에서의 Looking and Thinking Twice에서 착안한 Recursive Feature Pyramid(RFP)와
micro level에서의 Switchable Atrous Convolution(SAC)를 통해 switch functions을 통한 다른 atrous rates를 적용
RFP와 SAC를 결합하여 다양한 Task에서 좋은 성능을 기록
Looking and Thinking Twice
https://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Cao_Look_and_Think_ICCV_2015_paper.pdf
Look and Think Twice: Capturing Top-Down Visual Attention with Feedback Convolutional Neural Networks(ICCV 2015)에서
발표된 논문으로 이미지를 볼때 사람은 한번 보고 특정 물체를 판단하는데 한번보고 판단이 안될 때는 다시 이미지를 자세히
들여다보고 물체를 알 수 있게되는데 이를 Feedback Network로 정의
By “looking and thinking twice”, both human recognition and
detection performances increase significantly, especially in
images with cluttered background.
Feature Pyramid Networks for Object Detection(CVPR 2017)
FPN은 multi-scale object detection에서 practical and accurate
solution으로 a) 각 레벨별로 객체 탐지하기에 느리고 b) 특징을
압축하기에 속도는 빠르나 정확도가 낮음 c) 각 레벨별로 특징을
뽑아 객체를 탐지 (the SSD-style pyramid would reuse the
multi-scale feature maps from different layers)
FPN은 위 문제점을 해결하기 위하여 low/high resolution을 묶어서
상위 레벨에서 사용한 특징을 하위 레벨에서 재사용함으로써
multi-scale 특징을 효율적으로 사용
(Bottom-up pathway + Top-down pathway and lateral
connections)
Similar architectures adopting top-down and skip connections
https://arxiv.org/pdf/1612.03144.pdf
Recursive Feature Pyramid
ResNet has four stages, each of which is composed of several similar blocks.
EfficientDet의 경우 BiFPN Layer 사용
Atrous Spatial Pyramid Pooling (ASPP)
Atrous Spatial Pyramid Pooling (ASPP) to implement the
connecting module R.
Each of the four branches yields a feature with channels
1/4
ASPP into FPN to enrich features, similar to the
mini-DeepLab design in Seamless
Atrous(Dilated) Convolution
Dilated(팽창한)의 의미와 Atrous는 프랑스어의 A trous
(구멍이 있는)의미로 receptive file에 구멍을 두어 넓게 볼 수
있게됨 (5x5 크기로 보지만 연산량은 3x3) 공간의 특징을
유지할 수 있어 segmentation에 많이 활용됨
본 논문에서는 Micro관점에서의 Switchable Atrous
Convolution (SAC)을 통해 다른 비율로 weight를 계산
https://medium.com/hitchhikers-guide-to-deep-learning/10-introduction-to-deep-learning-with-computer-vision-types-of-convolutions-atrous-c
onvolutions-3cf142f77bc0
Switchable Atrous Convolution
We use y = Conv(x, w, r) to denote the convolutional operation with weight w and atrous rate r which takes x
as its input and outputs y. Then, we can convert a convolutional layer to SAC as follows.
weight w and atrous rate r which takes x
as its input and outputs y
Ablation Studies & Implementation Details
COCO dataset - Train : train2017 (115k labeled images) / Test : val2017
Network - HTC(ResNet and parameter set) + DetectoRS with mmdetection
Runtime - NVIDIA TITAN RTX (Turing)
train models for 12 epochs with the learning rate multiplied by 0.1 after 8 and 12 epochs.
40 epochs with the learning rate multiplied by 0.1 after 36 and 39 epochs.
Soft-NMS is used
report the results with and without test-time augmentation (TTA) horizontal flip and multi-scale testing
Experiments
Average Precision (AP):
AP% AP at IoU=.50:.05:.95 (primary challenge metric)
APIoU=.50% AP at IoU=.50 (PASCAL VOC metric)
APIoU=.75% AP at IoU=.75 (strict metric)
AP Across Scales:
APsmall% AP for small objects: area < 322
APmedium% AP for medium objects: 322 < area < 962
APlarge% AP for large objects: area > 962
Panoptic Segmentation (PQ metrics)
https://medium.com/@danielmechea/panoptic-segmentation-the-panoptic-quality-metric-d69a6c3ace30
Experiments
Submitted on 3 Jun 2020
Submitted on 20 Nov 2019
one-stage detectors
two or multi-stage detectors
HTC baseline
TTA : Test 시 원본과 flip,Scale등을
적용하여 Augmentation 적용 후
결과값 반영을 통한 성능 향상
Bounding Box Detection에서의 최고
성능
Conclusion
In this paper, motivated by the design philosophy of looking and thinking twice, we have proposed
DetectoRS, which includes Recursive Feature Pyramid and Switchable Atrous Convolution. Recursive
Feature Pyramid implements thinking twice at the macro level, where the outputs of FPN are brought
back to each stage of the bottom-up backbone through feedback connections. Switchable Atrous
Convolution instantiates looking twice at the micro level, where the inputs are convolved with two
different atrous rates. DetectoRS is tested on COCO for object detection, instance segmentation and
panoptic segmentation. It sets new state of-the-art results on all these tasks.
RFP → similar to human
visual perception that
selectively enhances or
suppresses neuron
activations, is able to find
occluded objects more easily
for which the nearby context
information is more critical.
SAC → increase the
field-of-view as needed, is
more capable of detecting
large objects in the images.
Thanks
Any Questions?
You can send mail to
Susang Kim(healess@kaist.ac.kr)

[Paper] DetectoRS for Object Detection

  • 1.
    Susang Kim(healess@kaist.ac.kr) Object Detection DetectoRS:Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution
  • 2.
    Object Detection Milestones ObjectDetection in 20 Years: A Survey : https://arxiv.org/pdf/1905.05055.pdf Selective Search, Region Proposal Network 등을 통해 RoI를 찾는단계와 찾은 후보들에 대해 Bounding Box Regression 및 Classification의 2단계를 거쳐 검출을 수행하는 2-StageDetector와 찾을 객체에 대해 미리 정의된 Anchor Box로 부터 Classification과 Bounding Box Regression을 바로 수행하는 1-Stage Detector가 두축으로 Multi-Stage를 통해 RoI를 찾는 Detector의 경우 DetectoRS의 baseline모델인 Cascade R-CNN과 Hybrid Task Cascade로 발전됨
  • 3.
    The timeline ofDL-based segmentation Image Segmentation Using Deep Learning: A Survey : https://arxiv.org/pdf/2001.05566.pdf 의료 영역에 많이 활용되는 Segmentation은 upsampling + skip combining (coarse->densse)을 통해 segmentation map산출로 Pixel단위의 예측 수행 1) Fully convolutional networks 2) Convolutional models with graphical models 3) Encoder-decoder based models 4) Multi-scale and pyramid network based models 5) R-CNN based models (for instance segmentation) 6) Dilated convolutional models and DeepLab family 7) Recurrent neural network based models 8) Attention-based models and Other models
  • 4.
    Cascade Mask R-CNN(CVPR 2018) Cascade R-CNN: High Quality Object Detection and Instance Segmentation : https://arxiv.org/pdf/1906.09756.pdf 일반적으로 Object Detection에서 IoU기준을 0.5로 정하지만 IoU가 낮더라도 정보는 더 많기에 Bounding Box가 잘 뽑히면 Segmentation의 성능도 올라가기에 순차적으로 이전에 뽑은 Bounding Box에서 새로운 정보를 뽑아 성능을 개선 Faster R-CNN이 1개의 Classifier을 사용했다면 Cascade R-CNN은 n개의 Classifier(Multi-Stage Detector)
  • 5.
    Multi-stage detector HTC(CVPR 2019) Hybrid Task Cascade (HTC)는 2018 COCO Challenge instance segmentation task에서 우승한 모델로 본 연구의 Baseline으로 활용(MMdet팀-SenseTime, The Chinese University of Hong Kong) Cascade Mask R-CNN의 parallel 구조에서 각 layer에서 추출된 Bounding Box와 Mask Feature의 interleaves branch와 각 layer의 Mask Feature를 연결하여 information flow 강화로 성능을 개선 Hybrid Task Cascade for Instance Segmentation : https://arxiv.org/pdf/1901.07518.pdf
  • 6.
    DetectoRS 2020년 6월 arxiv에공개된 논문으로 modern object detector에 사용되어온 looking and thinking twice 개념에 macro(Recursive Feature Pyramid) 및 micro(Switchable Atrous Convolution) level의 아키텍쳐를 통해 새로운 backbone design을 제시 On COCO test-dev, DetectoRS achieves state-of-the art 55.7% box AP for object detection, 48.5% mask AP for instance segmentation, and 50.0% PQ for panoptic segmentation. (2020.07) 21년 3월 기준 랭킹 1위는 Noah CV Lab (Huawei) - 58.8% https://cocodataset.org/#detection-leaderboard
  • 7.
    Introduction macro level에서의 Lookingand Thinking Twice에서 착안한 Recursive Feature Pyramid(RFP)와 micro level에서의 Switchable Atrous Convolution(SAC)를 통해 switch functions을 통한 다른 atrous rates를 적용 RFP와 SAC를 결합하여 다양한 Task에서 좋은 성능을 기록
  • 8.
    Looking and ThinkingTwice https://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Cao_Look_and_Think_ICCV_2015_paper.pdf Look and Think Twice: Capturing Top-Down Visual Attention with Feedback Convolutional Neural Networks(ICCV 2015)에서 발표된 논문으로 이미지를 볼때 사람은 한번 보고 특정 물체를 판단하는데 한번보고 판단이 안될 때는 다시 이미지를 자세히 들여다보고 물체를 알 수 있게되는데 이를 Feedback Network로 정의 By “looking and thinking twice”, both human recognition and detection performances increase significantly, especially in images with cluttered background.
  • 9.
    Feature Pyramid Networksfor Object Detection(CVPR 2017) FPN은 multi-scale object detection에서 practical and accurate solution으로 a) 각 레벨별로 객체 탐지하기에 느리고 b) 특징을 압축하기에 속도는 빠르나 정확도가 낮음 c) 각 레벨별로 특징을 뽑아 객체를 탐지 (the SSD-style pyramid would reuse the multi-scale feature maps from different layers) FPN은 위 문제점을 해결하기 위하여 low/high resolution을 묶어서 상위 레벨에서 사용한 특징을 하위 레벨에서 재사용함으로써 multi-scale 특징을 효율적으로 사용 (Bottom-up pathway + Top-down pathway and lateral connections) Similar architectures adopting top-down and skip connections https://arxiv.org/pdf/1612.03144.pdf
  • 10.
    Recursive Feature Pyramid ResNethas four stages, each of which is composed of several similar blocks. EfficientDet의 경우 BiFPN Layer 사용
  • 11.
    Atrous Spatial PyramidPooling (ASPP) Atrous Spatial Pyramid Pooling (ASPP) to implement the connecting module R. Each of the four branches yields a feature with channels 1/4 ASPP into FPN to enrich features, similar to the mini-DeepLab design in Seamless
  • 12.
    Atrous(Dilated) Convolution Dilated(팽창한)의 의미와Atrous는 프랑스어의 A trous (구멍이 있는)의미로 receptive file에 구멍을 두어 넓게 볼 수 있게됨 (5x5 크기로 보지만 연산량은 3x3) 공간의 특징을 유지할 수 있어 segmentation에 많이 활용됨 본 논문에서는 Micro관점에서의 Switchable Atrous Convolution (SAC)을 통해 다른 비율로 weight를 계산 https://medium.com/hitchhikers-guide-to-deep-learning/10-introduction-to-deep-learning-with-computer-vision-types-of-convolutions-atrous-c onvolutions-3cf142f77bc0
  • 13.
    Switchable Atrous Convolution Weuse y = Conv(x, w, r) to denote the convolutional operation with weight w and atrous rate r which takes x as its input and outputs y. Then, we can convert a convolutional layer to SAC as follows. weight w and atrous rate r which takes x as its input and outputs y
  • 14.
    Ablation Studies &Implementation Details COCO dataset - Train : train2017 (115k labeled images) / Test : val2017 Network - HTC(ResNet and parameter set) + DetectoRS with mmdetection Runtime - NVIDIA TITAN RTX (Turing) train models for 12 epochs with the learning rate multiplied by 0.1 after 8 and 12 epochs. 40 epochs with the learning rate multiplied by 0.1 after 36 and 39 epochs. Soft-NMS is used report the results with and without test-time augmentation (TTA) horizontal flip and multi-scale testing
  • 15.
    Experiments Average Precision (AP): AP%AP at IoU=.50:.05:.95 (primary challenge metric) APIoU=.50% AP at IoU=.50 (PASCAL VOC metric) APIoU=.75% AP at IoU=.75 (strict metric) AP Across Scales: APsmall% AP for small objects: area < 322 APmedium% AP for medium objects: 322 < area < 962 APlarge% AP for large objects: area > 962
  • 16.
    Panoptic Segmentation (PQmetrics) https://medium.com/@danielmechea/panoptic-segmentation-the-panoptic-quality-metric-d69a6c3ace30
  • 17.
    Experiments Submitted on 3Jun 2020 Submitted on 20 Nov 2019 one-stage detectors two or multi-stage detectors HTC baseline TTA : Test 시 원본과 flip,Scale등을 적용하여 Augmentation 적용 후 결과값 반영을 통한 성능 향상 Bounding Box Detection에서의 최고 성능
  • 18.
    Conclusion In this paper,motivated by the design philosophy of looking and thinking twice, we have proposed DetectoRS, which includes Recursive Feature Pyramid and Switchable Atrous Convolution. Recursive Feature Pyramid implements thinking twice at the macro level, where the outputs of FPN are brought back to each stage of the bottom-up backbone through feedback connections. Switchable Atrous Convolution instantiates looking twice at the micro level, where the inputs are convolved with two different atrous rates. DetectoRS is tested on COCO for object detection, instance segmentation and panoptic segmentation. It sets new state of-the-art results on all these tasks. RFP → similar to human visual perception that selectively enhances or suppresses neuron activations, is able to find occluded objects more easily for which the nearby context information is more critical. SAC → increase the field-of-view as needed, is more capable of detecting large objects in the images.
  • 19.
    Thanks Any Questions? You cansend mail to Susang Kim(healess@kaist.ac.kr)