Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Object Detection Beyond Mask R-CNN and RetinaNet I

33 views

Published on

ICME2019 Tutorial: Object Detection Beyond Mask R-CNN and RetinaNet I

Published in: Internet
  • Be the first to comment

  • Be the first to like this

Object Detection Beyond Mask R-CNN and RetinaNet I

  1. 1. Gang Yu 旷 视 研 究 院 Object Detection in Recent 3 Years Beyond RetinaNet and Mask R-CNN
  2. 2. Schedule of Tutorial • Lecture 1: Beyond RetinaNet and Mask R-CNN (Gang Yu) • Lecture 2: AutoML for Object Detection (Xiangyu Zhang) • Lecture 3: Finegrained Visual Analysis (Xiu-shen Wei)
  3. 3. Outline • Introduction to Object Detection • Modern Object detectors • One Stage detector vs Two-stage detector • Challenges • Backbone • Head • Pretraining • Scale • Batch Size • Crowd • NAS • Fine-Grained • Conclusion
  4. 4. Outline • Introduction to Object Detection • Modern Object detectors • One Stage detector vs Two-stage detector • Challenges • Backbone • Head • Pretraining • Scale • Batch Size • Crowd • NAS • Fine-Grained • Conclusion
  5. 5. What is object detection?
  6. 6. What is object detection?
  7. 7. Detection - Evaluation Criteria Average Precision (AP) and mAP Figures are from wikipedia
  8. 8. Detection - Evaluation Criteria mmAP Figures are from http://cocodataset.org
  9. 9. How to perform a detection? • Sliding window: enumerate all the windows (up to millions of windows) • VJ detector: cascade chain • Fully Convolutional network • shared computation Robust Real-time Object Detection; Viola, Jones; IJCV 2001 http://www.vision.caltech.edu/html-files/EE148-2005-Spring/pprs/viola04ijcv.pdf
  10. 10. General Detection Before Deep Learning • Feature + classifier • Feature • Haar Feature • HOG (Histogram of Gradient) • LBP (Local Binary Pattern) • ACF (Aggregated Channel Feature) • … • Classifier • SVM • Bootsing • Random Forest
  11. 11. Traditional Hand-crafted Feature: HoG
  12. 12. Traditional Hand-crafted Feature: HoG
  13. 13. General Detection Before Deep Learning Traditional Methods • Pros • Efficient to compute (e.g., HAAR, ACF) on CPU • Easy to debug, analyze the bad cases • reasonable performance on limited training data • Cons • Limited performance on large dataset • Hard to be accelerated by GPU
  14. 14. Deep Learning for Object Detection Based on the whether following the “proposal and refine” • One Stage • Example: Densebox, YOLO (YOLO v2), SSD, Retina Net • Keyword: Anchor, Divide and conquer, loss sampling • Two Stage • Example: RCNN (Fast RCNN, Faster RCNN), RFCN, FPN, MaskRCNN • Keyword: speed, performance
  15. 15. A bit of History Image Feature Extractor classification localization (bbox) One stage detector Densebox (2015) UnitBox (2016) EAST (2017) YOLO (2015) Anchor Free Anchor importedYOLOv2 (2016) SSD (2015) RON(2017) RetinaNet(2017) DSSD (2017) two stages detector Image Feature Extractor classification localization (bbox) Proposal classification localization (bbox) Refine RCNN (2014) Fast RCNN(2015) Faster RCNN (2015) RFCN (2016) MultiBox(2014) RFCN++ (2017) FPN (2017) Mask RCNN (2017) OverFeat(2013)
  16. 16. Outline • Introduction to Object Detection • Modern Object detectors • One Stage detector vs Two-stage detector • Challenges • Backbone • Head • Pretraining • Scale • Batch Size • Crowd • NAS • Fine-Grained • Conclusion
  17. 17. Modern Object detectors Backbone Head • Modern object detectors • RetinaNet • f1-f7 for backbone, f3-f7 with 4 convs for head • FPN with ROIAlign • f1-f6 for backbone, two fcs for head • Recall vs localization • One stage detector: Recall is high but compromising the localization ability • Two stage detector: Strong localization ability Postprocess NMS
  18. 18. One Stage detector: RetinaNet • FPN Structure • Focal loss Focal Loss for Dense Object Detection, Lin etc, ICCV 2017 Best student paper
  19. 19. One Stage detector: RetinaNet • FPN Structure • Focal loss Focal Loss for Dense Object Detection, Lin etc, ICCV 2017 Best student paper
  20. 20. Two-Stage detector: FPN/Mask R-CNN • FPN Structure • ROIAlign Mask R-CNN, He etc, ICCV 2017 Best paper
  21. 21. What is next for object detection? • The pipeline seems to be mature • There still exists a large gap between existing state-of-arts and product requirements • The devil is in the detail
  22. 22. Outline • Introduction to Object Detection • Modern Object detectors • One Stage detector vs Two-stage detector • Challenges • Backbone • Head • Pretraining • Scale • Batch Size • Crowd • NAS • Fine-Grained • Conclusion
  23. 23. Challenges Overview • Backbone • Head • Pretraining • Scale • Batch Size • Crowd • NAS • Fine-grained Backbone Head Postprocess NMS
  24. 24. Challenges - Backbone • Backbone network is designed for classification task but not for localization task • Receptive Field vs Spatial resolution • Only f1-f5 is pretrained but randomly initializing f6 and f7 (if applicable)
  25. 25. Backbone - DetNet • DetNet: A Backbone network for Object Detection, Li etc, 2018, https://arxiv.org/pdf/1804.06215.pdf
  26. 26. Backbone - DetNet
  27. 27. Backbone - DetNet
  28. 28. Backbone - DetNet
  29. 29. Backbone - DetNet
  30. 30. Backbone - DetNet
  31. 31. Challenges - Head • Speed is significantly improved for the two-stage detector • RCNN - > Fast RCNN -> Faster RCNN - > RFCN • How to obtain efficient speed as one stage detector like YOLO, SSD? • Small Backbone • Light Head
  32. 32. Head – Light head RCNN • Light-Head R-CNN: In Defense of Two-Stage Object Detector, 2017, https://arxiv.org/pdf/1711.07264.pdf Code: https://github.com/zengarden/light_head_rcnn
  33. 33. Head – Light head RCNN • Backbone • L: Resnet101 • S: Xception145 • Thin Feature map • L:C_{mid} = 256 • S: C_{mid} =64 • C_{out} = 10 * 7 * 7 • R-CNN subnet • A fc layer is connected to the PS ROI pool/Align
  34. 34. Head – Light head RCNN
  35. 35. Head – Light head RCNN
  36. 36. Head – Light head RCNN • Mobile Version • ThunderNet: Towards Real-time Generic Object Detection, Qin etc, Arxiv 2019 • https://arxiv.org/abs/1903.11752
  37. 37. Pretraining – Objects365 • ImageNet pretraining is usually employed for backbone training • Training from Scratch • Scratch Det claims GN/BN is important • Rethinking ImageNet Pretraining validates that training time is important
  38. 38. Pretraining – Objects365 • Objects365 Dataset
  39. 39. Pretraining – Objects365 • Pretraining with Objects365 vs ImageNet vs from Sctratch
  40. 40. Pretraining – Objects365 • Pretraining on Backbone or Pretraining on both backbone and head
  41. 41. Pretraining – Objects365 • Results on VOC Detection & VOC Segmentation
  42. 42. Pretraining – Objects365 • Summary • Pretraining is important to reduce the training time • Pretraining with a large dataset is beneficial for the performance
  43. 43. Challenges - Scale • Scale variations is extremely large for object detection
  44. 44. Challenges - Scale • Scale variations is extremely large for object detection • Previous works • Divide and Conquer: SSD, DSSD, RON, FPN, … • Limited Scale variation • Scale Normalization for Image Pyramids, Singh etc, CVPR2018 • Slow inference speed • How to address extremely large scale variation without compromising inference speed?
  45. 45. Scale - SFace • SFace: An Efficient Network for Face Detection in Large Scale Variations, 2018, http://cn.arxiv.org/pdf/1804.06559.pdf • Anchor-based: • Good localization for the scales which are covered by anchors • Difficult to address all the scale ranges of faces • Anchor-free: • Able to cover various face scales • Not good for the localization ability
  46. 46. Scale - SFace
  47. 47. Scale - SFace
  48. 48. Scale - SFace
  49. 49. Scale - SFace • Summary: • Integrate anchor-based and anchor-free for the scale issue • A new benchmark for face detection with large scale variations: 4K Face
  50. 50. Challenges - Batchsize • Small mini-batchsize for general object detection • 2 for R-CNN, Faster RCNN • 16 for RetinaNet, Mask RCNN • Problem with small mini-batchsize • Long training time • Insufficient BN statistics • Inbalanced pos/neg ratio
  51. 51. Batchsize – MegDet • MegDet: A Large Mini-Batch Object Detector, CVPR2018, https://arxiv.org/pdf/1711.07240.pdf
  52. 52. Batchsize – MegDet • Techniques • Learning rate warmup • Cross-GPU Batch Normalization
  53. 53. Challenges - Crowd • NMS is a post-processing step to eliminate multiple responses on one object instance • Reasonable for mild crowdness like COCO and VOC • Will Fail in the case when the objects are in a crowd
  54. 54. Challenges - Crowd • A few works have been devoted to this topic • Softnms, Bodla etc, ICCV 2017, http://www.cs.umd.edu/~bharat/snms.pdf • Relation Networks, Hu etc, CVPR 2018, https://arxiv.org/pdf/1711.11575.pdf • Lacking a good benchmark for evaluation in the literature
  55. 55. Crowd - CrowdHuman • CrowdHuman: A Benchmark for Detecting Human in a Crowd, 2018, https://arxiv.org/pdf/1805.00123.pdf, http://www.crowdhuman.org/ • A benchmark with Head, Visible Human, Full body bounding-box • Generalization ability for other head/pedestrian datasets • Crowdness
  56. 56. Crowd - CrowdHuman
  57. 57. Crowd-CrowdHuman
  58. 58. Crowd-CrowdHuman • Generalization • Head • Pedestrian • COCO
  59. 59. Conclusion • The task of object detection is still far from solved • Details are important to further improve the performance • Backbone • Head • Pretraining • Scale • Batchsize • Crowd • The improvement of object detection will be a significantly boost for the computer vision industry
  60. 60. 广告部分 • Megvii Detection 知乎专栏 Email: yugang@megvii.com

×