Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcelona 2020

407 views

Published on

These slides provide an overview of the most popular approaches up to date to solve the task of object detection with deep neural networks. It reviews both the two stages approaches such as R-CNN, Fast R-CNN and Faster R-CNN, and one-stage approaches such as YOLO and SSD. It also contains pointers to relevant datasets (Pascal, COCO, ILSRVC, OpenImages) and the definition of the Average Precision (AP) metric.

Full program:
https://www.talent.upc.edu/ing/estudis/formacio/curs/310400/postgraduate-course-artificial-intelligence-deep-learning/

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcelona 2020

  1. 1. Object Detection Computer Vision 2 Xavier Giro-i-Nieto @DocXavi xavier.giro@upc.edu Associate Professor Universitat Politècnica de Catalunya Spring 2020
  2. 2. Acknowledgements 2 Amaia Salvador amaia.salvador@upc.edu PhD Candidate Universitat Politècnica de Catalunya [UPC TelecomBCN 2016] [UPC TelecomBCN 2017]
  3. 3. Acknowledgements 3 [UPC TelecomBCN 2018] Míriam Bellver miriam.bellver@bsc.edu PhD Candidate Barcelona Supercomputing Center Universitat Politècnica de Catalunya Andreu Girbau andreu.girbau@upc.edu PhD Candidate Universitat Politècnica de Catalunya AutomaticTV [UPC TelecomBCN 2019]
  4. 4. Outline 4 1. Motivation 2. Datasets 3. Evaluation 4. Neural Architectures a. Two-stage b. Single-stage
  5. 5. Recap Figure from Charles Ollion - Olivier Grisel
  6. 6. Recap Figure from Charles Ollion - Olivier Grisel
  7. 7. Recap Figure from Charles Ollion - Olivier Grisel
  8. 8. Recap Figure from Charles Ollion - Olivier Grisel
  9. 9. Object Detection CAT, DOG, DUCK The task of assigning a label and a bounding box to all objects in the image: 1. We don’t know number of objects 2. Object detection relies on object proposal and object classification 9
  10. 10. Object Detection as Classification Classes = [cat, dog, duck] Cat ? NO Dog ? NO Duck? NO 10
  11. 11. Object Detection as Classification Classes = [cat, dog, duck] Cat ? NO Dog ? NO Duck? NO 11
  12. 12. Object Detection as Classification Classes = [cat, dog, duck] Cat ? YES Dog ? NO Duck? NO 12
  13. 13. Classes = [cat, dog, duck] Cat ? NO Dog ? NO Duck? NO 13 Object Detection as Classification
  14. 14. Challenge: Very large amount of possibilities: ● position ● scale ● aspect ratio 14 Object Detection as Classification Question: Do you think it is feasible to evaluate all possibilities ?
  15. 15. Challenge: Very large amount of possibilities: ● position ● scale ● aspect ratio Solution: If your classifier is fast enough, go for it 15 Object Detection as Classification
  16. 16. Object Detection with ConvNets? Convnets are computationally demanding. We can’t test all positions & scales ! Solution: Look at a tiny subset of positions. Choose them wisely :) 16
  17. 17. Outline 17 1. Motivation 2. Datasets 3. Evaluation 4. Neural Architectures a. Two-stage b. Single-stage
  18. 18. Classic Datasets 18 PASCAL 20 categories 6k training images 6k validation images 10k test images ILSVRC 200 categories 456k training images 60k validation + test images COCO 80 categories 200k training images 60k val + test images
  19. 19. Classic Datasets
  20. 20. Classic Datasets
  21. 21. Open Images Dataset Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., ... & Ferrari, V. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. IJCV 2020. [dataset]
  22. 22. Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., ... & Ferrari, V. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. IJCV 2020. [dataset] Open Images Dataset v6 PASCAL 20 categories 6k training images 6k validation images 10k test images ILSVRC 200 categories 456k training images 60k validation + test images COCO 80 categories 200k training images 60k val + test images
  23. 23. Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., ... & Ferrari, V. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. IJCV 2020. [dataset] Open Images Dataset v6
  24. 24. Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., ... & Ferrari, V. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. IJCV 2020. [dataset] Images with a large number of different classes annotated (11 on the left, 7 on the right). Open Images Dataset v6
  25. 25. Outline 25 1. Motivation 2. Datasets 3. Evaluation 4. Neural Architectures a. Two-stage b. Single-stage 5. Software implementations
  26. 26. 26 Evaluation metrics: Intersection over Union (IoU) ● aka Jaccard index ● Size of intersection divided by the size of the union ● Evaluate localization Figure: Pyimagesearch
  27. 27. 27 Metric: Average Precision (AP) for Object Detection Consider the case in which your object detection algorithm provides you: ● Coordinates for each bounding box. ● A confidence for each bounding box 0.7 0.9 Predictions 0.5
  28. 28. 28 Rank your predictions based on the confidence score of your object detection algorithm: 0.7 0.9 0.9 0.7 #1 #2 #3 Predictions Metric: Average Precision (AP) for Object Detection 0.5 0.5
  29. 29. 29 Set a criteria to identify whether your predictions are correct. Typically, a minimum IoU with respect to the bounding boxes from the ground truth annotation. ○ For example, IoU > 0.5. This is referred as AP0.5 . ○ Other popular options: AP0.75 , or a range of IoU [0.5:0.95] in 0.05 steps ○ Each GT box can only be assigned to one predicted box. 0.7 0.9 0.9 0.7 #1 #2 #3 Ground truth True Positive (TP) False Positive (FP) 0.5 0.5 Confidencescore Metric: Average Precision (AP) for Object Detection
  30. 30. 30 Compute the point of the Precision-Recall curve by considering as decision thresholds (Thr) the confidence scores of the ranked detections. Rank Correct ? 1 True 2 False 3 True Ground truth True Positive (TP) False Positive (FP) or False Negative (FN) 0.7 0.9 0.5 Threshold Precision Recall 0.9 1/1 1/4 0.7 1/2 1/4 0.5 2/3 2/4 Metric: Average Precision (AP) for Object Detection
  31. 31. 31 In the object detection case, in which GT objects may never any predictions, we may consider that trying to find the missing objects with an infinite amount of object proposals would drop precision to 0.0, but would eventually find all objects, so recall would be 1.0 Table inspired by: Johnatan Hui, “mAP (mean Average Precision) for Object Detection” (Medium 2018) Ground truth True Positive (TP) False Positive (FP) or False Negative (FN) 0.7 0.9 0.5 Threshold Precision Recall 0.9 1/1 1/4 0.7 1/2 1/4 0.5 2/3 2/4 0.0 ⋍ 0 1 Rank Correct ? 1 True 2 False 3 True ∞ True(s) Metric: Average Precision (AP) for Object Detection
  32. 32. 32 Threshold Precision Recall 0.9 1/1 1/4 0.7 1/2 1/4 0.5 2/3 2/4 0.0 ⋍ 0 1 Rank Correct ? 1 True 2 False 3 True ∞ True(s) Metric: Average Precision (AP) for Object Detection Precision Recall 1.0 0.5 0.5 1.0
  33. 33. 33 “The precision at each recall level r is interpolated by taking the maximum precision (...) for which the corresponding recall exceeds r.” (from Pascal VOC) [ref] [ref] Everingham, Mark, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. "The Pascal Visual Object Classes (VOC) challenge." IJCV 2010. Metric: Average Precision (AP) for Object Detection Threshold Precision Recall 0.9 1/1 1/4 0.7 1/2 1/4 0.5 2/3 2/4 0.0 ⋍ 0 1 Rank Correct ? 1 True 2 False 3 True ∞ True(s) Precision Recall 1.0 0.5 0.5 1.00
  34. 34. 34 Actually, not all PR pairs need to be computed because AP for object detection only requires the PR pairs related to True positives: Threshold Precision Recall 0.9 1/1 1/4 0.7 1/2 1/4 0.5 2/3 2/4 0.0 ⋍ 0 1 Rank Correct ? 1 True 2 False 3 True ∞ True(s) Metric: Average Precision (AP) for Object Detection Precision Recall 1.0 0.5 0.5 1.00
  35. 35. 35 ● The AP metric approximates the area of the PR curve. ● There are different methods for this approximation that may cause inconsistencies between implementations. ● Popular ones ○ (suggested) “the mean precision at a set of eleven equally spaced recall levels [0, 0.1, ...1]” ○ “weighted mean of precisions achieved at each threshold, with the increase in recall from the previous threshold used as the weight” (scikit-learn). [ref] Everingham, Mark, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. "The Pascal Visual Object Classes (VOC) challenge." IJCV 2010. Metric: Average Precision (AP) for Object Detection
  36. 36. 36 In our work, we adopt the approach from Pascal VOC: ● AP is “the mean precision at a set of eleven equally spaced recall levels [0, 0.1, ...1]” Threshold Precision Recall 0.9 1/1 1/4 0.5 2/3 2/4 0.0 ⋍ 0 1 Recall Precision 0.0 1.00 0.1 1.00 0.2 1.00 0.3 0.67 0.4 0.67 0.5 0.00 ... 0.00 1.0 0.00 AP 0.39 Precision Recall 1.0 0.5 0.5 1.00 Metric: Average Precision (AP) for Object Detection
  37. 37. 37 What if your object detection algorithm does not provide any confidence score ? #1 #2 #3 Predictions Metric: Average Precision w/o confidence scores ?
  38. 38. 38 If your object detection algorithm does not provide any confidence score: ● Generate N random ranks (eg. N=10) and average your metrics across these N runs. ● Average the obtained APs. #1 #2 #3 #1 #2 #3 #1 #2 #3 AP1 AP2 APN ... AP Metric: Average Precision w/o confidence scores
  39. 39. 39 Evaluation metrics: mean Average Precision (mAP) In the cases of multiple Q classes (eg. car, bike, person…), the mAP averages across the AP(q) of each class: ● Further readings: ○ Tarang Sangh, “Measuring Object Detection models — mAP — What is Mean Average Precision?” (Medium 2018)
  40. 40. 40 Evaluation metrics: Average Precision (AP) You can obtain implementations for this Average Precision for Object Detection from: TensorFlow Microsoft CoCo dataset API
  41. 41. Outline 41 1. Motivation 2. Datasets 3. Evaluation 4. Neural Architectures a. Two-stage b. Single-stage 5. Software implementations
  42. 42. Outline 42 1. Motivation 2. Datasets 3. Evaluation 4. Neural Architectures a. Two-stage b. Single-stage 5. Software implementations
  43. 43. Object Detection There are two main families: ● Two-Stage: Region proposal and then classification ● Single-Stage: A grid in the image where each cell is a proposal
  44. 44. Region Proposals ● Find “blobby” image regions that are likely to contain objects ● “Class-agnostic” object detector Slide Credit: CS231n 44
  45. 45. Region Proposals 45 Typical object detection/segmentation pipelines: Object proposal Refinement and Classification Dog 0.85 Cat 0.80 Dog 0.75 Cat 0.90
  46. 46. Region Proposals 46 Typical object detection/segmentation pipelines: Object proposal Refinement and Classification Dog 0.85 Cat 0.80 Dog 0.75 Cat 0.90 NMS: Non-Maximum Suppression
  47. 47. Region Proposals: from pixels #SS Uijlings, J. R., Van De Sande, K. E., Gevers, T., & Smeulders, A. W. (2013). Selective search for object recognition. IJCV 2013 47
  48. 48. Region Proposals: from pixels #MCG Pont-Tuset, J., Arbelaez, P., Barron, J. T., Marques, F., & Malik, J. (2016). Multiscale combinatorial grouping for image segmentation and object proposal generation. TPAMI 2016 48
  49. 49. Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. CVPR 2014. 49 R-CNN
  50. 50. R-CNN 50 We expect: We get: Non Maximum Suppression + score threshold
  51. 51. R-CNN + Non Maximum Suppression (NMS) 51 #DPM Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2009). Object detection with discriminatively trained part-based models. TPAMI 2009. Figure: Adrian Rosebrock
  52. 52. 52 Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. CVPR 2014. R-CNN
  53. 53. R-CNN: Problems 1. Slow at test-time: need to run full forward pass of CNN for each region proposal 2. SVMs and regressors are post-hoc: CNN features not updated in response to SVMs and regressors Slide Credit: CS231n 53
  54. 54. Fast R-CNN: Girshick Fast R-CNN. ICCV 2015c Solution: Share computation of convolutional layers between region proposals for an image R-CNN Problem #1: Slow at test-time: need to run full forward pass of CNN for each region proposal 54
  55. 55. Fast R-CNN Solution: Train it all together end to end R-CNN Problem #2&3: SVMs and regressors are post-hoc. Complex training. 55Girshick Fast R-CNN. ICCV 2015 -Softmax over (K+1) classes and 4 box offsets -Positive box are the ones with larger Intersection Over Union with ground truth
  56. 56. Fast R-CNN: RoI-Pooling Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal (variable size) Fully-connected layers Max-pool within each grid cell RoI conv features: C x h x w for region proposal (fixed size) Fully-connected layers expect low-res conv features: C x h x w Slide Credit: CS231n 56Girshick Fast R-CNN. ICCV 2015
  57. 57. RoI poolings allow 1) to propagate gradient only on interesting regions, and 2) efficient computing. Input: convolutional map + N regions of interest Output: tensor of N x 7 x 7 x depth features Fast R-CNN: RoI-Pooling
  58. 58. Slide Credit: CS231n 58 Fast R-CNN R-CNN Fast R-CNN Training Time: 84 hours 9.5 hours (Speedup) 1x 8.8x Test time per image 47 seconds 0.32 seconds (Speedup) 1x 146x mAP (VOC 2007) 66.0 66.9 Using VGG-16 CNN on Pascal VOC 2007 dataset Faster! FASTER! Better!
  59. 59. Fast R-CNN: Limitation Slide Credit: CS231n R-CNN Fast R-CNN Test time per image 47 seconds 0.32 seconds (Speedup) 1x 146x Test time per image with Selective Search 50 seconds 2 seconds (Speedup) 1x 25x Test-time speeds do not include region proposals 59
  60. 60. Conv layers Region Proposal Network FC6 Class probabilities FC7 FC8 RPN Proposals RoI Pooling Conv5_3 RPN Proposals Fast R-CNN 60 Learn proposals end-to-end sharing parameters with the classification network #Faster R-CNN Ren, S., He, K., Girshick, R., & Sun, J.. Faster r-cnn: Towards real-time object detection with region proposal networks. NIPS 2015. Faster R-CNN
  61. 61. Faster R-CNN Conv layers Region Proposal Network FC6 Class probabilities FC7 FC8 RPN Proposals RoI Pooling Conv5_3 RPN Proposals 61 Learn proposals end-to-end sharing parameters with the classification network This network is called Region Proposal Network (RPN), and the proposals are learnt!! #Faster R-CNN Ren, S., He, K., Girshick, R., & Sun, J.. Faster r-cnn: Towards real-time object detection with region proposal networks. NIPS 2015.
  62. 62. Faster R-CNN replaces selective search (SS) with the Region Proposal Network (RPN), which is trained jointly. Faster R-CNN
  63. 63. Region Proposal Network (RPN) Objectness scores (object/no object) Bounding Box Regression In practice, k = 9 (3 different scales and 3 aspect ratios) 63#Faster R-CNN Ren, S., He, K., Girshick, R., & Sun, J.. Faster R-CNN: Towards real-time object detection with region proposal networks. NIPS 2015.
  64. 64. Ren et al. Faster R-CNN: Towards real-time object detection with region proposal networks. NIPS 2015 R-CNN Fast R-CNN Faster R-CNN Test time per image (with proposals) 50 seconds 2 seconds 0.2 seconds (Speedup) 1x 25x 250x mAP (VOC 2007) 66.0 66.9 66.9 Slide Credit: CS231n 64 Faster R-CNN
  65. 65. Mask R-CNN: Object Detection + Instance Segmentation 65 He et al. Mask R-CNN. ICCV 2017
  66. 66. Next lecture: Instance & Image Segmentation 66 Source: Detectron2 Carles Ventura
  67. 67. Two-stage vs Single-stage methods 67 Computationally too intensive and too slow for real-time applications Faster R-CNN 7 FPS resample pixels for each BBOX resample features for each BBOX high quality classifier Object proposals generation Image pixels
  68. 68. Two-stage vs Single-stage methods 68 resample pixels for each BBOX resample features for each BBOX high quality classifier Object proposals generation Image pixels Instead of having two networks Region Proposals Network + Classifier Network in one-stage architectures, bounding boxes and confidences for multiple categories are predicted directly with a single network
  69. 69. Outline 69 1. Motivation 2. Datasets 3. Evaluation 4. Neural Architectures a. Two-stage b. Single-stage 5. Software implementations
  70. 70. One-stage methods 70 Problem: Too many positions & scales to test Previously… :
  71. 71. Overfeat 71#OverFeat Sermanet, Pierre, David Eigen, Xiang Zhang, Michaël Mathieu, Rob Fergus, and Yann LeCun. "Overfeat: Integrated recognition, localization and detection using convolutional networks." ICLR 2014
  72. 72. One-stage methods 72 Problem: Too many positions & scales to test Solution: If your classifier is fast enough, go for it Previously… :
  73. 73. 73 Problem: Too many positions & scales to test Modern detectors parallelize feature extraction across all locations. Region classification is not slow anymore! Previously… : One-stage methods
  74. 74. YOLO: You Only Look Once 74#YOLO Redmon et al. You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016 Proposal-free object detection pipeline S x S grid on input For each cell of the S x S predict: ● B boxes and confidence scores C (5 x B values) + classes c
  75. 75. 75Redmon et al. You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016 Proposal-free object detection pipeline S x S grid on input Bounding boxes + confidence Class probability map Final detections YOLO: You Only Look Once
  76. 76. 76Redmon et al. You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016 Proposal-free object detection pipeline S x S grid on input Bounding boxes + confidence Class probability map Final detections Final detections: Cj * prob(c) > threshold YOLO: You Only Look Once
  77. 77. Redmon et al. You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016 77 YOLO: You Only Look Once
  78. 78. YOLO: You Only Look Once 78 Each cell predicts: - For each bounding box: - 4 coordinates (x, y, w, h) - 1 confidence value - Some number of class probabilities For Pascal VOC: - 7x7 grid - 2 bounding boxes / cell - 20 classes 7 x 7 x (2 x 5 + 20) = 7 x 7 x 30 tensor = 1470 outputs
  79. 79. SSD: Single Shot MultiBox Detector Liu et al. SSD: Single Shot MultiBox Detector, ECCV 2016 79 Same idea as YOLO, + several predictors at different stages in the network to allow different receptive fields.
  80. 80. YOLOv2 80Redmon & Farhadi. YOLO900: Better, Faster, Stronger. CVPR 2017
  81. 81. YOLOv3 81 YOLO v2 + residual blocks + skip connections + upsampling + detection at multiple scales
  82. 82. YOLOv4 82Alexey Bochkovskiy, Chien-Yao Wang, Hong-Yuan Mark Liao, “YOLOv4: Optimal Speed and Accuracy of Object Detection” arXiv 2020.
  83. 83. 83 #YOLO Redmon et al. You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016
  84. 84. Military Applications & Privacy Risks 84
  85. 85. RetinaNet 85 Matching proposal-based performance with a one-stage approach Problem of one-stage detectors? They evaluate many candidate locations but only a few have objects ---> IMBALANCE, making learning inefficient Focal loss: Key idea is to lower loss weight for well classified samples, increase it for difficult ones. Lin et al. Focal Loss for Dense Object Detection. ICCV 2017
  86. 86. Overview 86
  87. 87. Neural Archictures for Object Detection 87 Two-stage methods ● R-CNN ● Fast R-CNN ● Faster R-CNN ● Mask R-CNN Single-stage methods ● YOLO ● SSD ● RetinaNet
  88. 88. Software implementations 88 Most models are publicly available ready to be used off-the-shelf. Model Framework Faster R-CNN [torchvision] (< suggested) [Detectron2] [Keras] RetinaNet [Detectron2] (< suggested) [Keras] Benchmark [TensorFlow Object Detection API] YOLOv3 [PyTorch] SSD [PyTorch] [Tutorial on Keras] Mask R-CNN [torchvision] (< suggested) [PyTorch] [Keras & TF] [tutorial]
  89. 89. Software implementations 89 Wang, Xin, Thomas E. Huang, Trevor Darrell, Joseph E. Gonzalez, and Fisher Yu. "Frustratingly Simple Few-Shot Object Detection." arXiv preprint arXiv:2003.06957 (2020). [code based on Detectron 2] Probably, you will not be interested in the object classes defined in Pascal/COCO. You can adapt (fine-tune) existing models to your own object classes.
  90. 90. Software implementations for Mobile 90 TensorFlow Lite: Object Detection PyTorch Mobile (no specific solutions for object detection)
  91. 91. Software implementations 91 Jordi Torres, “TensorFlow or PyTorch? ” (2020) [in Catalan]
  92. 92. Outline 92 1. Motivation 2. Datasets 3. Evaluation 4. Neural Architectures a. Two-stage b. Single-stage 5. Software implementations
  93. 93. Next lab: ImageNet models 93 Dani Fojo
  94. 94. Your questions 94

×