Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Object Detection (D2L5 Insight@DCU Machine Learning Workshop 2017)

980 views

Published on

https://telecombcn-dl.github.io/dlmm-2017-dcu/

Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Object Detection (D2L5 Insight@DCU Machine Learning Workshop 2017)

  1. 1. Amaia Salvador amaia.salvador@upc.edu PhD Candidate Universitat Politècnica de Catalunya Object Detection Day 2 Lecture 5
  2. 2. Object Detection CAT, DOG, DUCK The task of assigning a label and a bounding box to all objects in the image 2
  3. 3. Object Detection: Datasets 3 20 categories 6k training images 6k validation images 10k test images 200 categories 456k training images 60k validation + test images 80 categories 200k training images 60k val + test images
  4. 4. Object Detection as Classification Classes = [cat, dog, duck] Cat ? NO Dog ? NO Duck? NO 4
  5. 5. Classes = [cat, dog, duck] Cat ? NO Dog ? NO Duck? NO 5 Object Detection as Classification
  6. 6. Classes = [cat, dog, duck] Cat ? YES Dog ? NO Duck? NO 6 Object Detection as Classification
  7. 7. Classes = [cat, dog, duck] Cat ? NO Dog ? NO Duck? NO 7 Object Detection as Classification
  8. 8. Problem: Too many positions & scales to test Solution: If your classifier is fast enough, go for it 8 Object Detection as Classification
  9. 9. HOG: Histogram of Oriented Gradients Dalal and Triggs. Histograms of Oriented Gradients for Human Detection. CVPR 2005 9
  10. 10. Deformable Part Model Felzenszwalb et al, Object Detection with Discriminatively Trained Part Based Models, PAMI 2010 10
  11. 11. Object Detection with ConvNets? Convnets are computationally demanding. We can’t test all positions & scales ! Solution: Look at a tiny subset of positions. Choose them wisely :) 11
  12. 12. Region Proposals ● Find “blobby” image regions that are likely to contain objects ● “Class-agnostic” object detector ● Look for “blob-like” regions Slide Credit: CS231n 12
  13. 13. Region Proposals Selective Search (SS) Multiscale Combinatorial Grouping (MCG) [SS] Uijlings et al. Selective search for object recognition. IJCV 2013 [MCG] Arbeláez, Pont-Tuset et al. Multiscale combinatorial grouping. CVPR 2014 13
  14. 14. Object Detection with Convnets: R-CNN Girshick et al. Rich feature hierarchies for accurate object detection and semantic segmentation. CVPR 2014 14
  15. 15. R-CNN Girshick et al. Rich feature hierarchies for accurate object detection and semantic segmentation. CVPR 2014 1. Train network on proposals 2. Post-hoc training of SVMs & Box regressors on fc7 features 15
  16. 16. R-CNN 16 We expect: We get:
  17. 17. R-CNN Girshick et al. Rich feature hierarchies for accurate object detection and semantic segmentation. CVPR 2014 1. Train network on proposals 2. Post-hoc training of SVMs & Box regressors on fc7 features 3. Non Maximum Suppression + score threshold 17
  18. 18. R-CNN Girshick et al. Rich feature hierarchies for accurate object detection and semantic segmentation. CVPR 2014 18
  19. 19. R-CNN: Problems 1. Slow at test-time: need to run full forward pass of CNN for each region proposal 2. SVMs and regressors are post-hoc: CNN features not updated in response to SVMs and regressors 3. Complex multistage training pipeline Slide Credit: CS231n 19
  20. 20. Fast R-CNN Girshick Fast R-CNN. ICCV 2015 Solution: Share computation of convolutional layers between region proposals for an image R-CNN Problem #1: Slow at test-time: need to run full forward pass of CNN for each region proposal 20
  21. 21. Fast R-CNN: Sharing features Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected layers Max-pool within each grid cell RoI conv features: C x h x w for region proposal Fully-connected layers expect low-res conv features: C x h x w Slide Credit: CS231n 21Girshick Fast R-CNN. ICCV 2015
  22. 22. Fast R-CNN Solution: Train it all at together E2E R-CNN Problem #2&3: SVMs and regressors are post-hoc. Complex training. 22Girshick Fast R-CNN. ICCV 2015
  23. 23. Fast R-CNN: End-to-end training 23Girshick Fast R-CNN. ICCV 2015 Predicted class scores True class scores True box coordinates Predicted box coordinates Log loss Smooth L1 loss Only for positive boxes
  24. 24. Fast R-CNN: Positive / Negative Samples 24Girshick Fast R-CNN. ICCV 2015 Positive samples are defined as those whose IoU overlap with a ground-truth bounding box is > 0.5. Negative examples are sampled from those that have a maximum IoU overlap with ground truth in the interval [0.1, 0.5). 25%/75% ratio for positive/negative samples in a minibatch.
  25. 25. Fast R-CNN Slide Credit: CS231n R-CNN Fast R-CNN Training Time: 84 hours 9.5 hours (Speedup) 1x 8.8x Test time per image 47 seconds 0.32 seconds (Speedup) 1x 146x mAP (VOC 2007) 66.0 66.9 Using VGG-16 CNN on Pascal VOC 2007 dataset Faster! FASTER! Better! 25
  26. 26. Fast R-CNN: Problem Slide Credit: CS231n R-CNN Fast R-CNN Test time per image 47 seconds 0.32 seconds (Speedup) 1x 146x Test time per image with Selective Search 50 seconds 2 seconds (Speedup) 1x 25x Test-time speeds don’t include region proposals 26
  27. 27. Faster R-CNN Conv layers Region Proposal Network FC6 Class probabilities FC7 FC8 RPN Proposals RoI Pooling Conv5_3 RPN Proposals 27 Ren et al. Faster R-CNN: Towards real-time object detection with region proposal networks. NIPS 2015 Learn proposals end-to-end sharing parameters with the classification network
  28. 28. Faster R-CNN Conv layers Region Proposal Network FC6 Class probabilities FC7 FC8 RPN Proposals RoI Pooling Conv5_3 RPN Proposals Fast R-CNN 28 Ren et al. Faster R-CNN: Towards real-time object detection with region proposal networks. NIPS 2015 Learn proposals end-to-end sharing parameters with the classification network
  29. 29. Region Proposal Network Objectness scores (object/no object) Bounding Box Regression In practice, k = 9 (3 different scales and 3 aspect ratios) 29 Ren et al. Faster R-CNN: Towards real-time object detection with region proposal networks. NIPS 2015
  30. 30. Region Proposal Network: Loss function 30 Predicted probability of being an object for anchor i i = anchor index in minibatch Coordinates of the predicted bounding box for anchor i Ground truth objectness label True box coordinates Ncls = Number of anchors in minibatch (~ 256) Nreg = Number of anchor locations ( ~ 2400) Log loss Smooth L1 loss In practice = 10, so that both terms are roughly equally balanced
  31. 31. Region Proposal Network: Positive / Negative Samples 31 An anchor is labeled as positive if: (a) the anchor is the one with highest IoU overlap with a ground-truth box (b) the anchor has an IoU overlap with a ground-truth box higher than 0.7 Negative labels are assigned to anchors with IoU lower than 0.3 for all ground-truth boxes. 50%/50% ratio of positive/negative anchors in a minibatch.
  32. 32. Faster R-CNN: Training Conv layers Region Proposal Network FC6 Class probabilities FC7 FC8 RPN Proposals RoI Pooling Conv5_3 RPN Proposals 32 Ren et al. Faster R-CNN: Towards real-time object detection with region proposal networks. NIPS 2015 RoI Pooling is not differentiable w.r.t box coordinates. Solutions: ● Alternate training ● Ignore gradient of classification branch w.r.t proposal coordinates ● Make pooling function differentiable
  33. 33. Faster R-CNN Ren et al. Faster R-CNN: Towards real-time object detection with region proposal networks. NIPS 2015 R-CNN Fast R-CNN Faster R-CNN Test time per image (with proposals) 50 seconds 2 seconds 0.2 seconds (Speedup) 1x 25x 250x mAP (VOC 2007) 66.0 66.9 66.9 Slide Credit: CS231n 33
  34. 34. Faster R-CNN 34 ● Faster R-CNN is the basis of the winners of COCO and ILSVRC 2015 object detection competitions. He et al. Deep residual learning for image recognition. CVPR 2016
  35. 35. YOLO: You Only Look Once 35Redmon et al. You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016 Proposal-free object detection pipeline
  36. 36. YOLO: You Only Look Once Redmon et al. You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016 36
  37. 37. YOLO: You Only Look Once 37 Each cell predicts: - For each bounding box: - 4 coordinates (x, y, w, h) - 1 confidence value - Some number of class probabilities For Pascal VOC: - 7x7 grid - 2 bounding boxes / cell - 20 classes 7 x 7 x (2 x 5 + 20) = 7 x 7 x 30 tensor = 1470 outputs
  38. 38. YOLO: Training 38Slide credit: YOLO Presentation @ CVPR 2016 For training, each ground truth bounding box is matched into the right cell
  39. 39. YOLO: Training 39Slide credit: YOLO Presentation @ CVPR 2016 For training, each ground truth bounding box is matched into the right cell
  40. 40. YOLO: Training 40Slide credit: YOLO Presentation @ CVPR 2016 Optimize class prediction in that cell: dog: 1, cat: 0, bike: 0, ...
  41. 41. YOLO: Training 41Slide credit: YOLO Presentation @ CVPR 2016 Predicted boxes for this cell
  42. 42. YOLO: Training 42Slide credit: YOLO Presentation @ CVPR 2016 Find the best one wrt ground truth bounding box, optimize it (i.e. adjust its coordinates to be closer to the ground truth’s coordinates)
  43. 43. YOLO: Training 43Slide credit: YOLO Presentation @ CVPR 2016 Increase matched box’s confidence, decrease non-matched boxes confidence
  44. 44. YOLO: Training 44Slide credit: YOLO Presentation @ CVPR 2016 Increase matched box’s confidence, decrease non-matched boxes confidence
  45. 45. YOLO: Training 45Slide credit: YOLO Presentation @ CVPR 2016 For cells with no ground truth detections, confidences of all predicted boxes are decreased
  46. 46. YOLO: Training 46Slide credit: YOLO Presentation @ CVPR 2016 For cells with no ground truth detections: ● Confidences of all predicted boxes are decreased ● Class probabilities are not adjusted
  47. 47. YOLO: Training, formally 47Slide credit: YOLO Presentation @ CVPR 2016 Bounding box coordinate regression Bounding box score prediction Class score prediction = 1 if box j and cell i are matched together, 0 otherwise = 1 if box j and cell i are NOT matched together = 1 if cell i has an object present
  48. 48. YOLO: You Only Look Once Redmon et al. You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016 48 Dog Bicycle Car Dining Table Predict class probability for each cell (conditioned on object P(car | object) )
  49. 49. YOLO: You Only Look Once Redmon et al. You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016 49 + NMS + Score threshold
  50. 50. SSD: Single Shot MultiBox Detector Liu et al. SSD: Single Shot MultiBox Detector, ECCV 2016 50 Same idea as YOLO, + several predictors at different stages in the network
  51. 51. SSD: Single Shot MultiBox Detector Liu et al. SSD: Single Shot MultiBox Detector, ECCV 2016 51 Similarly to Faster R-CNN, it uses box anchors to predict box coordinates as displacements
  52. 52. YOLOv2 52 Redmon & Farhadi. YOLO900: Better, Faster, Stronger. CVPR 2017
  53. 53. YOLOv2 53 Results on Pascal VOC 2007
  54. 54. YOLOv2 54 Results on COCO test-dev 2015
  55. 55. Summary 55 Proposal-based methods ● R-CNN ● Fast R-CNN ● Faster R-CNN ● SPPnet ● R-FCN Proposal-free methods ● YOLO, YOLOv2 ● SSD
  56. 56. Questions ?
  57. 57. A note on NMS 57 Tradeoff between recall/precision 0.8 0.6 0.7 Objectness scores
  58. 58. Soft NMS 58 Bodla, Singh et al. Improving Object Detection With One Line of Code. arXiv Apr 2017 Decay detection scores of contiguous objects instead of setting them to 0
  59. 59. Avoid NMS: Sequential box prediction 59 Stewart et al. End-To-End People Detection in Crowded Scenes. CVPR 2016 Predict boxes one after the other and learn when to stop

×