Successfully reported this slideshow.
Your SlideShare is downloading. ×

Codetecon #KRK 3 - Object detection with Deep Learning

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 71 Ad

Codetecon #KRK 3 - Object detection with Deep Learning

Download to read offline

There’s been enormous progress in object detection algorithms. Starting from multi-stage ones like R-CNN to end-to-end ones like SSD or YOLO, accuracy of the methods improved significantly. Current applications include pedestrian detection for cars and face detection on facebook.
But that’s just the beginning. I am going to show the algorithms for solving the problem, show what’s currently possible, and what will be possible in the near future.

There’s been enormous progress in object detection algorithms. Starting from multi-stage ones like R-CNN to end-to-end ones like SSD or YOLO, accuracy of the methods improved significantly. Current applications include pedestrian detection for cars and face detection on facebook.
But that’s just the beginning. I am going to show the algorithms for solving the problem, show what’s currently possible, and what will be possible in the near future.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Codetecon #KRK 3 - Object detection with Deep Learning (20)

Advertisement
Advertisement

Codetecon #KRK 3 - Object detection with Deep Learning

  1. 1. Object detection with Deep Learning Matthew Opala
  2. 2. AGENDA Region proposals based models Regression models Localization & detection
  3. 3. Localization & detection
  4. 4. Computer Vision tasks Classification Classification & Localization Object detection Instance segmentation Single Object Multiple Objects Credit: http://vision.stanford.edu/teaching/cs231n/slides/2016/winter1516_lecture8.pdf
  5. 5. Computer Vision tasks Classification Classification & Localization Object detection Instance segmentation Single Object Multiple Objects
  6. 6. Classification & Localization Classification: ◦ Input: image ◦ Output: class label ◦ Evaluation: accuracy Localization: ◦ Input: image ◦ Output: Box(x, y, w, h) ◦ Evaluation: IoU CAT (x, y, w, h)
  7. 7. Object detection ◦ Many objects of different classes on an image ◦ Needs variable size output
  8. 8. ConvNet Final conv feature maps Classification head Regression head Region proposals Crop & warp ConvNet Final conv feature maps Classifier Detection as regression vs. Detection as classification
  9. 9. Object detection models
  10. 10. R-CNN
  11. 11. Region proposals - selective search Credit: Uijlings et al, “Selective search for Object Recognition”, IJCV 2013
  12. 12. RCNN - model ConvNet Bbox regressors SVM Input Image Regions of Interest (RoI) Warped image regions Selective search
  13. 13. RCNN - training ◦ Train a classification model ◦ Fine-tune it for detection ◦ Extract features ◦ Train a binary SVM for each class ◦ Train a linear regression model for each class
  14. 14. RCNN - disadvantages ◦ Complex training pipeline ◦ Slow at test time - 50s per image
  15. 15. Fast R-CNN
  16. 16. Input Image ConvNet Bbox regressors Softmax RoI projection onto the feature map RoI pooling FC layers Selective search
  17. 17. RoI Pooling 0,81 0,4 0,7 0,32 0,5 0,2 0,4 0,5 0,19 0,2 0,16 0,31 0,3 0,9 0,1 0,32 0,37 0,31 0,29 0,53 0,78 0,81 0,22 0,24 0,8 0,38 0,36 0,89 0,24 0,23 0,95 0,88 0,66 0,71 0,42 0,11 0,11 0,08 0,23 0,5 0,65 0,51 0,32 0,14 0,74 0,7 0,05 0,32 0,9 0,7 0,57 0,53 0,64 0,23 0,02 0,19 0,22 0,45 0,64 0,58 0,52 0,58 0,32 0,89
  18. 18. RoI Pooling, output size 2 x 2, region of interest 7 x 5 0,81 0,4 0,7 0,32 0,5 0,2 0,4 0,5 0,19 0,2 0,16 0,31 0,3 0,9 0,1 0,32 0,37 0,31 0,29 0,53 0,78 0,81 0,22 0,24 0,8 0,38 0,36 0,89 0,24 0,23 0,95 0,88 0,66 0,71 0,42 0,11 0,11 0,08 0,23 0,5 0,65 0,51 0,32 0,14 0,74 0,7 0,05 0,32 0,9 0,7 0,57 0,53 0,64 0,23 0,02 0,19 0,22 0,45 0,64 0,58 0,52 0,58 0,32 0,89
  19. 19. RoI Pooling, output size 2 x 2, region of interest 7 x 5 0,81 0,4 0,7 0,32 0,5 0,2 0,4 0,5 0,19 0,2 0,16 0,31 0,3 0,9 0,1 0,32 0,37 0,31 0,29 0,53 0,78 0,81 0,22 0,24 0,8 0,38 0,36 0,89 0,24 0,23 0,95 0,88 0,66 0,71 0,42 0,11 0,11 0,08 0,23 0,5 0,65 0,51 0,32 0,14 0,74 0,7 0,05 0,32 0,9 0,7 0,57 0,53 0,64 0,23 0,02 0,19 0,22 0,45 0,64 0,58 0,52 0,58 0,32 0,89
  20. 20. RoI Pooling, output size 2 x 2, region of interest 7 x 5 0,81 0,4 0,7 0,32 0,5 0,2 0,4 0,5 0,19 0,2 0,16 0,31 0,3 0,9 0,1 0,32 0,37 0,31 0,29 0,53 0,78 0,81 0,22 0,24 0,8 0,38 0,36 0,89 0,24 0,23 0,95 0,88 0,66 0,71 0,42 0,11 0,11 0,08 0,23 0,5 0,65 0,51 0,32 0,14 0,74 0,7 0,05 0,32 0,9 0,7 0,57 0,53 0,64 0,23 0,02 0,19 0,22 0,45 0,64 0,58 0,52 0,58 0,32 0,89
  21. 21. RoI Pooling, output size 2 x 2, region of interest 7 x 5 0,8 0,95 0,9 0,74
  22. 22. Fast R-CNN advantages ◦ Much simpler training ◦ Faster - 2s per image
  23. 23. Faster R-CNN
  24. 24. Input Image ConvNet Bbox regressors Softmax RoI pooling FC layers Region proposal network Feature Map Regions propositions
  25. 25. Faster R-CNN ◦ Fast enough for many applications: 140 ms per image
  26. 26. YOLO
  27. 27. Even Faster-RCNN is too slow for real-time Model Time/img FPS Pascal 2007 mAP RCNN 20 s/img 0.05 0.66 Fast-RCNN 2 s/img 0.5 0.7 Faster-RCNN 140 ms/img 7 0.732 YOLO v1. 22 ms/img 45 0.63 Fast YOLO v1. 6,45 ms/img 155 0.53 Credit: https://pjreddie.com/darknet/yolo
  28. 28. 50 km/h
  29. 29. 278 m RCNN
  30. 30. 278 m RCNN 28 m Fast-RCNN
  31. 31. 278 m RCNN 28 m Fast-RCNN 1,95 m Faster-RCNN
  32. 32. 278 m RCNN 28 m Fast-RCNN 1,95 m Faster-RCNN 0.3 m YOLO
  33. 33. Split image into S x S grid
  34. 34. Each cell predicts boxes (x, y, w, h) and confidences P(object)
  35. 35. Each cell predicts boxes (x, y, w, h) and confidences P(object)
  36. 36. Each cell predicts boxes (x, y, w, h) and confidences P(object)
  37. 37. Each cell predicts class probability conditioned on object e.g. P(Car | object) CarBicycle Dog Dining table
  38. 38. At test time we combine the box and class predictions
  39. 39. After NMS and thresholding
  40. 40. Model ◦ Image divided into S x S grid ◦ Within each grid cell predict: ▫ B boxes (4 coordinates + confidence) ▫ C class scores ◦ Regression from image to S x S x (5 * B + C) tensor Credit: Redmon et al, “You Only Look Once: Unified, Real-Time Object Detection”, arXiv 2015
  41. 41. During training we match examples to correct cell
  42. 42. Dog = 1 Cat = 0 Bicycle = 0 ... Adjust cell’s class probabilities
  43. 43. Find predicted bounding box with highest IoU
  44. 44. Increase its confidence
  45. 45. Decrease confidence of other boxes
  46. 46. Decrease confidence of boxes in cells without ground truth detection
  47. 47. Training details ● Pretrain Extraction Net on Imagenet (24 conv laters) ● SGD with decreasing learning rate ● Extensive data augmentation ● Leaky ReLUs ● Increase loss from bounding boxes coordinate predictions and decrease for boxes that don’t contain objects ● Predicts square root of width and height instead of direct prediction
  48. 48. YOLO v2
  49. 49. YOLO drawbacks ◦ YOLO makes a significant number of localization errors in comparison to Faster-RCNN ◦ Low recall in comparison to region proposal based methods
  50. 50. YOLO v2 ◦ Batch normalization ◦ High resolution classifier ◦ Convolutional anchor boxes ◦ K-Means for choosing boxes’ priors ◦ Fine-grained features ◦ Multi-scale training
  51. 51. mAP and speed on VOC 2007 Credit: Redmon, Farhadi: “YOLO9000, Better, Faster, Stronger”, arXiv 2017
  52. 52. YOLO 9000 - WordTree
  53. 53. YOLO 9000: Hierarchical Classification ◦ Train Darknet-19 on WordTree ◦ Propagate ground truth labels up the tree ◦ Perform multiple softmax over co-hyponyms
  54. 54. YOLO 9000 - Joint Classification and Detection training ◦ COCO detection + top 9000 classes from ImageNet ◦ On detection image, backpropagate loss as normal ◦ On classification image, only backpropagate loss at or above the corresponding level of label ◦ ImageNet shares 44 categories with COCO ◦ Generalizes quite good to new animals (tiger 0.61 AP, fox, 0.52) ◦ Fails on clothing e.g. “sunglasses”
  55. 55. YOLO 9000: Visualizations
  56. 56. Single Shot Multibox Detector
  57. 57. SSD - YOLO architecture comparison Credit: Liu, et al: “SSD: Single Shot Multibox Detector””,, arxiv 2016.
  58. 58. SSD detection ◦ Described by four parameters (cx, cy, w, h) and class category ◦ Detector outputs single value, we need #classes + 4 detectors for a single detection
  59. 59. Different “classes” of detection Aspect ratio: 2:1 Aspect ratio 1:2 Aspect ratio 1:1
  60. 60. Default boxes and aspect ratios
  61. 61. For each conv layer that is input to detection there are: (classes + 4) x #default boxes x m x n outputs
  62. 62. SSD Training ◦ Ground truth data needs to be assigned to specific outputs in the fixed set of detector outputs ◦ For each GT box we choose the default one with best jaccard overlap ◦ Hard negative mining ◦ Data augmentation ◦ Loss: cross-entropy + Smooth L1
  63. 63. Deconvolutional Single Shot Detector Credit: Liu, et al: “DSSD: Deconvolutional Single Shot Detectorr””,, arxiv 2017.
  64. 64. Models comparison - according to DSSD paper Model Network Pascal 2007 mAP Faster-RCNN ResNet-101 0.764 R-FCN ResNet-101 0.805 SSD-300 VGG-16 0.77.5 SSD-513 ResNet-101 0.806 YOLO v2 - 544 Darknet-19 0.786 DSSD-513 ResNet-101 0.815
  65. 65. Recap ◦ Detection as regression or detection as classification ◦ Static images detectors are already fast enough to work even on video ◦ Fast YOLO is the fastest detector ◦ State-of-the-art: ▫ Resnet-101 + SSD + deconvolutions
  66. 66. Thanks! Q&A You can contact us at: matthew.opala@craftinity.com

×