Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.



Published on

State of the art Detection methods

Published in: Data & Analytics


  1. 1. Detection: 第 1页 | 共 25 页
  2. 2. Object Detection: Intuition Detection ≈ Localization + Classification 第 2页 | 共 25 页
  3. 3. Outline •R-CNN •SPP-Net •Fast R-CNN •Unified Approach 第 3页 | 共 25 页
  4. 4. Outline •R-CNN •SPP-Net •Fast R-CNN •Unified Approach 第 4页 | 共 25 页
  5. 5. R-CNN: Pipeline Overview Step1. Input an image Step2. Use selective search to obtain ~2k proposals Step3. Warp each proposal and apply CNN to extract its features Step4. Adopt class-specified SVM to score each proposal Step5. Rank the proposals and use NMS to get the bboxes. Step6. Use class-specified regressors to refine the bboxes’ positions.Ross Girshick et al. Rich feature hierarchies for accurate object detection and semantic segmentation, CVPR14 第 5页 | 共 25 页
  6. 6. R-CNN: Performance in PASCAL VOC07 • AlexNet(T-Net): 58.5 mAP • VGG-Net(O-Net): 66.0 mAP 第 6页 | 共 25 页
  7. 7. R-CNN: Limitation • TOO SLOWWWW !!! (13s/image on a GPU or 53s/image on a CPU, and VGG-Net 7x slower) • Proposals need to be warped to a fixed size. 第 7页 | 共 25 页
  8. 8. Outline •R-CNN •SPP-Net •Fast R-CNN •Unified Approach 第 8页 | 共 25 页
  9. 9. SPP-Net: Motivation • Cropping may loss some information about the object • Warpping may change the object’s appearance He et al. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition, TPAMI15 第 9页 | 共 25 页
  10. 10. SPP-Net: Spatial Pyramid Pooling (SPP) Layer • FC layer need a fixed-length input while conv layer can be adapted to arbitrary input size. • Thus we need a bridge between the conv and FC layer. • Here comes the SPP layer. 第 10页 | 共 25 页
  11. 11. SPP-Net: Training for Detection(1) Conv5 feature map Conv5 feature map Conv5 feature map Image Pyramid FeatMap Pyramids conv Step1. Generate a image pyramid and exact the conv FeatMap of the whole image 第 11页 | 共 25 页
  12. 12. SPP-Net: Training for Detection(2) • Step 2, For each proposal, walking the image pyramid and find a project version that has a number of pixels closest to 224x224. (For scaling invariance in training.) • Step 3, find the corresponding FeatMap in Conv5 and use SPP layer to pool it to a fix size. • Step 4, While getting all the proposals’ feature, fine-tune the FC layer only. • Step 5, Train the class-specified SVM 第 12页 | 共 25 页
  13. 13. SPP-Net: Testing for Detection • Allmost the same as R-CNN, except Step3. 第 13页 | 共 25 页
  14. 14. SPP-Net: Performance • Speed: 64x faster than R-CNN using one scale, and 24x faster using five-scale paramid. • mAP: +1.2 mAP vs R-CNN 第 14页 | 共 25 页
  15. 15. SPP-Net: Limitation 2. Training is expensive in space and time. 1. Training is a multi-stage pipeline. FC layersConv layers SVM regressor store 第 15页 | 共 25 页
  16. 16. Outline •R-CNN •SPP-Net •Fast R-CNN •Unified Approach 第 16页 | 共 25 页
  17. 17. Fast R-CNN: Motivation Ross Girshick, Fast R-CNN, Arxiv tech report JOINT TRAINING!! 第 17页 | 共 25 页
  18. 18. Fast R-CNN: Joint Training Framework Joint the feature extractor, classifier, regressor together in a unified framework 第 18页 | 共 25 页
  19. 19. Fast R-CNN: RoI pooling layer ≈ one scale SPP layer 第 19页 | 共 25 页
  20. 20. Fast R-CNN: Regression Loss A smooth L1 loss which is less sensitive to outliers than L2 loss 第 20页 | 共 25 页
  21. 21. Fast R-CNN: Scale Invariance image pyramids ( multi scale )brute force ( single scale ) Conv5 feature map conv • In practice, single scale is good enough. (The main reason why it can faster x10 than SPP-Net) 第 21页 | 共 25 页
  22. 22. Fast R-CNN: Other tricks • SVD on FC layers: 30% speed up at testing time with a little performance drop. • Which layers to fine-tune? Fix the shallow conv layers can reduce the training time with a little performance drop. • Data augment: use VOC12 as the additional trainset can boost mAP by ~3% 第 22页 | 共 25 页
  23. 23. Fast R-CNN: Performance • Without data augment, the mAP just +0.9 on VOC077 • But training and testing time has been greatly speed up. (training 9x, testing 213x vs R-CNN) • Without data augment, the mAP +2.3 on VOC127 第 23页 | 共 25 页
  24. 24. Fast-RCNN: Discussion about #proposal Are more proposals always better ? NO! 第 24页 | 共 25 页
  25. 25. Outline •R-CNN •SPP-Net •Fast R-CNN •Unified Approach 第 25页 | 共 25 页
  26. 26. Unified Approach: Motivation No Need For Regions 第 26页 | 共 25 页
  27. 27. Unified Approach: Framework 第 27页 | 共 25 页 • Move Away from classification network • Use a deep network like GoogleNet • Divide the image into 7 by 7 grids • Each grid responsible for predicting the object center falling in the grid • Predict the class probabilities and coordinates for the object • Testing time reduces significantly as no regions are required • Loss function is combination of class probabilities error and bounding box regression error as in Fast RCNN
  28. 28. Unified Approach: Training 第 28页 | 共 25 页 • Most grid parameters will tend towards zero as one object will only contribute towards one grid • Introduce extra probability for background vs foreground • Probability error loss for a class activated only when foreground • Optimize for Pr(Class/ob) rather than Pr(Class) • Final probabilities calculated by Pr(ob)*Pr(Class/ob)
  29. 29. Unified Approach: Training 第 29页 | 共 25 页 • Run initial iterations by minimizing pr(ob) and pr(class/ob) separately • Can run joint minimization in later stages • The network predicts the bounding box taking convolutions from the whole image • This instigates error in localization • Penalize predictions which outputs lower iou by rescaling probabilities to the iou instead of 1
  30. 30. Unified Approach: Detection layer 第 30页 | 共 25 页
  31. 31. Unified Approach: Detection layer 第 31页 | 共 25 页
  32. 32. Unified Approach: Detection layer 第 32页 | 共 25 页
  33. 33. Unified Approach: Network 第 33页 | 共 25 页 Variant of GoogleNet with pooling layers replaced by convolutional layers which helps in localizing objects Leaky Relu layer with 1.1(X>0) + 0.1(x<0) increases map Logistic layer at the end to enforce predictions within 0 to 1
  34. 34. Unified Approach: Saliency 第 34页 | 共 25 页 • Predicts images at 45 fps • Competitive Performance with Fast Rcnn using Caffe net (MAP Score = 58.8) in VOC 2007 • Almost 95 times faster than Fast RCNN More details to be published in upcoming paper
  35. 35. 第 35页 | 共 25 页 THANKS