Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Alexander Zarichkovyi
Ring Ukraine
Faster than real-time
face detection
About me
2
● Junior Researcher
@ Ring Ukraine
● Student of “Kyiv Polytechnic
Institute” (B.SE. Software
Engineering)
● Lov...
3
1. Object detection problem
a. Why is detection problem important?
b. Face detection problem
c. Datasets
d. How to evalu...
Object detection
problem
5http://cs231n.github.io/classi ication/
Image classification
6http://tutorial.caffe.berkeleyvision.org/caffe-cvpr15-detection.pdf
Classi ication problem Detection problem
Classificati...
7
Object detection results are
mostly used as an input for other
tasks:
● face recognition
● person recognition
● self dri...
Face detection
problem
9
What is the Face Detection problem?
10
How many faces do you see on
the picture?
11
● Occlusions
● Light conditions
● Pose
● Diversity
● ...
Why is it difficult?
Datasets
13
20 classes:
• Person: person
• Animal: bird, cat, cow, dog, horse, sheep
• Vehicle: aeroplane, bicycle, boat, bus, car,...
14
● Consists of 32 203 images with
393 703 labeled faces
● The faces vary largely in
appearance, pose and scale
● Multipl...
15WIDER FACE: A Face Detection Benchmark
WIDER FACE. Annotations
How good is your
Detection algorithm?
17https://www.pyimagesearch.com/2016/11/07/intersection-over-union-iou-for-object-detection/
Intersection Over Union
(Jacc...
18
https://classeval.wordpress.com/introduction/introduction-to-the-precision-recall-plot
Recall/Precision curve
Evaluatio...
19
Evolution of detectors
Viola–Jones object
detection architecture
Viola-Jones
detector
CVPR, 2001
21
Main principles:
● Integral image
● Scanning window
● HAAR-like Features
● Boosted feature selection
● Cascaded classi ...
22
Integral image
Rapid object detection using a boosted cascade of simple feature
Viola-Jones detector (2)
23
Scanning window
Rapid object detection using a boosted cascade of simple feature
Viola-Jones detector (3)
24
HAAR-like Features
Rapid object detection using a boosted cascade of simple feature
Viola-Jones detector (4)
25
Boosted feature selection
α
α
α
α
…
…
α α α … α
Feature importance
(hi
∈ ℜ)
Feature
(αi
∈ ℜ)
Rapid object detection usi...
26
Cascaded classi ier
Rapid object detection using a boosted cascade of simple feature
Viola-Jones detector (5)
27
Pros:
● Really fast (can run at real-time
on embedded devices)
● Low false-positive rate
● Easy to tune
Cons:
● Hand-ma...
Classification based
architectures
Selective Search
Viola-Jones
detector
CVPR, 2001
Selective
Search
IJCV, 2013
30Selective Search for Object Recognition
Selective Search (1)
31
Selective search + SIFT + bag-of-words + SVMs = 35.1% mAP on
PASCAL 2007
Selective Search for Object Recognition
Select...
Region-based CNN
(R-CNN)
Viola-Jones
detector
CVPR, 2001
Selective
Search
mAP: 35.1%
IJCV, 2013
R-CNN
Nov, 2013
33Rich feature hierarchies for accurate object detection and semantic segmen ation
● Regions: ~2000 Selective
Search propo...
34
Pros:
● Accurate
● Any architecture can be used as a feature
extractor
Cons:
● Hard to train (lots of training objectiv...
Why so slow?
Multiple usage of CNN
inference!
How to use CNN only once on the whole image?
Spatial Pyramid Pooling
Network (SPP-Net)
Viola-Jones
detector
CVPR, 2001
Selective
Search
mAP: 35.1%
IJCV, 2013
R-CNN
mAP...
38
● In each region proposal, used a
4-level spatial pyramid, with
grids:
■ 1×1
■ 2×2
■ 3×3
■ 6×6
● To each grid's cells w...
39
ROIs from
proposal
method
● Fully-connected layers
● Forward whole image
through Convolutional
Network
● Get feature ma...
40
What’s good about SPP-net?
Pascal VOC 2007 results
It's really faster…
Spatial Pyramid Pooling in Deep Convolutional Ne...
41
What’s wrong about SPP-net?
● Inherits the rest of
R CNN’s problems
● Introduces a new
problem: cannot update
parameter...
Fast R-CNN
Viola-Jones
detector
CVPR, 2001
Selective
Search
mAP: 35.1%
IJCV, 2013
R-CNN
mAP: 53.7%
FPS: 0.05
Nov, 2013
SPP...
43
● Fast test time, like
SPP-net
● One network,
trained in one stage
● Higher mean
average precision
than R CNN and
SPP-n...
44Fast R CNN
R-CNN Fast R-CNN
Training Time 84 hours 9.5 hours
(Speedup) 1x 8.8x
Test time per image
(network only)
47 sec...
But, work time do not
include time for Selective
Search...
46
R-CNN Fast R-CNN
Test time per image
(network only)
47 seconds 0.32 seconds
(Speedup) 1x 146x
Test time per image (with...
How to speedup
Selective Search?
Rewrite with GPU usage?
Use other segmentation
algorithms?
Use Neural Network!
Faster R-CNN
Viola-Jones
detector
CVPR, 2001
Selective
Search
mAP: 35.1%
IJCV, 2013
R-CNN
mAP: 53.7%
FPS: 0.05
Nov, 2013
S...
52
~ 100 FPS
Faster R CNN: Towards Real-Time Object Detection with Region Proposal Networks
Faster R-CNN (1):
Region propo...
53Faster R CNN: Towards Real-Time Object Detection with Region Proposal Networks
Faster R-CNN (2)
54Faster R CNN: Towards Real-Time Object Detection with Region Proposal Networks
R-CNN Fast R-CNN Faster R-CNN
Test time p...
Regression based
architectures
You Only Look Once
(YOLO)
Viola-Jones
detector
CVPR, 2001
Selective
Search
mAP: 35.1%
IJCV, 2013
R-CNN
mAP: 53.7%
FPS: 0.0...
57You Only Look Once: Uni ied, Real-Time Object Detection
YOLO’s pipeline
YOLO (1)
58You Only Look Once: Uni ied, Real-Time Object Detection
Bottom layers from
GoogLeNet
Custom layers
YOLO architecture
YOL...
59
Pros:
● uite fast (~40 FPS on Nvidia Titan Black)
● End-to-end training
● Low Error Rate for
Foreground/Background misc...
Single Shot MultiBox
Detector (SSD)
Viola-Jones
detector
CVPR, 2001
Selective
Search
mAP: 35.1%
IJCV, 2013
R-CNN
mAP: 53.7...
61SSD: Single Shot MultiBox Detector
SSD architecture
SSD (1)
62
Apply regressors to default
box and get result
Regressors
Confidences for 21 classes
(20 VOC Pascal 2007 classes + back...
63
Model mAP FPS
Faster R-CNN (VGG-16) 73.2% 7
Faster R-CNN (ZF) 62.1% 17
YOLO 63.4% 45
Tiny YOLO 52.7% 155
SSD300 (VGG-16...
Cascade classification
based architectures
Multi-task cascade NN
(MTCNN)
Viola-Jones
detector
CVPR, 2001
Selective
Search
mAP: 35.1%
IJCV, 2013
R-CNN
mAP: 53.7%
FPS:...
66
Network Input size FPS* Validation
Accuracy
P-Net 12x12 8000 94.6%
R-Net 24x24 650 95.4%
O-Net 48x48 220 95.4%
Networks...
67
MTCNN s Networks Architectures
Landmarks example
Joint Face Detection and Alignment using Multi-task Cascaded Convoluti...
68
Recall/Precision curve
Test set o Wider Face date set
Joint Face Detection and Alignment using Multi-task Cascaded Conv...
69
Pros:
● Really fast (100 FPS on GPU)
● Lot of speed/accuracy trade-offs
● State of the art results on big part of
Face ...
70
Questions?
71
Thanks for your attention!
Contact information:
alexander.zarichkovyi@ring.com
Upcoming SlideShare
Loading in …5
×

Александр Заричковый "Faster than real-time face detection"

1,743 views

Published on

I will talk about object and face detection problems, evolution of different approaches to solving these problems and about the ideas behind each of these approaches. Also I will describe meta-architecture that achieve state of the art results on faces detection problem and works faster than real-time.

Published in: Technology
  • Be the first to comment

Александр Заричковый "Faster than real-time face detection"

  1. 1. Alexander Zarichkovyi Ring Ukraine Faster than real-time face detection
  2. 2. About me 2 ● Junior Researcher @ Ring Ukraine ● Student of “Kyiv Polytechnic Institute” (B.SE. Software Engineering) ● Love algorithms and programming competitions
  3. 3. 3 1. Object detection problem a. Why is detection problem important? b. Face detection problem c. Datasets d. How to evaluate different Object Detection approaches? 2. History of object detection architectures a. Viola–Jones object detection b. Classi ication based c. Regression based d. Cascade classi ication based Agenda
  4. 4. Object detection problem
  5. 5. 5http://cs231n.github.io/classi ication/ Image classification
  6. 6. 6http://tutorial.caffe.berkeleyvision.org/caffe-cvpr15-detection.pdf Classi ication problem Detection problem Classification vs Detection
  7. 7. 7 Object detection results are mostly used as an input for other tasks: ● face recognition ● person recognition ● self driving cars ● . . . Why is object detection so important?
  8. 8. Face detection problem
  9. 9. 9 What is the Face Detection problem?
  10. 10. 10 How many faces do you see on the picture?
  11. 11. 11 ● Occlusions ● Light conditions ● Pose ● Diversity ● ... Why is it difficult?
  12. 12. Datasets
  13. 13. 13 20 classes: • Person: person • Animal: bird, cat, cow, dog, horse, sheep • Vehicle: aeroplane, bicycle, boat, bus, car, motorbike, train • Indoor: bottle, chair, dining table, potted plant, sofa, tv/monitor Train/val size: VOC 2007 has 9,963 images containing 24,640 annotated objects. The PASCAL Visual Object Classes Challenge: A Retrospective Pascal VOC 2007
  14. 14. 14 ● Consists of 32 203 images with 393 703 labeled faces ● The faces vary largely in appearance, pose and scale ● Multiple attributes annotated: occlusion, pose and event categories, which allows depth analysis of existing algorithms WIDER FACE: A Face Detection Benchmark WIDER FACE: A Face Detection Benchmark
  15. 15. 15WIDER FACE: A Face Detection Benchmark WIDER FACE. Annotations
  16. 16. How good is your Detection algorithm?
  17. 17. 17https://www.pyimagesearch.com/2016/11/07/intersection-over-union-iou-for-object-detection/ Intersection Over Union (Jaccard index)
  18. 18. 18 https://classeval.wordpress.com/introduction/introduction-to-the-precision-recall-plot Recall/Precision curve Evaluation Plot
  19. 19. 19 Evolution of detectors
  20. 20. Viola–Jones object detection architecture Viola-Jones detector CVPR, 2001
  21. 21. 21 Main principles: ● Integral image ● Scanning window ● HAAR-like Features ● Boosted feature selection ● Cascaded classi ier Rapid object detection using a boosted cascade of simple feature Viola-Jones detector (1)
  22. 22. 22 Integral image Rapid object detection using a boosted cascade of simple feature Viola-Jones detector (2)
  23. 23. 23 Scanning window Rapid object detection using a boosted cascade of simple feature Viola-Jones detector (3)
  24. 24. 24 HAAR-like Features Rapid object detection using a boosted cascade of simple feature Viola-Jones detector (4)
  25. 25. 25 Boosted feature selection α α α α … … α α α … α Feature importance (hi ∈ ℜ) Feature (αi ∈ ℜ) Rapid object detection using a boosted cascade of simple feature Viola-Jones detector (5)
  26. 26. 26 Cascaded classi ier Rapid object detection using a boosted cascade of simple feature Viola-Jones detector (5)
  27. 27. 27 Pros: ● Really fast (can run at real-time on embedded devices) ● Low false-positive rate ● Easy to tune Cons: ● Hand-made features ● Hard to train ● Low detection rate on non frontal faces ● Detects only simple objects Rapid object detection using a boosted cascade of simple feature Viola-Jones detector (5)
  28. 28. Classification based architectures
  29. 29. Selective Search Viola-Jones detector CVPR, 2001 Selective Search IJCV, 2013
  30. 30. 30Selective Search for Object Recognition Selective Search (1)
  31. 31. 31 Selective search + SIFT + bag-of-words + SVMs = 35.1% mAP on PASCAL 2007 Selective Search for Object Recognition Selective Search (2)
  32. 32. Region-based CNN (R-CNN) Viola-Jones detector CVPR, 2001 Selective Search mAP: 35.1% IJCV, 2013 R-CNN Nov, 2013
  33. 33. 33Rich feature hierarchies for accurate object detection and semantic segmen ation ● Regions: ~2000 Selective Search proposals ● Feature Extractor: AlexNet pre-trained on ImageNet, ine-tuned on PASCAL 2007 ● Bounding box regression to re ine box locations ● Performance: mAP of 53.7% on PASCAL 2007 R-CNN (1)
  34. 34. 34 Pros: ● Accurate ● Any architecture can be used as a feature extractor Cons: ● Hard to train (lots of training objectives: softmax classi ier, linear SVMs, bound-box regressions, lot of them train separately) ● Slow training (84h on GPU) ● Inference (detection) is slow (47s / image with VGG-16 feature extractor) Rich feature hierarchies for accurate object detection and semantic segmentation R-CNN (2)
  35. 35. Why so slow?
  36. 36. Multiple usage of CNN inference! How to use CNN only once on the whole image?
  37. 37. Spatial Pyramid Pooling Network (SPP-Net) Viola-Jones detector CVPR, 2001 Selective Search mAP: 35.1% IJCV, 2013 R-CNN mAP: 53.7% FPS: 0.05 Nov, 2013 SPP-Net Jun, 2014
  38. 38. 38 ● In each region proposal, used a 4-level spatial pyramid, with grids: ■ 1×1 ■ 2×2 ■ 3×3 ■ 6×6 ● To each grid's cells we apply some global pooling operation. ● Totally we get 50 bins to pool the features from each feature map. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition SPP-Net (1): Spatial Pyramid Pooling Layer
  39. 39. 39 ROIs from proposal method ● Fully-connected layers ● Forward whole image through Convolutional Network ● Get feature map of image ● Apply Spatial Pyramid Pooling layer to feature map ● Input image ● Classify regions and apply bounding box regressors Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition SPP-Net (2)
  40. 40. 40 What’s good about SPP-net? Pascal VOC 2007 results It's really faster… Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition SPP-Net (3)
  41. 41. 41 What’s wrong about SPP-net? ● Inherits the rest of R CNN’s problems ● Introduces a new problem: cannot update parameters below SPP layer during training Trainable (3 layers) Frozen (13 layers) Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition SPP-Net (4)
  42. 42. Fast R-CNN Viola-Jones detector CVPR, 2001 Selective Search mAP: 35.1% IJCV, 2013 R-CNN mAP: 53.7% FPS: 0.05 Nov, 2013 SPP-Net mAP: 59.2% Jun, 2014 Fast R-CNN Apr, 2015
  43. 43. 43 ● Fast test time, like SPP-net ● One network, trained in one stage ● Higher mean average precision than R CNN and SPP-net Fast R CNN Fast R-CNN (1)
  44. 44. 44Fast R CNN R-CNN Fast R-CNN Training Time 84 hours 9.5 hours (Speedup) 1x 8.8x Test time per image (network only) 47 seconds 0.32 seconds (Speedup) 1x 146x mAP (VOC 2007) 53.7% 66.9% Comparison of R CNN and Fast R CNN (both use VGG-16 feature extractor) Fast R-CNN (2)
  45. 45. But, work time do not include time for Selective Search...
  46. 46. 46 R-CNN Fast R-CNN Test time per image (network only) 47 seconds 0.32 seconds (Speedup) 1x 146x Test time per image (with Selective Search) 50 seconds 2 seconds (Speedup) 1x 25x Comparison of R CNN and Fast R CNN (both use VGG-16 feature extractor) Fast R-CNN (3)
  47. 47. How to speedup Selective Search?
  48. 48. Rewrite with GPU usage?
  49. 49. Use other segmentation algorithms?
  50. 50. Use Neural Network!
  51. 51. Faster R-CNN Viola-Jones detector CVPR, 2001 Selective Search mAP: 35.1% IJCV, 2013 R-CNN mAP: 53.7% FPS: 0.05 Nov, 2013 SPP-Net mAP: 59.2% FPS: 0.47 Jun, 2014 Fast R-CNN mAP: 66.9% FPS: 0.5 Apr, 2015 Faster R-CNN Jun, 2015
  52. 52. 52 ~ 100 FPS Faster R CNN: Towards Real-Time Object Detection with Region Proposal Networks Faster R-CNN (1): Region proposal network
  53. 53. 53Faster R CNN: Towards Real-Time Object Detection with Region Proposal Networks Faster R-CNN (2)
  54. 54. 54Faster R CNN: Towards Real-Time Object Detection with Region Proposal Networks R-CNN Fast R-CNN Faster R-CNN Test time per image (with proposals) 50 seconds 2 seconds 0.2 seconds (Speedup) 1x 25x 250x mAP (VOC 2007) 53.7% 66.9% 69.9% Comparison of R CNN/Fast R CNN/Faster R CNN (all use VGG-16 feature extractor) Faster R-CNN (3)
  55. 55. Regression based architectures
  56. 56. You Only Look Once (YOLO) Viola-Jones detector CVPR, 2001 Selective Search mAP: 35.1% IJCV, 2013 R-CNN mAP: 53.7% FPS: 0.05 Nov, 2013 SPP-Net mAP: 59.2% FPS: 0.47 Jun, 2014 YOLO Jun, 2015 Fast R-CNN mAP: 66.9% FPS: 0.5 Apr, 2015 Faster R-CNN mAP: 69.9% FPS: 5 Jun, 2015
  57. 57. 57You Only Look Once: Uni ied, Real-Time Object Detection YOLO’s pipeline YOLO (1)
  58. 58. 58You Only Look Once: Uni ied, Real-Time Object Detection Bottom layers from GoogLeNet Custom layers YOLO architecture YOLO (2)
  59. 59. 59 Pros: ● uite fast (~40 FPS on Nvidia Titan Black) ● End-to-end training ● Low Error Rate for Foreground/Background misclassi ication ● Learn very general representation of objects Cons: ● Less accurate than Fast R CNN (63.9% mAP comparte to 66.9%) ● Loss function is an approximation ● Can not detect small objects ● Low detection rate of objects that located close to each other You Only Look Once: Unified, Real-Time Object Detection Errors types comparison Fast R CNN vs YOLO YOLO (3)
  60. 60. Single Shot MultiBox Detector (SSD) Viola-Jones detector CVPR, 2001 Selective Search mAP: 35.1% IJCV, 2013 R-CNN mAP: 53.7% FPS: 0.05 Nov, 2013 SPP-Net mAP: 59.2% FPS: 0.47 Jun, 2014 YOLO mAP: 63.9% FPS: 40 Jun, 2015 Faster R-CNN mAP: 69.9% FPS: 5 Jun, 2015 Fast R-CNN mAP: 66.9% FPS: 0.5 Apr, 2015 SSD Dec, 2015
  61. 61. 61SSD: Single Shot MultiBox Detector SSD architecture SSD (1)
  62. 62. 62 Apply regressors to default box and get result Regressors Confidences for 21 classes (20 VOC Pascal 2007 classes + background) 3 default boxes for each cell SSD: Single Shot MultiBox Detector SSD detector example SSD (2)
  63. 63. 63 Model mAP FPS Faster R-CNN (VGG-16) 73.2% 7 Faster R-CNN (ZF) 62.1% 17 YOLO 63.4% 45 Tiny YOLO 52.7% 155 SSD300 (VGG-16) 72.1% 58 SSD500 (VGG-16) 75.1% 23 Pros: ● The best speed/accuracy trade-offs ● State of the art results on all object detection datasets ● Pretty well works with light feature extractors (InceprtionV2, S ueeze Net, MobileNet, Shu leNet, etc.) Cons: ● Default boxes as hyper parameter ● Poorly works with heavy feature extractors (ResNet-101, InceptionV4, VGG-16, etc.) SSD: Single Shot MultiBox Detector Comparison of SSD with other detectors SSD (3)
  64. 64. Cascade classification based architectures
  65. 65. Multi-task cascade NN (MTCNN) Viola-Jones detector CVPR, 2001 Selective Search mAP: 35.1% IJCV, 2013 R-CNN mAP: 53.7% FPS: 0.05 Nov, 2013 SPP-Net mAP: 59.2% FPS: 0.47 Jun, 2014 YOLO mAP: 63.9% FPS: 40 Jun, 2015 Faster R-CNN mAP: 69.9% FPS: 5 Jun, 2015 Fast R-CNN mAP: 66.9% FPS: 0.5 Apr, 2015 SSD mAP: 72.1% FPS: 58 Dec, 2015 MTCNN Apr, 2016
  66. 66. 66 Network Input size FPS* Validation Accuracy P-Net 12x12 8000 94.6% R-Net 24x24 650 95.4% O-Net 48x48 220 95.4% Networks speed and accuracy on crops * - for original network input and batch size 1 Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks MTCNN (1)
  67. 67. 67 MTCNN s Networks Architectures Landmarks example Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks MTCNN (2)
  68. 68. 68 Recall/Precision curve Test set o Wider Face date set Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks MTCNN (3)
  69. 69. 69 Pros: ● Really fast (100 FPS on GPU) ● Lot of speed/accuracy trade-offs ● State of the art results on big part of Face Detection Datasets (CelebA, FDDB, etc.) Cons: ● Hard to train ● Lot of hyper-parameters ● Low detection rate of small faces ● Poorly works without landmarks Model mAP FPS MTCNN 85.1% 100 Faster R-CNN (VGG-16) 93.2% 5 SSH (VGG-16) 91.9% 10 Different face detector models comparison Wider Face test set Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks MTCNN (4)
  70. 70. 70 Questions?
  71. 71. 71 Thanks for your attention! Contact information: alexander.zarichkovyi@ring.com

×