For the full video of this presentation, please visit:
https://www.embedded-vision.com/platinum-members/embedded-vision-alliance/embedded-vision-training/videos/pages/may-2017-embedded-vision-summit-kim
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Minyoung Kim, Senior Research Engineer at Panasonic Silicon Valley Laboratory, presents the "A Fast Object Detector for ADAS using Deep Learning" tutorial at the May 2017 Embedded Vision Summit.
Object detection has been one of the most important research areas in computer vision for decades. Recently, deep neural networks (DNNs) have led to significant improvement in several machine learning domains, including computer vision, achieving the state-of-the-art performance thanks to their theoretically proven modeling and generalization capabilities. However, it is still challenging to deploy such DNNs on embedded systems, for applications such as advanced driver assistance systems (ADAS), where computation power is limited.
Kim and her team focus on reducing the size of the network and required computations, and thus building a fast, real-time object detection system. They propose a fully convolutional neural network that can achieve at least 45 fps on 640x480 frames with competitive performance. With this network, there is no proposal generation step, which can cause a speed bottleneck; instead, a single forward propagation of the network approximates the locations of objects directly.
3. 3
• Pros
• High performance
• Beat state-of-the-art records in many tasks including image
classification and detection
• Cons
• Large set of database
• High computational power
• Deep Neural Networks with millions of parameters
• Slower running time than most of conventional algorithms
Object Detection with Deep Learning
5. 5
Object Detection System
Building Object Detection System
• Training Deep Neural Network for Classification
• Pedestrian detection: Binary classification
• Object Proposal Generation at different scales
• Generate box proposals (1000 ~ 2000 boxes)
• Selective Search*, Edge Boxes**
• Merge largely overlapping boxes
• Non Maximum Suppression
* J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, A. W. M. Smeulders, IJCV 2013
** C. Lawrence Zitnick and Piotr Doll´ar, Microsoft Research
Run Recognizer
Proposal Generation
Recognition Network
Classification
Pedestrian
Background
Merge boxes
6. 6
Time Consuming!
Proposal Generation & Scaling
• Region proposal
• Selective Search: 2 seconds per image (CPU)
• Order of magnitude slower
• Edge Boxes: 0.2 seconds per image
• Scaling
• Multiple forward propagations
• Bottleneck
• A forward propagation of an image
• Less than 0.1 seconds (GPU)
Object Detection System
Proposal Generation
Scaling
7. 7
PSVL Pedestrian Detection System
Our Pedestrian Detection System
INPUT
A Single Forward Propagation
OUTPUT
PSVL
Neural Detector
8. 8
Recognition Network
Our Pedestrian Detection System
Add Regression Layer and Finetune
Fully Convolutional Network as Detector
Detection by a single forward propagation
9. 9
Train DNN for recognition
• GPU & Framework
• NVIDIA Titan X, NVIDIA Tesla K80
• Caffe*
• Network Architectures
• Modified GoogLeNet**
• 25~30 Convolutional layers
• Input: Pedestrian and Backgrounds (80x32)
• Output: Sigmoid or Softmax
• Dataset
• Caltech Pedestrian Detection Benchmark***
• 10 hours of 640(w) x 480(h) 30Hz video
• About 250,000 frames with a total of
350,000 bounding boxes
Recognition Network
* http://caffe.berkeleyvision.org/
** C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich (2014)
*** http://www.vision.caltech.edu/Image_Datasets/CaltechPedestrians/
10. 10
Convert recognition network to a fully convolutional network
Fully Convolutional Network
Base
Network
limited input size
Kernel sliding
Input size not limited
Fully connected Convolutional
11. 11
Regression Layer
• Regress bounding boxes on useful features
• Nx4 box coordinates data
• N: Feature Map resolution (NX x NY)
• Original GT Box: B = [x1, y1, x2, y2]
• New GT Box: B’ = rel(B) / m (m: multiplier of Window Size)
Fully Convolutional Network
240 120
m = 2
Output
Feature
Map
4
NX
NY
12. 12
Training detector network
• Network Architectures
• Custom loss functions
• Feature Map: Cross Entropy Loss with Boosting
• Boosting
• Ped: Correct Results (TPs) + Ground Truths (FNs)
• True Positive if IOU > 0.5
• False Negative if Ground Truths not detected
• NonPed: FPs
• False Positive if IOU < 0.5
• Regression: Euclidean Loss with Feature Map Data incorporated
PSVL Neural Detector
+
640x480
Original
Images
Regression Layer
Fully
Convolutional
Network
Feature
Map
Box
Coord-
inates
14. 14
Even fewer box prediction with Center-Height features
PSVL Neural Detector
15. 15
Performance – Very Fast with Competitive
Accuracy
• From DeepCascade paper1)
• DeepCascade: NVIDIA K20
• 15 fps
• Ours: NVIDIA GTX770
• 34 fps
• Speed Adjustment
• 34*0.96992) = 33 fps
• Ours: NVIDIA Titan X
• 51.422 fps w/o cuDNN
• 85.565 fps with cuDNN4
(*): Left hand side for methods with unknown
fps or less than 0.2 fps
(**): DeepCascade without extra data
(***): SpatialPooling+/Katamari methods use
additional motion information
1) A. Angelova, A. Krizhevsky V. Vanhoucke, A. Ogale, D. Ferguson (2015)
2) http://caffe.berkeleyvision.org/performance_hardware.html
Performance of Pedestrian Detection Methods (Accuracy vs. Speed)
PSVL ND
(**)
(*)
(***)
Faster
Moreaccurate
16. 16
Deploy PSVL ND on Google Nexus 9
• Processor
• NVIDIA Tegra K1
• GPU: NVIDIA Kepler with 192 CUDA cores
• Speed (without any optimization)
• Base resolution (600x390): 5 fps
• Lower resolution (280x240): 16 fps
ND on Portable Device
20. 20
More Approaches
• Faster-RCNN (2015)*
• Region Proposal Network
• +10 ms
• Anchor boxes
• Predicts offsets & confidences
Object Detection from others
* S. Ren, K. He, R. Girshick, J. Sun (NIPS 2015)
21. 21
Object Detection from others
* J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (CVPR 2016)
** J. Redmon and A. Farhadi (CVPR 2017)
• YOLO9000 (2017)**
• Improved localization/recall
• (-) fully connected layer
Similar Approaches
• YOLO (2016)*
• Fully Convolutional Network
• + fc layer + regression
22. 22
OURS
* F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, K. Keutzer (arXiv:1602.07360)
PSVL Multiple-Object Detection System
• Fire modules*
• Only 13 MB size
• 16.5 fps on max scale (600x2200)
Performance
(Speed, Accuracy)