1. Haifeng SHEN
DiDi AI Labs
Zhengping CHE
DiDi AI Labs
Guangyu LI
DiDi AI Labs
Yuhong GUO
DiDi AI Labs
Carleton University
Jieping YE
DiDi AI Labs
Univ. of Michigan, Ann Arbor
17. LeNet
LeNet-5 (1998)
⢠A neural network architecture for handwritten and
machine-printed character recognition in 1990s
⢠Consists of seven layers including
⢠Convolution operations
⢠Pooling operations
⢠Full connections
Yann LeCun, et al., Gradient-Based Learning Applied to Document Recognition, 1998
Bottom-right: https://engmrk.com/lenet-5-a-classic-cnn-architecture/
18. AlexNet
AlexNet (2012)
⢠ILSVRC 2012 winner (16.4% top-5 error)
⢠60 million parameters and 650,000 neurons
⢠8 learned layers: 5 convolutional and 3 fully-connected layers
⢠A 1000-way softmax layer after the last fully-connected layer
⢠Dropout and ReLU
⢠Trained parallelly on 2 GPUs
Alex Krizhevsky, et al., ImageNet Classification with Deep Convolutional Neural Networks, 2012
Bottom-right: Nitish Srivastava, et al., Dropout: A Simple Way to Prevent Neural Networks from Overfitting, 2014
19. VGGNet
⢠Six versions with 5 group convolutions of 11 - 19 layers
⢠VGG16 (138 million parameters) and VGG19
⢠Only 3x3 conv and 2x2 max-pooling layers before FC layers
⢠Results @ ILSVRC 2014
⢠1st in localization task
⢠2nd in classification task (7.3% top-5 error)
VGGNet (2014)
Karen Simonyan, et al., Very Deep Convolutional Networks for Large-Scale Image Recognition, 2014
20. GoogLeNet
⢠ILSVRC 2014 winner (6.7% top-5 error)
⢠22 layers with only 5 million model parameters
⢠Inception concept
⢠Multiple conv kernels including 1x1, 3x3, and 5x5
⢠1x1 kernel for dimension reduction
⢠Better representational power + fewer network parameters
⢠More advanced Inception modules (V2, V3, and V4) Inception-V1 Module
GoogLeNet (2014)
Christian Szegedy, et al., Going Deeper with Convolutions, 2015
21. ResNet
⢠1st place on the ILSVRC 2015 classification task (3.6% top-5 error)
⢠Deeper model with fewer filters and lower complexity
⢠34-layer baseline
⢠3.6 billion FLOPs
⢠only 18% of VGG-19 (19.6 billion FLOPs)
⢠Up to 152 layers!
⢠Initialization, batchnorm, residual blockâŚ
ResNet Block
ResNet (2015, top)
Kaiming He, et al., Deep Residual Learning for Image Recognition, 2016
http://kaiminghe.com/icml16tutorial/icml2016_tutorial_deep_residual_networks_kaiminghe.pdf
22. DenseNet
â˘
! !"#
$
direct connections for % layers
⢠Fewer parameters and less computation
DenseNet Block
DenseNet (2016)
!" = $" !%, !', ⌠, !")'
Gao Huang, et al., Densely Connected Convolutional Networks, 2016
23. SENet
⢠ILSVRC 2017 winner (2.251% top-5 error)
⢠Squeeze-and-excitation block
⢠Squeeze: Global average pooling
⢠Excitation: Channel association
⢠Scale: Channel attention
⢠Integration with modern architectures
Squeeze-and-Excitation Block
SENet (2017)
Jie Hu, et al., Squeeze-and-Excitation Networks, 2018
24. DLA: Deep Layer Aggregation
DLA (2018)
⢠Layer aggregation to better fuse information
⢠Iterative deep aggregation (IDA)
⢠Semantic fusion
⢠Resolutions and scales
⢠Hierarchical deep aggregation (HDA)
⢠Spatial fusion
⢠Channels and depths (modules)
Fisher Yu, et al., Deep Layer Aggregation, 2018
25. Classification Experiments
Classification Accuracy
Method
Car Brand
Classification
with 66 classes
Car Brand
Classification
with 2506 classes
ResNet 94.60% -
SENet 92.30% -
DLA 96.02% 93.75%
⢠Dataset-1
⢠193186 images of 66 classes
⢠Collected offline
⢠Dataset-2
⢠549169 images of 2506
classes
⢠Collected offline + online
⢠Similar settings to the Stanford
Cars dataset
30. R-CNN: Regions with CNN Features
⢠Selective Search + CNN + SVM
⢠Start to use CNN features instead of the traditional features
⢠~2k bottom-up region proposals from selective search
⢠Time consuming
⢠Extracting feature for every proposal separately
Ross Girshick, et al., Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, 2014
Bottom-Right: https://dl.dropboxusercontent.com/s/vlyrkgd8nz8gy5l/fast-rcnn.pdf
R-CNN (2014)
31. Fast R-CNN
⢠One image + multiple RoIs + a fully CNN
⢠RoI pooling: to generate fixed-size feature vector for each proposal
⢠Outputs: softmax probabilities + bounding-box regression offsets
⢠End-to-end training with a multi-task loss
Fast R-CNN (2015)
Right: http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture11.pdf
Ross Girshick, Fast R-CNN, 2015
32. Faster R-CNN
⢠Region proposal network (RPN) + Fast R-CNN
⢠RPN & detection network share full-image convolutional features
⢠Anchors with multiple scales and aspect ratios
Bottom-Left: http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture11.pdf
Shaoqing Ren, et al., Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, 2015
Faster R-CNN (2015) Region Proposal Network
33. R-FCN: Region-based Fully Convolutional Networks
⢠Position-sensitive score map before RoI pooling
⢠9 positions: top/middle/bottom-left/center/right
⢠Position-sensitive RoI pooling instead of standard RoI pooling
⢠fully convolutional detection network instead of fully-connected detection network in Faster
Jifeng Dai, et al., R-FCN: Object Detection via Region-based Fully Convolutional Networks, 2016
R-FCN (2016)
Position-Sensitive Score Map
34. Light-Head R-CNN
⢠Heavy head
⢠E.g., Faster R-CNN & R-FCN
⢠Intensive computations around RoI warping
⢠Light-Head R-CNN
⢠Thin feature maps from large separable convolution layers
⢠Cheap R-CNN subnet with 1 FC-layer
Zeming Li, et al., Light-Head R-CNN: In Defense of Two-Stage Object Detector, 2017
Light-Head R-CNN (2017)
âHeavyâ-Head Detectors
Large Separable Convolution
35. FPN: Feature Pyramid Networks
⢠Bottom-up pathway
⢠Top-down pathway
⢠Lateral connection
Tsung-Yi Lin, et al., Feature Pyramid Networks for Object Detection, 2017
Different Feature Maps FPN Block
⢠Feature pyramid: Combination of
⢠Low-resolution, semantically strong features
⢠High-resolution, semantically weak features
36. Cascade R-CNN
⢠Multi-stage extension of R-CNN
⢠Trained sequentially using output of
the previous stage
⢠Cascaded bbox regression
⢠! ", $ = !& â !&() â ⯠â !) ", $
⢠Cascaded detection
⢠A sequence of detectors trained with
increasing IoU thresholds
Zhaowei Cai, et al., Cascade R-CNN: Delving into High Quality Object Detection, 2018
Cascade R-CNN
37. SNIP: Scale Normalization for Image Pyramids
⢠CNNs are not robust to changes in scale
⢠Multi-scale image pyramids for objects
with different scales
⢠Detections from each scale are rescaled
and combined using NMS
⢠Small objects from high-resolution image
⢠Large objects from low-resolution image
Bharat Singh, Scale Invariance in Object Detection - SNIP, 2018
38. YOLOv3 (2018)
YOLO: You Only Look Once
⢠End-to-end one-stage method
⢠Directly use full images to predict each bounding box
⢠Extremely fast in real-time speed
⢠YOLOv2
⢠Darknet19 backbone
⢠Anchor mechanism
⢠YOLOv3
⢠Multi-scale features
⢠Darknet53 backbone
Joseph Redmon, et al., You Only Look Once: Unified, Real-Time Object Detection, 2016
Joseph Redmon, et al., YOLO9000: Better, Faster, Stronger, 2017
Joseph Redmon, et al., YOLOv3: An Incremental Improvement, 2018
Top-Left: https://docs.google.com/presentation/d/1kAa7NOamBt4calBU9iHgT8a86RRHz9Yz2oh4-GTdX6M/
Bottom-Left: https://towardsdatascience.com/yolo-v3-object-detection-53fb7d3bfe6b/
YOLO (2016)
39. SSD: Single Shot Detector
⢠Multiple feature maps with different resolutions and scales
⢠Improved speed/accuracy trade-off
Wei Liu, et al., SSD: Single Shot MultiBox Detector, 2016
SSD (2016)
YOLOv1
40. DSSD: Deconvolutional SSD
⢠Encoder-decoder Hourglass structure
⢠Wide â Narrow â Wide
⢠Convolution and deconvolution modules
⢠Deconvolution: To introduce additional
large-scale context for object detection
⢠Two prediction modules
⢠Each with one residual block
Cheng-Yang Fu, et al., DSSD: Deconvolutional Single Shot Detector, 2017
SSD
DSSD (2017)
Selected Prediction Module
41. RetinaNet
⢠Focal loss instead of cross entropy function
⢠Focus on training on a sparse set of hard samples
⢠!" #$ = â 1 â #$
( log #$
Tsung-Yi Lin, Focal Loss for Dense Object Detection, 2017
RetinaNet (2017)
42. RefineDet
Shifeng Zhang, et al., Single-Shot Refinement Neural Network for Object Detection, 2018
RefineDet (2018)
⢠Anchor refinement module
⢠Filtering out easy negatives
⢠Coarsely adjusting anchors
⢠Object detection module
⢠Further improving regression
⢠Prediction multi-class
Transfer Connection Block
43. CornerNet
⢠Object as a pair of bounding box corners
⢠No need for anchor boxes
⢠Regression problem
â Corner prediction problem
⢠Corner pooling
⢠To better localize corners of bounding box
Hei Law, et al, CornerNet: Detecting Objects as Paired Keypoints, 2018
CornerNet (2018)
Corner Pooling
44. ⢠Multiple Receptive Field block (MRF): Multiple receptive field and more features for prediction
⢠Auxiliary Semantic Segmentation block (ASM): Auxiliary semantic segmentation focusing on small object
⢠Object Detection block (ODM): Combining MRF and ASM with parallel training
⢠Loss function:
MRFSWSnet:
Siyang Sun, et al., Multiple Receptive Fields and Small-Object-Focusing Weakly-Supervised Segmentation Network for Fast Object Detection, 2019
Multiple Receptive Field Small-Object-Focusing
Weakly-Supervised Segmentation Net
45. Experiments on MRFSWSnet
Method Recall Precision F1 Score
Faster R-CNN 97.57 96.47 97.01
RetinaNet 97.80 97.80 97.80
Light-Head R-CNN 97.71 95.13 96.40
YOLOv3 98.57 97.32 97.94
MRFSWSnet 98.71 97.32 98.01
⢠Images collected by dash camera
⢠Detection on cellphone usage during driving
⢠1000 testing images
Siyang Sun, et al., Multiple Receptive Fields and Small-Object-Focusing Weakly-Supervised Segmentation Network for Fast Object Detection, 2019
46. ⢠Depend on large amount of labeled data, induce expensive annotation cost
⢠Difficult to be applied directly in new operation environments
⢠Computation intensive, highly demanding in computational resources
⢠Complicated models, time/memory consuming, which prevents usage in real
time operation systems(e,g. DMS)
Challenge
47. Yuhong GUO DiDi AI Labs & Carleton University
Part II: Advanced Topics
50. ⢠Definition [Pan et al., IJCAI13 ]:
Ability of a system to recognize and apply knowledge and skills learned in
previous domains/tasks to novel domains/tasks
⢠.
Domain Adaptation/Transfer Learning
S. Pan, Q. Yang and W. Fan. Tutorial: Transfer Learning with Applications, IJCAI 2013.
Tan, Chuanqi, et al. "A survey on deep transfer learning." International Conference on Artificial Neural Networks. Springer, Cham, 2018.
51. § Successful Application of ML in industry depends on learning from large
amount of labeled data
ĂExpensive, time consuming to collect labels
ĂDifficult or dangerous to collect data in certain scenarios, e.g, auto driving
§ Domain Adaptation/Transfer Learning provides essential ability of
ĂźReusing existing labeled resources
ĂźAdapting to changing environment
ĂźLearning from simulations
Why Domain Adaptation
52. Transfer Learning vs Traditional ML
Transfer Learning/Domain Adaptation
Training
domain/task A
Test
domain/task B
§
§
§
Traditional ML
(Semi-)Supervised Learning
Training
domain/task A
Test
domain/task B
§
§
§
55. Adapting to New Domains
§ Reuse existing datasets, hence the annotation information
ĂObject Recognition
ĂObject Detection
ĂPerson Re-Identification
ĂImage Segmentation
ĂImage Classification ⌠...
56. Learning from Simulations
§ Gathering data and training model are either too expensive, time-
consuming, or too dangerous
§ Solution: create data, learning from simulations
Ă
Ă
OpenAI's Universe will potentially allow us to train a
self-driving car using GTA 5 or other video games.
Training models on real robotics
is too slow and expensive
http://ruder.io/transfer-learning/index.html
57. Common Datasets
§ Object recognition:
Office-31:
§
§
§
ImageCLEF-DA:
§
§
§
§ Visual domain adaptation challenge
dataset VisDA-2017
§ Digits: MNIST, SVHN, USPS
§ Syn2Real dataset â a new dataset for object recognition
[Peng et al, 2018]
60. Three main classes:
§ Reweighting/Instance-based Methods
Ăź
§ Feature-based/Representation Learning Methods
Ăź
§ Parameter/Model- based Methods
Ăź
Categories of DA Methods
71. § A-distance, measure of distance between probability distribution
§ Bound on target domain error
Ă
Ă
Theoretical Connection
Ben-David et al. "Analysis of Representations for Domain Adaptationâ, NIPS 06
Kifer et al. Detecting change in data streams. In Very Large Databases (VLDB), 2004.
73. § DANN: Adversarial is
implemented via GRL (gradient
reverse layer)
Domain Adversarial Neural Network (DANN)
74. § Adversarial Discriminative Domain Adaptation (ADDA)
source CNN is trained without sacrificing any discriminativity
Model Sharing and Adversarial Adaptation
75. § Re-weight source domain label
distribution to help reduce domain
discrepancy and adapt classifier
§ Reweighted adversarial loss (RAAN)
Reweighted Adversarial Adaptation [Chen et al, CVPR 18]
Chen, et al. " â, CVPR 18
76. § Maximum Classifier Discrepancy (MCD):
Ă
Ă
§ Adversarial loss:
Target domain
prediction discrepancy
Alternative Adversarial Terms
K. Saito, et al. " Maximum Classifier Discrepancy for Unsupervised Domain Adaptationâ, CVPR 18
Train both classifiers and generator to
classify the source samples correctly
81. Object detection DA-Faster-R-CNN
§ Adversarial loss via GRL at both image level and instance level
§ Consistent regularization at the two levels
Multi-Level Adversarial Adaptation
Chen, et al. " â, CVPR 18
86. § Limitation of domain alignment techniques:
Ă
Ă
§ CyCADA:
Ă
Ă
Ă
Cycle-Consistent Adversarial DA
et al. " â, ICML 18
et al. ICML18
87. Cycle-Consistent Adversarial DA
et al. " â, ICML 18
et al. ICML18
image-level GAN loss (green), the feature level GAN loss (orange), the source and target semantic
consistency losses (black), the source cycle loss (red), and the source task loss (purple).
90. §
§
Pseudo-Label based Methods
Some positive application in domain adaptation:
ĂProgressive domain adaptation for Object detection
ĂFor recognition:
Zhang et al. " Collaborative and Adversarial Network for Unsupervised domain adaptation :â, CVPR 18
Inoue et al. " Cross-Domain Weakly-Supervised Object Detection through Progressive Domain Adaptationâ, CVPR 18
91. ⢠Unsupervised domain adaptation has received a lot of attention
⢠Open domain learning remains to be challenging, but starts drawing
attentions
⢠Most study has focused on classification problems
⢠Much less effort has been made on more complex tasks such as
object detection
Summary
93. Basics
Number of multiplications for one standard convolutional layer:
Input: !" x !" x M Output: !# x !# x N
!$: kernel size
M: number of input channels
N: number of output channels
!#: output dimension
94. Basics
⢠Architecture designâ lightweight models
Ă Use two 3 x 3 conv layer to replace 5 x 5 conv
layer:
(3x3+3x3)/(5x5)
Ă Use two sequential 1xn and n x 1 conv layers to
replace n x n conv layers
(1xn + n x 1)/(n x n)
97. Inception Module
Inception module with dimension reduction
V1 block (from googlenet)
Traditional 3X3
convolution block
Input: 28 X 28 X 192
Output: 28 X 28 X 256
#Model parameters:
3 X 3 X 192 X 256 = 442k
1 X 1 X 192 X 64
+1 X 1 X 192 X 96 + 3 X 3 X 96 X 128
+1 X 1 X 192 X 16 + 5 X 5 X 16 X 32
+0(maxpooling)+1 X 1 X 192 X 32 =163k
Previous layer
3X3 convolution
output layer
Szegedy et al. Going Deeper with Convolutions, https://arxiv.org/abs/1409.4842. 2014.
â˘
â˘
98. Inception V1, V2, V3
Szegedy et al. Going Deeper with Convolutions, https://arxiv.org/abs/1409.4842. 2014.
Sergey Ioffe et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,
http://arxiv.org/abs/1502.03167.2015
Rethinking the Inception Architecture for Computer Vision, http://arxiv.org/abs/1512.00567. 2015.
â˘
⢠Use two 3 x 3 conv
to replace 5 x 5 conv
â˘
1
99. Xception
François Chollet. Xception: Deep Learning with Depthwise Separable Convolutions. https://arxiv.org/abs/1610.02357. 2016-2017.
⢠Depthwise separable convolution
⢠Ă
â˘
(3 x 3 x 1 x M/7 x 112 x 112) x 7 â˘
â˘
100. SqueezeNet
Input: F x F x M
Squeeze:
⢠1x1 convs
output: F x F x S (S< M)
Expand:
⢠1x1 convs
output: F x F x e1
⢠3x3 convs
output: F x F x e2
Concate: F x F x (e1+e2)
Forrest N. Iandola,et al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size. https://arxiv.org/abs/1602.07360. 2016
101. ⢠Standard:
⢠Depthwise separable conv
(1) depthwise conv: 1filter takes 1 input channel
(2) pointwise conv
1x1 convs
⢠Computation Reduction
MobileNet V1:
Andrew G. Howard et al. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications.
https://arxiv.org/abs/1704.04861?context=cs. 2017.
!" !"
!" !"
102. ⢠Standard:
⢠Depthwise separable conv
(1) depthwise conv: 1filter takes 1 input channel
(2) pointwise conv
1x1 convs
⢠Computation Reduction
MobileNet V1:
Andrew G. Howard et al. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications.
https://arxiv.org/abs/1704.04861?context=cs. 2017.
!" !"
!" !"
103. MobileNet V1
⢠Use conv with stride=2 to
replace pooling
⢠Add two super parameters:
Width multiplier Îą and
resolution multiplier Ď
⢠ι =1.0, 0.75, 0.5, 0.25;
⢠standard MobileNet when ι=1
104. MobileNet V2
MobileNetV1
MobileNetV2
Increase # channels
Linear bottlenecks:
removed nonlinear
activation in the low dim
Mark Sandler et al. MobileNetV2: Inverted Residuals and Linear Bottlenecks. https://arxiv.org/abs/1801.04381.2018.
inverted residual block
Increase dim, then reduce dim
105. ShuffleNet V1
⢠pointwise group convolution (1x 1 Conv)
⢠channel shuffle: help the information flowing across feature channels
⢠Use concat operation to concatenate two different channels
Xiangyu Zhang et al. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. https://arxiv.org/abs/1707.01083. 2017.
#g (groups)
106. ShuffleNet V1
Xiangyu Zhang et al. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. https://arxiv.org/abs/1707.01083. 2017.
108. ShuffleNet V2
Reduce memory
access cost:
⢠Channel Split (2g)
⢠remove group
convolution
⢠Put channel shuffle
module after
channel
concatenation
1)) ( ( 2 1)) ( (
Ningning Ma et al. ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design. https://arxiv.org/abs/1807.11164.2018.
109. Experiments - Classification
Model
mAP
(%)
Precision
(%)
Recall
(%)
Size
(MB)
Computation
speed
(ms/photo)
Server-based
+ Yolov2
99.62 99.60 99.65 N/A N/A
1.00xShuffleNet V2
+Yolov2
96.43 97.16 96.83 5.20 80.00
0.50xShuffleNet V2
+Yolov2
95.86 97.28 96.28 1.70 40.00
0.50xShuffleNet V2
+SSD
97.73 90.61 97.98 7.90 65.00
0.25xShuffleNet V2
+SSD
97.25 90.46 97.59 5.00 45.00
Category Abbreviation
front page of ID card id_card_f
Back page of ID card id_card_b
Front page of driver license driver_license_f
Back page of driver license driver_license_b
Front of main page in car license
front
car_license_f
Back of main page in car license front car_license_b
Supplementary Page in car license vehicle_license
Real car photo( whole car) Whole car
Real car photo(car plate) plate
115. Application - Pay by smiling
⢠In Sep. 2017, Alibaba's Ant
Financial affiliate and KFC
China announced facial-
recognition payment
available for customers in the
fast food restaurant chain's
new KPRO store in Hangzhou.
⢠"Smile to Pay" facial
recognition payment solution
at KFC enables customers to
pay without their wallets.
https://www.jrzj.com/194328.html
116. Application - Check-in at station
Taiyuan South railway stationBeijing West railway station Shanghai metro station
https://baijiahao.baidu.com/s?id=1552314447507461&wfr=spider&for=pc
http://www.sohu.com/a/220124437_99966914
http://dy.163.com/v2/article/detail/D5U3QH2P0525KG01.html
126. Overview - Detection & landmark dataset
Face detection
dataset
Available # faces # images Website Remarks
FDDB Public 5171 2845 http://vis-www.cs.umass.edu/fddb/ unconstrained face
WiderFace Public
32,20
3
393,703
http://mmlab.ie.cuhk.edu.hk/projects/W
IDERFace
Easy, Medium, Hard set, a high
degree of variability in scale, pose
and occlusion.
MALF Public
11,93
1
5,250 http://www.cbsr.ia.ac.cn/faceevaluation/
Bounding box, multi-Attribute
Labelled Faces, pose and facial
attributes
Caltech
10,000
Web Faces
Public - 10,524
http://www.vision.caltech.edu/Image_Da
tasets/Caltech_10K_WebFaces/
Collect from Google image search,
4 landmarks(two eyes, nose and
mouth)
PUB Public 9971
http://biometrics.put.poznan.pl/put-
face-database/
30 landmarks, 194 contour points
AFLW Public 25,993
https://www.tugraz.at/institute/icg/rese
arch/team-bischof/lrs/downloads/aflw/
Collect from Flickr, 21 landmarks
127. Overview - Detection - MTCNN
Kaipeng Zhang et al. Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks. https://arxiv.org/abs/1604.02878v1.2016.
⢠propose a deep cascaded multi-task framework with three stages, P-Net, R-
Net and O-Net.
⢠Each is a shallow network.
⢠P-Net: proposal network, produces candidate windows quickly through a
shallow CNN
⢠R-Net: refine network, refines the candidates to reject a large number of
non-faces windows through a more complex CNN
⢠O-Net: output network, use a more powerful CNN to refine the result and
output facial landmarks positions
128. Overview - Detection - Face RFCN
Yitong Wang et al. Detecting Faces Using Region-based Fully Convolutional Networks. https://arxiv.org/abs/1709.05256. 2017.
⢠The framework is based on the R-FCN.
⢠propose a region-based face detector applying deep networks in a fully
convolutional fashion
⢠introduce additional smaller anchors and modify the position-sensitive RoI
pooling to a smaller size for suiting the detection of the tiny faces.
⢠propose to use position-sensitive average pooling instead of normal
average pooling for the last feature voting in R-FCN
⢠use multi-scale training strategy and online Hard Example Mining (OHEM)
strategy.
129. Overview - Detection - PyramidBox
Xu Tang et al. PyramidBox: A Context-assisted Single Shot Face Detector. https://arxiv.org/abs/1803.07737?context=cs. 2018.
⢠Baidu proposes the PyramidBox.
⢠extended VGG16 backbone and generate
the feature maps at different levels
⢠generate a series of anchors
corresponding to larger regions related to
a face that contain more contextual
information, such as head, shoulder and
body.
130. Overview - Recognition - Dataset
Dataset Available # People # images Website Remarks
LFW Public 5K 13K
http://vis-
www.cs.umass.edu/lfw/#views
Labeled Faces in the Wild
YFD Public 1.5K 3.4K (Video)
https://www.cs.tau.ac.il/~wolf/ytfac
es/
YouTube Faces Database
CelebA
(CelebFaces
Attributes
Dataset)
Public 10K 202K
http://mmlab.ie.cuhk.edu.hk/project
s/CelebA.html
Multimedia Lab, The Chinese
University of Hong Kong
CASIA-WebFace Public 10K 500K
http://www.cbsr.ia.ac.cn/english/CAS
IA-WebFace/CASIA-
WebFace_Agreements.pdf
MS-Celeb-1M public 100K 10M https://www.msceleb.org
VGGFace2 Public 9k 3.3M
http://www.robots.ox.ac.uk/~vgg/da
ta/vgg_face2/
downloaded from Google Image
Search and have large variations in
pose, age, illumination, ethnicity
and profession
Facebook Private 4K 4,400K N/A
Google Private 8000K 100-200M N/A
132. Overview - Recognition - Results
Time Method Training size Method description LFW Comments
1991 Eigenfaces < 10k Principal component analysis(PCA) 60.02%
2006 LBP+CSML < 10k
Local binary pattern(LBP) + Metric
learning
85.57%
2013 High-dim LBP 0.1m High-dim LBP + Joint Bayesian 95.17%
2014 DeepFace 4m CNN + 3D face alignment 97.35% Facebook
2014 Deep ID 0.2m CNN + Softmax 97.45% CUHK
2015 VGGFace 2.6m VGG + Softmax 98.95% Oxford
2015 FaceNet 200m Inception + Triplet-Loss 99.63% Google
2015 Ensemble face 1.2m CNN + Multi-patch + Deep metric 99.77% Baidu
2016 Effective face 2.5m CNN + Augmentation 98.06% Pose + Shape + Expression
2017 SphereFace 0.5m CNN + Angular-Softmax 99.42%
Multiplicative angular margin:
cos(mθ)
2018 ArcFace 6.8m CNN + Additive angular margin 99.83%
Additive angular margin: cos(θ
+ m)
2019 Combined loss N/A cos(m1θ + m2) â m3
133. Overview - Recognition - DeepFace
Yaniv Taigman et al. DeepFace: Closing the Gap to Human-Level Performance in Face Verification.
https://ieeexplore.ieee.org/document/6909616. CVPR 2014.
⢠CNN + DNN structure
⢠L4 - L6 are locally connected layers without weight sharing, rather than the standard
convolutional layers
⢠The last two layers, i.e. F7 and F8 are fully-connected
⢠Employ 3D face modeling to apply the affine transformation for 3D face alignment and
get the frontal face
⢠more than 120 million parameters
⢠Train using four million facial images belonging to more than 4,000 identities
134. Overview - Recognition - DeepID
Yi Sun, Xiaogang Wang, Xiaoou Tang. Deep Learning Face Representation from Predicting 10,000 Classes. https://www.cv-
foundation.org/openaccess/content_cvpr_2014/papers/Sun_Deep_Learning_Face_2014_CVPR_paper.pdf. CVPR2014.
⢠Use face patch method and each patch use one ConvNet
⢠Each ConvNet has 4 layers
⢠60 face patches with ten regions, three scales, and RGB or gray channel.
⢠60 ConvNets x two 160-dimensional vectors and flipped counterpart, totally 19200-dimensional
vector for face verification
⢠achieves 97.45% face verification accuracy on LFW
⢠Based on DeepID1, Chinese University of Hong Kong provides DeepID2 and DeepID3
135. Overview - Recognition - FaceNet
Florian Schroff et al. FaceNet: A Unified Embedding for Face Recognition and Clustering. https://arxiv.org/abs/1503.03832. CVPR 2015.
⢠Google proposes the structure.
⢠Directly use a deep convolutional network
⢠Use triplet loss for training: minimizes the distance between an anchor and a positive,
both of which have the same identity, and maximizes the distance between the
anchor and a negative of a different identity
⢠Use the Euclidean distance to measure the face similarity for verification.
136. Overview - Recognition - Ensemble Face
Jingtuo Liu et al. Targeting Ultimate Accuracy: Face Recognition via Deep Embedding. https://arxiv.org/pdf/1506.07310. 2015.
⢠Multi-patch feature extraction.
⢠9 image patches and each patch is centered at different landmarks on face region.
⢠Each patch: 9 convolution layers and a softmax layer at the end
⢠Concatenate the last convolution layer of each network to build the high dimensional feature for the face
representation
⢠metric learning method with triplet loss is used for feature reduction and obtain 128/256 dimensions.
⢠achieve the accuracy (99.77%) of LFW under 6000 pair evaluation protocol
137. Overview - Recognition - Effective Face
Iacopo Masi et al. Do We Really Need to Collect Millions of Faces for Effective Face Recognition. https://arxiv.org/abs/1603.07057. CVPR 2016.
⢠Use a single VGGNet with 19 layers
⢠Training on both real and augmented data
⢠use the CASIA WebFace collection data and generate the artificial data
by introducing pose variations, shape variation and expression
variation
138. Jiankang Deng et al. ArcFace: Additive Angular Margin Loss for Deep Face Recognition. https://arxiv.org/abs/1801.07698. 2019.
Multiplicative angular
margin: cos(mθ)
Additive angular
margin: cos(θ + m)
Additive cosine
margin: cos(θ ) - mcosθ
Combined loss:
Overview - Recognition - Combined loss
139. Experiments - Combined loss
Test set feature softmax shpereface cosface arcface
Combined
loss
LFW public 98.75 99.52 99.50 99.55 99.60
7k private 93.60 95.45 95.90 96.72 97.13
50k private 93.28 95.93 95.50 97.08 96.90
zc private 99.18 99.37 99.45 99.57 99.52
avg 96.20 97.57 97.59 98.23 98.29
⢠7k/50k The test set is extracted from registered driver photo database. 3K positive
pair and 3k negative pair are randomly selected from 7k/50k drivers respectively.
⢠zc the test set is randomly extracted from premier driver photo database. 3K
positive pair and 3K negative pair are randomly selected for the testing.
140. Experiments - Virtual learning
drastically improves the performances over the baseline softmax on both LFW and SLLFW datasets, e.g. from 99.10%
to 99.46% and 94.59% to 95.85%, respectively.
Binghui Chen, Weihong Deng, Haifeng Shen. Virtual Class Enhanced Discriminative Embedding Learning. https://arxiv.org/abs/1811.12611. 2018
142. Experiments - Face detection
q WIDER FACE dataset is a face detection benchmark dataset,
collected from the publicly available WIDER dataset.
q Choose 32,203 images and label 393,703 faces with a high
degree of variability in scale, pose and occlusion as depicted
in the sample images.
q Propose DFS method and use semantic fused feature maps
as contextual cues and construct a semantic segmentation
for training supervision and to learn the best representations
q Win 5 rank-1 results in April. 2019
Widerface: http://shuoyang1213.me/WIDERFACE/index.html
Wanxin Tian, Zixuan Wang, Haifeng Shen, Weihong Deng, et al. Learning Better Features for Face Detection
with Feature Fusion and Segmentation Supervision. https://arxiv.org/abs/1811.08557. 2018-2019.
144. What can we learn from Driving Scenario?
⢠What is in a driving scenario?
⢠How far are they from ego-vehicle?
⢠How does human driver interact with environment?
Vision Perception
3D Reconstruction
Behavior Analysis
145. Driving Scenarios v.s. General Computer Vision
Data
⢠Multi-modal (i.e. multiple sensors including Camera LiDAR, GPS, IMU etc.)
⢠Collected from 3D Open Area (Not Indoor/Lab Environments)
⢠Ego-centric / First Person
Requirements
â˘
â˘
â˘
Opportunities
â˘
â˘
â˘
146. Main Components
⢠Pedestrian
⢠Vehicle
⢠Road
⢠Traffic Sign / Light
Vision Perception in Driving Scenario
Detect, Segment, Track and Classify Object-of-interest in Driving Scenarios
What does Vision Perception do:
149. Vision Perception â Pedestrian Detection
Pedestrian detection at 100FPS
⢠Uses Cascades
⢠Fast features
⢠Not a CNN based model
Benenson et al â12 âVeryFastâ
100+ FPS detector. NO CNNs.
150. Vision Perception â Pedestrian Detection
Real-time Pedestrian Detection with CNNs
⢠Uses Cascades
⢠Uses fast non-CNN features
⢠Use CNNs for max accuracy with minimum speed
sacrifice
Angelova et al â15 âDeepCascadesâ
Real-time (15FPS) with CNNs
151. Vision Perception â Pedestrian Detection
Occlusion-aware pedestrian detection
⢠Aggregation loss (enforce proposals to be close
and locate compactly)
⢠Occlusion-aware region of interest (PORoI)
(integrate prior structure information of human
to handle occlusion)
⢠Based on Faster RCNN
Zhang et al â18 âOR-CNNâ
State of the Art (by April 2019)
153. Vision Perception â Vehicle Detection
Vehicle detection in 3D from image
⢠Directly from 2D image
⢠Proposal Generation as Energy Minimization
⢠Orientation Estimation Network
Chen et al â16 â3D Bounding Boxâ
Breakthrough for 3D Detection with Mono Image
154. Vision Perception â Vehicle Detection
Multi-View 3D object Detection
⢠Multi-sensor fusion
Chen et al â17 âMV3Dâ
Impressive accuracy gain for considering multi-sensors fusion
155. Vision Perception â Vehicle Detection
Multi-level Fusion based 3D Object Detection
from Mono Images
⢠Simultaneously propose 2D RPN and predict 3D
location, orientation, dimensions
Xu et al â18 âMulti-level Fusionâ
State of the Art for 3D Detection from Mono Camear Images
156. Vision Perception â Road Segmentation
Joint Semantic Prediction
⢠KITTI Road Detection top performance 2017
⢠Multi-task framework
⢠Real-time
⢠Uses RGB image only
Teichmann et al â17 âMultiNetâ
Speed + Accuracy with RGB image only
157. Vision Perception â Road Segmentation
LIDAR-Camera Fusion
⢠KITTI Road Detection top performance 2018
⢠Cross Fusion mechanism with FCN
Caltagirone et al â18 âLidCamNetâ
LIDAR-Camera Fusion RULES
158. Vision Perception â Road Segmentation
LIDAR-Camera Fusion with LIDAR Adaptation
⢠KITTI Road Detection current top performance
⢠Progressive LIDAR Adaptation
Chen et al â19 âPLARDâ
State of the Art Performance
160. Vision Perception â Traffic Sign Detection
IJCNN 2011 Traffic Sign Recognition Competition
⢠Ciresan et al â11: 0.56% error
⢠Human: 1.16% error
⢠Non-CNN: 3.86%
Ciresan et al â11 âTraffic Sign Recognitionâ
Traffic Sign Recognition is EASY (Super-human Performance)
161. Vision Perception â Traffic Sign Detection
Detecting Small Signs from Large Images
⢠Brake large image into small patches
⢠Small-Object-Sensitive-CNN (SOS-CNN)
⢠Based on SSD
Meng et al â17 âSOS-CNNâ
Handle Small Objects
162. What can we learn from Driving Scenario?
⢠What is in a driving scenario?
⢠How far are they from ego-vehicle?
⢠How does human driver interact with environment?
Vision Perception
3D Reconstruction
Behavior Analysis
163. Main Components
â˘
â˘
â˘
â˘
3D Reconstruction in Driving Scenario
Recover real-world Location and Pose of Driving Scenario Objects (2D to 3D)
What does 3D Reconstruction do:
5 mins Theoretic Backgrounds (a little Math)
169. 3D Reconstruction â Semantic Reconstruction
Kundu et al â14 âJoint semantic and 3D reconstruction from monocular videoâ
Semantic + 3D Reconstruction from Mono Camera
170. 3D Reconstruction â Semantic Reconstruction
Cherabier et al â16 âMulti-label semantic 3d reconstruction using voxel blocksâ
Efficient Dense Semantic + 3D Reconstruction
171. What can we learn from Driving Scenario?
⢠What is in a driving scenario?
⢠How far are they from ego-vehicle?
⢠How does human driver interact with environment?
Vision Perception
3D Reconstruction
Behavior Analysis
172. Driving Scenario Understanding
Honda Research Institute Driving Dataset
⢠104 Hours Real Human Driving records
⢠Driving Behavior and Causal Reasoning annotation
Ramanishka et al â18 âHDDâ
First Dataset towards Driving Scenario Understanding
173. Driving Scenario Understanding
Driving Attention Prediction from Video
⢠Focus on Driverâs Attention
⢠In-car v.s. In-lab test
Xia et al â18 âPredicting Driver Attentionâ
Introduce Attention Heat Maps
175. GAIA Open Dataset
⢠Dataset : D2 âCity Dataset
⢠D²-City is a large-scale driving video dataset that provides more than 10k videos recorded
in 720p HD or 1080p FHD from front-facing dashcams, with annotations for object
detection and tracking.
n 1k videos -
annotation of the
bounding boxes and
tracking IDs of road
objects into 12
different categories.
n 9k videos -
annotation the
bounding boxes in
key frames.