Computer vision for transportation

Haifeng SHEN
DiDi AI Labs
Zhengping CHE
DiDi AI Labs
Guangyu LI
DiDi AI Labs
Yuhong GUO
DiDi AI Labs
Carleton University
Jieping YE
DiDi AI Labs
Univ. of Michigan, Ann Arbor

Part I: Introduction to Computer Vision
Zhengping CHE, DiDi AI Labs

• Computer Vision Basics
• Image Classification
• Object Detection
Introduction to Computer Vision

Computer Vision Basics
• Representation Learning
• Activation Functions
• Neural Network Structures
• Convolution Operators
• Pooling Layers
• Batch Normalization

Representation Learning
http://kaiminghe.com/cvpr17tutorial/cvpr2017_tutorial_kaiminghe.pdf

Neural Network Structures
Convolutional Neural Network
Deep Neural Network
Different Neural Networks
Top/Middle-left: http://cs231n.github.io/convolutional-networks/
Bottom-left: https://colah.github.io/posts/2015-08-Understanding-LSTMs/
Right: http://www.asimovinstitute.org/neural-network-zoo/
Recurrent Neural Network

Activation Functions
Top: https://theffork.com/activation-functions-in-neural-networks/
Bottom: http://cs231n.stanford.edu/slides/2019/cs231n_2019_lecture04.pdf

Convolution Operators
-1 0 1
-2 0 2
-1 0 1
Vertical
-1 -2 -1
0 0 0
1 2 1
Horizontal
Sobel Operator
Laplacian Operator
0 -1 0
-1 4 -1
0 -1 0
-1 -1 -1
-1 8 -1
-1 -1 -1
Traditional Operators Convolution Operation
Right: http://cs231n.github.io/convolutional-networks/

Convolution Operators (cont’d)
Left: Jifeng Dai, et al., Deformable Convolutional Networks, 2017
Right: https://towardsdatascience.com/review-drn-dilated-residual-networks-image-classification-semantic-segmentation-d527e1a8fb5/
Fisher Yu, et al., Multi-Scale Context Aggregation by Dilated Convolutions, 2016
Dilated Convolution
Standard Convolution
(dilation rate = 1)
Dilated Convolution
(dilation rate = 2)
Deformable Convolution
Standard Convolution
Deformable Convolution
Deform. Conv. with Scaling
Deform. Conv. with Rotation

Pooling Layers
Top-left: http://deeplearning.stanford.edu/tutorial/supervised/Pooling/
Bottom-left: Matthew D. Zeiler, et al., Visualizing and Understanding Convolutional Networks, 2014
Right: http://fractalytics.io/rooftop-detection-with-keras-tensorflow/
Different Pooling Operations
Unpooling
Pooling

Pooling Layers (Cont’d)
Corner Pooling
Atrous Spatial Pyramid Pooling Right: Hei Law, et al., Detecting Objects as Paired Keypoints, 2018
Top-left: Kaiming He, et al., Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition, 2015
Bottom-left: Liang-Chieh Chen,et al., Deeplab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs, 2017
Spatial Pyramid Pooling

Batch Normalization
Top-left: http://gradientscience.org/batchnorm/
Sergey Ioffe, et al., Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, 2015
Bottom: Yuxin Wu, et al., Group Normalization, 2018
!"# + %#
&!'("
)"
#*+ ,-. + / 0(-2)
-2-4
-.
Normalization Scale & Shift ActivationFC Layer
$%(')

Image Classification
• Datasets & Competitions
• Roadmap
• Classification Networks
• Experiments

Image Classification Datasets & Competitions
ImageNet, ILSVRC 2009-2017 ImageNet: http://www.image-net.org/
Second figure: https://principlesofdeeplearning.com/index.php/is-deep-learning-getting-too-deep/
Human

Datasets & Competitions (Cont’d)
MNIST CIFAR-10 & CIFAR-100
Dogs vs. Cats Stanford Cars
iNaturalist Competition Plant Seedlings Classification
http://yann.lecun.com/exdb/mnist/ https://www.cs.toronto.edu/~kriz/cifar.html
https://www.kaggle.com/c/dogs-vs-cats https://ai.stanford.edu/~jkrause/cars/car_dataset.html
https://sites.google.com/view/fgvc5/competitions/inaturalist https://www.kaggle.com/c/plant-seedlings-classification

Image Classification Roadmap
… 1998 2012 2014 2015 2016 2017
LeNet VGGNet ResNet SENet
AlexNet GoogLeNet DenseNet
2018
DLA

LeNet
LeNet-5 (1998)
• A neural network architecture for handwritten and
machine-printed character recognition in 1990s
• Consists of seven layers including
• Convolution operations
• Pooling operations
• Full connections
Yann LeCun, et al., Gradient-Based Learning Applied to Document Recognition, 1998
Bottom-right: https://engmrk.com/lenet-5-a-classic-cnn-architecture/

AlexNet
AlexNet (2012)
• ILSVRC 2012 winner (16.4% top-5 error)
• 60 million parameters and 650,000 neurons
• 8 learned layers: 5 convolutional and 3 fully-connected layers
• A 1000-way softmax layer after the last fully-connected layer
• Dropout and ReLU
• Trained parallelly on 2 GPUs
Alex Krizhevsky, et al., ImageNet Classification with Deep Convolutional Neural Networks, 2012
Bottom-right: Nitish Srivastava, et al., Dropout: A Simple Way to Prevent Neural Networks from Overfitting, 2014

VGGNet
• Six versions with 5 group convolutions of 11 - 19 layers
• VGG16 (138 million parameters) and VGG19
• Only 3x3 conv and 2x2 max-pooling layers before FC layers
• Results @ ILSVRC 2014
• 1st in localization task
• 2nd in classification task (7.3% top-5 error)
VGGNet (2014)
Karen Simonyan, et al., Very Deep Convolutional Networks for Large-Scale Image Recognition, 2014

GoogLeNet
• 22 layers with only 5 million model parameters
• Inception concept
• Multiple conv kernels including 1x1, 3x3, and 5x5
• 1x1 kernel for dimension reduction
• Better representational power + fewer network parameters
• More advanced Inception modules (V2, V3, and V4) Inception-V1 Module
GoogLeNet (2014)
Christian Szegedy, et al., Going Deeper with Convolutions, 2015

ResNet
• 1st place on the ILSVRC 2015 classification task (3.6% top-5 error)
• Deeper model with fewer filters and lower complexity
• 34-layer baseline
• 3.6 billion FLOPs
• only 18% of VGG-19 (19.6 billion FLOPs)
• Up to 152 layers!
• Initialization, batchnorm, residual block…
ResNet Block
ResNet (2015, top)
Kaiming He, et al., Deep Residual Learning for Image Recognition, 2016
http://kaiminghe.com/icml16tutorial/icml2016_tutorial_deep_residual_networks_kaiminghe.pdf

DenseNet
•
! !"#
$
direct connections for % layers
• Fewer parameters and less computation
DenseNet Block
DenseNet (2016)
!" = $" !%, !', … , !")'
Gao Huang, et al., Densely Connected Convolutional Networks, 2016

SENet
• Squeeze-and-excitation block
• Squeeze: Global average pooling
• Excitation: Channel association
• Scale: Channel attention
• Integration with modern architectures
Squeeze-and-Excitation Block
SENet (2017)
Jie Hu, et al., Squeeze-and-Excitation Networks, 2018

DLA: Deep Layer Aggregation
DLA (2018)
• Layer aggregation to better fuse information
• Iterative deep aggregation (IDA)
• Semantic fusion
• Resolutions and scales
• Hierarchical deep aggregation (HDA)
• Spatial fusion
• Channels and depths (modules)
Fisher Yu, et al., Deep Layer Aggregation, 2018

Classification Experiments
Classification Accuracy
Method
Car Brand
Classification
with 66 classes
Car Brand
Classification
with 2506 classes
ResNet 94.60% -
SENet 92.30% -
DLA 96.02% 93.75%
• Dataset-1
• 193186 images of 66 classes
• Collected offline
• Dataset-2
• 549169 images of 2506
classes
• Collected offline + online
• Similar settings to the Stanford
Cars dataset

Object Detection
• Introduction & Roadmap
• Region-Based Methods
• Region-Free Methods
• Experiments

Object Detection Introduction
Top-Left: http://cs231n.stanford.edu/slides/2016/winter1516_lecture8.pdf
Top-Right: https://www.hackerearth.com/blog/developers/object-detection-for-self-driving-cars/
MS COCO
http://cocodataset.org/#home Open Images
https://storage.googleapis.com/openimages/web/index.html
http://host.robots.ox.ac.uk/pascal/VOC/
Pascal VOC
ImageNet
http://www.image-net.org/

Object Detection Roadmap
… 2014 2015 2016 2017 2018
R-CNN
SPPNet
Fast R-CNN
Faster R-CNN
R-FCN
FPN SNIPER
YOLOv1
SSD
DSSD
RetinaNet
RefineDet
CornerNet
YOLOv3
Light-Head R-CNN
Cascade R-CNN
SNIP
Region-Based
Detection
Region-Free
Detection
YOLOv2
Left: Zhengxia Zou, et al., Object Detection in 20 Years: A Survey, 2019

Region-Based / Region-Free Methods
• Region-based detection
Jonathan Huang, et al., Speed/Accuracy Trade-Offs for Modern Convolutional Object Detectors, 2017
• Two-stage method
• Higher accuracy
• Lower speed
• Complex computation
• R-FCN, Fast R-CNN, Faster R-CNN, R-FCN,
FPN, Cascade R-CNN, SNIP, SNIPER…
• One-stage method
• Lower accuracy
• Faster speed
• Light computation
• YOLO, SSD, DSSD, RetinaNet, RefineDet,
CornerNet…
• Region-free detection

R-CNN: Regions with CNN Features
• Selective Search + CNN + SVM
• Start to use CNN features instead of the traditional features
• ~2k bottom-up region proposals from selective search
• Time consuming
• Extracting feature for every proposal separately
Ross Girshick, et al., Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, 2014
Bottom-Right: https://dl.dropboxusercontent.com/s/vlyrkgd8nz8gy5l/fast-rcnn.pdf
R-CNN (2014)

Fast R-CNN
• One image + multiple RoIs + a fully CNN
• RoI pooling: to generate fixed-size feature vector for each proposal
• Outputs: softmax probabilities + bounding-box regression offsets
• End-to-end training with a multi-task loss
Fast R-CNN (2015)
Right: http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture11.pdf
Ross Girshick, Fast R-CNN, 2015

Faster R-CNN
• Region proposal network (RPN) + Fast R-CNN
• RPN & detection network share full-image convolutional features
• Anchors with multiple scales and aspect ratios
Bottom-Left: http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture11.pdf
Shaoqing Ren, et al., Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, 2015
Faster R-CNN (2015) Region Proposal Network

R-FCN: Region-based Fully Convolutional Networks
• Position-sensitive score map before RoI pooling
• 9 positions: top/middle/bottom-left/center/right
• Position-sensitive RoI pooling instead of standard RoI pooling
• fully convolutional detection network instead of fully-connected detection network in Faster
Jifeng Dai, et al., R-FCN: Object Detection via Region-based Fully Convolutional Networks, 2016
R-FCN (2016)
Position-Sensitive Score Map

Light-Head R-CNN
• Heavy head
• E.g., Faster R-CNN & R-FCN
• Intensive computations around RoI warping
• Light-Head R-CNN
• Thin feature maps from large separable convolution layers
• Cheap R-CNN subnet with 1 FC-layer
Zeming Li, et al., Light-Head R-CNN: In Defense of Two-Stage Object Detector, 2017
Light-Head R-CNN (2017)
‘Heavy’-Head Detectors
Large Separable Convolution

FPN: Feature Pyramid Networks
• Bottom-up pathway
• Top-down pathway
• Lateral connection
Tsung-Yi Lin, et al., Feature Pyramid Networks for Object Detection, 2017
Different Feature Maps FPN Block
• Feature pyramid: Combination of
• Low-resolution, semantically strong features
• High-resolution, semantically weak features

Cascade R-CNN
• Multi-stage extension of R-CNN
• Trained sequentially using output of
the previous stage
• Cascaded bbox regression
• ! ", $ = !& ∘ !&() ∘ ⋯ ∘ !) ", $
• Cascaded detection
• A sequence of detectors trained with
increasing IoU thresholds
Zhaowei Cai, et al., Cascade R-CNN: Delving into High Quality Object Detection, 2018
Cascade R-CNN

SNIP: Scale Normalization for Image Pyramids
• CNNs are not robust to changes in scale
• Multi-scale image pyramids for objects
with different scales
• Detections from each scale are rescaled
and combined using NMS
• Small objects from high-resolution image
• Large objects from low-resolution image
Bharat Singh, Scale Invariance in Object Detection - SNIP, 2018

YOLOv3 (2018)
YOLO: You Only Look Once
• End-to-end one-stage method
• Directly use full images to predict each bounding box
• Extremely fast in real-time speed
• YOLOv2
• Darknet19 backbone
• Anchor mechanism
• YOLOv3
• Multi-scale features
• Darknet53 backbone
Joseph Redmon, et al., You Only Look Once: Unified, Real-Time Object Detection, 2016
Joseph Redmon, et al., YOLO9000: Better, Faster, Stronger, 2017
Joseph Redmon, et al., YOLOv3: An Incremental Improvement, 2018
Top-Left: https://docs.google.com/presentation/d/1kAa7NOamBt4calBU9iHgT8a86RRHz9Yz2oh4-GTdX6M/
Bottom-Left: https://towardsdatascience.com/yolo-v3-object-detection-53fb7d3bfe6b/
YOLO (2016)

SSD: Single Shot Detector
• Multiple feature maps with different resolutions and scales
• Improved speed/accuracy trade-off
Wei Liu, et al., SSD: Single Shot MultiBox Detector, 2016
SSD (2016)
YOLOv1

DSSD: Deconvolutional SSD
• Encoder-decoder Hourglass structure
• Wide – Narrow – Wide
• Convolution and deconvolution modules
• Deconvolution: To introduce additional
large-scale context for object detection
• Two prediction modules
• Each with one residual block
Cheng-Yang Fu, et al., DSSD: Deconvolutional Single Shot Detector, 2017
SSD
DSSD (2017)
Selected Prediction Module

RetinaNet
• Focal loss instead of cross entropy function
• Focus on training on a sparse set of hard samples
• !" #$ = − 1 − #$
( log #$
Tsung-Yi Lin, Focal Loss for Dense Object Detection, 2017
RetinaNet (2017)

RefineDet
Shifeng Zhang, et al., Single-Shot Refinement Neural Network for Object Detection, 2018
RefineDet (2018)
• Anchor refinement module
• Filtering out easy negatives
• Coarsely adjusting anchors
• Object detection module
• Further improving regression
• Prediction multi-class
Transfer Connection Block

CornerNet
• Object as a pair of bounding box corners
• No need for anchor boxes
• Regression problem
→ Corner prediction problem
• Corner pooling
• To better localize corners of bounding box
Hei Law, et al, CornerNet: Detecting Objects as Paired Keypoints, 2018
CornerNet (2018)
Corner Pooling

• Multiple Receptive Field block (MRF): Multiple receptive field and more features for prediction
• Auxiliary Semantic Segmentation block (ASM): Auxiliary semantic segmentation focusing on small object
• Object Detection block (ODM): Combining MRF and ASM with parallel training
• Loss function:
MRFSWSnet:
Siyang Sun, et al., Multiple Receptive Fields and Small-Object-Focusing Weakly-Supervised Segmentation Network for Fast Object Detection, 2019
Multiple Receptive Field Small-Object-Focusing
Weakly-Supervised Segmentation Net

Experiments on MRFSWSnet
Method Recall Precision F1 Score
Faster R-CNN 97.57 96.47 97.01
RetinaNet 97.80 97.80 97.80
Light-Head R-CNN 97.71 95.13 96.40
YOLOv3 98.57 97.32 97.94
MRFSWSnet 98.71 97.32 98.01
• Images collected by dash camera
• Detection on cellphone usage during driving
• 1000 testing images
Siyang Sun, et al., Multiple Receptive Fields and Small-Object-Focusing Weakly-Supervised Segmentation Network for Fast Object Detection, 2019

• Depend on large amount of labeled data, induce expensive annotation cost
• Difficult to be applied directly in new operation environments
• Computation intensive, highly demanding in computational resources
• Complicated models, time/memory consuming, which prevents usage in real
time operation systems(e,g. DMS)
Challenge

Yuhong GUO DiDi AI Labs & Carleton University
Part II: Advanced Topics

•Domain Adaptation
•Lightweight Models
Topics

• Definition [Pan et al., IJCAI13 ]:
Ability of a system to recognize and apply knowledge and skills learned in
previous domains/tasks to novel domains/tasks
• .
Domain Adaptation/Transfer Learning
S. Pan, Q. Yang and W. Fan. Tutorial: Transfer Learning with Applications, IJCAI 2013.
Tan, Chuanqi, et al. "A survey on deep transfer learning." International Conference on Artificial Neural Networks. Springer, Cham, 2018.

§ Successful Application of ML in industry depends on learning from large
amount of labeled data
ØExpensive, time consuming to collect labels
ØDifficult or dangerous to collect data in certain scenarios, e.g, auto driving
§ Domain Adaptation/Transfer Learning provides essential ability of
üReusing existing labeled resources
üAdapting to changing environment
üLearning from simulations
Why Domain Adaptation

Transfer Learning vs Traditional ML
Transfer Learning/Domain Adaptation
Training
domain/task A
Test
domain/task B
§
§
§
Traditional ML
(Semi-)Supervised Learning
Training
domain/task A
Test
domain/task B
§
§
§

Motivation Examples
Different feature distributions
Different label spaces
!"#$%& !"'("

Applications in Computer Vision

Adapting to New Domains
§ Reuse existing datasets, hence the annotation information
ØObject Recognition
ØObject Detection
ØPerson Re-Identification
ØImage Segmentation
ØImage Classification … ...

Learning from Simulations
§ Gathering data and training model are either too expensive, time-
consuming, or too dangerous
§ Solution: create data, learning from simulations
Ø
Ø
OpenAI's Universe will potentially allow us to train a
self-driving car using GTA 5 or other video games.
Training models on real robotics
is too slow and expensive
http://ruder.io/transfer-learning/index.html

Common Datasets
§ Object recognition:
Office-31:
§
§
§
ImageCLEF-DA:
§
§
§
§ Visual domain adaptation challenge
dataset VisDA-2017
§ Digits: MNIST, SVHN, USPS
§ Syn2Real dataset – a new dataset for object recognition
[Peng et al, 2018]

Common Datasets
§ Semantic Segmentation/object
detection:
Ø
Ø
Ø
•
Ø

Three main classes:
§ Reweighting/Instance-based Methods
ü
§ Feature-based/Representation Learning Methods
ü
§ Parameter/Model- based Methods
ü
Categories of DA Methods

Start with Instance Reweighting
§ Context
Ø
Ø
§ Idea
Ø

§ h() – prediction function, x --- input , y – output
§ Expected risk in target domain:
Simple Math Analysis

§ Assume shared conditional distribution
§ To minimize target risk, source instance can be reweighted:
Covariate Shift

§ Assume shared conditional distribution
§ In addition, note
Ø !" !# $
Ø !" ≠ !# $ ≠
§ Assumption of support:
Ø ∃' , !# but !"
Ø !" ,-- !#
Assumptions

§ Density ratio estimation
Ø !
Ø "
§ Direct weight estimation
Ø
Weight Estimation
" = !$ / !& ∝ !() = *|,)/!() = .|,)
! ) = * ,
! ) = . ,

§ Maximum Mean Discrepancy (MMD)
Ø
Ø
• F H
X
Learning Weights Directly: MMD
[Gretton et al. 2012]

§ MMD for domain adaptation
Ø
Ø
Learning Weights Directly: MMD
!! ~ # !"

§ Extend MMD to learn representation function ∅(#)
Ø
Extend to Representation Learning
Long et al. " ”, CVPR 13
[Long et al. CVPR13]

§ Representation learning methods present larger capacity in bridging
domain discrepancy
§ Widely applied in transfer learning for computer vision tasks
§ Recent development of representation learning based domain adaptation
Ø
Ø
Ø
Recent Feature-based Methods

§ Main idea:
Ø
min$ max' ()*+(-, /) = 23~'5
log /(-(9)) + 23~';
log(1 − / - 9 )
o- 9 ->, -?)
o
Ø
p> (-(9)) = p?(-(9))
Adversarial Loss-based Adaptation Framework
Goodfellow et al. " ”, 2014

§ A-distance, measure of distance between probability distribution
§ Bound on target domain error
Ø
Ø
Theoretical Connection
Ben-David et al. "Analysis of Representations for Domain Adaptation”, NIPS 06
Kifer et al. Detecting change in data streams. In Very Large Databases (VLDB), 2004.

§ Main idea:
Ø
min$,& max) * = *,-./(1, 2) + 5 *6/7
Adversarial Loss-based Adaptation Framework
*!"#$
*%$&

§ DANN: Adversarial is
implemented via GRL (gradient
reverse layer)
Domain Adversarial Neural Network (DANN)

§ Adversarial Discriminative Domain Adaptation (ADDA)
source CNN is trained without sacrificing any discriminativity
Model Sharing and Adversarial Adaptation

§ Re-weight source domain label
distribution to help reduce domain
discrepancy and adapt classifier
§ Reweighted adversarial loss (RAAN)
Reweighted Adversarial Adaptation [Chen et al, CVPR 18]
Chen, et al. " ”, CVPR 18

§ Maximum Classifier Discrepancy (MCD):
Ø
Ø
§ Adversarial loss:
Target domain
prediction discrepancy
Alternative Adversarial Terms
K. Saito, et al. " Maximum Classifier Discrepancy for Unsupervised Domain Adaptation”, CVPR 18
Train both classifiers and generator to
classify the source samples correctly

Conditional Adversarial Domain Adaptation
§ Conditional Domain Adversarial Networks (CDANs) [NeurIPS 18]:
Ø

Question Raised: Transferabiliy vs Discriminability
§

Batch Spectral Penalization (BSP)
§

Object detection DA-Faster-R-CNN
§ Adversarial loss via GRL at both image level and instance level
§ Consistent regularization at the two levels
Multi-Level Adversarial Adaptation
Chen, et al. " ”, CVPR 18

Object detection: Strong-Weak
Multi-Level Adversarial Alignment
Saito, et al. " ”, CVPR 19
§
§
•
•

Object detection
Multi-Level Adversarial Alignment
Saito, et al. " ”, CVPR 19

Object detection
DA Detection Results

§ Main idea:
Ø
Ø
Generative Model based Methods

§ Limitation of domain alignment techniques:
Ø
Ø
§ CyCADA:
Ø
Ø
Ø
Cycle-Consistent Adversarial DA
et al. " ”, ICML 18
et al. ICML18

Cycle-Consistent Adversarial DA
et al. " ”, ICML 18
et al. ICML18
image-level GAN loss (green), the feature level GAN loss (orange), the source and target semantic
consistency losses (black), the source cycle loss (red), and the source task loss (purple).

§ SBDA-GAN:
Ø
Ø
Ø
Symmetric Bi-Directional Adaptive GAN
et al. " ”, CVPR 18
et al. CVPR18

§
§
Pseudo-Label based Methods
Some positive application in domain adaptation:
ØProgressive domain adaptation for Object detection
ØFor recognition:
Zhang et al. " Collaborative and Adversarial Network for Unsupervised domain adaptation :”, CVPR 18
Inoue et al. " Cross-Domain Weakly-Supervised Object Detection through Progressive Domain Adaptation”, CVPR 18

• Unsupervised domain adaptation has received a lot of attention
• Open domain learning remains to be challenging, but starts drawing
attentions
• Most study has focused on classification problems
• Much less effort has been made on more complex tasks such as
object detection
Summary

Basics
Number of multiplications for one standard convolutional layer:
Input: !" x !" x M Output: !# x !# x N
!$: kernel size
M: number of input channels
N: number of output channels
!#: output dimension

Basics
• Architecture design– lightweight models
Ø Use two 3 x 3 conv layer to replace 5 x 5 conv
layer:
(3x3+3x3)/(5x5)
Ø Use two sequential 1xn and n x 1 conv layers to
replace n x n conv layers
(1xn + n x 1)/(n x n)

Basics
• Architecture design– lightweight models
Ø pointwise convolution: use 1x1 conv layer (to reduce dimension)
Ø Depthwise separable convolution:
!" !"

• Inception, Xception *
• SqueezeNet
• MobileNet / MobileNetV2
• ShuffleNet / ShuffleNetV2
Lightweight models

Inception Module
Inception module with dimension reduction
V1 block (from googlenet)
Traditional 3X3
convolution block
Input: 28 X 28 X 192
Output: 28 X 28 X 256
#Model parameters:
3 X 3 X 192 X 256 = 442k
1 X 1 X 192 X 64
+1 X 1 X 192 X 96 + 3 X 3 X 96 X 128
+1 X 1 X 192 X 16 + 5 X 5 X 16 X 32
+0(maxpooling)+1 X 1 X 192 X 32 =163k
Previous layer
3X3 convolution
output layer
Szegedy et al. Going Deeper with Convolutions, https://arxiv.org/abs/1409.4842. 2014.
•
•

Inception V1, V2, V3
Szegedy et al. Going Deeper with Convolutions, https://arxiv.org/abs/1409.4842. 2014.
Sergey Ioffe et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,
http://arxiv.org/abs/1502.03167.2015
Rethinking the Inception Architecture for Computer Vision, http://arxiv.org/abs/1512.00567. 2015.
•
• Use two 3 x 3 conv
to replace 5 x 5 conv
•
1

Xception
François Chollet. Xception: Deep Learning with Depthwise Separable Convolutions. https://arxiv.org/abs/1610.02357. 2016-2017.
• Depthwise separable convolution
• à
•
(3 x 3 x 1 x M/7 x 112 x 112) x 7 •
•

SqueezeNet
Input: F x F x M
Squeeze:
• 1x1 convs
output: F x F x S (S< M)
Expand:
• 1x1 convs
output: F x F x e1
• 3x3 convs
output: F x F x e2
Concate: F x F x (e1+e2)
Forrest N. Iandola,et al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size. https://arxiv.org/abs/1602.07360. 2016

• Standard:
• Depthwise separable conv
(1) depthwise conv: 1filter takes 1 input channel
(2) pointwise conv
1x1 convs
• Computation Reduction
MobileNet V1:
Andrew G. Howard et al. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications.
https://arxiv.org/abs/1704.04861?context=cs. 2017.
!" !"
!" !"

MobileNet V1
• Use conv with stride=2 to
replace pooling
• Add two super parameters:
Width multiplier α and
resolution multiplier ρ
• α =1.0, 0.75, 0.5, 0.25;
• standard MobileNet when α=1

MobileNet V2
MobileNetV1
MobileNetV2
Increase # channels
Linear bottlenecks:
removed nonlinear
activation in the low dim
Mark Sandler et al. MobileNetV2: Inverted Residuals and Linear Bottlenecks. https://arxiv.org/abs/1801.04381.2018.
inverted residual block
Increase dim, then reduce dim

ShuffleNet V1
• pointwise group convolution (1x 1 Conv)
• channel shuffle: help the information flowing across feature channels
• Use concat operation to concatenate two different channels
Xiangyu Zhang et al. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. https://arxiv.org/abs/1707.01083. 2017.
#g (groups)

ShuffleNet V1
Xiangyu Zhang et al. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. https://arxiv.org/abs/1707.01083. 2017.

ShuffleNet V2
Reduce memory
access cost:
• Channel Split (2g)
• remove group
convolution
• Put channel shuffle
module after
channel
concatenation
1)) ( ( 2 1)) ( (
Ningning Ma et al. ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design. https://arxiv.org/abs/1807.11164.2018.

Experiments - Classification
Model
mAP
(%)
Precision
(%)
Recall
(%)
Size
(MB)
Computation
speed
(ms/photo)
Server-based
+ Yolov2
99.62 99.60 99.65 N/A N/A
1.00xShuffleNet V2
+Yolov2
96.43 97.16 96.83 5.20 80.00
0.50xShuffleNet V2
+Yolov2
95.86 97.28 96.28 1.70 40.00
0.50xShuffleNet V2
+SSD
97.73 90.61 97.98 7.90 65.00
0.25xShuffleNet V2
+SSD
97.25 90.46 97.59 5.00 45.00
Category Abbreviation
front page of ID card id_card_f
Back page of ID card id_card_b
Front page of driver license driver_license_f
Back page of driver license driver_license_b
Front of main page in car license
front
car_license_f
Back of main page in car license front car_license_b
Supplementary Page in car license vehicle_license
Real car photo( whole car) Whole car
Real car photo(car plate) plate

Experiments - Classification
#positive photos: 8K #negative photos: 8K
Version Backbone
Detection
method
Size
(MB)
mAP
(%)
Precision
(%)
Recall
(%)
Error detection rate
(%
Floating-point version 0.5*ShuffleNet V2 YoloV2 1.70 97.86 98.81 98.00 0.125
Fixed-point version 0.5*ShuffleNet V2 YoloV2 0.40 97.82 98.82 97.97 0.0625
#positive photos:
8K
Precision
(%)
Recall
(%)
Precision
(%)
Recall
(%)
car 98.87 96.41 98.97 96.11
car_license_b 98.70 99.00 99.10 99.00
car_license_f 99.90 97.70 99.80 97.90
driver_license_b 99.80 98.90 99.80 99.00
driver_license_f 99.49 98.50 99.19 98.50
id_card_b 99.90 99.00 99.90 99.00
id_card_f 99.50 99.10 99.50 99.10
plate 93.82 94.29 93.71 93.99
vehicle_license 99.30 99.10 99.40 99.10
Average 98.81 98.00 98.82 97.97

Experiments - Embeded OCR
• Use ShuffleNet to replace Resnet50 as the backbone

Haifeng SHEN, DiDi AI Labs
Guangyu LI, DiDi AI Labs
Part III : Application

•Driver Identification
•Driving Scenario Understanding
Application

Driver identification
• Application
• Overview
• Experiments

Application - Pay by smiling
• In Sep. 2017, Alibaba's Ant
Financial affiliate and KFC
China announced facial-
recognition payment
available for customers in the
fast food restaurant chain's
new KPRO store in Hangzhou.
• "Smile to Pay" facial
recognition payment solution
at KFC enables customers to
pay without their wallets.
https://www.jrzj.com/194328.html

Application - Check-in at station
Taiyuan South railway stationBeijing West railway station Shanghai metro station
https://baijiahao.baidu.com/s?id=1552314447507461&wfr=spider&for=pc
http://www.sohu.com/a/220124437_99966914
http://dy.163.com/v2/article/detail/D5U3QH2P0525KG01.html

http://www.sohu.com/a/168709903_728989
Application - Pedestrian monitoring
Ningbo City uses face recognition for transportation surveillance and pedestrian monitoring.

Application - Driver monitoring
https://www.sohu.com/a/253263266_649849

Application - Other uses
https://www.globalrailwayreview.com/article/66120/train-stations-facial-recognition/
https://image.baidu.com/

Overview - Market
https://www.marketsandmarkets.com/Market-Reports/facial-recognition-market-995.html

Overview - features
Natural Un-perceivable
Contact-less Multiple
BIOMETRIC --- You are your own key”

Overview - Challenges
Inter-class similarity

Overview - Challenges
Illumination Expression
Occlusion Age
Pose
Other
Intra-class variability
Similarity
=0.18

Overview - Framework
Verification

Overview - Framework
Identfication

Overview - Detection & landmark dataset
Face detection
dataset
Available # faces # images Website Remarks
FDDB Public 5171 2845 http://vis-www.cs.umass.edu/fddb/ unconstrained face
WiderFace Public
32,20
3
393,703
http://mmlab.ie.cuhk.edu.hk/projects/W
IDERFace
Easy, Medium, Hard set, a high
degree of variability in scale, pose
and occlusion.
MALF Public
11,93
1
5,250 http://www.cbsr.ia.ac.cn/faceevaluation/
Bounding box, multi-Attribute
Labelled Faces, pose and facial
attributes
Caltech
10,000
Web Faces
Public - 10,524
http://www.vision.caltech.edu/Image_Da
tasets/Caltech_10K_WebFaces/
Collect from Google image search,
4 landmarks(two eyes, nose and
mouth)
PUB Public 9971
http://biometrics.put.poznan.pl/put-
face-database/
30 landmarks, 194 contour points
AFLW Public 25,993
https://www.tugraz.at/institute/icg/rese
arch/team-bischof/lrs/downloads/aflw/
Collect from Flickr, 21 landmarks

Overview - Detection - MTCNN
Kaipeng Zhang et al. Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks. https://arxiv.org/abs/1604.02878v1.2016.
• propose a deep cascaded multi-task framework with three stages, P-Net, R-
Net and O-Net.
• Each is a shallow network.
• P-Net: proposal network, produces candidate windows quickly through a
shallow CNN
• R-Net: refine network, refines the candidates to reject a large number of
non-faces windows through a more complex CNN
• O-Net: output network, use a more powerful CNN to refine the result and
output facial landmarks positions

Overview - Detection - Face RFCN
Yitong Wang et al. Detecting Faces Using Region-based Fully Convolutional Networks. https://arxiv.org/abs/1709.05256. 2017.
• The framework is based on the R-FCN.
• propose a region-based face detector applying deep networks in a fully
convolutional fashion
• introduce additional smaller anchors and modify the position-sensitive RoI
pooling to a smaller size for suiting the detection of the tiny faces.
• propose to use position-sensitive average pooling instead of normal
average pooling for the last feature voting in R-FCN
• use multi-scale training strategy and online Hard Example Mining (OHEM)
strategy.

Overview - Detection - PyramidBox
Xu Tang et al. PyramidBox: A Context-assisted Single Shot Face Detector. https://arxiv.org/abs/1803.07737?context=cs. 2018.
• Baidu proposes the PyramidBox.
• extended VGG16 backbone and generate
the feature maps at different levels
• generate a series of anchors
corresponding to larger regions related to
a face that contain more contextual
information, such as head, shoulder and
body.

Overview - Recognition - Dataset
Dataset Available # People # images Website Remarks
LFW Public 5K 13K
http://vis-
www.cs.umass.edu/lfw/#views
Labeled Faces in the Wild
YFD Public 1.5K 3.4K (Video)
https://www.cs.tau.ac.il/~wolf/ytfac
es/
YouTube Faces Database
CelebA
(CelebFaces
Attributes
Dataset)
Public 10K 202K
http://mmlab.ie.cuhk.edu.hk/project
s/CelebA.html
Multimedia Lab, The Chinese
University of Hong Kong
CASIA-WebFace Public 10K 500K
http://www.cbsr.ia.ac.cn/english/CAS
IA-WebFace/CASIA-
WebFace_Agreements.pdf
MS-Celeb-1M public 100K 10M https://www.msceleb.org
VGGFace2 Public 9k 3.3M
http://www.robots.ox.ac.uk/~vgg/da
ta/vgg_face2/
downloaded from Google Image
Search and have large variations in
pose, age, illumination, ethnicity
and profession
Facebook Private 4K 4,400K N/A
Google Private 8000K 100-200M N/A

Overview - Recognition - Milestones
1888
Galton,
Nature
1910
Galton,
Nature
1965
Chan,Bledsoe,
AFR
1991
TurkandMA,
Eigenfaces
1997
BelhumeurP,
Fisherface
2002
LiuC,
Gaborfeature
2006
AhonenT,
LBP
2009
WrightJ,
Sparserepresentation
2013
ChenD,
High-dimLBP
2014
SunYi,
DeepID
2014
Facebook,
DeepFace
2015
Oxford,
VGGFace
2015
Google,
FaceNet
2015Baidu,
EnsembleFace
2016
EffectiveFace
2017
SphereFace
2018
ArcFace
2019
Combined
loss

Overview - Recognition - Results
Time Method Training size Method description LFW Comments
1991 Eigenfaces < 10k Principal component analysis(PCA) 60.02%
2006 LBP+CSML < 10k
Local binary pattern(LBP) + Metric
learning
85.57%
2013 High-dim LBP 0.1m High-dim LBP + Joint Bayesian 95.17%
2014 DeepFace 4m CNN + 3D face alignment 97.35% Facebook
2014 Deep ID 0.2m CNN + Softmax 97.45% CUHK
2015 VGGFace 2.6m VGG + Softmax 98.95% Oxford
2015 FaceNet 200m Inception + Triplet-Loss 99.63% Google
2015 Ensemble face 1.2m CNN + Multi-patch + Deep metric 99.77% Baidu
2016 Effective face 2.5m CNN + Augmentation 98.06% Pose + Shape + Expression
2017 SphereFace 0.5m CNN + Angular-Softmax 99.42%
Multiplicative angular margin:
cos(mθ)
2018 ArcFace 6.8m CNN + Additive angular margin 99.83%
Additive angular margin: cos(θ
+ m)
2019 Combined loss N/A cos(m1θ + m2) − m3

Overview - Recognition - DeepFace
Yaniv Taigman et al. DeepFace: Closing the Gap to Human-Level Performance in Face Verification.
https://ieeexplore.ieee.org/document/6909616. CVPR 2014.
• CNN + DNN structure
• L4 - L6 are locally connected layers without weight sharing, rather than the standard
convolutional layers
• The last two layers, i.e. F7 and F8 are fully-connected
• Employ 3D face modeling to apply the affine transformation for 3D face alignment and
get the frontal face
• more than 120 million parameters
• Train using four million facial images belonging to more than 4,000 identities

Overview - Recognition - DeepID
Yi Sun, Xiaogang Wang, Xiaoou Tang. Deep Learning Face Representation from Predicting 10,000 Classes. https://www.cv-
foundation.org/openaccess/content_cvpr_2014/papers/Sun_Deep_Learning_Face_2014_CVPR_paper.pdf. CVPR2014.
• Use face patch method and each patch use one ConvNet
• Each ConvNet has 4 layers
• 60 face patches with ten regions, three scales, and RGB or gray channel.
• 60 ConvNets x two 160-dimensional vectors and flipped counterpart, totally 19200-dimensional
vector for face verification
• achieves 97.45% face verification accuracy on LFW
• Based on DeepID1, Chinese University of Hong Kong provides DeepID2 and DeepID3

Overview - Recognition - FaceNet
Florian Schroff et al. FaceNet: A Unified Embedding for Face Recognition and Clustering. https://arxiv.org/abs/1503.03832. CVPR 2015.
• Google proposes the structure.
• Directly use a deep convolutional network
• Use triplet loss for training: minimizes the distance between an anchor and a positive,
both of which have the same identity, and maximizes the distance between the
anchor and a negative of a different identity
• Use the Euclidean distance to measure the face similarity for verification.

Overview - Recognition - Ensemble Face
Jingtuo Liu et al. Targeting Ultimate Accuracy: Face Recognition via Deep Embedding. https://arxiv.org/pdf/1506.07310. 2015.
• Multi-patch feature extraction.
• 9 image patches and each patch is centered at different landmarks on face region.
• Each patch: 9 convolution layers and a softmax layer at the end
• Concatenate the last convolution layer of each network to build the high dimensional feature for the face
representation
• metric learning method with triplet loss is used for feature reduction and obtain 128/256 dimensions.
• achieve the accuracy (99.77%) of LFW under 6000 pair evaluation protocol

Overview - Recognition - Effective Face
Iacopo Masi et al. Do We Really Need to Collect Millions of Faces for Effective Face Recognition. https://arxiv.org/abs/1603.07057. CVPR 2016.
• Use a single VGGNet with 19 layers
• Training on both real and augmented data
• use the CASIA WebFace collection data and generate the artificial data
by introducing pose variations, shape variation and expression
variation

Jiankang Deng et al. ArcFace: Additive Angular Margin Loss for Deep Face Recognition. https://arxiv.org/abs/1801.07698. 2019.
Multiplicative angular
margin: cos(mθ)
Additive angular
margin: cos(θ + m)
Additive cosine
margin: cos(θ ) - mcosθ
Combined loss:
Overview - Recognition - Combined loss

Experiments - Combined loss
Test set feature softmax shpereface cosface arcface
Combined
loss
LFW public 98.75 99.52 99.50 99.55 99.60
7k private 93.60 95.45 95.90 96.72 97.13
50k private 93.28 95.93 95.50 97.08 96.90
zc private 99.18 99.37 99.45 99.57 99.52
avg 96.20 97.57 97.59 98.23 98.29
• 7k/50k The test set is extracted from registered driver photo database. 3K positive
pair and 3k negative pair are randomly selected from 7k/50k drivers respectively.
• zc the test set is randomly extracted from premier driver photo database. 3K
positive pair and 3K negative pair are randomly selected for the testing.

Experiments - Virtual learning
drastically improves the performances over the baseline softmax on both LFW and SLLFW datasets, e.g. from 99.10%
to 99.46% and 94.59% to 95.85%, respectively.
Binghui Chen, Weihong Deng, Haifeng Shen. Virtual Class Enhanced Discriminative Embedding Learning. https://arxiv.org/abs/1811.12611. 2018

Experiments - Fast face detection
80
8
0
40
4
0
20
20
20
20
10
10 55 33 22 11
C3 C4 C5 C6 C7 C8 C9 C10 C11
Multiscale feature fusion
Object detection
Detection result
Upsam
pling
Upsampling
n Multiscale features:
C3+C4+C5+C7+Conv9+Conv11
n Combine up-sampling features: C3 + C3’,
C4 + C4’, C5 + C5’
n Support batch image computation
n TensorRT Optimization
Speed
(ms/frame)
Batch size=1 Batch size=64 Batch size=100
Original 22 12 N/A
FP32 17 7 7
INT8 13 4 4
GPU Memory
(GB/frame)
Batch size=1 Batch
size=64
Batch
size=100
Original 1.40 0.188 N/A
FP32 0.57 0.070 0.066
INT8 0.48 0.039 0.030
Detection % Precision Recall F-score
Original 97.90 97.00 97.47
FP32 97.90 97.10 97.48
INT8 97.85 96.96 97.40

Experiments - Face detection
q WIDER FACE dataset is a face detection benchmark dataset,
collected from the publicly available WIDER dataset.
q Choose 32,203 images and label 393,703 faces with a high
degree of variability in scale, pose and occlusion as depicted
in the sample images.
q Propose DFS method and use semantic fused feature maps
as contextual cues and construct a semantic segmentation
for training supervision and to learn the best representations
q Win 5 rank-1 results in April. 2019
Widerface: http://shuoyang1213.me/WIDERFACE/index.html
Wanxin Tian, Zixuan Wang, Haifeng Shen, Weihong Deng, et al. Learning Better Features for Face Detection
with Feature Fusion and Segmentation Supervision. https://arxiv.org/abs/1811.08557. 2018-2019.

What can we learn from Driving Scenario?
• What is in a driving scenario?
• How far are they from ego-vehicle?
• How does human driver interact with environment?
Vision Perception
3D Reconstruction
Behavior Analysis

Driving Scenarios v.s. General Computer Vision
Data
• Multi-modal (i.e. multiple sensors including Camera LiDAR, GPS, IMU etc.)
• Collected from 3D Open Area (Not Indoor/Lab Environments)
• Ego-centric / First Person
Requirements
•
•
•
Opportunities
•
•
•

Main Components
• Pedestrian
• Vehicle
• Road
• Traffic Sign / Light
Vision Perception in Driving Scenario
Detect, Segment, Track and Classify Object-of-interest in Driving Scenarios
What does Vision Perception do:

Vision Perception – Pedestrian Detection

Pedestrian detection at 100FPS
• Uses Cascades
• Fast features
• Not a CNN based model
Benenson et al ’12 “VeryFast”
100+ FPS detector. NO CNNs.

Real-time Pedestrian Detection with CNNs
• Uses Cascades
• Uses fast non-CNN features
• Use CNNs for max accuracy with minimum speed
sacrifice
Angelova et al ’15 “DeepCascades”
Real-time (15FPS) with CNNs

Occlusion-aware pedestrian detection
• Aggregation loss (enforce proposals to be close
and locate compactly)
• Occlusion-aware region of interest (PORoI)
(integrate prior structure information of human
to handle occlusion)
• Based on Faster RCNN
Zhang et al ’18 “OR-CNN”
State of the Art (by April 2019)

Vision Perception – Vehicle Detection

Vehicle detection in 3D from image
• Directly from 2D image
• Proposal Generation as Energy Minimization
• Orientation Estimation Network
Chen et al ’16 “3D Bounding Box”
Breakthrough for 3D Detection with Mono Image

Multi-View 3D object Detection
• Multi-sensor fusion
Chen et al ’17 “MV3D”
Impressive accuracy gain for considering multi-sensors fusion

Multi-level Fusion based 3D Object Detection
from Mono Images
• Simultaneously propose 2D RPN and predict 3D
location, orientation, dimensions
Xu et al ’18 “Multi-level Fusion”
State of the Art for 3D Detection from Mono Camear Images

Vision Perception – Road Segmentation
Joint Semantic Prediction
• KITTI Road Detection top performance 2017
• Multi-task framework
• Real-time
• Uses RGB image only
Teichmann et al ’17 “MultiNet”
Speed + Accuracy with RGB image only

LIDAR-Camera Fusion
• KITTI Road Detection top performance 2018
• Cross Fusion mechanism with FCN
Caltagirone et al ’18 “LidCamNet”
LIDAR-Camera Fusion RULES

LIDAR-Camera Fusion with LIDAR Adaptation
• KITTI Road Detection current top performance
• Progressive LIDAR Adaptation
Chen et al ’19 “PLARD”
State of the Art Performance

State of the Arts on KITTI (by April 2019)

Vision Perception – Traffic Sign Detection
IJCNN 2011 Traffic Sign Recognition Competition
• Ciresan et al ’11: 0.56% error
• Human: 1.16% error
• Non-CNN: 3.86%
Ciresan et al ’11 “Traffic Sign Recognition”
Traffic Sign Recognition is EASY (Super-human Performance)

Vision Perception – Traffic Sign Detection
Detecting Small Signs from Large Images
• Brake large image into small patches
• Small-Object-Sensitive-CNN (SOS-CNN)
• Based on SSD
Meng et al ’17 “SOS-CNN”
Handle Small Objects

Main Components
•
•
•
•
3D Reconstruction in Driving Scenario
Recover real-world Location and Pose of Driving Scenario Objects (2D to 3D)
What does 3D Reconstruction do:
5 mins Theoretic Backgrounds (a little Math)

3D Reconstruction – Theoretic Backgrounds
• Perspective Projection

• Internal Camera Parameters

• External Camera Parameters

• Camera Model for Perspective Projection

• A Block Diagram

3D Reconstruction – Semantic Reconstruction
Kundu et al ’14 “Joint semantic and 3D reconstruction from monocular video”
Semantic + 3D Reconstruction from Mono Camera

3D Reconstruction – Semantic Reconstruction
Cherabier et al ’16 “Multi-label semantic 3d reconstruction using voxel blocks”
Efficient Dense Semantic + 3D Reconstruction

Driving Scenario Understanding
Honda Research Institute Driving Dataset
• 104 Hours Real Human Driving records
• Driving Behavior and Causal Reasoning annotation
Ramanishka et al ’18 “HDD”
First Dataset towards Driving Scenario Understanding

Driving Scenario Understanding
Driving Attention Prediction from Video
• Focus on Driver’s Attention
• In-car v.s. In-lab test
Xia et al ’18 “Predicting Driver Attention”
Introduce Attention Heat Maps

Related Datasets
HDD [7]
[6]
[5]
[4]
[3]
[2]
[1]
D2-City [8]
Driving behavior & Causal reasoning /
Traffic participants detection & tracking
Camera, GPS, IMU 95.9 Suburban, urban and highway

GAIA Open Dataset
• Dataset : D2 –City Dataset
• D²-City is a large-scale driving video dataset that provides more than 10k videos recorded
in 720p HD or 1080p FHD from front-facing dashcams, with annotations for object
detection and tracking.
n 1k videos -
annotation of the
bounding boxes and
tracking IDs of road
objects into 12
different categories.
n 9k videos -
annotation the
bounding boxes in
key frames.

Computer vision for transportation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Computer vision for transportation

Similar to Computer vision for transportation (20)

More from Wanjin Yu

More from Wanjin Yu (15)

Recently uploaded

Recently uploaded (20)

Computer vision for transportation