SlideShare a Scribd company logo
1 of 177
Download to read offline
Haifeng SHEN
DiDi AI Labs
Zhengping CHE
DiDi AI Labs
Guangyu LI
DiDi AI Labs
Yuhong GUO
DiDi AI Labs
Carleton University
Jieping YE
DiDi AI Labs
Univ. of Michigan, Ann Arbor
Part I: Introduction to Computer Vision
Zhengping CHE, DiDi AI Labs
• Computer Vision Basics
• Image Classification
• Object Detection
Introduction to Computer Vision
Computer Vision Basics
• Representation Learning
• Activation Functions
• Neural Network Structures
• Convolution Operators
• Pooling Layers
• Batch Normalization
Representation Learning
http://kaiminghe.com/cvpr17tutorial/cvpr2017_tutorial_kaiminghe.pdf
Neural Network Structures
Convolutional Neural Network
Deep Neural Network
Different Neural Networks
Top/Middle-left: http://cs231n.github.io/convolutional-networks/
Bottom-left: https://colah.github.io/posts/2015-08-Understanding-LSTMs/
Right: http://www.asimovinstitute.org/neural-network-zoo/
Recurrent Neural Network
Activation Functions
Top: https://theffork.com/activation-functions-in-neural-networks/
Bottom: http://cs231n.stanford.edu/slides/2019/cs231n_2019_lecture04.pdf
Convolution Operators
-1 0 1
-2 0 2
-1 0 1
Vertical
-1 -2 -1
0 0 0
1 2 1
Horizontal
Sobel Operator
Laplacian Operator
0 -1 0
-1 4 -1
0 -1 0
-1 -1 -1
-1 8 -1
-1 -1 -1
Traditional Operators Convolution Operation
Right: http://cs231n.github.io/convolutional-networks/
Convolution Operators (cont’d)
Left: Jifeng Dai, et al., Deformable Convolutional Networks, 2017
Right: https://towardsdatascience.com/review-drn-dilated-residual-networks-image-classification-semantic-segmentation-d527e1a8fb5/
Fisher Yu, et al., Multi-Scale Context Aggregation by Dilated Convolutions, 2016
Dilated Convolution
Standard Convolution
(dilation rate = 1)
Dilated Convolution
(dilation rate = 2)
Deformable Convolution
Standard Convolution
Deformable Convolution
Deform. Conv. with Scaling
Deform. Conv. with Rotation
Pooling Layers
Top-left: http://deeplearning.stanford.edu/tutorial/supervised/Pooling/
Bottom-left: Matthew D. Zeiler, et al., Visualizing and Understanding Convolutional Networks, 2014
Right: http://fractalytics.io/rooftop-detection-with-keras-tensorflow/
Different Pooling Operations
Unpooling
Pooling
Pooling Layers (Cont’d)
Corner Pooling
Atrous Spatial Pyramid Pooling Right: Hei Law, et al., Detecting Objects as Paired Keypoints, 2018
Top-left: Kaiming He, et al., Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition, 2015
Bottom-left: Liang-Chieh Chen,et al., Deeplab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs, 2017
Spatial Pyramid Pooling
Batch Normalization
Top-left: http://gradientscience.org/batchnorm/
Sergey Ioffe, et al., Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, 2015
Bottom: Yuxin Wu, et al., Group Normalization, 2018
!"# + %#
&!'("
)"
#*+ ,-. + / 0(-2)
-2-4
-.
Normalization Scale & Shift ActivationFC Layer
$%(')
Image Classification
• Datasets & Competitions
• Roadmap
• Classification Networks
• Experiments
Image Classification Datasets & Competitions
ImageNet, ILSVRC 2009-2017 ImageNet: http://www.image-net.org/
Second figure: https://principlesofdeeplearning.com/index.php/is-deep-learning-getting-too-deep/
Human
Datasets & Competitions (Cont’d)
MNIST CIFAR-10 & CIFAR-100
Dogs vs. Cats Stanford Cars
iNaturalist Competition Plant Seedlings Classification
http://yann.lecun.com/exdb/mnist/ https://www.cs.toronto.edu/~kriz/cifar.html
https://www.kaggle.com/c/dogs-vs-cats https://ai.stanford.edu/~jkrause/cars/car_dataset.html
https://sites.google.com/view/fgvc5/competitions/inaturalist https://www.kaggle.com/c/plant-seedlings-classification
Image Classification Roadmap
… 1998 2012 2014 2015 2016 2017
LeNet VGGNet ResNet SENet
AlexNet GoogLeNet DenseNet
2018
DLA
LeNet
LeNet-5 (1998)
• A neural network architecture for handwritten and
machine-printed character recognition in 1990s
• Consists of seven layers including
• Convolution operations
• Pooling operations
• Full connections
Yann LeCun, et al., Gradient-Based Learning Applied to Document Recognition, 1998
Bottom-right: https://engmrk.com/lenet-5-a-classic-cnn-architecture/
AlexNet
AlexNet (2012)
• ILSVRC 2012 winner (16.4% top-5 error)
• 60 million parameters and 650,000 neurons
• 8 learned layers: 5 convolutional and 3 fully-connected layers
• A 1000-way softmax layer after the last fully-connected layer
• Dropout and ReLU
• Trained parallelly on 2 GPUs
Alex Krizhevsky, et al., ImageNet Classification with Deep Convolutional Neural Networks, 2012
Bottom-right: Nitish Srivastava, et al., Dropout: A Simple Way to Prevent Neural Networks from Overfitting, 2014
VGGNet
• Six versions with 5 group convolutions of 11 - 19 layers
• VGG16 (138 million parameters) and VGG19
• Only 3x3 conv and 2x2 max-pooling layers before FC layers
• Results @ ILSVRC 2014
• 1st in localization task
• 2nd in classification task (7.3% top-5 error)
VGGNet (2014)
Karen Simonyan, et al., Very Deep Convolutional Networks for Large-Scale Image Recognition, 2014
GoogLeNet
• ILSVRC 2014 winner (6.7% top-5 error)
• 22 layers with only 5 million model parameters
• Inception concept
• Multiple conv kernels including 1x1, 3x3, and 5x5
• 1x1 kernel for dimension reduction
• Better representational power + fewer network parameters
• More advanced Inception modules (V2, V3, and V4) Inception-V1 Module
GoogLeNet (2014)
Christian Szegedy, et al., Going Deeper with Convolutions, 2015
ResNet
• 1st place on the ILSVRC 2015 classification task (3.6% top-5 error)
• Deeper model with fewer filters and lower complexity
• 34-layer baseline
• 3.6 billion FLOPs
• only 18% of VGG-19 (19.6 billion FLOPs)
• Up to 152 layers!
• Initialization, batchnorm, residual block…
ResNet Block
ResNet (2015, top)
Kaiming He, et al., Deep Residual Learning for Image Recognition, 2016
http://kaiminghe.com/icml16tutorial/icml2016_tutorial_deep_residual_networks_kaiminghe.pdf
DenseNet
•
! !"#
$
direct connections for % layers
• Fewer parameters and less computation
DenseNet Block
DenseNet (2016)
!" = $" !%, !', … , !")'
Gao Huang, et al., Densely Connected Convolutional Networks, 2016
SENet
• ILSVRC 2017 winner (2.251% top-5 error)
• Squeeze-and-excitation block
• Squeeze: Global average pooling
• Excitation: Channel association
• Scale: Channel attention
• Integration with modern architectures
Squeeze-and-Excitation Block
SENet (2017)
Jie Hu, et al., Squeeze-and-Excitation Networks, 2018
DLA: Deep Layer Aggregation
DLA (2018)
• Layer aggregation to better fuse information
• Iterative deep aggregation (IDA)
• Semantic fusion
• Resolutions and scales
• Hierarchical deep aggregation (HDA)
• Spatial fusion
• Channels and depths (modules)
Fisher Yu, et al., Deep Layer Aggregation, 2018
Classification Experiments
Classification Accuracy
Method
Car Brand
Classification
with 66 classes
Car Brand
Classification
with 2506 classes
ResNet 94.60% -
SENet 92.30% -
DLA 96.02% 93.75%
• Dataset-1
• 193186 images of 66 classes
• Collected offline
• Dataset-2
• 549169 images of 2506
classes
• Collected offline + online
• Similar settings to the Stanford
Cars dataset
Object Detection
• Introduction & Roadmap
• Region-Based Methods
• Region-Free Methods
• Experiments
Object Detection Introduction
Top-Left: http://cs231n.stanford.edu/slides/2016/winter1516_lecture8.pdf
Top-Right: https://www.hackerearth.com/blog/developers/object-detection-for-self-driving-cars/
MS COCO
http://cocodataset.org/#home Open Images
https://storage.googleapis.com/openimages/web/index.html
http://host.robots.ox.ac.uk/pascal/VOC/
Pascal VOC
ImageNet
http://www.image-net.org/
Object Detection Roadmap
… 2014 2015 2016 2017 2018
R-CNN
SPPNet
Fast R-CNN
Faster R-CNN
R-FCN
FPN SNIPER
YOLOv1
SSD
DSSD
RetinaNet
RefineDet
CornerNet
YOLOv3
Light-Head R-CNN
Cascade R-CNN
SNIP
Region-Based
Detection
Region-Free
Detection
YOLOv2
Left: Zhengxia Zou, et al., Object Detection in 20 Years: A Survey, 2019
Region-Based / Region-Free Methods
• Region-based detection
Jonathan Huang, et al., Speed/Accuracy Trade-Offs for Modern Convolutional Object Detectors, 2017
• Two-stage method
• Higher accuracy
• Lower speed
• Complex computation
• R-FCN, Fast R-CNN, Faster R-CNN, R-FCN,
FPN, Cascade R-CNN, SNIP, SNIPER…
• One-stage method
• Lower accuracy
• Faster speed
• Light computation
• YOLO, SSD, DSSD, RetinaNet, RefineDet,
CornerNet…
• Region-free detection
R-CNN: Regions with CNN Features
• Selective Search + CNN + SVM
• Start to use CNN features instead of the traditional features
• ~2k bottom-up region proposals from selective search
• Time consuming
• Extracting feature for every proposal separately
Ross Girshick, et al., Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, 2014
Bottom-Right: https://dl.dropboxusercontent.com/s/vlyrkgd8nz8gy5l/fast-rcnn.pdf
R-CNN (2014)
Fast R-CNN
• One image + multiple RoIs + a fully CNN
• RoI pooling: to generate fixed-size feature vector for each proposal
• Outputs: softmax probabilities + bounding-box regression offsets
• End-to-end training with a multi-task loss
Fast R-CNN (2015)
Right: http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture11.pdf
Ross Girshick, Fast R-CNN, 2015
Faster R-CNN
• Region proposal network (RPN) + Fast R-CNN
• RPN & detection network share full-image convolutional features
• Anchors with multiple scales and aspect ratios
Bottom-Left: http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture11.pdf
Shaoqing Ren, et al., Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, 2015
Faster R-CNN (2015) Region Proposal Network
R-FCN: Region-based Fully Convolutional Networks
• Position-sensitive score map before RoI pooling
• 9 positions: top/middle/bottom-left/center/right
• Position-sensitive RoI pooling instead of standard RoI pooling
• fully convolutional detection network instead of fully-connected detection network in Faster
Jifeng Dai, et al., R-FCN: Object Detection via Region-based Fully Convolutional Networks, 2016
R-FCN (2016)
Position-Sensitive Score Map
Light-Head R-CNN
• Heavy head
• E.g., Faster R-CNN & R-FCN
• Intensive computations around RoI warping
• Light-Head R-CNN
• Thin feature maps from large separable convolution layers
• Cheap R-CNN subnet with 1 FC-layer
Zeming Li, et al., Light-Head R-CNN: In Defense of Two-Stage Object Detector, 2017
Light-Head R-CNN (2017)
‘Heavy’-Head Detectors
Large Separable Convolution
FPN: Feature Pyramid Networks
• Bottom-up pathway
• Top-down pathway
• Lateral connection
Tsung-Yi Lin, et al., Feature Pyramid Networks for Object Detection, 2017
Different Feature Maps FPN Block
• Feature pyramid: Combination of
• Low-resolution, semantically strong features
• High-resolution, semantically weak features
Cascade R-CNN
• Multi-stage extension of R-CNN
• Trained sequentially using output of
the previous stage
• Cascaded bbox regression
• ! ", $ = !& ∘ !&() ∘ ⋯ ∘ !) ", $
• Cascaded detection
• A sequence of detectors trained with
increasing IoU thresholds
Zhaowei Cai, et al., Cascade R-CNN: Delving into High Quality Object Detection, 2018
Cascade R-CNN
SNIP: Scale Normalization for Image Pyramids
• CNNs are not robust to changes in scale
• Multi-scale image pyramids for objects
with different scales
• Detections from each scale are rescaled
and combined using NMS
• Small objects from high-resolution image
• Large objects from low-resolution image
Bharat Singh, Scale Invariance in Object Detection - SNIP, 2018
YOLOv3 (2018)
YOLO: You Only Look Once
• End-to-end one-stage method
• Directly use full images to predict each bounding box
• Extremely fast in real-time speed
• YOLOv2
• Darknet19 backbone
• Anchor mechanism
• YOLOv3
• Multi-scale features
• Darknet53 backbone
Joseph Redmon, et al., You Only Look Once: Unified, Real-Time Object Detection, 2016
Joseph Redmon, et al., YOLO9000: Better, Faster, Stronger, 2017
Joseph Redmon, et al., YOLOv3: An Incremental Improvement, 2018
Top-Left: https://docs.google.com/presentation/d/1kAa7NOamBt4calBU9iHgT8a86RRHz9Yz2oh4-GTdX6M/
Bottom-Left: https://towardsdatascience.com/yolo-v3-object-detection-53fb7d3bfe6b/
YOLO (2016)
SSD: Single Shot Detector
• Multiple feature maps with different resolutions and scales
• Improved speed/accuracy trade-off
Wei Liu, et al., SSD: Single Shot MultiBox Detector, 2016
SSD (2016)
YOLOv1
DSSD: Deconvolutional SSD
• Encoder-decoder Hourglass structure
• Wide – Narrow – Wide
• Convolution and deconvolution modules
• Deconvolution: To introduce additional
large-scale context for object detection
• Two prediction modules
• Each with one residual block
Cheng-Yang Fu, et al., DSSD: Deconvolutional Single Shot Detector, 2017
SSD
DSSD (2017)
Selected Prediction Module
RetinaNet
• Focal loss instead of cross entropy function
• Focus on training on a sparse set of hard samples
• !" #$ = − 1 − #$
( log #$
Tsung-Yi Lin, Focal Loss for Dense Object Detection, 2017
RetinaNet (2017)
RefineDet
Shifeng Zhang, et al., Single-Shot Refinement Neural Network for Object Detection, 2018
RefineDet (2018)
• Anchor refinement module
• Filtering out easy negatives
• Coarsely adjusting anchors
• Object detection module
• Further improving regression
• Prediction multi-class
Transfer Connection Block
CornerNet
• Object as a pair of bounding box corners
• No need for anchor boxes
• Regression problem
→ Corner prediction problem
• Corner pooling
• To better localize corners of bounding box
Hei Law, et al, CornerNet: Detecting Objects as Paired Keypoints, 2018
CornerNet (2018)
Corner Pooling
• Multiple Receptive Field block (MRF): Multiple receptive field and more features for prediction
• Auxiliary Semantic Segmentation block (ASM): Auxiliary semantic segmentation focusing on small object
• Object Detection block (ODM): Combining MRF and ASM with parallel training
• Loss function:
MRFSWSnet:
Siyang Sun, et al., Multiple Receptive Fields and Small-Object-Focusing Weakly-Supervised Segmentation Network for Fast Object Detection, 2019
Multiple Receptive Field Small-Object-Focusing
Weakly-Supervised Segmentation Net
Experiments on MRFSWSnet
Method Recall Precision F1 Score
Faster R-CNN 97.57 96.47 97.01
RetinaNet 97.80 97.80 97.80
Light-Head R-CNN 97.71 95.13 96.40
YOLOv3 98.57 97.32 97.94
MRFSWSnet 98.71 97.32 98.01
• Images collected by dash camera
• Detection on cellphone usage during driving
• 1000 testing images
Siyang Sun, et al., Multiple Receptive Fields and Small-Object-Focusing Weakly-Supervised Segmentation Network for Fast Object Detection, 2019
• Depend on large amount of labeled data, induce expensive annotation cost
• Difficult to be applied directly in new operation environments
• Computation intensive, highly demanding in computational resources
• Complicated models, time/memory consuming, which prevents usage in real
time operation systems(e,g. DMS)
Challenge
Yuhong GUO DiDi AI Labs & Carleton University
Part II: Advanced Topics
•Domain Adaptation
•Lightweight Models
Topics
Domain Adaptation
• Definition [Pan et al., IJCAI13 ]:
Ability of a system to recognize and apply knowledge and skills learned in
previous domains/tasks to novel domains/tasks
• .
Domain Adaptation/Transfer Learning
S. Pan, Q. Yang and W. Fan. Tutorial: Transfer Learning with Applications, IJCAI 2013.
Tan, Chuanqi, et al. "A survey on deep transfer learning." International Conference on Artificial Neural Networks. Springer, Cham, 2018.
§ Successful Application of ML in industry depends on learning from large
amount of labeled data
ØExpensive, time consuming to collect labels
ØDifficult or dangerous to collect data in certain scenarios, e.g, auto driving
§ Domain Adaptation/Transfer Learning provides essential ability of
ĂźReusing existing labeled resources
ĂźAdapting to changing environment
ĂźLearning from simulations
Why Domain Adaptation
Transfer Learning vs Traditional ML
Transfer Learning/Domain Adaptation
Training
domain/task A
Test
domain/task B
§
§
§
Traditional ML
(Semi-)Supervised Learning
Training
domain/task A
Test
domain/task B
§
§
§
Motivation Examples
Different feature distributions
Different label spaces
!"#$%& !"'("
Applications in Computer Vision
Adapting to New Domains
§ Reuse existing datasets, hence the annotation information
ØObject Recognition
ØObject Detection
ØPerson Re-Identification
ØImage Segmentation
ØImage Classification … ...
Learning from Simulations
§ Gathering data and training model are either too expensive, time-
consuming, or too dangerous
§ Solution: create data, learning from simulations
Ø
Ø
OpenAI's Universe will potentially allow us to train a
self-driving car using GTA 5 or other video games.
Training models on real robotics
is too slow and expensive
http://ruder.io/transfer-learning/index.html
Common Datasets
§ Object recognition:
Office-31:
§
§
§
ImageCLEF-DA:
§
§
§
§ Visual domain adaptation challenge
dataset VisDA-2017
§ Digits: MNIST, SVHN, USPS
§ Syn2Real dataset – a new dataset for object recognition
[Peng et al, 2018]
Common Datasets
§ Semantic Segmentation/object
detection:
Ø
Ø
Ø
•
Ø
Domain Adaptation Methods
Three main classes:
§ Reweighting/Instance-based Methods
Ăź
§ Feature-based/Representation Learning Methods
Ăź
§ Parameter/Model- based Methods
Ăź
Categories of DA Methods
Start with Instance Reweighting
§ Context
Ø
Ø
§ Idea
Ø
§ h() – prediction function, x --- input , y – output
§ Expected risk in target domain:
Simple Math Analysis
§ Assume shared conditional distribution
§ To minimize target risk, source instance can be reweighted:
Covariate Shift
§ Assume shared conditional distribution
§ In addition, note
Ø !" !# $
Ø !" ≠ !# $ ≠
§ Assumption of support:
Ø ∃' , !# but !"
Ø !" ,-- !#
Assumptions
§ Density ratio estimation
Ø !
Ø "
§ Direct weight estimation
Ø
Weight Estimation
" = !$ / !& ∝ !() = *|,)/!() = .|,)
! ) = * ,
! ) = . ,
§ Maximum Mean Discrepancy (MMD)
Ø
Ø
• F H
X
Learning Weights Directly: MMD
[Gretton et al. 2012]
§ MMD for domain adaptation
Ø
Ø
Learning Weights Directly: MMD
!! ~ # !"
§ Extend MMD to learn representation function ∅(#)
Ø
Extend to Representation Learning
Long et al. " ”, CVPR 13
[Long et al. CVPR13]
§ Representation learning methods present larger capacity in bridging
domain discrepancy
§ Widely applied in transfer learning for computer vision tasks
§ Recent development of representation learning based domain adaptation
Ø
Ø
Ø
Recent Feature-based Methods
§ Main idea:
Ø
min$ max' ()*+(-, /) = 23~'5
log /(-(9)) + 23~';
log(1 − / - 9 )
o- 9 ->, -?)
o
Ø
p> (-(9)) = p?(-(9))
Adversarial Loss-based Adaptation Framework
Goodfellow et al. " ”, 2014
§ A-distance, measure of distance between probability distribution
§ Bound on target domain error
Ø
Ø
Theoretical Connection
Ben-David et al. "Analysis of Representations for Domain Adaptation”, NIPS 06
Kifer et al. Detecting change in data streams. In Very Large Databases (VLDB), 2004.
§ Main idea:
Ø
min$,& max) * = *,-./(1, 2) + 5 *6/7
Adversarial Loss-based Adaptation Framework
*!"#$
*%$&
§ DANN: Adversarial is
implemented via GRL (gradient
reverse layer)
Domain Adversarial Neural Network (DANN)
§ Adversarial Discriminative Domain Adaptation (ADDA)
source CNN is trained without sacrificing any discriminativity
Model Sharing and Adversarial Adaptation
§ Re-weight source domain label
distribution to help reduce domain
discrepancy and adapt classifier
§ Reweighted adversarial loss (RAAN)
Reweighted Adversarial Adaptation [Chen et al, CVPR 18]
Chen, et al. " ”, CVPR 18
§ Maximum Classifier Discrepancy (MCD):
Ø
Ø
§ Adversarial loss:
Target domain
prediction discrepancy
Alternative Adversarial Terms
K. Saito, et al. " Maximum Classifier Discrepancy for Unsupervised Domain Adaptation”, CVPR 18
Train both classifiers and generator to
classify the source samples correctly
Conditional Adversarial Domain Adaptation
§ Conditional Domain Adversarial Networks (CDANs) [NeurIPS 18]:
Ø
DA Recognition Results
Question Raised: Transferabiliy vs Discriminability
§
Batch Spectral Penalization (BSP)
§
Object detection DA-Faster-R-CNN
§ Adversarial loss via GRL at both image level and instance level
§ Consistent regularization at the two levels
Multi-Level Adversarial Adaptation
Chen, et al. " ”, CVPR 18
Object detection: Strong-Weak
Multi-Level Adversarial Alignment
Saito, et al. " ”, CVPR 19
§
§
•
•
Object detection
Multi-Level Adversarial Alignment
Saito, et al. " ”, CVPR 19
Object detection
DA Detection Results
§ Main idea:
Ø
Ø
Generative Model based Methods
§ Limitation of domain alignment techniques:
Ø
Ø
§ CyCADA:
Ø
Ø
Ø
Cycle-Consistent Adversarial DA
et al. " ”, ICML 18
et al. ICML18
Cycle-Consistent Adversarial DA
et al. " ”, ICML 18
et al. ICML18
image-level GAN loss (green), the feature level GAN loss (orange), the source and target semantic
consistency losses (black), the source cycle loss (red), and the source task loss (purple).
§ SBDA-GAN:
Ø
Ø
Ø
Symmetric Bi-Directional Adaptive GAN
et al. " ”, CVPR 18
et al. CVPR18
DA Recognition Results
§
§
Pseudo-Label based Methods
Some positive application in domain adaptation:
ØProgressive domain adaptation for Object detection
ØFor recognition:
Zhang et al. " Collaborative and Adversarial Network for Unsupervised domain adaptation :”, CVPR 18
Inoue et al. " Cross-Domain Weakly-Supervised Object Detection through Progressive Domain Adaptation”, CVPR 18
• Unsupervised domain adaptation has received a lot of attention
• Open domain learning remains to be challenging, but starts drawing
attentions
• Most study has focused on classification problems
• Much less effort has been made on more complex tasks such as
object detection
Summary
Lightweight Models
Basics
Number of multiplications for one standard convolutional layer:
Input: !" x !" x M Output: !# x !# x N
!$: kernel size
M: number of input channels
N: number of output channels
!#: output dimension
Basics
• Architecture design– lightweight models
Ø Use two 3 x 3 conv layer to replace 5 x 5 conv
layer:
(3x3+3x3)/(5x5)
Ø Use two sequential 1xn and n x 1 conv layers to
replace n x n conv layers
(1xn + n x 1)/(n x n)
Basics
• Architecture design– lightweight models
Ø pointwise convolution: use 1x1 conv layer (to reduce dimension)
Ø Depthwise separable convolution:
!" !"
• Inception, Xception *
• SqueezeNet
• MobileNet / MobileNetV2
• ShuffleNet / ShuffleNetV2
Lightweight models
Inception Module
Inception module with dimension reduction
V1 block (from googlenet)
Traditional 3X3
convolution block
Input: 28 X 28 X 192
Output: 28 X 28 X 256
#Model parameters:
3 X 3 X 192 X 256 = 442k
1 X 1 X 192 X 64
+1 X 1 X 192 X 96 + 3 X 3 X 96 X 128
+1 X 1 X 192 X 16 + 5 X 5 X 16 X 32
+0(maxpooling)+1 X 1 X 192 X 32 =163k
Previous layer
3X3 convolution
output layer
Szegedy et al. Going Deeper with Convolutions, https://arxiv.org/abs/1409.4842. 2014.
•
•
Inception V1, V2, V3
Szegedy et al. Going Deeper with Convolutions, https://arxiv.org/abs/1409.4842. 2014.
Sergey Ioffe et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,
http://arxiv.org/abs/1502.03167.2015
Rethinking the Inception Architecture for Computer Vision, http://arxiv.org/abs/1512.00567. 2015.
•
• Use two 3 x 3 conv
to replace 5 x 5 conv
•
1
Xception
François Chollet. Xception: Deep Learning with Depthwise Separable Convolutions. https://arxiv.org/abs/1610.02357. 2016-2017.
• Depthwise separable convolution
• à
•
(3 x 3 x 1 x M/7 x 112 x 112) x 7 •
•
SqueezeNet
Input: F x F x M
Squeeze:
• 1x1 convs
output: F x F x S (S< M)
Expand:
• 1x1 convs
output: F x F x e1
• 3x3 convs
output: F x F x e2
Concate: F x F x (e1+e2)
Forrest N. Iandola,et al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size. https://arxiv.org/abs/1602.07360. 2016
• Standard:
• Depthwise separable conv
(1) depthwise conv: 1filter takes 1 input channel
(2) pointwise conv
1x1 convs
• Computation Reduction
MobileNet V1:
Andrew G. Howard et al. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications.
https://arxiv.org/abs/1704.04861?context=cs. 2017.
!" !"
!" !"
• Standard:
• Depthwise separable conv
(1) depthwise conv: 1filter takes 1 input channel
(2) pointwise conv
1x1 convs
• Computation Reduction
MobileNet V1:
Andrew G. Howard et al. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications.
https://arxiv.org/abs/1704.04861?context=cs. 2017.
!" !"
!" !"
MobileNet V1
• Use conv with stride=2 to
replace pooling
• Add two super parameters:
Width multiplier Îą and
resolution multiplier ρ
• α =1.0, 0.75, 0.5, 0.25;
• standard MobileNet when α=1
MobileNet V2
MobileNetV1
MobileNetV2
Increase # channels
Linear bottlenecks:
removed nonlinear
activation in the low dim
Mark Sandler et al. MobileNetV2: Inverted Residuals and Linear Bottlenecks. https://arxiv.org/abs/1801.04381.2018.
inverted residual block
Increase dim, then reduce dim
ShuffleNet V1
• pointwise group convolution (1x 1 Conv)
• channel shuffle: help the information flowing across feature channels
• Use concat operation to concatenate two different channels
Xiangyu Zhang et al. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. https://arxiv.org/abs/1707.01083. 2017.
#g (groups)
ShuffleNet V1
Xiangyu Zhang et al. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. https://arxiv.org/abs/1707.01083. 2017.
ShuffleNet V1
ShuffleNet V2
Reduce memory
access cost:
• Channel Split (2g)
• remove group
convolution
• Put channel shuffle
module after
channel
concatenation
1)) ( ( 2 1)) ( (
Ningning Ma et al. ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design. https://arxiv.org/abs/1807.11164.2018.
Experiments - Classification
Model
mAP
(%)
Precision
(%)
Recall
(%)
Size
(MB)
Computation
speed
(ms/photo)
Server-based
+ Yolov2
99.62 99.60 99.65 N/A N/A
1.00xShuffleNet V2
+Yolov2
96.43 97.16 96.83 5.20 80.00
0.50xShuffleNet V2
+Yolov2
95.86 97.28 96.28 1.70 40.00
0.50xShuffleNet V2
+SSD
97.73 90.61 97.98 7.90 65.00
0.25xShuffleNet V2
+SSD
97.25 90.46 97.59 5.00 45.00
Category Abbreviation
front page of ID card id_card_f
Back page of ID card id_card_b
Front page of driver license driver_license_f
Back page of driver license driver_license_b
Front of main page in car license
front
car_license_f
Back of main page in car license front car_license_b
Supplementary Page in car license vehicle_license
Real car photo( whole car) Whole car
Real car photo(car plate) plate
Experiments - Classification
#positive photos: 8K #negative photos: 8K
Version Backbone
Detection
method
Size
(MB)
mAP
(%)
Precision
(%)
Recall
(%)
Error detection rate
(%
Floating-point version 0.5*ShuffleNet V2 YoloV2 1.70 97.86 98.81 98.00 0.125
Fixed-point version 0.5*ShuffleNet V2 YoloV2 0.40 97.82 98.82 97.97 0.0625
#positive photos:
8K
Precision
(%)
Recall
(%)
Precision
(%)
Recall
(%)
car 98.87 96.41 98.97 96.11
car_license_b 98.70 99.00 99.10 99.00
car_license_f 99.90 97.70 99.80 97.90
driver_license_b 99.80 98.90 99.80 99.00
driver_license_f 99.49 98.50 99.19 98.50
id_card_b 99.90 99.00 99.90 99.00
id_card_f 99.50 99.10 99.50 99.10
plate 93.82 94.29 93.71 93.99
vehicle_license 99.30 99.10 99.40 99.10
Average 98.81 98.00 98.82 97.97
Experiments - Embeded OCR
• Use ShuffleNet to replace Resnet50 as the backbone
Haifeng SHEN, DiDi AI Labs
Guangyu LI, DiDi AI Labs
Part III : Application
•Driver Identification
•Driving Scenario Understanding
Application
Driver identification
• Application
• Overview
• Experiments
Application - Pay by smiling
• In Sep. 2017, Alibaba's Ant
Financial affiliate and KFC
China announced facial-
recognition payment
available for customers in the
fast food restaurant chain's
new KPRO store in Hangzhou.
• "Smile to Pay" facial
recognition payment solution
at KFC enables customers to
pay without their wallets.
https://www.jrzj.com/194328.html
Application - Check-in at station
Taiyuan South railway stationBeijing West railway station Shanghai metro station
https://baijiahao.baidu.com/s?id=1552314447507461&wfr=spider&for=pc
http://www.sohu.com/a/220124437_99966914
http://dy.163.com/v2/article/detail/D5U3QH2P0525KG01.html
http://www.sohu.com/a/168709903_728989
Application - Pedestrian monitoring
Ningbo City uses face recognition for transportation surveillance and pedestrian monitoring.
Application - Driver monitoring
https://www.sohu.com/a/253263266_649849
Application - Other uses
https://www.globalrailwayreview.com/article/66120/train-stations-facial-recognition/
https://image.baidu.com/
Overview - Market
https://www.marketsandmarkets.com/Market-Reports/facial-recognition-market-995.html
Overview - features
Natural Un-perceivable
Contact-less Multiple
BIOMETRIC --- You are your own key”
https://image.baidu.com/
Overview - Challenges
Inter-class similarity
https://image.baidu.com/
Overview - Challenges
Illumination Expression
Occlusion Age
Pose
Other
Intra-class variability
Similarity
=0.18
https://image.baidu.com/
Overview - Framework
Verification
Overview - Framework
Identfication
Overview - Detection & landmark dataset
Face detection
dataset
Available # faces # images Website Remarks
FDDB Public 5171 2845 http://vis-www.cs.umass.edu/fddb/ unconstrained face
WiderFace Public
32,20
3
393,703
http://mmlab.ie.cuhk.edu.hk/projects/W
IDERFace
Easy, Medium, Hard set, a high
degree of variability in scale, pose
and occlusion.
MALF Public
11,93
1
5,250 http://www.cbsr.ia.ac.cn/faceevaluation/
Bounding box, multi-Attribute
Labelled Faces, pose and facial
attributes
Caltech
10,000
Web Faces
Public - 10,524
http://www.vision.caltech.edu/Image_Da
tasets/Caltech_10K_WebFaces/
Collect from Google image search,
4 landmarks(two eyes, nose and
mouth)
PUB Public 9971
http://biometrics.put.poznan.pl/put-
face-database/
30 landmarks, 194 contour points
AFLW Public 25,993
https://www.tugraz.at/institute/icg/rese
arch/team-bischof/lrs/downloads/aflw/
Collect from Flickr, 21 landmarks
Overview - Detection - MTCNN
Kaipeng Zhang et al. Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks. https://arxiv.org/abs/1604.02878v1.2016.
• propose a deep cascaded multi-task framework with three stages, P-Net, R-
Net and O-Net.
• Each is a shallow network.
• P-Net: proposal network, produces candidate windows quickly through a
shallow CNN
• R-Net: refine network, refines the candidates to reject a large number of
non-faces windows through a more complex CNN
• O-Net: output network, use a more powerful CNN to refine the result and
output facial landmarks positions
Overview - Detection - Face RFCN
Yitong Wang et al. Detecting Faces Using Region-based Fully Convolutional Networks. https://arxiv.org/abs/1709.05256. 2017.
• The framework is based on the R-FCN.
• propose a region-based face detector applying deep networks in a fully
convolutional fashion
• introduce additional smaller anchors and modify the position-sensitive RoI
pooling to a smaller size for suiting the detection of the tiny faces.
• propose to use position-sensitive average pooling instead of normal
average pooling for the last feature voting in R-FCN
• use multi-scale training strategy and online Hard Example Mining (OHEM)
strategy.
Overview - Detection - PyramidBox
Xu Tang et al. PyramidBox: A Context-assisted Single Shot Face Detector. https://arxiv.org/abs/1803.07737?context=cs. 2018.
• Baidu proposes the PyramidBox.
• extended VGG16 backbone and generate
the feature maps at different levels
• generate a series of anchors
corresponding to larger regions related to
a face that contain more contextual
information, such as head, shoulder and
body.
Overview - Recognition - Dataset
Dataset Available # People # images Website Remarks
LFW Public 5K 13K
http://vis-
www.cs.umass.edu/lfw/#views
Labeled Faces in the Wild
YFD Public 1.5K 3.4K (Video)
https://www.cs.tau.ac.il/~wolf/ytfac
es/
YouTube Faces Database
CelebA
(CelebFaces
Attributes
Dataset)
Public 10K 202K
http://mmlab.ie.cuhk.edu.hk/project
s/CelebA.html
Multimedia Lab, The Chinese
University of Hong Kong
CASIA-WebFace Public 10K 500K
http://www.cbsr.ia.ac.cn/english/CAS
IA-WebFace/CASIA-
WebFace_Agreements.pdf
MS-Celeb-1M public 100K 10M https://www.msceleb.org
VGGFace2 Public 9k 3.3M
http://www.robots.ox.ac.uk/~vgg/da
ta/vgg_face2/
downloaded from Google Image
Search and have large variations in
pose, age, illumination, ethnicity
and profession
Facebook Private 4K 4,400K N/A
Google Private 8000K 100-200M N/A
Overview - Recognition - Milestones
1888
Galton,
Nature
1910
Galton,
Nature
1965
Chan,Bledsoe,
AFR
1991
TurkandMA,
Eigenfaces
1997
BelhumeurP,
Fisherface
2002
LiuC,
Gaborfeature
2006
AhonenT,
LBP
2009
WrightJ,
Sparserepresentation
2013
ChenD,
High-dimLBP
2014
SunYi,
DeepID
2014
Facebook,
DeepFace
2015
Oxford,
VGGFace
2015
Google,
FaceNet
2015Baidu,
EnsembleFace
2016
EffectiveFace
2017
SphereFace
2018
ArcFace
2019
Combined
loss
Overview - Recognition - Results
Time Method Training size Method description LFW Comments
1991 Eigenfaces < 10k Principal component analysis(PCA) 60.02%
2006 LBP+CSML < 10k
Local binary pattern(LBP) + Metric
learning
85.57%
2013 High-dim LBP 0.1m High-dim LBP + Joint Bayesian 95.17%
2014 DeepFace 4m CNN + 3D face alignment 97.35% Facebook
2014 Deep ID 0.2m CNN + Softmax 97.45% CUHK
2015 VGGFace 2.6m VGG + Softmax 98.95% Oxford
2015 FaceNet 200m Inception + Triplet-Loss 99.63% Google
2015 Ensemble face 1.2m CNN + Multi-patch + Deep metric 99.77% Baidu
2016 Effective face 2.5m CNN + Augmentation 98.06% Pose + Shape + Expression
2017 SphereFace 0.5m CNN + Angular-Softmax 99.42%
Multiplicative angular margin:
cos(mθ)
2018 ArcFace 6.8m CNN + Additive angular margin 99.83%
Additive angular margin: cos(θ
+ m)
2019 Combined loss N/A cos(m1θ + m2) − m3
Overview - Recognition - DeepFace
Yaniv Taigman et al. DeepFace: Closing the Gap to Human-Level Performance in Face Verification.
https://ieeexplore.ieee.org/document/6909616. CVPR 2014.
• CNN + DNN structure
• L4 - L6 are locally connected layers without weight sharing, rather than the standard
convolutional layers
• The last two layers, i.e. F7 and F8 are fully-connected
• Employ 3D face modeling to apply the affine transformation for 3D face alignment and
get the frontal face
• more than 120 million parameters
• Train using four million facial images belonging to more than 4,000 identities
Overview - Recognition - DeepID
Yi Sun, Xiaogang Wang, Xiaoou Tang. Deep Learning Face Representation from Predicting 10,000 Classes. https://www.cv-
foundation.org/openaccess/content_cvpr_2014/papers/Sun_Deep_Learning_Face_2014_CVPR_paper.pdf. CVPR2014.
• Use face patch method and each patch use one ConvNet
• Each ConvNet has 4 layers
• 60 face patches with ten regions, three scales, and RGB or gray channel.
• 60 ConvNets x two 160-dimensional vectors and flipped counterpart, totally 19200-dimensional
vector for face verification
• achieves 97.45% face verification accuracy on LFW
• Based on DeepID1, Chinese University of Hong Kong provides DeepID2 and DeepID3
Overview - Recognition - FaceNet
Florian Schroff et al. FaceNet: A Unified Embedding for Face Recognition and Clustering. https://arxiv.org/abs/1503.03832. CVPR 2015.
• Google proposes the structure.
• Directly use a deep convolutional network
• Use triplet loss for training: minimizes the distance between an anchor and a positive,
both of which have the same identity, and maximizes the distance between the
anchor and a negative of a different identity
• Use the Euclidean distance to measure the face similarity for verification.
Overview - Recognition - Ensemble Face
Jingtuo Liu et al. Targeting Ultimate Accuracy: Face Recognition via Deep Embedding. https://arxiv.org/pdf/1506.07310. 2015.
• Multi-patch feature extraction.
• 9 image patches and each patch is centered at different landmarks on face region.
• Each patch: 9 convolution layers and a softmax layer at the end
• Concatenate the last convolution layer of each network to build the high dimensional feature for the face
representation
• metric learning method with triplet loss is used for feature reduction and obtain 128/256 dimensions.
• achieve the accuracy (99.77%) of LFW under 6000 pair evaluation protocol
Overview - Recognition - Effective Face
Iacopo Masi et al. Do We Really Need to Collect Millions of Faces for Effective Face Recognition. https://arxiv.org/abs/1603.07057. CVPR 2016.
• Use a single VGGNet with 19 layers
• Training on both real and augmented data
• use the CASIA WebFace collection data and generate the artificial data
by introducing pose variations, shape variation and expression
variation
Jiankang Deng et al. ArcFace: Additive Angular Margin Loss for Deep Face Recognition. https://arxiv.org/abs/1801.07698. 2019.
Multiplicative angular
margin: cos(mθ)
Additive angular
margin: cos(θ + m)
Additive cosine
margin: cos(θ ) - mcosθ
Combined loss:
Overview - Recognition - Combined loss
Experiments - Combined loss
Test set feature softmax shpereface cosface arcface
Combined
loss
LFW public 98.75 99.52 99.50 99.55 99.60
7k private 93.60 95.45 95.90 96.72 97.13
50k private 93.28 95.93 95.50 97.08 96.90
zc private 99.18 99.37 99.45 99.57 99.52
avg 96.20 97.57 97.59 98.23 98.29
• 7k/50k The test set is extracted from registered driver photo database. 3K positive
pair and 3k negative pair are randomly selected from 7k/50k drivers respectively.
• zc the test set is randomly extracted from premier driver photo database. 3K
positive pair and 3K negative pair are randomly selected for the testing.
Experiments - Virtual learning
drastically improves the performances over the baseline softmax on both LFW and SLLFW datasets, e.g. from 99.10%
to 99.46% and 94.59% to 95.85%, respectively.
Binghui Chen, Weihong Deng, Haifeng Shen. Virtual Class Enhanced Discriminative Embedding Learning. https://arxiv.org/abs/1811.12611. 2018
Experiments - Fast face detection
80
8
0
40
4
0
20
20
20
20
10
10 55 33 22 11
C3 C4 C5 C6 C7 C8 C9 C10 C11
Multiscale feature fusion
Object detection
Detection result
Upsam
pling
Upsampling
n Multiscale features:
C3+C4+C5+C7+Conv9+Conv11
n Combine up-sampling features: C3 + C3’,
C4 + C4’, C5 + C5’
n Support batch image computation
n TensorRT Optimization
Speed
(ms/frame)
Batch size=1 Batch size=64 Batch size=100
Original 22 12 N/A
FP32 17 7 7
INT8 13 4 4
GPU Memory
(GB/frame)
Batch size=1 Batch
size=64
Batch
size=100
Original 1.40 0.188 N/A
FP32 0.57 0.070 0.066
INT8 0.48 0.039 0.030
Detection % Precision Recall F-score
Original 97.90 97.00 97.47
FP32 97.90 97.10 97.48
INT8 97.85 96.96 97.40
Experiments - Face detection
q WIDER FACE dataset is a face detection benchmark dataset,
collected from the publicly available WIDER dataset.
q Choose 32,203 images and label 393,703 faces with a high
degree of variability in scale, pose and occlusion as depicted
in the sample images.
q Propose DFS method and use semantic fused feature maps
as contextual cues and construct a semantic segmentation
for training supervision and to learn the best representations
q Win 5 rank-1 results in April. 2019
Widerface: http://shuoyang1213.me/WIDERFACE/index.html
Wanxin Tian, Zixuan Wang, Haifeng Shen, Weihong Deng, et al. Learning Better Features for Face Detection
with Feature Fusion and Segmentation Supervision. https://arxiv.org/abs/1811.08557. 2018-2019.
Human Driving Scenarios
What can we learn from Driving Scenario?
• What is in a driving scenario?
• How far are they from ego-vehicle?
• How does human driver interact with environment?
Vision Perception
3D Reconstruction
Behavior Analysis
Driving Scenarios v.s. General Computer Vision
Data
• Multi-modal (i.e. multiple sensors including Camera LiDAR, GPS, IMU etc.)
• Collected from 3D Open Area (Not Indoor/Lab Environments)
• Ego-centric / First Person
Requirements
•
•
•
Opportunities
•
•
•
Main Components
• Pedestrian
• Vehicle
• Road
• Traffic Sign / Light
Vision Perception in Driving Scenario
Detect, Segment, Track and Classify Object-of-interest in Driving Scenarios
What does Vision Perception do:
Vision Perception – Pedestrian Detection
Vision Perception – Pedestrian Detection
Vision Perception – Pedestrian Detection
Pedestrian detection at 100FPS
• Uses Cascades
• Fast features
• Not a CNN based model
Benenson et al ’12 “VeryFast”
100+ FPS detector. NO CNNs.
Vision Perception – Pedestrian Detection
Real-time Pedestrian Detection with CNNs
• Uses Cascades
• Uses fast non-CNN features
• Use CNNs for max accuracy with minimum speed
sacrifice
Angelova et al ’15 “DeepCascades”
Real-time (15FPS) with CNNs
Vision Perception – Pedestrian Detection
Occlusion-aware pedestrian detection
• Aggregation loss (enforce proposals to be close
and locate compactly)
• Occlusion-aware region of interest (PORoI)
(integrate prior structure information of human
to handle occlusion)
• Based on Faster RCNN
Zhang et al ’18 “OR-CNN”
State of the Art (by April 2019)
Vision Perception – Vehicle Detection
Vision Perception – Vehicle Detection
Vehicle detection in 3D from image
• Directly from 2D image
• Proposal Generation as Energy Minimization
• Orientation Estimation Network
Chen et al ’16 “3D Bounding Box”
Breakthrough for 3D Detection with Mono Image
Vision Perception – Vehicle Detection
Multi-View 3D object Detection
• Multi-sensor fusion
Chen et al ’17 “MV3D”
Impressive accuracy gain for considering multi-sensors fusion
Vision Perception – Vehicle Detection
Multi-level Fusion based 3D Object Detection
from Mono Images
• Simultaneously propose 2D RPN and predict 3D
location, orientation, dimensions
Xu et al ’18 “Multi-level Fusion”
State of the Art for 3D Detection from Mono Camear Images
Vision Perception – Road Segmentation
Joint Semantic Prediction
• KITTI Road Detection top performance 2017
• Multi-task framework
• Real-time
• Uses RGB image only
Teichmann et al ’17 “MultiNet”
Speed + Accuracy with RGB image only
Vision Perception – Road Segmentation
LIDAR-Camera Fusion
• KITTI Road Detection top performance 2018
• Cross Fusion mechanism with FCN
Caltagirone et al ’18 “LidCamNet”
LIDAR-Camera Fusion RULES
Vision Perception – Road Segmentation
LIDAR-Camera Fusion with LIDAR Adaptation
• KITTI Road Detection current top performance
• Progressive LIDAR Adaptation
Chen et al ’19 “PLARD”
State of the Art Performance
Vision Perception – Road Segmentation
State of the Arts on KITTI (by April 2019)
Vision Perception – Traffic Sign Detection
IJCNN 2011 Traffic Sign Recognition Competition
• Ciresan et al ’11: 0.56% error
• Human: 1.16% error
• Non-CNN: 3.86%
Ciresan et al ’11 “Traffic Sign Recognition”
Traffic Sign Recognition is EASY (Super-human Performance)
Vision Perception – Traffic Sign Detection
Detecting Small Signs from Large Images
• Brake large image into small patches
• Small-Object-Sensitive-CNN (SOS-CNN)
• Based on SSD
Meng et al ’17 “SOS-CNN”
Handle Small Objects
What can we learn from Driving Scenario?
• What is in a driving scenario?
• How far are they from ego-vehicle?
• How does human driver interact with environment?
Vision Perception
3D Reconstruction
Behavior Analysis
Main Components
•
•
•
•
3D Reconstruction in Driving Scenario
Recover real-world Location and Pose of Driving Scenario Objects (2D to 3D)
What does 3D Reconstruction do:
5 mins Theoretic Backgrounds (a little Math)
3D Reconstruction – Theoretic Backgrounds
• Perspective Projection
3D Reconstruction – Theoretic Backgrounds
• Internal Camera Parameters
3D Reconstruction – Theoretic Backgrounds
• External Camera Parameters
3D Reconstruction – Theoretic Backgrounds
• Camera Model for Perspective Projection
3D Reconstruction – Theoretic Backgrounds
• A Block Diagram
3D Reconstruction – Semantic Reconstruction
Kundu et al ’14 “Joint semantic and 3D reconstruction from monocular video”
Semantic + 3D Reconstruction from Mono Camera
3D Reconstruction – Semantic Reconstruction
Cherabier et al ’16 “Multi-label semantic 3d reconstruction using voxel blocks”
Efficient Dense Semantic + 3D Reconstruction
What can we learn from Driving Scenario?
• What is in a driving scenario?
• How far are they from ego-vehicle?
• How does human driver interact with environment?
Vision Perception
3D Reconstruction
Behavior Analysis
Driving Scenario Understanding
Honda Research Institute Driving Dataset
• 104 Hours Real Human Driving records
• Driving Behavior and Causal Reasoning annotation
Ramanishka et al ’18 “HDD”
First Dataset towards Driving Scenario Understanding
Driving Scenario Understanding
Driving Attention Prediction from Video
• Focus on Driver’s Attention
• In-car v.s. In-lab test
Xia et al ’18 “Predicting Driver Attention”
Introduce Attention Heat Maps
Related Datasets
HDD [7]
[6]
[5]
[4]
[3]
[2]
[1]
D2-City [8]
Driving behavior & Causal reasoning /
Traffic participants detection & tracking
Camera, GPS, IMU 95.9 Suburban, urban and highway
GAIA Open Dataset
• Dataset : D2 –City Dataset
• D²-City is a large-scale driving video dataset that provides more than 10k videos recorded
in 720p HD or 1080p FHD from front-facing dashcams, with annotations for object
detection and tracking.
n 1k videos -
annotation of the
bounding boxes and
tracking IDs of road
objects into 12
different categories.
n 9k videos -
annotation the
bounding boxes in
key frames.
Q & A
Thanks!

More Related Content

What's hot

Meta-Learning Presentation
Meta-Learning PresentationMeta-Learning Presentation
Meta-Learning PresentationAkshayaNagarajan10
 
Convolution Neural Network (CNN)
Convolution Neural Network (CNN)Convolution Neural Network (CNN)
Convolution Neural Network (CNN)Suraj Aavula
 
Deep Learning Fundamentals
Deep Learning FundamentalsDeep Learning Fundamentals
Deep Learning FundamentalsThomas Delteil
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Yuta Niki
 
Image classification with Deep Neural Networks
Image classification with Deep Neural NetworksImage classification with Deep Neural Networks
Image classification with Deep Neural NetworksYogendra Tamang
 
Convolutional Neural Networks
Convolutional Neural NetworksConvolutional Neural Networks
Convolutional Neural NetworksAshray Bhandare
 
Introduction to Deep Learning (NVIDIA)
Introduction to Deep Learning (NVIDIA)Introduction to Deep Learning (NVIDIA)
Introduction to Deep Learning (NVIDIA)Rakuten Group, Inc.
 
Computer Vision for autonomous driving
Computer Vision for autonomous drivingComputer Vision for autonomous driving
Computer Vision for autonomous drivingBill Liu
 
Deep learning - A Visual Introduction
Deep learning - A Visual IntroductionDeep learning - A Visual Introduction
Deep learning - A Visual IntroductionLukas Masuch
 
Deep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksDeep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksChristian Perone
 
U-Net (1).pptx
U-Net (1).pptxU-Net (1).pptx
U-Net (1).pptxChangjin Lee
 
Transformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to HeroTransformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to HeroBill Liu
 
Image segmentation with deep learning
Image segmentation with deep learningImage segmentation with deep learning
Image segmentation with deep learningAntonio Rueda-Toicen
 
ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]Dongmin Choi
 
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Preferred Networks
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Gaurav Mittal
 
VSlam 2017 11_20(張閎智)
VSlam 2017 11_20(張閎智)VSlam 2017 11_20(張閎智)
VSlam 2017 11_20(張閎智)Hung-Chih Chang
 
Semantic Segmentation on Satellite Imagery
Semantic Segmentation on Satellite ImagerySemantic Segmentation on Satellite Imagery
Semantic Segmentation on Satellite ImageryRAHUL BHOJWANI
 
Deep Learning for Autonomous Driving
Deep Learning for Autonomous DrivingDeep Learning for Autonomous Driving
Deep Learning for Autonomous DrivingJan Wiegelmann
 

What's hot (20)

Meta-Learning Presentation
Meta-Learning PresentationMeta-Learning Presentation
Meta-Learning Presentation
 
Convolution Neural Network (CNN)
Convolution Neural Network (CNN)Convolution Neural Network (CNN)
Convolution Neural Network (CNN)
 
Deep Learning Fundamentals
Deep Learning FundamentalsDeep Learning Fundamentals
Deep Learning Fundamentals
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)
 
Image classification with Deep Neural Networks
Image classification with Deep Neural NetworksImage classification with Deep Neural Networks
Image classification with Deep Neural Networks
 
Convolutional Neural Networks
Convolutional Neural NetworksConvolutional Neural Networks
Convolutional Neural Networks
 
Introduction to Deep Learning (NVIDIA)
Introduction to Deep Learning (NVIDIA)Introduction to Deep Learning (NVIDIA)
Introduction to Deep Learning (NVIDIA)
 
Computer Vision for autonomous driving
Computer Vision for autonomous drivingComputer Vision for autonomous driving
Computer Vision for autonomous driving
 
Deep learning - A Visual Introduction
Deep learning - A Visual IntroductionDeep learning - A Visual Introduction
Deep learning - A Visual Introduction
 
Deep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksDeep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural Networks
 
U-Net (1).pptx
U-Net (1).pptxU-Net (1).pptx
U-Net (1).pptx
 
Transformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to HeroTransformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to Hero
 
Cnn
CnnCnn
Cnn
 
Image segmentation with deep learning
Image segmentation with deep learningImage segmentation with deep learning
Image segmentation with deep learning
 
ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]
 
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)
 
VSlam 2017 11_20(張閎智)
VSlam 2017 11_20(張閎智)VSlam 2017 11_20(張閎智)
VSlam 2017 11_20(張閎智)
 
Semantic Segmentation on Satellite Imagery
Semantic Segmentation on Satellite ImagerySemantic Segmentation on Satellite Imagery
Semantic Segmentation on Satellite Imagery
 
Deep Learning for Autonomous Driving
Deep Learning for Autonomous DrivingDeep Learning for Autonomous Driving
Deep Learning for Autonomous Driving
 

Similar to Computer vision for transportation

ICCV 2019 - A view
ICCV 2019 - A viewICCV 2019 - A view
ICCV 2019 - A viewLiberiFatali
 
Object extraction from satellite imagery using deep learning
Object extraction from satellite imagery using deep learningObject extraction from satellite imagery using deep learning
Object extraction from satellite imagery using deep learningAly Abdelkareem
 
Computer Vision for Beginners
Computer Vision for BeginnersComputer Vision for Beginners
Computer Vision for BeginnersSanghamitra Deb
 
Object Detection Beyond Mask R-CNN and RetinaNet I
Object Detection Beyond Mask R-CNN and RetinaNet IObject Detection Beyond Mask R-CNN and RetinaNet I
Object Detection Beyond Mask R-CNN and RetinaNet IWanjin Yu
 
Anomaly Detection with Azure and .net
Anomaly Detection with Azure and .netAnomaly Detection with Azure and .net
Anomaly Detection with Azure and .netMarco Parenzan
 
2_Image Classification.pdf
2_Image Classification.pdf2_Image Classification.pdf
2_Image Classification.pdfFEG
 
Anomaly Detection with Azure and .NET
Anomaly Detection with Azure and .NETAnomaly Detection with Azure and .NET
Anomaly Detection with Azure and .NETMarco Parenzan
 
深度學習在AOI的應用
深度學習在AOI的應用深度學習在AOI的應用
深度學習在AOI的應用CHENHuiMei
 
Real Time Object Dectection using machine learning
Real Time Object Dectection using machine learningReal Time Object Dectection using machine learning
Real Time Object Dectection using machine learningpratik pratyay
 
kanimozhi2019.pdf
kanimozhi2019.pdfkanimozhi2019.pdf
kanimozhi2019.pdfAshrafDabbas1
 
Deep Learning for Robotics
Deep Learning for RoboticsDeep Learning for Robotics
Deep Learning for RoboticsIntel Nervana
 
IRJET- Object Detection and Recognition using Single Shot Multi-Box Detector
IRJET- Object Detection and Recognition using Single Shot Multi-Box DetectorIRJET- Object Detection and Recognition using Single Shot Multi-Box Detector
IRJET- Object Detection and Recognition using Single Shot Multi-Box DetectorIRJET Journal
 
"Enabling Ubiquitous Visual Intelligence Through Deep Learning," a Keynote Pr...
"Enabling Ubiquitous Visual Intelligence Through Deep Learning," a Keynote Pr..."Enabling Ubiquitous Visual Intelligence Through Deep Learning," a Keynote Pr...
"Enabling Ubiquitous Visual Intelligence Through Deep Learning," a Keynote Pr...Edge AI and Vision Alliance
 
High level-api in tensorflow
High level-api in tensorflowHigh level-api in tensorflow
High level-api in tensorflowHyungjoo Cho
 
Performance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use casePerformance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use caseFlorian Wilhelm
 
Performance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use casePerformance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use caseinovex GmbH
 
Data Science At Zillow
Data Science At ZillowData Science At Zillow
Data Science At ZillowNicholas McClure
 
HiPEAC 2019 Workshop - Real-Time Modelling Visual Scenes with Biological Insp...
HiPEAC 2019 Workshop - Real-Time Modelling Visual Scenes with Biological Insp...HiPEAC 2019 Workshop - Real-Time Modelling Visual Scenes with Biological Insp...
HiPEAC 2019 Workshop - Real-Time Modelling Visual Scenes with Biological Insp...Tulipp. Eu
 
Object Detection with Transformers
Object Detection with TransformersObject Detection with Transformers
Object Detection with TransformersDatabricks
 

Similar to Computer vision for transportation (20)

ICCV 2019 - A view
ICCV 2019 - A viewICCV 2019 - A view
ICCV 2019 - A view
 
Object extraction from satellite imagery using deep learning
Object extraction from satellite imagery using deep learningObject extraction from satellite imagery using deep learning
Object extraction from satellite imagery using deep learning
 
Computer Vision for Beginners
Computer Vision for BeginnersComputer Vision for Beginners
Computer Vision for Beginners
 
Object Detection Beyond Mask R-CNN and RetinaNet I
Object Detection Beyond Mask R-CNN and RetinaNet IObject Detection Beyond Mask R-CNN and RetinaNet I
Object Detection Beyond Mask R-CNN and RetinaNet I
 
Anomaly Detection with Azure and .net
Anomaly Detection with Azure and .netAnomaly Detection with Azure and .net
Anomaly Detection with Azure and .net
 
2_Image Classification.pdf
2_Image Classification.pdf2_Image Classification.pdf
2_Image Classification.pdf
 
Anomaly Detection with Azure and .NET
Anomaly Detection with Azure and .NETAnomaly Detection with Azure and .NET
Anomaly Detection with Azure and .NET
 
深度學習在AOI的應用
深度學習在AOI的應用深度學習在AOI的應用
深度學習在AOI的應用
 
slide-171212080528.pptx
slide-171212080528.pptxslide-171212080528.pptx
slide-171212080528.pptx
 
Real Time Object Dectection using machine learning
Real Time Object Dectection using machine learningReal Time Object Dectection using machine learning
Real Time Object Dectection using machine learning
 
kanimozhi2019.pdf
kanimozhi2019.pdfkanimozhi2019.pdf
kanimozhi2019.pdf
 
Deep Learning for Robotics
Deep Learning for RoboticsDeep Learning for Robotics
Deep Learning for Robotics
 
IRJET- Object Detection and Recognition using Single Shot Multi-Box Detector
IRJET- Object Detection and Recognition using Single Shot Multi-Box DetectorIRJET- Object Detection and Recognition using Single Shot Multi-Box Detector
IRJET- Object Detection and Recognition using Single Shot Multi-Box Detector
 
"Enabling Ubiquitous Visual Intelligence Through Deep Learning," a Keynote Pr...
"Enabling Ubiquitous Visual Intelligence Through Deep Learning," a Keynote Pr..."Enabling Ubiquitous Visual Intelligence Through Deep Learning," a Keynote Pr...
"Enabling Ubiquitous Visual Intelligence Through Deep Learning," a Keynote Pr...
 
High level-api in tensorflow
High level-api in tensorflowHigh level-api in tensorflow
High level-api in tensorflow
 
Performance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use casePerformance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use case
 
Performance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use casePerformance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use case
 
Data Science At Zillow
Data Science At ZillowData Science At Zillow
Data Science At Zillow
 
HiPEAC 2019 Workshop - Real-Time Modelling Visual Scenes with Biological Insp...
HiPEAC 2019 Workshop - Real-Time Modelling Visual Scenes with Biological Insp...HiPEAC 2019 Workshop - Real-Time Modelling Visual Scenes with Biological Insp...
HiPEAC 2019 Workshop - Real-Time Modelling Visual Scenes with Biological Insp...
 
Object Detection with Transformers
Object Detection with TransformersObject Detection with Transformers
Object Detection with Transformers
 

More from Wanjin Yu

Architecture Design for Deep Neural Networks III
Architecture Design for Deep Neural Networks IIIArchitecture Design for Deep Neural Networks III
Architecture Design for Deep Neural Networks IIIWanjin Yu
 
Intelligent Multimedia Recommendation
Intelligent Multimedia RecommendationIntelligent Multimedia Recommendation
Intelligent Multimedia RecommendationWanjin Yu
 
Architecture Design for Deep Neural Networks II
Architecture Design for Deep Neural Networks IIArchitecture Design for Deep Neural Networks II
Architecture Design for Deep Neural Networks IIWanjin Yu
 
Architecture Design for Deep Neural Networks I
Architecture Design for Deep Neural Networks IArchitecture Design for Deep Neural Networks I
Architecture Design for Deep Neural Networks IWanjin Yu
 
Causally regularized machine learning
Causally regularized machine learningCausally regularized machine learning
Causally regularized machine learningWanjin Yu
 
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet IIIObject Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet IIIWanjin Yu
 
Object Detection Beyond Mask R-CNN and RetinaNet II
Object Detection Beyond Mask R-CNN and RetinaNet IIObject Detection Beyond Mask R-CNN and RetinaNet II
Object Detection Beyond Mask R-CNN and RetinaNet IIWanjin Yu
 
Visual Search and Question Answering II
Visual Search and Question Answering IIVisual Search and Question Answering II
Visual Search and Question Answering IIWanjin Yu
 
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...Wanjin Yu
 
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...Wanjin Yu
 
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...Wanjin Yu
 
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...Wanjin Yu
 
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...Wanjin Yu
 
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...Wanjin Yu
 
Big Data Intelligence: from Correlation Discovery to Causal Reasoning
Big Data Intelligence: from Correlation Discovery to Causal Reasoning Big Data Intelligence: from Correlation Discovery to Causal Reasoning
Big Data Intelligence: from Correlation Discovery to Causal Reasoning Wanjin Yu
 

More from Wanjin Yu (15)

Architecture Design for Deep Neural Networks III
Architecture Design for Deep Neural Networks IIIArchitecture Design for Deep Neural Networks III
Architecture Design for Deep Neural Networks III
 
Intelligent Multimedia Recommendation
Intelligent Multimedia RecommendationIntelligent Multimedia Recommendation
Intelligent Multimedia Recommendation
 
Architecture Design for Deep Neural Networks II
Architecture Design for Deep Neural Networks IIArchitecture Design for Deep Neural Networks II
Architecture Design for Deep Neural Networks II
 
Architecture Design for Deep Neural Networks I
Architecture Design for Deep Neural Networks IArchitecture Design for Deep Neural Networks I
Architecture Design for Deep Neural Networks I
 
Causally regularized machine learning
Causally regularized machine learningCausally regularized machine learning
Causally regularized machine learning
 
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet IIIObject Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
 
Object Detection Beyond Mask R-CNN and RetinaNet II
Object Detection Beyond Mask R-CNN and RetinaNet IIObject Detection Beyond Mask R-CNN and RetinaNet II
Object Detection Beyond Mask R-CNN and RetinaNet II
 
Visual Search and Question Answering II
Visual Search and Question Answering IIVisual Search and Question Answering II
Visual Search and Question Answering II
 
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
 
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
 
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
 
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
 
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
 
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
 
Big Data Intelligence: from Correlation Discovery to Causal Reasoning
Big Data Intelligence: from Correlation Discovery to Causal Reasoning Big Data Intelligence: from Correlation Discovery to Causal Reasoning
Big Data Intelligence: from Correlation Discovery to Causal Reasoning
 

Recently uploaded

定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一Fs
 
Q4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptxQ4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptxeditsforyah
 
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxDyna Gilbert
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa494f574xmv
 
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Sonam Pathan
 
Contact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New DelhiContact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New Delhimiss dipika
 
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)Christopher H Felton
 
Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Sonam Pathan
 
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一z xss
 
Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Excelmac1
 
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一Fs
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书zdzoqco
 
SCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is prediSCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is predieusebiomeyer
 
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作ys8omjxb
 
Git and Github workshop GDSC MLRITM
Git and Github  workshop GDSC MLRITMGit and Github  workshop GDSC MLRITM
Git and Github workshop GDSC MLRITMgdsc13
 
Elevate Your Business with Our IT Expertise in New Orleans
Elevate Your Business with Our IT Expertise in New OrleansElevate Your Business with Our IT Expertise in New Orleans
Elevate Your Business with Our IT Expertise in New Orleanscorenetworkseo
 
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一Fs
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Paul Calvano
 
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书rnrncn29
 

Recently uploaded (20)

定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
 
Q4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptxQ4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptx
 
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptx
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa
 
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
 
Contact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New DelhiContact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New Delhi
 
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
 
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
 
Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170
 
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
 
Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...
 
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
 
SCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is prediSCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is predi
 
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
 
Git and Github workshop GDSC MLRITM
Git and Github  workshop GDSC MLRITMGit and Github  workshop GDSC MLRITM
Git and Github workshop GDSC MLRITM
 
Elevate Your Business with Our IT Expertise in New Orleans
Elevate Your Business with Our IT Expertise in New OrleansElevate Your Business with Our IT Expertise in New Orleans
Elevate Your Business with Our IT Expertise in New Orleans
 
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24
 
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
 

Computer vision for transportation

  • 1. Haifeng SHEN DiDi AI Labs Zhengping CHE DiDi AI Labs Guangyu LI DiDi AI Labs Yuhong GUO DiDi AI Labs Carleton University Jieping YE DiDi AI Labs Univ. of Michigan, Ann Arbor
  • 2. Part I: Introduction to Computer Vision Zhengping CHE, DiDi AI Labs
  • 3. • Computer Vision Basics • Image Classification • Object Detection Introduction to Computer Vision
  • 4. Computer Vision Basics • Representation Learning • Activation Functions • Neural Network Structures • Convolution Operators • Pooling Layers • Batch Normalization
  • 6. Neural Network Structures Convolutional Neural Network Deep Neural Network Different Neural Networks Top/Middle-left: http://cs231n.github.io/convolutional-networks/ Bottom-left: https://colah.github.io/posts/2015-08-Understanding-LSTMs/ Right: http://www.asimovinstitute.org/neural-network-zoo/ Recurrent Neural Network
  • 7. Activation Functions Top: https://theffork.com/activation-functions-in-neural-networks/ Bottom: http://cs231n.stanford.edu/slides/2019/cs231n_2019_lecture04.pdf
  • 8. Convolution Operators -1 0 1 -2 0 2 -1 0 1 Vertical -1 -2 -1 0 0 0 1 2 1 Horizontal Sobel Operator Laplacian Operator 0 -1 0 -1 4 -1 0 -1 0 -1 -1 -1 -1 8 -1 -1 -1 -1 Traditional Operators Convolution Operation Right: http://cs231n.github.io/convolutional-networks/
  • 9. Convolution Operators (cont’d) Left: Jifeng Dai, et al., Deformable Convolutional Networks, 2017 Right: https://towardsdatascience.com/review-drn-dilated-residual-networks-image-classification-semantic-segmentation-d527e1a8fb5/ Fisher Yu, et al., Multi-Scale Context Aggregation by Dilated Convolutions, 2016 Dilated Convolution Standard Convolution (dilation rate = 1) Dilated Convolution (dilation rate = 2) Deformable Convolution Standard Convolution Deformable Convolution Deform. Conv. with Scaling Deform. Conv. with Rotation
  • 10. Pooling Layers Top-left: http://deeplearning.stanford.edu/tutorial/supervised/Pooling/ Bottom-left: Matthew D. Zeiler, et al., Visualizing and Understanding Convolutional Networks, 2014 Right: http://fractalytics.io/rooftop-detection-with-keras-tensorflow/ Different Pooling Operations Unpooling Pooling
  • 11. Pooling Layers (Cont’d) Corner Pooling Atrous Spatial Pyramid Pooling Right: Hei Law, et al., Detecting Objects as Paired Keypoints, 2018 Top-left: Kaiming He, et al., Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition, 2015 Bottom-left: Liang-Chieh Chen,et al., Deeplab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs, 2017 Spatial Pyramid Pooling
  • 12. Batch Normalization Top-left: http://gradientscience.org/batchnorm/ Sergey Ioffe, et al., Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, 2015 Bottom: Yuxin Wu, et al., Group Normalization, 2018 !"# + %# &!'(" )" #*+ ,-. + / 0(-2) -2-4 -. Normalization Scale & Shift ActivationFC Layer $%(')
  • 13. Image Classification • Datasets & Competitions • Roadmap • Classification Networks • Experiments
  • 14. Image Classification Datasets & Competitions ImageNet, ILSVRC 2009-2017 ImageNet: http://www.image-net.org/ Second figure: https://principlesofdeeplearning.com/index.php/is-deep-learning-getting-too-deep/ Human
  • 15. Datasets & Competitions (Cont’d) MNIST CIFAR-10 & CIFAR-100 Dogs vs. Cats Stanford Cars iNaturalist Competition Plant Seedlings Classification http://yann.lecun.com/exdb/mnist/ https://www.cs.toronto.edu/~kriz/cifar.html https://www.kaggle.com/c/dogs-vs-cats https://ai.stanford.edu/~jkrause/cars/car_dataset.html https://sites.google.com/view/fgvc5/competitions/inaturalist https://www.kaggle.com/c/plant-seedlings-classification
  • 16. Image Classification Roadmap … 1998 2012 2014 2015 2016 2017 LeNet VGGNet ResNet SENet AlexNet GoogLeNet DenseNet 2018 DLA
  • 17. LeNet LeNet-5 (1998) • A neural network architecture for handwritten and machine-printed character recognition in 1990s • Consists of seven layers including • Convolution operations • Pooling operations • Full connections Yann LeCun, et al., Gradient-Based Learning Applied to Document Recognition, 1998 Bottom-right: https://engmrk.com/lenet-5-a-classic-cnn-architecture/
  • 18. AlexNet AlexNet (2012) • ILSVRC 2012 winner (16.4% top-5 error) • 60 million parameters and 650,000 neurons • 8 learned layers: 5 convolutional and 3 fully-connected layers • A 1000-way softmax layer after the last fully-connected layer • Dropout and ReLU • Trained parallelly on 2 GPUs Alex Krizhevsky, et al., ImageNet Classification with Deep Convolutional Neural Networks, 2012 Bottom-right: Nitish Srivastava, et al., Dropout: A Simple Way to Prevent Neural Networks from Overfitting, 2014
  • 19. VGGNet • Six versions with 5 group convolutions of 11 - 19 layers • VGG16 (138 million parameters) and VGG19 • Only 3x3 conv and 2x2 max-pooling layers before FC layers • Results @ ILSVRC 2014 • 1st in localization task • 2nd in classification task (7.3% top-5 error) VGGNet (2014) Karen Simonyan, et al., Very Deep Convolutional Networks for Large-Scale Image Recognition, 2014
  • 20. GoogLeNet • ILSVRC 2014 winner (6.7% top-5 error) • 22 layers with only 5 million model parameters • Inception concept • Multiple conv kernels including 1x1, 3x3, and 5x5 • 1x1 kernel for dimension reduction • Better representational power + fewer network parameters • More advanced Inception modules (V2, V3, and V4) Inception-V1 Module GoogLeNet (2014) Christian Szegedy, et al., Going Deeper with Convolutions, 2015
  • 21. ResNet • 1st place on the ILSVRC 2015 classification task (3.6% top-5 error) • Deeper model with fewer filters and lower complexity • 34-layer baseline • 3.6 billion FLOPs • only 18% of VGG-19 (19.6 billion FLOPs) • Up to 152 layers! • Initialization, batchnorm, residual block… ResNet Block ResNet (2015, top) Kaiming He, et al., Deep Residual Learning for Image Recognition, 2016 http://kaiminghe.com/icml16tutorial/icml2016_tutorial_deep_residual_networks_kaiminghe.pdf
  • 22. DenseNet • ! !"# $ direct connections for % layers • Fewer parameters and less computation DenseNet Block DenseNet (2016) !" = $" !%, !', … , !")' Gao Huang, et al., Densely Connected Convolutional Networks, 2016
  • 23. SENet • ILSVRC 2017 winner (2.251% top-5 error) • Squeeze-and-excitation block • Squeeze: Global average pooling • Excitation: Channel association • Scale: Channel attention • Integration with modern architectures Squeeze-and-Excitation Block SENet (2017) Jie Hu, et al., Squeeze-and-Excitation Networks, 2018
  • 24. DLA: Deep Layer Aggregation DLA (2018) • Layer aggregation to better fuse information • Iterative deep aggregation (IDA) • Semantic fusion • Resolutions and scales • Hierarchical deep aggregation (HDA) • Spatial fusion • Channels and depths (modules) Fisher Yu, et al., Deep Layer Aggregation, 2018
  • 25. Classification Experiments Classification Accuracy Method Car Brand Classification with 66 classes Car Brand Classification with 2506 classes ResNet 94.60% - SENet 92.30% - DLA 96.02% 93.75% • Dataset-1 • 193186 images of 66 classes • Collected offline • Dataset-2 • 549169 images of 2506 classes • Collected offline + online • Similar settings to the Stanford Cars dataset
  • 26. Object Detection • Introduction & Roadmap • Region-Based Methods • Region-Free Methods • Experiments
  • 27. Object Detection Introduction Top-Left: http://cs231n.stanford.edu/slides/2016/winter1516_lecture8.pdf Top-Right: https://www.hackerearth.com/blog/developers/object-detection-for-self-driving-cars/ MS COCO http://cocodataset.org/#home Open Images https://storage.googleapis.com/openimages/web/index.html http://host.robots.ox.ac.uk/pascal/VOC/ Pascal VOC ImageNet http://www.image-net.org/
  • 28. Object Detection Roadmap … 2014 2015 2016 2017 2018 R-CNN SPPNet Fast R-CNN Faster R-CNN R-FCN FPN SNIPER YOLOv1 SSD DSSD RetinaNet RefineDet CornerNet YOLOv3 Light-Head R-CNN Cascade R-CNN SNIP Region-Based Detection Region-Free Detection YOLOv2 Left: Zhengxia Zou, et al., Object Detection in 20 Years: A Survey, 2019
  • 29. Region-Based / Region-Free Methods • Region-based detection Jonathan Huang, et al., Speed/Accuracy Trade-Offs for Modern Convolutional Object Detectors, 2017 • Two-stage method • Higher accuracy • Lower speed • Complex computation • R-FCN, Fast R-CNN, Faster R-CNN, R-FCN, FPN, Cascade R-CNN, SNIP, SNIPER… • One-stage method • Lower accuracy • Faster speed • Light computation • YOLO, SSD, DSSD, RetinaNet, RefineDet, CornerNet… • Region-free detection
  • 30. R-CNN: Regions with CNN Features • Selective Search + CNN + SVM • Start to use CNN features instead of the traditional features • ~2k bottom-up region proposals from selective search • Time consuming • Extracting feature for every proposal separately Ross Girshick, et al., Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, 2014 Bottom-Right: https://dl.dropboxusercontent.com/s/vlyrkgd8nz8gy5l/fast-rcnn.pdf R-CNN (2014)
  • 31. Fast R-CNN • One image + multiple RoIs + a fully CNN • RoI pooling: to generate fixed-size feature vector for each proposal • Outputs: softmax probabilities + bounding-box regression offsets • End-to-end training with a multi-task loss Fast R-CNN (2015) Right: http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture11.pdf Ross Girshick, Fast R-CNN, 2015
  • 32. Faster R-CNN • Region proposal network (RPN) + Fast R-CNN • RPN & detection network share full-image convolutional features • Anchors with multiple scales and aspect ratios Bottom-Left: http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture11.pdf Shaoqing Ren, et al., Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, 2015 Faster R-CNN (2015) Region Proposal Network
  • 33. R-FCN: Region-based Fully Convolutional Networks • Position-sensitive score map before RoI pooling • 9 positions: top/middle/bottom-left/center/right • Position-sensitive RoI pooling instead of standard RoI pooling • fully convolutional detection network instead of fully-connected detection network in Faster Jifeng Dai, et al., R-FCN: Object Detection via Region-based Fully Convolutional Networks, 2016 R-FCN (2016) Position-Sensitive Score Map
  • 34. Light-Head R-CNN • Heavy head • E.g., Faster R-CNN & R-FCN • Intensive computations around RoI warping • Light-Head R-CNN • Thin feature maps from large separable convolution layers • Cheap R-CNN subnet with 1 FC-layer Zeming Li, et al., Light-Head R-CNN: In Defense of Two-Stage Object Detector, 2017 Light-Head R-CNN (2017) ‘Heavy’-Head Detectors Large Separable Convolution
  • 35. FPN: Feature Pyramid Networks • Bottom-up pathway • Top-down pathway • Lateral connection Tsung-Yi Lin, et al., Feature Pyramid Networks for Object Detection, 2017 Different Feature Maps FPN Block • Feature pyramid: Combination of • Low-resolution, semantically strong features • High-resolution, semantically weak features
  • 36. Cascade R-CNN • Multi-stage extension of R-CNN • Trained sequentially using output of the previous stage • Cascaded bbox regression • ! ", $ = !& ∘ !&() ∘ ⋯ ∘ !) ", $ • Cascaded detection • A sequence of detectors trained with increasing IoU thresholds Zhaowei Cai, et al., Cascade R-CNN: Delving into High Quality Object Detection, 2018 Cascade R-CNN
  • 37. SNIP: Scale Normalization for Image Pyramids • CNNs are not robust to changes in scale • Multi-scale image pyramids for objects with different scales • Detections from each scale are rescaled and combined using NMS • Small objects from high-resolution image • Large objects from low-resolution image Bharat Singh, Scale Invariance in Object Detection - SNIP, 2018
  • 38. YOLOv3 (2018) YOLO: You Only Look Once • End-to-end one-stage method • Directly use full images to predict each bounding box • Extremely fast in real-time speed • YOLOv2 • Darknet19 backbone • Anchor mechanism • YOLOv3 • Multi-scale features • Darknet53 backbone Joseph Redmon, et al., You Only Look Once: Unified, Real-Time Object Detection, 2016 Joseph Redmon, et al., YOLO9000: Better, Faster, Stronger, 2017 Joseph Redmon, et al., YOLOv3: An Incremental Improvement, 2018 Top-Left: https://docs.google.com/presentation/d/1kAa7NOamBt4calBU9iHgT8a86RRHz9Yz2oh4-GTdX6M/ Bottom-Left: https://towardsdatascience.com/yolo-v3-object-detection-53fb7d3bfe6b/ YOLO (2016)
  • 39. SSD: Single Shot Detector • Multiple feature maps with different resolutions and scales • Improved speed/accuracy trade-off Wei Liu, et al., SSD: Single Shot MultiBox Detector, 2016 SSD (2016) YOLOv1
  • 40. DSSD: Deconvolutional SSD • Encoder-decoder Hourglass structure • Wide – Narrow – Wide • Convolution and deconvolution modules • Deconvolution: To introduce additional large-scale context for object detection • Two prediction modules • Each with one residual block Cheng-Yang Fu, et al., DSSD: Deconvolutional Single Shot Detector, 2017 SSD DSSD (2017) Selected Prediction Module
  • 41. RetinaNet • Focal loss instead of cross entropy function • Focus on training on a sparse set of hard samples • !" #$ = − 1 − #$ ( log #$ Tsung-Yi Lin, Focal Loss for Dense Object Detection, 2017 RetinaNet (2017)
  • 42. RefineDet Shifeng Zhang, et al., Single-Shot Refinement Neural Network for Object Detection, 2018 RefineDet (2018) • Anchor refinement module • Filtering out easy negatives • Coarsely adjusting anchors • Object detection module • Further improving regression • Prediction multi-class Transfer Connection Block
  • 43. CornerNet • Object as a pair of bounding box corners • No need for anchor boxes • Regression problem → Corner prediction problem • Corner pooling • To better localize corners of bounding box Hei Law, et al, CornerNet: Detecting Objects as Paired Keypoints, 2018 CornerNet (2018) Corner Pooling
  • 44. • Multiple Receptive Field block (MRF): Multiple receptive field and more features for prediction • Auxiliary Semantic Segmentation block (ASM): Auxiliary semantic segmentation focusing on small object • Object Detection block (ODM): Combining MRF and ASM with parallel training • Loss function: MRFSWSnet: Siyang Sun, et al., Multiple Receptive Fields and Small-Object-Focusing Weakly-Supervised Segmentation Network for Fast Object Detection, 2019 Multiple Receptive Field Small-Object-Focusing Weakly-Supervised Segmentation Net
  • 45. Experiments on MRFSWSnet Method Recall Precision F1 Score Faster R-CNN 97.57 96.47 97.01 RetinaNet 97.80 97.80 97.80 Light-Head R-CNN 97.71 95.13 96.40 YOLOv3 98.57 97.32 97.94 MRFSWSnet 98.71 97.32 98.01 • Images collected by dash camera • Detection on cellphone usage during driving • 1000 testing images Siyang Sun, et al., Multiple Receptive Fields and Small-Object-Focusing Weakly-Supervised Segmentation Network for Fast Object Detection, 2019
  • 46. • Depend on large amount of labeled data, induce expensive annotation cost • Difficult to be applied directly in new operation environments • Computation intensive, highly demanding in computational resources • Complicated models, time/memory consuming, which prevents usage in real time operation systems(e,g. DMS) Challenge
  • 47. Yuhong GUO DiDi AI Labs & Carleton University Part II: Advanced Topics
  • 50. • Definition [Pan et al., IJCAI13 ]: Ability of a system to recognize and apply knowledge and skills learned in previous domains/tasks to novel domains/tasks • . Domain Adaptation/Transfer Learning S. Pan, Q. Yang and W. Fan. Tutorial: Transfer Learning with Applications, IJCAI 2013. Tan, Chuanqi, et al. "A survey on deep transfer learning." International Conference on Artificial Neural Networks. Springer, Cham, 2018.
  • 51. § Successful Application of ML in industry depends on learning from large amount of labeled data ØExpensive, time consuming to collect labels ØDifficult or dangerous to collect data in certain scenarios, e.g, auto driving § Domain Adaptation/Transfer Learning provides essential ability of ĂźReusing existing labeled resources ĂźAdapting to changing environment ĂźLearning from simulations Why Domain Adaptation
  • 52. Transfer Learning vs Traditional ML Transfer Learning/Domain Adaptation Training domain/task A Test domain/task B § § § Traditional ML (Semi-)Supervised Learning Training domain/task A Test domain/task B § § §
  • 53. Motivation Examples Different feature distributions Different label spaces !"#$%& !"'("
  • 55. Adapting to New Domains § Reuse existing datasets, hence the annotation information ØObject Recognition ØObject Detection ØPerson Re-Identification ØImage Segmentation ØImage Classification … ...
  • 56. Learning from Simulations § Gathering data and training model are either too expensive, time- consuming, or too dangerous § Solution: create data, learning from simulations Ø Ø OpenAI's Universe will potentially allow us to train a self-driving car using GTA 5 or other video games. Training models on real robotics is too slow and expensive http://ruder.io/transfer-learning/index.html
  • 57. Common Datasets § Object recognition: Office-31: § § § ImageCLEF-DA: § § § § Visual domain adaptation challenge dataset VisDA-2017 § Digits: MNIST, SVHN, USPS § Syn2Real dataset – a new dataset for object recognition [Peng et al, 2018]
  • 58. Common Datasets § Semantic Segmentation/object detection: Ø Ø Ø • Ø
  • 60. Three main classes: § Reweighting/Instance-based Methods Ăź § Feature-based/Representation Learning Methods Ăź § Parameter/Model- based Methods Ăź Categories of DA Methods
  • 61. Start with Instance Reweighting § Context Ø Ø § Idea Ø
  • 62. § h() – prediction function, x --- input , y – output § Expected risk in target domain: Simple Math Analysis
  • 63. § Assume shared conditional distribution § To minimize target risk, source instance can be reweighted: Covariate Shift
  • 64. § Assume shared conditional distribution § In addition, note Ø !" !# $ Ø !" ≠ !# $ ≠ § Assumption of support: Ø ∃' , !# but !" Ø !" ,-- !# Assumptions
  • 65. § Density ratio estimation Ø ! Ø " § Direct weight estimation Ø Weight Estimation " = !$ / !& ∝ !() = *|,)/!() = .|,) ! ) = * , ! ) = . ,
  • 66. § Maximum Mean Discrepancy (MMD) Ø Ø • F H X Learning Weights Directly: MMD [Gretton et al. 2012]
  • 67. § MMD for domain adaptation Ø Ø Learning Weights Directly: MMD !! ~ # !"
  • 68. § Extend MMD to learn representation function ∅(#) Ø Extend to Representation Learning Long et al. " ”, CVPR 13 [Long et al. CVPR13]
  • 69. § Representation learning methods present larger capacity in bridging domain discrepancy § Widely applied in transfer learning for computer vision tasks § Recent development of representation learning based domain adaptation Ø Ø Ø Recent Feature-based Methods
  • 70. § Main idea: Ø min$ max' ()*+(-, /) = 23~'5 log /(-(9)) + 23~'; log(1 − / - 9 ) o- 9 ->, -?) o Ø p> (-(9)) = p?(-(9)) Adversarial Loss-based Adaptation Framework Goodfellow et al. " ”, 2014
  • 71. § A-distance, measure of distance between probability distribution § Bound on target domain error Ø Ø Theoretical Connection Ben-David et al. "Analysis of Representations for Domain Adaptation”, NIPS 06 Kifer et al. Detecting change in data streams. In Very Large Databases (VLDB), 2004.
  • 72. § Main idea: Ø min$,& max) * = *,-./(1, 2) + 5 *6/7 Adversarial Loss-based Adaptation Framework *!"#$ *%$&
  • 73. § DANN: Adversarial is implemented via GRL (gradient reverse layer) Domain Adversarial Neural Network (DANN)
  • 74. § Adversarial Discriminative Domain Adaptation (ADDA) source CNN is trained without sacrificing any discriminativity Model Sharing and Adversarial Adaptation
  • 75. § Re-weight source domain label distribution to help reduce domain discrepancy and adapt classifier § Reweighted adversarial loss (RAAN) Reweighted Adversarial Adaptation [Chen et al, CVPR 18] Chen, et al. " ”, CVPR 18
  • 76. § Maximum Classifier Discrepancy (MCD): Ø Ø § Adversarial loss: Target domain prediction discrepancy Alternative Adversarial Terms K. Saito, et al. " Maximum Classifier Discrepancy for Unsupervised Domain Adaptation”, CVPR 18 Train both classifiers and generator to classify the source samples correctly
  • 77. Conditional Adversarial Domain Adaptation § Conditional Domain Adversarial Networks (CDANs) [NeurIPS 18]: Ø
  • 79. Question Raised: Transferabiliy vs Discriminability §
  • 81. Object detection DA-Faster-R-CNN § Adversarial loss via GRL at both image level and instance level § Consistent regularization at the two levels Multi-Level Adversarial Adaptation Chen, et al. " ”, CVPR 18
  • 82. Object detection: Strong-Weak Multi-Level Adversarial Alignment Saito, et al. " ”, CVPR 19 § § • •
  • 83. Object detection Multi-Level Adversarial Alignment Saito, et al. " ”, CVPR 19
  • 86. § Limitation of domain alignment techniques: Ø Ø § CyCADA: Ø Ø Ø Cycle-Consistent Adversarial DA et al. " ”, ICML 18 et al. ICML18
  • 87. Cycle-Consistent Adversarial DA et al. " ”, ICML 18 et al. ICML18 image-level GAN loss (green), the feature level GAN loss (orange), the source and target semantic consistency losses (black), the source cycle loss (red), and the source task loss (purple).
  • 88. § SBDA-GAN: Ø Ø Ø Symmetric Bi-Directional Adaptive GAN et al. " ”, CVPR 18 et al. CVPR18
  • 90. § § Pseudo-Label based Methods Some positive application in domain adaptation: ØProgressive domain adaptation for Object detection ØFor recognition: Zhang et al. " Collaborative and Adversarial Network for Unsupervised domain adaptation :”, CVPR 18 Inoue et al. " Cross-Domain Weakly-Supervised Object Detection through Progressive Domain Adaptation”, CVPR 18
  • 91. • Unsupervised domain adaptation has received a lot of attention • Open domain learning remains to be challenging, but starts drawing attentions • Most study has focused on classification problems • Much less effort has been made on more complex tasks such as object detection Summary
  • 93. Basics Number of multiplications for one standard convolutional layer: Input: !" x !" x M Output: !# x !# x N !$: kernel size M: number of input channels N: number of output channels !#: output dimension
  • 94. Basics • Architecture design– lightweight models Ø Use two 3 x 3 conv layer to replace 5 x 5 conv layer: (3x3+3x3)/(5x5) Ø Use two sequential 1xn and n x 1 conv layers to replace n x n conv layers (1xn + n x 1)/(n x n)
  • 95. Basics • Architecture design– lightweight models Ø pointwise convolution: use 1x1 conv layer (to reduce dimension) Ø Depthwise separable convolution: !" !"
  • 96. • Inception, Xception * • SqueezeNet • MobileNet / MobileNetV2 • ShuffleNet / ShuffleNetV2 Lightweight models
  • 97. Inception Module Inception module with dimension reduction V1 block (from googlenet) Traditional 3X3 convolution block Input: 28 X 28 X 192 Output: 28 X 28 X 256 #Model parameters: 3 X 3 X 192 X 256 = 442k 1 X 1 X 192 X 64 +1 X 1 X 192 X 96 + 3 X 3 X 96 X 128 +1 X 1 X 192 X 16 + 5 X 5 X 16 X 32 +0(maxpooling)+1 X 1 X 192 X 32 =163k Previous layer 3X3 convolution output layer Szegedy et al. Going Deeper with Convolutions, https://arxiv.org/abs/1409.4842. 2014. • •
  • 98. Inception V1, V2, V3 Szegedy et al. Going Deeper with Convolutions, https://arxiv.org/abs/1409.4842. 2014. Sergey Ioffe et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, http://arxiv.org/abs/1502.03167.2015 Rethinking the Inception Architecture for Computer Vision, http://arxiv.org/abs/1512.00567. 2015. • • Use two 3 x 3 conv to replace 5 x 5 conv • 1
  • 99. Xception François Chollet. Xception: Deep Learning with Depthwise Separable Convolutions. https://arxiv.org/abs/1610.02357. 2016-2017. • Depthwise separable convolution • Ă  • (3 x 3 x 1 x M/7 x 112 x 112) x 7 • •
  • 100. SqueezeNet Input: F x F x M Squeeze: • 1x1 convs output: F x F x S (S< M) Expand: • 1x1 convs output: F x F x e1 • 3x3 convs output: F x F x e2 Concate: F x F x (e1+e2) Forrest N. Iandola,et al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size. https://arxiv.org/abs/1602.07360. 2016
  • 101. • Standard: • Depthwise separable conv (1) depthwise conv: 1filter takes 1 input channel (2) pointwise conv 1x1 convs • Computation Reduction MobileNet V1: Andrew G. Howard et al. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. https://arxiv.org/abs/1704.04861?context=cs. 2017. !" !" !" !"
  • 102. • Standard: • Depthwise separable conv (1) depthwise conv: 1filter takes 1 input channel (2) pointwise conv 1x1 convs • Computation Reduction MobileNet V1: Andrew G. Howard et al. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. https://arxiv.org/abs/1704.04861?context=cs. 2017. !" !" !" !"
  • 103. MobileNet V1 • Use conv with stride=2 to replace pooling • Add two super parameters: Width multiplier Îą and resolution multiplier ρ • Îą =1.0, 0.75, 0.5, 0.25; • standard MobileNet when Îą=1
  • 104. MobileNet V2 MobileNetV1 MobileNetV2 Increase # channels Linear bottlenecks: removed nonlinear activation in the low dim Mark Sandler et al. MobileNetV2: Inverted Residuals and Linear Bottlenecks. https://arxiv.org/abs/1801.04381.2018. inverted residual block Increase dim, then reduce dim
  • 105. ShuffleNet V1 • pointwise group convolution (1x 1 Conv) • channel shuffle: help the information flowing across feature channels • Use concat operation to concatenate two different channels Xiangyu Zhang et al. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. https://arxiv.org/abs/1707.01083. 2017. #g (groups)
  • 106. ShuffleNet V1 Xiangyu Zhang et al. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. https://arxiv.org/abs/1707.01083. 2017.
  • 108. ShuffleNet V2 Reduce memory access cost: • Channel Split (2g) • remove group convolution • Put channel shuffle module after channel concatenation 1)) ( ( 2 1)) ( ( Ningning Ma et al. ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design. https://arxiv.org/abs/1807.11164.2018.
  • 109. Experiments - Classification Model mAP (%) Precision (%) Recall (%) Size (MB) Computation speed (ms/photo) Server-based + Yolov2 99.62 99.60 99.65 N/A N/A 1.00xShuffleNet V2 +Yolov2 96.43 97.16 96.83 5.20 80.00 0.50xShuffleNet V2 +Yolov2 95.86 97.28 96.28 1.70 40.00 0.50xShuffleNet V2 +SSD 97.73 90.61 97.98 7.90 65.00 0.25xShuffleNet V2 +SSD 97.25 90.46 97.59 5.00 45.00 Category Abbreviation front page of ID card id_card_f Back page of ID card id_card_b Front page of driver license driver_license_f Back page of driver license driver_license_b Front of main page in car license front car_license_f Back of main page in car license front car_license_b Supplementary Page in car license vehicle_license Real car photo( whole car) Whole car Real car photo(car plate) plate
  • 110. Experiments - Classification #positive photos: 8K #negative photos: 8K Version Backbone Detection method Size (MB) mAP (%) Precision (%) Recall (%) Error detection rate (% Floating-point version 0.5*ShuffleNet V2 YoloV2 1.70 97.86 98.81 98.00 0.125 Fixed-point version 0.5*ShuffleNet V2 YoloV2 0.40 97.82 98.82 97.97 0.0625 #positive photos: 8K Precision (%) Recall (%) Precision (%) Recall (%) car 98.87 96.41 98.97 96.11 car_license_b 98.70 99.00 99.10 99.00 car_license_f 99.90 97.70 99.80 97.90 driver_license_b 99.80 98.90 99.80 99.00 driver_license_f 99.49 98.50 99.19 98.50 id_card_b 99.90 99.00 99.90 99.00 id_card_f 99.50 99.10 99.50 99.10 plate 93.82 94.29 93.71 93.99 vehicle_license 99.30 99.10 99.40 99.10 Average 98.81 98.00 98.82 97.97
  • 111. Experiments - Embeded OCR • Use ShuffleNet to replace Resnet50 as the backbone
  • 112. Haifeng SHEN, DiDi AI Labs Guangyu LI, DiDi AI Labs Part III : Application
  • 114. Driver identification • Application • Overview • Experiments
  • 115. Application - Pay by smiling • In Sep. 2017, Alibaba's Ant Financial affiliate and KFC China announced facial- recognition payment available for customers in the fast food restaurant chain's new KPRO store in Hangzhou. • "Smile to Pay" facial recognition payment solution at KFC enables customers to pay without their wallets. https://www.jrzj.com/194328.html
  • 116. Application - Check-in at station Taiyuan South railway stationBeijing West railway station Shanghai metro station https://baijiahao.baidu.com/s?id=1552314447507461&wfr=spider&for=pc http://www.sohu.com/a/220124437_99966914 http://dy.163.com/v2/article/detail/D5U3QH2P0525KG01.html
  • 117. http://www.sohu.com/a/168709903_728989 Application - Pedestrian monitoring Ningbo City uses face recognition for transportation surveillance and pedestrian monitoring.
  • 118. Application - Driver monitoring https://www.sohu.com/a/253263266_649849
  • 119. Application - Other uses https://www.globalrailwayreview.com/article/66120/train-stations-facial-recognition/ https://image.baidu.com/
  • 121. Overview - features Natural Un-perceivable Contact-less Multiple BIOMETRIC --- You are your own key” https://image.baidu.com/
  • 122. Overview - Challenges Inter-class similarity https://image.baidu.com/
  • 123. Overview - Challenges Illumination Expression Occlusion Age Pose Other Intra-class variability Similarity =0.18 https://image.baidu.com/
  • 126. Overview - Detection & landmark dataset Face detection dataset Available # faces # images Website Remarks FDDB Public 5171 2845 http://vis-www.cs.umass.edu/fddb/ unconstrained face WiderFace Public 32,20 3 393,703 http://mmlab.ie.cuhk.edu.hk/projects/W IDERFace Easy, Medium, Hard set, a high degree of variability in scale, pose and occlusion. MALF Public 11,93 1 5,250 http://www.cbsr.ia.ac.cn/faceevaluation/ Bounding box, multi-Attribute Labelled Faces, pose and facial attributes Caltech 10,000 Web Faces Public - 10,524 http://www.vision.caltech.edu/Image_Da tasets/Caltech_10K_WebFaces/ Collect from Google image search, 4 landmarks(two eyes, nose and mouth) PUB Public 9971 http://biometrics.put.poznan.pl/put- face-database/ 30 landmarks, 194 contour points AFLW Public 25,993 https://www.tugraz.at/institute/icg/rese arch/team-bischof/lrs/downloads/aflw/ Collect from Flickr, 21 landmarks
  • 127. Overview - Detection - MTCNN Kaipeng Zhang et al. Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks. https://arxiv.org/abs/1604.02878v1.2016. • propose a deep cascaded multi-task framework with three stages, P-Net, R- Net and O-Net. • Each is a shallow network. • P-Net: proposal network, produces candidate windows quickly through a shallow CNN • R-Net: refine network, refines the candidates to reject a large number of non-faces windows through a more complex CNN • O-Net: output network, use a more powerful CNN to refine the result and output facial landmarks positions
  • 128. Overview - Detection - Face RFCN Yitong Wang et al. Detecting Faces Using Region-based Fully Convolutional Networks. https://arxiv.org/abs/1709.05256. 2017. • The framework is based on the R-FCN. • propose a region-based face detector applying deep networks in a fully convolutional fashion • introduce additional smaller anchors and modify the position-sensitive RoI pooling to a smaller size for suiting the detection of the tiny faces. • propose to use position-sensitive average pooling instead of normal average pooling for the last feature voting in R-FCN • use multi-scale training strategy and online Hard Example Mining (OHEM) strategy.
  • 129. Overview - Detection - PyramidBox Xu Tang et al. PyramidBox: A Context-assisted Single Shot Face Detector. https://arxiv.org/abs/1803.07737?context=cs. 2018. • Baidu proposes the PyramidBox. • extended VGG16 backbone and generate the feature maps at different levels • generate a series of anchors corresponding to larger regions related to a face that contain more contextual information, such as head, shoulder and body.
  • 130. Overview - Recognition - Dataset Dataset Available # People # images Website Remarks LFW Public 5K 13K http://vis- www.cs.umass.edu/lfw/#views Labeled Faces in the Wild YFD Public 1.5K 3.4K (Video) https://www.cs.tau.ac.il/~wolf/ytfac es/ YouTube Faces Database CelebA (CelebFaces Attributes Dataset) Public 10K 202K http://mmlab.ie.cuhk.edu.hk/project s/CelebA.html Multimedia Lab, The Chinese University of Hong Kong CASIA-WebFace Public 10K 500K http://www.cbsr.ia.ac.cn/english/CAS IA-WebFace/CASIA- WebFace_Agreements.pdf MS-Celeb-1M public 100K 10M https://www.msceleb.org VGGFace2 Public 9k 3.3M http://www.robots.ox.ac.uk/~vgg/da ta/vgg_face2/ downloaded from Google Image Search and have large variations in pose, age, illumination, ethnicity and profession Facebook Private 4K 4,400K N/A Google Private 8000K 100-200M N/A
  • 131. Overview - Recognition - Milestones 1888 Galton, Nature 1910 Galton, Nature 1965 Chan,Bledsoe, AFR 1991 TurkandMA, Eigenfaces 1997 BelhumeurP, Fisherface 2002 LiuC, Gaborfeature 2006 AhonenT, LBP 2009 WrightJ, Sparserepresentation 2013 ChenD, High-dimLBP 2014 SunYi, DeepID 2014 Facebook, DeepFace 2015 Oxford, VGGFace 2015 Google, FaceNet 2015Baidu, EnsembleFace 2016 EffectiveFace 2017 SphereFace 2018 ArcFace 2019 Combined loss
  • 132. Overview - Recognition - Results Time Method Training size Method description LFW Comments 1991 Eigenfaces < 10k Principal component analysis(PCA) 60.02% 2006 LBP+CSML < 10k Local binary pattern(LBP) + Metric learning 85.57% 2013 High-dim LBP 0.1m High-dim LBP + Joint Bayesian 95.17% 2014 DeepFace 4m CNN + 3D face alignment 97.35% Facebook 2014 Deep ID 0.2m CNN + Softmax 97.45% CUHK 2015 VGGFace 2.6m VGG + Softmax 98.95% Oxford 2015 FaceNet 200m Inception + Triplet-Loss 99.63% Google 2015 Ensemble face 1.2m CNN + Multi-patch + Deep metric 99.77% Baidu 2016 Effective face 2.5m CNN + Augmentation 98.06% Pose + Shape + Expression 2017 SphereFace 0.5m CNN + Angular-Softmax 99.42% Multiplicative angular margin: cos(mθ) 2018 ArcFace 6.8m CNN + Additive angular margin 99.83% Additive angular margin: cos(θ + m) 2019 Combined loss N/A cos(m1θ + m2) − m3
  • 133. Overview - Recognition - DeepFace Yaniv Taigman et al. DeepFace: Closing the Gap to Human-Level Performance in Face Verification. https://ieeexplore.ieee.org/document/6909616. CVPR 2014. • CNN + DNN structure • L4 - L6 are locally connected layers without weight sharing, rather than the standard convolutional layers • The last two layers, i.e. F7 and F8 are fully-connected • Employ 3D face modeling to apply the affine transformation for 3D face alignment and get the frontal face • more than 120 million parameters • Train using four million facial images belonging to more than 4,000 identities
  • 134. Overview - Recognition - DeepID Yi Sun, Xiaogang Wang, Xiaoou Tang. Deep Learning Face Representation from Predicting 10,000 Classes. https://www.cv- foundation.org/openaccess/content_cvpr_2014/papers/Sun_Deep_Learning_Face_2014_CVPR_paper.pdf. CVPR2014. • Use face patch method and each patch use one ConvNet • Each ConvNet has 4 layers • 60 face patches with ten regions, three scales, and RGB or gray channel. • 60 ConvNets x two 160-dimensional vectors and flipped counterpart, totally 19200-dimensional vector for face verification • achieves 97.45% face verification accuracy on LFW • Based on DeepID1, Chinese University of Hong Kong provides DeepID2 and DeepID3
  • 135. Overview - Recognition - FaceNet Florian Schroff et al. FaceNet: A Unified Embedding for Face Recognition and Clustering. https://arxiv.org/abs/1503.03832. CVPR 2015. • Google proposes the structure. • Directly use a deep convolutional network • Use triplet loss for training: minimizes the distance between an anchor and a positive, both of which have the same identity, and maximizes the distance between the anchor and a negative of a different identity • Use the Euclidean distance to measure the face similarity for verification.
  • 136. Overview - Recognition - Ensemble Face Jingtuo Liu et al. Targeting Ultimate Accuracy: Face Recognition via Deep Embedding. https://arxiv.org/pdf/1506.07310. 2015. • Multi-patch feature extraction. • 9 image patches and each patch is centered at different landmarks on face region. • Each patch: 9 convolution layers and a softmax layer at the end • Concatenate the last convolution layer of each network to build the high dimensional feature for the face representation • metric learning method with triplet loss is used for feature reduction and obtain 128/256 dimensions. • achieve the accuracy (99.77%) of LFW under 6000 pair evaluation protocol
  • 137. Overview - Recognition - Effective Face Iacopo Masi et al. Do We Really Need to Collect Millions of Faces for Effective Face Recognition. https://arxiv.org/abs/1603.07057. CVPR 2016. • Use a single VGGNet with 19 layers • Training on both real and augmented data • use the CASIA WebFace collection data and generate the artificial data by introducing pose variations, shape variation and expression variation
  • 138. Jiankang Deng et al. ArcFace: Additive Angular Margin Loss for Deep Face Recognition. https://arxiv.org/abs/1801.07698. 2019. Multiplicative angular margin: cos(mθ) Additive angular margin: cos(θ + m) Additive cosine margin: cos(θ ) - mcosθ Combined loss: Overview - Recognition - Combined loss
  • 139. Experiments - Combined loss Test set feature softmax shpereface cosface arcface Combined loss LFW public 98.75 99.52 99.50 99.55 99.60 7k private 93.60 95.45 95.90 96.72 97.13 50k private 93.28 95.93 95.50 97.08 96.90 zc private 99.18 99.37 99.45 99.57 99.52 avg 96.20 97.57 97.59 98.23 98.29 • 7k/50k The test set is extracted from registered driver photo database. 3K positive pair and 3k negative pair are randomly selected from 7k/50k drivers respectively. • zc the test set is randomly extracted from premier driver photo database. 3K positive pair and 3K negative pair are randomly selected for the testing.
  • 140. Experiments - Virtual learning drastically improves the performances over the baseline softmax on both LFW and SLLFW datasets, e.g. from 99.10% to 99.46% and 94.59% to 95.85%, respectively. Binghui Chen, Weihong Deng, Haifeng Shen. Virtual Class Enhanced Discriminative Embedding Learning. https://arxiv.org/abs/1811.12611. 2018
  • 141. Experiments - Fast face detection 80 8 0 40 4 0 20 20 20 20 10 10 55 33 22 11 C3 C4 C5 C6 C7 C8 C9 C10 C11 Multiscale feature fusion Object detection Detection result Upsam pling Upsampling n Multiscale features: C3+C4+C5+C7+Conv9+Conv11 n Combine up-sampling features: C3 + C3’, C4 + C4’, C5 + C5’ n Support batch image computation n TensorRT Optimization Speed (ms/frame) Batch size=1 Batch size=64 Batch size=100 Original 22 12 N/A FP32 17 7 7 INT8 13 4 4 GPU Memory (GB/frame) Batch size=1 Batch size=64 Batch size=100 Original 1.40 0.188 N/A FP32 0.57 0.070 0.066 INT8 0.48 0.039 0.030 Detection % Precision Recall F-score Original 97.90 97.00 97.47 FP32 97.90 97.10 97.48 INT8 97.85 96.96 97.40
  • 142. Experiments - Face detection q WIDER FACE dataset is a face detection benchmark dataset, collected from the publicly available WIDER dataset. q Choose 32,203 images and label 393,703 faces with a high degree of variability in scale, pose and occlusion as depicted in the sample images. q Propose DFS method and use semantic fused feature maps as contextual cues and construct a semantic segmentation for training supervision and to learn the best representations q Win 5 rank-1 results in April. 2019 Widerface: http://shuoyang1213.me/WIDERFACE/index.html Wanxin Tian, Zixuan Wang, Haifeng Shen, Weihong Deng, et al. Learning Better Features for Face Detection with Feature Fusion and Segmentation Supervision. https://arxiv.org/abs/1811.08557. 2018-2019.
  • 144. What can we learn from Driving Scenario? • What is in a driving scenario? • How far are they from ego-vehicle? • How does human driver interact with environment? Vision Perception 3D Reconstruction Behavior Analysis
  • 145. Driving Scenarios v.s. General Computer Vision Data • Multi-modal (i.e. multiple sensors including Camera LiDAR, GPS, IMU etc.) • Collected from 3D Open Area (Not Indoor/Lab Environments) • Ego-centric / First Person Requirements • • • Opportunities • • •
  • 146. Main Components • Pedestrian • Vehicle • Road • Traffic Sign / Light Vision Perception in Driving Scenario Detect, Segment, Track and Classify Object-of-interest in Driving Scenarios What does Vision Perception do:
  • 147. Vision Perception – Pedestrian Detection
  • 148. Vision Perception – Pedestrian Detection
  • 149. Vision Perception – Pedestrian Detection Pedestrian detection at 100FPS • Uses Cascades • Fast features • Not a CNN based model Benenson et al ’12 “VeryFast” 100+ FPS detector. NO CNNs.
  • 150. Vision Perception – Pedestrian Detection Real-time Pedestrian Detection with CNNs • Uses Cascades • Uses fast non-CNN features • Use CNNs for max accuracy with minimum speed sacrifice Angelova et al ’15 “DeepCascades” Real-time (15FPS) with CNNs
  • 151. Vision Perception – Pedestrian Detection Occlusion-aware pedestrian detection • Aggregation loss (enforce proposals to be close and locate compactly) • Occlusion-aware region of interest (PORoI) (integrate prior structure information of human to handle occlusion) • Based on Faster RCNN Zhang et al ’18 “OR-CNN” State of the Art (by April 2019)
  • 152. Vision Perception – Vehicle Detection
  • 153. Vision Perception – Vehicle Detection Vehicle detection in 3D from image • Directly from 2D image • Proposal Generation as Energy Minimization • Orientation Estimation Network Chen et al ’16 “3D Bounding Box” Breakthrough for 3D Detection with Mono Image
  • 154. Vision Perception – Vehicle Detection Multi-View 3D object Detection • Multi-sensor fusion Chen et al ’17 “MV3D” Impressive accuracy gain for considering multi-sensors fusion
  • 155. Vision Perception – Vehicle Detection Multi-level Fusion based 3D Object Detection from Mono Images • Simultaneously propose 2D RPN and predict 3D location, orientation, dimensions Xu et al ’18 “Multi-level Fusion” State of the Art for 3D Detection from Mono Camear Images
  • 156. Vision Perception – Road Segmentation Joint Semantic Prediction • KITTI Road Detection top performance 2017 • Multi-task framework • Real-time • Uses RGB image only Teichmann et al ’17 “MultiNet” Speed + Accuracy with RGB image only
  • 157. Vision Perception – Road Segmentation LIDAR-Camera Fusion • KITTI Road Detection top performance 2018 • Cross Fusion mechanism with FCN Caltagirone et al ’18 “LidCamNet” LIDAR-Camera Fusion RULES
  • 158. Vision Perception – Road Segmentation LIDAR-Camera Fusion with LIDAR Adaptation • KITTI Road Detection current top performance • Progressive LIDAR Adaptation Chen et al ’19 “PLARD” State of the Art Performance
  • 159. Vision Perception – Road Segmentation State of the Arts on KITTI (by April 2019)
  • 160. Vision Perception – Traffic Sign Detection IJCNN 2011 Traffic Sign Recognition Competition • Ciresan et al ’11: 0.56% error • Human: 1.16% error • Non-CNN: 3.86% Ciresan et al ’11 “Traffic Sign Recognition” Traffic Sign Recognition is EASY (Super-human Performance)
  • 161. Vision Perception – Traffic Sign Detection Detecting Small Signs from Large Images • Brake large image into small patches • Small-Object-Sensitive-CNN (SOS-CNN) • Based on SSD Meng et al ’17 “SOS-CNN” Handle Small Objects
  • 162. What can we learn from Driving Scenario? • What is in a driving scenario? • How far are they from ego-vehicle? • How does human driver interact with environment? Vision Perception 3D Reconstruction Behavior Analysis
  • 163. Main Components • • • • 3D Reconstruction in Driving Scenario Recover real-world Location and Pose of Driving Scenario Objects (2D to 3D) What does 3D Reconstruction do: 5 mins Theoretic Backgrounds (a little Math)
  • 164. 3D Reconstruction – Theoretic Backgrounds • Perspective Projection
  • 165. 3D Reconstruction – Theoretic Backgrounds • Internal Camera Parameters
  • 166. 3D Reconstruction – Theoretic Backgrounds • External Camera Parameters
  • 167. 3D Reconstruction – Theoretic Backgrounds • Camera Model for Perspective Projection
  • 168. 3D Reconstruction – Theoretic Backgrounds • A Block Diagram
  • 169. 3D Reconstruction – Semantic Reconstruction Kundu et al ’14 “Joint semantic and 3D reconstruction from monocular video” Semantic + 3D Reconstruction from Mono Camera
  • 170. 3D Reconstruction – Semantic Reconstruction Cherabier et al ’16 “Multi-label semantic 3d reconstruction using voxel blocks” Efficient Dense Semantic + 3D Reconstruction
  • 171. What can we learn from Driving Scenario? • What is in a driving scenario? • How far are they from ego-vehicle? • How does human driver interact with environment? Vision Perception 3D Reconstruction Behavior Analysis
  • 172. Driving Scenario Understanding Honda Research Institute Driving Dataset • 104 Hours Real Human Driving records • Driving Behavior and Causal Reasoning annotation Ramanishka et al ’18 “HDD” First Dataset towards Driving Scenario Understanding
  • 173. Driving Scenario Understanding Driving Attention Prediction from Video • Focus on Driver’s Attention • In-car v.s. In-lab test Xia et al ’18 “Predicting Driver Attention” Introduce Attention Heat Maps
  • 174. Related Datasets HDD [7] [6] [5] [4] [3] [2] [1] D2-City [8] Driving behavior & Causal reasoning / Traffic participants detection & tracking Camera, GPS, IMU 95.9 Suburban, urban and highway
  • 175. GAIA Open Dataset • Dataset : D2 –City Dataset • D²-City is a large-scale driving video dataset that provides more than 10k videos recorded in 720p HD or 1080p FHD from front-facing dashcams, with annotations for object detection and tracking. n 1k videos - annotation of the bounding boxes and tracking IDs of road objects into 12 different categories. n 9k videos - annotation the bounding boxes in key frames.
  • 176. Q & A