Successfully reported this slideshow.
Upcoming SlideShare
×

Computer vision for transportation

36 views

Published on

ICME2019 Tutorial: Computer vision for transportation

Published in: Internet
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Computer vision for transportation

1. 1. Haifeng SHEN DiDi AI Labs Zhengping CHE DiDi AI Labs Guangyu LI DiDi AI Labs Yuhong GUO DiDi AI Labs Carleton University Jieping YE DiDi AI Labs Univ. of Michigan, Ann Arbor
2. 2. Part I: Introduction to Computer Vision Zhengping CHE, DiDi AI Labs
3. 3. • Computer Vision Basics • Image Classification • Object Detection Introduction to Computer Vision
4. 4. Computer Vision Basics • Representation Learning • Activation Functions • Neural Network Structures • Convolution Operators • Pooling Layers • Batch Normalization
5. 5. Representation Learning http://kaiminghe.com/cvpr17tutorial/cvpr2017_tutorial_kaiminghe.pdf
6. 6. Neural Network Structures Convolutional Neural Network Deep Neural Network Different Neural Networks Top/Middle-left: http://cs231n.github.io/convolutional-networks/ Bottom-left: https://colah.github.io/posts/2015-08-Understanding-LSTMs/ Right: http://www.asimovinstitute.org/neural-network-zoo/ Recurrent Neural Network
7. 7. Activation Functions Top: https://theffork.com/activation-functions-in-neural-networks/ Bottom: http://cs231n.stanford.edu/slides/2019/cs231n_2019_lecture04.pdf
8. 8. Convolution Operators -1 0 1 -2 0 2 -1 0 1 Vertical -1 -2 -1 0 0 0 1 2 1 Horizontal Sobel Operator Laplacian Operator 0 -1 0 -1 4 -1 0 -1 0 -1 -1 -1 -1 8 -1 -1 -1 -1 Traditional Operators Convolution Operation Right: http://cs231n.github.io/convolutional-networks/
9. 9. Convolution Operators (cont’d) Left: Jifeng Dai, et al., Deformable Convolutional Networks, 2017 Right: https://towardsdatascience.com/review-drn-dilated-residual-networks-image-classification-semantic-segmentation-d527e1a8fb5/ Fisher Yu, et al., Multi-Scale Context Aggregation by Dilated Convolutions, 2016 Dilated Convolution Standard Convolution (dilation rate = 1) Dilated Convolution (dilation rate = 2) Deformable Convolution Standard Convolution Deformable Convolution Deform. Conv. with Scaling Deform. Conv. with Rotation
10. 10. Pooling Layers Top-left: http://deeplearning.stanford.edu/tutorial/supervised/Pooling/ Bottom-left: Matthew D. Zeiler, et al., Visualizing and Understanding Convolutional Networks, 2014 Right: http://fractalytics.io/rooftop-detection-with-keras-tensorflow/ Different Pooling Operations Unpooling Pooling
11. 11. Pooling Layers (Cont’d) Corner Pooling Atrous Spatial Pyramid Pooling Right: Hei Law, et al., Detecting Objects as Paired Keypoints, 2018 Top-left: Kaiming He, et al., Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition, 2015 Bottom-left: Liang-Chieh Chen,et al., Deeplab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs, 2017 Spatial Pyramid Pooling
12. 12. Batch Normalization Top-left: http://gradientscience.org/batchnorm/ Sergey Ioffe, et al., Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, 2015 Bottom: Yuxin Wu, et al., Group Normalization, 2018 !"# + %# &!'(" )" #*+ ,-. + / 0(-2) -2-4 -. Normalization Scale & Shift ActivationFC Layer \$%(')
13. 13. Image Classification • Datasets & Competitions • Roadmap • Classification Networks • Experiments
14. 14. Image Classification Datasets & Competitions ImageNet, ILSVRC 2009-2017 ImageNet: http://www.image-net.org/ Second figure: https://principlesofdeeplearning.com/index.php/is-deep-learning-getting-too-deep/ Human
15. 15. Datasets & Competitions (Cont’d) MNIST CIFAR-10 & CIFAR-100 Dogs vs. Cats Stanford Cars iNaturalist Competition Plant Seedlings Classification http://yann.lecun.com/exdb/mnist/ https://www.cs.toronto.edu/~kriz/cifar.html https://www.kaggle.com/c/dogs-vs-cats https://ai.stanford.edu/~jkrause/cars/car_dataset.html https://sites.google.com/view/fgvc5/competitions/inaturalist https://www.kaggle.com/c/plant-seedlings-classification
16. 16. Image Classification Roadmap … 1998 2012 2014 2015 2016 2017 LeNet VGGNet ResNet SENet AlexNet GoogLeNet DenseNet 2018 DLA
17. 17. LeNet LeNet-5 (1998) • A neural network architecture for handwritten and machine-printed character recognition in 1990s • Consists of seven layers including • Convolution operations • Pooling operations • Full connections Yann LeCun, et al., Gradient-Based Learning Applied to Document Recognition, 1998 Bottom-right: https://engmrk.com/lenet-5-a-classic-cnn-architecture/
18. 18. AlexNet AlexNet (2012) • ILSVRC 2012 winner (16.4% top-5 error) • 60 million parameters and 650,000 neurons • 8 learned layers: 5 convolutional and 3 fully-connected layers • A 1000-way softmax layer after the last fully-connected layer • Dropout and ReLU • Trained parallelly on 2 GPUs Alex Krizhevsky, et al., ImageNet Classification with Deep Convolutional Neural Networks, 2012 Bottom-right: Nitish Srivastava, et al., Dropout: A Simple Way to Prevent Neural Networks from Overfitting, 2014
19. 19. VGGNet • Six versions with 5 group convolutions of 11 - 19 layers • VGG16 (138 million parameters) and VGG19 • Only 3x3 conv and 2x2 max-pooling layers before FC layers • Results @ ILSVRC 2014 • 1st in localization task • 2nd in classification task (7.3% top-5 error) VGGNet (2014) Karen Simonyan, et al., Very Deep Convolutional Networks for Large-Scale Image Recognition, 2014
20. 20. GoogLeNet • ILSVRC 2014 winner (6.7% top-5 error) • 22 layers with only 5 million model parameters • Inception concept • Multiple conv kernels including 1x1, 3x3, and 5x5 • 1x1 kernel for dimension reduction • Better representational power + fewer network parameters • More advanced Inception modules (V2, V3, and V4) Inception-V1 Module GoogLeNet (2014) Christian Szegedy, et al., Going Deeper with Convolutions, 2015
21. 21. ResNet • 1st place on the ILSVRC 2015 classification task (3.6% top-5 error) • Deeper model with fewer filters and lower complexity • 34-layer baseline • 3.6 billion FLOPs • only 18% of VGG-19 (19.6 billion FLOPs) • Up to 152 layers! • Initialization, batchnorm, residual block… ResNet Block ResNet (2015, top) Kaiming He, et al., Deep Residual Learning for Image Recognition, 2016 http://kaiminghe.com/icml16tutorial/icml2016_tutorial_deep_residual_networks_kaiminghe.pdf
22. 22. DenseNet • ! !"# \$ direct connections for % layers • Fewer parameters and less computation DenseNet Block DenseNet (2016) !" = \$" !%, !', … , !")' Gao Huang, et al., Densely Connected Convolutional Networks, 2016
23. 23. SENet • ILSVRC 2017 winner (2.251% top-5 error) • Squeeze-and-excitation block • Squeeze: Global average pooling • Excitation: Channel association • Scale: Channel attention • Integration with modern architectures Squeeze-and-Excitation Block SENet (2017) Jie Hu, et al., Squeeze-and-Excitation Networks, 2018
24. 24. DLA: Deep Layer Aggregation DLA (2018) • Layer aggregation to better fuse information • Iterative deep aggregation (IDA) • Semantic fusion • Resolutions and scales • Hierarchical deep aggregation (HDA) • Spatial fusion • Channels and depths (modules) Fisher Yu, et al., Deep Layer Aggregation, 2018
25. 25. Classification Experiments Classification Accuracy Method Car Brand Classification with 66 classes Car Brand Classification with 2506 classes ResNet 94.60% - SENet 92.30% - DLA 96.02% 93.75% • Dataset-1 • 193186 images of 66 classes • Collected offline • Dataset-2 • 549169 images of 2506 classes • Collected offline + online • Similar settings to the Stanford Cars dataset
26. 26. Object Detection • Introduction & Roadmap • Region-Based Methods • Region-Free Methods • Experiments
27. 27. Object Detection Introduction Top-Left: http://cs231n.stanford.edu/slides/2016/winter1516_lecture8.pdf Top-Right: https://www.hackerearth.com/blog/developers/object-detection-for-self-driving-cars/ MS COCO http://cocodataset.org/#home Open Images https://storage.googleapis.com/openimages/web/index.html http://host.robots.ox.ac.uk/pascal/VOC/ Pascal VOC ImageNet http://www.image-net.org/
28. 28. Object Detection Roadmap … 2014 2015 2016 2017 2018 R-CNN SPPNet Fast R-CNN Faster R-CNN R-FCN FPN SNIPER YOLOv1 SSD DSSD RetinaNet RefineDet CornerNet YOLOv3 Light-Head R-CNN Cascade R-CNN SNIP Region-Based Detection Region-Free Detection YOLOv2 Left: Zhengxia Zou, et al., Object Detection in 20 Years: A Survey, 2019
29. 29. Region-Based / Region-Free Methods • Region-based detection Jonathan Huang, et al., Speed/Accuracy Trade-Offs for Modern Convolutional Object Detectors, 2017 • Two-stage method • Higher accuracy • Lower speed • Complex computation • R-FCN, Fast R-CNN, Faster R-CNN, R-FCN, FPN, Cascade R-CNN, SNIP, SNIPER… • One-stage method • Lower accuracy • Faster speed • Light computation • YOLO, SSD, DSSD, RetinaNet, RefineDet, CornerNet… • Region-free detection
30. 30. R-CNN: Regions with CNN Features • Selective Search + CNN + SVM • Start to use CNN features instead of the traditional features • ~2k bottom-up region proposals from selective search • Time consuming • Extracting feature for every proposal separately Ross Girshick, et al., Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, 2014 Bottom-Right: https://dl.dropboxusercontent.com/s/vlyrkgd8nz8gy5l/fast-rcnn.pdf R-CNN (2014)
31. 31. Fast R-CNN • One image + multiple RoIs + a fully CNN • RoI pooling: to generate fixed-size feature vector for each proposal • Outputs: softmax probabilities + bounding-box regression offsets • End-to-end training with a multi-task loss Fast R-CNN (2015) Right: http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture11.pdf Ross Girshick, Fast R-CNN, 2015
32. 32. Faster R-CNN • Region proposal network (RPN) + Fast R-CNN • RPN & detection network share full-image convolutional features • Anchors with multiple scales and aspect ratios Bottom-Left: http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture11.pdf Shaoqing Ren, et al., Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, 2015 Faster R-CNN (2015) Region Proposal Network
33. 33. R-FCN: Region-based Fully Convolutional Networks • Position-sensitive score map before RoI pooling • 9 positions: top/middle/bottom-left/center/right • Position-sensitive RoI pooling instead of standard RoI pooling • fully convolutional detection network instead of fully-connected detection network in Faster Jifeng Dai, et al., R-FCN: Object Detection via Region-based Fully Convolutional Networks, 2016 R-FCN (2016) Position-Sensitive Score Map
34. 34. Light-Head R-CNN • Heavy head • E.g., Faster R-CNN & R-FCN • Intensive computations around RoI warping • Light-Head R-CNN • Thin feature maps from large separable convolution layers • Cheap R-CNN subnet with 1 FC-layer Zeming Li, et al., Light-Head R-CNN: In Defense of Two-Stage Object Detector, 2017 Light-Head R-CNN (2017) ‘Heavy’-Head Detectors Large Separable Convolution
35. 35. FPN: Feature Pyramid Networks • Bottom-up pathway • Top-down pathway • Lateral connection Tsung-Yi Lin, et al., Feature Pyramid Networks for Object Detection, 2017 Different Feature Maps FPN Block • Feature pyramid: Combination of • Low-resolution, semantically strong features • High-resolution, semantically weak features
36. 36. Cascade R-CNN • Multi-stage extension of R-CNN • Trained sequentially using output of the previous stage • Cascaded bbox regression • ! ", \$ = !& ∘ !&() ∘ ⋯ ∘ !) ", \$ • Cascaded detection • A sequence of detectors trained with increasing IoU thresholds Zhaowei Cai, et al., Cascade R-CNN: Delving into High Quality Object Detection, 2018 Cascade R-CNN
37. 37. SNIP: Scale Normalization for Image Pyramids • CNNs are not robust to changes in scale • Multi-scale image pyramids for objects with different scales • Detections from each scale are rescaled and combined using NMS • Small objects from high-resolution image • Large objects from low-resolution image Bharat Singh, Scale Invariance in Object Detection - SNIP, 2018
38. 38. YOLOv3 (2018) YOLO: You Only Look Once • End-to-end one-stage method • Directly use full images to predict each bounding box • Extremely fast in real-time speed • YOLOv2 • Darknet19 backbone • Anchor mechanism • YOLOv3 • Multi-scale features • Darknet53 backbone Joseph Redmon, et al., You Only Look Once: Unified, Real-Time Object Detection, 2016 Joseph Redmon, et al., YOLO9000: Better, Faster, Stronger, 2017 Joseph Redmon, et al., YOLOv3: An Incremental Improvement, 2018 Top-Left: https://docs.google.com/presentation/d/1kAa7NOamBt4calBU9iHgT8a86RRHz9Yz2oh4-GTdX6M/ Bottom-Left: https://towardsdatascience.com/yolo-v3-object-detection-53fb7d3bfe6b/ YOLO (2016)
39. 39. SSD: Single Shot Detector • Multiple feature maps with different resolutions and scales • Improved speed/accuracy trade-off Wei Liu, et al., SSD: Single Shot MultiBox Detector, 2016 SSD (2016) YOLOv1
40. 40. DSSD: Deconvolutional SSD • Encoder-decoder Hourglass structure • Wide – Narrow – Wide • Convolution and deconvolution modules • Deconvolution: To introduce additional large-scale context for object detection • Two prediction modules • Each with one residual block Cheng-Yang Fu, et al., DSSD: Deconvolutional Single Shot Detector, 2017 SSD DSSD (2017) Selected Prediction Module
41. 41. RetinaNet • Focal loss instead of cross entropy function • Focus on training on a sparse set of hard samples • !" #\$ = − 1 − #\$ ( log #\$ Tsung-Yi Lin, Focal Loss for Dense Object Detection, 2017 RetinaNet (2017)
42. 42. RefineDet Shifeng Zhang, et al., Single-Shot Refinement Neural Network for Object Detection, 2018 RefineDet (2018) • Anchor refinement module • Filtering out easy negatives • Coarsely adjusting anchors • Object detection module • Further improving regression • Prediction multi-class Transfer Connection Block
43. 43. CornerNet • Object as a pair of bounding box corners • No need for anchor boxes • Regression problem → Corner prediction problem • Corner pooling • To better localize corners of bounding box Hei Law, et al, CornerNet: Detecting Objects as Paired Keypoints, 2018 CornerNet (2018) Corner Pooling
44. 44. • Multiple Receptive Field block (MRF): Multiple receptive field and more features for prediction • Auxiliary Semantic Segmentation block (ASM): Auxiliary semantic segmentation focusing on small object • Object Detection block (ODM): Combining MRF and ASM with parallel training • Loss function: MRFSWSnet: Siyang Sun, et al., Multiple Receptive Fields and Small-Object-Focusing Weakly-Supervised Segmentation Network for Fast Object Detection, 2019 Multiple Receptive Field Small-Object-Focusing Weakly-Supervised Segmentation Net
45. 45. Experiments on MRFSWSnet Method Recall Precision F1 Score Faster R-CNN 97.57 96.47 97.01 RetinaNet 97.80 97.80 97.80 Light-Head R-CNN 97.71 95.13 96.40 YOLOv3 98.57 97.32 97.94 MRFSWSnet 98.71 97.32 98.01 • Images collected by dash camera • Detection on cellphone usage during driving • 1000 testing images Siyang Sun, et al., Multiple Receptive Fields and Small-Object-Focusing Weakly-Supervised Segmentation Network for Fast Object Detection, 2019
46. 46. • Depend on large amount of labeled data, induce expensive annotation cost • Difficult to be applied directly in new operation environments • Computation intensive, highly demanding in computational resources • Complicated models, time/memory consuming, which prevents usage in real time operation systems(e,g. DMS) Challenge
47. 47. Yuhong GUO DiDi AI Labs & Carleton University Part II: Advanced Topics
48. 48. •Domain Adaptation •Lightweight Models Topics
50. 50. • Definition [Pan et al., IJCAI13 ]: Ability of a system to recognize and apply knowledge and skills learned in previous domains/tasks to novel domains/tasks • . Domain Adaptation/Transfer Learning S. Pan, Q. Yang and W. Fan. Tutorial: Transfer Learning with Applications, IJCAI 2013. Tan, Chuanqi, et al. "A survey on deep transfer learning." International Conference on Artificial Neural Networks. Springer, Cham, 2018.
51. 51. § Successful Application of ML in industry depends on learning from large amount of labeled data ØExpensive, time consuming to collect labels ØDifficult or dangerous to collect data in certain scenarios, e.g, auto driving § Domain Adaptation/Transfer Learning provides essential ability of üReusing existing labeled resources üAdapting to changing environment üLearning from simulations Why Domain Adaptation
53. 53. Motivation Examples Different feature distributions Different label spaces !"#\$%& !"'("
54. 54. Applications in Computer Vision
55. 55. Adapting to New Domains § Reuse existing datasets, hence the annotation information ØObject Recognition ØObject Detection ØPerson Re-Identification ØImage Segmentation ØImage Classification … ...
56. 56. Learning from Simulations § Gathering data and training model are either too expensive, time- consuming, or too dangerous § Solution: create data, learning from simulations Ø Ø OpenAI's Universe will potentially allow us to train a self-driving car using GTA 5 or other video games. Training models on real robotics is too slow and expensive http://ruder.io/transfer-learning/index.html
57. 57. Common Datasets § Object recognition: Office-31: § § § ImageCLEF-DA: § § § § Visual domain adaptation challenge dataset VisDA-2017 § Digits: MNIST, SVHN, USPS § Syn2Real dataset – a new dataset for object recognition [Peng et al, 2018]
58. 58. Common Datasets § Semantic Segmentation/object detection: Ø Ø Ø • Ø
60. 60. Three main classes: § Reweighting/Instance-based Methods ü § Feature-based/Representation Learning Methods ü § Parameter/Model- based Methods ü Categories of DA Methods
61. 61. Start with Instance Reweighting § Context Ø Ø § Idea Ø
62. 62. § h() – prediction function, x --- input , y – output § Expected risk in target domain: Simple Math Analysis
63. 63. § Assume shared conditional distribution § To minimize target risk, source instance can be reweighted: Covariate Shift
64. 64. § Assume shared conditional distribution § In addition, note Ø !" !# \$ Ø !" ≠ !# \$ ≠ § Assumption of support: Ø ∃' , !# but !" Ø !" ,-- !# Assumptions
65. 65. § Density ratio estimation Ø ! Ø " § Direct weight estimation Ø Weight Estimation " = !\$ / !& ∝ !() = *|,)/!() = .|,) ! ) = * , ! ) = . ,
66. 66. § Maximum Mean Discrepancy (MMD) Ø Ø • F H X Learning Weights Directly: MMD [Gretton et al. 2012]
67. 67. § MMD for domain adaptation Ø Ø Learning Weights Directly: MMD !! ~ # !"
68. 68. § Extend MMD to learn representation function ∅(#) Ø Extend to Representation Learning Long et al. " ”, CVPR 13 [Long et al. CVPR13]
69. 69. § Representation learning methods present larger capacity in bridging domain discrepancy § Widely applied in transfer learning for computer vision tasks § Recent development of representation learning based domain adaptation Ø Ø Ø Recent Feature-based Methods
70. 70. § Main idea: Ø min\$ max' ()*+(-, /) = 23~'5 log /(-(9)) + 23~'; log(1 − / - 9 ) o- 9 ->, -?) o Ø p> (-(9)) = p?(-(9)) Adversarial Loss-based Adaptation Framework Goodfellow et al. " ”, 2014
71. 71. § A-distance, measure of distance between probability distribution § Bound on target domain error Ø Ø Theoretical Connection Ben-David et al. "Analysis of Representations for Domain Adaptation”, NIPS 06 Kifer et al. Detecting change in data streams. In Very Large Databases (VLDB), 2004.
72. 72. § Main idea: Ø min\$,& max) * = *,-./(1, 2) + 5 *6/7 Adversarial Loss-based Adaptation Framework *!"#\$ *%\$&
73. 73. § DANN: Adversarial is implemented via GRL (gradient reverse layer) Domain Adversarial Neural Network (DANN)
75. 75. § Re-weight source domain label distribution to help reduce domain discrepancy and adapt classifier § Reweighted adversarial loss (RAAN) Reweighted Adversarial Adaptation [Chen et al, CVPR 18] Chen, et al. " ”, CVPR 18
76. 76. § Maximum Classifier Discrepancy (MCD): Ø Ø § Adversarial loss: Target domain prediction discrepancy Alternative Adversarial Terms K. Saito, et al. " Maximum Classifier Discrepancy for Unsupervised Domain Adaptation”, CVPR 18 Train both classifiers and generator to classify the source samples correctly
78. 78. DA Recognition Results
79. 79. Question Raised: Transferabiliy vs Discriminability §
80. 80. Batch Spectral Penalization (BSP) §
81. 81. Object detection DA-Faster-R-CNN § Adversarial loss via GRL at both image level and instance level § Consistent regularization at the two levels Multi-Level Adversarial Adaptation Chen, et al. " ”, CVPR 18
82. 82. Object detection: Strong-Weak Multi-Level Adversarial Alignment Saito, et al. " ”, CVPR 19 § § • •
83. 83. Object detection Multi-Level Adversarial Alignment Saito, et al. " ”, CVPR 19
84. 84. Object detection DA Detection Results
85. 85. § Main idea: Ø Ø Generative Model based Methods
86. 86. § Limitation of domain alignment techniques: Ø Ø § CyCADA: Ø Ø Ø Cycle-Consistent Adversarial DA et al. " ”, ICML 18 et al. ICML18
87. 87. Cycle-Consistent Adversarial DA et al. " ”, ICML 18 et al. ICML18 image-level GAN loss (green), the feature level GAN loss (orange), the source and target semantic consistency losses (black), the source cycle loss (red), and the source task loss (purple).
88. 88. § SBDA-GAN: Ø Ø Ø Symmetric Bi-Directional Adaptive GAN et al. " ”, CVPR 18 et al. CVPR18
89. 89. DA Recognition Results
90. 90. § § Pseudo-Label based Methods Some positive application in domain adaptation: ØProgressive domain adaptation for Object detection ØFor recognition: Zhang et al. " Collaborative and Adversarial Network for Unsupervised domain adaptation :”, CVPR 18 Inoue et al. " Cross-Domain Weakly-Supervised Object Detection through Progressive Domain Adaptation”, CVPR 18
91. 91. • Unsupervised domain adaptation has received a lot of attention • Open domain learning remains to be challenging, but starts drawing attentions • Most study has focused on classification problems • Much less effort has been made on more complex tasks such as object detection Summary
92. 92. Lightweight Models
93. 93. Basics Number of multiplications for one standard convolutional layer: Input: !" x !" x M Output: !# x !# x N !\$: kernel size M: number of input channels N: number of output channels !#: output dimension
94. 94. Basics • Architecture design– lightweight models Ø Use two 3 x 3 conv layer to replace 5 x 5 conv layer: (3x3+3x3)/(5x5) Ø Use two sequential 1xn and n x 1 conv layers to replace n x n conv layers (1xn + n x 1)/(n x n)
95. 95. Basics • Architecture design– lightweight models Ø pointwise convolution: use 1x1 conv layer (to reduce dimension) Ø Depthwise separable convolution: !" !"
96. 96. • Inception, Xception * • SqueezeNet • MobileNet / MobileNetV2 • ShuffleNet / ShuffleNetV2 Lightweight models
97. 97. Inception Module Inception module with dimension reduction V1 block (from googlenet) Traditional 3X3 convolution block Input: 28 X 28 X 192 Output: 28 X 28 X 256 #Model parameters: 3 X 3 X 192 X 256 = 442k 1 X 1 X 192 X 64 +1 X 1 X 192 X 96 + 3 X 3 X 96 X 128 +1 X 1 X 192 X 16 + 5 X 5 X 16 X 32 +0(maxpooling)+1 X 1 X 192 X 32 =163k Previous layer 3X3 convolution output layer Szegedy et al. Going Deeper with Convolutions, https://arxiv.org/abs/1409.4842. 2014. • •
98. 98. Inception V1, V2, V3 Szegedy et al. Going Deeper with Convolutions, https://arxiv.org/abs/1409.4842. 2014. Sergey Ioffe et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, http://arxiv.org/abs/1502.03167.2015 Rethinking the Inception Architecture for Computer Vision, http://arxiv.org/abs/1512.00567. 2015. • • Use two 3 x 3 conv to replace 5 x 5 conv • 1
99. 99. Xception François Chollet. Xception: Deep Learning with Depthwise Separable Convolutions. https://arxiv.org/abs/1610.02357. 2016-2017. • Depthwise separable convolution • à • (3 x 3 x 1 x M/7 x 112 x 112) x 7 • •
100. 100. SqueezeNet Input: F x F x M Squeeze: • 1x1 convs output: F x F x S (S< M) Expand: • 1x1 convs output: F x F x e1 • 3x3 convs output: F x F x e2 Concate: F x F x (e1+e2) Forrest N. Iandola,et al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size. https://arxiv.org/abs/1602.07360. 2016
101. 101. • Standard: • Depthwise separable conv (1) depthwise conv: 1filter takes 1 input channel (2) pointwise conv 1x1 convs • Computation Reduction MobileNet V1: Andrew G. Howard et al. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. https://arxiv.org/abs/1704.04861?context=cs. 2017. !" !" !" !"
102. 102. • Standard: • Depthwise separable conv (1) depthwise conv: 1filter takes 1 input channel (2) pointwise conv 1x1 convs • Computation Reduction MobileNet V1: Andrew G. Howard et al. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. https://arxiv.org/abs/1704.04861?context=cs. 2017. !" !" !" !"
103. 103. MobileNet V1 • Use conv with stride=2 to replace pooling • Add two super parameters: Width multiplier α and resolution multiplier ρ • α =1.0, 0.75, 0.5, 0.25; • standard MobileNet when α=1
104. 104. MobileNet V2 MobileNetV1 MobileNetV2 Increase # channels Linear bottlenecks: removed nonlinear activation in the low dim Mark Sandler et al. MobileNetV2: Inverted Residuals and Linear Bottlenecks. https://arxiv.org/abs/1801.04381.2018. inverted residual block Increase dim, then reduce dim
105. 105. ShuffleNet V1 • pointwise group convolution (1x 1 Conv) • channel shuffle: help the information flowing across feature channels • Use concat operation to concatenate two different channels Xiangyu Zhang et al. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. https://arxiv.org/abs/1707.01083. 2017. #g (groups)
106. 106. ShuffleNet V1 Xiangyu Zhang et al. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. https://arxiv.org/abs/1707.01083. 2017.
107. 107. ShuffleNet V1
108. 108. ShuffleNet V2 Reduce memory access cost: • Channel Split (2g) • remove group convolution • Put channel shuffle module after channel concatenation 1)) ( ( 2 1)) ( ( Ningning Ma et al. ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design. https://arxiv.org/abs/1807.11164.2018.
109. 109. Experiments - Classification Model mAP (%) Precision (%) Recall (%) Size (MB) Computation speed (ms/photo) Server-based + Yolov2 99.62 99.60 99.65 N/A N/A 1.00xShuffleNet V2 +Yolov2 96.43 97.16 96.83 5.20 80.00 0.50xShuffleNet V2 +Yolov2 95.86 97.28 96.28 1.70 40.00 0.50xShuffleNet V2 +SSD 97.73 90.61 97.98 7.90 65.00 0.25xShuffleNet V2 +SSD 97.25 90.46 97.59 5.00 45.00 Category Abbreviation front page of ID card id_card_f Back page of ID card id_card_b Front page of driver license driver_license_f Back page of driver license driver_license_b Front of main page in car license front car_license_f Back of main page in car license front car_license_b Supplementary Page in car license vehicle_license Real car photo( whole car) Whole car Real car photo(car plate) plate
110. 110. Experiments - Classification #positive photos: 8K #negative photos: 8K Version Backbone Detection method Size (MB) mAP (%) Precision (%) Recall (%) Error detection rate (% Floating-point version 0.5*ShuffleNet V2 YoloV2 1.70 97.86 98.81 98.00 0.125 Fixed-point version 0.5*ShuffleNet V2 YoloV2 0.40 97.82 98.82 97.97 0.0625 #positive photos: 8K Precision (%) Recall (%) Precision (%) Recall (%) car 98.87 96.41 98.97 96.11 car_license_b 98.70 99.00 99.10 99.00 car_license_f 99.90 97.70 99.80 97.90 driver_license_b 99.80 98.90 99.80 99.00 driver_license_f 99.49 98.50 99.19 98.50 id_card_b 99.90 99.00 99.90 99.00 id_card_f 99.50 99.10 99.50 99.10 plate 93.82 94.29 93.71 93.99 vehicle_license 99.30 99.10 99.40 99.10 Average 98.81 98.00 98.82 97.97
111. 111. Experiments - Embeded OCR • Use ShuffleNet to replace Resnet50 as the backbone
112. 112. Haifeng SHEN, DiDi AI Labs Guangyu LI, DiDi AI Labs Part III : Application
113. 113. •Driver Identification •Driving Scenario Understanding Application
114. 114. Driver identification • Application • Overview • Experiments
115. 115. Application - Pay by smiling • In Sep. 2017, Alibaba's Ant Financial affiliate and KFC China announced facial- recognition payment available for customers in the fast food restaurant chain's new KPRO store in Hangzhou. • "Smile to Pay" facial recognition payment solution at KFC enables customers to pay without their wallets. https://www.jrzj.com/194328.html
116. 116. Application - Check-in at station Taiyuan South railway stationBeijing West railway station Shanghai metro station https://baijiahao.baidu.com/s?id=1552314447507461&wfr=spider&for=pc http://www.sohu.com/a/220124437_99966914 http://dy.163.com/v2/article/detail/D5U3QH2P0525KG01.html
117. 117. http://www.sohu.com/a/168709903_728989 Application - Pedestrian monitoring Ningbo City uses face recognition for transportation surveillance and pedestrian monitoring.
118. 118. Application - Driver monitoring https://www.sohu.com/a/253263266_649849
119. 119. Application - Other uses https://www.globalrailwayreview.com/article/66120/train-stations-facial-recognition/ https://image.baidu.com/
120. 120. Overview - Market https://www.marketsandmarkets.com/Market-Reports/facial-recognition-market-995.html
121. 121. Overview - features Natural Un-perceivable Contact-less Multiple BIOMETRIC --- You are your own key” https://image.baidu.com/
122. 122. Overview - Challenges Inter-class similarity https://image.baidu.com/
123. 123. Overview - Challenges Illumination Expression Occlusion Age Pose Other Intra-class variability Similarity =0.18 https://image.baidu.com/
124. 124. Overview - Framework Verification
125. 125. Overview - Framework Identfication
126. 126. Overview - Detection & landmark dataset Face detection dataset Available # faces # images Website Remarks FDDB Public 5171 2845 http://vis-www.cs.umass.edu/fddb/ unconstrained face WiderFace Public 32,20 3 393,703 http://mmlab.ie.cuhk.edu.hk/projects/W IDERFace Easy, Medium, Hard set, a high degree of variability in scale, pose and occlusion. MALF Public 11,93 1 5,250 http://www.cbsr.ia.ac.cn/faceevaluation/ Bounding box, multi-Attribute Labelled Faces, pose and facial attributes Caltech 10,000 Web Faces Public - 10,524 http://www.vision.caltech.edu/Image_Da tasets/Caltech_10K_WebFaces/ Collect from Google image search, 4 landmarks(two eyes, nose and mouth) PUB Public 9971 http://biometrics.put.poznan.pl/put- face-database/ 30 landmarks, 194 contour points AFLW Public 25,993 https://www.tugraz.at/institute/icg/rese arch/team-bischof/lrs/downloads/aflw/ Collect from Flickr, 21 landmarks
127. 127. Overview - Detection - MTCNN Kaipeng Zhang et al. Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks. https://arxiv.org/abs/1604.02878v1.2016. • propose a deep cascaded multi-task framework with three stages, P-Net, R- Net and O-Net. • Each is a shallow network. • P-Net: proposal network, produces candidate windows quickly through a shallow CNN • R-Net: refine network, refines the candidates to reject a large number of non-faces windows through a more complex CNN • O-Net: output network, use a more powerful CNN to refine the result and output facial landmarks positions
128. 128. Overview - Detection - Face RFCN Yitong Wang et al. Detecting Faces Using Region-based Fully Convolutional Networks. https://arxiv.org/abs/1709.05256. 2017. • The framework is based on the R-FCN. • propose a region-based face detector applying deep networks in a fully convolutional fashion • introduce additional smaller anchors and modify the position-sensitive RoI pooling to a smaller size for suiting the detection of the tiny faces. • propose to use position-sensitive average pooling instead of normal average pooling for the last feature voting in R-FCN • use multi-scale training strategy and online Hard Example Mining (OHEM) strategy.
129. 129. Overview - Detection - PyramidBox Xu Tang et al. PyramidBox: A Context-assisted Single Shot Face Detector. https://arxiv.org/abs/1803.07737?context=cs. 2018. • Baidu proposes the PyramidBox. • extended VGG16 backbone and generate the feature maps at different levels • generate a series of anchors corresponding to larger regions related to a face that contain more contextual information, such as head, shoulder and body.
130. 130. Overview - Recognition - Dataset Dataset Available # People # images Website Remarks LFW Public 5K 13K http://vis- www.cs.umass.edu/lfw/#views Labeled Faces in the Wild YFD Public 1.5K 3.4K (Video) https://www.cs.tau.ac.il/~wolf/ytfac es/ YouTube Faces Database CelebA (CelebFaces Attributes Dataset) Public 10K 202K http://mmlab.ie.cuhk.edu.hk/project s/CelebA.html Multimedia Lab, The Chinese University of Hong Kong CASIA-WebFace Public 10K 500K http://www.cbsr.ia.ac.cn/english/CAS IA-WebFace/CASIA- WebFace_Agreements.pdf MS-Celeb-1M public 100K 10M https://www.msceleb.org VGGFace2 Public 9k 3.3M http://www.robots.ox.ac.uk/~vgg/da ta/vgg_face2/ downloaded from Google Image Search and have large variations in pose, age, illumination, ethnicity and profession Facebook Private 4K 4,400K N/A Google Private 8000K 100-200M N/A
131. 131. Overview - Recognition - Milestones 1888 Galton, Nature 1910 Galton, Nature 1965 Chan,Bledsoe, AFR 1991 TurkandMA, Eigenfaces 1997 BelhumeurP, Fisherface 2002 LiuC, Gaborfeature 2006 AhonenT, LBP 2009 WrightJ, Sparserepresentation 2013 ChenD, High-dimLBP 2014 SunYi, DeepID 2014 Facebook, DeepFace 2015 Oxford, VGGFace 2015 Google, FaceNet 2015Baidu, EnsembleFace 2016 EffectiveFace 2017 SphereFace 2018 ArcFace 2019 Combined loss
132. 132. Overview - Recognition - Results Time Method Training size Method description LFW Comments 1991 Eigenfaces < 10k Principal component analysis(PCA) 60.02% 2006 LBP+CSML < 10k Local binary pattern(LBP) + Metric learning 85.57% 2013 High-dim LBP 0.1m High-dim LBP + Joint Bayesian 95.17% 2014 DeepFace 4m CNN + 3D face alignment 97.35% Facebook 2014 Deep ID 0.2m CNN + Softmax 97.45% CUHK 2015 VGGFace 2.6m VGG + Softmax 98.95% Oxford 2015 FaceNet 200m Inception + Triplet-Loss 99.63% Google 2015 Ensemble face 1.2m CNN + Multi-patch + Deep metric 99.77% Baidu 2016 Effective face 2.5m CNN + Augmentation 98.06% Pose + Shape + Expression 2017 SphereFace 0.5m CNN + Angular-Softmax 99.42% Multiplicative angular margin: cos(mθ) 2018 ArcFace 6.8m CNN + Additive angular margin 99.83% Additive angular margin: cos(θ + m) 2019 Combined loss N/A cos(m1θ + m2) − m3
133. 133. Overview - Recognition - DeepFace Yaniv Taigman et al. DeepFace: Closing the Gap to Human-Level Performance in Face Verification. https://ieeexplore.ieee.org/document/6909616. CVPR 2014. • CNN + DNN structure • L4 - L6 are locally connected layers without weight sharing, rather than the standard convolutional layers • The last two layers, i.e. F7 and F8 are fully-connected • Employ 3D face modeling to apply the affine transformation for 3D face alignment and get the frontal face • more than 120 million parameters • Train using four million facial images belonging to more than 4,000 identities
134. 134. Overview - Recognition - DeepID Yi Sun, Xiaogang Wang, Xiaoou Tang. Deep Learning Face Representation from Predicting 10,000 Classes. https://www.cv- foundation.org/openaccess/content_cvpr_2014/papers/Sun_Deep_Learning_Face_2014_CVPR_paper.pdf. CVPR2014. • Use face patch method and each patch use one ConvNet • Each ConvNet has 4 layers • 60 face patches with ten regions, three scales, and RGB or gray channel. • 60 ConvNets x two 160-dimensional vectors and flipped counterpart, totally 19200-dimensional vector for face verification • achieves 97.45% face verification accuracy on LFW • Based on DeepID1, Chinese University of Hong Kong provides DeepID2 and DeepID3
135. 135. Overview - Recognition - FaceNet Florian Schroff et al. FaceNet: A Unified Embedding for Face Recognition and Clustering. https://arxiv.org/abs/1503.03832. CVPR 2015. • Google proposes the structure. • Directly use a deep convolutional network • Use triplet loss for training: minimizes the distance between an anchor and a positive, both of which have the same identity, and maximizes the distance between the anchor and a negative of a different identity • Use the Euclidean distance to measure the face similarity for verification.
136. 136. Overview - Recognition - Ensemble Face Jingtuo Liu et al. Targeting Ultimate Accuracy: Face Recognition via Deep Embedding. https://arxiv.org/pdf/1506.07310. 2015. • Multi-patch feature extraction. • 9 image patches and each patch is centered at different landmarks on face region. • Each patch: 9 convolution layers and a softmax layer at the end • Concatenate the last convolution layer of each network to build the high dimensional feature for the face representation • metric learning method with triplet loss is used for feature reduction and obtain 128/256 dimensions. • achieve the accuracy (99.77%) of LFW under 6000 pair evaluation protocol
137. 137. Overview - Recognition - Effective Face Iacopo Masi et al. Do We Really Need to Collect Millions of Faces for Effective Face Recognition. https://arxiv.org/abs/1603.07057. CVPR 2016. • Use a single VGGNet with 19 layers • Training on both real and augmented data • use the CASIA WebFace collection data and generate the artificial data by introducing pose variations, shape variation and expression variation
138. 138. Jiankang Deng et al. ArcFace: Additive Angular Margin Loss for Deep Face Recognition. https://arxiv.org/abs/1801.07698. 2019. Multiplicative angular margin: cos(mθ) Additive angular margin: cos(θ + m) Additive cosine margin: cos(θ ) - mcosθ Combined loss: Overview - Recognition - Combined loss
139. 139. Experiments - Combined loss Test set feature softmax shpereface cosface arcface Combined loss LFW public 98.75 99.52 99.50 99.55 99.60 7k private 93.60 95.45 95.90 96.72 97.13 50k private 93.28 95.93 95.50 97.08 96.90 zc private 99.18 99.37 99.45 99.57 99.52 avg 96.20 97.57 97.59 98.23 98.29 • 7k/50k The test set is extracted from registered driver photo database. 3K positive pair and 3k negative pair are randomly selected from 7k/50k drivers respectively. • zc the test set is randomly extracted from premier driver photo database. 3K positive pair and 3K negative pair are randomly selected for the testing.
140. 140. Experiments - Virtual learning drastically improves the performances over the baseline softmax on both LFW and SLLFW datasets, e.g. from 99.10% to 99.46% and 94.59% to 95.85%, respectively. Binghui Chen, Weihong Deng, Haifeng Shen. Virtual Class Enhanced Discriminative Embedding Learning. https://arxiv.org/abs/1811.12611. 2018
141. 141. Experiments - Fast face detection 80 8 0 40 4 0 20 20 20 20 10 10 55 33 22 11 C3 C4 C5 C6 C7 C8 C9 C10 C11 Multiscale feature fusion Object detection Detection result Upsam pling Upsampling n Multiscale features: C3+C4+C5+C7+Conv9+Conv11 n Combine up-sampling features: C3 + C3’, C4 + C4’, C5 + C5’ n Support batch image computation n TensorRT Optimization Speed (ms/frame) Batch size=1 Batch size=64 Batch size=100 Original 22 12 N/A FP32 17 7 7 INT8 13 4 4 GPU Memory (GB/frame) Batch size=1 Batch size=64 Batch size=100 Original 1.40 0.188 N/A FP32 0.57 0.070 0.066 INT8 0.48 0.039 0.030 Detection % Precision Recall F-score Original 97.90 97.00 97.47 FP32 97.90 97.10 97.48 INT8 97.85 96.96 97.40
142. 142. Experiments - Face detection q WIDER FACE dataset is a face detection benchmark dataset, collected from the publicly available WIDER dataset. q Choose 32,203 images and label 393,703 faces with a high degree of variability in scale, pose and occlusion as depicted in the sample images. q Propose DFS method and use semantic fused feature maps as contextual cues and construct a semantic segmentation for training supervision and to learn the best representations q Win 5 rank-1 results in April. 2019 Widerface: http://shuoyang1213.me/WIDERFACE/index.html Wanxin Tian, Zixuan Wang, Haifeng Shen, Weihong Deng, et al. Learning Better Features for Face Detection with Feature Fusion and Segmentation Supervision. https://arxiv.org/abs/1811.08557. 2018-2019.
143. 143. Human Driving Scenarios
144. 144. What can we learn from Driving Scenario? • What is in a driving scenario? • How far are they from ego-vehicle? • How does human driver interact with environment? Vision Perception 3D Reconstruction Behavior Analysis
145. 145. Driving Scenarios v.s. General Computer Vision Data • Multi-modal (i.e. multiple sensors including Camera LiDAR, GPS, IMU etc.) • Collected from 3D Open Area (Not Indoor/Lab Environments) • Ego-centric / First Person Requirements • • • Opportunities • • •
146. 146. Main Components • Pedestrian • Vehicle • Road • Traffic Sign / Light Vision Perception in Driving Scenario Detect, Segment, Track and Classify Object-of-interest in Driving Scenarios What does Vision Perception do:
147. 147. Vision Perception – Pedestrian Detection
148. 148. Vision Perception – Pedestrian Detection
149. 149. Vision Perception – Pedestrian Detection Pedestrian detection at 100FPS • Uses Cascades • Fast features • Not a CNN based model Benenson et al ’12 “VeryFast” 100+ FPS detector. NO CNNs.
150. 150. Vision Perception – Pedestrian Detection Real-time Pedestrian Detection with CNNs • Uses Cascades • Uses fast non-CNN features • Use CNNs for max accuracy with minimum speed sacrifice Angelova et al ’15 “DeepCascades” Real-time (15FPS) with CNNs
151. 151. Vision Perception – Pedestrian Detection Occlusion-aware pedestrian detection • Aggregation loss (enforce proposals to be close and locate compactly) • Occlusion-aware region of interest (PORoI) (integrate prior structure information of human to handle occlusion) • Based on Faster RCNN Zhang et al ’18 “OR-CNN” State of the Art (by April 2019)
152. 152. Vision Perception – Vehicle Detection
153. 153. Vision Perception – Vehicle Detection Vehicle detection in 3D from image • Directly from 2D image • Proposal Generation as Energy Minimization • Orientation Estimation Network Chen et al ’16 “3D Bounding Box” Breakthrough for 3D Detection with Mono Image
154. 154. Vision Perception – Vehicle Detection Multi-View 3D object Detection • Multi-sensor fusion Chen et al ’17 “MV3D” Impressive accuracy gain for considering multi-sensors fusion
155. 155. Vision Perception – Vehicle Detection Multi-level Fusion based 3D Object Detection from Mono Images • Simultaneously propose 2D RPN and predict 3D location, orientation, dimensions Xu et al ’18 “Multi-level Fusion” State of the Art for 3D Detection from Mono Camear Images
156. 156. Vision Perception – Road Segmentation Joint Semantic Prediction • KITTI Road Detection top performance 2017 • Multi-task framework • Real-time • Uses RGB image only Teichmann et al ’17 “MultiNet” Speed + Accuracy with RGB image only
157. 157. Vision Perception – Road Segmentation LIDAR-Camera Fusion • KITTI Road Detection top performance 2018 • Cross Fusion mechanism with FCN Caltagirone et al ’18 “LidCamNet” LIDAR-Camera Fusion RULES
158. 158. Vision Perception – Road Segmentation LIDAR-Camera Fusion with LIDAR Adaptation • KITTI Road Detection current top performance • Progressive LIDAR Adaptation Chen et al ’19 “PLARD” State of the Art Performance
159. 159. Vision Perception – Road Segmentation State of the Arts on KITTI (by April 2019)
160. 160. Vision Perception – Traffic Sign Detection IJCNN 2011 Traffic Sign Recognition Competition • Ciresan et al ’11: 0.56% error • Human: 1.16% error • Non-CNN: 3.86% Ciresan et al ’11 “Traffic Sign Recognition” Traffic Sign Recognition is EASY (Super-human Performance)
161. 161. Vision Perception – Traffic Sign Detection Detecting Small Signs from Large Images • Brake large image into small patches • Small-Object-Sensitive-CNN (SOS-CNN) • Based on SSD Meng et al ’17 “SOS-CNN” Handle Small Objects
162. 162. What can we learn from Driving Scenario? • What is in a driving scenario? • How far are they from ego-vehicle? • How does human driver interact with environment? Vision Perception 3D Reconstruction Behavior Analysis
163. 163. Main Components • • • • 3D Reconstruction in Driving Scenario Recover real-world Location and Pose of Driving Scenario Objects (2D to 3D) What does 3D Reconstruction do: 5 mins Theoretic Backgrounds (a little Math)
164. 164. 3D Reconstruction – Theoretic Backgrounds • Perspective Projection
165. 165. 3D Reconstruction – Theoretic Backgrounds • Internal Camera Parameters
166. 166. 3D Reconstruction – Theoretic Backgrounds • External Camera Parameters
167. 167. 3D Reconstruction – Theoretic Backgrounds • Camera Model for Perspective Projection
168. 168. 3D Reconstruction – Theoretic Backgrounds • A Block Diagram
169. 169. 3D Reconstruction – Semantic Reconstruction Kundu et al ’14 “Joint semantic and 3D reconstruction from monocular video” Semantic + 3D Reconstruction from Mono Camera
170. 170. 3D Reconstruction – Semantic Reconstruction Cherabier et al ’16 “Multi-label semantic 3d reconstruction using voxel blocks” Efficient Dense Semantic + 3D Reconstruction
171. 171. What can we learn from Driving Scenario? • What is in a driving scenario? • How far are they from ego-vehicle? • How does human driver interact with environment? Vision Perception 3D Reconstruction Behavior Analysis
172. 172. Driving Scenario Understanding Honda Research Institute Driving Dataset • 104 Hours Real Human Driving records • Driving Behavior and Causal Reasoning annotation Ramanishka et al ’18 “HDD” First Dataset towards Driving Scenario Understanding
173. 173. Driving Scenario Understanding Driving Attention Prediction from Video • Focus on Driver’s Attention • In-car v.s. In-lab test Xia et al ’18 “Predicting Driver Attention” Introduce Attention Heat Maps
174. 174. Related Datasets HDD [7] [6] [5] [4] [3] [2] [1] D2-City [8] Driving behavior & Causal reasoning / Traffic participants detection & tracking Camera, GPS, IMU 95.9 Suburban, urban and highway
175. 175. GAIA Open Dataset • Dataset : D2 –City Dataset • D²-City is a large-scale driving video dataset that provides more than 10k videos recorded in 720p HD or 1080p FHD from front-facing dashcams, with annotations for object detection and tracking. n 1k videos - annotation of the bounding boxes and tracking IDs of road objects into 12 different categories. n 9k videos - annotation the bounding boxes in key frames.
176. 176. Q & A
177. 177. Thanks!