Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

3D Perception for Autonomous Driving - Datasets and Algorithms -

1,320 views

Published on

Recent datasets for autonomous driving and 3D object detection algorithms.

Published in: Engineering
  • Login to see the comments

3D Perception for Autonomous Driving - Datasets and Algorithms -

  1. 1. Mobility Technologies Co., Ltd. 3D Perception for Autonomous Driving - Datasets and Algorithms - Kazuyuki MIyazawa AI R D Group 2, AI System Dept. Mobility Technologies Co., Ltd.
  2. 2. Mobility Technologies Co., Ltd. Who am I? 2 @kzykmyzw Kazuyuki Miyazawa Group Leader AI R D Group 2 AI System Dept. Mobility Technologies Co., Ltd. Past Work Experience April 2019 - March 2020 AI Research Engineer@DeNA Co., Ltd. April 2010 - March 2019 Research Scientist@Mitsubishi Electric Corp. Education PhD in Information Science@Tohoku University
  3. 3. Mobility Technologies Co., Ltd.3 1 Autonomous Driving Datasets Agenda 2 3D Object Detection Algorithms
  4. 4. Mobility Technologies Co., Ltd. 3D Object Detection: Motivation ■ 2D bounding boxes are not sufficient ■ Lack of 3D pose, Occlusion information, and 3D location Preliminary (Today’s Main Topic) 4 2D Object Detection 3D Object Detection http://www.cs.toronto.edu/~byang/
  5. 5. Mobility Technologies Co., Ltd. Autonomous Driving Datasets 5 01
  6. 6. Mobility Technologies Co., Ltd. KITTI [2012] 6 Sensor Setup ● GPS/IMU x 1 ● LiDAR (64ch) x 1 ● Grayscale Camera (1.4M) x 2 ● Color Camera (1.4M) x 2 http://www.cvlibs.net/datasets/kitti/
  7. 7. Mobility Technologies Co., Ltd. KITTI [2012] 7
  8. 8. Mobility Technologies Co., Ltd. 3D Object Detection 8 ● 7,481 training images / point clouds ● 7,518 test images / point clouds ● 80,256 labeled objects type Car, Van, Truck, Pedestrian, Person_sitting, Cyclist, Tram, Misc or DontCare truncated 0 to 1, where truncated refers to the object leaving image boundaries occuluded 0 = fully visible, 1 = partly occluded, 2 = largely occluded, 3 = unknown alpha Observation angle of object, ranging [-pi..pi] bbox 2D bounding box of object in the image dimensions 3D object dimensions: height, width, length location 3D object location x,y,z in camera coordinate rotation_y Rotation ry around Y-axis in camera coordinates [-pi..pi] Annotations
  9. 9. Mobility Technologies Co., Ltd. License 9
  10. 10. Mobility Technologies Co., Ltd. Variants of KITTI 10 SemanticKITTI Dataset provides annotations that associate each LiDAR point with one of 28 semantic classes in all 22 sequences of the KITTI Dataset http://semantic-kitti.org/ Virtual KITTI contains 50 high-resolution monocular videos (21,260 frames) generated from five different virtual worlds in urban settings under different imaging and weather conditions https://europe.naverlabs.com/research/computer-vision/proxy-virtual-worlds/
  11. 11. Mobility Technologies Co., Ltd. ApolloScape [2017] 11 Sensor Setup ● GPS/IMU x 1 ● LiDAR x 2 ● Color Camera (9.2M) x 2 http://apolloscape.auto/
  12. 12. Mobility Technologies Co., Ltd. ApolloScape [2017] 12 Scene Parsing 3D Car Instance Lane Segmentation
  13. 13. Mobility Technologies Co., Ltd. ApolloScape [2017] 13 Self Localization Stereo
  14. 14. Mobility Technologies Co., Ltd. 3D Object Detection 14 ● 53 min training sequences ● 50 min testing sequences ● 70K 3D fitted cars type Small vehicle, Big vehicle, Pedestrian, Motorcyclist and Bicyclist, Traffic cones, Others dimensions 3D object dimensions: height, width, length location 3D object location x,y,z in relative coordinate heading Steering radian with respect to the direction of the object Annotations
  15. 15. Mobility Technologies Co., Ltd. ■ To the extent that we authorize the Developer to use Datasets and subject to the terms of this Agreement, the Developer is entitled to use the Datasets only (i) for Developer’s internal purposes of non-commercial research or teaching and (ii) in accordance with the terms of this Agreement. License 15 http://apolloscape.auto/license.html
  16. 16. Mobility Technologies Co., Ltd. nuScenes [2019] 16 Sensor Setup ● GPS/IMU x 1 ● LiDAR (32ch) x 1 ● RADAR x 5 ● Color Camera (1.4M) x 3 https://www.nuscenes.org/
  17. 17. Mobility Technologies Co., Ltd. Semantic Map 17 ● Provide highly accurate human-annotated semantic maps of the relevant areas ● 11 semantic classes ● Encourage the use of localization and semantic maps as strong priors for all tasks
  18. 18. Mobility Technologies Co., Ltd. 3D Object Detection 18 ● category ● attribute ● visibility ● instance ● sensor ● calibrated_sensor ● ego_pose ● log ● scene ● sample ● sample_data ● sample_annotation ● map Number of annotations per category Attributes distribution for selected categories 1.4M boxes in total
  19. 19. Mobility Technologies Co., Ltd. License 19
  20. 20. Mobility Technologies Co., Ltd. Argoverse [2019] 20 Sensor Setup ● GPS x 1 ● LiDAR (32ch) x 2 ● Color Camera (4.8M) x 2 ● Color Camera (2M) x 7 https://www.argoverse.org/
  21. 21. Mobility Technologies Co., Ltd. Argoverse Maps 21 Vector Map: Lane-Level Geometry Rasterized Map: Ground Height Rasterized Map: Drivable Area
  22. 22. Mobility Technologies Co., Ltd. 3D Object Detection (3D Tracking) 22 ● Collection of 113 log segments with 3D object tracking annotations ● These log segments vary in length from 15 to 30 seconds and contain a total of 11,052 tracks ● Each sequence includes annotations for all objects within 5 meters of “drivable area” — the area in which it is possible for a vehicle to drive
  23. 23. Mobility Technologies Co., Ltd. License 23
  24. 24. Mobility Technologies Co., Ltd. Lyft Level 5 [2019] 24 Sensor Setup (BETA_V0) ● LiDAR (40ch) x 3 ● WFOV Camera (1.2M) x 6 ● Long-focal-length Camera (1.7M) x 1 Sensor Setup (BETA_++) ● LiDAR (64ch) x 1 ● LiDAR (40ch) x 2 ● WFOV Camera (2M) x 6 ● Long-focal-length Camera (2M) x 1 https://level5.lyft.com/dataset/
  25. 25. Mobility Technologies Co., Ltd. Semantic Map 25
  26. 26. Mobility Technologies Co., Ltd. 3D Object Detection (Same format as nuScenes) 26 ● category ● attribute ● visibility ● instance ● sensor ● calibrated_sensor ● ego_pose ● log ● scene ● sample ● sample_data ● sample_annotation ● map animal bicycle bus car emergency_vehicle motorcycle other_vehicle pedestrian truck 638K boxes in total
  27. 27. Mobility Technologies Co., Ltd. License 27
  28. 28. Mobility Technologies Co., Ltd. Audi Autonomous Driving Dataset (A2D2) [2020] 28 Sensor Setup ● GPS/IMU x 1 ● LiDAR (16ch) x 5 ● Color Camera (2.3M) x 6 https://www.a2d2.audi/a2d2/en.html
  29. 29. Mobility Technologies Co., Ltd. Audi Autonomous Driving Dataset (A2D2) [2020] 29
  30. 30. Mobility Technologies Co., Ltd. 3D Object Detection 30 ● All images have corresponding LiDAR point clouds, of which 12,497 are annotated with 3D bounding boxes within the field of view of the front-center camera
  31. 31. Mobility Technologies Co., Ltd. License 31
  32. 32. Mobility Technologies Co., Ltd. Comparison 32 ? ? ? ? ? ? These figures are based on Table 1 in https://arxiv.org/abs/1912.04838
  33. 33. Mobility Technologies Co., Ltd. Comparison 33 Waymo Waymo Waymo Waymo Waymo Waymo These figures are based on Table 1 in https://arxiv.org/abs/1912.04838
  34. 34. Mobility Technologies Co., Ltd. Waymo Open Dataset [2019] 34 Sensor Setup ● Mid-Range (~75m) LiDAR x 1 ● Short-Range (~20m) LiDAR x 4 ● Color Camera (2M) x 3 ● Color Camera (1.6M) x 2 https://waymo.com/open/
  35. 35. Mobility Technologies Co., Ltd. Data Volume 35 Train 798 segments w/ labels (757 GB) Test 150 seg. w/o labels (192 GB) Validation 202 seg. w/ labels (144 GB) ● Contain 1150 segments that each span 20 seconds ● Additionally, segments from a new location and only a subset have labels are provided for domain adaptation
  36. 36. Mobility Technologies Co., Ltd. Data Format 36 Segment Frame context Shared information among all frames in the scene (e.g., calibration parameters, stats) timestamp_micros Frame timestamp pose Vehicle pose images Camera images and metadata (e.g., pose, velocity, timestamp) lasers Range images laser_labels 3D box annotations projected_lidar_labels Lidar labels (laser_labels) projected to camera images camera_labels 2D box annotations no_label_zones Polygon that represents areas without labels (e.g., opposite side of a highway) Frame ... ● Each segment (20 sec) consists of ~200 frames (10 Hz) ● All the data related to a segment is stored to a single tfrecord and represented as protocol buffers
  37. 37. Mobility Technologies Co., Ltd. Range Image 37 The point cloud of each LiDAR is encoded as a range image 1streturn2ndreturn range intensity elongation range intensity elongation
  38. 38. Mobility Technologies Co., Ltd. API & Tutorial in colab 38 https://github.com/waymo-research/waymo-open-dataset https://colab.research.google.com/github/waymo-research/waymo-open-dataset/blob/master/tutorial/tutorial.ipynb
  39. 39. Mobility Technologies Co., Ltd. Data Visualization (LiDAR Point Cloud) 39 Mid-range LiDAR
  40. 40. Mobility Technologies Co., Ltd. Data Visualization (LiDAR Point Cloud) 40 Mid-range LiDAR
  41. 41. Mobility Technologies Co., Ltd. Data Visualization (LiDAR Point Cloud) 41 Mid-range LiDAR Short-range LiDAR (front)
  42. 42. Mobility Technologies Co., Ltd. Data Visualization (LiDAR Point Cloud) 42 Mid-range LiDAR Short-range LiDAR (right)
  43. 43. Mobility Technologies Co., Ltd. Data Visualization (LiDAR Point Cloud) 43 Mid-range LiDAR Short-range LiDAR (rear)
  44. 44. Mobility Technologies Co., Ltd. Data Visualization (LiDAR Point Cloud) 44 Mid-range LiDAR Short-range LiDAR (left)
  45. 45. Mobility Technologies Co., Ltd. Data Visualization (LiDAR Point Cloud) 45 Mid-range LiDAR Short-range LiDARs (all)
  46. 46. Mobility Technologies Co., Ltd. Data Visualization (Camera Images) 46 Front Left 1920x1080 Front 1920x1080 Front Right 1920x1080 Side Left 1920x886 Side Right 1920x886
  47. 47. Mobility Technologies Co., Ltd. 3D Object Detection 47 ■ 3D LiDAR Lables ■ 3D 7-DOF bounding boxes in the vehicle frame with globally unique tracking IDs ■ vehicles, pedestrian, cyclists, signs ■ 2D Camera Lables ■ Not projections of 3D labels ■ vehicles, pedestrian, cyclists ■ Tight-fitting, axis-aligned 2D bounding boxes with globally unique tracking IDs Vehicle Pedestrian Cyclists Signs 3D Object 6.1M 2.8M 67K 3.2M 3D TrackID 60K 23K 620 23K 2D Object 7.7M 2.1M 63K - 2D TrackID 164K 45K 1.3K - Labeled object and tracking ID counts
  48. 48. Mobility Technologies Co., Ltd. 2D Label Samples 48
  49. 49. Mobility Technologies Co., Ltd. 3D Label Samples 49
  50. 50. Mobility Technologies Co., Ltd. LiDAR to Camera Projection 50 ■ Cameras and LiDARs data are well-synchronized ■ LiDAR points can be projected to camera image with rolling shutter effect compensation
  51. 51. Mobility Technologies Co., Ltd. Challenges 51
  52. 52. Mobility Technologies Co., Ltd. Evaluation Metrics for 3D Object Detection 52 https://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-ranked-retrieval-results-1.html P/R Curve Average Precision with Heading Each true positive is weighted by heading accuracy defined as Ground truth Prediction
  53. 53. Mobility Technologies Co., Ltd. To ensure the Dataset is only used for Non-Commercial Purposes, You agree ■ Not to distribute or publish any models trained on or refined using the Dataset, or the weights or biases from such trained models ■ Not to use or deploy the Dataset, any models trained on or refined using the Dataset, or the weights or biases from such trained models (i) in operation of a vehicle or to assist in the operation of a vehicle, (ii) in any Production Systems, or (iii) for any other primarily commercial purposes License 53 https://waymo.com/open/terms/
  54. 54. Mobility Technologies Co., Ltd. 3D Object Detection Algorithms 54 02
  55. 55. Mobility Technologies Co., Ltd. ■ Design a novel type of neural network that directly consumes point clouds, which well respects the permutation invariance of points in the input ■ Provide a unified architecture for applications ranging from object classification, part segmentation, to scene semantic parsing PointNet [C. Qi+, CVPR2017] 55 https://arxiv.org/abs/1612.00593
  56. 56. Mobility Technologies Co., Ltd. PointNet Architecture 56
  57. 57. Mobility Technologies Co., Ltd. PointNet Architecture 57 Predict an affine transformation matrix by a mini-network and align all input set to achieve invariance against geometric transformations
  58. 58. Mobility Technologies Co., Ltd. PointNet Architecture 58 The same alignment approach is also applied in feature space
  59. 59. Mobility Technologies Co., Ltd. PointNet Architecture 59 Using max pooling as symmetric function, aggregate unordered point features
  60. 60. Mobility Technologies Co., Ltd. ■ Divide a point cloud into 3D voxels and transform them into a unified feature representation ■ Descriptive volumetric representation is then connected to a RPN to generate detections VoxelNet [Y, Zhou+, CVPR2018] 60 A voxel represents a value on a regular grid in three- dimensional space https://en.wikipedia.org/wiki/Voxel LiDAR ONLY https://arxiv.org/abs/1711.06396
  61. 61. Mobility Technologies Co., Ltd. Voxel Feature Encoding (VFE) Layer 61 ● VFE enables inter-point interaction within a voxel, by combining point-wise features with a locally aggregated feature. ● Stacking multiple VFE layers allows learning complex features for characterizing local 3D shape information
  62. 62. Mobility Technologies Co., Ltd. Convolutional Middle Layers 62 ● Each convolutional middle layer applies 3D convolution, BN layer, and ReLU layer sequentially ● Convolutional middle layers aggregate voxel-wise features within a progressively expanding receptive field, adding more context to the shape description
  63. 63. Mobility Technologies Co., Ltd. Region Proposal Network 63 ● The first layer of each block downsamples the input feature map ● Then the output of every block is upsampled to a fixed size and concatenated to construct the high resolution feature map ● Finally, this feature map is mapped to the desired learning targets
  64. 64. Mobility Technologies Co., Ltd. Evaluation on KITTI 64 Performance comparison on KITTI validation set Performance comparison on KITTI test set
  65. 65. Mobility Technologies Co., Ltd. ■ Apply sparse convolution to greatly increase the speeds of training and inference ■ Introduce a novel angle loss regression approach to solve the problem of the large loss generated when the angle prediction error is equal to π SECOND (Sparsely Embedded CONvolutional Detection) [Y, Yan+, Sensors2018] 65 LiDAR ONLY https://pdfs.semanticscholar.org/5125/a16039cabc6320c908a4764f32596e018ad3.pdf
  66. 66. Mobility Technologies Co., Ltd. Sparse Convolution Algorithm 66 ■ Gather the necessary input to construct the matrix, perform GEMM, then scatter the data back ■ GPU-based rule generation algorithm is proposed to construct input–output index rule matrix
  67. 67. Mobility Technologies Co., Ltd. ■ Directly predicting the radian offset suffers from an adversarial example problem between the cases of 0 and π radians because they correspond to the same box but generate a large loss when one is misidentified as the other ■ Solve this problem by introducing a new angle loss regression: ■ To address the issue that this loss treats boxes with opposite directions as being the same, a simple direction classifier is added to the output of the RPN Sine-Error Loss for Angle Regression 67
  68. 68. Mobility Technologies Co., Ltd. Evaluation on KITTI 68 Performance comparison on KITTI validation set Performance comparison on KITTI test set
  69. 69. Mobility Technologies Co., Ltd. PointPillars [A. Lang+, CVPR2019] 69 ■ Propose an encoder to learn a representation of point clouds organized in vertical columns (pillars) and generate pseudo 2D image ■ Encoded features can be used with any standard 2D convolutional detection architecture without computationally-expensive 3D ConvNets LiDAR ONLY https://arxiv.org/abs/1812.05784
  70. 70. Mobility Technologies Co., Ltd. Pointcloud to Pseudo-Image 70 Point cloud is discretized into an evenly spaced grid in the x-y plane,creating a set of pillars
  71. 71. Mobility Technologies Co., Ltd. Pointcloud to Pseudo-Image 71 Create a dense tensor of size (D, P, N) D: Dimension of augmented lidar point (=9) P: Number of non-empty pillars per sample N: Number of points per pillar
  72. 72. Mobility Technologies Co., Ltd. Pointcloud to Pseudo-Image 72 Apply PointNet to generate a (C, P, N) sized feature tensor, followed by a max operation over the channels to create an output tensor of size (C, P)
  73. 73. Mobility Technologies Co., Ltd. Pointcloud to Pseudo-Image 73 Features are scattered back to the original pillar locations to create a pseudo-image of size (C, H, W) where H and W indicate the height and width of the canvas
  74. 74. Mobility Technologies Co., Ltd. Backbone 74 Top-down network produces features at increasingly small spatial resolution Second network performs upsampling and concatenation of the top-down features
  75. 75. Mobility Technologies Co., Ltd. Detection Head 75 Single Shot Detector (SSD) is used with additional regression targets (height and elevation)
  76. 76. Mobility Technologies Co., Ltd. Evaluation on KITTI 76 Performance comparison on KITTI test set
  77. 77. Mobility Technologies Co., Ltd. ■ Implementaion ■ Official PointPillar’s implementation is forked from SECOND’s implementation and is no longer maintained ■ Instead, SECOND’s implementation now supports PointPillars ■ Format Conversion ■ SECOND’s implementation only supports KITTI and nuScenes, so format conversion is the fastest way to use Waymo Open Dataset ■ Several converters can be found on GitHub ■ Waymo_Kitti_Adapter ■ waymo_kitti_converter Let’s Try PointPillars on Waymo Open Dataset 77
  78. 78. Mobility Technologies Co., Ltd. These results are just for reference, because only a part of training set is used and hyper parameters are not tuned to Waymo Open Dataset at all Vehicle Detection Results 78
  79. 79. Mobility Technologies Co., Ltd. These results are just for reference, because only a part of training set is used and hyper parameters are not tuned to Waymo Open Dataset at all Vehicle Detection Results 79
  80. 80. Mobility Technologies Co., Ltd. Results from Leaderboard on Waymo Open Dataset 80 https://waymo.com/open/challenges/3d-detection/#
  81. 81. Mobility Technologies Co., Ltd. ■ First generate 2D object region proposals in the RGB image using CNN, then each 2D region is extruded to a 3D viewing frustum to get a point cloud ■ PointNet predicts a 3D bounding box for the object from the points in frustum Frustum PointNets [C. Qi+, CVPR2018] 81 LiDAR + Camera https://arxiv.org/abs/1812.05784
  82. 82. Mobility Technologies Co., Ltd. Frustum Proposal 82 ● Use object detector in RGB image to predict a 2D bounding box and lift it to a frustum with a known camera matrix ● Collect all points within the frustum to form a frustum point cloud
  83. 83. Mobility Technologies Co., Ltd. 3D Instance Segmentation 83 Object instance is segmented by binary classification of each point using PointNet
  84. 84. Mobility Technologies Co., Ltd. Amodal 3D Box Estimation 84 Estimate the object’s amodal oriented 3D bounding box by using a box regression PointNet Estimate the true center of the complete object and then transform the coordinate such that the predicted center becomes the origin
  85. 85. Mobility Technologies Co., Ltd. Evaluation on KITTI 85 Performance comparison on KITTI validation set Performance comparison on KITTI test set
  86. 86. Mobility Technologies Co., Ltd. PV-RCNN [S. Shi+, CVPR2020] 86 https://arxiv.org/abs/1912.13192 ■ Voxel-based operation efficiently encodes multi-scale feature representations and can generate high-quality 3D proposals, while the PointNet-based set abstraction operation preserves accurate location information with flexible receptive fields ■ Integrate the two operations via the voxel-to-keypoint 3D scene encoding and the keypoint-to- grid RoI feature abstraction LiDAR ONLY
  87. 87. Mobility Technologies Co., Ltd. 3D Voxel CNN for Feature Encoding and Proposal Generation 87 Input points are first divided into voxels and gradually converted into feature volumes by 3D sparse CNN By converting 3D feature volumes into 2D bird-view feature maps, high-quality 3D proposals are generated following the anchor- based approaches
  88. 88. Mobility Technologies Co., Ltd. Voxel-to-keypoint Scene Encoding via Voxel Set Abstraction 88 Small number of keypoints are sampled from the point clouds PointNet-based set abstraction module encodes the multi-scale semantic features from the 3D CNN feature volumes to the keypoints. Check if each key point is inside or outside of a ground-truth 3D box, and re-weight the keypoint features
  89. 89. Mobility Technologies Co., Ltd. Keypoint-to-grid RoI Feature Abstraction for Proposal Refinement 89 RoI-grid pooling module aggregates the keypoint features to the RoI-grid points with multiple receptive fields using PointNet
  90. 90. Mobility Technologies Co., Ltd. Evaluation on KITTI / Waymo Open Dataset 90 Performance comparison on KITTI test set Performance comparison on Waymo OD validation set
  91. 91. Mobility Technologies Co., Ltd. We Don’t Need Camera? 91 3D vehicle detection performance on KITTI test set (moderate) LiDAR only LiDAR + Camera
  92. 92. Mobility Technologies Co., Ltd. ■ Autonomous Driving Dataset ■ KITTI is most famous and frequently used dataset for vehicle related researches, however, it has limited amount and the performance on the dataset is coming to a head (> 80% AP) ■ More recent datasets provide much larger multi-modal sensor data and annotations, and some of them also provide semantic maps ■ Waymo Open Dataset is one of the largest and most diverse datasets ever released, and provides high-quality (meata)data and annotations (but unfortunately, it’s NOT commercial-friendly at all) ■ 3D Object Detection Algorithms ■ Recent 3D object detection algorithms re-purpose camera-based detection architectures, which has been greatly advanced by CNN and many mature techniques such as region proposal ■ Main two streams are the grid-based methods and the point-based methods, and a key component in the former is 2D/3D CNN, and PointNet in the latter ■ Current SoTAs are dominated by LiDAR-only methods and LiDAR-camera fusion methods lag behind Summary 92
  93. 93. · Mobility Technologies Co., Ltd.

×