SlideShare a Scribd company logo
BEV SEMANTIC
SEGMENTATION
Yu Huang
Sunnyvale, California
Yu.huang07@gmail.com
OUTLINE
• Learning to Look around Objects for Top-View
Representations of Outdoor Scenes
• Monocular Semantic Occupancy Grid Mapping
with Convolutional Variational Enc-Dec Networks
• Cross-view Semantic Segmentation for Sensing
Surroundings
• MonoLayout: Amodal scene layout from a single
image
• Predicting Semantic Map Representations from
Images using Pyramid Occupancy Networks
• A Sim2Real DL Approach for the Transformation
of Images from Multiple Vehicle-Mounted Cameras
to a Semantically Segmented Image in BEV
• FISHING Net: Future Inference of Semantic
Heatmaps In Grids
• BEV-Seg: Bird’s Eye View Semantic Segmentation
Using Geometry and Semantic Point Cloud
• Lift, Splat, Shoot: Encoding Images from Arbitrary
Camera Rigs by Implicitly Unprojecting to 3D
• Understanding Bird’s-Eye View Semantic HD-maps
Using an Onboard Monocular Camera
Learning To Look Around Objects For Top-view
Representations Of Outdoor Scenes
• Estimating an occlusion-reasoned semantic scene layout in the top-view.
• This challenging problem not only requires an accurate understanding of both the 3D geometry and
the semantics of the visible scene, but also of occluded areas.
• A convolutional neural network that learns to predict occluded portions of the scene layout by
looking around foreground objects like cars or pedestrians.
• But instead of hallucinating RGB values, directly predicting the semantics and depths in the
occluded areas enables a better transformation into the top-view.
• This initial top-view representation can be significantly enhanced by learning priors and rules about
typical road layouts from simulated or, if available, map data.
Learning To Look Around Objects For Top-view
Representations Of Outdoor Scenes
Learning To Look Around Objects For Top-view
Representations Of Outdoor Scenes
The inpainting CNN first encodes a masked image and the mask
itself. The extracted features are concatenated and two decoders
predict semantics and depth for visible and occluded pixels.
To train the inpainting CNN, ignore FG
objects as no GT is available (red) but
articially add masks (green) over BG regions
where full annotation is already available.
Learning To Look Around Objects For Top-view
Representations Of Outdoor Scenes
The process of mapping the semantic segmentation with corresponding
depth first into a 3D point cloud and then into the bird's eye view. The red
and blue circles illustrate corresponding locations in all views.
Learning To Look Around Objects For Top-view
Representations Of Outdoor Scenes
(a) Simulated road shapes in the top-view. (b) The refinement-CNN is an encoder-decoder network
receiving three supervisory signals: self-reconstruction with the input, adversarial loss from simulated data,
and reconstruction loss with aligned OpenStreetMap (OSM) data. (c) The alignment CNN takes as input the
initial BEV map and a crop of OSM data (via noisy GPS and yaw estimate given). The CNN predicts a warp
for the OSM map and is trained to minimize the reconstruction loss with the initial BEV map.
Learning To Look Around Objects For Top-view
Representations Of Outdoor Scenes
(a) We use a composition of similarity transform (left, “box") and a non-parametric warp (right, “flow") to
align noisy OSM with image evidence. (b, top) Input image and the corresponding Binit. (b, bottom)
Resulting warping grid overlaid on the OSM map and the warping result for 4 different warping
functions, respectively: “box", ”flow", “box+flow", “box+flow (with regularization)". Note the importance of
composing the transformations and the induced regularization.
Learning To Look Around Objects For Top-view
Representations Of Outdoor Scenes
Examples of BEV representation. Examples of our BEV representation.
Monocular Semantic Occupancy Grid Mapping With
Convolutional Variational Encoder-decoder Networks
• This work is end-to-end learning of monocular semantic-metric occupancy grid mapping from
weak binocular ground truth.
• The network learns to predict four classes, as well as a camera to bird’s eye view mapping.
• At the core, it utilizes a variational encoder-decoder network that encodes the front-view visual
information of the driving scene and subsequently decodes it into a 2-D top-view Cartesian
coordinate system.
• The variational sampling with a relatively small embedding vector brings robustness against
vehicle dynamic perturbations, and generalizability for unseen KITTI data
Monocular Semantic Occupancy Grid Mapping With
Convolutional Variational Encoder-decoder Networks
Illustration of the proposed variational encoder-decoder approach. From a single front-view
RGB image, our system can predict a 2-D top-view semantic-metric occupancy grid map.
Monocular Semantic Occupancy Grid Mapping With
Convolutional Variational Encoder-decoder Networks
Some visualized mapping examples on the test set with different methods.
Monocular Semantic Occupancy Grid Mapping With
Convolutional Variational Encoder-decoder Networks
Cross-view Semantic Segmentation For Sensing
Surroundings
• Cross-view Semantic Segmentation: a framework named View Parsing Network (VPN) to
address it.
• In the cross-view semantic segmentation task, the agent is trained to parse the first-view
observations into a top-down-view semantic map indicating the spatial location of all the
objects at pixel-level.
• The main issue of this task is that lacking the real-world annotations of top-down view data.
• To mitigate this, train the VPN in 3D graphics environment and utilize the domain adaptation
technique to transfer it to handle real-world data.
• Code and demo videos can be found at https://view-parsing-network.github.io.
Cross-view Semantic Segmentation For Sensing
Surroundings
Framework of the View
Parsing Network for
cross-view semantic
segmentation. The
simulation part shows
the architecture and
training scheme of VPN,
while the real-world part
demonstrates the
domain adaptation
process for transferring
VPN to the real world.
Cross-view Semantic Segmentation For Sensing
Surroundings
Qualitative results of sim-to-real adaptation. The results of source prediction before and after domain
adaptation, drivable area prediciton after adaptation and the groud-truth drivable area map.
MonoLayout: Amodal Scene Layout From A Single
Image
• Given a single color image captured from a driving platform, to predict the bird’s eye view layout
of the road and other traffic participants.
• The estimated layout should reason beyond what is visible in the image, and compensate for the
loss of 3D information due to projection.
• Amodal scene layout estimation, involves hallucinating scene layout for even parts of the world
that are occluded in the image.
• Mono-Layout, a deep NN for real-time amodal scene layout estimation from a single image.
• To represent scene layout as a multi-channel semantic occupancy grid, and leverage adversarial
feature learning to “hallucinate" plausible completions for occluded image parts.
MonoLayout: Amodal Scene Layout From A Single
Image
MonoLayout: Given only a single image of a road scene, a neural network architecture reasons about
the amodal scene layout in bird’s eye view in real-time (30 fps). This approach, MonoLayout can
hallucinate regions of the static scene (road, sidewalks)—and traffic participants—that do not even
project to the visible regime of the image plane. Shown above are example images from the KITTI (left)
and Argoverse (right) datasets. MonoLayout outperforms prior art (by more than a 20% margin) on
hallucinating occluded regions.
MonoLayout: Amodal Scene Layout From A Single
Image
Architecture: MonoLayout takes in a color image of an urban driving scenario, and predicts an amodal
scene layout in bird’s eye view. The architecture comprises a context encoder, amodal layout decoders,
and two discriminators. Architecture: MonoLayout takes in a color image of an urban driving scenario, and
predicts an amodal scene layout in bird’s eye view. The architecture comprises a context encoder, amodal
layout decoders, and two discriminators.
MonoLayout: Amodal Scene Layout From A Single
Image
MonoLayout: Amodal Scene Layout From A Single
Image
Static layout estimation: Observe how MonoLayout performs amodal completion of the static scene
(road shown in pink, sidewalk shown in gray). Mono Occupancy fails to reason beyond occluding
objects (top row), and does not hallucinate large missing patches (bottom row), while MonoLayout is
accurately able to do so. Furthermore, even in cases where there is no occlusion (row 2), MonoLayout
generates road layouts of much sharper quality. Row 3 show extremely challenging scenarios where
most of the view is blocked by vehicles, and the scenes exhibit high-dynamic range (HDR) and shadows.
MonoLayout: Amodal Scene Layout From A Single
Image
Dynamic layout estimation: vehicle occupancy estimation results on the KITTI 3D Object detection
benchmark. From left to right, the column corresponds to the input image, Mono Occupancy, Mono3D,
OFT, MonoLayout, and ground-truth respectively. While the other approaches miss out on detecting cars
(top row), or split a vehicle detection into two (second row), or stray detections off road (third row),
MonoLayout produces crisp object boundaries while respecting vehicle and road geometries.
Monolayout: Amodal Scene Layout From A Single
Image
Amodal scene layout estimation on the Argoverse
dataset. The dataset comprises multiple
challenging scenarios, with low illumination, large
number of vehicles. MonoLayout is accurately able
to produce sharp estimates of vehicles and road
layouts. (Sidewalks are not predicted here, as they
aren’t annotated in Argoverse).
MonoLayout: Amodal Scene Layout From A Single
Image
Trajectory forecasting: MonoLayout-
forecast accurately estimates future
trajectories of moving vehicles. (Left): In
each figure, the magenta cuboid shows the
initial position of the vehicle. MonoLayout-
forecast is pre-conditioned for 1 seconds,
by observing the vehicle, at which point
(cyan cuboid) it starts forecasting future
trajectories (blue). The ground-truth
trajectory is shown in red, for comparision.
(Right): Trajectories visualized in image
space. Notice how MonoLayout-forecast is
able to forecast trajectories accurately
despite the presence of moving obstacles
(top row), turns (middle row), and merging
traffic (bottom row).
Predicting Semantic Map Representations From
Images Using Pyramid Occupancy Networks
• vision-based elements: ground plane estimation, road segmentation and 3D object detection
• a simple, unified approach for estimating maps directly from monocular images using a single
end-to-end deep learning architecture
• For the maps themselves, adopt a semantic Bayesian occupancy grid framework, allowing to
trivially accumulate information over multiple cameras and timesteps
• Codes available at http://github.com/tom-roddick/mono-semantic-maps.
Predicting Semantic Map Representations From
Images Using Pyramid Occupancy Networks
Predicting Semantic Map Representations From
Images Using Pyramid Occupancy Networks
Given a set of surround-view images,
predict a full 360 birds-eye-view
semantic map, which captures both
static elements like road and
sidewalk as well as dynamic actors
such as cars and pedestrians.
Predicting Semantic Map Representations From
Images Using Pyramid Occupancy Networks
Architecture diagram showing an overview. (1) A ResNet-50 backbone network extracts image features
at multiple resolutions. (2) A feature pyramid augments the high-resolution features with spatial context
from lower pyramid layers. (3) A stack of dense transformer layers map the image-based features into
the birds-eye-view. (4) The top down network processes the birds-eye-view features and predicts the
final semantic occupancy probabilities.
Predicting Semantic Map Representations From
Images Using Pyramid Occupancy Networks
This dense transformer layer first condenses the image based features
along the vertical dimension, whilst retaining the horizontal dimension. Then,
predict a set of features along the depth axis in a polar coordinate system,
which are then resampled to Cartesian coordinates.
Predicting Semantic Map Representations From
Images Using Pyramid Occupancy Networks
• The dense transformer layer is inspired: while the network needs a lot of vert. context to map
features to the BEV, in the horiz. direction the relationship btw BEV locations and image
locations can be established using camera geometry.
• In order to retain the maximum amount of spatial info, collapse the vert. dim. and channel
dimensions of the image feature map to a bottleneck of size B, but preserve the horiz. dim. W.
• The apply a 1D conv along the horiz. axis, reshape the result. feat. map to give a tensor of dim.
• However this feature map, which is still in image-space coord., actually corresponds to a
trapezoid in the orthographic BEV space due to perspective, and so the final step is to resample
into a Cartesian frame using the known camera focal length f and horizontal offset u0.
Predicting Semantic Map Representations From
Images Using Pyramid Occupancy Networks
Predicting Semantic Map Representations From
Images Using Pyramid Occupancy Networks
Predicting Semantic Map Representations From
Images Using Pyramid Occupancy Networks
Predicting Semantic Map Representations From
Images Using Pyramid Occupancy Networks
A Sim2Real Deep Learning Approach For The Transformation Of Images From
Multiple Vehicle-mounted Cameras To A Semantically Segmented Image In BEV
• To obtain a corrected 360 BEV image given images from multiple vehicle-mounted cameras.
• The corrected BEV image is segmented into semantic classes and includes a prediction of
occluded areas.
• The neural network approach does not rely on manually labeled data, but is trained on a
synthetic dataset in such a way that it generalizes well to real-world data.
• By using semantically segmented images as input, reduce the reality gap between simulated and
real-world data and are able to show that the method can be successfully applied in the real
world.
• Source code and datasets are available at https://github:com/ika-rwth-aachen/Cam2BEV.
A Sim2Real Deep Learning Approach For The Transformation Of Images From
Multiple Vehicle-mounted Cameras To A Semantically Segmented Image In BEV
A homography can be applied to the four semantically segmented images from
vehicle-mounted cameras to transform them to BEV. This approach involves
learning to compute an accurate BEV image without visual distortions.
A Sim2Real Deep Learning Approach For The Transformation Of Images From
Multiple Vehicle-mounted Cameras To A Semantically Segmented Image In BEV
• For each vehicle camera, virtual rays are cast from its mount position to the edges of the
semantically segmented ground truth BEV image.
• The rays are only cast to edge pixels that lie within the specific camera’s field of view.
• All pixels along these rays are processed to determine their occlusion state according to the
following rules:
1. some semantic classes always block sight (e.g. building, truck);
2. some semantic classes never block sight (e.g. road);
3. cars block sight, except on taller objects behind them (e.g. truck, bus);
4. partially occluded objects remain completely visible;
5. objects are only labeled as occluded if they are occluded in all camera perspectives.
A Sim2Real Deep Learning Approach For The Transformation Of Images From
Multiple Vehicle-mounted Cameras To A Semantically Segmented Image In BEV
The uNetXST architecture has
separate encoder paths for each
input image (green paths). As part of
the skip-connection on each scale
level (violet paths), feature maps are
projectively transformed (v-block),
concatenated with the other input
streams (||-block), convoluted, and
finally concatenated with upsampled
output of the decoder path. This
illustration shows a network with only
two pooling and two upsampling
layers, the actual trained network
contains four, respectively.
A Sim2Real Deep Learning Approach For The Transformation Of Images From
Multiple Vehicle-mounted Cameras To A Semantically Segmented Image In BEV
The v-block resembles a Spatial Transformer unit.
Input feature maps from preceding convolutional
layers (orange grid layers) are projectively
transformed by the homographies obtained through
IPM (Inverse Projection Mapping). The transformation
differs between the input streams for the different
cameras. Spatial consistency is established, since the
transformed feature maps all capture the same field
of view as the ground truth BEV. The transformed
feature maps are then concatenated into a single
feature map (cf. ||-block).
A Sim2Real Deep Learning Approach For The Transformation Of Images From
Multiple Vehicle-mounted Cameras To A Semantically Segmented Image In BEV
A Sim2Real Deep Learning Approach For The Transformation Of Images From
Multiple Vehicle-mounted Cameras To A Semantically Segmented Image In BEV
FISHING Net: Future Inference Of Semantic
Heatmaps In Grids
• End-to-end pipeline that performs semantic segmentation and short term prediction using a top
down representation.
• This approach consists of an ensemble of neural networks which take in sensor data from different
sensor modalities and transform them into a single common top-down semantic grid representation.
• This representation favorable as it is agnostic to sensor-specific reference frames and captures both
the semantic and geometric information for the surrounding scene.
• Because the modalities share a single output representation, they can be easily aggregated to produce
a fused output.
• This work predicts short-term semantic grids but the framework can be extended to other tasks.
FISHING Net: Future Inference Of Semantic
Heatmaps In Grids
FISHING Net Architecture:
multiple neural networks, one for
each sensor modality (lidar, radar
and camera) take in a sequence
of input sensor data and output a
sequence of shared top-down
semantic grids representing 3
object classes (Vulnerable Road
Users (VRU), vehicles and
background). The sequences are
then fused using an aggregation
function to output a fused
sequence of semantic grids.
FISHING Net: Future Inference Of Semantic
Heatmaps In Grids
• The overall architecture consists of a neural network for each sensor modality.
• Across all modalities, the network architecture consists of an encoder decoder network with
convolutional layers.
• It uses average pooling with a pooling size of (2,2) in the encoder and up-sampling in the
decoder.
• After the decoder, a single linear convolutional layer to produce logits, and a softmax to
produce the final output probabilities for each of the three classes along each of the output
timesteps.
• It uses a slightly different encoder and decoder scheme for the vision network compared to the
lidar and radar networks to account for the pixel space features.
FISHING Net: Future Inference Of Semantic
Heatmaps In Grids
Vision architecture
FISHING Net: Future Inference Of Semantic
Heatmaps In Grids
Lidar and Radar Architecture
FISHING Net: Future Inference Of Semantic
Heatmaps In Grids
• The LiDAR features consist of: 1) Binary lidar occupancy (1 if any lidar point is present a given
grid cell, 0 otherwise). 2) Lidar density (Log normalized density of all lidar points present in a
grid cell). 3) Max z (Largest height value for lidar points in a given grid cell). 4) Max z sliced
(Largest z value for each grid cell over 5 linear slices eg. 0-0.5m,..., 2.0-2.5m).
• The Radar features consist of: 1) Binary radar occupancy (1 if any radar point is present a given
grid cell, 0 otherwise). 2) X, Y values for each radar return’s doppler velocity compensated with
ego vehicle’s motion. 3) Radar cross section (RCS). 4) Signal to noise ratio (SNR). 5)
Ambiguous Doppler interval.
• The dimensions of the images match the output resolution of 192 by 320.
FISHING Net: Future Inference Of Semantic
Heatmaps In Grids
FISHING Net: Future Inference Of Semantic
Heatmaps In Grids
Label
input for lidar radar and vision predictions for lidar radar and vision
BEV-Seg: Bird’s Eye View Semantic Segmentation
Using Geometry And Semantic Point Cloud
• Bird’s eye semantic segmentation, a task that predicts pixel-wise semantic segmentation in BEV
from side RGB images.
• Two main challenges: the view transformation from side view to bird’s eye view, as well as
transfer learning to unseen domains.
• The 2-staged perception pipeline explicitly predicts pixel depths and combines them with pixel
semantics in an efficient manner, allowing the model to leverage depth information to infer
objects’ spatial locations in the BEV.
• Transfer learning by abstracting high level geometric features and predicting an intermediate
representation that is common across different domains.
BEV-Seg: Bird’s Eye View Semantic Segmentation
Using Geometry And Semantic Point Cloud
BEV-Seg
pipeline
BEV-Seg: Bird’s Eye View Semantic Segmentation
Using Geometry And Semantic Point Cloud
• In the first stage, N RGB road scene images are captured by cameras at different angles and
individually pass through semantic segmentation network S and depth estimation network D.
• The resulting side semantic segmentations and depth maps are combined and projected into a
semantic point cloud.
• This point cloud is then projected downward into an incomplete bird’s-eye view, which is fed
into a parser network to predict the final bird’s-eye segmentation.
• The rest of this section provides details on the various components of the pipeline.
BEV-Seg: Bird’s Eye View Semantic Segmentation
Using Geometry And Semantic Point Cloud
• For side-semantic segmentations, use HRNet, a state-of-the-art convolutional network for semantic
segmentation.
• For monocular depth estimation, implement SORD using the same HRNet as the backbone.
• For both tasks, train the same model on all four views.
• The resulting semantic point cloud is projected height-wise onto a 512x512 image.
• Train a separate HRNet model as the parser network for the final bird’s-eye segmentation.
• Transfer learning via modularity and abstraction: 1). Fine-tune the stage 1 models on the target
domain stage 1 data; 2). Apply the trained stage 2 model as-is to the projected point cloud in the
target domain.
BEV-Seg: Bird’s Eye View Semantic Segmentation
Using Geometry And Semantic Point Cloud
Table 1: Segmentation Result on
BEVSEG-Carla. Oracle models have
ground truth given for specified inputs.
Lift, Splat, Shoot: Encoding Images From Arbitrary
Camera Rigs By Implicitly Unprojecting To 3D
• End-to-end architecture that directly extracts a bird's-eye-view representation of a scene given
image data from an arbitrary number of cameras
• To “lift" each image individually into a frustum of features for each camera, then “splat" all
frustums into a rasterized bird's-eye view grid
• To learn how to represent images and how to fuse predictions from all cameras into a single
cohesive representation of the scene while being robust to calibration error
• Codes: https://nv-tlabs.github.io/lift-splat-shoot
Lift, Splat, Shoot: Encoding Images From Arbitrary
Camera Rigs By Implicitly Unprojecting To 3D
Given multi-view camera data (left), it infers semantics directly in the bird's-eye-view (BEV) coordinate
frame (right). It shows vehicle segmentation (blue), drivable area (orange), and lane segmentation
(green). These BEV predictions are then projected back onto input images (dots on the left).
Lift, Splat, Shoot: Encoding Images From Arbitrary
Camera Rigs By Implicitly Unprojecting To 3D
Traditionally, computer vision tasks such as semantic segmentation involve making predictions in
the same coordinate frame as the input image. In contrast, planning for self-driving generally
operates in the bird's-eye-view frame. The model directly makes predictions in a given bird's-eye-
view frame for end-to-end planning from multi-view images.
Lift, Splat, Shoot: Encoding Images From Arbitrary
Camera Rigs By Implicitly Unprojecting To 3D
It visualizes the “lift" step. For each pixel, it predicts a categorical distribution over depth (left) and a
context vector (top left). Features at each point along the ray are determined by their outer product (right).
Lift, Splat, Shoot: Encoding Images From Arbitrary
Camera Rigs By Implicitly Unprojecting To 3D
In the “lift" step, a frustum-shaped point cloud is generated for each individual image (center-left). The
extrinsics/intrinsics are then used to splat each frustum onto the BEV plane (center right). Finally, a BEV
CNN processes the BEV representation for BEV semantic segmentation or planning (right).
Lift, Splat, Shoot: Encoding Images From Arbitrary
Camera Rigs By Implicitly Unprojecting To 3D
It visualizes the 1K trajectory templates that is “shoot"
onto the cost map during training and testing. During
training, the cost of each template trajectory is
computed and interpreted as a 1K-dimensional
Boltzman distribution over the templates. During
testing, choose the argmax of this distribution and act
according to the chosen template.
Lift, Splat, Shoot: Encoding Images From Arbitrary
Camera Rigs By Implicitly Unprojecting To 3D
Instead of the hard-margin loss proposed in NMP
(Neural Motion Planner), planning is framed as
classification over a set of K template trajectories.
To leverage the cost-volume nature of the planning
problem, enforce the distribution over K template
trajectories to take the following form
Lift, Splat, Shoot: Encoding Images From Arbitrary
Camera Rigs By Implicitly Unprojecting To 3D
For a single time stamp, remove each of the cameras and visualize how the loss the cameras selects the
prediction of the network. Region covered by the missing camera becomes fuzzier in every case. When the
front camera is removed (top middle), the network extrapolates the lane and drivable area in front of the ego
and extrapolates the body of a car for which only a corner can be seen in the top right camera.
Lift, Splat, Shoot: Encoding Images From Arbitrary
Camera Rigs By Implicitly Unprojecting To 3D
Qualitatively show how the model performs, given an entirely new camera rig at test time. Road
segmentation is in orange, lane segmentation is in green, and vehicle segmentation is in blue.
Lift, Splat, Shoot: Encoding Images From Arbitrary
Camera Rigs By Implicitly Unprojecting To 3D
The top 10 ranked trajectories out of the 1k templates. The model predicts bimodal distributions and
curves from observations from a single timestamp. The model does not have access to the speed of the
car so it is compelling that the model predicts low-speed trajectories near crosswalks and brake lights.
Understanding Bird’s-eye View Semantic HD-
Maps Using An Onboard Monocular Camera
• online estimation of semantic BEV HD-maps using video input from a single onboard camera
• image-level understanding, BEV level understanding, and aggregation of temporal info
Front-facing monocular camera
for Bird’s-eye View (BEV) HD-
map understanding
Understanding Bird’s-eye View Semantic HD-
Maps Using An Onboard Monocular Camera
It relies on three pillars and can also be split into modules that process backbone features. First, the
image-level branch which is composed of two decoders, one processing the static HDmap and one the
dynamic obstacle, second the BEV temporal aggregation module that fuses our three pillars and
aggregates all the temporal and image plane information in the BEV and finally the BEV decoder.
Understanding Bird’s-eye View Semantic HD-
Maps Using An Onboard Monocular Camera
Temporal aggregation module combines information
from all frames and all branches into one BEV feature
map. Backbone features and image-level static
estimates are projected with warping function AB to
BEV and max (M) is applied in batch dimension. The
results are concatenated in channel dimension. The
reference frame backbone features (highlighted with
red) are used in Max function as well as skip
connection to concatenation.
Understanding Bird’s-eye View Semantic HD-
Maps Using An Onboard Monocular Camera
• The dataset also provides 3D bounding boxes of 23 object classes.
• In experiments, select six HD-map classes: drivable area, pedestrian crossings, walkways,
carpark area, road segment, and lane.
• For dynamic objects, select the classes: car, truck, bus, trailer, construction vehicle, pedestrian,
motorcycle, traffic cone and barrier.
• Even though a six-camera rig was used to capture data, only use the front camera for training
and evaluation.
Understanding Bird’s-eye View Semantic HD-
Maps Using An Onboard Monocular Camera
Understanding Bird’s-eye View Semantic HD-
Maps Using An Onboard Monocular Camera
BEV Semantic Segmentation

More Related Content

What's hot

NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
taeseon ryu
 
The Origin of Grad-CAM
The Origin of Grad-CAMThe Origin of Grad-CAM
The Origin of Grad-CAM
Shintaro Yoshida
 
Neural Radiance Field
Neural Radiance FieldNeural Radiance Field
Neural Radiance Field
Dong Heon Cho
 
Pose estimation from RGB images by deep learning
Pose estimation from RGB images by deep learningPose estimation from RGB images by deep learning
Pose estimation from RGB images by deep learning
Yu Huang
 
論文読み会2018 (CodeSLAM)
論文読み会2018 (CodeSLAM)論文読み会2018 (CodeSLAM)
論文読み会2018 (CodeSLAM)
Masaya Kaneko
 
CNN and its applications by ketaki
CNN and its applications by ketakiCNN and its applications by ketaki
CNN and its applications by ketaki
Ketaki Patwari
 
Moving Object Detection And Tracking Using CNN
Moving Object Detection And Tracking Using CNNMoving Object Detection And Tracking Using CNN
Moving Object Detection And Tracking Using CNN
NITISHKUMAR1401
 
Scene classification using Convolutional Neural Networks - Jayani Withanawasam
Scene classification using Convolutional Neural Networks - Jayani WithanawasamScene classification using Convolutional Neural Networks - Jayani Withanawasam
Scene classification using Convolutional Neural Networks - Jayani Withanawasam
WithTheBest
 
[DL輪読会]BADGR: An Autonomous Self-Supervised Learning-Based Navigation System
[DL輪読会]BADGR: An Autonomous Self-Supervised Learning-Based Navigation System[DL輪読会]BADGR: An Autonomous Self-Supervised Learning-Based Navigation System
[DL輪読会]BADGR: An Autonomous Self-Supervised Learning-Based Navigation System
Deep Learning JP
 
A brief introduction to recent segmentation methods
A brief introduction to recent segmentation methodsA brief introduction to recent segmentation methods
A brief introduction to recent segmentation methods
Shunta Saito
 
Neural Scene Representation & Rendering: Introduction to Novel View Synthesis
Neural Scene Representation & Rendering: Introduction to Novel View SynthesisNeural Scene Representation & Rendering: Introduction to Novel View Synthesis
Neural Scene Representation & Rendering: Introduction to Novel View Synthesis
Vincent Sitzmann
 
LSD-SLAM:Large Scale Direct Monocular SLAM
LSD-SLAM:Large Scale Direct Monocular SLAMLSD-SLAM:Large Scale Direct Monocular SLAM
LSD-SLAM:Large Scale Direct Monocular SLAM
EndoYuuki
 
Camera-Based Road Lane Detection by Deep Learning II
Camera-Based Road Lane Detection by Deep Learning IICamera-Based Road Lane Detection by Deep Learning II
Camera-Based Road Lane Detection by Deep Learning II
Yu Huang
 
Mask R-CNN
Mask R-CNNMask R-CNN
Mask R-CNN
Chanuk Lim
 
CNN Machine learning DeepLearning
CNN Machine learning DeepLearningCNN Machine learning DeepLearning
CNN Machine learning DeepLearning
Abhishek Sharma
 
Weakly supervised semantic segmentation
Weakly supervised semantic segmentationWeakly supervised semantic segmentation
Weakly supervised semantic segmentation
哲东 郑
 
HorizonNet
HorizonNetHorizonNet
HorizonNet
kanosawa
 
Anchor free object detection by deep learning
Anchor free object detection by deep learningAnchor free object detection by deep learning
Anchor free object detection by deep learning
Yu Huang
 
論文読み会@AIST (Deep Virtual Stereo Odometry [ECCV2018])
論文読み会@AIST (Deep Virtual Stereo Odometry [ECCV2018])論文読み会@AIST (Deep Virtual Stereo Odometry [ECCV2018])
論文読み会@AIST (Deep Virtual Stereo Odometry [ECCV2018])
Masaya Kaneko
 
Semantic Segmentation on Satellite Imagery
Semantic Segmentation on Satellite ImagerySemantic Segmentation on Satellite Imagery
Semantic Segmentation on Satellite Imagery
RAHUL BHOJWANI
 

What's hot (20)

NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
 
The Origin of Grad-CAM
The Origin of Grad-CAMThe Origin of Grad-CAM
The Origin of Grad-CAM
 
Neural Radiance Field
Neural Radiance FieldNeural Radiance Field
Neural Radiance Field
 
Pose estimation from RGB images by deep learning
Pose estimation from RGB images by deep learningPose estimation from RGB images by deep learning
Pose estimation from RGB images by deep learning
 
論文読み会2018 (CodeSLAM)
論文読み会2018 (CodeSLAM)論文読み会2018 (CodeSLAM)
論文読み会2018 (CodeSLAM)
 
CNN and its applications by ketaki
CNN and its applications by ketakiCNN and its applications by ketaki
CNN and its applications by ketaki
 
Moving Object Detection And Tracking Using CNN
Moving Object Detection And Tracking Using CNNMoving Object Detection And Tracking Using CNN
Moving Object Detection And Tracking Using CNN
 
Scene classification using Convolutional Neural Networks - Jayani Withanawasam
Scene classification using Convolutional Neural Networks - Jayani WithanawasamScene classification using Convolutional Neural Networks - Jayani Withanawasam
Scene classification using Convolutional Neural Networks - Jayani Withanawasam
 
[DL輪読会]BADGR: An Autonomous Self-Supervised Learning-Based Navigation System
[DL輪読会]BADGR: An Autonomous Self-Supervised Learning-Based Navigation System[DL輪読会]BADGR: An Autonomous Self-Supervised Learning-Based Navigation System
[DL輪読会]BADGR: An Autonomous Self-Supervised Learning-Based Navigation System
 
A brief introduction to recent segmentation methods
A brief introduction to recent segmentation methodsA brief introduction to recent segmentation methods
A brief introduction to recent segmentation methods
 
Neural Scene Representation & Rendering: Introduction to Novel View Synthesis
Neural Scene Representation & Rendering: Introduction to Novel View SynthesisNeural Scene Representation & Rendering: Introduction to Novel View Synthesis
Neural Scene Representation & Rendering: Introduction to Novel View Synthesis
 
LSD-SLAM:Large Scale Direct Monocular SLAM
LSD-SLAM:Large Scale Direct Monocular SLAMLSD-SLAM:Large Scale Direct Monocular SLAM
LSD-SLAM:Large Scale Direct Monocular SLAM
 
Camera-Based Road Lane Detection by Deep Learning II
Camera-Based Road Lane Detection by Deep Learning IICamera-Based Road Lane Detection by Deep Learning II
Camera-Based Road Lane Detection by Deep Learning II
 
Mask R-CNN
Mask R-CNNMask R-CNN
Mask R-CNN
 
CNN Machine learning DeepLearning
CNN Machine learning DeepLearningCNN Machine learning DeepLearning
CNN Machine learning DeepLearning
 
Weakly supervised semantic segmentation
Weakly supervised semantic segmentationWeakly supervised semantic segmentation
Weakly supervised semantic segmentation
 
HorizonNet
HorizonNetHorizonNet
HorizonNet
 
Anchor free object detection by deep learning
Anchor free object detection by deep learningAnchor free object detection by deep learning
Anchor free object detection by deep learning
 
論文読み会@AIST (Deep Virtual Stereo Odometry [ECCV2018])
論文読み会@AIST (Deep Virtual Stereo Odometry [ECCV2018])論文読み会@AIST (Deep Virtual Stereo Odometry [ECCV2018])
論文読み会@AIST (Deep Virtual Stereo Odometry [ECCV2018])
 
Semantic Segmentation on Satellite Imagery
Semantic Segmentation on Satellite ImagerySemantic Segmentation on Satellite Imagery
Semantic Segmentation on Satellite Imagery
 

Similar to BEV Semantic Segmentation

Game Engine Overview
Game Engine OverviewGame Engine Overview
Game Engine OverviewSharad Mitra
 
3-d interpretation from single 2-d image for autonomous driving
3-d interpretation from single 2-d image for autonomous driving3-d interpretation from single 2-d image for autonomous driving
3-d interpretation from single 2-d image for autonomous driving
Yu Huang
 
Simulation for autonomous driving at uber atg
Simulation for autonomous driving at uber atgSimulation for autonomous driving at uber atg
Simulation for autonomous driving at uber atg
Yu Huang
 
fusion of Camera and lidar for autonomous driving II
fusion of Camera and lidar for autonomous driving IIfusion of Camera and lidar for autonomous driving II
fusion of Camera and lidar for autonomous driving II
Yu Huang
 
Leader Follower Formation Control of Ground Vehicles Using Dynamic Pixel Coun...
Leader Follower Formation Control of Ground Vehicles Using Dynamic Pixel Coun...Leader Follower Formation Control of Ground Vehicles Using Dynamic Pixel Coun...
Leader Follower Formation Control of Ground Vehicles Using Dynamic Pixel Coun...
ijma
 
3-d interpretation from single 2-d image for autonomous driving II
3-d interpretation from single 2-d image for autonomous driving II3-d interpretation from single 2-d image for autonomous driving II
3-d interpretation from single 2-d image for autonomous driving II
Yu Huang
 
Arindam batabyal literature reviewpresentation
Arindam batabyal literature reviewpresentationArindam batabyal literature reviewpresentation
Arindam batabyal literature reviewpresentationArindam Batabyal
 
Real-time 3D Object Pose Estimation and Tracking for Natural Landmark Based V...
Real-time 3D Object Pose Estimation and Tracking for Natural Landmark Based V...Real-time 3D Object Pose Estimation and Tracking for Natural Landmark Based V...
Real-time 3D Object Pose Estimation and Tracking for Natural Landmark Based V...c.choi
 
Lidar for Autonomous Driving II (via Deep Learning)
Lidar for Autonomous Driving II (via Deep Learning)Lidar for Autonomous Driving II (via Deep Learning)
Lidar for Autonomous Driving II (via Deep Learning)
Yu Huang
 
AR/SLAM for end-users
AR/SLAM for end-usersAR/SLAM for end-users
AR/SLAM for end-users
Rakuten Group, Inc.
 
Fisheye-Omnidirectional View in Autonomous Driving III
Fisheye-Omnidirectional View in Autonomous Driving IIIFisheye-Omnidirectional View in Autonomous Driving III
Fisheye-Omnidirectional View in Autonomous Driving III
Yu Huang
 
fusion of Camera and lidar for autonomous driving I
fusion of Camera and lidar for autonomous driving Ifusion of Camera and lidar for autonomous driving I
fusion of Camera and lidar for autonomous driving I
Yu Huang
 
Fisheye/Omnidirectional View in Autonomous Driving V
Fisheye/Omnidirectional View in Autonomous Driving VFisheye/Omnidirectional View in Autonomous Driving V
Fisheye/Omnidirectional View in Autonomous Driving V
Yu Huang
 
Report bep thomas_blanken
Report bep thomas_blankenReport bep thomas_blanken
Report bep thomas_blankenxepost
 
CAD
CAD CAD
LiDAR-based Autonomous Driving III (by Deep Learning)
LiDAR-based Autonomous Driving III (by Deep Learning)LiDAR-based Autonomous Driving III (by Deep Learning)
LiDAR-based Autonomous Driving III (by Deep Learning)
Yu Huang
 
Driving behaviors for adas and autonomous driving XII
Driving behaviors for adas and autonomous driving XIIDriving behaviors for adas and autonomous driving XII
Driving behaviors for adas and autonomous driving XII
Yu Huang
 
G04743943
G04743943G04743943
G04743943
IOSR-JEN
 
CVGIP 2010 Part 3
CVGIP 2010 Part 3CVGIP 2010 Part 3
CVGIP 2010 Part 3
Cody Liu
 
spkumar-503report-approved
spkumar-503report-approvedspkumar-503report-approved
spkumar-503report-approvedPrasanna Kumar
 

Similar to BEV Semantic Segmentation (20)

Game Engine Overview
Game Engine OverviewGame Engine Overview
Game Engine Overview
 
3-d interpretation from single 2-d image for autonomous driving
3-d interpretation from single 2-d image for autonomous driving3-d interpretation from single 2-d image for autonomous driving
3-d interpretation from single 2-d image for autonomous driving
 
Simulation for autonomous driving at uber atg
Simulation for autonomous driving at uber atgSimulation for autonomous driving at uber atg
Simulation for autonomous driving at uber atg
 
fusion of Camera and lidar for autonomous driving II
fusion of Camera and lidar for autonomous driving IIfusion of Camera and lidar for autonomous driving II
fusion of Camera and lidar for autonomous driving II
 
Leader Follower Formation Control of Ground Vehicles Using Dynamic Pixel Coun...
Leader Follower Formation Control of Ground Vehicles Using Dynamic Pixel Coun...Leader Follower Formation Control of Ground Vehicles Using Dynamic Pixel Coun...
Leader Follower Formation Control of Ground Vehicles Using Dynamic Pixel Coun...
 
3-d interpretation from single 2-d image for autonomous driving II
3-d interpretation from single 2-d image for autonomous driving II3-d interpretation from single 2-d image for autonomous driving II
3-d interpretation from single 2-d image for autonomous driving II
 
Arindam batabyal literature reviewpresentation
Arindam batabyal literature reviewpresentationArindam batabyal literature reviewpresentation
Arindam batabyal literature reviewpresentation
 
Real-time 3D Object Pose Estimation and Tracking for Natural Landmark Based V...
Real-time 3D Object Pose Estimation and Tracking for Natural Landmark Based V...Real-time 3D Object Pose Estimation and Tracking for Natural Landmark Based V...
Real-time 3D Object Pose Estimation and Tracking for Natural Landmark Based V...
 
Lidar for Autonomous Driving II (via Deep Learning)
Lidar for Autonomous Driving II (via Deep Learning)Lidar for Autonomous Driving II (via Deep Learning)
Lidar for Autonomous Driving II (via Deep Learning)
 
AR/SLAM for end-users
AR/SLAM for end-usersAR/SLAM for end-users
AR/SLAM for end-users
 
Fisheye-Omnidirectional View in Autonomous Driving III
Fisheye-Omnidirectional View in Autonomous Driving IIIFisheye-Omnidirectional View in Autonomous Driving III
Fisheye-Omnidirectional View in Autonomous Driving III
 
fusion of Camera and lidar for autonomous driving I
fusion of Camera and lidar for autonomous driving Ifusion of Camera and lidar for autonomous driving I
fusion of Camera and lidar for autonomous driving I
 
Fisheye/Omnidirectional View in Autonomous Driving V
Fisheye/Omnidirectional View in Autonomous Driving VFisheye/Omnidirectional View in Autonomous Driving V
Fisheye/Omnidirectional View in Autonomous Driving V
 
Report bep thomas_blanken
Report bep thomas_blankenReport bep thomas_blanken
Report bep thomas_blanken
 
CAD
CAD CAD
CAD
 
LiDAR-based Autonomous Driving III (by Deep Learning)
LiDAR-based Autonomous Driving III (by Deep Learning)LiDAR-based Autonomous Driving III (by Deep Learning)
LiDAR-based Autonomous Driving III (by Deep Learning)
 
Driving behaviors for adas and autonomous driving XII
Driving behaviors for adas and autonomous driving XIIDriving behaviors for adas and autonomous driving XII
Driving behaviors for adas and autonomous driving XII
 
G04743943
G04743943G04743943
G04743943
 
CVGIP 2010 Part 3
CVGIP 2010 Part 3CVGIP 2010 Part 3
CVGIP 2010 Part 3
 
spkumar-503report-approved
spkumar-503report-approvedspkumar-503report-approved
spkumar-503report-approved
 

More from Yu Huang

Application of Foundation Model for Autonomous Driving
Application of Foundation Model for Autonomous DrivingApplication of Foundation Model for Autonomous Driving
Application of Foundation Model for Autonomous Driving
Yu Huang
 
The New Perception Framework in Autonomous Driving: An Introduction of BEV N...
The New Perception Framework  in Autonomous Driving: An Introduction of BEV N...The New Perception Framework  in Autonomous Driving: An Introduction of BEV N...
The New Perception Framework in Autonomous Driving: An Introduction of BEV N...
Yu Huang
 
Data Closed Loop in Simulation Test of Autonomous Driving
Data Closed Loop in Simulation Test of Autonomous DrivingData Closed Loop in Simulation Test of Autonomous Driving
Data Closed Loop in Simulation Test of Autonomous Driving
Yu Huang
 
Techniques and Challenges in Autonomous Driving
Techniques and Challenges in Autonomous DrivingTechniques and Challenges in Autonomous Driving
Techniques and Challenges in Autonomous Driving
Yu Huang
 
BEV Object Detection and Prediction
BEV Object Detection and PredictionBEV Object Detection and Prediction
BEV Object Detection and Prediction
Yu Huang
 
Fisheye based Perception for Autonomous Driving VI
Fisheye based Perception for Autonomous Driving VIFisheye based Perception for Autonomous Driving VI
Fisheye based Perception for Autonomous Driving VI
Yu Huang
 
Fisheye/Omnidirectional View in Autonomous Driving IV
Fisheye/Omnidirectional View in Autonomous Driving IVFisheye/Omnidirectional View in Autonomous Driving IV
Fisheye/Omnidirectional View in Autonomous Driving IV
Yu Huang
 
Prediction,Planninng & Control at Baidu
Prediction,Planninng & Control at BaiduPrediction,Planninng & Control at Baidu
Prediction,Planninng & Control at Baidu
Yu Huang
 
Cruise AI under the Hood
Cruise AI under the HoodCruise AI under the Hood
Cruise AI under the Hood
Yu Huang
 
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)
Yu Huang
 
Scenario-Based Development & Testing for Autonomous Driving
Scenario-Based Development & Testing for Autonomous DrivingScenario-Based Development & Testing for Autonomous Driving
Scenario-Based Development & Testing for Autonomous Driving
Yu Huang
 
How to Build a Data Closed-loop Platform for Autonomous Driving?
How to Build a Data Closed-loop Platform for Autonomous Driving?How to Build a Data Closed-loop Platform for Autonomous Driving?
How to Build a Data Closed-loop Platform for Autonomous Driving?
Yu Huang
 
Annotation tools for ADAS & Autonomous Driving
Annotation tools for ADAS & Autonomous DrivingAnnotation tools for ADAS & Autonomous Driving
Annotation tools for ADAS & Autonomous Driving
Yu Huang
 
Multi sensor calibration by deep learning
Multi sensor calibration by deep learningMulti sensor calibration by deep learning
Multi sensor calibration by deep learning
Yu Huang
 
Prediction and planning for self driving at waymo
Prediction and planning for self driving at waymoPrediction and planning for self driving at waymo
Prediction and planning for self driving at waymo
Yu Huang
 
Jointly mapping, localization, perception, prediction and planning
Jointly mapping, localization, perception, prediction and planningJointly mapping, localization, perception, prediction and planning
Jointly mapping, localization, perception, prediction and planning
Yu Huang
 
Data pipeline and data lake for autonomous driving
Data pipeline and data lake for autonomous drivingData pipeline and data lake for autonomous driving
Data pipeline and data lake for autonomous driving
Yu Huang
 
Open Source codes of trajectory prediction & behavior planning
Open Source codes of trajectory prediction & behavior planningOpen Source codes of trajectory prediction & behavior planning
Open Source codes of trajectory prediction & behavior planning
Yu Huang
 
Lidar in the adverse weather: dust, fog, snow and rain
Lidar in the adverse weather: dust, fog, snow and rainLidar in the adverse weather: dust, fog, snow and rain
Lidar in the adverse weather: dust, fog, snow and rain
Yu Huang
 
Autonomous Driving of L3/L4 Commercial trucks
Autonomous Driving of L3/L4 Commercial trucksAutonomous Driving of L3/L4 Commercial trucks
Autonomous Driving of L3/L4 Commercial trucks
Yu Huang
 

More from Yu Huang (20)

Application of Foundation Model for Autonomous Driving
Application of Foundation Model for Autonomous DrivingApplication of Foundation Model for Autonomous Driving
Application of Foundation Model for Autonomous Driving
 
The New Perception Framework in Autonomous Driving: An Introduction of BEV N...
The New Perception Framework  in Autonomous Driving: An Introduction of BEV N...The New Perception Framework  in Autonomous Driving: An Introduction of BEV N...
The New Perception Framework in Autonomous Driving: An Introduction of BEV N...
 
Data Closed Loop in Simulation Test of Autonomous Driving
Data Closed Loop in Simulation Test of Autonomous DrivingData Closed Loop in Simulation Test of Autonomous Driving
Data Closed Loop in Simulation Test of Autonomous Driving
 
Techniques and Challenges in Autonomous Driving
Techniques and Challenges in Autonomous DrivingTechniques and Challenges in Autonomous Driving
Techniques and Challenges in Autonomous Driving
 
BEV Object Detection and Prediction
BEV Object Detection and PredictionBEV Object Detection and Prediction
BEV Object Detection and Prediction
 
Fisheye based Perception for Autonomous Driving VI
Fisheye based Perception for Autonomous Driving VIFisheye based Perception for Autonomous Driving VI
Fisheye based Perception for Autonomous Driving VI
 
Fisheye/Omnidirectional View in Autonomous Driving IV
Fisheye/Omnidirectional View in Autonomous Driving IVFisheye/Omnidirectional View in Autonomous Driving IV
Fisheye/Omnidirectional View in Autonomous Driving IV
 
Prediction,Planninng & Control at Baidu
Prediction,Planninng & Control at BaiduPrediction,Planninng & Control at Baidu
Prediction,Planninng & Control at Baidu
 
Cruise AI under the Hood
Cruise AI under the HoodCruise AI under the Hood
Cruise AI under the Hood
 
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)
 
Scenario-Based Development & Testing for Autonomous Driving
Scenario-Based Development & Testing for Autonomous DrivingScenario-Based Development & Testing for Autonomous Driving
Scenario-Based Development & Testing for Autonomous Driving
 
How to Build a Data Closed-loop Platform for Autonomous Driving?
How to Build a Data Closed-loop Platform for Autonomous Driving?How to Build a Data Closed-loop Platform for Autonomous Driving?
How to Build a Data Closed-loop Platform for Autonomous Driving?
 
Annotation tools for ADAS & Autonomous Driving
Annotation tools for ADAS & Autonomous DrivingAnnotation tools for ADAS & Autonomous Driving
Annotation tools for ADAS & Autonomous Driving
 
Multi sensor calibration by deep learning
Multi sensor calibration by deep learningMulti sensor calibration by deep learning
Multi sensor calibration by deep learning
 
Prediction and planning for self driving at waymo
Prediction and planning for self driving at waymoPrediction and planning for self driving at waymo
Prediction and planning for self driving at waymo
 
Jointly mapping, localization, perception, prediction and planning
Jointly mapping, localization, perception, prediction and planningJointly mapping, localization, perception, prediction and planning
Jointly mapping, localization, perception, prediction and planning
 
Data pipeline and data lake for autonomous driving
Data pipeline and data lake for autonomous drivingData pipeline and data lake for autonomous driving
Data pipeline and data lake for autonomous driving
 
Open Source codes of trajectory prediction & behavior planning
Open Source codes of trajectory prediction & behavior planningOpen Source codes of trajectory prediction & behavior planning
Open Source codes of trajectory prediction & behavior planning
 
Lidar in the adverse weather: dust, fog, snow and rain
Lidar in the adverse weather: dust, fog, snow and rainLidar in the adverse weather: dust, fog, snow and rain
Lidar in the adverse weather: dust, fog, snow and rain
 
Autonomous Driving of L3/L4 Commercial trucks
Autonomous Driving of L3/L4 Commercial trucksAutonomous Driving of L3/L4 Commercial trucks
Autonomous Driving of L3/L4 Commercial trucks
 

Recently uploaded

power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
ViniHema
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Teleport Manpower Consultant
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
Jayaprasanna4
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
AafreenAbuthahir2
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Amil Baba Dawood bangali
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
AJAYKUMARPUND1
 
AP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specificAP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specific
BrazilAccount1
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
VENKATESHvenky89705
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
AhmedHussein950959
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
Kamal Acharya
 
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
Jayaprasanna4
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
Kamal Acharya
 
Runway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptxRunway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptx
SupreethSP4
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 
space technology lecture notes on satellite
space technology lecture notes on satellitespace technology lecture notes on satellite
space technology lecture notes on satellite
ongomchris
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
Robbie Edward Sayers
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
R&R Consult
 

Recently uploaded (20)

power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
 
AP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specificAP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specific
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
 
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
 
Runway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptxRunway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptx
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 
space technology lecture notes on satellite
space technology lecture notes on satellitespace technology lecture notes on satellite
space technology lecture notes on satellite
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
 

BEV Semantic Segmentation

  • 1. BEV SEMANTIC SEGMENTATION Yu Huang Sunnyvale, California Yu.huang07@gmail.com
  • 2. OUTLINE • Learning to Look around Objects for Top-View Representations of Outdoor Scenes • Monocular Semantic Occupancy Grid Mapping with Convolutional Variational Enc-Dec Networks • Cross-view Semantic Segmentation for Sensing Surroundings • MonoLayout: Amodal scene layout from a single image • Predicting Semantic Map Representations from Images using Pyramid Occupancy Networks • A Sim2Real DL Approach for the Transformation of Images from Multiple Vehicle-Mounted Cameras to a Semantically Segmented Image in BEV • FISHING Net: Future Inference of Semantic Heatmaps In Grids • BEV-Seg: Bird’s Eye View Semantic Segmentation Using Geometry and Semantic Point Cloud • Lift, Splat, Shoot: Encoding Images from Arbitrary Camera Rigs by Implicitly Unprojecting to 3D • Understanding Bird’s-Eye View Semantic HD-maps Using an Onboard Monocular Camera
  • 3. Learning To Look Around Objects For Top-view Representations Of Outdoor Scenes • Estimating an occlusion-reasoned semantic scene layout in the top-view. • This challenging problem not only requires an accurate understanding of both the 3D geometry and the semantics of the visible scene, but also of occluded areas. • A convolutional neural network that learns to predict occluded portions of the scene layout by looking around foreground objects like cars or pedestrians. • But instead of hallucinating RGB values, directly predicting the semantics and depths in the occluded areas enables a better transformation into the top-view. • This initial top-view representation can be significantly enhanced by learning priors and rules about typical road layouts from simulated or, if available, map data.
  • 4. Learning To Look Around Objects For Top-view Representations Of Outdoor Scenes
  • 5. Learning To Look Around Objects For Top-view Representations Of Outdoor Scenes The inpainting CNN first encodes a masked image and the mask itself. The extracted features are concatenated and two decoders predict semantics and depth for visible and occluded pixels. To train the inpainting CNN, ignore FG objects as no GT is available (red) but articially add masks (green) over BG regions where full annotation is already available.
  • 6. Learning To Look Around Objects For Top-view Representations Of Outdoor Scenes The process of mapping the semantic segmentation with corresponding depth first into a 3D point cloud and then into the bird's eye view. The red and blue circles illustrate corresponding locations in all views.
  • 7. Learning To Look Around Objects For Top-view Representations Of Outdoor Scenes (a) Simulated road shapes in the top-view. (b) The refinement-CNN is an encoder-decoder network receiving three supervisory signals: self-reconstruction with the input, adversarial loss from simulated data, and reconstruction loss with aligned OpenStreetMap (OSM) data. (c) The alignment CNN takes as input the initial BEV map and a crop of OSM data (via noisy GPS and yaw estimate given). The CNN predicts a warp for the OSM map and is trained to minimize the reconstruction loss with the initial BEV map.
  • 8. Learning To Look Around Objects For Top-view Representations Of Outdoor Scenes (a) We use a composition of similarity transform (left, “box") and a non-parametric warp (right, “flow") to align noisy OSM with image evidence. (b, top) Input image and the corresponding Binit. (b, bottom) Resulting warping grid overlaid on the OSM map and the warping result for 4 different warping functions, respectively: “box", ”flow", “box+flow", “box+flow (with regularization)". Note the importance of composing the transformations and the induced regularization.
  • 9. Learning To Look Around Objects For Top-view Representations Of Outdoor Scenes Examples of BEV representation. Examples of our BEV representation.
  • 10. Monocular Semantic Occupancy Grid Mapping With Convolutional Variational Encoder-decoder Networks • This work is end-to-end learning of monocular semantic-metric occupancy grid mapping from weak binocular ground truth. • The network learns to predict four classes, as well as a camera to bird’s eye view mapping. • At the core, it utilizes a variational encoder-decoder network that encodes the front-view visual information of the driving scene and subsequently decodes it into a 2-D top-view Cartesian coordinate system. • The variational sampling with a relatively small embedding vector brings robustness against vehicle dynamic perturbations, and generalizability for unseen KITTI data
  • 11. Monocular Semantic Occupancy Grid Mapping With Convolutional Variational Encoder-decoder Networks Illustration of the proposed variational encoder-decoder approach. From a single front-view RGB image, our system can predict a 2-D top-view semantic-metric occupancy grid map.
  • 12. Monocular Semantic Occupancy Grid Mapping With Convolutional Variational Encoder-decoder Networks Some visualized mapping examples on the test set with different methods.
  • 13. Monocular Semantic Occupancy Grid Mapping With Convolutional Variational Encoder-decoder Networks
  • 14. Cross-view Semantic Segmentation For Sensing Surroundings • Cross-view Semantic Segmentation: a framework named View Parsing Network (VPN) to address it. • In the cross-view semantic segmentation task, the agent is trained to parse the first-view observations into a top-down-view semantic map indicating the spatial location of all the objects at pixel-level. • The main issue of this task is that lacking the real-world annotations of top-down view data. • To mitigate this, train the VPN in 3D graphics environment and utilize the domain adaptation technique to transfer it to handle real-world data. • Code and demo videos can be found at https://view-parsing-network.github.io.
  • 15. Cross-view Semantic Segmentation For Sensing Surroundings Framework of the View Parsing Network for cross-view semantic segmentation. The simulation part shows the architecture and training scheme of VPN, while the real-world part demonstrates the domain adaptation process for transferring VPN to the real world.
  • 16. Cross-view Semantic Segmentation For Sensing Surroundings Qualitative results of sim-to-real adaptation. The results of source prediction before and after domain adaptation, drivable area prediciton after adaptation and the groud-truth drivable area map.
  • 17. MonoLayout: Amodal Scene Layout From A Single Image • Given a single color image captured from a driving platform, to predict the bird’s eye view layout of the road and other traffic participants. • The estimated layout should reason beyond what is visible in the image, and compensate for the loss of 3D information due to projection. • Amodal scene layout estimation, involves hallucinating scene layout for even parts of the world that are occluded in the image. • Mono-Layout, a deep NN for real-time amodal scene layout estimation from a single image. • To represent scene layout as a multi-channel semantic occupancy grid, and leverage adversarial feature learning to “hallucinate" plausible completions for occluded image parts.
  • 18. MonoLayout: Amodal Scene Layout From A Single Image MonoLayout: Given only a single image of a road scene, a neural network architecture reasons about the amodal scene layout in bird’s eye view in real-time (30 fps). This approach, MonoLayout can hallucinate regions of the static scene (road, sidewalks)—and traffic participants—that do not even project to the visible regime of the image plane. Shown above are example images from the KITTI (left) and Argoverse (right) datasets. MonoLayout outperforms prior art (by more than a 20% margin) on hallucinating occluded regions.
  • 19. MonoLayout: Amodal Scene Layout From A Single Image Architecture: MonoLayout takes in a color image of an urban driving scenario, and predicts an amodal scene layout in bird’s eye view. The architecture comprises a context encoder, amodal layout decoders, and two discriminators. Architecture: MonoLayout takes in a color image of an urban driving scenario, and predicts an amodal scene layout in bird’s eye view. The architecture comprises a context encoder, amodal layout decoders, and two discriminators.
  • 20. MonoLayout: Amodal Scene Layout From A Single Image
  • 21. MonoLayout: Amodal Scene Layout From A Single Image Static layout estimation: Observe how MonoLayout performs amodal completion of the static scene (road shown in pink, sidewalk shown in gray). Mono Occupancy fails to reason beyond occluding objects (top row), and does not hallucinate large missing patches (bottom row), while MonoLayout is accurately able to do so. Furthermore, even in cases where there is no occlusion (row 2), MonoLayout generates road layouts of much sharper quality. Row 3 show extremely challenging scenarios where most of the view is blocked by vehicles, and the scenes exhibit high-dynamic range (HDR) and shadows.
  • 22. MonoLayout: Amodal Scene Layout From A Single Image Dynamic layout estimation: vehicle occupancy estimation results on the KITTI 3D Object detection benchmark. From left to right, the column corresponds to the input image, Mono Occupancy, Mono3D, OFT, MonoLayout, and ground-truth respectively. While the other approaches miss out on detecting cars (top row), or split a vehicle detection into two (second row), or stray detections off road (third row), MonoLayout produces crisp object boundaries while respecting vehicle and road geometries.
  • 23. Monolayout: Amodal Scene Layout From A Single Image Amodal scene layout estimation on the Argoverse dataset. The dataset comprises multiple challenging scenarios, with low illumination, large number of vehicles. MonoLayout is accurately able to produce sharp estimates of vehicles and road layouts. (Sidewalks are not predicted here, as they aren’t annotated in Argoverse).
  • 24. MonoLayout: Amodal Scene Layout From A Single Image Trajectory forecasting: MonoLayout- forecast accurately estimates future trajectories of moving vehicles. (Left): In each figure, the magenta cuboid shows the initial position of the vehicle. MonoLayout- forecast is pre-conditioned for 1 seconds, by observing the vehicle, at which point (cyan cuboid) it starts forecasting future trajectories (blue). The ground-truth trajectory is shown in red, for comparision. (Right): Trajectories visualized in image space. Notice how MonoLayout-forecast is able to forecast trajectories accurately despite the presence of moving obstacles (top row), turns (middle row), and merging traffic (bottom row).
  • 25. Predicting Semantic Map Representations From Images Using Pyramid Occupancy Networks • vision-based elements: ground plane estimation, road segmentation and 3D object detection • a simple, unified approach for estimating maps directly from monocular images using a single end-to-end deep learning architecture • For the maps themselves, adopt a semantic Bayesian occupancy grid framework, allowing to trivially accumulate information over multiple cameras and timesteps • Codes available at http://github.com/tom-roddick/mono-semantic-maps.
  • 26. Predicting Semantic Map Representations From Images Using Pyramid Occupancy Networks
  • 27. Predicting Semantic Map Representations From Images Using Pyramid Occupancy Networks Given a set of surround-view images, predict a full 360 birds-eye-view semantic map, which captures both static elements like road and sidewalk as well as dynamic actors such as cars and pedestrians.
  • 28. Predicting Semantic Map Representations From Images Using Pyramid Occupancy Networks Architecture diagram showing an overview. (1) A ResNet-50 backbone network extracts image features at multiple resolutions. (2) A feature pyramid augments the high-resolution features with spatial context from lower pyramid layers. (3) A stack of dense transformer layers map the image-based features into the birds-eye-view. (4) The top down network processes the birds-eye-view features and predicts the final semantic occupancy probabilities.
  • 29. Predicting Semantic Map Representations From Images Using Pyramid Occupancy Networks This dense transformer layer first condenses the image based features along the vertical dimension, whilst retaining the horizontal dimension. Then, predict a set of features along the depth axis in a polar coordinate system, which are then resampled to Cartesian coordinates.
  • 30. Predicting Semantic Map Representations From Images Using Pyramid Occupancy Networks • The dense transformer layer is inspired: while the network needs a lot of vert. context to map features to the BEV, in the horiz. direction the relationship btw BEV locations and image locations can be established using camera geometry. • In order to retain the maximum amount of spatial info, collapse the vert. dim. and channel dimensions of the image feature map to a bottleneck of size B, but preserve the horiz. dim. W. • The apply a 1D conv along the horiz. axis, reshape the result. feat. map to give a tensor of dim. • However this feature map, which is still in image-space coord., actually corresponds to a trapezoid in the orthographic BEV space due to perspective, and so the final step is to resample into a Cartesian frame using the known camera focal length f and horizontal offset u0.
  • 31. Predicting Semantic Map Representations From Images Using Pyramid Occupancy Networks
  • 32. Predicting Semantic Map Representations From Images Using Pyramid Occupancy Networks
  • 33. Predicting Semantic Map Representations From Images Using Pyramid Occupancy Networks
  • 34. Predicting Semantic Map Representations From Images Using Pyramid Occupancy Networks
  • 35. A Sim2Real Deep Learning Approach For The Transformation Of Images From Multiple Vehicle-mounted Cameras To A Semantically Segmented Image In BEV • To obtain a corrected 360 BEV image given images from multiple vehicle-mounted cameras. • The corrected BEV image is segmented into semantic classes and includes a prediction of occluded areas. • The neural network approach does not rely on manually labeled data, but is trained on a synthetic dataset in such a way that it generalizes well to real-world data. • By using semantically segmented images as input, reduce the reality gap between simulated and real-world data and are able to show that the method can be successfully applied in the real world. • Source code and datasets are available at https://github:com/ika-rwth-aachen/Cam2BEV.
  • 36. A Sim2Real Deep Learning Approach For The Transformation Of Images From Multiple Vehicle-mounted Cameras To A Semantically Segmented Image In BEV A homography can be applied to the four semantically segmented images from vehicle-mounted cameras to transform them to BEV. This approach involves learning to compute an accurate BEV image without visual distortions.
  • 37. A Sim2Real Deep Learning Approach For The Transformation Of Images From Multiple Vehicle-mounted Cameras To A Semantically Segmented Image In BEV • For each vehicle camera, virtual rays are cast from its mount position to the edges of the semantically segmented ground truth BEV image. • The rays are only cast to edge pixels that lie within the specific camera’s field of view. • All pixels along these rays are processed to determine their occlusion state according to the following rules: 1. some semantic classes always block sight (e.g. building, truck); 2. some semantic classes never block sight (e.g. road); 3. cars block sight, except on taller objects behind them (e.g. truck, bus); 4. partially occluded objects remain completely visible; 5. objects are only labeled as occluded if they are occluded in all camera perspectives.
  • 38. A Sim2Real Deep Learning Approach For The Transformation Of Images From Multiple Vehicle-mounted Cameras To A Semantically Segmented Image In BEV The uNetXST architecture has separate encoder paths for each input image (green paths). As part of the skip-connection on each scale level (violet paths), feature maps are projectively transformed (v-block), concatenated with the other input streams (||-block), convoluted, and finally concatenated with upsampled output of the decoder path. This illustration shows a network with only two pooling and two upsampling layers, the actual trained network contains four, respectively.
  • 39. A Sim2Real Deep Learning Approach For The Transformation Of Images From Multiple Vehicle-mounted Cameras To A Semantically Segmented Image In BEV The v-block resembles a Spatial Transformer unit. Input feature maps from preceding convolutional layers (orange grid layers) are projectively transformed by the homographies obtained through IPM (Inverse Projection Mapping). The transformation differs between the input streams for the different cameras. Spatial consistency is established, since the transformed feature maps all capture the same field of view as the ground truth BEV. The transformed feature maps are then concatenated into a single feature map (cf. ||-block).
  • 40. A Sim2Real Deep Learning Approach For The Transformation Of Images From Multiple Vehicle-mounted Cameras To A Semantically Segmented Image In BEV
  • 41. A Sim2Real Deep Learning Approach For The Transformation Of Images From Multiple Vehicle-mounted Cameras To A Semantically Segmented Image In BEV
  • 42. FISHING Net: Future Inference Of Semantic Heatmaps In Grids • End-to-end pipeline that performs semantic segmentation and short term prediction using a top down representation. • This approach consists of an ensemble of neural networks which take in sensor data from different sensor modalities and transform them into a single common top-down semantic grid representation. • This representation favorable as it is agnostic to sensor-specific reference frames and captures both the semantic and geometric information for the surrounding scene. • Because the modalities share a single output representation, they can be easily aggregated to produce a fused output. • This work predicts short-term semantic grids but the framework can be extended to other tasks.
  • 43. FISHING Net: Future Inference Of Semantic Heatmaps In Grids FISHING Net Architecture: multiple neural networks, one for each sensor modality (lidar, radar and camera) take in a sequence of input sensor data and output a sequence of shared top-down semantic grids representing 3 object classes (Vulnerable Road Users (VRU), vehicles and background). The sequences are then fused using an aggregation function to output a fused sequence of semantic grids.
  • 44. FISHING Net: Future Inference Of Semantic Heatmaps In Grids • The overall architecture consists of a neural network for each sensor modality. • Across all modalities, the network architecture consists of an encoder decoder network with convolutional layers. • It uses average pooling with a pooling size of (2,2) in the encoder and up-sampling in the decoder. • After the decoder, a single linear convolutional layer to produce logits, and a softmax to produce the final output probabilities for each of the three classes along each of the output timesteps. • It uses a slightly different encoder and decoder scheme for the vision network compared to the lidar and radar networks to account for the pixel space features.
  • 45. FISHING Net: Future Inference Of Semantic Heatmaps In Grids Vision architecture
  • 46. FISHING Net: Future Inference Of Semantic Heatmaps In Grids Lidar and Radar Architecture
  • 47. FISHING Net: Future Inference Of Semantic Heatmaps In Grids • The LiDAR features consist of: 1) Binary lidar occupancy (1 if any lidar point is present a given grid cell, 0 otherwise). 2) Lidar density (Log normalized density of all lidar points present in a grid cell). 3) Max z (Largest height value for lidar points in a given grid cell). 4) Max z sliced (Largest z value for each grid cell over 5 linear slices eg. 0-0.5m,..., 2.0-2.5m). • The Radar features consist of: 1) Binary radar occupancy (1 if any radar point is present a given grid cell, 0 otherwise). 2) X, Y values for each radar return’s doppler velocity compensated with ego vehicle’s motion. 3) Radar cross section (RCS). 4) Signal to noise ratio (SNR). 5) Ambiguous Doppler interval. • The dimensions of the images match the output resolution of 192 by 320.
  • 48. FISHING Net: Future Inference Of Semantic Heatmaps In Grids
  • 49. FISHING Net: Future Inference Of Semantic Heatmaps In Grids Label input for lidar radar and vision predictions for lidar radar and vision
  • 50. BEV-Seg: Bird’s Eye View Semantic Segmentation Using Geometry And Semantic Point Cloud • Bird’s eye semantic segmentation, a task that predicts pixel-wise semantic segmentation in BEV from side RGB images. • Two main challenges: the view transformation from side view to bird’s eye view, as well as transfer learning to unseen domains. • The 2-staged perception pipeline explicitly predicts pixel depths and combines them with pixel semantics in an efficient manner, allowing the model to leverage depth information to infer objects’ spatial locations in the BEV. • Transfer learning by abstracting high level geometric features and predicting an intermediate representation that is common across different domains.
  • 51. BEV-Seg: Bird’s Eye View Semantic Segmentation Using Geometry And Semantic Point Cloud BEV-Seg pipeline
  • 52. BEV-Seg: Bird’s Eye View Semantic Segmentation Using Geometry And Semantic Point Cloud • In the first stage, N RGB road scene images are captured by cameras at different angles and individually pass through semantic segmentation network S and depth estimation network D. • The resulting side semantic segmentations and depth maps are combined and projected into a semantic point cloud. • This point cloud is then projected downward into an incomplete bird’s-eye view, which is fed into a parser network to predict the final bird’s-eye segmentation. • The rest of this section provides details on the various components of the pipeline.
  • 53. BEV-Seg: Bird’s Eye View Semantic Segmentation Using Geometry And Semantic Point Cloud • For side-semantic segmentations, use HRNet, a state-of-the-art convolutional network for semantic segmentation. • For monocular depth estimation, implement SORD using the same HRNet as the backbone. • For both tasks, train the same model on all four views. • The resulting semantic point cloud is projected height-wise onto a 512x512 image. • Train a separate HRNet model as the parser network for the final bird’s-eye segmentation. • Transfer learning via modularity and abstraction: 1). Fine-tune the stage 1 models on the target domain stage 1 data; 2). Apply the trained stage 2 model as-is to the projected point cloud in the target domain.
  • 54. BEV-Seg: Bird’s Eye View Semantic Segmentation Using Geometry And Semantic Point Cloud Table 1: Segmentation Result on BEVSEG-Carla. Oracle models have ground truth given for specified inputs.
  • 55. Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs By Implicitly Unprojecting To 3D • End-to-end architecture that directly extracts a bird's-eye-view representation of a scene given image data from an arbitrary number of cameras • To “lift" each image individually into a frustum of features for each camera, then “splat" all frustums into a rasterized bird's-eye view grid • To learn how to represent images and how to fuse predictions from all cameras into a single cohesive representation of the scene while being robust to calibration error • Codes: https://nv-tlabs.github.io/lift-splat-shoot
  • 56. Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs By Implicitly Unprojecting To 3D Given multi-view camera data (left), it infers semantics directly in the bird's-eye-view (BEV) coordinate frame (right). It shows vehicle segmentation (blue), drivable area (orange), and lane segmentation (green). These BEV predictions are then projected back onto input images (dots on the left).
  • 57. Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs By Implicitly Unprojecting To 3D Traditionally, computer vision tasks such as semantic segmentation involve making predictions in the same coordinate frame as the input image. In contrast, planning for self-driving generally operates in the bird's-eye-view frame. The model directly makes predictions in a given bird's-eye- view frame for end-to-end planning from multi-view images.
  • 58. Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs By Implicitly Unprojecting To 3D It visualizes the “lift" step. For each pixel, it predicts a categorical distribution over depth (left) and a context vector (top left). Features at each point along the ray are determined by their outer product (right).
  • 59. Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs By Implicitly Unprojecting To 3D In the “lift" step, a frustum-shaped point cloud is generated for each individual image (center-left). The extrinsics/intrinsics are then used to splat each frustum onto the BEV plane (center right). Finally, a BEV CNN processes the BEV representation for BEV semantic segmentation or planning (right).
  • 60. Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs By Implicitly Unprojecting To 3D It visualizes the 1K trajectory templates that is “shoot" onto the cost map during training and testing. During training, the cost of each template trajectory is computed and interpreted as a 1K-dimensional Boltzman distribution over the templates. During testing, choose the argmax of this distribution and act according to the chosen template.
  • 61. Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs By Implicitly Unprojecting To 3D Instead of the hard-margin loss proposed in NMP (Neural Motion Planner), planning is framed as classification over a set of K template trajectories. To leverage the cost-volume nature of the planning problem, enforce the distribution over K template trajectories to take the following form
  • 62. Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs By Implicitly Unprojecting To 3D For a single time stamp, remove each of the cameras and visualize how the loss the cameras selects the prediction of the network. Region covered by the missing camera becomes fuzzier in every case. When the front camera is removed (top middle), the network extrapolates the lane and drivable area in front of the ego and extrapolates the body of a car for which only a corner can be seen in the top right camera.
  • 63. Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs By Implicitly Unprojecting To 3D Qualitatively show how the model performs, given an entirely new camera rig at test time. Road segmentation is in orange, lane segmentation is in green, and vehicle segmentation is in blue.
  • 64. Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs By Implicitly Unprojecting To 3D The top 10 ranked trajectories out of the 1k templates. The model predicts bimodal distributions and curves from observations from a single timestamp. The model does not have access to the speed of the car so it is compelling that the model predicts low-speed trajectories near crosswalks and brake lights.
  • 65. Understanding Bird’s-eye View Semantic HD- Maps Using An Onboard Monocular Camera • online estimation of semantic BEV HD-maps using video input from a single onboard camera • image-level understanding, BEV level understanding, and aggregation of temporal info Front-facing monocular camera for Bird’s-eye View (BEV) HD- map understanding
  • 66. Understanding Bird’s-eye View Semantic HD- Maps Using An Onboard Monocular Camera It relies on three pillars and can also be split into modules that process backbone features. First, the image-level branch which is composed of two decoders, one processing the static HDmap and one the dynamic obstacle, second the BEV temporal aggregation module that fuses our three pillars and aggregates all the temporal and image plane information in the BEV and finally the BEV decoder.
  • 67. Understanding Bird’s-eye View Semantic HD- Maps Using An Onboard Monocular Camera Temporal aggregation module combines information from all frames and all branches into one BEV feature map. Backbone features and image-level static estimates are projected with warping function AB to BEV and max (M) is applied in batch dimension. The results are concatenated in channel dimension. The reference frame backbone features (highlighted with red) are used in Max function as well as skip connection to concatenation.
  • 68. Understanding Bird’s-eye View Semantic HD- Maps Using An Onboard Monocular Camera • The dataset also provides 3D bounding boxes of 23 object classes. • In experiments, select six HD-map classes: drivable area, pedestrian crossings, walkways, carpark area, road segment, and lane. • For dynamic objects, select the classes: car, truck, bus, trailer, construction vehicle, pedestrian, motorcycle, traffic cone and barrier. • Even though a six-camera rig was used to capture data, only use the front camera for training and evaluation.
  • 69. Understanding Bird’s-eye View Semantic HD- Maps Using An Onboard Monocular Camera
  • 70. Understanding Bird’s-eye View Semantic HD- Maps Using An Onboard Monocular Camera