BEV Semantic Segmentation

BEV SEMANTIC
SEGMENTATION
Yu Huang
Sunnyvale, California
Yu.huang07@gmail.com

OUTLINE
• Learning to Look around Objects for Top-View
Representations of Outdoor Scenes
• Monocular Semantic Occupancy Grid Mapping
with Convolutional Variational Enc-Dec Networks
• Cross-view Semantic Segmentation for Sensing
Surroundings
• MonoLayout: Amodal scene layout from a single
image
• Predicting Semantic Map Representations from
Images using Pyramid Occupancy Networks
• A Sim2Real DL Approach for the Transformation
of Images from Multiple Vehicle-Mounted Cameras
to a Semantically Segmented Image in BEV
• FISHING Net: Future Inference of Semantic
Heatmaps In Grids
• BEV-Seg: Bird’s Eye View Semantic Segmentation
Using Geometry and Semantic Point Cloud
• Lift, Splat, Shoot: Encoding Images from Arbitrary
Camera Rigs by Implicitly Unprojecting to 3D
• Understanding Bird’s-Eye View Semantic HD-maps
Using an Onboard Monocular Camera

Learning To Look Around Objects For Top-view
Representations Of Outdoor Scenes
• Estimating an occlusion-reasoned semantic scene layout in the top-view.
• This challenging problem not only requires an accurate understanding of both the 3D geometry and
the semantics of the visible scene, but also of occluded areas.
• A convolutional neural network that learns to predict occluded portions of the scene layout by
looking around foreground objects like cars or pedestrians.
• But instead of hallucinating RGB values, directly predicting the semantics and depths in the
occluded areas enables a better transformation into the top-view.
• This initial top-view representation can be significantly enhanced by learning priors and rules about
typical road layouts from simulated or, if available, map data.

The inpainting CNN first encodes a masked image and the mask
itself. The extracted features are concatenated and two decoders
predict semantics and depth for visible and occluded pixels.
To train the inpainting CNN, ignore FG
objects as no GT is available (red) but
articially add masks (green) over BG regions
where full annotation is already available.

The process of mapping the semantic segmentation with corresponding
depth first into a 3D point cloud and then into the bird's eye view. The red
and blue circles illustrate corresponding locations in all views.

(a) Simulated road shapes in the top-view. (b) The refinement-CNN is an encoder-decoder network
receiving three supervisory signals: self-reconstruction with the input, adversarial loss from simulated data,
and reconstruction loss with aligned OpenStreetMap (OSM) data. (c) The alignment CNN takes as input the
initial BEV map and a crop of OSM data (via noisy GPS and yaw estimate given). The CNN predicts a warp
for the OSM map and is trained to minimize the reconstruction loss with the initial BEV map.

(a) We use a composition of similarity transform (left, “box") and a non-parametric warp (right, “flow") to
align noisy OSM with image evidence. (b, top) Input image and the corresponding Binit. (b, bottom)
Resulting warping grid overlaid on the OSM map and the warping result for 4 different warping
functions, respectively: “box", ”flow", “box+flow", “box+flow (with regularization)". Note the importance of
composing the transformations and the induced regularization.

Examples of BEV representation. Examples of our BEV representation.

Monocular Semantic Occupancy Grid Mapping With
Convolutional Variational Encoder-decoder Networks
• This work is end-to-end learning of monocular semantic-metric occupancy grid mapping from
weak binocular ground truth.
• The network learns to predict four classes, as well as a camera to bird’s eye view mapping.
• At the core, it utilizes a variational encoder-decoder network that encodes the front-view visual
information of the driving scene and subsequently decodes it into a 2-D top-view Cartesian
coordinate system.
• The variational sampling with a relatively small embedding vector brings robustness against
vehicle dynamic perturbations, and generalizability for unseen KITTI data

Illustration of the proposed variational encoder-decoder approach. From a single front-view
RGB image, our system can predict a 2-D top-view semantic-metric occupancy grid map.

Some visualized mapping examples on the test set with different methods.

Cross-view Semantic Segmentation For Sensing
Surroundings
• Cross-view Semantic Segmentation: a framework named View Parsing Network (VPN) to
address it.
• In the cross-view semantic segmentation task, the agent is trained to parse the first-view
observations into a top-down-view semantic map indicating the spatial location of all the
objects at pixel-level.
• The main issue of this task is that lacking the real-world annotations of top-down view data.
• To mitigate this, train the VPN in 3D graphics environment and utilize the domain adaptation
technique to transfer it to handle real-world data.
• Code and demo videos can be found at https://view-parsing-network.github.io.

Surroundings
Framework of the View
Parsing Network for
cross-view semantic
segmentation. The
simulation part shows
the architecture and
training scheme of VPN,
while the real-world part
demonstrates the
domain adaptation
process for transferring
VPN to the real world.

Surroundings
Qualitative results of sim-to-real adaptation. The results of source prediction before and after domain
adaptation, drivable area prediciton after adaptation and the groud-truth drivable area map.

MonoLayout: Amodal Scene Layout From A Single
Image
• Given a single color image captured from a driving platform, to predict the bird’s eye view layout
of the road and other traffic participants.
• The estimated layout should reason beyond what is visible in the image, and compensate for the
loss of 3D information due to projection.
• Amodal scene layout estimation, involves hallucinating scene layout for even parts of the world
that are occluded in the image.
• Mono-Layout, a deep NN for real-time amodal scene layout estimation from a single image.
• To represent scene layout as a multi-channel semantic occupancy grid, and leverage adversarial
feature learning to “hallucinate" plausible completions for occluded image parts.

Image
MonoLayout: Given only a single image of a road scene, a neural network architecture reasons about
the amodal scene layout in bird’s eye view in real-time (30 fps). This approach, MonoLayout can
hallucinate regions of the static scene (road, sidewalks)—and traffic participants—that do not even
project to the visible regime of the image plane. Shown above are example images from the KITTI (left)
and Argoverse (right) datasets. MonoLayout outperforms prior art (by more than a 20% margin) on
hallucinating occluded regions.

Image
Architecture: MonoLayout takes in a color image of an urban driving scenario, and predicts an amodal
scene layout in bird’s eye view. The architecture comprises a context encoder, amodal layout decoders,
and two discriminators. Architecture: MonoLayout takes in a color image of an urban driving scenario, and
predicts an amodal scene layout in bird’s eye view. The architecture comprises a context encoder, amodal
layout decoders, and two discriminators.

Image

Image
Static layout estimation: Observe how MonoLayout performs amodal completion of the static scene
(road shown in pink, sidewalk shown in gray). Mono Occupancy fails to reason beyond occluding
objects (top row), and does not hallucinate large missing patches (bottom row), while MonoLayout is
accurately able to do so. Furthermore, even in cases where there is no occlusion (row 2), MonoLayout
generates road layouts of much sharper quality. Row 3 show extremely challenging scenarios where
most of the view is blocked by vehicles, and the scenes exhibit high-dynamic range (HDR) and shadows.

Image
Dynamic layout estimation: vehicle occupancy estimation results on the KITTI 3D Object detection
benchmark. From left to right, the column corresponds to the input image, Mono Occupancy, Mono3D,
OFT, MonoLayout, and ground-truth respectively. While the other approaches miss out on detecting cars
(top row), or split a vehicle detection into two (second row), or stray detections off road (third row),
MonoLayout produces crisp object boundaries while respecting vehicle and road geometries.

Monolayout: Amodal Scene Layout From A Single
Image
Amodal scene layout estimation on the Argoverse
dataset. The dataset comprises multiple
challenging scenarios, with low illumination, large
number of vehicles. MonoLayout is accurately able
to produce sharp estimates of vehicles and road
layouts. (Sidewalks are not predicted here, as they
aren’t annotated in Argoverse).

Image
Trajectory forecasting: MonoLayout-
forecast accurately estimates future
trajectories of moving vehicles. (Left): In
each figure, the magenta cuboid shows the
initial position of the vehicle. MonoLayout-
forecast is pre-conditioned for 1 seconds,
by observing the vehicle, at which point
(cyan cuboid) it starts forecasting future
trajectories (blue). The ground-truth
trajectory is shown in red, for comparision.
(Right): Trajectories visualized in image
space. Notice how MonoLayout-forecast is
able to forecast trajectories accurately
despite the presence of moving obstacles
(top row), turns (middle row), and merging
traffic (bottom row).

Predicting Semantic Map Representations From
Images Using Pyramid Occupancy Networks
• vision-based elements: ground plane estimation, road segmentation and 3D object detection
• a simple, unified approach for estimating maps directly from monocular images using a single
end-to-end deep learning architecture
• For the maps themselves, adopt a semantic Bayesian occupancy grid framework, allowing to
trivially accumulate information over multiple cameras and timesteps
• Codes available at http://github.com/tom-roddick/mono-semantic-maps.

Given a set of surround-view images,
predict a full 360 birds-eye-view
semantic map, which captures both
static elements like road and
sidewalk as well as dynamic actors
such as cars and pedestrians.

Architecture diagram showing an overview. (1) A ResNet-50 backbone network extracts image features
at multiple resolutions. (2) A feature pyramid augments the high-resolution features with spatial context
from lower pyramid layers. (3) A stack of dense transformer layers map the image-based features into
the birds-eye-view. (4) The top down network processes the birds-eye-view features and predicts the
final semantic occupancy probabilities.

This dense transformer layer first condenses the image based features
along the vertical dimension, whilst retaining the horizontal dimension. Then,
predict a set of features along the depth axis in a polar coordinate system,
which are then resampled to Cartesian coordinates.

• The dense transformer layer is inspired: while the network needs a lot of vert. context to map
features to the BEV, in the horiz. direction the relationship btw BEV locations and image
locations can be established using camera geometry.
• In order to retain the maximum amount of spatial info, collapse the vert. dim. and channel
dimensions of the image feature map to a bottleneck of size B, but preserve the horiz. dim. W.
• The apply a 1D conv along the horiz. axis, reshape the result. feat. map to give a tensor of dim.
• However this feature map, which is still in image-space coord., actually corresponds to a
trapezoid in the orthographic BEV space due to perspective, and so the final step is to resample
into a Cartesian frame using the known camera focal length f and horizontal offset u0.

A Sim2Real Deep Learning Approach For The Transformation Of Images From
Multiple Vehicle-mounted Cameras To A Semantically Segmented Image In BEV
• To obtain a corrected 360 BEV image given images from multiple vehicle-mounted cameras.
• The corrected BEV image is segmented into semantic classes and includes a prediction of
occluded areas.
• The neural network approach does not rely on manually labeled data, but is trained on a
synthetic dataset in such a way that it generalizes well to real-world data.
• By using semantically segmented images as input, reduce the reality gap between simulated and
real-world data and are able to show that the method can be successfully applied in the real
world.
• Source code and datasets are available at https://github:com/ika-rwth-aachen/Cam2BEV.

A homography can be applied to the four semantically segmented images from
vehicle-mounted cameras to transform them to BEV. This approach involves
learning to compute an accurate BEV image without visual distortions.

• For each vehicle camera, virtual rays are cast from its mount position to the edges of the
semantically segmented ground truth BEV image.
• The rays are only cast to edge pixels that lie within the specific camera’s field of view.
• All pixels along these rays are processed to determine their occlusion state according to the
following rules:
1. some semantic classes always block sight (e.g. building, truck);
2. some semantic classes never block sight (e.g. road);
3. cars block sight, except on taller objects behind them (e.g. truck, bus);
4. partially occluded objects remain completely visible;
5. objects are only labeled as occluded if they are occluded in all camera perspectives.

The uNetXST architecture has
separate encoder paths for each
input image (green paths). As part of
the skip-connection on each scale
level (violet paths), feature maps are
projectively transformed (v-block),
concatenated with the other input
streams (||-block), convoluted, and
finally concatenated with upsampled
output of the decoder path. This
illustration shows a network with only
two pooling and two upsampling
layers, the actual trained network
contains four, respectively.

The v-block resembles a Spatial Transformer unit.
Input feature maps from preceding convolutional
layers (orange grid layers) are projectively
transformed by the homographies obtained through
IPM (Inverse Projection Mapping). The transformation
differs between the input streams for the different
cameras. Spatial consistency is established, since the
transformed feature maps all capture the same field
of view as the ground truth BEV. The transformed
feature maps are then concatenated into a single
feature map (cf. ||-block).

FISHING Net: Future Inference Of Semantic
Heatmaps In Grids
• End-to-end pipeline that performs semantic segmentation and short term prediction using a top
down representation.
• This approach consists of an ensemble of neural networks which take in sensor data from different
sensor modalities and transform them into a single common top-down semantic grid representation.
• This representation favorable as it is agnostic to sensor-specific reference frames and captures both
the semantic and geometric information for the surrounding scene.
• Because the modalities share a single output representation, they can be easily aggregated to produce
a fused output.
• This work predicts short-term semantic grids but the framework can be extended to other tasks.

Heatmaps In Grids
FISHING Net Architecture:
multiple neural networks, one for
each sensor modality (lidar, radar
and camera) take in a sequence
of input sensor data and output a
sequence of shared top-down
semantic grids representing 3
object classes (Vulnerable Road
Users (VRU), vehicles and
background). The sequences are
then fused using an aggregation
function to output a fused
sequence of semantic grids.

Heatmaps In Grids
• The overall architecture consists of a neural network for each sensor modality.
• Across all modalities, the network architecture consists of an encoder decoder network with
convolutional layers.
• It uses average pooling with a pooling size of (2,2) in the encoder and up-sampling in the
decoder.
• After the decoder, a single linear convolutional layer to produce logits, and a softmax to
produce the final output probabilities for each of the three classes along each of the output
timesteps.
• It uses a slightly different encoder and decoder scheme for the vision network compared to the
lidar and radar networks to account for the pixel space features.

Heatmaps In Grids
Vision architecture

Heatmaps In Grids
Lidar and Radar Architecture

Heatmaps In Grids
• The LiDAR features consist of: 1) Binary lidar occupancy (1 if any lidar point is present a given
grid cell, 0 otherwise). 2) Lidar density (Log normalized density of all lidar points present in a
grid cell). 3) Max z (Largest height value for lidar points in a given grid cell). 4) Max z sliced
(Largest z value for each grid cell over 5 linear slices eg. 0-0.5m,..., 2.0-2.5m).
• The Radar features consist of: 1) Binary radar occupancy (1 if any radar point is present a given
grid cell, 0 otherwise). 2) X, Y values for each radar return’s doppler velocity compensated with
ego vehicle’s motion. 3) Radar cross section (RCS). 4) Signal to noise ratio (SNR). 5)
Ambiguous Doppler interval.
• The dimensions of the images match the output resolution of 192 by 320.

Heatmaps In Grids

Heatmaps In Grids
Label
input for lidar radar and vision predictions for lidar radar and vision

BEV-Seg: Bird’s Eye View Semantic Segmentation
Using Geometry And Semantic Point Cloud
• Bird’s eye semantic segmentation, a task that predicts pixel-wise semantic segmentation in BEV
from side RGB images.
• Two main challenges: the view transformation from side view to bird’s eye view, as well as
transfer learning to unseen domains.
• The 2-staged perception pipeline explicitly predicts pixel depths and combines them with pixel
semantics in an efficient manner, allowing the model to leverage depth information to infer
objects’ spatial locations in the BEV.
• Transfer learning by abstracting high level geometric features and predicting an intermediate
representation that is common across different domains.

BEV-Seg
pipeline

• In the first stage, N RGB road scene images are captured by cameras at different angles and
individually pass through semantic segmentation network S and depth estimation network D.
• The resulting side semantic segmentations and depth maps are combined and projected into a
semantic point cloud.
• This point cloud is then projected downward into an incomplete bird’s-eye view, which is fed
into a parser network to predict the final bird’s-eye segmentation.
• The rest of this section provides details on the various components of the pipeline.

• For side-semantic segmentations, use HRNet, a state-of-the-art convolutional network for semantic
segmentation.
• For monocular depth estimation, implement SORD using the same HRNet as the backbone.
• For both tasks, train the same model on all four views.
• The resulting semantic point cloud is projected height-wise onto a 512x512 image.
• Train a separate HRNet model as the parser network for the final bird’s-eye segmentation.
• Transfer learning via modularity and abstraction: 1). Fine-tune the stage 1 models on the target
domain stage 1 data; 2). Apply the trained stage 2 model as-is to the projected point cloud in the
target domain.

Table 1: Segmentation Result on
BEVSEG-Carla. Oracle models have
ground truth given for specified inputs.

Lift, Splat, Shoot: Encoding Images From Arbitrary
Camera Rigs By Implicitly Unprojecting To 3D
• End-to-end architecture that directly extracts a bird's-eye-view representation of a scene given
image data from an arbitrary number of cameras
• To “lift" each image individually into a frustum of features for each camera, then “splat" all
frustums into a rasterized bird's-eye view grid
• To learn how to represent images and how to fuse predictions from all cameras into a single
cohesive representation of the scene while being robust to calibration error
• Codes: https://nv-tlabs.github.io/lift-splat-shoot

Given multi-view camera data (left), it infers semantics directly in the bird's-eye-view (BEV) coordinate
frame (right). It shows vehicle segmentation (blue), drivable area (orange), and lane segmentation
(green). These BEV predictions are then projected back onto input images (dots on the left).

Traditionally, computer vision tasks such as semantic segmentation involve making predictions in
the same coordinate frame as the input image. In contrast, planning for self-driving generally
operates in the bird's-eye-view frame. The model directly makes predictions in a given bird's-eye-
view frame for end-to-end planning from multi-view images.

It visualizes the “lift" step. For each pixel, it predicts a categorical distribution over depth (left) and a
context vector (top left). Features at each point along the ray are determined by their outer product (right).

In the “lift" step, a frustum-shaped point cloud is generated for each individual image (center-left). The
extrinsics/intrinsics are then used to splat each frustum onto the BEV plane (center right). Finally, a BEV
CNN processes the BEV representation for BEV semantic segmentation or planning (right).

It visualizes the 1K trajectory templates that is “shoot"
onto the cost map during training and testing. During
training, the cost of each template trajectory is
computed and interpreted as a 1K-dimensional
Boltzman distribution over the templates. During
testing, choose the argmax of this distribution and act
according to the chosen template.

Instead of the hard-margin loss proposed in NMP
(Neural Motion Planner), planning is framed as
classification over a set of K template trajectories.
To leverage the cost-volume nature of the planning
problem, enforce the distribution over K template
trajectories to take the following form

For a single time stamp, remove each of the cameras and visualize how the loss the cameras selects the
prediction of the network. Region covered by the missing camera becomes fuzzier in every case. When the
front camera is removed (top middle), the network extrapolates the lane and drivable area in front of the ego
and extrapolates the body of a car for which only a corner can be seen in the top right camera.

Qualitatively show how the model performs, given an entirely new camera rig at test time. Road
segmentation is in orange, lane segmentation is in green, and vehicle segmentation is in blue.

The top 10 ranked trajectories out of the 1k templates. The model predicts bimodal distributions and
curves from observations from a single timestamp. The model does not have access to the speed of the
car so it is compelling that the model predicts low-speed trajectories near crosswalks and brake lights.

Understanding Bird’s-eye View Semantic HD-
Maps Using An Onboard Monocular Camera
• online estimation of semantic BEV HD-maps using video input from a single onboard camera
• image-level understanding, BEV level understanding, and aggregation of temporal info
Front-facing monocular camera
for Bird’s-eye View (BEV) HD-
map understanding

It relies on three pillars and can also be split into modules that process backbone features. First, the
image-level branch which is composed of two decoders, one processing the static HDmap and one the
dynamic obstacle, second the BEV temporal aggregation module that fuses our three pillars and
aggregates all the temporal and image plane information in the BEV and finally the BEV decoder.

Temporal aggregation module combines information
from all frames and all branches into one BEV feature
map. Backbone features and image-level static
estimates are projected with warping function AB to
BEV and max (M) is applied in batch dimension. The
results are concatenated in channel dimension. The
reference frame backbone features (highlighted with
red) are used in Max function as well as skip
connection to concatenation.

• The dataset also provides 3D bounding boxes of 23 object classes.
• In experiments, select six HD-map classes: drivable area, pedestrian crossings, walkways,
carpark area, road segment, and lane.
• For dynamic objects, select the classes: car, truck, bus, trailer, construction vehicle, pedestrian,
motorcycle, traffic cone and barrier.
• Even though a six-camera rig was used to capture data, only use the front camera for training
and evaluation.

BEV Semantic Segmentation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to BEV Semantic Segmentation

Similar to BEV Semantic Segmentation (20)

More from Yu Huang

More from Yu Huang (20)

Recently uploaded

Recently uploaded (20)

BEV Semantic Segmentation