Deep Learning’s Application in Radar Signal Data II

Deep Learning’s Application
in Radar Signal Data II
Yu Huang
Yu.huang07@gmail.com
Sunnyvale, California

Outline
• A Deep Learning-based Radar and Camera Sensor Fusion Architecture for
Object Detection
• CNN based Road User Detection using the 3D Radar Cube
• Distant Vehicle Detection Using Radar and Vision
• Seeing Through Fog Without Seeing Fog: Deep Multimodal Sensor Fusion in
Unseen Adverse Weather
• A Deep Learning Approach for Automotive Radar Interference Mitigation
• Deep Radar Detector

A Deep Learning-based Radar and Camera Sensor
Fusion Architecture for Object Detection
• The sensor quality of the camera is limited in severe weather conditions and through increased
sensor noise in sparsely lit areas and at night.
• Compared to camera sensors, radar sensors are more robust to environment conditions such as
lighting changes, rain and fog.
• This approach enhances current 2D object detection networks by fusing camera data and
projected sparse radar data in the network layers.
• The radar sensor outputs a sparse 2D point cloud with associated radar characteristics.
• The data used includes the azimuth angle, the distance and the radar cross section (RCS).
• The proposed Camera Radar Fusion Net (CRF-Net) automatically learns at which level the fusion
of the sensor data is most beneficial for the detection result.
• Additionally, it introduce BlackIn, a training strategy inspired by Dropout, which focuses the
learning on a specific sensor type.
• The code is available at: ://github.com/TUMFTM/CameraRadarFusionNet

• It transforms the radar data from the 2D ground plane to a perpendicular image plane.
• The characteristics of the radar return are stored as pixel values in the augmented image.
• At the location of image pixels where no radar returns are present, the projected radar channel
values are set to the value 0.
• The input camera image consists of three channels (red, green, blue); then add the
aforementioned radar channels as the input for the neural network.
• Field of view (FOV) of three radars overlap with the FOV of the front-facing fish-eye camera.
• Then concatenate the point clouds of the three sensors into one and use this as the projected
radar input source.
• The radar detections give no information about the height at which they were received, which
increases the difficulty to fuse the data types.
• The 3D coordinates of the radar detections are assumed to be returned from the ground plane
that the vehicle is driving on.

• The projections are then extended in perpendicular direction to this plane, so as to account for
the vertical extension of the objects to be detected. (It detects traffic objects which can be
classified as cars, trucks, motorcycles, bicycles and pedestrians. )
• To cover the height of such object types, it assumes a height extension of the radar detections of
3m to associate camera pixels with radar data.
• The radar data is mapped with a pixel width of one into the image plane.
• It increases the density of radar data by jointly fusing the last 13 radar cycles (around 1 s) to its
own data format, where ego-motion is compensated for this projection method.
• The radar channels (distance and RCS) are mapped to the same locations.
• The radar returns many detections coming from objects which are not relevant for the driving
task, such as ghost objects, irrelevant objects and ground detections.
• These detections are called clutter or noise for the task at hand.

(a) Without ground-truth noise filter (b) With ground-truth noise filter
A ground truth noise filter is employed to the radar data which removes all radar detections outside of the 3D ground-
truth bounding boxes, to show the general feasibility of the fusion concept with less clutter in the input signal.
An annotation filter (AF) is applied, so that the filtered ground-truth data only contains objects which yield at least
one radar detection. This is done via associating the 3D bounding boxes with radar points. The fusion approach is
expected to show its potential for those objects which are detectable in both modalities

• The neural network architecture builds on RetinaNet with a VGG backbone.
• The network is extended to deal with the additional radar channels of the augmented image.
• The output is 2D regression of Bbox coordinates and a classification score for the Bbox.
• The network is trained using focal loss and the baseline uses a VGG feature extractor.
• The amount of information of one radar return is different from the information of a single pixel.
• The distance of an object to the ego-vehicle, as measured by the radar, can be considered more
relevant to the driving task than a simple color value of a pixel of a camera.
• In deeper layers of the neural network, the input data is compressed into a denser representation
which ideally contains all the relevant input information.
• As it is hard to quantify the abstraction level of the information provided by each of the two
sensor types, it designs the network in a way that it learns itself at which depth level the fusion of
the data is most beneficial to the overall loss minimization.

High-level structure of Camera Radar Fusion Net

(a) Baseline network detection (b) CRF-Net detection
Detection comparison of the baseline network (a) and the CRF-Net (b).
The baseline network does not detect the pedestrian on the left.

TABLE II: mAP scores of the baseline network and CameraRadarFusionNet. Configurations: (AF) -
Annotation filter, (GRF) - ground-truth radar filter, (NRM) - No radar meta data
Data Network mAP
nuScenes Baseline image network
CRF-Net w/o BlackIn
CRF-Net
Baseline image network (AF)
CRF-Net (AF)
CRF-Net (AF, GRF)
CRF-Net (AF, GRF, NRM)
43.47%
43.6%
43.95%
43.03%
44.85%
55.99%
53.23%
TUM
Technical University of Munich
Baseline image network
CRF-Net
56.12%
57.50%

CNN based Road User Detection using the 3D Radar Cube
• Radars are attractive sensors for intelligent vehicles as they are relatively robust to weather and
lighting conditions (e.g. rain, snow, darkness) compared to camera and LIDAR sensors.
• Radars also have excellent range sensitivity and can measure radial object velocities directly using
the Doppler effect.
• This paper presents a radar based, single frame, multi-class detection method for moving road
users (pedestrian, cyclist, car), which utilizes low-level radar cube data.
• The method provides class information both on the radar target and object-level.
• Radar targets are classified individually after extending the target features with a cropped block of
the 3D radar cube around their positions, thereby capturing the motion of moving parts in the
local velocity distribution.
• A Convolutional Neural Network (CNN), RTCnet (Radar Target Classification Network), is proposed
for this classification step.
• Afterwards, object proposals are generated with a clustering step, which not only considers the
radar targets’ positions and velocities, but their calculated class scores as well.

Inputs (radar cube and radar targets,
top), main processing blocks (RTCnet
and object clustering, bottom left), and
outputs (classified radar targets and
object proposals, bottom right).
Classified radar targets are shown as
colored spheres at the sensor’s height.
Object proposals are visualized by a
convex hull around the clustered targets
on the ground plane and at 2 m.

• Pre-processing:
• A single frame of radar targets and the radar cube (low-level data) is fetched.
• Each radar target’s speed is compensated for ego-motion.
• Targets with low compensated (absolute) velocity are static and are filtered out.
• Then, corresponding target-level and low-level radar data are connected.
• That is, to look up each remaining dynamic radar target’s corresponding
range/azimuth/Doppler bins, i.e. a grid cell in the radar cube based on range,
azimuth and (relative) velocity (r, α, vr).
• Afterwards, a 3D block of the radar cube is cropped around each radar target’s
grid cell with radius in range/azimuth/Doppler dimensions (L, W, H).

1) Down-sample range and azimuth dimensions: to encode the radar target’s spatial neighborhood’s Doppler
distribution into a tensor without extension in range or azimuth.
2) Process Doppler dimension: to extract class information from the speed distribution around the target.
3) Score calculation: use two fully connected layers with 128 nodes each to provide scores. The output layer
has either four nodes (one for each class) for multi-class classification or two for binary tasks.

• With 4 output nodes, it is possible to train the 3rd module to perform multi-class classification
directly.
• It implemented an ensemble voting system of binary classifiers (networks with two output nodes).
• Aside training a single, multi-class network, it trained One-vs-All (OvA) and One-vs-One (OvO)
binary classifiers for each class (e.g. car-vs-all) and pair of classes (e.g. carvs-cyclist), 10 in total.
• The final prediction scores depend on the voting of all the binary models.
• OvO scores are weighted by the summation of the corresponding OvA scores to achieve a more
balanced result.
• To obtain proposals for object detection, cluster the classified radar targets with DBSCAN
incorporating the predicted class information, i.e. radar targets with bike/pedestrian/car predicted
labels are clustered in separate steps.
• The advantage of clustering each class separately is that no universal parameter set is needed for
DBSCAN.
• Furthermore, swapping the clustering and classification step makes it possible to consider objects
with a single reflection.

Examples of correctly classified radar targets by RTCnet, projected to image plane. Radar targets with
pedestrian/cyclist/car labels are marked by green/red/blue. Static objects and the class other are not shown.
Examples of radar targets misclassified by
RTCnet, caused by: flat surfaces acting as
mirrors and creating ghost targets (a),
unusual vehicles (b), partial
misclassification of an objects’ reflections
(c), and strong reflections nearby (d).

Distant Vehicle Detection Using Radar and Vision
• For autonomous vehicles to be able to operate successfully they need to be
aware of other vehicles with sufficient time to make safe, stable plans.
• Given the possible closing speeds between two vehicles, this necessitates the
ability to accurately detect distant vehicles.
• Many current image-based object detectors using convolutional neural networks
exhibit excellent performance on existing datasets such as KITTI.
• However, the performance of these networks falls when detecting small (distant)
objects.
• Here incorporating radar data can boost performance in these difficult situations.
• It also introduces an efficient automated method for training data generation
using cameras of different focal lengths.

By using radar, detect vehicles even if
they are very small (top) or hard to
see (bottom). The inset images show
the difficult parts of the main scenes
and are taken from a synchronized
long focal length camera as part of the
training data generation. Detections
are shown in red, ground truth in blue.

• To create the dataset, data is gathered using two cameras configured as a stereo pair and a third,
with a long focal length lens, positioned next to the left stereo camera.
• All three cameras are synchronized and collect 1280x960 RGB images at 30Hz.
• In addition, collect radar data using a Delphi ESR 2.5 pulse Doppler cruise control radar with a
scan frequency of 20Hz.
• The radar is dual-beam, operating a wide angle medium range beam (> 90 deg, > 50m) and a long
range forward-facing narrow beam (> 20deg, > 100m) (labels are generated in an automated).

• To produce more accurate labels of distant vehicles, make use of two cameras of different focal
lengths. The first camera CA has a wide angle lens (short focal length) and is the camera in which
objects are to be detected when the system is deployed live on a vehicle. The second camera CB
has a much longer focal length and is mounted as close as physically possible to the first such that
their optical axes are approximately aligned.
• Object detections in CB can be transferred to CA without needing to know the object’s range by
exploiting the cameras’ close mounting.
• The radar internally performs target identification from the radar scans and outputs a set of
identified targets (access to the raw data is not available).
• Each target comprises measurements of range, bearing, range rate (radial velocity) and amplitude.
• Each radar scan contains a maximum of 64 targets from each of the two beams.
• To handle the varying number of targets, project the radar targets into camera CA giving two extra
image channels — range and range-rate.

Example of bounding box transfer between two cameras of different focal lengths for training data generation. Left
shows the original bounding boxes found from the short focal length camera (vehicles are red, pedestrians blue).
Middle shows the original bounding boxes found from the long focal length camera. Right shows the combined set of
bounding boxes. The outline of the overlapping region is shown in green.
Note: To generate labels, use an implementation of the YOLO object detector trained on the KITTI dataset.

• They mark each target position in the image as a small circle rather than a single pixel as this both
increases the influence of each point in the training process and reflects to some extent the
uncertainty of the radar measurements in both bearing and height.
• To simplify the learning process, before performing the projection they subtract the ego-motion
of the platform from the range rate measurement of each target.
• To calculate the ego-motion, use a conventional stereo visual odometry system. As the radar is
not synchronized with the cameras, take the closest ego-motion estimate to each radar scan.
• The radar is sparse and can be inconsistent, there is no guarantee that a moving vehicle will be
detected as a target.
• It is also noisy — occasional high range-rate targets will briefly appear without any apparent
relation to the environment.
• Neverthless, there is sufficient info that it can provide a useful guide to vehicle location.

Examples of automatically generated training data. Top shows the image with bounding
boxes from the object detections from the combined cameras. Middle shows the range
image generated from the radar scan and bottom shows the range-rate image.

• It builds upon the SSD object detection framework, chosen as it represents a proven baseline for
single-stage detection networks.
• It construct the network from ResNet blocks using the 18-layer variant.
• Using blocks from the larger ResNet variants added model com- plexity without increasing
performance, possibly due to the limited number of classes and training examples (relative to
ImageNet) meaning that larger models merely added redundant parameters.
• Try including the radar data in two ways. Firstly, by adding an additional branch for the radar input
and concatenating the features after the second image ResNet block. Secondly, by adding the
same additional branch but without the max-pool and using element-wise addition to fuse the
features after the first image ResNet block.
• Try with a combined five-channel input image, the branch configuration proved best, allowing the
development of separate radar and RGB features.
• Using a branch structure also offers the potential flexibility of re-using weights from the RGB
branch with different radar representations.

The network configuration for the
concatenation fusion, showing filter
sizes, strides, output channels and
image size for each level. For
networks using only RGB images, the
right-hand radar branch is removed.

Seeing Through Fog Without Seeing Fog: Deep
Multimodal Sensor Fusion in Unseen Adverse Weather
• The fusion of multimodal sensor streams, such as camera, lidar, and radar, plays a critical role in
object detection for autonomous vehicles, which base their decision making on these inputs.
• While existing methods exploit redundant information under good conditions, they fail to do this
in adverse weather where the sensory streams can be asymmetrically distorted.
• These rare “edge-case” scenarios are not represented in available datasets, and existing fusion
architectures are not designed to handle them.
• This paper presents a multi-modal dataset by over 10,000 km of driving in northern Europe.
• Though it is the 1st large multimodal dataset in adverse weather, with 100k labels for lidar,
camera, radar and gated NIR sensors, it does not facilitate training as extreme weather is rare.
• To this end, they present a deep fusion network for robust fusion without a large corpus of
labeled training data covering all asymmetric distortions.
• Departing from proposal-level fusion, it proposes a single-shot model that adaptively fuses
features, driven by measurement entropy.

Existing object detection methods, including
efficient Single-Shot detectors (SSD) , are trained on
automotive datasets that are biased towards good
weather conditions. While these methods work well
in good conditions, they fail in rare weather events
(top). Lidar- only detectors, such as the same SSD
model trained on projected lidar depth, might be
distorted due to severe backscatter in fog or snow
(center). These asymmetric distortions are a
challenge for fusion methods, that rely on
redundant information. The proposed method
(bottom) learns to tackle unseen (potentially
asymmetric) distortions in multimodal data without
seeing training data of these rare scenarios.

Multimodal sensor response of RGB camera, scanning lidar, gated camera and radar in a fog
chamber with dense fog. Reference recordings under clear conditions are shown the first row,
recordings in fog with visibility of 23 m are shown in the second row.

• Data Representation:
• The camera branch uses 3-plane RGB inputs, while for the lidar and radar branch, depart
from recent bird’s eye-view (BeV) projection schemes or raw point-cloud representations.
• Instead of using a depth-only input encoding, they provide depth, height and pulse intensity
as input to the lidar network.
• For the radar network, assume the radar is scanning in a 2D-plane orthogonal to the image
plane and parallel to the image horizontal dim. Therefore, consider radar invariant along the
vertical image axis.
• To aid the multimodal fusion by matching the input projection, replicate the scan across the
horizontal axis.
• Gated images are transformed to the image plane using a homography mapping.
• The input encoding allows for position and intensity-dependent fusion with pixelwise
correspondences in-between different streams.
• It encodes missing measurement samples with zero intensity.

• Feature Extraction:
• As feature extraction stack in each stream, use a modified VGG backbone.
• It reduces the number of channels by half and cut the network at the conv4 layer.
• It uses 6 feature layers from conv4-10 as input to SSD detection layers.
• The feature maps decrease in size to a feature pyramid for detections at different scales.
• The activations of different feature extraction stacks are exchanged.
• To steer fusion towards the most reliable info, it provides the sensor entropy to each feature
exchange block.
• First, convolve the entropy, apply a sigmoid, multiply with the concatenated input features
from all sensors and finally concatenate the input entropy. The folding of entropy and
application of the sigmoid generates a multiplication matrix in the interval [0,1].
• This scales the concatenated features for each sensor individually based on the available info.
• Regions with low entropy can be attenuated, while entropy rich regions amplified.
• Doing so allows to adaptively fuse features in the feature extraction stack itself.

• Entropy-steered Fusion:
• To steer deep fusion towards redundant and reliable info, it introduces an entropy channel in
each sensor stream, instead of directly inferring the adverse weather type and strength.
• The steering process is learned purely on clean weather data, which contains different
illumination settings present in day to night-time conditions.
• No real adverse weather patterns are presented during training. Further it drops sensor
streams randomly with probability 0.5 and set the entropy to a constant zero value.
• Loss Functions
• The number of anchor boxes in different feature layers and their sizes play an important role
during training and the chosen configuration are given in the supplemental material.
• In total, each anchor box with class label yi and probability pi is trained using the cross
entropy loss with softmax,

Normalized entropy
with respect to the
clear reference
recording for a gated
camera, rgb camera,
radar and lidar in
varying fog visibilities
(left) and changing
illumination (right).

Quantitative detection AP on real unseen weather-affected data from dataset split
across weather and difficulties easy/moderate/hard following.

A Deep Learning Approach for Automotive
Radar Interference Mitigation
• Recent popular radar technologies include Frequency Modulated Continuous Wave (FMCW) or
Chirp Sequence (CS) radars.
• Using transmit and reflected radar signal by a target, it can capture the target range and velocity.
• However, when interference signals exist, noise floor increases and it severely affects the
detectability of target objects.
• The conventional signal processing methods for canceling the interference or reconstructing the
transmit signals are difficult tasks, and also have many restrictions.
• In this work, they propose an approach to mitigate interference using deep learning.
• Especially they apply RNN model with GRU, suitable for processing sequence data, to remove
interference and reconstruct transmit signal simultaneously.
• It reconstructs transmit signal even in the presence of various interference signals, and the
reconstructed signal can be used to detect objects through Fast Fourier Transform (FFT).
• In particular, through the learned network, signal processing can be done only with the matrix
calculation, not with any iteration structure. Also, it does not require any adaptive threshold.

CS waveform of transmit and received signal

Beat frequency

Interrupted transmit signal, interference occurs in a.
Interrupted beat signal, interference
occurs around the 0 to 80 samples.

Method I is time domain thresholding (TDT) method. Method II did not use an adaptive threshold.
Simulated power levels with respect to range. Two targets exist in range 100m, 120m. Four interferences exist
in range 40m, 50m, 60m, and 70m. Red circles are detected targets.

Deep Radar Detector
• While camera and LiDAR processing have been revolutionized since the introduction of deep
learning, radar processing still relies on classical tools.
• The radar-generated point clouds differ significantly from the LiDAR point clouds in two aspects:
• A) Viewpoint/pose variation – a point cloud of an object differs for even very similar object poses and
close viewpoint angles.
• B) Temporal-variation – even with no pose variation, a point cloud of the same object vary over time.
• This paper introduces a deep learning (CNN-based) approach for radar processing, working
directly with the radar complex data.
• A significant challenge of applying DL to radar data is the lack of labeled data. Here, they rely in
training only on the radar calibration data and introduce new radar augmentation techniques.
• Applying deep learning on radar data has several advantages, such as eliminating the need for an
expensive radar calibration process each time and enabling classification of the detected objects
with almost zero-overhead.

Deep Radar Detector
Conventional radar signal processing flow
The sampled radar echoes are first transferred to range- Doppler (RD) domain via the 2D fast Fourier transform
(FFT). Next, the radar signals in the RD map, whose energy exceeds the detection threshold are declared as
detections. In the following beamforming processing block, the direction in azimuth and elevation to these
detections is estimated. Finally, detections are clustered, tracked and classified.

Deep Radar Detector
• This work proposes to use the radar calibration data, which contains the radar sensor array
responses to a known target located at a variety of angles.
• Typically, the radar is calibrated in the anechoic chamber with a point-target (corner reflector).
• The radar is mounted to an accurate rotator to collect array responses at a variety of angles.

Deep Radar Detector
• The solution relies on a two-step detector as in the faster- RCNN (Region CNN) model, whose
detection is performed using the following two steps:
• 1) Region Proposal Network (RPN) propose regions where it is likely to find objects. Each is
provided with its possible coarse location (using anchors).
• 2) Classifier - classifies the proposed objects and finetune their locations (via regression).
• The detection task in the Range-Doppler map is formulated as a segmentation, in which each cell
(“pixel”) in the RD-map is labeled by the correct class.
• This model proposes the following two detection steps:
• RD-Net: detects, classifies and localizes all detections in the range-Doppler domain.
• Ang-Net: finds the azimuth and the elevation angles of each detection found by the RD-Net.
• The RD-Net, whose internal architecture adopts the 2D-U-Net, performs the segmentation task.
• The detections class and their locations in range-Doppler map, and a global feature vector are
then passed to the Ang-Net, which obtains azimuth and elevation of each detection in the range-
Doppler map.

Deep Radar Detector
Raw Radar Frame Input (left); Network Input Radar Frame Output (right)
The radar targets at any range and Doppler can be augmented simply by shifting the phase of the raw radar
frame elements. Easily working by multiplying window coefficients of the 2D-FFTs with a complex exponent
(before applying the FFT on the data, first pass it through a window (e.g. hamming) to reduce side -lobes.

Deep Radar Detector
DRD network flow. Radar frame is first passed to
the RD-Net for RD-domain detection (range &
doppler) and global feature extraction. The
detections (location & class) are then passed to the
Ang-Net, which pools for each detection a 3x3
center crop from the radar frame. It uses it with the
global feature vector and class (extracted by the
RD-Net) to find the angle (azimuth & elevation) of
each detection.
The class-balanced cross-entropy loss is used in
the RD-Net. The two classification heads of the
Ang-net use the regular cross- entropy loss. The
loss function used here is defined as:

Deep Radar Detector
DRD-Network Architecture. In the RD-Net a U-Net shaped network is used to detect all targets in the
RD domain. In the Ang-Net for each detection, the network takes a 3x3xCh crop and filter it with a
3x3x256 Conv resulting with a 1x1x256 vector. The vector is concatenated with the 1x1x512 global
feature vector extracted from the RD-Net and also with the class one-hot vector k. The concatenated
vector is then passed through 3 fc layers (fc1-3) and the output is split to 2 separate classification heads,
one for azimuth detection and the second for elevation detection.

Deep Radar Detector
Accuracy vs SNR: Range Doppler accuracy,
Azimuth Accuracy, Elevation accuracy.

Deep Learning’s Application in Radar Signal Data II

Deep Learning’s Application in Radar Signal Data II

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Deep Learning’s Application in Radar Signal Data II

Similar to Deep Learning’s Application in Radar Signal Data II (20)

More from Yu Huang

More from Yu Huang (20)

Recently uploaded

Recently uploaded (20)

Deep Learning’s Application in Radar Signal Data II