HiPEAC 2019 Workshop - Use Cases

Implementation on embedded platforms
• All use cases started with a reference implementation in a normal
server environment
• Tools used on the embedded platform:
• SDSoc
• With the Tulipp platform installed
• Vivado HLS
• Tulipp tools:
• Stehm
• Lynsyn
• Hipperos OS

Workflow
• Clean the code from library dependancies not available on the
embedded platform
• Make it run on the CPU side of the SOC FPGA – handle input/output,
smaller memory footprint etc
• Identify sections of the code that are candidates for HW acceleration
• Refactor/restructure the algorithm to optimize for the given
conditions:
• Streaming
• Small local memory
• Preferably no floating point

Pedestrian
detection
Safety
application
Car
integration
The Use Case
Requirements:
• 30 Hz frame rate
• Low latency (2-3 frames)
• Not more than 5-10 Watt
The ADAS use case – pedestrian detection

Viola/Jones classification
• Machine learning algorithm based on training with labeled data
• The classifier is the weighted sum of ”rectangular features”
• The weights and what features to chose is selected by the training
process

Rectangular features
A classifier consists of a
large number of features
calculated for a given
path.
If the sum is above a
threshold, the patch
contains a pedestrian.
A feature is the sum of all
pixels in a rectangular
region.

Integral images
• In an integral image, each pixel stores
the sum of all pixel to the left and
above that pixel in the original image
• With an integral image, the sum of all
pixels in an arbitrary rectangle can
easily be calculated with a small, fixed
number of operations
x, y

Integral images
Frame
Orientation
Gradient
magnitude
LUV color
10 integral
images
640 x 480

Perform detections
Integral images
50
sizes
Sweep
classifier
over all
pixel
positions

Challenges
• High memory bandwidth requirements – combined with a non-
sequential access pattern
• 30 frames/s
• 50 patch sizes
• Sweping over all image positions
• Each classifier requires roughly 1000 feature calculations
• Ineffective pipelining since the classifier calculation can terminate at
any stage
• Not all data can be kept locally (cached)

So we need some tricks
• Cascading – successively trained classifier chain that emphasizes on eliminating
non-pedestrians quickly. Reduces the number of classifier steps on average with
at least a factor of 10.
• The classifiers does not need to test every single position, instead scan in a grid
• Results in a need for 5-10 Gbyte/s – random access!
Patch with
possible
pedestrian
No pedestrian No pedestrian No pedestrian
Pedestrian!
…

Random access on DDRs is very ineffective
• The trick was to find data requests that were on the same DDR cache
lines
• That required us to rewrite the algorithm so it calculates many
classifiers at the same time
• By then reordering all accesses in a cache friendly manner, the
resulting memory bandwidth increased to almost the same as for
fully sequential accesses

Result
• Reference implementation on PC platform – 10 s/frame
• Final implementation on the Tulipp platform
• 15 frames/s
• Latency of 2-3 frames

The UAV Use Case objectives
• To perform real time stereo depth estimation
• To detect obstacles based on the depth estimation and to avoid
collision
• Based on dual cameras forming a stereo pair
• Lower weight and lower price than a depth camera
• Requires real time performance – high measurement rate and low
latency

StereoDepth Estimation
• Two cameras with baseline 𝑏, observe an object 𝑀 at two different locations 𝑥1 and 𝑥2
• Depth 𝑍 can be computed from disparity 𝑑 = |𝑥1 − 𝑥2| ∝
1
𝑍
• Disparity computation requires detection of same objects in both images
Stereo camera setup

Algorithm Description
• Stereo algorithm with Semi-Global-Matching [1] optimization
[1] H. Hirschmueller, Accurate and efficient stereo processing by semi-global matching and mutual information,
IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2005.
Input: Stereo Images
Output: Depth map
Image Rectification /
Pre-processing
Depth Estimation
Local matching
Semi Global Matching
Left-Right
Consistency check
Median filtering

Semi-Global Matching
𝐸 𝐷 = ෍
𝐩
𝐶𝑜𝑠𝑡 𝐩, 𝐷p + ෍
𝐪∈𝑁p
𝑃1T 𝐷p − 𝐷q = 1 + ෍
𝐪∈𝑁p
𝑃2T 𝐷p − 𝐷q > 1 , 𝑚𝑖𝑡 𝑃1 ≤ 𝑃2
Large discontinuity – large penaltySmall discontinuity – small penalty
Aggregation along paths solved using
dynamic programming

Stereo Depth Estimation
The depth z can be calculated as 𝑍 = 𝑓 ⋅
𝑏
𝑑
Where f is the focal length, B the distance between the cameras and d the disparity.
Input image Corresponding depth image

Obstacle Avoidance
Reactive obstacle avoidance algorithm computing shortest path around
obstacle based on disparity map
1. U- / V-Map computation (Oleynikova et al. 2015)
2. Binary filtering and contour detection
3. Obstacle extraction and waypoint computation
U- / V-Map Binary filtering Contour Detection
Obstacle
extraction
Waypoint
Computation

Challenges
• Limited local storage on the FPGA
• Real Time/Low latency requirements

SGM optimization for streaming
• Original algorithm used aggregation
along 8 paths
• That requires access to the full image
• In the FPGA implementation, the full
image can’t be stored locally, hence a
streaming solution would be preferred
• By only aggregating along 4 paths,
streaming can be used.
• Only 1.7% accuracy reduction when
going from 8 to 4 paths

Implementation and Results
The disparity estimation is
implemented in C/C++ and
synthesized to the FPGA using
HLS
The obstacle avoidance is
purely implemented on CPU
part of the SOC FPGA

The medical use case
• Used on X-ray video for surgery
• Lower the radiation dose by a factor of 4
• Enhances the image quality by denoising and image filtering
• Operates on 1024x1024 24 bits images @ 30 Hz

Current solution vs the goal
RAW IMAGE
PC
dedicated
to Thales
Sensor
Cleaned &
Enhanced
Image
UI
Current Xray Sensor architecture
With Tulipp
- Reduce Costs
- Reduce Size
- Ease integration
- Choose a MPSoC
GigE-Vision+Msg
Nano Processing Unit
Inside the sensor
Based on SoC
(credit card size board)
Future Xray Sensor
architecture
Cleaned &
Enhanced
ImageGigE-Vision+Msg

Multi pass image filtering
• The image is filtered with several different methods
• Together they perform:
• Remove sensor defects
• Emphasize low contrast parts of the image
• Enhances details and edges
• Adapt the image to the final display

Typical processing sequence:
Raw Image

Clean image stage
• Remove dead pixels
• AGC – Automatic Gain
Control
• ABC – Automatic Brightness
and Contrast – feedback to x-
ray sensor

Pre-equalization gamma
• Enhancing low level parts of
the image
• Recursive, temporal filter for
denoising

Clip & Spatial filters
• Clipping to reduce the signal
levels of the very bright areas
• Spatial filtering for smoothing
(convolution)

Multiscale contrast & edge enhancement
• Multiscale filtering using
Laplacian Gaussian pyramid
• Iteratively operates on
downscaled images in a
”pyramid”
• A low pass filterede image is
subtracted from the original
in each step, to extract the
high frequency components
• Final image is composed of
the result from each level

Inversion & auto-Adaptative LUT
• Adaptation to the display

Rotation & Resize

Challenges
• Handling of all scales in the pyramid filtering – requires much more
memory than locally available
• Some of the filters had to be redesign since they had too many
branches, which is poor for hardware streaming solutions
• Implemented from C/C++ using SDSoc

Results
• The algorithm, although slightly modified, run s on the Tulipp
platform:
• 29 frames/s
• 29 ms latency

Conclusion
• The three use cases show that the Tulipp platform performs well for
quite different applications
• The Tulipp tools together with the vendor tools offers a nice
development environment, where you actually can get effective FPGA
implementations using high level tools, based on C/C++
• Important to remember – a large portion of the work will (always) be
to refactor/restructure the algorithm to fit the underlaying hardware
structure

HiPEAC 2019 Workshop - Use Cases

Recommended

Recommended

More Related Content

Similar to HiPEAC 2019 Workshop - Use Cases

Similar to HiPEAC 2019 Workshop - Use Cases (20)

More from Tulipp. Eu

More from Tulipp. Eu (18)

Recently uploaded

Recently uploaded (9)

HiPEAC 2019 Workshop - Use Cases