"Taming the Beast: Performance and Energy Optimization Across Embedded Feature Detection and Tracking," a Presentation from Cadence

Copyright © 2014 Cadence Design Systems 1
Chris Rowen -Cadence Fellow
May 2014
Taming the Beast:
Performance and Energy Optimization Across
Embedded Feature Detection and Tracking

• What’s the problem in imaging and vision?
• An extended example: feature/gesture recognition/tracking pipeline
• A quick look at features detectors
• A deep dive on connected component identification
• Mapping to an vision DSP—issues and opportunities
• Performance and energy optimization results
• Wrap-up
Agenda

IVPEP:
Platform for Imaging Applications Everywhere
Front-collision
warning
Automatic high beam
Traffic sign
detection /
recognition
Lane tracking
Gesture control
Face detection,
recognition and
tracking
High dynamic range (HDR)
image/video capture
Video pre-
processing for
improved encoding
Stabilization
Digital
zoom Low-light image
enhancement
Computer Vision
Auto (ADAS)
Still Image and Video Capture
Scene
analysis
Advanced Driver
Assistance Systems
Handsets, Tablets PCs, DSCs
DTV, Tablets, PCs,
Consumer Gaming
Decode artifact
compensation
Scaling, frame
rate adjustment
Sharpening
Video Post-Processing
DTV, Mobile
Digital effects photography
Pedestrian detection and
tracking
Display
adaptation
3D effects
Object detection,
tracking and
identification
Augmented
reality
registration

Imaging Computation Chain — Sensor to Display
Imaging Processor fit were performance and complexity collide:
• High performance required (implying significant power consumption or latency
sensitivity)
• Complex multi-YUVframe algorithms that require high non-local memory bandwidth
• New and complex algorithms are required in Bayer domain for ISP performance
• Product differentiation depends on rapidly evolving, sometimes proprietary algorithms or
performance
• Complete applications built by chaining a range of imaging/video functions
• The Future: Acceleration of demand from both RTL soft and CPU  offload
Sensor
Single frame ISP Multi-
frame
ISP
3Dnoisereduction
Stillimagestabilization
VideoStabilization
HighDynamicRange(HDR)
Face,blink,smiledetection
Red-eyereduction
Skinbeautification
Panoramastitching
JPEGcompression
Videocompression
Transcoding
3Dcapture
Smart
Photo
Video
Encode
PixelProcessing
Filtering(incl.Deblocking)
BitstreamProcessing
De-
code
FrameRateConversion
De-interlacing
3DNoiseFiltering
Video
Post
Process
DisplayCompensation
Edge/corner/blobdetection
Featuredetection
Objecttracking
Barcode/QRcodedetection
Image/Video
Analysis
Motionanalysis
Gesturedetection
Face/gesturerecognition
Sceneanalysis
Augmentedreality
RGBBayerfiltering
Defectcorrection
Lensshadingcorrection
Demosaicing
Sharpening
Color/Gammacorrection
Noisefiltering
Autoexposurecorrection
Whitebalance
Colorspaceconversion
Blacklevelcompensation
ExtendedDepthofField
Scaling
Imagewarping
Display
Storage
Super-resolution/digitalzoom
Multi-sensorarrayintegration
3A control
(Histogram,
face, …)
Multi sensor processing
(HDR,Stereo depth)
Gain
Bayer domain ISP processing Intelligent image post-processing

Preprocess
(Denoise,
Contrast)
Region of
Interest
(Motion
detect, Skin
tone, etc.)
Morphologic
al operations
(Dilate,
Erode)
Bounding
box
(Connected
components)
Hand/Pose
detection
(Classifier)
Hand
tracking
(Meanshift,
Feature
detection+O
ptical flow)
Gesture
recognition
Example Gesture Recognition Pipeline
Dense Processing Sparse processing

Input Image Background subtraction
Dilate/ErodeConnected components
Bounding Box
Denoising
The Gesture Recognition Pipeline Visualized

Connected Component Labelling
Input Image
Output Image – each shade of gray
represents a
a different object and is assigned a
unique label (1, 2, 3, …)

• Each set of connected pixels are uniquely labeled
• Connectivity checks performed on 4 or 8 neighbor pixels
• Input usually binary (example all foreground pixels are
considered connected) or grayscale (similar pixels are
considered connected)
• Approaches to connected component labelling
• Two pass algorithm
• First pass: Scan and propagate labels top-down and left-right,
give new labels to unconnected points and maintain label
mappings when two labels are found connected
• Second pass: Re-label using label mappings from fist pass
• Vector friendly, performance depends on number of labels
generated in first pass
• Single pass algorithm
• Often based on contour tracing
• Not vector friendly, requires access to entire image
• A combination of the two approaches
Connected Components
8-connectivity
4-connectivity
Images from Wikipedia

Connected Components — Vector Processing
Checking connection and propagating
labels in down, right-down or left-down
direction is easy
Current vector
Above vector
Left label
Checking connection and propagating
labels left-right direction potentially
requires sequential propagation. Ability to
terminate when labels have stopped
changing improves performance
Current vector
Many vectors are all background pixels -
ability to skip over entire vectors of
background pixels improves performanceCurrent vector

Initial labels
1 2 2 2 5 5
Re-labelling is a lookup operation, but
the table size may be large depending on
how many initial labels were assigned.
Fast vector lookup operations can
improve performance
Label Map
1->1, 2->1,
5->2
1 1 1 1 2 2
Final labels

Test image from http://pets2012.net/
Using these vector
techniques performance of
highly optimized scalar
connected component code is
improved by 30X !

Preprocess
(Denoise,
Contrast)
Region of
Interest
(Motion
detect, Skin
tone, etc.)
Morphological
operations
(Dilate,
Erode)
Bounding box
(Connected
components)
Hand/Pose
detection
(Classifier)
Hand tracking
(Meanshift,
Feature
detection+Opt
ical flow)
Gesture
recognition
Example Gesture Recognition Pipeline

Feature
response
scale-space
creation to
search for
keypoints
Find points
with Extrema
in scale-
space
response
Feature
localization
• Interpolation in
scale/space
• Rejecting not
interesting
features
Feature
descriptor
creation
Feature Detection
Millions of Pixels
Millions of pixels
input, few 100s or
1000s of points
output
Few 100s or
1000s of points
scattered in
scale and space
Few 100s or
1000s of points
scattered in
image
Detection Descriptor

• SIFT (Scale Invariant Feature Transform)
• Uses difference of Gaussians to calculate feature response
• The filters used are symmetric, data is typically 8-bit
• SURF (Speeded Up Robust Features) - Fast Hessian
• Uses difference of boxes to calculate feature response
• Integral images speedup calculation of sum of pixels in a box by reducing
them to sums and differences
• Integral data is usually 32-bits
• FAST (Features from Accelerated Segment Tests)
• Uses pixel intensity comparisons to find “interesting” points
• Finds N contiguous pixels in a circle around the point of interest that are
either brighter or darker than the point of interest
• Operations involve comparisons and bit manipulations , intensities are
generally 8-bit
• A good architecture needs to support a range of data types and
accelerate a range of operations
Feature Detection — Three Popular Approaches

Mapping to an Vision DSP — Issues and
Opportunities
Instruction Fetch/Dispatch: variable length
Instruction Memory: configurable
Data Memory: Configurable
Xtensacontrolprocessing(VLIW)
mDMA
…
PixelRegfile
Predicates
Shift/Sel
MUL
Shift
Multiple ALUs
Load/Store
Units
PixelAccum
User-defined FUs
On-chip
network bridge
Scalarexecutionpipelines
Scalarexecutionpipelines
…
PixelRegfile
Predicates
Shift/Sel
MUL
Shift
Multiple ALUs
PixelAccum
User-defined FU
PixelRegfile
Predicates
Shift/Sel
MUL
Shift
Multiple ALUs
PixelAccum
User-defined FU
PixelRegfile
Predicates
Shift/Sel
MUL
Shift
Multiple ALUs
PixelAccum
User-defined FU
Cross-element select/reduction network
Memory Data Rotator
1. Exploit data locality:
• Compiler-automated
vectorization
• Tile manager runtime layer
hides integrated mDMA
programming
• Vector data types
• Extended native C operators
• State-of-the-art code
scheduling
2. Leverage libraries:
• New mappings only when
needed
• >700 OpenXV/OpenCV-based
functions
3. Use tools in tuning:
• Instant single/multi-core
simulation
• Multi-dimensional profiling
• Memory analysis
• User-defined ISA extension
IVP-EP subsystem organization

IVP-EP Performance: Up to 4x Boost Over
Previous Generation IVP Processor
*with instruction set option package
0
1
2
3
4
5
Speed-upoverIVP
**

Connected Components
Performance and energy comparison
0
5
10
15
20
25
30
35
Frames Per Second Frames Per Watt
(core)
Frames Per Watt
(memory)
Frames Per Watt
(total)
RelativePerformance
RISC Core
IVP-EP

Lessons Learned
Expanded Code Tuning Checklist
Measure reference cycle
performance/quality.
Convert floating point types to lowest fixed-
point that meets image quality need
Identify any inherent loop recurrences and
move dependencies from inner loops to
outer loops.
Decompose deep table lookups into
computed function of shallow table lookups
Choose best native scalar/vector data-types
Compile with auto-vectorization
If necessary, add vector reorganization
operations to maximize vector usage
Use abstract DMA or “tile manager” library to
pre-load/post-store data in background
If MP, partition data and add API task/
communications library calls to
communications library to initiate tasks and
coordinate computation across processors
Use memory tools to validate stack/heap
usage
Measure and compare final
performance/energy
• More than enough interesting, hard vision
problems
• Algorithms are diverse in structure and
evolving
• Implementation approach must balance
Ease/agility of development vs.
speed/efficiency of result
• Object detection and tracking is a multi-
phase algorithm, with different techniques
for exploiting parallelism in each phase
• Tools matter
• Target architecture matters
• Libraries matter
• Superior frames per sec, mW and time-to-
solution are achievable

• Cadence Tensilica Imaging/Video Processor (IVP)
• http://ip.cadence.com/ipportfolio/tensilica-ip/image-video-processing
• Connected Component Labelling
• http://en.wikipedia.org/wiki/Connected-component_labeling
• SIFT
• Lowe, D.G. (2004), "Distinctive Image Features from Scale-Invariant Keypoints", International
Journal of Computer Vision 60 (2),
http://www.robots.ox.ac.uk/~vgg/research/affine/det_eval_files/lowe_ijcv2004.pdf
• http://en.wikipedia.org/wiki/Scale-invariant_feature_transform
• http://w3.inf.fu-berlin.de/lehre/SS09/CV/uebungen/uebung09/SIFT.pdf
• SURF
• http://www.vision.ee.ethz.ch/~surf/papers.html
• FAST
• Rosten, Edward; Tom Drummond (2005). "Fusing points and lines for high performance tracking".
IEEE International Conference on Computer Vision 2,
http://edwardrosten.com/work/rosten_2005_tracking.pdf
• http://en.wikipedia.org/wiki/Features_from_accelerated_segment_test#cite_note-2
Resources
© 2014 Cadence Design Systems, Inc. All rights reserved worldwide. Cadence, the Cadence logo, and Xtensa are registered trademarks of Cadence
Design Systems, Inc. in the United States and other countries. All other trademarks are the property of their respective owners and are not affiliated with
Cadence.

"Taming the Beast: Performance and Energy Optimization Across Embedded Feature Detection and Tracking," a Presentation from Cadence

Recommended

Recommended

More Related Content

Similar to "Taming the Beast: Performance and Energy Optimization Across Embedded Feature Detection and Tracking," a Presentation from Cadence

Similar to "Taming the Beast: Performance and Energy Optimization Across Embedded Feature Detection and Tracking," a Presentation from Cadence (20)

More from Edge AI and Vision Alliance

More from Edge AI and Vision Alliance (20)

Recently uploaded

Recently uploaded (20)

"Taming the Beast: Performance and Energy Optimization Across Embedded Feature Detection and Tracking," a Presentation from Cadence