"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel

Copyright © 2016 Intel Corporation 1
Accelerating Deep Learning Using
Altera FPGAs
Bill Jenkins
May 3, 2016

Legal Notices and Disclaimers
• Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service
activation. Learn more at intel.com, or from the OEM or retailer. No computer system can be absolutely secure.
• Tests document performance of components on a particular test, in specific systems. Results have been estimated or simulated
using internal Intel analysis or architecture simulation or modeling, and provided to you for informational purposes. Differences in
hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance
as you consider your purchase. For more complete information about performance and benchmark results, visit
http://www.intel.com/performance.
• Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances
and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs
or cost reduction.
• All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product
specifications and roadmaps.
• Statements in this document that refer to Intel’s plans and expectations for the quarter, the year, and the future, are forward-
looking statements that involve a number of risks and uncertainties. A detailed discussion of the factors that could affect Intel’s
results and plans is included in Intel’s SEC filings, including the annual report on Form 10-K.
• The products described may contain design defects or errors known as errata which may cause the product to deviate from
published specifications. Current characterized errata are available on request.
• No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
• Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the
referenced web site and confirm whether referenced data are accurate.
• Intel, the Intel logo, and Xeon and others are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and
brands may be claimed as the property of others.

• Accelerated FPGA innovation from
combined R&D scale
• Improved FPGA performance/power
via early access and greater
optimization of process node
advancements
• New, breakthrough Data Center and
IoT products harnessing combined
FPGA + CPU expertise
Altera and Intel Enhance the FPGA Value Proposition
Accelerated FPGA investment
Operational excellence
STRATEGIC RATIONALE
• Superior product design capabilities
• Continued excellence in customer
service and support
• Increased resources bolster long-term
innovation
• Focused, additive investments today

• Extracting features from data in order to solve predictive problems
• Image classification & detection
• Image recognition/tagging
• Network intrusion detection
• Fraud / face detection
• Aim is programs that automatically learn to recognize complex patterns and make
intelligent decisions based on insight generated from learning
• For accuracy, models must be trained, tested and calibrated to detect patterns
using previous experience
What is Machine Learning?

• Human expertise is absent
• Navigating to Pluto
• Humans cannot explain their expertise
• Speech recognition
• Solution changes over time
• Tracking traffic
• Solution needs to be adapted to particular cases
• Medical diagnosis
• Problem is vast in relation to human reasoning capabilities
• Ranking web pages on Google or Bing
When to Apply Machine Learning

Value Proposition of Machine Learning
X 35ZB/s =
Increasing
Variety of
Things
Volume x
Velocity =
Throughput
Separating Signal
from Noise
Provides Value
Data is the problem
Revenue
Growth
Cost
Savings
Increased
Margin

• A network of interconnected
neurons, modeled after biological
processes, for computing
approximate functions
• Layers extract successively higher
level of features
• Often want a custom topology to
meet specific application
accuracy/throughput requirements
Convolutional Neural Networks (CNN)
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based Learning Applied to
Document Recognition. IEEE98

CNN Computation in One Slide
Inew 𝑥 𝑦 = Iold
1
𝑦′=−1
1
𝑥′=−1
𝑥 + 𝑥′ 𝑦 + 𝑦′ × F 𝑥′ 𝑦′
Input Feature Map
(Set of 2D Images)
Filter
(3D Space)
Output Feature Map
Repeat for Multiple Filters to Create Multiple
“Layers” of Output Feature Map

What’s in my FPGA?
• DSPs
• Dedicated single-precision floating
point multiply and accumulators
• Block RAMs
• Small embedded memories that can
be stitched to form an arbitrary
memory system
• Programmable Interconnect
• Programmable logic and routing that
can build arbitrary topologies
• Compute architecture with high degree
of customization
X
+

• 1 TFLOP floating point performance in mid-
range part
• 35W total device power
• Use every DSP, every clock cycle compute
spatially
• 8 TB/s memory bandwidth to keep the state on
chip!
• Exceeds available external bandwidth by
factor of 50
• Random access, low latency (2 clks)
• Place all data in on-chip memory compute
temporally
Why an FPGA for CNN? (Arria 10)
X
+
X
+
X
+
X
+ M20K
M20K
M20K
M20K
Fine-grained & low latency
between compute and memory

CNNs on FPGAs — Scalable Architecture

Market Demands Scalability for Machine Learning
• 1000s of Classes
• Large Workloads
• Highly Efficient
(Performance / W)
• Varying accuracy
• Server Form Factor
Cloud Analytics Transportation Safety
• < 10 Classes
• Frame Rate: 15–30fps
• Power: 1W-5W
• Cost: Low
• Varying accuracy
• Custom Form Factor

Old Approach
• Parallelism across the “face” of the
kernel window, and across multiple
convolution stages
• Low hardware re-use
Different Parallelism in CNN
New Approach
• Parallelism in the depth of the kernel
window and across output features
Defer complex spatial math to
random access memory
• Re-use hardware to compute
multiple layers

Scalable CNN Computations — In One Slide
accum
accum
accum
Output
Feature Map
“Slide”  No data movement.
Addressing an on-chip RAM!
Filters

Scalable CNN Architecture on FPGA (1)
FPGA
Double-Buffer
On-Chip RAM
DDR
Filters
(on-chip RAM)
#ofParallel
Convolutions

• Array size
(x, y)
• Clock rate
• External memory
bandwidth
Calculated throughput &
resource utilization
• Layer
descriptions
• Given resource constraints,
find optimal architecture
• Ex. AlexNet on A10-115 is 52x26
for 800 img/s @ 350 MHz

• Choice of parallelism has large impact on end compute architecture and properties of solution
• Defined a scalable approach to CNNs on the FPGA
• Not tied to specific FPGA device
• Not tied to specific CNN topology
• Design Methodology:
1. Fit largest possible accelerator network on FPGA (52x26 on Arria 10)
• Limited by DSP Blocks & M20K (RAM) Resources
2. Tile network onto available accelerator
• Decompose filter window into 1x1xW vectors for dot product

AlexNet Competitive Analysis — Classification
System (Precision, Image, Speed)1 Throughput
Est. Board
Power
Throughput /
Watt
Arria 10-115 (Current: FP32, Full Size, @275Mhz) 575 img/s ~31W 18.5 img/s/W
Arria 10-115 (Optimized: FP32, Full Size, @350Mhz) 750 img/s ~36W 20.8 img/s/W
Arria 10-115 (Estimate: FP16, Full Size, @350Mhz) 900 img/s ~39W 23.1 img/s/W
Arria 10-115 (Estimate: 21b, Full Size, @350Mhz) 1200 img/s ~40W 30 img/s/W
2 x Arria 10-115
Nallatech 510T Board
2400 img/s ~75W 32 img/s/W
cuDNN4 on NVIDIA Titan X
Source: NVIDIA Corporation, GPU-Based Deep Learning Inference: A Performance and
Power Analysis, November 2015
3216 img/s 227W 14.2 img/s/W
• Further algorithmic optimization of FPGA possible
• Expect similar ratios for Stratix10 vs. NVIDIA 14nm Pascal

Getting Started with CNNs on FPGAs
High-Performance
Machine Learning
Desired
Accelerate
Computation
Scale & Speed of Devices
Better Compute Architecture
Math Optimization (Winograd, FFT)
Optimized RTL / HLD
(Current Intel PSG focus,
original MSFT focus)
Tune Problem
to Platform
Simplify network topology
Reduce precision / use fixed point
Create more local neuron structures
Integrated training and classification
(Current i-Abra and partner focus)
Not Mutually Exclusive
Combine for Optimal Solution

Overview: Design Flow Using CNN IP
Data
Collection
Data
Store
Choose
Network
Train
Network
Execution
Engine
Improvement Strategies
• Collect more data
• Improve network
Parameters
Selection
Architecture
Choose Network
• Use framework (e.g. Caffé,
Torch)
• Choose based on experience
or limits of execution engine
Train Network
• An HPC workload
• Requires data to be pre-
selected
• Weeks to Months process
Execution Engine
• Implementation of the
Neural Network
• Flexibility, performance &
power dominate choice
Altera
CNN IP

Overview: Design Flow for CNN Using Partner
Data
Collection
Data
Store
Neural
Pathways
Neural
Synapse
Parameters
Selection
Architecture
Neural Pathways
• Integrated Network
selection and training
• Capable of acceleration in
FPGA
• Minutes to hours process
Neural Synapse
• Implementation of highly
efficient Neural Network
• Built in FPGA fabric with
OpenCL
Altera
CNN IP

• New opportunities to increase the FPGA value proposition
• Accelerated FPGA investment driving product innovation to increase your
performance and productivity
• Increased operational excellence to accelerate time-to-market
• Expanded product portfolio to arm you with new solutions for your most
challenging applications
• Come join us at our booth to see a demo of machine learning on FPGAs
Join Us on Our Journey Together…
How can Intel + Altera help your business grow?

• Altera Website
• Altera SDK for OpenCL Page (www.altera.com/opencl)
• Technical Article “Efficient Implementation of Neural Network Systems Built
on FPGAs, Programmed with OpenCL” (www.altera.com/deeplearning-
tech-article)
• GPU vs FPGA overview online training (available mid-May)
• CNN on FPGA whitepaper (available mid-May)
• “Machine Learning on FPGAs” web page (available mid-May)
• Embedded Vision Alliance Website
• Technical Article “OpenCL Streamlines FPGA Acceleration of Computer Vision”
Resources

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies
depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at www.intel.com.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark
and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause
the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the
performance of that product when combined with other products.
© Intel Corporation
Slide 18
Footnote 1. Configurations:
AlexNet configurations on Arria 10-115 FPGAs optimized via IP - tested by Intel PSG
For more information go to https://www.altera.com/content/dam/altera-www/global/en_US/pdfs/literature/pt/arria-10-product-table.pdf
Legal Notices and Disclaimers

Thank You

"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to "Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel

Similar to "Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel (20)

More from Edge AI and Vision Alliance

More from Edge AI and Vision Alliance (20)

Recently uploaded

Recently uploaded (20)

"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel