Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel Slide 1
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel Slide 2
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel Slide 3

YouTube videos are no longer supported on SlideShare

View original on YouTube

Loading in …3
1 of 26

"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel



Download to read offline

For the full video of this presentation, please visit:

For more information about embedded vision, please visit:

Bill Jenkins, Senior Product Specialist for High Level Design Tools at Intel, presents the "Accelerating Deep Learning Using Altera FPGAs" tutorial at the May 2016 Embedded Vision Summit.

While large strides have recently been made in the development of high-performance systems for neural networks based on multi-core technology, significant challenges in power, cost and, performance scaling remain. Field-programmable gate arrays (FPGAs) are a natural choice for implementing neural networks because they can combine computing, logic, and memory resources in a single device. Intel's Programmable Solutions Group has developed a scalable convolutional neural network reference design for deep learning systems using the OpenCL programming language built with our SDK for OpenCL. The design performance is being benchmarked using several popular CNN benchmarks: CIFAR-10, ImageNet and KITTI.

Building the CNN with OpenCL kernels allows true scaling of the design from smaller to larger devices and from one device generation to the next. New designs can be sized using different numbers of kernels at each layer. Performance scaling from one generation to the next also benefits from architectural advancements, such as floating-point engines and frequency scaling. Thus, you achieve greater than linear performance and performance per watt scaling with each new series of devices.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel

  1. 1. Copyright © 2016 Intel Corporation 1 Accelerating Deep Learning Using Altera FPGAs Bill Jenkins May 3, 2016
  2. 2. Copyright © 2016 Intel Corporation 2 Legal Notices and Disclaimers • Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at, or from the OEM or retailer. No computer system can be absolutely secure. • Tests document performance of components on a particular test, in specific systems. Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling, and provided to you for informational purposes. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit • Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction. • All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps. • Statements in this document that refer to Intel’s plans and expectations for the quarter, the year, and the future, are forward- looking statements that involve a number of risks and uncertainties. A detailed discussion of the factors that could affect Intel’s results and plans is included in Intel’s SEC filings, including the annual report on Form 10-K. • The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. • No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. • Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate. • Intel, the Intel logo, and Xeon and others are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.
  3. 3. Copyright © 2016 Intel Corporation 3 • Accelerated FPGA innovation from combined R&D scale • Improved FPGA performance/power via early access and greater optimization of process node advancements • New, breakthrough Data Center and IoT products harnessing combined FPGA + CPU expertise Altera and Intel Enhance the FPGA Value Proposition Accelerated FPGA investment Operational excellence STRATEGIC RATIONALE • Superior product design capabilities • Continued excellence in customer service and support • Increased resources bolster long-term innovation • Focused, additive investments today
  4. 4. Copyright © 2016 Intel Corporation 4 • Extracting features from data in order to solve predictive problems • Image classification & detection • Image recognition/tagging • Network intrusion detection • Fraud / face detection • Aim is programs that automatically learn to recognize complex patterns and make intelligent decisions based on insight generated from learning • For accuracy, models must be trained, tested and calibrated to detect patterns using previous experience What is Machine Learning?
  5. 5. Copyright © 2016 Intel Corporation 5 • Human expertise is absent • Navigating to Pluto • Humans cannot explain their expertise • Speech recognition • Solution changes over time • Tracking traffic • Solution needs to be adapted to particular cases • Medical diagnosis • Problem is vast in relation to human reasoning capabilities • Ranking web pages on Google or Bing When to Apply Machine Learning
  6. 6. Copyright © 2016 Intel Corporation 6 Value Proposition of Machine Learning X 35ZB/s = Increasing Variety of Things Volume x Velocity = Throughput Separating Signal from Noise Provides Value Data is the problem Revenue Growth Cost Savings Increased Margin
  7. 7. Copyright © 2016 Intel Corporation 7 • A network of interconnected neurons, modeled after biological processes, for computing approximate functions • Layers extract successively higher level of features • Often want a custom topology to meet specific application accuracy/throughput requirements Convolutional Neural Networks (CNN) Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based Learning Applied to Document Recognition. IEEE98
  8. 8. Copyright © 2016 Intel Corporation 8 CNN Computation in One Slide Inew 𝑥 𝑦 = Iold 1 𝑦′=−1 1 𝑥′=−1 𝑥 + 𝑥′ 𝑦 + 𝑦′ × F 𝑥′ 𝑦′ Input Feature Map (Set of 2D Images) Filter (3D Space) Output Feature Map Repeat for Multiple Filters to Create Multiple “Layers” of Output Feature Map
  9. 9. Copyright © 2016 Intel Corporation 9 What’s in my FPGA? • DSPs • Dedicated single-precision floating point multiply and accumulators • Block RAMs • Small embedded memories that can be stitched to form an arbitrary memory system • Programmable Interconnect • Programmable logic and routing that can build arbitrary topologies • Compute architecture with high degree of customization X +
  10. 10. Copyright © 2016 Intel Corporation 10 • 1 TFLOP floating point performance in mid- range part • 35W total device power • Use every DSP, every clock cycle compute spatially • 8 TB/s memory bandwidth to keep the state on chip! • Exceeds available external bandwidth by factor of 50 • Random access, low latency (2 clks) • Place all data in on-chip memory compute temporally Why an FPGA for CNN? (Arria 10) X + X + X + X + M20K M20K M20K M20K Fine-grained & low latency between compute and memory
  11. 11. Copyright © 2016 Intel Corporation 11 CNNs on FPGAs — Scalable Architecture
  12. 12. Copyright © 2016 Intel Corporation 12 Market Demands Scalability for Machine Learning • 1000s of Classes • Large Workloads • Highly Efficient (Performance / W) • Varying accuracy • Server Form Factor Cloud Analytics Transportation Safety • < 10 Classes • Frame Rate: 15–30fps • Power: 1W-5W • Cost: Low • Varying accuracy • Custom Form Factor
  13. 13. Copyright © 2016 Intel Corporation 13 Old Approach • Parallelism across the “face” of the kernel window, and across multiple convolution stages • Low hardware re-use Different Parallelism in CNN New Approach • Parallelism in the depth of the kernel window and across output features Defer complex spatial math to random access memory • Re-use hardware to compute multiple layers
  14. 14. Copyright © 2016 Intel Corporation 14 Scalable CNN Computations — In One Slide accum accum accum Output Feature Map “Slide”  No data movement. Addressing an on-chip RAM! Filters
  15. 15. Copyright © 2016 Intel Corporation 15 Scalable CNN Architecture on FPGA (1) FPGA Double-Buffer On-Chip RAM DDR Filters (on-chip RAM) #ofParallel Convolutions
  16. 16. Copyright © 2016 Intel Corporation 16 Scalable CNN Architecture on FPGA (2) • Array size (x, y) • Clock rate • External memory bandwidth Calculated throughput & resource utilization • Layer descriptions • Given resource constraints, find optimal architecture • Ex. AlexNet on A10-115 is 52x26 for 800 img/s @ 350 MHz
  17. 17. Copyright © 2016 Intel Corporation 17 • Choice of parallelism has large impact on end compute architecture and properties of solution • Defined a scalable approach to CNNs on the FPGA • Not tied to specific FPGA device • Not tied to specific CNN topology • Design Methodology: 1. Fit largest possible accelerator network on FPGA (52x26 on Arria 10) • Limited by DSP Blocks & M20K (RAM) Resources 2. Tile network onto available accelerator • Decompose filter window into 1x1xW vectors for dot product Scalable CNN Architecture on FPGA (3)
  18. 18. Copyright © 2016 Intel Corporation 18 AlexNet Competitive Analysis — Classification System (Precision, Image, Speed)1 Throughput Est. Board Power Throughput / Watt Arria 10-115 (Current: FP32, Full Size, @275Mhz) 575 img/s ~31W 18.5 img/s/W Arria 10-115 (Optimized: FP32, Full Size, @350Mhz) 750 img/s ~36W 20.8 img/s/W Arria 10-115 (Estimate: FP16, Full Size, @350Mhz) 900 img/s ~39W 23.1 img/s/W Arria 10-115 (Estimate: 21b, Full Size, @350Mhz) 1200 img/s ~40W 30 img/s/W 2 x Arria 10-115 Nallatech 510T Board 2400 img/s ~75W 32 img/s/W cuDNN4 on NVIDIA Titan X Source: NVIDIA Corporation, GPU-Based Deep Learning Inference: A Performance and Power Analysis, November 2015 3216 img/s 227W 14.2 img/s/W • Further algorithmic optimization of FPGA possible • Expect similar ratios for Stratix10 vs. NVIDIA 14nm Pascal
  19. 19. Copyright © 2016 Intel Corporation 19 Getting Started with CNNs on FPGAs High-Performance Machine Learning Desired Accelerate Computation Scale & Speed of Devices Better Compute Architecture Math Optimization (Winograd, FFT) Optimized RTL / HLD (Current Intel PSG focus, original MSFT focus) Tune Problem to Platform Simplify network topology Reduce precision / use fixed point Create more local neuron structures Integrated training and classification (Current i-Abra and partner focus) Not Mutually Exclusive Combine for Optimal Solution
  20. 20. Copyright © 2016 Intel Corporation 20 Overview: Design Flow Using CNN IP Data Collection Data Store Choose Network Train Network Execution Engine Improvement Strategies • Collect more data • Improve network Parameters Selection Architecture Choose Network • Use framework (e.g. Caffé, Torch) • Choose based on experience or limits of execution engine Train Network • An HPC workload • Requires data to be pre- selected • Weeks to Months process Execution Engine • Implementation of the Neural Network • Flexibility, performance & power dominate choice Altera CNN IP
  21. 21. Copyright © 2016 Intel Corporation 21 Overview: Design Flow for CNN Using Partner Data Collection Data Store Neural Pathways Neural Synapse Parameters Selection Architecture Neural Pathways • Integrated Network selection and training • Capable of acceleration in FPGA • Minutes to hours process Neural Synapse • Implementation of highly efficient Neural Network • Built in FPGA fabric with OpenCL Altera CNN IP
  22. 22. Copyright © 2016 Intel Corporation 22 • New opportunities to increase the FPGA value proposition • Accelerated FPGA investment driving product innovation to increase your performance and productivity • Increased operational excellence to accelerate time-to-market • Expanded product portfolio to arm you with new solutions for your most challenging applications • Come join us at our booth to see a demo of machine learning on FPGAs Join Us on Our Journey Together… How can Intel + Altera help your business grow?
  23. 23. Copyright © 2016 Intel Corporation 23 • Altera Website • Altera SDK for OpenCL Page ( • Technical Article “Efficient Implementation of Neural Network Systems Built on FPGAs, Programmed with OpenCL” ( tech-article) • GPU vs FPGA overview online training (available mid-May) • CNN on FPGA whitepaper (available mid-May) • “Machine Learning on FPGAs” web page (available mid-May) • Embedded Vision Alliance Website • Technical Article “OpenCL Streamlines FPGA Acceleration of Computer Vision” Resources
  24. 24. Copyright © 2016 Intel Corporation 24 Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. © Intel Corporation Slide 18 Footnote 1. Configurations: AlexNet configurations on Arria 10-115 FPGAs optimized via IP - tested by Intel PSG For more information go to Legal Notices and Disclaimers
  25. 25. Copyright © 2016 Intel Corporation 25 Thank You