IntelfpgasforaiSupercomputing 2018
ScaleYourInnovation 2
WhyFPGAsWINInDeepLearning
Enabling real time AI in a wide range of
embedded, edge, and data center applications
FIRSTTOMARKETTOACCELERATE
EVoLVINGAIWORKLOADS
▪ Precision
▪ Latency
▪ Sparsity
▪ AdversarialNetworks
▪ ReinforcementLearning
▪ NeuromorphicComputing
▪ …
Lowlatencymemory
constrainedworkloads
▪ Rnn
▪ Lstm
▪ SpeechWL
DeliveringAI+forFlexible
systemlevelfunctionality
▪ AI+I/OIngest
▪ AI+Networking
▪ Ai+security
▪ Ai+pre/postprocessing
▪ …
ScaleYourInnovation 3
Fpgas-flexibleforevolvingprecision
ResNet-34 1x Wide ResNet-34 2x Wide ResNet-34 3x Wide
Activation Weight Eq TOPS Top-1 Acc Eq TOPS Top-1 Acc Eq TOPS Top-1 Acc
FP32 FP32 7 0.7359 NR NR NR NR
8-bit 8-bit 8 0.7093 2 NR 1 NR
8-bit Ternary 43 0.6919 11 NR 5 NR
8-bit Binary 52 NR 13 NR 6 NR
4-bit 4-bit 18 0.7033 5 0.7453 2 NR
3-bit 3-bit 51 NR 13 6 NR
2-bit 2-bit 85 0.6793 21 0.7332 9 NR
2-bit Ternary 98 0.6793 25 0.7332 11 NR
1-bit 1-bit 267 0.6054 67 0.6985 30 0.7238
▪ Explore precision
and accuracy
balance
▪ 4X performance
gain with the
same FPGA
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark,
are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should
consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other
products. For more complete information visit http://www.intel.com/performance. Copyright © 2017, Intel Corporation
Throughput and Accuracy for various PE configurations on ResNet Topologies
ScaleYourInnovation 4
FpgassolveMemoryboundworkloads
Mozilla DeepSpeech topology implementation
▪ Intel® Stratix 10 MX can further
reduce latency by directly
ingesting the speech signal
*Estimations performed by Manjeera Design Systems Assumption: ~4.4 TOPs of 16b compute (8192 MACs at 266MHz) for Intel Stratix 10 MX
Stream Length FPGA (estimated) (16 bit)
1s 0.003s
10s 0.312s
20s 0.624s
40s 1.25s
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark,
are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should
consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other
products. For more complete information visit http://www.intel.com/performance. Copyright © 2017, Intel Corporation
▪ Intel Stratix 10 MX offers
512GBps bandwidth via multiple
integrated HBMs
ScaleYourInnovation
Intel®
Xeon®
Processor
5
AI+flexibleI/o&networking
Per-chip performance increases when scaled
AI + I/O & networking unlocks nonlinear performance gains through pooling
2x improvement
w/ ResNet-101
Intel®
Xeon®
Processor
Intel®
Xeon®
Processor
Intel®
Arria® 10
FPGA
Intel®
Arria® 10
FPGA
Intel®
Arria® 10
FPGA
ScaleYourInnovation 6
AI+Pre/postprocessing&directI/oprovideslowlatency
FPGA
Compute
Latency
FPGAs can perform in-line, real-time
acceleration on the data ingest and
avoid costly data movement within
the system
Intel® Xeon®
Processor
Data Sources
LowerSystemlatency
ScaleYourInnovation 8
HowIntel®FPGAsenableDEEPLearningI/O
I/O
I/O
I/O
▪ Millions of reconfigurable logic elements & routing
fabric
▪ Thousands of 20Kb memory blocks & MLABs
▪ Thousands of variable precision digital signal
processing (DSP) blocks
▪ Hundreds of configurable I/O & high-speed
transceivers
▪ Programmable Datapath
▪ Customized Memory structure
▪ Configurable compute
ScaleYourInnovation 9
Adaptingtoinnovation
Many efforts to improve efficiency
▪ Batching
▪ Reduce bit width
▪ Sparse weights
▪ Sparse activations
▪ Weight sharing
▪ Compact network
SparseCNN
[CVPR’15]
Spatially SparseCNN
[CIFAR-10 winner ‘14]
Pruning
[NIPS’15]
TernaryConnec
t [ICLR’16]
BinaryConnect
[NIPS’15]
DeepComp
[ICLR’16]
HashedNets
[ICML’15]
XNORNet
SqueezeNet
I
X
W
=
···
···
O
3 2
1 3
13
1
3
Shared Weights
LeNet
[IEEE}
AlexNet
[ILSVRC’12}
VGG
[ILSVRC’14}
GoogleNet
[ILSVRC’14}
ResNet
[ILSVRC’15}
I W O
2
3
ScaleYourInnovation 10
Performanceimprovementovertime
Model
Sept-17
Baseline
Dec-17 Feb-18 Apr-18 Jun-18 Oct-18 Dec-18 (projected)
SqueezeNet 1x 1.13x 1.75x 2.61x 3.89x 4.33x 4.51x
GoogleNet 1x 1.13x 1.22x 1.46x 3.55x 4.11x 4.50x
▪ Continually adapting
the custom data flow,
memory hierarchy and
compute enables
improved performance
with the same power
footprint
Jun-17 Sep-17 Dec-17 Apr-18 Jul-18 Oct-18 Feb-19
Performance(img/s)
SqueezeNet and Googlenet
Performance over Time, Batch=1
ScaleYourInnovation 12
Intel® FPGADeepLearning accelerationsuite
Pre-compiledGraphArchitecture ExampleTopologies
DDR
DDR
DDR
DDR
Configuration
Engine
AlexNet GoogleNet Tiny Yolo
SqueezeNetVGG16 ResNet 18
…*
ResNet 50ResNet 101
Memory
Reader
/Writer
Crossbar
CUST
OM*
PRIM
Conv
PE Array
Feature Map Cache
*Deeper customization options
COMING SOON!
PRIM PRIM
*More topologies added with every release
MobileNet ResNetSSD
SqueezeNet
SDD
ScaleYourInnovation 13
OpenvinoTM toolkitforintelfpgas
Anall-in-onesolutiontoeasily
harnessthebenefitsofFPGAs
▪ Enables developers and data scientists to take
their prototype application to production
▪ Utilize API-based & direct coding to maximize
performance
▪ Deeper customization capabilities coming
soon
OpenVINO™ Toolkit
IntelDeepLearning
DeploymentToolkit
Inference
Engine
Model
Optimizer
Intel FPGA DL
Acceleration Suite
TODAY’S INTEL FPGA
SUPPORTED
DEEP LEARNING FRAMEWORKS
Intel
Xeon®
Processor
Intel
FPGAHeterogeneous
CPU/FPGA
Deployment
Free Download 
software.intel.com/openvino-toolkit
ScaleYourInnovation 14
Yourapplicationaccelerationwithfpgapoweredplatforms
*Please contact Intel representative for complete list of ODM manufacturers. Other names and brands may be claimed as the property of others.
INTERFACE
CURRENTLY MANUFACTURED
BY*
Mustang F-100
PCIe x8
Develop NN Model; Deploy across Intel® CPU, GPU, VPU, FPGA; Leverage common algorithms
SOFTWARE
TOOLS
SUPPORTED
PLATFORMS FOR
FPGA
Intel Programmable
Acceleration Card with
Intel Arria 10
PCIe x8
Intel® Arria® 10
Development Kit
PCIe x8
INTEL® INTEL®
Openvino™toolkit
ScaleYourInnovation 15
Usecase1:search
Solution Search
Looking for a quick path to deploy and accelerate instant
reverse image searches of products for retail convenience
Solution Success
Intel® FPGAs offered real-time AI inferencing using OpenVINO™
toolkit. This enabled engineers to map neural networks to FPGA,
accelerating image searches with increased throughput and lower
latency, all without the need for FPGA programming experience
Real-timeaioptimizedforperformance,powerandcost
OpenVINO™ Toolkit
Accelerating workloads,
enabling deep learning
capabilities for smarter and
faster ways to transform data
for competitive edge
Intel Programmable
Acceleration Card with
Intel Arria® 10 FPGA
Deployment ready PCIe-
based card with versatile
built-in multifunction
acceleration capabilities with
low-power dissipation and
low-profile form factor
Acceleration stack for
Intel® Xeon® CPU with
FPGAs
Abstracting programming
complexity and maximizing
ease of use by hot-swapping
accelerators and enabling
application portability for
Intel FPGA based
acceleration solutions
ScaleYourInnovation 16
UseCase2:Microsoft’sAIforEarth
Microsoft leverages the multimode
capabilities of Intel FPGAs to push through
the memory wall to maximize performance
Project Brainwave with Intel®
Stratix® 10 gives Performance/$ 
only $42 of compute*
200M Images, 20TB
Land cover mapping for the whole US
10+ minutes
*Microsoft’s Blog
ScaleYourInnovation 17
Summary
Delivering AI+ for Flexible system
level functionality
First to market to accelerate
evolving AI workloads
▪ OpenVINO™ Toolkit is free to download and enables you to deploy on Intel
FPGAs directly from TensorFlow or Caffe
▪ Intel’s FPGA architecture enables programmable datapath, custom
memory structure and configurable compute
INTELFPGASENABLE
ScaleYourInnovation 18
resources
Intel FPGA Training
https://www.intel.com/content/www/us/en/programmable/support/training/overview.html
Get started quickly with:
▪ Find out more online at ww w.intel.com/ai and www.intel.com/fpga
▪ Intel Tech.Decoded online webinars, tool
how-tos & quick tips
▪ Hands-on in-person events
Support
▪ Connect with Intel engineers & AI experts via the public Community Forum
Download 
Free OPENVINO™ toolkit
AI Crash Course- Supercomputing

AI Crash Course- Supercomputing

  • 1.
  • 2.
    ScaleYourInnovation 2 WhyFPGAsWINInDeepLearning Enabling realtime AI in a wide range of embedded, edge, and data center applications FIRSTTOMARKETTOACCELERATE EVoLVINGAIWORKLOADS ▪ Precision ▪ Latency ▪ Sparsity ▪ AdversarialNetworks ▪ ReinforcementLearning ▪ NeuromorphicComputing ▪ … Lowlatencymemory constrainedworkloads ▪ Rnn ▪ Lstm ▪ SpeechWL DeliveringAI+forFlexible systemlevelfunctionality ▪ AI+I/OIngest ▪ AI+Networking ▪ Ai+security ▪ Ai+pre/postprocessing ▪ …
  • 3.
    ScaleYourInnovation 3 Fpgas-flexibleforevolvingprecision ResNet-34 1xWide ResNet-34 2x Wide ResNet-34 3x Wide Activation Weight Eq TOPS Top-1 Acc Eq TOPS Top-1 Acc Eq TOPS Top-1 Acc FP32 FP32 7 0.7359 NR NR NR NR 8-bit 8-bit 8 0.7093 2 NR 1 NR 8-bit Ternary 43 0.6919 11 NR 5 NR 8-bit Binary 52 NR 13 NR 6 NR 4-bit 4-bit 18 0.7033 5 0.7453 2 NR 3-bit 3-bit 51 NR 13 6 NR 2-bit 2-bit 85 0.6793 21 0.7332 9 NR 2-bit Ternary 98 0.6793 25 0.7332 11 NR 1-bit 1-bit 267 0.6054 67 0.6985 30 0.7238 ▪ Explore precision and accuracy balance ▪ 4X performance gain with the same FPGA Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance. Copyright © 2017, Intel Corporation Throughput and Accuracy for various PE configurations on ResNet Topologies
  • 4.
    ScaleYourInnovation 4 FpgassolveMemoryboundworkloads Mozilla DeepSpeechtopology implementation ▪ Intel® Stratix 10 MX can further reduce latency by directly ingesting the speech signal *Estimations performed by Manjeera Design Systems Assumption: ~4.4 TOPs of 16b compute (8192 MACs at 266MHz) for Intel Stratix 10 MX Stream Length FPGA (estimated) (16 bit) 1s 0.003s 10s 0.312s 20s 0.624s 40s 1.25s Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance. Copyright © 2017, Intel Corporation ▪ Intel Stratix 10 MX offers 512GBps bandwidth via multiple integrated HBMs
  • 5.
    ScaleYourInnovation Intel® Xeon® Processor 5 AI+flexibleI/o&networking Per-chip performance increaseswhen scaled AI + I/O & networking unlocks nonlinear performance gains through pooling 2x improvement w/ ResNet-101 Intel® Xeon® Processor Intel® Xeon® Processor Intel® Arria® 10 FPGA Intel® Arria® 10 FPGA Intel® Arria® 10 FPGA
  • 6.
    ScaleYourInnovation 6 AI+Pre/postprocessing&directI/oprovideslowlatency FPGA Compute Latency FPGAs canperform in-line, real-time acceleration on the data ingest and avoid costly data movement within the system Intel® Xeon® Processor Data Sources LowerSystemlatency
  • 8.
    ScaleYourInnovation 8 HowIntel®FPGAsenableDEEPLearningI/O I/O I/O I/O ▪ Millionsof reconfigurable logic elements & routing fabric ▪ Thousands of 20Kb memory blocks & MLABs ▪ Thousands of variable precision digital signal processing (DSP) blocks ▪ Hundreds of configurable I/O & high-speed transceivers ▪ Programmable Datapath ▪ Customized Memory structure ▪ Configurable compute
  • 9.
    ScaleYourInnovation 9 Adaptingtoinnovation Many effortsto improve efficiency ▪ Batching ▪ Reduce bit width ▪ Sparse weights ▪ Sparse activations ▪ Weight sharing ▪ Compact network SparseCNN [CVPR’15] Spatially SparseCNN [CIFAR-10 winner ‘14] Pruning [NIPS’15] TernaryConnec t [ICLR’16] BinaryConnect [NIPS’15] DeepComp [ICLR’16] HashedNets [ICML’15] XNORNet SqueezeNet I X W = ··· ··· O 3 2 1 3 13 1 3 Shared Weights LeNet [IEEE} AlexNet [ILSVRC’12} VGG [ILSVRC’14} GoogleNet [ILSVRC’14} ResNet [ILSVRC’15} I W O 2 3
  • 10.
    ScaleYourInnovation 10 Performanceimprovementovertime Model Sept-17 Baseline Dec-17 Feb-18Apr-18 Jun-18 Oct-18 Dec-18 (projected) SqueezeNet 1x 1.13x 1.75x 2.61x 3.89x 4.33x 4.51x GoogleNet 1x 1.13x 1.22x 1.46x 3.55x 4.11x 4.50x ▪ Continually adapting the custom data flow, memory hierarchy and compute enables improved performance with the same power footprint Jun-17 Sep-17 Dec-17 Apr-18 Jul-18 Oct-18 Feb-19 Performance(img/s) SqueezeNet and Googlenet Performance over Time, Batch=1
  • 12.
    ScaleYourInnovation 12 Intel® FPGADeepLearningaccelerationsuite Pre-compiledGraphArchitecture ExampleTopologies DDR DDR DDR DDR Configuration Engine AlexNet GoogleNet Tiny Yolo SqueezeNetVGG16 ResNet 18 …* ResNet 50ResNet 101 Memory Reader /Writer Crossbar CUST OM* PRIM Conv PE Array Feature Map Cache *Deeper customization options COMING SOON! PRIM PRIM *More topologies added with every release MobileNet ResNetSSD SqueezeNet SDD
  • 13.
    ScaleYourInnovation 13 OpenvinoTM toolkitforintelfpgas Anall-in-onesolutiontoeasily harnessthebenefitsofFPGAs ▪Enables developers and data scientists to take their prototype application to production ▪ Utilize API-based & direct coding to maximize performance ▪ Deeper customization capabilities coming soon OpenVINO™ Toolkit IntelDeepLearning DeploymentToolkit Inference Engine Model Optimizer Intel FPGA DL Acceleration Suite TODAY’S INTEL FPGA SUPPORTED DEEP LEARNING FRAMEWORKS Intel Xeon® Processor Intel FPGAHeterogeneous CPU/FPGA Deployment Free Download  software.intel.com/openvino-toolkit
  • 14.
    ScaleYourInnovation 14 Yourapplicationaccelerationwithfpgapoweredplatforms *Please contactIntel representative for complete list of ODM manufacturers. Other names and brands may be claimed as the property of others. INTERFACE CURRENTLY MANUFACTURED BY* Mustang F-100 PCIe x8 Develop NN Model; Deploy across Intel® CPU, GPU, VPU, FPGA; Leverage common algorithms SOFTWARE TOOLS SUPPORTED PLATFORMS FOR FPGA Intel Programmable Acceleration Card with Intel Arria 10 PCIe x8 Intel® Arria® 10 Development Kit PCIe x8 INTEL® INTEL® Openvino™toolkit
  • 15.
    ScaleYourInnovation 15 Usecase1:search Solution Search Lookingfor a quick path to deploy and accelerate instant reverse image searches of products for retail convenience Solution Success Intel® FPGAs offered real-time AI inferencing using OpenVINO™ toolkit. This enabled engineers to map neural networks to FPGA, accelerating image searches with increased throughput and lower latency, all without the need for FPGA programming experience Real-timeaioptimizedforperformance,powerandcost OpenVINO™ Toolkit Accelerating workloads, enabling deep learning capabilities for smarter and faster ways to transform data for competitive edge Intel Programmable Acceleration Card with Intel Arria® 10 FPGA Deployment ready PCIe- based card with versatile built-in multifunction acceleration capabilities with low-power dissipation and low-profile form factor Acceleration stack for Intel® Xeon® CPU with FPGAs Abstracting programming complexity and maximizing ease of use by hot-swapping accelerators and enabling application portability for Intel FPGA based acceleration solutions
  • 16.
    ScaleYourInnovation 16 UseCase2:Microsoft’sAIforEarth Microsoft leveragesthe multimode capabilities of Intel FPGAs to push through the memory wall to maximize performance Project Brainwave with Intel® Stratix® 10 gives Performance/$  only $42 of compute* 200M Images, 20TB Land cover mapping for the whole US 10+ minutes *Microsoft’s Blog
  • 17.
    ScaleYourInnovation 17 Summary Delivering AI+for Flexible system level functionality First to market to accelerate evolving AI workloads ▪ OpenVINO™ Toolkit is free to download and enables you to deploy on Intel FPGAs directly from TensorFlow or Caffe ▪ Intel’s FPGA architecture enables programmable datapath, custom memory structure and configurable compute INTELFPGASENABLE
  • 18.
    ScaleYourInnovation 18 resources Intel FPGATraining https://www.intel.com/content/www/us/en/programmable/support/training/overview.html Get started quickly with: ▪ Find out more online at ww w.intel.com/ai and www.intel.com/fpga ▪ Intel Tech.Decoded online webinars, tool how-tos & quick tips ▪ Hands-on in-person events Support ▪ Connect with Intel engineers & AI experts via the public Community Forum Download  Free OPENVINO™ toolkit