SlideShare a Scribd company logo
1 of 20
Download to read offline
Deep Learning Training At Scale
Spring Crest Deep Learning Accelerator
(Intel® Nervana™ NNP-T)
Andrew Yang | 8/16/2019
2
Legalnotices&disclaimers
• This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest
forecast, schedule, specifications and roadmaps.
• Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer. No computer system
can be absolutely secure.
• Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to
evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance.
• Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances
will vary. Intel does not guarantee any costs or cost reduction.
• Statements in this document that refer to Intel’s plans and expectations for the quarter, the year, and the future, are forward-looking statements that involve a number of risks and uncertainties. A detailed
discussion of the factors that could affect Intel’s results and plans is included in Intel’s SEC filings, including the annual report on Form 10-K.
• Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific
computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in
fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: http://www.intel.com/performance The products
described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
• No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
• Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate.
• Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2,
SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-
dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the
applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice Revision #20110804.
• All products, computer systems, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice.
• ​Performance results are based on testing as of August 1, 2019 and may not reflect all publicly available security updates. See configuration disclosure for details. No product or component can be absolutely
secure.
• Intel technologies' features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No product or
component can be absolutely secure. Check with your system manufacturer or retailer or learn more at http://www.intel.com/.
• Intel, the Intel logo, Intel Inside, Nervana, and others are trademarks of Intel Corporation in the U.S. and/or other countries.
• *Other names and brands may be claimed as the property of others.
• © 2019 Intel Corporation.
• Real-world performance - time to train and power
efficiency
• Compute used for largest models & training sets
doubles every 3.5 months**
• GEMMs and Convolutions dominate majority of
computation required for Deep Neural Networks.
• Primary factors that drive a DL training accelerator
• Power
• Compute
• Memory & Communication
• Scale-out
DeepLearningTraining
* Based on analysis of ResNet, FaceNet, Open NMT
Total Computation*
Other
MAC>99%
Output Layer
Input Layer
Hidden Layer 1
Hidden Layer 2 Hidden Layer 4
Hidden Layer 3
Deep Neural Network
**https://openai.com/blog/ai-and-compute/
• Train a network as fast as possible within a given
power budget, targeting larger models and datasets
• Balance between Compute, Communication, &
Memory
• Re-use on-die data as much as possible
• Optimize for batched workloads
• Built-in scale-out support
• Support future workloads
SpringCrest(NNP-T)ArchitectureDirection
• PCIe Gen 4 x16 EP
• 4x HBM2
• 64 lanes SerDes
• 24 Tensor Processors
• Up to 119 TOPS
• 60 MB on-chip
distributed memory
• Management CPU and
Interfaces
• 2.5D packaging
SpringCrest(NNP-T)SoC
Spring Crest (NNP-T) SoC
HBM
HBM
HBM
HBM
PCIe/DMA
PCIe Gen 4 x16
X-bar
SerDesx8
SerDesx8
SerDesx8
SerDesx8
8x ICL
HBMPHY
HBMMC
TPC
TPC
TPC
TPC
TPC
HBMPHY
HBMMC
HBMPHY
HBMMC
HBMPHY
HBMMC
TPC
TPC
TPC
TPC
TPC
TPC
TPC
TPC
TPC
TPC
TPC
TPC
TPC
TPC
TPC
TPC
TPC
TPC
TPC
SerDesx8
SerDesx8
SerDesx8
SerDesx8
8x ICL
X-bar
SerDesx8
SerDesx8
SerDesx8
SerDesx8
8xICL
SerDesx8
SerDesx8
SerDesx8
SerDesx8
8xICL
• TSMC CLN16FF+
• 680mm2, 1200mm2 interposer
• 27 Billion Transistors
• 60mm x 60mm/6-2-6 3325 pin BGA
package
• 4x8GB HBM2-2400 memory
• Up to 1.1Ghz core frequency
• 64 lanes SerDes HSIO up to
3.58Tbps aggregate BW
• PCIe Gen 4 x16, SPI, I2C, GPIOs
• PCIe & OAM form factors
• Air-cooled, 150-250W typical
workload power
SpringCrest(NNP-T)Implementation
NNP-TSoftwareStack
• Full software stack built with open components
• Direct integration with DL frameworks
• nGraph: Hardware agnostic deep learning library
and compiler for DL platform developers.
• Provides common set of optimizations for NNP-T across
DL frameworks
• Argon: NNP-T DNN compute & communication
kernel library
• Low-level programmability: NNP-T kernel
development toolchain w/tensor compiler
Kernel Mode Driver
Board Firmware Chip Firmware
Argon
DNN Kernel Library
Open
Source
• Flexible and programmable Tensor-based ISA
• Limited instruction set
• Extensible with uController custom instructions
• Same distributed programming model for both intra and
inter-chip
• Explicit SW memory management and message passing
• Synchronization primitives
• Compute has affinity to local data
• DL workloads are dominated by a limited set of operations
NNP-TProgrammingModel
TensorProcessingCluster(TPC)
Local
Memory
Bank
Convolution
Engine
32x32
Mult Array
PreOpPreOp
Post
Ops
Partial
Product
uController
Control Path
Neighbor
32x32
Mult Array
PreOpPreOp
Post
Ops
Partial
Product
Neighbor
Neighbor
Local
Memory
Bank
Local
Memory
Bank
• Bfloat16 w/ FP32 accumulation
• No sacrifice in SOTA accuracy, improved power efficiency & training time*
• Minimal model changes
BFLoat16Numerics
Numeric format Area/Power efficient Easy to Converge? Notes
FP32 No Yes Industry standard
FP16 Yes Medium Good precision for Fprop.
Bprop/Update is more challenging.
16b Integer formats Yes No Better area/power than FP, but hard
to use
Bfloat16 Yes Yes Good at Fprop, Bprop, and Update
8/4/2/1b integer Extreme Extremely difficult Research areas
* Facebook and Intel joint paper - A Study of BFLOAT16 for Deep Learning Training: https://arxiv.org/pdf/1905.12322.pdf
• Bfloat16 Matrix Multiply Core (32x32)
• FP32 & BF16 support for all other operations
• 2x multiply cores per TPC to amortize SoC
resources
• Vector operations for non-GEMM
• Compound pipeline
• DL specific optimizations
• Activation functions, RNG, Reductions &
accumulations
• Programmable FP32 look-up tables
SpringCrestCompute
32x32
Mult Array
PreOpPreOp
Post
Ops
Partial
Product
Out0 Out1 EW PP A B
• Four stacks of HBM2-2400 8GB devices
• 1.22TBps raw bandwidth and 32GB total device memory. ECC
protected
• 2.5MB/TPC of local scratchpad memory
• 60 MB total distributed memory, ECC protected
• Native Tensor Transpose
• Simultaneous read and write on each MRB
• 1.4TBps local read/write bandwidth per-TPC
• Support for direct memory to memory transfer for both HBM
and MRB
MemorySubsystem
SpringCrestOn-dieCommunication
• Bidirectional 2-D mesh architecture
to allow any to any communication
• Prioritized for throughput and
congestion avoidance
• Cut-through forwarding and multi-
cast support
• 2.6TBps total cross-sectional BW,
1.3TBps per-direction
• All peripheral devices shared
through the mesh (HBM, SerDes)
• Separate meshes for different
traffic types
• Support for direct peer to peer
communication between TPCs
PCI
Interchip X-BAR
OCR1,3
4 2 2 2 5
6 3 3 3 7
0 0
0 0
1
2 1
1 1
3
OCR1,4
4 2 2 2 5
6 3 3 3 7
0 0
0 0
1
2 1
1 1
3
OCR2,2
4 2 2 2 5
6 3 3 3 7
0 0
0 0
1
2 1
1 1
3
OCR2,3
4 2 2 2 5
6 3 3 3 7
0 0
0 0
1
2 1
1 1
3
OCR2,4
4 2 2 2 5
6 3 3 3 7
0 0
0 0
1
2 1
1 1
3
OCR1,2
4 2 2 2 5
6 3 3 3 7
0 0
0 0
1
2 1
1 1
3
HBMController
Ch0
Ch1
Ch2
Ch3
Ch4
Ch5
Ch6
Ch7
OCR4,3
4 2 2 2 5
6 3 3 3 7
0 0
0 0
1
2 1
1 1
3
OCR4,4
4 2 2 2 5
6 3 3 3 7
0 0
0 0
1
2 1
1 1
3
OCR5,2
4 2 2 2 5
6 3 3 3 7
0 0
0 0
1
2 1
1 1
3
OCR5,3
4 2 2 2 5
6 3 3 3 7
0 0
0 0
1
2 1
1 1
3
OCR5,4
4 2 2 2 5
6 3 3 3 7
0 0
0 0
1
2 1
1 1
3
OCR4,2
4 2 2 2 5
6 3 3 3 7
0 0
0 0
1
2 1
1 1
3
Interchip XBAR
OCR3,2
4 2 2 2 5
6 3 3 3 7
0 0
0 0
1
2 1
1 1
3
OCR3,3
4 2 2 2 5
6 3 3 3 7
0 0
0 0
1
2 1 1
1 3
OCR3,4
4 2 2 2 5
6 3 3 3 7
0 0
0 0
1
2 1
1 1
3
OCR3,1
4 2 2 2 5
6 3 3 3 7
0 0
0 0
1
2 1
1 1
3
OCR3,6
4 2 2 2 5
6 3 3 3 7
0 0
0 0
1
2 1
1 1
3
OCR2,1
4 2 2 2 5
6 3 3 3 7
0 0
0 0
1
2 1
1 1
3
OCR1,5
4 2 2 2 5
6 3 3 3 7
0 0
0 0
1
2 1
1 1
3
OCR2,5
4 2 2 2 5
6 3 3 3 7
0 0
0 0
1
2 1
1 1
3
OCR4,5
4 2 2 2 5
6 3 3 3 7
0 0
0 0
1
2 1
1 1
3
OCR5,5
4 2 2 2 5
6 3 3 3 7
0 0
0 0
1
2 1
1 1
3
OCR3,5
4 2 2 2 5
6 3 3 3 7
0 0
0 0
1
2 1
1 1
3
OCR2,6
4 2 2 2 5
6 3 3 3 7
0 0
0 0
1
2 1
1 1
3
HBMController
Ch0
Ch1
Ch2
Ch3
Ch4
Ch5
Ch6
Ch7
HBMController
Ch0
Ch1
Ch2
Ch3
Ch4
Ch5
Ch6
Ch7
HBMController
Ch0
Ch1
Ch2
Ch3
Ch4
Ch5
Ch6
Ch7
• Run the largest models across multiple
chips and across chassis
• 16 quads of 112Gbps. 3.58Tbps total bi-
directional BW per chip
• Fully programmable router w/multi-cast
support enables multiple glue-less
topologies
• Reliable transmission
• Virtual channels and priorities for
traffic management
• Direct low-latency local memory transfer
• Support for up to 1024 nodes
Scale-Out
• DL workloads require a variety of GEMM
sizes
• Minimize HBM memory boundedness
with on-die data re-use => higher
utilization of compute resources
• Faster training, less idle resources
• ~2x better than published utilization of
competitive products
SpringCrestSinglechipgemmperformance
GEMM Size
Spring Crest
Utilization
1024 x 700 x 512 31.1%
1760 x 7133 x 1760 44.5%
2048 x 7133 x 2048 46.7%
2560 x 7133 x 2560 57.1%
4096 x 7133 x 4096 57.4%
5124 x 9124 x 2048 55.5%
* All products, computer systems, dates, and figures are preliminary based on current expectations, and are subject to change without notice.
Performance measured on July 10, 2019 on pre-production NNP-T Spring Crest silicon, using 22 TPCs at 900MHz core clock and 2GHz HBM clock. Host is an Intel® Xeon® Gold 6130T CPU @ 2.10GHz with 64 GB of system memory
For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
** Other names and brands may be claimed as the property of others. Deepbench data from: https://github.com/baidu-research/DeepBench/blob/master/results/train/DeepBench_NV_V100.xlsx
Based on NVIDIA DGX-1, NVIDIA V100 GPU, Linux Kernel 4.4.0-124-generic, CUDA 10.0.130, CuDNN 7.3.1.20, NVIDIA Driver 410.48, Intel® Xeon® CPU ES-2698v4@2.2GHz
Description Spring Crest
Utilization
c64xh56xw56_k64xr3xs3_st1_n128 86%
c128xh28xw28_k128xr3xs3_st1_n128 71%
c512xh28xw28_k128xr1xs1_st1_n128 65%
c128xh28xw28_k512xr1xs1_st1_n128 59%
c256xh14xw14_k1024xr1xs1_st1_n128 62%
c256xh28xw28_k512xr1xs1_st2_n128 71%
c32xh120xw120_k64xr5xs5_st1_n128 87%
SpringCrestConvolutions
• Various convolution hyperparameters are required by DL workloads
• Support multiple tensor layouts for maximum on-die data reuse resulting higher
compute efficiency
0%
20%
40%
60%
80%
100%
Convolution Performance
C=# input dimensions, H=height, W=width, K=# filters, R=filter X, S=filter Y, ST=stride N=minibatch size
* All products, computer systems, dates, and figures are preliminary based on current expectations, and are subject to change without notice.
Performance measured on July 10, 2019 on pre-production NNP-T Spring Crest silicon, using 22 TPCs at 900MHz core clock and 2GHz HBM clock. Host is an Intel® Xeon® Gold 6130T CPU @ 2.10GHz with 64 GB of system memory
For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
• Benchmarked on ring topology (intra and inter-
chassis)
• Support for different All-reduce algorithms with
different communication patterns
SpringCrestCommunicationperformance
Communication
Kernels
Bandwidth (BW)
Within-chassis
(8 cards)
Cross-chassis
(16, 32 cards)
Spring Crest
16x ICL
Spring Crest
16x ICL
2-card Send/Recv BW (GB/s) 161 161
Allreduce BW (GB/s) 151 151
Broadcast BW (GB/s) 147 147
* All products, computer systems, dates, and figures are preliminary based on current expectations, and are subject to change without notice.
Performance measured on July 10, 2019 on pre-production NNP-T Spring Crest silicon, using 22 TPCs at 900MHz core clock and 2GHz HBM clock. Host is an Intel® Xeon® Gold 6130T CPU @ 2.10GHz with 64 GB of system memory
For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
• Low overhead and direct memory transfer result in low latency
• High efficiency even at moderate transfer sizes
• Cross-chassis scale-out with the same network/connectivity
Communicationperformance
Allreduce
Latency , 2KB
(ms)
Spring Crest
16x ICL
2 cards (in-chassis) 3
4 cards (in-chassis) 8
8 cards (in-chassis) 9
16 cards (cross-chassis) 30
32 cards (cross-chassis) 36
Allreduce Data Rate (GB/s)
Message
Size
Spring Crest
(8 chips)
Spring Crest
(32 chips)
1 MB 68.7 39.9
8 MB 115.8 92.2
32 MB 137.5 130.2
128 MB 147.1 147.4
* All products, computer systems, dates, and figures are preliminary based on current expectations, and are subject to change without notice.
Performance measured on July 10, 2019 on pre-production NNP-T Spring Crest silicon, using 22 TPCs at 900MHz core clock and 2GHz HBM clock. Host is an Intel® Xeon® Gold 6130T CPU @ 2.10GHz with 64 GB of system memory
For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
• Domain specific acceleration has a place in DL training
• Training time and model size continue to be bottlenecks
• Numerics and compute tailored for DL
• No legacy workloads to support
• Architect from ground up to reduce data movement and keep
compute units fed
• Higher utilization and efficiency on micro-benchmarks translate into
better overall WL performance => Faster, more power efficient
training
ConcludingRemarks
Deep Learning Training at Scale: Spring Crest Deep Learning Accelerator

More Related Content

What's hot

Accelerate Ceph performance via SPDK related techniques
Accelerate Ceph performance via SPDK related techniques Accelerate Ceph performance via SPDK related techniques
Accelerate Ceph performance via SPDK related techniques Ceph Community
 
Intel xeon e5v3 y sdi
Intel xeon e5v3 y sdiIntel xeon e5v3 y sdi
Intel xeon e5v3 y sdiTelecomputer
 
4th gen intelcoreprocessor family
4th gen intelcoreprocessor family4th gen intelcoreprocessor family
4th gen intelcoreprocessor familyAhnku Toh
 
Intel® Xeon® Processor 5500 Series
Intel® Xeon® Processor 5500 SeriesIntel® Xeon® Processor 5500 Series
Intel® Xeon® Processor 5500 SeriesJames Price
 
The Next Generation of Intel: The Dawn of Nehalem
The Next Generation of Intel: The Dawn of NehalemThe Next Generation of Intel: The Dawn of Nehalem
The Next Generation of Intel: The Dawn of NehalemJames Price
 
Intel® Virtualization Technology & Parallels Bring Native Graphics Innovation...
Intel® Virtualization Technology & Parallels Bring Native Graphics Innovation...Intel® Virtualization Technology & Parallels Bring Native Graphics Innovation...
Intel® Virtualization Technology & Parallels Bring Native Graphics Innovation...James Price
 
Reimagining HPC Compute and Storage Architecture with Intel Optane Technology
Reimagining HPC Compute and Storage Architecture with Intel Optane TechnologyReimagining HPC Compute and Storage Architecture with Intel Optane Technology
Reimagining HPC Compute and Storage Architecture with Intel Optane Technologyinside-BigData.com
 
IBM Redbooks Product Guide: IBM BladeCenter HS23E
IBM Redbooks Product Guide: IBM BladeCenter HS23EIBM Redbooks Product Guide: IBM BladeCenter HS23E
IBM Redbooks Product Guide: IBM BladeCenter HS23EIBM India Smarter Computing
 
Arquitetura do coprocessador Intel® Xeon Phi™ - Intel Software Conference 2013
Arquitetura do coprocessador Intel® Xeon Phi™ - Intel Software Conference 2013Arquitetura do coprocessador Intel® Xeon Phi™ - Intel Software Conference 2013
Arquitetura do coprocessador Intel® Xeon Phi™ - Intel Software Conference 2013Intel Software Brasil
 
Intel speed-select-technology-base-frequency-enhancing-performance
Intel speed-select-technology-base-frequency-enhancing-performanceIntel speed-select-technology-base-frequency-enhancing-performance
Intel speed-select-technology-base-frequency-enhancing-performanceVijaianand Sundaramoorthy
 
Hardware Software Codesign
Hardware Software CodesignHardware Software Codesign
Hardware Software Codesigndestruck
 
Reducing tco white paper rev5
Reducing tco white paper rev5Reducing tco white paper rev5
Reducing tco white paper rev5Manoj Punamia
 
Xeon E5 Making the Business Case PowerPoint
Xeon E5 Making the Business Case PowerPointXeon E5 Making the Business Case PowerPoint
Xeon E5 Making the Business Case PowerPointIntel IT Center
 
Intel® Xeon® Processor E5-2600 v3 Product Family Application Showcase - Core ...
Intel® Xeon® Processor E5-2600 v3 Product Family Application Showcase - Core ...Intel® Xeon® Processor E5-2600 v3 Product Family Application Showcase - Core ...
Intel® Xeon® Processor E5-2600 v3 Product Family Application Showcase - Core ...Intel IT Center
 
Higher Order Thinking - Question paper setting
Higher Order Thinking - Question paper settingHigher Order Thinking - Question paper setting
Higher Order Thinking - Question paper settingPradeep Kumar TS
 

What's hot (20)

Accelerate Ceph performance via SPDK related techniques
Accelerate Ceph performance via SPDK related techniques Accelerate Ceph performance via SPDK related techniques
Accelerate Ceph performance via SPDK related techniques
 
Intel xeon e5v3 y sdi
Intel xeon e5v3 y sdiIntel xeon e5v3 y sdi
Intel xeon e5v3 y sdi
 
4th gen intelcoreprocessor family
4th gen intelcoreprocessor family4th gen intelcoreprocessor family
4th gen intelcoreprocessor family
 
Intel® Xeon® Processor 5500 Series
Intel® Xeon® Processor 5500 SeriesIntel® Xeon® Processor 5500 Series
Intel® Xeon® Processor 5500 Series
 
The Next Generation of Intel: The Dawn of Nehalem
The Next Generation of Intel: The Dawn of NehalemThe Next Generation of Intel: The Dawn of Nehalem
The Next Generation of Intel: The Dawn of Nehalem
 
Intel® Virtualization Technology & Parallels Bring Native Graphics Innovation...
Intel® Virtualization Technology & Parallels Bring Native Graphics Innovation...Intel® Virtualization Technology & Parallels Bring Native Graphics Innovation...
Intel® Virtualization Technology & Parallels Bring Native Graphics Innovation...
 
Reimagining HPC Compute and Storage Architecture with Intel Optane Technology
Reimagining HPC Compute and Storage Architecture with Intel Optane TechnologyReimagining HPC Compute and Storage Architecture with Intel Optane Technology
Reimagining HPC Compute and Storage Architecture with Intel Optane Technology
 
IBM Redbooks Product Guide: IBM BladeCenter HS23E
IBM Redbooks Product Guide: IBM BladeCenter HS23EIBM Redbooks Product Guide: IBM BladeCenter HS23E
IBM Redbooks Product Guide: IBM BladeCenter HS23E
 
Arquitetura do coprocessador Intel® Xeon Phi™ - Intel Software Conference 2013
Arquitetura do coprocessador Intel® Xeon Phi™ - Intel Software Conference 2013Arquitetura do coprocessador Intel® Xeon Phi™ - Intel Software Conference 2013
Arquitetura do coprocessador Intel® Xeon Phi™ - Intel Software Conference 2013
 
Intel speed-select-technology-base-frequency-enhancing-performance
Intel speed-select-technology-base-frequency-enhancing-performanceIntel speed-select-technology-base-frequency-enhancing-performance
Intel speed-select-technology-base-frequency-enhancing-performance
 
Embedded systems
Embedded systemsEmbedded systems
Embedded systems
 
No[1][1]
No[1][1]No[1][1]
No[1][1]
 
Hardware Software Codesign
Hardware Software CodesignHardware Software Codesign
Hardware Software Codesign
 
Sa*ple
Sa*pleSa*ple
Sa*ple
 
Reducing tco white paper rev5
Reducing tco white paper rev5Reducing tco white paper rev5
Reducing tco white paper rev5
 
Xeon E5 Making the Business Case PowerPoint
Xeon E5 Making the Business Case PowerPointXeon E5 Making the Business Case PowerPoint
Xeon E5 Making the Business Case PowerPoint
 
Intel® Xeon® Processor E5-2600 v3 Product Family Application Showcase - Core ...
Intel® Xeon® Processor E5-2600 v3 Product Family Application Showcase - Core ...Intel® Xeon® Processor E5-2600 v3 Product Family Application Showcase - Core ...
Intel® Xeon® Processor E5-2600 v3 Product Family Application Showcase - Core ...
 
Project based learning methodologies for Embedded Systems and Intelligent Sys...
Project based learning methodologies for Embedded Systems and Intelligent Sys...Project based learning methodologies for Embedded Systems and Intelligent Sys...
Project based learning methodologies for Embedded Systems and Intelligent Sys...
 
Exam hp2 t17 aquestion 1
Exam hp2 t17 aquestion 1Exam hp2 t17 aquestion 1
Exam hp2 t17 aquestion 1
 
Higher Order Thinking - Question paper setting
Higher Order Thinking - Question paper settingHigher Order Thinking - Question paper setting
Higher Order Thinking - Question paper setting
 

Similar to Deep Learning Training at Scale: Spring Crest Deep Learning Accelerator

Accelerate Your Apache Spark with Intel Optane DC Persistent Memory
Accelerate Your Apache Spark with Intel Optane DC Persistent MemoryAccelerate Your Apache Spark with Intel Optane DC Persistent Memory
Accelerate Your Apache Spark with Intel Optane DC Persistent MemoryDatabricks
 
Optimizing Apache Spark Throughput Using Intel Optane and Intel Memory Drive...
 Optimizing Apache Spark Throughput Using Intel Optane and Intel Memory Drive... Optimizing Apache Spark Throughput Using Intel Optane and Intel Memory Drive...
Optimizing Apache Spark Throughput Using Intel Optane and Intel Memory Drive...Databricks
 
Intel Core X-seires processors
Intel Core X-seires processorsIntel Core X-seires processors
Intel Core X-seires processorsLow Hong Chuan
 
Performance out of the box developers
Performance   out of the box developersPerformance   out of the box developers
Performance out of the box developersMichelle Holley
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...Databricks
 
2 new hw_features_cat_cod_etc
2 new hw_features_cat_cod_etc2 new hw_features_cat_cod_etc
2 new hw_features_cat_cod_etcvideos
 
Accelerating SparkML Workloads on the Intel Xeon+FPGA Platform with Srivatsan...
Accelerating SparkML Workloads on the Intel Xeon+FPGA Platform with Srivatsan...Accelerating SparkML Workloads on the Intel Xeon+FPGA Platform with Srivatsan...
Accelerating SparkML Workloads on the Intel Xeon+FPGA Platform with Srivatsan...Databricks
 
HPC DAY 2017 | Accelerating tomorrow's HPC and AI workflows with Intel Archit...
HPC DAY 2017 | Accelerating tomorrow's HPC and AI workflows with Intel Archit...HPC DAY 2017 | Accelerating tomorrow's HPC and AI workflows with Intel Archit...
HPC DAY 2017 | Accelerating tomorrow's HPC and AI workflows with Intel Archit...HPC DAY
 
DAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI Convergence
DAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI ConvergenceDAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI Convergence
DAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI Convergenceinside-BigData.com
 
DPDK Summit - 08 Sept 2014 - Intel - Networking Workloads on Intel Architecture
DPDK Summit - 08 Sept 2014 - Intel - Networking Workloads on Intel ArchitectureDPDK Summit - 08 Sept 2014 - Intel - Networking Workloads on Intel Architecture
DPDK Summit - 08 Sept 2014 - Intel - Networking Workloads on Intel ArchitectureJim St. Leger
 
E5 Intel Xeon Processor E5 Family Making the Business Case
E5 Intel Xeon Processor E5 Family Making the Business Case E5 Intel Xeon Processor E5 Family Making the Business Case
E5 Intel Xeon Processor E5 Family Making the Business Case Intel IT Center
 
Extend HPC Workloads to Amazon EC2 Instances with Intel and Rescale (CMP373-S...
Extend HPC Workloads to Amazon EC2 Instances with Intel and Rescale (CMP373-S...Extend HPC Workloads to Amazon EC2 Instances with Intel and Rescale (CMP373-S...
Extend HPC Workloads to Amazon EC2 Instances with Intel and Rescale (CMP373-S...Amazon Web Services
 
Scaling python to_hpc_big_data-maidanov
Scaling python to_hpc_big_data-maidanovScaling python to_hpc_big_data-maidanov
Scaling python to_hpc_big_data-maidanovDenis Nagorny
 
Accelerating Virtual Machine Access with the Storage Performance Development ...
Accelerating Virtual Machine Access with the Storage Performance Development ...Accelerating Virtual Machine Access with the Storage Performance Development ...
Accelerating Virtual Machine Access with the Storage Performance Development ...Michelle Holley
 
Python* Scalability in Production Environments
Python* Scalability in Production EnvironmentsPython* Scalability in Production Environments
Python* Scalability in Production EnvironmentsIntel® Software
 
TDC2018SP | Trilha IA - Inteligencia Artificial na Arquitetura Intel
TDC2018SP | Trilha IA - Inteligencia Artificial na Arquitetura IntelTDC2018SP | Trilha IA - Inteligencia Artificial na Arquitetura Intel
TDC2018SP | Trilha IA - Inteligencia Artificial na Arquitetura Inteltdc-globalcode
 
DUG'20: 11 - Platform Performance Evolution from bring-up to reaching link sa...
DUG'20: 11 - Platform Performance Evolution from bring-up to reaching link sa...DUG'20: 11 - Platform Performance Evolution from bring-up to reaching link sa...
DUG'20: 11 - Platform Performance Evolution from bring-up to reaching link sa...Andrey Kudryavtsev
 
Develop, Deploy, and Innovate with Intel® Cluster Ready
Develop, Deploy, and Innovate with Intel® Cluster ReadyDevelop, Deploy, and Innovate with Intel® Cluster Ready
Develop, Deploy, and Innovate with Intel® Cluster ReadyIntel IT Center
 
3 additional dpdk_theory(1)
3 additional dpdk_theory(1)3 additional dpdk_theory(1)
3 additional dpdk_theory(1)videos
 

Similar to Deep Learning Training at Scale: Spring Crest Deep Learning Accelerator (20)

Accelerate Your Apache Spark with Intel Optane DC Persistent Memory
Accelerate Your Apache Spark with Intel Optane DC Persistent MemoryAccelerate Your Apache Spark with Intel Optane DC Persistent Memory
Accelerate Your Apache Spark with Intel Optane DC Persistent Memory
 
Optimizing Apache Spark Throughput Using Intel Optane and Intel Memory Drive...
 Optimizing Apache Spark Throughput Using Intel Optane and Intel Memory Drive... Optimizing Apache Spark Throughput Using Intel Optane and Intel Memory Drive...
Optimizing Apache Spark Throughput Using Intel Optane and Intel Memory Drive...
 
Intel Core X-seires processors
Intel Core X-seires processorsIntel Core X-seires processors
Intel Core X-seires processors
 
Performance out of the box developers
Performance   out of the box developersPerformance   out of the box developers
Performance out of the box developers
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 
2 new hw_features_cat_cod_etc
2 new hw_features_cat_cod_etc2 new hw_features_cat_cod_etc
2 new hw_features_cat_cod_etc
 
Accelerating SparkML Workloads on the Intel Xeon+FPGA Platform with Srivatsan...
Accelerating SparkML Workloads on the Intel Xeon+FPGA Platform with Srivatsan...Accelerating SparkML Workloads on the Intel Xeon+FPGA Platform with Srivatsan...
Accelerating SparkML Workloads on the Intel Xeon+FPGA Platform with Srivatsan...
 
HPC DAY 2017 | Accelerating tomorrow's HPC and AI workflows with Intel Archit...
HPC DAY 2017 | Accelerating tomorrow's HPC and AI workflows with Intel Archit...HPC DAY 2017 | Accelerating tomorrow's HPC and AI workflows with Intel Archit...
HPC DAY 2017 | Accelerating tomorrow's HPC and AI workflows with Intel Archit...
 
DAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI Convergence
DAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI ConvergenceDAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI Convergence
DAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI Convergence
 
DPDK Summit - 08 Sept 2014 - Intel - Networking Workloads on Intel Architecture
DPDK Summit - 08 Sept 2014 - Intel - Networking Workloads on Intel ArchitectureDPDK Summit - 08 Sept 2014 - Intel - Networking Workloads on Intel Architecture
DPDK Summit - 08 Sept 2014 - Intel - Networking Workloads on Intel Architecture
 
E5 Intel Xeon Processor E5 Family Making the Business Case
E5 Intel Xeon Processor E5 Family Making the Business Case E5 Intel Xeon Processor E5 Family Making the Business Case
E5 Intel Xeon Processor E5 Family Making the Business Case
 
Extend HPC Workloads to Amazon EC2 Instances with Intel and Rescale (CMP373-S...
Extend HPC Workloads to Amazon EC2 Instances with Intel and Rescale (CMP373-S...Extend HPC Workloads to Amazon EC2 Instances with Intel and Rescale (CMP373-S...
Extend HPC Workloads to Amazon EC2 Instances with Intel and Rescale (CMP373-S...
 
Scaling python to_hpc_big_data-maidanov
Scaling python to_hpc_big_data-maidanovScaling python to_hpc_big_data-maidanov
Scaling python to_hpc_big_data-maidanov
 
Accelerating Virtual Machine Access with the Storage Performance Development ...
Accelerating Virtual Machine Access with the Storage Performance Development ...Accelerating Virtual Machine Access with the Storage Performance Development ...
Accelerating Virtual Machine Access with the Storage Performance Development ...
 
Python* Scalability in Production Environments
Python* Scalability in Production EnvironmentsPython* Scalability in Production Environments
Python* Scalability in Production Environments
 
Intel python 2017
Intel python 2017Intel python 2017
Intel python 2017
 
TDC2018SP | Trilha IA - Inteligencia Artificial na Arquitetura Intel
TDC2018SP | Trilha IA - Inteligencia Artificial na Arquitetura IntelTDC2018SP | Trilha IA - Inteligencia Artificial na Arquitetura Intel
TDC2018SP | Trilha IA - Inteligencia Artificial na Arquitetura Intel
 
DUG'20: 11 - Platform Performance Evolution from bring-up to reaching link sa...
DUG'20: 11 - Platform Performance Evolution from bring-up to reaching link sa...DUG'20: 11 - Platform Performance Evolution from bring-up to reaching link sa...
DUG'20: 11 - Platform Performance Evolution from bring-up to reaching link sa...
 
Develop, Deploy, and Innovate with Intel® Cluster Ready
Develop, Deploy, and Innovate with Intel® Cluster ReadyDevelop, Deploy, and Innovate with Intel® Cluster Ready
Develop, Deploy, and Innovate with Intel® Cluster Ready
 
3 additional dpdk_theory(1)
3 additional dpdk_theory(1)3 additional dpdk_theory(1)
3 additional dpdk_theory(1)
 

More from inside-BigData.com

Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networksinside-BigData.com
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...inside-BigData.com
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...inside-BigData.com
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...inside-BigData.com
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networksinside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoringinside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecastsinside-BigData.com
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Updateinside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuninginside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODinside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Accelerationinside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficientlyinside-BigData.com
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Erainside-BigData.com
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computinginside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Clusterinside-BigData.com
 

More from inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 

Recently uploaded

Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 

Recently uploaded (20)

Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 

Deep Learning Training at Scale: Spring Crest Deep Learning Accelerator

  • 1. Deep Learning Training At Scale Spring Crest Deep Learning Accelerator (Intel® Nervana™ NNP-T) Andrew Yang | 8/16/2019
  • 2. 2 Legalnotices&disclaimers • This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps. • Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer. No computer system can be absolutely secure. • Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance. • Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction. • Statements in this document that refer to Intel’s plans and expectations for the quarter, the year, and the future, are forward-looking statements that involve a number of risks and uncertainties. A detailed discussion of the factors that could affect Intel’s results and plans is included in Intel’s SEC filings, including the annual report on Form 10-K. • Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: http://www.intel.com/performance The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. • No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. • Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate. • Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor- dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice Revision #20110804. • All products, computer systems, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice. • ​Performance results are based on testing as of August 1, 2019 and may not reflect all publicly available security updates. See configuration disclosure for details. No product or component can be absolutely secure. • Intel technologies' features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No product or component can be absolutely secure. Check with your system manufacturer or retailer or learn more at http://www.intel.com/. • Intel, the Intel logo, Intel Inside, Nervana, and others are trademarks of Intel Corporation in the U.S. and/or other countries. • *Other names and brands may be claimed as the property of others. • © 2019 Intel Corporation.
  • 3. • Real-world performance - time to train and power efficiency • Compute used for largest models & training sets doubles every 3.5 months** • GEMMs and Convolutions dominate majority of computation required for Deep Neural Networks. • Primary factors that drive a DL training accelerator • Power • Compute • Memory & Communication • Scale-out DeepLearningTraining * Based on analysis of ResNet, FaceNet, Open NMT Total Computation* Other MAC>99% Output Layer Input Layer Hidden Layer 1 Hidden Layer 2 Hidden Layer 4 Hidden Layer 3 Deep Neural Network **https://openai.com/blog/ai-and-compute/
  • 4. • Train a network as fast as possible within a given power budget, targeting larger models and datasets • Balance between Compute, Communication, & Memory • Re-use on-die data as much as possible • Optimize for batched workloads • Built-in scale-out support • Support future workloads SpringCrest(NNP-T)ArchitectureDirection
  • 5. • PCIe Gen 4 x16 EP • 4x HBM2 • 64 lanes SerDes • 24 Tensor Processors • Up to 119 TOPS • 60 MB on-chip distributed memory • Management CPU and Interfaces • 2.5D packaging SpringCrest(NNP-T)SoC Spring Crest (NNP-T) SoC HBM HBM HBM HBM PCIe/DMA PCIe Gen 4 x16 X-bar SerDesx8 SerDesx8 SerDesx8 SerDesx8 8x ICL HBMPHY HBMMC TPC TPC TPC TPC TPC HBMPHY HBMMC HBMPHY HBMMC HBMPHY HBMMC TPC TPC TPC TPC TPC TPC TPC TPC TPC TPC TPC TPC TPC TPC TPC TPC TPC TPC TPC SerDesx8 SerDesx8 SerDesx8 SerDesx8 8x ICL X-bar SerDesx8 SerDesx8 SerDesx8 SerDesx8 8xICL SerDesx8 SerDesx8 SerDesx8 SerDesx8 8xICL
  • 6. • TSMC CLN16FF+ • 680mm2, 1200mm2 interposer • 27 Billion Transistors • 60mm x 60mm/6-2-6 3325 pin BGA package • 4x8GB HBM2-2400 memory • Up to 1.1Ghz core frequency • 64 lanes SerDes HSIO up to 3.58Tbps aggregate BW • PCIe Gen 4 x16, SPI, I2C, GPIOs • PCIe & OAM form factors • Air-cooled, 150-250W typical workload power SpringCrest(NNP-T)Implementation
  • 7. NNP-TSoftwareStack • Full software stack built with open components • Direct integration with DL frameworks • nGraph: Hardware agnostic deep learning library and compiler for DL platform developers. • Provides common set of optimizations for NNP-T across DL frameworks • Argon: NNP-T DNN compute & communication kernel library • Low-level programmability: NNP-T kernel development toolchain w/tensor compiler Kernel Mode Driver Board Firmware Chip Firmware Argon DNN Kernel Library Open Source
  • 8. • Flexible and programmable Tensor-based ISA • Limited instruction set • Extensible with uController custom instructions • Same distributed programming model for both intra and inter-chip • Explicit SW memory management and message passing • Synchronization primitives • Compute has affinity to local data • DL workloads are dominated by a limited set of operations NNP-TProgrammingModel
  • 10. • Bfloat16 w/ FP32 accumulation • No sacrifice in SOTA accuracy, improved power efficiency & training time* • Minimal model changes BFLoat16Numerics Numeric format Area/Power efficient Easy to Converge? Notes FP32 No Yes Industry standard FP16 Yes Medium Good precision for Fprop. Bprop/Update is more challenging. 16b Integer formats Yes No Better area/power than FP, but hard to use Bfloat16 Yes Yes Good at Fprop, Bprop, and Update 8/4/2/1b integer Extreme Extremely difficult Research areas * Facebook and Intel joint paper - A Study of BFLOAT16 for Deep Learning Training: https://arxiv.org/pdf/1905.12322.pdf
  • 11. • Bfloat16 Matrix Multiply Core (32x32) • FP32 & BF16 support for all other operations • 2x multiply cores per TPC to amortize SoC resources • Vector operations for non-GEMM • Compound pipeline • DL specific optimizations • Activation functions, RNG, Reductions & accumulations • Programmable FP32 look-up tables SpringCrestCompute 32x32 Mult Array PreOpPreOp Post Ops Partial Product Out0 Out1 EW PP A B
  • 12. • Four stacks of HBM2-2400 8GB devices • 1.22TBps raw bandwidth and 32GB total device memory. ECC protected • 2.5MB/TPC of local scratchpad memory • 60 MB total distributed memory, ECC protected • Native Tensor Transpose • Simultaneous read and write on each MRB • 1.4TBps local read/write bandwidth per-TPC • Support for direct memory to memory transfer for both HBM and MRB MemorySubsystem
  • 13. SpringCrestOn-dieCommunication • Bidirectional 2-D mesh architecture to allow any to any communication • Prioritized for throughput and congestion avoidance • Cut-through forwarding and multi- cast support • 2.6TBps total cross-sectional BW, 1.3TBps per-direction • All peripheral devices shared through the mesh (HBM, SerDes) • Separate meshes for different traffic types • Support for direct peer to peer communication between TPCs PCI Interchip X-BAR OCR1,3 4 2 2 2 5 6 3 3 3 7 0 0 0 0 1 2 1 1 1 3 OCR1,4 4 2 2 2 5 6 3 3 3 7 0 0 0 0 1 2 1 1 1 3 OCR2,2 4 2 2 2 5 6 3 3 3 7 0 0 0 0 1 2 1 1 1 3 OCR2,3 4 2 2 2 5 6 3 3 3 7 0 0 0 0 1 2 1 1 1 3 OCR2,4 4 2 2 2 5 6 3 3 3 7 0 0 0 0 1 2 1 1 1 3 OCR1,2 4 2 2 2 5 6 3 3 3 7 0 0 0 0 1 2 1 1 1 3 HBMController Ch0 Ch1 Ch2 Ch3 Ch4 Ch5 Ch6 Ch7 OCR4,3 4 2 2 2 5 6 3 3 3 7 0 0 0 0 1 2 1 1 1 3 OCR4,4 4 2 2 2 5 6 3 3 3 7 0 0 0 0 1 2 1 1 1 3 OCR5,2 4 2 2 2 5 6 3 3 3 7 0 0 0 0 1 2 1 1 1 3 OCR5,3 4 2 2 2 5 6 3 3 3 7 0 0 0 0 1 2 1 1 1 3 OCR5,4 4 2 2 2 5 6 3 3 3 7 0 0 0 0 1 2 1 1 1 3 OCR4,2 4 2 2 2 5 6 3 3 3 7 0 0 0 0 1 2 1 1 1 3 Interchip XBAR OCR3,2 4 2 2 2 5 6 3 3 3 7 0 0 0 0 1 2 1 1 1 3 OCR3,3 4 2 2 2 5 6 3 3 3 7 0 0 0 0 1 2 1 1 1 3 OCR3,4 4 2 2 2 5 6 3 3 3 7 0 0 0 0 1 2 1 1 1 3 OCR3,1 4 2 2 2 5 6 3 3 3 7 0 0 0 0 1 2 1 1 1 3 OCR3,6 4 2 2 2 5 6 3 3 3 7 0 0 0 0 1 2 1 1 1 3 OCR2,1 4 2 2 2 5 6 3 3 3 7 0 0 0 0 1 2 1 1 1 3 OCR1,5 4 2 2 2 5 6 3 3 3 7 0 0 0 0 1 2 1 1 1 3 OCR2,5 4 2 2 2 5 6 3 3 3 7 0 0 0 0 1 2 1 1 1 3 OCR4,5 4 2 2 2 5 6 3 3 3 7 0 0 0 0 1 2 1 1 1 3 OCR5,5 4 2 2 2 5 6 3 3 3 7 0 0 0 0 1 2 1 1 1 3 OCR3,5 4 2 2 2 5 6 3 3 3 7 0 0 0 0 1 2 1 1 1 3 OCR2,6 4 2 2 2 5 6 3 3 3 7 0 0 0 0 1 2 1 1 1 3 HBMController Ch0 Ch1 Ch2 Ch3 Ch4 Ch5 Ch6 Ch7 HBMController Ch0 Ch1 Ch2 Ch3 Ch4 Ch5 Ch6 Ch7 HBMController Ch0 Ch1 Ch2 Ch3 Ch4 Ch5 Ch6 Ch7
  • 14. • Run the largest models across multiple chips and across chassis • 16 quads of 112Gbps. 3.58Tbps total bi- directional BW per chip • Fully programmable router w/multi-cast support enables multiple glue-less topologies • Reliable transmission • Virtual channels and priorities for traffic management • Direct low-latency local memory transfer • Support for up to 1024 nodes Scale-Out
  • 15. • DL workloads require a variety of GEMM sizes • Minimize HBM memory boundedness with on-die data re-use => higher utilization of compute resources • Faster training, less idle resources • ~2x better than published utilization of competitive products SpringCrestSinglechipgemmperformance GEMM Size Spring Crest Utilization 1024 x 700 x 512 31.1% 1760 x 7133 x 1760 44.5% 2048 x 7133 x 2048 46.7% 2560 x 7133 x 2560 57.1% 4096 x 7133 x 4096 57.4% 5124 x 9124 x 2048 55.5% * All products, computer systems, dates, and figures are preliminary based on current expectations, and are subject to change without notice. Performance measured on July 10, 2019 on pre-production NNP-T Spring Crest silicon, using 22 TPCs at 900MHz core clock and 2GHz HBM clock. Host is an Intel® Xeon® Gold 6130T CPU @ 2.10GHz with 64 GB of system memory For more complete information about performance and benchmark results, visit www.intel.com/benchmarks. ** Other names and brands may be claimed as the property of others. Deepbench data from: https://github.com/baidu-research/DeepBench/blob/master/results/train/DeepBench_NV_V100.xlsx Based on NVIDIA DGX-1, NVIDIA V100 GPU, Linux Kernel 4.4.0-124-generic, CUDA 10.0.130, CuDNN 7.3.1.20, NVIDIA Driver 410.48, Intel® Xeon® CPU ES-2698v4@2.2GHz
  • 16. Description Spring Crest Utilization c64xh56xw56_k64xr3xs3_st1_n128 86% c128xh28xw28_k128xr3xs3_st1_n128 71% c512xh28xw28_k128xr1xs1_st1_n128 65% c128xh28xw28_k512xr1xs1_st1_n128 59% c256xh14xw14_k1024xr1xs1_st1_n128 62% c256xh28xw28_k512xr1xs1_st2_n128 71% c32xh120xw120_k64xr5xs5_st1_n128 87% SpringCrestConvolutions • Various convolution hyperparameters are required by DL workloads • Support multiple tensor layouts for maximum on-die data reuse resulting higher compute efficiency 0% 20% 40% 60% 80% 100% Convolution Performance C=# input dimensions, H=height, W=width, K=# filters, R=filter X, S=filter Y, ST=stride N=minibatch size * All products, computer systems, dates, and figures are preliminary based on current expectations, and are subject to change without notice. Performance measured on July 10, 2019 on pre-production NNP-T Spring Crest silicon, using 22 TPCs at 900MHz core clock and 2GHz HBM clock. Host is an Intel® Xeon® Gold 6130T CPU @ 2.10GHz with 64 GB of system memory For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
  • 17. • Benchmarked on ring topology (intra and inter- chassis) • Support for different All-reduce algorithms with different communication patterns SpringCrestCommunicationperformance Communication Kernels Bandwidth (BW) Within-chassis (8 cards) Cross-chassis (16, 32 cards) Spring Crest 16x ICL Spring Crest 16x ICL 2-card Send/Recv BW (GB/s) 161 161 Allreduce BW (GB/s) 151 151 Broadcast BW (GB/s) 147 147 * All products, computer systems, dates, and figures are preliminary based on current expectations, and are subject to change without notice. Performance measured on July 10, 2019 on pre-production NNP-T Spring Crest silicon, using 22 TPCs at 900MHz core clock and 2GHz HBM clock. Host is an Intel® Xeon® Gold 6130T CPU @ 2.10GHz with 64 GB of system memory For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
  • 18. • Low overhead and direct memory transfer result in low latency • High efficiency even at moderate transfer sizes • Cross-chassis scale-out with the same network/connectivity Communicationperformance Allreduce Latency , 2KB (ms) Spring Crest 16x ICL 2 cards (in-chassis) 3 4 cards (in-chassis) 8 8 cards (in-chassis) 9 16 cards (cross-chassis) 30 32 cards (cross-chassis) 36 Allreduce Data Rate (GB/s) Message Size Spring Crest (8 chips) Spring Crest (32 chips) 1 MB 68.7 39.9 8 MB 115.8 92.2 32 MB 137.5 130.2 128 MB 147.1 147.4 * All products, computer systems, dates, and figures are preliminary based on current expectations, and are subject to change without notice. Performance measured on July 10, 2019 on pre-production NNP-T Spring Crest silicon, using 22 TPCs at 900MHz core clock and 2GHz HBM clock. Host is an Intel® Xeon® Gold 6130T CPU @ 2.10GHz with 64 GB of system memory For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
  • 19. • Domain specific acceleration has a place in DL training • Training time and model size continue to be bottlenecks • Numerics and compute tailored for DL • No legacy workloads to support • Architect from ground up to reduce data movement and keep compute units fed • Higher utilization and efficiency on micro-benchmarks translate into better overall WL performance => Faster, more power efficient training ConcludingRemarks