Implementing AI: High Performace Architectures

www.ktn-uk.org @KTNUK
Finding valuable
partners
-
Project consortium
building
-
Supply Chain
Knowledge
-
Driving new
connections
-
Articulating challenges
-
Finding creative
solutions
Awareness and
dissemination
-
Public and private
finance
-
Advice – project scope
-
Advice – proposal
mentoring
-
Project
follow-up
Promoting
Industry needs
-
Informing policy
makers
-
Informing
strategy
-
Communicating trends
and market drivers
Intelligence on trends
and markets
-
Business Planning
support
-
Success stories /
raising profile
Navigating the
innovation support
landscape
-
Promoting coherent
strategy and approach
-
Engaging wider
stakeholders
-
Curation of innovation
resources
Connecting SupportingNavigatingInfluencing Funding
What we do - Growth Through Innovation

eFutures aims to strengthen and support a network of people
working in electronic systems across the UK
• Building new links and increasing involvement with industry
• Mapping the national electronics research, to ensure the work across the UK is known and noted
• Encouraging and funding innovative multi-disciplinary/multi-university proposals
• Communicating with our network via a monthly magazine, social media and new website
• Running events that support our network and our strategy
• Piloting an academic Mentoring Scheme pilot
• Launching a Big Ideas Challenge – more details soon
• Ideas warmly welcomed. Please get involved!
Twitter @efuturesuk
Sign up to our mailing list: efutures@qub.ac.uk

Today’s Agenda
Large scale HPC hardware in the age of AI
Prof Simon McIntosh-Smith, Bristol University
Solving Core Recommendation Model Challenges in Data Centers
Giles Peckham, myrtle
Short Break
Arm SVE and Supercomputer Fugaku for Deep learning
Roxana Rusitoru, ARM
A Universal Accelerated Computing Platform
Timothy Lanfear, NVIDIA
Panel Q&A Session
Chaired by Prof Roger Woods

Large scale HPC hardware
in the age of AI
Prof. Simon McIntosh-Smith
Head of the HPC research group
University of Bristol, UK
Twitter: @simonmcs
Email: simonm@cs.bris.ac.uk
http://uob-hpc.github.io

AI is a primary goal for next-generation supercomputers
The coming generation of Exascale systems will
include a diverse range of architectures at massive
scale, all of which are targeting AI:
• Fugaku: Fujitsu A64FX Arm CPUs
• Perlmutter: AMD EYPC CPUs and NVIDIA GPUs
• Frontier: AMD EPYC CPUs and Radeon GPUs
• Aurora: Intel Xeon CPUs and Xe GPUs
• El Capitan: AMD EPYC CPUs and Radeon GPUs
The Next Platform, Jan 13th
2020: “HPC in 2020: compute engine diversity gets real”
https://www.nextplatform.com/2020/01/13/hpc-in-2020-compute-engine-diversity-gets-real/
June 22, 2020 1
Overview
The Fugaku compute system was designed and built by Fujitsu and RIKEN. Fugaku 富岳, is
another name for Mount Fuji, created by combining the first character of 富士, Fuji, and 岳,
mountain. The system is installed at the RIKEN Center for Computational Science (R-CCS) in
Kobe, Japan. RIKEN is a large scientific research institute in Japan with about 3,000 scientists in
seven campuses across Japan. Development for Fugaku hardware started in 2014 as the
successor to the K computer. The K Computer mainly focused on basic science and simulations
and modernized the Japanese supercomputer to be massively parallel. The Fugaku system is
designed to have a continuum of applications ranging from basic science to Society 5.0, an
initiative to create a new social scheme and economic model by fully incorporating the
technological innovations of the fourth industrial revolution. The relation to the Mount Fuji
image is to have a broad base of applications and capacity for simulation, data science, and AI—
with academic, industry, and cloud startups—along with a high peak performance on large-scale
applications.
Figure 1. Fugaku System as installed in RIKEN R-CCS
The Fugaku system is built on the A64FX ARM v8.2-A, which uses Scalable Vector Extension
(SVE) instructions and a 512-bit implementation. The Fugaku system adds the following Fujitsu
extensions: hardware barrier, sector cache, prefetch, and the 48/52 core CPU. It is optimized for
high-performance computing (HPC) with an extremely high bandwidth 3D stacked memory, 4x
8 GB HBM with 1024 GB/s, on-die Tofu-D network BW (~400 Gbps), high SVE FLOP/s (3.072
TFLOP/s), and various AI support (FP16, INT8, etc.). The A64FX processor provides for
general purpose Linux, Windows, and other cloud systems. Simply put, Fugaku is the largest and
fastest supercomputer built to date. Below is further breakdown of the hardware.
• Caches:
o L1D/core: 64 KB, 4way, 256 GB/s (load), 128 GB/s (store)
o L2/CMG: 8 MB, 16 way
o L2/node: 4 TB/s (load), 2 TB/s (store)
o L2/core: 128 GB/s (load), 64 GB/s (store)
• 158,976 nodes

The UK’s Tier-2 exploring options
Isambard
• First production Arm-based HPC service
• 10,752 Armv8 cores (168n x 2s x 32c)
• Marvell ThunderX2 32core 2.5GHz
• Cray XC50 ‘Scout’ form factor
• High-speed Aries interconnect
• Cray HPC optimised software stack
• >420 registered users, >100 of whom are
from outside the consortium

UK Tier-2 dense GPU systems
• 22 NVIDIA DGX-1 Deep Learning Systems, each comprising:
• 8 NVIDIA Tesla V100 GPUs
• NVIDIA's high-speed NVlink interconnect
• 4 TB of SSD for machine learning datasets
• over 1PB of Seagate ClusterStor storage
• Mellanox EDR networking
• optimized versions of Caffe, TensorFlow, Theano and Torch etc
• system integration/delivery by Atos, hosting by STFC Hartree
• system management by Atos / STFC Hartree
http://www.hpc-uk.ac.uk/facilities/

Arm + GPU
Source: https://nvidianews.nvidia.com/news/nvidia-and-tech-leaders-team-to-build-gpu-accelerated-arm-servers-for-new-era-of-diverse-hpc-architectures

Emerging architectures for AI / MP
Google’s Tensorflow Processing Unit (TPU), GraphCore, Intel’s Nervana

Google’s Tensor Processing Units:
http://uob-hpc.github.iohttps://cloud.google.com/tpu
Cloud TPU v3:
420 TFLOP/s
128 GB HBM
$2.40 / TPU hour Cloud TPU v3 Pod:
100+ PFLOP/s
32 TB HBM
2-D toroidal
mesh network
V4 supposedly improves
performance by 2.7x

Graphcore has just announced their 2nd generation “IPU”

Graphcore IPU-M2000
• 4 x Colossus MK2 GC200 IPUs in a 1U box
• 1 PetaFLOP “AI compute” (16-bit FP)
• 5,888 processor cores, 35,328 independent threads
• Up to 450GB of exchange memory (off-chip DRAM)
• 2nd gen IPU has 7-9X more performance on AI benchmarks
• 59.4B 7nm transistors in 823mm2
• 900MB of on-chip fast SRAM per IPU (3x first gen.)
• 250 TFLOP/s AI compute per chip, 62.5 TFLOP/s single-precision

Graphcore systems now include their own interconnect too

Massive scale AI/ML supercomputers

Graphcore 3D torus topology for large-scale AI

Key takeaways
• Orders of magnitude more AI / ML compute coming
• Diverse architectures to deliver greater performance
• You need solutions that can work across CPUs, GPUs and now more
exotic hardware
• Optimised libraries are the main path to exploitation
• TensorFlow, PyTorch, Café et al
• Anything lower level requires a lot more ninja programming

For more information
Bristol HPC group: https://uob-hpc.github.io/
Email: S.McIntosh-Smith@bristol.ac.uk
Twitter: @simonmcs
http://uob-hpc.github.io 15

Copyright © Myrtle.ai 2020
Solving Core Recommendation
Model Challenges in Data Centers
Giles Peckham, Myrtle.ai

Myrtle.aiacceleratesMachineLearninginference
• Accelerates Recommendation Models, RNNs and other DNNs with sparse structures
• Achieves maximum throughputin applicationswith strictlatency constraints
• Addresses hyper-scale inference
• Data Centers (Cloud & On-Premise) and Embedded applications
Myrtle.ai
Founding Member:
MLCommons
Alliance Member
Gold Partner
AI Keynote 2019
Joint White Paper
Copyright© Myrtle.ai 2020
Recommendation
Systems Speech Synthesis
Speech
Transcription
Machine
Translation

MAU Accelerator
Low latency inference accelerator for data center ML workloads
Optimized for highest latency-bounded throughput
DNN Model
Cloud or enterprise
data center serverFPGA accelerator
card

MAU Accelerator Benefits
Optimized for highest
latency-bounded
throughput
Reduceddata centerinfrastructure required • Lower CapEx
• Mitigates against rack space limitations
Reducedenergy consumption • Lower OpEx
• Smaller carbonfootprint
• Mitigates against power constraints
Deterministic low tail-latency enables the use of
higher quality models
• Improvedcustomer experience
• Better services
Uses readily-available data center accelerator cards
compatible withtypical server installations
• Rapiddeployment at scale
Development flow basedonindustry standards • Easy to compile frompopularopen-source
frameworks
Flexible & reprogrammable solution • Future proof

Applications
Target Applications
• Speech transcription
• Natural language processing
• Speech synthesis
• Time series cleansing & analysis
• Payment & trading fraud detection
• Anomaly detection
• Network security
Target Model Architectures
• Fully connected linear layers
• RNN, including LSTM and GRU
• Time delay neural network (TDNN)
Target Sectors
• Finance (trading,compliance, service)
• Search, Social Media & other Ad Servers
• HPC (very large ML)
• Life science (genomics, dataanalytics)
• Defense, Aerospace, Security
• Telcos & Conferencing Providers

An Accelerator for Recommendation Systems

Recommendation Models
• One of the most common data center workloads
• Used for search, adverts,feeds and personalization
Demands
• Throughput / Capacity
• Need to ramp up capacityquicklyto meet demand
• Months/years to commission new data center floor space
• Cost
• Data center rack server investment >$50B /yr1
• Latency / Model Accuracy / Revenue
• 5ms latencyis challengingfor typical server systems
• 100ms delayin load time can cost e-commerce companies many $B /yr2
• Energy Consumption/ Carbon Footprint
• Global data center energy costs >$10B /yr3
• Global data center emissions ~100M tonnes CO2 /yr4
1. https://www.marketsandmarkets.com/Market-Reports/data-center-rack-server-market-53332315.html
2. https://www.akamai.com/uk/en/about/news/press/2017-press/akamai-releases-spring-2017-state-of-online-retail-performance-report.jsp
3. https://www.sciencedaily.com/releases/2020/02/200227144313.htm
4. https://www.comsoc.org/publications/tcn/2019-nov/energy-efficiency-data-centers

Design Challenges
• A typicalRecommendationModel:
• Traditionalapproach:
• Put the wholemodel on one chip
• Myrtle.aiapproach:
• Offloaddifferent features of the modelto different hardware accelerators
• Make it equallypracticalto adopt
Dense Features
Compute-Bound
Sparse Features
Memory-Bound
Dense Features
Compute-Bound
OutputInput
• Up to 80% of time can be spent here
• Memory architecture in typical data center infrastructure is inefficient here
• Existingaccelerators give a poor return here

• Accelerates the memory-bound sparse operations in all recommendation models
• Delivers large gains in latency bounded throughput
• Fully preserves existing model accuracy
• Is complementary to existing compute accelerators
• Is integrated into the PyTorch Deep Learning Framework
SEAL: An Accelerator for
Recommendation Systems

Add SEAL
modules
Offload sparse
operations to
SEAL
CPU freed up;
latency reduced
Increase
CPU batch size
Throughput
increased
The “Virtuous Circle”

Performance
• Vector Processing Bandwidth is the bandwidth achievable when
transforming random multi-hot vectors into real-valued dense vectors
• Carrier is Glacier Point v2
Vector Processing Bandwidth
16 GB version 18 GB/s (219 GB/s per carrier)
32 GB version 16 GB/s (195 GB/s per carrier)

Key Benefits
Based on benchmarking using a weighted average of the mlperf.org benchmark
recommendation models (Dec. 2019):
• Rapid 8x increase in latency-boundedthroughputusing existing infrastructure1
• Enables more recommendationsto be made
• Enables better quality recommendationsto be made
• Higher CTRs
• Increased revenue
• Greater consumer satisfaction
• Up to 50%CapEx savings on further capacity expansion1,2
• Up to 80%reduction in energy consumption1,2
• OpEx savings
• Smaller carbonfootprint
1 Comparisonsarebetween a Xeon D-2100 performinginferenceon itown and thesameCPUleveragingSEALacceleration.
Performanceandbenefitswill vary,dependingon individualsystemconfiguration and model usage.
2 Based on servers+SEALonly. Excludesbuildings,HVAC etc.
8xmore
throughput
50%
less CapEx
80%
less energy

Highly Complementary
to Existing Infrastructure
• Acceleratesexisting servers; easy to install
• Complementary to other accelerators
• Scalable
• Does not requireany changeto the
recommendation model.No model
retraining.No degradation in accuracy
• Supportsco-location of modelswith no
performancepenalty
• Supportsconcurrentdeploymentof different
versionsof a model, and loading/unloading
models on the fly to facilitate A/B testing

Contact seal@myrtle.ai to evaluate what SEAL can do for your business
For more information visit myrtle.ai/seal
SEAL is the
• lowest power
• smallest form factor
• easiest-to-deploy
method of optimizing memory bound
recommendation models in existing infrastructure.

Thank You
w w w . m y r t l e . a i
Giles Peckham
07785 278478
giles@myrtle.ai

© 2020 Arm Limited (or its affiliates)
Roxana Rusitoru
Arm ML Research Lab
7 August 2020
Arm SVE and
Supercomputer Fugaku
for Deep learning
Implementing AI: High Performance
Architectures

2 © 2020 Arm Limited (or its affiliates)
Disclaimer
• Spent the better part of the
last decade on Arm in HPC
within Arm Research.
• Worked on all layers from
application optimization to
kernel development,
simulation infrastructure and
next-gen architectures.
• And now I do ML! (Looking
after ML on CPUs)

We want to train large networks on this

Supercomputer Fugaku Top 1

Supercomputer Fugaku today

Green500
K Fugaku
Peak FP64 11.3 PFLOPs* 0.537 ExaFLOPs
Peak FP32 11.3 PFLOPs 1.07 ExaFLOPs
Peak FP16 -- 2.15 ExaFLOPs
Peak INT8 -- 4.30 ExaFLOPs
Total mem BW 5.184 PB/sec 163 PB/s
*Reported in Top500 (including I/O nodes)

Fujitsu A64FX Specs

What is this Scalable Vector Extension (aka SVE)?
• New vector extension after Advanced SIMD (aka Arm NEON) with features needed for new
markets (e.g., gather load & scatter store, per-lane predication, longer vectors)
• There is no preferred vector length
• The vector length (VL) is a hardware choice, 128-2048b, in increments of 128b
• A Vector Length Agnostic (VLA) programming adjusts dynamically to the available VL
• SVE is not an extension of Armv8 Advanced SIMD (aka NEON)
• SVE is a separate, optional extension with a new set of instruction encodings.
• Initial focus is HPC and general-purpose server, not media/image processing.
• SVE begins to tackle traditional barriers to auto-vectorization
• Very low overhead vs scalar code to encourage opportunistic vectorization.
• Software-managed speculative vectorization of uncounted loops (i.e. C break).
• Extract more data-level parallelism (DLP) from existing C/C++/Fortran source code

Why Vector Length Agnostic?
Pros
• Fit into the 32b fixed-width A64 instruction set
encoding
• Future-proofing: no need for a new instruction
set when the vector length increases
• Vectorized code will scale automatically to use
the whole vector length
• No need to recompile, or to re-write hand-
coded SVE assembler and intrinsics
Challenges
• Programmers and compilers must think
differently about vectorization
• A different vector length may expose latent bugs
• Stack layout changes may expose stack
overwriting bugs
• Software developers cannot be expected to
validate at 16 different VLs (128b, 256b,
384b… 2048b)

VLA Instruction Set Support
• Vectors cannot be initialized from compile-time constant, so…
• INDEX Zd.S,#1,#4 : Zd = [ 1, 5, 9, 13, 17, 21, 25, 29 ]
• Predicates cannot be initialized from memory, so…
• PTRUE Pd.S, MUL3 : Pd = [ T, T, T , T, T, T , F, F ]
• Vector loop increment and trip count are unknown at compile-time, so…
• INCD Xi : increment scalar Xi by # of 64b dwords in vector
• WHILELT Pd.D,Xi,Xe : next iteration predicate Pd = [ while i++ < e ]
• Vector stores to stack must be dynamically allocated and indexed, so…
• ADDVL SP,SP,#-4 : decrement stack pointer by (4*VL)
• STR Zi, [SP,#3,MUL VL] : store vector Z1 to address (SP+3*VL)

Other SVE Features
• Gather-load and scatter-store
• Loads/stores a single vector register from/to non-contiguous memory locations.
• Per-lane predication
• Operate on individual lanes of vector controlled by a governing predicate register.
• Predicate-driven loop control and management
• Eliminate loop heads and tails and other overheads by processing partial vectors.
• Vector partitioning for software-managed speculation
• First-fault vector load instructions allow vector accesses to cross into invalid pages.
• Extended floating-point and bitwise horizontal reductions
• In-order or tree-based floating-point sum, trading-off repeatability vs performance.
1 2 3 4
5 5 5 5
1 0 1 0
6 2 8 4
+=
→
pred
1 2 0 0
1 1 0 0
→
FFR
1 2
1 + 2 + 3 + 4
1 + 2
+
3 + 4
3 7
= =
=
=
n-2
1 01 0→
n-1 n n+1
for (i = 0; i < n; ++i)

Back to the future!
software

On-CPU Machine Learning Training
ML software stack on AArch64 and SVE
Available
frameworks
Container
images and build
recipes (e.g., TF,
PyTorch, Caffe)
Popular ML
frameworks
support Arm as a
first-class citizen
Variety of
libraries
and tools
ML libraries,
such as oneDNN,
Eigen, ArmPL
and ArmCL.
Profiles and
debuggers: Arm
Forge (DDT,
MAP)
Selection of
training
benchmarks
Amongst which
MLPerf
(Training, HPC)
Using Arm architectural
features
Large core count
INT8, Bfloat16,
FP16, FP32
SVE / SVE2
Matrix Multiplier
Extension
On the latest AArch64
hardware
Arm Neoverse
N1
Fujitsu A64FX
Marvell
ThunderX2
Ampere eMAG

Flexibility
Ease of
programming
ML
processing
requirements
Arch
• Dot product instructions (v8.0 – v8.4)
Arch
• Matrix-multiply-and-accumulate
instructions (*MMLA*) (v8.6)
Micro Arch
• SVE vector length
Arch
• Bfloat16 support (v8.6)
Features enhancing ML performance on
Arm CPUs
Primary drivers for on-CPU ML
On-CPU Machine Learning

Machine Learning Frameworks
TensorFlow
• Arm
actively
involved
Deepbench
• Arm
actively
involved
Torch
• Community
maintained
Mahout
• Available
via Apache
Bigtop
Weka
• Community
maintained
Caffe
• Community
maintained
Theano (EOL)
• Community
maintained
• Docker builds of popular ML frameworks:
• PyTorch: https://github.com/ARM-software/Tool-Solutions/tree/master/docker/pytorch-aarch64
• TensorFlow: https://github.com/ARM-software/Tool-Solutions/tree/master/docker/tensorflow-aarch64
• Details: https://community.arm.com/developer/tools-software/tools/b/tools-software-ides-blog/posts/aarch64-
docker-images-for-pytorch-and-tensorflow

ML Training on Arm-based systems
Supercomputer Fugaku and Arm SVE are just the beginning!
Versatile
architecture
enriched with ML
features
Diverse selection of
Arm-based
implementations
Freedom to design
fit-for-purpose
hardware
Vast Arm software
ecosystem and
partner base

Thank You
Danke
Merci
谢谢
ありがとう
Gracias
Kiitos
감사합니다
धन्यवाद
‫ا‬ً‫شكر‬
ধন্যবাদ
‫תודה‬

The Arm trademarks featured in this presentation are registered
trademarks or trademarks of Arm Limited (or its subsidiaries) in
the US and/or elsewhere. All rights reserved. All other marks
featured may be trademarks of their respective owners.
www.arm.com/company/policies/trademarks

A UNIVERSAL ACCELERATED
COMPUTING PLATFORM

22
25 YEARS OF SCIENTIFIC COMPUTING ACCELERATION
X-FACTOR SPEEDUP FULL STACK ONE ARCHITECTURESOFTWARE DEFINED
EXTREME SCALE
25 YEARS OF COMPUTING ACCELERATION
DEVELOPMENT

3
THE NEW COMPUTING
EDGE APPLIANCE
SUPERCOMPUTER
AI
Edge
Streaming
Simulation
Visualization
EXTREME IO
Data
Analytics
Cloud
NETWORK

44
A100 AVAILABLE VIA NVIDIA HGX A100 AND A100 PCIE
Scale-up - Fastest Time-to-solution for AI
8 GPUs, Full NVLink B/W between all
GPUs with NVSwitch
HGX A100 8-GPU
For Mainstream Servers
1-8 GPUs per server, optional NVLink
Bridge between 2 GPUs
A100 PCIe
Scale-Up – Mixed AI & HPC
4 A100s, Fully Connected w/
shared NVLinks
HGX A100 4-GPU

55
5 MIRACLES OF A100
NVIDIA Ampere Architecture
World’s Largest 7nm chip
54B XTORS, HBM2
3rd Gen NVLINK and NVSWITCH
Efficient Scaling to Enable Super GPU
2X More Bandwidth
3rd Gen Tensor Cores
Faster, Flexible, Easier to use
20x AI Perf with TF32
2.5x HPC Perf
New Sparsity Acceleration
Harness Sparsity in AI Models
2x AI Performance
New Multi-Instance GPU
Optimal utilization with right sized GPU
7x Simultaneous Instances per GPU

6
INTRODUCING DGX A100
The Universal AI System – Data Analytics, Training and Inference
9x Mellanox ConnectX-6 200Gb/s Network Interface
8x NVIDIA A100 GPUs with 320GB Total GPU Memory
15TB Gen4 NVME SSD
Dual 64-core AMD Rome CPUs and 1TB RAM
4.8TB/sec Bi-directional Bandwidth
2X More than Previous Generation NVSwitch
6x NVIDIA NVSwitches
12 NVLinks/GPU
600GB/sec GPU-to-GPU Bi-directional Bandwidth
25GB/sec Peak Bandwidth
2X Faster than Gen3 NVME SSDs
3.2X More Cores to Power the Most Intensive AI Jobs
450GB/sec Peak Bi-directional Bandwidth

7
UNIFIED AI ACCELERATION
BERT Pre-Training Throughput using Pytorch including (2/3)Phase 1 and (1/3)Phase 2 | Phase 1 Seq Len = 128, Phase 2 Seq Len = 512 V100: DGX-1 Server with 8xV100 using FP32 and FP16 precision A100: DGX A100 Server with 8xA100 using TF32
precision and FP16 |
BERT Large Inference | T4: TRT 7.1, Precision = INT8, Batch Size =256, V100: TRT 7.1, Precision = FP16, Batch Size =256 | A100 with 7 MIG instances of 1g.5gb : Pre-production TRT, Batch Size =94, Precision = INT8 with Sparsity
216
822
1260
2274
0
400
800
1200
1600
2000
2400
FP32 FP16
Sequences/s
BERT-LARGE TRAINING
V100
0.6x 1x 1x
7x
0
1000
2000
3000
4000
5000
6000
7000
Sequences/s
BERT-LARGE INFERENCE
V100T4 1 MIG
(1/7 A100)
6X
out-of-
the-box
Speedup
with TF32
7 MIG
(1 A100)
3X
Speedup with
AMP (FP16)

8
350 CPU Servers
$23M | 22 Racks | 300 kW
NVIDIA SHATTERS BIG DATA ANALYTICS BENCHMARK
19.5X Faster TPCx-BB Performance Results on DGX A100 with RAPIDS
16 NVIDIA DGX A100 Systems
$3.3M | 4 Racks |100 kW
Equivalent
Performance
1/7th Cost
1/3rd Power
16 Servers / Rack
…
Rack 1 Rack 2 Rack 3 Rack 22Rack 4 Rack 1 Rack 2 Rack 3 Rack 4
Performance: CPU = 4.7 hr, DGX A100 = 14.5 min (19.5x faster); After normalizing performance across CPU and GPU clusters -> Cost: CPU = $23M, DGX A100 = $3.3M (1/7th the
cost); Power: CPU = 298kW, DGX A100 = 104kW (1/3rd the power); Space: CPU = 22 racks, DGX A100 = 4 racks (less than 1/5th the space)

9
GPU-ACCELERATED APACHE SPARK 3.0
Data Preparation Model Training
Shared Storage
CPU Powered Cluster GPU Powered Cluster
Data
Sources
Spark 2.x Spark 3.0
Data
Sources
Spark
XGBoost | TensorFlow
| PyTorch
Data Preparation Model Training
Spark
XGBoost | TensorFlow
| PyTorch
Spark Orchestrated
Spark Orchestrated
Spark 3.0 enables:
• A single pipeline, from ingest to data preparation
to model training
• GPU-accelerated data preparation
• Consolidation and simplification of infrastructure
Built on Foundations of RAPIDS
Learn More @ nvidia.com/spark-book
Now Available on Leading Cloud Analytics Platforms
RAPIDS Accelerator for Apache Spark
GPU Powered Cluster

10
1.5X 1.5X 1.6X
1.9X
1.7X
1.8X
1.9X
2.0X
2.1X
0.0x
0.5x
1.0x
1.5x
2.0x
NAMD GROMACS AMBER LAMMPS FUN3D SPECFEM3D RTM BerkeleyGW Chroma
A100
UP TO 2X MORE HPC PERFORMANCE
All results are measured
Except BerkeleyGW, V100 used is single V100 SXM2. A100 used is single A100 SXM4
More apps detail: AMBER based on PME-Cellulose, GROMACS with STMV (h-bond), LAMMPS with Atomic Fluid LJ-2.5, NAMD with v3.0a1 STMV_NVE
Chroma with szscl21_24_128, FUN3D with dpw, RTM with Isotropic Radius 4 1024^3, SPECFEM3D with Cartesian four material model
BerkeleyGW based on Chi Sum and uses 8xV100 in DGX-1, vs 8xA100 in DGX A100
Speedup
V100
Molecular Dynamics Physics Geo Science Physics

12
17.1 (1792 A100)
10.5 (256 A100)
3.3 (8 A100)
0.8 (2048 A100)
0.8 (1024 A100)
0.8 (1840 A100)
0.7 (1024 A100)
0.6 (480 A100)
0 5 10 15 20 25 30 35 40
Reinforcement Learning MiniGo
Object Detection (Heavy Weight) Mask R-CNN
Recommendation DLRM
NLP BERT
Object Detection (Light Weight) SSD
Image Classification ResNet-50 v.1.5
Translation (Recurrent) GNMT
Translation (Non-recurrent) Transformer
Time to Train (Minutes)
Time to Train (Lower is Better)
Commercially Available Solutions
NVIDIA A100
NVIDIA V100
Google TPUv3
Huawei Ascend
MLPERF: DGX SUPERPOD SETS ALL 8 AT SCALE AI RECORDS
Under 18 Minutes To Train Each MLPerf Benchmark
MLPerf 0.7 Performance comparison at Max Scale. Max scale used for NVIDIA A100, NVIDIA V100, TPUv3 and Huawei Ascend for all applicable benchmarks. | MLPerf ID at Scale: :Transformer: 0.7-30, 0.7-52 , GNMT: 0.7-34, 0.7-54, ResNet-50
v1.5: 0.7-37, 0.7-55, 0.7-1, 0.7-3, SSD: 0.7-33, 0.7-53, BERT: 0.7-38, 0.7-56, 0.7-1, DLRM: 0.7-17, 0.7-43, Mask R-CNN: 0.7-28, 0.7-48, MiniGo: 0.7-36, 0.7-51 | MLPerf name and logo are trademarks. See www.mlperf.org for more information.
XXXXXXXXXXXXX
X = No result submitted
28.7 (16 TPUv3)
56.7
(16 TPUv3)

13
MLPERF: ALL 8 PER CHIP AI PERFORMANCE RECORDS
0.7X
1.2X
0.9X
1.0X 1.0X 1.0X 1.0X 1.0X 1.0X 1.0X 1.0X
1.5X
1.6X
1.9X
2.0X 2.0X
2.4X 2.4X 2.5X
0x
1x
2x
3x
Image
Classification
ResNet-50 v.1.5
NLP
BERT
Object Detection
(Heavy Weight)
Mask R-CNN
Reinforcement
Learning
MiniGo
Object Detection
(Light Weight)
SSD
Translation
(Recurrent)
GNMT
Translation
(Non-recurrent)
Transformer
Recommendation
DLRM
SpeedupOverV100
Relative Speedup
Commercially Available Solutions
Huawei Ascend TPUv3 V100 A100
Per Chip Performance arrived at by comparing performance at same scale when possible and normalizing it to a single chip. 8 chip scale: V100, A100 Mask R-CNN, MiniGo, SSD, GNMT, Transformer. 16 chip scale: V100, A100, TPUv3 for ResNet-
50 v1.5 and BERT. 512 chip scale: Huawei Ascend 910 for ResNet-50. DLRM compared 8 A100 and 16 V100. Submission IDs: ResNet-50 v1.5: 0.7-3, 0.7-1, 0.7-44, 0.7-18, 0.7-21, 0.7-15 BERT: 0.7-1, 0.7-45, 0.7-22 , Mask R-CNN: 0.7-40, 0.7-19,
MiniGo: 0.7-41, 0.7-20, SSD: 0.7-40, 0.7-19, GNMT: 0.7-40, 0.7-19, Transformer: 0.7-40, 0.7-19, DLRM: 0.7-43, 0.7-17| MLPerf name and logo are trademarks. See www.mlperf.org for more information.
X X X X X X X X X X X X X
X = No result submitted

14
#7 on TOP500 (27.6 PetaFLOPS HPL)
#2 on Green500 (20.5 GigaFLOPS/watt)
Fastest Industrial System in U.S. — 1+ ExaFLOPS AI
Built with NVIDIA DGX SuperPOD Arch in 3 Weeks
NVIDIA DGX A100 and NVIDIA Mellanox IB
NVIDIA’s decade of AI experience
Configuration:
2,240 NVIDIA A100 Tensor Core GPUs
280 NVIDIA DGX A100 systems
494 Mellanox 200G HDR IB switches
7 PB of all-flash storage
DGX SuperPOD Deployment
SELENE

15
Oxford Nanopore
Sequence Viral Genome in
7Hrs
Plotly, NVIDIA
Real-Time
Infection Rate Analysis
ORNL, Scripps
Screen
2B Drug Compounds in
1 Day vs 1 Year
Structura, NIH, UT Austin
CryoSPARC
1st 3D Structure of Virus Spike Protein
NIH, NVIDIA
AI COVID-19
Classification
Kiwibot
Robot Medical Supply
Delivery
Whiteboard Coordinator
AI Elevated Body Temp
Screening System
ACCELERATED COMPUTING FIGHTS COVID-19
Data
Analytics
Simulation &
Visualization
AI Edge

Implementing AI: High Performace Architectures

Implementing AI: High Performace Architectures

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Implementing AI: High Performace Architectures

Similar to Implementing AI: High Performace Architectures (20)

More from KTN

More from KTN (20)

Recently uploaded

Recently uploaded (20)

Implementing AI: High Performace Architectures