SlideShare a Scribd company logo
1 of 68
Download to read offline
www.ktn-uk.org @KTNUK
Finding valuable
partners
-
Project consortium
building
-
Supply Chain
Knowledge
-
Driving new
connections
-
Articulating challenges
-
Finding creative
solutions
Awareness and
dissemination
-
Public and private
finance
-
Advice – project scope
-
Advice – proposal
mentoring
-
Project
follow-up
Promoting
Industry needs
-
Informing policy
makers
-
Informing
strategy
-
Communicating trends
and market drivers
Intelligence on trends
and markets
-
Business Planning
support
-
Success stories /
raising profile
Navigating the
innovation support
landscape
-
Promoting coherent
strategy and approach
-
Engaging wider
stakeholders
-
Curation of innovation
resources
Connecting SupportingNavigatingInfluencing Funding
What we do - Growth Through Innovation
eFutures aims to strengthen and support a network of people
working in electronic systems across the UK
• Building new links and increasing involvement with industry
• Mapping the national electronics research, to ensure the work across the UK is known and noted
• Encouraging and funding innovative multi-disciplinary/multi-university proposals
• Communicating with our network via a monthly magazine, social media and new website
• Running events that support our network and our strategy
• Piloting an academic Mentoring Scheme pilot
• Launching a Big Ideas Challenge – more details soon
• Ideas warmly welcomed. Please get involved!
Twitter @efuturesuk
Sign up to our mailing list: efutures@qub.ac.uk
Today’s Agenda
Large scale HPC hardware in the age of AI
Prof Simon McIntosh-Smith, Bristol University
Solving Core Recommendation Model Challenges in Data Centers
Giles Peckham, myrtle
Short Break
Arm SVE and Supercomputer Fugaku for Deep learning
Roxana Rusitoru, ARM
A Universal Accelerated Computing Platform
Timothy Lanfear, NVIDIA
Panel Q&A Session
Chaired by Prof Roger Woods
Large scale HPC hardware
in the age of AI
Prof. Simon McIntosh-Smith
Head of the HPC research group
University of Bristol, UK
Twitter: @simonmcs
Email: simonm@cs.bris.ac.uk
http://uob-hpc.github.io
AI is a primary goal for next-generation supercomputers
The coming generation of Exascale systems will
include a diverse range of architectures at massive
scale, all of which are targeting AI:
• Fugaku: Fujitsu A64FX Arm CPUs
• Perlmutter: AMD EYPC CPUs and NVIDIA GPUs
• Frontier: AMD EPYC CPUs and Radeon GPUs
• Aurora: Intel Xeon CPUs and Xe GPUs
• El Capitan: AMD EPYC CPUs and Radeon GPUs
http://uob-hpc.github.io
The Next Platform, Jan 13th
2020: “HPC in 2020: compute engine diversity gets real”
https://www.nextplatform.com/2020/01/13/hpc-in-2020-compute-engine-diversity-gets-real/
June 22, 2020 1
Overview
The Fugaku compute system was designed and built by Fujitsu and RIKEN. Fugaku 富岳, is
another name for Mount Fuji, created by combining the first character of 富士, Fuji, and 岳,
mountain. The system is installed at the RIKEN Center for Computational Science (R-CCS) in
Kobe, Japan. RIKEN is a large scientific research institute in Japan with about 3,000 scientists in
seven campuses across Japan. Development for Fugaku hardware started in 2014 as the
successor to the K computer. The K Computer mainly focused on basic science and simulations
and modernized the Japanese supercomputer to be massively parallel. The Fugaku system is
designed to have a continuum of applications ranging from basic science to Society 5.0, an
initiative to create a new social scheme and economic model by fully incorporating the
technological innovations of the fourth industrial revolution. The relation to the Mount Fuji
image is to have a broad base of applications and capacity for simulation, data science, and AI—
with academic, industry, and cloud startups—along with a high peak performance on large-scale
applications.
Figure 1. Fugaku System as installed in RIKEN R-CCS
The Fugaku system is built on the A64FX ARM v8.2-A, which uses Scalable Vector Extension
(SVE) instructions and a 512-bit implementation. The Fugaku system adds the following Fujitsu
extensions: hardware barrier, sector cache, prefetch, and the 48/52 core CPU. It is optimized for
high-performance computing (HPC) with an extremely high bandwidth 3D stacked memory, 4x
8 GB HBM with 1024 GB/s, on-die Tofu-D network BW (~400 Gbps), high SVE FLOP/s (3.072
TFLOP/s), and various AI support (FP16, INT8, etc.). The A64FX processor provides for
general purpose Linux, Windows, and other cloud systems. Simply put, Fugaku is the largest and
fastest supercomputer built to date. Below is further breakdown of the hardware.
• Caches:
o L1D/core: 64 KB, 4way, 256 GB/s (load), 128 GB/s (store)
o L2/CMG: 8 MB, 16 way
o L2/node: 4 TB/s (load), 2 TB/s (store)
o L2/core: 128 GB/s (load), 64 GB/s (store)
• 158,976 nodes
The UK’s Tier-2 exploring options
Isambard
• First production Arm-based HPC service
• 10,752 Armv8 cores (168n x 2s x 32c)
• Marvell ThunderX2 32core 2.5GHz
• Cray XC50 ‘Scout’ form factor
• High-speed Aries interconnect
• Cray HPC optimised software stack
• >420 registered users, >100 of whom are
from outside the consortium
UK Tier-2 dense GPU systems
http://uob-hpc.github.io
• 22 NVIDIA DGX-1 Deep Learning Systems, each comprising:
• 8 NVIDIA Tesla V100 GPUs
• NVIDIA's high-speed NVlink interconnect
• 4 TB of SSD for machine learning datasets
• over 1PB of Seagate ClusterStor storage
• Mellanox EDR networking
• optimized versions of Caffe, TensorFlow, Theano and Torch etc
• system integration/delivery by Atos, hosting by STFC Hartree
• system management by Atos / STFC Hartree
http://www.hpc-uk.ac.uk/facilities/
Arm + GPU
http://uob-hpc.github.io
Source: https://nvidianews.nvidia.com/news/nvidia-and-tech-leaders-team-to-build-gpu-accelerated-arm-servers-for-new-era-of-diverse-hpc-architectures
Emerging architectures for AI / MP
http://uob-hpc.github.io
Google’s Tensorflow Processing Unit (TPU), GraphCore, Intel’s Nervana
Google’s Tensor Processing Units:
http://uob-hpc.github.iohttps://cloud.google.com/tpu
Cloud TPU v3:
420 TFLOP/s
128 GB HBM
$2.40 / TPU hour Cloud TPU v3 Pod:
100+ PFLOP/s
32 TB HBM
2-D toroidal
mesh network
V4 supposedly improves
performance by 2.7x
Graphcore has just announced their 2nd generation “IPU”
http://uob-hpc.github.io
Graphcore IPU-M2000
• 4 x Colossus MK2 GC200 IPUs in a 1U box
• 1 PetaFLOP “AI compute” (16-bit FP)
• 5,888 processor cores, 35,328 independent threads
• Up to 450GB of exchange memory (off-chip DRAM)
• 2nd gen IPU has 7-9X more performance on AI benchmarks
• 59.4B 7nm transistors in 823mm2
• 900MB of on-chip fast SRAM per IPU (3x first gen.)
• 250 TFLOP/s AI compute per chip, 62.5 TFLOP/s single-precision
http://uob-hpc.github.io
Graphcore systems now include their own interconnect too
http://uob-hpc.github.io
Massive scale AI/ML supercomputers
http://uob-hpc.github.io
Graphcore 3D torus topology for large-scale AI
http://uob-hpc.github.io
http://uob-hpc.github.io
Key takeaways
• Orders of magnitude more AI / ML compute coming
• Diverse architectures to deliver greater performance
• You need solutions that can work across CPUs, GPUs and now more
exotic hardware
• Optimised libraries are the main path to exploitation
• TensorFlow, PyTorch, Café et al
• Anything lower level requires a lot more ninja programming
http://uob-hpc.github.io
For more information
Bristol HPC group: https://uob-hpc.github.io/
Email: S.McIntosh-Smith@bristol.ac.uk
Twitter: @simonmcs
http://uob-hpc.github.io 15
Copyright © Myrtle.ai 2020
Solving Core Recommendation
Model Challenges in Data Centers
Giles Peckham, Myrtle.ai
Myrtle.aiacceleratesMachineLearninginference
• Accelerates Recommendation Models, RNNs and other DNNs with sparse structures
• Achieves maximum throughputin applicationswith strictlatency constraints
• Addresses hyper-scale inference
• Data Centers (Cloud & On-Premise) and Embedded applications
Myrtle.ai
Founding Member:
MLCommons
Alliance Member
Gold Partner
AI Keynote 2019
Joint White Paper
Copyright© Myrtle.ai 2020
Recommendation
Systems Speech Synthesis
Speech
Transcription
Machine
Translation
Copyright© Myrtle.ai 2020
MAU Accelerator
Low latency inference accelerator for data center ML workloads
Optimized for highest latency-bounded throughput
DNN Model
Cloud or enterprise
data center serverFPGA accelerator
card
Copyright© Myrtle.ai 2020
MAU Accelerator Benefits
Optimized for highest
latency-bounded
throughput
Reduceddata centerinfrastructure required • Lower CapEx
• Mitigates against rack space limitations
Reducedenergy consumption • Lower OpEx
• Smaller carbonfootprint
• Mitigates against power constraints
Deterministic low tail-latency enables the use of
higher quality models
• Improvedcustomer experience
• Better services
Uses readily-available data center accelerator cards
compatible withtypical server installations
• Rapiddeployment at scale
Development flow basedonindustry standards • Easy to compile frompopularopen-source
frameworks
Flexible & reprogrammable solution • Future proof
Copyright© Myrtle.ai 2020
Applications
Target Applications
• Speech transcription
• Natural language processing
• Speech synthesis
• Time series cleansing & analysis
• Payment & trading fraud detection
• Anomaly detection
• Network security
Target Model Architectures
• Fully connected linear layers
• RNN, including LSTM and GRU
• Time delay neural network (TDNN)
Target Sectors
• Finance (trading,compliance, service)
• Search, Social Media & other Ad Servers
• HPC (very large ML)
• Life science (genomics, dataanalytics)
• Defense, Aerospace, Security
• Telcos & Conferencing Providers
Copyright © Myrtle.ai 2020
An Accelerator for Recommendation Systems
Copyright© Myrtle.ai 2020
Recommendation Models
• One of the most common data center workloads
• Used for search, adverts,feeds and personalization
Demands
• Throughput / Capacity
• Need to ramp up capacityquicklyto meet demand
• Months/years to commission new data center floor space
• Cost
• Data center rack server investment >$50B /yr1
• Latency / Model Accuracy / Revenue
• 5ms latencyis challengingfor typical server systems
• 100ms delayin load time can cost e-commerce companies many $B /yr2
• Energy Consumption/ Carbon Footprint
• Global data center energy costs >$10B /yr3
• Global data center emissions ~100M tonnes CO2 /yr4
1. https://www.marketsandmarkets.com/Market-Reports/data-center-rack-server-market-53332315.html
2. https://www.akamai.com/uk/en/about/news/press/2017-press/akamai-releases-spring-2017-state-of-online-retail-performance-report.jsp
3. https://www.sciencedaily.com/releases/2020/02/200227144313.htm
4. https://www.comsoc.org/publications/tcn/2019-nov/energy-efficiency-data-centers
Copyright© Myrtle.ai 2020
Design Challenges
• A typicalRecommendationModel:
• Traditionalapproach:
• Put the wholemodel on one chip
• Myrtle.aiapproach:
• Offloaddifferent features of the modelto different hardware accelerators
• Make it equallypracticalto adopt
Dense Features
Compute-Bound
Sparse Features
Memory-Bound
Dense Features
Compute-Bound
OutputInput
• Up to 80% of time can be spent here
• Memory architecture in typical data center infrastructure is inefficient here
• Existingaccelerators give a poor return here
Copyright © Myrtle.ai 2020
• Accelerates the memory-bound sparse operations in all recommendation models
• Delivers large gains in latency bounded throughput
• Fully preserves existing model accuracy
• Is complementary to existing compute accelerators
• Is integrated into the PyTorch Deep Learning Framework
SEAL: An Accelerator for
Recommendation Systems
Copyright © Myrtle.ai 2020
Add SEAL
modules
Offload sparse
operations to
SEAL
CPU freed up;
latency reduced
Increase
CPU batch size
Throughput
increased
The “Virtuous Circle”
Copyright© Myrtle.ai 2020
Performance
• Vector Processing Bandwidth is the bandwidth achievable when
transforming random multi-hot vectors into real-valued dense vectors
• Carrier is Glacier Point v2
Vector Processing Bandwidth
16 GB version 18 GB/s (219 GB/s per carrier)
32 GB version 16 GB/s (195 GB/s per carrier)
Copyright© Myrtle.ai 2020
Key Benefits
Based on benchmarking using a weighted average of the mlperf.org benchmark
recommendation models (Dec. 2019):
• Rapid 8x increase in latency-boundedthroughputusing existing infrastructure1
• Enables more recommendationsto be made
• Enables better quality recommendationsto be made
• Higher CTRs
• Increased revenue
• Greater consumer satisfaction
• Up to 50%CapEx savings on further capacity expansion1,2
• Up to 80%reduction in energy consumption1,2
• OpEx savings
• Smaller carbonfootprint
1 Comparisonsarebetween a Xeon D-2100 performinginferenceon itown and thesameCPUleveragingSEALacceleration.
Performanceandbenefitswill vary,dependingon individualsystemconfiguration and model usage.
2 Based on servers+SEALonly. Excludesbuildings,HVAC etc.
8xmore
throughput
50%
less CapEx
80%
less energy
Copyright© Myrtle.ai 2020
Highly Complementary
to Existing Infrastructure
• Acceleratesexisting servers; easy to install
• Complementary to other accelerators
• Scalable
• Does not requireany changeto the
recommendation model.No model
retraining.No degradation in accuracy
• Supportsco-location of modelswith no
performancepenalty
• Supportsconcurrentdeploymentof different
versionsof a model, and loading/unloading
models on the fly to facilitate A/B testing
Copyright © Myrtle.ai 2020
Contact seal@myrtle.ai to evaluate what SEAL can do for your business
For more information visit myrtle.ai/seal
SEAL is the
• lowest power
• smallest form factor
• easiest-to-deploy
method of optimizing memory bound
recommendation models in existing infrastructure.
Thank You
w w w . m y r t l e . a i
Copyright © Myrtle.ai 2020
Giles Peckham
07785 278478
giles@myrtle.ai
© 2020 Arm Limited (or its affiliates)
Roxana Rusitoru
Arm ML Research Lab
7 August 2020
Arm SVE and
Supercomputer Fugaku
for Deep learning
Implementing AI: High Performance
Architectures
2 © 2020 Arm Limited (or its affiliates)
Disclaimer
• Spent the better part of the
last decade on Arm in HPC
within Arm Research.
• Worked on all layers from
application optimization to
kernel development,
simulation infrastructure and
next-gen architectures.
• And now I do ML! (Looking
after ML on CPUs)
3 © 2020 Arm Limited (or its affiliates)
We want to train large networks on this
Supercomputer Fugaku
4 © 2020 Arm Limited (or its affiliates)
Supercomputer Fugaku Top 1
5 © 2020 Arm Limited (or its affiliates)
Supercomputer Fugaku today
6 © 2020 Arm Limited (or its affiliates)
Supercomputer Fugaku
Green500
K Fugaku
Peak FP64 11.3 PFLOPs* 0.537 ExaFLOPs
Peak FP32 11.3 PFLOPs 1.07 ExaFLOPs
Peak FP16 -- 2.15 ExaFLOPs
Peak INT8 -- 4.30 ExaFLOPs
Total mem BW 5.184 PB/sec 163 PB/s
*Reported in Top500 (including I/O nodes)
7 © 2020 Arm Limited (or its affiliates)
Fujitsu A64FX Specs
8 © 2020 Arm Limited (or its affiliates)
What is this Scalable Vector Extension (aka SVE)?
• New vector extension after Advanced SIMD (aka Arm NEON) with features needed for new
markets (e.g., gather load & scatter store, per-lane predication, longer vectors)
• There is no preferred vector length
• The vector length (VL) is a hardware choice, 128-2048b, in increments of 128b
• A Vector Length Agnostic (VLA) programming adjusts dynamically to the available VL
• SVE is not an extension of Armv8 Advanced SIMD (aka NEON)
• SVE is a separate, optional extension with a new set of instruction encodings.
• Initial focus is HPC and general-purpose server, not media/image processing.
• SVE begins to tackle traditional barriers to auto-vectorization
• Very low overhead vs scalar code to encourage opportunistic vectorization.
• Software-managed speculative vectorization of uncounted loops (i.e. C break).
• Extract more data-level parallelism (DLP) from existing C/C++/Fortran source code
9 © 2020 Arm Limited (or its affiliates)
Why Vector Length Agnostic?
Pros
• Fit into the 32b fixed-width A64 instruction set
encoding
• Future-proofing: no need for a new instruction
set when the vector length increases
• Vectorized code will scale automatically to use
the whole vector length
• No need to recompile, or to re-write hand-
coded SVE assembler and intrinsics
Challenges
• Programmers and compilers must think
differently about vectorization
• A different vector length may expose latent bugs
• Stack layout changes may expose stack
overwriting bugs
• Software developers cannot be expected to
validate at 16 different VLs (128b, 256b,
384b… 2048b)
10 © 2020 Arm Limited (or its affiliates)
VLA Instruction Set Support
• Vectors cannot be initialized from compile-time constant, so…
• INDEX Zd.S,#1,#4 : Zd = [ 1, 5, 9, 13, 17, 21, 25, 29 ]
• Predicates cannot be initialized from memory, so…
• PTRUE Pd.S, MUL3 : Pd = [ T, T, T , T, T, T , F, F ]
• Vector loop increment and trip count are unknown at compile-time, so…
• INCD Xi : increment scalar Xi by # of 64b dwords in vector
• WHILELT Pd.D,Xi,Xe : next iteration predicate Pd = [ while i++ < e ]
• Vector stores to stack must be dynamically allocated and indexed, so…
• ADDVL SP,SP,#-4 : decrement stack pointer by (4*VL)
• STR Zi, [SP,#3,MUL VL] : store vector Z1 to address (SP+3*VL)
11 © 2020 Arm Limited (or its affiliates)
Other SVE Features
• Gather-load and scatter-store
• Loads/stores a single vector register from/to non-contiguous memory locations.
• Per-lane predication
• Operate on individual lanes of vector controlled by a governing predicate register.
• Predicate-driven loop control and management
• Eliminate loop heads and tails and other overheads by processing partial vectors.
• Vector partitioning for software-managed speculation
• First-fault vector load instructions allow vector accesses to cross into invalid pages.
• Extended floating-point and bitwise horizontal reductions
• In-order or tree-based floating-point sum, trading-off repeatability vs performance.
1 2 3 4
5 5 5 5
1 0 1 0
6 2 8 4
+=
→
pred
1 2 0 0
1 1 0 0
→
FFR
1 2
1 + 2 + 3 + 4
1 + 2
+
3 + 4
3 7
= =
=
=
n-2
1 01 0→
n-1 n n+1
for (i = 0; i < n; ++i)
© 2020 Arm Limited (or its affiliates)
Back to the future!
software
13 © 2020 Arm Limited (or its affiliates)
On-CPU Machine Learning Training
ML software stack on AArch64 and SVE
Available
frameworks
Container
images and build
recipes (e.g., TF,
PyTorch, Caffe)
Popular ML
frameworks
support Arm as a
first-class citizen
Variety of
libraries
and tools
ML libraries,
such as oneDNN,
Eigen, ArmPL
and ArmCL.
Profiles and
debuggers: Arm
Forge (DDT,
MAP)
Selection of
training
benchmarks
Amongst which
MLPerf
(Training, HPC)
Using Arm architectural
features
Large core count
INT8, Bfloat16,
FP16, FP32
SVE / SVE2
Matrix Multiplier
Extension
On the latest AArch64
hardware
Arm Neoverse
N1
Fujitsu A64FX
Marvell
ThunderX2
Ampere eMAG
14 © 2020 Arm Limited (or its affiliates)
Flexibility
Ease of
programming
ML
processing
requirements
Arch
• Dot product instructions (v8.0 – v8.4)
Arch
• Matrix-multiply-and-accumulate
instructions (*MMLA*) (v8.6)
Micro Arch
• SVE vector length
Arch
• Bfloat16 support (v8.6)
Features enhancing ML performance on
Arm CPUs
Primary drivers for on-CPU ML
On-CPU Machine Learning
15 © 2020 Arm Limited (or its affiliates)
Machine Learning Frameworks
TensorFlow
• Arm
actively
involved
Deepbench
• Arm
actively
involved
Torch
• Community
maintained
Mahout
• Available
via Apache
Bigtop
Weka
• Community
maintained
Caffe
• Community
maintained
Theano (EOL)
• Community
maintained
• Docker builds of popular ML frameworks:
• PyTorch: https://github.com/ARM-software/Tool-Solutions/tree/master/docker/pytorch-aarch64
• TensorFlow: https://github.com/ARM-software/Tool-Solutions/tree/master/docker/tensorflow-aarch64
• Details: https://community.arm.com/developer/tools-software/tools/b/tools-software-ides-blog/posts/aarch64-
docker-images-for-pytorch-and-tensorflow
16 © 2020 Arm Limited (or its affiliates)
ML Training on Arm-based systems
Supercomputer Fugaku and Arm SVE are just the beginning!
Versatile
architecture
enriched with ML
features
Diverse selection of
Arm-based
implementations
Freedom to design
fit-for-purpose
hardware
Vast Arm software
ecosystem and
partner base
© 2020 Arm Limited (or its affiliates)
Thank You
Danke
Merci
谢谢
ありがとう
Gracias
Kiitos
감사합니다
धन्यवाद
‫ا‬ً‫شكر‬
ধন্যবাদ
‫תודה‬
The Arm trademarks featured in this presentation are registered
trademarks or trademarks of Arm Limited (or its subsidiaries) in
the US and/or elsewhere. All rights reserved. All other marks
featured may be trademarks of their respective owners.
www.arm.com/company/policies/trademarks
© 2020 Arm Limited (or its affiliates)
A UNIVERSAL ACCELERATED
COMPUTING PLATFORM
22
25 YEARS OF SCIENTIFIC COMPUTING ACCELERATION
X-FACTOR SPEEDUP FULL STACK ONE ARCHITECTURESOFTWARE DEFINED
EXTREME SCALE
25 YEARS OF COMPUTING ACCELERATION
DEVELOPMENT
3
THE NEW COMPUTING
EDGE APPLIANCE
SUPERCOMPUTER
AI
Edge
Streaming
Simulation
Visualization
EXTREME IO
Data
Analytics
Cloud
NETWORK
44
A100 AVAILABLE VIA NVIDIA HGX A100 AND A100 PCIE
Scale-up - Fastest Time-to-solution for AI
8 GPUs, Full NVLink B/W between all
GPUs with NVSwitch
HGX A100 8-GPU
For Mainstream Servers
1-8 GPUs per server, optional NVLink
Bridge between 2 GPUs
A100 PCIe
Scale-Up – Mixed AI & HPC
4 A100s, Fully Connected w/
shared NVLinks
HGX A100 4-GPU
55
5 MIRACLES OF A100
NVIDIA Ampere Architecture
World’s Largest 7nm chip
54B XTORS, HBM2
3rd Gen NVLINK and NVSWITCH
Efficient Scaling to Enable Super GPU
2X More Bandwidth
3rd Gen Tensor Cores
Faster, Flexible, Easier to use
20x AI Perf with TF32
2.5x HPC Perf
New Sparsity Acceleration
Harness Sparsity in AI Models
2x AI Performance
New Multi-Instance GPU
Optimal utilization with right sized GPU
7x Simultaneous Instances per GPU
6
INTRODUCING DGX A100
The Universal AI System – Data Analytics, Training and Inference
9x Mellanox ConnectX-6 200Gb/s Network Interface
8x NVIDIA A100 GPUs with 320GB Total GPU Memory
15TB Gen4 NVME SSD
Dual 64-core AMD Rome CPUs and 1TB RAM
4.8TB/sec Bi-directional Bandwidth
2X More than Previous Generation NVSwitch
6x NVIDIA NVSwitches
12 NVLinks/GPU
600GB/sec GPU-to-GPU Bi-directional Bandwidth
25GB/sec Peak Bandwidth
2X Faster than Gen3 NVME SSDs
3.2X More Cores to Power the Most Intensive AI Jobs
450GB/sec Peak Bi-directional Bandwidth
7
UNIFIED AI ACCELERATION
BERT Pre-Training Throughput using Pytorch including (2/3)Phase 1 and (1/3)Phase 2 | Phase 1 Seq Len = 128, Phase 2 Seq Len = 512 V100: DGX-1 Server with 8xV100 using FP32 and FP16 precision A100: DGX A100 Server with 8xA100 using TF32
precision and FP16 |
BERT Large Inference | T4: TRT 7.1, Precision = INT8, Batch Size =256, V100: TRT 7.1, Precision = FP16, Batch Size =256 | A100 with 7 MIG instances of 1g.5gb : Pre-production TRT, Batch Size =94, Precision = INT8 with Sparsity
216
822
1260
2274
0
400
800
1200
1600
2000
2400
FP32 FP16
Sequences/s
BERT-LARGE TRAINING
V100
0.6x 1x 1x
7x
0
1000
2000
3000
4000
5000
6000
7000
Sequences/s
BERT-LARGE INFERENCE
V100T4 1 MIG
(1/7 A100)
6X
out-of-
the-box
Speedup
with TF32
7 MIG
(1 A100)
3X
Speedup with
AMP (FP16)
8
350 CPU Servers
$23M | 22 Racks | 300 kW
NVIDIA SHATTERS BIG DATA ANALYTICS BENCHMARK
19.5X Faster TPCx-BB Performance Results on DGX A100 with RAPIDS
16 NVIDIA DGX A100 Systems
$3.3M | 4 Racks |100 kW
Equivalent
Performance
1/7th Cost
1/3rd Power
16 Servers / Rack
…
Rack 1 Rack 2 Rack 3 Rack 22Rack 4 Rack 1 Rack 2 Rack 3 Rack 4
Performance: CPU = 4.7 hr, DGX A100 = 14.5 min (19.5x faster); After normalizing performance across CPU and GPU clusters -> Cost: CPU = $23M, DGX A100 = $3.3M (1/7th the
cost); Power: CPU = 298kW, DGX A100 = 104kW (1/3rd the power); Space: CPU = 22 racks, DGX A100 = 4 racks (less than 1/5th the space)
9
GPU-ACCELERATED APACHE SPARK 3.0
Data Preparation Model Training
Shared Storage
CPU Powered Cluster GPU Powered Cluster
Data
Sources
Spark 2.x Spark 3.0
Data
Sources
Spark
XGBoost | TensorFlow
| PyTorch
Data Preparation Model Training
Spark
XGBoost | TensorFlow
| PyTorch
Spark Orchestrated
Spark Orchestrated
Spark 3.0 enables:
• A single pipeline, from ingest to data preparation
to model training
• GPU-accelerated data preparation
• Consolidation and simplification of infrastructure
Built on Foundations of RAPIDS
Learn More @ nvidia.com/spark-book
Now Available on Leading Cloud Analytics Platforms
RAPIDS Accelerator for Apache Spark
GPU Powered Cluster
10
1.5X 1.5X 1.6X
1.9X
1.7X
1.8X
1.9X
2.0X
2.1X
0.0x
0.5x
1.0x
1.5x
2.0x
NAMD GROMACS AMBER LAMMPS FUN3D SPECFEM3D RTM BerkeleyGW Chroma
A100
UP TO 2X MORE HPC PERFORMANCE
All results are measured
Except BerkeleyGW, V100 used is single V100 SXM2. A100 used is single A100 SXM4
More apps detail: AMBER based on PME-Cellulose, GROMACS with STMV (h-bond), LAMMPS with Atomic Fluid LJ-2.5, NAMD with v3.0a1 STMV_NVE
Chroma with szscl21_24_128, FUN3D with dpw, RTM with Isotropic Radius 4 1024^3, SPECFEM3D with Cartesian four material model
BerkeleyGW based on Chi Sum and uses 8xV100 in DGX-1, vs 8xA100 in DGX A100
Speedup
V100
Molecular Dynamics Physics Geo Science Physics
11
NGC – GPU-OPTIMIZED HPC & AI SOFTWARE
Accelerate Time to Discovery and Solutions
TOOLKITS & SDKsAPPLICATION CONTAINERS AI MODELS HELM CHARTS
150+ 100+ ML, Inference Healthcare | Smart Cities | Conversational AI | Robotics | more
NGC
ON-PREM
MULTI-CLOUD
EDGEHYBRID CLOUD
ENCRYPTED
x86 | ARM | POWER
12
17.1 (1792 A100)
10.5 (256 A100)
3.3 (8 A100)
0.8 (2048 A100)
0.8 (1024 A100)
0.8 (1840 A100)
0.7 (1024 A100)
0.6 (480 A100)
0 5 10 15 20 25 30 35 40
Reinforcement Learning MiniGo
Object Detection (Heavy Weight) Mask R-CNN
Recommendation DLRM
NLP BERT
Object Detection (Light Weight) SSD
Image Classification ResNet-50 v.1.5
Translation (Recurrent) GNMT
Translation (Non-recurrent) Transformer
Time to Train (Minutes)
Time to Train (Lower is Better)
Commercially Available Solutions
NVIDIA A100
NVIDIA V100
Google TPUv3
Huawei Ascend
MLPERF: DGX SUPERPOD SETS ALL 8 AT SCALE AI RECORDS
Under 18 Minutes To Train Each MLPerf Benchmark
MLPerf 0.7 Performance comparison at Max Scale. Max scale used for NVIDIA A100, NVIDIA V100, TPUv3 and Huawei Ascend for all applicable benchmarks. | MLPerf ID at Scale: :Transformer: 0.7-30, 0.7-52 , GNMT: 0.7-34, 0.7-54, ResNet-50
v1.5: 0.7-37, 0.7-55, 0.7-1, 0.7-3, SSD: 0.7-33, 0.7-53, BERT: 0.7-38, 0.7-56, 0.7-1, DLRM: 0.7-17, 0.7-43, Mask R-CNN: 0.7-28, 0.7-48, MiniGo: 0.7-36, 0.7-51 | MLPerf name and logo are trademarks. See www.mlperf.org for more information.
XXXXXXXXXXXXX
X = No result submitted
28.7 (16 TPUv3)
56.7
(16 TPUv3)
13
MLPERF: ALL 8 PER CHIP AI PERFORMANCE RECORDS
0.7X
1.2X
0.9X
1.0X 1.0X 1.0X 1.0X 1.0X 1.0X 1.0X 1.0X
1.5X
1.6X
1.9X
2.0X 2.0X
2.4X 2.4X 2.5X
0x
1x
2x
3x
Image
Classification
ResNet-50 v.1.5
NLP
BERT
Object Detection
(Heavy Weight)
Mask R-CNN
Reinforcement
Learning
MiniGo
Object Detection
(Light Weight)
SSD
Translation
(Recurrent)
GNMT
Translation
(Non-recurrent)
Transformer
Recommendation
DLRM
SpeedupOverV100
Relative Speedup
Commercially Available Solutions
Huawei Ascend TPUv3 V100 A100
Per Chip Performance arrived at by comparing performance at same scale when possible and normalizing it to a single chip. 8 chip scale: V100, A100 Mask R-CNN, MiniGo, SSD, GNMT, Transformer. 16 chip scale: V100, A100, TPUv3 for ResNet-
50 v1.5 and BERT. 512 chip scale: Huawei Ascend 910 for ResNet-50. DLRM compared 8 A100 and 16 V100. Submission IDs: ResNet-50 v1.5: 0.7-3, 0.7-1, 0.7-44, 0.7-18, 0.7-21, 0.7-15 BERT: 0.7-1, 0.7-45, 0.7-22 , Mask R-CNN: 0.7-40, 0.7-19,
MiniGo: 0.7-41, 0.7-20, SSD: 0.7-40, 0.7-19, GNMT: 0.7-40, 0.7-19, Transformer: 0.7-40, 0.7-19, DLRM: 0.7-43, 0.7-17| MLPerf name and logo are trademarks. See www.mlperf.org for more information.
X X X X X X X X X X X X X
X = No result submitted
14
#7 on TOP500 (27.6 PetaFLOPS HPL)
#2 on Green500 (20.5 GigaFLOPS/watt)
Fastest Industrial System in U.S. — 1+ ExaFLOPS AI
Built with NVIDIA DGX SuperPOD Arch in 3 Weeks
NVIDIA DGX A100 and NVIDIA Mellanox IB
NVIDIA’s decade of AI experience
Configuration:
2,240 NVIDIA A100 Tensor Core GPUs
280 NVIDIA DGX A100 systems
494 Mellanox 200G HDR IB switches
7 PB of all-flash storage
DGX SuperPOD Deployment
SELENE
15
Oxford Nanopore
Sequence Viral Genome in
7Hrs
Plotly, NVIDIA
Real-Time
Infection Rate Analysis
ORNL, Scripps
Screen
2B Drug Compounds in
1 Day vs 1 Year
Structura, NIH, UT Austin
CryoSPARC
1st 3D Structure of Virus Spike Protein
NIH, NVIDIA
AI COVID-19
Classification
Kiwibot
Robot Medical Supply
Delivery
Whiteboard Coordinator
AI Elevated Body Temp
Screening System
ACCELERATED COMPUTING FIGHTS COVID-19
Data
Analytics
Simulation &
Visualization
AI Edge
Implementing AI: High Performace Architectures

More Related Content

What's hot

Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...inside-BigData.com
 
Xilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systemsXilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systemsGanesan Narayanasamy
 
LEG Keynote: Linda Knippers - HP
LEG Keynote: Linda Knippers - HPLEG Keynote: Linda Knippers - HP
LEG Keynote: Linda Knippers - HPLinaro
 
Hp moonshot Server
Hp moonshot Server Hp moonshot Server
Hp moonshot Server 성호 윤
 
DDN: Protecting Your Data, Protecting Your Hardware
DDN: Protecting Your Data, Protecting Your HardwareDDN: Protecting Your Data, Protecting Your Hardware
DDN: Protecting Your Data, Protecting Your Hardwareinside-BigData.com
 
HPC DAY 2017 | FlyElephant Solutions for Data Science and HPC
HPC DAY 2017 | FlyElephant Solutions for Data Science and HPCHPC DAY 2017 | FlyElephant Solutions for Data Science and HPC
HPC DAY 2017 | FlyElephant Solutions for Data Science and HPCHPC DAY
 
HP Innovation for HPC – From Moonshot and Beyond
HP Innovation for HPC – From Moonshot and BeyondHP Innovation for HPC – From Moonshot and Beyond
HP Innovation for HPC – From Moonshot and BeyondIntel IT Center
 
HPC DAY 2017 | The network part in accelerating Machine-Learning and Big-Data
HPC DAY 2017 | The network part in accelerating Machine-Learning and Big-DataHPC DAY 2017 | The network part in accelerating Machine-Learning and Big-Data
HPC DAY 2017 | The network part in accelerating Machine-Learning and Big-DataHPC DAY
 
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source ToolsData Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source ToolsEsther Vasiete
 
HPC DAY 2017 | HPE Storage and Data Management for Big Data
HPC DAY 2017 | HPE Storage and Data Management for Big DataHPC DAY 2017 | HPE Storage and Data Management for Big Data
HPC DAY 2017 | HPE Storage and Data Management for Big DataHPC DAY
 

What's hot (20)

Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
Xilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systemsXilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systems
 
OpenPOWER foundation
OpenPOWER foundationOpenPOWER foundation
OpenPOWER foundation
 
CFD on Power
CFD on Power CFD on Power
CFD on Power
 
BSC LMS DDL
BSC LMS DDL BSC LMS DDL
BSC LMS DDL
 
LEG Keynote: Linda Knippers - HP
LEG Keynote: Linda Knippers - HPLEG Keynote: Linda Knippers - HP
LEG Keynote: Linda Knippers - HP
 
Hp moonshot Server
Hp moonshot Server Hp moonshot Server
Hp moonshot Server
 
DDN: Protecting Your Data, Protecting Your Hardware
DDN: Protecting Your Data, Protecting Your HardwareDDN: Protecting Your Data, Protecting Your Hardware
DDN: Protecting Your Data, Protecting Your Hardware
 
HP Moonshot system
HP Moonshot systemHP Moonshot system
HP Moonshot system
 
HPC DAY 2017 | FlyElephant Solutions for Data Science and HPC
HPC DAY 2017 | FlyElephant Solutions for Data Science and HPCHPC DAY 2017 | FlyElephant Solutions for Data Science and HPC
HPC DAY 2017 | FlyElephant Solutions for Data Science and HPC
 
Ac922 cdac webinar
Ac922 cdac webinarAc922 cdac webinar
Ac922 cdac webinar
 
SNAP MACHINE LEARNING
SNAP MACHINE LEARNINGSNAP MACHINE LEARNING
SNAP MACHINE LEARNING
 
HP Innovation for HPC – From Moonshot and Beyond
HP Innovation for HPC – From Moonshot and BeyondHP Innovation for HPC – From Moonshot and Beyond
HP Innovation for HPC – From Moonshot and Beyond
 
Deeplearningusingcloudpakfordata
DeeplearningusingcloudpakfordataDeeplearningusingcloudpakfordata
Deeplearningusingcloudpakfordata
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
HPC DAY 2017 | The network part in accelerating Machine-Learning and Big-Data
HPC DAY 2017 | The network part in accelerating Machine-Learning and Big-DataHPC DAY 2017 | The network part in accelerating Machine-Learning and Big-Data
HPC DAY 2017 | The network part in accelerating Machine-Learning and Big-Data
 
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source ToolsData Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
 
PowerAI Deep dive
PowerAI Deep divePowerAI Deep dive
PowerAI Deep dive
 
OpenPOWER Webinar
OpenPOWER Webinar OpenPOWER Webinar
OpenPOWER Webinar
 
HPC DAY 2017 | HPE Storage and Data Management for Big Data
HPC DAY 2017 | HPE Storage and Data Management for Big DataHPC DAY 2017 | HPE Storage and Data Management for Big Data
HPC DAY 2017 | HPE Storage and Data Management for Big Data
 

Similar to Implementing AI: High Performace Architectures

Implementing AI: High Performance Architectures: Large scale HPC hardware in ...
Implementing AI: High Performance Architectures: Large scale HPC hardware in ...Implementing AI: High Performance Architectures: Large scale HPC hardware in ...
Implementing AI: High Performance Architectures: Large scale HPC hardware in ...KTN
 
08 Supercomputer Fugaku
08 Supercomputer Fugaku08 Supercomputer Fugaku
08 Supercomputer FugakuRCCSRENKEI
 
HPE Solutions for Challenges in AI and Big Data
HPE Solutions for Challenges in AI and Big DataHPE Solutions for Challenges in AI and Big Data
HPE Solutions for Challenges in AI and Big DataLviv Startup Club
 
Saviak lviv ai-2019-e-mail (1)
Saviak lviv ai-2019-e-mail (1)Saviak lviv ai-2019-e-mail (1)
Saviak lviv ai-2019-e-mail (1)Lviv Startup Club
 
OpenACC Monthly Highlights September 2020
OpenACC Monthly Highlights September 2020OpenACC Monthly Highlights September 2020
OpenACC Monthly Highlights September 2020OpenACC
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLinaro
 
SFScon 2020 - Roberto Innocenti - 202x Open Hardware Concrete Approach
SFScon 2020 - Roberto Innocenti - 202x Open Hardware Concrete ApproachSFScon 2020 - Roberto Innocenti - 202x Open Hardware Concrete Approach
SFScon 2020 - Roberto Innocenti - 202x Open Hardware Concrete ApproachSouth Tyrol Free Software Conference
 
IBM: The Linux Ecosystem
IBM: The Linux EcosystemIBM: The Linux Ecosystem
IBM: The Linux EcosystemKangaroot
 
ODSA Sub-Project Launch
ODSA Sub-Project LaunchODSA Sub-Project Launch
ODSA Sub-Project LaunchODSA Workgroup
 
ODSA Sub-Project Launch
 ODSA Sub-Project Launch ODSA Sub-Project Launch
ODSA Sub-Project LaunchNetronome
 
Data Plane Evolution: Towards Openness and Flexibility
Data Plane Evolution: Towards Openness and FlexibilityData Plane Evolution: Towards Openness and Flexibility
Data Plane Evolution: Towards Openness and FlexibilityAPNIC
 
AI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsAI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsGanesan Narayanasamy
 
AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems Ganesan Narayanasamy
 
Lessons Learned on Benchmarking Big Data Platforms
Lessons Learned on Benchmarking  Big Data PlatformsLessons Learned on Benchmarking  Big Data Platforms
Lessons Learned on Benchmarking Big Data Platformst_ivanov
 
CSCfi Computing Services 12/2014
CSCfi Computing Services 12/2014CSCfi Computing Services 12/2014
CSCfi Computing Services 12/2014Olli-Pekka Lehto
 
OGF Introductory Overview - OGF 44 at EGI Conference 2015
OGF Introductory Overview - OGF 44 at EGI Conference 2015OGF Introductory Overview - OGF 44 at EGI Conference 2015
OGF Introductory Overview - OGF 44 at EGI Conference 2015Alan Sill
 
HKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening KeynoteHKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening KeynoteLinaro
 
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AIArm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AIinside-BigData.com
 

Similar to Implementing AI: High Performace Architectures (20)

Implementing AI: High Performance Architectures: Large scale HPC hardware in ...
Implementing AI: High Performance Architectures: Large scale HPC hardware in ...Implementing AI: High Performance Architectures: Large scale HPC hardware in ...
Implementing AI: High Performance Architectures: Large scale HPC hardware in ...
 
08 Supercomputer Fugaku
08 Supercomputer Fugaku08 Supercomputer Fugaku
08 Supercomputer Fugaku
 
HPE Solutions for Challenges in AI and Big Data
HPE Solutions for Challenges in AI and Big DataHPE Solutions for Challenges in AI and Big Data
HPE Solutions for Challenges in AI and Big Data
 
Saviak lviv ai-2019-e-mail (1)
Saviak lviv ai-2019-e-mail (1)Saviak lviv ai-2019-e-mail (1)
Saviak lviv ai-2019-e-mail (1)
 
OpenACC Monthly Highlights September 2020
OpenACC Monthly Highlights September 2020OpenACC Monthly Highlights September 2020
OpenACC Monthly Highlights September 2020
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience Report
 
SFScon 2020 - Roberto Innocenti - 202x Open Hardware Concrete Approach
SFScon 2020 - Roberto Innocenti - 202x Open Hardware Concrete ApproachSFScon 2020 - Roberto Innocenti - 202x Open Hardware Concrete Approach
SFScon 2020 - Roberto Innocenti - 202x Open Hardware Concrete Approach
 
IBM: The Linux Ecosystem
IBM: The Linux EcosystemIBM: The Linux Ecosystem
IBM: The Linux Ecosystem
 
ODSA Sub-Project Launch
ODSA Sub-Project LaunchODSA Sub-Project Launch
ODSA Sub-Project Launch
 
ODSA Sub-Project Launch
 ODSA Sub-Project Launch ODSA Sub-Project Launch
ODSA Sub-Project Launch
 
NWU and HPC
NWU and HPCNWU and HPC
NWU and HPC
 
Data Plane Evolution: Towards Openness and Flexibility
Data Plane Evolution: Towards Openness and FlexibilityData Plane Evolution: Towards Openness and Flexibility
Data Plane Evolution: Towards Openness and Flexibility
 
AI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsAI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systems
 
AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems
 
Lessons Learned on Benchmarking Big Data Platforms
Lessons Learned on Benchmarking  Big Data PlatformsLessons Learned on Benchmarking  Big Data Platforms
Lessons Learned on Benchmarking Big Data Platforms
 
CSCfi Computing Services 12/2014
CSCfi Computing Services 12/2014CSCfi Computing Services 12/2014
CSCfi Computing Services 12/2014
 
Available HPC Resources at CSUC
Available HPC Resources at CSUCAvailable HPC Resources at CSUC
Available HPC Resources at CSUC
 
OGF Introductory Overview - OGF 44 at EGI Conference 2015
OGF Introductory Overview - OGF 44 at EGI Conference 2015OGF Introductory Overview - OGF 44 at EGI Conference 2015
OGF Introductory Overview - OGF 44 at EGI Conference 2015
 
HKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening KeynoteHKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening Keynote
 
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AIArm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
 

More from KTN

Competition Briefing - Open Digital Solutions for Net Zero Energy
Competition Briefing - Open Digital Solutions for Net Zero Energy Competition Briefing - Open Digital Solutions for Net Zero Energy
Competition Briefing - Open Digital Solutions for Net Zero Energy KTN
 
An Introduction to Eurostars - an Opportunity for SMEs to Collaborate Interna...
An Introduction to Eurostars - an Opportunity for SMEs to Collaborate Interna...An Introduction to Eurostars - an Opportunity for SMEs to Collaborate Interna...
An Introduction to Eurostars - an Opportunity for SMEs to Collaborate Interna...KTN
 
Prospering from the Energy Revolution: Six in Sixty - Technology and Infrastr...
Prospering from the Energy Revolution: Six in Sixty - Technology and Infrastr...Prospering from the Energy Revolution: Six in Sixty - Technology and Infrastr...
Prospering from the Energy Revolution: Six in Sixty - Technology and Infrastr...KTN
 
UK Catalysis: Innovation opportunities for an enabling technology
UK Catalysis: Innovation opportunities for an enabling technologyUK Catalysis: Innovation opportunities for an enabling technology
UK Catalysis: Innovation opportunities for an enabling technologyKTN
 
Industrial Energy Transformational Fund Phase 2 Spring 2022 - Competition Bri...
Industrial Energy Transformational Fund Phase 2 Spring 2022 - Competition Bri...Industrial Energy Transformational Fund Phase 2 Spring 2022 - Competition Bri...
Industrial Energy Transformational Fund Phase 2 Spring 2022 - Competition Bri...KTN
 
Horizon Europe ‘Culture, Creativity and Inclusive Society’ Consortia Building...
Horizon Europe ‘Culture, Creativity and Inclusive Society’ Consortia Building...Horizon Europe ‘Culture, Creativity and Inclusive Society’ Consortia Building...
Horizon Europe ‘Culture, Creativity and Inclusive Society’ Consortia Building...KTN
 
Horizon Europe ‘Culture, Creativity and Inclusive Society’ Consortia Building...
Horizon Europe ‘Culture, Creativity and Inclusive Society’ Consortia Building...Horizon Europe ‘Culture, Creativity and Inclusive Society’ Consortia Building...
Horizon Europe ‘Culture, Creativity and Inclusive Society’ Consortia Building...KTN
 
Smart Networks and Services Joint Undertaking (SNS JU) Call Topics
Smart Networks and Services Joint Undertaking (SNS JU) Call TopicsSmart Networks and Services Joint Undertaking (SNS JU) Call Topics
Smart Networks and Services Joint Undertaking (SNS JU) Call TopicsKTN
 
Building Talent for the Future 2 – Expression of Interest Briefing
Building Talent for the Future 2 – Expression of Interest BriefingBuilding Talent for the Future 2 – Expression of Interest Briefing
Building Talent for the Future 2 – Expression of Interest BriefingKTN
 
Connected and Autonomous Vehicles Cohort Workshop
Connected and Autonomous Vehicles Cohort WorkshopConnected and Autonomous Vehicles Cohort Workshop
Connected and Autonomous Vehicles Cohort WorkshopKTN
 
Biodiversity and Food Production: The Future of the British Landscape
Biodiversity and Food Production: The Future of the British LandscapeBiodiversity and Food Production: The Future of the British Landscape
Biodiversity and Food Production: The Future of the British LandscapeKTN
 
Engage with...Performance Projects
Engage with...Performance ProjectsEngage with...Performance Projects
Engage with...Performance ProjectsKTN
 
How to Create a Good Horizon Europe Proposal Webinar
How to Create a Good Horizon Europe Proposal WebinarHow to Create a Good Horizon Europe Proposal Webinar
How to Create a Good Horizon Europe Proposal WebinarKTN
 
Horizon Europe Tackling Diseases and Antimicrobial Resistance (AMR) Webinar a...
Horizon Europe Tackling Diseases and Antimicrobial Resistance (AMR) Webinar a...Horizon Europe Tackling Diseases and Antimicrobial Resistance (AMR) Webinar a...
Horizon Europe Tackling Diseases and Antimicrobial Resistance (AMR) Webinar a...KTN
 
Engage with...Custom Interconnect
Engage with...Custom InterconnectEngage with...Custom Interconnect
Engage with...Custom InterconnectKTN
 
Engage with...ZF
Engage with...ZFEngage with...ZF
Engage with...ZFKTN
 
Engage with...FluxSys
Engage with...FluxSysEngage with...FluxSys
Engage with...FluxSysKTN
 
Made Smarter Innovation: Sustainable Smart Factory Competition Briefing
Made Smarter Innovation: Sustainable Smart Factory Competition BriefingMade Smarter Innovation: Sustainable Smart Factory Competition Briefing
Made Smarter Innovation: Sustainable Smart Factory Competition BriefingKTN
 
Driving the Electric Revolution – PEMD Skills Hub
Driving the Electric Revolution – PEMD Skills HubDriving the Electric Revolution – PEMD Skills Hub
Driving the Electric Revolution – PEMD Skills HubKTN
 
Medicines Manufacturing Challenge EDI Survey Briefing Webinar
Medicines Manufacturing Challenge EDI Survey Briefing WebinarMedicines Manufacturing Challenge EDI Survey Briefing Webinar
Medicines Manufacturing Challenge EDI Survey Briefing WebinarKTN
 

More from KTN (20)

Competition Briefing - Open Digital Solutions for Net Zero Energy
Competition Briefing - Open Digital Solutions for Net Zero Energy Competition Briefing - Open Digital Solutions for Net Zero Energy
Competition Briefing - Open Digital Solutions for Net Zero Energy
 
An Introduction to Eurostars - an Opportunity for SMEs to Collaborate Interna...
An Introduction to Eurostars - an Opportunity for SMEs to Collaborate Interna...An Introduction to Eurostars - an Opportunity for SMEs to Collaborate Interna...
An Introduction to Eurostars - an Opportunity for SMEs to Collaborate Interna...
 
Prospering from the Energy Revolution: Six in Sixty - Technology and Infrastr...
Prospering from the Energy Revolution: Six in Sixty - Technology and Infrastr...Prospering from the Energy Revolution: Six in Sixty - Technology and Infrastr...
Prospering from the Energy Revolution: Six in Sixty - Technology and Infrastr...
 
UK Catalysis: Innovation opportunities for an enabling technology
UK Catalysis: Innovation opportunities for an enabling technologyUK Catalysis: Innovation opportunities for an enabling technology
UK Catalysis: Innovation opportunities for an enabling technology
 
Industrial Energy Transformational Fund Phase 2 Spring 2022 - Competition Bri...
Industrial Energy Transformational Fund Phase 2 Spring 2022 - Competition Bri...Industrial Energy Transformational Fund Phase 2 Spring 2022 - Competition Bri...
Industrial Energy Transformational Fund Phase 2 Spring 2022 - Competition Bri...
 
Horizon Europe ‘Culture, Creativity and Inclusive Society’ Consortia Building...
Horizon Europe ‘Culture, Creativity and Inclusive Society’ Consortia Building...Horizon Europe ‘Culture, Creativity and Inclusive Society’ Consortia Building...
Horizon Europe ‘Culture, Creativity and Inclusive Society’ Consortia Building...
 
Horizon Europe ‘Culture, Creativity and Inclusive Society’ Consortia Building...
Horizon Europe ‘Culture, Creativity and Inclusive Society’ Consortia Building...Horizon Europe ‘Culture, Creativity and Inclusive Society’ Consortia Building...
Horizon Europe ‘Culture, Creativity and Inclusive Society’ Consortia Building...
 
Smart Networks and Services Joint Undertaking (SNS JU) Call Topics
Smart Networks and Services Joint Undertaking (SNS JU) Call TopicsSmart Networks and Services Joint Undertaking (SNS JU) Call Topics
Smart Networks and Services Joint Undertaking (SNS JU) Call Topics
 
Building Talent for the Future 2 – Expression of Interest Briefing
Building Talent for the Future 2 – Expression of Interest BriefingBuilding Talent for the Future 2 – Expression of Interest Briefing
Building Talent for the Future 2 – Expression of Interest Briefing
 
Connected and Autonomous Vehicles Cohort Workshop
Connected and Autonomous Vehicles Cohort WorkshopConnected and Autonomous Vehicles Cohort Workshop
Connected and Autonomous Vehicles Cohort Workshop
 
Biodiversity and Food Production: The Future of the British Landscape
Biodiversity and Food Production: The Future of the British LandscapeBiodiversity and Food Production: The Future of the British Landscape
Biodiversity and Food Production: The Future of the British Landscape
 
Engage with...Performance Projects
Engage with...Performance ProjectsEngage with...Performance Projects
Engage with...Performance Projects
 
How to Create a Good Horizon Europe Proposal Webinar
How to Create a Good Horizon Europe Proposal WebinarHow to Create a Good Horizon Europe Proposal Webinar
How to Create a Good Horizon Europe Proposal Webinar
 
Horizon Europe Tackling Diseases and Antimicrobial Resistance (AMR) Webinar a...
Horizon Europe Tackling Diseases and Antimicrobial Resistance (AMR) Webinar a...Horizon Europe Tackling Diseases and Antimicrobial Resistance (AMR) Webinar a...
Horizon Europe Tackling Diseases and Antimicrobial Resistance (AMR) Webinar a...
 
Engage with...Custom Interconnect
Engage with...Custom InterconnectEngage with...Custom Interconnect
Engage with...Custom Interconnect
 
Engage with...ZF
Engage with...ZFEngage with...ZF
Engage with...ZF
 
Engage with...FluxSys
Engage with...FluxSysEngage with...FluxSys
Engage with...FluxSys
 
Made Smarter Innovation: Sustainable Smart Factory Competition Briefing
Made Smarter Innovation: Sustainable Smart Factory Competition BriefingMade Smarter Innovation: Sustainable Smart Factory Competition Briefing
Made Smarter Innovation: Sustainable Smart Factory Competition Briefing
 
Driving the Electric Revolution – PEMD Skills Hub
Driving the Electric Revolution – PEMD Skills HubDriving the Electric Revolution – PEMD Skills Hub
Driving the Electric Revolution – PEMD Skills Hub
 
Medicines Manufacturing Challenge EDI Survey Briefing Webinar
Medicines Manufacturing Challenge EDI Survey Briefing WebinarMedicines Manufacturing Challenge EDI Survey Briefing Webinar
Medicines Manufacturing Challenge EDI Survey Briefing Webinar
 

Recently uploaded

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

Implementing AI: High Performace Architectures

  • 1.
  • 2. www.ktn-uk.org @KTNUK Finding valuable partners - Project consortium building - Supply Chain Knowledge - Driving new connections - Articulating challenges - Finding creative solutions Awareness and dissemination - Public and private finance - Advice – project scope - Advice – proposal mentoring - Project follow-up Promoting Industry needs - Informing policy makers - Informing strategy - Communicating trends and market drivers Intelligence on trends and markets - Business Planning support - Success stories / raising profile Navigating the innovation support landscape - Promoting coherent strategy and approach - Engaging wider stakeholders - Curation of innovation resources Connecting SupportingNavigatingInfluencing Funding What we do - Growth Through Innovation
  • 3. eFutures aims to strengthen and support a network of people working in electronic systems across the UK • Building new links and increasing involvement with industry • Mapping the national electronics research, to ensure the work across the UK is known and noted • Encouraging and funding innovative multi-disciplinary/multi-university proposals • Communicating with our network via a monthly magazine, social media and new website • Running events that support our network and our strategy • Piloting an academic Mentoring Scheme pilot • Launching a Big Ideas Challenge – more details soon • Ideas warmly welcomed. Please get involved! Twitter @efuturesuk Sign up to our mailing list: efutures@qub.ac.uk
  • 4. Today’s Agenda Large scale HPC hardware in the age of AI Prof Simon McIntosh-Smith, Bristol University Solving Core Recommendation Model Challenges in Data Centers Giles Peckham, myrtle Short Break Arm SVE and Supercomputer Fugaku for Deep learning Roxana Rusitoru, ARM A Universal Accelerated Computing Platform Timothy Lanfear, NVIDIA Panel Q&A Session Chaired by Prof Roger Woods
  • 5. Large scale HPC hardware in the age of AI Prof. Simon McIntosh-Smith Head of the HPC research group University of Bristol, UK Twitter: @simonmcs Email: simonm@cs.bris.ac.uk http://uob-hpc.github.io
  • 6. AI is a primary goal for next-generation supercomputers The coming generation of Exascale systems will include a diverse range of architectures at massive scale, all of which are targeting AI: • Fugaku: Fujitsu A64FX Arm CPUs • Perlmutter: AMD EYPC CPUs and NVIDIA GPUs • Frontier: AMD EPYC CPUs and Radeon GPUs • Aurora: Intel Xeon CPUs and Xe GPUs • El Capitan: AMD EPYC CPUs and Radeon GPUs http://uob-hpc.github.io The Next Platform, Jan 13th 2020: “HPC in 2020: compute engine diversity gets real” https://www.nextplatform.com/2020/01/13/hpc-in-2020-compute-engine-diversity-gets-real/ June 22, 2020 1 Overview The Fugaku compute system was designed and built by Fujitsu and RIKEN. Fugaku 富岳, is another name for Mount Fuji, created by combining the first character of 富士, Fuji, and 岳, mountain. The system is installed at the RIKEN Center for Computational Science (R-CCS) in Kobe, Japan. RIKEN is a large scientific research institute in Japan with about 3,000 scientists in seven campuses across Japan. Development for Fugaku hardware started in 2014 as the successor to the K computer. The K Computer mainly focused on basic science and simulations and modernized the Japanese supercomputer to be massively parallel. The Fugaku system is designed to have a continuum of applications ranging from basic science to Society 5.0, an initiative to create a new social scheme and economic model by fully incorporating the technological innovations of the fourth industrial revolution. The relation to the Mount Fuji image is to have a broad base of applications and capacity for simulation, data science, and AI— with academic, industry, and cloud startups—along with a high peak performance on large-scale applications. Figure 1. Fugaku System as installed in RIKEN R-CCS The Fugaku system is built on the A64FX ARM v8.2-A, which uses Scalable Vector Extension (SVE) instructions and a 512-bit implementation. The Fugaku system adds the following Fujitsu extensions: hardware barrier, sector cache, prefetch, and the 48/52 core CPU. It is optimized for high-performance computing (HPC) with an extremely high bandwidth 3D stacked memory, 4x 8 GB HBM with 1024 GB/s, on-die Tofu-D network BW (~400 Gbps), high SVE FLOP/s (3.072 TFLOP/s), and various AI support (FP16, INT8, etc.). The A64FX processor provides for general purpose Linux, Windows, and other cloud systems. Simply put, Fugaku is the largest and fastest supercomputer built to date. Below is further breakdown of the hardware. • Caches: o L1D/core: 64 KB, 4way, 256 GB/s (load), 128 GB/s (store) o L2/CMG: 8 MB, 16 way o L2/node: 4 TB/s (load), 2 TB/s (store) o L2/core: 128 GB/s (load), 64 GB/s (store) • 158,976 nodes
  • 7. The UK’s Tier-2 exploring options Isambard • First production Arm-based HPC service • 10,752 Armv8 cores (168n x 2s x 32c) • Marvell ThunderX2 32core 2.5GHz • Cray XC50 ‘Scout’ form factor • High-speed Aries interconnect • Cray HPC optimised software stack • >420 registered users, >100 of whom are from outside the consortium
  • 8. UK Tier-2 dense GPU systems http://uob-hpc.github.io • 22 NVIDIA DGX-1 Deep Learning Systems, each comprising: • 8 NVIDIA Tesla V100 GPUs • NVIDIA's high-speed NVlink interconnect • 4 TB of SSD for machine learning datasets • over 1PB of Seagate ClusterStor storage • Mellanox EDR networking • optimized versions of Caffe, TensorFlow, Theano and Torch etc • system integration/delivery by Atos, hosting by STFC Hartree • system management by Atos / STFC Hartree http://www.hpc-uk.ac.uk/facilities/
  • 9. Arm + GPU http://uob-hpc.github.io Source: https://nvidianews.nvidia.com/news/nvidia-and-tech-leaders-team-to-build-gpu-accelerated-arm-servers-for-new-era-of-diverse-hpc-architectures
  • 10. Emerging architectures for AI / MP http://uob-hpc.github.io Google’s Tensorflow Processing Unit (TPU), GraphCore, Intel’s Nervana
  • 11. Google’s Tensor Processing Units: http://uob-hpc.github.iohttps://cloud.google.com/tpu Cloud TPU v3: 420 TFLOP/s 128 GB HBM $2.40 / TPU hour Cloud TPU v3 Pod: 100+ PFLOP/s 32 TB HBM 2-D toroidal mesh network V4 supposedly improves performance by 2.7x
  • 12. Graphcore has just announced their 2nd generation “IPU” http://uob-hpc.github.io
  • 13. Graphcore IPU-M2000 • 4 x Colossus MK2 GC200 IPUs in a 1U box • 1 PetaFLOP “AI compute” (16-bit FP) • 5,888 processor cores, 35,328 independent threads • Up to 450GB of exchange memory (off-chip DRAM) • 2nd gen IPU has 7-9X more performance on AI benchmarks • 59.4B 7nm transistors in 823mm2 • 900MB of on-chip fast SRAM per IPU (3x first gen.) • 250 TFLOP/s AI compute per chip, 62.5 TFLOP/s single-precision http://uob-hpc.github.io
  • 14. Graphcore systems now include their own interconnect too http://uob-hpc.github.io
  • 15. Massive scale AI/ML supercomputers http://uob-hpc.github.io
  • 16. Graphcore 3D torus topology for large-scale AI http://uob-hpc.github.io
  • 18. Key takeaways • Orders of magnitude more AI / ML compute coming • Diverse architectures to deliver greater performance • You need solutions that can work across CPUs, GPUs and now more exotic hardware • Optimised libraries are the main path to exploitation • TensorFlow, PyTorch, Café et al • Anything lower level requires a lot more ninja programming http://uob-hpc.github.io
  • 19. For more information Bristol HPC group: https://uob-hpc.github.io/ Email: S.McIntosh-Smith@bristol.ac.uk Twitter: @simonmcs http://uob-hpc.github.io 15
  • 20. Copyright © Myrtle.ai 2020 Solving Core Recommendation Model Challenges in Data Centers Giles Peckham, Myrtle.ai
  • 21. Myrtle.aiacceleratesMachineLearninginference • Accelerates Recommendation Models, RNNs and other DNNs with sparse structures • Achieves maximum throughputin applicationswith strictlatency constraints • Addresses hyper-scale inference • Data Centers (Cloud & On-Premise) and Embedded applications Myrtle.ai Founding Member: MLCommons Alliance Member Gold Partner AI Keynote 2019 Joint White Paper Copyright© Myrtle.ai 2020 Recommendation Systems Speech Synthesis Speech Transcription Machine Translation
  • 22. Copyright© Myrtle.ai 2020 MAU Accelerator Low latency inference accelerator for data center ML workloads Optimized for highest latency-bounded throughput DNN Model Cloud or enterprise data center serverFPGA accelerator card
  • 23. Copyright© Myrtle.ai 2020 MAU Accelerator Benefits Optimized for highest latency-bounded throughput Reduceddata centerinfrastructure required • Lower CapEx • Mitigates against rack space limitations Reducedenergy consumption • Lower OpEx • Smaller carbonfootprint • Mitigates against power constraints Deterministic low tail-latency enables the use of higher quality models • Improvedcustomer experience • Better services Uses readily-available data center accelerator cards compatible withtypical server installations • Rapiddeployment at scale Development flow basedonindustry standards • Easy to compile frompopularopen-source frameworks Flexible & reprogrammable solution • Future proof
  • 24. Copyright© Myrtle.ai 2020 Applications Target Applications • Speech transcription • Natural language processing • Speech synthesis • Time series cleansing & analysis • Payment & trading fraud detection • Anomaly detection • Network security Target Model Architectures • Fully connected linear layers • RNN, including LSTM and GRU • Time delay neural network (TDNN) Target Sectors • Finance (trading,compliance, service) • Search, Social Media & other Ad Servers • HPC (very large ML) • Life science (genomics, dataanalytics) • Defense, Aerospace, Security • Telcos & Conferencing Providers
  • 25. Copyright © Myrtle.ai 2020 An Accelerator for Recommendation Systems
  • 26. Copyright© Myrtle.ai 2020 Recommendation Models • One of the most common data center workloads • Used for search, adverts,feeds and personalization Demands • Throughput / Capacity • Need to ramp up capacityquicklyto meet demand • Months/years to commission new data center floor space • Cost • Data center rack server investment >$50B /yr1 • Latency / Model Accuracy / Revenue • 5ms latencyis challengingfor typical server systems • 100ms delayin load time can cost e-commerce companies many $B /yr2 • Energy Consumption/ Carbon Footprint • Global data center energy costs >$10B /yr3 • Global data center emissions ~100M tonnes CO2 /yr4 1. https://www.marketsandmarkets.com/Market-Reports/data-center-rack-server-market-53332315.html 2. https://www.akamai.com/uk/en/about/news/press/2017-press/akamai-releases-spring-2017-state-of-online-retail-performance-report.jsp 3. https://www.sciencedaily.com/releases/2020/02/200227144313.htm 4. https://www.comsoc.org/publications/tcn/2019-nov/energy-efficiency-data-centers
  • 27. Copyright© Myrtle.ai 2020 Design Challenges • A typicalRecommendationModel: • Traditionalapproach: • Put the wholemodel on one chip • Myrtle.aiapproach: • Offloaddifferent features of the modelto different hardware accelerators • Make it equallypracticalto adopt Dense Features Compute-Bound Sparse Features Memory-Bound Dense Features Compute-Bound OutputInput • Up to 80% of time can be spent here • Memory architecture in typical data center infrastructure is inefficient here • Existingaccelerators give a poor return here
  • 28. Copyright © Myrtle.ai 2020 • Accelerates the memory-bound sparse operations in all recommendation models • Delivers large gains in latency bounded throughput • Fully preserves existing model accuracy • Is complementary to existing compute accelerators • Is integrated into the PyTorch Deep Learning Framework SEAL: An Accelerator for Recommendation Systems
  • 29. Copyright © Myrtle.ai 2020 Add SEAL modules Offload sparse operations to SEAL CPU freed up; latency reduced Increase CPU batch size Throughput increased The “Virtuous Circle”
  • 30. Copyright© Myrtle.ai 2020 Performance • Vector Processing Bandwidth is the bandwidth achievable when transforming random multi-hot vectors into real-valued dense vectors • Carrier is Glacier Point v2 Vector Processing Bandwidth 16 GB version 18 GB/s (219 GB/s per carrier) 32 GB version 16 GB/s (195 GB/s per carrier)
  • 31. Copyright© Myrtle.ai 2020 Key Benefits Based on benchmarking using a weighted average of the mlperf.org benchmark recommendation models (Dec. 2019): • Rapid 8x increase in latency-boundedthroughputusing existing infrastructure1 • Enables more recommendationsto be made • Enables better quality recommendationsto be made • Higher CTRs • Increased revenue • Greater consumer satisfaction • Up to 50%CapEx savings on further capacity expansion1,2 • Up to 80%reduction in energy consumption1,2 • OpEx savings • Smaller carbonfootprint 1 Comparisonsarebetween a Xeon D-2100 performinginferenceon itown and thesameCPUleveragingSEALacceleration. Performanceandbenefitswill vary,dependingon individualsystemconfiguration and model usage. 2 Based on servers+SEALonly. Excludesbuildings,HVAC etc. 8xmore throughput 50% less CapEx 80% less energy
  • 32. Copyright© Myrtle.ai 2020 Highly Complementary to Existing Infrastructure • Acceleratesexisting servers; easy to install • Complementary to other accelerators • Scalable • Does not requireany changeto the recommendation model.No model retraining.No degradation in accuracy • Supportsco-location of modelswith no performancepenalty • Supportsconcurrentdeploymentof different versionsof a model, and loading/unloading models on the fly to facilitate A/B testing
  • 33. Copyright © Myrtle.ai 2020 Contact seal@myrtle.ai to evaluate what SEAL can do for your business For more information visit myrtle.ai/seal SEAL is the • lowest power • smallest form factor • easiest-to-deploy method of optimizing memory bound recommendation models in existing infrastructure.
  • 34. Thank You w w w . m y r t l e . a i Copyright © Myrtle.ai 2020 Giles Peckham 07785 278478 giles@myrtle.ai
  • 35. © 2020 Arm Limited (or its affiliates) Roxana Rusitoru Arm ML Research Lab 7 August 2020 Arm SVE and Supercomputer Fugaku for Deep learning Implementing AI: High Performance Architectures
  • 36. 2 © 2020 Arm Limited (or its affiliates) Disclaimer • Spent the better part of the last decade on Arm in HPC within Arm Research. • Worked on all layers from application optimization to kernel development, simulation infrastructure and next-gen architectures. • And now I do ML! (Looking after ML on CPUs)
  • 37. 3 © 2020 Arm Limited (or its affiliates) We want to train large networks on this Supercomputer Fugaku
  • 38. 4 © 2020 Arm Limited (or its affiliates) Supercomputer Fugaku Top 1
  • 39. 5 © 2020 Arm Limited (or its affiliates) Supercomputer Fugaku today
  • 40. 6 © 2020 Arm Limited (or its affiliates) Supercomputer Fugaku Green500 K Fugaku Peak FP64 11.3 PFLOPs* 0.537 ExaFLOPs Peak FP32 11.3 PFLOPs 1.07 ExaFLOPs Peak FP16 -- 2.15 ExaFLOPs Peak INT8 -- 4.30 ExaFLOPs Total mem BW 5.184 PB/sec 163 PB/s *Reported in Top500 (including I/O nodes)
  • 41. 7 © 2020 Arm Limited (or its affiliates) Fujitsu A64FX Specs
  • 42. 8 © 2020 Arm Limited (or its affiliates) What is this Scalable Vector Extension (aka SVE)? • New vector extension after Advanced SIMD (aka Arm NEON) with features needed for new markets (e.g., gather load & scatter store, per-lane predication, longer vectors) • There is no preferred vector length • The vector length (VL) is a hardware choice, 128-2048b, in increments of 128b • A Vector Length Agnostic (VLA) programming adjusts dynamically to the available VL • SVE is not an extension of Armv8 Advanced SIMD (aka NEON) • SVE is a separate, optional extension with a new set of instruction encodings. • Initial focus is HPC and general-purpose server, not media/image processing. • SVE begins to tackle traditional barriers to auto-vectorization • Very low overhead vs scalar code to encourage opportunistic vectorization. • Software-managed speculative vectorization of uncounted loops (i.e. C break). • Extract more data-level parallelism (DLP) from existing C/C++/Fortran source code
  • 43. 9 © 2020 Arm Limited (or its affiliates) Why Vector Length Agnostic? Pros • Fit into the 32b fixed-width A64 instruction set encoding • Future-proofing: no need for a new instruction set when the vector length increases • Vectorized code will scale automatically to use the whole vector length • No need to recompile, or to re-write hand- coded SVE assembler and intrinsics Challenges • Programmers and compilers must think differently about vectorization • A different vector length may expose latent bugs • Stack layout changes may expose stack overwriting bugs • Software developers cannot be expected to validate at 16 different VLs (128b, 256b, 384b… 2048b)
  • 44. 10 © 2020 Arm Limited (or its affiliates) VLA Instruction Set Support • Vectors cannot be initialized from compile-time constant, so… • INDEX Zd.S,#1,#4 : Zd = [ 1, 5, 9, 13, 17, 21, 25, 29 ] • Predicates cannot be initialized from memory, so… • PTRUE Pd.S, MUL3 : Pd = [ T, T, T , T, T, T , F, F ] • Vector loop increment and trip count are unknown at compile-time, so… • INCD Xi : increment scalar Xi by # of 64b dwords in vector • WHILELT Pd.D,Xi,Xe : next iteration predicate Pd = [ while i++ < e ] • Vector stores to stack must be dynamically allocated and indexed, so… • ADDVL SP,SP,#-4 : decrement stack pointer by (4*VL) • STR Zi, [SP,#3,MUL VL] : store vector Z1 to address (SP+3*VL)
  • 45. 11 © 2020 Arm Limited (or its affiliates) Other SVE Features • Gather-load and scatter-store • Loads/stores a single vector register from/to non-contiguous memory locations. • Per-lane predication • Operate on individual lanes of vector controlled by a governing predicate register. • Predicate-driven loop control and management • Eliminate loop heads and tails and other overheads by processing partial vectors. • Vector partitioning for software-managed speculation • First-fault vector load instructions allow vector accesses to cross into invalid pages. • Extended floating-point and bitwise horizontal reductions • In-order or tree-based floating-point sum, trading-off repeatability vs performance. 1 2 3 4 5 5 5 5 1 0 1 0 6 2 8 4 += → pred 1 2 0 0 1 1 0 0 → FFR 1 2 1 + 2 + 3 + 4 1 + 2 + 3 + 4 3 7 = = = = n-2 1 01 0→ n-1 n n+1 for (i = 0; i < n; ++i)
  • 46. © 2020 Arm Limited (or its affiliates) Back to the future! software
  • 47. 13 © 2020 Arm Limited (or its affiliates) On-CPU Machine Learning Training ML software stack on AArch64 and SVE Available frameworks Container images and build recipes (e.g., TF, PyTorch, Caffe) Popular ML frameworks support Arm as a first-class citizen Variety of libraries and tools ML libraries, such as oneDNN, Eigen, ArmPL and ArmCL. Profiles and debuggers: Arm Forge (DDT, MAP) Selection of training benchmarks Amongst which MLPerf (Training, HPC) Using Arm architectural features Large core count INT8, Bfloat16, FP16, FP32 SVE / SVE2 Matrix Multiplier Extension On the latest AArch64 hardware Arm Neoverse N1 Fujitsu A64FX Marvell ThunderX2 Ampere eMAG
  • 48. 14 © 2020 Arm Limited (or its affiliates) Flexibility Ease of programming ML processing requirements Arch • Dot product instructions (v8.0 – v8.4) Arch • Matrix-multiply-and-accumulate instructions (*MMLA*) (v8.6) Micro Arch • SVE vector length Arch • Bfloat16 support (v8.6) Features enhancing ML performance on Arm CPUs Primary drivers for on-CPU ML On-CPU Machine Learning
  • 49. 15 © 2020 Arm Limited (or its affiliates) Machine Learning Frameworks TensorFlow • Arm actively involved Deepbench • Arm actively involved Torch • Community maintained Mahout • Available via Apache Bigtop Weka • Community maintained Caffe • Community maintained Theano (EOL) • Community maintained • Docker builds of popular ML frameworks: • PyTorch: https://github.com/ARM-software/Tool-Solutions/tree/master/docker/pytorch-aarch64 • TensorFlow: https://github.com/ARM-software/Tool-Solutions/tree/master/docker/tensorflow-aarch64 • Details: https://community.arm.com/developer/tools-software/tools/b/tools-software-ides-blog/posts/aarch64- docker-images-for-pytorch-and-tensorflow
  • 50. 16 © 2020 Arm Limited (or its affiliates) ML Training on Arm-based systems Supercomputer Fugaku and Arm SVE are just the beginning! Versatile architecture enriched with ML features Diverse selection of Arm-based implementations Freedom to design fit-for-purpose hardware Vast Arm software ecosystem and partner base
  • 51. © 2020 Arm Limited (or its affiliates) Thank You Danke Merci 谢谢 ありがとう Gracias Kiitos 감사합니다 धन्यवाद ‫ا‬ً‫شكر‬ ধন্যবাদ ‫תודה‬
  • 52. The Arm trademarks featured in this presentation are registered trademarks or trademarks of Arm Limited (or its subsidiaries) in the US and/or elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners. www.arm.com/company/policies/trademarks © 2020 Arm Limited (or its affiliates)
  • 54. 22 25 YEARS OF SCIENTIFIC COMPUTING ACCELERATION X-FACTOR SPEEDUP FULL STACK ONE ARCHITECTURESOFTWARE DEFINED EXTREME SCALE 25 YEARS OF COMPUTING ACCELERATION DEVELOPMENT
  • 55. 3 THE NEW COMPUTING EDGE APPLIANCE SUPERCOMPUTER AI Edge Streaming Simulation Visualization EXTREME IO Data Analytics Cloud NETWORK
  • 56. 44 A100 AVAILABLE VIA NVIDIA HGX A100 AND A100 PCIE Scale-up - Fastest Time-to-solution for AI 8 GPUs, Full NVLink B/W between all GPUs with NVSwitch HGX A100 8-GPU For Mainstream Servers 1-8 GPUs per server, optional NVLink Bridge between 2 GPUs A100 PCIe Scale-Up – Mixed AI & HPC 4 A100s, Fully Connected w/ shared NVLinks HGX A100 4-GPU
  • 57. 55 5 MIRACLES OF A100 NVIDIA Ampere Architecture World’s Largest 7nm chip 54B XTORS, HBM2 3rd Gen NVLINK and NVSWITCH Efficient Scaling to Enable Super GPU 2X More Bandwidth 3rd Gen Tensor Cores Faster, Flexible, Easier to use 20x AI Perf with TF32 2.5x HPC Perf New Sparsity Acceleration Harness Sparsity in AI Models 2x AI Performance New Multi-Instance GPU Optimal utilization with right sized GPU 7x Simultaneous Instances per GPU
  • 58. 6 INTRODUCING DGX A100 The Universal AI System – Data Analytics, Training and Inference 9x Mellanox ConnectX-6 200Gb/s Network Interface 8x NVIDIA A100 GPUs with 320GB Total GPU Memory 15TB Gen4 NVME SSD Dual 64-core AMD Rome CPUs and 1TB RAM 4.8TB/sec Bi-directional Bandwidth 2X More than Previous Generation NVSwitch 6x NVIDIA NVSwitches 12 NVLinks/GPU 600GB/sec GPU-to-GPU Bi-directional Bandwidth 25GB/sec Peak Bandwidth 2X Faster than Gen3 NVME SSDs 3.2X More Cores to Power the Most Intensive AI Jobs 450GB/sec Peak Bi-directional Bandwidth
  • 59. 7 UNIFIED AI ACCELERATION BERT Pre-Training Throughput using Pytorch including (2/3)Phase 1 and (1/3)Phase 2 | Phase 1 Seq Len = 128, Phase 2 Seq Len = 512 V100: DGX-1 Server with 8xV100 using FP32 and FP16 precision A100: DGX A100 Server with 8xA100 using TF32 precision and FP16 | BERT Large Inference | T4: TRT 7.1, Precision = INT8, Batch Size =256, V100: TRT 7.1, Precision = FP16, Batch Size =256 | A100 with 7 MIG instances of 1g.5gb : Pre-production TRT, Batch Size =94, Precision = INT8 with Sparsity 216 822 1260 2274 0 400 800 1200 1600 2000 2400 FP32 FP16 Sequences/s BERT-LARGE TRAINING V100 0.6x 1x 1x 7x 0 1000 2000 3000 4000 5000 6000 7000 Sequences/s BERT-LARGE INFERENCE V100T4 1 MIG (1/7 A100) 6X out-of- the-box Speedup with TF32 7 MIG (1 A100) 3X Speedup with AMP (FP16)
  • 60. 8 350 CPU Servers $23M | 22 Racks | 300 kW NVIDIA SHATTERS BIG DATA ANALYTICS BENCHMARK 19.5X Faster TPCx-BB Performance Results on DGX A100 with RAPIDS 16 NVIDIA DGX A100 Systems $3.3M | 4 Racks |100 kW Equivalent Performance 1/7th Cost 1/3rd Power 16 Servers / Rack … Rack 1 Rack 2 Rack 3 Rack 22Rack 4 Rack 1 Rack 2 Rack 3 Rack 4 Performance: CPU = 4.7 hr, DGX A100 = 14.5 min (19.5x faster); After normalizing performance across CPU and GPU clusters -> Cost: CPU = $23M, DGX A100 = $3.3M (1/7th the cost); Power: CPU = 298kW, DGX A100 = 104kW (1/3rd the power); Space: CPU = 22 racks, DGX A100 = 4 racks (less than 1/5th the space)
  • 61. 9 GPU-ACCELERATED APACHE SPARK 3.0 Data Preparation Model Training Shared Storage CPU Powered Cluster GPU Powered Cluster Data Sources Spark 2.x Spark 3.0 Data Sources Spark XGBoost | TensorFlow | PyTorch Data Preparation Model Training Spark XGBoost | TensorFlow | PyTorch Spark Orchestrated Spark Orchestrated Spark 3.0 enables: • A single pipeline, from ingest to data preparation to model training • GPU-accelerated data preparation • Consolidation and simplification of infrastructure Built on Foundations of RAPIDS Learn More @ nvidia.com/spark-book Now Available on Leading Cloud Analytics Platforms RAPIDS Accelerator for Apache Spark GPU Powered Cluster
  • 62. 10 1.5X 1.5X 1.6X 1.9X 1.7X 1.8X 1.9X 2.0X 2.1X 0.0x 0.5x 1.0x 1.5x 2.0x NAMD GROMACS AMBER LAMMPS FUN3D SPECFEM3D RTM BerkeleyGW Chroma A100 UP TO 2X MORE HPC PERFORMANCE All results are measured Except BerkeleyGW, V100 used is single V100 SXM2. A100 used is single A100 SXM4 More apps detail: AMBER based on PME-Cellulose, GROMACS with STMV (h-bond), LAMMPS with Atomic Fluid LJ-2.5, NAMD with v3.0a1 STMV_NVE Chroma with szscl21_24_128, FUN3D with dpw, RTM with Isotropic Radius 4 1024^3, SPECFEM3D with Cartesian four material model BerkeleyGW based on Chi Sum and uses 8xV100 in DGX-1, vs 8xA100 in DGX A100 Speedup V100 Molecular Dynamics Physics Geo Science Physics
  • 63. 11 NGC – GPU-OPTIMIZED HPC & AI SOFTWARE Accelerate Time to Discovery and Solutions TOOLKITS & SDKsAPPLICATION CONTAINERS AI MODELS HELM CHARTS 150+ 100+ ML, Inference Healthcare | Smart Cities | Conversational AI | Robotics | more NGC ON-PREM MULTI-CLOUD EDGEHYBRID CLOUD ENCRYPTED x86 | ARM | POWER
  • 64. 12 17.1 (1792 A100) 10.5 (256 A100) 3.3 (8 A100) 0.8 (2048 A100) 0.8 (1024 A100) 0.8 (1840 A100) 0.7 (1024 A100) 0.6 (480 A100) 0 5 10 15 20 25 30 35 40 Reinforcement Learning MiniGo Object Detection (Heavy Weight) Mask R-CNN Recommendation DLRM NLP BERT Object Detection (Light Weight) SSD Image Classification ResNet-50 v.1.5 Translation (Recurrent) GNMT Translation (Non-recurrent) Transformer Time to Train (Minutes) Time to Train (Lower is Better) Commercially Available Solutions NVIDIA A100 NVIDIA V100 Google TPUv3 Huawei Ascend MLPERF: DGX SUPERPOD SETS ALL 8 AT SCALE AI RECORDS Under 18 Minutes To Train Each MLPerf Benchmark MLPerf 0.7 Performance comparison at Max Scale. Max scale used for NVIDIA A100, NVIDIA V100, TPUv3 and Huawei Ascend for all applicable benchmarks. | MLPerf ID at Scale: :Transformer: 0.7-30, 0.7-52 , GNMT: 0.7-34, 0.7-54, ResNet-50 v1.5: 0.7-37, 0.7-55, 0.7-1, 0.7-3, SSD: 0.7-33, 0.7-53, BERT: 0.7-38, 0.7-56, 0.7-1, DLRM: 0.7-17, 0.7-43, Mask R-CNN: 0.7-28, 0.7-48, MiniGo: 0.7-36, 0.7-51 | MLPerf name and logo are trademarks. See www.mlperf.org for more information. XXXXXXXXXXXXX X = No result submitted 28.7 (16 TPUv3) 56.7 (16 TPUv3)
  • 65. 13 MLPERF: ALL 8 PER CHIP AI PERFORMANCE RECORDS 0.7X 1.2X 0.9X 1.0X 1.0X 1.0X 1.0X 1.0X 1.0X 1.0X 1.0X 1.5X 1.6X 1.9X 2.0X 2.0X 2.4X 2.4X 2.5X 0x 1x 2x 3x Image Classification ResNet-50 v.1.5 NLP BERT Object Detection (Heavy Weight) Mask R-CNN Reinforcement Learning MiniGo Object Detection (Light Weight) SSD Translation (Recurrent) GNMT Translation (Non-recurrent) Transformer Recommendation DLRM SpeedupOverV100 Relative Speedup Commercially Available Solutions Huawei Ascend TPUv3 V100 A100 Per Chip Performance arrived at by comparing performance at same scale when possible and normalizing it to a single chip. 8 chip scale: V100, A100 Mask R-CNN, MiniGo, SSD, GNMT, Transformer. 16 chip scale: V100, A100, TPUv3 for ResNet- 50 v1.5 and BERT. 512 chip scale: Huawei Ascend 910 for ResNet-50. DLRM compared 8 A100 and 16 V100. Submission IDs: ResNet-50 v1.5: 0.7-3, 0.7-1, 0.7-44, 0.7-18, 0.7-21, 0.7-15 BERT: 0.7-1, 0.7-45, 0.7-22 , Mask R-CNN: 0.7-40, 0.7-19, MiniGo: 0.7-41, 0.7-20, SSD: 0.7-40, 0.7-19, GNMT: 0.7-40, 0.7-19, Transformer: 0.7-40, 0.7-19, DLRM: 0.7-43, 0.7-17| MLPerf name and logo are trademarks. See www.mlperf.org for more information. X X X X X X X X X X X X X X = No result submitted
  • 66. 14 #7 on TOP500 (27.6 PetaFLOPS HPL) #2 on Green500 (20.5 GigaFLOPS/watt) Fastest Industrial System in U.S. — 1+ ExaFLOPS AI Built with NVIDIA DGX SuperPOD Arch in 3 Weeks NVIDIA DGX A100 and NVIDIA Mellanox IB NVIDIA’s decade of AI experience Configuration: 2,240 NVIDIA A100 Tensor Core GPUs 280 NVIDIA DGX A100 systems 494 Mellanox 200G HDR IB switches 7 PB of all-flash storage DGX SuperPOD Deployment SELENE
  • 67. 15 Oxford Nanopore Sequence Viral Genome in 7Hrs Plotly, NVIDIA Real-Time Infection Rate Analysis ORNL, Scripps Screen 2B Drug Compounds in 1 Day vs 1 Year Structura, NIH, UT Austin CryoSPARC 1st 3D Structure of Virus Spike Protein NIH, NVIDIA AI COVID-19 Classification Kiwibot Robot Medical Supply Delivery Whiteboard Coordinator AI Elevated Body Temp Screening System ACCELERATED COMPUTING FIGHTS COVID-19 Data Analytics Simulation & Visualization AI Edge