The Implementing AI: High Performance Architectures webinar, hosted by KTN and eFutures, was the fourth event in the Implementing AI summer webinar series.
Every business is increasing the use of artificial intelligence to gain efficiency and to make better decisions. These new demands for data processing are not well delivered by traditional computer architectures. Enterprises, developers, data scientists, and researchers need new platforms that unify all AI workloads, simplifying infrastructure and accelerating ROI. This has led to the development of high performance and specialised hardware devices to meet these new demands.
The focus of this webinar was the impact of processing AI data on data centres - particularly from the technology perspective. The webinar had four presentations from experts, covering the opportunities, implementation techniques and Case Studies, followed by a panel Q&A session.
08448380779 Call Girls In Civil Lines Women Seeking Men
Implementing AI: High Performace Architectures
1.
2. www.ktn-uk.org @KTNUK
Finding valuable
partners
-
Project consortium
building
-
Supply Chain
Knowledge
-
Driving new
connections
-
Articulating challenges
-
Finding creative
solutions
Awareness and
dissemination
-
Public and private
finance
-
Advice – project scope
-
Advice – proposal
mentoring
-
Project
follow-up
Promoting
Industry needs
-
Informing policy
makers
-
Informing
strategy
-
Communicating trends
and market drivers
Intelligence on trends
and markets
-
Business Planning
support
-
Success stories /
raising profile
Navigating the
innovation support
landscape
-
Promoting coherent
strategy and approach
-
Engaging wider
stakeholders
-
Curation of innovation
resources
Connecting SupportingNavigatingInfluencing Funding
What we do - Growth Through Innovation
3. eFutures aims to strengthen and support a network of people
working in electronic systems across the UK
• Building new links and increasing involvement with industry
• Mapping the national electronics research, to ensure the work across the UK is known and noted
• Encouraging and funding innovative multi-disciplinary/multi-university proposals
• Communicating with our network via a monthly magazine, social media and new website
• Running events that support our network and our strategy
• Piloting an academic Mentoring Scheme pilot
• Launching a Big Ideas Challenge – more details soon
• Ideas warmly welcomed. Please get involved!
Twitter @efuturesuk
Sign up to our mailing list: efutures@qub.ac.uk
4. Today’s Agenda
Large scale HPC hardware in the age of AI
Prof Simon McIntosh-Smith, Bristol University
Solving Core Recommendation Model Challenges in Data Centers
Giles Peckham, myrtle
Short Break
Arm SVE and Supercomputer Fugaku for Deep learning
Roxana Rusitoru, ARM
A Universal Accelerated Computing Platform
Timothy Lanfear, NVIDIA
Panel Q&A Session
Chaired by Prof Roger Woods
5. Large scale HPC hardware
in the age of AI
Prof. Simon McIntosh-Smith
Head of the HPC research group
University of Bristol, UK
Twitter: @simonmcs
Email: simonm@cs.bris.ac.uk
http://uob-hpc.github.io
6. AI is a primary goal for next-generation supercomputers
The coming generation of Exascale systems will
include a diverse range of architectures at massive
scale, all of which are targeting AI:
• Fugaku: Fujitsu A64FX Arm CPUs
• Perlmutter: AMD EYPC CPUs and NVIDIA GPUs
• Frontier: AMD EPYC CPUs and Radeon GPUs
• Aurora: Intel Xeon CPUs and Xe GPUs
• El Capitan: AMD EPYC CPUs and Radeon GPUs
http://uob-hpc.github.io
The Next Platform, Jan 13th
2020: “HPC in 2020: compute engine diversity gets real”
https://www.nextplatform.com/2020/01/13/hpc-in-2020-compute-engine-diversity-gets-real/
June 22, 2020 1
Overview
The Fugaku compute system was designed and built by Fujitsu and RIKEN. Fugaku 富岳, is
another name for Mount Fuji, created by combining the first character of 富士, Fuji, and 岳,
mountain. The system is installed at the RIKEN Center for Computational Science (R-CCS) in
Kobe, Japan. RIKEN is a large scientific research institute in Japan with about 3,000 scientists in
seven campuses across Japan. Development for Fugaku hardware started in 2014 as the
successor to the K computer. The K Computer mainly focused on basic science and simulations
and modernized the Japanese supercomputer to be massively parallel. The Fugaku system is
designed to have a continuum of applications ranging from basic science to Society 5.0, an
initiative to create a new social scheme and economic model by fully incorporating the
technological innovations of the fourth industrial revolution. The relation to the Mount Fuji
image is to have a broad base of applications and capacity for simulation, data science, and AI—
with academic, industry, and cloud startups—along with a high peak performance on large-scale
applications.
Figure 1. Fugaku System as installed in RIKEN R-CCS
The Fugaku system is built on the A64FX ARM v8.2-A, which uses Scalable Vector Extension
(SVE) instructions and a 512-bit implementation. The Fugaku system adds the following Fujitsu
extensions: hardware barrier, sector cache, prefetch, and the 48/52 core CPU. It is optimized for
high-performance computing (HPC) with an extremely high bandwidth 3D stacked memory, 4x
8 GB HBM with 1024 GB/s, on-die Tofu-D network BW (~400 Gbps), high SVE FLOP/s (3.072
TFLOP/s), and various AI support (FP16, INT8, etc.). The A64FX processor provides for
general purpose Linux, Windows, and other cloud systems. Simply put, Fugaku is the largest and
fastest supercomputer built to date. Below is further breakdown of the hardware.
• Caches:
o L1D/core: 64 KB, 4way, 256 GB/s (load), 128 GB/s (store)
o L2/CMG: 8 MB, 16 way
o L2/node: 4 TB/s (load), 2 TB/s (store)
o L2/core: 128 GB/s (load), 64 GB/s (store)
• 158,976 nodes
7. The UK’s Tier-2 exploring options
Isambard
• First production Arm-based HPC service
• 10,752 Armv8 cores (168n x 2s x 32c)
• Marvell ThunderX2 32core 2.5GHz
• Cray XC50 ‘Scout’ form factor
• High-speed Aries interconnect
• Cray HPC optimised software stack
• >420 registered users, >100 of whom are
from outside the consortium
8. UK Tier-2 dense GPU systems
http://uob-hpc.github.io
• 22 NVIDIA DGX-1 Deep Learning Systems, each comprising:
• 8 NVIDIA Tesla V100 GPUs
• NVIDIA's high-speed NVlink interconnect
• 4 TB of SSD for machine learning datasets
• over 1PB of Seagate ClusterStor storage
• Mellanox EDR networking
• optimized versions of Caffe, TensorFlow, Theano and Torch etc
• system integration/delivery by Atos, hosting by STFC Hartree
• system management by Atos / STFC Hartree
http://www.hpc-uk.ac.uk/facilities/
9. Arm + GPU
http://uob-hpc.github.io
Source: https://nvidianews.nvidia.com/news/nvidia-and-tech-leaders-team-to-build-gpu-accelerated-arm-servers-for-new-era-of-diverse-hpc-architectures
10. Emerging architectures for AI / MP
http://uob-hpc.github.io
Google’s Tensorflow Processing Unit (TPU), GraphCore, Intel’s Nervana
12. Graphcore has just announced their 2nd generation “IPU”
http://uob-hpc.github.io
13. Graphcore IPU-M2000
• 4 x Colossus MK2 GC200 IPUs in a 1U box
• 1 PetaFLOP “AI compute” (16-bit FP)
• 5,888 processor cores, 35,328 independent threads
• Up to 450GB of exchange memory (off-chip DRAM)
• 2nd gen IPU has 7-9X more performance on AI benchmarks
• 59.4B 7nm transistors in 823mm2
• 900MB of on-chip fast SRAM per IPU (3x first gen.)
• 250 TFLOP/s AI compute per chip, 62.5 TFLOP/s single-precision
http://uob-hpc.github.io
18. Key takeaways
• Orders of magnitude more AI / ML compute coming
• Diverse architectures to deliver greater performance
• You need solutions that can work across CPUs, GPUs and now more
exotic hardware
• Optimised libraries are the main path to exploitation
• TensorFlow, PyTorch, Café et al
• Anything lower level requires a lot more ninja programming
http://uob-hpc.github.io
19. For more information
Bristol HPC group: https://uob-hpc.github.io/
Email: S.McIntosh-Smith@bristol.ac.uk
Twitter: @simonmcs
http://uob-hpc.github.io 15
54. 22
25 YEARS OF SCIENTIFIC COMPUTING ACCELERATION
X-FACTOR SPEEDUP FULL STACK ONE ARCHITECTURESOFTWARE DEFINED
EXTREME SCALE
25 YEARS OF COMPUTING ACCELERATION
DEVELOPMENT
55. 3
THE NEW COMPUTING
EDGE APPLIANCE
SUPERCOMPUTER
AI
Edge
Streaming
Simulation
Visualization
EXTREME IO
Data
Analytics
Cloud
NETWORK
56. 44
A100 AVAILABLE VIA NVIDIA HGX A100 AND A100 PCIE
Scale-up - Fastest Time-to-solution for AI
8 GPUs, Full NVLink B/W between all
GPUs with NVSwitch
HGX A100 8-GPU
For Mainstream Servers
1-8 GPUs per server, optional NVLink
Bridge between 2 GPUs
A100 PCIe
Scale-Up – Mixed AI & HPC
4 A100s, Fully Connected w/
shared NVLinks
HGX A100 4-GPU
57. 55
5 MIRACLES OF A100
NVIDIA Ampere Architecture
World’s Largest 7nm chip
54B XTORS, HBM2
3rd Gen NVLINK and NVSWITCH
Efficient Scaling to Enable Super GPU
2X More Bandwidth
3rd Gen Tensor Cores
Faster, Flexible, Easier to use
20x AI Perf with TF32
2.5x HPC Perf
New Sparsity Acceleration
Harness Sparsity in AI Models
2x AI Performance
New Multi-Instance GPU
Optimal utilization with right sized GPU
7x Simultaneous Instances per GPU
58. 6
INTRODUCING DGX A100
The Universal AI System – Data Analytics, Training and Inference
9x Mellanox ConnectX-6 200Gb/s Network Interface
8x NVIDIA A100 GPUs with 320GB Total GPU Memory
15TB Gen4 NVME SSD
Dual 64-core AMD Rome CPUs and 1TB RAM
4.8TB/sec Bi-directional Bandwidth
2X More than Previous Generation NVSwitch
6x NVIDIA NVSwitches
12 NVLinks/GPU
600GB/sec GPU-to-GPU Bi-directional Bandwidth
25GB/sec Peak Bandwidth
2X Faster than Gen3 NVME SSDs
3.2X More Cores to Power the Most Intensive AI Jobs
450GB/sec Peak Bi-directional Bandwidth
59. 7
UNIFIED AI ACCELERATION
BERT Pre-Training Throughput using Pytorch including (2/3)Phase 1 and (1/3)Phase 2 | Phase 1 Seq Len = 128, Phase 2 Seq Len = 512 V100: DGX-1 Server with 8xV100 using FP32 and FP16 precision A100: DGX A100 Server with 8xA100 using TF32
precision and FP16 |
BERT Large Inference | T4: TRT 7.1, Precision = INT8, Batch Size =256, V100: TRT 7.1, Precision = FP16, Batch Size =256 | A100 with 7 MIG instances of 1g.5gb : Pre-production TRT, Batch Size =94, Precision = INT8 with Sparsity
216
822
1260
2274
0
400
800
1200
1600
2000
2400
FP32 FP16
Sequences/s
BERT-LARGE TRAINING
V100
0.6x 1x 1x
7x
0
1000
2000
3000
4000
5000
6000
7000
Sequences/s
BERT-LARGE INFERENCE
V100T4 1 MIG
(1/7 A100)
6X
out-of-
the-box
Speedup
with TF32
7 MIG
(1 A100)
3X
Speedup with
AMP (FP16)
60. 8
350 CPU Servers
$23M | 22 Racks | 300 kW
NVIDIA SHATTERS BIG DATA ANALYTICS BENCHMARK
19.5X Faster TPCx-BB Performance Results on DGX A100 with RAPIDS
16 NVIDIA DGX A100 Systems
$3.3M | 4 Racks |100 kW
Equivalent
Performance
1/7th Cost
1/3rd Power
16 Servers / Rack
…
Rack 1 Rack 2 Rack 3 Rack 22Rack 4 Rack 1 Rack 2 Rack 3 Rack 4
Performance: CPU = 4.7 hr, DGX A100 = 14.5 min (19.5x faster); After normalizing performance across CPU and GPU clusters -> Cost: CPU = $23M, DGX A100 = $3.3M (1/7th the
cost); Power: CPU = 298kW, DGX A100 = 104kW (1/3rd the power); Space: CPU = 22 racks, DGX A100 = 4 racks (less than 1/5th the space)
61. 9
GPU-ACCELERATED APACHE SPARK 3.0
Data Preparation Model Training
Shared Storage
CPU Powered Cluster GPU Powered Cluster
Data
Sources
Spark 2.x Spark 3.0
Data
Sources
Spark
XGBoost | TensorFlow
| PyTorch
Data Preparation Model Training
Spark
XGBoost | TensorFlow
| PyTorch
Spark Orchestrated
Spark Orchestrated
Spark 3.0 enables:
• A single pipeline, from ingest to data preparation
to model training
• GPU-accelerated data preparation
• Consolidation and simplification of infrastructure
Built on Foundations of RAPIDS
Learn More @ nvidia.com/spark-book
Now Available on Leading Cloud Analytics Platforms
RAPIDS Accelerator for Apache Spark
GPU Powered Cluster
62. 10
1.5X 1.5X 1.6X
1.9X
1.7X
1.8X
1.9X
2.0X
2.1X
0.0x
0.5x
1.0x
1.5x
2.0x
NAMD GROMACS AMBER LAMMPS FUN3D SPECFEM3D RTM BerkeleyGW Chroma
A100
UP TO 2X MORE HPC PERFORMANCE
All results are measured
Except BerkeleyGW, V100 used is single V100 SXM2. A100 used is single A100 SXM4
More apps detail: AMBER based on PME-Cellulose, GROMACS with STMV (h-bond), LAMMPS with Atomic Fluid LJ-2.5, NAMD with v3.0a1 STMV_NVE
Chroma with szscl21_24_128, FUN3D with dpw, RTM with Isotropic Radius 4 1024^3, SPECFEM3D with Cartesian four material model
BerkeleyGW based on Chi Sum and uses 8xV100 in DGX-1, vs 8xA100 in DGX A100
Speedup
V100
Molecular Dynamics Physics Geo Science Physics
63. 11
NGC – GPU-OPTIMIZED HPC & AI SOFTWARE
Accelerate Time to Discovery and Solutions
TOOLKITS & SDKsAPPLICATION CONTAINERS AI MODELS HELM CHARTS
150+ 100+ ML, Inference Healthcare | Smart Cities | Conversational AI | Robotics | more
NGC
ON-PREM
MULTI-CLOUD
EDGEHYBRID CLOUD
ENCRYPTED
x86 | ARM | POWER
64. 12
17.1 (1792 A100)
10.5 (256 A100)
3.3 (8 A100)
0.8 (2048 A100)
0.8 (1024 A100)
0.8 (1840 A100)
0.7 (1024 A100)
0.6 (480 A100)
0 5 10 15 20 25 30 35 40
Reinforcement Learning MiniGo
Object Detection (Heavy Weight) Mask R-CNN
Recommendation DLRM
NLP BERT
Object Detection (Light Weight) SSD
Image Classification ResNet-50 v.1.5
Translation (Recurrent) GNMT
Translation (Non-recurrent) Transformer
Time to Train (Minutes)
Time to Train (Lower is Better)
Commercially Available Solutions
NVIDIA A100
NVIDIA V100
Google TPUv3
Huawei Ascend
MLPERF: DGX SUPERPOD SETS ALL 8 AT SCALE AI RECORDS
Under 18 Minutes To Train Each MLPerf Benchmark
MLPerf 0.7 Performance comparison at Max Scale. Max scale used for NVIDIA A100, NVIDIA V100, TPUv3 and Huawei Ascend for all applicable benchmarks. | MLPerf ID at Scale: :Transformer: 0.7-30, 0.7-52 , GNMT: 0.7-34, 0.7-54, ResNet-50
v1.5: 0.7-37, 0.7-55, 0.7-1, 0.7-3, SSD: 0.7-33, 0.7-53, BERT: 0.7-38, 0.7-56, 0.7-1, DLRM: 0.7-17, 0.7-43, Mask R-CNN: 0.7-28, 0.7-48, MiniGo: 0.7-36, 0.7-51 | MLPerf name and logo are trademarks. See www.mlperf.org for more information.
XXXXXXXXXXXXX
X = No result submitted
28.7 (16 TPUv3)
56.7
(16 TPUv3)
65. 13
MLPERF: ALL 8 PER CHIP AI PERFORMANCE RECORDS
0.7X
1.2X
0.9X
1.0X 1.0X 1.0X 1.0X 1.0X 1.0X 1.0X 1.0X
1.5X
1.6X
1.9X
2.0X 2.0X
2.4X 2.4X 2.5X
0x
1x
2x
3x
Image
Classification
ResNet-50 v.1.5
NLP
BERT
Object Detection
(Heavy Weight)
Mask R-CNN
Reinforcement
Learning
MiniGo
Object Detection
(Light Weight)
SSD
Translation
(Recurrent)
GNMT
Translation
(Non-recurrent)
Transformer
Recommendation
DLRM
SpeedupOverV100
Relative Speedup
Commercially Available Solutions
Huawei Ascend TPUv3 V100 A100
Per Chip Performance arrived at by comparing performance at same scale when possible and normalizing it to a single chip. 8 chip scale: V100, A100 Mask R-CNN, MiniGo, SSD, GNMT, Transformer. 16 chip scale: V100, A100, TPUv3 for ResNet-
50 v1.5 and BERT. 512 chip scale: Huawei Ascend 910 for ResNet-50. DLRM compared 8 A100 and 16 V100. Submission IDs: ResNet-50 v1.5: 0.7-3, 0.7-1, 0.7-44, 0.7-18, 0.7-21, 0.7-15 BERT: 0.7-1, 0.7-45, 0.7-22 , Mask R-CNN: 0.7-40, 0.7-19,
MiniGo: 0.7-41, 0.7-20, SSD: 0.7-40, 0.7-19, GNMT: 0.7-40, 0.7-19, Transformer: 0.7-40, 0.7-19, DLRM: 0.7-43, 0.7-17| MLPerf name and logo are trademarks. See www.mlperf.org for more information.
X X X X X X X X X X X X X
X = No result submitted
66. 14
#7 on TOP500 (27.6 PetaFLOPS HPL)
#2 on Green500 (20.5 GigaFLOPS/watt)
Fastest Industrial System in U.S. — 1+ ExaFLOPS AI
Built with NVIDIA DGX SuperPOD Arch in 3 Weeks
NVIDIA DGX A100 and NVIDIA Mellanox IB
NVIDIA’s decade of AI experience
Configuration:
2,240 NVIDIA A100 Tensor Core GPUs
280 NVIDIA DGX A100 systems
494 Mellanox 200G HDR IB switches
7 PB of all-flash storage
DGX SuperPOD Deployment
SELENE
67. 15
Oxford Nanopore
Sequence Viral Genome in
7Hrs
Plotly, NVIDIA
Real-Time
Infection Rate Analysis
ORNL, Scripps
Screen
2B Drug Compounds in
1 Day vs 1 Year
Structura, NIH, UT Austin
CryoSPARC
1st 3D Structure of Virus Spike Protein
NIH, NVIDIA
AI COVID-19
Classification
Kiwibot
Robot Medical Supply
Delivery
Whiteboard Coordinator
AI Elevated Body Temp
Screening System
ACCELERATED COMPUTING FIGHTS COVID-19
Data
Analytics
Simulation &
Visualization
AI Edge