SlideShare a Scribd company logo
1 of 41
Download to read offline
Arm A64fx and Post-K: Game
Changing CPU & Supercomputer for
HPC and its Convergence of with
Big Data / AI
Satoshi Matsuoka
Director, Riken Center for Computational Science /
Professor, Tokyo Institute of Technology
Hyperion HPC User’s Forum
Santa Fe, New Mexico
20190403
2
Riken Center for Computational Science(R-CCS)
World Leading HPC Research, active collaborations w/Universities, national labs, & Industry
New Materials & Electronic Devices e.g., Photonics,
Neuromorphics, Quantum, Reconfigurable
Mutual Synergy
Foundational research on computing in high
performance for K, Post-K, and beyond towards the
“Post-Moore” era, including future high
performance architectures, new computing and
programming models, system software, large scale
systems modeling, big data analytics, and scalable
artificial intelligence / machine learning
Breakthrough Science & Technology using high
performance computing capabilities of K, Post-K
and beyond to address the issues of high public
concern, in areas such as life sciences, climate &
environment, disaster prediction & prevention,
advanced manufacturing, applications of machine
learning for Society 5.0.
Sci. of Computing Sci. by Computing
Apr 1 2018 Became Director of Riken-CCS:
Science, of Computing, by Computing, and for Computing
Sci. for Computing
High Resolution, High
Fidelity Analysis &
Simulation
Novel Future High
Performance Computing
Architectures & Algorithms
Arm64fx & Post-K (to be renamed) are:
3
⚫ Fujitsu-Riken design A64fx ARM v8.2 (SVE), 48/52 core CPU
⚫ HPC Optimized: Extremely high package high memory BW
(1TByte/s), on-die Tofu-D network BW (~400Gbps), high SVE FLOPS
(~3Teraflops), various AI support (FP16, INT8, etc.)
⚫ Gen purpose CPU – Linux, Windows (Word), other SCs/Clouds
⚫ Extremely power efficient – > 10x power/perf efficiency for CFD
benchmark over current mainstream x86 CPU
⚫ Largest and fastest supercomputer to be ever built circa 2020
⚫ > 150,000 nodes, superseding LLNL Sequoia
⚫ > 150 PetaByte/s memory BW
⚫ Tofu-D 6D Torus NW, 60 Petabps injection BW (10x global IDC traffic)
⚫ 25~30PB NVMe L1 storage
⚫ ~10,000 endpoint 100Gbps I/O network into Lustre
⚫ The first ‘exascale’ machine (not exa64bitflops but in apps perf.)
4
1. Heritage of the K-Computer, HP in simulation via extensive Co-Design
• High performance: up to x100 performance of K in real applications
• Multitudes of Scientific Breakthroughs via Post-K application programs
• Simultaneous high performance and ease-of-programming
2. New Technology Innovations of Post-K
• High Performance, esp. via high memory BW
Performance boost by “factors” c.f. mainstream CPUs in many
HPC & Society5.0 apps via BW & Vector acceleration
• Very Green e.g. extreme power efficiency
Ultra Power efficient design & various power control knobs
• Arm Global Ecosystem & SVE contribution
Top CPU in ARM Ecosystem of 21 billion chips/year, SVE co-
design and world’s first implementation by Fujitsu
• High Perf. on Society5.0 apps incl. AI
Architectural features for high perf on Society 5.0 apps based
on Big Data, AI/ML, CAE/EDA, Blockchain security, etc.
Post-K: The Game Changer
ARM: Massive ecosystem
from embedded to HPC
Global leadership not just in
the machine & apps, but as
cutting edge IT
Technology not just limited to Post-K, but into societal IT infrastructures e.g. Clouds
Post K A64fx Processor is…
5
⚫ an Many-Core ARM CPU…
⚫ 48 compute cores + 2 or 4 assistant (OS) cores
⚫ Brand new core design
⚫ Near Xeon-Class Integer performance core
⚫ ARM V8 --- 64bit ARM ecosystem
⚫ Tofu-D + PCIe 3 external connection
⚫ …but also an accelerated GPU-like processor
⚫ SVE 512 bit vector extensions (ARM & Fujitsu)
⚫ Integer (1, 2, 4, 8 bytes) + Float (16, 32, 64 bytes)
⚫ Cache + scratchpad-like local memory (sector cache)
⚫ HBM2 on package memory – Massive Mem BW (Bytes/DPF ~0.4)
⚫ Streaming memory access, strided access, scatter/gather etc.
⚫ Intra-chip barrier synch. and other memory enhancing features
⚫ GPU-like High performance in HPC, AI/Big Data, Auto Driving…2018/3/13
“Post-K” Chronology
6
(Disclaimer: below includes speculative schedules and subject to change)
⚫ May 2018 A0 Chip came out, almost bug free
⚫ 1Q2019 B0 Chip on hand, bug free, exceeded perf. target
⚫ Mar 2019 “Post-K” manufacturing budget approval by the Diet, actual
manufacturing contract signed and starts
⚫ Aug 2019 End of K-Computer operations
⚫ 4Q2019 “Post-K” installation starts
⚫ 1H2020 “Post-K” preproduction operation starts
⚫ 2020~2021 “Post-K” production operation starts (hopefully)
⚫ And of course we move on…
Watch for announcements on “Post-K” technology commercialization by
Fujitsu and its partner vendors RSN
Co-design for Post-K
(slides by Mitsuhisa Sato Team Leader of Architecture Development Team)
Deputy project leader, FLAGSHIP 2020 project
Deputy Director, RIKEN Center for Computational Science (R-CCS)
ARM for HPC - Co-design Opportunities, ISC 2018 BOF
June 25, 2018
Richard F. BARRETT, et.al. “On the
Role of Co-design in High Performa
nce Computing”, Transition of HPC
Towards Exascale Computing
Started in 2009 (before K)
8
⚫ Architectural Parameters to be determined
⚫ #SIMD, SIMD length, #core, #NUMA node, O3 resources, specialized hardware
⚫ cache (size and bandwidth), memory technologies
⚫ Chip die-size, power consumption
⚫ Interconnect
Co-design from Apps to Architecture
Target applications representatives of
almost all our applications in terms of
computational methods and
communication patterns in order to
design architectural features.
⚫ We have selected a set of target applications
⚫ Performance estimation tool
⚫ Performance projection using Fujitsu FX100
execution profile to a set of arch. parameters.
⚫ Co-design Methodology (at early design
phase)
1. Setting set of system parameters
2. Tuning target applications under the
system parameters
3. Evaluating execution time using prediction
tools
4. Identifying hardware bottlenecks and
changing the set of system parameters
Protein simulation before K
◼ Simulation of a protein in isolation
Folding simulation of Villin, a small protein
with 36 amino acids
Protein simulation with K
DNA
GROEL
Ribosome
ProteinsRibosome
GROEL
400nm
TRNA
100nm
ATP
water
ion
metabolites
Genesis MD: proteins in a cell environment
9
◼ all atom simulation of a cell interior
◼ cytoplasm of Mycoplasma genitalium
NICAM: Global Climate Simulation
◼ Global cloud resolving model with 0.87 km-mesh which allows
resolution of cumulus clouds
◼ Month-long forecasts of Madden-Julian oscillations in the tropics is realized.
Global cloud resolving
model
Miyamoto et al (2013) , Geophys. Res. Lett., 40, 4922–4926, doi:10.1002/grl.50944.
13
11
Heart Simulator
◼ Heartbeat, blood ejection, coronary
circulation are simulated consistently.
◼ Applications explored
◼ congenital heart diseases
◼ Screening for drug-induced irregular
heartbeat risk
◼ Multi-scale
simulator of
heart starting
from molecules
and building up
cells, tissues, and
heart
electrocardiogram
(ECG)
ultrasonic waves
UT-Heart, Inc., Fujitsu Limited
12
⚫ Tools for performance tuning
⚫ Performance estimation tool
⚫ Performance projection using Fujitsu FX100
execution profile
⚫ Gives “target” performance
⚫ Post-K processor simulator
⚫ Based on gem5, O3, cycle-level simulation
⚫ Very slow, so limited to kernel-level evaluation
⚫ Co-design of apps
⚫ 1. Estimate “target” performance using
performance estimation tool
⚫ 2. Extract kernel code for simulator
⚫ 3. Measure exec time using simulator
⚫ 4. Feed-back to code optimization
⚫ 5. Feed-back to compiler
Co-design of Apps for Architecture
NICAM OPRT3D_divdamp区間 (1/3
推定方法
性能
予測ツ
メモリ HB
カーネル化、縮小 あ
ソースコード版数 r4_a
浮動小数点数精度 単精
SIMD幅 1
集計スレッド番号 3
実行時間[秒] 2.09E
GFLOPS/プロセス 97
メモリスループット(R+W)
[GB/s/プロセス]
10
メモリスループット(R)
[GB/s/プロセス]
88
メモリスループット(W)
[GB/s/プロセス]
20
浮動小数点演算数
/スレッド
1.77E
実行命令数/スレッド 2.91E
ロード・ストア命令数
/スレッド
9.56E
L1Dミス数/スレッド 1.39E
L2ミス数/スレッド 6.11E
①
②
③
Perform-
ance
estimation
toolcd
Simulator Simulator Simulator
As is Tuning 1
Tuning 2
Target
performance
Executiontime
13
⚫ ARM SVE Vector Length Agnostic feature is very interesting, since we can
examine vector performance using the same binary.
⚫ We have investigated how to improve the performance of SVE keeping
hardware-resource the same. (in “Rev-A” paper)
⚫ ex. “512 bits SVE x 2 pipes” vs. “1024 bits SVE x 1 pipe”
⚫ Evaluation of Performance and Power ( in “coolchips” paper) by using our gem-5
simulator (with “white” parameter) and ARM compiler.
⚫ Conclusion: Wide vector size over FPU element size will improve performance if there are
enough rename registers and the utilization of FPU has room for improvement.
ARM for HPC - Co-design Opportunities
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
triad nbody dgemm
RelativeExecutionTime
LEN=4 LEN=8 LEN=8 (x2)
Faster
Note that these researches are not relevant to
“post-K” architecture.
⚫ Y. Kodama, T. Oajima and M. Sato. “Preliminary
Performance Evaluation of Application Kernels Using
ARM SVE with Multiple Vector Lengths”, In Re-
Emergence of Vector Architectures Workshop (Rev-
A) in 2017 IEEE International Conference on Cluster
Computing, pp. 677-684, Sep. 2017.
⚫ T. Odajima, Y. Kodama and M. Sato, “Power
Performance Analysis of ARM Scalable Vector
Extension”, In IEEE Symposium on Low-Power and
High-Speed Chips and Systems (COOL Chips 21), Apr.
2018
14 © 2019 FUJITSU
A64FX: Summary
◼ Arm SVE, high performance and high efficiency
◼ DP performance 2.7+ TFLOPS, >90%@DGEMM
◼ Memory BW 1024 GB/s, >80%@STREAM Triad
12x compute cores
1x assistant core
A64FX
ISA (Base, extension) Armv8.2-A, SVE
Process technology 7 nm
Peak DP performance 2.7+ TFLOPS
SIMD width 512-bit
# of cores 48 + 4
Memory capacity 32 GiB (HBM2 x4)
Memory peak bandwidth 1024 GB/s
PCIe Gen3 16 lanes
High speed interconnect TofuD integrated
PCle
Controller
Tofu
Interface
C
C
C
C
N
O
C
HBM2HBM2
HBM2HBM2
CMG CMG
CMG CMG
CMG:Core Memory Group NOC:Network on Chip
15 © 2019 FUJITSU
A64FX technologies: Core performance
◼ High calc. throughput of Fujitsu’s original CPU core w/ SVE
◼ 512-bit wide SIMD x 2 pipelines and new integer functions
A0 A1 A2 A3
B0 B1 B2 B3
X X X X
8bit 8bit 8bit 8bit
C
32bit
INT8 partial dot product
16 © 2019 FUJITSU
A64FX technologies: Scalable architecture
◼ Core Memory Group (CMG)
◼ 12 compute cores for computing and
an assistant core for OS daemon, I/O, etc.
◼ Shared L2 cache
◼ Dedicated memory controller
◼ Four CMGs maintain cache coherence w/
on-chip directory
◼ Threads binding within a CMG allows linear
speed up of cores' performance
core core core core core core
core core core core core core
core
X-Bar connection
L2 cache 8MiB 16-way
Memory
controller
Network
on chip
CMG configuration
HBM2
Network
on
chip
CMG
CMG
CMG
CMG
HBM2
HBM2
HBM2
HBM2
A64FX chip configuration
Tofu
controller
PCIe
controller
HHBHBMBM2M2828G8GiGBiBiB
◼ Extremely high bandwidth in caches and memory
• A64FX has out-of-order mechanisms in cores, caches and memory controllers.
It maximizes the capability of each layer’s bandwidth
Performance
>2.7TFLOPS
L1 Cache
>11.0TB/s (BF ratio = 4)
High Bandwidth
CMG
L2 Cache
>3.6TB/s (BF ratio = 1.3)
L1D 64KiB, 4way
512-bit wide SIMD
2x FMAs
Core Core CoreCore
>230
GB/s
>115
GB/s
12x Computing Cores + 1x Assistant Core
Memory
1024GB/s (BF ratio =~0.37)
>115
GB/s
>57
GB/s
HBM2 8GiB
L2 Cache 8MiB, 16way
256
GB/s
13 All Rights Reserved. Copyright © FUJITSU LIMITED 2018
18 © 2019 FUJITSU
A64FX: L1D cache uncompromised BW
◼ 128B/cycle sustained BW even for
unaligned SIMD load
◼ “Combined Gather” doubles gather
(indirect) load’s data throughput,
when target elements are within a
“128-byte aligned block” for a pair of
two regs, even & odd
L1D$
Read port0
Read port1
Read data0
64B/cycle
Read data1
64B/cycle
Mem.
128B
0 1 2 3 4 5 6 7
0 1 3 2 6 7 45
8B
Regs
flow-1
flow-2
flow-4 flow-3
Maximizes BW to 32 bytes/cyc.
19 © 2019 FUJITSU
A64FX: Power monitor and analyzer
◼ Energy monitor (per chip)
◼ Node power via Power API*1 (~msec)
◼ Averaged power of a node, CMG (cores,
L2 cache, memory) etc.
◼ Energy analyzer (per core)
◼ Power profiler via PAPI*2 (~nsec)
◼ Fine grained power analysis of a core, L2
cache, and memory
*1: Sandia National Laboratory *2: Performance Application Programming Interface
A64FX
CMG#0
Core
Power calc.Core activity
L2 cache
Power calc.L2 cache activity
Memory
Power calc.Memory activity
CMG
#1
CMG
#2
CMG
#3
Node 12 Cores L2 cache HBM2 Tofu PCIe
Energy
Analyzer
Assistant core
Energy
Monitor
CMG
Core
Memory
L2 Cache
Power calc. for PCIePower calc. for Tofu
20 © 2019 FUJITSU
A64FX: Power Knobs to reduce power consumption
◼ “Power knob” limits units’ activity via user APIs
◼ Performance/W can be optimized by utilizing Power knobs, Energy monitor
& analyzer
L1 I$
Branch
Predictor
Decode
& Issue
RSE0
RSA
RSE1
RSBR
PGPR
EXA
EXB
EAGA
EXC
EAGB
EXD
PFPR
Fetch
Port
Store
Port L1D$
HBM2 Controller
Fetch Issue Dispatch Reg-read Execute Cache and Memory
CSE
Commit
PC
Control
Registers
L2$
HBM2
Write
Buffer
Tofu controller
FLA
PPR
FLB
PRX
FLA pipeline only
HBM2 B/W adjust
(units of 10%)
Frequency reduction
Decode width: 4 to 2
EXA pipeline only
PCI Controller
52 cores
21 © 2019 FUJITSU
Preliminary performance evaluation results
◼ Over 2.5x faster in HPC & AI benchmarks than SPARC64 XIfx
Post-K A64fx A0 (ES) performance
Performance / CPU Machine Performance (HPC)
Peak TF
(DFP)
Peak Mem.
BW
Stream
Triad
Theor
etical
B/F
DGEMM
Efficiency
Linpack
Efficiency
GF/W
Network BW
Per Chip
Post-K A64fx
(A0 Eng.
Sample)
2.764/
3.072
1024GB/s 840GB/s
0.37/
0.33
94 % 87.7 % >15
TOFU-D
40.8GB/s
(6.8x 6)
Intel KNL 3.0464 600GB/s 490GB/s 0.20 66% 54.4 % 4.9 12.5 GB/s
Intel Skylake 1.6128 127.8GB/s 97 GB/s 0.08 80 % 66.7 % 4.5 6.2GB/s
NVIDIA V100
(DGX-2)
7.8 900 GB/s 855GB/s 0.12 76 % 15.113
160GB/s
6.2GB/s
◼ OpenMP版を用いてノードあたり演算性能を評価
FX10に対するノードあたり性能向上比
9
8
7
6
5
4
3
2
1
0
NAS Parallel Benchmark of FX100
[Slide by Ikuo Miyoshi, Fujitsu, SSKen2015]
FX10(16スレッド) FX100(32スレッド) Haswell(32スレッド)†
† エラーバーはCPU周波数を1.9GHzに固定しない場合の性能を表す
使用コード: NAS Parallel Benchmarks Ver. 3.3.1 OpenMP版 クラスC
FX100は、FX10比平均3.9倍、Haswell比平均1.3倍のノードあたり演算性能
MG SP FT LU IS BT CG EP
多重格子法 スカラ五重 離散3次元 ガ ウ ス
対角ソルバ フーリエ変換 ザイデル
法
整数ソート ブロック三重 CG法
対角ソルバ カーネル
驚異的並列
LU EPISMG FT CGSP BT
23 Copyright 2015 FUJITSU LIMITED
性能向上比
Note: Haswell:
1 node = 2 chips
32 threads&cores
FX100: 1 node =
1 chip
32 threads&cores
◼ 評価条件
24 Copyright 2015 FUJITSU LIMITED
◼ 測定結果
FX10に対するノードあたり性能向上比
Fiber ( Post-K) MiniApp on FX100
[Slide by Ikuo Miyoshi, Fujitsu, SSKen2015]
FX100は、FX10比平均3.3倍、
Haswell比平均1.4倍の ノード
あたり性能
注) FFVCには開発版コンパイラ、 NTChemに
は 開発版数学ライブラリを使用。 QCDでは
セクタキャッシュ利用、
FFBではループリローリングのコード変更を実
施
アプリ名 問題サイズ 評価区間 スレッド数×プロセス数
(左からFX10、FX100、Haswell)
CCS QCD 32×32×32×32 BiCGStab 16t×2p 32t×1p 16t×2p
NICAM-DC-MINI gl05rl00z80pe10 Dynamics 3t×10p
FFB-MINI 1,048,576要素 MAIN_LOOP 1t×32p 8t×4p 1t×32p
FFVC-MINI 256×256×256 Total 4t×8p 16t×2p
NTChem-MINI taxol RIMP2_Driver 1t×32p 16t×2p 2t×16p
Note: Haswell:
1 node = 2 chips
32 threads&cores
FX100: 1 node =
1 chip
32 threads&cores
25 © 2019 FUJITSU
Post-K performance evaluation
◼ Himeno Benchmark (Fortran90)
† “Performance evaluation of a vector supercomputer SX-aurora TSUBASA”,
SC18, https://dl.acm.org/citation.cfm?id=3291728
FUJITSU CONFIDENTIAL
Post-K Chassis, PCB (w/DLC), and CPU Package
CMU
CPU Package
A0 Chip Booted in June
Undergoing Tests
60
mm
60
mm
230 mm
280
mm
CPU
A64fx
Copyright 2018 FUJITSU LIMITED
W 800㎜
D1400㎜
H2000㎜
384 nodes
27 © 2019 FUJITSU
CMU: CPU Memory Unit
◼ A64FX CPU x2 (Two independent nodes)
◼ QSFP28 x3 for Active Optical Cables
◼ Single-side blind mate connectors of signals & water
◼ ~100% direct water cooling
Water
Water
Electrical signals
AOC
QSFP28 (X)
QSFP28 (Y)
QSFP28 (Z)
AOC
AOC
SCAsia2019, March 12
TOFU-D 6D Mesh/Torus Network
◼ Six coordinate axes: X, Y, Z, A, B, C
◼ X, Y, Z: the size varies according to the system configuration
◼ A, B, C: the size is fixed to 2×3×2
◼ Tofu stands for “torus fusion”: (X, Y, Z)×(A, B, C)
Copyright 2018 FUJITSU LIMITED
September11th,2018,IEEECluster2018
Z
X
C
A
B
X×Y×Z×2×3×2
Y
28
29 © 2019 FUJITSU
A64FX: Tofu interconnect D
◼ Integrated w/ rich resources
◼ Increased TNIs achieves higher injection BW & flexible comm. patterns
◼ Increased barrier resources allow flexible collective comm. algorithms
◼ Memory bypassing achieves low latency
◼ Direct descriptor & cache injection
TofuD spec
Port bandwidth 6.8 GB/s
Injection bandwidth 40.8 GB/s
Measured
Put throughput 6.35 GB/s
Ping-pong latency 0.49~0.54 µs
c
c
c
c
c c
c
c
c
c
c
cc
NOC
HBM2
CMG
c
c
c
c
c c
c
c
c
c
c
cc
HBM2
CMG
c
c
c
c
c c
c
c
c
c
c
cc
HBM2
CMG
c
c
c
c
c c
c
c
c
c
c
cc
HBM2
CMG
PCle
A64FX
TNI0
TNI1
TNI2
TNI3
TNI4
TNI5
TofuNetworkRouter
2lanes×10ports
TofuD
Put Latencies
◼ 8B Put transfer between nodes on the same board
◼ The low-latency features were used
◼ Tofu2 reduced the Put latency by 0.20 μs from that of Tofu1
◼ The cache injection feature contributed to this reduction
◼ TofuD reduced the Put latency by 0.22 μs from that of Tofu2
Copyright 2018 FUJITSU LIMITEDSeptember 11th, 2018, IEEE Cluster 2018
Communication settings Latency
Tofu1 Descriptor on main memory 1.15 µs
Direct Descriptor 0.91 µs
Tofu2 Cache injection OFF 0.87 µs
Cache injection ON 0.71 µs
TofuD To/From far CMGs 0.54 µs
To/From near CMGs 0.49 µs
0.20 µs
0.22 µs
30
Injection Rates per Node
◼ Simultaneous Put transfers to multiple nearest-neighbor nodes
◼ Tofu1 and Tofu2 used 4 TNIs, and TofuD used 6 TNIs
◼ The injection rate of TofuD was approximately 83% that of Tofu2
◼ The efficiencies of Tofu1 were lower than 90%
◼ Because of a bottleneck in the bus that connects CPU and ICC
◼ The efficiencies of Tofu2 and TofuD exceeded 90 %
◼ Integration into the processor chip removed the bottleneck
Copyright 2018 FUJITSU LIMITEDSeptember 11th, 2018, IEEE Cluster 2018
Injection rate Efficiency
Tofu1 (K) 15.0 GB/s 77 %
Tofu1 (FX10) 17.6 GB/s 88 %
Tofu2 45.8 GB/s 92 %
TofuD 38.1 GB/s 93 %
31
Packaging – Rack Structure of Post-K
◼ Rack
◼ 8 shelves
◼ 192 CMUs or 384 CPUs
◼ Shelf
◼ 24 CMUs or 48 CPUs
◼ X×Y×Z×A×B×C = 1×1×4×2×3×2
◼ Top or bottom half of rack
◼ 4 shelves
◼ X×Y×Z×A×B×C = 2×2×4×2×3×2
Copyright 2018 FUJITSU LIMITEDSeptember 11th, 2018, IEEE Cluster 2018
Rack
Shelves
32
33 © 2019 FUJITSU
Post-K system configuration
◼ Scalable design
Unit # of nodes Description
CPU 1 Single socket node with HBM2 & Tofu interconnect D
CMU 2 CPU Memory Unit: 2x CPU
BoB 16 Bunch of Blades: 8x CMU
Shelf 48 3x BoB
Rack 384 8x Shelf
System 150k+ As a Post-K system
CPU CMU BoB RackShelf System
Overview of Post-K System & Storage
34
⚫ 3-level hierarchical storage
⚫ 1st Layer: GFS Cache + Temp FS (25~30 PB NVMe)
⚫ 2nd Layer: Lustre-based GFS (a few hundred PB HDD)
⚫ 3rd Layer: Off-site Cloud Storage
⚫ Full Machine Spec
⚫ >150,000 nodes
~8 million High Perf. Arm v8.2 Cores
⚫ > 150PB/s memory BW
⚫ Tofu-D 10x Global IDC traffic @ 60Pbps
⚫ ~10,000 I/O fabric endpoints
⚫ > 400 racks
⚫ ~40 MegaWatts Machine+IDC
PUE ~ 1.1 High Pressure DLC
⚫ NRE pays off: ~= 15~30 million
state-of-the art competing CPU
Cores for HPC workloads
(both dense and sparse problems)
Prepping the 40+MW
Facility (actual photo)
35
Post-K system software
⚫ RIKEN and Fujitsu are developing a software stack for Post-
K
SCAsia2019, March 12
Post-K system hardware
Linux OS / McKernel (Lightweight kernel)
Fujitsu Technical Computing Suite / RIKEN-developed system software
Post-K applications
Management software Programming environmentFile system
System management
for high availability &
power saving operation
Job management for
higher system
utilization & power
efficiency
LLIO
NVM-based file I/O
accelerator
OpenMP, COARRAY, Math.libs.
MPI (Open MPI, MPICH)
XcalableMPFEFS
Lustre-based
distributed file system
Compilers (C, C++, Fortran)
Debugging and tuning tools
Post-K Under development w/ RIKEN
Post-K Programming Environment
37
⚫ Programing Languages and Compilers provided by Fujitsu
⚫ Fortran2008 & Fortran2018 subset
⚫ C11 & GNU and Clang extensions
⚫ C++14 & C++17 subset and GNU and Clang extensions
⚫ OpenMP 4.5 & OpenMP 5.0 subset
⚫ Java
⚫ Parallel Programming Language & Domain Specific Library provided by RIKEN
⚫ XcalableMP
⚫ FDPS (Framework for Developing Particle Simulator)
⚫ Process/Thread Library provided by RIKEN
⚫ PiP (Process in Process)
⚫ Script Languages provided by Linux distributor
⚫ E.g., Python+NumPy, SciPy
⚫ Communication Libraries
⚫ MPI 3.1 & MPI4.0 subset
⚫ Open MPI base (Fujitsu), MPICH (RIKEN)
⚫ Low-level Communication Libraries
⚫ uTofu (Fujitsu), LLC(RIKEN)
⚫ File I/O Libraries provided by RIKEN
⚫ Lustre
⚫ pnetCDF, DTF, FTAR
⚫ Math Libraries
⚫ BLAS, LAPACK, ScaLAPACK, SSL II (Fujitsu)
⚫ EigenEXA, Batched BLAS (RIKEN)
⚫ Programming Tools provided by Fujitsu
⚫ Profiler, Debugger, GUI
⚫ NEW: Containers (Singularity) and other Cloud APIs
⚫ NEW: AI software stacks (w/ARM)
2018/6/26
GCC and LLVM will be also available
OSS Application Porting @ Arm HPC Users Group
http://arm-hpc.gitlab.io/
(http://arm-hpc.gitlab.io/)
Application Lang. GCC LLVM Arm Fujitsu
LAMMPS C++ Modified Modified Modified Modified
GROMACS C Modified Modified Modified Modified
GAMESS* Fortran Modified Modified Modified Modified
OpenFOAM C++ Modified Modified Modified Modified
NAMD C++ Modified Modified Modified Modified
WRF Fortran Modified Modified Modified Modified
Quantum
ESPRESSO
Fortran Ok in as is Ok in as is Ok in as is Modified
NWChem Fortran Ok in as is Modified Modified Modified
ABINIT Fortran Modified Modified Modified Modified
CP2K Fortran Ok in as is Issues found Issues found Modified
NEST* C++ Ok in as is Modified Modified Modified
BLAST* C++ Ok in as is Modified Modified Modified
OSS Application Porting @ Arm HPC Users Group
http://arm-hpc.gitlab.io/
Application Lang. GCC LLVM Arm Fujitsu
LAMMPS C++ Modified Modified Modified Modified
GROMACS C Modified Modified Modified Modified
GAMESS* Fortran Modified Modified Modified Modified
OpenFOAM C++ Modified Modified Modified Modified
NAMD C++ Modified Modified Modified Modified
WRF Fortran Modified Modified Modified Modified
Quantum
ESPRESSO
Fortran Ok in as is Ok in as is Ok in as is Modified
NWChem Fortran Ok in as is Modified Modified ongoing
ABINIT Fortran Modified Modified Modified Modified
CP2K Fortran Ok in as is Issues found Issues found ongoing
NEST* C++ Ok in as is Modified Modified Modified
BLAST* C++ Ok in as is Modified Modified Modified
Twelve primary OSS applications are listed and being tested in
the Users Group for each compilers, collaboratively w/ Arm
(http://arm-hpc.gitlab.io/)
40
Post-K Processor
◆High perf FP16&Int8
◆High mem BW for convolution
◆Built-in scalable Tofu network
Unprecedened DL scalability
High Performance DNN Convolution
Low Precision ALU + High Memory Bandwi
dth + Advanced Combining of Convolution
Algorithms (FFT+Winograd+GEMM)
High Performance and Ultra-Scalable Network
for massive scaling model & data parallelism
Unprecedented Scalability of Data/
Massive Scale Deep Learning on Post-K
C P U
For the
Post-K
supercomputer
C P U
For the
Post-K
supercomputer
C P U
For the
Post-K
supercomputer
C P U
For the
Post-K
supercomputer
TOFU Network w/high
injection BW for fast
reduction
Open “Post-K” Naming
⚫ Foreign submissions welcome
⚫ Requirements for the post-K’s name are:
⚫ The name should preferably express the idea that RIKEN is a world-
class research institute operating a state-of-the–art supercomputer.
⚫ The name should be attractive not only to Japanese speakers but to
people around the world.
https://www.r-ccs.riken.jp/en/topics/naming.html
until April 8, 2019 5 pm JST

More Related Content

What's hot

第6回 配信講義 計算科学技術特論A(2021)
第6回 配信講義 計算科学技術特論A(2021)第6回 配信講義 計算科学技術特論A(2021)
第6回 配信講義 計算科学技術特論A(2021)RCCSRENKEI
 
FPGA・リコンフィギャラブルシステム研究の最新動向
FPGA・リコンフィギャラブルシステム研究の最新動向FPGA・リコンフィギャラブルシステム研究の最新動向
FPGA・リコンフィギャラブルシステム研究の最新動向Shinya Takamaeda-Y
 
Hands-on Lab: How to Unleash Your Storage Performance by Using NVM Express™ B...
Hands-on Lab: How to Unleash Your Storage Performance by Using NVM Express™ B...Hands-on Lab: How to Unleash Your Storage Performance by Using NVM Express™ B...
Hands-on Lab: How to Unleash Your Storage Performance by Using NVM Express™ B...Odinot Stanislas
 
GPU と PYTHON と、それから最近の NVIDIA
GPU と PYTHON と、それから最近の NVIDIAGPU と PYTHON と、それから最近の NVIDIA
GPU と PYTHON と、それから最近の NVIDIANVIDIA Japan
 
CXL_説明_公開用.pdf
CXL_説明_公開用.pdfCXL_説明_公開用.pdf
CXL_説明_公開用.pdfYasunori Goto
 
NVIDIA cuQuantum SDK による量子回路シミュレーターの高速化
NVIDIA cuQuantum SDK による量子回路シミュレーターの高速化NVIDIA cuQuantum SDK による量子回路シミュレーターの高速化
NVIDIA cuQuantum SDK による量子回路シミュレーターの高速化NVIDIA Japan
 
HPC 的に H100 は魅力的な GPU なのか?
HPC 的に H100 は魅力的な GPU なのか?HPC 的に H100 は魅力的な GPU なのか?
HPC 的に H100 は魅力的な GPU なのか?NVIDIA Japan
 
オープンデータの本質と活用事例
オープンデータの本質と活用事例オープンデータの本質と活用事例
オープンデータの本質と活用事例Masahiko Shoji
 
RISC-V の現況と Esperanto Technologies のアプローチ
RISC-V の現況と Esperanto Technologies のアプローチRISC-V の現況と Esperanto Technologies のアプローチ
RISC-V の現況と Esperanto Technologies のアプローチYutaka Yasuda
 
「スプラトゥーン」リアルタイム画像解析ツール 「IkaLog」の裏側
「スプラトゥーン」リアルタイム画像解析ツール 「IkaLog」の裏側「スプラトゥーン」リアルタイム画像解析ツール 「IkaLog」の裏側
「スプラトゥーン」リアルタイム画像解析ツール 「IkaLog」の裏側Takeshi HASEGAWA
 
Chips alliance omni xtend overview
Chips alliance omni xtend overviewChips alliance omni xtend overview
Chips alliance omni xtend overviewRISC-V International
 
ZytleBot: ROSベースの自律移動ロボットへのFPGAの統合に向けて
ZytleBot: ROSベースの自律移動ロボットへのFPGAの統合に向けてZytleBot: ROSベースの自律移動ロボットへのFPGAの統合に向けて
ZytleBot: ROSベースの自律移動ロボットへのFPGAの統合に向けてHideki Takase
 
マルチレイヤコンパイラ基盤による、エッジ向けディープラーニングの実装と最適化について
マルチレイヤコンパイラ基盤による、エッジ向けディープラーニングの実装と最適化についてマルチレイヤコンパイラ基盤による、エッジ向けディープラーニングの実装と最適化について
マルチレイヤコンパイラ基盤による、エッジ向けディープラーニングの実装と最適化についてFixstars Corporation
 
FPGAをロボット(ROS)で「やわらかく」使うには
FPGAをロボット(ROS)で「やわらかく」使うにはFPGAをロボット(ROS)で「やわらかく」使うには
FPGAをロボット(ROS)で「やわらかく」使うにはHideki Takase
 
第12回 配信講義 計算科学技術特論B(2022)
第12回 配信講義 計算科学技術特論B(2022)第12回 配信講義 計算科学技術特論B(2022)
第12回 配信講義 計算科学技術特論B(2022)RCCSRENKEI
 
Fugaku, the Successes and the Lessons Learned
Fugaku, the Successes and the Lessons LearnedFugaku, the Successes and the Lessons Learned
Fugaku, the Successes and the Lessons LearnedRCCSRENKEI
 
"Snapdragon Hybrid Computer Vision/Deep Learning Architecture for Imaging App...
"Snapdragon Hybrid Computer Vision/Deep Learning Architecture for Imaging App..."Snapdragon Hybrid Computer Vision/Deep Learning Architecture for Imaging App...
"Snapdragon Hybrid Computer Vision/Deep Learning Architecture for Imaging App...Edge AI and Vision Alliance
 
Halide による画像処理プログラミング入門
Halide による画像処理プログラミング入門Halide による画像処理プログラミング入門
Halide による画像処理プログラミング入門Fixstars Corporation
 
03_03_Implementing_PCIe_ATS_in_ARM-based_SoCs_Final
03_03_Implementing_PCIe_ATS_in_ARM-based_SoCs_Final03_03_Implementing_PCIe_ATS_in_ARM-based_SoCs_Final
03_03_Implementing_PCIe_ATS_in_ARM-based_SoCs_FinalGopi Krishnamurthy
 

What's hot (20)

第6回 配信講義 計算科学技術特論A(2021)
第6回 配信講義 計算科学技術特論A(2021)第6回 配信講義 計算科学技術特論A(2021)
第6回 配信講義 計算科学技術特論A(2021)
 
FPGA・リコンフィギャラブルシステム研究の最新動向
FPGA・リコンフィギャラブルシステム研究の最新動向FPGA・リコンフィギャラブルシステム研究の最新動向
FPGA・リコンフィギャラブルシステム研究の最新動向
 
Hands-on Lab: How to Unleash Your Storage Performance by Using NVM Express™ B...
Hands-on Lab: How to Unleash Your Storage Performance by Using NVM Express™ B...Hands-on Lab: How to Unleash Your Storage Performance by Using NVM Express™ B...
Hands-on Lab: How to Unleash Your Storage Performance by Using NVM Express™ B...
 
GPU と PYTHON と、それから最近の NVIDIA
GPU と PYTHON と、それから最近の NVIDIAGPU と PYTHON と、それから最近の NVIDIA
GPU と PYTHON と、それから最近の NVIDIA
 
CXL_説明_公開用.pdf
CXL_説明_公開用.pdfCXL_説明_公開用.pdf
CXL_説明_公開用.pdf
 
NVIDIA cuQuantum SDK による量子回路シミュレーターの高速化
NVIDIA cuQuantum SDK による量子回路シミュレーターの高速化NVIDIA cuQuantum SDK による量子回路シミュレーターの高速化
NVIDIA cuQuantum SDK による量子回路シミュレーターの高速化
 
RISC-V User level ISA
RISC-V User level ISARISC-V User level ISA
RISC-V User level ISA
 
HPC 的に H100 は魅力的な GPU なのか?
HPC 的に H100 は魅力的な GPU なのか?HPC 的に H100 は魅力的な GPU なのか?
HPC 的に H100 は魅力的な GPU なのか?
 
オープンデータの本質と活用事例
オープンデータの本質と活用事例オープンデータの本質と活用事例
オープンデータの本質と活用事例
 
RISC-V の現況と Esperanto Technologies のアプローチ
RISC-V の現況と Esperanto Technologies のアプローチRISC-V の現況と Esperanto Technologies のアプローチ
RISC-V の現況と Esperanto Technologies のアプローチ
 
「スプラトゥーン」リアルタイム画像解析ツール 「IkaLog」の裏側
「スプラトゥーン」リアルタイム画像解析ツール 「IkaLog」の裏側「スプラトゥーン」リアルタイム画像解析ツール 「IkaLog」の裏側
「スプラトゥーン」リアルタイム画像解析ツール 「IkaLog」の裏側
 
Chips alliance omni xtend overview
Chips alliance omni xtend overviewChips alliance omni xtend overview
Chips alliance omni xtend overview
 
ZytleBot: ROSベースの自律移動ロボットへのFPGAの統合に向けて
ZytleBot: ROSベースの自律移動ロボットへのFPGAの統合に向けてZytleBot: ROSベースの自律移動ロボットへのFPGAの統合に向けて
ZytleBot: ROSベースの自律移動ロボットへのFPGAの統合に向けて
 
マルチレイヤコンパイラ基盤による、エッジ向けディープラーニングの実装と最適化について
マルチレイヤコンパイラ基盤による、エッジ向けディープラーニングの実装と最適化についてマルチレイヤコンパイラ基盤による、エッジ向けディープラーニングの実装と最適化について
マルチレイヤコンパイラ基盤による、エッジ向けディープラーニングの実装と最適化について
 
FPGAをロボット(ROS)で「やわらかく」使うには
FPGAをロボット(ROS)で「やわらかく」使うにはFPGAをロボット(ROS)で「やわらかく」使うには
FPGAをロボット(ROS)で「やわらかく」使うには
 
第12回 配信講義 計算科学技術特論B(2022)
第12回 配信講義 計算科学技術特論B(2022)第12回 配信講義 計算科学技術特論B(2022)
第12回 配信講義 計算科学技術特論B(2022)
 
Fugaku, the Successes and the Lessons Learned
Fugaku, the Successes and the Lessons LearnedFugaku, the Successes and the Lessons Learned
Fugaku, the Successes and the Lessons Learned
 
"Snapdragon Hybrid Computer Vision/Deep Learning Architecture for Imaging App...
"Snapdragon Hybrid Computer Vision/Deep Learning Architecture for Imaging App..."Snapdragon Hybrid Computer Vision/Deep Learning Architecture for Imaging App...
"Snapdragon Hybrid Computer Vision/Deep Learning Architecture for Imaging App...
 
Halide による画像処理プログラミング入門
Halide による画像処理プログラミング入門Halide による画像処理プログラミング入門
Halide による画像処理プログラミング入門
 
03_03_Implementing_PCIe_ATS_in_ARM-based_SoCs_Final
03_03_Implementing_PCIe_ATS_in_ARM-based_SoCs_Final03_03_Implementing_PCIe_ATS_in_ARM-based_SoCs_Final
03_03_Implementing_PCIe_ATS_in_ARM-based_SoCs_Final
 

Similar to Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI

01 From K to Fugaku
01 From K to Fugaku01 From K to Fugaku
01 From K to FugakuRCCSRENKEI
 
A64fx and Fugaku - A Game Changing, HPC / AI Optimized Arm CPU to enable Exas...
A64fx and Fugaku - A Game Changing, HPC / AI Optimized Arm CPU to enable Exas...A64fx and Fugaku - A Game Changing, HPC / AI Optimized Arm CPU to enable Exas...
A64fx and Fugaku - A Game Changing, HPC / AI Optimized Arm CPU to enable Exas...inside-BigData.com
 
05 Preparing for Extreme Geterogeneity in HPC
05 Preparing for Extreme Geterogeneity in HPC05 Preparing for Extreme Geterogeneity in HPC
05 Preparing for Extreme Geterogeneity in HPCRCCSRENKEI
 
Top 10 Supercomputers With Descriptive Information & Analysis
Top 10 Supercomputers With Descriptive Information & AnalysisTop 10 Supercomputers With Descriptive Information & Analysis
Top 10 Supercomputers With Descriptive Information & AnalysisNomanSiddiqui41
 
Panel: NRP Science Impacts​
Panel: NRP Science Impacts​Panel: NRP Science Impacts​
Panel: NRP Science Impacts​Larry Smarr
 
The Coming Age of Extreme Heterogeneity in HPC
The Coming Age of Extreme Heterogeneity in HPCThe Coming Age of Extreme Heterogeneity in HPC
The Coming Age of Extreme Heterogeneity in HPCinside-BigData.com
 
Stories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi TorresStories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi TorresSpark Summit
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...inside-BigData.com
 
Implementing AI: High Performace Architectures
Implementing AI: High Performace ArchitecturesImplementing AI: High Performace Architectures
Implementing AI: High Performace ArchitecturesKTN
 
OpenACC Monthly Highlights: May 2020
OpenACC Monthly Highlights: May 2020OpenACC Monthly Highlights: May 2020
OpenACC Monthly Highlights: May 2020OpenACC
 
Early Application experiences on Summit
Early Application experiences on Summit Early Application experiences on Summit
Early Application experiences on Summit Ganesan Narayanasamy
 
2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdf2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdfLevLafayette1
 
Implementing AI: High Performance Architectures: Large scale HPC hardware in ...
Implementing AI: High Performance Architectures: Large scale HPC hardware in ...Implementing AI: High Performance Architectures: Large scale HPC hardware in ...
Implementing AI: High Performance Architectures: Large scale HPC hardware in ...KTN
 
OpenACC Monthly Highlights: October2020
OpenACC Monthly Highlights: October2020OpenACC Monthly Highlights: October2020
OpenACC Monthly Highlights: October2020OpenACC
 
A New Direction for Computer Architecture Research
A New Direction for Computer Architecture ResearchA New Direction for Computer Architecture Research
A New Direction for Computer Architecture Researchdbpublications
 
Real time machine learning proposers day v3
Real time machine learning proposers day v3Real time machine learning proposers day v3
Real time machine learning proposers day v3mustafa sarac
 
ARM-based Supercomputer from Fujitsu and RIKEN - "Post-K"
ARM-based Supercomputer from Fujitsu and RIKEN - "Post-K"ARM-based Supercomputer from Fujitsu and RIKEN - "Post-K"
ARM-based Supercomputer from Fujitsu and RIKEN - "Post-K"Phil Hughes
 

Similar to Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI (20)

01 From K to Fugaku
01 From K to Fugaku01 From K to Fugaku
01 From K to Fugaku
 
A64fx and Fugaku - A Game Changing, HPC / AI Optimized Arm CPU to enable Exas...
A64fx and Fugaku - A Game Changing, HPC / AI Optimized Arm CPU to enable Exas...A64fx and Fugaku - A Game Changing, HPC / AI Optimized Arm CPU to enable Exas...
A64fx and Fugaku - A Game Changing, HPC / AI Optimized Arm CPU to enable Exas...
 
Japan's post K Computer
Japan's post K ComputerJapan's post K Computer
Japan's post K Computer
 
05 Preparing for Extreme Geterogeneity in HPC
05 Preparing for Extreme Geterogeneity in HPC05 Preparing for Extreme Geterogeneity in HPC
05 Preparing for Extreme Geterogeneity in HPC
 
Top 10 Supercomputers With Descriptive Information & Analysis
Top 10 Supercomputers With Descriptive Information & AnalysisTop 10 Supercomputers With Descriptive Information & Analysis
Top 10 Supercomputers With Descriptive Information & Analysis
 
Future of hpc
Future of hpcFuture of hpc
Future of hpc
 
NWU and HPC
NWU and HPCNWU and HPC
NWU and HPC
 
Panel: NRP Science Impacts​
Panel: NRP Science Impacts​Panel: NRP Science Impacts​
Panel: NRP Science Impacts​
 
The Coming Age of Extreme Heterogeneity in HPC
The Coming Age of Extreme Heterogeneity in HPCThe Coming Age of Extreme Heterogeneity in HPC
The Coming Age of Extreme Heterogeneity in HPC
 
Stories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi TorresStories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi Torres
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
Implementing AI: High Performace Architectures
Implementing AI: High Performace ArchitecturesImplementing AI: High Performace Architectures
Implementing AI: High Performace Architectures
 
OpenACC Monthly Highlights: May 2020
OpenACC Monthly Highlights: May 2020OpenACC Monthly Highlights: May 2020
OpenACC Monthly Highlights: May 2020
 
Early Application experiences on Summit
Early Application experiences on Summit Early Application experiences on Summit
Early Application experiences on Summit
 
2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdf2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdf
 
Implementing AI: High Performance Architectures: Large scale HPC hardware in ...
Implementing AI: High Performance Architectures: Large scale HPC hardware in ...Implementing AI: High Performance Architectures: Large scale HPC hardware in ...
Implementing AI: High Performance Architectures: Large scale HPC hardware in ...
 
OpenACC Monthly Highlights: October2020
OpenACC Monthly Highlights: October2020OpenACC Monthly Highlights: October2020
OpenACC Monthly Highlights: October2020
 
A New Direction for Computer Architecture Research
A New Direction for Computer Architecture ResearchA New Direction for Computer Architecture Research
A New Direction for Computer Architecture Research
 
Real time machine learning proposers day v3
Real time machine learning proposers day v3Real time machine learning proposers day v3
Real time machine learning proposers day v3
 
ARM-based Supercomputer from Fujitsu and RIKEN - "Post-K"
ARM-based Supercomputer from Fujitsu and RIKEN - "Post-K"ARM-based Supercomputer from Fujitsu and RIKEN - "Post-K"
ARM-based Supercomputer from Fujitsu and RIKEN - "Post-K"
 

More from inside-BigData.com

Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networksinside-BigData.com
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...inside-BigData.com
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...inside-BigData.com
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networksinside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoringinside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecastsinside-BigData.com
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Updateinside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuninginside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODinside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Accelerationinside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficientlyinside-BigData.com
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Erainside-BigData.com
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computinginside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Clusterinside-BigData.com
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...inside-BigData.com
 

More from inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
 

Recently uploaded

SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 

Recently uploaded (20)

SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 

Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI

  • 1. Arm A64fx and Post-K: Game Changing CPU & Supercomputer for HPC and its Convergence of with Big Data / AI Satoshi Matsuoka Director, Riken Center for Computational Science / Professor, Tokyo Institute of Technology Hyperion HPC User’s Forum Santa Fe, New Mexico 20190403
  • 2. 2 Riken Center for Computational Science(R-CCS) World Leading HPC Research, active collaborations w/Universities, national labs, & Industry New Materials & Electronic Devices e.g., Photonics, Neuromorphics, Quantum, Reconfigurable Mutual Synergy Foundational research on computing in high performance for K, Post-K, and beyond towards the “Post-Moore” era, including future high performance architectures, new computing and programming models, system software, large scale systems modeling, big data analytics, and scalable artificial intelligence / machine learning Breakthrough Science & Technology using high performance computing capabilities of K, Post-K and beyond to address the issues of high public concern, in areas such as life sciences, climate & environment, disaster prediction & prevention, advanced manufacturing, applications of machine learning for Society 5.0. Sci. of Computing Sci. by Computing Apr 1 2018 Became Director of Riken-CCS: Science, of Computing, by Computing, and for Computing Sci. for Computing High Resolution, High Fidelity Analysis & Simulation Novel Future High Performance Computing Architectures & Algorithms
  • 3. Arm64fx & Post-K (to be renamed) are: 3 ⚫ Fujitsu-Riken design A64fx ARM v8.2 (SVE), 48/52 core CPU ⚫ HPC Optimized: Extremely high package high memory BW (1TByte/s), on-die Tofu-D network BW (~400Gbps), high SVE FLOPS (~3Teraflops), various AI support (FP16, INT8, etc.) ⚫ Gen purpose CPU – Linux, Windows (Word), other SCs/Clouds ⚫ Extremely power efficient – > 10x power/perf efficiency for CFD benchmark over current mainstream x86 CPU ⚫ Largest and fastest supercomputer to be ever built circa 2020 ⚫ > 150,000 nodes, superseding LLNL Sequoia ⚫ > 150 PetaByte/s memory BW ⚫ Tofu-D 6D Torus NW, 60 Petabps injection BW (10x global IDC traffic) ⚫ 25~30PB NVMe L1 storage ⚫ ~10,000 endpoint 100Gbps I/O network into Lustre ⚫ The first ‘exascale’ machine (not exa64bitflops but in apps perf.)
  • 4. 4 1. Heritage of the K-Computer, HP in simulation via extensive Co-Design • High performance: up to x100 performance of K in real applications • Multitudes of Scientific Breakthroughs via Post-K application programs • Simultaneous high performance and ease-of-programming 2. New Technology Innovations of Post-K • High Performance, esp. via high memory BW Performance boost by “factors” c.f. mainstream CPUs in many HPC & Society5.0 apps via BW & Vector acceleration • Very Green e.g. extreme power efficiency Ultra Power efficient design & various power control knobs • Arm Global Ecosystem & SVE contribution Top CPU in ARM Ecosystem of 21 billion chips/year, SVE co- design and world’s first implementation by Fujitsu • High Perf. on Society5.0 apps incl. AI Architectural features for high perf on Society 5.0 apps based on Big Data, AI/ML, CAE/EDA, Blockchain security, etc. Post-K: The Game Changer ARM: Massive ecosystem from embedded to HPC Global leadership not just in the machine & apps, but as cutting edge IT Technology not just limited to Post-K, but into societal IT infrastructures e.g. Clouds
  • 5. Post K A64fx Processor is… 5 ⚫ an Many-Core ARM CPU… ⚫ 48 compute cores + 2 or 4 assistant (OS) cores ⚫ Brand new core design ⚫ Near Xeon-Class Integer performance core ⚫ ARM V8 --- 64bit ARM ecosystem ⚫ Tofu-D + PCIe 3 external connection ⚫ …but also an accelerated GPU-like processor ⚫ SVE 512 bit vector extensions (ARM & Fujitsu) ⚫ Integer (1, 2, 4, 8 bytes) + Float (16, 32, 64 bytes) ⚫ Cache + scratchpad-like local memory (sector cache) ⚫ HBM2 on package memory – Massive Mem BW (Bytes/DPF ~0.4) ⚫ Streaming memory access, strided access, scatter/gather etc. ⚫ Intra-chip barrier synch. and other memory enhancing features ⚫ GPU-like High performance in HPC, AI/Big Data, Auto Driving…2018/3/13
  • 6. “Post-K” Chronology 6 (Disclaimer: below includes speculative schedules and subject to change) ⚫ May 2018 A0 Chip came out, almost bug free ⚫ 1Q2019 B0 Chip on hand, bug free, exceeded perf. target ⚫ Mar 2019 “Post-K” manufacturing budget approval by the Diet, actual manufacturing contract signed and starts ⚫ Aug 2019 End of K-Computer operations ⚫ 4Q2019 “Post-K” installation starts ⚫ 1H2020 “Post-K” preproduction operation starts ⚫ 2020~2021 “Post-K” production operation starts (hopefully) ⚫ And of course we move on… Watch for announcements on “Post-K” technology commercialization by Fujitsu and its partner vendors RSN
  • 7. Co-design for Post-K (slides by Mitsuhisa Sato Team Leader of Architecture Development Team) Deputy project leader, FLAGSHIP 2020 project Deputy Director, RIKEN Center for Computational Science (R-CCS) ARM for HPC - Co-design Opportunities, ISC 2018 BOF June 25, 2018 Richard F. BARRETT, et.al. “On the Role of Co-design in High Performa nce Computing”, Transition of HPC Towards Exascale Computing Started in 2009 (before K)
  • 8. 8 ⚫ Architectural Parameters to be determined ⚫ #SIMD, SIMD length, #core, #NUMA node, O3 resources, specialized hardware ⚫ cache (size and bandwidth), memory technologies ⚫ Chip die-size, power consumption ⚫ Interconnect Co-design from Apps to Architecture Target applications representatives of almost all our applications in terms of computational methods and communication patterns in order to design architectural features. ⚫ We have selected a set of target applications ⚫ Performance estimation tool ⚫ Performance projection using Fujitsu FX100 execution profile to a set of arch. parameters. ⚫ Co-design Methodology (at early design phase) 1. Setting set of system parameters 2. Tuning target applications under the system parameters 3. Evaluating execution time using prediction tools 4. Identifying hardware bottlenecks and changing the set of system parameters
  • 9. Protein simulation before K ◼ Simulation of a protein in isolation Folding simulation of Villin, a small protein with 36 amino acids Protein simulation with K DNA GROEL Ribosome ProteinsRibosome GROEL 400nm TRNA 100nm ATP water ion metabolites Genesis MD: proteins in a cell environment 9 ◼ all atom simulation of a cell interior ◼ cytoplasm of Mycoplasma genitalium
  • 10. NICAM: Global Climate Simulation ◼ Global cloud resolving model with 0.87 km-mesh which allows resolution of cumulus clouds ◼ Month-long forecasts of Madden-Julian oscillations in the tropics is realized. Global cloud resolving model Miyamoto et al (2013) , Geophys. Res. Lett., 40, 4922–4926, doi:10.1002/grl.50944. 13
  • 11. 11 Heart Simulator ◼ Heartbeat, blood ejection, coronary circulation are simulated consistently. ◼ Applications explored ◼ congenital heart diseases ◼ Screening for drug-induced irregular heartbeat risk ◼ Multi-scale simulator of heart starting from molecules and building up cells, tissues, and heart electrocardiogram (ECG) ultrasonic waves UT-Heart, Inc., Fujitsu Limited
  • 12. 12 ⚫ Tools for performance tuning ⚫ Performance estimation tool ⚫ Performance projection using Fujitsu FX100 execution profile ⚫ Gives “target” performance ⚫ Post-K processor simulator ⚫ Based on gem5, O3, cycle-level simulation ⚫ Very slow, so limited to kernel-level evaluation ⚫ Co-design of apps ⚫ 1. Estimate “target” performance using performance estimation tool ⚫ 2. Extract kernel code for simulator ⚫ 3. Measure exec time using simulator ⚫ 4. Feed-back to code optimization ⚫ 5. Feed-back to compiler Co-design of Apps for Architecture NICAM OPRT3D_divdamp区間 (1/3 推定方法 性能 予測ツ メモリ HB カーネル化、縮小 あ ソースコード版数 r4_a 浮動小数点数精度 単精 SIMD幅 1 集計スレッド番号 3 実行時間[秒] 2.09E GFLOPS/プロセス 97 メモリスループット(R+W) [GB/s/プロセス] 10 メモリスループット(R) [GB/s/プロセス] 88 メモリスループット(W) [GB/s/プロセス] 20 浮動小数点演算数 /スレッド 1.77E 実行命令数/スレッド 2.91E ロード・ストア命令数 /スレッド 9.56E L1Dミス数/スレッド 1.39E L2ミス数/スレッド 6.11E ① ② ③ Perform- ance estimation toolcd Simulator Simulator Simulator As is Tuning 1 Tuning 2 Target performance Executiontime
  • 13. 13 ⚫ ARM SVE Vector Length Agnostic feature is very interesting, since we can examine vector performance using the same binary. ⚫ We have investigated how to improve the performance of SVE keeping hardware-resource the same. (in “Rev-A” paper) ⚫ ex. “512 bits SVE x 2 pipes” vs. “1024 bits SVE x 1 pipe” ⚫ Evaluation of Performance and Power ( in “coolchips” paper) by using our gem-5 simulator (with “white” parameter) and ARM compiler. ⚫ Conclusion: Wide vector size over FPU element size will improve performance if there are enough rename registers and the utilization of FPU has room for improvement. ARM for HPC - Co-design Opportunities 0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40 triad nbody dgemm RelativeExecutionTime LEN=4 LEN=8 LEN=8 (x2) Faster Note that these researches are not relevant to “post-K” architecture. ⚫ Y. Kodama, T. Oajima and M. Sato. “Preliminary Performance Evaluation of Application Kernels Using ARM SVE with Multiple Vector Lengths”, In Re- Emergence of Vector Architectures Workshop (Rev- A) in 2017 IEEE International Conference on Cluster Computing, pp. 677-684, Sep. 2017. ⚫ T. Odajima, Y. Kodama and M. Sato, “Power Performance Analysis of ARM Scalable Vector Extension”, In IEEE Symposium on Low-Power and High-Speed Chips and Systems (COOL Chips 21), Apr. 2018
  • 14. 14 © 2019 FUJITSU A64FX: Summary ◼ Arm SVE, high performance and high efficiency ◼ DP performance 2.7+ TFLOPS, >90%@DGEMM ◼ Memory BW 1024 GB/s, >80%@STREAM Triad 12x compute cores 1x assistant core A64FX ISA (Base, extension) Armv8.2-A, SVE Process technology 7 nm Peak DP performance 2.7+ TFLOPS SIMD width 512-bit # of cores 48 + 4 Memory capacity 32 GiB (HBM2 x4) Memory peak bandwidth 1024 GB/s PCIe Gen3 16 lanes High speed interconnect TofuD integrated PCle Controller Tofu Interface C C C C N O C HBM2HBM2 HBM2HBM2 CMG CMG CMG CMG CMG:Core Memory Group NOC:Network on Chip
  • 15. 15 © 2019 FUJITSU A64FX technologies: Core performance ◼ High calc. throughput of Fujitsu’s original CPU core w/ SVE ◼ 512-bit wide SIMD x 2 pipelines and new integer functions A0 A1 A2 A3 B0 B1 B2 B3 X X X X 8bit 8bit 8bit 8bit C 32bit INT8 partial dot product
  • 16. 16 © 2019 FUJITSU A64FX technologies: Scalable architecture ◼ Core Memory Group (CMG) ◼ 12 compute cores for computing and an assistant core for OS daemon, I/O, etc. ◼ Shared L2 cache ◼ Dedicated memory controller ◼ Four CMGs maintain cache coherence w/ on-chip directory ◼ Threads binding within a CMG allows linear speed up of cores' performance core core core core core core core core core core core core core X-Bar connection L2 cache 8MiB 16-way Memory controller Network on chip CMG configuration HBM2 Network on chip CMG CMG CMG CMG HBM2 HBM2 HBM2 HBM2 A64FX chip configuration Tofu controller PCIe controller
  • 17. HHBHBMBM2M2828G8GiGBiBiB ◼ Extremely high bandwidth in caches and memory • A64FX has out-of-order mechanisms in cores, caches and memory controllers. It maximizes the capability of each layer’s bandwidth Performance >2.7TFLOPS L1 Cache >11.0TB/s (BF ratio = 4) High Bandwidth CMG L2 Cache >3.6TB/s (BF ratio = 1.3) L1D 64KiB, 4way 512-bit wide SIMD 2x FMAs Core Core CoreCore >230 GB/s >115 GB/s 12x Computing Cores + 1x Assistant Core Memory 1024GB/s (BF ratio =~0.37) >115 GB/s >57 GB/s HBM2 8GiB L2 Cache 8MiB, 16way 256 GB/s 13 All Rights Reserved. Copyright © FUJITSU LIMITED 2018
  • 18. 18 © 2019 FUJITSU A64FX: L1D cache uncompromised BW ◼ 128B/cycle sustained BW even for unaligned SIMD load ◼ “Combined Gather” doubles gather (indirect) load’s data throughput, when target elements are within a “128-byte aligned block” for a pair of two regs, even & odd L1D$ Read port0 Read port1 Read data0 64B/cycle Read data1 64B/cycle Mem. 128B 0 1 2 3 4 5 6 7 0 1 3 2 6 7 45 8B Regs flow-1 flow-2 flow-4 flow-3 Maximizes BW to 32 bytes/cyc.
  • 19. 19 © 2019 FUJITSU A64FX: Power monitor and analyzer ◼ Energy monitor (per chip) ◼ Node power via Power API*1 (~msec) ◼ Averaged power of a node, CMG (cores, L2 cache, memory) etc. ◼ Energy analyzer (per core) ◼ Power profiler via PAPI*2 (~nsec) ◼ Fine grained power analysis of a core, L2 cache, and memory *1: Sandia National Laboratory *2: Performance Application Programming Interface A64FX CMG#0 Core Power calc.Core activity L2 cache Power calc.L2 cache activity Memory Power calc.Memory activity CMG #1 CMG #2 CMG #3 Node 12 Cores L2 cache HBM2 Tofu PCIe Energy Analyzer Assistant core Energy Monitor CMG Core Memory L2 Cache Power calc. for PCIePower calc. for Tofu
  • 20. 20 © 2019 FUJITSU A64FX: Power Knobs to reduce power consumption ◼ “Power knob” limits units’ activity via user APIs ◼ Performance/W can be optimized by utilizing Power knobs, Energy monitor & analyzer L1 I$ Branch Predictor Decode & Issue RSE0 RSA RSE1 RSBR PGPR EXA EXB EAGA EXC EAGB EXD PFPR Fetch Port Store Port L1D$ HBM2 Controller Fetch Issue Dispatch Reg-read Execute Cache and Memory CSE Commit PC Control Registers L2$ HBM2 Write Buffer Tofu controller FLA PPR FLB PRX FLA pipeline only HBM2 B/W adjust (units of 10%) Frequency reduction Decode width: 4 to 2 EXA pipeline only PCI Controller 52 cores
  • 21. 21 © 2019 FUJITSU Preliminary performance evaluation results ◼ Over 2.5x faster in HPC & AI benchmarks than SPARC64 XIfx
  • 22. Post-K A64fx A0 (ES) performance Performance / CPU Machine Performance (HPC) Peak TF (DFP) Peak Mem. BW Stream Triad Theor etical B/F DGEMM Efficiency Linpack Efficiency GF/W Network BW Per Chip Post-K A64fx (A0 Eng. Sample) 2.764/ 3.072 1024GB/s 840GB/s 0.37/ 0.33 94 % 87.7 % >15 TOFU-D 40.8GB/s (6.8x 6) Intel KNL 3.0464 600GB/s 490GB/s 0.20 66% 54.4 % 4.9 12.5 GB/s Intel Skylake 1.6128 127.8GB/s 97 GB/s 0.08 80 % 66.7 % 4.5 6.2GB/s NVIDIA V100 (DGX-2) 7.8 900 GB/s 855GB/s 0.12 76 % 15.113 160GB/s 6.2GB/s
  • 23. ◼ OpenMP版を用いてノードあたり演算性能を評価 FX10に対するノードあたり性能向上比 9 8 7 6 5 4 3 2 1 0 NAS Parallel Benchmark of FX100 [Slide by Ikuo Miyoshi, Fujitsu, SSKen2015] FX10(16スレッド) FX100(32スレッド) Haswell(32スレッド)† † エラーバーはCPU周波数を1.9GHzに固定しない場合の性能を表す 使用コード: NAS Parallel Benchmarks Ver. 3.3.1 OpenMP版 クラスC FX100は、FX10比平均3.9倍、Haswell比平均1.3倍のノードあたり演算性能 MG SP FT LU IS BT CG EP 多重格子法 スカラ五重 離散3次元 ガ ウ ス 対角ソルバ フーリエ変換 ザイデル 法 整数ソート ブロック三重 CG法 対角ソルバ カーネル 驚異的並列 LU EPISMG FT CGSP BT 23 Copyright 2015 FUJITSU LIMITED 性能向上比 Note: Haswell: 1 node = 2 chips 32 threads&cores FX100: 1 node = 1 chip 32 threads&cores
  • 24. ◼ 評価条件 24 Copyright 2015 FUJITSU LIMITED ◼ 測定結果 FX10に対するノードあたり性能向上比 Fiber ( Post-K) MiniApp on FX100 [Slide by Ikuo Miyoshi, Fujitsu, SSKen2015] FX100は、FX10比平均3.3倍、 Haswell比平均1.4倍の ノード あたり性能 注) FFVCには開発版コンパイラ、 NTChemに は 開発版数学ライブラリを使用。 QCDでは セクタキャッシュ利用、 FFBではループリローリングのコード変更を実 施 アプリ名 問題サイズ 評価区間 スレッド数×プロセス数 (左からFX10、FX100、Haswell) CCS QCD 32×32×32×32 BiCGStab 16t×2p 32t×1p 16t×2p NICAM-DC-MINI gl05rl00z80pe10 Dynamics 3t×10p FFB-MINI 1,048,576要素 MAIN_LOOP 1t×32p 8t×4p 1t×32p FFVC-MINI 256×256×256 Total 4t×8p 16t×2p NTChem-MINI taxol RIMP2_Driver 1t×32p 16t×2p 2t×16p Note: Haswell: 1 node = 2 chips 32 threads&cores FX100: 1 node = 1 chip 32 threads&cores
  • 25. 25 © 2019 FUJITSU Post-K performance evaluation ◼ Himeno Benchmark (Fortran90) † “Performance evaluation of a vector supercomputer SX-aurora TSUBASA”, SC18, https://dl.acm.org/citation.cfm?id=3291728
  • 26. FUJITSU CONFIDENTIAL Post-K Chassis, PCB (w/DLC), and CPU Package CMU CPU Package A0 Chip Booted in June Undergoing Tests 60 mm 60 mm 230 mm 280 mm CPU A64fx Copyright 2018 FUJITSU LIMITED W 800㎜ D1400㎜ H2000㎜ 384 nodes
  • 27. 27 © 2019 FUJITSU CMU: CPU Memory Unit ◼ A64FX CPU x2 (Two independent nodes) ◼ QSFP28 x3 for Active Optical Cables ◼ Single-side blind mate connectors of signals & water ◼ ~100% direct water cooling Water Water Electrical signals AOC QSFP28 (X) QSFP28 (Y) QSFP28 (Z) AOC AOC SCAsia2019, March 12
  • 28. TOFU-D 6D Mesh/Torus Network ◼ Six coordinate axes: X, Y, Z, A, B, C ◼ X, Y, Z: the size varies according to the system configuration ◼ A, B, C: the size is fixed to 2×3×2 ◼ Tofu stands for “torus fusion”: (X, Y, Z)×(A, B, C) Copyright 2018 FUJITSU LIMITED September11th,2018,IEEECluster2018 Z X C A B X×Y×Z×2×3×2 Y 28
  • 29. 29 © 2019 FUJITSU A64FX: Tofu interconnect D ◼ Integrated w/ rich resources ◼ Increased TNIs achieves higher injection BW & flexible comm. patterns ◼ Increased barrier resources allow flexible collective comm. algorithms ◼ Memory bypassing achieves low latency ◼ Direct descriptor & cache injection TofuD spec Port bandwidth 6.8 GB/s Injection bandwidth 40.8 GB/s Measured Put throughput 6.35 GB/s Ping-pong latency 0.49~0.54 µs c c c c c c c c c c c cc NOC HBM2 CMG c c c c c c c c c c c cc HBM2 CMG c c c c c c c c c c c cc HBM2 CMG c c c c c c c c c c c cc HBM2 CMG PCle A64FX TNI0 TNI1 TNI2 TNI3 TNI4 TNI5 TofuNetworkRouter 2lanes×10ports TofuD
  • 30. Put Latencies ◼ 8B Put transfer between nodes on the same board ◼ The low-latency features were used ◼ Tofu2 reduced the Put latency by 0.20 μs from that of Tofu1 ◼ The cache injection feature contributed to this reduction ◼ TofuD reduced the Put latency by 0.22 μs from that of Tofu2 Copyright 2018 FUJITSU LIMITEDSeptember 11th, 2018, IEEE Cluster 2018 Communication settings Latency Tofu1 Descriptor on main memory 1.15 µs Direct Descriptor 0.91 µs Tofu2 Cache injection OFF 0.87 µs Cache injection ON 0.71 µs TofuD To/From far CMGs 0.54 µs To/From near CMGs 0.49 µs 0.20 µs 0.22 µs 30
  • 31. Injection Rates per Node ◼ Simultaneous Put transfers to multiple nearest-neighbor nodes ◼ Tofu1 and Tofu2 used 4 TNIs, and TofuD used 6 TNIs ◼ The injection rate of TofuD was approximately 83% that of Tofu2 ◼ The efficiencies of Tofu1 were lower than 90% ◼ Because of a bottleneck in the bus that connects CPU and ICC ◼ The efficiencies of Tofu2 and TofuD exceeded 90 % ◼ Integration into the processor chip removed the bottleneck Copyright 2018 FUJITSU LIMITEDSeptember 11th, 2018, IEEE Cluster 2018 Injection rate Efficiency Tofu1 (K) 15.0 GB/s 77 % Tofu1 (FX10) 17.6 GB/s 88 % Tofu2 45.8 GB/s 92 % TofuD 38.1 GB/s 93 % 31
  • 32. Packaging – Rack Structure of Post-K ◼ Rack ◼ 8 shelves ◼ 192 CMUs or 384 CPUs ◼ Shelf ◼ 24 CMUs or 48 CPUs ◼ X×Y×Z×A×B×C = 1×1×4×2×3×2 ◼ Top or bottom half of rack ◼ 4 shelves ◼ X×Y×Z×A×B×C = 2×2×4×2×3×2 Copyright 2018 FUJITSU LIMITEDSeptember 11th, 2018, IEEE Cluster 2018 Rack Shelves 32
  • 33. 33 © 2019 FUJITSU Post-K system configuration ◼ Scalable design Unit # of nodes Description CPU 1 Single socket node with HBM2 & Tofu interconnect D CMU 2 CPU Memory Unit: 2x CPU BoB 16 Bunch of Blades: 8x CMU Shelf 48 3x BoB Rack 384 8x Shelf System 150k+ As a Post-K system CPU CMU BoB RackShelf System
  • 34. Overview of Post-K System & Storage 34 ⚫ 3-level hierarchical storage ⚫ 1st Layer: GFS Cache + Temp FS (25~30 PB NVMe) ⚫ 2nd Layer: Lustre-based GFS (a few hundred PB HDD) ⚫ 3rd Layer: Off-site Cloud Storage ⚫ Full Machine Spec ⚫ >150,000 nodes ~8 million High Perf. Arm v8.2 Cores ⚫ > 150PB/s memory BW ⚫ Tofu-D 10x Global IDC traffic @ 60Pbps ⚫ ~10,000 I/O fabric endpoints ⚫ > 400 racks ⚫ ~40 MegaWatts Machine+IDC PUE ~ 1.1 High Pressure DLC ⚫ NRE pays off: ~= 15~30 million state-of-the art competing CPU Cores for HPC workloads (both dense and sparse problems)
  • 35. Prepping the 40+MW Facility (actual photo) 35
  • 36. Post-K system software ⚫ RIKEN and Fujitsu are developing a software stack for Post- K SCAsia2019, March 12 Post-K system hardware Linux OS / McKernel (Lightweight kernel) Fujitsu Technical Computing Suite / RIKEN-developed system software Post-K applications Management software Programming environmentFile system System management for high availability & power saving operation Job management for higher system utilization & power efficiency LLIO NVM-based file I/O accelerator OpenMP, COARRAY, Math.libs. MPI (Open MPI, MPICH) XcalableMPFEFS Lustre-based distributed file system Compilers (C, C++, Fortran) Debugging and tuning tools Post-K Under development w/ RIKEN
  • 37. Post-K Programming Environment 37 ⚫ Programing Languages and Compilers provided by Fujitsu ⚫ Fortran2008 & Fortran2018 subset ⚫ C11 & GNU and Clang extensions ⚫ C++14 & C++17 subset and GNU and Clang extensions ⚫ OpenMP 4.5 & OpenMP 5.0 subset ⚫ Java ⚫ Parallel Programming Language & Domain Specific Library provided by RIKEN ⚫ XcalableMP ⚫ FDPS (Framework for Developing Particle Simulator) ⚫ Process/Thread Library provided by RIKEN ⚫ PiP (Process in Process) ⚫ Script Languages provided by Linux distributor ⚫ E.g., Python+NumPy, SciPy ⚫ Communication Libraries ⚫ MPI 3.1 & MPI4.0 subset ⚫ Open MPI base (Fujitsu), MPICH (RIKEN) ⚫ Low-level Communication Libraries ⚫ uTofu (Fujitsu), LLC(RIKEN) ⚫ File I/O Libraries provided by RIKEN ⚫ Lustre ⚫ pnetCDF, DTF, FTAR ⚫ Math Libraries ⚫ BLAS, LAPACK, ScaLAPACK, SSL II (Fujitsu) ⚫ EigenEXA, Batched BLAS (RIKEN) ⚫ Programming Tools provided by Fujitsu ⚫ Profiler, Debugger, GUI ⚫ NEW: Containers (Singularity) and other Cloud APIs ⚫ NEW: AI software stacks (w/ARM) 2018/6/26 GCC and LLVM will be also available
  • 38. OSS Application Porting @ Arm HPC Users Group http://arm-hpc.gitlab.io/ (http://arm-hpc.gitlab.io/) Application Lang. GCC LLVM Arm Fujitsu LAMMPS C++ Modified Modified Modified Modified GROMACS C Modified Modified Modified Modified GAMESS* Fortran Modified Modified Modified Modified OpenFOAM C++ Modified Modified Modified Modified NAMD C++ Modified Modified Modified Modified WRF Fortran Modified Modified Modified Modified Quantum ESPRESSO Fortran Ok in as is Ok in as is Ok in as is Modified NWChem Fortran Ok in as is Modified Modified Modified ABINIT Fortran Modified Modified Modified Modified CP2K Fortran Ok in as is Issues found Issues found Modified NEST* C++ Ok in as is Modified Modified Modified BLAST* C++ Ok in as is Modified Modified Modified
  • 39. OSS Application Porting @ Arm HPC Users Group http://arm-hpc.gitlab.io/ Application Lang. GCC LLVM Arm Fujitsu LAMMPS C++ Modified Modified Modified Modified GROMACS C Modified Modified Modified Modified GAMESS* Fortran Modified Modified Modified Modified OpenFOAM C++ Modified Modified Modified Modified NAMD C++ Modified Modified Modified Modified WRF Fortran Modified Modified Modified Modified Quantum ESPRESSO Fortran Ok in as is Ok in as is Ok in as is Modified NWChem Fortran Ok in as is Modified Modified ongoing ABINIT Fortran Modified Modified Modified Modified CP2K Fortran Ok in as is Issues found Issues found ongoing NEST* C++ Ok in as is Modified Modified Modified BLAST* C++ Ok in as is Modified Modified Modified Twelve primary OSS applications are listed and being tested in the Users Group for each compilers, collaboratively w/ Arm (http://arm-hpc.gitlab.io/)
  • 40. 40 Post-K Processor ◆High perf FP16&Int8 ◆High mem BW for convolution ◆Built-in scalable Tofu network Unprecedened DL scalability High Performance DNN Convolution Low Precision ALU + High Memory Bandwi dth + Advanced Combining of Convolution Algorithms (FFT+Winograd+GEMM) High Performance and Ultra-Scalable Network for massive scaling model & data parallelism Unprecedented Scalability of Data/ Massive Scale Deep Learning on Post-K C P U For the Post-K supercomputer C P U For the Post-K supercomputer C P U For the Post-K supercomputer C P U For the Post-K supercomputer TOFU Network w/high injection BW for fast reduction
  • 41. Open “Post-K” Naming ⚫ Foreign submissions welcome ⚫ Requirements for the post-K’s name are: ⚫ The name should preferably express the idea that RIKEN is a world- class research institute operating a state-of-the–art supercomputer. ⚫ The name should be attractive not only to Japanese speakers but to people around the world. https://www.r-ccs.riken.jp/en/topics/naming.html until April 8, 2019 5 pm JST