The document summarizes the successes and lessons learned from Fugaku, Japan's flagship supercomputer. Key points include:
- Fugaku achieved the top performance on all HPC benchmarks in 2020 and 2021, showing high performance across applications, not just traditional HPC workloads.
- While many applications achieved their target performance, some did not due to issues like insufficient parallelism, I/O scalability problems, and compiler vectorization failures.
- Lessons include the need for improved software stacks, application analysis, and adapting to modern applications beyond classic HPC.
- Looking ahead, sustained exascale performance will require data-centric architectures and corresponding system software and algorithms as transistor scaling slow
Stay up-to-date on the latest news, events and resources for the OpenACC community. This month’s highlights covers pseudo random number generation, the first-ever MONAI Bootcamp, upcoming GPU Hackathons and Bootcamps, and new resources!
Stay up-to-date on the latest news, events and resources for the OpenACC community. This month’s highlights covers pseudo random number generation, the first-ever MONAI Bootcamp, upcoming GPU Hackathons and Bootcamps, and new resources!
Stay up-to-date on the latest news, events and resources for the OpenACC community. This month’s highlights covers the Organization's newly elected president, an updated OpenACC 3.1 specification, upcoming 2021 GPU Hackathons, new resources and more!
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Lablup Inc.
This slide introduces technical specs and details about Backend.AI 19.09.
* On-premise clustering / container orchestration / scaling on cloud
* Container-level fractional GPU technology to use one GPU as many GPUs on many containers at the same time.
* NVidia GPU Cloud integrations
* Enterprise features
Top 10 Supercomputers With Descriptive Information & AnalysisNomanSiddiqui41
Top 10 Supercomputers Report
What is Supercomputer?
A supercomputer is a computer with a high level of performance as compared to a general-purpose computer. The performance of a supercomputer is commonly measured in floating-point operations per second (FLOPS) instead of million instructions per second (MIPS). Since 2017, there are supercomputers which can perform over 1017 FLOPS (a hundred quadrillion FLOPS, 100 petaFLOPS or 100 PFLOPS
Supercomputers play an important role in the field of computational science, and are used for a wide range of computationally intensive tasks in various fields, including quantum mechanics, weather forecasting, climate research, oil and gas exploration, molecular modeling (computing the structures and properties of chemical compounds, biological macromolecules, polymers, and crystals), and physical simulations (such as simulations of the early moments of the universe, airplane and spacecraft aerodynamics, the detonation of nuclear weapons, and nuclear fusion). They have been essential in the field of cryptanalysis.
1. The Fugaku Supercomputer
Introduction:
Fugaku is a petascale supercomputer (while only at petascale for mainstream benchmark), at the Riken Center for Computational Science in Kobe, Japan. It started development in 2014 as the successor to the K computer, and started operating in 2021. Fugaku made its debut in 2020, and became the fastest supercomputer in the world in the June 2020 TOP500 list, as well as becoming the first ARM architecture-based computer to achieve this. In June 2020, it achieved 1.42 exaFLOPS (in HPL-AI benchmark making it the first ever supercomputer that achieved 1 exaFLOPS. As of November 2021, Fugaku is the fastest supercomputer in the world. It is named after an alternative name for Mount Fuji.
Block Diagram:
Functional Units:
Functional Units, Co-Design and System for the Supercomputer “Fugaku”
1. Performance estimation tool: This tool, taking Fujitsu FX100 (FX100 is the previous Fujitsu supercomputer) execution profile data as an input, enables the performance projection by a given set of architecture parameters. The performance projection is modeled according to the Fujitsu microarchitecture. This tool can also estimate the power consumption based on the architecture model.
2. Fujitsu in-house processor simulator: We used an extended FX100 SPARC instruction-set simulator and compiler, developed by Fujitsu, for preliminary studies in the initial phase, and an Armv8þSVE simulator and compiler afterward.
3. Gem5 simulator for the Post-K processor: The Post-K processor simulator3 based on an opensource system-level processor simulator, Gem5, was developed by RIKEN during the co-design for architecture verification and performance tuning. A fundamental problem is the scale of scientific applications that are expected to be run on Post-K. Even our target applications are thousands of lines of code and are written to use complex algorithms and data structures. Altho
This slide explains about the detailed view hardware architecture which includes CPUs, GPUs, Interconnect networks and applications used by the summit supercomputer
Stay up-to-date on the latest news, events and resources for the OpenACC community. This month’s highlights covers the first remote GPU Hackathons, a complete schedule of upcoming events, using OpenACC for a biophysics problem, NVIDIA HPC SDK, GCC 10, new resources and more!
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AIinside-BigData.com
Satoshi Matsuoka from RIKEN gave this talk at the HPC User Forum in Santa Fe.
"With rapid rise and increase of Big Data and AI as a new breed of high-performance workloads on supercomputers, we need to accommodate them at scale, and thus the need for R&D for HW and SW Infrastructures where traditional simulation-based HPC and BD/AI would converge, in a BYTES-oriented fashion. Post-K is the flagship next generation national supercomputer being developed by Riken and Fujitsu in collaboration. Post-K will have hyperscale class resource in one exascale machine, with well more than 100,000 nodes of sever-class A64fx many-core Arm CPUs, realized through extensive co-design process involving the entire Japanese HPC community.
Rather than to focus on double precision flops that are of lesser utility, rather Post-K, especially its Arm64fx processor and the Tofu-D network is designed to sustain extreme bandwidth on realistic applications including those for oil and gas, such as seismic wave propagation, CFD, as well as structural codes, besting its rivals by several factors in measured performance. Post-K is slated to perform 100 times faster on some key applications c.f. its predecessor, the K-Computer, but also will likely to be the premier big data and AI/ML infrastructure. Currently, we are conducting research to scale deep learning to more than 100,000 nodes on Post-K, where we would obtain near top GPU-class performance on each node."
Watch the video: https://wp.me/p3RLHQ-k6G
Learn more: https://en.wikichip.org/wiki/supercomputers/post-k
and
http://hpcuserforum.com
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSDatabricks
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Abstract: We will introduce RAPIDS, a suite of open source libraries for GPU-accelerated data science, and illustrate how it operates seamlessly with MLflow to enable reproducible training, model storage, and deployment. We will walk through a baseline example that incorporates MLflow locally, with a simple SQLite backend, and briefly introduce how the same workflow can be deployed in the context of GPU enabled Kubernetes clusters.
ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...ijdpsjournal
This paper studies the performance and energy consumption of several multi-core, multi-CPUs and manycore
hardware platforms and software stacks for parallel programming. It uses the Multimedia Multiscale
Parser (MMP), a computationally demanding image encoder application, which was ported to several
hardware and software parallel environments as a benchmark. Hardware-wise, the study assesses
NVIDIA's Jetson TK1 development board, the Raspberry Pi 2, and a dual Intel Xeon E5-2620/v2 server, as
well as NVIDIA's discrete GPUs GTX 680, Titan Black Edition and GTX 750 Ti. The assessed parallel
programming paradigms are OpenMP, Pthreads and CUDA, and a single-thread sequential version, all
running in a Linux environment. While the CUDA-based implementation delivered the fastest execution, the
Jetson TK1 proved to be the most energy efficient platform, regardless of the used parallel software stack.
Although it has the lowest power demand, the Raspberry Pi 2 energy efficiency is hindered by its lengthy
execution times, effectively consuming more energy than the Jetson TK1. Surprisingly, OpenMP delivered
twice the performance of the Pthreads-based implementation, proving the maturity of the tools and
libraries supporting OpenMP.
ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...ijdpsjournal
This paper studies the performance and energy consumption of several multi-core, multi-CPUs and manycore
hardware platforms and software stacks for parallel programming. It uses the Multimedia Multiscale
Parser (MMP), a computationally demanding image encoder application, which was ported to several
hardware and software parallel environments as a benchmark. Hardware-wise, the study assesses
NVIDIA's Jetson TK1 development board, the Raspberry Pi 2, and a dual Intel Xeon E5-2620/v2 server, as
well as NVIDIA's discrete GPUs GTX 680, Titan Black Edition and GTX 750 Ti. The assessed parallel
programming paradigms are OpenMP, Pthreads and CUDA, and a single-thread sequential version, all
running in a Linux environment. While the CUDA-based implementation delivered the fastest execution, the
Jetson TK1 proved to be the most energy efficient platform, regardless of the used parallel software stack.
Although it has the lowest power demand, the Raspberry Pi 2 energy efficiency is hindered by its lengthy
execution times, effectively consuming more energy than the Jetson TK1. Surprisingly, OpenMP delivered
twice the performance of the Pthreads-based implementation, proving the maturity of the tools and
libraries supporting OpenMP.
Architecture exploration of recent GPUs to analyze the efficiency of hardware...journalBEEI
This study analyzes the efficiency of parallel computational applications with the adoption of recent graphics processing units (GPUs). We investigate the impacts of the additional resources of recent architecture on the popular benchmarks compared with previous architecture. Our simulation results demonstrate that Pascal GPU architecture improves the performance by 273% on average compared to old-fashioned Fermi architecture. To evaluate the performance improvement depending on specific hardware resources, we divide the hardware resources into two types: computing and memory resources. Computing resources have bigger impact on performance improvement than memory resources in most of benchmarks. For Hotspot and B+ tree, the architecture adopting only enhanced computing resources can achieve similar performance gains of the architecture adopting both computing and memory resources. We also evaluate the influence of the number of warp schedulers in the SM (Streaming Multiprocessor) to the GPU performance in relationship with barrier waiting time. Based on these analyses, we propose the development direction for the future generation of GPUs.
Multi-source connectivity as the driver of solar wind variability in the heli...Sérgio Sacani
The ambient solar wind that flls the heliosphere originates from multiple
sources in the solar corona and is highly structured. It is often described
as high-speed, relatively homogeneous, plasma streams from coronal
holes and slow-speed, highly variable, streams whose source regions are
under debate. A key goal of ESA/NASA’s Solar Orbiter mission is to identify
solar wind sources and understand what drives the complexity seen in the
heliosphere. By combining magnetic feld modelling and spectroscopic
techniques with high-resolution observations and measurements, we show
that the solar wind variability detected in situ by Solar Orbiter in March
2022 is driven by spatio-temporal changes in the magnetic connectivity to
multiple sources in the solar atmosphere. The magnetic feld footpoints
connected to the spacecraft moved from the boundaries of a coronal hole
to one active region (12961) and then across to another region (12957). This
is refected in the in situ measurements, which show the transition from fast
to highly Alfvénic then to slow solar wind that is disrupted by the arrival of
a coronal mass ejection. Our results describe solar wind variability at 0.5 au
but are applicable to near-Earth observatories.
Professional air quality monitoring systems provide immediate, on-site data for analysis, compliance, and decision-making.
Monitor common gases, weather parameters, particulates.
The increased availability of biomedical data, particularly in the public domain, offers the opportunity to better understand human health and to develop effective therapeutics for a wide range of unmet medical needs. However, data scientists remain stymied by the fact that data remain hard to find and to productively reuse because data and their metadata i) are wholly inaccessible, ii) are in non-standard or incompatible representations, iii) do not conform to community standards, and iv) have unclear or highly restricted terms and conditions that preclude legitimate reuse. These limitations require a rethink on data can be made machine and AI-ready - the key motivation behind the FAIR Guiding Principles. Concurrently, while recent efforts have explored the use of deep learning to fuse disparate data into predictive models for a wide range of biomedical applications, these models often fail even when the correct answer is already known, and fail to explain individual predictions in terms that data scientists can appreciate. These limitations suggest that new methods to produce practical artificial intelligence are still needed.
In this talk, I will discuss our work in (1) building an integrative knowledge infrastructure to prepare FAIR and "AI-ready" data and services along with (2) neurosymbolic AI methods to improve the quality of predictions and to generate plausible explanations. Attention is given to standards, platforms, and methods to wrangle knowledge into simple, but effective semantic and latent representations, and to make these available into standards-compliant and discoverable interfaces that can be used in model building, validation, and explanation. Our work, and those of others in the field, creates a baseline for building trustworthy and easy to deploy AI models in biomedicine.
Bio
Dr. Michel Dumontier is the Distinguished Professor of Data Science at Maastricht University, founder and executive director of the Institute of Data Science, and co-founder of the FAIR (Findable, Accessible, Interoperable and Reusable) data principles. His research explores socio-technological approaches for responsible discovery science, which includes collaborative multi-modal knowledge graphs, privacy-preserving distributed data mining, and AI methods for drug discovery and personalized medicine. His work is supported through the Dutch National Research Agenda, the Netherlands Organisation for Scientific Research, Horizon Europe, the European Open Science Cloud, the US National Institutes of Health, and a Marie-Curie Innovative Training Network. He is the editor-in-chief for the journal Data Science and is internationally recognized for his contributions in bioinformatics, biomedical informatics, and semantic technologies including ontologies and linked data.
This pdf is about the Schizophrenia.
For more details visit on YouTube; @SELF-EXPLANATORY;
https://www.youtube.com/channel/UCAiarMZDNhe1A3Rnpr_WkzA/videos
Thanks...!
A brief information about the SCOP protein database used in bioinformatics.
The Structural Classification of Proteins (SCOP) database is a comprehensive and authoritative resource for the structural and evolutionary relationships of proteins. It provides a detailed and curated classification of protein structures, grouping them into families, superfamilies, and folds based on their structural and sequence similarities.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.Sérgio Sacani
The return of a sample of near-surface atmosphere from Mars would facilitate answers to several first-order science questions surrounding the formation and evolution of the planet. One of the important aspects of terrestrial planet formation in general is the role that primary atmospheres played in influencing the chemistry and structure of the planets and their antecedents. Studies of the martian atmosphere can be used to investigate the role of a primary atmosphere in its history. Atmosphere samples would also inform our understanding of the near-surface chemistry of the planet, and ultimately the prospects for life. High-precision isotopic analyses of constituent gases are needed to address these questions, requiring that the analyses are made on returned samples rather than in situ.
3. “Applications First” Exascale R&D
Fugaku Target Applications – Priority Research Areas
11
Advanced Applications
Co-Design Program to
Parallel Fugaku R&D
Select one representative
app from 9 priority areas
Health & Medicine
Environment & Disaster
Energy
Materials & Manufacturing
Basic Sciences
Up to 100x speedup c.f.
K-Computer => achieved!
131x(GENESIS)
23x(Genomon) 63x(GAMERA)
127x (NICAM+ LETKF)
70x(NTChem)
63x(Adventure)
38x (RSDFT)
51x(FFB)
38x(LQCD) Average
~70x
4. Fugaku HPC+Big Data+AI+Cloud ‘Converged’ Software Stack
Red Hat Enterprise Linux 8 Libraries
Batch Job and Management
System
Open Source
Management Tool
Spack and other DoE
ECP Software
Hierarchical File System
Tuning and Debugging Tools
Fujitsu: Profiler, Debugger, GUI
Math Libraries
Fujitsu: BLAS, LAPACK, ScaLAPACK, SSL II
RIKEN: EigenEXA, KMATH_FFT3D, Batched BLAS,,,,
High-level Prog. Lang.
XMP
Domain Spec. Lang.
FDPS
Red Hat Enterprise Linux Kernel+ optional light-weight kernel (McKernel)
File I/O
DTF
Communication
Fujitsu MPI
RIKEN MPI
Low Level Communication
uTofu, LLC
File I/O for Hierarchical Storage
Lustre/LLIO
Process/Thread
PIP
Virtualization & Container
KVM, Singularity
Compiler and Script Languages
Fortran, C/C++, OpenMP, Java, python, …
(Multiple Compilers suppoted: Fujitsu, Arm, GNU,
LLVM/CLANG, PGI, …)
4
Fugaku AI (DL4Fugaku)
RIKEN: Chainer, PyTorch, TensorFlow, DNNL…
Cloud Software Stack
OpenStack, Kubernetis, NEWT...
Live Data Analytics
Apache Flink, Kibana, ….
ObjectStore
S3 Compatible
~3000 Apps sup-
ported by Spack
Most applications will work
with simple recompile from
x86/RHEL environment to t
he Arm processor.
LLNL Spack automates this.
Traditional HPC system eg K-computer
Traditional Clouds eg EC2
Fugaku and
future HPCI systems
5. 5
C.f. Xeon (Skylake/Cascade Lake)
superior efficiency/peak (7.03% vs 2.20/3.96%)
superior per socket time to performance (11.19 sec vs 20.61/8.29 x 2 sec)
CUBE / COVID-19 Aerosol Sim Performance
10. 10
Positives: proper project vision and management
General purpose low power CPU w/good FLOPs and high BW
Arm ecosystem extremely important, for programming, tools, & apps
Aggressive R&D+adoption of new (risky) technologies: on-
die HBM2, Embedded ~400Gbps partially optical switchless
interconnect, mainframe RAS, low power etc.
Co-design and co-working at (inter-)nationally
Some evolutions to cope with massive parallelism (K=>Fugaku)
Addition of modern architecture features e.g. FP16
Shortcomings: lack of widespread commercial adoption
Co-design: focused too much on target app optimization (only)
Immaturity of software stack esp. compilers & libraries w/SVE
Still too focused on classic HPC for industry & cloud adoption
Failed to look at modern apps: data, AI, entertainment, mobility
Failed to ‘deprecate’ classes of algorithms towards Post-Moore
What elements can we learn for sustained perf. improvements
Lessons Learned from Fugaku
11. Performance projection of many-core CPU systems based on
IRDS roadmap
11
2028: Post-Moore Era
~2015 25 years of sustained scaling
in the Manycore period
(Post-Dennard scaling)
2016~ Difficulty in advancing Moore's law
2025~ Post-Moore Era
The end of transistor-power
advancement
Challenge: Exploration of computer architectures that will enable
performance improvement even around the year 2028
Key to sustained performance improvement:
FLOPS to Bytes, "data movement-centric architecture"
Reconfigurable, data-driven, vector computing
Ultra-deep and ultra-wide bandwidth memory architectures
Optical networks
system software, programing, algorithms that correspond to new architectures
12. Performance projection of many-core CPU systems based on
IRDS roadmap
12
A
Predictions based on the IRDS Roadmap(2020 ed.), extrapolation of
traditional many core architectures relying merely on advances of
semiconductor technologies will achieve only 1.8EFLOPS Peak (3.37x
c.f. Fugaku), if a machine with broad applicability will be built
• Methodoogies(CPU part):
Assumptions from IRDS Roadmap
Systems and Architectures
‒ Cores/socket=70 cores
‒ SIMD width=2048-bit x 2
‒ Clock frequency=3.9GHz
‒ Socket TDP = 351W
• System assumptions
‒ System Power=30, 40, 50MW
‒ PUE=1.1
‒ CPU power occupy=60,70,80%
NGACI white paper
https://sites.google.com/
view/ngaci/home
From NGACI white paper
最もアグレッシブなシステム構成(50MW電力バジェット、
CPUで80%電力消費)においても1.8EF程度の性能と予測
Socket
Cores
Injection
BW(Tb/s)
Storage
(EBytes)
13. Application Kernel Categorization & SC
Architecture
Compute Bound
(aka Top500)
Bandwidth Bound
(aka HPCG)
Latency Bound
(aka Graph500)
Classic Vector (e.g. Earth Simulator) ~90s
COTS-CPU based clusters late 90s~late 2000s (ASCI XXX, Tsubame1/T2K, Jaguar, K)
Standard Memory Technologies (DDR DRAM), Massively Parallel
GPU CPU
GPU
GPU
GPU-Based ‘Heterogeneous’ Machines: high (compute & BW & latency) for GPU
Tsubame2/3, ABCI, Summit, Piz-Daint, Fronter, Aurora, …
Fugaku/A64FX, Sapphire Rapids: aggressive technology & Good SW Ecosystem
GPU/Matrix CPU
Unexplored but best? (programmability, performance, industry adoption, …)
14. BLAS / GEMM utilization in HPC Applications
[Domke et. al.@R-CCS, IPDPS2020]
14
Analyzed various data sources:
Historical data from K computer: only 53,4% of node-hours (in FY18) were
consumed by applications which had GEMM functions in the symbol table
Library dependencies: only 9% of Spack packages have direct BLAS lib
dependency (51.5% have indirect dependency)
TensorCore benefit for DL: up to 7.6x speedup for MLperf kernels
GEMM utilization in HPC: sampled across 77 HPC benchmarks (ECP proxy,
RIKEN fiber, TOP500, SPEC CPU/OMP/MPI) and measured/profiled via
Score-P and Vtune
16. 16
Increasing FLOPS via increasing the number of
ALUs no longer viable
Compute power = ALU logic switching power + data
movement between ALUs and registers/memory
ALU logic power saturation faster than lithography
saturation
No more acceleration of pure FLOPS
Only way to increase performance at low level is logic
simplification, e.g., lower precision, alternative numerical formats
At higher levels, decreasing the # of numerical operations very
effective => sparse (iterative) methods (general HPC), network
compaction (AI), algorithmic pruning (HPC & AI)
FLOPS to BYTES for future acceleration? (1)
17. 17
Data movement has its own problems but promising w/ new
device and packaging tech + architectures & algorithms to exploit
them
Devices & Packaging
3-D stacking of memory + logic
Photonic interconnect
Dense and fast memory devices from SRAM to MRAM
Architecture
Large & high bandwidth local memory processor (very large L1/L2)
Customized datapaths for frequent compute patters –
stencils/convolution, matrix, FFT, tensor operations, ... => can they be
generalized? Micro dataflow in a core?
Coarse grained dataflow (CGRA)? => optimize data movement in general
over standard CPU/GPU(SMT Vector)
Near memory processing
FLOPS to BYTES!
Same motivation as embedded computing
FLOPS to BYTES for future acceleration? (2)
18. Towards 2030 Post-Moore era
• End of ALU compute (FLOPS) advance
• Disrupritve reduction in data movement
cost with new devices, packaging
• Algorithm advances to reduce the
computational order (+ more reliance on
data movement)
• Unification of BD/AI/Simulation towards
data-centric view
Post-Moore Algorithmic Development
Bandwidth Centric
Sparse NN
SVM FFT CG
O(n) QM
𝑂(𝑛)
𝑂(2𝑛
)
Quantum
Alg.
New
Paradigm
Quantum CPU and/or GPU + α (Data Movement Acceleration, eg CGRA?)
H-MatrixcGraph
Computational
Complexity
Machine Learning, HPC Siulations
Data movement reduction
DL・Quantum Chem
Combinatorial
Optimization
Lower order algorithm
Advanced
Algorithms
Quantum
Chem
Architecture
Algorithm
Domain
Data Movement (BYTES) Centric
Search &
Optiization
NP Hard
or HSP
Categorization of Algorithms and Their Doamains
“New problem domains require new computing accelerators”
In practice challenging, due to algorithms & programming
Copyright 2021 FUJITSU LIMITED
Data Movement (bandwidth) bound
HF SVM FFT CG
CNN
𝑂(𝑛 ) 𝑂(𝑛 ) 𝑂(𝑛)
𝑂(2𝑛
)
Quantum
Algorithms
Compute Bound
New Paradigm
Quantum&
Digital
Annealer
Quantum
Gates GPU+MM CPU or GPU w/HBM etc.
H-
Matrix
Graph
Computational
Complexity
Machine Learning, HPC Simulations
Traditional but Important
Deep Learning
Quantum Systems
Combinatorial
Optimization
“Innovation Challenge) New DL, Vision
Izzing
Model
Crypto etc.
Architecture
Algorithms
Domain
Data Movement (BYTES)
Centric
Search&
Optimization FLOPS Centric
NP Hard
1
2021 present day
𝑂(𝑛 log 𝑛)
2030
Latency Centric
NeuroM
19. 19
Compute bound via matrix/tensor HW
Fairly low utilization
Low memory capacity (O(n^k))
Easy to encapsulate in library etc.
Latency bound via standard
localization & hiding techniques
Good single thread / low latency
communication HW
Multithreading/latency hiding
Latency-avoiding / localization
algorithms
BW bound via 3D stacked near memory & photonics
Tiered memory, extreme high BW memory is capacity limited c.f. FLOPS (see figure)
Require algorithmic changes and innovations, generic (eg temporal blocking), customized, …
Some apps/algorithms may not survive the change (eg traditional unstructured mesh…)
All is not Rosy: Modernizing & Downselecting
Application & Algorithm Types
20. Are Domain-Specific Accelerators Useful for HPC?
• On chip integration (SoC)
– Accelerator on the same die with CPU or even embedded
within a CPU (e.g. vector/matrix engines within CPU cores)n
– Shared various resources with CPUs e.g. on-chip cache
– low energy of data movement, homogeneous across nodes.
• Multi-chip packaging
– Interconnect accelerator chiplets with CPU chiplets using interposers
etc.
– Shared main memory, medium energy data movement
• On-Node accelerators + CPUs
– Accelerator – CPU connection via standard chip-chip interconnect
e.g. PCI-E, CXL, CAPI
– Low bandwidth, higher energy of data movement
– Scalable if homogeneous and workload exclusive to ACC or CPU
• Specific accelerated nodes/machines, via LAN or even WAN
– Expensive data movement, workload largely confined to each
– Limited utility, high cost of heterogeneous management, not scalable
– Only makes sense if workload is well known and largely fixed
Accelerators are means to and end, not a purpose by itself
2021/2/25 20
CPU Acc.
L1$
L2$
Mem
cont.
I/O
CPU Acc.
HBM
CPU
Acc.
PCI-E
Acc.
Acc.
CPU
21. From the user’s point of view, computing system should be
uniform, with heterogeneity, distribution etc. hidden under the hood
Success of clouds achieved with this principle
Modern IT involves massive software ecosystem, heterogeneity
hinders their use => integration with CPU(orGPU) most sensible
Fugaku / A64FX was designed exactly with this principle
Performance always governed by Amdahl’s law (strong scaling)
and Gustafson’s law (weak scaling)
Employing multiple heterogeneous accelerators in an app => bad idea
“Homogeneous” parallelization of workloads exclusively confined to a
SINGLE accelerator type (or CPU) per each node with good load
balancing is the ONLY way to overcome the Amdahl’s law
Successful applications on large GPU machines follow this principle
“Balanced” use of GPU and CPU a myth => EITHER GPU or CPU
Reality of Accelerated Computing
22. 22
SmartPhone SOC subject to
Amdahl Speedup (Law)
Supercomputers subject to
Amdahl & Gustafson Speedup
Smartphones NOT extrapolatable to HPC
Apple A15 SoC
(source https://semianalysis.com/apple-a15-die-shot-and-
annotation-ip-block-area-analysis/)
ECE 695NS Lecture 3: Practical Assessment of Code Performance
by: Peter Bermel, Harvard University
https://nanohub.org/resources/20560/watch?resid=25763
Gustafson J.L. (2011) Gustafson’s Law. In: Padua D. (eds) Encyclopedia of Parallel C
omputing. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-09766-4_78
23. 23
Accelerators are subject to Amdahl’s law (strong scaling)
Large-scale parallel computing subject to Gustafson’s law
Accelerators vs. Amdahl’s Law & Gustafson’s Law (1)
Non-Acc
Time-to-solution t
Non-Acc
Time-to-solution
Acceleratable Acc Non-Acc
Time-to-solution
1/(1-a)t
A
c
c
Non-Par
Time-to-solution
Parallelizable Non-Par
Time-to-solution (constant)
Parallelizable
Parallelizable
Parallelizable
Parallelism: P
Problem size: xP
(weak scaling)
Performance: ~= xP for large P,
if non-parallelizable overhead would
be minimized, e.g., w/load balancing,
communication minimization, etc. =>
entails uniform, well balanced
processing for every node
For accelerators to work, non-
accelerated portion must be as
small as possible
e.g. GPU-CPU, CPU processing
must be minimized
Non-Par
Non-Par
a
1-a
24. Combining Amdahl’s law and Gustafson’s law in a supercomputer
Accelerators vs. Amdahl’s Law & Gustafson’s Law (2)
Non-Acc
~=Non-Par
Node
Performance
(1-a)
A
c
c
Theoretical
asymptotic
performance gain
(1-a)P
BUT
Extremely
susceptible to
overhead, e.g.,
load imbalance,
communication
overhead, etc.
Non-Acc
~=Non-Par
A
c
c
Parallelism
P
Principles of Accelerated Supercomputer:
• Maximizing acceleration under Amdahl
=> Dominant processing done on the same
accelerator on every node
BAD:“intra-node” heterogeneous processing
• Extremely uniform load balancing
=> SPMD over uniform accelerators the best
BAD: heterogeneous task parallelism over
multiple types of accelerators
• Minimize parallelization overhead e.g.
communication
=> tight communication coupling of accelerated
components, on-chip > on-package > on node >
different machines
BAD: any segregation entailing data movement,
poor interconnect, etc.
25. 25
It is no accident that, every successful large-scale accelerated
supercomputers (esp. GPU machines) are
built with a singular node configuration across the entire machine
tight coupling and robust interconnect (& I/O) to sustain maximum
bandwidth in/out of accelerator processor
dominant processing on the GPU for maximum performance
SPMD with very good load balancing (incl. data parallel DNN training)
Tsubame, Tianhe-2A, Titan/Summit, Piz-Daint, ABCI,
Fugaku, Frontier, Lumi, Aurora, …
… and this is the consequence of physical laws, so will continue to be
applicable to future machines (no extreme heterogeneity, asynchrony, …)
Accelerators vs. Amdahl’s Law & Gustafson’s Law (3)
Tsubame3
26. Amahl’s law also will hit communication time and energy/power
consumption
Even if we achieve considerable speedup with low energy on the
accelerator, moving the data around to be processed by other
accelerators will be hit with the Amdahl’s law in communication time
and power/energy consumption
Neither can be brought down, the more distance the signal travels from on-
chip towards inter-rack or inter IDC, becoming the overall overhead factor
Thus the right approach to minimize the effect of the Amdahl’s law is to
do SoC or even CPU integration of acceleration features, NOT
SEPARATE HETEROGENEOUS ACCELERATOR CHIPS&SYSTEMS
Again, Fugaku / A64FX was designed with this principle
“Multiple Heterogeneous Domain Specific Accelerator”
Considered Harmful
27. Towards 2030 Post-Moore era
• End of ALU compute (FLOPS) advance
• Disrupritve reduction in data movement cost
with new devices, packaging
• Algorithm advances to reduce the computational
order (+ more reliance on data movement)
• Unification of BD/AI/Simulation towards data-
centric view
Post-Moore Algorithmic Development
Bandwidth Centric
Sparse NN
SVM FFT CG
O(n) QM
𝑂(𝑛)
𝑂(2𝑛
)
Quantum
Alg.
New
Paradigm
Quantum CPU and/or GPU + α (Data Movement Acceleration, eg CGRA?)
H-MatrixcGraph
Computational
Complexity
Machine Learning, HPC Siulations
Data movement reduction
DL・Quantum Chem
Combinatorial
Optimization
Lower order algorithm
Advanced
Algorithms
Quantum
Chem
Architecture
Algorithm
Domain
Data Movement (BYTES) Centric
Search &
Optiization
NP Hard
or HSP
Categorization of Algorithms and Their Doamains
“New problem domains require new computing accelerators”
In practice challenging, due to algorithms & programming
Copyright 2021 FUJITSU LIMITED
Data Movement (bandwidth) bound
HF SVM FFT CG
CNN
𝑂(𝑛 ) 𝑂(𝑛 ) 𝑂(𝑛)
𝑂(2𝑛
)
Quantum
Algorithms
Compute Bound
New Paradigm
Quantum&
Digital
Annealer
Quantum
Gates GPU+MM CPU or GPU w/HBM etc.
H-
Matrix
Graph
Computational
Complexity
Machine Learning, HPC Simulations
Traditional but Important
Deep Learning
Quantum Systems
Combinatorial
Optimization
“Innovation Challenge) New DL, Vision
Izzing
Model
Crypto etc.
Architecture
Algorithms
Domain
Data Movement (BYTES)
Centric
Search&
Optimization FLOPS Centric
NP Hard
1
2021 present day
𝑂(𝑛 log 𝑛)
2030
Latency Centric
NeuroM
Quantum Future Non-Quantum Future
29. 29
Applications that are infeasible to solve on conventional
computers due to high complexity => impractical time-to-solution
Material science- first principles simulations (wave functions)
Difficult higher-order problems difficult to solve with conventional
means due to high complexity : O(n^k) where k > 4
Other examples: cryptography
Applications that can be solved with existing computers but
beneficial on quantum computers due to cheaper OPEX/CAPX
Optimization problems – TSP and variants
Some classes of AI/DL – quantum learning
They are important applications, OTOH the list is unfortunately not
very long (and likely will not be…)
What are the applications we desire quantum accelerator?
30. Accelerators are subject to Amdahl’s law (strong scaling)
Possible (polynomial?) speedup for NP-Hard problems, exponential
speedup for HSP problems
Non-Acc
Time-to-solution t
Non-Acc
Time-to-solution
Acceleratable Acc Non-Acc
Time-to-solution
1/(1-a)t
A
c
c
Non-QC
Time-to-solution
Quantum
Advantage
For accelerators to work, non-
accelerated portion must be as
small as possible
Quantum Computers only possible Amdahl Accelerator
Non-QC
Quantum
Advantage
Quantum
Advantage
Non-QC
Quantum
Advantage
Quantum
Advantage
Quantum
Advantage
Quantum
Advantage
Non-QC QC Non-QC QC
Q
C
Non-QC QC
Q
C
Q
C
Problem complexity increase
Q
C
31. Research for Quantum Computing/Computer (QC) @ R-CCS
① Development of large-scale QC simulators using Fugaku
• Bracket simulator(R-CCS Ito Team) Large-medium scale (#qubits<50)
• Qulacs simulator (RQC Fujii Team) Medium-small scale (#qubits<30)
• QC simulator designed using Tensor-network (R-CCS yunoki Team)
Development supported by Program “The enhancement of Fugaku
useability” for Fugaku CPU resource.
② Design of Hybrid programming
environment for integration of QC
(simulator) and classic HPC
supercomputer
• Workflow and task-parallel programming model
for offloading for QC (R-CCS Sato Team)
• Design and implementation for a common
framework such as Qibo(IHPC, Singapore)
③ Design of QC algorithm and Development of QC applications
for QC and HPC hybrid computing
• Target:Material simulation for the optimization of ground state of molecules
by VQE method using more than 40 qubits of QC simulator on Fugaku.
• It is expected to be used for the real QC developed by RQC.
Supported by a special program in the by Program “The enhancement of Fugaku
useability” for Fugaku CPU resource.
Access to the external real QC (IBM Q, D-WAVE)
④ Research on the architecture to accelerate
QC simulation
To accelerate QC research, the technology for high-
speed QC simulation is important. (R-CCS Kondo Team)
The acquisition of GPU-based system (NVIDIA A100)
Expected to use the outcome for Fugaku Next ⑤ Inter-national collaborations
• IHPC(Singapore) ・CEA (France)
Integrated
Execute
Tech
Transfer
Collaboration with RIKEN
Center for Quantum
Computing (RQC)
32. 32
Increasing FLOPS via increasing the number of
ALUs no longer viable
Compute power = ALU logic switching power + data
movement between ALUs and registers/memory
ALU logic power saturation faster than lithography
saturation
No more acceleration of pure FLOPS
Only way to increase performance at low level is logic
simplification, e.g., lower precision, alternative numerical formats
At higher levels, decreasing the # of numerical operations very
effective => sparse (iterative) methods (general HPC), network
compaction (AI), algorithmic pruning (HPC & AI)
Investigating the non-Quantum Future
FLOPS to BYTES for future acceleration? (1)
33. 33
Data movement has its own problems but promising w/ new
device and packaging tech + architectures & algorithms to exploit
them
Devices & Packaging
3-D stacking of memory + logic
Photonic interconnect
Dense and fast memory devices from SRAM to MRAM
Architecture
Large & high bandwidth local memory processor (very large L1/L2)
Customized datapaths for frequent compute patters – stencils/convolution,
matrix, FFT, tensor operations, ... => can they be generalized? Micro
dataflow in a core?
Coarse grained dataflow (CGRA)? => optimize data movement in general
over standard CPU/GPU(SMT Vector)
Near memory processing
FLOPS to BYTES!
Same motivation as embedded computing
Investigating the non-Quantum Future
FLOPS to BYTES for future acceleration? (2)
35. Our Project: Exploring versatile HPC architecture and system software technologies
to achieve 100x performance by 2028
35
Problems to be solved and goals to be achieved
• General-purpose computer architectures that will accelerate a wide range of applications in
the post-Moore era have not yet been established.
• What is a feasible approach for versatile HPC systems based on bandwidth improvement?
• Goal: to explore architectures that can achieve 100x performance in a wide range of
applications around 2028
Exemlar FLOPS to BYTES
Architecture
gen-purpose
many-core CPUs
other
approaches
(like CGRA)
near-
memory
proc
new
memory
device
new
memory
hierarch
y,
connecti
on for
cores
near-
memory
proc
new
memory
device
Subtask1.1 Performance
characterization and modeling with
benchmarks to identify directions
for exploration and improvement
(Riken R-CCS)
Subtask1.2 Exploring a
reconfigurable vector data-
flow architecture (CGRA) that
can exploit increased data
transfer capability
(Rijen R-CCS)
Subtask3 Exploring near-
memory computing for highly
effective bandwidth and
cooling efficiency for general
purpose computing (U-Tokyo)
Approaches and subtasks
• Exploration of future CPU node
architectures and necessary
technologies
2018 2019 2020 2021 2022
Plan
Explore individual technologies Integrate promising technologies
for a target node architecture
Developing
stage ...
2023~
Subtask2 Exploring innovative
memory architectures with ultra-
deep and ultra-wide bandwidth
(Tokyo Tech.)
Subtask4 Exploration of node
architectures as extension of
existing many-core CPUs with
non von-Neumann methods
(unnamed company)
Planned
36. Advantages of our project
36
Approach of exploration
• Target gen-purpose architectures
instead of special-purpose accelerators,
where conventional codes are available.
• Explorer x100 performance node around 2028
• Discover node architectures to be developed
Novelty and innovativeness
• FLOPS to BYTES architecture
to improve performance effectively
using increased bandwidth
• Performance modeling and analysis
to guide development of node
architectures
• Exploration of fundamental
technologies to achieve
the target node architecture
+ Innovative memory systems
+ Reconfigurable vector
data-flow architecture
(CGRA)
+ Near-memory computing
Specs of HPC CPU around 2018
Intel Skylake Xeon Platinum 8180:
Process: Intel 14nm CMOS
CPU: Intel Xeon CPU、28 cores
Frequency: 1.9GHz (for vector comp with AVX)
Peak FP64 perf: 1.7 TF (3.4 TF for FP32)
Memories: DDR4-2666 DRAM, 6ch 130GB/s
NW injection bw: 100Gbps (Infiniband EDR)
Power: 300W (including memories and NW)
Specs of HPC CPUs and systems
expected in Post-Moore Era around 2028:
Process: 3nm CMOS (UV lithography)
PCRAM、non-volatile memory as novel devices
CPU: normal core & vector data-flow, near-memory core
Frequency: 2GHz+
Peak FP64 perf: 20 TF (200TF for FP32, w/ or w/o Matrix Engine)
memory: highly-dense TSV/silicon-on-silicon 3D stack device
1st layer 3D TSV Stacked SRAM >10TB/s, cap. >8GByte
2nd layer 2.5D TSV Stacked DRAM >3TB/s, cap. >64GByte
3rd layer 2.5D TSV Flash/NVM ~30GB/s, cap. >1TB
NW injection bw:
12Tbps (VCSEL CWDM Micro-Optics)
Power: 300W+ (including memories and NW)
37. 37
Assume constant memory per core
#cores ~ total problem (total machine (memory)) size n ~ core performance
Modern massively parallel architectures: core performance constant,
performance gains ~ increasing #cores in system, runtime T ~ problem
complexity / core performance
Compute-bound codes, O(nk) complexity where k > 1 : runtime T ~ nk-1 , so
increasing total machine size increases T, even w/ constant memory per core
Memory-bound codes, O(n) complexity, runtime T ~ # memory controllers
At core level, # memory controllers (e.g. access to cache) ~ #cores so runtime
remains constant with increasing cores (weak scaling).
However, at chip level (external memory access), memory controllers are
constant even with #cores increase, so T ~ #cores (no scaling)
Increasing memory size further per core meaningless, since T ~ n
Maintaining memory size per core, let alone increase, will not lead to effective
performance gains, diminishing Gustafson’s Law
Towards Strong Scaling (1)
38. 38
Some applications are inherently strong scaling
E.g. Molecular Dynamics, c.f., Anton
Even traditional weak scaling codes need to strong scale
Architectural requirements : memory high BW / low latency requires capacity
to be small
Science requirements: from demonstrative big runs to real R&D
Ensemble of multiple smaller problem sizes
Time to solution >> problem size
If Gustafson’s law is well satisfied (e.g., well load balanced), then strong
scaling will work up to the point of bad load balance and/or non-parallel
region becoming significant
Applications must be prepared to strong scale, or at least deal with
hierarchical memory
Advanced localization e.g. temporal blocking, putting only BW sensitive data
in fast memory, …
Towards Strong Scaling (2)
39. 39
New!: Comprehensive benchmarking platform effort
Collect benchmarks and machines incl. Fugaku also x86&GPU
Construct a platform to do all benches x all machines benchmarking
Make all benchmarks be repeatedly executable so that new
instrumentations can be done easily
Couple with architectural simulators to conduct what-if analysis
New!: Enhancing system software robustness, contributing to
compilers and other performance OSS tools
Make Fugaku be performance robust, not focus on co-design apps
OSS as future dev platform e.g. LLVM and contribute result to
community
New optimizations for new architectures before actual HW
New Efforts at R-CCS towards Non-Quantum Future
40. If you are excited about future of HPC…
40
“Feasibility Study 2.0” starting Apr 2022, ~$4m USD/y
We are hiring!
Team/Unit leaders (digital twins, possibly more in future)
Various researcher and post-doc positions
For details ‘google’ R-CCS Home page