In this deck, Paul Isaacs from Linaro presents: State of ARM-based HPC. This talk provides an overview of applications and infrastructure services successfully ported to Aarch64 and benefiting from scale.
"With its debut on the TOP500, the 125,000-core Astra supercomputer at New Mexico’s Sandia Labs uses Cavium ThunderX2 chips to mark Arm’s entry into the petascale world. In Japan, the Fujitsu A64FX Arm-based CPU in the pending Fugaku supercomputer has been optimized to achieve high-level, real-world application performance, anticipating up to one hundred times the application execution performance of the K computer. K was the first computer to top 10 petaflops in 2011."
Watch the video: https://wp.me/p3RLHQ-lIT
Learn more: https://www.linaro.org/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
2. Welcome!
1. This is not our first rodeo…
a. Mont Blanc -
https://www.montblanc-project.eu/wp-content/uploads/2017/12/UCHPC_Presentation_PDF_
lw.pdf
b. Linaro Connect -
http://connect.linaro.org.s3.amazonaws.com/sfo17/Presentations/SFO17-200K1.pdf
c. Linaro Connect - https://connect.linaro.org/resources/san19/san19-400k1/
d. Arm - https://developer.arm.com/solutions/hpc
2. The question of whether Aarch64/Arm64 can do HPC is a resounding Yes!
3. Typical components of a HPC
1. Common components.
a. As near identical configuration per node as possible.
b. A method of interconnecting nodes.
2. A job scheduler.
a. Slurm workload manager
b. Univa grid engine
c. ...and others or ways to parallelise across nodes.
3. CPU / RAM / Interconnect / Storage
Is that enough?
4. Components
1. Core volume/density.
a. We used to count the number of simultaneous processes by the number of physical
CPUs.
i. In each node we look at number of CPUs
ii. The number of cores
iii. The number of threads
1. Is threading intentionally disabled?
iv. Is NUMA supported?
v. Whether those CPUs are cache-coherent.
2. Levels of Cache
L0 - Macro-op cache
L1 - for each core
L2 - for each cluster of cores
L3 - for each cluster of CPUs
L1,L2,L3 Cache have separate Instruction and Data elements.
5. Chips
● Arm v8.0-A (Advanced Neon, SIMD 32 x 128bit)
○ Ampere eMag 8180
○ Cavium ThunderX
○ Qualcomm Kryo
● Arm v8.1-A
○ Marvell ThunderX2 (28core variant) - Astra Supercomputer (dual-socket)
○ Marvell ThunderX2 (32core variant) - Isambard Supercomputer (dual-socket)
● Arm v8.2-A
○ Arm NeoverseN1
○ Fujitsu A64FX (+SVE) - Fugaku Supercomputer (single-socket)
○ Huawei Kunpeng 920
○ NVidia Carmel
○ Ampere Altra (v8.2+)
● Arm v8.3-A (SIMD Complex Number rotation support and Nested Virtualisation support)
○ Marvell ThunderX3 (v8.3+) 2020
○ Huawei Kunpeng 930 (almost v8.4 + SVE) 2021
https://en.wikipedia.org/wiki/ARM_architecture
6. Chips
● Arm v8.6-A (Neoverse N2 ‘Zeus’ to be used in the European Processor Initiative)
○ General Matrix Multiply (GEMM)
○ Bfloat16 format support
○ SIMD matrix manipulation instructions, BFDOT, BFMMLA, BFMLAL and BFCVT
○ Enhancements for virtualization, system management and security
● Arm SVE2
○ Fine-grained data-level parallelism
Support for v8.6-A and SVE2 to be in GCC 10 and LLVM CLANG 9
Announced April 2019
https://en.wikipedia.org/wiki/ARM_architecture
7. RISC, CISC, ACCELERATOR
● The ARM ISA is a RISC implementation
○ Do simple operations highly efficiently.
○ Each operation takes one clock cycle, enables pipelining.
● A CISC implementation
○ Do simple instructions like RISC but have additional complex instructions that take more
than one clock cycle. Pipelining is more cumbersome.
● Accelerators
○ Do bespoke actions as quick as possible, even asynchronously.
● The Challenge,
○ Can an ARM ISA extended with accelerator-style operations be as effective as a CISC +
plug-in Accelerator?
8. Interconnects
● Between upto 128 cores there is ARM CMN600 - Coherent Mesh Network for single chassis
● Between chassis there are:
○ PCIe
○ CCIX
○ CXL?
○ Ares
○ Tofu
● Network options
○ InfiniBand - Low latency
○ Ethernet
11. Blending Containers
● Containers are packaged environments to enable the easy execution of applications by
supplying its dependencies within.
● Multiple containers can work together as building blocks of a larger solution.
● Subject to operational requirements, containers can be built to run on a variety of platforms.
○ From SBC to HPC!
● With the right sort of scheduler system and orchestration tool jobs become:
○ Auto-built/tested
○ Parallelised
○ Flexible
○ Scalable
○ On-demand
12. Storage is still required...
● DRAM is volatile
● Virtual disks ephemeral
● Diskless nodes
● Persistent storage is still needed:
○ File systems
■ Ext4,lvm,xfs,zfs
○ Parallel file systems
■ Lustre
○ Distributed storage
■ CEPH
○ Media
■ Conventional disks
■ SSD,nvme
13. Applications
What does HPC enable...
● 292 Libraries/Applications tested for Aarch64 -
https://gitlab.com/arm-hpc/packages/-/wikis/home
● Weather prediction
○ Although Scalable Probabilistic approximation might be more efficient…
https://advances.sciencemag.org/content/6/5/eaaw0961
● Molecular Dynamics
○ GROMACS supports SIMD NEON operations
○ https://redmine.gromacs.org/issues/2806 SIMD algorithms for ARM SVE scheduled for
2021.
● AI
14. All things Cloud...
● IDC - Worldwide Server Market Revenue Declined 11.6% Year Over Year in the Second Quarter
of 2019 https://www.idc.com/getdoc.jsp?containerId=prUS45482519
● COVID-19 pandemic causes Stock Market falls of 20% (Mar.2020).
https://www.wired.com/story/covid-19-spreads-listen-stock-market/
● Working remotely is now the norm.
● Scalable on-demand services brings Serverless Computing.
15. The Linaro Datacenter & Cloud Group (LDCG)
● Common development center for the Arm
Server & Infrastructure ecosystem
● Eliminates fragmentation, reduces cost
and accelerates time to market
● Members can focus on innovation and
differentiated value-add
● Working on core open-source software for
ARM servers
○ Server architecture – UEFI/ACPI/ServerReady
○ ARMv8 enablement & optimization
○ Big Data, BigTop, Hadoop and Spark
○ Cloud Infrastructure such as Kubernetes,
OpenStack and Ceph
Linaro Developer Cloud
Enterprise-class Arm Powered
servers hosted in UK are available for
development, test, CI and cloud
deployments for VM and containers.
www.linaro.cloud
16. Lower deployment & management barriers
Leverage the Linaro Developer Cloud and other services to develop
cost-effective Cloud-integrated HPC development frameworks and generate
reference implementations to accelerate
Member-driven with Advisory Board
Members determine work completed by engineering resources while advisory
board provides subject matter expertise on HPC requirements and guidance
and feedback on ongoing HPC SIG strategic direction and roadmap
Driving datacenter-class, open-source HPC development on Arm
Identify and adopt standards to make HPC deployment on Arm a commercial
imperative. Develop real-world use cases that reap the benefits of Arm while
ensuring interoperability, modularization, orchestration
LDCG High Performance Computing (HPC) SIG
Collaborative project building on the work of the Linaro Datacenter & Cloud Group
HPC
17. Functions-as-a-Service
● Linaro HPC hardware being reconfigured towards a scalable environment.
○ A combination of OpenStack, K8S and OpenHPC.
○ A testbed to verify combinations of heterogeneous ingredients for the optimal recipes.
● Service Consumers
○ Send the service request and receive the service answer.
○ The service consumer will be CPU,GPU,ISA,Accelerator agnostic!
If the equipment is billed as pay-per-use then it’s our challenge to ensure that Aarch64
solutions match a significant number of requests.
18. Thank you
Continuing to accelerate deployment of your
Arm-based solutions through collaboration
hpc@linaro.org