©ARM 20171
13th ANNUAL WORKSHOP 2017
RDMA on ARM
Pavel Shamis (Pasha), Principal Research Engineer
March, 2017
ARM
©ARM 20172
RDMA on ARM ?
©ARM 20173
ARM Servers from multiple manufacturers
©ARM 20174
ARM Overview
©ARM 20175
An introduction to ARM
ARM is the world's leading semiconductor
intellectual property supplier.
We license to over 350 partners, are present in 95% of smart
phones, 80% of digital cameras, 35% of all electronic devices, and a
total of 60 billion ARM cores have been shipped since 1990.
Our CPU business model:
License technology to partners, who use it to create their own
system-on-chip (SoC) products.
 We may license an instruction set architecture (ISA) such as
“ARMv8-A”)
 or a specific implementation, such as “Cortex-A72”.
 Partners who license an ISA can create their own
implementation, as long as it passes the compliance tests.
…and our IP extends beyond the CPU
©ARM 20176
A business model that shares success
 Everyone in the value chain benefits
 Long term sustainability
Design once and reuse is fundamental
 Spread the cost amongst many partners
 Technology reused across multiple applications
 Creates market for ecosystem to target
 Re-use is also fundamental to the ecosystem
Upfront license fee
 Covers the development cost
Ongoing royalties
 Typically based on a percentage of chip price
 Vested interest in success of customers
A partnership business model
Business Development
Licence fee
IP
Provider
SemiCo
Partner
ARM Licenses
technology to
Partner
Partners
develop
chips
OEM
Customer
OEM sells
consumer
products
Royalty
Approximately 1350 licenses
Grows by ~120 every year
More than 440 potential
royalty payers
14.8bn+ ARM-powered chips in
2015
©ARM 20177
Partnership
©ARM 20178
Range of SoCs addressing infrastructure
Highly Accelerated Massively Multicore
One size does not fit all
QorIQ®
Layerscape
2080A
©ARM 20179
Serious ARM HPC deployments starting in 2017
Two big announcements about ARM in HPC in Europe:
©ARM 201710
Japan
slides from Fujitsu at ISC’16
©ARM 201711
Software Eco-system
©ARM 201712
Released
Planned
Concept
2016
Hardware
Open-Source
software
ARM HPC
tools
ARM HPC ecosystem roadmap
2017 Future
Cavium ThunderX
Allinea DDT and MAP
Rogue Wave TotalView
AppliedMicro X-Gene 1 & 2
Fujitsu – Post K (SVE)
ISV software
AMD Seattle
Altair PBS Pro
Cavium ThunderX2
Qualcomm Centriq
NAG Library & Compiler
ARM Optimized Routines
GCC (gcc/g++/gfortran)
LLVM - clang LLVM – Flang
ISV software
ARM Code Advisor (Beta) ARM Code Advisor (Full release)
ARM Optimized Routines – vector versions
Phytium Mars
OpenHPC 1.2
ARM Performance Libraries
PathScale ENZO
ARM Fortran CompilerARM C/C++ Compiler – ahead of LLVM trunk
ARM Instruction Emulator
AppliedMicro X-Gene 3
©ARM 201713
Linux / FreeBSD w/ AARCH64 support
 Engaged with FreeBSD foundation / Semi-half & Cavium to get FreeBSD on ARMv8
FreeBSD Beta version demo’d by Semihalf – Nov. 2015
12.04LTS & 14.04LTS
released
 Also 14.10 & 15.04 releases
Red Hat Enterprise Linux Server for ARM
7.2 BETA – Sept, 2015
Fedora 22 released – May 2015
Fedora 23 released – Nov 2015 CentOS Linux 7 for AArch64
GA – August 2015
OpenSUSE 13.2 – Nov 2014 SUSE Launches Partner Program to Bring
SUSE Linux Enterprise 12 to 64-bit ARM
July 2015 @ ISC
Debian 8 adds AARCH64 – April 2015
©ARM 201714
Open source and commercial compilers
 LLVM
 C, C++
 OpenMP 3.1, (4.0 coming soon)
 Fortran coming Q1 2017
 PathScale
 C, C++ Fortran
 OpenACC
 OpenMP 4.0
 NAG
 Fortran
 OpenMP 3.1
 GCC
 C, C++, Fortran
 OpenMP 4.0
 ARM C/C++ Compiler
 LLVM based
 Includes SVE
©ARM 201715
RESEARCH & DEVELOPMENT
©ARM 201716
RDMA
©ARM 201717
RDMA Support
 Mellanox OFED 2.4 and above supports ARM
 Linux Kernel 4.5.0 and above (maybe even earlier)
 OFED – No support
 Linux Distribution – on going process
©ARM 201718
Lessons Learned
 Memory Barriers
 Multithread environment
 Software-hardware interaction
 Examples https://github.com/openucx/ucx/blob/master/src/ucs/arch/aarch64/cpu.h#L25
 You can “fish” for these bugs in MPI implementations around Eager-RDMA and shared
memory protocols
Write Payload
Write Notify
Busy-wait Read Notify
Read Payload
Write Barrier Read Barrier
RDMA
RDMA
Maranget, Luc, Susmit Sarkar, and Peter Sewell. "A tutorial introduction to the ARM and POWER relaxed memory models." Draft available from
http://www.cl. cam. ac. uk/~ pes20/ppc-supplemental/test7. pdf (2012).
©ARM 201719
Lessons Learned - continued
 Low-level timers
 Typically found in benchmarks and MPI
 Code examples https://github.com/openucx/ucx/blob/master/src/ucs/arch/aarch64/cpu.h#L35
©ARM 201720
Lessons Learned – continued
 Not all cache-lines are 64Byte !
 Implementation dependent
 128Byte and 64Byte
http://xeroxnostalgia.com/duplicators/xerox-9200/
©ARM 201721
Optimizations
 AVX => Neon
 Mostly found around communication request initialization codes
https://github.com/openucx/ucx/blob/891e20ef90257d1e2721da52461b0261220c82d8/src/uct/
ib/mlx5/ib_mlx5.inl#L160
 Busy-wait loop
 SeeWait-For-Event (WFE)
Pavel Shamis, M. Graham Lopez, and Gilad Shainer.“Enabling One-sided Communication Semantics on ARM“
©ARM 201722
Preliminary Results
©ARM 201723
Testbed
• 2 x Softiron Overdrive 3000 servers with AMD Opteron A1100 / 2GHz
• ConnectX-4 IB/VPI EDR (PCIe g2 x8)
• Ubuntu 16.04
• MOFED 3.3-1.5.0.0
• UCX [0558b41]
• XPMEM [bdfcc52]
• OSHMEM/OPEN-MPI [fed4849]
©ARM 201724
OpenUCX and OpenSHMEM on ARM
HPC Applications
OpenSHMEM
UCX SPML
UCX
Framework
UCP
MLX5 UCT Verbs UCT XPMEM UCT
©ARM 201725
OpenUCX with InfiniBand
40%
Pavel Shamis, M. Graham Lopez, and Gilad Shainer.“Enabling One-sided Communication Semantics on ARM“
©ARM 201726
SHMEM_WAIT()
73%
35%
Pavel Shamis, M. Graham Lopez, and Gilad Shainer.“Enabling One-sided Communication Semantics on ARM“
©ARM 201727
OpenSHMEM SSCA
Pavel Shamis, M. Graham Lopez, and Gilad Shainer.“Enabling One-sided Communication Semantics on ARM“
7-30%
©ARM 201728
OpenSHMEM GUPs
Pavel Shamis, M. Graham Lopez, and Gilad Shainer.“Enabling One-sided Communication Semantics on ARM“
21%
©ARM 201729
Summary
 Linux RDMA community is doing great job !
 A lot of progress was made in ARM HPC/server software
eco-system
The trademarks featured in this presentation are registered and/or unregistered trademarks of ARM Limited
(or its subsidiaries) in the US and/or elsewhere. All rights reserved. All other marks featured may be
trademarks of their respective owners.
Copyright © 2017 ARM Limited
©ARM 2017
©ARM 201731
Backup
©ARM 201732
CCIX
 Accelerators and Network (NIC/HCA/etc.) as a first
class “citizen” in the system
 Seamless process and accelerator hardware cache
coherence support
 Low-latency and high-bandwidth
 Allow in-line acceleration
 Bump in the wire processing (network packet processing, storage
acceleration, etc.)
 Allows “off-line” acceleration (co-processor model)
 Driver-less / interrupt-less usage model
 http://www.ccixconsortium.com
©ARM 201733
CCIX
CCIX Enables System Connectivity For All Elements (CPU, GPU, IO, FPGA)
©ARM 201734
Gen-Z
 All data is accessed by some form of a Read or a Write
 Example of reads: DDR Row + Column Read, PCI DMA Read, SCSIWrite,
Socket Read, File Read, RDMA Read
 Example of writes: DDR Row + Column Write, PCI DMA Write, SCSI Read,
SocketWrite, FileWrite, RDMA Write
 The Goal: Simplify world to memory semantic Reads &Writes
©ARM 201735
Gen-Z Overview
 An open, standards-based, scalable, system interconnect and protocol.
 Optimized to support memory semantic communications
 Breaks Processor-Memory Interlock
 Split controller model
 Memory controller
 Initiates high-level requests—Read,Write,Atomic, Put / Get, etc.
 Enforces ordering, reliability, path selection, etc.
 Media controller
 Abstracts memory media
 Supports volatile / non-volatile / mixed-media
 Performs media-specific operations
 Executes requests and returns responses
 Enables data-centric computing (accelerator, compute, etc.)

RDMA on ARM

  • 1.
    ©ARM 20171 13th ANNUALWORKSHOP 2017 RDMA on ARM Pavel Shamis (Pasha), Principal Research Engineer March, 2017 ARM
  • 2.
  • 3.
    ©ARM 20173 ARM Serversfrom multiple manufacturers
  • 4.
  • 5.
    ©ARM 20175 An introductionto ARM ARM is the world's leading semiconductor intellectual property supplier. We license to over 350 partners, are present in 95% of smart phones, 80% of digital cameras, 35% of all electronic devices, and a total of 60 billion ARM cores have been shipped since 1990. Our CPU business model: License technology to partners, who use it to create their own system-on-chip (SoC) products.  We may license an instruction set architecture (ISA) such as “ARMv8-A”)  or a specific implementation, such as “Cortex-A72”.  Partners who license an ISA can create their own implementation, as long as it passes the compliance tests. …and our IP extends beyond the CPU
  • 6.
    ©ARM 20176 A businessmodel that shares success  Everyone in the value chain benefits  Long term sustainability Design once and reuse is fundamental  Spread the cost amongst many partners  Technology reused across multiple applications  Creates market for ecosystem to target  Re-use is also fundamental to the ecosystem Upfront license fee  Covers the development cost Ongoing royalties  Typically based on a percentage of chip price  Vested interest in success of customers A partnership business model Business Development Licence fee IP Provider SemiCo Partner ARM Licenses technology to Partner Partners develop chips OEM Customer OEM sells consumer products Royalty Approximately 1350 licenses Grows by ~120 every year More than 440 potential royalty payers 14.8bn+ ARM-powered chips in 2015
  • 7.
  • 8.
    ©ARM 20178 Range ofSoCs addressing infrastructure Highly Accelerated Massively Multicore One size does not fit all QorIQ® Layerscape 2080A
  • 9.
    ©ARM 20179 Serious ARMHPC deployments starting in 2017 Two big announcements about ARM in HPC in Europe:
  • 10.
    ©ARM 201710 Japan slides fromFujitsu at ISC’16
  • 11.
  • 12.
    ©ARM 201712 Released Planned Concept 2016 Hardware Open-Source software ARM HPC tools ARMHPC ecosystem roadmap 2017 Future Cavium ThunderX Allinea DDT and MAP Rogue Wave TotalView AppliedMicro X-Gene 1 & 2 Fujitsu – Post K (SVE) ISV software AMD Seattle Altair PBS Pro Cavium ThunderX2 Qualcomm Centriq NAG Library & Compiler ARM Optimized Routines GCC (gcc/g++/gfortran) LLVM - clang LLVM – Flang ISV software ARM Code Advisor (Beta) ARM Code Advisor (Full release) ARM Optimized Routines – vector versions Phytium Mars OpenHPC 1.2 ARM Performance Libraries PathScale ENZO ARM Fortran CompilerARM C/C++ Compiler – ahead of LLVM trunk ARM Instruction Emulator AppliedMicro X-Gene 3
  • 13.
    ©ARM 201713 Linux /FreeBSD w/ AARCH64 support  Engaged with FreeBSD foundation / Semi-half & Cavium to get FreeBSD on ARMv8 FreeBSD Beta version demo’d by Semihalf – Nov. 2015 12.04LTS & 14.04LTS released  Also 14.10 & 15.04 releases Red Hat Enterprise Linux Server for ARM 7.2 BETA – Sept, 2015 Fedora 22 released – May 2015 Fedora 23 released – Nov 2015 CentOS Linux 7 for AArch64 GA – August 2015 OpenSUSE 13.2 – Nov 2014 SUSE Launches Partner Program to Bring SUSE Linux Enterprise 12 to 64-bit ARM July 2015 @ ISC Debian 8 adds AARCH64 – April 2015
  • 14.
    ©ARM 201714 Open sourceand commercial compilers  LLVM  C, C++  OpenMP 3.1, (4.0 coming soon)  Fortran coming Q1 2017  PathScale  C, C++ Fortran  OpenACC  OpenMP 4.0  NAG  Fortran  OpenMP 3.1  GCC  C, C++, Fortran  OpenMP 4.0  ARM C/C++ Compiler  LLVM based  Includes SVE
  • 15.
  • 16.
  • 17.
    ©ARM 201717 RDMA Support Mellanox OFED 2.4 and above supports ARM  Linux Kernel 4.5.0 and above (maybe even earlier)  OFED – No support  Linux Distribution – on going process
  • 18.
    ©ARM 201718 Lessons Learned Memory Barriers  Multithread environment  Software-hardware interaction  Examples https://github.com/openucx/ucx/blob/master/src/ucs/arch/aarch64/cpu.h#L25  You can “fish” for these bugs in MPI implementations around Eager-RDMA and shared memory protocols Write Payload Write Notify Busy-wait Read Notify Read Payload Write Barrier Read Barrier RDMA RDMA Maranget, Luc, Susmit Sarkar, and Peter Sewell. "A tutorial introduction to the ARM and POWER relaxed memory models." Draft available from http://www.cl. cam. ac. uk/~ pes20/ppc-supplemental/test7. pdf (2012).
  • 19.
    ©ARM 201719 Lessons Learned- continued  Low-level timers  Typically found in benchmarks and MPI  Code examples https://github.com/openucx/ucx/blob/master/src/ucs/arch/aarch64/cpu.h#L35
  • 20.
    ©ARM 201720 Lessons Learned– continued  Not all cache-lines are 64Byte !  Implementation dependent  128Byte and 64Byte http://xeroxnostalgia.com/duplicators/xerox-9200/
  • 21.
    ©ARM 201721 Optimizations  AVX=> Neon  Mostly found around communication request initialization codes https://github.com/openucx/ucx/blob/891e20ef90257d1e2721da52461b0261220c82d8/src/uct/ ib/mlx5/ib_mlx5.inl#L160  Busy-wait loop  SeeWait-For-Event (WFE) Pavel Shamis, M. Graham Lopez, and Gilad Shainer.“Enabling One-sided Communication Semantics on ARM“
  • 22.
  • 23.
    ©ARM 201723 Testbed • 2x Softiron Overdrive 3000 servers with AMD Opteron A1100 / 2GHz • ConnectX-4 IB/VPI EDR (PCIe g2 x8) • Ubuntu 16.04 • MOFED 3.3-1.5.0.0 • UCX [0558b41] • XPMEM [bdfcc52] • OSHMEM/OPEN-MPI [fed4849]
  • 24.
    ©ARM 201724 OpenUCX andOpenSHMEM on ARM HPC Applications OpenSHMEM UCX SPML UCX Framework UCP MLX5 UCT Verbs UCT XPMEM UCT
  • 25.
    ©ARM 201725 OpenUCX withInfiniBand 40% Pavel Shamis, M. Graham Lopez, and Gilad Shainer.“Enabling One-sided Communication Semantics on ARM“
  • 26.
    ©ARM 201726 SHMEM_WAIT() 73% 35% Pavel Shamis,M. Graham Lopez, and Gilad Shainer.“Enabling One-sided Communication Semantics on ARM“
  • 27.
    ©ARM 201727 OpenSHMEM SSCA PavelShamis, M. Graham Lopez, and Gilad Shainer.“Enabling One-sided Communication Semantics on ARM“ 7-30%
  • 28.
    ©ARM 201728 OpenSHMEM GUPs PavelShamis, M. Graham Lopez, and Gilad Shainer.“Enabling One-sided Communication Semantics on ARM“ 21%
  • 29.
    ©ARM 201729 Summary  LinuxRDMA community is doing great job !  A lot of progress was made in ARM HPC/server software eco-system
  • 30.
    The trademarks featuredin this presentation are registered and/or unregistered trademarks of ARM Limited (or its subsidiaries) in the US and/or elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners. Copyright © 2017 ARM Limited ©ARM 2017
  • 31.
  • 32.
    ©ARM 201732 CCIX  Acceleratorsand Network (NIC/HCA/etc.) as a first class “citizen” in the system  Seamless process and accelerator hardware cache coherence support  Low-latency and high-bandwidth  Allow in-line acceleration  Bump in the wire processing (network packet processing, storage acceleration, etc.)  Allows “off-line” acceleration (co-processor model)  Driver-less / interrupt-less usage model  http://www.ccixconsortium.com
  • 33.
    ©ARM 201733 CCIX CCIX EnablesSystem Connectivity For All Elements (CPU, GPU, IO, FPGA)
  • 34.
    ©ARM 201734 Gen-Z  Alldata is accessed by some form of a Read or a Write  Example of reads: DDR Row + Column Read, PCI DMA Read, SCSIWrite, Socket Read, File Read, RDMA Read  Example of writes: DDR Row + Column Write, PCI DMA Write, SCSI Read, SocketWrite, FileWrite, RDMA Write  The Goal: Simplify world to memory semantic Reads &Writes
  • 35.
    ©ARM 201735 Gen-Z Overview An open, standards-based, scalable, system interconnect and protocol.  Optimized to support memory semantic communications  Breaks Processor-Memory Interlock  Split controller model  Memory controller  Initiates high-level requests—Read,Write,Atomic, Put / Get, etc.  Enforces ordering, reliability, path selection, etc.  Media controller  Abstracts memory media  Supports volatile / non-volatile / mixed-media  Performs media-specific operations  Executes requests and returns responses  Enables data-centric computing (accelerator, compute, etc.)