Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Arm tools and roadmap for SVE compiler support


Published on

By Richard Sandiford, Florian Hahn (Arm), ARM

This presentation will give an overview of what Arm is doing to develop the HPC ecosystem, with a particular focus on SVE. It will include a brief synopsis of both the commercial and open-source tools and libraries that Arm is developing and a description of the various community initiatives that Arm is involved in. The bulk of the talk will describe the roadmap for SVE compiler support in both GCC and LLVM. It will cover the work that has already been done to support both hand-optimised and automatically-vectorised code, and the plans for future improvements.

For more info on The Linaro High Performance Computing (HPC) visit

Published in: Technology
  • Be the first to comment

Arm tools and roadmap for SVE compiler support

  1. 1. © 2017 Arm Limited Richard Sandiford Florian Hahn Arm tools and roadmap for SVE compiler support Arm HPC Workshop Tokyo 2017
  2. 2. © 2017 Arm Limited 2 Tools and libraries for HPC on Armv8-A and SVE SVE support for Linux, GLIBC, GDB, etc. Allows running GNU/Linux on an SVE system SVE support for GCC System compiler for GNU/Linux systems SVE support for LLVM/Clang Clang is a widely-used open-source compiler, and LLVM is now used as a library in several open- source projects Arm Allinea Studio Fully-supported commercial HPC suite for the Armv8-A architecture, including SVE Arm Instruction Emulator Allows users to evaluate SVE code using existing Armv8-A hardware Commercial Open-source
  3. 3. © 2017 Arm Limited 3 Arm Allinea Studio A quick glance at what is in Arm Allinea Studio C/C++ Compiler • C++ 14 support • OpenMP 4.5 without offloading • SVE ready Fortran Compiler • Fortran 2003 support • Partial Fortran 2008 support • OpenMP 3.1 • SVE ready Performance Libraries • Optimized math libraries • BLAS, LAPACK and FFT • Threaded parallelism with OpenMP Forge (DDT and MAP) • Profile, Tune and Debug • Scalable debugging with DDT • Parallel Profiling with MAP Performance Reports • Analyze your application • Memory, MPI, Threads, I/O, CPU metrics Tuned by Arm for a wide-range of server-class Arm-based platforms
  4. 4. © 2017 Arm Limited SVE
  5. 5. © 2017 Arm Limited 5 Introducing the Scalable Vector Extension (SVE) A vector extension to the Armv8-A architecture with some major new features: Gather-load and scatter-store Loads a single register from several non-contiguous memory locations Per-lane predication Operations work on individual lanes under control of a predicate register Predicate-driven loop control and management Eliminate scalar loop heads and tails by processing partial vectors Vector partitioning and software-managed speculation First Faulting Load instructions allow memory accesses to cross into invalid pages No preferred vector length The above features allow the production of compiled binaries that are agnostic to hardware vector length (which can be between 128-2048 bit at 128 bit increments) 1 2 3 4 5 5 5 5 1 0 1 0 6 2 8 4 + = pred 1 2 0 0 1 1 0 0 + pred 1 2 n-2 1 01 0CMPLT n n-1 n n+1INDEX i for (i = 0; i < n; ++i) 1 2 3 4 5 6
  6. 6. © 2017 Arm Limited 6 GNU/Linux support for SVE Kernel and core userspace components • Linux host OS and userspace support merged for v4.15 Details: linux/Documentation/arm64/sve.txt • Vector length selectable per user task (up to 2048 bits, subject to hardware support) • Self-hosted debug and introspection supported via ptrace • Support for SVE in KVM guests currently under discussion RFC posted, not yet merged Linux • SVE support currently being upstreamed, expected to be committed in Q1 2018 • Supports both self-hosted debug and remote debugging GDB GNU binutils GLIBC • New GLIBC not needed to run SVE code • Header files need updating for new Linux userspace interfaces (minor change) Contact for detailsContact for details Contact for details libgcc unwinder • Needed to unwind through frames that spill SVE registers • Patch approved for GCC 8 (due for release in Q2 2018) • SVE support committed in Q3 2016 • Available in GNU binutils 2.28 and later
  7. 7. © 2017 Arm Limited 7 2018 2019 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May LLVM and GCC upstreaming roadmap 1980LLVM 6.0 Initial SVE Autovec support Partial SVE MC support SVE MC support Codegen Vectorized IR Full SVE Autovec support 1980LLVM 7.0 1980LLVM 8.0 1980GCC 8 1980GCC 9 Full SVE intrinsics support Automatic SVE vectorisation Deadline for new GCC 9 features LLVM GCC
  8. 8. © 2017 Arm Limited GCC
  9. 9. © 2017 Arm Limited 9 Status of GCC for SVE GCC support was developed by a team in Arm and released publicly in Nov 2016 This year Linaro have been updating and incorporating those changes into upstream GCC Nearly 60% of the patches have been committed and a further 30% have been accepted but can’t be committed yet Aim is to get the rest into GCC 8, due for release in Q2 2018 Committed 259 (58%) Approved 131 (30%) Unreviewed 52 (12%) Patches Making progress towards the aim of getting SVE support into GCC 8
  10. 10. © 2017 Arm Limited 10 Features of GCC for SVE ”Vector-length agnostic” and “vector-length specific” code generation Autovectorisation, including: • Fully-predicated loops • Predicated structure loads and stores • Gather loads & scatter stores • Aliasing checks with variable strides • Non-associative reductions (FADDA) Spill code improvements No support for SVE intrinsics yet: aim is to add that to GCC 9 VL registers and offsets 30263 (60%) Autovectorisation 7415 (15%) AArch64 backend 6717 (13%) Spilling improvements 2668 (5%) Other 3212 (7%) Total lines of code for SVE support Summary of features and scope of changes
  11. 11. © 2017 Arm Limited 11 Comparison of GCC output: scalar code vs. WHILE-based SVE code daxpy: cbz x0, .L1 mov x3, 0 .p2align 3 .L3: ldr d1, [x1, x3, lsl 3] ldr d2, [x2, x3, lsl 3] fmadd d1, d1, d0, d2 str d1, [x2, x3, lsl 3] add x3, x3, 1 cmp x0, x3 bne .L3 .L1: ret Scalar code daxpy: cbz x0, .L1 mov x3, 0 mov z0.d, d0 whilelo p0.d, xzr, x0 ptrue p1.d, all .p2align 3 .L3: ld1d z2.d, p0/z, [x2, x3, lsl 3] ld1d z1.d, p0/z, [x1, x3, lsl 3] fmad z1.d, p1/m, z0.d, z2.d st1d z1.d, p0, [x2, x3, lsl 3] incd x3 whilelo p0.d, x3, x0 bne .L3 .L1: ret daxpy: cbz x0, .L1 mov x3, 0 mov z0.d, d0 whilelo p0.d, xzr, x0 ptrue p1.d, vl8 .p2align 3 .L3: ld1d z2.d, p0/z, [x2, x3, lsl 3] ld1d z1.d, p0/z, [x1, x3, lsl 3] fmad z1.d, p1/m, z0.d, z2.d st1d z1.d, p0, [x2, x3, lsl 3] add x3, x3, 8 whilelo p0.d, x3, x0 bne .L3 .L1: ret VL-agnostic code VL512-specific code Vector code has only 3 extra instructions (could be just 1) No benefit to VL-specific code here! SVE vectorisation example: naïve daxpy gcc –O3 –march=armv8-a gcc –O3 –march=armv8-a+sve gcc –O3 –march=armv8-a+sve –msve-vector-bits=512
  12. 12. © 2017 Arm Limited LLVM/Clang
  13. 13. © 2017 Arm Limited 13 Where are our changes to LLVM/Clang Line additions/removals in LLVM and Clang repos Base is LLVM/Clang 5.0 release branch Unit tests represent 79% of all changes, so are omitted here LLVM: 58367 lines Clang: 11294 lines Fork with SVE support -software/LLVM-SVE
  14. 14. © 2017 Arm Limited 14 Where are our changes to LLVM/Clang The SVE Side Assembler/MC IR Types Autovectorization Extend loop vectorizer to use length-agnostic IR Make sure length-agnostic vectorization fits into VPlan Aim: Initial support in LLVM 7.0 Assembler support for SVE 120 patches ready Started upstreaming in November Aim: Submit changes by May Introduce scalable vector type Initially use intrinsics for functions like stepvector Codegen for scalable IR Aim: Initial support after LLVM 6.0 release
  15. 15. © 2017 Arm Limited 15 Changes to Clang and libraries Vectorized math library supportSVE intrinsics support C language extension with intrinsics for SVE supported by the commercial compiler Allows to hand-optimize code for SVE Aim is to upstream it once there is consensus in the community and LLVM support is committed Goal: Vectorize calls to libm functions Define vector ABI for libraries to use Extend LoopVectorize to generate calls using the ABI Use SLEEF (a vectorizable implementation of parts of libm) Support for Advanced SIMD is already upstream, SVE support downstream
  16. 16. © 2017 Arm Limited 16 Conclusion Compiler with SVE & Fortran support available Arm Instruction Emulator runs SVE userspace binaries Commercial Tools GCC & LLVM SVE enablement is a priority Working on Armv8-A performance Improving Flang Optimized libraries Making sure HPC apps work on Armv8-A Port HPC apps Build Arm HPC community Open Source HPC Applications
  17. 17. 1717 Thank You! Danke! Merci! 谢谢! ありがとう! Gracias! Kiitos! 감사합니다 धन्यवाद © 2017 Arm Limited
  18. 18. © 2017 Arm Limited Backup
  19. 19. © 2017 Arm Limited 19 Where are our changes to LLVM Re-balances chains of multiplies/adds to allow better use of FMAs Gives a significant (~30%) improvement on SpecCPU2006 Calculix LoopSpeculativeBoundsCheck Allows Re-factoring of LoopVersioningLICM Hoisting of loop-independent loads that feed into the induction variable, with runtime checks for aliasing Improves SpecCPU2000 GCC by ~5% Aim: At least one of these passes for Clang 6.0, the others for Clang 7.0 LoopExprTreeFactoring PreInlinerTransforms Split call sites where arguments are predicated conditions in predecessors Exposes additional inlining opportunities Improves SpecCPU2017 GCC by ~22% New target-independent passes
  20. 20. © 2017 Arm Limited 20 GCC spill code improvements Problem Solution Instruction scheduling tends to increase register pressure • Use pressure-sensitive scheduling by default • Assume for register-pressure purposes that only 8 predicate registers are available Stack frames can contain a mixture of variable-length and fixed-length data • Don’t share stack slots between fixed-length and variable-length data • Use shared ”anchor” addresses to access nearby spills Normal function calls do not preserve SVE state, but optimisations can move vector operations across calls • A new “early rematerialisation” pass that runs before register allocation and tries to make sure that SVE values are recalculated after calls where necessary Ø More effective than trying to stop optimisations moving values across calls Ø Handles more cases than the register allocator Spilt values are often duplicated invariants • Spill the duplicated invariant instead of the vector Ø Future work More vectorisation opportunities means more potential for spilling
  21. 21. © 2017 Arm Limited 21 Evaluating SVE Compile Emulate Analyse Arm Compiler C/C++/Fortran code SVE via auto-vectorization, intrinsics and assembly. Compiler Insight: Compiler places results of compile- time decisions and analysis in the resulting binary. Supplied with SVE Performance Libraries. Arm Instruction Emulator Runs userspace binaries for future Arm architectures on today’s systems. Supported instructions run unmodified. Unsupported instructions are trapped and emulated. Arm Code Advisor Console or web-based output shows prioritized advice in-line with original source code.
  22. 22. © 2017 Arm Limited 22 Community building Our app work is engaging with code owners and users to get suitable test cases, to get Arm support built in, and including helping them make AArch64 testing part of their development processes Outside the people we collaborate with, various complementary Arm HPC communities already exist: • Arm HPC User Group (SC) and GoingArm (ISC/ArmRS) • Arm HPC Google Group ( • Arm HPC GitLab pages ( Encouraging our partners to use GitLab is a priority
  23. 23. © 2017 Arm Limited 23 Wiki Dynamic list of common HPC applications Up-to-date summary of package status Provides focus for porting progress Community driven. Maintained by Arm, but anyone can join and contribute. Allows developers to share recipes, and learn from progress on other applications Provides a mechanism for tracking status of applications and package sets (e.g. OpenHPC packages, Mantevo, etc.)
  24. 24. © 2017 Arm Limited 24 Open source libraries for helping increase performance Arm Optimized Routines These routines provide high performing versions of many math.h functions • Algorithmically better performance than standard library calls • No loss of accuracy SLEEF library Vectorized math.h functions • Provided as an option for use in Arm Compiler Perf-libs-tools Understanding an application’s needs for BLAS, LAPACK and FFT calls • Used in conjunction with Arm Performance Libraries can generate logging info to help profile applications for specific case breakdowns Example visualization: DGEMM cases called