Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

An evaluation of LLVM compiler for SVE with fairly complicated loops


Published on

By Hiroshi Nakashima, Kyoto University / RIKEN AICS

As a part of the evaluation of Post-K’s compilers, we have been investigating compiled codes of vectorizable kernel loops in a particle-in-cell simulation program. This talk will reveal how the latest version of LLVM compiler (v1.4) works on the loops together with the qualitative and quantitative comparison with the code generated by Intel’s compiler for KNL.

Hiroshi Nakashima Bio
Currently working as a professor of Kyoto University’s supercomputer center (ACCMS) for R&D on HPC programming and supercomputer system architecture, as well as a visiting senior researcher of RIKEN AICS for the evaluation of Post-K computer and its compilers.


For more info on The Linaro High Performance Computing (HPC) visit

Published in: Technology
  • Be the first to comment

  • Be the first to like this

An evaluation of LLVM compiler for SVE with fairly complicated loops

  1. 1. An Evaluation of (ARM’s) LLVM Compiler for SVE with Fairly Complicated Loops Hiroshi Nakashima (Kyoto University / RIKEN-AICS)
  2. 2. Introduction  We’re evaluating several compilers targeting SVE and AVX-512 using kernel loops in a production-level particle-in-cell (PIC) code.  The program is;  written in C99 with restrict (and const) pointer qualifiers so that do-all loops operating on arrays are vectorized.  parallelized by OpenMP (and MPI) so that all loops are in a big region of #pragma omp parallel.  free from any compiler-specific directives, intrinsics, and #pragma omp simd.  Evaluation is done by investigating .s and is based on #-instructions for each loop body. 2 ARM HPC Workshop © 2017 H. Nakashima
  3. 3. Kernel Loops (1/2)  Loops operate on;  SOA-type 1D arrays p{xyz}[p] and v{xyz}[p] of positional/velocity vectors of a particle p;  SOA-type 4D arrays ef[][z][y][x] and bf[][z][y][x] for electric/magnetic field;  SOA-type 4D array jv[][z][y][x] for current density;  to accelerate p in each cell c referring to E/B vectors at c’s vertices, to move p, and to update J vectors at c’s vertices. 3 ARM HPC Workshop © 2017 H. Nakashima
  4. 4. Kernel Loops (2/2)  particle_push-1 Simple v=a*u for 4-dimensional arrays v and u.  ppush-1 p{xyz}[p], v{xyz}[p], E/B-field vectors in 48 scalar variables and base coordinate in 3 scalar variables, perform Lorentz acceleration to update v{xyz}[p] with interpolation of E/B-field vectors.  ppush-2 / cscat-1 With p{xyz}[p], v{xyz}[p] and base coordinates in 6 scalar variables, extrapolate the contributions to J vectors in 12 scalar variables and accumulate them. In ppush-2, p{xyz}[p] is updated and moving directions are recorded in mdir[i=p-head].  pmove-1 while (mdir[j]==0.0) j++; 4 ARM HPC Workshop © 2017 H. Nakashima
  5. 5. Bad News  Are 5 loops vectorized?  Why cannot ARM vectorize them?  We don’t know, esp. for ppush-2 which is very similar to cscat-1.  Can scalar loops be the base of vectorized loops?  Yes, but need some improvements. 5 ARM HPC Workshop © 2017 H. Nakashima particle_ push-1 ppush-1 ppush-2 cscat-1 pmove-1 ARM 1.4 NO NO NO YES NO Fujitsu Oct17 NO YES YES YES NO Intel 17.0.1 YES YES YES YES NO Cray 8.6.1 YES YES YES YES NO
  6. 6. particle_push-1  Source (summary) double (*const restrict et)[esz][esy][esx]=...; const double (*ef)[esz][esy][esx]=...; for(z) for(y) for(x) { et[0][z][y][x] = ef[0][z][y][x] * qmr; et[1][z][y][x] = ef[1][z][y][x] * qmr; et[2][z][y][x] = ef[2][z][y][x] * qmr; bt[0][z][y][x] = bf[0][z][y][x] * qmr; bt[1][z][y][x] = bf[1][z][y][x] * qmr; bt[2][z][y][x] = bf[2][z][y][x] * qmr; }  const-qualification of RHS arrays looks insufficient for ARM (& Fujitsu) to vectorize the loop, while Intel & Cray exploit the qualification for 8x2 unrolling. 6 ARM HPC Workshop © 2017 H. Nakashima
  7. 7. ppush-1 (1/2)  ARM (scalar) vs Intel (vector) in #-inst  Is ARM’s code sufficiently good as the base of vectorized version?  No, because it has 21 redundant sub-s to access 21 loop-invariant scalar variables spilled-out to memory, whose displacements from the frame base are less than −256. sub x21,x29,#168 //x29 is frame base ldur d2,[x21,#-256] //load from x29-424 7 ARM HPC Workshop © 2017 H. Nakashima (a) gross (b) mem opd (c) div net=(a)+(b)-(c) ARM 163 0 1 162 Intel 129 42 9 162
  8. 8. ppush-1 (2/2)  Any other improvements/modifications to have a good vectorized code?  further eliminations  a lsl for index scaling.  2 redundant mov-s caused by a mysterious register allocation for constant 1.0.  2 redundant fsub-s for (b-a) to calculate c+d*(b-a) when we have other fsub-s for (a-b).  additions  6 net additions to replace fdiv with NR approximation with frecpe, frecps and fmul.  2 movprfx-s for pseudo 4-operand FMAs out of 59 FMAs. 8 ARM HPC Workshop © 2017 H. Nakashima (a) gross (b) mem opd (c) div net=(a)+(b)-(c) ARM 146 0 7 139 (3 movprfx) Intel 129 42 9 162
  9. 9. ppush-2  Does small difference from cscat-1 really make it impossible to vectorize ppush-2?  update of px[p], py[p] and pz[p].  update of mdir[i] where i is defined by; for(int p=head,i=0;p<tail;p++,i++)  Scalar loop cannot be considered as the base of vectorization due to a too shrewd optimization with variable/instruction coupling using NEON’s 128-bit SIMD.  e.g. a1=a2+a3 and b1=b2+b3 are done by one instruction.  Even with the coupling, one loop-invariant scalar variable is spilled out due to inappropriate instruction ordering. 9 ARM HPC Workshop © 2017 H. Nakashima
  10. 10. cscat-1  Fairly good job without spilling any of 12 reduction variables, 6 loop-invariants and 2 constants.  Still has small room of improvement.  add to have array index p from canonicalized loop index having p-head.  xr=(x0==x1)?(px0+px1)*0.5:((x0<x1)?x1:x0);  false part is fcmgt+sel instead of fmax.  final assignment is sel instead of movprfx’ed fmul.  movprfx+fnmls can be replaced with fnmsb. 10 ARM HPC Workshop © 2017 H. Nakashima (a) gross (b) mem opd net=(a)+(b) ARM 82 0 82 (3 movprfx) Intel 76 6 82
  11. 11. pmove-1  Doesn’t this a good example of fault tolerant speculative vectorization? while (mdir[j]==0) j++;  Though any of four compilers cannot vectorize this loop, ARM’s (& Fujitsu’s) vectorization failure disappoints us because the speculative vectorization with ldff1d and related predicating instructions is a catch of SVE.  Vectorization is effective because particles tend to stay in a cell with mdir[j]==0. 11 ARM HPC Workshop © 2017 H. Nakashima
  12. 12. Spilled Loop-Invariant Var/Const  Spill is inevitable in ppush-1 (51 invariants + 2 constants) and very likely in ppush-2 (12 reductions + 6 invariants + 4 constants).  Options 12 ARM HPC Workshop © 2017 H. Nakashima where instruction note mem (VL/8-byte) ldr Intel’s way for variables. Signed 9-bit offset. Consume large space in L1. mem (8byte) ld1rd Intel’s way for constants. Unsigned 6- bit offset. As efficient as ldr? Xn mov Unique for SVE. Faster than mem? Zn (a lane) dup Not vector-length agnostic? immediate fmov All constants in 2 loops are short enough to use this option. immediate fadd, etc. Used in cscat-1 for 0.5 (but Z3 also has 0.5).
  13. 13. Summary  ARM’s LLVM compiler cannot vectorize 3 kernel loops which Intel’s can vectorize.  Investigation of the reason why not is very necessary to compete with Intel (& others) in the game with real-world HPC applications whose programmers have known (or will know soon) what Intel can do.  Since scalar loops have reasonable quality, simply removing the obstacles of vectorization will give us good codes.  And with reasonable effort, ARM’s code can be superior to Intel’s. 13 ARM HPC Workshop © 2017 H. Nakashima