Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Compilation of COSMO for GPU using LLVM

1,070 views

Published on

By Tobias Grosser, Scalable Parallel Computing Laboratory

The COSMO climate and weather model delivers daily forecasts for Switzerland and many other nations. As a traditional HPC application it was developed with SIMD-CPUs in mind and large manual efforts were required to enable the 2016 move to GPU acceleration. As today's high-performance computer systems increasingly rely on accelerators to reach peak performance and manual translation to accelerators is both costly and difficult to maintain, we propose a fully automatic accelerator compiler for the automatic translation of scientific Fortran codes to CUDA GPU accelerated systems. Several challenges had to be overcome to make this reality: 1) improved scalability, 2) automatic data placement using unified memory, 3) loop rescheduling to expose coarse-grained parallelism, 4) inter-procedural loop optimization, and 5) plenty of performance tuning. Our evaluation shows that end-to-end automatic accelerator compilation is possible for non-trivial portions of the COSMO climate model, despite the lack of complete static information. Non-trivial loop optimizations previously implemented manually are performed fully automatically and memory management happens fully transparently using unified memory. Our preliminary results show notable performance improvements over sequential CPU code (40s to 8s reduction in execution time) and we are currently working on closing the remaining gap to hand-tuned GPU code. This talk is a status update on our most recent efforts and also intended to gather feedback on future research plans towards automatically mapping COSMO to FPGAs.

Tobias Grosser Bio
Tobias Grosser is a senior researcher in the Scalable Parallel Computing Laboratory (SPCL) of Torsten Hoefler at the Computer Science Department of ETH Zürich. Supported by a Google PhD Fellowship he received his doctoral degree from Universite Pierre et Marie Curie under the supervision of Albert Cohen. Tobias' research is taking place at the border of low-level compilers and high-level program transformations with the goal of enabling complex - but highly-beneficial - program transformations in a production compiler environment. He develops with the Polly loop optimizer a loop transformation framework which today is a community project supported throught the Polly Labs research laboratory. Tobias also developed advanced tiling schemes for the efficient execution of iterated stencils. Today Tobias leads the heterogeneous compute efforts in the Swiss University funded ComPASC project and is about to start a three year NSF Ambizione project on advancing automatic compilation and heterogenization techniques at ETH Zurich.

Email
bgerofi@riken.jp

For more info on The Linaro High Performance Computing (HPC) visit https://www.linaro.org/sig/hpc/

Published in: Technology
  • Be the first to comment

Compilation of COSMO for GPU using LLVM

  1. 1. spcl.inf.ethz.ch @spcl_eth Automatic Accelerator Compilation of the COSMO Physics Core Tobias Grosser, Siddharth Bhat, Torsten Hoefler December 2017 Albert Cohen, Sven Verdoolaege, Oleksandre Zinenko Polly Labs, ENS Paris Johannes Doerfert Uni. Saarbruecken Roman Gereev, Ural Federal University Hongin Zheng, Alexandre Isonard Xilinx Swiss Universities / PASC Qualcomm, ARM, Xilinx … many others
  2. 2. spcl.inf.ethz.ch @spcl_eth 2 Weather Physics Simulations Machine Learning Graphics
  3. 3. spcl.inf.ethz.ch @spcl_eth row = 0; output_image_ptr = output_image; output_image_ptr += (NN * dead_rows); for (r = 0; r < NN - KK + 1; r++) { output_image_offset = output_image_ptr; output_image_offset += dead_cols; col = 0; for (c = 0; c < NN - KK + 1; c++) { input_image_ptr = input_image; input_image_ptr += (NN * row); kernel_ptr = kernel; S0: *output_image_offset = 0; for (i = 0; i < KK; i++) { input_image_offset = input_image_ptr; input_image_offset += col; kernel_offset = kernel_ptr; for (j = 0; j < KK; j++) { S1: temp1 = *input_image_offset++; S1: temp2 = *kernel_offset++; S1: *output_image_offset += temp1 * temp2; } kernel_ptr += KK; input_image_ptr += NN; } S2: *output_image_offset = ((*output_image_offset)/ normal_factor); output_image_offset++ ; col++; } output_image_ptr += NN; row++; } } Fortran C/C++ CPU CPU CPU CPU CPU CPU CPU CPU Multi-Core & SIMD CPU GPUGPU GPUGPU GPUGPU GPUGPU GPUGPU GPUGPU GPUGPU GPUGPU GPUGPU GPUGPU GPUGPU GPUGPU GPUGPU GPUGPU Accelerator Sequential Software Parallel Hardware Development Time Maintenance Cost Performance Tuning 3
  4. 4. spcl.inf.ethz.ch @spcl_eth 4 COSMO: Weather and Climate Model • 500.000 Lines of Fortran • 18.000 Loops • 19 Years of Knowledge • Used in Switzerland, Russia, Germany, Poland, Italy, Israel, Greece, Romania, …
  5. 5. spcl.inf.ethz.ch @spcl_eth COSMO – Climate Modeling 5 • Global (low-resolution model) • Up to 5000 nodes • Runs “monthly” Piz Daint, Lugano, Switzerland
  6. 6. spcl.inf.ethz.ch @spcl_eth COSMO – Weather Forecast 6 • Regional model • High-resolution • Runs “hourly” (20 instances in parallel) • Today: 40 Nodes * 8 GPU • Manual translation to GPUs 3 Year, Multi-person Project Can we automate this GPU mapping?
  7. 7. spcl.inf.ethz.ch @spcl_eth 7 The LLVM Compiler Targets CPU Intel / AMD PowerPC ARM / MIPS GPU NVIDIA AMD / ARM FPGA Xilinx Altera TargetsTargets Static Languages C / C++ Fortran Go / D / C# / … Compute Languages Julia (MatLab style) Dynamic JavaScript Java Frontends COSMO
  8. 8. spcl.inf.ethz.ch @spcl_eth Iteration Space 0 1 2 3 4 5 j i 5 4 3 2 1 0 N = 4 j ≤ i i ≤ N = 4 0 ≤ j 0 ≤ i D = { (i,j) | 0 ≤ i ≤ N ∧ 0 ≤ j ≤ i } Program Code for (i = 0; i <= N; i++) for (j = 0; j <= i; j++) S(i,j); i = 0, j = 1 i = 4, j = 4i = 4, j = 3i = 4, j = 2 i = 3, j = 3 i = 4, j = 0 i = 3, j = 0 i = 2, j = 0 i = 1, j = 0 i = 4, j = 1 i = 2, j = 1 i = 1, j = 1 i = 2, j = 2 i = 3, j = 1i = 3, j = 2 Polly – Performing Polyhedral Optimizations on a Low-Level Intermediate Representation Tobias Grosser, Armin Groesslinger, and Christian Lengauer in Parallel Processing Letters (PPL), April, 2012 8 Polyhedral Model – In a nutshell
  9. 9. spcl.inf.ethz.ch @spcl_eth Static Control Parts - SCoPs  Structured Control  IF-conditions  Counted FOR-loops (Fortran style)  Multi-dimensional array accesses (and scalars)  Loop-conditions and IF-conditions are Presburger Formula  Loop increments are constant (non-parametric)  Array subscript expressions are piecewise-affine  Can be modeled precisely with Presburger Sets 9
  10. 10. spcl.inf.ethz.ch @spcl_eth Polyhedral Model of Static Control Part for (i = 0; i <= N; i++) for (j = 0; j <= i; j++) S: B[i][j] = A[i][j] + A[i][j+1]; • Iteration Space (Domain) 𝐼𝑆 = 𝑆 𝑖, 𝑗 0 ≤ 𝑖 ≤ 𝑁 ∧ 0 ≤ 𝑗 ≤ 𝑖 • Schedule 𝜃𝑆 = { 𝑆 𝑖, 𝑗 → 𝑖, 𝑗 } • Access Relation • Reads: {𝑆 𝑖, 𝑗 → 𝐴 𝑖, 𝑗 ; 𝑆 𝑖, 𝑗 → 𝐴(𝑖, 𝑗 + 1)} • Writes: {𝑆 𝑖, 𝑗 → 𝐵 𝑖, 𝑗 } 10
  11. 11. spcl.inf.ethz.ch @spcl_eth Model 𝐼𝑆 = 𝑆 𝑖, 𝑗 0 ≤ 𝑖 ≤ 𝑛 ∧ 0 ≤ 𝑗 ≤ 𝑖 𝜃𝑆 = {𝑆 𝑖, 𝑗 → 𝑖, 𝑗 } → 𝑖 4 , 𝑗, 𝑖 𝑚𝑜𝑑 4 Code for (i = 0; i <= n; i++) for (j = 0; j <= i; j++) S(i, j); Polyhedral Schedule: Original 11
  12. 12. spcl.inf.ethz.ch @spcl_eth Model 𝐼𝑆 = 𝑆 𝑖, 𝑗 0 ≤ 𝑖 ≤ 𝑛 ∧ 0 ≤ 𝑗 ≤ 𝑖 𝜃𝑆 = {𝑆 𝑖, 𝑗 → 𝑖, 𝑗 } → 𝑖 4 , 𝑗, 𝑖 𝑚𝑜𝑑 4 Code for (c0 = 0; c0 <= n; c0++) for (c1 = 0; c1 <= c0; c1++) S(c0, c1); Polyhedral Schedule: Original 12
  13. 13. spcl.inf.ethz.ch @spcl_eth Model 𝐼𝑆 = 𝑆 𝑖, 𝑗 0 ≤ 𝑖 ≤ 𝑛 ∧ 0 ≤ 𝑗 ≤ 𝑖 𝜃𝑆 = {𝑆 𝑖, 𝑗 → 𝑗, 𝑖 } → 𝑖 4 , 𝑗, 𝑖 𝑚𝑜𝑑 4 Code for (c0 = 0; c0 <= n; c0++) for (c1 = c0; c1 <= n; c1++) S(c1, c0); Polyhedral Schedule: Interchanged 13
  14. 14. spcl.inf.ethz.ch @spcl_eth Model 𝐼𝑆 = 𝑆 𝑖, 𝑗 0 ≤ 𝑖 ≤ 𝑛 ∧ 0 ≤ 𝑗 ≤ 𝑖 𝜃𝑆 = {𝑆 𝑖, 𝑗 → 𝑖 4 , 𝑗, 𝑖 𝑚𝑜𝑑 4 } Code for (c0 = 0; c0 <= floord(n, 4); c0++) for (c1 = 0; c1 <= min(n, 4 * c0 + 3); c1++) for (c2 = max(0, -4 * c0 + c1); c1 <= min(3, n - 4 * c0); c2++) S(4 * c0 + c2, c1); Polyhedral Schedule: Strip-mined 14
  15. 15. spcl.inf.ethz.ch @spcl_eth Model 𝐼𝑆 = 𝑆 𝑖, 𝑗 0 ≤ 𝑖 ≤ 𝑛 ∧ 0 ≤ 𝑗 ≤ 𝑖 𝜃𝑆 = {𝑆 𝑖, 𝑗 → 𝑖 4 , 𝑗 4 , 𝑖 𝑚𝑜𝑑 4, 𝑗 𝑚𝑜𝑑 4 } Code for (c0 = 0; c0 <= floord(n, 4); c0++) for (c1 = 0; c1 <= c0; c1++) for (c2 = 0; c2 <= min(3, n - 4 * c0); c2++) for (c3 = 0; c3 <= min(3, 4 * c0 – 4 * c1 + c2); c3++) S(4 * c0 + c2, 4 * c1 + c3); Polyhedral Schedule: Blocked 15
  16. 16. spcl.inf.ethz.ch @spcl_eth 0 1 2 0 1 2 0 1 2 3 0 1 2 3 0 1 10 16 Mapping Computation to Device 0 0 1 1 Device Blocks & Threads Iteration Space 𝐵𝐼𝐷 = { 𝑖, 𝑗 → 𝑖 4 % 2, 𝑗 3 % 2 } i j 𝑇𝐼𝐷 = { 𝑖, 𝑗 → 𝑖 % 4, 𝑗 % 3 }
  17. 17. spcl.inf.ethz.ch @spcl_eth 17 Polly-ACC: Architecture Polly-ACC: Transparent Compilation to Heterogeneous Hardware Tobias Grosser, Torsten Hoefler at International Conference of Supercomputing (ICS), June 2016, Istanbul
  18. 18. spcl.inf.ethz.ch @spcl_eth 18 Kernels to Programs – Data Transfers void heat(int n, float A[n], float hot, float cold) { float B[n] = {0}; for (int i = 0; i < n; i++) A[i] = cold; setCenter(n, A, hot, n/4); for (int t = 0; t < T; t++) { average(n, A, B); average(n, B, A); printf("Iteration %d done", t); } } OpenCL Kernel Host code With unknown side effects CUDA GPU CUDA GPU CUDA GPU CUDA GPU
  19. 19. spcl.inf.ethz.ch @spcl_eth 19 Data Transfer – Per Kernel Host Memory initialize() setCenter() average() average() average() D → 𝐻 D → 𝐻 𝐻 → 𝐷 𝐷 → 𝐻 time 𝐻 → 𝐷 𝐷 → 𝐻 𝐻 → 𝐷 𝐷 → 𝐻 Device Memory
  20. 20. spcl.inf.ethz.ch @spcl_eth 20 Data Transfer – Inter Kernel Caching Host Memory 𝐷 → 𝐻 Host Memory initialize() setCenter() average() average() average() time 𝐻 → 𝐷 Device Memory
  21. 21. spcl.inf.ethz.ch @spcl_eth 21 EvaluationEvaluation – Polly ACC Workstation: 10 core SandyBridge NVIDIA Titan Black (Kepler)
  22. 22. spcl.inf.ethz.ch @spcl_eth 0:00 0:28 0:57 1:26 1:55 2:24 2:52 3:21 3:50 4:19 Workstation icc icc -openmp clang Polly ACC 2x speedup vs. multi-thread CPU 22 Lattice Boltzmann (SPEC 2006) 4x speedup vs. single-thread CPU
  23. 23. spcl.inf.ethz.ch @spcl_eth 23 Cactus ADM (SPEC 2006) - Performance
  24. 24. spcl.inf.ethz.ch @spcl_eth 24 Statistics - COSMO  Number of Loops  18,093 Total  9,760 Static Control Loops (Modeled precisely by Polly)  15,245 Non-Affine Memory Accesses (Approximated by Polly)  11.154 Loops after precise modeling, less e.g. due to: • Infeasible assumptions taken, or modeling timeouts  Largest set of loops: 72 loops  Reasons why loops cannot be modeled  Function calls with side-effects  Uncomputable loops bounds (data-dependent loop bounds?) Siddharth Bhat
  25. 25. spcl.inf.ethz.ch @spcl_eth init_radition organize_radition fesft  opt_th  opt_so  inv_th  coe_th  inv_so  coe_so 25 Radiation Computation in COSMO (call graph) Hot Functions: Must be inlined and interchanged Hot Functions: Must be inlined and interchanged Compute kernels in all functions
  26. 26. spcl.inf.ethz.ch @spcl_eth Interprocedural Loop Interchange for GPU Execution (inv_th) 26 #ifdef _OPENACC !$acc parallel !$acc loop gang vector DO j1 = ki1sc, ki1ec CALL coe_th_gpu(pduh2oc (j1, ki3sc), pduh2of(j1, ki3sc), pduco2(j1, ki3sc), pduo3(j1, ki3sc), …, pa2f(j1), pa3c(j1), pa3f(j1)) ENDDO !$acc end parallel #else CALL coe_th (pduh2oc, pduh2of, pduco2, pduo3, palogp, palogt, podsc, podsf, podac, podaf, …, pa3c, pa3f) #endif Pulled out parallel loop for OpenACC Annotations
  27. 27. spcl.inf.ethz.ch @spcl_eth Optical Effect on Solar Layer (inv_th 27 DO j3 = ki3sc+1, ki3ec CALL coe_th (j3) { ! Determine effect of the layer in *coe_th* ! Optical depth of gases DO j1 = ki1sc, ki1ec … IF (kco2 /= 0) THEN zodgf = zodgf + pduco2(j1 ,j3)* (cobi(kco2,kspec,2)* EXP ( coali(kco2,kspec,2) * palogp(j1 ,j3) -cobti(kco2,kspec,2) * palogt(j1 ,j3))) ENDIF … zeps=SQRT(zodgf*zodgf) … ENDDO } DO j1 = ki1sc, ki1ec ! Set RHS … ENDDO DO j1 = ki1sc, ki1ec ! Elimination and storage of utility variables … ENDDO ENDDO ! End of vertical loop over layers Outer loop is sequential Inner loop is parallel Sequential Dependences Inner loop is parallel Inner loop is parallel
  28. 28. spcl.inf.ethz.ch @spcl_eth Optical Effect on Solar Layer – After interchange 28 !> Turn loop structure with multiple ip loops inside a !> single k loop into perfectly nested k-ip loop on GPU. #ifdef _OPENACC !$acc parallel !$acc loop gang vector DO j1 = ki1sc, ki1ec !$acc loop seq DO j3 = ki3sc+1, ki3ec ! Loop over vertical ! Determine effects of layer in *coe_so* CALL coe_so_gpu(pduh2oc (j1,j3) , pduh2of (j1,j3) , …, pa4c(j1), pa4f(j1), pa5c(j1), pa5f(j1)) ! Elimination … ztd1 = 1.0_dp/(1.0_dp-pa5f(j1)*(pca2(j1,j3)*ztu6(j1,j3-1)+pcc2(j1,j3)*ztu8(j1,j3-1))) ztu9(j1,j3) = pa5c(j1)*pcd1(j1,j3)+ztd6*ztu3(j1,j3) + ztd7*ztu5(j1,j3) ENDDO END DO ! Vertical loop !$acc end parallel Inner loop is sequential Outer loop is parallel
  29. 29. spcl.inf.ethz.ch @spcl_eth Life Range Reordering (IMPACT’16 Verdoolaege et. al) 29 sequential parallel parallel sequential Privatization needed for parallel execution False dependences prevent interchange Scalable Scheduling
  30. 30. spcl.inf.ethz.ch @spcl_eth 30 Polly-ACC: Architecture Polly-ACC: Transparent Compilation to Heterogeneous Hardware Tobias Grosser, Torsten Hoefler at International Conference of Supercomputing (ICS), June 2016, Istanbul
  31. 31. spcl.inf.ethz.ch @spcl_eth 31 Polly-ACC: Architecture Polly-ACC: Transparent Compilation to Heterogeneous Hardware Tobias Grosser, Torsten Hoefler at International Conference of Supercomputing (ICS), June 2016, Istanbul Intrinsics to model Multi-dimensional strided arrays Better ways to link with NVIDIA libdevice Scalable Modeling Scalable Scheduling Unified Memory OpenCL + SPIR-V Backend
  32. 32. spcl.inf.ethz.ch @spcl_eth 32 Memory on CPU + GPU Hybrid Machine System DRAM System GDDR5 Automatic
  33. 33. spcl.inf.ethz.ch @spcl_eth Performance 1 10 100 1000 Dragonegg + LLVM (CPU only) Cray (CPU only) Polly-ACC (P100) Manual OpenACC (P100) COSMO COSMO 5x speedup 33 4.3x speedup All important loop transformations performed Headroom: - Kernel compilation (1.5s) - Register usage (2x) - Block-size tuning - Unified-memory overhead? 22x speedup
  34. 34. spcl.inf.ethz.ch @spcl_eth Per-Kernel Performane % Total Calls Time-per-call 29.98% 1.22414s 939 1.0221ms FUNC___radiation_rg_MOD_inv_th_SCOP_0_KERNEL_0 19.19% 783.48ms 580 1.1786ms FUNC___radiation_rg_MOD_inv_so_SCOP_0_KERNEL_1 8.50% 347.10ms 140 146.62us FUNC___radiation_rg_MOD_fesft_dp_SCOP_11_KERNEL_0 ... ~ 50 more 34 Per-Kernel Time is short Many Small Kernels Still way-longer than openacc kernels
  35. 35. spcl.inf.ethz.ch @spcl_eth Correct Types for Loop Transformations 35 Maximilian Falkenstein for (int32 i = 1; i < N; i++) for (int32 j = 1; j <= M; j++) A(i,j) = A(i-1,j) + A(i,j-1) j i
  36. 36. spcl.inf.ethz.ch @spcl_eth Correct Types for Loop Transformations 36 Maximilian Falkenstein for (intX c = 2; c < N+M; c++) #pragma simd for (intX i = max(1, c-M); i <= min(N, c-1); i++) A(i,c-i) = A(i-1,c-1) + A(i,c-i-1) for (int32 i = 1; i < N; i++) for (int32 j = 1; j <= M; j++) A(i,j) = A(i-1,j) + A(i,j-1) j i i + j
  37. 37. spcl.inf.ethz.ch @spcl_eth Correct Types for Loop Transformations 37 Maximilian Falkenstein for (intX c = 2; c < N+M; c++) #pragma simd for (intX i = max(1, c-M); i <= min(N, c-1); i++) A(i,c-i) = A(i-1,c-1) + A(i,c-i-1) for (int32 i = 1; i < N; i++) for (int32 j = 1; j <= M; j++) A(i,j) = A(i-1,j) + A(i,j-1) j i i + j What is X? N + M larger than 32 bit TODAY • Use 64-bit • Hope it’s enough
  38. 38. spcl.inf.ethz.ch @spcl_eth What type would be optimal? 38 Server or Workstation CPU Embedded or HPC GPU Embedded CPU FPGA 64-bit 32-bit 32/16 bit minimal Today Tomorrow COSMO Can we always get 32 bit types?
  39. 39. spcl.inf.ethz.ch @spcl_eth Precise Solution 39 for (intX c = 2; c < N+M; c++) # simd for (intX i = max(1, c-M); i <= min(N, c-1); i++) A(i, c-i) = A(i-1, c-1) + A(i, c-i-1) - - c i 1 Domain: { (c) : 2 <= c < N + M INT_MIN <= N, M <= INT_MAX } f0() = c - i f1() = c - i - 1 1) calc: min(fX()), max(fX()) under Domain 2) choose type accordingly
  40. 40. spcl.inf.ethz.ch @spcl_eth 40 ILP Solver • Minimal Types • Potentially Costly Approximations* • s(a+b) ≤ max(s(a), s(b)) + 1 • Good, if smaller than native type * Earlier uses in GCC and Polly Preconditions • Assume values fit into 32 bit • Derive required pre-conditions + - c i 1
  41. 41. spcl.inf.ethz.ch @spcl_eth Type Distribution for LNT SCOPS 41 32 + epsilon is almost always enough!
  42. 42. spcl.inf.ethz.ch @spcl_eth Compile Time Overhead 0 5 10 15 20 25 30 No Types Solver Solver + Approx Solver + Approx (8 bit) GPU Code Generation (5000 lines of code) GPU Code Generation 42 Less than 10% overhead vs. no types. Less than 10%
  43. 43. spcl.inf.ethz.ch @spcl_eth Automatic Compilation to FPGA ?  Automatic Translation: Floating Point Fixed Point  COSMO mostly floating point (single precision / double precision)  SPIR-V flow to Xilinx HLS Tools  Translate LLVM-IR directly to Verilog  How to get cache coherence  Visited Xilinx: Could share some of their software toolchain for Enzian  How to schedule kernels  Partial reconfiguration  Data Caching  Can we keep data in b-ram  Kernel Size  Can we reduce the size of all kernels?  … 43
  44. 44. spcl.inf.ethz.ch @spcl_eth Conclusion 44 Optimal & Correct Types Automatic Unified Memory Transfers Complex Loop Transformations Hybrid Mapping

×