Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

CFD on Power

762 views

Published on

Computational Fluid Dynamics on Power

Published in: Technology
  • I think you need a perfect and 100% unique academic essays papers have a look once this site i hope you will get valuable papers, ⇒ www.HelpWriting.net ⇐
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Writing a good research paper isn't easy and it's the fruit of hard work. For help you can check writing expert. Check out, please ⇒ www.WritePaper.info ⇐ I think they are the best
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

CFD on Power

  1. 1. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 1 Task-based GPU acceleration in Computational Fluid Dynamics with OpenMP 4.5 and CUDA in OpenPOWER platforms. OpenPOWER and AI ADG Workshop – BSC, Barcelona, Spain June 2018 Samuel Antao IBM Research, Daresbury, UK
  2. 2. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 2 IBM Research @ Daresbury, UK
  3. 3. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 3 IBM Research @ Daresbury, UK – STFC Partnership Mission • 2015: £313 million investment over next 5 years • Agreement for IBM Collaborative Research and Development (R&D) that established IBM Research presence in the UK • Product and Services Agreement with IBM UK and Ireland • Access to the latest data-centric and cognitive computing technologies, including IBMs world-class Watson cognitive computing platform • Joint commercialization of intellectual property assets produced in the partnership Help the UK industries and institutions bringing cutting-edge computational science, engineering and applicable technologies, such as data-centric cognitive computing, to boost growth and development of the UK economy
  4. 4. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 4 IBM Research @ Daresbury, UK – People 7 Over 26 computational scientists and engineers
  5. 5. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 5 IBM Research @ Daresbury, UK – Research areas • Case studies: – Smart Crop Protection - Precision Agriculture • Data science + Life sciences – Improving disease diagnostics and personalised treatments • Life sciences + Machine learning – Cognitive treatment plant • Engineering + Machine learning – Parameterisation of engineering models • Engineering + Machine learning
  6. 6. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 6 Task-based GPU acceleration in Computational Fluid Dynamics with OpenMP 4.5 and CUDA in OpenPOWER platforms. OpenPOWER and AI ADG Workshop – BSC, Barcelona, Spain June 2018 Samuel Antao IBM Research, Daresbury, UK
  7. 7. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 7 CFD and Algebraic-Multigrid • Solve set of partial-differential equations over several time steps – Discretization: • Unstructured vs Structured – Equations: • Velocity • Pressure • Turbulence • Iterative solvers – Jacobi – Gauss-Seidel – Conjugate Gradient • Multigrid approaches – Solve the problem at different resolutions • Coarse and fine grids/meshes – Less Iterations for fine grids • Algebraic multigrid (AMG) – encode mesh information in algebraic format – Sparse matrices. source: http://web.utk.edu/~wfeng1/research.html
  8. 8. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 8 CFD and Algebraic-Multigrid source: http://web.utk.edu/~wfeng1/research.html NVLINKTM NVLINKTM NVLINKTM NVLINKTM InfiniBandTM MPI rank • Grid partitioned by MPI ranks • Ranks distributed by nodes • More than one rank executing in one node • Challenges: – Different grids have different compute needs – Access strides vary, unstructured data accesses. – CPU-GPU data movements – Regular communication between ranks • Halo elements • Residuals • Synchronizations
  9. 9. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 9 CFD and Algebraic-Multigrid – Code Saturne • Open-source – developed and maintained by EDF • 350K lines of code: – 50% C – 37% Fortran – 13% Python • Rich ecosystem to configure/parameterise simulations, generate meshes • History of good scalability Cores Time in Solver Efficiency 262,144 789.79 s - 524,288 403.18 s 97% MPI Tasks Time in Solver Efficiency 524,288 70.114 s - 1,048,576 52.574 s 66% 1,572,864 45.731 s 76% 105B Cell Mesh (MIRA, BGQ) 13B Cell Mesh (MIRA, BGQ)
  10. 10. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 10 CFD and Algebraic-Multigrid – Execution time distribution • Many components (kernels) contribute to total execution time • There are data dependencies between consecutive kernels • There are opportunities to keep data in the device between kernels • Some kernels may have lower compute intensity, it could still be worthwhile computing them in the GPU if the data is already there Gauss-Seidel solver (Velocity) Other Matrix-vector mult. MSR Matrix-vector mult. CSR Dot products Multigrid setup Compute coarse cells from fine cells Other AMG-related Pressure (AMG) Single thread profiling - Code Saturne 5.0+
  11. 11. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 11 Directive-based programming models • Porting existing code to accelerators is time consuming… • The sooner we have code running in the GPU the sooner you can start … – … learning where overheads are – … identifying what data patterns are being used – … spotting kernels performing poorly – … making decisions on what strategies can be used to improve performance • Directive-based programming models can get you started much quicker – Don’t need to bother about device memory allocation and data pointers – Implementation defaults already exploiting device features – Easily create data environments where data resides in the GPU – Improve your code portability • Clang C/C++ and IBM XL C/C++/Fortran compiler provide OpenMP 4.5 support • PGI C/C++/Fortran compiler provide OpenACC support • Can be complemented with existing GPU accelerated libraries – cuSparse – AMGx XL clang
  12. 12. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 12 Directive-based programming models • OpenMP 4.5 data environments – Code Saturne 5.0+ snippet – Conjugate Gradient IBM Confidential GPU acceleration in Code Saturne 1 static cs_sles_convergence_state_t _conjugate_gradient (/* ... */) 2 { 3 # pragma omp target data if (n_rows > GPU_THRESHOLD ) 4 /* Move result vector to device and copied it back at the ned of the scope */ 5 map(tofrom:vx[: vx_size ]) 6 /* Move right -hand side vector to the device */ 7 map(to:rhs [: n_rows ]) 8 /* Allocate all auxiliary vectors in the device */ 9 map(alloc: _aux_vectors [: tmp_size ]) 10 { 11 12 /* Solver code */ 13 14 } 15 } Listing 2: OpenMP 4.5 data environment for a level of the AMG solver. during the computation of a level so it can be copied to the device at the beginning of the level. The result vector can also be kept in the device for a significant part of the execution, and only has to be copied to the host during halo exchange. OpenMP 4.5 makes managing the data according to the aforementioned observations almost trivial: a single directive su ces to set the scope - see Listing 2. Each time halos All arrays reside in the device in this scope! The programming model manages host/device pointers mapping for you!
  13. 13. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 13 Directive-based programming models • OpenMP 4.5 target regions – Code Saturne 5.0+ snippet – Dot products 6 vx[ii] += (alpha * dk[ii]); 7 rk[ii] += (alpha * zk[ii]); 8 } 9 10 /* ... */ 11 } 12 13 /* ... */ 14 15 static void _cs_dot_xx_xy_superblock (cs_lnum_t n, 16 const cs_real_t *restrict x, 17 const cs_real_t *restrict y, 18 double *xx , 19 double *xy) 20 { 21 double dot_xx = 0.0, dot_xy = 0.0; 22 23 # pragma omp target teams distribute parallel for reduction (+: dot_xx , dot_xy) 24 if ( n > GPU_THRESHOLD ) 25 map(to:x[:n],y[:n]) 26 map(tofrom:dot_xx , dot_xy) 27 for (cs_lnum_t i = 0; i < n; ++i) { 28 const double tx = x[i]; 29 const double ty = y[i]; 30 dot_xx += tx*tx; 31 dot_xy += tx*ty; 32 } 33 34 /* ... */ 35 36 *xx = dot_xx; 37 *xy = dot_xy; 38 } Listing 3: Example of GPU port for two stream kernels: vector multiply-and-add and dot product . … Host … … CUDA blocks OpenMP team Allocate data in the the device. Host Release data in the the device. OpenMP runtime library OpenMP runtime library
  14. 14. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 14 Directive-based programming models • OpenMP 4.5 – Code Saturne 5.0+ – AMG NVPROF timeline AMG cycle AMG coarse grid detail Allocations of small variables High kernel launch latency Back-to-back kernels
  15. 15. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 15 CUDA-based tuning • Avoid expensive GPU memory allocation/deallocation: – Allocate a memory chunk once and reuse it • Use pinned memory for data copied frequently to the GPU – Avoid pageable-pinned memory copies by the CUDA implementation • Explore asynchronous execution of CUDA API calls – Start copying data to/from the device while the host is preparing the next set of data or the next kernel • Use CUDA constant memory to copy arguments for multiple kernels at once. – The latency of copying tens of KB to the GPU is similar to copy 1B – Dual-buffering enable copies to happen asynchronously • Produce specialized kernels instead of relying on runtime checks. – CUDA is a C++ extension and therefore kernels and device functions can be templated. – Leverage compile-time optimizations for the relevant sequences of kernels. – NVCC toolchain does very aggressive inlining. – Lower register pressure = more occupancy. IBM Confidential GPU acceleration in Cod 1 template < KernelKinds Kind > 2 __device__ int any_kernel ( KernelArgsBase &Arg , unsigned n_rows_per_block ) { 3 switch(Kind) { 4 /* ... */ 5 // Dot product: 6 // 7 case DP_xx: 8 dot_product <Kind >( 9 /* version */ Arg.getArg <cs_lnum_t >(0), 10 /* n_rows */ Arg.getArg <cs_lnum_t >(1), 11 /* x */ Arg.getArg <cs_real_t * >(2), 12 /* y */ nullptr , 13 /* z */ nullptr , 14 /* res */ Arg.getArg <cs_real_t * >(3), 15 /* n_rows_per_block */ n_rows_per_block ); 16 break; 17 /* ... */ 18 } 19 __syncthreads (); 20 return 0; 21 } 22 23 template < KernelKinds ... Kinds > 24 __global__ void any_kernels (void) { 25 26 auto *KA = reinterpret_cast < KernelArgsSeries *>(& KernelArgsSeriesGPU [0]); 27 const unsigned n_rows_per_block = KA -> RowsPerBlock ; 28 unsigned idx = 0; 29 30 int dummy [] = { any_kernel <Kinds >(KA ->Args[idx ++], n_rows_per_block )... }; 31 (void) dummy; 32 } Listing 10: Device entry-point function for kernel execution.
  16. 16. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 16 CUDA-based tuning • CUDA – Code Saturne 5.0+ – Results for a single rank – IBM Minsky server - Lid driven cavity flow – 1.5M-cell grid 57.21 43.83 34.77 30.28 49.86 37.37 29.67 25.63 11.87 10.83 9.55 9.32 4.41 4.34 4.40 4.63 0 10 20 30 40 50 60 70 1 2 4 8 Time (seconds) OpenMP threads Wall time CPU Solver time CPU Wall time CPU+GPU Solver time CPU+GPU 4.82 4.05 3.64 3.25 11.29 8.60 6.74 5.53 0 2 4 6 8 10 12 1 2 4 8 GPU speedup (1x) OpenMP threads Wall time Solvers time Execution time Speed up
  17. 17. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 17 CUDA-based tuning • CUDA – Code Saturne 5.0+ – NVPROF timeline for a single rank – IBM Minsky server - Lid driven cavity flow – 1.5M-cell grid Gauss-Seidel AMG fine grid
  18. 18. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 18 CUDA-based tuning • CUDA – Code Saturne 5.0+ – NVPROF timeline for a single rank (cont.) – IBM Minsky server - Lid driven cavity flow – 1.5M-cell grid AMG coarse grid
  19. 19. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 19 MPI and GPU acceleration • Different processes (MPI ranks) will use different CUDA contexts. • CUDA implementation serializes CUDA contexts by default. • NVIDIA Multi-Process Service (MPS) provides context switching capabilities so that multiple processes can use the same GPU. MPS server instance GPU driver Define Visible GPU Start MPS server Execute application Terminate MPS server Define Visible GPU Execute application Define Visible GPU Execute application Rank 0 Rank 1 Rank 2
  20. 20. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 20 MPI and GPU acceleration • CUDA – Code Saturne 5.0+ – NVPROF timeline for multiple ranks (5 ranks per GPU) – IBM Minsky server - Lid driven cavity flow – 111M-cell grid Gauss-Seidel Hiding data movement latencies
  21. 21. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 21 MPI and GPU acceleration • CUDA – Code Saturne 5.0+ – NVPROF timeline for multiple ranks (5 ranks per GPU – cont.) – IBM Minsky server - Lid driven cavity flow – 111M-cell grid AMG coarse grid
  22. 22. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 22 MPI and GPU acceleration • CUDA – Code Saturne 5.0+ – Results for multiple ranks (5 ranks per GPU – cont.) – IBM Minsky server - Lid driven cavity flow – 111M-cell grid – CPU+GPU efficiency 65% @32 nodes Execution time Speed up 2.39 2.42 2.32 2.22 2.08 2.00 2.53 2.57 2.45 2.37 2.20 2.10 1.5 1.7 1.9 2.1 2.3 2.5 2.7 1 2 4 8 16 32 Speedup over CPU-only (1x) Nodes Wall time Solvers time 717.6 369.9 187.4 100.2 54.9 28.9 693.9 358.0 181.5 97.2 53.4 28.1 300.4 153.1 80.8 45.2 26.4 14.4 274.7 139.6 74.0 41.1 24.3 13.410.0 100.0 1000.0 1 2 4 8 16 32 Time (seconds) Nodes CPU wall time CPU solvers time CPU+GPU wall time CPU+GPU solvers time
  23. 23. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 23 POWER8 to POWER9 • A code performing well on POWER8 + P100 GPUs should perform well on POWER9 + V100 GPUs – No major code refactoring needed. – More powerful CPUs, GPUs and interconnect. • Some differences to consider: – Core vs Pairs of Cores • POWER9 L3 cache and store queue is shared for each pair of cores • SMT4 per core or SMT8 per pair-of-cores – V100 (Volta) drops lock-step execution per warp-threads • One program-counter per thread • If code assumes lock-step execution explicit barriers have to be inserted • No guarantee threads will converge after divergence within a warp • One has to leverage cooperative groups and thread activity masks NVLINKTM NVLINKTM NVLINKTM ORNL Summit Socket (2 sockets per node) for (cs_lnum_t ii = StartRow; ii < EndRow; ii += bdimy) { // Depending on the number of rows - warps may diverge here unsigned AM = __activemask(); … for (cs_lnum_t kk = 1 ; kk < bdimx ; kk *= 2) sii += __shfl_down_sync(AM, sii,kk, bdimx); … }
  24. 24. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 24 POWER8 to POWER9 • CUDA – Code Saturne 5.0+ – Results for multiple ranks (3 ranks per GPU) – 6 GPUs per node – IBM Power 9 and NVLINK 2.0 (Summit) - Lid driven cavity flow – 889M-cell grid – CPU+GPU efficiency 76% @512 nodes 2.31 2.90 2.34 1.5 1.7 1.9 2.1 2.3 2.5 2.7 2.9 3.1 64 256 512 Speedup over CPU-only (1x) Nodes Wall time 74.73 21.04 11.16 32.3 7.25 4.76 1.0 10.0 100.0 64 256 512 Time (seconds) Nodes CPU wall time CPU+GPU wall time Execution time Speed up POWER9 vs POWER8: Better efficiency when scaling to 16x more nodes for 8x larger problem
  25. 25. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 25 CFD and AI • Cognitive Enhanced Design – Design / prototype new pieces of equipment is expensive (time and finance) • Parameter sweeps need several expensive simulations • Want to make decisions faster • Make decisions on more complex problems – Use cognitive techniques (e.g. Bayesian neural networks) to generate a model based on a parameterized space to relate design parameters to performance. Use this in Bayesian optimization to improve design • Converge to optimal parameters more quickly – Example: airfoil optimization: Lift/Drag maximization • Adaptive Expected Improvement (EI) converges faster and with less variance. Work package ML1 Cognitive Enhanced Design § Problem: Design / prototyping of new pieces of equipment can be expensive (time and finance). Want to do more work in silico, and also use an ‘intelligent’ design process § Solution: Use cognitive techniques (e.g. Bayesian neural networks) to generate a model based on a parameterized space to relate design parameters to performance. Use this in Bayesian optimization to improve design. a
  26. 26. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 26 CFD and AI • Enhanced 3D-Feature Detection – Typical bottleneck of the design process is the simulation-led workflow analysis • Extract feature like: – flow – separation – swirl – layering – Extend AI techniques to automatically extract features in 3D • Remove analysis bottlenecks • Semantic querying of simulation data • Contextual event classification • Computational steering for rare-event simulation – Example: Racing car vortex detection • AI-enabled feature detection § Problem: One typical bottleneck in the simulation-led workflow, is the analysis of the output produced by the simulation itself – especially the identification of features (e.g. for flow; separation, swirl, layering). § Solution: Extend deep-feature detection to 3-dimensional problems to remove this bottleneck from the design workflow § Additional extensions planned for the semantic querying of simulation data, contextual event classification, and computational steering for rare-event simulation Workpackage ML3 Enhanced 3D-Feature Detection
  27. 27. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 27 CFD and AI • Enhanced 3D-Feature Detection – Typical bottleneck of the design process is the simulation-led workflow analysis • Extract feature like: – flow – separation – swirl – layering – Extend AI techniques to automatically extract features in 3D • Remove analysis bottlenecks • Semantic querying of simulation data • Contextual event classification • Computational steering for rare-event simulation – Example: Racing car vortex detection • AI-enabled feature detection
  28. 28. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 28 Questions? samuel.antao@ibm.com

×