• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
ARAEL Performance
 

ARAEL Performance

on

  • 1,145 views

ARAEL is a FVM-based solver for transient incompressible flows that works with OpenFOAM

ARAEL is a FVM-based solver for transient incompressible flows that works with OpenFOAM

Statistics

Views

Total Views
1,145
Views on SlideShare
1,113
Embed Views
32

Actions

Likes
0
Downloads
0
Comments
1

3 Embeds 32

http://vratis.com 25
http://www.linkedin.com 6
http://www.slashdocs.com 1

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • The next goal is to replace diagonal preconditioner with Algebraic Multigrid preconditioner to further accelerate the whole solver on GPU.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    ARAEL Performance ARAEL Performance Presentation Transcript

    • ARAEL – a GPU-based solver for 3Dincompressible transient and steady-state flows T. Tomczak, K. Zadarnowska, Z. Koza, M. Matyka, A. Kosior, J. Pola, L. Miroslaw Vratis, University of Wroclaw, Wroclaw University of Technology
    • Introduction   Motivation.   Attempt 1: Replace linear systems solvers with their GPU implementations.   Proposed Solution.   Validation and Profiling.   Attempt II: Port the whole CFD Solver to   Runtime studies. GPU.   Conclusions. In Collaboration with:
    • Motivation¡ CFD simulations take time. Sometimes too much time.¡ CFD solvers are more communication- bound than CPU-bound. ¡ Moore’s Law does not hold for CPU anymore. Peak performance for GPU gets doubled each two years. Fig. GPU and CPU Double Precision Performance. Source GPU modeling blog. ¡ CUDA provides massive parallelization but also imposes constraints on software to exploit hardware efficiency.
    • Proposed Solution¡ Solution II: Porting a complete finite volume SIMPLE and PISO solvers to GPU. 0 5¡ Biomedical simulations as the first target: 10 y [cm] ¡ Limited number of boundary conditions is needed: 15 L BM FVM fixed gradient, time-varying inlet conditions. ¡ Laminar flows. 20 0 0.01 0.02 ¡ Support for 3D unstructured meshes in widely WSS [P a] accepted formats such as OpenFOAM or VTK. a) WSS profile b) LBM c) FVM
    • 2 years later… ¡ PISO (Pressure Implicit with Split Operator): Transient solver for laminar incompressible flow.¡ SIMPLE (Semi-implicit Method for Pressure-Linked Equation): Steady-state solver for laminar incompresssible flow.¡ Jacobi preconditioner.¡ Boundary conditions on p,U: zero gradient, no-slip condition (fixed- value), time varying inlet conditions.¡ Easy to use: from command line or API. OpenFOAM case  with the geometry, OpenFOAM -> boundary conditions, ARAEL converter dictionaries From command line: gico, gsimple
    • Main idea mesh: space, time and flow equations. Space is divided into a mesh of nc cells. Cells are polyhedrons with flat faces, and¡ Each thread to exactly two one cell toisensure each face belongs analyzes polyhedrons or a bound- V0 A1,0 A0,1 V1 ary face. Pressure and velocity are defined at centroids of massive and scallable parallelization. in the cells. Since partial di↵erential equations are local A2,0 A3,1 space and time, their discretization leads to nonlinear alge- A0,2 A1,3¡ A special storagethe velocity provides 
 at each braic equations relating format and pressure a coalesed access toat adjacent cells only. After polyhedron with their values the GPU memory. linearization, these equations reduce to a linear system V2 A3,2 A2,3 V3 A~ = ~ ˆx b, (1)¡ Extensive use of fast shared memory. where ~ and ~ are vectors of length nc and A is a sparse x b ˆ vals: matrix such that Aij 6= 0 if and only if cells i and j have a A0,1 A1,0 A2,0 A3,1 A0,2 A1,3 A2,3 A3,2 common face or i = j. The value of Aij can depend on the current and previous values of the pressure and velocity at indices: i, j as well as on some face-specific parameters, e.g. area of the face, that can be stored in auxiliary arrays with the 1 0 4 5 2 3 7 6 same sparsity pattern as that of A. ˆ Matrix A must be ˆ assembled many times and then used in a linear solver. chunk 0 chunk 1 ˆ During these operations A is accessed in rows or columns as if in sparse matrix-vector and sparse transposed matrix- Figure 1: Example of construction of vals and indices arrays for vector products (STMVP). As the highest priority must Different paradigm =the first set, k = implementation. matrix elements from different 2, nc = 4. be granted to the implementation of SMVP, A must beˆ stored in a way enabling a highly optimized implementa- number of cores (SMs). No synchronization •  High tion of SMVP and a reasonably e cient implementation Like in ELL format, in our format matrix elements from between them. of STMVP. the first set are stored in two arrays in GPU memory: vals •  Limited shared memory (kB). and indices. Example of vals and indices arrays for Several formats designed for e cient implementation… and more. •  of the SMVP kernel on modern GPUs were investigated by 2D mesh built from 4 cells is shown in Fig. 1. In the Bell and Garland [2] and the data format implemented in vals array there are simply matrix elements represented
    • Test casesFig. Test cases. cav* refers to steady-state flow in lid-driven cavity 3D, tp1M refers to transient 2D Poiseuille flow andlca refers to a steady-state flow in the left coronary artery (geometry provided by Dr. Vartan Kurtcouglu from ETHZurich).
    • Validationa) b) c) p u ulid L uin p=0 t y y y L/2 x x x z zd) e) f) 0.1 0.014 0.16 u t=0.5 0.14 w 0.05 0.012 0.12 v 0.01 0 t=0.38 0.1 0.008 velocity [m/ s] velocity [m/ s]velocity [m/ s] t=0.2 0.08 -0.05 v 0.006 w t=0.12 0.06 -0.1 u 0.004 0.04 0.002 t=0.002 0.02 -0.15 0 0 -0.2 -0.002 -0.02 -0.25 -0.004 -0.04 0 0.02 0.04 0.06 0.08 0.1 0 0.005 0.01 0.015 0.02 0.001 0.002 0.003 0.004 0.005 0.006 0.007 x [m] y [m] position along the cross-section [mm]g) h) i) -0.0202 0.0004 0.023 -0.0204 0.00035 0.022 0.021 -0.0206 0.0003 0.02 -0.0208 0.00025 P ressure [P a] 0.019P ressure [P a] P ressure [P a] -0.021 0.0002 0.018 -0.0212 0.00015 t=0.5 0.017 0.016 -0.0214 0.0001 0.015 -0.0216 5e-05 0.014 -0.0218 0 0.013 0 0.02 0.04 0.06 0.08 0.1 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.001 0.002 0.003 0.004 0.005 0.006 0.007 x [m] x [m] position along the cross-section [mm]
    • Convergence
 0 0 10 10 −1 −1 10 10 −2 −2 10 10 residual for uresidual for p −3 −3 10 10 −4 −4 10 10 −5 −5 10 10 CPU/Jacobi CPU/Jacobi CPU/GAMG CPU/GAMG GPU −6 GPU −6 10 10 0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350 n n Fig. Convergence of CPU and GPU solvers.
    • Performance
 6 core, 12-thread Intel Xeon X5670 vs. Tesla C2070 vs. CG/BiCG with Jacobi vs. GAMG 4.5 1.6 (a) SIMPLE (b) SIMPLE 4 1.4 GPU vs. CPU accelerationGPU vs. CPU acceleration PISO PISO 3.5 1.2 3 1 2.5 0.8 2 0.6 1.5 lca0.5M 0.4 1 0.2 0.5 tp1M 0 0 10 3 10 4 10 5 10 6 10 7 103 104 105 106 107 problem size problem size Fig. Acceleration of the GPU relative to the CPU (OpenFOAM) implementations of SIMPLE and PISO solvers. In all cases the linear solvers used by the GPU were Jacobi-preconditioned BiCGStab and CG. The CPU implementation used BiCG and CG with Jacobi preconditioner (a) or GAMG with the FDIC preconditioner (b).
    • Performance
Lid-driven cavity, Intel Xeon X5670 (12 processes) vs. Tesla C2070 2012 4 vs. CG/BiCG with DIC/DILU 3.5 3 2.5 2 PISO SIMPLE 1.5 1 0.5 0 1.00E+03 2.00E+06 4.00E+06 6.00E+06 8.00E+06 1.00E+07 1.20E+07 Fig. Acceleration of the GPU relative to the CPU (OpenFOAM) implementations of SIMPLE and PISO solvers. In all cases the linear solvers used by the GPU were Jacobi-preconditioned BiCGStab and CG. The CPU implementation used BiCG and CG with DILU and DIC preconditioners.
    • Transient solver for incompressible fluids 
 Compared against Intel Xeon E5620 2.4GHz
 DDR3 PC3-1066 MHz 2011 Time needed for computation of 20 time steps 10000 8704.84 3290.18 1000 x2.65 600 404.41 100 94.17 24.3 Xeon E5620 1 core DIC 12.025 10 Xeon E5620 4 cores DIC 4.6425 Tesla C2070 0.73 2.29 1 0.78 0.64 0.325 In collaboration with: 0.1 0.0825 0.06175 0.01 1.00E+03 1.06E+04 1.04E+05 1.00E+06 9.00E+06Fig. Time (sec.) required to perform 20 iterations calculated for different resolution ofcavity3D OpenFOAM case (icofoam). Logarithmic scale.
    • Profiling
Table 2: Cumulative number of the pressure solver iterations: Jacobi-preconditioned CG (GPU and CPU) and GAMG (CPU). 100% Case GPU-CG CPU-CG CPU-GAMG % of PISO/SIMPLE solver time SIMPLE cav103 2 468 2 474 269 80% cav473 122 216 103 321 3 699 cav1003 635 086 629 182 14 721 60% cav1813 1 424 531 1 618 634 29 191 cav2233 1 733 366 1 766 484 27 220 40% lca0.5M 157 176 159 959 1 704 PISO 20% cav103 104 972 110 614 10 507 cav473 575 153 568 388 16 549 cav1003 1 255 664 1 352 614 25 257 cav1813 2 726 289 3 010 947 43 760 PS PS PS PS P PS PS ca ca M ca 3 ca ca 3 tp 3 lca 3 cav2233 3 705 806 3 716 480 100 279 v1 v1 v2 v1 v4 1M 0. 81 00 23 0 7 lca0.5M 5 1 770 672 1 830 437 9 875 3 tp1M 24 117 157 27 042 577 34 589 CG BiCGStab x 700 more! 4: Figure Percentage of the time spent in CG and BiCGStab linea solvers during a full PISO (P) and SIMPLE (S) solver execution. Fig. Cumumlative the GPUof the pressure solver iterations (left). Percentage of the time spent in Comparison of number implementations with Open-FOAM exploiting the GAMG pressure solver is depicted SIMPLE (S) solver execution (right). CG and BiCGStab linear solvers during a full PISO (P) and divergence or non-coalesced device memory accesses), andin Fig. 3b. In this case the GPU can deliver a small perfor- incur prohibitive runtime overhead. Therefore we can
    • 1st Level Interface ¡ 1step: Convert your OpenFOAM case to our internal format. ¡ GPL based converters for each solver type. ¡ Features:¡ Results saved in ./arael directory with : mesh, solver settings and initial p,U,phi field values.¡ Extensive Error checking, e.g.: ¡ wrong solver, ¡ unsupported boundary conditions ¡ unsupported linear solver or preconditioner ¡ overwriting previous results 14
    • 1st Level Interface ¡ 2nd step: 
 Run gico or gsimple.¡ Features:   Reads data from converters.   Calculates given number of solver iteration.   Saves final results in OpenFOAM text format:   results: p, U, phi fields in directory ./ 999999999 15
    • Example output from gico 16
    • 2nd Level ¡ Low Level API for calling ARAEL from C/Fortran/Python. ¡ Typical scenario:
 
 Create solver object   Prepare data (mesh and initial field values)   Set solver parameters   Run required number of time loop iterations   Get computed data from GPU   Destroy solver object 17
    • Conclusions Intel Xeon E5670 
 Acceleration on a¡ PISO and SIMPLE has been succesfully (6 cores, hyper single GPU ported to GPU. Two access levels are threading) provided. CG/BiCG with Jakobi x4 precond.¡ Acceleration x4 is consistent with the 4.5 CG with DIC, BiCG x2-x3 ratio of GPU/CPU memory bandwith 
 withDILU (32GB/s and 144GB/s). GAMG minimal acceleration ¡ Last but not least: ARAEL¡ ARAEL works entirely on GPU. CPU can be implements basic CFD solver used for other calculations. 
 operations such as Laplacian, gradient, divergence. 
 
¡ Large number of iterations per a time step It is possible to build other solvers limits the acceleration (-> GAMG)
 using our API. 
 Better preconditioner is needed (such as AMG).Complete PISO and SIMPLE solvers on Graphics Processing Units, Tomczak T. , Zadarnowska K., Koza Z., Matyka M.,Miroslaw L.., Computer and Fluids 2012 (submitted)
    • Feature Plans¡ Integration with Pre- and Post-processing for blood flow simulations (EU Funded project).¡ ARAEL is a platform for more advanced solutions but feedback from OpenFOAM community is needed. Possible extensions: ¡ Multi-GPU for cases with > 11M cells. ¡ More boundary conditions. ¡ Turbulence modeling (k-omega SST). ¡ More solvers ported to GPU.
 gpimple, ginter, gsonic?
    • Acknowledgments ¡ Zbigniew Koza, Maciej Matyka (University of Wroclaw) ¡ Tadeusz Tomczak, Kasia Zadarnowska (Wroclaw University of Technology) ¡ Vartan Kurtcouglu, Farhad Rihtegar (ETH Zurich) ¡ PLGRID Help desk Vratis Ltd. Muchoborska 18, Wroclaw, Poland vratis.com Email: lukasz.miroslaw@vratis.com Questions? Comments?