Presentation of the open source CFD code Code_Saturne

  • 1,859 views
Uploaded on

This is a copy of a presentation presented by EDF at the OSCIC in 2009. It gives details of the open source CFD code Code_Saturne.

This is a copy of a presentation presented by EDF at the OSCIC in 2009. It gives details of the open source CFD code Code_Saturne.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,859
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
19
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. HPC and CFD at EDF with Code_Saturne Yvan Fournier, Jérôme Bonelle EDF R&D Fluid Dynamics, Power Generation and Environment Department Open Source CFD International Conference Barcelona 2009
  • 2. Summary 1. General Elements on Code_Saturne 2. Real-world performance of Code_Saturne 3. Example applications: fuel assemblies 4. Parallel implementation of Code_Saturne 5. Ongoing work and future directions 2 Open Source CFD International Conference 2009
  • 3. General elements on Code_Saturne 3 Open Source CFD International Conference 2009
  • 4. Code_Saturne: main capabilities Physical modelling Single-phase laminar and turbulent flows: k-ε, k-ω SST, v2f, RSM, LES Radiative heat transfer (DOM, P-1) Combustion coal, gas, heavy fuel oil (EBU, pdf, LWP) Electric arc and Joule effect Lagrangian module for dispersed particle tracking Atmospheric flows (aka Mercure_Saturne) Specific engineering module for cooling towers ALE method for deformable meshes Conjugate heat transfer (SYRTHES & 1D) Common structure with NEPTUNE_CFD for eulerian multiphase flows Flexibility Portability (UNIX, Linux and MacOS X) Standalone GUI and integrated in SALOME platform Parallel on distributed memory machines Periodic boundaries (parallel, arbitrary interfaces) Wide range of unstructured meshes with arbitrary interfaces Code coupling capabilities (Code_Saturne/Code_Saturne, Code_Saturne/Code_Aster, ...) 4 Open Source CFD International Conference 2009
  • 5. Code_Saturne: general features Technology Co-located finite volume, arbitrary unstructured meshes (polyhdral cells), predictor-corrector method 500 000 lines of code, 50% FORTRAN 90, 40% C, 10% Python Development 1998: Prototype (long time EDF in-house experience, ESTET-ASTRID, N3S, ...) 2000: version 1.0 (basic modelling, wide range of meshes) 2001: Qualification for single phase nuclear thermal-hydraulic applications 2004: Version 1.1 (complex physics, LES, parallel computing) 2006: Version 1.2 (state of the art turbulence models, GUI) 2008: Version 1.3 (massively parallel, ALE, code coupling, ...) Released as open source (GPL licence) 2008: Dvpt version 1.4 (parallel IO, multigrid, atmospheric, cooling towers, ...) 2009: Dvp version 2.0-beta (parallel mesh joining, code coupling, easy install & packaging, extended GUI) schedule for industrial release beginning of 2010 Code_Saturne developed under Quality Assurance 5 Open Source CFD International Conference 2009
  • 6. External libraries (EDF, LGPL): • BFT: Base Functions and Types Code_Saturne subsystems • FVM: Finite Volume Mesh • MEI: Mathematical Expression Interpreter Code_Saturne BFT library Meshes Preprocessor Run-time environment mesh import Memory logging mesh joining periodicity domain partitioning FVM code coupling Restart library files parallel treatment Parallel Kernel parallel mesh Code_Saturne management SYRTHES Parallel mesh setup Code_Aster CFD Solver SALOME platform ... MEI library Mathematical Post- Expression Xml data file processing Interpreter GUI output 6 Open Source CFD International Conference 2009
  • 7. Code_Saturne environment Graphical User Interface setting up of calculation parameters parameter stored in Xml file interactive launch of calculations some specific physics not yet covered by GUI advanced setting up by Fortran user routines Integration in the SALOME platform extension of GUI capabilities mouse selection of boundary zones advanced user files management from CAD to post-processing in one tool 7 Open Source CFD International Conference 2009
  • 8. Allowable mesh examples Example of mesh with stretched cells and hanging nodes PWR lower Example of composite mesh plenum 3D polyhedral cells 8 Open Source CFD International Conference 2009
  • 9. Joining of non-conforming meshes Arbitrary interfaces Meshes may be contained in one single file or in several separate files, in any order Arbitrary interfaces can be selected by mesh references Caution must be exercised if arbitrary interfaces are used: in critical regions, or with LES with very different mesh refinements, or on curved CAD surfaces Often used in ways detrimental to mesh quality,but a functionality we can not do without as long as we do not have a proven alternative. Joining of meshes built in several pieces may also be used to circumvent meshing tool memory limitations. Periodicity is also constructed as an extension of mesh joining. 9 Open Source CFD International Conference 2009
  • 10. Real-world performance of Code_Saturne 10 Open Source CFD International Conference 2009
  • 11. Code_Saturne Features of note to HPC Segregated solver All variables are solved or independently, coupling terms are explicit Diagonal-preconditioned CG used for pressure equation, Jacobi (or bi-CGstab) used for other variables More important, matrices have no block structure, and are very sparse Typically 7 non-zeroes per row for hexahedra, 5 for tetrahedra Indirect addressing + no dense blocs means less opportunities for MatVec optimization, as memory bandwidth is as important as peak flops. Linear equation solvers usually amount to 80% of CPU cost (dominated by pressure), gradient reconstruction about 20% The larger the mesh, the higher the relative cost of the pressure step 11 Open Source CFD International Conference 2009
  • 12. Current performance (1/3) 2 LES test cases (most I/O factored out) 1 M cells: (n_cells_min + n_cells_max)/2 = 880 at 1024 cores, 109 at 8192 10 M cells (n_cells_min + n_cells_max)/2 = 9345 at 1024 cores, 1150 at 8192 FATHER HYP P I 1 M hexahedra LES tes t cas e 10 M h e xa h e d ra LES te st c a se 100000 500000 10000 50000 O pteron + infiniband O pteron + Myrinet Opte ron + Myrine t NovaS cale Opte ron + infiniba nd Ela ps e d time Blue G ene/L Nova S c a le Elapsed tile Blue Ge ne /L 1000 5000 100 500 1 10 100 1000 10000 1 10 100 1000 10000 n c ore s n cores 12 Open Source CFD International Conference 2009
  • 13. Current performance (2/3) RANS, 100 M tetrahedra + polyhedra (most I/O factored out) Polyhedra due to mesh joinings may lead to higher load imbalance in local MatVec for large core counts 96286/102242 min/max cells/core at 1024 cores 11344/12781 min/max cells cores at 8192 cores FA Grid RANS te s t ca s e 10000 Ela pse d tim e pe r ite ra tion 1000 Nov a S c a le Blue Ge ne /L (C O) Blue Ge ne /L (VN) 100 10 100 1000 10000 n c o re s 13 Open Source CFD International Conference 2009
  • 14. Current performance (3/3) Efficiency often goes through an optimum (due du better cache hit rates) before dropping (due to latency induced by parallel synchronization) Example shown here: HYPI (10 M cell LES test case) 1,6 1,4 1,2 1 Efficacité 0,8 0,6 0,4 0,2 0 1 10 100 1000 10000 Number of MPI ranks Cluster Chatou Tantale Platine BlueGene 14 Open Source CFD International Conference 2009
  • 15. High Performance Computing with Code_Saturne Code_Saturne used extensively on HPC machines in-house EDF clusters CCRT calculation centre (CEA based) EDF IBM BlueGene machines (8 000 and 32 000 cores) Run also on MareNostrum (Barcelona Computing Centre), Cray XT, … Code_Saturne used as reference in PRACE European project reference code for CFD benchmarks on 6 large european HPC centres Code_Saturne obtained “gold medal” status in scalability by Daresbury Laboratory (UK, HPCx) machine) 15 Open Source CFD International Conference 2009
  • 16. Example HPC applications: fuel assemblies 16 Open Source CFD International Conference 2009
  • 17. Fuel Assembly Studies Conflicting design goals Good thermal mixing properties, requiring tubulent flow Limit head loss Limit vibrations Fuel rods held by dimples and springs, and not welded, as they lengthen slightly over the years due to irradiation Complex core geometry Circa 150 to 250 fuel assemblies per core depending on reactor type, 8 to 10 grids per fuel assembly, 17x17 grid (mostly fuel rods, 24 guide tubes) Geometry almost periodic, except for mix of several fuel assembly types in a given core (reload by 1/3 or 1/4) Inlet an wall conditions not periodic, heat production not uniform at fine scale Why we study these flows Deformation may lead to difficulties in core unload/reload Turbulent-induced vibrations of fuel assemblies in PWR power plants is a potential cause of deformation and of fretting wear damage These may lead to weeks or months of interruption of operations 17 Open Source CFD International Conference 2009
  • 18. Prototype FA calculation with Code_Saturne PWR nuclear reactor mixing grid mock-up (5x5) 100 million cells calculation run on 4 000 to 8 000 cores Main issue is mesh generation 18 Open Source CFD International Conference 2009
  • 19. LES simulation of reduced FA domain Particular features for LES SIMPLEC algorithm with Rhie and Chow interpolation 2nd order in time (Crank-Nicolson and Adams- Bashforth) 2nd order in space (fully centered and sub- iterations for non-orthogonal faces) Fully hexahedral mesh, 8 million cells Boundary Conditions Implicit periodicity in x and y directions Constant inlet conditions Wall function when needed Free outlet Simulation 1 million time-steps: 40 flow passes, 20 flow passes for averaging (no homogeneous direction) CFLmax= 0.8 (dt=5.10-6s) BlueGene/L system, 1024 processors Per time-step: 5s For 100 000 time-steps: 1week 19 Open Source CFD International Conference 2009
  • 20. Parallel implementation of Code_Saturne 20 Open Source CFD International Conference 2009
  • 21. Base parallel operations (1/4) Distributed memory parallelism using domain partitioning Use classical “ghost cell” method for both parallelism and periodicity Most operations require only ghost cells sharing faces Extended neighborhoods for gradients also require ghost cells sharing vertices Global reductions (dot products) are also used, especially by the preconditioned conjugate gradient algorithm Periodicity uses the same mechanism Vector and tensor rotation also required 21 Open Source CFD International Conference 2009
  • 22. Base parallel operations (/) Use of global numbering We associate a global number to each mesh entity A specific C type (fvm_gnum_t) is used for this. Currently an unsigned integer (usually 32-bit), but an unsigned long integer (64-bit) will be necessary Face-cell connectivity for hexahedral cells : size 4.n_faces, and n_faces about 3.n_cells, → size around 12.n_cells, so numbers requiring 64 bit around 350 million cells. Currently equal to the initial (pre-partitioning) number Allows for partition-independent single-image files Essential for restart files, also used for postprocessor output Also used for legacy coupling where matches can be saved 22 Open Source CFD International Conference 2009
  • 23. Base parallel operations (/) Use of global numbering Redistribution on n blocks n blocks ≤ n cores Minimum block size may be set to avoid many small blocks (for some communication or usage schemes), or to force 1 block (for I/O with non-parallel libraries) In the future, using at most 1 of every p processors may improve MPI/IO performance if we use a smaller communicator (to be tested) 23 Open Source CFD International Conference 2009
  • 24. Base parallel operations (/) Conversely, simply using global numbers allows reconstructing neighbor partition entity equivalents mapping Used for parallel ghost cell construction from initially partitioned mesh with no ghost data Arbitrary distribution, inefficient for halo exchange, but allow for simpler data structure related algorithms with deterministic performance bounds Owning processor determined simply by global number, messages are aggregated 24 Open Source CFD International Conference 2009
  • 25. Parallel IO (1/2) We prefer using single (partition independent) files Easily run different stages or restarts of a calculation on different machines or queues Avoids having thousands or tens of thousands of files in a directory Better transparency of parallelism for the user Use MPI I/O when available Uses block to partition exchange when reading, partition to block when writing Use of indexed datatypes may be tested in the future, but will not be possible everywhere Used for reading of preprocessor and partitioner output, as well as for restart files These files use a unified binary format, consisting of an simple header an a succession of sections MPI IO pattern is thus a succession of global reads (or local read + broadcast) for section headers and collective reading of data (with a different portion for each rank) We could switch to HDF5 but preferred a lighter model and also avoid an extra dependency or dependency conflicts Infrastructure in progress for postprocessor output Layered approach as we allow for multiple formats 25 Open Source CFD International Conference 2009
  • 26. Parallel IO (2/2) Parallel I/O only of benefit with parallel filesystems Use of MPI IO may be disabled either at build time, or for a given file using specific hints Without MPI IO, data for each block is written or read successively by rank 0, using the same FVM file API Not much feedback yet, but initial results dissapointing Similar performance with and without MPI IO on at least 2 systems Whether using MPI_File_read/write_at_all or MPI_File_read/write_all Need to retest this forcing less processors in the MPI IO communicator Bugs encountered in several MPI/IO implementations 26 Open Source CFD International Conference 2009
  • 27. Ongoing work and future directions 27 Open Source CFD International Conference 2009
  • 28. Parallelization of mesh joining (2008-2009) Parallelizing this algorithm requires the same main steps as the serial algorithm: Detect intersections (within a given tolerance) between edges of overlapping faces Uses parallel octree for face bounding boxes, built in a bottom-up fashion (no balance condition required) Subdivide edges according to inserted intersection vertices Merge coincident or nearly-coincident vertices/intersections This is the most complex Must be synchronized in parallel Choice of merging criteria has a profound impact on the quality of the resulting mesh Re-build sub-faces With parallel mesh joining, the most memory-intensive serial preprocessing step is removed We will add parallel mesh « append » within a few months (for version 2.1); this will allow generation of huge meshes even with serial meshing tools 28 Open Source CFD International Conference 2009
  • 29. Coupling of Code_Saturne with itself Objective coupling of different models (RANS/LES) fluid-structure interaction with large displacements rotating machines Two kinds of communications data exchange at boundaries for interface coupling volume forcing for overlapping regions Still under development, but ... data exchange already implemented in FVM library optimised localisation algorithm compliance with parallel/parallel coupling prototype versions with promising results more work needed on conservativity at the exchange first version adapted to pump modelling implemented in version 2.0 rotor/stator coupling compares favourably with CFX 29 Open Source CFD International Conference 2009
  • 30. Multigrid Currently, multigrid coarsening does not cross processor boundaries This implies that on p processors, the coarsest matrix may not contain less than p cells With a high processor count, less grid levels will be used, and solving for the coarsest matrix may be significantly more expensive than with a low processor count This reduces scalability, and may be checked (if suspected) using the solver summary info at the end of the log file Planned solution: move grids to nearest rank multiple of 4 or 8 when mean local grid size is too small The communication pattern is not expected to change too much, as partitioning is of a recursive nature, and should already exhibit a “multigrid” nature This may be less optimal than repartitioning at each level, but setup time should also remain much cheaper Important, as grids may be rebuilt each time step 30 Open Source CFD International Conference 2009
  • 31. Partitioning We currently use METIS or SCOTCH, but should move to ParMETIS or Pt-SCOTCH within a few months The current infrastructure makes this quite easy We have recently added a « backup » partitioning based on space-filling curves We currently use the Z curve (from our octree construction for parallel joining), but the appropriate changes in the coordinate comparison rules should allow switching to a Hilbert curve (reputed to lead to better partitioning) This is fully parallel and deterministic Performance on initial tests is about 20% worse on a single 10-million cell case on 256 processes reasonable compared to unoptimized partitioning 31 Open Source CFD International Conference 2009
  • 32. Tool chain evolution Code_Saturne V1.3 (current production version) added many HPC-oriented improvements compared to prior versions: Post-processor output handled by FVM / Kernel Ghost cell construction handled by FVM / Kernel Up to 40% gain in preprocessor memory peak compared to V1.2 Parallelized and scales (manages 2 ghost cell sets and multiple periodicities) Well adapted up to 150 million cells (with 64 Gb for preprocessing) All fundamental limitations are pre-processing related Pre-Processor: Kernel + FVM: Post- Meshes processing serial run distributed run output Version 2.0 separates partitioning from preprocessing Also reduces their memory footprint a bit, moving newly parallelized operations to the kernel Pre-Processor: Partitioner: Kernel + FVM: Post- Meshes processing serial run serial run distributed run output 32 Open Source CFD International Conference 2009
  • 33. Future direction: Hybrid MPI / OpenMP (1/2) Currently, a pure MPI model is used: Everything is parallel, synchronization is explicit when required On multiprocessor / multicore nodes, shared memory parallelism could also be used (using OpenMP directives) Parallel sections must be marked, and parallel loops must avoid modifying the same values Specific numberings must be used, similar to those used for vectorization, but with different constraints: Avoid false sharing, keep locality to limit cache misses 33 Open Source CFD International Conference 2009
  • 34. Future direction: Hybrid MPI / OpenMP (2/2) Hybrid MPI / OpenMP is being tested IBM is testing this on Blue Gene/P Requires work on renumbering algorithms OpenMP parallelism would ease of packaging / installation on workstations No dependencies on source but not binary-compatible MPI library choices, only on the compiler runtime Good enough for current multicore workstations Coupling the code with itself or with with SYRTHES 4 will still require MPI The main goal is to allow MPI communicators of “only” 10000’s of ranks on machines with 100000 cores Performance benefits expected mainly at the very high end Reduce risk of medium-term issues with MPI_Alltoallv used in I/O and parallelism-related data redistribution Though sparse collective algorithms is the long term solution for this specific issue 34 Open Source CFD International Conference 2009
  • 35. Code_Saturne HPC roadmap 2003 2006 2007 2010 2015 Consecutive to the Civaux The whole vessel 9 fuel assemblies thermal fatigue event reactor No experimental approach up Computations enable to better to now understand the wall thermal loading in an injection. Will enable the study of side effects implied by the flow Knowing the root causes of the Computation with an around neighbour fuel event ⇒ define a new design to L.E.S. approach for assemblies. avoid this problem. turbulent modelling Part of a fuel assembly Better understanding of Refined mesh near the vibration phenomena and 3 grid assemblies wall. wear-out of the rods. 106 cells 107 cells 108 cells 109 cells 1010 cells 3.1013 operations 6.1014 operations 1016 operations 3.1017 operations 5.1018 operations Fujistu VPP 5000 Cluster, IBM Power5 IBM Blue Gene/L « Frontier » 30 times the power of 500 times the power of 1 of 4 vector processors 400 processors 8000 processors IBM Blue Gene/L « Frontier » IBM Blue Gene/L « Frontier » 2 month length computation 9 days # 1 month # 1 month # 1 month # 1 Gb of storage # 15 Gb of storage # 200 Gb of storage # 1 Tb of storage # 10 Tb of storage 2 Gb of memory 25 Gb of memory 250 Gb of memory 2,5 Tb of memory 25 Tb of memory Power of the computer Pre-processing not parallelized Pre-processing not parallelized … ibid. … … ibid. … Mesh generation … ibid. … … ibid. … Scalability / Solver … ibid. … Visualisation 35 Open Source CFD International Conference 2009
  • 36. Thank you for your attention! 36 Open Source CFD International Conference 2009
  • 37. Additional Notes 37 Open Source CFD International Conference 2009
  • 38. Load imbalance (1/3) In this example, using 8 partitions (with METIS), we have the following local minima and maxima: Cells: 416 / 440 (6% imbalance) Cells + ghost cells: 469/519 (11% imbalance) Interior faces: 852/946 (11% imbalance) Most loops are on cells, but some are on cells + ghosts, and MatVec is in cells + faces 38 Open Source CFD International Conference 2009
  • 39. Load imbalance (2/3) If load imbalance increases with processor count, scalability decreases If load imbalance reaches a high value (say 30% to 50%) but does not increase, scalability is maintained, though some processor power is wasted Perfect balancing is impossible to reach, as different loops show different imbalance levels, an synchronizations may be required between these loops GCP uses MatVec and dot products Load imbalance might be reduced using weights for domain partitioning, with Cell weight = 1 + f(n_faces) 39 Open Source CFD International Conference 2009
  • 40. Load imbalance (3/3) Another possible source of load imbalance is different cache miss rates on different ranks Difficult to estimate a priori With otherwise balanced loops, if a processor has a cache miss every 300 instructions, and another a cache miss every 400 instructions, considering that the cost of a cache miss is at least 100 instructions, the corresponding imbalance reaches 20% 40 Open Source CFD International Conference 2009