In this talk we present the Distributed and Unified Numerics
Environment (DUNE). It is a software framework for the parallel
numerical solution of partial differential equations with grid-based
methods. Using generic programming techniques it strives for both:
high flexibility (efficiency of the programmer) and high performance
(efficiency of the program).
We present parallel applications realized with DUNE and
show their scalability on current HPC platforms such as the Blue
Gene/P system in Jülich.
Finally we will take a closer look on hardware attributes that
influence the scalability of DUNE and software solving partial
differential equations in general. We investigate how DUNE will
perform on future hardware like Blue Gene/Q.
Special emphasis will be put on the performance of parallel
iterative solvers both in general and in DUNE.
Modeling documents with Generative Adversarial Networks - John Glover
DUNE on current and next generation HPC Platforms
1. Title
DUNE on current and next generation HPC Platforms
Markus Blatt
Dr. Markus Blatt
HPC-Simulation-Software & Services
Forschungszentrum J¨lich, Germany
u
March 8, 2012
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene J¨lich 2012
u 1 / 33
2. Outline
Outline
1 DUNE
2 Parallelization
Parallel Grid Interface
Parallel Iterative Solvers
Parallel Algebraic Multigrid
Scalability
A Glimpse at other DUNE projects
3 Trends and Outlook for HPC and DUNE
4 HPC-Simulation-Software & Services
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene J¨lich 2012
u 2 / 33
3. DUNE
Why we created DUNE!
Problems with most PDE software
• Mostly support one special set of features:
• IPARS: block structured, parallel, multiphysics.
• Alberta: simplicial, unstructured, bisection refinement.
• UG: unstructured, multi-element, red-green refinement, parallel.
• QuocMesh: Fast, on-the-fly structured grids.
• Other features either not or inefficiently supported.
The idea of DUNE
• Separation of data structures and algorithms.
• Easy exchange of interface implementations.
• Reuse of legacy software.
• Fine grained interfaces.
• C++ with templates.
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene J¨lich 2012
u 3 / 33
4. DUNE
Modularity
dune−pdelab−howto
dune−pdelab dune−fem
dune−grid−howto dune−localfunctions
dune−grid dune−istl Metis
NeuronGrid
SuperLU
UG Alberta ALU dune−common
VTK Gmsh
• Grid interface: (non-)conforming hierarchically grid interface.
• Iterative Solver Template Library: Dense and sparse linear algebra.
• PDELab: Discretization module based on residual formulation.
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene J¨lich 2012
u 4 / 33
5. DUNE
PDELab: Plug, Code and Play Simulation Software
• Choose:
• Grid
• Finite Element
• Maybe imlement local
operator.
• Time stepping scheme.
• (Non-)linear solvers.
• Recompile application
• Run efficiently.
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene J¨lich 2012
u 5 / 33
6. DUNE
Sample Simulations
Transport in porous media, Density-driven flow, Flow around Root networks, Neuron network simulations
• Electromagnetics
• Computational neuroscience: biophysically realistic networks of neurons
• Geostatistical inversion: coping with uncertain parameters
• Linear Acoustics
• Multiphase-Flow and transport in porous media
• Density-driven flow
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene J¨lich 2012
u 6 / 33
7. Parallelization
Parallelization
Parallel Grids
• Domain decomposition (overlapping or non-overlapping).
• Load-balancing
• Message passing based on MPI is handled by grid manager.
Parallel linear algebra
• Message passing decoupled from grid.
• Abstraction: Parallel index sets to identify data globally.
• Reuse of efficient sequential linear algebra.
• Minimize communication.
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene J¨lich 2012
u 7 / 33
8. Parallelization Parallel Grid Interface
An Example of a Parallel Grid
First row: with
overlap and ghosts
c=0 c=1 c=2 Second row: with
overlap only
Third row: with
ghosts only
interior
c=0 c=1 c=2
overlap
ghost
border
front
not stored
c=0 c=1 c=2
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene J¨lich 2012
u 8 / 33
9. Parallelization Parallel Grid Interface
Parallel Grids in Dune
• YaspGrid
• structured
• 2D/3D
• arbitrary overlap
• UGGrid
• unstructured
• 2D/3D
• multi-element
• one layer of ghost cells
• (conforming) red-green refinement
• (non-free!)
• ALUGrid
• unstructured
• 3D
• tetrahedral or hexahedral elements
• ghost cells
• (non-conforming) bisection refinement
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene J¨lich 2012
u 9 / 33
10. Parallelization Parallel Grid Interface
Parallel Grids in Dune
• YaspGrid
• structured
• 2D/3D
• arbitrary overlap
• UGGrid
• unstructured
• 2D/3D
• multi-element
• one layer of ghost cells
• (conforming) red-green refinement
• (non-free!)
• ALUGrid
• unstructured
• 3D
• tetrahedral or hexahedral elements
• ghost cells
• (non-conforming) bisection refinement
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene J¨lich 2012
u 9 / 33
11. Parallelization Parallel Grid Interface
Parallel Grids in Dune
• YaspGrid
• structured
• 2D/3D
• arbitrary overlap
• UGGrid
• unstructured
• 2D/3D
• multi-element
• one layer of ghost cells
• (conforming) red-green refinement
• (non-free!)
• ALUGrid
• unstructured
• 3D
• tetrahedral or hexahedral elements
• ghost cells
• (non-conforming) bisection refinement
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene J¨lich 2012
u 9 / 33
12. Parallelization Parallel Iterative Solvers
Index Sets
Index Set
P−1
• Distributed overlapping index set I = p=0 Ip
• Process p manages mapping Ip −→ [0, np ).
• Might only store information about the mapping for shared indices.
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene J¨lich 2012
u 10 / 33
13. Parallelization Parallel Iterative Solvers
Index Sets
Index Set
P−1
• Distributed overlapping index set I = p=0 Ip
• Process p manages mapping Ip −→ [0, np ).
• Might only store information about the mapping for shared indices.
Global Index
• Identifies a position (index) globally.
• Arbitrary and not consecutive (to support adaptivity).
• Persistent.
• On JUGENE this is not an int to get rid off the 32 bit limit!
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene J¨lich 2012
u 10 / 33
14. Parallelization Parallel Iterative Solvers
Index Sets
Index Set
P−1
• Distributed overlapping index set I = p=0 Ip
• Process p manages mapping Ip −→ [0, np ).
• Might only store information about the mapping for shared indices.
Local Index
• Addresses a position in the local container.
• Convertible to an integral type.
• Consecutive index starting from 0.
• Non-persistent.
• Provides an attribute to identify ghost region.
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene J¨lich 2012
u 10 / 33
15. Parallelization Parallel Iterative Solvers
Remote Information and Communication
Remote Index Information
• For each process q the process p knows all common global indices
together with their attribute on q.
Communication
• Target and source partition of the index is chosen using attribute
flags, e.g from ghost to owner and ghost.
• If there is remote index information of q available on p, then p send
all the data in one message.
• All communication takes place asynchronously at the same time.
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene J¨lich 2012
u 11 / 33
16. Parallelization Parallel Iterative Solvers
Parallel Matrix Representation
• Let Ii be a nonoverlapping decomposition of our index set I .
• Ii is the augmented index set such set for all k ∈ Ii with
|akj | + |ajk | = 0 also k ∈ Ii holds.
• Then the locally stored matrix looks like
Ii Aii ∗
Ii
0 I
• Therefore Av can be computed locally for the entries associated with
Ii if v is known for Ii
• A communication step ensures consistent ghost values.
• Matrix can be used for hybrid preconditioners.
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene J¨lich 2012
u 12 / 33
17. Parallelization Parallel Algebraic Multigrid
Algebraic Multigrid (AMG)
Stationary Iterative Methods
• Error reduction stagnates with increasing number of iterations and
unknowns
• Reduces only high frequency errors
Algebraic Multigrid
• approximate smooth residual on a coarser grid and solve there.
• calculate a correction there
• interpolate correction to the fine grid and add it to the current guess
• Use algebraic nature of the problem to define coarse level.
• Coarsening adapts to problem and grid automatically
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene J¨lich 2012
u 13 / 33
18. Parallelization Parallel Algebraic Multigrid
Aggregation AMG
Simple, non-smoothed version
• Piecewise constant prolongators Pl .
• Heuristic and greedy aggregation algorithm.
• Al−1 = PlT Al Pl
• Proposed by Raw, Vanek et al., Braess
• Preconditioner for Krylov methods.
Observations
• Reasonable coarse grid operator for systems.
• Preserves FV discretization.
• Very memory efficient.
• Fast and scalable V-cycle.
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene J¨lich 2012
u 14 / 33
19. Parallelization Parallel Algebraic Multigrid
Aggregation AMG
Simple, non-smoothed version
• Piecewise constant prolongators Pl .
• Heuristic and greedy aggregation algorithm.
• Al−1 = PlT Al Pl
• Proposed by Raw, Vanek et al., Braess
• Preconditioner for Krylov methods.
Observations
• Reasonable coarse grid operator for systems.
• Preserves FV discretization.
• Very memory efficient.
• Fast and scalable V-cycle.
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene J¨lich 2012
u 14 / 33
20. Parallelization Parallel Algebraic Multigrid
Illustration Parallel Setup
Decoupled Aggregation
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene J¨lich 2012
u 15 / 33
21. Parallelization Parallel Algebraic Multigrid
Illustration Parallel Setup
Communicate Ghost Aggregates
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene J¨lich 2012
u 15 / 33
22. Parallelization Parallel Algebraic Multigrid
Parallel Setup Phase
• Every process builds aggregates in his owner region.
• One communication with next neighbors to update aggregate
information in ghost region.
• Coarse level index sets are a subset of the fine level.
• Remote index information can be deduced locally.
• Galerkin product can be calculated locally.
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene J¨lich 2012
u 16 / 33
23. Parallelization Parallel Algebraic Multigrid
Data Agglomeration on Coarse Levels
0
number of vertices per processor
1 • Repartition the data onto
fewer processes.
(L−2)’
L−4 L−6
• Use METIS on the graph of
L−1 L−3 L−5 L−7
the communication pattern.
coarsen
(ParMETIS cannot handle
target
L L−2
L−4’
L−6’ the full machine!)
1 number of nonidle processor n
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene J¨lich 2012
u 17 / 33
25. Parallelization Scalability
Clipped Log-Random Problem
• − · (k(x) u) = f in Ω
• κ(x) realization of log-random field with variance σ 2 , mean 0, and
correlation length λ.
• k(x): binary medium constructed from κ(x).
• Weak scaling: λ scales with mesh width h.
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene J¨lich 2012
u 19 / 33
26. Parallelization Scalability
Weak Scalability Results (Possion vs. Clipped)
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene J¨lich 2012
u 20 / 33
28. Parallelization Scalability
Parallel Groundwater Simulation
Figure: Cut through the ground beneath an acre
• Highly discontinuous permeability of the ground.
• 3D simulations with high resolution.
• Efficient and robust parallel iterative solvers.
− · (K (x) u) = f in Ω (1)
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene J¨lich 2012
u 22 / 33
29. Parallelization Scalability
Weak Scalability Results II
• Richards equation.
• 64 ∗ 64 ∗ 128 unknowns per process.
• 1.25E11 unknowns on the full JUGENE.
• One time step in simulation.
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene J¨lich 2012
u 23 / 33
30. Parallelization Scalability
Efficiency Solver Components
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene J¨lich 2012
u 24 / 33
31. Parallelization Scalability
Efficiency IO
• Highly tuned by Olaf Ippisch
• SionLib from J¨lich rocks!
u
• Still: IO not very scalable!
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene J¨lich 2012
u 25 / 33
32. Parallelization A Glimpse at other DUNE projects
Projects using D UNE -F EM on JUGENE
¨
DFG SPP MetStrom: Adaptive Numerics for Multiscale Phenomena
· D. Kroner, S.Brdar (Freiburg)
¨
· M. Baldauf, D. Schuster (DWD)
· R. Klofkorn (Stuttgart)
¨
· A. Dedner (Warwick)
Mountain wave test case: work by S.Brdar
BW Stiftung HPC-11: Simulation of 2-stroke engines with detailed combustion
· D. Kroner, D. Lebiedz, M. Nolte, M. Fein (Freiburg)
¨
· R. Klofkorn (Stuttgart)
¨
· A. Dedner (Warwick)
work by D.Trescher
Robert Klöfkorn
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene J¨lich 2012
u 26 / 33
33. Parallelization A Glimpse at other DUNE projects
Strong scaling on Blue Gene/P
Table: Strong scaling and efficiency on the supercomputer JUGENE (Julich, Germany)
¨
#cores #cells/core1 #DOFs/core time (ms)2 speed-up efficiency
512 474 296250 46216 — —
4096 59 36875 6294 7.34 0.91
32768 7 4375 949 48.71 0.76
65536 3 1875 504 91.70 0.72
Navier-Stokes equations solved with CDG2 ( k = 4, 3D)
overall number of cells 243 000 (#DOFS ≈ 1.52 · 108 ) on Cartesian grid
≈ 6.1 GB memory consumption on a desktop machine
explicit Runge-Kutta method of order 3
programming techniques for performance
· template meta programming
· automated code generation of DG kernels
· hybrid parallelization (MPI / pthreads , work in progress)
· overlap computation and communication (DCMF INTERRUPT=1)
1
average #cells/core
2
average run time per time step Robert Klöfkorn
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene J¨lich 2012
u 27 / 33
34. Trends and Outlook for HPC and DUNE
(My) Parallel Machines
Helics I (2003) Helics II (2007) Blue Gene/P (2009)
• 256 nodes • 156+4 nodes • 73728 nodes
• Dual AMD Athlon 1,4 • 2x Dual Core AMD • PowerPC 450
GHz Opteron 2220 2.8 GHz Quad-core 850Mhz
• 5.9 GFLOPS • 18.8 GFLOPS • 13,6 GFLOPS
peak/node peak/node peak/node
• 1GB main • 8 GB RAM/node (21.4 • 2 GB RAM/node (13.6
memory/node GB/s) GB/s)
• Myrinet 2 Gbit • Myricom 10Gbit
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene J¨lich 2012
u 28 / 33
35. Trends and Outlook for HPC and DUNE
Observations in Parallel Computing
Software
• Solution of time dependent (nonlinear) equations with implicit time
stepping schemes.
• Most time consuming: Solution of linear system.
• Peak GFLOPS out of reach!
• Methods are limited by memory bandwidth.
Hardware
• Costs for compute power drop fast (2002: 12 USD/MFLOP, 2011:
.01 USD/MFLOP)
• Costs for main memory drop only slightly.
• Main memory not power efficient.
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene J¨lich 2012
u 29 / 33
36. Trends and Outlook for HPC and DUNE
Current Hardware/Software Trends
The hardware manufacturers solution (Green Computing)
• More cores per node
• Less main memory per core.
• SIMD (Blue Gene/Q, GPGPU)
• Increase GFLOPS per GB/s main memory speed
• Faster network interconnects.
How does DUNE cope
• Memory efficiency!
• Minimize communication!
• Favor less convergent iterative methods!
• Time to solution / scalability matters most!
• Ability to compute bigger problems faster.
Software always two steps behind. Blue Gene
M. Blatt (HPC-Sim-Soft, Heidelberg)
DUNE J¨lich 2012
u 30 / 33
37. Trends and Outlook for HPC and DUNE
Current Work and Future Plans
Software Point of View
• Hybrid parallelization in ALUgrid (Kl¨fkorn, University Stuttgart)
o
• Borrow ideas from GPGPU-computing (coalesced memory)
• Check out cache oblivious algorithms
• Check out parallel in time algorithms.
Application Point of View
• Inverse Modeling
• Geostatistical inversion: University Heidelberg (Ippisch, Ngo) and
T¨bingen (Cirpka, Schwede)
u
• Run several parallel forward simulation in parallel.
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene J¨lich 2012
u 31 / 33
38. Trends and Outlook for HPC and DUNE
DUNE on Blue Gene/Q?
Advantages of a central installation:
• Saves scientists a lot of time.
• Would help optimizing DUNE for the platform.
• Possibility of professional installation and user support.
• Closer cooperation of DUNE and IBM brings benefits to all.
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene J¨lich 2012
u 32 / 33
39. HPC-Simulation-Software & Services
What can we do for you?
Efficient simulation software made to measure.
Hans-Bunte-Str. 8-10, 69123 Heidelberg, Germany
http://www.dr-blatt.de
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene J¨lich 2012
u 33 / 33