Scaling algebraic multigrid to over 287K processors

Title

Scaling Algebraic Multigrid to over 287K processors

Markus Blatt
Markus.Blatt@iwr.uni-heidelberg.de
Interdisziplin¨res Zentrum f¨r wissenschaftliches Rechnen
a u
Universit¨t Heidelberg
a

Copper Mountain Conference on Multigrid Methods
March 28, 2011

M. Blatt (IWR, Heidelberg) AMG 4 287K CMMG, March 27, 2009 1 / 19

Outline

Outline

1 Motivation

2 IBM Blue Gene/P

3 Algebraic Multigrid

4 Parallelization Approach

5 Scalability


Motivation

Motivation

Figure: Cut through the ground beneath an acre

• Highly discontinuous permeability of the ground.
• 3D Simulations with high resolution.
• Eﬃcient and robust parallel iterative solvers.

− · (K (x) u) = f in Ω

IBM Blue Gene/P

IBM Blue Gene/P
JUGENE in J¨lich
u

• 72 racks - 73728 nodes (294912 cores)
• Overall peak performance: 1 Petaﬂops
• Main Memory 144 TB
• Compute Card/Processor
• PowerPC 450, 32-bit, 850 MHz, 4 Cores
• Main Memory: 2 GB
• Top500 list: ranked 9 (Europe 2)

IBM Blue Gene/P

Challenges on IBM Blue Gene/P

Challenging Facts
• Only 500MB main memory/core in VM-mode!
• 32-bit architecture, but 144 TB memory!
• Utilize many cores!
• Idle cores for coarse problems!

Crucial Points to be Adressed
• Be memory eﬃcient and maybe consider less convergent methods.
• Be able to adress 1012 degrees of freedom.
• Speedup matrix-vector-products as much as possible!


Algebraic Multigrid

Algebraic Multigrid Methods
Interpolation AMG
• Split unknowns into C- (coarse grid unknowns) and F-nodes (other
unknowns).
• Construct operator-dependent interpolation operators.
• Optimal number of iterations for model problems.
• Non-optimal operator complexity.
• Convergence behavior not optimal for discontinuous coeﬃcients.

Aggregation AMG
• Builds aggregates of strongly connected unknowns.
• Each aggregate is represent as a coarse level unknown.
• (Tentative) piecewise constant prolongators.
• For smoothed aggregation the prolongation operator is improved via
smoothing.

Algebraic Multigrid

Improved Algebraic Multigrid Methods

Improved Variants
• Interpolation-based AMG with aggressive coarsening (less fill-in).
• Adaptive AMG (better convergence, but more fill-in)
• AMG based on compatible relaxation (are there efficient
implementations?).
• K-cycle Aggregation-AMG (good operator complexity, but more work
on levels).


Algebraic Multigrid

Our AMG Method

• Non-smoothed AMG
• Heuristic and greedy aggregation algorithm.
• Fast setup time
• Very memory eﬃcient.
• Fast and scalable V-cycle.
• Only used as a preconditioner to Krylov methods (BiCGSTAB in the
examples


Algebraic Multigrid

Sequential Aggregation

a− a− a− a−
• strong connection (j, i): a aji > αγi , with γi = maxj=i a aji ,
ij ij
ii jj ii jj
−
aij = max{0, −aij }
• Prescribe size of the aggregates (e.g. min 8, max 12 for 3D).
• Minimize ﬁll-in.
• Make aggregates round in the isotropic case.

Parallelization Approach


Goals
• Reuse eﬃcient sequential data structures and algorithms for
aggregation and matrix-vector-products.
• Agglomerate data onto fewer processes on coarser levels.
• Smoother is hybrid Gauss-Seidel.

Approach
• Separate decomposition and communication information from data
structures.
• Use simple and portable index identiﬁcation for data items.
• Data structures need to be augmented to contain ghost items.



Index Sets

Index Set
P−1
• Distributed overlapping index set I = 0 Ip
• Process p manages mapping Ip −→ [0, np ).
• Might only store information about the mapping for shared indices.



Index Sets

Index Set
P−1

Global Index
• Identiﬁes a position (index) globally.
• Arbitrary and not consecutive (to support adaptivity).
• Persistent.
• On JUGENE this is not an int to get rid oﬀ the 32 bit limit!



Index Sets

Index Set
P−1

Local Index
• Addresses a position in the local container.
• Convertible to an integral type.
• Consecutive index starting from 0.
• Non-persistent.
• Provides an attribute to identity ghost region.



Remote Information and Communication

Remote Index Information
• For each process q the process p knows all common global indices
together with the attribute on q.

Communication
• Target and source partition of the index is chosen using attribute
ﬂags, e.g from ghost to owner and ghost.
• If there is remote index information of q available on p, then p send
all the data in one message.
• Communication takes place collectively at the same time.



Parallel Matrix Representation

• Let Ii is a nonoverlapping decomposition of our index set I .
• Ii is the augmented index set such set for all k ∈ Ii with
|akj | + |ajk | = 0 also k ∈ Ii holds.
• Then the locally stored matrix looks like
 

 
Ii Aii ∗

Ii 


0 I


• Therefore Av can be computed locally for the entries associated with
Ii if v is known for Ii
• A communication step ensures consistent ghost values.
• Matrix can be used for hybrid preconditioners.



Parallel Setup Phase

• Every process builds aggregates in his owner region.
• One communication with next neighbors to update aggregate
information in ghost region.
• Coarse level index sets are a subset of the ﬁne level.
• Remote index information can be deduced locally.
• Galerkin product can be calculated locally.



Data Agglomeration on Coarse Levels

0
number of vertices per processor

1 • Repartition the data onto
fewer processes.
(L−2)’
L−4 L−6
• Use METIS on the graph of
L−1 L−3 L−5 L−7
the communication pattern.
coarsen
(ParMETIS cannot handle
target
L L−2
L−4’
L−6’ the full machine!)
1 number of nonidle processor n


Scalability

Weak Scalability Results

.01 1 .01

1 1000 1

.01 1 .01

• Test problem from Griebel et. al.
• K· u=0
• Dirichlet boundary condition
• 323 unknowns per process (87E09 unknowns for biggest problem)


Scalability

Weak Scalability Results

140

120
Computation Time [s]

100

80

60

40

20

0
1e+00 1e+01 1e+02 1e+03 1e+04 1e+05 1e+06
# Processors [-]
build time solve time total time


Scalability

Weak Scalability Results II

• Richards equation.
• I.e. water ﬂow in partially saturated media.
• One large random heterogeneity ﬁeld.
• 64 ∗ 64 ∗ 128 unknowns per process.
• 1.25E11 unknowns on the full JUGENE.
• One time step in simulation.
• Model and discretization due to Olaf Ippisch
procs T(s)
1 1133.44
294849 3543.34


Scalability

Summary

• Optimal convergence is not everything for parallel iterative solvers.
• Current architectures demand low memory footprints to tackle big
problems.
• Still, we were able to achieve reasonable scalability for large problems
with jumping coeﬃcients.
• The software is open software and part of the ISTL module of the
Distributed and Uniﬁed Numerics Environment .


Scaling algebraic multigrid to over 287K processors

More Related Content

Viewers also liked

Similar to Scaling algebraic multigrid to over 287K processors

Recently uploaded

Scaling algebraic multigrid to over 287K processors