Title




  Scaling Algebraic Multigrid to over 287K processors

                                 Markus Blatt
                     Markus.Blatt@iwr.uni-heidelberg.de
                   Interdisziplin¨res Zentrum f¨r wissenschaftliches Rechnen
                                 a              u
                                      Universit¨t Heidelberg
                                               a


             Copper Mountain Conference on Multigrid Methods
                             March 28, 2011




M. Blatt (IWR, Heidelberg)                AMG 4 287K                 CMMG, March 27, 2009   1 / 19
Outline


Outline


1   Motivation

2   IBM Blue Gene/P

3   Algebraic Multigrid

4   Parallelization Approach

5   Scalability




    M. Blatt (IWR, Heidelberg)     AMG 4 287K   CMMG, March 27, 2009   2 / 19
Motivation


Motivation




                     Figure: Cut through the ground beneath an acre

  • Highly discontinuous permeability of the ground.
  • 3D Simulations with high resolution.
  • Efficient and robust parallel iterative solvers.

                               −   · (K (x) u) = f in Ω
  M. Blatt (IWR, Heidelberg)                 AMG 4 287K       CMMG, March 27, 2009   3 / 19
IBM Blue Gene/P


IBM Blue Gene/P
JUGENE in J¨lich
           u




   • 72 racks - 73728 nodes (294912 cores)
   • Overall peak performance: 1 Petaflops
   • Main Memory 144 TB
   • Compute Card/Processor
       • PowerPC 450, 32-bit, 850 MHz, 4 Cores
       • Main Memory: 2 GB
   • Top500 list: ranked 9 (Europe 2)
   M. Blatt (IWR, Heidelberg)              AMG 4 287K   CMMG, March 27, 2009   4 / 19
IBM Blue Gene/P


Challenges on IBM Blue Gene/P


Challenging Facts
  • Only 500MB main memory/core in VM-mode!
  • 32-bit architecture, but 144 TB memory!
  • Utilize many cores!
  • Idle cores for coarse problems!

Crucial Points to be Adressed
  • Be memory efficient and maybe consider less convergent methods.
  • Be able to adress 1012 degrees of freedom.
  • Speedup matrix-vector-products as much as possible!



  M. Blatt (IWR, Heidelberg)              AMG 4 287K   CMMG, March 27, 2009   5 / 19
IBM Blue Gene/P


Challenges on IBM Blue Gene/P


Challenging Facts
  • Only 500MB main memory/core in VM-mode!
  • 32-bit architecture, but 144 TB memory!
  • Utilize many cores!
  • Idle cores for coarse problems!

Crucial Points to be Adressed
  • Be memory efficient and maybe consider less convergent methods.
  • Be able to adress 1012 degrees of freedom.
  • Speedup matrix-vector-products as much as possible!



  M. Blatt (IWR, Heidelberg)              AMG 4 287K   CMMG, March 27, 2009   5 / 19
Algebraic Multigrid


Algebraic Multigrid Methods
Interpolation AMG
  • Split unknowns into C- (coarse grid unknowns) and F-nodes (other
     unknowns).
  • Construct operator-dependent interpolation operators.
  • Optimal number of iterations for model problems.
  • Non-optimal operator complexity.
  • Convergence behavior not optimal for discontinuous coefficients.

Aggregation AMG
  • Builds aggregates of strongly connected unknowns.
  • Each aggregate is represent as a coarse level unknown.
  • (Tentative) piecewise constant prolongators.
  • For smoothed aggregation the prolongation operator is improved via
     smoothing.
  M. Blatt (IWR, Heidelberg)                 AMG 4 287K   CMMG, March 27, 2009   6 / 19
Algebraic Multigrid


Improved Algebraic Multigrid Methods



Improved Variants
  • Interpolation-based AMG with aggressive coarsening (less fill-in).
  • Adaptive AMG (better convergence, but more fill-in)
  • AMG based on compatible relaxation (are there efficient
     implementations?).
  • K-cycle Aggregation-AMG (good operator complexity, but more work
     on levels).




  M. Blatt (IWR, Heidelberg)                 AMG 4 287K   CMMG, March 27, 2009   7 / 19
Algebraic Multigrid


Our AMG Method



 • Non-smoothed AMG
 • Heuristic and greedy aggregation algorithm.
 • Fast setup time
 • Very memory efficient.
 • Fast and scalable V-cycle.
 • Only used as a preconditioner to Krylov methods (BiCGSTAB in the
    examples




 M. Blatt (IWR, Heidelberg)                 AMG 4 287K   CMMG, March 27, 2009   8 / 19
Algebraic Multigrid


Sequential Aggregation




                                       a− a−                      a− a−
  • strong connection (j, i): a aji > αγi , with γi = maxj=i a aji ,
                              ij                             ij
                               ii jj                          ii jj
      −
     aij = max{0, −aij }
  • Prescribe size of the aggregates (e.g. min 8, max 12 for 3D).
  • Minimize fill-in.
  • Make aggregates round in the isotropic case.
  M. Blatt (IWR, Heidelberg)                 AMG 4 287K   CMMG, March 27, 2009   9 / 19
Parallelization Approach


Parallelization Approach


Goals
  • Reuse efficient sequential data structures and algorithms for
     aggregation and matrix-vector-products.
  • Agglomerate data onto fewer processes on coarser levels.
  • Smoother is hybrid Gauss-Seidel.

Approach
  • Separate decomposition and communication information from data
     structures.
  • Use simple and portable index identification for data items.
  • Data structures need to be augmented to contain ghost items.



  M. Blatt (IWR, Heidelberg)                      AMG 4 287K   CMMG, March 27, 2009   10 / 19
Parallelization Approach


Parallelization Approach


Goals
  • Reuse efficient sequential data structures and algorithms for
     aggregation and matrix-vector-products.
  • Agglomerate data onto fewer processes on coarser levels.
  • Smoother is hybrid Gauss-Seidel.

Approach
  • Separate decomposition and communication information from data
     structures.
  • Use simple and portable index identification for data items.
  • Data structures need to be augmented to contain ghost items.



  M. Blatt (IWR, Heidelberg)                      AMG 4 287K   CMMG, March 27, 2009   10 / 19
Parallelization Approach


Index Sets




Index Set
                                                               P−1
  • Distributed overlapping index set I =                      0   Ip
  • Process p manages mapping Ip −→ [0, np ).
  • Might only store information about the mapping for shared indices.




  M. Blatt (IWR, Heidelberg)                      AMG 4 287K            CMMG, March 27, 2009   11 / 19
Parallelization Approach


Index Sets


Index Set
                                                               P−1
  • Distributed overlapping index set I =                      0   Ip
  • Process p manages mapping Ip −→ [0, np ).
  • Might only store information about the mapping for shared indices.

Global Index
  • Identifies a position (index) globally.
  • Arbitrary and not consecutive (to support adaptivity).
  • Persistent.
  • On JUGENE this is not an int to get rid off the 32 bit limit!



  M. Blatt (IWR, Heidelberg)                      AMG 4 287K            CMMG, March 27, 2009   11 / 19
Parallelization Approach


Index Sets

Index Set
                                                               P−1
  • Distributed overlapping index set I =                      0   Ip
  • Process p manages mapping Ip −→ [0, np ).
  • Might only store information about the mapping for shared indices.

Local Index
  • Addresses a position in the local container.
  • Convertible to an integral type.
  • Consecutive index starting from 0.
  • Non-persistent.
  • Provides an attribute to identity ghost region.


  M. Blatt (IWR, Heidelberg)                      AMG 4 287K            CMMG, March 27, 2009   11 / 19
Parallelization Approach


Remote Information and Communication


Remote Index Information
  • For each process q the process p knows all common global indices
     together with the attribute on q.

Communication
  • Target and source partition of the index is chosen using attribute
     flags, e.g from ghost to owner and ghost.
  • If there is remote index information of q available on p, then p send
     all the data in one message.
  • Communication takes place collectively at the same time.




  M. Blatt (IWR, Heidelberg)                      AMG 4 287K   CMMG, March 27, 2009   12 / 19
Parallelization Approach


Parallel Matrix Representation

  • Let Ii is a nonoverlapping decomposition of our index set I .
  • Ii is the augmented index set such set for all k ∈ Ii with
     |akj | + |ajk | = 0 also k ∈ Ii holds.
  • Then the locally stored matrix looks like
                                                
                                        
                                                
                                            Ii                    Aii   ∗
                                        
                                   Ii            
                                        
                                        
                                                                  0     I
                                        

  • Therefore Av can be computed locally for the entries associated with
     Ii if v is known for Ii
  • A communication step ensures consistent ghost values.
  • Matrix can be used for hybrid preconditioners.

  M. Blatt (IWR, Heidelberg)                         AMG 4 287K             CMMG, March 27, 2009   13 / 19
Parallelization Approach


Parallel Setup Phase




  • Every process builds aggregates in his owner region.
  • One communication with next neighbors to update aggregate
     information in ghost region.
  • Coarse level index sets are a subset of the fine level.
  • Remote index information can be deduced locally.
  • Galerkin product can be calculated locally.




  M. Blatt (IWR, Heidelberg)                      AMG 4 287K   CMMG, March 27, 2009   14 / 19
Parallelization Approach


Data Agglomeration on Coarse Levels



                                                                                             0
    number of vertices per processor




                                                                                             1       • Repartition the data onto
                                                                                                       fewer processes.
                                           (L−2)’
                                                              L−4                 L−6
                                                                                                     • Use METIS on the graph of
                                           L−1               L−3              L−5            L−7
                                                                                                       the communication pattern.
coarsen
                                                                                                       (ParMETIS cannot handle
 target
                                           L                 L−2
                                                                              L−4’
                                                                                             L−6’      the full machine!)
                                       1            number of nonidle processor          n




    M. Blatt (IWR, Heidelberg)                                                          AMG 4 287K             CMMG, March 27, 2009   15 / 19
Scalability


Weak Scalability Results



                               .01          1      .01



                               1         1000       1




                               .01          1      .01




  • Test problem from Griebel et. al.
  •     K·       u=0
  • Dirichlet boundary condition
  • 323 unknowns per process (87E09 unknowns for biggest problem)

  M. Blatt (IWR, Heidelberg)              AMG 4 287K     CMMG, March 27, 2009   16 / 19
Scalability


Weak Scalability Results

                        140

                        120
 Computation Time [s]




                        100

                        80

                        60

                        40

                        20

                         0
                         1e+00    1e+01    1e+02    1e+03       1e+04      1e+05      1e+06
                                                # Processors [-]
                              build time        solve time              total time



            M. Blatt (IWR, Heidelberg)                   AMG 4 287K                  CMMG, March 27, 2009   17 / 19
Scalability


Weak Scalability Results II


  • Richards equation.
  • I.e. water flow in partially saturated media.
  • One large random heterogeneity field.
  • 64 ∗ 64 ∗ 128 unknowns per process.
  • 1.25E11 unknowns on the full JUGENE.
  • One time step in simulation.
  • Model and discretization due to Olaf Ippisch
   procs          T(s)
       1       1133.44
 294849        3543.34



  M. Blatt (IWR, Heidelberg)        AMG 4 287K     CMMG, March 27, 2009   18 / 19
Scalability


Summary



 • Optimal convergence is not everything for parallel iterative solvers.
 • Current architectures demand low memory footprints to tackle big
    problems.
 • Still, we were able to achieve reasonable scalability for large problems
    with jumping coefficients.
 • The software is open software and part of the ISTL module of the
    Distributed and Unified Numerics Environment .




 M. Blatt (IWR, Heidelberg)        AMG 4 287K         CMMG, March 27, 2009   19 / 19

Scaling algebraic multigrid to over 287K processors

  • 1.
    Title ScalingAlgebraic Multigrid to over 287K processors Markus Blatt Markus.Blatt@iwr.uni-heidelberg.de Interdisziplin¨res Zentrum f¨r wissenschaftliches Rechnen a u Universit¨t Heidelberg a Copper Mountain Conference on Multigrid Methods March 28, 2011 M. Blatt (IWR, Heidelberg) AMG 4 287K CMMG, March 27, 2009 1 / 19
  • 2.
    Outline Outline 1 Motivation 2 IBM Blue Gene/P 3 Algebraic Multigrid 4 Parallelization Approach 5 Scalability M. Blatt (IWR, Heidelberg) AMG 4 287K CMMG, March 27, 2009 2 / 19
  • 3.
    Motivation Motivation Figure: Cut through the ground beneath an acre • Highly discontinuous permeability of the ground. • 3D Simulations with high resolution. • Efficient and robust parallel iterative solvers. − · (K (x) u) = f in Ω M. Blatt (IWR, Heidelberg) AMG 4 287K CMMG, March 27, 2009 3 / 19
  • 4.
    IBM Blue Gene/P IBMBlue Gene/P JUGENE in J¨lich u • 72 racks - 73728 nodes (294912 cores) • Overall peak performance: 1 Petaflops • Main Memory 144 TB • Compute Card/Processor • PowerPC 450, 32-bit, 850 MHz, 4 Cores • Main Memory: 2 GB • Top500 list: ranked 9 (Europe 2) M. Blatt (IWR, Heidelberg) AMG 4 287K CMMG, March 27, 2009 4 / 19
  • 5.
    IBM Blue Gene/P Challengeson IBM Blue Gene/P Challenging Facts • Only 500MB main memory/core in VM-mode! • 32-bit architecture, but 144 TB memory! • Utilize many cores! • Idle cores for coarse problems! Crucial Points to be Adressed • Be memory efficient and maybe consider less convergent methods. • Be able to adress 1012 degrees of freedom. • Speedup matrix-vector-products as much as possible! M. Blatt (IWR, Heidelberg) AMG 4 287K CMMG, March 27, 2009 5 / 19
  • 6.
    IBM Blue Gene/P Challengeson IBM Blue Gene/P Challenging Facts • Only 500MB main memory/core in VM-mode! • 32-bit architecture, but 144 TB memory! • Utilize many cores! • Idle cores for coarse problems! Crucial Points to be Adressed • Be memory efficient and maybe consider less convergent methods. • Be able to adress 1012 degrees of freedom. • Speedup matrix-vector-products as much as possible! M. Blatt (IWR, Heidelberg) AMG 4 287K CMMG, March 27, 2009 5 / 19
  • 7.
    Algebraic Multigrid Algebraic MultigridMethods Interpolation AMG • Split unknowns into C- (coarse grid unknowns) and F-nodes (other unknowns). • Construct operator-dependent interpolation operators. • Optimal number of iterations for model problems. • Non-optimal operator complexity. • Convergence behavior not optimal for discontinuous coefficients. Aggregation AMG • Builds aggregates of strongly connected unknowns. • Each aggregate is represent as a coarse level unknown. • (Tentative) piecewise constant prolongators. • For smoothed aggregation the prolongation operator is improved via smoothing. M. Blatt (IWR, Heidelberg) AMG 4 287K CMMG, March 27, 2009 6 / 19
  • 8.
    Algebraic Multigrid Improved AlgebraicMultigrid Methods Improved Variants • Interpolation-based AMG with aggressive coarsening (less fill-in). • Adaptive AMG (better convergence, but more fill-in) • AMG based on compatible relaxation (are there efficient implementations?). • K-cycle Aggregation-AMG (good operator complexity, but more work on levels). M. Blatt (IWR, Heidelberg) AMG 4 287K CMMG, March 27, 2009 7 / 19
  • 9.
    Algebraic Multigrid Our AMGMethod • Non-smoothed AMG • Heuristic and greedy aggregation algorithm. • Fast setup time • Very memory efficient. • Fast and scalable V-cycle. • Only used as a preconditioner to Krylov methods (BiCGSTAB in the examples M. Blatt (IWR, Heidelberg) AMG 4 287K CMMG, March 27, 2009 8 / 19
  • 10.
    Algebraic Multigrid Sequential Aggregation a− a− a− a− • strong connection (j, i): a aji > αγi , with γi = maxj=i a aji , ij ij ii jj ii jj − aij = max{0, −aij } • Prescribe size of the aggregates (e.g. min 8, max 12 for 3D). • Minimize fill-in. • Make aggregates round in the isotropic case. M. Blatt (IWR, Heidelberg) AMG 4 287K CMMG, March 27, 2009 9 / 19
  • 11.
    Parallelization Approach Parallelization Approach Goals • Reuse efficient sequential data structures and algorithms for aggregation and matrix-vector-products. • Agglomerate data onto fewer processes on coarser levels. • Smoother is hybrid Gauss-Seidel. Approach • Separate decomposition and communication information from data structures. • Use simple and portable index identification for data items. • Data structures need to be augmented to contain ghost items. M. Blatt (IWR, Heidelberg) AMG 4 287K CMMG, March 27, 2009 10 / 19
  • 12.
    Parallelization Approach Parallelization Approach Goals • Reuse efficient sequential data structures and algorithms for aggregation and matrix-vector-products. • Agglomerate data onto fewer processes on coarser levels. • Smoother is hybrid Gauss-Seidel. Approach • Separate decomposition and communication information from data structures. • Use simple and portable index identification for data items. • Data structures need to be augmented to contain ghost items. M. Blatt (IWR, Heidelberg) AMG 4 287K CMMG, March 27, 2009 10 / 19
  • 13.
    Parallelization Approach Index Sets IndexSet P−1 • Distributed overlapping index set I = 0 Ip • Process p manages mapping Ip −→ [0, np ). • Might only store information about the mapping for shared indices. M. Blatt (IWR, Heidelberg) AMG 4 287K CMMG, March 27, 2009 11 / 19
  • 14.
    Parallelization Approach Index Sets IndexSet P−1 • Distributed overlapping index set I = 0 Ip • Process p manages mapping Ip −→ [0, np ). • Might only store information about the mapping for shared indices. Global Index • Identifies a position (index) globally. • Arbitrary and not consecutive (to support adaptivity). • Persistent. • On JUGENE this is not an int to get rid off the 32 bit limit! M. Blatt (IWR, Heidelberg) AMG 4 287K CMMG, March 27, 2009 11 / 19
  • 15.
    Parallelization Approach Index Sets IndexSet P−1 • Distributed overlapping index set I = 0 Ip • Process p manages mapping Ip −→ [0, np ). • Might only store information about the mapping for shared indices. Local Index • Addresses a position in the local container. • Convertible to an integral type. • Consecutive index starting from 0. • Non-persistent. • Provides an attribute to identity ghost region. M. Blatt (IWR, Heidelberg) AMG 4 287K CMMG, March 27, 2009 11 / 19
  • 16.
    Parallelization Approach Remote Informationand Communication Remote Index Information • For each process q the process p knows all common global indices together with the attribute on q. Communication • Target and source partition of the index is chosen using attribute flags, e.g from ghost to owner and ghost. • If there is remote index information of q available on p, then p send all the data in one message. • Communication takes place collectively at the same time. M. Blatt (IWR, Heidelberg) AMG 4 287K CMMG, March 27, 2009 12 / 19
  • 17.
    Parallelization Approach Parallel MatrixRepresentation • Let Ii is a nonoverlapping decomposition of our index set I . • Ii is the augmented index set such set for all k ∈ Ii with |akj | + |ajk | = 0 also k ∈ Ii holds. • Then the locally stored matrix looks like      Ii Aii ∗  Ii    0 I  • Therefore Av can be computed locally for the entries associated with Ii if v is known for Ii • A communication step ensures consistent ghost values. • Matrix can be used for hybrid preconditioners. M. Blatt (IWR, Heidelberg) AMG 4 287K CMMG, March 27, 2009 13 / 19
  • 18.
    Parallelization Approach Parallel SetupPhase • Every process builds aggregates in his owner region. • One communication with next neighbors to update aggregate information in ghost region. • Coarse level index sets are a subset of the fine level. • Remote index information can be deduced locally. • Galerkin product can be calculated locally. M. Blatt (IWR, Heidelberg) AMG 4 287K CMMG, March 27, 2009 14 / 19
  • 19.
    Parallelization Approach Data Agglomerationon Coarse Levels 0 number of vertices per processor 1 • Repartition the data onto fewer processes. (L−2)’ L−4 L−6 • Use METIS on the graph of L−1 L−3 L−5 L−7 the communication pattern. coarsen (ParMETIS cannot handle target L L−2 L−4’ L−6’ the full machine!) 1 number of nonidle processor n M. Blatt (IWR, Heidelberg) AMG 4 287K CMMG, March 27, 2009 15 / 19
  • 20.
    Scalability Weak Scalability Results .01 1 .01 1 1000 1 .01 1 .01 • Test problem from Griebel et. al. • K· u=0 • Dirichlet boundary condition • 323 unknowns per process (87E09 unknowns for biggest problem) M. Blatt (IWR, Heidelberg) AMG 4 287K CMMG, March 27, 2009 16 / 19
  • 21.
    Scalability Weak Scalability Results 140 120 Computation Time [s] 100 80 60 40 20 0 1e+00 1e+01 1e+02 1e+03 1e+04 1e+05 1e+06 # Processors [-] build time solve time total time M. Blatt (IWR, Heidelberg) AMG 4 287K CMMG, March 27, 2009 17 / 19
  • 22.
    Scalability Weak Scalability ResultsII • Richards equation. • I.e. water flow in partially saturated media. • One large random heterogeneity field. • 64 ∗ 64 ∗ 128 unknowns per process. • 1.25E11 unknowns on the full JUGENE. • One time step in simulation. • Model and discretization due to Olaf Ippisch procs T(s) 1 1133.44 294849 3543.34 M. Blatt (IWR, Heidelberg) AMG 4 287K CMMG, March 27, 2009 18 / 19
  • 23.
    Scalability Summary • Optimalconvergence is not everything for parallel iterative solvers. • Current architectures demand low memory footprints to tackle big problems. • Still, we were able to achieve reasonable scalability for large problems with jumping coefficients. • The software is open software and part of the ISTL module of the Distributed and Unified Numerics Environment . M. Blatt (IWR, Heidelberg) AMG 4 287K CMMG, March 27, 2009 19 / 19