Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Comparative Performance Analysis of an
Algebraic-Multigrid Solver on
Leading Multicore Architectures
Brian Austin, Alex Dr...
What this talk is about
Performance optimization, comparison and modeling of
a novel shared-memory algebraic-multigrid sol...
How our multigrid solver works
Repeat until converged:
pre-smoothing y ← x + M−1(b − Ax)
coarse-grid correction z ← y + PA...
How we construct the interpolator
=
SP P�
How we construct the coarse-grid matrix
=
Ac PAPT
What the spe10 problem is and how we are solving it
Credit: http://www.spe10.org
oil-reservoir modeling benchmark problem
...
What the spe10 problem is and how we are solving it
oil-reservoir modeling benchmark problem
solved using Darcy’s equation...
What are the machines that we study?
Edison Babbage
name Ivy Bridge Knights Corner
model Xeon E5-2695 v2 Xeon Phi 5110P
cl...
What the coarse-grid system is
n = 7,782; nnz = 1,412,840; nnz/n = 181.6
How we chose the preconditioner for PCG
preconditioner operator
Jacobi z = D−1r
Symmetric Gauss–Seidel z = (L + D)−1D(L + ...
How we chose the preconditioner for PCG
unprecond Jacobi SGS
conditioning
isotropic 3.37 × 104 1.35 × 103 1.83 × 102
aniso...
How we chose the preconditioner for PCG
SGS Jacobi
1 thread 1 thread 12 threads
time (s)
isotropic 83.0 80.3 29.2
anisotro...
Where does the AMG cycle spend most of its time?
1 2 4 8 16 32 60 120
32
64
128
256
512
1,024
2,048
smoothing
PCG
total
nu...
How to improve the performance of PCG
1: while not converged do
2: ρ ← σ
3: omp parallel for: w ← Ap
4: omp parallel for: ...
How to improve the performance of PCG
1: omp parallel
2: while not converged do
3: omp single: τ ← 0.0 implied barrier
4: ...
How to improve the performance of PCG
1: omp parallel
2: while not converged do
3: omp for: w ← Ap
4: omp single
5: τ ← w ...
How to improve the performance of PCG
1: while not converged do
2: ρ ← σ
3: omp parallel for: w ← Ap
4: τ ← w · p
5: α ← ρ...
How to improve the performance of PCG
1 2 4 8 16 32 60 120
16
32
64
128
256
number of threads
runtime (s)
Algorithm 1
Algo...
How the sparse HSS solver works
sparse matrix-factorization
algorithm
represents the frontal matrices
as hierarchically-se...
How do the parameters of the solver affect performance?
Parameter Values
coarse solver HSS, PCG
elements-per-agglomerate 64...
How do the parameters of the solver affect performance?
1%2%4%8%16%32%64%
8
16
32
64
128
percentile rank
runtime (s)
Babbag...
What our performance model is
stage bytes flops
pre- and post-smooth (3ν + 1)(12 nza + 3 · 8n) 2(3ν + 1)(nza + 2n)
restrict...
What our performance model is
1 2 4 8 12
8
16
32
64
128
memory bound
flops bound
actual
number of cores
runtime (s)
Final comments
HSS is an attractive option for solving coarse systems
performance is quite sensitive to parameter tuning
p...
Thank you!
Upcoming SlideShare
Loading in …5
×

Druinsky_SIAMCSE15

364 views

Published on

RX-Solvers Presentation at SIAM CSE15

Published in: Science
  • Be the first to comment

Druinsky_SIAMCSE15

  1. 1. Comparative Performance Analysis of an Algebraic-Multigrid Solver on Leading Multicore Architectures Brian Austin, Alex Druinsky, Pieter Ghysels, Xiaoye Sherry Li, Osni A. Marques, Eric Roman, Samuel Williams Lawrence Berkeley National Laboratory Andrew Barker, Panayot Vassilevski Lawrence Livermore National Laboratory Delyan Kalchev University of Colorado, Boulder
  2. 2. What this talk is about Performance optimization, comparison and modeling of a novel shared-memory algebraic-multigrid solver using the SPE10 reservoir-modeling problem on a node of Cray XC30 and on a Xeon Phi.
  3. 3. How our multigrid solver works Repeat until converged: pre-smoothing y ← x + M−1(b − Ax) coarse-grid correction z ← y + PA−1 c PT (b − Ay) post-smoothing x ← z + M−1(b − Az)
  4. 4. How we construct the interpolator = SP P�
  5. 5. How we construct the coarse-grid matrix = Ac PAPT
  6. 6. What the spe10 problem is and how we are solving it Credit: http://www.spe10.org oil-reservoir modeling benchmark problem solved using Darcy’s equation (in primal form) − · (κ(x) p(x)) = f (x) , where p(x) = pressure, and κ(x) = permeability defined over a 60 × 220 × 85 grid with isotropic and anisotropic versions
  7. 7. What the spe10 problem is and how we are solving it oil-reservoir modeling benchmark problem solved using Darcy’s equation (in primal form) − · (κ(x) p(x)) = f (x) , where p(x) = pressure, and κ(x) = permeability defined over a 60 × 220 × 85 grid with isotropic and anisotropic versions
  8. 8. What are the machines that we study? Edison Babbage name Ivy Bridge Knights Corner model Xeon E5-2695 v2 Xeon Phi 5110P clock speed 2.4 GHz 1.053 GHz cores 12 60 SMT threads 2 4 SIMD width 4 8 peak gflop/s 230.4 1010.88 bandwidth 48.5 GB/s 122.9 GB/s per-core caches: L1-D 32 KB 32 KB L2 256 KB 512 KB shared cache: L3 30 MB none
  9. 9. What the coarse-grid system is n = 7,782; nnz = 1,412,840; nnz/n = 181.6
  10. 10. How we chose the preconditioner for PCG preconditioner operator Jacobi z = D−1r Symmetric Gauss–Seidel z = (L + D)−1D(L + D)−T r = + + Ac L D LT
  11. 11. How we chose the preconditioner for PCG unprecond Jacobi SGS conditioning isotropic 3.37 × 104 1.35 × 103 1.83 × 102 anisotropic 9.68 × 106 1.89 × 104 2.91 × 103 iterations isotropic 605.53 194.57 78.87 anisotropic 1,267.85 288.32 122.85
  12. 12. How we chose the preconditioner for PCG SGS Jacobi 1 thread 1 thread 12 threads time (s) isotropic 83.0 80.3 29.2 anisotropic 128.6 121.6 43.8
  13. 13. Where does the AMG cycle spend most of its time? 1 2 4 8 16 32 60 120 32 64 128 256 512 1,024 2,048 smoothing PCG total number of threads runtime (s)
  14. 14. How to improve the performance of PCG 1: while not converged do 2: ρ ← σ 3: omp parallel for: w ← Ap 4: omp parallel for: τ ← w · p 5: α ← ρ/τ 6: omp parallel for: x ← x + αp 7: omp parallel for: r ← r − αw 8: omp parallel for: z ← M−1r 9: omp parallel for: σ ← z · r 10: β ← σ/ρ 11: omp parallel for: p ← z + βp 12: end while
  15. 15. How to improve the performance of PCG 1: omp parallel 2: while not converged do 3: omp single: τ ← 0.0 implied barrier 4: omp single nowait: ρ ← σ, σ ← 0.0 5: omp for nowait: w ← Ap 6: omp for reduction: τ ← w · p implied barrier 7: α ← ρ/τ 8: omp for nowait: x ← x + αp 9: omp for nowait: r ← r − αw 10: omp for nowait: z ← M−1r 11: omp for reduction: σ ← z · r implied barrier 12: β ← σ/ρ 13: omp for nowait: p ← z + βp 14: end while 15: end omp parallel
  16. 16. How to improve the performance of PCG 1: omp parallel 2: while not converged do 3: omp for: w ← Ap 4: omp single 5: τ ← w · p 6: α ← ρ/τ 7: x ← x + αp 8: r ← r − αw 9: z ← M−1r 10: ρ ← σ 11: σ ← z · r 12: β ← σ/ρ 13: p ← z + βp 14: end omp single 15: end while 16: end omp parallel
  17. 17. How to improve the performance of PCG 1: while not converged do 2: ρ ← σ 3: omp parallel for: w ← Ap 4: τ ← w · p 5: α ← ρ/τ 6: x ← x + αp 7: r ← r − αw 8: z ← M−1r 9: σ ← z · r 10: β ← σ/ρ 11: p ← z + βp 12: end while
  18. 18. How to improve the performance of PCG 1 2 4 8 16 32 60 120 16 32 64 128 256 number of threads runtime (s) Algorithm 1 Algorithm 2 Algorithm 3 Algorithm 4
  19. 19. How the sparse HSS solver works sparse matrix-factorization algorithm represents the frontal matrices as hierarchically-semiseparable (HSS) matrices uses randomized sampling for faster compression D1 D2 D4 D5 D8 D9 D11 D12 U3B3V6 H 7 B14 H U6B6V3 H B U7 U3R3 U6R6 = More details in Pieter Ghysels’ talk tomorrow!
  20. 20. How do the parameters of the solver affect performance? Parameter Values coarse solver HSS, PCG elements-per-agglomerate 64, 128, 256, 512 νP 0, 1, 2 νM−1 1, 3, 5 θ 0.001, 0.001 × 100.5, 0.01
  21. 21. How do the parameters of the solver affect performance? 1%2%4%8%16%32%64% 8 16 32 64 128 percentile rank runtime (s) Babbage (HSS) Babbage (PCG) Edison (HSS) Edison (PCG) default configuration
  22. 22. What our performance model is stage bytes flops pre- and post-smooth (3ν + 1)(12 nza + 3 · 8n) 2(3ν + 1)(nza + 2n) restriction 12 nza + 12 nzp + 3 · 8n 2(nza + nzp) one coarse solve multiply by Ac 12 nzc 2 nzc preconditioner 2 · 8nc nc vector operations 5 · 8nc 2 · 5nc interpolation 12 nzp + 8n 2 nzp stopping criterion 12 nza + 4 · 8n 2(nza + n)
  23. 23. What our performance model is 1 2 4 8 12 8 16 32 64 128 memory bound flops bound actual number of cores runtime (s)
  24. 24. Final comments HSS is an attractive option for solving coarse systems performance is quite sensitive to parameter tuning performance model indicates where the bottlenecks are
  25. 25. Thank you!

×