Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

364 views

Published on

RX-Solvers Presentation at SIAM CSE15

Published in:
Science

No Downloads

Total views

364

On SlideShare

0

From Embeds

0

Number of Embeds

6

Shares

0

Downloads

10

Comments

0

Likes

1

No embeds

No notes for slide

- 1. Comparative Performance Analysis of an Algebraic-Multigrid Solver on Leading Multicore Architectures Brian Austin, Alex Druinsky, Pieter Ghysels, Xiaoye Sherry Li, Osni A. Marques, Eric Roman, Samuel Williams Lawrence Berkeley National Laboratory Andrew Barker, Panayot Vassilevski Lawrence Livermore National Laboratory Delyan Kalchev University of Colorado, Boulder
- 2. What this talk is about Performance optimization, comparison and modeling of a novel shared-memory algebraic-multigrid solver using the SPE10 reservoir-modeling problem on a node of Cray XC30 and on a Xeon Phi.
- 3. How our multigrid solver works Repeat until converged: pre-smoothing y ← x + M−1(b − Ax) coarse-grid correction z ← y + PA−1 c PT (b − Ay) post-smoothing x ← z + M−1(b − Az)
- 4. How we construct the interpolator = SP P�
- 5. How we construct the coarse-grid matrix = Ac PAPT
- 6. What the spe10 problem is and how we are solving it Credit: http://www.spe10.org oil-reservoir modeling benchmark problem solved using Darcy’s equation (in primal form) − · (κ(x) p(x)) = f (x) , where p(x) = pressure, and κ(x) = permeability deﬁned over a 60 × 220 × 85 grid with isotropic and anisotropic versions
- 7. What the spe10 problem is and how we are solving it oil-reservoir modeling benchmark problem solved using Darcy’s equation (in primal form) − · (κ(x) p(x)) = f (x) , where p(x) = pressure, and κ(x) = permeability deﬁned over a 60 × 220 × 85 grid with isotropic and anisotropic versions
- 8. What are the machines that we study? Edison Babbage name Ivy Bridge Knights Corner model Xeon E5-2695 v2 Xeon Phi 5110P clock speed 2.4 GHz 1.053 GHz cores 12 60 SMT threads 2 4 SIMD width 4 8 peak gﬂop/s 230.4 1010.88 bandwidth 48.5 GB/s 122.9 GB/s per-core caches: L1-D 32 KB 32 KB L2 256 KB 512 KB shared cache: L3 30 MB none
- 9. What the coarse-grid system is n = 7,782; nnz = 1,412,840; nnz/n = 181.6
- 10. How we chose the preconditioner for PCG preconditioner operator Jacobi z = D−1r Symmetric Gauss–Seidel z = (L + D)−1D(L + D)−T r = + + Ac L D LT
- 11. How we chose the preconditioner for PCG unprecond Jacobi SGS conditioning isotropic 3.37 × 104 1.35 × 103 1.83 × 102 anisotropic 9.68 × 106 1.89 × 104 2.91 × 103 iterations isotropic 605.53 194.57 78.87 anisotropic 1,267.85 288.32 122.85
- 12. How we chose the preconditioner for PCG SGS Jacobi 1 thread 1 thread 12 threads time (s) isotropic 83.0 80.3 29.2 anisotropic 128.6 121.6 43.8
- 13. Where does the AMG cycle spend most of its time? 1 2 4 8 16 32 60 120 32 64 128 256 512 1,024 2,048 smoothing PCG total number of threads runtime (s)
- 14. How to improve the performance of PCG 1: while not converged do 2: ρ ← σ 3: omp parallel for: w ← Ap 4: omp parallel for: τ ← w · p 5: α ← ρ/τ 6: omp parallel for: x ← x + αp 7: omp parallel for: r ← r − αw 8: omp parallel for: z ← M−1r 9: omp parallel for: σ ← z · r 10: β ← σ/ρ 11: omp parallel for: p ← z + βp 12: end while
- 15. How to improve the performance of PCG 1: omp parallel 2: while not converged do 3: omp single: τ ← 0.0 implied barrier 4: omp single nowait: ρ ← σ, σ ← 0.0 5: omp for nowait: w ← Ap 6: omp for reduction: τ ← w · p implied barrier 7: α ← ρ/τ 8: omp for nowait: x ← x + αp 9: omp for nowait: r ← r − αw 10: omp for nowait: z ← M−1r 11: omp for reduction: σ ← z · r implied barrier 12: β ← σ/ρ 13: omp for nowait: p ← z + βp 14: end while 15: end omp parallel
- 16. How to improve the performance of PCG 1: omp parallel 2: while not converged do 3: omp for: w ← Ap 4: omp single 5: τ ← w · p 6: α ← ρ/τ 7: x ← x + αp 8: r ← r − αw 9: z ← M−1r 10: ρ ← σ 11: σ ← z · r 12: β ← σ/ρ 13: p ← z + βp 14: end omp single 15: end while 16: end omp parallel
- 17. How to improve the performance of PCG 1: while not converged do 2: ρ ← σ 3: omp parallel for: w ← Ap 4: τ ← w · p 5: α ← ρ/τ 6: x ← x + αp 7: r ← r − αw 8: z ← M−1r 9: σ ← z · r 10: β ← σ/ρ 11: p ← z + βp 12: end while
- 18. How to improve the performance of PCG 1 2 4 8 16 32 60 120 16 32 64 128 256 number of threads runtime (s) Algorithm 1 Algorithm 2 Algorithm 3 Algorithm 4
- 19. How the sparse HSS solver works sparse matrix-factorization algorithm represents the frontal matrices as hierarchically-semiseparable (HSS) matrices uses randomized sampling for faster compression D1 D2 D4 D5 D8 D9 D11 D12 U3B3V6 H 7 B14 H U6B6V3 H B U7 U3R3 U6R6 = More details in Pieter Ghysels’ talk tomorrow!
- 20. How do the parameters of the solver aﬀect performance? Parameter Values coarse solver HSS, PCG elements-per-agglomerate 64, 128, 256, 512 νP 0, 1, 2 νM−1 1, 3, 5 θ 0.001, 0.001 × 100.5, 0.01
- 21. How do the parameters of the solver aﬀect performance? 1%2%4%8%16%32%64% 8 16 32 64 128 percentile rank runtime (s) Babbage (HSS) Babbage (PCG) Edison (HSS) Edison (PCG) default conﬁguration
- 22. What our performance model is stage bytes ﬂops pre- and post-smooth (3ν + 1)(12 nza + 3 · 8n) 2(3ν + 1)(nza + 2n) restriction 12 nza + 12 nzp + 3 · 8n 2(nza + nzp) one coarse solve multiply by Ac 12 nzc 2 nzc preconditioner 2 · 8nc nc vector operations 5 · 8nc 2 · 5nc interpolation 12 nzp + 8n 2 nzp stopping criterion 12 nza + 4 · 8n 2(nza + n)
- 23. What our performance model is 1 2 4 8 12 8 16 32 64 128 memory bound ﬂops bound actual number of cores runtime (s)
- 24. Final comments HSS is an attractive option for solving coarse systems performance is quite sensitive to parameter tuning performance model indicates where the bottlenecks are
- 25. Thank you!

No public clipboards found for this slide

Be the first to comment