SlideShare a Scribd company logo
Comparative Performance Analysis of an
Algebraic-Multigrid Solver on
Leading Multicore Architectures
Brian Austin, Alex Druinsky, Pieter Ghysels, Xiaoye Sherry Li,
Osni A. Marques, Eric Roman, Samuel Williams
Lawrence Berkeley National Laboratory
Andrew Barker, Panayot Vassilevski
Lawrence Livermore National Laboratory
Delyan Kalchev
University of Colorado, Boulder
What this talk is about
Performance optimization, comparison and modeling of
a novel shared-memory algebraic-multigrid solver
using the SPE10 reservoir-modeling problem
on a node of Cray XC30 and on a Xeon Phi.
How our multigrid solver works
Repeat until converged:
pre-smoothing y ← x + M−1(b − Ax)
coarse-grid correction z ← y + PA−1
c PT (b − Ay)
post-smoothing x ← z + M−1(b − Az)
How we construct the interpolator
=
SP P�
How we construct the coarse-grid matrix
=
Ac PAPT
What the spe10 problem is and how we are solving it
Credit: http://www.spe10.org
oil-reservoir modeling benchmark problem
solved using Darcy’s equation (in primal form)
− · (κ(x) p(x)) = f (x) ,
where p(x) = pressure, and κ(x) = permeability
defined over a 60 × 220 × 85 grid
with isotropic and anisotropic versions
What the spe10 problem is and how we are solving it
oil-reservoir modeling benchmark problem
solved using Darcy’s equation (in primal form)
− · (κ(x) p(x)) = f (x) ,
where p(x) = pressure, and κ(x) = permeability
defined over a 60 × 220 × 85 grid
with isotropic and anisotropic versions
What are the machines that we study?
Edison Babbage
name Ivy Bridge Knights Corner
model Xeon E5-2695 v2 Xeon Phi 5110P
clock speed 2.4 GHz 1.053 GHz
cores 12 60
SMT threads 2 4
SIMD width 4 8
peak gflop/s 230.4 1010.88
bandwidth 48.5 GB/s 122.9 GB/s
per-core caches:
L1-D 32 KB 32 KB
L2 256 KB 512 KB
shared cache:
L3 30 MB none
What the coarse-grid system is
n = 7,782; nnz = 1,412,840; nnz/n = 181.6
How we chose the preconditioner for PCG
preconditioner operator
Jacobi z = D−1r
Symmetric Gauss–Seidel z = (L + D)−1D(L + D)−T r
= + +
Ac L D LT
How we chose the preconditioner for PCG
unprecond Jacobi SGS
conditioning
isotropic 3.37 × 104 1.35 × 103 1.83 × 102
anisotropic 9.68 × 106 1.89 × 104 2.91 × 103
iterations
isotropic 605.53 194.57 78.87
anisotropic 1,267.85 288.32 122.85
How we chose the preconditioner for PCG
SGS Jacobi
1 thread 1 thread 12 threads
time (s)
isotropic 83.0 80.3 29.2
anisotropic 128.6 121.6 43.8
Where does the AMG cycle spend most of its time?
1 2 4 8 16 32 60 120
32
64
128
256
512
1,024
2,048
smoothing
PCG
total
number of threads
runtime (s)
How to improve the performance of PCG
1: while not converged do
2: ρ ← σ
3: omp parallel for: w ← Ap
4: omp parallel for: τ ← w · p
5: α ← ρ/τ
6: omp parallel for: x ← x + αp
7: omp parallel for: r ← r − αw
8: omp parallel for: z ← M−1r
9: omp parallel for: σ ← z · r
10: β ← σ/ρ
11: omp parallel for: p ← z + βp
12: end while
How to improve the performance of PCG
1: omp parallel
2: while not converged do
3: omp single: τ ← 0.0 implied barrier
4: omp single nowait: ρ ← σ, σ ← 0.0
5: omp for nowait: w ← Ap
6: omp for reduction: τ ← w · p implied barrier
7: α ← ρ/τ
8: omp for nowait: x ← x + αp
9: omp for nowait: r ← r − αw
10: omp for nowait: z ← M−1r
11: omp for reduction: σ ← z · r implied barrier
12: β ← σ/ρ
13: omp for nowait: p ← z + βp
14: end while
15: end omp parallel
How to improve the performance of PCG
1: omp parallel
2: while not converged do
3: omp for: w ← Ap
4: omp single
5: τ ← w · p
6: α ← ρ/τ
7: x ← x + αp
8: r ← r − αw
9: z ← M−1r
10: ρ ← σ
11: σ ← z · r
12: β ← σ/ρ
13: p ← z + βp
14: end omp single
15: end while
16: end omp parallel
How to improve the performance of PCG
1: while not converged do
2: ρ ← σ
3: omp parallel for: w ← Ap
4: τ ← w · p
5: α ← ρ/τ
6: x ← x + αp
7: r ← r − αw
8: z ← M−1r
9: σ ← z · r
10: β ← σ/ρ
11: p ← z + βp
12: end while
How to improve the performance of PCG
1 2 4 8 16 32 60 120
16
32
64
128
256
number of threads
runtime (s)
Algorithm 1
Algorithm 2
Algorithm 3
Algorithm 4
How the sparse HSS solver works
sparse matrix-factorization
algorithm
represents the frontal matrices
as hierarchically-semiseparable
(HSS) matrices
uses randomized sampling for
faster compression
D1
D2
D4
D5
D8
D9
D11
D12
U3B3V6
H 7 B14
H
U6B6V3
H
B
U7
U3R3
U6R6
=
More details in Pieter Ghysels’ talk tomorrow!
How do the parameters of the solver affect performance?
Parameter Values
coarse solver HSS, PCG
elements-per-agglomerate 64, 128, 256, 512
νP 0, 1, 2
νM−1 1, 3, 5
θ 0.001, 0.001 × 100.5, 0.01
How do the parameters of the solver affect performance?
1%2%4%8%16%32%64%
8
16
32
64
128
percentile rank
runtime (s)
Babbage (HSS)
Babbage (PCG)
Edison (HSS)
Edison (PCG)
default configuration
What our performance model is
stage bytes flops
pre- and post-smooth (3ν + 1)(12 nza + 3 · 8n) 2(3ν + 1)(nza + 2n)
restriction 12 nza + 12 nzp + 3 · 8n 2(nza + nzp)
one coarse solve
multiply by Ac 12 nzc 2 nzc
preconditioner 2 · 8nc nc
vector operations 5 · 8nc 2 · 5nc
interpolation 12 nzp + 8n 2 nzp
stopping criterion 12 nza + 4 · 8n 2(nza + n)
What our performance model is
1 2 4 8 12
8
16
32
64
128
memory bound
flops bound
actual
number of cores
runtime (s)
Final comments
HSS is an attractive option for solving coarse systems
performance is quite sensitive to parameter tuning
performance model indicates where the bottlenecks are
Thank you!

More Related Content

What's hot

Stream
StreamStream
Stream
Scott Shu
 
All Pairs-Shortest Path (Fast Floyd-Warshall) Code
All Pairs-Shortest Path (Fast Floyd-Warshall) Code All Pairs-Shortest Path (Fast Floyd-Warshall) Code
All Pairs-Shortest Path (Fast Floyd-Warshall) Code
Ehsan Sharifi
 
Linux-Permission
Linux-PermissionLinux-Permission
Linux-Permission
Colin Su
 
JVM memory management & Diagnostics
JVM memory management & DiagnosticsJVM memory management & Diagnostics
JVM memory management & Diagnostics
Dhaval Shah
 
Ethereum 9¾ @ Devcon5
Ethereum 9¾ @ Devcon5Ethereum 9¾ @ Devcon5
Ethereum 9¾ @ Devcon5
Wanseob Lim
 
aiboのAI:DeepLearning認識
aiboのAI:DeepLearning認識aiboのAI:DeepLearning認識
aiboのAI:DeepLearning認識
Naoki Fujiwara
 
Lagrangian Relaxation of Magnetic Fields
Lagrangian Relaxation of Magnetic FieldsLagrangian Relaxation of Magnetic Fields
Lagrangian Relaxation of Magnetic Fields
Simon Candelaresi
 
Cloud flare jgc bigo meetup rolling hashes
Cloud flare jgc   bigo meetup rolling hashesCloud flare jgc   bigo meetup rolling hashes
Cloud flare jgc bigo meetup rolling hashesCloudflare
 
Scaling the #2ndhalf
Scaling the #2ndhalfScaling the #2ndhalf
Scaling the #2ndhalf
Salo Shp
 
Logistic Regression in R-An Exmple.
Logistic Regression in R-An Exmple. Logistic Regression in R-An Exmple.
Logistic Regression in R-An Exmple.
Dr. Volkan OBAN
 
【論文紹介】Relay: A New IR for Machine Learning Frameworks
【論文紹介】Relay: A New IR for Machine Learning Frameworks【論文紹介】Relay: A New IR for Machine Learning Frameworks
【論文紹介】Relay: A New IR for Machine Learning Frameworks
Takeo Imai
 
Exploring Parallel Merging In GPU Based Systems Using CUDA C.
Exploring Parallel Merging In GPU Based Systems Using CUDA C.Exploring Parallel Merging In GPU Based Systems Using CUDA C.
Exploring Parallel Merging In GPU Based Systems Using CUDA C.
Rakib Hossain
 
ALPSチュートリアル
ALPSチュートリアルALPSチュートリアル
第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)
RCCSRENKEI
 
証明駆動開発のたのしみ@名古屋reject会議
証明駆動開発のたのしみ@名古屋reject会議証明駆動開発のたのしみ@名古屋reject会議
証明駆動開発のたのしみ@名古屋reject会議Hiroki Mizuno
 
Gc in golang
Gc in golangGc in golang
Gc in golang
Genchi Lu
 
The impact of supercomputers on MSR
The impact of supercomputers on MSRThe impact of supercomputers on MSR
The impact of supercomputers on MSR
Yasutaka Kamei
 

What's hot (20)

Auto
AutoAuto
Auto
 
Stream
StreamStream
Stream
 
All Pairs-Shortest Path (Fast Floyd-Warshall) Code
All Pairs-Shortest Path (Fast Floyd-Warshall) Code All Pairs-Shortest Path (Fast Floyd-Warshall) Code
All Pairs-Shortest Path (Fast Floyd-Warshall) Code
 
Linux-Permission
Linux-PermissionLinux-Permission
Linux-Permission
 
JVM memory management & Diagnostics
JVM memory management & DiagnosticsJVM memory management & Diagnostics
JVM memory management & Diagnostics
 
Ethereum 9¾ @ Devcon5
Ethereum 9¾ @ Devcon5Ethereum 9¾ @ Devcon5
Ethereum 9¾ @ Devcon5
 
aiboのAI:DeepLearning認識
aiboのAI:DeepLearning認識aiboのAI:DeepLearning認識
aiboのAI:DeepLearning認識
 
Lagrangian Relaxation of Magnetic Fields
Lagrangian Relaxation of Magnetic FieldsLagrangian Relaxation of Magnetic Fields
Lagrangian Relaxation of Magnetic Fields
 
Cloud flare jgc bigo meetup rolling hashes
Cloud flare jgc   bigo meetup rolling hashesCloud flare jgc   bigo meetup rolling hashes
Cloud flare jgc bigo meetup rolling hashes
 
Scaling the #2ndhalf
Scaling the #2ndhalfScaling the #2ndhalf
Scaling the #2ndhalf
 
Logistic Regression in R-An Exmple.
Logistic Regression in R-An Exmple. Logistic Regression in R-An Exmple.
Logistic Regression in R-An Exmple.
 
doc
docdoc
doc
 
NAS EP Algorithm
NAS EP Algorithm NAS EP Algorithm
NAS EP Algorithm
 
【論文紹介】Relay: A New IR for Machine Learning Frameworks
【論文紹介】Relay: A New IR for Machine Learning Frameworks【論文紹介】Relay: A New IR for Machine Learning Frameworks
【論文紹介】Relay: A New IR for Machine Learning Frameworks
 
Exploring Parallel Merging In GPU Based Systems Using CUDA C.
Exploring Parallel Merging In GPU Based Systems Using CUDA C.Exploring Parallel Merging In GPU Based Systems Using CUDA C.
Exploring Parallel Merging In GPU Based Systems Using CUDA C.
 
ALPSチュートリアル
ALPSチュートリアルALPSチュートリアル
ALPSチュートリアル
 
第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)
 
証明駆動開発のたのしみ@名古屋reject会議
証明駆動開発のたのしみ@名古屋reject会議証明駆動開発のたのしみ@名古屋reject会議
証明駆動開発のたのしみ@名古屋reject会議
 
Gc in golang
Gc in golangGc in golang
Gc in golang
 
The impact of supercomputers on MSR
The impact of supercomputers on MSRThe impact of supercomputers on MSR
The impact of supercomputers on MSR
 

Similar to Druinsky_SIAMCSE15

ILP Based Approach for Input Vector Controlled (IVC) Toggle Maximization in C...
ILP Based Approach for Input Vector Controlled (IVC) Toggle Maximization in C...ILP Based Approach for Input Vector Controlled (IVC) Toggle Maximization in C...
ILP Based Approach for Input Vector Controlled (IVC) Toggle Maximization in C...
Deepak Malani
 
MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4
arogozhnikov
 
Practical spherical harmonics based PRT methods.ppsx
Practical spherical harmonics based PRT methods.ppsxPractical spherical harmonics based PRT methods.ppsx
Practical spherical harmonics based PRT methods.ppsx
MannyK4
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
byteLAKE
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolvers
inside-BigData.com
 
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
NVIDIA Taiwan
 
Practical Spherical Harmonics Based PRT Methods
Practical Spherical Harmonics Based PRT MethodsPractical Spherical Harmonics Based PRT Methods
Practical Spherical Harmonics Based PRT MethodsNaughty Dog
 
fast-matmul-cse15
fast-matmul-cse15fast-matmul-cse15
fast-matmul-cse15
Austin Benson
 
sheet6.pdf
sheet6.pdfsheet6.pdf
sheet6.pdf
aminasouyah
 
doc6.pdf
doc6.pdfdoc6.pdf
doc6.pdf
aminasouyah
 
paper6.pdf
paper6.pdfpaper6.pdf
paper6.pdf
aminasouyah
 
lecture5.pdf
lecture5.pdflecture5.pdf
lecture5.pdf
aminasouyah
 
Injecting image priors into Learnable Compressive Subsampling
Injecting image priors into Learnable Compressive SubsamplingInjecting image priors into Learnable Compressive Subsampling
Injecting image priors into Learnable Compressive Subsampling
Martino Ferrari
 
Svm map reduce_slides
Svm map reduce_slidesSvm map reduce_slides
Svm map reduce_slides
Sara Asher
 
Intro to Machine Learning for GPUs
Intro to Machine Learning for GPUsIntro to Machine Learning for GPUs
Intro to Machine Learning for GPUs
Sri Ambati
 
Symbolic Regression on Network Properties
Symbolic Regression on Network PropertiesSymbolic Regression on Network Properties
Symbolic Regression on Network Properties
Marcus Märtens
 
Pseudo Random Number Generators
Pseudo Random Number GeneratorsPseudo Random Number Generators
Pseudo Random Number Generators
Darshini Parikh
 
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Universitat Politècnica de Catalunya
 
Dictionary Learning in Games - GDC 2014
Dictionary Learning in Games - GDC 2014Dictionary Learning in Games - GDC 2014
Dictionary Learning in Games - GDC 2014
Manchor Ko
 
R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...
R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...
R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...
Matt Moores
 

Similar to Druinsky_SIAMCSE15 (20)

ILP Based Approach for Input Vector Controlled (IVC) Toggle Maximization in C...
ILP Based Approach for Input Vector Controlled (IVC) Toggle Maximization in C...ILP Based Approach for Input Vector Controlled (IVC) Toggle Maximization in C...
ILP Based Approach for Input Vector Controlled (IVC) Toggle Maximization in C...
 
MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4
 
Practical spherical harmonics based PRT methods.ppsx
Practical spherical harmonics based PRT methods.ppsxPractical spherical harmonics based PRT methods.ppsx
Practical spherical harmonics based PRT methods.ppsx
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolvers
 
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
 
Practical Spherical Harmonics Based PRT Methods
Practical Spherical Harmonics Based PRT MethodsPractical Spherical Harmonics Based PRT Methods
Practical Spherical Harmonics Based PRT Methods
 
fast-matmul-cse15
fast-matmul-cse15fast-matmul-cse15
fast-matmul-cse15
 
sheet6.pdf
sheet6.pdfsheet6.pdf
sheet6.pdf
 
doc6.pdf
doc6.pdfdoc6.pdf
doc6.pdf
 
paper6.pdf
paper6.pdfpaper6.pdf
paper6.pdf
 
lecture5.pdf
lecture5.pdflecture5.pdf
lecture5.pdf
 
Injecting image priors into Learnable Compressive Subsampling
Injecting image priors into Learnable Compressive SubsamplingInjecting image priors into Learnable Compressive Subsampling
Injecting image priors into Learnable Compressive Subsampling
 
Svm map reduce_slides
Svm map reduce_slidesSvm map reduce_slides
Svm map reduce_slides
 
Intro to Machine Learning for GPUs
Intro to Machine Learning for GPUsIntro to Machine Learning for GPUs
Intro to Machine Learning for GPUs
 
Symbolic Regression on Network Properties
Symbolic Regression on Network PropertiesSymbolic Regression on Network Properties
Symbolic Regression on Network Properties
 
Pseudo Random Number Generators
Pseudo Random Number GeneratorsPseudo Random Number Generators
Pseudo Random Number Generators
 
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
 
Dictionary Learning in Games - GDC 2014
Dictionary Learning in Games - GDC 2014Dictionary Learning in Games - GDC 2014
Dictionary Learning in Games - GDC 2014
 
R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...
R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...
R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...
 

More from Karen Pao

LupoPasini_SIAMCSE15
LupoPasini_SIAMCSE15LupoPasini_SIAMCSE15
LupoPasini_SIAMCSE15
Karen Pao
 
Barker_SIAMCSE15
Barker_SIAMCSE15Barker_SIAMCSE15
Barker_SIAMCSE15
Karen Pao
 
Myers_SIAMCSE15
Myers_SIAMCSE15Myers_SIAMCSE15
Myers_SIAMCSE15
Karen Pao
 
Adams_SIAMCSE15
Adams_SIAMCSE15Adams_SIAMCSE15
Adams_SIAMCSE15
Karen Pao
 
Austin_SIAMCSE15
Austin_SIAMCSE15Austin_SIAMCSE15
Austin_SIAMCSE15
Karen Pao
 
Slattery_SIAMCSE15
Slattery_SIAMCSE15Slattery_SIAMCSE15
Slattery_SIAMCSE15
Karen Pao
 
Loffeld_SIAMCSE15
Loffeld_SIAMCSE15Loffeld_SIAMCSE15
Loffeld_SIAMCSE15
Karen Pao
 
Dubey_SIAMCSE15
Dubey_SIAMCSE15Dubey_SIAMCSE15
Dubey_SIAMCSE15
Karen Pao
 

More from Karen Pao (8)

LupoPasini_SIAMCSE15
LupoPasini_SIAMCSE15LupoPasini_SIAMCSE15
LupoPasini_SIAMCSE15
 
Barker_SIAMCSE15
Barker_SIAMCSE15Barker_SIAMCSE15
Barker_SIAMCSE15
 
Myers_SIAMCSE15
Myers_SIAMCSE15Myers_SIAMCSE15
Myers_SIAMCSE15
 
Adams_SIAMCSE15
Adams_SIAMCSE15Adams_SIAMCSE15
Adams_SIAMCSE15
 
Austin_SIAMCSE15
Austin_SIAMCSE15Austin_SIAMCSE15
Austin_SIAMCSE15
 
Slattery_SIAMCSE15
Slattery_SIAMCSE15Slattery_SIAMCSE15
Slattery_SIAMCSE15
 
Loffeld_SIAMCSE15
Loffeld_SIAMCSE15Loffeld_SIAMCSE15
Loffeld_SIAMCSE15
 
Dubey_SIAMCSE15
Dubey_SIAMCSE15Dubey_SIAMCSE15
Dubey_SIAMCSE15
 

Recently uploaded

ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptxANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
RASHMI M G
 
Toxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and ArsenicToxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and Arsenic
sanjana502982
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
University of Maribor
 
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills MN
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
moosaasad1975
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
Nistarini College, Purulia (W.B) India
 
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốtmô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
HongcNguyn6
 
Oedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptxOedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptx
muralinath2
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
KrushnaDarade1
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Erdal Coalmaker
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Sérgio Sacani
 
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdfTopic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
TinyAnderson
 
20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx
Sharon Liu
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
Columbia Weather Systems
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
RitabrataSarkar3
 
Nucleophilic Addition of carbonyl compounds.pptx
Nucleophilic Addition of carbonyl  compounds.pptxNucleophilic Addition of carbonyl  compounds.pptx
Nucleophilic Addition of carbonyl compounds.pptx
SSR02
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Ana Luísa Pinho
 
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
MAGOTI ERNEST
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
yqqaatn0
 

Recently uploaded (20)

ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptxANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
 
Toxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and ArsenicToxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and Arsenic
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
 
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
 
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốtmô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
 
Oedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptxOedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptx
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
 
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdfTopic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
 
20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
 
Nucleophilic Addition of carbonyl compounds.pptx
Nucleophilic Addition of carbonyl  compounds.pptxNucleophilic Addition of carbonyl  compounds.pptx
Nucleophilic Addition of carbonyl compounds.pptx
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
 
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
 

Druinsky_SIAMCSE15

  • 1. Comparative Performance Analysis of an Algebraic-Multigrid Solver on Leading Multicore Architectures Brian Austin, Alex Druinsky, Pieter Ghysels, Xiaoye Sherry Li, Osni A. Marques, Eric Roman, Samuel Williams Lawrence Berkeley National Laboratory Andrew Barker, Panayot Vassilevski Lawrence Livermore National Laboratory Delyan Kalchev University of Colorado, Boulder
  • 2. What this talk is about Performance optimization, comparison and modeling of a novel shared-memory algebraic-multigrid solver using the SPE10 reservoir-modeling problem on a node of Cray XC30 and on a Xeon Phi.
  • 3. How our multigrid solver works Repeat until converged: pre-smoothing y ← x + M−1(b − Ax) coarse-grid correction z ← y + PA−1 c PT (b − Ay) post-smoothing x ← z + M−1(b − Az)
  • 4. How we construct the interpolator = SP P�
  • 5. How we construct the coarse-grid matrix = Ac PAPT
  • 6. What the spe10 problem is and how we are solving it Credit: http://www.spe10.org oil-reservoir modeling benchmark problem solved using Darcy’s equation (in primal form) − · (κ(x) p(x)) = f (x) , where p(x) = pressure, and κ(x) = permeability defined over a 60 × 220 × 85 grid with isotropic and anisotropic versions
  • 7. What the spe10 problem is and how we are solving it oil-reservoir modeling benchmark problem solved using Darcy’s equation (in primal form) − · (κ(x) p(x)) = f (x) , where p(x) = pressure, and κ(x) = permeability defined over a 60 × 220 × 85 grid with isotropic and anisotropic versions
  • 8. What are the machines that we study? Edison Babbage name Ivy Bridge Knights Corner model Xeon E5-2695 v2 Xeon Phi 5110P clock speed 2.4 GHz 1.053 GHz cores 12 60 SMT threads 2 4 SIMD width 4 8 peak gflop/s 230.4 1010.88 bandwidth 48.5 GB/s 122.9 GB/s per-core caches: L1-D 32 KB 32 KB L2 256 KB 512 KB shared cache: L3 30 MB none
  • 9. What the coarse-grid system is n = 7,782; nnz = 1,412,840; nnz/n = 181.6
  • 10. How we chose the preconditioner for PCG preconditioner operator Jacobi z = D−1r Symmetric Gauss–Seidel z = (L + D)−1D(L + D)−T r = + + Ac L D LT
  • 11. How we chose the preconditioner for PCG unprecond Jacobi SGS conditioning isotropic 3.37 × 104 1.35 × 103 1.83 × 102 anisotropic 9.68 × 106 1.89 × 104 2.91 × 103 iterations isotropic 605.53 194.57 78.87 anisotropic 1,267.85 288.32 122.85
  • 12. How we chose the preconditioner for PCG SGS Jacobi 1 thread 1 thread 12 threads time (s) isotropic 83.0 80.3 29.2 anisotropic 128.6 121.6 43.8
  • 13. Where does the AMG cycle spend most of its time? 1 2 4 8 16 32 60 120 32 64 128 256 512 1,024 2,048 smoothing PCG total number of threads runtime (s)
  • 14. How to improve the performance of PCG 1: while not converged do 2: ρ ← σ 3: omp parallel for: w ← Ap 4: omp parallel for: τ ← w · p 5: α ← ρ/τ 6: omp parallel for: x ← x + αp 7: omp parallel for: r ← r − αw 8: omp parallel for: z ← M−1r 9: omp parallel for: σ ← z · r 10: β ← σ/ρ 11: omp parallel for: p ← z + βp 12: end while
  • 15. How to improve the performance of PCG 1: omp parallel 2: while not converged do 3: omp single: τ ← 0.0 implied barrier 4: omp single nowait: ρ ← σ, σ ← 0.0 5: omp for nowait: w ← Ap 6: omp for reduction: τ ← w · p implied barrier 7: α ← ρ/τ 8: omp for nowait: x ← x + αp 9: omp for nowait: r ← r − αw 10: omp for nowait: z ← M−1r 11: omp for reduction: σ ← z · r implied barrier 12: β ← σ/ρ 13: omp for nowait: p ← z + βp 14: end while 15: end omp parallel
  • 16. How to improve the performance of PCG 1: omp parallel 2: while not converged do 3: omp for: w ← Ap 4: omp single 5: τ ← w · p 6: α ← ρ/τ 7: x ← x + αp 8: r ← r − αw 9: z ← M−1r 10: ρ ← σ 11: σ ← z · r 12: β ← σ/ρ 13: p ← z + βp 14: end omp single 15: end while 16: end omp parallel
  • 17. How to improve the performance of PCG 1: while not converged do 2: ρ ← σ 3: omp parallel for: w ← Ap 4: τ ← w · p 5: α ← ρ/τ 6: x ← x + αp 7: r ← r − αw 8: z ← M−1r 9: σ ← z · r 10: β ← σ/ρ 11: p ← z + βp 12: end while
  • 18. How to improve the performance of PCG 1 2 4 8 16 32 60 120 16 32 64 128 256 number of threads runtime (s) Algorithm 1 Algorithm 2 Algorithm 3 Algorithm 4
  • 19. How the sparse HSS solver works sparse matrix-factorization algorithm represents the frontal matrices as hierarchically-semiseparable (HSS) matrices uses randomized sampling for faster compression D1 D2 D4 D5 D8 D9 D11 D12 U3B3V6 H 7 B14 H U6B6V3 H B U7 U3R3 U6R6 = More details in Pieter Ghysels’ talk tomorrow!
  • 20. How do the parameters of the solver affect performance? Parameter Values coarse solver HSS, PCG elements-per-agglomerate 64, 128, 256, 512 νP 0, 1, 2 νM−1 1, 3, 5 θ 0.001, 0.001 × 100.5, 0.01
  • 21. How do the parameters of the solver affect performance? 1%2%4%8%16%32%64% 8 16 32 64 128 percentile rank runtime (s) Babbage (HSS) Babbage (PCG) Edison (HSS) Edison (PCG) default configuration
  • 22. What our performance model is stage bytes flops pre- and post-smooth (3ν + 1)(12 nza + 3 · 8n) 2(3ν + 1)(nza + 2n) restriction 12 nza + 12 nzp + 3 · 8n 2(nza + nzp) one coarse solve multiply by Ac 12 nzc 2 nzc preconditioner 2 · 8nc nc vector operations 5 · 8nc 2 · 5nc interpolation 12 nzp + 8n 2 nzp stopping criterion 12 nza + 4 · 8n 2(nza + n)
  • 23. What our performance model is 1 2 4 8 12 8 16 32 64 128 memory bound flops bound actual number of cores runtime (s)
  • 24. Final comments HSS is an attractive option for solving coarse systems performance is quite sensitive to parameter tuning performance model indicates where the bottlenecks are