SlideShare a Scribd company logo
1 of 57
Download to read offline
Communication Hiding Through Pipelining in
Krylov Methods
Wim Vanroose
Department of Mathematics and Computer Science, U. Antwerpen, Belgium
Pieter Ghysels, Siegfried Cools, Jeffrey Cornelis
Funding: IWT flanders and Intel, Exa2ct Horizon 2020 (EU funding ref.no. 610741), FWO flanders,
U. Antwerpen Research Fund.
SIAM PP 18, Tokyo
1 / 56
Krylov subspace methods
We search for a solution x of Ax = b, Ax = λx or AT Ax = AT b in
the Krylov subspace
x ∈ x(0)
+ span r(0)
, Ar(0)
, A2
r(0)
, . . . , Ak−1
r(0)
, (1)
where r(0) = b − Ax(0).
Conjugate Gradients,
Lanczos,
GMRES, Minres,
BiCG,CGS,BiCGstab
CGLS, bidiagonalization,
. . .
Algebraic and Geometric
Multigrid,
Incomplete factorization,
Domain Decomposition,
FETI, BDDC,
Physics based
preconditioners,
. . .
2 / 56
GMRES, classical Gram-Schmidt
1: r0 ← b − Ax0
2: v0 ← r0/||r0||2
3: for i = 0, . . . , m − 1 do
4: w ← Avi
5: for j = 0, . . . , i do
6: hj,i ← w, vj
7: end for
8: ˜vi+1 ← w − i
j=1 hj,i vj
9: hi+1,i ← ||˜vi+1||2
10: vi+1 ← ˜vi+1/hi+1,i
11: {apply Givens rotations to h:,i }
12: end for
13: ym ←
argmin||Hm+1,mym − ||r0||2e1||2
14: x ← x0 + Vmym
Sparse Matrix-Vector
product
Only communication
with neighbors
Good scaling
Dot-product
Global communication
Scales as log(P)
Scalar vector multiplication,
vector-vector addition
No communication
3 / 56
Bulk Synchronous Parallelism
4 / 56
latency of global reductions
T. Hoeffler, T. Schneider and A. Lumsdaine, SC10, 2010.
5 / 56
GMRES vs Pipelined GMRES iteration on 4 nodes
Classical GMRES
p1
p2
p3
p4
SpMV local dot
+
+
+
reduction b-cast axpy
Gram-Schmidt
norm
+
+
+
reduction b-castscale
Normalization
H update
6 / 56
Outline
Communication Hiding and Communication Avoiding
Pipelined Conjugate Gradients
Propagation of Rounding Errors
Deeper Pipelines
Exploiting the properties of Symmetric Matrices
Implementation
7 / 56
Communication Hiding vs Communication Avoiding
(figure taken from Fukaya et al, 2014)
M. Mohiyuddin, M. Hoemmen, J. Demmel and K. Yelick,
Minimizing communication in sparse matrix solvers, p1–12, 2009.
M. Hoemmen, Communication-avoiding Krylov subspace methods,
PhD Thesis, 2010.
J. Demmel, L. Grigori, M. Hoemmen and J. Langou
Communication-optimal parallel and sequential QR and LU
factorizations SISC, 34, pA206–A239, 2012.
8 / 56
Conjugate Gradients?
Preconditioned Conjugate Gradient
1: r0 := b − Ax0; u0 := M−1r0; p0 := u0
2: for i = 0, . . . , m − 1 do
3: s := Api
4: α := ri , ui / s, pi
5: xi+1 := xi + αpi
6: ri+1 := ri − αs
7: ui+1 := M−1ri+1
8: β := ri+1, ui+1 / ri , ui
9: pi+1 := ui+1 + βpi
10: end for
9 / 56
Auxiliary Recurrences
We already have introduced the notation
si := Api ui := M−1
ri (2)
The recurrence of the search directions
pi = ui + βpi−1 (3)
×A Api = Aui + βApi−1 (4)
si = wi + βsi−1 (5)
where
wi := Aui (6)
10 / 56
Chronopoulos/Gear CG
Only one global reduction each iteration.
1: r0 := b − Ax0; u0 := M−1r0; w0 := Au0
2: α0 := r0, u0 / w0, u0 ; β := 0; γ0 := r0, u0
3: for i = 0, . . . , m − 1 do
4: pi := ui + βi pi−1
5: si := wi + βi si−1
6: xi+1 := xi + αpi
7: ri+1 := ri − αsi
8: ui+1 := M−1ri+1
9: wi+1 := Aui+1
10: γi+1 := ri+1, ui+1
11: δ := wi+1, ui+1
12: βi+1 := γi+1/γi
13: αi+1 := γi+1/(δ − βi+1γi+1/αi )
14: end for
11 / 56
pipelined Chronopoulos/Gear CG
Global reduction overlaps with sparse matrix-vector (SpMV)
product.
1: r0 := b − Ax0; w0 := Au0
2: for i = 0, . . . , m − 1 do
3: γi := ri , ri
4: δ := wi , ri
5: qi := Awi
6: if i > 0 then
7: βi := γi /γi−1; αi := γi /(δ − βi γi /αi−1)
8: else
9: βi := 0; αi := γi /δ
10: end if
11: zi := qi + βi zi−1
12: si := wi + βi si−1
13: pi := ri + βi pi−1
14: xi+1 := xi + αi pi
15: ri+1 := ri − αi si
16: wi+1 := wi − αi zi
17: end for
12 / 56
Preconditioned pipelined CG
1: r0 := b − Ax0; u0 := M−1
r0; w0 := Au0
2: for i = 0, . . . do
3: γi := ri , ui
4: δ := wi , ui
5: mi := M−1
wi
6: ni := Ami
7: if i > 0 then
8: βi := γi /γi−1; αi := γi / (δ − βi γi /αi−1)
9: else
10: βi := 0; αi := γi /δ
11: end if
12: zi := ni + βi zi−1
13: qi := mi + βi qi−1
14: si := wi + βi si−1
15: pi := ui + βi pi−1
16: xi+1 := xi + αi pi
17: ri+1 := ri − αi si
18: ui+1 := ui − αi qi
19: wi+1 := wi − αi zi
20: end for 13 / 56
Conjugate Gradients
Pipelined Conjugate Gradients
14 / 56
Preconditioned pipelined CR
1: r0 := b − Ax0; u0 := M−1
r0; w0 := Au0
2: for i = 0, . . . do
3: mi := M−1
wi
4: γi := wi , ui
5: δ := mi , wi
6: ni := Ami
7: if i > 0 then
8: βi := γi /γi−1; αi := γi / (δ − βi γi /αi−1)
9: else
10: βi := 0; αi := γi /δ
11: end if
12: zi := ni + βi zi−1
13: qi := mi + βi qi−1
14: pi := ui + βi pi−1
15: xi+1 := xi + αi pi
16: ui+1 := ui − αi qi
17: wi+1 := wi − αi zi
18: end for
15 / 56
Cost Model
G := time for a global reduction
SpMV := time for a sparse-matrix vector product
PC := time for preconditioner application
local work such as AXPY is neglected
flops time (excl, axpys, dots) #glob syncs memory
CG 10 2G + SpMV + PC 2 4
Chro/Gea 12 G + SpMV + PC 1 5
CR 12 2G + SpMV + PC 2 5
pipe-CG 20 max(G, SpMV + PC) 1 9
pipe-CR 16 max(G, SpMV) + PC 1 7
Gropp-CG 14 max(G, SpMV) + max(G, PC) 2 6
P. Ghysels and W. Vanroose, Hiding global synchronization latency
in the preconditioned Conjugate Gradient algorithm, Parallel
Computing, 40, (2014), Pages 224–238
16 / 56
Better Strong Scaling
Hydrostatic ice sheet flow, 100 × 100 × 50 Q1 finite elements
line search Newton method (rtol = 10−8
, atol = 10−15
)
CG with block Jacobi ICC(0) precond (rtol = 10−5
, atol = 10−50
)
Measured speedup over standard CG for different variations of
pipelined CG. 17 / 56
MPI trace
18 / 56
Pipelined methods in PETSc (from 3.4.2)
Krylov methods: KSPPIPECR, KSPPIPECG, KSPGROPPCG,
KSPPGMRES
Uses MPI-3 non-blocking collectives
export MPICH ASYNC PROGRES=1
1: . . .
2: KSP PCApply(ksp,W,M); /* m ← Bw */
3: if (i > 0 && ksp→normtype == KSP NORM PRECONDITIONED)
4: VecNormBegin(U,NORM 2,&dp);
5: VecDotBegin(W,U,&gamma);
6: VecDotBegin(M,W,&delta);
7: PetscCommSplitReductionBegin(PetscObjectComm((PetscObject)U));
8: KSP MatMult(ksp,Amat,M,N); /* n ← Am */
9: if (i > 0 && ksp→normtype == KSP NORM PRECONDITIONED)
10: VecNormEnd(U,NORM 2,&dp);
11: VecDotEnd(W,U,&gamma);
12: VecDotEnd(M,W,&delta);
13: . . .
19 / 56
True Residual vs recursed Residual
CG pipelined Chron/Gear CG
20 / 56
Rounding errors in classical CG
¯xi+1 = ¯xi + αi+1¯pi + δx
i
¯ri+1 = ¯ri − αi+1¯si + δr
i
(7)
where si := Api and ¯xi , ¯ri , ¯si denote the computed iterates and δx
i
and δr
i are the local rounding errors.
∆r
i+1 = b − A¯xi+1 − ¯ri+1
= b − A(¯xi + αi+1¯pi + δx
i ) − ¯ri + αi+1¯si − δr
i
= b − A¯xi − ¯ri
∆r
i
+ Aδx
+ δr
local rounding error
21 / 56
Rounding Errors
22 / 56
pipelined Chronopoulos/Gear CG
Global reduction overlaps with matrix vector product.
1: r0 := b − Ax0; w0 := Au0
2: for i = 0, . . . , m − 1 do
3: γi := ri , ri
4: δ := wi , ri
5: qi := Awi
6: if i > 0 then
7: βi := γi /γi−1; αi := γi /(δ − βi γi /αi−1)
8: else
9: βi := 0; αi := γi /δ
10: end if
11: zi := qi + βi zi−1
12: si := wi + βi si−1
13: pi := ri + βi pi−1
14: xi+1 := xi + αi pi
15: ri+1 := ri − αi si
16: wi+1 := wi − αi zi
17: end for
23 / 56
Deviation between the recursed and true value



si := Api ,
wi := Ari ,
zi := Asi = A2pi .
(8)
We use the following definition for the errors



∆r
i := ¯ri − (b − A¯xi )
∆s
i := ¯si − A¯pi
∆w
i := ¯wi − A¯ri
∆z
i := ¯zi − A¯si
(9)
24 / 56
Rounding errors




∆r
i
∆s
i
∆w
i
∆z
i



 =




1 αi 0 0
0 βi 1 0
0 0 1 αi
0 0 0 βi








∆r
i−1
∆s
i−1
∆w
i−1
∆z
i−1



 +




Aδx + δr
Aδr + δs
Aδr + δw
Aδs + δz



 (10)
25 / 56
Residual replacement
Replace ri , si , wi and zi with their true values.



ri := b − Axi
si := Api ,
wi := Ari ,
zi := Asi = A2pi .
(11)
van der Vorst and Ye, Residual Replacement strategies for Krylov
Subspace iterative methods for Convergence of the True Residuals,
SISC 2000
Tong and Ye, Analysis of the finite precision bi-conjugate gradient
algorithm for nonsymmetric linear systems, Mathematics and
computations, 1999.
AZ = ZT + f (12)
fi =
Aτi
ri
+
1
αi
ηi+1
ri
−
βi
αi−1
ηi
ri
(13)
26 / 56
Residual Replacement for Pipelined Chron/Gear CG
27 / 56
Siegfried Cools, Emrullah Fatih Yetkin, Emmanuel Agullo, Luc
Giraud, Wim Vanroose, Analyzing the effect of local rounding error
propagation on the maximal attainable accuracy of the pipelined
Conjugate Gradient method, arxiv: 1601.07068. To appear in SIAM
SIMAX.
28 / 56
Conjugate Gradients
Pipelined Conjugate Gradients
29 / 56
Extending the pipeline depth
30 / 56
We introduce the auxiliary vectors Zi+1 = [z0, z1, . . . , zi , zi+1]
defined as
zj :=



v0 j = 0,
Pj (A)v0 0 < j ≤ l,
Pl (A)vj−l j > l,
(14)
where the polynomials Pl (t) is fixed to order l, where l is the
depth of the pipeline
Pl (t) :=
l−1
j=0
(t − σj ), (15)
and for Pj (t) = (t − σj )Pj−1(t) with P0(t) := 1.
31 / 56
Example with l = 3
Let V4 = [v0, v1, v2, v3] be a Krylov subspace. We then have the
auxiliarry variable Z7 = [z0, . . . z7].
z0 = v0,
z1 = (A − σ0I)v0,
z2 = (A − σ1I)(A − σ0I)v0,
z3 = (A − σ2I)(A − σ1I)(A − σ0I)v0,
z4 = (A − σ2I)(A − σ1I)(A − σ0I)v1,
z5 = (A − σ2I)(A − σ1I)(A − σ0I)v2,
z6 = (A − σ2I)(A − σ1I)(A − σ0I)v3. (16)
32 / 56
We start from the Arnoldi relation
AVk = Vk+1Hk+1,k, (17)
where Vk is the basis of the Krylov subspace and Hk+1,k is the
upper Hessenberg matrix.
Written for basis vector vj−l , this becomes
vj−l =
Avj−l−1 − j−l−1
k=0 hk,j−l−1vk
hj−l,j−l−1
. (18)
When we apply the fixed polynomial
Pl (A)vj−l =
Pl (A)Avj−l−1 − j−l−1
k=0 hk,j−l−1Pl (A)vk
hj−l,j−l−1
. (19)
This is now, using the auxiliary variables,
zj =
Azj−1 − j−l−1
k=0 hk,j−l−1zk+l
hj−l,j−l−1
. (20)
33 / 56
Relations for auxiliary variables Z
AZi = Zi+1Bi+1,i (21)
with
Bi+1,i :=












σ0
1
...
1 σl
1 h0,0 h0,i−l
h1,0
...
hi+1−1,i−l












(22)
Observation: the auxiliary variables Z contain the information for
the Hessenberg matrix. Is it possible to extract?
34 / 56
Matrix G
The auxiliary variables also lie in the Krylov subspace.
[z0, z1, . . . , zk−1] = [v0, v1, . . . , vk−1]









∗ ∗ ∗ . . . ∗ g0,k−1
∗ ∗ . . . ∗ g1,k−1
∗ . . . ∗ g2,k−1
...
...
∗
gk−1,k−1









so
Zk = VkGk, (23)
where Gk is a upper triangular matrix.
35 / 56
Recursion for G
∀j = 0, 1, . . . , k − 2 gj,k−1 = (zk−1, vj ) (24)
= zk−1,
zj − j−1
m=0 gm,j vm
gj,j
=
(zk−1, zj ) − j−1
m=0 gm,j gm,k−1
gj,j
Two cases:
If vj vector is already constructed → (zk−1, vj )
If vj not, use (zk−1, zj ) and recurrence.








x x x (z3, v0)
x x (z3, v1)
↓
x (z3, z2)
↓
(z3, z3)








(25)
36 / 56
Recursion for G
for the diagonal elements we use
gj,j =
(zj , zj ) − j−1
m=0 gm,j gm,j
gj,j
(26)
resulting in
gj,j = (zj , zj ) −
j−1
m=0
g2
m,j (27)
Note that this can lead to breakdowns.
37 / 56
Recursive calculation of the Hessenberg Matrix
Hk+1,k = V T
k+1AVk (28)
= V T
k+1 AZk G−1
k using Z = VG
= V T
k+1Zk+1 Bk+1,k G−1
k using AZ = ZB
= Gk+1Bk+1,k G−1
k using Gk+1 = V T
k+1Zl+1
=
Gk g:,k+1
0 gk+1,k+1
Bk,k−1 b:,k
0 bk+1,k
Gk g:,k+1
0 gk+1,k+1
−1
=
Gk Bk,k−1 Gk b:,k + g:,k+1bk+1,k
0 gk+1,k+1bk+1,k
G−1
k−1 −G−1
k−1g:,k g−1
k,k
0 g−1
k,k
=
Gk Bk,k−1G−1
k−1 −Gk Bk,k−1G−1
k−1g:,k + Gk b:,k + g:,k+1bk+1,k g−1
k,k
0 gk+1,k+1bk+1,k g−1
k,k
=
Hk,k−1 (Gk b:,k + g:,k+1bk+1,k − Hk,k−1g:,k ) g−1
k,k
0 gk+1,k+1bk+1,k g−1
k,k
(29)
38 / 56
39 / 56
p(1)-GMRES
40 / 56
Conjugate Gradients with longer pipelines
In conjugate gradients is A = AT and the matrix is positive
definite.
How does the the GMRES algorithm simplify if A = AT ?
How can we avoid storing the full basis V and the full set of
auxiliary vectors Z?
How do the rounding errors propagate?
41 / 56
Lemma
(Cornelis, Master Thesis 2017) Let A be a symmetric matrix and
let zi = Pl (A)vi−1 be auxiliary variables with l > 0. The
Gi = V T
i Zi is now a banded upper triangular matrix with band
2l + 1, where
gj,i = vt
j zi = 0 ∀j /∈ {i − 2l, . . . , i} (30)
0 5 10 15 20 25 30
0
5
10
15
20
25
30
42 / 56
Updating the solution
In principle we need all basis vectors to construct the solution.
xk = x0 + Vkyk = x0 + VkT−1
k,k r0 e1 (31)
However, we can introduce the search directions pk that are also
recursively updated. We can then update the current guess with a
step in the search direction.
pj = vj+1 − µj pj−1 (32)
xk = xk−1 + pkck (33)
where ck and µj are determined based on the Hessenberg matrix.
Liesen and Strakos, Krylov Subspace Methods, 2012.
43 / 56
44 / 56
45 / 56
pipelined Lanczos vs classical GMRES
46 / 56
CG vs pipe CG
47 / 56
48 / 56
49 / 56
Pipeline for l = 2
50 / 56
Pipeline Storage
51 / 56
Example of pipeline storage for l = 2
52 / 56
Contributions to PETSC
KSPPGMRES: pipelined GMRES (thanks to Jed Brown)
KSPPIPECG: pipelined CG
KPPPIPECR: pipelined conjugate residuals
KSPGROPPCG: Gropp variant where overlap is done with
SPMV and Preconditioner
KSPPIPEBCGS: pipelined BiCGStab
Under development:
p(l)-CG: in the pull requests of Petsc, being reviewed.
Soliciting feedback from applications.
53 / 56
Conclusions
Summary:
We have shown that by adding auxiliary variables and
additional recurrences it is possible to hide the latencies of the
dot products in Krylov methods.
It is possible to write a version of Conjugate Gradients where
dot-products are hidden behind multiple SPMV.
Asynchronous = dot product can take the time of multiple
SpMV to complete.
Better scaling in the strong scaling limit, where global
reductions latencies rise and volume of computations per core
diminishes.
54 / 56
Conclusions
Summary:
We have shown that by adding auxiliary variables and
additional recurrences it is possible to hide the latencies of the
dot products in Krylov methods.
It is possible to write a version of Conjugate Gradients where
dot-products are hidden behind multiple SPMV.
Asynchronous = dot product can take the time of multiple
SpMV to complete.
Better scaling in the strong scaling limit, where global
reductions latencies rise and volume of computations per core
diminishes.
Lessons Learned:
Rather technical to implement.
Tremendous challenges in the numerical analysis to control
the propagation of rounding errors and avoiding
breakdowns.
54 / 56
Opportunities and Future work
The dramatic increase in parallelism with very lightweight
cores requires a way to deal with the global reductions.
Introduce deep pipelining in various other Krylov methods:
Blocked Krylov methods.
LSQR and CGLS frequently used in inverse and data science
problems. There are hardly any preconditioners and often
implemented and run on cloud infrastructure.
Combine pipelining and communication avoiding techniques
such as s-step methods,
I. Yamazaki, M. Hoemmen, P. Luszczek and J. Dongarra,
(MS40, MS82).
Understand performance on large systems and in the presence
of noise,
Eller and Gropp (MS40)
Morgan, Knepley and Sanan (MS40).
Replace Bulk Synchronous Parallel (BSP) implementations of
Krylov subspace methods with Pipelined Asynchronous
Parallel (PAP) where preconditioners, SPMV and dot-product
are operating asynchronously. 55 / 56
Our main publications
P. Ghysels, T. J. Ashby, K. Meerbergen, and W. Vanroose Hiding
Global Communication Latency in the GMRES Algorithm on
Massively Parallel Machines SIAM J. Sci. Comput., 35(1), C48C71.
(2013)
P. Ghysels and W. Vanroose, Hiding global synchronization latency
in the preconditioned Conjugate Gradient algorithm, Parallel
Computing, 40, (2014), Pages 224238
S. Cools, E. F. Yetkin, E. Agullo, L. Giraud, W. Vanroose, Analyzing
the effect of local rounding error propagation on the maximal
attainable accuracy of the pipelined Conjugate Gradient method,
arxiv: 1601.07068. To appear in SIAM SIMAX.
S. Cools and W. Vanroose, The communication-hiding pipelined
BiCGstab method for the parallel solution of large unsymmetric
linear systems, Parallel Computing Volume 65, July 2017, Pages 1-20
J. Cornelis, S. Cools, W. Vanroose, The Communication-Hiding
Conjugate Gradient Method with Deep Pipelines, arxiv 1801.04728
(2018) Submitted. 56 / 56

More Related Content

What's hot

Optimal Budget Allocation: Theoretical Guarantee and Efficient Algorithm
Optimal Budget Allocation: Theoretical Guarantee and Efficient AlgorithmOptimal Budget Allocation: Theoretical Guarantee and Efficient Algorithm
Optimal Budget Allocation: Theoretical Guarantee and Efficient AlgorithmTasuku Soma
 
Maximizing Submodular Function over the Integer Lattice
Maximizing Submodular Function over the Integer LatticeMaximizing Submodular Function over the Integer Lattice
Maximizing Submodular Function over the Integer LatticeTasuku Soma
 
Longest common subsequence lcs
Longest common subsequence  lcsLongest common subsequence  lcs
Longest common subsequence lcsShahariar Rabby
 
lecture 23
lecture 23lecture 23
lecture 23sajinsc
 
HMPC for Upper Stage Attitude Control
HMPC for Upper Stage Attitude ControlHMPC for Upper Stage Attitude Control
HMPC for Upper Stage Attitude ControlPantelis Sopasakis
 
Fast parallelizable scenario-based stochastic optimization
Fast parallelizable scenario-based stochastic optimizationFast parallelizable scenario-based stochastic optimization
Fast parallelizable scenario-based stochastic optimizationPantelis Sopasakis
 
lecture 24
lecture 24lecture 24
lecture 24sajinsc
 
Distributed solution of stochastic optimal control problem on GPUs
Distributed solution of stochastic optimal control problem on GPUsDistributed solution of stochastic optimal control problem on GPUs
Distributed solution of stochastic optimal control problem on GPUsPantelis Sopasakis
 
Divergence clustering
Divergence clusteringDivergence clustering
Divergence clusteringFrank Nielsen
 
Sloshing-aware MPC for upper stage attitude control
Sloshing-aware MPC for upper stage attitude controlSloshing-aware MPC for upper stage attitude control
Sloshing-aware MPC for upper stage attitude controlPantelis Sopasakis
 
Divergence center-based clustering and their applications
Divergence center-based clustering and their applicationsDivergence center-based clustering and their applications
Divergence center-based clustering and their applicationsFrank Nielsen
 
Schéma numérique basé sur une équation d'Hamilton-Jacobi : modélisation des i...
Schéma numérique basé sur une équation d'Hamilton-Jacobi : modélisation des i...Schéma numérique basé sur une équation d'Hamilton-Jacobi : modélisation des i...
Schéma numérique basé sur une équation d'Hamilton-Jacobi : modélisation des i...Guillaume Costeseque
 
Regret Minimization in Multi-objective Submodular Function Maximization
Regret Minimization in Multi-objective Submodular Function MaximizationRegret Minimization in Multi-objective Submodular Function Maximization
Regret Minimization in Multi-objective Submodular Function MaximizationTasuku Soma
 

What's hot (19)

Optimal Budget Allocation: Theoretical Guarantee and Efficient Algorithm
Optimal Budget Allocation: Theoretical Guarantee and Efficient AlgorithmOptimal Budget Allocation: Theoretical Guarantee and Efficient Algorithm
Optimal Budget Allocation: Theoretical Guarantee and Efficient Algorithm
 
Maximizing Submodular Function over the Integer Lattice
Maximizing Submodular Function over the Integer LatticeMaximizing Submodular Function over the Integer Lattice
Maximizing Submodular Function over the Integer Lattice
 
Longest common subsequence lcs
Longest common subsequence  lcsLongest common subsequence  lcs
Longest common subsequence lcs
 
lecture 23
lecture 23lecture 23
lecture 23
 
Complexos 3
Complexos 3Complexos 3
Complexos 3
 
HMPC for Upper Stage Attitude Control
HMPC for Upper Stage Attitude ControlHMPC for Upper Stage Attitude Control
HMPC for Upper Stage Attitude Control
 
My MSc. Project
My MSc. ProjectMy MSc. Project
My MSc. Project
 
Fast parallelizable scenario-based stochastic optimization
Fast parallelizable scenario-based stochastic optimizationFast parallelizable scenario-based stochastic optimization
Fast parallelizable scenario-based stochastic optimization
 
lecture 24
lecture 24lecture 24
lecture 24
 
Recursive Compressed Sensing
Recursive Compressed SensingRecursive Compressed Sensing
Recursive Compressed Sensing
 
Distributed solution of stochastic optimal control problem on GPUs
Distributed solution of stochastic optimal control problem on GPUsDistributed solution of stochastic optimal control problem on GPUs
Distributed solution of stochastic optimal control problem on GPUs
 
Divergence clustering
Divergence clusteringDivergence clustering
Divergence clustering
 
Sloshing-aware MPC for upper stage attitude control
Sloshing-aware MPC for upper stage attitude controlSloshing-aware MPC for upper stage attitude control
Sloshing-aware MPC for upper stage attitude control
 
Semi vae memo (2)
Semi vae memo (2)Semi vae memo (2)
Semi vae memo (2)
 
Divergence center-based clustering and their applications
Divergence center-based clustering and their applicationsDivergence center-based clustering and their applications
Divergence center-based clustering and their applications
 
Escola naval 2016
Escola naval 2016Escola naval 2016
Escola naval 2016
 
Talk litvinenko gamm19
Talk litvinenko gamm19Talk litvinenko gamm19
Talk litvinenko gamm19
 
Schéma numérique basé sur une équation d'Hamilton-Jacobi : modélisation des i...
Schéma numérique basé sur une équation d'Hamilton-Jacobi : modélisation des i...Schéma numérique basé sur une équation d'Hamilton-Jacobi : modélisation des i...
Schéma numérique basé sur une équation d'Hamilton-Jacobi : modélisation des i...
 
Regret Minimization in Multi-objective Submodular Function Maximization
Regret Minimization in Multi-objective Submodular Function MaximizationRegret Minimization in Multi-objective Submodular Function Maximization
Regret Minimization in Multi-objective Submodular Function Maximization
 

Similar to Invited presentation at SIAM PP 18: Communication Hiding Through Pipelining in Krylov Methods

Trilinear embedding for divergence-form operators
Trilinear embedding for divergence-form operatorsTrilinear embedding for divergence-form operators
Trilinear embedding for divergence-form operatorsVjekoslavKovac1
 
TemporalSpreadingOfAlternativeTrainsPresentation
TemporalSpreadingOfAlternativeTrainsPresentationTemporalSpreadingOfAlternativeTrainsPresentation
TemporalSpreadingOfAlternativeTrainsPresentationPeter Sels
 
reservoir-modeling-using-matlab-the-matalb-reservoir-simulation-toolbox-mrst.pdf
reservoir-modeling-using-matlab-the-matalb-reservoir-simulation-toolbox-mrst.pdfreservoir-modeling-using-matlab-the-matalb-reservoir-simulation-toolbox-mrst.pdf
reservoir-modeling-using-matlab-the-matalb-reservoir-simulation-toolbox-mrst.pdfRTEFGDFGJU
 
Talk Alexander Litvinenko on SIAM GS Conference in Houston
Talk Alexander Litvinenko on SIAM GS Conference in HoustonTalk Alexander Litvinenko on SIAM GS Conference in Houston
Talk Alexander Litvinenko on SIAM GS Conference in HoustonAlexander Litvinenko
 
Efficient Simulations for Contamination of Groundwater Aquifers under Uncerta...
Efficient Simulations for Contamination of Groundwater Aquifers under Uncerta...Efficient Simulations for Contamination of Groundwater Aquifers under Uncerta...
Efficient Simulations for Contamination of Groundwater Aquifers under Uncerta...Alexander Litvinenko
 
Complex analysis notes
Complex analysis notesComplex analysis notes
Complex analysis notesPrakash Dabhi
 
Relative squared distances to a conic
Relative squared distances to a conic Relative squared distances to a conic
Relative squared distances to a conic ijcga
 
Optimal L-shaped matrix reordering, aka graph's core-periphery
Optimal L-shaped matrix reordering, aka graph's core-peripheryOptimal L-shaped matrix reordering, aka graph's core-periphery
Optimal L-shaped matrix reordering, aka graph's core-peripheryFrancesco Tudisco
 
Numerical integration based on the hyperfunction theory
Numerical integration based on the hyperfunction theoryNumerical integration based on the hyperfunction theory
Numerical integration based on the hyperfunction theoryHidenoriOgata
 
Raices de ecuaciones
Raices de ecuacionesRaices de ecuaciones
Raices de ecuacionesNatalia
 
Raices de ecuaciones
Raices de ecuacionesRaices de ecuaciones
Raices de ecuacionesNatalia
 
sublabel accurate convex relaxation of vectorial multilabel energies
sublabel accurate convex relaxation of vectorial multilabel energiessublabel accurate convex relaxation of vectorial multilabel energies
sublabel accurate convex relaxation of vectorial multilabel energiesFujimoto Keisuke
 
Need for Controllers having Integer Coefficients in Homomorphically Encrypted D...
Need for Controllers having Integer Coefficients in Homomorphically Encrypted D...Need for Controllers having Integer Coefficients in Homomorphically Encrypted D...
Need for Controllers having Integer Coefficients in Homomorphically Encrypted D...CDSL_at_SNU
 

Similar to Invited presentation at SIAM PP 18: Communication Hiding Through Pipelining in Krylov Methods (20)

Trilinear embedding for divergence-form operators
Trilinear embedding for divergence-form operatorsTrilinear embedding for divergence-form operators
Trilinear embedding for divergence-form operators
 
TemporalSpreadingOfAlternativeTrainsPresentation
TemporalSpreadingOfAlternativeTrainsPresentationTemporalSpreadingOfAlternativeTrainsPresentation
TemporalSpreadingOfAlternativeTrainsPresentation
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
reservoir-modeling-using-matlab-the-matalb-reservoir-simulation-toolbox-mrst.pdf
reservoir-modeling-using-matlab-the-matalb-reservoir-simulation-toolbox-mrst.pdfreservoir-modeling-using-matlab-the-matalb-reservoir-simulation-toolbox-mrst.pdf
reservoir-modeling-using-matlab-the-matalb-reservoir-simulation-toolbox-mrst.pdf
 
Talk Alexander Litvinenko on SIAM GS Conference in Houston
Talk Alexander Litvinenko on SIAM GS Conference in HoustonTalk Alexander Litvinenko on SIAM GS Conference in Houston
Talk Alexander Litvinenko on SIAM GS Conference in Houston
 
Efficient Simulations for Contamination of Groundwater Aquifers under Uncerta...
Efficient Simulations for Contamination of Groundwater Aquifers under Uncerta...Efficient Simulations for Contamination of Groundwater Aquifers under Uncerta...
Efficient Simulations for Contamination of Groundwater Aquifers under Uncerta...
 
Complex Integral
Complex IntegralComplex Integral
Complex Integral
 
Complex analysis notes
Complex analysis notesComplex analysis notes
Complex analysis notes
 
Relative squared distances to a conic
Relative squared distances to a conic Relative squared distances to a conic
Relative squared distances to a conic
 
Triggering patterns of topology changes in dynamic attributed graphs
Triggering patterns of topology changes in dynamic attributed graphsTriggering patterns of topology changes in dynamic attributed graphs
Triggering patterns of topology changes in dynamic attributed graphs
 
Shell theory
Shell theoryShell theory
Shell theory
 
Optimal L-shaped matrix reordering, aka graph's core-periphery
Optimal L-shaped matrix reordering, aka graph's core-peripheryOptimal L-shaped matrix reordering, aka graph's core-periphery
Optimal L-shaped matrix reordering, aka graph's core-periphery
 
Numerical integration based on the hyperfunction theory
Numerical integration based on the hyperfunction theoryNumerical integration based on the hyperfunction theory
Numerical integration based on the hyperfunction theory
 
Steven Duplij, "Polyadic rings of p-adic integers"
Steven Duplij, "Polyadic rings of p-adic integers"Steven Duplij, "Polyadic rings of p-adic integers"
Steven Duplij, "Polyadic rings of p-adic integers"
 
kactl.pdf
kactl.pdfkactl.pdf
kactl.pdf
 
Raices de ecuaciones
Raices de ecuacionesRaices de ecuaciones
Raices de ecuaciones
 
Raices de ecuaciones
Raices de ecuacionesRaices de ecuaciones
Raices de ecuaciones
 
sublabel accurate convex relaxation of vectorial multilabel energies
sublabel accurate convex relaxation of vectorial multilabel energiessublabel accurate convex relaxation of vectorial multilabel energies
sublabel accurate convex relaxation of vectorial multilabel energies
 
hw-sol.pdf
hw-sol.pdfhw-sol.pdf
hw-sol.pdf
 
Need for Controllers having Integer Coefficients in Homomorphically Encrypted D...
Need for Controllers having Integer Coefficients in Homomorphically Encrypted D...Need for Controllers having Integer Coefficients in Homomorphically Encrypted D...
Need for Controllers having Integer Coefficients in Homomorphically Encrypted D...
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 

Recently uploaded (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 

Invited presentation at SIAM PP 18: Communication Hiding Through Pipelining in Krylov Methods

  • 1. Communication Hiding Through Pipelining in Krylov Methods Wim Vanroose Department of Mathematics and Computer Science, U. Antwerpen, Belgium Pieter Ghysels, Siegfried Cools, Jeffrey Cornelis Funding: IWT flanders and Intel, Exa2ct Horizon 2020 (EU funding ref.no. 610741), FWO flanders, U. Antwerpen Research Fund. SIAM PP 18, Tokyo 1 / 56
  • 2. Krylov subspace methods We search for a solution x of Ax = b, Ax = λx or AT Ax = AT b in the Krylov subspace x ∈ x(0) + span r(0) , Ar(0) , A2 r(0) , . . . , Ak−1 r(0) , (1) where r(0) = b − Ax(0). Conjugate Gradients, Lanczos, GMRES, Minres, BiCG,CGS,BiCGstab CGLS, bidiagonalization, . . . Algebraic and Geometric Multigrid, Incomplete factorization, Domain Decomposition, FETI, BDDC, Physics based preconditioners, . . . 2 / 56
  • 3. GMRES, classical Gram-Schmidt 1: r0 ← b − Ax0 2: v0 ← r0/||r0||2 3: for i = 0, . . . , m − 1 do 4: w ← Avi 5: for j = 0, . . . , i do 6: hj,i ← w, vj 7: end for 8: ˜vi+1 ← w − i j=1 hj,i vj 9: hi+1,i ← ||˜vi+1||2 10: vi+1 ← ˜vi+1/hi+1,i 11: {apply Givens rotations to h:,i } 12: end for 13: ym ← argmin||Hm+1,mym − ||r0||2e1||2 14: x ← x0 + Vmym Sparse Matrix-Vector product Only communication with neighbors Good scaling Dot-product Global communication Scales as log(P) Scalar vector multiplication, vector-vector addition No communication 3 / 56
  • 5. latency of global reductions T. Hoeffler, T. Schneider and A. Lumsdaine, SC10, 2010. 5 / 56
  • 6. GMRES vs Pipelined GMRES iteration on 4 nodes Classical GMRES p1 p2 p3 p4 SpMV local dot + + + reduction b-cast axpy Gram-Schmidt norm + + + reduction b-castscale Normalization H update 6 / 56
  • 7. Outline Communication Hiding and Communication Avoiding Pipelined Conjugate Gradients Propagation of Rounding Errors Deeper Pipelines Exploiting the properties of Symmetric Matrices Implementation 7 / 56
  • 8. Communication Hiding vs Communication Avoiding (figure taken from Fukaya et al, 2014) M. Mohiyuddin, M. Hoemmen, J. Demmel and K. Yelick, Minimizing communication in sparse matrix solvers, p1–12, 2009. M. Hoemmen, Communication-avoiding Krylov subspace methods, PhD Thesis, 2010. J. Demmel, L. Grigori, M. Hoemmen and J. Langou Communication-optimal parallel and sequential QR and LU factorizations SISC, 34, pA206–A239, 2012. 8 / 56
  • 9. Conjugate Gradients? Preconditioned Conjugate Gradient 1: r0 := b − Ax0; u0 := M−1r0; p0 := u0 2: for i = 0, . . . , m − 1 do 3: s := Api 4: α := ri , ui / s, pi 5: xi+1 := xi + αpi 6: ri+1 := ri − αs 7: ui+1 := M−1ri+1 8: β := ri+1, ui+1 / ri , ui 9: pi+1 := ui+1 + βpi 10: end for 9 / 56
  • 10. Auxiliary Recurrences We already have introduced the notation si := Api ui := M−1 ri (2) The recurrence of the search directions pi = ui + βpi−1 (3) ×A Api = Aui + βApi−1 (4) si = wi + βsi−1 (5) where wi := Aui (6) 10 / 56
  • 11. Chronopoulos/Gear CG Only one global reduction each iteration. 1: r0 := b − Ax0; u0 := M−1r0; w0 := Au0 2: α0 := r0, u0 / w0, u0 ; β := 0; γ0 := r0, u0 3: for i = 0, . . . , m − 1 do 4: pi := ui + βi pi−1 5: si := wi + βi si−1 6: xi+1 := xi + αpi 7: ri+1 := ri − αsi 8: ui+1 := M−1ri+1 9: wi+1 := Aui+1 10: γi+1 := ri+1, ui+1 11: δ := wi+1, ui+1 12: βi+1 := γi+1/γi 13: αi+1 := γi+1/(δ − βi+1γi+1/αi ) 14: end for 11 / 56
  • 12. pipelined Chronopoulos/Gear CG Global reduction overlaps with sparse matrix-vector (SpMV) product. 1: r0 := b − Ax0; w0 := Au0 2: for i = 0, . . . , m − 1 do 3: γi := ri , ri 4: δ := wi , ri 5: qi := Awi 6: if i > 0 then 7: βi := γi /γi−1; αi := γi /(δ − βi γi /αi−1) 8: else 9: βi := 0; αi := γi /δ 10: end if 11: zi := qi + βi zi−1 12: si := wi + βi si−1 13: pi := ri + βi pi−1 14: xi+1 := xi + αi pi 15: ri+1 := ri − αi si 16: wi+1 := wi − αi zi 17: end for 12 / 56
  • 13. Preconditioned pipelined CG 1: r0 := b − Ax0; u0 := M−1 r0; w0 := Au0 2: for i = 0, . . . do 3: γi := ri , ui 4: δ := wi , ui 5: mi := M−1 wi 6: ni := Ami 7: if i > 0 then 8: βi := γi /γi−1; αi := γi / (δ − βi γi /αi−1) 9: else 10: βi := 0; αi := γi /δ 11: end if 12: zi := ni + βi zi−1 13: qi := mi + βi qi−1 14: si := wi + βi si−1 15: pi := ui + βi pi−1 16: xi+1 := xi + αi pi 17: ri+1 := ri − αi si 18: ui+1 := ui − αi qi 19: wi+1 := wi − αi zi 20: end for 13 / 56
  • 15. Preconditioned pipelined CR 1: r0 := b − Ax0; u0 := M−1 r0; w0 := Au0 2: for i = 0, . . . do 3: mi := M−1 wi 4: γi := wi , ui 5: δ := mi , wi 6: ni := Ami 7: if i > 0 then 8: βi := γi /γi−1; αi := γi / (δ − βi γi /αi−1) 9: else 10: βi := 0; αi := γi /δ 11: end if 12: zi := ni + βi zi−1 13: qi := mi + βi qi−1 14: pi := ui + βi pi−1 15: xi+1 := xi + αi pi 16: ui+1 := ui − αi qi 17: wi+1 := wi − αi zi 18: end for 15 / 56
  • 16. Cost Model G := time for a global reduction SpMV := time for a sparse-matrix vector product PC := time for preconditioner application local work such as AXPY is neglected flops time (excl, axpys, dots) #glob syncs memory CG 10 2G + SpMV + PC 2 4 Chro/Gea 12 G + SpMV + PC 1 5 CR 12 2G + SpMV + PC 2 5 pipe-CG 20 max(G, SpMV + PC) 1 9 pipe-CR 16 max(G, SpMV) + PC 1 7 Gropp-CG 14 max(G, SpMV) + max(G, PC) 2 6 P. Ghysels and W. Vanroose, Hiding global synchronization latency in the preconditioned Conjugate Gradient algorithm, Parallel Computing, 40, (2014), Pages 224–238 16 / 56
  • 17. Better Strong Scaling Hydrostatic ice sheet flow, 100 × 100 × 50 Q1 finite elements line search Newton method (rtol = 10−8 , atol = 10−15 ) CG with block Jacobi ICC(0) precond (rtol = 10−5 , atol = 10−50 ) Measured speedup over standard CG for different variations of pipelined CG. 17 / 56
  • 19. Pipelined methods in PETSc (from 3.4.2) Krylov methods: KSPPIPECR, KSPPIPECG, KSPGROPPCG, KSPPGMRES Uses MPI-3 non-blocking collectives export MPICH ASYNC PROGRES=1 1: . . . 2: KSP PCApply(ksp,W,M); /* m ← Bw */ 3: if (i > 0 && ksp→normtype == KSP NORM PRECONDITIONED) 4: VecNormBegin(U,NORM 2,&dp); 5: VecDotBegin(W,U,&gamma); 6: VecDotBegin(M,W,&delta); 7: PetscCommSplitReductionBegin(PetscObjectComm((PetscObject)U)); 8: KSP MatMult(ksp,Amat,M,N); /* n ← Am */ 9: if (i > 0 && ksp→normtype == KSP NORM PRECONDITIONED) 10: VecNormEnd(U,NORM 2,&dp); 11: VecDotEnd(W,U,&gamma); 12: VecDotEnd(M,W,&delta); 13: . . . 19 / 56
  • 20. True Residual vs recursed Residual CG pipelined Chron/Gear CG 20 / 56
  • 21. Rounding errors in classical CG ¯xi+1 = ¯xi + αi+1¯pi + δx i ¯ri+1 = ¯ri − αi+1¯si + δr i (7) where si := Api and ¯xi , ¯ri , ¯si denote the computed iterates and δx i and δr i are the local rounding errors. ∆r i+1 = b − A¯xi+1 − ¯ri+1 = b − A(¯xi + αi+1¯pi + δx i ) − ¯ri + αi+1¯si − δr i = b − A¯xi − ¯ri ∆r i + Aδx + δr local rounding error 21 / 56
  • 23. pipelined Chronopoulos/Gear CG Global reduction overlaps with matrix vector product. 1: r0 := b − Ax0; w0 := Au0 2: for i = 0, . . . , m − 1 do 3: γi := ri , ri 4: δ := wi , ri 5: qi := Awi 6: if i > 0 then 7: βi := γi /γi−1; αi := γi /(δ − βi γi /αi−1) 8: else 9: βi := 0; αi := γi /δ 10: end if 11: zi := qi + βi zi−1 12: si := wi + βi si−1 13: pi := ri + βi pi−1 14: xi+1 := xi + αi pi 15: ri+1 := ri − αi si 16: wi+1 := wi − αi zi 17: end for 23 / 56
  • 24. Deviation between the recursed and true value    si := Api , wi := Ari , zi := Asi = A2pi . (8) We use the following definition for the errors    ∆r i := ¯ri − (b − A¯xi ) ∆s i := ¯si − A¯pi ∆w i := ¯wi − A¯ri ∆z i := ¯zi − A¯si (9) 24 / 56
  • 25. Rounding errors     ∆r i ∆s i ∆w i ∆z i     =     1 αi 0 0 0 βi 1 0 0 0 1 αi 0 0 0 βi         ∆r i−1 ∆s i−1 ∆w i−1 ∆z i−1     +     Aδx + δr Aδr + δs Aδr + δw Aδs + δz     (10) 25 / 56
  • 26. Residual replacement Replace ri , si , wi and zi with their true values.    ri := b − Axi si := Api , wi := Ari , zi := Asi = A2pi . (11) van der Vorst and Ye, Residual Replacement strategies for Krylov Subspace iterative methods for Convergence of the True Residuals, SISC 2000 Tong and Ye, Analysis of the finite precision bi-conjugate gradient algorithm for nonsymmetric linear systems, Mathematics and computations, 1999. AZ = ZT + f (12) fi = Aτi ri + 1 αi ηi+1 ri − βi αi−1 ηi ri (13) 26 / 56
  • 27. Residual Replacement for Pipelined Chron/Gear CG 27 / 56
  • 28. Siegfried Cools, Emrullah Fatih Yetkin, Emmanuel Agullo, Luc Giraud, Wim Vanroose, Analyzing the effect of local rounding error propagation on the maximal attainable accuracy of the pipelined Conjugate Gradient method, arxiv: 1601.07068. To appear in SIAM SIMAX. 28 / 56
  • 30. Extending the pipeline depth 30 / 56
  • 31. We introduce the auxiliary vectors Zi+1 = [z0, z1, . . . , zi , zi+1] defined as zj :=    v0 j = 0, Pj (A)v0 0 < j ≤ l, Pl (A)vj−l j > l, (14) where the polynomials Pl (t) is fixed to order l, where l is the depth of the pipeline Pl (t) := l−1 j=0 (t − σj ), (15) and for Pj (t) = (t − σj )Pj−1(t) with P0(t) := 1. 31 / 56
  • 32. Example with l = 3 Let V4 = [v0, v1, v2, v3] be a Krylov subspace. We then have the auxiliarry variable Z7 = [z0, . . . z7]. z0 = v0, z1 = (A − σ0I)v0, z2 = (A − σ1I)(A − σ0I)v0, z3 = (A − σ2I)(A − σ1I)(A − σ0I)v0, z4 = (A − σ2I)(A − σ1I)(A − σ0I)v1, z5 = (A − σ2I)(A − σ1I)(A − σ0I)v2, z6 = (A − σ2I)(A − σ1I)(A − σ0I)v3. (16) 32 / 56
  • 33. We start from the Arnoldi relation AVk = Vk+1Hk+1,k, (17) where Vk is the basis of the Krylov subspace and Hk+1,k is the upper Hessenberg matrix. Written for basis vector vj−l , this becomes vj−l = Avj−l−1 − j−l−1 k=0 hk,j−l−1vk hj−l,j−l−1 . (18) When we apply the fixed polynomial Pl (A)vj−l = Pl (A)Avj−l−1 − j−l−1 k=0 hk,j−l−1Pl (A)vk hj−l,j−l−1 . (19) This is now, using the auxiliary variables, zj = Azj−1 − j−l−1 k=0 hk,j−l−1zk+l hj−l,j−l−1 . (20) 33 / 56
  • 34. Relations for auxiliary variables Z AZi = Zi+1Bi+1,i (21) with Bi+1,i :=             σ0 1 ... 1 σl 1 h0,0 h0,i−l h1,0 ... hi+1−1,i−l             (22) Observation: the auxiliary variables Z contain the information for the Hessenberg matrix. Is it possible to extract? 34 / 56
  • 35. Matrix G The auxiliary variables also lie in the Krylov subspace. [z0, z1, . . . , zk−1] = [v0, v1, . . . , vk−1]          ∗ ∗ ∗ . . . ∗ g0,k−1 ∗ ∗ . . . ∗ g1,k−1 ∗ . . . ∗ g2,k−1 ... ... ∗ gk−1,k−1          so Zk = VkGk, (23) where Gk is a upper triangular matrix. 35 / 56
  • 36. Recursion for G ∀j = 0, 1, . . . , k − 2 gj,k−1 = (zk−1, vj ) (24) = zk−1, zj − j−1 m=0 gm,j vm gj,j = (zk−1, zj ) − j−1 m=0 gm,j gm,k−1 gj,j Two cases: If vj vector is already constructed → (zk−1, vj ) If vj not, use (zk−1, zj ) and recurrence.         x x x (z3, v0) x x (z3, v1) ↓ x (z3, z2) ↓ (z3, z3)         (25) 36 / 56
  • 37. Recursion for G for the diagonal elements we use gj,j = (zj , zj ) − j−1 m=0 gm,j gm,j gj,j (26) resulting in gj,j = (zj , zj ) − j−1 m=0 g2 m,j (27) Note that this can lead to breakdowns. 37 / 56
  • 38. Recursive calculation of the Hessenberg Matrix Hk+1,k = V T k+1AVk (28) = V T k+1 AZk G−1 k using Z = VG = V T k+1Zk+1 Bk+1,k G−1 k using AZ = ZB = Gk+1Bk+1,k G−1 k using Gk+1 = V T k+1Zl+1 = Gk g:,k+1 0 gk+1,k+1 Bk,k−1 b:,k 0 bk+1,k Gk g:,k+1 0 gk+1,k+1 −1 = Gk Bk,k−1 Gk b:,k + g:,k+1bk+1,k 0 gk+1,k+1bk+1,k G−1 k−1 −G−1 k−1g:,k g−1 k,k 0 g−1 k,k = Gk Bk,k−1G−1 k−1 −Gk Bk,k−1G−1 k−1g:,k + Gk b:,k + g:,k+1bk+1,k g−1 k,k 0 gk+1,k+1bk+1,k g−1 k,k = Hk,k−1 (Gk b:,k + g:,k+1bk+1,k − Hk,k−1g:,k ) g−1 k,k 0 gk+1,k+1bk+1,k g−1 k,k (29) 38 / 56
  • 41. Conjugate Gradients with longer pipelines In conjugate gradients is A = AT and the matrix is positive definite. How does the the GMRES algorithm simplify if A = AT ? How can we avoid storing the full basis V and the full set of auxiliary vectors Z? How do the rounding errors propagate? 41 / 56
  • 42. Lemma (Cornelis, Master Thesis 2017) Let A be a symmetric matrix and let zi = Pl (A)vi−1 be auxiliary variables with l > 0. The Gi = V T i Zi is now a banded upper triangular matrix with band 2l + 1, where gj,i = vt j zi = 0 ∀j /∈ {i − 2l, . . . , i} (30) 0 5 10 15 20 25 30 0 5 10 15 20 25 30 42 / 56
  • 43. Updating the solution In principle we need all basis vectors to construct the solution. xk = x0 + Vkyk = x0 + VkT−1 k,k r0 e1 (31) However, we can introduce the search directions pk that are also recursively updated. We can then update the current guess with a step in the search direction. pj = vj+1 − µj pj−1 (32) xk = xk−1 + pkck (33) where ck and µj are determined based on the Hessenberg matrix. Liesen and Strakos, Krylov Subspace Methods, 2012. 43 / 56
  • 46. pipelined Lanczos vs classical GMRES 46 / 56
  • 47. CG vs pipe CG 47 / 56
  • 50. Pipeline for l = 2 50 / 56
  • 52. Example of pipeline storage for l = 2 52 / 56
  • 53. Contributions to PETSC KSPPGMRES: pipelined GMRES (thanks to Jed Brown) KSPPIPECG: pipelined CG KPPPIPECR: pipelined conjugate residuals KSPGROPPCG: Gropp variant where overlap is done with SPMV and Preconditioner KSPPIPEBCGS: pipelined BiCGStab Under development: p(l)-CG: in the pull requests of Petsc, being reviewed. Soliciting feedback from applications. 53 / 56
  • 54. Conclusions Summary: We have shown that by adding auxiliary variables and additional recurrences it is possible to hide the latencies of the dot products in Krylov methods. It is possible to write a version of Conjugate Gradients where dot-products are hidden behind multiple SPMV. Asynchronous = dot product can take the time of multiple SpMV to complete. Better scaling in the strong scaling limit, where global reductions latencies rise and volume of computations per core diminishes. 54 / 56
  • 55. Conclusions Summary: We have shown that by adding auxiliary variables and additional recurrences it is possible to hide the latencies of the dot products in Krylov methods. It is possible to write a version of Conjugate Gradients where dot-products are hidden behind multiple SPMV. Asynchronous = dot product can take the time of multiple SpMV to complete. Better scaling in the strong scaling limit, where global reductions latencies rise and volume of computations per core diminishes. Lessons Learned: Rather technical to implement. Tremendous challenges in the numerical analysis to control the propagation of rounding errors and avoiding breakdowns. 54 / 56
  • 56. Opportunities and Future work The dramatic increase in parallelism with very lightweight cores requires a way to deal with the global reductions. Introduce deep pipelining in various other Krylov methods: Blocked Krylov methods. LSQR and CGLS frequently used in inverse and data science problems. There are hardly any preconditioners and often implemented and run on cloud infrastructure. Combine pipelining and communication avoiding techniques such as s-step methods, I. Yamazaki, M. Hoemmen, P. Luszczek and J. Dongarra, (MS40, MS82). Understand performance on large systems and in the presence of noise, Eller and Gropp (MS40) Morgan, Knepley and Sanan (MS40). Replace Bulk Synchronous Parallel (BSP) implementations of Krylov subspace methods with Pipelined Asynchronous Parallel (PAP) where preconditioners, SPMV and dot-product are operating asynchronously. 55 / 56
  • 57. Our main publications P. Ghysels, T. J. Ashby, K. Meerbergen, and W. Vanroose Hiding Global Communication Latency in the GMRES Algorithm on Massively Parallel Machines SIAM J. Sci. Comput., 35(1), C48C71. (2013) P. Ghysels and W. Vanroose, Hiding global synchronization latency in the preconditioned Conjugate Gradient algorithm, Parallel Computing, 40, (2014), Pages 224238 S. Cools, E. F. Yetkin, E. Agullo, L. Giraud, W. Vanroose, Analyzing the effect of local rounding error propagation on the maximal attainable accuracy of the pipelined Conjugate Gradient method, arxiv: 1601.07068. To appear in SIAM SIMAX. S. Cools and W. Vanroose, The communication-hiding pipelined BiCGstab method for the parallel solution of large unsymmetric linear systems, Parallel Computing Volume 65, July 2017, Pages 1-20 J. Cornelis, S. Cools, W. Vanroose, The Communication-Hiding Conjugate Gradient Method with Deep Pipelines, arxiv 1801.04728 (2018) Submitted. 56 / 56