Nucleon TMD Contractions in Lattice QCD using QUDA

Nucleon TMD Contractions in Lattice QCD
using QUDA1
Christos Kallidonis, Sergey Syritsyn
X
Y
key=vl3_xxx_bl2_YY
Nucleon EDMs on a Lattice  
at the Physical Point
Sergey N. Syritsyn,
Stony Brook University & RIKEN / BNL Research Center
together with LHP and RBC collaborations
LATTICE 2018
East Lansing, MI, July 22-28, 2018
Courtesy of BMW Collaboration
GPU Hackathon
Brookhaven National Laboratory
Sep. 17-21, 2018
Progress Report
Mentors:
Kate Clark, Mathias Wagner
1 https://github.com/lattice/quda
with GPU Lattice team:
C. Jung, M. Lin, D. Howarth, J. Tu, B. Wang, D. Guo

Problem at hand
Degrees of freedom:
• (local) volume sites: x = 1,…,512K
• Ns spin: α,β = 1,…,4
• Nc color: a, b = 1,…,3
• Vector index: k = 1,…,12
• Γ-matrix index: i = 1,…,16
• Complex numbers! x2
# cplx multiply-add / site: N2
c N2
s ⇥ (1 + NcNs) + N3
s
15104 Flops
(2NcNs)2
+ N2
c ) ⇤ cplx = 4752 Bytes
N2
s ⇤ cplx = 256 Bytes
Inp. mem/site:
Out. mem/site:
⇥
=
Uba
(x) wk (x)a
↵ Wk (x)b
↵ ⇥ v?
k (x)b =
+
+
⇥ C(i)
(x)
Fk (x)↵
G(x)↵ =
X
k
Fk (x)↵
G(x)↵
(i)
↵
=
C(i)
(x) =
X
k
X
↵, ,a,b
(i)
↵U(x)ba
wk (x)a
↵v?
k (x)b

Kernel optimization
Iteration-0:
• assign 1 thr/site
• loop over, a, b, α, β
• sum over k
• perform trace
Iteration-1:
• QUDA: block/grid auto-tuning functionality
⇥
=
Uba
(x) wk (x)a
↵ Wk (x)b
↵ ⇥ v?
k (x)b =
+
+
⇥ C(i)
(x)
Fk (x)↵
G(x)↵ =
X
k
Fk (x)↵
G(x)↵
(i)
↵
Can do better than that!
Performance per GPU (1/2 K80): ~ 6 GFlop/s
Memory Bandwidth: ~ 1.9 GB/s
Kernel exec. cost: 6 GPU*sec
—> Dominant part of workﬂow
Nvidia Visual
proﬁler:
Thanks, Mathias!
C(i)
(x) =
X
k
X
↵, ,a,b
(i)
↵U(x)ba
wk (x)a
↵v?
k (x)b

Kernel optimization
Iteration-2:
• move required buffers to shared memory
• extend the block dim. to 3d - assign color/spin
indices to individual threads
• #pragma unroll the (remaining) loops
• inline relevant functions involving Γ-matrices
Kernel exec. cost: 5.2 GPU*sec, x1.15 impr.
Proﬁler still complains about very
high local memory overhead…
⇥
=
Uba
(x) wk (x)a
↵ Wk (x)b
↵ ⇥ v?
k (x)b =
+
+
⇥ C(i)
(x)
Fk (x)↵
G(x)↵ =
X
k
Fk (x)↵
G(x)↵
(i)
↵
C(i)
(x) =
X
k
X
↵, ,a,b
(i)
↵U(x)ba
wk (x)a
↵v?
k (x)b

Kernel optimization
Iteration-3:
• Move Γ-matrices to constant memory, did the trick. Thanks, Kate!
—> compiler could not resolve array indexing,
buffers spilled to local memory
QUDA auto-tuner report:
Performance: 205 Gﬂop/s
Memory BW: 65 GB/s
Kernel exec. cost: 0.16 GPU*sec (to compare with 5.2 GPU*sec)
—> Now only 4% of workﬂow
x32 improvement!!
On-going work:
• can we squeeze more Flop/s ?
• optimize communication-intensive code segments
• experiment with env. variables
• update/optimize the rest of contraction kernels
⇥
=
Uba
(x) wk (x)a
↵ Wk (x)b
↵ ⇥ v?
k (x)b =
+
+
⇥ C(i)
(x)
Fk (x)↵
G(x)↵ =
X
k
Fk (x)↵
G(x)↵
(i)
↵
C(i)
(x) =
X
k
X
↵, ,a,b
(i)
↵U(x)ba
wk (x)a
↵v?
k (x)b

Nucleon TMD Contractions in Lattice QCD using QUDA

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Nucleon TMD Contractions in Lattice QCD using QUDA

Similar to Nucleon TMD Contractions in Lattice QCD using QUDA (20)

More from Christos Kallidonis

More from Christos Kallidonis (11)

Recently uploaded

Recently uploaded (20)

Nucleon TMD Contractions in Lattice QCD using QUDA