IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

CUDA Tricks and
Computational Physics
Kipton Barros
Boston University

In collaboration with R. Babich, R. Brower, M. Clark,
C. Rebbi, J. Ellowitz

High energy physics
huge computational needs
Large Hadron Collider, CERN

27 km

A request:

Please question/comment
freely during the talk

A disclaimer:
I’m not a high energy physicist

View of the CMS detector at the end of 2007 (Maximilien Brice, © CERN)
.

15 Petabytes to be processed annually
View of the Computer Center during the installation of ser vers. (Maximilien Brice; Claudia Marcelloni, © CERN)

The “Standard Model” of Particle Physics

I’ll discuss Quantum ChromoDynamics

Although it’s “standard”, these equations are hard to solve
Big questions:
why do quarks appear in groups?
physics during big bang?

Quantum
ChromoDynamics
The theory of nuclear
interactions
(bound by “gluons”)

Extremely difﬁcult:
Must work at the level of ﬁelds, not particles
Calculation is quantum mechanical

Lattice QCD:
Solving Quantum Chromodynamics by Computer

Discretize space and time
(place the quarks and gluons on a 4D lattice)

Spacetime = 3+1 dimensions
32 ∼ 10
4 6
lattice sites

Quarks live on sites (24 floats each)

Gluons live on links (18 floats each)
lattice sites
4 × 324 × (24 + 4 × 18) ∼ 384MB
Total system size
gluons
float bytes quarks

Lattice QCD:
Inner loop requires repeatedly solving linear equation

quarks
gluons

DW is a sparse matrix
with only nearest neighbor
couplings

DW needs to be fast!

DW
Operation of

1 output quark site
(24 ﬂoats)

DW
Operation of

1 output quark site
(24 ﬂoats)

2x4 input quark sites
(24x8 ﬂoats)

DW
Operation of

1 output quark site
(24 floats)

(24x8 floats)

2x4 input gluon links
(18x8 floats)

DW
Operation of

1 output quark site
(24 floats)

(24x8 floats)

2x4 input gluon links
(18x8 floats)

1.4 kB of local storage required per quark update?

Cuda Parallelization:
Must process many quark updates simultaneously
Odd/even sites processed separately

ding
hrea
T
Programming Model
Host Device
A kernel is executed as a Grid 1
grid of thread blocks Block Block Block
Kernel
A thread block is a batch (0, 0) (1, 0) (2, 0)
1

of threads that can Block Block Block
cooperate with each (0, 1) (1, 1) (2, 1)

other by:
Grid 2
Sharing data through
shared memory Kernel
2
Synchronizing their
execution
Block (1, 1)

Threads from different
Thread Thread Thread Thread Thread
(0, 0) (1, 0) (2, 0) (3, 0) (4, 0)

blocks cannot cooperate Thread Thread Thread Thread Thread
(0, 1) (1, 1) (2, 1) (3, 1) (4, 1)

Thread Thread Thread Thread Thread
(0, 2) (1, 2) (2, 2) (3, 2) (4, 2)

3
© NVIDIA Corporation 2006

Friday, January 23, 2009

DW parallelization:
Each thread processes 1 site
No communication required bet ween threads!
All threads in warp execute same code

Step 1: Read neighbor site

Step 2: Read neighbor link



Step 3: Accumulate into

xec
E
!quot;quot;#$%"'

()*+%,-.&/0*#quot;0.1&/-%*+-+2+quot;#0+,-/+3#+&0.%44'5-/1-
+2+quot;#0.&6-10)+*-7%*$/-./-0)+-1&4'-7%'-01-).,+-
4%0+".+/-%&,-8++$-0)+-)%*,7%*+-9#/'

!quot;quot;#$%"' :-;#<9+*-1=-7%*$/-*#&&.&6-
quot;1"#**+&04'-1&-%-<#40.$*1quot;+//1*-,.>.,+,-9'-
<%2.<#<-&#<9+*-1=-7%*$/-0)%0-quot;%&-*#&-
quot;1"#**+&04'

?.<.0+,-9'-*+/1#*quot;+-#/%6+@
!!!
A+6./0+*/
B)%*+,-<+<1*'

79


xec
E
!quot;#$%$&$'()#*+,-./)quot;,+)01234
5*22/,)#*+,-./)quot;,+)01234)-/)-)%61#$quot;1,)27)8-+quot;)/$&,
9:2$.)8-/#$'()32%quot;6#-#$2')2')6'.,+;quot;2quot;61-#,.)8-+quot;/
<2+,)#*+,-./)quot;,+)01234)==)0,##,+)%,%2+>)1-#,'3>)
*$.$'(
?6#@)%2+,)#*+,-./)quot;,+)01234)==)7,8,+)+,($/#,+/)quot;,+)
#*+,-.
A,+',1)$':23-#$2'/)3-')7-$1)$7)#22)%-'>)+,($/#,+/)-+,)6/,.
B,6+$/#$3/
<$'$%6%C)DE)#*+,-./)quot;,+)01234
!'1>)$7)%61#$quot;1,)32'36++,'#)01234/)
FGH)2+)HID)#*+,-./)-)0,##,+)3*2$3,
J/6-11>)/#$11),'26(*)+,(/)#2)32%quot;$1,)-'.)$':24,)/633,//7611>
K*$/)-11).,quot;,'./)2')>26+)32%quot;6#-#$2'@)/2),Lquot;+$%,'#M

85


Reminder -- each multiprocessor has:

16 kb shared memory
16 k registers
1024 active threads (max)

High occupancy needed for
(roughly 25% or so)
maximum performance

DW : does it fit onto the GPU?

Each thread requires 1.4 kb 0.2 kb
of fast local memory

24 12 floats

18 floats

24 floats



MP has 16 kb shared mem

Threads/MP = 16 / 0.2 = 80




Threads/MP = 16 / 0.2 = 80 64
(multiple of 64 only)




Threads/MP = 16 / 0.2 = 80 64
(multiple of 64 only)

MP occupancy = 64/1024 = 6%

6% occupancy
sounds pretty
bad!

Andreas Kuehn / Getty

How can we get better occupancy?


16 kb shared memory
16 k registers
1024 active threads (max)

Each thread requires 0.2 kb

How can we get better occupancy?


16 kb shared memory Occupancy > 25%
16 k registers = 64 kb memory
1024 active threads

Each thread requires 0.2 kb

Registers as data
(possible because no inter-thread communication)

Instead of shared memory

Registers are allocated as

Registers as data
Can’t be indexed. All loops must be
EXPLICITLY expanded

Code sample

(approx. 1000 LOC automatically generated)

Performance Results:

44 Gigabytes/sec (Tesla C870)

82 Gigabytes/sec (GTX 280)
(90 Gﬂops/s)

(completely bandwidth limited)

For comparison:
t wice as fast as Cell impl. (arXiv:0804.3654)

20 times faster than CPU implementations

GB/s vs Occupancy
Tesla C870 GTX 280
GB/s GB/s
45.00 85.00

33.75 63.75

22.50 42.50

11.25 21.25

0 0
≥ 25% 17% 8% 0% ≥ 19% 13% 6% 0%
Occupancy Occupancy

Surprise! Very robust to low occupancy

Device memory is the bottleneck
Coalesced memory accesses crucial
Data reordering

Quark 1 Quark 2 Quark 3

q21 , q22 , ...q224 q31 , q32 , ...q324 ...
q11 , q12 , ...q124

q11 q21 q31 ... q12 q22 q32 ...

thread 1
...
thread 0 thread 2

Memory coalescing: store even/odd
lattices separately

When memory access isn’t perfectly coalesced
Sometimes ﬂoat4 arrays can hide latency

This global memory read
corresponds to a single CUDA
instruction

In case of coalesce miss, at least
4x data is transfered
thread 0 thread 1 thread 2

When memory access isn’t perfectly coalesced

Binding to textures can help

corresponds to a single
CUDA instruction

This makes use of the texture cache and can reduce penalty
for nearly coalesced accesses

Regarding textures, there are t wo kinds of memory:

Linear array

Can be modiﬁed in kernel
Can only be bound to 1D texture

“Cuda array”

Can’t be modifed in kernel
Gets reordered for 2D, 3D locality
Allows various hardware features

When a CUDA array is bound to a 2D texture, it is
probably reordered to something like a Z-cur ve

This gives 2D locality

Wikipedia image

Warnings:
The effectiveness of ﬂoat4, textures, depends
on the CUDA hardware and driver (!)

Certain “magic” access patterns are many
times faster than others

Testing appears to be necessary

Memory bandwidth test
Simple kernel

Memory access completely coalesced
Should be optimal

Memory bandwidth test
Simple kernel

Memory access completely coalesced

Bandwidth: 54 Gigabytes / sec
(GTX 280, 140 GB/s theoretical!)

So why are NVIDIA samples so fast?

NVIDIA actually uses

54 Gigabytes / sec 102 Gigabytes / sec

(GTX 280, 140 GB/s theoretical)

Naive access pattern

Step 1 ...

...
Block 1 Block 2

Step 2 ...

...
Block 1 Block 2

Modiﬁed access pattern (much more efﬁcient)

Step 1 ...

...
Block 1 Block 2

...
Step 2

...
Block 1 Block 2

CUDA Compiler
(LOTS of
optimization
here)
CUDA PTX CUDA machine
C code code code

Use unofﬁcial CUDA disassembler to
view CUDA machine code

CUDA
disassembly

CUDA Disassembler (decuda)

foo.cu
Compile and save cubin ﬁle

Disassemble

Look how CUDA
implements integer
division!

CUDA provides fast (but imperfect)
trigonometry in hardware!

The compiler is very aggressive in optimization. It will
group memory loads together to minimize latency

(snippet from LQCD)

Notice: each thread reads 20 ﬂoats!

IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

Similar to IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU) (20)

More from npinto

More from npinto (20)

Recently uploaded

Recently uploaded (20)

IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)