Gpgpu intro

Outline
1. Introduction

2. Threads

3. Physical Memory
NOTE:
4. Logical Memory A lot of this serves as a recap of
what was covered so far.
5. Efficient GPU Programming
REMEMBER:
6. Some Examples Repetition is the key to remembering things.

7. CUDA Programming

8. CUDA Tools Introduction

9. CUDA Debugger

10. CUDA Visual Profiler

But first…
• Do you believe that there can be a school without exams?

• Do you believe that a 9 year old kid in a South Indian village can
understand how DNA works?

• Do you believe that schools and universities
should be changed entirely?

• http://www.ted.com/talks/sugata_mitra_build_a_school_in_the_
cloud.html
• Fixing education is a task that requires everyone’s attention…

Most importantly…
• Do you believe that we can learn, driven entirely by
motivation?
• If your answer is “NO”, then try to…

• … Get a new perspective on life…
…leave your comfort zone!
突破自己!

Why are we here?
CPU vs. GPU
•

Combining strengths:
CPU + GPU
• Can’t we just build a new device that combines the two?

• Short answer: Some new devices are just that!
• AMD Fusion
• Intel MIC (Xeon Phi)

• Long answer:
• Take 楊佳玲’s Advanced Computer Architecture class!

Writing Code
Performance vs. Design
• Programmers have two contradictory goals:
1. Good Performance (FAST!)
2. Good Design (bug-resilient, extensible, easy to use etc…)

• Rule of thumb: Fast code is not pretty

• Example:
• Mathematical description – 1 line
• Algorithm Pseudocode – 10 lines
• Algorithm Code – 20 lines
• Optimized Algorithm Code – 50 lines

Writing Code
Common Fallacies
1. “GPU Programs are always faster than their CPU counterpart”
• Only if: 1. The problem allows it and 2. you invest a lot of time

2. “I don’t need a profiler”
• A profiler helps you analyze performance and find bottlenecks.
• If you don’t care for performance, do NOT use the GPU.

3. “I don’t need a debugger”
• Yes you do.
• Adding tons of printf’s makes things a lot more difficult (and longer)
• (Plus, people are lazy)

4. “I can write bug-free code”
• No, you can’t – No one can.

Writing Code
A Tale of Two Address Spaces…
• Never forget – In the current architecture:
• The CPU, and each GPU all have their own address space and code

• We CANNOT access host pointers from device or vice versa

• We CANNOT call host code from the device or vice versa

• We CANNOT access device pointers or call code from different
devices
HOST DEVICE

M PCIe M
e e
CPU BUS m m
GPU BUS
or or
y y

Threads &
Parallel Programming

Why do we need multithreading?
• Most and foremost: Speed!
• There are some other reasons, but not today…

• Real-life example:
• Ship 10k containers from 台北 to 香港
• Question: Do you use 1 very fast ship, or 4 slow ships?

• Program example:
• Add a scalar to 10k numbers
• Question: Do you use 1 very fast processor, or 4 slow processors?

• The real issue: Single-unit speed never scales!
There is no very fast ship or very fast processor

Why do we hate multithreading?
• Multithreading adds whole new dimensions of complications
to programming
• … Communication
• … Synchronization
• (… Dead-locks – But generally not on the GPU)

• Plus, debugging is a lot more complicated

How many Threads? Kitchen

•
T1 T2

T3 T4

Kitchen

T1 T2

T3 T4

Physical Memory
How our computer works

Memory Hierarchy
Smaller is faster!

& Shared Memory

Processor vs. Memory Speed
• Memory latency keeps getting worse!

• http://seven-degrees-of-freedom.blogspot.tw/2009/10/latency-
elephant.html

Logical Memory
How we see memory in our programs

Working with Memory
What is Memory logically?
• Let’s define: Memory = 1D array of bytes
0 1 2 3 4 5 6 7 8 9

• An object is a set of 1 or more bytes with a special meaning
• If the bytes are contiguous, the object is a struct

• Examples of structs:
• byte
• int
• float
• pointer !?!
• sequence of structs: int float* short

• A pointer is a struct that represents a memory address
• Basically it’s same as a 1D array index!

Working with Memory
Structs vs. Arrays
• A chunk of contiguous memory is either an array or a struct
• Array: 1 or more of the same element:
• Struct: 1 or more of (possibly different) elements:
• Determine at compile-time

• Don’t make silly assumptions about structs!
• Compiler might change alignment
• Compiler might reorder elements

• GPU pointers must be word (4-byte) – aligned

• If the object is only a single element, it can be said to be both:
• A one-element struct
• A one-element array
But don’t overthink it…

Working with Memory
Multi-dimensional Arrays
• Arrays are often multi-dimensional!
• …a line (1D)
• …a rectangle (2D)
• …a box (3D)
• … and so on

• But address space is only 1D!

• We have to map higher dimensional space into 1D…
• C and CUDA-C do not allow for multi-dimensional array indices
• We need to compute indices ourselves

Working with Memory
Row-Major Indexing
•

x
w=5

y

h=…

Working with Memory
Summary
•

Must Read!
• If you want to understand the GPU and write fast programs, read these:

• CUDA C Programming Guide

• CUDA Best Practices Guide

• All important CUDA documentation is right here:
• http://docs.nvidia.com/cuda/index.html

• OpenCL documentation:
• http://developer.amd.com/resources/heterogeneous-computing/opencl-
zone/
• http://www.nvidia.com/content/cudazone/download/OpenCL/NVIDIA_Open
CL_ProgrammingGuide.pdf

Can Read!
Some More Optimization Slides
• The power of ILP:
• http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf

• Some tips and tricks:
• http://www.nvidia.com/content/cudazone/download/Advanced_
CUDA_Training_NVISION08.pdf

ILP Magic
• The GPU facilitates both TLP and ILP
• Thread-level parallelism
• Instruction-level parallelism

• ILP means: We can execute multiple instructions at the same time

• Thread does not stall on memory access
• It only stalls on RAW (Read-After-Write) dependencies:
a = A[i]; // no stall
b = B[i]; // no stall
// …
c = a * b; // stall

• Threads can execute multiple arithmetic instructions in parallel
a = k1 + c * d; // no stall
b = k2 + f * g; // no stall

Warps occupying a SM
(SM=Streaming Multiprocessor)
• Using the previous example: SM Scheduler
…
a = A[i]; // no stall warp6
warp4
b = B[i]; // no stall
// … warp5 warp8

c = a * b; // stall

• What happens on a stall?
• The current warp is placed in the I/O queue and another can run on
the SM
• That is why we want as many threads (warps) per SM as possible
• Also need multiple blocks
• E.g. Geforce 660 can have 2048 threads/SM but only 1024 threads/block

TLP vs. ILP
What is good Occupancy?
•

Ex.: Only 50% processor utilization!

Registers + Shared Memory vs.
Working Set Size
• Shared Memory + Registers must hold current working set of
all active warps on a SM
• In other words: Shared Memory + Registers must hold all (or most
of the) data that all of the threads currently and most often need

• More threads = better TLP = less actual stalls

• More threads = less space for working set
• Less registers/thread & shared memory/thread

• If Shm + Registers too small for working set, must use out-of-
core method
• For example: External merge sort
• http://en.wikipedia.org/wiki/External_sorting

Memory Coalescing and
Bank Conflicts
• VERY big bottleneck!

• See the professor’s slides

• Also, see the Must Read! section

OOP vs. DOP
• Array-of-Struct vs. Struct-of-Array (AoS vs. SoA)

• You probably all have heard of Object-Oriented Programming
• Idealistic OOP is slow
• OOP groups data (and code) into logical chunks (structs)
• OOP generally ignores temporal locality of data

• Good performance requires: Data-Oriented Programming
• http://research.scee.net/files/presentations/gcapaustralia09/Pitf
alls_of_Object_Oriented_Programming_GCAP_09.pdf

• Bundle data together that is accessed at about the same time!
• I.e. group data in a way that maximizes temporal locality

Streams – Pipelining
memcpy vs. computation
•
Why? Because:
memcpy between host and device is a huge bottleneck!

Look beyond the code
E.g.
int a = …, wA = …;
int tx = threadIdx.x, ty = threadIdx.y;
__shared__ int A[128];
As[ty][tx] = A[a + wA * ty + tx];

• Which resources does the line of code use?
• Several registers and constant cache
• Variables and constants
• Intermediate results

• Memory (shared or global)
• Reads from A (shared)
• Writes to As (maybe global)

Where to get the numbers?
• For actual NVIDIA device properties, check CUDA programming
guide Appendix F, Table 10
• (The appendix lists a lot of info complementary to device query)
• Note: Every device has a max Compute Capability (CC) version
• The CC version of your device decides which features it supports
• More info can be found in each CC section (all in Appendix F)
• E.g. # warp schedulers (2 for CC 2.x; 4 for CC 3.x)
• Dual-issue since CC 2.1

• For comparison of device stats consider NVIDIA
• http://en.wikipedia.org/wiki/GeForce_600_Series#Products
• etc…

• E.g. Memory latency (from section 5.2.3 of the Progr. Guide)
• “400 to 800 clock cycles for devices of compute capability 1.x and 2.x
and about 200 to 400 clock cycles for devices of compute capability
3.x”

Other Tuning Tips
• The most important contributor to performance is the algorithm!

• Block size always a multiple of Warp size (32 on NVIDIA, 64 on AMD)!

• There is a lot more…
• Page-lock Host Memory
• Etc…

• Read all the references mentioned in this talk and you’ll get it.

Writing the Code…
• Do not ask the TA via email to help you with the code!

• Use the forum instead
• Other people probably have similar questions!

• The TA (this guy) will answer all forum posts to his best judgment

• Other students can also help!

• Just one rule: Do not share your actual code!

Example 1
Scalar-Vector Multiplication
•

Why?

Example 2
A typical CUDA kernel…
Shared memory declarations

Repeat:
Copy some input to shared memory (shm)

__syncthreads();

Use shm data for actual computation

__syncthreads();

Write to global memory

Example 3
Median Filter
• No code (sorry!), but here are some hints…

• Use shared memory!
• The code skeleton looks like Example 2
• Remember: All threads in a block can access the same shared memory
• Use 2D blocks!
• To get increased shared memory data re-use
• Each thread computes one output pixel!

• Use the debugger!
• Use the profiler!

• Some more hints are in the homework description…

Many More Examples…
• Check out the NVIDIA CUDA and AMD APP SDK samples

• Some of them come with documents, explaining:
• The parallel algorithm (and how it was developed)
• Exactly how much speed up was gained from each optimization step

• CUDA 5 samples with docs:
• simpleMultiCopy
• Mandelbrot
• Eigenvalue
• recursiveGaussian
• sobelFilter
• smokeParticles
• BlackScholes
• …and many more…

Documentation

• Online Documentation for NSIGHT 3
• http://http.developer.nvidia.com/NsightVisualStudio/3.0/Documentatio
n/UserGuide/HTML/Nsight_Visual_Studio_Edition_User_Guide.htm

• Again: Read the documents from the Must read! section

CUDA Debugger
VS 2010 & NSIGHT
Works with Eclipse and VS 2010
(no VS 2012 support yet)

NSIGHT 3 and 2.2
Setup
• Get NSIGHT 3.0:
• Go to: https://developer.nvidia.com/nvidia-nsight-visual-studio-edition
• Register (Create an account)
• Login
• https://developer.nvidia.com/rdp/nsight-visual-studio-edition-early-access
• Download NSIGHT 3
• Works for CUDA 5
• Also has an OpenGL debugger and more

• Alternative: Get NSIGHT 2.2
• No login required
• Only works for CUDA 4

CUDA Debugger
Some References
• http://http.developer.nvidia.com/NsightVisualStudio/3.0/Doc
umentation/UserGuide/HTML/Content/Debugging_CUDA_Ap
plication.htm

• https://www.youtube.com/watch?v=FLQuqXhlx40
• A bit outdated, but still very useful

• etc…

Visual Studio 2010 & NSIGHT
• System Info

1. Enable Debugging
• NOTE: CPU and GPU debugging are entirely separated at this point
• You must set everything explicitly for GPU
• When GPU debug mode is enabled GPU kernels will run a lot slower!

2. Set breakpoint in code:

3. Start CUDA Debugger
• DO NOT USE THE VS DEBUGGER (F5) for CUDA debugging

4. Step through the code
• Step Into (F11)
• Step Over (F10)
• Step Out (Shift + F11)

5. Open the corresponding windows

6. Inspect everything…

Conditions Remember?
• Right-Click on breakpoint

• Result:

• Move between warps

• Select a specific thread

• Inspect Thread and Warp State

• Lists state information of all Threads. E.g.:
• Id, Block, Warp, File, Line, PC (Program Counter), etc…
• Barrier information (is warp currently waiting for sync?)
• Active Mask
• Which threads of the thread’s warp are currently running
• One bit per thread
• Prof. Chen will cover warp divergence later in the class

• Inspect Memory
• Can use Drag & Drop!

Why is
1 == 00 00 80 3f?

Floating Point representation!

CUDA Profilers
Understand your program’s performance profiles!

Comprehensive References
• Great Overview:
• http://people.maths.ox.ac.uk/gilesm/cuda/lecs/NV_Profiling_low
res.pdf

• http://developer.download.nvidia.com/GTC/PDF/GTC2012/Pr
esentationPDF/S0419B-GTC2012-Profiling-Profiling-Tools.pdf

NVIDIA Visual Profiler
TODO…
• Great Tool!

• Chance for bonus points:
• Put together a comprehensive and easily understandable
tutorial!

• We will cast a vote!
• The best tutorial gets bonus points!

nvprof
TODO
• Text-based profiler
• For everyone without a GUI

• Maybe also bonus points?

• We will post more details on the forum…

GTC – More about the GPU
• NVIDIA’s annual GPU Technology Conference hosts many talks
available online

• This year’s GTC is in progress RIGHT NOW!
• http://www.gputechconf.com/page/sessions.html

• Of course it’s a big advertisement campaign for NVIDIA
• But it also has a lot of interesting stuff!

Update (1)
1. Compiler Options
nvcc (NVIDIA Cuda Compiler)有很多option可以玩玩看。
建議各位把nvcc的help印到一個檔案然後每次開始寫code之前多多參考：
nvcc --help > nvcchelp.txt

2. Compute Capability 1.3
測試系統很老所以他跑的CUDA版本跟大部分的人家裡的CUDA版本應該不一樣。
你們如果家裡可以pass但是批改娘雖然不讓你們pass的話，這裡就有一個很好
的解決方法:用"-arch=sm_13"萊compile跟測試系統會跑一樣的machine code：
nvcc -arch=sm_13

3. Register Pressure & Register Usage
這個stackoverflow的文章就是談nvcc跟register usage的一些事情：
[url]http://stackoverflow.com/questions/9723431/tracking-down-cuda-kernel-
register-usage[/url]
如果跟nvcc講-Xptxas="-v"的話，他就會跟你講每一個thread到底在用幾個
register。

我的中文好差。請各位多多指教。

Update (2)
• Occupancy Calculator!
• http://developer.download.nvidia.com/compute/cuda/CUDA_Oc
cupancy_calculator.xls

Gpgpu intro

More Related Content

What's hot

Viewers also liked

Similar to Gpgpu intro

Gpgpu intro