14. Basics
• GPUs are SIMD and excel at data-parallel tasks
• Program for GPU is called ‘kernel’
• Kernel runs in instances called threads
• Hardware takes care of thread scheduling
• Typical GPU has 100s of processors
• Need 1000s of threads to fully utilize GPU
15. Example
C=A+B
Kernel:
void sum (int c[], int a[], int b[]) {
int Index = getThreadId();
c[Index] = a[Index] + b[Index];
}
Adding vectors:
int A[10], B[10], C[10];
sum<<10>> (C, A, B);
17. GPU Computing Stack
High-level Language
Translation, no
optimizations
Intermediate Language
Optimization
goes here
ISA
GPU Hardware
18. GPU Computing Stack
GPU world is bipolar
NVIDIA ATI
HLL CUDA C, OpenCL OpenCL
IL PTX IL
Documented for
ISA Not documented
RV700 (48xx)
HW G80 (8xxx) and up RV670 (38xx) and up
19. Breaking passwords
the CPU way
Generate
H(p) Verify hash
password
Computing H(p) takes the most
time, so offload it to the GPU
20. Breaking passwords
the GPU way
CPU GPU CPU
H(p)
Generate H(p) Verify hashes
passwords
...
H(p)
21. Breaking passwords
the GPU way
CPU GPU CPU
Generate H(p) Verify hashes
passwords
•If H(p) is fast, PCIe data transfers are the bottleneck
•E.g. if H(p) is SHA-1, theoretical peak is ~200M p/s
Solution is to offload everything to GPU
22. Breaking passwords
the GPU way
GPU GPU GPU
Generate H(p) Verify hashes
passwords
•If H(p) is fast, PCIe data transfers are the bottleneck
•E.g. if H(p) is SHA-1, theoretical peak is ~200M p/s
Solution is to offload everything to GPU
23. How to use GPUs?
Implementation considerations
24. GPU Computing Stack
NVIDIA ATI
HLL CUDA C, OpenCL OpenCL
IL PTX IL
Documented for
ISA Not documented
RV700 (48xx)
HW G80 (8xxx) and up RV670 (38xx) and up
25. Choosing language
CUDA C vs. PTX
• C code translates into PTX without
optimizations
• Optimization is done when compiling PTX
• Intrinsics for device-specific instructions
No real reason for developing in PTX
26. Choosing language
OpenCL
• Portability requires compilation at runtime
• May take significant time and resources
• Compiler is part of driver ➯ testing hell
• Requires source code in HLL ➯ IP issues
• Implementations are not complete and vary
across vendors
Not mature enough
27. Choosing language
ATI IL
• The only viable option if you love your users
• Access to device-specific instructions
• Best performance
• Not a an option if you love your developers
• Poor documentation, poor samples
• Meaningless compiler errors, no debugger
28. Achieving performance
• Minimize data transfers
• Minimize memory accesses
• Or at least plan them carefully
• Minimize number of registers used
• Less registers used means more threads will
run simultaneously
• Schedule enough threads to keep GPU
processors busy
• Avoid thread divergence
29. Porting crypto to GPU
• Usually pretty straightforward
• MD5, SHA1 and alike require little to no
changes
• Can be tricky sometimes
• RC4 requires many memory accesses, so
careful layout is needed
• DES requires table lookups which are
very expensive
30. Porting crypto to GPU
The DES
• Table lookups (s-boxes) are the bottleneck
• Avoid them by using bitslicing
• S-boxes replaced with logic functions
• 32 encryptions in parallel
• Requires many registers
• Performance depends on compiler
heuristics
32. Scalability
Not all GPUs created equal
1. Program should scale nicely with the number of
processors on GPU
• Query processor count from the driver
• Partition task accordingly
numThreads = F(numProcessors)
• Also helps to avoid triggering watchdog and
freezing screen
33. Scalability
8 GPUs in system are not uncommon
2. Program should scale nicely with the number of
GPUs
• Query device count from the driver
• Spawn CPU threads to control each device
• Partition task accordingly
Speedup should be linear unless you hit PCIe limits
34. Compatibility
Not everyone’s got Fermi.Yet.
• New hardware offers great new features
• Cache on Fermi
• bitalign instruction on RV770
• May require different optimization strategy
• May require separate codebase
• Support for legacy hardware shouldn’t be dropped
Be prepared to handle this sort of
complexity
35. Including GPU code
Option 1: include PTX/IL code in your program
Pros Cons
•Recommended way •Compilation at runtime
•Forward compatibility •Can’t test all hardware
•No hardware required •IP issues
36. Including GPU code
Option 2: include pre-compiled GPU binaries
Pros Cons
•No dependency on users’ •May not work with future
driver devices
•No compilation at runtime •Need to precompile for
every supported GPU
•Better IP protection
•No precompiled binary
for GPU = no support