Using Graphics Cards to Break Passwords

Using Graphics Cards
to Break Passwords
Andrey Belenko
a.belenko@elcomsoft.com

!"#$%&'()"*

Core i7 die layout

Transistor count: 1.17B

Core i7 die layout
L3 Cache L3 Cache
IO & QPI

IO & QPI
Queue
Core Core Core Core Core Core

Memory Controller

Transistor count: 1.17B

Branch pred.
Fetch & L1

Paging
L2

Decode
& Mem. L1
μ-code

Sched. Exec

10%
CPU dedicates 1/10 of
resources to calculations
90%

GTX 480 die layout

Transistor count: 3B

• GPU dedicates 1/3 of
30% resources to calculations
• 2.5x more transistors
than CPU
70%
• 7x more computing
power overall

PBKDF2-SHA1
with 2000 iterations

i7-970 15.5K

GTX 480 60K

GTX 580 68K

HD 5970 195K

0K 50K 100K 150K 200K

Basics
• GPUs are SIMD and excel at data-parallel tasks
• Program for GPU is called ‘kernel’
• Kernel runs in instances called threads
• Hardware takes care of thread scheduling
• Typical GPU has 100s of processors
• Need 1000s of threads to fully utilize GPU

Example
C=A+B

Kernel:
void sum (int c[], int a[], int b[]) {
int Index = getThreadId();
c[Index] = a[Index] + b[Index];
}

Adding vectors:
int A[10], B[10], C[10];
sum<<10>> (C, A, B);

Example
MD5
Kernel:
void md5 (uint8 *dataIn, uint8 *dataOut) {
int Index = getThreadId();
uint8 *in = dataIn + MD5_BLOCK_SIZE * Index;
uint8 *out = dataOut + MD5_HASH_SIZE * Index;
MD5( dataOut, dataIn, MD5_BLOCK_SIZE );
}

Computing hashes:
uint8 Src[10 * MD5_BLOCK_SIZE];
uint8 Dst[10 * MD5_HASH_SIZE];
md5<<10>> (Src, Dst);

GPU Computing Stack

High-level Language
Translation, no
optimizations
Intermediate Language
Optimization
goes here
ISA

GPU Hardware

GPU Computing Stack
GPU world is bipolar

NVIDIA ATI

HLL CUDA C, OpenCL OpenCL

IL PTX IL

Documented for
ISA Not documented
RV700 (48xx)

HW G80 (8xxx) and up RV670 (38xx) and up

Breaking passwords
the CPU way

Generate
H(p) Verify hash
password

Computing H(p) takes the most
time, so ofﬂoad it to the GPU

Breaking passwords
the GPU way

CPU GPU CPU
H(p)
Generate H(p) Verify hashes
passwords
...

H(p)

Breaking passwords
the GPU way

CPU GPU CPU
passwords

•If H(p) is fast, PCIe data transfers are the bottleneck
•E.g. if H(p) is SHA-1, theoretical peak is ~200M p/s
Solution is to ofﬂoad everything to GPU

Breaking passwords
the GPU way

GPU GPU GPU
passwords

•If H(p) is fast, PCIe data transfers are the bottleneck
•E.g. if H(p) is SHA-1, theoretical peak is ~200M p/s
Solution is to ofﬂoad everything to GPU

How to use GPUs?
Implementation considerations

GPU Computing Stack
NVIDIA ATI

HLL CUDA C, OpenCL OpenCL

IL PTX IL

Documented for
ISA Not documented
RV700 (48xx)

HW G80 (8xxx) and up RV670 (38xx) and up

Choosing language
CUDA C vs. PTX

• C code translates into PTX without
optimizations
• Optimization is done when compiling PTX
• Intrinsics for device-speciﬁc instructions
No real reason for developing in PTX

Choosing language
OpenCL

• Portability requires compilation at runtime
• May take signiﬁcant time and resources
• Compiler is part of driver ➯ testing hell
• Requires source code in HLL ➯ IP issues
• Implementations are not complete and vary
across vendors
Not mature enough

Choosing language
ATI IL

• The only viable option if you love your users
• Access to device-speciﬁc instructions
• Best performance
• Not a an option if you love your developers
• Poor documentation, poor samples
• Meaningless compiler errors, no debugger

Achieving performance
• Minimize data transfers
• Minimize memory accesses
• Or at least plan them carefully
• Minimize number of registers used
• Less registers used means more threads will
run simultaneously
• Schedule enough threads to keep GPU
processors busy
• Avoid thread divergence

Porting crypto to GPU
• Usually pretty straightforward
• MD5, SHA1 and alike require little to no
changes
• Can be tricky sometimes
• RC4 requires many memory accesses, so
careful layout is needed
• DES requires table lookups which are
very expensive

Porting crypto to GPU
The DES

• Table lookups (s-boxes) are the bottleneck
• Avoid them by using bitslicing
• S-boxes replaced with logic functions
• 32 encryptions in parallel
• Requires many registers
• Performance depends on compiler
heuristics

How to use GPUs?
Real-world problems

Scalability
Not all GPUs created equal

1. Program should scale nicely with the number of
processors on GPU
• Query processor count from the driver
• Partition task accordingly
numThreads = F(numProcessors)

• Also helps to avoid triggering watchdog and
freezing screen

Scalability
8 GPUs in system are not uncommon

2. Program should scale nicely with the number of
GPUs
• Query device count from the driver
• Spawn CPU threads to control each device
• Partition task accordingly
Speedup should be linear unless you hit PCIe limits

Compatibility
Not everyone’s got Fermi.Yet.

• New hardware offers great new features
• Cache on Fermi
• bitalign instruction on RV770
• May require different optimization strategy
• May require separate codebase
• Support for legacy hardware shouldn’t be dropped
Be prepared to handle this sort of
complexity

Including GPU code
Option 1: include PTX/IL code in your program

Pros Cons
•Recommended way •Compilation at runtime
•Forward compatibility •Can’t test all hardware
•No hardware required •IP issues

Including GPU code
Option 2: include pre-compiled GPU binaries
Pros Cons
•No dependency on users’ •May not work with future
driver devices
•No compilation at runtime •Need to precompile for
every supported GPU
•Better IP protection
•No precompiled binary
for GPU = no support

Using Graphics Cards to Break Passwords

Recommended

Recommended

More Related Content

Featured

Featured (20)

Using Graphics Cards to Break Passwords