• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Using Graphics Cards to Break Passwords
 

Using Graphics Cards to Break Passwords

on

  • 2,206 views

 

Statistics

Views

Total Views
2,206
Views on SlideShare
2,200
Embed Views
6

Actions

Likes
4
Downloads
39
Comments
0

3 Embeds 6

http://www.linkedin.com 4
http://www.techgig.com 1
http://a0.twimg.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Using Graphics Cards to Break Passwords Using Graphics Cards to Break Passwords Presentation Transcript

    • Using Graphics Cards to Break Passwords Andrey Belenko a.belenko@elcomsoft.com !"#$%&()"*
    • Why use GPUs?
    • Core i7 die layout Transistor count: 1.17B
    • Core i7 die layout L3 Cache L3 CacheIO & QPI IO & QPI Queue Core Core Core Core Core Core Memory Controller Transistor count: 1.17B
    • Branch pred.Fetch & L1 Paging L2Decode & Mem. L1μ-codeSched. Exec
    • Core i7 die layout Transistor count: 1.17B
    • 10% CPU dedicates 1/10 of resources to calculations90%
    • GTX 480 die layout Transistor count: 3B
    • GTX 480 die layout Transistor count: 3B
    • • GPU dedicates 1/3 of 30% resources to calculations • 2.5x more transistors than CPU70% • 7x more computing power overall
    • PBKDF2-SHA1 with 2000 iterations i7-970 15.5KGTX 480 60KGTX 580 68KHD 5970 195K 0K 50K 100K 150K 200K
    • How to use GPUs?
    • Basics• GPUs are SIMD and excel at data-parallel tasks• Program for GPU is called ‘kernel’• Kernel runs in instances called threads• Hardware takes care of thread scheduling• Typical GPU has 100s of processors• Need 1000s of threads to fully utilize GPU
    • Example C=A+BKernel:void sum (int c[], int a[], int b[]) { int Index = getThreadId(); c[Index] = a[Index] + b[Index];}Adding vectors:int A[10], B[10], C[10];sum<<10>> (C, A, B);
    • Example MD5 Kernel: void md5 (uint8 *dataIn, uint8 *dataOut) { int Index = getThreadId(); uint8 *in = dataIn + MD5_BLOCK_SIZE * Index; uint8 *out = dataOut + MD5_HASH_SIZE * Index; MD5( dataOut, dataIn, MD5_BLOCK_SIZE ); }Computing hashes:uint8 Src[10 * MD5_BLOCK_SIZE];uint8 Dst[10 * MD5_HASH_SIZE];md5<<10>> (Src, Dst);
    • GPU Computing Stack High-level LanguageTranslation, nooptimizations Intermediate Language Optimization goes here ISA GPU Hardware
    • GPU Computing Stack GPU world is bipolar NVIDIA ATIHLL CUDA C, OpenCL OpenCLIL PTX IL Documented forISA Not documented RV700 (48xx)HW G80 (8xxx) and up RV670 (38xx) and up
    • Breaking passwords the CPU wayGenerate H(p) Verify hashpassword Computing H(p) takes the most time, so offload it to the GPU
    • Breaking passwords the GPU wayCPU GPU CPU H(p) Generate H(p) Verify hashes passwords ... H(p)
    • Breaking passwords the GPU wayCPU GPU CPU Generate H(p) Verify hashes passwords•If H(p) is fast, PCIe data transfers are the bottleneck •E.g. if H(p) is SHA-1, theoretical peak is ~200M p/s Solution is to offload everything to GPU
    • Breaking passwords the GPU wayGPU GPU GPU Generate H(p) Verify hashes passwords•If H(p) is fast, PCIe data transfers are the bottleneck •E.g. if H(p) is SHA-1, theoretical peak is ~200M p/s Solution is to offload everything to GPU
    • How to use GPUs? Implementation considerations
    • GPU Computing Stack NVIDIA ATIHLL CUDA C, OpenCL OpenCLIL PTX IL Documented forISA Not documented RV700 (48xx)HW G80 (8xxx) and up RV670 (38xx) and up
    • Choosing language CUDA C vs. PTX• C code translates into PTX without optimizations• Optimization is done when compiling PTX• Intrinsics for device-specific instructions No real reason for developing in PTX
    • Choosing language OpenCL• Portability requires compilation at runtime • May take significant time and resources • Compiler is part of driver ➯ testing hell • Requires source code in HLL ➯ IP issues• Implementations are not complete and vary across vendors Not mature enough
    • Choosing language ATI IL• The only viable option if you love your users • Access to device-specific instructions • Best performance• Not a an option if you love your developers • Poor documentation, poor samples • Meaningless compiler errors, no debugger
    • Achieving performance• Minimize data transfers• Minimize memory accesses • Or at least plan them carefully• Minimize number of registers used • Less registers used means more threads will run simultaneously• Schedule enough threads to keep GPU processors busy• Avoid thread divergence
    • Porting crypto to GPU• Usually pretty straightforward • MD5, SHA1 and alike require little to no changes• Can be tricky sometimes • RC4 requires many memory accesses, so careful layout is needed • DES requires table lookups which are very expensive
    • Porting crypto to GPU The DES• Table lookups (s-boxes) are the bottleneck• Avoid them by using bitslicing • S-boxes replaced with logic functions • 32 encryptions in parallel • Requires many registers • Performance depends on compiler heuristics
    • How to use GPUs? Real-world problems
    • Scalability Not all GPUs created equal1. Program should scale nicely with the number ofprocessors on GPU • Query processor count from the driver • Partition task accordingly numThreads = F(numProcessors) • Also helps to avoid triggering watchdog and freezing screen
    • Scalability 8 GPUs in system are not uncommon2. Program should scale nicely with the number ofGPUs • Query device count from the driver • Spawn CPU threads to control each device • Partition task accordinglySpeedup should be linear unless you hit PCIe limits
    • Compatibility Not everyone’s got Fermi.Yet.• New hardware offers great new features • Cache on Fermi • bitalign instruction on RV770• May require different optimization strategy• May require separate codebase• Support for legacy hardware shouldn’t be dropped Be prepared to handle this sort of complexity
    • Including GPU code Option 1: include PTX/IL code in your program Pros Cons•Recommended way •Compilation at runtime•Forward compatibility •Can’t test all hardware•No hardware required •IP issues
    • Including GPU code Option 2: include pre-compiled GPU binaries Pros Cons•No dependency on users’ •May not work with future driver devices•No compilation at runtime •Need to precompile for every supported GPU•Better IP protection •No precompiled binary for GPU = no support
    • Questions?
    • Thank you
    • Using Graphics Cards to Break Passwords Andrey Belenko a.belenko@elcomsoft.com !"#$%&()"*