[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

on

  • 738 views

http://cs264.org

http://cs264.org

http://j.mp/h2zN72

Statistics

Views

Total Views
738
Views on SlideShare
738
Embed Views
0

Actions

Likes
0
Downloads
18
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech) Presentation Transcript

  • 1. Dynamic Compilation for Massively Parallel Processors Gregory Diamos PhD candidate Georgia Institute of Technology and NVIDIA Research April 14, 2011 Gregory Diamos CS264 - Dynamic Compilation 1/62
  • 2. What is an execution model? Gregory Diamos CS264 - Dynamic Compilation 2/62
  • 3. Goals of programming languages Programming languages are designed for productivity. Efficiency is measured in terms of: 1 cost - hardware investment, power consumption, area requirement 2 complexity - application development effort 3 speed - amount of work performed per unit time Gregory Diamos CS264 - Dynamic Compilation 3/62
  • 4. Goals of processor architecture Hardware is designed for speed and efficiency. Gregory Diamos CS264 - Dynamic Compilation 4/62
  • 5. Goals of processor architecture - 2 [1] - M. Koyanagi, T. Fukushima, and T. Tanaka. "High-Density Through Silicon Vias for 3-D LSIs" [2] - Novoselov et al. "Electric Field Effect in Atomically Thin Carbon Films." [3] - Intel Corp. 22nm test chip. It is constrained by the limitations of physical devices. Gregory Diamos CS264 - Dynamic Compilation 5/62
  • 6. Execution models bridge the gap Gregory Diamos CS264 - Dynamic Compilation 6/62
  • 7. Goals of execution models Execution models provide impedance matching between applications and hardware. Goals: leverage common optimizations across multiple applications. limit the impact of hardware changes on software. ISAs have traditionally been effective execution models. Gregory Diamos CS264 - Dynamic Compilation 7/62
  • 8. Programming challenges of heterogeneity The introduction of heterogeneous and multi-core processors changes the hardware/software interface: Intel Nehalem IBM PowerEN AMD Fusion NVIDIA Fermi 1 multi-core creates multiple interfaces. 2 heterogeneity creates different interfaces. 3 these increase software complexity. Gregory Diamos CS264 - Dynamic Compilation 8/62
  • 9. Program the entire processor, not individual cores. (new execution model abstractions are needed) Gregory Diamos CS264 - Dynamic Compilation 9/62
  • 10. Emerging execution models Gregory Diamos CS264 - Dynamic Compilation 10/62
  • 11. Bulk-synchronous parallel (BSP) [1] - Leslie Valiant. A bridging model for parallel computing. Gregory Diamos CS264 - Dynamic Compilation 11/62
  • 12. The Parallel Thread eXecution (PTX) Model PTX defines a kernel as a 2-level grid of bulk-synchronous tasks. Gregory Diamos CS264 - Dynamic Compilation 12/62
  • 13. Dynamically translating PTX Dynamic compilers can transform this parallelism to fit the hardware. Gregory Diamos CS264 - Dynamic Compilation 13/62
  • 14. Beyond PTX - Data distributions Gregory Diamos CS264 - Dynamic Compilation 14/62
  • 15. Beyond PTX - Memory hierarchies [1] - Leslie Valiant. A bridging model for multi-core. [2] Fatahalian et al. Sequoia: Programming the memory hierarchy. Gregory Diamos CS264 - Dynamic Compilation 15/62
  • 16. Dynamic compilation/binary translation Gregory Diamos CS264 - Dynamic Compilation 16/62
  • 17. Binary translation Gregory Diamos CS264 - Dynamic Compilation 17/62
  • 18. Binary translators are everywhere If you are running a browser, you are using dynamic compilation. Gregory Diamos CS264 - Dynamic Compilation 18/62
  • 19. x86 binary translation Gregory Diamos CS264 - Dynamic Compilation 19/62
  • 20. Low Level Virtual Machines Compile all programs to a common virtual machine representation (LLVM IR), keep this around. Perform common optimizations on this IR. Target various machines by lowering it to an ISA. Statically or via JIT compilation. Gregory Diamos CS264 - Dynamic Compilation 20/62
  • 21. Execution model translation Gregory Diamos CS264 - Dynamic Compilation 21/62
  • 22. Execution model translation Extend binary translation to execution model translation. Dynamic compilers can map threads/tasks to the HW. Gregory Diamos CS264 - Dynamic Compilation 22/62
  • 23. Different core architectures Can we target these from the same execution model. What about efficiency? Gregory Diamos CS264 - Dynamic Compilation 23/62
  • 24. Ocelot Enables thread-aware compiler transformations. Gregory Diamos CS264 - Dynamic Compilation 24/62
  • 25. Mapping CTAs to cores - thread fusion Scheduler Block Restore Registers Barrier Spill Registers Original PTX Code Transformed PTX Code Transform threads into loops over the program. Distribute loops to handle barriers. Gregory Diamos CS264 - Dynamic Compilation 25/62
  • 26. Mapping CTAs to cores - vectorization Pack adjacent threads into vector instructions. Speculate that divergence never occurs, check in case it does. Gregory Diamos CS264 - Dynamic Compilation 26/62
  • 27. Mapping CTAs to cores - multiple instruction streams T0 T1 T2 T3 Instructions from different threads are independent. merge instruction streams and statically schedule on functional units. Gregory Diamos CS264 - Dynamic Compilation 27/62
  • 28. PTX analysis Gregory Diamos CS264 - Dynamic Compilation 28/62
  • 29. Divergence analysis Gregory Diamos CS264 - Dynamic Compilation 29/62
  • 30. Subkernels subkernel Gregory Diamos CS264 - Dynamic Compilation 30/62
  • 31. Thread frontier analysis Supporting control flow on SIMD processors requires finding divergent branches and potential re-converge points. entry T0 T1 T2 T3 T0 T1 T2 T3 entry Block Id Thread Frontiers B1 Push B3 on T0 bra cond1() {} bra cond1() bra cond3() if((cond1() || cond2()) Push Exit on T1 && (cond3() || cond4())) { bra cond2() B2 {B2 - B3} thread-frontier reconvergence Push B5 on T2 ... of T0 } bra cond2() .... bra cond4() Push Exit on T4 B3 bra cond3() {B3 - Exit} Pop stack Exit on T4 thread-frontier reconvergence Pop stack of T2 B4 switch to B5 on T2 bra cond4() {B4 - Exit} post dominator reconvergence Pop stack Exit on T1 exit of T1 and T3 Pop stack switch to B3 on T0 compound .... B5 conditionals {B5 - Exit} re-convergence at thread frontiers exit post dominator short circuit control flow reconvergence of T1, T2, and T3 immediate post-dominator re-convergence Compiler analysis can identify immediate post donimators or thread-frontiers as re-convergence points. Gregory Diamos CS264 - Dynamic Compilation 31/62
  • 32. Consequences of architecture differences Gregory Diamos CS264 - Dynamic Compilation 32/62
  • 33. Degraded performance portability 1600 600 1400 500 1200 400 1000 GFLOPS GFLOPS 300 800 600 Fermi SGEMM Fermi SGEMM 200 AMD SGEMM AMD SGEMM 400 100 200 0 0 0 1000 2000 3000 4000 5000 6000 0 1000 2000 3000 4000 5000 6000 N N Performance of two OpenCL applications, one tuned for AMD, the other for NVIDIA. Gregory Diamos CS264 - Dynamic Compilation 33/62
  • 34. Memory traversal patterns Warp(4) Cycle 1 Warp(4) Cycle 2 Warp(1) Cycle 1 Warp(1) Cycle 2 Thread loops change row major memory accesses to column major accesses. Gregory Diamos CS264 - Dynamic Compilation 34/62
  • 35. Reduced memory bandwidth on CPUs Optimized for single-threaded CPU Optimized for SIMD (GPU) This reduces memory bandwidth by 10x for a memory microbenchmark running on a 4-core CPU. Gregory Diamos CS264 - Dynamic Compilation 35/62
  • 36. The good news Gregory Diamos CS264 - Dynamic Compilation 36/62
  • 37. Scaling across three decades of processors Many existing applications still scale. 12x 480x 280GTX has 40x more peak flops than a Phenom, 480x more than an Atom. Gregory Diamos CS264 - Dynamic Compilation 37/62
  • 38. Questions? Gregory Diamos CS264 - Dynamic Compilation 38/62
  • 39. Databases on GPUs Gregory Diamos CS264 - Dynamic Compilation 39/62
  • 40. Who cares about databases? Gregory Diamos CS264 - Dynamic Compilation 40/62
  • 41. What do applications look like? What do applications look like? Gregory Diamos CS264 - Dynamic Compilation 41/62
  • 42. Gobs of data Gregory Diamos CS264 - Dynamic Compilation 42/62
  • 43. Distributed systems Gregory Diamos CS264 - Dynamic Compilation 43/62
  • 44. Lots of parallelism Gregory Diamos CS264 - Dynamic Compilation 44/62
  • 45. What do CPU algorithms look like? What do cpu algorithms look like? Gregory Diamos CS264 - Dynamic Compilation 45/62
  • 46. Btrees Gregory Diamos CS264 - Dynamic Compilation 46/62
  • 47. Sequential algorithms < relation 1 = result relation 2 > Gregory Diamos CS264 - Dynamic Compilation 47/62
  • 48. It doesn’t look good Outlook not so good... Gregory Diamos CS264 - Dynamic Compilation 48/62
  • 49. Or does it? Where is the parallelism? Gregory Diamos CS264 - Dynamic Compilation 49/62
  • 50. Flattened trees Gregory Diamos CS264 - Dynamic Compilation 50/62
  • 51. Relational algebra Gregory Diamos CS264 - Dynamic Compilation 51/62
  • 52. Inner Join A Case Study: Inner Join Gregory Diamos CS264 - Dynamic Compilation 52/62
  • 53. 1. Recursive partitioning Gregory Diamos CS264 - Dynamic Compilation 53/62
  • 54. 2. Block streaming Blocking into pages, shared memory buffers, and transaction sized chunks makes memory accesses efficient. Gregory Diamos CS264 - Dynamic Compilation 54/62
  • 55. 3. Shared memory merging network A network for join can be constructed, similar to a sorting network. Gregory Diamos CS264 - Dynamic Compilation 55/62
  • 56. 4. Data chunking Stream compaction packs result data into chunks that can be streamed out of shared memory efficiently. Gregory Diamos CS264 - Dynamic Compilation 56/62
  • 57. Operator fusion Gregory Diamos CS264 - Dynamic Compilation 57/62
  • 58. Will it blend? Gregory Diamos CS264 - Dynamic Compilation 58/62
  • 59. Yes it blends. Operator NVIDIA C2050 Phenom 9570 inner-join 26.4-32.3 GB/s 0.11-0.63 GB/s select 104.2 GB/s 2.55 GB/s set operators 45.8 GB/s 0.72 GB/s projection 54.3 GB/s 2.34 GB/s cross product 98.8 GB/s 2.67 GB/s Gregory Diamos CS264 - Dynamic Compilation 59/62
  • 60. Questions? Gregory Diamos CS264 - Dynamic Compilation 60/62
  • 61. Conclusions Emerging heterogeneous architectures need matching execution model abstractions. dynamic compilation can enable portability. When writing massively parallel codes, consider: data structures and algorithms. mapping onto the execution model. transformations in the compiler/runtime. processor micro-architecture. Gregory Diamos CS264 - Dynamic Compilation 61/62
  • 62. Thoughts on open source software Gregory Diamos CS264 - Dynamic Compilation 62/62
  • 63. Questions? Questions? Contact Me: gregory.diamos@gatech.edu Contribute to Harmony, Ocelot, and Vanaheimr: http://code.google.com/p/harmonyruntime/ http://code.google.com/p/gpuocelot/ http://code.google.com/p/vanaheimr/ Gregory Diamos CS264 - Dynamic Compilation 63/62