The R-Stream High-Level Program Transformation Tool      N. Vasilache, B. Meister, M. Baskaran, A.Hartono, R. LethinReserv...
Outline        •• R-Stream Overview        •• Compilation Walk-through        •• Performance Results        •• Getting R-S...
Power efficiency driving architectures           Heterogeneous                         SIMD   SIMD          SIMD   SIMD   ...
Computation choreography        •• Expressing it        • Annotations and pragma dialects for C        • Explicitly (e.g.,...
How to do automatic scheduling?        •• Naïve approach        • Model          – Tasks, dependences          – Resource ...
Program Transformations Specificationiteration space of a statement S(i,j)                                                ...
Loop transformations for(i=0; i<N; i++)   for(j=0; j<N; j++)     S(i,j);                                                  ...
Loop fusion and distribution for(i=0; i<N; i++)   for(j=0; j<N; j++)                                             fusion   ...
Enabling technology is new compiler mathUniform Recurrence Equations [Karp et al. 1970]                                   ...
R-Stream model: polyhedran = f();for (i=5; i<= n; i+=2) {  A[i][i] = A[i][i]/B[i];  for (j=0; j<=i; j++) {    if (j<=10) {...
Polyhedral slogans        •• Parametric imperfect loop nests        •• Subsumes classical transformations        •• Compac...
Polyhedral model – challenges in building a compiler        •• Killer math        •• Scalability of optimizations/code gen...
R-Stream blueprint                 Machine                                                  Polyhedral Mapper             ...
Inside the polyhedral mapper                                             GDG representation                               ...
Inside the polyhedral mapperOptimization modules engineered to expose ““knobs”” that could be used by auto-tuner          ...
Driving the mapping: the machine model•• Target machine characteristics that have an     influence on how the mapping shou...
Machine model example: multi-Tesla       Host1 thread per GPU              OpenMP morph              XML file             ...
Mapping process             Dependencies                                                               2- Task formation: ...
Program Transformations Specificationiteration space of a statement S(i,j)                                                ...
Model for scheduling trades 3 objectives jointly                                      Loop Fission  Fewer  Global         ...
Optimization with BLAS vs. global optimization                         Numerous cache misses                              ...
Tradeoffs between parallelism and locality        • Significant parallelism is needed to fully utilize all resources      ...
Parallelism/locality tradeoff example                           Array z gets expanded, to                                 ...
Parallelism/locality tradeoff example (cont.)                                                                Aggressive lo...
Parallelism/locality tradeoff example (cont.)                                                                             ...
Parallelism/locality tradeoffs: performance           numbers Code with a good balance between parallelism and fusion perf...
R-Stream: affine scheduling and fusion    •• R-Stream uses a heuristic based on an objective function with several cost   ...
Parallelism + locality + spatial locality                                      Hypothesis that auto-tuning should adjust t...
Outline        •• R-Stream Overview        •• Compilation Walk-through        •• Performance Results        •• Getting R-S...
What R-Stream does for you – in a nutshell     •• Input     • Sequential        – Short and simple          textbook C cod...
What R-Stream does for you – in a nutshell     •• Input                                    •• Output     • Sequential     ...
What R-Stream does for you – in a nutshell     •• Input                                    •• Output     • Sequential     ...
Finding and utilizing available parallelism           Excerpt of automatically generated code                             ...
Memory compaction on GPU scratchpad              Excerpt of automatically generated code   GPU                            ...
GPU DRAM to scratchpad coalesced communication             Excerpt of automatically generated code                        ...
Host-to-GPU communication          Excerpt of automatically generated code                 R-Stream AUTOMATICALLY         ...
Multi-GPU mapping          Excerpt of automatically generated code                                      Mapping           ...
Multi-GPU mapping          Excerpt of automatically generated code                                                        ...
Future capabilities – mapping to CPU-GPU clusters                                                            High Speed In...
Outline        •• R-Stream Overview        •• Compilation Walk-through        •• Performance Results        •• Getting R-S...
Experimental evaluation     Configuration 1: MKL                           Configuration 2: Low-level compilers       Rada...
Experimental evaluation (cont.)  ••   Intel Xeon workstation:  •    Dual quad-core E5405 Xeon processors (8 cores total)  ...
Radar benchmarks     •• Beamforming algorithms:     • MVDR-SER: Minimum Variance Distortionless Response using          Se...
MVDR-SERReservoir Labs   Harvard 04/12/2011
CSLC-LMSReservoir Labs   Harvard 04/12/2011
CSLC-RLSReservoir Labs   Harvard 04/12/2011
3D Discretized wave equation input code (RTM)        ••   #pragma rstream map        ••   void RTM_3D(double (*U1)[Y][X], ...
3D Discretized wave equation input code (RTM)Not so naïve …Communicationautotuningknobs                                   ...
Outline        •• R-Stream Overview        •• Compilation Walk-through        •• Performance Results        •• Getting R-S...
Current status        •• Ongoing development also supported by DOE, Reservoir        • Improvements in scope, stability, p...
Availability        •• Per developer seat licensing model        •• Support (releases, bug fixes, services)        •• Avai...
R-Stream Gives More Insight About Programs        •• Teaches you about parallelism and the polyhedral model:        • Gene...
How to Use R-Stream Successfully        ••   R-Stream is a great transformation tool, it is also a great learning tool:   ...
Use Case: Fully Automatic Mode + Compile Flag Options        •• Akin to traditional gcc / icc compilation with flags:     ...
Use Case: Autotuning Mode        •• More advanced than traditional approaches:        •  Knobs go far beyond loop unrollin...
Use Case: “Power user” interactive interface        •• Beanshell access to optimizations        •• Can direct and review t...
Conclusion       •• R-Stream simplifies software development and maintenance       •• Porting: reduces expense and deliver...
Contact us        •• Per-developer seat, floating, cloud-based licensing        •• Discounts for academic users        •• ...
Upcoming SlideShare
Loading in …5
×

[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

1,617
-1

Published on

http://cs264.org

http://j.mp/fmzhYl

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,617
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
24
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

  1. 1. The R-Stream High-Level Program Transformation Tool N. Vasilache, B. Meister, M. Baskaran, A.Hartono, R. LethinReservoir Labs Harvard 04/12/2011 1
  2. 2. Outline •• R-Stream Overview •• Compilation Walk-through •• Performance Results •• Getting R-StreamReservoir Labs Harvard 04/12/2011 2
  3. 3. Power efficiency driving architectures Heterogeneous SIMD SIMD SIMD SIMD NUMA FPGA FPGA Processing DMA Memory DMA Memory Distributed GPP GPP Local Memories SIMD SIMD SIMD SIMD Hierarchical (including board, chassis, cabinet) SIMD SIMD SIMD SIMD Explicitly FPGA FPGA Managed Architecture DMA Memory DMA Multiple Memory Execution Models Bandwidth GPP GPP Starved SIMD SIMD SIMD SIMD Multiple Mixed Spatial Parallelism Dimensions Types 3Reservoir Labs Harvard 04/12/2011
  4. 4. Computation choreography •• Expressing it • Annotations and pragma dialects for C • Explicitly (e.g., new languages like CUDA and OpenCL) •• But before expressing it, how can programmers find it? • Manual constructive procedures, art, sweat, time – Artisans get complete control over every detail • Automatically – Operations research problem – Like scheduling trucks to save fuel Our focus – Model, solve , implement – Faster, sometimes better, than a humanReservoir Labs Harvard 04/12/2011 4
  5. 5. How to do automatic scheduling? •• Naïve approach • Model – Tasks, dependences – Resource use, latencies – Machine model with connectivity, resource capacities • Solve with ILP – Minimize overall task length – Subject to dependences, resource use • Problems – Complexity: task graph is huge! – Dynamics: loop lengths unknown. •• So we do something much more cool.Reservoir Labs Harvard 04/12/2011 5
  6. 6. Program Transformations Specificationiteration space of a statement S(i,j) t2 j 2 2 :Z Z i t1 •• Schedule maps iterations to multi-dimensional time: • A feasible schedule preserves dependences •• Placement maps iterations to multi-dimensional space: • UHPC in progress, partially done •• Layout maps data elements to multi-dimensional space: • UHPC in progress •• Hierarchical by design, tiling serves separation of concerns 6 Reservoir Labs Harvard 04/12/2011
  7. 7. Loop transformations for(i=0; i<N; i++) for(j=0; j<N; j++) S(i,j); unimodular for(j=0; j<N; j++) 0 1 ipermutation for(i=0; i<N; i++) (i, j ) S(i,j); 1 0 j for(i=N-1; i>=0; i--) 1 0 i reversal for(j=0; j<N; j++) (i, j ) S(j,i); 0 1 j for(i=0; i<N; i++) 1 0 i skewing for(j= *i; j<N+ *i; j++) (i, j ) S(i,j- *i); 1 j for(i=0; i< *N; i+= ) 0 i scaling for(j=0; j<N; j++) (i, j ) S(i/ ,j); 0 1 j 7Reservoir Labs Harvard 04/12/2011
  8. 8. Loop fusion and distribution for(i=0; i<N; i++) for(j=0; j<N; j++) fusion for(i=0; i<N; i++) for(j=0; j<N; j++) S1(i,j); S1(i,j); for(j=0; j<N; j++) S2(i,j) S2(i,j) distribution 0 0 0 0 0 0 1 0 0 i 1 0 0 i fusion (i, j ) 0 0 0 j 1 (i, j ) 0 0 0 j 1 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 i 1 0 0 i 2 (i, j ) 0 0 1 j 2 (i, j ) 0 0 0 j distribution 0 1 0 1 0 1 0 1 0 0 0 0 0 1 8Reservoir Labs Harvard 04/12/2011
  9. 9. Enabling technology is new compiler mathUniform Recurrence Equations [Karp et al. 1970] Many: Lamport, Allen/Kennedy, Banerjee, Irigoin, Wolfe/Lam, Pugh, Pingali, eLoop Transformations and Parallelization [1970-] Vectorization, SMP, locality optimizations Dependence summary: direction/distance vectors Unimodular transformations Systolic Array Mapping Mostly linear-algebraic Many: Feautrier, Darte, Vivien, Wilde, Rajopadhye, etc,.... Exact dependence analysis Polyhedral Model [1980-] General affine transformations Loop synthesis via polyhedral scanning New computational techniques based on polyhedral representations Reservoir Labs Harvard 04/12/2011 9
  10. 10. R-Stream model: polyhedran = f();for (i=5; i<= n; i+=2) { A[i][i] = A[i][i]/B[i]; for (j=0; j<=i; j++) { if (j<=10) { …… A[i+2j+n][i+3]…… }}{i, j Z2 | k Z ,5 i n;0 j i; j i ; i 2k 1} i A0 1 2 1 0 j A1 1 0 0 3 Affine and non-affine transformations n Order and place of operations and data 1 0 0 0 1 1Loop code represented (exactly or conservatively) with polyhedrons High-level, mathematical view of a mapping But targets concrete properties: parallelism, locality, memory footprintReservoir Labs Harvard 04/12/2011 10
  11. 11. Polyhedral slogans •• Parametric imperfect loop nests •• Subsumes classical transformations •• Compacts the transformation search space •• Parallelization, locality optimization (communication avoiding) •• Preserves semantics •• Analytic joint formulations of optimizations •• Not just for affine static control programsReservoir Labs Harvard 04/12/2011 11
  12. 12. Polyhedral model – challenges in building a compiler •• Killer math •• Scalability of optimizations/code generation •• Mostly confined to dependence preserving transformations •• Code can be radically transformed – outputs can look wildly different •• Modeling indirections, pointers, non-affine code. •• Many of these challenges are solvedReservoir Labs Harvard 04/12/2011 12
  13. 13. R-Stream blueprint Machine Polyhedral Mapper Model Raising Lowering EDG C Pretty Scalar Representation Front End Printer 13Reservoir Labs Harvard 04/12/2011
  14. 14. Inside the polyhedral mapper GDG representation Tactics Module Parallelization Comm. Locality Tiling Placement Generation Optimization …… Memory Sync Layout Polyhedral Promotion Generation Optimization Scanning Jolylib, …… 14Reservoir Labs Harvard 04/12/2011
  15. 15. Inside the polyhedral mapperOptimization modules engineered to expose ““knobs”” that could be used by auto-tuner GDG representation Tactics Module Parallelization Comm. Locality Tiling Placement Generation Optimization …… Memory Sync Layout Polyhedral Promotion Generation Optimization Scanning Jolylib, …… 15 Reservoir Labs Harvard 04/12/2011
  16. 16. Driving the mapping: the machine model•• Target machine characteristics that have an influence on how the mapping should be done• Local memory / cache sizes• Communication facilities: DMA, cache(s)• Synchronization capabilities• Symmetrical or not• SIMD width• Bandwidths•• Currently: two-level model (Host and Accelerators)•• XML schema and graphical rendering Reservoir Labs Harvard 04/12/2011
  17. 17. Machine model example: multi-Tesla Host1 thread per GPU OpenMP morph XML file CUDA morphReservoir Labs Harvard 04/12/2011 17
  18. 18. Mapping process Dependencies 2- Task formation: - Coarse-grain atomic tasks - Master/slave side operations 1- Scheduling: Parallelism, locality, tilability 3- Placement: Assign tasks to blocks/threads - Local / global data layout optimization - Multi-buffering (explicitly managed) - Synchronization (barriers) - Bulk communications - Thread generation -> master/slave - CUDA-specific optimizationsReservoir Labs Harvard 04/12/2011 18
  19. 19. Program Transformations Specificationiteration space of a statement S(i,j) t2 j 2 2 :Z Z i t1 •• Schedule maps iterations to multi-dimensional time: • A feasible schedule preserves dependences •• Placement maps iterations to multi-dimensional space: • UHPC in progress, partially done •• Layout maps data elements to multi-dimensional space: • UHPC in progress •• Hierarchical by design, tiling serves separation of concerns 19 Reservoir Labs Harvard 04/12/2011
  20. 20. Model for scheduling trades 3 objectives jointly Loop Fission Fewer Global More More Sufficient Memory Locality Parallelism Occupancy Accesses Loop Fusion + successive + successive thread thread contiguity contiguity Memory Coalescing Better Effective Bandwidth Patent pendingReservoir Labs Harvard 04/12/2011 20
  21. 21. Optimization with BLAS vs. global optimization Numerous cache misses /* Global Optimization*//* Optimization with BLAS */ doall loop { Can parallelizefor loop { Outer loop(s) …… outer loop(s) …… for loop { Retrieve data Z from disk …… BLAS call 1 …… Store data Z back to disk [read from Z] Retrieve data Z from disk !!! Loop fusion BLAS call 2 VS. …… …… [write to Z] can …… …… improve BLAS call n [read from Z] locality …… }} …… } Global optimization can expose better parallelism and locality Reservoir Labs Harvard 04/12/2011
  22. 22. Tradeoffs between parallelism and locality • Significant parallelism is needed to fully utilize all resources • Locality is also critical to minimize communication • Parallelism can come at the expense of locality Limited bandwidth at chip border High on-chip parallelism •• Our approach: R-Stream compiler exposes parallelism via affine scheduling that simultaneously augments locality using loop fusion Reuse data once loaded on chip = localityReservoir Labs Harvard 04/12/2011
  23. 23. Parallelism/locality tradeoff example Array z gets expanded, to Maximum distribution destroys locality introduce another level of parallelism/* doall (i=0; i<400; i++) * Original code: doall (j=0; j<3997; j++) * Simplified CSLC LMS z_e[j][i]=0 */ doall (i=0; i<400; i++)for (k=0; k<400; k++) { doall (j=0; j<3997; j++) Max. parallelism for (i=0; i<3997; i++) { for (k=0; k<4000; k++) (no fusion) z[i]=0; z_e[j][i]=z_e[j][i]+B[j][k]*x[i][k]; for (j=0; j<4000; j++) doall (i=0; i<3997; i++) z[i]= z[i]+B[i][j]*x[k][j]; for (j=0; j<400; j++) } w[i]=w[i]+z_e[i][j]; for (i=0; i<3997; i++) doall (i=0; i<3997; i++) w[i]=w[i]+z[i]; Data z[i] = z_e[i][399];} accumulation 2 levels of parallelism, but poor data reuse (on array z_e) Reservoir Labs Harvard 04/12/2011
  24. 24. Parallelism/locality tradeoff example (cont.) Aggressive loop fusion destroys parallelism (i.e., only 1 degree/* of parallelism) * Original code: * Simplified CSLC LMS */ doall (i=0; i<3997; i++)for (k=0; k<400; k++) { for (j=0; j<400; j++) { for (i=0; i<3997; i++) { Max. fusion z[i]=0; z[i]=0; for (k=0; k<4000; k++) for (j=0; j<4000; j++) z[i]=z[i]+B[i][k]*x[j][k]; z[i]= z[i]+B[i][j]*x[k][j]; w[i]=w[i]+z[i]; } } for (i=0; i<3997; i++) w[i]=w[i]+z[i];} Very good data reuse (on array z), but only 1 level of parallelism Reservoir Labs Harvard 04/12/2011
  25. 25. Parallelism/locality tradeoff example (cont.) Partial fusion doesn’t Expansion of array z decrease parallelism/* * Original code: doall (i=0; i<3997; i++) { * Simplified CSLC LMS doall (j=0; j<400; j++) { */ z_e[i][j]=0;for (k=0; k<400; k++) { for (k=0; k<4000; k++) z_e[i][j]=z_e[i][j]+B[i][k]*x[j][k]; for (i=0; i<3997; i++) { Parallelism with } z[i]=0; partial fusion for (j=0; j<4000; j++) for (j=0; j<400; j++) z[i]= z[i]+B[i][j]*x[k][j]; w[i]=w[i]+z_e[i][j]; } } doall (i=0; i<3997; i++) for (i=0; i<3997; i++) Data z[i]=z_e[i][399]; w[i]=w[i]+z[i]; accumulation} 2 levels of parallelism with good data reuse (on array z_e) Reservoir Labs Harvard 04/12/2011
  26. 26. Parallelism/locality tradeoffs: performance numbers Code with a good balance between parallelism and fusion performs best In explicitly managed memory/scratchpad architectures this is even more trueReservoir Labs Harvard 04/12/2011
  27. 27. R-Stream: affine scheduling and fusion •• R-Stream uses a heuristic based on an objective function with several cost coefficients: • slowdown in execution if a loop p is executed sequentially rather than in parallel • cost in performance if two loops p and q remain unfused rather than fused minimize wl pl ue f e l loops e loop edges slowdown in sequential cost of unfusing execution two loops •• These two cost coefficients address parallelism and locality in a unified and unbiased manner (as opposed to traditional compilers) •• Fine-grained parallelism, such as SIMD, can also be modeled using similar formulation Patent PendingReservoir Labs Harvard 04/12/2011
  28. 28. Parallelism + locality + spatial locality Hypothesis that auto-tuning should adjust these parameters wl pl ue f e l loops e loop edges benefits of improved locality benefits of parallel execution New algorithm (unpublished) balances contiguity to enhance coalescing for GPU and SIMDization modulo data-layout transformations 28Reservoir Labs Harvard 04/12/2011
  29. 29. Outline •• R-Stream Overview •• Compilation Walk-through •• Performance Results •• Getting R-StreamReservoir Labs Harvard 04/12/2011 29
  30. 30. What R-Stream does for you – in a nutshell •• Input • Sequential – Short and simple textbook C code – Just add a “#pragma map” and R-Stream figures out the restReservoir Labs Harvard 04/12/2011 30
  31. 31. What R-Stream does for you – in a nutshell •• Input •• Output • Sequential • OpenMP + CUDA code – Short and simple R-Stream – Hundreds of lines of tightly textbook C code optimized GPU-side CUDA – Just add a “#pragma code map” and R-Stream – Few lines of host-side figures out the rest OpenMP C code • Gauss-Seidel 9 points stencil • Very difficult to – Used in iterative PDE solvers hand-optimize – scientific modeling (heat, fluid flow, waves, • Not available in etc.) any standard – Building block for faster iterative solvers like library Multigrid or AMRReservoir Labs Harvard 04/12/2011 31
  32. 32. What R-Stream does for you – in a nutshell •• Input •• Output • Sequential • OpenMP + CUDA code – Short and simple R-Stream – Hundreds of lines of tightly textbook C code optimized GPU-side CUDA – Just add a “#pragma code map” and R-Stream – Few lines of host-side figures out the rest OpenMP C code • Achieving up to Will be – 20 GFLOPS in illustrated in GTX 285 the next few – 25 GFLOPS in slides GTX 480Reservoir Labs Harvard 04/12/2011 32
  33. 33. Finding and utilizing available parallelism Excerpt of automatically generated code GPU SM N SM 2 SM 1 Shared Memory Registers Registers Registers InstructionR-Stream AUTOMATICALLY finds SP 1 SP 2 … SP M Unitand forms parallelism Constant Cache Texture Cache Off-chip Device memory (Global, constant, texture) Extracting and mapping parallel loops Reservoir Labs Harvard 04/12/2011 33
  34. 34. Memory compaction on GPU scratchpad Excerpt of automatically generated code GPU SM N SM 2 SM 1 Shared Memory Registers Registers Registers Instruction SP 1 SP 2 … SP M UnitR-Stream AUTOMATICALLYmanages local scratchpad Constant Cache Texture Cache Off-chip Device memory (Global, constant, texture) Reservoir Labs Harvard 04/12/2011 34
  35. 35. GPU DRAM to scratchpad coalesced communication Excerpt of automatically generated code GPU SM N SM 2 SM 1 Shared Memory Registers Registers Registers Instruction SP 1 SP 2 … SP M UnitR-Stream AUTOMATICALLYchooses parallelism to favor Constant Cachecoalescing Texture Cache Off-chip Device memory (Global, constant, texture) Coalesced GPU DRAM accesses Reservoir Labs Harvard 04/12/2011 35
  36. 36. Host-to-GPU communication Excerpt of automatically generated code R-Stream AUTOMATICALLY GPU chooses partition and sets up host SM N to GPU communication SM 2 SM 1 Shared Memory Registers Registers Registers Instruction SP 1 SP 2 … SP M Unit CPU Constant Cache Texture PCI Cache Express Host memory Off-chip Device memory (Global, constant, texture)Reservoir Labs Harvard 04/12/2011 36
  37. 37. Multi-GPU mapping Excerpt of automatically generated code Mapping Host Host across all GPUs memory memory CPU CPU R-Stream AUTOMATICALLY finds GPU GPU GPU another level of parallelism, across GPUs GPU GPU GPU memory memory memoryReservoir Labs Harvard 04/12/2011 37
  38. 38. Multi-GPU mapping Excerpt of automatically generated code Host Host Multi-streaming memory memory of host-GPU communication CPU CPU R-Stream AUTOMATICALLY GPU GPU GPU creates n-way software pipelines for communications GPU GPU GPU memory memory memoryReservoir Labs Harvard 04/12/2011 38
  39. 39. Future capabilities – mapping to CPU-GPU clusters High Speed Interconnect (e.g. InfiniBand)Program CPU + GPU CPU + GPU MPI Process MPI Process OpenMP OpenMP process process DRAM launching launching DRAM CUDA CUDA CPU CPU GPU GPU GPU DRAM DRAM DRAMReservoir Labs Harvard 04/12/2011 39
  40. 40. Outline •• R-Stream Overview •• Compilation Walk-through •• Performance Results •• Getting R-StreamReservoir Labs Harvard 04/12/2011 40
  41. 41. Experimental evaluation Configuration 1: MKL Configuration 2: Low-level compilers Radar Radar GCC MKL calls code code ICC Configuration 3: R-Stream Radar Optimized GCC R-Stream code radar code ICC •• Main comparisons: • R-Stream High-Level C Transformation Tool 3.1.2 • Intel MKL 10.2.1Reservoir Labs Harvard 04/12/2011
  42. 42. Experimental evaluation (cont.) •• Intel Xeon workstation: • Dual quad-core E5405 Xeon processors (8 cores total) • 9GB memory •• 8 OpenMP threads •• Single precision floating point data •• Low-level compilers and the used flags: • GCC: -O6 -fno-trapping-math -ftree-vectorize -msse3 -fopenmp • ICC: -fast -openmpReservoir Labs Harvard 04/12/2011
  43. 43. Radar benchmarks •• Beamforming algorithms: • MVDR-SER: Minimum Variance Distortionless Response using Sequential Regression • CSLC-LMS: Coherent Sidelobe Cancellation using Least Mean Square • CSLC-RLS: Coherent Sidelobe Cancellation using Robust Least Square •• Expressed in sequential ANSI C •• 400 radar iterations •• Compute 3 radar sidelobes (for CSLC-LMS and CSLC-RLS)Reservoir Labs Harvard 04/12/2011
  44. 44. MVDR-SERReservoir Labs Harvard 04/12/2011
  45. 45. CSLC-LMSReservoir Labs Harvard 04/12/2011
  46. 46. CSLC-RLSReservoir Labs Harvard 04/12/2011
  47. 47. 3D Discretized wave equation input code (RTM) •• #pragma rstream map •• void RTM_3D(double (*U1)[Y][X], double (*U2)[Y][X], double (*V)[Y][X], •• int pX, int pY, int pZ) { •• double temp; •• int i, j, k; •• for (k=4; k<pZ-4; k++) { •• for (j=4; j<pY-4; j++) { •• for (i=4; i<pX-4; i++) { •• temp = C0 * U2[k][j][i] + •• C1 * (U2[k-1][j][i] + U2[k+1][j][i] + •• U2[k][j-1][i] + U2[k][j+1][i] + •• U2[k][j][i-1] + U2[k][j][i+1]) + •• C2 * (U2[k-2][j][i] + U2[k+2][j][i] + •• U2[k][j-2][i] + U2[k][j+2][i] + 25-point 8th •• U2[k][j][i-2] + U2[k][j][i+2]) + order (in space) •• C3 * (U2[k-3][j][i] + U2[k+3][j][i] + •• U2[k][j-3][i] + U2[k][j+3][i] + stencil •• U2[k][j][i-3] + U2[k][j][i+3]) + •• C4 * (U2[k-4][j][i] + U2[k+4][j][i] + •• U2[k][j-4][i] + U2[k][j+4][i] + •• U2[k][j][i-4] + U2[k][j][i+4]); •• U1[k][j][i] = •• 2.0f * U2[k][j][i] - U1[k][j][i] + •• V[k][j][i] * temp; •• } } } }Reservoir Labs Harvard 04/12/2011 47
  48. 48. 3D Discretized wave equation input code (RTM)Not so naïve …Communicationautotuningknobs ThreadIdx.x divergence is expensive Reservoir Labs Harvard 04/12/2011 48
  49. 49. Outline •• R-Stream Overview •• Compilation Walk-through •• Performance Results •• Getting R-StreamReservoir Labs Harvard 04/12/2011 49
  50. 50. Current status •• Ongoing development also supported by DOE, Reservoir • Improvements in scope, stability, performance •• Installations/evaluations at US government laboratories •• Forward collaboration with Georgia Tech on Keeneland • HP SL390 - 3 FERMI GPU, 2 Westmere/node •• Basis of compiler for DARPA UHPC Intel Corporation TeamReservoir Labs Harvard 04/12/2011 50
  51. 51. Availability •• Per developer seat licensing model •• Support (releases, bug fixes, services) •• Available with commercial grade external solvers •• Government has limited rights / SBIR data rights •• Academic source licenses with collaborators •• Professional team, continuity, software engineeringReservoir Labs Harvard 04/12/2011 51
  52. 52. R-Stream Gives More Insight About Programs •• Teaches you about parallelism and the polyhedral model: • Generates correct code for any transformation – The transformation may be incorrect if specified by the user • Imperative code generation has been the bottleneck till 2005 – 3 thesis written on the topic and it’s still not completely covered – It is good to have an intuition of how these things work •• R-Stream has meaningful metrics to represent your program • Maximal amount of parallelism given minimal expansion • Tradeoffs between coarse-grained and fine-grained parallelism • Loop types (doall, red, perm, seq) •• Help with algorithm selection •• Tiling of imperfectly nested loop nests •• Generates code for explicitly managed memoryReservoir Labs Harvard 04/12/2011 52
  53. 53. How to Use R-Stream Successfully •• R-Stream is a great transformation tool, it is also a great learning tool: • Takes “simple C” input code • Applies multiple transformations and lists them explicitly • Can dump code at any step in the process • It’s the tool I wish I had during my PhD: – To be fair, I already had a great set of tools •• R-Stream can be used in multiple modes: • Fully automatic + compile flag options • Autotuning mode (more than just tile size and unrolling … ) • Scripting / Programmatic mode (Beanshell + interfaces) • Mix of these modes + manual post-processingReservoir Labs Harvard 04/12/2011 53
  54. 54. Use Case: Fully Automatic Mode + Compile Flag Options •• Akin to traditional gcc / icc compilation with flags: • Predefined transformations can be parameterized – Except less/no phase ordering issues • Except you can see what each transformation does at each step • Except you can generate compilable and executable code at (almost) any step in the process • Except you can control code generation for compactness or performanceReservoir Labs Harvard 04/12/2011 54
  55. 55. Use Case: Autotuning Mode •• More advanced than traditional approaches: • Knobs go far beyond loop unrolling + unroll-and-jam + tiling • Knobs are based on well-understood models • Knobs target high-level properties of the program – Amount of parallelism, amount of memory expansion, depth of pipelining of communication and computations … • Knobs are dependent on target machine, program and state of the mapping process: – Our tool has introspection •• We are really building a hierarchical autotuning transformation toolReservoir Labs Harvard 04/12/2011 55
  56. 56. Use Case: “Power user” interactive interface •• Beanshell access to optimizations •• Can direct and review the process of compilation • automatic tactics (affine scheduling) •• Can direct code generation •• Access to “tuning parameters” •• All options / commands available on command line interfaceReservoir Labs Harvard 04/12/2011 56
  57. 57. Conclusion •• R-Stream simplifies software development and maintenance •• Porting: reduces expense and delivery delays •• Does this by automatically parallelizing loop code • While optimizing for data locality, coalescing, etc. •• Addresses • Dense loop-intensive computations •• Extensions • Data-parallel programming idioms • Sparse representations • Dynamic runtime executionReservoir Labs Harvard 04/12/2011 57
  58. 58. Contact us •• Per-developer seat, floating, cloud-based licensing •• Discounts for academic users •• Research collaborations with academic partners •• For more information: • Call us at 212-780-0527, or • See Rich, Ann • E-mail us at – {sales,lethin,johnson}@reservoir.comReservoir Labs Harvard 04/12/2011 58
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×