• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)
 

[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

on

  • 1,604 views

http://cs264.org

http://cs264.org

http://j.mp/fmzhYl

Statistics

Views

Total Views
1,604
Views on SlideShare
1,604
Embed Views
0

Actions

Likes
0
Downloads
17
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs) [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs) Presentation Transcript

    • The R-Stream High-Level Program Transformation Tool N. Vasilache, B. Meister, M. Baskaran, A.Hartono, R. LethinReservoir Labs Harvard 04/12/2011 1
    • Outline •• R-Stream Overview •• Compilation Walk-through •• Performance Results •• Getting R-StreamReservoir Labs Harvard 04/12/2011 2
    • Power efficiency driving architectures Heterogeneous SIMD SIMD SIMD SIMD NUMA FPGA FPGA Processing DMA Memory DMA Memory Distributed GPP GPP Local Memories SIMD SIMD SIMD SIMD Hierarchical (including board, chassis, cabinet) SIMD SIMD SIMD SIMD Explicitly FPGA FPGA Managed Architecture DMA Memory DMA Multiple Memory Execution Models Bandwidth GPP GPP Starved SIMD SIMD SIMD SIMD Multiple Mixed Spatial Parallelism Dimensions Types 3Reservoir Labs Harvard 04/12/2011
    • Computation choreography •• Expressing it • Annotations and pragma dialects for C • Explicitly (e.g., new languages like CUDA and OpenCL) •• But before expressing it, how can programmers find it? • Manual constructive procedures, art, sweat, time – Artisans get complete control over every detail • Automatically – Operations research problem – Like scheduling trucks to save fuel Our focus – Model, solve , implement – Faster, sometimes better, than a humanReservoir Labs Harvard 04/12/2011 4
    • How to do automatic scheduling? •• Naïve approach • Model – Tasks, dependences – Resource use, latencies – Machine model with connectivity, resource capacities • Solve with ILP – Minimize overall task length – Subject to dependences, resource use • Problems – Complexity: task graph is huge! – Dynamics: loop lengths unknown. •• So we do something much more cool.Reservoir Labs Harvard 04/12/2011 5
    • Program Transformations Specificationiteration space of a statement S(i,j) t2 j 2 2 :Z Z i t1 •• Schedule maps iterations to multi-dimensional time: • A feasible schedule preserves dependences •• Placement maps iterations to multi-dimensional space: • UHPC in progress, partially done •• Layout maps data elements to multi-dimensional space: • UHPC in progress •• Hierarchical by design, tiling serves separation of concerns 6 Reservoir Labs Harvard 04/12/2011
    • Loop transformations for(i=0; i<N; i++) for(j=0; j<N; j++) S(i,j); unimodular for(j=0; j<N; j++) 0 1 ipermutation for(i=0; i<N; i++) (i, j ) S(i,j); 1 0 j for(i=N-1; i>=0; i--) 1 0 i reversal for(j=0; j<N; j++) (i, j ) S(j,i); 0 1 j for(i=0; i<N; i++) 1 0 i skewing for(j= *i; j<N+ *i; j++) (i, j ) S(i,j- *i); 1 j for(i=0; i< *N; i+= ) 0 i scaling for(j=0; j<N; j++) (i, j ) S(i/ ,j); 0 1 j 7Reservoir Labs Harvard 04/12/2011
    • Loop fusion and distribution for(i=0; i<N; i++) for(j=0; j<N; j++) fusion for(i=0; i<N; i++) for(j=0; j<N; j++) S1(i,j); S1(i,j); for(j=0; j<N; j++) S2(i,j) S2(i,j) distribution 0 0 0 0 0 0 1 0 0 i 1 0 0 i fusion (i, j ) 0 0 0 j 1 (i, j ) 0 0 0 j 1 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 i 1 0 0 i 2 (i, j ) 0 0 1 j 2 (i, j ) 0 0 0 j distribution 0 1 0 1 0 1 0 1 0 0 0 0 0 1 8Reservoir Labs Harvard 04/12/2011
    • Enabling technology is new compiler mathUniform Recurrence Equations [Karp et al. 1970] Many: Lamport, Allen/Kennedy, Banerjee, Irigoin, Wolfe/Lam, Pugh, Pingali, eLoop Transformations and Parallelization [1970-] Vectorization, SMP, locality optimizations Dependence summary: direction/distance vectors Unimodular transformations Systolic Array Mapping Mostly linear-algebraic Many: Feautrier, Darte, Vivien, Wilde, Rajopadhye, etc,.... Exact dependence analysis Polyhedral Model [1980-] General affine transformations Loop synthesis via polyhedral scanning New computational techniques based on polyhedral representations Reservoir Labs Harvard 04/12/2011 9
    • R-Stream model: polyhedran = f();for (i=5; i<= n; i+=2) { A[i][i] = A[i][i]/B[i]; for (j=0; j<=i; j++) { if (j<=10) { …… A[i+2j+n][i+3]…… }}{i, j Z2 | k Z ,5 i n;0 j i; j i ; i 2k 1} i A0 1 2 1 0 j A1 1 0 0 3 Affine and non-affine transformations n Order and place of operations and data 1 0 0 0 1 1Loop code represented (exactly or conservatively) with polyhedrons High-level, mathematical view of a mapping But targets concrete properties: parallelism, locality, memory footprintReservoir Labs Harvard 04/12/2011 10
    • Polyhedral slogans •• Parametric imperfect loop nests •• Subsumes classical transformations •• Compacts the transformation search space •• Parallelization, locality optimization (communication avoiding) •• Preserves semantics •• Analytic joint formulations of optimizations •• Not just for affine static control programsReservoir Labs Harvard 04/12/2011 11
    • Polyhedral model – challenges in building a compiler •• Killer math •• Scalability of optimizations/code generation •• Mostly confined to dependence preserving transformations •• Code can be radically transformed – outputs can look wildly different •• Modeling indirections, pointers, non-affine code. •• Many of these challenges are solvedReservoir Labs Harvard 04/12/2011 12
    • R-Stream blueprint Machine Polyhedral Mapper Model Raising Lowering EDG C Pretty Scalar Representation Front End Printer 13Reservoir Labs Harvard 04/12/2011
    • Inside the polyhedral mapper GDG representation Tactics Module Parallelization Comm. Locality Tiling Placement Generation Optimization …… Memory Sync Layout Polyhedral Promotion Generation Optimization Scanning Jolylib, …… 14Reservoir Labs Harvard 04/12/2011
    • Inside the polyhedral mapperOptimization modules engineered to expose ““knobs”” that could be used by auto-tuner GDG representation Tactics Module Parallelization Comm. Locality Tiling Placement Generation Optimization …… Memory Sync Layout Polyhedral Promotion Generation Optimization Scanning Jolylib, …… 15 Reservoir Labs Harvard 04/12/2011
    • Driving the mapping: the machine model•• Target machine characteristics that have an influence on how the mapping should be done• Local memory / cache sizes• Communication facilities: DMA, cache(s)• Synchronization capabilities• Symmetrical or not• SIMD width• Bandwidths•• Currently: two-level model (Host and Accelerators)•• XML schema and graphical rendering Reservoir Labs Harvard 04/12/2011
    • Machine model example: multi-Tesla Host1 thread per GPU OpenMP morph XML file CUDA morphReservoir Labs Harvard 04/12/2011 17
    • Mapping process Dependencies 2- Task formation: - Coarse-grain atomic tasks - Master/slave side operations 1- Scheduling: Parallelism, locality, tilability 3- Placement: Assign tasks to blocks/threads - Local / global data layout optimization - Multi-buffering (explicitly managed) - Synchronization (barriers) - Bulk communications - Thread generation -> master/slave - CUDA-specific optimizationsReservoir Labs Harvard 04/12/2011 18
    • Program Transformations Specificationiteration space of a statement S(i,j) t2 j 2 2 :Z Z i t1 •• Schedule maps iterations to multi-dimensional time: • A feasible schedule preserves dependences •• Placement maps iterations to multi-dimensional space: • UHPC in progress, partially done •• Layout maps data elements to multi-dimensional space: • UHPC in progress •• Hierarchical by design, tiling serves separation of concerns 19 Reservoir Labs Harvard 04/12/2011
    • Model for scheduling trades 3 objectives jointly Loop Fission Fewer Global More More Sufficient Memory Locality Parallelism Occupancy Accesses Loop Fusion + successive + successive thread thread contiguity contiguity Memory Coalescing Better Effective Bandwidth Patent pendingReservoir Labs Harvard 04/12/2011 20
    • Optimization with BLAS vs. global optimization Numerous cache misses /* Global Optimization*//* Optimization with BLAS */ doall loop { Can parallelizefor loop { Outer loop(s) …… outer loop(s) …… for loop { Retrieve data Z from disk …… BLAS call 1 …… Store data Z back to disk [read from Z] Retrieve data Z from disk !!! Loop fusion BLAS call 2 VS. …… …… [write to Z] can …… …… improve BLAS call n [read from Z] locality …… }} …… } Global optimization can expose better parallelism and locality Reservoir Labs Harvard 04/12/2011
    • Tradeoffs between parallelism and locality • Significant parallelism is needed to fully utilize all resources • Locality is also critical to minimize communication • Parallelism can come at the expense of locality Limited bandwidth at chip border High on-chip parallelism •• Our approach: R-Stream compiler exposes parallelism via affine scheduling that simultaneously augments locality using loop fusion Reuse data once loaded on chip = localityReservoir Labs Harvard 04/12/2011
    • Parallelism/locality tradeoff example Array z gets expanded, to Maximum distribution destroys locality introduce another level of parallelism/* doall (i=0; i<400; i++) * Original code: doall (j=0; j<3997; j++) * Simplified CSLC LMS z_e[j][i]=0 */ doall (i=0; i<400; i++)for (k=0; k<400; k++) { doall (j=0; j<3997; j++) Max. parallelism for (i=0; i<3997; i++) { for (k=0; k<4000; k++) (no fusion) z[i]=0; z_e[j][i]=z_e[j][i]+B[j][k]*x[i][k]; for (j=0; j<4000; j++) doall (i=0; i<3997; i++) z[i]= z[i]+B[i][j]*x[k][j]; for (j=0; j<400; j++) } w[i]=w[i]+z_e[i][j]; for (i=0; i<3997; i++) doall (i=0; i<3997; i++) w[i]=w[i]+z[i]; Data z[i] = z_e[i][399];} accumulation 2 levels of parallelism, but poor data reuse (on array z_e) Reservoir Labs Harvard 04/12/2011
    • Parallelism/locality tradeoff example (cont.) Aggressive loop fusion destroys parallelism (i.e., only 1 degree/* of parallelism) * Original code: * Simplified CSLC LMS */ doall (i=0; i<3997; i++)for (k=0; k<400; k++) { for (j=0; j<400; j++) { for (i=0; i<3997; i++) { Max. fusion z[i]=0; z[i]=0; for (k=0; k<4000; k++) for (j=0; j<4000; j++) z[i]=z[i]+B[i][k]*x[j][k]; z[i]= z[i]+B[i][j]*x[k][j]; w[i]=w[i]+z[i]; } } for (i=0; i<3997; i++) w[i]=w[i]+z[i];} Very good data reuse (on array z), but only 1 level of parallelism Reservoir Labs Harvard 04/12/2011
    • Parallelism/locality tradeoff example (cont.) Partial fusion doesn’t Expansion of array z decrease parallelism/* * Original code: doall (i=0; i<3997; i++) { * Simplified CSLC LMS doall (j=0; j<400; j++) { */ z_e[i][j]=0;for (k=0; k<400; k++) { for (k=0; k<4000; k++) z_e[i][j]=z_e[i][j]+B[i][k]*x[j][k]; for (i=0; i<3997; i++) { Parallelism with } z[i]=0; partial fusion for (j=0; j<4000; j++) for (j=0; j<400; j++) z[i]= z[i]+B[i][j]*x[k][j]; w[i]=w[i]+z_e[i][j]; } } doall (i=0; i<3997; i++) for (i=0; i<3997; i++) Data z[i]=z_e[i][399]; w[i]=w[i]+z[i]; accumulation} 2 levels of parallelism with good data reuse (on array z_e) Reservoir Labs Harvard 04/12/2011
    • Parallelism/locality tradeoffs: performance numbers Code with a good balance between parallelism and fusion performs best In explicitly managed memory/scratchpad architectures this is even more trueReservoir Labs Harvard 04/12/2011
    • R-Stream: affine scheduling and fusion •• R-Stream uses a heuristic based on an objective function with several cost coefficients: • slowdown in execution if a loop p is executed sequentially rather than in parallel • cost in performance if two loops p and q remain unfused rather than fused minimize wl pl ue f e l loops e loop edges slowdown in sequential cost of unfusing execution two loops •• These two cost coefficients address parallelism and locality in a unified and unbiased manner (as opposed to traditional compilers) •• Fine-grained parallelism, such as SIMD, can also be modeled using similar formulation Patent PendingReservoir Labs Harvard 04/12/2011
    • Parallelism + locality + spatial locality Hypothesis that auto-tuning should adjust these parameters wl pl ue f e l loops e loop edges benefits of improved locality benefits of parallel execution New algorithm (unpublished) balances contiguity to enhance coalescing for GPU and SIMDization modulo data-layout transformations 28Reservoir Labs Harvard 04/12/2011
    • Outline •• R-Stream Overview •• Compilation Walk-through •• Performance Results •• Getting R-StreamReservoir Labs Harvard 04/12/2011 29
    • What R-Stream does for you – in a nutshell •• Input • Sequential – Short and simple textbook C code – Just add a “#pragma map” and R-Stream figures out the restReservoir Labs Harvard 04/12/2011 30
    • What R-Stream does for you – in a nutshell •• Input •• Output • Sequential • OpenMP + CUDA code – Short and simple R-Stream – Hundreds of lines of tightly textbook C code optimized GPU-side CUDA – Just add a “#pragma code map” and R-Stream – Few lines of host-side figures out the rest OpenMP C code • Gauss-Seidel 9 points stencil • Very difficult to – Used in iterative PDE solvers hand-optimize – scientific modeling (heat, fluid flow, waves, • Not available in etc.) any standard – Building block for faster iterative solvers like library Multigrid or AMRReservoir Labs Harvard 04/12/2011 31
    • What R-Stream does for you – in a nutshell •• Input •• Output • Sequential • OpenMP + CUDA code – Short and simple R-Stream – Hundreds of lines of tightly textbook C code optimized GPU-side CUDA – Just add a “#pragma code map” and R-Stream – Few lines of host-side figures out the rest OpenMP C code • Achieving up to Will be – 20 GFLOPS in illustrated in GTX 285 the next few – 25 GFLOPS in slides GTX 480Reservoir Labs Harvard 04/12/2011 32
    • Finding and utilizing available parallelism Excerpt of automatically generated code GPU SM N SM 2 SM 1 Shared Memory Registers Registers Registers InstructionR-Stream AUTOMATICALLY finds SP 1 SP 2 … SP M Unitand forms parallelism Constant Cache Texture Cache Off-chip Device memory (Global, constant, texture) Extracting and mapping parallel loops Reservoir Labs Harvard 04/12/2011 33
    • Memory compaction on GPU scratchpad Excerpt of automatically generated code GPU SM N SM 2 SM 1 Shared Memory Registers Registers Registers Instruction SP 1 SP 2 … SP M UnitR-Stream AUTOMATICALLYmanages local scratchpad Constant Cache Texture Cache Off-chip Device memory (Global, constant, texture) Reservoir Labs Harvard 04/12/2011 34
    • GPU DRAM to scratchpad coalesced communication Excerpt of automatically generated code GPU SM N SM 2 SM 1 Shared Memory Registers Registers Registers Instruction SP 1 SP 2 … SP M UnitR-Stream AUTOMATICALLYchooses parallelism to favor Constant Cachecoalescing Texture Cache Off-chip Device memory (Global, constant, texture) Coalesced GPU DRAM accesses Reservoir Labs Harvard 04/12/2011 35
    • Host-to-GPU communication Excerpt of automatically generated code R-Stream AUTOMATICALLY GPU chooses partition and sets up host SM N to GPU communication SM 2 SM 1 Shared Memory Registers Registers Registers Instruction SP 1 SP 2 … SP M Unit CPU Constant Cache Texture PCI Cache Express Host memory Off-chip Device memory (Global, constant, texture)Reservoir Labs Harvard 04/12/2011 36
    • Multi-GPU mapping Excerpt of automatically generated code Mapping Host Host across all GPUs memory memory CPU CPU R-Stream AUTOMATICALLY finds GPU GPU GPU another level of parallelism, across GPUs GPU GPU GPU memory memory memoryReservoir Labs Harvard 04/12/2011 37
    • Multi-GPU mapping Excerpt of automatically generated code Host Host Multi-streaming memory memory of host-GPU communication CPU CPU R-Stream AUTOMATICALLY GPU GPU GPU creates n-way software pipelines for communications GPU GPU GPU memory memory memoryReservoir Labs Harvard 04/12/2011 38
    • Future capabilities – mapping to CPU-GPU clusters High Speed Interconnect (e.g. InfiniBand)Program CPU + GPU CPU + GPU MPI Process MPI Process OpenMP OpenMP process process DRAM launching launching DRAM CUDA CUDA CPU CPU GPU GPU GPU DRAM DRAM DRAMReservoir Labs Harvard 04/12/2011 39
    • Outline •• R-Stream Overview •• Compilation Walk-through •• Performance Results •• Getting R-StreamReservoir Labs Harvard 04/12/2011 40
    • Experimental evaluation Configuration 1: MKL Configuration 2: Low-level compilers Radar Radar GCC MKL calls code code ICC Configuration 3: R-Stream Radar Optimized GCC R-Stream code radar code ICC •• Main comparisons: • R-Stream High-Level C Transformation Tool 3.1.2 • Intel MKL 10.2.1Reservoir Labs Harvard 04/12/2011
    • Experimental evaluation (cont.) •• Intel Xeon workstation: • Dual quad-core E5405 Xeon processors (8 cores total) • 9GB memory •• 8 OpenMP threads •• Single precision floating point data •• Low-level compilers and the used flags: • GCC: -O6 -fno-trapping-math -ftree-vectorize -msse3 -fopenmp • ICC: -fast -openmpReservoir Labs Harvard 04/12/2011
    • Radar benchmarks •• Beamforming algorithms: • MVDR-SER: Minimum Variance Distortionless Response using Sequential Regression • CSLC-LMS: Coherent Sidelobe Cancellation using Least Mean Square • CSLC-RLS: Coherent Sidelobe Cancellation using Robust Least Square •• Expressed in sequential ANSI C •• 400 radar iterations •• Compute 3 radar sidelobes (for CSLC-LMS and CSLC-RLS)Reservoir Labs Harvard 04/12/2011
    • MVDR-SERReservoir Labs Harvard 04/12/2011
    • CSLC-LMSReservoir Labs Harvard 04/12/2011
    • CSLC-RLSReservoir Labs Harvard 04/12/2011
    • 3D Discretized wave equation input code (RTM) •• #pragma rstream map •• void RTM_3D(double (*U1)[Y][X], double (*U2)[Y][X], double (*V)[Y][X], •• int pX, int pY, int pZ) { •• double temp; •• int i, j, k; •• for (k=4; k<pZ-4; k++) { •• for (j=4; j<pY-4; j++) { •• for (i=4; i<pX-4; i++) { •• temp = C0 * U2[k][j][i] + •• C1 * (U2[k-1][j][i] + U2[k+1][j][i] + •• U2[k][j-1][i] + U2[k][j+1][i] + •• U2[k][j][i-1] + U2[k][j][i+1]) + •• C2 * (U2[k-2][j][i] + U2[k+2][j][i] + •• U2[k][j-2][i] + U2[k][j+2][i] + 25-point 8th •• U2[k][j][i-2] + U2[k][j][i+2]) + order (in space) •• C3 * (U2[k-3][j][i] + U2[k+3][j][i] + •• U2[k][j-3][i] + U2[k][j+3][i] + stencil •• U2[k][j][i-3] + U2[k][j][i+3]) + •• C4 * (U2[k-4][j][i] + U2[k+4][j][i] + •• U2[k][j-4][i] + U2[k][j+4][i] + •• U2[k][j][i-4] + U2[k][j][i+4]); •• U1[k][j][i] = •• 2.0f * U2[k][j][i] - U1[k][j][i] + •• V[k][j][i] * temp; •• } } } }Reservoir Labs Harvard 04/12/2011 47
    • 3D Discretized wave equation input code (RTM)Not so naïve …Communicationautotuningknobs ThreadIdx.x divergence is expensive Reservoir Labs Harvard 04/12/2011 48
    • Outline •• R-Stream Overview •• Compilation Walk-through •• Performance Results •• Getting R-StreamReservoir Labs Harvard 04/12/2011 49
    • Current status •• Ongoing development also supported by DOE, Reservoir • Improvements in scope, stability, performance •• Installations/evaluations at US government laboratories •• Forward collaboration with Georgia Tech on Keeneland • HP SL390 - 3 FERMI GPU, 2 Westmere/node •• Basis of compiler for DARPA UHPC Intel Corporation TeamReservoir Labs Harvard 04/12/2011 50
    • Availability •• Per developer seat licensing model •• Support (releases, bug fixes, services) •• Available with commercial grade external solvers •• Government has limited rights / SBIR data rights •• Academic source licenses with collaborators •• Professional team, continuity, software engineeringReservoir Labs Harvard 04/12/2011 51
    • R-Stream Gives More Insight About Programs •• Teaches you about parallelism and the polyhedral model: • Generates correct code for any transformation – The transformation may be incorrect if specified by the user • Imperative code generation has been the bottleneck till 2005 – 3 thesis written on the topic and it’s still not completely covered – It is good to have an intuition of how these things work •• R-Stream has meaningful metrics to represent your program • Maximal amount of parallelism given minimal expansion • Tradeoffs between coarse-grained and fine-grained parallelism • Loop types (doall, red, perm, seq) •• Help with algorithm selection •• Tiling of imperfectly nested loop nests •• Generates code for explicitly managed memoryReservoir Labs Harvard 04/12/2011 52
    • How to Use R-Stream Successfully •• R-Stream is a great transformation tool, it is also a great learning tool: • Takes “simple C” input code • Applies multiple transformations and lists them explicitly • Can dump code at any step in the process • It’s the tool I wish I had during my PhD: – To be fair, I already had a great set of tools •• R-Stream can be used in multiple modes: • Fully automatic + compile flag options • Autotuning mode (more than just tile size and unrolling … ) • Scripting / Programmatic mode (Beanshell + interfaces) • Mix of these modes + manual post-processingReservoir Labs Harvard 04/12/2011 53
    • Use Case: Fully Automatic Mode + Compile Flag Options •• Akin to traditional gcc / icc compilation with flags: • Predefined transformations can be parameterized – Except less/no phase ordering issues • Except you can see what each transformation does at each step • Except you can generate compilable and executable code at (almost) any step in the process • Except you can control code generation for compactness or performanceReservoir Labs Harvard 04/12/2011 54
    • Use Case: Autotuning Mode •• More advanced than traditional approaches: • Knobs go far beyond loop unrolling + unroll-and-jam + tiling • Knobs are based on well-understood models • Knobs target high-level properties of the program – Amount of parallelism, amount of memory expansion, depth of pipelining of communication and computations … • Knobs are dependent on target machine, program and state of the mapping process: – Our tool has introspection •• We are really building a hierarchical autotuning transformation toolReservoir Labs Harvard 04/12/2011 55
    • Use Case: “Power user” interactive interface •• Beanshell access to optimizations •• Can direct and review the process of compilation • automatic tactics (affine scheduling) •• Can direct code generation •• Access to “tuning parameters” •• All options / commands available on command line interfaceReservoir Labs Harvard 04/12/2011 56
    • Conclusion •• R-Stream simplifies software development and maintenance •• Porting: reduces expense and delivery delays •• Does this by automatically parallelizing loop code • While optimizing for data locality, coalescing, etc. •• Addresses • Dense loop-intensive computations •• Extensions • Data-parallel programming idioms • Sparse representations • Dynamic runtime executionReservoir Labs Harvard 04/12/2011 57
    • Contact us •• Per-developer seat, floating, cloud-based licensing •• Discounts for academic users •• Research collaborations with academic partners •• For more information: • Call us at 212-780-0527, or • See Rich, Ann • E-mail us at – {sales,lethin,johnson}@reservoir.comReservoir Labs Harvard 04/12/2011 58