[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)
The document describes the R-Stream high-level program transformation tool. It provides an overview of R-Stream, walks through the compilation process, and discusses performance results. R-Stream uses the polyhedral model to perform program transformations like loop transformations, fusion, distribution and tiling to optimize for parallelism and locality. It models the target machine and uses this to inform the mapping of operations to resources like GPUs.
Evidence Of Bimodal Crystallite Size Distribution In Microcrystalline Silico...Sanjay Ram
It is known that there is a bimodal size distribution in microcrystalline silicon. How can the deconvolution of the Raman spectra be done with incorporation of a bimodal CSD to obtain more accurate and physical picture of the microstructure in this material?
Evidence Of Bimodal Crystallite Size Distribution In Microcrystalline Silico...Sanjay Ram
It is known that there is a bimodal size distribution in microcrystalline silicon. How can the deconvolution of the Raman spectra be done with incorporation of a bimodal CSD to obtain more accurate and physical picture of the microstructure in this material?
GQSAR is a breakthrough patent pending methodology that significantly enhances the use of QSAR as an approach for new molecule design. As a predictive tool for activity, this method is significantly superior to conventional 3D and 2D QSAR. Here we explain application of GQSAR for optimizing compounds in congeneric series.
http://cs264.org
Abstract:
High-level scripting languages are in many ways polar opposites to
GPUs. GPUs are highly parallel, subject to hardware subtleties, and
designed for maximum throughput, and they offer a tremendous advance
in the performance achievable for a significant number of
computational problems. On the other hand, scripting languages such as
Python favor ease of use over computational speed and do not generally
emphasize parallelism. PyOpenCL and PyCUDA are two packages that
attempt to join the two together. By showing concrete examples, both
at the toy and the whole-application level, this talk aims to
demonstrate that by combining these opposites, a programming
environment is created that is greater than just the sum of its two
parts.
Speaker biography:
Andreas Klöckner obtained his PhD degree working with Jan Hesthaven at
the Department of Applied Mathematics at Brown University. He worked
on a variety of topics all aiming to broaden the utility of
discontinuous Galerkin (DG) methods. This included their use in the
simulation of plasma physics and the demonstration of their particular
suitability for computation on throughput-oriented graphics processors
(GPUs). He also worked on multi-rate time stepping methods and shock
capturing schemes for DG.
In the fall of 2010, he joined the Courant Institute of Mathematical
Sciences at New York University as a Courant Instructor. There, he is
working on problems in computational electromagnetics with Leslie
Greengard.
His research interests include:
- Discontinuous Galerkin and integral equation methods for wave
propagation
- Programming tools for parallel architectures
- High-order unstructured particle-in-cell methods for plasma simulation
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...npinto
Abstract:
Machine learning researchers and practitioners develop computer
algorithms that "improve performance automatically through
experience". At Google, machine learning is applied to solve many
problems, such as prioritizing emails in Gmail, recommending tags for
YouTube videos, and identifying different aspects from online user
reviews. Machine learning on big data, however, is challenging. Some
"simple" machine learning algorithms with quadratic time complexity,
while running fine with hundreds of records, are almost impractical to
use on billions of records.
In this talk, I will describe lessons drawn from various Google
projects on developing large scale machine learning systems. These
systems build on top of Google's computing infrastructure such as GFS
and MapReduce, and attack the scalability problem through massively
parallel algorithms. I will present the design decisions made in
these systems, strategies of scaling and speeding up machine learning
systems on web scale data.
Speaker biography:
Max Lin is a software engineer with Google Research in New York City
office. He is the tech lead of the Google Prediction API, a machine
learning web service in the cloud. Prior to Google, he published
research work on video content analysis, sentiment analysis, machine
learning, and cross-lingual information retrieval. He had a PhD in
Computer Science from Carnegie Mellon University.
GQSAR is a breakthrough patent pending methodology that significantly enhances the use of QSAR as an approach for new molecule design. As a predictive tool for activity, this method is significantly superior to conventional 3D and 2D QSAR. Here we explain application of GQSAR for optimizing compounds in congeneric series.
http://cs264.org
Abstract:
High-level scripting languages are in many ways polar opposites to
GPUs. GPUs are highly parallel, subject to hardware subtleties, and
designed for maximum throughput, and they offer a tremendous advance
in the performance achievable for a significant number of
computational problems. On the other hand, scripting languages such as
Python favor ease of use over computational speed and do not generally
emphasize parallelism. PyOpenCL and PyCUDA are two packages that
attempt to join the two together. By showing concrete examples, both
at the toy and the whole-application level, this talk aims to
demonstrate that by combining these opposites, a programming
environment is created that is greater than just the sum of its two
parts.
Speaker biography:
Andreas Klöckner obtained his PhD degree working with Jan Hesthaven at
the Department of Applied Mathematics at Brown University. He worked
on a variety of topics all aiming to broaden the utility of
discontinuous Galerkin (DG) methods. This included their use in the
simulation of plasma physics and the demonstration of their particular
suitability for computation on throughput-oriented graphics processors
(GPUs). He also worked on multi-rate time stepping methods and shock
capturing schemes for DG.
In the fall of 2010, he joined the Courant Institute of Mathematical
Sciences at New York University as a Courant Instructor. There, he is
working on problems in computational electromagnetics with Leslie
Greengard.
His research interests include:
- Discontinuous Galerkin and integral equation methods for wave
propagation
- Programming tools for parallel architectures
- High-order unstructured particle-in-cell methods for plasma simulation
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...npinto
Abstract:
Machine learning researchers and practitioners develop computer
algorithms that "improve performance automatically through
experience". At Google, machine learning is applied to solve many
problems, such as prioritizing emails in Gmail, recommending tags for
YouTube videos, and identifying different aspects from online user
reviews. Machine learning on big data, however, is challenging. Some
"simple" machine learning algorithms with quadratic time complexity,
while running fine with hundreds of records, are almost impractical to
use on billions of records.
In this talk, I will describe lessons drawn from various Google
projects on developing large scale machine learning systems. These
systems build on top of Google's computing infrastructure such as GFS
and MapReduce, and attack the scalability problem through massively
parallel algorithms. I will present the design decisions made in
these systems, strategies of scaling and speeding up machine learning
systems on web scale data.
Speaker biography:
Max Lin is a software engineer with Google Research in New York City
office. He is the tech lead of the Google Prediction API, a machine
learning web service in the cloud. Prior to Google, he published
research work on video content analysis, sentiment analysis, machine
learning, and cross-lingual information retrieval. He had a PhD in
Computer Science from Carnegie Mellon University.
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
Similar to [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)
Big Science, Big Data: Simon Metson at Eduserv Symposium 2012Eduserv
This talk looks at experience and lessons from LHC computing applicable to the current 'Big Data' trend; how these tools and techniques can be applied to other disciplines and the potential impact on provision of computing at universities.
Hummingbird - Open Source for Small Satellites - GSAW 2012Logica_hummingbird
This presentation about the Hummingbird project won best presentation at the GSAW 2012 event in Los Angeles - http://csse.usc.edu/gsaw/index.html
To read more from this author about how the IT industry can shape the boundaries of the space industry, please visit - http://www.logica.com/we-work-in/space/related%20media/thought-pieces/2011/new-paradigms-for-space/
Extend starfish to Support the Growing Hadoop EcosystemFei Dong
Similar to [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs) (20)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)npinto
MIT 6.870 Object Recognition and Scene Understanding (Fall 2008)
http://people.csail.mit.edu/torralba/courses/6.870/6.870.recognition.htm
This class will review and discuss current approaches to object recognition and scene understanding in computer vision. The course will cover bag of words models, part based models, classifier based models, multiclass object recognition and transfer learning, concurrent recognition and segmentation, context models for object recognition, grammars for scene understanding and large datasets for semi supervised and unsupervised discovery of object and scene categories. We will be reading a mixture of papers from computer vision and influential works from cognitive psychology on object and scene recognition.
More at http://sites.google.com/site/cudaiap2009 and http://pinto.scripts.mit.edu/Classes/CUDAIAP2009
Note that some slides were borrowed from Matthew Bolitho (John Hopkins) and NVIDIA.
More at http://sites.google.com/site/cudaiap2009 and http://pinto.scripts.mit.edu/Classes/CUDAIAP2009
Note that some slides were borrowed from Matthew Bolitho (John Hopkins) and NVIDIA.
More at http://sites.google.com/site/cudaiap2009 and http://pinto.scripts.mit.edu/Classes/CUDAIAP2009
Note that some slides were borrowed from Matthew Bolitho (John Hopkins), Mike Houston (Stanford) and NVIDIA.
Normal Labour/ Stages of Labour/ Mechanism of LabourWasim Ak
Normal labor is also termed spontaneous labor, defined as the natural physiological process through which the fetus, placenta, and membranes are expelled from the uterus through the birth canal at term (37 to 42 weeks
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
Synthetic Fiber Construction in lab .pptxPavel ( NSTU)
Synthetic fiber production is a fascinating and complex field that blends chemistry, engineering, and environmental science. By understanding these aspects, students can gain a comprehensive view of synthetic fiber production, its impact on society and the environment, and the potential for future innovations. Synthetic fibers play a crucial role in modern society, impacting various aspects of daily life, industry, and the environment. ynthetic fibers are integral to modern life, offering a range of benefits from cost-effectiveness and versatility to innovative applications and performance characteristics. While they pose environmental challenges, ongoing research and development aim to create more sustainable and eco-friendly alternatives. Understanding the importance of synthetic fibers helps in appreciating their role in the economy, industry, and daily life, while also emphasizing the need for sustainable practices and innovation.
Model Attribute Check Company Auto PropertyCeline George
In Odoo, the multi-company feature allows you to manage multiple companies within a single Odoo database instance. Each company can have its own configurations while still sharing common resources such as products, customers, and suppliers.
Read| The latest issue of The Challenger is here! We are thrilled to announce that our school paper has qualified for the NATIONAL SCHOOLS PRESS CONFERENCE (NSPC) 2024. Thank you for your unwavering support and trust. Dive into the stories that made us stand out!
Safalta Digital marketing institute in Noida, provide complete applications that encompass a huge range of virtual advertising and marketing additives, which includes search engine optimization, virtual communication advertising, pay-per-click on marketing, content material advertising, internet analytics, and greater. These university courses are designed for students who possess a comprehensive understanding of virtual marketing strategies and attributes.Safalta Digital Marketing Institute in Noida is a first choice for young individuals or students who are looking to start their careers in the field of digital advertising. The institute gives specialized courses designed and certification.
for beginners, providing thorough training in areas such as SEO, digital communication marketing, and PPC training in Noida. After finishing the program, students receive the certifications recognised by top different universitie, setting a strong foundation for a successful career in digital marketing.
Executive Directors Chat Leveraging AI for Diversity, Equity, and InclusionTechSoup
Let’s explore the intersection of technology and equity in the final session of our DEI series. Discover how AI tools, like ChatGPT, can be used to support and enhance your nonprofit's DEI initiatives. Participants will gain insights into practical AI applications and get tips for leveraging technology to advance their DEI goals.
Francesca Gottschalk - How can education support child empowerment.pptxEduSkills OECD
Francesca Gottschalk from the OECD’s Centre for Educational Research and Innovation presents at the Ask an Expert Webinar: How can education support child empowerment?
Francesca Gottschalk - How can education support child empowerment.pptx
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)
1. The R-Stream High-Level Program Transformation Tool
N. Vasilache, B. Meister, M. Baskaran, A.Hartono, R. Lethin
Reservoir Labs Harvard 04/12/2011 1
4. Computation choreography
•• Expressing it
• Annotations and pragma dialects for C
• Explicitly (e.g., new languages like CUDA and OpenCL)
•• But before expressing it, how can programmers find it?
• Manual constructive procedures, art, sweat, time
– Artisans get complete control over every detail
• Automatically
– Operations research problem
– Like scheduling trucks to save fuel Our focus
– Model, solve , implement
– Faster, sometimes better, than a human
Reservoir Labs Harvard 04/12/2011 4
5. How to do automatic scheduling?
•• Naïve approach
• Model
– Tasks, dependences
– Resource use, latencies
– Machine model with connectivity, resource capacities
• Solve with ILP
– Minimize overall task length
– Subject to dependences, resource use
• Problems
– Complexity: task graph is huge!
– Dynamics: loop lengths unknown.
•• So we do something much more cool.
Reservoir Labs Harvard 04/12/2011 5
6. Program Transformations Specification
iteration space of a statement S(i,j)
t2
j
2 2
:Z Z
i
t1
•• Schedule maps iterations to multi-dimensional time:
• A feasible schedule preserves dependences
•• Placement maps iterations to multi-dimensional space:
• UHPC in progress, partially done
•• Layout maps data elements to multi-dimensional space:
• UHPC in progress
•• Hierarchical by design, tiling serves separation of concerns
6
Reservoir Labs Harvard 04/12/2011
9. Enabling technology is new compiler math
Uniform Recurrence Equations [Karp et al. 1970]
Many: Lamport, Allen/Kennedy,
Banerjee, Irigoin, Wolfe/Lam, Pugh, Pingali, e
Loop Transformations and Parallelization [1970-]
Vectorization, SMP, locality optimizations
Dependence summary: direction/distance vectors
Unimodular transformations
Systolic Array Mapping Mostly linear-algebraic
Many: Feautrier, Darte, Vivien, Wilde, Rajopadhye, etc,....
Exact dependence analysis
Polyhedral Model [1980-] General affine transformations
Loop synthesis via polyhedral scanning
New computational techniques based
on polyhedral representations
Reservoir Labs Harvard 04/12/2011
9
10. R-Stream model: polyhedra
n = f();
for (i=5; i<= n; i+=2) {
A[i][i] = A[i][i]/B[i];
for (j=0; j<=i; j++) {
if (j<=10) {
…… A[i+2j+n][i+3]……
}
}
{i, j Z2 | k Z ,5 i n;0 j i; j i ; i 2k 1}
i
A0 1 2 1 0
j
A1 1 0 0 3 Affine and non-affine transformations
n Order and place of operations and data
1 0 0 0 1
1
Loop code represented (exactly or conservatively) with polyhedrons
High-level, mathematical view of a mapping
But targets concrete properties: parallelism, locality, memory footprint
Reservoir Labs Harvard 04/12/2011 10
11. Polyhedral slogans
•• Parametric imperfect loop nests
•• Subsumes classical transformations
•• Compacts the transformation search space
•• Parallelization, locality optimization (communication avoiding)
•• Preserves semantics
•• Analytic joint formulations of optimizations
•• Not just for affine static control programs
Reservoir Labs Harvard 04/12/2011 11
12. Polyhedral model – challenges in building a compiler
•• Killer math
•• Scalability of optimizations/code generation
•• Mostly confined to dependence preserving transformations
•• Code can be radically transformed – outputs can look
wildly different
•• Modeling indirections, pointers, non-affine code.
•• Many of these challenges are solved
Reservoir Labs Harvard 04/12/2011 12
13. R-Stream blueprint
Machine
Polyhedral Mapper
Model
Raising Lowering
EDG C Pretty
Scalar Representation
Front End Printer
13
Reservoir Labs Harvard 04/12/2011
15. Inside the polyhedral mapper
Optimization modules engineered to expose ““knobs”” that could be used by auto-tuner
GDG representation
Tactics Module
Parallelization
Comm.
Locality Tiling Placement
Generation
Optimization
……
Memory Sync Layout Polyhedral
Promotion Generation Optimization Scanning
Jolylib, ……
15
Reservoir Labs Harvard 04/12/2011
16. Driving the mapping: the machine model
•• Target machine characteristics that have an
influence on how the mapping should be done
• Local memory / cache sizes
• Communication facilities: DMA, cache(s)
• Synchronization capabilities
• Symmetrical or not
• SIMD width
• Bandwidths
•• Currently: two-level model (Host and Accelerators)
•• XML schema and graphical rendering
Reservoir Labs Harvard 04/12/2011
17. Machine model example: multi-Tesla
Host
1 thread per GPU
OpenMP morph XML file
CUDA morph
Reservoir Labs Harvard 04/12/2011 17
18. Mapping process
Dependencies
2- Task formation:
- Coarse-grain atomic tasks
- Master/slave side operations
1- Scheduling:
Parallelism, locality, tilability
3- Placement:
Assign tasks to blocks/threads
- Local / global data layout optimization
- Multi-buffering (explicitly managed)
- Synchronization (barriers)
- Bulk communications
- Thread generation -> master/slave
- CUDA-specific optimizations
Reservoir Labs Harvard 04/12/2011 18
19. Program Transformations Specification
iteration space of a statement S(i,j)
t2
j
2 2
:Z Z
i
t1
•• Schedule maps iterations to multi-dimensional time:
• A feasible schedule preserves dependences
•• Placement maps iterations to multi-dimensional space:
• UHPC in progress, partially done
•• Layout maps data elements to multi-dimensional space:
• UHPC in progress
•• Hierarchical by design, tiling serves separation of concerns
19
Reservoir Labs Harvard 04/12/2011
20. Model for scheduling trades 3 objectives jointly
Loop Fission
Fewer
Global More More Sufficient
Memory Locality Parallelism Occupancy
Accesses
Loop Fusion
+ successive + successive
thread thread
contiguity contiguity
Memory
Coalescing
Better
Effective
Bandwidth
Patent pending
Reservoir Labs Harvard 04/12/2011 20
21. Optimization with BLAS vs. global optimization
Numerous cache misses
/* Global Optimization*/
/* Optimization with BLAS */ doall loop { Can parallelize
for loop { Outer loop(s) …… outer loop(s)
…… for loop {
Retrieve data Z from disk ……
BLAS call 1
…… Store data Z back to disk [read from Z]
Retrieve data Z from disk !!! Loop fusion
BLAS call 2 VS. ……
…… [write to Z] can
…… …… improve
BLAS call n [read from Z] locality
…… }
} ……
}
Global optimization can expose better parallelism and locality
Reservoir Labs Harvard 04/12/2011
22. Tradeoffs between parallelism and locality
• Significant parallelism is needed to fully utilize all resources
• Locality is also critical to minimize communication
• Parallelism can come at the expense of locality
Limited bandwidth
at chip border
High on-chip
parallelism
•• Our approach: R-Stream compiler exposes parallelism via affine scheduling
that simultaneously augments locality using loop fusion
Reuse data once
loaded on chip =
locality
Reservoir Labs Harvard 04/12/2011
23. Parallelism/locality tradeoff example
Array z gets expanded, to
Maximum distribution destroys locality
introduce another level of
parallelism
/* doall (i=0; i<400; i++)
* Original code: doall (j=0; j<3997; j++)
* Simplified CSLC LMS z_e[j][i]=0
*/ doall (i=0; i<400; i++)
for (k=0; k<400; k++) { doall (j=0; j<3997; j++)
Max. parallelism
for (i=0; i<3997; i++) { for (k=0; k<4000; k++)
(no fusion)
z[i]=0; z_e[j][i]=z_e[j][i]+B[j][k]*x[i][k];
for (j=0; j<4000; j++) doall (i=0; i<3997; i++)
z[i]= z[i]+B[i][j]*x[k][j]; for (j=0; j<400; j++)
} w[i]=w[i]+z_e[i][j];
for (i=0; i<3997; i++) doall (i=0; i<3997; i++)
w[i]=w[i]+z[i]; Data
z[i] = z_e[i][399];
} accumulation
2 levels of parallelism, but poor data reuse (on array z_e)
Reservoir Labs Harvard 04/12/2011
24. Parallelism/locality tradeoff example (cont.)
Aggressive loop fusion destroys
parallelism (i.e., only 1 degree
/*
of parallelism)
* Original code:
* Simplified CSLC LMS
*/ doall (i=0; i<3997; i++)
for (k=0; k<400; k++) { for (j=0; j<400; j++) {
for (i=0; i<3997; i++) {
Max. fusion z[i]=0;
z[i]=0; for (k=0; k<4000; k++)
for (j=0; j<4000; j++) z[i]=z[i]+B[i][k]*x[j][k];
z[i]= z[i]+B[i][j]*x[k][j]; w[i]=w[i]+z[i];
} }
for (i=0; i<3997; i++)
w[i]=w[i]+z[i];
}
Very good data reuse (on array z), but only 1 level of parallelism
Reservoir Labs Harvard 04/12/2011
25. Parallelism/locality tradeoff example (cont.)
Partial fusion doesn’t
Expansion of array z decrease parallelism
/*
* Original code: doall (i=0; i<3997; i++) {
* Simplified CSLC LMS doall (j=0; j<400; j++) {
*/ z_e[i][j]=0;
for (k=0; k<400; k++) { for (k=0; k<4000; k++)
z_e[i][j]=z_e[i][j]+B[i][k]*x[j][k];
for (i=0; i<3997; i++) { Parallelism with
}
z[i]=0; partial fusion
for (j=0; j<4000; j++) for (j=0; j<400; j++)
z[i]= z[i]+B[i][j]*x[k][j]; w[i]=w[i]+z_e[i][j];
} }
doall (i=0; i<3997; i++)
for (i=0; i<3997; i++) Data
z[i]=z_e[i][399];
w[i]=w[i]+z[i]; accumulation
}
2 levels of parallelism with good data reuse (on array z_e)
Reservoir Labs Harvard 04/12/2011
26. Parallelism/locality tradeoffs: performance
numbers
Code with a good balance between parallelism and fusion performs best
In explicitly managed memory/scratchpad architectures this is even more true
Reservoir Labs Harvard 04/12/2011
27. R-Stream: affine scheduling and fusion
•• R-Stream uses a heuristic based on an objective function with several cost
coefficients:
• slowdown in execution if a loop p is executed sequentially rather than in parallel
• cost in performance if two loops p and q remain unfused rather than fused
minimize wl pl ue f e
l loops e loop edges
slowdown in sequential cost of unfusing
execution two loops
•• These two cost coefficients address parallelism and locality in a unified and
unbiased manner (as opposed to traditional compilers)
•• Fine-grained parallelism, such as SIMD, can also be modeled using similar
formulation
Patent Pending
Reservoir Labs Harvard 04/12/2011
28. Parallelism + locality + spatial locality
Hypothesis that auto-tuning should adjust these
parameters
wl pl ue f e
l loops e loop edges
benefits of improved locality
benefits of parallel execution
New algorithm (unpublished) balances contiguity to
enhance coalescing for GPU and SIMDization
modulo data-layout transformations
28
Reservoir Labs Harvard 04/12/2011
30. What R-Stream does for you – in a nutshell
•• Input
• Sequential
– Short and simple
textbook C code
– Just add a “#pragma
map” and R-Stream
figures out the rest
Reservoir Labs Harvard 04/12/2011 30
31. What R-Stream does for you – in a nutshell
•• Input •• Output
• Sequential • OpenMP + CUDA code
– Short and simple R-Stream – Hundreds of lines of tightly
textbook C code optimized GPU-side CUDA
– Just add a “#pragma code
map” and R-Stream – Few lines of host-side
figures out the rest OpenMP C code
• Gauss-Seidel 9 points stencil • Very difficult to
– Used in iterative PDE solvers hand-optimize
– scientific modeling (heat, fluid flow, waves, • Not available in
etc.) any standard
– Building block for faster iterative solvers like library
Multigrid or AMR
Reservoir Labs Harvard 04/12/2011 31
32. What R-Stream does for you – in a nutshell
•• Input •• Output
• Sequential • OpenMP + CUDA code
– Short and simple R-Stream – Hundreds of lines of tightly
textbook C code optimized GPU-side CUDA
– Just add a “#pragma code
map” and R-Stream – Few lines of host-side
figures out the rest OpenMP C code
• Achieving up to
Will be – 20 GFLOPS in
illustrated in GTX 285
the next few – 25 GFLOPS in
slides GTX 480
Reservoir Labs Harvard 04/12/2011 32
33. Finding and utilizing available parallelism
Excerpt of automatically generated code
GPU
SM N
SM 2
SM 1
Shared Memory
Registers Registers Registers
Instruction
R-Stream AUTOMATICALLY finds
SP 1 SP 2 … SP M
Unit
and forms parallelism Constant
Cache
Texture
Cache
Off-chip Device memory (Global,
constant, texture)
Extracting and mapping parallel loops
Reservoir Labs Harvard 04/12/2011 33
34. Memory compaction on GPU scratchpad
Excerpt of automatically generated code GPU
SM N
SM 2
SM 1
Shared Memory
Registers Registers Registers
Instruction
SP 1 SP 2 … SP M
Unit
R-Stream AUTOMATICALLY
manages local scratchpad Constant
Cache
Texture
Cache
Off-chip Device memory (Global,
constant, texture)
Reservoir Labs Harvard 04/12/2011 34
35. GPU DRAM to scratchpad coalesced communication
Excerpt of automatically generated code
GPU
SM N
SM 2
SM 1
Shared Memory
Registers Registers Registers
Instruction
SP 1 SP 2 … SP M
Unit
R-Stream AUTOMATICALLY
chooses parallelism to favor Constant
Cache
coalescing
Texture
Cache
Off-chip Device memory (Global,
constant, texture)
Coalesced GPU DRAM accesses
Reservoir Labs Harvard 04/12/2011 35
36. Host-to-GPU communication
Excerpt of automatically generated code
R-Stream AUTOMATICALLY
GPU
chooses partition and sets up host
SM N
to GPU communication
SM 2
SM 1
Shared Memory
Registers Registers Registers
Instruction
SP 1 SP 2 … SP M
Unit
CPU Constant
Cache
Texture
PCI Cache
Express
Host memory Off-chip Device memory (Global,
constant, texture)
Reservoir Labs Harvard 04/12/2011 36
37. Multi-GPU mapping
Excerpt of automatically generated code
Mapping Host Host
across all GPUs memory memory
CPU CPU
R-Stream AUTOMATICALLY finds GPU GPU
GPU
another level of parallelism,
across GPUs
GPU GPU GPU
memory memory memory
Reservoir Labs Harvard 04/12/2011 37
38. Multi-GPU mapping
Excerpt of automatically generated code
Host Host
Multi-streaming memory memory
of host-GPU
communication
CPU CPU
R-Stream AUTOMATICALLY GPU GPU
GPU
creates n-way software pipelines
for communications
GPU GPU GPU
memory memory memory
Reservoir Labs Harvard 04/12/2011 38
39. Future capabilities – mapping to CPU-GPU clusters
High Speed Interconnect (e.g.
InfiniBand)
Program
CPU + GPU CPU + GPU
MPI Process MPI Process
OpenMP OpenMP
process process
DRAM
launching launching DRAM
CUDA CUDA
CPU CPU
GPU GPU GPU
DRAM DRAM DRAM
Reservoir Labs Harvard 04/12/2011 39
48. 3D Discretized wave equation input code (RTM)
Not so naïve …
Communication
autotuning
knobs
ThreadIdx.x divergence
is expensive
Reservoir Labs Harvard 04/12/2011 48
50. Current status
•• Ongoing development also supported by DOE, Reservoir
• Improvements in scope, stability, performance
•• Installations/evaluations at US government laboratories
•• Forward collaboration with Georgia Tech on Keeneland
• HP SL390 - 3 FERMI GPU, 2 Westmere/node
•• Basis of compiler for DARPA UHPC Intel Corporation Team
Reservoir Labs Harvard 04/12/2011 50
51. Availability
•• Per developer seat licensing model
•• Support (releases, bug fixes, services)
•• Available with commercial grade external solvers
•• Government has limited rights / SBIR data rights
•• Academic source licenses with collaborators
•• Professional team, continuity, software engineering
Reservoir Labs Harvard 04/12/2011 51
52. R-Stream Gives More Insight About Programs
•• Teaches you about parallelism and the polyhedral model:
• Generates correct code for any transformation
– The transformation may be incorrect if specified by the user
• Imperative code generation has been the bottleneck till 2005
– 3 thesis written on the topic and it’s still not completely covered
– It is good to have an intuition of how these things work
•• R-Stream has meaningful metrics to represent your program
• Maximal amount of parallelism given minimal expansion
• Tradeoffs between coarse-grained and fine-grained parallelism
• Loop types (doall, red, perm, seq)
•• Help with algorithm selection
•• Tiling of imperfectly nested loop nests
•• Generates code for explicitly managed memory
Reservoir Labs Harvard 04/12/2011 52
53. How to Use R-Stream Successfully
•• R-Stream is a great transformation tool, it is also a great learning tool:
• Takes “simple C” input code
• Applies multiple transformations and lists them explicitly
• Can dump code at any step in the process
• It’s the tool I wish I had during my PhD:
– To be fair, I already had a great set of tools
•• R-Stream can be used in multiple modes:
• Fully automatic + compile flag options
• Autotuning mode (more than just tile size and unrolling … )
• Scripting / Programmatic mode (Beanshell + interfaces)
• Mix of these modes + manual post-processing
Reservoir Labs Harvard 04/12/2011 53
54. Use Case: Fully Automatic Mode + Compile Flag Options
•• Akin to traditional gcc / icc compilation with flags:
• Predefined transformations can be parameterized
– Except less/no phase ordering issues
• Except you can see what each transformation does at each step
• Except you can generate compilable and executable code at (almost)
any step in the process
• Except you can control code generation for compactness or
performance
Reservoir Labs Harvard 04/12/2011 54
55. Use Case: Autotuning Mode
•• More advanced than traditional approaches:
• Knobs go far beyond loop unrolling + unroll-and-jam + tiling
• Knobs are based on well-understood models
• Knobs target high-level properties of the program
– Amount of parallelism, amount of memory expansion, depth of
pipelining of communication and computations …
• Knobs are dependent on target machine, program and state of the
mapping process:
– Our tool has introspection
•• We are really building a hierarchical autotuning transformation tool
Reservoir Labs Harvard 04/12/2011 55
56. Use Case: “Power user” interactive interface
•• Beanshell access to optimizations
•• Can direct and review the process of compilation
• automatic tactics (affine scheduling)
•• Can direct code generation
•• Access to “tuning parameters”
•• All options / commands available on command line
interface
Reservoir Labs Harvard 04/12/2011 56
57. Conclusion
•• R-Stream simplifies software development and maintenance
•• Porting: reduces expense and delivery delays
•• Does this by automatically parallelizing loop code
• While optimizing for data locality, coalescing, etc.
•• Addresses
• Dense loop-intensive computations
•• Extensions
• Data-parallel programming idioms
• Sparse representations
• Dynamic runtime execution
Reservoir Labs Harvard 04/12/2011 57
58. Contact us
•• Per-developer seat, floating, cloud-based licensing
•• Discounts for academic users
•• Research collaborations with academic partners
•• For more information:
• Call us at 212-780-0527, or
• See Rich, Ann
• E-mail us at
– {sales,lethin,johnson}@reservoir.com
Reservoir Labs Harvard 04/12/2011 58