[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Level Programming (Cyrus Omar, CMU)
Upcoming SlideShare
Loading in...5
×
 

[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Level Programming (Cyrus Omar, CMU)

on

  • 1,050 views

http://cs264.org

http://cs264.org

http://bit.ly/gjQ3k7

Statistics

Views

Total Views
1,050
Views on SlideShare
1,050
Embed Views
0

Actions

Likes
0
Downloads
9
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Level Programming (Cyrus Omar, CMU) [Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Level Programming (Cyrus Omar, CMU) Presentation Transcript

  • High-Level Language Features for Low-Level ProgrammingCyrus OmarComputer Science DepartmentCarnegie Mellon Universityhttp://www.cs.cmu.edu/~comar/
  • The Needs of Scientists and Engineersment group) or letting the users/stakeholders know how the 100software works (open source, scientific paper publication). Other Using Tool is ’Standard’ 80 Reason given for use of programming language Only language known3.8 Non-functional requirements % of respondents Tool is Easy to Use Required 60 The respondents were asked to rate a series of non-functional Open Source Reason for use of tools Favouriterequirements on the following Likert scale: 40 Performance Cost (or lack thereof) of Tool 20 1. very unimportant Legacy Project Organisation Ease of use 0 2. unimportant Features Reliability Functionality Usability Availability Flexibility Performance* Portability Testability Maintainability Tracability* Reusability Developer experience Version Control is ’Required’ 3. neither Features Cross-platform compatability Improve Ease of Coding 4. important 0 10 20 30 40 50 60 0 5 Very Unimportant 10 Neither 15 20 Very Important Unimportant Number of respondents (out of 46) Important 5. very important Number of respondents Figure 7: Reasons for Choice of Programming Lan- Figure 18: Importance why non-functional require- Figure 9: Reasons of tools are used This scale was chosen so that the relative importance of guage ments as rated by respondentsnon-functional requirements could be determined from re- 60spondents’ answers. A straight ranking of non-functional re-quirements would only indicate how important respondents Modelling 50 Table 1: Combined important and very importantconsidered each non-functional requirement in comparison ratings for non-functional requirements Number of respondents Frameworkto others, but would not provide any information regard- 40 Ranking Requirement Combined Importanting how importantTracking Bug/Change a non-functional requirement was over- 30 and Very Importantall. The neutral response of ‘neither’ was included as some Ratings (%)respondents may not consider a non-functional requirement Build Tools Tool type 20 1 Reliability 100or are unaware of it. Libraries/Packages 2 Functionality 95 Non-functional requirements from the Software Require- 3 Maintainability [Nguyen-Hoan et al, 2010] 90 10ments Specification Data Item described in United States Testing
  • The State of Scientific Programming Today C, Fortran, CUDA, OpenCL MATLAB, Python, R, Perl Fast Productive Control over memory allocation Low syntactic overhead Control over data movement Read-eval-print loop (REPL) Access to hardware primitives Flexible data structures and abstractions Portability Nice development environments Tedious Slow Type annotations, templates, pragmas Dynamic lookups and indirection abound Obtuse compilers, linkers, preprocessors Automatic memory management can cause problems No support for high-level abstractions Scientists relieve the tension by: • writing overall control flow and basic data analysis routines in a high-level language • calling into a low-level language for performance-critical sections (can be annoying)
  • The State of Scientific Programming Tomorrow C, Fortran, CUDA, OpenCL MATLAB, Python, R, Perl Fast Productive Control over memory allocation Low syntactic overhead Control over data movement Read-eval-print loop (REPL) Access to hardware primitives Flexible data structures and abstractions Portability Nice development environments Tedious Slow Type annotations, templates, pragmas Dynamic lookups and indirection abound Obtuse compilers, linkers, preprocessors Automatic memory management can cause problems No support for high-level abstractions Scientists relieve any remaining tension by: • writing overall control flow and basic data analysis routines in a high-level language • calling into cl.oquence for performance-critical sections (can be annoying)
  • What is cl.oquence? A low-level programming language that maps closely onto, and compiles down to, OpenCL.What is OpenCL? OpenCL is an emerging standard for low-level programming in heterogeneous computing environments. It is designed as a library that can be used from a variety of higher-level language.What is a heterogeneous computing environment? A heterogeneous computing environment is an environment where many different compute devices and address spaces are available. Devices can include multi-core CPUs (using a variety of instruction sets), GPUs, hybrid-core processors like the Cell BE and other specialized accelerators.Why should I use cl.oquence? • Same core type system (including pointers) and performance profile as OpenCL • Usable from any host language that has OpenCL bindings • Type inference and extension inference eliminates annotational burden • Simplified syntax is a subset of Python, can use existing tools • Structural polymorphism gives you generic programming by default • New features: • Higher-order functions • Default arguments for functions • Python as the preprocessor and module system • Rich support for compile-time metaprogramming • Write compiler extensions, new basic types as libraries; modular, clean design • Light-weight and easy to integrate into any build process • Packaged with special Python host bindings that eliminate even basic overhead when using from within Python • Built on top of pyopencl and numpy
  • What is cl.oquence? A low-level programming language that maps closely onto, and compiles down to, OpenCL.What is OpenCL? OpenCL is an emerging standard for low-level programming in heterogeneous computing environments. It is designed as a library that can be used from a variety of higher-level language.What is a heterogeneous computing environment? A heterogeneous computing environment is an environment where many different compute devices and address spaces are available. Devices can include multi-core CPUs (using a variety of instruction sets), GPUs, hybrid-core processors like the Cell BE and other specialized accelerators.Why should I use cl.oquence? • Same core type system (including pointers) and performance profile as OpenCL • Usable from any host language that has OpenCL bindings • Type inference and extension inference eliminates annotational burden • Simplified syntax is a subset of Python, can use existing tools • Structural polymorphism gives you generic programming by default • New features: • Higher-order functions • Default arguments for functions • Python as the preprocessor and module system • Rich support for compile-time metaprogramming • Write compiler extensions, new basic types as libraries; modular, clean design • Light-weight and easy to integrate into any build process • Packaged with special Python host bindings that eliminate even basic overhead when using from within Python • Built on top of pyopencl and numpy
  • What is cl.oquence? A low-level programming language that maps closely onto, and compiles down to, OpenCL.What is OpenCL? OpenCL is an emerging standard for low-level programming in heterogeneous computing environments. It is designed as a library that can be used from a variety of higher-level language.What is a heterogeneous computing environment? A heterogeneous computing environment is one where many different devices and address spaces must be managed. Examples of devices include multi-core CPUs (using a variety of instruction sets), GPUs, hybrid-core processors like the Cell BE and other specialized accelerators.Why should I use cl.oquence? • Same core type system (including pointers) and performance profile as OpenCL • Usable from any host language that has OpenCL bindings • Type inference and extension inference eliminates annotational burden • Simplified syntax is a subset of Python, can use existing tools • Structural polymorphism gives you generic programming by default • New features: • Higher-order functions • Default arguments for functions • Python as the preprocessor and module system • Rich support for compile-time metaprogramming • Write compiler extensions, new basic types as libraries; modular, clean design • Light-weight and easy to integrate into any build process • Packaged with special Python host bindings that eliminate even basic overhead when using from within Python • Built on top of pyopencl and numpy
  • What is cl.oquence? A low-level programming language that maps closely onto, and compiles down to, OpenCL.What is OpenCL? OpenCL is an emerging standard for low-level programming in heterogeneous computing environments. It is designed as a library that can be used from a variety of higher-level language.What is a heterogeneous computing environment? A heterogeneous computing environment is one where many different devices and address spaces must be managed. Examples of devices include multi-core CPUs (using a variety of instruction sets), GPUs, hybrid-core processors like the Cell BE and other specialized accelerators.Why should I use cl.oquence? • Same core type system (including pointers) and performance profile as OpenCL • Usable from any host language that has OpenCL bindings • Type inference and extension inference eliminates annotational burden • Simplified syntax is a subset of Python, can use existing tools • Structural polymorphism gives you generic programming by default • New features: • Higher-order functions • Default arguments for functions • Python as the preprocessor and module system • Rich support for compile-time metaprogramming • Write compiler extensions, new basic types as libraries; modular, clean design • Light-weight and easy to integrate into any build process • Packaged with special Python host bindings that eliminate even basic overhead when using from within Python • Built on top of pyopencl and numpy
  • OpenCL//  Parallel  elementwise  sum__kernel  void  sum(__global  float*  a,  __global  float*  b,                                      __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum(__global  int*  a,  __global  int*  b,                                      __global  int*  dest)  {        size_t  gid  =  get_global_id(0);        dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum(__global  short*  a,  __global  int*  b,                                      __global  int*  dest)  {        size_t  gid  =  get_global_id(0);        dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum(__global  float*  a,  __global  double*  b,                                      __global  float*  dest)  {        #pragma  ...        size_t  gid  =  get_global_id(0);        dest[gid]  =  a[gid]  +  b[gid];}...//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {
  • OpenCL//  Parallel  elementwise  sum__kernel  void  sum_ff(__global  float*  a,  __global  float*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum_ii(__global  int*  a,  __global  int*  b,                                            __global  int*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum(__global  short*  a,  __global  int*  b,                                      __global  int*  dest)  {        size_t  gid  =  get_global_id(0);        dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum(__global  float*  a,  __global  double*  b,                                      __global  float*  dest)  {        #pragma  ...        size_t  gid  =  get_global_id(0);        dest[gid]  =  a[gid]  +  b[gid];}...//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {
  • OpenCL//  Parallel  elementwise  sum__kernel  void  sum_ff(__global  float*  a,  __global  float*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum_ii(__global  int*  a,  __global  int*  b,                                            __global  int*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum(__global  short*  a,  __global  int*  b,                                      __global  int*  dest)  {        size_t  gid  =  get_global_id(0);        dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum(__global  float*  a,  __global  double*  b,                                      __global  float*  dest)  {        #pragma  ...        size_t  gid  =  get_global_id(0);        dest[gid]  =  a[gid]  +  b[gid];}...//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {
  • OpenCL//  Parallel  elementwise  sum__kernel  void  sum_ff(__global  float*  a,  __global  float*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum_ii(__global  int*  a,  __global  int*  b,                                            __global  int*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum_fi(__global  float*  a,  __global  int*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum_di(__global  double*  a,  __global  int*  b,                                            __global  double*  dest)  {        #pragma  OPENCL  EXTENSION  cl_khr_fp64  :  enable        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}...//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {
  • OpenCL//  Parallel  elementwise  sum__kernel  void  sum_ff(__global  float*  a,  __global  float*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum_ii(__global  int*  a,  __global  int*  b,                                            __global  int*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum_fi(__global  float*  a,  __global  int*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum_df(__global  double*  a,  __global  int*  b,                                            __global  double*  dest)  {        #pragma  OPENCL  EXTENSION  cl_khr_fp64  :  enable        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}...//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {
  • OpenCL//  Parallel  elementwise  sum__kernel  void  sum_ff(__global  float*  a,  __global  float*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum_ii(__global  int*  a,  __global  int*  b,                                            __global  int*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum_fi(__global  float*  a,  __global  int*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum_df(__global  double*  a,  __global  int*  b,                                            __global  double*  dest)  {        #pragma  OPENCL  EXTENSION  cl_khr_fp64  :  enable        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}... ...//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {
  • OpenCL//  Parallel  elementwise  sum__kernel  void  sum_ff(__global  float*  a,  __global  float*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum_ii(__global  int*  a,  __global  int*  b,                                            __global  int*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum_fi(__global  float*  a,  __global  int*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum_df(__global  double*  a,  __global  int*  b,                                            __global  double*  dest)  {        #pragma  OPENCL  EXTENSION  cl_khr_fp64  :  enable        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}... ...//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,  __kernel  void  prod_ff(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}__kernel  void  prod(__global  float*  a,  __global  float*  b,  __kernel  void  prod_ii(__global  int*  a,  __global  int*  b,                                        __global  float*  dest)  {                                          __global  int*  dest)  {
  • OpenCL//  Parallel  elementwise  sum__kernel  void  sum_ff(__global  float*  a,  __global  float*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum_ii(__global  int*  a,  __global  int*  b,                                            __global  int*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum_fi(__global  float*  a,  __global  int*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum_df(__global  double*  a,  __global  int*  b,                                            __global  double*  dest)  {        #pragma  OPENCL  EXTENSION  cl_khr_fp64  :  enable        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}... ...//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,  __kernel  void  prod_ff(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}__kernel  void  prod(__global  float*  a,  __global  float*  b,  __kernel  void  prod_ii(__global  int*  a,  __global  int*  b,                                        __global  float*  dest)  {                                          __global  int*  dest)  {
  • OpenCL//  Parallel  elementwise  sum__kernel  void  sum_ff(__global  float*  a,  __global  float*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum_ii(__global  int*  a,  __global  int*  b,                                            __global  int*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum_fi(__global  float*  a,  __global  int*  b,   My photographs tell stories of loss, human struggle, and personal exploration                                          __global  float*  dest)  { within landscapes scarred by technology and over-use… [I] strive to metaphorically        size_t  gid  =  get_global_id(0);   and poetically link laborious actions, idiosyncratic rituals and strangely crude machines into tales about our modern experience.        dest[gid]  =  a[gid]  +  b[gid];} Robert ParkeHarrison__kernel  void  sum_df(__global  double*  a,  __global  int*  b,                                            __global  double*  dest)  {        #pragma  OPENCL  EXTENSION  cl_khr_fp64  :  enable        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}... ...//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,  __kernel  void  prod_ff(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}__kernel  void  prod(__global  float*  a,  __global  float*  b,  __kernel  void  prod_ii(__global  int*  a,  __global  int*  b,                                        __global  float*  dest)  {                                          __global  int*  dest)  {
  • OpenCL//  Parallel  elementwise  sum__kernel  void  sum_ff(__global  float*  a,  __global  float*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];} @cl.oquence.fn__kernel  void  sum_ii(__global  int*  a,  __global  int*  b,   def  plus(a,  b):                                          __global  int*  dest)  {        Adds  the  two  operands.        size_t  gid  =  get_global_id(0);   return  a  +  b        dest[gid]  =  a[gid]  +  b[gid];} @cl.oquence.fn def  mul(a,  b):__kernel  void  sum_fi(__global  float*  a,  __global  int*  b,          Multiplies  the  two  operands.                                          __global  float*  dest)  { return  a  *  b        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum_df(__global  double*  a,  __global  int*  b,                                            __global  double*  dest)  {        #pragma  OPENCL  EXTENSION  cl_khr_fp64  :  enable        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}... ...//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,  __kernel  void  prod_ff(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}__kernel  void  prod(__global  float*  a,  __global  float*  b,  __kernel  void  prod_ii(__global  int*  a,  __global  int*  b,                                        __global  float*  dest)  {                                          __global  int*  dest)  {
  • OpenCL//  Parallel  elementwise  sum @cl.oquence.fn__kernel  void  sum_ff(__global  float*  a,  __global  float*  b,   def  ew_op(a,  b,  dest,  op):                                          __global  float*  dest)  {        Parallel  elementwise  binary  operation.        size_t  gid  =  get_global_id(0);  //  Get  thread  index        gid  =  get_global_id(0)                  #  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];        dest[gid]  =  op(a[gid],  b[gid])} @cl.oquence.fn__kernel  void  sum_ii(__global  int*  a,  __global  int*  b,   def  plus(a,  b):                                          __global  int*  dest)  {        Adds  the  two  operands.        size_t  gid  =  get_global_id(0);   return  a  +  b        dest[gid]  =  a[gid]  +  b[gid];} @cl.oquence.fn def  mul(a,  b):__kernel  void  sum_fi(__global  float*  a,  __global  int*  b,          Multiplies  the  two  operands.                                          __global  float*  dest)  { return  a  *  b        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum_df(__global  double*  a,  __global  int*  b,                                            __global  double*  dest)  {        #pragma  OPENCL  EXTENSION  cl_khr_fp64  :  enable        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}... ...//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,  __kernel  void  prod_ff(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}__kernel  void  prod(__global  float*  a,  __global  float*  b,  __kernel  void  prod_ii(__global  int*  a,  __global  int*  b,                                        __global  float*  dest)  {                                          __global  int*  dest)  {
  • OpenCL//  Parallel  elementwise  sum @cl.oquence.fn__kernel  void  sum_ff(__global  float*  a,  __global  float*  b,   def  ew_op(a,  b,  dest,  op):                                          __global  float*  dest)  {        Parallel  elementwise  binary  operation.        size_t  gid  =  get_global_id(0);  //  Get  thread  index        gid  =  get_global_id(0)                  #  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];        dest[gid]  =  op(a[gid],  b[gid])} @cl.oquence.fn__kernel  void  sum_ii(__global  int*  a,  __global  int*  b,   def  plus(a,  b):                                          __global  int*  dest)  {        Adds  the  two  operands.        size_t  gid  =  get_global_id(0);   return  a  +  b        dest[gid]  =  a[gid]  +  b[gid];} @cl.oquence.fn def  mul(a,  b):__kernel  void  sum_fi(__global  float*  a,  __global  int*  b,          Multiplies  the  two  operands.                                          __global  float*  dest)  { return  a  *  b        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum_df(__global  double*  a,  __global  int*  b,                                            __global  double*  dest)  {        #pragma  OPENCL  EXTENSION  cl_khr_fp64  :  enable        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}... ...//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,  __kernel  void  prod_ff(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}__kernel  void  prod(__global  float*  a,  __global  float*  b,  __kernel  void  prod_ii(__global  int*  a,  __global  int*  b,                                        __global  float*  dest)  {                                          __global  int*  dest)  {
  • OpenCL//  Parallel  elementwise  sum @cl.oquence.fn__kernel  void  sum_ff(__global  float*  a,  __global  float*  b,   def  ew_op(a,  b,  dest,  op):                                          __global  float*  dest)  {        Parallel  elementwise  binary  operation.        size_t  gid  =  get_global_id(0);  //  Get  thread  index        gid  =  get_global_id(0)                  #  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];        dest[gid]  =  op(a[gid],  b[gid])} @cl.oquence.fn__kernel  void  sum_ii(__global  int*  a,  __global  int*  b,   def  plus(a,  b):                                          __global  int*  dest)  {        Adds  the  two  operands.        size_t  gid  =  get_global_id(0);   return  a  +  b        dest[gid]  =  a[gid]  +  b[gid];} @cl.oquence.fn def  mul(a,  b):__kernel  void  sum_fi(__global  float*  a,  __global  int*  b,          Multiplies  the  two  operands.                                          __global  float*  dest)  { return  a  *  b        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum_df(__global  double*  a,  __global  int*  b,                                            __global  double*  dest)  {        #pragma  OPENCL  EXTENSION  cl_khr_fp64  :  enable        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}... ...//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,  __kernel  void  prod_ff(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}__kernel  void  prod(__global  float*  a,  __global  float*  b,  __kernel  void  prod_ii(__global  int*  a,  __global  int*  b,                                        __global  float*  dest)  {                                          __global  int*  dest)  {
  • OpenCL//  Parallel  elementwise  sum @cl.oquence.fn__kernel  void  sum_ff(__global  float*  a,  __global  float*  b,   def  ew_op(a,  b,  dest,  op):                                          __global  float*  dest)  {        Parallel  elementwise  binary  operation.        size_t  gid  =  get_global_id(0);  //  Get  thread  index        gid  =  get_global_id(0)                  #  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];        dest[gid]  =  op(a[gid],  b[gid])} @cl.oquence.fn__kernel  void  sum_ii(__global  int*  a,  __global  int*  b,   def  plus(a,  b):                                          __global  int*  dest)  {        Adds  the  two  operands.        size_t  gid  =  get_global_id(0);   return  a  +  b        dest[gid]  =  a[gid]  +  b[gid];} @cl.oquence.fn def  mul(a,  b):__kernel  void  sum_fi(__global  float*  a,  __global  int*  b,          Multiplies  the  two  operands.                                          __global  float*  dest)  { return  a  *  b        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum_df(__global  double*  a,  __global  int*  b,                                            __global  double*  dest)  {        #pragma  OPENCL  EXTENSION  cl_khr_fp64  :  enable        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}... ...//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,  __kernel  void  prod_ff(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}__kernel  void  prod(__global  float*  a,  __global  float*  b,  __kernel  void  prod_ii(__global  int*  a,  __global  int*  b,                                        __global  float*  dest)  {                                          __global  int*  dest)  {
  • OpenCL//  Parallel  elementwise  sum @cl.oquence.fn__kernel  void  sum_ff(__global  float*  a,  __global  float*  b,   def  ew_op(a,  b,  dest,  op):                                          __global  float*  dest)  {        Parallel  elementwise  binary  operation.        size_t  gid  =  get_global_id(0);  //  Get  thread  index        gid  =  get_global_id(0)                  #  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];        dest[gid]  =  op(a[gid],  b[gid])} @cl.oquence.fn__kernel  void  sum_ii(__global  int*  a,  __global  int*  b,   def  plus(a,  b):                                          __global  int*  dest)  {        Adds  the  two  operands.        size_t  gid  =  get_global_id(0);   return  a  +  b        dest[gid]  =  a[gid]  +  b[gid];} @cl.oquence.fn def  mul(a,  b):__kernel  void  sum_fi(__global  float*  a,  __global  int*  b,          Multiplies  the  two  operands.                                          __global  float*  dest)  { return  a  *  b        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum_df(__global  double*  a,  __global  int*  b,                                            __global  double*  dest)  {        #pragma  OPENCL  EXTENSION  cl_khr_fp64  :  enable These two libraries express the same thing.        size_t  gid  =  get_global_id(0);   The code will run in precisely the same amount of time.        dest[gid]  =  a[gid]  +  b[gid];}... ...//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,  __kernel  void  prod_ff(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}__kernel  void  prod(__global  float*  a,  __global  float*  b,  __kernel  void  prod_ii(__global  int*  a,  __global  int*  b,                                        __global  float*  dest)  {                                          __global  int*  dest)  {
  • Two invocation models @cl.oquence.fn 1. Standalone compilation to OpenCL def  ew_op(a,  b,  dest,  op): • Use any host language that has OpenCL        Parallel  elementwise  binary  operation. bindings available        gid  =  get_global_id(0)                  #  Get  thread  index •C        dest[gid]  =  op(a[gid],  b[gid]) • C++ • Fortran @cl.oquence.fn def  plus(a,  b): • MATLAB        Adds  the  two  operands. • Java return  a  +  b • .NET • Ruby @cl.oquence.fn def  mul(a,  b): • Python        Multiplies  the  two  operands. return  a  *  b #  Programmatically  specialize  and  assign  types  to   #  any  externally  callable  versions  you  need. sum  =  ew_op.specialize(op=plus) prod  =  ew_op.specialize(op=mul) g_int_p  =  cl_int.global_ptr g_float_p  =  cl_float.global_ptr sum_ff  =  sum.compile(g_float_p,  g_float_p,  g_float_p) sum_ii  =  sum.compile(g_int_p,  g_int_p,  g_int_p)
  • Two invocation models @cl.oquence.fn 1. Standalone compilation to OpenCL def  ew_op(a,  b,  dest,  op): • Use any host language that has OpenCL        Parallel  elementwise  binary  operation. bindings available        gid  =  get_global_id(0)                  #  Get  thread  index •C        dest[gid]  =  op(a[gid],  b[gid]) • C++ • Fortran @cl.oquence.fn def  plus(a,  b): • MATLAB        Adds  the  two  operands. • Java return  a  +  b • .NET • Ruby @cl.oquence.fn def  mul(a,  b): • Python        Multiplies  the  two  operands. return  a  *  b #  Programmatically  specialize  and  assign  types  to  clqcc  hello.clq #  any  externally  callable  versions  you  need. sum  =  ew_op.specialize(op=plus) creates hello.cl: prod  =  ew_op.specialize(op=mul)__kernel  void  sum_ff(__global  float*  a,  __global  float*  b,   g_int_p  =  cl_int.global_ptr                                          __global  float*  dest)  {        size_t  gid  =  get_global_id(0); g_float_p  =  cl_float.global_ptr        dest[gid]  =  a[gid]  +  b[gid];} sum_ff  =  sum.compile(g_float_p,  g_float_p,  g_float_p) sum_ii  =  sum.compile(g_int_p,  g_int_p,  g_int_p)__kernel  void  sum_ii(__global  int*  a,  __global  int*  b,                                            __global  int*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}
  • Two invocation models @cl.oquence.fn 1. Standalone compilation to OpenCL def  ew_op(a,  b,  dest,  op): 2. Integrated into a host language        Parallel  elementwise  binary  operation. • Python + pyopencl (w/extensions) + numpy        gid  =  get_global_id(0)                  #  Get  thread  index        dest[gid]  =  op(a[gid],  b[gid]) @cl.oquence.fn def  plus(a,  b):        Adds  the  two  operands. return  a  +  b @cl.oquence.fn def  mul(a,  b):        Multiplies  the  two  operands. return  a  *  b #  allocate  two  random  arrays  that  we  will  be  adding a  =  numpy.random.rand(50000).astype(numpy.float32) b  =  numpy.random.rand(50000).astype(numpy.float32) #  transfer  data  to  device ctx  =  cl.ctx  =  cl.Context.for_device(0,  0) a_buf  =  ctx.to_device(a) b_buf  =  ctx.to_device(b) dest_buf  =  ctx.alloc(like=a) #  invoke  function  (automatically  specialized  as  needed) ew_op(a_buf,  b_buf,  dest_buf,  plus,              global_size=a.shape,  local_size=(256,)).wait() #  get  results result  =  ctx.from_device(dest_buf) #  check  results print  la.norm(c  -­‐  (a  +  b))
  • Two invocation models @cl.oquence.fn 1. Standalone compilation to OpenCL def  ew_op(a,  b,  dest,  op): 2. Integrated into a host language        Parallel  elementwise  binary  operation. • Python + pyopencl (w/extensions) + numpy        gid  =  get_global_id(0)                  #  Get  thread  index        dest[gid]  =  op(a[gid],  b[gid]) @cl.oquence.fn def  plus(a,  b):        Adds  the  two  operands.Four simple memory management functions return  a  +  b 1. to_device: numpy array => new buffer @cl.oquence.fn 2. from_device: buffer => new numpy array def  mul(a,  b): 3. alloc: empty buffer        Multiplies  the  two  operands. 4. copy: copies between existing buffers or arrays return  a  *  b #  allocate  two  random  arrays  that  we  will  be  addingBuffers hold metadata (type, shape, order) so you a  =  numpy.random.rand(50000).astype(numpy.float32)don’t have to provide it. b  =  numpy.random.rand(50000).astype(numpy.float32) #  transfer  data  to  device ctx  =  cl.ctx  =  cl.Context.for_device(0,  0) a_buf  =  ctx.to_device(a) b_buf  =  ctx.to_device(b) dest_buf  =  ctx.alloc(like=a) #  invoke  function  (automatically  specialized  as  needed) ew_op(a_buf,  b_buf,  dest_buf,  plus,              global_size=a.shape,  local_size=(256,)).wait() #  get  results result  =  ctx.from_device(dest_buf) #  check  results print  la.norm(c  -­‐  (a  +  b))
  • Two invocation models @cl.oquence.fn 1. Standalone compilation to OpenCL def  ew_op(a,  b,  dest,  op): 2. Integrated into a host language        Parallel  elementwise  binary  operation. • Python + pyopencl (w/extensions) + numpy        gid  =  get_global_id(0)                  #  Get  thread  index        dest[gid]  =  op(a[gid],  b[gid]) @cl.oquence.fn def  plus(a,  b):        Adds  the  two  operands.Four simple memory management functions return  a  +  b 1. to_device: numpy array => new buffer @cl.oquence.fn 2. from_device: buffer => new numpy array def  mul(a,  b): 3. alloc: empty buffer        Multiplies  the  two  operands. 4. copy: copies between existing buffers or arrays return  a  *  b #  allocate  two  random  arrays  that  we  will  be  addingBuffers hold metadata (type, shape, order) so you a  =  numpy.random.rand(50000).astype(numpy.float32)don’t have to provide it. b  =  numpy.random.rand(50000).astype(numpy.float32) #  transfer  data  to  device ctx  =  cl.ctx  =  cl.Context.for_device(0,  0)Implicit queue associated with each context. a_buf  =  ctx.to_device(a) b_buf  =  ctx.to_device(b) dest_buf  =  ctx.alloc(like=a) #  invoke  function  (automatically  specialized  as  needed) ew_op(a_buf,  b_buf,  dest_buf,  plus,              global_size=a.shape,  local_size=(256,)).wait() #  get  results result  =  ctx.from_device(dest_buf) #  check  results print  la.norm(c  -­‐  (a  +  b))
  • Two invocation models @cl.oquence.auto(lambda  a,  b,  dest,  op:  a.shape,  (256,)) 1. Standalone compilation to OpenCL @cl.oquence.fn 2. Integrated into a host language def  ew_op(a,  b,  dest,  op):        Parallel  elementwise  binary  operation. • Python + pyopencl (w/extensions) + numpy        gid  =  get_global_id(0)                  #  Get  thread  index        dest[gid]  =  op(a[gid],  b[gid]) @cl.oquence.fn def  plus(a,  b):        Adds  the  two  operands.Four simple memory management functions return  a  +  b 1. to_device: numpy array => new buffer @cl.oquence.fn 2. from_device: buffer => new numpy array def  mul(a,  b): 3. alloc: empty buffer        Multiplies  the  two  operands. 4. copy: copies between existing buffers or arrays return  a  *  b #  allocate  two  random  arrays  that  we  will  be  addingBuffers hold metadata (type, shape, order) so you a  =  numpy.random.rand(50000).astype(numpy.float32)don’t have to provide it. b  =  numpy.random.rand(50000).astype(numpy.float32) #  transfer  data  to  device ctx  =  cl.ctx  =  cl.Context.for_device(0,  0)Implicit queue associated with each context. a_buf  =  ctx.to_device(a) b_buf  =  ctx.to_device(b) dest_buf  =  ctx.alloc(like=a)The auto annotation can allow you to hide the #  invoke  function  (automatically  specialized  as  needed)details of parallelization from the user. ew_op(a_buf,  b_buf,  dest_buf,  plus).wait() #  get  results result  =  ctx.from_device(dest_buf) #  check  results print  la.norm(c  -­‐  (a  +  b))
  • Two invocation models @cl.oquence.auto(lambda  a,  b,  dest,  op:  a.shape,  (256,)) 1. Standalone compilation to OpenCL @cl.oquence.fn 2. Integrated into a host language def  ew_op(a,  b,  dest,  op):        Parallel  elementwise  binary  operation. • Python + pyopencl (w/extensions) + numpy        gid  =  get_global_id(0)                  #  Get  thread  index        dest[gid]  =  op(a[gid],  b[gid]) @cl.oquence.fn def  plus(a,  b):        Adds  the  two  operands.Four simple memory management functions return  a  +  b 1. to_device: numpy array => new buffer @cl.oquence.fn 2. from_device: buffer => new numpy array def  mul(a,  b): 3. alloc: empty buffer        Multiplies  the  two  operands. 4. copy: copies between existing buffers or arrays return  a  *  b #  allocate  two  random  arrays  that  we  will  be  addingBuffers hold metadata (type, shape, order) so you a  =  numpy.random.rand(50000).astype(numpy.float32)don’t have to provide it. b  =  numpy.random.rand(50000).astype(numpy.float32) c  =  numpy.empty_like(a) #  create  an  OpenCL  contextImplicit queue associated with each context. ctx  =  cl.ctx  =  cl.Context.for_device(0,  0) #  invoke  function  (automatically  specialized  as  needed) ew_op(In(a),  In(b),  Out(c),  plus).wait()The auto annotation can allow you to hide thedetails of parallelization from the user. #  check  results print  la.norm(c  -­‐  (a  +  b))The In, Out and InOut constructs can helpautomate data movement when convenient.
  • OpenCL//  Parallel  elementwise  sum @cl.oquence.fn__kernel  void  sum_ff(__global  float*  a,  __global  float*  b,   def  ew_op(a,  b,  dest,  op):                                          __global  float*  dest)  {        Parallel  elementwise  binary  operation.        size_t  gid  =  get_global_id(0);  //  Get  thread  index        gid  =  get_global_id(0)                  #  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];        dest[gid]  =  op(a[gid],  b[gid])} @cl.oquence.fn__kernel  void  sum_ii(__global  int*  a,  __global  int*  b,   def  plus(a,  b):                                          __global  int*  dest)  {        Adds  the  two  operands.        size_t  gid  =  get_global_id(0);   return  a  +  b        dest[gid]  =  a[gid]  +  b[gid];} @cl.oquence.fn def  mul(a,  b):__kernel  void  sum_fi(__global  float*  a,  __global  int*  b,          Multiplies  the  two  operands.                                          __global  float*  dest)  { return  a  *  b        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum_df(__global  double*  a,  __global  int*  b,                                            __global  double*  dest)  {        #pragma  OPENCL  EXTENSION  cl_khr_fp64  :  enable These two libraries express the same thing.        size_t  gid  =  get_global_id(0);   The code will run in precisely the same amount of time.        dest[gid]  =  a[gid]  +  b[gid];}... ...//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,  __kernel  void  prod_ff(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}__kernel  void  prod(__global  float*  a,  __global  float*  b,  __kernel  void  prod_ii(__global  int*  a,  __global  int*  b,                                        __global  float*  dest)  {                                          __global  int*  dest)  {
  • OpenCL//  Parallel  elementwise  sum @cl.oquence.fn__kernel  void  sum_ff(__global  float*  a,  __global  float*  b,   def  ew_op(a,  b,  dest,  op):                                          __global  float*  dest)  {        Parallel  elementwise  binary  operation.        size_t  gid  =  get_global_id(0);  //  Get  thread  index        gid  =  get_global_id(0)                  #  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];        dest[gid]  =  op(a[gid],  b[gid])} @cl.oquence.fn__kernel  void  sum_ii(__global  int*  a,  __global  int*  b,   def  plus(a,  b):                                          __global  int*  dest)  {        Adds  the  two  operands.        size_t  gid  =  get_global_id(0);   return  a  +  b        dest[gid]  =  a[gid]  +  b[gid];} @cl.oquence.fn def  mul(a,  b):__kernel  void  sum_fi(__global  float*  a,  __global  int*  b,          Multiplies  the  two  operands.                                          __global  float*  dest)  { return  a  *  b        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum_df(__global  double*  a,  __global  int*  b,   How?                                          __global  double*  dest)  {        #pragma  OPENCL  EXTENSION  cl_khr_fp64  :  enable • cl.oquence.fn code looks like Python, but no!        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid]; • Same core type system as OpenCL (C99+)} • Type inference to eliminate type annotations (not dynamic lookups)... ... • Extension inference to eliminate pragmas • Higher-order functions (inlined at compile-time)//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,  __kernel  void  prod_ff(__global  float*  a,  __global  float*  b,   • Structural polymorphism                                      __global  float*  dest)  {                                            __global  float*  dest)  { • All functions are generic by default        size_t  gid  =  get_global_id(0);  //  Get  thread  index • You can call a function with any arguments        dest[gid]  =  a[gid]  *  b[gid]; that support the operations it uses.} •__kernel  void  prod(__global  float*  a,  __global  float*  b,  __kernel  void  prod_ii(__global  int*  a,  __global  int*  b,                                        __global  float*  dest)  {                                          __global  int*  dest)  {
  • OpenCL//  Parallel  elementwise  sum @cl.oquence.fn__kernel  void  sum_ff(__global  float*  a,  __global  float*  b,   def  ew_op(a,  b,  dest,  op):                                          __global  float*  dest)  {        Parallel  elementwise  binary  operation.        size_t  gid  =  get_global_id(0);  //  Get  thread  index        gid  =  get_global_id(0)                  #  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];        dest[gid]  =  op(a[gid],  b[gid])} @cl.oquence.fn__kernel  void  sum_ii(__global  int*  a,  __global  int*  b,   def  plus(a,  b):                                          __global  int*  dest)  {        Adds  the  two  operands.        size_t  gid  =  get_global_id(0);   return  a  +  b        dest[gid]  =  a[gid]  +  b[gid];} @cl.oquence.fn def  mul(a,  b):__kernel  void  sum_fi(__global  float*  a,  __global  int*  b,          Multiplies  the  two  operands.                                          __global  float*  dest)  { return  a  *  b        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum_df(__global  double*  a,  __global  int*  b,   How?                                          __global  double*  dest)  {        #pragma  OPENCL  EXTENSION  cl_khr_fp64  :  enable • cl.oquence.fn code looks like Python, but no!        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid]; • Same core type system as OpenCL (C99+)} • Type inference to eliminate type annotations (not dynamic lookups)... ... • Extension inference to eliminate pragmas • Higher-order functions (inlined at compile-time)//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,  __kernel  void  prod_ff(__global  float*  a,  __global  float*  b,   • Structural polymorphism                                      __global  float*  dest)  {                                            __global  float*  dest)  { • All functions are generic by default        size_t  gid  =  get_global_id(0);  //  Get  thread  index • You can call a function with any arguments        dest[gid]  =  a[gid]  *  b[gid]; that support the operations it uses.} •__kernel  void  prod(__global  float*  a,  __global  float*  b,  __kernel  void  prod_ii(__global  int*  a,  __global  int*  b,                                        __global  float*  dest)  {                                          __global  int*  dest)  {
  • OpenCL//  Parallel  elementwise  sum @cl.oquence.fn__kernel  void  sum_ff(__global  float*  a,  __global  float*  b,   def  ew_op(a,  b,  dest,  op):                                          __global  float*  dest)  {        Parallel  elementwise  binary  operation.        size_t  gid  =  get_global_id(0);  //  Get  thread  index        gid  =  get_global_id(0)                  #  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];        dest[gid]  =  op(a[gid],  b[gid])} @cl.oquence.fn__kernel  void  sum_ii(__global  int*  a,  __global  int*  b,   def  plus(a,  b):                                          __global  int*  dest)  {        Adds  the  two  operands.        size_t  gid  =  get_global_id(0);   return  a  +  b        dest[gid]  =  a[gid]  +  b[gid];} @cl.oquence.fn def  mul(a,  b):__kernel  void  sum_fi(__global  float*  a,  __global  int*  b,          Multiplies  the  two  operands.                                          __global  float*  dest)  { return  a  *  b        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum_df(__global  double*  a,  __global  int*  b,   How?                                          __global  double*  dest)  {        #pragma  OPENCL  EXTENSION  cl_khr_fp64  :  enable • cl.oquence.fn code looks like Python, but no!        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid]; • Same core type system as OpenCL (C99+)} • Type inference to eliminate type annotations (not dynamic lookups)... ... • Extension inference to eliminate pragmas • Higher-order functions (inlined at compile-time)//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,  __kernel  void  prod_ff(__global  float*  a,  __global  float*  b,   • Structural polymorphism                                      __global  float*  dest)  {                                            __global  float*  dest)  { • All functions are generic by default        size_t  gid  =  get_global_id(0);  //  Get  thread  index • You can call a function with any arguments        dest[gid]  =  a[gid]  *  b[gid]; that support the operations it uses.}__kernel  void  prod(__global  float*  a,  __global  float*  b,  __kernel  void  prod_ii(__global  int*  a,  __global  int*  b,                                        __global  float*  dest)  {                                          __global  int*  dest)  {
  • OpenCL//  Parallel  elementwise  sum @cl.oquence.fn__kernel  void  sum_ff(__global  float*  a,  __global  float*  b,   def  ew_op(a,  b,  dest,  op):                                          __global  float*  dest)  {        Parallel  elementwise  binary  operation.        size_t  gid  =  get_global_id(0);  //  Get  thread  index        gid  =  get_global_id(0)                  #  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];        dest[gid]  =  op(a[gid],  b[gid])} @cl.oquence.fn__kernel  void  sum_ii(__global  int*  a,  __global  int*  b,   def  plus(a,  b):                                          __global  int*  dest)  {        Adds  the  two  operands.        size_t  gid  =  get_global_id(0);   return  a  +  b        dest[gid]  =  a[gid]  +  b[gid];} @cl.oquence.fn def  mul(a,  b):__kernel  void  sum_fi(__global  float*  a,  __global  int*  b,          Multiplies  the  two  operands.                                          __global  float*  dest)  { return  a  *  b        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];} With either invocation model, Python is now your__kernel  void  sum_df(__global  double*  a,  __global  int*  b,   preprocessor                                          __global  double*  dest)  {        #pragma  OPENCL  EXTENSION  cl_khr_fp64  :  enable • Functions can be programmatically generated        size_t  gid  =  get_global_id(0);   from source or ASTs        dest[gid]  =  a[gid]  +  b[gid];} • You’re using Python’s well-designed module system instead of the #import system (!)... ... • Use distutils, PyPI, and so on • The syntax is a subset of Python//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,  __kernel  void  prod_ff(__global  float*  a,  __global  float*  b,   • Same source code highlighters                                      __global  float*  dest)  {                                            __global  float*  dest)  { • Use standard documentation generators        size_t  gid  =  get_global_id(0);  //  Get  thread  index • You can write compiler and language        dest[gid]  =  a[gid]  *  b[gid]; extensions as libraries} • Bonus: default values for function arguments__kernel  void  prod(__global  float*  a,  __global  float*  b,  __kernel  void  prod_ii(__global  int*  a,  __global  int*  b,                                        __global  float*  dest)  {                                          __global  int*  dest)  {
  • OpenCL//  Parallel  elementwise  sum @cl.oquence.fn__kernel  void  sum_ff(__global  float*  a,  __global  float*  b,   def  ew_op(a,  b,  dest,  op):                                          __global  float*  dest)  {        Parallel  elementwise  binary  operation.        size_t  gid  =  get_global_id(0);  //  Get  thread  index        gid  =  get_global_id(0)                  #  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];        dest[gid]  =  op(a[gid],  b[gid])} @cl.oquence.fn__kernel  void  sum_ii(__global  int*  a,  __global  int*  b,   def  plus(a,  b):                                          __global  int*  dest)  {        Adds  the  two  operands.        size_t  gid  =  get_global_id(0);   return  a  +  b        dest[gid]  =  a[gid]  +  b[gid];} @cl.oquence.fn def  mul(a,  b):__kernel  void  sum_fi(__global  float*  a,  __global  int*  b,          Multiplies  the  two  operands.                                          __global  float*  dest)  { return  a  *  b        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum_df(__global  double*  a,  __global  int*  b,   Downsides (e.g. open projects, email me!)                                          __global  double*  dest)  { • No current support for graphical debuggers        #pragma  OPENCL  EXTENSION  cl_khr_fp64  :  enable        size_t  gid  =  get_global_id(0);   • You can optionally include line numbers        dest[gid]  =  a[gid]  +  b[gid]; from original file in comments however} • But mapping is close enough that debugging is not typically a problem and ...... generated source code is formatted nicely//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,  __kernel  void  prod_ff(__global  float*  a,  __global  float*  b,   • Calling non-cl.oquence OpenCL libraries                                      __global  float*  dest)  {                                            __global  float*  dest)  { requires writing an explicit extern directive        size_t  gid  =  get_global_id(0);  //  Get  thread  index • ext_import(“library.cl”)        dest[gid]  =  a[gid]  *  b[gid]; • ext_function  =  extern(cl.void,  cl_int,  ...)}__kernel  void  prod(__global  float*  a,  __global  float*  b,  __kernel  void  prod_ii(__global  int*  a,  __global  int*  b,                                        __global  float*  dest)  {                                          __global  int*  dest)  {
  • Neurobiological Circuit Simulations A modular simulation architecture that usesfrom  cl.egans  import  Simulation compile-time code generation to avoid thefrom  ahh.cl.egans.spiking.models  import  ReducedLIF typical performance penalties.from  ahh.cl.egans.spiking.inputs  import  ExponentialSynapsesim  =  Simulation(ctx, @cl.oquence.fn        n_realizations=1, def  step_fn(timestep,  realization_start):        n_timesteps  =  10000,        gid  =  get_global_id(0)        DT=0.1)        gsize  =  get_global_size(0)        first_idx_sim  =  realization_start  *  4000        last_idx_sim  =  min(first_idx_sim  +  4000,  4000)#  Create  4000  LIF  neurons        for  idx_sim  in  (first_idx_sim  +  gid,  last_idx_sim,  gsize):N_Exc  =  3200                realization_num  =  idx_sim  /  4000N_Inh  =  800                realization_first_idx_sim  =  realization_num  *  4000N  =  N_Exc  +  N_Inh                realization_first_idx_div  =  (realization_num  -­‐  neurons  =  ReducedLIF(sim,  "LIF",                          realization_start)*4000                idx_realization  =  idx_sim  -­‐  realization_first_idx_sim        count=N,                idx_division  =  idx_sim  -­‐  first_idx_sim        tau=20.0,                idx_model  =  idx_realization  -­‐  0        v_reset=0.0,                idx_state  =  idx_model  +  (realization_num  -­‐          v_thresh=10.0,                        realization_start)*4000        abs_refractory_period=5.0)                LIF_v  =  LIF_v_buffer[idx_state]                #  ...                if  v_new  >=  10.0:#  Create  excitatory  and  inhibitory  synapses                        LIF_v_buffer[idx_state]  =  0.0e_synapse  =  ExponentialSynapse(neurons,  ge,                        target  =  LIF_ge_AtomicReceiver_out  if  idx_model  <          tau=5.0,                                3200  else  LIF_gi_AtomicReceiver_out        reversal=60.0)                        neighbors_offset  =  neighbor_data[idx_realization]                        neighbor_size  =  neighbor_data[neighbors_offset]                        neighbors  =  neighbor_data  +  neighbors_offset  +  1...                        for  i  in  (0,  neighbor_size,  1):                                atom_add(target  +  realization_first_idx_div  +  sim.generate()                                        neighbors[i],  1)print  sim.code                else:                        #  ...
  • The State of Scientific Programming Tomorrow C, Fortran, CUDA, OpenCL MATLAB, Python, R, Perl Fast Productive Control over memory allocation Low syntactic overhead Control over data movement Read-eval-print loop (REPL) Access to hardware primitives Flexible data structures and abstractions Portability Nice development environments Tedious Slow Type annotations, templates, pragmas Dynamic lookups and indirection abound Obtuse compilers, linkers, preprocessors Automatic memory management can cause problems No support for high-level abstractions Scientists relieve any remaining tension by: • writing overall control flow and basic data analysis routines in a high-level language • calling into cl.oquence for performance-critical sections (can be annoying)
  • Current Status: • Everything works, just need to clean out some cobwebs. • It will be available at http://cl.oquence.org/ soon (May).OpenCL: The Good Parts • If you want to use it today, email me (cyrus@cmu.edu). • Join clq-announce on Google Groups for release announcement. • Paper will be in submission shortly.Cyrus OmarComputer Science DepartmentCarnegie Mellon Universityhttp://www.cs.cmu.edu/~comar/