High-Level Language Features for Low-Level ProgrammingCyrus OmarComputer Science DepartmentCarnegie Mellon Universityhttp:...
The Needs of Scientists and Engineersment group) or letting the users/stakeholders know how the                           ...
The State of Scientific Programming Today  C, Fortran, CUDA, OpenCL                             MATLAB, Python, R, Perl  F...
The State of Scientific Programming Tomorrow  C, Fortran, CUDA, OpenCL                             MATLAB, Python, R, Perl...
What is cl.oquence?  A low-level programming language that maps closely onto, and compiles down to, OpenCL.What is OpenCL?...
What is cl.oquence?  A low-level programming language that maps closely onto, and compiles down to, OpenCL.What is OpenCL?...
What is cl.oquence?  A low-level programming language that maps closely onto, and compiles down to, OpenCL.What is OpenCL?...
What is cl.oquence?  A low-level programming language that maps closely onto, and compiles down to, OpenCL.What is OpenCL?...
OpenCL//	  Parallel	  elementwise	  sum__kernel	  void	  sum(__global	  float*	  a,	  __global	  float*	  b,	  	  	  	  	 ...
OpenCL//	  Parallel	  elementwise	  sum__kernel	  void	  sum_ff(__global	  float*	  a,	  __global	  float*	  b,	  	  	  	 ...
OpenCL//	  Parallel	  elementwise	  sum__kernel	  void	  sum_ff(__global	  float*	  a,	  __global	  float*	  b,	  	  	  	 ...
OpenCL//	  Parallel	  elementwise	  sum__kernel	  void	  sum_ff(__global	  float*	  a,	  __global	  float*	  b,	  	  	  	 ...
OpenCL//	  Parallel	  elementwise	  sum__kernel	  void	  sum_ff(__global	  float*	  a,	  __global	  float*	  b,	  	  	  	 ...
OpenCL//	  Parallel	  elementwise	  sum__kernel	  void	  sum_ff(__global	  float*	  a,	  __global	  float*	  b,	  	  	  	 ...
OpenCL//	  Parallel	  elementwise	  sum__kernel	  void	  sum_ff(__global	  float*	  a,	  __global	  float*	  b,	  	  	  	 ...
OpenCL//	  Parallel	  elementwise	  sum__kernel	  void	  sum_ff(__global	  float*	  a,	  __global	  float*	  b,	  	  	  	 ...
OpenCL//	  Parallel	  elementwise	  sum__kernel	  void	  sum_ff(__global	  float*	  a,	  __global	  float*	  b,	  	  	  	 ...
OpenCL//	  Parallel	  elementwise	  sum__kernel	  void	  sum_ff(__global	  float*	  a,	  __global	  float*	  b,	  	  	  	 ...
OpenCL//	  Parallel	  elementwise	  sum                                                                                   ...
OpenCL//	  Parallel	  elementwise	  sum                                                                                   ...
OpenCL//	  Parallel	  elementwise	  sum                                                                                   ...
OpenCL//	  Parallel	  elementwise	  sum                                                                                   ...
OpenCL//	  Parallel	  elementwise	  sum                                                                                   ...
Two invocation models                                               @cl.oquence.fn 1. Standalone compilation to OpenCL    ...
Two invocation models                                                                                                     ...
Two invocation models                                                  @cl.oquence.fn 1. Standalone compilation to OpenCL ...
Two invocation models                                                      @cl.oquence.fn 1. Standalone compilation to Ope...
Two invocation models                                                      @cl.oquence.fn 1. Standalone compilation to Ope...
Two invocation models                                 @cl.oquence.auto(lambda	  a,	  b,	  dest,	  op:	  a.shape,	  (256,))...
Two invocation models                                 @cl.oquence.auto(lambda	  a,	  b,	  dest,	  op:	  a.shape,	  (256,))...
OpenCL//	  Parallel	  elementwise	  sum                                                                                   ...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Level Programming (Cyrus Omar, CMU)
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Level Programming (Cyrus Omar, CMU)
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Level Programming (Cyrus Omar, CMU)
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Level Programming (Cyrus Omar, CMU)
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Level Programming (Cyrus Omar, CMU)
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Level Programming (Cyrus Omar, CMU)
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Level Programming (Cyrus Omar, CMU)
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Level Programming (Cyrus Omar, CMU)
Upcoming SlideShare
Loading in...5
×

[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Level Programming (Cyrus Omar, CMU)

866

Published on

http://cs264.org

http://bit.ly/gjQ3k7

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
866
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
13
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Level Programming (Cyrus Omar, CMU)

  1. 1. High-Level Language Features for Low-Level ProgrammingCyrus OmarComputer Science DepartmentCarnegie Mellon Universityhttp://www.cs.cmu.edu/~comar/
  2. 2. The Needs of Scientists and Engineersment group) or letting the users/stakeholders know how the 100software works (open source, scientific paper publication). Other Using Tool is ’Standard’ 80 Reason given for use of programming language Only language known3.8 Non-functional requirements % of respondents Tool is Easy to Use Required 60 The respondents were asked to rate a series of non-functional Open Source Reason for use of tools Favouriterequirements on the following Likert scale: 40 Performance Cost (or lack thereof) of Tool 20 1. very unimportant Legacy Project Organisation Ease of use 0 2. unimportant Features Reliability Functionality Usability Availability Flexibility Performance* Portability Testability Maintainability Tracability* Reusability Developer experience Version Control is ’Required’ 3. neither Features Cross-platform compatability Improve Ease of Coding 4. important 0 10 20 30 40 50 60 0 5 Very Unimportant 10 Neither 15 20 Very Important Unimportant Number of respondents (out of 46) Important 5. very important Number of respondents Figure 7: Reasons for Choice of Programming Lan- Figure 18: Importance why non-functional require- Figure 9: Reasons of tools are used This scale was chosen so that the relative importance of guage ments as rated by respondentsnon-functional requirements could be determined from re- 60spondents’ answers. A straight ranking of non-functional re-quirements would only indicate how important respondents Modelling 50 Table 1: Combined important and very importantconsidered each non-functional requirement in comparison ratings for non-functional requirements Number of respondents Frameworkto others, but would not provide any information regard- 40 Ranking Requirement Combined Importanting how importantTracking Bug/Change a non-functional requirement was over- 30 and Very Importantall. The neutral response of ‘neither’ was included as some Ratings (%)respondents may not consider a non-functional requirement Build Tools Tool type 20 1 Reliability 100or are unaware of it. Libraries/Packages 2 Functionality 95 Non-functional requirements from the Software Require- 3 Maintainability [Nguyen-Hoan et al, 2010] 90 10ments Specification Data Item described in United States Testing
  3. 3. The State of Scientific Programming Today C, Fortran, CUDA, OpenCL MATLAB, Python, R, Perl Fast Productive Control over memory allocation Low syntactic overhead Control over data movement Read-eval-print loop (REPL) Access to hardware primitives Flexible data structures and abstractions Portability Nice development environments Tedious Slow Type annotations, templates, pragmas Dynamic lookups and indirection abound Obtuse compilers, linkers, preprocessors Automatic memory management can cause problems No support for high-level abstractions Scientists relieve the tension by: • writing overall control flow and basic data analysis routines in a high-level language • calling into a low-level language for performance-critical sections (can be annoying)
  4. 4. The State of Scientific Programming Tomorrow C, Fortran, CUDA, OpenCL MATLAB, Python, R, Perl Fast Productive Control over memory allocation Low syntactic overhead Control over data movement Read-eval-print loop (REPL) Access to hardware primitives Flexible data structures and abstractions Portability Nice development environments Tedious Slow Type annotations, templates, pragmas Dynamic lookups and indirection abound Obtuse compilers, linkers, preprocessors Automatic memory management can cause problems No support for high-level abstractions Scientists relieve any remaining tension by: • writing overall control flow and basic data analysis routines in a high-level language • calling into cl.oquence for performance-critical sections (can be annoying)
  5. 5. What is cl.oquence? A low-level programming language that maps closely onto, and compiles down to, OpenCL.What is OpenCL? OpenCL is an emerging standard for low-level programming in heterogeneous computing environments. It is designed as a library that can be used from a variety of higher-level language.What is a heterogeneous computing environment? A heterogeneous computing environment is an environment where many different compute devices and address spaces are available. Devices can include multi-core CPUs (using a variety of instruction sets), GPUs, hybrid-core processors like the Cell BE and other specialized accelerators.Why should I use cl.oquence? • Same core type system (including pointers) and performance profile as OpenCL • Usable from any host language that has OpenCL bindings • Type inference and extension inference eliminates annotational burden • Simplified syntax is a subset of Python, can use existing tools • Structural polymorphism gives you generic programming by default • New features: • Higher-order functions • Default arguments for functions • Python as the preprocessor and module system • Rich support for compile-time metaprogramming • Write compiler extensions, new basic types as libraries; modular, clean design • Light-weight and easy to integrate into any build process • Packaged with special Python host bindings that eliminate even basic overhead when using from within Python • Built on top of pyopencl and numpy
  6. 6. What is cl.oquence? A low-level programming language that maps closely onto, and compiles down to, OpenCL.What is OpenCL? OpenCL is an emerging standard for low-level programming in heterogeneous computing environments. It is designed as a library that can be used from a variety of higher-level language.What is a heterogeneous computing environment? A heterogeneous computing environment is an environment where many different compute devices and address spaces are available. Devices can include multi-core CPUs (using a variety of instruction sets), GPUs, hybrid-core processors like the Cell BE and other specialized accelerators.Why should I use cl.oquence? • Same core type system (including pointers) and performance profile as OpenCL • Usable from any host language that has OpenCL bindings • Type inference and extension inference eliminates annotational burden • Simplified syntax is a subset of Python, can use existing tools • Structural polymorphism gives you generic programming by default • New features: • Higher-order functions • Default arguments for functions • Python as the preprocessor and module system • Rich support for compile-time metaprogramming • Write compiler extensions, new basic types as libraries; modular, clean design • Light-weight and easy to integrate into any build process • Packaged with special Python host bindings that eliminate even basic overhead when using from within Python • Built on top of pyopencl and numpy
  7. 7. What is cl.oquence? A low-level programming language that maps closely onto, and compiles down to, OpenCL.What is OpenCL? OpenCL is an emerging standard for low-level programming in heterogeneous computing environments. It is designed as a library that can be used from a variety of higher-level language.What is a heterogeneous computing environment? A heterogeneous computing environment is one where many different devices and address spaces must be managed. Examples of devices include multi-core CPUs (using a variety of instruction sets), GPUs, hybrid-core processors like the Cell BE and other specialized accelerators.Why should I use cl.oquence? • Same core type system (including pointers) and performance profile as OpenCL • Usable from any host language that has OpenCL bindings • Type inference and extension inference eliminates annotational burden • Simplified syntax is a subset of Python, can use existing tools • Structural polymorphism gives you generic programming by default • New features: • Higher-order functions • Default arguments for functions • Python as the preprocessor and module system • Rich support for compile-time metaprogramming • Write compiler extensions, new basic types as libraries; modular, clean design • Light-weight and easy to integrate into any build process • Packaged with special Python host bindings that eliminate even basic overhead when using from within Python • Built on top of pyopencl and numpy
  8. 8. What is cl.oquence? A low-level programming language that maps closely onto, and compiles down to, OpenCL.What is OpenCL? OpenCL is an emerging standard for low-level programming in heterogeneous computing environments. It is designed as a library that can be used from a variety of higher-level language.What is a heterogeneous computing environment? A heterogeneous computing environment is one where many different devices and address spaces must be managed. Examples of devices include multi-core CPUs (using a variety of instruction sets), GPUs, hybrid-core processors like the Cell BE and other specialized accelerators.Why should I use cl.oquence? • Same core type system (including pointers) and performance profile as OpenCL • Usable from any host language that has OpenCL bindings • Type inference and extension inference eliminates annotational burden • Simplified syntax is a subset of Python, can use existing tools • Structural polymorphism gives you generic programming by default • New features: • Higher-order functions • Default arguments for functions • Python as the preprocessor and module system • Rich support for compile-time metaprogramming • Write compiler extensions, new basic types as libraries; modular, clean design • Light-weight and easy to integrate into any build process • Packaged with special Python host bindings that eliminate even basic overhead when using from within Python • Built on top of pyopencl and numpy
  9. 9. OpenCL//  Parallel  elementwise  sum__kernel  void  sum(__global  float*  a,  __global  float*  b,                                      __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum(__global  int*  a,  __global  int*  b,                                      __global  int*  dest)  {        size_t  gid  =  get_global_id(0);        dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum(__global  short*  a,  __global  int*  b,                                      __global  int*  dest)  {        size_t  gid  =  get_global_id(0);        dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum(__global  float*  a,  __global  double*  b,                                      __global  float*  dest)  {        #pragma  ...        size_t  gid  =  get_global_id(0);        dest[gid]  =  a[gid]  +  b[gid];}...//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {
  10. 10. OpenCL//  Parallel  elementwise  sum__kernel  void  sum_ff(__global  float*  a,  __global  float*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum_ii(__global  int*  a,  __global  int*  b,                                            __global  int*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum(__global  short*  a,  __global  int*  b,                                      __global  int*  dest)  {        size_t  gid  =  get_global_id(0);        dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum(__global  float*  a,  __global  double*  b,                                      __global  float*  dest)  {        #pragma  ...        size_t  gid  =  get_global_id(0);        dest[gid]  =  a[gid]  +  b[gid];}...//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {
  11. 11. OpenCL//  Parallel  elementwise  sum__kernel  void  sum_ff(__global  float*  a,  __global  float*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum_ii(__global  int*  a,  __global  int*  b,                                            __global  int*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum(__global  short*  a,  __global  int*  b,                                      __global  int*  dest)  {        size_t  gid  =  get_global_id(0);        dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum(__global  float*  a,  __global  double*  b,                                      __global  float*  dest)  {        #pragma  ...        size_t  gid  =  get_global_id(0);        dest[gid]  =  a[gid]  +  b[gid];}...//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {
  12. 12. OpenCL//  Parallel  elementwise  sum__kernel  void  sum_ff(__global  float*  a,  __global  float*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum_ii(__global  int*  a,  __global  int*  b,                                            __global  int*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum_fi(__global  float*  a,  __global  int*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum_di(__global  double*  a,  __global  int*  b,                                            __global  double*  dest)  {        #pragma  OPENCL  EXTENSION  cl_khr_fp64  :  enable        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}...//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {
  13. 13. OpenCL//  Parallel  elementwise  sum__kernel  void  sum_ff(__global  float*  a,  __global  float*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum_ii(__global  int*  a,  __global  int*  b,                                            __global  int*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum_fi(__global  float*  a,  __global  int*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum_df(__global  double*  a,  __global  int*  b,                                            __global  double*  dest)  {        #pragma  OPENCL  EXTENSION  cl_khr_fp64  :  enable        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}...//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {
  14. 14. OpenCL//  Parallel  elementwise  sum__kernel  void  sum_ff(__global  float*  a,  __global  float*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum_ii(__global  int*  a,  __global  int*  b,                                            __global  int*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum_fi(__global  float*  a,  __global  int*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum_df(__global  double*  a,  __global  int*  b,                                            __global  double*  dest)  {        #pragma  OPENCL  EXTENSION  cl_khr_fp64  :  enable        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}... ...//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {
  15. 15. OpenCL//  Parallel  elementwise  sum__kernel  void  sum_ff(__global  float*  a,  __global  float*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum_ii(__global  int*  a,  __global  int*  b,                                            __global  int*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum_fi(__global  float*  a,  __global  int*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum_df(__global  double*  a,  __global  int*  b,                                            __global  double*  dest)  {        #pragma  OPENCL  EXTENSION  cl_khr_fp64  :  enable        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}... ...//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,  __kernel  void  prod_ff(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}__kernel  void  prod(__global  float*  a,  __global  float*  b,  __kernel  void  prod_ii(__global  int*  a,  __global  int*  b,                                        __global  float*  dest)  {                                          __global  int*  dest)  {
  16. 16. OpenCL//  Parallel  elementwise  sum__kernel  void  sum_ff(__global  float*  a,  __global  float*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum_ii(__global  int*  a,  __global  int*  b,                                            __global  int*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum_fi(__global  float*  a,  __global  int*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum_df(__global  double*  a,  __global  int*  b,                                            __global  double*  dest)  {        #pragma  OPENCL  EXTENSION  cl_khr_fp64  :  enable        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}... ...//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,  __kernel  void  prod_ff(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}__kernel  void  prod(__global  float*  a,  __global  float*  b,  __kernel  void  prod_ii(__global  int*  a,  __global  int*  b,                                        __global  float*  dest)  {                                          __global  int*  dest)  {
  17. 17. OpenCL//  Parallel  elementwise  sum__kernel  void  sum_ff(__global  float*  a,  __global  float*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum_ii(__global  int*  a,  __global  int*  b,                                            __global  int*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum_fi(__global  float*  a,  __global  int*  b,   My photographs tell stories of loss, human struggle, and personal exploration                                          __global  float*  dest)  { within landscapes scarred by technology and over-use… [I] strive to metaphorically        size_t  gid  =  get_global_id(0);   and poetically link laborious actions, idiosyncratic rituals and strangely crude machines into tales about our modern experience.        dest[gid]  =  a[gid]  +  b[gid];} Robert ParkeHarrison__kernel  void  sum_df(__global  double*  a,  __global  int*  b,                                            __global  double*  dest)  {        #pragma  OPENCL  EXTENSION  cl_khr_fp64  :  enable        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}... ...//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,  __kernel  void  prod_ff(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}__kernel  void  prod(__global  float*  a,  __global  float*  b,  __kernel  void  prod_ii(__global  int*  a,  __global  int*  b,                                        __global  float*  dest)  {                                          __global  int*  dest)  {
  18. 18. OpenCL//  Parallel  elementwise  sum__kernel  void  sum_ff(__global  float*  a,  __global  float*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];} @cl.oquence.fn__kernel  void  sum_ii(__global  int*  a,  __global  int*  b,   def  plus(a,  b):                                          __global  int*  dest)  {        Adds  the  two  operands.        size_t  gid  =  get_global_id(0);   return  a  +  b        dest[gid]  =  a[gid]  +  b[gid];} @cl.oquence.fn def  mul(a,  b):__kernel  void  sum_fi(__global  float*  a,  __global  int*  b,          Multiplies  the  two  operands.                                          __global  float*  dest)  { return  a  *  b        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum_df(__global  double*  a,  __global  int*  b,                                            __global  double*  dest)  {        #pragma  OPENCL  EXTENSION  cl_khr_fp64  :  enable        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}... ...//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,  __kernel  void  prod_ff(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}__kernel  void  prod(__global  float*  a,  __global  float*  b,  __kernel  void  prod_ii(__global  int*  a,  __global  int*  b,                                        __global  float*  dest)  {                                          __global  int*  dest)  {
  19. 19. OpenCL//  Parallel  elementwise  sum @cl.oquence.fn__kernel  void  sum_ff(__global  float*  a,  __global  float*  b,   def  ew_op(a,  b,  dest,  op):                                          __global  float*  dest)  {        Parallel  elementwise  binary  operation.        size_t  gid  =  get_global_id(0);  //  Get  thread  index        gid  =  get_global_id(0)                  #  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];        dest[gid]  =  op(a[gid],  b[gid])} @cl.oquence.fn__kernel  void  sum_ii(__global  int*  a,  __global  int*  b,   def  plus(a,  b):                                          __global  int*  dest)  {        Adds  the  two  operands.        size_t  gid  =  get_global_id(0);   return  a  +  b        dest[gid]  =  a[gid]  +  b[gid];} @cl.oquence.fn def  mul(a,  b):__kernel  void  sum_fi(__global  float*  a,  __global  int*  b,          Multiplies  the  two  operands.                                          __global  float*  dest)  { return  a  *  b        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum_df(__global  double*  a,  __global  int*  b,                                            __global  double*  dest)  {        #pragma  OPENCL  EXTENSION  cl_khr_fp64  :  enable        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}... ...//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,  __kernel  void  prod_ff(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}__kernel  void  prod(__global  float*  a,  __global  float*  b,  __kernel  void  prod_ii(__global  int*  a,  __global  int*  b,                                        __global  float*  dest)  {                                          __global  int*  dest)  {
  20. 20. OpenCL//  Parallel  elementwise  sum @cl.oquence.fn__kernel  void  sum_ff(__global  float*  a,  __global  float*  b,   def  ew_op(a,  b,  dest,  op):                                          __global  float*  dest)  {        Parallel  elementwise  binary  operation.        size_t  gid  =  get_global_id(0);  //  Get  thread  index        gid  =  get_global_id(0)                  #  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];        dest[gid]  =  op(a[gid],  b[gid])} @cl.oquence.fn__kernel  void  sum_ii(__global  int*  a,  __global  int*  b,   def  plus(a,  b):                                          __global  int*  dest)  {        Adds  the  two  operands.        size_t  gid  =  get_global_id(0);   return  a  +  b        dest[gid]  =  a[gid]  +  b[gid];} @cl.oquence.fn def  mul(a,  b):__kernel  void  sum_fi(__global  float*  a,  __global  int*  b,          Multiplies  the  two  operands.                                          __global  float*  dest)  { return  a  *  b        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum_df(__global  double*  a,  __global  int*  b,                                            __global  double*  dest)  {        #pragma  OPENCL  EXTENSION  cl_khr_fp64  :  enable        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}... ...//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,  __kernel  void  prod_ff(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}__kernel  void  prod(__global  float*  a,  __global  float*  b,  __kernel  void  prod_ii(__global  int*  a,  __global  int*  b,                                        __global  float*  dest)  {                                          __global  int*  dest)  {
  21. 21. OpenCL//  Parallel  elementwise  sum @cl.oquence.fn__kernel  void  sum_ff(__global  float*  a,  __global  float*  b,   def  ew_op(a,  b,  dest,  op):                                          __global  float*  dest)  {        Parallel  elementwise  binary  operation.        size_t  gid  =  get_global_id(0);  //  Get  thread  index        gid  =  get_global_id(0)                  #  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];        dest[gid]  =  op(a[gid],  b[gid])} @cl.oquence.fn__kernel  void  sum_ii(__global  int*  a,  __global  int*  b,   def  plus(a,  b):                                          __global  int*  dest)  {        Adds  the  two  operands.        size_t  gid  =  get_global_id(0);   return  a  +  b        dest[gid]  =  a[gid]  +  b[gid];} @cl.oquence.fn def  mul(a,  b):__kernel  void  sum_fi(__global  float*  a,  __global  int*  b,          Multiplies  the  two  operands.                                          __global  float*  dest)  { return  a  *  b        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum_df(__global  double*  a,  __global  int*  b,                                            __global  double*  dest)  {        #pragma  OPENCL  EXTENSION  cl_khr_fp64  :  enable        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}... ...//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,  __kernel  void  prod_ff(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}__kernel  void  prod(__global  float*  a,  __global  float*  b,  __kernel  void  prod_ii(__global  int*  a,  __global  int*  b,                                        __global  float*  dest)  {                                          __global  int*  dest)  {
  22. 22. OpenCL//  Parallel  elementwise  sum @cl.oquence.fn__kernel  void  sum_ff(__global  float*  a,  __global  float*  b,   def  ew_op(a,  b,  dest,  op):                                          __global  float*  dest)  {        Parallel  elementwise  binary  operation.        size_t  gid  =  get_global_id(0);  //  Get  thread  index        gid  =  get_global_id(0)                  #  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];        dest[gid]  =  op(a[gid],  b[gid])} @cl.oquence.fn__kernel  void  sum_ii(__global  int*  a,  __global  int*  b,   def  plus(a,  b):                                          __global  int*  dest)  {        Adds  the  two  operands.        size_t  gid  =  get_global_id(0);   return  a  +  b        dest[gid]  =  a[gid]  +  b[gid];} @cl.oquence.fn def  mul(a,  b):__kernel  void  sum_fi(__global  float*  a,  __global  int*  b,          Multiplies  the  two  operands.                                          __global  float*  dest)  { return  a  *  b        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum_df(__global  double*  a,  __global  int*  b,                                            __global  double*  dest)  {        #pragma  OPENCL  EXTENSION  cl_khr_fp64  :  enable        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}... ...//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,  __kernel  void  prod_ff(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}__kernel  void  prod(__global  float*  a,  __global  float*  b,  __kernel  void  prod_ii(__global  int*  a,  __global  int*  b,                                        __global  float*  dest)  {                                          __global  int*  dest)  {
  23. 23. OpenCL//  Parallel  elementwise  sum @cl.oquence.fn__kernel  void  sum_ff(__global  float*  a,  __global  float*  b,   def  ew_op(a,  b,  dest,  op):                                          __global  float*  dest)  {        Parallel  elementwise  binary  operation.        size_t  gid  =  get_global_id(0);  //  Get  thread  index        gid  =  get_global_id(0)                  #  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];        dest[gid]  =  op(a[gid],  b[gid])} @cl.oquence.fn__kernel  void  sum_ii(__global  int*  a,  __global  int*  b,   def  plus(a,  b):                                          __global  int*  dest)  {        Adds  the  two  operands.        size_t  gid  =  get_global_id(0);   return  a  +  b        dest[gid]  =  a[gid]  +  b[gid];} @cl.oquence.fn def  mul(a,  b):__kernel  void  sum_fi(__global  float*  a,  __global  int*  b,          Multiplies  the  two  operands.                                          __global  float*  dest)  { return  a  *  b        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum_df(__global  double*  a,  __global  int*  b,                                            __global  double*  dest)  {        #pragma  OPENCL  EXTENSION  cl_khr_fp64  :  enable These two libraries express the same thing.        size_t  gid  =  get_global_id(0);   The code will run in precisely the same amount of time.        dest[gid]  =  a[gid]  +  b[gid];}... ...//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,  __kernel  void  prod_ff(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}__kernel  void  prod(__global  float*  a,  __global  float*  b,  __kernel  void  prod_ii(__global  int*  a,  __global  int*  b,                                        __global  float*  dest)  {                                          __global  int*  dest)  {
  24. 24. Two invocation models @cl.oquence.fn 1. Standalone compilation to OpenCL def  ew_op(a,  b,  dest,  op): • Use any host language that has OpenCL        Parallel  elementwise  binary  operation. bindings available        gid  =  get_global_id(0)                  #  Get  thread  index •C        dest[gid]  =  op(a[gid],  b[gid]) • C++ • Fortran @cl.oquence.fn def  plus(a,  b): • MATLAB        Adds  the  two  operands. • Java return  a  +  b • .NET • Ruby @cl.oquence.fn def  mul(a,  b): • Python        Multiplies  the  two  operands. return  a  *  b #  Programmatically  specialize  and  assign  types  to   #  any  externally  callable  versions  you  need. sum  =  ew_op.specialize(op=plus) prod  =  ew_op.specialize(op=mul) g_int_p  =  cl_int.global_ptr g_float_p  =  cl_float.global_ptr sum_ff  =  sum.compile(g_float_p,  g_float_p,  g_float_p) sum_ii  =  sum.compile(g_int_p,  g_int_p,  g_int_p)
  25. 25. Two invocation models @cl.oquence.fn 1. Standalone compilation to OpenCL def  ew_op(a,  b,  dest,  op): • Use any host language that has OpenCL        Parallel  elementwise  binary  operation. bindings available        gid  =  get_global_id(0)                  #  Get  thread  index •C        dest[gid]  =  op(a[gid],  b[gid]) • C++ • Fortran @cl.oquence.fn def  plus(a,  b): • MATLAB        Adds  the  two  operands. • Java return  a  +  b • .NET • Ruby @cl.oquence.fn def  mul(a,  b): • Python        Multiplies  the  two  operands. return  a  *  b #  Programmatically  specialize  and  assign  types  to  clqcc  hello.clq #  any  externally  callable  versions  you  need. sum  =  ew_op.specialize(op=plus) creates hello.cl: prod  =  ew_op.specialize(op=mul)__kernel  void  sum_ff(__global  float*  a,  __global  float*  b,   g_int_p  =  cl_int.global_ptr                                          __global  float*  dest)  {        size_t  gid  =  get_global_id(0); g_float_p  =  cl_float.global_ptr        dest[gid]  =  a[gid]  +  b[gid];} sum_ff  =  sum.compile(g_float_p,  g_float_p,  g_float_p) sum_ii  =  sum.compile(g_int_p,  g_int_p,  g_int_p)__kernel  void  sum_ii(__global  int*  a,  __global  int*  b,                                            __global  int*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}
  26. 26. Two invocation models @cl.oquence.fn 1. Standalone compilation to OpenCL def  ew_op(a,  b,  dest,  op): 2. Integrated into a host language        Parallel  elementwise  binary  operation. • Python + pyopencl (w/extensions) + numpy        gid  =  get_global_id(0)                  #  Get  thread  index        dest[gid]  =  op(a[gid],  b[gid]) @cl.oquence.fn def  plus(a,  b):        Adds  the  two  operands. return  a  +  b @cl.oquence.fn def  mul(a,  b):        Multiplies  the  two  operands. return  a  *  b #  allocate  two  random  arrays  that  we  will  be  adding a  =  numpy.random.rand(50000).astype(numpy.float32) b  =  numpy.random.rand(50000).astype(numpy.float32) #  transfer  data  to  device ctx  =  cl.ctx  =  cl.Context.for_device(0,  0) a_buf  =  ctx.to_device(a) b_buf  =  ctx.to_device(b) dest_buf  =  ctx.alloc(like=a) #  invoke  function  (automatically  specialized  as  needed) ew_op(a_buf,  b_buf,  dest_buf,  plus,              global_size=a.shape,  local_size=(256,)).wait() #  get  results result  =  ctx.from_device(dest_buf) #  check  results print  la.norm(c  -­‐  (a  +  b))
  27. 27. Two invocation models @cl.oquence.fn 1. Standalone compilation to OpenCL def  ew_op(a,  b,  dest,  op): 2. Integrated into a host language        Parallel  elementwise  binary  operation. • Python + pyopencl (w/extensions) + numpy        gid  =  get_global_id(0)                  #  Get  thread  index        dest[gid]  =  op(a[gid],  b[gid]) @cl.oquence.fn def  plus(a,  b):        Adds  the  two  operands.Four simple memory management functions return  a  +  b 1. to_device: numpy array => new buffer @cl.oquence.fn 2. from_device: buffer => new numpy array def  mul(a,  b): 3. alloc: empty buffer        Multiplies  the  two  operands. 4. copy: copies between existing buffers or arrays return  a  *  b #  allocate  two  random  arrays  that  we  will  be  addingBuffers hold metadata (type, shape, order) so you a  =  numpy.random.rand(50000).astype(numpy.float32)don’t have to provide it. b  =  numpy.random.rand(50000).astype(numpy.float32) #  transfer  data  to  device ctx  =  cl.ctx  =  cl.Context.for_device(0,  0) a_buf  =  ctx.to_device(a) b_buf  =  ctx.to_device(b) dest_buf  =  ctx.alloc(like=a) #  invoke  function  (automatically  specialized  as  needed) ew_op(a_buf,  b_buf,  dest_buf,  plus,              global_size=a.shape,  local_size=(256,)).wait() #  get  results result  =  ctx.from_device(dest_buf) #  check  results print  la.norm(c  -­‐  (a  +  b))
  28. 28. Two invocation models @cl.oquence.fn 1. Standalone compilation to OpenCL def  ew_op(a,  b,  dest,  op): 2. Integrated into a host language        Parallel  elementwise  binary  operation. • Python + pyopencl (w/extensions) + numpy        gid  =  get_global_id(0)                  #  Get  thread  index        dest[gid]  =  op(a[gid],  b[gid]) @cl.oquence.fn def  plus(a,  b):        Adds  the  two  operands.Four simple memory management functions return  a  +  b 1. to_device: numpy array => new buffer @cl.oquence.fn 2. from_device: buffer => new numpy array def  mul(a,  b): 3. alloc: empty buffer        Multiplies  the  two  operands. 4. copy: copies between existing buffers or arrays return  a  *  b #  allocate  two  random  arrays  that  we  will  be  addingBuffers hold metadata (type, shape, order) so you a  =  numpy.random.rand(50000).astype(numpy.float32)don’t have to provide it. b  =  numpy.random.rand(50000).astype(numpy.float32) #  transfer  data  to  device ctx  =  cl.ctx  =  cl.Context.for_device(0,  0)Implicit queue associated with each context. a_buf  =  ctx.to_device(a) b_buf  =  ctx.to_device(b) dest_buf  =  ctx.alloc(like=a) #  invoke  function  (automatically  specialized  as  needed) ew_op(a_buf,  b_buf,  dest_buf,  plus,              global_size=a.shape,  local_size=(256,)).wait() #  get  results result  =  ctx.from_device(dest_buf) #  check  results print  la.norm(c  -­‐  (a  +  b))
  29. 29. Two invocation models @cl.oquence.auto(lambda  a,  b,  dest,  op:  a.shape,  (256,)) 1. Standalone compilation to OpenCL @cl.oquence.fn 2. Integrated into a host language def  ew_op(a,  b,  dest,  op):        Parallel  elementwise  binary  operation. • Python + pyopencl (w/extensions) + numpy        gid  =  get_global_id(0)                  #  Get  thread  index        dest[gid]  =  op(a[gid],  b[gid]) @cl.oquence.fn def  plus(a,  b):        Adds  the  two  operands.Four simple memory management functions return  a  +  b 1. to_device: numpy array => new buffer @cl.oquence.fn 2. from_device: buffer => new numpy array def  mul(a,  b): 3. alloc: empty buffer        Multiplies  the  two  operands. 4. copy: copies between existing buffers or arrays return  a  *  b #  allocate  two  random  arrays  that  we  will  be  addingBuffers hold metadata (type, shape, order) so you a  =  numpy.random.rand(50000).astype(numpy.float32)don’t have to provide it. b  =  numpy.random.rand(50000).astype(numpy.float32) #  transfer  data  to  device ctx  =  cl.ctx  =  cl.Context.for_device(0,  0)Implicit queue associated with each context. a_buf  =  ctx.to_device(a) b_buf  =  ctx.to_device(b) dest_buf  =  ctx.alloc(like=a)The auto annotation can allow you to hide the #  invoke  function  (automatically  specialized  as  needed)details of parallelization from the user. ew_op(a_buf,  b_buf,  dest_buf,  plus).wait() #  get  results result  =  ctx.from_device(dest_buf) #  check  results print  la.norm(c  -­‐  (a  +  b))
  30. 30. Two invocation models @cl.oquence.auto(lambda  a,  b,  dest,  op:  a.shape,  (256,)) 1. Standalone compilation to OpenCL @cl.oquence.fn 2. Integrated into a host language def  ew_op(a,  b,  dest,  op):        Parallel  elementwise  binary  operation. • Python + pyopencl (w/extensions) + numpy        gid  =  get_global_id(0)                  #  Get  thread  index        dest[gid]  =  op(a[gid],  b[gid]) @cl.oquence.fn def  plus(a,  b):        Adds  the  two  operands.Four simple memory management functions return  a  +  b 1. to_device: numpy array => new buffer @cl.oquence.fn 2. from_device: buffer => new numpy array def  mul(a,  b): 3. alloc: empty buffer        Multiplies  the  two  operands. 4. copy: copies between existing buffers or arrays return  a  *  b #  allocate  two  random  arrays  that  we  will  be  addingBuffers hold metadata (type, shape, order) so you a  =  numpy.random.rand(50000).astype(numpy.float32)don’t have to provide it. b  =  numpy.random.rand(50000).astype(numpy.float32) c  =  numpy.empty_like(a) #  create  an  OpenCL  contextImplicit queue associated with each context. ctx  =  cl.ctx  =  cl.Context.for_device(0,  0) #  invoke  function  (automatically  specialized  as  needed) ew_op(In(a),  In(b),  Out(c),  plus).wait()The auto annotation can allow you to hide thedetails of parallelization from the user. #  check  results print  la.norm(c  -­‐  (a  +  b))The In, Out and InOut constructs can helpautomate data movement when convenient.
  31. 31. OpenCL//  Parallel  elementwise  sum @cl.oquence.fn__kernel  void  sum_ff(__global  float*  a,  __global  float*  b,   def  ew_op(a,  b,  dest,  op):                                          __global  float*  dest)  {        Parallel  elementwise  binary  operation.        size_t  gid  =  get_global_id(0);  //  Get  thread  index        gid  =  get_global_id(0)                  #  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];        dest[gid]  =  op(a[gid],  b[gid])} @cl.oquence.fn__kernel  void  sum_ii(__global  int*  a,  __global  int*  b,   def  plus(a,  b):                                          __global  int*  dest)  {        Adds  the  two  operands.        size_t  gid  =  get_global_id(0);   return  a  +  b        dest[gid]  =  a[gid]  +  b[gid];} @cl.oquence.fn def  mul(a,  b):__kernel  void  sum_fi(__global  float*  a,  __global  int*  b,          Multiplies  the  two  operands.                                          __global  float*  dest)  { return  a  *  b        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}__kernel  void  sum_df(__global  double*  a,  __global  int*  b,                                            __global  double*  dest)  {        #pragma  OPENCL  EXTENSION  cl_khr_fp64  :  enable These two libraries express the same thing.        size_t  gid  =  get_global_id(0);   The code will run in precisely the same amount of time.        dest[gid]  =  a[gid]  +  b[gid];}... ...//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,  __kernel  void  prod_ff(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}__kernel  void  prod(__global  float*  a,  __global  float*  b,  __kernel  void  prod_ii(__global  int*  a,  __global  int*  b,                                        __global  float*  dest)  {                                          __global  int*  dest)  {
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×