SlideShare a Scribd company logo
1 of 43
@prestonism

Ruby Supercomputing                                                                                gmail: conmotto
                                                                                             http://prestonlee.com
                                                                                       http://github.com/preston
Using The GPU For Massive Performance Speedup                                http://www.slideshare.net/preston.lee/
Last Updated: March 17th, 2011.
Preston Lee, MBA, Translational Genomics Research Institute And Arizona State University
git@github.com:preston/ruby-gpu-examples.git

   Grab the code now if you want, but to run all the examples
     youā€™ll need the NVIDIA development driver and toolkit,
     Ruby 1.9, the ā€œbarracudaā€ gem, and JRuby (without the
   barracuda gem) on a multi-core OS X Snow Leopard system.
    This takes time to set up, so maybe just chillax for now?
Letā€™s find the area of each ring.
http://openwalls.com/image/7358/the_texture_of_the_tree_rings_1440x900.jpg
Math. Yay!

āœ¤   The inner-most ring is ring #1.

āœ¤   Total area of ring #5 is Ļ€ times            #1
    the square of the radius. (Ļ€rr)       #4

āœ¤   Area of only ring #5 is Ļ€rr                #5
    minus area of ring #4.

āœ¤   (Math::PI * (radius ** 2)) -
    (Math::PI * ((radius - 1) ** 2))
āœ¤   Letā€™s ļ¬nd the area of every ring...
1st working attempt.                                              #1
(Single-threaded Ruby.)




  def ring_area(radius)
  
    (Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2))
  end

  (1..RINGS).each do |i|
     puts ring_area(i)
  end
2nd working attempt.                                                    #1
(Multi-threaded Ruby.)



  def ring_area(radius)
  
     (Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2))
  end

  # Multi-thread it!
  (0..(NUM_THREADS - 1)).each do |i|
  
     rings_per_thread = RINGS / NUM_THREADS
  
     offset = i * rings_per_thread
  
     threads << Thread.new(rings_per_thread, offset) do |num, offset|
         
 last = offset + num - 1
         
 (offset..(last)).each do |radius|
         
 
      ring_area(radius)
         
 end
  
     end
  end
  threads.each do |t| t.join end
3rd/4th working attempt.                                                                                                                                                                       #1
(Single/Multi-threaded C. Ohhh crap...)


  /*                                                                                                             
                  printf("%ld.%06ld secondsn", (long)diff.tv_sec, (long)diff.tv_usec);
    A baseline CPU-based benchmark program CPU/GPU performance comparision.                                      
    Approximates the cross-sectional area of every tree ring in a tree trunk in serial and in parallel           
                  printf("nDone!nn");
    by taking the total area at a given radius and subtracting the area of the closest inner ring.               
                  return EXIT_SUCCESS;
    Copyright Ā© 2011 Preston Lee. All rights reserved.                                                           }
    http://prestonlee.com
  */                                                                                                             /* Approximate the cross-sectional area between each pair of consecutive tree rings
                                                                                                                   
               in serial */
  #include <stdio.h>                                                                                             void calculate_ring_areas_in_serial(int rings) {
  #include <stdlib.h>                                                                                            
                 
            calculate_ring_areas_in_serial_with_offset(rings, 0);
  #include <math.h>                                                                                              }
  #include <pthread.h>
  #include <sys/time.h>                                                                                          void calculate_ring_areas_in_serial_with_offset(int rings, int thread) {
                                                                                                                 
                 int i;
  #include "tree_rings.h"                                                                                        
                 int offset = rings * thread;
                                                                                                                 
                 int max = rings + offset;
  #deļ¬ne DEFAULT_RINGS 1000000                                                                                   
                 ļ¬‚oat a = 0;
  #deļ¬ne NUM_THREADS 8                                                                                           
                 for(i = offset; i < max; i++) {
  #deļ¬ne DEBUG 0                                                                                                 
                 
              a = (M_PI * pow(i, 2)) - (M_PI * pow(i - 1, 2));
                                                                                                                 
                 }
  int acc = 0;                                                                                                   }

  int main(int argc, const char * argv[]) {                                                                      /* Approximate the cross-sectional area between each pair of consecutive tree rings
  
                  int rings = DEFAULT_RINGS;                                                                    
               in parallel on NUM_THREADS threads */
                                                                                                                 void calculate_ring_areas_in_parallel(int rings) {
  
                 if(argc > 1) {                                                                               
                 pthread_t threads[NUM_THREADS];
  
                 
              rings = atoi(argv[1]);                                                        
                 int rc;
  
                 }                                                                                            
                 int t;
  
                                                                                                              
                 int rings_per_thread = rings / NUM_THREADS;
  
                 printf("nA baseline CPU-based benchmark program for CPU/GPU performance comparision.n");   
                 ring_thread_data data[NUM_THREADS];
  
                 printf("Copyright Ā© 2011 Preston Lee. All rights reserved.nn");                            
  
                 printf("tUsage: %s [NUM TREE RINGS]nn", argv[0]);                                         
                 for(t = 0; t < NUM_THREADS; t++){
  
                                                                                                              
                 
              data[t].rings = rings_per_thread;
  
                 printf("Number of tree rings: %i. Yay!n", rings);                                           
                 
              data[t].number = t;
  
                 
                                                                                            
                    rc = pthread_create(&threads[t], NULL, (void *) ring_job, (void *) &data[t]);
  
                 struct timeval start, stop, diff;
                                                           
                    if (rc){
  
                                                                                                              
                      printf("ERROR; return code from pthread_create() is %dn", rc);
  
                 printf("nRunning serial calculation using CPU...ttt");                                   
                      exit(-1);
  
                 gettimeofday(&start, NULL);                                                                  
                   }
  
                 calculate_ring_areas_in_serial(rings);                                                       
                 }
  
                 gettimeofday(&stop, NULL);                                                                   
  
                 timeval_subtract(&diff, &stop, &start);                                                      
                 for(t = 0; t < NUM_THREADS; t++){
  
                 printf("%ld.%06ld secondsn", (long)diff.tv_sec, (long)diff.tv_usec);                        
                 
              pthread_join(threads[t], NULL);
  
                                                                                                              
                 }
  
                 printf("Running parallel calculation using %i CPU threads...t", NUM_THREADS);               }
  
                 gettimeofday(&start, NULL);
  
                 calculate_ring_areas_in_parallel(rings);                                                     void ring_job(ring_thread_data * data) {
  
                 gettimeofday(&stop, NULL);                                                                   
                 calculate_ring_areas_in_serial_with_offset(data->rings, data->number);
  
                 timeval_subtract(&diff, &stop, &start);                                                      }
Speed: Your Primary Options

1. Pure Ruby in a big loop. (Single
   threaded.)                         #1

2. Pure Ruby, multi-threading
   smaller loops. (Limited to using
   a single core on 1.9 due to the
   GIL, but not on jruby etc.)

3. C extension, single thread.

4. C extension, pthreads.

5. ā€œDivide and conquer.ā€
Ruby 1.9
CPU Concurrency

āœ¤   Ideally, asynchronous tasks across multiple physical and/or logical
    cores.

āœ¤   POSIX threading is the standard.

āœ¤   Producer/Consumer pattern typically used to account for differences
    in machine execution time.

āœ¤   Concurrency generally implemented with Time-Division
    Multiplexing. An CPU with 4 cores can run 100 threads, but the OS
    will time-share execution time, more-or-less fairly.

āœ¤   MIMD: Multiple Instruction Multiple Data
Common CPU Issues


āœ¤   Sometime we need insane numbers of threads per host.

āœ¤   Lock performance.

āœ¤   Potential for excessive virtual memory swapping.

āœ¤   Many tests are non-deterministic.

āœ¤   Concurrency modeling is difļ¬cult to get correct.
Canā€™t we just execute every
                       instruction
concurrently, but with different data?
(Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2))
No.
(Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2))
No.
(Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2))




         Multiple Instruction, Multiple Data. (MIMD)
Yes!
(Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2))




          Single Instruction, Multiple Data. (SIMD)
GPU Brief History

āœ¤   Graphics Processing Units were initially developed for 3D scene rendering, such as computing
    luminosity values of each pixel on the screen concurrently.

āœ¤   Consider a 1024x768 display. Thatā€™s 786,432 pixels updating 60 times per second. 786,432 pixles
    @ 60Hz => 47,185,920 => potential pixel pushes per second!

āœ¤   Running 768,432 threads makes OS unhappy. :(

āœ¤   Sometimes itā€™d be great to have all calculations ļ¬nish simultaneously. If youā€™re updating every
    pixel in a display, each can be computed concurrently. Exactly concurrently.

āœ¤   Generally fast ļ¬‚oating point. (Critical for scientiļ¬c computing.)

āœ¤   SIMD: Single Instruction Multiple Data
Hardware Examples




         Images from NVIDIA and ATI.
Hardware Examples




         Images from NVIDIA and ATI.
Hardware Examples




         Images from NVIDIA and ATI.
Hardware Examples




         Images from NVIDIA and ATI.
Hardware Examples




         Images from NVIDIA and ATI.
Hardware Examples




         Images from NVIDIA and ATI.
Hardware Examples




         Images from NVIDIA and ATI.
Hardware Examples




         Images from NVIDIA and ATI.
Hardware Examples




         Images from NVIDIA and ATI.
GPU Pros


āœ¤   SIMD architecture can run thousands of threads concurrently.

āœ¤   Many synchronization issues are designed out.

āœ¤   More FLOPS (ļ¬‚oating point operations per second) than host CPU.

āœ¤   Synchronously execute the same instruction for different data points.
NVidia Tesla C1060
CL_DEVICE_NAME: 

    
     Tesla C1060
CL_DEVICE_VENDOR: 
   
     
      NVIDIA Corporation
CL_DRIVER_VERSION: 
  
     
      260.81
CL_DEVICE_TYPE:
 
    
     
      CL_DEVICE_TYPE_GPU
CL_DEVICE_MAX_COMPUTE_UNITS:
 
           30
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS:
 3
CL_DEVICE_MAX_WORK_ITEM_SIZES:
 
         512 / 512 / 64
CL_DEVICE_MAX_WORK_GROUP_SIZE:
           
      512
CL_DEVICE_MAX_CLOCK_FREQUENCY:
           
      1296 MHz
CL_DEVICE_ADDRESS_BITS:
    
      
      32
CL_DEVICE_MAX_MEM_ALLOC_SIZE:
 
          1014 MByte
CL_DEVICE_GLOBAL_MEM_SIZE:
 
      4058 MByte
CL_DEVICE_ERROR_CORRECTION_SUPPORT:
 no
CL_DEVICE_LOCAL_MEM_TYPE:
 
       local
CL_DEVICE_LOCAL_MEM_SIZE:
 
       16 KByte
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE:
 64 KByte
CL_DEVICE_QUEUE_PROPERTIES:

      CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE
CL_DEVICE_QUEUE_PROPERTIES:

      CL_QUEUE_PROFILING_ENABLE
CL_DEVICE_IMAGE_SUPPORT:
 
        1
CL_DEVICE_MAX_READ_IMAGE_ARGS:
 
         128
CL_DEVICE_MAX_WRITE_IMAGE_ARGS:

         8
CL_DEVICE_SINGLE_FP_CONFIG:

      CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST CL_FP_ROUND_TO_ZERO CL_FP_ROUND_TO_INF
CL_FP_FMA
CL_DEVICE_2D_MAX_WIDTH
     
      
      4096
CL_DEVICE_2D_MAX_HEIGHT
 
         
      32768
CL_DEVICE_3D_MAX_WIDTH
     
      
      2048
CL_DEVICE_3D_MAX_HEIGHT
 
         
      2048
CL_DEVICE_3D_MAX_DEPTH
     
      
      2048
CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t>
 CHAR 1, SHORT 1, INT 1, LONG 1, FLOAT 1, DOUBLE 1
OpenCL
(Image from Khronos Group.)

Mac OS X Snow Leopard Ships With Support For OpenCL 1.0, But You Still Need The Development Driver, SDK Etc.
OpenCL Terms

āœ¤   Kernel: Your custom function(s) that will run on the Device.
    (Unrelated to the OS kernel.)

āœ¤   Device: something that computes, such as a GPU chip. (You probably
    have 1, but then you might have 0... or 2+.) Could also be your CPU!

āœ¤   Platform: Container for all your devices. When running code on your
    local machine youā€™ll only have one ā€œplatformā€ instance. (A cluster of
    GPU-heavy systems connected over a network would yield multiple
    available platforms, but OpenCL work in this area is not yet
    complete.)

āœ¤   Device-speciļ¬c terms: thread group, streaming multi-processor, etc.
Code!

āœ¤   Ruby 1.9: single threaded.
āœ¤   Ruby 1.9: multi-threaded on 1.9.
āœ¤   Ruby 1.9: barracuda (native OpenCL/NVIDIA CUDA bindings).
āœ¤   JRuby 1.6: single threaded.
āœ¤   JRuby 1.6: multi-threaded.
āœ¤   C: single threaded.
āœ¤   C: native threads.
GPU Not-So-Pros

āœ¤   Copying data in system RAM to/from GPU shared memory is not free. (In my
    own testing, often the slowest link in the chain.)

āœ¤   SIMD can feel limiting. Beneļ¬ts start to break down when you canā€™t design
    conditions out of your algorithm. (E.g. ā€œif(thread_number > 42) { x += y; }ā€ will
    cause some threads to idle.)

āœ¤   64-bit CPU does not imply 64-bit GPU! (All GPUs Iā€™ve used are 32-bit or less.)

āœ¤   Allocation limitations. Having 4GB of GPU shared memory does necessarily
    mean your can allocate one giant block.

āœ¤   Kernels are essentially written in C. May be difļ¬cult if youā€™re new to pointers or
    memory management. (You can bind to higher-level languages, though.)
Q&A and bonus content.
Names/Keywords To Know


āœ¤   NVidia (CUDA)

āœ¤   ATI (Owned by AMD) (Stream SDK)

āœ¤   Kronus (Drives OpenCL speciļ¬cation.)

āœ¤   Apple (ships OpenCL-capable drivers with all newer Macs running
    Snow Leopard.)
NVidia GeForce GT 330M
CL_DEVICE_NAME: 

    
     GeForce GT 330M
CL_DEVICE_VENDOR: 
   
     
     NVIDIA
CL_DRIVER_VERSION: 
  
     
     CLH 1.0
CL_DEVICE_TYPE:
 
    
     
     CL_DEVICE_TYPE_GPU
CL_DEVICE_MAX_COMPUTE_UNITS:
 
          6
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS:
 3
CL_DEVICE_MAX_WORK_ITEM_SIZES:
 
        0/0/0
CL_DEVICE_MAX_WORK_GROUP_SIZE:
          
     512
CL_DEVICE_MAX_CLOCK_FREQUENCY:
          
     1100 MHz
CL_DEVICE_ADDRESS_BITS:
    
     
      32
CL_DEVICE_MAX_MEM_ALLOC_SIZE:
 
         128 MByte
CL_DEVICE_GLOBAL_MEM_SIZE:
 
     512 MByte
CL_DEVICE_ERROR_CORRECTION_SUPPORT:
 no
CL_DEVICE_LOCAL_MEM_TYPE:
 
      local
CL_DEVICE_LOCAL_MEM_SIZE:
 
      16 KByte
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE:
 64 KByte
CL_DEVICE_QUEUE_PROPERTIES:

     CL_QUEUE_PROFILING_ENABLE
CL_DEVICE_IMAGE_SUPPORT:
 
       1
CL_DEVICE_MAX_READ_IMAGE_ARGS:
 
        128
CL_DEVICE_MAX_WRITE_IMAGE_ARGS:

        8
CL_DEVICE_SINGLE_FP_CONFIG:

     CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST
CL_DEVICE_2D_MAX_WIDTH
     
     
      0
CL_DEVICE_2D_MAX_HEIGHT
 
        
      0
CL_DEVICE_3D_MAX_WIDTH
     
     
      0
CL_DEVICE_3D_MAX_HEIGHT
 
        
      0
CL_DEVICE_3D_MAX_DEPTH
     
     
      0
CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t>
 CHAR 1, SHORT 1, INT 1, LONG 1, FLOAT 1, DOUBLE 0
Intel i7 CPU M 640, 2.80GHz
CL_DEVICE_NAME: 

    
     Intel(R) Core(TM) i7 CPU    M 640 @ 2.80GHz
CL_DEVICE_VENDOR: 
   
     
       Intel
CL_DRIVER_VERSION: 
  
     
       1.0
CL_DEVICE_TYPE:
 
    
     
       CL_DEVICE_TYPE_CPU
CL_DEVICE_MAX_COMPUTE_UNITS:
 
            4
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS:
 3
CL_DEVICE_MAX_WORK_ITEM_SIZES:
 
          0/0/0
CL_DEVICE_MAX_WORK_GROUP_SIZE:
            
      1
CL_DEVICE_MAX_CLOCK_FREQUENCY:
            
      2800 MHz
CL_DEVICE_ADDRESS_BITS:
    
       
      64
CL_DEVICE_MAX_MEM_ALLOC_SIZE:
 
           1536 MByte
CL_DEVICE_GLOBAL_MEM_SIZE:
 
       6144 MByte
CL_DEVICE_ERROR_CORRECTION_SUPPORT:
 no
CL_DEVICE_LOCAL_MEM_TYPE:
 
        global
CL_DEVICE_LOCAL_MEM_SIZE:
 
        16 KByte
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE:
 64 KByte
CL_DEVICE_QUEUE_PROPERTIES:

       CL_QUEUE_PROFILING_ENABLE
CL_DEVICE_IMAGE_SUPPORT:
 
         1
CL_DEVICE_MAX_READ_IMAGE_ARGS:
 
          128
CL_DEVICE_MAX_WRITE_IMAGE_ARGS:

          8
CL_DEVICE_SINGLE_FP_CONFIG:

       CL_FP_DENORM CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST
CL_DEVICE_2D_MAX_WIDTH
     
       
      0
CL_DEVICE_2D_MAX_HEIGHT
 
          
      0
CL_DEVICE_3D_MAX_WIDTH
     
       
      0
CL_DEVICE_3D_MAX_HEIGHT
 
          
      0
CL_DEVICE_3D_MAX_DEPTH
     
       
      0
CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t>
 CHAR 16, SHORT 8, INT 4, LONG 2, FLOAT 4, DOUBLE 2
Great GPU Use
Cases
āœ¤   3D rendering. (Obviously!)

āœ¤   Simulation. (Folding@home actual has a
    GPU client.)

āœ¤   Physics.

āœ¤   Probability.

āœ¤   Network reliability.

āœ¤   General ļ¬‚oating-point speed-ups.

āœ¤   Augmentation of existing apps by ofļ¬‚oading
    work to GPU. Combining CPU/GPU
    paradigms is a perfectly valid approach.
A Few Simple Kernels
Several simple C kernels w/Java, OpenCL and JOCL in Eclipse Helios

Date
Multi-
Dimensional
Threading
Threads can have an X, Y, and Z
coordinate within their thread
group, which you use to determine
who you are and can thus derive
your speciļ¬c local parameters and
minimize control ļ¬‚ow changes.
Links


āœ¤   OpenCL Programming Guide for OSX. (Really good!)
    http://developer.apple.com/library/mac/#documentation/Performance/Conceptual/OpenCL_MacProgGuide/Introduction/Introduction.html



āœ¤   NVidia CUDA:
    http://www.nvidia.com/object/cuda_home_new.html


āœ¤   ATI Stream:
    http://www.amd.com/US/PRODUCTS/TECHNOLOGIES/STREAM-TECHNOLOGY/Pages/stream-technology.aspx


āœ¤   OpenCL:
    http://www.khronos.org/opencl/

More Related Content

What's hot

User defined functions
User defined functionsUser defined functions
User defined functions
shubham_jangid
Ā 
Network vs. Code Metrics to Predict Defects: A Replication Study
Network vs. Code Metrics  to Predict Defects: A Replication StudyNetwork vs. Code Metrics  to Predict Defects: A Replication Study
Network vs. Code Metrics to Predict Defects: A Replication Study
Kim Herzig
Ā 
Part 3-functions
Part 3-functionsPart 3-functions
Part 3-functions
ankita44
Ā 
Refactoring in AS3
Refactoring in AS3Refactoring in AS3
Refactoring in AS3
Eddie Kao
Ā 
Xsl Tand X Path Quick Reference
Xsl Tand X Path Quick ReferenceXsl Tand X Path Quick Reference
Xsl Tand X Path Quick Reference
LiquidHub
Ā 
[ShaderX5] 8 1 Postprocessing Effects In Design
[ShaderX5] 8 1 Postprocessing Effects In Design[ShaderX5] 8 1 Postprocessing Effects In Design
[ShaderX5] 8 1 Postprocessing Effects In Design
ģ¢…ė¹ˆ ģ˜¤
Ā 
3D & Animation Effects Using CSS3 & jQuery
3D & Animation Effects Using CSS3 & jQuery3D & Animation Effects Using CSS3 & jQuery
3D & Animation Effects Using CSS3 & jQuery
Vu Tran Lam
Ā 
Oech03
Oech03Oech03
Oech03
fangjiafu
Ā 

What's hot (19)

Learn How to Master Solr1 4
Learn How to Master Solr1 4Learn How to Master Solr1 4
Learn How to Master Solr1 4
Ā 
DDS-20m
DDS-20mDDS-20m
DDS-20m
Ā 
December 7, Projects
December 7, ProjectsDecember 7, Projects
December 7, Projects
Ā 
Python basic
Python basic Python basic
Python basic
Ā 
NAS EP Algorithm
NAS EP Algorithm NAS EP Algorithm
NAS EP Algorithm
Ā 
Introduction to PyTorch
Introduction to PyTorchIntroduction to PyTorch
Introduction to PyTorch
Ā 
User defined functions
User defined functionsUser defined functions
User defined functions
Ā 
Network vs. Code Metrics to Predict Defects: A Replication Study
Network vs. Code Metrics  to Predict Defects: A Replication StudyNetwork vs. Code Metrics  to Predict Defects: A Replication Study
Network vs. Code Metrics to Predict Defects: A Replication Study
Ā 
Part 3-functions
Part 3-functionsPart 3-functions
Part 3-functions
Ā 
Lecture1 classes3
Lecture1 classes3Lecture1 classes3
Lecture1 classes3
Ā 
Refactoring in AS3
Refactoring in AS3Refactoring in AS3
Refactoring in AS3
Ā 
"PyTorch Deep Learning Framework: Status and Directions," a Presentation from...
"PyTorch Deep Learning Framework: Status and Directions," a Presentation from..."PyTorch Deep Learning Framework: Status and Directions," a Presentation from...
"PyTorch Deep Learning Framework: Status and Directions," a Presentation from...
Ā 
Xsl Tand X Path Quick Reference
Xsl Tand X Path Quick ReferenceXsl Tand X Path Quick Reference
Xsl Tand X Path Quick Reference
Ā 
Clean code ch15
Clean code ch15Clean code ch15
Clean code ch15
Ā 
[ShaderX5] 8 1 Postprocessing Effects In Design
[ShaderX5] 8 1 Postprocessing Effects In Design[ShaderX5] 8 1 Postprocessing Effects In Design
[ShaderX5] 8 1 Postprocessing Effects In Design
Ā 
cluster(python)
cluster(python)cluster(python)
cluster(python)
Ā 
3D & Animation Effects Using CSS3 & jQuery
3D & Animation Effects Using CSS3 & jQuery3D & Animation Effects Using CSS3 & jQuery
3D & Animation Effects Using CSS3 & jQuery
Ā 
Invited talk: Second Search Computing workshop
Invited talk: Second Search Computing workshopInvited talk: Second Search Computing workshop
Invited talk: Second Search Computing workshop
Ā 
Oech03
Oech03Oech03
Oech03
Ā 

Similar to Ruby Supercomputing - Using The GPU For Massive Performance Speedup v1.1

Using an Array include ltstdiohgt include ltmpih.pdf
Using an Array include ltstdiohgt include ltmpih.pdfUsing an Array include ltstdiohgt include ltmpih.pdf
Using an Array include ltstdiohgt include ltmpih.pdf
giriraj65
Ā 
Data structures notes for college students btech.pptx
Data structures notes for college students btech.pptxData structures notes for college students btech.pptx
Data structures notes for college students btech.pptx
KarthikVijay59
Ā 
I have written the code but cannot complete the assignment please help.pdf
I have written the code but cannot complete the assignment please help.pdfI have written the code but cannot complete the assignment please help.pdf
I have written the code but cannot complete the assignment please help.pdf
shreeaadithyaacellso
Ā 
C aptitude questions
C aptitude questionsC aptitude questions
C aptitude questions
Srikanth
Ā 
C - aptitude3
C - aptitude3C - aptitude3
C - aptitude3
Srikanth
Ā 
Chp4(ref dynamic)
Chp4(ref dynamic)Chp4(ref dynamic)
Chp4(ref dynamic)
Mohd Effandi
Ā 
check the modifed code now you will get all operations done.termin.pdf
check the modifed code now you will get all operations done.termin.pdfcheck the modifed code now you will get all operations done.termin.pdf
check the modifed code now you will get all operations done.termin.pdf
angelfragranc
Ā 
Unit 4
Unit 4Unit 4
Unit 4
siddr
Ā 

Similar to Ruby Supercomputing - Using The GPU For Massive Performance Speedup v1.1 (20)

Computer networkppt4577
Computer networkppt4577Computer networkppt4577
Computer networkppt4577
Ā 
Rubyconfindia2018 - GPU accelerated libraries for Ruby
Rubyconfindia2018 - GPU accelerated libraries for RubyRubyconfindia2018 - GPU accelerated libraries for Ruby
Rubyconfindia2018 - GPU accelerated libraries for Ruby
Ā 
Permute
PermutePermute
Permute
Ā 
Unit 3
Unit 3 Unit 3
Unit 3
Ā 
Prim's Algorithm on minimum spanning tree
Prim's Algorithm on minimum spanning treePrim's Algorithm on minimum spanning tree
Prim's Algorithm on minimum spanning tree
Ā 
Write Python for Speed
Write Python for SpeedWrite Python for Speed
Write Python for Speed
Ā 
Using an Array include ltstdiohgt include ltmpih.pdf
Using an Array include ltstdiohgt include ltmpih.pdfUsing an Array include ltstdiohgt include ltmpih.pdf
Using an Array include ltstdiohgt include ltmpih.pdf
Ā 
Task based Programming with OmpSs and its Application
Task based Programming with OmpSs and its ApplicationTask based Programming with OmpSs and its Application
Task based Programming with OmpSs and its Application
Ā 
Permute
PermutePermute
Permute
Ā 
Data structures notes for college students btech.pptx
Data structures notes for college students btech.pptxData structures notes for college students btech.pptx
Data structures notes for college students btech.pptx
Ā 
I have written the code but cannot complete the assignment please help.pdf
I have written the code but cannot complete the assignment please help.pdfI have written the code but cannot complete the assignment please help.pdf
I have written the code but cannot complete the assignment please help.pdf
Ā 
data structures using C 2 sem BCA univeristy of mysore
data structures using C 2 sem BCA univeristy of mysoredata structures using C 2 sem BCA univeristy of mysore
data structures using C 2 sem BCA univeristy of mysore
Ā 
C aptitude questions
C aptitude questionsC aptitude questions
C aptitude questions
Ā 
C - aptitude3
C - aptitude3C - aptitude3
C - aptitude3
Ā 
Chp4(ref dynamic)
Chp4(ref dynamic)Chp4(ref dynamic)
Chp4(ref dynamic)
Ā 
Recursion to iteration automation.
Recursion to iteration automation.Recursion to iteration automation.
Recursion to iteration automation.
Ā 
Advance data structure & algorithm
Advance data structure & algorithmAdvance data structure & algorithm
Advance data structure & algorithm
Ā 
check the modifed code now you will get all operations done.termin.pdf
check the modifed code now you will get all operations done.termin.pdfcheck the modifed code now you will get all operations done.termin.pdf
check the modifed code now you will get all operations done.termin.pdf
Ā 
Clojure basics
Clojure basicsClojure basics
Clojure basics
Ā 
Unit 4
Unit 4Unit 4
Unit 4
Ā 

Recently uploaded

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
Ā 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
Enterprise Knowledge
Ā 

Recently uploaded (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
Ā 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
Ā 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
Ā 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
Ā 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Ā 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Ā 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
Ā 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
Ā 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
Ā 
šŸ¬ The future of MySQL is Postgres šŸ˜
šŸ¬  The future of MySQL is Postgres   šŸ˜šŸ¬  The future of MySQL is Postgres   šŸ˜
šŸ¬ The future of MySQL is Postgres šŸ˜
Ā 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
Ā 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
Ā 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Ā 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
Ā 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
Ā 
Finology Group ā€“ Insurtech Innovation Award 2024
Finology Group ā€“ Insurtech Innovation Award 2024Finology Group ā€“ Insurtech Innovation Award 2024
Finology Group ā€“ Insurtech Innovation Award 2024
Ā 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Ā 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
Ā 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
Ā 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
Ā 

Ruby Supercomputing - Using The GPU For Massive Performance Speedup v1.1

  • 1. @prestonism Ruby Supercomputing gmail: conmotto http://prestonlee.com http://github.com/preston Using The GPU For Massive Performance Speedup http://www.slideshare.net/preston.lee/ Last Updated: March 17th, 2011. Preston Lee, MBA, Translational Genomics Research Institute And Arizona State University
  • 2. git@github.com:preston/ruby-gpu-examples.git Grab the code now if you want, but to run all the examples youā€™ll need the NVIDIA development driver and toolkit, Ruby 1.9, the ā€œbarracudaā€ gem, and JRuby (without the barracuda gem) on a multi-core OS X Snow Leopard system. This takes time to set up, so maybe just chillax for now?
  • 3. Letā€™s find the area of each ring. http://openwalls.com/image/7358/the_texture_of_the_tree_rings_1440x900.jpg
  • 4. Math. Yay! āœ¤ The inner-most ring is ring #1. āœ¤ Total area of ring #5 is Ļ€ times #1 the square of the radius. (Ļ€rr) #4 āœ¤ Area of only ring #5 is Ļ€rr #5 minus area of ring #4. āœ¤ (Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2)) āœ¤ Letā€™s ļ¬nd the area of every ring...
  • 5. 1st working attempt. #1 (Single-threaded Ruby.) def ring_area(radius) (Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2)) end (1..RINGS).each do |i| puts ring_area(i) end
  • 6. 2nd working attempt. #1 (Multi-threaded Ruby.) def ring_area(radius) (Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2)) end # Multi-thread it! (0..(NUM_THREADS - 1)).each do |i| rings_per_thread = RINGS / NUM_THREADS offset = i * rings_per_thread threads << Thread.new(rings_per_thread, offset) do |num, offset| last = offset + num - 1 (offset..(last)).each do |radius| ring_area(radius) end end end threads.each do |t| t.join end
  • 7. 3rd/4th working attempt. #1 (Single/Multi-threaded C. Ohhh crap...) /* printf("%ld.%06ld secondsn", (long)diff.tv_sec, (long)diff.tv_usec); A baseline CPU-based benchmark program CPU/GPU performance comparision. Approximates the cross-sectional area of every tree ring in a tree trunk in serial and in parallel printf("nDone!nn"); by taking the total area at a given radius and subtracting the area of the closest inner ring. return EXIT_SUCCESS; Copyright Ā© 2011 Preston Lee. All rights reserved. } http://prestonlee.com */ /* Approximate the cross-sectional area between each pair of consecutive tree rings in serial */ #include <stdio.h> void calculate_ring_areas_in_serial(int rings) { #include <stdlib.h> calculate_ring_areas_in_serial_with_offset(rings, 0); #include <math.h> } #include <pthread.h> #include <sys/time.h> void calculate_ring_areas_in_serial_with_offset(int rings, int thread) { int i; #include "tree_rings.h" int offset = rings * thread; int max = rings + offset; #deļ¬ne DEFAULT_RINGS 1000000 ļ¬‚oat a = 0; #deļ¬ne NUM_THREADS 8 for(i = offset; i < max; i++) { #deļ¬ne DEBUG 0 a = (M_PI * pow(i, 2)) - (M_PI * pow(i - 1, 2)); } int acc = 0; } int main(int argc, const char * argv[]) { /* Approximate the cross-sectional area between each pair of consecutive tree rings int rings = DEFAULT_RINGS; in parallel on NUM_THREADS threads */ void calculate_ring_areas_in_parallel(int rings) { if(argc > 1) { pthread_t threads[NUM_THREADS]; rings = atoi(argv[1]); int rc; } int t; int rings_per_thread = rings / NUM_THREADS; printf("nA baseline CPU-based benchmark program for CPU/GPU performance comparision.n"); ring_thread_data data[NUM_THREADS]; printf("Copyright Ā© 2011 Preston Lee. All rights reserved.nn"); printf("tUsage: %s [NUM TREE RINGS]nn", argv[0]); for(t = 0; t < NUM_THREADS; t++){ data[t].rings = rings_per_thread; printf("Number of tree rings: %i. Yay!n", rings); data[t].number = t; rc = pthread_create(&threads[t], NULL, (void *) ring_job, (void *) &data[t]); struct timeval start, stop, diff; if (rc){ printf("ERROR; return code from pthread_create() is %dn", rc); printf("nRunning serial calculation using CPU...ttt"); exit(-1); gettimeofday(&start, NULL); } calculate_ring_areas_in_serial(rings); } gettimeofday(&stop, NULL); timeval_subtract(&diff, &stop, &start); for(t = 0; t < NUM_THREADS; t++){ printf("%ld.%06ld secondsn", (long)diff.tv_sec, (long)diff.tv_usec); pthread_join(threads[t], NULL); } printf("Running parallel calculation using %i CPU threads...t", NUM_THREADS); } gettimeofday(&start, NULL); calculate_ring_areas_in_parallel(rings); void ring_job(ring_thread_data * data) { gettimeofday(&stop, NULL); calculate_ring_areas_in_serial_with_offset(data->rings, data->number); timeval_subtract(&diff, &stop, &start); }
  • 8. Speed: Your Primary Options 1. Pure Ruby in a big loop. (Single threaded.) #1 2. Pure Ruby, multi-threading smaller loops. (Limited to using a single core on 1.9 due to the GIL, but not on jruby etc.) 3. C extension, single thread. 4. C extension, pthreads. 5. ā€œDivide and conquer.ā€
  • 9.
  • 11.
  • 12. CPU Concurrency āœ¤ Ideally, asynchronous tasks across multiple physical and/or logical cores. āœ¤ POSIX threading is the standard. āœ¤ Producer/Consumer pattern typically used to account for differences in machine execution time. āœ¤ Concurrency generally implemented with Time-Division Multiplexing. An CPU with 4 cores can run 100 threads, but the OS will time-share execution time, more-or-less fairly. āœ¤ MIMD: Multiple Instruction Multiple Data
  • 13. Common CPU Issues āœ¤ Sometime we need insane numbers of threads per host. āœ¤ Lock performance. āœ¤ Potential for excessive virtual memory swapping. āœ¤ Many tests are non-deterministic. āœ¤ Concurrency modeling is difļ¬cult to get correct.
  • 14. Canā€™t we just execute every instruction concurrently, but with different data? (Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2))
  • 15. No. (Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2))
  • 16. No. (Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2)) Multiple Instruction, Multiple Data. (MIMD)
  • 17.
  • 18. Yes! (Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2)) Single Instruction, Multiple Data. (SIMD)
  • 19. GPU Brief History āœ¤ Graphics Processing Units were initially developed for 3D scene rendering, such as computing luminosity values of each pixel on the screen concurrently. āœ¤ Consider a 1024x768 display. Thatā€™s 786,432 pixels updating 60 times per second. 786,432 pixles @ 60Hz => 47,185,920 => potential pixel pushes per second! āœ¤ Running 768,432 threads makes OS unhappy. :( āœ¤ Sometimes itā€™d be great to have all calculations ļ¬nish simultaneously. If youā€™re updating every pixel in a display, each can be computed concurrently. Exactly concurrently. āœ¤ Generally fast ļ¬‚oating point. (Critical for scientiļ¬c computing.) āœ¤ SIMD: Single Instruction Multiple Data
  • 20. Hardware Examples Images from NVIDIA and ATI.
  • 21. Hardware Examples Images from NVIDIA and ATI.
  • 22. Hardware Examples Images from NVIDIA and ATI.
  • 23. Hardware Examples Images from NVIDIA and ATI.
  • 24. Hardware Examples Images from NVIDIA and ATI.
  • 25. Hardware Examples Images from NVIDIA and ATI.
  • 26. Hardware Examples Images from NVIDIA and ATI.
  • 27. Hardware Examples Images from NVIDIA and ATI.
  • 28. Hardware Examples Images from NVIDIA and ATI.
  • 29. GPU Pros āœ¤ SIMD architecture can run thousands of threads concurrently. āœ¤ Many synchronization issues are designed out. āœ¤ More FLOPS (ļ¬‚oating point operations per second) than host CPU. āœ¤ Synchronously execute the same instruction for different data points.
  • 30. NVidia Tesla C1060 CL_DEVICE_NAME: Tesla C1060 CL_DEVICE_VENDOR: NVIDIA Corporation CL_DRIVER_VERSION: 260.81 CL_DEVICE_TYPE: CL_DEVICE_TYPE_GPU CL_DEVICE_MAX_COMPUTE_UNITS: 30 CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3 CL_DEVICE_MAX_WORK_ITEM_SIZES: 512 / 512 / 64 CL_DEVICE_MAX_WORK_GROUP_SIZE: 512 CL_DEVICE_MAX_CLOCK_FREQUENCY: 1296 MHz CL_DEVICE_ADDRESS_BITS: 32 CL_DEVICE_MAX_MEM_ALLOC_SIZE: 1014 MByte CL_DEVICE_GLOBAL_MEM_SIZE: 4058 MByte CL_DEVICE_ERROR_CORRECTION_SUPPORT: no CL_DEVICE_LOCAL_MEM_TYPE: local CL_DEVICE_LOCAL_MEM_SIZE: 16 KByte CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByte CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLE CL_DEVICE_IMAGE_SUPPORT: 1 CL_DEVICE_MAX_READ_IMAGE_ARGS: 128 CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 8 CL_DEVICE_SINGLE_FP_CONFIG: CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST CL_FP_ROUND_TO_ZERO CL_FP_ROUND_TO_INF CL_FP_FMA CL_DEVICE_2D_MAX_WIDTH 4096 CL_DEVICE_2D_MAX_HEIGHT 32768 CL_DEVICE_3D_MAX_WIDTH 2048 CL_DEVICE_3D_MAX_HEIGHT 2048 CL_DEVICE_3D_MAX_DEPTH 2048 CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t> CHAR 1, SHORT 1, INT 1, LONG 1, FLOAT 1, DOUBLE 1
  • 31. OpenCL (Image from Khronos Group.) Mac OS X Snow Leopard Ships With Support For OpenCL 1.0, But You Still Need The Development Driver, SDK Etc.
  • 32. OpenCL Terms āœ¤ Kernel: Your custom function(s) that will run on the Device. (Unrelated to the OS kernel.) āœ¤ Device: something that computes, such as a GPU chip. (You probably have 1, but then you might have 0... or 2+.) Could also be your CPU! āœ¤ Platform: Container for all your devices. When running code on your local machine youā€™ll only have one ā€œplatformā€ instance. (A cluster of GPU-heavy systems connected over a network would yield multiple available platforms, but OpenCL work in this area is not yet complete.) āœ¤ Device-speciļ¬c terms: thread group, streaming multi-processor, etc.
  • 33. Code! āœ¤ Ruby 1.9: single threaded. āœ¤ Ruby 1.9: multi-threaded on 1.9. āœ¤ Ruby 1.9: barracuda (native OpenCL/NVIDIA CUDA bindings). āœ¤ JRuby 1.6: single threaded. āœ¤ JRuby 1.6: multi-threaded. āœ¤ C: single threaded. āœ¤ C: native threads.
  • 34.
  • 35. GPU Not-So-Pros āœ¤ Copying data in system RAM to/from GPU shared memory is not free. (In my own testing, often the slowest link in the chain.) āœ¤ SIMD can feel limiting. Beneļ¬ts start to break down when you canā€™t design conditions out of your algorithm. (E.g. ā€œif(thread_number > 42) { x += y; }ā€ will cause some threads to idle.) āœ¤ 64-bit CPU does not imply 64-bit GPU! (All GPUs Iā€™ve used are 32-bit or less.) āœ¤ Allocation limitations. Having 4GB of GPU shared memory does necessarily mean your can allocate one giant block. āœ¤ Kernels are essentially written in C. May be difļ¬cult if youā€™re new to pointers or memory management. (You can bind to higher-level languages, though.)
  • 36. Q&A and bonus content.
  • 37. Names/Keywords To Know āœ¤ NVidia (CUDA) āœ¤ ATI (Owned by AMD) (Stream SDK) āœ¤ Kronus (Drives OpenCL speciļ¬cation.) āœ¤ Apple (ships OpenCL-capable drivers with all newer Macs running Snow Leopard.)
  • 38. NVidia GeForce GT 330M CL_DEVICE_NAME: GeForce GT 330M CL_DEVICE_VENDOR: NVIDIA CL_DRIVER_VERSION: CLH 1.0 CL_DEVICE_TYPE: CL_DEVICE_TYPE_GPU CL_DEVICE_MAX_COMPUTE_UNITS: 6 CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3 CL_DEVICE_MAX_WORK_ITEM_SIZES: 0/0/0 CL_DEVICE_MAX_WORK_GROUP_SIZE: 512 CL_DEVICE_MAX_CLOCK_FREQUENCY: 1100 MHz CL_DEVICE_ADDRESS_BITS: 32 CL_DEVICE_MAX_MEM_ALLOC_SIZE: 128 MByte CL_DEVICE_GLOBAL_MEM_SIZE: 512 MByte CL_DEVICE_ERROR_CORRECTION_SUPPORT: no CL_DEVICE_LOCAL_MEM_TYPE: local CL_DEVICE_LOCAL_MEM_SIZE: 16 KByte CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByte CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLE CL_DEVICE_IMAGE_SUPPORT: 1 CL_DEVICE_MAX_READ_IMAGE_ARGS: 128 CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 8 CL_DEVICE_SINGLE_FP_CONFIG: CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST CL_DEVICE_2D_MAX_WIDTH 0 CL_DEVICE_2D_MAX_HEIGHT 0 CL_DEVICE_3D_MAX_WIDTH 0 CL_DEVICE_3D_MAX_HEIGHT 0 CL_DEVICE_3D_MAX_DEPTH 0 CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t> CHAR 1, SHORT 1, INT 1, LONG 1, FLOAT 1, DOUBLE 0
  • 39. Intel i7 CPU M 640, 2.80GHz CL_DEVICE_NAME: Intel(R) Core(TM) i7 CPU M 640 @ 2.80GHz CL_DEVICE_VENDOR: Intel CL_DRIVER_VERSION: 1.0 CL_DEVICE_TYPE: CL_DEVICE_TYPE_CPU CL_DEVICE_MAX_COMPUTE_UNITS: 4 CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3 CL_DEVICE_MAX_WORK_ITEM_SIZES: 0/0/0 CL_DEVICE_MAX_WORK_GROUP_SIZE: 1 CL_DEVICE_MAX_CLOCK_FREQUENCY: 2800 MHz CL_DEVICE_ADDRESS_BITS: 64 CL_DEVICE_MAX_MEM_ALLOC_SIZE: 1536 MByte CL_DEVICE_GLOBAL_MEM_SIZE: 6144 MByte CL_DEVICE_ERROR_CORRECTION_SUPPORT: no CL_DEVICE_LOCAL_MEM_TYPE: global CL_DEVICE_LOCAL_MEM_SIZE: 16 KByte CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByte CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLE CL_DEVICE_IMAGE_SUPPORT: 1 CL_DEVICE_MAX_READ_IMAGE_ARGS: 128 CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 8 CL_DEVICE_SINGLE_FP_CONFIG: CL_FP_DENORM CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST CL_DEVICE_2D_MAX_WIDTH 0 CL_DEVICE_2D_MAX_HEIGHT 0 CL_DEVICE_3D_MAX_WIDTH 0 CL_DEVICE_3D_MAX_HEIGHT 0 CL_DEVICE_3D_MAX_DEPTH 0 CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t> CHAR 16, SHORT 8, INT 4, LONG 2, FLOAT 4, DOUBLE 2
  • 40. Great GPU Use Cases āœ¤ 3D rendering. (Obviously!) āœ¤ Simulation. (Folding@home actual has a GPU client.) āœ¤ Physics. āœ¤ Probability. āœ¤ Network reliability. āœ¤ General ļ¬‚oating-point speed-ups. āœ¤ Augmentation of existing apps by ofļ¬‚oading work to GPU. Combining CPU/GPU paradigms is a perfectly valid approach.
  • 41. A Few Simple Kernels Several simple C kernels w/Java, OpenCL and JOCL in Eclipse Helios Date
  • 42. Multi- Dimensional Threading Threads can have an X, Y, and Z coordinate within their thread group, which you use to determine who you are and can thus derive your speciļ¬c local parameters and minimize control ļ¬‚ow changes.
  • 43. Links āœ¤ OpenCL Programming Guide for OSX. (Really good!) http://developer.apple.com/library/mac/#documentation/Performance/Conceptual/OpenCL_MacProgGuide/Introduction/Introduction.html āœ¤ NVidia CUDA: http://www.nvidia.com/object/cuda_home_new.html āœ¤ ATI Stream: http://www.amd.com/US/PRODUCTS/TECHNOLOGIES/STREAM-TECHNOLOGY/Pages/stream-technology.aspx āœ¤ OpenCL: http://www.khronos.org/opencl/

Editor's Notes

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. Jump to Eclipse device query.\n
  27. \n
  28. \n
  29. Jump to demo!\n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. \n
  38. \n
  39. \n
  40. \n