SlideShare a Scribd company logo
1 of 43
@prestonism

Ruby Supercomputing                                                                                gmail: conmotto
                                                                                             http://prestonlee.com
                                                                                       http://github.com/preston
Using The GPU For Massive Performance Speedup                                http://www.slideshare.net/preston.lee/
Last Updated: March 17th, 2011.
Preston Lee, MBA, Translational Genomics Research Institute And Arizona State University
git@github.com:preston/ruby-gpu-examples.git

   Grab the code now if you want, but to run all the examples
     you’ll need the NVIDIA development driver and toolkit,
     Ruby 1.9, the “barracuda” gem, and JRuby (without the
   barracuda gem) on a multi-core OS X Snow Leopard system.
    This takes time to set up, so maybe just chillax for now?
Let’s find the area of each ring.
http://openwalls.com/image/7358/the_texture_of_the_tree_rings_1440x900.jpg
Math. Yay!

✤   The inner-most ring is ring #1.

✤   Total area of ring #5 is π times            #1
    the square of the radius. (πrr)       #4

✤   Area of only ring #5 is πrr                #5
    minus area of ring #4.

✤   (Math::PI * (radius ** 2)) -
    (Math::PI * ((radius - 1) ** 2))
✤   Let’s find the area of every ring...
1st working attempt.                                              #1
(Single-threaded Ruby.)




  def ring_area(radius)
  
    (Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2))
  end

  (1..RINGS).each do |i|
     puts ring_area(i)
  end
2nd working attempt.                                                    #1
(Multi-threaded Ruby.)



  def ring_area(radius)
  
     (Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2))
  end

  # Multi-thread it!
  (0..(NUM_THREADS - 1)).each do |i|
  
     rings_per_thread = RINGS / NUM_THREADS
  
     offset = i * rings_per_thread
  
     threads << Thread.new(rings_per_thread, offset) do |num, offset|
         
 last = offset + num - 1
         
 (offset..(last)).each do |radius|
         
 
      ring_area(radius)
         
 end
  
     end
  end
  threads.each do |t| t.join end
3rd/4th working attempt.                                                                                                                                                                       #1
(Single/Multi-threaded C. Ohhh crap...)


  /*                                                                                                             
                  printf("%ld.%06ld secondsn", (long)diff.tv_sec, (long)diff.tv_usec);
    A baseline CPU-based benchmark program CPU/GPU performance comparision.                                      
    Approximates the cross-sectional area of every tree ring in a tree trunk in serial and in parallel           
                  printf("nDone!nn");
    by taking the total area at a given radius and subtracting the area of the closest inner ring.               
                  return EXIT_SUCCESS;
    Copyright © 2011 Preston Lee. All rights reserved.                                                           }
    http://prestonlee.com
  */                                                                                                             /* Approximate the cross-sectional area between each pair of consecutive tree rings
                                                                                                                   
               in serial */
  #include <stdio.h>                                                                                             void calculate_ring_areas_in_serial(int rings) {
  #include <stdlib.h>                                                                                            
                 
            calculate_ring_areas_in_serial_with_offset(rings, 0);
  #include <math.h>                                                                                              }
  #include <pthread.h>
  #include <sys/time.h>                                                                                          void calculate_ring_areas_in_serial_with_offset(int rings, int thread) {
                                                                                                                 
                 int i;
  #include "tree_rings.h"                                                                                        
                 int offset = rings * thread;
                                                                                                                 
                 int max = rings + offset;
  #define DEFAULT_RINGS 1000000                                                                                   
                 float a = 0;
  #define NUM_THREADS 8                                                                                           
                 for(i = offset; i < max; i++) {
  #define DEBUG 0                                                                                                 
                 
              a = (M_PI * pow(i, 2)) - (M_PI * pow(i - 1, 2));
                                                                                                                 
                 }
  int acc = 0;                                                                                                   }

  int main(int argc, const char * argv[]) {                                                                      /* Approximate the cross-sectional area between each pair of consecutive tree rings
  
                  int rings = DEFAULT_RINGS;                                                                    
               in parallel on NUM_THREADS threads */
                                                                                                                 void calculate_ring_areas_in_parallel(int rings) {
  
                 if(argc > 1) {                                                                               
                 pthread_t threads[NUM_THREADS];
  
                 
              rings = atoi(argv[1]);                                                        
                 int rc;
  
                 }                                                                                            
                 int t;
  
                                                                                                              
                 int rings_per_thread = rings / NUM_THREADS;
  
                 printf("nA baseline CPU-based benchmark program for CPU/GPU performance comparision.n");   
                 ring_thread_data data[NUM_THREADS];
  
                 printf("Copyright © 2011 Preston Lee. All rights reserved.nn");                            
  
                 printf("tUsage: %s [NUM TREE RINGS]nn", argv[0]);                                         
                 for(t = 0; t < NUM_THREADS; t++){
  
                                                                                                              
                 
              data[t].rings = rings_per_thread;
  
                 printf("Number of tree rings: %i. Yay!n", rings);                                           
                 
              data[t].number = t;
  
                 
                                                                                            
                    rc = pthread_create(&threads[t], NULL, (void *) ring_job, (void *) &data[t]);
  
                 struct timeval start, stop, diff;
                                                           
                    if (rc){
  
                                                                                                              
                      printf("ERROR; return code from pthread_create() is %dn", rc);
  
                 printf("nRunning serial calculation using CPU...ttt");                                   
                      exit(-1);
  
                 gettimeofday(&start, NULL);                                                                  
                   }
  
                 calculate_ring_areas_in_serial(rings);                                                       
                 }
  
                 gettimeofday(&stop, NULL);                                                                   
  
                 timeval_subtract(&diff, &stop, &start);                                                      
                 for(t = 0; t < NUM_THREADS; t++){
  
                 printf("%ld.%06ld secondsn", (long)diff.tv_sec, (long)diff.tv_usec);                        
                 
              pthread_join(threads[t], NULL);
  
                                                                                                              
                 }
  
                 printf("Running parallel calculation using %i CPU threads...t", NUM_THREADS);               }
  
                 gettimeofday(&start, NULL);
  
                 calculate_ring_areas_in_parallel(rings);                                                     void ring_job(ring_thread_data * data) {
  
                 gettimeofday(&stop, NULL);                                                                   
                 calculate_ring_areas_in_serial_with_offset(data->rings, data->number);
  
                 timeval_subtract(&diff, &stop, &start);                                                      }
Speed: Your Primary Options

1. Pure Ruby in a big loop. (Single
   threaded.)                         #1

2. Pure Ruby, multi-threading
   smaller loops. (Limited to using
   a single core on 1.9 due to the
   GIL, but not on jruby etc.)

3. C extension, single thread.

4. C extension, pthreads.

5. “Divide and conquer.”
Ruby 1.9
CPU Concurrency

✤   Ideally, asynchronous tasks across multiple physical and/or logical
    cores.

✤   POSIX threading is the standard.

✤   Producer/Consumer pattern typically used to account for differences
    in machine execution time.

✤   Concurrency generally implemented with Time-Division
    Multiplexing. An CPU with 4 cores can run 100 threads, but the OS
    will time-share execution time, more-or-less fairly.

✤   MIMD: Multiple Instruction Multiple Data
Common CPU Issues


✤   Sometime we need insane numbers of threads per host.

✤   Lock performance.

✤   Potential for excessive virtual memory swapping.

✤   Many tests are non-deterministic.

✤   Concurrency modeling is difficult to get correct.
Can’t we just execute every
                       instruction
concurrently, but with different data?
(Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2))
No.
(Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2))
No.
(Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2))




         Multiple Instruction, Multiple Data. (MIMD)
Yes!
(Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2))




          Single Instruction, Multiple Data. (SIMD)
GPU Brief History

✤   Graphics Processing Units were initially developed for 3D scene rendering, such as computing
    luminosity values of each pixel on the screen concurrently.

✤   Consider a 1024x768 display. That’s 786,432 pixels updating 60 times per second. 786,432 pixles
    @ 60Hz => 47,185,920 => potential pixel pushes per second!

✤   Running 768,432 threads makes OS unhappy. :(

✤   Sometimes it’d be great to have all calculations finish simultaneously. If you’re updating every
    pixel in a display, each can be computed concurrently. Exactly concurrently.

✤   Generally fast floating point. (Critical for scientific computing.)

✤   SIMD: Single Instruction Multiple Data
Hardware Examples




         Images from NVIDIA and ATI.
Hardware Examples




         Images from NVIDIA and ATI.
Hardware Examples




         Images from NVIDIA and ATI.
Hardware Examples




         Images from NVIDIA and ATI.
Hardware Examples




         Images from NVIDIA and ATI.
Hardware Examples




         Images from NVIDIA and ATI.
Hardware Examples




         Images from NVIDIA and ATI.
Hardware Examples




         Images from NVIDIA and ATI.
Hardware Examples




         Images from NVIDIA and ATI.
GPU Pros


✤   SIMD architecture can run thousands of threads concurrently.

✤   Many synchronization issues are designed out.

✤   More FLOPS (floating point operations per second) than host CPU.

✤   Synchronously execute the same instruction for different data points.
NVidia Tesla C1060
CL_DEVICE_NAME: 

    
     Tesla C1060
CL_DEVICE_VENDOR: 
   
     
      NVIDIA Corporation
CL_DRIVER_VERSION: 
  
     
      260.81
CL_DEVICE_TYPE:
 
    
     
      CL_DEVICE_TYPE_GPU
CL_DEVICE_MAX_COMPUTE_UNITS:
 
           30
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS:
 3
CL_DEVICE_MAX_WORK_ITEM_SIZES:
 
         512 / 512 / 64
CL_DEVICE_MAX_WORK_GROUP_SIZE:
           
      512
CL_DEVICE_MAX_CLOCK_FREQUENCY:
           
      1296 MHz
CL_DEVICE_ADDRESS_BITS:
    
      
      32
CL_DEVICE_MAX_MEM_ALLOC_SIZE:
 
          1014 MByte
CL_DEVICE_GLOBAL_MEM_SIZE:
 
      4058 MByte
CL_DEVICE_ERROR_CORRECTION_SUPPORT:
 no
CL_DEVICE_LOCAL_MEM_TYPE:
 
       local
CL_DEVICE_LOCAL_MEM_SIZE:
 
       16 KByte
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE:
 64 KByte
CL_DEVICE_QUEUE_PROPERTIES:

      CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE
CL_DEVICE_QUEUE_PROPERTIES:

      CL_QUEUE_PROFILING_ENABLE
CL_DEVICE_IMAGE_SUPPORT:
 
        1
CL_DEVICE_MAX_READ_IMAGE_ARGS:
 
         128
CL_DEVICE_MAX_WRITE_IMAGE_ARGS:

         8
CL_DEVICE_SINGLE_FP_CONFIG:

      CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST CL_FP_ROUND_TO_ZERO CL_FP_ROUND_TO_INF
CL_FP_FMA
CL_DEVICE_2D_MAX_WIDTH
     
      
      4096
CL_DEVICE_2D_MAX_HEIGHT
 
         
      32768
CL_DEVICE_3D_MAX_WIDTH
     
      
      2048
CL_DEVICE_3D_MAX_HEIGHT
 
         
      2048
CL_DEVICE_3D_MAX_DEPTH
     
      
      2048
CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t>
 CHAR 1, SHORT 1, INT 1, LONG 1, FLOAT 1, DOUBLE 1
OpenCL
(Image from Khronos Group.)

Mac OS X Snow Leopard Ships With Support For OpenCL 1.0, But You Still Need The Development Driver, SDK Etc.
OpenCL Terms

✤   Kernel: Your custom function(s) that will run on the Device.
    (Unrelated to the OS kernel.)

✤   Device: something that computes, such as a GPU chip. (You probably
    have 1, but then you might have 0... or 2+.) Could also be your CPU!

✤   Platform: Container for all your devices. When running code on your
    local machine you’ll only have one “platform” instance. (A cluster of
    GPU-heavy systems connected over a network would yield multiple
    available platforms, but OpenCL work in this area is not yet
    complete.)

✤   Device-specific terms: thread group, streaming multi-processor, etc.
Code!

✤   Ruby 1.9: single threaded.
✤   Ruby 1.9: multi-threaded on 1.9.
✤   Ruby 1.9: barracuda (native OpenCL/NVIDIA CUDA bindings).
✤   JRuby 1.6: single threaded.
✤   JRuby 1.6: multi-threaded.
✤   C: single threaded.
✤   C: native threads.
GPU Not-So-Pros

✤   Copying data in system RAM to/from GPU shared memory is not free. (In my
    own testing, often the slowest link in the chain.)

✤   SIMD can feel limiting. Benefits start to break down when you can’t design
    conditions out of your algorithm. (E.g. “if(thread_number > 42) { x += y; }” will
    cause some threads to idle.)

✤   64-bit CPU does not imply 64-bit GPU! (All GPUs I’ve used are 32-bit or less.)

✤   Allocation limitations. Having 4GB of GPU shared memory does necessarily
    mean your can allocate one giant block.

✤   Kernels are essentially written in C. May be difficult if you’re new to pointers or
    memory management. (You can bind to higher-level languages, though.)
Q&A and bonus content.
Names/Keywords To Know


✤   NVidia (CUDA)

✤   ATI (Owned by AMD) (Stream SDK)

✤   Kronus (Drives OpenCL specification.)

✤   Apple (ships OpenCL-capable drivers with all newer Macs running
    Snow Leopard.)
NVidia GeForce GT 330M
CL_DEVICE_NAME: 

    
     GeForce GT 330M
CL_DEVICE_VENDOR: 
   
     
     NVIDIA
CL_DRIVER_VERSION: 
  
     
     CLH 1.0
CL_DEVICE_TYPE:
 
    
     
     CL_DEVICE_TYPE_GPU
CL_DEVICE_MAX_COMPUTE_UNITS:
 
          6
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS:
 3
CL_DEVICE_MAX_WORK_ITEM_SIZES:
 
        0/0/0
CL_DEVICE_MAX_WORK_GROUP_SIZE:
          
     512
CL_DEVICE_MAX_CLOCK_FREQUENCY:
          
     1100 MHz
CL_DEVICE_ADDRESS_BITS:
    
     
      32
CL_DEVICE_MAX_MEM_ALLOC_SIZE:
 
         128 MByte
CL_DEVICE_GLOBAL_MEM_SIZE:
 
     512 MByte
CL_DEVICE_ERROR_CORRECTION_SUPPORT:
 no
CL_DEVICE_LOCAL_MEM_TYPE:
 
      local
CL_DEVICE_LOCAL_MEM_SIZE:
 
      16 KByte
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE:
 64 KByte
CL_DEVICE_QUEUE_PROPERTIES:

     CL_QUEUE_PROFILING_ENABLE
CL_DEVICE_IMAGE_SUPPORT:
 
       1
CL_DEVICE_MAX_READ_IMAGE_ARGS:
 
        128
CL_DEVICE_MAX_WRITE_IMAGE_ARGS:

        8
CL_DEVICE_SINGLE_FP_CONFIG:

     CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST
CL_DEVICE_2D_MAX_WIDTH
     
     
      0
CL_DEVICE_2D_MAX_HEIGHT
 
        
      0
CL_DEVICE_3D_MAX_WIDTH
     
     
      0
CL_DEVICE_3D_MAX_HEIGHT
 
        
      0
CL_DEVICE_3D_MAX_DEPTH
     
     
      0
CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t>
 CHAR 1, SHORT 1, INT 1, LONG 1, FLOAT 1, DOUBLE 0
Intel i7 CPU M 640, 2.80GHz
CL_DEVICE_NAME: 

    
     Intel(R) Core(TM) i7 CPU    M 640 @ 2.80GHz
CL_DEVICE_VENDOR: 
   
     
       Intel
CL_DRIVER_VERSION: 
  
     
       1.0
CL_DEVICE_TYPE:
 
    
     
       CL_DEVICE_TYPE_CPU
CL_DEVICE_MAX_COMPUTE_UNITS:
 
            4
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS:
 3
CL_DEVICE_MAX_WORK_ITEM_SIZES:
 
          0/0/0
CL_DEVICE_MAX_WORK_GROUP_SIZE:
            
      1
CL_DEVICE_MAX_CLOCK_FREQUENCY:
            
      2800 MHz
CL_DEVICE_ADDRESS_BITS:
    
       
      64
CL_DEVICE_MAX_MEM_ALLOC_SIZE:
 
           1536 MByte
CL_DEVICE_GLOBAL_MEM_SIZE:
 
       6144 MByte
CL_DEVICE_ERROR_CORRECTION_SUPPORT:
 no
CL_DEVICE_LOCAL_MEM_TYPE:
 
        global
CL_DEVICE_LOCAL_MEM_SIZE:
 
        16 KByte
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE:
 64 KByte
CL_DEVICE_QUEUE_PROPERTIES:

       CL_QUEUE_PROFILING_ENABLE
CL_DEVICE_IMAGE_SUPPORT:
 
         1
CL_DEVICE_MAX_READ_IMAGE_ARGS:
 
          128
CL_DEVICE_MAX_WRITE_IMAGE_ARGS:

          8
CL_DEVICE_SINGLE_FP_CONFIG:

       CL_FP_DENORM CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST
CL_DEVICE_2D_MAX_WIDTH
     
       
      0
CL_DEVICE_2D_MAX_HEIGHT
 
          
      0
CL_DEVICE_3D_MAX_WIDTH
     
       
      0
CL_DEVICE_3D_MAX_HEIGHT
 
          
      0
CL_DEVICE_3D_MAX_DEPTH
     
       
      0
CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t>
 CHAR 16, SHORT 8, INT 4, LONG 2, FLOAT 4, DOUBLE 2
Great GPU Use
Cases
✤   3D rendering. (Obviously!)

✤   Simulation. (Folding@home actual has a
    GPU client.)

✤   Physics.

✤   Probability.

✤   Network reliability.

✤   General floating-point speed-ups.

✤   Augmentation of existing apps by offloading
    work to GPU. Combining CPU/GPU
    paradigms is a perfectly valid approach.
A Few Simple Kernels
Several simple C kernels w/Java, OpenCL and JOCL in Eclipse Helios

Date
Multi-
Dimensional
Threading
Threads can have an X, Y, and Z
coordinate within their thread
group, which you use to determine
who you are and can thus derive
your specific local parameters and
minimize control flow changes.
Links


✤   OpenCL Programming Guide for OSX. (Really good!)
    http://developer.apple.com/library/mac/#documentation/Performance/Conceptual/OpenCL_MacProgGuide/Introduction/Introduction.html



✤   NVidia CUDA:
    http://www.nvidia.com/object/cuda_home_new.html


✤   ATI Stream:
    http://www.amd.com/US/PRODUCTS/TECHNOLOGIES/STREAM-TECHNOLOGY/Pages/stream-technology.aspx


✤   OpenCL:
    http://www.khronos.org/opencl/

More Related Content

What's hot

Python basic
Python basic Python basic
Python basic sewoo lee
 
Introduction to PyTorch
Introduction to PyTorchIntroduction to PyTorch
Introduction to PyTorchJun Young Park
 
User defined functions
User defined functionsUser defined functions
User defined functionsshubham_jangid
 
Network vs. Code Metrics to Predict Defects: A Replication Study
Network vs. Code Metrics  to Predict Defects: A Replication StudyNetwork vs. Code Metrics  to Predict Defects: A Replication Study
Network vs. Code Metrics to Predict Defects: A Replication StudyKim Herzig
 
Part 3-functions
Part 3-functionsPart 3-functions
Part 3-functionsankita44
 
Refactoring in AS3
Refactoring in AS3Refactoring in AS3
Refactoring in AS3Eddie Kao
 
"PyTorch Deep Learning Framework: Status and Directions," a Presentation from...
"PyTorch Deep Learning Framework: Status and Directions," a Presentation from..."PyTorch Deep Learning Framework: Status and Directions," a Presentation from...
"PyTorch Deep Learning Framework: Status and Directions," a Presentation from...Edge AI and Vision Alliance
 
Xsl Tand X Path Quick Reference
Xsl Tand X Path Quick ReferenceXsl Tand X Path Quick Reference
Xsl Tand X Path Quick ReferenceLiquidHub
 
[ShaderX5] 8 1 Postprocessing Effects In Design
[ShaderX5] 8 1 Postprocessing Effects In Design[ShaderX5] 8 1 Postprocessing Effects In Design
[ShaderX5] 8 1 Postprocessing Effects In Design종빈 오
 
3D & Animation Effects Using CSS3 & jQuery
3D & Animation Effects Using CSS3 & jQuery3D & Animation Effects Using CSS3 & jQuery
3D & Animation Effects Using CSS3 & jQueryVu Tran Lam
 
Invited talk: Second Search Computing workshop
Invited talk: Second Search Computing workshopInvited talk: Second Search Computing workshop
Invited talk: Second Search Computing workshopPaolo Missier
 

What's hot (19)

Learn How to Master Solr1 4
Learn How to Master Solr1 4Learn How to Master Solr1 4
Learn How to Master Solr1 4
 
DDS-20m
DDS-20mDDS-20m
DDS-20m
 
December 7, Projects
December 7, ProjectsDecember 7, Projects
December 7, Projects
 
Python basic
Python basic Python basic
Python basic
 
NAS EP Algorithm
NAS EP Algorithm NAS EP Algorithm
NAS EP Algorithm
 
Introduction to PyTorch
Introduction to PyTorchIntroduction to PyTorch
Introduction to PyTorch
 
User defined functions
User defined functionsUser defined functions
User defined functions
 
Network vs. Code Metrics to Predict Defects: A Replication Study
Network vs. Code Metrics  to Predict Defects: A Replication StudyNetwork vs. Code Metrics  to Predict Defects: A Replication Study
Network vs. Code Metrics to Predict Defects: A Replication Study
 
Part 3-functions
Part 3-functionsPart 3-functions
Part 3-functions
 
Lecture1 classes3
Lecture1 classes3Lecture1 classes3
Lecture1 classes3
 
Refactoring in AS3
Refactoring in AS3Refactoring in AS3
Refactoring in AS3
 
"PyTorch Deep Learning Framework: Status and Directions," a Presentation from...
"PyTorch Deep Learning Framework: Status and Directions," a Presentation from..."PyTorch Deep Learning Framework: Status and Directions," a Presentation from...
"PyTorch Deep Learning Framework: Status and Directions," a Presentation from...
 
Xsl Tand X Path Quick Reference
Xsl Tand X Path Quick ReferenceXsl Tand X Path Quick Reference
Xsl Tand X Path Quick Reference
 
Clean code ch15
Clean code ch15Clean code ch15
Clean code ch15
 
[ShaderX5] 8 1 Postprocessing Effects In Design
[ShaderX5] 8 1 Postprocessing Effects In Design[ShaderX5] 8 1 Postprocessing Effects In Design
[ShaderX5] 8 1 Postprocessing Effects In Design
 
cluster(python)
cluster(python)cluster(python)
cluster(python)
 
3D & Animation Effects Using CSS3 & jQuery
3D & Animation Effects Using CSS3 & jQuery3D & Animation Effects Using CSS3 & jQuery
3D & Animation Effects Using CSS3 & jQuery
 
Invited talk: Second Search Computing workshop
Invited talk: Second Search Computing workshopInvited talk: Second Search Computing workshop
Invited talk: Second Search Computing workshop
 
Oech03
Oech03Oech03
Oech03
 

Similar to Ruby Supercomputing - Using The GPU For Massive Performance Speedup v1.1

Rubyconfindia2018 - GPU accelerated libraries for Ruby
Rubyconfindia2018 - GPU accelerated libraries for RubyRubyconfindia2018 - GPU accelerated libraries for Ruby
Rubyconfindia2018 - GPU accelerated libraries for RubyPrasun Anand
 
Prim's Algorithm on minimum spanning tree
Prim's Algorithm on minimum spanning treePrim's Algorithm on minimum spanning tree
Prim's Algorithm on minimum spanning treeoneous
 
Write Python for Speed
Write Python for SpeedWrite Python for Speed
Write Python for SpeedYung-Yu Chen
 
Using an Array include ltstdiohgt include ltmpih.pdf
Using an Array include ltstdiohgt include ltmpih.pdfUsing an Array include ltstdiohgt include ltmpih.pdf
Using an Array include ltstdiohgt include ltmpih.pdfgiriraj65
 
Task based Programming with OmpSs and its Application
Task based Programming with OmpSs and its ApplicationTask based Programming with OmpSs and its Application
Task based Programming with OmpSs and its ApplicationFacultad de Informática UCM
 
Data structures notes for college students btech.pptx
Data structures notes for college students btech.pptxData structures notes for college students btech.pptx
Data structures notes for college students btech.pptxKarthikVijay59
 
I have written the code but cannot complete the assignment please help.pdf
I have written the code but cannot complete the assignment please help.pdfI have written the code but cannot complete the assignment please help.pdf
I have written the code but cannot complete the assignment please help.pdfshreeaadithyaacellso
 
data structures using C 2 sem BCA univeristy of mysore
data structures using C 2 sem BCA univeristy of mysoredata structures using C 2 sem BCA univeristy of mysore
data structures using C 2 sem BCA univeristy of mysoreambikavenkatesh2
 
C - aptitude3
C - aptitude3C - aptitude3
C - aptitude3Srikanth
 
C aptitude questions
C aptitude questionsC aptitude questions
C aptitude questionsSrikanth
 
Recursion to iteration automation.
Recursion to iteration automation.Recursion to iteration automation.
Recursion to iteration automation.Russell Childs
 
Advance data structure & algorithm
Advance data structure & algorithmAdvance data structure & algorithm
Advance data structure & algorithmK Hari Shankar
 
check the modifed code now you will get all operations done.termin.pdf
check the modifed code now you will get all operations done.termin.pdfcheck the modifed code now you will get all operations done.termin.pdf
check the modifed code now you will get all operations done.termin.pdfangelfragranc
 
Unit 4
Unit 4Unit 4
Unit 4siddr
 

Similar to Ruby Supercomputing - Using The GPU For Massive Performance Speedup v1.1 (20)

Computer networkppt4577
Computer networkppt4577Computer networkppt4577
Computer networkppt4577
 
Rubyconfindia2018 - GPU accelerated libraries for Ruby
Rubyconfindia2018 - GPU accelerated libraries for RubyRubyconfindia2018 - GPU accelerated libraries for Ruby
Rubyconfindia2018 - GPU accelerated libraries for Ruby
 
Permute
PermutePermute
Permute
 
Unit 3
Unit 3 Unit 3
Unit 3
 
Prim's Algorithm on minimum spanning tree
Prim's Algorithm on minimum spanning treePrim's Algorithm on minimum spanning tree
Prim's Algorithm on minimum spanning tree
 
Write Python for Speed
Write Python for SpeedWrite Python for Speed
Write Python for Speed
 
Using an Array include ltstdiohgt include ltmpih.pdf
Using an Array include ltstdiohgt include ltmpih.pdfUsing an Array include ltstdiohgt include ltmpih.pdf
Using an Array include ltstdiohgt include ltmpih.pdf
 
Task based Programming with OmpSs and its Application
Task based Programming with OmpSs and its ApplicationTask based Programming with OmpSs and its Application
Task based Programming with OmpSs and its Application
 
Permute
PermutePermute
Permute
 
Data structures notes for college students btech.pptx
Data structures notes for college students btech.pptxData structures notes for college students btech.pptx
Data structures notes for college students btech.pptx
 
I have written the code but cannot complete the assignment please help.pdf
I have written the code but cannot complete the assignment please help.pdfI have written the code but cannot complete the assignment please help.pdf
I have written the code but cannot complete the assignment please help.pdf
 
data structures using C 2 sem BCA univeristy of mysore
data structures using C 2 sem BCA univeristy of mysoredata structures using C 2 sem BCA univeristy of mysore
data structures using C 2 sem BCA univeristy of mysore
 
C - aptitude3
C - aptitude3C - aptitude3
C - aptitude3
 
C aptitude questions
C aptitude questionsC aptitude questions
C aptitude questions
 
Chp4(ref dynamic)
Chp4(ref dynamic)Chp4(ref dynamic)
Chp4(ref dynamic)
 
Recursion to iteration automation.
Recursion to iteration automation.Recursion to iteration automation.
Recursion to iteration automation.
 
Advance data structure & algorithm
Advance data structure & algorithmAdvance data structure & algorithm
Advance data structure & algorithm
 
check the modifed code now you will get all operations done.termin.pdf
check the modifed code now you will get all operations done.termin.pdfcheck the modifed code now you will get all operations done.termin.pdf
check the modifed code now you will get all operations done.termin.pdf
 
Clojure basics
Clojure basicsClojure basics
Clojure basics
 
Unit 4
Unit 4Unit 4
Unit 4
 

Recently uploaded

Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 

Recently uploaded (20)

Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 

Ruby Supercomputing - Using The GPU For Massive Performance Speedup v1.1

  • 1. @prestonism Ruby Supercomputing gmail: conmotto http://prestonlee.com http://github.com/preston Using The GPU For Massive Performance Speedup http://www.slideshare.net/preston.lee/ Last Updated: March 17th, 2011. Preston Lee, MBA, Translational Genomics Research Institute And Arizona State University
  • 2. git@github.com:preston/ruby-gpu-examples.git Grab the code now if you want, but to run all the examples you’ll need the NVIDIA development driver and toolkit, Ruby 1.9, the “barracuda” gem, and JRuby (without the barracuda gem) on a multi-core OS X Snow Leopard system. This takes time to set up, so maybe just chillax for now?
  • 3. Let’s find the area of each ring. http://openwalls.com/image/7358/the_texture_of_the_tree_rings_1440x900.jpg
  • 4. Math. Yay! ✤ The inner-most ring is ring #1. ✤ Total area of ring #5 is π times #1 the square of the radius. (πrr) #4 ✤ Area of only ring #5 is πrr #5 minus area of ring #4. ✤ (Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2)) ✤ Let’s find the area of every ring...
  • 5. 1st working attempt. #1 (Single-threaded Ruby.) def ring_area(radius) (Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2)) end (1..RINGS).each do |i| puts ring_area(i) end
  • 6. 2nd working attempt. #1 (Multi-threaded Ruby.) def ring_area(radius) (Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2)) end # Multi-thread it! (0..(NUM_THREADS - 1)).each do |i| rings_per_thread = RINGS / NUM_THREADS offset = i * rings_per_thread threads << Thread.new(rings_per_thread, offset) do |num, offset| last = offset + num - 1 (offset..(last)).each do |radius| ring_area(radius) end end end threads.each do |t| t.join end
  • 7. 3rd/4th working attempt. #1 (Single/Multi-threaded C. Ohhh crap...) /* printf("%ld.%06ld secondsn", (long)diff.tv_sec, (long)diff.tv_usec); A baseline CPU-based benchmark program CPU/GPU performance comparision. Approximates the cross-sectional area of every tree ring in a tree trunk in serial and in parallel printf("nDone!nn"); by taking the total area at a given radius and subtracting the area of the closest inner ring. return EXIT_SUCCESS; Copyright © 2011 Preston Lee. All rights reserved. } http://prestonlee.com */ /* Approximate the cross-sectional area between each pair of consecutive tree rings in serial */ #include <stdio.h> void calculate_ring_areas_in_serial(int rings) { #include <stdlib.h> calculate_ring_areas_in_serial_with_offset(rings, 0); #include <math.h> } #include <pthread.h> #include <sys/time.h> void calculate_ring_areas_in_serial_with_offset(int rings, int thread) { int i; #include "tree_rings.h" int offset = rings * thread; int max = rings + offset; #define DEFAULT_RINGS 1000000 float a = 0; #define NUM_THREADS 8 for(i = offset; i < max; i++) { #define DEBUG 0 a = (M_PI * pow(i, 2)) - (M_PI * pow(i - 1, 2)); } int acc = 0; } int main(int argc, const char * argv[]) { /* Approximate the cross-sectional area between each pair of consecutive tree rings int rings = DEFAULT_RINGS; in parallel on NUM_THREADS threads */ void calculate_ring_areas_in_parallel(int rings) { if(argc > 1) { pthread_t threads[NUM_THREADS]; rings = atoi(argv[1]); int rc; } int t; int rings_per_thread = rings / NUM_THREADS; printf("nA baseline CPU-based benchmark program for CPU/GPU performance comparision.n"); ring_thread_data data[NUM_THREADS]; printf("Copyright © 2011 Preston Lee. All rights reserved.nn"); printf("tUsage: %s [NUM TREE RINGS]nn", argv[0]); for(t = 0; t < NUM_THREADS; t++){ data[t].rings = rings_per_thread; printf("Number of tree rings: %i. Yay!n", rings); data[t].number = t; rc = pthread_create(&threads[t], NULL, (void *) ring_job, (void *) &data[t]); struct timeval start, stop, diff; if (rc){ printf("ERROR; return code from pthread_create() is %dn", rc); printf("nRunning serial calculation using CPU...ttt"); exit(-1); gettimeofday(&start, NULL); } calculate_ring_areas_in_serial(rings); } gettimeofday(&stop, NULL); timeval_subtract(&diff, &stop, &start); for(t = 0; t < NUM_THREADS; t++){ printf("%ld.%06ld secondsn", (long)diff.tv_sec, (long)diff.tv_usec); pthread_join(threads[t], NULL); } printf("Running parallel calculation using %i CPU threads...t", NUM_THREADS); } gettimeofday(&start, NULL); calculate_ring_areas_in_parallel(rings); void ring_job(ring_thread_data * data) { gettimeofday(&stop, NULL); calculate_ring_areas_in_serial_with_offset(data->rings, data->number); timeval_subtract(&diff, &stop, &start); }
  • 8. Speed: Your Primary Options 1. Pure Ruby in a big loop. (Single threaded.) #1 2. Pure Ruby, multi-threading smaller loops. (Limited to using a single core on 1.9 due to the GIL, but not on jruby etc.) 3. C extension, single thread. 4. C extension, pthreads. 5. “Divide and conquer.”
  • 9.
  • 11.
  • 12. CPU Concurrency ✤ Ideally, asynchronous tasks across multiple physical and/or logical cores. ✤ POSIX threading is the standard. ✤ Producer/Consumer pattern typically used to account for differences in machine execution time. ✤ Concurrency generally implemented with Time-Division Multiplexing. An CPU with 4 cores can run 100 threads, but the OS will time-share execution time, more-or-less fairly. ✤ MIMD: Multiple Instruction Multiple Data
  • 13. Common CPU Issues ✤ Sometime we need insane numbers of threads per host. ✤ Lock performance. ✤ Potential for excessive virtual memory swapping. ✤ Many tests are non-deterministic. ✤ Concurrency modeling is difficult to get correct.
  • 14. Can’t we just execute every instruction concurrently, but with different data? (Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2))
  • 15. No. (Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2))
  • 16. No. (Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2)) Multiple Instruction, Multiple Data. (MIMD)
  • 17.
  • 18. Yes! (Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2)) Single Instruction, Multiple Data. (SIMD)
  • 19. GPU Brief History ✤ Graphics Processing Units were initially developed for 3D scene rendering, such as computing luminosity values of each pixel on the screen concurrently. ✤ Consider a 1024x768 display. That’s 786,432 pixels updating 60 times per second. 786,432 pixles @ 60Hz => 47,185,920 => potential pixel pushes per second! ✤ Running 768,432 threads makes OS unhappy. :( ✤ Sometimes it’d be great to have all calculations finish simultaneously. If you’re updating every pixel in a display, each can be computed concurrently. Exactly concurrently. ✤ Generally fast floating point. (Critical for scientific computing.) ✤ SIMD: Single Instruction Multiple Data
  • 20. Hardware Examples Images from NVIDIA and ATI.
  • 21. Hardware Examples Images from NVIDIA and ATI.
  • 22. Hardware Examples Images from NVIDIA and ATI.
  • 23. Hardware Examples Images from NVIDIA and ATI.
  • 24. Hardware Examples Images from NVIDIA and ATI.
  • 25. Hardware Examples Images from NVIDIA and ATI.
  • 26. Hardware Examples Images from NVIDIA and ATI.
  • 27. Hardware Examples Images from NVIDIA and ATI.
  • 28. Hardware Examples Images from NVIDIA and ATI.
  • 29. GPU Pros ✤ SIMD architecture can run thousands of threads concurrently. ✤ Many synchronization issues are designed out. ✤ More FLOPS (floating point operations per second) than host CPU. ✤ Synchronously execute the same instruction for different data points.
  • 30. NVidia Tesla C1060 CL_DEVICE_NAME: Tesla C1060 CL_DEVICE_VENDOR: NVIDIA Corporation CL_DRIVER_VERSION: 260.81 CL_DEVICE_TYPE: CL_DEVICE_TYPE_GPU CL_DEVICE_MAX_COMPUTE_UNITS: 30 CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3 CL_DEVICE_MAX_WORK_ITEM_SIZES: 512 / 512 / 64 CL_DEVICE_MAX_WORK_GROUP_SIZE: 512 CL_DEVICE_MAX_CLOCK_FREQUENCY: 1296 MHz CL_DEVICE_ADDRESS_BITS: 32 CL_DEVICE_MAX_MEM_ALLOC_SIZE: 1014 MByte CL_DEVICE_GLOBAL_MEM_SIZE: 4058 MByte CL_DEVICE_ERROR_CORRECTION_SUPPORT: no CL_DEVICE_LOCAL_MEM_TYPE: local CL_DEVICE_LOCAL_MEM_SIZE: 16 KByte CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByte CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLE CL_DEVICE_IMAGE_SUPPORT: 1 CL_DEVICE_MAX_READ_IMAGE_ARGS: 128 CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 8 CL_DEVICE_SINGLE_FP_CONFIG: CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST CL_FP_ROUND_TO_ZERO CL_FP_ROUND_TO_INF CL_FP_FMA CL_DEVICE_2D_MAX_WIDTH 4096 CL_DEVICE_2D_MAX_HEIGHT 32768 CL_DEVICE_3D_MAX_WIDTH 2048 CL_DEVICE_3D_MAX_HEIGHT 2048 CL_DEVICE_3D_MAX_DEPTH 2048 CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t> CHAR 1, SHORT 1, INT 1, LONG 1, FLOAT 1, DOUBLE 1
  • 31. OpenCL (Image from Khronos Group.) Mac OS X Snow Leopard Ships With Support For OpenCL 1.0, But You Still Need The Development Driver, SDK Etc.
  • 32. OpenCL Terms ✤ Kernel: Your custom function(s) that will run on the Device. (Unrelated to the OS kernel.) ✤ Device: something that computes, such as a GPU chip. (You probably have 1, but then you might have 0... or 2+.) Could also be your CPU! ✤ Platform: Container for all your devices. When running code on your local machine you’ll only have one “platform” instance. (A cluster of GPU-heavy systems connected over a network would yield multiple available platforms, but OpenCL work in this area is not yet complete.) ✤ Device-specific terms: thread group, streaming multi-processor, etc.
  • 33. Code! ✤ Ruby 1.9: single threaded. ✤ Ruby 1.9: multi-threaded on 1.9. ✤ Ruby 1.9: barracuda (native OpenCL/NVIDIA CUDA bindings). ✤ JRuby 1.6: single threaded. ✤ JRuby 1.6: multi-threaded. ✤ C: single threaded. ✤ C: native threads.
  • 34.
  • 35. GPU Not-So-Pros ✤ Copying data in system RAM to/from GPU shared memory is not free. (In my own testing, often the slowest link in the chain.) ✤ SIMD can feel limiting. Benefits start to break down when you can’t design conditions out of your algorithm. (E.g. “if(thread_number > 42) { x += y; }” will cause some threads to idle.) ✤ 64-bit CPU does not imply 64-bit GPU! (All GPUs I’ve used are 32-bit or less.) ✤ Allocation limitations. Having 4GB of GPU shared memory does necessarily mean your can allocate one giant block. ✤ Kernels are essentially written in C. May be difficult if you’re new to pointers or memory management. (You can bind to higher-level languages, though.)
  • 36. Q&A and bonus content.
  • 37. Names/Keywords To Know ✤ NVidia (CUDA) ✤ ATI (Owned by AMD) (Stream SDK) ✤ Kronus (Drives OpenCL specification.) ✤ Apple (ships OpenCL-capable drivers with all newer Macs running Snow Leopard.)
  • 38. NVidia GeForce GT 330M CL_DEVICE_NAME: GeForce GT 330M CL_DEVICE_VENDOR: NVIDIA CL_DRIVER_VERSION: CLH 1.0 CL_DEVICE_TYPE: CL_DEVICE_TYPE_GPU CL_DEVICE_MAX_COMPUTE_UNITS: 6 CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3 CL_DEVICE_MAX_WORK_ITEM_SIZES: 0/0/0 CL_DEVICE_MAX_WORK_GROUP_SIZE: 512 CL_DEVICE_MAX_CLOCK_FREQUENCY: 1100 MHz CL_DEVICE_ADDRESS_BITS: 32 CL_DEVICE_MAX_MEM_ALLOC_SIZE: 128 MByte CL_DEVICE_GLOBAL_MEM_SIZE: 512 MByte CL_DEVICE_ERROR_CORRECTION_SUPPORT: no CL_DEVICE_LOCAL_MEM_TYPE: local CL_DEVICE_LOCAL_MEM_SIZE: 16 KByte CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByte CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLE CL_DEVICE_IMAGE_SUPPORT: 1 CL_DEVICE_MAX_READ_IMAGE_ARGS: 128 CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 8 CL_DEVICE_SINGLE_FP_CONFIG: CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST CL_DEVICE_2D_MAX_WIDTH 0 CL_DEVICE_2D_MAX_HEIGHT 0 CL_DEVICE_3D_MAX_WIDTH 0 CL_DEVICE_3D_MAX_HEIGHT 0 CL_DEVICE_3D_MAX_DEPTH 0 CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t> CHAR 1, SHORT 1, INT 1, LONG 1, FLOAT 1, DOUBLE 0
  • 39. Intel i7 CPU M 640, 2.80GHz CL_DEVICE_NAME: Intel(R) Core(TM) i7 CPU M 640 @ 2.80GHz CL_DEVICE_VENDOR: Intel CL_DRIVER_VERSION: 1.0 CL_DEVICE_TYPE: CL_DEVICE_TYPE_CPU CL_DEVICE_MAX_COMPUTE_UNITS: 4 CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3 CL_DEVICE_MAX_WORK_ITEM_SIZES: 0/0/0 CL_DEVICE_MAX_WORK_GROUP_SIZE: 1 CL_DEVICE_MAX_CLOCK_FREQUENCY: 2800 MHz CL_DEVICE_ADDRESS_BITS: 64 CL_DEVICE_MAX_MEM_ALLOC_SIZE: 1536 MByte CL_DEVICE_GLOBAL_MEM_SIZE: 6144 MByte CL_DEVICE_ERROR_CORRECTION_SUPPORT: no CL_DEVICE_LOCAL_MEM_TYPE: global CL_DEVICE_LOCAL_MEM_SIZE: 16 KByte CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByte CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLE CL_DEVICE_IMAGE_SUPPORT: 1 CL_DEVICE_MAX_READ_IMAGE_ARGS: 128 CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 8 CL_DEVICE_SINGLE_FP_CONFIG: CL_FP_DENORM CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST CL_DEVICE_2D_MAX_WIDTH 0 CL_DEVICE_2D_MAX_HEIGHT 0 CL_DEVICE_3D_MAX_WIDTH 0 CL_DEVICE_3D_MAX_HEIGHT 0 CL_DEVICE_3D_MAX_DEPTH 0 CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t> CHAR 16, SHORT 8, INT 4, LONG 2, FLOAT 4, DOUBLE 2
  • 40. Great GPU Use Cases ✤ 3D rendering. (Obviously!) ✤ Simulation. (Folding@home actual has a GPU client.) ✤ Physics. ✤ Probability. ✤ Network reliability. ✤ General floating-point speed-ups. ✤ Augmentation of existing apps by offloading work to GPU. Combining CPU/GPU paradigms is a perfectly valid approach.
  • 41. A Few Simple Kernels Several simple C kernels w/Java, OpenCL and JOCL in Eclipse Helios Date
  • 42. Multi- Dimensional Threading Threads can have an X, Y, and Z coordinate within their thread group, which you use to determine who you are and can thus derive your specific local parameters and minimize control flow changes.
  • 43. Links ✤ OpenCL Programming Guide for OSX. (Really good!) http://developer.apple.com/library/mac/#documentation/Performance/Conceptual/OpenCL_MacProgGuide/Introduction/Introduction.html ✤ NVidia CUDA: http://www.nvidia.com/object/cuda_home_new.html ✤ ATI Stream: http://www.amd.com/US/PRODUCTS/TECHNOLOGIES/STREAM-TECHNOLOGY/Pages/stream-technology.aspx ✤ OpenCL: http://www.khronos.org/opencl/

Editor's Notes

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. Jump to Eclipse device query.\n
  27. \n
  28. \n
  29. Jump to demo!\n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. \n
  38. \n
  39. \n
  40. \n