SlideShare a Scribd company logo
1 of 43
Download to read offline
@prestonism

Ruby Supercomputing                                                                                gmail: conmotto
                                                                                             http://prestonlee.com
                                                                                       http://github.com/preston
Using The GPU For Massive Performance Speedup                                http://www.slideshare.net/preston.lee/
Last Updated: March 17th, 2011.
Preston Lee, MBA, Translational Genomics Research Institute And Arizona State University
git@github.com:preston/ruby-gpu-examples.git

   Grab the code now if you want, but to run all the examples
     you’ll need the NVIDIA development driver and toolkit,
     Ruby 1.9, the “barracuda” gem, and JRuby (without the
   barracuda gem) on a multi-core OS X Snow Leopard system.
    This takes time to set up, so maybe just chillax for now?
Let’s find the area of each ring.
http://openwalls.com/image/7358/the_texture_of_the_tree_rings_1440x900.jpg
Math. Yay!

✤   The inner-most ring is ring #1.

✤   Total area of ring #5 is π times            #1
    the square of the radius. (πrr)       #4

✤   Area of only ring #5 is πrr                #5
    minus area of ring #4.

✤   (Math::PI * (radius ** 2)) -
    (Math::PI * ((radius - 1) ** 2))
✤   Let’s find the area of every ring...
1st working attempt.                                              #1
(Single-threaded Ruby.)




  def ring_area(radius)
  
    (Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2))
  end

  (1..RINGS).each do |i|
     puts ring_area(i)
  end
2nd working attempt.                                                    #1
(Multi-threaded Ruby.)



  def ring_area(radius)
  
     (Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2))
  end

  # Multi-thread it!
  (0..(NUM_THREADS - 1)).each do |i|
  
     rings_per_thread = RINGS / NUM_THREADS
  
     offset = i * rings_per_thread
  
     threads << Thread.new(rings_per_thread, offset) do |num, offset|
         
 last = offset + num - 1
         
 (offset..(last)).each do |radius|
         
 
      ring_area(radius)
         
 end
  
     end
  end
  threads.each do |t| t.join end
3rd/4th working attempt.                                                                                                                                                                       #1
(Single/Multi-threaded C. Ohhh crap...)


  /*                                                                                                             
                  printf("%ld.%06ld secondsn", (long)diff.tv_sec, (long)diff.tv_usec);
    A baseline CPU-based benchmark program CPU/GPU performance comparision.                                      
    Approximates the cross-sectional area of every tree ring in a tree trunk in serial and in parallel           
                  printf("nDone!nn");
    by taking the total area at a given radius and subtracting the area of the closest inner ring.               
                  return EXIT_SUCCESS;
    Copyright © 2011 Preston Lee. All rights reserved.                                                           }
    http://prestonlee.com
  */                                                                                                             /* Approximate the cross-sectional area between each pair of consecutive tree rings
                                                                                                                   
               in serial */
  #include <stdio.h>                                                                                             void calculate_ring_areas_in_serial(int rings) {
  #include <stdlib.h>                                                                                            
                 
            calculate_ring_areas_in_serial_with_offset(rings, 0);
  #include <math.h>                                                                                              }
  #include <pthread.h>
  #include <sys/time.h>                                                                                          void calculate_ring_areas_in_serial_with_offset(int rings, int thread) {
                                                                                                                 
                 int i;
  #include "tree_rings.h"                                                                                        
                 int offset = rings * thread;
                                                                                                                 
                 int max = rings + offset;
  #define DEFAULT_RINGS 1000000                                                                                   
                 float a = 0;
  #define NUM_THREADS 8                                                                                           
                 for(i = offset; i < max; i++) {
  #define DEBUG 0                                                                                                 
                 
              a = (M_PI * pow(i, 2)) - (M_PI * pow(i - 1, 2));
                                                                                                                 
                 }
  int acc = 0;                                                                                                   }

  int main(int argc, const char * argv[]) {                                                                      /* Approximate the cross-sectional area between each pair of consecutive tree rings
  
                  int rings = DEFAULT_RINGS;                                                                    
               in parallel on NUM_THREADS threads */
                                                                                                                 void calculate_ring_areas_in_parallel(int rings) {
  
                 if(argc > 1) {                                                                               
                 pthread_t threads[NUM_THREADS];
  
                 
              rings = atoi(argv[1]);                                                        
                 int rc;
  
                 }                                                                                            
                 int t;
  
                                                                                                              
                 int rings_per_thread = rings / NUM_THREADS;
  
                 printf("nA baseline CPU-based benchmark program for CPU/GPU performance comparision.n");   
                 ring_thread_data data[NUM_THREADS];
  
                 printf("Copyright © 2011 Preston Lee. All rights reserved.nn");                            
  
                 printf("tUsage: %s [NUM TREE RINGS]nn", argv[0]);                                         
                 for(t = 0; t < NUM_THREADS; t++){
  
                                                                                                              
                 
              data[t].rings = rings_per_thread;
  
                 printf("Number of tree rings: %i. Yay!n", rings);                                           
                 
              data[t].number = t;
  
                 
                                                                                            
                    rc = pthread_create(&threads[t], NULL, (void *) ring_job, (void *) &data[t]);
  
                 struct timeval start, stop, diff;
                                                           
                    if (rc){
  
                                                                                                              
                      printf("ERROR; return code from pthread_create() is %dn", rc);
  
                 printf("nRunning serial calculation using CPU...ttt");                                   
                      exit(-1);
  
                 gettimeofday(&start, NULL);                                                                  
                   }
  
                 calculate_ring_areas_in_serial(rings);                                                       
                 }
  
                 gettimeofday(&stop, NULL);                                                                   
  
                 timeval_subtract(&diff, &stop, &start);                                                      
                 for(t = 0; t < NUM_THREADS; t++){
  
                 printf("%ld.%06ld secondsn", (long)diff.tv_sec, (long)diff.tv_usec);                        
                 
              pthread_join(threads[t], NULL);
  
                                                                                                              
                 }
  
                 printf("Running parallel calculation using %i CPU threads...t", NUM_THREADS);               }
  
                 gettimeofday(&start, NULL);
  
                 calculate_ring_areas_in_parallel(rings);                                                     void ring_job(ring_thread_data * data) {
  
                 gettimeofday(&stop, NULL);                                                                   
                 calculate_ring_areas_in_serial_with_offset(data->rings, data->number);
  
                 timeval_subtract(&diff, &stop, &start);                                                      }
Speed: Your Primary Options

1. Pure Ruby in a big loop. (Single
   threaded.)                         #1

2. Pure Ruby, multi-threading
   smaller loops. (Limited to using
   a single core on 1.9 due to the
   GIL, but not on jruby etc.)

3. C extension, single thread.

4. C extension, pthreads.

5. “Divide and conquer.”
Ruby Supercomputing - Using The GPU For Massive Performance Speedup v1.1
Ruby 1.9
Ruby Supercomputing - Using The GPU For Massive Performance Speedup v1.1
CPU Concurrency

✤   Ideally, asynchronous tasks across multiple physical and/or logical
    cores.

✤   POSIX threading is the standard.

✤   Producer/Consumer pattern typically used to account for differences
    in machine execution time.

✤   Concurrency generally implemented with Time-Division
    Multiplexing. An CPU with 4 cores can run 100 threads, but the OS
    will time-share execution time, more-or-less fairly.

✤   MIMD: Multiple Instruction Multiple Data
Common CPU Issues


✤   Sometime we need insane numbers of threads per host.

✤   Lock performance.

✤   Potential for excessive virtual memory swapping.

✤   Many tests are non-deterministic.

✤   Concurrency modeling is difficult to get correct.
Can’t we just execute every
                       instruction
concurrently, but with different data?
(Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2))
No.
(Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2))
No.
(Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2))




         Multiple Instruction, Multiple Data. (MIMD)
Ruby Supercomputing - Using The GPU For Massive Performance Speedup v1.1
Yes!
(Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2))




          Single Instruction, Multiple Data. (SIMD)
GPU Brief History

✤   Graphics Processing Units were initially developed for 3D scene rendering, such as computing
    luminosity values of each pixel on the screen concurrently.

✤   Consider a 1024x768 display. That’s 786,432 pixels updating 60 times per second. 786,432 pixles
    @ 60Hz => 47,185,920 => potential pixel pushes per second!

✤   Running 768,432 threads makes OS unhappy. :(

✤   Sometimes it’d be great to have all calculations finish simultaneously. If you’re updating every
    pixel in a display, each can be computed concurrently. Exactly concurrently.

✤   Generally fast floating point. (Critical for scientific computing.)

✤   SIMD: Single Instruction Multiple Data
Hardware Examples




         Images from NVIDIA and ATI.
Hardware Examples




         Images from NVIDIA and ATI.
Hardware Examples




         Images from NVIDIA and ATI.
Hardware Examples




         Images from NVIDIA and ATI.
Hardware Examples




         Images from NVIDIA and ATI.
Hardware Examples




         Images from NVIDIA and ATI.
Hardware Examples




         Images from NVIDIA and ATI.
Hardware Examples




         Images from NVIDIA and ATI.
Hardware Examples




         Images from NVIDIA and ATI.
GPU Pros


✤   SIMD architecture can run thousands of threads concurrently.

✤   Many synchronization issues are designed out.

✤   More FLOPS (floating point operations per second) than host CPU.

✤   Synchronously execute the same instruction for different data points.
NVidia Tesla C1060
CL_DEVICE_NAME: 

    
     Tesla C1060
CL_DEVICE_VENDOR: 
   
     
      NVIDIA Corporation
CL_DRIVER_VERSION: 
  
     
      260.81
CL_DEVICE_TYPE:
 
    
     
      CL_DEVICE_TYPE_GPU
CL_DEVICE_MAX_COMPUTE_UNITS:
 
           30
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS:
 3
CL_DEVICE_MAX_WORK_ITEM_SIZES:
 
         512 / 512 / 64
CL_DEVICE_MAX_WORK_GROUP_SIZE:
           
      512
CL_DEVICE_MAX_CLOCK_FREQUENCY:
           
      1296 MHz
CL_DEVICE_ADDRESS_BITS:
    
      
      32
CL_DEVICE_MAX_MEM_ALLOC_SIZE:
 
          1014 MByte
CL_DEVICE_GLOBAL_MEM_SIZE:
 
      4058 MByte
CL_DEVICE_ERROR_CORRECTION_SUPPORT:
 no
CL_DEVICE_LOCAL_MEM_TYPE:
 
       local
CL_DEVICE_LOCAL_MEM_SIZE:
 
       16 KByte
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE:
 64 KByte
CL_DEVICE_QUEUE_PROPERTIES:

      CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE
CL_DEVICE_QUEUE_PROPERTIES:

      CL_QUEUE_PROFILING_ENABLE
CL_DEVICE_IMAGE_SUPPORT:
 
        1
CL_DEVICE_MAX_READ_IMAGE_ARGS:
 
         128
CL_DEVICE_MAX_WRITE_IMAGE_ARGS:

         8
CL_DEVICE_SINGLE_FP_CONFIG:

      CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST CL_FP_ROUND_TO_ZERO CL_FP_ROUND_TO_INF
CL_FP_FMA
CL_DEVICE_2D_MAX_WIDTH
     
      
      4096
CL_DEVICE_2D_MAX_HEIGHT
 
         
      32768
CL_DEVICE_3D_MAX_WIDTH
     
      
      2048
CL_DEVICE_3D_MAX_HEIGHT
 
         
      2048
CL_DEVICE_3D_MAX_DEPTH
     
      
      2048
CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t>
 CHAR 1, SHORT 1, INT 1, LONG 1, FLOAT 1, DOUBLE 1
OpenCL
(Image from Khronos Group.)

Mac OS X Snow Leopard Ships With Support For OpenCL 1.0, But You Still Need The Development Driver, SDK Etc.
OpenCL Terms

✤   Kernel: Your custom function(s) that will run on the Device.
    (Unrelated to the OS kernel.)

✤   Device: something that computes, such as a GPU chip. (You probably
    have 1, but then you might have 0... or 2+.) Could also be your CPU!

✤   Platform: Container for all your devices. When running code on your
    local machine you’ll only have one “platform” instance. (A cluster of
    GPU-heavy systems connected over a network would yield multiple
    available platforms, but OpenCL work in this area is not yet
    complete.)

✤   Device-specific terms: thread group, streaming multi-processor, etc.
Code!

✤   Ruby 1.9: single threaded.
✤   Ruby 1.9: multi-threaded on 1.9.
✤   Ruby 1.9: barracuda (native OpenCL/NVIDIA CUDA bindings).
✤   JRuby 1.6: single threaded.
✤   JRuby 1.6: multi-threaded.
✤   C: single threaded.
✤   C: native threads.
Ruby Supercomputing - Using The GPU For Massive Performance Speedup v1.1
GPU Not-So-Pros

✤   Copying data in system RAM to/from GPU shared memory is not free. (In my
    own testing, often the slowest link in the chain.)

✤   SIMD can feel limiting. Benefits start to break down when you can’t design
    conditions out of your algorithm. (E.g. “if(thread_number > 42) { x += y; }” will
    cause some threads to idle.)

✤   64-bit CPU does not imply 64-bit GPU! (All GPUs I’ve used are 32-bit or less.)

✤   Allocation limitations. Having 4GB of GPU shared memory does necessarily
    mean your can allocate one giant block.

✤   Kernels are essentially written in C. May be difficult if you’re new to pointers or
    memory management. (You can bind to higher-level languages, though.)
Q&A and bonus content.
Names/Keywords To Know


✤   NVidia (CUDA)

✤   ATI (Owned by AMD) (Stream SDK)

✤   Kronus (Drives OpenCL specification.)

✤   Apple (ships OpenCL-capable drivers with all newer Macs running
    Snow Leopard.)
NVidia GeForce GT 330M
CL_DEVICE_NAME: 

    
     GeForce GT 330M
CL_DEVICE_VENDOR: 
   
     
     NVIDIA
CL_DRIVER_VERSION: 
  
     
     CLH 1.0
CL_DEVICE_TYPE:
 
    
     
     CL_DEVICE_TYPE_GPU
CL_DEVICE_MAX_COMPUTE_UNITS:
 
          6
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS:
 3
CL_DEVICE_MAX_WORK_ITEM_SIZES:
 
        0/0/0
CL_DEVICE_MAX_WORK_GROUP_SIZE:
          
     512
CL_DEVICE_MAX_CLOCK_FREQUENCY:
          
     1100 MHz
CL_DEVICE_ADDRESS_BITS:
    
     
      32
CL_DEVICE_MAX_MEM_ALLOC_SIZE:
 
         128 MByte
CL_DEVICE_GLOBAL_MEM_SIZE:
 
     512 MByte
CL_DEVICE_ERROR_CORRECTION_SUPPORT:
 no
CL_DEVICE_LOCAL_MEM_TYPE:
 
      local
CL_DEVICE_LOCAL_MEM_SIZE:
 
      16 KByte
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE:
 64 KByte
CL_DEVICE_QUEUE_PROPERTIES:

     CL_QUEUE_PROFILING_ENABLE
CL_DEVICE_IMAGE_SUPPORT:
 
       1
CL_DEVICE_MAX_READ_IMAGE_ARGS:
 
        128
CL_DEVICE_MAX_WRITE_IMAGE_ARGS:

        8
CL_DEVICE_SINGLE_FP_CONFIG:

     CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST
CL_DEVICE_2D_MAX_WIDTH
     
     
      0
CL_DEVICE_2D_MAX_HEIGHT
 
        
      0
CL_DEVICE_3D_MAX_WIDTH
     
     
      0
CL_DEVICE_3D_MAX_HEIGHT
 
        
      0
CL_DEVICE_3D_MAX_DEPTH
     
     
      0
CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t>
 CHAR 1, SHORT 1, INT 1, LONG 1, FLOAT 1, DOUBLE 0
Intel i7 CPU M 640, 2.80GHz
CL_DEVICE_NAME: 

    
     Intel(R) Core(TM) i7 CPU    M 640 @ 2.80GHz
CL_DEVICE_VENDOR: 
   
     
       Intel
CL_DRIVER_VERSION: 
  
     
       1.0
CL_DEVICE_TYPE:
 
    
     
       CL_DEVICE_TYPE_CPU
CL_DEVICE_MAX_COMPUTE_UNITS:
 
            4
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS:
 3
CL_DEVICE_MAX_WORK_ITEM_SIZES:
 
          0/0/0
CL_DEVICE_MAX_WORK_GROUP_SIZE:
            
      1
CL_DEVICE_MAX_CLOCK_FREQUENCY:
            
      2800 MHz
CL_DEVICE_ADDRESS_BITS:
    
       
      64
CL_DEVICE_MAX_MEM_ALLOC_SIZE:
 
           1536 MByte
CL_DEVICE_GLOBAL_MEM_SIZE:
 
       6144 MByte
CL_DEVICE_ERROR_CORRECTION_SUPPORT:
 no
CL_DEVICE_LOCAL_MEM_TYPE:
 
        global
CL_DEVICE_LOCAL_MEM_SIZE:
 
        16 KByte
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE:
 64 KByte
CL_DEVICE_QUEUE_PROPERTIES:

       CL_QUEUE_PROFILING_ENABLE
CL_DEVICE_IMAGE_SUPPORT:
 
         1
CL_DEVICE_MAX_READ_IMAGE_ARGS:
 
          128
CL_DEVICE_MAX_WRITE_IMAGE_ARGS:

          8
CL_DEVICE_SINGLE_FP_CONFIG:

       CL_FP_DENORM CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST
CL_DEVICE_2D_MAX_WIDTH
     
       
      0
CL_DEVICE_2D_MAX_HEIGHT
 
          
      0
CL_DEVICE_3D_MAX_WIDTH
     
       
      0
CL_DEVICE_3D_MAX_HEIGHT
 
          
      0
CL_DEVICE_3D_MAX_DEPTH
     
       
      0
CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t>
 CHAR 16, SHORT 8, INT 4, LONG 2, FLOAT 4, DOUBLE 2
Great GPU Use
Cases
✤   3D rendering. (Obviously!)

✤   Simulation. (Folding@home actual has a
    GPU client.)

✤   Physics.

✤   Probability.

✤   Network reliability.

✤   General floating-point speed-ups.

✤   Augmentation of existing apps by offloading
    work to GPU. Combining CPU/GPU
    paradigms is a perfectly valid approach.
A Few Simple Kernels
Several simple C kernels w/Java, OpenCL and JOCL in Eclipse Helios

Date
Multi-
Dimensional
Threading
Threads can have an X, Y, and Z
coordinate within their thread
group, which you use to determine
who you are and can thus derive
your specific local parameters and
minimize control flow changes.
Links


✤   OpenCL Programming Guide for OSX. (Really good!)
    http://developer.apple.com/library/mac/#documentation/Performance/Conceptual/OpenCL_MacProgGuide/Introduction/Introduction.html



✤   NVidia CUDA:
    http://www.nvidia.com/object/cuda_home_new.html


✤   ATI Stream:
    http://www.amd.com/US/PRODUCTS/TECHNOLOGIES/STREAM-TECHNOLOGY/Pages/stream-technology.aspx


✤   OpenCL:
    http://www.khronos.org/opencl/

More Related Content

What's hot

Python basic
Python basic Python basic
Python basic sewoo lee
 
Introduction to PyTorch
Introduction to PyTorchIntroduction to PyTorch
Introduction to PyTorchJun Young Park
 
User defined functions
User defined functionsUser defined functions
User defined functionsshubham_jangid
 
Network vs. Code Metrics to Predict Defects: A Replication Study
Network vs. Code Metrics  to Predict Defects: A Replication StudyNetwork vs. Code Metrics  to Predict Defects: A Replication Study
Network vs. Code Metrics to Predict Defects: A Replication StudyKim Herzig
 
Part 3-functions
Part 3-functionsPart 3-functions
Part 3-functionsankita44
 
Refactoring in AS3
Refactoring in AS3Refactoring in AS3
Refactoring in AS3Eddie Kao
 
"PyTorch Deep Learning Framework: Status and Directions," a Presentation from...
"PyTorch Deep Learning Framework: Status and Directions," a Presentation from..."PyTorch Deep Learning Framework: Status and Directions," a Presentation from...
"PyTorch Deep Learning Framework: Status and Directions," a Presentation from...Edge AI and Vision Alliance
 
Xsl Tand X Path Quick Reference
Xsl Tand X Path Quick ReferenceXsl Tand X Path Quick Reference
Xsl Tand X Path Quick ReferenceLiquidHub
 
[ShaderX5] 8 1 Postprocessing Effects In Design
[ShaderX5] 8 1 Postprocessing Effects In Design[ShaderX5] 8 1 Postprocessing Effects In Design
[ShaderX5] 8 1 Postprocessing Effects In Design종빈 오
 
3D & Animation Effects Using CSS3 & jQuery
3D & Animation Effects Using CSS3 & jQuery3D & Animation Effects Using CSS3 & jQuery
3D & Animation Effects Using CSS3 & jQueryVu Tran Lam
 
Invited talk: Second Search Computing workshop
Invited talk: Second Search Computing workshopInvited talk: Second Search Computing workshop
Invited talk: Second Search Computing workshopPaolo Missier
 

What's hot (19)

Learn How to Master Solr1 4
Learn How to Master Solr1 4Learn How to Master Solr1 4
Learn How to Master Solr1 4
 
DDS-20m
DDS-20mDDS-20m
DDS-20m
 
December 7, Projects
December 7, ProjectsDecember 7, Projects
December 7, Projects
 
Python basic
Python basic Python basic
Python basic
 
NAS EP Algorithm
NAS EP Algorithm NAS EP Algorithm
NAS EP Algorithm
 
Introduction to PyTorch
Introduction to PyTorchIntroduction to PyTorch
Introduction to PyTorch
 
User defined functions
User defined functionsUser defined functions
User defined functions
 
Network vs. Code Metrics to Predict Defects: A Replication Study
Network vs. Code Metrics  to Predict Defects: A Replication StudyNetwork vs. Code Metrics  to Predict Defects: A Replication Study
Network vs. Code Metrics to Predict Defects: A Replication Study
 
Part 3-functions
Part 3-functionsPart 3-functions
Part 3-functions
 
Lecture1 classes3
Lecture1 classes3Lecture1 classes3
Lecture1 classes3
 
Refactoring in AS3
Refactoring in AS3Refactoring in AS3
Refactoring in AS3
 
"PyTorch Deep Learning Framework: Status and Directions," a Presentation from...
"PyTorch Deep Learning Framework: Status and Directions," a Presentation from..."PyTorch Deep Learning Framework: Status and Directions," a Presentation from...
"PyTorch Deep Learning Framework: Status and Directions," a Presentation from...
 
Xsl Tand X Path Quick Reference
Xsl Tand X Path Quick ReferenceXsl Tand X Path Quick Reference
Xsl Tand X Path Quick Reference
 
Clean code ch15
Clean code ch15Clean code ch15
Clean code ch15
 
[ShaderX5] 8 1 Postprocessing Effects In Design
[ShaderX5] 8 1 Postprocessing Effects In Design[ShaderX5] 8 1 Postprocessing Effects In Design
[ShaderX5] 8 1 Postprocessing Effects In Design
 
cluster(python)
cluster(python)cluster(python)
cluster(python)
 
3D & Animation Effects Using CSS3 & jQuery
3D & Animation Effects Using CSS3 & jQuery3D & Animation Effects Using CSS3 & jQuery
3D & Animation Effects Using CSS3 & jQuery
 
Invited talk: Second Search Computing workshop
Invited talk: Second Search Computing workshopInvited talk: Second Search Computing workshop
Invited talk: Second Search Computing workshop
 
Oech03
Oech03Oech03
Oech03
 

Similar to Ruby Supercomputing - Using The GPU For Massive Performance Speedup v1.1

Rubyconfindia2018 - GPU accelerated libraries for Ruby
Rubyconfindia2018 - GPU accelerated libraries for RubyRubyconfindia2018 - GPU accelerated libraries for Ruby
Rubyconfindia2018 - GPU accelerated libraries for RubyPrasun Anand
 
C++ Submit the source file (LabExercise6.cpp) and the screenshot of .pdf
C++ Submit the source file (LabExercise6.cpp) and the screenshot of .pdfC++ Submit the source file (LabExercise6.cpp) and the screenshot of .pdf
C++ Submit the source file (LabExercise6.cpp) and the screenshot of .pdfrahulfancycorner21
 
Prim's Algorithm on minimum spanning tree
Prim's Algorithm on minimum spanning treePrim's Algorithm on minimum spanning tree
Prim's Algorithm on minimum spanning treeoneous
 
Write Python for Speed
Write Python for SpeedWrite Python for Speed
Write Python for SpeedYung-Yu Chen
 
Task based Programming with OmpSs and its Application
Task based Programming with OmpSs and its ApplicationTask based Programming with OmpSs and its Application
Task based Programming with OmpSs and its ApplicationFacultad de Informática UCM
 
data structures using C 2 sem BCA univeristy of mysore
data structures using C 2 sem BCA univeristy of mysoredata structures using C 2 sem BCA univeristy of mysore
data structures using C 2 sem BCA univeristy of mysoreambikavenkatesh2
 
please rewrite the correct code and do not use set ! there are many e.docx
please rewrite the correct code and  do not use set ! there are many e.docxplease rewrite the correct code and  do not use set ! there are many e.docx
please rewrite the correct code and do not use set ! there are many e.docxJakeT2gGrayp
 
C - aptitude3
C - aptitude3C - aptitude3
C - aptitude3Srikanth
 
C aptitude questions
C aptitude questionsC aptitude questions
C aptitude questionsSrikanth
 
Recursion to iteration automation.
Recursion to iteration automation.Recursion to iteration automation.
Recursion to iteration automation.Russell Childs
 
Advance data structure & algorithm
Advance data structure & algorithmAdvance data structure & algorithm
Advance data structure & algorithmK Hari Shankar
 
check the modifed code now you will get all operations done.termin.pdf
check the modifed code now you will get all operations done.termin.pdfcheck the modifed code now you will get all operations done.termin.pdf
check the modifed code now you will get all operations done.termin.pdfangelfragranc
 
Unit 4
Unit 4Unit 4
Unit 4siddr
 
please help me to find bugs in my coding! thanks!#includeiostream.pdf
please help me to find bugs in my coding! thanks!#includeiostream.pdfplease help me to find bugs in my coding! thanks!#includeiostream.pdf
please help me to find bugs in my coding! thanks!#includeiostream.pdfamarrex323
 

Similar to Ruby Supercomputing - Using The GPU For Massive Performance Speedup v1.1 (20)

Computer networkppt4577
Computer networkppt4577Computer networkppt4577
Computer networkppt4577
 
Rubyconfindia2018 - GPU accelerated libraries for Ruby
Rubyconfindia2018 - GPU accelerated libraries for RubyRubyconfindia2018 - GPU accelerated libraries for Ruby
Rubyconfindia2018 - GPU accelerated libraries for Ruby
 
Permute
PermutePermute
Permute
 
Unit 3
Unit 3 Unit 3
Unit 3
 
C++ Submit the source file (LabExercise6.cpp) and the screenshot of .pdf
C++ Submit the source file (LabExercise6.cpp) and the screenshot of .pdfC++ Submit the source file (LabExercise6.cpp) and the screenshot of .pdf
C++ Submit the source file (LabExercise6.cpp) and the screenshot of .pdf
 
Prim's Algorithm on minimum spanning tree
Prim's Algorithm on minimum spanning treePrim's Algorithm on minimum spanning tree
Prim's Algorithm on minimum spanning tree
 
Write Python for Speed
Write Python for SpeedWrite Python for Speed
Write Python for Speed
 
Task based Programming with OmpSs and its Application
Task based Programming with OmpSs and its ApplicationTask based Programming with OmpSs and its Application
Task based Programming with OmpSs and its Application
 
Permute
PermutePermute
Permute
 
data structures using C 2 sem BCA univeristy of mysore
data structures using C 2 sem BCA univeristy of mysoredata structures using C 2 sem BCA univeristy of mysore
data structures using C 2 sem BCA univeristy of mysore
 
please rewrite the correct code and do not use set ! there are many e.docx
please rewrite the correct code and  do not use set ! there are many e.docxplease rewrite the correct code and  do not use set ! there are many e.docx
please rewrite the correct code and do not use set ! there are many e.docx
 
C - aptitude3
C - aptitude3C - aptitude3
C - aptitude3
 
C aptitude questions
C aptitude questionsC aptitude questions
C aptitude questions
 
Chp4(ref dynamic)
Chp4(ref dynamic)Chp4(ref dynamic)
Chp4(ref dynamic)
 
Recursion to iteration automation.
Recursion to iteration automation.Recursion to iteration automation.
Recursion to iteration automation.
 
Advance data structure & algorithm
Advance data structure & algorithmAdvance data structure & algorithm
Advance data structure & algorithm
 
check the modifed code now you will get all operations done.termin.pdf
check the modifed code now you will get all operations done.termin.pdfcheck the modifed code now you will get all operations done.termin.pdf
check the modifed code now you will get all operations done.termin.pdf
 
Clojure basics
Clojure basicsClojure basics
Clojure basics
 
Unit 4
Unit 4Unit 4
Unit 4
 
please help me to find bugs in my coding! thanks!#includeiostream.pdf
please help me to find bugs in my coding! thanks!#includeiostream.pdfplease help me to find bugs in my coding! thanks!#includeiostream.pdf
please help me to find bugs in my coding! thanks!#includeiostream.pdf
 

Recently uploaded

Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsSafe Software
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...Aggregage
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfJamie (Taka) Wang
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxUdaiappa Ramachandran
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 

Recently uploaded (20)

Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptx
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
20150722 - AGV
20150722 - AGV20150722 - AGV
20150722 - AGV
 

Ruby Supercomputing - Using The GPU For Massive Performance Speedup v1.1

  • 1. @prestonism Ruby Supercomputing gmail: conmotto http://prestonlee.com http://github.com/preston Using The GPU For Massive Performance Speedup http://www.slideshare.net/preston.lee/ Last Updated: March 17th, 2011. Preston Lee, MBA, Translational Genomics Research Institute And Arizona State University
  • 2. git@github.com:preston/ruby-gpu-examples.git Grab the code now if you want, but to run all the examples you’ll need the NVIDIA development driver and toolkit, Ruby 1.9, the “barracuda” gem, and JRuby (without the barracuda gem) on a multi-core OS X Snow Leopard system. This takes time to set up, so maybe just chillax for now?
  • 3. Let’s find the area of each ring. http://openwalls.com/image/7358/the_texture_of_the_tree_rings_1440x900.jpg
  • 4. Math. Yay! ✤ The inner-most ring is ring #1. ✤ Total area of ring #5 is π times #1 the square of the radius. (πrr) #4 ✤ Area of only ring #5 is πrr #5 minus area of ring #4. ✤ (Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2)) ✤ Let’s find the area of every ring...
  • 5. 1st working attempt. #1 (Single-threaded Ruby.) def ring_area(radius) (Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2)) end (1..RINGS).each do |i| puts ring_area(i) end
  • 6. 2nd working attempt. #1 (Multi-threaded Ruby.) def ring_area(radius) (Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2)) end # Multi-thread it! (0..(NUM_THREADS - 1)).each do |i| rings_per_thread = RINGS / NUM_THREADS offset = i * rings_per_thread threads << Thread.new(rings_per_thread, offset) do |num, offset| last = offset + num - 1 (offset..(last)).each do |radius| ring_area(radius) end end end threads.each do |t| t.join end
  • 7. 3rd/4th working attempt. #1 (Single/Multi-threaded C. Ohhh crap...) /* printf("%ld.%06ld secondsn", (long)diff.tv_sec, (long)diff.tv_usec); A baseline CPU-based benchmark program CPU/GPU performance comparision. Approximates the cross-sectional area of every tree ring in a tree trunk in serial and in parallel printf("nDone!nn"); by taking the total area at a given radius and subtracting the area of the closest inner ring. return EXIT_SUCCESS; Copyright © 2011 Preston Lee. All rights reserved. } http://prestonlee.com */ /* Approximate the cross-sectional area between each pair of consecutive tree rings in serial */ #include <stdio.h> void calculate_ring_areas_in_serial(int rings) { #include <stdlib.h> calculate_ring_areas_in_serial_with_offset(rings, 0); #include <math.h> } #include <pthread.h> #include <sys/time.h> void calculate_ring_areas_in_serial_with_offset(int rings, int thread) { int i; #include "tree_rings.h" int offset = rings * thread; int max = rings + offset; #define DEFAULT_RINGS 1000000 float a = 0; #define NUM_THREADS 8 for(i = offset; i < max; i++) { #define DEBUG 0 a = (M_PI * pow(i, 2)) - (M_PI * pow(i - 1, 2)); } int acc = 0; } int main(int argc, const char * argv[]) { /* Approximate the cross-sectional area between each pair of consecutive tree rings int rings = DEFAULT_RINGS; in parallel on NUM_THREADS threads */ void calculate_ring_areas_in_parallel(int rings) { if(argc > 1) { pthread_t threads[NUM_THREADS]; rings = atoi(argv[1]); int rc; } int t; int rings_per_thread = rings / NUM_THREADS; printf("nA baseline CPU-based benchmark program for CPU/GPU performance comparision.n"); ring_thread_data data[NUM_THREADS]; printf("Copyright © 2011 Preston Lee. All rights reserved.nn"); printf("tUsage: %s [NUM TREE RINGS]nn", argv[0]); for(t = 0; t < NUM_THREADS; t++){ data[t].rings = rings_per_thread; printf("Number of tree rings: %i. Yay!n", rings); data[t].number = t; rc = pthread_create(&threads[t], NULL, (void *) ring_job, (void *) &data[t]); struct timeval start, stop, diff; if (rc){ printf("ERROR; return code from pthread_create() is %dn", rc); printf("nRunning serial calculation using CPU...ttt"); exit(-1); gettimeofday(&start, NULL); } calculate_ring_areas_in_serial(rings); } gettimeofday(&stop, NULL); timeval_subtract(&diff, &stop, &start); for(t = 0; t < NUM_THREADS; t++){ printf("%ld.%06ld secondsn", (long)diff.tv_sec, (long)diff.tv_usec); pthread_join(threads[t], NULL); } printf("Running parallel calculation using %i CPU threads...t", NUM_THREADS); } gettimeofday(&start, NULL); calculate_ring_areas_in_parallel(rings); void ring_job(ring_thread_data * data) { gettimeofday(&stop, NULL); calculate_ring_areas_in_serial_with_offset(data->rings, data->number); timeval_subtract(&diff, &stop, &start); }
  • 8. Speed: Your Primary Options 1. Pure Ruby in a big loop. (Single threaded.) #1 2. Pure Ruby, multi-threading smaller loops. (Limited to using a single core on 1.9 due to the GIL, but not on jruby etc.) 3. C extension, single thread. 4. C extension, pthreads. 5. “Divide and conquer.”
  • 12. CPU Concurrency ✤ Ideally, asynchronous tasks across multiple physical and/or logical cores. ✤ POSIX threading is the standard. ✤ Producer/Consumer pattern typically used to account for differences in machine execution time. ✤ Concurrency generally implemented with Time-Division Multiplexing. An CPU with 4 cores can run 100 threads, but the OS will time-share execution time, more-or-less fairly. ✤ MIMD: Multiple Instruction Multiple Data
  • 13. Common CPU Issues ✤ Sometime we need insane numbers of threads per host. ✤ Lock performance. ✤ Potential for excessive virtual memory swapping. ✤ Many tests are non-deterministic. ✤ Concurrency modeling is difficult to get correct.
  • 14. Can’t we just execute every instruction concurrently, but with different data? (Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2))
  • 15. No. (Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2))
  • 16. No. (Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2)) Multiple Instruction, Multiple Data. (MIMD)
  • 18. Yes! (Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2)) Single Instruction, Multiple Data. (SIMD)
  • 19. GPU Brief History ✤ Graphics Processing Units were initially developed for 3D scene rendering, such as computing luminosity values of each pixel on the screen concurrently. ✤ Consider a 1024x768 display. That’s 786,432 pixels updating 60 times per second. 786,432 pixles @ 60Hz => 47,185,920 => potential pixel pushes per second! ✤ Running 768,432 threads makes OS unhappy. :( ✤ Sometimes it’d be great to have all calculations finish simultaneously. If you’re updating every pixel in a display, each can be computed concurrently. Exactly concurrently. ✤ Generally fast floating point. (Critical for scientific computing.) ✤ SIMD: Single Instruction Multiple Data
  • 20. Hardware Examples Images from NVIDIA and ATI.
  • 21. Hardware Examples Images from NVIDIA and ATI.
  • 22. Hardware Examples Images from NVIDIA and ATI.
  • 23. Hardware Examples Images from NVIDIA and ATI.
  • 24. Hardware Examples Images from NVIDIA and ATI.
  • 25. Hardware Examples Images from NVIDIA and ATI.
  • 26. Hardware Examples Images from NVIDIA and ATI.
  • 27. Hardware Examples Images from NVIDIA and ATI.
  • 28. Hardware Examples Images from NVIDIA and ATI.
  • 29. GPU Pros ✤ SIMD architecture can run thousands of threads concurrently. ✤ Many synchronization issues are designed out. ✤ More FLOPS (floating point operations per second) than host CPU. ✤ Synchronously execute the same instruction for different data points.
  • 30. NVidia Tesla C1060 CL_DEVICE_NAME: Tesla C1060 CL_DEVICE_VENDOR: NVIDIA Corporation CL_DRIVER_VERSION: 260.81 CL_DEVICE_TYPE: CL_DEVICE_TYPE_GPU CL_DEVICE_MAX_COMPUTE_UNITS: 30 CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3 CL_DEVICE_MAX_WORK_ITEM_SIZES: 512 / 512 / 64 CL_DEVICE_MAX_WORK_GROUP_SIZE: 512 CL_DEVICE_MAX_CLOCK_FREQUENCY: 1296 MHz CL_DEVICE_ADDRESS_BITS: 32 CL_DEVICE_MAX_MEM_ALLOC_SIZE: 1014 MByte CL_DEVICE_GLOBAL_MEM_SIZE: 4058 MByte CL_DEVICE_ERROR_CORRECTION_SUPPORT: no CL_DEVICE_LOCAL_MEM_TYPE: local CL_DEVICE_LOCAL_MEM_SIZE: 16 KByte CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByte CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLE CL_DEVICE_IMAGE_SUPPORT: 1 CL_DEVICE_MAX_READ_IMAGE_ARGS: 128 CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 8 CL_DEVICE_SINGLE_FP_CONFIG: CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST CL_FP_ROUND_TO_ZERO CL_FP_ROUND_TO_INF CL_FP_FMA CL_DEVICE_2D_MAX_WIDTH 4096 CL_DEVICE_2D_MAX_HEIGHT 32768 CL_DEVICE_3D_MAX_WIDTH 2048 CL_DEVICE_3D_MAX_HEIGHT 2048 CL_DEVICE_3D_MAX_DEPTH 2048 CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t> CHAR 1, SHORT 1, INT 1, LONG 1, FLOAT 1, DOUBLE 1
  • 31. OpenCL (Image from Khronos Group.) Mac OS X Snow Leopard Ships With Support For OpenCL 1.0, But You Still Need The Development Driver, SDK Etc.
  • 32. OpenCL Terms ✤ Kernel: Your custom function(s) that will run on the Device. (Unrelated to the OS kernel.) ✤ Device: something that computes, such as a GPU chip. (You probably have 1, but then you might have 0... or 2+.) Could also be your CPU! ✤ Platform: Container for all your devices. When running code on your local machine you’ll only have one “platform” instance. (A cluster of GPU-heavy systems connected over a network would yield multiple available platforms, but OpenCL work in this area is not yet complete.) ✤ Device-specific terms: thread group, streaming multi-processor, etc.
  • 33. Code! ✤ Ruby 1.9: single threaded. ✤ Ruby 1.9: multi-threaded on 1.9. ✤ Ruby 1.9: barracuda (native OpenCL/NVIDIA CUDA bindings). ✤ JRuby 1.6: single threaded. ✤ JRuby 1.6: multi-threaded. ✤ C: single threaded. ✤ C: native threads.
  • 35. GPU Not-So-Pros ✤ Copying data in system RAM to/from GPU shared memory is not free. (In my own testing, often the slowest link in the chain.) ✤ SIMD can feel limiting. Benefits start to break down when you can’t design conditions out of your algorithm. (E.g. “if(thread_number > 42) { x += y; }” will cause some threads to idle.) ✤ 64-bit CPU does not imply 64-bit GPU! (All GPUs I’ve used are 32-bit or less.) ✤ Allocation limitations. Having 4GB of GPU shared memory does necessarily mean your can allocate one giant block. ✤ Kernels are essentially written in C. May be difficult if you’re new to pointers or memory management. (You can bind to higher-level languages, though.)
  • 36. Q&A and bonus content.
  • 37. Names/Keywords To Know ✤ NVidia (CUDA) ✤ ATI (Owned by AMD) (Stream SDK) ✤ Kronus (Drives OpenCL specification.) ✤ Apple (ships OpenCL-capable drivers with all newer Macs running Snow Leopard.)
  • 38. NVidia GeForce GT 330M CL_DEVICE_NAME: GeForce GT 330M CL_DEVICE_VENDOR: NVIDIA CL_DRIVER_VERSION: CLH 1.0 CL_DEVICE_TYPE: CL_DEVICE_TYPE_GPU CL_DEVICE_MAX_COMPUTE_UNITS: 6 CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3 CL_DEVICE_MAX_WORK_ITEM_SIZES: 0/0/0 CL_DEVICE_MAX_WORK_GROUP_SIZE: 512 CL_DEVICE_MAX_CLOCK_FREQUENCY: 1100 MHz CL_DEVICE_ADDRESS_BITS: 32 CL_DEVICE_MAX_MEM_ALLOC_SIZE: 128 MByte CL_DEVICE_GLOBAL_MEM_SIZE: 512 MByte CL_DEVICE_ERROR_CORRECTION_SUPPORT: no CL_DEVICE_LOCAL_MEM_TYPE: local CL_DEVICE_LOCAL_MEM_SIZE: 16 KByte CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByte CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLE CL_DEVICE_IMAGE_SUPPORT: 1 CL_DEVICE_MAX_READ_IMAGE_ARGS: 128 CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 8 CL_DEVICE_SINGLE_FP_CONFIG: CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST CL_DEVICE_2D_MAX_WIDTH 0 CL_DEVICE_2D_MAX_HEIGHT 0 CL_DEVICE_3D_MAX_WIDTH 0 CL_DEVICE_3D_MAX_HEIGHT 0 CL_DEVICE_3D_MAX_DEPTH 0 CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t> CHAR 1, SHORT 1, INT 1, LONG 1, FLOAT 1, DOUBLE 0
  • 39. Intel i7 CPU M 640, 2.80GHz CL_DEVICE_NAME: Intel(R) Core(TM) i7 CPU M 640 @ 2.80GHz CL_DEVICE_VENDOR: Intel CL_DRIVER_VERSION: 1.0 CL_DEVICE_TYPE: CL_DEVICE_TYPE_CPU CL_DEVICE_MAX_COMPUTE_UNITS: 4 CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3 CL_DEVICE_MAX_WORK_ITEM_SIZES: 0/0/0 CL_DEVICE_MAX_WORK_GROUP_SIZE: 1 CL_DEVICE_MAX_CLOCK_FREQUENCY: 2800 MHz CL_DEVICE_ADDRESS_BITS: 64 CL_DEVICE_MAX_MEM_ALLOC_SIZE: 1536 MByte CL_DEVICE_GLOBAL_MEM_SIZE: 6144 MByte CL_DEVICE_ERROR_CORRECTION_SUPPORT: no CL_DEVICE_LOCAL_MEM_TYPE: global CL_DEVICE_LOCAL_MEM_SIZE: 16 KByte CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByte CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLE CL_DEVICE_IMAGE_SUPPORT: 1 CL_DEVICE_MAX_READ_IMAGE_ARGS: 128 CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 8 CL_DEVICE_SINGLE_FP_CONFIG: CL_FP_DENORM CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST CL_DEVICE_2D_MAX_WIDTH 0 CL_DEVICE_2D_MAX_HEIGHT 0 CL_DEVICE_3D_MAX_WIDTH 0 CL_DEVICE_3D_MAX_HEIGHT 0 CL_DEVICE_3D_MAX_DEPTH 0 CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t> CHAR 16, SHORT 8, INT 4, LONG 2, FLOAT 4, DOUBLE 2
  • 40. Great GPU Use Cases ✤ 3D rendering. (Obviously!) ✤ Simulation. (Folding@home actual has a GPU client.) ✤ Physics. ✤ Probability. ✤ Network reliability. ✤ General floating-point speed-ups. ✤ Augmentation of existing apps by offloading work to GPU. Combining CPU/GPU paradigms is a perfectly valid approach.
  • 41. A Few Simple Kernels Several simple C kernels w/Java, OpenCL and JOCL in Eclipse Helios Date
  • 42. Multi- Dimensional Threading Threads can have an X, Y, and Z coordinate within their thread group, which you use to determine who you are and can thus derive your specific local parameters and minimize control flow changes.
  • 43. Links ✤ OpenCL Programming Guide for OSX. (Really good!) http://developer.apple.com/library/mac/#documentation/Performance/Conceptual/OpenCL_MacProgGuide/Introduction/Introduction.html ✤ NVidia CUDA: http://www.nvidia.com/object/cuda_home_new.html ✤ ATI Stream: http://www.amd.com/US/PRODUCTS/TECHNOLOGIES/STREAM-TECHNOLOGY/Pages/stream-technology.aspx ✤ OpenCL: http://www.khronos.org/opencl/

Editor's Notes

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. Jump to Eclipse device query.\n
  27. \n
  28. \n
  29. Jump to demo!\n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. \n
  38. \n
  39. \n
  40. \n