@prestonismRuby Supercomputing                                                                                gmail: conmo...
git@github.com:preston/ruby-gpu-examples.git   Grab the code now if you want, but to run all the examples     you’ll need ...
Let’s find the area of each ring.http://openwalls.com/image/7358/the_texture_of_the_tree_rings_1440x900.jpg
Math. Yay!✤   The inner-most ring is ring #1.✤   Total area of ring #5 is π times            #1    the square of the radiu...
1st working attempt.                                              #1(Single-threaded Ruby.)  def ring_area(radius)      (M...
2nd working attempt.                                                    #1(Multi-threaded Ruby.)  def ring_area(radius)   ...
3rd/4th working attempt.                                                                                                  ...
Speed: Your Primary Options1. Pure Ruby in a big loop. (Single   threaded.)                         #12. Pure Ruby, multi-...
Ruby 1.9
CPU Concurrency✤   Ideally, asynchronous tasks across multiple physical and/or logical    cores.✤   POSIX threading is the...
Common CPU Issues✤   Sometime we need insane numbers of threads per host.✤   Lock performance.✤   Potential for excessive ...
Can’t we just execute every                       instructionconcurrently, but with different data?(Math::PI * (radius ** ...
No.(Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2))
No.(Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2))         Multiple Instruction, Multiple Data. (MIMD)
Yes!(Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2))          Single Instruction, Multiple Data. (SIMD)
GPU Brief History✤   Graphics Processing Units were initially developed for 3D scene rendering, such as computing    lumin...
Hardware Examples         Images from NVIDIA and ATI.
Hardware Examples         Images from NVIDIA and ATI.
Hardware Examples         Images from NVIDIA and ATI.
Hardware Examples         Images from NVIDIA and ATI.
Hardware Examples         Images from NVIDIA and ATI.
Hardware Examples         Images from NVIDIA and ATI.
Hardware Examples         Images from NVIDIA and ATI.
Hardware Examples         Images from NVIDIA and ATI.
Hardware Examples         Images from NVIDIA and ATI.
GPU Pros✤   SIMD architecture can run thousands of threads concurrently.✤   Many synchronization issues are designed out.✤...
NVidia Tesla C1060CL_DEVICE_NAME:          Tesla C1060CL_DEVICE_VENDOR:               NVIDIA CorporationCL_DRIVER_VERSION:...
OpenCL(Image from Khronos Group.)Mac OS X Snow Leopard Ships With Support For OpenCL 1.0, But You Still Need The Developme...
OpenCL Terms✤   Kernel: Your custom function(s) that will run on the Device.    (Unrelated to the OS kernel.)✤   Device: s...
Code!✤   Ruby 1.9: single threaded.✤   Ruby 1.9: multi-threaded on 1.9.✤   Ruby 1.9: barracuda (native OpenCL/NVIDIA CUDA ...
GPU Not-So-Pros✤   Copying data in system RAM to/from GPU shared memory is not free. (In my    own testing, often the slow...
Q&A and bonus content.
Names/Keywords To Know✤   NVidia (CUDA)✤   ATI (Owned by AMD) (Stream SDK)✤   Kronus (Drives OpenCL specification.)✤   Appl...
NVidia GeForce GT 330MCL_DEVICE_NAME:          GeForce GT 330MCL_DEVICE_VENDOR:              NVIDIACL_DRIVER_VERSION:     ...
Intel i7 CPU M 640, 2.80GHzCL_DEVICE_NAME:          Intel(R) Core(TM) i7 CPU    M 640 @ 2.80GHzCL_DEVICE_VENDOR:          ...
Great GPU UseCases✤   3D rendering. (Obviously!)✤   Simulation. (Folding@home actual has a    GPU client.)✤   Physics.✤   ...
A Few Simple KernelsSeveral simple C kernels w/Java, OpenCL and JOCL in Eclipse HeliosDate
Multi-DimensionalThreadingThreads can have an X, Y, and Zcoordinate within their threadgroup, which you use to determinewh...
Links✤   OpenCL Programming Guide for OSX. (Really good!)    http://developer.apple.com/library/mac/#documentation/Perform...
Ruby Supercomputing - Using The GPU For Massive Performance Speedup v1.1
Ruby Supercomputing - Using The GPU For Massive Performance Speedup v1.1
Ruby Supercomputing - Using The GPU For Massive Performance Speedup v1.1
Ruby Supercomputing - Using The GPU For Massive Performance Speedup v1.1
Upcoming SlideShare
Loading in …5
×

Ruby Supercomputing - Using The GPU For Massive Performance Speedup v1.1

9,378 views

Published on

For MountainWest RubyConf 2011 in Salt Lake City, Utah. By Preston Lee.

Twitter: @prestonism
Blog: http://prestonlee.com
Code: https://github.com/preston/ruby-gpu-examples
Slides: http://www.slideshare.net/preston.lee/

Published in: Technology
1 Comment
6 Likes
Statistics
Notes
  • Thanks everyone for checking out the presentation, and for getting this on the SlideShare 'featured' frontpage section for at least a day now. Note that a more detailed explanation of these will be posted shortly after Confreaks releases the MountainWest RubyConf 2011 video recordings for which this presentation was given.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
9,378
On SlideShare
0
From Embeds
0
Number of Embeds
14
Actions
Shares
0
Downloads
100
Comments
1
Likes
6
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Jump to Eclipse device query.\n
  • \n
  • \n
  • Jump to demo!\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Ruby Supercomputing - Using The GPU For Massive Performance Speedup v1.1

    1. @prestonismRuby Supercomputing gmail: conmotto http://prestonlee.com http://github.com/prestonUsing The GPU For Massive Performance Speedup http://www.slideshare.net/preston.lee/Last Updated: March 17th, 2011.Preston Lee, MBA, Translational Genomics Research Institute And Arizona State University
    2. git@github.com:preston/ruby-gpu-examples.git Grab the code now if you want, but to run all the examples you’ll need the NVIDIA development driver and toolkit, Ruby 1.9, the “barracuda” gem, and JRuby (without the barracuda gem) on a multi-core OS X Snow Leopard system. This takes time to set up, so maybe just chillax for now?
    3. Let’s find the area of each ring.http://openwalls.com/image/7358/the_texture_of_the_tree_rings_1440x900.jpg
    4. Math. Yay!✤ The inner-most ring is ring #1.✤ Total area of ring #5 is π times #1 the square of the radius. (πrr) #4✤ Area of only ring #5 is πrr #5 minus area of ring #4.✤ (Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2))✤ Let’s find the area of every ring...
    5. 1st working attempt. #1(Single-threaded Ruby.) def ring_area(radius) (Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2)) end (1..RINGS).each do |i| puts ring_area(i) end
    6. 2nd working attempt. #1(Multi-threaded Ruby.) def ring_area(radius) (Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2)) end # Multi-thread it! (0..(NUM_THREADS - 1)).each do |i| rings_per_thread = RINGS / NUM_THREADS offset = i * rings_per_thread threads << Thread.new(rings_per_thread, offset) do |num, offset| last = offset + num - 1 (offset..(last)).each do |radius| ring_area(radius) end end end threads.each do |t| t.join end
    7. 3rd/4th working attempt. #1(Single/Multi-threaded C. Ohhh crap...) /* printf("%ld.%06ld secondsn", (long)diff.tv_sec, (long)diff.tv_usec); A baseline CPU-based benchmark program CPU/GPU performance comparision. Approximates the cross-sectional area of every tree ring in a tree trunk in serial and in parallel printf("nDone!nn"); by taking the total area at a given radius and subtracting the area of the closest inner ring. return EXIT_SUCCESS; Copyright © 2011 Preston Lee. All rights reserved. } http://prestonlee.com */ /* Approximate the cross-sectional area between each pair of consecutive tree rings in serial */ #include <stdio.h> void calculate_ring_areas_in_serial(int rings) { #include <stdlib.h> calculate_ring_areas_in_serial_with_offset(rings, 0); #include <math.h> } #include <pthread.h> #include <sys/time.h> void calculate_ring_areas_in_serial_with_offset(int rings, int thread) { int i; #include "tree_rings.h" int offset = rings * thread; int max = rings + offset; #define DEFAULT_RINGS 1000000 float a = 0; #define NUM_THREADS 8 for(i = offset; i < max; i++) { #define DEBUG 0 a = (M_PI * pow(i, 2)) - (M_PI * pow(i - 1, 2)); } int acc = 0; } int main(int argc, const char * argv[]) { /* Approximate the cross-sectional area between each pair of consecutive tree rings int rings = DEFAULT_RINGS; in parallel on NUM_THREADS threads */ void calculate_ring_areas_in_parallel(int rings) { if(argc > 1) { pthread_t threads[NUM_THREADS]; rings = atoi(argv[1]); int rc; } int t; int rings_per_thread = rings / NUM_THREADS; printf("nA baseline CPU-based benchmark program for CPU/GPU performance comparision.n"); ring_thread_data data[NUM_THREADS]; printf("Copyright © 2011 Preston Lee. All rights reserved.nn"); printf("tUsage: %s [NUM TREE RINGS]nn", argv[0]); for(t = 0; t < NUM_THREADS; t++){ data[t].rings = rings_per_thread; printf("Number of tree rings: %i. Yay!n", rings); data[t].number = t; rc = pthread_create(&threads[t], NULL, (void *) ring_job, (void *) &data[t]); struct timeval start, stop, diff; if (rc){ printf("ERROR; return code from pthread_create() is %dn", rc); printf("nRunning serial calculation using CPU...ttt"); exit(-1); gettimeofday(&start, NULL); } calculate_ring_areas_in_serial(rings); } gettimeofday(&stop, NULL); timeval_subtract(&diff, &stop, &start); for(t = 0; t < NUM_THREADS; t++){ printf("%ld.%06ld secondsn", (long)diff.tv_sec, (long)diff.tv_usec); pthread_join(threads[t], NULL); } printf("Running parallel calculation using %i CPU threads...t", NUM_THREADS); } gettimeofday(&start, NULL); calculate_ring_areas_in_parallel(rings); void ring_job(ring_thread_data * data) { gettimeofday(&stop, NULL); calculate_ring_areas_in_serial_with_offset(data->rings, data->number); timeval_subtract(&diff, &stop, &start); }
    8. Speed: Your Primary Options1. Pure Ruby in a big loop. (Single threaded.) #12. Pure Ruby, multi-threading smaller loops. (Limited to using a single core on 1.9 due to the GIL, but not on jruby etc.)3. C extension, single thread.4. C extension, pthreads.5. “Divide and conquer.”
    9. Ruby 1.9
    10. CPU Concurrency✤ Ideally, asynchronous tasks across multiple physical and/or logical cores.✤ POSIX threading is the standard.✤ Producer/Consumer pattern typically used to account for differences in machine execution time.✤ Concurrency generally implemented with Time-Division Multiplexing. An CPU with 4 cores can run 100 threads, but the OS will time-share execution time, more-or-less fairly.✤ MIMD: Multiple Instruction Multiple Data
    11. Common CPU Issues✤ Sometime we need insane numbers of threads per host.✤ Lock performance.✤ Potential for excessive virtual memory swapping.✤ Many tests are non-deterministic.✤ Concurrency modeling is difficult to get correct.
    12. Can’t we just execute every instructionconcurrently, but with different data?(Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2))
    13. No.(Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2))
    14. No.(Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2)) Multiple Instruction, Multiple Data. (MIMD)
    15. Yes!(Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2)) Single Instruction, Multiple Data. (SIMD)
    16. GPU Brief History✤ Graphics Processing Units were initially developed for 3D scene rendering, such as computing luminosity values of each pixel on the screen concurrently.✤ Consider a 1024x768 display. That’s 786,432 pixels updating 60 times per second. 786,432 pixles @ 60Hz => 47,185,920 => potential pixel pushes per second!✤ Running 768,432 threads makes OS unhappy. :(✤ Sometimes it’d be great to have all calculations finish simultaneously. If you’re updating every pixel in a display, each can be computed concurrently. Exactly concurrently.✤ Generally fast floating point. (Critical for scientific computing.)✤ SIMD: Single Instruction Multiple Data
    17. Hardware Examples Images from NVIDIA and ATI.
    18. Hardware Examples Images from NVIDIA and ATI.
    19. Hardware Examples Images from NVIDIA and ATI.
    20. Hardware Examples Images from NVIDIA and ATI.
    21. Hardware Examples Images from NVIDIA and ATI.
    22. Hardware Examples Images from NVIDIA and ATI.
    23. Hardware Examples Images from NVIDIA and ATI.
    24. Hardware Examples Images from NVIDIA and ATI.
    25. Hardware Examples Images from NVIDIA and ATI.
    26. GPU Pros✤ SIMD architecture can run thousands of threads concurrently.✤ Many synchronization issues are designed out.✤ More FLOPS (floating point operations per second) than host CPU.✤ Synchronously execute the same instruction for different data points.
    27. NVidia Tesla C1060CL_DEVICE_NAME: Tesla C1060CL_DEVICE_VENDOR: NVIDIA CorporationCL_DRIVER_VERSION: 260.81CL_DEVICE_TYPE: CL_DEVICE_TYPE_GPUCL_DEVICE_MAX_COMPUTE_UNITS: 30CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3CL_DEVICE_MAX_WORK_ITEM_SIZES: 512 / 512 / 64CL_DEVICE_MAX_WORK_GROUP_SIZE: 512CL_DEVICE_MAX_CLOCK_FREQUENCY: 1296 MHzCL_DEVICE_ADDRESS_BITS: 32CL_DEVICE_MAX_MEM_ALLOC_SIZE: 1014 MByteCL_DEVICE_GLOBAL_MEM_SIZE: 4058 MByteCL_DEVICE_ERROR_CORRECTION_SUPPORT: noCL_DEVICE_LOCAL_MEM_TYPE: localCL_DEVICE_LOCAL_MEM_SIZE: 16 KByteCL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByteCL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLECL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLECL_DEVICE_IMAGE_SUPPORT: 1CL_DEVICE_MAX_READ_IMAGE_ARGS: 128CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 8CL_DEVICE_SINGLE_FP_CONFIG: CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST CL_FP_ROUND_TO_ZERO CL_FP_ROUND_TO_INFCL_FP_FMACL_DEVICE_2D_MAX_WIDTH 4096CL_DEVICE_2D_MAX_HEIGHT 32768CL_DEVICE_3D_MAX_WIDTH 2048CL_DEVICE_3D_MAX_HEIGHT 2048CL_DEVICE_3D_MAX_DEPTH 2048CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t> CHAR 1, SHORT 1, INT 1, LONG 1, FLOAT 1, DOUBLE 1
    28. OpenCL(Image from Khronos Group.)Mac OS X Snow Leopard Ships With Support For OpenCL 1.0, But You Still Need The Development Driver, SDK Etc.
    29. OpenCL Terms✤ Kernel: Your custom function(s) that will run on the Device. (Unrelated to the OS kernel.)✤ Device: something that computes, such as a GPU chip. (You probably have 1, but then you might have 0... or 2+.) Could also be your CPU!✤ Platform: Container for all your devices. When running code on your local machine you’ll only have one “platform” instance. (A cluster of GPU-heavy systems connected over a network would yield multiple available platforms, but OpenCL work in this area is not yet complete.)✤ Device-specific terms: thread group, streaming multi-processor, etc.
    30. Code!✤ Ruby 1.9: single threaded.✤ Ruby 1.9: multi-threaded on 1.9.✤ Ruby 1.9: barracuda (native OpenCL/NVIDIA CUDA bindings).✤ JRuby 1.6: single threaded.✤ JRuby 1.6: multi-threaded.✤ C: single threaded.✤ C: native threads.
    31. GPU Not-So-Pros✤ Copying data in system RAM to/from GPU shared memory is not free. (In my own testing, often the slowest link in the chain.)✤ SIMD can feel limiting. Benefits start to break down when you can’t design conditions out of your algorithm. (E.g. “if(thread_number > 42) { x += y; }” will cause some threads to idle.)✤ 64-bit CPU does not imply 64-bit GPU! (All GPUs I’ve used are 32-bit or less.)✤ Allocation limitations. Having 4GB of GPU shared memory does necessarily mean your can allocate one giant block.✤ Kernels are essentially written in C. May be difficult if you’re new to pointers or memory management. (You can bind to higher-level languages, though.)
    32. Q&A and bonus content.
    33. Names/Keywords To Know✤ NVidia (CUDA)✤ ATI (Owned by AMD) (Stream SDK)✤ Kronus (Drives OpenCL specification.)✤ Apple (ships OpenCL-capable drivers with all newer Macs running Snow Leopard.)
    34. NVidia GeForce GT 330MCL_DEVICE_NAME: GeForce GT 330MCL_DEVICE_VENDOR: NVIDIACL_DRIVER_VERSION: CLH 1.0CL_DEVICE_TYPE: CL_DEVICE_TYPE_GPUCL_DEVICE_MAX_COMPUTE_UNITS: 6CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3CL_DEVICE_MAX_WORK_ITEM_SIZES: 0/0/0CL_DEVICE_MAX_WORK_GROUP_SIZE: 512CL_DEVICE_MAX_CLOCK_FREQUENCY: 1100 MHzCL_DEVICE_ADDRESS_BITS: 32CL_DEVICE_MAX_MEM_ALLOC_SIZE: 128 MByteCL_DEVICE_GLOBAL_MEM_SIZE: 512 MByteCL_DEVICE_ERROR_CORRECTION_SUPPORT: noCL_DEVICE_LOCAL_MEM_TYPE: localCL_DEVICE_LOCAL_MEM_SIZE: 16 KByteCL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByteCL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLECL_DEVICE_IMAGE_SUPPORT: 1CL_DEVICE_MAX_READ_IMAGE_ARGS: 128CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 8CL_DEVICE_SINGLE_FP_CONFIG: CL_FP_INF_NAN CL_FP_ROUND_TO_NEARESTCL_DEVICE_2D_MAX_WIDTH 0CL_DEVICE_2D_MAX_HEIGHT 0CL_DEVICE_3D_MAX_WIDTH 0CL_DEVICE_3D_MAX_HEIGHT 0CL_DEVICE_3D_MAX_DEPTH 0CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t> CHAR 1, SHORT 1, INT 1, LONG 1, FLOAT 1, DOUBLE 0
    35. Intel i7 CPU M 640, 2.80GHzCL_DEVICE_NAME: Intel(R) Core(TM) i7 CPU M 640 @ 2.80GHzCL_DEVICE_VENDOR: IntelCL_DRIVER_VERSION: 1.0CL_DEVICE_TYPE: CL_DEVICE_TYPE_CPUCL_DEVICE_MAX_COMPUTE_UNITS: 4CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3CL_DEVICE_MAX_WORK_ITEM_SIZES: 0/0/0CL_DEVICE_MAX_WORK_GROUP_SIZE: 1CL_DEVICE_MAX_CLOCK_FREQUENCY: 2800 MHzCL_DEVICE_ADDRESS_BITS: 64CL_DEVICE_MAX_MEM_ALLOC_SIZE: 1536 MByteCL_DEVICE_GLOBAL_MEM_SIZE: 6144 MByteCL_DEVICE_ERROR_CORRECTION_SUPPORT: noCL_DEVICE_LOCAL_MEM_TYPE: globalCL_DEVICE_LOCAL_MEM_SIZE: 16 KByteCL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByteCL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLECL_DEVICE_IMAGE_SUPPORT: 1CL_DEVICE_MAX_READ_IMAGE_ARGS: 128CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 8CL_DEVICE_SINGLE_FP_CONFIG: CL_FP_DENORM CL_FP_INF_NAN CL_FP_ROUND_TO_NEARESTCL_DEVICE_2D_MAX_WIDTH 0CL_DEVICE_2D_MAX_HEIGHT 0CL_DEVICE_3D_MAX_WIDTH 0CL_DEVICE_3D_MAX_HEIGHT 0CL_DEVICE_3D_MAX_DEPTH 0CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t> CHAR 16, SHORT 8, INT 4, LONG 2, FLOAT 4, DOUBLE 2
    36. Great GPU UseCases✤ 3D rendering. (Obviously!)✤ Simulation. (Folding@home actual has a GPU client.)✤ Physics.✤ Probability.✤ Network reliability.✤ General floating-point speed-ups.✤ Augmentation of existing apps by offloading work to GPU. Combining CPU/GPU paradigms is a perfectly valid approach.
    37. A Few Simple KernelsSeveral simple C kernels w/Java, OpenCL and JOCL in Eclipse HeliosDate
    38. Multi-DimensionalThreadingThreads can have an X, Y, and Zcoordinate within their threadgroup, which you use to determinewho you are and can thus deriveyour specific local parameters andminimize control flow changes.
    39. Links✤ OpenCL Programming Guide for OSX. (Really good!) http://developer.apple.com/library/mac/#documentation/Performance/Conceptual/OpenCL_MacProgGuide/Introduction/Introduction.html✤ NVidia CUDA: http://www.nvidia.com/object/cuda_home_new.html✤ ATI Stream: http://www.amd.com/US/PRODUCTS/TECHNOLOGIES/STREAM-TECHNOLOGY/Pages/stream-technology.aspx✤ OpenCL: http://www.khronos.org/opencl/

    ×