For MountainWest RubyConf 2011 in Salt Lake City, Utah. By Preston Lee.
Twitter: @prestonism
Blog: http://prestonlee.com
Code: https://github.com/preston/ruby-gpu-examples
Slides: http://www.slideshare.net/preston.lee/
WSO2's API Vision: Unifying Control, Empowering Developers
Ruby Supercomputing - Using The GPU For Massive Performance Speedup v1.1
1. @prestonism
Ruby Supercomputing gmail: conmotto
http://prestonlee.com
http://github.com/preston
Using The GPU For Massive Performance Speedup http://www.slideshare.net/preston.lee/
Last Updated: March 17th, 2011.
Preston Lee, MBA, Translational Genomics Research Institute And Arizona State University
2. git@github.com:preston/ruby-gpu-examples.git
Grab the code now if you want, but to run all the examples
you’ll need the NVIDIA development driver and toolkit,
Ruby 1.9, the “barracuda” gem, and JRuby (without the
barracuda gem) on a multi-core OS X Snow Leopard system.
This takes time to set up, so maybe just chillax for now?
3. Let’s find the area of each ring.
http://openwalls.com/image/7358/the_texture_of_the_tree_rings_1440x900.jpg
4. Math. Yay!
✤ The inner-most ring is ring #1.
✤ Total area of ring #5 is π times #1
the square of the radius. (πrr) #4
✤ Area of only ring #5 is πrr #5
minus area of ring #4.
✤ (Math::PI * (radius ** 2)) -
(Math::PI * ((radius - 1) ** 2))
✤ Let’s find the area of every ring...
5. 1st working attempt. #1
(Single-threaded Ruby.)
def ring_area(radius)
(Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2))
end
(1..RINGS).each do |i|
puts ring_area(i)
end
6. 2nd working attempt. #1
(Multi-threaded Ruby.)
def ring_area(radius)
(Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2))
end
# Multi-thread it!
(0..(NUM_THREADS - 1)).each do |i|
rings_per_thread = RINGS / NUM_THREADS
offset = i * rings_per_thread
threads << Thread.new(rings_per_thread, offset) do |num, offset|
last = offset + num - 1
(offset..(last)).each do |radius|
ring_area(radius)
end
end
end
threads.each do |t| t.join end
8. Speed: Your Primary Options
1. Pure Ruby in a big loop. (Single
threaded.) #1
2. Pure Ruby, multi-threading
smaller loops. (Limited to using
a single core on 1.9 due to the
GIL, but not on jruby etc.)
3. C extension, single thread.
4. C extension, pthreads.
5. “Divide and conquer.”
12. CPU Concurrency
✤ Ideally, asynchronous tasks across multiple physical and/or logical
cores.
✤ POSIX threading is the standard.
✤ Producer/Consumer pattern typically used to account for differences
in machine execution time.
✤ Concurrency generally implemented with Time-Division
Multiplexing. An CPU with 4 cores can run 100 threads, but the OS
will time-share execution time, more-or-less fairly.
✤ MIMD: Multiple Instruction Multiple Data
13. Common CPU Issues
✤ Sometime we need insane numbers of threads per host.
✤ Lock performance.
✤ Potential for excessive virtual memory swapping.
✤ Many tests are non-deterministic.
✤ Concurrency modeling is difficult to get correct.
14. Can’t we just execute every
instruction
concurrently, but with different data?
(Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2))
19. GPU Brief History
✤ Graphics Processing Units were initially developed for 3D scene rendering, such as computing
luminosity values of each pixel on the screen concurrently.
✤ Consider a 1024x768 display. That’s 786,432 pixels updating 60 times per second. 786,432 pixles
@ 60Hz => 47,185,920 => potential pixel pushes per second!
✤ Running 768,432 threads makes OS unhappy. :(
✤ Sometimes it’d be great to have all calculations finish simultaneously. If you’re updating every
pixel in a display, each can be computed concurrently. Exactly concurrently.
✤ Generally fast floating point. (Critical for scientific computing.)
✤ SIMD: Single Instruction Multiple Data
29. GPU Pros
✤ SIMD architecture can run thousands of threads concurrently.
✤ Many synchronization issues are designed out.
✤ More FLOPS (floating point operations per second) than host CPU.
✤ Synchronously execute the same instruction for different data points.
31. OpenCL
(Image from Khronos Group.)
Mac OS X Snow Leopard Ships With Support For OpenCL 1.0, But You Still Need The Development Driver, SDK Etc.
32. OpenCL Terms
✤ Kernel: Your custom function(s) that will run on the Device.
(Unrelated to the OS kernel.)
✤ Device: something that computes, such as a GPU chip. (You probably
have 1, but then you might have 0... or 2+.) Could also be your CPU!
✤ Platform: Container for all your devices. When running code on your
local machine you’ll only have one “platform” instance. (A cluster of
GPU-heavy systems connected over a network would yield multiple
available platforms, but OpenCL work in this area is not yet
complete.)
✤ Device-specific terms: thread group, streaming multi-processor, etc.
33. Code!
✤ Ruby 1.9: single threaded.
✤ Ruby 1.9: multi-threaded on 1.9.
✤ Ruby 1.9: barracuda (native OpenCL/NVIDIA CUDA bindings).
✤ JRuby 1.6: single threaded.
✤ JRuby 1.6: multi-threaded.
✤ C: single threaded.
✤ C: native threads.
34.
35. GPU Not-So-Pros
✤ Copying data in system RAM to/from GPU shared memory is not free. (In my
own testing, often the slowest link in the chain.)
✤ SIMD can feel limiting. Benefits start to break down when you can’t design
conditions out of your algorithm. (E.g. “if(thread_number > 42) { x += y; }” will
cause some threads to idle.)
✤ 64-bit CPU does not imply 64-bit GPU! (All GPUs I’ve used are 32-bit or less.)
✤ Allocation limitations. Having 4GB of GPU shared memory does necessarily
mean your can allocate one giant block.
✤ Kernels are essentially written in C. May be difficult if you’re new to pointers or
memory management. (You can bind to higher-level languages, though.)
39. Intel i7 CPU M 640, 2.80GHz
CL_DEVICE_NAME:
Intel(R) Core(TM) i7 CPU M 640 @ 2.80GHz
CL_DEVICE_VENDOR:
Intel
CL_DRIVER_VERSION:
1.0
CL_DEVICE_TYPE:
CL_DEVICE_TYPE_CPU
CL_DEVICE_MAX_COMPUTE_UNITS:
4
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS:
3
CL_DEVICE_MAX_WORK_ITEM_SIZES:
0/0/0
CL_DEVICE_MAX_WORK_GROUP_SIZE:
1
CL_DEVICE_MAX_CLOCK_FREQUENCY:
2800 MHz
CL_DEVICE_ADDRESS_BITS:
64
CL_DEVICE_MAX_MEM_ALLOC_SIZE:
1536 MByte
CL_DEVICE_GLOBAL_MEM_SIZE:
6144 MByte
CL_DEVICE_ERROR_CORRECTION_SUPPORT:
no
CL_DEVICE_LOCAL_MEM_TYPE:
global
CL_DEVICE_LOCAL_MEM_SIZE:
16 KByte
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE:
64 KByte
CL_DEVICE_QUEUE_PROPERTIES:
CL_QUEUE_PROFILING_ENABLE
CL_DEVICE_IMAGE_SUPPORT:
1
CL_DEVICE_MAX_READ_IMAGE_ARGS:
128
CL_DEVICE_MAX_WRITE_IMAGE_ARGS:
8
CL_DEVICE_SINGLE_FP_CONFIG:
CL_FP_DENORM CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST
CL_DEVICE_2D_MAX_WIDTH
0
CL_DEVICE_2D_MAX_HEIGHT
0
CL_DEVICE_3D_MAX_WIDTH
0
CL_DEVICE_3D_MAX_HEIGHT
0
CL_DEVICE_3D_MAX_DEPTH
0
CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t>
CHAR 16, SHORT 8, INT 4, LONG 2, FLOAT 4, DOUBLE 2
40. Great GPU Use
Cases
✤ 3D rendering. (Obviously!)
✤ Simulation. (Folding@home actual has a
GPU client.)
✤ Physics.
✤ Probability.
✤ Network reliability.
✤ General floating-point speed-ups.
✤ Augmentation of existing apps by offloading
work to GPU. Combining CPU/GPU
paradigms is a perfectly valid approach.
41. A Few Simple Kernels
Several simple C kernels w/Java, OpenCL and JOCL in Eclipse Helios
Date
42. Multi-
Dimensional
Threading
Threads can have an X, Y, and Z
coordinate within their thread
group, which you use to determine
who you are and can thus derive
your specific local parameters and
minimize control flow changes.