Ruby Supercomputing - Using The GPU For Massive Performance Speedup v1.1

@prestonism

Ruby Supercomputing gmail: conmotto
http://prestonlee.com
http://github.com/preston
Using The GPU For Massive Performance Speedup http://www.slideshare.net/preston.lee/
Last Updated: March 17th, 2011.
Preston Lee, MBA, Translational Genomics Research Institute And Arizona State University

git@github.com:preston/ruby-gpu-examples.git

Grab the code now if you want, but to run all the examples
you’ll need the NVIDIA development driver and toolkit,
Ruby 1.9, the “barracuda” gem, and JRuby (without the
barracuda gem) on a multi-core OS X Snow Leopard system.
This takes time to set up, so maybe just chillax for now?

Let’s find the area of each ring.
http://openwalls.com/image/7358/the_texture_of_the_tree_rings_1440x900.jpg

Math. Yay!

✤ The inner-most ring is ring #1.

✤ Total area of ring #5 is π times #1
the square of the radius. (πrr) #4

✤ Area of only ring #5 is πrr #5
minus area of ring #4.

✤ (Math::PI * (radius ** 2)) -
(Math::PI * ((radius - 1) ** 2))
✤ Let’s ﬁnd the area of every ring...

1st working attempt. #1
(Single-threaded Ruby.)

def ring_area(radius)

(Math::PI * (radius ** 2)) - (Math::PI * ((radius - 1) ** 2))
end

(1..RINGS).each do |i|
puts ring_area(i)
end

2nd working attempt. #1
(Multi-threaded Ruby.)

def ring_area(radius)

end

# Multi-thread it!
(0..(NUM_THREADS - 1)).each do |i|

rings_per_thread = RINGS / NUM_THREADS

offset = i * rings_per_thread

threads << Thread.new(rings_per_thread, offset) do |num, offset|

last = offset + num - 1

(offset..(last)).each do |radius|

ring_area(radius)

end

end
end
threads.each do |t| t.join end

3rd/4th working attempt. #1
(Single/Multi-threaded C. Ohhh crap...)

/*
printf("%ld.%06ld secondsn", (long)diff.tv_sec, (long)diff.tv_usec);
A baseline CPU-based benchmark program CPU/GPU performance comparision.
Approximates the cross-sectional area of every tree ring in a tree trunk in serial and in parallel
printf("nDone!nn");
by taking the total area at a given radius and subtracting the area of the closest inner ring.
return EXIT_SUCCESS;
Copyright © 2011 Preston Lee. All rights reserved. }
http://prestonlee.com
*/ /* Approximate the cross-sectional area between each pair of consecutive tree rings

in serial */
#include <stdio.h> void calculate_ring_areas_in_serial(int rings) {
#include <stdlib.h>

calculate_ring_areas_in_serial_with_offset(rings, 0);
#include <math.h> }
#include <pthread.h>
#include <sys/time.h> void calculate_ring_areas_in_serial_with_offset(int rings, int thread) {

int i;
#include "tree_rings.h"
int offset = rings * thread;

int max = rings + offset;
#define DEFAULT_RINGS 1000000
float a = 0;
#define NUM_THREADS 8
for(i = offset; i < max; i++) {
#define DEBUG 0

a = (M_PI * pow(i, 2)) - (M_PI * pow(i - 1, 2));

}
int acc = 0; }

int main(int argc, const char * argv[]) { /* Approximate the cross-sectional area between each pair of consecutive tree rings

int rings = DEFAULT_RINGS;
in parallel on NUM_THREADS threads */
void calculate_ring_areas_in_parallel(int rings) {

if(argc > 1) {
pthread_t threads[NUM_THREADS];

rings = atoi(argv[1]);
int rc;

}
int t;

int rings_per_thread = rings / NUM_THREADS;

printf("nA baseline CPU-based benchmark program for CPU/GPU performance comparision.n");
ring_thread_data data[NUM_THREADS];

printf("Copyright © 2011 Preston Lee. All rights reserved.nn");

printf("tUsage: %s [NUM TREE RINGS]nn", argv[0]);
for(t = 0; t < NUM_THREADS; t++){

data[t].rings = rings_per_thread;

printf("Number of tree rings: %i. Yay!n", rings);

data[t].number = t;

rc = pthread_create(&threads[t], NULL, (void *) ring_job, (void *) &data[t]);

struct timeval start, stop, diff;

if (rc){

printf("ERROR; return code from pthread_create() is %dn", rc);

printf("nRunning serial calculation using CPU...ttt");
exit(-1);

gettimeofday(&start, NULL);
}

calculate_ring_areas_in_serial(rings);
}

gettimeofday(&stop, NULL);

timeval_subtract(&diff, &stop, &start);
for(t = 0; t < NUM_THREADS; t++){

printf("%ld.%06ld secondsn", (long)diff.tv_sec, (long)diff.tv_usec);

pthread_join(threads[t], NULL);

}

printf("Running parallel calculation using %i CPU threads...t", NUM_THREADS); }

gettimeofday(&start, NULL);

calculate_ring_areas_in_parallel(rings); void ring_job(ring_thread_data * data) {

gettimeofday(&stop, NULL);
calculate_ring_areas_in_serial_with_offset(data->rings, data->number);

timeval_subtract(&diff, &stop, &start); }

Speed: Your Primary Options

1. Pure Ruby in a big loop. (Single
threaded.) #1

2. Pure Ruby, multi-threading
smaller loops. (Limited to using
a single core on 1.9 due to the
GIL, but not on jruby etc.)

3. C extension, single thread.

4. C extension, pthreads.

5. “Divide and conquer.”

CPU Concurrency

✤ Ideally, asynchronous tasks across multiple physical and/or logical
cores.

✤ POSIX threading is the standard.

✤ Producer/Consumer pattern typically used to account for differences
in machine execution time.

✤ Concurrency generally implemented with Time-Division
Multiplexing. An CPU with 4 cores can run 100 threads, but the OS
will time-share execution time, more-or-less fairly.

✤ MIMD: Multiple Instruction Multiple Data

Common CPU Issues

✤ Sometime we need insane numbers of threads per host.

✤ Lock performance.

✤ Potential for excessive virtual memory swapping.

✤ Many tests are non-deterministic.

✤ Concurrency modeling is difﬁcult to get correct.

Can’t we just execute every
instruction
concurrently, but with different data?

No.

No.

Multiple Instruction, Multiple Data. (MIMD)

Yes!

Single Instruction, Multiple Data. (SIMD)

GPU Brief History

✤ Graphics Processing Units were initially developed for 3D scene rendering, such as computing
luminosity values of each pixel on the screen concurrently.

✤ Consider a 1024x768 display. That’s 786,432 pixels updating 60 times per second. 786,432 pixles
@ 60Hz => 47,185,920 => potential pixel pushes per second!

✤ Running 768,432 threads makes OS unhappy. :(

✤ Sometimes it’d be great to have all calculations finish simultaneously. If you’re updating every
pixel in a display, each can be computed concurrently. Exactly concurrently.

✤ Generally fast floating point. (Critical for scientific computing.)

✤ SIMD: Single Instruction Multiple Data

Hardware Examples

Images from NVIDIA and ATI.

GPU Pros

✤ SIMD architecture can run thousands of threads concurrently.

✤ Many synchronization issues are designed out.

✤ More FLOPS (ﬂoating point operations per second) than host CPU.

✤ Synchronously execute the same instruction for different data points.

NVidia Tesla C1060
CL_DEVICE_NAME:

Tesla C1060
CL_DEVICE_VENDOR:

NVIDIA Corporation
CL_DRIVER_VERSION:

260.81
CL_DEVICE_TYPE:

CL_DEVICE_TYPE_GPU
CL_DEVICE_MAX_COMPUTE_UNITS:

30
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS:
3
CL_DEVICE_MAX_WORK_ITEM_SIZES:

512 / 512 / 64
CL_DEVICE_MAX_WORK_GROUP_SIZE:

512
CL_DEVICE_MAX_CLOCK_FREQUENCY:

1296 MHz
CL_DEVICE_ADDRESS_BITS:

32
CL_DEVICE_MAX_MEM_ALLOC_SIZE:

1014 MByte
CL_DEVICE_GLOBAL_MEM_SIZE:

4058 MByte
CL_DEVICE_ERROR_CORRECTION_SUPPORT:
no
CL_DEVICE_LOCAL_MEM_TYPE:

local
CL_DEVICE_LOCAL_MEM_SIZE:

16 KByte
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE:
64 KByte
CL_DEVICE_QUEUE_PROPERTIES:

CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE

CL_QUEUE_PROFILING_ENABLE
CL_DEVICE_IMAGE_SUPPORT:

1
CL_DEVICE_MAX_READ_IMAGE_ARGS:

128
CL_DEVICE_MAX_WRITE_IMAGE_ARGS:

8
CL_DEVICE_SINGLE_FP_CONFIG:

CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST CL_FP_ROUND_TO_ZERO CL_FP_ROUND_TO_INF
CL_FP_FMA
CL_DEVICE_2D_MAX_WIDTH

4096
CL_DEVICE_2D_MAX_HEIGHT

32768

2048

2048
CL_DEVICE_3D_MAX_DEPTH

2048
CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t>
CHAR 1, SHORT 1, INT 1, LONG 1, FLOAT 1, DOUBLE 1

OpenCL
(Image from Khronos Group.)

Mac OS X Snow Leopard Ships With Support For OpenCL 1.0, But You Still Need The Development Driver, SDK Etc.

OpenCL Terms

✤ Kernel: Your custom function(s) that will run on the Device.
(Unrelated to the OS kernel.)

✤ Device: something that computes, such as a GPU chip. (You probably
have 1, but then you might have 0... or 2+.) Could also be your CPU!

✤ Platform: Container for all your devices. When running code on your
local machine you’ll only have one “platform” instance. (A cluster of
GPU-heavy systems connected over a network would yield multiple
available platforms, but OpenCL work in this area is not yet
complete.)

✤ Device-speciﬁc terms: thread group, streaming multi-processor, etc.

Code!

✤ Ruby 1.9: single threaded.
✤ Ruby 1.9: multi-threaded on 1.9.
✤ Ruby 1.9: barracuda (native OpenCL/NVIDIA CUDA bindings).
✤ JRuby 1.6: single threaded.
✤ JRuby 1.6: multi-threaded.
✤ C: single threaded.
✤ C: native threads.

GPU Not-So-Pros

✤ Copying data in system RAM to/from GPU shared memory is not free. (In my
own testing, often the slowest link in the chain.)

✤ SIMD can feel limiting. Beneﬁts start to break down when you can’t design
conditions out of your algorithm. (E.g. “if(thread_number > 42) { x += y; }” will
cause some threads to idle.)

✤ 64-bit CPU does not imply 64-bit GPU! (All GPUs I’ve used are 32-bit or less.)

✤ Allocation limitations. Having 4GB of GPU shared memory does necessarily
mean your can allocate one giant block.

✤ Kernels are essentially written in C. May be difﬁcult if you’re new to pointers or
memory management. (You can bind to higher-level languages, though.)

Names/Keywords To Know

✤ NVidia (CUDA)

✤ ATI (Owned by AMD) (Stream SDK)

✤ Kronus (Drives OpenCL speciﬁcation.)

✤ Apple (ships OpenCL-capable drivers with all newer Macs running
Snow Leopard.)

NVidia GeForce GT 330M
CL_DEVICE_NAME:

GeForce GT 330M
CL_DEVICE_VENDOR:

NVIDIA
CL_DRIVER_VERSION:

CLH 1.0
CL_DEVICE_TYPE:

CL_DEVICE_TYPE_GPU

6
3

0/0/0

512

1100 MHz

32

128 MByte

512 MByte
no

local

16 KByte
64 KByte


1

128

8

CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST

0

0

0

0

0

Intel i7 CPU M 640, 2.80GHz
CL_DEVICE_NAME:

Intel(R) Core(TM) i7 CPU M 640 @ 2.80GHz
CL_DEVICE_VENDOR:

Intel
CL_DRIVER_VERSION:

1.0
CL_DEVICE_TYPE:

CL_DEVICE_TYPE_CPU

4
3

0/0/0

1

2800 MHz

64

1536 MByte

6144 MByte
no

global

16 KByte
64 KByte


1

128

8

CL_FP_DENORM CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST

0

0

0

0

0

Great GPU Use
Cases
✤ 3D rendering. (Obviously!)

✤ Simulation. (Folding@home actual has a
GPU client.)

✤ Physics.

✤ Probability.

✤ Network reliability.

✤ General ﬂoating-point speed-ups.

✤ Augmentation of existing apps by ofﬂoading
work to GPU. Combining CPU/GPU
paradigms is a perfectly valid approach.

A Few Simple Kernels
Several simple C kernels w/Java, OpenCL and JOCL in Eclipse Helios

Date

Multi-
Dimensional
Threading
Threads can have an X, Y, and Z
coordinate within their thread
group, which you use to determine
who you are and can thus derive
your speciﬁc local parameters and
minimize control ﬂow changes.

Links

✤ OpenCL Programming Guide for OSX. (Really good!)
http://developer.apple.com/library/mac/#documentation/Performance/Conceptual/OpenCL_MacProgGuide/Introduction/Introduction.html

✤ NVidia CUDA:
http://www.nvidia.com/object/cuda_home_new.html

✤ ATI Stream:
http://www.amd.com/US/PRODUCTS/TECHNOLOGIES/STREAM-TECHNOLOGY/Pages/stream-technology.aspx

✤ OpenCL:
http://www.khronos.org/opencl/

Ruby Supercomputing - Using The GPU For Massive Performance Speedup v1.1

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Ruby Supercomputing - Using The GPU For Massive Performance Speedup v1.1

Similar to Ruby Supercomputing - Using The GPU For Massive Performance Speedup v1.1 (20)

Recently uploaded

Recently uploaded (20)

Ruby Supercomputing - Using The GPU For Massive Performance Speedup v1.1

Editor's Notes