PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

Software Libraries
for CUDA & OpenCL

Heterogeneous Computing is Hard
Two Examples:
1. Median Filtering

2. Local Windowing

Median Filtering
Increasingly
Difficult

Local Windowing
 Best algorithm to use changes given which
device is in the system.
Device 1

Device 2

Device 3

Device 4

Algorithm 1

395 ms

599

244

102

Algorihm 2

270

703

241

103

Algorithm 3

699

407

138

116

Algorithm 4

380

522

202

98

Why Software Libraries Are Great
 Reduce many lines of code to one line
 Obsessively tuned by experts; faster than DIY

 Well-tested and maintained
 Continuously improving

Five Influencers (besides price)

Performance

Portability

Programmability

Scalability

Community

Performance & Programmability

Faster

Slower

SSE or
AVX
Timeconsuming

Easy-to-use


Faster

Writing
Kernels

Slower

SSE or
AVX
Timeconsuming

Easy-to-use


Faster

Writing
Kernels

Slower

SSE or
AVX

Compiler
Directives

Timeconsuming

Easy-to-use


Faster

Writing
Kernels

Using
Libraries

Slower

SSE or
AVX

Compiler
Directives

Timeconsuming

Easy-to-use

Portability
 Flavors of portability
 HW vendor options
 Accelerator options (GPU, coprocessor, FPGA)
 CPU fallback
 High-performance mobile computing

 Libraries can provide portability

Scalability
 Always start with one device
 Potential headaches of adding devices
 Performance hit
 Development complexity

 Libraries can make scaling easy

Community
 What do you do when bugs arise?
 Continuous refinement

 Someone to answer questions
 Libraries can have great community support

TIME

TIME

Benefits of Using a Library

Porting

Libraries eliminate
hidden costs of software
development

Maintenance
Documentation
Test and QA
Development

COST

Pain

COST

Pleasure

ArrayFire: Technical Computing

 Super easy to program
 Highly optimized

Scalability
 Multi-GPU is 1-line of code
array *y = new array[n];
for (int i = 0; i < n; ++i) {
deviceset(i);
// change GPUs
array x = randu(5,5);
// add work to GPU’s queue
y[i] = fft(x);
// more work in queue
}
// all GPUs are now computing simultaneously

Community
 Over 8,000 posts at
http://forums.accelereyes.com
 Nightly library update releases
 Stable releases a few times a year
 v2.0 coming at the end of summer

Example Case Studies 1

17X

20X

20X

45X

12X

Neuro-imaging

Viral Analyses

Video Processing

Radar Imaging

Medical Devices

Georgia Tech

CDC

Google

System Planning

Spencer Tech

Example Case Studies 2

5X

35X

17X

70X

35X

Weather Models

Power Eng

Surveillance

Drug Delivery

Bioinformatics

NCAR

IIT India

BAE Systems

Georgia Tech

Leibnitz

Hundreds of Functions
reductions
• sum, min, max, count,
prod
• vectors, columns,
rows, etc

dense linear algebra
• LU, QR, Cholesky, SVD,
Eigenvalues, Inversion,
Solvers, Determinant,
Matrix Power

convolutions
• 2D, 3D, ND
FFTs
• 2D, 3D, ND

image processing
• filter, rotate, erode,
dilate, morph,
resize, rgb2gray,
histograms

interpolate & scale
• vectors, matrices
• rescaling
sorting
• along any
dimension
• sort detection

and many more…

Intuitive Functions (estimate π)
#include <stdio.h>
#include <arrayfire.h>
using namespace af;
int main() {
// 20 million random samples
int n = 20e6;
array x = randu(n,1), y = randu(n,1);
// how many fell inside unit circle?
float pi = 4 * sum<float>(x*x + y*y < 1) / n;
printf("pi = %gn", pi);
return 0;
}

array x = randu(n, f32);
array y = randu(n, f64);
array z = randu(n, u32);

Data Types

c32

complex
single precision

f64
real
double precision

b8

boolean byte

array

f32

real
single precision

container object

s32

u32

signed integer

unsigned integer

c64
complex
double precision

ND Support
vectors

matrices

volumes

… ND

Subscripting
ArrayFire Keywords: end, span
A(1,1)

A(1,span)

A(end,1)

A(end,span)

A(span,span,2)

Generate Arrays

constant(0,3)
constant(1,3,2,f64)
randu(1,8)
randn(2,2)
identity(3,3)
randu(5,7,c32)

//
//
//
//
//
//

3-by-1 column of zeros, single-precision
3-by-2 matrix, double-precision
row vector (1x8) of random values (uniform)
square matrix (2x2) random values (normal)
3-by-3 identity
complex random values

Create Arrays from CPU Data

float hA[] = {0,1,2,3,4,5};
array A(2,3,hA); // 2x3 matrix, single-precision
print(A);
// A = [ 0 2 4 ]
//
[ 1 3 5 ]

Note: Fortran storage order

Arithmetic

array R = randu(3,3);
array C = constant(1,3,3) + complex(sin(R));
// rescale complex values to unit circle
array a = randn(5,c32);
print(a / abs(a));

// C is c32

L-2 Norm Example

// calculate L-2 norm of
sqrt(sum(pow(X, 2)))
sqrt(sum(pow(X, 2), 0))
sqrt(sum(pow(X, 2), 1))

every column
// norm of every column vector
// ..same
// norm of every row vector

Subscripting Examples

array A = randu(3,3);
array a1 = A(0);
//
array a2 = A(0,1); //
A(1,span);
//
A.row(end);
//
A.cols(1,end);
//

first element
first row, second column
second row
last row
all but first column

Subscripting Examples

float b_ptr[] = {0,1,2,3,4,5,6,7,8,9};
array b(1,10,b_ptr);
b(seq(3));
// {0,1,2}
b(seq(1,7));
// {1,2,3,4,5,6,7}
b(seq(1,2,7));
// {1,3,5,7}
b(seq(0,2,end)); // {0,2,4,6,8}

Data Manipulation

// setting entries to a constant
A(span) = 4;
// fill entire array
A.row(0) = -1;
// first row
A(seq(3)) = 3.1415; // first three elements

Data Manipulation

// copy in another matrix
array B = constant(1,4,4,f64);
B.row(0) = randu(1,4,f32); // set row (upcast)

Data Manipulation

// index with another array
float h_inds[] = {0, 4, 2, 1}; // zero-based
array inds(1,4,h_inds);
B(inds) = randu(4,1); // set to random

Linear Algebra

// matrix factorization
array L, U;
lu(L, U, randu(n,n));
// linear systems: A x = b
array A = randu(n,n), b = randu(n,1);
array x = solve(A,b);

Graphics Functions
 asynchronous
 non-blocking

 throttled at 35 Hz

Graphics Functions
 non-blocking primitives
 surface - surface plotting (2d data)
 image - intensity image visualization
 arrows - vector fields
 plot2 - line plotting (x,y)
 plot3 - scatter plot (x,y,z)
 volume - volume rendering for 3d data

Graphics Functions
 utility commands









keep_on
keep_off
subfigure
palette
clearfig
draw (blocking)
figure
title
close

Graphics Example
#include <arrayfire.h>
using namespace af;
int main() {
// random 3d surface
const int n = 256;
while (1) {
array x = randu(n,n);
// 3d surface plot
surface(x);
}
return 0;
}

GFOR Parallel Loops
Parallel matrix multiplications (1 kernel launch)
gfor (array i, 3)
C(span,span,i) = A(span,span,i) * B;

=

C(,,1)

*

A(,,1)

=

B

C(,,2)

=

*

A(,,2)

B

C(,,3)

*

A(,,3)

B

GFOR Parallel Loops
gfor (array i, 3)

=
C(,,1:3)

=

*
=
A(,,1:3)

*

*
B

GFOR Parallel Loops
gfor (array i, 3)

=

C

*

A

B

Four Quick Stories in Conclusion
Advertising

Healthcare

Finance

Oil & Gas

Acceleration Demands
 The CPU code
 45 seconds for one session to complete
 Highly optimized OpenMP code leveraging all cores

 1,000 sessions/minute required 750 CPU nodes

 Convert Mac-only research code to C#
 Focus on efficiently developed robust performance

ArrayFire Solution
 Linear algebra
 Matrix multiple, Transpose
 Linear solvers

 Image processing






Convolutions
Fast Fourier Transform
Correlation Filter
Sobel Filter
Gaussian Blur

 OpenCV functions
 Custom edge detection

 Graphics
 Rendering points

 Reductions
 Min, Max, Sum

 JIT
 Increased productivity

Results
 3X acceleration
 Dropped from 750 nodes,
to 250 nodes
 Benefit from ongoing
library support

Culture-Free Microbiology
Computercontrolled
pipettes
 Filling

 Filled

Microscope
 A computer-controlled microscope scans a
cassette of pipettes, changes imaging
modes, and acquires digital images
according to program

 This platform provides a rapid alternative to
traditional cell culturing for susceptibility testing
 The faster the analysis pipeline, the sooner a
patient can be diagnosed and treated with an
antibiotic
 Culture-based methods can take 2-3 days, which
is problematic for many critically ill patients

ArrayFire Solution
 Image Processing
 Heavily filter based

 Convolve, Filter, Resize

 Image Statistics
 Mean, StdDev, Variance

Results
 Realtime throughput
Kernel

Speedup

Image Registration (Heavy use of
statistics functions)

73.17x

Custom Filter (Prep Center Image) 26.48x

Gaussian Blur

2.19x

 CPU-only version was taking 115 hours
 Needs to run entire database of portfolios
each night before trading begins next day

ArrayFire Solution
 Statistics Functions
 Random number
generation
 Variance

 Exponentials

 Arithmetic
 Sqrt

 Element-wise math

 Reductions
 Sum

Results
 GPU version drops runtime to 7 hours and
meets the requirement to run overnight
 Time left over to try more permutations

Oil Well Monitoring
 Ordinary telecom
fiber used as an
efficient, high fidelity
acoustic sensor
 Threaded along the
length of oil well

 Require realtime signal processing from 24
channels per unit with an onsite server
 CPU-only solution was 5x slower than realtime

ArrayFire Solution
 Heavy usage of signal filtering functions
 FIR

 IIR

Results
 6x performance improvements in signal
processing
 20x overall performance improvement
through more efficiently vectorized code

Software Shop for CUDA & OpenCL
 Two ways to work with us:
 Use

 Hire our CUDA & OpenCL developers
Code development; CUDA & OpenCL training

PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

Similar to PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos (20)

More from AMD Developer Central

More from AMD Developer Central (20)

Recently uploaded

Recently uploaded (20)

PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos