Software Libraries
for CUDA & OpenCL
Heterogeneous Computing is Hard
Two Examples:
1. Median Filtering

2. Local Windowing
Median Filtering
Increasingly
Difficult
Local Windowing
 Best algorithm to use changes given which
device is in the system.
Device 1

Device 2

Device 3

Device ...
Why Software Libraries Are Great
 Reduce many lines of code to one line
 Obsessively tuned by experts; faster than DIY

...
Five Influencers (besides price)

Performance

Portability

Programmability

Scalability

Community
Performance & Programmability

Faster

Slower

SSE or
AVX
Timeconsuming

Easy-to-use
Performance & Programmability

Faster

Writing
Kernels

Slower

SSE or
AVX
Timeconsuming

Easy-to-use
Performance & Programmability

Faster

Writing
Kernels

Slower

SSE or
AVX

Compiler
Directives

Timeconsuming

Easy-to-us...
Performance & Programmability

Faster

Writing
Kernels

Using
Libraries

Slower

SSE or
AVX

Compiler
Directives

Timecons...
Performance
Performance
Portability
 Flavors of portability
 HW vendor options
 Accelerator options (GPU, coprocessor, FPGA)
 CPU fallback
 H...
Scalability
 Always start with one device
 Potential headaches of adding devices
 Performance hit
 Development complex...
Community
 What do you do when bugs arise?
 Continuous refinement

 Someone to answer questions
 Libraries can have gr...
TIME

TIME

Benefits of Using a Library

Porting

Libraries eliminate
hidden costs of software
development

Maintenance
Do...
ArrayFire: Technical Computing
Performance & Programmability
 Super easy to program
 Highly optimized
Portability
Scalability
 Multi-GPU is 1-line of code
array *y = new array[n];
for (int i = 0; i < n; ++i) {
deviceset(i);
// change G...
Community
 Over 8,000 posts at
http://forums.accelereyes.com
 Nightly library update releases
 Stable releases a few ti...
Example Case Studies 1

17X

20X

20X

45X

12X

Neuro-imaging

Viral Analyses

Video Processing

Radar Imaging

Medical D...
Example Case Studies 2

5X

35X

17X

70X

35X

Weather Models

Power Eng

Surveillance

Drug Delivery

Bioinformatics

NC...
Hundreds of Functions
reductions
• sum, min, max, count,
prod
• vectors, columns,
rows, etc

dense linear algebra
• LU, QR...
Intuitive Functions (estimate π)
#include <stdio.h>
#include <arrayfire.h>
using namespace af;
int main() {
// 20 million ...
array x = randu(n, f32);
array y = randu(n, f64);
array z = randu(n, u32);

Data Types

c32

complex
single precision

f64...
ND Support
vectors

matrices

volumes

… ND
Subscripting
ArrayFire Keywords: end, span
A(1,1)

A(1,span)

A(end,1)

A(end,span)

A(span,span,2)
Generate Arrays

constant(0,3)
constant(1,3,2,f64)
randu(1,8)
randn(2,2)
identity(3,3)
randu(5,7,c32)

//
//
//
//
//
//

...
Create Arrays from CPU Data

float hA[] = {0,1,2,3,4,5};
array A(2,3,hA); // 2x3 matrix, single-precision
print(A);
// A =...
Arithmetic

array R = randu(3,3);
array C = constant(1,3,3) + complex(sin(R));
// rescale complex values to unit circle
ar...
L-2 Norm Example

// calculate L-2 norm of
sqrt(sum(pow(X, 2)))
sqrt(sum(pow(X, 2), 0))
sqrt(sum(pow(X, 2), 1))

every col...
Subscripting Examples

array A = randu(3,3);
array a1 = A(0);
//
array a2 = A(0,1); //
A(1,span);
//
A.row(end);
//
A.cols...
Subscripting Examples

float b_ptr[] = {0,1,2,3,4,5,6,7,8,9};
array b(1,10,b_ptr);
b(seq(3));
// {0,1,2}
b(seq(1,7));
// {...
Data Manipulation

// setting entries to a constant
A(span) = 4;
// fill entire array
A.row(0) = -1;
// first row
A(seq(3)...
Data Manipulation

// copy in another matrix
array B = constant(1,4,4,f64);
B.row(0) = randu(1,4,f32); // set row (upcast)
Data Manipulation

// index with another array
float h_inds[] = {0, 4, 2, 1}; // zero-based
array inds(1,4,h_inds);
B(inds...
Linear Algebra

// matrix factorization
array L, U;
lu(L, U, randu(n,n));
// linear systems: A x = b
array A = randu(n,n),...
Graphics Functions
 asynchronous
 non-blocking

 throttled at 35 Hz
Graphics Functions
 non-blocking primitives
 surface - surface plotting (2d data)
 image - intensity image visualizatio...
Graphics Functions
 utility commands









keep_on
keep_off
subfigure
palette
clearfig
draw (blocking)
figure
...
Graphics Example
#include <arrayfire.h>
using namespace af;
int main() {
// random 3d surface
const int n = 256;
while (1)...
GFOR Parallel Loops
Parallel matrix multiplications (1 kernel launch)
gfor (array i, 3)
C(span,span,i) = A(span,span,i) * ...
GFOR Parallel Loops
Parallel matrix multiplications (1 kernel launch)
gfor (array i, 3)
C(span,span,i) = A(span,span,i) * ...
GFOR Parallel Loops
Parallel matrix multiplications (1 kernel launch)
gfor (array i, 3)
C(span,span,i) = A(span,span,i) * ...
Four Quick Stories in Conclusion
Advertising

Healthcare

Finance

Oil & Gas
Virtual Glasses Try-On
Acceleration Demands
 The CPU code
 45 seconds for one session to complete
 Highly optimized OpenMP code leveraging all...
ArrayFire Solution
 Linear algebra
 Matrix multiple, Transpose
 Linear solvers

 Image processing






Convoluti...
Results
 3X acceleration
 Dropped from 750 nodes,
to 250 nodes
 Benefit from ongoing
library support
Culture-Free Microbiology
Computercontrolled
pipettes
 Filling

 Filled
Microscope
 A computer-controlled microscope scans a
cassette of pipettes, changes imaging
modes, and acquires digital im...
Acceleration Demands
 This platform provides a rapid alternative to
traditional cell culturing for susceptibility testing...
ArrayFire Solution
 Image Processing
 Heavily filter based

 Convolve, Filter, Resize

 Image Statistics
 Mean, StdDe...
Results
 Realtime throughput
Kernel

Speedup

Image Registration (Heavy use of
statistics functions)

73.17x

Custom Filt...
Hedge Protection System
Acceleration Demands
 CPU-only version was taking 115 hours
 Needs to run entire database of portfolios
each night befor...
ArrayFire Solution
 Statistics Functions
 Random number
generation
 Variance

 Exponentials

 Arithmetic
 Sqrt

 El...
Results
 GPU version drops runtime to 7 hours and
meets the requirement to run overnight
 Time left over to try more per...
Oil Well Monitoring
 Ordinary telecom
fiber used as an
efficient, high fidelity
acoustic sensor
 Threaded along the
leng...
Acceleration Demands
 Require realtime signal processing from 24
channels per unit with an onsite server
 CPU-only solut...
ArrayFire Solution
 Heavy usage of signal filtering functions
 FIR

 IIR
Results
 6x performance improvements in signal
processing
 20x overall performance improvement
through more efficiently ...
Software Shop for CUDA & OpenCL
 Two ways to work with us:
 Use

 Hire our CUDA & OpenCL developers
Code development; ...
Upcoming SlideShare
Loading in …5
×

PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

1,464 views

Published on

Presentation PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos at the AMD Developer Summit (APU13) Nov. 11-13, 2013.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,464
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
19
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

  1. 1. Software Libraries for CUDA & OpenCL
  2. 2. Heterogeneous Computing is Hard Two Examples: 1. Median Filtering 2. Local Windowing
  3. 3. Median Filtering Increasingly Difficult
  4. 4. Local Windowing  Best algorithm to use changes given which device is in the system. Device 1 Device 2 Device 3 Device 4 Algorithm 1 395 ms 599 244 102 Algorihm 2 270 703 241 103 Algorithm 3 699 407 138 116 Algorithm 4 380 522 202 98
  5. 5. Why Software Libraries Are Great  Reduce many lines of code to one line  Obsessively tuned by experts; faster than DIY  Well-tested and maintained  Continuously improving
  6. 6. Five Influencers (besides price) Performance Portability Programmability Scalability Community
  7. 7. Performance & Programmability Faster Slower SSE or AVX Timeconsuming Easy-to-use
  8. 8. Performance & Programmability Faster Writing Kernels Slower SSE or AVX Timeconsuming Easy-to-use
  9. 9. Performance & Programmability Faster Writing Kernels Slower SSE or AVX Compiler Directives Timeconsuming Easy-to-use
  10. 10. Performance & Programmability Faster Writing Kernels Using Libraries Slower SSE or AVX Compiler Directives Timeconsuming Easy-to-use
  11. 11. Performance
  12. 12. Performance
  13. 13. Portability  Flavors of portability  HW vendor options  Accelerator options (GPU, coprocessor, FPGA)  CPU fallback  High-performance mobile computing  Libraries can provide portability
  14. 14. Scalability  Always start with one device  Potential headaches of adding devices  Performance hit  Development complexity  Libraries can make scaling easy
  15. 15. Community  What do you do when bugs arise?  Continuous refinement  Someone to answer questions  Libraries can have great community support
  16. 16. TIME TIME Benefits of Using a Library Porting Libraries eliminate hidden costs of software development Maintenance Documentation Test and QA Development COST Pain COST Pleasure
  17. 17. ArrayFire: Technical Computing
  18. 18. Performance & Programmability  Super easy to program  Highly optimized
  19. 19. Portability
  20. 20. Scalability  Multi-GPU is 1-line of code array *y = new array[n]; for (int i = 0; i < n; ++i) { deviceset(i); // change GPUs array x = randu(5,5); // add work to GPU’s queue y[i] = fft(x); // more work in queue } // all GPUs are now computing simultaneously
  21. 21. Community  Over 8,000 posts at http://forums.accelereyes.com  Nightly library update releases  Stable releases a few times a year  v2.0 coming at the end of summer
  22. 22. Example Case Studies 1 17X 20X 20X 45X 12X Neuro-imaging Viral Analyses Video Processing Radar Imaging Medical Devices Georgia Tech CDC Google System Planning Spencer Tech
  23. 23. Example Case Studies 2 5X 35X 17X 70X 35X Weather Models Power Eng Surveillance Drug Delivery Bioinformatics NCAR IIT India BAE Systems Georgia Tech Leibnitz
  24. 24. Hundreds of Functions reductions • sum, min, max, count, prod • vectors, columns, rows, etc dense linear algebra • LU, QR, Cholesky, SVD, Eigenvalues, Inversion, Solvers, Determinant, Matrix Power convolutions • 2D, 3D, ND FFTs • 2D, 3D, ND image processing • filter, rotate, erode, dilate, morph, resize, rgb2gray, histograms interpolate & scale • vectors, matrices • rescaling sorting • along any dimension • sort detection and many more…
  25. 25. Intuitive Functions (estimate π) #include <stdio.h> #include <arrayfire.h> using namespace af; int main() { // 20 million random samples int n = 20e6; array x = randu(n,1), y = randu(n,1); // how many fell inside unit circle? float pi = 4 * sum<float>(x*x + y*y < 1) / n; printf("pi = %gn", pi); return 0; }
  26. 26. array x = randu(n, f32); array y = randu(n, f64); array z = randu(n, u32); Data Types c32 complex single precision f64 real double precision b8 boolean byte array f32 real single precision container object s32 u32 signed integer unsigned integer c64 complex double precision
  27. 27. ND Support vectors matrices volumes … ND
  28. 28. Subscripting ArrayFire Keywords: end, span A(1,1) A(1,span) A(end,1) A(end,span) A(span,span,2)
  29. 29. Generate Arrays constant(0,3) constant(1,3,2,f64) randu(1,8) randn(2,2) identity(3,3) randu(5,7,c32) // // // // // // 3-by-1 column of zeros, single-precision 3-by-2 matrix, double-precision row vector (1x8) of random values (uniform) square matrix (2x2) random values (normal) 3-by-3 identity complex random values
  30. 30. Create Arrays from CPU Data float hA[] = {0,1,2,3,4,5}; array A(2,3,hA); // 2x3 matrix, single-precision print(A); // A = [ 0 2 4 ] // [ 1 3 5 ] Note: Fortran storage order
  31. 31. Arithmetic array R = randu(3,3); array C = constant(1,3,3) + complex(sin(R)); // rescale complex values to unit circle array a = randn(5,c32); print(a / abs(a)); // C is c32
  32. 32. L-2 Norm Example // calculate L-2 norm of sqrt(sum(pow(X, 2))) sqrt(sum(pow(X, 2), 0)) sqrt(sum(pow(X, 2), 1)) every column // norm of every column vector // ..same // norm of every row vector
  33. 33. Subscripting Examples array A = randu(3,3); array a1 = A(0); // array a2 = A(0,1); // A(1,span); // A.row(end); // A.cols(1,end); // first element first row, second column second row last row all but first column
  34. 34. Subscripting Examples float b_ptr[] = {0,1,2,3,4,5,6,7,8,9}; array b(1,10,b_ptr); b(seq(3)); // {0,1,2} b(seq(1,7)); // {1,2,3,4,5,6,7} b(seq(1,2,7)); // {1,3,5,7} b(seq(0,2,end)); // {0,2,4,6,8}
  35. 35. Data Manipulation // setting entries to a constant A(span) = 4; // fill entire array A.row(0) = -1; // first row A(seq(3)) = 3.1415; // first three elements
  36. 36. Data Manipulation // copy in another matrix array B = constant(1,4,4,f64); B.row(0) = randu(1,4,f32); // set row (upcast)
  37. 37. Data Manipulation // index with another array float h_inds[] = {0, 4, 2, 1}; // zero-based array inds(1,4,h_inds); B(inds) = randu(4,1); // set to random
  38. 38. Linear Algebra // matrix factorization array L, U; lu(L, U, randu(n,n)); // linear systems: A x = b array A = randu(n,n), b = randu(n,1); array x = solve(A,b);
  39. 39. Graphics Functions  asynchronous  non-blocking  throttled at 35 Hz
  40. 40. Graphics Functions  non-blocking primitives  surface - surface plotting (2d data)  image - intensity image visualization  arrows - vector fields  plot2 - line plotting (x,y)  plot3 - scatter plot (x,y,z)  volume - volume rendering for 3d data
  41. 41. Graphics Functions  utility commands         keep_on keep_off subfigure palette clearfig draw (blocking) figure title close
  42. 42. Graphics Example #include <arrayfire.h> using namespace af; int main() { // random 3d surface const int n = 256; while (1) { array x = randu(n,n); // 3d surface plot surface(x); } return 0; }
  43. 43. GFOR Parallel Loops Parallel matrix multiplications (1 kernel launch) gfor (array i, 3) C(span,span,i) = A(span,span,i) * B; = C(,,1) * A(,,1) = B C(,,2) = * A(,,2) B C(,,3) * A(,,3) B
  44. 44. GFOR Parallel Loops Parallel matrix multiplications (1 kernel launch) gfor (array i, 3) C(span,span,i) = A(span,span,i) * B; = C(,,1:3) = * = A(,,1:3) * * B
  45. 45. GFOR Parallel Loops Parallel matrix multiplications (1 kernel launch) gfor (array i, 3) C(span,span,i) = A(span,span,i) * B; = C * A B
  46. 46. Four Quick Stories in Conclusion Advertising Healthcare Finance Oil & Gas
  47. 47. Virtual Glasses Try-On
  48. 48. Acceleration Demands  The CPU code  45 seconds for one session to complete  Highly optimized OpenMP code leveraging all cores  1,000 sessions/minute required 750 CPU nodes  Convert Mac-only research code to C#  Focus on efficiently developed robust performance
  49. 49. ArrayFire Solution  Linear algebra  Matrix multiple, Transpose  Linear solvers  Image processing      Convolutions Fast Fourier Transform Correlation Filter Sobel Filter Gaussian Blur  OpenCV functions  Custom edge detection  Graphics  Rendering points  Reductions  Min, Max, Sum  JIT  Increased productivity
  50. 50. Results  3X acceleration  Dropped from 750 nodes, to 250 nodes  Benefit from ongoing library support
  51. 51. Culture-Free Microbiology Computercontrolled pipettes  Filling  Filled
  52. 52. Microscope  A computer-controlled microscope scans a cassette of pipettes, changes imaging modes, and acquires digital images according to program
  53. 53. Acceleration Demands  This platform provides a rapid alternative to traditional cell culturing for susceptibility testing  The faster the analysis pipeline, the sooner a patient can be diagnosed and treated with an antibiotic  Culture-based methods can take 2-3 days, which is problematic for many critically ill patients
  54. 54. ArrayFire Solution  Image Processing  Heavily filter based  Convolve, Filter, Resize  Image Statistics  Mean, StdDev, Variance
  55. 55. Results  Realtime throughput Kernel Speedup Image Registration (Heavy use of statistics functions) 73.17x Custom Filter (Prep Center Image) 26.48x Gaussian Blur 2.19x
  56. 56. Hedge Protection System
  57. 57. Acceleration Demands  CPU-only version was taking 115 hours  Needs to run entire database of portfolios each night before trading begins next day
  58. 58. ArrayFire Solution  Statistics Functions  Random number generation  Variance  Exponentials  Arithmetic  Sqrt  Element-wise math  Reductions  Sum
  59. 59. Results  GPU version drops runtime to 7 hours and meets the requirement to run overnight  Time left over to try more permutations
  60. 60. Oil Well Monitoring  Ordinary telecom fiber used as an efficient, high fidelity acoustic sensor  Threaded along the length of oil well
  61. 61. Acceleration Demands  Require realtime signal processing from 24 channels per unit with an onsite server  CPU-only solution was 5x slower than realtime
  62. 62. ArrayFire Solution  Heavy usage of signal filtering functions  FIR  IIR
  63. 63. Results  6x performance improvements in signal processing  20x overall performance improvement through more efficiently vectorized code
  64. 64. Software Shop for CUDA & OpenCL  Two ways to work with us:  Use  Hire our CUDA & OpenCL developers Code development; CUDA & OpenCL training

×