• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
 

PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

on

  • 642 views

Presentation PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos at the AMD Developer Summit (APU13) Nov. 11-13, 2013.

Presentation PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos at the AMD Developer Summit (APU13) Nov. 11-13, 2013.

Statistics

Views

Total Views
642
Views on SlideShare
642
Embed Views
0

Actions

Likes
0
Downloads
16
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos Presentation Transcript

    • Software Libraries for CUDA & OpenCL
    • Heterogeneous Computing is Hard Two Examples: 1. Median Filtering 2. Local Windowing
    • Median Filtering Increasingly Difficult
    • Local Windowing  Best algorithm to use changes given which device is in the system. Device 1 Device 2 Device 3 Device 4 Algorithm 1 395 ms 599 244 102 Algorihm 2 270 703 241 103 Algorithm 3 699 407 138 116 Algorithm 4 380 522 202 98
    • Why Software Libraries Are Great  Reduce many lines of code to one line  Obsessively tuned by experts; faster than DIY  Well-tested and maintained  Continuously improving
    • Five Influencers (besides price) Performance Portability Programmability Scalability Community
    • Performance & Programmability Faster Slower SSE or AVX Timeconsuming Easy-to-use
    • Performance & Programmability Faster Writing Kernels Slower SSE or AVX Timeconsuming Easy-to-use
    • Performance & Programmability Faster Writing Kernels Slower SSE or AVX Compiler Directives Timeconsuming Easy-to-use
    • Performance & Programmability Faster Writing Kernels Using Libraries Slower SSE or AVX Compiler Directives Timeconsuming Easy-to-use
    • Performance
    • Performance
    • Portability  Flavors of portability  HW vendor options  Accelerator options (GPU, coprocessor, FPGA)  CPU fallback  High-performance mobile computing  Libraries can provide portability
    • Scalability  Always start with one device  Potential headaches of adding devices  Performance hit  Development complexity  Libraries can make scaling easy
    • Community  What do you do when bugs arise?  Continuous refinement  Someone to answer questions  Libraries can have great community support
    • TIME TIME Benefits of Using a Library Porting Libraries eliminate hidden costs of software development Maintenance Documentation Test and QA Development COST Pain COST Pleasure
    • ArrayFire: Technical Computing
    • Performance & Programmability  Super easy to program  Highly optimized
    • Portability
    • Scalability  Multi-GPU is 1-line of code array *y = new array[n]; for (int i = 0; i < n; ++i) { deviceset(i); // change GPUs array x = randu(5,5); // add work to GPU’s queue y[i] = fft(x); // more work in queue } // all GPUs are now computing simultaneously
    • Community  Over 8,000 posts at http://forums.accelereyes.com  Nightly library update releases  Stable releases a few times a year  v2.0 coming at the end of summer
    • Example Case Studies 1 17X 20X 20X 45X 12X Neuro-imaging Viral Analyses Video Processing Radar Imaging Medical Devices Georgia Tech CDC Google System Planning Spencer Tech
    • Example Case Studies 2 5X 35X 17X 70X 35X Weather Models Power Eng Surveillance Drug Delivery Bioinformatics NCAR IIT India BAE Systems Georgia Tech Leibnitz
    • Hundreds of Functions reductions • sum, min, max, count, prod • vectors, columns, rows, etc dense linear algebra • LU, QR, Cholesky, SVD, Eigenvalues, Inversion, Solvers, Determinant, Matrix Power convolutions • 2D, 3D, ND FFTs • 2D, 3D, ND image processing • filter, rotate, erode, dilate, morph, resize, rgb2gray, histograms interpolate & scale • vectors, matrices • rescaling sorting • along any dimension • sort detection and many more…
    • Intuitive Functions (estimate π) #include <stdio.h> #include <arrayfire.h> using namespace af; int main() { // 20 million random samples int n = 20e6; array x = randu(n,1), y = randu(n,1); // how many fell inside unit circle? float pi = 4 * sum<float>(x*x + y*y < 1) / n; printf("pi = %gn", pi); return 0; }
    • array x = randu(n, f32); array y = randu(n, f64); array z = randu(n, u32); Data Types c32 complex single precision f64 real double precision b8 boolean byte array f32 real single precision container object s32 u32 signed integer unsigned integer c64 complex double precision
    • ND Support vectors matrices volumes … ND
    • Subscripting ArrayFire Keywords: end, span A(1,1) A(1,span) A(end,1) A(end,span) A(span,span,2)
    • Generate Arrays constant(0,3) constant(1,3,2,f64) randu(1,8) randn(2,2) identity(3,3) randu(5,7,c32) // // // // // // 3-by-1 column of zeros, single-precision 3-by-2 matrix, double-precision row vector (1x8) of random values (uniform) square matrix (2x2) random values (normal) 3-by-3 identity complex random values
    • Create Arrays from CPU Data float hA[] = {0,1,2,3,4,5}; array A(2,3,hA); // 2x3 matrix, single-precision print(A); // A = [ 0 2 4 ] // [ 1 3 5 ] Note: Fortran storage order
    • Arithmetic array R = randu(3,3); array C = constant(1,3,3) + complex(sin(R)); // rescale complex values to unit circle array a = randn(5,c32); print(a / abs(a)); // C is c32
    • L-2 Norm Example // calculate L-2 norm of sqrt(sum(pow(X, 2))) sqrt(sum(pow(X, 2), 0)) sqrt(sum(pow(X, 2), 1)) every column // norm of every column vector // ..same // norm of every row vector
    • Subscripting Examples array A = randu(3,3); array a1 = A(0); // array a2 = A(0,1); // A(1,span); // A.row(end); // A.cols(1,end); // first element first row, second column second row last row all but first column
    • Subscripting Examples float b_ptr[] = {0,1,2,3,4,5,6,7,8,9}; array b(1,10,b_ptr); b(seq(3)); // {0,1,2} b(seq(1,7)); // {1,2,3,4,5,6,7} b(seq(1,2,7)); // {1,3,5,7} b(seq(0,2,end)); // {0,2,4,6,8}
    • Data Manipulation // setting entries to a constant A(span) = 4; // fill entire array A.row(0) = -1; // first row A(seq(3)) = 3.1415; // first three elements
    • Data Manipulation // copy in another matrix array B = constant(1,4,4,f64); B.row(0) = randu(1,4,f32); // set row (upcast)
    • Data Manipulation // index with another array float h_inds[] = {0, 4, 2, 1}; // zero-based array inds(1,4,h_inds); B(inds) = randu(4,1); // set to random
    • Linear Algebra // matrix factorization array L, U; lu(L, U, randu(n,n)); // linear systems: A x = b array A = randu(n,n), b = randu(n,1); array x = solve(A,b);
    • Graphics Functions  asynchronous  non-blocking  throttled at 35 Hz
    • Graphics Functions  non-blocking primitives  surface - surface plotting (2d data)  image - intensity image visualization  arrows - vector fields  plot2 - line plotting (x,y)  plot3 - scatter plot (x,y,z)  volume - volume rendering for 3d data
    • Graphics Functions  utility commands         keep_on keep_off subfigure palette clearfig draw (blocking) figure title close
    • Graphics Example #include <arrayfire.h> using namespace af; int main() { // random 3d surface const int n = 256; while (1) { array x = randu(n,n); // 3d surface plot surface(x); } return 0; }
    • GFOR Parallel Loops Parallel matrix multiplications (1 kernel launch) gfor (array i, 3) C(span,span,i) = A(span,span,i) * B; = C(,,1) * A(,,1) = B C(,,2) = * A(,,2) B C(,,3) * A(,,3) B
    • GFOR Parallel Loops Parallel matrix multiplications (1 kernel launch) gfor (array i, 3) C(span,span,i) = A(span,span,i) * B; = C(,,1:3) = * = A(,,1:3) * * B
    • GFOR Parallel Loops Parallel matrix multiplications (1 kernel launch) gfor (array i, 3) C(span,span,i) = A(span,span,i) * B; = C * A B
    • Four Quick Stories in Conclusion Advertising Healthcare Finance Oil & Gas
    • Virtual Glasses Try-On
    • Acceleration Demands  The CPU code  45 seconds for one session to complete  Highly optimized OpenMP code leveraging all cores  1,000 sessions/minute required 750 CPU nodes  Convert Mac-only research code to C#  Focus on efficiently developed robust performance
    • ArrayFire Solution  Linear algebra  Matrix multiple, Transpose  Linear solvers  Image processing      Convolutions Fast Fourier Transform Correlation Filter Sobel Filter Gaussian Blur  OpenCV functions  Custom edge detection  Graphics  Rendering points  Reductions  Min, Max, Sum  JIT  Increased productivity
    • Results  3X acceleration  Dropped from 750 nodes, to 250 nodes  Benefit from ongoing library support
    • Culture-Free Microbiology Computercontrolled pipettes  Filling  Filled
    • Microscope  A computer-controlled microscope scans a cassette of pipettes, changes imaging modes, and acquires digital images according to program
    • Acceleration Demands  This platform provides a rapid alternative to traditional cell culturing for susceptibility testing  The faster the analysis pipeline, the sooner a patient can be diagnosed and treated with an antibiotic  Culture-based methods can take 2-3 days, which is problematic for many critically ill patients
    • ArrayFire Solution  Image Processing  Heavily filter based  Convolve, Filter, Resize  Image Statistics  Mean, StdDev, Variance
    • Results  Realtime throughput Kernel Speedup Image Registration (Heavy use of statistics functions) 73.17x Custom Filter (Prep Center Image) 26.48x Gaussian Blur 2.19x
    • Hedge Protection System
    • Acceleration Demands  CPU-only version was taking 115 hours  Needs to run entire database of portfolios each night before trading begins next day
    • ArrayFire Solution  Statistics Functions  Random number generation  Variance  Exponentials  Arithmetic  Sqrt  Element-wise math  Reductions  Sum
    • Results  GPU version drops runtime to 7 hours and meets the requirement to run overnight  Time left over to try more permutations
    • Oil Well Monitoring  Ordinary telecom fiber used as an efficient, high fidelity acoustic sensor  Threaded along the length of oil well
    • Acceleration Demands  Require realtime signal processing from 24 channels per unit with an onsite server  CPU-only solution was 5x slower than realtime
    • ArrayFire Solution  Heavy usage of signal filtering functions  FIR  IIR
    • Results  6x performance improvements in signal processing  20x overall performance improvement through more efficiently vectorized code
    • Software Shop for CUDA & OpenCL  Two ways to work with us:  Use  Hire our CUDA & OpenCL developers Code development; CUDA & OpenCL training