4. Local Windowing
Best algorithm to use changes given which
device is in the system.
Device 1
Device 2
Device 3
Device 4
Algorithm 1
395 ms
599
244
102
Algorihm 2
270
703
241
103
Algorithm 3
699
407
138
116
Algorithm 4
380
522
202
98
5. Why Software Libraries Are Great
Reduce many lines of code to one line
Obsessively tuned by experts; faster than DIY
Well-tested and maintained
Continuously improving
13. Portability
Flavors of portability
HW vendor options
Accelerator options (GPU, coprocessor, FPGA)
CPU fallback
High-performance mobile computing
Libraries can provide portability
14. Scalability
Always start with one device
Potential headaches of adding devices
Performance hit
Development complexity
Libraries can make scaling easy
15. Community
What do you do when bugs arise?
Continuous refinement
Someone to answer questions
Libraries can have great community support
16. TIME
TIME
Benefits of Using a Library
Porting
Libraries eliminate
hidden costs of software
development
Maintenance
Documentation
Test and QA
Development
COST
Pain
COST
Pleasure
20. Scalability
Multi-GPU is 1-line of code
array *y = new array[n];
for (int i = 0; i < n; ++i) {
deviceset(i);
// change GPUs
array x = randu(5,5);
// add work to GPU’s queue
y[i] = fft(x);
// more work in queue
}
// all GPUs are now computing simultaneously
21. Community
Over 8,000 posts at
http://forums.accelereyes.com
Nightly library update releases
Stable releases a few times a year
v2.0 coming at the end of summer
22. Example Case Studies 1
17X
20X
20X
45X
12X
Neuro-imaging
Viral Analyses
Video Processing
Radar Imaging
Medical Devices
Georgia Tech
CDC
Google
System Planning
Spencer Tech
23. Example Case Studies 2
5X
35X
17X
70X
35X
Weather Models
Power Eng
Surveillance
Drug Delivery
Bioinformatics
NCAR
IIT India
BAE Systems
Georgia Tech
Leibnitz
24. Hundreds of Functions
reductions
• sum, min, max, count,
prod
• vectors, columns,
rows, etc
dense linear algebra
• LU, QR, Cholesky, SVD,
Eigenvalues, Inversion,
Solvers, Determinant,
Matrix Power
convolutions
• 2D, 3D, ND
FFTs
• 2D, 3D, ND
image processing
• filter, rotate, erode,
dilate, morph,
resize, rgb2gray,
histograms
interpolate & scale
• vectors, matrices
• rescaling
sorting
• along any
dimension
• sort detection
and many more…
25. Intuitive Functions (estimate π)
#include <stdio.h>
#include <arrayfire.h>
using namespace af;
int main() {
// 20 million random samples
int n = 20e6;
array x = randu(n,1), y = randu(n,1);
// how many fell inside unit circle?
float pi = 4 * sum<float>(x*x + y*y < 1) / n;
printf("pi = %gn", pi);
return 0;
}
26. array x = randu(n, f32);
array y = randu(n, f64);
array z = randu(n, u32);
Data Types
c32
complex
single precision
f64
real
double precision
b8
boolean byte
array
f32
real
single precision
container object
s32
u32
signed integer
unsigned integer
c64
complex
double precision
30. Create Arrays from CPU Data
float hA[] = {0,1,2,3,4,5};
array A(2,3,hA); // 2x3 matrix, single-precision
print(A);
// A = [ 0 2 4 ]
//
[ 1 3 5 ]
Note: Fortran storage order
31. Arithmetic
array R = randu(3,3);
array C = constant(1,3,3) + complex(sin(R));
// rescale complex values to unit circle
array a = randn(5,c32);
print(a / abs(a));
// C is c32
32. L-2 Norm Example
// calculate L-2 norm of
sqrt(sum(pow(X, 2)))
sqrt(sum(pow(X, 2), 0))
sqrt(sum(pow(X, 2), 1))
every column
// norm of every column vector
// ..same
// norm of every row vector
33. Subscripting Examples
array A = randu(3,3);
array a1 = A(0);
//
array a2 = A(0,1); //
A(1,span);
//
A.row(end);
//
A.cols(1,end);
//
first element
first row, second column
second row
last row
all but first column
35. Data Manipulation
// setting entries to a constant
A(span) = 4;
// fill entire array
A.row(0) = -1;
// first row
A(seq(3)) = 3.1415; // first three elements
36. Data Manipulation
// copy in another matrix
array B = constant(1,4,4,f64);
B.row(0) = randu(1,4,f32); // set row (upcast)
37. Data Manipulation
// index with another array
float h_inds[] = {0, 4, 2, 1}; // zero-based
array inds(1,4,h_inds);
B(inds) = randu(4,1); // set to random
38. Linear Algebra
// matrix factorization
array L, U;
lu(L, U, randu(n,n));
// linear systems: A x = b
array A = randu(n,n), b = randu(n,1);
array x = solve(A,b);
42. Graphics Example
#include <arrayfire.h>
using namespace af;
int main() {
// random 3d surface
const int n = 256;
while (1) {
array x = randu(n,n);
// 3d surface plot
surface(x);
}
return 0;
}
43. GFOR Parallel Loops
Parallel matrix multiplications (1 kernel launch)
gfor (array i, 3)
C(span,span,i) = A(span,span,i) * B;
=
C(,,1)
*
A(,,1)
=
B
C(,,2)
=
*
A(,,2)
B
C(,,3)
*
A(,,3)
B
48. Acceleration Demands
The CPU code
45 seconds for one session to complete
Highly optimized OpenMP code leveraging all cores
1,000 sessions/minute required 750 CPU nodes
Convert Mac-only research code to C#
Focus on efficiently developed robust performance
52. Microscope
A computer-controlled microscope scans a
cassette of pipettes, changes imaging
modes, and acquires digital images
according to program
53. Acceleration Demands
This platform provides a rapid alternative to
traditional cell culturing for susceptibility testing
The faster the analysis pipeline, the sooner a
patient can be diagnosed and treated with an
antibiotic
Culture-based methods can take 2-3 days, which
is problematic for many critically ill patients
57. Acceleration Demands
CPU-only version was taking 115 hours
Needs to run entire database of portfolios
each night before trading begins next day
58. ArrayFire Solution
Statistics Functions
Random number
generation
Variance
Exponentials
Arithmetic
Sqrt
Element-wise math
Reductions
Sum
59. Results
GPU version drops runtime to 7 hours and
meets the requirement to run overnight
Time left over to try more permutations
60. Oil Well Monitoring
Ordinary telecom
fiber used as an
efficient, high fidelity
acoustic sensor
Threaded along the
length of oil well
61. Acceleration Demands
Require realtime signal processing from 24
channels per unit with an onsite server
CPU-only solution was 5x slower than realtime