SlideShare a Scribd company logo
1 of 66
Download to read offline
© 2016 IBM Corporation
1
●
Sharing observations from IBM Runtimes
●
High level techniques and tools
●
Writing efficient code
●
Hardware accelerators
RDMA for networking
GPUs for computation
Apache Spark performance
Adam Roberts
IBM Runtimes, Hursley, UK
© 2016 IBM Corporation
2
Workloads we're especially interested in
●
HIBench
●
SparkSqlPerf, all 100 TPC-DS queries
●
Real customer applications
●
PoCs and Spark demos
© 2016 IBM Corporation
3
What I will be covering
✔
Best practices for Java/Scala code
✔
Writing code that works well with a JIT compiler
✔
Profiling techniques you can use
✔
How to use RDMA for fast networking
✔
How to use GPUs for fast data processing
✔
How we can use the above to dramatically increase our Spark
performance: get results faster
✔
Package for anyone to try
© 2016 IBM Corporation
4
What I won't be covering
●
High-level application design decisions
●
Avoiding the shuffle: knowing which Spark methods to use
●
File systems, operating systems, and file types to use
●
Conventional Spark options e.g. spark.shuffle.*, compression
codecs, spark.memory.*, spark.rpc.*, spark.streaming.*,
spark.dynamicAllocation.*
●
Java options in depth: though a matching -Xms and -Xmx
shows good results in Spark 2 (omitted by default in a PR) and
we use the Kryo serializer
© 2016 IBM Corporation
5
Tooling we use, all freely available
●
Health Center
●
TPROF with Visual Performance Analyzer
●
GCMV: garbage collection and memory visualizer
●
MAT: diagnose and resolve memory leaks
●
Linux perf tools
●
Jenkins, Slack, Maven, ScalaTest, Eclipse, Intellij Community
Edition
© 2016 IBM Corporation
6
Profiling Spark with Healthcenter
-Xhealthcenter:level=headless
© 2016 IBM Corporation
7
Profiling Java with TPROF
-agentlib:jprof=tprof
© 2016 IBM Corporation
8
Tips for performance in Java and Scala
●
Locals are faster than globals
Can prove closed set of storage readers / modifers
Fields and statics slow; parameters and locals fast
●
Constants are faster than variables
Can copy constants inline or across memory caches
Java’s final and Scala’s val are your friends
●
private is faster than public
private methods can't be dynamically redefined
protected and “package private” just as slow as public
●
Small methods (≤100 bytecodes) are good
More opportunities to in-line them
●
Simple is faster than complex
Easier for the JIT to reason about the effects
●
Limit extension points and architectural complexity when practical
Makes call sites concrete.
© 2016 IBM Corporation
9
Scala has lots of features, not all of them are fast
●
Understand the implementation of Scala language features – use them
judiciously
●
Reduce uncertainty for the compiler in your coding style: use type
ascription, avoid ambiguous polymorphism
●
Stick to common coding patterns - the JIT is tuned for them, as new
workloads emerge the latest JITs will change too
●
Focus on performance hotspots using the profiling tools I mentioned
●
Too much emphasis on performance can compromise
maintainability!
●
Too much emphasis on maintainability can compromise
performance!
© 2016 IBM Corporation
10
Idiomatic vs imperative Scala
© 2016 IBM Corporation
11
for (x <- 1 to 10) {
println(“Value of x: “ + x)
}
val values = List(1,2,3,4,5,6)
for (x <- values) {
println(“Value of x: “ + x)
}
val x = 1
while (x <= 10) {
println(“Value of x: “ + x)
x = x + 1
}
val values = List(1,2,3,4,5,6)
var x = 0
while (x < values.length) {
println(“Value of x: “ + values(x))
x = x + 1
}
Scala for loops
© 2016 IBM Corporation
12
Takeaway is to avoid boxing/unboxing (involves object
allocation) – avoid collections of type AnyRef! Know your types
Convert to AnyRef with care
© 2016 IBM Corporation
13
●
Max heap size, initial heap size, quickstart can make a big
difference - for Spark 2 we've noticed that a matching -Xms and
-Xmx improves performance on HiBench and SparkSqlPerf
●
O* JDK has a method size bytecode limit for the JIT, ours does not,
if you do use O*JDK try -XX:DontCompileHugeMethods if you find
certain queries become very slow
●
Experiment then profile – spend your time in what's actually
used the most, not nitpicking over barely used code paths!
●
spark-env.sh for environment variables
●
spark-defaults.conf for Spark settings
Observations with Java options
© 2016 IBM Corporation
14
●
The VM searches the JAR, loads and verifies bytecodes
to internal representation, runs bytecode form directly
●
After many invocations (or via sampling) code gets
compiled at ‘cold’ or ‘warm’ level
●
An internal, low overhead sampling thread is used to
identify frequently used methods
●
Methods may get recompiled at ‘hot’ or ‘scorching’ levels
(for more optimizations)
●
Transition to ‘scorching’ goes through a temporary
profiling step
cold
hot
scorching
profiling
interpreter
warm
Java's intermediate bytecodes are compiled as required and based on runtime profiling
- code is compiled 'just in time' as required
- dynamic compilation can determine the target machine capabilities and app demands
The JIT takes a holistic view of the application, looking for global
optimizations based on actual usage patterns, and speculative
assumptions
© 2016 IBM Corporation
15
export IBM_JAVA_OPTIONS=”-Xint“ to run without it, see the difference for yourself
What a difference a JIT makes...
© 2016 IBM Corporation
16
●
Using type ascription
●
Avoiding ambiguities
●
Preferring val/final and private
●
Reducing non-obvious polymorphism
●
Avoiding collections of AnyRef
●
Avoiding JNI
Writing JIT friendly code guidelines
© 2016 IBM Corporation
17
IBM JDK8 SR3
(tuned)
IBM JDK8 SR3
(out of the box)
PageRank 160% 148%
Sleep 187% 113%
Sort 103% 147%
WordCount 130% 146%
Bayes 100% 91%
Terasort 160% 131%
1/Geometric mean of HiBench time on zLinux 32 cores, 25G heap
Improvements in successive IBM Java 8 releases Performance compared with OpenJDK 8
HiBench huge, Spark 2.0.1, Linux Power8 12 core * 8-way SMT
1.35x
Can we tune a JDK to work well with Spark?
© 2016 IBM Corporation
18
Contributing back changes to Spark core
[SPARK-18231]: optimising the SizeEstimator
Hot methods in these classes with PageRank:
●
[SPARK-18196]: optimising CompactBuffer
●
[SPARK-18197]: optimising AppendOnlyMap
●
[SPARK-18224]: optimising PartitionedPairBuffer
Blog post for more details here
© 2016 IBM Corporation
19
Takeaways
●
Profile Lots In Pre-production (PLIP)
Our tools will help
●
Not all Java implementations are the same
●
Remember to focus on what's hot in the profiles!
Make a change, rebuild and reprofile, repeat
●
Many ways to achieve the same goal in Scala, use
convenient code in most places and simple
imperative code for what's critical
© 2016 IBM Corporation
20
We can only get so far writing fast code,
I'll talk about RDMA for fast networking
and how we can use GPUs for fast
processing
Beyond optimum code...
© 2016 IBM Corporation
21
●
Feature available in our SDK for Java: Java Sockets over
RDMA
●
Requires RDMA capable network adapter
●
Investigating other RDMA implementations so we can
avoid marshalling and data (de)serialization costs
●
Breaking Sorting World Records with RDMA
●
Getting started with RDMA
Remote Direct Memory Access (RDMA)
© 2016 IBM Corporation
22
Spark VM
Buffer
Off
Heap
Buffer
Spark VM
Buffer
Off
Heap
Buffer
Ether/IB
SwitchRDMA NIC/HCA RDMA NIC/HCA
OS OS
DMA DMA
(Z-Copy) (Z-Copy)
(B-Copy)(B-Copy)
Acronyms:
Z-Copy – Zero Copy
B-Copy – Buffer Copy
IB – InfiniBand
Ether - Ethernet
NIC – Network Interface Card
HCA – Host Control Adapter
●
Low-latency, high-throughput networking
●
Direct 'application to application' memory pointer exchange between remote hosts
●
Off-load network processing to RDMA NIC/HCA – OS/Kernel Bypass (zero-copy)
●
Introduces new IO characteristics that can influence the Apache Spark transfer plan
Spark node #1 Spark node #2
© 2016 IBM Corporation
23
TCP/IP
RDMA
RDMA exhibits improved throughput and reduced latency
Our JVM makes RDMA available transparently via Java.net.Socket
APIs (JsoR) or explicitly via com.ibm jVerbs calls
© 2016 IBM Corporation
24
32 cores (1 master, 4 nodes x 8 cores/node, 32GB Mem/node), IBM Java 8
0 100 200 300 400 500 600
Spark HiBench TeraSort [30GB]
Execution Time (sec)
556s
159s
TCP/IP
JSoR
Elapsed time with 30 GB of
data, 32 GB executor
© 2016 IBM Corporation
25
TPC-H benchmark 100Gb
30% improvement in database
operations
Shuffle-intensive benchmarks
show 30% - 40% better
performance with RDMA
HiBench PageRank 3Gb
40% faster, lower CPU usage
32 cores
(1 master, 4 nodes x 8 cores/node,
32GB Mem/node)
© 2016 IBM Corporation
26
Why?
●
Faster computation of results or the ability to process more
data in the same amount of time – we want to improve
accuracy of systems and free up CPUs for boring work
●
GPUs becoming available in servers and many modern
computers for us to use
●
Drivers and SDKs freely available
Fast computation: Graphics Processing Units
© 2016 IBM Corporation
27
z13
BigInsights
How popular is Java?
© 2016 IBM Corporation
28
AlphaGo: 1,202 CPUs, 176 GPUs
Titan: 18,688 GPUs, 18,688 CPUs
CERN and Geant: reported to be using GPUs
Oak Ridge, IBM “the world's fastest supercomputers by 2017”: two, $325m
Databricks: recent blog post mentions deep learning with GPUs and Spark
Who's interested in GPUs?
© 2016 IBM Corporation
29
GPUs excel at executing many of the same operations at once (Single
Instruction Multiple Data programming)
We'll program using CUDA or OpenCL – like C and C++ and we'll write JNI
code to access data in our Java world using the GPU
We'll run code on computers that are shipped with graphics cards, there
are free CUDA drivers for x86-64 Windows, Linux, and IBM's Power LE,
OpenCL drivers, SDK and source also widely available
CPUGPU
© 2016 IBM Corporation
30
Assume we have an integer array in CUDA C called myData
Allocate space on the GPU (device side) using cudaMalloc, this returns a
pointer we'll use later. Let's say we call this variable myDataOnGPU
Copy myData from the host to your allocated space (myDataOnGPU) using
cudaMemcpyHostToDevice
Process your data on the GPU in a kernel (we use <<< and >>>)
Copy the result back (what's at myDataOnGPU replaces myData on the
host) using cudaMemcpyDeviceToHost
How do we use a GPU?
© 2016 IBM Corporation
31
__global__ void addingKernel(int* array1, int* array2){
array1[threadIdx.x] += array2[threadIdx.x];
}
__global__ : it's a function we can call on the host (CPU), it's available
to be called from everywhere
How is the data arranged and how can I access it?
Sequentially, a kernel runs on a grid (blocks x threads) and it's how we
can run many threads that work on different parts of the data
int*? A pointer to integers we've copied to the GPU
threadIdx.x?
We use this as an index to our array, remember lots of threads run on
the GPU. Access each item for our example using this
© 2016 IBM Corporation
32
●
Assume we have an integer array on the Java heap: myData
●
Create a native method in Java or Scala
●
Write .cpp or .c code with a matching signature for your native method
●
In your native code, use JNI to get a pointer to your data
●
With this pointer, we can figure out how much memory we need
●
Allocate space on the GPU (device side): cudaMalloc, returns
myDataOnTheGPU
●
Copy myData to your allocated space (myDataOnTheGPU) using
cudaMemcpyHostToDevice
●
Process your data on the GPU in a kernel (look for <<< and >>>)
●
Copy the result back (what's now at myDataOnTheGPU replaces myData
on the host) using cudaMemcpyDeviceToHost
●
Release the elements (updating your JNI pointer so the data in our JVM
heap is now the result)
How would we use a GPU with Java or Scala?
Easier ways?
© 2016 IBM Corporation
33
Our option: -Dcom.ibm.gpu.enable/enforce/disable
40,000,000
400,000,000
Ints sorted
per second
Array length
400m per sec
40m per sec
Sorting throughput for ints
30,000 300,000 3,000,000 30,000,000 300,000,000
Details online here
Making it simple: Java class library modification
© 2016 IBM Corporation
34
Our option: -Xjit:enableGPU
Making it simple: Java JIT compiler modification
Use an IntStream and specify our JIT option
Primitive types can be used (byte, char, short, int, float, double, long)
© 2016 IBM Corporation
35
Measured performance improvement with a GPU using four programs
using
1-CPU-thread sequential execution
160-CPU-thread parallel execution
Experimental environment used
IBM Java 8 Service Release 2 for PowerPC Little Endian
Two 10-core 8-SMT IBM POWER8 CPUs at 3.69 GHz with 256GB memory
(160 hardware threads in total) with one NVIDIA Kepler K40m GPU (2880
CUDA cores in total) at 876 MHz with 12GB global memory (ECC off)
Performance of GPU enabled lambdas
© 2016 IBM Corporation
36
Name Summary Data size Data type
MM A dense matrix
multiplication: C =
A.B
1024 x 1024 (1m)
items
double
SpMM As above, sparse
matrix
500k x 500k (250m)
items
double
Jacobi2D Solve an equation
using the Jacobi
method
8192 x 8192 (67m)
items
double
LifeGame Conway's Game of
Life with 10k
iterations
512 x 512 (262k)
items
byte
© 2016 IBM Corporation
37
This shows GPU execution time speedup amounts compared to
what's in blue (1 CPU thread) and yellow (160 CPU threads)
The higher the bar, the bigger the speedup!
© 2016 IBM Corporation
38
Similar to JCuda but provides a higher level abstraction, production ready and
supported by us
●
No arbitrary and unrestricted use of Pointer(long)
●
Still feels like Java instead of C
Write your kernel and compile it into a fat binary
nvcc --fatbin AdamKernel.cu
Add your Java code
import com.ibm.cuda.*;
import com.ibm.cuda.CudaKernel.*;
Load your fat binary
module = new Loader().loadModule("AdamDoubler.fatbin",
device);
Making it simple: CUDA4J API
© 2016 IBM Corporation
39
Only doubling integers; could be any use case where we're doing the same
operation to lots of elements at once
Full code listing at the end, Javadocs: search IBM Java 8 API com.ibm.cuda
* Tip: the offsets are byte offsets, so you'll want your index in Java * the size of the object!
module = new Loader().loadModule("AdamDoubler.fatbin", device);
kernel = new CudaKernel(module, "Cuda_cuda4j_AdamDoubler_Strider");
stream = new CudaStream(device);
numElements = 100;
myData = new int[numElements];
Util.fillWithInts(myData);
CudaGrid grid = Util.makeGrid(numElements, stream);
buffer1 = new CudaBuffer(device, numElements * Integer.BYTES);
buffer1.copyFrom(myData);
Parameters kernelParams = new Parameters(2).set(0, buffer1).set(1, numElements);
kernel.launch(grid, kernelParams);
buffer1.copyTo(myData);
If our dynamically created grid
dimensions are too big we need to
break down the problem and use the
slice* API: doChunkingProblem()
Our kernel, compiles into AdamDoubler.fatbin
© 2016 IBM Corporation
40
●
Recommendation algorithms such as
●
Alternating Least Squares
●
Movie recommendations on Netflix
●
Recommended purchases on Amazon
●
Similar songs with Spotify
●
Clustering algorithms such as
●
K-means (unsupervised learning)
●
Produce clusters from data to determine which cluster a
new item can be categorised as
●
Identify anomalies: transaction fraud or erroneous data
●
Classification algorithms such as
●
Logistic regression
●
Create a model that we can use to predict where to
plot the next item in a sequence
●
Healthcare: predict adverse drug reactions based on
known interactions between similar drugs
Improving MLlib
© 2016 IBM Corporation
41
●
Under the covers optimisation, set the spark.mllib.ALS.useGPU property
●
Full paper: http://arxiv.org/abs/1603.03820
●
Full implementation: https://github.com/IBMSparkGPU
Netflix 1.5 GB 12 threads, CPU 64 threads, CPU GPU
Intel, IBM Java 8 676 seconds N/A 140 seconds
Currently always sends work to a GPU regardless of size, remember we
have limited device memory!
2x Intel(R) Xeon(R) CPU E5-2667 v2 @ 3.30GHz, 16 cores in the machine
(SMT-2), 256 GB RAM vs 2x Nvidia Tesla K80Ms
Also available for Power LE.
Improving Alternating Least Squares
© 2016 IBM Corporation
42
We modified the existing ALS (.scala) implementation's computeFactors method
●
Added code to check if spark.mllib.ALS.useGPU is set
●
If set we'll then call our native method written to ue JNI (.cpp)
●
Our JNI method calls native CUDA (.cu) method
●
CUDA used to send our data to the GPU, calls our kernel, returns the results over JNI
back to the Java heap
●
Built with our Spark distribution and the shared library is included: libGPUALS.so
●
Remember this will require the CUDA runtime (libcudart) and a capable GPU
ALS.scala
computeFactors
CuMFJNIInterface.cpp
ALS.cu libGPUALS.so
© 2016 IBM Corporation
43
We can send code to a GPU with APIs or if we make substantial changes
to existing implementations, but we can also make our changes at a
higher level to be more pervasive
Input: user application using DataFrame or Datasets, data stored in
Parquet format for now
✔
Spark with Tungsten. Uses UnsafeRow and, sun.misc.unsafe, idea is to bring Spark closer
to the hardware than previously, exploit CPUA caches, improved memory and CPU
efficiency, reduce GC times, avoid Java object overheads – good deep dive here
✔
Spark with Catalyst. Optimiser for Spark SQL APIs, good deep dive here, transforms a
query plan (abstraction of a user's program) into an optimised version, generates
optimised code with Janino compiler
✔
Spark with our changes: Java and core Spark class optimisations, optimised JIT
Pervasive GPU opportunities for Spark
© 2016 IBM Corporation
44
Output: generated code able to leverage auto-SIMD and GPUs
We want generated code that:
✔
has a counted loop, e.g. one controlled by an automatic induction
variable that increases from a lower to an upper bound
✔
accesses data in a linear fashion
✔
has as few branches as possible (simple for the GPU's kernel)
✔
does not have external method calls or contains only calls that can
be easily inlined
These help a JIT to either use auto-SIMD capabilities or GPUs
© 2016 IBM Corporation
45
Problems
1) Data representation of columnar storage (CachedBatch with Array[Byte]) isn't commonly used
2) Compression schemes are specific to CachedBatch, limited to just several data types
3) Building in-memory cache involves a long code path -> virtual method calls, conditional branches
4) Generated whole-stage code -> unnecessary conversion from CachedBatch or ColumnarBatch to
UnsafeRow
Solutions
1) Use ColumnarBatch format instead of CachedBatch for the in-memory cache generated by the
cache() method. ColumnarBatch and ColumnVector are commonly used data representations for
columnar storage
2) Use a common compression scheme (e.g. lz4) for all of the data types in a ColumnVector
3) Generate code at runtime that is simple and specialized for building a concrete instance of the in-
memory cache
4) Generate whole-stage code that directly reads data from columnar storage
(1) and (2) increase code reuse, (3) improves runtime performance of executing the cache() method
and (4) improves performance of user defined DataFrame and Dataset operations
© 2016 IBM Corporation
46
We propose a new columnar format: CachedColumnarBatch, that has a pointer to
ColumnarBatch (used by Parquet reader) that keeps each column as
OnHeapUnsafeColumnVector instead of OnHeapColumnVector.
Not yet using GPUS!
●
[SPARK-13805], merged into 2.0, performance improvement: 1.2x
Get data from ColumnVector directly by avoiding a copy from ColumnVector to
UnsafeRow when a program reads data in parquet format
●
[SPARK-14098] will be merged into 2.2, performance improvement: 3.4x
Generate optimized code to build CachedColumnarBatch, get data from a
ColumnVector directly by avoiding a copy from the ColumnVector to UnsafeRow,
and use lz4 to compress ColumnVector when df.cache() or ds.cache is executed
●
[SPARK-15962], merged into 2.1, performance improvement: 1.7x
Remove indirection at offsets field when accessing each element in
UnsafeArrayData, reduce memory footprint of UnsafeArrayData
© 2016 IBM Corporation
47
●
[SPARK-16043], performance improvement: 1.2x
Use a Scala primitive array (e.g. Array[Int]) instead of Array[Any] for
avoiding boxing operations when putting a primitive array into
GenericArrayData
●
[SPARK-15985], merged into 2.1, performance improvement: 1.3x
Eliminate boxing operations to put a primitive array into GenericArrayData
when a Dataset program with a primitive array is ran
●
[SPARK-16213], to be merged into 2.2, performance improvement: 16.6x
Eliminate boxing operations to put a primitive array into GenericArrayData
when a DataFrame program with a primitive array is ran
●
[SPARK-17490], merged into 2.1, performance improvement: 2.0x
Eliminate boxing operations to put a primitive array into
GenericArrayData when a DataFrame program with a primitive array is
used
© 2016 IBM Corporation
48
●
improving a commonly used API and contributing the code
●
Ensuring generated code is in the right format for exploitation
●
Making it simple for any Spark user to exploit hardware accelerators, be it
GPU or auto-SIMD code for the latest processors
●
We know how to build GPU based applications
●
We can figure out if a GPU is available
●
We can figure out what code to generate
●
We can figure out which GPU to send that code to
●
All while retaining Java safety features such as exceptions, bounds
checking, serviceability, tracing and profiling hooks
●
Assuming you have the hardware, add an option and watch
performance improve: this is our goal
What's in it for me?
© 2016 IBM Corporation
49
●
We provide an optimised JDK with Spark bundle that includes hardware
offloading, profiling, a tuned JIT and is under constant development
●
We can talk more about performance aspects, not covered FPGAs, CAPI flash,
an improved serializer, GC optimisations, object layout, monitoring...
●
Upcoming blog post at spark.tc outlining the Catalyst related work
●
Look out for more pull requests and involvement from IBM, we want to improve
performance for everybody and maintain Spark's status
●
Open to ideas and wanting to work in communities for everyone's benefit
http://ibm.biz/spark-kit
Feedback and suggestions welcome: aroberts@uk.ibm.com
Wrapping it all up...
© 2016 IBM Corporation
Backup slides, code listing, legal
information and disclaimers beyond this point
© 2016 IBM Corporation
51
CUDA core: part of the GPU, they execute groups of threads
Kernel: a function we'll run on the GPU
Grid: think of it as a CUBE of BLOCKS which lay out THREADS; our GPU functions
(KERNELS) run on one of these, we need to know the grid dimensions for each
kernel
Threads: these do our computation, much more available than with CPUs
Blocks: groups of threads
Recommended reading:
http://docs.nvidia.com/cuda/cuda-c-programming-guide/#thread-hierarchy
The nvidia-smi command tells you about your GPU's limits
One GPU can have MANY CUDA cores, each CUDA core executes many threads
© 2016 IBM Corporation
52
CUDA grid: why is this important?
To achieve parallelism: a layout of
threads we can use to solve our
big data problems
Block dimensions?
How many threads can run on a block
Grid dimensions?
How many blocks we can have
threadIdx.x? (BLOCKS contain THREADS)
Built in variable to get the current x coordinate of a given THREAD (can have an x,
y, z coordinate too)
blockIdx.x? (GRIDS contain BLOCKS)
Built in variable to get the current x coordinate of a given BLOCK (can have an x, y,
z coordinate too)
© 2016 IBM Corporation
53
For figuring out the dimensions we can use the following Java code, we want
512 threads and as many blocks as possible for the problem size
int log2BlockDim = 9;
int numBlocks = (numElements + 511) >> log2BlockDim;
int numThreads = 1 << log2BlockDim;
Size Blocks Threads
500 1 512
1,024 2 512
32,000 63 512
64,000 125 512
100,000 196 512
512,000 1,000 512
1,024,000 2,000 512
CUDA4J sample, part 1 of 3
import com.ibm.cuda.*;
import com.ibm.cuda.CudaKernel.*;
public class Sample {
private static final boolean PRINT_DATA = false;
private static int numElements;
private static int[] myData;
private static CudaBuffer buffer1;
private static CudaDevice device = new CudaDevice(0);
private static CudaModule module;
private static CudaKernel kernel;
private static CudaStream stream;
public static void main(String[] args) {
try {
module = new Loader().loadModule("AdamDoubler.fatbin", device);
kernel = new CudaKernel(module, "Cuda_cuda4j_AdamDoubler_Strider");
stream = new CudaStream(device);
doSmallProblem();
doMediumProblem();
doChunkingProblem();
} catch (CudaException e) {
e.printStackTrace();
} catch (Exception e) {
e.printStackTrace();
}
}
private static void doSmallProblem() throws Exception {
System.out.println("Doing the small sized problem");
numElements = 100;
myData = new int[numElements];
Util.fillWithInts(myData);
CudaGrid grid = Util.makeGrid(numElements, stream);
System.out.println("Kernel grid: <<<" + grid.gridDimX + ", " + grid.blockDimX + ">>>");
buffer1 = new CudaBuffer(device, numElements * Integer.BYTES);
buffer1.copyFrom(myData);
Parameters kernelParams = new Parameters(2).set(0, buffer1).set(1, numElements);
kernel.launch(grid, kernelParams);
int[] originalArrayCopy = new int[myData.length];
System.arraycopy(myData, 0, originalArrayCopy, 0, myData.length);
buffer1.copyTo(myData);
Util.checkArrayResultsDoubler(myData, originalArrayCopy);
}
private static void doMediumProblem() throws Exception {
System.out.println("Doing the medium sized problem");
numElements = 5_000_000;
myData = new int[numElements];
Util.fillWithInts(myData);
// This is only when handling more than max blocks * max threads per kernel
// Grid dim is the number of blocks in the grid
// Block dim is the number of threads in a block
// buffer1 is how we'll use our data on the GPU
buffer1 = new CudaBuffer(device, numElements * Integer.BYTES);
// myData is on CPU, transfer it
buffer1.copyFrom(myData);
// Our stream executes the kernel, can launch many streams at once
CudaGrid grid = Util.makeGrid(numElements, stream);
System.out.println("Kernel grid: <<<" + grid.gridDimX + ", " + grid.blockDimX +
">>>");
Parameters kernelParams = new Parameters(2).set(0, buffer1).set(1,
numElements);
kernel.launch(grid, kernelParams);
int[] originalArrayCopy = new int[myData.length];
System.arraycopy(myData, 0, originalArrayCopy, 0, myData.length);
buffer1.copyTo(myData);
Util.checkArrayResultsDoubler(myData, originalArrayCopy);
}
CUDA4J sample, part 2 of 3
private static void doChunkingProblem() throws Exception {
// I know 5m doesn't require chunking on the GPU but this does
System.out.println("Doing the too big to handle in one kernel problem");
numElements = 70_000_000;
myData = new int[numElements];
Util.fillWithInts(myData);
buffer1 = new CudaBuffer(device, numElements * Integer.BYTES);
buffer1.copyFrom(myData);
CudaGrid grid = Util.makeGrid(numElements, stream);
System.out.println("Kernel grid: <<<" + grid.gridDimX + ", " + grid.blockDimX + ">>>");
// Check we can actually launch a kernel with this grid size
try {
Parameters kernelParams = new Parameters(2).set(0, buffer1).set(1, numElements);
kernel.launch(grid, kernelParams);
int[] originalArrayCopy = new int[numElements];
System.arraycopy(myData, 0, originalArrayCopy, 0, numElements);
buffer1.copyTo(myData);
Util.checkArrayResultsDoubler(myData, originalArrayCopy);
} catch (CudaException ce) {
if (ce.getMessage().equals("invalid argument")) {
System.out.println("it was invalid argument, too big!");
int maxThreadsPerBlockX = device.getAttribute(CudaDevice.ATTRIBUTE_MAX_BLOCK_DIM_X);
int maxBlocksPerGridX = device.getAttribute(CudaDevice.ATTRIBUTE_MAX_GRID_DIM_Y);
long maxThreadsPerGrid = maxThreadsPerBlockX * maxBlocksPerGridX;
// 67,107,840 on my Windows box
System.out.println("Max threads per grid: " + maxThreadsPerGrid);
long numElementsAtOnce = maxThreadsPerGrid;
long elementsDone = 0;
grid = new CudaGrid(maxBlocksPerGridX, maxThreadsPerBlockX, stream);
System.out.println("Kernel grid: <<<" + grid.gridDimX + ", " + grid.blockDimX + ">>>");
while (elementsDone < numElements) {
if ( (elementsDone + numElementsAtOnce) > numElements) {
numElementsAtOnce = numElements - elementsDone; // Just do the remainder
}
long toOffset = numElementsAtOnce + elementsDone;
// It's the byte offset not the element index offset
CudaBuffer slicedSection = buffer1.slice(elementsDone * Integer.BYTES, toOffset * Integer.BYTES);
Parameters kernelParams = new Parameters(2).set(0, slicedSection).set(1, numElementsAtOnce);
kernel.launch(grid, kernelParams);
elementsDone += numElementsAtOnce;
}
int[] originalArrayCopy = new int[myData.length];
System.arraycopy(myData, 0, originalArrayCopy, 0, myData.length);
buffer1.copyTo(myData);
Util.checkArrayResultsDoubler(myData, originalArrayCopy);
} else {
System.out.println(ce.getMessage());
}
}
}
CUDA4J sample, part 3 of 3
CUDA4J kernel
#include <stdint.h>
#include <stdio.h>
/**
* 2D grid so we can have 1024 threads and many blocks
* Remember 1 grid -> has blocks/threads and one kernel runs on one grid
* In CUDA 6.5 we have cudaOccupancyMaxPotentialBlockSize which helps
*
* Let's say we have 100 ints to double, keeping it simple
* Assume we want to run with 256 threads at once
* For this size our kernel will be set up as follows
* 1 grid, 1 block, 512 threads
* blockDim.x is going to be 1
* threadIdx.x will remain at 0
* threadIdx.y will range from 0 to 512
* So we'll go from 1 to 512 and we'll limit access to how many elements we
have
*/
extern "C" __global__ void Cuda_cuda4j_AdamDoubler(int* toDouble, int
numElements){
int index = blockDim.x * threadIdx.x + threadIdx.y;
if (index < numElements) { // Don't go out of bounds
toDouble[index] *= 2; // Just double it
}
}
extern "C" __global__ void Cuda_cuda4j_AdamDoubler_Strider(int* toDouble,
int numElements){
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < numElements) { // don't go overboard
toDouble[i] *= 2;
}
}
Lambda example, part 1 of 2
import java.util.stream.IntStream;
public class Lambda {
private static long startTime = 0;
// -Xjit:enableGPU is our JVM option
public static void main(String[] args) {
boolean timeIt = true;
int numElements = 500_000_000;
int[] toDouble = new int[numElements];
Util.fillWithInts(toDouble);
myDoublerWithALambda(toDouble, timeIt);
double[] toHalf = new double[numElements];
Util.fillWithDoubles(toHalf);
myHalverWithALambda(toHalf, timeIt);
double[] toRandomFunc = new double[numElements];
Util.fillWithDoubles(toRandomFunc);
myRandomFuncWithALambda(toRandomFunc, timeIt);
}
private static void myDoublerWithALambda(int[] myArray, boolean timeIt) {
if (timeIt) startTime = System.currentTimeMillis();
IntStream.range(0, myArray.length).parallel().forEach(i -> {
myArray[i] = myArray[i] * 2; // Done on GPU for us
});
if (timeIt) {
System.out.println("Done doubling with a lambda, time taken: " +
(System.currentTimeMillis() - startTime) + " milliseconds");
}
}
private static void myHalverWithALambda(double[] myArray, boolean timeIt)
{
if (timeIt) startTime = System.currentTimeMillis();
IntStream.range(0, myArray.length).parallel().forEach(i -> {
myArray[i] = myArray[i] / 2; // Again on GPU
});
if (timeIt) {
System.out.println("Done halving with a lambda, time taken: " +
(System.currentTimeMillis() - startTime) + " milliseconds");
}
}
private static void myRandomFuncWithALambda(double[] myArray, boolean
timeIt) {
if (timeIt) startTime = System.currentTimeMillis();
IntStream.range(0, myArray.length).parallel().forEach(i -> {
myArray[i] = myArray[i] * 3.142; // Double so we don't lose precision
});
if (timeIt) {
System.out.println("Done with the random func with a lambda, time
taken: " +
(System.currentTimeMillis() - startTime) + " milliseconds");
}
}
}
Lambda example, part 2 of 2
Utility methods, part 1 of 2
import com.ibm.cuda.*;
public class Util {
protected static void fillWithInts(int[] toFill) {
for (int i = 0; i < toFill.length; i++) {
toFill[i] = i;
}
}
protected static void fillWithDoubles(double[] toFill) {
for (int i = 0; i < toFill.length; i++) {
toFill[i] = i;
}
}
protected static void printArray(int[] toPrint) {
System.out.println();
for (int i = 0; i < toPrint.length; i++) {
if (i == toPrint.length - 1) {
System.out.print(toPrint[i] + ".");
} else {
System.out.print(toPrint[i] + ", ");
}
}
System.out.println();
}
protected static CudaGrid makeGrid(int numElements, CudaStream stream) {
int numThreads = 512;
int numBlocks = (numElements + (numThreads - 1)) / numThreads;
return new CudaGrid(numBlocks, numThreads, stream);
}
/*
* Array will have been doubled at this point
*/
protected static void checkArrayResultsDoubler(int[] toCheck, int[] originalArray) {
long errorCount = 0;
// Check result, data has been copied back here
if (toCheck.length != originalArray.length) {
System.err.println("Something's gone horribly wrong, different array length");
}
for (int i = 0; i < originalArray.length; i++) {
if (toCheck[i] != (originalArray[i] * 2) ) {
errorCount++;
/*
System.err.println("Got an error, " + originalArray[i] +
" is incorrect: wasn't doubled correctly!" +
" Got " + toCheck[i] + " but should be " + originalArray[i] * 2);
*/
} else {
//System.out.println("Correct, doubled " + originalArray[i] + " and it became " +
toCheck[i]);
}
}
System.err.println("Incorrect results: " + errorCount);
}
}
Utility methods, part 2 of 2
CUDA4J module loader
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStream;
import com.ibm.cuda.CudaDevice;
import com.ibm.cuda.CudaException;
import com.ibm.cuda.CudaModule;
public class Loader {
private final CudaModule.Cache moduleCache = new CudaModule.Cache();
CudaModule loadModule(String moduleName, CudaDevice device) throws CudaException,
IOException {
CudaModule module = moduleCache.get(device, moduleName);
if (module == null) {
try (InputStream stream = getClass().getResourceAsStream(moduleName)) {
if (stream == null) {
throw new FileNotFoundException(moduleName);
}
module = new CudaModule(device, stream);
moduleCache.put(device, moduleName, module);
}
}
return module;
}
}
CUDA4J build script on Windows
nvcc -fatbin AdamDoubler.cu
"C:ibm8sr3gasdkbinjava" -version
"C:ibm8sr3gasdkbinjavac" *.java
"C:ibm8sr3gasdkbinjava" -Xmx2g Sample
"C:ibm8sr3gasdkbinjava" -Xmx4g Lambda
"C:ibm8sr3gasdkbinjava" -Xjit:enableGPU={verbose} -Xmx4g
Lambda
Set the PATH to include the CUDA library. For example, set
PATH=<CUDA_LIBRARY_PATH>;%PATH%, where the
<CUDA_LIBRARY_PATH> variable is the full path to the CUDA library. The
<CUDA_LIBRARY_PATH> variable is C:Program FilesNVIDIA GPU
Computing ToolkitCUDAv7.5bin, which assumes CUDA is installed to
the default directory.
Note: If you are using Just-In-Time Compiler (JIT) based GPU support, you must also
include paths to the NVIDIA Virtual Machine (NVVM) library, and to the NVDIA
Management Library (NVML). For example, the <CUDA_LIBRARY_PATH> variable
is C:Program FilesNVIDIA GPU Computing
ToolkitCUDAv7.5bin;<NVVM_LIBRARY_PATH>;<NVML_LIBRARY_P
ATH>.
If the NVVM library is installed to the default directory, the
<NVVM_LIBRARY_PATH> variable is C:Program FilesNVIDIA GPU
Computing ToolkitCUDAv7.5nvvmbin. You can find the NVML
library in your NVIDIA drivers directory. The default location of this directory is
C:Program FilesNVIDIA CorporationNVSMI.
From IBM's Java 8 docs
Environment example, see the docs for details
Notices and Disclaimers
Copyright © 2016 by International Business Machines Corporation (IBM). No part of this document may be reproduced or
transmitted in any form without written permission from IBM.
U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with
IBM.
Information in these presentations (including information relating to products that have not yet been announced by IBM) has been
reviewed for accuracy as of the date of initial publication and could include unintentional technical or typographical errors. IBM
shall have no responsibility to update this information. THIS document is distributed "AS IS" without any warranty, either express
or implied. In no event shall IBM be liable for any damage arising from the use of this information, including but not limited to,
loss of data, business interruption, loss of profit or loss of opportunity. IBM products and services are warranted according to the
terms and conditions of the agreements under which they are provided.
Any statements regarding IBM's future direction, intent or product plans are subject to change or withdrawal without
notice.
Performance data contained herein was generally obtained in a controlled, isolated environments. Customer examples are
presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual
performance, cost, savings or other results in other operating environments may vary.
References in this document to IBM products, programs, or services does not imply that IBM intends to make such products,
programs or services available in all countries in which IBM operates or does business.
Workshops, sessions and associated materials may have been prepared by independent session speakers, and do not
necessarily reflect the views of IBM. All materials and discussions are provided for informational purposes only, and are neither
intended to, nor shall constitute legal or other guidance or advice to any individual participant or their specific situation.
It is the customer’s responsibility to insure its own compliance with legal requirements and to obtain advice of competent legal
counsel as to the identification and interpretation of any relevant laws and regulatory requirements that may affect the customer’s
business and any actions the customer may need to take to comply with such laws. IBM does not provide legal advice or
represent or warrant that its services or products will ensure that the customer is in compliance with any law.
Notices and Disclaimers (con’t)
Information concerning non-IBM products was obtained from the suppliers of those products, their published
announcements or other publicly available sources. IBM has not tested those products in connection with this publication
and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products.
Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. IBM does not
warrant the quality of any third-party products, or the ability of any such third-party products to interoperate with IBM’s
products. IBM expressly disclaims all warranties, expressed or implied, including but not limited to, the implied warranties
of merchantability and fitness for a particular purpose.
The provision of the information contained herein is not intended to, and does not, grant any right or license under any
IBM patents, copyrights, trademarks or other intellectual property right.
IBM, the IBM logo, ibm.com, Bluemix, Blueworks Live, CICS, Clearcase, DOORS®, Enterprise Document
Management System™, Global Business Services ®, Global Technology Services ®, Information on Demand, ILOG,
LinuxONE™, Maximo®, MQIntegrator®, MQSeries®, Netcool®, OMEGAMON, OpenPower, PureAnalytics™,
PureApplication®, pureCluster™, PureCoverage®, PureData®, PureExperience®, PureFlex®, pureQuery®,
pureScale®, PureSystems®, QRadar®, Rational®, Rhapsody®, SoDA, SPSS, StoredIQ, Tivoli®, Trusteer®,
urban{code}®, Watson, WebSphere®, Worklight®, X-Force® and System z® Z/OS, are trademarks of International
Business Machines Corporation, registered in many jurisdictions worldwide. Other product and service names might
be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at "Copyright and
trademark information" at: www.ibm.com/legal/copytrade.shtml.
Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their
respective owners.
Databricks is a registered trademark of Databricks, Inc.
Apache Spark, Apache Cassandra, Apache Hadoop, Apache Maven, Spark, Apache, any other Apache project
mentioned here and the Apache product logos including the Spark logo are trademarks of The Apache Software
Foundation

More Related Content

Similar to Apache Spark Performance Observations

IBM Runtimes Performance Observations with Apache Spark
IBM Runtimes Performance Observations with Apache SparkIBM Runtimes Performance Observations with Apache Spark
IBM Runtimes Performance Observations with Apache SparkAdamRobertsIBM
 
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.J On The Beach
 
Five cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark fasterFive cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark fasterTim Ellison
 
A Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark PerformanceA Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark PerformanceTim Ellison
 
Apache Big Data Europe 2016
Apache Big Data Europe 2016Apache Big Data Europe 2016
Apache Big Data Europe 2016Tim Ellison
 
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors DataWorks Summit/Hadoop Summit
 
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...Indrajit Poddar
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with SparkRoger Rafanell Mas
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...Edge AI and Vision Alliance
 
Building Efficient Pipelines in Apache Spark
Building Efficient Pipelines in Apache SparkBuilding Efficient Pipelines in Apache Spark
Building Efficient Pipelines in Apache SparkJeremy Beard
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
 
IBM Power for Life Sciences
IBM Power for Life SciencesIBM Power for Life Sciences
IBM Power for Life SciencesDavid Spurway
 
Boosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesBoosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesAhsan Javed Awan
 
Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Ahsan Javed Awan
 
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Databricks
 
Putting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech AnalysticsPutting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech AnalysticsGareth Rogers
 
Webinar: High Performance MongoDB Applications with IBM POWER8
Webinar: High Performance MongoDB Applications with IBM POWER8Webinar: High Performance MongoDB Applications with IBM POWER8
Webinar: High Performance MongoDB Applications with IBM POWER8MongoDB
 
AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...
AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...
AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...Amazon Web Services
 
20151015 zagreb spark_notebooks
20151015 zagreb spark_notebooks20151015 zagreb spark_notebooks
20151015 zagreb spark_notebooksAndrey Vykhodtsev
 

Similar to Apache Spark Performance Observations (20)

IBM Runtimes Performance Observations with Apache Spark
IBM Runtimes Performance Observations with Apache SparkIBM Runtimes Performance Observations with Apache Spark
IBM Runtimes Performance Observations with Apache Spark
 
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
 
Five cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark fasterFive cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark faster
 
A Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark PerformanceA Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark Performance
 
Apache Big Data Europe 2016
Apache Big Data Europe 2016Apache Big Data Europe 2016
Apache Big Data Europe 2016
 
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
 
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
 
Building Efficient Pipelines in Apache Spark
Building Efficient Pipelines in Apache SparkBuilding Efficient Pipelines in Apache Spark
Building Efficient Pipelines in Apache Spark
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
CFD on Power
CFD on Power CFD on Power
CFD on Power
 
IBM Power for Life Sciences
IBM Power for Life SciencesIBM Power for Life Sciences
IBM Power for Life Sciences
 
Boosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesBoosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of Techniques
 
Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...
 
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
 
Putting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech AnalysticsPutting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech Analystics
 
Webinar: High Performance MongoDB Applications with IBM POWER8
Webinar: High Performance MongoDB Applications with IBM POWER8Webinar: High Performance MongoDB Applications with IBM POWER8
Webinar: High Performance MongoDB Applications with IBM POWER8
 
AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...
AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...
AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...
 
20151015 zagreb spark_notebooks
20151015 zagreb spark_notebooks20151015 zagreb spark_notebooks
20151015 zagreb spark_notebooks
 

Recently uploaded

Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxTanveerAhmed817946
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...shivangimorya083
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 

Recently uploaded (20)

Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptx
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 

Apache Spark Performance Observations

  • 1. © 2016 IBM Corporation 1 ● Sharing observations from IBM Runtimes ● High level techniques and tools ● Writing efficient code ● Hardware accelerators RDMA for networking GPUs for computation Apache Spark performance Adam Roberts IBM Runtimes, Hursley, UK
  • 2. © 2016 IBM Corporation 2 Workloads we're especially interested in ● HIBench ● SparkSqlPerf, all 100 TPC-DS queries ● Real customer applications ● PoCs and Spark demos
  • 3. © 2016 IBM Corporation 3 What I will be covering ✔ Best practices for Java/Scala code ✔ Writing code that works well with a JIT compiler ✔ Profiling techniques you can use ✔ How to use RDMA for fast networking ✔ How to use GPUs for fast data processing ✔ How we can use the above to dramatically increase our Spark performance: get results faster ✔ Package for anyone to try
  • 4. © 2016 IBM Corporation 4 What I won't be covering ● High-level application design decisions ● Avoiding the shuffle: knowing which Spark methods to use ● File systems, operating systems, and file types to use ● Conventional Spark options e.g. spark.shuffle.*, compression codecs, spark.memory.*, spark.rpc.*, spark.streaming.*, spark.dynamicAllocation.* ● Java options in depth: though a matching -Xms and -Xmx shows good results in Spark 2 (omitted by default in a PR) and we use the Kryo serializer
  • 5. © 2016 IBM Corporation 5 Tooling we use, all freely available ● Health Center ● TPROF with Visual Performance Analyzer ● GCMV: garbage collection and memory visualizer ● MAT: diagnose and resolve memory leaks ● Linux perf tools ● Jenkins, Slack, Maven, ScalaTest, Eclipse, Intellij Community Edition
  • 6. © 2016 IBM Corporation 6 Profiling Spark with Healthcenter -Xhealthcenter:level=headless
  • 7. © 2016 IBM Corporation 7 Profiling Java with TPROF -agentlib:jprof=tprof
  • 8. © 2016 IBM Corporation 8 Tips for performance in Java and Scala ● Locals are faster than globals Can prove closed set of storage readers / modifers Fields and statics slow; parameters and locals fast ● Constants are faster than variables Can copy constants inline or across memory caches Java’s final and Scala’s val are your friends ● private is faster than public private methods can't be dynamically redefined protected and “package private” just as slow as public ● Small methods (≤100 bytecodes) are good More opportunities to in-line them ● Simple is faster than complex Easier for the JIT to reason about the effects ● Limit extension points and architectural complexity when practical Makes call sites concrete.
  • 9. © 2016 IBM Corporation 9 Scala has lots of features, not all of them are fast ● Understand the implementation of Scala language features – use them judiciously ● Reduce uncertainty for the compiler in your coding style: use type ascription, avoid ambiguous polymorphism ● Stick to common coding patterns - the JIT is tuned for them, as new workloads emerge the latest JITs will change too ● Focus on performance hotspots using the profiling tools I mentioned ● Too much emphasis on performance can compromise maintainability! ● Too much emphasis on maintainability can compromise performance!
  • 10. © 2016 IBM Corporation 10 Idiomatic vs imperative Scala
  • 11. © 2016 IBM Corporation 11 for (x <- 1 to 10) { println(“Value of x: “ + x) } val values = List(1,2,3,4,5,6) for (x <- values) { println(“Value of x: “ + x) } val x = 1 while (x <= 10) { println(“Value of x: “ + x) x = x + 1 } val values = List(1,2,3,4,5,6) var x = 0 while (x < values.length) { println(“Value of x: “ + values(x)) x = x + 1 } Scala for loops
  • 12. © 2016 IBM Corporation 12 Takeaway is to avoid boxing/unboxing (involves object allocation) – avoid collections of type AnyRef! Know your types Convert to AnyRef with care
  • 13. © 2016 IBM Corporation 13 ● Max heap size, initial heap size, quickstart can make a big difference - for Spark 2 we've noticed that a matching -Xms and -Xmx improves performance on HiBench and SparkSqlPerf ● O* JDK has a method size bytecode limit for the JIT, ours does not, if you do use O*JDK try -XX:DontCompileHugeMethods if you find certain queries become very slow ● Experiment then profile – spend your time in what's actually used the most, not nitpicking over barely used code paths! ● spark-env.sh for environment variables ● spark-defaults.conf for Spark settings Observations with Java options
  • 14. © 2016 IBM Corporation 14 ● The VM searches the JAR, loads and verifies bytecodes to internal representation, runs bytecode form directly ● After many invocations (or via sampling) code gets compiled at ‘cold’ or ‘warm’ level ● An internal, low overhead sampling thread is used to identify frequently used methods ● Methods may get recompiled at ‘hot’ or ‘scorching’ levels (for more optimizations) ● Transition to ‘scorching’ goes through a temporary profiling step cold hot scorching profiling interpreter warm Java's intermediate bytecodes are compiled as required and based on runtime profiling - code is compiled 'just in time' as required - dynamic compilation can determine the target machine capabilities and app demands The JIT takes a holistic view of the application, looking for global optimizations based on actual usage patterns, and speculative assumptions
  • 15. © 2016 IBM Corporation 15 export IBM_JAVA_OPTIONS=”-Xint“ to run without it, see the difference for yourself What a difference a JIT makes...
  • 16. © 2016 IBM Corporation 16 ● Using type ascription ● Avoiding ambiguities ● Preferring val/final and private ● Reducing non-obvious polymorphism ● Avoiding collections of AnyRef ● Avoiding JNI Writing JIT friendly code guidelines
  • 17. © 2016 IBM Corporation 17 IBM JDK8 SR3 (tuned) IBM JDK8 SR3 (out of the box) PageRank 160% 148% Sleep 187% 113% Sort 103% 147% WordCount 130% 146% Bayes 100% 91% Terasort 160% 131% 1/Geometric mean of HiBench time on zLinux 32 cores, 25G heap Improvements in successive IBM Java 8 releases Performance compared with OpenJDK 8 HiBench huge, Spark 2.0.1, Linux Power8 12 core * 8-way SMT 1.35x Can we tune a JDK to work well with Spark?
  • 18. © 2016 IBM Corporation 18 Contributing back changes to Spark core [SPARK-18231]: optimising the SizeEstimator Hot methods in these classes with PageRank: ● [SPARK-18196]: optimising CompactBuffer ● [SPARK-18197]: optimising AppendOnlyMap ● [SPARK-18224]: optimising PartitionedPairBuffer Blog post for more details here
  • 19. © 2016 IBM Corporation 19 Takeaways ● Profile Lots In Pre-production (PLIP) Our tools will help ● Not all Java implementations are the same ● Remember to focus on what's hot in the profiles! Make a change, rebuild and reprofile, repeat ● Many ways to achieve the same goal in Scala, use convenient code in most places and simple imperative code for what's critical
  • 20. © 2016 IBM Corporation 20 We can only get so far writing fast code, I'll talk about RDMA for fast networking and how we can use GPUs for fast processing Beyond optimum code...
  • 21. © 2016 IBM Corporation 21 ● Feature available in our SDK for Java: Java Sockets over RDMA ● Requires RDMA capable network adapter ● Investigating other RDMA implementations so we can avoid marshalling and data (de)serialization costs ● Breaking Sorting World Records with RDMA ● Getting started with RDMA Remote Direct Memory Access (RDMA)
  • 22. © 2016 IBM Corporation 22 Spark VM Buffer Off Heap Buffer Spark VM Buffer Off Heap Buffer Ether/IB SwitchRDMA NIC/HCA RDMA NIC/HCA OS OS DMA DMA (Z-Copy) (Z-Copy) (B-Copy)(B-Copy) Acronyms: Z-Copy – Zero Copy B-Copy – Buffer Copy IB – InfiniBand Ether - Ethernet NIC – Network Interface Card HCA – Host Control Adapter ● Low-latency, high-throughput networking ● Direct 'application to application' memory pointer exchange between remote hosts ● Off-load network processing to RDMA NIC/HCA – OS/Kernel Bypass (zero-copy) ● Introduces new IO characteristics that can influence the Apache Spark transfer plan Spark node #1 Spark node #2
  • 23. © 2016 IBM Corporation 23 TCP/IP RDMA RDMA exhibits improved throughput and reduced latency Our JVM makes RDMA available transparently via Java.net.Socket APIs (JsoR) or explicitly via com.ibm jVerbs calls
  • 24. © 2016 IBM Corporation 24 32 cores (1 master, 4 nodes x 8 cores/node, 32GB Mem/node), IBM Java 8 0 100 200 300 400 500 600 Spark HiBench TeraSort [30GB] Execution Time (sec) 556s 159s TCP/IP JSoR Elapsed time with 30 GB of data, 32 GB executor
  • 25. © 2016 IBM Corporation 25 TPC-H benchmark 100Gb 30% improvement in database operations Shuffle-intensive benchmarks show 30% - 40% better performance with RDMA HiBench PageRank 3Gb 40% faster, lower CPU usage 32 cores (1 master, 4 nodes x 8 cores/node, 32GB Mem/node)
  • 26. © 2016 IBM Corporation 26 Why? ● Faster computation of results or the ability to process more data in the same amount of time – we want to improve accuracy of systems and free up CPUs for boring work ● GPUs becoming available in servers and many modern computers for us to use ● Drivers and SDKs freely available Fast computation: Graphics Processing Units
  • 27. © 2016 IBM Corporation 27 z13 BigInsights How popular is Java?
  • 28. © 2016 IBM Corporation 28 AlphaGo: 1,202 CPUs, 176 GPUs Titan: 18,688 GPUs, 18,688 CPUs CERN and Geant: reported to be using GPUs Oak Ridge, IBM “the world's fastest supercomputers by 2017”: two, $325m Databricks: recent blog post mentions deep learning with GPUs and Spark Who's interested in GPUs?
  • 29. © 2016 IBM Corporation 29 GPUs excel at executing many of the same operations at once (Single Instruction Multiple Data programming) We'll program using CUDA or OpenCL – like C and C++ and we'll write JNI code to access data in our Java world using the GPU We'll run code on computers that are shipped with graphics cards, there are free CUDA drivers for x86-64 Windows, Linux, and IBM's Power LE, OpenCL drivers, SDK and source also widely available CPUGPU
  • 30. © 2016 IBM Corporation 30 Assume we have an integer array in CUDA C called myData Allocate space on the GPU (device side) using cudaMalloc, this returns a pointer we'll use later. Let's say we call this variable myDataOnGPU Copy myData from the host to your allocated space (myDataOnGPU) using cudaMemcpyHostToDevice Process your data on the GPU in a kernel (we use <<< and >>>) Copy the result back (what's at myDataOnGPU replaces myData on the host) using cudaMemcpyDeviceToHost How do we use a GPU?
  • 31. © 2016 IBM Corporation 31 __global__ void addingKernel(int* array1, int* array2){ array1[threadIdx.x] += array2[threadIdx.x]; } __global__ : it's a function we can call on the host (CPU), it's available to be called from everywhere How is the data arranged and how can I access it? Sequentially, a kernel runs on a grid (blocks x threads) and it's how we can run many threads that work on different parts of the data int*? A pointer to integers we've copied to the GPU threadIdx.x? We use this as an index to our array, remember lots of threads run on the GPU. Access each item for our example using this
  • 32. © 2016 IBM Corporation 32 ● Assume we have an integer array on the Java heap: myData ● Create a native method in Java or Scala ● Write .cpp or .c code with a matching signature for your native method ● In your native code, use JNI to get a pointer to your data ● With this pointer, we can figure out how much memory we need ● Allocate space on the GPU (device side): cudaMalloc, returns myDataOnTheGPU ● Copy myData to your allocated space (myDataOnTheGPU) using cudaMemcpyHostToDevice ● Process your data on the GPU in a kernel (look for <<< and >>>) ● Copy the result back (what's now at myDataOnTheGPU replaces myData on the host) using cudaMemcpyDeviceToHost ● Release the elements (updating your JNI pointer so the data in our JVM heap is now the result) How would we use a GPU with Java or Scala? Easier ways?
  • 33. © 2016 IBM Corporation 33 Our option: -Dcom.ibm.gpu.enable/enforce/disable 40,000,000 400,000,000 Ints sorted per second Array length 400m per sec 40m per sec Sorting throughput for ints 30,000 300,000 3,000,000 30,000,000 300,000,000 Details online here Making it simple: Java class library modification
  • 34. © 2016 IBM Corporation 34 Our option: -Xjit:enableGPU Making it simple: Java JIT compiler modification Use an IntStream and specify our JIT option Primitive types can be used (byte, char, short, int, float, double, long)
  • 35. © 2016 IBM Corporation 35 Measured performance improvement with a GPU using four programs using 1-CPU-thread sequential execution 160-CPU-thread parallel execution Experimental environment used IBM Java 8 Service Release 2 for PowerPC Little Endian Two 10-core 8-SMT IBM POWER8 CPUs at 3.69 GHz with 256GB memory (160 hardware threads in total) with one NVIDIA Kepler K40m GPU (2880 CUDA cores in total) at 876 MHz with 12GB global memory (ECC off) Performance of GPU enabled lambdas
  • 36. © 2016 IBM Corporation 36 Name Summary Data size Data type MM A dense matrix multiplication: C = A.B 1024 x 1024 (1m) items double SpMM As above, sparse matrix 500k x 500k (250m) items double Jacobi2D Solve an equation using the Jacobi method 8192 x 8192 (67m) items double LifeGame Conway's Game of Life with 10k iterations 512 x 512 (262k) items byte
  • 37. © 2016 IBM Corporation 37 This shows GPU execution time speedup amounts compared to what's in blue (1 CPU thread) and yellow (160 CPU threads) The higher the bar, the bigger the speedup!
  • 38. © 2016 IBM Corporation 38 Similar to JCuda but provides a higher level abstraction, production ready and supported by us ● No arbitrary and unrestricted use of Pointer(long) ● Still feels like Java instead of C Write your kernel and compile it into a fat binary nvcc --fatbin AdamKernel.cu Add your Java code import com.ibm.cuda.*; import com.ibm.cuda.CudaKernel.*; Load your fat binary module = new Loader().loadModule("AdamDoubler.fatbin", device); Making it simple: CUDA4J API
  • 39. © 2016 IBM Corporation 39 Only doubling integers; could be any use case where we're doing the same operation to lots of elements at once Full code listing at the end, Javadocs: search IBM Java 8 API com.ibm.cuda * Tip: the offsets are byte offsets, so you'll want your index in Java * the size of the object! module = new Loader().loadModule("AdamDoubler.fatbin", device); kernel = new CudaKernel(module, "Cuda_cuda4j_AdamDoubler_Strider"); stream = new CudaStream(device); numElements = 100; myData = new int[numElements]; Util.fillWithInts(myData); CudaGrid grid = Util.makeGrid(numElements, stream); buffer1 = new CudaBuffer(device, numElements * Integer.BYTES); buffer1.copyFrom(myData); Parameters kernelParams = new Parameters(2).set(0, buffer1).set(1, numElements); kernel.launch(grid, kernelParams); buffer1.copyTo(myData); If our dynamically created grid dimensions are too big we need to break down the problem and use the slice* API: doChunkingProblem() Our kernel, compiles into AdamDoubler.fatbin
  • 40. © 2016 IBM Corporation 40 ● Recommendation algorithms such as ● Alternating Least Squares ● Movie recommendations on Netflix ● Recommended purchases on Amazon ● Similar songs with Spotify ● Clustering algorithms such as ● K-means (unsupervised learning) ● Produce clusters from data to determine which cluster a new item can be categorised as ● Identify anomalies: transaction fraud or erroneous data ● Classification algorithms such as ● Logistic regression ● Create a model that we can use to predict where to plot the next item in a sequence ● Healthcare: predict adverse drug reactions based on known interactions between similar drugs Improving MLlib
  • 41. © 2016 IBM Corporation 41 ● Under the covers optimisation, set the spark.mllib.ALS.useGPU property ● Full paper: http://arxiv.org/abs/1603.03820 ● Full implementation: https://github.com/IBMSparkGPU Netflix 1.5 GB 12 threads, CPU 64 threads, CPU GPU Intel, IBM Java 8 676 seconds N/A 140 seconds Currently always sends work to a GPU regardless of size, remember we have limited device memory! 2x Intel(R) Xeon(R) CPU E5-2667 v2 @ 3.30GHz, 16 cores in the machine (SMT-2), 256 GB RAM vs 2x Nvidia Tesla K80Ms Also available for Power LE. Improving Alternating Least Squares
  • 42. © 2016 IBM Corporation 42 We modified the existing ALS (.scala) implementation's computeFactors method ● Added code to check if spark.mllib.ALS.useGPU is set ● If set we'll then call our native method written to ue JNI (.cpp) ● Our JNI method calls native CUDA (.cu) method ● CUDA used to send our data to the GPU, calls our kernel, returns the results over JNI back to the Java heap ● Built with our Spark distribution and the shared library is included: libGPUALS.so ● Remember this will require the CUDA runtime (libcudart) and a capable GPU ALS.scala computeFactors CuMFJNIInterface.cpp ALS.cu libGPUALS.so
  • 43. © 2016 IBM Corporation 43 We can send code to a GPU with APIs or if we make substantial changes to existing implementations, but we can also make our changes at a higher level to be more pervasive Input: user application using DataFrame or Datasets, data stored in Parquet format for now ✔ Spark with Tungsten. Uses UnsafeRow and, sun.misc.unsafe, idea is to bring Spark closer to the hardware than previously, exploit CPUA caches, improved memory and CPU efficiency, reduce GC times, avoid Java object overheads – good deep dive here ✔ Spark with Catalyst. Optimiser for Spark SQL APIs, good deep dive here, transforms a query plan (abstraction of a user's program) into an optimised version, generates optimised code with Janino compiler ✔ Spark with our changes: Java and core Spark class optimisations, optimised JIT Pervasive GPU opportunities for Spark
  • 44. © 2016 IBM Corporation 44 Output: generated code able to leverage auto-SIMD and GPUs We want generated code that: ✔ has a counted loop, e.g. one controlled by an automatic induction variable that increases from a lower to an upper bound ✔ accesses data in a linear fashion ✔ has as few branches as possible (simple for the GPU's kernel) ✔ does not have external method calls or contains only calls that can be easily inlined These help a JIT to either use auto-SIMD capabilities or GPUs
  • 45. © 2016 IBM Corporation 45 Problems 1) Data representation of columnar storage (CachedBatch with Array[Byte]) isn't commonly used 2) Compression schemes are specific to CachedBatch, limited to just several data types 3) Building in-memory cache involves a long code path -> virtual method calls, conditional branches 4) Generated whole-stage code -> unnecessary conversion from CachedBatch or ColumnarBatch to UnsafeRow Solutions 1) Use ColumnarBatch format instead of CachedBatch for the in-memory cache generated by the cache() method. ColumnarBatch and ColumnVector are commonly used data representations for columnar storage 2) Use a common compression scheme (e.g. lz4) for all of the data types in a ColumnVector 3) Generate code at runtime that is simple and specialized for building a concrete instance of the in- memory cache 4) Generate whole-stage code that directly reads data from columnar storage (1) and (2) increase code reuse, (3) improves runtime performance of executing the cache() method and (4) improves performance of user defined DataFrame and Dataset operations
  • 46. © 2016 IBM Corporation 46 We propose a new columnar format: CachedColumnarBatch, that has a pointer to ColumnarBatch (used by Parquet reader) that keeps each column as OnHeapUnsafeColumnVector instead of OnHeapColumnVector. Not yet using GPUS! ● [SPARK-13805], merged into 2.0, performance improvement: 1.2x Get data from ColumnVector directly by avoiding a copy from ColumnVector to UnsafeRow when a program reads data in parquet format ● [SPARK-14098] will be merged into 2.2, performance improvement: 3.4x Generate optimized code to build CachedColumnarBatch, get data from a ColumnVector directly by avoiding a copy from the ColumnVector to UnsafeRow, and use lz4 to compress ColumnVector when df.cache() or ds.cache is executed ● [SPARK-15962], merged into 2.1, performance improvement: 1.7x Remove indirection at offsets field when accessing each element in UnsafeArrayData, reduce memory footprint of UnsafeArrayData
  • 47. © 2016 IBM Corporation 47 ● [SPARK-16043], performance improvement: 1.2x Use a Scala primitive array (e.g. Array[Int]) instead of Array[Any] for avoiding boxing operations when putting a primitive array into GenericArrayData ● [SPARK-15985], merged into 2.1, performance improvement: 1.3x Eliminate boxing operations to put a primitive array into GenericArrayData when a Dataset program with a primitive array is ran ● [SPARK-16213], to be merged into 2.2, performance improvement: 16.6x Eliminate boxing operations to put a primitive array into GenericArrayData when a DataFrame program with a primitive array is ran ● [SPARK-17490], merged into 2.1, performance improvement: 2.0x Eliminate boxing operations to put a primitive array into GenericArrayData when a DataFrame program with a primitive array is used
  • 48. © 2016 IBM Corporation 48 ● improving a commonly used API and contributing the code ● Ensuring generated code is in the right format for exploitation ● Making it simple for any Spark user to exploit hardware accelerators, be it GPU or auto-SIMD code for the latest processors ● We know how to build GPU based applications ● We can figure out if a GPU is available ● We can figure out what code to generate ● We can figure out which GPU to send that code to ● All while retaining Java safety features such as exceptions, bounds checking, serviceability, tracing and profiling hooks ● Assuming you have the hardware, add an option and watch performance improve: this is our goal What's in it for me?
  • 49. © 2016 IBM Corporation 49 ● We provide an optimised JDK with Spark bundle that includes hardware offloading, profiling, a tuned JIT and is under constant development ● We can talk more about performance aspects, not covered FPGAs, CAPI flash, an improved serializer, GC optimisations, object layout, monitoring... ● Upcoming blog post at spark.tc outlining the Catalyst related work ● Look out for more pull requests and involvement from IBM, we want to improve performance for everybody and maintain Spark's status ● Open to ideas and wanting to work in communities for everyone's benefit http://ibm.biz/spark-kit Feedback and suggestions welcome: aroberts@uk.ibm.com Wrapping it all up...
  • 50. © 2016 IBM Corporation Backup slides, code listing, legal information and disclaimers beyond this point
  • 51. © 2016 IBM Corporation 51 CUDA core: part of the GPU, they execute groups of threads Kernel: a function we'll run on the GPU Grid: think of it as a CUBE of BLOCKS which lay out THREADS; our GPU functions (KERNELS) run on one of these, we need to know the grid dimensions for each kernel Threads: these do our computation, much more available than with CPUs Blocks: groups of threads Recommended reading: http://docs.nvidia.com/cuda/cuda-c-programming-guide/#thread-hierarchy The nvidia-smi command tells you about your GPU's limits One GPU can have MANY CUDA cores, each CUDA core executes many threads
  • 52. © 2016 IBM Corporation 52 CUDA grid: why is this important? To achieve parallelism: a layout of threads we can use to solve our big data problems Block dimensions? How many threads can run on a block Grid dimensions? How many blocks we can have threadIdx.x? (BLOCKS contain THREADS) Built in variable to get the current x coordinate of a given THREAD (can have an x, y, z coordinate too) blockIdx.x? (GRIDS contain BLOCKS) Built in variable to get the current x coordinate of a given BLOCK (can have an x, y, z coordinate too)
  • 53. © 2016 IBM Corporation 53 For figuring out the dimensions we can use the following Java code, we want 512 threads and as many blocks as possible for the problem size int log2BlockDim = 9; int numBlocks = (numElements + 511) >> log2BlockDim; int numThreads = 1 << log2BlockDim; Size Blocks Threads 500 1 512 1,024 2 512 32,000 63 512 64,000 125 512 100,000 196 512 512,000 1,000 512 1,024,000 2,000 512
  • 54. CUDA4J sample, part 1 of 3 import com.ibm.cuda.*; import com.ibm.cuda.CudaKernel.*; public class Sample { private static final boolean PRINT_DATA = false; private static int numElements; private static int[] myData; private static CudaBuffer buffer1; private static CudaDevice device = new CudaDevice(0); private static CudaModule module; private static CudaKernel kernel; private static CudaStream stream; public static void main(String[] args) { try { module = new Loader().loadModule("AdamDoubler.fatbin", device); kernel = new CudaKernel(module, "Cuda_cuda4j_AdamDoubler_Strider"); stream = new CudaStream(device); doSmallProblem(); doMediumProblem(); doChunkingProblem(); } catch (CudaException e) { e.printStackTrace(); } catch (Exception e) { e.printStackTrace(); } } private static void doSmallProblem() throws Exception { System.out.println("Doing the small sized problem"); numElements = 100; myData = new int[numElements]; Util.fillWithInts(myData); CudaGrid grid = Util.makeGrid(numElements, stream); System.out.println("Kernel grid: <<<" + grid.gridDimX + ", " + grid.blockDimX + ">>>"); buffer1 = new CudaBuffer(device, numElements * Integer.BYTES); buffer1.copyFrom(myData); Parameters kernelParams = new Parameters(2).set(0, buffer1).set(1, numElements); kernel.launch(grid, kernelParams); int[] originalArrayCopy = new int[myData.length]; System.arraycopy(myData, 0, originalArrayCopy, 0, myData.length); buffer1.copyTo(myData); Util.checkArrayResultsDoubler(myData, originalArrayCopy); }
  • 55. private static void doMediumProblem() throws Exception { System.out.println("Doing the medium sized problem"); numElements = 5_000_000; myData = new int[numElements]; Util.fillWithInts(myData); // This is only when handling more than max blocks * max threads per kernel // Grid dim is the number of blocks in the grid // Block dim is the number of threads in a block // buffer1 is how we'll use our data on the GPU buffer1 = new CudaBuffer(device, numElements * Integer.BYTES); // myData is on CPU, transfer it buffer1.copyFrom(myData); // Our stream executes the kernel, can launch many streams at once CudaGrid grid = Util.makeGrid(numElements, stream); System.out.println("Kernel grid: <<<" + grid.gridDimX + ", " + grid.blockDimX + ">>>"); Parameters kernelParams = new Parameters(2).set(0, buffer1).set(1, numElements); kernel.launch(grid, kernelParams); int[] originalArrayCopy = new int[myData.length]; System.arraycopy(myData, 0, originalArrayCopy, 0, myData.length); buffer1.copyTo(myData); Util.checkArrayResultsDoubler(myData, originalArrayCopy); } CUDA4J sample, part 2 of 3
  • 56. private static void doChunkingProblem() throws Exception { // I know 5m doesn't require chunking on the GPU but this does System.out.println("Doing the too big to handle in one kernel problem"); numElements = 70_000_000; myData = new int[numElements]; Util.fillWithInts(myData); buffer1 = new CudaBuffer(device, numElements * Integer.BYTES); buffer1.copyFrom(myData); CudaGrid grid = Util.makeGrid(numElements, stream); System.out.println("Kernel grid: <<<" + grid.gridDimX + ", " + grid.blockDimX + ">>>"); // Check we can actually launch a kernel with this grid size try { Parameters kernelParams = new Parameters(2).set(0, buffer1).set(1, numElements); kernel.launch(grid, kernelParams); int[] originalArrayCopy = new int[numElements]; System.arraycopy(myData, 0, originalArrayCopy, 0, numElements); buffer1.copyTo(myData); Util.checkArrayResultsDoubler(myData, originalArrayCopy); } catch (CudaException ce) { if (ce.getMessage().equals("invalid argument")) { System.out.println("it was invalid argument, too big!"); int maxThreadsPerBlockX = device.getAttribute(CudaDevice.ATTRIBUTE_MAX_BLOCK_DIM_X); int maxBlocksPerGridX = device.getAttribute(CudaDevice.ATTRIBUTE_MAX_GRID_DIM_Y); long maxThreadsPerGrid = maxThreadsPerBlockX * maxBlocksPerGridX; // 67,107,840 on my Windows box System.out.println("Max threads per grid: " + maxThreadsPerGrid); long numElementsAtOnce = maxThreadsPerGrid; long elementsDone = 0; grid = new CudaGrid(maxBlocksPerGridX, maxThreadsPerBlockX, stream); System.out.println("Kernel grid: <<<" + grid.gridDimX + ", " + grid.blockDimX + ">>>"); while (elementsDone < numElements) { if ( (elementsDone + numElementsAtOnce) > numElements) { numElementsAtOnce = numElements - elementsDone; // Just do the remainder } long toOffset = numElementsAtOnce + elementsDone; // It's the byte offset not the element index offset CudaBuffer slicedSection = buffer1.slice(elementsDone * Integer.BYTES, toOffset * Integer.BYTES); Parameters kernelParams = new Parameters(2).set(0, slicedSection).set(1, numElementsAtOnce); kernel.launch(grid, kernelParams); elementsDone += numElementsAtOnce; } int[] originalArrayCopy = new int[myData.length]; System.arraycopy(myData, 0, originalArrayCopy, 0, myData.length); buffer1.copyTo(myData); Util.checkArrayResultsDoubler(myData, originalArrayCopy); } else { System.out.println(ce.getMessage()); } } } CUDA4J sample, part 3 of 3
  • 57. CUDA4J kernel #include <stdint.h> #include <stdio.h> /** * 2D grid so we can have 1024 threads and many blocks * Remember 1 grid -> has blocks/threads and one kernel runs on one grid * In CUDA 6.5 we have cudaOccupancyMaxPotentialBlockSize which helps * * Let's say we have 100 ints to double, keeping it simple * Assume we want to run with 256 threads at once * For this size our kernel will be set up as follows * 1 grid, 1 block, 512 threads * blockDim.x is going to be 1 * threadIdx.x will remain at 0 * threadIdx.y will range from 0 to 512 * So we'll go from 1 to 512 and we'll limit access to how many elements we have */ extern "C" __global__ void Cuda_cuda4j_AdamDoubler(int* toDouble, int numElements){ int index = blockDim.x * threadIdx.x + threadIdx.y; if (index < numElements) { // Don't go out of bounds toDouble[index] *= 2; // Just double it } } extern "C" __global__ void Cuda_cuda4j_AdamDoubler_Strider(int* toDouble, int numElements){ int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < numElements) { // don't go overboard toDouble[i] *= 2; } }
  • 58. Lambda example, part 1 of 2 import java.util.stream.IntStream; public class Lambda { private static long startTime = 0; // -Xjit:enableGPU is our JVM option public static void main(String[] args) { boolean timeIt = true; int numElements = 500_000_000; int[] toDouble = new int[numElements]; Util.fillWithInts(toDouble); myDoublerWithALambda(toDouble, timeIt); double[] toHalf = new double[numElements]; Util.fillWithDoubles(toHalf); myHalverWithALambda(toHalf, timeIt); double[] toRandomFunc = new double[numElements]; Util.fillWithDoubles(toRandomFunc); myRandomFuncWithALambda(toRandomFunc, timeIt); } private static void myDoublerWithALambda(int[] myArray, boolean timeIt) { if (timeIt) startTime = System.currentTimeMillis(); IntStream.range(0, myArray.length).parallel().forEach(i -> { myArray[i] = myArray[i] * 2; // Done on GPU for us }); if (timeIt) { System.out.println("Done doubling with a lambda, time taken: " + (System.currentTimeMillis() - startTime) + " milliseconds"); } }
  • 59. private static void myHalverWithALambda(double[] myArray, boolean timeIt) { if (timeIt) startTime = System.currentTimeMillis(); IntStream.range(0, myArray.length).parallel().forEach(i -> { myArray[i] = myArray[i] / 2; // Again on GPU }); if (timeIt) { System.out.println("Done halving with a lambda, time taken: " + (System.currentTimeMillis() - startTime) + " milliseconds"); } } private static void myRandomFuncWithALambda(double[] myArray, boolean timeIt) { if (timeIt) startTime = System.currentTimeMillis(); IntStream.range(0, myArray.length).parallel().forEach(i -> { myArray[i] = myArray[i] * 3.142; // Double so we don't lose precision }); if (timeIt) { System.out.println("Done with the random func with a lambda, time taken: " + (System.currentTimeMillis() - startTime) + " milliseconds"); } } } Lambda example, part 2 of 2
  • 60. Utility methods, part 1 of 2 import com.ibm.cuda.*; public class Util { protected static void fillWithInts(int[] toFill) { for (int i = 0; i < toFill.length; i++) { toFill[i] = i; } } protected static void fillWithDoubles(double[] toFill) { for (int i = 0; i < toFill.length; i++) { toFill[i] = i; } } protected static void printArray(int[] toPrint) { System.out.println(); for (int i = 0; i < toPrint.length; i++) { if (i == toPrint.length - 1) { System.out.print(toPrint[i] + "."); } else { System.out.print(toPrint[i] + ", "); } } System.out.println(); } protected static CudaGrid makeGrid(int numElements, CudaStream stream) { int numThreads = 512; int numBlocks = (numElements + (numThreads - 1)) / numThreads; return new CudaGrid(numBlocks, numThreads, stream); }
  • 61. /* * Array will have been doubled at this point */ protected static void checkArrayResultsDoubler(int[] toCheck, int[] originalArray) { long errorCount = 0; // Check result, data has been copied back here if (toCheck.length != originalArray.length) { System.err.println("Something's gone horribly wrong, different array length"); } for (int i = 0; i < originalArray.length; i++) { if (toCheck[i] != (originalArray[i] * 2) ) { errorCount++; /* System.err.println("Got an error, " + originalArray[i] + " is incorrect: wasn't doubled correctly!" + " Got " + toCheck[i] + " but should be " + originalArray[i] * 2); */ } else { //System.out.println("Correct, doubled " + originalArray[i] + " and it became " + toCheck[i]); } } System.err.println("Incorrect results: " + errorCount); } } Utility methods, part 2 of 2
  • 62. CUDA4J module loader import java.io.FileNotFoundException; import java.io.IOException; import java.io.InputStream; import com.ibm.cuda.CudaDevice; import com.ibm.cuda.CudaException; import com.ibm.cuda.CudaModule; public class Loader { private final CudaModule.Cache moduleCache = new CudaModule.Cache(); CudaModule loadModule(String moduleName, CudaDevice device) throws CudaException, IOException { CudaModule module = moduleCache.get(device, moduleName); if (module == null) { try (InputStream stream = getClass().getResourceAsStream(moduleName)) { if (stream == null) { throw new FileNotFoundException(moduleName); } module = new CudaModule(device, stream); moduleCache.put(device, moduleName, module); } } return module; } }
  • 63. CUDA4J build script on Windows nvcc -fatbin AdamDoubler.cu "C:ibm8sr3gasdkbinjava" -version "C:ibm8sr3gasdkbinjavac" *.java "C:ibm8sr3gasdkbinjava" -Xmx2g Sample "C:ibm8sr3gasdkbinjava" -Xmx4g Lambda "C:ibm8sr3gasdkbinjava" -Xjit:enableGPU={verbose} -Xmx4g Lambda
  • 64. Set the PATH to include the CUDA library. For example, set PATH=<CUDA_LIBRARY_PATH>;%PATH%, where the <CUDA_LIBRARY_PATH> variable is the full path to the CUDA library. The <CUDA_LIBRARY_PATH> variable is C:Program FilesNVIDIA GPU Computing ToolkitCUDAv7.5bin, which assumes CUDA is installed to the default directory. Note: If you are using Just-In-Time Compiler (JIT) based GPU support, you must also include paths to the NVIDIA Virtual Machine (NVVM) library, and to the NVDIA Management Library (NVML). For example, the <CUDA_LIBRARY_PATH> variable is C:Program FilesNVIDIA GPU Computing ToolkitCUDAv7.5bin;<NVVM_LIBRARY_PATH>;<NVML_LIBRARY_P ATH>. If the NVVM library is installed to the default directory, the <NVVM_LIBRARY_PATH> variable is C:Program FilesNVIDIA GPU Computing ToolkitCUDAv7.5nvvmbin. You can find the NVML library in your NVIDIA drivers directory. The default location of this directory is C:Program FilesNVIDIA CorporationNVSMI. From IBM's Java 8 docs Environment example, see the docs for details
  • 65. Notices and Disclaimers Copyright © 2016 by International Business Machines Corporation (IBM). No part of this document may be reproduced or transmitted in any form without written permission from IBM. U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM. Information in these presentations (including information relating to products that have not yet been announced by IBM) has been reviewed for accuracy as of the date of initial publication and could include unintentional technical or typographical errors. IBM shall have no responsibility to update this information. THIS document is distributed "AS IS" without any warranty, either express or implied. In no event shall IBM be liable for any damage arising from the use of this information, including but not limited to, loss of data, business interruption, loss of profit or loss of opportunity. IBM products and services are warranted according to the terms and conditions of the agreements under which they are provided. Any statements regarding IBM's future direction, intent or product plans are subject to change or withdrawal without notice. Performance data contained herein was generally obtained in a controlled, isolated environments. Customer examples are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual performance, cost, savings or other results in other operating environments may vary. References in this document to IBM products, programs, or services does not imply that IBM intends to make such products, programs or services available in all countries in which IBM operates or does business. Workshops, sessions and associated materials may have been prepared by independent session speakers, and do not necessarily reflect the views of IBM. All materials and discussions are provided for informational purposes only, and are neither intended to, nor shall constitute legal or other guidance or advice to any individual participant or their specific situation. It is the customer’s responsibility to insure its own compliance with legal requirements and to obtain advice of competent legal counsel as to the identification and interpretation of any relevant laws and regulatory requirements that may affect the customer’s business and any actions the customer may need to take to comply with such laws. IBM does not provide legal advice or represent or warrant that its services or products will ensure that the customer is in compliance with any law.
  • 66. Notices and Disclaimers (con’t) Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products in connection with this publication and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. IBM does not warrant the quality of any third-party products, or the ability of any such third-party products to interoperate with IBM’s products. IBM expressly disclaims all warranties, expressed or implied, including but not limited to, the implied warranties of merchantability and fitness for a particular purpose. The provision of the information contained herein is not intended to, and does not, grant any right or license under any IBM patents, copyrights, trademarks or other intellectual property right. IBM, the IBM logo, ibm.com, Bluemix, Blueworks Live, CICS, Clearcase, DOORS®, Enterprise Document Management System™, Global Business Services ®, Global Technology Services ®, Information on Demand, ILOG, LinuxONE™, Maximo®, MQIntegrator®, MQSeries®, Netcool®, OMEGAMON, OpenPower, PureAnalytics™, PureApplication®, pureCluster™, PureCoverage®, PureData®, PureExperience®, PureFlex®, pureQuery®, pureScale®, PureSystems®, QRadar®, Rational®, Rhapsody®, SoDA, SPSS, StoredIQ, Tivoli®, Trusteer®, urban{code}®, Watson, WebSphere®, Worklight®, X-Force® and System z® Z/OS, are trademarks of International Business Machines Corporation, registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at: www.ibm.com/legal/copytrade.shtml. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Databricks is a registered trademark of Databricks, Inc. Apache Spark, Apache Cassandra, Apache Hadoop, Apache Maven, Spark, Apache, any other Apache project mentioned here and the Apache product logos including the Spark logo are trademarks of The Apache Software Foundation