Exploiting GPUs in Spark

Kazuaki Ishizaki
IBM Research – Tokyo
日本アイ・ビー・エム（株）東京基礎研究所
Exploiting GPUs in Spark
1

Who am I?
 Kazuaki Ishizaki
– live in Tokyo, Japan
 Research staff member at IBM Research – Tokyo
– http://ibm.co/kiszk
 Research interests
– compiler optimizations, language runtime, and parallel processing
 Worked for Java virtual machine and just-in-time compiler over 20 years
– From JDK 1.0 to Java SE 8
 Twitter: @kiszk
 Slideshare: http://www.slideshare.net/ishizaki
 Github: https://github.com/kiszk
2 Exploiting GPUs in Spark - Kazuaki Ishizaki

My message is “Spark can meet GPUs”
 Let us discuss use cases, opportunities, requirements in meetups,
conferences, and Spark dev or user mailing list
While GPU is not the first-class citizen in Spark,
4 GPU related talks will be in Spark Summit SF

Agenda
 Motivation & Goal
 Activities to Exploit GPUs in Spark
 Introduction of GPUs
 Design & New Components
– Binary columnar
– GPU enabler
 Two approaches to Exploit GPUs in Spark
– Spark Plug-in
– Enhancement of Catalyst in Spark runtime
 Conclusion

Want to Accelerate Computation-heavy Application
 Motivation
– Want to shorten execution time of a long-running Spark application
 Computation-heavy
 Shuffle-heavy
 I/O-heavy
 Goal
– Accelerate a Spark computation-heavy application
 According to Reynold’s talk (p. 21), CPU will become bottleneck on Spark

Accelerate a Spark Application by GPUs
 Our Approach
– Accelerate a Spark application by using GPUs effectively and transparently
 Exploit high performance of GPUs
 Do not ask users to change their Spark programs
 New components for acceleration
– Binary columnar (e.g. Apache Arrow)
 Efficient data representations for GPUs and CPUs
– GPU enabler
 Automatically handle executions on GPUs
• GPU memory allocation, data copy between GPU and CPU, etc …

 Motivation & Goal
 Projects to Exploit GPUs in Spark
 Introduction of GPUs
 Design & New Components
 Two approaches to Exploit GPUs in Spark
 Conclusion

Existing 10~ Projects to Exploit GPUs in Spark
 There are several activities, but no one was enabled in master
– Community will make GPU as a first-class citizen in Spark
Spark system
programmer
Spark application
programmer
Generated from
Spark application
Spark
standard APIs
(RDD, Dataset,
DataFrame)
mllib (N/A on github)
Deeplearning4J on
Spark
Our GPU enabler
(spark-gpu)
Spark SWAT
Columnar
DataFrame (N/A on
github)
NUWA (product)
Our on-going work
Unique APIs Caffe on Spark
BidMach Spark
CSR in Spark
HeteroSpark (N/A on
github)
Who prepares GPU code
HowGPUcodeiscalled

Existing Resource Managers to Support GPU for Spark
 Spark on Mesos
– https://spark-summit.org/2016/events/spark-on-mesos-the-state-of-the-art/
 Yarn Node Labels
– https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-
site/NodeLabel.html

GPU Programming Model
 Five steps
1. Allocate GPU device memory
2. Copy data on CPU main memory to GPU device memory
3. Launch a GPU kernel to be executed in parallel on cores
4. Copy back data on GPU device memory to CPU main memory
5. Free GPU device memory
 Usually, a programmer has to write these steps in CUDA or OpenCL
device memory
(up to 12GB)
main memory
(up to 1TB/socket)
CPU GPU
Data copy
over PCIe
dozen cores/socket thousands cores

How We Can Run Program Faster on GPU
 Assign a lot of parallel computations into cores
 Make memory accesses coalesced
– An example
– Column-oriented layout achieves better performance
 This paper reports about 3x performance improvement of GPU kernel execution of
kmeans over row-oriented layout
1 52 61 5 3 7
Assumption: 4 consecutive data elements
can be coalesced by GPU hardware
2 v.s. 4
memory accesses to
GPU device memory Row-oriented layoutColumn-oriented layout
Pt(x: Int, y: Int)
Load four Pt.x
Load four Pt.y
2 6 4 843 87
coresx1 x2 x3 x4
cores
Load Pt.x Load Pt.y Load Pt.x Load Pt.y
1 2 31 2 4
y1 y2 y3 y4 x1 x2 x3 x4 y1 y2 y3 y4

High Level View of GPU Exploitation
 Efficient
– Reduce data copy overhead between CPU and GPU
– Make memory accesses efficient on GPU
 Transparent
– Map parallelism in a program
into GPU native code
User’s Spark Program (scala)
14
case class Pt(x: Int, y: Int)
rdd1 = sc.parallelize(Array(
Pt(1, 4), Pt(2, 5),
Pt(3, 6), Pt(4, 7),
Pt(5, 8), Pt(6, 9)), 3)
rdd2 = rdd1.map(p => Pt(p.x*2, p.y-1))
cnt = rdd2.reduce(
(p1, p2) => p1.x + p2.x)
Translate to
GPU native
code
Nativecode
1
GPU
4
2 5
3 6
4 7
5 8
6 9
1 4
2 5
3 6
4 7
5 8
6 9
2 3
4 4
6 5
8 6
10 7
12 8
2 3
4 4
6 5
8 6
10 7
12 8
*2=
-1=
rdd
1
Data
transfer
x y
Exploiting GPUs in Spark - Kazuaki Ishizaki
GPU enabler
binary columnar Off-heap
x y
GPU can exploit parallelism both
among blocks in RDD and
within a block of RDD
rdd
2
block
GPU
kernel
CPU

What Binary Columnar does?
 Keep data as binary representation (not Java object representation)
 Keep data as column-oriented layout
 Keep data on off-heap or GPU device memory
2 51 4
Off-heap
Array(Pt(1, 4),
Pt(2, 5))
Example
2 51 4
Off-heap
Columnar (column-oriented) Row-oriented

Current RDD as Java objects on Java heap
rdd = sc.parallelize(Array(Pt(1, 4),
Pt(2, 5)))
Object header for Java virtual machine
1 4 2 5
Java heap
Current RDD
Row-oriented layout
Java object representation
On Java heap
Pt Pt

Binary Columnar on off-heap
rdd = sc.parallelize(Array(Pt(1, 4),
Pt(2, 5)))
Object header for Java virtual machine
1 4 2 5
Java heap Off-heap
2 51 4
Current RDD
Row-oriented layout
Java object representation
On Java heap
Binary columnar
Column-oriented layout
Binary representation
On off-heap

2.1.
Long Path from Current RDD to GPU
 Three steps to send data from RDD to GPU
1. Java objects to column-oriented binary representation on Java heap
 From a Java object to binary representation
 From a row-oriented format to columnar
2. Binary representation on Java heap to binary columnar on off-heap
 Garbage collection may move objects on Java heap during GPU related operations
3. Off-heap to GPU device memory
rdd = sc.parallelize(Array(Pt(1, 4),Pt(2, 5)))
rdd.map(…).reduce(…) // execute on GPU
1 4 2 5 2 51 4 2 51 4 2 51 4
Off-heap GPU device memoryJava heap Java heap
This thread in dev ML also discusses overhead of copying data between RDD and GPU
3.
Pt Pt ByteBuffer ByteBuffer

Long Path from Current Dataset to GPU
 Two steps to send data from RDD to GPU
1. Binary representation on Java heap to binary columnar on off-heap
 From a row-oriented format to columnar
2. Off-heap to GPU device memory
ds = Array(Pt(1, 4),Pt(2, 5)).toDS()
ds.map(…).reduce(…) // execute on GPU
2 51 4 2 51 4
Off-heap GPU device memory
2 51 4
Java heap
1. 2.

Shorter Path from Binary Columnar RDD to GPU
 RDD with binary columnar can be simply copied to GPU device memory
rdd = sc.parallelize(Array(Pt(1, 4),Pt(2, 5)))
rdd.map(…).reduce(…) // execute on GPU
Off-heap GPU device memory
Eliminated
2 51 4 2 51 4
1 4 2 5 2 51 4 2 51 4
Off-heap GPU device memoryJava heap
2 51 4
Java heap

Can Execute map() in Parallel Using Binary Columnar
 Adjacent elements in binary columnar can be accessed in parallel
 The same type of operations ( * or -) can be executed in parallel for data
to be loaded in parallel
...
res= rdd or ds.map(p => Pt(p.x*2, p.y-1))
1 4 2 5
Java heap Off-heap
2 51 4
Current RDD Binary columnar
Memory access
order 1 2 3 4 1 1 2 2
1 4 2 5
Java heap
Current Dataset
1 2 3 4

Advantages of Binary Columnar
 Can exploit high performance of GPUs
 Can reduce overhead of data copy between CPU and GPU
 Consume less memory footprint than RDD
 Can directly compute data, which are stored in columnar, from Apache
Parquet, Apache Arrow
 Can exploit SIMD instructions on CPU, too

What GPU Enabler Does?
 Copy data in binary columnar RDD between CPU main memory and GPU
device memory
 Launch GPU kernels
 Cache GPU native code for kernels
 Generate GPU native code from transformations and actions in a program
– We already productized the IBM Java just-in-time compiler that generate GPU
native code from a lambda expression in Java 8

How to Exploit GPUs in Spark
 Bottom line is to enable columnar storage and GPU enabler in Spark
– Any approaches can use both them to effectively and transparently exploit
GPUs in Spark

Java heap
Comparisons among DataFrame, Dataset, and RDD
 DataFrame (with relational operations) and Dataset (with lambda
functions) use Catalyst and row-oriented data representation on off-heap
ds = d.toDS()
ds.filter(p => p.x>1)
.count()
1 4 2 5
Java heap
rdd = sc.parallelize(d)
rdd.filter(p => p.x>1)
.count()
df = d.toDF(…)
df.filter(”x>1”)
.count()
d = Array(Pt(1, 4), Pt(2, 5))
Frontend
API
2 51 4 Data
DataFrame (v1.3-) Dataset (v1.6-) RDD (v0.5-)
Catalyst
Backend
computation
Generated
Java bytecode
Java bytecode in
Spark program and runtime
Row-oriented
Row-oriented

Two Approaches to Exploit GPUs
 Devising Spark Package for RDD
– Library developers can use this to enable their GPU code in Spark libraries
– Application programmers can use this to run their code in their Spark
application
 Enhance Catalyst for DataFrame/Dataset
– Spark programs with DataFrame/Dataset will be translated to GPU code
transparently
– As the first step, we are generating code for specific columnar storages for
CPUs
• https://github.com/apache/spark/pull/11636 for ColumnarBatch
• https://github.com/apache/spark/pull/11956 for CachedBatch
2. Introduce generic columnar storage (UnsafeColumn?) for CPU
3. Generate code for generic columnar storage for CPU
4. Generate code for generic columnar storage for GPU

Software Stack for RDD in Spark 2.0
 RDD keeps data on Java heap
RDD API
Java heap
RDD data
User’s/library’s Spark program

Off-heap
GPU Exploitation for RDD
 Current RDD and binary columnar can co-exist
 User/library-provided GPU code is managed by GPU enabler
RDD API
Java heap
RDD data
Columnar
GPU
enabler
GPU device memory
Columnar

Software Stack for Dataset/DataFrame in Spark 2.0
 Dataset become a primary data structure for computation
 Dataset keeps data in UnsafeRow on Java heap
DataFrame
Dataset
Tungsten
Catalyst
Java heap
UnsafeRow
Logical optimizer
CPU code generator

GPU Exploitation for DataFrame/Dataset
 UnsafeRow and Columnar can co-exist
 Catalyst will generate GPU code from a Spark program
DataFrame
Dataset
Tungsten
Catalyst
Off-heap
GPU device memory
Columnar
Logical optimizer
CPU code generator
Columnar
Java heap
UnsafeRow
GPU enabler
Columnar

Exploit GPUs for RDD
 Execute user-provided GPU kernels from map()/reduce() functions
– GPU memory managements and data copy are automatically handled
 Generate GPU native code for simple map()/reduce() methods
– “spark.gpu.codegen=true” in spark-defaults.conf
rdd1 = sc.parallelize(1 to n, 2).convert(ColumnFormat) // rdd1 uses binary columnar RDD
sum = rdd1.map(i => i * 2)
.reduce((x, y) => (x + y))
// CUDA
__global__ void sample_map(int *inX, int *inY, int *outX, int *outY, long size) {
long ix = threadIdx.x + blockIdx.x * blockDim.x;
if (size <= ix) return;
outX[ix] = inX[ix] * 2;
outY[ix] = inY[ix] – 1;
}
// Spark
mapFunction = new CUDAFunction(“sample_map", // CUDA method name
Array("this.x", "this.y"), // input object has two fields
Array("this.x“, “this.y”), // output object has two fields
this.getClass.getResource("/sample.ptx")) // ptx is generated by CUDA complier
rdd1 = sc.parallelize(…).convert(ColumnFormat) // rdd1 uses binary columnar RDD
rdd2 = rdd1.mapExtFunc(p => Pt(p.x*2, p.y-1), mapFunction)

How to Use Exploitation of GPUs for RDD
 Easy to install by one-liner and to run by one-liner
– on x86_64, mac, and ppc64le with CUDA 7.0 or later with any JVM such as IBM
JDK or OpenJDK
 Run script for AWS EC2 is available, which support spot instances33 Exploiting GPUs in Spark - Kazuaki Ishizaki
$ wget https://s3.amazonaws.com/spark-gpu-public/spark-gpu-latest-bin-hadoop2.4.tgz &&
tar xf spark-gpu-latest-bin-hadoop2.4.tgz && cd spark-gpu
$ LD_LIBRARY_PATH=/usr/local/cuda/lib64 MASTER='local[2]' ./bin/run-example SparkGPULR 8 3200 32 5
…
numSlices=8, N=3200, D=32, ITERATIONS=5
On iteration 1
On iteration 2
On iteration 3
On iteration 4
On iteration 5
Elapsed time: 431 ms
$
Available at http://kiszk.github.io/spark-gpu/
• 3 contributors
• Private communications
with other developers

Achieved 3.15x Performance Improvement by GPU
 Ran naïve implementation of logistic regression
 Achieved 3.15x performance improvement of logistic regression over
without GPU on a 16-core IvyBridge box with an NVIDIA K40 GPU card
– We have rooms to improve performance
Details are available at https://github.com/kiszk/spark-gpu/wiki/Benchmark
Program parameters
N=1,000,000 (# of points), D=400 (# of features), ITERATIONS=5
Slices=128 (without GPU), 16 (with GPU)
MASTER=local[8] (without and with GPU)
Hardware and software
Machine: nx360 M4, 2 sockets 8-core Intel Xeon E5-2667 3.3GHz, 256GB memory, one NVIDIA K40m card
OS: RedHat 6.6, CUDA: 7.0

We are planning to release Spark Package version
 You can use any Spark runtime
– Spark 1.6, 1.6.1, 2.0.0-SNAPSHOP, your own Spark, …
 Live demo

Takeaway
 Accelerate a Spark application by using GPUs effectively and transparently
 More than 10 approaches exist for GPU exploitation
 Two fundamental components
– Binary columnar to alleviate overhead for GPU exploitation
– GPU enabler to manage GPU kernel execution from a Spark program
 Call pre-compiled libraries for GPU
 Generate GPU native code at runtime
 Two approaches
– Spark plugin For RDD
– Enhancement of Catalyst for DataFrame/Dataset
 Looking for anything in the community
– Use case, discussions, requests, …
Appreciate any your feedback and contributions

Exploiting GPUs in Spark

More Related Content

What's hot

Similar to Exploiting GPUs in Spark

More from Kazuaki Ishizaki

Recently uploaded

Exploiting GPUs in Spark