Kazuaki Ishizaki
IBM Research – Tokyo
日本アイ・ビー・エム(株)東京基礎研究所
Exploiting GPUs in Spark
1
Who am I?
 Kazuaki Ishizaki
– live in Tokyo, Japan
 Research staff member at IBM Research – Tokyo
– http://ibm.co/kiszk
 Research interests
– compiler optimizations, language runtime, and parallel processing
 Worked for Java virtual machine and just-in-time compiler over 20 years
– From JDK 1.0 to Java SE 8
 Twitter: @kiszk
 Slideshare: http://www.slideshare.net/ishizaki
 Github: https://github.com/kiszk
2 Exploiting GPUs in Spark - Kazuaki Ishizaki
My message is “Spark can meet GPUs”
 Let us discuss use cases, opportunities, requirements in meetups,
conferences, and Spark dev or user mailing list
3 Exploiting GPUs in Spark - Kazuaki Ishizaki
While GPU is not the first-class citizen in Spark,
4 GPU related talks will be in Spark Summit SF
Agenda
 Motivation & Goal
 Activities to Exploit GPUs in Spark
 Introduction of GPUs
 Design & New Components
– Binary columnar
– GPU enabler
 Two approaches to Exploit GPUs in Spark
– Spark Plug-in
– Enhancement of Catalyst in Spark runtime
 Conclusion
4 Exploiting GPUs in Spark - Kazuaki Ishizaki
Want to Accelerate Computation-heavy Application
 Motivation
– Want to shorten execution time of a long-running Spark application
 Computation-heavy
 Shuffle-heavy
 I/O-heavy
 Goal
– Accelerate a Spark computation-heavy application
 According to Reynold’s talk (p. 21), CPU will become bottleneck on Spark
5 Exploiting GPUs in Spark - Kazuaki Ishizaki
Accelerate a Spark Application by GPUs
 Our Approach
– Accelerate a Spark application by using GPUs effectively and transparently
 Exploit high performance of GPUs
 Do not ask users to change their Spark programs
 New components for acceleration
– Binary columnar (e.g. Apache Arrow)
 Efficient data representations for GPUs and CPUs
– GPU enabler
 Automatically handle executions on GPUs
• GPU memory allocation, data copy between GPU and CPU, etc …
6 Exploiting GPUs in Spark - Kazuaki Ishizaki
 Motivation & Goal
 Projects to Exploit GPUs in Spark
 Introduction of GPUs
 Design & New Components
 Two approaches to Exploit GPUs in Spark
 Conclusion
Existing 10~ Projects to Exploit GPUs in Spark
 There are several activities, but no one was enabled in master
– Community will make GPU as a first-class citizen in Spark
8 Exploiting GPUs in Spark - Kazuaki Ishizaki
Spark system
programmer
Spark application
programmer
Generated from
Spark application
Spark
standard APIs
(RDD, Dataset,
DataFrame)
mllib (N/A on github)
Deeplearning4J on
Spark
Our GPU enabler
(spark-gpu)
Spark SWAT
Columnar
DataFrame (N/A on
github)
NUWA (product)
Our on-going work
Unique APIs Caffe on Spark
BidMach Spark
CSR in Spark
HeteroSpark (N/A on
github)
Who prepares GPU code
HowGPUcodeiscalled
Existing Resource Managers to Support GPU for Spark
 Spark on Mesos
– https://spark-summit.org/2016/events/spark-on-mesos-the-state-of-the-art/
 Yarn Node Labels
– https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-
site/NodeLabel.html
9 Exploiting GPUs in Spark - Kazuaki Ishizaki
 Motivation & Goal
 Projects to Exploit GPUs in Spark
 Introduction of GPUs
 Design & New Components
 Two approaches to Exploit GPUs in Spark
 Conclusion
GPU Programming Model
 Five steps
1. Allocate GPU device memory
2. Copy data on CPU main memory to GPU device memory
3. Launch a GPU kernel to be executed in parallel on cores
4. Copy back data on GPU device memory to CPU main memory
5. Free GPU device memory
 Usually, a programmer has to write these steps in CUDA or OpenCL
11 Exploiting GPUs in Spark - Kazuaki Ishizaki
device memory
(up to 12GB)
main memory
(up to 1TB/socket)
CPU GPU
Data copy
over PCIe
dozen cores/socket thousands cores
How We Can Run Program Faster on GPU
 Assign a lot of parallel computations into cores
 Make memory accesses coalesced
– An example
– Column-oriented layout achieves better performance
 This paper reports about 3x performance improvement of GPU kernel execution of
kmeans over row-oriented layout
12 Exploiting GPUs in Spark - Kazuaki Ishizaki
1 52 61 5 3 7
Assumption: 4 consecutive data elements
can be coalesced by GPU hardware
2 v.s. 4
memory accesses to
GPU device memory Row-oriented layoutColumn-oriented layout
Pt(x: Int, y: Int)
Load four Pt.x
Load four Pt.y
2 6 4 843 87
coresx1 x2 x3 x4
cores
Load Pt.x Load Pt.y Load Pt.x Load Pt.y
1 2 31 2 4
y1 y2 y3 y4 x1 x2 x3 x4 y1 y2 y3 y4
 Motivation & Goal
 Projects to Exploit GPUs in Spark
 Introduction of GPUs
 Design & New Components
 Two approaches to Exploit GPUs in Spark
 Conclusion
High Level View of GPU Exploitation
 Efficient
– Reduce data copy overhead between CPU and GPU
– Make memory accesses efficient on GPU
 Transparent
– Map parallelism in a program
into GPU native code
User’s Spark Program (scala)
14
case class Pt(x: Int, y: Int)
rdd1 = sc.parallelize(Array(
Pt(1, 4), Pt(2, 5),
Pt(3, 6), Pt(4, 7),
Pt(5, 8), Pt(6, 9)), 3)
rdd2 = rdd1.map(p => Pt(p.x*2, p.y-1))
cnt = rdd2.reduce(
(p1, p2) => p1.x + p2.x)
Translate to
GPU native
code
Nativecode
1
GPU
4
2 5
3 6
4 7
5 8
6 9
1 4
2 5
3 6
4 7
5 8
6 9
2 3
4 4
6 5
8 6
10 7
12 8
2 3
4 4
6 5
8 6
10 7
12 8
*2=
-1=
rdd
1
Data
transfer
x y
Exploiting GPUs in Spark - Kazuaki Ishizaki
GPU enabler
binary columnar Off-heap
x y
GPU can exploit parallelism both
among blocks in RDD and
within a block of RDD
rdd
2
block
GPU
kernel
CPU
What Binary Columnar does?
 Keep data as binary representation (not Java object representation)
 Keep data as column-oriented layout
 Keep data on off-heap or GPU device memory
15 Exploiting GPUs in Spark - Kazuaki Ishizaki
2 51 4
Off-heap
case class Pt(x: Int, y: Int)
Array(Pt(1, 4),
Pt(2, 5))
Example
2 51 4
Off-heap
Columnar (column-oriented) Row-oriented
Current RDD as Java objects on Java heap
16 Exploiting GPUs in Spark - Kazuaki Ishizaki
case class Pt(x: Int, y: Int)
rdd = sc.parallelize(Array(Pt(1, 4),
Pt(2, 5)))
Object header for Java virtual machine
1 4 2 5
Java heap
Current RDD
Row-oriented layout
Java object representation
On Java heap
Pt Pt
Binary Columnar on off-heap
17 Exploiting GPUs in Spark - Kazuaki Ishizaki
case class Pt(x: Int, y: Int)
rdd = sc.parallelize(Array(Pt(1, 4),
Pt(2, 5)))
Object header for Java virtual machine
1 4 2 5
Java heap Off-heap
2 51 4
Current RDD
Row-oriented layout
Java object representation
On Java heap
Binary columnar
Column-oriented layout
Binary representation
On off-heap
2.1.
Long Path from Current RDD to GPU
 Three steps to send data from RDD to GPU
1. Java objects to column-oriented binary representation on Java heap
 From a Java object to binary representation
 From a row-oriented format to columnar
2. Binary representation on Java heap to binary columnar on off-heap
 Garbage collection may move objects on Java heap during GPU related operations
3. Off-heap to GPU device memory
18 Exploiting GPUs in Spark - Kazuaki Ishizaki
case class Pt(x: Int, y: Int)
rdd = sc.parallelize(Array(Pt(1, 4),Pt(2, 5)))
rdd.map(…).reduce(…) // execute on GPU
1 4 2 5 2 51 4 2 51 4 2 51 4
Off-heap GPU device memoryJava heap Java heap
This thread in dev ML also discusses overhead of copying data between RDD and GPU
3.
Pt Pt ByteBuffer ByteBuffer
Long Path from Current Dataset to GPU
 Two steps to send data from RDD to GPU
1. Binary representation on Java heap to binary columnar on off-heap
 From a row-oriented format to columnar
2. Off-heap to GPU device memory
19 Exploiting GPUs in Spark - Kazuaki Ishizaki
case class Pt(x: Int, y: Int)
ds = Array(Pt(1, 4),Pt(2, 5)).toDS()
ds.map(…).reduce(…) // execute on GPU
2 51 4 2 51 4
Off-heap GPU device memory
2 51 4
Java heap
1. 2.
Shorter Path from Binary Columnar RDD to GPU
 RDD with binary columnar can be simply copied to GPU device memory
20 Exploiting GPUs in Spark - Kazuaki Ishizaki
case class Pt(x: Int, y: Int)
rdd = sc.parallelize(Array(Pt(1, 4),Pt(2, 5)))
rdd.map(…).reduce(…) // execute on GPU
Off-heap GPU device memory
Eliminated
2 51 4 2 51 4
1 4 2 5 2 51 4 2 51 4
Off-heap GPU device memoryJava heap
2 51 4
Java heap
Can Execute map() in Parallel Using Binary Columnar
 Adjacent elements in binary columnar can be accessed in parallel
 The same type of operations ( * or -) can be executed in parallel for data
to be loaded in parallel
21 Exploiting GPUs in Spark - Kazuaki Ishizaki
case class Pt(x: Int, y: Int)
...
res= rdd or ds.map(p => Pt(p.x*2, p.y-1))
1 4 2 5
Java heap Off-heap
2 51 4
Current RDD Binary columnar
Memory access
order 1 2 3 4 1 1 2 2
1 4 2 5
Java heap
Current Dataset
1 2 3 4
Advantages of Binary Columnar
 Can exploit high performance of GPUs
 Can reduce overhead of data copy between CPU and GPU
 Consume less memory footprint than RDD
 Can directly compute data, which are stored in columnar, from Apache
Parquet, Apache Arrow
 Can exploit SIMD instructions on CPU, too
22 Exploiting GPUs in Spark - Kazuaki Ishizaki
What GPU Enabler Does?
 Copy data in binary columnar RDD between CPU main memory and GPU
device memory
 Launch GPU kernels
 Cache GPU native code for kernels
 Generate GPU native code from transformations and actions in a program
– We already productized the IBM Java just-in-time compiler that generate GPU
native code from a lambda expression in Java 8
23 Exploiting GPUs in Spark - Kazuaki Ishizaki
 Motivation & Goal
 Projects to Exploit GPUs in Spark
 Introduction of GPUs
 Design & New Components
 Two approaches to Exploit GPUs in Spark
 Conclusion
How to Exploit GPUs in Spark
 Bottom line is to enable columnar storage and GPU enabler in Spark
– Any approaches can use both them to effectively and transparently exploit
GPUs in Spark
25 Exploiting GPUs in Spark - Kazuaki Ishizaki
Java heap
Comparisons among DataFrame, Dataset, and RDD
 DataFrame (with relational operations) and Dataset (with lambda
functions) use Catalyst and row-oriented data representation on off-heap
26 Exploiting GPUs in Spark - Kazuaki Ishizaki
ds = d.toDS()
ds.filter(p => p.x>1)
.count()
1 4 2 5
Java heap
rdd = sc.parallelize(d)
rdd.filter(p => p.x>1)
.count()
df = d.toDF(…)
df.filter(”x>1”)
.count()
case class Pt(x: Int, y: Int)
d = Array(Pt(1, 4), Pt(2, 5))
Frontend
API
2 51 4 Data
DataFrame (v1.3-) Dataset (v1.6-) RDD (v0.5-)
Catalyst
Backend
computation
Generated
Java bytecode
Java bytecode in
Spark program and runtime
Row-oriented
Row-oriented
Two Approaches to Exploit GPUs
 Devising Spark Package for RDD
– Library developers can use this to enable their GPU code in Spark libraries
– Application programmers can use this to run their code in their Spark
application
 Enhance Catalyst for DataFrame/Dataset
– Spark programs with DataFrame/Dataset will be translated to GPU code
transparently
– As the first step, we are generating code for specific columnar storages for
CPUs
• https://github.com/apache/spark/pull/11636 for ColumnarBatch
• https://github.com/apache/spark/pull/11956 for CachedBatch
2. Introduce generic columnar storage (UnsafeColumn?) for CPU
3. Generate code for generic columnar storage for CPU
4. Generate code for generic columnar storage for GPU
27 Exploiting GPUs in Spark - Kazuaki Ishizaki
Software Stack for RDD in Spark 2.0
 RDD keeps data on Java heap
28 Exploiting GPUs in Spark - Kazuaki Ishizaki
RDD API
Java heap
RDD data
User’s/library’s Spark program
Off-heap
GPU Exploitation for RDD
 Current RDD and binary columnar can co-exist
 User/library-provided GPU code is managed by GPU enabler
29 Exploiting GPUs in Spark - Kazuaki Ishizaki
RDD API
Java heap
RDD data
User’s/library’s Spark program
Columnar
GPU
enabler
GPU device memory
Columnar
Software Stack for Dataset/DataFrame in Spark 2.0
 Dataset become a primary data structure for computation
 Dataset keeps data in UnsafeRow on Java heap
30 Exploiting GPUs in Spark - Kazuaki Ishizaki
DataFrame
Dataset
Tungsten
Catalyst
Java heap
UnsafeRow
User’s/library’s Spark program
Logical optimizer
CPU code generator
GPU Exploitation for DataFrame/Dataset
 UnsafeRow and Columnar can co-exist
 Catalyst will generate GPU code from a Spark program
31 Exploiting GPUs in Spark - Kazuaki Ishizaki
User’s/library’s Spark program
DataFrame
Dataset
Tungsten
Catalyst
Off-heap
GPU device memory
Columnar
Logical optimizer
CPU code generator
Columnar
Java heap
UnsafeRow
GPU enabler
Columnar
Exploit GPUs for RDD
 Execute user-provided GPU kernels from map()/reduce() functions
– GPU memory managements and data copy are automatically handled
 Generate GPU native code for simple map()/reduce() methods
– “spark.gpu.codegen=true” in spark-defaults.conf
32 Exploiting GPUs in Spark - Kazuaki Ishizaki
rdd1 = sc.parallelize(1 to n, 2).convert(ColumnFormat) // rdd1 uses binary columnar RDD
sum = rdd1.map(i => i * 2)
.reduce((x, y) => (x + y))
// CUDA
__global__ void sample_map(int *inX, int *inY, int *outX, int *outY, long size) {
long ix = threadIdx.x + blockIdx.x * blockDim.x;
if (size <= ix) return;
outX[ix] = inX[ix] * 2;
outY[ix] = inY[ix] – 1;
}
// Spark
mapFunction = new CUDAFunction(“sample_map", // CUDA method name
Array("this.x", "this.y"), // input object has two fields
Array("this.x“, “this.y”), // output object has two fields
this.getClass.getResource("/sample.ptx")) // ptx is generated by CUDA complier
rdd1 = sc.parallelize(…).convert(ColumnFormat) // rdd1 uses binary columnar RDD
rdd2 = rdd1.mapExtFunc(p => Pt(p.x*2, p.y-1), mapFunction)
How to Use Exploitation of GPUs for RDD
 Easy to install by one-liner and to run by one-liner
– on x86_64, mac, and ppc64le with CUDA 7.0 or later with any JVM such as IBM
JDK or OpenJDK
 Run script for AWS EC2 is available, which support spot instances33 Exploiting GPUs in Spark - Kazuaki Ishizaki
$ wget https://s3.amazonaws.com/spark-gpu-public/spark-gpu-latest-bin-hadoop2.4.tgz &&
tar xf spark-gpu-latest-bin-hadoop2.4.tgz && cd spark-gpu
$ LD_LIBRARY_PATH=/usr/local/cuda/lib64 MASTER='local[2]' ./bin/run-example SparkGPULR 8 3200 32 5
…
numSlices=8, N=3200, D=32, ITERATIONS=5
On iteration 1
On iteration 2
On iteration 3
On iteration 4
On iteration 5
Elapsed time: 431 ms
$
Available at http://kiszk.github.io/spark-gpu/
• 3 contributors
• Private communications
with other developers
Achieved 3.15x Performance Improvement by GPU
 Ran naïve implementation of logistic regression
 Achieved 3.15x performance improvement of logistic regression over
without GPU on a 16-core IvyBridge box with an NVIDIA K40 GPU card
– We have rooms to improve performance
34 Exploiting GPUs in Spark - Kazuaki Ishizaki
Details are available at https://github.com/kiszk/spark-gpu/wiki/Benchmark
Program parameters
N=1,000,000 (# of points), D=400 (# of features), ITERATIONS=5
Slices=128 (without GPU), 16 (with GPU)
MASTER=local[8] (without and with GPU)
Hardware and software
Machine: nx360 M4, 2 sockets 8-core Intel Xeon E5-2667 3.3GHz, 256GB memory, one NVIDIA K40m card
OS: RedHat 6.6, CUDA: 7.0
We are planning to release Spark Package version
 You can use any Spark runtime
– Spark 1.6, 1.6.1, 2.0.0-SNAPSHOP, your own Spark, …
 Live demo
35 Exploiting GPUs in Spark - Kazuaki Ishizaki
 Motivation & Goal
 Projects to Exploit GPUs in Spark
 Introduction of GPUs
 Design & New Components
 Two approaches to Exploit GPUs in Spark
 Conclusion
Takeaway
 Accelerate a Spark application by using GPUs effectively and transparently
 More than 10 approaches exist for GPU exploitation
 Two fundamental components
– Binary columnar to alleviate overhead for GPU exploitation
– GPU enabler to manage GPU kernel execution from a Spark program
 Call pre-compiled libraries for GPU
 Generate GPU native code at runtime
 Two approaches
– Spark plugin For RDD
– Enhancement of Catalyst for DataFrame/Dataset
 Looking for anything in the community
– Use case, discussions, requests, …
37 Exploiting GPUs in Spark - Kazuaki Ishizaki
Appreciate any your feedback and contributions

Exploiting GPUs in Spark

  • 1.
    Kazuaki Ishizaki IBM Research– Tokyo 日本アイ・ビー・エム(株)東京基礎研究所 Exploiting GPUs in Spark 1
  • 2.
    Who am I? Kazuaki Ishizaki – live in Tokyo, Japan  Research staff member at IBM Research – Tokyo – http://ibm.co/kiszk  Research interests – compiler optimizations, language runtime, and parallel processing  Worked for Java virtual machine and just-in-time compiler over 20 years – From JDK 1.0 to Java SE 8  Twitter: @kiszk  Slideshare: http://www.slideshare.net/ishizaki  Github: https://github.com/kiszk 2 Exploiting GPUs in Spark - Kazuaki Ishizaki
  • 3.
    My message is“Spark can meet GPUs”  Let us discuss use cases, opportunities, requirements in meetups, conferences, and Spark dev or user mailing list 3 Exploiting GPUs in Spark - Kazuaki Ishizaki While GPU is not the first-class citizen in Spark, 4 GPU related talks will be in Spark Summit SF
  • 4.
    Agenda  Motivation &Goal  Activities to Exploit GPUs in Spark  Introduction of GPUs  Design & New Components – Binary columnar – GPU enabler  Two approaches to Exploit GPUs in Spark – Spark Plug-in – Enhancement of Catalyst in Spark runtime  Conclusion 4 Exploiting GPUs in Spark - Kazuaki Ishizaki
  • 5.
    Want to AccelerateComputation-heavy Application  Motivation – Want to shorten execution time of a long-running Spark application  Computation-heavy  Shuffle-heavy  I/O-heavy  Goal – Accelerate a Spark computation-heavy application  According to Reynold’s talk (p. 21), CPU will become bottleneck on Spark 5 Exploiting GPUs in Spark - Kazuaki Ishizaki
  • 6.
    Accelerate a SparkApplication by GPUs  Our Approach – Accelerate a Spark application by using GPUs effectively and transparently  Exploit high performance of GPUs  Do not ask users to change their Spark programs  New components for acceleration – Binary columnar (e.g. Apache Arrow)  Efficient data representations for GPUs and CPUs – GPU enabler  Automatically handle executions on GPUs • GPU memory allocation, data copy between GPU and CPU, etc … 6 Exploiting GPUs in Spark - Kazuaki Ishizaki
  • 7.
     Motivation &Goal  Projects to Exploit GPUs in Spark  Introduction of GPUs  Design & New Components  Two approaches to Exploit GPUs in Spark  Conclusion
  • 8.
    Existing 10~ Projectsto Exploit GPUs in Spark  There are several activities, but no one was enabled in master – Community will make GPU as a first-class citizen in Spark 8 Exploiting GPUs in Spark - Kazuaki Ishizaki Spark system programmer Spark application programmer Generated from Spark application Spark standard APIs (RDD, Dataset, DataFrame) mllib (N/A on github) Deeplearning4J on Spark Our GPU enabler (spark-gpu) Spark SWAT Columnar DataFrame (N/A on github) NUWA (product) Our on-going work Unique APIs Caffe on Spark BidMach Spark CSR in Spark HeteroSpark (N/A on github) Who prepares GPU code HowGPUcodeiscalled
  • 9.
    Existing Resource Managersto Support GPU for Spark  Spark on Mesos – https://spark-summit.org/2016/events/spark-on-mesos-the-state-of-the-art/  Yarn Node Labels – https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn- site/NodeLabel.html 9 Exploiting GPUs in Spark - Kazuaki Ishizaki
  • 10.
     Motivation &Goal  Projects to Exploit GPUs in Spark  Introduction of GPUs  Design & New Components  Two approaches to Exploit GPUs in Spark  Conclusion
  • 11.
    GPU Programming Model Five steps 1. Allocate GPU device memory 2. Copy data on CPU main memory to GPU device memory 3. Launch a GPU kernel to be executed in parallel on cores 4. Copy back data on GPU device memory to CPU main memory 5. Free GPU device memory  Usually, a programmer has to write these steps in CUDA or OpenCL 11 Exploiting GPUs in Spark - Kazuaki Ishizaki device memory (up to 12GB) main memory (up to 1TB/socket) CPU GPU Data copy over PCIe dozen cores/socket thousands cores
  • 12.
    How We CanRun Program Faster on GPU  Assign a lot of parallel computations into cores  Make memory accesses coalesced – An example – Column-oriented layout achieves better performance  This paper reports about 3x performance improvement of GPU kernel execution of kmeans over row-oriented layout 12 Exploiting GPUs in Spark - Kazuaki Ishizaki 1 52 61 5 3 7 Assumption: 4 consecutive data elements can be coalesced by GPU hardware 2 v.s. 4 memory accesses to GPU device memory Row-oriented layoutColumn-oriented layout Pt(x: Int, y: Int) Load four Pt.x Load four Pt.y 2 6 4 843 87 coresx1 x2 x3 x4 cores Load Pt.x Load Pt.y Load Pt.x Load Pt.y 1 2 31 2 4 y1 y2 y3 y4 x1 x2 x3 x4 y1 y2 y3 y4
  • 13.
     Motivation &Goal  Projects to Exploit GPUs in Spark  Introduction of GPUs  Design & New Components  Two approaches to Exploit GPUs in Spark  Conclusion
  • 14.
    High Level Viewof GPU Exploitation  Efficient – Reduce data copy overhead between CPU and GPU – Make memory accesses efficient on GPU  Transparent – Map parallelism in a program into GPU native code User’s Spark Program (scala) 14 case class Pt(x: Int, y: Int) rdd1 = sc.parallelize(Array( Pt(1, 4), Pt(2, 5), Pt(3, 6), Pt(4, 7), Pt(5, 8), Pt(6, 9)), 3) rdd2 = rdd1.map(p => Pt(p.x*2, p.y-1)) cnt = rdd2.reduce( (p1, p2) => p1.x + p2.x) Translate to GPU native code Nativecode 1 GPU 4 2 5 3 6 4 7 5 8 6 9 1 4 2 5 3 6 4 7 5 8 6 9 2 3 4 4 6 5 8 6 10 7 12 8 2 3 4 4 6 5 8 6 10 7 12 8 *2= -1= rdd 1 Data transfer x y Exploiting GPUs in Spark - Kazuaki Ishizaki GPU enabler binary columnar Off-heap x y GPU can exploit parallelism both among blocks in RDD and within a block of RDD rdd 2 block GPU kernel CPU
  • 15.
    What Binary Columnardoes?  Keep data as binary representation (not Java object representation)  Keep data as column-oriented layout  Keep data on off-heap or GPU device memory 15 Exploiting GPUs in Spark - Kazuaki Ishizaki 2 51 4 Off-heap case class Pt(x: Int, y: Int) Array(Pt(1, 4), Pt(2, 5)) Example 2 51 4 Off-heap Columnar (column-oriented) Row-oriented
  • 16.
    Current RDD asJava objects on Java heap 16 Exploiting GPUs in Spark - Kazuaki Ishizaki case class Pt(x: Int, y: Int) rdd = sc.parallelize(Array(Pt(1, 4), Pt(2, 5))) Object header for Java virtual machine 1 4 2 5 Java heap Current RDD Row-oriented layout Java object representation On Java heap Pt Pt
  • 17.
    Binary Columnar onoff-heap 17 Exploiting GPUs in Spark - Kazuaki Ishizaki case class Pt(x: Int, y: Int) rdd = sc.parallelize(Array(Pt(1, 4), Pt(2, 5))) Object header for Java virtual machine 1 4 2 5 Java heap Off-heap 2 51 4 Current RDD Row-oriented layout Java object representation On Java heap Binary columnar Column-oriented layout Binary representation On off-heap
  • 18.
    2.1. Long Path fromCurrent RDD to GPU  Three steps to send data from RDD to GPU 1. Java objects to column-oriented binary representation on Java heap  From a Java object to binary representation  From a row-oriented format to columnar 2. Binary representation on Java heap to binary columnar on off-heap  Garbage collection may move objects on Java heap during GPU related operations 3. Off-heap to GPU device memory 18 Exploiting GPUs in Spark - Kazuaki Ishizaki case class Pt(x: Int, y: Int) rdd = sc.parallelize(Array(Pt(1, 4),Pt(2, 5))) rdd.map(…).reduce(…) // execute on GPU 1 4 2 5 2 51 4 2 51 4 2 51 4 Off-heap GPU device memoryJava heap Java heap This thread in dev ML also discusses overhead of copying data between RDD and GPU 3. Pt Pt ByteBuffer ByteBuffer
  • 19.
    Long Path fromCurrent Dataset to GPU  Two steps to send data from RDD to GPU 1. Binary representation on Java heap to binary columnar on off-heap  From a row-oriented format to columnar 2. Off-heap to GPU device memory 19 Exploiting GPUs in Spark - Kazuaki Ishizaki case class Pt(x: Int, y: Int) ds = Array(Pt(1, 4),Pt(2, 5)).toDS() ds.map(…).reduce(…) // execute on GPU 2 51 4 2 51 4 Off-heap GPU device memory 2 51 4 Java heap 1. 2.
  • 20.
    Shorter Path fromBinary Columnar RDD to GPU  RDD with binary columnar can be simply copied to GPU device memory 20 Exploiting GPUs in Spark - Kazuaki Ishizaki case class Pt(x: Int, y: Int) rdd = sc.parallelize(Array(Pt(1, 4),Pt(2, 5))) rdd.map(…).reduce(…) // execute on GPU Off-heap GPU device memory Eliminated 2 51 4 2 51 4 1 4 2 5 2 51 4 2 51 4 Off-heap GPU device memoryJava heap 2 51 4 Java heap
  • 21.
    Can Execute map()in Parallel Using Binary Columnar  Adjacent elements in binary columnar can be accessed in parallel  The same type of operations ( * or -) can be executed in parallel for data to be loaded in parallel 21 Exploiting GPUs in Spark - Kazuaki Ishizaki case class Pt(x: Int, y: Int) ... res= rdd or ds.map(p => Pt(p.x*2, p.y-1)) 1 4 2 5 Java heap Off-heap 2 51 4 Current RDD Binary columnar Memory access order 1 2 3 4 1 1 2 2 1 4 2 5 Java heap Current Dataset 1 2 3 4
  • 22.
    Advantages of BinaryColumnar  Can exploit high performance of GPUs  Can reduce overhead of data copy between CPU and GPU  Consume less memory footprint than RDD  Can directly compute data, which are stored in columnar, from Apache Parquet, Apache Arrow  Can exploit SIMD instructions on CPU, too 22 Exploiting GPUs in Spark - Kazuaki Ishizaki
  • 23.
    What GPU EnablerDoes?  Copy data in binary columnar RDD between CPU main memory and GPU device memory  Launch GPU kernels  Cache GPU native code for kernels  Generate GPU native code from transformations and actions in a program – We already productized the IBM Java just-in-time compiler that generate GPU native code from a lambda expression in Java 8 23 Exploiting GPUs in Spark - Kazuaki Ishizaki
  • 24.
     Motivation &Goal  Projects to Exploit GPUs in Spark  Introduction of GPUs  Design & New Components  Two approaches to Exploit GPUs in Spark  Conclusion
  • 25.
    How to ExploitGPUs in Spark  Bottom line is to enable columnar storage and GPU enabler in Spark – Any approaches can use both them to effectively and transparently exploit GPUs in Spark 25 Exploiting GPUs in Spark - Kazuaki Ishizaki
  • 26.
    Java heap Comparisons amongDataFrame, Dataset, and RDD  DataFrame (with relational operations) and Dataset (with lambda functions) use Catalyst and row-oriented data representation on off-heap 26 Exploiting GPUs in Spark - Kazuaki Ishizaki ds = d.toDS() ds.filter(p => p.x>1) .count() 1 4 2 5 Java heap rdd = sc.parallelize(d) rdd.filter(p => p.x>1) .count() df = d.toDF(…) df.filter(”x>1”) .count() case class Pt(x: Int, y: Int) d = Array(Pt(1, 4), Pt(2, 5)) Frontend API 2 51 4 Data DataFrame (v1.3-) Dataset (v1.6-) RDD (v0.5-) Catalyst Backend computation Generated Java bytecode Java bytecode in Spark program and runtime Row-oriented Row-oriented
  • 27.
    Two Approaches toExploit GPUs  Devising Spark Package for RDD – Library developers can use this to enable their GPU code in Spark libraries – Application programmers can use this to run their code in their Spark application  Enhance Catalyst for DataFrame/Dataset – Spark programs with DataFrame/Dataset will be translated to GPU code transparently – As the first step, we are generating code for specific columnar storages for CPUs • https://github.com/apache/spark/pull/11636 for ColumnarBatch • https://github.com/apache/spark/pull/11956 for CachedBatch 2. Introduce generic columnar storage (UnsafeColumn?) for CPU 3. Generate code for generic columnar storage for CPU 4. Generate code for generic columnar storage for GPU 27 Exploiting GPUs in Spark - Kazuaki Ishizaki
  • 28.
    Software Stack forRDD in Spark 2.0  RDD keeps data on Java heap 28 Exploiting GPUs in Spark - Kazuaki Ishizaki RDD API Java heap RDD data User’s/library’s Spark program
  • 29.
    Off-heap GPU Exploitation forRDD  Current RDD and binary columnar can co-exist  User/library-provided GPU code is managed by GPU enabler 29 Exploiting GPUs in Spark - Kazuaki Ishizaki RDD API Java heap RDD data User’s/library’s Spark program Columnar GPU enabler GPU device memory Columnar
  • 30.
    Software Stack forDataset/DataFrame in Spark 2.0  Dataset become a primary data structure for computation  Dataset keeps data in UnsafeRow on Java heap 30 Exploiting GPUs in Spark - Kazuaki Ishizaki DataFrame Dataset Tungsten Catalyst Java heap UnsafeRow User’s/library’s Spark program Logical optimizer CPU code generator
  • 31.
    GPU Exploitation forDataFrame/Dataset  UnsafeRow and Columnar can co-exist  Catalyst will generate GPU code from a Spark program 31 Exploiting GPUs in Spark - Kazuaki Ishizaki User’s/library’s Spark program DataFrame Dataset Tungsten Catalyst Off-heap GPU device memory Columnar Logical optimizer CPU code generator Columnar Java heap UnsafeRow GPU enabler Columnar
  • 32.
    Exploit GPUs forRDD  Execute user-provided GPU kernels from map()/reduce() functions – GPU memory managements and data copy are automatically handled  Generate GPU native code for simple map()/reduce() methods – “spark.gpu.codegen=true” in spark-defaults.conf 32 Exploiting GPUs in Spark - Kazuaki Ishizaki rdd1 = sc.parallelize(1 to n, 2).convert(ColumnFormat) // rdd1 uses binary columnar RDD sum = rdd1.map(i => i * 2) .reduce((x, y) => (x + y)) // CUDA __global__ void sample_map(int *inX, int *inY, int *outX, int *outY, long size) { long ix = threadIdx.x + blockIdx.x * blockDim.x; if (size <= ix) return; outX[ix] = inX[ix] * 2; outY[ix] = inY[ix] – 1; } // Spark mapFunction = new CUDAFunction(“sample_map", // CUDA method name Array("this.x", "this.y"), // input object has two fields Array("this.x“, “this.y”), // output object has two fields this.getClass.getResource("/sample.ptx")) // ptx is generated by CUDA complier rdd1 = sc.parallelize(…).convert(ColumnFormat) // rdd1 uses binary columnar RDD rdd2 = rdd1.mapExtFunc(p => Pt(p.x*2, p.y-1), mapFunction)
  • 33.
    How to UseExploitation of GPUs for RDD  Easy to install by one-liner and to run by one-liner – on x86_64, mac, and ppc64le with CUDA 7.0 or later with any JVM such as IBM JDK or OpenJDK  Run script for AWS EC2 is available, which support spot instances33 Exploiting GPUs in Spark - Kazuaki Ishizaki $ wget https://s3.amazonaws.com/spark-gpu-public/spark-gpu-latest-bin-hadoop2.4.tgz && tar xf spark-gpu-latest-bin-hadoop2.4.tgz && cd spark-gpu $ LD_LIBRARY_PATH=/usr/local/cuda/lib64 MASTER='local[2]' ./bin/run-example SparkGPULR 8 3200 32 5 … numSlices=8, N=3200, D=32, ITERATIONS=5 On iteration 1 On iteration 2 On iteration 3 On iteration 4 On iteration 5 Elapsed time: 431 ms $ Available at http://kiszk.github.io/spark-gpu/ • 3 contributors • Private communications with other developers
  • 34.
    Achieved 3.15x PerformanceImprovement by GPU  Ran naïve implementation of logistic regression  Achieved 3.15x performance improvement of logistic regression over without GPU on a 16-core IvyBridge box with an NVIDIA K40 GPU card – We have rooms to improve performance 34 Exploiting GPUs in Spark - Kazuaki Ishizaki Details are available at https://github.com/kiszk/spark-gpu/wiki/Benchmark Program parameters N=1,000,000 (# of points), D=400 (# of features), ITERATIONS=5 Slices=128 (without GPU), 16 (with GPU) MASTER=local[8] (without and with GPU) Hardware and software Machine: nx360 M4, 2 sockets 8-core Intel Xeon E5-2667 3.3GHz, 256GB memory, one NVIDIA K40m card OS: RedHat 6.6, CUDA: 7.0
  • 35.
    We are planningto release Spark Package version  You can use any Spark runtime – Spark 1.6, 1.6.1, 2.0.0-SNAPSHOP, your own Spark, …  Live demo 35 Exploiting GPUs in Spark - Kazuaki Ishizaki
  • 36.
     Motivation &Goal  Projects to Exploit GPUs in Spark  Introduction of GPUs  Design & New Components  Two approaches to Exploit GPUs in Spark  Conclusion
  • 37.
    Takeaway  Accelerate aSpark application by using GPUs effectively and transparently  More than 10 approaches exist for GPU exploitation  Two fundamental components – Binary columnar to alleviate overhead for GPU exploitation – GPU enabler to manage GPU kernel execution from a Spark program  Call pre-compiled libraries for GPU  Generate GPU native code at runtime  Two approaches – Spark plugin For RDD – Enhancement of Catalyst for DataFrame/Dataset  Looking for anything in the community – Use case, discussions, requests, … 37 Exploiting GPUs in Spark - Kazuaki Ishizaki Appreciate any your feedback and contributions