Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash with Patrick Stuedi

Patrick Stuedi, IBM Research
Running Spark on a High-
Performance Cluster
using RDMA Networking
and NVMe Flash

Hardware Trends
community
target
20172010
1Gbps
50us
10Gbps
20us
100 MB/s
100ms
1000 MB/s
200us

Hardware Trends
2017
10 GB/s
50us
100Gbps
1us
community
target
our
target
20172010
1Gbps
50us
10Gbps
20us
100 MB/s
100ms
1000 MB/s
200us

User-Level APIs
Storage App
HDD
FileSystem
iSCSI
Block Layer
OS Kernel
Network App
NIC
Sockets
Ethernet
TCP/IP
OS Kernel
Storage App
SSD, NIC
SPDK
Network App
NIC
OS KernelRDMA, DPDK
Kernel
Kernel
Traditional / Kernel-based User-Level / Kernel-Bypass
Needed to
achieve
1us RTT!

Remote Data Access
512 B 4 KB 64 KB 1 MB
0
200
400
600
800
1000
1200
NVMe over Fabrics
iSCSI
data size
time[usec]
DRAM NVMe Flash
512 B 4 KB 64 KB 1 MB
0
50
100
150
200
250
RDMA
Sockets
data size
time[usec]
User-level APIs offer 2-10x better performance than
traditional kernel-based APIs

Let's Use it!
Spark
NVMe Flash
DRAM
High-Performance RDMA network
SQL Streaming MLlib GraphX
Performance!
Realtime!
Disaggregation!

Case Study: Sorting in Spark
MapCores ReduceInput Output
read & classify shuffle store
Net Ser
Net Ser
Net Ser
Sort
Sort
Sort

Experiment Setup
●
Total data size: 12.8 TB
●
Cluster size: 128 nodes
●
Cluster hardware:
– DRAM: 512 GB DDR 4
– Storage: 4x 1.2 TB NVMe SSD
– Network: 100GbE Mellanox RDMA
Flash bandwidth
per node matches
network bandwidth

How is the Network Used?
Hardware limit
0
20
40
60
80
100
0 100 200 300 400 500
Throughput[Gbit/s]
Elapsed time (seconds)

What is the Problem?
JVM
Netty
TeraSort
Sorter
Serializer
sockets
Spark
● Spark uses legacy networking and storage APIs: no kernel-bypass
● Spark itself introduces additional I/O layers: Netty, serializer, sorter, etc.
TCP/IP
Ethernet
NIC
filesystem
block layer
iSCSI
SSD

Example: Shuffle (Map)
Buffer
Buffer
Buffer
Sorter
Kryo
BufferedOutputStream
BufferJNI backend
BufferFile System Buffer Cache
Object

Example: Shuffle (Map)
Buffer
Buffer
Buffer
Sorter
Kryo
BufferJNI
BufferFS
Object
Stream

Example: Shuffle (Map+Reduce)
Buffer
Buffer
Buffer
Sorter
Kryo
BufferJNI
BufferFS
Object
Stream
Buffer
Buffer Buffer SocketBuffer
Netty
Buffer
Buffer
Kryo
Object
network
Sorter

Example: Shuffle (Map+Reduce)
Buffer
Buffer
Buffer
Sorter
Kryo
BufferJNI
BufferFS
Object
Stream
Buffer
Buffer Buffer SocketBuffer
Netty
Buffer
Buffer
Kryo
Object
network
Sorter
Difficult at
100 Gbps!

How can we fix this...
● Not just for shuffle
– Also for broadcast, RDD transport, inter-job sharing, etc.
● Not just for RDMA and NVMe hardware
– But for any possible future high-performance I/O hardware
● Not just for co-located compute/storage
– Also for resource disaggregation, heterogeneous resource distribution, etc.
● Not just improve things
– Make it perform at the hardware limit

Example: Crail Shuffle (Map)
Buffer
Crail Serializer
CrailOutputStream
Object
local
DRAM
remote
DRAM
local
Flash
remote
Flash
memcpy RDMA NVMe NVMeF
No
buffering

Example: Crail Shuffle (Reduce)
Buffer
Crail Serializer
CrailInputStream
local
DRAM
remote
DRAM
local
Flash
remote
Flash
No
buffering
Object

Example: Crail Shuffle (Reduce)
Buffer
Crail Serializer
CrailInputStream
local
DRAM
remote
DRAM
local
Flash
remote
Flash
No
buffering
Object
Similar for
broadcast,
Input/output,
RDD, etc.

Performance: Configuration
●
Experiments
– Memory-only: Broadcast, GroupBy, Sorting, SQL
– Memory/Flash: disaggregation, tiering
●
Cluster size: 8 nodes, except Sorting: 128 nodes
●
Cluster hardware:
– DRAM: 512 GB DDR 4
– Storage: 4x 1.2 TB NVMe SSD
– Network: 100GbE Mellanox RDMA
●
Spark 2.1.0, Alluxio 1.4, Crail 1.0
●
Hadoop.tmp.dir: RamFS for microbenchmarks, flash SSD for workloads

Spark Broadcast
val bcVar = sc.Broadcast(new Array[Byte](128))
sc.parallelize(1 to tasks, tasks).map(_ => {
bcVar.value.length
}).count
Spark
driver
Spark
Executor
/spark/broadcast/broadcast_0 Crail
Store
Broadcast writing
RPC+2xRDMA+RPC
CDF
Broadcast reading
RPC+RDMA1
2
3
4

Spark GroupBy (80M keys, 4K)
val pairs = sc.parallelize(1 to tasks, tasks).flatmap(_ => {
var values = new array[(Long,Array[Byte])](numKeys)
values = initValues(values)
}).cache().groupByKey().count()
Spark
Exec0
/shuffle/1/ f0 f1 fN
/shuffle/3 f0 f1 fN
Spark
Exec1
Spark
ExecN
Spark
Exec0
Spark
Exec1
Spark
ExecN
hash hash hash1
2
map: Hash
Append (file)
reduce: fetch
Bucket (dir)
Crail
Spark/
Vanilla
5x
2.5x
2x
0
20
40
60
80
100
0 10 20 30 40 50 60 70 80 90 100 110 120
Throughput(Gbit/s)
1 core
4 cores
8 cores
0
20
40
60
80
100
0 10 20 30 40 50 60 70 80 90 100 110 120
Throughput(Gbit/s)
1 core
4 cores
8 cores
Spark/
Crail

Sorting 12.8 TB on 128 nodes
Spark Spark/Crail
0
200
400
600
reduce
map
Spark/Crail Network UsageSorting Runtime
Task ID
Throughput(Gbit/s)
6x
time(sec)

How fast is this?
Spark/Crail Winner 2014 Winner 2016
Size (TB) 12.8 100 100
Time (sec) 98 1406 134
Total cores 2560 6592 10240
Network HW (Gbit/s) 100 10 100
Rate/core (GB/min) 3.13 0.66 4.4
Native C
distributed
sorting
benchmark
Spark
www.sortingbenchmark.org
Sorting rate of
Crail/Spark only 27%
slower than rate of
Winner 2016

Spark SQL JoinSpark
Exec0
/shuffle/3 f0 f1 fN
Spark
Exec1
Spark
ExecN
hash hash hash
1
2
map: Hash
Append (file)
val ds1 = sparkSession.read.parquet(“...”) //64GB dataset
val ds2 = sparkSession.read.parquet(“...”) //64GB dataset
val resultDS = ds1.joinWith(ds2, ds1(“key”) === ds2(“key”)) //~100GB
resultDS.write.format(“parquet”).mode(SaveMode.Overwrite).save(“...”)
parquet parquet parquet
read from
Hdfs/crail/alluxio
deser
sort
join
write
sort
join
write
sort
join
write
deser deser
3 Sort
Merge join
hdfs/crail/alluxio
4
CrailStore
0
10
20
30
40
50
output
join
hash
input
Time(sec)
Spark/
Alluxio
Spark/
Crail

Storage Tiering: DRAM & NVMe
Spark
Exec0
Spark
Exec1
Spark
ExecN
Memory Tier
Flash
Tier
Network
Input/Output/
Shuffle/
Spark
With Crail, replacing memory with NVMe flash
for storing input/output/shuffle reduces the 200GB sorting runtime by only 48%
0
20
40
60
80
100
120
Reduce
MapVanilla
Spark
(100%
in-memory)
memory/flash ratio
Time(sec)
sorting runtime

DRAM & Flash Disaggregation
Spark
Exec1
Spark
ExecN
Using Crail, a Spark 200GB sorting workload can be run with
memory and flash disaggregated at no extra cost
Spark
Exec0
storage
nodes
compute
nodes
0
40
80
120
160
200
Reduce
Map
Crail/
DRAM
Crail/
NVMe
Vanilla/
Alluxio
Input/Output/
Shuffle
Time(sec)
sorting runtime

Conclusions
●
Effectively using high-performance I/O hardware in
Spark is challenging
●
Crail is an attempt to re-think how data processing
frameworks (not only Spark) should interact with
network and storage hardware
– User-level I/O, storage disaggregation, memory/flash convergence
●
Spark's modular architecture allows Crail to be used
seamlessly

Crail for Spark is Open Source
●
www.crail.io
●
github.com/zrlio/spark-io
●
github.com/zrlio/crail
●
github.com/zrlio/parquetgenerator
●
github.com/zrlio/crail-terasort

Thank You.
Contributors:
Animesh Trivedi, Jonas Pfefferle, Michael Kaufmann, Bernard
Metzler, Radu Stoica, Ioannis Koltsidas, Nikolos Ioannou,
Christoph Hagleitner, Peter Hofstee, Evangelos Eleftheriou, ...

Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash with Patrick Stuedi

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash with Patrick Stuedi

Similar to Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash with Patrick Stuedi (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash with Patrick Stuedi