3. 3Intel Architecture, Graphics, and SoftwareIAGS
Apache Spark: A Unified Analytics Engine
• A fast and general-purpose cluster computing
system, supports a rich set of high level tools
including Spark SQL for SQL processing,
MLLib for machine learning, GraphX for graph
processing and Spark Streaming.
• Provides high level APIs in Java, Scala, Python
and R.
• Supports multiple resource manager(Apache
Yarn, Apache Mesos, Kubernetes, standalone)
• Supports diverse data sources including
HDFS, Apache Cassandra, Apache Hbase, etc.
4. 4Intel Architecture, Graphics, and SoftwareIAGS
Enable and Accelerate Spark with DAOS
• Spark Input/Output Storage: From HDFS to DAOS
• Spark Shuffle Data Storage: From Local FS to DAOS
Compute & Storage Cluster
Node 1
Spark
Local FS HDFS
Input/Output
Data
Shuffle
Data
Compute Cluster
Node 1
Spark
Storage Cluster
DAOS
Shuffle
Data
Input/Output
Data
Colocated Cluster Disaggregated Cluster
5. 5Intel Architecture, Graphics, and SoftwareIAGS
DAOS Hadoop Filesystem Interface
S3A FileSystem
DAOS Hadoop
FileSystem
AWS Java SDK
DAOS API
Java Wrapper
HDFS
Hadoop FileSystem Abstract Class
Local
Disks Amazon S3 DAOS
Storage
DAOS Library
Spark Applications
6. 6Intel Architecture, Graphics, and SoftwareIAGS
DAOS Hadoop Filesystem Throughput
0
1000
2000
3000
4000
5000
6000
7000
0 5 10 15 20 25 30 35 40 45 50
MB/s
Client Number
Read Throughput
IOR DFS DFSIO
• DAOS Hadoop Filesystem delivers the same read/write
throughput as DAOS DFS API with enough parallelism.
0
500
1000
1500
2000
2500
0 10 20 30 40 50
MB/s
Client Number
Write Throughput
IOR DFS DFSIO
2 DAOS Servers, each with 1 P4500 NVMe; 3 Client Nodes; Network: 100 Gb Omni-Path
7. 7Intel Architecture, Graphics, and SoftwareIAGS
▪ Add the DAOS Hadoop FS jar file to the classpath of the Spark
executor and driver
Using DAOS Hadoop Filesystem in Spark
spark.executor.extraClassPath /path/to/DAOS Hadoop FS jar
spark.driver.extraClassPath /path/to/DAOS Hadoop FS jar
df = spark.read.json("daos:///UNS/path/root/dir/file")
▪ For DAOS UNS path, read data using the following URI in Spark
▪ For non-UNS path, add pool and container UUID in configuration
file daos-site.xml and use the following URI in Spark.
df = spark.read.json("daos:///root/dir/file")
9. 9Intel Architecture, Graphics, and SoftwareIAGS
DAOS Hadoop FS based Shuffle
Index File
DAOS
Index File
Reduce Task 1
Reduce Task 2
Reduce Task 3
Data File
Data File
Write Read
Posix Container
1. Map task writes the shuffle
output file to DAOS using
DAOS Hadoop filesystem API.
The shuffle data file is sorted
by partition ID.
2. Reduce task reads the shuffle
partitions from every shuffle
data file.
3. Every reduce task needs to
open all shuffle data files. File
handles can only be cached at
Spark executor level.
Shuffle Block
Map Task 1
Map Task 2
10. 10Intel Architecture, Graphics, and SoftwareIAGS
DAOS Hadoop FS based Shuffle Performance
Challenges:
Frequent file lookup/open
overhead during Shuffle Read,
especially when shuffle block
size is small.
11. 11Intel Architecture, Graphics, and SoftwareIAGS
DAOS Object based Spark Shuffle
DAOS Object
Reduce Task 1
Reduce Task 2
Reduce Task 3
Dkey: ReduceID 1
akey: MapID 1
akey: MapID 2
Dkey: ReduceID 2
akey: MapID 1
akey: MapID 2
Dkey: ReduceID 3
akey: MapID 1
akey: MapID 2
Map Task 1
Map Task 2
Write Read
1. Map task divides the data into
R partitions where R equals to
the number of reducers, and
every small block will be
stored into a DAOS object with
a specific Dkey and akey
where Dkey equals to
ReduceID and akey equals to
Map attempt ID.
2. For every reduce task, its
shuffle data will be stored
under the same Dkey. During
shuffle read, every reduce task
just reads all successful akeys’
value under the Dkey.
12. 12Intel Architecture, Graphics, and SoftwareIAGS
DAOS Object based Spark Shuffle Performance
5.83 5.68
5.68 5.68
1.94 1.94 2.06 2.06
0
1
2
3
4
5
6
7
200K 400K 800K 1600K
Throughput(GB/s)
Shuffle Block Size
Shuffle Throughput: 200 GB Repartition
Shuffle Read Shuffle Write
• Shuffle Read Throughput has
been improved, 1.5x for
200KB shuffle block.
• Shuffle Write Throughput
doesn’t change as it already
hits the NVMe limit.
13. 13Intel Architecture, Graphics, and SoftwareIAGS
▪ Add the remote shuffle jar file to the classpath of the Spark
executor and driver
▪ Enable DAOS remote shuffle
Using DAOS Object based Shuffle in Spark
spark.executor.extraClassPath /path/to/DAOS remote shuffle jar
spark.driver.extraClassPath /path/to/DAOS remote shuffle jar
spark.shuffle.manager org.apache.spark.shuffle.daos.DaosShuffleManager
spark.shuffle.daos.pool.uuid <Pool UUID>
spark.shuffle.daos.container.uuid <Container UUID>
14. 14Intel Architecture, Graphics, and SoftwareIAGS
▪ Performance:
• Switch to DAOS async API in the DAOS Spark shuffle manager.
▪ User Experience:
• Simplify user configuration and set better default values.
• Test and benchmark to cover more use cases.
Future Plan