Improving Apache Spark by Taking Advantage of Disaggregated Architecture

Chenzhao Guo(Intel) Carson Wang(Intel)
Improving Apache Spark by Taking
Advantage of Disaggregated Architecture
#UnifiedDataAnalytics #SparkAISummit

2#UnifiedDataAnalytics #SparkAISummit
About me
• Chenzhao Guo
• Big Data Software Engineer at Intel
• Contributor of Spark*, committer of OAP and HiBench
• Our group’s work: enabling and optimizing Spark* on an
exascale(10^18) High Performance Computer
*Other names and brands may be claimed as the property of others.

Agenda
• Limitations on default Spark* shuffle
• Enabling Spark* shuffle on disaggregated compute & storage
• New challenges & optimizations
• Performance benchmarks
• Where next

Agenda
• Where next

Pain point 1: performance
Shuffle drags down the end-to-end performance
• slow due to random disk I/Os
• may fail due to no space left or disk failure
Mechanical structure performs
slow seek operation
What if I upgrade the HDDs to larger capacity or SSDs?

Pain point 2: Easy management &
cost efficiency
• Still, local disks may be run out or fail
• Scaling out/up the storage resources affects
the compute resource
• Scaling the storage to accommodate shuffle’s
need may make storage resources under-
utilized for other(like CPU-intensive)
workloads: cost inefficient
What if I upgrade the HDDs to larger capacity or SSDs?
Collocated compute & storage architecture

7
Limitations on default Spark shuffle
essence diagram
Cluster
utilization
Scale out/up
Software
management
Traditional
collocated
architecture
Tightly
coupled
compute &
storage
Under-utilized
one of the 2
resources
Hard to scale
out/upgrade
without affecting
the other resource
Debugging an
issue may be
complicated
?

8
What brings us here?
• Challenges to resolve enabling & optimizing Spark* on HPC:
– storage on Distributed Asynchronous Object Storage(DAOS)
– network on high-performance fabrics
– scalability issues when there are ~10000 nodes
– intermediate(shuffle, etc.) storage on disaggregated compute
& storage architecture

9
Agenda
• Where next

10
Disaggregated architecture
• Compute and storage resources are separated, interconnected by
high-speed(usually) network
• Common in HPC world and large companies who own thousands of
servers, for higher efficiency and easier management
Compute cabinet with CPU,
Memory and accelerators
Storage cabinet with
NVMe SSD, running
DAOS
Disaggregated
architecture in HPC
Fabrics network

11
Collocated vs disaggregated compute &
storage architecture
Essence Diagram
Cluster
utilization
Scale out/up
Software
management
Traditional
collocated
architecture
Tightly
coupled
compute
& storage
Under-utilized
one of the 2
resources
Hard to scale
out/upgrade
without affecting
the other resource
Debugging an
issue may be
complicated
?
Disaggregate
d compute &
storage
architecture
Decouple
d
compute
& storage
Efficient for
diverse
workloads with
different
storage/comput
e needs
Independent
scaling per
compute &
storage needs
separately
Different team
only needs to
care their own
cluster
Overhead
Extra
network
transfer

12
Network transfer is not a big deal as time
evolves
• Software:
– Efficient compression algorithms
– Columnar file format
– Other optimizations toward less I/O
• Hardware:
– Network bandwidth is evolving fast
I/O bottleneck
CPU bottleneck

13
Targets Hadoop-compatible storage
remote shuffle
SortShuffleManager
ReaderWriter
Local disk
Java File API write
BlockManager
Netty fetch
Java File API read
Local diskLocal disk
default Spark local shuffle
storage cluster
Globally accessible Hadoop-compatible Filesystem
？
• Hadoop* Filesystem interface is general
• Support globally accessible storage
• Let the distributed storage takes care of replication,
enough space, etc.
compute cluster

14
A remote shuffle manager plugin: overview
RemoteShuffleManager
ReaderWriter
Globally accessible Hadoop-compatible Filesystem
Hadoop OutputStream Hadoop InputStream
HDFS Lustre DAOS FS…
• A pluggable custom shuffle manager
storing files in globally accessible
Hadoop-compatible filesystem
• The shuffle readers/writers talk with
storage system in Hadoop* API
• Algorithms of partitioning, sorting,
unsafe-optimizations, etc. are based on
the ones in default SortShuffleManager
• Shuffle data/index files, spill and other
temporary files are all kept in remote
storage

executor executor
15
More customization details
• Shuffle writers/readers talk with storage in Hadoop* API
• Shuffle readers derive the data/index file path from applicationId, mapId
and reduceId, then directly load them from remote storage without any
metadata keeper’s help(MapOutputTracker)
• Avoid resubmitting certain map tasks when executor fails, due to the
shuffle blocks are still there and can be retrieved
fos = new FileOutputStream(file, true) fsdos = fs.create(file)Local disk Local diskLocal disk Globally-accessible Hadoop-compatible Filesystem
I need a
shuffle
block
Read by derived path

16
A remote shuffle I/O plugin
• No need to maintain full shuffle algorithms including sort, partition,
aggregation logics, but only cares about shuffle storage
• Based on the new shuffle abstraction layer introduced in SPARK-25299
Non-shuffle facilities
Sort shuffle manager
Pluggable
shuffle manager
Non-shuffle facilities
Sort shuffle manager
(non-storage part)
Pluggable
shuffle I/O
Customized storage

17
What have we offered until now?
A ShuffleManager/(Shuffle I/O) that
• shuffles to globally accessible Hadoop-compatible storage
• is pluggable and easy to deploy
• is free from executor failures naturally due to shuffle metadata is
not locally managed, instead globally derived
• is free from disk failure due to the shuffle data in storage can be
replicated: reliable
• is easy to scale and cost-efficient due to taking advantages of
disaggregated compute and storage architecture

18
Agenda
• Where next

19
New challenges
• Repeatedly reading small index files from disks is inefficient
• Seek operations calling on data files in HDFS* are inefficient, reading
more data than expected through network

20
Optimization: index cache
Map stage:
Executor process 0
Maptask
Maptask
Maptask
HDFS Index File
Write index & data files to storage
Maptask
Maptask
Index files are small, however, some Hadoop* storage
system like HDFS* are not designed for small files I/O

21
Optimization: index cache
Reduce stage:
Executor process 0
Reducetask
Reducetask
HDFS Index File
Executor process 1
Reducetask
Fetch from
Index cache
Read index to cache
Indexcache
Directly read data files
• Saves small disk I/Os in storage cluster
• Saves small network I/Os between compute
and storage cluster
• Utilize the network inside compute cluster

22
Fall back to read from remote storage
Reduce stage:
Executor process 0
Reducetask
Reducetask
HDFS Index File
Executor process 1
Reducetask
Fetch from
Index cache
Indexcache
Directly read data filesDirectly read index filesRead index to cache
• When index cache is not available in certain
executors(failure or network issues), fall back
to directly read index files from remote
storage, just as index cache feature is never
enabled

23
Solutions for inefficient HDFS* seek
• a Hash-based shuffle manager without any seek operations
• Add new logics for blocks merging to eliminate seek operations

24
Agenda
• Where next

25
Shuffle over Lustre*
Dataset 4G dataset & 100 G dataset (~50GB shuffle)
workload SQL internal join
software Spark* 2.4.3
Hadoop* 2.7.3
baseline: 64GB ram disk | remote shuffle: Lustre*(remote)
hardware 1 master + 4 workers
40 cores Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz
128GB RAM
10Gb Ethernet

26
Comparison: end-to-end time
54
87
55
0
10
20
30
40
50
60
70
80
90
100
local shuffle to RAM disk remote shuffle remote shuffle + index cache

27
Agenda
• Where next

28
Where next
• Future work: a purely-hash-based shuffle path that resembles hash-
based shuffle in Spark* history
• Disaggregated architecture can benefit not only for Spark* shuffle
• Towards a fully disaggregated world – CPU, memory and disks are
all disaggregated, connected through network

DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

31
Legal Disclaimer
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular
purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.
This document contains information on products, services and/or processes in development. All information provided here is subject to change
without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.
The products and services described may contain defects or errors known as errata which may cause deviations from published specifications.
Current characterized errata are available on request. No product or component can be absolutely secure.
Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by
visiting www.intel.com/design/literature.htm.
Intel, the Intel logo, [List the Intel trademarks in your document] are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other
countries.
*Other names and brands may be claimed as the property of others
© Intel Corporation.

32
Legal Disclaimer
Software and workloads used in performance tests may have been optimized
for performance only on Intel microprocessors. Performance tests, such as
SYSmark and MobileMark, are measured using specific computer systems,
components, software, operations and functions. Any change to any of those
factors may cause the results to vary. You should consult other information and
performance tests to assist you in fully evaluating your contemplated purchases,
including the performance of that product when combined with other products.
For more information go to http://www.intel.com/performance.

Improving Apache Spark by Taking Advantage of Disaggregated Architecture

More Related Content

What's hot

Similar to Improving Apache Spark by Taking Advantage of Disaggregated Architecture

More from Databricks

Recently uploaded

Improving Apache Spark by Taking Advantage of Disaggregated Architecture