Chenzhao Guo(Intel) Carson Wang(Intel)
Improving Apache Spark by Taking
Advantage of Disaggregated Architecture
#UnifiedDataAnalytics #SparkAISummit
2#UnifiedDataAnalytics #SparkAISummit
About me
• Chenzhao Guo
• Big Data Software Engineer at Intel
• Contributor of Spark*, committer of OAP and HiBench
• Our group’s work: enabling and optimizing Spark* on an
exascale(10^18) High Performance Computer
*Other names and brands may be claimed as the property of others.
3#UnifiedDataAnalytics #SparkAISummit
Agenda
• Limitations on default Spark* shuffle
• Enabling Spark* shuffle on disaggregated compute & storage
• New challenges & optimizations
• Performance benchmarks
• Where next
*Other names and brands may be claimed as the property of others.
4#UnifiedDataAnalytics #SparkAISummit
Agenda
• Limitations on default Spark* shuffle
• Enabling Spark* shuffle on disaggregated compute & storage
• New challenges & optimizations
• Performance benchmarks
• Where next
*Other names and brands may be claimed as the property of others.
5#UnifiedDataAnalytics #SparkAISummit
Pain point 1: performance
Shuffle drags down the end-to-end performance
• slow due to random disk I/Os
• may fail due to no space left or disk failure
Mechanical structure performs
slow seek operation
What if I upgrade the HDDs to larger capacity or SSDs?
6#UnifiedDataAnalytics #SparkAISummit
Pain point 2: Easy management &
cost efficiency
• Still, local disks may be run out or fail
• Scaling out/up the storage resources affects
the compute resource
• Scaling the storage to accommodate shuffle’s
need may make storage resources under-
utilized for other(like CPU-intensive)
workloads: cost inefficient
What if I upgrade the HDDs to larger capacity or SSDs?
Collocated compute & storage architecture
7
Limitations on default Spark shuffle
essence diagram
Cluster
utilization
Scale out/up
Software
management
Traditional
collocated
architecture
Tightly
coupled
compute &
storage
Under-utilized
one of the 2
resources
Hard to scale
out/upgrade
without affecting
the other resource
Debugging an
issue may be
complicated
?
8
What brings us here?
• Challenges to resolve enabling & optimizing Spark* on HPC:
– storage on Distributed Asynchronous Object Storage(DAOS)
– network on high-performance fabrics
– scalability issues when there are ~10000 nodes
– intermediate(shuffle, etc.) storage on disaggregated compute
& storage architecture
• Limitations on default Spark* shuffle
*Other names and brands may be claimed as the property of others.
9
Agenda
• Limitations on default Spark* shuffle
• Enabling Spark* shuffle on disaggregated compute & storage
• New challenges & optimizations
• Performance benchmarks
• Where next
*Other names and brands may be claimed as the property of others.
10
Disaggregated architecture
• Compute and storage resources are separated, interconnected by
high-speed(usually) network
• Common in HPC world and large companies who own thousands of
servers, for higher efficiency and easier management
Compute cabinet with CPU,
Memory and accelerators
Storage cabinet with
NVMe SSD, running
DAOS
Disaggregated
architecture in HPC
Fabrics network
11
Collocated vs disaggregated compute &
storage architecture
Essence Diagram
Cluster
utilization
Scale out/up
Software
management
Traditional
collocated
architecture
Tightly
coupled
compute
& storage
Under-utilized
one of the 2
resources
Hard to scale
out/upgrade
without affecting
the other resource
Debugging an
issue may be
complicated
?
Disaggregate
d compute &
storage
architecture
Decouple
d
compute
& storage
Efficient for
diverse
workloads with
different
storage/comput
e needs
Independent
scaling per
compute &
storage needs
separately
Different team
only needs to
care their own
cluster
Overhead
Extra
network
transfer
12
Network transfer is not a big deal as time
evolves
• Software:
– Efficient compression algorithms
– Columnar file format
– Other optimizations toward less I/O
• Hardware:
– Network bandwidth is evolving fast
I/O bottleneck
CPU bottleneck
13
Targets Hadoop-compatible storage
remote shuffle
SortShuffleManager
ReaderWriter
Local disk
Java File API write
BlockManager
Netty fetch
Java File API read
Local diskLocal disk
default Spark local shuffle
storage cluster
Globally accessible Hadoop-compatible Filesystem
?
• Hadoop* Filesystem interface is general
• Support globally accessible storage
• Let the distributed storage takes care of replication,
enough space, etc.
compute cluster
*Other names and brands may be claimed as the property of others.
14
A remote shuffle manager plugin: overview
RemoteShuffleManager
ReaderWriter
Globally accessible Hadoop-compatible Filesystem
Hadoop OutputStream Hadoop InputStream
HDFS Lustre DAOS FS…
• A pluggable custom shuffle manager
storing files in globally accessible
Hadoop-compatible filesystem
• The shuffle readers/writers talk with
storage system in Hadoop* API
• Algorithms of partitioning, sorting,
unsafe-optimizations, etc. are based on
the ones in default SortShuffleManager
• Shuffle data/index files, spill and other
temporary files are all kept in remote
storage
*Other names and brands may be claimed as the property of others.
executor executor
15
More customization details
• Shuffle writers/readers talk with storage in Hadoop* API
• Shuffle readers derive the data/index file path from applicationId, mapId
and reduceId, then directly load them from remote storage without any
metadata keeper’s help(MapOutputTracker)
• Avoid resubmitting certain map tasks when executor fails, due to the
shuffle blocks are still there and can be retrieved
fos = new FileOutputStream(file, true) fsdos = fs.create(file)Local disk Local diskLocal disk Globally-accessible Hadoop-compatible Filesystem
I need a
shuffle
block
Read by derived path
*Other names and brands may be claimed as the property of others.
16
A remote shuffle I/O plugin
• No need to maintain full shuffle algorithms including sort, partition,
aggregation logics, but only cares about shuffle storage
• Based on the new shuffle abstraction layer introduced in SPARK-25299
Non-shuffle facilities
Sort shuffle manager
Pluggable
shuffle manager
Non-shuffle facilities
Sort shuffle manager
(non-storage part)
Pluggable
shuffle I/O
Customized storage
17
What have we offered until now?
A ShuffleManager/(Shuffle I/O) that
• shuffles to globally accessible Hadoop-compatible storage
• is pluggable and easy to deploy
• is free from executor failures naturally due to shuffle metadata is
not locally managed, instead globally derived
• is free from disk failure due to the shuffle data in storage can be
replicated: reliable
• is easy to scale and cost-efficient due to taking advantages of
disaggregated compute and storage architecture
18
Agenda
• Limitations on default Spark* shuffle
• Enabling Spark* shuffle on disaggregated compute & storage
• New challenges & optimizations
• Performance benchmarks
• Where next
*Other names and brands may be claimed as the property of others.
19
New challenges
• Repeatedly reading small index files from disks is inefficient
• Seek operations calling on data files in HDFS* are inefficient, reading
more data than expected through network
*Other names and brands may be claimed as the property of others.
20
Optimization: index cache
Map stage:
Executor process 0
Maptask
Maptask
Maptask
HDFS Index File
Write index & data files to storage
Maptask
Maptask
Index files are small, however, some Hadoop* storage
system like HDFS* are not designed for small files I/O
*Other names and brands may be claimed as the property of others.
21
Optimization: index cache
Reduce stage:
Executor process 0
Reducetask
Reducetask
HDFS Index File
Executor process 1
Reducetask
Fetch from
Index cache
Read index to cache
Indexcache
Directly read data files
• Saves small disk I/Os in storage cluster
• Saves small network I/Os between compute
and storage cluster
• Utilize the network inside compute cluster
22
Fall back to read from remote storage
Reduce stage:
Executor process 0
Reducetask
Reducetask
HDFS Index File
Executor process 1
Reducetask
Fetch from
Index cache
Indexcache
Directly read data filesDirectly read index filesRead index to cache
• When index cache is not available in certain
executors(failure or network issues), fall back
to directly read index files from remote
storage, just as index cache feature is never
enabled
23
Solutions for inefficient HDFS* seek
• a Hash-based shuffle manager without any seek operations
• Add new logics for blocks merging to eliminate seek operations
*Other names and brands may be claimed as the property of others.
24
Agenda
• Limitations on default Spark* shuffle
• Enabling Spark* shuffle on disaggregated compute & storage
• New challenges & optimizations
• Performance benchmarks
• Where next
*Other names and brands may be claimed as the property of others.
25
Shuffle over Lustre*
Dataset 4G dataset & 100 G dataset (~50GB shuffle)
workload SQL internal join
software Spark* 2.4.3
Hadoop* 2.7.3
baseline: 64GB ram disk | remote shuffle: Lustre*(remote)
hardware 1 master + 4 workers
40 cores Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz
128GB RAM
10Gb Ethernet
*Other names and brands may be claimed as the property of others.
26
Comparison: end-to-end time
54
87
55
0
10
20
30
40
50
60
70
80
90
100
local shuffle to RAM disk remote shuffle remote shuffle + index cache
27
Agenda
• Limitations on default Spark* shuffle
• Enabling Spark* shuffle on disaggregated compute & storage
• New challenges & optimizations
• Performance benchmarks
• Where next
28
Where next
• Future work: a purely-hash-based shuffle path that resembles hash-
based shuffle in Spark* history
• Disaggregated architecture can benefit not only for Spark* shuffle
• Towards a fully disaggregated world – CPU, memory and disks are
all disaggregated, connected through network
*Other names and brands may be claimed as the property of others.
Q&A
29
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT
31
Legal Disclaimer
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular
purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.
This document contains information on products, services and/or processes in development. All information provided here is subject to change
without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.
The products and services described may contain defects or errors known as errata which may cause deviations from published specifications.
Current characterized errata are available on request. No product or component can be absolutely secure.
Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by
visiting www.intel.com/design/literature.htm.
Intel, the Intel logo, [List the Intel trademarks in your document] are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other
countries.
*Other names and brands may be claimed as the property of others
© Intel Corporation.
32
Legal Disclaimer
Software and workloads used in performance tests may have been optimized
for performance only on Intel microprocessors. Performance tests, such as
SYSmark and MobileMark, are measured using specific computer systems,
components, software, operations and functions. Any change to any of those
factors may cause the results to vary. You should consult other information and
performance tests to assist you in fully evaluating your contemplated purchases,
including the performance of that product when combined with other products.
For more information go to http://www.intel.com/performance.

Improving Apache Spark by Taking Advantage of Disaggregated Architecture

  • 1.
    Chenzhao Guo(Intel) CarsonWang(Intel) Improving Apache Spark by Taking Advantage of Disaggregated Architecture #UnifiedDataAnalytics #SparkAISummit
  • 2.
    2#UnifiedDataAnalytics #SparkAISummit About me •Chenzhao Guo • Big Data Software Engineer at Intel • Contributor of Spark*, committer of OAP and HiBench • Our group’s work: enabling and optimizing Spark* on an exascale(10^18) High Performance Computer *Other names and brands may be claimed as the property of others.
  • 3.
    3#UnifiedDataAnalytics #SparkAISummit Agenda • Limitationson default Spark* shuffle • Enabling Spark* shuffle on disaggregated compute & storage • New challenges & optimizations • Performance benchmarks • Where next *Other names and brands may be claimed as the property of others.
  • 4.
    4#UnifiedDataAnalytics #SparkAISummit Agenda • Limitationson default Spark* shuffle • Enabling Spark* shuffle on disaggregated compute & storage • New challenges & optimizations • Performance benchmarks • Where next *Other names and brands may be claimed as the property of others.
  • 5.
    5#UnifiedDataAnalytics #SparkAISummit Pain point1: performance Shuffle drags down the end-to-end performance • slow due to random disk I/Os • may fail due to no space left or disk failure Mechanical structure performs slow seek operation What if I upgrade the HDDs to larger capacity or SSDs?
  • 6.
    6#UnifiedDataAnalytics #SparkAISummit Pain point2: Easy management & cost efficiency • Still, local disks may be run out or fail • Scaling out/up the storage resources affects the compute resource • Scaling the storage to accommodate shuffle’s need may make storage resources under- utilized for other(like CPU-intensive) workloads: cost inefficient What if I upgrade the HDDs to larger capacity or SSDs? Collocated compute & storage architecture
  • 7.
    7 Limitations on defaultSpark shuffle essence diagram Cluster utilization Scale out/up Software management Traditional collocated architecture Tightly coupled compute & storage Under-utilized one of the 2 resources Hard to scale out/upgrade without affecting the other resource Debugging an issue may be complicated ?
  • 8.
    8 What brings ushere? • Challenges to resolve enabling & optimizing Spark* on HPC: – storage on Distributed Asynchronous Object Storage(DAOS) – network on high-performance fabrics – scalability issues when there are ~10000 nodes – intermediate(shuffle, etc.) storage on disaggregated compute & storage architecture • Limitations on default Spark* shuffle *Other names and brands may be claimed as the property of others.
  • 9.
    9 Agenda • Limitations ondefault Spark* shuffle • Enabling Spark* shuffle on disaggregated compute & storage • New challenges & optimizations • Performance benchmarks • Where next *Other names and brands may be claimed as the property of others.
  • 10.
    10 Disaggregated architecture • Computeand storage resources are separated, interconnected by high-speed(usually) network • Common in HPC world and large companies who own thousands of servers, for higher efficiency and easier management Compute cabinet with CPU, Memory and accelerators Storage cabinet with NVMe SSD, running DAOS Disaggregated architecture in HPC Fabrics network
  • 11.
    11 Collocated vs disaggregatedcompute & storage architecture Essence Diagram Cluster utilization Scale out/up Software management Traditional collocated architecture Tightly coupled compute & storage Under-utilized one of the 2 resources Hard to scale out/upgrade without affecting the other resource Debugging an issue may be complicated ? Disaggregate d compute & storage architecture Decouple d compute & storage Efficient for diverse workloads with different storage/comput e needs Independent scaling per compute & storage needs separately Different team only needs to care their own cluster Overhead Extra network transfer
  • 12.
    12 Network transfer isnot a big deal as time evolves • Software: – Efficient compression algorithms – Columnar file format – Other optimizations toward less I/O • Hardware: – Network bandwidth is evolving fast I/O bottleneck CPU bottleneck
  • 13.
    13 Targets Hadoop-compatible storage remoteshuffle SortShuffleManager ReaderWriter Local disk Java File API write BlockManager Netty fetch Java File API read Local diskLocal disk default Spark local shuffle storage cluster Globally accessible Hadoop-compatible Filesystem ? • Hadoop* Filesystem interface is general • Support globally accessible storage • Let the distributed storage takes care of replication, enough space, etc. compute cluster *Other names and brands may be claimed as the property of others.
  • 14.
    14 A remote shufflemanager plugin: overview RemoteShuffleManager ReaderWriter Globally accessible Hadoop-compatible Filesystem Hadoop OutputStream Hadoop InputStream HDFS Lustre DAOS FS… • A pluggable custom shuffle manager storing files in globally accessible Hadoop-compatible filesystem • The shuffle readers/writers talk with storage system in Hadoop* API • Algorithms of partitioning, sorting, unsafe-optimizations, etc. are based on the ones in default SortShuffleManager • Shuffle data/index files, spill and other temporary files are all kept in remote storage *Other names and brands may be claimed as the property of others.
  • 15.
    executor executor 15 More customizationdetails • Shuffle writers/readers talk with storage in Hadoop* API • Shuffle readers derive the data/index file path from applicationId, mapId and reduceId, then directly load them from remote storage without any metadata keeper’s help(MapOutputTracker) • Avoid resubmitting certain map tasks when executor fails, due to the shuffle blocks are still there and can be retrieved fos = new FileOutputStream(file, true) fsdos = fs.create(file)Local disk Local diskLocal disk Globally-accessible Hadoop-compatible Filesystem I need a shuffle block Read by derived path *Other names and brands may be claimed as the property of others.
  • 16.
    16 A remote shuffleI/O plugin • No need to maintain full shuffle algorithms including sort, partition, aggregation logics, but only cares about shuffle storage • Based on the new shuffle abstraction layer introduced in SPARK-25299 Non-shuffle facilities Sort shuffle manager Pluggable shuffle manager Non-shuffle facilities Sort shuffle manager (non-storage part) Pluggable shuffle I/O Customized storage
  • 17.
    17 What have weoffered until now? A ShuffleManager/(Shuffle I/O) that • shuffles to globally accessible Hadoop-compatible storage • is pluggable and easy to deploy • is free from executor failures naturally due to shuffle metadata is not locally managed, instead globally derived • is free from disk failure due to the shuffle data in storage can be replicated: reliable • is easy to scale and cost-efficient due to taking advantages of disaggregated compute and storage architecture
  • 18.
    18 Agenda • Limitations ondefault Spark* shuffle • Enabling Spark* shuffle on disaggregated compute & storage • New challenges & optimizations • Performance benchmarks • Where next *Other names and brands may be claimed as the property of others.
  • 19.
    19 New challenges • Repeatedlyreading small index files from disks is inefficient • Seek operations calling on data files in HDFS* are inefficient, reading more data than expected through network *Other names and brands may be claimed as the property of others.
  • 20.
    20 Optimization: index cache Mapstage: Executor process 0 Maptask Maptask Maptask HDFS Index File Write index & data files to storage Maptask Maptask Index files are small, however, some Hadoop* storage system like HDFS* are not designed for small files I/O *Other names and brands may be claimed as the property of others.
  • 21.
    21 Optimization: index cache Reducestage: Executor process 0 Reducetask Reducetask HDFS Index File Executor process 1 Reducetask Fetch from Index cache Read index to cache Indexcache Directly read data files • Saves small disk I/Os in storage cluster • Saves small network I/Os between compute and storage cluster • Utilize the network inside compute cluster
  • 22.
    22 Fall back toread from remote storage Reduce stage: Executor process 0 Reducetask Reducetask HDFS Index File Executor process 1 Reducetask Fetch from Index cache Indexcache Directly read data filesDirectly read index filesRead index to cache • When index cache is not available in certain executors(failure or network issues), fall back to directly read index files from remote storage, just as index cache feature is never enabled
  • 23.
    23 Solutions for inefficientHDFS* seek • a Hash-based shuffle manager without any seek operations • Add new logics for blocks merging to eliminate seek operations *Other names and brands may be claimed as the property of others.
  • 24.
    24 Agenda • Limitations ondefault Spark* shuffle • Enabling Spark* shuffle on disaggregated compute & storage • New challenges & optimizations • Performance benchmarks • Where next *Other names and brands may be claimed as the property of others.
  • 25.
    25 Shuffle over Lustre* Dataset4G dataset & 100 G dataset (~50GB shuffle) workload SQL internal join software Spark* 2.4.3 Hadoop* 2.7.3 baseline: 64GB ram disk | remote shuffle: Lustre*(remote) hardware 1 master + 4 workers 40 cores Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz 128GB RAM 10Gb Ethernet *Other names and brands may be claimed as the property of others.
  • 26.
    26 Comparison: end-to-end time 54 87 55 0 10 20 30 40 50 60 70 80 90 100 localshuffle to RAM disk remote shuffle remote shuffle + index cache
  • 27.
    27 Agenda • Limitations ondefault Spark* shuffle • Enabling Spark* shuffle on disaggregated compute & storage • New challenges & optimizations • Performance benchmarks • Where next
  • 28.
    28 Where next • Futurework: a purely-hash-based shuffle path that resembles hash- based shuffle in Spark* history • Disaggregated architecture can benefit not only for Spark* shuffle • Towards a fully disaggregated world – CPU, memory and disks are all disaggregated, connected through network *Other names and brands may be claimed as the property of others.
  • 29.
  • 30.
    DON’T FORGET TORATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT
  • 31.
    31 Legal Disclaimer No license(express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade. This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps. The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request. No product or component can be absolutely secure. Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by visiting www.intel.com/design/literature.htm. Intel, the Intel logo, [List the Intel trademarks in your document] are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others © Intel Corporation.
  • 32.
    32 Legal Disclaimer Software andworkloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance.