Min Shen
Senior Staff Software Engineer, LinkedIn
Chandni Singh
Senior Software Engineer, LinkedIn
Agenda
§ Spark @ LinkedIn
§ Introducing Magnet
§ Production Results
§ Future Work
Spark @ LinkedIn
A massive-scale infrastructure
11K nodes
40K+ daily Spark apps
~70% of cluster
compute resources
~18 PB daily
shuffle data
3X+ growth YoY
Shuffle Basics – Spark Shuffle
§ Transfer intermediate data across
stages with a mesh connection
between mappers and reducers.
§ Shuffle operation is a scaling and
performance bottleneck.
§ Issues are especially visible at large
scale.
Issues with Spark Shuffle at LinkedIn
Shuffle service
unavailable under heavy
load.
• Efficiency issue
• Reliability issue
Job runtime can be very
inconsistent during peak
cluster hours due to
shuffle.
• Scalability issue
Small shuffle blocks
hurt disk throughput
which prolongs shuffle
wait time.
Existing Optimizations Not Sufficient
• Broadcast join can reduce shuffle during join.
• Requires one of the tables to fit into memory.
• Caching RDD/DF after shuffling them can potentially reduce shuffle.
• Has performance implications and limited applicability.
• Bucketing is like caching by materializing DFs while preserving partitioning.
• Much more performant but still subject to limited applicability and requires manual setup effort.
• Adaptive Query Execution in Spark 3.0 with its auto-coalescing feature can optimize
shuffle.
• Slightly increased shuffle block size. However, not sufficient to address the 3 issues.
Magnet Shuffle Service
Magnet shuffle service adopts a push-based shuffle mechanism.
M. Shen, Y. Zhou, C. Singh. “Magnet: Push-based Shuffle Service for Large-scale Data
Processing” Proceedings of the VLDB Endowment, 13(12) (2020)
Convert small random reads in shuffle to large sequential reads.
Shuffle intermediate data becomes 2-replicated, improving reliability.
Locality-aware scheduling of reducers, further improving performance.
Push-based Shuffle
§ Best-effort push/merge operation.
§ Complements existing shuffle.
§ Shuffle blocks get pushed to remote
shuffle services.
§ Shuffle services merge blocks into per-
partition merged shuffle files.
Push-based Shuffle
§ Reducers fetch a hybrid of blocks.
▪ Fetch merged blocks to improve I/O
efficiency.
▪ Fetch original blocks if not merged.
§ Improved shuffle data locality.
§ Merged shuffle files create a second
replica of shuffle data.
Magnet Results – Scalability and Reliability
Currently rolled out to 100% of offline Spark workloads at LinkedIn.
That’s about 15-18 PB of shuffle data per day.
Leveraged a ramp-based rollout to reduce risks and did not
encounter any major issues during the rollout of Magnet.
Magnet Results – Improve Cluster Shuffle efficiency
• 30X increase in shuffle data fetched locally compared with 6 months ago.
Shuffle
Locality
Ratio
Magnet Results – Improve Cluster Shuffle efficiency
• Overall shuffle fetch delay time reduced by 84.8%.
Avg
Shuffle
Fetch
Delay
%
Magnet Results – Improve Job Performance
Magnet brings performance improvements to Spark jobs at scale without any user intervention.
We analyzed 9628 Spark workflows that were onboarded to Magnet.
Overall compute resource consumption reduction is 16.3%.
Among flows previously heavily impacted by shuffle fetch delays (shuffle fetch delay time > 30% of total task
time), overall compute resource consumption reduction is 44.8%. These flows are about 19.5% of the the flows
we analyzed.
Magnet Results – Improve Job Performance
• 50% of heavily impacted
workflows have seen at
least 32% reduction in
their job runtime.
• The percentile time-
series graph represents
thousands of spark
workflows.
App
Duration
Reduction
%
25th percentile 50th
percentile 75th percentile
Key Finding
• Benefits from Magnet increases as adoption of Magnet increases.
App
Duration
Reduction
%
Magnet Results on NVMe
• Magnet still achieves significant benefits with NVMe disks for storing
shuffle data.
• Results of a benchmark job with and without Magnet on a 10-node cluster
with HDD and NVMe disks, respectively:
Runtime with HDD (min) /
Comparison with baseline
Runtime with NVMe (min) /
Comparison with baseline
Magnet disabled 16 (baseline) 10 (-37.5%)
Magnet enabled 7.3 (-54.4%) 4.2 (-73.7%)
Future Work
Contribute Magnet back to OSS
Spark: SPARK-30602
Cloud-native architecture for
Magnet
Bring superior shuffle
architecture to broader set of
use cases
Optimize Shuffle on Disaggregated Storage
Existing shuffle data storage
solutions
Store shuffle data on limited local storage
devices attached with VMs.
Store shuffle data on remote disaggregate
storage.
Drawbacks with these approaches
It hampers the VM elasticity and runs the risk
of exhausting local storage capacity.
It has non-negligible performance overhead.
Optimize Shuffle on Disaggregated Storage
§ Leverage both local and remote storage,
such that the local storage acts as a caching
layer for shuffle storage.
§ The local elastic storage can tolerate
running out of storage space or compute
VMs getting decommissioned.
§ Preserve Magnet’s benefit of much
improved shuffle data locality while
decoupling shuffle storage from compute
VMs
Optimize Python-centric Data Processing
The surge of the usage of AI in recent years has driven the adoption
of Python for building AI data pipelines.
Magnet’s optimization of shuffle is generic enough to benefit both
SQL-centric analytics and Python-centric AI use cases.
Support Magnet in Kubernetes.
Resources
SPARK-30602
Magnet: A scalable and performant shuffle architecture for Apache Spark
Magnet: Push-based Shuffle Service for Large-scale Data Processing
Bringing Next-Gen Shuffle Architecture To Data Infrastructure at LinkedIn Scale
SPARK
SPIP ticket
Blog post
VLDB 2020
Blog post
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

Magnet Shuffle Service: Push-based Shuffle at LinkedIn

  • 1.
    Min Shen Senior StaffSoftware Engineer, LinkedIn Chandni Singh Senior Software Engineer, LinkedIn
  • 2.
    Agenda § Spark @LinkedIn § Introducing Magnet § Production Results § Future Work
  • 3.
    Spark @ LinkedIn Amassive-scale infrastructure 11K nodes 40K+ daily Spark apps ~70% of cluster compute resources ~18 PB daily shuffle data 3X+ growth YoY
  • 4.
    Shuffle Basics –Spark Shuffle § Transfer intermediate data across stages with a mesh connection between mappers and reducers. § Shuffle operation is a scaling and performance bottleneck. § Issues are especially visible at large scale.
  • 5.
    Issues with SparkShuffle at LinkedIn Shuffle service unavailable under heavy load. • Efficiency issue • Reliability issue Job runtime can be very inconsistent during peak cluster hours due to shuffle. • Scalability issue Small shuffle blocks hurt disk throughput which prolongs shuffle wait time.
  • 6.
    Existing Optimizations NotSufficient • Broadcast join can reduce shuffle during join. • Requires one of the tables to fit into memory. • Caching RDD/DF after shuffling them can potentially reduce shuffle. • Has performance implications and limited applicability. • Bucketing is like caching by materializing DFs while preserving partitioning. • Much more performant but still subject to limited applicability and requires manual setup effort. • Adaptive Query Execution in Spark 3.0 with its auto-coalescing feature can optimize shuffle. • Slightly increased shuffle block size. However, not sufficient to address the 3 issues.
  • 7.
    Magnet Shuffle Service Magnetshuffle service adopts a push-based shuffle mechanism. M. Shen, Y. Zhou, C. Singh. “Magnet: Push-based Shuffle Service for Large-scale Data Processing” Proceedings of the VLDB Endowment, 13(12) (2020) Convert small random reads in shuffle to large sequential reads. Shuffle intermediate data becomes 2-replicated, improving reliability. Locality-aware scheduling of reducers, further improving performance.
  • 8.
    Push-based Shuffle § Best-effortpush/merge operation. § Complements existing shuffle. § Shuffle blocks get pushed to remote shuffle services. § Shuffle services merge blocks into per- partition merged shuffle files.
  • 9.
    Push-based Shuffle § Reducersfetch a hybrid of blocks. ▪ Fetch merged blocks to improve I/O efficiency. ▪ Fetch original blocks if not merged. § Improved shuffle data locality. § Merged shuffle files create a second replica of shuffle data.
  • 10.
    Magnet Results –Scalability and Reliability Currently rolled out to 100% of offline Spark workloads at LinkedIn. That’s about 15-18 PB of shuffle data per day. Leveraged a ramp-based rollout to reduce risks and did not encounter any major issues during the rollout of Magnet.
  • 11.
    Magnet Results –Improve Cluster Shuffle efficiency • 30X increase in shuffle data fetched locally compared with 6 months ago. Shuffle Locality Ratio
  • 12.
    Magnet Results –Improve Cluster Shuffle efficiency • Overall shuffle fetch delay time reduced by 84.8%. Avg Shuffle Fetch Delay %
  • 13.
    Magnet Results –Improve Job Performance Magnet brings performance improvements to Spark jobs at scale without any user intervention. We analyzed 9628 Spark workflows that were onboarded to Magnet. Overall compute resource consumption reduction is 16.3%. Among flows previously heavily impacted by shuffle fetch delays (shuffle fetch delay time > 30% of total task time), overall compute resource consumption reduction is 44.8%. These flows are about 19.5% of the the flows we analyzed.
  • 14.
    Magnet Results –Improve Job Performance • 50% of heavily impacted workflows have seen at least 32% reduction in their job runtime. • The percentile time- series graph represents thousands of spark workflows. App Duration Reduction % 25th percentile 50th percentile 75th percentile
  • 15.
    Key Finding • Benefitsfrom Magnet increases as adoption of Magnet increases. App Duration Reduction %
  • 16.
    Magnet Results onNVMe • Magnet still achieves significant benefits with NVMe disks for storing shuffle data. • Results of a benchmark job with and without Magnet on a 10-node cluster with HDD and NVMe disks, respectively: Runtime with HDD (min) / Comparison with baseline Runtime with NVMe (min) / Comparison with baseline Magnet disabled 16 (baseline) 10 (-37.5%) Magnet enabled 7.3 (-54.4%) 4.2 (-73.7%)
  • 17.
    Future Work Contribute Magnetback to OSS Spark: SPARK-30602 Cloud-native architecture for Magnet Bring superior shuffle architecture to broader set of use cases
  • 18.
    Optimize Shuffle onDisaggregated Storage Existing shuffle data storage solutions Store shuffle data on limited local storage devices attached with VMs. Store shuffle data on remote disaggregate storage. Drawbacks with these approaches It hampers the VM elasticity and runs the risk of exhausting local storage capacity. It has non-negligible performance overhead.
  • 19.
    Optimize Shuffle onDisaggregated Storage § Leverage both local and remote storage, such that the local storage acts as a caching layer for shuffle storage. § The local elastic storage can tolerate running out of storage space or compute VMs getting decommissioned. § Preserve Magnet’s benefit of much improved shuffle data locality while decoupling shuffle storage from compute VMs
  • 20.
    Optimize Python-centric DataProcessing The surge of the usage of AI in recent years has driven the adoption of Python for building AI data pipelines. Magnet’s optimization of shuffle is generic enough to benefit both SQL-centric analytics and Python-centric AI use cases. Support Magnet in Kubernetes.
  • 21.
    Resources SPARK-30602 Magnet: A scalableand performant shuffle architecture for Apache Spark Magnet: Push-based Shuffle Service for Large-scale Data Processing Bringing Next-Gen Shuffle Architecture To Data Infrastructure at LinkedIn Scale SPARK SPIP ticket Blog post VLDB 2020 Blog post
  • 22.
    Feedback Your feedback isimportant to us. Don’t forget to rate and review the sessions.