Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
What to Upload to SlideShare
Loading in …3
1 of 22

Magnet Shuffle Service: Push-based Shuffle at LinkedIn



Download to read offline

The number of daily Apache Spark applications at LinkedIn has increased by 3X in the past year. The shuffle process alone, which is one of the most costly operators in batch computation, is processing PBs of data and billions of blocks daily in our clusters. With such a rapid increase of Apache Spark workloads, we quickly realized that the shuffle process can become a severe bottleneck for both infrastructure scalability and workloads efficiency. In our production clusters, we have observed both reliability issues due to shuffle fetch connection failures and efficiency issues due to the random reads of small shuffle blocks on HDDs.

To tackle those challenges and optimize shuffle performance in Apache Spark, we have developed Magnet shuffle service, a push-based shuffle mechanism that works natively with Apache Spark. Our paper on Magnet has been accepted by VLDB 2020. In this talk, we will introduce how push-based shuffle can drastically increase shuffle efficiency when compared with the existing pull-based shuffle. In addition, by combining push-based shuffle and pull-based shuffle, we show how Magnet shuffle service helps to harden shuffle infrastructure at LinkedIn scale by both reducing shuffle related failures and removing scaling bottlenecks. Furthermore, we will share our experiences of productionizing Magnet at LinkedIn to process close to 10 PB of daily shuffle data.

Magnet Shuffle Service: Push-based Shuffle at LinkedIn

  1. 1. Min Shen Senior Staff Software Engineer, LinkedIn Chandni Singh Senior Software Engineer, LinkedIn
  2. 2. Agenda § Spark @ LinkedIn § Introducing Magnet § Production Results § Future Work
  3. 3. Spark @ LinkedIn A massive-scale infrastructure 11K nodes 40K+ daily Spark apps ~70% of cluster compute resources ~18 PB daily shuffle data 3X+ growth YoY
  4. 4. Shuffle Basics – Spark Shuffle § Transfer intermediate data across stages with a mesh connection between mappers and reducers. § Shuffle operation is a scaling and performance bottleneck. § Issues are especially visible at large scale.
  5. 5. Issues with Spark Shuffle at LinkedIn Shuffle service unavailable under heavy load. • Efficiency issue • Reliability issue Job runtime can be very inconsistent during peak cluster hours due to shuffle. • Scalability issue Small shuffle blocks hurt disk throughput which prolongs shuffle wait time.
  6. 6. Existing Optimizations Not Sufficient • Broadcast join can reduce shuffle during join. • Requires one of the tables to fit into memory. • Caching RDD/DF after shuffling them can potentially reduce shuffle. • Has performance implications and limited applicability. • Bucketing is like caching by materializing DFs while preserving partitioning. • Much more performant but still subject to limited applicability and requires manual setup effort. • Adaptive Query Execution in Spark 3.0 with its auto-coalescing feature can optimize shuffle. • Slightly increased shuffle block size. However, not sufficient to address the 3 issues.
  7. 7. Magnet Shuffle Service Magnet shuffle service adopts a push-based shuffle mechanism. M. Shen, Y. Zhou, C. Singh. “Magnet: Push-based Shuffle Service for Large-scale Data Processing” Proceedings of the VLDB Endowment, 13(12) (2020) Convert small random reads in shuffle to large sequential reads. Shuffle intermediate data becomes 2-replicated, improving reliability. Locality-aware scheduling of reducers, further improving performance.
  8. 8. Push-based Shuffle § Best-effort push/merge operation. § Complements existing shuffle. § Shuffle blocks get pushed to remote shuffle services. § Shuffle services merge blocks into per- partition merged shuffle files.
  9. 9. Push-based Shuffle § Reducers fetch a hybrid of blocks. ▪ Fetch merged blocks to improve I/O efficiency. ▪ Fetch original blocks if not merged. § Improved shuffle data locality. § Merged shuffle files create a second replica of shuffle data.
  10. 10. Magnet Results – Scalability and Reliability Currently rolled out to 100% of offline Spark workloads at LinkedIn. That’s about 15-18 PB of shuffle data per day. Leveraged a ramp-based rollout to reduce risks and did not encounter any major issues during the rollout of Magnet.
  11. 11. Magnet Results – Improve Cluster Shuffle efficiency • 30X increase in shuffle data fetched locally compared with 6 months ago. Shuffle Locality Ratio
  12. 12. Magnet Results – Improve Cluster Shuffle efficiency • Overall shuffle fetch delay time reduced by 84.8%. Avg Shuffle Fetch Delay %
  13. 13. Magnet Results – Improve Job Performance Magnet brings performance improvements to Spark jobs at scale without any user intervention. We analyzed 9628 Spark workflows that were onboarded to Magnet. Overall compute resource consumption reduction is 16.3%. Among flows previously heavily impacted by shuffle fetch delays (shuffle fetch delay time > 30% of total task time), overall compute resource consumption reduction is 44.8%. These flows are about 19.5% of the the flows we analyzed.
  14. 14. Magnet Results – Improve Job Performance • 50% of heavily impacted workflows have seen at least 32% reduction in their job runtime. • The percentile time- series graph represents thousands of spark workflows. App Duration Reduction % 25th percentile 50th percentile 75th percentile
  15. 15. Key Finding • Benefits from Magnet increases as adoption of Magnet increases. App Duration Reduction %
  16. 16. Magnet Results on NVMe • Magnet still achieves significant benefits with NVMe disks for storing shuffle data. • Results of a benchmark job with and without Magnet on a 10-node cluster with HDD and NVMe disks, respectively: Runtime with HDD (min) / Comparison with baseline Runtime with NVMe (min) / Comparison with baseline Magnet disabled 16 (baseline) 10 (-37.5%) Magnet enabled 7.3 (-54.4%) 4.2 (-73.7%)
  17. 17. Future Work Contribute Magnet back to OSS Spark: SPARK-30602 Cloud-native architecture for Magnet Bring superior shuffle architecture to broader set of use cases
  18. 18. Optimize Shuffle on Disaggregated Storage Existing shuffle data storage solutions Store shuffle data on limited local storage devices attached with VMs. Store shuffle data on remote disaggregate storage. Drawbacks with these approaches It hampers the VM elasticity and runs the risk of exhausting local storage capacity. It has non-negligible performance overhead.
  19. 19. Optimize Shuffle on Disaggregated Storage § Leverage both local and remote storage, such that the local storage acts as a caching layer for shuffle storage. § The local elastic storage can tolerate running out of storage space or compute VMs getting decommissioned. § Preserve Magnet’s benefit of much improved shuffle data locality while decoupling shuffle storage from compute VMs
  20. 20. Optimize Python-centric Data Processing The surge of the usage of AI in recent years has driven the adoption of Python for building AI data pipelines. Magnet’s optimization of shuffle is generic enough to benefit both SQL-centric analytics and Python-centric AI use cases. Support Magnet in Kubernetes.
  21. 21. Resources SPARK-30602 Magnet: A scalable and performant shuffle architecture for Apache Spark Magnet: Push-based Shuffle Service for Large-scale Data Processing Bringing Next-Gen Shuffle Architecture To Data Infrastructure at LinkedIn Scale SPARK SPIP ticket Blog post VLDB 2020 Blog post
  22. 22. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.