Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval Degani

567 views

Published on

The opportunity in accelerating Spark by improving its network data transfer facilities has been under much debate in the last few years. RDMA (remote direct memory access) is a network acceleration technology that is very prominent in the HPC (high-performance computing) world, but has not yet made its way to mainstream Apache Spark. Proper implementation of RDMA in network-oriented applications can improve scalability, throughput, latency and CPU utilization. In this talk we are going to present a new RDMA solution for Apache Spark that shows amazing improvements in multiple Spark use cases. The solution is under development in our labs, and is going to be released to the public as an open-source plug-in.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval Degani

  1. 1. Yuval Degani, Mellanox Technologies ACCELERATING SHUFFLE: A TAILOR- MADE RDMA SOLUTION FOR APACHE SPARK #EUres3
  2. 2. TeraSort accelerated with RDMA RDMA Standard 0 20 40 60 80 100 120 seconds
  3. 3. #EUres3 • Founded 1999 • End-to-end designer and supplier of interconnect solutions: network adapters, switches, system-on-a-chip, cables, silicon and software • 10-400 Gb/s Ethernet and InfiniBand Storage Front / Backend Server / Compute Switch / Gateway 56/100/200G InfiniBand 10/25/40/50/ 100/200/400GbE Virtual Protocol Interconnect 56/100/200G InfiniBand 10/25/40/50/ 100/200/400GbE Virtual Protocol Interconnect
  4. 4. Agenda • What’s RDMA? • Spark’s Shuffle Internals • SparkRDMA Shuffle Plugin • Results • What’s next? #EUres3
  5. 5. What’s RDMA? • Remote Direct Memory Access – Read/write from/to remote memory locations • Zero-copy • Direct hardware interface – bypasses the kernel and TCP/IP in IO path • Flow control and reliability is offloaded in hardware • Sub-microsecond latency • Supported on almost all mid-range/high-end network adapters • Growing cloud support – Already supported in Microsoft Azure (A, H instances) #EUres3 Java app buffer OS Sockets TCP/IP Driver Network Adapter RDMA Socket Context switch
  6. 6. Hardware acceleration in Big Data/Machine Learning platforms • Hardware acceleration adoption is continuously growing – GPU integration is now standard – ASIC integration is spreading fast • RDMA is already integrated in mainstream code of popular frameworks: – TensorFlow – Caffe2 – CNTK • Now it’s Spark’s turn to catch up #EUres3
  7. 7. Spark’s Shuffle Internals Under the hood #EUres3
  8. 8. MapReduce vs. Spark • Spark’s in-memory model completely changed how shuffle is done • In both Spark and MapReduce, map output is saved on the local disk (usually in buffer cache) • In MapReduce, map output is then copied over the network to the destined reducer’s local disk • In Spark, map output is fetched from the network, on-demand, to the reducer’s memory #EUres3 … … Map Reduce Map Reduce Memory-to-network-to-memory? RDMA is a perfect fit!
  9. 9. Spark’s Shuffle Basics #EUres3 Map Reduce task MapReduce Map Map Map Map Input Map output File File File File File Driver Reduce task Reduce task Reduce task Reduce task Fetch blocks Fetch blocks Fetch blocks Fetch blocks Fetch blocks
  10. 10. Shuffle Read Protocol Shuffle Read Driver Reader Writer 1 2 3 7 4 5 6 Request Map Statuses Send back Map Statuses Request blocks from writers Locate blocks, and setup as stream Request blocks from stream, one by one Group block locations by writer Locate block, send back 8 Block data is now ready
  11. 11. The Cost of Shuffling • Shuffling is very expensive in terms of CPU, RAM, disk and network IOs • Spark users try to avoid shuffles as much as they can • Speedy shuffles can relieve developers of such concerns, and simplify applications #EUres3
  12. 12. The Potential in Accelerating Shuffle What’s the opportunity here? #EUres3
  13. 13. The Potential in Accelerating Shuffle • Before setting off on the journey of adding RDMA to Spark, it was essential to estimate the ROI of such work • What’s better than a controlled experiment? #EUres3 Goal Quantify the potential performance gain in accelerating network transfers Method Bypass the network in Spark shuffle and compare to the original code: No network = maximum performance gain …Map Reduce
  14. 14. Experiment: Bypassing The Network in Shuffle • Do no fetch blocks from the network, instead, reuse a local sampled block over and over • Compare to standard Spark, but with block reuse – so reduce data will be identical #EUres3 Reduce task Reduce Reduce task Reduce task Reduce task Reduce task Fetch blocks Fetch blocks Fetch blocks Fetch blocks Fetch blocks Sample data Sample data Sample data Sample data Sample data
  15. 15. Results Benchmark: • HiBench TeraSort • Workload: 600GB Testbed: • HDFS on Hadoop 2.6.0 • Spark 2.0.0 – Standalone mode – Master + 30 workers – 28 cores per worker,840 total • Machine info: – Intel Xeon E5-2697 v3 @ 2.60GHz – 256GB RAM,128GB of it reserved as RAMDISK – RAMDISK is used for Spark local directories and HDFS #EUres3 Network bypass Standar d Spark 0 200 400 600 800 seconds
  16. 16. SparkRDMA Shuffle Plugin Accelerating Shuffle with RDMA #EUres3
  17. 17. Design Goals • Demonstrate significant improvements over standard Spark • Seamlessly accelerate Shuffles with RDMA – no functional limitations • Easy to use and deploy • Minimize code impact #EUres3
  18. 18. Design Approach • Entire Shuffle-related communication is done with RDMA – RPC messaging for meta-data transfers – Block transfers • SparkRDMA is an independent plugin – Implements the ShuffleManager interface – No changes to Spark’s code – use with any existing Spark installation • Reuse Spark facilities – Maximize reliability – Minimize impact on code • RDMA functionality is provided by “DiSNI” – Open-source Java interface to RDMAuser libraries – https://github.com/zrlio/disni • No functionality loss of any kind, SparkRDMA supports: – Compression – Spilling to disk – Recovery from failed map or reduce tasks #EUres3
  19. 19. ShuffleManager Plugin • Spark allows for external implementations of ShuffleManagers to be plugged in – Configurable per-job using: “spark.shuffle.manager” • Interface allows proprietary implementations of Shuffle Writers and Readers, and essentially defers the entire Shuffle process to the new component • SparkRDMA utilizes this interface to introduce RDMA in the Shuffle process #EUres3 SortShuffleManager RdmaShuffleManager
  20. 20. SparkRDMA Components • SparkRDMA reuses the main Shuffle Writer implementations of mainstream Spark: Unsafe & Sort • Shuffle data is written and stored identically to the original implementation • All-new ShuffleReader and ShuffleBlockResolver provide an optimized RDMA transport when blocks are being read over the network #EUres3 RdmaShuffleManager SortShuffleWriter UnsafeShuffleWriter Writers RdmaShuffleReader RdmaShuffleBlockResolver RdmaWrapperShuffleWriter SortShuffleManager SortShuffleWriter UnsafeShuffleWriter BypassMergeSortShuffleWriter Writers BlockStoreShuffleReader IndexShuffleBlockResolver
  21. 21. Shuffle Read Driver Reader Writer 1 2 3 7 4 5 6 Request Map Statuses Send back Map Statuses Request blocks from writers Locate blocks, and setup as stream Request blocks from stream, one by one Group block locations by writer Locate block, send back 8 Block data is now ready RDMA-Read blocks from writers No-op on writer HW offloads transfers 5 Block data is now ready Shuffle Read Protocol – Standard vs. RDMA
  22. 22. Shuffle Read Driver Reader Writer 1 2 3 7 4 5 6 Request Map Statuses Send back Map Statuses Request blocks from writers Locate blocks, and setup as stream Request blocks from stream, one by one Group block locations by writer Locate block, send back 8 Block data is now ready Shuffle Read Driver Reader Writer 1 2 3 4 6 Request Map Statuses Send back Map Statuses Group block locations by writer RDMA-Read blocks from writers No-op on writer HW offloads transfers 5 Block data is now ready StandardRDMA Server-side: ü 0 CPU ü Shuffle transfers are not blocked by GC in executor ü No buffering Client-side: ü Instant transfers ü Reduced messaging ü Direct, unblocked access to remote blocks Reader Writer 7 4 5 6 Request blocks from writers Request blocks from stream, one by one Locate block, send back 8 Block data is now ready Reader Writer 4 6 RDMA-Read blocks from writers No-op on writer HW offloads transfers 5 Block data is now ready Locate blocks, and setup as stream
  23. 23. Benefits • Substantial improvements in: – Block transfer times: latency and total transfer time – Memory consumption and management – CPU utilization • Easy to deploy and configure: – Packed into a single JAR file – Plugin is enabled through a simple configuration handle – Allows finer tuning with a set of configuration handles • Configuration and deployment are on a per-job basis: – Can be deployed incrementally – May be limited to Shuffle-intensive jobs #EUres3
  24. 24. Results #EUres3
  25. 25. Performance Results: TeraSort Testbed: • HiBench TeraSort – Workload:175GB • HDFS on Hadoop 2.6.0 – No replication • Spark 2.0.0 – 1 Master – 15 Workers – 28 active Spark cores on each node,420 total • Node info: – Intel Xeon E5-2697 v3 @ 2.60GHz – RoCE 100GbE – 256GB RAM – HDD is used for Spark local directories and HDFS RDMA Standard 0 20 40 60 80 100 120 seconds
  26. 26. Performance Results: GroupBy Testbed: • GroupBy – 48M keys – Each value:4096 bytes – Workload:183GB • Spark 2.0.0 – 1 Master – 15 Workers – 28 active Spark cores on each node,420 total • Node info: – Intel Xeon E5-2697 v3 @ 2.60GHz – RoCE 100GbE – 256GB RAM – HDD is used for Spark local directories and HDFS RDMA Standard 0 5 10 15 20 25 30 seconds
  27. 27. Roadmap What’s next #EUres3
  28. 28. What’s next? • SparkRDMAofficial v1.0 release is publicly available on GitHub • Integration to upstream Apache Spark • Extend RDMAcapabilities to more Spark components: – RDD accesses – Broadcast • Also in the works: RDMA acceleration for HDFS #EUres3
  29. 29. Open-source • SparkRDMAis available at https://github.com/Mellanox/SparkRDMA – Quick installation guide – Wiki pages for advanced settings • Vote for integrating SparkRDMAinto mainstream Spark: https://issues.apache.org/jira/browse/SPARK-22229 • Feel free to reach out at: yuvaldeg@mellanox.com #EUres3
  30. 30. Thank you. #EUres3

×