Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash with Patrick Stuedi

2,338 views

Published on

Effectively leveraging fast networking and storage hardware (e.g., RDMA, NVMe, etc.) in Apache Spark remains challenging. Current ways to integrate the hardware at the operating system level fall short, as the hardware performance advantages are shadowed by higher layer software overheads. This session will show how to integrate RDMA and NVMe hardware in Spark in a way that allows applications to bypass both the operating system and the Java virtual machine during I/O operations. With such an approach, the hardware performance advantages become visible at the application level, and eventually translate into workload runtime improvements. Stuedi will demonstrate how to run various Spark workloads (e.g, SQL, Graph, etc.) effectively on 100Gbit/s networks and NVMe flash.

Published in: Data & Analytics

Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash with Patrick Stuedi

  1. 1. Patrick Stuedi, IBM Research Running Spark on a High- Performance Cluster using RDMA Networking and NVMe Flash
  2. 2. Hardware Trends community target 20172010 1Gbps 50us 10Gbps 20us 100 MB/s 100ms 1000 MB/s 200us
  3. 3. Hardware Trends 2017 10 GB/s 50us 100Gbps 1us community target our target 20172010 1Gbps 50us 10Gbps 20us 100 MB/s 100ms 1000 MB/s 200us
  4. 4. User-Level APIs Storage App HDD FileSystem iSCSI Block Layer OS Kernel Network App NIC Sockets Ethernet TCP/IP OS Kernel Storage App SSD, NIC SPDK Network App NIC OS KernelRDMA, DPDK Kernel Kernel Traditional / Kernel-based User-Level / Kernel-Bypass Needed to achieve 1us RTT!
  5. 5. Remote Data Access 512 B 4 KB 64 KB 1 MB 0 200 400 600 800 1000 1200 NVMe over Fabrics iSCSI data size time[usec] DRAM NVMe Flash 512 B 4 KB 64 KB 1 MB 0 50 100 150 200 250 RDMA Sockets data size time[usec] User-level APIs offer 2-10x better performance than traditional kernel-based APIs
  6. 6. Let's Use it! Spark NVMe Flash DRAM High-Performance RDMA network SQL Streaming MLlib GraphX Performance! Realtime! Disaggregation!
  7. 7. Case Study: Sorting in Spark MapCores ReduceInput Output read & classify shuffle store Net Ser Net Ser Net Ser Sort Sort Sort
  8. 8. Experiment Setup ● Total data size: 12.8 TB ● Cluster size: 128 nodes ● Cluster hardware: – DRAM: 512 GB DDR 4 – Storage: 4x 1.2 TB NVMe SSD – Network: 100GbE Mellanox RDMA Flash bandwidth per node matches network bandwidth
  9. 9. How is the Network Used? Hardware limit 0 20 40 60 80 100 0 100 200 300 400 500 Throughput[Gbit/s] Elapsed time (seconds)
  10. 10. What is the Problem? JVM Netty TeraSort Sorter Serializer sockets Spark ● Spark uses legacy networking and storage APIs: no kernel-bypass ● Spark itself introduces additional I/O layers: Netty, serializer, sorter, etc. TCP/IP Ethernet NIC filesystem block layer iSCSI SSD
  11. 11. Example: Shuffle (Map) Buffer Buffer Buffer Sorter Kryo BufferedOutputStream BufferJNI backend BufferFile System Buffer Cache Object
  12. 12. Example: Shuffle (Map) Buffer Buffer Buffer Sorter Kryo BufferJNI BufferFS Object Stream
  13. 13. Example: Shuffle (Map+Reduce) Buffer Buffer Buffer Sorter Kryo BufferJNI BufferFS Object Stream Buffer Buffer Buffer SocketBuffer Netty Buffer Buffer Kryo Object network Sorter
  14. 14. Example: Shuffle (Map+Reduce) Buffer Buffer Buffer Sorter Kryo BufferJNI BufferFS Object Stream Buffer Buffer Buffer SocketBuffer Netty Buffer Buffer Kryo Object network Sorter Difficult at 100 Gbps!
  15. 15. How can we fix this... ● Not just for shuffle – Also for broadcast, RDD transport, inter-job sharing, etc. ● Not just for RDMA and NVMe hardware – But for any possible future high-performance I/O hardware ● Not just for co-located compute/storage – Also for resource disaggregation, heterogeneous resource distribution, etc. ● Not just improve things – Make it perform at the hardware limit
  16. 16. The CRAIL Approach
  17. 17. Example: Crail Shuffle (Map) Buffer Crail Serializer CrailOutputStream Object local DRAM remote DRAM local Flash remote Flash memcpy RDMA NVMe NVMeF No buffering
  18. 18. Example: Crail Shuffle (Reduce) Buffer Crail Serializer CrailInputStream local DRAM remote DRAM local Flash remote Flash memcpy RDMA NVMe NVMeF No buffering Object
  19. 19. Example: Crail Shuffle (Reduce) Buffer Crail Serializer CrailInputStream local DRAM remote DRAM local Flash remote Flash memcpy RDMA NVMe NVMeF No buffering Object Similar for broadcast, Input/output, RDD, etc.
  20. 20. Performance: Configuration ● Experiments – Memory-only: Broadcast, GroupBy, Sorting, SQL – Memory/Flash: disaggregation, tiering ● Cluster size: 8 nodes, except Sorting: 128 nodes ● Cluster hardware: – DRAM: 512 GB DDR 4 – Storage: 4x 1.2 TB NVMe SSD – Network: 100GbE Mellanox RDMA ● Spark 2.1.0, Alluxio 1.4, Crail 1.0 ● Hadoop.tmp.dir: RamFS for microbenchmarks, flash SSD for workloads
  21. 21. Spark Broadcast val bcVar = sc.Broadcast(new Array[Byte](128)) sc.parallelize(1 to tasks, tasks).map(_ => { bcVar.value.length }).count Spark driver Spark Executor /spark/broadcast/broadcast_0 Crail Store Broadcast writing RPC+2xRDMA+RPC CDF Broadcast reading RPC+RDMA1 2 3 4
  22. 22. Spark GroupBy (80M keys, 4K) val pairs = sc.parallelize(1 to tasks, tasks).flatmap(_ => { var values = new array[(Long,Array[Byte])](numKeys) values = initValues(values) }).cache().groupByKey().count() Spark Exec0 /shuffle/1/ f0 f1 fN /shuffle/2/ f0 f1 fN /shuffle/3 f0 f1 fN Spark Exec1 Spark ExecN Spark Exec0 Spark Exec1 Spark ExecN hash hash hash1 2 map: Hash Append (file) reduce: fetch Bucket (dir) Crail Spark/ Vanilla 5x 2.5x 2x 0 20 40 60 80 100 0 10 20 30 40 50 60 70 80 90 100 110 120 Throughput(Gbit/s) Elapsed time (seconds) 1 core 4 cores 8 cores 0 20 40 60 80 100 0 10 20 30 40 50 60 70 80 90 100 110 120 Throughput(Gbit/s) Elapsed time (seconds) 1 core 4 cores 8 cores Spark/ Crail
  23. 23. Sorting 12.8 TB on 128 nodes Spark Spark/Crail 0 200 400 600 reduce map Spark/Crail Network UsageSorting Runtime Task ID Throughput(Gbit/s) 6x time(sec)
  24. 24. How fast is this? Spark/Crail Winner 2014 Winner 2016 Size (TB) 12.8 100 100 Time (sec) 98 1406 134 Total cores 2560 6592 10240 Network HW (Gbit/s) 100 10 100 Rate/core (GB/min) 3.13 0.66 4.4 Native C distributed sorting benchmark Spark www.sortingbenchmark.org Sorting rate of Crail/Spark only 27% slower than rate of Winner 2016
  25. 25. Spark SQL JoinSpark Exec0 /shuffle/1/ f0 f1 fN /shuffle/2/ f0 f1 fN /shuffle/3 f0 f1 fN Spark Exec1 Spark ExecN hash hash hash 1 2 map: Hash Append (file) val ds1 = sparkSession.read.parquet(“...”) //64GB dataset val ds2 = sparkSession.read.parquet(“...”) //64GB dataset val resultDS = ds1.joinWith(ds2, ds1(“key”) === ds2(“key”)) //~100GB resultDS.write.format(“parquet”).mode(SaveMode.Overwrite).save(“...”) parquet parquet parquet read from Hdfs/crail/alluxio deser sort join write sort join write sort join write deser deser 3 Sort Merge join hdfs/crail/alluxio 4 CrailStore 0 10 20 30 40 50 output join hash input Time(sec) Spark/ Alluxio Spark/ Crail
  26. 26. Storage Tiering: DRAM & NVMe Spark Exec0 Spark Exec1 Spark ExecN Memory Tier Flash Tier Network Input/Output/ Shuffle/ Spark With Crail, replacing memory with NVMe flash for storing input/output/shuffle reduces the 200GB sorting runtime by only 48% 0 20 40 60 80 100 120 Reduce MapVanilla Spark (100% in-memory) memory/flash ratio Time(sec) sorting runtime
  27. 27. DRAM & Flash Disaggregation Spark Exec1 Spark ExecN Using Crail, a Spark 200GB sorting workload can be run with memory and flash disaggregated at no extra cost Spark Exec0 storage nodes compute nodes 0 40 80 120 160 200 Reduce Map Crail/ DRAM Crail/ NVMe Vanilla/ Alluxio Input/Output/ Shuffle Time(sec) sorting runtime
  28. 28. Conclusions ● Effectively using high-performance I/O hardware in Spark is challenging ● Crail is an attempt to re-think how data processing frameworks (not only Spark) should interact with network and storage hardware – User-level I/O, storage disaggregation, memory/flash convergence ● Spark's modular architecture allows Crail to be used seamlessly
  29. 29. Crail for Spark is Open Source ● www.crail.io ● github.com/zrlio/spark-io ● github.com/zrlio/crail ● github.com/zrlio/parquetgenerator ● github.com/zrlio/crail-terasort
  30. 30. Thank You. Contributors: Animesh Trivedi, Jonas Pfefferle, Michael Kaufmann, Bernard Metzler, Radu Stoica, Ioannis Koltsidas, Nikolos Ioannou, Christoph Hagleitner, Peter Hofstee, Evangelos Eleftheriou, ...

×