Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

3,489 views

Published on

Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

Published in: Software

Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

  1. 1. How Spark Beat Hadoop @ 100 TB Sort Advanced Apache Spark Meetup Chris Fregly, Principal Data Solutions Engineer IBM Spark Technology Center Power of data. Simplicity of design. Speed of innovation.IBM | spark.tc
  2. 2. Meetup Housekeeping
  3. 3. IBM | spark.tc Announcements Deepak Srinivasan Big Commerce Steve Beier IBM Spark Tech Center
  4. 4. IBM | spark.tc Who am I? Streaming Platform Engineer Streaming Data Engineer Netflix Open Source Committer Data Solutions Engineer Apache Contributor Principle Data Solutions Engineer IBM Technology Center
  5. 5. IBM | spark.tc Last Meetup (End-to-End Data Pipeline) Presented `Flux Capacitor` End-to-End Data Pipeline in a Box! Real-time, Advanced Analytics Machine Learning Recommendations Github github.com/fluxcapacitor Docker hub.docker.com/r/fluxcapacitor
  6. 6. IBM | spark.tc Since Last Meetup (End-to-End Data Pipeline) Meetup Statistics Total Spark Experts: ~850 (+100%) Mean RSVPs per Meetup: 268 Mean Attendance Percentage: ~60% of RSVPs Donations: $15 (Thank you so much, but please keep your $!) Github Statistics (github.com/fluxcapacitor) 18 forks, 13 clones, ~1300 views Docker Statistics (hub.docker.com/r/fluxcapacitor) ~1600 download
  7. 7. IBM | spark.tc Recent Events Replay of Last SF Meetup in Mtn View@BaseCR M Presented Flux Capacitor End-to-End Data Pipeline (Scala + Big Data) By The Bay Conference Workshop and 2 Talks Trained ~100 on End-to-End Data Pipeline Galvanize Workshop Trained ~30 on End-to-End Data Pipeline
  8. 8. IBM | spark.tc Upcoming USA Events IBM Hackathon @ Galvanize (Sept 18th – Sept 21st) Advanced Apache Spark Meetup@DataStax (Sept 21st) Spark-Cassandra Spark SQL+DataFrame Connector Cassandra Summit Talk (Sept 22nd – Sept 24th) Real-time End-to-End Data Pipeline w/ Cassandra Strata New York (Sept 29th - Oct 1st)
  9. 9. IBM | spark.tc Upcoming European Events Dublin Spark Meetup Talk (Oct 15th) Barcelona Spark Meetup Talk (Oct ?) Madrid Spark Meetup Talk (Oct ?) Amsterdam Spark Meetup (Oct 27th) Spark Summit Amsterdam (Oct 27th – Oct 29th) Brussels Spark Meetup Talk (Oct 30th)
  10. 10. Spark and the Daytona GraySort tChallenge sortbenchmark.org sortbenchmark.org/ApacheSpark2014.pdf
  11. 11. IBM | spark.tc Themes of this Talk: Mechanical Sympathy Seek Once, Scan Sequentially CPU Cache Locality, Memory Hierarchy are Key Go Off-Heap Whenever Possible Customize Data Structures for your Workload
  12. 12. IBM | spark.tc What is the Daytona GraySort Challenge? Key Metric Throughput of sorting 100TB of 100 byte data, 10 byte key Total time includes launching app and writing output file Daytona App must be general purpose Gray Named after Jim Gray
  13. 13. IBM | spark.tc Daytona GraySort Challenge: Input and ResourcesInput Records are 100 bytes in length First 10 bytes are random key Input generator: `ordinal.com/gensort.html` 28,000 fixed-size partitions for 100 TB sort 250,000 fixed-size partitions for 1 PB sort 1 partition = 1 HDFS block = 1 node = no partial read I/O Hardware and Runtime Resources Commercially available and off-the-shelf Unmodified, no over/under-clocking Generates 500TB of disk I/O, 200TB network I/O
  14. 14. IBM | spark.tc Daytona GraySort Challenge: Rules Must sort to/from OS files in secondary storage No raw disk since I/O subsystem is being tested File and device striping (RAID 0) are encouraged Output file(s) must have correct key order
  15. 15. IBM | spark.tc Daytona GraySort Challenge: Task Scheduling Types of Data Locality PROCESS_LOCAL NODE_LOCAL RACK_LOCAL ANY Delay Scheduling `spark.locality.wait.node`: time to wait for next shitty level Set to infinite to reduce shittiness, force NODE_LOCAL Straggling Executor JVMs naturally fade away on each run Increasing Level of Shittiness
  16. 16. IBM | spark.tc Daytona GraySort Challenge: Winning Results On-disk only, in-memory caching disabled! EC2 (i2.8xlarge) EC2 (i2.8xlarge) 28,000 partitions 250,000 partitions (!!)
  17. 17. IBM | spark.tc Daytona GraySort Challenge: EC2 Configuration 206 EC2 Worker nodes, 1 Master node i2.8xlarge 32 Intel Xeon CPU E5-2670 @ 2.5 Ghz 244 GB RAM, 8 x 800GB SSD, RAID 0 striping, ext4 NOOP I/O scheduler: FIFO, request merging, no reorderin g 3 Gbps mixed read/write disk I/O Deployed within Placement Group/VPC Enhanced Networking Single Root I/O Virtualization (SR-IOV): extension of PCIe 10 Gbps, low latency, low jitter (iperf showed ~9.5 Gbps)
  18. 18. IBM | spark.tc Daytona GraySort Challenge: Winning ConfigurationSpark 1.2, OpenJDK 1.7_<amazon-something>_u65-b17 Disabled in-memory caching -- all on-disk! HDFS 2.4.1 short-circuit local reads, 2x replication Writes flushed after every run (5 runs for 28,000 partitions) Netty 4.0.23.Final with native epoll Speculative Execution disabled: `spark.speculation`=false Force NODE_LOCAL: `spark.locality.wait.node`=Infinite Force Netty Off-Heap: `spark.shuffle.io.preferDirectBuffers` Spilling disabled: `spark.shuffle.spill`=true All compression disabled
  19. 19. IBM | spark.tc Daytona GraySort Challenge: Partitioning Range Partitioning (vs. Hash Partitioning) Take advantage of sequential key space Similar keys grouped together within a partition Ranges defined by sampling 79 values per partition Driver sorts samples and defines range boundaries Sampling took ~10 seconds for 28,000 partitions
  20. 20. IBM | spark.tc Daytona GraySort Challenge: Why Bother? Sorting relies heavily on shuffle, I/O subsystem Shuffle is major bottleneck in big data processing Large number of partitions can exhaust OS resources Shuffle optimization benefits all high-level libraries Goal is to saturate network controller on all nodes ~125 MB/s (1 GB ethernet), 1.25 GB/s (10 GB ethernet)
  21. 21. IBM | spark.tc Daytona GraySort Challenge: Per Node Results Mappers: 3 Gbps/node disk I/O (8x800 SSD) Reducers: 1.1 Gbps/node network I/O (10Gbps)
  22. 22. Quick Shuffle Refresher
  23. 23. IBM | spark.tc Shuffle Overview All to All, Cartesian Product Operation Least -> Useful Example I Could Find ->
  24. 24. IBM | spark.tc Spark Shuffle Overview Most -> Confusing Example I Could Find -> Stages are Defined by Shuffle Boundaries
  25. 25. IBM | spark.tc Shuffle Intermediate Data: Spill to Disk Intermediate shuffle data stored in memory Spill to Disk `spark.shuffle.spill`=true `spark.shuffle.memoryFraction`=% of all shuffle buffers Competes with `spark.storage.memoryFraction` Bump this up from default!! Will help Spark SQL, too. Skipped Stages Reuse intermediate shuffle data found on reducer DAG for that partition can be truncated
  26. 26. IBM | spark.tc Shuffle Intermediate Data: Compression `spark.shuffle.compress` Compress outputs (mapper) `spark.shuffle.spill.compress` Compress spills (reducer) `spark.io.compression.codec` LZF: Most workloads (new default for Spark) Snappy: LARGE workloads (less memory required to compress)
  27. 27. IBM | spark.tc Spark Shuffle Operations join distinct cogroup coalesce repartition sortByKey groupByKey reduceByKey aggregateByKey
  28. 28. IBM | spark.tc Spark Shuffle Managers `spark.shuffle.manager` = { `hash` <10,000 Reducers Output file determined by hashing the key of (K,V) pair Each mapper creates an output buffer/file per reducer Leads to M*R number of output buffers/files per shuffle `sort` >= 10,000 Reducers Default since Spark 1.2 Wins Daytona GraySort Challenge w/ 250,000 reducers!! `tungsten-sort` (Future Meetup!) }
  29. 29. IBM | spark.tc Shuffle Managers
  30. 30. IBM | spark.tc Hash Shuffle Manager M*R num open files per shuffle; M=num mappers R=num reducers Mapper Opens 1 File per Partition/Reducer HDFS (2x repl) HDFS (2x repl)
  31. 31. IBM | spark.tc Sort Shuffle Manager Hold Tight!
  32. 32. IBM | spark.tc Tungsten-Sort Shuffle Manager Future Meetup!!
  33. 33. IBM | spark.tc Shuffle Performance TuningHash Shuffle Manager (no longer default) `spark.shuffle.consolidateFiles`: mapper output files `o.a.s.shuffle.FileShuffleBlockResolver` Intermediate Files Increase `spark.shuffle.file.buffer`: reduce seeks & sys calls Increase `spark.reducer.maxSizeInFlight` if memory allows Use smaller number of larger workers to reduce total files SQL: BroadcastHashJoin vs. ShuffledHashJoin `spark.sql.autoBroadcastJoinThreshold`
  34. 34. IBM | spark.tc Shuffle Configuration Documentation spark.apache.org/docs/latest/configuration.html#shuffle-behavior Prefix spark.shuffle
  35. 35. Winning Optimizations Deployed across Spark 1.1 and 1.2
  36. 36. IBM | spark.tc Daytona GraySort Challenge: Winning OptimizationsCPU-Cache Locality: (Key, Pointer-to-Record) & Cache Alignment Optimized Sort Algorithm: Elements of (K, V) Pair s Reduce Network Overhead: Async Netty, epoll Reduce OS Resource Utilization: Sort Shuffle
  37. 37. IBM | spark.tc CPU-Cache Locality: (Key, Pointer-to-Record) AlphaSort paper ~1995 Chris Nyberg and Jim Gray Naïve List (Pointer-to-Record) Requires Key to be dereferenced for comparison AlphaSort List (Key, Pointer-to-Record) Key is directly available for comparison
  38. 38. IBM | spark.tc CPU-Cache Locality: Cache Alignment Key(10 bytes) + Pointer(4 bytes*) = 14 bytes *4 bytes when using compressed OOPS (<32 GB heap) Not binary in size, not CPU-cache friendly Cache Alignment Options ① Add Padding (2 bytes) Key(10 bytes) + Pad(2 bytes) + Pointer(4 bytes)=16 bytes ② (Key-Prefix, Pointer-to-Record) Perf affected by key distro Key-Prefix (4 bytes) + Pointer (4 bytes)=8 bytes
  39. 39. IBM | spark.tc CPU-Cache Locality: Performance Comparison
  40. 40. IBM | spark.tc Optimized Sort Algorithm: Elements of (K, V) Pair s`o.a.s.util.collection.TimSort` Based on JDK 1.7 TimSort Performs best on partially-sorted datasets Optimized for elements of (K,V) pairs Sorts impl of SortDataFormat (ie. KVArraySortDataFormat) `o.a.s.util.collection.AppendOnlyMap` Open addressing hash, quadratic probing Array of [(key0, value0), (key1, value1)] Good memory locality Keys never removed, values only append (^2 Probing)
  41. 41. IBM | spark.tc Reduce Network Overhead: Async Netty, epoll New Network Module based on Async Netty Replaces old java.nio, low-level, socket-based code Zero-copy epoll uses kernel-space between disk & networ k Custom memory management reduces GC pauses `spark.shuffle.blockTransferService`=netty Spark-Netty Performance Tuning `spark.shuffle.io.numConnectionsPerPeer` Increase to saturate hosts with multiple disks `spark.shuffle.io.preferDirectBuffers` On or Off-heap (Off-heap is default) Apache Spark Jira SPARK-2468
  42. 42. S IBM | spark.tc Reduce OS Resource Utilization: Sort Shuffle M open files per shuffle; M = num of mappers `spark.shuffle.sort.bypassMergeThreshold` Merge Sort (Disk) Reducers seek and scan from range offset of Master File on Mapper TimSort (RAM) HDFS (2x repl) HDFS (2x repl) SPARK-2926: Replace TimSort w/Merge Sort (Memory) Mapper Merge Sorts Partitions into 1 Master File Indexed by Partition Range Offsets <- Master-> File
  43. 43. Bonus!
  44. 44. IBM | spark.tc External Shuffle Service: Separate JVM Process Takes over when Spark Executor is in GC or dies Use new Netty-based Network Module Required for YARN dynamic allocation Node Manager serves files Apache Spark Jira: SPARK-3796
  45. 45. Next Steps Project Tungsten
  46. 46. IBM | spark.tc Project Tungsten: CPU and Memory Optimizations Disk Network CPU Memory Daytona GraySort Optimizations Tungsten Optimizations Custom Memory Management Eliminates JVM object and GC overhead More Cache-aware Data Structs and Algos `o.a.s.unsafe.map.BytesToBytesMap` vs. j.u.HashM Code Generation (default in 1.5) Generate bytecode from overall query plan
  47. 47. Thank you! Special thanks to Big Commerce!! IBM Spark Tech Center is Hiring! Nice people only, please!!  IBM | spark.tc Sign up for our newsletter at To Be Continued…
  48. 48. IBM | spark.tc Relevant Links http://sortbenchmark.org/ApacheSpark2014.pdf https://databricks.com/blog/2014/10/10/spark-petabyte-sort.ht ml https://databricks.com/blog/2014/11/05/spark-officially-sets-a- new-record-in-large-scale-sorting.html http://0x0fff.com/spark-architecture-shuffle/ http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/pro jects/reports/project16_report.pdf
  49. 49. Power of data. Simplicity of design. Speed of innovation. IBM Spark

×