Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
How Spark Beat Hadoop @ 100 TB Sort
Advanced Apache Spark Meetup
Chris Fregly, Principal Data Solutions Engineer
IBM Spark...
Meetup Housekeeping
IBM | spark.tc
Announcements
Deepak Srinivasan
Big Commerce
Steve Beier
IBM Spark Tech Center
IBM | spark.tc
Who am I?
Streaming Platform Engineer
Streaming Data Engineer
Netflix Open Source Committer
Data Solutions ...
IBM | spark.tc
Last Meetup (End-to-End Data Pipeline)
Presented `Flux Capacitor`
End-to-End Data Pipeline in a Box!
Real-t...
IBM | spark.tc
Since Last Meetup (End-to-End Data Pipeline)
Meetup Statistics
Total Spark Experts: ~850 (+100%)
Mean RSVPs...
IBM | spark.tc
Recent Events
Replay of Last SF Meetup in Mtn View@BaseCR
M
Presented Flux Capacitor End-to-End Data Pipeli...
IBM | spark.tc
Upcoming USA Events
IBM Hackathon @ Galvanize (Sept 18th – Sept 21st)
Advanced Apache Spark Meetup@DataStax...
IBM | spark.tc
Upcoming European Events
Dublin Spark Meetup Talk (Oct 15th)
Barcelona Spark Meetup Talk (Oct ?)
Madrid Spa...
Spark and the Daytona GraySort tChallenge
sortbenchmark.org
sortbenchmark.org/ApacheSpark2014.pdf
IBM | spark.tc
Themes of this Talk: Mechanical Sympathy
Seek Once, Scan Sequentially
CPU Cache Locality, Memory Hierarchy ...
IBM | spark.tc
What is the Daytona GraySort Challenge?
Key Metric
Throughput of sorting 100TB of 100 byte data, 10 byte ke...
IBM | spark.tc
Daytona GraySort Challenge: Input and
ResourcesInput
Records are 100 bytes in length
First 10 bytes are ran...
IBM | spark.tc
Daytona GraySort Challenge: Rules
Must sort to/from OS files in secondary storage
No raw disk since I/O sub...
IBM | spark.tc
Daytona GraySort Challenge: Task Scheduling
Types of Data Locality
PROCESS_LOCAL
NODE_LOCAL
RACK_LOCAL
ANY
...
IBM | spark.tc
Daytona GraySort Challenge: Winning Results
On-disk only, in-memory caching disabled!
EC2
(i2.8xlarge)
EC2
...
IBM | spark.tc
Daytona GraySort Challenge: EC2 Configuration
206 EC2 Worker nodes, 1 Master node
i2.8xlarge
32 Intel Xeon ...
IBM | spark.tc
Daytona GraySort Challenge: Winning
ConfigurationSpark 1.2, OpenJDK 1.7_<amazon-something>_u65-b17
Disabled...
IBM | spark.tc
Daytona GraySort Challenge: Partitioning
Range Partitioning (vs. Hash Partitioning)
Take advantage of seque...
IBM | spark.tc
Daytona GraySort Challenge: Why Bother?
Sorting relies heavily on shuffle, I/O subsystem
Shuffle is major b...
IBM | spark.tc
Daytona GraySort Challenge: Per Node Results
Mappers: 3 Gbps/node disk I/O (8x800 SSD)
Reducers: 1.1 Gbps/n...
Quick Shuffle Refresher
IBM | spark.tc
Shuffle Overview
All to All, Cartesian Product Operation
Least ->
Useful
Example
I Could
Find ->
IBM | spark.tc
Spark Shuffle Overview
Most ->
Confusing
Example
I Could
Find ->
Stages are Defined by Shuffle Boundaries
IBM | spark.tc
Shuffle Intermediate Data: Spill to Disk
Intermediate shuffle data stored in memory
Spill to Disk
`spark.sh...
IBM | spark.tc
Shuffle Intermediate Data: Compression
`spark.shuffle.compress`
Compress outputs (mapper)
`spark.shuffle.sp...
IBM | spark.tc
Spark Shuffle Operations
join
distinct
cogroup
coalesce
repartition
sortByKey
groupByKey
reduceByKey
aggreg...
IBM | spark.tc
Spark Shuffle Managers
`spark.shuffle.manager` = {
`hash` <10,000 Reducers
Output file determined by hashin...
IBM | spark.tc
Shuffle Managers
IBM | spark.tc
Hash Shuffle Manager
M*R num open files per shuffle; M=num mappers
R=num reducers
Mapper Opens 1 File per P...
IBM | spark.tc
Sort Shuffle Manager
Hold Tight!
IBM | spark.tc
Tungsten-Sort Shuffle Manager
Future Meetup!!
IBM | spark.tc
Shuffle Performance TuningHash Shuffle Manager (no longer default)
`spark.shuffle.consolidateFiles`: mapper...
IBM | spark.tc
Shuffle Configuration
Documentation
spark.apache.org/docs/latest/configuration.html#shuffle-behavior
Prefix...
Winning Optimizations
Deployed across Spark 1.1 and 1.2
IBM | spark.tc
Daytona GraySort Challenge: Winning
OptimizationsCPU-Cache Locality: (Key, Pointer-to-Record)
& Cache Align...
IBM | spark.tc
CPU-Cache Locality: (Key, Pointer-to-Record)
AlphaSort paper ~1995
Chris Nyberg and Jim Gray
Naïve
List (Po...
IBM | spark.tc
CPU-Cache Locality: Cache Alignment
Key(10 bytes) + Pointer(4 bytes*) = 14 bytes
*4 bytes when using compre...
IBM | spark.tc
CPU-Cache Locality: Performance Comparison
IBM | spark.tc
Optimized Sort Algorithm: Elements of (K, V) Pair
s`o.a.s.util.collection.TimSort`
Based on JDK 1.7 TimSort...
IBM | spark.tc
Reduce Network Overhead: Async Netty, epoll
New Network Module based on Async Netty
Replaces old java.nio, ...
S
IBM | spark.tc
Reduce OS Resource Utilization: Sort Shuffle
M open files per shuffle; M = num of mappers
`spark.shuffle....
Bonus!
IBM | spark.tc
External Shuffle Service: Separate JVM Process
Takes over when Spark Executor is in GC or dies
Use new Nett...
Next Steps
Project Tungsten
IBM | spark.tc
Project Tungsten: CPU and Memory Optimizations
Disk
Network
CPU
Memory
Daytona GraySort Optimizations
Tungs...
Thank you!
Special thanks to Big Commerce!!
IBM Spark Tech Center is Hiring!
Nice people only, please!! 
IBM | spark.tc
S...
IBM | spark.tc
Relevant Links
http://sortbenchmark.org/ApacheSpark2014.pdf
https://databricks.com/blog/2014/10/10/spark-pe...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Upcoming SlideShare
Loading in …5
×

Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

3,000 views

Published on

Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

Published in: Software

Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

  1. 1. How Spark Beat Hadoop @ 100 TB Sort Advanced Apache Spark Meetup Chris Fregly, Principal Data Solutions Engineer IBM Spark Technology Center Power of data. Simplicity of design. Speed of innovation.IBM | spark.tc
  2. 2. Meetup Housekeeping
  3. 3. IBM | spark.tc Announcements Deepak Srinivasan Big Commerce Steve Beier IBM Spark Tech Center
  4. 4. IBM | spark.tc Who am I? Streaming Platform Engineer Streaming Data Engineer Netflix Open Source Committer Data Solutions Engineer Apache Contributor Principle Data Solutions Engineer IBM Technology Center
  5. 5. IBM | spark.tc Last Meetup (End-to-End Data Pipeline) Presented `Flux Capacitor` End-to-End Data Pipeline in a Box! Real-time, Advanced Analytics Machine Learning Recommendations Github github.com/fluxcapacitor Docker hub.docker.com/r/fluxcapacitor
  6. 6. IBM | spark.tc Since Last Meetup (End-to-End Data Pipeline) Meetup Statistics Total Spark Experts: ~850 (+100%) Mean RSVPs per Meetup: 268 Mean Attendance Percentage: ~60% of RSVPs Donations: $15 (Thank you so much, but please keep your $!) Github Statistics (github.com/fluxcapacitor) 18 forks, 13 clones, ~1300 views Docker Statistics (hub.docker.com/r/fluxcapacitor) ~1600 download
  7. 7. IBM | spark.tc Recent Events Replay of Last SF Meetup in Mtn View@BaseCR M Presented Flux Capacitor End-to-End Data Pipeline (Scala + Big Data) By The Bay Conference Workshop and 2 Talks Trained ~100 on End-to-End Data Pipeline Galvanize Workshop Trained ~30 on End-to-End Data Pipeline
  8. 8. IBM | spark.tc Upcoming USA Events IBM Hackathon @ Galvanize (Sept 18th – Sept 21st) Advanced Apache Spark Meetup@DataStax (Sept 21st) Spark-Cassandra Spark SQL+DataFrame Connector Cassandra Summit Talk (Sept 22nd – Sept 24th) Real-time End-to-End Data Pipeline w/ Cassandra Strata New York (Sept 29th - Oct 1st)
  9. 9. IBM | spark.tc Upcoming European Events Dublin Spark Meetup Talk (Oct 15th) Barcelona Spark Meetup Talk (Oct ?) Madrid Spark Meetup Talk (Oct ?) Amsterdam Spark Meetup (Oct 27th) Spark Summit Amsterdam (Oct 27th – Oct 29th) Brussels Spark Meetup Talk (Oct 30th)
  10. 10. Spark and the Daytona GraySort tChallenge sortbenchmark.org sortbenchmark.org/ApacheSpark2014.pdf
  11. 11. IBM | spark.tc Themes of this Talk: Mechanical Sympathy Seek Once, Scan Sequentially CPU Cache Locality, Memory Hierarchy are Key Go Off-Heap Whenever Possible Customize Data Structures for your Workload
  12. 12. IBM | spark.tc What is the Daytona GraySort Challenge? Key Metric Throughput of sorting 100TB of 100 byte data, 10 byte key Total time includes launching app and writing output file Daytona App must be general purpose Gray Named after Jim Gray
  13. 13. IBM | spark.tc Daytona GraySort Challenge: Input and ResourcesInput Records are 100 bytes in length First 10 bytes are random key Input generator: `ordinal.com/gensort.html` 28,000 fixed-size partitions for 100 TB sort 250,000 fixed-size partitions for 1 PB sort 1 partition = 1 HDFS block = 1 node = no partial read I/O Hardware and Runtime Resources Commercially available and off-the-shelf Unmodified, no over/under-clocking Generates 500TB of disk I/O, 200TB network I/O
  14. 14. IBM | spark.tc Daytona GraySort Challenge: Rules Must sort to/from OS files in secondary storage No raw disk since I/O subsystem is being tested File and device striping (RAID 0) are encouraged Output file(s) must have correct key order
  15. 15. IBM | spark.tc Daytona GraySort Challenge: Task Scheduling Types of Data Locality PROCESS_LOCAL NODE_LOCAL RACK_LOCAL ANY Delay Scheduling `spark.locality.wait.node`: time to wait for next shitty level Set to infinite to reduce shittiness, force NODE_LOCAL Straggling Executor JVMs naturally fade away on each run Increasing Level of Shittiness
  16. 16. IBM | spark.tc Daytona GraySort Challenge: Winning Results On-disk only, in-memory caching disabled! EC2 (i2.8xlarge) EC2 (i2.8xlarge) 28,000 partitions 250,000 partitions (!!)
  17. 17. IBM | spark.tc Daytona GraySort Challenge: EC2 Configuration 206 EC2 Worker nodes, 1 Master node i2.8xlarge 32 Intel Xeon CPU E5-2670 @ 2.5 Ghz 244 GB RAM, 8 x 800GB SSD, RAID 0 striping, ext4 NOOP I/O scheduler: FIFO, request merging, no reorderin g 3 Gbps mixed read/write disk I/O Deployed within Placement Group/VPC Enhanced Networking Single Root I/O Virtualization (SR-IOV): extension of PCIe 10 Gbps, low latency, low jitter (iperf showed ~9.5 Gbps)
  18. 18. IBM | spark.tc Daytona GraySort Challenge: Winning ConfigurationSpark 1.2, OpenJDK 1.7_<amazon-something>_u65-b17 Disabled in-memory caching -- all on-disk! HDFS 2.4.1 short-circuit local reads, 2x replication Writes flushed after every run (5 runs for 28,000 partitions) Netty 4.0.23.Final with native epoll Speculative Execution disabled: `spark.speculation`=false Force NODE_LOCAL: `spark.locality.wait.node`=Infinite Force Netty Off-Heap: `spark.shuffle.io.preferDirectBuffers` Spilling disabled: `spark.shuffle.spill`=true All compression disabled
  19. 19. IBM | spark.tc Daytona GraySort Challenge: Partitioning Range Partitioning (vs. Hash Partitioning) Take advantage of sequential key space Similar keys grouped together within a partition Ranges defined by sampling 79 values per partition Driver sorts samples and defines range boundaries Sampling took ~10 seconds for 28,000 partitions
  20. 20. IBM | spark.tc Daytona GraySort Challenge: Why Bother? Sorting relies heavily on shuffle, I/O subsystem Shuffle is major bottleneck in big data processing Large number of partitions can exhaust OS resources Shuffle optimization benefits all high-level libraries Goal is to saturate network controller on all nodes ~125 MB/s (1 GB ethernet), 1.25 GB/s (10 GB ethernet)
  21. 21. IBM | spark.tc Daytona GraySort Challenge: Per Node Results Mappers: 3 Gbps/node disk I/O (8x800 SSD) Reducers: 1.1 Gbps/node network I/O (10Gbps)
  22. 22. Quick Shuffle Refresher
  23. 23. IBM | spark.tc Shuffle Overview All to All, Cartesian Product Operation Least -> Useful Example I Could Find ->
  24. 24. IBM | spark.tc Spark Shuffle Overview Most -> Confusing Example I Could Find -> Stages are Defined by Shuffle Boundaries
  25. 25. IBM | spark.tc Shuffle Intermediate Data: Spill to Disk Intermediate shuffle data stored in memory Spill to Disk `spark.shuffle.spill`=true `spark.shuffle.memoryFraction`=% of all shuffle buffers Competes with `spark.storage.memoryFraction` Bump this up from default!! Will help Spark SQL, too. Skipped Stages Reuse intermediate shuffle data found on reducer DAG for that partition can be truncated
  26. 26. IBM | spark.tc Shuffle Intermediate Data: Compression `spark.shuffle.compress` Compress outputs (mapper) `spark.shuffle.spill.compress` Compress spills (reducer) `spark.io.compression.codec` LZF: Most workloads (new default for Spark) Snappy: LARGE workloads (less memory required to compress)
  27. 27. IBM | spark.tc Spark Shuffle Operations join distinct cogroup coalesce repartition sortByKey groupByKey reduceByKey aggregateByKey
  28. 28. IBM | spark.tc Spark Shuffle Managers `spark.shuffle.manager` = { `hash` <10,000 Reducers Output file determined by hashing the key of (K,V) pair Each mapper creates an output buffer/file per reducer Leads to M*R number of output buffers/files per shuffle `sort` >= 10,000 Reducers Default since Spark 1.2 Wins Daytona GraySort Challenge w/ 250,000 reducers!! `tungsten-sort` (Future Meetup!) }
  29. 29. IBM | spark.tc Shuffle Managers
  30. 30. IBM | spark.tc Hash Shuffle Manager M*R num open files per shuffle; M=num mappers R=num reducers Mapper Opens 1 File per Partition/Reducer HDFS (2x repl) HDFS (2x repl)
  31. 31. IBM | spark.tc Sort Shuffle Manager Hold Tight!
  32. 32. IBM | spark.tc Tungsten-Sort Shuffle Manager Future Meetup!!
  33. 33. IBM | spark.tc Shuffle Performance TuningHash Shuffle Manager (no longer default) `spark.shuffle.consolidateFiles`: mapper output files `o.a.s.shuffle.FileShuffleBlockResolver` Intermediate Files Increase `spark.shuffle.file.buffer`: reduce seeks & sys calls Increase `spark.reducer.maxSizeInFlight` if memory allows Use smaller number of larger workers to reduce total files SQL: BroadcastHashJoin vs. ShuffledHashJoin `spark.sql.autoBroadcastJoinThreshold`
  34. 34. IBM | spark.tc Shuffle Configuration Documentation spark.apache.org/docs/latest/configuration.html#shuffle-behavior Prefix spark.shuffle
  35. 35. Winning Optimizations Deployed across Spark 1.1 and 1.2
  36. 36. IBM | spark.tc Daytona GraySort Challenge: Winning OptimizationsCPU-Cache Locality: (Key, Pointer-to-Record) & Cache Alignment Optimized Sort Algorithm: Elements of (K, V) Pair s Reduce Network Overhead: Async Netty, epoll Reduce OS Resource Utilization: Sort Shuffle
  37. 37. IBM | spark.tc CPU-Cache Locality: (Key, Pointer-to-Record) AlphaSort paper ~1995 Chris Nyberg and Jim Gray Naïve List (Pointer-to-Record) Requires Key to be dereferenced for comparison AlphaSort List (Key, Pointer-to-Record) Key is directly available for comparison
  38. 38. IBM | spark.tc CPU-Cache Locality: Cache Alignment Key(10 bytes) + Pointer(4 bytes*) = 14 bytes *4 bytes when using compressed OOPS (<32 GB heap) Not binary in size, not CPU-cache friendly Cache Alignment Options ① Add Padding (2 bytes) Key(10 bytes) + Pad(2 bytes) + Pointer(4 bytes)=16 bytes ② (Key-Prefix, Pointer-to-Record) Perf affected by key distro Key-Prefix (4 bytes) + Pointer (4 bytes)=8 bytes
  39. 39. IBM | spark.tc CPU-Cache Locality: Performance Comparison
  40. 40. IBM | spark.tc Optimized Sort Algorithm: Elements of (K, V) Pair s`o.a.s.util.collection.TimSort` Based on JDK 1.7 TimSort Performs best on partially-sorted datasets Optimized for elements of (K,V) pairs Sorts impl of SortDataFormat (ie. KVArraySortDataFormat) `o.a.s.util.collection.AppendOnlyMap` Open addressing hash, quadratic probing Array of [(key0, value0), (key1, value1)] Good memory locality Keys never removed, values only append (^2 Probing)
  41. 41. IBM | spark.tc Reduce Network Overhead: Async Netty, epoll New Network Module based on Async Netty Replaces old java.nio, low-level, socket-based code Zero-copy epoll uses kernel-space between disk & networ k Custom memory management reduces GC pauses `spark.shuffle.blockTransferService`=netty Spark-Netty Performance Tuning `spark.shuffle.io.numConnectionsPerPeer` Increase to saturate hosts with multiple disks `spark.shuffle.io.preferDirectBuffers` On or Off-heap (Off-heap is default) Apache Spark Jira SPARK-2468
  42. 42. S IBM | spark.tc Reduce OS Resource Utilization: Sort Shuffle M open files per shuffle; M = num of mappers `spark.shuffle.sort.bypassMergeThreshold` Merge Sort (Disk) Reducers seek and scan from range offset of Master File on Mapper TimSort (RAM) HDFS (2x repl) HDFS (2x repl) SPARK-2926: Replace TimSort w/Merge Sort (Memory) Mapper Merge Sorts Partitions into 1 Master File Indexed by Partition Range Offsets <- Master-> File
  43. 43. Bonus!
  44. 44. IBM | spark.tc External Shuffle Service: Separate JVM Process Takes over when Spark Executor is in GC or dies Use new Netty-based Network Module Required for YARN dynamic allocation Node Manager serves files Apache Spark Jira: SPARK-3796
  45. 45. Next Steps Project Tungsten
  46. 46. IBM | spark.tc Project Tungsten: CPU and Memory Optimizations Disk Network CPU Memory Daytona GraySort Optimizations Tungsten Optimizations Custom Memory Management Eliminates JVM object and GC overhead More Cache-aware Data Structs and Algos `o.a.s.unsafe.map.BytesToBytesMap` vs. j.u.HashM Code Generation (default in 1.5) Generate bytecode from overall query plan
  47. 47. Thank you! Special thanks to Big Commerce!! IBM Spark Tech Center is Hiring! Nice people only, please!!  IBM | spark.tc Sign up for our newsletter at To Be Continued…
  48. 48. IBM | spark.tc Relevant Links http://sortbenchmark.org/ApacheSpark2014.pdf https://databricks.com/blog/2014/10/10/spark-petabyte-sort.ht ml https://databricks.com/blog/2014/11/05/spark-officially-sets-a- new-record-in-large-scale-sorting.html http://0x0fff.com/spark-architecture-shuffle/ http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/pro jects/reports/project16_report.pdf
  49. 49. Power of data. Simplicity of design. Speed of innovation. IBM Spark

×