Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Spark Summit EU talk by Qifan Pu

923 views

Published on

Boosting Spark Performance on Many-Core Machines

Published in: Data & Analytics
  • Be the first to comment

Spark Summit EU talk by Qifan Pu

  1. 1. Boosting Spark Performance on Many-Core Machines Qifan Pu Sameer Agarwal (Databricks) Reynold Xin (Databricks) Ion Stoica UC  BERKELEY  
  2. 2. Me Ph.D. Student in AMPLab, advised by Prof. Ion Stoica - Spark-related research projects - e.g., how to run Spark in a geo-distributed fashion - Big data storage (e.g., Alluxio) - how to do memory management for multiple users Intern at Databricks in the past summer - Spark SQL team: aggregates, shuffle
  3. 3. This project Spark performance on many-core machines - Ongoing research, feedbacks are welcome -  Focus of this talk: •  understand shuffle performance •  Investigate and implement In-memory shuffle Moving beyond research -  Hope is to get into Spark (but no guarantee yet)
  4. 4. Why do we care about many-core machines? rdd.groupBy(…)…   …   …   …   …   Spark started as a cluster computing framework
  5. 5. •  Spark’s one big success has been high scalability •  Largest cluster known is 8000 •  Winner of 2014 Daytona GraySort Benchmark •  Typical cluster sizes in practice: “Our observation is that companies typically experiment with cluster size of under 100 nodes and expand to 200 or more nodes in the production stages.” -- Susheel Kaushik (Senior Director at Pivotal) Spark on cluster
  6. 6. Increasingly powerful single node machines •  More cores packed on single chip •  Intel Xeon Phi: 64-72 cores •  Berkeley FireBox Project: ~100 cores •  Larger instances in the Cloud •  Various 32-core instances on EC2, Azure & GCE •  EC2 X-Instance with 128 cores, 2TB (May 2016)
  7. 7. Memory (GB) vCPUs Hourly Cost ($) Cost/100GB Cost/8vCPU x1.32xlarge 1952 128 13.338 0.68 0.83 g2.2xlarge 15 8 0.65 4.33 0.65 i2.2xlarge 61 8 1.705 2.80 1.70 m3.2xlarge 30 8 0.532 1.78 0.53 c3.2xlarge 15 8 0.42 2.80 0.42 Cost of many-core nodes 1 x1.32xlarge instance is a small cluster (with more memory, fast inter-core network)
  8. 8. Spark’s design was based on many nodes •  Data communication (a.k.a. shuffle) •  Store intermediate data on disk •  Serialization/deserialization needed across nodes •  Now: much memory to spare, intra-node shuffle •  Resource management •  Designed to handle moderate amount on each node •  Now: 1 executor for 100 cores + 2TB memory? Focus of this talk Ongoing work
  9. 9. Can we improve shuffle on single, multi-core machines? 1, Memory is fast 2, We can use memory for shuffle 3, Therefore, Shuffle will be fast Will this “common sense” work?
  10. 10. Put in practice… •  Spark: write to a file stream, and save all bytes to disk •  Solution: …, …. to memory (bytes on heap) spark.range(16M).repartition(1).selectExpr("sum(id)").collect() 0 1 2 3 4 5 Vanilla Spark Attempt 1 runtime (seconds) Why zero improvement? Attempt 1 Vanilla Spark
  11. 11. flushBuffer(),  4.81%  of  4me   En4re  dura4on  of  a  job  (100%)   Why zero improvement?
  12. 12. 1, I/O throughput is not the bottleneck (in this job) 2, Buffer cache: memory is being exploited by disk-based shuffle Why zero improvement?
  13. 13. Understanding Shuffle Performance Generated  Iterator   OP1   OP2   OP3   Queue  Input   Filestreams   flush   flush   flush   Mappers  
  14. 14. Understanding Shuffle Performance Generated  Iterator   OP1   OP2   OP3   Results  or  another  shuffle   Reducers  
  15. 15. More complications •  Sort vs. hash-based shuffle •  Spill when memory runs out •  Clear data after shuffle …
  16. 16. Case 1 Generated  Iterator   OP1   OP2   OP3   Queue  Input   Filestreams   flush   flush   flush   Thread  1   Thread  2  
  17. 17. Case 1 Queue  Input   flush   flush   flush   Thread  1   Thread  2  Case  1:  thread  1  slower  than  thread  2   Filestreams   Generated  Iterator   OP1   OP2   OP3  
  18. 18. Case 2 Generated  Iterator   OP1   OP2   OP3   Queue  Input   Filestreams   flush   flush   flush  
  19. 19. Case 2 Queue  Input   flush   flush   flush   Case  2:  wri4ng/reading  file  streams  is  slow   Filestreams   Generated  Iterator   OP1   OP2   OP3  
  20. 20. Case study 2 Generated  Iterator   OP1   OP2   OP3   Results  or  another  shuffle   Reducers  
  21. 21. Case 3 Generated  Iterator   OP1   OP2   OP3   Queue  Input   Filestreams   flush   flush   flush  
  22. 22. Case 3 Queue  Input   flush   flush   flush   Case  3:  I/O  conten4ons     Filestreams   Generated  Iterator   OP1   OP2   OP3   Our  first  aTempt  should  work  in  this  case!  
  23. 23. Can we improve case 2? Queue  Input   flush   flush   flush   Filestreams   Generated  Iterator   OP1   OP2   OP3   Previous  example  is  case  2.  
  24. 24. Attempt 2: get rid of ser/deser, etc •  Attempt 2: create NxM queues (N=mappers, M=reducers) push corresponding records into queues mappers reducers •  No serialization •  No copy (data structure shared by both sides)
  25. 25. Attempt 2: get rid of ser/deser, etc •  Attempt 2: create NxM queues (N=mappers, M=reducers) push corresponding records into queues 0 2 4 6 8 10 12 14 16 18 Vanilla Spark Attempt 1 Attempt 2 runtime (seconds) 3x  worse  
  26. 26. 0 2 4 6 8 10 12 14 16 18 Vanilla Spark Attempt 1 Attempt 2 runtime (seconds) Processing (s) GC (s) Attempt 2: get rid of ser/deser, etc •  Attempt 2: create NxM queues (N=mappers, M=reducers) push corresponding records into queues
  27. 27. •  Attempt 3: instead of queue, copy records to memory pages Number of objects: ~records à ~pages NxM pages, or alternatively, one page per reducer Attempt 3: avoiding GC record1   recor..   ..d2   record3   …   …   …  
  28. 28. •  Unsafe Row (a buffer-backed row format): •  row.pointTo(buffer) Spark SQL record1   recor..   ..d2   record3   …   …   …   Instantaneous creation of unsafe rows by pointing to different offsets In the page
  29. 29. Attempt 3: avoiding GC •  Attempt 3: copy records onto large memory pages 0 2 4 6 8 10 12 14 16 18 Vanilla Spark Attempt 1 Attempt 2 Attempt 3 runtime (seconds) Improvement  from  avoid  ser/deser,  copy,   I/O  conten4on,  expensive  code  path    
  30. 30. Consistent improvement with varying size spark.range(N).repar44on().selectExpr("sum(id)").collect()   Use  N  from  2^  20  to  2^27     0 50 100 150 200 250 300 Disk-based In-memory Shuffle nanoseconds/row 2^20 2^21 2^22 2^23 2^24 2^25 2^26 2^27
  31. 31. TPC-DS performance (single node) 0 10 20 30 40 50 60 QueryRuntime(s) In-memory Shuffle Vanilla-Spark 27/33 queries improve with a median of 31%
  32. 32. Extending to multiple nodes •  Implementation •  All data goes to memory •  For remote transfer, copy from memory to network buffer •  A more memory-preserving way… •  Local transfer goes to memory •  Remote transfer goes to disk •  Cons1: have to enforce stricter locality on reducers •  Cons2: cannot avoid I/O contentions
  33. 33. spark.range(N).repar44on().selectExpr("sum(id)").collect()   Simple shuffle job 0 100 200 300 400 500 600 1 2 3 4 1 2 3 4 ns/row Reduce Stage Map Stage Vanilla-Spark In-memory Shuffle Map: Consistent improvement Reduce: Improvement decreases with more nodes
  34. 34. TPC-DS performance (x1.xlarge32) 0 20 40 60 80 q13 q20 q18 q11 q3 QueryRuntime(s) In-memory Shuffle Vanilla-Spark •  SF=100 •  Pick top 5 queries from single node experiment •  Best of 10 runs Many other performance bottlenecks need investigation!
  35. 35. Summary •  Spark on many-core requires many architectural changes •  In-memory shuffle •  How to improve shuffle performance with memory •  31% improvement over Spark •  On-going research •  Identify other performance bottlenecks
  36. 36. Thank you Qifan Pu qifan@cs.berkeley.edu

×