Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Flash for Spark Shuffle with Cosco
Aaron Gabriel Feldman
Software Engineer at Facebook
Agenda
1. Motivation
2. Intro to shuffle architecture
3. Flash
4. Hybrid RAM + flash techniques
5. Future improvements
6. ...
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.
Why should you care?
▪ IO efficiency
▪ Cosco is a service that improves IO efficiency (disk service time) by 3x for shuffl...
Intro to Shuffle Architecture
Spark Shuffle Recap
Map 0
Map 1
Map m
Reduce 0
Reduce 1
Reduce r
Partition
Mappers
Map Output Files
(on disk/DFS) Reducers...
Spark Shuffle Recap
Map 0
Map 1
Map m
Reduce 0
Reduce 1
Reduce r
Partition
Mappers
Map Output Files
(on disk/DFS) Reducers...
Spark Shuffle Recap
Map 0
Map 1
Map m
Reduce 0
Reduce 1
Reduce r
Partition
Mappers
Map Output Files
(on disk/DFS) Reducers...
Spark Shuffle Recap
Map 0
Map 1
Map m
Reduce 0
Reduce 1
Reduce r
Partition
Mappers
Map Output Files
(on disk/DFS) Reducers...
Spark Shuffle Recap
Map 0
Map 1
Map m
Reduce 0
Reduce 1
Reduce r
Partition
Sort by
key
Iterator
Iterator
Iterator
Write am...
Spark Shuffle Recap
Map 0
Map 1
Map m
Reduce 0
Reduce 1
Reduce r
Partition
Mappers
Map Output Files
(on disk/DFS) Reducers...
Spark Shuffle Recap
Map 1
Map m
Reduce 1
Reduce r
Mappers
Map Output Files
(on disk/DFS) Reducers
Reducers pull from map o...
Spark Shuffle Recap
Map 1
Map m
Reduce 1
Reduce r
Mappers
Map Output Files
(on disk/DFS) Reducers
Simplified drawing
Adapt...
Spark Shuffle Recap
Map 1
Map m
Mappers
Map Output Files
(on disk/DFS)
Simplified drawing
Reduce 1
Reduce r
Reducers
Adapt...
Cosco Shuffle for Spark
Reduce 1
Reduce r
Mappers Reducers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition...
Cosco Shuffle for Spark
Reduce 1
Reduce r
Mappers Reducers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition...
Cosco Shuffle for Spark
Reduce 1
Reduce r
Mappers Reducers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition...
Cosco Shuffle for Spark
Reduce 1
Reduce r
Mappers Reducers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition...
Cosco Shuffle for Spark
Reduce 1
Reduce r
Mappers Reducers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition...
Iterator
Iterator
Cosco Shuffle for Spark
Reduce 1
Reduce r
Mappers Reducers
Shuffle Service 1
Shuffle Service 2
Shuffle S...
Replace DRAM with Flash
for Buffering
Buffering Is Appending
Mappers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition r
Shuffle Services
(N = tho...
Buffering Is Appending
Mappers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition r
Shuffle Services
(N = tho...
Buffering Is Appending
Mappers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition r
Shuffle Services
(N = tho...
Buffering Is Appending
Mappers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition r
Shuffle Services
(N = tho...
Buffering Is Appending
Mappers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition r
Shuffle Services
(N = tho...
Replace DRAM with Flash for Buffering
Mappers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition r
Shuffle Se...
Replace DRAM with Flash for Buffering
Mappers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition r
Shuffle Se...
Replace DRAM with Flash for Buffering
Mappers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition r
Shuffle Se...
Replace DRAM with Flash for Buffering
Mappers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition r
Shuffle Se...
Example Rule of Thumb
▪ Hypothetical example numbers
▪ Assume 1 GB Flash can endure ~10 GB of writes per day for the lifet...
Basic Evaluation
▪ Example Cosco cluster
▪ 10 nodes
▪ Each node uses 100 GB DRAM for buffering
▪ And has additional DRAM f...
Basic Evaluation
Summary for cluster shuffling 100 TB/day
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for...
Hybrid Techniques for Efficiency
Two Hybrid Techniques
Two ways to use both DRAM and flash for buffering
1. Buffer in DRAM first, flush to flash only under...
Hybrid Technique #1
Take advantage of variation in shuffle workload over time
Time
Bytes buffered in
Cosco Shuffle Service
Hybrid Technique #1
Take advantage of variation in shuffle workload over time
Time
Bytes
buffered
Buffer only in DRAM Buff...
Hybrid Technique #1
Take advantage of variation in shuffle workload over time
Buffer only in DRAM
Buffer only in flash
1 T...
Hybrid Technique #1
Buffer in DRAM first, flush to flash only under memory pressure
250 GB DRAM
25 TB written/day to flash...
Hybrid Technique #1
Buffer in DRAM first, flush to flashPure-DRAM Cosco
Implementation requires balancing. Flash adds anot...
Hybrid Technique #1
Buffer in DRAM first, flush to flashPure-DRAM Cosco
Plug into pre-existing balancing logic
Shuffle Ser...
Hybrid Technique #1
Plug into pre-existing balancing logic
Balancing
Logic
Redirect to another
shuffle service
Flush to
DF...
Hybrid Technique #1
Summary
▪ Take advantage of variation in
total shuffle workload over time
▪ Buffer in DRAM first, flus...
Hybrid Technique #2
Take advantage of variation in partition fill rate
▪ Some partitions fill more slowly than others
▪ Sl...
Hybrid Technique #2
▪ 1 TB
▪ Supports 100K streams each
buffering up to 10MB
▪ 10 TB, 100 TB written/day
▪ 100K streams ea...
Hybrid Technique #2
Buffer fastest-filling partitions in DRAM and slowest-filling partitions in flash
▪ Technique
▪ Period...
Hybrid Technique #2
Real-world partition fill rates
Percentile of partitions
Fill rate
0 KiB/sec
1st
MiB’s/sec
99th
Percen...
Hybrid Technique #2
Real-world partition fill rates
Percentile of partitions
Percentile of partitions weighted by bufferin...
Combine both hybrid techniques
Buffer in DRAM first, then send the slowest partitions to flash when under memory pressure
...
Future Improvements
Lower-Latency Queries
Made possible by flash
▪ Serve shuffle data directly from flash for some jobs
▪ This is “free” until...
Further Efficiency Wins
Made possible by flash
▪ Decrease Cosco replication factor since flash is non-volatile
▪ Currently...
Practical Evaluation Techniques
Practical Evaluation Techniques
▪ Discrete event simulation
▪ Synthetic load generation on a test cluster
▪ Shadow testing...
Discrete Event Simulation
https://en.wikipedia.org/wiki/Discrete-event_simulation, 2020-05-18
Discrete Event Simulation
Shuffle Service Model DFS Model
Example
Partition 3
Partition 42
Time: 00h:01m:30.000s
Total KB ...
Discrete Event Simulation
Shuffle Service Model DFS Model
Example
Partition 3
Partition 42
Discrete event
Time: 00h:01m:30...
Discrete Event Simulation
Shuffle Service Model DFS Model
Example
Partition 3
Partition 42
Discrete event
Time: 00h:01m:30...
Discrete Event Simulation
Shuffle Service Model DFS Model
Example
Partition 3
Partition 42
Discrete event
Time: 00h:01m:30...
Discrete Event Simulation
Shuffle Service Model DFS Model
Example
Partition 3
Partition 42
Discrete event
Time: 00h:01m:31...
Discrete Event Simulation
Shuffle Service Model DFS Model
Example
Partition 3
Partition 42
Discrete event
Time: 00h:01m:31...
Discrete Event Simulation
Shuffle Service Model DFS Model
Example
Partition 3
Partition 42
Discrete event
Time: 00h:01m:32...
DFS Model
File 0
Discrete Event Simulation
Shuffle Service Model
Example
Partition 3
Partition 42
Sort & flush
Discrete ev...
DFS Model
File 0
Discrete Event Simulation
Shuffle Service Model
Example
Partition 3
Partition 42
Discrete event
Time: 00h...
Discrete Event Simulation
Drive simulation based on production data
cosco_chunks dataset
Partition
Shuffle
Service ID
Chun...
Canary on a Production Cluster
▪ Many important metrics are observed on mappers
▪ Example: “percentage of task time spent ...
Chen Yang
Software Engineer at Facebook
Sergey Makagonov
Software Engineer at Facebook
Special Thanks
SOS: Optimizing Shuffle IO, Spark Summit 2018
Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
Previo...
cosco@fb.com mailing list
Contact
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.
Flash for Apache Spark Shuffle with Cosco
Flash for Apache Spark Shuffle with Cosco
You’ve finished this document.
Download and read it offline.
Upcoming SlideShare
What to Upload to SlideShare
Next
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

2

Share

Flash for Apache Spark Shuffle with Cosco

Download to read offline

Cosco is an efficient and reliable shuffle-as-a-service that powers Spark jobs at Facebook warehouse scale.

Flash for Apache Spark Shuffle with Cosco

  1. 1. Flash for Spark Shuffle with Cosco Aaron Gabriel Feldman Software Engineer at Facebook
  2. 2. Agenda 1. Motivation 2. Intro to shuffle architecture 3. Flash 4. Hybrid RAM + flash techniques 5. Future improvements 6. Testing techniques
  3. 3. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.
  4. 4. Why should you care? ▪ IO efficiency ▪ Cosco is a service that improves IO efficiency (disk service time) by 3x for shuffle data ▪ Compute efficiency ▪ Flash supports more workload with less Cosco hardware ▪ Query latency is less of a focus ▪ Cosco helps shuffle-heavy queries, but query latency has not been our focus. We have been focused on batch workloads. ▪ Flash unlocks new possibilities to improve query latency, but that is future work ▪ Techniques for development and analysis ▪ Hopefully, some of these are applicable outside of Cosco
  5. 5. Intro to Shuffle Architecture
  6. 6. Spark Shuffle Recap Map 0 Map 1 Map m Reduce 0 Reduce 1 Reduce r Partition Mappers Map Output Files (on disk/DFS) Reducers Map output files written to local storage or distributed filesystem Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
  7. 7. Spark Shuffle Recap Map 0 Map 1 Map m Reduce 0 Reduce 1 Reduce r Partition Mappers Map Output Files (on disk/DFS) Reducers Reducers pull from map output files Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
  8. 8. Spark Shuffle Recap Map 0 Map 1 Map m Reduce 0 Reduce 1 Reduce r Partition Mappers Map Output Files (on disk/DFS) Reducers Sort by key Iterator Iterator Iterator Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
  9. 9. Spark Shuffle Recap Map 0 Map 1 Map m Reduce 0 Reduce 1 Reduce r Partition Mappers Map Output Files (on disk/DFS) Reducers Sort by key Iterator Iterator Iterator Write amplification is ~3x Write amplification problem Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
  10. 10. Spark Shuffle Recap Map 0 Map 1 Map m Reduce 0 Reduce 1 Reduce r Partition Sort by key Iterator Iterator Iterator Write amplification is ~3x And small IOs problem M x R Avg IO size is ~200 KiB Mappers Map Output Files (on disk/DFS) Reducers Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
  11. 11. Spark Shuffle Recap Map 0 Map 1 Map m Reduce 0 Reduce 1 Reduce r Partition Mappers Map Output Files (on disk/DFS) Reducers Reducers pull from map output files Sort by key Iterator Iterator Iterator Simplified drawing Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
  12. 12. Spark Shuffle Recap Map 1 Map m Reduce 1 Reduce r Mappers Map Output Files (on disk/DFS) Reducers Reducers pull from map output files Sort by key Iterator Iterator Simplified drawing Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
  13. 13. Spark Shuffle Recap Map 1 Map m Reduce 1 Reduce r Mappers Map Output Files (on disk/DFS) Reducers Simplified drawing Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
  14. 14. Spark Shuffle Recap Map 1 Map m Mappers Map Output Files (on disk/DFS) Simplified drawing Reduce 1 Reduce r Reducers Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
  15. 15. Cosco Shuffle for Spark Reduce 1 Reduce r Mappers Reducers Shuffle Service 1 Shuffle Service 2 Shuffle Service N Partition 1 Partition r Shuffle Services (N = thousands) Map m Map 1 Mappers stream their output to Cosco Shuffle Services, which buffer in memory Streaming output In-memory buffering Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
  16. 16. Cosco Shuffle for Spark Reduce 1 Reduce r Mappers Reducers Shuffle Service 1 Shuffle Service 2 Shuffle Service N Partition 1 (file 0 buffer) Partition r (file 0 buffer) File 0 File 0 Shuffle Services (N = thousands) Distributed Filesystem (HDFS/Warm Storage) Map m Map 1 Sort and flush buffers to DFS when full Streaming output In-memory buffering Sort (if required by query) Flush Flush Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
  17. 17. Cosco Shuffle for Spark Reduce 1 Reduce r Mappers Reducers Shuffle Service 1 Shuffle Service 2 Shuffle Service N Partition 1 (file 1 buffer) Partition r (file 0 buffer) File 0 File 1 File 0 Shuffle Services (N = thousands) Distributed Filesystem (HDFS/Warm Storage) Map m Map 1 Streaming output In-memory buffering Flush Sort (if required by query) Flush Sort and flush buffers to DFS when full Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
  18. 18. Cosco Shuffle for Spark Reduce 1 Reduce r Mappers Reducers Shuffle Service 1 Shuffle Service 2 Shuffle Service N Partition 1 (file 2 buffer) Partition r (file 0 buffer) File 0 File 1 File 0 Shuffle Services (N = thousands) Distributed Filesystem (HDFS/Warm Storage) Map m Map 1 File 2 Streaming output In-memory buffering Flush Sort (if required by query) Flush Sort and flush buffers to DFS when full Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
  19. 19. Cosco Shuffle for Spark Reduce 1 Reduce r Mappers Reducers Shuffle Service 1 Shuffle Service 2 Shuffle Service N Partition 1 (file 2 buffer) Partition r (file 1 buffer) File 0 File 1 File 2 File 0 File 1 Shuffle Services (N = thousands) Distributed Filesystem (HDFS/Warm Storage) Map m Map 1 Streaming output In-memory buffering Flush Sort (if required by query) Flush Sort and flush buffers to DFS when full Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
  20. 20. Iterator Iterator Cosco Shuffle for Spark Reduce 1 Reduce r Mappers Reducers Shuffle Service 1 Shuffle Service 2 Shuffle Service N Partition 1 (file 2 buffer) Partition r (file 1 buffer) File 0 File 1 File 2 File 0 File 1 Shuffle Services (N = thousands) Distributed Filesystem (HDFS/Warm Storage) Map m Map 1 Streaming output In-memory buffering Flush Sort (if required by query) Flush Reducers do a streaming merge after map stage completes Streaming merge Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
  21. 21. Replace DRAM with Flash for Buffering
  22. 22. Buffering Is Appending Mappers Shuffle Service 1 Shuffle Service 2 Shuffle Service N Partition r Shuffle Services (N = thousands) Map m Map 1 Each package is a few 10s of KiB
  23. 23. Buffering Is Appending Mappers Shuffle Service 1 Shuffle Service 2 Shuffle Service N Partition r Shuffle Services (N = thousands) Map m Map 1 Each package is a few 10s of KiB
  24. 24. Buffering Is Appending Mappers Shuffle Service 1 Shuffle Service 2 Shuffle Service N Partition r Shuffle Services (N = thousands) Map m Map 1 Each package is a few 10s of KiB
  25. 25. Buffering Is Appending Mappers Shuffle Service 1 Shuffle Service 2 Shuffle Service N Partition r Shuffle Services (N = thousands) Map m Map 1 Each package is a few 10s of KiB
  26. 26. Buffering Is Appending Mappers Shuffle Service 1 Shuffle Service 2 Shuffle Service N Partition r Shuffle Services (N = thousands) Map m Map 1 Each package is a few 10s of KiB
  27. 27. Replace DRAM with Flash for Buffering Mappers Shuffle Service 1 Shuffle Service 2 Shuffle Service N Partition r Shuffle Services (N = thousands) Map m Map 1 Each package is a few 10s of KiB Simply buffer to flash instead of memory On flash
  28. 28. Replace DRAM with Flash for Buffering Mappers Shuffle Service 1 Shuffle Service 2 Shuffle Service N Partition r Shuffle Services (N = thousands) Map m Map 1 Each package is a few 10s of KiB ▪ Appending is a friendly pattern for flash ▪ Minimize flash write amplification -> minimizing wear on the drive Simply buffer to flash instead of memory On flash
  29. 29. Replace DRAM with Flash for Buffering Mappers Shuffle Service 1 Shuffle Service 2 Shuffle Service N Partition r Shuffle Services (N = thousands) Map m Map 1 Each package is a few 10s of KiB ▪ Appending is a friendly pattern for flash ▪ Minimize flash write amplification -> minimizing wear on the drive Simply buffer to flash instead of memory On flash Read back to main memory for sorting
  30. 30. Replace DRAM with Flash for Buffering Mappers Shuffle Service 1 Shuffle Service 2 Shuffle Service N Partition r Shuffle Services (N = thousands) Map m Map 1 Each package is a few 10s of KiB ▪ Appending is a friendly pattern for flash ▪ Minimize flash write amplification -> minimizing wear on the drive ▪ Flash write/read latency is negligible ▪ Generally non-blocking ▪ Latency is much less than buffering time Simply buffer to flash instead of memory On flash Read back to main memory for sorting
  31. 31. Example Rule of Thumb ▪ Hypothetical example numbers ▪ Assume 1 GB Flash can endure ~10 GB of writes per day for the lifetime of the device ▪ Assume you are indifferent between consuming 1 GB DRAM vs ~10 GB Flash with write throughput at the endurance limit ▪ Then, you would be indifferent between consuming 1 GB DRAM vs ~100 GB/day Flash ▪ Notes ▪ These numbers chosen entirely because they are round -> Easier to illustrate math on slides ▪ DRAM consumes more power than Flash Would you rather consume 1 GB DRAM or flash that can endure 100 GB/day of write throughput?
  32. 32. Basic Evaluation ▪ Example Cosco cluster ▪ 10 nodes ▪ Each node uses 100 GB DRAM for buffering ▪ And has additional DRAM for sorting, RPCs, etc. ▪ So, 1 TB DRAM for buffering in total ▪ Again, numbers are chosen for illustration only ▪ Apply the example rule of thumb ▪ Indifferent between consuming 1 TB DRAM vs 100 TB/day flash endurance ▪ If this cluster shuffles less than 100 TB/day, then it is efficient to replace DRAM with Flash ▪ Each node replaces 100 GB DRAM with ~1 TB flash for buffering ▪ Nodes keep some DRAM for sorting, RPCs, etc.
  33. 33. Basic Evaluation Summary for cluster shuffling 100 TB/day CPU DRAM for sorting, RPCs, etc. DRAM for buffering CPU DRAM for sorting, RPCs, etc. DRAM for buffering CPU DRAM for sorting, RPCs, etc. DRAM for buffering CPU DRAM for sorting, RPCs, etc. DRAM for buffering CPU DRAM for sorting, RPCs, etc. DRAM for buffering CPU DRAM for sorting, RPCs, etc. DRAM for buffering CPU DRAM for sorting, RPCs, etc. DRAM for buffering CPU DRAM for sorting, RPCs, etc. DRAM for buffering CPU DRAM for sorting, RPCs, etc. DRAM for buffering CPU DRAM for sorting, RPCs, etc. DRAM for buffering Shuffle Service 10 CPU DRAM for sorting, RPCs, etc. 100 GB DRAM for buffering CPU DRAM for sorting, RPCs, etc. DRAM for buffering CPU DRAM for sorting, RPCs, etc. DRAM for buffering CPU DRAM for sorting, RPCs, etc. DRAM for buffering CPU DRAM for sorting, RPCs, etc. DRAM for buffering CPU DRAM for sorting, RPCs, etc. DRAM for buffering CPU DRAM for sorting, RPCs, etc. DRAM for buffering CPU DRAM for sorting, RPCs, etc. DRAM for buffering CPU DRAM for sorting, RPCs, etc. DRAM for buffering CPU DRAM for sorting, RPCs, etc. DRAM for buffering CPU DRAM for sorting, RPCs, etc. DRAM for buffering Shuffle Service 10 CPU DRAM for sorting, RPCs, etc. 1 TB Flash for buffering
  34. 34. Hybrid Techniques for Efficiency
  35. 35. Two Hybrid Techniques Two ways to use both DRAM and flash for buffering 1. Buffer in DRAM first, flush to flash only under memory pressure 2. Buffer fastest-filling partitions in DRAM, send slowest-filling partitions to flash
  36. 36. Hybrid Technique #1 Take advantage of variation in shuffle workload over time Time Bytes buffered in Cosco Shuffle Service
  37. 37. Hybrid Technique #1 Take advantage of variation in shuffle workload over time Time Bytes buffered Buffer only in DRAM Buffer only in flash 1 TB 100 TB written/day
  38. 38. Hybrid Technique #1 Take advantage of variation in shuffle workload over time Buffer only in DRAM Buffer only in flash 1 TB 100 TB written/day Hybrid Buffer in DRAM and flash 250 GB 25 TB written/day
  39. 39. Hybrid Technique #1 Buffer in DRAM first, flush to flash only under memory pressure 250 GB DRAM 25 TB written/day to flash ▪ Example: 25% RAM + 25% flash supports 100% throughput ▪ Spikier workload -> more win ▪ Safer to push the system to its limits ▪ Run out of memory -> immediate bad consequences ▪ But exceed flash endurance guidelines -> okay if you make up for it by writing less in the future
  40. 40. Hybrid Technique #1 Buffer in DRAM first, flush to flashPure-DRAM Cosco Implementation requires balancing. Flash adds another dimension. How to adapt balancing logic? Balancing Logic Redirect to another shuffle service Flush to DFS Backpressure mappers ??? Redirect to another shuffle service Flush to DFS Backpressure mappers Flush to Flash Shuffle Service is out of DRAM Shuffle Service is out of DRAM
  41. 41. Hybrid Technique #1 Buffer in DRAM first, flush to flashPure-DRAM Cosco Plug into pre-existing balancing logic Shuffle Service is out of DRAM Balancing Logic Redirect to another shuffle service Flush to DFS Backpressure mappers Balancing Logic Redirect to another shuffle service Flush to DFS Backpressure mappers Shuffle Service is out of DRAM Same logic Flash working set smaller than THRESHOLD ? No Flush to Flash Yes
  42. 42. Hybrid Technique #1 Plug into pre-existing balancing logic Balancing Logic Redirect to another shuffle service Flush to DFS Backpressure mappers Shuffle Service is out of DRAM Flash working set smaller than THRESHOLD ? No Flush to Flash Yes ▪ THRESHOLD limits flash working set size ▪ Configure THRESHOLD to stay under flash endurance limits ▪ Then predict cluster performance as if working-set flash were DRAM
  43. 43. Hybrid Technique #1 Summary ▪ Take advantage of variation in total shuffle workload over time ▪ Buffer in DRAM first, flush to flash only under memory pressure ▪ Adapt balancing logic
  44. 44. Hybrid Technique #2 Take advantage of variation in partition fill rate ▪ Some partitions fill more slowly than others ▪ Slower partitions wear out flash less quickly ▪ So, use flash to buffer slower partitions, and use DRAM to buffer faster partitions
  45. 45. Hybrid Technique #2 ▪ 1 TB ▪ Supports 100K streams each buffering up to 10MB ▪ 10 TB, 100 TB written/day ▪ 100K streams each writing 1 GB/day which is 12 KB/second. (Sanity check: 5 min map stage -> 3.6 MB partition.) ▪ Or 200K streams each writing 6KB/second -> These streams are better on flash ▪ Or 50K streams each writing 24 KB/second -> These streams would be better on DRAM FlashDRAM Take advantage of variation in partition fill rate: Illustrated with numbers
  46. 46. Hybrid Technique #2 Buffer fastest-filling partitions in DRAM and slowest-filling partitions in flash ▪ Technique ▪ Periodically measure partition fill rate ▪ If fill rate is less than threshold KB/s, then buffer partition data in flash ▪ Else, buffer partition data in DRAM ▪ Evaluation ▪ Assume “break-even” threshold of 12 KB/s from previous slide ▪ Suppose that 50% of buffer time is spent on partitions that are slower than 12 KB/s ▪ Suppose these slow partitions write an average of 3 KB/s ▪ Then, you can replace half of your buffering DRAM with 25% as much flash
  47. 47. Hybrid Technique #2 Real-world partition fill rates Percentile of partitions Fill rate 0 KiB/sec 1st MiB’s/sec 99th Percentile of partitions Fill rate, log scale 0 KiB/sec 1st MiB’s/sec 99th
  48. 48. Hybrid Technique #2 Real-world partition fill rates Percentile of partitions Percentile of partitions weighted by buffering time Fill rate 0 KiB/sec 1st MiB’s/sec 99th Percentile of partitions Percentile of partitions weighted by buffering time Fill rate, log scale 0 KiB/sec 1st MiB’s/sec 99th
  49. 49. Combine both hybrid techniques Buffer in DRAM first, then send the slowest partitions to flash when under memory pressure ▪ Evaluation ▪ Difficult theoretical estimation ▪ Or, do a discrete-event simulation -> Later in this presentation
  50. 50. Future Improvements
  51. 51. Lower-Latency Queries Made possible by flash ▪ Serve shuffle data directly from flash for some jobs ▪ This is “free” until flash drive gets so full that write amplification factor increases (~80% full) ▪ Prioritize interactive/low-latency queries to serve from flash ▪ Buffer bigger chunks to decrease reducer merging ▪ Fewer chunks -> Less chance that reducer needs to do an external merge
  52. 52. Further Efficiency Wins Made possible by flash ▪ Decrease Cosco replication factor since flash is non-volatile ▪ Currently Cosco replication is R2: Each map output byte is stored on two shuffle services until it is flushed to durable DFS ▪ Most Shuffle Service crashes in production are resolved in a few minutes with process restart ▪ Decrease Cosco replication to R1 for some queries, and attempt to automatically recover map output data from flash after restart ▪ Buffer bigger chunks to allow more efficient Reed-Solomon encodings on DFS
  53. 53. Practical Evaluation Techniques
  54. 54. Practical Evaluation Techniques ▪ Discrete event simulation ▪ Synthetic load generation on a test cluster ▪ Shadow testing on a test cluster ▪ Special canary in a production cluster
  55. 55. Discrete Event Simulation https://en.wikipedia.org/wiki/Discrete-event_simulation, 2020-05-18
  56. 56. Discrete Event Simulation Shuffle Service Model DFS Model Example Partition 3 Partition 42 Time: 00h:01m:30.000s Total KB written to flash: 9,000 Overall avg file size written to DFS: NaN
  57. 57. Discrete Event Simulation Shuffle Service Model DFS Model Example Partition 3 Partition 42 Discrete event Time: 00h:01m:30.250s Total KB written to flash: 9,050 Overall avg file size written to DFS: NaN
  58. 58. Discrete Event Simulation Shuffle Service Model DFS Model Example Partition 3 Partition 42 Discrete event Time: 00h:01m:30.500s Total KB written to flash: 9,100 Overall avg file size written to DFS: NaN
  59. 59. Discrete Event Simulation Shuffle Service Model DFS Model Example Partition 3 Partition 42 Discrete event Time: 00h:01m:30.750s Total KB written to flash: 9,150 Overall avg file size written to DFS: NaN
  60. 60. Discrete Event Simulation Shuffle Service Model DFS Model Example Partition 3 Partition 42 Discrete event Time: 00h:01m:31.000s Total KB written to flash: 9,200 Overall avg file size written to DFS: NaN
  61. 61. Discrete Event Simulation Shuffle Service Model DFS Model Example Partition 3 Partition 42 Discrete event Time: 00h:01m:31.500s Total KB written to flash: 9,250 Overall avg file size written to DFS: NaN
  62. 62. Discrete Event Simulation Shuffle Service Model DFS Model Example Partition 3 Partition 42 Discrete event Time: 00h:01m:32.000s Total KB written to flash: 9,300 Overall avg file size written to DFS: NaN
  63. 63. DFS Model File 0 Discrete Event Simulation Shuffle Service Model Example Partition 3 Partition 42 Sort & flush Discrete event Time: 00h:01m:32.000s Total KB written to flash: 9,300 Overall avg file size written to DFS: NaN9,200
  64. 64. DFS Model File 0 Discrete Event Simulation Shuffle Service Model Example Partition 3 Partition 42 Discrete event Time: 00h:01m:32.500s Total KB written to flash: 9,350 Overall avg file size written to DFS: NaN9,200
  65. 65. Discrete Event Simulation Drive simulation based on production data cosco_chunks dataset Partition Shuffle Service ID Chunk (DFS file) number Chunk Start Time Chunk Size Chunk Buffering Time Chunk Fill Rate (derived from size and buffering time) 3 10 5 2020-05-19 00:00:00.000 10 MiB 5000ms 2 MiB/s 42 10 2 2020-05-19 00:01:00.000 31 MiB 10000ms 3.1 MiB/s … …
  66. 66. Canary on a Production Cluster ▪ Many important metrics are observed on mappers ▪ Example: “percentage of task time spent shuffling” ▪ Example: “map task success rate” ▪ Problem: Mappers talk to many Shuffle Services ▪ Simultaneously ▪ Dynamic balancing can re-route to different Shuffle Services ▪ Solution: Subclusters ▪ Pre-existing feature for large clusters ▪ Each Shuffle Service belongs to one subcluster ▪ Each mapper is assigned to one subcluster, and only uses Shuffle Services in that subcluster ▪ Compare performance of subclusters that contain flash machines vs subclusters that don’t
  67. 67. Chen Yang Software Engineer at Facebook Sergey Makagonov Software Engineer at Facebook Special Thanks
  68. 68. SOS: Optimizing Shuffle IO, Spark Summit 2018 Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019 Previous Shuffle presentations from Facebook
  69. 69. cosco@fb.com mailing list Contact
  70. 70. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.
  • carlmartinmax

    Nov. 6, 2020
  • manuzhang

    Jul. 6, 2020

Cosco is an efficient and reliable shuffle-as-a-service that powers Spark jobs at Facebook warehouse scale.

Views

Total views

544

On Slideshare

0

From embeds

0

Number of embeds

1

Actions

Downloads

35

Shares

0

Comments

0

Likes

2

×