Successfully reported this slideshow.
Your SlideShare is downloading. ×

Skew Mitigation For Facebook PetabyteScale Joins

Skew Mitigation For Facebook PetabyteScale Joins

Download to read offline

Uneven distribution of input (or intermediate) data can often cause skew in joins. In Spark, this leads to very slow join stages where a few straggling tasks may take forever to finish. At Facebook, where Spark jobs shuffle hundreds of petabytes of aggregate data per day, skew in data exacerbates runtime latencies further to the order of multiple hours and even days. Over the course of last year, we introduced several state-of-art skew mitigation techniques from traditional databases that reduced query runtimes by more than 40%, and expanded Spark adoption for numerous latency sensitive pipelines. In this talk, we’ll take a deep dive into Spark’s execution engine and share how we’re gradually solving the data skew problem at scale. To this end, we’ll discuss several catalyst optimizations around implementing a hybrid skew join in Spark (that broadcasts uncorrelated skewed keys and shuffles non-skewed keys), describe our approach of extending this idea to efficiently identify (and broadcast) skewed keys adaptively at runtime, and discuss CPU vs. IOPS trade-offs around how these techniques interact with Cosco: Facebook’s petabyte-scale shuffle service (https://maxmind-databricks.pantheonsite.io/session/cosco-an-efficient-facebook-scale-shuffle-service).

Uneven distribution of input (or intermediate) data can often cause skew in joins. In Spark, this leads to very slow join stages where a few straggling tasks may take forever to finish. At Facebook, where Spark jobs shuffle hundreds of petabytes of aggregate data per day, skew in data exacerbates runtime latencies further to the order of multiple hours and even days. Over the course of last year, we introduced several state-of-art skew mitigation techniques from traditional databases that reduced query runtimes by more than 40%, and expanded Spark adoption for numerous latency sensitive pipelines. In this talk, we’ll take a deep dive into Spark’s execution engine and share how we’re gradually solving the data skew problem at scale. To this end, we’ll discuss several catalyst optimizations around implementing a hybrid skew join in Spark (that broadcasts uncorrelated skewed keys and shuffles non-skewed keys), describe our approach of extending this idea to efficiently identify (and broadcast) skewed keys adaptively at runtime, and discuss CPU vs. IOPS trade-offs around how these techniques interact with Cosco: Facebook’s petabyte-scale shuffle service (https://maxmind-databricks.pantheonsite.io/session/cosco-an-efficient-facebook-scale-shuffle-service).

More Related Content

Skew Mitigation For Facebook PetabyteScale Joins

  1. 1. Skew Mitigation For Facebook’s Petabyte-Scale Joins Suganthi Dewakar & Guanzhong (Tony) Xu
  2. 2. Agenda Skew Join Journey ▪ Skew Hint ▪ Runtime Skew Mitigation ▪ Customized AQE Skew Mitigation Cosco + Skew Join ▪ Shuffle Recap ▪ Working with Original Cosco ▪ Splitting in File Boundaries
  3. 3. What Is Data Skew ▪ Uneven distribution of partitioned data ▪ Affects aggregate operations e.g Group By and Join ▪ E.g. count number of people group by country ▪ Data skew at FB ▪ PB sized table Joins with TB sized skewed partitions ▪ Task latency ▪ Non-skewed partition: mins to few hours ▪ Skewed partition: hours to days ▪ Skewed pipelines ▪ Latency sensitive daily jobs ▪ Complex DAG with several upstream/downstream dependencies ▪ Can cause delay in hundreds of downsteam pipelines 0.01GB 1TB Percentile shuffle partition 1 partition 2 partition 3 (skewed) partition 4
  4. 4. Data Skew In Join ▪ Join strategies in Spark ▪ SortMergeJoin/ ShuffleHashJoin ▪ Not skew resistant – requires input data to be partitioned ▪ BroadcastHashJoin ▪ Skew resistant – requires no data partitioning ▪ Requires one of side of join to be small ▪ Skew Join ▪ Hybrid join that combines above strategies
  5. 5. Mitigating Data Skew In Joins Skew Hint Runtime skew mitigation Customized AQE skew mitigation 1 2 3 Optimizer rule-based solution Built on Spark 2.0 adaptive framework Based on Spark 3.0 AQE framework
  6. 6. Mitigating Data Skew In Joins Skew Hint Runtime skew mitigation Customized AQE skew mitigation 1 2 3 Optimizer rule-based solution Built on Spark 2.0 adaptive framework Based on Spark 3.0 AQE framework
  7. 7. Skew Hint ▪ User provides skew information as hints ▪ /*+ SKEWED_ON(column=value1,..) */ ▪ Split input into two parts ▪ Skewed keys ▪ Non-skewed keys ▪ Broadcast join skewed keys ▪ Shuffle/Sort join non-skewed keys ▪ Union both join outputs
  8. 8. Skew Hint Scan table_A Scan table_B SortMergeJoin Exchange Exchange
  9. 9. Skew Hint Union Scan table_A Scan table_B Filter [skewed keys] BroadcastHashJoin Filter [skewed keys] Scan table_A Scan table_B SortMergeJoin Filter [non-skewed keys] Filter [non-skewed keys] Exchange Exchange Skewed dataNon-skewed data
  10. 10. Skew Hint Skew Hint Pros • Reduces runtime latency Cons • Requires prior knowledge of skewed keys • Double scan of data
  11. 11. Mitigating Data Skew In Joins Skew Hint Runtime skew mitigation Customized AQE skew mitigation 1 2 3 Optimizer rule-based solution Built on Spark 2.0 adaptive framework Based on Spark 3.0 AQE framework
  12. 12. Runtime skew mitigation § Spark 2.0 adaptive framework § Adds ExchangeCoordinator to all shuffle Exchange nodes § ExchangeCoordinator merges small partitions based on runtime stats § Runtime skewed partition detection § Collect MapOutputStatistics of a joins input stages § Partition is skewed iff size of partition is: § > min_threshold_config_value (e.g. 1GB) § > median_ratio X median_size(shuffle) (e.g. 10 times median size) § > pct_99_ratio X pct99_size(shuffle) (e.g. 3 times 99 percentile size) Built on Spark 2.0 adaptive framework
  13. 13. Runtime skew mitigation § Split large partitions into smaller sub-partitions shuffle table_A shuffle table_B skewed partition split Built on Spark 2.0 adaptive framework table_A join table_B
  14. 14. Built on Spark 2.0 adaptive framework Skew Hint Runtime Skew Mitigation Spark 2.0 adaptive framework Pros • Reduces runtime latency Cons • Requires prior knowledge of skewed keys • Double scan of data Pros • Reduces runtime latency • No prior skew key knowledge required • No double scan of data Cons Additional shuffles even when a join is not skewed • Between joins • Between joins and aggregates Runtime skew mitigation
  15. 15. Mitigating Data Skew In Joins Skew Hint Runtime skew mitigation Customized AQE skew mitigation 1 2 3 Optimizer rule-based solution Built on Spark 2.0 adaptive framework Based on Spark 3.0 AQE framework
  16. 16. AQE skew mitigation ▪ AQE recap ▪ New adaptive framework in OSS ▪ Features ▪ Switching join strategy ▪ Small shuffle partitions coalescing ▪ Optimizing skew join ▪ Runtime DAG change Spark 3.0 adaptive framework
  17. 17. AQE skew mitigation Limitation ▪ Supports two table join only ▪ Single stage skewed join + reducer operation ▪ Skewed Join + Aggregate ▪ Skewed multi table joins ▪ Single stage skewed join + mapper operation ▪ Skewed join + union ▪ Skewed join + broadcast join
  18. 18. Customized AQE skew mitigation Customization 1: Runtime Shuffle Insertion Lifecycle of AQE execution EnsureRequirements Execute Stats from previous stage No new Exchange beyond this point Optimize skew Create sub-stages Re-plan
  19. 19. Customized AQE skew mitigation Customization 1: Runtime Shuffle Insertion Lifecycle of AQE execution EnsureRequirements Execute Stats from previous stage No new Exchange beyond this point Optimize skew Create sub-stages Re-plan If skewed, set outputPartitioning of join to Unknown Detect skew
  20. 20. table_A (petabye) table_B Left Outer Join table_C Left Outer Join Customized AQE skew mitigation [Limitation] Customization 1: Runtime Shuffle Insertion table_D Left Outer Join
  21. 21. table_A (petabye) table_B Left Outer Join table_C Left Outer Join Customized AQE skew mitigation [Limitation] Customization 1: Runtime Shuffle Insertion table_D Left Outer Join Exchange Exchange Shuffles PB sized data Shuffles PB sized data
  22. 22. Customized AQE skew mitigation Customization 2: No Shuffle Multi Table Join shuffle table_A shuffle table_B skewed partition split shuffle table_C ▪ Split skewed partition in one table ▪ Replicate the corresponding partitions in the rest of the input tables shuffle table_Dshuffle table_A
  23. 23. Skew Hint Runtime Skew Mitigation Spark 2.0 adaptive framework Customized AQE Skew Mitigation Spark 3.0 adaptive framework Pros • Reduces runtime latency Cons • Requires prior knowledge of skewed keys • Double scan of data Pros • Reduces runtime latency • No prior skew key knowledge required • No double scan of data Cons Additional shuffles even when a join is not skewed • Between joins • Between joins and aggregates Pros • Reduces runtime latency • No prior skew key knowledge required • No double scan of data • No unnecessary shuffles when a join is not skewed Customized AQE skew mitigation Customization 2: No Shuffle Multi Table Join
  24. 24. Skew Join With Cosco
  25. 25. ▪ Adaptive Skew Mitigation Splits in Mapper Boundaries ▪ Each sub-reducer is only required to read data from a subset of mappers. ▪ Only partial data is required ▪ E.g.: a skewed partition 0 is split that: sub-reducer 0 reads {mapper 1}, sub-reducer 1 reads {mapper 0, mapper 2}, sub-reducer 1 reads {mapper 0, mapper 2} Partition 0 The Problem File 0 File 1 File 3 Sub-reducer 0 Sub-reducer 1 Sub-reducer 2 Sub-reducer i… Sub-reducer i Mapper outputs are merged in files :(
  26. 26. Cosco Shuffle Recap ▪ Why Cosco? ▪ Shuffle as a service: IO Efficiency ▪ Write-ahead buffer ▪ Groups shuffled output based on partition id instead of mapper id. ▪ Data deduplication is performed in reducer side ▪ Efficiency Wins: ▪ Solve the write amplification problem: avoid spills: 3X -> 1X ▪ Solve the small IO problem: average IO size: 200KB -> 2.5MB ▪ Previous Talk ▪ Cosco: An Efficient Facebook-Scale Shuffle Service
  27. 27. DFS Cosco Shuffle ServiceCosco Shuffle Service Cosco Shuffle Recap Cosco Shuffle Service Partition 0 (file 2 buffer) Partition 0 (file 1 buffer) Mapper 0 Mapper 0 Mapper 0 Reducer 0 Reducer 1 File 1 File 0 File 2 File 0 File 1 Cosco Metadata Service
  28. 28. Working with Mapper Boundaries ▪ Load all data but filter out uninterested ones ▪ Each sub-reducer still has to load all data from all file chunks of the partition. ▪ Filtering out records from uninterested mappers. ▪ IO inefficient Record (mapper-1) Record (mapper-3) Record (mapper-2) Record (mapper-2) Record (mapper-4) Record (mapper-1) … File 0 Read Drop Drop Drop Read Read Sub-reducer for mapper {1, 4}
  29. 29. Skewed partition worst cases ▪ For the most heavily skewed partitions, it could be up to XXX TB and requires XX k splits, which means the skewed partition data has to be read XX k times as well. ▪ Very inefficient, most times are spent on disk IO. ▪ Compared to vanilla Spark shuffle, it no longer has any advantages in terms of IO.
  30. 30. Why cannot we split in file boundaries?
  31. 31. Terminology ▪ Record ▪ Unit of a data row ▪ Package ▪ Unit to be sent from mapper to shuffle service ▪ Chunk ▪ Unit to be flushed Record Record Record Record Package Package Package Package Chunk Multiple records to form a package Multiple packages to form a chunk
  32. 32. ▪ Duplicated records ▪ A record could be duplicated in multiple files due to package resending which might be caused by various reasons. ▪ Cosco relies on reducer client to perform deduplication which requires it to read the entire partition data from a mapper. ▪ Simply splitting on file boundaries requires no overlap of mappers between file sets for different sub-reducers, which is not feasible. File 2 File 3 Restrictions Sub-reducer 0 File 0 File 1 Sub-reducer 0 is assigned to read file 0 & 1. But it has to read file 2 & 3 as well because they have mapper overlap with file 0 & 1.
  33. 33. Duplication ▪ Mapper failure ▪ Each mapper is assigned a unique id. ▪ Each package is attached with its mapper’s unique id. ▪ Reducer consumes records from interested mappers based on those unique ids. ▪ It would not cause duplicated records in reducer if a mapper output is not fully read by a reducer. Mapper 0 (id: abc) Mapper 0 (id: xyz) Partition Package Id MapperId ... 0 3 abc 0 3 xyz ... Failed Sub-reducer 0 (read from `xyz`) Dropped Accepted
  34. 34. Duplication ▪ Network issues ▪ A mapper fails to receive ack message after timeout, it will resend the package. ▪ We call such packages "suspected packages”: a suspected package is not necessary to be a duplicated packaged, but a duplicated package is always a suspected package. Shuffle service - 1 Shuffle service - 2Package-1 Package-2 Package-3 Mapper - 1 Ack Queue Chunk - i Records from package- {1,2,3} are flushed to chunks Chunk - j Records from package- {1,2,3} are duplicated to chunks Resend packages in ack queues
  35. 35. Identify suspected packages in mapper
  36. 36. Suspected Map ▪ Map<partition, Map<packageId, chunkId>> ▪ Each mapper keeps a map to track the resent packages due to connection. ▪ Whenever a resend happens, all the packages in the ACK queue are added to the map. Shuffle service - 1 Shuffle service - 2 Package-1 Pakcage-2 Package-3 Mapper - 1 ACK Queue Resend packages in ack queues Add to suspected package map packageId chunkId Package-x Chunk-xyz Package-1 null Package-2 null Package-3 null Partition - 1
  37. 37. Suspected Map ▪ Map<partition, Map<packageId, chunkId>> ▪ An ACK message eventually returned contains a unique identifier of the chunk containing the package. ▪ Mapper keeps the mapping from those suspected packages to their authorized chunks. ▪ Once the mapper is done, reports the suspected map to Spark driver. ▪ Spark driver aggregate the suspected package info from different mappers. Shuffle service - 2 Pakcage-2 Package-3 Mapper - 1 ACK Queue Return ACK for Package-1 with chunkId 1. Remove package-1 from ACK queue. 2. Update the chunkId in suspected map packageId chunkId Package-x Chunk-xyz Package-1 Chunk-abc Package-2 null Package-3 null Partition - 1
  38. 38. ▪ Each sub-reducer would be assigned a subset of files along with the corresponding suspected map of the partition. Reading subset of files on sub-reducer Chunk Set - 1 Chunk Set - 2 ... Skewed Partition Sub-reducer - 1 Sub-reducer - 2 ...
  39. 39. ▪ Duplicated records would only be accepted if it is read from authorized chunk. Reading subset of files on sub-reducer Record {mapper-3,package-4,chunk-15} Chunk-15 Record {mapper-3,package-4,chunk-15} Chunk-4 Sub-reducer-2 Sub-reducer-7 Record Accepted Record Dropped {MapperId, PackageId} ChunkId ... ... {mapper-3, package-4} chunk-15 ... ...
  40. 40. Chunk Lost ▪ Chunk lost failure requires restart of all sub-reducers accordingly: . ▪ Spark driver to restart mappers to regenerate data of that particular partition. ▪ Data in new chunks are non-deterministic. ▪ Spark driver broadcasts failures to all corresponding sub-reducer tasks and restarts them all. ▪ This is rare since chunks are RS-encoded
  41. 41. Putting It All Together ▪ Motivation: ▪ To split a skewed partition into multiple non-overlapping file sets that each sub-reducer would be assigned one of them. ▪ Mapper: ▪ Keeps tracking of suspected packages along with authorized chunks. ▪ Report suspected package info to Spark driver when finishes. ▪ Driver: ▪ Aggregate suspect package info from all mappers and pass along to sub-reducers if necessary. ▪ Detects skewed partitions and split them based on file boundaries of each partition. ▪ Fail and restart all corresponding sub-reducer tasks of a partition if there is a chunk lost. ▪ Reducer: ▪ Fetches file list from driver along with suspected package info. ▪ Accept a record of a suspected package only if it is read from the authorized chunk. ▪ Data Correctness ▪ End-to-end checksum
  42. 42. Summary Skew Join Journey ▪ Skew Hint ▪ Runtime Skew Mitigation ▪ Customized AQE Skew Mitigation Cosco + Skew Join ▪ Shuffle Recap ▪ Working with Original Cosco ▪ Splitting in File Boundaries
  43. 43. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

×