Skew Mitigation For Facebook PetabyteScale Joins

Skew Mitigation For Facebook’s
Petabyte-Scale Joins
Suganthi Dewakar & Guanzhong (Tony) Xu

Agenda
Skew Join Journey
▪ Skew Hint
▪ Runtime Skew Mitigation
▪ Customized AQE Skew Mitigation
Cosco + Skew Join
▪ Shuffle Recap
▪ Working with Original Cosco
▪ Splitting in File Boundaries

What Is Data Skew
▪ Uneven distribution of partitioned data
▪ Affects aggregate operations e.g Group By and Join
▪ E.g. count number of people group by country
▪ Data skew at FB
▪ PB sized table Joins with TB sized skewed partitions
▪ Task latency
▪ Non-skewed partition: mins to few hours
▪ Skewed partition: hours to days
▪ Skewed pipelines
▪ Latency sensitive daily jobs
▪ Complex DAG with several upstream/downstream dependencies
▪ Can cause delay in hundreds of downsteam pipelines
0.01GB 1TB
Percentile
shuffle
partition 1
partition 2
partition 3 (skewed)
partition 4

Data Skew In Join
▪ Join strategies in Spark
▪ SortMergeJoin/ ShuffleHashJoin
▪ Not skew resistant – requires input data to be partitioned
▪ BroadcastHashJoin
▪ Skew resistant – requires no data partitioning
▪ Requires one of side of join to be small
▪ Skew Join
▪ Hybrid join that combines above strategies

Mitigating Data Skew In Joins
Skew Hint
Runtime skew
mitigation
Customized AQE
skew mitigation
1 2 3
Optimizer rule-based solution Built on Spark 2.0
adaptive framework
Based on Spark 3.0
AQE framework

Skew Hint
▪ User provides skew information as
hints
▪ /*+ SKEWED_ON(column=value1,..) */
▪ Split input into two parts
▪ Skewed keys
▪ Non-skewed keys
▪ Broadcast join skewed keys
▪ Shuffle/Sort join non-skewed keys
▪ Union both join outputs

Skew Hint
Scan table_A Scan table_B
SortMergeJoin
Exchange Exchange

Skew Hint
Union
Filter
[skewed keys]
BroadcastHashJoin
Filter
[skewed keys]
SortMergeJoin
Filter
[non-skewed keys]
Filter
[non-skewed keys]
Exchange Exchange
Skewed dataNon-skewed data

Skew Hint
Skew Hint
Pros
• Reduces runtime latency
Cons
• Requires prior knowledge of skewed
keys
• Double scan of data

Runtime skew mitigation
§ Spark 2.0 adaptive framework
§ Adds ExchangeCoordinator to all shuffle Exchange nodes
§ ExchangeCoordinator merges small partitions based on runtime stats
§ Runtime skewed partition detection
§ Collect MapOutputStatistics of a joins input stages
§ Partition is skewed iff size of partition is:
§ > min_threshold_config_value (e.g. 1GB)
§ > median_ratio X median_size(shuffle) (e.g. 10 times median size)
§ > pct_99_ratio X pct99_size(shuffle) (e.g. 3 times 99 percentile size)
Built on Spark 2.0 adaptive framework

§ Split large partitions into smaller sub-partitions
shuffle table_A shuffle table_B
skewed partition split
table_A join table_B

Skew Hint
Runtime Skew Mitigation
Spark 2.0 adaptive framework
Pros
Cons
keys
Pros
• No prior skew key knowledge
required
• No double scan of data
Cons
Additional shuffles even when a join is
not skewed
• Between joins
• Between joins and aggregates

AQE skew mitigation
▪ AQE recap
▪ New adaptive framework in OSS
▪ Features
▪ Switching join strategy
▪ Small shuffle partitions coalescing
▪ Optimizing skew join
▪ Runtime DAG change

AQE skew mitigation
Limitation
▪ Supports two table join only
▪ Single stage skewed join + reducer operation
▪ Skewed Join + Aggregate
▪ Skewed multi table joins
▪ Single stage skewed join + mapper operation
▪ Skewed join + union
▪ Skewed join + broadcast join

Customized AQE skew mitigation
Customization 1: Runtime Shuffle Insertion
Lifecycle of AQE execution
EnsureRequirements
Execute
Stats from previous stage
No new Exchange beyond
this point
Optimize skew
Create sub-stages
Re-plan

Customization 1: Runtime Shuffle Insertion
Lifecycle of AQE execution
EnsureRequirements
Execute
Stats from previous stage
No new Exchange beyond
this point
Optimize skew
Create sub-stages
Re-plan
If skewed, set
outputPartitioning of join to
Unknown
Detect skew

table_A (petabye) table_B
Left Outer Join
table_C
Left Outer Join
[Limitation] Customization 1: Runtime Shuffle Insertion
table_D
Left Outer Join

table_A (petabye) table_B
Left Outer Join
table_C
Left Outer Join
[Limitation] Customization 1: Runtime Shuffle Insertion
table_D
Left Outer Join
Exchange
Exchange
Shuffles PB sized
data
Shuffles PB sized
data

Customization 2: No Shuffle Multi Table Join
shuffle table_A
shuffle table_B
skewed partition split
shuffle table_C
▪ Split skewed partition in one table
▪ Replicate the corresponding partitions in the rest of the input tables
shuffle table_Dshuffle table_A

Skew Hint
Runtime Skew Mitigation
Customized AQE Skew Mitigation
Pros
Cons
keys
Pros
• No prior skew key knowledge
required
Cons
Additional shuffles even when a join is
not skewed
• Between joins
• Between joins and aggregates
Pros
• No prior skew key
knowledge required
• No unnecessary shuffles when a join
is not skewed
Customization 2: No Shuffle Multi Table Join

▪ Adaptive Skew Mitigation Splits in Mapper Boundaries
▪ Each sub-reducer is only required to read data from a subset of mappers.
▪ Only partial data is required
▪ E.g.: a skewed partition 0 is split that: sub-reducer 0 reads {mapper 1}, sub-reducer 1 reads {mapper 0, mapper 2}, sub-reducer 1
reads {mapper 0, mapper 2}
Partition 0
The Problem
File 0 File 1 File 3
Sub-reducer 0 Sub-reducer 1 Sub-reducer 2 Sub-reducer i…
Sub-reducer i
Mapper outputs
are merged in
files :(

Cosco Shuffle Recap
▪ Why Cosco?
▪ Shuffle as a service: IO Efficiency
▪ Write-ahead buffer
▪ Groups shuffled output based on partition id instead of mapper id.
▪ Data deduplication is performed in reducer side
▪ Efficiency Wins:
▪ Solve the write amplification problem: avoid spills: 3X -> 1X
▪ Solve the small IO problem: average IO size: 200KB -> 2.5MB
▪ Previous Talk
▪ Cosco: An Efficient Facebook-Scale Shuffle Service

DFS
Cosco
Shuffle ServiceCosco
Shuffle Service
Cosco Shuffle Recap
Cosco
Shuffle Service
Partition 0
(file 2 buffer)
Partition 0
(file 1 buffer)
Mapper 0
Mapper 0
Mapper 0
Reducer 0
Reducer 1
File 1
File 0
File 2
File 0
File 1
Cosco
Metadata Service

Working with Mapper Boundaries
▪ Load all data but filter out uninterested ones
▪ Each sub-reducer still has to load all data from all file chunks of the partition.
▪ Filtering out records from uninterested mappers.
▪ IO inefficient
Record (mapper-1)
Record (mapper-3)
Record (mapper-2)
Record (mapper-2)
Record (mapper-4)
Record (mapper-1)
…
File 0
Read
Drop
Drop
Drop
Read
Read
Sub-reducer for
mapper {1, 4}

Skewed partition worst cases
▪ For the most heavily skewed partitions, it
could be up to XXX TB and requires XX k
splits, which means the skewed partition
data has to be read XX k times as well.
▪ Very inefficient, most times are spent on
disk IO.
▪ Compared to vanilla Spark shuffle, it no
longer has any advantages in terms of IO.

Why cannot we split in file boundaries?

Terminology
▪ Record
▪ Unit of a data row
▪ Package
▪ Unit to be sent from mapper to shuffle service
▪ Chunk
▪ Unit to be flushed
Record
Record
Record
Record
Package
Package
Package
Package
Chunk
Multiple records to form a package
Multiple packages to form a chunk

▪ Duplicated records
▪ A record could be duplicated in multiple files due to package resending which might be caused by various reasons.
▪ Cosco relies on reducer client to perform deduplication which requires it to read the entire partition data from a mapper.
▪ Simply splitting on file boundaries requires no overlap of mappers between file sets for different sub-reducers, which is not feasible.
File 2 File 3
Restrictions
Sub-reducer 0
File 0 File 1
Sub-reducer 0 is assigned to
read file 0 & 1. But it has to
read file 2 & 3 as well
because they have mapper
overlap with file 0 & 1.

Duplication
▪ Mapper failure
▪ Each mapper is assigned a unique id.
▪ Each package is attached with its mapper’s unique id.
▪ Reducer consumes records from interested mappers based on those unique ids.
▪ It would not cause duplicated records in reducer if a mapper output is not fully read by a reducer.
Mapper 0
(id: abc)
Mapper 0
(id: xyz)
Partition Package Id MapperId
...
0 3 abc
0 3 xyz
...
Failed
Sub-reducer 0
(read from `xyz`)
Dropped
Accepted

Duplication
▪ Network issues
▪ A mapper fails to receive ack message after timeout, it will resend the package.
▪ We call such packages "suspected packages”: a suspected package is not necessary to be a duplicated packaged, but a duplicated
package is always a suspected package.
Shuffle service - 1
Shuffle service - 2Package-1
Package-2
Package-3
Mapper - 1
Ack Queue
Chunk - i
Records from package-
{1,2,3} are flushed to chunks
Chunk - j
Records from package-
{1,2,3} are duplicated to
chunks
Resend packages
in ack queues

Identify suspected packages in mapper

Suspected Map
▪ Map<partition, Map<packageId, chunkId>>
▪ Each mapper keeps a map to track the resent packages due to connection.
▪ Whenever a resend happens, all the packages in the ACK queue are added to the map.
Shuffle service - 1
Shuffle service - 2
Package-1
Pakcage-2
Package-3
Mapper - 1
ACK Queue
Resend packages
in ack queues
Add to suspected
package map
packageId chunkId
Package-x Chunk-xyz
Package-1 null
Package-2 null
Package-3 null
Partition - 1

Suspected Map
▪ Map<partition, Map<packageId, chunkId>>
▪ An ACK message eventually returned contains a unique identifier of the chunk containing the package.
▪ Mapper keeps the mapping from those suspected packages to their authorized chunks.
▪ Once the mapper is done, reports the suspected map to Spark driver.
▪ Spark driver aggregate the suspected package info from different mappers.
Shuffle service - 2
Pakcage-2
Package-3
Mapper - 1
ACK Queue
Return ACK for Package-1
with chunkId
1. Remove package-1
from ACK queue.
2. Update the chunkId in
suspected map
packageId chunkId
Package-x Chunk-xyz
Package-1 Chunk-abc
Package-2 null
Package-3 null
Partition - 1

▪ Each sub-reducer would be assigned a subset of files along with the
corresponding suspected map of the partition.
Reading subset of files on sub-reducer
Chunk Set - 1 Chunk Set - 2
...
Skewed Partition
Sub-reducer - 1 Sub-reducer - 2 ...

▪ Duplicated records would only be accepted if it is read from
authorized chunk.
Reading subset of files on sub-reducer
Record
{mapper-3,package-4,chunk-15}
Chunk-15
Record
{mapper-3,package-4,chunk-15}
Chunk-4
Sub-reducer-2 Sub-reducer-7
Record Accepted Record Dropped
{MapperId, PackageId} ChunkId
... ...
{mapper-3, package-4} chunk-15
... ...

Chunk Lost
▪ Chunk lost failure requires restart of all sub-reducers accordingly: .
▪ Spark driver to restart mappers to regenerate data of that particular partition.
▪ Data in new chunks are non-deterministic.
▪ Spark driver broadcasts failures to all corresponding sub-reducer tasks and restarts them all.
▪ This is rare since chunks are RS-encoded

Putting It All Together
▪ Motivation:
▪ To split a skewed partition into multiple non-overlapping file sets that each sub-reducer would be assigned one of them.
▪ Mapper:
▪ Keeps tracking of suspected packages along with authorized chunks.
▪ Report suspected package info to Spark driver when finishes.
▪ Driver:
▪ Aggregate suspect package info from all mappers and pass along to sub-reducers if necessary.
▪ Detects skewed partitions and split them based on file boundaries of each partition.
▪ Fail and restart all corresponding sub-reducer tasks of a partition if there is a chunk lost.
▪ Reducer:
▪ Fetches file list from driver along with suspected package info.
▪ Accept a record of a suspected package only if it is read from the authorized chunk.
▪ Data Correctness
▪ End-to-end checksum

Summary
Skew Join Journey
▪ Skew Hint
▪ Runtime Skew Mitigation
▪ Customized AQE Skew Mitigation
Cosco + Skew Join
▪ Shuffle Recap
▪ Working with Original Cosco
▪ Splitting in File Boundaries

Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

Skew Mitigation For Facebook PetabyteScale Joins

More Related Content

What's hot

Similar to Skew Mitigation For Facebook PetabyteScale Joins

More from Databricks

Recently uploaded

Skew Mitigation For Facebook PetabyteScale Joins