Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle

Bucketing 2.0: Improve Spark SQL
Performance by Removing Shuffle
Guo, Jun (jason.guo.vip@gmail.com)
Lead of Data Engine Team, ByteDance

Who we are
o Data Engine team of ByteDance
o Build a platform of one-stop experience
for OLAP , on which users can analyze
PB level data by writing SQL without
caring about the underlying execution
engine

What we do
o Manage Spark SQL / Presto / Hive
workload
o Offer Open API and self-serve platform
o Optimize Spark SQL / Presto / Hive
engine
o Design data architecture for most
business lines in ByteDance

Agenda
▪ Spark SQL at ByteDance
▪ What is Bucketing
▪ Spark Bucketing Limitations
▪ Bucketing Optimizations at ByteDance

Spark SQL at ByteDance
2016 2017 2018 2019 2020
Small Scale Experiments
Ad-hoc workload
Few ETL pipelines in production
Full-production deployment
Main engine in DW area

What is bucketing
▪ Create Bucketed Table
CREATE TABLE order
( order_id long,
user_id long,
product long,
amount long
)
using parquet
clustered by (user_id)
sorted by (user_id)
into 1024 buckets
location ‘/user/warehouse/test.db/order’
CREATE TABLE order
( order_id long,
user_id long,
product long,
amount long
)
clustered by (user_id)
sorted by (user_id)
into 1024 buckets
stored as parquet
location ‘/user/warehouse/test.db/order’

What is bucketing
▪ Insert into Bucketed Table
INSERT INTO order SELECT order_id, user_id, product, amount FROM order_staging

What is bucketing
▪ ShuffledHashJoin
Exchange(user_id) Exchange(user_id)
TableScan(order) TableScan(user)
ShuffleHashJoin

What is bucketing
▪ ShuffledHashJoin with Bucketing
ShuffleHashJoin
There is no Exchange
since both tables are
pre-shuffle on user_id

What is bucketing
▪ SortMergeJoin
SortMergeJoin
Sort(user_id)Sort(user_id)

What is bucketing
▪ SortMergeJoin with Bucketing
SortMergeJoin There is no Exchange
since both tables are
pre-shuffle and pre-sort
on join key(user_id)

Spark Bucketing Limitations
▪ Small files
hdfs dfs –ls
/user/warehouse/test.db/order/_temporary/0/_temporary/
attempt_20200519145628_0014_m_000014_0 | wc –l
988
INSERT INTO order SELECT order_id, user_id, product, amount FROM order_staging
Each task will generate up to 1024 small files. 1024 is the bucket number
There are up to 1024 * M small files in total. M is task number
When M is 1024, there will be up to 1 million small files

▪ Small files
INSERT INTO order SELECT order_id, user_id, product, amount
FROM order_staging
DISTRIBUTE BY user_id
There are up to 1024 files when 1024 is multiple of M
M equals to spark.sql.shuffle.partitions

▪ Small files
INSERT INTO order SELECT order_id, user_id, product, amount
FROM order_staging
DISTRIBUTE BY user_id
There are up to M files when M is multiple of 1024
M equals to spark.sql.shuffle.partitions

▪ Incompatible across SQL engines
Hive
M M M M......
R R R
bucket
0
bucket
1
bucket
(n-1)
…...
Spak SQL
M M M M…...
bucket
0
bucket
0
bucket
0
bucket
0
bucket
1
bucket
(n-1)
…...
HiveHash
Murmur3

▪ Incompatible across SQL engines
SortMergeJoin
Exchange and Sort
are required when
joining tables with
Hive bucketing in
Spark SQL or joining
tables with Spark
bucketing in Hive
join on user_id

▪ Extra Sort
SortMergeJoin
Sort is required when
joining tables with
Spark SQL bucketing
in Spark SQL
because each bucket
may consist of more
than one file
join on user_id

▪ Unaligned Bucket Number
Exchange(user_id)
SortMergeJoin
Exchange is required
on one of the
bucketed tables when
the bucket number is
different
clustered by user_id
into 4096 buckets
into 1024 buckets
join on user_id

▪ Join key set is different from bucket key set
Exchange(user_id,
location_id)
SortMergeJoin
on when the
bucketing key set is
different from the
join key set
into 1024 buckets
into 1024 buckets
join on
user_id,location_id
Exchange(user_id,
location_id)

▪ Union all after bucketing
Exchange(user_id)
TableScan(
order_mobile)
TableScan(user)
SortMergeJoin
into 1024 buckets
into 1024 buckets
join on user_id
TableScan(
order_web)
into 1024 buckets
in this case even
when the underlying
tables are both
bucketed by user_id,
which is the join key
and the bucket
number is the same
with the other one

Bucketing Optimizations at ByteDance

▪ Align Spark Bucketing with Hive
Hive
M M M M......
R R R
bucket
0
bucket
1
bucket
(n-1)
…...
Spak SQL
M M M M…...
bucket
0
bucket
0
bucket
0
bucket
0
bucket
1
bucket
(n-1)
…...
HiveHash
Murmur3

▪ Spark SQL write to Hive bucketed table in the same as Hive
▪ override InsertIntoHiveTable#requiredOrdering
▪ HashClusteredDistribution with HiveHash on bucketing keys
▪ override InsertIntoHiveTable#requiredDistribution
▪ SortOrder on bucketing keys with Ascending

▪ Spark SQL read Hive bucketed table with bucketing metadata
▪ override HiveTableScanExec#outputPartitioning
▪ HashPartitioning with HiveHash
▪ override HiveTableScanExec#outputOrdering
▪ SortOrder on bucketing keys with Ascending

HiveTableScan HiveTableScan
Sort Merge Join
outputPartitioning: HashPartitioning(id, n, HiveHash)
outputOrdering: SortOrder(id)
requireChildDistribution: HashClusteredDistribution(id, n, HiveHash)
requireChildOrdering: SortOrder(id)
HiveTableScan
Exchange
HiveTableScan
Exchange
Sort Sort
Sort Merge Join
requireChildDistribution: HashClusteredDistribution(id, n, Murmur3Hash)
requireChildOrdering: SortOrder(id)
outputPartitioning: UnknownPartitioning
outputOrdering: Nil

▪ One to Mange Bucket Join

Table A (3 bucket)
bucket 0
Table B(6 bucket)
bucket 1
bucket 5
bucket 2
bucket 3
bucket 4
(0, 6, 12)
(1, 7, 13)
(2, 8, 14)
(3, 9, 15)
(4, 10, 16)
(5, 11, 17)
(0, 3, 6, 9, 12, 15)
(2, 5, 8, 11, 14, 17)
(1, 4, 7, 10,13, 16)
TableScan
Sort Merge Join
Sort
Table A
(3 bucket)
Table B
(6 bucket)bucket 0
bucket 1
bucket 2
TableScan

Table A (3 bucket)
bucket 0
Table B(6 bucket)
bucket 1
bucket 5
bucket 2
bucket 3
bucket 4
(0, 6, 12)
(1, 7, 13)
(2, 8, 14)
(3, 9, 15)
(4, 10, 16)
(5, 11, 17)
(0, 3, 6, 9, 12, 15)
(2, 5, 8, 11, 14, 17)
(1, 4, 7, 10,13, 16)
TableScan
Sort Merge Join
Bucket Union
Table A
(3 bucket)
Table B
(6 bucket)bucket 0
bucket 1
bucket 2
TableScan
(0, 3, 6, 9, 12, 15)
(2, 5, 8, 11, 14, 17)
(1, 4, 7, 10,13, 16)
bucket 0’
bucket 1’
bucket 2’
Table A
(3 bucket)
TableScan

TableScan
Sort Merge Join
Bucket Union
Table A
(3 bucket)
Table B
(6 bucket)
TableScan
Table A
(3 bucket)
TableScan
• B left join A
• B left semi join A
• B anti join A
• B inner join A
• B right join A
• B full outer join A
• B cross join A

TableScan
Sort Merge Join
Bucket Union
Table A
(3 bucket)
Table B
(6 bucket)
TableScan
Table A
(3 bucket)
TableScan
Filter Filter
• B left join A
• B left semi join A
• B anti join A
• B inner join A
• B right join A
• B full outer join A
• B cross join A

Table A (3 bucket)
bucket 0
Table B(6 bucket)
bucket 1
bucket 5
bucket 2
bucket 3
bucket 4
(0, 6, 12)
(1, 7, 13)
(2, 8, 14)
(3, 9, 15)
(4, 10, 16)
(5, 11, 17)
(0, 3, 6, 9, 12, 15)
(2, 5, 8, 11, 14, 17)
(1, 4, 7, 10,13, 16)
bucket 0
bucket 1
bucket 2
(0, 3, 6, 9, 12, 15)
(2, 5, 8, 11, 14, 17)
(1, 4, 7, 10,13, 16)
bucket 0’
bucket 1’
bucket 2’
TableScan
Sort Merge Join
Bucket Union
Table A
(3 bucket)
Table B
(6 bucket)
TableScan
Table A
(3 bucket)
TableScan
Filter Filter

▪ Join on more than bucketing keys
TableScan
Sort Merge Join
on A B
Table X
Bucket by A
Table Y
Bucket by A
TableScan
Exchange
on A B
Exchange
on A B
Sort on A B Sort on A B
X 1 1
X 2 3
X 4 2
Y 6 7
Y 7 3
Y 8 5
Z 2 8
Z 4 3
Z 5 2
BA C
Table X
Bucket by A
X 2 3
X 1 1
X 4 2
Y 8 5
Y 6 7
Y 7 3
Z 2 8
Z 5 2
Z 4 3
BA C
Table Y
Bucket by A

▪ Join on more than bucketing keys
TableScan
Sort Merge Join
on A B
Table X
Bucket by A
Table Y
Bucket by A
TableScan
Sort on A B Sort on A B
X 1 1
X 2 3
X 4 2
Y 6 7
Y 7 3
Y 8 5
Z 2 8
Z 4 3
Z 5 2
BA C
Table X
Bucket by A
X 2 3
X 1 1
X 4 2
Y 8 5
Y 6 7
Y 7 3
Z 2 8
Z 5 2
Z 4 3
BA C
Table Y
Bucket by AJoin keys(A B) is superset of bucketing keys(A)

▪ Bucketing evolution
▪ Case 1: A non-bucketed table is partitioned by date. User want to
convert it to a bucketed table without overhead
▪ Case 2: The bucket number is X and user need to enlarge it to 2X
because the data volume increased

▪ Bucketing evolution
▪ Put bucketing information into partition parameter
▪ Only if all target partitions have the same bucketing information will the
table be read as bucketed table. Otherwise, it will be read as non-
bucketed table
▪ Reading a bucket table as non-bucketed table only impact performance
but not correctness

Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.

Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle

Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle

More Related Content

What's hot

Similar to Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle

More from Databricks

Recently uploaded

Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle