Bucketing 2.0: Improve Spark SQL
Performance by Removing Shuffle
Guo, Jun (jason.guo.vip@gmail.com)
Lead of Data Engine Team, ByteDance
Who we are
o Data Engine team of ByteDance
o Build a platform of one-stop experience
for OLAP , on which users can analyze
PB level data by writing SQL without
caring about the underlying execution
engine
What we do
o Manage Spark SQL / Presto / Hive
workload
o Offer Open API and self-serve platform
o Optimize Spark SQL / Presto / Hive
engine
o Design data architecture for most
business lines in ByteDance
Agenda
▪ Spark SQL at ByteDance
▪ What is Bucketing
▪ Spark Bucketing Limitations
▪ Bucketing Optimizations at ByteDance
Spark SQL at ByteDance
Spark SQL at ByteDance
2016 2017 2018 2019 2020
Small Scale Experiments
Ad-hoc workload
Few ETL pipelines in production
Full-production deployment
Main engine in DW area
What is Bucketing
What is bucketing
▪ Create Bucketed Table
CREATE TABLE order
( order_id long,
user_id long,
product long,
amount long
)
using parquet
clustered by (user_id)
sorted by (user_id)
into 1024 buckets
location ‘/user/warehouse/test.db/order’
CREATE TABLE order
( order_id long,
user_id long,
product long,
amount long
)
clustered by (user_id)
sorted by (user_id)
into 1024 buckets
stored as parquet
location ‘/user/warehouse/test.db/order’
What is bucketing
▪ Insert into Bucketed Table
INSERT INTO order SELECT order_id, user_id, product, amount FROM order_staging
What is bucketing
▪ ShuffledHashJoin
Exchange(user_id) Exchange(user_id)
TableScan(order) TableScan(user)
ShuffleHashJoin
What is bucketing
▪ ShuffledHashJoin with Bucketing
TableScan(order) TableScan(user)
ShuffleHashJoin
There is no Exchange
since both tables are
pre-shuffle on user_id
What is bucketing
▪ SortMergeJoin
Exchange(user_id) Exchange(user_id)
TableScan(order) TableScan(user)
SortMergeJoin
Sort(user_id)Sort(user_id)
What is bucketing
▪ SortMergeJoin with Bucketing
TableScan(order) TableScan(user)
SortMergeJoin There is no Exchange
since both tables are
pre-shuffle and pre-sort
on join key(user_id)
Spark Bucketing Limitations
Spark Bucketing Limitations
▪ Small files
hdfs dfs –ls
/user/warehouse/test.db/order/_temporary/0/_temporary/
attempt_20200519145628_0014_m_000014_0 | wc –l
988
INSERT INTO order SELECT order_id, user_id, product, amount FROM order_staging
Each task will generate up to 1024 small files. 1024 is the bucket number
There are up to 1024 * M small files in total. M is task number
When M is 1024, there will be up to 1 million small files
Spark Bucketing Limitations
▪ Small files
INSERT INTO order SELECT order_id, user_id, product, amount
FROM order_staging
DISTRIBUTE BY user_id
There are up to 1024 files when 1024 is multiple of M
M equals to spark.sql.shuffle.partitions
Spark Bucketing Limitations
▪ Small files
INSERT INTO order SELECT order_id, user_id, product, amount
FROM order_staging
DISTRIBUTE BY user_id
There are up to M files when M is multiple of 1024
M equals to spark.sql.shuffle.partitions
Spark Bucketing Limitations
▪ Incompatible across SQL engines
Hive
M M M M......
R R R
bucket
0
bucket
1
bucket
(n-1)
…...
Spak SQL
M M M M…...
bucket
0
bucket
0
bucket
0
bucket
0
bucket
1
bucket
(n-1)
…...
HiveHash
Murmur3
Spark Bucketing Limitations
▪ Incompatible across SQL engines
Exchange(user_id) Exchange(user_id)
TableScan(order) TableScan(user)
SortMergeJoin
Sort(user_id)Sort(user_id)
Exchange and Sort
are required when
joining tables with
Hive bucketing in
Spark SQL or joining
tables with Spark
bucketing in Hive
join on user_id
Spark Bucketing Limitations
▪ Extra Sort
TableScan(order) TableScan(user)
SortMergeJoin
Sort(user_id)Sort(user_id)
Sort is required when
joining tables with
Spark SQL bucketing
in Spark SQL
because each bucket
may consist of more
than one file
join on user_id
Spark Bucketing Limitations
▪ Unaligned Bucket Number
Exchange(user_id)
TableScan(order) TableScan(user)
SortMergeJoin
Exchange is required
on one of the
bucketed tables when
the bucket number is
different
clustered by user_id
into 4096 buckets
clustered by user_id
into 1024 buckets
join on user_id
Spark Bucketing Limitations
▪ Join key set is different from bucket key set
Exchange(user_id,
location_id)
TableScan(order) TableScan(user)
SortMergeJoin
Exchange is required
on when the
bucketing key set is
different from the
join key set
clustered by user_id
into 1024 buckets
clustered by user_id
into 1024 buckets
join on
user_id,location_id
Exchange(user_id,
location_id)
Spark Bucketing Limitations
▪ Union all after bucketing
Exchange(user_id)
TableScan(
order_mobile)
TableScan(user)
SortMergeJoin
clustered by user_id
into 1024 buckets
clustered by user_id
into 1024 buckets
join on user_id
TableScan(
order_web)
clustered by user_id
into 1024 buckets
Exchange is required
in this case even
when the underlying
tables are both
bucketed by user_id,
which is the join key
and the bucket
number is the same
with the other one
Bucketing Optimizations at ByteDance
Bucketing Optimizations at ByteDance
▪ Align Spark Bucketing with Hive
Hive
M M M M......
R R R
bucket
0
bucket
1
bucket
(n-1)
…...
Spak SQL
M M M M…...
bucket
0
bucket
0
bucket
0
bucket
0
bucket
1
bucket
(n-1)
…...
HiveHash
Murmur3
Bucketing Optimizations at ByteDance
▪ Align Spark Bucketing with Hive
▪ Spark SQL write to Hive bucketed table in the same as Hive
▪ override InsertIntoHiveTable#requiredOrdering
▪ HashClusteredDistribution with HiveHash on bucketing keys
▪ override InsertIntoHiveTable#requiredDistribution
▪ SortOrder on bucketing keys with Ascending
Bucketing Optimizations at ByteDance
▪ Align Spark Bucketing with Hive
▪ Spark SQL read Hive bucketed table with bucketing metadata
▪ override HiveTableScanExec#outputPartitioning
▪ HashPartitioning with HiveHash
▪ override HiveTableScanExec#outputOrdering
▪ SortOrder on bucketing keys with Ascending
Bucketing Optimizations at ByteDance
HiveTableScan HiveTableScan
Sort Merge Join
outputPartitioning: HashPartitioning(id, n, HiveHash)
outputOrdering: SortOrder(id)
requireChildDistribution: HashClusteredDistribution(id, n, HiveHash)
requireChildOrdering: SortOrder(id)
HiveTableScan
Exchange
HiveTableScan
Exchange
Sort Sort
Sort Merge Join
requireChildDistribution: HashClusteredDistribution(id, n, Murmur3Hash)
requireChildOrdering: SortOrder(id)
outputPartitioning: UnknownPartitioning
outputOrdering: Nil
Bucketing Optimizations at ByteDance
▪ One to Mange Bucket Join
Bucketing Optimizations at ByteDance
Table A (3 bucket)
bucket 0
Table B(6 bucket)
bucket 1
bucket 5
bucket 2
bucket 3
bucket 4
(0, 6, 12)
(1, 7, 13)
(2, 8, 14)
(3, 9, 15)
(4, 10, 16)
(5, 11, 17)
(0, 3, 6, 9, 12, 15)
(2, 5, 8, 11, 14, 17)
(1, 4, 7, 10,13, 16)
TableScan
Sort Merge Join
Sort
Table A
(3 bucket)
Table B
(6 bucket)bucket 0
bucket 1
bucket 2
TableScan
Bucketing Optimizations at ByteDance
Table A (3 bucket)
bucket 0
Table B(6 bucket)
bucket 1
bucket 5
bucket 2
bucket 3
bucket 4
(0, 6, 12)
(1, 7, 13)
(2, 8, 14)
(3, 9, 15)
(4, 10, 16)
(5, 11, 17)
(0, 3, 6, 9, 12, 15)
(2, 5, 8, 11, 14, 17)
(1, 4, 7, 10,13, 16)
TableScan
Sort Merge Join
Bucket Union
Table A
(3 bucket)
Table B
(6 bucket)bucket 0
bucket 1
bucket 2
TableScan
(0, 3, 6, 9, 12, 15)
(2, 5, 8, 11, 14, 17)
(1, 4, 7, 10,13, 16)
bucket 0’
bucket 1’
bucket 2’
Table A
(3 bucket)
TableScan
Bucketing Optimizations at ByteDance
TableScan
Sort Merge Join
Bucket Union
Table A
(3 bucket)
Table B
(6 bucket)
TableScan
Table A
(3 bucket)
TableScan
• B left join A
• B left semi join A
• B anti join A
• B inner join A
• B right join A
• B full outer join A
• B cross join A
Bucketing Optimizations at ByteDance
TableScan
Sort Merge Join
Bucket Union
Table A
(3 bucket)
Table B
(6 bucket)
TableScan
Table A
(3 bucket)
TableScan
Filter Filter
• B left join A
• B left semi join A
• B anti join A
• B inner join A
• B right join A
• B full outer join A
• B cross join A
Bucketing Optimizations at ByteDance
Table A (3 bucket)
bucket 0
Table B(6 bucket)
bucket 1
bucket 5
bucket 2
bucket 3
bucket 4
(0, 6, 12)
(1, 7, 13)
(2, 8, 14)
(3, 9, 15)
(4, 10, 16)
(5, 11, 17)
(0, 3, 6, 9, 12, 15)
(2, 5, 8, 11, 14, 17)
(1, 4, 7, 10,13, 16)
bucket 0
bucket 1
bucket 2
(0, 3, 6, 9, 12, 15)
(2, 5, 8, 11, 14, 17)
(1, 4, 7, 10,13, 16)
bucket 0’
bucket 1’
bucket 2’
TableScan
Sort Merge Join
Bucket Union
Table A
(3 bucket)
Table B
(6 bucket)
TableScan
Table A
(3 bucket)
TableScan
Filter Filter
Bucketing Optimizations at ByteDance
▪ Join on more than bucketing keys
TableScan
Sort Merge Join
on A B
Table X
Bucket by A
Table Y
Bucket by A
TableScan
Exchange
on A B
Exchange
on A B
Sort on A B Sort on A B
X 1 1
X 2 3
X 4 2
Y 6 7
Y 7 3
Y 8 5
Z 2 8
Z 4 3
Z 5 2
BA C
Table X
Bucket by A
X 2 3
X 1 1
X 4 2
Y 8 5
Y 6 7
Y 7 3
Z 2 8
Z 5 2
Z 4 3
BA C
Table Y
Bucket by A
Bucketing Optimizations at ByteDance
▪ Join on more than bucketing keys
TableScan
Sort Merge Join
on A B
Table X
Bucket by A
Table Y
Bucket by A
TableScan
Sort on A B Sort on A B
X 1 1
X 2 3
X 4 2
Y 6 7
Y 7 3
Y 8 5
Z 2 8
Z 4 3
Z 5 2
BA C
Table X
Bucket by A
X 2 3
X 1 1
X 4 2
Y 8 5
Y 6 7
Y 7 3
Z 2 8
Z 5 2
Z 4 3
BA C
Table Y
Bucket by AJoin keys(A B) is superset of bucketing keys(A)
Bucketing Optimizations at ByteDance
▪ Bucketing evolution
▪ Case 1: A non-bucketed table is partitioned by date. User want to
convert it to a bucketed table without overhead
▪ Case 2: The bucket number is X and user need to enlarge it to 2X
because the data volume increased
Bucketing Optimizations at ByteDance
▪ Bucketing evolution
▪ Put bucketing information into partition parameter
▪ Only if all target partitions have the same bucketing information will the
table be read as bucketed table. Otherwise, it will be read as non-
bucketed table
▪ Reading a bucket table as non-bucketed table only impact performance
but not correctness
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle

Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle

  • 2.
    Bucketing 2.0: ImproveSpark SQL Performance by Removing Shuffle Guo, Jun (jason.guo.vip@gmail.com) Lead of Data Engine Team, ByteDance
  • 3.
    Who we are oData Engine team of ByteDance o Build a platform of one-stop experience for OLAP , on which users can analyze PB level data by writing SQL without caring about the underlying execution engine
  • 4.
    What we do oManage Spark SQL / Presto / Hive workload o Offer Open API and self-serve platform o Optimize Spark SQL / Presto / Hive engine o Design data architecture for most business lines in ByteDance
  • 5.
    Agenda ▪ Spark SQLat ByteDance ▪ What is Bucketing ▪ Spark Bucketing Limitations ▪ Bucketing Optimizations at ByteDance
  • 6.
    Spark SQL atByteDance
  • 7.
    Spark SQL atByteDance 2016 2017 2018 2019 2020 Small Scale Experiments Ad-hoc workload Few ETL pipelines in production Full-production deployment Main engine in DW area
  • 8.
  • 9.
    What is bucketing ▪Create Bucketed Table CREATE TABLE order ( order_id long, user_id long, product long, amount long ) using parquet clustered by (user_id) sorted by (user_id) into 1024 buckets location ‘/user/warehouse/test.db/order’ CREATE TABLE order ( order_id long, user_id long, product long, amount long ) clustered by (user_id) sorted by (user_id) into 1024 buckets stored as parquet location ‘/user/warehouse/test.db/order’
  • 10.
    What is bucketing ▪Insert into Bucketed Table INSERT INTO order SELECT order_id, user_id, product, amount FROM order_staging
  • 11.
    What is bucketing ▪ShuffledHashJoin Exchange(user_id) Exchange(user_id) TableScan(order) TableScan(user) ShuffleHashJoin
  • 12.
    What is bucketing ▪ShuffledHashJoin with Bucketing TableScan(order) TableScan(user) ShuffleHashJoin There is no Exchange since both tables are pre-shuffle on user_id
  • 13.
    What is bucketing ▪SortMergeJoin Exchange(user_id) Exchange(user_id) TableScan(order) TableScan(user) SortMergeJoin Sort(user_id)Sort(user_id)
  • 14.
    What is bucketing ▪SortMergeJoin with Bucketing TableScan(order) TableScan(user) SortMergeJoin There is no Exchange since both tables are pre-shuffle and pre-sort on join key(user_id)
  • 15.
  • 16.
    Spark Bucketing Limitations ▪Small files hdfs dfs –ls /user/warehouse/test.db/order/_temporary/0/_temporary/ attempt_20200519145628_0014_m_000014_0 | wc –l 988 INSERT INTO order SELECT order_id, user_id, product, amount FROM order_staging Each task will generate up to 1024 small files. 1024 is the bucket number There are up to 1024 * M small files in total. M is task number When M is 1024, there will be up to 1 million small files
  • 17.
    Spark Bucketing Limitations ▪Small files INSERT INTO order SELECT order_id, user_id, product, amount FROM order_staging DISTRIBUTE BY user_id There are up to 1024 files when 1024 is multiple of M M equals to spark.sql.shuffle.partitions
  • 18.
    Spark Bucketing Limitations ▪Small files INSERT INTO order SELECT order_id, user_id, product, amount FROM order_staging DISTRIBUTE BY user_id There are up to M files when M is multiple of 1024 M equals to spark.sql.shuffle.partitions
  • 19.
    Spark Bucketing Limitations ▪Incompatible across SQL engines Hive M M M M...... R R R bucket 0 bucket 1 bucket (n-1) …... Spak SQL M M M M…... bucket 0 bucket 0 bucket 0 bucket 0 bucket 1 bucket (n-1) …... HiveHash Murmur3
  • 20.
    Spark Bucketing Limitations ▪Incompatible across SQL engines Exchange(user_id) Exchange(user_id) TableScan(order) TableScan(user) SortMergeJoin Sort(user_id)Sort(user_id) Exchange and Sort are required when joining tables with Hive bucketing in Spark SQL or joining tables with Spark bucketing in Hive join on user_id
  • 21.
    Spark Bucketing Limitations ▪Extra Sort TableScan(order) TableScan(user) SortMergeJoin Sort(user_id)Sort(user_id) Sort is required when joining tables with Spark SQL bucketing in Spark SQL because each bucket may consist of more than one file join on user_id
  • 22.
    Spark Bucketing Limitations ▪Unaligned Bucket Number Exchange(user_id) TableScan(order) TableScan(user) SortMergeJoin Exchange is required on one of the bucketed tables when the bucket number is different clustered by user_id into 4096 buckets clustered by user_id into 1024 buckets join on user_id
  • 23.
    Spark Bucketing Limitations ▪Join key set is different from bucket key set Exchange(user_id, location_id) TableScan(order) TableScan(user) SortMergeJoin Exchange is required on when the bucketing key set is different from the join key set clustered by user_id into 1024 buckets clustered by user_id into 1024 buckets join on user_id,location_id Exchange(user_id, location_id)
  • 24.
    Spark Bucketing Limitations ▪Union all after bucketing Exchange(user_id) TableScan( order_mobile) TableScan(user) SortMergeJoin clustered by user_id into 1024 buckets clustered by user_id into 1024 buckets join on user_id TableScan( order_web) clustered by user_id into 1024 buckets Exchange is required in this case even when the underlying tables are both bucketed by user_id, which is the join key and the bucket number is the same with the other one
  • 25.
  • 26.
    Bucketing Optimizations atByteDance ▪ Align Spark Bucketing with Hive Hive M M M M...... R R R bucket 0 bucket 1 bucket (n-1) …... Spak SQL M M M M…... bucket 0 bucket 0 bucket 0 bucket 0 bucket 1 bucket (n-1) …... HiveHash Murmur3
  • 27.
    Bucketing Optimizations atByteDance ▪ Align Spark Bucketing with Hive ▪ Spark SQL write to Hive bucketed table in the same as Hive ▪ override InsertIntoHiveTable#requiredOrdering ▪ HashClusteredDistribution with HiveHash on bucketing keys ▪ override InsertIntoHiveTable#requiredDistribution ▪ SortOrder on bucketing keys with Ascending
  • 28.
    Bucketing Optimizations atByteDance ▪ Align Spark Bucketing with Hive ▪ Spark SQL read Hive bucketed table with bucketing metadata ▪ override HiveTableScanExec#outputPartitioning ▪ HashPartitioning with HiveHash ▪ override HiveTableScanExec#outputOrdering ▪ SortOrder on bucketing keys with Ascending
  • 29.
    Bucketing Optimizations atByteDance HiveTableScan HiveTableScan Sort Merge Join outputPartitioning: HashPartitioning(id, n, HiveHash) outputOrdering: SortOrder(id) requireChildDistribution: HashClusteredDistribution(id, n, HiveHash) requireChildOrdering: SortOrder(id) HiveTableScan Exchange HiveTableScan Exchange Sort Sort Sort Merge Join requireChildDistribution: HashClusteredDistribution(id, n, Murmur3Hash) requireChildOrdering: SortOrder(id) outputPartitioning: UnknownPartitioning outputOrdering: Nil
  • 30.
    Bucketing Optimizations atByteDance ▪ One to Mange Bucket Join
  • 31.
    Bucketing Optimizations atByteDance Table A (3 bucket) bucket 0 Table B(6 bucket) bucket 1 bucket 5 bucket 2 bucket 3 bucket 4 (0, 6, 12) (1, 7, 13) (2, 8, 14) (3, 9, 15) (4, 10, 16) (5, 11, 17) (0, 3, 6, 9, 12, 15) (2, 5, 8, 11, 14, 17) (1, 4, 7, 10,13, 16) TableScan Sort Merge Join Sort Table A (3 bucket) Table B (6 bucket)bucket 0 bucket 1 bucket 2 TableScan
  • 32.
    Bucketing Optimizations atByteDance Table A (3 bucket) bucket 0 Table B(6 bucket) bucket 1 bucket 5 bucket 2 bucket 3 bucket 4 (0, 6, 12) (1, 7, 13) (2, 8, 14) (3, 9, 15) (4, 10, 16) (5, 11, 17) (0, 3, 6, 9, 12, 15) (2, 5, 8, 11, 14, 17) (1, 4, 7, 10,13, 16) TableScan Sort Merge Join Bucket Union Table A (3 bucket) Table B (6 bucket)bucket 0 bucket 1 bucket 2 TableScan (0, 3, 6, 9, 12, 15) (2, 5, 8, 11, 14, 17) (1, 4, 7, 10,13, 16) bucket 0’ bucket 1’ bucket 2’ Table A (3 bucket) TableScan
  • 33.
    Bucketing Optimizations atByteDance TableScan Sort Merge Join Bucket Union Table A (3 bucket) Table B (6 bucket) TableScan Table A (3 bucket) TableScan • B left join A • B left semi join A • B anti join A • B inner join A • B right join A • B full outer join A • B cross join A
  • 34.
    Bucketing Optimizations atByteDance TableScan Sort Merge Join Bucket Union Table A (3 bucket) Table B (6 bucket) TableScan Table A (3 bucket) TableScan Filter Filter • B left join A • B left semi join A • B anti join A • B inner join A • B right join A • B full outer join A • B cross join A
  • 35.
    Bucketing Optimizations atByteDance Table A (3 bucket) bucket 0 Table B(6 bucket) bucket 1 bucket 5 bucket 2 bucket 3 bucket 4 (0, 6, 12) (1, 7, 13) (2, 8, 14) (3, 9, 15) (4, 10, 16) (5, 11, 17) (0, 3, 6, 9, 12, 15) (2, 5, 8, 11, 14, 17) (1, 4, 7, 10,13, 16) bucket 0 bucket 1 bucket 2 (0, 3, 6, 9, 12, 15) (2, 5, 8, 11, 14, 17) (1, 4, 7, 10,13, 16) bucket 0’ bucket 1’ bucket 2’ TableScan Sort Merge Join Bucket Union Table A (3 bucket) Table B (6 bucket) TableScan Table A (3 bucket) TableScan Filter Filter
  • 36.
    Bucketing Optimizations atByteDance ▪ Join on more than bucketing keys TableScan Sort Merge Join on A B Table X Bucket by A Table Y Bucket by A TableScan Exchange on A B Exchange on A B Sort on A B Sort on A B X 1 1 X 2 3 X 4 2 Y 6 7 Y 7 3 Y 8 5 Z 2 8 Z 4 3 Z 5 2 BA C Table X Bucket by A X 2 3 X 1 1 X 4 2 Y 8 5 Y 6 7 Y 7 3 Z 2 8 Z 5 2 Z 4 3 BA C Table Y Bucket by A
  • 37.
    Bucketing Optimizations atByteDance ▪ Join on more than bucketing keys TableScan Sort Merge Join on A B Table X Bucket by A Table Y Bucket by A TableScan Sort on A B Sort on A B X 1 1 X 2 3 X 4 2 Y 6 7 Y 7 3 Y 8 5 Z 2 8 Z 4 3 Z 5 2 BA C Table X Bucket by A X 2 3 X 1 1 X 4 2 Y 8 5 Y 6 7 Y 7 3 Z 2 8 Z 5 2 Z 4 3 BA C Table Y Bucket by AJoin keys(A B) is superset of bucketing keys(A)
  • 38.
    Bucketing Optimizations atByteDance ▪ Bucketing evolution ▪ Case 1: A non-bucketed table is partitioned by date. User want to convert it to a bucketed table without overhead ▪ Case 2: The bucket number is X and user need to enlarge it to 2X because the data volume increased
  • 39.
    Bucketing Optimizations atByteDance ▪ Bucketing evolution ▪ Put bucketing information into partition parameter ▪ Only if all target partitions have the same bucketing information will the table be read as bucketed table. Otherwise, it will be read as non- bucketed table ▪ Reading a bucket table as non-bucketed table only impact performance but not correctness
  • 40.
    Feedback Your feedback isimportant to us. Don’t forget to rate and review the sessions.