Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale with Yuanjian li and Carson Wang

Carson Wang (Intel), Yuanjian Li (Baidu)
Spark SQL Adaptive Execution Unleashes
The Power of Cluster in Large Scale
#Exp5SAIS

Agenda
• Challenges in Using Spark SQL
• Adaptive Execution Introduction
• Adaptive Execution in Baidu
2#Exp5SAIS

Tuning Shuffle Partition Number
• Too small Spill, OOM
• Too large Scheduling overhead. More IO
requests. Too many small output files
• The same shuffle partition number doesn’t fit for
all stages
3#Exp5SAIS

Spark SQL Join Selection
• A Join may takes intermediate results as inputs.
Spark SQL may choose an inefficient join
strategy if it doesn’t know the exact size at
planning phase.
4#Exp5SAIS

Data Skew in Join
• Data in some partitions are extremely larger
than other partitions. Data skew is a common
source of slowness for shuffle joins.
• Common ways to solve data skew
– Increase shuffle partition number
– Increase BroadcastJoin threashold
– Add prefix to the skewed keys
5#Exp5SAIS

Spark SQL Execution Mode
6#Exp5SAIS

Spark SQL Adaptive Execution Mode
7#Exp5SAIS

Auto Setting the Reducer Number
• Enable the feature
– spark.sql.adaptive.enabled -> true
• Configure the behavior
– Target input size for a reduce task
– Min/Max shuffle partition number
8#Exp5SAIS
Min Shuffle Partition Number Max Shuffle Partition Number
Best Shuffle Partition Number

Auto Setting the Reducer Number
• Target size per reducer = 64 MB.
• Min-Max shuffle partition number = 1 to 5
9#Exp5SAIS
ShuffledRowRDD
Partition 0 (70MB)
Partition 1 (30MB)
Partition 2 (20MB )
Partition 3 (10MB)
Partition 4 (50MB)
ShuffledRowRDD
Partition 0 (70MB)
Partition 1 (30MB)
Partition 2 (20MB)
Partition 3 (10MB)
Partition 4 (50MB)
Adaptive Execution
uses 3 reducers at
runtime.

Optimize Join Strategy at Runtime
10#Exp5SAIS

Optimize Join Strategy at Runtime
• After optimizing
SortMergeJoin to
BroadcastJoin, each
reduce task local read
the whole map output
file and join with the
broadcasted table.
11#Exp5SAIS
Map Task
Executor
ReduceTask
Map Output File

Handle Skewed Join at Runtime
• spark.sql.adaptive.skewedJoin.enabled -> true
• A partition is thought as skewed if its data size or
row count is N times larger than the median, and
also larger than a pre-defined threshold.
12#Exp5SAIS

Handle Skewed Join at Runtime
13#Exp5SAIS
……
Partition 0
(part0)
Partition 0
(part1)
Partition 0
(part2)
Partition 0
Table 1 Table 2
Join
Sort Sort
SMJ
Sort
QS
Input
QS
Input
Sort
SMJ
Union
……
QS
Input
QS
Input
Table 1
Partition 0 (part0)
Table 2
Partition 0
Table 1
Parttition 1-N
Table 2
Partition 1-N

Spark in Baidu
14#Exp5SAIS
• Spark import
to Baidu
• Version: 0.8
80
1000
3000
6500
9500
50 300
1500
5800
18000
0
5000
10000
15000
20000
Nodes Jobs/day
2014 2015 2016 2017 2018
• Build standalone
cluster
• Integrate with
in-house
FSPub-SubDW
• Version: 1.4
• Build Cluster
over YARN
• Integrate with
in-house
Resource
Scheduler
System
• Version: 1.6
• SQLGraph
Service over
Spark
• OAP
• Version: 2.1
• Structure
Streaming
• Spark Adaptive
Execution
• Hadoop to Spark
• Version: 2.2

• Specific user scene(SortMergeJoin -> BroadcastJoin)
• Long running application or use Spark as a service
• Graph & ML
15#Exp5SAIS
AE Boosting Scenario in Baidu

• Common features in the scenario:
– Small table join big table in sub query
– Small table generated by sub query
• Key Point:
– Identify & determine ‘small’ table
• Acceleration ratio:
– 50%~200%
16
SortMergeJoin -> BroadcastJoin
#Exp5SAIS

SELECT t.c1, t.id, t.c2, t.c3, t.c4, sum(t.num1), sum(t.num2),
sum(t.num3) FROM
(
SELECT c1, t1.id as id, c2, c3, c4, sum(num1s) as num1, sum
(num2) as num2, sum(num3) as num3 FROM huge_table1 t1
INNER JOIN user_list t2 ON (t1.id = t2.id) WHERE (event_da
y=20171107) and flag != 'true' group by c1, t1.id, c2, c3, c4
UNION ALL
SELECT c1, t1.id as id, c2, c3, c4, sum(num1s) as num1, sum
(num2) as num2, sum(num3) as num3 FROM huge_table2 t1
INNER JOIN user_list t2 ON (t1.id = t2.id) WHERE (event_da
y=20171107) and flag != 'true' group by c1, t1.id, c2, c3, c4
) t
GROUP BY t.c1, t.id, t.c2, t.c3, c4
17#Exp5SAIS
SortMergeJoin -> BroadcastJoin
UserList
Base
Table1
Base
Table2
SubQuery1 SubQuery2
Inner Join Inner Join
Result
Union & Aggregate
#Exp5SAIS

18
Long Running Application
• Including scenario:
– Long running batch job(> 1 hour)
– Using Spark as a service
• (LivyBaidu BigSQLSpark ShellZeppelin)
– Spark Streaming
• Key Point:
– Adaptive parallelism adjustment
– 50%~100%
#Exp5SAIS

19
Long Running Application
Duration: 52min
100 instance
10G executor.mem
4 executor.cores
AE enable False
Duration: 30min
100 instance
10G executor.mem
4 executor.cores
AE enable True
Min/MaxNumPostShufflePa
rtitions 400/10000
targetPostShuffleInputSize
512M
#Exp5SAIS

20
GraphFrame & MLlib
• Including scenario:
– GraphFrame APP
– MLlib
• Key Point:
– Adaptive parallelism adjustment
– 50%~100%
#Exp5SAIS

21
AE probe in Spark
#Exp5SAIS
Spark Over YARN
NodeManager
HostA
RigAgent
Executor
NodeManager
HostB
RigAgent
Executor
Host…
MetricsSink
Spark Application
Baidu ShowX Console
Baidu Bigpipe
Batch Streaming SQL …
AE Probe

Takeaways
• Three main features in our adaptive execution
– Auto setting the shuffle partition number
– Optimize join strategy at runtime
– Handle skewed join at runtime
• For more information about our implementation:
– https://issues.apache.org/jira/browse/SPARK-23128
– https://github.com/Intel-bigdata/spark-adaptive
23#Exp5SAIS

Carson Wang carson.wang@intel.com
Yuanjian Li liyuanjian@baidu.com
Thank you!
#Exp5SAIS

Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale with Yuanjian li and Carson Wang

More Related Content

What's hot

Similar to Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale with Yuanjian li and Carson Wang

More from Databricks

Recently uploaded

Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale with Yuanjian li and Carson Wang