Physical Plans in Spark SQL

WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics

David Vrba, Socialbakers
Physical Plans in Spark
SQL
#UnifiedDataAnalytics #SparkAISummit

● David Vrba Ph.D.
● Data Scientist & Data Engineer at Socialbakers
○ Developing ETL pipelines in Spark
○ Optimizing Spark jobs
○ Productionalizing Spark applications
● Lecturing Spark trainings and workshops
○ Studying Spark source code
3#UnifiedDataAnalytics #SparkAISummit

Goal
● Share what we have learned by studying Spark source code and by using
Spark on daily basis
○ Processing data from social networks (Facebook, Twitter, Instagram, ...)
○ Regularly processing data on scales up to 10s of TBs
○ Understanding the query plan is a must for efficient processing

Outline
● Two part talk
○ In the first part we cover some theory
■ How query execution works
■ Physical Plan operators
■ Where to look for relevant information in Spark UI
○ In the second part we show some examples
■ Model queries with particular optimizations
■ Useful tips
○ Q/A

Part I
● Query Execution
● Physical Plan Operators
● Spark UI

Query Execution
Query
Logical Planning
Parser
Unresolved Plan
Analyzer
Analyzed Plan
Cache
Manager
Optimizer
Optimized Plan
Physical Planning
Query Planner
Spark Plan
Preparation
Executed Plan
Execution
RDD
DAG
Task
Scheduler
DAG Scheduler
Stages + Tasks
Executor
Today’s
session

Logical Planning
● Logical plan is created, analyzed and optimized
● Logical plan
○ Tree representation of the query
○ It is an abstraction that carries information about what is supposed to happen
○ It does not contain precise information on how it happens
○ Composed of
■ Relational operators
– Filter, Join, Project, ... (they represent DataFrame transformations)
■ Expressions
– Column transformations, filtering conditions, joining conditions, ...

Physical Planning
● In the phase of logical planning we get Optimized Logical Plan
● Execution layer does not understand DataFrame / Logical Plan (LP)
● Logical Plan has to be converted to a Physical Plan (PP)
● Physical Plan
○ Bridge between LP and RDDs
○ Similarly to Logical Plan it is a tree
○ Contains more specific description of how things should happen (specific
choice of algorithms)
○ Uses lower level primitives - RDDs

Physical Planning - 2 phases
Spark Plan Executed Plan
● Generated by Query
Planner using Strategies
● For each node in LP
there is a node in PP
Additional Rules

Additional Rules
Join
SortMergeJoin
BroadcastHashJoin
In Logical Plan:
In Physical Plan:
Strategy example: JoinSelection

Additional Rules
● Final version of query plan
● This will be executed
○ generates RDD code

● Final version of query plan
● This will be executed
○ generates RDD code
Additional Rules
df.queryExecution.sparkPlan
df.queryExecution.executedPlan
df.explain()
See the plans:

14
SQL
Job IDs
Click here to see the query plan
Spark UI

15
Graphical representation of the Physical Plan - executed plan
Details - string representation (LP + PP)

Let’s see some operators
● FileScan
● Exchange
● HashAggregate, SortAggregate, ObjectHashAggregate
● SortMergeJoin
● BroadcastHashJoin

FileScan
● Represents reading the data from a file format
spark.table(“posts_fb”)
.filter($“month” === 5)
.filter($”profile_id” === ...)
● table: posts_fb
● partitioned by month
● bucketed by profile_id, 20b
● 1 file per bucket

FileScan
number of files read
size of files read total
rows output
filesystem read data size total

FileScan
It is useful to pair these numbers with
information in Jobs and Stages tab in Spark UI

We read from bucketed table (20b, 1f/b)

697.9 KB

Bucket pruning: 0B
697.9 KB

Bucket pruning: 0B
697.9 KB
spark.sql.sources.bucketing.enabled

Size of the whole partition. This will be read if we
turn off bucketing (no bucket pruning)
If bucketing is OFF:

FileScan (string representation)
PartitionFilters PushedFilters
DataFilters

FileScan (string representation)
PartitionFilters PushedFilters
PartitionCount
(partition pruning)
DataFilters Format
SelectedBucketsCount
(bucket pruning)

Exchange
● Represents shuffle - physical data movement on the cluster
○ usually quite expensive

Exchange
HashPartitioning
● groupBy
● distinct
● join
● repartition($”key”)
● Window.partitionBy(“key”)

Exchange
HashPartitioning
● groupBy
● distinct
● join
SinglePartition
● Window.partitionBy()

Exchange
HashPartitioning
● groupBy
● distinct
● join
SinglePartition
RoundRobinPartitioning
● repartition(10)

Exchange
HashPartitioning
● groupBy
● distinct
● join
SinglePartition
RoundRobinPartitioning
● repartition(10)
RangePartitioning
● orderBy

Aggregate
HashAggregate
PP
SortAggregate
ObjectHashAggregate
Chosen in Aggregation strategy
● Represents data aggregation

Aggregate
● groupBy
● distinct
● dropDuplicates
Aggregate
HashAggregate
LP
PP
SortAggregate
ObjectHashAggregate

Aggregate
● groupBy
● distinct
● dropDuplicates
Aggregate
HashAggregate
LP
PP
SortAggregate
ObjectHashAggregate
df.groupBy(“profile_id”)
.agg(sum(“interactions”))

Aggregate
● groupBy
● distinct
● dropDuplicates
Aggregate
HashAggregate
LP
PP
Usually comes in pair
○ partial_sum
○ finalmerge_sum
SortAggregate
ObjectHashAggregate

SortMergeJoin
● Represents joining two
DataFrames
● Exchange & Sort often
before SortMergeJoin
but not necessarily (see
later)

BroadcastHashJoin
BroadcastExchange
● data from this branch are
broadcasted to each executor
BroadcastHashJoin
● Represents joining two DataFrames

BroadcastHashJoin
Two jobs: broadcasting the data is
handled in separate job
● Represents joining two DataFrames

Rules applied in preparation
• EnsureRequirements
• ReuseExchange
• ...
Additional Rules

EnsureRequirements
● Fills Exchange & Sort operators to the plan
● How it works?
● Each operator in PP has:
○ outputPartitioning
○ outputOrdering
○ requiredChildDistribution (this is a requirement for partitioning)
○ requiredChildOrdering (this is a requirement for ordering)
● If the requirement is not met, Exchange or Sort are used

SortMergeJoin (id)
Project Project
FileScan FileScan

SortMergeJoin (id)
Project Project
FileScan FileScan
requires:
HashPartitioning(id, 200)
order by id
requires:
order by id

SortMergeJoin (id)
Project Project
FileScan FileScan
requires:
order by id
requires:
order by id
outputPartitioning: Unknown
outputOrdering: None

SortMergeJoin (id)
Project Project
FileScan FileScan
requires:
order by id
requires:
order by id

SortMergeJoin (id)
Project Project
FileScan FileScan
Exchange Exchange
requires:
order by id
requires:
order by id
satisfiesthis
satisfiesthis

SortMergeJoin (id)
Project Project
FileScan FileScan
Exchange Exchange
Sort Sort
requires:
order by id
requires:
order by id

Example with bucketing
● If the sources are bucketed tables into 50 buckets by id:
○ FileScan will have outputPartitioning as HashPartitioning(id, 50)
○ It will pass it to Project
○ This will happen in both branches of the Join
○ The requirement for partitioning is met => no Exchange needed

SortMergeJoin (id)
Project Project
FileScan FileScan
requires:
order by id
requires:
order by id
outputPartitioning:
outputPartitioning:

SortMergeJoin (id)
Project Project
FileScan FileScan
requires:
order by id
requires:
order by id
outputPartitioning:
outputPartitioning:
outputPartitioning:
outputPartitioning:

SortMergeJoin (id)
Project Project
FileScan FileScan
requires:
order by id
requires:
order by id
outputPartitioning:
outputPartitioning:
outputPartitioning:
outputPartitioning:
No need for Exchange
OK OK

What about the Sort ?
dataDF
.write
.bucketBy(50, “id”)
.sortBy(“id”)
.option(“path”, outputPath)
.saveAsTable(tableName)
It depends on bucketing:

dataDF
.write
.sortBy(“id”)
It depends on bucketing: ● If one file per bucket is created
○ FileScan will pick up the information
about order
○ No need for Sort in the final plan

dataDF
.write
.sortBy(“id”)
It depends on bucketing: ● If one file per bucket is created
○ FileScan will pick up the information
about order
○ No need for Sort in the final plan
● If more files per bucket
○ FileScan will not use information
about order (data has to be sorted
globally)
○ Spark will add Sort to the final plan

SortMergeJoin (id)
Project Project
FileScan FileScan
Final Executed Plan if:
1. bucketed by id
2. sorted by id (one
file/bucket)

SortMergeJoin (id)
Project Project
FileScan FileScan
Sort Sort
Final Executed Plan if:
1. bucketed by id
2. not sorted or
3. sorted but more files/bucket

ReuseExchange
● Result of each Exchange (shuffle) is written to disk (shuffle-write)
● The rule checks if the plan has the same Exchange-branches
○ if the branches are the same, the shuffle-write will also be the same
● Spark can reuse it since the data (shuffle-write) is persisted on disks

Union
Project Project
FileScan FileScan
Exchange Exchange
Sort
Window
Generate
Project
The two sub-branches may become the
same in some situations

Union
Project Project
FileScan FileScan
Exchange Exchange
Sort
Window
Generate
Union
Project
FileScan
Exchange
Sort
Window
Generate
Project Project
If the branches are
the same, only one
will be computed and
Spark will reuse it

ReuseExchange
● The branches must be the same
● Only Exchange can be reused
● Can be turned off by internal configuration
○ spark.sql.exchange.reuse
● Otherwise you can not control it directly but there is indirect way by
tweaking the query (see the example later)
● It reduces the I/O and network cost:
○ One scan over the data
○ One shuffle write

Part I conclusion
● Physical Plan operators carry information about execution
○ it is useful to pair information from the query plan with info about stages/tasks
● ReuseExchange allows to reduce I/O and network cost
○ if Exchange sub-branches are the same Spark will compute it only once
● EnsureRequirements adds Exchange and Sort to the query plan
○ it makes sure that requirements of the operators are met

David Vrba, Socialbakers
Physical Plans in Spark
SQL - continues
#UnifiedDataAnalytics #SparkAISummit

Part I recap
● We covered
○ Query Execution (physical planning)
○ Two preparation rules:
■ EnsureRequirements
■ ReuseExchange

Part II
● Model examples with optimizations
● Some useful tips

Data
post_id profile_id interactions
date
1 1 20
2019-01-01
2 1 15
2019-01-01
3 1 50
2019-02-01
Table A: posts (messages published on FB)

Example I - exchange reuse
● Take all profiles where
○ sum of interactions is bigger than 100
○ or sum of interactions is less than 20

.agg(sum(“interactions”).alias(“sum”))
.filter($”sum” > 100 || $”sum” < 20)
.write(...)
Simple query:

.filter($”sum” > 100 || $”sum” < 20)
.write(...)
val dfSumBig = df
.groupBy(“profile_id”)
.filter($”sum” > 100)
val dfSumSmall = df
.filter($”sum” < 20)
dfSumBig.union(dfSumSmall)
.write(...)
Simple query:
Can be written also in terms of union

Typical plans with unions:
Union

val dfSumBig = df
No need to optimize
val dfSumSmall = df
.write(...)
This Exchange is
reused. The data is
scanned only once
In our example:

val dfSumBig = df
val dfSumSmall = df
.write(...)
Let’s suppose the assignment has changed:
● For dfSumSmall we want to consider
only specific profiles (profile_id is not
null)

.write(...)
val dfSumSmall = df
.filter($”profile_id”.isNotNull)
val dfSumBig = df
Let’s suppose the assignment has changed:
● For dfSumSmall we want to consider
only specific profiles (profile_id is not
null)

Now add additional filter to one DF:val dfSumBig = df
val dfSumSmall = df
.write(...)

val dfSumSmall = df
● How can we optimize this?
● Just calling the filter on different place does not help
Spark optimizer will move this filter back by
calling a rule PushDownPredicate

● We can limit the optimizer to stop the rule
spark.conf.set(
“spark.sql.optimizer.excludeRules”,
“org.apache.spark.sql.catalyst.optimizer.PushDownPredicate”
)
val dfSumSmall = df
.agg(sum(“interactions”).alias(“metricValue”))
.filter($”metricValue” < 20)
It has both the filter conditions

Limiting the optimizer
● Available since Spark 2.4
● Useful if you need to change the order of operators in the plan
○ reposition Filter, Exchange
● Queries:
○ one data source (which is expensive to read)
○ with multiple computations (using groupBy or Window)
■ combined together using Union or Join

Reused computation
● Similar effect to ReuseExchange can be achieved also by caching
○ caching may not be that useful with large datasets (if it does not fit to the
caching layer)
○ caching evokes additional overhead while putting data to the caching
layer (memory or disk)

Example II
● Get only records (posts) with max interactions for each profile
date
1 1 20
2019-01-01
2 1 50
2019-01-01
3 1 50
2019-02-01
4 2 0
2019-01-01
5 2 100
2019-03-01
date
2 1 50
2019-01-01
3 1 50
2019-02-01
Assume profile_id has non-null values.

Example II
● Three common ways how to write the query:
○ Using Window function
○ Using groupBy + join
○ Using correlated subquery in SQL
Which one is the most efficient?

val w = Window.partitionBy(“profile_id”)
posts
.withColumn(“maxCount”, max(“interactions”).over(w))
.filter($”interactions” === $”maxCount”)
Using Window:

val maxCount = posts
.agg(max(“interactions”).alias(“maxCount”))
posts.join(maxCount, Seq(“profile_id”))
.filter($”interactions” === $”maxCount”)
Using groupBy + join:

posts.createOrReplaceTempView(“postsView”)
val query = “““
SELECT *
FROM postsView v1
WHERE interactions = (
select max(interactions) from postsView v2 where v1.profile_id = v2.profile_id
)
”””
spark.sql(query)
Using correlated subquery (equivalent plan with Join+groupBy):

Query with window Query with join or
correlated subquery
SortMergeJoin BroadcastHashJoin

correlated subquery
Exchange + Sort

correlated subquery
2 Exchanges +
1 Sort
Exchange + Sort

correlated subquery
2 Exchanges +
1 Sort
Exchange + Sort
reduced shuffle
HashAggregate

correlated subquery
2 Exchanges +
1 Sort
Exchange + Sort
reduced shuffle
broadcast exchange
HashAggregate

Window vs join + groupBy
Both tables are large
● Go with window
● SortMergeJoin will be
more expensive (3
exchanges, 2 sorts)

● Go with window
more expensive (3
exchanges, 2 sorts)
Reduced table is much smaller
and can be broadcasted
● Go with broadcast join - it
will be much faster

Both tables are small and
comparable in size:
● It is not a big deal
● Broadcast will also
generate traffic
● Go with window
more expensive (3
exchanges, 2 sorts)
Reduced table is much smaller
and can be broadcasted
● Go with broadcast join - it
will be much faster

Example III
● Sum interactions for each profile and each date
● Join additional table about profiles
profile_id about
lang
1 “some string”
en
2 “some other
string” en
Table B: profiles (Facebook pages)

Example III
● Sum interactions for each profile and each date
● Join additional table about profiles
posts
.groupBy(“profile_id”, “date”)
.join(profiles, Seq(“profile_id”))
profile_id about
lang
1 “some string”
en
2 “some other
string” en
Table B: profiles (Facebook pages)

posts
3 Exchange operators => 3 shuffles

posts

posts
SortMergeJoin (profile_id)
It requires (strictly)
● HashPartitioning (profile_id)

posts
HashPartitioning (profile_id)

posts
HashAggregate (profile_id, date)
It requires
● HashPartitioning (profile_id, date)
○ Or any subset of these cols

posts
HashAggregate (profile_id, date)
It requires
● HashPartitioning (profile_id, date)
○ Or any subset of these cols
HashPartitioning (profile_id, date)

posts
.repartition(“profile_id”)
Add repartition and eliminate one shuffle

posts
Generated by strategy

posts

posts
OK with

posts
OK with
OK with

Adding repartition
● What is the cost of using it?
○ Now we shuffle all data
○ Before we shuffled reduced dataset

reduced shuffle
(shuffled data
are all distinct
combinations of
profile_id &
date)
Before using
repartition
HashAggregate

reduced shuffle
(shuffled data
are all distinct
combinations of
profile_id &
date)
total shuffle
(all data is
shuffled)
Before using
repartition
After using
repartition
HashAggregate
HashAggregate

Adding repartition
● What is the cost of using it?
○ Now we shuffle all data
○ Before we shuffled reduced dataset
● What is more efficient?
○ depends on properties of data
■ here the cardinality of distinct (profile_id, date)

Adding repartition
● Cardinality of (profile_id, date)
is comparable with row_count
● Each profile has only few
posts per date
● The data is not reduced much
by groupBy aggregation
● the reduced shuffle is
comparable with full shuffle
● Cardinality of (profile_id, date)
is much lower than row_count
● Each profile has many posts
per date
● The data is reduced a lot by
groupBy aggregation
● the reduced shuffle is much
lower than full shuffle
Use repartition to reduce
number of shuffles if:
Use the original plan if:

Useful tip I
● Filters are not pushed through nondeterministic expressions
posts
.agg(collect_list(“interactions”))
.filter($”profile_id”.isNotNull )

Useful tip I
posts
Exchange

Useful tip I
posts
Make sure your filters are positioned at the right place to achieve
efficient execution
Exchange

Nondeterministic expressions
● collect_list
● collect_set
● first
● last
● input_file_name
● spark_partition_id
● monotinically_increasing_id
● rand
● randn
● shuffle

Useful tip IIa
● Important settings related to BroadcastHashJoin:
spark.sql.autoBroadcastJoinThreshold
● Default value is 10MB
● Spark will broadcast if
○ Spark thinks that the size of the data is less
○ or you use broadcast hint

Useful tip IIa
Compute stats to make
good estimates
ANALYZE TABLE table_name COMPUTE STATISTICS FOR COLUMNS col1, col2, ...

Useful tip IIa
Compute stats to make
good estimates
ANALYZE TABLE table_name COMPUTE STATISTICS FOR COLUMNS col1, col2, ...
spark.sql.cbo.enabled

Useful tip IIb
spark.sql.broadcastTimeout
● Default value is 300s

Useful tip IIb
spark.sql.broadcastTimeout
● Default value is 300s
● If Spark does not make it:
SparkException: Could not execute broadcast in 300 secs. You can increase the
timeout for broadcasts via spark.sql.broadcastTimeout or disable broadcast
join by setting spark.sql.autoBroadcastJoinThreshold to -1

3 basic solutions:
1. Disable the broadcast by setting the threshold to -1
2. Increase the timeout
3. Use caching

val df = profiles
.some_udf_call
.agg(...some_aggregation...)
posts
.join(broadcast(df), Seq(“profile_id”))
Intense transformations:
● udf call
● aggregation, …
Computation may take longer than
5mins
If size of df is small we want to
broadcast

val df = profiles
.some_udf_call
.agg(...some_aggregation...)
.cache()
posts
.join(broadcast(df), Seq(“profile_id”))
df.count()
Three jobs:
1) count will write the data to memory
2) broadcast (fast because taken from RAM)
3) join - will leverage broadcasted data
If size of df is small we want to
broadcast
We can use caching (but we have
to materialize immediately)

Conclusion
● Understanding the physical plan is important
● Limiting the optimizer you can achieve the reused Exchange
● Choice between Window vs groupBy+join depends on data properties
● Adding repartition can avoid unnecessary Exchange
○ considering the data properties is important
● Be aware of nondeterministic expressions
● Fine-tune broadcast joins with configuration settings
○ make sure to have good size estimates using CBO

Questions

DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Physical Plans in Spark SQL

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Physical Plans in Spark SQL

Similar to Physical Plans in Spark SQL (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Physical Plans in Spark SQL