SlideShare a Scribd company logo
WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
David Vrba, Socialbakers
Physical Plans in Spark
SQL
#UnifiedDataAnalytics #SparkAISummit
● David Vrba Ph.D.
● Data Scientist & Data Engineer at Socialbakers
○ Developing ETL pipelines in Spark
○ Optimizing Spark jobs
○ Productionalizing Spark applications
● Lecturing Spark trainings and workshops
○ Studying Spark source code
3#UnifiedDataAnalytics #SparkAISummit
Goal
● Share what we have learned by studying Spark source code and by using
Spark on daily basis
○ Processing data from social networks (Facebook, Twitter, Instagram, ...)
○ Regularly processing data on scales up to 10s of TBs
○ Understanding the query plan is a must for efficient processing
4#UnifiedDataAnalytics #SparkAISummit
Outline
● Two part talk
○ In the first part we cover some theory
■ How query execution works
■ Physical Plan operators
■ Where to look for relevant information in Spark UI
○ In the second part we show some examples
■ Model queries with particular optimizations
■ Useful tips
○ Q/A
5#UnifiedDataAnalytics #SparkAISummit
Part I
● Query Execution
● Physical Plan Operators
● Spark UI
6#UnifiedDataAnalytics #SparkAISummit
Query Execution
7#UnifiedDataAnalytics #SparkAISummit
Query
Logical Planning
Parser
Unresolved Plan
Analyzer
Analyzed Plan
Cache
Manager
Optimizer
Optimized Plan
Physical Planning
Query Planner
Spark Plan
Preparation
Executed Plan
Execution
RDD
DAG
Task
Scheduler
DAG Scheduler
Stages + Tasks
Executor
Today’s
session
Logical Planning
● Logical plan is created, analyzed and optimized
● Logical plan
○ Tree representation of the query
○ It is an abstraction that carries information about what is supposed to happen
○ It does not contain precise information on how it happens
○ Composed of
■ Relational operators
– Filter, Join, Project, ... (they represent DataFrame transformations)
■ Expressions
– Column transformations, filtering conditions, joining conditions, ...
8#UnifiedDataAnalytics #SparkAISummit
Physical Planning
● In the phase of logical planning we get Optimized Logical Plan
● Execution layer does not understand DataFrame / Logical Plan (LP)
● Logical Plan has to be converted to a Physical Plan (PP)
● Physical Plan
○ Bridge between LP and RDDs
○ Similarly to Logical Plan it is a tree
○ Contains more specific description of how things should happen (specific
choice of algorithms)
○ Uses lower level primitives - RDDs
9#UnifiedDataAnalytics #SparkAISummit
Physical Planning - 2 phases
10#UnifiedDataAnalytics #SparkAISummit
Spark Plan Executed Plan
● Generated by Query
Planner using Strategies
● For each node in LP
there is a node in PP
Additional Rules
Physical Planning - 2 phases
11#UnifiedDataAnalytics #SparkAISummit
Spark Plan Executed Plan
● Generated by Query
Planner using Strategies
● For each node in LP
there is a node in PP
Additional Rules
Join
SortMergeJoin
BroadcastHashJoin
In Logical Plan:
In Physical Plan:
Strategy example: JoinSelection
Physical Planning - 2 phases
12#UnifiedDataAnalytics #SparkAISummit
Spark Plan Executed Plan
Additional Rules
● Generated by Query
Planner using Strategies
● For each node in LP
there is a node in PP
● Final version of query plan
● This will be executed
○ generates RDD code
● Generated by Query
Planner using Strategies
● For each node in LP
there is a node in PP
Physical Planning - 2 phases
13#UnifiedDataAnalytics #SparkAISummit
Spark Plan Executed Plan
● Final version of query plan
● This will be executed
○ generates RDD code
Additional Rules
df.queryExecution.sparkPlan
df.queryExecution.executedPlan
df.explain()
See the plans:
14
SQL
Job IDs
Click here to see the query plan
Spark UI
15
Graphical representation of the Physical Plan - executed plan
Details - string representation (LP + PP)
16
Let’s see some operators
● FileScan
● Exchange
● HashAggregate, SortAggregate, ObjectHashAggregate
● SortMergeJoin
● BroadcastHashJoin
17#UnifiedDataAnalytics #SparkAISummit
FileScan
18#UnifiedDataAnalytics #SparkAISummit
● Represents reading the data from a file format
spark.table(“posts_fb”)
.filter($“month” === 5)
.filter($”profile_id” === ...)
● table: posts_fb
● partitioned by month
● bucketed by profile_id, 20b
● 1 file per bucket
FileScan
19#UnifiedDataAnalytics #SparkAISummit
number of files read
size of files read total
rows output
filesystem read data size total
FileScan
20#UnifiedDataAnalytics #SparkAISummit
It is useful to pair these numbers with
information in Jobs and Stages tab in Spark UI
21#UnifiedDataAnalytics #SparkAISummit
We read from bucketed table (20b, 1f/b)
22#UnifiedDataAnalytics #SparkAISummit
We read from bucketed table (20b, 1f/b)
23#UnifiedDataAnalytics #SparkAISummit
697.9 KB
We read from bucketed table (20b, 1f/b)
24#UnifiedDataAnalytics #SparkAISummit
Bucket pruning: 0B
697.9 KB
We read from bucketed table (20b, 1f/b)
25#UnifiedDataAnalytics #SparkAISummit
Bucket pruning: 0B
697.9 KB
We read from bucketed table (20b, 1f/b)
spark.sql.sources.bucketing.enabled
26#UnifiedDataAnalytics #SparkAISummit
We read from bucketed table (20b, 1f/b)
Size of the whole partition. This will be read if we
turn off bucketing (no bucket pruning)
If bucketing is OFF:
FileScan (string representation)
27#UnifiedDataAnalytics #SparkAISummit
PartitionFilters PushedFilters
DataFilters
FileScan (string representation)
28#UnifiedDataAnalytics #SparkAISummit
PartitionFilters PushedFilters
PartitionCount
(partition pruning)
DataFilters Format
SelectedBucketsCount
(bucket pruning)
Exchange
29#UnifiedDataAnalytics #SparkAISummit
● Represents shuffle - physical data movement on the cluster
○ usually quite expensive
Exchange
30#UnifiedDataAnalytics #SparkAISummit
HashPartitioning
● groupBy
● distinct
● join
● repartition($”key”)
● Window.partitionBy(“key”)
Exchange
31#UnifiedDataAnalytics #SparkAISummit
HashPartitioning
● groupBy
● distinct
● join
● repartition($”key”)
● Window.partitionBy(“key”)
SinglePartition
● Window.partitionBy()
Exchange
32#UnifiedDataAnalytics #SparkAISummit
HashPartitioning
● groupBy
● distinct
● join
● repartition($”key”)
● Window.partitionBy(“key”)
SinglePartition
● Window.partitionBy()
RoundRobinPartitioning
● repartition(10)
Exchange
33#UnifiedDataAnalytics #SparkAISummit
HashPartitioning
● groupBy
● distinct
● join
● repartition($”key”)
● Window.partitionBy(“key”)
SinglePartition
● Window.partitionBy()
RoundRobinPartitioning
● repartition(10)
RangePartitioning
● orderBy
Aggregate
34#UnifiedDataAnalytics #SparkAISummit
HashAggregate
PP
SortAggregate
ObjectHashAggregate
Chosen in Aggregation strategy
● Represents data aggregation
Aggregate
35#UnifiedDataAnalytics #SparkAISummit
● groupBy
● distinct
● dropDuplicates
Aggregate
HashAggregate
LP
PP
SortAggregate
ObjectHashAggregate
Chosen in Aggregation strategy
Aggregate
36#UnifiedDataAnalytics #SparkAISummit
● groupBy
● distinct
● dropDuplicates
Aggregate
HashAggregate
LP
PP
SortAggregate
ObjectHashAggregate
Chosen in Aggregation strategy
df.groupBy(“profile_id”)
.agg(sum(“interactions”))
Aggregate
37#UnifiedDataAnalytics #SparkAISummit
● groupBy
● distinct
● dropDuplicates
Aggregate
HashAggregate
LP
PP
Usually comes in pair
○ partial_sum
○ finalmerge_sum
SortAggregate
ObjectHashAggregate
Chosen in Aggregation strategy
df.groupBy(“profile_id”)
.agg(sum(“interactions”))
SortMergeJoin
38#UnifiedDataAnalytics #SparkAISummit
● Represents joining two
DataFrames
● Exchange & Sort often
before SortMergeJoin
but not necessarily (see
later)
BroadcastHashJoin
39#UnifiedDataAnalytics #SparkAISummit
BroadcastExchange
● data from this branch are
broadcasted to each executor
BroadcastHashJoin
● Represents joining two DataFrames
BroadcastHashJoin
40#UnifiedDataAnalytics #SparkAISummit
Two jobs: broadcasting the data is
handled in separate job
● Represents joining two DataFrames
Rules applied in preparation
• EnsureRequirements
• ReuseExchange
• ...
41#UnifiedDataAnalytics #SparkAISummit
Spark Plan Executed Plan
Additional Rules
EnsureRequirements
● Fills Exchange & Sort operators to the plan
● How it works?
● Each operator in PP has:
○ outputPartitioning
○ outputOrdering
○ requiredChildDistribution (this is a requirement for partitioning)
○ requiredChildOrdering (this is a requirement for ordering)
● If the requirement is not met, Exchange or Sort are used
42#UnifiedDataAnalytics #SparkAISummit
43#UnifiedDataAnalytics #SparkAISummit
SortMergeJoin (id)
Project Project
FileScan FileScan
44#UnifiedDataAnalytics #SparkAISummit
SortMergeJoin (id)
Project Project
FileScan FileScan
requires:
HashPartitioning(id, 200)
order by id
requires:
HashPartitioning(id, 200)
order by id
45#UnifiedDataAnalytics #SparkAISummit
SortMergeJoin (id)
Project Project
FileScan FileScan
requires:
HashPartitioning(id, 200)
order by id
requires:
HashPartitioning(id, 200)
order by id
outputPartitioning: Unknown
outputOrdering: None
outputPartitioning: Unknown
outputOrdering: None
46#UnifiedDataAnalytics #SparkAISummit
SortMergeJoin (id)
Project Project
FileScan FileScan
requires:
HashPartitioning(id, 200)
order by id
requires:
HashPartitioning(id, 200)
order by id
outputPartitioning: Unknown
outputOrdering: None
outputPartitioning: Unknown
outputOrdering: None
47#UnifiedDataAnalytics #SparkAISummit
SortMergeJoin (id)
Project Project
FileScan FileScan
Exchange Exchange
requires:
HashPartitioning(id, 200)
order by id
requires:
HashPartitioning(id, 200)
order by id
satisfiesthis
satisfiesthis
48#UnifiedDataAnalytics #SparkAISummit
SortMergeJoin (id)
Project Project
FileScan FileScan
Exchange Exchange
Sort Sort
requires:
HashPartitioning(id, 200)
order by id
requires:
HashPartitioning(id, 200)
order by id
Example with bucketing
● If the sources are bucketed tables into 50 buckets by id:
○ FileScan will have outputPartitioning as HashPartitioning(id, 50)
○ It will pass it to Project
○ This will happen in both branches of the Join
○ The requirement for partitioning is met => no Exchange needed
49#UnifiedDataAnalytics #SparkAISummit
50#UnifiedDataAnalytics #SparkAISummit
SortMergeJoin (id)
Project Project
FileScan FileScan
requires:
HashPartitioning(id, 200)
order by id
requires:
HashPartitioning(id, 200)
order by id
outputPartitioning:
HashPartitioning(id, 50)
outputPartitioning:
HashPartitioning(id, 50)
51#UnifiedDataAnalytics #SparkAISummit
SortMergeJoin (id)
Project Project
FileScan FileScan
requires:
HashPartitioning(id, 200)
order by id
requires:
HashPartitioning(id, 200)
order by id
outputPartitioning:
HashPartitioning(id, 50)
outputPartitioning:
HashPartitioning(id, 50)
outputPartitioning:
HashPartitioning(id, 50)
outputPartitioning:
HashPartitioning(id, 50)
52#UnifiedDataAnalytics #SparkAISummit
SortMergeJoin (id)
Project Project
FileScan FileScan
requires:
HashPartitioning(id, 200)
order by id
requires:
HashPartitioning(id, 200)
order by id
outputPartitioning:
HashPartitioning(id, 50)
outputPartitioning:
HashPartitioning(id, 50)
outputPartitioning:
HashPartitioning(id, 50)
outputPartitioning:
HashPartitioning(id, 50)
No need for Exchange
OK OK
What about the Sort ?
53#UnifiedDataAnalytics #SparkAISummit
dataDF
.write
.bucketBy(50, “id”)
.sortBy(“id”)
.option(“path”, outputPath)
.saveAsTable(tableName)
It depends on bucketing:
What about the Sort ?
54#UnifiedDataAnalytics #SparkAISummit
dataDF
.write
.bucketBy(50, “id”)
.sortBy(“id”)
.option(“path”, outputPath)
.saveAsTable(tableName)
It depends on bucketing: ● If one file per bucket is created
○ FileScan will pick up the information
about order
○ No need for Sort in the final plan
What about the Sort ?
55#UnifiedDataAnalytics #SparkAISummit
dataDF
.write
.bucketBy(50, “id”)
.sortBy(“id”)
.option(“path”, outputPath)
.saveAsTable(tableName)
It depends on bucketing: ● If one file per bucket is created
○ FileScan will pick up the information
about order
○ No need for Sort in the final plan
● If more files per bucket
○ FileScan will not use information
about order (data has to be sorted
globally)
○ Spark will add Sort to the final plan
56#UnifiedDataAnalytics #SparkAISummit
SortMergeJoin (id)
Project Project
FileScan FileScan
Final Executed Plan if:
1. bucketed by id
2. sorted by id (one
file/bucket)
57#UnifiedDataAnalytics #SparkAISummit
SortMergeJoin (id)
Project Project
FileScan FileScan
Sort Sort
Final Executed Plan if:
1. bucketed by id
2. not sorted or
3. sorted but more files/bucket
ReuseExchange
● Result of each Exchange (shuffle) is written to disk (shuffle-write)
58#UnifiedDataAnalytics #SparkAISummit
● The rule checks if the plan has the same Exchange-branches
○ if the branches are the same, the shuffle-write will also be the same
● Spark can reuse it since the data (shuffle-write) is persisted on disks
59#UnifiedDataAnalytics #SparkAISummit
Union
Project Project
FileScan FileScan
Exchange Exchange
Sort
Window
Generate
Project
The two sub-branches may become the
same in some situations
60#UnifiedDataAnalytics #SparkAISummit
Union
Project Project
FileScan FileScan
Exchange Exchange
Sort
Window
Generate
Union
Project
FileScan
Exchange
Sort
Window
Generate
Project Project
If the branches are
the same, only one
will be computed and
Spark will reuse it
ReuseExchange
● The branches must be the same
● Only Exchange can be reused
● Can be turned off by internal configuration
○ spark.sql.exchange.reuse
● Otherwise you can not control it directly but there is indirect way by
tweaking the query (see the example later)
● It reduces the I/O and network cost:
○ One scan over the data
○ One shuffle write
61#UnifiedDataAnalytics #SparkAISummit
Part I conclusion
● Physical Plan operators carry information about execution
○ it is useful to pair information from the query plan with info about stages/tasks
● ReuseExchange allows to reduce I/O and network cost
○ if Exchange sub-branches are the same Spark will compute it only once
● EnsureRequirements adds Exchange and Sort to the query plan
○ it makes sure that requirements of the operators are met
62#UnifiedDataAnalytics #SparkAISummit
David Vrba, Socialbakers
Physical Plans in Spark
SQL - continues
#UnifiedDataAnalytics #SparkAISummit
Part I recap
● We covered
○ Query Execution (physical planning)
○ Two preparation rules:
■ EnsureRequirements
■ ReuseExchange
64#UnifiedDataAnalytics #SparkAISummit
Part II
● Model examples with optimizations
● Some useful tips
65#UnifiedDataAnalytics #SparkAISummit
Data
66#UnifiedDataAnalytics #SparkAISummit
post_id profile_id interactions
date
1 1 20
2019-01-01
2 1 15
2019-01-01
3 1 50
2019-02-01
Table A: posts (messages published on FB)
Example I - exchange reuse
67#UnifiedDataAnalytics #SparkAISummit
● Take all profiles where
○ sum of interactions is bigger than 100
○ or sum of interactions is less than 20
68#UnifiedDataAnalytics #SparkAISummit
df.groupBy(“profile_id”)
.agg(sum(“interactions”).alias(“sum”))
.filter($”sum” > 100 || $”sum” < 20)
.write(...)
Simple query:
69#UnifiedDataAnalytics #SparkAISummit
df.groupBy(“profile_id”)
.agg(sum(“interactions”).alias(“sum”))
.filter($”sum” > 100 || $”sum” < 20)
.write(...)
val dfSumBig = df
.groupBy(“profile_id”)
.agg(sum(“interactions”).alias(“sum”))
.filter($”sum” > 100)
val dfSumSmall = df
.groupBy(“profile_id”)
.agg(sum(“interactions”).alias(“sum”))
.filter($”sum” < 20)
dfSumBig.union(dfSumSmall)
.write(...)
Simple query:
Can be written also in terms of union
70#UnifiedDataAnalytics #SparkAISummit
Typical plans with unions:
Union
71#UnifiedDataAnalytics #SparkAISummit
val dfSumBig = df
.groupBy(“profile_id”)
.agg(sum(“interactions”).alias(“sum”))
.filter($”sum” > 100)
No need to optimize
val dfSumSmall = df
.groupBy(“profile_id”)
.agg(sum(“interactions”).alias(“sum”))
.filter($”sum” < 20)
dfSumBig.union(dfSumSmall)
.write(...)
This Exchange is
reused. The data is
scanned only once
In our example:
72#UnifiedDataAnalytics #SparkAISummit
val dfSumBig = df
.groupBy(“profile_id”)
.agg(sum(“interactions”).alias(“sum”))
.filter($”sum” > 100)
val dfSumSmall = df
.groupBy(“profile_id”)
.agg(sum(“interactions”).alias(“sum”))
.filter($”sum” < 20)
dfSumBig.union(dfSumSmall)
.write(...)
Let’s suppose the assignment has changed:
● For dfSumSmall we want to consider
only specific profiles (profile_id is not
null)
73#UnifiedDataAnalytics #SparkAISummit
dfSumBig.union(dfSumSmall)
.write(...)
val dfSumSmall = df
.filter($”profile_id”.isNotNull)
.groupBy(“profile_id”)
.agg(sum(“interactions”).alias(“sum”))
.filter($”sum” < 20)
val dfSumBig = df
.groupBy(“profile_id”)
.agg(sum(“interactions”).alias(“sum”))
.filter($”sum” > 100)
Let’s suppose the assignment has changed:
● For dfSumSmall we want to consider
only specific profiles (profile_id is not
null)
74#UnifiedDataAnalytics #SparkAISummit
Now add additional filter to one DF:val dfSumBig = df
.groupBy(“profile_id”)
.agg(sum(“interactions”).alias(“sum”))
.filter($”sum” > 100)
val dfSumSmall = df
.filter($”profile_id”.isNotNull)
.groupBy(“profile_id”)
.agg(sum(“interactions”).alias(“sum”))
.filter($”sum” < 20)
dfSumBig.union(dfSumSmall)
.write(...)
val dfSumSmall = df
.groupBy(“profile_id”)
.agg(sum(“interactions”).alias(“sum”))
.filter($”sum” < 20)
.filter($”profile_id”.isNotNull)
● How can we optimize this?
● Just calling the filter on different place does not help
75#UnifiedDataAnalytics #SparkAISummit
Spark optimizer will move this filter back by
calling a rule PushDownPredicate
● We can limit the optimizer to stop the rule
76#UnifiedDataAnalytics #SparkAISummit
spark.conf.set(
“spark.sql.optimizer.excludeRules”,
“org.apache.spark.sql.catalyst.optimizer.PushDownPredicate”
)
val dfSumSmall = df
.groupBy(“profile_id”)
.agg(sum(“interactions”).alias(“metricValue”))
.filter($”metricValue” < 20)
.filter($”profile_id”.isNotNull)
It has both the filter conditions
Limiting the optimizer
● Available since Spark 2.4
● Useful if you need to change the order of operators in the plan
○ reposition Filter, Exchange
● Queries:
○ one data source (which is expensive to read)
○ with multiple computations (using groupBy or Window)
■ combined together using Union or Join
77#UnifiedDataAnalytics #SparkAISummit
Reused computation
● Similar effect to ReuseExchange can be achieved also by caching
○ caching may not be that useful with large datasets (if it does not fit to the
caching layer)
○ caching evokes additional overhead while putting data to the caching
layer (memory or disk)
78#UnifiedDataAnalytics #SparkAISummit
Example II
● Get only records (posts) with max interactions for each profile
79#UnifiedDataAnalytics #SparkAISummit
post_id profile_id interactions
date
1 1 20
2019-01-01
2 1 50
2019-01-01
3 1 50
2019-02-01
4 2 0
2019-01-01
5 2 100
2019-03-01
post_id profile_id interactions
date
2 1 50
2019-01-01
3 1 50
2019-02-01
Assume profile_id has non-null values.
Example II
● Three common ways how to write the query:
○ Using Window function
○ Using groupBy + join
○ Using correlated subquery in SQL
80#UnifiedDataAnalytics #SparkAISummit
Which one is the most efficient?
81#UnifiedDataAnalytics #SparkAISummit
val w = Window.partitionBy(“profile_id”)
posts
.withColumn(“maxCount”, max(“interactions”).over(w))
.filter($”interactions” === $”maxCount”)
Using Window:
82#UnifiedDataAnalytics #SparkAISummit
val maxCount = posts
.groupBy(“profile_id”)
.agg(max(“interactions”).alias(“maxCount”))
posts.join(maxCount, Seq(“profile_id”))
.filter($”interactions” === $”maxCount”)
Using groupBy + join:
83#UnifiedDataAnalytics #SparkAISummit
posts.createOrReplaceTempView(“postsView”)
val query = “““
SELECT *
FROM postsView v1
WHERE interactions = (
select max(interactions) from postsView v2 where v1.profile_id = v2.profile_id
)
”””
spark.sql(query)
Using correlated subquery (equivalent plan with Join+groupBy):
84#UnifiedDataAnalytics #SparkAISummit
85#UnifiedDataAnalytics #SparkAISummit
Query with window Query with join or
correlated subquery
SortMergeJoin BroadcastHashJoin
86#UnifiedDataAnalytics #SparkAISummit
Query with window Query with join or
correlated subquery
SortMergeJoin BroadcastHashJoin
Exchange + Sort
87#UnifiedDataAnalytics #SparkAISummit
Query with window Query with join or
correlated subquery
SortMergeJoin BroadcastHashJoin
2 Exchanges +
1 Sort
Exchange + Sort
88#UnifiedDataAnalytics #SparkAISummit
Query with window Query with join or
correlated subquery
SortMergeJoin BroadcastHashJoin
2 Exchanges +
1 Sort
Exchange + Sort
reduced shuffle
HashAggregate
89#UnifiedDataAnalytics #SparkAISummit
Query with window Query with join or
correlated subquery
SortMergeJoin BroadcastHashJoin
2 Exchanges +
1 Sort
Exchange + Sort
reduced shuffle
broadcast exchange
HashAggregate
Window vs join + groupBy
90#UnifiedDataAnalytics #SparkAISummit
Both tables are large
● Go with window
● SortMergeJoin will be
more expensive (3
exchanges, 2 sorts)
Window vs join + groupBy
91#UnifiedDataAnalytics #SparkAISummit
Both tables are large
● Go with window
● SortMergeJoin will be
more expensive (3
exchanges, 2 sorts)
Reduced table is much smaller
and can be broadcasted
● Go with broadcast join - it
will be much faster
Window vs join + groupBy
92#UnifiedDataAnalytics #SparkAISummit
Both tables are small and
comparable in size:
● It is not a big deal
● Broadcast will also
generate traffic
Both tables are large
● Go with window
● SortMergeJoin will be
more expensive (3
exchanges, 2 sorts)
Reduced table is much smaller
and can be broadcasted
● Go with broadcast join - it
will be much faster
Example III
● Sum interactions for each profile and each date
● Join additional table about profiles
93#UnifiedDataAnalytics #SparkAISummit
profile_id about
lang
1 “some string”
en
2 “some other
string” en
Table B: profiles (Facebook pages)
Example III
● Sum interactions for each profile and each date
● Join additional table about profiles
94#UnifiedDataAnalytics #SparkAISummit
posts
.groupBy(“profile_id”, “date”)
.agg(sum(“interactions”))
.join(profiles, Seq(“profile_id”))
profile_id about
lang
1 “some string”
en
2 “some other
string” en
Table B: profiles (Facebook pages)
posts
.groupBy(“profile_id”, “date”)
.agg(sum(“interactions”))
.join(profiles, Seq(“profile_id”))
95#UnifiedDataAnalytics #SparkAISummit
3 Exchange operators => 3 shuffles
96#UnifiedDataAnalytics #SparkAISummit
posts
.groupBy(“profile_id”, “date”)
.agg(sum(“interactions”))
.join(profiles, Seq(“profile_id”))
97#UnifiedDataAnalytics #SparkAISummit
posts
.groupBy(“profile_id”, “date”)
.agg(sum(“interactions”))
.join(profiles, Seq(“profile_id”))
SortMergeJoin (profile_id)
It requires (strictly)
● HashPartitioning (profile_id)
98#UnifiedDataAnalytics #SparkAISummit
posts
.groupBy(“profile_id”, “date”)
.agg(sum(“interactions”))
.join(profiles, Seq(“profile_id”))
SortMergeJoin (profile_id)
It requires (strictly)
● HashPartitioning (profile_id)
HashPartitioning (profile_id)
99#UnifiedDataAnalytics #SparkAISummit
posts
.groupBy(“profile_id”, “date”)
.agg(sum(“interactions”))
.join(profiles, Seq(“profile_id”))
SortMergeJoin (profile_id)
It requires (strictly)
● HashPartitioning (profile_id)
HashPartitioning (profile_id)
HashAggregate (profile_id, date)
It requires
● HashPartitioning (profile_id, date)
○ Or any subset of these cols
100#UnifiedDataAnalytics #SparkAISummit
posts
.groupBy(“profile_id”, “date”)
.agg(sum(“interactions”))
.join(profiles, Seq(“profile_id”))
HashAggregate (profile_id, date)
It requires
● HashPartitioning (profile_id, date)
○ Or any subset of these cols
HashPartitioning (profile_id, date)
SortMergeJoin (profile_id)
It requires (strictly)
● HashPartitioning (profile_id)
HashPartitioning (profile_id)
101#UnifiedDataAnalytics #SparkAISummit
posts
.repartition(“profile_id”)
.groupBy(“profile_id”, “date”)
.agg(sum(“interactions”))
.join(profiles, Seq(“profile_id”))
Add repartition and eliminate one shuffle
102#UnifiedDataAnalytics #SparkAISummit
posts
.repartition(“profile_id”)
.groupBy(“profile_id”, “date”)
.agg(sum(“interactions”))
.join(profiles, Seq(“profile_id”))
Add repartition and eliminate one shuffle
Generated by strategy
HashPartitioning (profile_id)
103#UnifiedDataAnalytics #SparkAISummit
Add repartition and eliminate one shuffle
HashPartitioning (profile_id)
posts
.repartition(“profile_id”)
.groupBy(“profile_id”, “date”)
.agg(sum(“interactions”))
.join(profiles, Seq(“profile_id”))
Generated by strategy
104#UnifiedDataAnalytics #SparkAISummit
Add repartition and eliminate one shuffle
HashPartitioning (profile_id)
posts
.repartition(“profile_id”)
.groupBy(“profile_id”, “date”)
.agg(sum(“interactions”))
.join(profiles, Seq(“profile_id”))
Generated by strategy
105#UnifiedDataAnalytics #SparkAISummit
Add repartition and eliminate one shuffle
HashPartitioning (profile_id)
posts
.repartition(“profile_id”)
.groupBy(“profile_id”, “date”)
.agg(sum(“interactions”))
.join(profiles, Seq(“profile_id”))
Generated by strategy
OK with
106#UnifiedDataAnalytics #SparkAISummit
Add repartition and eliminate one shuffle
HashPartitioning (profile_id)
posts
.repartition(“profile_id”)
.groupBy(“profile_id”, “date”)
.agg(sum(“interactions”))
.join(profiles, Seq(“profile_id”))
Generated by strategy
OK with
OK with
Adding repartition
● What is the cost of using it?
○ Now we shuffle all data
○ Before we shuffled reduced dataset
107#UnifiedDataAnalytics #SparkAISummit
108#UnifiedDataAnalytics #SparkAISummit
reduced shuffle
(shuffled data
are all distinct
combinations of
profile_id &
date)
Before using
repartition
HashAggregate
109#UnifiedDataAnalytics #SparkAISummit
reduced shuffle
(shuffled data
are all distinct
combinations of
profile_id &
date)
total shuffle
(all data is
shuffled)
Before using
repartition
After using
repartition
HashAggregate
HashAggregate
Adding repartition
● What is the cost of using it?
○ Now we shuffle all data
○ Before we shuffled reduced dataset
● What is more efficient?
○ depends on properties of data
■ here the cardinality of distinct (profile_id, date)
110#UnifiedDataAnalytics #SparkAISummit
Adding repartition
111#UnifiedDataAnalytics #SparkAISummit
● Cardinality of (profile_id, date)
is comparable with row_count
● Each profile has only few
posts per date
● The data is not reduced much
by groupBy aggregation
● the reduced shuffle is
comparable with full shuffle
● Cardinality of (profile_id, date)
is much lower than row_count
● Each profile has many posts
per date
● The data is reduced a lot by
groupBy aggregation
● the reduced shuffle is much
lower than full shuffle
Use repartition to reduce
number of shuffles if:
Use the original plan if:
Useful tip I
● Filters are not pushed through nondeterministic expressions
112#UnifiedDataAnalytics #SparkAISummit
posts
.groupBy(“profile_id”)
.agg(collect_list(“interactions”))
.filter($”profile_id”.isNotNull )
Useful tip I
● Filters are not pushed through nondeterministic expressions
113#UnifiedDataAnalytics #SparkAISummit
posts
.groupBy(“profile_id”)
.agg(collect_list(“interactions”))
.filter($”profile_id”.isNotNull )
Exchange
Useful tip I
● Filters are not pushed through nondeterministic expressions
114#UnifiedDataAnalytics #SparkAISummit
posts
.filter($”profile_id”.isNotNull )
.groupBy(“profile_id”)
.agg(collect_list(“interactions”))
Make sure your filters are positioned at the right place to achieve
efficient execution
Exchange
Nondeterministic expressions
115#UnifiedDataAnalytics #SparkAISummit
● collect_list
● collect_set
● first
● last
● input_file_name
● spark_partition_id
● monotinically_increasing_id
● rand
● randn
● shuffle
Useful tip IIa
● Important settings related to BroadcastHashJoin:
116#UnifiedDataAnalytics #SparkAISummit
spark.sql.autoBroadcastJoinThreshold
● Default value is 10MB
● Spark will broadcast if
○ Spark thinks that the size of the data is less
○ or you use broadcast hint
Useful tip IIa
● Important settings related to BroadcastHashJoin:
117#UnifiedDataAnalytics #SparkAISummit
spark.sql.autoBroadcastJoinThreshold
● Default value is 10MB
● Spark will broadcast if
○ Spark thinks that the size of the data is less
○ or you use broadcast hint
Compute stats to make
good estimates
ANALYZE TABLE table_name COMPUTE STATISTICS FOR COLUMNS col1, col2, ...
Useful tip IIa
● Important settings related to BroadcastHashJoin:
118#UnifiedDataAnalytics #SparkAISummit
spark.sql.autoBroadcastJoinThreshold
● Default value is 10MB
● Spark will broadcast if
○ Spark thinks that the size of the data is less
○ or you use broadcast hint
Compute stats to make
good estimates
ANALYZE TABLE table_name COMPUTE STATISTICS FOR COLUMNS col1, col2, ...
spark.sql.cbo.enabled
Useful tip IIb
● Important settings related to BroadcastHashJoin:
119#UnifiedDataAnalytics #SparkAISummit
spark.sql.broadcastTimeout
● Default value is 300s
Useful tip IIb
● Important settings related to BroadcastHashJoin:
120#UnifiedDataAnalytics #SparkAISummit
spark.sql.broadcastTimeout
● Default value is 300s
● If Spark does not make it:
SparkException: Could not execute broadcast in 300 secs. You can increase the
timeout for broadcasts via spark.sql.broadcastTimeout or disable broadcast
join by setting spark.sql.autoBroadcastJoinThreshold to -1
3 basic solutions:
1. Disable the broadcast by setting the threshold to -1
2. Increase the timeout
3. Use caching
121#UnifiedDataAnalytics #SparkAISummit
122#UnifiedDataAnalytics #SparkAISummit
val df = profiles
.some_udf_call
.groupBy(“profile_id”)
.agg(...some_aggregation...)
posts
.join(broadcast(df), Seq(“profile_id”))
Intense transformations:
● udf call
● aggregation, …
Computation may take longer than
5mins
If size of df is small we want to
broadcast
123#UnifiedDataAnalytics #SparkAISummit
val df = profiles
.some_udf_call
.groupBy(“profile_id”)
.agg(...some_aggregation...)
.cache()
posts
.join(broadcast(df), Seq(“profile_id”))
df.count()
Three jobs:
1) count will write the data to memory
2) broadcast (fast because taken from RAM)
3) join - will leverage broadcasted data
If size of df is small we want to
broadcast
We can use caching (but we have
to materialize immediately)
Conclusion
● Understanding the physical plan is important
● Limiting the optimizer you can achieve the reused Exchange
● Choice between Window vs groupBy+join depends on data properties
● Adding repartition can avoid unnecessary Exchange
○ considering the data properties is important
● Be aware of nondeterministic expressions
● Fine-tune broadcast joins with configuration settings
○ make sure to have good size estimates using CBO
124#UnifiedDataAnalytics #SparkAISummit
Questions
125#UnifiedDataAnalytics #SparkAISummit
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

More Related Content

What's hot

Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
Databricks
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Databricks
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Bo Yang
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Databricks
 
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Databricks
 
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Chris Fregly
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Databricks
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
Spark Summit
 
Dive into PySpark
Dive into PySparkDive into PySpark
Dive into PySpark
Mateusz Buśkiewicz
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
Databricks
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
Databricks
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21
Stamatis Zampetakis
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
Databricks
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 

What's hot (20)

Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
 
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
 
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Dive into PySpark
Dive into PySparkDive into PySpark
Dive into PySpark
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 

Similar to Physical Plans in Spark SQL

Databricks: What We Have Learned by Eating Our Dog Food
Databricks: What We Have Learned by Eating Our Dog FoodDatabricks: What We Have Learned by Eating Our Dog Food
Databricks: What We Have Learned by Eating Our Dog Food
Databricks
 
Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2
datamantra
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Databricks
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Anatomy of Data Frame API :  A deep dive into Spark Data Frame APIAnatomy of Data Frame API :  A deep dive into Spark Data Frame API
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
datamantra
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
zmhassan
 
Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache Spark
Lucian Neghina
 
R programming for data science
R programming for data scienceR programming for data science
R programming for data science
Sovello Hildebrand
 
Simplifying Change Data Capture using Databricks Delta
Simplifying Change Data Capture using Databricks DeltaSimplifying Change Data Capture using Databricks Delta
Simplifying Change Data Capture using Databricks Delta
Databricks
 
Managing Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingManaging Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic Optimizing
Databricks
 
Spark tutorial
Spark tutorialSpark tutorial
Spark tutorial
Sahan Bulathwela
 
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
"Lessons learned using Apache Spark for self-service data prep in SaaS world""Lessons learned using Apache Spark for self-service data prep in SaaS world"
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
Pavel Hardak
 
Lessons Learned Using Apache Spark for Self-Service Data Prep in SaaS World
Lessons Learned Using Apache Spark for Self-Service Data Prep in SaaS WorldLessons Learned Using Apache Spark for Self-Service Data Prep in SaaS World
Lessons Learned Using Apache Spark for Self-Service Data Prep in SaaS World
Databricks
 
Data and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageData and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineage
Julien Le Dem
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineage
Databricks
 
Graph processing at scale using spark &amp; graph frames
Graph processing at scale using spark &amp; graph framesGraph processing at scale using spark &amp; graph frames
Graph processing at scale using spark &amp; graph frames
Ron Barabash
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & Marquez
Julien Le Dem
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
inoshg
 
Nikhil summer internship 2016
Nikhil   summer internship 2016Nikhil   summer internship 2016
Nikhil summer internship 2016
Nikhil Shekhar
 
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Databricks
 

Similar to Physical Plans in Spark SQL (20)

Databricks: What We Have Learned by Eating Our Dog Food
Databricks: What We Have Learned by Eating Our Dog FoodDatabricks: What We Have Learned by Eating Our Dog Food
Databricks: What We Have Learned by Eating Our Dog Food
 
Spark Meetup
Spark MeetupSpark Meetup
Spark Meetup
 
Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Anatomy of Data Frame API :  A deep dive into Spark Data Frame APIAnatomy of Data Frame API :  A deep dive into Spark Data Frame API
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
 
Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache Spark
 
R programming for data science
R programming for data scienceR programming for data science
R programming for data science
 
Simplifying Change Data Capture using Databricks Delta
Simplifying Change Data Capture using Databricks DeltaSimplifying Change Data Capture using Databricks Delta
Simplifying Change Data Capture using Databricks Delta
 
Managing Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingManaging Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic Optimizing
 
Spark tutorial
Spark tutorialSpark tutorial
Spark tutorial
 
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
"Lessons learned using Apache Spark for self-service data prep in SaaS world""Lessons learned using Apache Spark for self-service data prep in SaaS world"
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
 
Lessons Learned Using Apache Spark for Self-Service Data Prep in SaaS World
Lessons Learned Using Apache Spark for Self-Service Data Prep in SaaS WorldLessons Learned Using Apache Spark for Self-Service Data Prep in SaaS World
Lessons Learned Using Apache Spark for Self-Service Data Prep in SaaS World
 
Data and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageData and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineage
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineage
 
Graph processing at scale using spark &amp; graph frames
Graph processing at scale using spark &amp; graph framesGraph processing at scale using spark &amp; graph frames
Graph processing at scale using spark &amp; graph frames
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & Marquez
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
Nikhil summer internship 2016
Nikhil   summer internship 2016Nikhil   summer internship 2016
Nikhil summer internship 2016
 
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 

Recently uploaded (20)

Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 

Physical Plans in Spark SQL

  • 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  • 2. David Vrba, Socialbakers Physical Plans in Spark SQL #UnifiedDataAnalytics #SparkAISummit
  • 3. ● David Vrba Ph.D. ● Data Scientist & Data Engineer at Socialbakers ○ Developing ETL pipelines in Spark ○ Optimizing Spark jobs ○ Productionalizing Spark applications ● Lecturing Spark trainings and workshops ○ Studying Spark source code 3#UnifiedDataAnalytics #SparkAISummit
  • 4. Goal ● Share what we have learned by studying Spark source code and by using Spark on daily basis ○ Processing data from social networks (Facebook, Twitter, Instagram, ...) ○ Regularly processing data on scales up to 10s of TBs ○ Understanding the query plan is a must for efficient processing 4#UnifiedDataAnalytics #SparkAISummit
  • 5. Outline ● Two part talk ○ In the first part we cover some theory ■ How query execution works ■ Physical Plan operators ■ Where to look for relevant information in Spark UI ○ In the second part we show some examples ■ Model queries with particular optimizations ■ Useful tips ○ Q/A 5#UnifiedDataAnalytics #SparkAISummit
  • 6. Part I ● Query Execution ● Physical Plan Operators ● Spark UI 6#UnifiedDataAnalytics #SparkAISummit
  • 7. Query Execution 7#UnifiedDataAnalytics #SparkAISummit Query Logical Planning Parser Unresolved Plan Analyzer Analyzed Plan Cache Manager Optimizer Optimized Plan Physical Planning Query Planner Spark Plan Preparation Executed Plan Execution RDD DAG Task Scheduler DAG Scheduler Stages + Tasks Executor Today’s session
  • 8. Logical Planning ● Logical plan is created, analyzed and optimized ● Logical plan ○ Tree representation of the query ○ It is an abstraction that carries information about what is supposed to happen ○ It does not contain precise information on how it happens ○ Composed of ■ Relational operators – Filter, Join, Project, ... (they represent DataFrame transformations) ■ Expressions – Column transformations, filtering conditions, joining conditions, ... 8#UnifiedDataAnalytics #SparkAISummit
  • 9. Physical Planning ● In the phase of logical planning we get Optimized Logical Plan ● Execution layer does not understand DataFrame / Logical Plan (LP) ● Logical Plan has to be converted to a Physical Plan (PP) ● Physical Plan ○ Bridge between LP and RDDs ○ Similarly to Logical Plan it is a tree ○ Contains more specific description of how things should happen (specific choice of algorithms) ○ Uses lower level primitives - RDDs 9#UnifiedDataAnalytics #SparkAISummit
  • 10. Physical Planning - 2 phases 10#UnifiedDataAnalytics #SparkAISummit Spark Plan Executed Plan ● Generated by Query Planner using Strategies ● For each node in LP there is a node in PP Additional Rules
  • 11. Physical Planning - 2 phases 11#UnifiedDataAnalytics #SparkAISummit Spark Plan Executed Plan ● Generated by Query Planner using Strategies ● For each node in LP there is a node in PP Additional Rules Join SortMergeJoin BroadcastHashJoin In Logical Plan: In Physical Plan: Strategy example: JoinSelection
  • 12. Physical Planning - 2 phases 12#UnifiedDataAnalytics #SparkAISummit Spark Plan Executed Plan Additional Rules ● Generated by Query Planner using Strategies ● For each node in LP there is a node in PP ● Final version of query plan ● This will be executed ○ generates RDD code
  • 13. ● Generated by Query Planner using Strategies ● For each node in LP there is a node in PP Physical Planning - 2 phases 13#UnifiedDataAnalytics #SparkAISummit Spark Plan Executed Plan ● Final version of query plan ● This will be executed ○ generates RDD code Additional Rules df.queryExecution.sparkPlan df.queryExecution.executedPlan df.explain() See the plans:
  • 14. 14 SQL Job IDs Click here to see the query plan Spark UI
  • 15. 15 Graphical representation of the Physical Plan - executed plan Details - string representation (LP + PP)
  • 16. 16
  • 17. Let’s see some operators ● FileScan ● Exchange ● HashAggregate, SortAggregate, ObjectHashAggregate ● SortMergeJoin ● BroadcastHashJoin 17#UnifiedDataAnalytics #SparkAISummit
  • 18. FileScan 18#UnifiedDataAnalytics #SparkAISummit ● Represents reading the data from a file format spark.table(“posts_fb”) .filter($“month” === 5) .filter($”profile_id” === ...) ● table: posts_fb ● partitioned by month ● bucketed by profile_id, 20b ● 1 file per bucket
  • 19. FileScan 19#UnifiedDataAnalytics #SparkAISummit number of files read size of files read total rows output filesystem read data size total
  • 20. FileScan 20#UnifiedDataAnalytics #SparkAISummit It is useful to pair these numbers with information in Jobs and Stages tab in Spark UI
  • 21. 21#UnifiedDataAnalytics #SparkAISummit We read from bucketed table (20b, 1f/b)
  • 22. 22#UnifiedDataAnalytics #SparkAISummit We read from bucketed table (20b, 1f/b)
  • 23. 23#UnifiedDataAnalytics #SparkAISummit 697.9 KB We read from bucketed table (20b, 1f/b)
  • 24. 24#UnifiedDataAnalytics #SparkAISummit Bucket pruning: 0B 697.9 KB We read from bucketed table (20b, 1f/b)
  • 25. 25#UnifiedDataAnalytics #SparkAISummit Bucket pruning: 0B 697.9 KB We read from bucketed table (20b, 1f/b) spark.sql.sources.bucketing.enabled
  • 26. 26#UnifiedDataAnalytics #SparkAISummit We read from bucketed table (20b, 1f/b) Size of the whole partition. This will be read if we turn off bucketing (no bucket pruning) If bucketing is OFF:
  • 27. FileScan (string representation) 27#UnifiedDataAnalytics #SparkAISummit PartitionFilters PushedFilters DataFilters
  • 28. FileScan (string representation) 28#UnifiedDataAnalytics #SparkAISummit PartitionFilters PushedFilters PartitionCount (partition pruning) DataFilters Format SelectedBucketsCount (bucket pruning)
  • 29. Exchange 29#UnifiedDataAnalytics #SparkAISummit ● Represents shuffle - physical data movement on the cluster ○ usually quite expensive
  • 30. Exchange 30#UnifiedDataAnalytics #SparkAISummit HashPartitioning ● groupBy ● distinct ● join ● repartition($”key”) ● Window.partitionBy(“key”)
  • 31. Exchange 31#UnifiedDataAnalytics #SparkAISummit HashPartitioning ● groupBy ● distinct ● join ● repartition($”key”) ● Window.partitionBy(“key”) SinglePartition ● Window.partitionBy()
  • 32. Exchange 32#UnifiedDataAnalytics #SparkAISummit HashPartitioning ● groupBy ● distinct ● join ● repartition($”key”) ● Window.partitionBy(“key”) SinglePartition ● Window.partitionBy() RoundRobinPartitioning ● repartition(10)
  • 33. Exchange 33#UnifiedDataAnalytics #SparkAISummit HashPartitioning ● groupBy ● distinct ● join ● repartition($”key”) ● Window.partitionBy(“key”) SinglePartition ● Window.partitionBy() RoundRobinPartitioning ● repartition(10) RangePartitioning ● orderBy
  • 35. Aggregate 35#UnifiedDataAnalytics #SparkAISummit ● groupBy ● distinct ● dropDuplicates Aggregate HashAggregate LP PP SortAggregate ObjectHashAggregate Chosen in Aggregation strategy
  • 36. Aggregate 36#UnifiedDataAnalytics #SparkAISummit ● groupBy ● distinct ● dropDuplicates Aggregate HashAggregate LP PP SortAggregate ObjectHashAggregate Chosen in Aggregation strategy df.groupBy(“profile_id”) .agg(sum(“interactions”))
  • 37. Aggregate 37#UnifiedDataAnalytics #SparkAISummit ● groupBy ● distinct ● dropDuplicates Aggregate HashAggregate LP PP Usually comes in pair ○ partial_sum ○ finalmerge_sum SortAggregate ObjectHashAggregate Chosen in Aggregation strategy df.groupBy(“profile_id”) .agg(sum(“interactions”))
  • 38. SortMergeJoin 38#UnifiedDataAnalytics #SparkAISummit ● Represents joining two DataFrames ● Exchange & Sort often before SortMergeJoin but not necessarily (see later)
  • 39. BroadcastHashJoin 39#UnifiedDataAnalytics #SparkAISummit BroadcastExchange ● data from this branch are broadcasted to each executor BroadcastHashJoin ● Represents joining two DataFrames
  • 40. BroadcastHashJoin 40#UnifiedDataAnalytics #SparkAISummit Two jobs: broadcasting the data is handled in separate job ● Represents joining two DataFrames
  • 41. Rules applied in preparation • EnsureRequirements • ReuseExchange • ... 41#UnifiedDataAnalytics #SparkAISummit Spark Plan Executed Plan Additional Rules
  • 42. EnsureRequirements ● Fills Exchange & Sort operators to the plan ● How it works? ● Each operator in PP has: ○ outputPartitioning ○ outputOrdering ○ requiredChildDistribution (this is a requirement for partitioning) ○ requiredChildOrdering (this is a requirement for ordering) ● If the requirement is not met, Exchange or Sort are used 42#UnifiedDataAnalytics #SparkAISummit
  • 44. 44#UnifiedDataAnalytics #SparkAISummit SortMergeJoin (id) Project Project FileScan FileScan requires: HashPartitioning(id, 200) order by id requires: HashPartitioning(id, 200) order by id
  • 45. 45#UnifiedDataAnalytics #SparkAISummit SortMergeJoin (id) Project Project FileScan FileScan requires: HashPartitioning(id, 200) order by id requires: HashPartitioning(id, 200) order by id outputPartitioning: Unknown outputOrdering: None outputPartitioning: Unknown outputOrdering: None
  • 46. 46#UnifiedDataAnalytics #SparkAISummit SortMergeJoin (id) Project Project FileScan FileScan requires: HashPartitioning(id, 200) order by id requires: HashPartitioning(id, 200) order by id outputPartitioning: Unknown outputOrdering: None outputPartitioning: Unknown outputOrdering: None
  • 47. 47#UnifiedDataAnalytics #SparkAISummit SortMergeJoin (id) Project Project FileScan FileScan Exchange Exchange requires: HashPartitioning(id, 200) order by id requires: HashPartitioning(id, 200) order by id satisfiesthis satisfiesthis
  • 48. 48#UnifiedDataAnalytics #SparkAISummit SortMergeJoin (id) Project Project FileScan FileScan Exchange Exchange Sort Sort requires: HashPartitioning(id, 200) order by id requires: HashPartitioning(id, 200) order by id
  • 49. Example with bucketing ● If the sources are bucketed tables into 50 buckets by id: ○ FileScan will have outputPartitioning as HashPartitioning(id, 50) ○ It will pass it to Project ○ This will happen in both branches of the Join ○ The requirement for partitioning is met => no Exchange needed 49#UnifiedDataAnalytics #SparkAISummit
  • 50. 50#UnifiedDataAnalytics #SparkAISummit SortMergeJoin (id) Project Project FileScan FileScan requires: HashPartitioning(id, 200) order by id requires: HashPartitioning(id, 200) order by id outputPartitioning: HashPartitioning(id, 50) outputPartitioning: HashPartitioning(id, 50)
  • 51. 51#UnifiedDataAnalytics #SparkAISummit SortMergeJoin (id) Project Project FileScan FileScan requires: HashPartitioning(id, 200) order by id requires: HashPartitioning(id, 200) order by id outputPartitioning: HashPartitioning(id, 50) outputPartitioning: HashPartitioning(id, 50) outputPartitioning: HashPartitioning(id, 50) outputPartitioning: HashPartitioning(id, 50)
  • 52. 52#UnifiedDataAnalytics #SparkAISummit SortMergeJoin (id) Project Project FileScan FileScan requires: HashPartitioning(id, 200) order by id requires: HashPartitioning(id, 200) order by id outputPartitioning: HashPartitioning(id, 50) outputPartitioning: HashPartitioning(id, 50) outputPartitioning: HashPartitioning(id, 50) outputPartitioning: HashPartitioning(id, 50) No need for Exchange OK OK
  • 53. What about the Sort ? 53#UnifiedDataAnalytics #SparkAISummit dataDF .write .bucketBy(50, “id”) .sortBy(“id”) .option(“path”, outputPath) .saveAsTable(tableName) It depends on bucketing:
  • 54. What about the Sort ? 54#UnifiedDataAnalytics #SparkAISummit dataDF .write .bucketBy(50, “id”) .sortBy(“id”) .option(“path”, outputPath) .saveAsTable(tableName) It depends on bucketing: ● If one file per bucket is created ○ FileScan will pick up the information about order ○ No need for Sort in the final plan
  • 55. What about the Sort ? 55#UnifiedDataAnalytics #SparkAISummit dataDF .write .bucketBy(50, “id”) .sortBy(“id”) .option(“path”, outputPath) .saveAsTable(tableName) It depends on bucketing: ● If one file per bucket is created ○ FileScan will pick up the information about order ○ No need for Sort in the final plan ● If more files per bucket ○ FileScan will not use information about order (data has to be sorted globally) ○ Spark will add Sort to the final plan
  • 56. 56#UnifiedDataAnalytics #SparkAISummit SortMergeJoin (id) Project Project FileScan FileScan Final Executed Plan if: 1. bucketed by id 2. sorted by id (one file/bucket)
  • 57. 57#UnifiedDataAnalytics #SparkAISummit SortMergeJoin (id) Project Project FileScan FileScan Sort Sort Final Executed Plan if: 1. bucketed by id 2. not sorted or 3. sorted but more files/bucket
  • 58. ReuseExchange ● Result of each Exchange (shuffle) is written to disk (shuffle-write) 58#UnifiedDataAnalytics #SparkAISummit ● The rule checks if the plan has the same Exchange-branches ○ if the branches are the same, the shuffle-write will also be the same ● Spark can reuse it since the data (shuffle-write) is persisted on disks
  • 59. 59#UnifiedDataAnalytics #SparkAISummit Union Project Project FileScan FileScan Exchange Exchange Sort Window Generate Project The two sub-branches may become the same in some situations
  • 60. 60#UnifiedDataAnalytics #SparkAISummit Union Project Project FileScan FileScan Exchange Exchange Sort Window Generate Union Project FileScan Exchange Sort Window Generate Project Project If the branches are the same, only one will be computed and Spark will reuse it
  • 61. ReuseExchange ● The branches must be the same ● Only Exchange can be reused ● Can be turned off by internal configuration ○ spark.sql.exchange.reuse ● Otherwise you can not control it directly but there is indirect way by tweaking the query (see the example later) ● It reduces the I/O and network cost: ○ One scan over the data ○ One shuffle write 61#UnifiedDataAnalytics #SparkAISummit
  • 62. Part I conclusion ● Physical Plan operators carry information about execution ○ it is useful to pair information from the query plan with info about stages/tasks ● ReuseExchange allows to reduce I/O and network cost ○ if Exchange sub-branches are the same Spark will compute it only once ● EnsureRequirements adds Exchange and Sort to the query plan ○ it makes sure that requirements of the operators are met 62#UnifiedDataAnalytics #SparkAISummit
  • 63. David Vrba, Socialbakers Physical Plans in Spark SQL - continues #UnifiedDataAnalytics #SparkAISummit
  • 64. Part I recap ● We covered ○ Query Execution (physical planning) ○ Two preparation rules: ■ EnsureRequirements ■ ReuseExchange 64#UnifiedDataAnalytics #SparkAISummit
  • 65. Part II ● Model examples with optimizations ● Some useful tips 65#UnifiedDataAnalytics #SparkAISummit
  • 66. Data 66#UnifiedDataAnalytics #SparkAISummit post_id profile_id interactions date 1 1 20 2019-01-01 2 1 15 2019-01-01 3 1 50 2019-02-01 Table A: posts (messages published on FB)
  • 67. Example I - exchange reuse 67#UnifiedDataAnalytics #SparkAISummit ● Take all profiles where ○ sum of interactions is bigger than 100 ○ or sum of interactions is less than 20
  • 69. 69#UnifiedDataAnalytics #SparkAISummit df.groupBy(“profile_id”) .agg(sum(“interactions”).alias(“sum”)) .filter($”sum” > 100 || $”sum” < 20) .write(...) val dfSumBig = df .groupBy(“profile_id”) .agg(sum(“interactions”).alias(“sum”)) .filter($”sum” > 100) val dfSumSmall = df .groupBy(“profile_id”) .agg(sum(“interactions”).alias(“sum”)) .filter($”sum” < 20) dfSumBig.union(dfSumSmall) .write(...) Simple query: Can be written also in terms of union
  • 71. 71#UnifiedDataAnalytics #SparkAISummit val dfSumBig = df .groupBy(“profile_id”) .agg(sum(“interactions”).alias(“sum”)) .filter($”sum” > 100) No need to optimize val dfSumSmall = df .groupBy(“profile_id”) .agg(sum(“interactions”).alias(“sum”)) .filter($”sum” < 20) dfSumBig.union(dfSumSmall) .write(...) This Exchange is reused. The data is scanned only once In our example:
  • 72. 72#UnifiedDataAnalytics #SparkAISummit val dfSumBig = df .groupBy(“profile_id”) .agg(sum(“interactions”).alias(“sum”)) .filter($”sum” > 100) val dfSumSmall = df .groupBy(“profile_id”) .agg(sum(“interactions”).alias(“sum”)) .filter($”sum” < 20) dfSumBig.union(dfSumSmall) .write(...) Let’s suppose the assignment has changed: ● For dfSumSmall we want to consider only specific profiles (profile_id is not null)
  • 73. 73#UnifiedDataAnalytics #SparkAISummit dfSumBig.union(dfSumSmall) .write(...) val dfSumSmall = df .filter($”profile_id”.isNotNull) .groupBy(“profile_id”) .agg(sum(“interactions”).alias(“sum”)) .filter($”sum” < 20) val dfSumBig = df .groupBy(“profile_id”) .agg(sum(“interactions”).alias(“sum”)) .filter($”sum” > 100) Let’s suppose the assignment has changed: ● For dfSumSmall we want to consider only specific profiles (profile_id is not null)
  • 74. 74#UnifiedDataAnalytics #SparkAISummit Now add additional filter to one DF:val dfSumBig = df .groupBy(“profile_id”) .agg(sum(“interactions”).alias(“sum”)) .filter($”sum” > 100) val dfSumSmall = df .filter($”profile_id”.isNotNull) .groupBy(“profile_id”) .agg(sum(“interactions”).alias(“sum”)) .filter($”sum” < 20) dfSumBig.union(dfSumSmall) .write(...)
  • 75. val dfSumSmall = df .groupBy(“profile_id”) .agg(sum(“interactions”).alias(“sum”)) .filter($”sum” < 20) .filter($”profile_id”.isNotNull) ● How can we optimize this? ● Just calling the filter on different place does not help 75#UnifiedDataAnalytics #SparkAISummit Spark optimizer will move this filter back by calling a rule PushDownPredicate
  • 76. ● We can limit the optimizer to stop the rule 76#UnifiedDataAnalytics #SparkAISummit spark.conf.set( “spark.sql.optimizer.excludeRules”, “org.apache.spark.sql.catalyst.optimizer.PushDownPredicate” ) val dfSumSmall = df .groupBy(“profile_id”) .agg(sum(“interactions”).alias(“metricValue”)) .filter($”metricValue” < 20) .filter($”profile_id”.isNotNull) It has both the filter conditions
  • 77. Limiting the optimizer ● Available since Spark 2.4 ● Useful if you need to change the order of operators in the plan ○ reposition Filter, Exchange ● Queries: ○ one data source (which is expensive to read) ○ with multiple computations (using groupBy or Window) ■ combined together using Union or Join 77#UnifiedDataAnalytics #SparkAISummit
  • 78. Reused computation ● Similar effect to ReuseExchange can be achieved also by caching ○ caching may not be that useful with large datasets (if it does not fit to the caching layer) ○ caching evokes additional overhead while putting data to the caching layer (memory or disk) 78#UnifiedDataAnalytics #SparkAISummit
  • 79. Example II ● Get only records (posts) with max interactions for each profile 79#UnifiedDataAnalytics #SparkAISummit post_id profile_id interactions date 1 1 20 2019-01-01 2 1 50 2019-01-01 3 1 50 2019-02-01 4 2 0 2019-01-01 5 2 100 2019-03-01 post_id profile_id interactions date 2 1 50 2019-01-01 3 1 50 2019-02-01 Assume profile_id has non-null values.
  • 80. Example II ● Three common ways how to write the query: ○ Using Window function ○ Using groupBy + join ○ Using correlated subquery in SQL 80#UnifiedDataAnalytics #SparkAISummit Which one is the most efficient?
  • 81. 81#UnifiedDataAnalytics #SparkAISummit val w = Window.partitionBy(“profile_id”) posts .withColumn(“maxCount”, max(“interactions”).over(w)) .filter($”interactions” === $”maxCount”) Using Window:
  • 82. 82#UnifiedDataAnalytics #SparkAISummit val maxCount = posts .groupBy(“profile_id”) .agg(max(“interactions”).alias(“maxCount”)) posts.join(maxCount, Seq(“profile_id”)) .filter($”interactions” === $”maxCount”) Using groupBy + join:
  • 83. 83#UnifiedDataAnalytics #SparkAISummit posts.createOrReplaceTempView(“postsView”) val query = “““ SELECT * FROM postsView v1 WHERE interactions = ( select max(interactions) from postsView v2 where v1.profile_id = v2.profile_id ) ””” spark.sql(query) Using correlated subquery (equivalent plan with Join+groupBy):
  • 85. 85#UnifiedDataAnalytics #SparkAISummit Query with window Query with join or correlated subquery SortMergeJoin BroadcastHashJoin
  • 86. 86#UnifiedDataAnalytics #SparkAISummit Query with window Query with join or correlated subquery SortMergeJoin BroadcastHashJoin Exchange + Sort
  • 87. 87#UnifiedDataAnalytics #SparkAISummit Query with window Query with join or correlated subquery SortMergeJoin BroadcastHashJoin 2 Exchanges + 1 Sort Exchange + Sort
  • 88. 88#UnifiedDataAnalytics #SparkAISummit Query with window Query with join or correlated subquery SortMergeJoin BroadcastHashJoin 2 Exchanges + 1 Sort Exchange + Sort reduced shuffle HashAggregate
  • 89. 89#UnifiedDataAnalytics #SparkAISummit Query with window Query with join or correlated subquery SortMergeJoin BroadcastHashJoin 2 Exchanges + 1 Sort Exchange + Sort reduced shuffle broadcast exchange HashAggregate
  • 90. Window vs join + groupBy 90#UnifiedDataAnalytics #SparkAISummit Both tables are large ● Go with window ● SortMergeJoin will be more expensive (3 exchanges, 2 sorts)
  • 91. Window vs join + groupBy 91#UnifiedDataAnalytics #SparkAISummit Both tables are large ● Go with window ● SortMergeJoin will be more expensive (3 exchanges, 2 sorts) Reduced table is much smaller and can be broadcasted ● Go with broadcast join - it will be much faster
  • 92. Window vs join + groupBy 92#UnifiedDataAnalytics #SparkAISummit Both tables are small and comparable in size: ● It is not a big deal ● Broadcast will also generate traffic Both tables are large ● Go with window ● SortMergeJoin will be more expensive (3 exchanges, 2 sorts) Reduced table is much smaller and can be broadcasted ● Go with broadcast join - it will be much faster
  • 93. Example III ● Sum interactions for each profile and each date ● Join additional table about profiles 93#UnifiedDataAnalytics #SparkAISummit profile_id about lang 1 “some string” en 2 “some other string” en Table B: profiles (Facebook pages)
  • 94. Example III ● Sum interactions for each profile and each date ● Join additional table about profiles 94#UnifiedDataAnalytics #SparkAISummit posts .groupBy(“profile_id”, “date”) .agg(sum(“interactions”)) .join(profiles, Seq(“profile_id”)) profile_id about lang 1 “some string” en 2 “some other string” en Table B: profiles (Facebook pages)
  • 97. 97#UnifiedDataAnalytics #SparkAISummit posts .groupBy(“profile_id”, “date”) .agg(sum(“interactions”)) .join(profiles, Seq(“profile_id”)) SortMergeJoin (profile_id) It requires (strictly) ● HashPartitioning (profile_id)
  • 98. 98#UnifiedDataAnalytics #SparkAISummit posts .groupBy(“profile_id”, “date”) .agg(sum(“interactions”)) .join(profiles, Seq(“profile_id”)) SortMergeJoin (profile_id) It requires (strictly) ● HashPartitioning (profile_id) HashPartitioning (profile_id)
  • 99. 99#UnifiedDataAnalytics #SparkAISummit posts .groupBy(“profile_id”, “date”) .agg(sum(“interactions”)) .join(profiles, Seq(“profile_id”)) SortMergeJoin (profile_id) It requires (strictly) ● HashPartitioning (profile_id) HashPartitioning (profile_id) HashAggregate (profile_id, date) It requires ● HashPartitioning (profile_id, date) ○ Or any subset of these cols
  • 100. 100#UnifiedDataAnalytics #SparkAISummit posts .groupBy(“profile_id”, “date”) .agg(sum(“interactions”)) .join(profiles, Seq(“profile_id”)) HashAggregate (profile_id, date) It requires ● HashPartitioning (profile_id, date) ○ Or any subset of these cols HashPartitioning (profile_id, date) SortMergeJoin (profile_id) It requires (strictly) ● HashPartitioning (profile_id) HashPartitioning (profile_id)
  • 102. 102#UnifiedDataAnalytics #SparkAISummit posts .repartition(“profile_id”) .groupBy(“profile_id”, “date”) .agg(sum(“interactions”)) .join(profiles, Seq(“profile_id”)) Add repartition and eliminate one shuffle Generated by strategy HashPartitioning (profile_id)
  • 103. 103#UnifiedDataAnalytics #SparkAISummit Add repartition and eliminate one shuffle HashPartitioning (profile_id) posts .repartition(“profile_id”) .groupBy(“profile_id”, “date”) .agg(sum(“interactions”)) .join(profiles, Seq(“profile_id”)) Generated by strategy
  • 104. 104#UnifiedDataAnalytics #SparkAISummit Add repartition and eliminate one shuffle HashPartitioning (profile_id) posts .repartition(“profile_id”) .groupBy(“profile_id”, “date”) .agg(sum(“interactions”)) .join(profiles, Seq(“profile_id”)) Generated by strategy
  • 105. 105#UnifiedDataAnalytics #SparkAISummit Add repartition and eliminate one shuffle HashPartitioning (profile_id) posts .repartition(“profile_id”) .groupBy(“profile_id”, “date”) .agg(sum(“interactions”)) .join(profiles, Seq(“profile_id”)) Generated by strategy OK with
  • 106. 106#UnifiedDataAnalytics #SparkAISummit Add repartition and eliminate one shuffle HashPartitioning (profile_id) posts .repartition(“profile_id”) .groupBy(“profile_id”, “date”) .agg(sum(“interactions”)) .join(profiles, Seq(“profile_id”)) Generated by strategy OK with OK with
  • 107. Adding repartition ● What is the cost of using it? ○ Now we shuffle all data ○ Before we shuffled reduced dataset 107#UnifiedDataAnalytics #SparkAISummit
  • 108. 108#UnifiedDataAnalytics #SparkAISummit reduced shuffle (shuffled data are all distinct combinations of profile_id & date) Before using repartition HashAggregate
  • 109. 109#UnifiedDataAnalytics #SparkAISummit reduced shuffle (shuffled data are all distinct combinations of profile_id & date) total shuffle (all data is shuffled) Before using repartition After using repartition HashAggregate HashAggregate
  • 110. Adding repartition ● What is the cost of using it? ○ Now we shuffle all data ○ Before we shuffled reduced dataset ● What is more efficient? ○ depends on properties of data ■ here the cardinality of distinct (profile_id, date) 110#UnifiedDataAnalytics #SparkAISummit
  • 111. Adding repartition 111#UnifiedDataAnalytics #SparkAISummit ● Cardinality of (profile_id, date) is comparable with row_count ● Each profile has only few posts per date ● The data is not reduced much by groupBy aggregation ● the reduced shuffle is comparable with full shuffle ● Cardinality of (profile_id, date) is much lower than row_count ● Each profile has many posts per date ● The data is reduced a lot by groupBy aggregation ● the reduced shuffle is much lower than full shuffle Use repartition to reduce number of shuffles if: Use the original plan if:
  • 112. Useful tip I ● Filters are not pushed through nondeterministic expressions 112#UnifiedDataAnalytics #SparkAISummit posts .groupBy(“profile_id”) .agg(collect_list(“interactions”)) .filter($”profile_id”.isNotNull )
  • 113. Useful tip I ● Filters are not pushed through nondeterministic expressions 113#UnifiedDataAnalytics #SparkAISummit posts .groupBy(“profile_id”) .agg(collect_list(“interactions”)) .filter($”profile_id”.isNotNull ) Exchange
  • 114. Useful tip I ● Filters are not pushed through nondeterministic expressions 114#UnifiedDataAnalytics #SparkAISummit posts .filter($”profile_id”.isNotNull ) .groupBy(“profile_id”) .agg(collect_list(“interactions”)) Make sure your filters are positioned at the right place to achieve efficient execution Exchange
  • 115. Nondeterministic expressions 115#UnifiedDataAnalytics #SparkAISummit ● collect_list ● collect_set ● first ● last ● input_file_name ● spark_partition_id ● monotinically_increasing_id ● rand ● randn ● shuffle
  • 116. Useful tip IIa ● Important settings related to BroadcastHashJoin: 116#UnifiedDataAnalytics #SparkAISummit spark.sql.autoBroadcastJoinThreshold ● Default value is 10MB ● Spark will broadcast if ○ Spark thinks that the size of the data is less ○ or you use broadcast hint
  • 117. Useful tip IIa ● Important settings related to BroadcastHashJoin: 117#UnifiedDataAnalytics #SparkAISummit spark.sql.autoBroadcastJoinThreshold ● Default value is 10MB ● Spark will broadcast if ○ Spark thinks that the size of the data is less ○ or you use broadcast hint Compute stats to make good estimates ANALYZE TABLE table_name COMPUTE STATISTICS FOR COLUMNS col1, col2, ...
  • 118. Useful tip IIa ● Important settings related to BroadcastHashJoin: 118#UnifiedDataAnalytics #SparkAISummit spark.sql.autoBroadcastJoinThreshold ● Default value is 10MB ● Spark will broadcast if ○ Spark thinks that the size of the data is less ○ or you use broadcast hint Compute stats to make good estimates ANALYZE TABLE table_name COMPUTE STATISTICS FOR COLUMNS col1, col2, ... spark.sql.cbo.enabled
  • 119. Useful tip IIb ● Important settings related to BroadcastHashJoin: 119#UnifiedDataAnalytics #SparkAISummit spark.sql.broadcastTimeout ● Default value is 300s
  • 120. Useful tip IIb ● Important settings related to BroadcastHashJoin: 120#UnifiedDataAnalytics #SparkAISummit spark.sql.broadcastTimeout ● Default value is 300s ● If Spark does not make it: SparkException: Could not execute broadcast in 300 secs. You can increase the timeout for broadcasts via spark.sql.broadcastTimeout or disable broadcast join by setting spark.sql.autoBroadcastJoinThreshold to -1
  • 121. 3 basic solutions: 1. Disable the broadcast by setting the threshold to -1 2. Increase the timeout 3. Use caching 121#UnifiedDataAnalytics #SparkAISummit
  • 122. 122#UnifiedDataAnalytics #SparkAISummit val df = profiles .some_udf_call .groupBy(“profile_id”) .agg(...some_aggregation...) posts .join(broadcast(df), Seq(“profile_id”)) Intense transformations: ● udf call ● aggregation, … Computation may take longer than 5mins If size of df is small we want to broadcast
  • 123. 123#UnifiedDataAnalytics #SparkAISummit val df = profiles .some_udf_call .groupBy(“profile_id”) .agg(...some_aggregation...) .cache() posts .join(broadcast(df), Seq(“profile_id”)) df.count() Three jobs: 1) count will write the data to memory 2) broadcast (fast because taken from RAM) 3) join - will leverage broadcasted data If size of df is small we want to broadcast We can use caching (but we have to materialize immediately)
  • 124. Conclusion ● Understanding the physical plan is important ● Limiting the optimizer you can achieve the reused Exchange ● Choice between Window vs groupBy+join depends on data properties ● Adding repartition can avoid unnecessary Exchange ○ considering the data properties is important ● Be aware of nondeterministic expressions ● Fine-tune broadcast joins with configuration settings ○ make sure to have good size estimates using CBO 124#UnifiedDataAnalytics #SparkAISummit
  • 126. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT