Successfully reported this slideshow.
Your SlideShare is downloading. ×

Physical Plans in Spark SQL

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 126 Ad

Physical Plans in Spark SQL

Download to read offline

In Spark SQL the physical plan provides the fundamental information about the execution of the query. The objective of this talk is to convey understanding and familiarity of query plans in Spark SQL, and use that knowledge to achieve better performance of Apache Spark queries. We will walk you through the most common operators you might find in the query plan and explain some relevant information that can be useful in order to understand some details about the execution. If you understand the query plan, you can look for the weak spot and try to rewrite the query to achieve a more optimal plan that leads to more efficient execution.

The main content of this talk is based on Spark source code but it will reflect some real-life queries that we run while processing data. We will show some examples of query plans and explain how to interpret them and what information can be taken from them. We will also describe what is happening under the hood when the plan is generated focusing mainly on the phase of physical planning. In general, in this talk we want to share what we have learned from both Spark source code and real-life queries that we run in our daily data processing.

In Spark SQL the physical plan provides the fundamental information about the execution of the query. The objective of this talk is to convey understanding and familiarity of query plans in Spark SQL, and use that knowledge to achieve better performance of Apache Spark queries. We will walk you through the most common operators you might find in the query plan and explain some relevant information that can be useful in order to understand some details about the execution. If you understand the query plan, you can look for the weak spot and try to rewrite the query to achieve a more optimal plan that leads to more efficient execution.

The main content of this talk is based on Spark source code but it will reflect some real-life queries that we run while processing data. We will show some examples of query plans and explain how to interpret them and what information can be taken from them. We will also describe what is happening under the hood when the plan is generated focusing mainly on the phase of physical planning. In general, in this talk we want to share what we have learned from both Spark source code and real-life queries that we run in our daily data processing.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Physical Plans in Spark SQL (20)

Advertisement

More from Databricks (20)

Recently uploaded (20)

Advertisement

Physical Plans in Spark SQL

  1. 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  2. 2. David Vrba, Socialbakers Physical Plans in Spark SQL #UnifiedDataAnalytics #SparkAISummit
  3. 3. ● David Vrba Ph.D. ● Data Scientist & Data Engineer at Socialbakers ○ Developing ETL pipelines in Spark ○ Optimizing Spark jobs ○ Productionalizing Spark applications ● Lecturing Spark trainings and workshops ○ Studying Spark source code 3#UnifiedDataAnalytics #SparkAISummit
  4. 4. Goal ● Share what we have learned by studying Spark source code and by using Spark on daily basis ○ Processing data from social networks (Facebook, Twitter, Instagram, ...) ○ Regularly processing data on scales up to 10s of TBs ○ Understanding the query plan is a must for efficient processing 4#UnifiedDataAnalytics #SparkAISummit
  5. 5. Outline ● Two part talk ○ In the first part we cover some theory ■ How query execution works ■ Physical Plan operators ■ Where to look for relevant information in Spark UI ○ In the second part we show some examples ■ Model queries with particular optimizations ■ Useful tips ○ Q/A 5#UnifiedDataAnalytics #SparkAISummit
  6. 6. Part I ● Query Execution ● Physical Plan Operators ● Spark UI 6#UnifiedDataAnalytics #SparkAISummit
  7. 7. Query Execution 7#UnifiedDataAnalytics #SparkAISummit Query Logical Planning Parser Unresolved Plan Analyzer Analyzed Plan Cache Manager Optimizer Optimized Plan Physical Planning Query Planner Spark Plan Preparation Executed Plan Execution RDD DAG Task Scheduler DAG Scheduler Stages + Tasks Executor Today’s session
  8. 8. Logical Planning ● Logical plan is created, analyzed and optimized ● Logical plan ○ Tree representation of the query ○ It is an abstraction that carries information about what is supposed to happen ○ It does not contain precise information on how it happens ○ Composed of ■ Relational operators – Filter, Join, Project, ... (they represent DataFrame transformations) ■ Expressions – Column transformations, filtering conditions, joining conditions, ... 8#UnifiedDataAnalytics #SparkAISummit
  9. 9. Physical Planning ● In the phase of logical planning we get Optimized Logical Plan ● Execution layer does not understand DataFrame / Logical Plan (LP) ● Logical Plan has to be converted to a Physical Plan (PP) ● Physical Plan ○ Bridge between LP and RDDs ○ Similarly to Logical Plan it is a tree ○ Contains more specific description of how things should happen (specific choice of algorithms) ○ Uses lower level primitives - RDDs 9#UnifiedDataAnalytics #SparkAISummit
  10. 10. Physical Planning - 2 phases 10#UnifiedDataAnalytics #SparkAISummit Spark Plan Executed Plan ● Generated by Query Planner using Strategies ● For each node in LP there is a node in PP Additional Rules
  11. 11. Physical Planning - 2 phases 11#UnifiedDataAnalytics #SparkAISummit Spark Plan Executed Plan ● Generated by Query Planner using Strategies ● For each node in LP there is a node in PP Additional Rules Join SortMergeJoin BroadcastHashJoin In Logical Plan: In Physical Plan: Strategy example: JoinSelection
  12. 12. Physical Planning - 2 phases 12#UnifiedDataAnalytics #SparkAISummit Spark Plan Executed Plan Additional Rules ● Generated by Query Planner using Strategies ● For each node in LP there is a node in PP ● Final version of query plan ● This will be executed ○ generates RDD code
  13. 13. ● Generated by Query Planner using Strategies ● For each node in LP there is a node in PP Physical Planning - 2 phases 13#UnifiedDataAnalytics #SparkAISummit Spark Plan Executed Plan ● Final version of query plan ● This will be executed ○ generates RDD code Additional Rules df.queryExecution.sparkPlan df.queryExecution.executedPlan df.explain() See the plans:
  14. 14. 14 SQL Job IDs Click here to see the query plan Spark UI
  15. 15. 15 Graphical representation of the Physical Plan - executed plan Details - string representation (LP + PP)
  16. 16. 16
  17. 17. Let’s see some operators ● FileScan ● Exchange ● HashAggregate, SortAggregate, ObjectHashAggregate ● SortMergeJoin ● BroadcastHashJoin 17#UnifiedDataAnalytics #SparkAISummit
  18. 18. FileScan 18#UnifiedDataAnalytics #SparkAISummit ● Represents reading the data from a file format spark.table(“posts_fb”) .filter($“month” === 5) .filter($”profile_id” === ...) ● table: posts_fb ● partitioned by month ● bucketed by profile_id, 20b ● 1 file per bucket
  19. 19. FileScan 19#UnifiedDataAnalytics #SparkAISummit number of files read size of files read total rows output filesystem read data size total
  20. 20. FileScan 20#UnifiedDataAnalytics #SparkAISummit It is useful to pair these numbers with information in Jobs and Stages tab in Spark UI
  21. 21. 21#UnifiedDataAnalytics #SparkAISummit We read from bucketed table (20b, 1f/b)
  22. 22. 22#UnifiedDataAnalytics #SparkAISummit We read from bucketed table (20b, 1f/b)
  23. 23. 23#UnifiedDataAnalytics #SparkAISummit 697.9 KB We read from bucketed table (20b, 1f/b)
  24. 24. 24#UnifiedDataAnalytics #SparkAISummit Bucket pruning: 0B 697.9 KB We read from bucketed table (20b, 1f/b)
  25. 25. 25#UnifiedDataAnalytics #SparkAISummit Bucket pruning: 0B 697.9 KB We read from bucketed table (20b, 1f/b) spark.sql.sources.bucketing.enabled
  26. 26. 26#UnifiedDataAnalytics #SparkAISummit We read from bucketed table (20b, 1f/b) Size of the whole partition. This will be read if we turn off bucketing (no bucket pruning) If bucketing is OFF:
  27. 27. FileScan (string representation) 27#UnifiedDataAnalytics #SparkAISummit PartitionFilters PushedFilters DataFilters
  28. 28. FileScan (string representation) 28#UnifiedDataAnalytics #SparkAISummit PartitionFilters PushedFilters PartitionCount (partition pruning) DataFilters Format SelectedBucketsCount (bucket pruning)
  29. 29. Exchange 29#UnifiedDataAnalytics #SparkAISummit ● Represents shuffle - physical data movement on the cluster ○ usually quite expensive
  30. 30. Exchange 30#UnifiedDataAnalytics #SparkAISummit HashPartitioning ● groupBy ● distinct ● join ● repartition($”key”) ● Window.partitionBy(“key”)
  31. 31. Exchange 31#UnifiedDataAnalytics #SparkAISummit HashPartitioning ● groupBy ● distinct ● join ● repartition($”key”) ● Window.partitionBy(“key”) SinglePartition ● Window.partitionBy()
  32. 32. Exchange 32#UnifiedDataAnalytics #SparkAISummit HashPartitioning ● groupBy ● distinct ● join ● repartition($”key”) ● Window.partitionBy(“key”) SinglePartition ● Window.partitionBy() RoundRobinPartitioning ● repartition(10)
  33. 33. Exchange 33#UnifiedDataAnalytics #SparkAISummit HashPartitioning ● groupBy ● distinct ● join ● repartition($”key”) ● Window.partitionBy(“key”) SinglePartition ● Window.partitionBy() RoundRobinPartitioning ● repartition(10) RangePartitioning ● orderBy
  34. 34. Aggregate 34#UnifiedDataAnalytics #SparkAISummit HashAggregate PP SortAggregate ObjectHashAggregate Chosen in Aggregation strategy ● Represents data aggregation
  35. 35. Aggregate 35#UnifiedDataAnalytics #SparkAISummit ● groupBy ● distinct ● dropDuplicates Aggregate HashAggregate LP PP SortAggregate ObjectHashAggregate Chosen in Aggregation strategy
  36. 36. Aggregate 36#UnifiedDataAnalytics #SparkAISummit ● groupBy ● distinct ● dropDuplicates Aggregate HashAggregate LP PP SortAggregate ObjectHashAggregate Chosen in Aggregation strategy df.groupBy(“profile_id”) .agg(sum(“interactions”))
  37. 37. Aggregate 37#UnifiedDataAnalytics #SparkAISummit ● groupBy ● distinct ● dropDuplicates Aggregate HashAggregate LP PP Usually comes in pair ○ partial_sum ○ finalmerge_sum SortAggregate ObjectHashAggregate Chosen in Aggregation strategy df.groupBy(“profile_id”) .agg(sum(“interactions”))
  38. 38. SortMergeJoin 38#UnifiedDataAnalytics #SparkAISummit ● Represents joining two DataFrames ● Exchange & Sort often before SortMergeJoin but not necessarily (see later)
  39. 39. BroadcastHashJoin 39#UnifiedDataAnalytics #SparkAISummit BroadcastExchange ● data from this branch are broadcasted to each executor BroadcastHashJoin ● Represents joining two DataFrames
  40. 40. BroadcastHashJoin 40#UnifiedDataAnalytics #SparkAISummit Two jobs: broadcasting the data is handled in separate job ● Represents joining two DataFrames
  41. 41. Rules applied in preparation • EnsureRequirements • ReuseExchange • ... 41#UnifiedDataAnalytics #SparkAISummit Spark Plan Executed Plan Additional Rules
  42. 42. EnsureRequirements ● Fills Exchange & Sort operators to the plan ● How it works? ● Each operator in PP has: ○ outputPartitioning ○ outputOrdering ○ requiredChildDistribution (this is a requirement for partitioning) ○ requiredChildOrdering (this is a requirement for ordering) ● If the requirement is not met, Exchange or Sort are used 42#UnifiedDataAnalytics #SparkAISummit
  43. 43. 43#UnifiedDataAnalytics #SparkAISummit SortMergeJoin (id) Project Project FileScan FileScan
  44. 44. 44#UnifiedDataAnalytics #SparkAISummit SortMergeJoin (id) Project Project FileScan FileScan requires: HashPartitioning(id, 200) order by id requires: HashPartitioning(id, 200) order by id
  45. 45. 45#UnifiedDataAnalytics #SparkAISummit SortMergeJoin (id) Project Project FileScan FileScan requires: HashPartitioning(id, 200) order by id requires: HashPartitioning(id, 200) order by id outputPartitioning: Unknown outputOrdering: None outputPartitioning: Unknown outputOrdering: None
  46. 46. 46#UnifiedDataAnalytics #SparkAISummit SortMergeJoin (id) Project Project FileScan FileScan requires: HashPartitioning(id, 200) order by id requires: HashPartitioning(id, 200) order by id outputPartitioning: Unknown outputOrdering: None outputPartitioning: Unknown outputOrdering: None
  47. 47. 47#UnifiedDataAnalytics #SparkAISummit SortMergeJoin (id) Project Project FileScan FileScan Exchange Exchange requires: HashPartitioning(id, 200) order by id requires: HashPartitioning(id, 200) order by id satisfiesthis satisfiesthis
  48. 48. 48#UnifiedDataAnalytics #SparkAISummit SortMergeJoin (id) Project Project FileScan FileScan Exchange Exchange Sort Sort requires: HashPartitioning(id, 200) order by id requires: HashPartitioning(id, 200) order by id
  49. 49. Example with bucketing ● If the sources are bucketed tables into 50 buckets by id: ○ FileScan will have outputPartitioning as HashPartitioning(id, 50) ○ It will pass it to Project ○ This will happen in both branches of the Join ○ The requirement for partitioning is met => no Exchange needed 49#UnifiedDataAnalytics #SparkAISummit
  50. 50. 50#UnifiedDataAnalytics #SparkAISummit SortMergeJoin (id) Project Project FileScan FileScan requires: HashPartitioning(id, 200) order by id requires: HashPartitioning(id, 200) order by id outputPartitioning: HashPartitioning(id, 50) outputPartitioning: HashPartitioning(id, 50)
  51. 51. 51#UnifiedDataAnalytics #SparkAISummit SortMergeJoin (id) Project Project FileScan FileScan requires: HashPartitioning(id, 200) order by id requires: HashPartitioning(id, 200) order by id outputPartitioning: HashPartitioning(id, 50) outputPartitioning: HashPartitioning(id, 50) outputPartitioning: HashPartitioning(id, 50) outputPartitioning: HashPartitioning(id, 50)
  52. 52. 52#UnifiedDataAnalytics #SparkAISummit SortMergeJoin (id) Project Project FileScan FileScan requires: HashPartitioning(id, 200) order by id requires: HashPartitioning(id, 200) order by id outputPartitioning: HashPartitioning(id, 50) outputPartitioning: HashPartitioning(id, 50) outputPartitioning: HashPartitioning(id, 50) outputPartitioning: HashPartitioning(id, 50) No need for Exchange OK OK
  53. 53. What about the Sort ? 53#UnifiedDataAnalytics #SparkAISummit dataDF .write .bucketBy(50, “id”) .sortBy(“id”) .option(“path”, outputPath) .saveAsTable(tableName) It depends on bucketing:
  54. 54. What about the Sort ? 54#UnifiedDataAnalytics #SparkAISummit dataDF .write .bucketBy(50, “id”) .sortBy(“id”) .option(“path”, outputPath) .saveAsTable(tableName) It depends on bucketing: ● If one file per bucket is created ○ FileScan will pick up the information about order ○ No need for Sort in the final plan
  55. 55. What about the Sort ? 55#UnifiedDataAnalytics #SparkAISummit dataDF .write .bucketBy(50, “id”) .sortBy(“id”) .option(“path”, outputPath) .saveAsTable(tableName) It depends on bucketing: ● If one file per bucket is created ○ FileScan will pick up the information about order ○ No need for Sort in the final plan ● If more files per bucket ○ FileScan will not use information about order (data has to be sorted globally) ○ Spark will add Sort to the final plan
  56. 56. 56#UnifiedDataAnalytics #SparkAISummit SortMergeJoin (id) Project Project FileScan FileScan Final Executed Plan if: 1. bucketed by id 2. sorted by id (one file/bucket)
  57. 57. 57#UnifiedDataAnalytics #SparkAISummit SortMergeJoin (id) Project Project FileScan FileScan Sort Sort Final Executed Plan if: 1. bucketed by id 2. not sorted or 3. sorted but more files/bucket
  58. 58. ReuseExchange ● Result of each Exchange (shuffle) is written to disk (shuffle-write) 58#UnifiedDataAnalytics #SparkAISummit ● The rule checks if the plan has the same Exchange-branches ○ if the branches are the same, the shuffle-write will also be the same ● Spark can reuse it since the data (shuffle-write) is persisted on disks
  59. 59. 59#UnifiedDataAnalytics #SparkAISummit Union Project Project FileScan FileScan Exchange Exchange Sort Window Generate Project The two sub-branches may become the same in some situations
  60. 60. 60#UnifiedDataAnalytics #SparkAISummit Union Project Project FileScan FileScan Exchange Exchange Sort Window Generate Union Project FileScan Exchange Sort Window Generate Project Project If the branches are the same, only one will be computed and Spark will reuse it
  61. 61. ReuseExchange ● The branches must be the same ● Only Exchange can be reused ● Can be turned off by internal configuration ○ spark.sql.exchange.reuse ● Otherwise you can not control it directly but there is indirect way by tweaking the query (see the example later) ● It reduces the I/O and network cost: ○ One scan over the data ○ One shuffle write 61#UnifiedDataAnalytics #SparkAISummit
  62. 62. Part I conclusion ● Physical Plan operators carry information about execution ○ it is useful to pair information from the query plan with info about stages/tasks ● ReuseExchange allows to reduce I/O and network cost ○ if Exchange sub-branches are the same Spark will compute it only once ● EnsureRequirements adds Exchange and Sort to the query plan ○ it makes sure that requirements of the operators are met 62#UnifiedDataAnalytics #SparkAISummit
  63. 63. David Vrba, Socialbakers Physical Plans in Spark SQL - continues #UnifiedDataAnalytics #SparkAISummit
  64. 64. Part I recap ● We covered ○ Query Execution (physical planning) ○ Two preparation rules: ■ EnsureRequirements ■ ReuseExchange 64#UnifiedDataAnalytics #SparkAISummit
  65. 65. Part II ● Model examples with optimizations ● Some useful tips 65#UnifiedDataAnalytics #SparkAISummit
  66. 66. Data 66#UnifiedDataAnalytics #SparkAISummit post_id profile_id interactions date 1 1 20 2019-01-01 2 1 15 2019-01-01 3 1 50 2019-02-01 Table A: posts (messages published on FB)
  67. 67. Example I - exchange reuse 67#UnifiedDataAnalytics #SparkAISummit ● Take all profiles where ○ sum of interactions is bigger than 100 ○ or sum of interactions is less than 20
  68. 68. 68#UnifiedDataAnalytics #SparkAISummit df.groupBy(“profile_id”) .agg(sum(“interactions”).alias(“sum”)) .filter($”sum” > 100 || $”sum” < 20) .write(...) Simple query:
  69. 69. 69#UnifiedDataAnalytics #SparkAISummit df.groupBy(“profile_id”) .agg(sum(“interactions”).alias(“sum”)) .filter($”sum” > 100 || $”sum” < 20) .write(...) val dfSumBig = df .groupBy(“profile_id”) .agg(sum(“interactions”).alias(“sum”)) .filter($”sum” > 100) val dfSumSmall = df .groupBy(“profile_id”) .agg(sum(“interactions”).alias(“sum”)) .filter($”sum” < 20) dfSumBig.union(dfSumSmall) .write(...) Simple query: Can be written also in terms of union
  70. 70. 70#UnifiedDataAnalytics #SparkAISummit Typical plans with unions: Union
  71. 71. 71#UnifiedDataAnalytics #SparkAISummit val dfSumBig = df .groupBy(“profile_id”) .agg(sum(“interactions”).alias(“sum”)) .filter($”sum” > 100) No need to optimize val dfSumSmall = df .groupBy(“profile_id”) .agg(sum(“interactions”).alias(“sum”)) .filter($”sum” < 20) dfSumBig.union(dfSumSmall) .write(...) This Exchange is reused. The data is scanned only once In our example:
  72. 72. 72#UnifiedDataAnalytics #SparkAISummit val dfSumBig = df .groupBy(“profile_id”) .agg(sum(“interactions”).alias(“sum”)) .filter($”sum” > 100) val dfSumSmall = df .groupBy(“profile_id”) .agg(sum(“interactions”).alias(“sum”)) .filter($”sum” < 20) dfSumBig.union(dfSumSmall) .write(...) Let’s suppose the assignment has changed: ● For dfSumSmall we want to consider only specific profiles (profile_id is not null)
  73. 73. 73#UnifiedDataAnalytics #SparkAISummit dfSumBig.union(dfSumSmall) .write(...) val dfSumSmall = df .filter($”profile_id”.isNotNull) .groupBy(“profile_id”) .agg(sum(“interactions”).alias(“sum”)) .filter($”sum” < 20) val dfSumBig = df .groupBy(“profile_id”) .agg(sum(“interactions”).alias(“sum”)) .filter($”sum” > 100) Let’s suppose the assignment has changed: ● For dfSumSmall we want to consider only specific profiles (profile_id is not null)
  74. 74. 74#UnifiedDataAnalytics #SparkAISummit Now add additional filter to one DF:val dfSumBig = df .groupBy(“profile_id”) .agg(sum(“interactions”).alias(“sum”)) .filter($”sum” > 100) val dfSumSmall = df .filter($”profile_id”.isNotNull) .groupBy(“profile_id”) .agg(sum(“interactions”).alias(“sum”)) .filter($”sum” < 20) dfSumBig.union(dfSumSmall) .write(...)
  75. 75. val dfSumSmall = df .groupBy(“profile_id”) .agg(sum(“interactions”).alias(“sum”)) .filter($”sum” < 20) .filter($”profile_id”.isNotNull) ● How can we optimize this? ● Just calling the filter on different place does not help 75#UnifiedDataAnalytics #SparkAISummit Spark optimizer will move this filter back by calling a rule PushDownPredicate
  76. 76. ● We can limit the optimizer to stop the rule 76#UnifiedDataAnalytics #SparkAISummit spark.conf.set( “spark.sql.optimizer.excludeRules”, “org.apache.spark.sql.catalyst.optimizer.PushDownPredicate” ) val dfSumSmall = df .groupBy(“profile_id”) .agg(sum(“interactions”).alias(“metricValue”)) .filter($”metricValue” < 20) .filter($”profile_id”.isNotNull) It has both the filter conditions
  77. 77. Limiting the optimizer ● Available since Spark 2.4 ● Useful if you need to change the order of operators in the plan ○ reposition Filter, Exchange ● Queries: ○ one data source (which is expensive to read) ○ with multiple computations (using groupBy or Window) ■ combined together using Union or Join 77#UnifiedDataAnalytics #SparkAISummit
  78. 78. Reused computation ● Similar effect to ReuseExchange can be achieved also by caching ○ caching may not be that useful with large datasets (if it does not fit to the caching layer) ○ caching evokes additional overhead while putting data to the caching layer (memory or disk) 78#UnifiedDataAnalytics #SparkAISummit
  79. 79. Example II ● Get only records (posts) with max interactions for each profile 79#UnifiedDataAnalytics #SparkAISummit post_id profile_id interactions date 1 1 20 2019-01-01 2 1 50 2019-01-01 3 1 50 2019-02-01 4 2 0 2019-01-01 5 2 100 2019-03-01 post_id profile_id interactions date 2 1 50 2019-01-01 3 1 50 2019-02-01 Assume profile_id has non-null values.
  80. 80. Example II ● Three common ways how to write the query: ○ Using Window function ○ Using groupBy + join ○ Using correlated subquery in SQL 80#UnifiedDataAnalytics #SparkAISummit Which one is the most efficient?
  81. 81. 81#UnifiedDataAnalytics #SparkAISummit val w = Window.partitionBy(“profile_id”) posts .withColumn(“maxCount”, max(“interactions”).over(w)) .filter($”interactions” === $”maxCount”) Using Window:
  82. 82. 82#UnifiedDataAnalytics #SparkAISummit val maxCount = posts .groupBy(“profile_id”) .agg(max(“interactions”).alias(“maxCount”)) posts.join(maxCount, Seq(“profile_id”)) .filter($”interactions” === $”maxCount”) Using groupBy + join:
  83. 83. 83#UnifiedDataAnalytics #SparkAISummit posts.createOrReplaceTempView(“postsView”) val query = “““ SELECT * FROM postsView v1 WHERE interactions = ( select max(interactions) from postsView v2 where v1.profile_id = v2.profile_id ) ””” spark.sql(query) Using correlated subquery (equivalent plan with Join+groupBy):
  84. 84. 84#UnifiedDataAnalytics #SparkAISummit
  85. 85. 85#UnifiedDataAnalytics #SparkAISummit Query with window Query with join or correlated subquery SortMergeJoin BroadcastHashJoin
  86. 86. 86#UnifiedDataAnalytics #SparkAISummit Query with window Query with join or correlated subquery SortMergeJoin BroadcastHashJoin Exchange + Sort
  87. 87. 87#UnifiedDataAnalytics #SparkAISummit Query with window Query with join or correlated subquery SortMergeJoin BroadcastHashJoin 2 Exchanges + 1 Sort Exchange + Sort
  88. 88. 88#UnifiedDataAnalytics #SparkAISummit Query with window Query with join or correlated subquery SortMergeJoin BroadcastHashJoin 2 Exchanges + 1 Sort Exchange + Sort reduced shuffle HashAggregate
  89. 89. 89#UnifiedDataAnalytics #SparkAISummit Query with window Query with join or correlated subquery SortMergeJoin BroadcastHashJoin 2 Exchanges + 1 Sort Exchange + Sort reduced shuffle broadcast exchange HashAggregate
  90. 90. Window vs join + groupBy 90#UnifiedDataAnalytics #SparkAISummit Both tables are large ● Go with window ● SortMergeJoin will be more expensive (3 exchanges, 2 sorts)
  91. 91. Window vs join + groupBy 91#UnifiedDataAnalytics #SparkAISummit Both tables are large ● Go with window ● SortMergeJoin will be more expensive (3 exchanges, 2 sorts) Reduced table is much smaller and can be broadcasted ● Go with broadcast join - it will be much faster
  92. 92. Window vs join + groupBy 92#UnifiedDataAnalytics #SparkAISummit Both tables are small and comparable in size: ● It is not a big deal ● Broadcast will also generate traffic Both tables are large ● Go with window ● SortMergeJoin will be more expensive (3 exchanges, 2 sorts) Reduced table is much smaller and can be broadcasted ● Go with broadcast join - it will be much faster
  93. 93. Example III ● Sum interactions for each profile and each date ● Join additional table about profiles 93#UnifiedDataAnalytics #SparkAISummit profile_id about lang 1 “some string” en 2 “some other string” en Table B: profiles (Facebook pages)
  94. 94. Example III ● Sum interactions for each profile and each date ● Join additional table about profiles 94#UnifiedDataAnalytics #SparkAISummit posts .groupBy(“profile_id”, “date”) .agg(sum(“interactions”)) .join(profiles, Seq(“profile_id”)) profile_id about lang 1 “some string” en 2 “some other string” en Table B: profiles (Facebook pages)
  95. 95. posts .groupBy(“profile_id”, “date”) .agg(sum(“interactions”)) .join(profiles, Seq(“profile_id”)) 95#UnifiedDataAnalytics #SparkAISummit 3 Exchange operators => 3 shuffles
  96. 96. 96#UnifiedDataAnalytics #SparkAISummit posts .groupBy(“profile_id”, “date”) .agg(sum(“interactions”)) .join(profiles, Seq(“profile_id”))
  97. 97. 97#UnifiedDataAnalytics #SparkAISummit posts .groupBy(“profile_id”, “date”) .agg(sum(“interactions”)) .join(profiles, Seq(“profile_id”)) SortMergeJoin (profile_id) It requires (strictly) ● HashPartitioning (profile_id)
  98. 98. 98#UnifiedDataAnalytics #SparkAISummit posts .groupBy(“profile_id”, “date”) .agg(sum(“interactions”)) .join(profiles, Seq(“profile_id”)) SortMergeJoin (profile_id) It requires (strictly) ● HashPartitioning (profile_id) HashPartitioning (profile_id)
  99. 99. 99#UnifiedDataAnalytics #SparkAISummit posts .groupBy(“profile_id”, “date”) .agg(sum(“interactions”)) .join(profiles, Seq(“profile_id”)) SortMergeJoin (profile_id) It requires (strictly) ● HashPartitioning (profile_id) HashPartitioning (profile_id) HashAggregate (profile_id, date) It requires ● HashPartitioning (profile_id, date) ○ Or any subset of these cols
  100. 100. 100#UnifiedDataAnalytics #SparkAISummit posts .groupBy(“profile_id”, “date”) .agg(sum(“interactions”)) .join(profiles, Seq(“profile_id”)) HashAggregate (profile_id, date) It requires ● HashPartitioning (profile_id, date) ○ Or any subset of these cols HashPartitioning (profile_id, date) SortMergeJoin (profile_id) It requires (strictly) ● HashPartitioning (profile_id) HashPartitioning (profile_id)
  101. 101. 101#UnifiedDataAnalytics #SparkAISummit posts .repartition(“profile_id”) .groupBy(“profile_id”, “date”) .agg(sum(“interactions”)) .join(profiles, Seq(“profile_id”)) Add repartition and eliminate one shuffle
  102. 102. 102#UnifiedDataAnalytics #SparkAISummit posts .repartition(“profile_id”) .groupBy(“profile_id”, “date”) .agg(sum(“interactions”)) .join(profiles, Seq(“profile_id”)) Add repartition and eliminate one shuffle Generated by strategy HashPartitioning (profile_id)
  103. 103. 103#UnifiedDataAnalytics #SparkAISummit Add repartition and eliminate one shuffle HashPartitioning (profile_id) posts .repartition(“profile_id”) .groupBy(“profile_id”, “date”) .agg(sum(“interactions”)) .join(profiles, Seq(“profile_id”)) Generated by strategy
  104. 104. 104#UnifiedDataAnalytics #SparkAISummit Add repartition and eliminate one shuffle HashPartitioning (profile_id) posts .repartition(“profile_id”) .groupBy(“profile_id”, “date”) .agg(sum(“interactions”)) .join(profiles, Seq(“profile_id”)) Generated by strategy
  105. 105. 105#UnifiedDataAnalytics #SparkAISummit Add repartition and eliminate one shuffle HashPartitioning (profile_id) posts .repartition(“profile_id”) .groupBy(“profile_id”, “date”) .agg(sum(“interactions”)) .join(profiles, Seq(“profile_id”)) Generated by strategy OK with
  106. 106. 106#UnifiedDataAnalytics #SparkAISummit Add repartition and eliminate one shuffle HashPartitioning (profile_id) posts .repartition(“profile_id”) .groupBy(“profile_id”, “date”) .agg(sum(“interactions”)) .join(profiles, Seq(“profile_id”)) Generated by strategy OK with OK with
  107. 107. Adding repartition ● What is the cost of using it? ○ Now we shuffle all data ○ Before we shuffled reduced dataset 107#UnifiedDataAnalytics #SparkAISummit
  108. 108. 108#UnifiedDataAnalytics #SparkAISummit reduced shuffle (shuffled data are all distinct combinations of profile_id & date) Before using repartition HashAggregate
  109. 109. 109#UnifiedDataAnalytics #SparkAISummit reduced shuffle (shuffled data are all distinct combinations of profile_id & date) total shuffle (all data is shuffled) Before using repartition After using repartition HashAggregate HashAggregate
  110. 110. Adding repartition ● What is the cost of using it? ○ Now we shuffle all data ○ Before we shuffled reduced dataset ● What is more efficient? ○ depends on properties of data ■ here the cardinality of distinct (profile_id, date) 110#UnifiedDataAnalytics #SparkAISummit
  111. 111. Adding repartition 111#UnifiedDataAnalytics #SparkAISummit ● Cardinality of (profile_id, date) is comparable with row_count ● Each profile has only few posts per date ● The data is not reduced much by groupBy aggregation ● the reduced shuffle is comparable with full shuffle ● Cardinality of (profile_id, date) is much lower than row_count ● Each profile has many posts per date ● The data is reduced a lot by groupBy aggregation ● the reduced shuffle is much lower than full shuffle Use repartition to reduce number of shuffles if: Use the original plan if:
  112. 112. Useful tip I ● Filters are not pushed through nondeterministic expressions 112#UnifiedDataAnalytics #SparkAISummit posts .groupBy(“profile_id”) .agg(collect_list(“interactions”)) .filter($”profile_id”.isNotNull )
  113. 113. Useful tip I ● Filters are not pushed through nondeterministic expressions 113#UnifiedDataAnalytics #SparkAISummit posts .groupBy(“profile_id”) .agg(collect_list(“interactions”)) .filter($”profile_id”.isNotNull ) Exchange
  114. 114. Useful tip I ● Filters are not pushed through nondeterministic expressions 114#UnifiedDataAnalytics #SparkAISummit posts .filter($”profile_id”.isNotNull ) .groupBy(“profile_id”) .agg(collect_list(“interactions”)) Make sure your filters are positioned at the right place to achieve efficient execution Exchange
  115. 115. Nondeterministic expressions 115#UnifiedDataAnalytics #SparkAISummit ● collect_list ● collect_set ● first ● last ● input_file_name ● spark_partition_id ● monotinically_increasing_id ● rand ● randn ● shuffle
  116. 116. Useful tip IIa ● Important settings related to BroadcastHashJoin: 116#UnifiedDataAnalytics #SparkAISummit spark.sql.autoBroadcastJoinThreshold ● Default value is 10MB ● Spark will broadcast if ○ Spark thinks that the size of the data is less ○ or you use broadcast hint
  117. 117. Useful tip IIa ● Important settings related to BroadcastHashJoin: 117#UnifiedDataAnalytics #SparkAISummit spark.sql.autoBroadcastJoinThreshold ● Default value is 10MB ● Spark will broadcast if ○ Spark thinks that the size of the data is less ○ or you use broadcast hint Compute stats to make good estimates ANALYZE TABLE table_name COMPUTE STATISTICS FOR COLUMNS col1, col2, ...
  118. 118. Useful tip IIa ● Important settings related to BroadcastHashJoin: 118#UnifiedDataAnalytics #SparkAISummit spark.sql.autoBroadcastJoinThreshold ● Default value is 10MB ● Spark will broadcast if ○ Spark thinks that the size of the data is less ○ or you use broadcast hint Compute stats to make good estimates ANALYZE TABLE table_name COMPUTE STATISTICS FOR COLUMNS col1, col2, ... spark.sql.cbo.enabled
  119. 119. Useful tip IIb ● Important settings related to BroadcastHashJoin: 119#UnifiedDataAnalytics #SparkAISummit spark.sql.broadcastTimeout ● Default value is 300s
  120. 120. Useful tip IIb ● Important settings related to BroadcastHashJoin: 120#UnifiedDataAnalytics #SparkAISummit spark.sql.broadcastTimeout ● Default value is 300s ● If Spark does not make it: SparkException: Could not execute broadcast in 300 secs. You can increase the timeout for broadcasts via spark.sql.broadcastTimeout or disable broadcast join by setting spark.sql.autoBroadcastJoinThreshold to -1
  121. 121. 3 basic solutions: 1. Disable the broadcast by setting the threshold to -1 2. Increase the timeout 3. Use caching 121#UnifiedDataAnalytics #SparkAISummit
  122. 122. 122#UnifiedDataAnalytics #SparkAISummit val df = profiles .some_udf_call .groupBy(“profile_id”) .agg(...some_aggregation...) posts .join(broadcast(df), Seq(“profile_id”)) Intense transformations: ● udf call ● aggregation, … Computation may take longer than 5mins If size of df is small we want to broadcast
  123. 123. 123#UnifiedDataAnalytics #SparkAISummit val df = profiles .some_udf_call .groupBy(“profile_id”) .agg(...some_aggregation...) .cache() posts .join(broadcast(df), Seq(“profile_id”)) df.count() Three jobs: 1) count will write the data to memory 2) broadcast (fast because taken from RAM) 3) join - will leverage broadcasted data If size of df is small we want to broadcast We can use caching (but we have to materialize immediately)
  124. 124. Conclusion ● Understanding the physical plan is important ● Limiting the optimizer you can achieve the reused Exchange ● Choice between Window vs groupBy+join depends on data properties ● Adding repartition can avoid unnecessary Exchange ○ considering the data properties is important ● Be aware of nondeterministic expressions ● Fine-tune broadcast joins with configuration settings ○ make sure to have good size estimates using CBO 124#UnifiedDataAnalytics #SparkAISummit
  125. 125. Questions 125#UnifiedDataAnalytics #SparkAISummit
  126. 126. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT

×