Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Deep Dive Into Catalyst:
Apache Spark’s Optimizer
Yin Huai, yhuai@databricks.com
2017-06-06, Spark Summit
2
About me
• Software engineer at Databricks
• Apache Spark committer and PMC member
• One of the original developers of S...
3
TEAM
About Databricks
Started Spark project (now Apache Spark) at UC Berkeley in 2009
33
PRODUCT
Unified Analytics Platf...
4
Overview
4
Spark Core (RDD)
Catalyst
DataFrame/DatasetSQL
ML Pipelines
Structured
Streaming
and more…
Spark SQL
GraphFra...
5
Why structure APIs?
data.map { case (dept, age) => dept -> (age, 1) }
.reduceByKey { case ((a1, c1), (a2, c2)) => (a1 + ...
6
Why structure APIs?
• Structure will limit what can be expressed.
• In practice, we can accommodate the vast
majority of...
7
0 1 2 3 4 5
RDD
Dataframe
SQL
Runtime performance of aggregating 10 million int pairs
(secs)
Why structure APIs?
8
How to take advantage of
optimization opportunities?
9
Get an optimizer that automatically finds
out the most efficient plan to execute data
operations specified in the user’s...
Catalyst:
Apache Spark’s Optimizer
10
11
How Catalyst Works: An Overview
SQL AST
DataFrame
Dataset
(Java/Scala)
Query Plan
Optimized
Query Plan
RDDs
Code
Genera...
12
How Catalyst Works: An Overview
SQL AST
DataFrame
Dataset
(Java/Scala)
Query Plan
Optimized
Query Plan
RDDs
Code
Genera...
13
Trees: Abstractions of Users’
Programs
SELECT sum(v)
FROM (
SELECT
t1.id,
1 + 2 + t1.value AS v
FROM t1 JOIN t2
WHERE
t...
14
Trees: Abstractions of Users’
Programs
SELECT sum(v)
FROM (
SELECT
t1.id,
1 + 2 + t1.value AS v
FROM t1 JOIN t2
WHERE
t...
15
Trees: Abstractions of Users’
Programs
SELECT sum(v)
FROM (
SELECT
t1.id,
1 + 2 + t1.value AS v
FROM t1 JOIN t2
WHERE
t...
16
Logical Plan
• A Logical Plan describes computation
on datasets without defining how to
conduct the computation
• outpu...
17
Physical Plan
• A Physical Plan describes
computation on datasets with
specific definitions on how to
conduct the compu...
18
How Catalyst Works: An Overview
SQL AST
DataFrame
Dataset
(Java/Scala)
Query Plan
Optimized
Query Plan
RDDs
Code
Genera...
19
Transformations
• Transformations without changing the tree type (Transform
and Rule Executor)
• Expression => Expressi...
20
Transform
• A function associated with every tree used to implement a
single rule
Attribute
(t1.value)
Add
Add
Literal(...
21
Transform
• A transformation is defined as a Partial Function
• Partial Function: A function that is defined for a subs...
22
Combining Multiple Rules
Scan
(t1)
Scan
(t2)
Join
Filter
Project
t1.id,
3+t1.value as
v
t1.id=t2.id
t2.id>50000
Predica...
23
Combining Multiple Rules
Column
Pruning
Scan
(t1)
Scan
(t2)
Join
Filter
Project
t1.id,
3+t1.value as
v
t2.id>50000
t1.i...
24
Combining Multiple Rules
Scan
(t1)
Scan
(t2)
Join
Filter
Project
t1.id,
3+t1.value as
v
t1.id=t2.id
t2.id>50000
Scan
(t...
25
Combining Multiple Rules: Rule Executor
25
Batch 1 Batch 2 Batch n…
Rule 1
Rule 2
…
Rule 1
Rule 2
…
A Rule Executor tra...
26
Transformations
• Transformations without changing the tree type (Transform
and Rule Executor)
• Expression => Expressi...
27
From Logical Plan to Physical Plan
• A Logical Plan is transformed to a Physical Plan by
applying a set of Strategies
•...
28
SQL AST
DataFrame
Dataset
(Java/Scala
)
Query Plan
Optimized
Query Plan
RDDs
Unresolved
Logical Plan
Logical Plan
Optim...
29
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
Selected
Physical
Plan
CostModel
Physical
Plans
Catalog
Ana...
30
Put what we have learned in
action
31
Use Catalyst’s APIs to
customize Spark
Roll your own planner rule
32
Roll your own Planner Rule
32
import org.apache.spark.sql.functions._
// tableA is a dataset of integers in the ragne o...
33
Roll your own Planner Rule
== Physical Plan ==
*HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartitio...
34
Roll your own Planner Rule
34
Exploit the structure of the problem
We are joining two intervals; the result will be the...
35
Roll your own Planner Rule
// Import internal APIs of Catalyst
import org.apache.spark.sql.Strategy
import org.apache.s...
36
Roll your own Planner Rule
// matches cases like:
// tableA: start1----------------------------end1
// tableB: ...-----...
37
Roll your own Planner Rule
Hook it up with Spark
spark.experimental.extraStrategies = IntervalJoin :: Nil
Use it
result...
38
== Physical Plan ==
*HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition
+- *HashAggregate(keys=[]...
39
Contribute your ideas to Spark
110 line patch took a user’s query from
“never finishing” to 200s.
Overall 200+ people h...
40
UNIFIED ANALYTICS PLATFORM
Try Apache Spark in Databricks!
• Collaborative cloud environment
• Free version (community ...
Thank you!
What to chat?
Find me after this talk or at Databricks booth 3-3:40pm
Upcoming SlideShare
Loading in …5
×

A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai

5,937 views

Published on

Catalyst is becoming one of the most important components of Apache Spark, as it underpins all the major new APIs in Spark 2.0 and later versions, from DataFrames and Datasets to Streaming. At its core, Catalyst is a general library for manipulating trees.

In this talk, Yin explores a modular compiler frontend for Spark based on this library that includes a query analyzer, optimizer, and an execution planner. Yin offers a deep dive into Spark SQL’s Catalyst optimizer, introducing the core concepts of Catalyst and demonstrating how developers can extend it. You’ll leave with a deeper understanding of how Spark analyzes, optimizes, and plans a user’s query.

Published in: Data & Analytics
  • Hi there! Essay Help For Students | Discount 10% for your first order! - Check our website! https://vk.cc/80SakO
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai

  1. 1. Deep Dive Into Catalyst: Apache Spark’s Optimizer Yin Huai, yhuai@databricks.com 2017-06-06, Spark Summit
  2. 2. 2 About me • Software engineer at Databricks • Apache Spark committer and PMC member • One of the original developers of Spark SQL • Before joining Databricks: Ohio State University
  3. 3. 3 TEAM About Databricks Started Spark project (now Apache Spark) at UC Berkeley in 2009 33 PRODUCT Unified Analytics Platform MISSION Making Big Data Simple
  4. 4. 4 Overview 4 Spark Core (RDD) Catalyst DataFrame/DatasetSQL ML Pipelines Structured Streaming and more… Spark SQL GraphFrames Spark SQL applies structured views to data from different systems stored in different kinds of formats.
  5. 5. 5 Why structure APIs? data.map { case (dept, age) => dept -> (age, 1) } .reduceByKey { case ((a1, c1), (a2, c2)) => (a1 + a2, c1 + c2)} .map { case (dept, (age, c)) => dept -> age / c } select dept, avg(age) from data group by 1 SQL Dataframe RDD data.groupBy("dept").avg("age")
  6. 6. 6 Why structure APIs? • Structure will limit what can be expressed. • In practice, we can accommodate the vast majority of computations. 6 Limiting the space of what can be expressed enables optimizations.
  7. 7. 7 0 1 2 3 4 5 RDD Dataframe SQL Runtime performance of aggregating 10 million int pairs (secs) Why structure APIs?
  8. 8. 8 How to take advantage of optimization opportunities?
  9. 9. 9 Get an optimizer that automatically finds out the most efficient plan to execute data operations specified in the user’s program
  10. 10. Catalyst: Apache Spark’s Optimizer 10
  11. 11. 11 How Catalyst Works: An Overview SQL AST DataFrame Dataset (Java/Scala) Query Plan Optimized Query Plan RDDs Code Generation Transformations Catalyst Abstractions of users’ programs (Trees)
  12. 12. 12 How Catalyst Works: An Overview SQL AST DataFrame Dataset (Java/Scala) Query Plan Optimized Query Plan RDDs Code Generation Transformations Catalyst Abstractions of users’ programs (Trees)
  13. 13. 13 Trees: Abstractions of Users’ Programs SELECT sum(v) FROM ( SELECT t1.id, 1 + 2 + t1.value AS v FROM t1 JOIN t2 WHERE t1.id = t2.id AND t2.id > 50000) tmp
  14. 14. 14 Trees: Abstractions of Users’ Programs SELECT sum(v) FROM ( SELECT t1.id, 1 + 2 + t1.value AS v FROM t1 JOIN t2 WHERE t1.id = t2.id AND t2.id > 50000) tmp Expression • An expression represents a new value, computed based on input values • e.g. 1 + 2 + t1.value • Attribute: A column of a dataset (e.g. t1.id) or a column generated by a specific data operation (e.g. v)
  15. 15. 15 Trees: Abstractions of Users’ Programs SELECT sum(v) FROM ( SELECT t1.id, 1 + 2 + t1.value AS v FROM t1 JOIN t2 WHERE t1.id = t2.id AND t2.id > 50000) tmp Query Plan Scan (t1) Scan (t2) Join Filter Project Aggregate sum(v) t1.id, 1+2+t1.value as v t1.id=t2.id t2.id>50000
  16. 16. 16 Logical Plan • A Logical Plan describes computation on datasets without defining how to conduct the computation • output: a list of attributes generated by this Logical Plan, e.g. [id, v] • constraints: a set of invariants about the rows generated by this plan, e.g. t2.id > 50000 • statistics: size of the plan in rows/bytes. Per column stats (min/max/ndv/nulls). Scan (t1) Scan (t2) Join Filter Project Aggregate sum(v) t1.id, 1+2+t1.value as v t1.id=t2.id t2.id>50000
  17. 17. 17 Physical Plan • A Physical Plan describes computation on datasets with specific definitions on how to conduct the computation • A Physical Plan is executable Parquet Scan (t1) JSON Scan (t2) Sort-Merge Join Filter Project Hash- Aggregate sum(v) t1.id, 1+2+t1.value as v t1.id=t2.id t2.id>50000
  18. 18. 18 How Catalyst Works: An Overview SQL AST DataFrame Dataset (Java/Scala) Query Plan Optimized Query Plan RDDs Code Generation Transformations Catalyst Abstractions of users’ programs (Trees)
  19. 19. 19 Transformations • Transformations without changing the tree type (Transform and Rule Executor) • Expression => Expression • Logical Plan => Logical Plan • Physical Plan => Physical Plan • Transforming a tree to another kind of tree • Logical Plan => Physical Plan
  20. 20. 20 Transform • A function associated with every tree used to implement a single rule Attribute (t1.value) Add Add Literal(1) Literal(2) 1 + 2 + t1.value Attribute (t1.value) Add Literal(3) 3+ t1.valueEvaluate 1 + 2 once Evaluate 1 + 2 for every row
  21. 21. 21 Transform • A transformation is defined as a Partial Function • Partial Function: A function that is defined for a subset of its possible arguments val expression: Expression = ... expression.transform { case Add(Literal(x, IntegerType), Literal(y, IntegerType)) => Literal(x + y) } Case statement determines if the partial function is defined for a given input
  22. 22. 22 Combining Multiple Rules Scan (t1) Scan (t2) Join Filter Project t1.id, 3+t1.value as v t1.id=t2.id t2.id>50000 Predicate Pushdown Scan (t1) Scan (t2) Join Filter Project t1.id, 3+t1.value as v t2.id>50000 t1.id=t2.id
  23. 23. 23 Combining Multiple Rules Column Pruning Scan (t1) Scan (t2) Join Filter Project t1.id, 3+t1.value as v t2.id>50000 t1.id=t2.id Scan (t1) Scan (t2) Join Filter Project t1.id, 3+t1.value as v t2.id>50000 t1.id=t2.id Project Project t1.id t1.value t2.id
  24. 24. 24 Combining Multiple Rules Scan (t1) Scan (t2) Join Filter Project t1.id, 3+t1.value as v t1.id=t2.id t2.id>50000 Scan (t1) Scan (t2) Join Filter Project t1.id, 3+t1.value as v t2.id>50000 t1.id=t2.id Project Projectt1.id t1.value t2.id Before transformations After transformations
  25. 25. 25 Combining Multiple Rules: Rule Executor 25 Batch 1 Batch 2 Batch n… Rule 1 Rule 2 … Rule 1 Rule 2 … A Rule Executor transforms a Tree to another same type Tree by applying many rules defined in batches Every rule is implemented based on Transform Approaches of applying rules 1. Fixed point 2. Once
  26. 26. 26 Transformations • Transformations without changing the tree type (Transform and Rule Executor) • Expression => Expression • Logical Plan => Logical Plan • Physical Plan => Physical Plan • Transforming a tree to another kind of tree • Logical Plan => Physical Plan
  27. 27. 27 From Logical Plan to Physical Plan • A Logical Plan is transformed to a Physical Plan by applying a set of Strategies • Every Strategy uses pattern matching to convert a Logical Plan to a Physical Plan object BasicOperators extends Strategy { def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match { … case logical.Project(projectList, child) => execution.ProjectExec(projectList, planLater(child)) :: Nil case logical.Filter(condition, child) => execution.FilterExec(condition, planLater(child)) :: Nil … } } Triggers other Strategies
  28. 28. 28 SQL AST DataFrame Dataset (Java/Scala ) Query Plan Optimized Query Plan RDDs Unresolved Logical Plan Logical Plan Optimized Logical Plan Selected Physical Plan CostModel Physical Plans Catalog Analysis Logical Optimization Physical Planning Catalyst
  29. 29. 29 Unresolved Logical Plan Logical Plan Optimized Logical Plan Selected Physical Plan CostModel Physical Plans Catalog Analysis Logical Optimization Physical Planning • Analysis (Rule Executor): Transforms an Unresolved Logical Plan to a Resolved Logical Plan • Unresolved => Resolved: Use Catalog to find where datasets and columns are coming from and types of columns • Logical Optimization (Rule Executor): Transforms a Resolved Logical Plan to an Optimized Logical Plan • Physical Planning (Strategies + Rule Executor): • Phase 1: Transforms an Optimized Logical Plan to a Physical Plan • Phase 2: Rule executor is used to adjust the physical plan to make it ready for execution
  30. 30. 30 Put what we have learned in action
  31. 31. 31 Use Catalyst’s APIs to customize Spark Roll your own planner rule
  32. 32. 32 Roll your own Planner Rule 32 import org.apache.spark.sql.functions._ // tableA is a dataset of integers in the ragne of [0, 19999999] val tableA = spark.range(20000000).as('a) // tableB is a dataset of integers in the ragne of [0, 9999999] val tableB = spark.range(10000000).as('b) // result shows the number of records after joining tableA and tableB val result = tableA .join(tableB, $"a.id" === $"b.id") .groupBy() .count() result.show() This takes 4-8s on Databricks Community edition
  33. 33. 33 Roll your own Planner Rule == Physical Plan == *HashAggregate(keys=[], functions=[count(1)]) +- Exchange SinglePartition +- *HashAggregate(keys=[], functions=[partial_count(1)]) +- *Project +- *SortMergeJoin [id#642L], [id#646L], Inner :- *Sort [id#642L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(id#642L, 200) : +- *Range (0, 20000000, step=1, splits=8) +- *Sort [id#646L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(id#646L, 200) +- *Range (0, 10000000, step=1, splits=8) result.explain()
  34. 34. 34 Roll your own Planner Rule 34 Exploit the structure of the problem We are joining two intervals; the result will be the intersection of these intervals A B A∩B
  35. 35. 35 Roll your own Planner Rule // Import internal APIs of Catalyst import org.apache.spark.sql.Strategy import org.apache.spark.sql.catalyst.expressions.{Alias, EqualTo} import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Join, Range} import org.apache.spark.sql.catalyst.plans.Inner import org.apache.spark.sql.execution.{ProjectExec, RangeExec, SparkPlan} case object IntervalJoin extends Strategy with Serializable { def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match { case Join( Range(start1, end1, 1, part1, Seq(o1)), // mathces tableA Range(start2, end2, 1, part2, Seq(o2)), // matches tableB Inner, Some(EqualTo(e1, e2))) // matches the Join if ((o1 semanticEquals e1) && (o2 semanticEquals e2)) || ((o1 semanticEquals e2) && (o2 semanticEquals e1)) => // See next page for rule body case _ => Nil } }
  36. 36. 36 Roll your own Planner Rule // matches cases like: // tableA: start1----------------------------end1 // tableB: ...------------------end2 if ((end2 >= start1) && (end2 <= end2)) { // start of the intersection val start = math.max(start1, start2) // end of the intersection val end = math.min(end1, end2) val part = math.max(part1.getOrElse(200), part2.getOrElse(200)) // Create a new Range to represent the intersection val result = RangeExec(Range(start, end, 1, Some(part), o1 :: Nil)) val twoColumns = ProjectExec( Alias(o1, o1.name)(exprId = o1.exprId) :: Nil, result) twoColumns :: Nil } else { Nil }
  37. 37. 37 Roll your own Planner Rule Hook it up with Spark spark.experimental.extraStrategies = IntervalJoin :: Nil Use it result.show() This now takes ~0.5s to complete
  38. 38. 38 == Physical Plan == *HashAggregate(keys=[], functions=[count(1)]) +- Exchange SinglePartition +- *HashAggregate(keys=[], functions=[partial_count(1)]) +- *Project +- *Project [id#642L AS id#642L] +- *Range (0, 10000000, step=1, splits=8) Roll your own Planner Rule result.explain()
  39. 39. 39 Contribute your ideas to Spark 110 line patch took a user’s query from “never finishing” to 200s. Overall 200+ people have contributed to the analyzer/optimizer/planner in the last 2 years.
  40. 40. 40 UNIFIED ANALYTICS PLATFORM Try Apache Spark in Databricks! • Collaborative cloud environment • Free version (community edition) 4040 DATABRICKS RUNTIME 3.0 • Apache Spark - optimized for the cloud • Caching and optimization layer - DBIO • Enterprise security - DBES Try for free today. databricks.com
  41. 41. Thank you! What to chat? Find me after this talk or at Databricks booth 3-3:40pm

×