Deep Dive Into Catalyst:
Apache Spark 2.0’s Optimizer
Yin Huai
Spark Summit 2016
2
SELECT count(*)
FROM (
SELECT t1.id
FROM t1 JOIN t2
WHERE
t1.id = t2.id AND
t2.id > 50 * 1000) tmp
Write Programs Using RDD API
Solution 1
3
val count = t1.cartesian(t2).filter {
case (id1FromT1, id2FromT2) => id1FromT1 == id2FromT2
}.filter {
case (id1FromT1, id2FromT2) => id1FromT2 > 50 * 1000
}.map {
case (id1FromT1, id2FromT2) => id1FromT1
}.count
println("Count: " + count)
t1 join t2
t1.id = t2.id
t2.id > 50 * 1000
Solution 2
4
val filteredT2 =
t2.filter(id1FromT2 => id1FromT2 > 50 * 1000)
val preparedT1 =
t1.map(id1FromT1 => (id1FromT1, id1FromT1))
val preparedT2 =
filteredT2.map(id1FromT2 => (id1FromT2, id1FromT2))
val count = preparedT1.join(preparedT2).map {
case (id1FromT1, _) => id1FromT1
}.count
println("Count: " + count)
t2.id > 50 * 1000
t1 join t2
WHERE t1.id = t2.id
Solution 1 vs. Solution 2
0.26
53.92
Executiontime(s) Solution 1 Solution 2
5
~ 200x
Solution 1
6
val count = t1.cartesian(t2).filter {
case (id1FromT1, id2FromT2) => id1FromT1 == id2FromT2
}.filter {
case (id1FromT1, id2FromT2) => id1FromT2 > 50 * 1000
}.map {
case (id1FromT1, id2FromT2) => id1FromT1
}.count
println("Count: " + count)
t1 join t2
t1.id = t2.id
t2.id > 50 * 1000
Solution 2
7
val filteredT2 =
t2.filter(id1FromT2 => id1FromT2 > 50 * 1000)
val preparedT1 =
t1.map(id1FromT1 => (id1FromT1, id1FromT1))
val preparedT2 =
filteredT2.map(id1FromT2 => (id1FromT2, id1FromT2))
val count = preparedT1.join(preparedT2).map {
case (id1FromT1, _) => id1FromT1
}.count
println("Count: " + count)
t2.id > 50 * 1000
t1 join t2
WHERE t1.id = t2.id
Write Programs Using RDD API
• Users’ functions are black boxes
• Opaque computation
• Opaque data type
• Programs built using RDD API have total control on
how to executeevery data operation
• Developers have to write efficient programs for
different kinds of workloads
8
9
Is there an easy way to write efficient programs?
10
The easiest way to write efficient
programs is to not worry about it and get
your programs automatically optimized
How?
• Write programs usinghigh level programminginterfaces
• Programs are used to describe what data operations are needed
without specifying how to executethoseoperations
• High level programming interfaces: SQL, DataFrame, and Dataset
• Get an optimizer thatautomatically finds outthe most
efficientplan to execute data operationsspecifiedin the
user’sprogram
11
12
Catalyst:
Apache Spark’s Optimizer
13
How Catalyst Works: An Overview
SQL AST
DataFrame
Dataset
(Java/Scala)
Query Plan
Optimized
Query Plan
RDDs
CodeGenerationTransformations
Catalyst
Abstractionsof users’programs
(Trees)
14
How Catalyst Works: An Overview
SQL AST
DataFrame
Dataset
(Java/Scala)
Query Plan
Optimized
Query Plan
RDDs
CodeGenerationTransformations
Catalyst
Abstractions of users’ programs
(Trees)
15
Trees: Abstractions of Users’ Programs
SELECT sum(v)
FROM (
SELECT
t1.id,
1 + 2 + t1.value AS v
FROM t1 JOIN t2
WHERE
t1.id = t2.id AND
t2.id > 50 * 1000) tmp
16
Trees: Abstractions of Users’ Programs
SELECT sum(v)
FROM (
SELECT
t1.id,
1 + 2 + t1.value AS v
FROM t1 JOIN t2
WHERE
t1.id = t2.id AND
t2.id > 50 * 1000) tmp
Expression
• An expressionrepresentsa
new value, computed based
on input values
• e.g. 1 + 2 + t1.value
• Attribute: A column of a
dataset (e.g. t1.id) or a
column generatedby a
specific data operation (e.g.
v)
17
Trees: Abstractions of Users’ Programs
SELECT sum(v)
FROM (
SELECT
t1.id,
1 + 2 + t1.value AS v
FROM t1 JOIN t2
WHERE
t1.id = t2.id AND
t2.id > 50 * 1000) tmp
Query Plan
Scan
(t1)
Scan
(t2)
Join
Filter
Project
Aggregate sum(v)
t1.id,
1+2+t1.value
as v
t1.id=t2.id
t2.id>50*1000
Logical Plan
• A Logical Plan describescomputation
on datasets without defining how to
conductthe computation
• output: a list of attributes generated
by this Logical Plan, e.g. [id, v]
• constraints: a setof invariants about
the rows generatedby this plan, e.g.
t2.id > 50 * 1000
18
Scan
(t1)
Scan
(t2)
Join
Filter
Project
Aggregate sum(v)
t1.id,
1+2+t1.value
as v
t1.id=t2.id
t2.id>50*1000
Physical Plan
• A Physical Plan describescomputation
on datasets with specific definitions on
how to conductthe computation
• A Physical Plan is executable
19
Parquet Scan
(t1)
JSONScan
(t2)
Sort-Merge
Join
Filter
Project
Hash-
Aggregate
sum(v)
t1.id,
1+2+t1.value
as v
t1.id=t2.id
t2.id>50*1000
20
How Catalyst Works: An Overview
SQL AST
DataFrame
Dataset
(Java/Scala)
Query Plan
Optimized
Query Plan
RDDs
CodeGenerationTransformations
Catalyst
Abstractionsof users’programs
(Trees)
Transformations
• Transformations without changing the tree type
(Transformand Rule Executor)
• Expression => Expression
• Logical Plan => Logical Plan
• Physical Plan => Physical Plan
• Transforming a tree to another kind of tree
• Logical Plan => Physical Plan
21
• A function associated with everytree used to
implement a single rule
Transform
22
Attribute
(t1.value)
Add
Add
Literal(1) Literal(2)
1 + 2 + t1.value
Attribute
(t1.value)
Add
Literal(3)
3+ t1.valueEvaluate 1 + 2 onceEvaluate 1 + 2
for every row
Transform
• A transformation is defined as a Partial Function
• Partial Function: A function that is defined for a subset
of its possible arguments
23
val expression: Expression = ...
expression.transform {
case Add(Literal(x, IntegerType), Literal(y, IntegerType)) =>
Literal(x + y)
}
Case statement determineifthe partialfunction is definedfora given input
val expression: Expression = ...
expression.transform {
case Add(Literal(x, IntegerType), Literal(y, IntegerType)) =>
Literal(x + y)
}
Transform
24
Attribute
(t1.value)
Add
Add
Literal(1) Literal(2)
1 + 2 + t1.value
val expression: Expression = ...
expression.transform {
case Add(Literal(x, IntegerType), Literal(y, IntegerType)) =>
Literal(x + y)
}
Transform
25
Attribute
(t1.value)
Add
Add
Literal(1) Literal(2)
1 + 2 + t1.value
val expression: Expression = ...
expression.transform {
case Add(Literal(x, IntegerType), Literal(y, IntegerType)) =>
Literal(x + y)
}
Transform
26
Attribute
(t1.value)
Add
Add
Literal(1) Literal(2)
1 + 2 + t1.value
val expression: Expression = ...
expression.transform {
case Add(Literal(x, IntegerType), Literal(y, IntegerType)) =>
Literal(x + y)
}
Transform
27
Attribute
(t1.value)
Add
Add
Literal(1) Literal(2)
1 + 2 + t1.value
Attribute
(t1.value)
Add
Literal(3)
3+ t1.value
Combining Multiple Rules
28
Scan
(t1)
Scan
(t2)
Join
Filter
Project
Aggregate sum(v)
t1.id,
1+2+t1.value
as v
t1.id=t2.id
t2.id>50*1000
Predicate Pushdown
Scan
(t1)
Scan
(t2)
Join
Filter
Project
Aggregate sum(v)
t1.id,
1+2+t1.value
as v
t2.id>50*1000
t1.id=t2.id
Combining Multiple Rules
29
Constant Folding
Scan
(t1)
Scan
(t2)
Join
Filter
Project
Aggregate sum(v)
t1.id,
1+2+t1.value
as v
t2.id>50*1000
t1.id=t2.id
Scan
(t1)
Scan
(t2)
Join
Filter
Project
Aggregate sum(v)
t1.id,
3+t1.value as
v
t2.id>50000
t1.id=t2.id
Combining Multiple Rules
30
Column Pruning
Scan
(t1)
Scan
(t2)
Join
Filter
Project
Aggregate sum(v)
t1.id,
3+t1.value as
v
t2.id>50000
t1.id=t2.id
Scan
(t1)
Scan
(t2)
Join
Filter
Project
Aggregate sum(v)
t1.id,
3+t1.value as
v
t2.id>50000
t1.id=t2.id
Project Project
t1.id
t1.value t2.id
Combining Multiple Rules
31
Scan
(t1)
Scan
(t2)
Join
Filter
Project
Aggregate sum(v)
t1.id,
1+2+t1.value
as v
t1.id=t2.id
t2.id>50*1000
Scan
(t1)
Scan
(t2)
Join
Filter
Project
Aggregate sum(v)
t1.id,
3+t1.value as
v
t2.id>50000
t1.id=t2.id
Project Projectt1.id
t1.value
t2.id
Before transformations
After transformations
Combining Multiple Rules: Rule Executor
32
Batch1 Batch2 Batchn…
Rule 1
Rule 2
…
Rule 1
Rule 2
…
A Rule Executor transforms a Tree to anothersame type Tree by
applying many rulesdefined in batches
Every rule is implemented
based on Transform
Approaches of
applying rules
1. Fixed point
2. Once
Transformations
• Transformations without changing the tree type
(Transformand Rule Executor)
• Expression => Expression
• Logical Plan => Logical Plan
• Physical Plan => Physical Plan
• Transforming a tree to another kind of tree
• Logical Plan => Physical Plan
33
From Logical Plan to Physical Plan
• A Logical Plan is transformedto a Physical Plan by applying
a set of Strategies
• Every Strategy uses patternmatchingto converta Tree to
anotherkind of Tree
34
object BasicOperators extends Strategy {
def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
…
case logical.Project(projectList, child) =>
execution.ProjectExec(projectList, planLater(child)) :: Nil
case logical.Filter(condition, child) =>
execution.FilterExec(condition, planLater(child)) :: Nil
…
}
} Triggers other Strategies
35
SQL AST
DataFrame
Dataset
(Java/Scala)
Query Plan
Optimized
Query Plan
RDDs
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
Selected
Physical Plan
CostModel
Physical
Plans
Catalog
Analysis
Logical
Optimization
Physical
Planning
Catalyst
36
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
Selected
Physical Plan
CostModel
Physical
Plans
Catalog
Analysis
Logical
Optimization
Physical
Planning
• Analysis(Rule Executor):Transformsan UnresolvedLogical
Plan to a ResolvedLogical Plan
• Unresolved => Resolved: Use Catalog to find where datasets and
columns are coming from and types of columns
• Logical Optimization (Rule Executor): Transformsa Resolved
Logical Plan to an Optimized Logical Plan
• Physical Planning (Strategies+ Rule Executor):Transformsa
Optimized Logical Plan to a Physical Plan
Spark’s Planner
• 1st Phase: Transformsthe Logical Plan to the Physical
Plan using Strategies
• 2nd Phase: Use a Rule Executor to make the Physical
Plan readyfor execution
• Prepare Scalar sub-queries
• Ensure requirementson inputrows
• Apply physical optimizations
37
Ensure Requirements on Input Rows
38
Sort-MergeJoin
[t1.id = t2.id]
t1
Requirements:
Rows Sorted by t1.id
t2
Requirements:
Rows Sorted by t2.id
Ensure Requirements on Input Rows
39
Sort-MergeJoin
[t1.id = t2.id]
t1
Requirements:
Rows Sorted by t1.id
Requirements:
Rows Sorted by t2.id
Sort
[t1.id]
t2
Sort
[t2.id]
Ensure Requirements on Input Rows
40
Sort-MergeJoin
[t1.id = t2.id]
t1
[sortedby t1.id]
Requirements:
Rows Sorted by t1.id
Requirements:
Rows Sorted by t2.id
Sort
[t1.id]
t2
Sort
[t2.id]What if t1 has been
sorted by t1.id?
Ensure Requirements on Input Rows
41
Sort-MergeJoin
[t1.id = t2.id]
t1
[sortedby t1.id]
Requirements:
Rows Sorted by t1.id
Requirements:
Rows Sorted by t2.id
t2
Sort
[t2.id]
42
Catalyst in Apache Spark
43
Spark Core (RDD)
Catalyst
DataFrame/DatasetSQL
ML Pipelines
Structured
Streaming
{ JSON }
JDBC
and more…
With Spark 2.0, we expect
most users to migrate to
high level APIs (SQL,
DataFrame, and Dataset)
Spark SQL
GraphFrames
Where to Start
• SourceCode:
• Trees: TreeNode,Expression, Logical Plan, and Physical Plan
• Transformations: Analyzer,Optimizer, and Planner
• Checkout previous pull requests
• Start to write code using Catalyst
44
45
~ 200 lines of changes
(95 lines are for tests)
Certain workloads can
run hundreds times
faster
SPARK-12032
46
~ 250 lines of changes
(99 lines are for tests)
Pivot table support
SPARK-8992
• Try latestversion of ApacheSpark and preview of Spark 2.0
Try Apache Spark with Databricks
47
http://databricks.com/try
Thank you.
Officehour:2:45pm–3:30pm @ ExpoHall

Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer

  • 1.
    Deep Dive IntoCatalyst: Apache Spark 2.0’s Optimizer Yin Huai Spark Summit 2016
  • 2.
    2 SELECT count(*) FROM ( SELECTt1.id FROM t1 JOIN t2 WHERE t1.id = t2.id AND t2.id > 50 * 1000) tmp Write Programs Using RDD API
  • 3.
    Solution 1 3 val count= t1.cartesian(t2).filter { case (id1FromT1, id2FromT2) => id1FromT1 == id2FromT2 }.filter { case (id1FromT1, id2FromT2) => id1FromT2 > 50 * 1000 }.map { case (id1FromT1, id2FromT2) => id1FromT1 }.count println("Count: " + count) t1 join t2 t1.id = t2.id t2.id > 50 * 1000
  • 4.
    Solution 2 4 val filteredT2= t2.filter(id1FromT2 => id1FromT2 > 50 * 1000) val preparedT1 = t1.map(id1FromT1 => (id1FromT1, id1FromT1)) val preparedT2 = filteredT2.map(id1FromT2 => (id1FromT2, id1FromT2)) val count = preparedT1.join(preparedT2).map { case (id1FromT1, _) => id1FromT1 }.count println("Count: " + count) t2.id > 50 * 1000 t1 join t2 WHERE t1.id = t2.id
  • 5.
    Solution 1 vs.Solution 2 0.26 53.92 Executiontime(s) Solution 1 Solution 2 5 ~ 200x
  • 6.
    Solution 1 6 val count= t1.cartesian(t2).filter { case (id1FromT1, id2FromT2) => id1FromT1 == id2FromT2 }.filter { case (id1FromT1, id2FromT2) => id1FromT2 > 50 * 1000 }.map { case (id1FromT1, id2FromT2) => id1FromT1 }.count println("Count: " + count) t1 join t2 t1.id = t2.id t2.id > 50 * 1000
  • 7.
    Solution 2 7 val filteredT2= t2.filter(id1FromT2 => id1FromT2 > 50 * 1000) val preparedT1 = t1.map(id1FromT1 => (id1FromT1, id1FromT1)) val preparedT2 = filteredT2.map(id1FromT2 => (id1FromT2, id1FromT2)) val count = preparedT1.join(preparedT2).map { case (id1FromT1, _) => id1FromT1 }.count println("Count: " + count) t2.id > 50 * 1000 t1 join t2 WHERE t1.id = t2.id
  • 8.
    Write Programs UsingRDD API • Users’ functions are black boxes • Opaque computation • Opaque data type • Programs built using RDD API have total control on how to executeevery data operation • Developers have to write efficient programs for different kinds of workloads 8
  • 9.
    9 Is there aneasy way to write efficient programs?
  • 10.
    10 The easiest wayto write efficient programs is to not worry about it and get your programs automatically optimized
  • 11.
    How? • Write programsusinghigh level programminginterfaces • Programs are used to describe what data operations are needed without specifying how to executethoseoperations • High level programming interfaces: SQL, DataFrame, and Dataset • Get an optimizer thatautomatically finds outthe most efficientplan to execute data operationsspecifiedin the user’sprogram 11
  • 12.
  • 13.
    13 How Catalyst Works:An Overview SQL AST DataFrame Dataset (Java/Scala) Query Plan Optimized Query Plan RDDs CodeGenerationTransformations Catalyst Abstractionsof users’programs (Trees)
  • 14.
    14 How Catalyst Works:An Overview SQL AST DataFrame Dataset (Java/Scala) Query Plan Optimized Query Plan RDDs CodeGenerationTransformations Catalyst Abstractions of users’ programs (Trees)
  • 15.
    15 Trees: Abstractions ofUsers’ Programs SELECT sum(v) FROM ( SELECT t1.id, 1 + 2 + t1.value AS v FROM t1 JOIN t2 WHERE t1.id = t2.id AND t2.id > 50 * 1000) tmp
  • 16.
    16 Trees: Abstractions ofUsers’ Programs SELECT sum(v) FROM ( SELECT t1.id, 1 + 2 + t1.value AS v FROM t1 JOIN t2 WHERE t1.id = t2.id AND t2.id > 50 * 1000) tmp Expression • An expressionrepresentsa new value, computed based on input values • e.g. 1 + 2 + t1.value • Attribute: A column of a dataset (e.g. t1.id) or a column generatedby a specific data operation (e.g. v)
  • 17.
    17 Trees: Abstractions ofUsers’ Programs SELECT sum(v) FROM ( SELECT t1.id, 1 + 2 + t1.value AS v FROM t1 JOIN t2 WHERE t1.id = t2.id AND t2.id > 50 * 1000) tmp Query Plan Scan (t1) Scan (t2) Join Filter Project Aggregate sum(v) t1.id, 1+2+t1.value as v t1.id=t2.id t2.id>50*1000
  • 18.
    Logical Plan • ALogical Plan describescomputation on datasets without defining how to conductthe computation • output: a list of attributes generated by this Logical Plan, e.g. [id, v] • constraints: a setof invariants about the rows generatedby this plan, e.g. t2.id > 50 * 1000 18 Scan (t1) Scan (t2) Join Filter Project Aggregate sum(v) t1.id, 1+2+t1.value as v t1.id=t2.id t2.id>50*1000
  • 19.
    Physical Plan • APhysical Plan describescomputation on datasets with specific definitions on how to conductthe computation • A Physical Plan is executable 19 Parquet Scan (t1) JSONScan (t2) Sort-Merge Join Filter Project Hash- Aggregate sum(v) t1.id, 1+2+t1.value as v t1.id=t2.id t2.id>50*1000
  • 20.
    20 How Catalyst Works:An Overview SQL AST DataFrame Dataset (Java/Scala) Query Plan Optimized Query Plan RDDs CodeGenerationTransformations Catalyst Abstractionsof users’programs (Trees)
  • 21.
    Transformations • Transformations withoutchanging the tree type (Transformand Rule Executor) • Expression => Expression • Logical Plan => Logical Plan • Physical Plan => Physical Plan • Transforming a tree to another kind of tree • Logical Plan => Physical Plan 21
  • 22.
    • A functionassociated with everytree used to implement a single rule Transform 22 Attribute (t1.value) Add Add Literal(1) Literal(2) 1 + 2 + t1.value Attribute (t1.value) Add Literal(3) 3+ t1.valueEvaluate 1 + 2 onceEvaluate 1 + 2 for every row
  • 23.
    Transform • A transformationis defined as a Partial Function • Partial Function: A function that is defined for a subset of its possible arguments 23 val expression: Expression = ... expression.transform { case Add(Literal(x, IntegerType), Literal(y, IntegerType)) => Literal(x + y) } Case statement determineifthe partialfunction is definedfora given input
  • 24.
    val expression: Expression= ... expression.transform { case Add(Literal(x, IntegerType), Literal(y, IntegerType)) => Literal(x + y) } Transform 24 Attribute (t1.value) Add Add Literal(1) Literal(2) 1 + 2 + t1.value
  • 25.
    val expression: Expression= ... expression.transform { case Add(Literal(x, IntegerType), Literal(y, IntegerType)) => Literal(x + y) } Transform 25 Attribute (t1.value) Add Add Literal(1) Literal(2) 1 + 2 + t1.value
  • 26.
    val expression: Expression= ... expression.transform { case Add(Literal(x, IntegerType), Literal(y, IntegerType)) => Literal(x + y) } Transform 26 Attribute (t1.value) Add Add Literal(1) Literal(2) 1 + 2 + t1.value
  • 27.
    val expression: Expression= ... expression.transform { case Add(Literal(x, IntegerType), Literal(y, IntegerType)) => Literal(x + y) } Transform 27 Attribute (t1.value) Add Add Literal(1) Literal(2) 1 + 2 + t1.value Attribute (t1.value) Add Literal(3) 3+ t1.value
  • 28.
    Combining Multiple Rules 28 Scan (t1) Scan (t2) Join Filter Project Aggregatesum(v) t1.id, 1+2+t1.value as v t1.id=t2.id t2.id>50*1000 Predicate Pushdown Scan (t1) Scan (t2) Join Filter Project Aggregate sum(v) t1.id, 1+2+t1.value as v t2.id>50*1000 t1.id=t2.id
  • 29.
    Combining Multiple Rules 29 ConstantFolding Scan (t1) Scan (t2) Join Filter Project Aggregate sum(v) t1.id, 1+2+t1.value as v t2.id>50*1000 t1.id=t2.id Scan (t1) Scan (t2) Join Filter Project Aggregate sum(v) t1.id, 3+t1.value as v t2.id>50000 t1.id=t2.id
  • 30.
    Combining Multiple Rules 30 ColumnPruning Scan (t1) Scan (t2) Join Filter Project Aggregate sum(v) t1.id, 3+t1.value as v t2.id>50000 t1.id=t2.id Scan (t1) Scan (t2) Join Filter Project Aggregate sum(v) t1.id, 3+t1.value as v t2.id>50000 t1.id=t2.id Project Project t1.id t1.value t2.id
  • 31.
    Combining Multiple Rules 31 Scan (t1) Scan (t2) Join Filter Project Aggregatesum(v) t1.id, 1+2+t1.value as v t1.id=t2.id t2.id>50*1000 Scan (t1) Scan (t2) Join Filter Project Aggregate sum(v) t1.id, 3+t1.value as v t2.id>50000 t1.id=t2.id Project Projectt1.id t1.value t2.id Before transformations After transformations
  • 32.
    Combining Multiple Rules:Rule Executor 32 Batch1 Batch2 Batchn… Rule 1 Rule 2 … Rule 1 Rule 2 … A Rule Executor transforms a Tree to anothersame type Tree by applying many rulesdefined in batches Every rule is implemented based on Transform Approaches of applying rules 1. Fixed point 2. Once
  • 33.
    Transformations • Transformations withoutchanging the tree type (Transformand Rule Executor) • Expression => Expression • Logical Plan => Logical Plan • Physical Plan => Physical Plan • Transforming a tree to another kind of tree • Logical Plan => Physical Plan 33
  • 34.
    From Logical Planto Physical Plan • A Logical Plan is transformedto a Physical Plan by applying a set of Strategies • Every Strategy uses patternmatchingto converta Tree to anotherkind of Tree 34 object BasicOperators extends Strategy { def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match { … case logical.Project(projectList, child) => execution.ProjectExec(projectList, planLater(child)) :: Nil case logical.Filter(condition, child) => execution.FilterExec(condition, planLater(child)) :: Nil … } } Triggers other Strategies
  • 35.
    35 SQL AST DataFrame Dataset (Java/Scala) Query Plan Optimized QueryPlan RDDs Unresolved Logical Plan Logical Plan Optimized Logical Plan Selected Physical Plan CostModel Physical Plans Catalog Analysis Logical Optimization Physical Planning Catalyst
  • 36.
    36 Unresolved Logical Plan Logical Plan Optimized LogicalPlan Selected Physical Plan CostModel Physical Plans Catalog Analysis Logical Optimization Physical Planning • Analysis(Rule Executor):Transformsan UnresolvedLogical Plan to a ResolvedLogical Plan • Unresolved => Resolved: Use Catalog to find where datasets and columns are coming from and types of columns • Logical Optimization (Rule Executor): Transformsa Resolved Logical Plan to an Optimized Logical Plan • Physical Planning (Strategies+ Rule Executor):Transformsa Optimized Logical Plan to a Physical Plan
  • 37.
    Spark’s Planner • 1stPhase: Transformsthe Logical Plan to the Physical Plan using Strategies • 2nd Phase: Use a Rule Executor to make the Physical Plan readyfor execution • Prepare Scalar sub-queries • Ensure requirementson inputrows • Apply physical optimizations 37
  • 38.
    Ensure Requirements onInput Rows 38 Sort-MergeJoin [t1.id = t2.id] t1 Requirements: Rows Sorted by t1.id t2 Requirements: Rows Sorted by t2.id
  • 39.
    Ensure Requirements onInput Rows 39 Sort-MergeJoin [t1.id = t2.id] t1 Requirements: Rows Sorted by t1.id Requirements: Rows Sorted by t2.id Sort [t1.id] t2 Sort [t2.id]
  • 40.
    Ensure Requirements onInput Rows 40 Sort-MergeJoin [t1.id = t2.id] t1 [sortedby t1.id] Requirements: Rows Sorted by t1.id Requirements: Rows Sorted by t2.id Sort [t1.id] t2 Sort [t2.id]What if t1 has been sorted by t1.id?
  • 41.
    Ensure Requirements onInput Rows 41 Sort-MergeJoin [t1.id = t2.id] t1 [sortedby t1.id] Requirements: Rows Sorted by t1.id Requirements: Rows Sorted by t2.id t2 Sort [t2.id]
  • 42.
  • 43.
    43 Spark Core (RDD) Catalyst DataFrame/DatasetSQL MLPipelines Structured Streaming { JSON } JDBC and more… With Spark 2.0, we expect most users to migrate to high level APIs (SQL, DataFrame, and Dataset) Spark SQL GraphFrames
  • 44.
    Where to Start •SourceCode: • Trees: TreeNode,Expression, Logical Plan, and Physical Plan • Transformations: Analyzer,Optimizer, and Planner • Checkout previous pull requests • Start to write code using Catalyst 44
  • 45.
    45 ~ 200 linesof changes (95 lines are for tests) Certain workloads can run hundreds times faster SPARK-12032
  • 46.
    46 ~ 250 linesof changes (99 lines are for tests) Pivot table support SPARK-8992
  • 47.
    • Try latestversionof ApacheSpark and preview of Spark 2.0 Try Apache Spark with Databricks 47 http://databricks.com/try
  • 48.