Optimizing the
Catalyst Optimizer for
Complex Plans
Jianneng Li
Software Engineer, Workday
Asif Shahid
Software Engineer, Workday
This presentation may contain forward-looking statements for which there are risks, uncertainties, and
assumptions. If the risks materialize or assumptions prove incorrect, Workday’s business results and
directions could differ materially from results implied by the forward-looking statements. Forward-looking
statements include any statements regarding strategies or plans for future operations; any statements
concerning new features, enhancements or upgrades to our existing applications or plans for future
applications; and any statements of belief. Further information on risks that could affect Workday’s results is
included in our filings with the Securities and Exchange Commission which are available on the Workday
investor relations webpage: www.workday.com/company/investor_relations.php
Workday assumes no obligation for and does not intend to update any forward-looking statements. Any
unreleased services, features, functionality or enhancements referenced in any Workday document, roadmap,
blog, our website, press release or public statement that are not currently available are subject to change at
Workday’s discretion and may not be delivered as planned or at all.
Customers who purchase Workday, Inc. services should make their purchase decisions upon services,
features, and functions that are currently available.
Safe Harbor Statement
Agenda
▪ Workday Prism Analytics
▪ Complex Plans
▪ Handling Complex Plans
▪ Common Subexpression
Elimination (CSE)
▪ Large Case Expressions
▪ Constraint Propagation
▪ Closing Thoughts
Workday Prism Analytics
Example Spark physical plan of our pipeline shown in Spark UI
• Customers use our self-
service product to build data
transformation pipelines, which
are compiled to DataFrames and
executed by Spark
• Finance and HR use cases
• Use cases often involve complex
data pipelines
Spark in Workday Prism Analytics
For more details, see session from SAIS 2019 - Lessons Learned
using Apache Spark for Self-Service Data Prep in SaaS World
Complex Plans
What are Complex Plans
• Thousands of operators
• Many (self) joins, unions, and large expressions
• Takes Catalyst hours to compile and optimize
• Difficult to understand or inspect visually
Example: Data Validation
id name group_id
1 a x
2 b y
3 c y
4 d z
Example: Data Validation
id name group_id
1 a x
2 b y
3 c y
4 d z
Part 1: filter rows based on criteria
SELECT *
FROM dataset
WHERE id > 1 AND name != "b"
Example: Data Validation
id name group_id
1 a x
2 b y
3 c y
4 d z
Part 2: if a row is filtered, other rows in the same group_id are also filtered
SELECT *
FROM
(SELECT *
FROM dataset
WHERE id > 1 AND name != "b") l
LEFT ANTI JOIN
(SELECT group_id
FROM dataset
WHERE NOT (id > 1 AND name != "b")
GROUP BY group_id) r
ON l.group_id = r.group_id
SELECT *
FROM dataset
WHERE NOT (id > 1 AND name != "b")
UNION ALL
SELECT l.id, l.name, l.group_id
FROM
(SELECT *
FROM dataset
WHERE id > 1 AND name != "b") l
INNER JOIN
(SELECT group_id
FROM dataset
WHERE NOT (id > 1 AND name != "b")
GROUP BY group_id) r
ON l.group_id = r.group_id
Example: Data Validation
id name group_id
1 a x
2 b y
3 c y
4 d z
Part 3: compute invalid rows too
SELECT *
FROM
(SELECT *
FROM dataset
WHERE id > 1 AND name != "b") l
LEFT ANTI JOIN
(SELECT group_id
FROM dataset
WHERE NOT (id > 1 AND name != "b")
GROUP BY group_id) r
ON l.group_id = r.group_id
Example: Data Validation
id name group_id
1 a x
2 b y
3 c y
4 d z
SELECT *
FROM dataset
WHERE NOT (id > 1 AND name != "b")
UNION ALL
SELECT l.id, l.name, l.group_id
FROM
(SELECT *
FROM dataset
WHERE id > 1 AND name != "b") l
INNER JOIN
(SELECT group_id
FROM dataset
WHERE NOT (id > 1 AND name != "b")
GROUP BY group_id) r
ON l.group_id = r.group_id
Part 4: show unique error message for each filter criterion
SELECT *
FROM
(SELECT *
FROM dataset
WHERE id > 1 AND name != "b") l
LEFT ANTI JOIN
(SELECT group_id
FROM dataset
WHERE NOT (id > 1 AND name != "b")
GROUP BY group_id) r
ON l.group_id = r.group_id
MORE SELF UNIONS
About Complex Plans
• Complexity increases gradually over time
• Could ask customers to optimize, but much better if
performance is good without optimizations
Handling Complex Plans
Common Subexpression Elimination (CSE)
Common Subexpression Elimination (CSE)
• Identify shared subplans, and cache them
• E.g. self joins, self unions, reused scans
• Performed while creating DataFrames
• Heuristic
• Algorithmic
Common Subexpression Elimination (CSE)
Union(
Parse(“Dataset A”),
Join(
Parse(“Dataset A”),
Parse(“Dataset B”)),
Join(
Parse(“Dataset A”),
Parse(“Dataset B”))
)
Common Subexpression Elimination (CSE)
Union(
Cache(ID=1,
Parse(“Dataset A”)),
Join(
Cache(ID=1, ∅),
Parse(“Dataset B”)),
Join(
Cache(ID=1, ∅),
Parse(“Dataset B”))
)
Union(
Parse(“Dataset A”),
Join(
Parse(“Dataset A”),
Parse(“Dataset B”)),
Join(
Parse(“Dataset A”),
Parse(“Dataset B”))
)
Common Subexpression Elimination (CSE)
Union(
Cache(ID=1,
Parse(“Dataset A”)),
Join(
Cache(ID=1, ∅),
Parse(“Dataset B”)),
Join(
Cache(ID=1, ∅),
Parse(“Dataset B”))
)
Union(
Cache(ID=1,
Parse(“Dataset A”)),
Cache(ID=2,
Join(
Cache(ID=1, ∅),
Parse(“Dataset B”))),
Cache(ID=2, ∅)
)
Union(
Parse(“Dataset A”),
Join(
Parse(“Dataset A”),
Parse(“Dataset B”)),
Join(
Parse(“Dataset A”),
Parse(“Dataset B”))
)
CSE Benchmark
Without CSE With CSE
Number of operators in
optimized plan
10K 150
Time to compile and
optimize plan
10 minutes 30 seconds
4 Data Validations in one data pipeline
Spark 2.4, local mode, 4GB memory
Logging Complex Plans (10s of MBs in Size)
• Stream plans to log without generating them upfront
• Send only truncated plans to log aggregation service
Large Case Expressions
CASE
WHEN f(a, b) = 1 then 1
WHEN f(a, b) = 2 then 2
...
WHEN f(a, b) = 1000 then 1000
ELSE -1
END
Large Case Expression
1000s of branches
Problems with Large Case Expressions
• Function f is evaluated once for each branch
• Inlined into nested Projects (CollapseProject rule)
• OOM during code generation (SPARK-29561)
Relation
a, b
Project
(CASE
WHEN f(a, b) = 1 then 10
WHEN f(a, b) = 2 then 20
...
END) as c
Project
c, c as c1
Relation
a, b
Project
(CASE
WHEN k = 1 then 10
WHEN k = 2 then 20
...
END) as c,
(CASE
WHEN k = 1 then 10
WHEN k = 2 then 20
...
END) as c1,
Handling Large Case Expressions in Catalyst
• Identify large expressions and not collapse them
• Identify and extract f as an alias
• Only if f is used more than once
• Disable whole stage codegen if too many branches
Relation
a, b
Project
(CASE
WHEN f(a, b) = 1 then 10
WHEN f(a, b) = 2 then 20
...
END) as c
Project
c, c as c1
Relation
a, b
Project
(CASE
WHEN k = 1 then 10
WHEN k = 2 then 20
...
END) as c
Project
f(a, b) as k
Project
c, c as c1
Large Case Expression Benchmark
SELECT CASE
WHEN cf1 + cf2 = -1 then 1
WHEN cf1 + cf2 = -2 then 2
...
END as cf3
FROM (SELECT cf1, cf1 AS cf2
FROM (SELECT CASE
WHEN f(a) = 1 AND g(b) = 1 THEN 1
WHEN f(a) = 2 AND g(b) = 1 THEN 2
...
END as cf1
FROM dataset))
Large Case Expression Benchmark
SELECT CASE
WHEN cf1 + cf2 = -1 then 1
WHEN cf1 + cf2 = -2 then 2
...
END as cf3
FROM (SELECT cf1, cf1 AS cf2
FROM (SELECT CASE
WHEN f(a) = 1 AND g(b) = 1 THEN 1
WHEN f(a) = 2 AND g(b) = 1 THEN 2
...
END as cf1
FROM dataset))
SELECT CASE
WHEN cf1 + cf2 = -1 then 1
WHEN cf1 + cf2 = -2 then 2
...
END as cf3
FROM (SELECT cf1, cf1 AS cf2
FROM (SELECT CASE
WHEN f(a) = 1 AND g(b) = 1 THEN 1
WHEN f(a) = 2 AND g(b) = 1 THEN 2
...
END as cf1
FROM dataset))
Large Case Expression Benchmark
SELECT CASE
WHEN cf1 + cf2 = -1 then 1
WHEN cf1 + cf2 = -2 then 2
...
END as cf3
FROM (SELECT cf1, cf1 AS cf2
FROM (SELECT CASE
WHEN f(a) = 1 AND g(b) = 1 THEN 1
WHEN f(a) = 2 AND g(b) = 1 THEN 2
...
END as cf1
FROM dataset))
Large Case Expression Benchmark
Spark 3.1, local mode, 4GB memory
Constraint Propagation
What are Constraints
• Filters on column values
• Can be used to
• Generate new filters (eg. IsNotNull)
• Prune redundant filters
• Push down new filters on the "other" side of a join
Example: Generate New Filter
Relation
a, b
Filter
a > 10
Constraints:
a is not null
a > 10
Filter
a > 10
IsNotNull(a)
Relation
a, b
Example: Prune Redundant Filter
Relation
a, b
Filter
a > 10
Constraints:
a is not null
a > 10
Filter
a > 10
IsNotNull(a)
Relation
a, b
Filter
a1 > 10
Project
a, b, a as a1
Project
a, b, a as a1
Example: New Filter on “Other” Side of Join
Relation
a, b
Filter
a > 5,
a!=null
Project
a, a as a1,
b
Constraints:
a is not null
a > 5
Join
a1 == x
b == y
Relation
x, y
Filter
x > 5
x != null
Relation
a, b
Filter
a > 5,
a!=null
Project
a, a as a1,
b
Join
a1 == x
b == y
Relation
x, y
Current Constraint Propagation Algorithm
• Traverses tree from bottom to top
• On Filter node, create additional IsNotNull constraints
• On Project node with alias, create all possible
combinations of constraints
Relation
a, b
Filter
a > 10
Constraints:
a is not null
a > 10
Project
a, b, a as a1
Constraints:
a is not null
a > 10
a1 is not null
a1 > 10
EqualsNullSafe(a, a1)
Current Constraint Propagation Algorithm
• To prune filter
• Check if the filter already exists in constraints
• To add a new filter to right hand side of join
• Check if any constraint exists on join key
• Consider only those constraints dependent on a single join key
• Given a filter function F(a, b), if
• count of attribute a and its aliases is m
• count of attribute b and its aliases is n
Current Algorithm Takes High Memory
Constraints for
alias combinations
IsNotNull
constraints
EqualsNullSafe
constraints
• Then
• total intermediate constraints created for 1 such filter expression
≈ m * n + m + n + C(m, 2) + C(n, 2)
Recall: Fix for Large Case Expressions
• We created new aliases!
• New aliases cause OOM in Catalyst, due to
• Large number of aliases
• Large number of operators in plan
Optimized Constraint Propagation (SPARK-33152)
• Traverses tree from bottom to top
• On Filter node, create additional null constraints
• On Project node, create Lists where
• Each List maintains original attribute and its aliases and constraint is
stored in terms of original attribute
Relation
a, b
Filter
a > 10
Constraints:
a is not null
a > 10
Project
a, b, a as a1
Constraints:
a is not null
a > 10
Aliases:
[a, a1]
Optimized Constraint Propagation (SPARK-33152)
• To prune filter
• Rewrite expression in terms of original attribute
• a1 > 10 becomes a > 10
• Check if canonical version already exists in constraints
Relation
a, b
Filter
a > 10
Project
a, b, a as a1
Constraints:
a is not null
a > 10
Aliases:
[a, a1]
Filter
a1 > 10
Relation
a, b
Filter
a > 10
Project
a, b, a as a1
Optimized Constraint Propagation (SPARK-33152)
• To add a new filter to right hand side of join
• Rewrite expression in terms of original attributes
• Check if any constraint exists on join key
Relation
a, b
Filter
a + b > 5,
a!=null,
b!=null
Project
a, a as a1,
b, b as b1
Constraints:
a is not null
b is not null
a + b > 5
Aliases:
[a, a1]
[b, b1]
Join
a1 == x
b1 == y
Relation
x, y
Filter
x + y > 5
x != null
y != null
Relation
a, b
Filter
a + b > 5,
a!=null,
b!=null
Project
a, a as a1,
b, b as b1
Join
a1 == x
b1 == y
Relation
x, y
Constraint Propagation Algorithms Comparison
Current Algorithm Improved Algorithm
Number of constraints
Combinatorial, dependent on the
number of aliases
Independent of the number of
aliases
Memory usage High Low
Filter pushdown for join Single reference filters
Single reference and compound
filters
Creation of IsNotNull constraints Can miss IsNotNull constraints Detects more IsNotNull constraints
Constraint Propagation Benchmark
SELECT cf
FROM (SELECT cf
FROM (SELECT CASE
WHEN abs(c01) < 1 THEN 1
WHEN abs(c01) < 2 THEN 2
WHEN abs(c02) < 1 THEN 3
WHEN abs(c02) < 2 THEN 4
...
END AS cf
FROM (SELECT sum(a + a) AS c01,
sum(a + b) AS c02
...
FROM dataset
GROUP BY a))
WHERE cf > 0)
INNER JOIN letters ON a = cf
Constraint Propagation Benchmark
SELECT cf
FROM (SELECT cf
FROM (SELECT CASE
WHEN abs(c01) < 1 THEN 1
WHEN abs(c01) < 2 THEN 2
WHEN abs(c02) < 1 THEN 3
WHEN abs(c02) < 2 THEN 4
...
END AS cf
FROM (SELECT sum(a + a) AS c01,
sum(a + b) AS c02
...
FROM dataset
GROUP BY a))
WHERE cf > 0)
INNER JOIN letters ON a = cf
Constraint Propagation Benchmark
SELECT cf
FROM (SELECT cf
FROM (SELECT CASE
WHEN abs(c01) < 1 THEN 1
WHEN abs(c01) < 2 THEN 2
WHEN abs(c02) < 1 THEN 3
WHEN abs(c02) < 2 THEN 4
...
END AS cf
FROM (SELECT sum(a + a) AS c01,
sum(a + b) AS c02
...
FROM dataset
GROUP BY a))
WHERE cf > 0)
INNER JOIN letters ON a = cf
Constraint Propagation Benchmark
SELECT cf
FROM (SELECT cf
FROM (SELECT CASE
WHEN abs(c01) < 1 THEN 1
WHEN abs(c01) < 2 THEN 2
WHEN abs(c02) < 1 THEN 3
WHEN abs(c02) < 2 THEN 4
...
END AS cf
FROM (SELECT sum(a + a) AS c01,
sum(a + b) AS c02
...
FROM dataset
GROUP BY a))
WHERE cf > 0)
INNER JOIN letters ON a = cf
Constraint Propagation Benchmark
SELECT cf
FROM (SELECT cf
FROM (SELECT CASE
WHEN abs(c01) < 1 THEN 1
WHEN abs(c01) < 2 THEN 2
WHEN abs(c02) < 1 THEN 3
WHEN abs(c02) < 2 THEN 4
...
END AS cf
FROM (SELECT sum(a + a) AS c01,
sum(a + b) AS c02
...
FROM dataset
GROUP BY a))
WHERE cf > 0)
INNER JOIN letters ON a = cf
Spark 3.1, local mode, 4GB memory
Effect on Customer Pipeline
• Financial use case for
large insurance company
• Uses nested case
statements to validate
and categorize data
Closing Thoughts
Tuning Tips
• Take advantage of CSE
• Reduce the number of operators
• Limit the number of aliases
• Follow SPARK-33152 to receive updates on the
improved constraint propagation algorithm
Future Work
• Improve logic for Catalyst rules
• PushDownPredicates
• CollapseProject
• Implement rules engine in Spark
• Algorithms for converting to lookup table
• Rete Algorithm
Thank You
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

Optimizing the Catalyst Optimizer for Complex Plans

  • 1.
    Optimizing the Catalyst Optimizerfor Complex Plans Jianneng Li Software Engineer, Workday Asif Shahid Software Engineer, Workday
  • 2.
    This presentation maycontain forward-looking statements for which there are risks, uncertainties, and assumptions. If the risks materialize or assumptions prove incorrect, Workday’s business results and directions could differ materially from results implied by the forward-looking statements. Forward-looking statements include any statements regarding strategies or plans for future operations; any statements concerning new features, enhancements or upgrades to our existing applications or plans for future applications; and any statements of belief. Further information on risks that could affect Workday’s results is included in our filings with the Securities and Exchange Commission which are available on the Workday investor relations webpage: www.workday.com/company/investor_relations.php Workday assumes no obligation for and does not intend to update any forward-looking statements. Any unreleased services, features, functionality or enhancements referenced in any Workday document, roadmap, blog, our website, press release or public statement that are not currently available are subject to change at Workday’s discretion and may not be delivered as planned or at all. Customers who purchase Workday, Inc. services should make their purchase decisions upon services, features, and functions that are currently available. Safe Harbor Statement
  • 3.
    Agenda ▪ Workday PrismAnalytics ▪ Complex Plans ▪ Handling Complex Plans ▪ Common Subexpression Elimination (CSE) ▪ Large Case Expressions ▪ Constraint Propagation ▪ Closing Thoughts
  • 4.
  • 5.
    Example Spark physicalplan of our pipeline shown in Spark UI • Customers use our self- service product to build data transformation pipelines, which are compiled to DataFrames and executed by Spark • Finance and HR use cases • Use cases often involve complex data pipelines Spark in Workday Prism Analytics For more details, see session from SAIS 2019 - Lessons Learned using Apache Spark for Self-Service Data Prep in SaaS World
  • 6.
  • 7.
    What are ComplexPlans • Thousands of operators • Many (self) joins, unions, and large expressions • Takes Catalyst hours to compile and optimize • Difficult to understand or inspect visually
  • 8.
    Example: Data Validation idname group_id 1 a x 2 b y 3 c y 4 d z
  • 9.
    Example: Data Validation idname group_id 1 a x 2 b y 3 c y 4 d z Part 1: filter rows based on criteria SELECT * FROM dataset WHERE id > 1 AND name != "b"
  • 10.
    Example: Data Validation idname group_id 1 a x 2 b y 3 c y 4 d z Part 2: if a row is filtered, other rows in the same group_id are also filtered SELECT * FROM (SELECT * FROM dataset WHERE id > 1 AND name != "b") l LEFT ANTI JOIN (SELECT group_id FROM dataset WHERE NOT (id > 1 AND name != "b") GROUP BY group_id) r ON l.group_id = r.group_id
  • 11.
    SELECT * FROM dataset WHERENOT (id > 1 AND name != "b") UNION ALL SELECT l.id, l.name, l.group_id FROM (SELECT * FROM dataset WHERE id > 1 AND name != "b") l INNER JOIN (SELECT group_id FROM dataset WHERE NOT (id > 1 AND name != "b") GROUP BY group_id) r ON l.group_id = r.group_id Example: Data Validation id name group_id 1 a x 2 b y 3 c y 4 d z Part 3: compute invalid rows too SELECT * FROM (SELECT * FROM dataset WHERE id > 1 AND name != "b") l LEFT ANTI JOIN (SELECT group_id FROM dataset WHERE NOT (id > 1 AND name != "b") GROUP BY group_id) r ON l.group_id = r.group_id
  • 12.
    Example: Data Validation idname group_id 1 a x 2 b y 3 c y 4 d z SELECT * FROM dataset WHERE NOT (id > 1 AND name != "b") UNION ALL SELECT l.id, l.name, l.group_id FROM (SELECT * FROM dataset WHERE id > 1 AND name != "b") l INNER JOIN (SELECT group_id FROM dataset WHERE NOT (id > 1 AND name != "b") GROUP BY group_id) r ON l.group_id = r.group_id Part 4: show unique error message for each filter criterion SELECT * FROM (SELECT * FROM dataset WHERE id > 1 AND name != "b") l LEFT ANTI JOIN (SELECT group_id FROM dataset WHERE NOT (id > 1 AND name != "b") GROUP BY group_id) r ON l.group_id = r.group_id MORE SELF UNIONS
  • 13.
    About Complex Plans •Complexity increases gradually over time • Could ask customers to optimize, but much better if performance is good without optimizations
  • 14.
  • 15.
  • 16.
    Common Subexpression Elimination(CSE) • Identify shared subplans, and cache them • E.g. self joins, self unions, reused scans • Performed while creating DataFrames • Heuristic • Algorithmic
  • 17.
    Common Subexpression Elimination(CSE) Union( Parse(“Dataset A”), Join( Parse(“Dataset A”), Parse(“Dataset B”)), Join( Parse(“Dataset A”), Parse(“Dataset B”)) )
  • 18.
    Common Subexpression Elimination(CSE) Union( Cache(ID=1, Parse(“Dataset A”)), Join( Cache(ID=1, ∅), Parse(“Dataset B”)), Join( Cache(ID=1, ∅), Parse(“Dataset B”)) ) Union( Parse(“Dataset A”), Join( Parse(“Dataset A”), Parse(“Dataset B”)), Join( Parse(“Dataset A”), Parse(“Dataset B”)) )
  • 19.
    Common Subexpression Elimination(CSE) Union( Cache(ID=1, Parse(“Dataset A”)), Join( Cache(ID=1, ∅), Parse(“Dataset B”)), Join( Cache(ID=1, ∅), Parse(“Dataset B”)) ) Union( Cache(ID=1, Parse(“Dataset A”)), Cache(ID=2, Join( Cache(ID=1, ∅), Parse(“Dataset B”))), Cache(ID=2, ∅) ) Union( Parse(“Dataset A”), Join( Parse(“Dataset A”), Parse(“Dataset B”)), Join( Parse(“Dataset A”), Parse(“Dataset B”)) )
  • 20.
    CSE Benchmark Without CSEWith CSE Number of operators in optimized plan 10K 150 Time to compile and optimize plan 10 minutes 30 seconds 4 Data Validations in one data pipeline Spark 2.4, local mode, 4GB memory
  • 21.
    Logging Complex Plans(10s of MBs in Size) • Stream plans to log without generating them upfront • Send only truncated plans to log aggregation service
  • 22.
  • 23.
    CASE WHEN f(a, b)= 1 then 1 WHEN f(a, b) = 2 then 2 ... WHEN f(a, b) = 1000 then 1000 ELSE -1 END Large Case Expression 1000s of branches
  • 24.
    Problems with LargeCase Expressions • Function f is evaluated once for each branch • Inlined into nested Projects (CollapseProject rule) • OOM during code generation (SPARK-29561) Relation a, b Project (CASE WHEN f(a, b) = 1 then 10 WHEN f(a, b) = 2 then 20 ... END) as c Project c, c as c1 Relation a, b Project (CASE WHEN k = 1 then 10 WHEN k = 2 then 20 ... END) as c, (CASE WHEN k = 1 then 10 WHEN k = 2 then 20 ... END) as c1,
  • 25.
    Handling Large CaseExpressions in Catalyst • Identify large expressions and not collapse them • Identify and extract f as an alias • Only if f is used more than once • Disable whole stage codegen if too many branches Relation a, b Project (CASE WHEN f(a, b) = 1 then 10 WHEN f(a, b) = 2 then 20 ... END) as c Project c, c as c1 Relation a, b Project (CASE WHEN k = 1 then 10 WHEN k = 2 then 20 ... END) as c Project f(a, b) as k Project c, c as c1
  • 26.
    Large Case ExpressionBenchmark SELECT CASE WHEN cf1 + cf2 = -1 then 1 WHEN cf1 + cf2 = -2 then 2 ... END as cf3 FROM (SELECT cf1, cf1 AS cf2 FROM (SELECT CASE WHEN f(a) = 1 AND g(b) = 1 THEN 1 WHEN f(a) = 2 AND g(b) = 1 THEN 2 ... END as cf1 FROM dataset))
  • 27.
    Large Case ExpressionBenchmark SELECT CASE WHEN cf1 + cf2 = -1 then 1 WHEN cf1 + cf2 = -2 then 2 ... END as cf3 FROM (SELECT cf1, cf1 AS cf2 FROM (SELECT CASE WHEN f(a) = 1 AND g(b) = 1 THEN 1 WHEN f(a) = 2 AND g(b) = 1 THEN 2 ... END as cf1 FROM dataset))
  • 28.
    SELECT CASE WHEN cf1+ cf2 = -1 then 1 WHEN cf1 + cf2 = -2 then 2 ... END as cf3 FROM (SELECT cf1, cf1 AS cf2 FROM (SELECT CASE WHEN f(a) = 1 AND g(b) = 1 THEN 1 WHEN f(a) = 2 AND g(b) = 1 THEN 2 ... END as cf1 FROM dataset)) Large Case Expression Benchmark
  • 29.
    SELECT CASE WHEN cf1+ cf2 = -1 then 1 WHEN cf1 + cf2 = -2 then 2 ... END as cf3 FROM (SELECT cf1, cf1 AS cf2 FROM (SELECT CASE WHEN f(a) = 1 AND g(b) = 1 THEN 1 WHEN f(a) = 2 AND g(b) = 1 THEN 2 ... END as cf1 FROM dataset)) Large Case Expression Benchmark Spark 3.1, local mode, 4GB memory
  • 30.
  • 31.
    What are Constraints •Filters on column values • Can be used to • Generate new filters (eg. IsNotNull) • Prune redundant filters • Push down new filters on the "other" side of a join
  • 32.
    Example: Generate NewFilter Relation a, b Filter a > 10 Constraints: a is not null a > 10 Filter a > 10 IsNotNull(a) Relation a, b
  • 33.
    Example: Prune RedundantFilter Relation a, b Filter a > 10 Constraints: a is not null a > 10 Filter a > 10 IsNotNull(a) Relation a, b Filter a1 > 10 Project a, b, a as a1 Project a, b, a as a1
  • 34.
    Example: New Filteron “Other” Side of Join Relation a, b Filter a > 5, a!=null Project a, a as a1, b Constraints: a is not null a > 5 Join a1 == x b == y Relation x, y Filter x > 5 x != null Relation a, b Filter a > 5, a!=null Project a, a as a1, b Join a1 == x b == y Relation x, y
  • 35.
    Current Constraint PropagationAlgorithm • Traverses tree from bottom to top • On Filter node, create additional IsNotNull constraints • On Project node with alias, create all possible combinations of constraints Relation a, b Filter a > 10 Constraints: a is not null a > 10 Project a, b, a as a1 Constraints: a is not null a > 10 a1 is not null a1 > 10 EqualsNullSafe(a, a1)
  • 36.
    Current Constraint PropagationAlgorithm • To prune filter • Check if the filter already exists in constraints • To add a new filter to right hand side of join • Check if any constraint exists on join key • Consider only those constraints dependent on a single join key
  • 37.
    • Given afilter function F(a, b), if • count of attribute a and its aliases is m • count of attribute b and its aliases is n Current Algorithm Takes High Memory Constraints for alias combinations IsNotNull constraints EqualsNullSafe constraints • Then • total intermediate constraints created for 1 such filter expression ≈ m * n + m + n + C(m, 2) + C(n, 2)
  • 38.
    Recall: Fix forLarge Case Expressions • We created new aliases! • New aliases cause OOM in Catalyst, due to • Large number of aliases • Large number of operators in plan
  • 39.
    Optimized Constraint Propagation(SPARK-33152) • Traverses tree from bottom to top • On Filter node, create additional null constraints • On Project node, create Lists where • Each List maintains original attribute and its aliases and constraint is stored in terms of original attribute Relation a, b Filter a > 10 Constraints: a is not null a > 10 Project a, b, a as a1 Constraints: a is not null a > 10 Aliases: [a, a1]
  • 40.
    Optimized Constraint Propagation(SPARK-33152) • To prune filter • Rewrite expression in terms of original attribute • a1 > 10 becomes a > 10 • Check if canonical version already exists in constraints Relation a, b Filter a > 10 Project a, b, a as a1 Constraints: a is not null a > 10 Aliases: [a, a1] Filter a1 > 10 Relation a, b Filter a > 10 Project a, b, a as a1
  • 41.
    Optimized Constraint Propagation(SPARK-33152) • To add a new filter to right hand side of join • Rewrite expression in terms of original attributes • Check if any constraint exists on join key Relation a, b Filter a + b > 5, a!=null, b!=null Project a, a as a1, b, b as b1 Constraints: a is not null b is not null a + b > 5 Aliases: [a, a1] [b, b1] Join a1 == x b1 == y Relation x, y Filter x + y > 5 x != null y != null Relation a, b Filter a + b > 5, a!=null, b!=null Project a, a as a1, b, b as b1 Join a1 == x b1 == y Relation x, y
  • 42.
    Constraint Propagation AlgorithmsComparison Current Algorithm Improved Algorithm Number of constraints Combinatorial, dependent on the number of aliases Independent of the number of aliases Memory usage High Low Filter pushdown for join Single reference filters Single reference and compound filters Creation of IsNotNull constraints Can miss IsNotNull constraints Detects more IsNotNull constraints
  • 43.
    Constraint Propagation Benchmark SELECTcf FROM (SELECT cf FROM (SELECT CASE WHEN abs(c01) < 1 THEN 1 WHEN abs(c01) < 2 THEN 2 WHEN abs(c02) < 1 THEN 3 WHEN abs(c02) < 2 THEN 4 ... END AS cf FROM (SELECT sum(a + a) AS c01, sum(a + b) AS c02 ... FROM dataset GROUP BY a)) WHERE cf > 0) INNER JOIN letters ON a = cf
  • 44.
    Constraint Propagation Benchmark SELECTcf FROM (SELECT cf FROM (SELECT CASE WHEN abs(c01) < 1 THEN 1 WHEN abs(c01) < 2 THEN 2 WHEN abs(c02) < 1 THEN 3 WHEN abs(c02) < 2 THEN 4 ... END AS cf FROM (SELECT sum(a + a) AS c01, sum(a + b) AS c02 ... FROM dataset GROUP BY a)) WHERE cf > 0) INNER JOIN letters ON a = cf
  • 45.
    Constraint Propagation Benchmark SELECTcf FROM (SELECT cf FROM (SELECT CASE WHEN abs(c01) < 1 THEN 1 WHEN abs(c01) < 2 THEN 2 WHEN abs(c02) < 1 THEN 3 WHEN abs(c02) < 2 THEN 4 ... END AS cf FROM (SELECT sum(a + a) AS c01, sum(a + b) AS c02 ... FROM dataset GROUP BY a)) WHERE cf > 0) INNER JOIN letters ON a = cf
  • 46.
    Constraint Propagation Benchmark SELECTcf FROM (SELECT cf FROM (SELECT CASE WHEN abs(c01) < 1 THEN 1 WHEN abs(c01) < 2 THEN 2 WHEN abs(c02) < 1 THEN 3 WHEN abs(c02) < 2 THEN 4 ... END AS cf FROM (SELECT sum(a + a) AS c01, sum(a + b) AS c02 ... FROM dataset GROUP BY a)) WHERE cf > 0) INNER JOIN letters ON a = cf
  • 47.
    Constraint Propagation Benchmark SELECTcf FROM (SELECT cf FROM (SELECT CASE WHEN abs(c01) < 1 THEN 1 WHEN abs(c01) < 2 THEN 2 WHEN abs(c02) < 1 THEN 3 WHEN abs(c02) < 2 THEN 4 ... END AS cf FROM (SELECT sum(a + a) AS c01, sum(a + b) AS c02 ... FROM dataset GROUP BY a)) WHERE cf > 0) INNER JOIN letters ON a = cf Spark 3.1, local mode, 4GB memory
  • 48.
    Effect on CustomerPipeline • Financial use case for large insurance company • Uses nested case statements to validate and categorize data
  • 49.
  • 50.
    Tuning Tips • Takeadvantage of CSE • Reduce the number of operators • Limit the number of aliases • Follow SPARK-33152 to receive updates on the improved constraint propagation algorithm
  • 51.
    Future Work • Improvelogic for Catalyst rules • PushDownPredicates • CollapseProject • Implement rules engine in Spark • Algorithms for converting to lookup table • Rete Algorithm
  • 52.
  • 53.
    Feedback Your feedback isimportant to us. Don’t forget to rate and review the sessions.