Adi Polak, Sr. Software Engineer @ Akamai
Spark UDFs are EviL,
Catalyst to the rEsCue!
• Adi Polak
• Sr. Software Engineer @ Akamai
• Previous Security researcher
• Majored in Machine Learning
• Tel Avivian
• BGU alumni
• Co-funder of FLIP
• Spark & Scala enthusiast
• Foodie
Who am I
@adipolak
@adipolak
• Apache Spark with Scala
• Spark 1.6
• Catalyst optimization
• Spark custom UDFs
‫תאכלס‬..
CATALYST
Fundamentals of Catalyst Optimizer
SUB
Attribute(x) SUB
some_func(1) some_func(2)
SUB
Attribute(x) some_func(-1)
Spark SQL Execution Plan
Logical optimization –> Optimization rules
• Constant folding
• Predicate pushdown
• Projection pruning
• …
Physical Planning –> Planning strategies
Catalyst
Frontend Backend
What is Spark Custom UDF
What is Spark Custom UDF
"Use the higher-level standard Column-based functions with
Dataset operators whenever possible before reverting to
using your own custom UDF functions since UDFs are a
blackbox for Spark and so it does not even try to optimize
them."
What is Spark Custom UDF
"Use the higher-level standard Column-based functions with
Dataset operators whenever possible before reverting to
using your own custom UDF functions since UDFs are a
blackbox for Spark and so it does not even try to optimize
them."
What is Spark Custom UDF
"Use the higher-level standard Column-based functions with
Dataset operators whenever possible before reverting to
using your own custom UDF functions since UDFs are a
blackbox for Spark and so it does not even try to optimize
them."
What do we lose when
using Custom UDF ?
• Constant folding
• Predicate pushdown
• Null handling
What can we do ?
Use queryExecution & explain(true)
Catalyst
Frontend Backend
Use queryExecution & explain(true) API
Lost Push Down filter
What can be done instead?
sql functions DataFrame API:
Aggregate functions
Collection functions
Date time functions
Math functions
Non-aggregate functions
Sorting functions
String functions
Window functions
sql functions Column API
Expression operations..
How can I find what functions are available?
version
arrayContains, minute, round, rand, spark_partition_id, isin …
Can you show a complex example?
Using column functions ...
TEST - BENCHMARKS
Total output rows – 100
• Average execution time per row with UDF ~ 299.9 millis => 26sec
• Average execution time per row without one of the UDFs ~ 265.93 millis => 23sec
Takeaways
• Use UDFs as a last resort
• Always check yourself with df.explain(true)
• Act accordingly
THANK YOU
@adipolak
@adipolak

Spark UDFs are EviL, Catalyst to the rEsCue!

  • 1.
    Adi Polak, Sr.Software Engineer @ Akamai Spark UDFs are EviL, Catalyst to the rEsCue!
  • 2.
    • Adi Polak •Sr. Software Engineer @ Akamai • Previous Security researcher • Majored in Machine Learning • Tel Avivian • BGU alumni • Co-funder of FLIP • Spark & Scala enthusiast • Foodie Who am I @adipolak @adipolak
  • 4.
    • Apache Sparkwith Scala • Spark 1.6 • Catalyst optimization • Spark custom UDFs ‫תאכלס‬..
  • 6.
  • 7.
    Fundamentals of CatalystOptimizer SUB Attribute(x) SUB some_func(1) some_func(2) SUB Attribute(x) some_func(-1)
  • 8.
    Spark SQL ExecutionPlan Logical optimization –> Optimization rules • Constant folding • Predicate pushdown • Projection pruning • … Physical Planning –> Planning strategies Catalyst Frontend Backend
  • 9.
    What is SparkCustom UDF
  • 10.
    What is SparkCustom UDF "Use the higher-level standard Column-based functions with Dataset operators whenever possible before reverting to using your own custom UDF functions since UDFs are a blackbox for Spark and so it does not even try to optimize them."
  • 11.
    What is SparkCustom UDF "Use the higher-level standard Column-based functions with Dataset operators whenever possible before reverting to using your own custom UDF functions since UDFs are a blackbox for Spark and so it does not even try to optimize them."
  • 12.
    What is SparkCustom UDF "Use the higher-level standard Column-based functions with Dataset operators whenever possible before reverting to using your own custom UDF functions since UDFs are a blackbox for Spark and so it does not even try to optimize them."
  • 13.
    What do welose when using Custom UDF ? • Constant folding • Predicate pushdown • Null handling
  • 14.
  • 15.
    Use queryExecution &explain(true) Catalyst Frontend Backend
  • 17.
    Use queryExecution &explain(true) API
  • 18.
  • 19.
    What can bedone instead? sql functions DataFrame API: Aggregate functions Collection functions Date time functions Math functions Non-aggregate functions Sorting functions String functions Window functions sql functions Column API Expression operations..
  • 20.
    How can Ifind what functions are available? version arrayContains, minute, round, rand, spark_partition_id, isin …
  • 21.
    Can you showa complex example?
  • 22.
  • 23.
    TEST - BENCHMARKS Totaloutput rows – 100 • Average execution time per row with UDF ~ 299.9 millis => 26sec • Average execution time per row without one of the UDFs ~ 265.93 millis => 23sec
  • 24.
    Takeaways • Use UDFsas a last resort • Always check yourself with df.explain(true) • Act accordingly
  • 25.

Editor's Notes

  • #4 לפני שנתיים דיברנו על sla של 12 שעות לפני חצי שנה דיברנו על sla של שעה והמוצרים שאנחנו עובדים עליהם היום צריכים לעמוד בsla של דקות בודדות, מוצר real time 12 מיליון רשומות בדקה. והיום אסביר לכם איך תוכלו לשפר את הפרפורמנס שלכם אם תימנעו משימוש בudf.
  • #9 SQL query - query parser - AST - abstract syntax tree Catalog- used to resolve the attributes , tables etc/ Logical opt - rule based cost model - help choose JOIN alogirthem code gen- Ganino / quasiest
  • #17 SQL Parser - AST - abstract syntax tree
  • #19 SQL Parser - AST - abstract syntax tree
  • #22 broadcast-hash-join