Spark UDFs are EviL, Catalyst to the rEsCue!

Adi Polak, Sr. Software Engineer @ Akamai
Spark UDFs are EviL,
Catalyst to the rEsCue!

• Adi Polak
• Sr. Software Engineer @ Akamai
• Previous Security researcher
• Majored in Machine Learning
• Tel Avivian
• BGU alumni
• Co-funder of FLIP
• Spark & Scala enthusiast
• Foodie
Who am I
@adipolak
@adipolak

• Apache Spark with Scala
• Spark 1.6
• Catalyst optimization
• Spark custom UDFs
‫תאכלס‬..

Fundamentals of Catalyst Optimizer
SUB
Attribute(x) SUB
some_func(1) some_func(2)
SUB
Attribute(x) some_func(-1)

Spark SQL Execution Plan
Logical optimization –> Optimization rules
• Constant folding
• Predicate pushdown
• Projection pruning
• …
Physical Planning –> Planning strategies
Catalyst
Frontend Backend

What is Spark Custom UDF
"Use the higher-level standard Column-based functions with
Dataset operators whenever possible before reverting to
using your own custom UDF functions since UDFs are a
blackbox for Spark and so it does not even try to optimize
them."

What do we lose when
using Custom UDF ?
• Constant folding
• Predicate pushdown
• Null handling

Use queryExecution & explain(true)
Catalyst
Frontend Backend

Use queryExecution & explain(true) API

What can be done instead?
sql functions DataFrame API:
Aggregate functions
Collection functions
Date time functions
Math functions
Non-aggregate functions
Sorting functions
String functions
Window functions
sql functions Column API
Expression operations..

How can I find what functions are available?
version
arrayContains, minute, round, rand, spark_partition_id, isin …

Can you show a complex example?

TEST - BENCHMARKS
Total output rows – 100
• Average execution time per row with UDF ~ 299.9 millis => 26sec
• Average execution time per row without one of the UDFs ~ 265.93 millis => 23sec

Takeaways
• Use UDFs as a last resort
• Always check yourself with df.explain(true)
• Act accordingly

Spark UDFs are EviL, Catalyst to the rEsCue!

More Related Content

What's hot

Similar to Spark UDFs are EviL, Catalyst to the rEsCue!

More from Adi Polak

Recently uploaded

Spark UDFs are EviL, Catalyst to the rEsCue!

Editor's Notes