Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

Share

Portable UDFs: Write Once, Run Anywhere

Download to read offline

While most query engines come with a rich set of functions, it does not cover all the needs of users. In such cases, user defined functions (UDFs) allow users to express their business logic and use it in their queries. It is common for users to use more than one compute engine for solving their data problems. At Facebook, we provide multiple systems for users to solve their data problems : adhoc, batch, streaming / real-time. Users end up picking a system based off of their needs and problems at hand. Every system typically has its own way of allowing users to create UDFs. If a UDF was defined in one system, sooner or later there would be a need to have similar UDF in the other ones as well. This leads to users having to re-write the same UDFs multiple times to target for each system they want to use it in.

In this talk, we’ll take a deep dive in the Portable UDF. Portable UDF is our way of allowing users to write a function once in an engine agnostic way and use it across several compute engines. We’ll present the motivation, design and current state of Portable UDF project.

  • Be the first to like this

Portable UDFs: Write Once, Run Anywhere

  1. 1. Portable UDFs : Write Once, Run Anywhere Rongrong Zong & Tejas Patil Software Engineer, Facebook
  2. 2. Introductions § Presto committer § TL in Presto Team § Apache Spark committer § Co-founded Spark team @ FB § TL in Spark Team Tejas Patil Rongrong Zhong
  3. 3. Agenda § Why Portable UDFs? § What are Portable UDFs? § How do Portable UDFs work?
  4. 4. Why Portable UDFs ?
  5. 5. Why Portable UDFs ? User Presto UDF
  6. 6. Presto’s ‘array_contains’ UDF
  7. 7. Why Portable UDFs ? User Presto UDF Does not work L
  8. 8. Presto’s ‘array_contains’ UDF Spark’s ‘array_contains’ UDF
  9. 9. Why Portable UDFs ? User Presto UDF Spark UDF
  10. 10. Why Portable UDFs ? • Learn engine specifics
  11. 11. Why Portable UDFs ? • Learn engine specifics • Rewrite UDF logic
  12. 12. Why Portable UDFs ? • Learn engine specifics • Rewrite UDF logic • Bugs in rewrite • Eg. Corner cases, NULL, 0, empty, negative..
  13. 13. Why Portable UDFs ? User Presto UDF Spark UDF Presto UDF v1 v2 v1
  14. 14. Why Portable UDFs ? • Learn engine specifics • Rewrite UDF logic • Bugs in rewrite • Maintain both versions
  15. 15. What are Portable UDFs?
  16. 16. Portable UDFs User Portable UDF Stream processing engine
  17. 17. Hello World! • Supported types • boolean • byte (tinyint) • short (smallint) • int (integer) • long (bigint) • float (real) • double • String (varchar) • List (array) • Map Primitive types can be boxed
  18. 18. Function Metadata • Function management metadata • Ownership, description, etc • Function resolution metadata • Function signature, call convention, determinism, etc • Function execution metadata • Package location, version, boxed/unboxed, etc
  19. 19. How do Portable UDFs work?
  20. 20. Running on Spark
  21. 21. Running on Spark Spark Driver Query
  22. 22. Running on Spark Spark Driver Metastore Query
  23. 23. Running on Spark
  24. 24. Running on Spark
  25. 25. Running on Spark
  26. 26. Running on Spark Spark Executor Spark Driver Metastore Query Spark Executor Spark Executor Maven server
  27. 27. Running on Spark Spark Executor Spark Driver Metastore Query Spark Executor Spark Executor Maven server
  28. 28. Spark microbenchmark
  29. 29. Running on Presto – UDF • Presto supports dynamically registered SQL functions (UDF) already • We extended this to also support external functions • External functions are run remotely in a separate cluster
  30. 30. Running on Presto - Portable UDFs • Augment function metadata to indicate whether a function is local or remote • Planner change to allow batch processing • Per catalog configuration on what remote cluster to send requests to
  31. 31. Running on Presto -- Portable UDF Support
  32. 32. Running on Presto -- UDF Servers • Thrift service • invokeUdf(functionHandle, inputs) • Trying to prefetch the packages in advance to reduce overhead • Retrieving function metadata from metastore using function handle and construct the java method to be invoked • Excute the function for all inputs and return result

While most query engines come with a rich set of functions, it does not cover all the needs of users. In such cases, user defined functions (UDFs) allow users to express their business logic and use it in their queries. It is common for users to use more than one compute engine for solving their data problems. At Facebook, we provide multiple systems for users to solve their data problems : adhoc, batch, streaming / real-time. Users end up picking a system based off of their needs and problems at hand. Every system typically has its own way of allowing users to create UDFs. If a UDF was defined in one system, sooner or later there would be a need to have similar UDF in the other ones as well. This leads to users having to re-write the same UDFs multiple times to target for each system they want to use it in. In this talk, we’ll take a deep dive in the Portable UDF. Portable UDF is our way of allowing users to write a function once in an engine agnostic way and use it across several compute engines. We’ll present the motivation, design and current state of Portable UDF project.

Views

Total views

180

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

18

Shares

0

Comments

0

Likes

0

×