Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Coral & Transport UDFs: Building Blocks of a Postmodern Data Warehouse​

These slides describe LinkedIn efforts in building a view virtualization layer that enables compute engines to access Hive views, reason about them semantically, and optionally rewrite them before presenting to various compute engines such as Spark, Hive, and Presto. Namely, we describe two frameworks: Coral for reasoning about views using their logical plans, and Transport UDFs for reasoning about UDFs.

  • Be the first to comment

Coral & Transport UDFs: Building Blocks of a Postmodern Data Warehouse​

  1. 1. Coral & Transport UDFs Building Blocks of a Postmodern Data Warehouse​ Walaa Eldin Moustafa Staff Software Engineer @
  2. 2. The lifecycle of a data application at LinkedIn [Simplified] HDFS Dali Tables Dali Views
  3. 3. The lifecycle of a data application at LinkedIn [Simplified] HDFS Dali Tables Dali Views
  4. 4. The lifecycle of a data application at LinkedIn [Simplified] HDFS Dali Tables Dali Views
  5. 5. Why do we like catalogs and tables Location independence Format independence Statistics Atomicity SQL and compute engines Data availability and completeness PII and data governance Type system
  6. 6. Why do we like catalogs and tables Location independence Format independence Statistics Atomicity SQL and compute engines Data availability and completeness PII and data governance Type system
  7. 7. Why do we like views Data cleaning Data transformation Data abstraction ML Features Data staging Data reuse
  8. 8. View Virtualization View Virtualization Dali Tables Dali Views
  9. 9. View Virtualization Coral ⋈ Dali Tables Dali Views
  10. 10. Dali Views SELECT my_udf(c) FROM R JOIN S JOIN T WHERE date > today() - 5 AND date <= today() TBLPROPERTIES( 'functions' = 'my_udf: com.linkedin.MyUDF', 'dependencies' = 'group:artifact:0.0.1') SQL + UDFs View.SQL Functions .gradle Code Review Test Publish Deploy
  11. 11. Coral ⋈
  12. 12. Coral SELECT my_udf(c) FROM R JOIN S JOIN T WHERE date > today() - 5 AND date <= today() TBLPROPERTIES( 'functions' = 'my_udf', 'dependencies' = 'group:artifact:0.0.1') ⋈ σ T ⋈ R S Coral IR coral-hive
  13. 13. Coral ⋈ σ T ⋈ R S Coral IR identity coral-spark coral-presto coral-pig DataRealm
  14. 14. Coral-Hive Metastore Client Calcite Schema Calcite Table Hive Schema Hive Table Implements Implements Used by Calcite SqlOperatorTable DaliOperatorTable Implements Calcite TypeSystem Hive TypeSystem Implements
  15. 15. DaliOperatorTable f1 new SqlOperator(…) f2 f3 new SqlOperator(…) new SqlOperator(…) UDF Type Inference Calcite Type Inference
  16. 16. Coral-Presto ⋈ σ T ⋈ R S Coral IR ⋈ σ T ⋈ R S Presto Rel SELECT presto_udf(c + 1) FROM R JOIN S JOIN T WHERE date > today() - 5 AND date <= today()
  17. 17. Coral-Pig ⋈ σ T ⋈ R S Coral IR ⋈ σ T ⋈ R S Pig Rel R = LOAD "R" S = LOAD "S" T = LOAD "T" RS = R JOIN S RSF = FILTER RS BY ... RSFT = RSF JOIN T
  18. 18. Presto on Spark, or Spark on Presto? ⋈ σ T ⋈ R S Coral IR identity coral-spark coral-presto coral-pig DataRealm
  19. 19. Presto on Spark, or Spark on Presto? ⋈ σ T ⋈ R S Coral IR identity coral-spark coral-presto coral-pig DataRealm Both, and more!
  20. 20. What does the future look like?
  21. 21. Flexible Language Interface GraphQL Pig Latin Dataflow Datalog user { name address{} } A = LOAD B = LOAD C = A JOIN B d = load() .filter() .map() .count() TwoHop (X,Y) :- E(X,Z), E(Z,Y).
  22. 22. Flexible Language Interface GraphQL Pig Latin Dataflow Datalog user { name address{} } A = LOAD B = LOAD C = A JOIN B d = load() .filter() .map() .count() TwoHop (X,Y) :- E(X,Z), E(Z,Y). DataRealm
  23. 23. Flexible Language Interface GraphQL Pig Latin Dataflow Datalog user { name address{} } A = LOAD B = LOAD C = A JOIN B d = load() .filter() .map() .count() TwoHop (X,Y) :- E(X,Z), E(Z,Y). W. E. Moustafa, et al. Datalography: Scaling Datalog Graph Analytics on Graph Processing Systems. In IEEE BigData 2016.
  24. 24. Workload Analytics ⋈ σ T ⋈ R S ⋈ σ T R ⋈ σ T ⋈ Q S
  25. 25. Materialized View Selection ⋈ σ T ⋈ R S ⋈ σ T⋈ R S
  26. 26. Data Governance ⋈ σ T ⋈ R S ⋈ σ ⋈ T R S U U U
  27. 27. Cascading Deletes ⋈ σ T ⋈ R S ⋈ σ ▷ ⋈ ▷ ▷ T R S U U U
  28. 28. PII Lineage and Derivation ⋈ σ T ⋈ R S ⋈ σ T ⋈ R S PII(R) PII(S) PII(T) PII(V)
  29. 29. Format-specific View Schema Derivation ⋈ σ T ⋈ R S ⋈ σ T ⋈ R S Avro(R) Avro (S) Avro(T) Avro(V)
  30. 30. View Macros SELECT my_udf(c) FROM R JOIN S JOIN T WHERE date = '2020-02-26' SELECT my_udf(c) FROM R JOIN S JOIN T WHERE date = #LATEST_AVAILABLE
  31. 31. Transport UDFs: Translatable Portable UDFs
  32. 32. IR • SQL has pretty well-understood IR: Relational Algebra • Standard Operators • Scan, Filter, Project, Join, Group By, etc • UDFs • Opaque • Use imperative language • Not portable or translatable
  33. 33. UDF Denormalization Multiple versions of the same UDF. Not clear which is the source of truth. Duplication Duplicate implementations can diverge causing data inconsistency Inconsistency Developers need to learn multiple APIs, implement same logic multiple times. Low Productivity In some cases, use tuple conversion adapters to enable portability. Low Performance
  34. 34. A UDF Primer UDFs 101
  35. 35. Example Hive UDF
  36. 36. Example Hive UDF
  37. 37. Example Hive UDF
  38. 38. Example Hive UDF
  39. 39. Example Hive UDF
  40. 40. Example Hive UDF
  41. 41. Example Hive UDF
  42. 42. Example Presto UDF
  43. 43. Example Presto UDF
  44. 44. Example Presto UDF
  45. 45. Example Presto UDF
  46. 46. Example Presto UDF
  47. 47. Example Presto UDF
  48. 48. Example Presto UDF
  49. 49. Example Presto UDF
  50. 50. UDF APIs • API Complexity • APIs expose low-level details of engines • Data types may not intuitvely map to SQL type-system • API Disparity • APIs differ in what to expect from developer • APIs differ in features they can provide
  51. 51. Transport UDFs
  52. 52. Transport UDFs
  53. 53. Transport UDFs
  54. 54. Transport UDFs
  55. 55. Transport UDFs
  56. 56. Transport UDFs
  57. 57. Then What? > gradle build > ls build/my-udfs/libs my-udfs-presto.jar my-udfs-hive.jar my-udfs-spark.jar
  58. 58. Auto-generated UDFs Transport Gradle Plugin Code Analysis – Metadata Generation Autogenerated Engine UDFs Presto Hive Spark … Autogenerated UDF JARs Presto Hive Spark … User-defined Transport UDF
  59. 59. Engine Engine-specific Autogenerated UDF Wrapper User-defined Transport UDF Engine-specific Data and Types Architecture 1 2 3
  60. 60. Conclusions Only implement what is needed to define logic. No boilerplate code. Simple Declarative type signatures with generics. getRequiredFiles() support. Feature-rich Can run on multiple platforms. Code specific to platform is auto-generated. Translatable Direct access to native platform data. Performant Transport UDFs API
  61. 61. How to Contribute? github.com/linkedin/transport transport-udfs@googlegroups.com • New platform support • New Transport UDFs • Machine Learning UDFs • Spatial UDFs • New UDF type support • Aggregate UDFs (UDAF) • Table Functions (UDTF)
  62. 62. Contributors

×