Advertisement
Advertisement

More Related Content

Advertisement

Recently uploaded(20)

Advertisement

The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)

  1. First Language: APL  Example (arbitrary dim regression): Const⌹Coeff∘.*⌽0,Dim  High-Dimensional Arrays/Nested Arrays as Data Model  Expressions: Mathematics (declarative), parallel  Control flow: Recursion, GoTo   Syntax: Greek and Special Characters (easier to write than read) Next Languages: Pascal, Modula-2, Oberon, C/C++  Procedural, imperative (structured control flow)  One Item: Structured types and Object Models  Single-node, Parallelism/Distributed via Libraries Other experiences: Lisp, Prolog  Functional and Logic  List as Data Model  Recursion instead of control flow Data Processing Languages:  SQL: Declarative Expressions, Procedural Control Flow  DataLog: Recursion  XQuery: Tree Data Model, Declarative/Functional My Language History Imperative vs Declarative Procedural vs Functional vs Logical One item vs Sets Single-node vs Parallel vs Distributed Programming vs Data Processing
  2. Imperative Tell the system how to get it Declarative Tell the system what you want Let the system find a way to get it Declarative leaves options that an optimizer can reason about Imperative vs Declarative
  3. Procedural operates by changing “persistent state” (variables) with each expression Allows side-effects in control flow i = 0; FOR j FROM 1 TO 100 DO i = i + 1; i is now 100. Functional Transforms input into output, no “persistent state” No side-effects in control flow i = 0; FOR j FROM 1 TO 100 DO i = i + 1; i is now 1. Procedural vs Functional
  4. Single objects  Requires control flow  Explicit Parallelism Sets of objects  Allows higher-level abstraction expressions  Implicit Parallelism  Objects can be  Object/value  Tuples  Trees  graphs Meta Data Service provides sharing of data model objects One Item vs Sets Data Models
  5. Language Parallelism vs Libraries Scale-up vs Scale-out Synchronization/Transactions  Explicit imperative vs Implicit declarative  ACID support Single-node vs Parallel vs Distributed
  6. Programming Languages  Long-term data is in a store but not part of the language model  Designed for tight coupling of data and application logic  Often imperative, procedural, one-item object-oriented, explicit/library based parallelism Data Processing Languages  Long-term data is part of the language model  Data can evolve independently of application  Declarative and Functional  Set-based  Built-in parallelism and implicit/declarative synchronization Programming vs Data Processing
  7. Writeability vs Readability Consistency Familiarity Context independent Composable Mathematical vs natural language Reserved Keywords? Syntax Matters
  8. No surprises! Avoid complexities! Composable Optimizable Implementable Semantics Matters
  9. Consortiums/Standard Bodies:  Slow (it took XQuery 6 years!)  “Political” interests of participants can negatively impact design Individual/Small team:  More focused  Risk: different for difference sake Evolve vs New Create Language that address demand How to create languages
  10. Some sample use cases Digital Crime Unit – Analyze complex attack patterns to understand BotNets and to predict and mitigate future attacks by analyzing log records with complex custom algorithms Image Processing – Large-scale image feature extraction and classification using custom code Shopping Recommendation – Complex pattern analysis and prediction over shopping records using proprietary algorithms
  11.  Declarativity does scaling and parallelization for you  Extensibility is bolted on and not “native”  hard to work with anything other than structured data  difficult to extend with custom code
  12.  Extensibility through custom code is “native”  Declarativity is bolted on and not “native”  User often has to care about scale and performance  SQL is 2nd class within string  Often no code reuse/ sharing across queries
  13.  Declarativity and Extensibility are equally native to the language! Get benefits of both! Makes it easy for you by unifying: • Unstructured and structured data processing • Declarative SQL and custom imperative Code (C#, Python, R, …) Scales-up and Scales-out custom code within declarative framework
  14. The origins of U-SQL SCOPE – Microsoft’s internal Big Data language • SQL and C# integration model • Optimization and Scaling model • Runs 100’000s of jobs daily Hive • Complex data types (Maps, Arrays) • Data format alignment for text files T-SQL/ANSI SQL • Many of the SQL capabilities (windowing functions, meta data model etc.)
  15. U-SQL Language Philosophy Declarative Query and Transformation Language: • Uses SQL’s SELECT FROM WHERE with GROUP BY/Aggregation, Joins, SQL Analytics functions • Optimizable, Scalable Expression-flow programming style: • Easy to use functional lambda composition • Composable, globally optimizable Operates on Unstructured & Structured Data • Schema on read over files • Relational metadata objects (e.g. database, table) Extensible from ground up: • Type system is based on C# • Expression language IS C# • User-defined functions (U-SQL and C#) • User-defined Aggregators (C#) • User-defined Operators (UDO) (C#) U-SQL provides the Parallelization and Scale-out Framework for Usercode • EXTRACTOR, OUTPUTTER, PROCESSOR, REDUCER, COMBINER, APPLIER REFERENCE MyDB.MyAssembly; CREATE TABLE T( cid int, first_order DateTime , last_order DateTime, order_count int , order_amount float, ... ); @o = EXTRACT oid int, cid int, odate DateTime, amount float FROM "/input/orders.txt" USING Extractors.Csv(); @c = EXTRACT cid int, name string, city string FROM "/input/customers.txt" USING Extractors.Csv(); @j = SELECT c.cid, MIN(o.odate) AS firstorder , MAX(o.date) AS lastorder, COUNT(o.oid) AS ordercnt , AGG<MyAgg.MySum>(c.amount) AS totalamount FROM @c AS c LEFT OUTER JOIN @o AS o ON c.cid == o.cid WHERE c.city.StartsWith("New") && MyNamespace.MyFunction(o.odate) > 10 GROUP BY c.cid; OUTPUT @j TO "/output/result.txt" USING new MyData.Write(); INSERT INTO T SELECT * FROM @j;
  16. U-SQL Data Model Files and Tables Set-based
  17. Unstructured Data @s = EXTRACT a string, b int, date DateTime, file string FROM "filepath/{date:yyyy}/{date:MM}/{date:dd}/{file}.csv" USING Extractors.Csv(encoding: Encoding.Unicode); • Pro: Flexible, scaling with file sets and over parts of partitionable files • Cons: System doesn’t know data distribution, statistics; no indexing Structured Data CREATE TABLE T (col1 int , col2 string , col3 SQL.MAP<string,string> , INDEX idx CLUSTERED (col2 ASC) PARTITIONED BY (col1) DISTRIBUTED BY HASH (driver_id) ); • Pro: Provides system guarantees about data distributions, statistics, indices to help performance and scale; object discoverability • Cons: Needs Schema a priori, Cost of additional storage and generation
  18. U-SQL Familiarity SQL C# Python, R
  19. Familiar Operations • ORDER BY FETCH n ROWS • GROUP BY HAVING • UNION/INTERSECT/EXCEPT • OVER Expression: Windowing, Analytics, Ranking Functions • JOINS: INNER, FULL/LEFT/RIGHT OUTER, CROSS, SEMI, ANTI-SEMI-JOIN • CROSS APPLY • PIVOT/UNPIVOT (new!) New Operations • SET OPERATION BY NAME SELECT * FROM @left INTERSECT BY NAME ON (id, *) SELECT * FROM @right; • OUTER UNION BY NAME SELECT * FROM @left OUTER UNION BY NAME ON (A, K) SELECT * FROM @right; • Flexible Column Sets for parameter polymorphism
  20. “Top 5”s Surprises for SQL Users • AS is not as • C# keywords and SQL keywords overlap • Future Proofing against new reserved keywords in both languages: Reserve all upper-case words as U-SQL keywords • " vs ' vs [] • = != == • Remember: C# expression language • null IS NOT NULL • C# nulls are two-valued • PROCEDURES but no WHILE • No UPDATE, DELETE, nor MERGE (yet)
  21. U-SQL Object Model Reusability and Discoverability
  22. ADLA Account/Catalog Database Schema [1,n] [1,n] [0,n] tables views TVFs C# Fns C# UDAgg Clustered Index partitions C# Assemblies C# Extractors Data Source C# Reducers C# Processors C# Combiners C# Outputters Ext. tables User objects Refers toContains Implemented and named by Procedures Creden- tials MD Name C# Name C# Applier Table Types Legend Statistics C# UDTs
  23. U-SQL Extensibility  Start Time - End Time - User Name  5:00 AM - 6:00 AM - ABC  5:00 AM - 6:00 AM - XYZ  8:00 AM - 9:00 AM - ABC  8:00 AM - 10:00 AM - ABC  10:00 AM - 2:00 PM - ABC  7:00 AM - 11:00 AM - ABC  9:00 AM - 11:00 AM - ABC  11:00 AM - 11:30 AM - ABC  11:40 PM - 11:59 PM - FOO Start Time - End Time - User Name 5:00 AM - 6:00 AM - ABC 5:00 AM - 6:00 AM - XYZ 7:00 AM - 2:00 PM - ABC 11:40 PM - 0:40 AM - FOO
  24. U-SQL extensibility Extend U-SQL with C#/.NET Built-in operators, function, aggregates C# expressions (in SELECT expressions) User-defined aggregates (UDAGGs) User-defined functions (UDFs) User-defined operators (UDOs)
  25.  User-Defined Extractors  User-Defined Outputters  User-Defined Processors  Take one row and produce one row  Pass-through versus transforming  User-Defined Appliers  Take one row and produce 0 to n rows  Used with OUTER/CROSS APPLY  User-Defined Combiners  Combines rowsets (like a user-defined join)  User-Defined Reducers  Take n rows and produce m rows (normally m<n)  Scaled out with explicit U-SQL Syntax that takes a UDO instance (created as part of the execution):  EXTRACT  OUTPUT What are UDOs? Custom Operator Extensions Scaled out by U-SQL • PROCESS • COMBINE • REDUCE • CROSS APPLY
  26.  .Net API provided to build UDOs  Any .Net language usable  however only C# is first-class in tooling  Use U-SQL specific .Net DLLs  Deploying UDOs  Compile DLL  Upload DLL to ADLS  register with U-SQL script  VisualStudio provides tool support  UDOs can  Invoke managed code  Invoke native code deployed with UDO assemblies  Invoke other language runtimes (e.g., Python, R)  be scaled out by U-SQL execution framework  UDOs cannot  Communicate between different UDO invocations  Call Webservices/Reach outside the vertex boundary How to specify UDOs? Provide integration into U-SQL Data Models (Files and Rowsets) Integrates into Processing model and optimization model
  27. U-SQL Scalability and Performance Script level optimization Scales as Data scales
  28.  Automatic "in-lining" optimized out-of- the-box  Per job parallelization visibility into execution  Heatmap to identify bottlenecks
  29. U-SQL’s designed for Big Data Analytics Functional, Declarative, Set-based => Scalability and Optimizable Provides Extensibility with known Programming Languages Familiarity: Evolution and Re-use Summary

Editor's Notes

  1. It is not often that one designs a new query language, but sometimes a new data model or new processing requirements offer the opportunity to design a new language. I have been fortunate to be involved in both implementing, influencing and designing a few data processing languages during my career ranging from T-SQL over XQuery to U-SQL. In this presentation, I will present my experiences around language designs, what in my opinion makes a good language (and what may make a not so good one), what trade-offs have to be considered and show some of the design decisions behind U-SQL.
  2. It is not often that one designs a new query language, but sometimes a new data model or new processing requirements offer the opportunity to design a new language. I have been fortunate to be involved in both implementing, influencing and designing a few data processing languages during my career ranging from T-SQL over XQuery to U-SQL. In this presentation, I will present my experiences around language designs, what in my opinion makes a good language (and what may make a not so good one), what trade-offs have to be considered and show some of the design decisions behind U-SQL.
  3. Add velocity?
  4. Hard to operate on unstructured data: Even Hive requires meta data to be created to operate on unstructured data. Adding Custom Java functions, aggregators and SerDes is involving a lot of steps and often access to server’s head node and differs based on type of operation. Requires many tools and steps. Some examples: Hive UDAgg Code and compile .java into .jar Extend AbstractGenericUDAFResolver class: Does type checking, argument checking and overloading Extend GenericUDAFEvaluator class: implements logic in 8 methods. - Deploy: Deploy jar into class path on server Edit FunctionRegistry.java to register as built-in Update the content of show functions with ant Hive UDF (as of v0.13) Code Load JAR into head node or at URI CREATE FUNCTION USING JAR to register and load jar into classpath for every function (instead of registering jar and just use the functions)
  5. Spark supports Custom “inputters and outputters” for defining custom RDDs No UDAGGs Simple integration of UDFs but only for duration of program. No reuse/sharing. Cloud dataflow? Requires has to care about scale and perf Spark UDAgg Is not yet supported ( SPARK-3947) Spark UDF Write inline function def westernState(state: String) = Seq("CA", "OR", "WA", "AK").contains(state) for SQL usage need to register the table customerTable.registerTempTable("customerTable") Register each UDF sqlContext.udf.register("westernState", westernState _) Call it val westernStates = sqlContext.sql("SELECT * FROM customerTable WHERE westernState(state)")
  6. Offers Auto-scaling and performance Operates on unstructured data without tables needed Easy to extend declaratively with custom code: consistent model for UDO, UDF and UDAgg. Easy to query remote sources even without external tables U-SQL UDAgg Code and compile .cs file: Implement IAggregate’s 3 methods :Init(), Accumulate(), Terminate() C# takes case of type checking, generics etc. Deploy: Tooling: one click registration in user db of assembly By Hand: Copy file to ADL CREATE ASSEMBLY to register assembly Use via AGG<MyNamespace.MyAggregate<T>>(a) U-SQL UDF Code in C#, register assembly once, call by C# name.
  7. Remove SCOPE for external customers?
  8. Use for language experts
  9. Extensions require .NET assemblies to be registered with a database
Advertisement