The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)
First Language: APL
Example (arbitrary dim regression): Const⌹Coeff∘.*⌽0,Dim
High-Dimensional Arrays/Nested Arrays as Data Model
Expressions: Mathematics (declarative), parallel
Control flow: Recursion, GoTo
Syntax: Greek and Special Characters (easier to write than read)
Next Languages: Pascal, Modula-2, Oberon, C/C++
Procedural, imperative (structured control flow)
One Item: Structured types and Object Models
Single-node, Parallelism/Distributed via Libraries
Other experiences: Lisp, Prolog
Functional and Logic
List as Data Model
Recursion instead of control flow
Data Processing Languages:
SQL: Declarative Expressions, Procedural Control Flow
DataLog: Recursion
XQuery: Tree Data Model, Declarative/Functional
My Language
History
Imperative vs Declarative
Procedural vs Functional vs Logical
One item vs Sets
Single-node vs Parallel vs Distributed
Programming vs Data Processing
Imperative
Tell the system how to get it
Declarative
Tell the system what you want
Let the system find a way to get it
Declarative leaves options that an optimizer can reason about
Imperative vs
Declarative
Procedural
operates by changing “persistent state” (variables) with each
expression
Allows side-effects in control flow
i = 0;
FOR j FROM 1 TO 100 DO i = i + 1;
i is now 100.
Functional
Transforms input into output, no “persistent state”
No side-effects in control flow
i = 0;
FOR j FROM 1 TO 100 DO i = i + 1;
i is now 1.
Procedural vs
Functional
Single objects
Requires control flow
Explicit Parallelism
Sets of objects
Allows higher-level abstraction expressions
Implicit Parallelism
Objects can be
Object/value
Tuples
Trees
graphs
Meta Data Service provides sharing of data model
objects
One Item vs
Sets
Data Models
Language Parallelism vs Libraries
Scale-up vs Scale-out
Synchronization/Transactions
Explicit imperative vs Implicit declarative
ACID support
Single-node vs
Parallel vs
Distributed
Programming Languages
Long-term data is in a store but not part of the language
model
Designed for tight coupling of data and application logic
Often imperative, procedural, one-item object-oriented,
explicit/library based parallelism
Data Processing Languages
Long-term data is part of the language model
Data can evolve independently of application
Declarative and Functional
Set-based
Built-in parallelism and implicit/declarative synchronization
Programming
vs
Data
Processing
Consortiums/Standard Bodies:
Slow (it took XQuery 6 years!)
“Political” interests of participants can
negatively impact design
Individual/Small team:
More focused
Risk: different for difference sake
Evolve vs New
Create Language that address demand
How to create
languages
Some sample use cases
Digital Crime Unit – Analyze complex attack patterns
to understand BotNets and to predict and mitigate
future attacks by analyzing log records with
complex custom algorithms
Image Processing – Large-scale image feature
extraction and classification using custom code
Shopping Recommendation – Complex pattern
analysis and prediction over shopping records
using proprietary algorithms
Declarativity does scaling and
parallelization for you
Extensibility is bolted on and
not “native”
hard to work with anything other than
structured data
difficult to extend with custom code
Extensibility through custom code
is “native”
Declarativity is bolted on and
not “native”
User often has to
care about scale and performance
SQL is 2nd class within string
Often no code reuse/
sharing across queries
Declarativity and Extensibility are
equally native to the language!
Get benefits of both!
Makes it easy for you by unifying:
• Unstructured and structured data processing
• Declarative SQL and custom imperative Code
(C#, Python, R, …)
Scales-up and Scales-out custom code within
declarative framework
The origins
of U-SQL
SCOPE – Microsoft’s internal
Big Data language
• SQL and C# integration model
• Optimization and Scaling model
• Runs 100’000s of jobs daily
Hive
• Complex data types (Maps, Arrays)
• Data format alignment for text files
T-SQL/ANSI SQL
• Many of the SQL capabilities (windowing functions, meta
data model etc.)
U-SQL Language Philosophy
Declarative Query and Transformation Language:
• Uses SQL’s SELECT FROM WHERE with GROUP
BY/Aggregation, Joins, SQL Analytics functions
• Optimizable, Scalable
Expression-flow programming style:
• Easy to use functional lambda composition
• Composable, globally optimizable
Operates on Unstructured & Structured Data
• Schema on read over files
• Relational metadata objects (e.g. database, table)
Extensible from ground up:
• Type system is based on C#
• Expression language IS C#
• User-defined functions (U-SQL and C#)
• User-defined Aggregators (C#)
• User-defined Operators (UDO) (C#)
U-SQL provides the Parallelization and Scale-out
Framework for Usercode
• EXTRACTOR, OUTPUTTER, PROCESSOR, REDUCER,
COMBINER, APPLIER
REFERENCE MyDB.MyAssembly;
CREATE TABLE T( cid int, first_order DateTime
, last_order DateTime, order_count int
, order_amount float, ... );
@o = EXTRACT oid int, cid int, odate DateTime, amount float
FROM "/input/orders.txt"
USING Extractors.Csv();
@c = EXTRACT cid int, name string, city string
FROM "/input/customers.txt"
USING Extractors.Csv();
@j = SELECT c.cid, MIN(o.odate) AS firstorder
, MAX(o.date) AS lastorder, COUNT(o.oid) AS ordercnt
, AGG<MyAgg.MySum>(c.amount) AS totalamount
FROM @c AS c LEFT OUTER JOIN @o AS o ON c.cid == o.cid
WHERE c.city.StartsWith("New")
&& MyNamespace.MyFunction(o.odate) > 10
GROUP BY c.cid;
OUTPUT @j TO "/output/result.txt"
USING new MyData.Write();
INSERT INTO T SELECT * FROM @j;
Unstructured Data
@s = EXTRACT a string, b int, date DateTime, file string
FROM "filepath/{date:yyyy}/{date:MM}/{date:dd}/{file}.csv"
USING Extractors.Csv(encoding: Encoding.Unicode);
• Pro: Flexible, scaling with file sets and over parts of partitionable files
• Cons: System doesn’t know data distribution, statistics; no indexing
Structured Data
CREATE TABLE T (col1 int
, col2 string
, col3 SQL.MAP<string,string>
, INDEX idx CLUSTERED (col2 ASC)
PARTITIONED BY (col1)
DISTRIBUTED BY HASH (driver_id) );
• Pro: Provides system guarantees about data distributions, statistics, indices to
help performance and scale; object discoverability
• Cons: Needs Schema a priori, Cost of additional storage and generation
Familiar Operations
• ORDER BY FETCH n ROWS
• GROUP BY HAVING
• UNION/INTERSECT/EXCEPT
• OVER Expression: Windowing, Analytics, Ranking Functions
• JOINS: INNER, FULL/LEFT/RIGHT OUTER, CROSS, SEMI, ANTI-SEMI-JOIN
• CROSS APPLY
• PIVOT/UNPIVOT (new!)
New Operations
• SET OPERATION BY NAME
SELECT * FROM @left
INTERSECT BY NAME ON (id, *)
SELECT * FROM @right;
• OUTER UNION BY NAME
SELECT * FROM @left
OUTER UNION BY NAME ON (A, K)
SELECT * FROM @right;
• Flexible Column Sets for parameter polymorphism
“Top 5”s
Surprises for
SQL Users
• AS is not as
• C# keywords and SQL keywords overlap
• Future Proofing against new reserved keywords in
both languages: Reserve all upper-case words as
U-SQL keywords
• " vs ' vs []
• = != ==
• Remember: C# expression language
• null IS NOT NULL
• C# nulls are two-valued
• PROCEDURES but no WHILE
• No UPDATE, DELETE, nor
MERGE (yet)
ADLA Account/Catalog
Database
Schema
[1,n]
[1,n]
[0,n]
tables views TVFs
C# Fns C# UDAgg
Clustered
Index
partitions
C#
Assemblies
C# Extractors
Data
Source
C# Reducers
C# Processors
C# Combiners
C# Outputters
Ext. tables
User
objects
Refers toContains Implemented
and named by
Procedures
Creden-
tials
MD
Name
C# Name
C# Applier
Table Types
Legend
Statistics
C# UDTs
U-SQL Extensibility
Start Time - End Time - User Name
5:00 AM - 6:00 AM - ABC
5:00 AM - 6:00 AM - XYZ
8:00 AM - 9:00 AM - ABC
8:00 AM - 10:00 AM - ABC
10:00 AM - 2:00 PM - ABC
7:00 AM - 11:00 AM - ABC
9:00 AM - 11:00 AM - ABC
11:00 AM - 11:30 AM - ABC
11:40 PM - 11:59 PM - FOO
Start Time - End Time - User Name
5:00 AM - 6:00 AM - ABC
5:00 AM - 6:00 AM - XYZ
7:00 AM - 2:00 PM - ABC
11:40 PM - 0:40 AM - FOO
User-Defined Extractors
User-Defined Outputters
User-Defined Processors
Take one row and produce one row
Pass-through versus transforming
User-Defined Appliers
Take one row and produce 0 to n rows
Used with OUTER/CROSS APPLY
User-Defined Combiners
Combines rowsets (like a user-defined join)
User-Defined Reducers
Take n rows and produce m rows (normally m<n)
Scaled out with explicit U-SQL Syntax that takes a UDO
instance (created as part of the execution):
EXTRACT
OUTPUT
What are
UDOs?
Custom Operator Extensions
Scaled out by U-SQL
• PROCESS
• COMBINE
• REDUCE
• CROSS APPLY
.Net API provided to build UDOs
Any .Net language usable
however only C# is first-class in tooling
Use U-SQL specific .Net DLLs
Deploying UDOs
Compile DLL
Upload DLL to ADLS
register with U-SQL script
VisualStudio provides tool support
UDOs can
Invoke managed code
Invoke native code deployed with UDO assemblies
Invoke other language runtimes (e.g., Python, R)
be scaled out by U-SQL execution framework
UDOs cannot
Communicate between different UDO invocations
Call Webservices/Reach outside the vertex boundary
How to specify
UDOs?
Provide integration into U-SQL Data
Models (Files and Rowsets)
Integrates into Processing model and
optimization model
U-SQL’s designed for Big Data
Analytics
Functional, Declarative, Set-based =>
Scalability and Optimizable
Provides Extensibility with known
Programming Languages
Familiarity: Evolution and Re-use
Summary
Editor's Notes
It is not often that one designs a new query language, but sometimes a new data model or new processing requirements offer the opportunity to design a new language. I have been fortunate to be involved in both implementing, influencing and designing a few data processing languages during my career ranging from T-SQL over XQuery to U-SQL. In this presentation, I will present my experiences around language designs, what in my opinion makes a good language (and what may make a not so good one), what trade-offs have to be considered and show some of the design decisions behind U-SQL.
It is not often that one designs a new query language, but sometimes a new data model or new processing requirements offer the opportunity to design a new language. I have been fortunate to be involved in both implementing, influencing and designing a few data processing languages during my career ranging from T-SQL over XQuery to U-SQL. In this presentation, I will present my experiences around language designs, what in my opinion makes a good language (and what may make a not so good one), what trade-offs have to be considered and show some of the design decisions behind U-SQL.
Add velocity?
Hard to operate on unstructured data: Even Hive requires meta data to be created to operate on unstructured data. Adding Custom Java functions, aggregators and SerDes is involving a lot of steps and often access to server’s head node and differs based on type of operation. Requires many tools and steps.
Some examples:
Hive UDAgg
Code and compile .java into .jar
Extend AbstractGenericUDAFResolver class: Does type checking, argument checking and overloading
Extend GenericUDAFEvaluator class: implements logic in 8 methods.
- Deploy:
Deploy jar into class path on server
Edit FunctionRegistry.java to register as built-in
Update the content of show functions with ant
Hive UDF (as of v0.13)
Code
Load JAR into head node or at URI
CREATE FUNCTION USING JAR to register and load jar into classpath for every function (instead of registering jar and just use the functions)
Spark supports Custom “inputters and outputters” for defining custom RDDs
No UDAGGs
Simple integration of UDFs but only for duration of program. No reuse/sharing.
Cloud dataflow? Requires has to care about scale and perf
Spark UDAgg
Is not yet supported ( SPARK-3947)
Spark UDF
Write inline functiondef westernState(state: String) = Seq("CA", "OR", "WA", "AK").contains(state)
for SQL usage need to register the tablecustomerTable.registerTempTable("customerTable")
Register each UDFsqlContext.udf.register("westernState", westernState _)
Call itval westernStates =sqlContext.sql("SELECT * FROM customerTable WHERE westernState(state)")
Offers Auto-scaling and performance
Operates on unstructured data without tables needed
Easy to extend declaratively with custom code: consistent model for UDO, UDF and UDAgg.
Easy to query remote sources even without external tables
U-SQL UDAgg
Code and compile .cs file:
Implement IAggregate’s 3 methods :Init(), Accumulate(), Terminate()
C# takes case of type checking, generics etc.
Deploy:
Tooling: one click registration in user db of assembly
By Hand:
Copy file to ADL
CREATE ASSEMBLY to register assembly
Use via AGG<MyNamespace.MyAggregate<T>>(a)
U-SQL UDF
Code in C#, register assembly once, call by C# name.
Remove SCOPE for external customers?
Use for language experts
Extensions require .NET assemblies to be registered with a database