The Data Lake approach
Ingest all data
regardless of
requirements
Store all data
in native format
without schema
definition
Do analysis
Using analytic
engines like Hadoop
Interactive queries
Batch queries
Machine Learning
Data warehouse
Real-time analytics
Devices
Introducing Azure Data Lake
Big Data Made Easy
WebHDFS
YARN
U-SQL
ADL Analytics ADL HDInsight
Store
HiveAnalytics
Storage
Azure Data Lake (Store, HDInsight, Analytics)
Some sample use cases
Digital Crime Unit – Analyze complex attack patterns
to understand BotNets and to predict and mitigate
future attacks by analyzing log records with
complex custom algorithms
Image Processing – Large-scale image feature
extraction and classification using custom code
Shopping Recommendation – Complex pattern
analysis and prediction over shopping records
using proprietary algorithms
 Declarativity does scaling and
parallelization for you
 Extensibility is bolted on and
not “native”
 hard to work with anything other than
structured data
 difficult to extend with custom code
 Extensibility through custom code
is “native”
 Declarativity is bolted on and
not “native”
 User often has to
care about scale and performance
 SQL is 2nd class within string
 Often no code reuse/
sharing across queries
 Declarativity and Extensibility are
equally native to the language!
Get benefits of both!
Makes it easy for you by unifying:
• Unstructured and structured data processing
• Declarative SQL and custom imperative Code
(C#)
• Local and remote Queries
• Increase productivity and agility from Day 1 and
at Day 100 for YOU!
The origins
of U-SQL
SCOPE – Microsoft’s internal
Big Data language
• SQL and C# integration model
• Optimization and Scaling model
• Runs 100’000s of jobs daily
Hive
• Complex data types (Maps, Arrays)
• Data format alignment for text files
T-SQL/ANSI SQL
• Many of the SQL capabilities (windowing functions, meta
data model etc.)
Benefits
• Avoid moving large amounts of data across the
network between stores
• Single view of data irrespective of physical location
• Minimize data proliferation issues caused by
maintaining multiple copies
• Single query language for all data
• Each data store maintains its own sovereignty
• Design choices based on the need
• Push SQL expressions to remote SQL sources
• Projections
• Filters
• Joins
U-SQL
Query
Query
Azure
Storage Blobs
Azure SQL
in VMs
Azure
SQL DB
Azure Data
Lake Analytics
Azure
SQL Data Warehouse
Azure
Data Lake Storage
https://github.com/Azure/usql/tree/master/Examples/TweetAnalysis
EXTRACT Expression
@s = EXTRACT a string, b int
FROM "filepath/file.csv"
USING Extractors.Csv(encoding: Encoding.Unicode);
• Built-in Extractors: Csv, Tsv, Text with lots of options
• Custom Extractors: e.g., JSON, XML, etc.
OUTPUT Expression
OUTPUT @s
TO "filepath/file.csv"
USING Outputters.Csv();
• Built-in Outputters: Csv, Tsv, Text
• Custom Outputters: e.g., JSON, XML, etc. (see http://usql.io)
Filepath URIs
• Relative URI to default ADL Storage account: "filepath/file.csv"
• Absolute URIs:
• ADLS: "adl://account.azuredatalakestore.net/filepath/file.csv"
• WASB: "wasb://container@account/filepath/file.csv"
U-SQL extensibility
Extend U-SQL with C#/.NET
Built-in operators,
function, aggregates
C# expressions (in SELECT expressions)
User-defined aggregates (UDAGGs)
User-defined functions (UDFs)
User-defined operators (UDOs)
https://github.com/Azure/usql/tree/master/Examples/TweetAnalysis
Managing
Assemblies
• CREATE ASSEMBLY db.assembly FROM @path;
• CREATE ASSEMBLY db.assembly FROM byte[];
• Can also include additional resource files
• REFERENCE ASSEMBLY db.assembly;
• Referencing .Net Framework Assemblies
• Always accessible system namespaces:
• U-SQL specific (e.g., for SQL.MAP)
• All provided by system.dll system.core.dll
system.data.dll, System.Runtime.Serialization.dll,
mscorelib.dll (e.g., System.Text,
System.Text.RegularExpressions, System.Linq)
• Add all other .Net Framework Assemblies with:
REFERENCE SYSTEM ASSEMBLY [System.XML];
• Enumerating Assemblies
• Powershell command
• U-SQL Studio Server Explorer
• DROP ASSEMBLY db.assembly;
 Create assemblies
 Reference assemblies
 Enumerate assemblies
 Drop assemblies
 VisualStudio makes registration easy!
'USING' csharp_namespace
| Alias '=' csharp_namespace_or_class.
Examples:
DECLARE @ input string = "somejsonfile.json";
REFERENCE ASSEMBLY [Newtonsoft.Json];
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];
USING Microsoft.Analytics.Samples.Formats.Json;
@data0 =
EXTRACT IPAddresses string
FROM @input
USING new JsonExtractor("Devices[*]");
USING json =
[Microsoft.Analytics.Samples.Formats.Json.JsonExtractor];
@data1 =
EXTRACT IPAddresses string
FROM @input
USING new json("Devices[*]");
https://github.com/Azure/usql/tree/master/Examples/TweetAnalysis
Simple pattern language on filename and path
@pattern string =
"/input/{date:yyyy}/{date:MM}/{date:dd}/{*}.{suffix}";
• Binds two columns date and suffix
• Wildcards the filename
• Limits on number of files
(Current limit 800 and 3000 being increased in next refresh)
Virtual columns
EXTRACT
name string
, suffix string // virtual column
, date DateTime // virtual column
FROM @pattern
USING Extractors.Csv();
• Refer to virtual columns in query predicates to get partition
elimination
(otherwise you will get a warning)
https://github.com/Azure/usql/tree/master/Examples/TweetAnalysis
ADLA Account/Catalog
Database
Schema
[1,n]
[1,n]
[0,n]
tables views TVFs
C# Fns C# UDAgg
Clustered
Index
partitions
C#
Assemblies
C# Extractors
Data
Source
C# Reducers
C# Processors
C# Combiners
C# Outputters
Ext. tables
Abstract
objects
User
objects
Refers toContains Implemented
and named by
Procedures
Creden-
tials
MD
Name
C# Name
C# Applier
Table Types
Legend
Statistics
C# UDTs
• Naming
• Discovery
• Sharing
• Securing
U-SQL Catalog Naming
• Default Database and Schema context: master.dbo
• Quote identifiers with []: [my table]
• Stores data in ADL Storage /catalog folder
Discovery
• Visual Studio Server Explorer
• Azure Data Lake Analytics Portal
• SDKs and Azure Powershell commands
Sharing
• Within an Azure Data Lake Analytics account
Securing
• Secured with AAD principals at catalog level (inherited from ADL
Storage)
• At General Availability: Database level access control
CREATE TABLE T (col1 int
, col2 string
, col3 SQL.MAP<string,string>
, INDEX idx CLUSTERED (col1 ASC)
DISTRIBUTED BY HASH (driver_id)
);
• Structured Data
• Built-in Data types only (no UDTs)
• Clustered Index (needs to be specified): row-oriented
• Fine-grained distribution (needs to be specified):
• HASH, DIRECT HASH, RANGE, ROUND ROBIN
CREATE TABLE T (INDEX idx CLUSTERED …) AS SELECT …;
CREATE TABLE T (INDEX idx CLUSTERED …) AS EXTRACT…;
CREATE TABLE T (INDEX idx CLUSTERED …) AS myTVF(DEFAULT);
• Infer the schema from the query
• Still requires index and partitioning
https://github.com/Azure/usql/tree/master/Examples/TweetAnalysis
U-SQL
Joins
Join operators
• INNER JOIN
• LEFT or RIGHT or FULL OUTER JOIN
• CROSS JOIN
• SEMIJOIN
• equivalent to IN subquery
• ANTISEMIJOIN
• Equivalent to NOT IN subquery
Notes
• ON clause comparisons need to be of the simple form:
rowset.column == rowset.column
or AND conjunctions of the simple equality comparison
• If a comparand is not a column, wrap it into a column in a previous
SELECT
• If the comparison operation is not ==, put it into the WHERE clause
• turn the join into a CROSS JOIN if no equality comparison
Reason: Syntax calls out which joins are efficient
U-SQL
Analytics
Windowing Expression
Window_Function_Call 'OVER' '('
[ Over_Partition_By_Clause ]
[ Order_By_Clause ]
[ Row _Clause ]
')'.
Window_Function_Call :=
Aggregate_Function_Call
| Analytic_Function_Call
| Ranking_Function_Call.
Windowing Aggregate Functions
ANY_VALUE, AVG, COUNT, MAX, MIN, SUM, STDEV, STDEVP, VAR, VARP
Analytics Functions
CUME_DIST, FIRST_VALUE, LAST_VALUE, PERCENTILE_CONT,
PERCENTILE_DISC, PERCENT_RANK; soon: LEAD/LAG
Ranking Functions
DENSE_RANK, NTILE, RANK, ROW_NUMBER
“Top 5”s
Surprises for
SQL Users
• AS is not as
• C# keywords and SQL keywords overlap
• Costly to make case-insensitive -> Better build
capabilities than tinker with syntax
• = != ==
• Remember: C# expression language
• null IS NOT NULL
• C# nulls are two-valued
• PROCEDURES but no WHILE
• No UPDATE nor MERGE
U-SQL Language Philosophy
Declarative Query and Transformation Language:
• Uses SQL’s SELECT FROM WHERE with GROUP
BY/Aggregation, Joins, SQL Analytics functions
• Optimizable, Scalable
Expression-flow programming style:
• Easy to use functional lambda composition
• Composable, globally optimizable
Operates on Unstructured & Structured Data
• Schema on read over files
• Relational metadata objects (e.g. database, table)
Extensible from ground up:
• Type system is based on C#
• Expression language IS C#
• User-defined functions (U-SQL and C#)
• User-defined Aggregators (C#)
• User-defined Operators (UDO) (C#)
U-SQL provides the Parallelization and Scale-out
Framework for Usercode
• EXTRACTOR, OUTPUTTER, PROCESSOR, REDUCER,
COMBINER, APPLIER
Federated query across distributed data sources
REFERENCE MyDB.MyAssembly;
CREATE TABLE T( cid int, first_order DateTime
, last_order DateTime, order_count int
, order_amount float );
@o = EXTRACT oid int, cid int, odate DateTime, amount float
FROM "/input/orders.txt"
USING Extractors.Csv();
@c = EXTRACT cid int, name string, city string
FROM "/input/customers.txt"
USING Extractors.Csv();
@j = SELECT c.cid, MIN(o.odate) AS firstorder
, MAX(o.date) AS lastorder, COUNT(o.oid) AS ordercnt
, AGG<MyAgg.MySum>(c.amount) AS totalamount
FROM @c AS c LEFT OUTER JOIN @o AS o ON c.cid == o.cid
WHERE c.city.StartsWith("New")
&& MyNamespace.MyFunction(o.odate) > 10
GROUP BY c.cid;
OUTPUT @j TO "/output/result.txt"
USING new MyData.Write();
INSERT INTO T SELECT * FROM @j;
Unifies natively SQL’s declarativity and C#’s extensibility
Unifies querying structured and unstructured
Unifies local and remote queries
Increase productivity and agility from Day 1 forward for
YOU!
Sign up for an Azure Data Lake account and join the Public Preview
http://www.azure.com/datalake and give us your feedback via
http://aka.ms/adlfeedback or at http://aka.ms/u-sql-survey!
http://usql.io
http://blogs.msdn.microsoft.com/azuredatalake/
http://blogs.msdn.microsoft.com/mrys/
https://channel9.msdn.com/Search?term=U-SQL#ch9Search
http://aka.ms/usql_reference
https://azure.microsoft.com/en-
us/documentation/services/data-lake-analytics/
https://msdn.microsoft.com/en-us/magazine/mt614251
http://aka.ms/adlfeedback
https://social.msdn.microsoft.com/Forums/azure/en-
US/home?forum=AzureDataLake
http://stackoverflow.com/questions/tagged/u-sql
Taming the Data Science Monster with A New ‘Sword’ – U-SQL

Taming the Data Science Monster with A New ‘Sword’ – U-SQL

  • 4.
    The Data Lakeapproach Ingest all data regardless of requirements Store all data in native format without schema definition Do analysis Using analytic engines like Hadoop Interactive queries Batch queries Machine Learning Data warehouse Real-time analytics Devices
  • 5.
    Introducing Azure DataLake Big Data Made Easy
  • 6.
    WebHDFS YARN U-SQL ADL Analytics ADLHDInsight Store HiveAnalytics Storage Azure Data Lake (Store, HDInsight, Analytics)
  • 9.
    Some sample usecases Digital Crime Unit – Analyze complex attack patterns to understand BotNets and to predict and mitigate future attacks by analyzing log records with complex custom algorithms Image Processing – Large-scale image feature extraction and classification using custom code Shopping Recommendation – Complex pattern analysis and prediction over shopping records using proprietary algorithms
  • 10.
     Declarativity doesscaling and parallelization for you  Extensibility is bolted on and not “native”  hard to work with anything other than structured data  difficult to extend with custom code
  • 11.
     Extensibility throughcustom code is “native”  Declarativity is bolted on and not “native”  User often has to care about scale and performance  SQL is 2nd class within string  Often no code reuse/ sharing across queries
  • 12.
     Declarativity andExtensibility are equally native to the language! Get benefits of both! Makes it easy for you by unifying: • Unstructured and structured data processing • Declarative SQL and custom imperative Code (C#) • Local and remote Queries • Increase productivity and agility from Day 1 and at Day 100 for YOU!
  • 13.
    The origins of U-SQL SCOPE– Microsoft’s internal Big Data language • SQL and C# integration model • Optimization and Scaling model • Runs 100’000s of jobs daily Hive • Complex data types (Maps, Arrays) • Data format alignment for text files T-SQL/ANSI SQL • Many of the SQL capabilities (windowing functions, meta data model etc.)
  • 14.
    Benefits • Avoid movinglarge amounts of data across the network between stores • Single view of data irrespective of physical location • Minimize data proliferation issues caused by maintaining multiple copies • Single query language for all data • Each data store maintains its own sovereignty • Design choices based on the need • Push SQL expressions to remote SQL sources • Projections • Filters • Joins U-SQL Query Query Azure Storage Blobs Azure SQL in VMs Azure SQL DB Azure Data Lake Analytics Azure SQL Data Warehouse Azure Data Lake Storage
  • 15.
  • 16.
    EXTRACT Expression @s =EXTRACT a string, b int FROM "filepath/file.csv" USING Extractors.Csv(encoding: Encoding.Unicode); • Built-in Extractors: Csv, Tsv, Text with lots of options • Custom Extractors: e.g., JSON, XML, etc. OUTPUT Expression OUTPUT @s TO "filepath/file.csv" USING Outputters.Csv(); • Built-in Outputters: Csv, Tsv, Text • Custom Outputters: e.g., JSON, XML, etc. (see http://usql.io) Filepath URIs • Relative URI to default ADL Storage account: "filepath/file.csv" • Absolute URIs: • ADLS: "adl://account.azuredatalakestore.net/filepath/file.csv" • WASB: "wasb://container@account/filepath/file.csv"
  • 17.
    U-SQL extensibility Extend U-SQLwith C#/.NET Built-in operators, function, aggregates C# expressions (in SELECT expressions) User-defined aggregates (UDAGGs) User-defined functions (UDFs) User-defined operators (UDOs)
  • 18.
  • 19.
    Managing Assemblies • CREATE ASSEMBLYdb.assembly FROM @path; • CREATE ASSEMBLY db.assembly FROM byte[]; • Can also include additional resource files • REFERENCE ASSEMBLY db.assembly; • Referencing .Net Framework Assemblies • Always accessible system namespaces: • U-SQL specific (e.g., for SQL.MAP) • All provided by system.dll system.core.dll system.data.dll, System.Runtime.Serialization.dll, mscorelib.dll (e.g., System.Text, System.Text.RegularExpressions, System.Linq) • Add all other .Net Framework Assemblies with: REFERENCE SYSTEM ASSEMBLY [System.XML]; • Enumerating Assemblies • Powershell command • U-SQL Studio Server Explorer • DROP ASSEMBLY db.assembly;  Create assemblies  Reference assemblies  Enumerate assemblies  Drop assemblies  VisualStudio makes registration easy!
  • 20.
    'USING' csharp_namespace | Alias'=' csharp_namespace_or_class. Examples: DECLARE @ input string = "somejsonfile.json"; REFERENCE ASSEMBLY [Newtonsoft.Json]; REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats]; USING Microsoft.Analytics.Samples.Formats.Json; @data0 = EXTRACT IPAddresses string FROM @input USING new JsonExtractor("Devices[*]"); USING json = [Microsoft.Analytics.Samples.Formats.Json.JsonExtractor]; @data1 = EXTRACT IPAddresses string FROM @input USING new json("Devices[*]");
  • 21.
  • 22.
    Simple pattern languageon filename and path @pattern string = "/input/{date:yyyy}/{date:MM}/{date:dd}/{*}.{suffix}"; • Binds two columns date and suffix • Wildcards the filename • Limits on number of files (Current limit 800 and 3000 being increased in next refresh) Virtual columns EXTRACT name string , suffix string // virtual column , date DateTime // virtual column FROM @pattern USING Extractors.Csv(); • Refer to virtual columns in query predicates to get partition elimination (otherwise you will get a warning)
  • 23.
  • 24.
    ADLA Account/Catalog Database Schema [1,n] [1,n] [0,n] tables viewsTVFs C# Fns C# UDAgg Clustered Index partitions C# Assemblies C# Extractors Data Source C# Reducers C# Processors C# Combiners C# Outputters Ext. tables Abstract objects User objects Refers toContains Implemented and named by Procedures Creden- tials MD Name C# Name C# Applier Table Types Legend Statistics C# UDTs
  • 25.
    • Naming • Discovery •Sharing • Securing U-SQL Catalog Naming • Default Database and Schema context: master.dbo • Quote identifiers with []: [my table] • Stores data in ADL Storage /catalog folder Discovery • Visual Studio Server Explorer • Azure Data Lake Analytics Portal • SDKs and Azure Powershell commands Sharing • Within an Azure Data Lake Analytics account Securing • Secured with AAD principals at catalog level (inherited from ADL Storage) • At General Availability: Database level access control
  • 26.
    CREATE TABLE T(col1 int , col2 string , col3 SQL.MAP<string,string> , INDEX idx CLUSTERED (col1 ASC) DISTRIBUTED BY HASH (driver_id) ); • Structured Data • Built-in Data types only (no UDTs) • Clustered Index (needs to be specified): row-oriented • Fine-grained distribution (needs to be specified): • HASH, DIRECT HASH, RANGE, ROUND ROBIN CREATE TABLE T (INDEX idx CLUSTERED …) AS SELECT …; CREATE TABLE T (INDEX idx CLUSTERED …) AS EXTRACT…; CREATE TABLE T (INDEX idx CLUSTERED …) AS myTVF(DEFAULT); • Infer the schema from the query • Still requires index and partitioning
  • 27.
  • 28.
    U-SQL Joins Join operators • INNERJOIN • LEFT or RIGHT or FULL OUTER JOIN • CROSS JOIN • SEMIJOIN • equivalent to IN subquery • ANTISEMIJOIN • Equivalent to NOT IN subquery Notes • ON clause comparisons need to be of the simple form: rowset.column == rowset.column or AND conjunctions of the simple equality comparison • If a comparand is not a column, wrap it into a column in a previous SELECT • If the comparison operation is not ==, put it into the WHERE clause • turn the join into a CROSS JOIN if no equality comparison Reason: Syntax calls out which joins are efficient
  • 29.
    U-SQL Analytics Windowing Expression Window_Function_Call 'OVER''(' [ Over_Partition_By_Clause ] [ Order_By_Clause ] [ Row _Clause ] ')'. Window_Function_Call := Aggregate_Function_Call | Analytic_Function_Call | Ranking_Function_Call. Windowing Aggregate Functions ANY_VALUE, AVG, COUNT, MAX, MIN, SUM, STDEV, STDEVP, VAR, VARP Analytics Functions CUME_DIST, FIRST_VALUE, LAST_VALUE, PERCENTILE_CONT, PERCENTILE_DISC, PERCENT_RANK; soon: LEAD/LAG Ranking Functions DENSE_RANK, NTILE, RANK, ROW_NUMBER
  • 30.
    “Top 5”s Surprises for SQLUsers • AS is not as • C# keywords and SQL keywords overlap • Costly to make case-insensitive -> Better build capabilities than tinker with syntax • = != == • Remember: C# expression language • null IS NOT NULL • C# nulls are two-valued • PROCEDURES but no WHILE • No UPDATE nor MERGE
  • 32.
    U-SQL Language Philosophy DeclarativeQuery and Transformation Language: • Uses SQL’s SELECT FROM WHERE with GROUP BY/Aggregation, Joins, SQL Analytics functions • Optimizable, Scalable Expression-flow programming style: • Easy to use functional lambda composition • Composable, globally optimizable Operates on Unstructured & Structured Data • Schema on read over files • Relational metadata objects (e.g. database, table) Extensible from ground up: • Type system is based on C# • Expression language IS C# • User-defined functions (U-SQL and C#) • User-defined Aggregators (C#) • User-defined Operators (UDO) (C#) U-SQL provides the Parallelization and Scale-out Framework for Usercode • EXTRACTOR, OUTPUTTER, PROCESSOR, REDUCER, COMBINER, APPLIER Federated query across distributed data sources REFERENCE MyDB.MyAssembly; CREATE TABLE T( cid int, first_order DateTime , last_order DateTime, order_count int , order_amount float ); @o = EXTRACT oid int, cid int, odate DateTime, amount float FROM "/input/orders.txt" USING Extractors.Csv(); @c = EXTRACT cid int, name string, city string FROM "/input/customers.txt" USING Extractors.Csv(); @j = SELECT c.cid, MIN(o.odate) AS firstorder , MAX(o.date) AS lastorder, COUNT(o.oid) AS ordercnt , AGG<MyAgg.MySum>(c.amount) AS totalamount FROM @c AS c LEFT OUTER JOIN @o AS o ON c.cid == o.cid WHERE c.city.StartsWith("New") && MyNamespace.MyFunction(o.odate) > 10 GROUP BY c.cid; OUTPUT @j TO "/output/result.txt" USING new MyData.Write(); INSERT INTO T SELECT * FROM @j;
  • 33.
    Unifies natively SQL’sdeclarativity and C#’s extensibility Unifies querying structured and unstructured Unifies local and remote queries Increase productivity and agility from Day 1 forward for YOU! Sign up for an Azure Data Lake account and join the Public Preview http://www.azure.com/datalake and give us your feedback via http://aka.ms/adlfeedback or at http://aka.ms/u-sql-survey!
  • 34.

Editor's Notes

  • #5 A data lake is an enterprise wide repository of every type of data collected in a single place. Data of all types can be arbitrarily stored in the data lake prior to any formal definition of requirements or schema for the purposes of operational and exploratory analytics. Advanced analytics can be done using Hadoop, Machine Learning tools, or act as a lower cost data preparation location prior to moving curated data into a data warehouse. In these cases, customers would load data into the data lake prior to defining any transformation logic. This is bottom up because data is collected first and the data itself gives you the insight and helps derive conclusions or predictive models.
  • #10 Add velocity?
  • #11 Hard to operate on unstructured data: Even Hive requires meta data to be created to operate on unstructured data. Adding Custom Java functions, aggregators and SerDes is involving a lot of steps and often access to server’s head node and differs based on type of operation. Requires many tools and steps. Some examples: Hive UDAgg Code and compile .java into .jar Extend AbstractGenericUDAFResolver class: Does type checking, argument checking and overloading Extend GenericUDAFEvaluator class: implements logic in 8 methods. - Deploy: Deploy jar into class path on server Edit FunctionRegistry.java to register as built-in Update the content of show functions with ant Hive UDF (as of v0.13) Code Load JAR into head node or at URI CREATE FUNCTION USING JAR to register and load jar into classpath for every function (instead of registering jar and just use the functions)
  • #12 Spark supports Custom “inputters and outputters” for defining custom RDDs No UDAGGs Simple integration of UDFs but only for duration of program. No reuse/sharing. Cloud dataflow? Requires has to care about scale and perf Spark UDAgg Is not yet supported ( SPARK-3947) Spark UDF Write inline function def westernState(state: String) = Seq("CA", "OR", "WA", "AK").contains(state) for SQL usage need to register the table customerTable.registerTempTable("customerTable") Register each UDF sqlContext.udf.register("westernState", westernState _) Call it val westernStates = sqlContext.sql("SELECT * FROM customerTable WHERE westernState(state)")
  • #13 Offers Auto-scaling and performance Operates on unstructured data without tables needed Easy to extend declaratively with custom code: consistent model for UDO, UDF and UDAgg. Easy to query remote sources even without external tables U-SQL UDAgg Code and compile .cs file: Implement IAggregate’s 3 methods :Init(), Accumulate(), Terminate() C# takes case of type checking, generics etc. Deploy: Tooling: one click registration in user db of assembly By Hand: Copy file to ADL CREATE ASSEMBLY to register assembly Use via AGG<MyNamespace.MyAggregate<T>>(a) U-SQL UDF Code in C#, register assembly once, call by C# name.
  • #14 Remove SCOPE for external customers?
  • #15 DATA SOURCE: Represents a remote data source such as Azure SQL Database. Have to specify all the details (connection string, credentials, etc required to connect to and issues queries. EXTERNAL TABLE: A local table, with columns defined in C# types, that redirects queries issued against it to the remote table that it is based on. U-SQL automatically does the type conversion. External tables lets you impose a specific schema against the remote data, shielding you from remote schema changes. You can issue queries that ‘join’ external and local tables. PASS THROUGH queries: These queries are issued directly against the remote data source in the syntax of the remote data source (say T-SQL for Azure SQL database). REMOTABLE_TYPES: For every external data source you have to specify the list of ‘remoteable types. This list constrains the types of queries that will be remoted. Ex: REMOTABLE_TYPES = (bool, byte, short, ushort, int, decimal); LAZY METADATA LOADING: Here the remote data schematized only when the query is actually issues to the remote data source. Your program must be able to deal with remote schema changes.
  • #16 Shows simple Extract, OUTPUT Then simple extensibility with string functions.
  • #18 Extensions require .NET assemblies to be registered with a database
  • #19 Shows simple Extract, OUTPUT Then simple extensibility with string functions.
  • #22 Add file sets.
  • #24 Show Views, TVFs and Tables
  • #31 GROUP BY, ORDER BY, CROSS APPLY
  • #36 Use for language experts