SeaScale Meetup
Jan 2016
Azure Data Lake &
U-SQL
Michael Rys, @MikeDoesBigData
http://www.azure.com/datalake
{mrys, usql}@microsoft.com
Analytics
Storage
HDInsight
(“managed clusters”)
Azure Data Lake Analytics
Azure Data Lake Storage
Azure Data Lake
ADLA complements HDInsight
Target the same scenarios, tools, and customers
HDInsight
For developers familiar with the
Open Source: Java, Eclipse, Hive, etc.
Clusters offer customization, control,
and flexibility in a managed Hadoop
cluster
ADLA
Enables customers to leverage
existing experience with C#, SQL &
PowerShell
Offers convenience, efficiency,
automatic scale, and management in
a “job service” form factor
WebHDFS
YARN
U-SQL
Analytics Service HDInsight
(managed Hadoop Clusters)
Analytics
Store
Azure Data Lake
Azure Data Lake
Analytics Service
Enterprise-
grade
Limitless scaleProductivity
from day one
Easy and
powerful data
preparation
All data
6
0100101001000101010100101001000
10101010010100100010101010010100
10001010101001010010001010101001
0100100010101010010100100010101
0100101001000101010100101001000
10101010010100100010101010010100
10001010101001010010001010101001
0100100010101010010100100010101
0100101001000101010100101001000
10101010010100100010101010010100
Azure Data Lake Analytics
Azure
Data Lake
Analytics Service
A new distributed
analytics service
Built on Apache YARN
Scales dynamically with the turn of a dial
Pay by the query
Supports Azure AD for access control,
roles, and integration with on-prem
identity systems
Built with U-SQL to unify the benefits of
SQL with the power of C#
Processes data across Azure
7
Work across all cloud data
Azure Data Lake
Analytics
Azure SQL DW Azure SQL DB
Azure
Storage Blobs
Azure
Data Lake Store
SQL DB in an
Azure VM
Azure Data Lake
U-SQL
•
•
•
 hard to work with anything other than
structured data
 difficult to extend with custom code
 User often has to
care about scale and performance
 SQL is 2nd class within string
 Often no code reuse/
sharing across queries
Get benefits of both!
Makes it easy for you by unifying:
• Unstructured and structured data processing
• Declarative SQL and custom imperative Code
• Local and remote Queries
• Increase productivity and agility from Day 1 and
at Day 100 for YOU!
Extend U-SQL with C#/.NET
Built-in operators,
function, aggregates
C# expressions (in SELECT expressions)
User-defined aggregates (UDAGGs)
User-defined functions (UDFs)
User-defined operators (UDOs)
U-SQL Language Philosophy
Declarative Query and Transformation Language:
• Uses SQL’s SELECT FROM WHERE with GROUP
BY/Aggregation, Joins, SQL Analytics functions
• Optimizable, Scalable
Expression-flow programming style:
• Easy to use functional lambda composition
• Composable, globally optimizable
Operates on Unstructured & Structured Data
• Schema on read over files
• Relational metadata objects (e.g. database, table)
Extensible from ground up:
• Type system is based on C#
• Expression language IS C#
• User-defined functions (U-SQL and C#)
• User-defined Aggregators (C#)
• User-defined Operators (UDO) (C#)
U-SQL provides the Parallelization and Scale-out
Framework for Usercode
• EXTRACTOR, OUTPUTTER, PROCESSOR, REDUCER,
COMBINER, APPLIER
Federated query across distributed data sources
REFERENCE MyDB.MyAssembly;
CREATE TABLE T( cid int, first_order DateTime
, last_order DateTime, order_count int
, order_amount float );
@o = EXTRACT oid int, cid int, odate DateTime, amount float
FROM "/input/orders.txt"
USING Extractors.Csv();
@c = EXTRACT cid int, name string, city string
FROM "/input/customers.txt"
USING Extractors.Csv();
@j = SELECT c.cid, MIN(o.odate) AS firstorder
, MAX(o.date) AS lastorder, COUNT(o.oid) AS ordercnt
, AGG<MyAgg.MySum>(c.amount) AS totalamount
FROM @c AS c LEFT OUTER JOIN @o AS o ON c.cid == o.cid
WHERE c.city.StartsWith("New")
&& MyNamespace.MyFunction(o.odate) > 10
GROUP BY c.cid;
OUTPUT @j TO "/output/result.txt"
USING new MyData.Write();
INSERT INTO T SELECT * FROM @j;
Intro Blog entry: http://aka.ms/usql-intro
Blog entry on UDFs: http://aka.ms/usql-udf
U-SQL Reference Doc (beta): http://aka.ms/usql_reference
U-SQL Community & Team site: http://usql.io/
Videos: https://channel9.msdn.com/Series/AzureDataLake
Microsoft Confidential Material - covered under NDA
Additional Resources • Blogs and community page:
• http://usql.io
• https://blogs.msdn.microsoft.com/azuredatalake/
• http://blogs.msdn.com/b/visualstudio/
• http://azure.microsoft.com/en-us/blog/topics/big-
data/
• https://channel9.msdn.com/Search?term=U-
SQL#ch9Search
• Documentation:
• http://aka.ms/usql_reference
• https://azure.microsoft.com/en-
us/documentation/services/data-lake-analytics/
• ADL forums and feedback
• http://aka.ms/adlfeedback
• https://social.msdn.microsoft.com/Forums/azure/en-
US/home?forum=AzureDataLake
• http://stackoverflow.com/questions/tagged/u-sql
Unifies natively SQL’s declarativity and C#’s extensibility
Unifies querying structured and unstructured
Unifies local and remote queries
Increase productivity and agility from Day 1 forward for
YOU!
Sign up for an Azure Data Lake account and join the Public Preview
http://www.azure.com/datalake and give us your feedback via
http://aka.ms/adlfeedback or at http://aka.ms/u-sql-survey!

Azure Data Lake and U-SQL

  • 1.
    SeaScale Meetup Jan 2016 AzureData Lake & U-SQL Michael Rys, @MikeDoesBigData http://www.azure.com/datalake {mrys, usql}@microsoft.com
  • 2.
    Analytics Storage HDInsight (“managed clusters”) Azure DataLake Analytics Azure Data Lake Storage Azure Data Lake
  • 3.
    ADLA complements HDInsight Targetthe same scenarios, tools, and customers HDInsight For developers familiar with the Open Source: Java, Eclipse, Hive, etc. Clusters offer customization, control, and flexibility in a managed Hadoop cluster ADLA Enables customers to leverage existing experience with C#, SQL & PowerShell Offers convenience, efficiency, automatic scale, and management in a “job service” form factor
  • 4.
    WebHDFS YARN U-SQL Analytics Service HDInsight (managedHadoop Clusters) Analytics Store Azure Data Lake
  • 5.
  • 6.
    Enterprise- grade Limitless scaleProductivity from dayone Easy and powerful data preparation All data 6 0100101001000101010100101001000 10101010010100100010101010010100 10001010101001010010001010101001 0100100010101010010100100010101 0100101001000101010100101001000 10101010010100100010101010010100 10001010101001010010001010101001 0100100010101010010100100010101 0100101001000101010100101001000 10101010010100100010101010010100 Azure Data Lake Analytics
  • 7.
    Azure Data Lake Analytics Service Anew distributed analytics service Built on Apache YARN Scales dynamically with the turn of a dial Pay by the query Supports Azure AD for access control, roles, and integration with on-prem identity systems Built with U-SQL to unify the benefits of SQL with the power of C# Processes data across Azure 7
  • 8.
    Work across allcloud data Azure Data Lake Analytics Azure SQL DW Azure SQL DB Azure Storage Blobs Azure Data Lake Store SQL DB in an Azure VM
  • 10.
  • 12.
  • 13.
     hard towork with anything other than structured data  difficult to extend with custom code
  • 14.
     User oftenhas to care about scale and performance  SQL is 2nd class within string  Often no code reuse/ sharing across queries
  • 15.
    Get benefits ofboth! Makes it easy for you by unifying: • Unstructured and structured data processing • Declarative SQL and custom imperative Code • Local and remote Queries • Increase productivity and agility from Day 1 and at Day 100 for YOU!
  • 17.
    Extend U-SQL withC#/.NET Built-in operators, function, aggregates C# expressions (in SELECT expressions) User-defined aggregates (UDAGGs) User-defined functions (UDFs) User-defined operators (UDOs)
  • 19.
    U-SQL Language Philosophy DeclarativeQuery and Transformation Language: • Uses SQL’s SELECT FROM WHERE with GROUP BY/Aggregation, Joins, SQL Analytics functions • Optimizable, Scalable Expression-flow programming style: • Easy to use functional lambda composition • Composable, globally optimizable Operates on Unstructured & Structured Data • Schema on read over files • Relational metadata objects (e.g. database, table) Extensible from ground up: • Type system is based on C# • Expression language IS C# • User-defined functions (U-SQL and C#) • User-defined Aggregators (C#) • User-defined Operators (UDO) (C#) U-SQL provides the Parallelization and Scale-out Framework for Usercode • EXTRACTOR, OUTPUTTER, PROCESSOR, REDUCER, COMBINER, APPLIER Federated query across distributed data sources REFERENCE MyDB.MyAssembly; CREATE TABLE T( cid int, first_order DateTime , last_order DateTime, order_count int , order_amount float ); @o = EXTRACT oid int, cid int, odate DateTime, amount float FROM "/input/orders.txt" USING Extractors.Csv(); @c = EXTRACT cid int, name string, city string FROM "/input/customers.txt" USING Extractors.Csv(); @j = SELECT c.cid, MIN(o.odate) AS firstorder , MAX(o.date) AS lastorder, COUNT(o.oid) AS ordercnt , AGG<MyAgg.MySum>(c.amount) AS totalamount FROM @c AS c LEFT OUTER JOIN @o AS o ON c.cid == o.cid WHERE c.city.StartsWith("New") && MyNamespace.MyFunction(o.odate) > 10 GROUP BY c.cid; OUTPUT @j TO "/output/result.txt" USING new MyData.Write(); INSERT INTO T SELECT * FROM @j;
  • 20.
    Intro Blog entry:http://aka.ms/usql-intro Blog entry on UDFs: http://aka.ms/usql-udf U-SQL Reference Doc (beta): http://aka.ms/usql_reference U-SQL Community & Team site: http://usql.io/ Videos: https://channel9.msdn.com/Series/AzureDataLake
  • 21.
    Microsoft Confidential Material- covered under NDA Additional Resources • Blogs and community page: • http://usql.io • https://blogs.msdn.microsoft.com/azuredatalake/ • http://blogs.msdn.com/b/visualstudio/ • http://azure.microsoft.com/en-us/blog/topics/big- data/ • https://channel9.msdn.com/Search?term=U- SQL#ch9Search • Documentation: • http://aka.ms/usql_reference • https://azure.microsoft.com/en- us/documentation/services/data-lake-analytics/ • ADL forums and feedback • http://aka.ms/adlfeedback • https://social.msdn.microsoft.com/Forums/azure/en- US/home?forum=AzureDataLake • http://stackoverflow.com/questions/tagged/u-sql
  • 22.
    Unifies natively SQL’sdeclarativity and C#’s extensibility Unifies querying structured and unstructured Unifies local and remote queries Increase productivity and agility from Day 1 forward for YOU! Sign up for an Azure Data Lake account and join the Public Preview http://www.azure.com/datalake and give us your feedback via http://aka.ms/adlfeedback or at http://aka.ms/u-sql-survey!

Editor's Notes

  • #7 All data Unstructured, Semi structured, Structured Domain-specific user defined types using C# Queries over Data Lake and Azure Blobs Federated Queries over Operational and DW SQL stores removing the complexity of ETL Productive from day one Effortless scale and performance without need to manually tune/configure Best developer experience throughout development lifecycle for both novices and experts Leverage your existing skills with SQL and .NET Easy and powerful data preparation Easy to use built-in connectors for common data formats Simple and rich extensibility model for adding customer – specific data transformation – both existing and new No limits scale Scales on demand with no change to code Automatically parallelizes SQL and custom code Designed to process petabytes of data Enterprise grade Managing, securing, sharing, and discovery of familiar data and code objects (tables, functions etc.) Role based authorization of Catalogs and storage accounts using AAD security Auditing of catalog objects (databases, tables etc.)
  • #8 A new distributed analytics service Built on Apache YARN Dynamically scales Handles jobs of any scale instantly by simply setting the dial for how much power you need. You only pay for the cost of the query Supports Azure Active Directory for Access Control, Roles, Integration with on-premises identity systems It also includes U-SQL, a language that unifies the benefits of SQL with the expressive power of C# U-SQL’s scalable runtime processes data across multiple Azure data sources
  • #9 ADLA allows you to compute on data anywhere and a join data from multiple cloud sources.
  • #16 Hard to operate on unstructured data: Even Hive requires meta data to be created to operate on unstructured data. Adding Custom Java functions, aggregators and SerDes is involving a lot of steps and often access to server’s head node and differs based on type of operation. Requires many tools and steps. Some examples: Hive UDAgg Code and compile .java into .jar Extend AbstractGenericUDAFResolver class: Does type checking, argument checking and overloading Extend GenericUDAFEvaluator class: implements logic in 8 methods. - Deploy: Deploy jar into class path on server Edit FunctionRegistry.java to register as built-in Update the content of show functions with ant Hive UDF (as of v0.13) Code Load JAR into head node or at URI CREATE FUNCTION USING JAR to register and load jar into classpath for every function (instead of registering jar and just use the functions)
  • #17 Spark supports Custom “inputters and outputters” for defining custom RDDs No UDAGGs Simple integration of UDFs but only for duration of program. No reuse/sharing. Cloud dataflow? Requires has to care about scale and perf Spark UDAgg Is not yet supported ( SPARK-3947) Spark UDF Write inline function def westernState(state: String) = Seq("CA", "OR", "WA", "AK").contains(state) for SQL usage need to register the table customerTable.registerTempTable("customerTable") Register each UDF sqlContext.udf.register("westernState", westernState _) Call it val westernStates = sqlContext.sql("SELECT * FROM customerTable WHERE westernState(state)")
  • #18 Offers Auto-scaling and performance Operates on unstructured data without tables needed Easy to extend declaratively with custom code: consistent model for UDO, UDF and UDAgg. Easy to query remote sources even without external tables U-SQL UDAgg Code and compile .cs file: Implement IAggregate’s 3 methods :Init(), Accumulate(), Terminate() C# takes case of type checking, generics etc. Deploy: Tooling: one click registration in user db of assembly By Hand: Copy file to ADL CREATE ASSEMBLY to register assembly Use via AGG<MyNamespace.MyAggregate<T>>(a) U-SQL UDF Code in C#, register assembly once, call by C# name.
  • #20 Extensions require .NET assemblies to be registered with a database