Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Taming the Data Science Monster with A New ‘Sword’ – U-SQL


Published on

Presentation from Microsoft Data Science Summit 2016
Introduces U-SQL

Published in: Data & Analytics
  • Login to see the comments

Taming the Data Science Monster with A New ‘Sword’ – U-SQL

  1. 1. The Data Lake approach Ingest all data regardless of requirements Store all data in native format without schema definition Do analysis Using analytic engines like Hadoop Interactive queries Batch queries Machine Learning Data warehouse Real-time analytics Devices
  2. 2. Introducing Azure Data Lake Big Data Made Easy
  3. 3. WebHDFS YARN U-SQL ADL Analytics ADL HDInsight Store HiveAnalytics Storage Azure Data Lake (Store, HDInsight, Analytics)
  4. 4. Some sample use cases Digital Crime Unit – Analyze complex attack patterns to understand BotNets and to predict and mitigate future attacks by analyzing log records with complex custom algorithms Image Processing – Large-scale image feature extraction and classification using custom code Shopping Recommendation – Complex pattern analysis and prediction over shopping records using proprietary algorithms
  5. 5.  Declarativity does scaling and parallelization for you  Extensibility is bolted on and not “native”  hard to work with anything other than structured data  difficult to extend with custom code
  6. 6.  Extensibility through custom code is “native”  Declarativity is bolted on and not “native”  User often has to care about scale and performance  SQL is 2nd class within string  Often no code reuse/ sharing across queries
  7. 7.  Declarativity and Extensibility are equally native to the language! Get benefits of both! Makes it easy for you by unifying: • Unstructured and structured data processing • Declarative SQL and custom imperative Code (C#) • Local and remote Queries • Increase productivity and agility from Day 1 and at Day 100 for YOU!
  8. 8. The origins of U-SQL SCOPE – Microsoft’s internal Big Data language • SQL and C# integration model • Optimization and Scaling model • Runs 100’000s of jobs daily Hive • Complex data types (Maps, Arrays) • Data format alignment for text files T-SQL/ANSI SQL • Many of the SQL capabilities (windowing functions, meta data model etc.)
  9. 9. Benefits • Avoid moving large amounts of data across the network between stores • Single view of data irrespective of physical location • Minimize data proliferation issues caused by maintaining multiple copies • Single query language for all data • Each data store maintains its own sovereignty • Design choices based on the need • Push SQL expressions to remote SQL sources • Projections • Filters • Joins U-SQL Query Query Azure Storage Blobs Azure SQL in VMs Azure SQL DB Azure Data Lake Analytics Azure SQL Data Warehouse Azure Data Lake Storage
  10. 10.
  11. 11. EXTRACT Expression @s = EXTRACT a string, b int FROM "filepath/file.csv" USING Extractors.Csv(encoding: Encoding.Unicode); • Built-in Extractors: Csv, Tsv, Text with lots of options • Custom Extractors: e.g., JSON, XML, etc. OUTPUT Expression OUTPUT @s TO "filepath/file.csv" USING Outputters.Csv(); • Built-in Outputters: Csv, Tsv, Text • Custom Outputters: e.g., JSON, XML, etc. (see Filepath URIs • Relative URI to default ADL Storage account: "filepath/file.csv" • Absolute URIs: • ADLS: "adl://" • WASB: "wasb://container@account/filepath/file.csv"
  12. 12. U-SQL extensibility Extend U-SQL with C#/.NET Built-in operators, function, aggregates C# expressions (in SELECT expressions) User-defined aggregates (UDAGGs) User-defined functions (UDFs) User-defined operators (UDOs)
  13. 13.
  14. 14. Managing Assemblies • CREATE ASSEMBLY db.assembly FROM @path; • CREATE ASSEMBLY db.assembly FROM byte[]; • Can also include additional resource files • REFERENCE ASSEMBLY db.assembly; • Referencing .Net Framework Assemblies • Always accessible system namespaces: • U-SQL specific (e.g., for SQL.MAP) • All provided by system.dll system.core.dll, System.Runtime.Serialization.dll, mscorelib.dll (e.g., System.Text, System.Text.RegularExpressions, System.Linq) • Add all other .Net Framework Assemblies with: REFERENCE SYSTEM ASSEMBLY [System.XML]; • Enumerating Assemblies • Powershell command • U-SQL Studio Server Explorer • DROP ASSEMBLY db.assembly;  Create assemblies  Reference assemblies  Enumerate assemblies  Drop assemblies  VisualStudio makes registration easy!
  15. 15. 'USING' csharp_namespace | Alias '=' csharp_namespace_or_class. Examples: DECLARE @ input string = "somejsonfile.json"; REFERENCE ASSEMBLY [Newtonsoft.Json]; REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats]; USING Microsoft.Analytics.Samples.Formats.Json; @data0 = EXTRACT IPAddresses string FROM @input USING new JsonExtractor("Devices[*]"); USING json = [Microsoft.Analytics.Samples.Formats.Json.JsonExtractor]; @data1 = EXTRACT IPAddresses string FROM @input USING new json("Devices[*]");
  16. 16.
  17. 17. Simple pattern language on filename and path @pattern string = "/input/{date:yyyy}/{date:MM}/{date:dd}/{*}.{suffix}"; • Binds two columns date and suffix • Wildcards the filename • Limits on number of files (Current limit 800 and 3000 being increased in next refresh) Virtual columns EXTRACT name string , suffix string // virtual column , date DateTime // virtual column FROM @pattern USING Extractors.Csv(); • Refer to virtual columns in query predicates to get partition elimination (otherwise you will get a warning)
  18. 18.
  19. 19. ADLA Account/Catalog Database Schema [1,n] [1,n] [0,n] tables views TVFs C# Fns C# UDAgg Clustered Index partitions C# Assemblies C# Extractors Data Source C# Reducers C# Processors C# Combiners C# Outputters Ext. tables Abstract objects User objects Refers toContains Implemented and named by Procedures Creden- tials MD Name C# Name C# Applier Table Types Legend Statistics C# UDTs
  20. 20. • Naming • Discovery • Sharing • Securing U-SQL Catalog Naming • Default Database and Schema context: master.dbo • Quote identifiers with []: [my table] • Stores data in ADL Storage /catalog folder Discovery • Visual Studio Server Explorer • Azure Data Lake Analytics Portal • SDKs and Azure Powershell commands Sharing • Within an Azure Data Lake Analytics account Securing • Secured with AAD principals at catalog level (inherited from ADL Storage) • At General Availability: Database level access control
  21. 21. CREATE TABLE T (col1 int , col2 string , col3 SQL.MAP<string,string> , INDEX idx CLUSTERED (col1 ASC) DISTRIBUTED BY HASH (driver_id) ); • Structured Data • Built-in Data types only (no UDTs) • Clustered Index (needs to be specified): row-oriented • Fine-grained distribution (needs to be specified): • HASH, DIRECT HASH, RANGE, ROUND ROBIN CREATE TABLE T (INDEX idx CLUSTERED …) AS SELECT …; CREATE TABLE T (INDEX idx CLUSTERED …) AS EXTRACT…; CREATE TABLE T (INDEX idx CLUSTERED …) AS myTVF(DEFAULT); • Infer the schema from the query • Still requires index and partitioning
  22. 22.
  23. 23. U-SQL Joins Join operators • INNER JOIN • LEFT or RIGHT or FULL OUTER JOIN • CROSS JOIN • SEMIJOIN • equivalent to IN subquery • ANTISEMIJOIN • Equivalent to NOT IN subquery Notes • ON clause comparisons need to be of the simple form: rowset.column == rowset.column or AND conjunctions of the simple equality comparison • If a comparand is not a column, wrap it into a column in a previous SELECT • If the comparison operation is not ==, put it into the WHERE clause • turn the join into a CROSS JOIN if no equality comparison Reason: Syntax calls out which joins are efficient
  24. 24. U-SQL Analytics Windowing Expression Window_Function_Call 'OVER' '(' [ Over_Partition_By_Clause ] [ Order_By_Clause ] [ Row _Clause ] ')'. Window_Function_Call := Aggregate_Function_Call | Analytic_Function_Call | Ranking_Function_Call. Windowing Aggregate Functions ANY_VALUE, AVG, COUNT, MAX, MIN, SUM, STDEV, STDEVP, VAR, VARP Analytics Functions CUME_DIST, FIRST_VALUE, LAST_VALUE, PERCENTILE_CONT, PERCENTILE_DISC, PERCENT_RANK; soon: LEAD/LAG Ranking Functions DENSE_RANK, NTILE, RANK, ROW_NUMBER
  25. 25. “Top 5”s Surprises for SQL Users • AS is not as • C# keywords and SQL keywords overlap • Costly to make case-insensitive -> Better build capabilities than tinker with syntax • = != == • Remember: C# expression language • null IS NOT NULL • C# nulls are two-valued • PROCEDURES but no WHILE • No UPDATE nor MERGE
  26. 26. U-SQL Language Philosophy Declarative Query and Transformation Language: • Uses SQL’s SELECT FROM WHERE with GROUP BY/Aggregation, Joins, SQL Analytics functions • Optimizable, Scalable Expression-flow programming style: • Easy to use functional lambda composition • Composable, globally optimizable Operates on Unstructured & Structured Data • Schema on read over files • Relational metadata objects (e.g. database, table) Extensible from ground up: • Type system is based on C# • Expression language IS C# • User-defined functions (U-SQL and C#) • User-defined Aggregators (C#) • User-defined Operators (UDO) (C#) U-SQL provides the Parallelization and Scale-out Framework for Usercode • EXTRACTOR, OUTPUTTER, PROCESSOR, REDUCER, COMBINER, APPLIER Federated query across distributed data sources REFERENCE MyDB.MyAssembly; CREATE TABLE T( cid int, first_order DateTime , last_order DateTime, order_count int , order_amount float ); @o = EXTRACT oid int, cid int, odate DateTime, amount float FROM "/input/orders.txt" USING Extractors.Csv(); @c = EXTRACT cid int, name string, city string FROM "/input/customers.txt" USING Extractors.Csv(); @j = SELECT c.cid, MIN(o.odate) AS firstorder , MAX( AS lastorder, COUNT(o.oid) AS ordercnt , AGG<MyAgg.MySum>(c.amount) AS totalamount FROM @c AS c LEFT OUTER JOIN @o AS o ON c.cid == o.cid WHERE"New") && MyNamespace.MyFunction(o.odate) > 10 GROUP BY c.cid; OUTPUT @j TO "/output/result.txt" USING new MyData.Write(); INSERT INTO T SELECT * FROM @j;
  27. 27. Unifies natively SQL’s declarativity and C#’s extensibility Unifies querying structured and unstructured Unifies local and remote queries Increase productivity and agility from Day 1 forward for YOU! Sign up for an Azure Data Lake account and join the Public Preview and give us your feedback via or at!
  28. 28. us/documentation/services/data-lake-analytics/ US/home?forum=AzureDataLake