3 CityNetConf - sql+c#=u-sql

SQL + C# = U-SQL
Łukasz Grala
Architect Data Platform & Advanced Analytics & BI Solutions
Data Platform MVPGdańsk 18.05.2016

@Łukasz Grala – lukasz@tidk.pl
• Architekt rozwiązań Platformy Danych & Business Intelligence & Zaawansowanej Analityki w TIDK
• Certyfikowany trener Microsoft i wykładowca na wyższych uczelniach
• Autor zaawansowanych szkoleń i warsztatów, oraz licznych publikacji i webcastów
• Od 2010 roku wyróżniany nagrodą Microsoft Data Platform MVP
• Doktorant Politechnika Poznańska – Wydział Informatyki (obszar bazy danych, eksploracja danych,
uczenie maszynowe)
• Prelegent na licznych konferencjach w kraju i na świecie
• Posiada liczne certyfikaty (MCT, MCSE, MCSA, MCITP,…)
• Członek Polskiego Towarzystwa Informatycznego
• Członek i lider Polish SQL Server User Group (PLSSUG)
• Pasjonat analizy, przechowywania i przetwarzania danych, miłośnik Jazzu

Data (Big Data)
• 72 hours of video are uploaded per minute on YouTube (1 terabyte
every 4 minutes)
• 500 terabytes of new data per day are ingested in Facebook
databases
• Sensors from a Boeing jet engine create 20 terabytes
of data every hour
• The proposed Square Kilometer Array telescope will generate “a few
Exabytes of data per day” (single beam)
lukasz@tidk.pl

Internet of Things (IoT)
lukasz@tidk.pl

lukasz@tidk.pl
Azure Data Lake Storage & Analytics

lukasz@tidk.pl
YARN
U-SQL
Analytics Service HDInsight
(managed Hadoop Clusters)
Analytics
Azure Data Lake
WebHDFS
Store

HADOOP FILE SYSTEM HDFS for cloud
ENTERPRISE READY access control, encryption at rest
Store ANY DATA in its native format (csv, tcv, json tables,
images,…)
No limits to SCALE
Optimization for analytic workload PERFORMANCE
Azure Data Lake Store Services
8

Data Lake Store – Filesystem
• WebHDFS API, REST
• Use: adl://
•
adl://<data_lake_store_name>.azuredatalakestore.net

Built on Apache YARN
Scales dynamically with the turn of a dial
Pay by the query
Supports Azure AD for access control, roles, and integration with
on-prem identity systems
Built with U-SQL to unify the benefits of SQL with the power of C#
Processes data across Azure
Azure Data Lake Analytics Services
10

Work across all cloud data
Azure Data Lake
Analytics
Azure SQL DW Azure SQL DB
Azure
Storage Blobs
Azure
Data Lake Store
SQL DB in an
Azure VM

Azure Data Lake Store & Analytics

lukasz@tidk.pl
Demo
ADLS&A in Action!

Azure Data Lake Analytics
A elastic analytics service built on Apache YARN that processes all data,
at any size
• No limits to SCALE
• Includes U-SQL, a language that unifies the benefits of SQL with the expressive power of C#
• Optimized to work with ADL STORE
• FEDERATED QUERY across Azure data sources
• ENTERPRISE READY Role based access control & Auditing
• Pay PER JOB & Scale PER JOB

lukasz@tidk.pl
The origins of U-SQL
SCOPE – Microsoft’s internal Big Data language
• SQL and C# integration model
• Optimization and Scaling model
• Runs 100’000s of jobs daily
Hive
• Complex data types (Maps, Arrays)
• Data format alignment for text files
T-SQL/ANSI SQL
• Many of the SQL capabilities (windowing functions, meta data model etc.)

U-SQL – Language Overview
U-SQL Fundamentals
• All the familiar SQL clauses
SELECT | FROM | WHERE
GROUP BY | JOIN | OVER
• Operate on unstructured and
structured data
• Relational metadata objects
.NET integration and
extensibility
• U-SQL expressions are full C#
expressions
• Reuse .NET code in your own
assemblies
• Use C# to define your own:
Types | Functions | Joins | Aggregators | I/O
(Extractors, Outputters)

U-SQL Capabilities
Batch
Streaming
Machine Learning
IN PROGRESS
AVAILABLE NOW
FUTURE
FUTURE
Interactive

U-SQL Distributed Query
Azure Data Lake Store
Azure SQL Database
Azure SQL Data Warehouse
Azure SQL DB in Azure VM
READ
READ
READ
READ
READ
WRITE
WRITE
WRITE
WRITE
WRITE
Azure Storage Blobs

U-SQL extensibility
Built-in operators,
function, aggregates
C# expressions (in SELECT expressions)
User-defined aggregates (UDAGGs)
User-defined functions (UDFs)
User-defined operators (UDOs)

U-SQL Language Philosophy
Declarative Query and Transformation Language:
• Uses SQL’s SELECT FROM WHERE with GROUP
BY/Aggregation, Joins, SQL Analytics functions
• Optimizable, Scalable
Expression-flow programming style:
• Easy to use functional lambda composition
• Composable, globally optimizable
Operates on Unstructured & Structured Data
• Schema on read over files
• Relational metadata objects (e.g. database, table)
Extensible from ground up:
• Type system is based on C#
• Expression language IS C#
• User-defined functions (U-SQL and C#)
• User-defined Aggregators (C#)
• User-defined Operators (UDO) (C#)
U-SQL provides the Parallelization and Scale-out
Framework for Usercode
• EXTRACTOR, OUTPUTTER, PROCESSOR, REDUCER,
COMBINER, APPLIER
Federated query across distributed data sources
REFERENCE MyDB.MyAssembly;
CREATE TABLE T( cid int, first_order DateTime
, last_order DateTime, order_count int
, order_amount float );
@o = EXTRACT oid int, cid int, odate DateTime, amount float
FROM "/input/orders.txt"
USING Extractors.Csv();
@c = EXTRACT cid int, name string, city string
FROM "/input/customers.txt"
USING Extractors.Csv();
@j = SELECT c.cid, MIN(o.odate) AS firstorder
, MAX(o.date) AS lastorder, COUNT(o.oid) AS ordercnt
, AGG<MyAgg.MySum>(c.amount) AS totalamount
FROM @c AS c LEFT OUTER JOIN @o AS o ON c.cid == o.cid
WHERE c.city.StartsWith("New")
&& MyNamespace.MyFunction(o.odate) > 10
GROUP BY c.cid;
OUTPUT @j TO "/output/result.txt"
USING new MyData.Write();
INSERT INTO T SELECT * FROM @j;

Meta Data Object Model
ADLA Catalog
Database
Schema
[1,n]
[1,n]
[0,n]
tables views TVFs
C# Fns C# UDAgg
Clustered
Index
partitions
C#
Assemblies
C# Extractors
Data
Source
C# Reducers
C# Processors
C# Combiners
C# Outputters
Ext. tables Procedures
Creden-
tials
C# Applier
Table Types
Statistics
C# UDTs
Abstract
objects
User
objects
Refers toContains Implemented
and named by
MD
Name
C# Name
Legend

U-SQL Joins
Join operators
• INNER JOIN
• LEFT or RIGHT or FULL OUTER JOIN
• CROSS JOIN
• SEMIJOIN
• equivalent to IN subquery
• ANTISEMIJOIN
• Equivalent to NOT IN subquery
Notes
• ON clause comparisons need to be of the simple form:
rowset.column == rowset.column
or AND conjunctions of the simple equality comparison
• If a comparand is not a column, wrap it into a column in a previous SELECT
• If the comparison operation is not ==, put it into the WHERE clause
• turn the join into a CROSS JOIN if no equality comparison
Reason: Syntax calls out which joins are efficient

U-SQL Analytics
Windowing Aggregate Functions
ANY_VALUE, AVG, COUNT, MAX, MIN, SUM, STDEV, STDEVP, VAR, VARP
Analytics Functions
CUME_DIST, FIRST_VALUE, LAST_VALUE, PERCENTILE_CONT, PERCENTILE_DISC,
PERCENT_RANK, LEAD, LAG
Ranking Functions
DENSE_RANK, NTILE, RANK, ROW_NUMBER
Windowing Expression
Window_Function_Call 'OVER' '('
[ Over_Partition_By_Clause ]
[ Order_By_Clause ]
[ Row _Clause ]
')'.
Window_Function_Call :=
Aggregate_Function_Call
| Analytic_Function_Call
| Ranking_Function_Call.

Views and TVFs
• Views for simple cases
• TVFs for parameterization and most cases
Table-Valued Functions (TVFs)
CREATE FUNCTION F (@arg string = "default")
RETURNS @res [TABLE ( … )]
AS BEGIN … @res = … END;
• Provides parameterization
• One or more results
• Can contain multiple statements
• Can contain user-code (needs assembly reference)
• Will always be inlined
• Infers schema or checks against specified return schema
Views
CREATE VIEW V AS EXTRACT…
CREATE VIEW V AS SELECT …
• Cannot contain user-defined objects (such as
UDFs or UDOs)
• Will be inlined

Procedures
• Allows encapsulation of non-DDL scripts
• Provides parameterization
• No result but writes into file or table
• Can contain multiple statements
• Can contain user code (needs assembly
reference)
• Will always be inlined
• Cannot contain DDL (no CREATE, DROP)
CREATE PROCEDURE P (@arg string = "default“)
AS
BEGIN
…;
OUTPUT @res TO …;
INSERT INTO T …;
END;

Tables
• CREATE TABLE
• CREATE TABLE AS SELECT
CREATE TABLE T (INDEX idx CLUSTERED …) AS SELECT …;
CREATE TABLE T (INDEX idx CLUSTERED …) AS EXTRACT…;
CREATE TABLE T (INDEX idx CLUSTERED …) AS
myTVF(DEFAULT);
• Infer the schema from the query
• Still requires index and partitioning
CREATE TABLE T (col1 int
, col2 string
, col3 SQL.MAP<string,string>
, INDEX idx CLUSTERED (col1 ASC)
PARTITIONED BY HASH (driver_id)
);
• Structured Data
• Built-in Data types only (no UDTs)
• Clustered index (must be specified): row-oriented
• Fine-grained partitioning (must be specified):
• HASH, DIRECT HASH, RANGE, ROUND ROBIN

INSERT
• INSERT constant values
• INSERT from queries
• Multiple INSERTs
Multiple INSERTs into same table
• Is supported
• Generates separate file per insert in physical storage:
• Can lead to performance degradation
• Recommendations:
• Try to avoid small inserts
• Rebuild table after frequent insertions with:
ALTER TABLE T REBUILD;
INSERT constant values
INSERT INTO T VALUES (1, "text",
new SQL.MAP<string,string>("key","value"));
INSERT from queries
INSERT INTO T SELECT col1, col2, col3 FROM @rowset;

Top 5s Surprise for SQL Users
AS is not as
• C# keywords and SQL keywords overlap
• Costly to make case-insensitive -> Better build capabilities than tinker with syntax
= != ==
• Remember: C# expression language
null IS NOT NULL
• C# nulls are two-valued
PROCEDURES but no WHILE
No UPDATE nor MERGE
• Transform/Recook instead

Additional capabilities and resources
• Tools: http://aka.ms/adltoolsVS
• Blogs and community page:
• http://usql.io
• http://blogs.msdn.com/b/visualstudio/
• http://azure.microsoft.com/en-us/blog/topics/big-data/
• https://channel9.msdn.com/Search?term=U-SQL#ch9Search
• Documentation and articles and slides:
• http://aka.ms/usql_reference
• https://azure.microsoft.com/en-us/documentation/services/data-lake-analytics/
• https://msdn.microsoft.com/en-us/magazine/mt614251
• http://www.slideshare.net/MichaelRys
• ADL forums and feedback
• http://aka.ms/adlfeedback
• https://social.msdn.microsoft.com/Forums/azure/en-US/home?forum=AzureDataLake
• http://stackoverflow.com/questions/tagged/u-sql
Michael Rys -
@MikeDoesBigData

• 16-18 maj 2016
• Wrocław Centrum Konferencyjne
• 3 dni, 6 warsztatów, 4 ścieżki, ponad 30 prelegentów, 50 sesji
• 600 uczestników + sponsorzy + prelegenci + organizatorzy
• Goście między innymi z USA, Anglii, Niemiec, Ukrainy, Bułgarii, Słoweni
• Premiera techniczna SQL Server 2016
sqlday.pl @sqlday
lukasz@tidk.pl
W tym warsztat Big Data Analytics – Łukasz Grala & Marcin Szeliga

Masterclass: Cloud Storage
23-25.05.2016, Warszawa
Azure SQL Server i Azure SQL Database, Skalowanie bazy relacyjnej w
chmurze, Hurtownia danych w chmurze PowerShell i bazy danych w
Azure, Azure BLOB Storage, Bazy dokumentowe, Big Data z
HDInsight, Hadoop, Apache Spark, Pozostałe komponenty HDInsight i
Hadoop, Wirtualne maszyny
Masterclass: Cloud Analytics
20-22.06.2016, Warszawa
Data Catalog, Data Factory, Data Lake, PowerBI i dane relacyjne w
chmurze, Hadoop, Apache Spark, Analiza danych strumieniowych,
Analiza z baz danych dokumentowych i grafowych, Uczenie
maszynowe, Polybase w SQL Server 2016
Łukasz Grala
Data Platform MVP,
MCT, MCSE, MCSA,
MCITP, MCSA,
MCP, MTA
Łukasz o szkoleniach:
„Danych produkowanych jest
więcej niż kiedykolwiek, pochodzą
z sieci Internet, z portali społecznościowych, z
urządzeń. Bardzo duży rozwój Internetu Rzeczy
(IoT) ilość tych danych jeszcze bardziej
zwiększa. Dlatego przygotowaliśmy dwa
specjalne kursy Cloud Storage i Cloud Analytics,
przedstawiające mechanizmy składowania,
przetwarzania i analizy danych z
wykorzystaniem chmury.”
Big Data, BI, Analityka, SQL
Standard -25% na hasło 3CityNetConfwww.hexcode.pl

3 CityNetConf - sql+c#=u-sql

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to 3 CityNetConf - sql+c#=u-sql

Similar to 3 CityNetConf - sql+c#=u-sql (20)

More from Łukasz Grala

More from Łukasz Grala (20)

Recently uploaded

Recently uploaded (20)

3 CityNetConf - sql+c#=u-sql

Editor's Notes