Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R code at scale with U-SQL (SQLBits and SQLKonferenz 2018)

Run Python, R and .NET
code at Data Lake scale
with U-SQL in Azure
Data Lake
Michael Rys
Principal Program Manager Big Data Team, Microsoft
@MikeDoesBigData
usql@microsoft.com

Agenda
• Characteristics of Big Data Analytics Programming
• Scaling out existing code with U-SQL:
• Scaling out Cognitive Libraries
• Introduction to U-SQL’s Extensibility Framework
• Scaling out .NET with U-SQL:
• Custom Image processing
• Scaling out Python with U-SQL
• Scaling out R with U-SQL:
• Model generation, Model testing and scoring

Some sample use cases
Digital Crime Unit – Analyze complex attack patterns
to understand BotNets and to predict and mitigate
future attacks by analyzing log records with
complex custom algorithms
Image Processing – Large-scale image feature
extraction and classification using custom code
Shopping Recommendation – Complex pattern
analysis and prediction over shopping records
using proprietary algorithms
Characteristics
of Big Data
Analytics
• Requires processing
of any type of data
• Allow use of custom
algorithms
• Scale to any size and
be efficient
Bring your own coding expertise and
existing code and scale it out?

Status Quo:
SQL for
Big Data
 Declarativity does scaling and
parallelization for you
 Extensibility is bolted on and
not “native”
 hard to work with anything other than
structured data
 difficult to extend with custom code:
complex installations and frameworks
 Limited to one or two languages

Status Quo:
Programming
Languages for
Big Data
 Extensibility through custom code
is “native”
 Declarativity is bolted on and
not “native”
 User often has to
care about scale and performance
 SQL is 2nd class within string, only local
optimizations
 Often no code reuse/
sharing across queries

Why U-SQL?  Declarativity and Extensibility
are equally native!
Get benefits of both!
Scales out your custom imperative Code
(written in .NET, Python, R, and more to come)
in a declarative SQL-based framework
R
Python
.NET
U-SQL Framework

Extract
Process
Output
User CodeUser Code
User Code
User Code
Declarative Framework
User Extensions
U-SQL Example
Extract
User Code
User Code

Scale Out Cognitive Library
https://github.com/Azure/usql/tree/master/Examples/ImageApp
https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-u-sql-cognitive
Car Green
Parked
Outdoor
Racing

REFERENCE ASSEMBLY ImageCommon;
REFERENCE ASSEMBLY FaceSdk;
REFERENCE ASSEMBLY ImageEmotion;
REFERENCE ASSEMBLY ImageTagging;
REFERENCE ASSEMBLY ImageOcr;
@imgs =
EXTRACT FileName string, ImgData byte[]
FROM @"/images/{FileName}.jpg"
USING new Cognition.Vision.ImageExtractor();
// Extract the number of objects on each image and tag them
@objects =
PROCESS @imgs
PRODUCE FileName,
NumObjects int,
Tags SqlMap<string, float?>
READONLY FileName
USING new Cognition.Vision.ImageTagger();
OUTPUT @objects
TO "/objects.tsv"
USING Outputters.Tsv();
Imaging

REFERENCE ASSEMBLY [TextSentiment];
REFERENCE ASSEMBLY [TextKeyPhrase];
@WarAndPeace =
EXTRACT No int,
Year string,
Book string, Chapter string,
Text string
FROM @"/usqlext/samples/cognition/war_and_peace.csv"
USING Extractors.Csv();
@sentiment =
PROCESS @WarAndPeace
PRODUCE No,
Year,
Book, Chapter,
Text,
Sentiment string,
Conf double
USING new Cognition.Text.SentimentAnalyzer(true);
OUTPUT @sentinment
TO "/sentiment.tsv"
USING Outputters.Tsv();
Text Analysis

U-SQL/Cognitive
Example
• Identify objects in images (tags)
• Identify faces and emotions and images
• Join datasets – find out which tags are associated with happiness
REFERENCE ASSEMBLY ImageCommon;
REFERENCE ASSEMBLY FaceSdk;
REFERENCE ASSEMBLY ImageEmotion;
REFERENCE ASSEMBLY ImageTagging;
@objects =
PROCESS MegaFaceView
PRODUCE FileName, NumObjects int, Tags SqlMap<string,float?>
READONLY FileName
USING new Cognition.Vision.ImageTagger();
@tags =
SELECT FileName, T.Tag
FROM @objects CROSS APPLY EXPLODE(Tags.Split) AS T(Tag, Conf)
WHERE Tag.Contains("dog") OR Tag.Contains("cat");
@emotion =
SELECT ImageName, Details.Emotion
FROM MegaFaceView
CROSS APPLY new Cognition.Vision.EmotionApplier(imgCol:"image")
AS Details(NumFaces int, FaceIndex int,
RectX float, RectY float, Width float, Height float,
Emotion string, Confidence float);
@correlation =
SELECT T.FileName, Emotion, Tag
FROM @emotion AS E
INNER JOIN
@tags AS T
ON E.FileName == T.FileName;
Images
Objects Emotions
filter
join
aggregate

U-SQL extensibility
Extend U-SQL with C#/.NET, Python, R etc.
Built-in operators,
function, aggregates
C# expressions (in SELECT expressions)
User-defined aggregates (UDAGGs)
User-defined functions (UDFs)
User-defined operators (UDOs)

What are UDOs? • User-Defined Extractors
• Converts files into rowset
• User-Defined Outputters
• Converts rowset into files
• User-Defined Processors
• Take one row and produce one row
• Pass-through versus transforming
• User-Defined Appliers
• Take one row and produce 0 to n rows
• Used with OUTER/CROSS APPLY
• User-Defined Combiners
• Combines rowsets (like a user-defined join)
• User-Defined Reducers
• Take n rows and produce m rows (normally m<n)
• Scaled out with explicit U-SQL Syntax that takes a UDO
instance (created as part of the execution):
• EXTRACT
• OUTPUT
• CROSS APPLY
Custom Operator Extensions in
language of your choice
Scaled out by U-SQL
• PROCESS
• COMBINE
• REDUCE

Scaling out C# with U-SQL
https://github.com/Azure/usql/tree/master/Examples/ImageApp
Copyright Camera
Make
Camera
Model
Thumbnail
Michael Canon 70D
Michael Samsung S7

How to specify
.NET UDOs?
• .Net API provided to build UDOs
• Any .Net language usable
• however only C# is first-class in tooling
• Use U-SQL specific .Net DLLs
• Deploying UDOs
• Compile DLL
• Upload DLL to ADLS
• register with U-SQL script
• VisualStudio provides tool support
• UDOs can
• Invoke managed code
• Invoke native code deployed with UDO assemblies
• Invoke other language runtimes (e.g., Python, R)
• be scaled out by U-SQL execution framework
• UDOs cannot
• Communicate between different UDO invocations
• Call Webservices or Reach outside the vertex
boundary

How to specify UDOs?
• Code behind
• C#, Python, R

• C# Class Project for U-SQL
How to specify UDOs?

[SqlUserDefinedExtractor]
public class DriverExtractor : IExtractor
{
private byte[] _row_delim;
private string _col_delim;
private Encoding _encoding;
// Define a non-default constructor since I want to pass in my own parameters
public DriverExtractor( string row_delim = "rn", string col_delim = ",“
, Encoding encoding = null )
{
_encoding = encoding == null ? Encoding.UTF8 : encoding;
_row_delim = _encoding.GetBytes(row_delim);
_col_delim = col_delim;
} // DriverExtractor
// Converting text to target schema
private void OutputValueAtCol_I(string c, int i, IUpdatableRow outputrow)
{
var schema = outputrow.Schema;
if (schema[i].Type == typeof(int))
{
var tmp = Convert.ToInt32(c);
outputrow.Set(i, tmp);
}
...
} //SerializeCol
public override IEnumerable<IRow> Extract( IUnstructuredReader input
, IUpdatableRow outputrow)
{
foreach (var row in input.Split(_row_delim))
{
using(var s = new StreamReader(row, _encoding))
{
int i = 0;
foreach (var c in s.ReadToEnd().Split(new[] { _col_delim }, StringSplitOptions.None))
{
OutputValueAtCol_I(c, i++, outputrow);
} // foreach
} // using
yield return outputrow.AsReadOnly();
} // foreach
} // Extract
} // class DriverExtractor
UDO model
• Marking UDOs
• Parameterizing UDOs
• UDO signature
• UDO-specific processing
pattern
• Rowsets and their schemas
in UDOs
• Setting results
• By position
• By name

Managing Assemblies
• Create assemblies
• Reference assemblies
• Enumerate assemblies
• Drop assemblies
• VisualStudio makes registration easy!
• CREATE ASSEMBLY db.assembly FROM @path;
• CREATE ASSEMBLY db.assembly FROM byte[];
• Can also include additional resource files
• REFERENCE ASSEMBLY db.assembly;
• Referencing .Net Framework Assemblies
• Always accessible system namespaces:
• U-SQL specific (e.g., for SQL.MAP)
• All provided by system.dll system.core.dll system.data.dll,
System.Runtime.Serialization.dll, mscorelib.dll (e.g.,
System.Text, System.Text.RegularExpressions,
System.Linq)
• Add all other .Net Framework Assemblies with:
REFERENCE SYSTEM ASSEMBLY [System.XML];
• Enumerating Assemblies
• Powershell command
• U-SQL Studio Server Explorer and Azure Portal
• DROP ASSEMBLY db.assembly;

DEPLOY RESOURCE Syntax:
'DEPLOY' 'RESOURCE' file_path_URI { ',' file_path_URI }.
Example:
DEPLOY RESOURCE "/config/configfile.xml", "package.zip";
Use Cases:
• Script specific configuration files (not stored with Asm)
• Script specific models
• Any other file you want to access from user code on all
vertices
Semantics:
• Files have to be in ADLS or WASB
• Files are deployed to vertex and are accessible from any custom
code
Limits:
• Single resource file limit is 400MB
• Overall limit for deployed resource files is 3GB

U-SQL Vertex Code (.NET)
C#
C++
Algebra
Additional non-dll files &
Deployed resources
managed dll
native dll
Compilation output (in job folder)
Compilation and Optimization
U-SQL
Metadata
Service
Deployed to
Vertices
REFERENCE ASSEMBLY
ADLS DEPLOY RESOURCE
System files
(built-in Runtimes, Core DLLs, OS)

Scale Out Python With U-SQL
Python
Author Tweet
MikeDoesBigData @AzureDataLake: Come and see the #SQLKonferenz sessions on #USQL
AzureDataLake What are your recommendations for #SQLKonferenz? @MikeDoesBigData
Author Mentions Topics
MikeDoesBigData {@AzureDataLake} {#SQLKonferenz, #USQL}
AzureDataLake {@MikeDoesBigData} {#SQLKonferenz}

REFERENCE ASSEMBLY [ExtPython];
DECLARE @myScript = @"
def get_mentions(tweet):
return ';'.join( ( w[1:] for w in tweet.split() if w[0]=='@' ) )
def usqlml_main(df):
del df['time']
del df['author']
df['mentions'] = df.tweet.apply(get_mentions)
del df['tweet']
return df
";
@t =
SELECT * FROM
(VALUES
("D1","T1","A1","@foo Hello World @bar"),
("D2","T2","A2","@baz Hello World @beer")
) AS D( date, time, author, tweet );
@m =
REDUCE @t ON date
PRODUCE date string, mentions string
USING new Extension.Python.Reducer(pyScript:@myScript);
Use U-SQL to create a massively
distributed program.
Executing Python code across
many nodes.
Using standard libraries such as
numpy and pandas.
Documentation:
https://docs.microsoft.com/en-
us/azure/data-lake-analytics/data-
lake-analytics-u-sql-python-
extensions
Python
Extensions

U-SQL Vertex Code (Python)
C#
C++
Algebra
Additional Python Libs and Script
managed dll
native dll
U-SQL
Metadata
Service
Deployed to
Vertices
REFERENCE ASSEMBLY
ExtPython
Script.py
OtherLibs.zip
System files
Python Python Engine & Libs

Python (and R) Extension Execution Paradigm
Python/R.Reducer (type mapping) Python/R.Reducer (type mapping)

R running in U-
SQL
Generate a linear model
SampleScript_LM_Iris.R
REFERENCE ASSEMBLY [ExtR];
DECLARE @IrisData string = @"/usqlext/samples/R/iris.csv";
DECLARE @OutputFileModelSummary string =
@"/my/R/Output/LMModelSummaryCoefficientsIrisFromRCommand.txt";
DECLARE @myRScript = @"
inputFromUSQL$Species = as.factor(inputFromUSQL$Species)
lm.fit=lm(unclass(Species)~.-Par, data=inputFromUSQL)
#do not return readonly columns and make sure that the column names are
the same in usql and r scripts,
outputToUSQL=data.frame(summary(lm.fit)$coefficients)
colnames(outputToUSQL) <- c(""Estimate"", ""StdError"", ""tValue"",
""Pr"")
outputToUSQL";
@InputData =
EXTRACT SepalLength double, SepalWidth double, PetalLength double,
PetalWidth double, Species string
FROM @IrisData
@ExtendedData = SELECT 0 AS Par, * FROM @InputData;
@ModelCoefficients = REDUCE @ExtendedData ON Par
PRODUCE Par, Estimate double, StdError double, tValue double, Pr double
READONLY Par
USING new Extension.R.Reducer(command:@myRScript,
rReturnType:"dataframe");
OUTPUT @ModelCoefficients TO @OutputFileModelSummary USING Outputters.Tsv();

R running in U-
SQL
Use a previously
generated model
REFERENCE ASSEMBLY master.ExtR;
DEPLOY RESOURCE @"/usqlext/samples/R/my_model_LM_Iris.rda"; //
Prediction Model
DECLARE @IrisData string = @"/usqlext/samples/R/iris.csv";
DECLARE @OutputFilePredictions string = @"/Output/LMPredictionsIris.csv";
DECLARE @PartitionCount int = 10;
// R script to run
DECLARE @myRScript = @"
load(""my_model_LM_Iris.rda"")
outputToUSQL=data.frame(predict(lm.fit, inputFromUSQL, interval=""confidence""))";
@InputData =
EXTRACT SepalLength double, SepalWidth double, PetalLength double,
PetalWidth double, Species string
FROM @IrisData
//Randomly partition the data to apply the model in parallel
@ExtendedData =
SELECT Extension.R.RandomNumberGenerator.GetRandomNumber(@PartitionCount) AS Par, *
FROM @InputData;
// Predict Species
@RScriptOutput =
REDUCE @ExtendedData ON Par
PRODUCE Par, fit double, lwr double, upr double
READONLY Par
USING new Extension.R.Reducer(command:@myRScript, rReturnType:"dataframe",
stringsAsFactors:false);
OUTPUT @RScriptOutput TO @OutputFilePredictions
USING Outputters.Csv(outputHeader:true);

U-SQL Vertex Code (R)
C#
C++
Algebra
Additional R Libs and Script
managed dll
native dll
U-SQL
Metadata
Service
Deployed to
Vertices
REFERENCE ASSEMBLY
ExtR
Script.R
OtherLibs.zip
System files
R R Engine & Libs

Scaling Out your Code and Language with U-SQL
Bring your Code or Write your Custom Operator Extensions in
 .Net (C#, F#, etc)
 Python
 R
 …
Scaled out by U-SQL

Additional
Resources
• Blogs and community page:
• http://usql.io (U-SQL Github)
• http://blogs.msdn.microsoft.com/azuredatalake/
• http://blogs.msdn.microsoft.com/mrys/
• https://channel9.msdn.com/Search?term=U-SQL#ch9Search
• Documentation, presentations and articles:
• http://aka.ms/usql_reference
• https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-u-sql-
programmability-guide
• https://docs.microsoft.com/en-us/azure/data-lake-analytics/
• https://msdn.microsoft.com/en-us/magazine/mt614251
• https://msdn.microsoft.com/magazine/mt790200
• http://www.slideshare.net/MichaelRys
• Getting Started with R in U-SQL
• https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-u-sql-python-
extensions
• ADL forums and feedback
• https://social.msdn.microsoft.com/Forums/azure/en-US/home?forum=AzureDataLake
• http://stackoverflow.com/questions/tagged/u-sql
• http://aka.ms/adlfeedback
Continue your education
at Microsoft Virtual
Academy online.

Vielen Dank für Eure
Aufmerksamkeit!
usql@microsoft.com@MikeDoesBigData
http://aka.ms/azuredatalake

Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R code at scale with U-SQL (SQLBits and SQLKonferenz 2018)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R code at scale with U-SQL (SQLBits and SQLKonferenz 2018)

Similar to Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R code at scale with U-SQL (SQLBits and SQLKonferenz 2018) (20)

More from Michael Rys

More from Michael Rys (13)

Recently uploaded

Recently uploaded (20)

Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R code at scale with U-SQL (SQLBits and SQLKonferenz 2018)

Editor's Notes