A popular pattern today is the injection of declarative (or functional) mini-languages into general purpose host languages. Years ago, this is what LINQ for C# was all about. Now there are many more examples such as the Spark or Beam APIs for Java and Scala. The opposite embedding is also possible: start with a declarative (or functional) language as the outer host and then embed a general purpose language. This is the path we took for Scope years ago (Scope is a Microsoft-internal big data analytics language) and have recently shipped as U-SQL. In this case, the host language is close to T-SQL (Transact SQL is Microsoft’s SQL language for SQL Server and Azure SQL DB) and the embedded language is C#. By embedding the general purpose language in a declarative language, we enable all-of-program (not just all-of stage) optimization, parallelization, and scheduling. The resulting jobs can flexibly scale to leverage thousands of machines.
2. Masters in EE: RWTH Aachen
PhD in CS: ETH Zurich
(under Niklaus Wirth)
Post Doc: ICSI at UC Berkeley
Co-founder of Oberon microsystems
and Myriad Group
Assoc/Prof CS at Queensland UT,
Brisbane, Australia
Architect, Developer, Lead,
Manager,Group Manager
in
Research, Office, Connected
Systems, Data Group
at
Microsoft since 1999
Who is
Clemens?
Languages: Oberon 2, Sather 2,
Component Pascal, Mianjin, PQEL (M),
SA-QL, U-SQL
Systems: Ethos,Tenet 2, Gardens, Blackbox
Component Builder, Project Oslo, .Net
Managed Extensibility Framework, Power
Query (in Excel & Power BI), Azure Stream
Analytics, Azure Data Lake Analytics, Azure
Time Series Insights
3. BigData and
Machine Learning /AI
Big Data means
- a lot of data (volume)
- crazy shapes (variety)
- incoming!! (velocity)
Analytics (incl. ML/AI)
over Big Data means
- compute at massive scale
- complexity
- fault tolerance
XKCD license
4. The Big Data
Explosion
Data Complexity (variety and velocity)
TB
GB
EB
…
PB
Big Data
Log files
Spatial &
GPS coordinates
Data market feeds
eGov feeds
Weather
Text/image
Click stream
Wikis/blogs
Sensors/RFID/
devices
Social sentiment
Audio/video
Web 2.0
Web Logs
Digital Marketing
Search Marketing
Recommendations
Advertising
Mobile
Collaboration
eCommerce
Payables
Payroll
Inventory
Contacts
Deal Tracking
Sales Pipeline
ERM
CRM
Data Size
yotta Y 10008 1024 septillion
zetta Z 10007 1021 sextillion
exa E 10006 1018 quintillion
peta P 10005 1015 quadrillion
tera T 10004 1012 trillion
giga G 10003 109 billion
mega M 10002 106 million
6. All data generated
Schema agility AND experimentation
AND ML, image Processing,
graph, streaming
Operational Data
Highly modeled schema
Relational algebra
7. Examples:
Big Data at
Microsoft
Cosmos and Scope
- - Rooted in Dryad
- -A decade of development
Productized as Azure Data Lake
- -ADL Store
- -ADLAnalytics with U-SQL
Kafka Four Commas Club
(Ingestion of aTrillion+ Events/Day)
Cosmos stores exabytes of active data
Scope processes hundred of petabytes a day
Supports batch, interactive, streaming, ML
Data ranges across most Microsoft products
Bing and MSN click streams
Office and Windows telemetry
Xbox gaming
Cosmos comprises
hundreds of thousands of machines
millions of cores
petabytes of RAM
exabytes of disks
Still only a fraction of the
global Microsoft Cloud
9. How can you
leverage
Big Data?
Use the Power of Public Cloud Services
to move beyond
- hardware lifecycles
- infrastructure management
- physical and cyber security infrastructure
- inflexible demand/scale/cost structures
- inadequate geo reach
Well, it can be fairly simple, actually
10. FullService
Family of services designed for
composition is called
Platform as a Service
Contrast with low-level building
blocks (VMs, storage, network):
Infrastructure as a Service
Contrast also with finished
solution services:
Software as a Service
Composition units are fully
deployed and operated instances
Contrast this with source reuse
and with software components
Litmus test: Can you ask “who
pays the power bill?”
Cloud services allow users to shed
the cost of operations, enable user
to stay on top of software and
hardware trends, and virtualize
physical resources of extreme
capacity.
11. Who doesWhat
Note that Private clouds aim for
many of the same, but on top of
resources deployed and controlled
on a customer’s premises.
Azure Stack enables such private
clouds while retaining much of the
Azure model.
Key CloudValue Proposition: Separation of Responsibilities
Applications
Data
Runtime
Middleware
OS
Virtualization
Servers
Storage
Networking
Applications
Data
Runtime
Middleware
OS
Virtualization
Servers
Storage
Networking
Applications
Data
Runtime
Middleware
OS
Virtualization
Servers
Storage
Networking
Applications
Data
Runtime
Middleware
OS
Virtualization
Servers
Storage
Networking
On Premises Infrastructure
(as a Service)
Platform
(as a Service)
Software
(as a Service)
CustomerManages
ProviderManages
12. ServiceQuality
Security, Privacy
Tenancy Model
Performance Predictability
Scale vs. Price
Result Semantics
Management
Geographic and Geopolitical
Reliability
There are many qualities that
characterize a service.
The total design space for services is
overconstrained if all qualities are
equally important.
Understanding tradeoffs is thus
essential – and defines the service
engineering discipline.
13. Security, Privacy
• Encryption
• Data at rest (or always)
• Service- vs. user-managed keys
• Authentication
• Two-factor authentication
• Azure Active Directory (AAD) integration
• Federation with other providers and on-premises systems (AD)
• Authorization
• Role definitions, user-to-role assignments (RBAC)
• Managed shareable access keys (SAS,OAuth)
• DOS Protection, Attack and Intrusion Detection
• Network isolation
• All IP endpoints that are internal to a solution, on premise or in
cloud, are not exposed on the Internet (VNet’s: virtual networks)
Is the data secured?
Is the solution IP secured?
Is the service quality secured?
14. Performance
Predictability
Spectrum of predictability based on willingness to pay
• Static pre-allocation of fully dedicated resources
• Based on conservative static calculation of resource needs
• Based on requested resources – typically based on estimates
grounded in historic observation
• Either fails early or will run with promised resources
• Can be undermined by failures of underlying infrastructure
• Dynamic allocation of needed resources
• Risk of out-of-resource rejection in mid course
• Dynamic sharing of resources
• Risk of noisy neighbor impact
• Can use quota enforcement policies to keep individual resource
consumers within bounds
• Typically subject to overbooking policies to ensure high level of
resource utilization (and thus delivery at low cost)
Common performance metrics:
Throughput – data processed per
time unit.
Latency – time between earliest
possible and actual delivery of
results.
Metrics are usually observed
relative to benchmark or actual
workloads.
15. Tenancy Model
Single-Tenant Services
• Dedicated and isolated resources are granted to a tenant
• Example: dedicated clusters (VM sets with cluster-level features
for management, monitoring, etc.)
Multi-Tenant Services
• Resources are shared among tenants – simpler and cheaper
• Example: job-execution service
• Tenant isolation is fine-grained
• Security bar is more difficult to uphold
• Predictable Performance often impacted: “noisy neighbors”
• Performance irregularities caused by heavy use of shared resources
by some other users
A tenant is a logical customer
organization.
Multiple users may authenticate
under the same tenant.
A large customer may have multiple
tenancies.
16. Scale vs. Price
One Size will never Fit All
• Optimizing for most desirable
qualities (high availability,
reliability, predictability, security,
…) will counteract optimizing for
price
Average vs. Peak
• Max resource envelope scale
(like dedicated clusters)
• Max job envelope scale (auto-
resourcing per job submission)
• Actual job envelope scale (elastic
resourcing over duration of job)
• Likely discretized to max. elastic
adaptation rate
Price Sensitivity of a customer
depends on the value that a solution
generates for the customer.
Worst case example: solution is
required by law but does not
generate business value.
Best case example: solution itself is
a high-margin product sold by
customer.
workload
time
17. Result
Semantics
Many possible models for “good results”
Deterministic results
• Given inputs and chosen service operations fully determine results
Repeatable results
• While not fully pre-determined, rerunning a service request over the
same inputs will yield the same results (e.g. journaling of tie breaks)
Asymptotic results
• Over time, a service operation will yield closer approximations of the
ideal results (e.g., eventual consistency)
Best effort
• For some definition of effort, a service makes a best effort to yield the
desired result, but always returns the result it came up with
Time boxed
• Special best-effort case: do the best within an allotted time bound
A valid result of a service request is
one that meets the requestor’s
requirements.
Ideal definitions of validity (such as
mathematical ones) are often
“overkill” and sometimes
unattainable.
Practical definitions create a high-
dimensional engineering space.
18. Composing
Solutions
Security, Privacy
Tenancy Model
Performance Predictability
Scale vs. Price
Result Semantics
Management
Geographic and Geopolitical
Reliability
A closed solution can meet many
requirements by fiat.
A composed solution can still be
closed (hide its composition).
An openly composed solution
exposes its composition; here,
meeting many requirements is hard.
19. Composition
Composing solutions over cloud platform services
In-house or Solution Integrator
Cloud-only or hybrid
Hide cloud platform as implementation detail
Independent SolutionVendor
As above, but in addition:
Create multi-tenancy solution on top of platform
Create billing models, incl. abstraction of platform bills
Independent PlatformVendor
As above, but in addition:
Create new platform abstractions over existing ones
A platform is open for third-party contributions (extensions) and
solutions built atop
Enabling independent platform construction atop creates many
hard transitive challenges
Composition takes components as is
and assembles them into larger
pieces.
This is different from lower-level
forms of source-code reuse.
For service composition, lower-level
forms are definitely excluded.
20. How to Encode
Compositions?
Hey, Clemens, why can’t I just encode all this stuff in …
… Python
… Scala
… Go
… R
… C#
… Java
… whatnot?
I mean, composition is just programming after all, no?
21. Composing
using
Languages
Instructions can be very low-level
(close to the machine’s primitive
operations)
Instructions can be very high-level
(close to the problem domain at
hand)
Most languages strike a balance
Too low-level (limits audience,
limits target machines)
Too high-level (limits audience,
limits problem domains)
Given a computer with
some primitive operations
and a problem to solve.
Formulate a composition of
instructions to the
computer that solve the
problem. Skills Interest Audience
Machine
Specific
Domain
Specific
“General
Purpose”
22. Audience-
Specific
Languages
Languages that strive to be “general
purpose” end up being not quite
right at most anything.
To compensate, such languages
develop a large arsenal of
specialized but overlapping
capabilities.
The ideal maximized audience is
subdued by complexity.
Larger audiences can be served with
simpler languages to either side of
the “general purpose” point.
Consider a variety of
personas that characterize
how groups of people get
their tasks done.
Consider a set of personas
that fall into comparable
needs/skills categories.Call
that an audience.
Skills Interest
Audience Complexity
Machine
Specific
Domain
Specific
“General
Purpose”
“Audience
Specific”
23. Domain-
Specific
Languages
Embedded or internal DSLs were the latest craze for a while
Language-embedded Query (LINQ) is a popular example
Functional monadic query operators embedded as an expression
sub-language in general-purposeC#, even with its own syntax
Analyzability (static or dynamic) suggests the more limited
language be on the outside
Opposite of LINQ style of languages (that embed a functional DSL
inside a general-purpose programming language)
Example: U-SQL language that is essentiallyT-SQL DQL as the
outer layer andC# as the inner layer
DSLs come in two common shapes:
internal or embedded and external.
An internal DSL is embedded inside
a general-purpose language.
An external DSL is its own top-level
entity. Oddly, it may embed a
general-purpose language.
24. The Power of
Declaration –
Examples
Azure Resource Manager (ARM) templates
Declarative composition of resources across services
Repeatable deployment of solutions build over Azure platform services
Power Query expressions for Excel and Power BI
Functional composition of dataflow across many data sources
Dynamic analysis – pushes nested work out to smart data sources (like databases)
Azure Stream Analytics jobs
Declarative job definition
Functional composition of dataflow from N sources to M destinations
Static analysis – guarantees repeatable, at-least-once results from streaming jobs
U-SQL scripts for Azure Data Lake
Declarative job definition
Functional composition of distributed & federated dataflow, incl. custom code
Static analysis – determines distribution of work / federation of nested work
Dynamic analysis – determines affinity of compute, failure masking tactics
Declarative Languages establish the
shape of the result in a form fully
amenable to static analysis (and
comprehension).
For functional programming folks:
think of a fixed universe of higher-
order functions that are
“understood” by the system (and
usually have distinctive syntax)
25. U-SQL as an
exampleof
declarative
power
U-SQL scripts for Azure Data Lake
• Cost
Compile-time partition elimination
• Predictable execution
Compile-time per-vertex memory determination
Compiler (and Optimizer, Code Gen) runs for every submission
• Performance
Optimizer-time plan building for scale-out and staged pipelining
All-of-topology optimization (not just all-of-stage)
Example: predicates pushed through (r/o annotated) custom code
Native code gen around arbitrary custom code
• Security
Compile-time separation of trusted from untrusted (custom) code,
deployment into segregated containers
Declaration of all-of-toplogy
semantics as a dataflow graph
Hosting of custom code in well-
defined roles inside that graph
26. AzureData Lake
Analytics
U-SQLScripts
U-SQL
Unifies natively SQL’s declarative nature and C#’s general power
Metadata service keeps “assembly” definitions
Assembly: collection of uploaded files
Custom code is at least a .NET assembly with a public method
Custom code can then spawn processes and load other code; for
example: spin up a Python runtime with libraries and run a Python script
Built-in support for R and Python as well as a range of cognitive functions
Unifies querying structured and unstructured data
Unstructured data: Schema on read
Structured data: Metadata service keeps schema
Unifies local (distributed) and remote (federated) queries
FederateT-SQL queries to Azure SQL DB, Azure SQL DW, or to SQL
Server onVMs
A U-SQLScript uses an outer
language ofT-SQL (DDL and DQL)
to host an inner imperative
language (C#).
The outer declarative language is
used to automatically scale and
parallelize the inner islands of
custom code.
27. U-SQL
Extensibility
Extensibility at many levels, capturing semantic intent
C# expressions in SELECT statements
User-defined functions (UDF’s)
User-defined aggregates (UDAgg’s)
User-defined operators (UDO’s) – several kinds
Remember:T-SQL DML/DQL on
the outside, C# on the inside.
C# abstractions are also the basis
for extensibility.
28. User-Defined
Operators
User-Defined Extractors
Extract streams of rows from input sources
User-Defined Outputters
Serialize results and send to output targets
User-Defined Processors
Take one row and produce one row
Pass-through versus transforming
User-Defined Appliers
Take one row and produce 0 to n rows
Used with OUTER/CROSS APPLY
User-Defined Combiners
Combines rowsets (like a user-defined join)
User-Defined Reducers
Take n rows and produce 1 row
Scaled out with explicit U-SQL
Syntax that takes a UDO instance
(created as part of the execution):
EXTRACT
OUTPUT
PROCESS
COMBINE
REDUCE
29. U-SQL
Metadata
Object Model
ADLA
Account/Catalog
Database
Schema
[1,n]
[1,n]
[0,n]
Tables Views TVFs
C# Fns
C#
UDAgg
Clustered
Index
Partitions
C#
Assemblies
C# Extractors
Data
Source
C# Reducers
C# Processors
C# Combiners
C# Outputters
External
tables
Abstract
objects
User
objects
Refers toContains Implemented
and named by
Procedures
Creden-
tials
MD
Name
C#
Name
C# Appliers
Table
Types
Legend
Statistics
C# UDTs
Other
resources
30. U-SQL Language Construction
Declarative Query and Transformation Language:
• Uses SQL’s SELECT FROM WHERE with GROUP
BY/Aggregation, Joins, SQL Analytics functions
• Optimizable, Scalable
Expression-flow programming style:
• Easy to use dataflow composition
• Composable, globally optimizable
Operates on Unstructured & Structured Data
• Schema on read over files
• Relational metadata objects (e.g. database, table)
Extensible from the ground up:
• Type system is based on C#
• Expression language IS C#
• User-defined functions (U-SQL and C#)
• User-defined Aggregators (C#)
• User-defined Operators (UDO) (C#)
U-SQL provides the Parallelization and Scale-out
Framework for Usercode
• EXTRACTOR, OUTPUTTER, PROCESSOR, REDUCER,
COMBINER, APPLIER
Federated query across external data sources
REFERENCE MyDB.MyAssembly;
CREATE TABLE T( cid int, first_order DateTime
, last_order DateTime, order_count int
, order_amount float );
@o = EXTRACT oid int, cid int, odate DateTime, amount float
FROM "/input/orders.txt"
USING Extractors.Csv();
@c = EXTRACT cid int, name string, city string
FROM "/input/customers.txt"
USING Extractors.Csv();
@j = SELECT c.cid, MIN(o.odate) AS firstorder
, MAX(o.date) AS lastorder, COUNT(o.oid) AS ordercnt
, AGG<MyAgg.MySum>(c.amount) AS totalamount
FROM @c AS c LEFT OUTER JOIN @o AS o ON c.cid == o.cid
WHERE c.city.StartsWith("New")
&& MyNamespace.MyFunction(o.odate) > 10
GROUP BY c.cid;
OUTPUT @j TO "/output/result.txt"
USING new MyData.Write();
INSERT INTO T SELECT * FROM @j;
31. Designed for
MassiveScale
JOIN operators
INNER JOIN
LEFT, RIGHT, or FULL OUTER JOIN
CROSS JOIN
SEMIJOIN equivalent to IN subquery
ANTISEMIJOIN equivalent to NOT IN subquery
Language constraints steer user towards parallelizable patterns
• ON clause comparisons need to be of the simple form (“equijoin”):
rowset.column == rowset.column
or AND conjunctions of the simple equality comparison
• If a comparand is not a column, wrap it into a column in a previous SELECT
• If the comparison operation is not ==, put it into the WHERE clause
• Turn the join into a CROSS JOIN if no equality comparison
U-SQL is the product form of the
Microsoft-internal Scope.
Runs big parts of the business on
hundreds of thousands of machines.
Single jobs easily expand to run on
thousands of machines.
The U-SQL language is constrained to
steer users towards patterns that can
be parallelized. Example here: Joins.
32. U-SQL
Compilation
Run before every execution to
leverage actual input data
characteristics.
Analysis of entireU-SQL script as
well as metadata from data sources.
Elimination of empty partitions.
Splitting into pipelineable steps.
33. U-SQL
Optimization
Run after every compilation.
Builds physical execution graph.
Groups pipelineable steps into
stages.
Stages are scaled out to execute
over a chosen number of vertices,
influenced by input sharding, stages
before and after.
Per-job and user-driven level of
parallelization.
Tooling:
Detailed visibility into execution steps,
for debugging.
Heatmap like functionality to identify
performance bottlenecks.
The power of top-level declarative
Not just all-of-stage, but all-of-topology optimization
Example: push predicate upstream, even through (certain) UDO’s
Toplogy is computed by the system, not programmed by the developer (as it is in Storm, Spark, etc.)
Both static and dynamic optimization
Sizing of resources to meet capacity needs more predictable
The power of code inside defined “islands” within declarative topology
UDx as points of defined extensibility
Top-level coupling of C# into U-SQL
Nested support for Python, R, Java
Thus: bring your favorite library, written for your favorite language requiring your favorite runtime – and enjoy!
Some caveats …
Flow
Intro -> Big Data / Analytics -> First Demo -> Declarative + Code Power -> Q&A
Show 2.2PB ADLS account with demo data for customer support scenario (call center logs in the 100’s of gigs, plus related social media and telemetry logs)
Show analysis over PB’s of data using U-SQL
show solution “monitor” page
show 2PB ADLS dashboard
show call center log in file preview
switch to VM, show USQL and R snippets, incl. simple job graph, replay it
Switch to VS, show telemetry jobs, pick 2h running job, show job graph and replay it, show issues analysis (data skew), click link to help page