Michael Rys
Principal Program Manager, Big Data @ Microsoft
@MikeDoesBigData, {mrys, usql}@microsoft.com
U-SQL Federated Distributed Queries
Query data where it lives
Easily query data in multiple Azure data stores without moving it to a single store
Benefits
• Avoid moving large amounts of data across the
network between stores
• Single view of data irrespective of physical location
• Minimize data proliferation issues caused by
maintaining multiple copies
• Single query language for all data
• Each data store maintains its own sovereignty
• Design choices based on the need
• Push SQL expressions to remote SQL sources
• Filters
• Joins
U-SQL
Query
Query
Azure
Storage Blobs
Azure SQL
in VMs
Azure
SQL DB
Azure Data
Lake Analytics
Azure
SQL Data Warehouse
Azure
Data Lake Storage
Federated
queries
• Minimize data proliferation through data consolidation
• Same U-SQL over all Azure data (WASB, SQL Azure)
• Efficient and reliable execution strategies
• Striving to maintain semantic equivalence
• Design choices based on requirements:
• Schema-less design
• fast time-to-query and exploratory analysis
• Schematized design
• protect applications from data source changes
• Advanced federated query capabilities:
• Built-in decisions to optimize for performance
• push downs of joins, predicates, projection
• Control when and what to push down
• Prevent data source overload
• Provide control over semantics
Data sources and
external tables
• Secure credential
management
• Data sources to manage
connections and
remoting of queries
• Schematized design:
external tables to provide
early bound tables for
federated queries
Create secret in PowerShell
New-AzureRMDataLakeAnalyticsCatalogSecret
Create credential
CREATE CREDENTIAL Secret
WITH USER_NAME = “user@server", IDENTITY = "Secret";
Create external data source on
• Azure SQL DB
• Azure SQL DW
• SQL Server in Azure VM
CREATE DATA SOURCE SQL_PATIENTS FROM SQLSERVER WITH
( PROVIDER_STRING =
"Database=DB;Trusted_Connection=False;Encrypt=False"
, CREDENTIAL = Secret
, REMOTABLE_TYPES = (bool, byte, short, string, DateTime)
);
External tables (optional)
CREATE EXTERNAL TABLE sql_patients (
[custkey] int,
[name] string,
[address] string
) FROM SQL_PATIENTS LOCATION "dbo.patients";
Federated
queries
• Queries have to be in a
different script from data
source
• Pass-through queries to
execute remote language
• Schema-less design:
query data source
location
• Schematized design:
query external tables
• Semantics of federated
queries close to U-SQL
and C#
Pass-Through Query
@alive_patients =
SELECT *
FROM EXTERNAL SQL_PATIENTS EXECUTE @"
SELECT name
, CASE WHEN is_alive = 1
THEN 'Alive' ELSE 'Deceased' END AS status
, address, nationkey, phone
FROM dbo.patients";
Query Data Source Location
@patients = SELECT *
FROM EXTERNAL master.SQL_PATIENTS LOCATION "dbo.patients";
Query External Tables
@patients = SELECT * FROM EXTERNAL master.dbo.sql_patients;
Execution
• U-SQL Semantics
• Pushes predicates and even joins based on remotable types
http://aka.ms/AzureDataLake

U-SQL Federated Distributed Queries (SQLBits 2016)

  • 1.
    Michael Rys Principal ProgramManager, Big Data @ Microsoft @MikeDoesBigData, {mrys, usql}@microsoft.com U-SQL Federated Distributed Queries
  • 3.
    Query data whereit lives Easily query data in multiple Azure data stores without moving it to a single store Benefits • Avoid moving large amounts of data across the network between stores • Single view of data irrespective of physical location • Minimize data proliferation issues caused by maintaining multiple copies • Single query language for all data • Each data store maintains its own sovereignty • Design choices based on the need • Push SQL expressions to remote SQL sources • Filters • Joins U-SQL Query Query Azure Storage Blobs Azure SQL in VMs Azure SQL DB Azure Data Lake Analytics Azure SQL Data Warehouse Azure Data Lake Storage
  • 4.
    Federated queries • Minimize dataproliferation through data consolidation • Same U-SQL over all Azure data (WASB, SQL Azure) • Efficient and reliable execution strategies • Striving to maintain semantic equivalence • Design choices based on requirements: • Schema-less design • fast time-to-query and exploratory analysis • Schematized design • protect applications from data source changes • Advanced federated query capabilities: • Built-in decisions to optimize for performance • push downs of joins, predicates, projection • Control when and what to push down • Prevent data source overload • Provide control over semantics
  • 5.
    Data sources and externaltables • Secure credential management • Data sources to manage connections and remoting of queries • Schematized design: external tables to provide early bound tables for federated queries Create secret in PowerShell New-AzureRMDataLakeAnalyticsCatalogSecret Create credential CREATE CREDENTIAL Secret WITH USER_NAME = “user@server", IDENTITY = "Secret"; Create external data source on • Azure SQL DB • Azure SQL DW • SQL Server in Azure VM CREATE DATA SOURCE SQL_PATIENTS FROM SQLSERVER WITH ( PROVIDER_STRING = "Database=DB;Trusted_Connection=False;Encrypt=False" , CREDENTIAL = Secret , REMOTABLE_TYPES = (bool, byte, short, string, DateTime) ); External tables (optional) CREATE EXTERNAL TABLE sql_patients ( [custkey] int, [name] string, [address] string ) FROM SQL_PATIENTS LOCATION "dbo.patients";
  • 6.
    Federated queries • Queries haveto be in a different script from data source • Pass-through queries to execute remote language • Schema-less design: query data source location • Schematized design: query external tables • Semantics of federated queries close to U-SQL and C# Pass-Through Query @alive_patients = SELECT * FROM EXTERNAL SQL_PATIENTS EXECUTE @" SELECT name , CASE WHEN is_alive = 1 THEN 'Alive' ELSE 'Deceased' END AS status , address, nationkey, phone FROM dbo.patients"; Query Data Source Location @patients = SELECT * FROM EXTERNAL master.SQL_PATIENTS LOCATION "dbo.patients"; Query External Tables @patients = SELECT * FROM EXTERNAL master.dbo.sql_patients; Execution • U-SQL Semantics • Pushes predicates and even joins based on remotable types
  • 7.

Editor's Notes

  • #4 DATA SOURCE: Represents a remote data source such as Azure SQL Database. Have to specify all the details (connection string, credentials, etc required to connect to and issues queries. EXTERNAL TABLE: A local table, with columns defined in C# types, that redirects queries issued against it to the remote table that it is based on. U-SQL automatically does the type conversion. External tables lets you impose a specific schema against the remote data, shielding you from remote schema changes. You can issue queries that ‘join’ external and local tables. PASS THROUGH queries: These queries are issued directly against the remote data source in the syntax of the remote data source (say T-SQL for Azure SQL database). REMOTABLE_TYPES: For every external data source you have to specify the list of ‘remoteable types. This list constrains the types of queries that will be remoted. Ex: REMOTABLE_TYPES = (bool, byte, short, ushort, int, decimal); LAZY METADATA LOADING: Here the remote data schematized only when the query is actually issues to the remote data source. Your program must be able to deal with remote schema changes.