Cloud-First ETL in the Modern Data
Warehouse w/Azure Data Factory
Sr. Program Manager
Azure Data Management
https://github.com/kromerm/Azure-Data-Week-ADF
A fully-managed data integration service in the cloud
A Z U R E D ATA F A C T O R Y
H Y B R I D S C A L A B L EP R O D U C T I V E T R U S T E D
 Serverless scalability
with no infrastructure
to manage
 Drag & Drop UI
 Codeless Data
Movement
 Orchestrate where
your data lives
 Lift SSIS packages
to Azure
 Certified compliant
Data Movement
MODEL & SERVE
Azure Analysis ServicesAzure SQL Data
Warehouse
Power BI
Modernize your enterprise data warehouse at scale
A Z U R E D A T A F A C T O R Y
On-premises data
Oracle, SQL, Teradata,
fileshares, SAP
Cloud data
Azure, AWS, GCP
SaaS data
Salesforce, Workday,
Dynamics
INGEST STORE PREP & TRAIN
Azure Data Factory Azure Blob Storage
Azure Databricks
Polybase
Microsoft Azure also supports other Big Data services like Azure HDInsight, Azure SQL Database and Azure Data Lake to allow customers to tailor the above architecture to meet their unique needs.
Orchestrate with Azure Data Factory
Lift your SQL Server Integration Services (SSIS) packages to Azure
On-Premise data sources
SQL DB Managed Instance
SQL Server
VNET
Azure Data Factory
SSIS Cloud ETL
SSIS Integration Runtime
Cloud data sources
Cloud
On-premises
Microsoft
SQL Server
Integration Services
Control Flow in ADF Pipeline Builder
Coordinate pipeline activities into finite execution steps to enable looping,
conditionals and chaining while separating data transformations into
individual data flows
Activity 1 Activity 2
Activity 3
“On
Error”
Activity 1
Success,
params
Error,
param
s
My Pipeline 1
…
My Pipeline 2For Each…
Activity 4
Success,
params
Trigger
Event
Wall Clock
On Demand
Activity 1
Activity 2
…
Built-in source
control support
ADF Integration Runtime (IR)
 ADF compute environment with multiple capabilities:
- Activity dispatch & monitoring
- Data movement
- SSIS package execution
 To integrate data flow and control flow across the
enterprises’ hybrid cloud, customer can instantiate
multiple IR instances for different network environments:
- On premises (similar to DMG in ADF V1)
- In public cloud
- Inside VNet
 Bring a consistent provision and monitoring experience
across the network environments
Portal Application & SDK
Azure Data Factory Service
Data Movement & Activity
Dispatch on-prem, Cloud,
VNET
Data Movement & Activity
Dispatch In Azure Public
Network, SSIS
VNET coming soon
Self-Hosted IR Azure IR
Azure Data Factory Patterns
Azure SQL Database Change Tracking
Data Flow: Star Schema Fact &
Dimension Table Loading
Pipeline Master / Child Controller Pattern
Microsoft ETL/ELT Services in Azure
• Azure-SSIS IR: Managed cluster of Azure VMs
(nodes) dedicated to run your SSIS packages
and no other activities
• You can scale it up/out by specifying the node
size /number of nodes in the cluster
• You can bring your own Azure SQL Database
(DB)/Managed Instance (MI) server to host the
catalog of SSIS projects/packages (SSISDB) that
will be attached to it
• You can join it to a Virtual Network (VNet) that is
connected to your on-prem network to enable on-
prem data access
• Once provisioned, you can enter your Azure SQL
DB/MI server endpoint on SSDT/SSMS to deploy
SSIS projects/packages and configure/execute
them just like using SSIS on premises
Execute/Manage
Provision
SSDT SSMS
Cloud
On-
Premises Design/Deploy
SSIS Server
Design/Deploy Execute/Manage
ISVs
SSIS PaaS w/
ADF compute
resource
ADF
(PROVISIONING)
HDI
ML
Azure Data Factory
Visual Data Flow
Visual Data Flow Authoring
• Transform Data, At Scale, in the Cloud, Zero-Code
• Cloud-first, scale-out ELT
• Code-free dataflow pipelines
• Serverless scale-out transformation execution engine
• Maximum Productivity for Data Engineers
• Does NOT require understanding of Spark / Scala / Python / Java
• Resilient Data Transformation Flows
• Built for big data scenarios with unstructured data requirements
• Operationalize with Data Factory scheduling, control flow and monitoring
Code-free Data Transformation At Scale
• Does not require understanding of Spark, Big Data Execution
Engines, Clusters, Scala, Python …
• Focus on building business logic and data transformation
• Data cleansing
• Aggregation
• Data conversions
• Data prep
• Data exploration
ADF Data Flow Workstream
Data Sources Staging Transformation
s
Destination
Sort, Merge, Join,
Lookup …
• Explicit user action
• User places data
source(s) on design
surface, from toolbox
• Select explicit sources
• Implicit/Explicit
• Data Lake staging area
as default
• User does not need to
configure this manually
• Advanced feature to set
staging area options
• File Formats / Types
(Parquet, JSON, txt,
CSV …)
• Explicit user action
• User places
transformations on
design surface, from
toolbox
• User must set
properties for
transformation steps
and step connectors
• Explicit user action
• User chooses
destination
connector(s)
• User sets connector
property options
Simple Copy Flow
Build your logical data flows adding data
transformations in a guided experience
Switch to Debug Mode and select sample
data to work with for debugging
Debug mode provides row-level context and
visible results in inspector pane
Debug mode provides row-level context and
visible results in inspector pane
Interactive Expression Builder – Build data
transform expressions, not Spark code
Data Engineer derives columns using template expression
based on name and type matching. No need to define static
field names.

Azure Data Factory for Azure Data Week

  • 1.
    Cloud-First ETL inthe Modern Data Warehouse w/Azure Data Factory Sr. Program Manager Azure Data Management https://github.com/kromerm/Azure-Data-Week-ADF
  • 2.
    A fully-managed dataintegration service in the cloud A Z U R E D ATA F A C T O R Y H Y B R I D S C A L A B L EP R O D U C T I V E T R U S T E D  Serverless scalability with no infrastructure to manage  Drag & Drop UI  Codeless Data Movement  Orchestrate where your data lives  Lift SSIS packages to Azure  Certified compliant Data Movement
  • 3.
    MODEL & SERVE AzureAnalysis ServicesAzure SQL Data Warehouse Power BI Modernize your enterprise data warehouse at scale A Z U R E D A T A F A C T O R Y On-premises data Oracle, SQL, Teradata, fileshares, SAP Cloud data Azure, AWS, GCP SaaS data Salesforce, Workday, Dynamics INGEST STORE PREP & TRAIN Azure Data Factory Azure Blob Storage Azure Databricks Polybase Microsoft Azure also supports other Big Data services like Azure HDInsight, Azure SQL Database and Azure Data Lake to allow customers to tailor the above architecture to meet their unique needs. Orchestrate with Azure Data Factory
  • 4.
    Lift your SQLServer Integration Services (SSIS) packages to Azure On-Premise data sources SQL DB Managed Instance SQL Server VNET Azure Data Factory SSIS Cloud ETL SSIS Integration Runtime Cloud data sources Cloud On-premises Microsoft SQL Server Integration Services
  • 5.
    Control Flow inADF Pipeline Builder Coordinate pipeline activities into finite execution steps to enable looping, conditionals and chaining while separating data transformations into individual data flows Activity 1 Activity 2 Activity 3 “On Error” Activity 1 Success, params Error, param s My Pipeline 1 … My Pipeline 2For Each… Activity 4 Success, params Trigger Event Wall Clock On Demand Activity 1 Activity 2 …
  • 6.
  • 11.
    ADF Integration Runtime(IR)  ADF compute environment with multiple capabilities: - Activity dispatch & monitoring - Data movement - SSIS package execution  To integrate data flow and control flow across the enterprises’ hybrid cloud, customer can instantiate multiple IR instances for different network environments: - On premises (similar to DMG in ADF V1) - In public cloud - Inside VNet  Bring a consistent provision and monitoring experience across the network environments Portal Application & SDK Azure Data Factory Service Data Movement & Activity Dispatch on-prem, Cloud, VNET Data Movement & Activity Dispatch In Azure Public Network, SSIS VNET coming soon Self-Hosted IR Azure IR
  • 12.
    Azure Data FactoryPatterns Azure SQL Database Change Tracking Data Flow: Star Schema Fact & Dimension Table Loading Pipeline Master / Child Controller Pattern
  • 13.
    Microsoft ETL/ELT Servicesin Azure • Azure-SSIS IR: Managed cluster of Azure VMs (nodes) dedicated to run your SSIS packages and no other activities • You can scale it up/out by specifying the node size /number of nodes in the cluster • You can bring your own Azure SQL Database (DB)/Managed Instance (MI) server to host the catalog of SSIS projects/packages (SSISDB) that will be attached to it • You can join it to a Virtual Network (VNet) that is connected to your on-prem network to enable on- prem data access • Once provisioned, you can enter your Azure SQL DB/MI server endpoint on SSDT/SSMS to deploy SSIS projects/packages and configure/execute them just like using SSIS on premises Execute/Manage Provision SSDT SSMS Cloud On- Premises Design/Deploy SSIS Server Design/Deploy Execute/Manage ISVs SSIS PaaS w/ ADF compute resource ADF (PROVISIONING) HDI ML
  • 16.
  • 17.
    Visual Data FlowAuthoring • Transform Data, At Scale, in the Cloud, Zero-Code • Cloud-first, scale-out ELT • Code-free dataflow pipelines • Serverless scale-out transformation execution engine • Maximum Productivity for Data Engineers • Does NOT require understanding of Spark / Scala / Python / Java • Resilient Data Transformation Flows • Built for big data scenarios with unstructured data requirements • Operationalize with Data Factory scheduling, control flow and monitoring
  • 18.
    Code-free Data TransformationAt Scale • Does not require understanding of Spark, Big Data Execution Engines, Clusters, Scala, Python … • Focus on building business logic and data transformation • Data cleansing • Aggregation • Data conversions • Data prep • Data exploration
  • 19.
    ADF Data FlowWorkstream Data Sources Staging Transformation s Destination Sort, Merge, Join, Lookup … • Explicit user action • User places data source(s) on design surface, from toolbox • Select explicit sources • Implicit/Explicit • Data Lake staging area as default • User does not need to configure this manually • Advanced feature to set staging area options • File Formats / Types (Parquet, JSON, txt, CSV …) • Explicit user action • User places transformations on design surface, from toolbox • User must set properties for transformation steps and step connectors • Explicit user action • User chooses destination connector(s) • User sets connector property options
  • 21.
  • 22.
    Build your logicaldata flows adding data transformations in a guided experience
  • 23.
    Switch to DebugMode and select sample data to work with for debugging
  • 24.
    Debug mode providesrow-level context and visible results in inspector pane
  • 25.
    Debug mode providesrow-level context and visible results in inspector pane
  • 26.
    Interactive Expression Builder– Build data transform expressions, not Spark code
  • 27.
    Data Engineer derivescolumns using template expression based on name and type matching. No need to define static field names.

Editor's Notes

  • #6 5
  • #14 SSIS PaaS requires VNet to enable on-prem data access for now, but we may also reuse ADF Data Management Gateway (DMG) to do the same in the near future.