… data warehousing has reached the most
significant tipping point since its inception.
The biggest, possibly most elaborate data
management system in IT is changing.
– Gartner, “The State of Data Warehousing in 2012”
Data sources
5
Data sources
Increasing
data volumes
1
Real-time
data
2
Non-Relational Data
New data
sources & types
3
Cloud-born
data
4
ETL Tool
(SSIS, etc)
EDW
(SQL Svr, Teradata, etc)
Extract
Original Data
Load
Transformed
Data
Transform
BI Tools
Data Marts
Data Lake(s)
Dashboards
Apps
ETL Tool
(SSIS, etc)
EDW
(SQL Svr, Teradata, etc)
Extract
Original Data
Load
Transformed
Data
Transform
BI Tools
Ingest (EL)
Original Data
Data Marts
Data Lake(s)
Dashboards
Apps
ETL Tool
(SSIS, etc)
EDW
(SQL Svr, Teradata, etc)
Extract
Original Data
Load
Transformed
Data
Transform
BI Tools
Ingest (EL)
Original Data
Scale-out
Storage &
Compute
(HDFS, Blob Storage,
etc)
Transform & Load
Data Marts
Data Lake(s)
Dashboards
Apps
Streaming data
ETL Tool
(SSIS, etc)
EDW
(SQL Svr, Teradata, etc)
Extract
Original Data
Load
Transformed
Data
Transform
BI Tools
Ingest (EL)
Original Data
Scale-out
Storage &
Compute
(HDFS, Blob Storage,
etc)
Transform & Load
Data Marts
Data Lake(s)
Dashboards
Apps
Streaming data
BI Tools
Data Marts
Data Lake(s)
Dashboards
Apps
Data Hub
(Storage & Compute)
Data Sources
(Import From)
Move data
among Hubs
Data Hub
(Storage & Compute)
Data Sources
(Import From)
Ingest
Connect & Collect Transform & Enrich Publish
Information Production:
Ingest
Move to data mart, etc
BI Tools
Data Marts
Data Lake(s)
Dashboards
Apps
Data Hub
(Storage & Compute)
Data Sources
(Import From)
Data Connector:
Import from source to
Hub
Data
Connector:
Import/Export
among Hubs
Data Hub
(Storage & Compute)
Data Sources
(Import From)
Data Connector:
Import from source to
Hub
Data Connector:
Export from Hub to data
store
Connect & Collect Transform & Enrich Publish
Information Production:
• Coordination & Scheduling
• Monitoring & Mgmt
• Data Lineage
Example Scenario:
Data warehouse sales to Azure pipeline
Raw sales (Custom view on top of DW tables)
Hive processing
Sales by category by day
OrderDate Company Category
Qty
Ordered
Unit
Price
Sales Order
6/1/2004Action Bicycle Specialists Accessories 1716 22.0393SO71784
6/1/2004Action Bicycle Specialists Bikes 2288 864.0452SO71784
6/1/2004Action Bicycle Specialists Clothing 2340 26.8155SO71784
6/1/2004Action Bicycle Specialists Components 598 329.8538SO71784
6/1/2004Aerobic Exercise Company Components 338 133.8744SO71915
6/1/2004Action Bicycle Specialists Accessories 910 25.1057SO71938
Data Factory Walkthrough
New-AzureDataFactory
-Name “HaloTelemetry“
-Location “West-US“
New-AzureDataFactory
-Name “DW-Demo2“
-Location “West-US“
On Premises SQL Server Azure Blob Storage
New User View
Azure Data Factory
On Premises SQL Server Azure Blob Storage
AdventureWorksLTDW2014
Azure Data FactoryViewOf
New Sales
Aggregated
sales
ViewOf
On Premises SQL Server Azure Blob Storage
New User View
Copy “NewSales” to
Blob Storage
Cloud New Sales
Azure Data FactoryViewOf
New Sales
New User
Activity
Pipeline
On Premises SQL Server Azure Blob Storage
New User View
Copy New Sales to
Blob Storage
Cloud New Sales
Azure Data FactoryViewOf
Cloud New Sales
Aggregate
New Sales
AggregatedSales
HDInsight
Aggregated
Sales
Pipeline
Pipeline OnPrem SSIS
package
"availability": { "frequency": "Day", interval": 6 }
Hourly
12-6
6-12
12-6
AggregatesSalesActivity: (e.g. Hive):
Dataset2
Dataset3
Hourly
12-1
1-2
2-3
Daily
Monday
Tuesday
Wednesday
Daily
Monday
Tuesday
Wednesday
Hive
Activity
Sales From DW
other source
Daily Sales
• Is my data successfully getting produced?
• Is it produced on time?
• Am I alerted quickly of failures?
• What about troubleshooting information?
• Are there any policy warnings or errors?
• Easily move data to my existing data marts for consumption by my existing BI
tools
• Azure DB
• SQL Server on premises
• Oracle
• Files
• Azure Blob content
Coordination:
• Rich scheduling
• Complex dependencies
• Incremental rerun
Authoring:
• JSON & Powershell/C#
Management:
• Lineage
• Data production policies (late data, rerun, latency, etc)
Hub: Azure Hub (HDInsight + Blob storage)
• Activities: Hive, Pig, C#
• Data Connectors: Blobs, Tables, Azure DB, On Prem SQL Server, Oracle
• Contact me: ChristianCote@IA-TechConsulting.com
www.microsoft.com/learning
http://microsoft.com/technet
http://channel9.msdn.com/Events/TechEd
http://developer.microsoft.com

Adf dw walkthrough

  • 3.
    … data warehousinghas reached the most significant tipping point since its inception. The biggest, possibly most elaborate data management system in IT is changing. – Gartner, “The State of Data Warehousing in 2012” Data sources
  • 4.
    5 Data sources Increasing data volumes 1 Real-time data 2 Non-RelationalData New data sources & types 3 Cloud-born data 4
  • 5.
    ETL Tool (SSIS, etc) EDW (SQLSvr, Teradata, etc) Extract Original Data Load Transformed Data Transform BI Tools Data Marts Data Lake(s) Dashboards Apps
  • 6.
    ETL Tool (SSIS, etc) EDW (SQLSvr, Teradata, etc) Extract Original Data Load Transformed Data Transform BI Tools Ingest (EL) Original Data Data Marts Data Lake(s) Dashboards Apps
  • 7.
    ETL Tool (SSIS, etc) EDW (SQLSvr, Teradata, etc) Extract Original Data Load Transformed Data Transform BI Tools Ingest (EL) Original Data Scale-out Storage & Compute (HDFS, Blob Storage, etc) Transform & Load Data Marts Data Lake(s) Dashboards Apps Streaming data
  • 8.
    ETL Tool (SSIS, etc) EDW (SQLSvr, Teradata, etc) Extract Original Data Load Transformed Data Transform BI Tools Ingest (EL) Original Data Scale-out Storage & Compute (HDFS, Blob Storage, etc) Transform & Load Data Marts Data Lake(s) Dashboards Apps Streaming data
  • 9.
    BI Tools Data Marts DataLake(s) Dashboards Apps Data Hub (Storage & Compute) Data Sources (Import From) Move data among Hubs Data Hub (Storage & Compute) Data Sources (Import From) Ingest Connect & Collect Transform & Enrich Publish Information Production: Ingest Move to data mart, etc
  • 10.
    BI Tools Data Marts DataLake(s) Dashboards Apps Data Hub (Storage & Compute) Data Sources (Import From) Data Connector: Import from source to Hub Data Connector: Import/Export among Hubs Data Hub (Storage & Compute) Data Sources (Import From) Data Connector: Import from source to Hub Data Connector: Export from Hub to data store Connect & Collect Transform & Enrich Publish Information Production: • Coordination & Scheduling • Monitoring & Mgmt • Data Lineage
  • 13.
    Example Scenario: Data warehousesales to Azure pipeline
  • 14.
    Raw sales (Customview on top of DW tables) Hive processing Sales by category by day OrderDate Company Category Qty Ordered Unit Price Sales Order 6/1/2004Action Bicycle Specialists Accessories 1716 22.0393SO71784 6/1/2004Action Bicycle Specialists Bikes 2288 864.0452SO71784 6/1/2004Action Bicycle Specialists Clothing 2340 26.8155SO71784 6/1/2004Action Bicycle Specialists Components 598 329.8538SO71784 6/1/2004Aerobic Exercise Company Components 338 133.8744SO71915 6/1/2004Action Bicycle Specialists Accessories 910 25.1057SO71938
  • 15.
  • 16.
  • 17.
    On Premises SQLServer Azure Blob Storage New User View Azure Data Factory
  • 18.
    On Premises SQLServer Azure Blob Storage AdventureWorksLTDW2014 Azure Data FactoryViewOf New Sales Aggregated sales
  • 19.
    ViewOf On Premises SQLServer Azure Blob Storage New User View Copy “NewSales” to Blob Storage Cloud New Sales Azure Data FactoryViewOf New Sales New User Activity Pipeline
  • 20.
    On Premises SQLServer Azure Blob Storage New User View Copy New Sales to Blob Storage Cloud New Sales Azure Data FactoryViewOf Cloud New Sales Aggregate New Sales AggregatedSales HDInsight Aggregated Sales Pipeline Pipeline OnPrem SSIS package
  • 21.
    "availability": { "frequency":"Day", interval": 6 } Hourly 12-6 6-12 12-6 AggregatesSalesActivity: (e.g. Hive):
  • 22.
  • 23.
    • Is mydata successfully getting produced? • Is it produced on time? • Am I alerted quickly of failures? • What about troubleshooting information? • Are there any policy warnings or errors?
  • 26.
    • Easily movedata to my existing data marts for consumption by my existing BI tools • Azure DB • SQL Server on premises • Oracle • Files • Azure Blob content
  • 27.
    Coordination: • Rich scheduling •Complex dependencies • Incremental rerun Authoring: • JSON & Powershell/C# Management: • Lineage • Data production policies (late data, rerun, latency, etc) Hub: Azure Hub (HDInsight + Blob storage) • Activities: Hive, Pig, C# • Data Connectors: Blobs, Tables, Azure DB, On Prem SQL Server, Oracle
  • 28.
    • Contact me:ChristianCote@IA-TechConsulting.com
  • 29.