Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Microsoft Data Integration Pipelines: Azure Data Factory and SSIS

1,588 views

Published on

A look at Data Integration in Microsoft using Azure Data Factory and SSIS

Published in: Technology
  • Be the first to comment

Microsoft Data Integration Pipelines: Azure Data Factory and SSIS

  1. 1. Automate pipelines: Tools for building ETL pipelines Mark Kromer Microsoft Sr. Program Manager Information Management @kromerbigdata
  2. 2. ETL Pipeline Objectives • Consume hybrid disparate data (Extract) • On-prem + Cloud • Structured, un-structured, semi-structured • Calculate and format data for analytics (Transform) • Transform, aggregate, join, normalize • Address large-scale Big Data requirements (Scale / Load) • Scale-up or Scale-out data movement and transformation • Operationalize (Automate) • Create pipeline orchestrations for different org requirements • Manage & monitor multiple pipelines
  3. 3. Hybrid Data Integration Pattern 1: Analyze blog comments Azure SQL Database SQL Server (on-premises) Data Management Gateway Req’d for ADF Azure Data Factory (PaaS) Capture blog comments via API Drop into Blob Store Copy & lookup Power BI Dashboard Visualize and analyze SSIS (self-hosted) Transform via SPROC (ELT) Transform via Dataflow (ETL)
  4. 4. Hybrid Data Integration Pattern 2: Sentiment Analysis with Machine Learning Azure Data Factory Power BI Blob Storage Azure Functions
  5. 5. Hybrid Data Integration Pattern 3: Modern Data Warehouse Daily flat files OLTP DB Tables Analytical Schemas AML: Churn Model Customer Call Details Azure Data Factory (PaaS) SSIS (self-hosted) Social Media (un/semi structured)
  6. 6. SQL Server Integration Services (SSIS) SSIS is a platform for building enterprise-grade data integration solutions User-friendly code-free authoring / management client tools: SQL Server Data Tools (SSDT) SQL Server Management Studio (SSMS) Wealth of connectors + rich transformations to Extract, Transform, and Load (ETL) data between various sources and destinations, on premises and in the cloud Low Total Cost of Ownership (TCO)
  7. 7. Azure Feature Pack Provides Azure connectivity components for SSIS 1. Move / transfer hybrid data between various sources and destinations, on premises and in Azure 2. Develop ELT workflows with Big Data transformations / processing in Azure 3. Combine traditional ETL and modern ELT workflows spanning on-premises and Azure
  8. 8. SSIS Azure Feature Pack Features SSIS Azure Feature Pack contains: 1. Connection Managers 1. Azure Subscription Connection Manager 2. Azure Storage Connection Manager 3. ADLS Connection Manager (NEW) 2. Control Flows / Tasks 1. Azure Blob Upload / Download Tasks 2. Azure HDInsight Hive / Pig Tasks 3. Azure HDInsight Create / Delete Cluster Tasks 4. Azure SQL DW Upload Task (NEW) 3. Data Flows 1. Azure Blob Source / Destination 2. ADLS Source / Destination (NEW) 4. Azure Blob Enumerator
  9. 9. SQL DW + SSIS – Quick intro SQL DW is Microsoft’s scale-out database in the cloud Built on Massively Parallel Processing (MPP) architecture Capable of processing huge volumes of relational and non-relational data. It divides data and processing capability across multiple nodes Control Node receives, optimizes, and distributes requests to Compute Nodes that work in parallel. There are 2 ways to load data into SQL DW using SSIS: Front-loading through Control Node with data flows Back-loading through Compute Nodes with PolyBase
  10. 10. Azure SQL DW Upload Task – Typical scenario Azure Blob Storage SQL DW Cloud On-Premise SQL Server Flat File SSIS Machine 0. Export to a flat file 1. Upload to Azure Blob 2. Create an external table 3. Trigger PolyBase to load data Azure SQL DW Upload Task automates steps 1 – 2 – 3 below:
  11. 11. Azure SQL DW Upload Task On Azure SQL DW Upload Task Editor, you can 1. Name and describe a create / insert table task 2. Select and configure UTF-8-encoded text file(s) as your data source 3. Select and configure Azure Storage Connection Manager + new / existing blob container as your data staging area 4. Select and configure ADO.NET Connection Manager for SQL DW + new / existing table as your data destination 5. Map source and destination columns for the create / insert table task 6. Define metadata / data types for source columns
  12. 12. Azure SQL DW Upload Task Following configurations on Azure SQL DW Upload Task Editor, T-SQL script that triggers PolyBase to load data from your Azure Blob Storage into SQL DW will be automatically generated You can still manually edit this auto- generated T-SQL script to customize it for your particular needs
  13. 13. RAW DATA, DATA CHAOS REFINED, ORGANIZED DATA DATA CLARITY, BETTER DECISIONS DATA MOVEMENT DATA TRANSFORMATIONS BUSINESS INTELLIGENCE AZURE DATA FACTORY HYBRID DATA INTEGRATION AT SCALE Customer Profitability Sentiment Analysis Market Basket Analysis Machine Learning, Big Data Analytics, SQL, NoSQL, Data Warehouse , Data Lake
  14. 14. ADF: Orchestrate data services at scale with fully managed Data Integration cloud service PREPARE TRAN AN INGEST SQL <> SQL DATA SOURCES { } SQL • Create, schedule, orchestrate, and manage data pipelines • Visualize data lineage • Connect to on-premises and cloud data sources • Monitor data pipeline health • Automate cloud resource management • Move relational data for Hadoop processing • Transform with Hive, Pig, PySpark, SQL SPROC or custom code
  15. 15. Cloud Analytics – Common Challenges
  16. 16. ELT with Apache Spark Activity from ADF Pipeline Create new pipeline and HDI Cluster Linked Service for Spark from Azure Portal Invoke Python script from Spark activity to transform data at scale Schedule, monitor and manage pipeline from ADF Verify results and perform analytics from Jupyter notebooks / PBI
  17. 17. ELT with Azure Data Lake from ADF Pipeline Create new pipeline and Azure Data Lake Analytics Linked Service from Azure Portal Perform data transformations at scale with U-SQL script Schedule, monitor and manage pipeline from ADF Verify results and perform analytics from ADLA
  18. 18. Monitor & Manage Pipelines
  19. 19. Microsoft Data Integration Roadmap • SSIS • SQL Server 2017 • SSIS on Linux • Scale-out • ADF • SSIS as a Cloud “Integration Runtime” • Code-free web-based user experience • Control Flow orchestration + Data Flow steps • On-Demand Spark Cluster

×