Fabric Data Factory Pipeline Copy Perf Tips.pptx

•Download as PPTX, PDF•

0 likes•13 views

Mark Kromer

Tips of tuning performance of Fabric Data Factory Data Pipelines and Copy

Technology

Data Pipelines
Data Factory Data Pipelines provide cloud-scale data orchestration with a
point-and-click UI for building analytics and ETL workflows
• Built-in control flow and conditional
execution constructs
• Shared platform service with “soft limits”,
i.e. 80 (previously 40) max activities per
pipeline
• Copy Activity provides massive-scale data
movement inside data pipelines
• Use Copy for ELT patterns to land data
quickly into Lakehouse first

Pipelines Tip
Parallelism vs. Sequential Execution
Multiple levels of parallelism available to you as a pipeline developer in Data Factory
These activities will all run in
parallel
These activities will all run in sequence

Pipelines Tips Continued
• Gain more throughput with multiple Copy activities, but can drain resources on source
• Use Disable Activity option for debug until functionality
https://blog.fabric.microsoft.com/en-us/blog/data-pipeline-performance-improvements-part-2-creating-an-array-of-jsons

Microsoft Fabric
DATA FACTORY PIPELINE
Copy activity:
Warehouse as a destination uses the COPY INTO
command
Direct copy from Azure Blob Storage or ADLS Gen2
Staging copy required for other sources
Can also use warehouse as a source
Supports pipeline activities:
• Lookup activity
• Get metadata activity
• Script activity
• Stored procedure activity
Trident warehouse table
New Table

Pipelines: Copy Activity
Throughput and Parallelism
https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-performance
ADF: DIUs Fabric: ITO
Apply more compute with a higher DIU/ITO value.
These are max recommendations sent to copy engine.

Pipelines: Copy Activity Binary Optimization
• Utilizing binary copy can improve load times to LH Files or storage accounts by 60-70%
• New Fabric Copy optimization being implemented for binary to improve data movement speeds 2x
• Tip: Wildcards, list of files, and folders inside Copy source rather than iterate over many files using For-
Each at pipeline level

Pipelines: Copy with SQL Source Partitioning
• Be cognizant of blocking, locking, effects you may have on your SQL sources
• Set optimistic concurrency on SQL sources to improve read performance
• Expand resource throughput on source database during ETL jobs
Dynamic Range
Created 822 Files, Ranging from 22MB – 200MB
o Throughput: 155.415 MB/s
o Total Duration: 00:37:47
o Source to Staging
o Duration: 00:34:55
o Optimized throughput: Balanced (100 DIU)
o Used Parallel copies: 251
o Staging to Destination
o Duration: 00:02:51
o Optimized throughput: Standard (4 DIU)
o Used Parallel copies: 1
No Partitioning
Created 777 files, 200MB files
o Throughput: 47.971 MB/s
o Total Duration: 02:01:37
o Source to Staging
o Duration: 01:48:40
o Optimized throughput: Standard (4
DUI)
o Used Parallel copies: 1
o Staging to Destination
o Duration: 00:12:57
o Optimized throughput: Standard (4
DUI)
o Used Parallel copies: 1
Logical Partitioning
Created 818 Files Ranging from 80 – 180MB
o Throughput:
o Average: 9.5 MB/s (50*9.5 = 475 MB/s)
o Min: 1.246 MB/s
o Total Duration: 00:15:32
o Source to Staging
o Duration: 00:12:00 (Average)
o Optimized throughput: Standard (4 DUI)
o Used Parallel copies: 1
o Staging to Destination
o Duration: 00:02:00 (Average)
o Optimized throughput: Standard (4 DUI)
o Used Parallel copies: 1
Improve data load
performance by 88%
 Source: Azure SQL DB General Purpose Serverless Standard-series
(Gen5)
o Max vCores: 80
o Min vCores 20
 Table: dbo.orders
 Record Count: 1,625,000,000
 Database Used Space: 220 GB

Copy Performance Metrics with Lakehouse
Throughput 5.141 GB/s 1.284 GB/S 1.354 GB/s
Total duration 3 min 31 sec 13 min 44 sec 1 min 41 sec
Parquet -> LH (binary) CSV -> LH (V-Order) SQL -> LH (V-Order)
Pipeline developer goal for copy perf: Maximize throughput

Similar to Fabric Data Factory Pipeline Copy Perf Tips.pptx

CephHien Nguyen Van

Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Odinot Stanislas

Meta scale kognitio hadoop webinarKognitio

OGG Architecture PerformanceEnkitec

ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and SparkCarolyn Duby

Espc17 make your share point fly by tuning and optimising sql serverIsabelle Van Campenhoudt

Make your SharePoint fly by tuning and optimizing SQL Serverserge luca

Oracle Cloud DBaaSArush Jain

Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio, Inc.

VMworld 2013: Virtualizing Databases: Doing IT Right VMworld

SQL Server It Just Runs FasterBob Ward

Functional? Reactive? Why?Aleksandr Tavgen

Optimize SQL server performance for SharePointserge luca

Revolutionary Storage for Modern Databases, Applications and Infrastrcturesabnees

StreamHorizon overviewStreamHorizon

11g R2afa reg

Gruter TECHDAY 2014 Realtime Processing in TelcoGruter

Building a High Performance Analytics PlatformSantanu Dey

HadoopArchana Gopinath

Scale your Alfresco Solutions Alfresco Software

Similar to Fabric Data Factory Pipeline Copy Perf Tips.pptx (20)

Ceph

Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...

Meta scale kognitio hadoop webinar

OGG Architecture Performance

ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark

Espc17 make your share point fly by tuning and optimising sql server

Make your SharePoint fly by tuning and optimizing SQL Server

Oracle Cloud DBaaS

Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio

VMworld 2013: Virtualizing Databases: Doing IT Right

SQL Server It Just Runs Faster

Functional? Reactive? Why?

Optimize SQL server performance for SharePoint

Revolutionary Storage for Modern Databases, Applications and Infrastrcture

StreamHorizon overview

11g R2

Gruter TECHDAY 2014 Realtime Processing in Telco

Building a High Performance Analytics Platform

Hadoop

Scale your Alfresco Solutions

Recently uploaded

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

AI as an Interface for Commercial BuildingsMemoori

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren

Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix

Understanding the Laravel MVC ArchitecturePixlogix Infotech

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

Key Features Of Token Development (1).pptxLBM Solutions

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays

CloudStudio User manual (basic edition):comworks

Pigging Solutions in Pet Food ManufacturingPigging Solutions

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

Artificial intelligence in the post-deep learning eraDeakin University

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55

Build your next Gen AI Breakthrough - April 2024Neo4j

Vulnerability_Management_GRC_by Sohang Sengupta.pptxnull - The Open Security Community

Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies

SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j

Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada

Recently uploaded (20)

Connect Wave/ connectwave Pitch Deck Presentation

AI as an Interface for Commercial Buildings

Human Factors of XR: Using Human Factors to Design XR Systems

Presentation on how to chat with PDF using ChatGPT code interpreter

SQL Database Design For Developers at php[tek] 2024

Swan(sea) Song – personal research during my six years at Swansea ... and bey...

Understanding the Laravel MVC Architecture

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

Key Features Of Token Development (1).pptx

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...

CloudStudio User manual (basic edition):

Pigging Solutions in Pet Food Manufacturing

Unleash Your Potential - Namagunga Girls Coding Club

Artificial intelligence in the post-deep learning era

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...

Build your next Gen AI Breakthrough - April 2024

Vulnerability_Management_GRC_by Sohang Sengupta.pptx

Benefits Of Flutter Compared To Other Frameworks

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph

Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024

Fabric Data Factory Pipeline Copy Perf Tips.pptx

1. Fabric Data Factory Performance Tips

2. Pipelines and Copy Performance Tips

3. Data Pipelines Data Factory Data Pipelines provide cloud-scale data orchestration with a point-and-click UI for building analytics and ETL workflows • Built-in control flow and conditional execution constructs • Shared platform service with “soft limits”, i.e. 80 (previously 40) max activities per pipeline • Copy Activity provides massive-scale data movement inside data pipelines • Use Copy for ELT patterns to land data quickly into Lakehouse first

4. Pipelines Tip Parallelism vs. Sequential Execution Multiple levels of parallelism available to you as a pipeline developer in Data Factory These activities will all run in parallel These activities will all run in sequence

5. Pipelines Tips Continued • Gain more throughput with multiple Copy activities, but can drain resources on source • Use Disable Activity option for debug until functionality https://blog.fabric.microsoft.com/en-us/blog/data-pipeline-performance-improvements-part-2-creating-an-array-of-jsons

6. Microsoft Fabric DATA FACTORY PIPELINE Copy activity: Warehouse as a destination uses the COPY INTO command Direct copy from Azure Blob Storage or ADLS Gen2 Staging copy required for other sources Can also use warehouse as a source Supports pipeline activities: • Lookup activity • Get metadata activity • Script activity • Stored procedure activity Trident warehouse table New Table

7. Pipelines: Copy Activity Throughput and Parallelism https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-performance ADF: DIUs Fabric: ITO Apply more compute with a higher DIU/ITO value. These are max recommendations sent to copy engine.

8. Pipelines: Copy Activity Binary Optimization • Utilizing binary copy can improve load times to LH Files or storage accounts by 60-70% • New Fabric Copy optimization being implemented for binary to improve data movement speeds 2x • Tip: Wildcards, list of files, and folders inside Copy source rather than iterate over many files using For- Each at pipeline level

9. Pipelines: Copy with SQL Source Partitioning • Be cognizant of blocking, locking, effects you may have on your SQL sources • Set optimistic concurrency on SQL sources to improve read performance • Expand resource throughput on source database during ETL jobs Dynamic Range Created 822 Files, Ranging from 22MB – 200MB o Throughput: 155.415 MB/s o Total Duration: 00:37:47 o Source to Staging o Duration: 00:34:55 o Optimized throughput: Balanced (100 DIU) o Used Parallel copies: 251 o Staging to Destination o Duration: 00:02:51 o Optimized throughput: Standard (4 DIU) o Used Parallel copies: 1 No Partitioning Created 777 files, 200MB files o Throughput: 47.971 MB/s o Total Duration: 02:01:37 o Source to Staging o Duration: 01:48:40 o Optimized throughput: Standard (4 DUI) o Used Parallel copies: 1 o Staging to Destination o Duration: 00:12:57 o Optimized throughput: Standard (4 DUI) o Used Parallel copies: 1 Logical Partitioning Created 818 Files Ranging from 80 – 180MB o Throughput: o Average: 9.5 MB/s (50*9.5 = 475 MB/s) o Min: 1.246 MB/s o Total Duration: 00:15:32 o Source to Staging o Duration: 00:12:00 (Average) o Optimized throughput: Standard (4 DUI) o Used Parallel copies: 1 o Staging to Destination o Duration: 00:02:00 (Average) o Optimized throughput: Standard (4 DUI) o Used Parallel copies: 1 Improve data load performance by 88%  Source: Azure SQL DB General Purpose Serverless Standard-series (Gen5) o Max vCores: 80 o Min vCores 20  Table: dbo.orders  Record Count: 1,625,000,000  Database Used Space: 220 GB

10. Copy Performance Metrics with Lakehouse Throughput 5.141 GB/s 1.284 GB/S 1.354 GB/s Total duration 3 min 31 sec 13 min 44 sec 1 min 41 sec Parquet -> LH (binary) CSV -> LH (V-Order) SQL -> LH (V-Order) Pipeline developer goal for copy perf: Maximize throughput

Fabric Data Factory Pipeline Copy Perf Tips.pptx

Recommended

Recommended

More Related Content

Similar to Fabric Data Factory Pipeline Copy Perf Tips.pptx

Similar to Fabric Data Factory Pipeline Copy Perf Tips.pptx (20)

More from Mark Kromer

More from Mark Kromer (20)

Recently uploaded

Recently uploaded (20)

Fabric Data Factory Pipeline Copy Perf Tips.pptx