3. Data Pipelines
Data Factory Data Pipelines provide cloud-scale data orchestration with a
point-and-click UI for building analytics and ETL workflows
• Built-in control flow and conditional
execution constructs
• Shared platform service with “soft limits”,
i.e. 80 (previously 40) max activities per
pipeline
• Copy Activity provides massive-scale data
movement inside data pipelines
• Use Copy for ELT patterns to land data
quickly into Lakehouse first
4. Pipelines Tip
Parallelism vs. Sequential Execution
Multiple levels of parallelism available to you as a pipeline developer in Data Factory
These activities will all run in
parallel
These activities will all run in sequence
5. Pipelines Tips Continued
• Gain more throughput with multiple Copy activities, but can drain resources on source
• Use Disable Activity option for debug until functionality
https://blog.fabric.microsoft.com/en-us/blog/data-pipeline-performance-improvements-part-2-creating-an-array-of-jsons
6. Microsoft Fabric
DATA FACTORY PIPELINE
Copy activity:
Warehouse as a destination uses the COPY INTO
command
Direct copy from Azure Blob Storage or ADLS Gen2
Staging copy required for other sources
Can also use warehouse as a source
Supports pipeline activities:
• Lookup activity
• Get metadata activity
• Script activity
• Stored procedure activity
Trident warehouse table
New Table
7. Pipelines: Copy Activity
Throughput and Parallelism
https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-performance
ADF: DIUs Fabric: ITO
Apply more compute with a higher DIU/ITO value.
These are max recommendations sent to copy engine.
8. Pipelines: Copy Activity Binary Optimization
• Utilizing binary copy can improve load times to LH Files or storage accounts by 60-70%
• New Fabric Copy optimization being implemented for binary to improve data movement speeds 2x
• Tip: Wildcards, list of files, and folders inside Copy source rather than iterate over many files using For-
Each at pipeline level
9. Pipelines: Copy with SQL Source Partitioning
• Be cognizant of blocking, locking, effects you may have on your SQL sources
• Set optimistic concurrency on SQL sources to improve read performance
• Expand resource throughput on source database during ETL jobs
Dynamic Range
Created 822 Files, Ranging from 22MB – 200MB
o Throughput: 155.415 MB/s
o Total Duration: 00:37:47
o Source to Staging
o Duration: 00:34:55
o Optimized throughput: Balanced (100 DIU)
o Used Parallel copies: 251
o Staging to Destination
o Duration: 00:02:51
o Optimized throughput: Standard (4 DIU)
o Used Parallel copies: 1
No Partitioning
Created 777 files, 200MB files
o Throughput: 47.971 MB/s
o Total Duration: 02:01:37
o Source to Staging
o Duration: 01:48:40
o Optimized throughput: Standard (4
DUI)
o Used Parallel copies: 1
o Staging to Destination
o Duration: 00:12:57
o Optimized throughput: Standard (4
DUI)
o Used Parallel copies: 1
Logical Partitioning
Created 818 Files Ranging from 80 – 180MB
o Throughput:
o Average: 9.5 MB/s (50*9.5 = 475 MB/s)
o Min: 1.246 MB/s
o Total Duration: 00:15:32
o Source to Staging
o Duration: 00:12:00 (Average)
o Optimized throughput: Standard (4 DUI)
o Used Parallel copies: 1
o Staging to Destination
o Duration: 00:02:00 (Average)
o Optimized throughput: Standard (4 DUI)
o Used Parallel copies: 1
Improve data load
performance by 88%
Source: Azure SQL DB General Purpose Serverless Standard-series
(Gen5)
o Max vCores: 80
o Min vCores 20
Table: dbo.orders
Record Count: 1,625,000,000
Database Used Space: 220 GB
10. Copy Performance Metrics with Lakehouse
Throughput 5.141 GB/s 1.284 GB/S 1.354 GB/s
Total duration 3 min 31 sec 13 min 44 sec 1 min 41 sec
Parquet -> LH (binary) CSV -> LH (V-Order) SQL -> LH (V-Order)
Pipeline developer goal for copy perf: Maximize throughput