Azure Data Factory Introduction.pdf

© 2018 YASH Technologies | www.yash.com | Confidential
Azure Data Factory
- Mahesh Pandit

2
Agenda
 Why Azure data Factory
 Introduction
 Steps involves in ADF
 ADF Components
 ADF Activities
 Linked Services
 Integration Runtime and its
types
 How Azure Data Factory works
 Azure Data Factory V1 vs V2
 System Variables
 Functions in ADF
 Expressions in ADF
 Question- Answers

3
Why Azure data Factory
 Modern DW for BI
 Modern DW for SaaS Apps
 Lift & Shift existing SSIS
Pkgs. to Cloud

4
Why Azure Data Factory
Azure SQL
DW
Azure Data Lake
Azure Data Factory

5
Modern DW for Business Intelligence
Log, Files & Media
(Unstructured)
On Prem., Cloud
Apps & Data
Business/Custom apps
(Structures)
Data Factory
Data Factory Azure Storage Azure Databricks
Spark
Ingest Store Prep & Train Model & Serve Intelligence
Azure SQL Data warehouse
Azure Analysis services
Analytical
Dashboards
(Power BI)
Azure Data Factory orchestrates data pipeline activity work flow & scheduling

6
Modern DW for SaaS Apps
Log, Files & Media
(Unstructured)
On Prem., Cloud
Apps & Data
Business/Custom apps
(Structures)
Data Factory
Data Factory Azure Storage Azure Databricks
Spark
Ingest Store Prep & Train Model & Serve Intelligence
SaaS App Browser/Devices
App
Storage

7
Lift & Shift existing SSIS packages to Cloud
Cloud
On Premise
On-Premise Data Sources SQL Server
Data Factory
Cloud Data Sources
SQL DB Managed Instance
VNET

8
Introduction
Azure Data
Factory
Cloud-based
Integration Service
 It is cloud-based integration service
that allows you to create data-
driven workflows in the cloud for
orchestrating and automating data
movement and data transformation.
 Scheduled data-driven workflows.
 Sources and Destinations can be
either on-premise or cloud.
 Transformation can be done using
Azure HDInsight Hadoop, Spark,
Azure Data Lake Analytics and ML.

9
How does it work?
 The pipelines (data-driven workflows) in Azure Data Factory typically
perform the following four steps:

10
Steps involves in ADF
 Connect and collect
 Transform and enrich
 Publish
 Monitor

11
Connect and collect
 The first step in building an information production system is to connect to all the
required sources of data and processing, such as software-as-a-service (SaaS)
services, databases, file shares, and FTP web services.
 With Data Factory, you can use the Copy Activity in a data pipeline to move data
from both on-premises and cloud source data stores to a centralization data store in
the cloud for further analysis.
 For example, you can collect data in Azure Data Lake as well in Azure Blob
storage.

12
Transform and enrich
 After data is present in a centralized data store in
the cloud, process or transform the collected data
by using compute services such as
 HDInsight Hadoop
 Spark
 Data Lake Analytics
 Machine Learning.

13
Publish
 After the raw data has been refined into a business-ready consumable form, load the data into Azure Data
Warehouse, Azure SQL Database, Azure Cosmos DB or many more as per user’s need.

14
Monitor
 After you have successfully built and deployed your data integration pipeline, providing business value from
refined data, monitor the scheduled activities and pipelines for success and failure rates.
 Azure Data Factory has built-in support for pipeline monitoring via Azure Monitor, API, PowerShell, Log
Analytics, and health panels on the Azure portal.

15
ADF Components
 Pipeline
 Activity
 Datasets
 Linked Services

16
ADF Components
DATA SET
(Table , File)
ACTIVITY
(hive, copy)
PIPELINE
(schedule, Monitor)
LINKED SERVICE
(SQL Server, Hadoop
Cluster)
Consume
Produces Is logical
Group of
Runs on
Represent
Data item
stored in
 An Azure subscription might have one or more Azure Data Factory instances (or data factories).
 Azure Data Factory is composed of four key components.
 These components work together to provide the platform on which you can compose data-driven workflows with
steps to move and transform data.

17
Pipeline
 A data factory might have one or more pipelines.
 A pipeline is a logical grouping of activities that performs a
unit of work.
 Together, the activities in a pipeline perform a task.
 The pipeline allows you to manage the activities as a set
instead of managing each one individually.
 The activities in a pipeline can be chained together to operate
sequentially, or they can operate independently in parallel.
 To create data factory pipeline, we can use any one of the
below method:
Data Factory UI Copy Data Tool Azure Power Shell Rest
Resource Manager Template .NET Python

18
Pipeline Execution
Triggers
 Triggers represent the unit of processing that
determines when a pipeline execution needs to
be kicked off.
 There are different types of triggers for different
types of events.
Pipeline Runs
 A pipeline run is an instance of the pipeline
execution.
 Pipeline runs are typically instantiated by
passing the arguments to the parameters
that are defined in pipelines.
 The arguments can be passed manually or
within the trigger definition.
Parameters
 Parameters are key-value pairs of read-only
configuration.
 Parameters are defined in the pipeline.
 The arguments for the defined parameters are
passed during execution from the run context
that was created by a trigger or a pipeline that
was executed manually.
 Activities within the pipeline consume the
parameter values.
Control Flow
 Control flow is an orchestration of pipeline
activities that includes chaining activities in a
sequence, branching, defining parameters at
the pipeline level, and passing arguments
while invoking the pipeline on-demand or
from a trigger.
 It also includes custom-state passing and
looping containers, that is, For-each
iterators.

19
ADF Activities
 Data Movement Activities
 Data Transformation Activities
 Control Activities

20
Activity
 Activities represent a processing step in a pipeline.
 For example, you might use a copy activity to copy data from one data store to another data store.
 Data Factory supports three types of activities:
1. Data movement activities
2. Data transformation activities
3. Control activities.
Copy Activity
Copy Activity Azure Blob
Transformation
Activity
Copy Activity
Output data
Azure SQL
Data Warehouse
BI Tool

21
Data Movement Activities
 Copy Activity in Data Factory copies data from a source data store to a sink data store.
 Data from any source can be written to any sink.
…….

22
Data Transformation Activities
 Azure Data Factory supports the following transformation activities that can be added to pipelines either
individually or chained with another activity.
Compute
Environment
Data Transformation
Activity
Compute
Environment
Data Transformation
Activity
HDInsight
HDInsight
HDInsight
HDInsight
HDInsight
Azure SQL, Azure SQL DW
OR SQL Server
Azure VM
Azure Data Lake
Analytics
Azure Batch
Azure Databricks

23
Control Activities
 The following control flow activities are supported
Execute Pipeline Activity It allows a Data Factory pipeline
to invoke another pipeline.
For Each Activity
It defines a repeating control flow in your
pipeline.
Web Activity
It can be used to call a custom
REST endpoint from a Data Factory
pipeline.
Lookup Activity
It can be used to read or look up a record/
table name/ value from any external
source.
Get Metadata Activity
It can be used to retrieve metadata
of any data in Azure Data Factory.
Until Activity
It implements Do-Until loop that is similar to Do-
Until looping structure in programming languages.
It executes a set of activities in a loop until the
condition associated with the activity evaluates to
true.
If Condition Activity
It can be used to branch based on
condition that evaluates to true or
false.
Wait Activity
When you use a Wait activity in a pipeline, the
pipeline waits for the specified period of time
before continuing with execution of
subsequent activities.

24
Linked services
 Linked services are much like connection strings, which define the connection information that's needed for Data
Factory to connect to external resources.
 A linked service defines the connection to the data source.
 For example, an Azure Storage-linked service specifies a connection string to connect to the Azure Storage
account.
 Linked services are used for two purposes in Data Factory:
 To represent a data store that includes data stores located on-premises and in the cloud. E.g. Tables, Files,
Folders or Documents
 To represent a compute resource that can host the execution of an activity. For example, the HDInsightHive
activity runs on an HDInsight Hadoop cluster.
Tables
Files
HDInsight
Apache Spark
.......
Data Stores
Compute Resources

25
Integration Runtime
 Think it as a Bridge between 2 networks.
 It is compute infrastructure which provides capabilities across different N/W environments
Data
Movement
Activity
Dispatch
SSIS Package
Execution
 Copy data across data
stores in public network
and data stores in private
network (on-premises or
virtual private network).
 It provides support for
built-in connectors, format
conversion, column
mapping and scalable data
transfer.
 This capabilities are use
when compute services such
as Azure HDInsight, Azure
Machine Learning, Azure
SQL Database, SQL Server,
and more get used for
transformation activities.
 When SSIS packages
need to be executed in the
managed Azure Compute
Environment like HDInsight
then this capabilities are
used.

26
Integration runtime types
 These three types are:
IR type Public network Private network
Azure Data movement
Activity dispatch
Self-hosted Data movement
Activity dispatch
Data movement
Activity dispatch
Azure-SSIS SSIS package
execution
SSIS package
execution

27
How Azure Data Factory Works
Integration Runtime
Integration Runtime
Integration Runtime
Dataset
Dataset
Dataset
Pipeline
Activity Activity Activity
On-Premise
SQL Server
DB

28
Data Factory V1 vs. V2

29
Data Factory V1 vs. V2
Data Factory V1
 Datasets
 Linked Services
 Pipelines
 On-Premises Gateway
 Schedule on Dataset
availability and Pipeline
start/end Time
Data Factory V2
 Datasets
 Linked Services
 Pipelines
 Self hosted Integration
Runtime
 Schedule triggers(time or
tumbling window)
 Host and Execute SSIS
Package Parameters
 New Control Flow
Activities

30
System Variables
 Pipeline scope
 Schedule Trigger scope
 Tumbling Window Trigger scope

31
Pipeline Scope
These system variables can be referenced anywhere in the pipeline JSON.
@pipeline().DataFactory Name of the data factory the pipeline run is running within
@pipeline().Pipeline Name of the pipeline
@pipeline().RunId ID of the specific pipeline run
@pipeline().TriggerType Type of the trigger that invoked the pipeline (Manual, Scheduler)
@pipeline().TriggerId ID of the trigger that invokes the pipeline
@pipeline().TriggerName Name of the trigger that invokes the pipeline
@pipeline().TriggerTime Time when the trigger that invoked the pipeline. The trigger time is the
actual fired time, not the scheduled time.

32
Schedule Trigger Scope
 These system variables can be referenced anywhere in the trigger JSON if the trigger is of type:
"ScheduleTrigger."
@trigger().scheduledTime
Time when the trigger was scheduled to invoke the pipeline run.
For example, for a trigger that fires every 5 min, this variable would return 2017-06-
01T22:20:00Z, 2017-06-01T22:25:00Z, 2017-06-01T22:29:00Z respectively.
@trigger().startTime
Time when the trigger actually fired to invoke the pipeline run.
For example, for a trigger that fires every 5 min, this variable might return
something like this 2017-06-01T22:20:00.4061448Z, 2017-06-
01T22:25:00.7958577Z, 2017-06-01T22:29:00.9935483Zrespectively.

33
Tumbling window Trigger Scope
 These system variables can be referenced anywhere in the trigger JSON if the trigger is of type:
"TumblingWindowTrigger“.
@trigger().outputs.windowStartTime
Start of the window when the trigger was scheduled to invoke the pipeline run. If the
tumbling window trigger has a frequency of "hourly" this would be the time at the
beginning of the hour.
@trigger().outputs.windowEndTime
End of the window when the trigger was scheduled to invoke the pipeline run. If the
tumbling window trigger has a frequency of "hourly" this would be the time at the
end of the hour.

34
Functions in Azure
 String Functions
 Collection Functions
 Logical Functions
 Conversion Functions
 Math Functions
 Date Functions

35
String Functions
Function Description Example
concat Combines any number of strings together. concat(‘Hi’, ‘team’) : Hi team
substring Returns a subset of characters from a
string.
substring('somevalue',1,3) : ome
replace Replaces a string with a given string. replace(‘Hi team', ‘Hi', ‘Hey') : Hey team
guid Generates a globally unique string guid() : c2ecc88d-88c8-4096-912c-d6
toLower Converts a string to lowercase. toLower('Two’) : two
toUpper Converts a string to uppercase. toUpper('Two’) : TWO
indexof Find the index of a value within a string
case insensitively.
indexof(Hi team', ‘Hi’) : 0
endswith Checks if the string ends with a value case
insensitively.
endswith(‘Hi team', ‘team') : true
startswith Checks if the string starts with a value
case insensitively.
startswith(‘Hi team', ‘team') : false
split Splits the string using a separator. split(‘Hi;team', ‘;') : [“Hi", “team“]
lastindexof Find the last index of a value within a
string case insensitively.
lastindexof('foofoo‘) : 3

36
Collection Functions
contains Returns true if dictionary contains a key,
list contains value, or string contains
substring. .
contains('abacaba','aca')
: true
length Returns the number of elements in an
array or string.
length('abc')
: 3
empty Returns true if object, array, or string is
empty.
empty('')
: true
intersection Returns a single array or object with the
common elements between the arrays or
objects passed to it.
intersection([1, 2, 3], [101, 2, 1, 10],[6, 8,
1, 2])
: [1, 2]
union Returns a single array or object with all of
the elements that are in either array or
object passed to it.
union([1, 2, 3], [101, 2, 1, 10])
: [1, 2, 3, 10, 101]
first Returns the first element in the array or
string passed in.
first([0,2,3])
: 0
last Returns the last element in the array or
string passed in.
last('0123')
:3
take Returns the first Count elements from the
array or string passed in
take([1, 2, 3, 4], 2)
: [1, 2]
skip Returns the elements in the array starting
at index Count,
skip([1, 2 ,3 ,4], 2)
: [3, 4]

37
Logical Functions
int Convert the parameter to an integer. int('100')
: 100
string Convert the parameter to a string. string(10)
: ‘10’
json Convert the parameter to a JSON type
value.
json('[1,2,3]') : [1,2,3]
json('{"bar" : "baz"}') : { "bar" : "baz" }
float Convert the parameter argument to a
floating-point number.
float('10.333')
: 10.333
bool Convert the parameter to a Boolean. bool(0)
: false
coalesce Returns the first non-null object in the
arguments passed in. Note: an empty
string is not null.
coalesce(pipeline().parameters.paramet
er1', pipeline().parameters.parameter2
,'fallback')
: fallback
array Convert the parameter to an array. array('abc')
: ["abc"]
createArray Creates an array from the parameters. createArray('a', 'c')
: ["a", "c"]

38
Math Functions
add Returns the result of the addition of the
two numbers.
add(10,10.333): 20.333
sub Returns the result of the subtraction of the
two numbers.
sub(10,10.333): -0.333
mul Returns the result of the multiplication of
the two numbers.
mul(10,10.333): 103.33
div Returns the result of the division of the
two numbers.
div(10.333,10): 1.0333
mod Returns the result of the remainder after
the division of the two numbers (modulo).
mod(10,4) :2
min There are two different patterns for calling
this function. Note, all values must be
numbers
min([0,1,2]) :0
min(0,1,2) : 0
max There are two different patterns for calling
this function. Note, all values must be
numbers
max([0,1,2]) :2
max(0,1,2) : 2
range Generates an array of integers starting
from a certain number, and you define the
length of the returned array.
range(3,4) : [3,4,5,6]
rand Generates a random integer within the
specified range
rand(-1000,1000) : 42

39
Date Functions
utcnow Returns the current timestamp as a string. . utcnow()
: 2019-02-21T13:27:36Z
addseconds Adds an integer number of seconds to a
string timestamp passed in. The number of
seconds can be positive or negative.
addseconds('2015-03-15T13:27:36Z', -36)
:2015-03-15T13:27:00Z
addminutes Adds an integer number of minutes to a
string timestamp passed in. The number of
minutes can be positive or negative.
addminutes('2015-03-15T13:27:36Z', 33)
:2015-03-15T14:00:36Z
addhours Adds an integer number of hours to a string
timestamp passed in. The number of hours
can be positive or negative.
addhours('2015-03-15T13:27:36Z', 12)
:2015-03-16T01:27:36Z
adddays Adds an integer number of days to a string
timestamp passed in. The number of days
can be positive or negative.
adddays('2015-03-15T13:27:36Z', -20)
:2015-02-23T13:27:36Z
formatDateTime Returns a string in date format. formatDateTime('2015-03-15T13:27:36Z',
'o')
:2015-02-23T13:27:36Z

40
Expressions in Azure Data Factory
 JSON values in the definition can be literal or expressions that are evaluated at runtime.
E. g. "name": "value“ OR "name": "@pipeline().parameters.password“
 Expressions can appear anywhere in a JSON string value and always result in another JSON value.
 If a JSON value is an expression, the body of the expression is extracted by removing the at-sign (@).
JSON value Result
"parameters" The characters 'parameters' are returned.
"parameters[1]" The characters 'parameters[1]' are
returned.
"@@" A 1 character string that contains '@' is
returned.
" @" A 2 character string that contains ' @' is
returned.

41
A dataset with a parameter
 Suppose the BlobDataset takes a parameter named path.
 Its value is used to set a value for the folderPath property by using the following expressions:
"folderPath": "@dataset().path"
A pipeline with a parameter
 In the following example, the pipeline takes inputPath and outputPath parameters.
 The path for the parameterized blob dataset is set by using values of these parameters.
 The syntax used here is: :
"path": "@pipeline().parameters.inputPath"

Feel free to write to me at:
mahesh.pandit@yash.com
in case of any queries / clarifications.

Azure Data Factory Introduction.pdf

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Azure Data Factory Introduction.pdf

Similar to Azure Data Factory Introduction.pdf (20)

Recently uploaded

Recently uploaded (20)

Azure Data Factory Introduction.pdf