2. About me
Eugene Polonychko, Chapter Pass SQL Server User Group
Over 6 years of software development experience, mostly focused on data. Have designed and
implemented data warehouses using custom coding as well as with ETL tools. Experience
developing front end applications, BI reporting and database administration. Have worked with
MS SQL, MySQL and other databases. Strong experience in data modelling, data migration,
performance troubleshooting & tuning
Social network:
https://www.linkedin.com/in/eugenepolonichko/
https://msolapblog.wordpress.com/
3. What do we talk about?
• What is Azure Data Factory?
• Concepts
• Dataset
• Pipeline
• Linked Services
• Action and monitoring
4. What is Azure Data Factory?
Data Factory is a cloud-based data integration service that
orchestrates and automates the movement and transformation of
data. You can create data integration solutions using the Data
Factory service that can ingest data from various data stores,
transform/process the data, and publish result data to the data
stores.
6. Concepts
Pipeline
Data SourceDataset
is a grouping of logically related activities. It
is used to group activities into a unit that
performs a task
Activity
Activities define the
actions to perform on your
data. Each activity takes
zero or more datasets as
inputs and produces one
or more datasets as
output.
Linked services computing environment
9. Linked services
Linked services define the information needed for Data Factory to connect to external
resources (Examples: Azure Storage, on-premises SQL Server, Azure HDInsight). Linked
services are used for two purposes in Data Factory:
◦ To represent a data store including, but not limited to, an on-premises SQL Server, Oracle
database, file share, or an Azure Blob Storage account. See the Data movement activities section
for a list of supported data stores.
◦ To represent a compute resource that can host the execution of an activity. For example, the
HDInsightHive activity runs on an HDInsight Hadoop cluster. See Data transformation activities
section for a list of supported compute environments.
10. DataSet
Datasets represent data
structures with in the data stores.
For example, an Azure Storage
linked service provides
connection information for Data
Factory to connect to an Azure
Storage account. An Azure Blob
dataset specifies the blob
container and folder in the Azure
Blob Storage from which the
pipeline should read the data.
Similarly, an Azure SQL linked
service provides connection
information for an Azure SQL
database and an Azure SQL
dataset specifies the table that
contains the data.
11. PipeLine
In a Data Factory solution, you
create one or more data pipelines.
A pipeline is a logical grouping of
activities. They are used to group
activities into a unit that together
perform a task.
Activities define the actions to
perform on your data. For example,
you may use a Copy activity to copy
data from one data store to another
data store. Similarly, you may use a
Hive activity, which runs a Hive
query on an Azure HDInsight cluster
to transform or analyze your data.
Data Factory supports two types of
activities: data movement activities
and data transformation activities.
{
"name": "PipelineName",
"properties":
{
"description" : "pipeline description",
"activities":
[
],
"start": "<start date-time>",
"end": "<end date-time>"
}
}
{
"name": "ActivityName",
"description": "description",
"type": "<ActivityType>",
"inputs": "[]",
"outputs": "[]",
"linkedServiceName":
"MyLinkedService",
"typeProperties":
{
},
"policy":
{
}
"scheduler":
{
}
}
12. Activity
Move data Transformation data
Import data from one data source
to another data source. Copy
wizard
Analysis and Transformation using
Machine Learning, Hadoop, Hive и
etc.
14. Import Data
Category Data store Supported as a source Supported as a sink
Azure Azure Blob storage
Azure Data Lake Store
Azure SQL Database
Azure SQL Data Warehouse
Azure Table storage
Azure DocumentDB
Azure Search Index
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
Databases SQL Server*
Oracle*
MySQL*
DB2*
Teradata*
PostgreSQL*
Sybase*
Cassandra*
MongoDB*
Amazon Redshift
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
File File System*
HDFS*
Amazon S3
FTP
✓
✓
✓
✓
✓
Others Salesforce
Generic ODBC*
Generic OData
Web Table (table from HTML)
GE Historian*
✓
✓
✓
✓
✓
15. Transformation data
Data transformation activity Compute environment
Hive HDInsight [Hadoop]
Pig HDInsight [Hadoop]
MapReduce HDInsight [Hadoop]
Hadoop Streaming HDInsight [Hadoop]
Machine Learning activities: Batch Execution and
Update Resource
Azure VM
Stored Procedure Azure SQL, Azure SQL Data Warehouse, or SQL Server
Data Lake Analytics U-SQL Azure Data Lake Analytics
DotNet HDInsight [Hadoop] or Azure Batch
19. Price
LOW FREQUENCY HIGH FREQUENCY
Activites running in the cloud $0.60 per activity per month $1 per activity per month
Activities running on-premises and involving Data
Management Gateway
$1.50 per activity per month $2.50 per activity per month
Data Factory service allows you to create data pipelines that move and transform data, and then run the pipelines on a specified schedule (hourly, daily, weekly, etc.). It also provides rich visualizations to display the lineage and dependencies between your data pipelines, and monitor the pipelines from a single unified view to easily pinpoint issues and setup monitoring alerts.
So we have four concepts. First
Data Factory service allows you to create data pipelines that move and transform data, and then run the pipelines on a specified schedule (hourly, daily, weekly, etc.). It also provides rich visualizations to display the lineage and dependencies between your data pipelines, and monitor the pipelines from a single unified view to easily pinpoint issues and setup monitoring alerts.