2. Agenda
1. What is a Data Pipeline ?
2. Components of a Data pipeline.
3. Traditional Data Flows and issues
4. Introduction to Apache Airflow
5. Features
6. Core Components
7. Key Components
8. Demo
3. What is a Data Pipeline
Data Pipeline is a set of data processing elements connected in series, where the
output of one element is the input of the next one. The elements of the pipeline are
often executed in parallel or in time-sliced fashion. The name ‘pipeline’ come from
a rough analogy with physical plumbing.
● Modern data pipelines are used to ingest & process vast volumes of data in
real time.
● Real time processing of data as opposed to traditional ETL / batch modes.
4. Common Components of a data pipeline
Typical parts of a data pipeline
● Data Ingestion
● Filtering
● Processing
● Querying of the data
● Data warehousing
● Reprocessing capabilities
Typical Requirements
● Scalability
○ Billions of messages and terabytes of
data 24 /7
● Availability and redundancy
○ Across physical Locations
● Latency
○ Real time / Batch
● Platform support
5. Traditional data flow model
Webclients Reporting
Apps
Public Rest API Billing System
Microservices
OLTP
DB
Report
DB
Metrics
DB
$ curl api.example.com | filter.py | psql
Analytics
6. Messy data flow model ( 6 / 12 months later)
web clients reporting
Apps
Public Rest API Billing System
Microservices
OLTP
DB
Report
DB
Metrics
DB
Analytics
External
cloud
Doc
Store
DWH
7. Apache Airflow Introduction
● Apache Airflow is a way to programatically author, schedule and monitor
workflows
● Developed in Python and is open source.
● Workflows are configured as Python code.
● It uses python as the programming language, where in we can enrich the quality
of data pipelines by using python inbuilt libraries.
● Has multiple hooks and operators for handling BigData ecosystem components, (
Hive, Sqoop etc.. ) and DB hooks for RDBMS and Other NOSQL databases.
8. Features
● Cron replacement
● Fault tolerant.
● Dependency rules.
● Beautiful UI.
● Handle task failures.
● Python Code.
● Report / Alert on failures.
● Monitor your pipelines from the WebUI.
● And etc..
9. Core Components
● Webserver - Apache Airflow WebUI.
● Scheduler - Responsible for scheduling your jobs.
● Executor - bound to the scheduler , determine the worker process that
executes the the schedule task. ( Sequential , LocalExecutor, CeleryExecutor)
● Worker - Process that execute the task , determined by the executor.
● Metadatabase - Database were all the metadata related to your jobs are stored
10. Key Concepts
● DAG - Directed Acyclic graph . the graphical representation of your data
pipeline
● Operator - describes a single task in your data pipeline
● Task - An instance of operator task.
● Workflow - DAG + Operator + Task
11. Overview
● What is a DAG?
● What is an Operator?
● Operator relationships and Bitshift composition
● How the scheduler works?
● What is a Workflow ?
12. DAG ( Directed Acyclic Graph)
Simple DAG where we could imagine that
Task 1 - downloading the data.
Task 2 - Sending the data for processing.
Task 3 - monitoring the data processing.
Task 4 - generating the report.
Task 5 - Sending the email to the DAG owner or intended recipients.
Task 1 Task 2 Task 3 Task 4 Task 5
14. Operators
While DAG describes how to run a workflow , Operator defines what actually gets
done.
● Operator describes a single task in a workflow.
● Operators should be idempotent. ( it should produce the same result
irrespective of how many times it is executed.
● Retry Automatically in case of Failure.
15. Different Operators
● Bash Operator
○ Executes a bash command
● Python Operator
○ Calls an Arbitrary python function
● Email Operator
○ Sends an Email
● Mysql Operator, SQLite Operator, Postgres Operator.
○ Executes the SQL commands
● <Custom Operators> Inheriting from the BaseOperator
16. Types of Operators
There are 3 types of operators
● Action Operators
○ Perform an action ( Bash operator, Python Operator , Email Operator)
● Transfer Operators
○ Moving data from one system to another ( PrestoToMySQL operator, SFTP operator
● Sensor Operators
○ Waiting for the data to arrive at the default location.
17. Important Properties
● DAG’s are defined in Python files placed into Airflows DAG_FOLDER
● dag_id serves as a unique identifier for your DAG.
● description the description of your DAG.
● start_date - tell when your DAG should start.
● schedule_interval - define how often your DAG runs.
● depend_on_past - run the next DAGRun if the previous one is completed
successfully.
● default_args - a dictionary of variables to be used as constructor keyword
parameter when initializing operators