1
Airflow 101
Saar Bergerbest
Data Engineer
2
● Airflow is a platform to programmatically, schedule and monitor workflows.
● Started as an Airbnb project at 2014.
Joined the Apache Software Foundation’s incubation program in March 2016.
Using by the following companies:
What is Airflow?
3
● Scheduling - giving you the ability to schedule data pipelines easily.
● Dependencies - conveniently define sequences of tasks.
● Triggering - targeting task instances in specific states (failed, or success).
● Downtime Recover - in case airflow is restart it can fill the gaps.
● UI - The Airflow UI make it easy to monitor and troubleshoot your data pipelines.
● Error handling (retries) and logging.
Why Airflow?
4
● TaskOperator: Individual task that need to be done.
● Streamdependencies: The relation between the different tasksoperators (for
example - taskA needs to run after taskB).
● DAG:
○ Directed Acyclic Graph -
https://en.wikipedia.org/wiki/Directed_acyclic_graph
○ Collection of all the tasks you want to run, organized in a way that reflects
their relationships and dependencies.
○ DAG can have sub-dags
○ DAG can invoke other DAG
Basic Components
5
● Describes a single workflow.
● Scheduled with a cron.
● Determines the dependencies between tasks.
A task can depend on one or more other tasks for it to trigger.
● A task can be configured to execute after all its dependencies succeeded (default), once one is
succeeded, one one is failed, etc.
● In case Dag failed at task X - we can restart that task and its dependencies.
● In case subDag failed - we can restart all the tasks of that subDag.
Example
DAG
6
1. PythonOperator - Executes a Python callable function
Example
1. BranchPythonOperator - Allows a workflow to "branch" or follow a
single path following the execution of this task.
It derives the PythonOperator and expects a Python function that
returns
the task_id to follow.
Example
1. BashOperator - Execute a Bash script, command or set of commands.
Example
Operators: Python + Branch + Bash
7
● Jinja2 is a modern and designer-friendly templating language for Python.
● Airflow leverages the power of Jinja Templating and provides the pipeline author with a
set of built-in parameters and macros.
● Provides concise and elegant syntax.
Example
Jinja Template
Jinja documentation:
http://jinja.pocoo.org/docs/dev/
8
● Airflow tasks can runs on several workers , XCom let tasks exchange messages.
● Xcom defined by a key, value, and timestamp, but also track attributes like the task/DAG that
created the XCom and when it should become visible.
● Any object that can be *pickled can be used as an XCom value, so users should make sure to
use objects of appropriate size.
● Tasks can push XComs at any time by calling the xcom_push() method. In addition, if a task
returns a value then an XCom containing that value is automatically pushed.
● Tasks call xcom_pull() to retrieve XComs.
Example
XCom (cross communication via db)
*pickle:
https://docs.python.org/2/library/pickle.html
9
● Information such as hostname, port, login and passwords to other systems and services
is handled in the ‘Connection’ section of the UI.
● The pipeline code you will author will reference the ‘conn_id’ of the Connection objects.
● The information is saved in the db that Airflow manages, there is an option to encrypt
passwords.
Connections
10
● Key-Value storage within Airflow.
● Variables can be listed, created, updated and deleted from the UI, code or CLI.
Example
Variables
11
Operator is doing task , Sensor monitor status
1. SimpleHttpOperator - Calls an endpoint on an HTTP system to execute an action.
1. HttpSensor - Executes a HTTP get statement and returns False on failure:
404 not found or response_check function returned False.
Example
Operator/Sensor: SimpleHttp + HttpSensor
12
● Subdag id must be ‘parent.child’ (show in the example).
● Used to pack workflows that are used multiple times (modules).
● Gives the ability to retry on a whole logic unit.
Example
Operators: Subdag
1313
Another Examples for Operators
HiveOperator
PrestoToMysqlOper
ator
S3FileTransformOp
erator
SlackOperator DockerOperator EmailOperator
14
● Monitors Dags according to cron configuration.
● Monitor to tasks within DAGs (Triggers the task instances whose dependencies have
been met).
● The scheduler starts an instance of the executor specified in the your airflow.cfg file
(default Executor) or defined for the task in the code which executes the tasks.
Example of starting the scheduler...
Scheduler
15
1. SequentialExecutor (default with sqlite):
● Runs one task instance at a time.
1. LocalExecutor:
● Runs tasks in parallel.
● Configured number of LocalWorkers which execute the tasks.
1. CeleryExecutor:
● Parallel and distributed
● Using Celery backend - a tasks for execution queue (RabbitMQ, Redis).
● The scheduler will insert to the Celery queue the relevant tasks, Celery workers will executes
them.
Executors
16
UI

Airflow 101

  • 1.
  • 2.
    2 ● Airflow isa platform to programmatically, schedule and monitor workflows. ● Started as an Airbnb project at 2014. Joined the Apache Software Foundation’s incubation program in March 2016. Using by the following companies: What is Airflow?
  • 3.
    3 ● Scheduling -giving you the ability to schedule data pipelines easily. ● Dependencies - conveniently define sequences of tasks. ● Triggering - targeting task instances in specific states (failed, or success). ● Downtime Recover - in case airflow is restart it can fill the gaps. ● UI - The Airflow UI make it easy to monitor and troubleshoot your data pipelines. ● Error handling (retries) and logging. Why Airflow?
  • 4.
    4 ● TaskOperator: Individualtask that need to be done. ● Streamdependencies: The relation between the different tasksoperators (for example - taskA needs to run after taskB). ● DAG: ○ Directed Acyclic Graph - https://en.wikipedia.org/wiki/Directed_acyclic_graph ○ Collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies. ○ DAG can have sub-dags ○ DAG can invoke other DAG Basic Components
  • 5.
    5 ● Describes asingle workflow. ● Scheduled with a cron. ● Determines the dependencies between tasks. A task can depend on one or more other tasks for it to trigger. ● A task can be configured to execute after all its dependencies succeeded (default), once one is succeeded, one one is failed, etc. ● In case Dag failed at task X - we can restart that task and its dependencies. ● In case subDag failed - we can restart all the tasks of that subDag. Example DAG
  • 6.
    6 1. PythonOperator -Executes a Python callable function Example 1. BranchPythonOperator - Allows a workflow to "branch" or follow a single path following the execution of this task. It derives the PythonOperator and expects a Python function that returns the task_id to follow. Example 1. BashOperator - Execute a Bash script, command or set of commands. Example Operators: Python + Branch + Bash
  • 7.
    7 ● Jinja2 isa modern and designer-friendly templating language for Python. ● Airflow leverages the power of Jinja Templating and provides the pipeline author with a set of built-in parameters and macros. ● Provides concise and elegant syntax. Example Jinja Template Jinja documentation: http://jinja.pocoo.org/docs/dev/
  • 8.
    8 ● Airflow taskscan runs on several workers , XCom let tasks exchange messages. ● Xcom defined by a key, value, and timestamp, but also track attributes like the task/DAG that created the XCom and when it should become visible. ● Any object that can be *pickled can be used as an XCom value, so users should make sure to use objects of appropriate size. ● Tasks can push XComs at any time by calling the xcom_push() method. In addition, if a task returns a value then an XCom containing that value is automatically pushed. ● Tasks call xcom_pull() to retrieve XComs. Example XCom (cross communication via db) *pickle: https://docs.python.org/2/library/pickle.html
  • 9.
    9 ● Information suchas hostname, port, login and passwords to other systems and services is handled in the ‘Connection’ section of the UI. ● The pipeline code you will author will reference the ‘conn_id’ of the Connection objects. ● The information is saved in the db that Airflow manages, there is an option to encrypt passwords. Connections
  • 10.
    10 ● Key-Value storagewithin Airflow. ● Variables can be listed, created, updated and deleted from the UI, code or CLI. Example Variables
  • 11.
    11 Operator is doingtask , Sensor monitor status 1. SimpleHttpOperator - Calls an endpoint on an HTTP system to execute an action. 1. HttpSensor - Executes a HTTP get statement and returns False on failure: 404 not found or response_check function returned False. Example Operator/Sensor: SimpleHttp + HttpSensor
  • 12.
    12 ● Subdag idmust be ‘parent.child’ (show in the example). ● Used to pack workflows that are used multiple times (modules). ● Gives the ability to retry on a whole logic unit. Example Operators: Subdag
  • 13.
    1313 Another Examples forOperators HiveOperator PrestoToMysqlOper ator S3FileTransformOp erator SlackOperator DockerOperator EmailOperator
  • 14.
    14 ● Monitors Dagsaccording to cron configuration. ● Monitor to tasks within DAGs (Triggers the task instances whose dependencies have been met). ● The scheduler starts an instance of the executor specified in the your airflow.cfg file (default Executor) or defined for the task in the code which executes the tasks. Example of starting the scheduler... Scheduler
  • 15.
    15 1. SequentialExecutor (defaultwith sqlite): ● Runs one task instance at a time. 1. LocalExecutor: ● Runs tasks in parallel. ● Configured number of LocalWorkers which execute the tasks. 1. CeleryExecutor: ● Parallel and distributed ● Using Celery backend - a tasks for execution queue (RabbitMQ, Redis). ● The scheduler will insert to the Celery queue the relevant tasks, Celery workers will executes them. Executors
  • 16.