Apache Airflow
WORKFLOW MANAGEMENT PLATFORM
Nikolai Grishchenkov
CC BY-NC-SA 4.0
Agenda
● Workflows
● Airflow
− Principles
− Architecture
− Concepts
− UI Demo
Introducing Apache Airflow
● Apache Airflow
− Open source workflow management platform.
− Apache Software Foundation project.
− Initially developed by Airbnb.
Airflow Core Ideas
● Core ideas
− Workflow as a Directed Acyclic Graph
(DAG).
− DAG is defined programmatically
(“Configuration as code”).
Direct Acyclic Graph
● DAG - Directed graph that doesn’t
have any cycles
● Workflow - A collection of tasks with
their dependencies
Airflow Features
● Core features
− Rich Web UI & Powerful CLI
− Integration with Hadoop/Hive, S3, SQL Databases, Druid, Google Cloud, etc (30+
operators).
− Dynamic pipeline generation (tasks are instantiated dynamically)
− Jinja Templating
− Plugins support
● Workflow features
− Complex dependencies support
− Automatic retries & Email alerts & SLAs
− Comprehensive logging
− Backfilling option
Airflow Features (cont.)
● Resource management features
− Queues & Resource Pools
− Distributed execution (Scaling)
● Administration features
− Easy installation
− Security features: Web Authentication/LDAP/Kerberos/Oauth
● Misc. features
− Friendly for non-programmers
− Growing community
− Apache license / ASF project
Dynamic pipeline gen.
● Reduce “copy-paste” and
allow configuration with
airflow Variables
Jinja Templating
● Allows task parametrization with set
of built-in parameters and macros like
“execution date”.
Airflow Architecture
● Web Application & CLI
● Metadata Repository
● Scheduler Process
● Array of workers
● Jobs Definition in Python
● ETL Framework &
● Plugins
Airflow Operators
● Operators are task factories.
● Operators types:
− Operators that performs an action
− Operators that moves data
− Sensors
Operator Groups
● Total: > 100 operators (including
contrib)
● Perform action
● BashOperator, PythonOperator,
DockerOperator
● SparkSQLOperator,
SparkSubmitOperator
● HiveOperator, PostgresOperator,
MySqlOperator, BigQueryOperator
● EmailOperator, SlackOperator
● Sensors
● HdfsSensor, HivePartitionSensor
● SqlSensor
● TimeSensor, ExternalTaskSensor
● Move data
● S3ToHiveTransfer
● MySqlToHiveTransfer
Airflow Sensors
● Sensors are a certain type of operator that will keep running until a certain
criterion is met.
− Appearance/approach of
● Time
● Another DAG run
● Database Record
● Hive Partition
● File
● REST Query Result
Airflow Scheduling
● The scheduler runs job one
schedule_interval AFTER the start
date, at the END of the period.
● Jinja: {{ ds }} - execution date as
YYYY-MM-DD
● Backfill: run DAG for any interval that
has not been run or cleared
2017-03-
01
2017-03-
02
2017-03-
03
2017-03-
04
2017-03-
05
DagRun:
2017-03-01
now
Jinja
template: «2017-03-01»
data
Airflow Scheduling
Airflow Metadata
● Keep:
− DAG status
− Tasks status (passed/failed)
● Run heartbeat function to:
− Update “Last_updated”
− Run kill_zombies()
Airflow Executors
● Executors are the mechanism by which task instances get run.
● Types:
− Sequential (for debugging)
− Local
− Celery
− Apache Mesos
− Kubernetes
● Scalable “by design” (multiple workers):
− Celery / Apache Mesos / Kubernetes
Airflow Alerting
● On event:
− Retry/failure/success
− Timeout
− SLAs
● Using:
− Email
− SlackOperator
− Callback
Airflow Web-based UI
● Web UI allows to
− visualize
● pipelines
● dependencies
● runs
− monitor progress and status
− trigger tasks
− manage variables and connections
− explore logs and metadata
− run ad-hoc queries
UI: DAG Graph View
UI: DAG Tree View
UI: Task Options
UI: DAG Task Duration
UI: DAG Code
UI: Data Profiling Charts
Airflow CLI
● Task level
− Test
− Run
− List tasks
● DAG level
− Check DAG state
− Pause
− Backfill
● Instance level (for maintenance)
Airflow users (officially)
● Airbnb
● Bloomberg
● Change.org
● DigitalOcean
● Glassdoor
● HBO
● PayPal
● Reddit
● Spotify
● Tesla
● Tinder
● Twitter
● Ubisoft
● … and 237 more
Airflow community
● GitHub:
Apache/airflow
− 946 contributors
(242 in 03.2017)
− 4069 forks
(1,182 in 03.2017)
− > 110k lines of code
Any questions?

Apache Airflow overview

  • 1.
    Apache Airflow WORKFLOW MANAGEMENTPLATFORM Nikolai Grishchenkov CC BY-NC-SA 4.0
  • 2.
    Agenda ● Workflows ● Airflow −Principles − Architecture − Concepts − UI Demo
  • 3.
    Introducing Apache Airflow ●Apache Airflow − Open source workflow management platform. − Apache Software Foundation project. − Initially developed by Airbnb.
  • 4.
    Airflow Core Ideas ●Core ideas − Workflow as a Directed Acyclic Graph (DAG). − DAG is defined programmatically (“Configuration as code”).
  • 5.
    Direct Acyclic Graph ●DAG - Directed graph that doesn’t have any cycles ● Workflow - A collection of tasks with their dependencies
  • 6.
    Airflow Features ● Corefeatures − Rich Web UI & Powerful CLI − Integration with Hadoop/Hive, S3, SQL Databases, Druid, Google Cloud, etc (30+ operators). − Dynamic pipeline generation (tasks are instantiated dynamically) − Jinja Templating − Plugins support ● Workflow features − Complex dependencies support − Automatic retries & Email alerts & SLAs − Comprehensive logging − Backfilling option
  • 7.
    Airflow Features (cont.) ●Resource management features − Queues & Resource Pools − Distributed execution (Scaling) ● Administration features − Easy installation − Security features: Web Authentication/LDAP/Kerberos/Oauth ● Misc. features − Friendly for non-programmers − Growing community − Apache license / ASF project
  • 8.
    Dynamic pipeline gen. ●Reduce “copy-paste” and allow configuration with airflow Variables
  • 9.
    Jinja Templating ● Allowstask parametrization with set of built-in parameters and macros like “execution date”.
  • 10.
    Airflow Architecture ● WebApplication & CLI ● Metadata Repository ● Scheduler Process ● Array of workers ● Jobs Definition in Python ● ETL Framework & ● Plugins
  • 11.
    Airflow Operators ● Operatorsare task factories. ● Operators types: − Operators that performs an action − Operators that moves data − Sensors
  • 12.
    Operator Groups ● Total:> 100 operators (including contrib) ● Perform action ● BashOperator, PythonOperator, DockerOperator ● SparkSQLOperator, SparkSubmitOperator ● HiveOperator, PostgresOperator, MySqlOperator, BigQueryOperator ● EmailOperator, SlackOperator ● Sensors ● HdfsSensor, HivePartitionSensor ● SqlSensor ● TimeSensor, ExternalTaskSensor ● Move data ● S3ToHiveTransfer ● MySqlToHiveTransfer
  • 13.
    Airflow Sensors ● Sensorsare a certain type of operator that will keep running until a certain criterion is met. − Appearance/approach of ● Time ● Another DAG run ● Database Record ● Hive Partition ● File ● REST Query Result
  • 14.
    Airflow Scheduling ● Thescheduler runs job one schedule_interval AFTER the start date, at the END of the period. ● Jinja: {{ ds }} - execution date as YYYY-MM-DD ● Backfill: run DAG for any interval that has not been run or cleared 2017-03- 01 2017-03- 02 2017-03- 03 2017-03- 04 2017-03- 05 DagRun: 2017-03-01 now Jinja template: «2017-03-01» data Airflow Scheduling
  • 15.
    Airflow Metadata ● Keep: −DAG status − Tasks status (passed/failed) ● Run heartbeat function to: − Update “Last_updated” − Run kill_zombies()
  • 16.
    Airflow Executors ● Executorsare the mechanism by which task instances get run. ● Types: − Sequential (for debugging) − Local − Celery − Apache Mesos − Kubernetes ● Scalable “by design” (multiple workers): − Celery / Apache Mesos / Kubernetes
  • 17.
    Airflow Alerting ● Onevent: − Retry/failure/success − Timeout − SLAs ● Using: − Email − SlackOperator − Callback
  • 18.
    Airflow Web-based UI ●Web UI allows to − visualize ● pipelines ● dependencies ● runs − monitor progress and status − trigger tasks − manage variables and connections − explore logs and metadata − run ad-hoc queries
  • 19.
  • 20.
  • 21.
  • 22.
    UI: DAG TaskDuration
  • 23.
  • 24.
  • 25.
    Airflow CLI ● Tasklevel − Test − Run − List tasks ● DAG level − Check DAG state − Pause − Backfill ● Instance level (for maintenance)
  • 26.
    Airflow users (officially) ●Airbnb ● Bloomberg ● Change.org ● DigitalOcean ● Glassdoor ● HBO ● PayPal ● Reddit ● Spotify ● Tesla ● Tinder ● Twitter ● Ubisoft ● … and 237 more
  • 27.
    Airflow community ● GitHub: Apache/airflow −946 contributors (242 in 03.2017) − 4069 forks (1,182 in 03.2017) − > 110k lines of code
  • 28.