https://wepayinc.app.box.com/s/hf1chwmthuet29ux2a83f5quc8o5q18k
Airflow @ING
2
ING
3
Multinational banking and financial
services corporation headquartered in
Amsterdam.
Its primary businesses are retail
banking, direct banking, wholesale
banking, investment banking, asset
management, and insurance services.
• Cron Replacement
• Fault tolerant
• No XML (looking at you Oozie!)
• Testable
• Python code
• Extendable
• Now Apache (incubating)
• Scale Out
• Complex Dependency Rules
• Pools
• CLI & Web UI
Why Apache Airflow (incubating)?
4
Growing community
5
Airflow Operational Design
6
Airflow Webserver
Database
Airflow Scheduler
Airflow Executor
(local/celery/mesos
worker)
Airflow Tasks
Talks to
Auth Backend
Choose an executor that fits your environment
7
SequentialExecutor LocalExecutor CeleryExecutor
Use case Mainly testing Production (~50% of
installed base)
Production (~50% of
installed base)
Scaleability -na- Vertical Horizontal and Vertical
Complexity Low Medium Medium/High
DAG Local Local Needs sync / pickle
Configuration [core]
Executor =
SequentialExecutor
[core]
Executor =
LocalExecutor
Parallelism=32
[core]
Executor =
CeleryExecutor
[celery]
Celeryd_concurrency = 32
Broker_url = rabbitmq
celery_result_backend
Default_queue =
Remark Don’t use num_runs
UTC everywhere
8
Engineers here respond in
UTC if you ask them what
time it is
Max
• Airflow assumes every server / worker runs
in UTC
• Airflow does not manage time zones
(correctly) (to be fixed)
• UTC does not know Daylight Savings Time
Tasks run at the end of the period not at the start
9
• First run will be at 2016-06-1 22:00 UTC
• Execution date will be 2016-06-1 21:00
UTC
How to stop/kill a task?
10
How to force running a task?
11
Celery only (for now)
“An idempotent operation is one
that has no additional effect if it is
called more than once with the
same input parameters.”
Make your tasks and DAGs idempotent
12
• DAGS and Tasks receive
an execution date
• on_retry_callback can be
used to do a cleanup
before a retry
Generate your tasks programmatically
13
List file names on
HDFS
Loop file names
Create task
Assign upstream
downstream
• Otherwise scheduling can get deadlocked as the sensors take up all the slots in the
scheduler
• Another way to circumvent this issue is to have a separate pool for sensors
When using ExternalTaskSensor make sure to manually
raise the priority of the tasks it is waiting for
14
• Do you have longer running tasks? Increase the heartbeat of the scheduler to decrease
load
• Smaller tasks make for easier debugging and retrying
• Properly choose your start date: the scheduler will fill gaps.
• Changing the schedule requires change the dag_id
• Backfills are used to add runs where the scheduler already went by
Some last bits
15
Use case
16
Transactions
Risk
Products
External
HDFS SPARK
TEZ
POSTGRE
S
FLUME
XFB
SQOOP
SQOOP
17
Wait for files to arrive (Sensor)
18
Copy & clean up
19
Model creation
• Run Spark
• Tez
Sharding
20
Sqooping to DB
• Apache Release
• Allow auto aligned
start_date
• Backfills to use Dag
Runs
• Improve pooling
• DAG Parsing
Isolation
Draft Roadmap
21
• Rest API
• Further Kerberos
Integration
• Schedule Backfill
Dag Runs
• Isolation
• DAG syncing
across workers
• No direct imports
for operators from
__init__
• Event Driven Driven
Scheduler
• Make tasks not need
the database
• Roles / principals
In progress
In progress
In progress
In progress
Aspiring committer? Contributor? User?
22
http://gitter.im/apache/incubator-airflow/
https://github.com/apache/incubator-airflow/
http://mail-archives.apache.org/mod_mbox/incubator-airflow-
dev/
23

Apache Airflow (incubating) NL HUG Meetup 2016-07-19

  • 2.
  • 3.
    ING 3 Multinational banking andfinancial services corporation headquartered in Amsterdam. Its primary businesses are retail banking, direct banking, wholesale banking, investment banking, asset management, and insurance services.
  • 4.
    • Cron Replacement •Fault tolerant • No XML (looking at you Oozie!) • Testable • Python code • Extendable • Now Apache (incubating) • Scale Out • Complex Dependency Rules • Pools • CLI & Web UI Why Apache Airflow (incubating)? 4
  • 5.
  • 6.
    Airflow Operational Design 6 AirflowWebserver Database Airflow Scheduler Airflow Executor (local/celery/mesos worker) Airflow Tasks Talks to Auth Backend
  • 7.
    Choose an executorthat fits your environment 7 SequentialExecutor LocalExecutor CeleryExecutor Use case Mainly testing Production (~50% of installed base) Production (~50% of installed base) Scaleability -na- Vertical Horizontal and Vertical Complexity Low Medium Medium/High DAG Local Local Needs sync / pickle Configuration [core] Executor = SequentialExecutor [core] Executor = LocalExecutor Parallelism=32 [core] Executor = CeleryExecutor [celery] Celeryd_concurrency = 32 Broker_url = rabbitmq celery_result_backend Default_queue = Remark Don’t use num_runs
  • 8.
    UTC everywhere 8 Engineers hererespond in UTC if you ask them what time it is Max • Airflow assumes every server / worker runs in UTC • Airflow does not manage time zones (correctly) (to be fixed) • UTC does not know Daylight Savings Time
  • 9.
    Tasks run atthe end of the period not at the start 9 • First run will be at 2016-06-1 22:00 UTC • Execution date will be 2016-06-1 21:00 UTC
  • 10.
    How to stop/killa task? 10
  • 11.
    How to forcerunning a task? 11 Celery only (for now)
  • 12.
    “An idempotent operationis one that has no additional effect if it is called more than once with the same input parameters.” Make your tasks and DAGs idempotent 12 • DAGS and Tasks receive an execution date • on_retry_callback can be used to do a cleanup before a retry
  • 13.
    Generate your tasksprogrammatically 13 List file names on HDFS Loop file names Create task Assign upstream downstream
  • 14.
    • Otherwise schedulingcan get deadlocked as the sensors take up all the slots in the scheduler • Another way to circumvent this issue is to have a separate pool for sensors When using ExternalTaskSensor make sure to manually raise the priority of the tasks it is waiting for 14
  • 15.
    • Do youhave longer running tasks? Increase the heartbeat of the scheduler to decrease load • Smaller tasks make for easier debugging and retrying • Properly choose your start date: the scheduler will fill gaps. • Changing the schedule requires change the dag_id • Backfills are used to add runs where the scheduler already went by Some last bits 15
  • 16.
  • 17.
    17 Wait for filesto arrive (Sensor)
  • 18.
  • 19.
    19 Model creation • RunSpark • Tez Sharding
  • 20.
  • 21.
    • Apache Release •Allow auto aligned start_date • Backfills to use Dag Runs • Improve pooling • DAG Parsing Isolation Draft Roadmap 21 • Rest API • Further Kerberos Integration • Schedule Backfill Dag Runs • Isolation • DAG syncing across workers • No direct imports for operators from __init__ • Event Driven Driven Scheduler • Make tasks not need the database • Roles / principals In progress In progress In progress In progress
  • 22.
    Aspiring committer? Contributor?User? 22 http://gitter.im/apache/incubator-airflow/ https://github.com/apache/incubator-airflow/ http://mail-archives.apache.org/mod_mbox/incubator-airflow- dev/
  • 23.

Editor's Notes

  • #10 Airflow was developed as a solution for ETL needs. In the ETL world, you typically summarize data. So, if I want to summarize data for 2016-02-19, I would do it at 2016-02-20 midnight GMT, which would be right after all data for 2016-02-19 becomes available.
  • #14 One of the most powerful features of a system where workflows are described in code is that you can programmatically generate your dag. This is very, very useful where you want to automatically pick up new data sources without manual intervention.