Introducing Apache
Airflow (Incubating)
Sid Anand (@r39132)
Data Day Seattle 2016
1
About Me
2
Work [ed | s] @
Maintainer on
Reports to
Co-Chair for
Apache Airflow
3
What is it?
4
Apache Airflow : What is it?
Airflow is a platform to programmatically author,
schedule and monitor workflows (a.k.a. DAGs)
5
Apache Airflow : What is it?
Airflow is a platform to programmatically author,
schedule and monitor workflows (a.k.a. DAGs)
It ships with
• DAG Scheduler
• Web application (UI)
• Powerful CLI
6
Apache Airflow : What is it?
Airflow - Authoring DAGs
7
Airflow: Visualizing a DAG
8
Airflow: Author DAGs in Python! No need to bundle many XML files!
Airflow - Authoring DAGs
9
Airflow: The Tree View offers a view of DAG Runs over time!
Airflow - Authoring DAGs
Airflow - Performance Insights
10
Airflow: Gantt charts reveal the slowest tasks for a run!
11
Airflow: …And we can easily see performance trends over time
Airflow - Performance Insights
12
Apache Airflow : What is it?
When would you use a Workflow Scheduler like
Airflow?
• ETL Pipelines
• Machine Learning Pipelines
• Predictive Data Pipelines
• Fraud Detection, Scoring/Ranking, Classification,
Recommender System, etc…
• General Job Scheduling (e.g. Cron)
• DB Back-ups, Scheduled code/config deployment
13
Apache Airflow : What is it?
What should a Workflow Scheduler do well?
• Schedule a graph of dependencies
• where Workflow = A DAG of Tasks
• Handle task failures
• Report / Alert on failures
• Monitor performance of tasks over time
• Enforce SLAs
• E.g. Alerting if time or correctness SLAs are not met
• Scale
14
Apache Airflow : What is it?
What Does Apache Airflow Add?
• Configuration-as-code
• Usability - Stunning UI / UX
• Centralized configuration
• Resource Pooling
• Extensibility
Apache Airflow
15
Incubating
16
Apache Airflow : Incubating
Timeline
• Airflow was created @ Airbnb in 2015 by Maxime
Beauchemin
• Max launched it @ Hadoop Summit in Summer 2015
• On 3/31/2016, Airflow —> Apache Incubator
Today
• 166+ Contributors
• 300+ Users
• 40+ companies officially using it!
• 9 Committers/Maintainers <— We’re growing here
Agari
17
What We Do!
Agari : What We Do
18
19
Agari : What We Do
20
Agari : What We Do
21
Agari : What We Do
22
Agari : What We Do
23
Enterprise
Customers
email
metadata
apply
trust
models
email md
+ trust
score
Agari’s Current Product
Agari : What We Do
24
email
metadata
apply
trust
models
email md
+ trust
score
Agari’s Future ProductEnterprise
Customers
Agari : What We Do
Apache Airflow @ Agari
How Do We Use It?
25
Classes of Orchestration
26
apply trust
models
(message
scoring)
build trust
models
cron++
(general
job
scheduler)
New Product
(Enterprise Protect)
Operational
Automation
BI / ETL
N / A
Classes of Orchestration
27
apply trust
models
(message
scoring)
build trust
models
cron++
(general
job
scheduler)
New Product
(Enterprise Protect)
Operational
Automation
BI / ETL
N / A
This Talk
Use-Case : Message
Scoring
Batch Pipeline Architecture
28
Use-Case : Message Scoring
29
enterprise A
enterprise B
enterprise C
S3
S3 uploads every 15
minutes
Use-Case : Message Scoring
30
enterprise A
enterprise B
enterprise C
S3
Airflow kicks of a Spark
message scoring job
every hour
Use-Case : Message Scoring
31
enterprise A
enterprise B
enterprise C
S3
Spark job writes scored
messages and stats to
another S3 bucket
S3
Use-Case : Message Scoring
32
enterprise A
enterprise B
enterprise C
S3
This triggers SNS/SQS
messages events
S3
SNS
SQS
Use-Case : Message Scoring
33
enterprise A
enterprise B
enterprise C
S3
An Autoscale Group
(ASG) of Importers spins
up when it detects SQS
messages
S3
SNS
SQS
Importers
ASG
34
enterprise A
enterprise B
enterprise C
S3
The importers rapidly ingest scored
messages and aggregate statistics into
the DB
S3
SNS
SQS
Importers
ASG
DB
Use-Case : Message Scoring
35
enterprise A
enterprise B
enterprise C
S3
Users receive alerts of
untrusted emails &
can review them in
the web app
S3
SNS
SQS
Importers
ASG
DB
Use-Case : Message Scoring
36
enterprise A
enterprise B
enterprise C
S3 S3
SNS
SQS
Importers
ASG
DB
Airflow manages the entire process
Use-Case : Message Scoring
Use-Case : Message
Scoring
Airflow DAG
37
38
Airflow DAG
39
5 minute wait for S3
eventual consistency
Airflow DAG
40
1 hour a day, we
also build new
models
Airflow DAG
41
build
models
Airflow DAG
42
dummy needed for
branch operator
Airflow DAG
43
• trigger_rule:
one_success
Airflow DAG
44
Prep Spark
Run
Airflow DAG
45
• Run Spark
• Verify a record is
written to the DB
• Wait for the SQS
queue to empty
Airflow DAG
46
• Compute
discrepancies
• Send email report
• Update monitoring
graphs
• Raise SLA
(correctness) alerts
Airflow DAG
SLAs & Insights
Airflow
47
48
Desirable Qualities of a Resilient
Data Pipeline
OperabilityCorrectness
Timeliness Scalable/Available
• Data Integrity (no loss, etc…)
• Expected data distributions
• All output within time-bound SLAs
(e.g. 1 hour)
• Fine-grained Monitoring &
Alerting of Correctness &
Timeliness SLAs
• Quick Recoverability
• ASGs, SQS, SNS, S3
49
Desirable Qualities of a Resilient
Data Pipeline
OperabilityCorrectness
Timeliness
• Data Integrity (no loss, etc…)
• Expected data distributions
• All output within time-bound SLAs
(e.g. 1 hour)
• Fine-grained Monitoring &
Alerting of Correctness &
Timeliness SLAs
• Quick Recoverability
SLA
SLA
• ASGs, SQS, SNS, S3
Scalable/Available
50
Correctness : Email Reporting
orgs
51
Correctness : Email Reporting
For each org, we check for duplicate or missing data
as a count & percentage
orgs
52
Correctness : Email Reporting
These are the 3 stages of the pipeline. We can detect where a
discrepancy is coming from - often related to a code push!
orgs
53
Correctness : Monitoring
54
Airflow: …And easy to integrate with Ops tools!
Correctness & Timeliness : Alerting
55
Airflow: …And easy to integrate with Ops tools!
Correctness & Timeliness : Alerting
Timeliness SLA
miss
56
Airflow: …And easy to integrate with Ops tools!
Correctness & Timeliness : Alerting
Timeliness SLA
miss
dag = DAG(DAG_NAME,
schedule_interval='@hourly',
default_args=default_args,
sla_miss_callback=sla_alert_func)
57
Airflow: …And easy to integrate with Ops tools!
Correctness & Timeliness : Alerting
Timeliness SLA
miss
Correctness
SLA miss
58
Airflow: …And easy to integrate with Ops tools!
Correctness & Timeliness : Alerting
Timeliness &
Correctness SLA
misses sent to
PagerDuty/VictorOps
Use-Case : Model Building v2
For Both Batch & Near Realtime Scoring
Pipelines
59
60
Airflow DAG
61
Model Building DAG
Launch an
EMR cluster
62
Run model
building as
EMR steps
Model Building DAG
63
Validate
models
Send email
notification if
tests fail
Model Building DAG
64
Terminate
EMR cluster
Model Building DAG
Apache Airflow Next Steps
65
Areas for Improvement
66
Apache Airflow Next Steps
Improvement Areas
• Security
• API (though we do have a CLI)
• Deployment / Versioning
• Execution Scale Out
• On-demand Execution
Acknowledgments
67
• Vidur Apparao
• Stephen Cattaneo
• Jon Chase
• Andrew Flury
• William Forrester
• Chris Haag
• Mike Jones
• Scot Kennedy
• Thede Loder
• Paul Lorence
• Kevin Mandich
• Gabriel Ortiz
• Jacob Rideout
• Josh Yang
• Julian Mehnle
None of this work would be possible without the
contributions of the strong team below
Questions? (@r39132)
68

Introduction to Apache Airflow - Data Day Seattle 2016