Building (Better) Data Pipelines
using Apache Airflow
Sid Anand (@r39132)
QCon.AI 2018
1
About Me
2
Work [ed | s] @
Maintainer of
Spare time
Co-Chair for
Apache Airflow
3
What is it?
4
Apache Airflow : What is it?
In a :
Airflow is a platform to
programmatically author, schedule
and monitor workflows (a.k.a. DAGs
or Directed Acyclic Graphs)
Apache Airflow
5
UI Walk-Through
6
Apache Airflow : UI Walk-through
Airflow - Authoring DAGs
7
Airflow: Visualizing a DAG
8
Airflow: Author DAGs in Python! No need to bundle many XML files!
Airflow - Authoring DAGs
9
Airflow: The Tree View offers a view of DAG Runs over time!
Airflow - Authoring DAGs
Airflow - Performance Insights
10
Airflow: Gantt charts reveal the slowest tasks for a run!
11
Airflow: …And we can easily see performance trends over time
Airflow - Performance Insights
Apache Airflow
12
Why use it?
13
Apache Airflow : Why use it?
When would you use a Workflow Scheduler like
Airflow?
• ETL Pipelines
• Machine Learning Pipelines
• Predictive Data Pipelines
• Fraud Detection, Scoring/Ranking, Classification,
Recommender System, etc…
• General Job Scheduling (e.g. Cron)
• DB Back-ups, Scheduled code/config deployment
14
What should a Workflow Scheduler do well?
• Schedule a graph of dependencies
• where Workflow = A DAG of Tasks
• Handle task failures
• Report / Alert on failures
• Monitor performance of tasks over time
• Enforce SLAs
• E.g. Alerting if time or correctness SLAs are not met
• Easily scale for growing load
Apache Airflow : Why use it?
15
What Does Apache Airflow Add?
• Configuration-as-code
• Usability - Stunning UI / UX
• Centralized configuration
• Resource Pooling
• Extensibility
Apache Airflow : Why use it?
Use-Case : Message
Scoring
Batch Pipeline Architecture
16
Use-Case : Message Scoring
17
enterprise A
enterprise B
enterprise C
S3
S3 uploads every 15
minutes
Use-Case : Message Scoring
18
enterprise A
enterprise B
enterprise C
S3
Airflow kicks of a Spark
message scoring job
every hour
Use-Case : Message Scoring
19
enterprise A
enterprise B
enterprise C
S3
Spark job writes scored
messages and stats to
another S3 bucket
S3
Use-Case : Message Scoring
20
enterprise A
enterprise B
enterprise C
S3
This triggers SNS/SQS
messages events
S3
SNS
SQS
Use-Case : Message Scoring
21
enterprise A
enterprise B
enterprise C
S3
An Autoscale Group
(ASG) of Importers spins
up when it detects SQS
messages
S3
SNS
SQS
Importers
ASG
22
enterprise A
enterprise B
enterprise C
S3
The importers rapidly ingest scored
messages and aggregate statistics into
the DB
S3
SNS
SQS
Importers
ASG
DB
Use-Case : Message Scoring
23
enterprise A
enterprise B
enterprise C
S3
Users receive alerts of
untrusted emails &
can review them in
the web app
S3
SNS
SQS
Importers
ASG
DB
Use-Case : Message Scoring
24
enterprise A
enterprise B
enterprise C
S3 S3
SNS
SQS
Importers
ASG
DB
Airflow manages the entire process
Use-Case : Message Scoring
25
Airflow DAG
Apache Airflow
26
Incubating
27
Apache Airflow : Incubating
Timeline
• Airflow was created @ Airbnb in 2015 by Maxime
Beauchemin
• Max launched it @ Hadoop Summit in Summer 2015
• On 3/31/2016, Airflow —> Apache Incubator
Today
• 2400+ Forks
• 7600+ GitHub Stars
• 430+ Contributors
• 150+ companies officially using it!
• 14 Committers/Maintainers <— We’re growing here
Thank You!
28
Apache Airflow
29
Behind the Scenes
30
Airflow is a platform to programmatically author,
schedule and monitor workflows (a.k.a. DAGs)
It ships with a
• DAG Scheduler
• Web application (UI)
• Powerful CLI
• Celery Workers!
Apache Airflow : Behind the Scenes
31
Apache Airflow : Behind the Scenes
Webserver
Scheduler
WorkerWorkerWorker
Meta DB
1. A user schedules / manages
DAGs using the Airflow UI!
2. Airflow’s webserver stores
scheduling metadata in the
metadata DB
3. The scheduler picks up new
schedules and distributes
work over Celery /
RabbitMQ
4. Airflow workers pick up
Airflow tasks over Celery
Celery / RabbitMQ
32
Webserver
Scheduler
WorkerWorkerWorker
Meta DB
1. A user schedules / manages
DAGs using the Airflow UI!
2. Airflow’s webserver stores
scheduling metadata in the
metadata DB
3. The scheduler picks up new
schedules and distributes
work over Celery /
RabbitMQ
4. Airflow workers pick up
Airflow tasks over Celery
Celery / RabbitMQ
Apache Airflow : Behind the Scenes
1. A user schedules / manages
DAGs using the Airflow UI!
2. Airflow’s webserver stores
scheduling metadata in the
metadata DB
3. The scheduler picks up new
schedules and distributes
work over Celery /
RabbitMQ
4. Airflow workers pick up
Airflow tasks over Celery
33
Webserver
Scheduler
WorkerWorkerWorker
Meta DB
Celery / RabbitMQ
Apache Airflow : Behind the Scenes
34
Webserver
Scheduler
WorkerWorkerWorker
Meta DB
1. A user schedules / manages
DAGs using the Airflow UI!
2. Airflow’s webserver stores
scheduling metadata in the
metadata DB
3. The scheduler picks up new
schedules and distributes
work over Celery /
RabbitMQ
4. Airflow workers pick up
Airflow tasks from RabbitMQ
Celery / RabbitMQ
Apache Airflow : Behind the Scenes
Thank You!
35

Building Better Data Pipelines using Apache Airflow

  • 1.
    Building (Better) DataPipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1
  • 2.
    About Me 2 Work [ed| s] @ Maintainer of Spare time Co-Chair for
  • 3.
  • 4.
    4 Apache Airflow :What is it? In a : Airflow is a platform to programmatically author, schedule and monitor workflows (a.k.a. DAGs or Directed Acyclic Graphs)
  • 5.
  • 6.
    6 Apache Airflow :UI Walk-through
  • 7.
    Airflow - AuthoringDAGs 7 Airflow: Visualizing a DAG
  • 8.
    8 Airflow: Author DAGsin Python! No need to bundle many XML files! Airflow - Authoring DAGs
  • 9.
    9 Airflow: The TreeView offers a view of DAG Runs over time! Airflow - Authoring DAGs
  • 10.
    Airflow - PerformanceInsights 10 Airflow: Gantt charts reveal the slowest tasks for a run!
  • 11.
    11 Airflow: …And wecan easily see performance trends over time Airflow - Performance Insights
  • 12.
  • 13.
    13 Apache Airflow :Why use it? When would you use a Workflow Scheduler like Airflow? • ETL Pipelines • Machine Learning Pipelines • Predictive Data Pipelines • Fraud Detection, Scoring/Ranking, Classification, Recommender System, etc… • General Job Scheduling (e.g. Cron) • DB Back-ups, Scheduled code/config deployment
  • 14.
    14 What should aWorkflow Scheduler do well? • Schedule a graph of dependencies • where Workflow = A DAG of Tasks • Handle task failures • Report / Alert on failures • Monitor performance of tasks over time • Enforce SLAs • E.g. Alerting if time or correctness SLAs are not met • Easily scale for growing load Apache Airflow : Why use it?
  • 15.
    15 What Does ApacheAirflow Add? • Configuration-as-code • Usability - Stunning UI / UX • Centralized configuration • Resource Pooling • Extensibility Apache Airflow : Why use it?
  • 16.
    Use-Case : Message Scoring BatchPipeline Architecture 16
  • 17.
    Use-Case : MessageScoring 17 enterprise A enterprise B enterprise C S3 S3 uploads every 15 minutes
  • 18.
    Use-Case : MessageScoring 18 enterprise A enterprise B enterprise C S3 Airflow kicks of a Spark message scoring job every hour
  • 19.
    Use-Case : MessageScoring 19 enterprise A enterprise B enterprise C S3 Spark job writes scored messages and stats to another S3 bucket S3
  • 20.
    Use-Case : MessageScoring 20 enterprise A enterprise B enterprise C S3 This triggers SNS/SQS messages events S3 SNS SQS
  • 21.
    Use-Case : MessageScoring 21 enterprise A enterprise B enterprise C S3 An Autoscale Group (ASG) of Importers spins up when it detects SQS messages S3 SNS SQS Importers ASG
  • 22.
    22 enterprise A enterprise B enterpriseC S3 The importers rapidly ingest scored messages and aggregate statistics into the DB S3 SNS SQS Importers ASG DB Use-Case : Message Scoring
  • 23.
    23 enterprise A enterprise B enterpriseC S3 Users receive alerts of untrusted emails & can review them in the web app S3 SNS SQS Importers ASG DB Use-Case : Message Scoring
  • 24.
    24 enterprise A enterprise B enterpriseC S3 S3 SNS SQS Importers ASG DB Airflow manages the entire process Use-Case : Message Scoring
  • 25.
  • 26.
  • 27.
    27 Apache Airflow :Incubating Timeline • Airflow was created @ Airbnb in 2015 by Maxime Beauchemin • Max launched it @ Hadoop Summit in Summer 2015 • On 3/31/2016, Airflow —> Apache Incubator Today • 2400+ Forks • 7600+ GitHub Stars • 430+ Contributors • 150+ companies officially using it! • 14 Committers/Maintainers <— We’re growing here
  • 28.
  • 29.
  • 30.
    30 Airflow is aplatform to programmatically author, schedule and monitor workflows (a.k.a. DAGs) It ships with a • DAG Scheduler • Web application (UI) • Powerful CLI • Celery Workers! Apache Airflow : Behind the Scenes
  • 31.
    31 Apache Airflow :Behind the Scenes Webserver Scheduler WorkerWorkerWorker Meta DB 1. A user schedules / manages DAGs using the Airflow UI! 2. Airflow’s webserver stores scheduling metadata in the metadata DB 3. The scheduler picks up new schedules and distributes work over Celery / RabbitMQ 4. Airflow workers pick up Airflow tasks over Celery Celery / RabbitMQ
  • 32.
    32 Webserver Scheduler WorkerWorkerWorker Meta DB 1. Auser schedules / manages DAGs using the Airflow UI! 2. Airflow’s webserver stores scheduling metadata in the metadata DB 3. The scheduler picks up new schedules and distributes work over Celery / RabbitMQ 4. Airflow workers pick up Airflow tasks over Celery Celery / RabbitMQ Apache Airflow : Behind the Scenes
  • 33.
    1. A userschedules / manages DAGs using the Airflow UI! 2. Airflow’s webserver stores scheduling metadata in the metadata DB 3. The scheduler picks up new schedules and distributes work over Celery / RabbitMQ 4. Airflow workers pick up Airflow tasks over Celery 33 Webserver Scheduler WorkerWorkerWorker Meta DB Celery / RabbitMQ Apache Airflow : Behind the Scenes
  • 34.
    34 Webserver Scheduler WorkerWorkerWorker Meta DB 1. Auser schedules / manages DAGs using the Airflow UI! 2. Airflow’s webserver stores scheduling metadata in the metadata DB 3. The scheduler picks up new schedules and distributes work over Celery / RabbitMQ 4. Airflow workers pick up Airflow tasks from RabbitMQ Celery / RabbitMQ Apache Airflow : Behind the Scenes
  • 35.