Airflow
for beginners
https://github.com/karpenkovarya/airflow_for_beginners
What is Airflow?
It is a tool to BUILD, SCHEDULE and MONITOR
data pipelines
Set of data processing elements connected in series.
The output of one element is the input of the next one.
I
Create
Questions
table
II
Store data
from Stack
Overflow
III
Write filtered
questions to
S3
IV
Render HTML
template
V
Send me an
email
Building blocks
of Airflow
Operator
(Worker)
Knows how to perform a task
and has the tools to do it.
Example:
Python Operator
Postgres Operator
Bash Operator
Email Operator
DAG
(Protocol /
Instructions)
Describes the
order of tasks and
what to do if task is failing.
Example:
Run Task A, when it is finished, run
Task B. If one of the tasks failed, stop
the whole process and send me a
notification.
Task
(Specific job)
Job that is done by an
Operator.
Example:
- Load data from some API using
Python Operator
- Write data to the database using
MySQL Operator
Hooks
Interfaces to the external
platforms and databases.
Implements common interface
(all hooks look very similar) and
use Connections
Example:
S3 Hook
Slack Hook
HDFS Hook
Connection
Credentials to the external
systems that can be securely
stored in the Airflow.
Example:
Postgres Connection = Connection
string to the Postgres database
AWS Connection = AWS access
keys
Variables
Like environment
variables.
Can store arbitrary
information and be used in
the Tasks
Examples:
Stack Overflow base URL
Gmail Client ID and Secret
XComs
Let’s Tasks exchange
small messages.
I
Create
Questions
table
II
Store data
from Stack
Overflow
III
Write filtered
questions to
S3
IV
Render HTML
template
V
Send me an
email
Postgres
Connection
Postgres
Connection
Postgres
Connection
S3
Connection
Python Operator
Python Operator
Python Operator
Postgres Hook
S3
Connection
S3
Hook
Postgres Hook S3
HookPostgres
Operator
XCom
XCom
Variables
Variables
Email
Operator
What have we learned?
- What is Apache Airflow
- What is a data pipeline
- Main Airflow concepts (DAG, Task, Operator, Connection, etc.)
- First pipeline
Thank you!
🌻✨💛
📬 hello@varya.io

Airflow for Beginners

  • 1.
  • 2.
    What is Airflow? Itis a tool to BUILD, SCHEDULE and MONITOR data pipelines Set of data processing elements connected in series. The output of one element is the input of the next one.
  • 3.
    I Create Questions table II Store data from Stack Overflow III Writefiltered questions to S3 IV Render HTML template V Send me an email
  • 4.
    Building blocks of Airflow Operator (Worker) Knowshow to perform a task and has the tools to do it. Example: Python Operator Postgres Operator Bash Operator Email Operator DAG (Protocol / Instructions) Describes the order of tasks and what to do if task is failing. Example: Run Task A, when it is finished, run Task B. If one of the tasks failed, stop the whole process and send me a notification. Task (Specific job) Job that is done by an Operator. Example: - Load data from some API using Python Operator - Write data to the database using MySQL Operator Hooks Interfaces to the external platforms and databases. Implements common interface (all hooks look very similar) and use Connections Example: S3 Hook Slack Hook HDFS Hook Connection Credentials to the external systems that can be securely stored in the Airflow. Example: Postgres Connection = Connection string to the Postgres database AWS Connection = AWS access keys Variables Like environment variables. Can store arbitrary information and be used in the Tasks Examples: Stack Overflow base URL Gmail Client ID and Secret XComs Let’s Tasks exchange small messages.
  • 6.
    I Create Questions table II Store data from Stack Overflow III Writefiltered questions to S3 IV Render HTML template V Send me an email Postgres Connection Postgres Connection Postgres Connection S3 Connection Python Operator Python Operator Python Operator Postgres Hook S3 Connection S3 Hook Postgres Hook S3 HookPostgres Operator XCom XCom Variables Variables Email Operator
  • 8.
    What have welearned? - What is Apache Airflow - What is a data pipeline - Main Airflow concepts (DAG, Task, Operator, Connection, etc.) - First pipeline
  • 9.