A walk through of what Airflow is and isn't. Also how to use airflow to construct dynamic tasks and automate your entire ETL process. Presentation can be seen here: http://dovy.io/airflow/airflow-strength-and-weaknesses-and-dynamic-tasks
2. A BRIEF HISTORY OF DATA
PIPELINES
F o r r e a l , t h i s i s h o w w e u s e d t o d o i t …
3. The dev’s answer to
EVERYTHING
C r o n / c r o n t a b
This works great for some use cases,
but lacks in many other ways.
Works great, provided
the computer is on. Will
manage at the time you
set every time it can.
No recovery, logs self
managed, not sure when
it runs. Can only execute
on one computer.
4. It keeps tasks alive.
S u p e r v i s o r / S u p e r v i s o r d
Fantastic utility, works as expected and
optionally embedded UI and CLI util.
Keeps everything up
and let’s you see what’s
going on. Even rotates
logs and allows groups.
Still executes on the one
computer. Isn’t more
than it advertises to be.
Limited scope.
6. Airflow is a “workflow
management system”
created by airbnb.com
“Today, we are proud to announce that
we are open sourcing and sharing
Airflow, our workflow management
platform.”
June 2, 2016
https://medium.com/airbnb-engineering/
airflow-a-workflow-management-
platform-46318b977fd8
And it’s all written in Python!
7. What IS Airflow?
B U T R E A L LY …
Dependency Control
Task Management
Task Recovery
Charting
Logging
Alerting
History
Folder Watching
Trending
Dynamic Tasks
ANYTHING your pipeline may need…
13. We place it all on a
single Google Compute
Engine VM.
No bull!
E x c u s e m e ?
CPU: n1-standard-2
2 vCPUs, 7.5 GB memory
HD: 30 GB
Standard Persistant Disk (Non-SSD)
15. A few key Airflow
concepts.
DAGs
Directed Acyclic Graph – is a collection of all the tasks you
want to run, organized in a way that reflects their
relationships and dependencies. Written in python.
01
Describes how a single task performs in a workflow (dag).
There are many types of operators:
BashOperator, PythonOperator, EmailOperator, HTTPOperator,
MySqlOperator, SqliteOperator, PostgresOperator, MsSqlOperator,
OracleOperator, JdbcOperator, Sensor, DockerOperator
02
Operators
Tasks
Once an operator is instantiated it’s referred to as a task.
03
dag = DAG(
dag_id='example_python_operator',
schedule_interval=None
)
def my_sleeping_function(random_base):
'''This is a function that will run within the DAG
execution'''
time.sleep(random_base)
def print_context(ds, **kwargs):
pprint(kwargs)
print(ds)
return 'Whatever you return gets printed in the logs’
run_this = PythonOperator(
task_id='print_the_context',
provide_context=True,
python_callable=print_context,
dag=dag)
for i in range(10):
'''
Generating 10 sleeping task, sleeping from 0 to 9
seconds
respectively
'''
task = PythonOperator(
task_id='sleep_for_'+str(i),
python_callable=my_sleeping_function,
op_kwargs={'random_base': float(i)/10},
dag=dag
)
task.set_upstream(run_this)
20. Stop doing things the way you
have, think dynamically.
You can automate your task by
reading source code or listing
files in a directory.
You don’t have to worry about
execution order, you only need to
present airflow with relationships.
Think in terms of how you can
remove human error. Let
airflow work for you.
21. Airflow really shines with
dynamic tasks.
Dictionary (array) of Dependencies
What if you made a script that parsed all your
jobs, and detected all dependencies
automatically.
Now what if you took that dictionary, and fed it
into airflow?
How would that simplify your pipeline?
dependencies = {
'topic_billing_frequency': [
‘dim_billing_frequency’,
‘dim_account'
],
'topic_payment_method':
‘dim_credit_card_type’,
‘dim_payment_accounts’
]
}
Let’s take a look…
L e t m e s h o w y o u …
22. Airflow really shines with
dynamic tasks.
T h e c o d e t o r u n i t a l l
Top Level Dependencies
Top level dependencies are created. Each of these tasks,
depends on creating and deleting the cluster.
01
Now each child dependency is iterated over, and a task is
created for each. Each is given the delete task as a
“downstream” so delete cluster will never run until the tasks are
complete.
02
Child Dependencies
Connect children to parents
Now set the parent task as an upstream for each child
task.
03
all_tasks = {}
# Create all parent tasks, top level
for key, value in dependencies.all_dependencies.iteritems():
if key not in all_tasks:
all_tasks[key] = PythonOperator(
task_id=key,
python_callable=process,
op_kwargs={},
provide_context=True,
dag=dag,
retries=30,
retry_delay=timedelta(minutes=10),
on_retry_callback=airflow_retry_function,
on_failure_callback=airflow_error_function,
on_success_callback=airflow_success_function,
)
all_tasks[key].set_upstream(task_create_cluster)
all_tasks[key].set_downstream(task_delete_cluster)
# Create all nested dependency tasks
for key, value in dependencies.all_dependencies.iteritems():
for item in value:
if item not in all_tasks:
if key in all_tasks:
continue
all_tasks[item] = PythonOperator(
task_id=item,
python_callable=process,
op_kwargs={},
provide_context=True,
dag=dag,
retries=30,
retry_delay=timedelta(minutes=10),
on_retry_callback=airflow_retry_function,
on_failure_callback=airflow_error_function,
on_success_callback=airflow_success_function,
)
all_tasks[item].set_downstream(task_delete_cluster)
all_tasks[item].set_downstream(all_tasks[key])
23. What does that code do?This is real code being used today.
24. Dovy Paukstys
Consultant at Caserta
#geek #bigdata #redux
How can I help?
http://dovy.io
http://twitter.com/simplerain
dovy.paukstys@caserta.com
http://reduxframework.com
https://github.com/dovy/
http://linkedin.com/in/dovyp