Apache Airflow is a platform to programmatically author, schedule and monitor workflows. Airflow is not a data streaming solution. Tasks do not move data from one to the other (though tasks can exchange metadata!). Airflow is not in the Spark Streaming or Storm space, it is more comparable to Oozie or Azkaban. It's primary goal is to solve problem nicely described in this XKCD comic (https://xkcd.com/2054/) What's unique about Airflow is that it brings "infrastructure as a code" concept to building scalable, manageable and elegant workflows. Workflows are defined as Python code - thus making dynamic workflow possible. It provides hundreds of out-of-the-box Operators that allow your pipeline to tap into pretty much any resource possible - starting from resources from multiple cloud providers as well as on-the-premises systems of yours. It's super-easy to write your own operators and leverage the power of data pipeline infrastructure provided by Airflow. This talk will be about general concepts behind Airflow - how you can author your workflow, write your own operators and run and monitor your pipelines. It will also explain how you can leverage Kubernetes (in recent release of Airflow) to make use of your on-premises or in-the-cloud infrastructure efficiently. You leave the talk armed with enough knowledge to evaluate if Airflow is good for you to solve your data pipeline problems and get some insight from Airflow contributors in case you are already an Airflow user.
6. PyWaw #80 @higrys, @sprzedwojski
Cron vs. Airflow
● no installation needed ● setup complex (but easy with Google Composer)
● locally stored job output ● aggregate logs, show them in UI
● no versioning ● versioning as usual (for programmers)
● no testing ● automated testing as usual (for programmers)
● hard re-running, backfilling ● re-running, backfilling (run all missed runs)
● no dependency management ● complex data pipelines (task dependencies)
12. @higrys, @sprzedwojskiPyWaw #80
Operator Properties
Important when writing your operators:
● Make operators idempotent
● Task execution is atomic
● State is not stored locally
● Data exchanged between tasks with XCOM
38. @higrys, @sprzedwojskiPyWaw #80
Single node
Local Executor
Web
server
RDBMS DAGs
Scheduler
Local executors`
Local executors
Local executors
Local executors
multiprocessing