What’s upcoming in Airflow 2?
PiterPy
4 August 2020
Kaxil Naik
Senior Data Engineer @ Astronomer
Twitter: @kaxil
Who am I?
● Airflow Committer, PMC Member & Release Manager
● Senior Data Engineer @ Astronomer
○ Part of the Airflow team
○ Work full-time on Airflow
● Previously worked at DataReply
● Masters in Data Science & Analytics from Royal
Holloway, University of London
● Twitter: https://twitter.com/kaxil
● Github: https://github.com/kaxil/
● LinkedIn: https://www.linkedin.com/in/kaxil/
About Astronomer
Founded in 2018, Astronomer is the
commercial developer of Apache
Airflow, the open-source standard for
data orchestration.
100+
Enterprise customers around the world
Locations
San Francisco London New York Cincinnati Hyderabad
Investors
3 of top 10
Airflow committers are Astronomer
advisors or employees
Mission
To help organizations adopt new standards for
data orchestration.
Airflow 2 - Highlights
● Scheduler High-Availability
● DAG Serialization
● DAG Versioning
● Stable Rest API
● Functional DAGs
● Official Docker Image & Helm Chart
● Providers Packages
● and other notable changes ...
http://gph.is/1VBGIPv
High Availability
Scheduler High Availability
Goals:
● Performance - reduce task-to-task schedule "lag"
● Scalability - increase task throughput by horizontal scaling
● Resiliency - kill a scheduler and have tasks continue to be scheduled
Scheduler High Availability - Design
● Active-active model. Each scheduler does everything
● Uses existing database - no new components needed, no extra operational
burden
● Plan to use row-level-locks in the DB (SELECT … FOR UPDATE)
● Will re-evaluate if performance/stress testing show the need
Example HA configuration
Measuring Performance
Key performance we define as "Scheduler lag":
● Amount of "wasted" time not running tasks
● ti.state_date - max(t.end_date for t in upstream_tis)
● Zero is the goal (we'll never get to 0.)
● Tasks are "echo true" -- tiny but still executing
Preliminary performance results
Case: 100 DAG files | 1 DAG per file | 10 Tasks per DAG | 1 run per DAG
Workers: 4 | Parallelism: 64
1.10.10: 54.17s (σ19.38) Total runtime: 22m22s
HA branch - 1 scheduler: 4.39s (σ1.40) Total runtime: 1m10s
HA branch - 3 schedulers: 1.96s (σ0.51) Total runtime: 48s
DAG Serialization
DAG Serialization
Serialized DAG Representation
Dag Serialization (Tasks Completed)
● Stateless Webserver: Scheduler parses the DAG files, serializes them in JSON format & saves
them in the Metadata DB.
● Lazy Loading of DAGs: Instead of loading an entire DagBag when the Webserver starts we only
load each DAG on demand. This helps reduce Webserver startup time and memory. This
reduction in time is notable with large number of DAGs.
● Deploying new DAGs to Airflow - no longer requires long restarts of webserver (if DAGs are baked
in Docker image)
● Feature to use the “JSON” library of choice for Serialization (default is inbuilt ‘json’ library)
● Paves way for DAG Versioning & Scheduler HA
Dag Serialization (Tasks In-Progress for Airflow 2.0)
● Decouple DAG Parsing and Serializing from the scheduling loop.
● Scheduler will fetch DAGs from DB
● DAG will be parsed, serialized and saved to DB by a separate component
“Serializer”/ “Dag Parser”
● This should reduce the delay in Scheduling tasks when the number of DAGs
are large
DAG Versioning
Dag Versioning
Current Problem:
● Change in DAG structure affects viewing previous DagRuns too
● Not possible to view the code or DAG Shape associated with an old DagRun
● Checking logs of a deleted task in the UI is not straight-forward
Dag Versioning (Current Problem)
Dag Versioning (Current Problem)
New task is shown in Graph View for older DAG Runs too with “no status”.
Dag Versioning (Proposed Solution - Version Badge)
● DAG Version Badge:
○ Hash of Serialized DAG
○ Use to identify a DAG Version
○ Links to associated Code in
Code View.
Dag Versioning (Proposed Solution)
● Graph View shows the actual version that was run.
● Mixed-Version: If a DAG was changed mid-way, an internal version is
calculated that shows the actual version that was run.
Two different Versions
(Mixed Version)
Dag Versioning
Goal:
● Scope is limited to the make sure visibility behavior of Airflow is correct
○ No change in the execution behaviour
○ Execution will continue to be based on the most recent version of the DAG
● Support for storing multiple versions of Serialized DAGs
● All the views show the correct DAG associated with that DagRun
● Baked-In Maintenance DAGs
○ Cleanup old DagRuns & associated Serialized DAGs
Stable REST API
API: follows Open API 3.0 specification
Stable REST API
● Built using Connexion
○ OpenAPI compliant
○ Spec-first approach
○ Maps endpoints to Python functions
● Migration guide from the old experimental API to the new Stable REST API
● Integration with the current Flask-Appbuilder Permission model
○ Experimental API didn’t have a permission model, it was all or nothing
Clients in multiple languages for the API
● Using OpenAPI Generator we can generate
clients for multiple languages.
Functional DAGs
Functional DAGs
Current Problem:
● No explicit way of passing data between tasks
○ XCom exists but are hidden in task execution
○ XCom isn’t intuitive for beginners
● Need of Boiler plate code for:
○ Task dependencies
○ Using Python function in PythonOperator
● Can’t pass large data between tasks with Xcom
Example DAG - the normal way
“load” uses the data from
“transform” task but you
still need to define explicit
dependency between them
Too much boilerplate code
to read data from previous
task using XCom
Example DAG - the functional way
Simple and readable code
to get the data (without
knowledge of XCom and
Jinja)
Convert function into an
Operator and pass
arguments in a more
pythonic way
Also assigns the task to the
DAG
No need of explicitly defining Task Dependencies
Functional DAGs
● Automatically convert Functions into a Task using decorators:
○ @airflow.decorators.task
○ @dag.task
● Task ids are automatically generated
● Pluggable XCom Storage Engine:
○ Store and retrieve data in GCS, S3, etc
● Simplifies writing DAG Code and increases readability
● Task dependencies are automatically set when a task uses data from other
task.
● Backwards compatible
Official Docker Image & Helm Chart
Official Container Image
● Alpha-quality image was available since Airflow 1.10.10
● Beta-quality community image available since Airflow 1.10.11
● Available at https://hub.docker.com/r/apache/airflow
○ Run docker pull apache/airflow to pull the latest image
Official Container Image
● Supported Python Versions: 2.7, 3.5, 3.6, 3.7, 3.8
● Size: ~ 210 MB (compressed size)
● Uses python slim-buster images as the base image by default
Image Customization options
● Choose Base image (python)
● Install Airflow from PyPI
● Install from GitHub branch/tag
● Install additional extras, python deps, apt dev/runtime deps
● Choose different UID/GID
● Choose different AIRFLOW_HOME
● Choose different HOME dir
Image Customization options
Details: https://github.com/apache/airflow/blob/master/IMAGES.rst#ci-images
Helm Chart
● Donated by Astronomer
○ Battle-tested & used by hundreds of production deployments run by Astronomer
● Uses official Airflow Docker Image
● Currently used in Airflow CI to run Kubernetes tests
● Check https://github.com/apache/airflow/tree/master/chart for details
Helm Chart
● Supports KEDA (Kubernetes-based Event Driven Autoscaling)
● Supports Sequential, Local, Celery and Kubernetes Executor
● Several options to mount DAGs:
○ using git-sync side-car without Persistence
○ from a PVC
● Liveness Probe restarts Scheduler on heartbeat failure
Providers Package
Providers Package
● As part of AIP-21 all the contents (hooks/operators/sensors) from airflow/contrib directory
were grouped into different providers and moved to airflow/providers packages.
Backport Providers Package
● Bring Airflow 2.0 providers to 1.10.*
● Packages per-provider
● 58 packages
● Python 3.6+ only
● Automatically tested on CI
● Separate cadence than Airflow
Other Notable Changes ...
Other notable changes
● Python 3 only ✔
Python 2.7 unsupported upstream since Jan 1, 2020
● "RBAC" UI is now the only UI ✔
Was a config option before, now only option. Charts/data profiling removed due to security risks
● Improvements to SubDags ✔
Was run as a backfill job and didn’t always respect Pools / Concurrency limits
● CLI Refactor ✔
Commands are now grouped functionally
Road to Airflow 2.0
When will Airflow 2.0 be available?
How to upgrade to 2.0 safely
● Install the latest 1.10 release
● Run airflow upgrade-check (doesn't exist, yet #8765)
● Fix any warnings
● Upgrade Airflow
Links / References
Links
● Airflow Summit Talks:
○ Keynote: Future of Airflow by Kaxil, Ash, Jarek, Kamil, Tomek & Daniel
○ AIP-31: Airflow functional DAG definition by Gerard Casas Saez
○ Production Docker image for Apache Airflow by Jarek Potiuk
● AIPs (Airflow Improvement Proposals):
○ AIP-15 Support Multiple-Schedulers for HA & Better Scheduling Performance
○ AIP-21 Changes in import paths
○ AIP-24 DAG Serialization
○ AIP-32 Airflow REST API
○ AIP-36 DAG Versioning
Links
● Airflow
○ Repo: https://github.com/apache/airflow
○ Website: https://airflow.apache.org/
○ Blog: https://airflow.apache.org/blog/
○ Documentation: https://airflow.apache.org/docs/stable/
○ Slack: https://s.apache.org/airflow-slack
○ Twitter: https://twitter.com/apacheairflow
● Contact Me:
○ Twitter: https://twitter.com/kaxil
○ Github: https://github.com/kaxil/
○ LinkedIn: https://www.linkedin.com/in/kaxil/
Thank You!

Upcoming features in Airflow 2

  • 1.
    What’s upcoming inAirflow 2? PiterPy 4 August 2020 Kaxil Naik Senior Data Engineer @ Astronomer Twitter: @kaxil
  • 2.
    Who am I? ●Airflow Committer, PMC Member & Release Manager ● Senior Data Engineer @ Astronomer ○ Part of the Airflow team ○ Work full-time on Airflow ● Previously worked at DataReply ● Masters in Data Science & Analytics from Royal Holloway, University of London ● Twitter: https://twitter.com/kaxil ● Github: https://github.com/kaxil/ ● LinkedIn: https://www.linkedin.com/in/kaxil/
  • 3.
    About Astronomer Founded in2018, Astronomer is the commercial developer of Apache Airflow, the open-source standard for data orchestration. 100+ Enterprise customers around the world Locations San Francisco London New York Cincinnati Hyderabad Investors 3 of top 10 Airflow committers are Astronomer advisors or employees Mission To help organizations adopt new standards for data orchestration.
  • 4.
    Airflow 2 -Highlights ● Scheduler High-Availability ● DAG Serialization ● DAG Versioning ● Stable Rest API ● Functional DAGs ● Official Docker Image & Helm Chart ● Providers Packages ● and other notable changes ... http://gph.is/1VBGIPv
  • 5.
  • 6.
    Scheduler High Availability Goals: ●Performance - reduce task-to-task schedule "lag" ● Scalability - increase task throughput by horizontal scaling ● Resiliency - kill a scheduler and have tasks continue to be scheduled
  • 7.
    Scheduler High Availability- Design ● Active-active model. Each scheduler does everything ● Uses existing database - no new components needed, no extra operational burden ● Plan to use row-level-locks in the DB (SELECT … FOR UPDATE) ● Will re-evaluate if performance/stress testing show the need
  • 8.
  • 9.
    Measuring Performance Key performancewe define as "Scheduler lag": ● Amount of "wasted" time not running tasks ● ti.state_date - max(t.end_date for t in upstream_tis) ● Zero is the goal (we'll never get to 0.) ● Tasks are "echo true" -- tiny but still executing
  • 10.
    Preliminary performance results Case:100 DAG files | 1 DAG per file | 10 Tasks per DAG | 1 run per DAG Workers: 4 | Parallelism: 64 1.10.10: 54.17s (σ19.38) Total runtime: 22m22s HA branch - 1 scheduler: 4.39s (σ1.40) Total runtime: 1m10s HA branch - 3 schedulers: 1.96s (σ0.51) Total runtime: 48s
  • 11.
  • 12.
  • 13.
  • 14.
    Dag Serialization (TasksCompleted) ● Stateless Webserver: Scheduler parses the DAG files, serializes them in JSON format & saves them in the Metadata DB. ● Lazy Loading of DAGs: Instead of loading an entire DagBag when the Webserver starts we only load each DAG on demand. This helps reduce Webserver startup time and memory. This reduction in time is notable with large number of DAGs. ● Deploying new DAGs to Airflow - no longer requires long restarts of webserver (if DAGs are baked in Docker image) ● Feature to use the “JSON” library of choice for Serialization (default is inbuilt ‘json’ library) ● Paves way for DAG Versioning & Scheduler HA
  • 15.
    Dag Serialization (TasksIn-Progress for Airflow 2.0) ● Decouple DAG Parsing and Serializing from the scheduling loop. ● Scheduler will fetch DAGs from DB ● DAG will be parsed, serialized and saved to DB by a separate component “Serializer”/ “Dag Parser” ● This should reduce the delay in Scheduling tasks when the number of DAGs are large
  • 16.
  • 17.
    Dag Versioning Current Problem: ●Change in DAG structure affects viewing previous DagRuns too ● Not possible to view the code or DAG Shape associated with an old DagRun ● Checking logs of a deleted task in the UI is not straight-forward
  • 18.
  • 19.
    Dag Versioning (CurrentProblem) New task is shown in Graph View for older DAG Runs too with “no status”.
  • 20.
    Dag Versioning (ProposedSolution - Version Badge) ● DAG Version Badge: ○ Hash of Serialized DAG ○ Use to identify a DAG Version ○ Links to associated Code in Code View.
  • 21.
    Dag Versioning (ProposedSolution) ● Graph View shows the actual version that was run. ● Mixed-Version: If a DAG was changed mid-way, an internal version is calculated that shows the actual version that was run. Two different Versions (Mixed Version)
  • 22.
    Dag Versioning Goal: ● Scopeis limited to the make sure visibility behavior of Airflow is correct ○ No change in the execution behaviour ○ Execution will continue to be based on the most recent version of the DAG ● Support for storing multiple versions of Serialized DAGs ● All the views show the correct DAG associated with that DagRun ● Baked-In Maintenance DAGs ○ Cleanup old DagRuns & associated Serialized DAGs
  • 23.
  • 24.
    API: follows OpenAPI 3.0 specification
  • 25.
    Stable REST API ●Built using Connexion ○ OpenAPI compliant ○ Spec-first approach ○ Maps endpoints to Python functions ● Migration guide from the old experimental API to the new Stable REST API ● Integration with the current Flask-Appbuilder Permission model ○ Experimental API didn’t have a permission model, it was all or nothing
  • 26.
    Clients in multiplelanguages for the API ● Using OpenAPI Generator we can generate clients for multiple languages.
  • 27.
  • 28.
    Functional DAGs Current Problem: ●No explicit way of passing data between tasks ○ XCom exists but are hidden in task execution ○ XCom isn’t intuitive for beginners ● Need of Boiler plate code for: ○ Task dependencies ○ Using Python function in PythonOperator ● Can’t pass large data between tasks with Xcom
  • 29.
    Example DAG -the normal way “load” uses the data from “transform” task but you still need to define explicit dependency between them Too much boilerplate code to read data from previous task using XCom
  • 30.
    Example DAG -the functional way Simple and readable code to get the data (without knowledge of XCom and Jinja) Convert function into an Operator and pass arguments in a more pythonic way Also assigns the task to the DAG No need of explicitly defining Task Dependencies
  • 31.
    Functional DAGs ● Automaticallyconvert Functions into a Task using decorators: ○ @airflow.decorators.task ○ @dag.task ● Task ids are automatically generated ● Pluggable XCom Storage Engine: ○ Store and retrieve data in GCS, S3, etc ● Simplifies writing DAG Code and increases readability ● Task dependencies are automatically set when a task uses data from other task. ● Backwards compatible
  • 32.
  • 33.
    Official Container Image ●Alpha-quality image was available since Airflow 1.10.10 ● Beta-quality community image available since Airflow 1.10.11 ● Available at https://hub.docker.com/r/apache/airflow ○ Run docker pull apache/airflow to pull the latest image
  • 34.
    Official Container Image ●Supported Python Versions: 2.7, 3.5, 3.6, 3.7, 3.8 ● Size: ~ 210 MB (compressed size) ● Uses python slim-buster images as the base image by default
  • 35.
    Image Customization options ●Choose Base image (python) ● Install Airflow from PyPI ● Install from GitHub branch/tag ● Install additional extras, python deps, apt dev/runtime deps ● Choose different UID/GID ● Choose different AIRFLOW_HOME ● Choose different HOME dir
  • 36.
    Image Customization options Details:https://github.com/apache/airflow/blob/master/IMAGES.rst#ci-images
  • 37.
    Helm Chart ● Donatedby Astronomer ○ Battle-tested & used by hundreds of production deployments run by Astronomer ● Uses official Airflow Docker Image ● Currently used in Airflow CI to run Kubernetes tests ● Check https://github.com/apache/airflow/tree/master/chart for details
  • 38.
    Helm Chart ● SupportsKEDA (Kubernetes-based Event Driven Autoscaling) ● Supports Sequential, Local, Celery and Kubernetes Executor ● Several options to mount DAGs: ○ using git-sync side-car without Persistence ○ from a PVC ● Liveness Probe restarts Scheduler on heartbeat failure
  • 39.
  • 40.
    Providers Package ● Aspart of AIP-21 all the contents (hooks/operators/sensors) from airflow/contrib directory were grouped into different providers and moved to airflow/providers packages.
  • 41.
    Backport Providers Package ●Bring Airflow 2.0 providers to 1.10.* ● Packages per-provider ● 58 packages ● Python 3.6+ only ● Automatically tested on CI ● Separate cadence than Airflow
  • 42.
  • 43.
    Other notable changes ●Python 3 only ✔ Python 2.7 unsupported upstream since Jan 1, 2020 ● "RBAC" UI is now the only UI ✔ Was a config option before, now only option. Charts/data profiling removed due to security risks ● Improvements to SubDags ✔ Was run as a backfill job and didn’t always respect Pools / Concurrency limits ● CLI Refactor ✔ Commands are now grouped functionally
  • 44.
  • 45.
    When will Airflow2.0 be available?
  • 46.
    How to upgradeto 2.0 safely ● Install the latest 1.10 release ● Run airflow upgrade-check (doesn't exist, yet #8765) ● Fix any warnings ● Upgrade Airflow
  • 47.
  • 48.
    Links ● Airflow SummitTalks: ○ Keynote: Future of Airflow by Kaxil, Ash, Jarek, Kamil, Tomek & Daniel ○ AIP-31: Airflow functional DAG definition by Gerard Casas Saez ○ Production Docker image for Apache Airflow by Jarek Potiuk ● AIPs (Airflow Improvement Proposals): ○ AIP-15 Support Multiple-Schedulers for HA & Better Scheduling Performance ○ AIP-21 Changes in import paths ○ AIP-24 DAG Serialization ○ AIP-32 Airflow REST API ○ AIP-36 DAG Versioning
  • 49.
    Links ● Airflow ○ Repo:https://github.com/apache/airflow ○ Website: https://airflow.apache.org/ ○ Blog: https://airflow.apache.org/blog/ ○ Documentation: https://airflow.apache.org/docs/stable/ ○ Slack: https://s.apache.org/airflow-slack ○ Twitter: https://twitter.com/apacheairflow ● Contact Me: ○ Twitter: https://twitter.com/kaxil ○ Github: https://github.com/kaxil/ ○ LinkedIn: https://www.linkedin.com/in/kaxil/
  • 50.