SlideShare a Scribd company logo
1 of 61
Download to read offline
What’s
coming in
Apache
Airflow 2.0
NYC Meetup
13th of May 2020
Who are we?
Tomek Urbaszek
Committer
Software Engineer @ Polidea
Jarek Potiuk
Committer, PMC member
Principal Software Engineer @ Polidea
Kamil Breguła
Committer
Software Engineer @ Polidea
Ash Berlin-Taylor
Committer, PMC member
Airflow Engineering Lead @ Astronomer
Daniel Imberman
Committer
Senior Data Engineer @ Astronomer
Kaxil Naik
Committer, PMC member
Senior Data Engineer @ Astronomer
High Availability
Scheduler High Availability
Goals:
● Performance - reduce task-to-task schedule "lag"
● Scalability - increase task throughput by horizontal scaling
● Resiliency - kill a scheduler and have tasks continue to be scheduled
Scheduler High Availability: Design
● Active-active model. Each scheduler does everything
● Uses existing database - no new components needed, no extra operational
burden
● Plan to use row-level-locks in the DB
● Will re-evaluate if performance/stress testing show the need
Example HA configuration
Scheduler High Availability: Tasks
● Separate DAG parsing from DAG scheduling
This removes the tie between parsing and scheduling that is still present
● Run a mini scheduler in the worker after each task is completed
A.K.A. "fast follow". Look at immediate down stream tasks of what just finished and see what we can
schedule
● Test it to destruction
This is a big architectural change, we need to be sure it works well.
DAG Serialization
Dag Serialization
Dag Serialization (Tasks Completed)
● Stateless Webserver: Scheduler parses the DAG files, serializes them in JSON format & saves
them in the Metadata DB.
● Lazy Loading of DAGs: Instead of loading an entire DagBag when the Webserver starts we only
load each DAG on demand. This helps reduce Webserver startup time and memory. This
reduction is notable with large number of DAGs.
● Deploying new DAGs to Airflow - no longer requires long restarts of webserver (if DAGs are baked in
Docker image)
● Feature to use the “JSON” library of choice for Serialization (default is inbuilt ‘json’ library)
● Paves way for DAG Versioning & Scheduler HA
Dag Serialization (Tasks In-Progress for Airflow 2.0)
● Decouple DAG Parsing and Serializing from the scheduling loop.
● Scheduler will fetch DAGs from DB
● DAG will be parsed, serialized and saved to DB by a separate component
“Serializer”/ “Dag Parser”
● This should reduce the delay in Scheduling tasks when the number of DAGs
are large
DAG Versioning
Dag Versioning
Current Problem:
● Change in DAG structure affects viewing previous DagRuns too
● Not possible to view the code associated with a specific DagRun
Dag Versioning (Current Problem)
Dag Versioning (Current Problem)
New task is shown in Graph View for older DAG Runs too with “no status”.
Dag Versioning
Current Problem:
● Change in DAG structure affects viewing previous DagRuns too
● Not possible to view the code associated with a specific DagRun
Goal:
● Support for storing multiple versions of Serialized DAGs
● Baked-In Maintenance DAGs to cleanup old DagRuns & associated
Serialized DAGs
● Graph View shows the DAG associated with that DagRun
Performance Improvements
Performance improvements
● Review each component of scheduler in turn and its optimization.
● Perf kit
○ A set of tools that allows you to quickly check the performance of a component
Do you see a performance problem?
Results for DagFileProcessor
When we have one DAG file with 200 DAGs, each DAG with 5 tasks:
Before After Diff
Average time: 8080.246 ms 628.801 ms -7452 ms (92%)
Queries count: 2692 5 -2687 (99%)
How to avoid regression?
REST API
API: follows Open API 3.0 specification
Outreachy interns
Ephraim Anierobi
Omair Khan
Dev/CI environment
CI environment
● Moving to GitHub Actions
○ Kubernetes Tests
○ Easier way to test Kubernetes Tests locally
● Quarantined tests
○ Process of fixing the Quarantined tests
● Thinning CI image
○ Move integrations out of the image (hadoop etc)
● Automated System Tests (AIP-21)
GitHub Actions
Dev environment
● Breeze
○ unit testing
○ package building
○ release preparation
○ refreshing videos
● CodeSpaces integration
Backport Packages
● Bring Airflow 2.0 providers to 1.10.*
● Packages per-provider
● 58 packages (!)
● Python 3.6+ only(!)
● Automatically tested on CI
● Future
○ Automated System Tests (AIP-21)
○ Split Airflow (AIP-8)?
Automated release notes for backport packages
Support for Production
Deployments
Production Image
● Alpha quality image is ready
● Gathering feedback
● Started with “bare image”
● Listening to use cases from users
● Integration with Docker Compose
● Integration with Helm Chart
KEDA Autoscaling
KubernetesExecutor
KubernetesExecutor
KubernetesExecutor
KubernetesExecutor vs. CeleryExecutor
KEDA Autoscaling
● Kubernetes Event-driven Autoscaler
● Scales based on # of RUNNING and QUEUED tasks in PostgreSQL backend
KEDA Autoscaling
KEDA Autoscaling
KEDA Autoscaling
KEDA Queues
● Historically Queues were expensive and hard to allocate
● With KEDA, queues are free! (can have 100 queues)
● KEDA works with k8s deployments so any customization you can make in a
k8s pod, you can make in a k8s queue (worker size, GPU, secrets, etc.)
KubernetesExecutor
Pod Templating from
YAML/JSON
KubernetesExecutor Pod Templating
● In the K8sExecutor currently, users can modify certain parts of the pod, but
many features of the k8s API are abstracted away
● We did this because at the time the airflow community was not well
acquainted with the k8s API
● We want to enable users to modify their worker pods to better match their
use-cases
KubernetesExecutor Pod Templating
● Users can now set the pod_template_file config in their
airflow.cfg
● Given a path, the KubernetesExecutor will now parse the yaml
file when launching a worker pod
● Huge thank you to @davlum for this feature
Official Airflow Helm Chart
Helm Chart
● Donated by astronomer.io.
● This is the official helm chart that we have used both in our
enterprise and in our cloud offerings (thousands of deployments
of varying sizes)
● Users can turn on KEDA autoscaling through helm variables
Helm Chart
● Chart will cut new releases with each airflow release
● Will be tested on official docker image
● Significantly simplifies airflow onboarding process for
Kubernetes users
DAG authoring "sugar"
Functional DAGs
➔ PythonOperator boilerplate code
➔ Define order and data relation
separately
➔ Writing jinja strings by hand
Functional DAGs
No PythonOperator boilerplate code!
Data and order relationship are same!
And works for all operators
Functional DAGs
AIP-31: Airflow functional DAG definition
➔ Easy way to convert a function to
an operator
➔ Simplified way of writing DAGs
➔ Pluggable XCom Storage engine
Example: store and retrieve DataFrames on
GCS or S3 buckets without boilerplate code
Smaller changes
Other changes of note
● Connection IDs now need to be unique
It was often confusing, and there are better ways to do load balancing
● Python 3 only
Python 2.7 unsupported upstream since Jan 1, 2020
● "RBAC" UI is now the only UI.
Was a config option before, now only option. Charts/data profiling removed due to security risks
Road to Airflow 2.0
When will Airflow 2.0 be available?
Airflow 2.0 – deprecate, but (try) not to remove
● Breaking changes should be avoided where we can – if upgrade is to difficult
users will be left behind
● Release "backport providers" to make new code layout available "now":
● Before 2.0 we want to make sure we've fixed everything we want to remove
or break.
pip install apache-airflow-backport-providers-aws 
apache-airflow-backport-providers-google
How to upgrade to 2.0 safely
● Install the latest 1.10 release
● Run airflow upgrade-check (doesn't exist yet)
● Fix any warnings
● Upgrade Airflow
Thank you!
Time for Q & A

More Related Content

What's hot

Airflow presentation
Airflow presentationAirflow presentation
Airflow presentationIlias Okacha
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Yohei Onishi
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowSid Anand
 
Building an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowBuilding an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowYohei Onishi
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowLaura Lorenz
 
From business requirements to working pipelines with apache airflow
From business requirements to working pipelines with apache airflowFrom business requirements to working pipelines with apache airflow
From business requirements to working pipelines with apache airflowDerrick Qin
 
Apache Airflow in Production
Apache Airflow in ProductionApache Airflow in Production
Apache Airflow in ProductionRobert Sanders
 
Airflow Clustering and High Availability
Airflow Clustering and High AvailabilityAirflow Clustering and High Availability
Airflow Clustering and High AvailabilityRobert Sanders
 
Airflow tutorials hands_on
Airflow tutorials hands_onAirflow tutorials hands_on
Airflow tutorials hands_onpko89403
 
From airflow to google cloud composer
From airflow to google cloud composerFrom airflow to google cloud composer
From airflow to google cloud composerBruce Kuo
 
What's Coming in Apache Airflow 2.0 - PyDataWarsaw 2019
What's Coming in Apache Airflow 2.0 - PyDataWarsaw 2019What's Coming in Apache Airflow 2.0 - PyDataWarsaw 2019
What's Coming in Apache Airflow 2.0 - PyDataWarsaw 2019Jarek Potiuk
 
Data Pipelines with Apache Airflow
Data Pipelines with Apache AirflowData Pipelines with Apache Airflow
Data Pipelines with Apache AirflowManning Publications
 
Elasticsearch features and ecosystem
Elasticsearch features and ecosystemElasticsearch features and ecosystem
Elasticsearch features and ecosystemPavel Alexeev
 

What's hot (20)

Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
 
Airflow and supervisor
Airflow and supervisorAirflow and supervisor
Airflow and supervisor
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache Airflow
 
Building an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowBuilding an analytics workflow using Apache Airflow
Building an analytics workflow using Apache Airflow
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
 
From business requirements to working pipelines with apache airflow
From business requirements to working pipelines with apache airflowFrom business requirements to working pipelines with apache airflow
From business requirements to working pipelines with apache airflow
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
Airflow at WePay
Airflow at WePayAirflow at WePay
Airflow at WePay
 
Apache Airflow in Production
Apache Airflow in ProductionApache Airflow in Production
Apache Airflow in Production
 
Airflow Clustering and High Availability
Airflow Clustering and High AvailabilityAirflow Clustering and High Availability
Airflow Clustering and High Availability
 
Airflow tutorials hands_on
Airflow tutorials hands_onAirflow tutorials hands_on
Airflow tutorials hands_on
 
From airflow to google cloud composer
From airflow to google cloud composerFrom airflow to google cloud composer
From airflow to google cloud composer
 
Apache airflow
Apache airflowApache airflow
Apache airflow
 
What's Coming in Apache Airflow 2.0 - PyDataWarsaw 2019
What's Coming in Apache Airflow 2.0 - PyDataWarsaw 2019What's Coming in Apache Airflow 2.0 - PyDataWarsaw 2019
What's Coming in Apache Airflow 2.0 - PyDataWarsaw 2019
 
Airflow introduction
Airflow introductionAirflow introduction
Airflow introduction
 
Data Pipelines with Apache Airflow
Data Pipelines with Apache AirflowData Pipelines with Apache Airflow
Data Pipelines with Apache Airflow
 
AIRflow at Scale
AIRflow at ScaleAIRflow at Scale
AIRflow at Scale
 
Elasticsearch features and ecosystem
Elasticsearch features and ecosystemElasticsearch features and ecosystem
Elasticsearch features and ecosystem
 

Similar to Apache Airflow 2.0 NYC Meetup Recap

19. Cloud Native Computing - Kubernetes - Bratislava - Databases in K8s world
19. Cloud Native Computing - Kubernetes - Bratislava - Databases in K8s world19. Cloud Native Computing - Kubernetes - Bratislava - Databases in K8s world
19. Cloud Native Computing - Kubernetes - Bratislava - Databases in K8s worldDávid Kőszeghy
 
What's new in Airflow 2.3?
What's new in Airflow 2.3?What's new in Airflow 2.3?
What's new in Airflow 2.3?Kaxil Naik
 
Spring Update | July 2023
Spring Update | July 2023Spring Update | July 2023
Spring Update | July 2023VMware Tanzu
 
Spring Boot 3 And Beyond
Spring Boot 3 And BeyondSpring Boot 3 And Beyond
Spring Boot 3 And BeyondVMware Tanzu
 
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...javier ramirez
 
SCM Puppet: from an intro to the scaling
SCM Puppet: from an intro to the scalingSCM Puppet: from an intro to the scaling
SCM Puppet: from an intro to the scalingStanislav Osipov
 
Accumulo Summit Keynote 2018
Accumulo Summit Keynote 2018Accumulo Summit Keynote 2018
Accumulo Summit Keynote 2018Accumulo Summit
 
Extreme Replication - RMOUG Presentation
Extreme Replication - RMOUG PresentationExtreme Replication - RMOUG Presentation
Extreme Replication - RMOUG PresentationBobby Curtis
 
Scalable Clusters On Demand
Scalable Clusters On DemandScalable Clusters On Demand
Scalable Clusters On DemandBogdan Kyryliuk
 
Airflow based Video Encoding Platform
Airflow based Video Encoding PlatformAirflow based Video Encoding Platform
Airflow based Video Encoding PlatformHotstar
 
Why Airflow? & What's new in Airflow 2.3?
Why Airflow? & What's new in Airflow 2.3?Why Airflow? & What's new in Airflow 2.3?
Why Airflow? & What's new in Airflow 2.3?Kaxil Naik
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesKubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesSeungYong Oh
 
COUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_FeaturesCOUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_FeaturesAlfredo Abate
 
Airflow at lyft
Airflow at lyftAirflow at lyft
Airflow at lyftTao Feng
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkDatabricks
 
Automating using Ansible
Automating using AnsibleAutomating using Ansible
Automating using AnsibleAlok Patra
 
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesApache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesDataWorks Summit
 
Serverless ETL and Optimization on ML pipeline
Serverless ETL and Optimization on ML pipelineServerless ETL and Optimization on ML pipeline
Serverless ETL and Optimization on ML pipelineShu-Jeng Hsieh
 

Similar to Apache Airflow 2.0 NYC Meetup Recap (20)

19. Cloud Native Computing - Kubernetes - Bratislava - Databases in K8s world
19. Cloud Native Computing - Kubernetes - Bratislava - Databases in K8s world19. Cloud Native Computing - Kubernetes - Bratislava - Databases in K8s world
19. Cloud Native Computing - Kubernetes - Bratislava - Databases in K8s world
 
What's new in Airflow 2.3?
What's new in Airflow 2.3?What's new in Airflow 2.3?
What's new in Airflow 2.3?
 
Spring Update | July 2023
Spring Update | July 2023Spring Update | July 2023
Spring Update | July 2023
 
Spring Boot 3 And Beyond
Spring Boot 3 And BeyondSpring Boot 3 And Beyond
Spring Boot 3 And Beyond
 
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
 
SCM Puppet: from an intro to the scaling
SCM Puppet: from an intro to the scalingSCM Puppet: from an intro to the scaling
SCM Puppet: from an intro to the scaling
 
Accumulo Summit Keynote 2018
Accumulo Summit Keynote 2018Accumulo Summit Keynote 2018
Accumulo Summit Keynote 2018
 
Extreme Replication - RMOUG Presentation
Extreme Replication - RMOUG PresentationExtreme Replication - RMOUG Presentation
Extreme Replication - RMOUG Presentation
 
Scalable Clusters On Demand
Scalable Clusters On DemandScalable Clusters On Demand
Scalable Clusters On Demand
 
Airflow based Video Encoding Platform
Airflow based Video Encoding PlatformAirflow based Video Encoding Platform
Airflow based Video Encoding Platform
 
Why Airflow? & What's new in Airflow 2.3?
Why Airflow? & What's new in Airflow 2.3?Why Airflow? & What's new in Airflow 2.3?
Why Airflow? & What's new in Airflow 2.3?
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesKubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
 
COUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_FeaturesCOUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_Features
 
Airflow at lyft
Airflow at lyftAirflow at lyft
Airflow at lyft
 
Dataflow.pptx
Dataflow.pptxDataflow.pptx
Dataflow.pptx
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
 
Automating using Ansible
Automating using AnsibleAutomating using Ansible
Automating using Ansible
 
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesApache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
 
Serverless ETL and Optimization on ML pipeline
Serverless ETL and Optimization on ML pipelineServerless ETL and Optimization on ML pipeline
Serverless ETL and Optimization on ML pipeline
 

Recently uploaded

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 

Recently uploaded (20)

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 

Apache Airflow 2.0 NYC Meetup Recap

  • 2. Who are we? Tomek Urbaszek Committer Software Engineer @ Polidea Jarek Potiuk Committer, PMC member Principal Software Engineer @ Polidea Kamil Breguła Committer Software Engineer @ Polidea Ash Berlin-Taylor Committer, PMC member Airflow Engineering Lead @ Astronomer Daniel Imberman Committer Senior Data Engineer @ Astronomer Kaxil Naik Committer, PMC member Senior Data Engineer @ Astronomer
  • 4. Scheduler High Availability Goals: ● Performance - reduce task-to-task schedule "lag" ● Scalability - increase task throughput by horizontal scaling ● Resiliency - kill a scheduler and have tasks continue to be scheduled
  • 5. Scheduler High Availability: Design ● Active-active model. Each scheduler does everything ● Uses existing database - no new components needed, no extra operational burden ● Plan to use row-level-locks in the DB ● Will re-evaluate if performance/stress testing show the need
  • 7. Scheduler High Availability: Tasks ● Separate DAG parsing from DAG scheduling This removes the tie between parsing and scheduling that is still present ● Run a mini scheduler in the worker after each task is completed A.K.A. "fast follow". Look at immediate down stream tasks of what just finished and see what we can schedule ● Test it to destruction This is a big architectural change, we need to be sure it works well.
  • 10. Dag Serialization (Tasks Completed) ● Stateless Webserver: Scheduler parses the DAG files, serializes them in JSON format & saves them in the Metadata DB. ● Lazy Loading of DAGs: Instead of loading an entire DagBag when the Webserver starts we only load each DAG on demand. This helps reduce Webserver startup time and memory. This reduction is notable with large number of DAGs. ● Deploying new DAGs to Airflow - no longer requires long restarts of webserver (if DAGs are baked in Docker image) ● Feature to use the “JSON” library of choice for Serialization (default is inbuilt ‘json’ library) ● Paves way for DAG Versioning & Scheduler HA
  • 11. Dag Serialization (Tasks In-Progress for Airflow 2.0) ● Decouple DAG Parsing and Serializing from the scheduling loop. ● Scheduler will fetch DAGs from DB ● DAG will be parsed, serialized and saved to DB by a separate component “Serializer”/ “Dag Parser” ● This should reduce the delay in Scheduling tasks when the number of DAGs are large
  • 13. Dag Versioning Current Problem: ● Change in DAG structure affects viewing previous DagRuns too ● Not possible to view the code associated with a specific DagRun
  • 15. Dag Versioning (Current Problem) New task is shown in Graph View for older DAG Runs too with “no status”.
  • 16. Dag Versioning Current Problem: ● Change in DAG structure affects viewing previous DagRuns too ● Not possible to view the code associated with a specific DagRun Goal: ● Support for storing multiple versions of Serialized DAGs ● Baked-In Maintenance DAGs to cleanup old DagRuns & associated Serialized DAGs ● Graph View shows the DAG associated with that DagRun
  • 18. Performance improvements ● Review each component of scheduler in turn and its optimization. ● Perf kit ○ A set of tools that allows you to quickly check the performance of a component
  • 19. Do you see a performance problem?
  • 20.
  • 21. Results for DagFileProcessor When we have one DAG file with 200 DAGs, each DAG with 5 tasks: Before After Diff Average time: 8080.246 ms 628.801 ms -7452 ms (92%) Queries count: 2692 5 -2687 (99%)
  • 22. How to avoid regression?
  • 24. API: follows Open API 3.0 specification Outreachy interns Ephraim Anierobi Omair Khan
  • 26. CI environment ● Moving to GitHub Actions ○ Kubernetes Tests ○ Easier way to test Kubernetes Tests locally ● Quarantined tests ○ Process of fixing the Quarantined tests ● Thinning CI image ○ Move integrations out of the image (hadoop etc) ● Automated System Tests (AIP-21)
  • 28. Dev environment ● Breeze ○ unit testing ○ package building ○ release preparation ○ refreshing videos ● CodeSpaces integration
  • 29. Backport Packages ● Bring Airflow 2.0 providers to 1.10.* ● Packages per-provider ● 58 packages (!) ● Python 3.6+ only(!) ● Automatically tested on CI ● Future ○ Automated System Tests (AIP-21) ○ Split Airflow (AIP-8)?
  • 30. Automated release notes for backport packages
  • 32. Production Image ● Alpha quality image is ready ● Gathering feedback ● Started with “bare image” ● Listening to use cases from users ● Integration with Docker Compose ● Integration with Helm Chart
  • 38.
  • 39. KEDA Autoscaling ● Kubernetes Event-driven Autoscaler ● Scales based on # of RUNNING and QUEUED tasks in PostgreSQL backend
  • 43. KEDA Queues ● Historically Queues were expensive and hard to allocate ● With KEDA, queues are free! (can have 100 queues) ● KEDA works with k8s deployments so any customization you can make in a k8s pod, you can make in a k8s queue (worker size, GPU, secrets, etc.)
  • 45. KubernetesExecutor Pod Templating ● In the K8sExecutor currently, users can modify certain parts of the pod, but many features of the k8s API are abstracted away ● We did this because at the time the airflow community was not well acquainted with the k8s API ● We want to enable users to modify their worker pods to better match their use-cases
  • 46. KubernetesExecutor Pod Templating ● Users can now set the pod_template_file config in their airflow.cfg ● Given a path, the KubernetesExecutor will now parse the yaml file when launching a worker pod ● Huge thank you to @davlum for this feature
  • 48. Helm Chart ● Donated by astronomer.io. ● This is the official helm chart that we have used both in our enterprise and in our cloud offerings (thousands of deployments of varying sizes) ● Users can turn on KEDA autoscaling through helm variables
  • 49. Helm Chart ● Chart will cut new releases with each airflow release ● Will be tested on official docker image ● Significantly simplifies airflow onboarding process for Kubernetes users
  • 51. Functional DAGs ➔ PythonOperator boilerplate code ➔ Define order and data relation separately ➔ Writing jinja strings by hand
  • 52. Functional DAGs No PythonOperator boilerplate code! Data and order relationship are same! And works for all operators
  • 53. Functional DAGs AIP-31: Airflow functional DAG definition ➔ Easy way to convert a function to an operator ➔ Simplified way of writing DAGs ➔ Pluggable XCom Storage engine Example: store and retrieve DataFrames on GCS or S3 buckets without boilerplate code
  • 55. Other changes of note ● Connection IDs now need to be unique It was often confusing, and there are better ways to do load balancing ● Python 3 only Python 2.7 unsupported upstream since Jan 1, 2020 ● "RBAC" UI is now the only UI. Was a config option before, now only option. Charts/data profiling removed due to security risks
  • 57. When will Airflow 2.0 be available?
  • 58. Airflow 2.0 – deprecate, but (try) not to remove ● Breaking changes should be avoided where we can – if upgrade is to difficult users will be left behind ● Release "backport providers" to make new code layout available "now": ● Before 2.0 we want to make sure we've fixed everything we want to remove or break. pip install apache-airflow-backport-providers-aws apache-airflow-backport-providers-google
  • 59. How to upgrade to 2.0 safely ● Install the latest 1.10 release ● Run airflow upgrade-check (doesn't exist yet) ● Fix any warnings ● Upgrade Airflow
  • 60.