Eran Shemesh @ Fyber:
Fyber uses airflow to manage its entire big data pipelines including monitoring and auto-fix, the session will describe best practices that we implemented in production
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...Itai Yaffe
Tal Sharon (Software Architect), Aviel Buskila (DevOps Engineer) and Max Peres (Data Engineer) @ Nielsen:
At the Nielsen Marketing Cloud, we used to manage our data pipelines via AWS Data Pipeline. Over the years, we’ve encountered several issues with this tool, and a year ago we decided to embark on a journey to replace it with a tool more suitable for our needs.
In this session, we’ll discuss how we actually migrated to Airflow, what challenges we faced and how we mitigated them (and even contributed to the open-source project along the way). We’ll also provide some helpful tips for Airflow users
How I learned to time travel, or, data pipelining and scheduling with AirflowLaura Lorenz
****UPDATE: Project is now open sourced at https://www.github.com/industrydive/fileflow****
From Pydata DC 2016
Description
Data warehousing and analytics projects can, like ours, start out small - and fragile. With an organically growing mess of scripts glued together and triggered by cron jobs hiding on different servers, we needed better plumbing. After perusing the data pipelining landscape, we landed on Airflow, an Apache incubating batch processing pipelining and scheduler tool from Airbnb.
Abstract
The power of any reporting tool breaks based on the data behind it, so when our data warehousing process got too big for its humble origins, we searched for something better. After testing out several options such as Drake, Pydoit, Luigi, AWS Data Pipeline, and Pinball, we landed on Airflow, an Apache incubating batch processing pipelining and scheduler tool originating from Airbnb, that provides the benefits of pipeline construction as directed acyclic graphs (DAGs), along with a scheduler that can handle alerting, retries, callbacks and more to make your pipeline robust. This talk will discuss the value of DAG based pipelines for data processing workflows, highlight useful features in all of the pipelining projects we tested, and dive into some of the specific challenges (like time travel) and successes (like time travel!) we’ve experienced using Airflow to productionize our data engineering tasks. By the end of this talk, you will learn
- pros and cons of several Python-based/Python-supporting data pipelining libraries
- the design paradigm behind Airflow, an Apache incubating data pipelining and scheduling service, and what it is good for
- some epic fails to avoid and some epic wins to emulate from our experience porting our data engineering tasks to a more robust system
- some quick-start tips for implementing Airflow at your organization.
This presentation covers how to setup an Airflow instance as a cluster which spans multiple machines instead of the traditional 1 machine distribution. In addition, it covers an added step you can take to ensure High Availability in that cluster.
Tao Feng gave a presentation on Airflow at Lyft. Some key points:
1) Lyft uses Apache Airflow for ETL workflows with over 600 DAGs and 800 DAG runs daily across three AWS Auto Scaling Groups of worker nodes.
2) Lyft has customized Airflow with additional UI links, DAG dependency graphs, and integration with internal tools.
3) Lyft is working to improve the backfill experience, support DAG-level access controls, and explore running Airflow with Kubernetes executors.
4) Tao discussed challenges like daylight saving time issues and long-running tasks occupying slots, and thanked other Lyft engineers contributing to Airflow.
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...Kaxil Naik
Apache Airflow allows users to programmatically author, schedule, and monitor workflows or directed acyclic graphs (DAGs) using Python. It is an open-source workflow management platform developed by Airbnb that is used to orchestrate data pipelines. The document provides an overview of Airflow including what it is, its architecture, and concepts like DAGs, tasks, and operators. It also includes instructions on setting up Airflow and running tutorials on basic and dynamic workflows.
We will introduce Airflow, an Apache Project for scheduling and workflow orchestration. We will discuss use cases, applicability and how best to use Airflow, mainly in the context of building data engineering pipelines. We have been running Airflow in production for about 2 years, we will also go over some learnings, best practices and some tools we have built around it.
Speakers: Robert Sanders, Shekhar Vemuri
Building Better Data Pipelines using Apache AirflowSid Anand
Apache Airflow is a platform for authoring, scheduling, and monitoring workflows or directed acyclic graphs (DAGs). It allows users to programmatically author DAGs in Python without needing to bundle many XML files. The UI provides a tree view to see DAG runs over time and Gantt charts to see performance trends. Airflow is useful for ETL pipelines, machine learning workflows, and general job scheduling. It handles task dependencies and failures, monitors performance, and enforces service level agreements. Behind the scenes, the scheduler distributes tasks from the metadata database to Celery workers via RabbitMQ.
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...Itai Yaffe
Tal Sharon (Software Architect), Aviel Buskila (DevOps Engineer) and Max Peres (Data Engineer) @ Nielsen:
At the Nielsen Marketing Cloud, we used to manage our data pipelines via AWS Data Pipeline. Over the years, we’ve encountered several issues with this tool, and a year ago we decided to embark on a journey to replace it with a tool more suitable for our needs.
In this session, we’ll discuss how we actually migrated to Airflow, what challenges we faced and how we mitigated them (and even contributed to the open-source project along the way). We’ll also provide some helpful tips for Airflow users
How I learned to time travel, or, data pipelining and scheduling with AirflowLaura Lorenz
****UPDATE: Project is now open sourced at https://www.github.com/industrydive/fileflow****
From Pydata DC 2016
Description
Data warehousing and analytics projects can, like ours, start out small - and fragile. With an organically growing mess of scripts glued together and triggered by cron jobs hiding on different servers, we needed better plumbing. After perusing the data pipelining landscape, we landed on Airflow, an Apache incubating batch processing pipelining and scheduler tool from Airbnb.
Abstract
The power of any reporting tool breaks based on the data behind it, so when our data warehousing process got too big for its humble origins, we searched for something better. After testing out several options such as Drake, Pydoit, Luigi, AWS Data Pipeline, and Pinball, we landed on Airflow, an Apache incubating batch processing pipelining and scheduler tool originating from Airbnb, that provides the benefits of pipeline construction as directed acyclic graphs (DAGs), along with a scheduler that can handle alerting, retries, callbacks and more to make your pipeline robust. This talk will discuss the value of DAG based pipelines for data processing workflows, highlight useful features in all of the pipelining projects we tested, and dive into some of the specific challenges (like time travel) and successes (like time travel!) we’ve experienced using Airflow to productionize our data engineering tasks. By the end of this talk, you will learn
- pros and cons of several Python-based/Python-supporting data pipelining libraries
- the design paradigm behind Airflow, an Apache incubating data pipelining and scheduling service, and what it is good for
- some epic fails to avoid and some epic wins to emulate from our experience porting our data engineering tasks to a more robust system
- some quick-start tips for implementing Airflow at your organization.
This presentation covers how to setup an Airflow instance as a cluster which spans multiple machines instead of the traditional 1 machine distribution. In addition, it covers an added step you can take to ensure High Availability in that cluster.
Tao Feng gave a presentation on Airflow at Lyft. Some key points:
1) Lyft uses Apache Airflow for ETL workflows with over 600 DAGs and 800 DAG runs daily across three AWS Auto Scaling Groups of worker nodes.
2) Lyft has customized Airflow with additional UI links, DAG dependency graphs, and integration with internal tools.
3) Lyft is working to improve the backfill experience, support DAG-level access controls, and explore running Airflow with Kubernetes executors.
4) Tao discussed challenges like daylight saving time issues and long-running tasks occupying slots, and thanked other Lyft engineers contributing to Airflow.
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...Kaxil Naik
Apache Airflow allows users to programmatically author, schedule, and monitor workflows or directed acyclic graphs (DAGs) using Python. It is an open-source workflow management platform developed by Airbnb that is used to orchestrate data pipelines. The document provides an overview of Airflow including what it is, its architecture, and concepts like DAGs, tasks, and operators. It also includes instructions on setting up Airflow and running tutorials on basic and dynamic workflows.
We will introduce Airflow, an Apache Project for scheduling and workflow orchestration. We will discuss use cases, applicability and how best to use Airflow, mainly in the context of building data engineering pipelines. We have been running Airflow in production for about 2 years, we will also go over some learnings, best practices and some tools we have built around it.
Speakers: Robert Sanders, Shekhar Vemuri
Building Better Data Pipelines using Apache AirflowSid Anand
Apache Airflow is a platform for authoring, scheduling, and monitoring workflows or directed acyclic graphs (DAGs). It allows users to programmatically author DAGs in Python without needing to bundle many XML files. The UI provides a tree view to see DAG runs over time and Gantt charts to see performance trends. Airflow is useful for ETL pipelines, machine learning workflows, and general job scheduling. It handles task dependencies and failures, monitors performance, and enforces service level agreements. Behind the scenes, the scheduler distributes tasks from the metadata database to Celery workers via RabbitMQ.
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Yohei Onishi
This is the slide I presented at PyCon SG 2019. I talked about overview of Airflow and how we can use Airflow and the other data engineering services on AWS and GCP to build data pipelines.
Introduction to Apache Airflow, it's main concepts and features and an example of a DAG. Afterwards some lessons and best practices learned by from the 3 years I have been using Airflow to power workflows in production.
Apache Airflow is a platform to author, schedule and monitor workflows as directed acyclic graphs (DAGs) of tasks. It allows workflows to be defined as code making them more maintainable, versionable and collaborative. The rich user interface makes it easy to visualize pipelines and monitor progress. Key concepts include DAGs, operators, hooks, pools and xcoms. Alternatives include Azkaban from LinkedIn and Oozie for Hadoop workflows.
In the session, we discussed the End-to-end working of Apache Airflow that mainly focused on "Why What and How" factors. It includes the DAG creation/implementation, Architecture, pros & cons. It also includes how the DAG is created for scheduling the Job and what all steps are required to create the DAG using python script & finally with the working demo.
Airflow Best Practises & Roadmap to Airflow 2.0Kaxil Naik
This document provides an overview of new features in Airflow 1.10.8/1.10.9 and best practices for writing DAGs and configuring Airflow for production. It also outlines the roadmap for Airflow 2.0, including dag serialization, a revamped real-time UI, developing a production-grade modern API, releasing official Docker/Helm support, and improving the scheduler. The document aims to help users understand recent Airflow updates and plan their migration to version 2.0.
Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...Kaxil Naik
From not knowing Python (let alone Airflow), and from submitting the first PR that fixes typo to becoming Airflow Committer, PMC Member, Release Manager, and #1 Committer this year, this talk walks through Kaxil’s journey in the Airflow World.
The second part of this talk explains:
how you can also start your OSS journey by contributing to Airflow
Expanding familiarity with a different part of the Airflow codebase
Continue committing regularly & steadily to become Airflow Committer. (including talking about current Guidelines of becoming a Committer)
Different mediums of communication (Dev list, users list, Slack channel, Github Discussions etc)
Airflow is a workflow management system for authoring, scheduling and monitoring workflows or directed acyclic graphs (DAGs) of tasks. It has features like DAGs to define tasks and their relationships, operators to describe tasks, sensors to monitor external systems, hooks to connect to external APIs and databases, and a user interface for visualizing pipelines and monitoring runs. Airflow uses a variety of executors like SequentialExecutor, CeleryExecutor and MesosExecutor to run tasks on schedulers like Celery or Kubernetes. It provides security features like authentication, authorization and impersonation to manage access.
The document discusses upcoming features and changes in Apache Airflow 2.0. Key points include:
1. Scheduler high availability will use an active-active model with row-level locks to allow killing a scheduler without interrupting tasks.
2. DAG serialization will decouple DAG parsing from scheduling to reduce delays, support lazy loading, and enable features like versioning.
3. Performance improvements include optimizing the DAG file processor and using a profiling tool to identify other bottlenecks.
4. The Kubernetes executor will integrate with KEDA for autoscaling and allow customizing pods through templating.
5. The official Helm chart, functional DAGs, and smaller usability changes
The document provides an overview of Apache Airflow, an open-source workflow management platform for data pipelines. It describes how Airflow allows users to programmatically author, schedule and monitor workflows or data pipelines via a GUI. It also outlines key Airflow concepts like DAGs (directed acyclic graphs), tasks, operators, sensors, XComs (cross-communication), connections, variables and executors that allow parallel task execution.
Slide deck for the fourth data engineering lunch, presented by guest speaker Will Angel. It covered the topic of using Airflow for data engineering. Airflow is a scheduling tool for managing data pipelines.
A successful pipeline moves data efficiently, minimizing pauses and blockages between tasks, keeping every process along the way operational. Apache Airflow provides a single customizable environment for building and managing data pipelines, eliminating the need for a hodge-podge collection of tools, snowflake code, and homegrown processes. Using real-world scenarios and examples, Data Pipelines with Apache Airflow teaches you how to simplify and automate data pipelines, reduce operational overhead, and smoothly integrate all the technologies in your stack.
Check out the contents on our browser-based liveBook reader here: https://livebook.manning.com/book/data-pipelines-with-apache-airflow/
Apache Airflow is an open-source workflow management platform developed by Airbnb and now an Apache Software Foundation project. It allows users to define and manage data pipelines as directed acyclic graphs (DAGs) of tasks. The tasks can be operators to perform actions, move data between systems, and use sensors to monitor external systems. Airflow provides a rich web UI, CLI and integrations with databases, Hadoop, AWS and others. It is scalable, supports dynamic task generation and templates, alerting, retries, and distributed execution across clusters.
This document provides an overview of building data pipelines using Apache Airflow. It discusses what a data pipeline is, common components of data pipelines like data ingestion and processing, and issues with traditional data flows. It then introduces Apache Airflow, describing its features like being fault tolerant and supporting Python code. The core components of Airflow including the web server, scheduler, executor, and worker processes are explained. Key concepts like DAGs, operators, tasks, and workflows are defined. Finally, it demonstrates Airflow through an example DAG that extracts and cleanses tweets.
Building an analytics workflow using Apache AirflowYohei Onishi
This document discusses using Apache Airflow to build an analytics workflow. It begins with an overview of Airflow and how it can be used to author workflows through Python code. Examples are shown of using Airflow to copy files between S3 buckets. The document then covers setting up a highly available Airflow cluster, implementing continuous integration/deployment, and monitoring workflows. It emphasizes that Google Cloud Composer can simplify deploying and managing Airflow clusters on Google Kubernetes Engine and integrating with other Google Cloud services.
The document discusses using Nginx and Ansible to easily implement a lightweight maintenance mode. It describes configuring Nginx to deny all requests except from allowed IP ranges by modifying the Nginx configuration file. Ansible playbooks are used to automatically deploy the maintenance configuration by copying template configuration files to destination servers. This allows developers to simply run one Ansible command to enable or disable maintenance mode rather than manually deploying new application versions.
This document discusses Apache Airflow and its use at Dailymotion. It provides an agenda that covers data at Dailymotion, Apache Airflow, how Airflow is used at Dailymotion, deployment of Airflow at Dailymotion, working on a DAG (directed acyclic graph) pipeline, and an example pipeline for Dailymotion's new Advanced Analytics project. The example pipeline aggregates data from different sources with varying frequencies and timezones into BigQuery and Exasol for visualization in Tableau.
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Bolke de Bruin
Introduction to Apache Airflow (Incubating), best practices and roadmap. Airflow is a platform to programmatically author, schedule and monitor workflows.
How I learned to time travel, or, data pipelining and scheduling with AirflowPyData
This document discusses how the author learned to use Airflow for data pipelining and scheduling tasks. It describes some early tools like Cron and Luigi that were used for scheduling. It then evaluates options like Drake, Pydoit, Pinball, Luigi, and AWS Data Pipeline before settling on Airflow due to its sophistication in handling complex dependencies, built-in scheduling and monitoring, and flexibility. The author also develops a plugin called smart-airflow to add file-based checkpointing capabilities to Airflow to track intermediate data transformations.
It's a Breeze to develop Apache Airflow (London Apache Airflow meetup)Jarek Potiuk
This talk is about tools and mechanism we developed and used to improve productivity and teamwork in our team (of 6 currently) while developing 70+ operators for Airflow over more than 6 months.
We developed an "Airflow Breeze" simplified development environment which cuts down the time to become productive Apache Airflow developer from days to minutes.
It is part of Airflow Improvement Proposals:
AIP-10 Multi-layered and multi-stage official Airflow image
AIP-7 Simplified development workflow
Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]Noam Elfanbaum
Building a ML analytics platform into production using Apache Airflow at Bluevine. This includes:
- Migrating our ML workload to Airflow
- Hacking at Airflow to provide a semi-streaming solution
- Monitoring business sensitive processes
Container Orchestration from Theory to PracticeDocker, Inc.
Join Laura Frank and Stephen Day as they explain and examine technical concepts behind container orchestration systems, like distributed consensus, object models, and node topology. These concepts build the foundation of every modern orchestration system, and each technical explanation will be illustrated using Docker’s SwarmKit as a real-world example. Gain a deeper understanding of how orchestration systems like SwarmKit work in practice and walk away with more insights into your production applications.
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Yohei Onishi
This is the slide I presented at PyCon SG 2019. I talked about overview of Airflow and how we can use Airflow and the other data engineering services on AWS and GCP to build data pipelines.
Introduction to Apache Airflow, it's main concepts and features and an example of a DAG. Afterwards some lessons and best practices learned by from the 3 years I have been using Airflow to power workflows in production.
Apache Airflow is a platform to author, schedule and monitor workflows as directed acyclic graphs (DAGs) of tasks. It allows workflows to be defined as code making them more maintainable, versionable and collaborative. The rich user interface makes it easy to visualize pipelines and monitor progress. Key concepts include DAGs, operators, hooks, pools and xcoms. Alternatives include Azkaban from LinkedIn and Oozie for Hadoop workflows.
In the session, we discussed the End-to-end working of Apache Airflow that mainly focused on "Why What and How" factors. It includes the DAG creation/implementation, Architecture, pros & cons. It also includes how the DAG is created for scheduling the Job and what all steps are required to create the DAG using python script & finally with the working demo.
Airflow Best Practises & Roadmap to Airflow 2.0Kaxil Naik
This document provides an overview of new features in Airflow 1.10.8/1.10.9 and best practices for writing DAGs and configuring Airflow for production. It also outlines the roadmap for Airflow 2.0, including dag serialization, a revamped real-time UI, developing a production-grade modern API, releasing official Docker/Helm support, and improving the scheduler. The document aims to help users understand recent Airflow updates and plan their migration to version 2.0.
Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...Kaxil Naik
From not knowing Python (let alone Airflow), and from submitting the first PR that fixes typo to becoming Airflow Committer, PMC Member, Release Manager, and #1 Committer this year, this talk walks through Kaxil’s journey in the Airflow World.
The second part of this talk explains:
how you can also start your OSS journey by contributing to Airflow
Expanding familiarity with a different part of the Airflow codebase
Continue committing regularly & steadily to become Airflow Committer. (including talking about current Guidelines of becoming a Committer)
Different mediums of communication (Dev list, users list, Slack channel, Github Discussions etc)
Airflow is a workflow management system for authoring, scheduling and monitoring workflows or directed acyclic graphs (DAGs) of tasks. It has features like DAGs to define tasks and their relationships, operators to describe tasks, sensors to monitor external systems, hooks to connect to external APIs and databases, and a user interface for visualizing pipelines and monitoring runs. Airflow uses a variety of executors like SequentialExecutor, CeleryExecutor and MesosExecutor to run tasks on schedulers like Celery or Kubernetes. It provides security features like authentication, authorization and impersonation to manage access.
The document discusses upcoming features and changes in Apache Airflow 2.0. Key points include:
1. Scheduler high availability will use an active-active model with row-level locks to allow killing a scheduler without interrupting tasks.
2. DAG serialization will decouple DAG parsing from scheduling to reduce delays, support lazy loading, and enable features like versioning.
3. Performance improvements include optimizing the DAG file processor and using a profiling tool to identify other bottlenecks.
4. The Kubernetes executor will integrate with KEDA for autoscaling and allow customizing pods through templating.
5. The official Helm chart, functional DAGs, and smaller usability changes
The document provides an overview of Apache Airflow, an open-source workflow management platform for data pipelines. It describes how Airflow allows users to programmatically author, schedule and monitor workflows or data pipelines via a GUI. It also outlines key Airflow concepts like DAGs (directed acyclic graphs), tasks, operators, sensors, XComs (cross-communication), connections, variables and executors that allow parallel task execution.
Slide deck for the fourth data engineering lunch, presented by guest speaker Will Angel. It covered the topic of using Airflow for data engineering. Airflow is a scheduling tool for managing data pipelines.
A successful pipeline moves data efficiently, minimizing pauses and blockages between tasks, keeping every process along the way operational. Apache Airflow provides a single customizable environment for building and managing data pipelines, eliminating the need for a hodge-podge collection of tools, snowflake code, and homegrown processes. Using real-world scenarios and examples, Data Pipelines with Apache Airflow teaches you how to simplify and automate data pipelines, reduce operational overhead, and smoothly integrate all the technologies in your stack.
Check out the contents on our browser-based liveBook reader here: https://livebook.manning.com/book/data-pipelines-with-apache-airflow/
Apache Airflow is an open-source workflow management platform developed by Airbnb and now an Apache Software Foundation project. It allows users to define and manage data pipelines as directed acyclic graphs (DAGs) of tasks. The tasks can be operators to perform actions, move data between systems, and use sensors to monitor external systems. Airflow provides a rich web UI, CLI and integrations with databases, Hadoop, AWS and others. It is scalable, supports dynamic task generation and templates, alerting, retries, and distributed execution across clusters.
This document provides an overview of building data pipelines using Apache Airflow. It discusses what a data pipeline is, common components of data pipelines like data ingestion and processing, and issues with traditional data flows. It then introduces Apache Airflow, describing its features like being fault tolerant and supporting Python code. The core components of Airflow including the web server, scheduler, executor, and worker processes are explained. Key concepts like DAGs, operators, tasks, and workflows are defined. Finally, it demonstrates Airflow through an example DAG that extracts and cleanses tweets.
Building an analytics workflow using Apache AirflowYohei Onishi
This document discusses using Apache Airflow to build an analytics workflow. It begins with an overview of Airflow and how it can be used to author workflows through Python code. Examples are shown of using Airflow to copy files between S3 buckets. The document then covers setting up a highly available Airflow cluster, implementing continuous integration/deployment, and monitoring workflows. It emphasizes that Google Cloud Composer can simplify deploying and managing Airflow clusters on Google Kubernetes Engine and integrating with other Google Cloud services.
The document discusses using Nginx and Ansible to easily implement a lightweight maintenance mode. It describes configuring Nginx to deny all requests except from allowed IP ranges by modifying the Nginx configuration file. Ansible playbooks are used to automatically deploy the maintenance configuration by copying template configuration files to destination servers. This allows developers to simply run one Ansible command to enable or disable maintenance mode rather than manually deploying new application versions.
This document discusses Apache Airflow and its use at Dailymotion. It provides an agenda that covers data at Dailymotion, Apache Airflow, how Airflow is used at Dailymotion, deployment of Airflow at Dailymotion, working on a DAG (directed acyclic graph) pipeline, and an example pipeline for Dailymotion's new Advanced Analytics project. The example pipeline aggregates data from different sources with varying frequencies and timezones into BigQuery and Exasol for visualization in Tableau.
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Bolke de Bruin
Introduction to Apache Airflow (Incubating), best practices and roadmap. Airflow is a platform to programmatically author, schedule and monitor workflows.
How I learned to time travel, or, data pipelining and scheduling with AirflowPyData
This document discusses how the author learned to use Airflow for data pipelining and scheduling tasks. It describes some early tools like Cron and Luigi that were used for scheduling. It then evaluates options like Drake, Pydoit, Pinball, Luigi, and AWS Data Pipeline before settling on Airflow due to its sophistication in handling complex dependencies, built-in scheduling and monitoring, and flexibility. The author also develops a plugin called smart-airflow to add file-based checkpointing capabilities to Airflow to track intermediate data transformations.
It's a Breeze to develop Apache Airflow (London Apache Airflow meetup)Jarek Potiuk
This talk is about tools and mechanism we developed and used to improve productivity and teamwork in our team (of 6 currently) while developing 70+ operators for Airflow over more than 6 months.
We developed an "Airflow Breeze" simplified development environment which cuts down the time to become productive Apache Airflow developer from days to minutes.
It is part of Airflow Improvement Proposals:
AIP-10 Multi-layered and multi-stage official Airflow image
AIP-7 Simplified development workflow
Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]Noam Elfanbaum
Building a ML analytics platform into production using Apache Airflow at Bluevine. This includes:
- Migrating our ML workload to Airflow
- Hacking at Airflow to provide a semi-streaming solution
- Monitoring business sensitive processes
Container Orchestration from Theory to PracticeDocker, Inc.
Join Laura Frank and Stephen Day as they explain and examine technical concepts behind container orchestration systems, like distributed consensus, object models, and node topology. These concepts build the foundation of every modern orchestration system, and each technical explanation will be illustrated using Docker’s SwarmKit as a real-world example. Gain a deeper understanding of how orchestration systems like SwarmKit work in practice and walk away with more insights into your production applications.
One of the most boring thing in software development in large companies is following a bureaucracy. Tons of developers were melted down by that ruthless machine with its not always obvious rules. That’s why we decided to delegate all the boring work to machines instead of humans and the talk will cover the achieved results.
Heart of the SwarmKit: Store, Topology & Object ModelDocker, Inc.
Heart of the SwarmKit: Store, Topology & Object Model by Aaron, Andrea, Stephen D (Docker)
Swarmkit repo - https://github.com/docker/swarmkit
Liveblogging: http://canopy.mirage.io/Liveblog/SwarmKitDDS2016
Flux architecture and Redux - theory, context and practiceJakub Kocikowski
Flux Architecture changes how we think about data in frontend applications. In the talk I will cover the theory — what Flux Architecture is, what are the driving principles behind it and how it compares to other patterns in software developer landscape. And practice — what implementation decisions made Redux the most popular implementation of the pattern and do you need Redux to use Flux in your project.
And finally I will try to answer the most important question: when will Flux add value to your project and when it just adds unnecessary complexity?
Managing Apache Spark Workload and Automatic OptimizingDatabricks
eBay is highly using Spark as one of most significant data engines. In data warehouse domain, there are millions of batch queries running every day against 6000+ key DW tables, which contains over 22PB data (compressed) and still keeps booming every year. In machine learning domain, it is playing a more and more significant role. We have introduced our great achievement in migration work from MPP database to Apache Spark last year in Europe Summit. Furthermore, from the vision of the entire infrastructure, it is still a big challenge on managing workload and efficiency for all Spark jobs upon our data center. Our team is leading the whole infrastructure of big data platform and the management tools upon it, helping our customers -- not only DW engineers and data scientists, but also AI engineers -- to leverage on the same page. In this session, we will introduce how to benefit all of them within a self-service workload management portal/system. First, we will share the basic architecture of this system to illustrate how it collects metrics from multiple data centers and how it detects the abnormal workload real-time. We develop a component called Profiler which is to enhance the current Spark core to support customized metric collection. Next, we will demonstrate some real user stories in eBay to show how the self-service system reduces the efforts both in customer side and infra-team side. That's the highlight part about Spark job analysis and diagnosis. Finally, some incoming advanced features will be introduced to describe an automatic optimizing workflow rather than just alerting.
Speaker: Lantao Jin
This document is a presentation about Gearman, an open source application framework for distributing tasks to multiple machines or processes. The presentation covers what Gearman is, its main concepts of client-daemon-worker communication and distributed model, how to do a quick start with Gearman including installation and a simple PHP example, digging deeper into topics like persistence, workers and monitoring, and PHP integration including usage, frameworks, handling conditions, and use cases like image processing and log analysis. The presenter provides contact details to find more information and asks if there are any questions.
Airflow is a platform for authoring, scheduling, and monitoring workflows or data pipelines. It uses a directed acyclic graph (DAG) to define dependencies between tasks and schedule their execution. The UI provides dashboards to monitor task status and view workflow histories. Hands-on exercises demonstrate installing Airflow and creating sample DAGs.
improving the performance of Rails web ApplicationsJohn McCaffrey
This presentation is the first in a series on Improving Rails application performance. This session covers the basic motivations and goals for improving performance, the best way to approach a performance assessment, and a review of the tools and techniques that will yield the best results. Tools covered include: Firebug, yslow, page speed, speed tracer, dom monster, request log analyzer, oink, rack bug, new relic rpm, rails metrics, showslow.org, msfast, webpagetest.org and gtmetrix.org.
The upcoming sessions will focus on:
Improving sql queries, and active record use
Improving general rails/ruby code
Improving the front-end
And a final presentation will cover how to be a more efficient and effective developer!
This series will be compressed into a best of session for the 2010 http://windycityRails.org conference
Paris.rb – 07/19 – Sidekiq scaling, workers vs processesMaxence Haltel
Presentation given to Paris.RB meetup in July 2019.
- How to scale Sidekiq to handle millions of jobs?
- Is there a magic recipe to do API and computing jobs?
- Can we be cost-sensitive in scaling?
Presentation of a research protocol with observations and results.
Join Laura Frank and Stephen Day as they explain and examine technical concepts behind container orchestration systems, like distributed consensus, object models, and node topology. These concepts build the foundation of every modern orchestration system, and each technical explanation will be illustrated using Docker’s SwarmKit as a real-world example. Gain a deeper understanding of how orchestration systems like SwarmKit work in practice, and walk away with more insights into your production applications.
Till Rohrmann – Fault Tolerance and Job Recovery in Apache FlinkFlink Forward
Flink provides fault tolerance guarantees through checkpointing and recovery mechanisms. Checkpoints take consistent snapshots of distributed state and data, while barriers mark checkpoints in the data flow. This allows Flink to recover jobs from failures and resume processing from the last completed checkpoint. Flink also implements high availability by persisting metadata like the execution graph and checkpoints to Apache Zookeeper, enabling a standby JobManager to take over if the active one fails.
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Data Con LA
This talk explores deploying a series of small and large batch and streaming pipelines locally, to Spark and Flink clusters and to Google Cloud Dataflow services to give the audience a feel for the portability of Beam, a new portable Big Data processing framework recently submitted by Google to the Apache foundation. This talk will look at how the programming model handles late arriving data in a stream with event time, windows, and triggers.
Real-time Stream Processing using Apache ApexApache Apex
Apache Apex is a stream processing framework that provides high performance, scalability, and fault tolerance. It uses YARN for resource management, can achieve single digit millisecond latency, and automatically recovers from failures without data loss through checkpointing. Apex applications are modeled as directed acyclic graphs of operators and can be partitioned for scalability. It has a large community of committers and is in the process of becoming a top-level Apache project.
Apache Apex is a stream processing framework that provides high performance, scalability, and fault tolerance. It uses YARN for resource management, can achieve single digit millisecond latency, and automatically recovers from failures without data loss through checkpointing. Apex applications are modeled as directed acyclic graphs of operators and can be partitioned for scalability. It has a large community of committers and is in the process of becoming a top-level Apache project.
How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018javier ramirez
In this talk I will present the architecture that allows runners to execute a Beam pipeline. I will explain what needs to happen in order for a compatible runner to know which transforms to run, how to pass data from one step to the next, and how beam allows runners to be SDK agnostic when running pipelines.
This document discusses various metrics for evaluating computer performance and discusses latency. It defines latency as the time it takes a computer to perform a single task and discusses how latency is measured. Latency is important for application responsiveness, real-time applications, and other situations where waiting time matters. The document also introduces the performance equation that models latency in terms of architectural parameters like instructions, clock cycles, and clock frequency.
Testing Persistent Storage Performance in Kubernetes with SherlockScyllaDB
Getting to understand your Kubernetes storage capabilities is important in order to run a proper cluster in production. In this session I will demonstrate how to use Sherlock, an open source platform written to test persistent NVMe/TCP storage in Kubernetes, either via synthetic workload or via variety of databases, all easily done and summarized to give you an estimate of what your IOPS, Latency and Throughput your storage can provide to the Kubernetes cluster.
The document discusses various techniques for optimizing web performance and React applications. It covers topics like loading time, rendering time, dev tools, React tools, the latest features in React 17 and 18 like the new root API and startTransition API. It also discusses best practices for performance optimization in React like using pure components, React.memo, lazy loading, throttling events, debouncing events, and virtualization. Code snippets are provided as examples for some of these techniques.
Understanding of linux kernel memory modelSeongJae Park
SeongJae Park introduces himself and his work contributing to the Linux kernel memory model documentation. He developed a guaranteed contiguous memory allocator and maintains the Korean translation of the kernel's memory barrier documentation. The document discusses how the increasing prevalence of multi-core processors requires careful programming to ensure correct parallel execution given relaxed memory ordering. It notes that compilers and CPUs optimize for instruction throughput over programmer goals, and memory accesses can be reordered in ways that affect correctness on multi-processors. Understanding the memory model is important for writing high-performance parallel code.
Similar to Fyber - airflow best practices in production (20)
Mastering Partitioning for High-Volume Data ProcessingItai Yaffe
Yulia Antonovsky (Senior Software Engineer II) @ Akamai:
Our cloud-based ingest pipeline processes over 10 Gb of security events data per second, which demands high-performance processing and analysis. To achieve this, we've implemented efficient partitioning using Java and Spark applications running on AKS and leveraging Kafka. This allows us to provide real-time analytics within two minutes and heavy batch processing for deeper analysis hourly. During this talk, we will cover how we use Kafka to scale our Spark application on K8s, partitioning strategies for high-volume data processing, and how partitioning helps avoid storage throttling issues.
Solving Data Engineers Velocity - Wix's Data Warehouse AutomationItai Yaffe
This document discusses Wix's solution to automate data warehouse maintenance called Data Warehouse Automation (DWHA). It consists of three components - BI Bank, Metric Collector, and DWHA. BI Bank defines data sources and semantics. Metric Collector extracts metrics from sources efficiently. DWHA understands changes between runs, aggregates data, and handles different table types and changes over time. The presenter demonstrates how DWHA streamlines maintenance by automatically handling differences between runs. They also discuss additional DWHA capabilities and plans for a UI.
Lessons Learnt from Running Thousands of On-demand Spark ApplicationsItai Yaffe
Ada Sharoni (Software Engineering Architect) @ Hunters:
Imagine you had to manage thousands of Spark applications that are automatically spinning up on-demand upon every customer interaction.
Our unique constraints in Hunters have led us to adopt an architecture and concepts that we believe many other companies will find useful.
In this lecture we will share our solutions and insights in running many lightweight, cheap Spark applications on Kubernetes, that can easily survive frequent restarts and smartly share resources on Spot EC2 instances.
Why do the majority of Data Science projects never make it to production?Itai Yaffe
María de la Fuente (Solutions Architect Manager for IMEA) @ Databricks
While most companies understand the value creation of leveraging data and are taking on board an AI strategy, only 13% of the data science projects make it to production successfully.
Besides the well-known skills gap in the market, we need to level up our end-to-end approach and cover all aspects involved when working with AI.
In this session, we will discuss the main obstacles to overcome and how we can avoid the major pitfalls to ensure our data science journey becomes successful.
Planning a data solution - "By Failing to prepare, you are preparing to fail"Itai Yaffe
Eynav Mass (VP R&D) @ Oribi:
When it comes to data solutions, one-size doesn't fit all.
Choosing the right best-matching database, or data tools, can be a game-changer for your system.
How can you take such a decision effectively?
The system, the company, the product, and probably your team - all are evolving, and the best solution for today may not fit tomorrow's needs.
In order to pick a data solution for longer term, you should evaluate the optional data tools according to several factors.
These factors will reflect the requirements looking forward.
At the session, we will discuss these factors, along with sharing some real-life stories and lessons learned, to help you properly plan & prepare your data solutions.
Evaluating Big Data & ML Solutions - Opening NotesItai Yaffe
Esther Sánchez (Global Executive Committee member & Spain Chapter Lead) and Itai Yaffe (Israel Chapter Lead) @ Women in Big Data:
Opening notes about the digital era, diversity, and Women in Big Data (https://www.womeninbigdata.org/).
Big data serving: Processing and inference at scale in real timeItai Yaffe
Jon Bratseth (VP Architect) @ Verizon Media:
The big data world has mature technologies for offline analysis and learning from data, but have lacked options for making data-driven decisions in real time.
When it is sufficient to consider a single data point model servers such as TensorFlow serving can be used but in many cases you want to consider many data points to make decisions.
This is a difficult engineering problem combining state, distributed algorithms and low latency, but solving it often makes it possible to create far superior solutions when applying machine learning.
This talk will explain why this is a hard problem, show the advantages of solving it, and introduce the open source Vespa.ai platform which is used to implement such solutions in some of the largest scale problems in the world including the world's third largest ad serving system.
Data Lakes on Public Cloud: Breaking Data Management MonolithsItai Yaffe
Sharon Dashet (Sr. Data Analytics Solution Lead) @ Google Cloud:
The worlds of traditional RDBMS and Data Lake Hadoop systems are converging and moving to public cloud and SaaS offerings.
In this session, Sharon will share her personal journey as a data professional since the 90s weaved into the history of data management systems.
The session will also cover the differences between on-premise and cloud Data Lakes.
Orit Alul (Sr. Solutions Architect) @ AWS:
As data is growing at an exponential rate, we are interested not only in being able to analyze the past or present but also in predicting the future!
In this session, Orit will talk about the power of data combined with machine learning.
Building a highly scalable and flexible data architecture in the cloud to collect, process, and analyze data, in order to get timely insights and react quickly to new information.
In addition, Orit will present best practices, performance and optimization tips for building a Data Lake in the cloud.
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...Itai Yaffe
Roi Teveth (Data Engineer) and Itai Yaffe (Tech Lead, Big Data group) @ Nielsen:
At Nielsen Identity Engine, we use Spark to process 10’s of TBs of data. Our ETLs, orchestrated by Airflow, spin-up AWS EMR clusters with thousands of nodes per day.
In this talk, we’ll guide you through migrating Spark workloads to Kubernetes with minimal changes to Airflow DAGs, using the open-sourced GCP Spark-on-K8s operator and the native integration we recently contributed to the Airflow project.
DevTalks Reimagined 2020 - Funnel Analysis with Spark and DruidItai Yaffe
Itai Yaffe (Tech Lead, Big Data group) @ Nielsen:
Every day, millions of advertising campaigns are happening around the world.
As campaign owners, measuring the ongoing campaign effectiveness (e.g "how many distinct users saw my online ad VS how many distinct users saw my online ad, clicked it and purchased my product?") is super important.
However, this task (often referred to as "funnel analysis") is not an easy task, especially if the chronological order of events matters.
So, while the combination of Druid and ThetaSketch aggregators can answer some of these questions, it still can’t answer the question "how many distinct users viewed the brand’s homepage FIRST and THEN viewed product X page?"
In this talk, we will discuss how we combine Spark, Druid and ThetaSketch aggregators to answer such questions at scale.
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)Itai Yaffe
Itai Yaffe (Tech Lead, Big Data group) @ Nielsen and David Bar (Software Architect) @ ForeScout:
At this Ask Me Anything-style virtual meetup, Itai and David answered questions about Apache Druid from the unique perspectives of an open-source Druid user (Itai) and Imply customer (David).
Introducing Kafka Connect and Implementing Custom ConnectorsItai Yaffe
Kobi Hikri (Independent Software Architect and Consultant):
Kobi provides a short intro to Kafka Connect, and then shows an actual code example of developing and dockerizing a custom connector.
A Day in the Life of a Druid Implementor and Druid's RoadmapItai Yaffe
This document summarizes a typical day for a Druid architect. It describes common tasks like evaluating production clusters, analyzing data and queries, and recommending optimizations. The architect asks stakeholders questions to understand usage and helps evaluate if Druid is a good fit. When advising on Druid, the architect considers factors like data sources, query types, and technology stacks. The document also provides tips on configuring clusters for performance and controlling segment size.
Dr. Edward (Eddie) Bortnikov (Senior Director of Research) @ Verizon Media:
Ingestion and queries of real-time data in Druid are performed by a core software component named Incremental Index (I^2).
I^2’s scalability is paramount to the speed of the ingested data becoming queryable as well as to the operational efficiency of the Druid cluster.
The current I^2 Implementation is based on the traditional ordered JDK key-value (KV-)map.
We present an experimental I^2 implementation that is based on a novel data structure named OakMap - a scalable thread-safe off-heap KV-map for Big Data applications in Java.
With OakMap, I^2 can ingest data at almost 2x speed while using 30% less RAM.
The project is expected to become GA in 2020.
Itai Yaffe (Tech Lead, Big Data group) @ Nielsen:
Every day, millions of advertising campaigns are happening around the world.
As campaign owners, measuring the ongoing campaign effectiveness (e.g “how many distinct users saw my online ad VS how many distinct users saw my online ad, clicked it and purchased my product?”) is super important.
However, this task (often referred to as “funnel analysis”) is not an easy task, especially if the chronological order of events matters. So, while the combination of Druid and ThetaSketch aggregators can answer some of these questions, it still can’t answer the question "how many distinct users viewed the brand’s homepage FIRST and THEN viewed product X page?"
In this talk, we will discuss how we combine Spark, Druid and ThetaSketch aggregators to answer such questions at scale.
The benefits of running Spark on your own DockerItai Yaffe
Shir Bromberg (Big Data team leader) @ Yotpo:
Nowadays, many of an organization’s main applications rely on Spark pipelines. As these applications become more significant to businesses, so does the need to quickly deploy, test and monitor them.
The standard way of running spark jobs is to deploy it on a dedicated managed cluster. However, this solution is relatively expensive with potentially high setup time. Therefore, we developed a way to run Spark on any container orchestration platform. This allows us to run Spark in a simple, custom and testable way.
In this talk, we will present our open-source dockers for running Spark on Nomad servers. We will cover:
* The issues we had running spark on managed clusters and the solution we developed.
* How to build a spark docker.
* And finally, what you may achieve by using Spark on Nomad.
Optimizing Spark-based data pipelines - are you up for it?Itai Yaffe
Etti Gur (Senior Big Data developer) and Itai Yaffe (Tech Lead, Big Data group) @ Nielsen:
At Nielsen Marketing Cloud, we provide our customers (marketers and publishers) real-time analytics tools to measure their ongoing campaigns' efficiency.
To achieve that, we need to ingest billions of events per day into our big data stores and we need to do it in a scalable yet cost-efficient manner.
In this talk, we will discuss how we significantly optimized our Spark-based in-flight analytics daily pipeline, reducing its total execution time from over 20 hours down to 2 hours, resulting in a huge cost reduction.
Topics include:
* Ways to identify optimization opportunities
* Optimizing Spark resource allocation
* Parallelizing Spark output phase with dynamic partition inserts
* Running multiple Spark "jobs" in parallel within a single Spark application
Scheduling big data workloads on serverless infrastructureItai Yaffe
Ilai Malka from Nielsen at AWS Community Day TLV, December 2019 (https://awscommunitydaytelaviv2019.splashthat.com/):
Scheduling big data workloads is challenging. It's extra challenging when running on Serverless infrastructure.
At Nielsen Marketing Cloud, we've built a system that uploads 250 billion events per day to partner ad platforms, running on Serverless infrastructure (AWS Lambda and OpenFaaS).
Creating a 'scheduler' for this system required:
1. Rate-limiting to prevent flooding partner platforms.
2. High utilization to keep costs low
3. Careful bottleneck management to keep the system humming
https://www.linkedin.com/in/ilai-malka-93b06172/
https://twitter.com/IlaiMalka
#Nielsen #NielsenMarketingCloud #AWSCommunityDay #Serverless
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
5. Why?
5
The cron way
■ Each valid flow takes more time than it should
■ Each job should be aware to the buffer from its execution time to its working time
■ In a case of a retry for a certain task in the flow, the whole flow can fail
■ What if the time buffer is sometimes not enough?
■ What if one of the system that runs a cron job was down for a run or more?
■ What if the input data to a flow was incorrect?
■ What if, for a product requirement change, I need to re-run the past X runs?
■ Visability
6. Why?
6
The airflow way
■ Tasks are really dependant on each other
■ Easily Scalable
■ Web UI
■ Can recover from downtime
7. ■ Each valid flow takes more time than it should
■ Each job should be aware to the buffer from its execution time to its working time
■ In a case of a retry for a certain task in the flow, the whole flow can fail
■ What if the buffer is sometimes not enough?
■ What if one of the system that runs a cron job was down for a run or more?
■ What if the input data to a flow was incorrect?
■ What if, for a product requirement change, I need to re-run the past X runs?
Why?
7
The airflow way
8. ■ An HTTP request to invoke job on databricks (SimpleHttpOperator)
■ Extract the databricks task_id from the response (PythonOperator)
■ Monitor task progress (HttpSensor) by task id
■ In case of success, get the result (SimpleHttpOperator)
■ Extract result from the HttpResponse (PythonOperator)
Hello Airflow
SimpleHttpOperator PythonOperator HttpSensor SimpleHttpOperator PythonOperator
13. ■ There is no retry mechanism on a dag level, only on task level
■ Out of the box, a sub DAG does not retry well
■ We utilized the sub DAG’s on_retry_callback for it’s retry mechanism when needed
Retryable Sub Dags
15. Sub dags - use with caution!
15
subdag task task subdag task taskWorker
Concurrency Level
task subdag task task
16. Sub dags - use with caution!
16
subdag subdag subdag subdag task taskWorker
Concurrency Level
task subdag task task
17. Sub dags - use with caution!
17
subdag subdag subdag subdag subdag subdagWorker
Concurrency Level
task subdag task task
18. Sub dags - use with caution!
18
subdag subdag subdag subdag task taskWorker
Thread pool
task subdag task task
task task task task
Airflow 10’s default solution:
SequentialExecutor ( One process to run them all)
19. Sub dags - use with caution!
19
subdag subdag subdag subdag subdag subdagWorker 1
Concurrency Level
task subdag task task
task subdag taskWorker 2
Concurrency Level
task taskWorker 3
Concurrency Level
Second option -
Add more workers!
26. Building modules
26
■ A template of tasks and dependencies between them
■ Using the template method design pattern, the module dictates the general flow, to be
implemented by different business logic subclasses
■ Most commonly used inside a sub dag, like in the monitoring example
DAG extensions
31. Use case 1: Skipping daily tasks
31
■ Each hour calculates hourly aggregation and than daily agg
■ When fixing data or when the task runs are delayed, it’s unnecessary to calculate partial
daily aggregations
■ Using the ShortCircuitOperator, we check if the next execution should have happened
already
■ If it has, we skip all following tasks in the same dag run
Hourly and daily flow
32. 32
Use case 1: Skipping daily tasks
Hourly and daily flow
33. 33
Use case 1: Skipping daily tasks
Hourly and daily flow
34. Use case 1: Skipping daily tasks
34
Hourly and daily flow
35. Use case 2: Programatically clearing DAG
35
S3/{bucket_name}/day=23
S3/{bucket_name}/day=22
S3/{bucket_name}/day=21
S3/{bucket_name}/day=10
36. 36
■ Creating a DAG for executing a single day’s flow
■ The scheduling for the above DAG would occur by another DAG (and not the Airflow’s scheduler)
■ The scheduling DAG would:
○ Create a new run for each day in the target DAG
○ Clear the target DAG runs for the previous 14 days
Use case 2: Programatically clearing DAG
37. 37
Using another DAG to clear the above DAG for the last 14 days:
Use case 2: Programatically clearing DAG
39. Tips and best practices
39
■ Create only idempotent tasks
■ Notice that the worker only creates an OS process for each task
■ Always use a retry on a task, the workers can fail!
■ Use connections to store passwords and secret keys (for encryption)
■ Notice that your python files gets executed constantly by the scheduler
■ Use a docker compose environment on your dev machine