Building Robust Pipelines with Airflow

•

1 like•1,958 views

The data science team at Zymergen is applying machine learning techniques to identify genetic targets, work that is supported by extensive analytical automation that systematically identifies outliers, removes process-related bias, and quantifies performance improvements. We’re using Apache Airflow to construct robust data pipelines that allow us to produce clean, reliable inputs to our predictive models. In this talk, I’ll discuss the unique data processing challenges we face in working with high-throughput, biological data and provide an overview of how we’re using Apache Airflow to meet those challenges.

@erinshellman
Wrangle Conf
July 20th, 2017
Building Robust
Pipelines with Airﬂow

Zymology: is the science of fermentation and it’s applied to
make materials and molecules
!
"
#
$
Beer
Insulin
Food
additives
Plastics

Zymergen provides
a platform for
rapid improvement
of microbial strains
through genetic
engineering.

Robotic automation
Our experimentation
is increasingly
orchestrated with
robotics and machine
learning.

Learning how to efﬁciently
navigate the genome is the mission
of data science at Zymergen

Blocker: process failure
Orchestrating complex experiments with
robots is hard, and there are process failures.
These failures often cause sporadic, extreme
measurement values.

Blocker: batch effects
We see temporal
effects based on
when experiments
were performed

Blocker:
different interpretations of results
We’re building a platform that can
support any microbe and any molecule.
Sometimes that results in a proliferation
of solutions with disagreement on which
is best.

Processing pipeline
1.Identify process failures
2.Quantify and remove process-
related bias
3.Identify strains that show
improvement using consistent
criteria
Clean model inputs
Outlier detection
Normalization
Hit detection

Rolling our own ETL pipeline
There are many
ways to measure
the concentration of
a molecule.
Any microbe, any
molecule… any
experiment, many
data formats.

Describing complex
processing
dependencies is
hard.
Rolling our own ETL pipeline

Airﬂow
https://airﬂow.incubator.apache.org/
“Airﬂow is a platform to programmatically author, schedule and
monitor workﬂows.”
Airﬂow gives us ﬂexibility to apply a common
set of processing steps to variable data
inputs, schedule complex processing
workﬂows, and has become a delivery
mechanism for our products.

e.g. Normalization
Airﬂow workﬂows are
described as directed
acyclic graphs (DAGs).
Each task node in the
DAG is an operator.

The anatomy of a
DAG
Custom operators
Ordering
Instantiate DAG

Airﬂow + PyStan
With Bayesian hierarchical models we estimate
(and monitor) the distribution of batch effects.
Experimental bias

DropBox
• Scientists at Zymergen work with data using
many different tools including JMP, SQL, and
Excel.
• We use a custom DropBox hook to make
quick data ingestion pipelines.

Pairs well with Superset!
“Apache Superset is a
modern, enterprise-ready
business intelligence web
application”
https://github.com/apache/incubator-superset

Fairﬂow: Functional Airﬂow
• The core of Fairﬂow is an
abstract base class foperator
that takes care of
instantiating your Airﬂow
operators and setting their
dependencies.
• In Fairﬂow, DAGs are
constructed from foperators
that create the upstream
operators when the ﬁnal
foperator is called.

Conﬁguring complex ML
workﬂows… functionally

Deﬁning ML workﬂows
In the DAG
deﬁnition, create
an instance of the
task.
Then, instantiate a
DAG like usual and
call the compare
task on the DAG.

Deﬁning ML workﬂows
The design allows for simple creation of
complicated experimental workﬂows with arbitrary
sets of models, parameters, and evaluation metrics.

Is Airﬂow for you?
Do you have heterogeneous data sources?
Do you have complex dependencies between
processing tasks?
Do you have data with different velocities?
Do you have constraints on your time?
Probably!

While working with Hadoop, you'll eventually encounter the need to schedule and run workflows to perform various operations like ingesting data or performing ETL. There are a number of tools available to assist you with this type of requirement and one such tool that we at Clairvoyant have been looking to use is Apache Airflow. Apache Airflow is an Apache Incubator project that allows you to programmatically create workflows through a python script. This provides a flexible and effective way to design your workflows with little code and setup. In this talk, we will discuss Apache Airflow and how we at Clairvoyant have utilized it for ETL pipelines on Hadoop.

Airflow at WePay

Chris Riccomini

Getting to Know Airflow

Rosanne Hoyem

Airflow - a data flow engine

Walter Liu

Airflow - An Open Source Platform to Author and Monitor Data PipelinesDataWorks Summit

Introduction to Apache Airflow - Data Day Seattle 2016

Sid Anand

Apache Airflow

Sumit Maheshwari

Presentation given at Coolblue B.V. demonstrating Apache Airflow (incubating), what we learned from the underlying design principles and how an implementation of these principles reduce the amount of ETL effort. Why choose Airflow? Because it makes your engineering life easier, more people can contribute to how data flows through the organization, so that you can spend more time applying your brain to more difficult problems like Machine Learning, Deep Learning and higher level analysis.

Apache Airflow Introduction

Liangjun Jiang

How I learned to time travel, or, data pipelining and scheduling with Airflow

PyData

Airflow for Beginners

Varya Karpenko

How I learned to time travel, or, data pipelining and scheduling with Airflow

Laura Lorenz

****UPDATE: Project is now open sourced at https://www.github.com/industrydive/fileflow**** From Pydata DC 2016 Description Data warehousing and analytics projects can, like ours, start out small - and fragile. With an organically growing mess of scripts glued together and triggered by cron jobs hiding on different servers, we needed better plumbing. After perusing the data pipelining landscape, we landed on Airflow, an Apache incubating batch processing pipelining and scheduler tool from Airbnb. Abstract The power of any reporting tool breaks based on the data behind it, so when our data warehousing process got too big for its humble origins, we searched for something better. After testing out several options such as Drake, Pydoit, Luigi, AWS Data Pipeline, and Pinball, we landed on Airflow, an Apache incubating batch processing pipelining and scheduler tool originating from Airbnb, that provides the benefits of pipeline construction as directed acyclic graphs (DAGs), along with a scheduler that can handle alerting, retries, callbacks and more to make your pipeline robust. This talk will discuss the value of DAG based pipelines for data processing workflows, highlight useful features in all of the pipelining projects we tested, and dive into some of the specific challenges (like time travel) and successes (like time travel!) we’ve experienced using Airflow to productionize our data engineering tasks. By the end of this talk, you will learn - pros and cons of several Python-based/Python-supporting data pipelining libraries - the design paradigm behind Airflow, an Apache incubating data pipelining and scheduling service, and what it is good for - some epic fails to avoid and some epic wins to emulate from our experience porting our data engineering tasks to a more robust system - some quick-start tips for implementing Airflow at your organization.

Introducing Apache Airflow and how we are using it

Bruno Faria

Airflow presentation

Anant Corporation

Building Better Data Pipelines using Apache Airflow

Sid Anand

Building cloud-enabled genomics workflows with Luigi and Docker

Jacob Feala

Talk given at Bio-IT 2016, Cloud Computing track Abstract: As bioinformatics scientists, we tend to write custom tools for managing our workflows, even when viable, open-source alternatives are available from the tech community. Our field has, however, begun to adopt Docker containers to stabilize compute environments. In this talk, I will introduce Luigi, a workflow system built by engineers at Spotify to manage long-running big data processing jobs with complex dependencies. Focusing on a case study of next generation sequencing analysis in cancer genomics research, I will show how Luigi can connect simple, containerized applications into complex bioinformatics pipelines that can be easily integrated with compute, storage, and data warehousing on the cloud.

Airflow presentation

Ilias Okacha

(CMP310) Data Processing Pipelines Using Containers & Spot Instances

Amazon Web Services

It's difficult to find off-the-shelf, open-source solutions for creating lean, simple, and language-agnostic data-processing pipelines for machine learning (ML). This session shows you how to use Amazon S3, Docker, Amazon EC2, Auto Scaling, and a number of open source libraries as cornerstones to build one. We also share our experience creating elastically scalable and robust ML infrastructure leveraging the Spot instance market.

Orchestrating workflows Apache Airflow on GCP & AWS

Derrick Qin

Working in a cloud or on-premises environment, we all somehow move data from A to B on-demand or on schedule. It is essential to have a tool that can automate recurring workflows. This can be anything from an ETL(Extract, Transform, and Load) job for a regular analytics report all the way to automatically re-training a machine learning model. In this talk, we will introduce Apache Airflow and how it can help orchestrate your workflows. We will cover key concepts, features, and use cases of Apache Airflow, as well as how you can enjoy Apache Airflow on GCP and AWS by demo-ing a few practical workflows.

Apache Airflow

Knoldus Inc.

What is Spark

Bruno Faria

Workflow Engines + Luigi

Vladislav Supalov

Clearing Airflow Obstructions

Tatiana Al-Chueyr

Powering machine learning workflows with Apache Airflow and Python

Tatiana Al-Chueyr

Fast and Reliable Apache Spark SQL Engine

Databricks

Building the next generation Spark SQL engine at speed poses new challenges to both automation and testing. At Databricks, we are implementing a new testing framework for assessing the quality and performance of new developments as they produced. Having more than 1,200 worldwide contributors, Apache Spark follows a rapid pace of development. At this scale, new testing tooling such as random query and data generation, fault injection, longevity stress, and scalability tests are essential to guarantee a reliable and performance Spark later in production. By applying such techniques, we will demonstrate the effectiveness of our testing infrastructure by drilling-down into cases where correctness and performance regressions have been found early. In addition, showing how they have been root-caused and fixed to prevent regressions in production and boosting the continuous delivery of new features.

Flink Forward SF 2017: Bill Liu & Haohui Mai - AthenaX : Uber’s streaming pro...

Flink Forward

The mission of Uber is to make transportation as reliable as running water. The business is fundamentally driven by real-time data -- more than half of the employees in Uber, many of whom are non-technical, use SQL on a regular basis to analyze data and power their business decisions. We are building AthenaX, a stream processing platform built on top of Apache Flink to enable our users to write SQL to process real-time data efficiently and reliably at Uber's scale. Using Apache Calcite as query parser, AthenaX compiles the SQL down to Flink jobs. Leveraging Flink's unique streaming capabilities, AthenaX supports (1) consistent computations reliably thanks to at-least-once guarantees, (2) nontrivial analytics (e.g., windowing and joins) on multiple data sources, and (3) efficient and cost-effective executions in production through code generation and elastic scaling.

Software testing

Enamul Haque

Lessons Learned in Software Development: QA Infrastructure – Maintaining Rob...

Cωνσtantίnoς Giannoulis

Concurrency in Eclipse: Best Practices and Gotchas

amccullo

What's hot

Apache Airflow Architecture

Gerard Toonstra

Apache Airflow Introduction

Liangjun Jiang

How I learned to time travel, or, data pipelining and scheduling with Airflow

PyData

Airflow for Beginners

Varya Karpenko

How I learned to time travel, or, data pipelining and scheduling with Airflow

Laura Lorenz

Introducing Apache Airflow and how we are using it

Bruno Faria

Airflow presentation

Anant Corporation

Building Better Data Pipelines using Apache Airflow

Sid Anand

Building cloud-enabled genomics workflows with Luigi and Docker

Jacob Feala

Airflow presentation

Ilias Okacha

(CMP310) Data Processing Pipelines Using Containers & Spot Instances

Amazon Web Services

Orchestrating workflows Apache Airflow on GCP & AWS

Derrick Qin

Apache Airflow

Knoldus Inc.

What is Spark

Bruno Faria

Workflow Engines + Luigi

Vladislav Supalov

Clearing Airflow Obstructions

Tatiana Al-Chueyr

Powering machine learning workflows with Apache Airflow and Python

Tatiana Al-Chueyr

Fast and Reliable Apache Spark SQL Engine

Databricks

Flink Forward SF 2017: Bill Liu & Haohui Mai - AthenaX : Uber’s streaming pro...

Flink Forward

What's hot (19)

Apache Airflow Architecture

Apache Airflow Introduction

How I learned to time travel, or, data pipelining and scheduling with Airflow

Airflow for Beginners

How I learned to time travel, or, data pipelining and scheduling with Airflow

Introducing Apache Airflow and how we are using it

Airflow presentation

Building Better Data Pipelines using Apache Airflow

Building cloud-enabled genomics workflows with Luigi and Docker

Airflow presentation

(CMP310) Data Processing Pipelines Using Containers & Spot Instances

Orchestrating workflows Apache Airflow on GCP & AWS

Apache Airflow

What is Spark

Workflow Engines + Luigi

Clearing Airflow Obstructions

Powering machine learning workflows with Apache Airflow and Python

Fast and Reliable Apache Spark SQL Engine

Flink Forward SF 2017: Bill Liu & Haohui Mai - AthenaX : Uber’s streaming pro...

Similar to Building Robust Pipelines with Airflow

Software testing

Enamul Haque

Lessons Learned in Software Development: QA Infrastructure – Maintaining Rob...

Cωνσtantίnoς Giannoulis

Concurrency in Eclipse: Best Practices and Gotchas

amccullo

Java Performance & Profiling

Isuru Perera

Google, quality and you

nelinger

INTRODUCTION TO MACHINE LEARNING FOR MATERIALS SCIENCE

IPutuAdiPratama

This lab introduces students to an end-to-end example of applying a machine learning (ML) workflow to a materials science dataset to address a research problem. The lab aims at deepening the conceptual understanding of ML, and while procedural skills such as writing Python code are not the focus of this lab, students will gain experience with a number of standard open source packages by interacting with code snippets through the Jupyter Notebook format and describing what each essential command does.

AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...

Robert Grossman

Cerberus : Framework for Manual and Automated Testing (Web Application)

CIVEL Benoit

Cerberus_Presentation1CIVEL Benoit

Coldbox developer training – session 4Billie Berzinskas

[AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵

Amazon Web Services Korea

Amazon Elastic Kubernetes Service (EKS)는 표준 Kubernetes 환경에서 실행되는 어플리케이션과 완벽히 호환됩니다. AWS상에서 Kubernetes 클러스터를 생성하고, 컨테이너 어플리케이션을 배포, 관리, 확장 및 로깅, 모니터링에 대한 실습과 함께, 최근 릴리즈된 AWS IAM 권한을 Pod에 할당하는 방법 등을 Amazon EKS에서 구현하는 과정을 진행합니다.

Software Testing Tecniquesersanbilik

Create an architecture for web test automation

Elias Nogueira

Advances in Scientific Workflow Environments

Carole Goble

Agility for Data

Elisabeth Hendrickson

Here’s the thing about data: it’s sticky, often rigid, and rarely feels agile. Yes, there are patterns that help increase the agility of data. Ruby on Rails, for example, has data migrations that not just allow but actively encourage incremental schema design. However all too often we hear about relational databases becoming a defacto API between subsystems, and thus resistant to change. It’s not just relational databases. Even supposedly unstructured data stored as key value pairs can be difficult to change if every piece of code that uses the data has duplicated logic to manage the semantic meaning of the data. Further, in-database logic such as rules, stored procedures, or user defined functions can be remarkably difficult to unit test. Finally, there are the often strict data governance requirements that necessitate keeping tight authorization control. Application developers have a wealth of tools and practices available to support incremental delivery. System administrators have DevOps tools and practices to support repeatable, automated operations. Where are the equivalents for data-centric work? Bringing agility to your data strategy may feel like an impossible goal, but it is possible. In this talk we consider the various ways in which data can impede agility, and how to make data strategies more agile-friendly.

Using the Machine to predict Testability

Miguel Lopez

Resilience Engineering: A field of study, a community, and some perspective s...

John Allspaw

Java code coverage with JCov. Implementation details and use cases.

Alexandre (Shura) Iline

Data cleaning with the Kurator toolkit: Bridging the gap between conventional...

Timothy McPhillips

Generating test cases using UML Communication Diagram

Praveen Penumathsa

Similar to Building Robust Pipelines with Airflow (20)

Software testing

Lessons Learned in Software Development: QA Infrastructure – Maintaining Rob...

Concurrency in Eclipse: Best Practices and Gotchas

Java Performance & Profiling

Google, quality and you

INTRODUCTION TO MACHINE LEARNING FOR MATERIALS SCIENCE

AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...

Cerberus : Framework for Manual and Automated Testing (Web Application)

Cerberus_Presentation1

Coldbox developer training – session 4

[AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵

Software Testing Tecniques

Create an architecture for web test automation

Advances in Scientific Workflow Environments

Agility for Data

Using the Machine to predict Testability

Resilience Engineering: A field of study, a community, and some perspective s...

Java code coverage with JCov. Implementation details and use cases.

Data cleaning with the Kurator toolkit: Bridging the gap between conventional...

Generating test cases using UML Communication Diagram

More from Erin Shellman

Case studies in data-driven merchandising

Erin Shellman

For retail businesses, inventory is simultaneously the company’s greatest asset and its largest risk. Competitor pricing, markdowns, and returns all threaten the margins that drive success, and in practice, inventory doesn’t always move according to plan. To win in this highly competitive and rapidly evolving industry, it’s essential to have a flexible toolkit that accurately produces forecasts and intelligently adapts to unplanned inventory dynamics. In this talk, I’ll outline how Nordstrom applies data science and machine learning to build a wholistic view of inventory management from assortment, through stocking with intelligent size runs, and ending with a customer experience that gets the right product to the right customer at the right time.

Catching the most with high-throughput screening

Erin Shellman

High-throughput screening (HTS) is a technique applied in drug discovery and synthetic biology to screen thousands of 'candidates' (e.g. molecules, cells, strains) for expression of interesting traits. In exchange for throughput, we often pay a penalty in terms of sample size and variance. Small sample sizes and highly variable measurements can lower our ability to detect valuable improvements. In this talk, I describe a simulation environment for conducting power analysis and describe special considerations for maintaining high power with a moving target.

Developing effective data scientists

Erin Shellman

This is a short talk I delivered as part of the Michigan Institute for Data Science kick-off event at the University of Michigan. Data science is an emergent field that incorporates concepts from statistics, computer science and machine learning to create and apply knowledge from data. In this talk I’ll share what I think are the essential skills and characteristics of effective data scientists. I’ll also provide guidance on how students can develop those skills in school and how educators can prepare them for jobs in industry.

Bot or Not

Erin Shellman

Like many Internet giants Twitter makes money by selling ads, but they’ve got an insidious infestation eroding their advertising credibility: bots. More than 23 million of them. Twitter bots are automatons living in the Twittersphere and ranging wildly in capability. In their simplest form, they follow you maybe fav-ing or retweeting your statuses. At their most complex, they troll and ironically, troll trolls using speech patterns that can, at times, fool humans. But when advertisers pay for engagement, they aren’t interested in a four-hour flame war between a gamergate bot and a Kanye bot. When advertisers analyze social data they want to be sure their findings are the result of human activity. In Bot or Not I describe an end-to-end data analysis to build a classifier with Python.

Downloading the internet with Python + Scrapy

Erin Shellman

Fun! with the Twitter API

Erin Shellman

real time real talk

Erin Shellman

Businesses are inundated with options of vendors who provide analytical products. We're told in no uncertain terms that without real-time analytical capabilities we are being out- strategized, out-advertised and out-personalized with every passing second. "Real-time real talk" is a critical survey of real-time systems and their application in the areas of retail and marketing. We'll cut through the hype and get to the heart of what it means to be real-time, and explore some of the real-time systems being built today in the Nordstrom Data Lab.

Collaborative Filtering for fun ...and profit!

Erin Shellman

Assumptions: Check yo'self before you wreck yourself

Erin Shellman

Predicting the future is hard and it requires a lot of assumptions, also known as beliefs, also known as faith. In “Assumptions: Check yo self, before you wreck yo self” we explore the consequences of beliefs when constructing predictive models. We’ll walk through the process of developing a demand forecast for Evo, a Seattle-based outdoor recreation retailer, and discuss how assumptions influence the behavior of your application and ultimately the decisions you make.

More from Erin Shellman (9)

Case studies in data-driven merchandising

Catching the most with high-throughput screening

Developing effective data scientists

Bot or Not

Downloading the internet with Python + Scrapy

Fun! with the Twitter API

real time real talk

Collaborative Filtering for fun ...and profit!

Assumptions: Check yo'self before you wreck yourself

Recently uploaded

LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...

DanBrown980551

Do you want to learn how to model and simulate an electrical network from scratch in under an hour? Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)! During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook. PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides: - A fully editable and extendable library for grid component modelling; - Visualization tools to display your network; - Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses; The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well. What you will learn during the webinar: - For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills; - For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.

Assuring Contact Center Experiences for Your Customers With ThousandEyes

ThousandEyes

When stars align: studies in data quality, knowledge graphs, and machine lear...

Elena Simperl

Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...

Ramesh Iyer

In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.

AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...

Product School

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf

91mobiles

Knowledge engineering: from people to machines and back

Elena Simperl

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...

Product School

Leading Change strategies and insights for effective change management pdf 1.pdf

OnBoard

Connector Corner: Automate dynamic content and events by pushing a button

DianaGray10

Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to: Create a campaign using Mailchimp with merge tags/fields Send an interactive Slack channel message (using buttons) Have the message received by managers and peers along with a test email for review But there’s more: In a second workflow supporting the same use case, you’ll see: Your campaign sent to target colleagues for approval If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team But—if the “Reject” button is pushed, colleagues will be alerted via Slack message Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors. And... Speakers: Akshay Agnihotri, Product Manager Charlie Greenberg, Host

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024

Tobias Schneck

As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other? Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.

Designing Great Products: The Power of Design and Leadership by Chief Designe...

Product School

State of ICS and IoT Cyber Threat Landscape Report 2024 preview

Prayukth K V

The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development. The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers: State of global ICS asset and network exposure Sectoral targets and attacks as well as the cost of ransom Global APT activity, AI usage, actor and tactic profiles, and implications Rise in volumes of AI-powered cyberattacks Major cyber events in 2024 Malware and malicious payload trends Cyberattack types and targets Vulnerability exploit attempts on CVEs Attacks on counties – USA Expansion of bot farms – how, where, and why In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East Why are attacks on smart factories rising? Cyber risk predictions Axis of attacks – Europe Systemic attacks in the Middle East Download the full report from here: https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/

Bits & Pixels using AI for Good.........

Alison B. Lowndes

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...

UiPathCommunity

💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™: See how to accelerate model training and optimize model performance with active learning Learn about the latest enhancements to out-of-the-box document processing – with little to no training required Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath. Speakers: 👨‍🏫 Andras Palfi, Senior Product Manager, UiPath 👩‍🏫 Lenka Dulovicova, Product Program Manager, UiPath

The Art of the Pitch: WordPress Relationships and Sales

Laura Byrne

Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes? All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.

DevOps and Testing slides at DASA Connect

Kari Kakkonen

Essentials of Automations: Optimizing FME Workflows with Parameters

Safe Software

Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place. Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects. Here’s what you’ll gain: - Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows. - Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy. - Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency. - Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity. We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic. Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.

JMeter webinar - integration with InfluxDB and Grafana

RTTS

Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application. In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics. Length: 30 minutes Session Overview ------------------------------------------- During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana: - What out-of-the-box solutions are available for real-time monitoring JMeter tests? - What are the benefits of integrating InfluxDB and Grafana into the load testing stack? - Which features are provided by Grafana? - Demonstration of InfluxDB and Grafana using a practice web application To view the webinar recording, go to: https://www.rttsweb.com/jmeter-integration-webinar

PCI PIN Basics Webinar from the Controlcase Team

ControlCase

Recently uploaded (20)

LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...

Assuring Contact Center Experiences for Your Customers With ThousandEyes

When stars align: studies in data quality, knowledge graphs, and machine lear...

Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...

AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf

Knowledge engineering: from people to machines and back

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...

Leading Change strategies and insights for effective change management pdf 1.pdf

Connector Corner: Automate dynamic content and events by pushing a button

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024

Designing Great Products: The Power of Design and Leadership by Chief Designe...

State of ICS and IoT Cyber Threat Landscape Report 2024 preview

Bits & Pixels using AI for Good.........

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...

The Art of the Pitch: WordPress Relationships and Sales

DevOps and Testing slides at DASA Connect

Essentials of Automations: Optimizing FME Workflows with Parameters

JMeter webinar - integration with InfluxDB and Grafana

PCI PIN Basics Webinar from the Controlcase Team

Building Robust Pipelines with Airflow

1. @erinshellman Wrangle Conf July 20th, 2017 Building Robust Pipelines with Airﬂow

2. Zymology: is the science of fermentation and it’s applied to make materials and molecules ! " # $ Beer Insulin Food additives Plastics

4. Zymergen provides a platform for rapid improvement of microbial strains through genetic engineering.

5. Robotic automation Our experimentation is increasingly orchestrated with robotics and machine learning.

6. Learning how to efﬁciently navigate the genome is the mission of data science at Zymergen

7. Blocker: process failure Orchestrating complex experiments with robots is hard, and there are process failures. These failures often cause sporadic, extreme measurement values.

8. Blocker: batch effects We see temporal effects based on when experiments were performed

9. Blocker: different interpretations of results We’re building a platform that can support any microbe and any molecule. Sometimes that results in a proliferation of solutions with disagreement on which is best.

10. Processing pipeline 1.Identify process failures 2.Quantify and remove process- related bias 3.Identify strains that show improvement using consistent criteria Clean model inputs Outlier detection Normalization Hit detection

11. Rolling our own ETL pipeline There are many ways to measure the concentration of a molecule. Any microbe, any molecule… any experiment, many data formats.

12. Describing complex processing dependencies is hard. Rolling our own ETL pipeline

13. Airflow https://airflow.incubator.apache.org/ “Airflow is a platform to programmatically author, schedule and monitor workflows.” Airflow gives us flexibility to apply a common set of processing steps to variable data inputs, schedule complex processing workflows, and has become a delivery mechanism for our products.

14. Structure and Flexibility

15. e.g. Normalization Airﬂow workﬂows are described as directed acyclic graphs (DAGs). Each task node in the DAG is an operator.

16. The anatomy of a DAG Custom operators Ordering Instantiate DAG

17. Modularity and ﬂexibility

18. Airﬂow + PyStan With Bayesian hierarchical models we estimate (and monitor) the distribution of batch effects. Experimental bias

19. DropBox • Scientists at Zymergen work with data using many different tools including JMP, SQL, and Excel. • We use a custom DropBox hook to make quick data ingestion pipelines.

20. Alerting / Communication

21. 3rd-party hooks & operators

22. Operator

23. Pairs well with Superset! “Apache Superset is a modern, enterprise-ready business intelligence web application” https://github.com/apache/incubator-superset

24. Constructing machine learning workﬂows

25. Fairflow: Functional Airflow • The core of Fairflow is an abstract base class foperator that takes care of instantiating your Airflow operators and setting their dependencies. • In Fairflow, DAGs are constructed from foperators that create the upstream operators when the final foperator is called.

26. Conﬁguring complex ML workﬂows… functionally

27. Defining ML workflows In the DAG definition, create an instance of the task. Then, instantiate a DAG like usual and call the compare task on the DAG.

28. Defining ML workflows The design allows for simple creation of complicated experimental workflows with arbitrary sets of models, parameters, and evaluation metrics.

29. Is Airﬂow for you? Do you have heterogeneous data sources? Do you have complex dependencies between processing tasks? Do you have data with different velocities? Do you have constraints on your time? Probably!

30. Thanks team! %% & '() * +

Building Robust Pipelines with Airflow

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Building Robust Pipelines with Airflow

Similar to Building Robust Pipelines with Airflow (20)

More from Erin Shellman

More from Erin Shellman (9)

Recently uploaded

Recently uploaded (20)

Building Robust Pipelines with Airflow