Vladislav Supalov introduces data pipeline architecture and workflow engines like Luigi. He discusses how custom scripts are problematic for maintaining data pipelines and recommends using workflow engines instead. Luigi is presented as a Python-based workflow engine that was created at Spotify to manage thousands of daily Hadoop jobs. It provides features like parameterization, email alerts, dependency resolution, and task scheduling through a central scheduler. Luigi aims to minimize boilerplate code and make pipelines testable, versioning-friendly, and collaborative.
(CMP310) Data Processing Pipelines Using Containers & Spot InstancesAmazon Web Services
It's difficult to find off-the-shelf, open-source solutions for creating lean, simple, and language-agnostic data-processing pipelines for machine learning (ML). This session shows you how to use Amazon S3, Docker, Amazon EC2, Auto Scaling, and a number of open source libraries as cornerstones to build one. We also share our experience creating elastically scalable and robust ML infrastructure leveraging the Spot instance market.
Building a reliable pipeline of data ingress, batch computation, and data egress with Hadoop can be a major challenge. Most folks start out with cron to manage workflows, but soon discover that doesn't scale past a handful of jobs. There are a number of open-source workflow engines with support for Hadoop, including Azkaban (from LinkedIn), Luigi (from Spotify), and Apache Oozie. Having deployed all three of these systems in production, Joe will talk about what features and qualities are important for a workflow system.
A talk about data workflow tools in Metrics Monday Helsinki.
Both Custobar (https://custobar.com) and ŌURA (https://ouraring.com) are hiring talented developers. Contact me if you are interested in joining either of companies.
Totango is an Analytics platform for Customer Success.
Our data pipeline converts usage information into actionable analytics. The pipeline is managed using Luigi workflow engine, and data transformations are done in Spark.
(CMP310) Data Processing Pipelines Using Containers & Spot InstancesAmazon Web Services
It's difficult to find off-the-shelf, open-source solutions for creating lean, simple, and language-agnostic data-processing pipelines for machine learning (ML). This session shows you how to use Amazon S3, Docker, Amazon EC2, Auto Scaling, and a number of open source libraries as cornerstones to build one. We also share our experience creating elastically scalable and robust ML infrastructure leveraging the Spot instance market.
Building a reliable pipeline of data ingress, batch computation, and data egress with Hadoop can be a major challenge. Most folks start out with cron to manage workflows, but soon discover that doesn't scale past a handful of jobs. There are a number of open-source workflow engines with support for Hadoop, including Azkaban (from LinkedIn), Luigi (from Spotify), and Apache Oozie. Having deployed all three of these systems in production, Joe will talk about what features and qualities are important for a workflow system.
A talk about data workflow tools in Metrics Monday Helsinki.
Both Custobar (https://custobar.com) and ŌURA (https://ouraring.com) are hiring talented developers. Contact me if you are interested in joining either of companies.
Totango is an Analytics platform for Customer Success.
Our data pipeline converts usage information into actionable analytics. The pipeline is managed using Luigi workflow engine, and data transformations are done in Spark.
Building cloud-enabled genomics workflows with Luigi and DockerJacob Feala
Talk given at Bio-IT 2016, Cloud Computing track
Abstract:
As bioinformatics scientists, we tend to write custom tools for managing our workflows, even when viable, open-source alternatives are available from the tech community. Our field has, however, begun to adopt Docker containers to stabilize compute environments. In this talk, I will introduce Luigi, a workflow system built by engineers at Spotify to manage long-running big data processing jobs with complex dependencies. Focusing on a case study of next generation sequencing analysis in cancer genomics research, I will show how Luigi can connect simple, containerized applications into complex bioinformatics pipelines that can be easily integrated with compute, storage, and data warehousing on the cloud.
Agenda :
- Data at Dailymotion
- Apache Airflow
- Airflow at Dailymotion
- Deployment
- Working on a DAG
- Example of a pipeline
Talk in french here : https://www.youtube.com/watch?v=NEtmrJWZbXQ
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Bolke de Bruin
Introduction to Apache Airflow (Incubating), best practices and roadmap. Airflow is a platform to programmatically author, schedule and monitor workflows.
Running Airflow Workflows as ETL Processes on Hadoopclairvoyantllc
While working with Hadoop, you'll eventually encounter the need to schedule and run workflows to perform various operations like ingesting data or performing ETL. There are a number of tools available to assist you with this type of requirement and one such tool that we at Clairvoyant have been looking to use is Apache Airflow. Apache Airflow is an Apache Incubator project that allows you to programmatically create workflows through a python script. This provides a flexible and effective way to design your workflows with little code and setup. In this talk, we will discuss Apache Airflow and how we at Clairvoyant have utilized it for ETL pipelines on Hadoop.
A 20 minute talk about how WePay runs airflow. Discusses usage and operations. Also covers running Airflow in Google cloud.
Video of the talk is available here:
https://wepayinc.box.com/s/hf1chwmthuet29ux2a83f5quc8o5q18k
Presentation given at Coolblue B.V. demonstrating Apache Airflow (incubating), what we learned from the underlying design principles and how an implementation of these principles reduce the amount of ETL effort. Why choose Airflow? Because it makes your engineering life easier, more people can contribute to how data flows through the organization, so that you can spend more time applying your brain to more difficult problems like Machine Learning, Deep Learning and higher level analysis.
Quick introduction about Apache Spark and how it fits in the cognitive world, how can we use it to help cognitive solutions as well as create distributed algorithms to predict and perform other machine learning tasks.
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Yohei Onishi
This is the slide I presented at PyCon SG 2019. I talked about overview of Airflow and how we can use Airflow and the other data engineering services on AWS and GCP to build data pipelines.
Presentation given on the 15th July 2021 at the Airflow Summit 2021
Conference website: https://airflowsummit.org/sessions/2021/clearing-airflow-obstructions/
Recording: https://www.crowdcast.io/e/airflowsummit2021/40
In this talk, we provide an introduction to Python Luigi via real life case studies showing you how you can break large, multi-step data processing task into a graph of smaller sub-tasks that are aware of the state of their interdependencies.
Growth Intelligence tracks the performance and activity of all the companies in the UK economy using their data ‘footprint’. This involves tracking numerous unstructured data points from multiple sources in a variety of formats and transforming them into a standardised feature set we can use for building predictive models for our clients.
In the past, this data was collected by in a somewhat haphazard fashion: combining manual effort, ad hoc scripting and processing which was difficult to maintain. In order to streamline the data flows, we’re using an open-source Python framework from Spotify called Luigi. Luigi was created for managing task dependencies, monitoring the progress of the data pipeline and providing frameworks for common batch processing tasks.
Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map ReduceAaron Knight
This talk is about how to make a system for gathering and processing data more robust, using the open source Luigi library and the AWS Data Pipeline and Elastic Map Reduce (EMR) services.
Building cloud-enabled genomics workflows with Luigi and DockerJacob Feala
Talk given at Bio-IT 2016, Cloud Computing track
Abstract:
As bioinformatics scientists, we tend to write custom tools for managing our workflows, even when viable, open-source alternatives are available from the tech community. Our field has, however, begun to adopt Docker containers to stabilize compute environments. In this talk, I will introduce Luigi, a workflow system built by engineers at Spotify to manage long-running big data processing jobs with complex dependencies. Focusing on a case study of next generation sequencing analysis in cancer genomics research, I will show how Luigi can connect simple, containerized applications into complex bioinformatics pipelines that can be easily integrated with compute, storage, and data warehousing on the cloud.
Agenda :
- Data at Dailymotion
- Apache Airflow
- Airflow at Dailymotion
- Deployment
- Working on a DAG
- Example of a pipeline
Talk in french here : https://www.youtube.com/watch?v=NEtmrJWZbXQ
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Bolke de Bruin
Introduction to Apache Airflow (Incubating), best practices and roadmap. Airflow is a platform to programmatically author, schedule and monitor workflows.
Running Airflow Workflows as ETL Processes on Hadoopclairvoyantllc
While working with Hadoop, you'll eventually encounter the need to schedule and run workflows to perform various operations like ingesting data or performing ETL. There are a number of tools available to assist you with this type of requirement and one such tool that we at Clairvoyant have been looking to use is Apache Airflow. Apache Airflow is an Apache Incubator project that allows you to programmatically create workflows through a python script. This provides a flexible and effective way to design your workflows with little code and setup. In this talk, we will discuss Apache Airflow and how we at Clairvoyant have utilized it for ETL pipelines on Hadoop.
A 20 minute talk about how WePay runs airflow. Discusses usage and operations. Also covers running Airflow in Google cloud.
Video of the talk is available here:
https://wepayinc.box.com/s/hf1chwmthuet29ux2a83f5quc8o5q18k
Presentation given at Coolblue B.V. demonstrating Apache Airflow (incubating), what we learned from the underlying design principles and how an implementation of these principles reduce the amount of ETL effort. Why choose Airflow? Because it makes your engineering life easier, more people can contribute to how data flows through the organization, so that you can spend more time applying your brain to more difficult problems like Machine Learning, Deep Learning and higher level analysis.
Quick introduction about Apache Spark and how it fits in the cognitive world, how can we use it to help cognitive solutions as well as create distributed algorithms to predict and perform other machine learning tasks.
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Yohei Onishi
This is the slide I presented at PyCon SG 2019. I talked about overview of Airflow and how we can use Airflow and the other data engineering services on AWS and GCP to build data pipelines.
Presentation given on the 15th July 2021 at the Airflow Summit 2021
Conference website: https://airflowsummit.org/sessions/2021/clearing-airflow-obstructions/
Recording: https://www.crowdcast.io/e/airflowsummit2021/40
In this talk, we provide an introduction to Python Luigi via real life case studies showing you how you can break large, multi-step data processing task into a graph of smaller sub-tasks that are aware of the state of their interdependencies.
Growth Intelligence tracks the performance and activity of all the companies in the UK economy using their data ‘footprint’. This involves tracking numerous unstructured data points from multiple sources in a variety of formats and transforming them into a standardised feature set we can use for building predictive models for our clients.
In the past, this data was collected by in a somewhat haphazard fashion: combining manual effort, ad hoc scripting and processing which was difficult to maintain. In order to streamline the data flows, we’re using an open-source Python framework from Spotify called Luigi. Luigi was created for managing task dependencies, monitoring the progress of the data pipeline and providing frameworks for common batch processing tasks.
Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map ReduceAaron Knight
This talk is about how to make a system for gathering and processing data more robust, using the open source Luigi library and the AWS Data Pipeline and Elastic Map Reduce (EMR) services.
More Data, More Problems: Evolving big data machine learning pipelines with S...Alex Sadovsky
These are the slides from the Denver/Boulder Spark meet-up on February 24th, 2016. (deck build animations are all broken here... sorry!)
This talk provides an evaluation of existing machine learning pipelines in the eyes of different key stakeholders in the data science ecosystem. Focus is be placed upon the entire process from data to product (and keeping everyone in-between happy). Ultimately I explore how to utilize Spotify’s Luigi pipeline tool in combination with Spark to produce batch processing machine learning pipelines that have operational insights and redundancy built in.
Best Practices for Genomic and Bioinformatics Analysis Pipelines on AWS Amazon Web Services
AWS is a great fit for both steady state and episodic computational workloads. Here we present some common architecture patterns for analyzing genomic and other biomedical data on scalable high-throughput computational clusters on AWS. This talk will cover bootstrapping a traditional Beowulf compute cluster on AWS EC2, data transfer and storage strategies for S3.
Approximate nearest neighbor methods and vector models – NYC ML meetupErik Bernhardsson
Nearest neighbors refers to something that is conceptually very simple. For a set of points in some space (possibly many dimensions), we want to find the closest k neighbors quickly.
This presentation covers a library called Annoy built my me that that helps you do (approximate) nearest neighbor queries in high dimensional spaces. We're going through vector models, how to measure similarity, and why nearest neighbor queries are useful.
DataOps requires a cultural shift that brings the principles of lean manufacturing and DevOps to data analytics. It breaks down silos between developers, data scientists, and operators, resulting in rapid cycle times and low error rates.
At Spotify in 2013, the concept of DataOps did not exist but the Swedish company needed a way to align the people, processes, and technologies of the data organization to accelerate the development of high-quality analytics. The result was a Swedish-style DataOps, influenced by Scandinavian culture and agile principles, that enabled the company to become a true data-driven leader.
Dirty data? Clean it up! - Datapalooza Denver 2016Dan Lynn
Dan Lynn (AgilData) & Patrick Russell (Craftsy) present on how to do data science in the real world. We discuss data cleansing, ETL, pipelines, hosting, and share several tools used in the industry.
What is a data platform? Why do we need one? And how to build one in the cloud? This talk covers the essential engineering facets of a data platform: flows, persistence, access, standardization and data processing. How these facets combine into a unified platform and how and what cloud technologies as managed services and serverless help/challenge us to build it into a powerful business tool.
These are slides from a presentation from a "code naturally" meetup we held on 30/4 2018.
Simplified News Analytics in Presidential Election with Google Cloud PlatformImre Nagi
This talk presented the usage of various Google Cloud Platform technologies in building simplified news analytics system to process news about presidential election in Indonesia.
The 'macro view' on Big Query:
We started with an overview, some typical uses and moved to project hierarchy, access control and security.
In the end we touch about tools and demos.
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...Demi Ben-Ari
Once you start working with distributed Big Data systems, you start discovering a whole bunch of problems you won’t find in monolithic systems.
All of a sudden to monitor all of the components becomes a big data problem itself.
In the talk we’ll mention all of the aspects that you should take in consideration when monitoring a distributed system once you’re using tools like:
Web Services, Apache Spark, Cassandra, MongoDB, Amazon Web Services.
Not only the tools, what should you monitor about the actual data that flows in the system?
And we’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...Codemotion
Once you start working with Big Data systems, you discover a whole bunch of problems you won’t find in monolithic systems. Monitoring all of the components becomes a big data problem itself. In the talk we’ll mention all of the aspects that you should take in consideration when monitoring a distributed system using tools like: Web Services,Spark,Cassandra,MongoDB,AWS. Not only the tools, what should you monitor about the actual data that flows in the system? We’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.
GDG DevFest Ukraine - Powering Interactive Data Analysis with Google BigQueryMárton Kodok
Every scientist who needs big data analytics to save millions of lives should have that power. Powering Interactive Data Analysis require massive architecture, and know-how to build a fast real-time computing system. You will learn how Google BigQuery solves this problem by enabling super-fast, SQL queries against petabytes of data using the processing power of Google’s infrastructure. After this session you will be able to work with BigQuery, do streaming inserts, write User Defined Functions in Javascript, and several use cases for everyday developer: funnel analytics, behavioral analytics, exploring unstructured data. You will be able to run arbitrary queries on open-data such as historical data about Github commits, Stackoverflow Q&A data, or analysing Reddit comments to find out books the community talks about.
Serverless Big Data Architecture on Google Cloud Platform at Credit OKKriangkrai Chaonithi
This is a talk at at Barcamp Bangkhen 2018,
presented by Kriangkrai Chaonithi.
I shared my experience at Credit OK on building a data pipeline to ingest huge amount of customer data to our big data analytic warehouse using serverless services on Google platform.
As a result, we can make it without setting up any servers to handle our data at a very minimal cost.
"Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & ...Dataconomy Media
"Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & Co-Founder of Pivii Technologies
Watch videos from Data Natives Berlin 2016 here: http://bit.ly/2fE1sEo
Visit the conference website to learn more: www.datanatives.io
Follow Data Natives:
https://www.facebook.com/DataNatives
https://twitter.com/DataNativesConf
https://www.youtube.com/c/DataNatives
Stay Connected to Data Natives by Email: Subscribe to our newsletter to get the news first about Data Natives 2017: http://bit.ly/1WMJAqS
About the Author:
Vladislav is an entrepreneur, machine learning enthusiast, and DevOps geek. Currently, he is co-founding a startup, running a data engineering consulting business, traveling and writing on data-related topics.
Not Your Father’s Web App: The Cloud-Native Architecture of images.nasa.govChris Shenton
Presentation to the NASA Cloud Community of Interest. Shows the evolution of simple webapps lacking resilience and scalability up to cloud-native apps that tolerate server faults and availability zone outages. Describes how images.nasa.gov leverages S3, EC2, ELB, DynamoDB, CloudSearch, ElasticTranscoder and more to provide a modern, scalable, mobile-friendly site showcasing the best of NASA's images, video and audio.
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB
Moving to a new home is daunting. Packing up all your things, getting a vehicle to move it all, unpacking it, updating your mailing address, and making sure you did not leave anything behind. Well, the move to MongoDB Atlas is similar, but all the logistics are already figured out for you by MongoDB.
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...Márton Kodok
Every scientist who needs big data analytics to save millions of lives should have that power. Complex interactive Big Data analytics solutions require massive architecture, and Know-How to build a fast real-time computing system.BigQuery solves this problem by enabling super-fast, SQL-like queries against petabytes of data using the processing power of Google’s infrastructure. We will cover its core features, working with BigQuery, streaming inserts, User Defined Functions in Javascript, and several use cases for everyday developer: funnel analytics, behavioral analytics, exploring unstructured data.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Workflow Engines + Luigi
1. Data Pipeline ArchitectData Pipeline Architect
Workflow Engines + Luigi.
A broad overview and a brief introduction.
Vladislav Supalov, 8 March 2016
2. Data Pipeline ArchitectData Pipeline Architect
Hi, I’m Vladislav
2
2
● Wow, that’s neat. We can do cool stuff with data.
○ Machine Learning
○ Data Mining
○ Computer Vision
● DevOps? Shiny!
○ Lots of servers being useful and reliable <3
○ Automation
● Oh, so this is how businesses perceive things. I WAS BLIND.
○ Business goals and values
○ Measurable impact
3. Data Pipeline ArchitectData Pipeline Architect
Here’s What I Do
3
3
● Yes, please. All of this. Data engineering consulting.
○ “We built data stuff in-house and it delivers lots of value!”
■ “But it also sucks. We are losing money.”
■ “How can we do better?”
○ Mobile application marketing agencies
■ Not necessarily huge data
■ Very valuable and worthwhile topic
■ datapipelinearchitect.com
4. Data Pipeline ArchitectData Pipeline Architect
Not Necessarily Big Data
4
4
● There’s Big Data
○ It’s pretty fascinating, alright
○ Most companies are a few steps away from having these problems
● Let’s talk more about
○ Messy data (multiple data sources, no overview)
○ Tedious-to-handle data (multiple data sources, lots of manual work)
5. Data Pipeline ArchitectData Pipeline Architect
The Big Picture
Actually handling the data is a very small part.
Straightforward, once the business needs are clear.
It’s about communication and people.
6. Data Pipeline ArchitectData Pipeline Architect
My First Data Pipeline
6
6
● ~20 GB per day
● Legacy MongoDB setup
● BEFORE: “It takes HOURS to get query results!”
● AFTER: “Already done. That was hardly a minute.”
● Google BigQuery
○ Batch: daily
○ Streaming: near real-time
● … So I wrote some modular scripts from scratch (Python & Bash)
○ It worked alright
○ I’m so sorry!
7. Data Pipeline ArchitectData Pipeline Architect
What’s Wrong with Custom Scripts?
7
7
● What happens when the original author leaves?
○ hit-by-bus criterium
● Cost of ownership
○ Learning curve, uniqueness
○ Maintenance time, tricky bugs, code duplication
○ Unexpected failure modes
● Extensibility
● Growth
● Metadata?
● You’re reinventing the wheel
8. Data Pipeline ArchitectData Pipeline Architect
Here’s What Most People Don’t Search For
8
8
“I need to get data from A to B on a regular basis!”
● ETL
○ Extract
○ Transform
○ Load
● Long history
● Even longer beard
● A lot of enterprise-grade tools
→ Data pipelines
9. Data Pipeline ArchitectData Pipeline Architect
Data Plumbing - There Are Many Approaches
9
9
● Data Virtuality
○ Access data across multiple sources transparently
○ Redshift used in the background intelligently
● Snowplow
○ “Event analytics platform” - designed to run on AWS services
○ Generate special events instead of plumbing existing data
● Segment.io
○ “Collect customer data with one API and send it to hundreds of tools for
analytics, marketing, and data warehousing.”
http://datapipelinearchitect.com/tools-for-combining-multiple-data-sources/
10. Data Pipeline ArchitectData Pipeline Architect
Workflow Engines!
10
10
● Workflow = “[..] orchestrated and repeatable pattern of business activity [..]” [1]
● Data flow = “bunch of data processing tasks with inter-dependencies” [2]
● Pipelines of batch jobs
○ complex, long-running
● Dependency management
● Reusability of intermediate steps
● Logging and alerting
● Failure handling
● Monitoring
● Lots of effort went into them (Broken data? Crashes? Partial failures?)
[1] https://en.wikipedia.org/wiki/Workflow
[2] Elias Freider, 2013, “Luigi - Batch Data Processing in Python“
11. Data Pipeline ArchitectData Pipeline Architect
Workflow Engine Specimens
11
11
● Oozie
● Azkaban
○ XML, strong Hadoop ecosystem focus.
● Luigi
● Airflow
● Pinball
○ Glue!
● Google Cloud Dataflow
● AWS Data Pipeline
○ Managed! Fancy.
A nice comparison: http://bytepawn.com/luigi-airflow-pinball.html
12. Data Pipeline ArchitectData Pipeline Architect
Let’s Talk Luigi!
12
12
● Spotify
○ Lots of data!
○ 10k+ Hadoop jobs every day [1]
● Battle hardened
○ Published 2009
○ Has been used in production by large companies for a while
● Python
● Modular & extensible
● Dependency graph
● Not just for data tasks
[1] Erik Bernhardsson, 2013, “Building Data Pipelines with Python and Luigi”
13. Data Pipeline ArchitectData Pipeline Architect
Core Goals and Concepts
13
13
● Goals [1]
○ Minimize boilerplate code
○ As general as possible
○ Easy to go from test to production
● Dependencies modeled as directed acyclic graph (DAG)
● Tasks, Targets
● Assumptions:
○ Idempotency
○ Atomic file operations
■ File X is there? I’m done forever.
[1] Elias Freider, 2013, “Luigi - Batch Data Processing in Python“
14. Data Pipeline ArchitectData Pipeline Architect
What Luigi Provides
14
14
● Parametrization (command line arguments)
● Email alerts
● Dependency resolution
● Retries
● History
● Visualizations
● Preventing duplication of effort
● Testable
● Versioning-friendly
● Collaborative
● Community!
15. Data Pipeline ArchitectData Pipeline Architect
Workers and the Scheduler
15
https://github.com/spotify/luigi15
16. Data Pipeline ArchitectData Pipeline Architect
Workers and the Scheduler
16
https://github.com/spotify/luigi16
17. Data Pipeline ArchitectData Pipeline Architect
Workers and the Central Scheduler
17
17
● Workers
○ Crunch data
○ Started via cron, or by hand
● Scheduler
○ Not cron
○ Doesn’t do any data processing
○ Synchronization
○ Web interface - dashboard, visualizations
○ Prevent same task to run multiple times
○ Edit configuration → run luigid
18. Data Pipeline ArchitectData Pipeline Architect
A Luigi Script
18
18
import luigi
# structure
class MyTask(luigi.Task):
def requires(self): # a list of Task(s)
def output(self): # a Target
def run(self): # the work happens here
if __name__ == “__main__”:
luigi.run()
---
$ python dataflow.py MyTask
20. Data Pipeline ArchitectData Pipeline Architect
Task Inputs and Outputs
20
20
[...]
# where the data goes
def output(self):
return luigi.LocalTarget(“/data/out1-%s.txt” % self.param)
# what needs to run beforehand
def requires(self):
return OtherTask(self.param)
[...]
21. Data Pipeline ArchitectData Pipeline Architect
Doing the Work
21
21
[...]
def run(self):
with self.input().open('r') as in_file:
with self.output().open(“w”) as out_file:
# read from in_file, ???, write to out_file
[...]
---
run can yield tasks, to create dynamic dependencies
23. Data Pipeline ArchitectData Pipeline Architect
Takeaways
23
23
● Don’t consider data plumbing in isolation
● Technical decisions should be informed by business needs & goals
● Don’t go with home-baked scripts
○ “Quick and easy”? No.
● ETL is a thing
● There are workflow engines
○ Lots of them
○ Not only for data
● There are other approaches and services
● Luigi is a useful tool
24. Data Pipeline ArchitectData Pipeline Architect
Thanks! Let’s stay in touch :)
You’ll also get a step-by-step guide on learning Luigi.
http://datapipelinearchitect.com/big-data-eindhoven/