SlideShare a Scribd company logo
Data Pipeline ArchitectData Pipeline Architect
Workflow Engines + Luigi.
A broad overview and a brief introduction.
Vladislav Supalov, 8 March 2016
Data Pipeline ArchitectData Pipeline Architect
Hi, I’m Vladislav
2
2
● Wow, that’s neat. We can do cool stuff with data.
○ Machine Learning
○ Data Mining
○ Computer Vision
● DevOps? Shiny!
○ Lots of servers being useful and reliable <3
○ Automation
● Oh, so this is how businesses perceive things. I WAS BLIND.
○ Business goals and values
○ Measurable impact
Data Pipeline ArchitectData Pipeline Architect
Here’s What I Do
3
3
● Yes, please. All of this. Data engineering consulting.
○ “We built data stuff in-house and it delivers lots of value!”
■ “But it also sucks. We are losing money.”
■ “How can we do better?”
○ Mobile application marketing agencies
■ Not necessarily huge data
■ Very valuable and worthwhile topic
■ datapipelinearchitect.com
Data Pipeline ArchitectData Pipeline Architect
Not Necessarily Big Data
4
4
● There’s Big Data
○ It’s pretty fascinating, alright
○ Most companies are a few steps away from having these problems
● Let’s talk more about
○ Messy data (multiple data sources, no overview)
○ Tedious-to-handle data (multiple data sources, lots of manual work)
Data Pipeline ArchitectData Pipeline Architect
The Big Picture
Actually handling the data is a very small part.
Straightforward, once the business needs are clear.
It’s about communication and people.
Data Pipeline ArchitectData Pipeline Architect
My First Data Pipeline
6
6
● ~20 GB per day
● Legacy MongoDB setup
● BEFORE: “It takes HOURS to get query results!”
● AFTER: “Already done. That was hardly a minute.”
● Google BigQuery
○ Batch: daily
○ Streaming: near real-time
● … So I wrote some modular scripts from scratch (Python & Bash)
○ It worked alright
○ I’m so sorry!
Data Pipeline ArchitectData Pipeline Architect
What’s Wrong with Custom Scripts?
7
7
● What happens when the original author leaves?
○ hit-by-bus criterium
● Cost of ownership
○ Learning curve, uniqueness
○ Maintenance time, tricky bugs, code duplication
○ Unexpected failure modes
● Extensibility
● Growth
● Metadata?
● You’re reinventing the wheel
Data Pipeline ArchitectData Pipeline Architect
Here’s What Most People Don’t Search For
8
8
“I need to get data from A to B on a regular basis!”
● ETL
○ Extract
○ Transform
○ Load
● Long history
● Even longer beard
● A lot of enterprise-grade tools
→ Data pipelines
Data Pipeline ArchitectData Pipeline Architect
Data Plumbing - There Are Many Approaches
9
9
● Data Virtuality
○ Access data across multiple sources transparently
○ Redshift used in the background intelligently
● Snowplow
○ “Event analytics platform” - designed to run on AWS services
○ Generate special events instead of plumbing existing data
● Segment.io
○ “Collect customer data with one API and send it to hundreds of tools for
analytics, marketing, and data warehousing.”
http://datapipelinearchitect.com/tools-for-combining-multiple-data-sources/
Data Pipeline ArchitectData Pipeline Architect
Workflow Engines!
10
10
● Workflow = “[..] orchestrated and repeatable pattern of business activity [..]” [1]
● Data flow = “bunch of data processing tasks with inter-dependencies” [2]
● Pipelines of batch jobs
○ complex, long-running
● Dependency management
● Reusability of intermediate steps
● Logging and alerting
● Failure handling
● Monitoring
● Lots of effort went into them (Broken data? Crashes? Partial failures?)
[1] https://en.wikipedia.org/wiki/Workflow
[2] Elias Freider, 2013, “Luigi - Batch Data Processing in Python“
Data Pipeline ArchitectData Pipeline Architect
Workflow Engine Specimens
11
11
● Oozie
● Azkaban
○ XML, strong Hadoop ecosystem focus.
● Luigi
● Airflow
● Pinball
○ Glue!
● Google Cloud Dataflow
● AWS Data Pipeline
○ Managed! Fancy.
A nice comparison: http://bytepawn.com/luigi-airflow-pinball.html
Data Pipeline ArchitectData Pipeline Architect
Let’s Talk Luigi!
12
12
● Spotify
○ Lots of data!
○ 10k+ Hadoop jobs every day [1]
● Battle hardened
○ Published 2009
○ Has been used in production by large companies for a while
● Python
● Modular & extensible
● Dependency graph
● Not just for data tasks
[1] Erik Bernhardsson, 2013, “Building Data Pipelines with Python and Luigi”
Data Pipeline ArchitectData Pipeline Architect
Core Goals and Concepts
13
13
● Goals [1]
○ Minimize boilerplate code
○ As general as possible
○ Easy to go from test to production
● Dependencies modeled as directed acyclic graph (DAG)
● Tasks, Targets
● Assumptions:
○ Idempotency
○ Atomic file operations
■ File X is there? I’m done forever.
[1] Elias Freider, 2013, “Luigi - Batch Data Processing in Python“
Data Pipeline ArchitectData Pipeline Architect
What Luigi Provides
14
14
● Parametrization (command line arguments)
● Email alerts
● Dependency resolution
● Retries
● History
● Visualizations
● Preventing duplication of effort
● Testable
● Versioning-friendly
● Collaborative
● Community!
Data Pipeline ArchitectData Pipeline Architect
Workers and the Scheduler
15
https://github.com/spotify/luigi15
Data Pipeline ArchitectData Pipeline Architect
Workers and the Scheduler
16
https://github.com/spotify/luigi16
Data Pipeline ArchitectData Pipeline Architect
Workers and the Central Scheduler
17
17
● Workers
○ Crunch data
○ Started via cron, or by hand
● Scheduler
○ Not cron
○ Doesn’t do any data processing
○ Synchronization
○ Web interface - dashboard, visualizations
○ Prevent same task to run multiple times
○ Edit configuration → run luigid
Data Pipeline ArchitectData Pipeline Architect
A Luigi Script
18
18
import luigi
# structure
class MyTask(luigi.Task):
def requires(self): # a list of Task(s)
def output(self): # a Target
def run(self): # the work happens here
if __name__ == “__main__”:
luigi.run()
---
$ python dataflow.py MyTask
Data Pipeline ArchitectData Pipeline Architect
Parameters
19
19
class MyTask(luigi.Task):
# magic!
param = luigi.Parameter(default=3)
[...]
---
$ python dataflow.py MyTask --param 2
Data Pipeline ArchitectData Pipeline Architect
Task Inputs and Outputs
20
20
[...]
# where the data goes
def output(self):
return luigi.LocalTarget(“/data/out1-%s.txt” % self.param)
# what needs to run beforehand
def requires(self):
return OtherTask(self.param)
[...]
Data Pipeline ArchitectData Pipeline Architect
Doing the Work
21
21
[...]
def run(self):
with self.input().open('r') as in_file:
with self.output().open(“w”) as out_file:
# read from in_file, ???, write to out_file
[...]
---
run can yield tasks, to create dynamic dependencies
Data Pipeline ArchitectData Pipeline Architect
The Perks
22
22
● Specify dates
$ python dataflow.py MyTask --date 2016-03-08
$ [...] --date_interval 2016-W20
● Concurrency
$ [...] --workers 3
● Lots of functionality already provided
○ Targets (HDFS, S3, …)
○ Tasks (HadoopJobTask, CopyToTable, ...)
Data Pipeline ArchitectData Pipeline Architect
Takeaways
23
23
● Don’t consider data plumbing in isolation
● Technical decisions should be informed by business needs & goals
● Don’t go with home-baked scripts
○ “Quick and easy”? No.
● ETL is a thing
● There are workflow engines
○ Lots of them
○ Not only for data
● There are other approaches and services
● Luigi is a useful tool
Data Pipeline ArchitectData Pipeline Architect
Thanks! Let’s stay in touch :)
You’ll also get a step-by-step guide on learning Luigi.
http://datapipelinearchitect.com/big-data-eindhoven/

More Related Content

What's hot

Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
Sumit Maheshwari
 
Building cloud-enabled genomics workflows with Luigi and Docker
Building cloud-enabled genomics workflows with Luigi and DockerBuilding cloud-enabled genomics workflows with Luigi and Docker
Building cloud-enabled genomics workflows with Luigi and Docker
Jacob Feala
 
Apache Airflow at Dailymotion
Apache Airflow at DailymotionApache Airflow at Dailymotion
Apache Airflow at Dailymotion
Germain Tanguy
 
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Bolke de Bruin
 
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow managementIntro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Burasakorn Sabyeying
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
clairvoyantllc
 
Airflow at WePay
Airflow at WePayAirflow at WePay
Airflow at WePay
Chris Riccomini
 
AIRflow at Scale
AIRflow at ScaleAIRflow at Scale
AIRflow at Scale
Digital Vidya
 
Apache Airflow Architecture
Apache Airflow ArchitectureApache Airflow Architecture
Apache Airflow Architecture
Gerard Toonstra
 
Apache Airflow overview
Apache Airflow overviewApache Airflow overview
Apache Airflow overview
NikolayGrishchenkov
 
Airflow for Beginners
Airflow for BeginnersAirflow for Beginners
Airflow for Beginners
Varya Karpenko
 
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data PipelinesAirflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data PipelinesDataWorks Summit
 
Luigi future
Luigi futureLuigi future
Luigi future
Erik Bernhardsson
 
Building an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowBuilding an analytics workflow using Apache Airflow
Building an analytics workflow using Apache Airflow
Yohei Onishi
 
What is Spark
What is SparkWhat is Spark
What is Spark
Bruno Faria
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
Ilias Okacha
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Yohei Onishi
 
Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data Science
Erik Bernhardsson
 
Clearing Airflow Obstructions
Clearing Airflow ObstructionsClearing Airflow Obstructions
Clearing Airflow Obstructions
Tatiana Al-Chueyr
 
Airflow introduction
Airflow introductionAirflow introduction
Airflow introduction
Chandler Huang
 

What's hot (20)

Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
Building cloud-enabled genomics workflows with Luigi and Docker
Building cloud-enabled genomics workflows with Luigi and DockerBuilding cloud-enabled genomics workflows with Luigi and Docker
Building cloud-enabled genomics workflows with Luigi and Docker
 
Apache Airflow at Dailymotion
Apache Airflow at DailymotionApache Airflow at Dailymotion
Apache Airflow at Dailymotion
 
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
 
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow managementIntro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
 
Airflow at WePay
Airflow at WePayAirflow at WePay
Airflow at WePay
 
AIRflow at Scale
AIRflow at ScaleAIRflow at Scale
AIRflow at Scale
 
Apache Airflow Architecture
Apache Airflow ArchitectureApache Airflow Architecture
Apache Airflow Architecture
 
Apache Airflow overview
Apache Airflow overviewApache Airflow overview
Apache Airflow overview
 
Airflow for Beginners
Airflow for BeginnersAirflow for Beginners
Airflow for Beginners
 
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data PipelinesAirflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
 
Luigi future
Luigi futureLuigi future
Luigi future
 
Building an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowBuilding an analytics workflow using Apache Airflow
Building an analytics workflow using Apache Airflow
 
What is Spark
What is SparkWhat is Spark
What is Spark
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
 
Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data Science
 
Clearing Airflow Obstructions
Clearing Airflow ObstructionsClearing Airflow Obstructions
Clearing Airflow Obstructions
 
Airflow introduction
Airflow introductionAirflow introduction
Airflow introduction
 

Viewers also liked

A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with Luigi
Growth Intelligence
 
Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map Reduce
Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map ReduceEngineering a robust(ish) data pipeline with Luigi and AWS Elastic Map Reduce
Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map Reduce
Aaron Knight
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
Hakka Labs
 
More Data, More Problems: Evolving big data machine learning pipelines with S...
More Data, More Problems: Evolving big data machine learning pipelines with S...More Data, More Problems: Evolving big data machine learning pipelines with S...
More Data, More Problems: Evolving big data machine learning pipelines with S...
Alex Sadovsky
 
Top 10 Perl Performance Tips
Top 10 Perl Performance TipsTop 10 Perl Performance Tips
Top 10 Perl Performance Tips
Perrin Harkins
 
Best Practices for Genomic and Bioinformatics Analysis Pipelines on AWS
Best Practices for Genomic and Bioinformatics Analysis Pipelines on AWS Best Practices for Genomic and Bioinformatics Analysis Pipelines on AWS
Best Practices for Genomic and Bioinformatics Analysis Pipelines on AWS
Amazon Web Services
 
Interactive workflow management using Azkaban
Interactive workflow management using AzkabanInteractive workflow management using Azkaban
Interactive workflow management using Azkaban
datamantra
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
Dat Tran
 
Approximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetupApproximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetup
Erik Bernhardsson
 
Azkaban
AzkabanAzkaban
Azkaban
wyukawa
 
Jenkins 2.0 Pipeline & Blue Ocean
Jenkins 2.0 Pipeline & Blue OceanJenkins 2.0 Pipeline & Blue Ocean
Jenkins 2.0 Pipeline & Blue Ocean
Akihiko Horiuchi
 

Viewers also liked (11)

A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with Luigi
 
Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map Reduce
Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map ReduceEngineering a robust(ish) data pipeline with Luigi and AWS Elastic Map Reduce
Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map Reduce
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
 
More Data, More Problems: Evolving big data machine learning pipelines with S...
More Data, More Problems: Evolving big data machine learning pipelines with S...More Data, More Problems: Evolving big data machine learning pipelines with S...
More Data, More Problems: Evolving big data machine learning pipelines with S...
 
Top 10 Perl Performance Tips
Top 10 Perl Performance TipsTop 10 Perl Performance Tips
Top 10 Perl Performance Tips
 
Best Practices for Genomic and Bioinformatics Analysis Pipelines on AWS
Best Practices for Genomic and Bioinformatics Analysis Pipelines on AWS Best Practices for Genomic and Bioinformatics Analysis Pipelines on AWS
Best Practices for Genomic and Bioinformatics Analysis Pipelines on AWS
 
Interactive workflow management using Azkaban
Interactive workflow management using AzkabanInteractive workflow management using Azkaban
Interactive workflow management using Azkaban
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
 
Approximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetupApproximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetup
 
Azkaban
AzkabanAzkaban
Azkaban
 
Jenkins 2.0 Pipeline & Blue Ocean
Jenkins 2.0 Pipeline & Blue OceanJenkins 2.0 Pipeline & Blue Ocean
Jenkins 2.0 Pipeline & Blue Ocean
 

Similar to Workflow Engines + Luigi

Workshop on Google Cloud Data Platform
Workshop on Google Cloud Data PlatformWorkshop on Google Cloud Data Platform
Workshop on Google Cloud Data Platform
GoDataDriven
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
Kriangkrai Chaonithi
 
Data ops in practice - Swedish style
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish style
Lars Albertsson
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016
Dan Lynn
 
Data Platform in the Cloud
Data Platform in the CloudData Platform in the Cloud
Data Platform in the Cloud
Amihay Zer-Kavod
 
Simplified News Analytics in Presidential Election with Google Cloud Platform
Simplified News Analytics in Presidential Election with Google Cloud PlatformSimplified News Analytics in Presidential Election with Google Cloud Platform
Simplified News Analytics in Presidential Election with Google Cloud Platform
Imre Nagi
 
Big Query Basics
Big Query BasicsBig Query Basics
Big Query Basics
Ido Green
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dan Lynn
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Demi Ben-Ari
 
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Codemotion
 
GDG DevFest Ukraine - Powering Interactive Data Analysis with Google BigQuery
GDG DevFest Ukraine - Powering Interactive Data Analysis with Google BigQueryGDG DevFest Ukraine - Powering Interactive Data Analysis with Google BigQuery
GDG DevFest Ukraine - Powering Interactive Data Analysis with Google BigQuery
Márton Kodok
 
Serverless Big Data Architecture on Google Cloud Platform at Credit OK
Serverless Big Data Architecture on Google Cloud Platform at Credit OKServerless Big Data Architecture on Google Cloud Platform at Credit OK
Serverless Big Data Architecture on Google Cloud Platform at Credit OK
Kriangkrai Chaonithi
 
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Kaxil Naik
 
Getting started with BigQuery
Getting started with BigQueryGetting started with BigQuery
Getting started with BigQuery
Pradeep Bhadani
 
"Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & ...
"Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & ..."Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & ...
"Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & ...
Dataconomy Media
 
DevTalks Keynote Powering interactive data analysis with Google BigQuery
DevTalks Keynote Powering interactive data analysis with Google BigQueryDevTalks Keynote Powering interactive data analysis with Google BigQuery
DevTalks Keynote Powering interactive data analysis with Google BigQuery
Márton Kodok
 
Not Your Father’s Web App: The Cloud-Native Architecture of images.nasa.gov
Not Your Father’s Web App: The Cloud-Native Architecture of images.nasa.govNot Your Father’s Web App: The Cloud-Native Architecture of images.nasa.gov
Not Your Father’s Web App: The Cloud-Native Architecture of images.nasa.gov
Chris Shenton
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB
 
Big Query - Women Techmarkers (Ukraine - March 2014)
Big Query - Women Techmarkers (Ukraine - March 2014)Big Query - Women Techmarkers (Ukraine - March 2014)
Big Query - Women Techmarkers (Ukraine - March 2014)
Ido Green
 
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
Márton Kodok
 

Similar to Workflow Engines + Luigi (20)

Workshop on Google Cloud Data Platform
Workshop on Google Cloud Data PlatformWorkshop on Google Cloud Data Platform
Workshop on Google Cloud Data Platform
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
 
Data ops in practice - Swedish style
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish style
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016
 
Data Platform in the Cloud
Data Platform in the CloudData Platform in the Cloud
Data Platform in the Cloud
 
Simplified News Analytics in Presidential Election with Google Cloud Platform
Simplified News Analytics in Presidential Election with Google Cloud PlatformSimplified News Analytics in Presidential Election with Google Cloud Platform
Simplified News Analytics in Presidential Election with Google Cloud Platform
 
Big Query Basics
Big Query BasicsBig Query Basics
Big Query Basics
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
 
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
 
GDG DevFest Ukraine - Powering Interactive Data Analysis with Google BigQuery
GDG DevFest Ukraine - Powering Interactive Data Analysis with Google BigQueryGDG DevFest Ukraine - Powering Interactive Data Analysis with Google BigQuery
GDG DevFest Ukraine - Powering Interactive Data Analysis with Google BigQuery
 
Serverless Big Data Architecture on Google Cloud Platform at Credit OK
Serverless Big Data Architecture on Google Cloud Platform at Credit OKServerless Big Data Architecture on Google Cloud Platform at Credit OK
Serverless Big Data Architecture on Google Cloud Platform at Credit OK
 
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
 
Getting started with BigQuery
Getting started with BigQueryGetting started with BigQuery
Getting started with BigQuery
 
"Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & ...
"Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & ..."Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & ...
"Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & ...
 
DevTalks Keynote Powering interactive data analysis with Google BigQuery
DevTalks Keynote Powering interactive data analysis with Google BigQueryDevTalks Keynote Powering interactive data analysis with Google BigQuery
DevTalks Keynote Powering interactive data analysis with Google BigQuery
 
Not Your Father’s Web App: The Cloud-Native Architecture of images.nasa.gov
Not Your Father’s Web App: The Cloud-Native Architecture of images.nasa.govNot Your Father’s Web App: The Cloud-Native Architecture of images.nasa.gov
Not Your Father’s Web App: The Cloud-Native Architecture of images.nasa.gov
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
 
Big Query - Women Techmarkers (Ukraine - March 2014)
Big Query - Women Techmarkers (Ukraine - March 2014)Big Query - Women Techmarkers (Ukraine - March 2014)
Big Query - Women Techmarkers (Ukraine - March 2014)
 
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
 

Recently uploaded

一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 

Recently uploaded (20)

一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 

Workflow Engines + Luigi

  • 1. Data Pipeline ArchitectData Pipeline Architect Workflow Engines + Luigi. A broad overview and a brief introduction. Vladislav Supalov, 8 March 2016
  • 2. Data Pipeline ArchitectData Pipeline Architect Hi, I’m Vladislav 2 2 ● Wow, that’s neat. We can do cool stuff with data. ○ Machine Learning ○ Data Mining ○ Computer Vision ● DevOps? Shiny! ○ Lots of servers being useful and reliable <3 ○ Automation ● Oh, so this is how businesses perceive things. I WAS BLIND. ○ Business goals and values ○ Measurable impact
  • 3. Data Pipeline ArchitectData Pipeline Architect Here’s What I Do 3 3 ● Yes, please. All of this. Data engineering consulting. ○ “We built data stuff in-house and it delivers lots of value!” ■ “But it also sucks. We are losing money.” ■ “How can we do better?” ○ Mobile application marketing agencies ■ Not necessarily huge data ■ Very valuable and worthwhile topic ■ datapipelinearchitect.com
  • 4. Data Pipeline ArchitectData Pipeline Architect Not Necessarily Big Data 4 4 ● There’s Big Data ○ It’s pretty fascinating, alright ○ Most companies are a few steps away from having these problems ● Let’s talk more about ○ Messy data (multiple data sources, no overview) ○ Tedious-to-handle data (multiple data sources, lots of manual work)
  • 5. Data Pipeline ArchitectData Pipeline Architect The Big Picture Actually handling the data is a very small part. Straightforward, once the business needs are clear. It’s about communication and people.
  • 6. Data Pipeline ArchitectData Pipeline Architect My First Data Pipeline 6 6 ● ~20 GB per day ● Legacy MongoDB setup ● BEFORE: “It takes HOURS to get query results!” ● AFTER: “Already done. That was hardly a minute.” ● Google BigQuery ○ Batch: daily ○ Streaming: near real-time ● … So I wrote some modular scripts from scratch (Python & Bash) ○ It worked alright ○ I’m so sorry!
  • 7. Data Pipeline ArchitectData Pipeline Architect What’s Wrong with Custom Scripts? 7 7 ● What happens when the original author leaves? ○ hit-by-bus criterium ● Cost of ownership ○ Learning curve, uniqueness ○ Maintenance time, tricky bugs, code duplication ○ Unexpected failure modes ● Extensibility ● Growth ● Metadata? ● You’re reinventing the wheel
  • 8. Data Pipeline ArchitectData Pipeline Architect Here’s What Most People Don’t Search For 8 8 “I need to get data from A to B on a regular basis!” ● ETL ○ Extract ○ Transform ○ Load ● Long history ● Even longer beard ● A lot of enterprise-grade tools → Data pipelines
  • 9. Data Pipeline ArchitectData Pipeline Architect Data Plumbing - There Are Many Approaches 9 9 ● Data Virtuality ○ Access data across multiple sources transparently ○ Redshift used in the background intelligently ● Snowplow ○ “Event analytics platform” - designed to run on AWS services ○ Generate special events instead of plumbing existing data ● Segment.io ○ “Collect customer data with one API and send it to hundreds of tools for analytics, marketing, and data warehousing.” http://datapipelinearchitect.com/tools-for-combining-multiple-data-sources/
  • 10. Data Pipeline ArchitectData Pipeline Architect Workflow Engines! 10 10 ● Workflow = “[..] orchestrated and repeatable pattern of business activity [..]” [1] ● Data flow = “bunch of data processing tasks with inter-dependencies” [2] ● Pipelines of batch jobs ○ complex, long-running ● Dependency management ● Reusability of intermediate steps ● Logging and alerting ● Failure handling ● Monitoring ● Lots of effort went into them (Broken data? Crashes? Partial failures?) [1] https://en.wikipedia.org/wiki/Workflow [2] Elias Freider, 2013, “Luigi - Batch Data Processing in Python“
  • 11. Data Pipeline ArchitectData Pipeline Architect Workflow Engine Specimens 11 11 ● Oozie ● Azkaban ○ XML, strong Hadoop ecosystem focus. ● Luigi ● Airflow ● Pinball ○ Glue! ● Google Cloud Dataflow ● AWS Data Pipeline ○ Managed! Fancy. A nice comparison: http://bytepawn.com/luigi-airflow-pinball.html
  • 12. Data Pipeline ArchitectData Pipeline Architect Let’s Talk Luigi! 12 12 ● Spotify ○ Lots of data! ○ 10k+ Hadoop jobs every day [1] ● Battle hardened ○ Published 2009 ○ Has been used in production by large companies for a while ● Python ● Modular & extensible ● Dependency graph ● Not just for data tasks [1] Erik Bernhardsson, 2013, “Building Data Pipelines with Python and Luigi”
  • 13. Data Pipeline ArchitectData Pipeline Architect Core Goals and Concepts 13 13 ● Goals [1] ○ Minimize boilerplate code ○ As general as possible ○ Easy to go from test to production ● Dependencies modeled as directed acyclic graph (DAG) ● Tasks, Targets ● Assumptions: ○ Idempotency ○ Atomic file operations ■ File X is there? I’m done forever. [1] Elias Freider, 2013, “Luigi - Batch Data Processing in Python“
  • 14. Data Pipeline ArchitectData Pipeline Architect What Luigi Provides 14 14 ● Parametrization (command line arguments) ● Email alerts ● Dependency resolution ● Retries ● History ● Visualizations ● Preventing duplication of effort ● Testable ● Versioning-friendly ● Collaborative ● Community!
  • 15. Data Pipeline ArchitectData Pipeline Architect Workers and the Scheduler 15 https://github.com/spotify/luigi15
  • 16. Data Pipeline ArchitectData Pipeline Architect Workers and the Scheduler 16 https://github.com/spotify/luigi16
  • 17. Data Pipeline ArchitectData Pipeline Architect Workers and the Central Scheduler 17 17 ● Workers ○ Crunch data ○ Started via cron, or by hand ● Scheduler ○ Not cron ○ Doesn’t do any data processing ○ Synchronization ○ Web interface - dashboard, visualizations ○ Prevent same task to run multiple times ○ Edit configuration → run luigid
  • 18. Data Pipeline ArchitectData Pipeline Architect A Luigi Script 18 18 import luigi # structure class MyTask(luigi.Task): def requires(self): # a list of Task(s) def output(self): # a Target def run(self): # the work happens here if __name__ == “__main__”: luigi.run() --- $ python dataflow.py MyTask
  • 19. Data Pipeline ArchitectData Pipeline Architect Parameters 19 19 class MyTask(luigi.Task): # magic! param = luigi.Parameter(default=3) [...] --- $ python dataflow.py MyTask --param 2
  • 20. Data Pipeline ArchitectData Pipeline Architect Task Inputs and Outputs 20 20 [...] # where the data goes def output(self): return luigi.LocalTarget(“/data/out1-%s.txt” % self.param) # what needs to run beforehand def requires(self): return OtherTask(self.param) [...]
  • 21. Data Pipeline ArchitectData Pipeline Architect Doing the Work 21 21 [...] def run(self): with self.input().open('r') as in_file: with self.output().open(“w”) as out_file: # read from in_file, ???, write to out_file [...] --- run can yield tasks, to create dynamic dependencies
  • 22. Data Pipeline ArchitectData Pipeline Architect The Perks 22 22 ● Specify dates $ python dataflow.py MyTask --date 2016-03-08 $ [...] --date_interval 2016-W20 ● Concurrency $ [...] --workers 3 ● Lots of functionality already provided ○ Targets (HDFS, S3, …) ○ Tasks (HadoopJobTask, CopyToTable, ...)
  • 23. Data Pipeline ArchitectData Pipeline Architect Takeaways 23 23 ● Don’t consider data plumbing in isolation ● Technical decisions should be informed by business needs & goals ● Don’t go with home-baked scripts ○ “Quick and easy”? No. ● ETL is a thing ● There are workflow engines ○ Lots of them ○ Not only for data ● There are other approaches and services ● Luigi is a useful tool
  • 24. Data Pipeline ArchitectData Pipeline Architect Thanks! Let’s stay in touch :) You’ll also get a step-by-step guide on learning Luigi. http://datapipelinearchitect.com/big-data-eindhoven/