SlideShare a Scribd company logo
Data Workflows
at Foursquare
using Luigi
Foursquare
•  35 million users
•  Nearly 4 billion check-ins
•  More than 5 million check-ins per day
•  50 million point-of-interest database
•  100's of GB of log data per day
Tools We Use
•  Hive
o  Ad hoc analytics, data dumping ground
•  Raw MapReduce
o  100's of MapReduce jobs in our codebase
•  Pig
o  Fits between structure Hive and free-form
MapReduce
•  Vertica
o  Low latency analytics
Cron
E.g.
0 0 * * * ./hadoop-script-1.sh
# Wait two hours for that job to finish...
0 2 * * * ./hadoop-script-2.sh
# And on and on and on
Cron - Problems
•  Brittle
•  Hard to reason about / visualize
•  Spend a lot of time waiting
•  Difficult to tell what succeeded or failed
•  No one likes writing Bash scripts
Oozie
XML-based Workflow Engine, with support for
Hadoop, Hive, and Pig
Workflows specify computations in a DAG, e.g
"Run this Hive query, then run these two
MapReduce jobs in parallel"
Coordinators launch recurring workflows at a
given frequency, when dependent data is
available
Oozie - Example
Oozie - Problems
•  Workflows are all-or-nothing
o  Cannot just run step that failed
o  Very little code reuse
•  Little to no extensibility
•  Limited control flow
•  Extremely verbose
•  Difficult to test
•  No one likes writing XML
Luigi
•  Python framework for batch processing jobs
•  Created by Spotify, open-sourced Sept. 2012
•  Tasks are units of work that produce Targets
•  Tasks can depend on one or more other Tasks
•  A Task is only run if all of its dependent Tasks are done
•  Tasks are idempotent
Luigi - Example Task
Luigi - Running the Task
$ python word-count.py WordCount --date 2013-06-01
Luigi - Scheduler
Central scheduler ensures each Task is only
run by a single worker.
A task is uniquely identified by its class name
and its Parameters, e.g.
WordCount(date=2013-06-01)
Will retry failed Tasks after a configured timeout
Emails someone when a Task fails
Luigi - Visualizer
Luigi - Visualizer
Luigi - Visualizer
Luigi - Advantages over Cron
•  Explicit dependencies
•  No wasted time waiting
•  Easy to tell what has failed
•  Avoid duplicate work / partial failures
Luigi - Advantages over Oozie
•  Explicit dependencies between workflows
•  Easier to write
•  Vastly more extensible
•  Code reuse
•  Can easily re-run individual steps
Thank you!
Check out Luigi:
https://github.com/spotify/luigi
Drop me a line:
Joe Ennever
jennever@foursquare.com

More Related Content

What's hot

Introduction to Apache Airflow
Introduction to Apache AirflowIntroduction to Apache Airflow
Introduction to Apache Airflow
mutt_data
 
Apache Airflow Architecture
Apache Airflow ArchitectureApache Airflow Architecture
Apache Airflow Architecture
Gerard Toonstra
 
HTTP Request Smuggling via higher HTTP versions
HTTP Request Smuggling via higher HTTP versionsHTTP Request Smuggling via higher HTTP versions
HTTP Request Smuggling via higher HTTP versions
neexemil
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
Ilias Okacha
 
Apache Flink Worst Practices
Apache Flink Worst PracticesApache Flink Worst Practices
Apache Flink Worst Practices
Konstantin Knauf
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
Knoldus Inc.
 
The Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersThe Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and Containers
SATOSHI TAGOMORI
 
Apache airflow
Apache airflowApache airflow
Apache airflow
Purna Chander
 
Airflow for Beginners
Airflow for BeginnersAirflow for Beginners
Airflow for Beginners
Varya Karpenko
 
Concurrency With Go
Concurrency With GoConcurrency With Go
Concurrency With Go
John-Alan Simmons
 
Apache Airflow Introduction
Apache Airflow IntroductionApache Airflow Introduction
Apache Airflow Introduction
Liangjun Jiang
 
Airflow at lyft
Airflow at lyftAirflow at lyft
Airflow at lyft
Tao Feng
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark Metrics
Databricks
 
Building an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowBuilding an analytics workflow using Apache Airflow
Building an analytics workflow using Apache Airflow
Yohei Onishi
 
VictoriaLogs: Open Source Log Management System - Preview
VictoriaLogs: Open Source Log Management System - PreviewVictoriaLogs: Open Source Log Management System - Preview
VictoriaLogs: Open Source Log Management System - Preview
VictoriaMetrics
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
Saurav Haloi
 
Airflow introduction
Airflow introductionAirflow introduction
Airflow introduction
Chandler Huang
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Databricks
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
Measuring P99 Latency in Event-Driven Architectures with OpenTelemetry
Measuring P99 Latency in Event-Driven Architectures with OpenTelemetryMeasuring P99 Latency in Event-Driven Architectures with OpenTelemetry
Measuring P99 Latency in Event-Driven Architectures with OpenTelemetry
ScyllaDB
 

What's hot (20)

Introduction to Apache Airflow
Introduction to Apache AirflowIntroduction to Apache Airflow
Introduction to Apache Airflow
 
Apache Airflow Architecture
Apache Airflow ArchitectureApache Airflow Architecture
Apache Airflow Architecture
 
HTTP Request Smuggling via higher HTTP versions
HTTP Request Smuggling via higher HTTP versionsHTTP Request Smuggling via higher HTTP versions
HTTP Request Smuggling via higher HTTP versions
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
 
Apache Flink Worst Practices
Apache Flink Worst PracticesApache Flink Worst Practices
Apache Flink Worst Practices
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
The Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersThe Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and Containers
 
Apache airflow
Apache airflowApache airflow
Apache airflow
 
Airflow for Beginners
Airflow for BeginnersAirflow for Beginners
Airflow for Beginners
 
Concurrency With Go
Concurrency With GoConcurrency With Go
Concurrency With Go
 
Apache Airflow Introduction
Apache Airflow IntroductionApache Airflow Introduction
Apache Airflow Introduction
 
Airflow at lyft
Airflow at lyftAirflow at lyft
Airflow at lyft
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark Metrics
 
Building an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowBuilding an analytics workflow using Apache Airflow
Building an analytics workflow using Apache Airflow
 
VictoriaLogs: Open Source Log Management System - Preview
VictoriaLogs: Open Source Log Management System - PreviewVictoriaLogs: Open Source Log Management System - Preview
VictoriaLogs: Open Source Log Management System - Preview
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
 
Airflow introduction
Airflow introductionAirflow introduction
Airflow introduction
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
Measuring P99 Latency in Event-Driven Architectures with OpenTelemetry
Measuring P99 Latency in Event-Driven Architectures with OpenTelemetryMeasuring P99 Latency in Event-Driven Architectures with OpenTelemetry
Measuring P99 Latency in Event-Driven Architectures with OpenTelemetry
 

Similar to Luigi presentation OA Summit

Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for Hadoop
Joe Crobak
 
Performance and Abstractions
Performance and AbstractionsPerformance and Abstractions
Performance and Abstractions
Metosin Oy
 
Erlang, the big switch in social games
Erlang, the big switch in social gamesErlang, the big switch in social games
Erlang, the big switch in social games
Wooga
 
Erlang factory SF 2011 "Erlang and the big switch in social games"
Erlang factory SF 2011 "Erlang and the big switch in social games"Erlang factory SF 2011 "Erlang and the big switch in social games"
Erlang factory SF 2011 "Erlang and the big switch in social games"
Paolo Negri
 
What's new in Solr 5.0
What's new in Solr 5.0What's new in Solr 5.0
What's new in Solr 5.0
Anshum Gupta
 
Best practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudBest practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloud
Anshum Gupta
 
Lotuscript for large systems
Lotuscript for large systemsLotuscript for large systems
Lotuscript for large systems
Bill Buchan
 
ProjectFork 4.1 in Joomla! 3.x
ProjectFork 4.1 in Joomla! 3.xProjectFork 4.1 in Joomla! 3.x
ProjectFork 4.1 in Joomla! 3.x
Russell Searle
 
Profiling and Tuning a Web Application - The Dirty Details
Profiling and Tuning a Web Application - The Dirty DetailsProfiling and Tuning a Web Application - The Dirty Details
Profiling and Tuning a Web Application - The Dirty Details
Achievers Tech
 
Testing API's: Tools & Tips & Tricks (Oh My!)
Testing API's: Tools & Tips & Tricks (Oh My!)Testing API's: Tools & Tips & Tricks (Oh My!)
Testing API's: Tools & Tips & Tricks (Oh My!)
Ford Prior
 
SGCE 2015 REST APIs
SGCE 2015 REST APIsSGCE 2015 REST APIs
SGCE 2015 REST APIs
Domingo Suarez Torres
 
APIs distribuidos con alta escalabilidad
APIs distribuidos con alta escalabilidadAPIs distribuidos con alta escalabilidad
APIs distribuidos con alta escalabilidad
Software Guru
 
Reproducibility and automation of machine learning process
Reproducibility and automation of machine learning processReproducibility and automation of machine learning process
Reproducibility and automation of machine learning process
Denis Dus
 
Messaging, interoperability and log aggregation - a new framework
Messaging, interoperability and log aggregation - a new frameworkMessaging, interoperability and log aggregation - a new framework
Messaging, interoperability and log aggregation - a new framework
Tomas Doran
 
Greenfields tech decisions
Greenfields tech decisionsGreenfields tech decisions
Greenfields tech decisions
Trent Hornibrook
 
Julia Computing - an alternative to Hadoop
Julia Computing - an alternative to HadoopJulia Computing - an alternative to Hadoop
Julia Computing - an alternative to Hadoop
Shaurya Shekhar
 
Introduction to the intermediate Python - v1.1
Introduction to the intermediate Python - v1.1Introduction to the intermediate Python - v1.1
Introduction to the intermediate Python - v1.1
Andrei KUCHARAVY
 
CrossWorlds: Unleash the Power of Domino for Connections Development
CrossWorlds: Unleash the Power of Domino for Connections Development CrossWorlds: Unleash the Power of Domino for Connections Development
CrossWorlds: Unleash the Power of Domino for Connections Development
LetsConnect
 
Serverless for High Performance Computing
Serverless for High Performance ComputingServerless for High Performance Computing
Serverless for High Performance Computing
Luciano Mammino
 
Reactive Development: Commands, Actors and Events. Oh My!!
Reactive Development: Commands, Actors and Events.  Oh My!!Reactive Development: Commands, Actors and Events.  Oh My!!
Reactive Development: Commands, Actors and Events. Oh My!!
David Hoerster
 

Similar to Luigi presentation OA Summit (20)

Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for Hadoop
 
Performance and Abstractions
Performance and AbstractionsPerformance and Abstractions
Performance and Abstractions
 
Erlang, the big switch in social games
Erlang, the big switch in social gamesErlang, the big switch in social games
Erlang, the big switch in social games
 
Erlang factory SF 2011 "Erlang and the big switch in social games"
Erlang factory SF 2011 "Erlang and the big switch in social games"Erlang factory SF 2011 "Erlang and the big switch in social games"
Erlang factory SF 2011 "Erlang and the big switch in social games"
 
What's new in Solr 5.0
What's new in Solr 5.0What's new in Solr 5.0
What's new in Solr 5.0
 
Best practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudBest practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloud
 
Lotuscript for large systems
Lotuscript for large systemsLotuscript for large systems
Lotuscript for large systems
 
ProjectFork 4.1 in Joomla! 3.x
ProjectFork 4.1 in Joomla! 3.xProjectFork 4.1 in Joomla! 3.x
ProjectFork 4.1 in Joomla! 3.x
 
Profiling and Tuning a Web Application - The Dirty Details
Profiling and Tuning a Web Application - The Dirty DetailsProfiling and Tuning a Web Application - The Dirty Details
Profiling and Tuning a Web Application - The Dirty Details
 
Testing API's: Tools & Tips & Tricks (Oh My!)
Testing API's: Tools & Tips & Tricks (Oh My!)Testing API's: Tools & Tips & Tricks (Oh My!)
Testing API's: Tools & Tips & Tricks (Oh My!)
 
SGCE 2015 REST APIs
SGCE 2015 REST APIsSGCE 2015 REST APIs
SGCE 2015 REST APIs
 
APIs distribuidos con alta escalabilidad
APIs distribuidos con alta escalabilidadAPIs distribuidos con alta escalabilidad
APIs distribuidos con alta escalabilidad
 
Reproducibility and automation of machine learning process
Reproducibility and automation of machine learning processReproducibility and automation of machine learning process
Reproducibility and automation of machine learning process
 
Messaging, interoperability and log aggregation - a new framework
Messaging, interoperability and log aggregation - a new frameworkMessaging, interoperability and log aggregation - a new framework
Messaging, interoperability and log aggregation - a new framework
 
Greenfields tech decisions
Greenfields tech decisionsGreenfields tech decisions
Greenfields tech decisions
 
Julia Computing - an alternative to Hadoop
Julia Computing - an alternative to HadoopJulia Computing - an alternative to Hadoop
Julia Computing - an alternative to Hadoop
 
Introduction to the intermediate Python - v1.1
Introduction to the intermediate Python - v1.1Introduction to the intermediate Python - v1.1
Introduction to the intermediate Python - v1.1
 
CrossWorlds: Unleash the Power of Domino for Connections Development
CrossWorlds: Unleash the Power of Domino for Connections Development CrossWorlds: Unleash the Power of Domino for Connections Development
CrossWorlds: Unleash the Power of Domino for Connections Development
 
Serverless for High Performance Computing
Serverless for High Performance ComputingServerless for High Performance Computing
Serverless for High Performance Computing
 
Reactive Development: Commands, Actors and Events. Oh My!!
Reactive Development: Commands, Actors and Events.  Oh My!!Reactive Development: Commands, Actors and Events.  Oh My!!
Reactive Development: Commands, Actors and Events. Oh My!!
 

More from Open Analytics

Cyber after Snowden (OA Cyber Summit)
Cyber after Snowden (OA Cyber Summit)Cyber after Snowden (OA Cyber Summit)
Cyber after Snowden (OA Cyber Summit)
Open Analytics
 
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
Open Analytics
 
CDM….Where do you start? (OA Cyber Summit)
CDM….Where do you start? (OA Cyber Summit)CDM….Where do you start? (OA Cyber Summit)
CDM….Where do you start? (OA Cyber Summit)
Open Analytics
 
An Immigrant’s view of Cyberspace (OA Cyber Summit)
An Immigrant’s view of Cyberspace (OA Cyber Summit)An Immigrant’s view of Cyberspace (OA Cyber Summit)
An Immigrant’s view of Cyberspace (OA Cyber Summit)
Open Analytics
 
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
Open Analytics
 
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
Open Analytics
 
Using Real-Time Data to Drive Optimization & Personalization
Using Real-Time Data to Drive Optimization & PersonalizationUsing Real-Time Data to Drive Optimization & Personalization
Using Real-Time Data to Drive Optimization & Personalization
Open Analytics
 
M&A Trends in Telco Analytics
M&A Trends in Telco AnalyticsM&A Trends in Telco Analytics
M&A Trends in Telco Analytics
Open Analytics
 
Competing in the Digital Economy
Competing in the Digital EconomyCompeting in the Digital Economy
Competing in the Digital Economy
Open Analytics
 
Piwik: An Analytics Alternative (Chicago Summit)
Piwik: An Analytics Alternative (Chicago Summit)Piwik: An Analytics Alternative (Chicago Summit)
Piwik: An Analytics Alternative (Chicago Summit)
Open Analytics
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Open Analytics
 
Crossing the Chasm (Ikanow - Chicago Summit)
Crossing the Chasm (Ikanow - Chicago Summit)Crossing the Chasm (Ikanow - Chicago Summit)
Crossing the Chasm (Ikanow - Chicago Summit)
Open Analytics
 
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
Open Analytics
 
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
Open Analytics
 
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
Open Analytics
 
From Insight to Impact (Chicago Summit - Keynote)
From Insight to Impact (Chicago Summit - Keynote)From Insight to Impact (Chicago Summit - Keynote)
From Insight to Impact (Chicago Summit - Keynote)
Open Analytics
 
Easybib Open Analytics NYC
Easybib Open Analytics NYCEasybib Open Analytics NYC
Easybib Open Analytics NYC
Open Analytics
 
MarkLogic - Open Analytics Meetup
MarkLogic - Open Analytics MeetupMarkLogic - Open Analytics Meetup
MarkLogic - Open Analytics Meetup
Open Analytics
 
The caprate presentation_july2013_open analytics dc meetup
The caprate presentation_july2013_open analytics dc meetupThe caprate presentation_july2013_open analytics dc meetup
The caprate presentation_july2013_open analytics dc meetup
Open Analytics
 
Verifeed open analytics_3min deck_071713_final
Verifeed open analytics_3min deck_071713_finalVerifeed open analytics_3min deck_071713_final
Verifeed open analytics_3min deck_071713_final
Open Analytics
 

More from Open Analytics (20)

Cyber after Snowden (OA Cyber Summit)
Cyber after Snowden (OA Cyber Summit)Cyber after Snowden (OA Cyber Summit)
Cyber after Snowden (OA Cyber Summit)
 
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
 
CDM….Where do you start? (OA Cyber Summit)
CDM….Where do you start? (OA Cyber Summit)CDM….Where do you start? (OA Cyber Summit)
CDM….Where do you start? (OA Cyber Summit)
 
An Immigrant’s view of Cyberspace (OA Cyber Summit)
An Immigrant’s view of Cyberspace (OA Cyber Summit)An Immigrant’s view of Cyberspace (OA Cyber Summit)
An Immigrant’s view of Cyberspace (OA Cyber Summit)
 
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
 
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
 
Using Real-Time Data to Drive Optimization & Personalization
Using Real-Time Data to Drive Optimization & PersonalizationUsing Real-Time Data to Drive Optimization & Personalization
Using Real-Time Data to Drive Optimization & Personalization
 
M&A Trends in Telco Analytics
M&A Trends in Telco AnalyticsM&A Trends in Telco Analytics
M&A Trends in Telco Analytics
 
Competing in the Digital Economy
Competing in the Digital EconomyCompeting in the Digital Economy
Competing in the Digital Economy
 
Piwik: An Analytics Alternative (Chicago Summit)
Piwik: An Analytics Alternative (Chicago Summit)Piwik: An Analytics Alternative (Chicago Summit)
Piwik: An Analytics Alternative (Chicago Summit)
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
 
Crossing the Chasm (Ikanow - Chicago Summit)
Crossing the Chasm (Ikanow - Chicago Summit)Crossing the Chasm (Ikanow - Chicago Summit)
Crossing the Chasm (Ikanow - Chicago Summit)
 
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
 
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
 
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
 
From Insight to Impact (Chicago Summit - Keynote)
From Insight to Impact (Chicago Summit - Keynote)From Insight to Impact (Chicago Summit - Keynote)
From Insight to Impact (Chicago Summit - Keynote)
 
Easybib Open Analytics NYC
Easybib Open Analytics NYCEasybib Open Analytics NYC
Easybib Open Analytics NYC
 
MarkLogic - Open Analytics Meetup
MarkLogic - Open Analytics MeetupMarkLogic - Open Analytics Meetup
MarkLogic - Open Analytics Meetup
 
The caprate presentation_july2013_open analytics dc meetup
The caprate presentation_july2013_open analytics dc meetupThe caprate presentation_july2013_open analytics dc meetup
The caprate presentation_july2013_open analytics dc meetup
 
Verifeed open analytics_3min deck_071713_final
Verifeed open analytics_3min deck_071713_finalVerifeed open analytics_3min deck_071713_final
Verifeed open analytics_3min deck_071713_final
 

Luigi presentation OA Summit

  • 2. Foursquare •  35 million users •  Nearly 4 billion check-ins •  More than 5 million check-ins per day •  50 million point-of-interest database •  100's of GB of log data per day
  • 3. Tools We Use •  Hive o  Ad hoc analytics, data dumping ground •  Raw MapReduce o  100's of MapReduce jobs in our codebase •  Pig o  Fits between structure Hive and free-form MapReduce •  Vertica o  Low latency analytics
  • 4. Cron E.g. 0 0 * * * ./hadoop-script-1.sh # Wait two hours for that job to finish... 0 2 * * * ./hadoop-script-2.sh # And on and on and on
  • 5. Cron - Problems •  Brittle •  Hard to reason about / visualize •  Spend a lot of time waiting •  Difficult to tell what succeeded or failed •  No one likes writing Bash scripts
  • 6. Oozie XML-based Workflow Engine, with support for Hadoop, Hive, and Pig Workflows specify computations in a DAG, e.g "Run this Hive query, then run these two MapReduce jobs in parallel" Coordinators launch recurring workflows at a given frequency, when dependent data is available
  • 8. Oozie - Problems •  Workflows are all-or-nothing o  Cannot just run step that failed o  Very little code reuse •  Little to no extensibility •  Limited control flow •  Extremely verbose •  Difficult to test •  No one likes writing XML
  • 9. Luigi •  Python framework for batch processing jobs •  Created by Spotify, open-sourced Sept. 2012 •  Tasks are units of work that produce Targets •  Tasks can depend on one or more other Tasks •  A Task is only run if all of its dependent Tasks are done •  Tasks are idempotent
  • 11. Luigi - Running the Task $ python word-count.py WordCount --date 2013-06-01
  • 12. Luigi - Scheduler Central scheduler ensures each Task is only run by a single worker. A task is uniquely identified by its class name and its Parameters, e.g. WordCount(date=2013-06-01) Will retry failed Tasks after a configured timeout Emails someone when a Task fails
  • 16. Luigi - Advantages over Cron •  Explicit dependencies •  No wasted time waiting •  Easy to tell what has failed •  Avoid duplicate work / partial failures
  • 17. Luigi - Advantages over Oozie •  Explicit dependencies between workflows •  Easier to write •  Vastly more extensible •  Code reuse •  Can easily re-run individual steps
  • 18. Thank you! Check out Luigi: https://github.com/spotify/luigi Drop me a line: Joe Ennever jennever@foursquare.com