SlideShare a Scribd company logo
1 of 39
Download to read offline
Workflow Engines for
Hadoop
Joe Crobak
@joecrobak
NYC Data Engineering Meetup
September 5, 2013
1
Intro
2
Background
• Devops/Infra for Hadoop
• ~4 years with Hadoop
• Have done two migrations from EMR to the colo.
• Formerly Data/Analytics Infrastructure @
• worked with Apache Oozie and Luigi
• Before that, Hadoop @
• worked with Azkaban 1.0
Disclosure: I’ve contributed to Luigi and Azkaban 1.0
3
What is Apache Hadoop?
4
What is a workflow?
5
What is a workflow engine?
6
Two Example Use-Cases
7
Analytics / Data Warehousing
• logs -> fact table(s).
• database backups -> dimension tables.
• Compute rollups/cubes.
• Load data into a low-latency store (e.g.
Redshift,Vertica, HBase).
• Dashboarding & BI tools hit database.
8
Analytics / Data Warehousing
9
Analytics / Data Warehousing
• What happens if there’s a failure?
• rebuild the failed day
• ... and any downstream datasets
10
Hadoop-Driven Features
• PeopleYou May Know
• Amazon-style “People
that buy this often by
that”
• SPAM detection
• logs, databases ->
machine learning /
collaborative filtering
• derivative datasets ->
production database
(often k/v store)
11
Hadoop-Driven Features
12
Hadoop-Driven Features
• What happens if there’s a failure?
• possibly OK to skip a day.
• Workflow tends to be self-contained, so
you don’t need to rerun downstream.
• Sanity check your data before pushing to
production.
13
Workflow Engine Evolution
• Usually start with cron
• at 01:00 import data
• at 02:00 run really expensive query A
• at 03:00 run query B, C, D
• ...
• This goes on until you have ~10 jobs or so.
• It’s hard to debug and rerun.
• Doesn’t scale to many people.
14
Workflow Engine Evolution
• Two possibilities:
1. “a workflow engine can’t be too hard,
let’s write our own”
2. spend weeks evaluating all the options
out there.Try to shoehorn your
workflow into each one.
15
Workflow Engine
Considerations
How do I...
• Deploy and Upgrade
• workflows and the workflow engine
• Test
• Detect Failure
• Debug/find logs
• Rebuild/backfill datasets
• Load data to/from a RDBMS
• Manage a set of similar tasks
16
Apache
http://oozie.apache.org/
17
Oozie - architecture
18
Oozie - the good
• Great community support
• Integrated with HUE, Cloudera Manager,Apache
Ambari
• HCatalog integration
• SLA alerts (new in Oozie 4)
• Ecosystem support: Pig, Hive, Sqoop, etc.
• Very detailed documentation
• Launcher jobs as map tasks
19
Oozie - the bad
• Launcher jobs as map tasks.
• UI - but HUE, oozie-web (and
good API)
• Confusing object model (bundles,
coordinators, workflows) - high
barrier to entry.
• Setup - extjs, hadoop proxy user,
RDBMS.
• XML!
20
Oozie - the bad
• Hello World in Oozie
21
http://azkaban.github.io/azkaban2/
22
Azkaban - architecture
Source: http://azkaban.github.io/azkaban2/overview.html
23
Azkaban - the good
• Great UI
• DAG visualization
• Task history
• Easy access to log files
• Plugin architecture
• Pig, Hive, etc. Also, voldemort “build and push” integration
• SLA Alerting
• HDFS Browser
• User Authentication/Authorization and auditing.
• Reportal: https://github.com/azkaban/azkaban-plugins/pull/6
24
25
Azkaban - the bad
• Representing data dependencies
• i.e. run job X when datasetY is available.
• Executors run on separate workers, can be
under-utilized (YARN anyone?).
• Community - mostly just LinkedIn, and they
rewrote it in isolation.
• mailing list responsiveness is good.
26
Azkaban - good and
bad
• Job definitions as java properties
• Web uploads/deploy
• Running jobs, scheduling jobs.
• nearly impossible to integrate with
configuration management
27
https://github.com/spotify/luigi
28
Luigi - architecture
29
Luigi - the good
• Task definitions are code.
• Tasks are idempotent.
• Workflow defines data (and task) dependencies.
• Growing community.
• Easy to hack on the codebase (<6k LoC).
• Postgres integration
• Foursquare got this working with Redshift and
Vertica.
30
Luigi - the bad
• Missing some key features, e.g. Pig support
• but this is easy to add
• Deploy situation is confusing (but easy to
automate)
• visualizer scaling
• no persistent backing
• JVM overhead
31
Comparison matrix -
part 1
Lang
Code
Complexity
Frameworks Logs Community Docs
oozie java high - 105k
pig, hive, sqoop,
mapreduce
decentralized,
map tasks
Good - ASF in
many distros
excelle
nt
azkaban java moderate - 26k
pig, hive,
mapreduce
UI-accessible
few users,
responsive on
MLs
good
luigi python simple - 5.9k
hive, postgres,
scalding, python
streaming
decentral-ized
on workers
few users,
responsive on
github and MLs
good
32
Comparison matrix -
part 2
property
configuration
Reruns
Customizat
ion (new
job type)
Testing User Auth
oozie
command-line,
properties file, xml
defaults
oozie job -
rerun
difficult MiniOozie
Kerberos, simple,
custom
azkaban
bundled inside
workflow zip, system
defaults
partial
reruns in UI
plugin
architecture
?
xml-based,
custom
luigi
command-line,
python ini file
remove
output,
idempotency
subclass
luigi.Task
python
unittests
linux-based
33
Other workflow
engines
• Chronos
• EMR
• Mortar
• Qubole
• general purpose:
• kettle, spring batch
34
Qualities I like in a
workflow engine
• scripting language
• you end up writing scripts to run your job anyway
• custom logic, e.g. representing a dep on 7-days of data or run
only every week
• Less property propagation
• Idempotency
• WYSIWYG
• It shouldn't be hard to take my existing job and move it to the
workflow engine (it should just work).
• Easy to hack on
35
Less important
• High availability (cold failover with manual
intervention is OK)
• Multiple cluster support
• Security
36
Best Practices
• Version datasets
• Backfilling datasets
• Monitor the absence of a job running
• Continuous deploy?
37
Resources
• Azkaban talk at Hadoop User Group:
http://www.youtube.com/watch?
v=rIUlh33uKMU
• PyData talk on Luigi: http://vimeo.com/
63435580
• Oozie talk at Hadoop user Group: http://
www.slideshare.net/mislam77/oozie-hug-
may12
38
Thanks!
• Questions?
• shameless plug: Subscribe to my
newsletter: http://hadoopweekly.com
39

More Related Content

What's hot

Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Bolke de Bruin
 
Managing data workflows with Luigi
Managing data workflows with LuigiManaging data workflows with Luigi
Managing data workflows with LuigiTeemu Kurppa
 
Airflow - a data flow engine
Airflow - a data flow engineAirflow - a data flow engine
Airflow - a data flow engineWalter Liu
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoopclairvoyantllc
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowSid Anand
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowLaura Lorenz
 
Apache Airflow in Production
Apache Airflow in ProductionApache Airflow in Production
Apache Airflow in ProductionRobert Sanders
 
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow managementIntro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow managementBurasakorn Sabyeying
 
Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data ScienceErik Bernhardsson
 
Getting to Know Airflow
Getting to Know AirflowGetting to Know Airflow
Getting to Know AirflowRosanne Hoyem
 
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data PipelinesAirflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data PipelinesDataWorks Summit
 

What's hot (20)

What is Spark
What is SparkWhat is Spark
What is Spark
 
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
 
Apache Airflow overview
Apache Airflow overviewApache Airflow overview
Apache Airflow overview
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
Apache airflow
Apache airflowApache airflow
Apache airflow
 
Managing data workflows with Luigi
Managing data workflows with LuigiManaging data workflows with Luigi
Managing data workflows with Luigi
 
Apache airflow
Apache airflowApache airflow
Apache airflow
 
Airflow - a data flow engine
Airflow - a data flow engineAirflow - a data flow engine
Airflow - a data flow engine
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache Airflow
 
AIRflow at Scale
AIRflow at ScaleAIRflow at Scale
AIRflow at Scale
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
 
Apache Airflow in Production
Apache Airflow in ProductionApache Airflow in Production
Apache Airflow in Production
 
Airflow introduction
Airflow introductionAirflow introduction
Airflow introduction
 
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow managementIntro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
 
Airflow for Beginners
Airflow for BeginnersAirflow for Beginners
Airflow for Beginners
 
Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data Science
 
Getting to Know Airflow
Getting to Know AirflowGetting to Know Airflow
Getting to Know Airflow
 
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data PipelinesAirflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
 

Similar to Workflow Engines for Hadoop: Oozie, Azkaban and Luigi Compared

Five Years of EC2 Distilled
Five Years of EC2 DistilledFive Years of EC2 Distilled
Five Years of EC2 DistilledGrig Gheorghiu
 
Facebook Presto presentation
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentationCyanny LIANG
 
Deploying Grid Services Using Apache Hadoop
Deploying Grid Services Using Apache HadoopDeploying Grid Services Using Apache Hadoop
Deploying Grid Services Using Apache HadoopAllen Wittenauer
 
Overview of PaaS: Java experience
Overview of PaaS: Java experienceOverview of PaaS: Java experience
Overview of PaaS: Java experienceIgor Anishchenko
 
Overview of PaaS: Java experience
Overview of PaaS: Java experienceOverview of PaaS: Java experience
Overview of PaaS: Java experienceAlex Tumanoff
 
MySQL in the Hosted Cloud
MySQL in the Hosted CloudMySQL in the Hosted Cloud
MySQL in the Hosted CloudColin Charles
 
DrupalCampLA 2014 - Drupal backend performance and scalability
DrupalCampLA 2014 - Drupal backend performance and scalabilityDrupalCampLA 2014 - Drupal backend performance and scalability
DrupalCampLA 2014 - Drupal backend performance and scalabilitycherryhillco
 
Leonid Vasilyev "Building, deploying and running production code at Dropbox"
Leonid Vasilyev  "Building, deploying and running production code at Dropbox"Leonid Vasilyev  "Building, deploying and running production code at Dropbox"
Leonid Vasilyev "Building, deploying and running production code at Dropbox"IT Event
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014cdmaxime
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopEvans Ye
 
What we talk about when we talk about DevOps
What we talk about when we talk about DevOpsWhat we talk about when we talk about DevOps
What we talk about when we talk about DevOpsRicard Clau
 
Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Cloudera, Inc.
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09Chris Purrington
 
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache ApexMaking sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache ApexApache Apex
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
Using LuaJIT in mid-load web-projects
Using LuaJIT in mid-load web-projectsUsing LuaJIT in mid-load web-projects
Using LuaJIT in mid-load web-projectsAlexander Gladysh
 
Databases in the hosted cloud
Databases in the hosted cloudDatabases in the hosted cloud
Databases in the hosted cloudColin Charles
 

Similar to Workflow Engines for Hadoop: Oozie, Azkaban and Luigi Compared (20)

Five Years of EC2 Distilled
Five Years of EC2 DistilledFive Years of EC2 Distilled
Five Years of EC2 Distilled
 
Facebook Presto presentation
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentation
 
Deploying Grid Services Using Apache Hadoop
Deploying Grid Services Using Apache HadoopDeploying Grid Services Using Apache Hadoop
Deploying Grid Services Using Apache Hadoop
 
Overview of PaaS: Java experience
Overview of PaaS: Java experienceOverview of PaaS: Java experience
Overview of PaaS: Java experience
 
Overview of PaaS: Java experience
Overview of PaaS: Java experienceOverview of PaaS: Java experience
Overview of PaaS: Java experience
 
MySQL in the Hosted Cloud
MySQL in the Hosted CloudMySQL in the Hosted Cloud
MySQL in the Hosted Cloud
 
DrupalCampLA 2014 - Drupal backend performance and scalability
DrupalCampLA 2014 - Drupal backend performance and scalabilityDrupalCampLA 2014 - Drupal backend performance and scalability
DrupalCampLA 2014 - Drupal backend performance and scalability
 
Leonid Vasilyev "Building, deploying and running production code at Dropbox"
Leonid Vasilyev  "Building, deploying and running production code at Dropbox"Leonid Vasilyev  "Building, deploying and running production code at Dropbox"
Leonid Vasilyev "Building, deploying and running production code at Dropbox"
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
 
Oracle OpenWo2014 review part 03 three_paa_s_database
Oracle OpenWo2014 review part 03 three_paa_s_databaseOracle OpenWo2014 review part 03 three_paa_s_database
Oracle OpenWo2014 review part 03 three_paa_s_database
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
 
What we talk about when we talk about DevOps
What we talk about when we talk about DevOpsWhat we talk about when we talk about DevOps
What we talk about when we talk about DevOps
 
Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache ApexMaking sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Using LuaJIT in mid-load web-projects
Using LuaJIT in mid-load web-projectsUsing LuaJIT in mid-load web-projects
Using LuaJIT in mid-load web-projects
 
Stackato v2
Stackato v2Stackato v2
Stackato v2
 
Databases in the hosted cloud
Databases in the hosted cloudDatabases in the hosted cloud
Databases in the hosted cloud
 

Recently uploaded

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 

Recently uploaded (20)

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 

Workflow Engines for Hadoop: Oozie, Azkaban and Luigi Compared

  • 1. Workflow Engines for Hadoop Joe Crobak @joecrobak NYC Data Engineering Meetup September 5, 2013 1
  • 3. Background • Devops/Infra for Hadoop • ~4 years with Hadoop • Have done two migrations from EMR to the colo. • Formerly Data/Analytics Infrastructure @ • worked with Apache Oozie and Luigi • Before that, Hadoop @ • worked with Azkaban 1.0 Disclosure: I’ve contributed to Luigi and Azkaban 1.0 3
  • 4. What is Apache Hadoop? 4
  • 5. What is a workflow? 5
  • 6. What is a workflow engine? 6
  • 8. Analytics / Data Warehousing • logs -> fact table(s). • database backups -> dimension tables. • Compute rollups/cubes. • Load data into a low-latency store (e.g. Redshift,Vertica, HBase). • Dashboarding & BI tools hit database. 8
  • 9. Analytics / Data Warehousing 9
  • 10. Analytics / Data Warehousing • What happens if there’s a failure? • rebuild the failed day • ... and any downstream datasets 10
  • 11. Hadoop-Driven Features • PeopleYou May Know • Amazon-style “People that buy this often by that” • SPAM detection • logs, databases -> machine learning / collaborative filtering • derivative datasets -> production database (often k/v store) 11
  • 13. Hadoop-Driven Features • What happens if there’s a failure? • possibly OK to skip a day. • Workflow tends to be self-contained, so you don’t need to rerun downstream. • Sanity check your data before pushing to production. 13
  • 14. Workflow Engine Evolution • Usually start with cron • at 01:00 import data • at 02:00 run really expensive query A • at 03:00 run query B, C, D • ... • This goes on until you have ~10 jobs or so. • It’s hard to debug and rerun. • Doesn’t scale to many people. 14
  • 15. Workflow Engine Evolution • Two possibilities: 1. “a workflow engine can’t be too hard, let’s write our own” 2. spend weeks evaluating all the options out there.Try to shoehorn your workflow into each one. 15
  • 16. Workflow Engine Considerations How do I... • Deploy and Upgrade • workflows and the workflow engine • Test • Detect Failure • Debug/find logs • Rebuild/backfill datasets • Load data to/from a RDBMS • Manage a set of similar tasks 16
  • 19. Oozie - the good • Great community support • Integrated with HUE, Cloudera Manager,Apache Ambari • HCatalog integration • SLA alerts (new in Oozie 4) • Ecosystem support: Pig, Hive, Sqoop, etc. • Very detailed documentation • Launcher jobs as map tasks 19
  • 20. Oozie - the bad • Launcher jobs as map tasks. • UI - but HUE, oozie-web (and good API) • Confusing object model (bundles, coordinators, workflows) - high barrier to entry. • Setup - extjs, hadoop proxy user, RDBMS. • XML! 20
  • 21. Oozie - the bad • Hello World in Oozie 21
  • 23. Azkaban - architecture Source: http://azkaban.github.io/azkaban2/overview.html 23
  • 24. Azkaban - the good • Great UI • DAG visualization • Task history • Easy access to log files • Plugin architecture • Pig, Hive, etc. Also, voldemort “build and push” integration • SLA Alerting • HDFS Browser • User Authentication/Authorization and auditing. • Reportal: https://github.com/azkaban/azkaban-plugins/pull/6 24
  • 25. 25
  • 26. Azkaban - the bad • Representing data dependencies • i.e. run job X when datasetY is available. • Executors run on separate workers, can be under-utilized (YARN anyone?). • Community - mostly just LinkedIn, and they rewrote it in isolation. • mailing list responsiveness is good. 26
  • 27. Azkaban - good and bad • Job definitions as java properties • Web uploads/deploy • Running jobs, scheduling jobs. • nearly impossible to integrate with configuration management 27
  • 30. Luigi - the good • Task definitions are code. • Tasks are idempotent. • Workflow defines data (and task) dependencies. • Growing community. • Easy to hack on the codebase (<6k LoC). • Postgres integration • Foursquare got this working with Redshift and Vertica. 30
  • 31. Luigi - the bad • Missing some key features, e.g. Pig support • but this is easy to add • Deploy situation is confusing (but easy to automate) • visualizer scaling • no persistent backing • JVM overhead 31
  • 32. Comparison matrix - part 1 Lang Code Complexity Frameworks Logs Community Docs oozie java high - 105k pig, hive, sqoop, mapreduce decentralized, map tasks Good - ASF in many distros excelle nt azkaban java moderate - 26k pig, hive, mapreduce UI-accessible few users, responsive on MLs good luigi python simple - 5.9k hive, postgres, scalding, python streaming decentral-ized on workers few users, responsive on github and MLs good 32
  • 33. Comparison matrix - part 2 property configuration Reruns Customizat ion (new job type) Testing User Auth oozie command-line, properties file, xml defaults oozie job - rerun difficult MiniOozie Kerberos, simple, custom azkaban bundled inside workflow zip, system defaults partial reruns in UI plugin architecture ? xml-based, custom luigi command-line, python ini file remove output, idempotency subclass luigi.Task python unittests linux-based 33
  • 34. Other workflow engines • Chronos • EMR • Mortar • Qubole • general purpose: • kettle, spring batch 34
  • 35. Qualities I like in a workflow engine • scripting language • you end up writing scripts to run your job anyway • custom logic, e.g. representing a dep on 7-days of data or run only every week • Less property propagation • Idempotency • WYSIWYG • It shouldn't be hard to take my existing job and move it to the workflow engine (it should just work). • Easy to hack on 35
  • 36. Less important • High availability (cold failover with manual intervention is OK) • Multiple cluster support • Security 36
  • 37. Best Practices • Version datasets • Backfilling datasets • Monitor the absence of a job running • Continuous deploy? 37
  • 38. Resources • Azkaban talk at Hadoop User Group: http://www.youtube.com/watch? v=rIUlh33uKMU • PyData talk on Luigi: http://vimeo.com/ 63435580 • Oozie talk at Hadoop user Group: http:// www.slideshare.net/mislam77/oozie-hug- may12 38
  • 39. Thanks! • Questions? • shameless plug: Subscribe to my newsletter: http://hadoopweekly.com 39