Workflow Engines for Hadoop

31,438 views

Published on

Building a reliable pipeline of data ingress, batch computation, and data egress with Hadoop can be a major challenge. Most folks start out with cron to manage workflows, but soon discover that doesn't scale past a handful of jobs. There are a number of open-source workflow engines with support for Hadoop, including Azkaban (from LinkedIn), Luigi (from Spotify), and Apache Oozie. Having deployed all three of these systems in production, Joe will talk about what features and qualities are important for a workflow system.

Published in: Technology
2 Comments
82 Likes
Statistics
Notes
No Downloads
Views
Total views
31,438
On SlideShare
0
From Embeds
0
Number of Embeds
1,915
Actions
Shares
0
Downloads
556
Comments
2
Likes
82
Embeds 0
No embeds

No notes for slide

Workflow Engines for Hadoop

  1. 1. Workflow Engines for Hadoop Joe Crobak @joecrobak NYC Data Engineering Meetup September 5, 2013 1
  2. 2. Intro 2
  3. 3. Background • Devops/Infra for Hadoop • ~4 years with Hadoop • Have done two migrations from EMR to the colo. • Formerly Data/Analytics Infrastructure @ • worked with Apache Oozie and Luigi • Before that, Hadoop @ • worked with Azkaban 1.0 Disclosure: I’ve contributed to Luigi and Azkaban 1.0 3
  4. 4. What is Apache Hadoop? 4
  5. 5. What is a workflow? 5
  6. 6. What is a workflow engine? 6
  7. 7. Two Example Use-Cases 7
  8. 8. Analytics / Data Warehousing • logs -> fact table(s). • database backups -> dimension tables. • Compute rollups/cubes. • Load data into a low-latency store (e.g. Redshift,Vertica, HBase). • Dashboarding & BI tools hit database. 8
  9. 9. Analytics / Data Warehousing 9
  10. 10. Analytics / Data Warehousing • What happens if there’s a failure? • rebuild the failed day • ... and any downstream datasets 10
  11. 11. Hadoop-Driven Features • PeopleYou May Know • Amazon-style “People that buy this often by that” • SPAM detection • logs, databases -> machine learning / collaborative filtering • derivative datasets -> production database (often k/v store) 11
  12. 12. Hadoop-Driven Features 12
  13. 13. Hadoop-Driven Features • What happens if there’s a failure? • possibly OK to skip a day. • Workflow tends to be self-contained, so you don’t need to rerun downstream. • Sanity check your data before pushing to production. 13
  14. 14. Workflow Engine Evolution • Usually start with cron • at 01:00 import data • at 02:00 run really expensive query A • at 03:00 run query B, C, D • ... • This goes on until you have ~10 jobs or so. • It’s hard to debug and rerun. • Doesn’t scale to many people. 14
  15. 15. Workflow Engine Evolution • Two possibilities: 1. “a workflow engine can’t be too hard, let’s write our own” 2. spend weeks evaluating all the options out there.Try to shoehorn your workflow into each one. 15
  16. 16. Workflow Engine Considerations How do I... • Deploy and Upgrade • workflows and the workflow engine • Test • Detect Failure • Debug/find logs • Rebuild/backfill datasets • Load data to/from a RDBMS • Manage a set of similar tasks 16
  17. 17. Apache http://oozie.apache.org/ 17
  18. 18. Oozie - architecture 18
  19. 19. Oozie - the good • Great community support • Integrated with HUE, Cloudera Manager,Apache Ambari • HCatalog integration • SLA alerts (new in Oozie 4) • Ecosystem support: Pig, Hive, Sqoop, etc. • Very detailed documentation • Launcher jobs as map tasks 19
  20. 20. Oozie - the bad • Launcher jobs as map tasks. • UI - but HUE, oozie-web (and good API) • Confusing object model (bundles, coordinators, workflows) - high barrier to entry. • Setup - extjs, hadoop proxy user, RDBMS. • XML! 20
  21. 21. Oozie - the bad • Hello World in Oozie 21
  22. 22. http://azkaban.github.io/azkaban2/ 22
  23. 23. Azkaban - architecture Source: http://azkaban.github.io/azkaban2/overview.html 23
  24. 24. Azkaban - the good • Great UI • DAG visualization • Task history • Easy access to log files • Plugin architecture • Pig, Hive, etc. Also, voldemort “build and push” integration • SLA Alerting • HDFS Browser • User Authentication/Authorization and auditing. • Reportal: https://github.com/azkaban/azkaban-plugins/pull/6 24
  25. 25. 25
  26. 26. Azkaban - the bad • Representing data dependencies • i.e. run job X when datasetY is available. • Executors run on separate workers, can be under-utilized (YARN anyone?). • Community - mostly just LinkedIn, and they rewrote it in isolation. • mailing list responsiveness is good. 26
  27. 27. Azkaban - good and bad • Job definitions as java properties • Web uploads/deploy • Running jobs, scheduling jobs. • nearly impossible to integrate with configuration management 27
  28. 28. https://github.com/spotify/luigi 28
  29. 29. Luigi - architecture 29
  30. 30. Luigi - the good • Task definitions are code. • Tasks are idempotent. • Workflow defines data (and task) dependencies. • Growing community. • Easy to hack on the codebase (<6k LoC). • Postgres integration • Foursquare got this working with Redshift and Vertica. 30
  31. 31. Luigi - the bad • Missing some key features, e.g. Pig support • but this is easy to add • Deploy situation is confusing (but easy to automate) • visualizer scaling • no persistent backing • JVM overhead 31
  32. 32. Comparison matrix - part 1 Lang Code Complexity Frameworks Logs Community Docs oozie java high - 105k pig, hive, sqoop, mapreduce decentralized, map tasks Good - ASF in many distros excelle nt azkaban java moderate - 26k pig, hive, mapreduce UI-accessible few users, responsive on MLs good luigi python simple - 5.9k hive, postgres, scalding, python streaming decentral-ized on workers few users, responsive on github and MLs good 32
  33. 33. Comparison matrix - part 2 property configuration Reruns Customizat ion (new job type) Testing User Auth oozie command-line, properties file, xml defaults oozie job - rerun difficult MiniOozie Kerberos, simple, custom azkaban bundled inside workflow zip, system defaults partial reruns in UI plugin architecture ? xml-based, custom luigi command-line, python ini file remove output, idempotency subclass luigi.Task python unittests linux-based 33
  34. 34. Other workflow engines • Chronos • EMR • Mortar • Qubole • general purpose: • kettle, spring batch 34
  35. 35. Qualities I like in a workflow engine • scripting language • you end up writing scripts to run your job anyway • custom logic, e.g. representing a dep on 7-days of data or run only every week • Less property propagation • Idempotency • WYSIWYG • It shouldn't be hard to take my existing job and move it to the workflow engine (it should just work). • Easy to hack on 35
  36. 36. Less important • High availability (cold failover with manual intervention is OK) • Multiple cluster support • Security 36
  37. 37. Best Practices • Version datasets • Backfilling datasets • Monitor the absence of a job running • Continuous deploy? 37
  38. 38. Resources • Azkaban talk at Hadoop User Group: http://www.youtube.com/watch? v=rIUlh33uKMU • PyData talk on Luigi: http://vimeo.com/ 63435580 • Oozie talk at Hadoop user Group: http:// www.slideshare.net/mislam77/oozie-hug- may12 38
  39. 39. Thanks! • Questions? • shameless plug: Subscribe to my newsletter: http://hadoopweekly.com 39

×