Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Building and managing complex dependencies pipeline using Apache Oozie

2,275 views

Published on

Building and managing complex dependencies pipeline using Apache Oozie

Published in: Technology

Building and managing complex dependencies pipeline using Apache Oozie

  1. 1. Building and managing complex dependencies pipeline using Apache Oozie Purshotam Shah (purushah@yahoo-inc.com) Sr. Software Engineer, Yahoo Hadoop team Apache Oozie PMC member and committer
  2. 2. Agenda Oozie at Yahoo1 Data Pipelines SLA and Monitoring Monitoring Limitations and User monitoring systems Future Work 2 3 4 5
  3. 3. Why Oozie? 3  Out-of-box support for multiple job types  Java, shell, distcp  Mapreduce • Pipes, streaming  pig, hive, spark  Highly scalable  High availability  Hot-Hot with rolling upgrades  Load balanced  Hue Integration Oozie Hbase Pig Hive Spark Yarn HDFS Hue HCata log
  4. 4. 4 Security: https + kerberos / cookie-based auth Deployment Architecture at Yahoo Load Balancer Oracle RAC Hadoop Cluster, HBase, HCatalog submit request request redirection Oozie Server 1 Oozie Server 2 Inter server communication for log streaming,sharelib update etc Zookeeper Curator Security: https + kerberos / cookie- based-auth Security: https+kerberos Lock management Security: kerberos Security: kerberos
  5. 5. Scale at Yahoo 5 Deployed on all clusters (production, non-production) One instance per cluster 75 products / 2000 + projects 255 monthly users 90,00 workflow jobs daily June 2016, one busy cluster) Between 1-8 actions :Avg. 4 actions/workflow Extreme use case, submit 100-200 workflow jobs per min 2,277 coordinator jobs daily (June 2016, one busy cluster) Frequency: 5, 10, 15 mins, hourly, daily, weekly, monthly (25% : < 15 min) 99 % of workflow jobs kicked from coordinator 97 bundle jobs daily (June 2016, one busy cluster)
  6. 6. Agenda Oozie at Yahoo1 Data Pipelines SLA and monitoring Monitoring Limitations and User monitoring systems Future Work 2 3 4 5
  7. 7. Data Pipelines 7 Ad Exchange Ad Latency Search Advertising Content Management Content Optimization Content Personalization Flickr Video Audience Targeting Behavioral Targeting Partner Targeting Retargeting Web Targeting Advertisement Content Targeting
  8. 8. Data Pipelines 8 Anti Spam Content Retargeting Research Dashboards & Reports Forecasting Email Data Intelligence Data Management Audience Pipeline
  9. 9. Use Case - Data pipeline 9
  10. 10. Large Scale Data Pipeline Requirements 10  Administrative  One should be able to start, stop and pause all related pipelines or part of it at the same time  Dependency Management  BCP support  Data is not guaranteed, start processing even if partial data is available  Mandatory and optional feeds
  11. 11. Large Scale Data Pipeline Requirements 11  Multiple Providers  If data is available from multiple providers, I want to specify the provider priority  Combining dataset from multiple providers  SLA Management  Monitor pipeline processing to take immediate action in case of failures or SLA misses  Pipelines owners should get notified if an SLA is missed
  12. 12. Bundle 12  The Bundle system allows the user to define and execute a bunch of Loosely coupled set of coordinators. They are dependent on each other, but dependency is enforced via inputs and outputs.  Bundle can be used to start/stop/suspend/resume/rerun whole pipeline
  13. 13. Complex dependencies 13 OOZIE-1976 : Specifying coordinator input datasets in more logical ways
  14. 14. BCP Support Pull data from A or B. Specify dataset as AorB. Action will start running as soon either dataset A or B is available. <input-logic> <or name=“AorB”> <data-in dataset="A”/> <data-in dataset="B"/> </or> </input-logic> 14
  15. 15. Minimum availability processing 15  Some time, we want to process even if partial data is available. <input-logic> <data-in dataset=“A" min=”4”/> </input-logic>
  16. 16. Optional feeds 16  Dataset B is optional, Oozie will start processing as soon as A is available. It will include dataset from A and whatever is available from B. <input-logic> <and name="optional> <data-in dataset="A"/> <data-in dataset="B" min=”0”/> </and> </input-logic>
  17. 17. Priority Among Dataset Instances A will have higher precedence over B and B will have higher precedence over C. <input-logic> <or name="AorBorC"> <data-in dataset="A"/> <data-in dataset="B"/> <data-in dataset="C”/> </or> </input-logic> 17
  18. 18. Wait for primary Sometime we want to give preference to primary data source and switch to secondary only after waiting for some specific amount of time. <input-logic> <or name="AorB"> <data-in dataset="A” wait=“120”/> <data-in dataset="B"/> </or> </input-logic> 18
  19. 19. Combining Dataset From Multiple Providers Combine function will first check instances from A and go to B next for whatever is missing in A. <data-in name="A" dataset="dataset_A"> <start-instance> ${coord:current(-5)} </start-instance> <end-instance> ${coord:current(-1)} </end-instance> </data-in> <data-in name="B" dataset="dataset_B"> <start-instance>${coord:current(-5)}</start-instance> <end-instance>${coord:current(-1)}</end-instance> </data-in> <input-logic> <combine name="AB"> <data-in dataset="A"/> <data-in dataset="B"/> </combine> </input-logic> 19
  20. 20. Agenda Oozie at Yahoo1 Data Pipelines SLA and monitoring Monitoring Limitations and User monitoring systems Future Work 2 3 4 5
  21. 21. Monitoring 21  Configure to receive notifications  Email action  HTTP notifications for job status change  Email notification for SLA misses  JMS notification for SLA events  By Polling  CLI/REST API monitoring • Single Job monitoring • Bulk Monitoring for Bundles and Coordinators • SLA monitoring
  22. 22. Monitoring 22  Email action can be added to workflow to send mail  Job status change notification for coordinator action  oozie.coord.action.notification.url  oozie.coord.action.notification.proxy  Job status change notification for workflow  “oozie.wf.workflow.notification.url”  “oozie.wf.workflow.notification.proxy”
  23. 23. Job Monitoring - polling 23  Supported for both CLI and web service  Single job monitoring  Bulk job monitoring  Multiple parameter like, • Bundle name, bundle id, username, startcreatedtime, endcreatedtime  Multiple job status such as • oozie jobs -bulk bundle=bundle-app-1; actionstatus=RUNNING; actionstatus=FAILED
  24. 24.  Oozie can actively track SLAs on Jobs’  Start-time, End-time, Duration  Access/Filter SLA info via  Web-console dashboard  REST API  JMS Messages  Email alert 24 SLA Monitoring
  25. 25. 25 SLA dashboard – tabular view
  26. 26. 26 SLA dashboard – Graph view
  27. 27. Agenda Oozie at Yahoo1 Data Pipelines SLA and monitoring Monitoring Limitations and User monitoring systems Future Work 2 3 4 5
  28. 28.  User view  BCP SLA support  No Color coding  Paging/oncall  Threshold  Consolidated email  Multi grid view 28 Monitoring Limitations
  29. 29. 29 Data pipeline monitoring use case from Y!
  30. 30.  Setup cron job which periodically pull SLA information from oozie  If there is any SLA miss, notification is sent to internal monitoring system › Pages and sends mobile alert to on-call person › Send email alert 30 Case-1
  31. 31. Case-1 31
  32. 32. Case-2 32  Divided into four section  SLA Details  Error jobs  Long Running Jobs  Running jobs
  33. 33. SLA information 33
  34. 34. SLA-status 34
  35. 35. Long Waiting jobs 35
  36. 36. Long Waiting jobs – missing dependencies 36
  37. 37. Error Jobs 37
  38. 38. Running job details 38
  39. 39. Job explorer 39
  40. 40. Feeds - jobs 40
  41. 41. Validation job 41  Data pipe line also run periodically validation jobs to validate the output  Those multiple pipeline has multiple validation requirement, One example of validation job is to validate the number of click impression with billing details.
  42. 42. Alert 42
  43. 43. Reprocessing 43  One of the biggest requirements of a pipeline is to reprocess whole dependent DAG.  Oozie does not support any data dependencies  This makes it very difficult to rerun the whole pipeline for a particular nominal time.
  44. 44. Reprocessing 44  To solve Oozie limitation, they have built a job dependency DAG.  It is very similar to job explorer->feed lookup feature.  job explorer->feed lookup is based on the output produced by coordinator jobs.  Job dependencies DAG is based on the input to jobs.  Currently there is no UI to this, they parse oozie jobs daily and store the dependencies in text file.
  45. 45. Reprocessing 45  Rerun the failed action and all dependent coordinator jobs. • Easy to do • Cons – Difficult to monitor  Create a new coordinator for timeline which has failed • Easy to monitor
  46. 46. Reprocessing 46
  47. 47. Reprocessing 47
  48. 48. Consolidate SLA Monitoring 48
  49. 49. Agenda Oozie at Yahoo1 Data Pipelines SLA and monitoring Monitoring Limitations and User monitoring systems Future Work 2 3 4 5
  50. 50. Future Work 50  Oozie Unit testing framework  No unit tests now. Directly tested by running in staging  Coordinator Dependency management  Better reprocessing  Aperiodic and Incremental processing  Managed through workarounds
  51. 51. Oozie BOF at Ballroom B 51
  52. 52. THANK YOU Purshotam Shah (purushah@yahoo-inc.com) Sr. Software Engineer, Yahoo Hadoop team

×