Nell’iperspazio con Rocket: il Framework Web di Rust!
Building and managing complex dependencies pipeline using Apache Oozie
1. Building and managing complex
dependencies pipeline using
Apache Oozie
Purshotam Shah (purushah@yahoo-inc.com)
Sr. Software Engineer, Yahoo Hadoop team
Apache Oozie PMC member and committer
2. Agenda
Oozie at Yahoo1
Data Pipelines
SLA and Monitoring
Monitoring Limitations and User monitoring systems
Future Work
2
3
4
5
4. 4
Security: https + kerberos /
cookie-based auth
Deployment Architecture at Yahoo
Load
Balancer
Oracle
RAC
Hadoop Cluster, HBase, HCatalog
submit request
request redirection
Oozie Server 1
Oozie Server 2
Inter server communication
for log streaming,sharelib update etc
Zookeeper
Curator
Security: https + kerberos / cookie-
based-auth
Security: https+kerberos
Lock management
Security: kerberos
Security: kerberos
5. Scale at Yahoo
5
Deployed on all clusters (production, non-production)
One instance per cluster
75 products / 2000 + projects
255 monthly users
90,00 workflow jobs daily June 2016, one busy cluster)
Between 1-8 actions :Avg. 4 actions/workflow
Extreme use case, submit 100-200 workflow jobs per min
2,277 coordinator jobs daily (June 2016, one busy cluster)
Frequency: 5, 10, 15 mins, hourly, daily, weekly, monthly (25% : < 15 min)
99 % of workflow jobs kicked from coordinator
97 bundle jobs daily (June 2016, one busy cluster)
6. Agenda
Oozie at Yahoo1
Data Pipelines
SLA and monitoring
Monitoring Limitations and User monitoring systems
Future Work
2
3
4
5
7. Data Pipelines
7
Ad Exchange
Ad Latency
Search Advertising
Content Management
Content Optimization
Content Personalization
Flickr Video
Audience Targeting
Behavioral Targeting
Partner Targeting
Retargeting
Web Targeting
Advertisement Content Targeting
10. Large Scale Data Pipeline Requirements
10
Administrative
One should be able to start, stop and pause all related pipelines or part of it at the
same time
Dependency Management
BCP support
Data is not guaranteed, start processing even if partial data is available
Mandatory and optional feeds
11. Large Scale Data Pipeline Requirements
11
Multiple Providers
If data is available from multiple providers, I want to specify the provider priority
Combining dataset from multiple providers
SLA Management
Monitor pipeline processing to take immediate action in case of failures or SLA misses
Pipelines owners should get notified if an SLA is missed
12. Bundle
12
The Bundle system allows the user to define and execute a bunch of
Loosely coupled set of coordinators. They are dependent on each
other, but dependency is enforced via inputs and outputs.
Bundle can be used to start/stop/suspend/resume/rerun whole pipeline
14. BCP Support
Pull data from A or B. Specify dataset as AorB. Action will start running as soon
either dataset A or B is available.
<input-logic>
<or name=“AorB”>
<data-in dataset="A”/>
<data-in dataset="B"/>
</or>
</input-logic>
14
15. Minimum availability processing
15
Some time, we want to process even if partial data is available.
<input-logic>
<data-in dataset=“A" min=”4”/>
</input-logic>
16. Optional feeds
16
Dataset B is optional, Oozie will start processing as soon as A is available. It will include
dataset from A and whatever is available from B.
<input-logic>
<and name="optional>
<data-in dataset="A"/>
<data-in dataset="B" min=”0”/>
</and>
</input-logic>
17. Priority Among Dataset Instances
A will have higher precedence over B and B will have higher precedence over C.
<input-logic>
<or name="AorBorC">
<data-in dataset="A"/>
<data-in dataset="B"/>
<data-in dataset="C”/>
</or>
</input-logic>
17
18. Wait for primary
Sometime we want to give preference to primary data source and switch to secondary
only after waiting for some specific amount of time.
<input-logic>
<or name="AorB">
<data-in dataset="A” wait=“120”/>
<data-in dataset="B"/>
</or>
</input-logic>
18
19. Combining Dataset From Multiple Providers
Combine function will first check instances from A and go to B next for whatever is
missing in A.
<data-in name="A" dataset="dataset_A">
<start-instance> ${coord:current(-5)} </start-instance>
<end-instance> ${coord:current(-1)} </end-instance>
</data-in>
<data-in name="B" dataset="dataset_B">
<start-instance>${coord:current(-5)}</start-instance>
<end-instance>${coord:current(-1)}</end-instance>
</data-in>
<input-logic>
<combine name="AB">
<data-in dataset="A"/>
<data-in dataset="B"/>
</combine>
</input-logic>
19
20. Agenda
Oozie at Yahoo1
Data Pipelines
SLA and monitoring
Monitoring Limitations and User monitoring systems
Future Work
2
3
4
5
21. Monitoring
21
Configure to receive notifications
Email action
HTTP notifications for job status change
Email notification for SLA misses
JMS notification for SLA events
By Polling
CLI/REST API monitoring
• Single Job monitoring
• Bulk Monitoring for Bundles and Coordinators
• SLA monitoring
22. Monitoring
22
Email action can be added to workflow to send mail
Job status change notification for coordinator action
oozie.coord.action.notification.url
oozie.coord.action.notification.proxy
Job status change notification for workflow
“oozie.wf.workflow.notification.url”
“oozie.wf.workflow.notification.proxy”
23. Job Monitoring - polling
23
Supported for both CLI and web service
Single job monitoring
Bulk job monitoring
Multiple parameter like,
• Bundle name, bundle id, username, startcreatedtime, endcreatedtime
Multiple job status such as
• oozie jobs -bulk bundle=bundle-app-1; actionstatus=RUNNING; actionstatus=FAILED
24. Oozie can actively track SLAs on Jobs’
Start-time, End-time, Duration
Access/Filter SLA info via
Web-console dashboard
REST API
JMS Messages
Email alert
24
SLA Monitoring
30. Setup cron job which periodically pull SLA information from oozie
If there is any SLA miss, notification is sent to internal monitoring
system
› Pages and sends mobile alert to on-call person
› Send email alert
30
Case-1
41. Validation job
41
Data pipe line also run periodically validation jobs to validate the output
Those multiple pipeline has multiple validation requirement, One example of validation
job is to validate the number of click impression with billing details.
43. Reprocessing
43
One of the biggest requirements of a pipeline is to reprocess whole
dependent DAG.
Oozie does not support any data dependencies
This makes it very difficult to rerun the whole pipeline for a particular
nominal time.
44. Reprocessing
44
To solve Oozie limitation, they have built a job dependency DAG.
It is very similar to job explorer->feed lookup feature.
job explorer->feed lookup is based on the output produced by
coordinator jobs.
Job dependencies DAG is based on the input to jobs.
Currently there is no UI to this, they parse oozie jobs daily and store the
dependencies in text file.
45. Reprocessing
45
Rerun the failed action and all dependent coordinator jobs.
• Easy to do
• Cons
– Difficult to monitor
Create a new coordinator for timeline which has failed
• Easy to monitor
49. Agenda
Oozie at Yahoo1
Data Pipelines
SLA and monitoring
Monitoring Limitations and User monitoring systems
Future Work
2
3
4
5
50. Future Work
50
Oozie Unit testing framework
No unit tests now. Directly tested by running in staging
Coordinator Dependency management
Better reprocessing
Aperiodic and Incremental processing
Managed through workarounds