Overview: Workﬂow • Oozie executes workﬂow deﬁned as DAG of jobs. • The job type includes: Map‐Reduce/ Pipes/ Streaming/ Pig/Custom Java Code etc. • Introduced in Oozie 1.x. M/R streaming job M/R start fork join job Pig MORE decision job M/R ENOUGH job FS end Java job
Overview: Coordinator • Oozie executes workﬂow based on: – Time Dependency (Frequency) – Data Dependency • Introduced in Oozie 2.x. Oozie Server Check WS API Oozie Data Availability Coordinator Oozie Oozie Workﬂow Client Hadoop
Oozie 3.x: Bundle • User can deﬁne and execute a bunch of coordinator applicaons. • User could start/stop/suspend/resume/rerun in the bundle level. • Beneﬁts: Easy to maintain and control large data pipelines applicaons for Service Engineering team. Oozie Server Check WS API Data Availability Bundle Coordinator Oozie Workﬂow Client Hadoop
Enhanced Stability and Scalability • Issue : – At very high load, Oozie becomes slow. – 90% of the total Oozie support incidence. • Reason: – Lot of acve but non‐progressing jobs. – Oozie internal queue is full. • Resoluon: – Throcle the number of acve jobs/coordinator – Put the job into meout state. – Enforce the uniqueness for oozie queue element.
Improved Usability • Issue: – Coordinator job’s status is not intuive and causes confusion to the Oozie user. • Reason: – Status SUCCEEDED doesn’t mean job is successful!! – Status PREMATER is for oozie internal use only. But it was exposed to user. • Resoluon: – Redesign Coordinator status
Coordinator Status Redesign Current SUSPENDED KILLED PREP PREMATER Running SUCCEEDED FAILED New SUSPENDED KILLED SUCCEEDED PREP Running DONE_WITH_ERROR PAUSED FAILED
The Second Year ... • Number of Releases – Feature Releases : 3 – Patches : 9 • Backward compa5bility is strongly maintained. • No need to resubmit the job if Oozie is restarted. • Code Overhaul: – Re‐designed the command pacern to avoid DB connecon leaks and to improve DB connecons usages.
Oozie Usages • Y! internal usages: – Total number of user : 377 – Total number of processed jobs ≈ 600K/month • External downloads: – 1500+ in last 8 months from Github – A large number of downloads maintained by 3rd party packaging.
Challenges 1 :Data Availability Check • Issue : – Currently checks directory in every minute (polling based). – Increases NN overhead and does not scale well. • Reason: No meta‐data system with appropriate noﬁcaons mechanism. • Planned resoluon: Incorporate with HCatalog metadata system.
Challenges 2 : Adaptability to Hadoop • Issues : If Hadoop NN or JT is down, Oozie submits job and obviously fails. User intervenon is required when Hadoop server is back. • Impact: Inconvenient for Oozie user. For example, if Hadoop is restarted on Friday night, job will not run unl next Monday. • Planned Resoluon: Graceful handling of Hadoop downme: – If Hadoop is down, block submission. – When Hadoop becomes available • Submit the blocked job • Auto‐resubmit the untraced job.
Challenges 3: Horizontally Scalable • Issues: One instance of Oozie could not eﬃciently handle a very large number of jobs (say 100K/ hours). In addion, Oozie doesn’t support load balancing. • Reason: Oozie internal task queue is not synchronized across mulple Oozie instances. • Planned Resoluon: Use Zookeeper for coordinaon. • Beneﬁts: As the load increases, add extra Oozie server.
Future Plan • AutomaNc Failover: Using ZooKeeper. • Monitoring: Rich WS API for applicaon Monitoring/Alerng. • Improved Usability: – Distcp acon – Hive Acon • Asynchronous data processing. • Incremental data processing. • Apache MigraNon: Works iniated.
Q&A • Github link: hcp://yahoo.github.com/oozie • Mailing list: Oozieemail@example.com Mohammad K Islam kamrul@yahoo‐inc.com
Oozie Workﬂow Applicaon • Contents – A workflow.xml ﬁle – Resource ﬁles, conﬁg ﬁles and Pig scripts – All necessary JAR and nave library ﬁles • Parameters – The workflow.xml, is parameterized, parameters can be propagated to map-reduce, pig & ssh jobs • Deployment – In a directory in the HDFS of the Hadoop cluster where the Hadoop & Pig jobs will run 19
Oozie Running a Workﬂow Job cmd Workﬂow ApplicaNon Deployment $ hadoop fs –mkdir hdfs://usr/tucu/wordcount-wf $ hadoop fs –mkdir hdfs://usr/tucu/wordcount-wf/lib $ hadoop fs –copyFromLocal workflow.xml wordcount.xml hdfs://usr/tucu/wordcount-wf $ hadoop fs –copyFromLocal hadoop-examples.jar hdfs://usr/tucu/wordcount-wf/lib $Workﬂow Job ExecuNon $ oozie run -o http://foo.corp:8080/oozie -a hdfs://bar.corp:9000/usr/tucu/wordcount-wf input=/data/2008/input output=/data/2008/output Workflow job id [1234567890-wordcount-wf] $ Workﬂow Job Status $ oozie status -o http://foo.corp:8080/oozie -j 1234567890-wordcount-wf Workflow job status [RUNNING] ... $ 20