Oozie 3: Improved Scheduling and Control Of Workﬂows Mohammad K Islam kamrul@yahoo‐inc.com
Introduc?ons • Who I am • Technical Lead at Yahoo! • Oozie Team • Architecture, Development, Management – Mayank Bansal – Angelo Huang – Mohammad Islam – Amol Kekre – Andreas Newman – Lei Zhang • External contributors. • QE – Marcy Chang – Michelle Chiang
Overview: Workﬂow • Oozie executes workﬂow deﬁned as DAG of jobs. • The job type includes: Map‐Reduce/ Pipes/ Streaming/ Pig/Custom Java Code etc. • Introduced in Oozie 1.x. M/R streaming job M/R start fork join job Pig MORE decision job M/R ENOUGH job FS end Java job
Overview: Coordinator • Oozie executes workﬂow based on: – Time Dependency (Frequency) – Data Dependency • Introduced in Oozie 2.x. Oozie Server Check WS API Oozie Data Availability Coordinator Oozie Oozie Workﬂow Client Hadoop
Bundle • What is Bundle? – A new abstraccon layer on top of Coordinator. – Users can deﬁne and execute a bunch of coordinator applicacons. – Introduced in Oozie 3.x. • Why it is required? – Data pipeline: A set of inter‐related coordinators applicacon required for large data processing. – Operaconal nightmare: Hard to maintain and control these pipelines for Service Engineering team.
Bundle Cont. • User deﬁnes the bundle through a new XML. • User could start/stop/suspend/resume/rerun in the bundle level. • Bundle is op3onal. Oozie Server Check WS API Data Availability Bundle Coordinator Oozie Workﬂow Client Hadoop
Enhanced Stability and Scalability • Issue : At very high load, Oozie becomes slow. • Impact: 90% of the total Oozie support incidence. • Reason: – Lot of accve but non‐progressing jobs. – Non‐progressing jobs are consuming a lot of resources. – Oozie internal queue is full. • Resolucon: – Throhle the number of accve jobs/coordinator – Put the job into cmeout state. – Enforce the uniqueness for oozie queue element.
Improved Usability • Issue: Coordinator job’s status is not intuicve and causes confusion to the Oozie user. • Impact: User confusion and related Oozie support. • Reason: – Status SUCCEEDED doesn’t mean job is successful!! – Status PREMATER is for oozie internal use only. But it was exposed to user. • Resolucon: – Redesign Coordinator status
Coordinator Status Redesign Current SUSPENDED KILLED PREP PREMATER Running SUCCEEDED FAILED New SUSPENDED KILLED SUCCEEDED PREP Running DONE_WITH_ERROR PAUSED FAILED
Future Plan • Higher Scalability: Change polling‐based data‐ dependency check to push‐model through HCatalog and Nocﬁcacon system. • Adaptability: Graceful handling Hadoop downcme: – If Hadoop is down, block submission. – When Hadoop becomes available • Submit the blocked job • Auto‐resubmit the untraced job. • Monitoring: Rich WS API for applicacon Monitoring/ Alercng.
Future Plan Cont. • Automa?c Failover: Using ZooKeeper. • Load Balancing : Through server replicacon • Improved Usability: – Distcp accon – Hive Accon • Asynchronous data processing. • Incremental data processing. • Apache Migra?on: Works inicated.
Q&A • Github link: hhp://yahoo.github.com/oozie • Mailing list: Ooziefirstname.lastname@example.org Mohammad K Islam kamrul@yahoo‐inc.com