Your SlideShare is downloading. ×
Everything you wanted to know, but were afraid to ask about Oozie
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Everything you wanted to know, but were afraid to ask about Oozie

12,233
views

Published on

Boris Lublinsky and Alexey Yakubovich give us an overview of using Oozie. This presentation was given on December 13th, 2012 at the Nokia offices in Chicago, IL. …

Boris Lublinsky and Alexey Yakubovich give us an overview of using Oozie. This presentation was given on December 13th, 2012 at the Nokia offices in Chicago, IL.

View the HD video of this talk here: http://vimeo.com/chug/oozie-overview

Published in: Technology

0 Comments
17 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
12,233
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
282
Comments
0
Likes
17
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Everything that you ever wanted toknow about Oozie, but were afraid to ask B Lublinsky, A Yakubovich
  • 2. Apache Oozie• Oozie is a workflow/coordination system to manage Apache Hadoop jobs.• A single Oozie server implements all four functional Oozie components: – Oozie workflow – Oozie coordinator – Oozie bundle – Oozie SLA.
  • 3. Main components Oozie Server Bundle3rd party application time condition monitoring Coordinator WS API workflow data condition monitoring action Oozie Command action action Line Interface action wf logic job submission and monitoring definitions, states Oozie shared libraries HDFS Bundle Coordinator Coordinator MapReduce Data Coordinator Coordinator Coordinator Workflow Coordinator Coordinator Hadoop
  • 4. Oozie workflow
  • 5. Workflow LanguageFlow-control XML element type DescriptionnodeDecision workflow:DECISION expressing “switch-case” logicFork workflow:FORK splits one path of execution into multiple concurrent pathsJoin workflow:JOIN waits until every concurrent execution path of a previous fork node arrives to itKill workflow:kill forces a workflow job to kill (abort) itselfAction node XML element type Descriptionjava workflow:JAVA invokes the main() method from the specified java classfs workflow:FS manipulate files and directories in HDFS; supports commands: move, delete, mkdirMapReduce workflow:MAP-REDUCE starts a Hadoop map/reduce job; that could be java MR job, streaming job or pipe jobPig workflow:pig runs a Pig jobSub workflow workflow:SUB- runs a child workflow job WORKFLOWHive * workflow:HIVE runs a Hive jobShell * workflow:SHELL runs a Shell commandssh * workflow:SSH starts a shell command on a remote machine as a remote secure shellSqoop * workflow:SQOOP runs a Sqoop jobEmail * workflow:EMAIL sending emails from Oozie workflow applicationDistcp ? Under development (Yahoo)
  • 6. Workflow actions• Oozie workflow supports two types of actions:  Synchronous, executed inside Oozie runtime  Asynchronous, executed as a Map Reduce job. ActionStartCommand WorkflowStore Services ActionExecutorContext JavaActionExecutor JobClient 1 : workflow := getWorkflow() 2 : action := getAction() 3 : context := init<>() 4 : executor := get() 5 : start() 6 : submitLauncher() 7 : jobClient := get() 8 : runningJob := submit() 9 : setStartData()
  • 7. Workflow lifecycle PREPKILLED RUNNING FAILED SUSPENDED SUCCEDDED
  • 8. Oozie execution console
  • 9. Extending Oozie workflow• Oozie provides a “minimal” workflow language, which contains only a handful of control and actions nodes.• Oozie supports a very elegant extensibility mechanism – custom action nodes. Custom action nodes allow to extend Oozie’ language with additional actions (verbs).• Creation of custom action requires implementation of following: – Java action implementation, which extends ActionExecutor class. – Implementation of the action’s XML schema defining action’s configuration parameters – Packaging of java implementation and configuration schema into action jar, which has to be added to Oozie war – extending oozie-site.xml to register information about custom executor with Oozie runtime.
  • 10. Oozie Workflow Client• Oozie provides an easy way for integration with enterprise applications through Oozie client APIs. It provides two types of APIs• REST HTTP API Number of HTTP requests • Info requests (job status, job configuration) • Job management (submit, start, suspend, resume, kill) Example: job definition info request GET /oozie/v0/job/job-ID?show=definition• Java API - package org.apache.oozie.client – OozieClient start(), submit(), run(), reRunXXX(), resume(), kill(), suspend() – WorkflowJob, WorkflowAction – CoordinatorJob, CoordinatorAction – SLAEvent
  • 11. Oozie workflow good, bad and ugly• Good – Nice integration with Hadoop ecosystem, allowing to easily build processes encompassing synchronized execution of multiple Map Reduce, Hive, Pig, etc jobs. – Nice UI for tracking execution progress – Simple APIs for integration with other applications – Simple extensibility APIs• Bad – Process has to be expressed directly in hPDL with no visual support – No support for Uber Jars (but we added our own)• Ugly – Static forking (but you can regenerate workflow and invoke on a fly) – No support for loops
  • 12. Oozie Coordinator
  • 13. Coordinator languageElement type Description Attributes and sub-elementscoordinator- top-level element in coordinator instance frequencyapp start endcontrols specify the execution policy for coordinator and timeout (actions) it’s elements (workflow actions) concurrency (actions) execution order (workflow instances)action Required singular element specifying the Workflow name associated workflow. The jobs specified in workflow consume and produce dataset instancesdatasets Collection of data referred to by a logical name. Datasets serve to specify data dependences between workflow instancesinput event specifies the input conditions (in the form of present data sets) that are required in order to execute a coordinator actionoutput event specifies the dataset that should be produced by coordinator action
  • 14. Coordinator lifecycle
  • 15. Oozie Bundle
  • 16. Bundle lifecycle PREP PREPSUSPENDED PREPPAUSED RUNNING KILLEDSUSPENDED FAILED PAUSED SUCCEDDED
  • 17. Oozie SLA
  • 18. SLA Navigation COORD_JOBS id app_name app_path … WF_JOBSSLA_EVENTevent_id idalert_contact app_namealert-frieuency app_path… …sla_id... COORD_ACTIONS id action_number action_xml WF_ACTIONS … external_id ... id conf console_url …
  • 19. Using Probes to analyze/monitor Places• Select probe data for specified time/location• Validate – Filter - Transform probe data• Calculate statistics on available probe data• Distribute data per geo-tiles• Calculate place statistics (e.g. attendance index)-------------------------------------------------------------If exception condition happens, report failureIf all steps succeeded, report success
  • 20. Workflow as acyclic graph
  • 21. Workflow – fragment 1
  • 22. Workflow – fragment 2
  • 23. Oozie tips and tricks
  • 24. Configuring workflow• Oozie provides 3 overlapping mechanisms to configure workflow - config-default.xml, jobs properties file and job arguments that can be passed to Oozie as part of command line invocations.• The way Oozie processes these three sets of the parameters is as follows: – Use all of the parameters from command line invocation – For remaining unresolved parameters, job config is used – Use config-default.xml for everything else• Although documentation does not describe clearly when to use which, the overall recommendation is as follows: – Use config-default.xml for defining parameters that never change for a given workflow – Use jobs properties for the parameters that are common for a given deployment of a workflow – Use command line arguments for the parameters that are specific for a given workflow invocation.
  • 25. Accessing and storing process variables• Accessing – Through the arguments in java main• Storing String ooziePropFileName = System.getProperty("oozie.action.output.properties"); OutputStream os = new FileOutputStream(new File(ooziePropFileName)); Properties props = new Properties(); props.setProperty(key, value); props.store(os, ""); os.close();
  • 26. Validating data presence• Oozie provides two possible approaches for validating resource file(s) presence – using Oozie coordinator’s input events based on the data set - technically the simplest implementation approach, but it does not provide a more complex decision support that might be required. It just either runs a corresponding workflow or not. – custom java node inside Oozie workflow. - allows to extend decision logic by sending notifications about data absence, run execution on partial data under certain timing conditions, etc.• Additional configuration parameters for Oozie coordinator, for example, ability to wait for files arrival, etc. can expand usage of Oozie coordinator.
  • 27. Invoking map Reduce jobs• Oozie provides two different ways of invoking Map Reduce job – MapReduce action and java action.• Invocation of Map Reduce job with java action is somewhat similar to invocation of this job with Hadoop command line from the edge node. You specify a driver as a class for the java activity and Oozie invokes the driver. This approach has two main advantages: – The same driver class can be used for both – running Map Reduce job from an edge node and a java action in an Oozie process. – A driver provides a convenient place for executing additional code, for example clean-up required for Map Reduce execution.• Driver requires a proper shutdown hook to ensure that there are no lingering Map Reduce jobs
  • 28. Implementing predefined looping and forking• hPDL is an XML document with the well-defined schema.• This means that the actual workflow can be easily manipulated using JAXB objects, which can be generated from hPDL schema using xjc compiler.• This means that we can create the complete workflow programmatically, based on calculated amount of fork branches or implementing loops as a repeated actions.• The other option is creation of template process and modifying it based on calculated parameters.
  • 29. Oozie client security (or lack of)• By default Oozie client reads clients identity from the local machine OS and passes it to the Oozie server, which uses this identity for MR jobs invocation• Impersonation can be implemented by overwriting OozieClient class’ method createConfiguration, where client variables can be set through new constructor. public Properties createConfiguration() { Properties conf = new Properties(); if(user == null) conf.setProperty(USER_NAME, System.getProperty("user.name")); else conf.setProperty(USER_NAME, user); return conf; }
  • 30. uber jars with Oozieuber jar contains resources: other jars, so libraries, zip files unpack resources Oozie launcher to current uber jar dir server java action set inverse classloader uber jar Classes (Launcher) invoke MR driver pass arguments jars so zip<java> set shutdown hook … ‘wait for complete’ <main-class>${wfUberLauncher}</main-class> <arg>-appStart=${wfAppMain}</arg> … mapper</java> mapper