Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study

3,875 views

Published on

Oozie Introduction, Case Study, and Tips

also some introduction about Integration of Kettle and Oozie using Spoon

PDF download: http://user.cs.tu-berlin.de/~tqiu/Oozie_BigData_Workflow_Scheduler_Case_Study.pdf

During the past three years Oozie has become the de-facto workflow scheduling system for Hadoop. Oozie has proven itself as a scalable, secure and multi-tenant service.

More: http://www.chinahadoop.net/thread-6659-1-1.html

Online Open Course: http://chinahadoop.edusoho.cn/course/19

video: http://www.youtube.com/watch?v=qzk08ggdIDw&hd=1

vimeo -- http://vimeo.com/84164730

Published in: Technology
  • Be the first to comment

Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study

  1. 1. ChinaHadoop 公开课 大数据处理工作流调度系统 —— OOZIE及相关产品介绍 邱腾 Teng Qiu http://abcn.net/ http://www.fxlive.de/
  2. 2. 大纲 ● Oozie概述 ● 适合使用Oozie的情景 ● Oozie的实现原理及特点 ● Oozie的核心组件 ● Oozie实战及Tips ● Oozie的编程接口介绍 ● 支持Oozie的图形化开源ETL工具Kettle初探 ● 总结展望 Berlin | 2014.01.14 | Teng Qiu 2
  3. 3. OOZIE概述 ● 工作流引擎 ● 顺序运行一组Hadoop作业任务 ● 有向无环图 DAG (Direct Acyclic Graph) ● Workflow 1:1 Coordinator n:1 Bundle ● Coordinator可触发执行,可类似cron job方式执行,时间轮循只支持UTC时间 ● XML作为工作流描述语言 hPDL (Process Definition Language) ● 类似JBoss jBPM中使用的 jPDL ● Control Flow Nodes 控制流程的执行路径: start, end, fail / kill, decision, fork-join ● Action Nodes: ● HDFS, MapReduce, Pig, Hive, Sqoop, Java, SSH, E-Mail, Sub-Workflow ● (mkdir, delete, move, chmod, touchz, DistCp) ● 信息存放在数据库中 derby / mysql Berlin | 2014.01.14 | Teng Qiu 3
  4. 4. 适合使用OOZIE的情景 ● Hadoop中需要按顺序进行的数据处理工作流 ● 即能够顺序执行,又能够并行处理(fork-join) ● 运行结果或异常的通报、处理 ● Hadoop集群内ETL任务 ● 取代Hadoop集群内的Cron Job Berlin | 2014.01.14 | Teng Qiu 4
  5. 5. 适合使用OOZIE的情景 ● 需要定期执行的任务,如 ETL cron job A,在 hdp01 这个机器上,每个小时的15分启动,处理原始数据集1 cron job B,在 hdp05 这个机器上,每个小时的20分启动,处理原始数据集2 cron job C,在 hdp11 这个机器上,每个小时的50分启动,去读A和B的结果,然后做处理 ● RDBMS中的表 => HBase Table / Hive Table ● RDBMS中的 trigger / stored procedure => HBase的RegionObserver和Endpoint Coprocessor Berlin | 2014.01.14 | Teng Qiu 5
  6. 6. 适合使用OOZIE的情景 ● Hadoop中需要按顺序进行的数据处理工作流 ● 即能够顺序执行,又能够并行处理(fork-join) ● 运行结果或异常的通报、处理 ● Hadoop集群内ETL任务 ● 取代Hadoop集群内的Cron Job ● 适用于 batch processing,DWH,不太可能进行 real-time data processing Berlin | 2014.01.14 | Teng Qiu 6
  7. 7. OOZIE的实现原理及特点 ● 实现原理 ● oozie:lancher:T=:W=:A= ● Oozie Server根据workflow XML, 提交一个map only的MR Job ● map中封装用户定义的action, 通过JobClient将job.jar和job.xml提交JobTracker ● action Job开始工作,map only Job 等待 => oozie始终多占用一个map slot ● callback / polling 获取action状态 ● 正常情况下,通过callback URL通知完成 ● 特点 ● 通过MapReduce Framework实现负载均衡,容错/重试机制 ● 支持参数化,Java EL 语言 ● DAG,没有重试(Error / Exception / exit code != 0) ● 但是workflow可以rerun(oozie.wf.rerun.failnodes=true或 oozie.wf.rerun.skip.nodes=xxx,yyy,zzz) Berlin | 2014.01.14 | Teng Qiu 7
  8. 8. OOZIE的核心组件 Control Flow Node 流程控制节点 ● Oozie的核心组件(流程控制节点介绍) Berlin | 2014.01.14 | Teng Qiu 8
  9. 9. OOZIE的核心组件 Control Flow Node 流程控制节点 ● decision 节点 ${wf:conf("etl_only_do_something") eq "yes"} ● fork-join ● 一个bug:OOZIE-1142,3.3.2后fix ● 解决办法:在 oozie-site.xml 中,设置oozie.validate.ForkJoin为false Berlin | 2014.01.14 | Teng Qiu 9
  10. 10. OOZIE的核心组件 Action Node 任务节点 ● HDFS ● move, delete, mkdir, chmod, touchz, DistCp ● MapReduce ● job.xml 指定M/R的class和目录 ● Pig / Hive ● <job-xml>hive-site.xml</job-xml> ● <script>${hiveScript}</script> ● SSH ● public key !!! 一声叹息啊 ● <host>, <command>, <args> -_● Sub Workflow ● <propagate-configuration/> Berlin | 2014.01.14 | Teng Qiu 10
  11. 11. OOZIE的核心组件 Action Node 任务节点 ● Sqoop Action 比较让人崩溃 Berlin | 2014.01.14 | Teng Qiu 11
  12. 12. OOZIE的核心组件 Action Node 任务节点 ● Java Action ● <main-class> ● <arg> ● <capture-output /> ● ${wf:actionData('action-node-name')['property-name']} String oozieProp = System.getProperty("oozie.action.output.properties"); if (oozieProp != null) { Properties props = new Properties(); props.setProperty(propKey, propVal); File propFile = new File(oozieProp); OutputStream os = new FileOutputStream(propFile); props.store(os, "Results from oozie task"); os.close(); } Berlin | 2014.01.14 | Teng Qiu 12
  13. 13. OOZIE的核心组件 Action Node 任务节点 ● 自定义Action ● 实现 ActionExecutor 接口 ● 构造函数 super(ACTION_TYPE) ● ActionExecutor.Context ● start / end / kill / check ● 修改 oozie-site.xml ● 添加自定义类名到属性 ● oozie.service.ActionService.executor.ext.classes ● 或许可以给 Impala 写一个? Berlin | 2014.01.14 | Teng Qiu 13
  14. 14. OOZIE实战及TIPS 情景描述 ● Oozie实战及Tips ● 典型的DMP(Data Management Platform)ETL应用 ● 对用户行为进行聚合,对用户进行归类 用户行为表1..n,TTL=30天 商品分类表 最终结果 用户 时间 商品 商品 归属类别 归类 A 101 XXX XXX 1,2,3 外部用 户标识 Genera tion B 102 YYY YYY 4,3,2 A1 1,7,2, 3,9,8 0 A 103 ZZZ ZZZ 7,9,8 B1 4,3,2 0 ● 中间表:内部用户归类:A -> 1,7,2,3,9,8 | B -> 4,3,2 内部用户ID外部用户标识对应表: A -> A1 | B -> B1 Berlin | 2014.01.14 | Teng Qiu 14
  15. 15. OOZIE实战及TIPS START Point ZKClient getGen and checkTime 1) get old and new generation 2) compare lastImportedTime vs. lastExportedTime decision is there new data? No E-Mail Client * MSG: nothing to export END Successful Yes Error Hive/FTP Script to create/send export files for A coprocessor Client aggregate X events fork coprocessor Client aggregate Y events join coprocessor Client aggregate Z events Hive Script to generate export table Error fork Hive/FTP Script to create/send export files for B Hive/FTP Script to create/send export files for C Error Error join ZKClient-setGen set new generation Error ZKClient-fail-after-coproc set generation back E-Mail Client * MSG: failed after coproc KILLED with ERROR Error E-Mail Client * MSG: failed by ZK Client Berlin | 2014.01.14 | Teng Qiu 15
  16. 16. OOZIE实战及TIPS 所涉及的action ● ETL场景,DMP数据聚合,处理,导出 ● decision / fork-join ● Java(HBase,ZooKeeper ) ● Hive ● E-Mail Berlin | 2014.01.14 | Teng Qiu 16
  17. 17. OOZIE实战及TIPS 万里长征第一步 – 运行 ● Oozie的使用 ● 命令行 ● Java Client API / REST API ● Hue jobTracker=xxx:8021 nameNode=xxx:8020 oozie.coord.application.path=${workflowRoot}/coordinator.xml oozie.wf.application.path=${workflowRoot}/workflow.xml $ oozie job -oozie http://fxlive.de:11000/oozie -config /some/where/job.properties –run $ oozie job -oozie http://fxlive.de:11000/oozie -info 0000001-130104191423486-oozie-oozi-W $ oozie job -oozie http://fxlive.de:11000/oozie -log 0000001-130104191423486-oozie-oozi-W $ oozie job -oozie http://fxlive.de:11000/oozie -kill 0000001-130104191423486-oozie-oozi-W ● ShareLib ● /usr/lib/oozie/oozie-sharelib.tar.gz ● sudo -u oozie hadoop fs -put share /user/oozie/ ● 在job.properties中,oozie.use.system.libpath=true ● oozie.service.WorkflowAppService.system.libpath ● oozie.libpath=${nameNode}/xxx/xxx/jars Berlin | 2014.01.14 | Teng Qiu 17
  18. 18. OOZIE实战及TIPS 运行不了? ● 权限问题 ● Error: E0902 : E0902: Exception occured: [org.apache.hadoop.ipc.RemoteException: User: oozie is not allowed to impersonate xxx] ● core-site.xml中设置 ● hadoop.proxyuser.oozie.groups ● hadoop.proxyuser.oozie.hosts ● ForkJoin的bug ● Error: E0735 : E0735: There was an invalid "error to" transition to node [xxx] while using fork/join ● OOZIE-1142 ● oozie-site.xml中设置oozie.validate.ForkJoin为false Berlin | 2014.01.14 | Teng Qiu 18
  19. 19. OOZIE实战及TIPS HBase用起来问题多多? ● hbase-site.xml ● oozie不支持HBase,所以不会知道hbase的zookeeper设置等等 Configuration conf = new Configuration(); conf.addResource("hbase-site.xml"); conf.reloadConfiguration(); ● 如果你不幸要使用sqoop + hbase ● 在sharelib中 /lib/sqoop/ 下的hbase-xxx.jar ● 替换jar包中的hbase-site.xml!? ● 将hbase-site.xml通过hadoop fs put到oozie/share/lib/sqoop/ Berlin | 2014.01.14 | Teng Qiu 19
  20. 20. OOZIE实战及TIPS Hive各种报错 ● 每个hive action node都必须通过 <job-xml> 指定 hive-site.xml ● FAILED: Error in metadata ● NestedThrowables: JDOFatalInternalException 或 InvocationTargetException ● MetaStore所使用数据库的driver ● 如MySQL Java Connector,mysql-connector-java-xxx-bin.jar是否在 workflow中的lib目录下 ● 目录权限 ● Hive的warehouse和tmp目录权限,对启动oozie任务必须是的可写 ● 如果要整合HBase ● hive-site.xml 中的 auxpath,zookeeper设置 Berlin | 2014.01.14 | Teng Qiu 20
  21. 21. OOZIE实战及TIPS TIP:全局属性 ● global的properties和job-xml <workflow-app name=“xxx"> <global> <job-xml>${hiveSite}</job-xml> <configuration> <property> <name>mapred.child.java.opts</name> <value>-Xmx2048m</value> </property> <property> <name>oozie.launcher.mapred.child.java.opts</name> <value>-server -Xmx2G -Djava.net.preferIPv4Stack=true</value> </property> </configuration> </global> ... Berlin | 2014.01.14 | Teng Qiu 21
  22. 22. OOZIE实战及TIPS TIP:全局属性 ● 属性检查、替换 <workflow-app name=""> <parameters> <property> <name>current_month</name> 如果current_month变量未指定,将报错Error: E0738 </property> <property> 如果current_date变量未指定,此处将设为 '' <name>currentDate</name> <value>${concat(concat("'", wf:conf('current_date')), "'")}</value> </property> <property> <name>dateFrom</name> <value>${concat(concat("'", firstNotNull(wf:conf('current_date'), concat(wf:conf('current_month'), '-01'))), "'")}</value> </property> <property> <name>dateTo</name> <value>${concat(concat("'", firstNotNull(wf:conf('current_date'), concat(wf:conf('current_month'), '-31'))), "'")}</value> </property> </parameters> ... Berlin | 2014.01.14 | Teng Qiu 22
  23. 23. OOZIE实战及TIPS TIP:变量名和变量的使用 ● 符合命名规则的变量( [A-Za-z_][0-9A-Za-z_]* ) ● ${xxx} 或 wf:conf(xxx) ● ${wf:conf("etl_only_do_something") eq "yes"} ● 不符合命名规则的变量(input.path) ● ${input.path} ● 不能有减号~ ● 不能写成 input-path Berlin | 2014.01.14 | Teng Qiu 23
  24. 24. OOZIE实战及TIPS 工作流运行中对KPI值的收集 ● MapReduce action / Pig action ● hadoop:counters ● ${hadoop:counters("mr-node-name")["FileSystemCounters"]["FILE_BYTES_READ"]} ● Java / SSH action ● <capture-output /> ● ${wf:actionData('java-action-node-name')['property-name']} ● ${wf:action:output('ssh-action-node-name')['property-name']} ● Hive 没有好的办法 ● hive –e –S Berlin | 2014.01.14 | Teng Qiu 24
  25. 25. OOZIE实战及TIPS Java Action 传递输出数据回oozie ● Java的输出作为变量 ● <capture-output /> ● 程序中写Properties String oozieProp = System.getProperty("oozie.action.output.properties"); if (oozieProp != null) { Properties props = new Properties(); props.setProperty(“last.import.date”, “2013-12-01T00:00:00Z”); // ISO-8601 date format File propFile = new File(oozieProp); OutputStream os = new FileOutputStream(propFile); props.store(os, "Results from oozie task"); os.close(); } Berlin | 2014.01.14 | Teng Qiu 25
  26. 26. OOZIE实战及TIPS Java Action 输出数据的使用 ● Java ● 使用时可作为main函数的参数传入 ● 或用在decision中 Berlin | 2014.01.14 | Teng Qiu 26
  27. 27. OOZIE实战及TIPS 收集输出变量也是有风险滴 ● Oozie Action的输出数据有个默认的大小限制,只有2K! Failing Oozie Launcher, Output data size [4 321] exceeds maximum [2 048] Failing Oozie Launcher, Main class [com.myactions.action.InitAction], exception invoking main(), null org.apache.oozie.action.hadoop.LauncherException at org.apache.oozie.action.hadoop.LauncherMapper.failLauncher(LauncherMapper.java:571) ● 修改 oozie-site.xml <property> <name>oozie.action.max.output.data</name> <value>1048576</value> </property> ● 设置成1M ● 然后。。。要重启oozie Berlin | 2014.01.14 | Teng Qiu 27
  28. 28. OOZIE的编程接口介绍 ● Oozie的编程接口介绍 ● Oozie Web Services API ● HTTP REST API ● curl -X POST -H "Content-Type: application/xml" -d @config.xml "http://localhost:11000/oozie/v1/jobs?action=start" ● Oozie Java client API import org.apache.oozie.client.OozieClient; new OozieClient(String oozie_url) create Properties Object String jobId = oozieClient.run(Properties prop) org.apache.oozie.client.WorkflowJob WorkflowJob job = oozieClient.getJobInfo(String jobID); Berlin | 2014.01.14 | Teng Qiu 28
  29. 29. 图形化开源ETL工具KETTLE ● Oozie的些许限制 ● Hadoop集群内部 ● HBase咋办 ● 支持Oozie的图形化开源ETL工具Kettle初探 ● Job / Transformation ● HBase Input / Output Berlin | 2014.01.14 | Teng Qiu 29
  30. 30. 总结展望 ● 总结展望 ● 作为hadoop集群内cron job的有效替代者 ● 与Hadoop结合紧密,可统一进行用户权限管理 ● 工作流节点的错误报警和处理(rerun) ● 可通过流程控制节点对工作流进行灵活控制 ● 与Azkaban相比,支持的任务种类更多 ● 但是是有所牺牲的,始终占用一个map slot ● 与Azkaban相比,支持变量及EL语言 ● coordinator提供事件触发式的启动模式 ● API丰富 ● 不支持HBase ● 要费劲写XML Berlin | 2014.01.14 | Teng Qiu 30

×