Your SlideShare is downloading. ×
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study

2,196

Published on

Oozie Introduction, Case Study, and Tips …

Oozie Introduction, Case Study, and Tips

also some introduction about Integration of Kettle and Oozie using Spoon

PDF download: http://user.cs.tu-berlin.de/~tqiu/Oozie_BigData_Workflow_Scheduler_Case_Study.pdf

During the past three years Oozie has become the de-facto workflow scheduling system for Hadoop. Oozie has proven itself as a scalable, secure and multi-tenant service.

More: http://www.chinahadoop.net/thread-6659-1-1.html

Online Open Course: http://chinahadoop.edusoho.cn/course/19

video: http://www.youtube.com/watch?v=qzk08ggdIDw&hd=1

vimeo -- http://vimeo.com/84164730

Published in: Technology
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,196
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
74
Comments
0
Likes
6
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. ChinaHadoop 公开课 大数据处理工作流调度系统 —— OOZIE及相关产品介绍 邱腾 Teng Qiu http://abcn.net/ http://www.fxlive.de/
  • 2. 大纲 ● Oozie概述 ● 适合使用Oozie的情景 ● Oozie的实现原理及特点 ● Oozie的核心组件 ● Oozie实战及Tips ● Oozie的编程接口介绍 ● 支持Oozie的图形化开源ETL工具Kettle初探 ● 总结展望 Berlin | 2014.01.14 | Teng Qiu 2
  • 3. OOZIE概述 ● 工作流引擎 ● 顺序运行一组Hadoop作业任务 ● 有向无环图 DAG (Direct Acyclic Graph) ● Workflow 1:1 Coordinator n:1 Bundle ● Coordinator可触发执行,可类似cron job方式执行,时间轮循只支持UTC时间 ● XML作为工作流描述语言 hPDL (Process Definition Language) ● 类似JBoss jBPM中使用的 jPDL ● Control Flow Nodes 控制流程的执行路径: start, end, fail / kill, decision, fork-join ● Action Nodes: ● HDFS, MapReduce, Pig, Hive, Sqoop, Java, SSH, E-Mail, Sub-Workflow ● (mkdir, delete, move, chmod, touchz, DistCp) ● 信息存放在数据库中 derby / mysql Berlin | 2014.01.14 | Teng Qiu 3
  • 4. 适合使用OOZIE的情景 ● Hadoop中需要按顺序进行的数据处理工作流 ● 即能够顺序执行,又能够并行处理(fork-join) ● 运行结果或异常的通报、处理 ● Hadoop集群内ETL任务 ● 取代Hadoop集群内的Cron Job Berlin | 2014.01.14 | Teng Qiu 4
  • 5. 适合使用OOZIE的情景 ● 需要定期执行的任务,如 ETL cron job A,在 hdp01 这个机器上,每个小时的15分启动,处理原始数据集1 cron job B,在 hdp05 这个机器上,每个小时的20分启动,处理原始数据集2 cron job C,在 hdp11 这个机器上,每个小时的50分启动,去读A和B的结果,然后做处理 ● RDBMS中的表 => HBase Table / Hive Table ● RDBMS中的 trigger / stored procedure => HBase的RegionObserver和Endpoint Coprocessor Berlin | 2014.01.14 | Teng Qiu 5
  • 6. 适合使用OOZIE的情景 ● Hadoop中需要按顺序进行的数据处理工作流 ● 即能够顺序执行,又能够并行处理(fork-join) ● 运行结果或异常的通报、处理 ● Hadoop集群内ETL任务 ● 取代Hadoop集群内的Cron Job ● 适用于 batch processing,DWH,不太可能进行 real-time data processing Berlin | 2014.01.14 | Teng Qiu 6
  • 7. OOZIE的实现原理及特点 ● 实现原理 ● oozie:lancher:T=:W=:A= ● Oozie Server根据workflow XML, 提交一个map only的MR Job ● map中封装用户定义的action, 通过JobClient将job.jar和job.xml提交JobTracker ● action Job开始工作,map only Job 等待 => oozie始终多占用一个map slot ● callback / polling 获取action状态 ● 正常情况下,通过callback URL通知完成 ● 特点 ● 通过MapReduce Framework实现负载均衡,容错/重试机制 ● 支持参数化,Java EL 语言 ● DAG,没有重试(Error / Exception / exit code != 0) ● 但是workflow可以rerun(oozie.wf.rerun.failnodes=true或 oozie.wf.rerun.skip.nodes=xxx,yyy,zzz) Berlin | 2014.01.14 | Teng Qiu 7
  • 8. OOZIE的核心组件 Control Flow Node 流程控制节点 ● Oozie的核心组件(流程控制节点介绍) Berlin | 2014.01.14 | Teng Qiu 8
  • 9. OOZIE的核心组件 Control Flow Node 流程控制节点 ● decision 节点 ${wf:conf("etl_only_do_something") eq "yes"} ● fork-join ● 一个bug:OOZIE-1142,3.3.2后fix ● 解决办法:在 oozie-site.xml 中,设置oozie.validate.ForkJoin为false Berlin | 2014.01.14 | Teng Qiu 9
  • 10. OOZIE的核心组件 Action Node 任务节点 ● HDFS ● move, delete, mkdir, chmod, touchz, DistCp ● MapReduce ● job.xml 指定M/R的class和目录 ● Pig / Hive ● <job-xml>hive-site.xml</job-xml> ● <script>${hiveScript}</script> ● SSH ● public key !!! 一声叹息啊 ● <host>, <command>, <args> -_● Sub Workflow ● <propagate-configuration/> Berlin | 2014.01.14 | Teng Qiu 10
  • 11. OOZIE的核心组件 Action Node 任务节点 ● Sqoop Action 比较让人崩溃 Berlin | 2014.01.14 | Teng Qiu 11
  • 12. OOZIE的核心组件 Action Node 任务节点 ● Java Action ● <main-class> ● <arg> ● <capture-output /> ● ${wf:actionData('action-node-name')['property-name']} String oozieProp = System.getProperty("oozie.action.output.properties"); if (oozieProp != null) { Properties props = new Properties(); props.setProperty(propKey, propVal); File propFile = new File(oozieProp); OutputStream os = new FileOutputStream(propFile); props.store(os, "Results from oozie task"); os.close(); } Berlin | 2014.01.14 | Teng Qiu 12
  • 13. OOZIE的核心组件 Action Node 任务节点 ● 自定义Action ● 实现 ActionExecutor 接口 ● 构造函数 super(ACTION_TYPE) ● ActionExecutor.Context ● start / end / kill / check ● 修改 oozie-site.xml ● 添加自定义类名到属性 ● oozie.service.ActionService.executor.ext.classes ● 或许可以给 Impala 写一个? Berlin | 2014.01.14 | Teng Qiu 13
  • 14. OOZIE实战及TIPS 情景描述 ● Oozie实战及Tips ● 典型的DMP(Data Management Platform)ETL应用 ● 对用户行为进行聚合,对用户进行归类 用户行为表1..n,TTL=30天 商品分类表 最终结果 用户 时间 商品 商品 归属类别 归类 A 101 XXX XXX 1,2,3 外部用 户标识 Genera tion B 102 YYY YYY 4,3,2 A1 1,7,2, 3,9,8 0 A 103 ZZZ ZZZ 7,9,8 B1 4,3,2 0 ● 中间表:内部用户归类:A -> 1,7,2,3,9,8 | B -> 4,3,2 内部用户ID外部用户标识对应表: A -> A1 | B -> B1 Berlin | 2014.01.14 | Teng Qiu 14
  • 15. OOZIE实战及TIPS START Point ZKClient getGen and checkTime 1) get old and new generation 2) compare lastImportedTime vs. lastExportedTime decision is there new data? No E-Mail Client * MSG: nothing to export END Successful Yes Error Hive/FTP Script to create/send export files for A coprocessor Client aggregate X events fork coprocessor Client aggregate Y events join coprocessor Client aggregate Z events Hive Script to generate export table Error fork Hive/FTP Script to create/send export files for B Hive/FTP Script to create/send export files for C Error Error join ZKClient-setGen set new generation Error ZKClient-fail-after-coproc set generation back E-Mail Client * MSG: failed after coproc KILLED with ERROR Error E-Mail Client * MSG: failed by ZK Client Berlin | 2014.01.14 | Teng Qiu 15
  • 16. OOZIE实战及TIPS 所涉及的action ● ETL场景,DMP数据聚合,处理,导出 ● decision / fork-join ● Java(HBase,ZooKeeper ) ● Hive ● E-Mail Berlin | 2014.01.14 | Teng Qiu 16
  • 17. OOZIE实战及TIPS 万里长征第一步 – 运行 ● Oozie的使用 ● 命令行 ● Java Client API / REST API ● Hue jobTracker=xxx:8021 nameNode=xxx:8020 oozie.coord.application.path=${workflowRoot}/coordinator.xml oozie.wf.application.path=${workflowRoot}/workflow.xml $ oozie job -oozie http://fxlive.de:11000/oozie -config /some/where/job.properties –run $ oozie job -oozie http://fxlive.de:11000/oozie -info 0000001-130104191423486-oozie-oozi-W $ oozie job -oozie http://fxlive.de:11000/oozie -log 0000001-130104191423486-oozie-oozi-W $ oozie job -oozie http://fxlive.de:11000/oozie -kill 0000001-130104191423486-oozie-oozi-W ● ShareLib ● /usr/lib/oozie/oozie-sharelib.tar.gz ● sudo -u oozie hadoop fs -put share /user/oozie/ ● 在job.properties中,oozie.use.system.libpath=true ● oozie.service.WorkflowAppService.system.libpath ● oozie.libpath=${nameNode}/xxx/xxx/jars Berlin | 2014.01.14 | Teng Qiu 17
  • 18. OOZIE实战及TIPS 运行不了? ● 权限问题 ● Error: E0902 : E0902: Exception occured: [org.apache.hadoop.ipc.RemoteException: User: oozie is not allowed to impersonate xxx] ● core-site.xml中设置 ● hadoop.proxyuser.oozie.groups ● hadoop.proxyuser.oozie.hosts ● ForkJoin的bug ● Error: E0735 : E0735: There was an invalid "error to" transition to node [xxx] while using fork/join ● OOZIE-1142 ● oozie-site.xml中设置oozie.validate.ForkJoin为false Berlin | 2014.01.14 | Teng Qiu 18
  • 19. OOZIE实战及TIPS HBase用起来问题多多? ● hbase-site.xml ● oozie不支持HBase,所以不会知道hbase的zookeeper设置等等 Configuration conf = new Configuration(); conf.addResource("hbase-site.xml"); conf.reloadConfiguration(); ● 如果你不幸要使用sqoop + hbase ● 在sharelib中 /lib/sqoop/ 下的hbase-xxx.jar ● 替换jar包中的hbase-site.xml!? ● 将hbase-site.xml通过hadoop fs put到oozie/share/lib/sqoop/ Berlin | 2014.01.14 | Teng Qiu 19
  • 20. OOZIE实战及TIPS Hive各种报错 ● 每个hive action node都必须通过 <job-xml> 指定 hive-site.xml ● FAILED: Error in metadata ● NestedThrowables: JDOFatalInternalException 或 InvocationTargetException ● MetaStore所使用数据库的driver ● 如MySQL Java Connector,mysql-connector-java-xxx-bin.jar是否在 workflow中的lib目录下 ● 目录权限 ● Hive的warehouse和tmp目录权限,对启动oozie任务必须是的可写 ● 如果要整合HBase ● hive-site.xml 中的 auxpath,zookeeper设置 Berlin | 2014.01.14 | Teng Qiu 20
  • 21. OOZIE实战及TIPS TIP:全局属性 ● global的properties和job-xml <workflow-app name=“xxx"> <global> <job-xml>${hiveSite}</job-xml> <configuration> <property> <name>mapred.child.java.opts</name> <value>-Xmx2048m</value> </property> <property> <name>oozie.launcher.mapred.child.java.opts</name> <value>-server -Xmx2G -Djava.net.preferIPv4Stack=true</value> </property> </configuration> </global> ... Berlin | 2014.01.14 | Teng Qiu 21
  • 22. OOZIE实战及TIPS TIP:全局属性 ● 属性检查、替换 <workflow-app name=""> <parameters> <property> <name>current_month</name> 如果current_month变量未指定,将报错Error: E0738 </property> <property> 如果current_date变量未指定,此处将设为 '' <name>currentDate</name> <value>${concat(concat("'", wf:conf('current_date')), "'")}</value> </property> <property> <name>dateFrom</name> <value>${concat(concat("'", firstNotNull(wf:conf('current_date'), concat(wf:conf('current_month'), '-01'))), "'")}</value> </property> <property> <name>dateTo</name> <value>${concat(concat("'", firstNotNull(wf:conf('current_date'), concat(wf:conf('current_month'), '-31'))), "'")}</value> </property> </parameters> ... Berlin | 2014.01.14 | Teng Qiu 22
  • 23. OOZIE实战及TIPS TIP:变量名和变量的使用 ● 符合命名规则的变量( [A-Za-z_][0-9A-Za-z_]* ) ● ${xxx} 或 wf:conf(xxx) ● ${wf:conf("etl_only_do_something") eq "yes"} ● 不符合命名规则的变量(input.path) ● ${input.path} ● 不能有减号~ ● 不能写成 input-path Berlin | 2014.01.14 | Teng Qiu 23
  • 24. OOZIE实战及TIPS 工作流运行中对KPI值的收集 ● MapReduce action / Pig action ● hadoop:counters ● ${hadoop:counters("mr-node-name")["FileSystemCounters"]["FILE_BYTES_READ"]} ● Java / SSH action ● <capture-output /> ● ${wf:actionData('java-action-node-name')['property-name']} ● ${wf:action:output('ssh-action-node-name')['property-name']} ● Hive 没有好的办法 ● hive –e –S Berlin | 2014.01.14 | Teng Qiu 24
  • 25. OOZIE实战及TIPS Java Action 传递输出数据回oozie ● Java的输出作为变量 ● <capture-output /> ● 程序中写Properties String oozieProp = System.getProperty("oozie.action.output.properties"); if (oozieProp != null) { Properties props = new Properties(); props.setProperty(“last.import.date”, “2013-12-01T00:00:00Z”); // ISO-8601 date format File propFile = new File(oozieProp); OutputStream os = new FileOutputStream(propFile); props.store(os, "Results from oozie task"); os.close(); } Berlin | 2014.01.14 | Teng Qiu 25
  • 26. OOZIE实战及TIPS Java Action 输出数据的使用 ● Java ● 使用时可作为main函数的参数传入 ● 或用在decision中 Berlin | 2014.01.14 | Teng Qiu 26
  • 27. OOZIE实战及TIPS 收集输出变量也是有风险滴 ● Oozie Action的输出数据有个默认的大小限制,只有2K! Failing Oozie Launcher, Output data size [4 321] exceeds maximum [2 048] Failing Oozie Launcher, Main class [com.myactions.action.InitAction], exception invoking main(), null org.apache.oozie.action.hadoop.LauncherException at org.apache.oozie.action.hadoop.LauncherMapper.failLauncher(LauncherMapper.java:571) ● 修改 oozie-site.xml <property> <name>oozie.action.max.output.data</name> <value>1048576</value> </property> ● 设置成1M ● 然后。。。要重启oozie Berlin | 2014.01.14 | Teng Qiu 27
  • 28. OOZIE的编程接口介绍 ● Oozie的编程接口介绍 ● Oozie Web Services API ● HTTP REST API ● curl -X POST -H "Content-Type: application/xml" -d @config.xml "http://localhost:11000/oozie/v1/jobs?action=start" ● Oozie Java client API import org.apache.oozie.client.OozieClient; new OozieClient(String oozie_url) create Properties Object String jobId = oozieClient.run(Properties prop) org.apache.oozie.client.WorkflowJob WorkflowJob job = oozieClient.getJobInfo(String jobID); Berlin | 2014.01.14 | Teng Qiu 28
  • 29. 图形化开源ETL工具KETTLE ● Oozie的些许限制 ● Hadoop集群内部 ● HBase咋办 ● 支持Oozie的图形化开源ETL工具Kettle初探 ● Job / Transformation ● HBase Input / Output Berlin | 2014.01.14 | Teng Qiu 29
  • 30. 总结展望 ● 总结展望 ● 作为hadoop集群内cron job的有效替代者 ● 与Hadoop结合紧密,可统一进行用户权限管理 ● 工作流节点的错误报警和处理(rerun) ● 可通过流程控制节点对工作流进行灵活控制 ● 与Azkaban相比,支持的任务种类更多 ● 但是是有所牺牲的,始终占用一个map slot ● 与Azkaban相比,支持变量及EL语言 ● coordinator提供事件触发式的启动模式 ● API丰富 ● 不支持HBase ● 要费劲写XML Berlin | 2014.01.14 | Teng Qiu 30

×