Hadoop MapReduce Streaming and Pipes

2,902 views

Published on

Introduction of Hadoop MapReduce Streaming and Pipes, for training.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,902
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Hadoop MapReduce Streaming and Pipes

  1. 1. HadoopStreaming and Pipes July 10, 2012 Clay Jiang Big Data Engineering Team Hanborq Inc.
  2. 2. Hadoop Streaming• Hadoop Streaming 是一个将任何可执行程序 /脚本当成Map/Reduce来执行MR Job的工具• $HADOOP_HOME/contrib/streaming/hadoop- streaming-*.jar 2
  3. 3. First Streaming Run• 基本命令: – hadoop jar $HADOOP_HOME/contrib/streaming/hadoop -streaming-*.jar -input /path/to/inputdir -output /path/to/outputdir -mapper /path/to/map_exec -reducer /path/to/reduce_exec 3
  4. 4. How Streaming Works?• Mapper/Reducer将 map_exec/reduce_exec作为单独进程启 动• Mapper/Reducer通过stdin和stdout传输 <key,value>• <key,value>以约定的形式传输给 map_exec/reduce_exec,默认形式 为”keytvalue” 4
  5. 5. How Streaming Works? 5
  6. 6. Hadoop Streaming Example• Streaming WordCount 6
  7. 7. Streaming Internal• 只是工具,不是新的机制• 在原有的MapReduce框架上,增加适配层: – PipeMapper + PipeMapRunner – PipeCombiner – PipeReducer – No PipePartitioner 7
  8. 8. Streaming InternalPipeMapper/PipeReducer负责与可执行程序通过 stdin/stdout传输数据 8
  9. 9. Streaming Internal• hadoop-streaming*.jar主入口:• 三个工具其中之一: 9
  10. 10. Streaming-StreamJob• StreamJob – parseArgv: • Argv  Field Member – setJobConf: • Field Member  JobConf – submitAndMonitorJob: • JobConf submit to JobClient 10
  11. 11. Streaming Map• -mapper <cmd|JavaClassName>• PipeMapRunner/PipeMapper – startOutputThreads: 启动线程MROutputThread 来“tail”map_exec的stdout,并使用 OutputReader 读取输出,解析后写到collector上 – PipeMapper.map: 使用InputWriter,将key/value 写成map_exec可以解析的字符串,写到 map_exec的 stdin 11
  12. 12. Streaming Reduce• -reducer <cmd|JavaClassName>• PipeReducer – 倚靠MapReduce内部机制shuffle到reducer – startOutputThread: 首次reduce时,类似地启动 MROutputThread来收集“reducer cmd”的stdout – 类似地,使用inputWriter来翻译reduce的 key/values,逐对提供给“reducer cmd” 12
  13. 13. InputWriter/OutputReader• InputWriter – 将<key,value>按预定的编码写到可执行程序的stdin• OutputReader – 读取可执行程序的stdout并解编码为<key,value>• InputWriter + OutputReader – 形成Java进程与map/reduce可执行进程的数据传输协议 13
  14. 14. TextInputWriter/TextOutputReader• 默认使用: – TextInputWriter、TextOutputReader• <key,value>  key + separator + value• 默认separator: t 14
  15. 15. Streaming Data Flow 15
  16. 16. Streaming Combiner• -combiner <cmd|JavaClassName>• PipeCombiner简单地继承了PipeReducer,流 程与PipeReducer相同 16
  17. 17. Streaming Partitioner• -partitioner <javaClassName>• 目前而言,partitioner必须为java类 17
  18. 18. Streaming I/O Format• -inputFormat <javaClassName> – JobConf.setInputFormat()• -outputFormat <javaClassName> – JobConf.setOutputFormat()• -inputreader <javaClassName>: • 使用StreamInputFormat 作为InputFormat 18
  19. 19. Streaming IO Spec• TextInputWriter/TextOuputReader: – stream.map/reduce.output.field.separator • map/reduce可执行程序输出使用的separator – stream.map/reduce.input.field.separator • map/reduce可执行程序输入使用的separator – stream.num/reduce.map.output.key.fields • Separator将行分割成多个field,指定若干个fields作 为key 19
  20. 20. Streaming IO Spec• -io text|rawbytes|typedbytes – text  TextInputWriter/TextOutputReader – rawbytes  RawBytesInputWriter/RawBytesOutputReader – typedbytes  TypedBytesInputWriter/TypedOutputReader – 由IdentifierResolver解析选项 20
  21. 21. User-Defined IO Spec• MyInputWriter/MyOutputReader – extend InputWriter/OutputReader• MyIdentifierResovler – extend IdentifierResovler – 用于解析 my  MyInputWriter/MyOutputReader – -Dstream.io.identifier.resolver.class MyIdentifierResovler 21
  22. 22. Debug Streaming• -mapdebug/-reducedebug – 当map/reduce task执行失败时,执行debug脚本 – $script $stdout $stderr $syslog $jobconf• -debug – 执行完毕时,不删除 /tmp/${user.name}/streamjob.jar 22
  23. 23. V.S. Hadoop Pipes• Stdin/stdout  Socket• 限定I/O接口  $HADOOP_HOME/c++/$PLATFORM/include – HadoopPipes::Mapper::map(MapContext& context) – HadoopPipes::Reducer::reduce(ReduceContext& context)• Performance: One better than the other? 23
  24. 24. V.S. Hadoop Pipes• 实现上很相似 – PipeMapper/PipeReducer  PipesMapper/PipesReducer – InputWriter/OuputReader  Application – 任何可执行程序 Pipes客户端需要链接 c++库 24
  25. 25. 参考• (1)《Hadoop the definitive guide》• (2)Hadoop Streaming - http://hadoop.apache.org/common/docs/r0.20.2/streaming. html• (3)How to Debug Map/Reduce Programs http://wiki.apache.org/hadoop/HowToDebugMapReduceProg rams• (4)Hadoop Wiki http://wiki.apache.org/hadoop/ 25
  26. 26. The EndThank You Very Much! chiangbing@gmail.com 26

×