Hadoop
Streaming and Pipes
         July 10, 2012
           Clay Jiang
  Big Data Engineering Team
         Hanborq Inc.
Hadoop Streaming
• Hadoop Streaming 是一个将任何可执行程序
  /脚本当成Map/Reduce来执行MR Job的工具

• $HADOOP_HOME/contrib/streaming/hadoop-
  streaming-*.jar




                                       2
First Streaming Run
• 基本命令:
 – hadoop jar
   $HADOOP_HOME/contrib/streaming/hadoop
   -streaming-*.jar 
   -input /path/to/inputdir 
   -output /path/to/outputdir 
   -mapper /path/to/map_exec 
   -reducer /path/to/reduce_exec




                                           3
How Streaming Works?
• Mapper/Reducer将
  map_exec/reduce_exec作为单独进程启
  动
• Mapper/Reducer通过stdin和stdout传输
  <key,value>
• <key,value>以约定的形式传输给
  map_exec/reduce_exec,默认形式
  为”keytvalue”

                               4
How Streaming Works?




                       5
Hadoop Streaming Example
• Streaming WordCount




                               6
Streaming Internal
• 只是工具,不是新的机制
• 在原有的MapReduce框架上,增加适配层:
 – PipeMapper + PipeMapRunner
 – PipeCombiner
 – PipeReducer
 – No PipePartitioner




                                7
Streaming Internal
PipeMapper/PipeReducer负责与可执行程序通过
             stdin/stdout传输数据




                                   8
Streaming Internal
• hadoop-streaming*.jar主入口:



• 三个工具其中之一:




                              9
Streaming-StreamJob
• StreamJob
  – parseArgv:
     • Argv  Field Member

  – setJobConf:
     • Field Member  JobConf

  – submitAndMonitorJob:
     • JobConf submit to JobClient

                                     10
Streaming Map
• -mapper <cmd|JavaClassName>
• PipeMapRunner/PipeMapper
  – startOutputThreads: 启动线程MROutputThread
    来“tail”map_exec的stdout,并使用
    OutputReader 读取输出,解析后写到collector上
  – PipeMapper.map: 使用InputWriter,将key/value
    写成map_exec可以解析的字符串,写到
    map_exec的 stdin


                                           11
Streaming Reduce
• -reducer <cmd|JavaClassName>
• PipeReducer
  – 倚靠MapReduce内部机制shuffle到reducer
  – startOutputThread: 首次reduce时,类似地启动
    MROutputThread来收集“reducer cmd”的stdout
  – 类似地,使用inputWriter来翻译reduce的
    key/values,逐对提供给“reducer cmd”



                                            12
InputWriter/OutputReader
• InputWriter
  – 将<key,value>按预定的编码写到可执行程序的stdin

• OutputReader
  – 读取可执行程序的stdout并解编码为<key,value>

• InputWriter + OutputReader
  – 形成Java进程与map/reduce可执行进程的数据传输协议


                                      13
TextInputWriter/TextOutputReader

• 默认使用:
  – TextInputWriter、TextOutputReader
• <key,value>  key + separator + value
• 默认separator: t




                                           14
Streaming Data Flow




                      15
Streaming Combiner
• -combiner <cmd|JavaClassName>

• PipeCombiner简单地继承了PipeReducer,流
 程与PipeReducer相同




                                  16
Streaming Partitioner
• -partitioner <javaClassName>
• 目前而言,partitioner必须为java类




                                 17
Streaming I/O Format
• -inputFormat <javaClassName>
  – JobConf.setInputFormat()

• -outputFormat <javaClassName>
  – JobConf.setOutputFormat()

• -inputreader <javaClassName>:
  • 使用StreamInputFormat 作为InputFormat

                                        18
Streaming IO Spec
• TextInputWriter/TextOuputReader:
  – stream.map/reduce.output.field.separator
     • map/reduce可执行程序输出使用的separator
  – stream.map/reduce.input.field.separator
     • map/reduce可执行程序输入使用的separator
  – stream.num/reduce.map.output.key.fields
     • Separator将行分割成多个field,指定若干个fields作
       为key



                                               19
Streaming IO Spec
• -io text|rawbytes|typedbytes
  – text  TextInputWriter/TextOutputReader
  – rawbytes 
    RawBytesInputWriter/RawBytesOutputReader
  – typedbytes 
    TypedBytesInputWriter/TypedOutputReader
  – 由IdentifierResolver解析选项




                                               20
User-Defined IO Spec
• MyInputWriter/MyOutputReader
  – extend InputWriter/OutputReader
• MyIdentifierResovler
  – extend IdentifierResovler
  – 用于解析 my 
    MyInputWriter/MyOutputReader
  – -Dstream.io.identifier.resolver.class
    MyIdentifierResovler


                                        21
Debug Streaming
• -mapdebug/-reducedebug
  – 当map/reduce task执行失败时,执行debug脚本
  – $script $stdout $stderr $syslog $jobconf
• -debug
  – 执行完毕时,不删除
    /tmp/${user.name}/streamjob.jar




                                           22
V.S. Hadoop Pipes
• Stdin/stdout  Socket

• 限定I/O接口 
 $HADOOP_HOME/c++/$PLATFORM/include
  – HadoopPipes::Mapper::map(MapContext& context)

  – HadoopPipes::Reducer::reduce(ReduceContext& context)

• Performance: One better than the other?




                                                           23
V.S. Hadoop Pipes
• 实现上很相似
 – PipeMapper/PipeReducer 
  PipesMapper/PipesReducer
 – InputWriter/OuputReader 
  Application
 – 任何可执行程序 Pipes客户端需要链接
  c++库

                                24
参考
• (1)《Hadoop the definitive guide》
• (2)Hadoop Streaming -
  http://hadoop.apache.org/common/docs/r0.20.2/streaming.
  html
• (3)How to Debug Map/Reduce Programs
  http://wiki.apache.org/hadoop/HowToDebugMapReduceProg
  rams
• (4)Hadoop Wiki http://wiki.apache.org/hadoop/




                                                        25
The End
Thank You Very Much!
    chiangbing@gmail.com




                           26

Hadoop MapReduce Streaming and Pipes

  • 1.
    Hadoop Streaming and Pipes July 10, 2012 Clay Jiang Big Data Engineering Team Hanborq Inc.
  • 2.
    Hadoop Streaming • HadoopStreaming 是一个将任何可执行程序 /脚本当成Map/Reduce来执行MR Job的工具 • $HADOOP_HOME/contrib/streaming/hadoop- streaming-*.jar 2
  • 3.
    First Streaming Run •基本命令: – hadoop jar $HADOOP_HOME/contrib/streaming/hadoop -streaming-*.jar -input /path/to/inputdir -output /path/to/outputdir -mapper /path/to/map_exec -reducer /path/to/reduce_exec 3
  • 4.
    How Streaming Works? •Mapper/Reducer将 map_exec/reduce_exec作为单独进程启 动 • Mapper/Reducer通过stdin和stdout传输 <key,value> • <key,value>以约定的形式传输给 map_exec/reduce_exec,默认形式 为”keytvalue” 4
  • 5.
  • 6.
    Hadoop Streaming Example •Streaming WordCount 6
  • 7.
    Streaming Internal • 只是工具,不是新的机制 •在原有的MapReduce框架上,增加适配层: – PipeMapper + PipeMapRunner – PipeCombiner – PipeReducer – No PipePartitioner 7
  • 8.
  • 9.
  • 10.
    Streaming-StreamJob • StreamJob – parseArgv: • Argv  Field Member – setJobConf: • Field Member  JobConf – submitAndMonitorJob: • JobConf submit to JobClient 10
  • 11.
    Streaming Map • -mapper<cmd|JavaClassName> • PipeMapRunner/PipeMapper – startOutputThreads: 启动线程MROutputThread 来“tail”map_exec的stdout,并使用 OutputReader 读取输出,解析后写到collector上 – PipeMapper.map: 使用InputWriter,将key/value 写成map_exec可以解析的字符串,写到 map_exec的 stdin 11
  • 12.
    Streaming Reduce • -reducer<cmd|JavaClassName> • PipeReducer – 倚靠MapReduce内部机制shuffle到reducer – startOutputThread: 首次reduce时,类似地启动 MROutputThread来收集“reducer cmd”的stdout – 类似地,使用inputWriter来翻译reduce的 key/values,逐对提供给“reducer cmd” 12
  • 13.
    InputWriter/OutputReader • InputWriter – 将<key,value>按预定的编码写到可执行程序的stdin • OutputReader – 读取可执行程序的stdout并解编码为<key,value> • InputWriter + OutputReader – 形成Java进程与map/reduce可执行进程的数据传输协议 13
  • 14.
    TextInputWriter/TextOutputReader • 默认使用: – TextInputWriter、TextOutputReader • <key,value>  key + separator + value • 默认separator: t 14
  • 15.
  • 16.
    Streaming Combiner • -combiner<cmd|JavaClassName> • PipeCombiner简单地继承了PipeReducer,流 程与PipeReducer相同 16
  • 17.
    Streaming Partitioner • -partitioner<javaClassName> • 目前而言,partitioner必须为java类 17
  • 18.
    Streaming I/O Format •-inputFormat <javaClassName> – JobConf.setInputFormat() • -outputFormat <javaClassName> – JobConf.setOutputFormat() • -inputreader <javaClassName>: • 使用StreamInputFormat 作为InputFormat 18
  • 19.
    Streaming IO Spec •TextInputWriter/TextOuputReader: – stream.map/reduce.output.field.separator • map/reduce可执行程序输出使用的separator – stream.map/reduce.input.field.separator • map/reduce可执行程序输入使用的separator – stream.num/reduce.map.output.key.fields • Separator将行分割成多个field,指定若干个fields作 为key 19
  • 20.
    Streaming IO Spec •-io text|rawbytes|typedbytes – text  TextInputWriter/TextOutputReader – rawbytes  RawBytesInputWriter/RawBytesOutputReader – typedbytes  TypedBytesInputWriter/TypedOutputReader – 由IdentifierResolver解析选项 20
  • 21.
    User-Defined IO Spec •MyInputWriter/MyOutputReader – extend InputWriter/OutputReader • MyIdentifierResovler – extend IdentifierResovler – 用于解析 my  MyInputWriter/MyOutputReader – -Dstream.io.identifier.resolver.class MyIdentifierResovler 21
  • 22.
    Debug Streaming • -mapdebug/-reducedebug – 当map/reduce task执行失败时,执行debug脚本 – $script $stdout $stderr $syslog $jobconf • -debug – 执行完毕时,不删除 /tmp/${user.name}/streamjob.jar 22
  • 23.
    V.S. Hadoop Pipes •Stdin/stdout  Socket • 限定I/O接口  $HADOOP_HOME/c++/$PLATFORM/include – HadoopPipes::Mapper::map(MapContext& context) – HadoopPipes::Reducer::reduce(ReduceContext& context) • Performance: One better than the other? 23
  • 24.
    V.S. Hadoop Pipes •实现上很相似 – PipeMapper/PipeReducer  PipesMapper/PipesReducer – InputWriter/OuputReader  Application – 任何可执行程序 Pipes客户端需要链接 c++库 24
  • 25.
    参考 • (1)《Hadoop thedefinitive guide》 • (2)Hadoop Streaming - http://hadoop.apache.org/common/docs/r0.20.2/streaming. html • (3)How to Debug Map/Reduce Programs http://wiki.apache.org/hadoop/HowToDebugMapReduceProg rams • (4)Hadoop Wiki http://wiki.apache.org/hadoop/ 25
  • 26.
    The End Thank YouVery Much! chiangbing@gmail.com 26