MapReduce
@shot6
Cloudera
                   Avro	
           Sqoop	
  
Desktop	
  



 Pig	
        Hive	
          HBase	
           Chukwa	
  


 Map                                       Zoo
                         HDFS	
  
Reduce	
                                  Keeper	
  

                   Core	
  
Cloudera
                   Avro	
           Sqoop	
  
Desktop	
  



 Pig	
        Hive	
          HBase	
           Chukwa	
  


 Map                                       Zoo
                         HDFS	
  
Reduce	
                                  Keeper	
  

                   Core	
  
•                 MapReduce

     –    Mapper/Reducer
• 
MapReduce                      	
•              WordCount
• 
• 
     – Mapper/Reducer       Job   ⾏行行
     – InputFormat/OutputFormat         ⽅方
     – HDFS(FileSystem)
     –     Writable     ⽅方
WordCount	
•  Hadoop          Hello World
•                   API
   (org.apache.hadoop.mapreduce)
•  API
Grep	
•  grep
  – grepJob/sortJob 2
        ⾏行行
  – JobConf/Mapper/Reducer            ⽅方
  – Mapper RegexMapper     ⾏行行   <Text,
    Long> SequenceFileFormat
  – sortJob
  –                                ⼒力力
  – 
Grep
                  -	
•  JobConf
•  Mapper
•  Reducer
o.a.hadoop.mapred.JobConf	
• 
     –           mapred-default.xml
     –    conf/mapred-site.xml
     – XML    ⾝身
       DOM
     – ⾃自        ⽬目    ⼿手
     –  ⼦子
       •  JobConf child = new JobConf(   Conf,   jar
                                 );
mapred-site.xml	
<configuration>
<!–                 -->
<property>
 <key>mapred.job.tracker</key>
 <value>your-site:9001</value>
</property>
</configuration>
o.a.hadoop.mapred.Mapper	
•  Mapper
•  InputSplit    Mapper
•  MapTask/MapRunner
•  map(KEY, VALUE, COLLECTOR,
   REPORTER)
     – KEY:Map      VALUE:Map
     – COLLECTOR:
     – REPORTER:                     API
•         MapReduceBase
o.a.hadoop.mapred.MapTask	
•  Map
•  initiazlize              (Task Reducer    )
  –                                     ⽣生
  –               (o.a.h.mapred.TaskStatus.State)
       •  RUNNING, SUCCEEDED, FAILED, UNASSIGNED,
          KILLED, COMMIT_PENDING, FAILED_UNCLEAN,
          KILLED_UNCLEAN
  – OutputCommiter ⽣生
       •  Task        ⼒力力            ⾏行行
       •                                         ⼒力力
          – mapred.work.output.dir
o.a.h.mapred.MapTask cont	
•  run        runOldMapper
•  JobClient
   InputSplit
•  RecordReader
o.a.h.mapred.MapTask cont2	
•  Reduce
  –              spill                   (*            )
       •  $mapred.local.dir/taskTracker/jobcache/$
          {taskid}/output/spill${spillNumber}.out
  – Reducer
                 ⼒力力
       •  Combiner        min.num.spills.for.combine
                          combiner
  –              RecordWriter                 ⼒力力
•  MapRunner
o.a.h.mapred.MapRunner	
•  MapRunnable
  – mapred.map.runner.class
  – Hadoop
    PipeMapRunner
  –               Map
    MultiThreadedMapRunner
o.a.h.mapred.MapRunner
                cont	
•  run(RecordReader, OutputCollector,
   Reporter)
     – RecordReader: InputFormat Split
         Reader(InputFormat/RecordReader
                           )
• 
     – RecordReader
     – 
                ⾝身
     – 
MapTask	
      MapRunner	
              Mapper	
         Record            Output
                                                         Reader	
          Collector	
       Input
      Split⽣生 	
  
                                                          	
                                                                                   	

                                                                             Spill
              & run	
                            createKey()                SpillThread
                                                 createValue()	
                    	
  

                                                 next(key, value)	

            EOF         	
     Map(key, value,
                                                                           Spill
                               outputCollector, reporter)
m(_ _)m
•  Mapper
     – JobConf
     – Mapper/MapRunner/MapTask
• 
     – Reducer
       •  Reducer   ⾏行行
       •  Reducer                 ⾏行行
     – InputFormat/RecordReader
o.a.h.mapred.Reducer	
•  Reducer
•  InputSplit      Mapper
•  ReduceTask/ReduceRunner
•  reduce(KEY, Iterator<VALUE>,
   COLLECTOR, REPORTER)
     – KEY:    Iterator<VALUE>:
     – COLLECTOR:
     – REPORTER:                       API
•         MapReduceBase
o.a.h.mapred.ReduceTask	
•  SHUFFLE
•  ReduceTask.ReduceCopier
  – fetchOutputs(            Merger.MergeQueue)
    •  Map                            x   mapred.reduce.parallel.copies

         – MapOutputCopier
    •  Map
          ⾏行行 LocalFSMerger
    •                  ⾏行行 InMemFSMergeThread
    •  GetMapEventsThread
         – Map
         – <     , MapOutputLocation(taskId, host, httpUrl)>
    •    ⼀一 TaskTracker                                         ⼯工
o.a.h.mapred.ReduceTask	
•  run(RecordReader, OutputCollector,
   Reporter)
•  SORT
  – Memory, disk                        ⽣生
    •  RowKeyValueItetator
  – Reducer ⽣生
  – RecordWriter ⽣生
  – ReduceValuesIterator       ⾏行行

サンプルから見るMap reduceコード

  • 1.
  • 2.
    Cloudera Avro   Sqoop   Desktop   Pig   Hive   HBase   Chukwa   Map Zoo HDFS   Reduce   Keeper   Core  
  • 3.
    Cloudera Avro   Sqoop   Desktop   Pig   Hive   HBase   Chukwa   Map Zoo HDFS   Reduce   Keeper   Core  
  • 4.
    •  MapReduce –  Mapper/Reducer • 
  • 5.
    MapReduce •  WordCount •  •  – Mapper/Reducer Job ⾏行行 – InputFormat/OutputFormat ⽅方 – HDFS(FileSystem) –  Writable ⽅方
  • 6.
    WordCount •  Hadoop Hello World •  API (org.apache.hadoop.mapreduce) •  API
  • 7.
    Grep •  grep – grepJob/sortJob 2 ⾏行行 – JobConf/Mapper/Reducer ⽅方 – Mapper RegexMapper ⾏行行 <Text, Long> SequenceFileFormat – sortJob –  ⼒力力 – 
  • 8.
    Grep - •  JobConf •  Mapper •  Reducer
  • 9.
    o.a.hadoop.mapred.JobConf •  –  mapred-default.xml –  conf/mapred-site.xml – XML ⾝身 DOM – ⾃自 ⽬目 ⼿手 –  ⼦子 •  JobConf child = new JobConf( Conf, jar );
  • 10.
    mapred-site.xml <configuration> <!– --> <property> <key>mapred.job.tracker</key> <value>your-site:9001</value> </property> </configuration>
  • 11.
    o.a.hadoop.mapred.Mapper •  Mapper •  InputSplit Mapper •  MapTask/MapRunner •  map(KEY, VALUE, COLLECTOR, REPORTER) – KEY:Map VALUE:Map – COLLECTOR: – REPORTER: API •  MapReduceBase
  • 12.
    o.a.hadoop.mapred.MapTask •  Map •  initiazlize (Task Reducer ) –  ⽣生 –  (o.a.h.mapred.TaskStatus.State) •  RUNNING, SUCCEEDED, FAILED, UNASSIGNED, KILLED, COMMIT_PENDING, FAILED_UNCLEAN, KILLED_UNCLEAN – OutputCommiter ⽣生 •  Task ⼒力力 ⾏行行 •  ⼒力力 – mapred.work.output.dir
  • 13.
    o.a.h.mapred.MapTask cont •  run runOldMapper •  JobClient InputSplit •  RecordReader
  • 14.
    o.a.h.mapred.MapTask cont2 •  Reduce –  spill (* ) •  $mapred.local.dir/taskTracker/jobcache/$ {taskid}/output/spill${spillNumber}.out – Reducer ⼒力力 •  Combiner min.num.spills.for.combine combiner –  RecordWriter ⼒力力 •  MapRunner
  • 15.
    o.a.h.mapred.MapRunner •  MapRunnable – mapred.map.runner.class – Hadoop PipeMapRunner –  Map MultiThreadedMapRunner
  • 16.
    o.a.h.mapred.MapRunner cont •  run(RecordReader, OutputCollector, Reporter) – RecordReader: InputFormat Split Reader(InputFormat/RecordReader ) •  – RecordReader –  ⾝身 – 
  • 17.
    MapTask MapRunner Mapper Record Output Reader Collector Input Split⽣生   Spill & run createKey() SpillThread createValue()   next(key, value) EOF   Map(key, value, Spill outputCollector, reporter)
  • 18.
  • 19.
    •  Mapper – JobConf – Mapper/MapRunner/MapTask •  – Reducer •  Reducer ⾏行行 •  Reducer ⾏行行 – InputFormat/RecordReader
  • 20.
    o.a.h.mapred.Reducer •  Reducer •  InputSplit Mapper •  ReduceTask/ReduceRunner •  reduce(KEY, Iterator<VALUE>, COLLECTOR, REPORTER) – KEY: Iterator<VALUE>: – COLLECTOR: – REPORTER: API •  MapReduceBase
  • 21.
    o.a.h.mapred.ReduceTask •  SHUFFLE •  ReduceTask.ReduceCopier – fetchOutputs( Merger.MergeQueue) •  Map x mapred.reduce.parallel.copies – MapOutputCopier •  Map ⾏行行 LocalFSMerger •  ⾏行行 InMemFSMergeThread •  GetMapEventsThread – Map – < , MapOutputLocation(taskId, host, httpUrl)> •  ⼀一 TaskTracker ⼯工
  • 22.
    o.a.h.mapred.ReduceTask •  run(RecordReader, OutputCollector, Reporter) •  SORT – Memory, disk ⽣生 •  RowKeyValueItetator – Reducer ⽣生 – RecordWriter ⽣生 – ReduceValuesIterator ⾏行行