Hadoop MapReduce
Introduction and Deep Insight
              July 9, 2012
                Anty Rao
       Big Data Engineering Team
              Hanborq Inc.
Outline
•   Architecture
•   Job Tracker
•   Task Tracker
•   Map/Reduce internal
•   Optimization
•   YARN



                            2
Architecture

 MapReduce             RPC
                                                  JobTracker
   Client



                                   beat
                              Heart


        TaskTracker                               TaskTracker                   TaskTracker



Child      Child      Child               Child      Child      Child   Child      Child      Child
JVM        JVM        JVM                 JVM        JVM        JVM     JVM        JVM        JVM




                                                                                                  3
Job Tracker




              4
Job Tracker
• Manages cluster resources
• Job scheduling




                              5
Implementation Overview




                          6
ExpireLaunchingTasks
• A thread to timeout tasks that have been
  assigned to task trackers, but have not
  reported back yet.
• After get report from task tracker, task
  tracker take over the responsibility of
  monitoring task execution, such as killing
  unresponsive task.



                                               7
ExpireTrackers
• Used to monitor task tracker status, expire
  Task tracers that have gone down.
• After task tracker die, reschedule all tasks
  reside on dead task tracker.




                                                 8
RetireJobs
• Used to remove old finished Jobs that
  have been around too long.
• Job tracker can’t retain all finished job’s
  info
• There is also a upper limit on # of job info
  on a per-user basis.




                                                 9
JobInitThread
• Used to initialize jobs that have just been
  created.
• Job initialization including
  – Create split info per map
  – Create map tasks
  – Create reduce tasks




                                                10
TaskCommitQueue
• A thread which does all of the HDFS FS-
  related operations for task
  – Promote outputs of COMMIT_PENDING tasks
  – Discard outputs for FAILED/KILLED tasks
• All local file system related operation is in
  charge of task trackers.




                                                  11
HTTP Server
• Supply job tracker status
• Supply all job status
  – Per job metrics
• Supply history job status




                              12
Key Data Structures
• JobInProcess
   – Maintain all the info for keeping a Job on the straight and narrow.
   – It keeps its JobProfile and its latest JobStatus, plus a set of
     tables for doing bookkeeping of its tasks
   – Penalize task tracker for each of the jobs which had any tasks
     running on it when it was lost.

• TaskInProgress
   – Maintain all the info needed for a task in the lifetime of its owning
     job.
   – A give task might be speculatively executed or re-executed.
   – Maintain multiple task states for different task attempts,




                                                                        13
The whole life of a job




                          14
15
The life of a job
• Client
  – User create custom mapper, reducer; Client
    compute splits, upload job configuration file, jar
    file, split meta info onto HDFS
  – Submit job to job tracker
• Job Tracker
  – Initialize job, read in job split info, determine final
    # maps, create all needed map tasks and reduce
    tasks; create all needed structures to represent
    these tasks
  – Tasks pulled by task tracker through heartbeats


                                                          16
The life of a job
• Task Tracker
  – Through heartbeats pull tasks from job tracker
  – Initialize job, only once per job
  – Initialize task
     • Download all needed jar file, configuration file,
       distributed cache from HDFS to local disk
     • Create staging working directory for task on local disk
     • Localize configuration file
  – Create java launching options, setup the Child
    JVM




                                                                 17
The life of a job
• Child JVM
  – RPC with task tracker to get it’s task info
  – Actually do the dirty chore : execute map or
    reduce function, during this period it report status
    regularly in case being killed by task tracker.
  – retrieve map complete event from task tracker, if
    needed. Report fetch failure to TT
  – When task done, report COMMIT_PENDING or
    SUCCEEDED state to TT


                                                       18
Task Tracker




               19
Task Tracker
• Per-node agent
• Manage tasks




                           20
Implementation
Overview




                 21
TT Main Thread
• Heartbeat with JJ periodically to report task
  status, retrieve directives which includs
  launch task action, kill job action, kill task
  action
• Kill unresponsive task within configured time
  period
• If there isn’t enough disk space to
  accommodate all running task, pick tasks to
  kill
• In case TT expire , reinitialize itself.


                                                   22
taskCleanupThread
• Thread dedicate to process clean up
  actions assigned by JJ
  – Kill job action
  – Kill task action




                                        23
directoryCleanupThread
• Before task executing, create a executing
  environment
  – Create staging directory
  – Copy configuration file
  – Etc
• When task running, may produce multiple
  intermediate files in local staging directory
• After job/task complete or fail, delete all
  these crappy directory and files.


                                                  24
taskLauncher
• Localize job
• Localize task
• Create a taskRunner thread to manage
  Child JVM




                                         25
TaskRunner
• It’s a Thread
• Two type
  – MapTaskRunner
  – ReduceTaskRunner
• Main duties
  – Make up the launching java Options &
    Executing Environment
  – In charge of launching, killing Child JVM.


                                                 26
MapEventsFetcherThread
• When there are tasks(reducer) in shuffle
  phase, RPC with JJ to fetch map
  completion event, on a per-job basis.




                                             27
Child JVM
• Actually execute map/reduce function
• Report status to TT periodically
• Retrieve map completion event from TT
  for reducer task if needed.




                                          28
Key data structures
• Running Jobs
  – JobID
  – JobConf
  – Set<TaskInProgress>
• TaskInProgress
  – Task
  – TaskStatus
  – TaskRunner


                              29
Map/Reduce Internal




                      30
Map/Reduce Programming Mode




         Hadoop—The Definition Guide

                                       31
Map Phase
Diagram




            32
Steps of Map Phase
• Put records emitted by map function into circle
  buffer continually
• When buffer usage space exceed
  io.sort.mb*io.sort.spill.percent, spill will start which
  will sort records by partition, key-part, then write
  out buffer onto disk, with a index file associated
  with it indicating the positions where partition
  begins.
• Merge will combine all the intermediate files into a
  single large file, plus a index file.


                                                         33
Main map-side tuning Knobs




                             34
Reduce Phase Diagram




                       35
• <property>
• <name>mapred.tasktracker.indexcache.mb</name>
• <value>10</value>
• <description> The maximum memory that a task
  tracker allows for the
•    index cache that is used when serving map outputs
  to reducers.
• </description>
• </property>




                                                     36
Steps of Reduce Phase
• Pull over data from map, if there is space
  available In memory & the size of file is
  less than
 25%*HeapSize*mapred.job.shuffle.input.b
 uffer.percent, put file in memory, else
 directly store file on disk.




                                               37
Steps of Reduce Phase(Cont.)
• Merge operation will merge and sort data
  from memory and/or disk and write result on
  disk. Merge operation come in two different
  flavors:
  – In-memory merge operation
     • In-memory merge operation can be triggered when
       accumulated memory space exceed
      mapred.job.shuffle.merge.percent.
  – On-disk merge operation
     • On-disk merge operation will be triggered when # of
       files on disk exceed configured threshold.


                                                             38
Steps of Reduce Phase(Cont.)
• When shuffle and sort complete, before
  feeding reduce function, it must satisfy the
  following constraints:
  – memory usage for buffering reduce input
   can’t exceed
    mapred.job.reduce.input.buffer.percent;
  – # of files on disk can’t exceed io.sort.factor




                                                     39
Notes about Reduce
• Shuffle & sort take up % of Reduce heap size
  to buffer shuffle data, because Reduce can’t
  start until shuffle and sort complete. As
  opposed to Map phase, which buffer size is
  determined by io.sort.mb.
• Reduce input may contains multiple files, not
  necessarily a single file. Just using a heap
  iterator to feed reduce function.


                                              40
Reduce-side
Key parameters




                 41
Optimization Tuning
• We can make use of
  mapred.job.reduce.input.buffer.percent which
  specify how much memory can be spared to
  use as reduce input buffer
• Look at the difference between the following
  cases
  – Case-1
  – Case-2
  – Case-3

                                                 42
Case-1

All reduce input reside on disk
Case-2

Partial data in memory ,plus data on
disk as reduce input
Case-3

Much better, all data in memory
• If reduce function don’t stress memory too
  much, we can spare some memory to
  buffer reduce input to boost overall
  performance.
• What’s more, if input data is small, we can
  let reduces hold all intermediate data in
  memory, not involving disk access.



                                            46
Optimization




               47
Shuffle:
 Netty Server & Batch Fetch (1)
• Less TCP connection overhead.
• Reduce the effect of TCP slow start.
• More important, better shuffle schedule in
  Reduce Phase result in better overall
  performance.
Shuffle:
 Netty Server & Batch Fetch (2)
One connection per map               Batch fetch
• Each fetch thread in reduce        •   Fetch thread copy multiple map
                                         outputs per connection.
  copy one map output per
                                     •   This fetch thread take over this TT,
  connection, even there are             other fetch threads can’t fetch
  many outputs in TT.                    outputs from this TT during coping
                                         period.



                                vs
Sort Avoidance
• Many real-world jobs require shuffling, but not sorting. And the
  sorting bring much overhead.
    – Hash Aggregation
    – Hash Join
    – … etc.

• When sorting is turned off, the mapper feeds data to the reducer
  which directly passes the data to the Reduce() function bypassing
  the intermediate sorting step.
    – Spilling, Partitioning, Merging and Reducing will be more efficient.

• How to turn off sorting?
    – JobConf job = (JobConf) getConf();
    – job.setBoolean("mapred.sort.avoidance", true);

• MAPREDUCE-4039
Sort Avoidance: Spill and Partition
• When spills, records compare by partition
  only.
• Partition comparison using counting sort [O(n)],
  not quick sort [O(nlog n)].
Sort Avoidance: Early Reduce
          (Remove shuffle barrier)
• Currently reduce function can’t start until
  all map outputs have been fetched already.
• When sort is unnecessary, reduce function
  can start as soon as there is any map
  output available.
• Greatly improve overall performance!
Sort Avoidance: Bytes Merge
• No overhead of
  key/value
  serialization/deseriali
  zation, comparison
• Don’t take care of
  records, just bytes
• Just concatenate
  byte streams
  together – read in
  bytes, write out bytes.
Sort Avoidance:
       Sequential Reduce Input
• Sequential read input files to feed reduce
  function, So no disk seeks, better
  performance.
YARN
(yet another resource negotiator)




                                    55
Current Limitations
• Hard partition of resources into map and
  reduce slots
  – Low resource utilization
• Lacks support for alternate paradigms
  – Iterative applications implemented using
    MapReduce are 10x slower.
  – Hacks for the likes of MPI/Graph Processing
• Lack of wire-compatible protocols
  – Client and cluster must be of sameversion
  – Applications and work flows cannot migrate to
    different clusters

                                                    56
Current Limitations(Cont.)
• Scalability
  – Maximum Cluster size – 4,000 nodes
  – Maximum concurrent tasks–40,000
  – Coarse synchronization in JobTracker
• Single point of failure
  – Failure kills all queued and running jobs
  – Jobs need to be re-submitted by user
• Restart is very tricky due to complex state



                                                57
Yarn Architecture




                    58
Architecture
• Resource Manager
  – Global resource scheduler
  – Hierarchical queues
• Node Manager
  – Per-machine agent
  – Manages the life-cycle of container
  – Container resource monitoring
• Application Master
  – Per-application
  – Manages application scheduling and task execution
  – E.g. MapReduce Application Master


                                                        59
Design Centre
• Split up the two major functions of
  JobTractor
  – Cluster resource management
  – Application life-cycle management
• MapReduce becomes user-land library




                                        60
Code
• MapReduce Classic
  – Mess
• Yarn
  – better




                       61
Questions?



ant.rao@gmail.com




                    62
Secondary Sort
• Want to sort by value
• Solution
  – setOutputKeyComparatorClass
  – setOutputValueGroupingComparator
  – Partitioner




                                       63

Hadoop MapReduce Introduction and Deep Insight

  • 1.
    Hadoop MapReduce Introduction andDeep Insight July 9, 2012 Anty Rao Big Data Engineering Team Hanborq Inc.
  • 2.
    Outline • Architecture • Job Tracker • Task Tracker • Map/Reduce internal • Optimization • YARN 2
  • 3.
    Architecture MapReduce RPC JobTracker Client beat Heart TaskTracker TaskTracker TaskTracker Child Child Child Child Child Child Child Child Child JVM JVM JVM JVM JVM JVM JVM JVM JVM 3
  • 4.
  • 5.
    Job Tracker • Managescluster resources • Job scheduling 5
  • 6.
  • 7.
    ExpireLaunchingTasks • A threadto timeout tasks that have been assigned to task trackers, but have not reported back yet. • After get report from task tracker, task tracker take over the responsibility of monitoring task execution, such as killing unresponsive task. 7
  • 8.
    ExpireTrackers • Used tomonitor task tracker status, expire Task tracers that have gone down. • After task tracker die, reschedule all tasks reside on dead task tracker. 8
  • 9.
    RetireJobs • Used toremove old finished Jobs that have been around too long. • Job tracker can’t retain all finished job’s info • There is also a upper limit on # of job info on a per-user basis. 9
  • 10.
    JobInitThread • Used toinitialize jobs that have just been created. • Job initialization including – Create split info per map – Create map tasks – Create reduce tasks 10
  • 11.
    TaskCommitQueue • A threadwhich does all of the HDFS FS- related operations for task – Promote outputs of COMMIT_PENDING tasks – Discard outputs for FAILED/KILLED tasks • All local file system related operation is in charge of task trackers. 11
  • 12.
    HTTP Server • Supplyjob tracker status • Supply all job status – Per job metrics • Supply history job status 12
  • 13.
    Key Data Structures •JobInProcess – Maintain all the info for keeping a Job on the straight and narrow. – It keeps its JobProfile and its latest JobStatus, plus a set of tables for doing bookkeeping of its tasks – Penalize task tracker for each of the jobs which had any tasks running on it when it was lost. • TaskInProgress – Maintain all the info needed for a task in the lifetime of its owning job. – A give task might be speculatively executed or re-executed. – Maintain multiple task states for different task attempts, 13
  • 14.
    The whole lifeof a job 14
  • 15.
  • 16.
    The life ofa job • Client – User create custom mapper, reducer; Client compute splits, upload job configuration file, jar file, split meta info onto HDFS – Submit job to job tracker • Job Tracker – Initialize job, read in job split info, determine final # maps, create all needed map tasks and reduce tasks; create all needed structures to represent these tasks – Tasks pulled by task tracker through heartbeats 16
  • 17.
    The life ofa job • Task Tracker – Through heartbeats pull tasks from job tracker – Initialize job, only once per job – Initialize task • Download all needed jar file, configuration file, distributed cache from HDFS to local disk • Create staging working directory for task on local disk • Localize configuration file – Create java launching options, setup the Child JVM 17
  • 18.
    The life ofa job • Child JVM – RPC with task tracker to get it’s task info – Actually do the dirty chore : execute map or reduce function, during this period it report status regularly in case being killed by task tracker. – retrieve map complete event from task tracker, if needed. Report fetch failure to TT – When task done, report COMMIT_PENDING or SUCCEEDED state to TT 18
  • 19.
  • 20.
    Task Tracker • Per-nodeagent • Manage tasks 20
  • 21.
  • 22.
    TT Main Thread •Heartbeat with JJ periodically to report task status, retrieve directives which includs launch task action, kill job action, kill task action • Kill unresponsive task within configured time period • If there isn’t enough disk space to accommodate all running task, pick tasks to kill • In case TT expire , reinitialize itself. 22
  • 23.
    taskCleanupThread • Thread dedicateto process clean up actions assigned by JJ – Kill job action – Kill task action 23
  • 24.
    directoryCleanupThread • Before taskexecuting, create a executing environment – Create staging directory – Copy configuration file – Etc • When task running, may produce multiple intermediate files in local staging directory • After job/task complete or fail, delete all these crappy directory and files. 24
  • 25.
    taskLauncher • Localize job •Localize task • Create a taskRunner thread to manage Child JVM 25
  • 26.
    TaskRunner • It’s aThread • Two type – MapTaskRunner – ReduceTaskRunner • Main duties – Make up the launching java Options & Executing Environment – In charge of launching, killing Child JVM. 26
  • 27.
    MapEventsFetcherThread • When thereare tasks(reducer) in shuffle phase, RPC with JJ to fetch map completion event, on a per-job basis. 27
  • 28.
    Child JVM • Actuallyexecute map/reduce function • Report status to TT periodically • Retrieve map completion event from TT for reducer task if needed. 28
  • 29.
    Key data structures •Running Jobs – JobID – JobConf – Set<TaskInProgress> • TaskInProgress – Task – TaskStatus – TaskRunner 29
  • 30.
  • 31.
    Map/Reduce Programming Mode Hadoop—The Definition Guide 31
  • 32.
  • 33.
    Steps of MapPhase • Put records emitted by map function into circle buffer continually • When buffer usage space exceed io.sort.mb*io.sort.spill.percent, spill will start which will sort records by partition, key-part, then write out buffer onto disk, with a index file associated with it indicating the positions where partition begins. • Merge will combine all the intermediate files into a single large file, plus a index file. 33
  • 34.
  • 35.
  • 36.
    • <property> • <name>mapred.tasktracker.indexcache.mb</name> •<value>10</value> • <description> The maximum memory that a task tracker allows for the • index cache that is used when serving map outputs to reducers. • </description> • </property> 36
  • 37.
    Steps of ReducePhase • Pull over data from map, if there is space available In memory & the size of file is less than 25%*HeapSize*mapred.job.shuffle.input.b uffer.percent, put file in memory, else directly store file on disk. 37
  • 38.
    Steps of ReducePhase(Cont.) • Merge operation will merge and sort data from memory and/or disk and write result on disk. Merge operation come in two different flavors: – In-memory merge operation • In-memory merge operation can be triggered when accumulated memory space exceed mapred.job.shuffle.merge.percent. – On-disk merge operation • On-disk merge operation will be triggered when # of files on disk exceed configured threshold. 38
  • 39.
    Steps of ReducePhase(Cont.) • When shuffle and sort complete, before feeding reduce function, it must satisfy the following constraints: – memory usage for buffering reduce input can’t exceed mapred.job.reduce.input.buffer.percent; – # of files on disk can’t exceed io.sort.factor 39
  • 40.
    Notes about Reduce •Shuffle & sort take up % of Reduce heap size to buffer shuffle data, because Reduce can’t start until shuffle and sort complete. As opposed to Map phase, which buffer size is determined by io.sort.mb. • Reduce input may contains multiple files, not necessarily a single file. Just using a heap iterator to feed reduce function. 40
  • 41.
  • 42.
    Optimization Tuning • Wecan make use of mapred.job.reduce.input.buffer.percent which specify how much memory can be spared to use as reduce input buffer • Look at the difference between the following cases – Case-1 – Case-2 – Case-3 42
  • 43.
  • 44.
    Case-2 Partial data inmemory ,plus data on disk as reduce input
  • 45.
  • 46.
    • If reducefunction don’t stress memory too much, we can spare some memory to buffer reduce input to boost overall performance. • What’s more, if input data is small, we can let reduces hold all intermediate data in memory, not involving disk access. 46
  • 47.
  • 48.
    Shuffle: Netty Server& Batch Fetch (1) • Less TCP connection overhead. • Reduce the effect of TCP slow start. • More important, better shuffle schedule in Reduce Phase result in better overall performance.
  • 49.
    Shuffle: Netty Server& Batch Fetch (2) One connection per map Batch fetch • Each fetch thread in reduce • Fetch thread copy multiple map outputs per connection. copy one map output per • This fetch thread take over this TT, connection, even there are other fetch threads can’t fetch many outputs in TT. outputs from this TT during coping period. vs
  • 50.
    Sort Avoidance • Manyreal-world jobs require shuffling, but not sorting. And the sorting bring much overhead. – Hash Aggregation – Hash Join – … etc. • When sorting is turned off, the mapper feeds data to the reducer which directly passes the data to the Reduce() function bypassing the intermediate sorting step. – Spilling, Partitioning, Merging and Reducing will be more efficient. • How to turn off sorting? – JobConf job = (JobConf) getConf(); – job.setBoolean("mapred.sort.avoidance", true); • MAPREDUCE-4039
  • 51.
    Sort Avoidance: Spilland Partition • When spills, records compare by partition only. • Partition comparison using counting sort [O(n)], not quick sort [O(nlog n)].
  • 52.
    Sort Avoidance: EarlyReduce (Remove shuffle barrier) • Currently reduce function can’t start until all map outputs have been fetched already. • When sort is unnecessary, reduce function can start as soon as there is any map output available. • Greatly improve overall performance!
  • 53.
    Sort Avoidance: BytesMerge • No overhead of key/value serialization/deseriali zation, comparison • Don’t take care of records, just bytes • Just concatenate byte streams together – read in bytes, write out bytes.
  • 54.
    Sort Avoidance: Sequential Reduce Input • Sequential read input files to feed reduce function, So no disk seeks, better performance.
  • 55.
  • 56.
    Current Limitations • Hardpartition of resources into map and reduce slots – Low resource utilization • Lacks support for alternate paradigms – Iterative applications implemented using MapReduce are 10x slower. – Hacks for the likes of MPI/Graph Processing • Lack of wire-compatible protocols – Client and cluster must be of sameversion – Applications and work flows cannot migrate to different clusters 56
  • 57.
    Current Limitations(Cont.) • Scalability – Maximum Cluster size – 4,000 nodes – Maximum concurrent tasks–40,000 – Coarse synchronization in JobTracker • Single point of failure – Failure kills all queued and running jobs – Jobs need to be re-submitted by user • Restart is very tricky due to complex state 57
  • 58.
  • 59.
    Architecture • Resource Manager – Global resource scheduler – Hierarchical queues • Node Manager – Per-machine agent – Manages the life-cycle of container – Container resource monitoring • Application Master – Per-application – Manages application scheduling and task execution – E.g. MapReduce Application Master 59
  • 60.
    Design Centre • Splitup the two major functions of JobTractor – Cluster resource management – Application life-cycle management • MapReduce becomes user-land library 60
  • 61.
    Code • MapReduce Classic – Mess • Yarn – better 61
  • 62.
  • 63.
    Secondary Sort • Wantto sort by value • Solution – setOutputKeyComparatorClass – setOutputValueGroupingComparator – Partitioner 63