Hadoop MapReduce Introduction and Deep Insight


Published on

Introduction and Implementation of Hadoop MapReduce, for training.

Published in: Technology
1 Comment
  • http://dbmanagement.info/Tutorials/MapReduce.htm
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Hadoop MapReduce Introduction and Deep Insight

  1. 1. Hadoop MapReduceIntroduction and Deep Insight July 9, 2012 Anty Rao Big Data Engineering Team Hanborq Inc.
  2. 2. Outline• Architecture• Job Tracker• Task Tracker• Map/Reduce internal• Optimization• YARN 2
  3. 3. Architecture MapReduce RPC JobTracker Client beat Heart TaskTracker TaskTracker TaskTrackerChild Child Child Child Child Child Child Child ChildJVM JVM JVM JVM JVM JVM JVM JVM JVM 3
  4. 4. Job Tracker 4
  5. 5. Job Tracker• Manages cluster resources• Job scheduling 5
  6. 6. Implementation Overview 6
  7. 7. ExpireLaunchingTasks• A thread to timeout tasks that have been assigned to task trackers, but have not reported back yet.• After get report from task tracker, task tracker take over the responsibility of monitoring task execution, such as killing unresponsive task. 7
  8. 8. ExpireTrackers• Used to monitor task tracker status, expire Task tracers that have gone down.• After task tracker die, reschedule all tasks reside on dead task tracker. 8
  9. 9. RetireJobs• Used to remove old finished Jobs that have been around too long.• Job tracker can’t retain all finished job’s info• There is also a upper limit on # of job info on a per-user basis. 9
  10. 10. JobInitThread• Used to initialize jobs that have just been created.• Job initialization including – Create split info per map – Create map tasks – Create reduce tasks 10
  11. 11. TaskCommitQueue• A thread which does all of the HDFS FS- related operations for task – Promote outputs of COMMIT_PENDING tasks – Discard outputs for FAILED/KILLED tasks• All local file system related operation is in charge of task trackers. 11
  12. 12. HTTP Server• Supply job tracker status• Supply all job status – Per job metrics• Supply history job status 12
  13. 13. Key Data Structures• JobInProcess – Maintain all the info for keeping a Job on the straight and narrow. – It keeps its JobProfile and its latest JobStatus, plus a set of tables for doing bookkeeping of its tasks – Penalize task tracker for each of the jobs which had any tasks running on it when it was lost.• TaskInProgress – Maintain all the info needed for a task in the lifetime of its owning job. – A give task might be speculatively executed or re-executed. – Maintain multiple task states for different task attempts, 13
  14. 14. The whole life of a job 14
  15. 15. 15
  16. 16. The life of a job• Client – User create custom mapper, reducer; Client compute splits, upload job configuration file, jar file, split meta info onto HDFS – Submit job to job tracker• Job Tracker – Initialize job, read in job split info, determine final # maps, create all needed map tasks and reduce tasks; create all needed structures to represent these tasks – Tasks pulled by task tracker through heartbeats 16
  17. 17. The life of a job• Task Tracker – Through heartbeats pull tasks from job tracker – Initialize job, only once per job – Initialize task • Download all needed jar file, configuration file, distributed cache from HDFS to local disk • Create staging working directory for task on local disk • Localize configuration file – Create java launching options, setup the Child JVM 17
  18. 18. The life of a job• Child JVM – RPC with task tracker to get it’s task info – Actually do the dirty chore : execute map or reduce function, during this period it report status regularly in case being killed by task tracker. – retrieve map complete event from task tracker, if needed. Report fetch failure to TT – When task done, report COMMIT_PENDING or SUCCEEDED state to TT 18
  19. 19. Task Tracker 19
  20. 20. Task Tracker• Per-node agent• Manage tasks 20
  21. 21. ImplementationOverview 21
  22. 22. TT Main Thread• Heartbeat with JJ periodically to report task status, retrieve directives which includs launch task action, kill job action, kill task action• Kill unresponsive task within configured time period• If there isn’t enough disk space to accommodate all running task, pick tasks to kill• In case TT expire , reinitialize itself. 22
  23. 23. taskCleanupThread• Thread dedicate to process clean up actions assigned by JJ – Kill job action – Kill task action 23
  24. 24. directoryCleanupThread• Before task executing, create a executing environment – Create staging directory – Copy configuration file – Etc• When task running, may produce multiple intermediate files in local staging directory• After job/task complete or fail, delete all these crappy directory and files. 24
  25. 25. taskLauncher• Localize job• Localize task• Create a taskRunner thread to manage Child JVM 25
  26. 26. TaskRunner• It’s a Thread• Two type – MapTaskRunner – ReduceTaskRunner• Main duties – Make up the launching java Options & Executing Environment – In charge of launching, killing Child JVM. 26
  27. 27. MapEventsFetcherThread• When there are tasks(reducer) in shuffle phase, RPC with JJ to fetch map completion event, on a per-job basis. 27
  28. 28. Child JVM• Actually execute map/reduce function• Report status to TT periodically• Retrieve map completion event from TT for reducer task if needed. 28
  29. 29. Key data structures• Running Jobs – JobID – JobConf – Set<TaskInProgress>• TaskInProgress – Task – TaskStatus – TaskRunner 29
  30. 30. Map/Reduce Internal 30
  31. 31. Map/Reduce Programming Mode Hadoop—The Definition Guide 31
  32. 32. Map PhaseDiagram 32
  33. 33. Steps of Map Phase• Put records emitted by map function into circle buffer continually• When buffer usage space exceed io.sort.mb*io.sort.spill.percent, spill will start which will sort records by partition, key-part, then write out buffer onto disk, with a index file associated with it indicating the positions where partition begins.• Merge will combine all the intermediate files into a single large file, plus a index file. 33
  34. 34. Main map-side tuning Knobs 34
  35. 35. Reduce Phase Diagram 35
  36. 36. • <property>• <name>mapred.tasktracker.indexcache.mb</name>• <value>10</value>• <description> The maximum memory that a task tracker allows for the• index cache that is used when serving map outputs to reducers.• </description>• </property> 36
  37. 37. Steps of Reduce Phase• Pull over data from map, if there is space available In memory & the size of file is less than 25%*HeapSize*mapred.job.shuffle.input.b uffer.percent, put file in memory, else directly store file on disk. 37
  38. 38. Steps of Reduce Phase(Cont.)• Merge operation will merge and sort data from memory and/or disk and write result on disk. Merge operation come in two different flavors: – In-memory merge operation • In-memory merge operation can be triggered when accumulated memory space exceed mapred.job.shuffle.merge.percent. – On-disk merge operation • On-disk merge operation will be triggered when # of files on disk exceed configured threshold. 38
  39. 39. Steps of Reduce Phase(Cont.)• When shuffle and sort complete, before feeding reduce function, it must satisfy the following constraints: – memory usage for buffering reduce input can’t exceed mapred.job.reduce.input.buffer.percent; – # of files on disk can’t exceed io.sort.factor 39
  40. 40. Notes about Reduce• Shuffle & sort take up % of Reduce heap size to buffer shuffle data, because Reduce can’t start until shuffle and sort complete. As opposed to Map phase, which buffer size is determined by io.sort.mb.• Reduce input may contains multiple files, not necessarily a single file. Just using a heap iterator to feed reduce function. 40
  41. 41. Reduce-sideKey parameters 41
  42. 42. Optimization Tuning• We can make use of mapred.job.reduce.input.buffer.percent which specify how much memory can be spared to use as reduce input buffer• Look at the difference between the following cases – Case-1 – Case-2 – Case-3 42
  43. 43. Case-1All reduce input reside on disk
  44. 44. Case-2Partial data in memory ,plus data ondisk as reduce input
  45. 45. Case-3Much better, all data in memory
  46. 46. • If reduce function don’t stress memory too much, we can spare some memory to buffer reduce input to boost overall performance.• What’s more, if input data is small, we can let reduces hold all intermediate data in memory, not involving disk access. 46
  47. 47. Optimization 47
  48. 48. Shuffle: Netty Server & Batch Fetch (1)• Less TCP connection overhead.• Reduce the effect of TCP slow start.• More important, better shuffle schedule in Reduce Phase result in better overall performance.
  49. 49. Shuffle: Netty Server & Batch Fetch (2)One connection per map Batch fetch• Each fetch thread in reduce • Fetch thread copy multiple map outputs per connection. copy one map output per • This fetch thread take over this TT, connection, even there are other fetch threads can’t fetch many outputs in TT. outputs from this TT during coping period. vs
  50. 50. Sort Avoidance• Many real-world jobs require shuffling, but not sorting. And the sorting bring much overhead. – Hash Aggregation – Hash Join – … etc.• When sorting is turned off, the mapper feeds data to the reducer which directly passes the data to the Reduce() function bypassing the intermediate sorting step. – Spilling, Partitioning, Merging and Reducing will be more efficient.• How to turn off sorting? – JobConf job = (JobConf) getConf(); – job.setBoolean("mapred.sort.avoidance", true);• MAPREDUCE-4039
  51. 51. Sort Avoidance: Spill and Partition• When spills, records compare by partition only.• Partition comparison using counting sort [O(n)], not quick sort [O(nlog n)].
  52. 52. Sort Avoidance: Early Reduce (Remove shuffle barrier)• Currently reduce function can’t start until all map outputs have been fetched already.• When sort is unnecessary, reduce function can start as soon as there is any map output available.• Greatly improve overall performance!
  53. 53. Sort Avoidance: Bytes Merge• No overhead of key/value serialization/deseriali zation, comparison• Don’t take care of records, just bytes• Just concatenate byte streams together – read in bytes, write out bytes.
  54. 54. Sort Avoidance: Sequential Reduce Input• Sequential read input files to feed reduce function, So no disk seeks, better performance.
  55. 55. YARN(yet another resource negotiator) 55
  56. 56. Current Limitations• Hard partition of resources into map and reduce slots – Low resource utilization• Lacks support for alternate paradigms – Iterative applications implemented using MapReduce are 10x slower. – Hacks for the likes of MPI/Graph Processing• Lack of wire-compatible protocols – Client and cluster must be of sameversion – Applications and work flows cannot migrate to different clusters 56
  57. 57. Current Limitations(Cont.)• Scalability – Maximum Cluster size – 4,000 nodes – Maximum concurrent tasks–40,000 – Coarse synchronization in JobTracker• Single point of failure – Failure kills all queued and running jobs – Jobs need to be re-submitted by user• Restart is very tricky due to complex state 57
  58. 58. Yarn Architecture 58
  59. 59. Architecture• Resource Manager – Global resource scheduler – Hierarchical queues• Node Manager – Per-machine agent – Manages the life-cycle of container – Container resource monitoring• Application Master – Per-application – Manages application scheduling and task execution – E.g. MapReduce Application Master 59
  60. 60. Design Centre• Split up the two major functions of JobTractor – Cluster resource management – Application life-cycle management• MapReduce becomes user-land library 60
  61. 61. Code• MapReduce Classic – Mess• Yarn – better 61
  62. 62. Questions?ant.rao@gmail.com 62
  63. 63. Secondary Sort• Want to sort by value• Solution – setOutputKeyComparatorClass – setOutputValueGroupingComparator – Partitioner 63