ExpireLaunchingTasks• A thread to timeout tasks that have been assigned to task trackers, but have not reported back yet.• After get report from task tracker, task tracker take over the responsibility of monitoring task execution, such as killing unresponsive task. 7
ExpireTrackers• Used to monitor task tracker status, expire Task tracers that have gone down.• After task tracker die, reschedule all tasks reside on dead task tracker. 8
RetireJobs• Used to remove old finished Jobs that have been around too long.• Job tracker can’t retain all finished job’s info• There is also a upper limit on # of job info on a per-user basis. 9
JobInitThread• Used to initialize jobs that have just been created.• Job initialization including – Create split info per map – Create map tasks – Create reduce tasks 10
TaskCommitQueue• A thread which does all of the HDFS FS- related operations for task – Promote outputs of COMMIT_PENDING tasks – Discard outputs for FAILED/KILLED tasks• All local file system related operation is in charge of task trackers. 11
HTTP Server• Supply job tracker status• Supply all job status – Per job metrics• Supply history job status 12
Key Data Structures• JobInProcess – Maintain all the info for keeping a Job on the straight and narrow. – It keeps its JobProfile and its latest JobStatus, plus a set of tables for doing bookkeeping of its tasks – Penalize task tracker for each of the jobs which had any tasks running on it when it was lost.• TaskInProgress – Maintain all the info needed for a task in the lifetime of its owning job. – A give task might be speculatively executed or re-executed. – Maintain multiple task states for different task attempts, 13
The life of a job• Client – User create custom mapper, reducer; Client compute splits, upload job configuration file, jar file, split meta info onto HDFS – Submit job to job tracker• Job Tracker – Initialize job, read in job split info, determine final # maps, create all needed map tasks and reduce tasks; create all needed structures to represent these tasks – Tasks pulled by task tracker through heartbeats 16
The life of a job• Task Tracker – Through heartbeats pull tasks from job tracker – Initialize job, only once per job – Initialize task • Download all needed jar file, configuration file, distributed cache from HDFS to local disk • Create staging working directory for task on local disk • Localize configuration file – Create java launching options, setup the Child JVM 17
The life of a job• Child JVM – RPC with task tracker to get it’s task info – Actually do the dirty chore : execute map or reduce function, during this period it report status regularly in case being killed by task tracker. – retrieve map complete event from task tracker, if needed. Report fetch failure to TT – When task done, report COMMIT_PENDING or SUCCEEDED state to TT 18
TT Main Thread• Heartbeat with JJ periodically to report task status, retrieve directives which includs launch task action, kill job action, kill task action• Kill unresponsive task within configured time period• If there isn’t enough disk space to accommodate all running task, pick tasks to kill• In case TT expire , reinitialize itself. 22
taskCleanupThread• Thread dedicate to process clean up actions assigned by JJ – Kill job action – Kill task action 23
directoryCleanupThread• Before task executing, create a executing environment – Create staging directory – Copy configuration file – Etc• When task running, may produce multiple intermediate files in local staging directory• After job/task complete or fail, delete all these crappy directory and files. 24
taskLauncher• Localize job• Localize task• Create a taskRunner thread to manage Child JVM 25
TaskRunner• It’s a Thread• Two type – MapTaskRunner – ReduceTaskRunner• Main duties – Make up the launching java Options & Executing Environment – In charge of launching, killing Child JVM. 26
MapEventsFetcherThread• When there are tasks(reducer) in shuffle phase, RPC with JJ to fetch map completion event, on a per-job basis. 27
Child JVM• Actually execute map/reduce function• Report status to TT periodically• Retrieve map completion event from TT for reducer task if needed. 28
Key data structures• Running Jobs – JobID – JobConf – Set<TaskInProgress>• TaskInProgress – Task – TaskStatus – TaskRunner 29
Steps of Map Phase• Put records emitted by map function into circle buffer continually• When buffer usage space exceed io.sort.mb*io.sort.spill.percent, spill will start which will sort records by partition, key-part, then write out buffer onto disk, with a index file associated with it indicating the positions where partition begins.• Merge will combine all the intermediate files into a single large file, plus a index file. 33
• <property>• <name>mapred.tasktracker.indexcache.mb</name>• <value>10</value>• <description> The maximum memory that a task tracker allows for the• index cache that is used when serving map outputs to reducers.• </description>• </property> 36
Steps of Reduce Phase• Pull over data from map, if there is space available In memory & the size of file is less than 25%*HeapSize*mapred.job.shuffle.input.b uffer.percent, put file in memory, else directly store file on disk. 37
Steps of Reduce Phase(Cont.)• Merge operation will merge and sort data from memory and/or disk and write result on disk. Merge operation come in two different flavors: – In-memory merge operation • In-memory merge operation can be triggered when accumulated memory space exceed mapred.job.shuffle.merge.percent. – On-disk merge operation • On-disk merge operation will be triggered when # of files on disk exceed configured threshold. 38
Steps of Reduce Phase(Cont.)• When shuffle and sort complete, before feeding reduce function, it must satisfy the following constraints: – memory usage for buffering reduce input can’t exceed mapred.job.reduce.input.buffer.percent; – # of files on disk can’t exceed io.sort.factor 39
Notes about Reduce• Shuffle & sort take up % of Reduce heap size to buffer shuffle data, because Reduce can’t start until shuffle and sort complete. As opposed to Map phase, which buffer size is determined by io.sort.mb.• Reduce input may contains multiple files, not necessarily a single file. Just using a heap iterator to feed reduce function. 40
Optimization Tuning• We can make use of mapred.job.reduce.input.buffer.percent which specify how much memory can be spared to use as reduce input buffer• Look at the difference between the following cases – Case-1 – Case-2 – Case-3 42
• If reduce function don’t stress memory too much, we can spare some memory to buffer reduce input to boost overall performance.• What’s more, if input data is small, we can let reduces hold all intermediate data in memory, not involving disk access. 46
Shuffle: Netty Server & Batch Fetch (1)• Less TCP connection overhead.• Reduce the effect of TCP slow start.• More important, better shuffle schedule in Reduce Phase result in better overall performance.
Shuffle: Netty Server & Batch Fetch (2)One connection per map Batch fetch• Each fetch thread in reduce • Fetch thread copy multiple map outputs per connection. copy one map output per • This fetch thread take over this TT, connection, even there are other fetch threads can’t fetch many outputs in TT. outputs from this TT during coping period. vs
Sort Avoidance• Many real-world jobs require shuffling, but not sorting. And the sorting bring much overhead. – Hash Aggregation – Hash Join – … etc.• When sorting is turned off, the mapper feeds data to the reducer which directly passes the data to the Reduce() function bypassing the intermediate sorting step. – Spilling, Partitioning, Merging and Reducing will be more efficient.• How to turn off sorting? – JobConf job = (JobConf) getConf(); – job.setBoolean("mapred.sort.avoidance", true);• MAPREDUCE-4039
Sort Avoidance: Spill and Partition• When spills, records compare by partition only.• Partition comparison using counting sort [O(n)], not quick sort [O(nlog n)].
Sort Avoidance: Early Reduce (Remove shuffle barrier)• Currently reduce function can’t start until all map outputs have been fetched already.• When sort is unnecessary, reduce function can start as soon as there is any map output available.• Greatly improve overall performance!
Sort Avoidance: Bytes Merge• No overhead of key/value serialization/deseriali zation, comparison• Don’t take care of records, just bytes• Just concatenate byte streams together – read in bytes, write out bytes.
Sort Avoidance: Sequential Reduce Input• Sequential read input files to feed reduce function, So no disk seeks, better performance.
Current Limitations• Hard partition of resources into map and reduce slots – Low resource utilization• Lacks support for alternate paradigms – Iterative applications implemented using MapReduce are 10x slower. – Hacks for the likes of MPI/Graph Processing• Lack of wire-compatible protocols – Client and cluster must be of sameversion – Applications and work flows cannot migrate to different clusters 56
Current Limitations(Cont.)• Scalability – Maximum Cluster size – 4,000 nodes – Maximum concurrent tasks–40,000 – Coarse synchronization in JobTracker• Single point of failure – Failure kills all queued and running jobs – Jobs need to be re-submitted by user• Restart is very tricky due to complex state 57