Large-Scale Data Processing ◦ Want to process lots of data ( > 1 TB) Size of web > 400 TB ◦ Want to parallelize across 100/ 1000 „s of CPUs ◦ But don‟t want hassle of managing things MapReduce provides ◦ Automatic parallelization & distribution ◦ Fault tolerance ◦ Monitoring & status updates tools ◦ A clear abstraction for programmers
Borrows from functional program Users imlement interface of 2 function ◦ Map ◦ Reduce Map( in-key,in-value) (Out-key,intermediate-value) list Reduce(Out-key,intermediate-value list) out_value list
Records from the data source(lines out of files,row of a database,etc ) are fed into the map function as key*value pairs ◦ Ex: (filename,line) Map() produces one or more intermediate values along with an output key from the input
let map(k,v) =emit (k.toUpper(), v.toUpper() ) ◦ (“foo”, “bar”) -> (“FOO”,”BAR”) ◦ (“key2”,”data”) -> (“KEY2”,”DATA”) let map(k,v)= foreach char c in v :emit (k,c) ◦ (“A”,”cats”)->(“A”,”c”),(“A”,”a”),(“A”,”t”),(“A”,”s”) ◦ (“B”,”hi”) ->(“B”,”hi”), (“B”,”i”) let map(k,v)= if (isPrime(v)) then emit (k,v) ◦ (“foo”,7) -> (“foo”,7) ◦ (“test”,10) -> (nothing) let map(k,v)= emit(v.length,v) ◦ (“hi”,”test”)->(4,”test”) ◦ (“x”,”quux”) ->(4,”quux”)
After the map phase is over, all the intermediate values for a given output key are combined together into a list Reduce() combines those intermediate values into one or more final values for that same output key (in practice ,usually only one final value per key)
let reduce(k,vals)= sum=0 foreach int v in vals: sum +=v emit(k,sum) ◦ (“A”,[42,100,312])-> (“A”,454) ◦ (“B”,[12,6,-2])->(“B”,16)
Distributed Grep ◦ Input consists of (url+offset, single line) ◦ map(key=url+offset, val=line): If contents matches regexp, emit (line, “1”) ◦ reduce(key=line, values=uniq_counts): Don‟t do anything; just emit line Count of URL Access Frequency Reverse Web-Link Graph Term-Vector per Host Inverted Index Distributed Sort
Several different implementations of Map Reduce interface are possible depending on the environment. The implementation targeted to the computing environment at Google ◦ Large cluster of PC‟s ◦ Dual x86 processors ◦ Networking hardware – 100 MB/s or 1 Gb/s ◦ Scheduling system
When the user program calls the MapReduce function, the following sequence of actions occurs :1) The MapReduce library in the user program first splits the input files into M pieces – 16MB to 64MB per piece. It then starts up many copies of program on a cluster of machines.2) One of the copies of program is master. The rest are workers that are assigned work by the master.
3) A worker who is assigned a map task : reads the contents of the corresponding input split parses key/value pairs out of the input data and passes each pair to the user-defined Map function. The intermediate key/value pairs produced by the Map function are buffered in memory(RAM).4) The buffered pairs are written to local disk, partitioned into R regions by the partitioning function. The location of these buffered pairs on the local disk are passed back to the master, who forwards these locations to the reduce workers.
5)When a reduce worker is notified by the master about these locations, it reads the buffered data from the local disks of the map workers. When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together.6) The reduce worker iterates over the sorted intermediate data and for each unique intermediate key, it passes the key and the corresponding set of intermediate values to the user‟s Reduce function. The output of the Reduce function is appended to a final output file.7) When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, MapReduce call in the user program returns back to the user code. After successful completion, output of the mapreduce execution is available in the R output files.
For every map and reduce task it stores ◦ State (idle, in-progress, or completed) ◦ Identity of worker machine Location of intermediate files propagated from map tasks to reduce tasks through the master. For every completed map task, ◦ location of the R intermediate files(from map task) ◦ size of R intermediate file
Failure detection mechanism: Master pings workers periodically. Re-executes completed & in-progress map() tasks o All output was stored locally Re-executes in-progress reduce() tasks o All output stored in the global file system
Master failure unlikely Create a checkpoint and note the state of Master Data Structure Write the state to GFS filesystem New master recovers and continues
If Map and Reduce operators are deterministic functions of their input values Relay on Atomic commits of map and reduce task outputs ◦ When a map task completes, the worker sends a message to the master and includes the name of the R temporary files in the message. ◦ If the master receives a completion message for an already completed map task, it ignores the message. Otherwise, it records the names of R files in a master data structure (for use by the reduce tasks). ◦ Output of Reduce task is stored in the GFS. High availability via replication. The filename of the output produced by a reduce task is deterministic. When a reduce task completes, the reduce worker atomically renames its temporary output file to the final output file. If the same reduce task executes on multiple machines, multiple renames calls will be executed for the same output file. If Map and Reduce operators are NOT deterministic functions of their input values: In this case, MapReduce provides weaker but reasonable semantics.
Asks GFS for locations of replicas of input file blocks Map tasks typically split into 64MB (GFS block size) Map tasks scheduled so GFS input block replica are on same machine or same rackEffect :Thousands of machines read input at localdisk speed
M and R should be much larger than the number of worker machines. Better dynamic load balancing Speeds up recovery when a worker fails
Straggler – machine that takes unusually long time to complete one of the last few map or reduce tasks in the computation. Straggler-Causes: Bad disk Cluster scheduling system Bug in machine initialization code Straggler-Solution: When MapReduce operation is close to completion, master schedules backup executions of the remaining in-progress tasks. Task is marked as complete when the primary or backup execution completes.
Few extensions that are useful to the Map andReduce functions: Partitioning Function: Users of MapReduce can specify the number of reduce tasks/output files that they desire (R). Default : “hash(key)mod R” Can be customized: Ex: “hash(Hostname(urlkey)) mod R” Ordering Guarantees: Guaranteed that in a partition, intermediate key/value pairs processed in increasing key order.
Combiner Function: Run on mapper nodes after map phase “Mini reduce" only on local map output save bandwidth Same code used to implement combiner and reduce function. Difference based on how Map-Reduce handles output of the function – • Output of reduce is written to final output file. • Output of combiner is written to intermediate file (i.e. sent to a reduce task).Use: Speeds up certain classes of Map-Reduce operations
Input and Output Types: Map Reduce library provides support for reading input data in several formats. Users can add support for a new input type by providing an implementation of a simple reader interface. Reader can read records from a database or from data structures mapped in memory. Side-effects : Auxiliary files are produced as additional outputs from either map and/or reduce operators. Application writer makes such side effects-atomic and idempotent. Application writes to a temporary file and atomically renames this file once it is generated.
Skipping Bad Records: Map/Reduce functions crash deterministically due to Bugs in user code on some records. Best solution is to debug & fix o Not always possible ~ third-party source libraries On segmentation fault: o Send UDP packet to master from signal handler o Include sequence number of record being processed If master sees two failures for same record: o Next worker is told to skip the record Local Execution : Debugging in Map-Reduce function is complex Computations occurs in distributed environment. Worker processes dynamically allocated by master. Hence to enable debugging, profiling and small scale testing use Local Execution. This implementation causes sequential execution of all work for a Map Reduce operation on a local machine. Controls are provided to the user so that Computation limited to a particular map task.
Status Information : Master runs internal HTTP server and provides set of status pages to user. The status pages show progress of the computation, such as: No. of tasks completed No. of tasks in progress Bytes of input, intermediate and output data Processing rates Links to standard error and standard output files These information on the page can be used to predict how long the computation is going to take ? Should more resources be added to computation? Is the computation much slower than expected? Top Level Status Page: Which worker failed? Which Map-reduce they were working when they failed? Use: Easy to detect bugs. Master can order re-execution for failed process
Counters: Map Reduce library has counter facility to count occurrences of various events. Eg. Counting total no. of words processed.
MR_GrepScan -Scan 10^10 100-byte records to extract records matching a rare pattern (92K matching records) : 150 seconds. MR_SortSort -Sort 10^10 100-byte records (modeled after TeraSort benchmark) : normal 839 seconds.
A cluster consisting of 1800PCs: ◦ 2 GHz Intel Xeon processors ◦ 4 GB of memory 1-1.5 GB reserved for other tasks sharing the nodes. ◦ 320 GB storage: two 160 GB IDE disks
scan through 1010 100 byte records.(~ 1 TB) 3-character pattern to be matched ( pattern occ ured in ~ 1 lakh records) M = 15000, R = 1 Input data chunk size = 64 MB Execution time is 150S 1764 workers are assigned! Map Task finishedTime to scheduletasks; startup.
Map function extracts a 10-byte sorting key from a text line, emitting the key and the original text line as the intermediate key/value pair. ◦ Each intermediate key/value pair will be sorted 1800 machines used 1010 100 byte records.(~ 1 TB) M = 15000 R = 4000 Input data chunk size = 64 MB 2 TB of final output (GFS maintains 2 copies)
Normal No backup tasks 200 processes killed Backup tasks reduce job completion time a lot! System deals well with failures
MapReduce has proven to be a useful abstraction Greatly simplifies large-scale computations at Google. Indexing code rewritten using MapReduce. Code is simpler, smaller, readable. MapReduce handles failures, slow machines
Programming model inspired by functional language primitives Partitioning/shuffling similar to many large-scale sorting systems ◦ NOW-Sort  Re-execution for fault tolerance ◦ BAD-FS  and TACC  Locality optimization has parallels with Active Disks/Diamond work ◦ Active Disks [12,15] Backup tasks similar to Eager Scheduling in Charlotte system ◦ Charlotte  Dynamic load balancing solves similar problem as Rivers distributed queues ◦ River 
MapReduce has proven to be a useful abstraction Greatly simplifies large-scale computations at Google Fun to use: ◦ focus on problem, ◦ let library deal with messy details