2. Background
Large set of data needs to be processed in a fast and efficient
way
In order to process large set of data in a reasonable amount
time, this needs to be distributed across thousands of
machines
Programmers need to focus in solving problems without
worrying about the implementation
Map Reduce is the answer.
3. What is Map reduce?
Programming model for processing large data sets
Hides the implementation of parallelization, faul-tolerance, data
distribution and load balancing in a library
Inspired on some characteristics functional programming
Functional operations do not modify data structures.
They always create new ones
Original data is not modified
Data flow is implicit within the application
The order of the operations does not matter
4. What is Map reduce?
There is two functions: Map and Reduce
Map
Input: Key/Value pairs
Output: Intermediate key/value pairs
Reduce
Input: Key, Iterator values
Output: list with results
map(k1, v1) --> list(k2, v2)
reduce(k2, values(k2)) --> list(v2)
Complicated?
5. Map Reduce by example
Counting each word in a large set of documents
map(String key, String value):
// key: document name
// value: document contents
for each word w in value:
EmitIntermediate(w, "1");
reduce(String key, Iterator values):
// key: a word
// values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));
6. Map Reduce by example
Counting each word in a large set of documents
Document_1
foo
bar
baz
foo
bar
test
Document_2
test
foo
baz
bar
foo
Expected results:
<foo, 4>,<bar, 3>,<baz,2>,<test,2>
7. Map Reduce by example
Counting each word in a large set of documents
map(String key, String value):
// key: document name
// value: document contents
for each word w in value:
EmitIntermediate(w, "1");
Map(document_1,contents(document_1))
<foo, “1”>
<bar,”1”>
<baz, “1” >
<foo, “1”>
<bar, “1”>
<test, ”1”>
Map(document_2,contents(document_2))
<test, “1”>
<foo, “1”>
<baz, ”1”>
<bar, ”1”>
<foo, “1”>
8. Map Reduce by example
Counting each word in a large set of documents
reduce(String key, Iterator values):
// key: a word
// values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));
Reduce(word, values)
<foo, “2”>
<bar,”2”>
<baz, “1” >
<test,”1”>
Reduce(word, values)
<test, “1”>
<foo, “2”>
<baz, ”1”>
<bar, ”1”>
9. Map Reduce by example
Counting each word in a large set of documents
Reduce(word, values)
<foo, “4”>
<bar, ”3”>
<baz, “2”>
<test,”2”>
<foo, “2”>
<bar, ”2”>
<baz, “1”>
<test,”1”>
<test, “1”>
<foo, “2”>
<baz, ”1”>
<bar, ”1”>
Expected results:
<foo, 4>,<bar, 3>,<baz,2>,<test,2>
11. Master node
Master keeps different data structures for Map and reduce
tasks where the status of each process is maintain
Status: idle, in-progress or completed
The master node keeps track of the intermediate files to feed
the reduce tasks
The master node control the interaction between the M map
tasks and R reduce tasks
12. Fault Tolerance
Master pings every worker periodically
If a worker fail, then the master mark this worker as failed and
assign the task to another worker
Every worker must notify that has finish its task. The master
then assign another task
Each tasks is independent and can be restarted at any
moment. Map reduce is resilient to workers failures
If the master failed, then? The Master periodically its status
and data structures. Then another master can start from the
last checkpoint
13. Task Granularity
There are M maps tasks and R reduce tasks
M and R should be larger than the number of workers
Dynamic loading and load balancing on workers to optimize
resources
Master must make O(M+R) scheduling decisions and keeps
O(M*R) states. One byte to save the state of each worker
According to the paper, Google performs M=200,000 and
R=5,000 using 2,000 workers
14. Refinements
Partition function: load balancing
Ordering function: optimized generation of keys and easy to
generate sorted output files
Combiner function = Reduce function. See count word in
documents example
Input and output Readers: Standard input and output
Skipping bad records: Control of bad input
Local execution for debugging
Status information through an external application
15. What are the benefits of map reduce?
Easy to use for programmers that don't need to worry about
the details of distributed computing
A large set of problems can be expressed in Map reduce
programming model
Flexible and scalable in large clusters of machines. The fault
tolerance is elegant and works
16. Programs that can be expressed
with Map Reduce
Distributed Grep <word, match>
Count URL Access Frequency <URL, total_count>
Reverse Web-link graph <target, list(source)>
Term-Vector per Host <word, frequency>
Inverted index <word, document ID>
Distributed Sort <key, record>
17. References
MapReduce: Simplified Data Processing on Large Clusters (
http://labs.google.com/papers/mapreduce-osdi04.pdf)
http://code.google.com/edu/parallel/mapreduce-tutorial.html
www.mapreduce.org
http://www.youtube.com/watch?v=yjPBkvYh-ss&feature=PlayList&p=
http://hadoop.apache.org/