Map Reduce

Background

Large set of data needs to be processed in a fast and efficient
way

In order to process large set of data in a reasonable amount
time, this needs to be distributed across thousands of
machines

Programmers need to focus in solving problems without
worrying about the implementation
Map Reduce is the answer.

What is Map reduce?

Programming model for processing large data sets

Hides the implementation of parallelization, faul-tolerance, data
distribution and load balancing in a library

Inspired on some characteristics functional programming

Functional operations do not modify data structures.
They always create new ones

Original data is not modified

Data flow is implicit within the application

The order of the operations does not matter

What is Map reduce?

There is two functions: Map and Reduce

Map

Input: Key/Value pairs

Output: Intermediate key/value pairs

Reduce

Input: Key, Iterator values

Output: list with results
map(k1, v1) --> list(k2, v2)
reduce(k2, values(k2)) --> list(v2)
Complicated?

Map Reduce by example
Counting each word in a large set of documents
map(String key, String value):
// key: document name
// value: document contents
for each word w in value:
EmitIntermediate(w, "1");
reduce(String key, Iterator values):
// key: a word
// values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));

Document_1
foo
bar
baz
foo
bar
test
Document_2
test
foo
baz
bar
foo
Expected results:
<foo, 4>,<bar, 3>,<baz,2>,<test,2>

map(String key, String value):
// key: document name
// value: document contents
for each word w in value:
EmitIntermediate(w, "1");
Map(document_1,contents(document_1))
<foo, “1”>
<bar,”1”>
<baz, “1” >
<foo, “1”>
<bar, “1”>
<test, ”1”>
Map(document_2,contents(document_2))
<test, “1”>
<foo, “1”>
<baz, ”1”>
<bar, ”1”>
<foo, “1”>

reduce(String key, Iterator values):
// key: a word
// values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));
Reduce(word, values)
<foo, “2”>
<bar,”2”>
<baz, “1” >
<test,”1”>
<test, “1”>
<foo, “2”>
<baz, ”1”>
<bar, ”1”>

<foo, “4”>
<bar, ”3”>
<baz, “2”>
<test,”2”>
<foo, “2”>
<bar, ”2”>
<baz, “1”>
<test,”1”>
<test, “1”>
<foo, “2”>
<baz, ”1”>
<bar, ”1”>
Expected results:
<foo, 4>,<bar, 3>,<baz,2>,<test,2>

Master node

Master keeps different data structures for Map and reduce
tasks where the status of each process is maintain

Status: idle, in-progress or completed

The master node keeps track of the intermediate files to feed
the reduce tasks

The master node control the interaction between the M map
tasks and R reduce tasks

Fault Tolerance

Master pings every worker periodically

If a worker fail, then the master mark this worker as failed and
assign the task to another worker

Every worker must notify that has finish its task. The master
then assign another task

Each tasks is independent and can be restarted at any
moment. Map reduce is resilient to workers failures

If the master failed, then? The Master periodically its status
and data structures. Then another master can start from the
last checkpoint

Task Granularity

There are M maps tasks and R reduce tasks

M and R should be larger than the number of workers

Dynamic loading and load balancing on workers to optimize
resources

Master must make O(M+R) scheduling decisions and keeps
O(M*R) states. One byte to save the state of each worker

According to the paper, Google performs M=200,000 and
R=5,000 using 2,000 workers

Refinements

Partition function: load balancing

Ordering function: optimized generation of keys and easy to
generate sorted output files

Combiner function = Reduce function. See count word in
documents example

Input and output Readers: Standard input and output

Skipping bad records: Control of bad input

Local execution for debugging

Status information through an external application

What are the benefits of map reduce?

Easy to use for programmers that don't need to worry about
the details of distributed computing

A large set of problems can be expressed in Map reduce
programming model

Flexible and scalable in large clusters of machines. The fault
tolerance is elegant and works

Programs that can be expressed
with Map Reduce

Distributed Grep <word, match>

Count URL Access Frequency <URL, total_count>

Reverse Web-link graph <target, list(source)>

Term-Vector per Host <word, frequency>

Inverted index <word, document ID>

Distributed Sort <key, record>

References

MapReduce: Simplified Data Processing on Large Clusters (
http://labs.google.com/papers/mapreduce-osdi04.pdf)

http://code.google.com/edu/parallel/mapreduce-tutorial.html

www.mapreduce.org

http://www.youtube.com/watch?v=yjPBkvYh-ss&feature=PlayList&p=

http://hadoop.apache.org/

Map Reduce

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Map Reduce

Similar to Map Reduce (20)

More from Manuel Correa

More from Manuel Correa (7)

Recently uploaded

Recently uploaded (20)

Map Reduce