Your SlideShare is downloading. ×
0
Map Reduce
By
Manuel Correa
Background

Large set of data needs to be processed in a fast and efficient
way

In order to process large set of data i...
What is Map reduce?

Programming model for processing large data sets

Hides the implementation of parallelization, faul...
What is Map reduce?

There is two functions: Map and Reduce

Map

Input: Key/Value pairs

Output: Intermediate key/val...
Map Reduce by example
Counting each word in a large set of documents
map(String key, String value):
// key: document name
...
Map Reduce by example
Counting each word in a large set of documents
Document_1
foo
bar
baz
foo
bar
test
Document_2
test
f...
Map Reduce by example
Counting each word in a large set of documents
map(String key, String value):
// key: document name
...
Map Reduce by example
Counting each word in a large set of documents
reduce(String key, Iterator values):
// key: a word
/...
Map Reduce by example
Counting each word in a large set of documents
Reduce(word, values)
<foo, “4”>
<bar, ”3”>
<baz, “2”>...
Implementation
Master node

Master keeps different data structures for Map and reduce
tasks where the status of each process is maintain...
Fault Tolerance

Master pings every worker periodically

If a worker fail, then the master mark this worker as failed an...
Task Granularity

There are M maps tasks and R reduce tasks

M and R should be larger than the number of workers

Dynam...
Refinements

Partition function: load balancing

Ordering function: optimized generation of keys and easy to
generate so...
What are the benefits of map reduce?

Easy to use for programmers that don't need to worry about
the details of distribut...
Programs that can be expressed
with Map Reduce

Distributed Grep <word, match>

Count URL Access Frequency <URL, total_c...
References

MapReduce: Simplified Data Processing on Large Clusters (
http://labs.google.com/papers/mapreduce-osdi04.pdf)...
Map Reduce
Questions?
Upcoming SlideShare
Loading in...5
×

Map Reduce

733

Published on

Map Reduce presentation.
Operating Systems.
University of Georgia
2010

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
733
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
39
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "Map Reduce"

  1. 1. Map Reduce By Manuel Correa
  2. 2. Background  Large set of data needs to be processed in a fast and efficient way  In order to process large set of data in a reasonable amount time, this needs to be distributed across thousands of machines  Programmers need to focus in solving problems without worrying about the implementation Map Reduce is the answer.
  3. 3. What is Map reduce?  Programming model for processing large data sets  Hides the implementation of parallelization, faul-tolerance, data distribution and load balancing in a library  Inspired on some characteristics functional programming  Functional operations do not modify data structures. They always create new ones  Original data is not modified  Data flow is implicit within the application  The order of the operations does not matter
  4. 4. What is Map reduce?  There is two functions: Map and Reduce  Map  Input: Key/Value pairs  Output: Intermediate key/value pairs  Reduce  Input: Key, Iterator values  Output: list with results map(k1, v1) --> list(k2, v2) reduce(k2, values(k2)) --> list(v2) Complicated?
  5. 5. Map Reduce by example Counting each word in a large set of documents map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result));
  6. 6. Map Reduce by example Counting each word in a large set of documents Document_1 foo bar baz foo bar test Document_2 test foo baz bar foo Expected results: <foo, 4>,<bar, 3>,<baz,2>,<test,2>
  7. 7. Map Reduce by example Counting each word in a large set of documents map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); Map(document_1,contents(document_1)) <foo, “1”> <bar,”1”> <baz, “1” > <foo, “1”> <bar, “1”> <test, ”1”> Map(document_2,contents(document_2)) <test, “1”> <foo, “1”> <baz, ”1”> <bar, ”1”> <foo, “1”>
  8. 8. Map Reduce by example Counting each word in a large set of documents reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); Reduce(word, values) <foo, “2”> <bar,”2”> <baz, “1” > <test,”1”> Reduce(word, values) <test, “1”> <foo, “2”> <baz, ”1”> <bar, ”1”>
  9. 9. Map Reduce by example Counting each word in a large set of documents Reduce(word, values) <foo, “4”> <bar, ”3”> <baz, “2”> <test,”2”> <foo, “2”> <bar, ”2”> <baz, “1”> <test,”1”> <test, “1”> <foo, “2”> <baz, ”1”> <bar, ”1”> Expected results: <foo, 4>,<bar, 3>,<baz,2>,<test,2>
  10. 10. Implementation
  11. 11. Master node  Master keeps different data structures for Map and reduce tasks where the status of each process is maintain  Status: idle, in-progress or completed  The master node keeps track of the intermediate files to feed the reduce tasks  The master node control the interaction between the M map tasks and R reduce tasks
  12. 12. Fault Tolerance  Master pings every worker periodically  If a worker fail, then the master mark this worker as failed and assign the task to another worker  Every worker must notify that has finish its task. The master then assign another task  Each tasks is independent and can be restarted at any moment. Map reduce is resilient to workers failures  If the master failed, then? The Master periodically its status and data structures. Then another master can start from the last checkpoint
  13. 13. Task Granularity  There are M maps tasks and R reduce tasks  M and R should be larger than the number of workers  Dynamic loading and load balancing on workers to optimize resources  Master must make O(M+R) scheduling decisions and keeps O(M*R) states. One byte to save the state of each worker  According to the paper, Google performs M=200,000 and R=5,000 using 2,000 workers
  14. 14. Refinements  Partition function: load balancing  Ordering function: optimized generation of keys and easy to generate sorted output files  Combiner function = Reduce function. See count word in documents example  Input and output Readers: Standard input and output  Skipping bad records: Control of bad input  Local execution for debugging  Status information through an external application
  15. 15. What are the benefits of map reduce?  Easy to use for programmers that don't need to worry about the details of distributed computing  A large set of problems can be expressed in Map reduce programming model  Flexible and scalable in large clusters of machines. The fault tolerance is elegant and works
  16. 16. Programs that can be expressed with Map Reduce  Distributed Grep <word, match>  Count URL Access Frequency <URL, total_count>  Reverse Web-link graph <target, list(source)>  Term-Vector per Host <word, frequency>  Inverted index <word, document ID>  Distributed Sort <key, record>
  17. 17. References  MapReduce: Simplified Data Processing on Large Clusters ( http://labs.google.com/papers/mapreduce-osdi04.pdf)  http://code.google.com/edu/parallel/mapreduce-tutorial.html  www.mapreduce.org  http://www.youtube.com/watch?v=yjPBkvYh-ss&feature=PlayList&p=  http://hadoop.apache.org/
  18. 18. Map Reduce Questions?
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×