Upcoming SlideShare
×

# Map Reduce

946 views

Published on

Map Reduce presentation.
Operating Systems.
University of Georgia
2010

Published in: Technology
1 Like
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
946
On SlideShare
0
From Embeds
0
Number of Embeds
39
Actions
Shares
0
40
0
Likes
1
Embeds 0
No embeds

No notes for slide

### Map Reduce

1. 1. Map Reduce By Manuel Correa
2. 2. Background  Large set of data needs to be processed in a fast and efficient way  In order to process large set of data in a reasonable amount time, this needs to be distributed across thousands of machines  Programmers need to focus in solving problems without worrying about the implementation Map Reduce is the answer.
3. 3. What is Map reduce?  Programming model for processing large data sets  Hides the implementation of parallelization, faul-tolerance, data distribution and load balancing in a library  Inspired on some characteristics functional programming  Functional operations do not modify data structures. They always create new ones  Original data is not modified  Data flow is implicit within the application  The order of the operations does not matter
4. 4. What is Map reduce?  There is two functions: Map and Reduce  Map  Input: Key/Value pairs  Output: Intermediate key/value pairs  Reduce  Input: Key, Iterator values  Output: list with results map(k1, v1) --> list(k2, v2) reduce(k2, values(k2)) --> list(v2) Complicated?
5. 5. Map Reduce by example Counting each word in a large set of documents map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result));
6. 6. Map Reduce by example Counting each word in a large set of documents Document_1 foo bar baz foo bar test Document_2 test foo baz bar foo Expected results: <foo, 4>,<bar, 3>,<baz,2>,<test,2>
7. 7. Map Reduce by example Counting each word in a large set of documents map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); Map(document_1,contents(document_1)) <foo, “1”> <bar,”1”> <baz, “1” > <foo, “1”> <bar, “1”> <test, ”1”> Map(document_2,contents(document_2)) <test, “1”> <foo, “1”> <baz, ”1”> <bar, ”1”> <foo, “1”>
8. 8. Map Reduce by example Counting each word in a large set of documents reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); Reduce(word, values) <foo, “2”> <bar,”2”> <baz, “1” > <test,”1”> Reduce(word, values) <test, “1”> <foo, “2”> <baz, ”1”> <bar, ”1”>
9. 9. Map Reduce by example Counting each word in a large set of documents Reduce(word, values) <foo, “4”> <bar, ”3”> <baz, “2”> <test,”2”> <foo, “2”> <bar, ”2”> <baz, “1”> <test,”1”> <test, “1”> <foo, “2”> <baz, ”1”> <bar, ”1”> Expected results: <foo, 4>,<bar, 3>,<baz,2>,<test,2>
10. 10. Implementation
11. 11. Master node  Master keeps different data structures for Map and reduce tasks where the status of each process is maintain  Status: idle, in-progress or completed  The master node keeps track of the intermediate files to feed the reduce tasks  The master node control the interaction between the M map tasks and R reduce tasks
12. 12. Fault Tolerance  Master pings every worker periodically  If a worker fail, then the master mark this worker as failed and assign the task to another worker  Every worker must notify that has finish its task. The master then assign another task  Each tasks is independent and can be restarted at any moment. Map reduce is resilient to workers failures  If the master failed, then? The Master periodically its status and data structures. Then another master can start from the last checkpoint
13. 13. Task Granularity  There are M maps tasks and R reduce tasks  M and R should be larger than the number of workers  Dynamic loading and load balancing on workers to optimize resources  Master must make O(M+R) scheduling decisions and keeps O(M*R) states. One byte to save the state of each worker  According to the paper, Google performs M=200,000 and R=5,000 using 2,000 workers
14. 14. Refinements  Partition function: load balancing  Ordering function: optimized generation of keys and easy to generate sorted output files  Combiner function = Reduce function. See count word in documents example  Input and output Readers: Standard input and output  Skipping bad records: Control of bad input  Local execution for debugging  Status information through an external application
15. 15. What are the benefits of map reduce?  Easy to use for programmers that don't need to worry about the details of distributed computing  A large set of problems can be expressed in Map reduce programming model  Flexible and scalable in large clusters of machines. The fault tolerance is elegant and works
16. 16. Programs that can be expressed with Map Reduce  Distributed Grep <word, match>  Count URL Access Frequency <URL, total_count>  Reverse Web-link graph <target, list(source)>  Term-Vector per Host <word, frequency>  Inverted index <word, document ID>  Distributed Sort <key, record>