by Jeffrey Dean and Sanjay Ghemawat
Communication of the ACM, 2008
Presented by: Abolfazl Asudeh
MapReduce: Simplified Dat...
Map Reduce
7/4/20132
 Patented by Google
 A parallel programming model
 and an associated implementation
 for processi...
7/4/20133
 Previously, users had to handle the parallelization of
the programs over hundred or thousand machines
 Distri...
Programming Model
7/4/20134
 Input: a set of input key/value pairs
 Output: a set of output key/value pairs
 Map (writt...
Basic Example: Word Counting
7/4/20135
Basic Example: Word Counting
7/4/20136
Basic Example: Word Counting
7/4/20137
<How,1>
<now,1>
<brown,1>
<cow,1>
<How,1>
<does,1>
<it,1>
<work,1>
<now,1>
<How,1 1...
Execution Overview
7/4/20138
1. The MapReduce library splits the input files into M
pieces of typically 16-64MB per piece ...
Execution Overview
7/4/20139
4. the buffered pairs are written to local disk,
partitioned into R regions.
 The locations ...
Execution Overview
7/4/201310
6. The reduce worker passes the results for each
intermediate key to the reduce function
7. ...
Execution Overview
7/4/201311
Fault Tolerance
7/4/201312
 Failure: if a worker does not response the PING of
master
 If a map worker Fails:
 Reschedu...
Execution Optimization
7/4/201313
 Locality
 Network bandwidth is a relatively scarce resource
 Compute on local copies...
Practice
7/4/201314
 Write the map and reduce functions for Page Rank
Algorithm.
Thank you
7/4/201315
Upcoming SlideShare
Loading in...5
×

MapReduce : Simplified Data Processing on Large Clusters

339

Published on

Published in: Education, Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
339
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

MapReduce : Simplified Data Processing on Large Clusters

  1. 1. by Jeffrey Dean and Sanjay Ghemawat Communication of the ACM, 2008 Presented by: Abolfazl Asudeh MapReduce: Simplified Data Processing on Large Clusters
  2. 2. Map Reduce 7/4/20132  Patented by Google  A parallel programming model  and an associated implementation  for processing and generating large datasets  Users specify the computation in terms of a map and a reduce function  The system automatically parallelizes the computation across large-scale clusters
  3. 3. 7/4/20133  Previously, users had to handle the parallelization of the programs over hundred or thousand machines  Distribute the data  Handle Failure  Schedule inter-machine communications – to make efficient use of resources  By Experiment:  most of their computations involved applying a map operation to produce intermediate key/value pairs  Then applying a reduce operation to combine/aggregate the pairs
  4. 4. Programming Model 7/4/20134  Input: a set of input key/value pairs  Output: a set of output key/value pairs  Map (written by the user):  takes an input pair and produces a set of intermediate key/value pairs.  MapReduce library:  groups all intermediate values associated with the same intermediate key and passes them to the reduce function.  Reduce (written by the user):  accepts an intermediate key and a set of values for that key.  It merges these values together to form a possibly smaller set of values
  5. 5. Basic Example: Word Counting 7/4/20135
  6. 6. Basic Example: Word Counting 7/4/20136
  7. 7. Basic Example: Word Counting 7/4/20137 <How,1> <now,1> <brown,1> <cow,1> <How,1> <does,1> <it,1> <work,1> <now,1> <How,1 1> <now,1 1> <brown,1> <cow,1> <does,1> <it,1> <work,1> How now Brown cow How does It work now Input brown 1 cow 1 does 1 How 2 it 1 now 2 work 1 Output M M M M Map R R Reduce
  8. 8. Execution Overview 7/4/20138 1. The MapReduce library splits the input files into M pieces of typically 16-64MB per piece and starts up many copies of the program on a cluster of machines. 2. One of the workers becomes Master. It manages assigning M map jobs and R reduce jobs to the Workers. It picks the Idle workers and assign the jobs. 3. The worker that is doing a Map job: reads the corresponding input split, parses the key/value pairs and pass to map function (by user)
  9. 9. Execution Overview 7/4/20139 4. the buffered pairs are written to local disk, partitioned into R regions.  The locations of buffered pairs on the local disk are passed back to the master who is responsible for forwarding these locations to the reduce workers 5. Reduce Worker remotely reads the buffered data from the local disc of the corresponding mapper.  Sorts the read data by Intermediate key and group the results together.
  10. 10. Execution Overview 7/4/201310 6. The reduce worker passes the results for each intermediate key to the reduce function 7. When all the tasks are done, the Map-Reduce function returns back to the user program
  11. 11. Execution Overview 7/4/201311
  12. 12. Fault Tolerance 7/4/201312  Failure: if a worker does not response the PING of master  If a map worker Fails:  Reschedule the WHOLE map tasks (because it writes on the local disk)  Send the results Address in the new map worker to all corresponding reduce workers (if the did not still read from the previous mapper, read from the new one)  If a reduce worker Fails:  Completed reduce tasks do not need to be re-executed since their output is stored in a global file system
  13. 13. Execution Optimization 7/4/201313  Locality  Network bandwidth is a relatively scarce resource  Compute on local copies which are distributed by HFDS  Task Granularity  Ideally, M and R should be much larger than the number of worker machines  Having each worker perform many different tasks improves dynamic load balancing and also speeds up recovery when a worker fails
  14. 14. Practice 7/4/201314  Write the map and reduce functions for Page Rank Algorithm.
  15. 15. Thank you 7/4/201315

×