MapReduce
Simplified Data Processing on Large
Clusters
Google , Inc.
Presented by

Noha El-Prince
Winter 2011
Problem and Motivations
—  Large Data Size
—  Limited CPU Powers
—  Difficulties of Distributed , Parallel Computing

2
MapReduce
—  MapReduce is a Software framework
—  introduced by Google
—  Enables automatic parallelization and distrib...
Outline
—  MapReduce : Execution Example
—  Programming Model
—  MapReduce: Distributed Execution
—  More Examples
— ...
Programming Model
Intermediate
data

Raw
Data

Reduced
Processed
data

MapReduce Library
(k’,v’)
(k,v)

<k’, v’>*

(k’,<v’...
Example
q  Input:
—  Page 1: the weather is good
—  Page 2: today is good
—  Page 3: good weather is good.

q  Output...
Input
The weather is good Today is good Good weather is good
Data
map(key, value):
for each word w in value:
emit(w, 1)
M
...
Programming Model
§  Input : A set of key/value pairs
§  Programmer specifies two functions:

Map

Reduce

•  map	
  (k,...
Distributed Execution Overview

fork
assign
map
Input Data
Split 0 read
Split 1
Split 2

fork
Master

fork

assign
reduce
...
MapReduce Examples
Distributed Grep: Search pattern (key) : virus
A….virus
B…….
C..virus…

MAP

virus, A…
virus, B…

RED

...
MapReduce Examples
—  Count of URL Access Frequency:

www.cbc.com
www.cnn.com
www.bbc.com
www.cbc.com
www.cbc.com
www.bbc...
MapReduce Examples
q Reverse Web-Link Graph:
www.facebook.com

www.youtube.com

source

target

MAP

(facebook,youtube)
(...
MapReduce Examples
q  Term-Vector per Host:.

MAP

word1>
word2>
word2>
word2>

….
RED

Documents of the
facebook (hostna...
MapReduce Examples
q Inverted Index:

MAP

Docs

<word1,
<word2,
…
<word3,
<word1,
…
<word1,

docID1>
docID1>

<word1, [d...
Outline
þ—  MapReduce : Execution
þ—  Example
þ—  Programming Model
þ—  MapReduce: Distributed Execution
þ—  Mor...
Customizations on Clusters
—  Coordination
—  Scheduling
—  Fault Tolerance
—  Task Granularity
—  Backup Tasks

16
Customizations on Clusters
q Coordination
Master Data Structure
M

250.133.22.7

Completed

Root/intFile.txt

M

250.133....
Customizations on Clusters
q Scheduling
Master scheduling policy: (objective: conserve network bandwidth)
1.  GFS divides...
Customizations on Clusters
q Fault Tolerance
On worker failure:
• 
• 
• 
• 

Detect failure via periodic heartbeats
Re-ex...
Customizations on Clusters
q  Task Granularity

(How tasks are divided ?)

Rule of thumb:
Make M and R much larger than t...
Customizations on Clusters
q  Backup tasks
—  Problem of stragglers (machines taking long time
to complete one of the la...
Outline
þ—  MapReduce : Execution
þ—  Example
þ—  Programming Model
þ—  MapReduce: Distributed Execution
þ—  Mor...
Refinements
—  Partitioning functions.
—  Skipping bad records.
—  Status info.
—  Other Refinements

23
Refinements : Partitioning Function
—  MapRedue users specify no. of tasks/output files desired (R)
—  For reduce, we ne...
Refinements : Skipping Bad Records
§  Map/Reduce functions sometimes fail for particular
inputs
•  MapReduce has a specia...
Refinements : Status Information
—  Status pages shows the computation progress
—  Links to standard error and output fi...
Other Refinements
§  Combiner function: Compression of intermediate data
Ø  useful for saving network bandwidth

§  Use...
Outline
þ—  MapReduce : Execution
þ—  Example
þ—  Programming Model
þ—  MapReduce: Distributed Execution
þ—  Mor...
Performance
§  Tests run on cluster of 1800 machines: each machine has:
—  4 GB of memory
—  Dual-processor 2 GHz Xeons...
Grep

1764 workers
M=15000 (input split= 64MB)
R=1
Assume all machines has same host
Search pattern: 3 characters
Found in...
Sort

M=15000 (input split= 64MB)
R=4000, # of workers = 1746
Fig.(a) Btr than Terasoft benchmark reported result of 1057 ...
Experience:

Rewrite of Production
Indexing System

§  New code is simpler, easier to understand
§  MapReduce takes care...
Outline
þ—  MapReduce : Execution
þ—  Example
þ—  Programming Model
þ—  MapReduce: Distributed Execution
þ—  Mor...
Conclusion & Future Work
—  MapReduce has proven to be a useful abstraction
—  Greatly simplifies large-scale computatio...
MapReduce Advantages/Disadvantages
Now it s easy to program for many CPUs
•  Communication management effectively gone
Ø ...
Outline
þ—  MapReduce : Execution
þ—  Example
þ—  Programming Model
þ—  MapReduce: Distributed Execution
þ—  Mor...
Companies using MapReduce
v  Amazon: Amazon Elastic MapReduce :
§  a web service
§  enables businesses, researchers, da...
Companies using MapReduce
—  Amazon: to build product search indices
—  Facebook: processing of web logs, via both Map-R...
39
Upcoming SlideShare
Loading in …5
×

My mapreduce1 presentation

551 views

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
551
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
68
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

My mapreduce1 presentation

  1. 1. MapReduce Simplified Data Processing on Large Clusters Google , Inc. Presented by Noha El-Prince Winter 2011
  2. 2. Problem and Motivations —  Large Data Size —  Limited CPU Powers —  Difficulties of Distributed , Parallel Computing 2
  3. 3. MapReduce —  MapReduce is a Software framework —  introduced by Google —  Enables automatic parallelization and distribution of large-scale computations —  Hides the details of parallelization, data distribution, load balancing and fault tolerance. —  Achieves high performance 3
  4. 4. Outline —  MapReduce : Execution Example —  Programming Model —  MapReduce: Distributed Execution —  More Examples —  Customization on Cluster —  Refinements —  Performance measurement —  Conclusion and Future Work —  MapReduce in other companies 4
  5. 5. Programming Model Intermediate data Raw Data Reduced Processed data MapReduce Library (k’,v’) (k,v) <k’, v’>* (k’,<v’>*) Mu Ru 5
  6. 6. Example q  Input: —  Page 1: the weather is good —  Page 2: today is good —  Page 3: good weather is good. q  Output Desired: The frequency each word is encountered in all pages. (the 1), (is 3), (weather 2),(today 1), (good 4) 6
  7. 7. Input The weather is good Today is good Good weather is good Data map(key, value): for each word w in value: emit(w, 1) M M M M M M M Intermediate Data (The,1) (weather,1) (is,1) (good,1) (Today,1) (is,1) (good,1) (good,1) (weather,1) (is,1) (good,1) Group by Key reduce(key, values): result=0 Grouped for each count v in values (The,[1]) (weather,[1,1]) (is,[1,1,1]) (good,[1,1,1]) (Today,[1]) Data result += v emit(key, result) R R R R R Output Data (The,1) (weather, 2) (is, 3) (good,3) (Today,1) 7
  8. 8. Programming Model §  Input : A set of key/value pairs §  Programmer specifies two functions: Map Reduce •  map  (k,v)  à  <k’,  v’>   •  reduce  (k’,<v’>*)  à  <k’,v’>*   All v’ with same k’ are reduced together 8
  9. 9. Distributed Execution Overview fork assign map Input Data Split 0 read Split 1 Split 2 fork Master fork assign reduce Worker Worker Worker local write Worker Worker remote read, sort write Output File 0 Output File 1 9
  10. 10. MapReduce Examples Distributed Grep: Search pattern (key) : virus A….virus B……. C..virus… MAP virus, A… virus, B… RED Virus, [A..,B..] Web Page 10
  11. 11. MapReduce Examples —  Count of URL Access Frequency: www.cbc.com www.cnn.com www.bbc.com www.cbc.com www.cbc.com www.bbc.com MAP CBC, [1,1,1] CNN [1] BBC [1,1] RED CBC, 3 BBC, 2 CNN, 1 Web server logs 11
  12. 12. MapReduce Examples q Reverse Web-Link Graph: www.facebook.com www.youtube.com source target MAP (facebook,youtube) (facebook, disney) Facebook.com Twitter.com RED www.disney.com Facebook.com Web server logs (Facebook, [youtube, disney]) 12
  13. 13. MapReduce Examples q  Term-Vector per Host:. MAP word1> word2> word2> word2> …. RED Documents of the facebook (hostname) <facebook, <facebook, <facebook, <facebook, <facebook, [word2, …]> Summary of the most popular words 13
  14. 14. MapReduce Examples q Inverted Index: MAP Docs <word1, <word2, … <word3, <word1, … <word1, docID1> docID1> <word1, [docID1, docID2, docID3]> RED <word2, [docID1]> docID2> docID2> docID3> 14
  15. 15. Outline þ—  MapReduce : Execution þ—  Example þ—  Programming Model þ—  MapReduce: Distributed Execution þ—  More Examples —  Customizations on Clusters —  Refinements —  Performance measurement —  Conclusion & Future Work —  MapReduce in other companies 15
  16. 16. Customizations on Clusters —  Coordination —  Scheduling —  Fault Tolerance —  Task Granularity —  Backup Tasks 16
  17. 17. Customizations on Clusters q Coordination Master Data Structure M 250.133.22.7 Completed Root/intFile.txt M 250.133.22.8 inprogress Root/intFile.txt R 250.123.23.3 idle Root/outFile.txt 17
  18. 18. Customizations on Clusters q Scheduling Master scheduling policy: (objective: conserve network bandwidth) 1.  GFS divides each file into 64MB block. 2.  I/P data are stored on the worker’s local disks (managed by GFS) Ø  Locality :using the same cluster for both data storage and data processing. 3.  GFS stores multiple copies of each block (typically 3 copies) on different machines. 18
  19. 19. Customizations on Clusters q Fault Tolerance On worker failure: •  •  •  •  Detect failure via periodic heartbeats Re-execute completed and in-progress map tasks Re-execute in progress reduce tasks Task completion committed through master Master failure: •  Could handle, but don't yet (master failure unlikely) •  MapReduce task is aborted and client is notified 19
  20. 20. Customizations on Clusters q  Task Granularity (How tasks are divided ?) Rule of thumb: Make M and R much larger than the number of worker machines à  Improves dynamic load balancing à  speeds recovery from worker failure Usually R is smaller than M 20
  21. 21. Customizations on Clusters q  Backup tasks —  Problem of stragglers (machines taking long time to complete one of the last few tasks ) —  When a MapReduce operation is about to complete: Ø  Ø  Master schedules backup executions of the remaining tasks Task is marked “complete” whenever either the primary or the backup execution completes. Effect: dramatically shortens job completion time 21
  22. 22. Outline þ—  MapReduce : Execution þ—  Example þ—  Programming Model þ—  MapReduce: Distributed Execution þ—  More Examples —  þ Customizations on Clusters —  Refinements —  Performance measurement —  Conclusion & Future Work —  Companies using MapReduce 22
  23. 23. Refinements —  Partitioning functions. —  Skipping bad records. —  Status info. —  Other Refinements 23
  24. 24. Refinements : Partitioning Function —  MapRedue users specify no. of tasks/output files desired (R) —  For reduce, we need to ensure that records with the same intermediate key end up at the same worker —  System uses a default partition function e.g., hash(key) mod R ( results fairly well-balanced partitions ) —  Sometimes useful to override —  E.g., hash(hostname(URL key)) mod R Ø  ensures URLs from a host end up in the same output file 24
  25. 25. Refinements : Skipping Bad Records §  Map/Reduce functions sometimes fail for particular inputs •  MapReduce has a special treatment for ‘bad’ input data, i.e. input data that repeatedly leads to the crash of a task. Ø  The master, tracking crashes of tasks, recognizes such situations and, after a number of failed retries, will decide to ignore this piece of data. •  Effect: Can work around bugs in third-party libraries 25
  26. 26. Refinements : Status Information —  Status pages shows the computation progress —  Links to standard error and output files generated by each task. —  User can Ø  Predict the computational length Ø  Add more resources if needed Ø  Know which workers have failed —  Useful in user code bug diagnosis 26
  27. 27. Other Refinements §  Combiner function: Compression of intermediate data Ø  useful for saving network bandwidth §  User-defined counters Ø  periodically propagated to the master from worker machines Ø Useful for checking behavior of MaReduce operations (appears on master status page ) 27
  28. 28. Outline þ—  MapReduce : Execution þ—  Example þ—  Programming Model þ—  MapReduce: Distributed Execution þ—  More Examples —  þ Customizations on Clusters þ Refinements —  —  Performance measurement —  Conclusion & Future Work —  Companies using MapReduce 28
  29. 29. Performance §  Tests run on cluster of 1800 machines: each machine has: —  4 GB of memory —  Dual-processor 2 GHz Xeons with Hyperthreading —  Dual 160 GB IDE disks —  Gigabit Ethernet link —  Bisection bandwidth approximately 100-200 Gbps §  Two benchmarks: —  Grep: Scan 1010 100-byte records to extract records matching a rare pattern (92K matching records) —  Sort: Sort 1010 100-byte records 29
  30. 30. Grep 1764 workers M=15000 (input split= 64MB) R=1 Assume all machines has same host Search pattern: 3 characters Found in: 92,337 records •  1800 machines read 1 TB of data at peak of ~31 GB/s •  Startup overhead is significant for short jobs (entire computation = 80 + 1 minute start up 30
  31. 31. Sort M=15000 (input split= 64MB) R=4000, # of workers = 1746 Fig.(a) Btr than Terasoft benchmark reported result of 1057 s (a) Normal Execution (b) No backup tasks (c) 200 tasks killed (a) Locality optimization èInput rate > shuffle rate and output rate Output phase writes 2 copies of sorted data è Shuffle rate > output rate (b) 5 Stragglers à Entire computation rate increases 44% than normal 31
  32. 32. Experience: Rewrite of Production Indexing System §  New code is simpler, easier to understand §  MapReduce takes care of failures, slow machines §  Easy to make indexing faster by adding more machines 32
  33. 33. Outline þ—  MapReduce : Execution þ—  Example þ—  Programming Model þ—  MapReduce: Distributed Execution þ—  More Examples —  þ Customizations on Clusters —  þ Refinements —  þ Performance measurement —  Conclusion & Future Work —  Companies using MapReduce 33
  34. 34. Conclusion & Future Work —  MapReduce has proven to be a useful abstraction —  Greatly simplifies large-scale computations —  Fun to use: focus on problem, let library deal w/ messy details 34
  35. 35. MapReduce Advantages/Disadvantages Now it s easy to program for many CPUs •  Communication management effectively gone Ø  I/O scheduling done for us •  Fault tolerance, monitoring Ø  machine failures, suddenly-slow machines, etc are handled •  Can be much easier to design and program! •  Can cascade several (many?) MapReduce tasks But … it further restricts solvable problems •  Might be hard to express problem in MapReduce •  Data parallelism is key Ø  Need to be able to break up a problem by data chunks •  MapReduce is closed-source (to Google) C++ Ø  Hadoop is open-source Java-based rewrite 35
  36. 36. Outline þ—  MapReduce : Execution þ—  Example þ—  Programming Model þ—  MapReduce: Distributed Execution þ—  More Examples —  þ Customizations on Clusters —  þ Refinements —  þ Performance measurement —  þ Conclusion & Future Work —  Companies using MapReduce 36
  37. 37. Companies using MapReduce v  Amazon: Amazon Elastic MapReduce : §  a web service §  enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. §  It utilizes a hosted Hadoop framework running on the webscale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3). §  allows you to use Hadoop with no hardware investment —  http://aws.amazon.com/elasticmapreduce/ 37
  38. 38. Companies using MapReduce —  Amazon: to build product search indices —  Facebook: processing of web logs, via both Map-Reduce and Hive —  IBM and Google: making large compute clusters available to higher ed and research organizations —  New York Times: large scale image conversions —  Yahoo: use Map Reduce and Pig for web log processing, data model training, web map construction, and much, much more —  Many universities for teaching parallel and large data systems And many more, see them all at http://wiki.apache.org/hadoop/ PoweredBy 38
  39. 39. 39

×