Mapreduce
Simplified Data Processing on Large Clusters
Original Research by: Jeffrey Dean and Sanjay Ghemawat
Google Inc., Published in OSDI 2004
P RESENTATION BY: A BE A RREDONDO & J A SON BEERE
UNIVERSITY OF TEXAS AT AUSTIN
GRADUATE SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING
EE 382N: DISTRIBUTED SYSTEMS OPT III PRC, FALL 2015
DR. VIJAY K. GARG PROFESSOR, WEI-LUN-HUNG TEACHING ASSISTANT
SEPTEMBER 18TH, 2015
1
Agenda
• Introduction & Overview
• Motivation, Background, Examples
• Implementation
• Diagram
• Program Example
• Advantages, Disadvantages, Refinements, and Extensions
• Performance
• Conclusion
• References and Appendix
2
Introduction and Overview
•Motivation:
• Process lots of data,Scalable tothousandsof commodityCPU’s, Easyto use
•What does it do?
• Parallelization, Fault Tolerance, Load Balancing, I/O Scheduling, Monitor
• Locally optimized: reduces the amount of data sent across the network
•Examples
• Web Search Service, Sorting, Data Mining, Machine Leaning,
• Distributed Grep & Sort, Web link-Graph Reversal, Inverted Indexes
•How and where is it used?
◦ Analytics,Maps,User Behavior,RetailCommercialAdvertising, SocialMedia,
◦ HumanGenome, CancerResearch,FacialRecognition.… Gov&Military.…
Monash ResearchPub: http://www.dbms2.com/2008/08/26/known-applications-of-mapreduce/
3
Diagram
4
Snippet of code
map(String key, String value):
// key: document name
// value: document contents
for each word w in value:
EmitIntermediate(w, "1");
reduce(String key, Iterator values):
// key: a word
// values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));
5
Advantages, Disadvantages,
Refinements, and Extensions
•Master Fault Tolerance
• Unlikely: start a new copy of master
• Worker Failure
• Master Redirects tasks
• Detect Failure: Periodic Heartbeat
• Re-execute completed and in-progress map tasks
• Re-execute in progress reduce tasks
• Task completion committed through master
•Straggler Machine Delays
• A machine with a bad Disk. et at. …
•SOLUTION: Master Schedules a Backup
6
Refinements :
◦ Ordering Guarantees: Key/Value pairs are guaranteed to be processed
in increasing order.
◦ Skipping bad records using UDP data packets
◦ Combiner Function: Partial combining speeds up MapReduce Ops
◦ Local Execution and Local debugging tools (gdb)
◦ Status Info: Master runs an internal HTTP server and exports status.
◦ User defined Counters Facility used for sanity checking
Task Granularity and Pipelining
◦ Many more Map Tasks than machines
◦ Min time for fault recovery
◦ Pipeline shuffling
◦ Dynamic Load Balancing
Advantages, Disadvantages,
Refinements, and Extensions 2
200,000 Map, 5000 Reduce, w/ 2000 Machines
7
Grep Performance
8
•Tests run on cluster of 1800 machines:
• 4 GB of memory
• Dual-processor2 GHz Xeons with H-hreading
• Dual160 GB IDE disks
• GigabitEthernetper machine
• Bisectionbandwidthapproximately 100Gbps
•Two benchmarks:
• MR_Grep Scan 1010 100-byte records to extract records matchinga rare
pattern(92K matchingrecords)
• MR_Sort Sort 1010 100-byte records(modeledafter TeraSortbenchmark)
•Locally Optimized Helped
• 1800 Machinesread1 TB of dataat peak ~31 GB/s
• Withoutthis, rack switches wouldlimitto 10 GB/s
• Startupoverhead issignificantforshort jobs
Sort Performance
9
Conclusion
MapReduce has proven to be a useful abstraction
Greatly simplifies large-scale computations at Google
Easy to use: focus on problem, let library deal w/ messy details
• Parallelization, Fault Tolerance, Load Balancing, I/O Scheduling,
Monitor
• Locally optimized: reducesthe amount of data sent across the
network
New code is simpler, easier to understand
MapReduce takes care of failures, slow machines
Easy to make indexing faster by adding more machines
10
Appendix and References
11
One Final Thought
In pioneer days they used oxen for heavy pulling,
and when one ox couldn't budge a log, they didn't
try to grow a larger ox. We shouldn't be trying for
bigger computers, but for more systems of
computers.
- Grace Hopper
12
Questions?
13
Errata
14
Believed “an apple a day keeps a doctor away”
Sam’s Mother
Mother
Sam
An Apple
(3) Ekanayake
15
Sam thought of “drinking” the apple
One day
 He used a to cut the
and a to make juice.
(3) Ekanayake
16
 (map ‘( ))
( )
Sam applied his invention to all the fruits he could find in the fruit
basket
Next Day
 (reduce ‘( )) Classical Notion of MapReduce in
Functional Programming
A list of values mapped into another list
of values, which gets reduced into a
single value
(3) Ekanayake
17
18 Years Later
Sam got his first job in JuiceRUs for his talent in making juice
 Now, it’s not just one basket
but a wholecontainer of fruits
 Also, they produce a list of juice types
separately
NOT ENOUGH!!
 But, Sam had just ONE and ONE
Largedata and list of values for
output
Wait!
(3) Ekanayake
18
Implemented a parallel version of his innovation
Brave Sam
(<a, > , <o, > , <p, > , …)
Each input to a map is a list of <key, value> pairs
Each output of a map is a list of <key, value> pairs
(<a’, > , <o’, > , <p’, > , …)
Grouped by key
Each input to a reduce is a <key, value-list> (possibly a
list of these, depending on the grouping/hashing
mechanism)
e.g. <a’, ( …)>
Reduced into a list of values
(3) Ekanayake
19
Implemented a parallel version of his innovation
Brave Sam
The idea of MapReduce in Data Intensive
Computing
A list of <key, value> pairs mapped into another
list of <key, value> pairs which gets grouped by
the key and reduced into a list of values
(3) Ekanayake
20
Sam realized,
◦ To create his favoritemix fruit juice he can use a combiner after the reducers
◦ If several <key, value-list> fall into the same group (based on the grouping/hashing
algorithm) then use the blender (reducer) separatelyon each of them
◦ The knife (mapper)and blender (reducer)should not contain residueafter use – Side
Effect Free
◦ In general reducer should be associative and commutative
Afterwards
(3) Ekanayake
21
References
(1) MapReduce: Simplified Data Processing on Large Clusters– by Jeffrey
Dean and Sanjay Ghemawat. Presentation:
http://research.google.com/archive/mapreduce-osdi04-slides/index-auto-0002.html
(2) Map-Reducemeets wider Varieties of applications, by Chen,
Scholsser: Intel. http://www.cs.cmu.edu/~chensm/papers/IRP-TR-08-05.pdf
(3) MapReduce: Story of Sam. By Saliya Ekanayake SALSA HPC Group
Pervasive Technology Institute, Indiana University,
Bloomingtonhttp://www.slideshare.net/esaliya/mapreduce-in-simple-terms
(4) Monash Research Pub: http://www.dbms2.com/2008/08/26/known-applications-of-mapreduce/
(5) Wikipedia https://en.wikipedia.org/wiki/MapReduce#References
22

MapReduce

  • 1.
    Mapreduce Simplified Data Processingon Large Clusters Original Research by: Jeffrey Dean and Sanjay Ghemawat Google Inc., Published in OSDI 2004 P RESENTATION BY: A BE A RREDONDO & J A SON BEERE UNIVERSITY OF TEXAS AT AUSTIN GRADUATE SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING EE 382N: DISTRIBUTED SYSTEMS OPT III PRC, FALL 2015 DR. VIJAY K. GARG PROFESSOR, WEI-LUN-HUNG TEACHING ASSISTANT SEPTEMBER 18TH, 2015 1
  • 2.
    Agenda • Introduction &Overview • Motivation, Background, Examples • Implementation • Diagram • Program Example • Advantages, Disadvantages, Refinements, and Extensions • Performance • Conclusion • References and Appendix 2
  • 3.
    Introduction and Overview •Motivation: •Process lots of data,Scalable tothousandsof commodityCPU’s, Easyto use •What does it do? • Parallelization, Fault Tolerance, Load Balancing, I/O Scheduling, Monitor • Locally optimized: reduces the amount of data sent across the network •Examples • Web Search Service, Sorting, Data Mining, Machine Leaning, • Distributed Grep & Sort, Web link-Graph Reversal, Inverted Indexes •How and where is it used? ◦ Analytics,Maps,User Behavior,RetailCommercialAdvertising, SocialMedia, ◦ HumanGenome, CancerResearch,FacialRecognition.… Gov&Military.… Monash ResearchPub: http://www.dbms2.com/2008/08/26/known-applications-of-mapreduce/ 3
  • 4.
  • 5.
    Snippet of code map(Stringkey, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); 5
  • 6.
    Advantages, Disadvantages, Refinements, andExtensions •Master Fault Tolerance • Unlikely: start a new copy of master • Worker Failure • Master Redirects tasks • Detect Failure: Periodic Heartbeat • Re-execute completed and in-progress map tasks • Re-execute in progress reduce tasks • Task completion committed through master •Straggler Machine Delays • A machine with a bad Disk. et at. … •SOLUTION: Master Schedules a Backup 6
  • 7.
    Refinements : ◦ OrderingGuarantees: Key/Value pairs are guaranteed to be processed in increasing order. ◦ Skipping bad records using UDP data packets ◦ Combiner Function: Partial combining speeds up MapReduce Ops ◦ Local Execution and Local debugging tools (gdb) ◦ Status Info: Master runs an internal HTTP server and exports status. ◦ User defined Counters Facility used for sanity checking Task Granularity and Pipelining ◦ Many more Map Tasks than machines ◦ Min time for fault recovery ◦ Pipeline shuffling ◦ Dynamic Load Balancing Advantages, Disadvantages, Refinements, and Extensions 2 200,000 Map, 5000 Reduce, w/ 2000 Machines 7
  • 8.
    Grep Performance 8 •Tests runon cluster of 1800 machines: • 4 GB of memory • Dual-processor2 GHz Xeons with H-hreading • Dual160 GB IDE disks • GigabitEthernetper machine • Bisectionbandwidthapproximately 100Gbps •Two benchmarks: • MR_Grep Scan 1010 100-byte records to extract records matchinga rare pattern(92K matchingrecords) • MR_Sort Sort 1010 100-byte records(modeledafter TeraSortbenchmark) •Locally Optimized Helped • 1800 Machinesread1 TB of dataat peak ~31 GB/s • Withoutthis, rack switches wouldlimitto 10 GB/s • Startupoverhead issignificantforshort jobs
  • 9.
  • 10.
    Conclusion MapReduce has provento be a useful abstraction Greatly simplifies large-scale computations at Google Easy to use: focus on problem, let library deal w/ messy details • Parallelization, Fault Tolerance, Load Balancing, I/O Scheduling, Monitor • Locally optimized: reducesthe amount of data sent across the network New code is simpler, easier to understand MapReduce takes care of failures, slow machines Easy to make indexing faster by adding more machines 10
  • 11.
  • 12.
    One Final Thought Inpioneer days they used oxen for heavy pulling, and when one ox couldn't budge a log, they didn't try to grow a larger ox. We shouldn't be trying for bigger computers, but for more systems of computers. - Grace Hopper 12
  • 13.
  • 14.
  • 15.
    Believed “an applea day keeps a doctor away” Sam’s Mother Mother Sam An Apple (3) Ekanayake 15
  • 16.
    Sam thought of“drinking” the apple One day  He used a to cut the and a to make juice. (3) Ekanayake 16
  • 17.
     (map ‘()) ( ) Sam applied his invention to all the fruits he could find in the fruit basket Next Day  (reduce ‘( )) Classical Notion of MapReduce in Functional Programming A list of values mapped into another list of values, which gets reduced into a single value (3) Ekanayake 17
  • 18.
    18 Years Later Samgot his first job in JuiceRUs for his talent in making juice  Now, it’s not just one basket but a wholecontainer of fruits  Also, they produce a list of juice types separately NOT ENOUGH!!  But, Sam had just ONE and ONE Largedata and list of values for output Wait! (3) Ekanayake 18
  • 19.
    Implemented a parallelversion of his innovation Brave Sam (<a, > , <o, > , <p, > , …) Each input to a map is a list of <key, value> pairs Each output of a map is a list of <key, value> pairs (<a’, > , <o’, > , <p’, > , …) Grouped by key Each input to a reduce is a <key, value-list> (possibly a list of these, depending on the grouping/hashing mechanism) e.g. <a’, ( …)> Reduced into a list of values (3) Ekanayake 19
  • 20.
    Implemented a parallelversion of his innovation Brave Sam The idea of MapReduce in Data Intensive Computing A list of <key, value> pairs mapped into another list of <key, value> pairs which gets grouped by the key and reduced into a list of values (3) Ekanayake 20
  • 21.
    Sam realized, ◦ Tocreate his favoritemix fruit juice he can use a combiner after the reducers ◦ If several <key, value-list> fall into the same group (based on the grouping/hashing algorithm) then use the blender (reducer) separatelyon each of them ◦ The knife (mapper)and blender (reducer)should not contain residueafter use – Side Effect Free ◦ In general reducer should be associative and commutative Afterwards (3) Ekanayake 21
  • 22.
    References (1) MapReduce: SimplifiedData Processing on Large Clusters– by Jeffrey Dean and Sanjay Ghemawat. Presentation: http://research.google.com/archive/mapreduce-osdi04-slides/index-auto-0002.html (2) Map-Reducemeets wider Varieties of applications, by Chen, Scholsser: Intel. http://www.cs.cmu.edu/~chensm/papers/IRP-TR-08-05.pdf (3) MapReduce: Story of Sam. By Saliya Ekanayake SALSA HPC Group Pervasive Technology Institute, Indiana University, Bloomingtonhttp://www.slideshare.net/esaliya/mapreduce-in-simple-terms (4) Monash Research Pub: http://www.dbms2.com/2008/08/26/known-applications-of-mapreduce/ (5) Wikipedia https://en.wikipedia.org/wiki/MapReduce#References 22