MapReduce

Mapreduce
Simplified Data Processing on Large Clusters
Original Research by: Jeffrey Dean and Sanjay Ghemawat
Google Inc., Published in OSDI 2004
P RESENTATION BY: A BE A RREDONDO & J A SON BEERE
UNIVERSITY OF TEXAS AT AUSTIN
GRADUATE SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING
EE 382N: DISTRIBUTED SYSTEMS OPT III PRC, FALL 2015
DR. VIJAY K. GARG PROFESSOR, WEI-LUN-HUNG TEACHING ASSISTANT
SEPTEMBER 18TH, 2015
1

Agenda
• Introduction & Overview
• Motivation, Background, Examples
• Implementation
• Diagram
• Program Example
• Advantages, Disadvantages, Refinements, and Extensions
• Performance
• Conclusion
• References and Appendix
2

Introduction and Overview
•Motivation:
• Process lots of data,Scalable tothousandsof commodityCPU’s, Easyto use
•What does it do?
• Parallelization, Fault Tolerance, Load Balancing, I/O Scheduling, Monitor
• Locally optimized: reduces the amount of data sent across the network
•Examples
• Web Search Service, Sorting, Data Mining, Machine Leaning,
• Distributed Grep & Sort, Web link-Graph Reversal, Inverted Indexes
•How and where is it used?
◦ Analytics,Maps,User Behavior,RetailCommercialAdvertising, SocialMedia,
◦ HumanGenome, CancerResearch,FacialRecognition.… Gov&Military.…
Monash ResearchPub: http://www.dbms2.com/2008/08/26/known-applications-of-mapreduce/
3

Snippet of code
map(String key, String value):
// key: document name
// value: document contents
for each word w in value:
EmitIntermediate(w, "1");
reduce(String key, Iterator values):
// key: a word
// values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));
5

Advantages, Disadvantages,
Refinements, and Extensions
•Master Fault Tolerance
• Unlikely: start a new copy of master
• Worker Failure
• Master Redirects tasks
• Detect Failure: Periodic Heartbeat
• Re-execute completed and in-progress map tasks
• Re-execute in progress reduce tasks
• Task completion committed through master
•Straggler Machine Delays
• A machine with a bad Disk. et at. …
•SOLUTION: Master Schedules a Backup
6

Refinements :
◦ Ordering Guarantees: Key/Value pairs are guaranteed to be processed
in increasing order.
◦ Skipping bad records using UDP data packets
◦ Combiner Function: Partial combining speeds up MapReduce Ops
◦ Local Execution and Local debugging tools (gdb)
◦ Status Info: Master runs an internal HTTP server and exports status.
◦ User defined Counters Facility used for sanity checking
Task Granularity and Pipelining
◦ Many more Map Tasks than machines
◦ Min time for fault recovery
◦ Pipeline shuffling
◦ Dynamic Load Balancing
Advantages, Disadvantages,
Refinements, and Extensions 2
200,000 Map, 5000 Reduce, w/ 2000 Machines
7

Grep Performance
8
•Tests run on cluster of 1800 machines:
• 4 GB of memory
• Dual-processor2 GHz Xeons with H-hreading
• Dual160 GB IDE disks
• GigabitEthernetper machine
• Bisectionbandwidthapproximately 100Gbps
•Two benchmarks:
• MR_Grep Scan 1010 100-byte records to extract records matchinga rare
pattern(92K matchingrecords)
• MR_Sort Sort 1010 100-byte records(modeledafter TeraSortbenchmark)
•Locally Optimized Helped
• 1800 Machinesread1 TB of dataat peak ~31 GB/s
• Withoutthis, rack switches wouldlimitto 10 GB/s
• Startupoverhead issignificantforshort jobs

Conclusion
MapReduce has proven to be a useful abstraction
Greatly simplifies large-scale computations at Google
Easy to use: focus on problem, let library deal w/ messy details
• Parallelization, Fault Tolerance, Load Balancing, I/O Scheduling,
Monitor
• Locally optimized: reducesthe amount of data sent across the
network
New code is simpler, easier to understand
MapReduce takes care of failures, slow machines
Easy to make indexing faster by adding more machines
10

One Final Thought
In pioneer days they used oxen for heavy pulling,
and when one ox couldn't budge a log, they didn't
try to grow a larger ox. We shouldn't be trying for
bigger computers, but for more systems of
computers.
- Grace Hopper
12

Believed “an apple a day keeps a doctor away”
Sam’s Mother
Mother
Sam
An Apple
(3) Ekanayake
15

Sam thought of “drinking” the apple
One day
 He used a to cut the
and a to make juice.
(3) Ekanayake
16

 (map ‘( ))
( )
Sam applied his invention to all the fruits he could find in the fruit
basket
Next Day
 (reduce ‘( )) Classical Notion of MapReduce in
Functional Programming
A list of values mapped into another list
of values, which gets reduced into a
single value
(3) Ekanayake
17

18 Years Later
Sam got his first job in JuiceRUs for his talent in making juice
 Now, it’s not just one basket
but a wholecontainer of fruits
 Also, they produce a list of juice types
separately
NOT ENOUGH!!
 But, Sam had just ONE and ONE
Largedata and list of values for
output
Wait!
(3) Ekanayake
18

Implemented a parallel version of his innovation
Brave Sam
(<a, > , <o, > , <p, > , …)
Each input to a map is a list of <key, value> pairs
Each output of a map is a list of <key, value> pairs
(<a’, > , <o’, > , <p’, > , …)
Grouped by key
Each input to a reduce is a <key, value-list> (possibly a
list of these, depending on the grouping/hashing
mechanism)
e.g. <a’, ( …)>
Reduced into a list of values
(3) Ekanayake
19

Implemented a parallel version of his innovation
Brave Sam
The idea of MapReduce in Data Intensive
Computing
A list of <key, value> pairs mapped into another
list of <key, value> pairs which gets grouped by
the key and reduced into a list of values
(3) Ekanayake
20

Sam realized,
◦ To create his favoritemix fruit juice he can use a combiner after the reducers
◦ If several <key, value-list> fall into the same group (based on the grouping/hashing
algorithm) then use the blender (reducer) separatelyon each of them
◦ The knife (mapper)and blender (reducer)should not contain residueafter use – Side
Effect Free
◦ In general reducer should be associative and commutative
Afterwards
(3) Ekanayake
21

References
(1) MapReduce: Simplified Data Processing on Large Clusters– by Jeffrey
Dean and Sanjay Ghemawat. Presentation:
http://research.google.com/archive/mapreduce-osdi04-slides/index-auto-0002.html
(2) Map-Reducemeets wider Varieties of applications, by Chen,
Scholsser: Intel. http://www.cs.cmu.edu/~chensm/papers/IRP-TR-08-05.pdf
(3) MapReduce: Story of Sam. By Saliya Ekanayake SALSA HPC Group
Pervasive Technology Institute, Indiana University,
Bloomingtonhttp://www.slideshare.net/esaliya/mapreduce-in-simple-terms
(4) Monash Research Pub: http://www.dbms2.com/2008/08/26/known-applications-of-mapreduce/
(5) Wikipedia https://en.wikipedia.org/wiki/MapReduce#References
22

MapReduce

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (12)

Similar to MapReduce

Similar to MapReduce (20)

More from Abe Arredondo

More from Abe Arredondo (11)

MapReduce