• What is MapReduce?
• What are MapReduce implementations?
Facing these questions I have make a personal research, and realize a synthesis, which has help me to clarify some ideas. The attached presentation does not intend to be exhaustive on the subject, but could perhaps bring you some useful insights.
• What is MapReduce?
• What are MapReduce implementations?
Facing these questions I have make a personal research, and realize a synthesis, which has help me to clarify some ideas. The attached presentation does not intend to be exhaustive on the subject, but could perhaps bring you some useful insights.
MapReduce is one of the most important and major component in Hadoop Ecosystem. Whenever we are having a large set of data then in the case of the huge data set will be divided into smaller pieces and processing will be done on them in parallel in MapReduce.
As MapReduce clusters have become popular these days, their scheduling is one of the important factor which is to be considered. In order to achieve good performance a MapReduce scheduler must avoid unnecessary data transmission. Hence different scheduling algorithms for MapReduce are necessary to provide good performance. This
slide provides an overview of many different scheduling algorithms for MapReduce.
Dache - a data aware cache system for big-data applications using the MapReduce framework.
Dache aim-extending the MapReduce framework and provisioning a cache layer for efficiently identifying and accessing cache items in a MapReduce job.
MapReduce is one of the most important and major component in Hadoop Ecosystem. Whenever we are having a large set of data then in the case of the huge data set will be divided into smaller pieces and processing will be done on them in parallel in MapReduce.
As MapReduce clusters have become popular these days, their scheduling is one of the important factor which is to be considered. In order to achieve good performance a MapReduce scheduler must avoid unnecessary data transmission. Hence different scheduling algorithms for MapReduce are necessary to provide good performance. This
slide provides an overview of many different scheduling algorithms for MapReduce.
Dache - a data aware cache system for big-data applications using the MapReduce framework.
Dache aim-extending the MapReduce framework and provisioning a cache layer for efficiently identifying and accessing cache items in a MapReduce job.
French regular verbs - first group (-er) (present tense)Catherine Bowles
Here is a guide to first group (regular) verbs and how to conjugate them in the present tense.
This includes a step-by-step process for how to dismantle the infinitive and then add on the correct ending to conjugate the verb.
There is a list of verbs within this presentation.
Cloud Computing course presentation, Tarbiat Modares University
By: Sina Ebrahimi, Mohammadreza Noei
Advisor: Sadegh Dorri Nogoorani, PhD.
Presentation Data: 1397/03/07
Video Link in Aparat: https://www.aparat.com/v/N5VbK
Video Link on TMU Cloud: http://cloud.modares.ac.ir/public.php?service=files&t=9ecb8d2dd08df6f990a3eb63f42011f7
This presenation's pptx file (some animations may be lost in slideshare) : http://cloud.modares.ac.ir/public.php?service=files&t=f62282dbd205abaa66de2512d9fdfc83
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
(Berkeley CS186 guest lecture)
Big Data Analytics Systems: What Goes Around Comes Around
Introduction to MapReduce, GFS, HDFS, Spark, and differences between "Big Data" and database systems.
1. Mapreduce
Simplified Data Processing on Large Clusters
Original Research by: Jeffrey Dean and Sanjay Ghemawat
Google Inc., Published in OSDI 2004
P RESENTATION BY: A BE A RREDONDO & J A SON BEERE
UNIVERSITY OF TEXAS AT AUSTIN
GRADUATE SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING
EE 382N: DISTRIBUTED SYSTEMS OPT III PRC, FALL 2015
DR. VIJAY K. GARG PROFESSOR, WEI-LUN-HUNG TEACHING ASSISTANT
SEPTEMBER 18TH, 2015
1
2. Agenda
• Introduction & Overview
• Motivation, Background, Examples
• Implementation
• Diagram
• Program Example
• Advantages, Disadvantages, Refinements, and Extensions
• Performance
• Conclusion
• References and Appendix
2
3. Introduction and Overview
•Motivation:
• Process lots of data,Scalable tothousandsof commodityCPU’s, Easyto use
•What does it do?
• Parallelization, Fault Tolerance, Load Balancing, I/O Scheduling, Monitor
• Locally optimized: reduces the amount of data sent across the network
•Examples
• Web Search Service, Sorting, Data Mining, Machine Leaning,
• Distributed Grep & Sort, Web link-Graph Reversal, Inverted Indexes
•How and where is it used?
◦ Analytics,Maps,User Behavior,RetailCommercialAdvertising, SocialMedia,
◦ HumanGenome, CancerResearch,FacialRecognition.… Gov&Military.…
Monash ResearchPub: http://www.dbms2.com/2008/08/26/known-applications-of-mapreduce/
3
5. Snippet of code
map(String key, String value):
// key: document name
// value: document contents
for each word w in value:
EmitIntermediate(w, "1");
reduce(String key, Iterator values):
// key: a word
// values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));
5
6. Advantages, Disadvantages,
Refinements, and Extensions
•Master Fault Tolerance
• Unlikely: start a new copy of master
• Worker Failure
• Master Redirects tasks
• Detect Failure: Periodic Heartbeat
• Re-execute completed and in-progress map tasks
• Re-execute in progress reduce tasks
• Task completion committed through master
•Straggler Machine Delays
• A machine with a bad Disk. et at. …
•SOLUTION: Master Schedules a Backup
6
7. Refinements :
◦ Ordering Guarantees: Key/Value pairs are guaranteed to be processed
in increasing order.
◦ Skipping bad records using UDP data packets
◦ Combiner Function: Partial combining speeds up MapReduce Ops
◦ Local Execution and Local debugging tools (gdb)
◦ Status Info: Master runs an internal HTTP server and exports status.
◦ User defined Counters Facility used for sanity checking
Task Granularity and Pipelining
◦ Many more Map Tasks than machines
◦ Min time for fault recovery
◦ Pipeline shuffling
◦ Dynamic Load Balancing
Advantages, Disadvantages,
Refinements, and Extensions 2
200,000 Map, 5000 Reduce, w/ 2000 Machines
7
8. Grep Performance
8
•Tests run on cluster of 1800 machines:
• 4 GB of memory
• Dual-processor2 GHz Xeons with H-hreading
• Dual160 GB IDE disks
• GigabitEthernetper machine
• Bisectionbandwidthapproximately 100Gbps
•Two benchmarks:
• MR_Grep Scan 1010 100-byte records to extract records matchinga rare
pattern(92K matchingrecords)
• MR_Sort Sort 1010 100-byte records(modeledafter TeraSortbenchmark)
•Locally Optimized Helped
• 1800 Machinesread1 TB of dataat peak ~31 GB/s
• Withoutthis, rack switches wouldlimitto 10 GB/s
• Startupoverhead issignificantforshort jobs
10. Conclusion
MapReduce has proven to be a useful abstraction
Greatly simplifies large-scale computations at Google
Easy to use: focus on problem, let library deal w/ messy details
• Parallelization, Fault Tolerance, Load Balancing, I/O Scheduling,
Monitor
• Locally optimized: reducesthe amount of data sent across the
network
New code is simpler, easier to understand
MapReduce takes care of failures, slow machines
Easy to make indexing faster by adding more machines
10
12. One Final Thought
In pioneer days they used oxen for heavy pulling,
and when one ox couldn't budge a log, they didn't
try to grow a larger ox. We shouldn't be trying for
bigger computers, but for more systems of
computers.
- Grace Hopper
12
15. Believed “an apple a day keeps a doctor away”
Sam’s Mother
Mother
Sam
An Apple
(3) Ekanayake
15
16. Sam thought of “drinking” the apple
One day
He used a to cut the
and a to make juice.
(3) Ekanayake
16
17. (map ‘( ))
( )
Sam applied his invention to all the fruits he could find in the fruit
basket
Next Day
(reduce ‘( )) Classical Notion of MapReduce in
Functional Programming
A list of values mapped into another list
of values, which gets reduced into a
single value
(3) Ekanayake
17
18. 18 Years Later
Sam got his first job in JuiceRUs for his talent in making juice
Now, it’s not just one basket
but a wholecontainer of fruits
Also, they produce a list of juice types
separately
NOT ENOUGH!!
But, Sam had just ONE and ONE
Largedata and list of values for
output
Wait!
(3) Ekanayake
18
19. Implemented a parallel version of his innovation
Brave Sam
(<a, > , <o, > , <p, > , …)
Each input to a map is a list of <key, value> pairs
Each output of a map is a list of <key, value> pairs
(<a’, > , <o’, > , <p’, > , …)
Grouped by key
Each input to a reduce is a <key, value-list> (possibly a
list of these, depending on the grouping/hashing
mechanism)
e.g. <a’, ( …)>
Reduced into a list of values
(3) Ekanayake
19
20. Implemented a parallel version of his innovation
Brave Sam
The idea of MapReduce in Data Intensive
Computing
A list of <key, value> pairs mapped into another
list of <key, value> pairs which gets grouped by
the key and reduced into a list of values
(3) Ekanayake
20
21. Sam realized,
◦ To create his favoritemix fruit juice he can use a combiner after the reducers
◦ If several <key, value-list> fall into the same group (based on the grouping/hashing
algorithm) then use the blender (reducer) separatelyon each of them
◦ The knife (mapper)and blender (reducer)should not contain residueafter use – Side
Effect Free
◦ In general reducer should be associative and commutative
Afterwards
(3) Ekanayake
21
22. References
(1) MapReduce: Simplified Data Processing on Large Clusters– by Jeffrey
Dean and Sanjay Ghemawat. Presentation:
http://research.google.com/archive/mapreduce-osdi04-slides/index-auto-0002.html
(2) Map-Reducemeets wider Varieties of applications, by Chen,
Scholsser: Intel. http://www.cs.cmu.edu/~chensm/papers/IRP-TR-08-05.pdf
(3) MapReduce: Story of Sam. By Saliya Ekanayake SALSA HPC Group
Pervasive Technology Institute, Indiana University,
Bloomingtonhttp://www.slideshare.net/esaliya/mapreduce-in-simple-terms
(4) Monash Research Pub: http://www.dbms2.com/2008/08/26/known-applications-of-mapreduce/
(5) Wikipedia https://en.wikipedia.org/wiki/MapReduce#References
22