3. Multiple choice: MapReduce is…
a) A combination of 2 common functional programming
messages
b) Used extensively* by Google
c) Implemented in libraries for all languages (that matter )
d) A framework for management and execution of processing in
parallel
e) Getting more and more relevant with the emergence of “Big
Data”
f) Implementable as a service via AWS
g) Targeted towards batch style computation
h) All of the above
* Approx 12K MR programs from http://www.youtube.com/watch?v=NXCIItzkn3E
4. A potted history of MapReduce
Hadoop started by Doug Cutting at Yahoo
AWS launch ElasticMapReduce
Facebook announces 21PB
Hadoop cluster
002 2004 2006 2008 2010 2012
Yahoo announces 10K Hadoop cluster
http://labs.google.com/papers/mapreduce.html
http://labs.google.com/papers/gfs.html
5. Processing flow
MAP
MAP REDUCE
Process
Call
Read and Call MAP chunk, Partition
REDUCE Process Persist
split input for each returning and sort
for each partition output
into chunks chunk intermediate results
partition
results
…
MAP
REDUCE
…
MAP
6. Map and Reduce by example: word
part_1.txt part_2.txt
Peter Piper picked a peck of pickled peppers, If Peter Piper picked a peck of pickled peppers,
A peck of pickled peppers Peter Piper picked; Where's the peck of pickled peppers Peter Piper
picked?
map calls reduce calls
Input key Input value Output keys Output values Input key Input value Output values
part_1.txt Peter Piper peter 1 a [1, 1, 1] a -> 3
picked a peck piper 1
of pickled picked 1
if [1] if -> 1
peppers, a 1
peck 1
of 1 of [1, 1, 1, 1] of -> 4
pickled 1
peppers 1 peck [1, 1, 1, 1] peck -> 4
part_1.txt A peck of a 1
pickled peppers peck 1 peppers [1, 1, 1, 1] peppers -> 4
Peter Piper of 1
picked pickled 1
peter [1, 1, 1, 1] peter -> 4
peppers 1
peter 1
piper 1 picked [1, 1, 1, 1] picked -> 4
picked 1
part_2.txt If Peter Piper If 1 pickled [1, 1, 1, 1] pickled -> 4
picked a peck peter 1
of pickled piper 1 piper [1, 1, 1, 1] piper -> 4
peppers picked 1
a 1
peck 1 the [1] the ->1
of 1
pickled 1
8. Map and Reduce by pattern
[C D,
A B map E F,
G H,
…]
W [X, Y, Z] reduce V
9. Map and Reduce for Word count
[word1 1,
fileoffset line of text map word2 1,
word3 1,
…]
word [1, 1, 1] reduce word 3
10. Map and Reduce for Search
[searchterm filename + line1,
fileoffset line of text map filename + line2]
searchterm [filename1 + line1,
searchterm [filename1 + line1 + line 2,
filename1 + line2, reduce filename2 + line1]
filename2 + line1]
11. Map and Reduce for Index
fileoffset line of text map [word1 filename,
word2 filename,
word3 filename]
word1 [filename1, word1 [filename1,
filename2, reduce filename2,
filename3] filename3]
30. Limitations
Processing must be parallelisable
Large amounts of consistent data requiring consistent
processing and few dependencies
Not designed for high reliability
E.g.,Name Node single point of failure on Hadoop DFS
31. MapReduce in practice
Log and/or clickstream analysis of various kinds
Marketing analytics
Machine learning and/or sophisticated data mining
Image processing
Processing of XML messages
Web crawling and/or text processing
General archiving, including of relational/tabular data, e.g.
for compliance
Source: http://en.wikipedia.org/wiki/Apache_Hadoop
32. FABUQ
What if my input has multiline records?
What if my EMR instances don’t have the required libraries, etc
to run my steps?
What if I needed to nest jobs within steps?
What are the signs that a MR solution might “fit” the problem?
How do I control the number of mappers and reducers used?
What if I don’t need to do any reduction?
How does MR provide fault tolerance?
What Is MapReduceHow does it workAn implementation without the frameworkAn implementation with the frameworkAWS architecture for MapReduceAn example using HiveAn example using PigA custom example in JavaLimitations
Unconscious incompetence -> conscious incompetenceHigh level understandingKnowledge of low level usageLimitationsHave conversation with customer
The MapReduce library in the user program first shards the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece. It then starts up many copies of the program on a cluster of machines.One of the copies of the program is special: the master. The rest are workers that are assigned work by the master. There are M map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.A worker who is assigned a map task reads the contents of the corresponding input shard. It parses key/value pairs out of the input data and passes each pair to the user-defined Map function. The intermediate key/value pairs produced by the Map function are buffered in memory.Periodically, the buffered pairs are written to local disk, partitioned into R regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers.When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers. When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. If the amount of intermediate data is too large to fit in memory, an external sort is used.The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user's Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition.When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code.
Note: different domains for map input and outputDifferent domains for reduce input and outputDomain of map output is same as for reduce input
Note: different domains for map input and outputDifferent domains for reduce input and outputDomain of map output is same as for reduce input
Note: different domains for map input and outputDifferent domains for reduce input and outputDomain of map output is same as for reduce input
Note: different domains for map input and outputDifferent domains for reduce input and outputDomain of map output is same as for reduce input
Note: different domains for map input and outputDifferent domains for reduce input and outputDomain of map output is same as for reduce input
Fibonacci ✖Searching ✔
“Monitoring the filesystem counters for a job- particularly relative to byte counts from the map and into the reduce- is invaluable to the tuning of these parameters.” (from http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#Source+Code)
Unconscious incompetence -> conscious incompetenceHigh level understandingKnowledge of low level usageLimitationsHave conversation with customer