2. Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters
The reasons for the success of MapReduce
2
The MapReduce programming model has been
successfully used at Google
Reasons
• MapReduce is easy to use
• Many problems are easily expressible
as MapReduce computations
• An implementation of MapReduce can run on a large
cluster of commodity machines and is high scalable
3. Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters
The reasons why MapReduce is easy to use
3
MapReduce hides the messy details of the following
• Parallelization
• Fault-tolerance
• Locality optimization
• Load balancing
4. Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters
MapReduce computation
4
Computation is expressed as two functions: Map and Reduce
(n1, s1)
(n2, s2)
(n1,s1)
(w1, 1)
(w2, 1)
(w1, (1,1)) (2)
(k1, v1) → list(k2,v2)
(k1, list(v2)) → list(v2)
(n1, s1)
(n2, s2)
phase
type
map reduce
(w2, (1,1)) (2)(n2,s2)
(w1, 1)
(w2, 1)
time
input file
split
A Flow of an execution
(Counting the number of occurrence of each word in a document)
Input and output is a set of key/value pairs
machine
machine
machine
machine
PCs
→
→
→
→
5. Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters
Map
5
Map takes an input pair and produces a set of intermediate
key/value pairs
Map example
(Counting the number of occurrence of each word in a document)
Map, written by the user, produces a set of intermediate key/value pairs
Map(String
key,
String
value):
//
key:
document
name
//
value:
document
contents
for
each
word
w
in
value:
EmitIntermediate(w,
“1”’);
6. Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters
Reduce
6
Reduce accepts an intermediate key and a set of values for
that key, then form a possibly smaller set of values
Reduce example
(Counting the number of occurrence of each word in a document)
Reduce(String
key,
String
value):
//
key:
a
word
//
value:
a
list
of
counts
for
each
v
in
values:
result
+=
ParseInt(v);
Emit
(AsString(result));
7. Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters
Parallelization
7
Map and reduce allow programmers to parallelize
computations easily
• They are inspired by map and reduce present in
functional languages
• Referential transparency is one of the principle of
functional programming
• Referential transparency encourages language based
parallelism of computation
8. Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters
MapReduce operation
8
First, The MapReduce library copies user program
on a cluster of machines
Copy of user program on a cluster of machines
User program
fork
fork
fork
9. Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters
The master and workers
9
One of copies is the master, and it assigns map tasks and
reduce tasks to workers(the rest)
master
worker
worker
Copy of user program on a cluster of machines
assign map
assign reduce
10. Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters
Fault-tolerance | worker
10
Any map task or reduce task in progress on a failed worker
becomes eligible for rescheduling
Task scheduling on a failure
master
machine A
machine B
Task status
idle
assign task
in-progress
ping
(no response)
idle
in-progress
assign task
time
exception
ping
pong
Master stores states(idle, in-progress, or completed) of tasks
11. Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters
Fault-tolerance | master
11
If the master task dies, a new copy can be started from
the last checkpoint state
Restart of the master task from a checkpoint
master task
exception
new copy
checkpoint
The master writes periodic checkpoints of the master data structures
12. Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters
Locality optimization
12
Network bandwidth is conserved by taking
advantage of the GFS
• Input data(managed by GFS) is stored on the local disks of
the machines that make up a cluster
• GFS divides each file into 64MB blocks, and stores several
copies of each block on different machines
• The master attempts to schedule a map task on a machine
that contains a replica of the corresponding input data
13. Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters
Load balancing
13
Having each worker perform many different tasks improves
dynamic load balancing
The many map tasks a worker has completed can be
spread out across all the other machines