MapReduce: Simplified Data Processing On Large Clusters

MapReduce:
Simplified Data Processing on Large Clusters
2015-08-30 さとうかずま

Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters
The reasons for the success of MapReduce
2
The MapReduce programming model has been
successfully used at Google
Reasons
•  MapReduce is easy to use
•  Many problems are easily expressible
as MapReduce computations
•  An implementation of MapReduce can run on a large
cluster of commodity machines and is high scalable

The reasons why MapReduce is easy to use
3
MapReduce hides the messy details of the following
•  Parallelization
•  Fault-tolerance
•  Locality optimization
•  Load balancing

MapReduce computation
4
Computation is expressed as two functions: Map and Reduce
(n1, s1)
(n2, s2)
(n1,s1)
(w1, 1)
(w2, 1)
(w1, (1,1)) (2)
(k1, v1) → list(k2,v2)
(k1, list(v2)) → list(v2)
(n1, s1)
(n2, s2)
phase
type
map reduce
(w2, (1,1)) (2)(n2,s2)
(w1, 1)
(w2, 1)
time
input file
split
A Flow of an execution
(Counting the number of occurrence of each word in a document)
Input and output is a set of key/value pairs
machine
machine
machine
machine
PCs
→
→
→
→

Map
5
Map takes an input pair and produces a set of intermediate
key/value pairs
Map example
Map, written by the user, produces a set of intermediate key/value pairs
Map(String
key,
String
value):

//
key:
document
name

//
value:
document
contents

for
each
word
w
in
value:

EmitIntermediate(w,
“1”’);

Reduce
6
Reduce accepts an intermediate key and a set of values for
that key, then form a possibly smaller set of values
Reduce example
Reduce(String
key,
String
value):

//
key:
a
word

//
value:
a
list
of
counts

for
each
v
in
values:

result
+=
ParseInt(v);

Emit
(AsString(result));

Parallelization
7
Map and reduce allow programmers to parallelize
computations easily
•  They are inspired by map and reduce present in
functional languages
•  Referential transparency is one of the principle of
functional programming
•  Referential transparency encourages language based
parallelism of computation

MapReduce operation
8
First, The MapReduce library copies user program
on a cluster of machines
Copy of user program on a cluster of machines
User program
fork
fork
fork

The master and workers
9
One of copies is the master, and it assigns map tasks and
reduce tasks to workers(the rest)
master
worker
worker
Copy of user program on a cluster of machines
assign map
assign reduce

Fault-tolerance | worker
10
Any map task or reduce task in progress on a failed worker
becomes eligible for rescheduling
Task scheduling on a failure
master
machine A
machine B
Task status
idle
assign task
in-progress
ping
(no response)
idle
in-progress
assign task
time
exception
ping
pong
Master stores states(idle, in-progress, or completed) of tasks

Fault-tolerance | master
11
If the master task dies, a new copy can be started from
the last checkpoint state
Restart of the master task from a checkpoint
master task
exception
new copy
checkpoint
The master writes periodic checkpoints of the master data structures

Locality optimization
12
Network bandwidth is conserved by taking
advantage of the GFS
•  Input data(managed by GFS) is stored on the local disks of
the machines that make up a cluster
•  GFS divides each file into 64MB blocks, and stores several
copies of each block on different machines
•  The master attempts to schedule a map task on a machine
that contains a replica of the corresponding input data

Load balancing
13
Having each worker perform many different tasks improves
dynamic load balancing
The many map tasks a worker has completed can be
spread out across all the other machines

MapReduce: Simplified Data Processing On Large Clusters

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to MapReduce: Simplified Data Processing On Large Clusters

Similar to MapReduce: Simplified Data Processing On Large Clusters (20)

Recently uploaded

Recently uploaded (20)

MapReduce: Simplified Data Processing On Large Clusters